contextual-retrieval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Contextual Retrieval

上下文检索

Prepend situational context to chunks before embedding to preserve document-level meaning.

在嵌入文本块前添加情境上下文，以保留文档层面的语义。

The Problem

问题背景

Traditional chunking loses context:

Original document: "ACME Q3 2024 Earnings Report..."
Chunk: "Revenue increased 15% compared to the previous quarter."

Query: "What was ACME's Q3 2024 revenue growth?"
Result: Chunk doesn't mention "ACME" or "Q3 2024" - retrieval fails

传统文本分块会丢失上下文：

Original document: "ACME Q3 2024 Earnings Report..."
Chunk: "Revenue increased 15% compared to the previous quarter."

Query: "What was ACME's Q3 2024 revenue growth?"
Result: Chunk doesn't mention "ACME" or "Q3 2024" - retrieval fails

The Solution

解决方案

Contextual Retrieval prepends a brief context to each chunk:

Contextualized chunk:
"This chunk is from ACME Corp's Q3 2024 earnings report, specifically
the revenue section. Revenue increased 15% compared to the previous quarter."

上下文检索会为每个文本块添加一段简短的上下文：

Contextualized chunk:
"This chunk is from ACME Corp's Q3 2024 earnings report, specifically
the revenue section. Revenue increased 15% compared to the previous quarter."

Implementation

实现方案

Context Generation

上下文生成

python

import anthropic

client = anthropic.Anthropic()

CONTEXT_PROMPT = """
<document>
{document}
</document>

Here is the chunk we want to situate within the document:
<chunk>
{chunk}
</chunk>

Please give a short, succinct context (1-2 sentences) to situate this chunk
within the overall document. Focus on information that would help retrieval.
Answer only with the context, nothing else.
"""

def generate_context(document: str, chunk: str) -> str:
    """Generate context for a single chunk."""
    response = client.messages.create(
        model="claude-sonnet-4-5-20251101",
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(document=document, chunk=chunk)
        }]
    )
    return response.content[0].text

def contextualize_chunk(document: str, chunk: str) -> str:
    """Prepend context to chunk."""
    context = generate_context(document, chunk)
    return f"{context}\n\n{chunk}"

python

import anthropic

client = anthropic.Anthropic()

CONTEXT_PROMPT = """
<document>
{document}
</document>

Here is the chunk we want to situate within the document:
<chunk>
{chunk}
</chunk>

Please give a short, succinct context (1-2 sentences) to situate this chunk
within the overall document. Focus on information that would help retrieval.
Answer only with the context, nothing else.
"""

def generate_context(document: str, chunk: str) -> str:
    """Generate context for a single chunk."""
    response = client.messages.create(
        model="claude-sonnet-4-5-20251101",
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(document=document, chunk=chunk)
        }]
    )
    return response.content[0].text

def contextualize_chunk(document: str, chunk: str) -> str:
    """Prepend context to chunk."""
    context = generate_context(document, chunk)
    return f"{context}\n\n{chunk}"

Batch Processing with Caching

批量处理与缓存

python

from anthropic import Anthropic

client = Anthropic()

def contextualize_chunks_cached(document: str, chunks: list[str]) -> list[str]:
    """
    Use prompt caching to efficiently process many chunks from same document.
    Document is cached, only chunk changes per request.
    """
    results = []

    for i, chunk in enumerate(chunks):
        response = client.messages.create(
            model="claude-sonnet-4-5-20251101",
            max_tokens=150,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"<document>\n{document}\n</document>",
                        "cache_control": {"type": "ephemeral"}  # Cache document
                    },
                    {
                        "type": "text",
                        "text": f"""
Here is chunk {i+1} to situate:
<chunk>
{chunk}
</chunk>

Give a short context (1-2 sentences) to situate this chunk.
"""
                    }
                ]
            }]
        )
        context = response.content[0].text
        results.append(f"{context}\n\n{chunk}")

    return results

python

from anthropic import Anthropic

client = Anthropic()

def contextualize_chunks_cached(document: str, chunks: list[str]) -> list[str]:
    """
    Use prompt caching to efficiently process many chunks from same document.
    Document is cached, only chunk changes per request.
    """
    results = []

    for i, chunk in enumerate(chunks):
        response = client.messages.create(
            model="claude-sonnet-4-5-20251101",
            max_tokens=150,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"<document>\n{document}\n</document>",
                        "cache_control": {"type": "ephemeral"}  # Cache document
                    },
                    {
                        "type": "text",
                        "text": f"""
Here is chunk {i+1} to situate:
<chunk>
{chunk}
</chunk>

Give a short context (1-2 sentences) to situate this chunk.
"""
                    }
                ]
            }]
        )
        context = response.content[0].text
        results.append(f"{context}\n\n{chunk}")

    return results

Hybrid Search (BM25 + Vector)

混合搜索（BM25 + 向量）

Contextual Retrieval works best with hybrid search:

python

from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, chunks: list[str], embeddings: np.ndarray):
        self.chunks = chunks
        self.embeddings = embeddings

        # BM25 index on raw text
        tokenized = [c.lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def search(
        self,
        query: str,
        query_embedding: np.ndarray,
        top_k: int = 20,
        bm25_weight: float = 0.4,
        vector_weight: float = 0.6
    ) -> list[tuple[int, float]]:
        """Hybrid search combining BM25 and vector similarity."""
        # BM25 scores
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6)

        # Vector similarity
        vector_scores = np.dot(self.embeddings, query_embedding)
        vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min() + 1e-6)

        # Combine
        combined = bm25_weight * bm25_scores + vector_weight * vector_scores

        # Top-k
        top_indices = np.argsort(combined)[::-1][:top_k]
        return [(i, combined[i]) for i in top_indices]

上下文检索与混合搜索搭配使用效果最佳：

python

from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, chunks: list[str], embeddings: np.ndarray):
        self.chunks = chunks
        self.embeddings = embeddings

        # BM25 index on raw text
        tokenized = [c.lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def search(
        self,
        query: str,
        query_embedding: np.ndarray,
        top_k: int = 20,
        bm25_weight: float = 0.4,
        vector_weight: float = 0.6
    ) -> list[tuple[int, float]]:
        """Hybrid search combining BM25 and vector similarity."""
        # BM25 scores
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6)

        # Vector similarity
        vector_scores = np.dot(self.embeddings, query_embedding)
        vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min() + 1e-6)

        # Combine
        combined = bm25_weight * bm25_scores + vector_weight * vector_scores

        # Top-k
        top_indices = np.argsort(combined)[::-1][:top_k]
        return [(i, combined[i]) for i in top_indices]

Complete Pipeline

完整流程

python

from dataclasses import dataclass
import hashlib
import json

@dataclass
class ContextualChunk:
    original: str
    contextualized: str
    embedding: list[float]
    doc_id: str
    chunk_index: int

class ContextualRetriever:
    def __init__(self, embed_model, llm_client):
        self.embed_model = embed_model
        self.llm = llm_client
        self.chunks: list[ContextualChunk] = []
        self.bm25 = None

    def add_document(self, doc_id: str, text: str, chunk_size: int = 512):
        """Process and index a document."""
        # 1. Chunk the document
        raw_chunks = self._chunk_text(text, chunk_size)

        # 2. Generate context for each chunk (with caching)
        contextualized = self._contextualize_batch(text, raw_chunks)

        # 3. Embed contextualized chunks
        embeddings = self.embed_model.embed(contextualized)

        # 4. Store
        for i, (raw, ctx, emb) in enumerate(zip(raw_chunks, contextualized, embeddings)):
            self.chunks.append(ContextualChunk(
                original=raw,
                contextualized=ctx,
                embedding=emb,
                doc_id=doc_id,
                chunk_index=i
            ))

        # 5. Rebuild BM25 index
        self._rebuild_bm25()

    def search(self, query: str, top_k: int = 10) -> list[ContextualChunk]:
        """Hybrid search over contextualized chunks."""
        query_emb = self.embed_model.embed([query])[0]

        # BM25 on contextualized text
        bm25_scores = self.bm25.get_scores(query.lower().split())

        # Vector similarity
        embeddings = np.array([c.embedding for c in self.chunks])
        vector_scores = np.dot(embeddings, query_emb)

        # Normalize and combine
        bm25_norm = self._normalize(bm25_scores)
        vector_norm = self._normalize(vector_scores)
        combined = 0.4 * bm25_norm + 0.6 * vector_norm

        # Return top-k
        top_indices = np.argsort(combined)[::-1][:top_k]
        return [self.chunks[i] for i in top_indices]

    def _contextualize_batch(self, document: str, chunks: list[str]) -> list[str]:
        """Generate context for all chunks (use prompt caching)."""
        results = []
        for chunk in chunks:
            context = self._generate_context(document, chunk)
            results.append(f"{context}\n\n{chunk}")
        return results

    def _generate_context(self, document: str, chunk: str) -> str:
        # Implementation from above
        pass

    def _chunk_text(self, text: str, chunk_size: int) -> list[str]:
        """Simple sentence-aware chunking."""
        sentences = text.split('. ')
        chunks = []
        current = []
        current_len = 0

        for sent in sentences:
            if current_len + len(sent) > chunk_size and current:
                chunks.append('. '.join(current) + '.')
                current = [sent]
                current_len = len(sent)
            else:
                current.append(sent)
                current_len += len(sent)

        if current:
            chunks.append('. '.join(current))
        return chunks

    def _rebuild_bm25(self):
        tokenized = [c.contextualized.lower().split() for c in self.chunks]
        self.bm25 = BM25Okapi(tokenized)

    def _normalize(self, scores: np.ndarray) -> np.ndarray:
        return (scores - scores.min()) / (scores.max() - scores.min() + 1e-6)

python

from dataclasses import dataclass
import hashlib
import json

@dataclass
class ContextualChunk:
    original: str
    contextualized: str
    embedding: list[float]
    doc_id: str
    chunk_index: int

class ContextualRetriever:
    def __init__(self, embed_model, llm_client):
        self.embed_model = embed_model
        self.llm = llm_client
        self.chunks: list[ContextualChunk] = []
        self.bm25 = None

    def add_document(self, doc_id: str, text: str, chunk_size: int = 512):
        """Process and index a document."""
        # 1. Chunk the document
        raw_chunks = self._chunk_text(text, chunk_size)

        # 2. Generate context for each chunk (with caching)
        contextualized = self._contextualize_batch(text, raw_chunks)

        # 3. Embed contextualized chunks
        embeddings = self.embed_model.embed(contextualized)

        # 4. Store
        for i, (raw, ctx, emb) in enumerate(zip(raw_chunks, contextualized, embeddings)):
            self.chunks.append(ContextualChunk(
                original=raw,
                contextualized=ctx,
                embedding=emb,
                doc_id=doc_id,
                chunk_index=i
            ))

        # 5. Rebuild BM25 index
        self._rebuild_bm25()

    def search(self, query: str, top_k: int = 10) -> list[ContextualChunk]:
        """Hybrid search over contextualized chunks."""
        query_emb = self.embed_model.embed([query])[0]

        # BM25 on contextualized text
        bm25_scores = self.bm25.get_scores(query.lower().split())

        # Vector similarity
        embeddings = np.array([c.embedding for c in self.chunks])
        vector_scores = np.dot(embeddings, query_emb)

        # Normalize and combine
        bm25_norm = self._normalize(bm25_scores)
        vector_norm = self._normalize(vector_scores)
        combined = 0.4 * bm25_norm + 0.6 * vector_norm

        # Return top-k
        top_indices = np.argsort(combined)[::-1][:top_k]
        return [self.chunks[i] for i in top_indices]

    def _contextualize_batch(self, document: str, chunks: list[str]) -> list[str]:
        """Generate context for all chunks (use prompt caching)."""
        results = []
        for chunk in chunks:
            context = self._generate_context(document, chunk)
            results.append(f"{context}\n\n{chunk}")
        return results

    def _generate_context(self, document: str, chunk: str) -> str:
        # Implementation from above
        pass

    def _chunk_text(self, text: str, chunk_size: int) -> list[str]:
        """Simple sentence-aware chunking."""
        sentences = text.split('. ')
        chunks = []
        current = []
        current_len = 0

        for sent in sentences:
            if current_len + len(sent) > chunk_size and current:
                chunks.append('. '.join(current) + '.')
                current = [sent]
                current_len = len(sent)
            else:
                current.append(sent)
                current_len += len(sent)

        if current:
            chunks.append('. '.join(current))
        return chunks

    def _rebuild_bm25(self):
        tokenized = [c.contextualized.lower().split() for c in self.chunks]
        self.bm25 = BM25Okapi(tokenized)

    def _normalize(self, scores: np.ndarray) -> np.ndarray:
        return (scores - scores.min()) / (scores.max() - scores.min() + 1e-6)

Optimization Tips

优化技巧

1. Cost Reduction with Caching

1. 利用缓存降低成本

python

undefined

python

undefined

Prompt caching reduces cost by ~90% when processing

many chunks from the same document

Document cached on first request, reused for subsequent chunks

undefined

undefined

2. Parallel Processing

2. 并行处理

python

import asyncio

async def contextualize_parallel(document: str, chunks: list[str]) -> list[str]:
    """Process chunks in parallel with rate limiting."""
    semaphore = asyncio.Semaphore(10)  # Max 10 concurrent

    async def process_chunk(chunk: str) -> str:
        async with semaphore:
            context = await async_generate_context(document, chunk)
            return f"{context}\n\n{chunk}"

    return await asyncio.gather(*[process_chunk(c) for c in chunks])

python

import asyncio

async def contextualize_parallel(document: str, chunks: list[str]) -> list[str]:
    """Process chunks in parallel with rate limiting."""
    semaphore = asyncio.Semaphore(10)  # Max 10 concurrent

    async def process_chunk(chunk: str) -> str:
        async with semaphore:
            context = await async_generate_context(document, chunk)
            return f"{context}\n\n{chunk}"

    return await asyncio.gather(*[process_chunk(c) for c in chunks])

3. Context Quality

3. 上下文质量

python

undefined

python

undefined

Good context examples:

"This chunk is from the API authentication section of the FastAPI documentation." "This describes the company's Q3 2024 financial performance, specifically operating expenses." "This section covers error handling in the user registration flow."

Bad context (too generic):

"This is a chunk from the document." "Information about the topic."

undefined

"This is a chunk from the document." "Information about the topic."

undefined

Results (from Anthropic's research)

结果（来自Anthropic的研究）

Method	Retrieval Failure Rate
Traditional embeddings	5.7%
+ Contextual embeddings	3.5%
+ Contextual + BM25 hybrid	1.9%
+ Contextual + BM25 + reranking	1.3%

67% reduction in retrieval failures with full contextual retrieval pipeline.

方法	检索失败率
传统嵌入	5.7%
+ 上下文嵌入	3.5%
+ 上下文嵌入 + BM25混合	1.9%
+ 上下文嵌入 + BM25混合 + 重排序	1.3%

完整上下文检索流程可将检索失败率降低67%。

Overview

适用场景

Use Contextual Retrieval when:

Documents have important metadata (dates, names, versions)
Chunks frequently lose meaning without document context
Retrieval quality is critical (customer-facing, compliance)
You can afford the additional LLM cost during indexing

Skip if:

Chunks are self-contained (Q&A pairs, definitions)
Low latency indexing required (high-volume streaming)
Cost-sensitive with many small documents

在以下场景使用上下文检索：

文档包含重要元数据（日期、名称、版本）
文本块脱离文档上下文后常丢失语义
检索质量至关重要（面向客户、合规场景）
索引阶段可承担额外的LLM成本

以下场景无需使用：

文本块具备自包含性（问答对、定义类内容）
对索引延迟要求高（高流量流处理场景）
对成本敏感，且需处理大量小型文档

Related Skills

Key Decisions

关键决策

Decision	Choice	Rationale
Context generation model	Claude Sonnet	Balance of quality and cost for context generation
BM25/Vector weight split	40%/60%	Anthropic research shows slight vector bias optimal
Chunk context length	1-2 sentences	Enough context without excessive token overhead
Prompt caching	Ephemeral cache	90% cost reduction when processing many chunks from same doc

决策	选择	理由
上下文生成模型	Claude Sonnet	在上下文生成的质量与成本间取得平衡
BM25/向量权重分配	40%/60%	Anthropic研究表明，略微偏向向量的权重配置为最优
块上下文长度	1-2句话	提供足够上下文的同时避免过多的token开销
提示词缓存	临时缓存	处理同一文档的多个文本块时，可降低90%的成本

Resources

参考资源

Anthropic Blog: https://www.anthropic.com/news/contextual-retrieval
Prompt Caching: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Anthropic博客：https://www.anthropic.com/news/contextual-retrieval
提示词缓存：https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

contextual-retrieval

Original

Translation

Contextual Retrieval

上下文检索

The Problem

问题背景

The Solution

解决方案

Implementation

实现方案

Context Generation

上下文生成

Batch Processing with Caching

批量处理与缓存

Hybrid Search (BM25 + Vector)

混合搜索（BM25 + 向量）

Complete Pipeline

完整流程

Optimization Tips

优化技巧

1. Cost Reduction with Caching

1. 利用缓存降低成本

Prompt caching reduces cost by ~90% when processing

Prompt caching reduces cost by ~90% when processing

many chunks from the same document

many chunks from the same document

Document cached on first request, reused for subsequent chunks

Document cached on first request, reused for subsequent chunks

2. Parallel Processing

2. 并行处理

3. Context Quality

3. 上下文质量

Good context examples:

Good context examples:

Bad context (too generic):

Bad context (too generic):

Results (from Anthropic's research)

结果（来自Anthropic的研究）

Overview

适用场景

Related Skills

相关技术

Key Decisions

关键决策

Resources

参考资源