contextual-retrieval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Contextual Retrieval

上下文检索

Prepend situational context to chunks before embedding to preserve document-level meaning.
在嵌入文本块前添加情境上下文,以保留文档层面的语义。

The Problem

问题背景

Traditional chunking loses context:
Original document: "ACME Q3 2024 Earnings Report..."
Chunk: "Revenue increased 15% compared to the previous quarter."

Query: "What was ACME's Q3 2024 revenue growth?"
Result: Chunk doesn't mention "ACME" or "Q3 2024" - retrieval fails
传统文本分块会丢失上下文:
Original document: "ACME Q3 2024 Earnings Report..."
Chunk: "Revenue increased 15% compared to the previous quarter."

Query: "What was ACME's Q3 2024 revenue growth?"
Result: Chunk doesn't mention "ACME" or "Q3 2024" - retrieval fails

The Solution

解决方案

Contextual Retrieval prepends a brief context to each chunk:
Contextualized chunk:
"This chunk is from ACME Corp's Q3 2024 earnings report, specifically
the revenue section. Revenue increased 15% compared to the previous quarter."
上下文检索会为每个文本块添加一段简短的上下文:
Contextualized chunk:
"This chunk is from ACME Corp's Q3 2024 earnings report, specifically
the revenue section. Revenue increased 15% compared to the previous quarter."

Implementation

实现方案

Context Generation

上下文生成

python
import anthropic

client = anthropic.Anthropic()

CONTEXT_PROMPT = """
<document>
{document}
</document>

Here is the chunk we want to situate within the document:
<chunk>
{chunk}
</chunk>

Please give a short, succinct context (1-2 sentences) to situate this chunk
within the overall document. Focus on information that would help retrieval.
Answer only with the context, nothing else.
"""

def generate_context(document: str, chunk: str) -> str:
    """Generate context for a single chunk."""
    response = client.messages.create(
        model="claude-sonnet-4-5-20251101",
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(document=document, chunk=chunk)
        }]
    )
    return response.content[0].text

def contextualize_chunk(document: str, chunk: str) -> str:
    """Prepend context to chunk."""
    context = generate_context(document, chunk)
    return f"{context}\n\n{chunk}"
python
import anthropic

client = anthropic.Anthropic()

CONTEXT_PROMPT = """
<document>
{document}
</document>

Here is the chunk we want to situate within the document:
<chunk>
{chunk}
</chunk>

Please give a short, succinct context (1-2 sentences) to situate this chunk
within the overall document. Focus on information that would help retrieval.
Answer only with the context, nothing else.
"""

def generate_context(document: str, chunk: str) -> str:
    """Generate context for a single chunk."""
    response = client.messages.create(
        model="claude-sonnet-4-5-20251101",
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(document=document, chunk=chunk)
        }]
    )
    return response.content[0].text

def contextualize_chunk(document: str, chunk: str) -> str:
    """Prepend context to chunk."""
    context = generate_context(document, chunk)
    return f"{context}\n\n{chunk}"

Batch Processing with Caching

批量处理与缓存

python
from anthropic import Anthropic

client = Anthropic()

def contextualize_chunks_cached(document: str, chunks: list[str]) -> list[str]:
    """
    Use prompt caching to efficiently process many chunks from same document.
    Document is cached, only chunk changes per request.
    """
    results = []

    for i, chunk in enumerate(chunks):
        response = client.messages.create(
            model="claude-sonnet-4-5-20251101",
            max_tokens=150,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"<document>\n{document}\n</document>",
                        "cache_control": {"type": "ephemeral"}  # Cache document
                    },
                    {
                        "type": "text",
                        "text": f"""
Here is chunk {i+1} to situate:
<chunk>
{chunk}
</chunk>

Give a short context (1-2 sentences) to situate this chunk.
"""
                    }
                ]
            }]
        )
        context = response.content[0].text
        results.append(f"{context}\n\n{chunk}")

    return results
python
from anthropic import Anthropic

client = Anthropic()

def contextualize_chunks_cached(document: str, chunks: list[str]) -> list[str]:
    """
    Use prompt caching to efficiently process many chunks from same document.
    Document is cached, only chunk changes per request.
    """
    results = []

    for i, chunk in enumerate(chunks):
        response = client.messages.create(
            model="claude-sonnet-4-5-20251101",
            max_tokens=150,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"<document>\n{document}\n</document>",
                        "cache_control": {"type": "ephemeral"}  # Cache document
                    },
                    {
                        "type": "text",
                        "text": f"""
Here is chunk {i+1} to situate:
<chunk>
{chunk}
</chunk>

Give a short context (1-2 sentences) to situate this chunk.
"""
                    }
                ]
            }]
        )
        context = response.content[0].text
        results.append(f"{context}\n\n{chunk}")

    return results

Hybrid Search (BM25 + Vector)

混合搜索(BM25 + 向量)

Contextual Retrieval works best with hybrid search:
python
from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, chunks: list[str], embeddings: np.ndarray):
        self.chunks = chunks
        self.embeddings = embeddings

        # BM25 index on raw text
        tokenized = [c.lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def search(
        self,
        query: str,
        query_embedding: np.ndarray,
        top_k: int = 20,
        bm25_weight: float = 0.4,
        vector_weight: float = 0.6
    ) -> list[tuple[int, float]]:
        """Hybrid search combining BM25 and vector similarity."""
        # BM25 scores
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6)

        # Vector similarity
        vector_scores = np.dot(self.embeddings, query_embedding)
        vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min() + 1e-6)

        # Combine
        combined = bm25_weight * bm25_scores + vector_weight * vector_scores

        # Top-k
        top_indices = np.argsort(combined)[::-1][:top_k]
        return [(i, combined[i]) for i in top_indices]
上下文检索与混合搜索搭配使用效果最佳:
python
from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, chunks: list[str], embeddings: np.ndarray):
        self.chunks = chunks
        self.embeddings = embeddings

        # BM25 index on raw text
        tokenized = [c.lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def search(
        self,
        query: str,
        query_embedding: np.ndarray,
        top_k: int = 20,
        bm25_weight: float = 0.4,
        vector_weight: float = 0.6
    ) -> list[tuple[int, float]]:
        """Hybrid search combining BM25 and vector similarity."""
        # BM25 scores
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6)

        # Vector similarity
        vector_scores = np.dot(self.embeddings, query_embedding)
        vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min() + 1e-6)

        # Combine
        combined = bm25_weight * bm25_scores + vector_weight * vector_scores

        # Top-k
        top_indices = np.argsort(combined)[::-1][:top_k]
        return [(i, combined[i]) for i in top_indices]

Complete Pipeline

完整流程

python
from dataclasses import dataclass
import hashlib
import json

@dataclass
class ContextualChunk:
    original: str
    contextualized: str
    embedding: list[float]
    doc_id: str
    chunk_index: int

class ContextualRetriever:
    def __init__(self, embed_model, llm_client):
        self.embed_model = embed_model
        self.llm = llm_client
        self.chunks: list[ContextualChunk] = []
        self.bm25 = None

    def add_document(self, doc_id: str, text: str, chunk_size: int = 512):
        """Process and index a document."""
        # 1. Chunk the document
        raw_chunks = self._chunk_text(text, chunk_size)

        # 2. Generate context for each chunk (with caching)
        contextualized = self._contextualize_batch(text, raw_chunks)

        # 3. Embed contextualized chunks
        embeddings = self.embed_model.embed(contextualized)

        # 4. Store
        for i, (raw, ctx, emb) in enumerate(zip(raw_chunks, contextualized, embeddings)):
            self.chunks.append(ContextualChunk(
                original=raw,
                contextualized=ctx,
                embedding=emb,
                doc_id=doc_id,
                chunk_index=i
            ))

        # 5. Rebuild BM25 index
        self._rebuild_bm25()

    def search(self, query: str, top_k: int = 10) -> list[ContextualChunk]:
        """Hybrid search over contextualized chunks."""
        query_emb = self.embed_model.embed([query])[0]

        # BM25 on contextualized text
        bm25_scores = self.bm25.get_scores(query.lower().split())

        # Vector similarity
        embeddings = np.array([c.embedding for c in self.chunks])
        vector_scores = np.dot(embeddings, query_emb)

        # Normalize and combine
        bm25_norm = self._normalize(bm25_scores)
        vector_norm = self._normalize(vector_scores)
        combined = 0.4 * bm25_norm + 0.6 * vector_norm

        # Return top-k
        top_indices = np.argsort(combined)[::-1][:top_k]
        return [self.chunks[i] for i in top_indices]

    def _contextualize_batch(self, document: str, chunks: list[str]) -> list[str]:
        """Generate context for all chunks (use prompt caching)."""
        results = []
        for chunk in chunks:
            context = self._generate_context(document, chunk)
            results.append(f"{context}\n\n{chunk}")
        return results

    def _generate_context(self, document: str, chunk: str) -> str:
        # Implementation from above
        pass

    def _chunk_text(self, text: str, chunk_size: int) -> list[str]:
        """Simple sentence-aware chunking."""
        sentences = text.split('. ')
        chunks = []
        current = []
        current_len = 0

        for sent in sentences:
            if current_len + len(sent) > chunk_size and current:
                chunks.append('. '.join(current) + '.')
                current = [sent]
                current_len = len(sent)
            else:
                current.append(sent)
                current_len += len(sent)

        if current:
            chunks.append('. '.join(current))
        return chunks

    def _rebuild_bm25(self):
        tokenized = [c.contextualized.lower().split() for c in self.chunks]
        self.bm25 = BM25Okapi(tokenized)

    def _normalize(self, scores: np.ndarray) -> np.ndarray:
        return (scores - scores.min()) / (scores.max() - scores.min() + 1e-6)
python
from dataclasses import dataclass
import hashlib
import json

@dataclass
class ContextualChunk:
    original: str
    contextualized: str
    embedding: list[float]
    doc_id: str
    chunk_index: int

class ContextualRetriever:
    def __init__(self, embed_model, llm_client):
        self.embed_model = embed_model
        self.llm = llm_client
        self.chunks: list[ContextualChunk] = []
        self.bm25 = None

    def add_document(self, doc_id: str, text: str, chunk_size: int = 512):
        """Process and index a document."""
        # 1. Chunk the document
        raw_chunks = self._chunk_text(text, chunk_size)

        # 2. Generate context for each chunk (with caching)
        contextualized = self._contextualize_batch(text, raw_chunks)

        # 3. Embed contextualized chunks
        embeddings = self.embed_model.embed(contextualized)

        # 4. Store
        for i, (raw, ctx, emb) in enumerate(zip(raw_chunks, contextualized, embeddings)):
            self.chunks.append(ContextualChunk(
                original=raw,
                contextualized=ctx,
                embedding=emb,
                doc_id=doc_id,
                chunk_index=i
            ))

        # 5. Rebuild BM25 index
        self._rebuild_bm25()

    def search(self, query: str, top_k: int = 10) -> list[ContextualChunk]:
        """Hybrid search over contextualized chunks."""
        query_emb = self.embed_model.embed([query])[0]

        # BM25 on contextualized text
        bm25_scores = self.bm25.get_scores(query.lower().split())

        # Vector similarity
        embeddings = np.array([c.embedding for c in self.chunks])
        vector_scores = np.dot(embeddings, query_emb)

        # Normalize and combine
        bm25_norm = self._normalize(bm25_scores)
        vector_norm = self._normalize(vector_scores)
        combined = 0.4 * bm25_norm + 0.6 * vector_norm

        # Return top-k
        top_indices = np.argsort(combined)[::-1][:top_k]
        return [self.chunks[i] for i in top_indices]

    def _contextualize_batch(self, document: str, chunks: list[str]) -> list[str]:
        """Generate context for all chunks (use prompt caching)."""
        results = []
        for chunk in chunks:
            context = self._generate_context(document, chunk)
            results.append(f"{context}\n\n{chunk}")
        return results

    def _generate_context(self, document: str, chunk: str) -> str:
        # Implementation from above
        pass

    def _chunk_text(self, text: str, chunk_size: int) -> list[str]:
        """Simple sentence-aware chunking."""
        sentences = text.split('. ')
        chunks = []
        current = []
        current_len = 0

        for sent in sentences:
            if current_len + len(sent) > chunk_size and current:
                chunks.append('. '.join(current) + '.')
                current = [sent]
                current_len = len(sent)
            else:
                current.append(sent)
                current_len += len(sent)

        if current:
            chunks.append('. '.join(current))
        return chunks

    def _rebuild_bm25(self):
        tokenized = [c.contextualized.lower().split() for c in self.chunks]
        self.bm25 = BM25Okapi(tokenized)

    def _normalize(self, scores: np.ndarray) -> np.ndarray:
        return (scores - scores.min()) / (scores.max() - scores.min() + 1e-6)

Optimization Tips

优化技巧

1. Cost Reduction with Caching

1. 利用缓存降低成本

python
undefined
python
undefined

Prompt caching reduces cost by ~90% when processing

Prompt caching reduces cost by ~90% when processing

many chunks from the same document

many chunks from the same document

Document cached on first request, reused for subsequent chunks

Document cached on first request, reused for subsequent chunks

undefined
undefined

2. Parallel Processing

2. 并行处理

python
import asyncio

async def contextualize_parallel(document: str, chunks: list[str]) -> list[str]:
    """Process chunks in parallel with rate limiting."""
    semaphore = asyncio.Semaphore(10)  # Max 10 concurrent

    async def process_chunk(chunk: str) -> str:
        async with semaphore:
            context = await async_generate_context(document, chunk)
            return f"{context}\n\n{chunk}"

    return await asyncio.gather(*[process_chunk(c) for c in chunks])
python
import asyncio

async def contextualize_parallel(document: str, chunks: list[str]) -> list[str]:
    """Process chunks in parallel with rate limiting."""
    semaphore = asyncio.Semaphore(10)  # Max 10 concurrent

    async def process_chunk(chunk: str) -> str:
        async with semaphore:
            context = await async_generate_context(document, chunk)
            return f"{context}\n\n{chunk}"

    return await asyncio.gather(*[process_chunk(c) for c in chunks])

3. Context Quality

3. 上下文质量

python
undefined
python
undefined

Good context examples:

Good context examples:

"This chunk is from the API authentication section of the FastAPI documentation." "This describes the company's Q3 2024 financial performance, specifically operating expenses." "This section covers error handling in the user registration flow."
"This chunk is from the API authentication section of the FastAPI documentation." "This describes the company's Q3 2024 financial performance, specifically operating expenses." "This section covers error handling in the user registration flow."

Bad context (too generic):

Bad context (too generic):

"This is a chunk from the document." "Information about the topic."
undefined
"This is a chunk from the document." "Information about the topic."
undefined

Results (from Anthropic's research)

结果(来自Anthropic的研究)

MethodRetrieval Failure Rate
Traditional embeddings5.7%
+ Contextual embeddings3.5%
+ Contextual + BM25 hybrid1.9%
+ Contextual + BM25 + reranking1.3%
67% reduction in retrieval failures with full contextual retrieval pipeline.
方法检索失败率
传统嵌入5.7%
+ 上下文嵌入3.5%
+ 上下文嵌入 + BM25混合1.9%
+ 上下文嵌入 + BM25混合 + 重排序1.3%
完整上下文检索流程可将检索失败率降低67%

Overview

适用场景

Use Contextual Retrieval when:
  • Documents have important metadata (dates, names, versions)
  • Chunks frequently lose meaning without document context
  • Retrieval quality is critical (customer-facing, compliance)
  • You can afford the additional LLM cost during indexing
Skip if:
  • Chunks are self-contained (Q&A pairs, definitions)
  • Low latency indexing required (high-volume streaming)
  • Cost-sensitive with many small documents
在以下场景使用上下文检索
  • 文档包含重要元数据(日期、名称、版本)
  • 文本块脱离文档上下文后常丢失语义
  • 检索质量至关重要(面向客户、合规场景)
  • 索引阶段可承担额外的LLM成本
以下场景无需使用
  • 文本块具备自包含性(问答对、定义类内容)
  • 对索引延迟要求高(高流量流处理场景)
  • 对成本敏感,且需处理大量小型文档

Related Skills

相关技术

  • rag-retrieval
    - Core RAG pipeline patterns that contextual retrieval enhances
  • embeddings
    - Text embedding strategies for the vector search component
  • reranking-patterns
    - Post-retrieval reranking to further improve precision
  • hyde-retrieval
    - Alternative retrieval enhancement using hypothetical documents
  • rag-retrieval
    - 上下文检索可增强的核心RAG流程模式
  • embeddings
    - 向量搜索组件的文本嵌入策略
  • reranking-patterns
    - 检索后重排序,进一步提升精度
  • hyde-retrieval
    - 利用假设文档的检索增强替代方案

Key Decisions

关键决策

DecisionChoiceRationale
Context generation modelClaude SonnetBalance of quality and cost for context generation
BM25/Vector weight split40%/60%Anthropic research shows slight vector bias optimal
Chunk context length1-2 sentencesEnough context without excessive token overhead
Prompt cachingEphemeral cache90% cost reduction when processing many chunks from same doc
决策选择理由
上下文生成模型Claude Sonnet在上下文生成的质量与成本间取得平衡
BM25/向量权重分配40%/60%Anthropic研究表明,略微偏向向量的权重配置为最优
块上下文长度1-2句话提供足够上下文的同时避免过多的token开销
提示词缓存临时缓存处理同一文档的多个文本块时,可降低90%的成本

Resources

参考资源