contextual-retrieval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseContextual Retrieval
上下文检索
Prepend situational context to chunks before embedding to preserve document-level meaning.
在嵌入文本块前添加情境上下文,以保留文档层面的语义。
The Problem
问题背景
Traditional chunking loses context:
Original document: "ACME Q3 2024 Earnings Report..."
Chunk: "Revenue increased 15% compared to the previous quarter."
Query: "What was ACME's Q3 2024 revenue growth?"
Result: Chunk doesn't mention "ACME" or "Q3 2024" - retrieval fails传统文本分块会丢失上下文:
Original document: "ACME Q3 2024 Earnings Report..."
Chunk: "Revenue increased 15% compared to the previous quarter."
Query: "What was ACME's Q3 2024 revenue growth?"
Result: Chunk doesn't mention "ACME" or "Q3 2024" - retrieval failsThe Solution
解决方案
Contextual Retrieval prepends a brief context to each chunk:
Contextualized chunk:
"This chunk is from ACME Corp's Q3 2024 earnings report, specifically
the revenue section. Revenue increased 15% compared to the previous quarter."上下文检索会为每个文本块添加一段简短的上下文:
Contextualized chunk:
"This chunk is from ACME Corp's Q3 2024 earnings report, specifically
the revenue section. Revenue increased 15% compared to the previous quarter."Implementation
实现方案
Context Generation
上下文生成
python
import anthropic
client = anthropic.Anthropic()
CONTEXT_PROMPT = """
<document>
{document}
</document>
Here is the chunk we want to situate within the document:
<chunk>
{chunk}
</chunk>
Please give a short, succinct context (1-2 sentences) to situate this chunk
within the overall document. Focus on information that would help retrieval.
Answer only with the context, nothing else.
"""
def generate_context(document: str, chunk: str) -> str:
"""Generate context for a single chunk."""
response = client.messages.create(
model="claude-sonnet-4-5-20251101",
max_tokens=150,
messages=[{
"role": "user",
"content": CONTEXT_PROMPT.format(document=document, chunk=chunk)
}]
)
return response.content[0].text
def contextualize_chunk(document: str, chunk: str) -> str:
"""Prepend context to chunk."""
context = generate_context(document, chunk)
return f"{context}\n\n{chunk}"python
import anthropic
client = anthropic.Anthropic()
CONTEXT_PROMPT = """
<document>
{document}
</document>
Here is the chunk we want to situate within the document:
<chunk>
{chunk}
</chunk>
Please give a short, succinct context (1-2 sentences) to situate this chunk
within the overall document. Focus on information that would help retrieval.
Answer only with the context, nothing else.
"""
def generate_context(document: str, chunk: str) -> str:
"""Generate context for a single chunk."""
response = client.messages.create(
model="claude-sonnet-4-5-20251101",
max_tokens=150,
messages=[{
"role": "user",
"content": CONTEXT_PROMPT.format(document=document, chunk=chunk)
}]
)
return response.content[0].text
def contextualize_chunk(document: str, chunk: str) -> str:
"""Prepend context to chunk."""
context = generate_context(document, chunk)
return f"{context}\n\n{chunk}"Batch Processing with Caching
批量处理与缓存
python
from anthropic import Anthropic
client = Anthropic()
def contextualize_chunks_cached(document: str, chunks: list[str]) -> list[str]:
"""
Use prompt caching to efficiently process many chunks from same document.
Document is cached, only chunk changes per request.
"""
results = []
for i, chunk in enumerate(chunks):
response = client.messages.create(
model="claude-sonnet-4-5-20251101",
max_tokens=150,
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": f"<document>\n{document}\n</document>",
"cache_control": {"type": "ephemeral"} # Cache document
},
{
"type": "text",
"text": f"""
Here is chunk {i+1} to situate:
<chunk>
{chunk}
</chunk>
Give a short context (1-2 sentences) to situate this chunk.
"""
}
]
}]
)
context = response.content[0].text
results.append(f"{context}\n\n{chunk}")
return resultspython
from anthropic import Anthropic
client = Anthropic()
def contextualize_chunks_cached(document: str, chunks: list[str]) -> list[str]:
"""
Use prompt caching to efficiently process many chunks from same document.
Document is cached, only chunk changes per request.
"""
results = []
for i, chunk in enumerate(chunks):
response = client.messages.create(
model="claude-sonnet-4-5-20251101",
max_tokens=150,
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": f"<document>\n{document}\n</document>",
"cache_control": {"type": "ephemeral"} # Cache document
},
{
"type": "text",
"text": f"""
Here is chunk {i+1} to situate:
<chunk>
{chunk}
</chunk>
Give a short context (1-2 sentences) to situate this chunk.
"""
}
]
}]
)
context = response.content[0].text
results.append(f"{context}\n\n{chunk}")
return resultsHybrid Search (BM25 + Vector)
混合搜索(BM25 + 向量)
Contextual Retrieval works best with hybrid search:
python
from rank_bm25 import BM25Okapi
import numpy as np
class HybridRetriever:
def __init__(self, chunks: list[str], embeddings: np.ndarray):
self.chunks = chunks
self.embeddings = embeddings
# BM25 index on raw text
tokenized = [c.lower().split() for c in chunks]
self.bm25 = BM25Okapi(tokenized)
def search(
self,
query: str,
query_embedding: np.ndarray,
top_k: int = 20,
bm25_weight: float = 0.4,
vector_weight: float = 0.6
) -> list[tuple[int, float]]:
"""Hybrid search combining BM25 and vector similarity."""
# BM25 scores
bm25_scores = self.bm25.get_scores(query.lower().split())
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6)
# Vector similarity
vector_scores = np.dot(self.embeddings, query_embedding)
vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min() + 1e-6)
# Combine
combined = bm25_weight * bm25_scores + vector_weight * vector_scores
# Top-k
top_indices = np.argsort(combined)[::-1][:top_k]
return [(i, combined[i]) for i in top_indices]上下文检索与混合搜索搭配使用效果最佳:
python
from rank_bm25 import BM25Okapi
import numpy as np
class HybridRetriever:
def __init__(self, chunks: list[str], embeddings: np.ndarray):
self.chunks = chunks
self.embeddings = embeddings
# BM25 index on raw text
tokenized = [c.lower().split() for c in chunks]
self.bm25 = BM25Okapi(tokenized)
def search(
self,
query: str,
query_embedding: np.ndarray,
top_k: int = 20,
bm25_weight: float = 0.4,
vector_weight: float = 0.6
) -> list[tuple[int, float]]:
"""Hybrid search combining BM25 and vector similarity."""
# BM25 scores
bm25_scores = self.bm25.get_scores(query.lower().split())
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6)
# Vector similarity
vector_scores = np.dot(self.embeddings, query_embedding)
vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min() + 1e-6)
# Combine
combined = bm25_weight * bm25_scores + vector_weight * vector_scores
# Top-k
top_indices = np.argsort(combined)[::-1][:top_k]
return [(i, combined[i]) for i in top_indices]Complete Pipeline
完整流程
python
from dataclasses import dataclass
import hashlib
import json
@dataclass
class ContextualChunk:
original: str
contextualized: str
embedding: list[float]
doc_id: str
chunk_index: int
class ContextualRetriever:
def __init__(self, embed_model, llm_client):
self.embed_model = embed_model
self.llm = llm_client
self.chunks: list[ContextualChunk] = []
self.bm25 = None
def add_document(self, doc_id: str, text: str, chunk_size: int = 512):
"""Process and index a document."""
# 1. Chunk the document
raw_chunks = self._chunk_text(text, chunk_size)
# 2. Generate context for each chunk (with caching)
contextualized = self._contextualize_batch(text, raw_chunks)
# 3. Embed contextualized chunks
embeddings = self.embed_model.embed(contextualized)
# 4. Store
for i, (raw, ctx, emb) in enumerate(zip(raw_chunks, contextualized, embeddings)):
self.chunks.append(ContextualChunk(
original=raw,
contextualized=ctx,
embedding=emb,
doc_id=doc_id,
chunk_index=i
))
# 5. Rebuild BM25 index
self._rebuild_bm25()
def search(self, query: str, top_k: int = 10) -> list[ContextualChunk]:
"""Hybrid search over contextualized chunks."""
query_emb = self.embed_model.embed([query])[0]
# BM25 on contextualized text
bm25_scores = self.bm25.get_scores(query.lower().split())
# Vector similarity
embeddings = np.array([c.embedding for c in self.chunks])
vector_scores = np.dot(embeddings, query_emb)
# Normalize and combine
bm25_norm = self._normalize(bm25_scores)
vector_norm = self._normalize(vector_scores)
combined = 0.4 * bm25_norm + 0.6 * vector_norm
# Return top-k
top_indices = np.argsort(combined)[::-1][:top_k]
return [self.chunks[i] for i in top_indices]
def _contextualize_batch(self, document: str, chunks: list[str]) -> list[str]:
"""Generate context for all chunks (use prompt caching)."""
results = []
for chunk in chunks:
context = self._generate_context(document, chunk)
results.append(f"{context}\n\n{chunk}")
return results
def _generate_context(self, document: str, chunk: str) -> str:
# Implementation from above
pass
def _chunk_text(self, text: str, chunk_size: int) -> list[str]:
"""Simple sentence-aware chunking."""
sentences = text.split('. ')
chunks = []
current = []
current_len = 0
for sent in sentences:
if current_len + len(sent) > chunk_size and current:
chunks.append('. '.join(current) + '.')
current = [sent]
current_len = len(sent)
else:
current.append(sent)
current_len += len(sent)
if current:
chunks.append('. '.join(current))
return chunks
def _rebuild_bm25(self):
tokenized = [c.contextualized.lower().split() for c in self.chunks]
self.bm25 = BM25Okapi(tokenized)
def _normalize(self, scores: np.ndarray) -> np.ndarray:
return (scores - scores.min()) / (scores.max() - scores.min() + 1e-6)python
from dataclasses import dataclass
import hashlib
import json
@dataclass
class ContextualChunk:
original: str
contextualized: str
embedding: list[float]
doc_id: str
chunk_index: int
class ContextualRetriever:
def __init__(self, embed_model, llm_client):
self.embed_model = embed_model
self.llm = llm_client
self.chunks: list[ContextualChunk] = []
self.bm25 = None
def add_document(self, doc_id: str, text: str, chunk_size: int = 512):
"""Process and index a document."""
# 1. Chunk the document
raw_chunks = self._chunk_text(text, chunk_size)
# 2. Generate context for each chunk (with caching)
contextualized = self._contextualize_batch(text, raw_chunks)
# 3. Embed contextualized chunks
embeddings = self.embed_model.embed(contextualized)
# 4. Store
for i, (raw, ctx, emb) in enumerate(zip(raw_chunks, contextualized, embeddings)):
self.chunks.append(ContextualChunk(
original=raw,
contextualized=ctx,
embedding=emb,
doc_id=doc_id,
chunk_index=i
))
# 5. Rebuild BM25 index
self._rebuild_bm25()
def search(self, query: str, top_k: int = 10) -> list[ContextualChunk]:
"""Hybrid search over contextualized chunks."""
query_emb = self.embed_model.embed([query])[0]
# BM25 on contextualized text
bm25_scores = self.bm25.get_scores(query.lower().split())
# Vector similarity
embeddings = np.array([c.embedding for c in self.chunks])
vector_scores = np.dot(embeddings, query_emb)
# Normalize and combine
bm25_norm = self._normalize(bm25_scores)
vector_norm = self._normalize(vector_scores)
combined = 0.4 * bm25_norm + 0.6 * vector_norm
# Return top-k
top_indices = np.argsort(combined)[::-1][:top_k]
return [self.chunks[i] for i in top_indices]
def _contextualize_batch(self, document: str, chunks: list[str]) -> list[str]:
"""Generate context for all chunks (use prompt caching)."""
results = []
for chunk in chunks:
context = self._generate_context(document, chunk)
results.append(f"{context}\n\n{chunk}")
return results
def _generate_context(self, document: str, chunk: str) -> str:
# Implementation from above
pass
def _chunk_text(self, text: str, chunk_size: int) -> list[str]:
"""Simple sentence-aware chunking."""
sentences = text.split('. ')
chunks = []
current = []
current_len = 0
for sent in sentences:
if current_len + len(sent) > chunk_size and current:
chunks.append('. '.join(current) + '.')
current = [sent]
current_len = len(sent)
else:
current.append(sent)
current_len += len(sent)
if current:
chunks.append('. '.join(current))
return chunks
def _rebuild_bm25(self):
tokenized = [c.contextualized.lower().split() for c in self.chunks]
self.bm25 = BM25Okapi(tokenized)
def _normalize(self, scores: np.ndarray) -> np.ndarray:
return (scores - scores.min()) / (scores.max() - scores.min() + 1e-6)Optimization Tips
优化技巧
1. Cost Reduction with Caching
1. 利用缓存降低成本
python
undefinedpython
undefinedPrompt caching reduces cost by ~90% when processing
Prompt caching reduces cost by ~90% when processing
many chunks from the same document
many chunks from the same document
Document cached on first request, reused for subsequent chunks
Document cached on first request, reused for subsequent chunks
undefinedundefined2. Parallel Processing
2. 并行处理
python
import asyncio
async def contextualize_parallel(document: str, chunks: list[str]) -> list[str]:
"""Process chunks in parallel with rate limiting."""
semaphore = asyncio.Semaphore(10) # Max 10 concurrent
async def process_chunk(chunk: str) -> str:
async with semaphore:
context = await async_generate_context(document, chunk)
return f"{context}\n\n{chunk}"
return await asyncio.gather(*[process_chunk(c) for c in chunks])python
import asyncio
async def contextualize_parallel(document: str, chunks: list[str]) -> list[str]:
"""Process chunks in parallel with rate limiting."""
semaphore = asyncio.Semaphore(10) # Max 10 concurrent
async def process_chunk(chunk: str) -> str:
async with semaphore:
context = await async_generate_context(document, chunk)
return f"{context}\n\n{chunk}"
return await asyncio.gather(*[process_chunk(c) for c in chunks])3. Context Quality
3. 上下文质量
python
undefinedpython
undefinedGood context examples:
Good context examples:
"This chunk is from the API authentication section of the FastAPI documentation."
"This describes the company's Q3 2024 financial performance, specifically operating expenses."
"This section covers error handling in the user registration flow."
"This chunk is from the API authentication section of the FastAPI documentation."
"This describes the company's Q3 2024 financial performance, specifically operating expenses."
"This section covers error handling in the user registration flow."
Bad context (too generic):
Bad context (too generic):
"This is a chunk from the document."
"Information about the topic."
undefined"This is a chunk from the document."
"Information about the topic."
undefinedResults (from Anthropic's research)
结果(来自Anthropic的研究)
| Method | Retrieval Failure Rate |
|---|---|
| Traditional embeddings | 5.7% |
| + Contextual embeddings | 3.5% |
| + Contextual + BM25 hybrid | 1.9% |
| + Contextual + BM25 + reranking | 1.3% |
67% reduction in retrieval failures with full contextual retrieval pipeline.
| 方法 | 检索失败率 |
|---|---|
| 传统嵌入 | 5.7% |
| + 上下文嵌入 | 3.5% |
| + 上下文嵌入 + BM25混合 | 1.9% |
| + 上下文嵌入 + BM25混合 + 重排序 | 1.3% |
完整上下文检索流程可将检索失败率降低67%。
Overview
适用场景
Use Contextual Retrieval when:
- Documents have important metadata (dates, names, versions)
- Chunks frequently lose meaning without document context
- Retrieval quality is critical (customer-facing, compliance)
- You can afford the additional LLM cost during indexing
Skip if:
- Chunks are self-contained (Q&A pairs, definitions)
- Low latency indexing required (high-volume streaming)
- Cost-sensitive with many small documents
在以下场景使用上下文检索:
- 文档包含重要元数据(日期、名称、版本)
- 文本块脱离文档上下文后常丢失语义
- 检索质量至关重要(面向客户、合规场景)
- 索引阶段可承担额外的LLM成本
以下场景无需使用:
- 文本块具备自包含性(问答对、定义类内容)
- 对索引延迟要求高(高流量流处理场景)
- 对成本敏感,且需处理大量小型文档
Related Skills
相关技术
- - Core RAG pipeline patterns that contextual retrieval enhances
rag-retrieval - - Text embedding strategies for the vector search component
embeddings - - Post-retrieval reranking to further improve precision
reranking-patterns - - Alternative retrieval enhancement using hypothetical documents
hyde-retrieval
- - 上下文检索可增强的核心RAG流程模式
rag-retrieval - - 向量搜索组件的文本嵌入策略
embeddings - - 检索后重排序,进一步提升精度
reranking-patterns - - 利用假设文档的检索增强替代方案
hyde-retrieval
Key Decisions
关键决策
| Decision | Choice | Rationale |
|---|---|---|
| Context generation model | Claude Sonnet | Balance of quality and cost for context generation |
| BM25/Vector weight split | 40%/60% | Anthropic research shows slight vector bias optimal |
| Chunk context length | 1-2 sentences | Enough context without excessive token overhead |
| Prompt caching | Ephemeral cache | 90% cost reduction when processing many chunks from same doc |
| 决策 | 选择 | 理由 |
|---|---|---|
| 上下文生成模型 | Claude Sonnet | 在上下文生成的质量与成本间取得平衡 |
| BM25/向量权重分配 | 40%/60% | Anthropic研究表明,略微偏向向量的权重配置为最优 |
| 块上下文长度 | 1-2句话 | 提供足够上下文的同时避免过多的token开销 |
| 提示词缓存 | 临时缓存 | 处理同一文档的多个文本块时,可降低90%的成本 |