rag-implementation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRAG Implementation
RAG 系统实现
Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.
掌握Retrieval-Augmented Generation (RAG)技术,构建可利用外部知识源提供准确、有依据响应的LLM应用。
When to Use This Skill
适用场景
- Building Q&A systems over proprietary documents
- Creating chatbots with current, factual information
- Implementing semantic search with natural language queries
- Reducing hallucinations with grounded responses
- Enabling LLMs to access domain-specific knowledge
- Building documentation assistants
- Creating research tools with source citation
- 为专有文档构建问答系统
- 创建具备实时事实信息的聊天机器人
- 实现基于自然语言查询的语义搜索
- 通过有依据的响应减少幻觉问题
- 让LLM能够访问领域特定知识
- 构建文档助手
- 创建带来源引用的研究工具
Core Components
核心组件
1. Vector Databases
1. 向量数据库
Purpose: Store and retrieve document embeddings efficiently
Options:
- Pinecone: Managed, scalable, serverless
- Weaviate: Open-source, hybrid search, GraphQL
- Milvus: High performance, on-premise
- Chroma: Lightweight, easy to use, local development
- Qdrant: Fast, filtered search, Rust-based
- pgvector: PostgreSQL extension, SQL integration
用途:高效存储和检索文档嵌入向量
可选方案:
- Pinecone:托管式、可扩展、无服务器
- Weaviate:开源、混合搜索、支持GraphQL
- Milvus:高性能、可本地部署
- Chroma:轻量易用、适用于本地开发
- Qdrant:快速、支持过滤搜索、基于Rust开发
- pgvector:PostgreSQL扩展、支持SQL集成
2. Embeddings
2. 嵌入模型
Purpose: Convert text to numerical vectors for similarity search
Models (2026):
| Model | Dimensions | Best For |
|---|---|---|
| voyage-3-large | 1024 | Claude apps (Anthropic recommended) |
| voyage-code-3 | 1024 | Code search |
| text-embedding-3-large | 3072 | OpenAI apps, high accuracy |
| text-embedding-3-small | 1536 | OpenAI apps, cost-effective |
| bge-large-en-v1.5 | 1024 | Open source, local deployment |
| multilingual-e5-large | 1024 | Multi-language support |
用途:将文本转换为数值向量以进行相似度搜索
2026年推荐模型:
| 模型 | 维度 | 最佳适用场景 |
|---|---|---|
| voyage-3-large | 1024 | Claude应用(Anthropic官方推荐) |
| voyage-code-3 | 1024 | 代码搜索 |
| text-embedding-3-large | 3072 | OpenAI应用、高精度需求 |
| text-embedding-3-small | 1536 | OpenAI应用、高性价比需求 |
| bge-large-en-v1.5 | 1024 | 开源、可本地部署 |
| multilingual-e5-large | 1024 | 多语言支持 |
3. Retrieval Strategies
3. 检索策略
Approaches:
- Dense Retrieval: Semantic similarity via embeddings
- Sparse Retrieval: Keyword matching (BM25, TF-IDF)
- Hybrid Search: Combine dense + sparse with weighted fusion
- Multi-Query: Generate multiple query variations
- HyDE: Generate hypothetical documents for better retrieval
可选方法:
- 密集检索:通过嵌入向量实现语义相似度匹配
- 稀疏检索:关键词匹配(如BM25、TF-IDF)
- 混合搜索:结合密集与稀疏检索,通过加权融合优化结果
- 多查询检索:生成多个查询变体以提升召回率
- HyDE:生成假设文档以优化检索效果
4. Reranking
4. 重排序
Purpose: Improve retrieval quality by reordering results
Methods:
- Cross-Encoders: BERT-based reranking (ms-marco-MiniLM)
- Cohere Rerank: API-based reranking
- Maximal Marginal Relevance (MMR): Diversity + relevance
- LLM-based: Use LLM to score relevance
用途:通过重新排序检索结果提升质量
可选方法:
- 交叉编码器:基于BERT的重排序(如ms-marco-MiniLM)
- Cohere Rerank:基于API的重排序服务
- 最大边际相关性(MMR):平衡结果相关性与多样性
- 基于LLM的重排序:使用LLM对结果相关性打分
Quick Start with LangGraph
基于LangGraph的快速开始
python
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic
from langchain_voyageai import VoyageAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import TypedDict, Annotated
class RAGState(TypedDict):
question: str
context: list[Document]
answer: strpython
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic
from langchain_voyageai import VoyageAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import TypedDict, Annotated
class RAGState(TypedDict):
question: str
context: list[Document]
answer: strInitialize components
Initialize components
llm = ChatAnthropic(model="claude-sonnet-4-5")
embeddings = VoyageAIEmbeddings(model="voyage-3-large")
vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
llm = ChatAnthropic(model="claude-sonnet-4-5")
embeddings = VoyageAIEmbeddings(model="voyage-3-large")
vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
RAG prompt
RAG prompt
rag_prompt = ChatPromptTemplate.from_template(
"""Answer based on the context below. If you cannot answer, say so.
Context:
{context}
Question: {question}
Answer:""")
async def retrieve(state: RAGState) -> RAGState:
"""Retrieve relevant documents."""
docs = await retriever.ainvoke(state["question"])
return {"context": docs}
async def generate(state: RAGState) -> RAGState:
"""Generate answer from context."""
context_text = "\n\n".join(doc.page_content for doc in state["context"])
messages = rag_prompt.format_messages(
context=context_text,
question=state["question"]
)
response = await llm.ainvoke(messages)
return {"answer": response.content}
rag_prompt = ChatPromptTemplate.from_template(
"""Answer based on the context below. If you cannot answer, say so.
Context:
{context}
Question: {question}
Answer:""")
async def retrieve(state: RAGState) -> RAGState:
"""Retrieve relevant documents."""
docs = await retriever.ainvoke(state["question"])
return {"context": docs}
async def generate(state: RAGState) -> RAGState:
"""Generate answer from context."""
context_text = "\n\n".join(doc.page_content for doc in state["context"])
messages = rag_prompt.format_messages(
context=context_text,
question=state["question"]
)
response = await llm.ainvoke(messages)
return {"answer": response.content}
Build RAG graph
Build RAG graph
builder = StateGraph(RAGState)
builder.add_node("retrieve", retrieve)
builder.add_node("generate", generate)
builder.add_edge(START, "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)
rag_chain = builder.compile()
builder = StateGraph(RAGState)
builder.add_node("retrieve", retrieve)
builder.add_node("generate", generate)
builder.add_edge(START, "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)
rag_chain = builder.compile()
Use
Use
result = await rag_chain.ainvoke({"question": "What are the main features?"})
print(result["answer"])
undefinedresult = await rag_chain.ainvoke({"question": "What are the main features?"})
print(result["answer"])
undefinedAdvanced RAG Patterns
高级RAG模式
Pattern 1: Hybrid Search with RRF
模式1:基于RRF的混合搜索
python
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetrieverpython
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetrieverSparse retriever (BM25 for keyword matching)
Sparse retriever (BM25 for keyword matching)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
Dense retriever (embeddings for semantic search)
Dense retriever (embeddings for semantic search)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
Combine with Reciprocal Rank Fusion weights
Combine with Reciprocal Rank Fusion weights
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.3, 0.7] # 30% keyword, 70% semantic
)
undefinedensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.3, 0.7] # 30% keyword, 70% semantic
)
undefinedPattern 2: Multi-Query Retrieval
模式2:多查询检索
python
from langchain.retrievers.multi_query import MultiQueryRetrieverpython
from langchain.retrievers.multi_query import MultiQueryRetrieverGenerate multiple query perspectives for better recall
Generate multiple query perspectives for better recall
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
llm=llm
)
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
llm=llm
)
Single query → multiple variations → combined results
Single query → multiple variations → combined results
results = await multi_query_retriever.ainvoke("What is the main topic?")
undefinedresults = await multi_query_retriever.ainvoke("What is the main topic?")
undefinedPattern 3: Contextual Compression
模式3:上下文压缩
python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractorpython
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractorCompressor extracts only relevant portions
Compressor extracts only relevant portions
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
Returns only relevant parts of documents
Returns only relevant parts of documents
compressed_docs = await compression_retriever.ainvoke("specific query")
undefinedcompressed_docs = await compression_retriever.ainvoke("specific query")
undefinedPattern 4: Parent Document Retriever
模式4:父文档检索
python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitterpython
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitterSmall chunks for precise retrieval, large chunks for context
Small chunks for precise retrieval, large chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
Store for parent documents
Store for parent documents
docstore = InMemoryStore()
parent_retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
docstore = InMemoryStore()
parent_retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
Add documents (splits children, stores parents)
Add documents (splits children, stores parents)
await parent_retriever.aadd_documents(documents)
await parent_retriever.aadd_documents(documents)
Retrieval returns parent documents with full context
Retrieval returns parent documents with full context
results = await parent_retriever.ainvoke("query")
undefinedresults = await parent_retriever.ainvoke("query")
undefinedPattern 5: HyDE (Hypothetical Document Embeddings)
模式5:HyDE(假设文档嵌入)
python
from langchain_core.prompts import ChatPromptTemplate
class HyDEState(TypedDict):
question: str
hypothetical_doc: str
context: list[Document]
answer: str
hyde_prompt = ChatPromptTemplate.from_template(
"""Write a detailed passage that would answer this question:
Question: {question}
Passage:"""
)
async def generate_hypothetical(state: HyDEState) -> HyDEState:
"""Generate hypothetical document for better retrieval."""
messages = hyde_prompt.format_messages(question=state["question"])
response = await llm.ainvoke(messages)
return {"hypothetical_doc": response.content}
async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
"""Retrieve using hypothetical document."""
# Use hypothetical doc for retrieval instead of original query
docs = await retriever.ainvoke(state["hypothetical_doc"])
return {"context": docs}python
from langchain_core.prompts import ChatPromptTemplate
class HyDEState(TypedDict):
question: str
hypothetical_doc: str
context: list[Document]
answer: str
hyde_prompt = ChatPromptTemplate.from_template(
"""Write a detailed passage that would answer this question:
Question: {question}
Passage:"""
)
async def generate_hypothetical(state: HyDEState) -> HyDEState:
"""Generate hypothetical document for better retrieval."""
messages = hyde_prompt.format_messages(question=state["question"])
response = await llm.ainvoke(messages)
return {"hypothetical_doc": response.content}
async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
"""Retrieve using hypothetical document."""
# Use hypothetical doc for retrieval instead of original query
docs = await retriever.ainvoke(state["hypothetical_doc"])
return {"context": docs}Build HyDE RAG graph
Build HyDE RAG graph
builder = StateGraph(HyDEState)
builder.add_node("hypothetical", generate_hypothetical)
builder.add_node("retrieve", retrieve_with_hyde)
builder.add_node("generate", generate)
builder.add_edge(START, "hypothetical")
builder.add_edge("hypothetical", "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)
hyde_rag = builder.compile()
undefinedbuilder = StateGraph(HyDEState)
builder.add_node("hypothetical", generate_hypothetical)
builder.add_node("retrieve", retrieve_with_hyde)
builder.add_node("generate", generate)
builder.add_edge(START, "hypothetical")
builder.add_edge("hypothetical", "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)
hyde_rag = builder.compile()
undefinedDocument Chunking Strategies
文档分块策略
Recursive Character Text Splitter
递归字符分块器
python
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""] # Try in order
)
chunks = splitter.split_documents(documents)python
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""] # Try in order
)
chunks = splitter.split_documents(documents)Token-Based Splitting
基于Token的分块
python
from langchain_text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50,
encoding_name="cl100k_base" # OpenAI tiktoken encoding
)python
from langchain_text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50,
encoding_name="cl100k_base" # OpenAI tiktoken encoding
)Semantic Chunking
语义分块
python
from langchain_experimental.text_splitter import SemanticChunker
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)python
from langchain_experimental.text_splitter import SemanticChunker
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)Markdown Header Splitter
Markdown标题分块
python
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
strip_headers=False
)python
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
strip_headers=False
)Vector Store Configurations
向量数据库配置
Pinecone (Serverless)
Pinecone(无服务器)
python
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStorepython
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStoreInitialize Pinecone client
Initialize Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
Create index if needed
Create index if needed
if "my-index" not in pc.list_indexes().names():
pc.create_index(
name="my-index",
dimension=1024, # voyage-3-large dimensions
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
if "my-index" not in pc.list_indexes().names():
pc.create_index(
name="my-index",
dimension=1024, # voyage-3-large dimensions
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
Create vector store
Create vector store
index = pc.Index("my-index")
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
undefinedindex = pc.Index("my-index")
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
undefinedWeaviate
Weaviate
python
import weaviate
from langchain_weaviate import WeaviateVectorStore
client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()
vectorstore = WeaviateVectorStore(
client=client,
index_name="Documents",
text_key="content",
embedding=embeddings
)python
import weaviate
from langchain_weaviate import WeaviateVectorStore
client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()
vectorstore = WeaviateVectorStore(
client=client,
index_name="Documents",
text_key="content",
embedding=embeddings
)Chroma (Local Development)
Chroma(本地开发)
python
from langchain_chroma import Chroma
vectorstore = Chroma(
collection_name="my_collection",
embedding_function=embeddings,
persist_directory="./chroma_db"
)python
from langchain_chroma import Chroma
vectorstore = Chroma(
collection_name="my_collection",
embedding_function=embeddings,
persist_directory="./chroma_db"
)pgvector (PostgreSQL)
pgvector(PostgreSQL)
python
from langchain_postgres.vectorstores import PGVector
connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"
vectorstore = PGVector(
embeddings=embeddings,
collection_name="documents",
connection=connection_string,
)python
from langchain_postgres.vectorstores import PGVector
connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"
vectorstore = PGVector(
embeddings=embeddings,
collection_name="documents",
connection=connection_string,
)Retrieval Optimization
检索优化
1. Metadata Filtering
1. 元数据过滤
python
from langchain_core.documents import Documentpython
from langchain_core.documents import DocumentAdd metadata during indexing
Add metadata during indexing
docs_with_metadata = []
for doc in documents:
doc.metadata.update({
"source": doc.metadata.get("source", "unknown"),
"category": determine_category(doc.page_content),
"date": datetime.now().isoformat()
})
docs_with_metadata.append(doc)
docs_with_metadata = []
for doc in documents:
doc.metadata.update({
"source": doc.metadata.get("source", "unknown"),
"category": determine_category(doc.page_content),
"date": datetime.now().isoformat()
})
docs_with_metadata.append(doc)
Filter during retrieval
Filter during retrieval
results = await vectorstore.asimilarity_search(
"query",
filter={"category": "technical"},
k=5
)
undefinedresults = await vectorstore.asimilarity_search(
"query",
filter={"category": "technical"},
k=5
)
undefined2. Maximal Marginal Relevance (MMR)
2. 最大边际相关性(MMR)
python
undefinedpython
undefinedBalance relevance with diversity
Balance relevance with diversity
results = await vectorstore.amax_marginal_relevance_search(
"query",
k=5,
fetch_k=20, # Fetch 20, return top 5 diverse
lambda_mult=0.5 # 0=max diversity, 1=max relevance
)
undefinedresults = await vectorstore.amax_marginal_relevance_search(
"query",
k=5,
fetch_k=20, # Fetch 20, return top 5 diverse
lambda_mult=0.5 # 0=max diversity, 1=max relevance
)
undefined3. Reranking with Cross-Encoder
3. 基于交叉编码器的重排序
python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
# Get initial results
candidates = await vectorstore.asimilarity_search(query, k=20)
# Rerank
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
# Sort by score and take top k
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:k]]python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
# Get initial results
candidates = await vectorstore.asimilarity_search(query, k=20)
# Rerank
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
# Sort by score and take top k
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:k]]4. Cohere Rerank
4. Cohere重排序
python
from langchain.retrievers import CohereRerank
from langchain_cohere import CohereRerank
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)python
from langchain.retrievers import CohereRerank
from langchain_cohere import CohereRerank
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)Wrap retriever with reranking
Wrap retriever with reranking
reranked_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
undefinedreranked_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
undefinedPrompt Engineering for RAG
RAG提示词工程
Contextual Prompt with Citations
带引用的上下文提示词
python
rag_prompt = ChatPromptTemplate.from_template(
"""Answer the question based on the context below. Include citations using [1], [2], etc.
If you cannot answer based on the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Instructions:
1. Use only information from the context
2. Cite sources with [1], [2] format
3. If uncertain, express uncertainty
Answer (with citations):"""
)python
rag_prompt = ChatPromptTemplate.from_template(
"""Answer the question based on the context below. Include citations using [1], [2], etc.
If you cannot answer based on the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Instructions:
1. Use only information from the context
2. Cite sources with [1], [2] format
3. If uncertain, express uncertainty
Answer (with citations):"""
)Structured Output for RAG
RAG结构化输出
python
from pydantic import BaseModel, Field
class RAGResponse(BaseModel):
answer: str = Field(description="The answer based on context")
confidence: float = Field(description="Confidence score 0-1")
sources: list[str] = Field(description="Source document IDs used")
reasoning: str = Field(description="Brief reasoning for the answer")python
from pydantic import BaseModel, Field
class RAGResponse(BaseModel):
answer: str = Field(description="The answer based on context")
confidence: float = Field(description="Confidence score 0-1")
sources: list[str] = Field(description="Source document IDs used")
reasoning: str = Field(description="Brief reasoning for the answer")Use with structured output
Use with structured output
structured_llm = llm.with_structured_output(RAGResponse)
undefinedstructured_llm = llm.with_structured_output(RAGResponse)
undefinedEvaluation Metrics
评估指标
python
from typing import TypedDict
class RAGEvalMetrics(TypedDict):
retrieval_precision: float # Relevant docs / retrieved docs
retrieval_recall: float # Retrieved relevant / total relevant
answer_relevance: float # Answer addresses question
faithfulness: float # Answer grounded in context
context_relevance: float # Context relevant to question
async def evaluate_rag_system(
rag_chain,
test_cases: list[dict]
) -> RAGEvalMetrics:
"""Evaluate RAG system on test cases."""
metrics = {k: [] for k in RAGEvalMetrics.__annotations__}
for test in test_cases:
result = await rag_chain.ainvoke({"question": test["question"]})
# Retrieval metrics
retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
relevant_ids = set(test["relevant_doc_ids"])
precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
metrics["retrieval_precision"].append(precision)
metrics["retrieval_recall"].append(recall)
# Use LLM-as-judge for quality metrics
quality = await evaluate_answer_quality(
question=test["question"],
answer=result["answer"],
context=result["context"],
expected=test.get("expected_answer")
)
metrics["answer_relevance"].append(quality["relevance"])
metrics["faithfulness"].append(quality["faithfulness"])
metrics["context_relevance"].append(quality["context_relevance"])
return {k: sum(v) / len(v) for k, v in metrics.items()}python
from typing import TypedDict
class RAGEvalMetrics(TypedDict):
retrieval_precision: float # Relevant docs / retrieved docs
retrieval_recall: float # Retrieved relevant / total relevant
answer_relevance: float # Answer addresses question
faithfulness: float # Answer grounded in context
context_relevance: float # Context relevant to question
async def evaluate_rag_system(
rag_chain,
test_cases: list[dict]
) -> RAGEvalMetrics:
"""Evaluate RAG system on test cases."""
metrics = {k: [] for k in RAGEvalMetrics.__annotations__}
for test in test_cases:
result = await rag_chain.ainvoke({"question": test["question"]})
# Retrieval metrics
retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
relevant_ids = set(test["relevant_doc_ids"])
precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
metrics["retrieval_precision"].append(precision)
metrics["retrieval_recall"].append(recall)
# Use LLM-as-judge for quality metrics
quality = await evaluate_answer_quality(
question=test["question"],
answer=result["answer"],
context=result["context"],
expected=test.get("expected_answer")
)
metrics["answer_relevance"].append(quality["relevance"])
metrics["faithfulness"].append(quality["faithfulness"])
metrics["context_relevance"].append(quality["context_relevance"])
return {k: sum(v) / len(v) for k, v in metrics.items()}Resources
参考资源
Best Practices
最佳实践
- Chunk Size: Balance between context (larger) and specificity (smaller) - typically 500-1000 tokens
- Overlap: Use 10-20% overlap to preserve context at boundaries
- Metadata: Include source, page, timestamp for filtering and debugging
- Hybrid Search: Combine semantic and keyword search for best recall
- Reranking: Use cross-encoder reranking for precision-critical applications
- Citations: Always return source documents for transparency
- Evaluation: Continuously test retrieval quality and answer accuracy
- Monitoring: Track retrieval metrics and latency in production
- 分块大小:在上下文完整性与检索精准度间平衡,通常为500-1000个Token
- 重叠部分:设置10-20%的重叠率,避免上下文边界信息丢失
- 元数据:添加来源、页码、时间戳等信息,便于过滤与调试
- 混合搜索:结合语义与关键词搜索以获得最佳召回率
- 重排序:对精度要求高的场景,使用交叉编码器进行重排序
- 引用来源:始终返回源文档以保证结果透明度
- 持续评估:定期测试检索质量与回答准确性
- 监控:在生产环境中跟踪检索指标与延迟情况
Common Issues
常见问题
- Poor Retrieval: Check embedding quality, chunk size, query formulation
- Irrelevant Results: Add metadata filtering, use hybrid search, rerank
- Missing Information: Ensure documents are properly indexed, check chunking
- Slow Queries: Optimize vector store, use caching, reduce k
- Hallucinations: Improve grounding prompt, add verification step
- Context Too Long: Use compression or parent document retriever
- 检索效果差:检查嵌入模型质量、分块大小、查询表述
- 结果不相关:添加元数据过滤、使用混合搜索、启用重排序
- 信息缺失:确保文档已正确索引,检查分块策略
- 查询速度慢:优化向量数据库配置、使用缓存、减少返回结果数量(k值)
- 幻觉问题:优化提示词增强上下文约束、添加验证步骤
- 上下文过长:使用上下文压缩或父文档检索器