rag-implementation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RAG Implementation

RAG 系统实现

Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.
掌握Retrieval-Augmented Generation (RAG)技术,构建可利用外部知识源提供准确、有依据响应的LLM应用。

When to Use This Skill

适用场景

  • Building Q&A systems over proprietary documents
  • Creating chatbots with current, factual information
  • Implementing semantic search with natural language queries
  • Reducing hallucinations with grounded responses
  • Enabling LLMs to access domain-specific knowledge
  • Building documentation assistants
  • Creating research tools with source citation
  • 为专有文档构建问答系统
  • 创建具备实时事实信息的聊天机器人
  • 实现基于自然语言查询的语义搜索
  • 通过有依据的响应减少幻觉问题
  • 让LLM能够访问领域特定知识
  • 构建文档助手
  • 创建带来源引用的研究工具

Core Components

核心组件

1. Vector Databases

1. 向量数据库

Purpose: Store and retrieve document embeddings efficiently
Options:
  • Pinecone: Managed, scalable, serverless
  • Weaviate: Open-source, hybrid search, GraphQL
  • Milvus: High performance, on-premise
  • Chroma: Lightweight, easy to use, local development
  • Qdrant: Fast, filtered search, Rust-based
  • pgvector: PostgreSQL extension, SQL integration
用途:高效存储和检索文档嵌入向量
可选方案:
  • Pinecone:托管式、可扩展、无服务器
  • Weaviate:开源、混合搜索、支持GraphQL
  • Milvus:高性能、可本地部署
  • Chroma:轻量易用、适用于本地开发
  • Qdrant:快速、支持过滤搜索、基于Rust开发
  • pgvector:PostgreSQL扩展、支持SQL集成

2. Embeddings

2. 嵌入模型

Purpose: Convert text to numerical vectors for similarity search
Models (2026):
ModelDimensionsBest For
voyage-3-large1024Claude apps (Anthropic recommended)
voyage-code-31024Code search
text-embedding-3-large3072OpenAI apps, high accuracy
text-embedding-3-small1536OpenAI apps, cost-effective
bge-large-en-v1.51024Open source, local deployment
multilingual-e5-large1024Multi-language support
用途:将文本转换为数值向量以进行相似度搜索
2026年推荐模型:
模型维度最佳适用场景
voyage-3-large1024Claude应用(Anthropic官方推荐)
voyage-code-31024代码搜索
text-embedding-3-large3072OpenAI应用、高精度需求
text-embedding-3-small1536OpenAI应用、高性价比需求
bge-large-en-v1.51024开源、可本地部署
multilingual-e5-large1024多语言支持

3. Retrieval Strategies

3. 检索策略

Approaches:
  • Dense Retrieval: Semantic similarity via embeddings
  • Sparse Retrieval: Keyword matching (BM25, TF-IDF)
  • Hybrid Search: Combine dense + sparse with weighted fusion
  • Multi-Query: Generate multiple query variations
  • HyDE: Generate hypothetical documents for better retrieval
可选方法:
  • 密集检索:通过嵌入向量实现语义相似度匹配
  • 稀疏检索:关键词匹配(如BM25、TF-IDF)
  • 混合搜索:结合密集与稀疏检索,通过加权融合优化结果
  • 多查询检索:生成多个查询变体以提升召回率
  • HyDE:生成假设文档以优化检索效果

4. Reranking

4. 重排序

Purpose: Improve retrieval quality by reordering results
Methods:
  • Cross-Encoders: BERT-based reranking (ms-marco-MiniLM)
  • Cohere Rerank: API-based reranking
  • Maximal Marginal Relevance (MMR): Diversity + relevance
  • LLM-based: Use LLM to score relevance
用途:通过重新排序检索结果提升质量
可选方法:
  • 交叉编码器:基于BERT的重排序(如ms-marco-MiniLM)
  • Cohere Rerank:基于API的重排序服务
  • 最大边际相关性(MMR):平衡结果相关性与多样性
  • 基于LLM的重排序:使用LLM对结果相关性打分

Quick Start with LangGraph

基于LangGraph的快速开始

python
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic
from langchain_voyageai import VoyageAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import TypedDict, Annotated

class RAGState(TypedDict):
    question: str
    context: list[Document]
    answer: str
python
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic
from langchain_voyageai import VoyageAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import TypedDict, Annotated

class RAGState(TypedDict):
    question: str
    context: list[Document]
    answer: str

Initialize components

Initialize components

llm = ChatAnthropic(model="claude-sonnet-4-5") embeddings = VoyageAIEmbeddings(model="voyage-3-large") vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings) retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
llm = ChatAnthropic(model="claude-sonnet-4-5") embeddings = VoyageAIEmbeddings(model="voyage-3-large") vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings) retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

RAG prompt

RAG prompt

rag_prompt = ChatPromptTemplate.from_template( """Answer based on the context below. If you cannot answer, say so.
Context:
{context}

Question: {question}

Answer:"""
)
async def retrieve(state: RAGState) -> RAGState: """Retrieve relevant documents.""" docs = await retriever.ainvoke(state["question"]) return {"context": docs}
async def generate(state: RAGState) -> RAGState: """Generate answer from context.""" context_text = "\n\n".join(doc.page_content for doc in state["context"]) messages = rag_prompt.format_messages( context=context_text, question=state["question"] ) response = await llm.ainvoke(messages) return {"answer": response.content}
rag_prompt = ChatPromptTemplate.from_template( """Answer based on the context below. If you cannot answer, say so.
Context:
{context}

Question: {question}

Answer:"""
)
async def retrieve(state: RAGState) -> RAGState: """Retrieve relevant documents.""" docs = await retriever.ainvoke(state["question"]) return {"context": docs}
async def generate(state: RAGState) -> RAGState: """Generate answer from context.""" context_text = "\n\n".join(doc.page_content for doc in state["context"]) messages = rag_prompt.format_messages( context=context_text, question=state["question"] ) response = await llm.ainvoke(messages) return {"answer": response.content}

Build RAG graph

Build RAG graph

builder = StateGraph(RAGState) builder.add_node("retrieve", retrieve) builder.add_node("generate", generate) builder.add_edge(START, "retrieve") builder.add_edge("retrieve", "generate") builder.add_edge("generate", END)
rag_chain = builder.compile()
builder = StateGraph(RAGState) builder.add_node("retrieve", retrieve) builder.add_node("generate", generate) builder.add_edge(START, "retrieve") builder.add_edge("retrieve", "generate") builder.add_edge("generate", END)
rag_chain = builder.compile()

Use

Use

result = await rag_chain.ainvoke({"question": "What are the main features?"}) print(result["answer"])
undefined
result = await rag_chain.ainvoke({"question": "What are the main features?"}) print(result["answer"])
undefined

Advanced RAG Patterns

高级RAG模式

Pattern 1: Hybrid Search with RRF

模式1:基于RRF的混合搜索

python
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
python
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

Sparse retriever (BM25 for keyword matching)

Sparse retriever (BM25 for keyword matching)

bm25_retriever = BM25Retriever.from_documents(documents) bm25_retriever.k = 10
bm25_retriever = BM25Retriever.from_documents(documents) bm25_retriever.k = 10

Dense retriever (embeddings for semantic search)

Dense retriever (embeddings for semantic search)

dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

Combine with Reciprocal Rank Fusion weights

Combine with Reciprocal Rank Fusion weights

ensemble_retriever = EnsembleRetriever( retrievers=[bm25_retriever, dense_retriever], weights=[0.3, 0.7] # 30% keyword, 70% semantic )
undefined
ensemble_retriever = EnsembleRetriever( retrievers=[bm25_retriever, dense_retriever], weights=[0.3, 0.7] # 30% keyword, 70% semantic )
undefined

Pattern 2: Multi-Query Retrieval

模式2:多查询检索

python
from langchain.retrievers.multi_query import MultiQueryRetriever
python
from langchain.retrievers.multi_query import MultiQueryRetriever

Generate multiple query perspectives for better recall

Generate multiple query perspectives for better recall

multi_query_retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), llm=llm )
multi_query_retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), llm=llm )

Single query → multiple variations → combined results

Single query → multiple variations → combined results

results = await multi_query_retriever.ainvoke("What is the main topic?")
undefined
results = await multi_query_retriever.ainvoke("What is the main topic?")
undefined

Pattern 3: Contextual Compression

模式3:上下文压缩

python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

Compressor extracts only relevant portions

Compressor extracts only relevant portions

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10}) )
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10}) )

Returns only relevant parts of documents

Returns only relevant parts of documents

compressed_docs = await compression_retriever.ainvoke("specific query")
undefined
compressed_docs = await compression_retriever.ainvoke("specific query")
undefined

Pattern 4: Parent Document Retriever

模式4:父文档检索

python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

Small chunks for precise retrieval, large chunks for context

Small chunks for precise retrieval, large chunks for context

child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

Store for parent documents

Store for parent documents

docstore = InMemoryStore()
parent_retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=docstore, child_splitter=child_splitter, parent_splitter=parent_splitter )
docstore = InMemoryStore()
parent_retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=docstore, child_splitter=child_splitter, parent_splitter=parent_splitter )

Add documents (splits children, stores parents)

Add documents (splits children, stores parents)

await parent_retriever.aadd_documents(documents)
await parent_retriever.aadd_documents(documents)

Retrieval returns parent documents with full context

Retrieval returns parent documents with full context

results = await parent_retriever.ainvoke("query")
undefined
results = await parent_retriever.ainvoke("query")
undefined

Pattern 5: HyDE (Hypothetical Document Embeddings)

模式5:HyDE(假设文档嵌入)

python
from langchain_core.prompts import ChatPromptTemplate

class HyDEState(TypedDict):
    question: str
    hypothetical_doc: str
    context: list[Document]
    answer: str

hyde_prompt = ChatPromptTemplate.from_template(
    """Write a detailed passage that would answer this question:

    Question: {question}

    Passage:"""
)

async def generate_hypothetical(state: HyDEState) -> HyDEState:
    """Generate hypothetical document for better retrieval."""
    messages = hyde_prompt.format_messages(question=state["question"])
    response = await llm.ainvoke(messages)
    return {"hypothetical_doc": response.content}

async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
    """Retrieve using hypothetical document."""
    # Use hypothetical doc for retrieval instead of original query
    docs = await retriever.ainvoke(state["hypothetical_doc"])
    return {"context": docs}
python
from langchain_core.prompts import ChatPromptTemplate

class HyDEState(TypedDict):
    question: str
    hypothetical_doc: str
    context: list[Document]
    answer: str

hyde_prompt = ChatPromptTemplate.from_template(
    """Write a detailed passage that would answer this question:

    Question: {question}

    Passage:"""
)

async def generate_hypothetical(state: HyDEState) -> HyDEState:
    """Generate hypothetical document for better retrieval."""
    messages = hyde_prompt.format_messages(question=state["question"])
    response = await llm.ainvoke(messages)
    return {"hypothetical_doc": response.content}

async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
    """Retrieve using hypothetical document."""
    # Use hypothetical doc for retrieval instead of original query
    docs = await retriever.ainvoke(state["hypothetical_doc"])
    return {"context": docs}

Build HyDE RAG graph

Build HyDE RAG graph

builder = StateGraph(HyDEState) builder.add_node("hypothetical", generate_hypothetical) builder.add_node("retrieve", retrieve_with_hyde) builder.add_node("generate", generate) builder.add_edge(START, "hypothetical") builder.add_edge("hypothetical", "retrieve") builder.add_edge("retrieve", "generate") builder.add_edge("generate", END)
hyde_rag = builder.compile()
undefined
builder = StateGraph(HyDEState) builder.add_node("hypothetical", generate_hypothetical) builder.add_node("retrieve", retrieve_with_hyde) builder.add_node("generate", generate) builder.add_edge(START, "hypothetical") builder.add_edge("hypothetical", "retrieve") builder.add_edge("retrieve", "generate") builder.add_edge("generate", END)
hyde_rag = builder.compile()
undefined

Document Chunking Strategies

文档分块策略

Recursive Character Text Splitter

递归字符分块器

python
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Try in order
)

chunks = splitter.split_documents(documents)
python
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Try in order
)

chunks = splitter.split_documents(documents)

Token-Based Splitting

基于Token的分块

python
from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    encoding_name="cl100k_base"  # OpenAI tiktoken encoding
)
python
from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    encoding_name="cl100k_base"  # OpenAI tiktoken encoding
)

Semantic Chunking

语义分块

python
from langchain_experimental.text_splitter import SemanticChunker

splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
python
from langchain_experimental.text_splitter import SemanticChunker

splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

Markdown Header Splitter

Markdown标题分块

python
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False
)
python
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False
)

Vector Store Configurations

向量数据库配置

Pinecone (Serverless)

Pinecone(无服务器)

python
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
python
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore

Initialize Pinecone client

Initialize Pinecone client

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

Create index if needed

Create index if needed

if "my-index" not in pc.list_indexes().names(): pc.create_index( name="my-index", dimension=1024, # voyage-3-large dimensions metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1") )
if "my-index" not in pc.list_indexes().names(): pc.create_index( name="my-index", dimension=1024, # voyage-3-large dimensions metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1") )

Create vector store

Create vector store

index = pc.Index("my-index") vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
undefined
index = pc.Index("my-index") vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
undefined

Weaviate

Weaviate

python
import weaviate
from langchain_weaviate import WeaviateVectorStore

client = weaviate.connect_to_local()  # or connect_to_weaviate_cloud()

vectorstore = WeaviateVectorStore(
    client=client,
    index_name="Documents",
    text_key="content",
    embedding=embeddings
)
python
import weaviate
from langchain_weaviate import WeaviateVectorStore

client = weaviate.connect_to_local()  # or connect_to_weaviate_cloud()

vectorstore = WeaviateVectorStore(
    client=client,
    index_name="Documents",
    text_key="content",
    embedding=embeddings
)

Chroma (Local Development)

Chroma(本地开发)

python
from langchain_chroma import Chroma

vectorstore = Chroma(
    collection_name="my_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)
python
from langchain_chroma import Chroma

vectorstore = Chroma(
    collection_name="my_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

pgvector (PostgreSQL)

pgvector(PostgreSQL)

python
from langchain_postgres.vectorstores import PGVector

connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="documents",
    connection=connection_string,
)
python
from langchain_postgres.vectorstores import PGVector

connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="documents",
    connection=connection_string,
)

Retrieval Optimization

检索优化

1. Metadata Filtering

1. 元数据过滤

python
from langchain_core.documents import Document
python
from langchain_core.documents import Document

Add metadata during indexing

Add metadata during indexing

docs_with_metadata = [] for doc in documents: doc.metadata.update({ "source": doc.metadata.get("source", "unknown"), "category": determine_category(doc.page_content), "date": datetime.now().isoformat() }) docs_with_metadata.append(doc)
docs_with_metadata = [] for doc in documents: doc.metadata.update({ "source": doc.metadata.get("source", "unknown"), "category": determine_category(doc.page_content), "date": datetime.now().isoformat() }) docs_with_metadata.append(doc)

Filter during retrieval

Filter during retrieval

results = await vectorstore.asimilarity_search( "query", filter={"category": "technical"}, k=5 )
undefined
results = await vectorstore.asimilarity_search( "query", filter={"category": "technical"}, k=5 )
undefined

2. Maximal Marginal Relevance (MMR)

2. 最大边际相关性(MMR)

python
undefined
python
undefined

Balance relevance with diversity

Balance relevance with diversity

results = await vectorstore.amax_marginal_relevance_search( "query", k=5, fetch_k=20, # Fetch 20, return top 5 diverse lambda_mult=0.5 # 0=max diversity, 1=max relevance )
undefined
results = await vectorstore.amax_marginal_relevance_search( "query", k=5, fetch_k=20, # Fetch 20, return top 5 diverse lambda_mult=0.5 # 0=max diversity, 1=max relevance )
undefined

3. Reranking with Cross-Encoder

3. 基于交叉编码器的重排序

python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
    # Get initial results
    candidates = await vectorstore.asimilarity_search(query, k=20)

    # Rerank
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)

    # Sort by score and take top k
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:k]]
python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
    # Get initial results
    candidates = await vectorstore.asimilarity_search(query, k=20)

    # Rerank
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)

    # Sort by score and take top k
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:k]]

4. Cohere Rerank

4. Cohere重排序

python
from langchain.retrievers import CohereRerank
from langchain_cohere import CohereRerank

reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
python
from langchain.retrievers import CohereRerank
from langchain_cohere import CohereRerank

reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)

Wrap retriever with reranking

Wrap retriever with reranking

reranked_retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}) )
undefined
reranked_retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}) )
undefined

Prompt Engineering for RAG

RAG提示词工程

Contextual Prompt with Citations

带引用的上下文提示词

python
rag_prompt = ChatPromptTemplate.from_template(
    """Answer the question based on the context below. Include citations using [1], [2], etc.

    If you cannot answer based on the context, say "I don't have enough information."

    Context:
    {context}

    Question: {question}

    Instructions:
    1. Use only information from the context
    2. Cite sources with [1], [2] format
    3. If uncertain, express uncertainty

    Answer (with citations):"""
)
python
rag_prompt = ChatPromptTemplate.from_template(
    """Answer the question based on the context below. Include citations using [1], [2], etc.

    If you cannot answer based on the context, say "I don't have enough information."

    Context:
    {context}

    Question: {question}

    Instructions:
    1. Use only information from the context
    2. Cite sources with [1], [2] format
    3. If uncertain, express uncertainty

    Answer (with citations):"""
)

Structured Output for RAG

RAG结构化输出

python
from pydantic import BaseModel, Field

class RAGResponse(BaseModel):
    answer: str = Field(description="The answer based on context")
    confidence: float = Field(description="Confidence score 0-1")
    sources: list[str] = Field(description="Source document IDs used")
    reasoning: str = Field(description="Brief reasoning for the answer")
python
from pydantic import BaseModel, Field

class RAGResponse(BaseModel):
    answer: str = Field(description="The answer based on context")
    confidence: float = Field(description="Confidence score 0-1")
    sources: list[str] = Field(description="Source document IDs used")
    reasoning: str = Field(description="Brief reasoning for the answer")

Use with structured output

Use with structured output

structured_llm = llm.with_structured_output(RAGResponse)
undefined
structured_llm = llm.with_structured_output(RAGResponse)
undefined

Evaluation Metrics

评估指标

python
from typing import TypedDict

class RAGEvalMetrics(TypedDict):
    retrieval_precision: float  # Relevant docs / retrieved docs
    retrieval_recall: float     # Retrieved relevant / total relevant
    answer_relevance: float     # Answer addresses question
    faithfulness: float         # Answer grounded in context
    context_relevance: float    # Context relevant to question

async def evaluate_rag_system(
    rag_chain,
    test_cases: list[dict]
) -> RAGEvalMetrics:
    """Evaluate RAG system on test cases."""
    metrics = {k: [] for k in RAGEvalMetrics.__annotations__}

    for test in test_cases:
        result = await rag_chain.ainvoke({"question": test["question"]})

        # Retrieval metrics
        retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
        relevant_ids = set(test["relevant_doc_ids"])

        precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
        recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)

        metrics["retrieval_precision"].append(precision)
        metrics["retrieval_recall"].append(recall)

        # Use LLM-as-judge for quality metrics
        quality = await evaluate_answer_quality(
            question=test["question"],
            answer=result["answer"],
            context=result["context"],
            expected=test.get("expected_answer")
        )
        metrics["answer_relevance"].append(quality["relevance"])
        metrics["faithfulness"].append(quality["faithfulness"])
        metrics["context_relevance"].append(quality["context_relevance"])

    return {k: sum(v) / len(v) for k, v in metrics.items()}
python
from typing import TypedDict

class RAGEvalMetrics(TypedDict):
    retrieval_precision: float  # Relevant docs / retrieved docs
    retrieval_recall: float     # Retrieved relevant / total relevant
    answer_relevance: float     # Answer addresses question
    faithfulness: float         # Answer grounded in context
    context_relevance: float    # Context relevant to question

async def evaluate_rag_system(
    rag_chain,
    test_cases: list[dict]
) -> RAGEvalMetrics:
    """Evaluate RAG system on test cases."""
    metrics = {k: [] for k in RAGEvalMetrics.__annotations__}

    for test in test_cases:
        result = await rag_chain.ainvoke({"question": test["question"]})

        # Retrieval metrics
        retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
        relevant_ids = set(test["relevant_doc_ids"])

        precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
        recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)

        metrics["retrieval_precision"].append(precision)
        metrics["retrieval_recall"].append(recall)

        # Use LLM-as-judge for quality metrics
        quality = await evaluate_answer_quality(
            question=test["question"],
            answer=result["answer"],
            context=result["context"],
            expected=test.get("expected_answer")
        )
        metrics["answer_relevance"].append(quality["relevance"])
        metrics["faithfulness"].append(quality["faithfulness"])
        metrics["context_relevance"].append(quality["context_relevance"])

    return {k: sum(v) / len(v) for k, v in metrics.items()}

Resources

参考资源

Best Practices

最佳实践

  1. Chunk Size: Balance between context (larger) and specificity (smaller) - typically 500-1000 tokens
  2. Overlap: Use 10-20% overlap to preserve context at boundaries
  3. Metadata: Include source, page, timestamp for filtering and debugging
  4. Hybrid Search: Combine semantic and keyword search for best recall
  5. Reranking: Use cross-encoder reranking for precision-critical applications
  6. Citations: Always return source documents for transparency
  7. Evaluation: Continuously test retrieval quality and answer accuracy
  8. Monitoring: Track retrieval metrics and latency in production
  1. 分块大小:在上下文完整性与检索精准度间平衡,通常为500-1000个Token
  2. 重叠部分:设置10-20%的重叠率,避免上下文边界信息丢失
  3. 元数据:添加来源、页码、时间戳等信息,便于过滤与调试
  4. 混合搜索:结合语义与关键词搜索以获得最佳召回率
  5. 重排序:对精度要求高的场景,使用交叉编码器进行重排序
  6. 引用来源:始终返回源文档以保证结果透明度
  7. 持续评估:定期测试检索质量与回答准确性
  8. 监控:在生产环境中跟踪检索指标与延迟情况

Common Issues

常见问题

  • Poor Retrieval: Check embedding quality, chunk size, query formulation
  • Irrelevant Results: Add metadata filtering, use hybrid search, rerank
  • Missing Information: Ensure documents are properly indexed, check chunking
  • Slow Queries: Optimize vector store, use caching, reduce k
  • Hallucinations: Improve grounding prompt, add verification step
  • Context Too Long: Use compression or parent document retriever
  • 检索效果差:检查嵌入模型质量、分块大小、查询表述
  • 结果不相关:添加元数据过滤、使用混合搜索、启用重排序
  • 信息缺失:确保文档已正确索引,检查分块策略
  • 查询速度慢:优化向量数据库配置、使用缓存、减少返回结果数量(k值)
  • 幻觉问题:优化提示词增强上下文约束、添加验证步骤
  • 上下文过长:使用上下文压缩或父文档检索器