chunking-strategies

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Chunking Strategies for RAG

RAG系统的文档分块策略

Optimize document splitting for retrieval accuracy and context preservation.
优化文档拆分,提升检索准确性与上下文保留效果。

When to Use

适用场景

  • Designing a new RAG pipeline
  • Retrieval quality is poor due to chunk boundaries
  • Documents have mixed content types (code, tables, prose)
  • Need to balance context window limits with retrieval precision
  • 设计全新的RAG流程
  • 因分块边界问题导致检索质量不佳
  • 文档包含混合内容类型(代码、表格、散文)
  • 需要平衡上下文窗口限制与检索精度

Chunking Methods

分块方法

1. Fixed-Size Chunking

1. 固定大小分块

python
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separator="\n"
)
chunks = splitter.split_text(document)
Best for: Homogeneous content, quick prototyping Avoid when: Documents have natural boundaries (sections, paragraphs)
python
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separator="\n"
)
chunks = splitter.split_text(document)
最适用场景:内容同质化的文档、快速原型开发 避免场景:文档存在自然边界(章节、段落)

2. Recursive Character Splitting

2. 递归字符拆分

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(docs)
Best for: General-purpose text, maintains paragraph integrity Hierarchy: Tries larger separators first, falls back to smaller
python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(docs)
最适用场景:通用文本、保持段落完整性 层级逻辑:优先尝试较大分隔符,逐步 fallback 到较小分隔符

3. Semantic Chunking

3. 语义分块

python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
chunks = splitter.split_text(document)
Best for: When meaning matters more than size Trade-off: Slower, requires embedding calls
python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
chunks = splitter.split_text(document)
最适用场景:语义优先级高于大小的场景 权衡点:速度较慢,需要调用嵌入模型

4. Document-Specific Chunking

4. 特定文档类型分块

Markdown

Markdown

python
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)
python
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)

Code

代码

python
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000,
    chunk_overlap=200
)
chunks = splitter.split_documents(code_docs)
python
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000,
    chunk_overlap=200
)
chunks = splitter.split_documents(code_docs)

HTML

HTML

python
from langchain.text_splitter import HTMLHeaderTextSplitter

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "h1"), ("h2", "h2"), ("h3", "h3")]
)
chunks = splitter.split_text(html_doc)
python
from langchain.text_splitter import HTMLHeaderTextSplitter

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "h1"), ("h2", "h2"), ("h3", "h3")]
)
chunks = splitter.split_text(html_doc)

Chunk Size Guidelines

分块大小指南

Content TypeRecommended SizeOverlap
Dense technical docs500-1000 tokens10-20%
Conversational/FAQ200-500 tokens5-10%
Legal/contracts1000-1500 tokens15-20%
Code1500-2000 tokens10-15%
Mixed content800-1200 tokens15%
内容类型推荐大小重叠率
高密度技术文档500-1000 tokens10-20%
对话/FAQ文档200-500 tokens5-10%
法律/合同文档1000-1500 tokens15-20%
代码1500-2000 tokens10-15%
混合内容800-1200 tokens15%

Advanced: Parent-Child Chunking

进阶:父子分块

python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

Small chunks for retrieval, large chunks for context

小分块用于精准检索,大分块用于提供完整上下文

child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
store = InMemoryStore() retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, )

**Why**: Small chunks = precise retrieval, large chunks = better context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
store = InMemoryStore() retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, )

**优势**:小分块实现精准检索,大分块提供更完整的上下文

Metadata Enrichment

元数据增强

Always attach metadata to chunks:
python
for i, chunk in enumerate(chunks):
    chunk.metadata.update({
        "source": doc.metadata["source"],
        "chunk_index": i,
        "total_chunks": len(chunks),
        "doc_type": detect_doc_type(chunk.page_content),
        "has_code": bool(re.search(r'```', chunk.page_content)),
        "timestamp": datetime.now().isoformat()
    })
始终为分块附加元数据:
python
for i, chunk in enumerate(chunks):
    chunk.metadata.update({
        "source": doc.metadata["source"],
        "chunk_index": i,
        "total_chunks": len(chunks),
        "doc_type": detect_doc_type(chunk.page_content),
        "has_code": bool(re.search(r'```', chunk.page_content)),
        "timestamp": datetime.now().isoformat()
    })

Evaluation Checklist

评估检查清单

  • Chunks don't break mid-sentence
  • Code blocks stay intact
  • Tables aren't split across chunks
  • Headers stay with their content
  • Overlap preserves context continuity
  • Metadata enables filtering
  • 分块不会在句子中间断开
  • 代码块保持完整
  • 表格不会跨分块拆分
  • 标题与对应内容保持关联
  • 重叠部分确保上下文连续性
  • 元数据支持过滤功能

Best Practices

最佳实践

  1. Start with recursive splitting - works for 80% of cases
  2. Test retrieval quality - not just chunk count
  3. Use overlap - 10-20% prevents context loss at boundaries
  4. Match chunk size to model - consider embedding model's optimal input
  5. Preserve structure - use document-aware splitters when possible
  1. 从递归拆分开始 - 适用于80%的场景
  2. 测试检索质量 - 不要只关注分块数量
  3. 使用重叠设置 - 10-20%的重叠率可避免边界处的上下文丢失
  4. 匹配模型的分块大小 - 考虑嵌入模型的最优输入长度
  5. 保留文档结构 - 尽可能使用感知文档类型的拆分工具