chunking-strategies
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseChunking Strategies for RAG
RAG系统的文档分块策略
Optimize document splitting for retrieval accuracy and context preservation.
优化文档拆分,提升检索准确性与上下文保留效果。
When to Use
适用场景
- Designing a new RAG pipeline
- Retrieval quality is poor due to chunk boundaries
- Documents have mixed content types (code, tables, prose)
- Need to balance context window limits with retrieval precision
- 设计全新的RAG流程
- 因分块边界问题导致检索质量不佳
- 文档包含混合内容类型(代码、表格、散文)
- 需要平衡上下文窗口限制与检索精度
Chunking Methods
分块方法
1. Fixed-Size Chunking
1. 固定大小分块
python
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separator="\n"
)
chunks = splitter.split_text(document)Best for: Homogeneous content, quick prototyping
Avoid when: Documents have natural boundaries (sections, paragraphs)
python
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separator="\n"
)
chunks = splitter.split_text(document)最适用场景:内容同质化的文档、快速原型开发
避免场景:文档存在自然边界(章节、段落)
2. Recursive Character Splitting
2. 递归字符拆分
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(docs)Best for: General-purpose text, maintains paragraph integrity
Hierarchy: Tries larger separators first, falls back to smaller
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(docs)最适用场景:通用文本、保持段落完整性
层级逻辑:优先尝试较大分隔符,逐步 fallback 到较小分隔符
3. Semantic Chunking
3. 语义分块
python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = splitter.split_text(document)Best for: When meaning matters more than size
Trade-off: Slower, requires embedding calls
python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = splitter.split_text(document)最适用场景:语义优先级高于大小的场景
权衡点:速度较慢,需要调用嵌入模型
4. Document-Specific Chunking
4. 特定文档类型分块
Markdown
Markdown
python
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)python
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)Code
代码
python
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=2000,
chunk_overlap=200
)
chunks = splitter.split_documents(code_docs)python
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=2000,
chunk_overlap=200
)
chunks = splitter.split_documents(code_docs)HTML
HTML
python
from langchain.text_splitter import HTMLHeaderTextSplitter
splitter = HTMLHeaderTextSplitter(
headers_to_split_on=[("h1", "h1"), ("h2", "h2"), ("h3", "h3")]
)
chunks = splitter.split_text(html_doc)python
from langchain.text_splitter import HTMLHeaderTextSplitter
splitter = HTMLHeaderTextSplitter(
headers_to_split_on=[("h1", "h1"), ("h2", "h2"), ("h3", "h3")]
)
chunks = splitter.split_text(html_doc)Chunk Size Guidelines
分块大小指南
| Content Type | Recommended Size | Overlap |
|---|---|---|
| Dense technical docs | 500-1000 tokens | 10-20% |
| Conversational/FAQ | 200-500 tokens | 5-10% |
| Legal/contracts | 1000-1500 tokens | 15-20% |
| Code | 1500-2000 tokens | 10-15% |
| Mixed content | 800-1200 tokens | 15% |
| 内容类型 | 推荐大小 | 重叠率 |
|---|---|---|
| 高密度技术文档 | 500-1000 tokens | 10-20% |
| 对话/FAQ文档 | 200-500 tokens | 5-10% |
| 法律/合同文档 | 1000-1500 tokens | 15-20% |
| 代码 | 1500-2000 tokens | 10-15% |
| 混合内容 | 800-1200 tokens | 15% |
Advanced: Parent-Child Chunking
进阶:父子分块
python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStorepython
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStoreSmall chunks for retrieval, large chunks for context
小分块用于精准检索,大分块用于提供完整上下文
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
**Why**: Small chunks = precise retrieval, large chunks = better contextchild_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
**优势**:小分块实现精准检索,大分块提供更完整的上下文Metadata Enrichment
元数据增强
Always attach metadata to chunks:
python
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source": doc.metadata["source"],
"chunk_index": i,
"total_chunks": len(chunks),
"doc_type": detect_doc_type(chunk.page_content),
"has_code": bool(re.search(r'```', chunk.page_content)),
"timestamp": datetime.now().isoformat()
})始终为分块附加元数据:
python
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source": doc.metadata["source"],
"chunk_index": i,
"total_chunks": len(chunks),
"doc_type": detect_doc_type(chunk.page_content),
"has_code": bool(re.search(r'```', chunk.page_content)),
"timestamp": datetime.now().isoformat()
})Evaluation Checklist
评估检查清单
- Chunks don't break mid-sentence
- Code blocks stay intact
- Tables aren't split across chunks
- Headers stay with their content
- Overlap preserves context continuity
- Metadata enables filtering
- 分块不会在句子中间断开
- 代码块保持完整
- 表格不会跨分块拆分
- 标题与对应内容保持关联
- 重叠部分确保上下文连续性
- 元数据支持过滤功能
Best Practices
最佳实践
- Start with recursive splitting - works for 80% of cases
- Test retrieval quality - not just chunk count
- Use overlap - 10-20% prevents context loss at boundaries
- Match chunk size to model - consider embedding model's optimal input
- Preserve structure - use document-aware splitters when possible
- 从递归拆分开始 - 适用于80%的场景
- 测试检索质量 - 不要只关注分块数量
- 使用重叠设置 - 10-20%的重叠率可避免边界处的上下文丢失
- 匹配模型的分块大小 - 考虑嵌入模型的最优输入长度
- 保留文档结构 - 尽可能使用感知文档类型的拆分工具