Loading...
Loading...
Compare original and translation side by side
| Strategy | Best For | Chunk Size | Overlap | Pros | Cons |
|---|---|---|---|---|---|
| Fixed-size (token) | Uniform docs, consistent sizing | 512-2048 tokens | 10-20% | Predictable, simple | Breaks semantic units |
| Sentence-based | Narrative text, articles | 3-8 sentences | 1 sentence | Preserves language boundaries | Variable sizes |
| Paragraph-based | Structured docs, technical manuals | 1-3 paragraphs | 0-1 paragraph | Preserves topic coherence | Highly variable sizes |
| Semantic | Long-form, research papers | Dynamic | Topic-shift detection | Best coherence | Computationally expensive |
| Recursive | Mixed content types | Dynamic, multi-level | Per-level | Optimal utilization | Complex implementation |
| Document-aware | Multi-format collections | Format-specific | Section-level | Preserves metadata | Format-specific code required |
| 策略 | 适用场景 | 分块大小 | 重叠率 | 优势 | 劣势 |
|---|---|---|---|---|---|
| Fixed-size (token) | 格式统一、篇幅一致的文档 | 512-2048 tokens | 10-20% | 可预测、实现简单 | 破坏语义单元 |
| Sentence-based | 叙事文本、文章类内容 | 3-8个句子 | 1个句子 | 保留语言边界 | 分块大小不稳定 |
| Paragraph-based | 结构化文档、技术手册 | 1-3个段落 | 0-1个段落 | 保留主题连贯性 | 分块大小差异极大 |
| Semantic | 长篇内容、研究论文 | 动态调整 | 主题转移检测 | 语义连贯性最佳 | 计算成本高 |
| Recursive | 混合内容类型 | 动态多层级 | 按层级设置 | 资源利用率最优 | 实现复杂 |
| Document-aware | 多格式集合 | 格式特定 | 章节级 | 保留元数据 | 需要格式特定代码 |
| Model | Dimensions | Speed | Quality | Cost | Best For |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | ~14K tok/s | Good | Free (local) | Prototyping, low-latency |
| all-mpnet-base-v2 | 768 | ~2.8K tok/s | Better | Free (local) | Balanced production use |
| text-embedding-3-small | 1536 | API | High | $0.02/1M tokens | Cost-effective production |
| text-embedding-3-large | 3072 | API | Highest | $0.13/1M tokens | Maximum quality |
| Domain fine-tuned | Varies | Varies | Domain-best | Training cost | Specialized domains (legal, medical) |
| 模型 | 维度 | 速度 | 质量 | 成本 | 适用场景 |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | ~14K tok/s | 良好 | 免费(本地部署) | 原型开发、低延迟场景 |
| all-mpnet-base-v2 | 768 | ~2.8K tok/s | 更优 | 免费(本地部署) | 平衡型生产场景 |
| text-embedding-3-small | 1536 | API调用 | 高 | $0.02/1M tokens | 高性价比生产场景 |
| text-embedding-3-large | 3072 | API调用 | 最高 | $0.13/1M tokens | 追求极致质量场景 |
| Domain fine-tuned | 可变 | 可变 | 领域最优 | 需训练成本 | 专业领域(法律、医疗) |
| Database | Type | Scaling | Key Feature | Best For |
|---|---|---|---|---|
| Pinecone | Managed | Auto-scaling | Metadata filtering, hybrid search | Production, managed preference |
| Weaviate | Open source | Horizontal | GraphQL API, multi-modal | Complex data types |
| Qdrant | Open source | Distributed | High perf, low memory (Rust) | Performance-critical |
| Chroma | Embedded | Limited | Simple API, SQLite-backed | Prototyping, small-scale |
| pgvector | PostgreSQL ext | PostgreSQL scaling | ACID, SQL joins | Existing PostgreSQL infra |
| 数据库 | 类型 | 扩展性 | 核心特性 | 适用场景 |
|---|---|---|---|---|
| Pinecone | 托管式 | 自动扩缩容 | 元数据过滤、混合检索 | 生产环境、偏好托管服务 |
| Weaviate | 开源 | 水平扩展 | GraphQL API、多模态支持 | 复杂数据类型场景 |
| Qdrant | 开源 | 分布式 | 高性能、低内存占用(Rust实现) | 性能敏感场景 |
| Chroma | 嵌入式 | 有限扩展 | 简单API、SQLite支撑 | 原型开发、小规模场景 |
| pgvector | PostgreSQL扩展 | 跟随PostgreSQL扩展 | ACID合规、SQL关联查询 | 已有PostgreSQL基础设施 |
| Strategy | When to Use | Implementation |
|---|---|---|
| Dense (vector similarity) | Default for semantic search | Cosine similarity with k-NN/ANN |
| Sparse (BM25/TF-IDF) | Exact keyword matching needed | Elasticsearch or inverted index |
| Hybrid (dense + sparse) | Best of both needed | Reciprocal Rank Fusion (RRF) with tuned weights |
| + Reranking | Precision must exceed 0.85 | Cross-encoder reranker after initial retrieval |
| 策略 | 适用场景 | 实现方式 |
|---|---|---|
| Dense (vector similarity) | 语义搜索默认方案 | 基于k-NN/ANN的余弦相似度计算 |
| Sparse (BM25/TF-IDF) | 需要精确关键词匹配 | Elasticsearch或倒排索引 |
| Hybrid (dense + sparse) | 需要兼顾两者优势 | 带权重调优的Reciprocal Rank Fusion (RRF) |
| + Reranking | 精度必须超过0.85 | 初始检索后使用Cross-encoder重排序器 |
| Technique | When to Use | How It Works |
|---|---|---|
| HyDE | Query/document style mismatch | LLM generates hypothetical answer; embed that instead of query |
| Multi-query | Ambiguous queries | Generate 3-5 query variations; retrieve for each; deduplicate |
| Step-back | Specific questions needing general context | Transform to broader query; retrieve general + specific |
| 技术 | 适用场景 | 工作原理 |
|---|---|---|
| HyDE | 查询与文档风格不匹配 | LLM生成假设性答案;将该答案嵌入而非原查询 |
| Multi-query | 查询存在歧义 | 生成3-5个查询变体;分别检索;去重 |
| Step-back | 需要通用上下文的特定问题 | 转换为更宽泛的查询;检索通用+特定内容 |
| Metric | Target | What It Measures |
|---|---|---|
| Faithfulness | > 0.90 | Answers grounded in retrieved context |
| Context Relevance | > 0.80 | Retrieved chunks relevant to query |
| Answer Relevance | > 0.85 | Answer addresses the original question |
| Precision@K | > 0.70 | % of top-K results that are relevant |
| Recall@K | > 0.80 | % of relevant docs found in top-K |
| MRR | > 0.75 | Reciprocal rank of first relevant result |
| 指标 | 目标值 | 衡量内容 |
|---|---|---|
| Faithfulness | > 0.90 | 答案是否基于检索到的上下文 |
| Context Relevance | > 0.80 | 检索到的分块与查询的相关性 |
| Answer Relevance | > 0.85 | 答案是否回应原始问题 |
| Precision@K | > 0.70 | top-K结果中相关内容占比 |
| Recall@K | > 0.80 | 相关文档在top-K结果中的占比 |
| MRR | > 0.75 | 第一个相关结果的倒数排名 |
corpus:
documents: 12,000 Confluence pages + 3,000 PDFs
avg_length: 2,400 tokens
languages: [English]
domain: internal engineering docs
pipeline:
chunking:
strategy: recursive
max_tokens: 512
overlap: 50 tokens
boundary: paragraph
embedding:
model: text-embedding-3-small
dimensions: 1536
batch_size: 100
vector_db:
engine: pgvector
index: HNSW (ef_construction=128, m=16)
reason: "Existing PostgreSQL infra; ACID compliance for audit"
retrieval:
strategy: hybrid
dense_weight: 0.7
sparse_weight: 0.3
top_k: 10
reranker: cross-encoder/ms-marco-MiniLM-L-12-v2
final_k: 5
evaluation_results:
faithfulness: 0.93
context_relevance: 0.84
answer_relevance: 0.88
precision_at_5: 0.76
recall_at_10: 0.85corpus:
documents: 12,000 Confluence pages + 3,000 PDFs
avg_length: 2,400 tokens
languages: [English]
domain: internal engineering docs
pipeline:
chunking:
strategy: recursive
max_tokens: 512
overlap: 50 tokens
boundary: paragraph
embedding:
model: text-embedding-3-small
dimensions: 1536
batch_size: 100
vector_db:
engine: pgvector
index: HNSW (ef_construction=128, m=16)
reason: "Existing PostgreSQL infra; ACID compliance for audit"
retrieval:
strategy: hybrid
dense_weight: 0.7
sparse_weight: 0.3
top_k: 10
reranker: cross-encoder/ms-marco-MiniLM-L-12-v2
final_k: 5
evaluation_results:
faithfulness: 0.93
context_relevance: 0.84
answer_relevance: 0.88
precision_at_5: 0.76
recall_at_10: 0.85| Problem | Solution |
|---|---|
| Chunks break mid-sentence | Use boundary-aware chunking with sentence/paragraph overlap |
| Low retrieval precision | Add cross-encoder reranker; tune similarity threshold |
| High latency (> 2s) | Cache embeddings; use faster model; reduce top-K |
| Inconsistent quality | Implement RAGAS evaluation in CI; add quality scoring |
| Scalability bottleneck | Shard vector DB; implement auto-scaling; add caching layer |
| 问题 | 解决方案 |
|---|---|
| 分块在句子中间断开 | 使用感知边界的分策略,设置句子/段落重叠 |
| 检索精度低 | 添加Cross-encoder重排序器;调整相似度阈值 |
| 延迟过高(>2s) | 缓存嵌入结果;使用更快的模型;降低top-K值 |
| 质量不稳定 | 在CI中实现RAGAS评估;添加质量评分机制 |
| 扩展性瓶颈 | 分片向量数据库;实现自动扩缩容;添加缓存层 |
| Problem | Cause | Solution |
|---|---|---|
| Chunks contain incomplete sentences or broken code blocks | Fixed-size chunking ignoring semantic boundaries | Switch to sentence-based or semantic (heading-aware) chunking; enable boundary detection in |
| Retrieved context is relevant but answer is wrong | LLM hallucinating beyond retrieved chunks | Enable faithfulness evaluation via RAGAS; add source attribution guardrails; lower confidence threshold to surface "I don't know" responses |
| Precision@K below 0.50 despite relevant documents existing | Embedding model does not capture domain vocabulary | Fine-tune embedding model on domain data or switch to a domain-specific model; add cross-encoder reranking stage |
| Query latency exceeds 2 seconds | Large top-K, no caching, or unoptimized HNSW index | Reduce top-K, enable query-level and semantic caching, tune HNSW parameters (ef_search, m) |
| Recall drops after adding new documents | Stale embeddings or index fragmentation after incremental inserts | Trigger full re-index; verify new documents pass chunking pipeline; check embedding model version consistency |
| Hybrid retrieval returns duplicate chunks | Dense and sparse retrievers returning overlapping results without deduplication | Apply Reciprocal Rank Fusion (RRF) with deduplication before reranking; tune dense/sparse weight ratio |
| Evaluation metrics fluctuate across runs | Non-deterministic embedding batching or insufficient test query set | Fix random seeds, increase evaluation sample size, run evaluations on a frozen ground-truth set |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 分块包含不完整句子或损坏的代码块 | 固定大小分块忽略语义边界 | 切换为基于句子或语义(感知标题)的分块策略;在 |
| 检索上下文相关但答案错误 | LLM生成超出检索上下文的内容 | 通过RAGAS启用忠实度评估;添加来源归因防护机制;降低置信度阈值以触发"我不知道"响应 |
| 尽管存在相关文档,但Precision@K低于0.50 | 嵌入模型未捕捉领域词汇 | 在领域数据上微调嵌入模型或切换为领域特定模型;添加Cross-encoder重排序环节 |
| 查询延迟超过2秒 | top-K值过大、无缓存或HNSW索引未优化 | 降低top-K值;启用查询级和语义级缓存;调整HNSW参数(ef_search、m) |
| 添加新文档后召回率下降 | 嵌入结果过时或增量插入后索引碎片化 | 触发全量重新索引;验证新文档通过分块管道;检查嵌入模型版本一致性 |
| 混合检索返回重复分块 | 稠密和稀疏检索器返回重叠结果且未去重 | 在重排序前应用带去重的Reciprocal Rank Fusion (RRF);调整稠密/稀疏权重比 |
| 评估指标在多次运行中波动 | 嵌入批处理非确定性或测试查询集不足 | 固定随机种子;增加评估样本量;在冻结的真值数据集上运行评估 |
chunking_optimizer.pychunking_optimizer.pyengineering/prompt-engineer-toolkitengineering/database-designerengineering/observability-designerengineering/agent-workflow-designerengineering/prompt-engineer-toolkitengineering/database-designerengineering/observability-designerengineering/agent-workflow-designer| Skill | Integration | Data Flow |
|---|---|---|
| Optimize system prompts and few-shot examples fed alongside retrieved chunks | Pipeline design output --> prompt templates that reference chunk format and metadata |
| Design relational metadata stores (tags, access control, source tracking) paired with the vector database | Vector DB recommendation --> metadata schema for hybrid storage |
| Set up latency, throughput, and accuracy monitoring for the deployed RAG pipeline | Evaluation metrics and SLO targets --> dashboards and alerting rules |
| Embed the RAG retrieval step inside multi-agent reasoning workflows | Retrieval config --> agent tool definition with top-K and threshold parameters |
| Automate embedding re-indexing, evaluation regression tests, and deployment on document changes | Evaluation thresholds --> CI gate that blocks deploys when metrics regress |
| Review the query and ingestion API surface exposed by the RAG service | Pipeline config --> OpenAPI spec review for search and ingest endpoints |
| 技能 | 集成方式 | 数据流 |
|---|---|---|
| 优化与检索分块一起传入的系统提示词和少样本示例 | 管道设计输出 --> 引用分块格式和元数据的提示词模板 |
| 设计与向量数据库配对的关系型元数据存储(标签、访问控制、来源追踪) | 向量数据库推荐 --> 混合存储的元数据schema |
| 为部署的RAG管道设置延迟、吞吐量和准确性监控 | 评估指标和SLO目标 --> 仪表盘和告警规则 |
| 将RAG检索步骤嵌入多Agent推理工作流 | 检索配置 --> 包含top-K和阈值参数的Agent工具定义 |
| 文档变更时自动执行嵌入重新索引、评估回归测试和部署 | 评估阈值 --> 指标退化时阻止部署的CI gate |
| 审查RAG服务暴露的查询和 ingestion API 接口 | 管道配置 --> 搜索和 ingest 端点的OpenAPI规范审查 |
python chunking_optimizer.py <directory> [options]| Flag | Type | Default | Description |
|---|---|---|---|
| positional, required | -- | Directory containing text/markdown documents to analyze |
| string | None | Output file path for results in JSON format |
| string | None | JSON configuration file to customize strategy parameters (fixed_sizes, overlaps, sentence_max_sizes, paragraph_max_sizes, semantic_max_sizes) |
| string list | | File extensions to include when scanning the corpus |
| flag | off | Print all strategy scores in addition to the recommendation |
python chunking_optimizer.py ./docs --output results.json --extensions .txt .md --verbose--verbose--outputcorpus_infostrategy_resultsrecommendationsample_chunkspython chunking_optimizer.py <directory> [options]| 标志 | 类型 | 默认值 | 描述 |
|---|---|---|---|
| 位置参数,必填 | -- | 包含待分析文本/Markdown文档的目录 |
| 字符串 | None | JSON格式结果的输出文件路径 |
| 字符串 | None | 自定义策略参数的JSON配置文件(fixed_sizes、overlaps、sentence_max_sizes、paragraph_max_sizes、semantic_max_sizes) |
| 字符串列表 | | 扫描语料库时包含的文件扩展名 |
| 标志 | 关闭 | 除推荐结果外,打印所有策略得分 |
python chunking_optimizer.py ./docs --output results.json --extensions .txt .md --verbose--verbose--outputcorpus_infostrategy_resultsrecommendationsample_chunkspython retrieval_evaluator.py <queries> <corpus> <ground_truth> [options]| Flag | Type | Default | Description |
|---|---|---|---|
| positional, required | -- | JSON file containing queries (list of |
| positional, required | -- | Directory containing the document corpus |
| positional, required | -- | JSON file mapping query IDs to lists of relevant document IDs |
| string | None | Output file path for results in JSON format |
| int list | | K values used when computing Precision@K, Recall@K, and NDCG@K |
| string list | | File extensions to include from the corpus directory |
| flag | off | Print detailed per-metric values and failure analysis counts |
python retrieval_evaluator.py queries.json ./corpus ground_truth.json --output eval.json --k-values 1 5 10 --verbose--verbose--outputaggregate_metricsquery_resultsfailure_analysisevaluation_summaryrecommendationspython retrieval_evaluator.py <queries> <corpus> <ground_truth> [options]| 标志 | 类型 | 默认值 | 描述 |
|---|---|---|---|
| 位置参数,必填 | -- | 包含查询的JSON文件( |
| 位置参数,必填 | -- | 包含文档语料库的目录 |
| 位置参数,必填 | -- | 将查询ID映射到相关文档ID列表的JSON文件 |
| 字符串 | None | JSON格式结果的输出文件路径 |
| 整数列表 | | 计算Precision@K、Recall@K和NDCG@K时使用的K值 |
| 字符串列表 | | 从语料库目录中包含的文件扩展名 |
| 标志 | 关闭 | 打印详细的每指标值和失败分析统计 |
python retrieval_evaluator.py queries.json ./corpus ground_truth.json --output eval.json --k-values 1 5 10 --verbose--verbose--outputaggregate_metricsquery_resultsfailure_analysisevaluation_summaryrecommendationspython rag_pipeline_designer.py <requirements> [options]| Flag | Type | Default | Description |
|---|---|---|---|
| positional, required | -- | JSON file containing system requirements (document_types, document_count, avg_document_size, queries_per_day, query_patterns, latency_requirement, budget_monthly, accuracy_priority, cost_priority, maintenance_complexity) |
| string | None | Output file path for the pipeline design in JSON format |
| flag | off | Print full configuration templates for each component |
python rag_pipeline_designer.py requirements.json --output pipeline_design.json --verbose--verbose--outputComponentRecommendationtotal_costarchitecture_diagramconfig_templatespython rag_pipeline_designer.py <requirements> [options]| 标志 | 类型 | 默认值 | 描述 |
|---|---|---|---|
| 位置参数,必填 | -- | 包含系统需求的JSON文件(document_types、document_count、avg_document_size、queries_per_day、query_patterns、latency_requirement、budget_monthly、accuracy_priority、cost_priority、maintenance_complexity) |
| 字符串 | None | JSON格式管道设计的输出文件路径 |
| 标志 | 关闭 | 打印各组件的完整配置模板 |
python rag_pipeline_designer.py requirements.json --output pipeline_design.json --verbose--verbose--outputComponentRecommendationtotal_costarchitecture_diagramconfig_templates