embedding-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Embedding Optimization

向量嵌入优化

Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.
针对RAG和语义搜索系统,从成本、性能和质量三个维度优化嵌入生成过程。

When to Use This Skill

何时使用该技能

Trigger this skill when:
  • Building RAG (Retrieval Augmented Generation) systems
  • Implementing semantic search or similarity detection
  • Optimizing embedding API costs (reducing by 70-90%)
  • Improving document retrieval quality through better chunking
  • Processing large document corpora (thousands to millions of documents)
  • Selecting between API-based vs. local embedding models
在以下场景中启用该技能:
  • 构建RAG(Retrieval Augmented Generation,检索增强生成)系统
  • 实现语义搜索或相似度检测功能
  • 优化嵌入API成本(可降低70-90%)
  • 通过更优的分块策略提升文档检索质量
  • 处理大规模文档语料库(数千至数百万份文档)
  • 在基于API和本地嵌入模型之间做选型

Model Selection Framework

模型选择框架

Choose the optimal embedding model based on requirements:
Quick Recommendations:
  • Startup/MVP:
    all-MiniLM-L6-v2
    (local, 384 dims, zero API costs)
  • Production:
    text-embedding-3-small
    (API, 1,536 dims, balanced quality/cost)
  • High Quality:
    text-embedding-3-large
    (API, 3,072 dims, premium)
  • Multilingual:
    multilingual-e5-base
    (local, 768 dims) or Cohere
    embed-multilingual-v3.0
For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see
references/model-selection-guide.md
.
Model Comparison Summary:
ModelTypeDimensionsCost per 1M tokensBest For
all-MiniLM-L6-v2Local384$0 (compute only)High volume, tight budgets
BGE-base-en-v1.5Local768$0 (compute only)Quality + cost balance
text-embedding-3-smallAPI1,536$0.02General purpose production
text-embedding-3-largeAPI3,072$0.13Premium quality requirements
embed-multilingual-v3.0API1,024$0.10100+ language support
根据需求选择最优的嵌入模型:
快速推荐:
  • 初创项目/MVP:
    all-MiniLM-L6-v2
    (本地部署,384维度,零API成本)
  • 生产环境:
    text-embedding-3-small
    (API调用,1536维度,平衡质量与成本)
  • 高质量需求:
    text-embedding-3-large
    (API调用,3072维度,旗舰级质量)
  • 多语言场景:
    multilingual-e5-base
    (本地部署,768维度)或Cohere
    embed-multilingual-v3.0
如需包含成本对比、质量基准和数据隐私考量的详细决策框架,请参阅
references/model-selection-guide.md
模型对比摘要:
模型类型维度每百万令牌成本适用场景
all-MiniLM-L6-v2本地384$0(仅计算资源成本)高吞吐量、预算紧张场景
BGE-base-en-v1.5本地768$0(仅计算资源成本)质量与成本平衡场景
text-embedding-3-smallAPI1,536$0.02通用生产环境
text-embedding-3-largeAPI3,072$0.13高质量需求场景
embed-multilingual-v3.0API1,024$0.10支持100+种语言的场景

Chunking Strategies

分块策略

Select chunking strategy based on content type and use case:
Content Type → Strategy Mapping:
  • Documentation: Recursive (heading-aware), 800 chars, 100 overlap
  • Code: Recursive (function-level), 1,000 chars, 100 overlap
  • Q&A/FAQ: Fixed-size, 500 chars, 50 overlap (precise retrieval)
  • Legal/Technical: Semantic (large), 1,500 chars, 200 overlap (context preservation)
  • Blog Posts: Semantic (paragraph), 1,000 chars, 100 overlap
  • Academic Papers: Recursive (section-aware), 1,200 chars, 150 overlap
For detailed chunking patterns, decision trees, and implementation guidance, see
references/chunking-strategies.md
.
Quick Start with CLI:
bash
python scripts/chunk_document.py \
  --input document.txt \
  --content-type markdown \
  --chunk-size 800 \
  --overlap 100 \
  --output chunks.jsonl
根据内容类型和使用场景选择合适的分块策略:
内容类型→策略映射:
  • 文档类: 递归分块(识别标题),800字符,100字符重叠
  • 代码类: 递归分块(函数级),1000字符,100字符重叠
  • 问答/FAQ: 固定大小分块,500字符,50字符重叠(精准检索)
  • 法律/技术文档: 语义分块(大尺寸),1500字符,200字符重叠(保留上下文)
  • 博客文章: 语义分块(段落级),1000字符,100字符重叠
  • 学术论文: 递归分块(识别章节),1200字符,150字符重叠
如需详细的分块模式、决策树和实现指南,请参阅
references/chunking-strategies.md
CLI快速开始:
bash
python scripts/chunk_document.py \
  --input document.txt \
  --content-type markdown \
  --chunk-size 800 \
  --overlap 100 \
  --output chunks.jsonl

Caching Implementation

缓存实现

Achieve 80-90% cost reduction through content-addressable caching.
Caching Architecture by Query Volume:
  • <10K queries/month: In-memory cache (Python
    lru_cache
    )
  • 10K-100K queries/month: Redis (fast, TTL-based expiration)
  • 100K-1M queries/month: Redis (hot) + PostgreSQL (warm)
  • >1M queries/month: Multi-tier (Redis + PostgreSQL + S3)
Production Caching with Redis:
bash
undefined
通过内容可寻址缓存实现80-90%的成本降低。
按查询量划分的缓存架构:
  • <1万次查询/月: 内存缓存(Python
    lru_cache
  • 1万-10万次查询/月: Redis(快速,基于TTL的过期策略)
  • 10万-100万次查询/月: Redis(热点数据) + PostgreSQL(温数据)
  • >100万次查询/月: 多层架构(Redis + PostgreSQL + S3)
基于Redis的生产级缓存:
bash
undefined

Embed documents with caching enabled

启用缓存的文档嵌入

python scripts/cached_embedder.py
--model text-embedding-3-small
--input documents.jsonl
--output embeddings.npy
--cache-backend redis
--cache-ttl 2592000 # 30 days

**Caching ROI Example:**
- 50,000 document chunks
- 20% duplicate content
- Without caching: $0.50 API cost
- With caching (60% hit rate): $0.20 API cost
- **Savings: 60% ($0.30)**
python scripts/cached_embedder.py
--model text-embedding-3-small
--input documents.jsonl
--output embeddings.npy
--cache-backend redis
--cache-ttl 2592000 # 30天

**缓存投资回报示例:**
- 50,000个文档分块
- 20%重复内容
- 无缓存:$0.50 API成本
- 启用缓存(60%命中率):$0.20 API成本
- **节省:60%($0.30)**

Dimensionality Trade-offs

维度权衡

Balance storage, search speed, and quality:
DimensionsStorage (1M vectors)Search Speed (p95)QualityUse Case
3841.5 GB10msGoodLarge-scale search
7683 GB15msHighGeneral purpose RAG
1,5366 GB25msVery HighHigh-quality retrieval
3,07212 GB40msHighestPremium applications
Key Insight: For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.
平衡存储、搜索速度和质量:
维度100万向量的存储量搜索速度(p95)质量适用场景
3841.5 GB10ms良好大规模搜索
7683 GB15ms通用RAG场景
1,5366 GB25ms极高高质量检索
3,07212 GB40ms顶级高端应用
核心结论: 对于大多数RAG应用,768维度(如本地部署的BGE-base-en-v1.5或同类模型)能实现最佳的质量/成本/速度平衡。

Batch Processing Optimization

批量处理优化

Maximize throughput for large-scale ingestion:
OpenAI API:
  • Batch up to 2,048 inputs per request
  • Implement rate limiting (tier-dependent: 500-5,000 RPM)
  • Use parallel requests with backoff on rate limits
Local Models (sentence-transformers):
  • GPU acceleration (CUDA, MPS for Apple Silicon)
  • Batch size tuning (32-128 based on GPU memory)
  • Multi-GPU support for maximum throughput
Expected Throughput:
  • OpenAI API: 1,000-5,000 texts/minute (rate limit dependent)
  • Local GPU (RTX 3090): 5,000-10,000 texts/minute
  • Local CPU: 100-500 texts/minute
针对大规模数据摄入最大化吞吐量:
OpenAI API:
  • 每个请求最多批量处理2048个输入
  • 实现速率限制(根据层级:500-5000 RPM)
  • 采用带退避机制的并行请求处理速率限制
本地模型(sentence-transformers):
  • GPU加速(CUDA,Apple Silicon使用MPS)
  • 批量大小调优(根据GPU内存设置32-128)
  • 多GPU支持以实现最大吞吐量
预期吞吐量:
  • OpenAI API:1000-5000文本/分钟(取决于速率限制)
  • 本地GPU(RTX 3090):5000-10000文本/分钟
  • 本地CPU:100-500文本/分钟

Performance Monitoring

性能监控

Track key metrics for optimization:
Critical Metrics:
  • Latency: Embedding generation time (p50, p95, p99)
  • Throughput: Embeddings per second/minute
  • Cost: API usage tracking (USD per 1K/1M tokens)
  • Cache Efficiency: Hit rate percentage
For detailed monitoring setup, metric collection patterns, and dashboarding, see
references/performance-monitoring.md
.
Monitor with Wrapper:
python
from scripts.performance_monitor import MonitoredEmbedder

monitored = MonitoredEmbedder(
    embedder=your_embedder,
    cost_per_1k_tokens=0.00002  # OpenAI pricing
)

embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%")
print(f"Total cost: ${metrics['total_cost_usd']}")
跟踪关键指标以持续优化:
核心指标:
  • 延迟: 嵌入生成时间(p50、p95、p99)
  • 吞吐量: 每秒/每分钟生成的嵌入数量
  • 成本: API使用量跟踪(每千/百万令牌的美元成本)
  • 缓存效率: 命中率百分比
如需详细的监控设置、指标收集模式和仪表盘搭建指南,请参阅
references/performance-monitoring.md
通过包装器监控:
python
from scripts.performance_monitor import MonitoredEmbedder

monitored = MonitoredEmbedder(
    embedder=your_embedder,
    cost_per_1k_tokens=0.00002  # OpenAI定价
)

embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"缓存命中率: {metrics['cache_hit_rate_pct']}%")
print(f"总成本: ${metrics['total_cost_usd']}")

Working Examples

实战示例

See
examples/
directory for complete implementations:
Python Examples:
  • examples/openai_cached.py
    - OpenAI embeddings with Redis caching
  • examples/local_embedder.py
    - sentence-transformers local embedding
  • examples/smart_chunker.py
    - Content-aware recursive chunking
  • examples/performance_monitor.py
    - Pipeline performance tracking
  • examples/batch_processor.py
    - Large-scale document processing
All examples include:
  • Complete, runnable code
  • Dependency installation instructions
  • Error handling and retry logic
  • Configuration options
请查看
examples/
目录获取完整实现:
Python示例:
  • examples/openai_cached.py
    - 带Redis缓存的OpenAI嵌入
  • examples/local_embedder.py
    - sentence-transformers本地嵌入
  • examples/smart_chunker.py
    - 内容感知型递归分块
  • examples/performance_monitor.py
    - 流水线性能跟踪
  • examples/batch_processor.py
    - 大规模文档处理
所有示例包含:
  • 完整可运行代码
  • 依赖安装说明
  • 错误处理和重试逻辑
  • 配置选项

Integration Points

集成点

Upstream (This skill provides to):
  • Vector Databases: Embeddings flow to Pinecone, Weaviate, Qdrant, pgvector
  • RAG Systems: Optimized embeddings for retrieval pipelines
  • Semantic Search: Query and document embeddings for similarity search
Downstream (This skill uses from):
  • Document Processing: Chunk documents before embedding
  • Data Ingestion: Process documents from various sources
Related Skills:
  • For RAG architecture, see
    building-ai-chat
    skill
  • For vector database operations, see
    databases-vector
    skill
  • For data ingestion pipelines, see
    ingesting-data
    skill
上游(该技能为以下组件提供支持):
  • 向量数据库: 嵌入流向Pinecone、Weaviate、Qdrant、pgvector
  • RAG系统: 为检索流水线提供优化后的嵌入
  • 语义搜索: 为相似度搜索提供查询和文档嵌入
下游(该技能依赖以下组件):
  • 文档处理: 嵌入前对文档进行分块
  • 数据摄入: 处理来自各种来源的文档
相关技能:
  • 如需RAG架构相关内容,请参阅
    building-ai-chat
    技能
  • 如需向量数据库操作相关内容,请参阅
    databases-vector
    技能
  • 如需数据摄入流水线相关内容,请参阅
    ingesting-data
    技能

Common Patterns

常见模式

Pattern 1: RAG Pipeline
Document → Chunk → Embed → Store (vector DB) → Retrieve
Pattern 2: Semantic Search
Query → Embed → Search (vector DB) → Rank → Display
Pattern 3: Multi-Stage Retrieval (Cost Optimization)
Query → Cheap Embedding (384d) → Initial Search →
Expensive Embedding (1,536d) → Rerank Top-K → Return
Cost Savings: 70% reduction vs. single-stage with expensive embeddings
模式1:RAG流水线
文档 → 分块 → 嵌入 → 存储(向量数据库) → 检索
模式2:语义搜索
查询 → 嵌入 → 搜索(向量数据库) → 排序 → 展示
模式3:多阶段检索(成本优化)
查询 → 低成本嵌入(384维) → 初始搜索 →
高成本嵌入(1536维) → 对Top-K结果重排序 → 返回
成本节省: 与单阶段高成本嵌入相比,减少70%的成本

Quick Reference Checklist

快速参考检查清单

Model Selection:
  • Identified data privacy requirements (local vs. API)
  • Calculated expected query volume
  • Determined quality requirements (good/high/highest)
  • Checked multilingual support needs
Chunking:
  • Analyzed content type (code, docs, legal, etc.)
  • Selected appropriate chunk size (500-1,500 chars)
  • Set overlap to prevent context loss (50-200 chars)
  • Validated chunks preserve semantic boundaries
Caching:
  • Implemented content-addressable hashing
  • Selected cache backend (Redis, PostgreSQL)
  • Set TTL based on content volatility
  • Monitoring cache hit rate (target: >60%)
Performance:
  • Tracking latency (embedding generation time)
  • Measuring throughput (embeddings/sec)
  • Monitoring costs (USD spent on API calls)
  • Optimizing batch sizes for maximum efficiency
模型选择:
  • 明确数据隐私要求(本地 vs API)
  • 计算预期查询量
  • 确定质量要求(良好/高/顶级)
  • 确认多语言支持需求
分块:
  • 分析内容类型(代码、文档、法律文件等)
  • 选择合适的分块大小(500-1500字符)
  • 设置重叠字符数以避免上下文丢失(50-200字符)
  • 验证分块保留了语义边界
缓存:
  • 实现内容可寻址哈希
  • 选择缓存后端(Redis、PostgreSQL)
  • 根据内容更新频率设置TTL
  • 监控缓存命中率(目标:>60%)
性能:
  • 跟踪延迟(嵌入生成时间)
  • 测量吞吐量(每秒嵌入数量)
  • 监控成本(API调用花费的美元金额)
  • 优化批量大小以实现最大效率