Loading...
Loading...
Vector database selection, embedding storage, approximate nearest neighbor (ANN) algorithms, and vector search optimization. Use when choosing vector stores, designing semantic search, or optimizing similarity search performance.
npx skill4agent add melodic-software/claude-code-plugins vector-databases| Database | Strengths | Limitations | Best For |
|---|---|---|---|
| Pinecone | Fully managed, easy scaling, enterprise | Vendor lock-in, cost at scale | Enterprise production |
| Weaviate Cloud | GraphQL, hybrid search, modules | Complexity | Knowledge graphs |
| Zilliz Cloud | Milvus-based, high performance | Learning curve | High-scale production |
| MongoDB Atlas Vector | Existing MongoDB users | Newer feature | MongoDB shops |
| Elastic Vector | Existing Elastic stack | Resource heavy | Search platforms |
| Database | Strengths | Limitations | Best For |
|---|---|---|---|
| Milvus | Feature-rich, scalable, GPU support | Operational complexity | Large-scale production |
| Qdrant | Rust performance, filtering, easy | Smaller ecosystem | Performance-focused |
| Weaviate | Modules, semantic, hybrid | Memory usage | Knowledge applications |
| Chroma | Simple, Python-native | Limited scale | Development, prototyping |
| pgvector | PostgreSQL extension | Performance limits | Postgres shops |
| FAISS | Library, not DB, fastest | No persistence, no filtering | Research, embedded |
Need managed, don't want operations?
├── Yes → Pinecone (simplest) or Weaviate Cloud
└── No (self-hosted)
└── Already using PostgreSQL?
├── Yes, <1M vectors → pgvector
└── No
└── Need maximum performance at scale?
├── Yes → Milvus or Qdrant
└── No
└── Prototyping/development?
├── Yes → Chroma
└── No → Qdrant (balanced choice)Exact KNN:
• Search ALL vectors
• O(n) time complexity
• Perfect accuracy
• Impractical at scale
Approximate NN (ANN):
• Search SUBSET of vectors
• O(log n) to O(1) complexity
• Near-perfect accuracy
• Practical at any scaleLayer 3: ○───────────────────────○ (sparse, long connections)
│ │
Layer 2: ○───○───────○───────○───○ (medium density)
│ │ │ │ │
Layer 1: ○─○─○─○─○─○─○─○─○─○─○─○─○ (denser)
│││││││││││││││││││││││
Layer 0: ○○○○○○○○○○○○○○○○○○○○○○○○○ (all vectors)
Search: Start at top layer, greedily descend
• Fast: O(log n) search time
• High recall: >95% typically
• Memory: Extra graph storage| Parameter | Description | Trade-off |
|---|---|---|
| Connections per node | Memory vs. recall |
| Build-time search width | Build time vs. recall |
| Query-time search width | Latency vs. recall |
Clustering Phase:
┌─────────────────────────────────────────┐
│ Cluster vectors into K centroids │
│ │
│ ● ● ● ● │
│ /│\ /│\ /│\ /│\ │
│ ○○○○○ ○○○○○ ○○○○○ ○○○○○ │
│ Cluster 1 Cluster 2 Cluster 3 Cluster 4│
└─────────────────────────────────────────┘
Search Phase:
1. Find nprobe nearest centroids
2. Search only those clusters
3. Much faster than exhaustive| Parameter | Description | Trade-off |
|---|---|---|
| Number of clusters | Build time vs. search quality |
| Clusters to search | Latency vs. recall |
Original Vector (128 dim):
[0.1, 0.2, ..., 0.9] (128 × 4 bytes = 512 bytes)
PQ Compressed (8 subvectors, 8-bit codes):
[23, 45, 12, 89, 56, 34, 78, 90] (8 bytes)
Memory reduction: 64x
Accuracy trade-off: ~5% recall drop| Algorithm | Search Speed | Memory | Build Time | Recall |
|---|---|---|---|---|
| Flat/Brute | Slow (O(n)) | Low | None | 100% |
| IVF | Fast | Low | Medium | 90-95% |
| IVF-PQ | Very fast | Very low | Medium | 85-92% |
| HNSW | Very fast | High | Slow | 95-99% |
| HNSW+PQ | Very fast | Medium | Slow | 90-95% |
< 100K vectors:
└── Flat index (exact search is fast enough)
100K - 1M vectors:
└── HNSW (best recall/speed trade-off)
1M - 100M vectors:
├── Memory available → HNSW
└── Memory constrained → IVF-PQ or HNSW+PQ
> 100M vectors:
└── Sharded IVF-PQ or distributed HNSW| Metric | Formula | Range | Best For |
|---|---|---|---|
| Cosine Similarity | | [-1, 1] | Normalized embeddings |
| Dot Product | | (-∞, ∞) | When magnitude matters |
| Euclidean (L2) | | [0, ∞) | Absolute distances |
| Manhattan (L1) | | [0, ∞) | High-dimensional sparse |
Embeddings pre-normalized (unit vectors)?
├── Yes → Cosine = Dot Product (use Dot, faster)
└── No
└── Magnitude meaningful?
├── Yes → Dot Product
└── No → Cosine Similarity
Note: Most embedding models output normalized vectors
→ Dot product is usually the best choicePre-filtering (Filter → Search):
┌─────────────────────────────────────────┐
│ 1. Apply metadata filter │
│ (category = "electronics") │
│ Result: 10K of 1M vectors │
│ │
│ 2. Vector search on 10K vectors │
│ Much faster, guaranteed filter match │
└─────────────────────────────────────────┘
Post-filtering (Search → Filter):
┌─────────────────────────────────────────┐
│ 1. Vector search on 1M vectors │
│ Return top-1000 │
│ │
│ 2. Apply metadata filter │
│ May return < K results! │
└─────────────────────────────────────────┘Query: "wireless headphones under $100"
│
┌─────┴─────┐
▼ ▼
┌───────┐ ┌───────┐
│Vector │ │Filter │
│Search │ │ Build │
│"wire- │ │price │
│less │ │< 100 │
│head- │ │ │
│phones"│ │ │
└───────┘ └───────┘
│ │
└─────┬─────┘
▼
┌───────────┐
│ Combine │
│ Results │
└───────────┘| Metadata Type | Index Strategy | Query Example |
|---|---|---|
| Categorical | Bitmap/hash index | category = "books" |
| Numeric range | B-tree | price BETWEEN 10 AND 50 |
| Keyword search | Inverted index | tags CONTAINS "sale" |
| Geospatial | R-tree/geohash | location NEAR (lat, lng) |
Naive Sharding (by ID):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ IDs 0-N │ │IDs N-2N │ │IDs 2N-3N│
└─────────┘ └─────────┘ └─────────┘
Query → Search ALL shards → Merge results
Semantic Sharding (by cluster):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ Tech │ │ Health │ │ Finance │
│ docs │ │ docs │ │ docs │
└─────────┘ └─────────┘ └─────────┘
Query → Route to relevant shard(s) → Faster!┌─────────────────────────────────────────┐
│ Load Balancer │
└─────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Replica 1│ │Replica 2│ │Replica 3│
│ (Read) │ │ (Read) │ │ (Read) │
└─────────┘ └─────────┘ └─────────┘
│ │ │
└───────────┼───────────┘
│
┌─────────┐
│ Primary │
│ (Write) │
└─────────┘| Scale (vectors) | Architecture | Replication |
|---|---|---|
| < 1M | Single node | Optional |
| 1-10M | Single node, more RAM | For HA |
| 10-100M | Sharded, few nodes | Required |
| 100M-1B | Sharded, many nodes | Required |
| > 1B | Sharded + tiered | Required |
| Optimization | Description | Impact |
|---|---|---|
| Batch insertion | Insert in batches of 1K-10K | 10x faster |
| Parallel build | Multi-threaded index construction | 2-4x faster |
| Incremental index | Add to existing index | Avoids rebuild |
| GPU acceleration | Use GPU for training (IVF) | 10-100x faster |
| Optimization | Description | Impact |
|---|---|---|
| Warm cache | Keep index in memory | 10x latency reduction |
| Query batching | Batch similar queries | Higher throughput |
| Reduce dimensions | PCA, random projection | 2-4x faster |
| Early termination | Stop when "good enough" | Lower latency |
Memory per vector:
┌────────────────────────────────────────┐
│ 1536 dims × 4 bytes = 6KB per vector │
│ │
│ 1M vectors: │
│ Raw: 6GB │
│ + HNSW graph: +2-4GB (M-dependent) │
│ = 8-10GB total │
│ │
│ With PQ (64 subquantizers): │
│ 1M vectors: ~64MB │
│ = 100x reduction │
└────────────────────────────────────────┘| Strategy | Description | RPO/RTO |
|---|---|---|
| Snapshots | Periodic full backup | Hours |
| WAL replication | Write-ahead log streaming | Minutes |
| Real-time sync | Synchronous replication | Seconds |
| Metric | Description | Alert Threshold |
|---|---|---|
| Query latency p99 | 99th percentile latency | > 100ms |
| Recall | Search accuracy | < 90% |
| QPS | Queries per second | Capacity dependent |
| Memory usage | Index memory | > 80% |
| Index freshness | Time since last update | Domain dependent |
┌─────────────────────────────────────────┐
│ Index Maintenance Tasks │
├─────────────────────────────────────────┤
│ • Compaction: Merge small segments │
│ • Reindex: Rebuild degraded index │
│ • Vacuum: Remove deleted vectors │
│ • Optimize: Tune parameters │
│ │
│ Schedule during low-traffic periods │
└─────────────────────────────────────────┘Option 1: Namespace/Collection per tenant
┌─────────────────────────────────────────┐
│ tenant_1_collection │
│ tenant_2_collection │
│ tenant_3_collection │
└─────────────────────────────────────────┘
Pro: Complete isolation
Con: Many indexes, operational overhead
Option 2: Single collection + tenant filter
┌─────────────────────────────────────────┐
│ shared_collection │
│ metadata: { tenant_id: "..." } │
│ Pre-filter by tenant_id │
└─────────────────────────────────────────┘
Pro: Simpler operations
Con: Requires efficient filteringWrite Path:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Write │ │ Buffer │ │ Merge │
│ Request │───▶│ (Memory) │───▶│ to Index │
└─────────────┘ └─────────────┘ └─────────────┘
Strategy:
1. Buffer writes in memory
2. Periodically merge to main index
3. Search: main index + buffer
4. Compact periodicallyVersion 1 embeddings ──┐
│
Version 2 embeddings ──┼──▶ Parallel indexes during migration
│
│ ┌─────────────────────┐
└───▶│ Gradual reindexing │
│ Blue-green switch │
└─────────────────────┘Cost = (vectors × dimensions × bytes × replication) / GB × $/GB/month
Example:
10M vectors × 1536 dims × 4 bytes × 3 replicas = 184 GB
At $0.10/GB/month = $18.40/month storage
Note: Memory (for serving) costs more than storageFactors:
• QPS (queries per second)
• Latency requirements
• Index type (HNSW needs more RAM)
• Filtering complexity
Rule of thumb:
• 1M vectors, HNSW, <50ms latency: 16GB RAM
• 10M vectors, HNSW, <50ms latency: 64-128GB RAM
• 100M vectors: Distributed system requiredrag-architecturellm-serving-patternsml-system-designestimation-techniques