chroma

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Chroma - Open-Source Embedding Database

Chroma - 开源嵌入向量数据库

The AI-native database for building LLM applications with memory.
专为构建具备记忆能力的LLM应用而生的原生AI数据库。

When to use Chroma

何时使用Chroma

Use Chroma when:
  • Building RAG (retrieval-augmented generation) applications
  • Need local/self-hosted vector database
  • Want open-source solution (Apache 2.0)
  • Prototyping in notebooks
  • Semantic search over documents
  • Storing embeddings with metadata
Metrics:
  • 24,300+ GitHub stars
  • 1,900+ forks
  • v1.3.3 (stable, weekly releases)
  • Apache 2.0 license
Use alternatives instead:
  • Pinecone: Managed cloud, auto-scaling
  • FAISS: Pure similarity search, no metadata
  • Weaviate: Production ML-native database
  • Qdrant: High performance, Rust-based
适合使用Chroma的场景:
  • 构建RAG(检索增强生成)应用
  • 需要本地/自托管的向量数据库
  • 倾向于开源解决方案(Apache 2.0协议)
  • 在笔记本环境中快速原型开发
  • 对文档进行语义搜索
  • 存储带有元数据的嵌入向量
项目数据:
  • GitHub星标数24,300+
  • 复刻数1,900+
  • 稳定版v1.3.3(每周发布更新)
  • Apache 2.0开源协议
可选择的替代方案:
  • Pinecone:托管式云服务,支持自动扩容
  • FAISS:纯相似度检索工具,不支持元数据
  • Weaviate:面向生产环境的ML原生数据库
  • Qdrant:基于Rust开发的高性能向量数据库

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Python

Python

pip install chromadb
pip install chromadb

JavaScript/TypeScript

JavaScript/TypeScript

npm install chromadb @chroma-core/default-embed
undefined
npm install chromadb @chroma-core/default-embed
undefined

Basic usage (Python)

基础使用(Python)

python
import chromadb
python
import chromadb

Create client

创建客户端

client = chromadb.Client()
client = chromadb.Client()

Create collection

创建集合

collection = client.create_collection(name="my_collection")
collection = client.create_collection(name="my_collection")

Add documents

添加文档

collection.add( documents=["This is document 1", "This is document 2"], metadatas=[{"source": "doc1"}, {"source": "doc2"}], ids=["id1", "id2"] )
collection.add( documents=["This is document 1", "This is document 2"], metadatas=[{"source": "doc1"}, {"source": "doc2"}], ids=["id1", "id2"] )

Query

查询

results = collection.query( query_texts=["document about topic"], n_results=2 )
print(results)
undefined
results = collection.query( query_texts=["document about topic"], n_results=2 )
print(results)
undefined

Core operations

核心操作

1. Create collection

1. 创建集合

python
undefined
python
undefined

Simple collection

简单创建集合

collection = client.create_collection("my_docs")
collection = client.create_collection("my_docs")

With custom embedding function

使用自定义嵌入向量函数

from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-key", model_name="text-embedding-3-small" )
collection = client.create_collection( name="my_docs", embedding_function=openai_ef )
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-key", model_name="text-embedding-3-small" )
collection = client.create_collection( name="my_docs", embedding_function=openai_ef )

Get existing collection

获取已存在的集合

collection = client.get_collection("my_docs")
collection = client.get_collection("my_docs")

Delete collection

删除集合

client.delete_collection("my_docs")
undefined
client.delete_collection("my_docs")
undefined

2. Add documents

2. 添加文档

python
undefined
python
undefined

Add with auto-generated IDs

自动生成ID添加文档

collection.add( documents=["Doc 1", "Doc 2", "Doc 3"], metadatas=[ {"source": "web", "category": "tutorial"}, {"source": "pdf", "page": 5}, {"source": "api", "timestamp": "2025-01-01"} ], ids=["id1", "id2", "id3"] )
collection.add( documents=["Doc 1", "Doc 2", "Doc 3"], metadatas=[ {"source": "web", "category": "tutorial"}, {"source": "pdf", "page": 5}, {"source": "api", "timestamp": "2025-01-01"} ], ids=["id1", "id2", "id3"] )

Add with custom embeddings

自定义嵌入向量添加文档

collection.add( embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]], documents=["Doc 1", "Doc 2"], ids=["id1", "id2"] )
undefined
collection.add( embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]], documents=["Doc 1", "Doc 2"], ids=["id1", "id2"] )
undefined

3. Query (similarity search)

3. 查询(相似度检索)

python
undefined
python
undefined

Basic query

基础查询

results = collection.query( query_texts=["machine learning tutorial"], n_results=5 )
results = collection.query( query_texts=["machine learning tutorial"], n_results=5 )

Query with filters

带过滤条件的查询

results = collection.query( query_texts=["Python programming"], n_results=3, where={"source": "web"} )
results = collection.query( query_texts=["Python programming"], n_results=3, where={"source": "web"} )

Query with metadata filters

带复杂元数据过滤的查询

results = collection.query( query_texts=["advanced topics"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$gte": 3}} ] } )
results = collection.query( query_texts=["advanced topics"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$gte": 3}} ] } )

Access results

访问查询结果

print(results["documents"]) # List of matching documents print(results["metadatas"]) # Metadata for each doc print(results["distances"]) # Similarity scores print(results["ids"]) # Document IDs
undefined
print(results["documents"]) # 匹配的文档列表 print(results["metadatas"]) # 每个文档的元数据 print(results["distances"]) # 相似度得分 print(results["ids"]) # 文档ID
undefined

4. Get documents

4. 获取文档

python
undefined
python
undefined

Get by IDs

通过ID获取文档

docs = collection.get( ids=["id1", "id2"] )
docs = collection.get( ids=["id1", "id2"] )

Get with filters

带过滤条件获取文档

docs = collection.get( where={"category": "tutorial"}, limit=10 )
docs = collection.get( where={"category": "tutorial"}, limit=10 )

Get all documents

获取所有文档

docs = collection.get()
undefined
docs = collection.get()
undefined

5. Update documents

5. 更新文档

python
undefined
python
undefined

Update document content

更新文档内容

collection.update( ids=["id1"], documents=["Updated content"], metadatas=[{"source": "updated"}] )
undefined
collection.update( ids=["id1"], documents=["Updated content"], metadatas=[{"source": "updated"}] )
undefined

6. Delete documents

6. 删除文档

python
undefined
python
undefined

Delete by IDs

通过ID删除文档

collection.delete(ids=["id1", "id2"])
collection.delete(ids=["id1", "id2"])

Delete with filter

带过滤条件删除文档

collection.delete( where={"source": "outdated"} )
undefined
collection.delete( where={"source": "outdated"} )
undefined

Persistent storage

持久化存储

python
undefined
python
undefined

Persist to disk

持久化到磁盘

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("my_docs") collection.add(documents=["Doc 1"], ids=["id1"])
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("my_docs") collection.add(documents=["Doc 1"], ids=["id1"])

Data persisted automatically

数据会自动持久化

Reload later with same path

后续可通过相同路径重新加载

client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection("my_docs")
undefined
client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection("my_docs")
undefined

Embedding functions

嵌入向量函数

Default (Sentence Transformers)

默认(Sentence Transformers)

python
undefined
python
undefined

Uses sentence-transformers by default

默认使用sentence-transformers模型

collection = client.create_collection("my_docs")
collection = client.create_collection("my_docs")

Default model: all-MiniLM-L6-v2

默认模型:all-MiniLM-L6-v2

undefined
undefined

OpenAI

OpenAI

python
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="openai_docs",
    embedding_function=openai_ef
)
python
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="openai_docs",
    embedding_function=openai_ef
)

HuggingFace

HuggingFace

python
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="your-key",
    model_name="sentence-transformers/all-mpnet-base-v2"
)

collection = client.create_collection(
    name="hf_docs",
    embedding_function=huggingface_ef
)
python
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="your-key",
    model_name="sentence-transformers/all-mpnet-base-v2"
)

collection = client.create_collection(
    name="hf_docs",
    embedding_function=huggingface_ef
)

Custom embedding function

自定义嵌入向量函数

python
from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # Your embedding logic
        return embeddings

my_ef = MyEmbeddingFunction()
collection = client.create_collection(
    name="custom_docs",
    embedding_function=my_ef
)
python
from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # 自定义嵌入向量逻辑
        return embeddings

my_ef = MyEmbeddingFunction()
collection = client.create_collection(
    name="custom_docs",
    embedding_function=my_ef
)

Metadata filtering

元数据过滤

python
undefined
python
undefined

Exact match

精确匹配

results = collection.query( query_texts=["query"], where={"category": "tutorial"} )
results = collection.query( query_texts=["query"], where={"category": "tutorial"} )

Comparison operators

比较运算符

results = collection.query( query_texts=["query"], where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne )
results = collection.query( query_texts=["query"], where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne )

Logical operators

逻辑运算符

results = collection.query( query_texts=["query"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$lte": 3}} ] } # Also: $or )
results = collection.query( query_texts=["query"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$lte": 3}} ] } # 也支持:$or )

Contains

包含匹配

results = collection.query( query_texts=["query"], where={"tags": {"$in": ["python", "ml"]}} )
undefined
results = collection.query( query_texts=["query"], where={"tags": {"$in": ["python", "ml"]}} )
undefined

LangChain integration

LangChain集成

python
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
python
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

Split documents

拆分文档

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000) docs = text_splitter.split_documents(documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000) docs = text_splitter.split_documents(documents)

Create Chroma vector store

创建Chroma向量存储

vectorstore = Chroma.from_documents( documents=docs, embedding=OpenAIEmbeddings(), persist_directory="./chroma_db" )
vectorstore = Chroma.from_documents( documents=docs, embedding=OpenAIEmbeddings(), persist_directory="./chroma_db" )

Query

查询

results = vectorstore.similarity_search("machine learning", k=3)
results = vectorstore.similarity_search("machine learning", k=3)

As retriever

作为检索器使用

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
undefined
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
undefined

LlamaIndex integration

LlamaIndex集成

python
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadb
python
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadb

Initialize Chroma

初始化Chroma

db = chromadb.PersistentClient(path="./chroma_db") collection = db.get_or_create_collection("my_collection")
db = chromadb.PersistentClient(path="./chroma_db") collection = db.get_or_create_collection("my_collection")

Create vector store

创建向量存储

vector_store = ChromaVectorStore(chroma_collection=collection) storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store = ChromaVectorStore(chroma_collection=collection) storage_context = StorageContext.from_defaults(vector_store=vector_store)

Create index

创建索引

index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )
index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

Query

查询

query_engine = index.as_query_engine() response = query_engine.query("What is machine learning?")
undefined
query_engine = index.as_query_engine() response = query_engine.query("What is machine learning?")
undefined

Server mode

服务器模式

python
undefined
python
undefined

Run Chroma server

启动Chroma服务器

Terminal: chroma run --path ./chroma_db --port 8000

终端命令:chroma run --path ./chroma_db --port 8000

Connect to server

连接到服务器

import chromadb from chromadb.config import Settings
client = chromadb.HttpClient( host="localhost", port=8000, settings=Settings(anonymized_telemetry=False) )
import chromadb from chromadb.config import Settings
client = chromadb.HttpClient( host="localhost", port=8000, settings=Settings(anonymized_telemetry=False) )

Use as normal

正常使用客户端

collection = client.get_or_create_collection("my_docs")
undefined
collection = client.get_or_create_collection("my_docs")
undefined

Best practices

最佳实践

  1. Use persistent client - Don't lose data on restart
  2. Add metadata - Enables filtering and tracking
  3. Batch operations - Add multiple docs at once
  4. Choose right embedding model - Balance speed/quality
  5. Use filters - Narrow search space
  6. Unique IDs - Avoid collisions
  7. Regular backups - Copy chroma_db directory
  8. Monitor collection size - Scale up if needed
  9. Test embedding functions - Ensure quality
  10. Use server mode for production - Better for multi-user
  1. 使用持久化客户端 - 避免重启后丢失数据
  2. 添加元数据 - 支持过滤与追踪
  3. 批量操作 - 一次性添加多个文档
  4. 选择合适的嵌入向量模型 - 在速度与质量间平衡
  5. 使用过滤条件 - 缩小检索范围
  6. 使用唯一ID - 避免冲突
  7. 定期备份 - 复制chroma_db目录
  8. 监控集合大小 - 必要时扩容
  9. 测试嵌入向量函数 - 确保检索质量
  10. 生产环境使用服务器模式 - 更适合多用户场景

Performance

性能

OperationLatencyNotes
Add 100 docs~1-3sWith embedding
Query (top 10)~50-200msDepends on collection size
Metadata filter~10-50msFast with proper indexing
操作延迟说明
添加100个文档~1-3秒包含嵌入向量生成
查询(前10条)~50-200毫秒取决于集合大小
元数据过滤~10-50毫秒索引优化后速度更快

Resources

资源