chroma
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseChroma - Open-Source Embedding Database
Chroma - 开源嵌入向量数据库
The AI-native database for building LLM applications with memory.
专为构建具备记忆能力的LLM应用而生的原生AI数据库。
When to use Chroma
何时使用Chroma
Use Chroma when:
- Building RAG (retrieval-augmented generation) applications
- Need local/self-hosted vector database
- Want open-source solution (Apache 2.0)
- Prototyping in notebooks
- Semantic search over documents
- Storing embeddings with metadata
Metrics:
- 24,300+ GitHub stars
- 1,900+ forks
- v1.3.3 (stable, weekly releases)
- Apache 2.0 license
Use alternatives instead:
- Pinecone: Managed cloud, auto-scaling
- FAISS: Pure similarity search, no metadata
- Weaviate: Production ML-native database
- Qdrant: High performance, Rust-based
适合使用Chroma的场景:
- 构建RAG(检索增强生成)应用
- 需要本地/自托管的向量数据库
- 倾向于开源解决方案(Apache 2.0协议)
- 在笔记本环境中快速原型开发
- 对文档进行语义搜索
- 存储带有元数据的嵌入向量
项目数据:
- GitHub星标数24,300+
- 复刻数1,900+
- 稳定版v1.3.3(每周发布更新)
- Apache 2.0开源协议
可选择的替代方案:
- Pinecone:托管式云服务,支持自动扩容
- FAISS:纯相似度检索工具,不支持元数据
- Weaviate:面向生产环境的ML原生数据库
- Qdrant:基于Rust开发的高性能向量数据库
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedPython
Python
pip install chromadb
pip install chromadb
JavaScript/TypeScript
JavaScript/TypeScript
npm install chromadb @chroma-core/default-embed
undefinednpm install chromadb @chroma-core/default-embed
undefinedBasic usage (Python)
基础使用(Python)
python
import chromadbpython
import chromadbCreate client
创建客户端
client = chromadb.Client()
client = chromadb.Client()
Create collection
创建集合
collection = client.create_collection(name="my_collection")
collection = client.create_collection(name="my_collection")
Add documents
添加文档
collection.add(
documents=["This is document 1", "This is document 2"],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["id1", "id2"]
)
collection.add(
documents=["This is document 1", "This is document 2"],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["id1", "id2"]
)
Query
查询
results = collection.query(
query_texts=["document about topic"],
n_results=2
)
print(results)
undefinedresults = collection.query(
query_texts=["document about topic"],
n_results=2
)
print(results)
undefinedCore operations
核心操作
1. Create collection
1. 创建集合
python
undefinedpython
undefinedSimple collection
简单创建集合
collection = client.create_collection("my_docs")
collection = client.create_collection("my_docs")
With custom embedding function
使用自定义嵌入向量函数
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="my_docs",
embedding_function=openai_ef
)
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="my_docs",
embedding_function=openai_ef
)
Get existing collection
获取已存在的集合
collection = client.get_collection("my_docs")
collection = client.get_collection("my_docs")
Delete collection
删除集合
client.delete_collection("my_docs")
undefinedclient.delete_collection("my_docs")
undefined2. Add documents
2. 添加文档
python
undefinedpython
undefinedAdd with auto-generated IDs
自动生成ID添加文档
collection.add(
documents=["Doc 1", "Doc 2", "Doc 3"],
metadatas=[
{"source": "web", "category": "tutorial"},
{"source": "pdf", "page": 5},
{"source": "api", "timestamp": "2025-01-01"}
],
ids=["id1", "id2", "id3"]
)
collection.add(
documents=["Doc 1", "Doc 2", "Doc 3"],
metadatas=[
{"source": "web", "category": "tutorial"},
{"source": "pdf", "page": 5},
{"source": "api", "timestamp": "2025-01-01"}
],
ids=["id1", "id2", "id3"]
)
Add with custom embeddings
自定义嵌入向量添加文档
collection.add(
embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
documents=["Doc 1", "Doc 2"],
ids=["id1", "id2"]
)
undefinedcollection.add(
embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
documents=["Doc 1", "Doc 2"],
ids=["id1", "id2"]
)
undefined3. Query (similarity search)
3. 查询(相似度检索)
python
undefinedpython
undefinedBasic query
基础查询
results = collection.query(
query_texts=["machine learning tutorial"],
n_results=5
)
results = collection.query(
query_texts=["machine learning tutorial"],
n_results=5
)
Query with filters
带过滤条件的查询
results = collection.query(
query_texts=["Python programming"],
n_results=3,
where={"source": "web"}
)
results = collection.query(
query_texts=["Python programming"],
n_results=3,
where={"source": "web"}
)
Query with metadata filters
带复杂元数据过滤的查询
results = collection.query(
query_texts=["advanced topics"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$gte": 3}}
]
}
)
results = collection.query(
query_texts=["advanced topics"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$gte": 3}}
]
}
)
Access results
访问查询结果
print(results["documents"]) # List of matching documents
print(results["metadatas"]) # Metadata for each doc
print(results["distances"]) # Similarity scores
print(results["ids"]) # Document IDs
undefinedprint(results["documents"]) # 匹配的文档列表
print(results["metadatas"]) # 每个文档的元数据
print(results["distances"]) # 相似度得分
print(results["ids"]) # 文档ID
undefined4. Get documents
4. 获取文档
python
undefinedpython
undefinedGet by IDs
通过ID获取文档
docs = collection.get(
ids=["id1", "id2"]
)
docs = collection.get(
ids=["id1", "id2"]
)
Get with filters
带过滤条件获取文档
docs = collection.get(
where={"category": "tutorial"},
limit=10
)
docs = collection.get(
where={"category": "tutorial"},
limit=10
)
Get all documents
获取所有文档
docs = collection.get()
undefineddocs = collection.get()
undefined5. Update documents
5. 更新文档
python
undefinedpython
undefinedUpdate document content
更新文档内容
collection.update(
ids=["id1"],
documents=["Updated content"],
metadatas=[{"source": "updated"}]
)
undefinedcollection.update(
ids=["id1"],
documents=["Updated content"],
metadatas=[{"source": "updated"}]
)
undefined6. Delete documents
6. 删除文档
python
undefinedpython
undefinedDelete by IDs
通过ID删除文档
collection.delete(ids=["id1", "id2"])
collection.delete(ids=["id1", "id2"])
Delete with filter
带过滤条件删除文档
collection.delete(
where={"source": "outdated"}
)
undefinedcollection.delete(
where={"source": "outdated"}
)
undefinedPersistent storage
持久化存储
python
undefinedpython
undefinedPersist to disk
持久化到磁盘
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("my_docs")
collection.add(documents=["Doc 1"], ids=["id1"])
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("my_docs")
collection.add(documents=["Doc 1"], ids=["id1"])
Data persisted automatically
数据会自动持久化
Reload later with same path
后续可通过相同路径重新加载
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("my_docs")
undefinedclient = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("my_docs")
undefinedEmbedding functions
嵌入向量函数
Default (Sentence Transformers)
默认(Sentence Transformers)
python
undefinedpython
undefinedUses sentence-transformers by default
默认使用sentence-transformers模型
collection = client.create_collection("my_docs")
collection = client.create_collection("my_docs")
Default model: all-MiniLM-L6-v2
默认模型:all-MiniLM-L6-v2
undefinedundefinedOpenAI
OpenAI
python
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="openai_docs",
embedding_function=openai_ef
)python
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="openai_docs",
embedding_function=openai_ef
)HuggingFace
HuggingFace
python
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
api_key="your-key",
model_name="sentence-transformers/all-mpnet-base-v2"
)
collection = client.create_collection(
name="hf_docs",
embedding_function=huggingface_ef
)python
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
api_key="your-key",
model_name="sentence-transformers/all-mpnet-base-v2"
)
collection = client.create_collection(
name="hf_docs",
embedding_function=huggingface_ef
)Custom embedding function
自定义嵌入向量函数
python
from chromadb import Documents, EmbeddingFunction, Embeddings
class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
# Your embedding logic
return embeddings
my_ef = MyEmbeddingFunction()
collection = client.create_collection(
name="custom_docs",
embedding_function=my_ef
)python
from chromadb import Documents, EmbeddingFunction, Embeddings
class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
# 自定义嵌入向量逻辑
return embeddings
my_ef = MyEmbeddingFunction()
collection = client.create_collection(
name="custom_docs",
embedding_function=my_ef
)Metadata filtering
元数据过滤
python
undefinedpython
undefinedExact match
精确匹配
results = collection.query(
query_texts=["query"],
where={"category": "tutorial"}
)
results = collection.query(
query_texts=["query"],
where={"category": "tutorial"}
)
Comparison operators
比较运算符
results = collection.query(
query_texts=["query"],
where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne
)
results = collection.query(
query_texts=["query"],
where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne
)
Logical operators
逻辑运算符
results = collection.query(
query_texts=["query"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$lte": 3}}
]
} # Also: $or
)
results = collection.query(
query_texts=["query"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$lte": 3}}
]
} # 也支持:$or
)
Contains
包含匹配
results = collection.query(
query_texts=["query"],
where={"tags": {"$in": ["python", "ml"]}}
)
undefinedresults = collection.query(
query_texts=["query"],
where={"tags": {"$in": ["python", "ml"]}}
)
undefinedLangChain integration
LangChain集成
python
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitterpython
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitterSplit documents
拆分文档
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(documents)
Create Chroma vector store
创建Chroma向量存储
vectorstore = Chroma.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(),
persist_directory="./chroma_db"
)
vectorstore = Chroma.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(),
persist_directory="./chroma_db"
)
Query
查询
results = vectorstore.similarity_search("machine learning", k=3)
results = vectorstore.similarity_search("machine learning", k=3)
As retriever
作为检索器使用
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
undefinedretriever = vectorstore.as_retriever(search_kwargs={"k": 5})
undefinedLlamaIndex integration
LlamaIndex集成
python
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadbpython
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadbInitialize Chroma
初始化Chroma
db = chromadb.PersistentClient(path="./chroma_db")
collection = db.get_or_create_collection("my_collection")
db = chromadb.PersistentClient(path="./chroma_db")
collection = db.get_or_create_collection("my_collection")
Create vector store
创建向量存储
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
Create index
创建索引
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)
Query
查询
query_engine = index.as_query_engine()
response = query_engine.query("What is machine learning?")
undefinedquery_engine = index.as_query_engine()
response = query_engine.query("What is machine learning?")
undefinedServer mode
服务器模式
python
undefinedpython
undefinedRun Chroma server
启动Chroma服务器
Terminal: chroma run --path ./chroma_db --port 8000
终端命令:chroma run --path ./chroma_db --port 8000
Connect to server
连接到服务器
import chromadb
from chromadb.config import Settings
client = chromadb.HttpClient(
host="localhost",
port=8000,
settings=Settings(anonymized_telemetry=False)
)
import chromadb
from chromadb.config import Settings
client = chromadb.HttpClient(
host="localhost",
port=8000,
settings=Settings(anonymized_telemetry=False)
)
Use as normal
正常使用客户端
collection = client.get_or_create_collection("my_docs")
undefinedcollection = client.get_or_create_collection("my_docs")
undefinedBest practices
最佳实践
- Use persistent client - Don't lose data on restart
- Add metadata - Enables filtering and tracking
- Batch operations - Add multiple docs at once
- Choose right embedding model - Balance speed/quality
- Use filters - Narrow search space
- Unique IDs - Avoid collisions
- Regular backups - Copy chroma_db directory
- Monitor collection size - Scale up if needed
- Test embedding functions - Ensure quality
- Use server mode for production - Better for multi-user
- 使用持久化客户端 - 避免重启后丢失数据
- 添加元数据 - 支持过滤与追踪
- 批量操作 - 一次性添加多个文档
- 选择合适的嵌入向量模型 - 在速度与质量间平衡
- 使用过滤条件 - 缩小检索范围
- 使用唯一ID - 避免冲突
- 定期备份 - 复制chroma_db目录
- 监控集合大小 - 必要时扩容
- 测试嵌入向量函数 - 确保检索质量
- 生产环境使用服务器模式 - 更适合多用户场景
Performance
性能
| Operation | Latency | Notes |
|---|---|---|
| Add 100 docs | ~1-3s | With embedding |
| Query (top 10) | ~50-200ms | Depends on collection size |
| Metadata filter | ~10-50ms | Fast with proper indexing |
| 操作 | 延迟 | 说明 |
|---|---|---|
| 添加100个文档 | ~1-3秒 | 包含嵌入向量生成 |
| 查询(前10条) | ~50-200毫秒 | 取决于集合大小 |
| 元数据过滤 | ~10-50毫秒 | 索引优化后速度更快 |
Resources
资源
- GitHub: https://github.com/chroma-core/chroma ⭐ 24,300+
- Docs: https://docs.trychroma.com
- Discord: https://discord.gg/MMeYNTmh3x
- Version: 1.3.3+
- License: Apache 2.0
- GitHub:https://github.com/chroma-core/chroma ⭐ 24,300+
- 官方文档:https://docs.trychroma.com
- Discord社区:https://discord.gg/MMeYNTmh3x
- 版本:1.3.3+
- 协议:Apache 2.0