Chroma - Open-Source Embedding Database

Chroma - 开源嵌入向量数据库

The AI-native database for building LLM applications with memory.

专为构建具备记忆能力的LLM应用而生的原生AI数据库。

When to use Chroma

何时使用Chroma

Use Chroma when:

Building RAG (retrieval-augmented generation) applications
Need local/self-hosted vector database
Want open-source solution (Apache 2.0)
Prototyping in notebooks
Semantic search over documents
Storing embeddings with metadata

Metrics:

24,300+ GitHub stars
1,900+ forks
v1.3.3 (stable, weekly releases)
Apache 2.0 license

Use alternatives instead:

Pinecone: Managed cloud, auto-scaling
FAISS: Pure similarity search, no metadata
Weaviate: Production ML-native database
Qdrant: High performance, Rust-based

适合使用Chroma的场景：

构建RAG（检索增强生成）应用
需要本地/自托管的向量数据库
倾向于开源解决方案（Apache 2.0协议）
在笔记本环境中快速原型开发
对文档进行语义搜索
存储带有元数据的嵌入向量

项目数据：

GitHub星标数24,300+
复刻数1,900+
稳定版v1.3.3（每周发布更新）
Apache 2.0开源协议

可选择的替代方案：

Pinecone：托管式云服务，支持自动扩容
FAISS：纯相似度检索工具，不支持元数据
Weaviate：面向生产环境的ML原生数据库
Qdrant：基于Rust开发的高性能向量数据库

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

Python

pip install chromadb

JavaScript/TypeScript

npm install chromadb @chroma-core/default-embed

undefined

npm install chromadb @chroma-core/default-embed

undefined

Basic usage (Python)

基础使用（Python）

python

import chromadb

python

import chromadb

Create client

创建客户端

client = chromadb.Client()

Create collection

创建集合

collection = client.create_collection(name="my_collection")

Add documents

添加文档

collection.add( documents=["This is document 1", "This is document 2"], metadatas=[{"source": "doc1"}, {"source": "doc2"}], ids=["id1", "id2"] )

Query

查询

results = collection.query( query_texts=["document about topic"], n_results=2 )

print(results)

undefined

results = collection.query( query_texts=["document about topic"], n_results=2 )

print(results)

undefined

Core operations

核心操作

1. Create collection

1. 创建集合

python

undefined

python

undefined

Simple collection

简单创建集合

collection = client.create_collection("my_docs")

With custom embedding function

使用自定义嵌入向量函数

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-key", model_name="text-embedding-3-small" )

collection = client.create_collection( name="my_docs", embedding_function=openai_ef )

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-key", model_name="text-embedding-3-small" )

collection = client.create_collection( name="my_docs", embedding_function=openai_ef )

Get existing collection

获取已存在的集合

collection = client.get_collection("my_docs")

Delete collection

删除集合

client.delete_collection("my_docs")

undefined

client.delete_collection("my_docs")

undefined

2. Add documents

2. 添加文档

python

undefined

python

undefined

Add with auto-generated IDs

自动生成ID添加文档

collection.add( documents=["Doc 1", "Doc 2", "Doc 3"], metadatas=[ {"source": "web", "category": "tutorial"}, {"source": "pdf", "page": 5}, {"source": "api", "timestamp": "2025-01-01"} ], ids=["id1", "id2", "id3"] )

Add with custom embeddings

自定义嵌入向量添加文档

collection.add( embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]], documents=["Doc 1", "Doc 2"], ids=["id1", "id2"] )

undefined

collection.add( embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]], documents=["Doc 1", "Doc 2"], ids=["id1", "id2"] )

undefined

3. Query (similarity search)

3. 查询（相似度检索）

python

undefined

python

undefined

Basic query

基础查询

results = collection.query( query_texts=["machine learning tutorial"], n_results=5 )

Query with filters

带过滤条件的查询

results = collection.query( query_texts=["Python programming"], n_results=3, where={"source": "web"} )

Query with metadata filters

带复杂元数据过滤的查询

results = collection.query( query_texts=["advanced topics"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$gte": 3}} ] } )

Access results

访问查询结果

print(results["documents"]) # List of matching documents print(results["metadatas"]) # Metadata for each doc print(results["distances"]) # Similarity scores print(results["ids"]) # Document IDs

undefined

print(results["documents"]) # 匹配的文档列表 print(results["metadatas"]) # 每个文档的元数据 print(results["distances"]) # 相似度得分 print(results["ids"]) # 文档ID

undefined

4. Get documents

4. 获取文档

python

undefined

python

undefined

Get by IDs

通过ID获取文档

docs = collection.get( ids=["id1", "id2"] )

Get with filters

带过滤条件获取文档

docs = collection.get( where={"category": "tutorial"}, limit=10 )

Get all documents

获取所有文档

docs = collection.get()

undefined

docs = collection.get()

undefined

5. Update documents

5. 更新文档

python

undefined

python

undefined

Update document content

更新文档内容

collection.update( ids=["id1"], documents=["Updated content"], metadatas=[{"source": "updated"}] )

undefined

collection.update( ids=["id1"], documents=["Updated content"], metadatas=[{"source": "updated"}] )

undefined

6. Delete documents

6. 删除文档

python

undefined

python

undefined

Delete by IDs

通过ID删除文档

collection.delete(ids=["id1", "id2"])

Delete with filter

带过滤条件删除文档

collection.delete( where={"source": "outdated"} )

undefined

collection.delete( where={"source": "outdated"} )

undefined

Persistent storage

持久化存储

python

undefined

python

undefined

Persist to disk

持久化到磁盘

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.create_collection("my_docs") collection.add(documents=["Doc 1"], ids=["id1"])

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.create_collection("my_docs") collection.add(documents=["Doc 1"], ids=["id1"])

Data persisted automatically

数据会自动持久化

Reload later with same path

后续可通过相同路径重新加载

client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection("my_docs")

undefined

client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection("my_docs")

undefined

Embedding functions

嵌入向量函数

Default (Sentence Transformers)

默认（Sentence Transformers）

python

undefined

python

undefined

Uses sentence-transformers by default

默认使用sentence-transformers模型

collection = client.create_collection("my_docs")

Default model: all-MiniLM-L6-v2

默认模型：all-MiniLM-L6-v2

undefined

undefined

OpenAI

python

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="openai_docs",
    embedding_function=openai_ef
)

python

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="openai_docs",
    embedding_function=openai_ef
)

HuggingFace

python

huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="your-key",
    model_name="sentence-transformers/all-mpnet-base-v2"
)

collection = client.create_collection(
    name="hf_docs",
    embedding_function=huggingface_ef
)

python

huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="your-key",
    model_name="sentence-transformers/all-mpnet-base-v2"
)

collection = client.create_collection(
    name="hf_docs",
    embedding_function=huggingface_ef
)

Custom embedding function

自定义嵌入向量函数

python

from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # Your embedding logic
        return embeddings

my_ef = MyEmbeddingFunction()
collection = client.create_collection(
    name="custom_docs",
    embedding_function=my_ef
)

python

from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # 自定义嵌入向量逻辑
        return embeddings

my_ef = MyEmbeddingFunction()
collection = client.create_collection(
    name="custom_docs",
    embedding_function=my_ef
)

Metadata filtering

元数据过滤

python

undefined

python

undefined

Exact match

精确匹配

results = collection.query( query_texts=["query"], where={"category": "tutorial"} )

Comparison operators

比较运算符

results = collection.query( query_texts=["query"], where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne )

Logical operators

逻辑运算符

results = collection.query( query_texts=["query"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$lte": 3}} ] } # Also: $or )

results = collection.query( query_texts=["query"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$lte": 3}} ] } # 也支持：$or )

Contains

包含匹配

results = collection.query( query_texts=["query"], where={"tags": {"$in": ["python", "ml"]}} )

undefined

results = collection.query( query_texts=["query"], where={"tags": {"$in": ["python", "ml"]}} )

undefined

LangChain integration

LangChain集成

python

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

python

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

Split documents

拆分文档

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000) docs = text_splitter.split_documents(documents)

Create Chroma vector store

创建Chroma向量存储

vectorstore = Chroma.from_documents( documents=docs, embedding=OpenAIEmbeddings(), persist_directory="./chroma_db" )

Query

查询

results = vectorstore.similarity_search("machine learning", k=3)

As retriever

作为检索器使用

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

undefined

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

undefined

LlamaIndex integration

LlamaIndex集成

python

from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadb

python

from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadb

Initialize Chroma

初始化Chroma

db = chromadb.PersistentClient(path="./chroma_db") collection = db.get_or_create_collection("my_collection")

Create vector store

创建向量存储

vector_store = ChromaVectorStore(chroma_collection=collection) storage_context = StorageContext.from_defaults(vector_store=vector_store)

Create index

创建索引

index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

Query

查询

query_engine = index.as_query_engine() response = query_engine.query("What is machine learning?")

undefined

query_engine = index.as_query_engine() response = query_engine.query("What is machine learning?")

undefined

Server mode

服务器模式

python

undefined

python

undefined

Run Chroma server

启动Chroma服务器

Terminal: chroma run --path ./chroma_db --port 8000

终端命令：chroma run --path ./chroma_db --port 8000

Connect to server

连接到服务器

import chromadb from chromadb.config import Settings

client = chromadb.HttpClient( host="localhost", port=8000, settings=Settings(anonymized_telemetry=False) )

import chromadb from chromadb.config import Settings

client = chromadb.HttpClient( host="localhost", port=8000, settings=Settings(anonymized_telemetry=False) )

Use as normal

正常使用客户端

collection = client.get_or_create_collection("my_docs")

undefined

collection = client.get_or_create_collection("my_docs")

undefined

Best practices

最佳实践

Use persistent client - Don't lose data on restart
Add metadata - Enables filtering and tracking
Batch operations - Add multiple docs at once
Choose right embedding model - Balance speed/quality
Use filters - Narrow search space
Unique IDs - Avoid collisions
Regular backups - Copy chroma_db directory
Monitor collection size - Scale up if needed
Test embedding functions - Ensure quality
Use server mode for production - Better for multi-user

使用持久化客户端 - 避免重启后丢失数据
添加元数据 - 支持过滤与追踪
批量操作 - 一次性添加多个文档
选择合适的嵌入向量模型 - 在速度与质量间平衡
使用过滤条件 - 缩小检索范围
使用唯一ID - 避免冲突
定期备份 - 复制chroma_db目录
监控集合大小 - 必要时扩容
测试嵌入向量函数 - 确保检索质量
生产环境使用服务器模式 - 更适合多用户场景

Performance

性能

Operation	Latency	Notes
Add 100 docs	~1-3s	With embedding
Query (top 10)	~50-200ms	Depends on collection size
Metadata filter	~10-50ms	Fast with proper indexing

操作	延迟	说明
添加100个文档	~1-3秒	包含嵌入向量生成
查询（前10条）	~50-200毫秒	取决于集合大小
元数据过滤	~10-50毫秒	索引优化后速度更快

Resources

资源

GitHub: https://github.com/chroma-core/chroma ⭐ 24,300+
Docs: https://docs.trychroma.com
Discord: https://discord.gg/MMeYNTmh3x
Version: 1.3.3+
License: Apache 2.0

GitHub：https://github.com/chroma-core/chroma ⭐ 24,300+
官方文档：https://docs.trychroma.com
Discord社区：https://discord.gg/MMeYNTmh3x
版本：1.3.3+
协议：Apache 2.0

chroma

Original

Translation

Chroma - Open-Source Embedding Database

Chroma - 开源嵌入向量数据库

When to use Chroma

何时使用Chroma

Quick start

快速开始

Installation

安装

Python

Python

JavaScript/TypeScript

JavaScript/TypeScript

Basic usage (Python)

基础使用（Python）

Create client

创建客户端

Create collection

创建集合

Add documents

添加文档

Query

查询

Core operations

核心操作

1. Create collection

1. 创建集合

Simple collection

简单创建集合

With custom embedding function

使用自定义嵌入向量函数

Get existing collection

获取已存在的集合

Delete collection

删除集合

2. Add documents

2. 添加文档

Add with auto-generated IDs

自动生成ID添加文档

Add with custom embeddings

自定义嵌入向量添加文档

3. Query (similarity search)

3. 查询（相似度检索）

Basic query

基础查询

Query with filters

带过滤条件的查询

Query with metadata filters

带复杂元数据过滤的查询

Access results

访问查询结果

4. Get documents

4. 获取文档

Get by IDs

通过ID获取文档

Get with filters

带过滤条件获取文档

Get all documents

获取所有文档

5. Update documents

5. 更新文档

Update document content

更新文档内容

6. Delete documents

6. 删除文档

Delete by IDs

通过ID删除文档

Delete with filter

带过滤条件删除文档

Persistent storage

持久化存储

Persist to disk

持久化到磁盘

Data persisted automatically

数据会自动持久化

Reload later with same path

后续可通过相同路径重新加载

Embedding functions