neo4j-vector-index-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

When to Use

适用场景

  • Creating a vector index (
    CREATE VECTOR INDEX
    ) on nodes or relationships
  • Running vector similarity / nearest-neighbor search
  • Storing embeddings on graph nodes during ingestion
  • Choosing similarity function, dimensions, HNSW params, or quantization
  • Using
    SEARCH
    clause (2026.01+) or
    db.index.vector.queryNodes()
    (2025.x)
  • Batch-updating embeddings after model change
  • Combining vector results with immediate graph neighborhood (full retrieval_query pipelines →
    neo4j-graphrag-skill
    )
  • 在节点或关系上创建向量索引(
    CREATE VECTOR INDEX
  • 运行向量相似度/最近邻搜索
  • 在数据摄入阶段将嵌入向量存储在图节点上
  • 选择相似度函数、维度、HNSW参数或量化配置
  • 使用
    SEARCH
    子句(2026.01+)或
    db.index.vector.queryNodes()
    (2025.x版本)
  • 模型更新后批量更新嵌入向量
  • 将向量搜索结果与直接图邻域结合(完整retrieval_query管道请使用
    neo4j-graphrag-skill

When NOT to Use

不适用场景

  • GraphRAG pipelines (VectorCypherRetriever, HybridCypherRetriever, retrieval_query) →
    neo4j-graphrag-skill
  • Fulltext / keyword search (FULLTEXT INDEX,
    db.index.fulltext.queryNodes
    ) →
    neo4j-cypher-skill
  • GDS graph embeddings (FastRP, Node2Vec, GraphSAGE) →
    neo4j-gds-skill
  • Index admin (list all indexes, drop range/text/lookup indexes) →
    neo4j-cypher-skill

  • GraphRAG管道(VectorCypherRetriever、HybridCypherRetriever、retrieval_query)→ 使用
    neo4j-graphrag-skill
  • 全文/关键词搜索(FULLTEXT INDEX、
    db.index.fulltext.queryNodes
    )→ 使用
    neo4j-cypher-skill
  • GDS图嵌入(FastRP、Node2Vec、GraphSAGE)→ 使用
    neo4j-gds-skill
  • 索引管理(列出所有索引、删除范围/文本/查找索引)→ 使用
    neo4j-cypher-skill

Pre-flight — Determine Version

前置步骤——确定版本

Drives syntax choice:
cypher
CALL dbms.components() YIELD versions RETURN versions[0] AS neo4j_version
VersionUse
2026.01
or higher
SEARCH
clause (in-index filtering, preferred)
2025.x
db.index.vector.queryNodes()
procedure (deprecated 2026.04 — use
SEARCH
when on 2026.x)

版本决定语法选择:
cypher
CALL dbms.components() YIELD versions RETURN versions[0] AS neo4j_version
版本使用方式
2026.01
及以上
SEARCH
子句(支持索引内过滤,推荐使用)
2025.x
db.index.vector.queryNodes()
存储过程(2026.04起弃用——2026.x版本请使用
SEARCH

Step 1 — Create Vector Index

步骤1——创建向量索引

Node index (single label):
cypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
OPTIONS {
  indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine',
    `vector.quantization.enabled`: true,
    `vector.hnsw.m`: 16,
    `vector.hnsw.ef_construction`: 100
  }
}
Node index with filterable properties [2026.01+] —
WITH
declares which properties can be used in
SEARCH ... WHERE
:
cypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
WITH [c.source, c.lang, c.published_year]  // stored as metadata; filterable in SEARCH WHERE
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }
Multi-label index with filterable properties [2026.01+]:
cypher
CYPHER 25
CREATE VECTOR INDEX doc_embedding IF NOT EXISTS
FOR (n:Document|Article) ON n.embedding
WITH [n.author, n.published_year, n.lang]
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }
Relationship index:
cypher
CYPHER 25
CREATE VECTOR INDEX rel_embedding IF NOT EXISTS
FOR ()-[r:HAS_CHUNK]-() ON (r.embedding)
OPTIONS { indexConfig: { `vector.dimensions`: 768, `vector.similarity_function`: 'cosine' } }
WITH
property types
— only scalar types allowed:
INTEGER
,
FLOAT
,
STRING
,
BOOLEAN
,
DATE
,
ZONED DATETIME
,
LOCAL DATETIME
,
ZONED TIME
,
LOCAL TIME
,
DURATION
. Not allowed:
LIST
,
POINT
, or the vector property itself.
Index config reference:
ParameterTypeDefaultNotes
vector.dimensions
INTEGER 1–4096noneRequired; must match embedding model exactly
vector.similarity_function
STRING
'cosine'
'cosine'
or
'euclidean'
vector.quantization.enabled
BOOLEAN
true
Reduces storage; slight accuracy tradeoff; needs vector-2.0+ (5.18+)
vector.hnsw.m
INTEGER 1–512
16
HNSW graph connections; higher = better recall, more memory
vector.hnsw.ef_construction
INTEGER 1–3200
100
Build-time candidates; higher = better recall, slower build
Similarity function choice:
Use caseFunction
Normalized embeddings (OpenAI, Cohere, Voyage, Google)
'cosine'
Unnormalized / raw distance matters
'euclidean'

单标签节点索引:
cypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
OPTIONS {
  indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine',
    `vector.quantization.enabled`: true,
    `vector.hnsw.m`: 16,
    `vector.hnsw.ef_construction`: 100
  }
}
带可过滤属性的节点索引[2026.01+]——
WITH
声明可在
SEARCH ... WHERE
中使用的属性:
cypher
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
WITH [c.source, c.lang, c.published_year]  // 存储为元数据;可在SEARCH WHERE中过滤
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }
带可过滤属性的多标签索引[2026.01+]:
cypher
CYPHER 25
CREATE VECTOR INDEX doc_embedding IF NOT EXISTS
FOR (n:Document|Article) ON n.embedding
WITH [n.author, n.published_year, n.lang]
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }
关系索引:
cypher
CYPHER 25
CREATE VECTOR INDEX rel_embedding IF NOT EXISTS
FOR ()-[r:HAS_CHUNK]-() ON (r.embedding)
OPTIONS { indexConfig: { `vector.dimensions`: 768, `vector.similarity_function`: 'cosine' } }
WITH
属性类型
——仅允许标量类型:
INTEGER
FLOAT
STRING
BOOLEAN
DATE
ZONED DATETIME
LOCAL DATETIME
ZONED TIME
LOCAL TIME
DURATION
。不允许:
LIST
POINT
或向量属性本身。
索引配置参考:
参数类型默认值说明
vector.dimensions
整数1–4096必填;必须与嵌入模型完全匹配
vector.similarity_function
字符串
'cosine'
可选值
'cosine'
'euclidean'
vector.quantization.enabled
布尔值
true
减少存储空间;精度略有损失;需要vector-2.0+(5.18+版本)
vector.hnsw.m
整数1–512
16
HNSW图连接数;值越高召回率越好,但占用内存更多
vector.hnsw.ef_construction
整数1–3200
100
构建时候选数;值越高召回率越好,但构建速度越慢
相似度函数选择:
使用场景函数
归一化嵌入向量(OpenAI、Cohere、Voyage、Google)
'cosine'
非归一化/原始距离重要的场景
'euclidean'

Step 2 — Wait for Index ONLINE

步骤2——等待索引变为ONLINE状态

Index builds asynchronously — do NOT query until ONLINE:
cypher
SHOW VECTOR INDEXES YIELD name, state, populationPercent
WHERE name = 'chunk_embedding'
RETURN name, state, populationPercent
Poll every 5s until
state = 'ONLINE'
and
populationPercent = 100.0
. If
state = 'FAILED'
→ stop, check logs.
Shell poll (cypher-shell):
bash
until cypher-shell -u neo4j -p "$NEO4J_PASSWORD" \
  "SHOW VECTOR INDEXES YIELD name, state WHERE name='chunk_embedding' RETURN state" \
  | grep -q ONLINE; do
  sleep 5
done

索引异步构建——索引未变为ONLINE前请勿查询
cypher
SHOW VECTOR INDEXES YIELD name, state, populationPercent
WHERE name = 'chunk_embedding'
RETURN name, state, populationPercent
每5秒轮询一次,直到
state = 'ONLINE'
populationPercent = 100.0
。如果
state = 'FAILED'
→停止操作,检查日志。
Shell轮询(cypher-shell):
bash
until cypher-shell -u neo4j -p "$NEO4J_PASSWORD" \
  "SHOW VECTOR INDEXES YIELD name, state WHERE name='chunk_embedding' RETURN state" \
  | grep -q ONLINE; do
  sleep 5
done

Step 3 — Ingest Embeddings

步骤3——摄入嵌入向量

Batch UNWIND pattern (use for > 100 nodes — never one-node-per-transaction):
python
from neo4j import GraphDatabase

driver = GraphDatabase.driver(uri, auth=(user, password))

def embed_batch(texts: list[str]) -> list[list[float]]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small", input=texts
    )
    return [r.embedding for r in response.data]

def store_embeddings(records: list[dict], batch_size: int = 500):
    expected_dim = 1536  # must match vector.dimensions
    texts = [r["text"] for r in records]
    embeddings = embed_batch(texts)
    for emb in embeddings:
        assert len(emb) == expected_dim, f"Dim mismatch: {len(emb)} != {expected_dim}"
    rows = [{"id": r["id"], "embedding": emb}
            for r, emb in zip(records, embeddings)]
    for i in range(0, len(rows), batch_size):
        driver.execute_query(
            "UNWIND $rows AS row MATCH (c:Chunk {id: row.id}) SET c.embedding = row.embedding",
            rows=rows[i:i+batch_size]
        )
❌ Never create index after embeddings are already stored — always create index first. ✅ Create index → poll ONLINE → ingest embeddings.

批量UNWIND模式(适用于超过100个节点的情况——绝不要单节点单事务):
python
from neo4j import GraphDatabase

driver = GraphDatabase.driver(uri, auth=(user, password))

def embed_batch(texts: list[str]) -> list[list[float]]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small", input=texts
    )
    return [r.embedding for r in response.data]

def store_embeddings(records: list[dict], batch_size: int = 500):
    expected_dim = 1536  # 必须与vector.dimensions匹配
    texts = [r["text"] for r in records]
    embeddings = embed_batch(texts)
    for emb in embeddings:
        assert len(emb) == expected_dim, f"维度不匹配:{len(emb)} != {expected_dim}"
    rows = [{"id": r["id"], "embedding": emb}
            for r, emb in zip(records, embeddings)]
    for i in range(0, len(rows), batch_size):
        driver.execute_query(
            "UNWIND $rows AS row MATCH (c:Chunk {id: row.id}) SET c.embedding = row.embedding",
            rows=rows[i:i+batch_size]
        )
❌ 绝不要在嵌入向量已存储后创建索引——务必先创建索引。 ✅ 创建索引 → 轮询至ONLINE状态 → 摄入嵌入向量。

Step 4 — Run Vector Search

步骤4——运行向量搜索

SEARCH clause (2026.01+, preferred)

SEARCH子句(2026.01+,推荐使用)

cypher
CYPHER 25
MATCH (c:Chunk)
  SEARCH c IN (
    VECTOR INDEX chunk_embedding
    FOR $queryEmbedding
    LIMIT 10
  ) SCORE AS score
RETURN c.text, score
ORDER BY score DESC
With in-index filter [2026.01+] — properties must be declared in
WITH
at index creation:
cypher
// Index must have been created with: WITH [c.source, c.lang, c.published_year]
CYPHER 25
MATCH (c:Chunk)
  SEARCH c IN (
    VECTOR INDEX chunk_embedding
    FOR $queryEmbedding
    WHERE c.source = $source AND c.lang = 'en' AND c.published_year >= 2024
    LIMIT 10
  ) SCORE AS score
RETURN c.text, c.source, score
ORDER BY score DESC
Filtering strategy — choose one:
StrategyWhen to useTradeoff
In-index
WHERE
[2026.01+]
Filters on pre-declared
WITH
properties; known at index design time
Fast, consistent latency; properties must be declared upfront
Post-filter (MATCH + procedure)Arbitrary Cypher predicates, graph traversal, OR/NOTFull flexibility; may over-fetch then discard
Pre-filter (MATCH first, then SEARCH)Small known candidate set; exact nearest-neighbor within subsetDeterministic; slow on large candidate sets
In-index
WHERE
hard limits [2026.01+]:
  • Property must be listed in
    WITH [...]
    at index creation — undeclared properties silently fall back to post-filtering
  • AND predicates only — no OR, NOT, list ops, string ops
  • Scalar types only:
    INTEGER
    ,
    FLOAT
    ,
    STRING
    ,
    BOOLEAN
    , temporal types — not VECTOR/LIST/POINT
cypher
CYPHER 25
MATCH (c:Chunk)
  SEARCH c IN (
    VECTOR INDEX chunk_embedding
    FOR $queryEmbedding
    LIMIT 10
  ) SCORE AS score
RETURN c.text, score
ORDER BY score DESC
带索引内过滤[2026.01+]——属性必须在索引创建时的
WITH
中声明:
cypher
// 索引创建时必须声明:WITH [c.source, c.lang, c.published_year]
CYPHER 25
MATCH (c:Chunk)
  SEARCH c IN (
    VECTOR INDEX chunk_embedding
    FOR $queryEmbedding
    WHERE c.source = $source AND c.lang = 'en' AND c.published_year >= 2024
    LIMIT 10
  ) SCORE AS score
RETURN c.text, c.source, score
ORDER BY score DESC
过滤策略——三选一:
策略使用场景权衡
索引内
WHERE
[2026.01+]
对预声明的
WITH
属性进行过滤;索引设计阶段已确定属性
速度快,延迟稳定;属性必须预先声明
后过滤(MATCH + 存储过程)任意Cypher谓词、图遍历、OR/NOT逻辑完全灵活;可能会过度查询后再丢弃结果
预过滤(先MATCH,再SEARCH)已知候选集较小;在子集内查找精确最近邻确定性强;候选集较大时速度慢
索引内
WHERE
限制[2026.01+]:
  • 属性必须在索引创建时的
    WITH [...]
    中列出——未声明的属性会自动降级为后过滤
  • 仅支持AND谓词——不支持OR、NOT、列表操作、字符串操作
  • 仅支持标量类型:
    INTEGER
    FLOAT
    STRING
    BOOLEAN
    、时间类型——不支持VECTOR/LIST/POINT

Post-filter pattern (2025.x or arbitrary predicates)

后过滤模式(2025.x版本或需任意谓词)

cypher
CYPHER 25
CALL db.index.vector.queryNodes('chunk_embedding', 50, $queryEmbedding)
YIELD node AS c, score
WHERE c.source = $source    // post-filter: fetch more, then filter
RETURN c.text, score
ORDER BY score DESC LIMIT 10
Relationship index procedure:
cypher
CYPHER 25
CALL db.index.vector.queryRelationships('rel_embedding', 5, $queryEmbedding)
YIELD relationship AS r, score
RETURN r.text, score
SEARCH clause hard limits (all versions):
  • Index name cannot be a parameter (
    $indexName
    not allowed — use literal string)
  • Binding variable must come from the enclosing MATCH pattern
  • Query vector cannot reference the binding variable

cypher
CYPHER 25
CALL db.index.vector.queryNodes('chunk_embedding', 50, $queryEmbedding)
YIELD node AS c, score
WHERE c.source = $source    // 后过滤:先查询更多结果,再过滤
RETURN c.text, score
ORDER BY score DESC LIMIT 10
关系索引存储过程:
cypher
CYPHER 25
CALL db.index.vector.queryRelationships('rel_embedding', 5, $queryEmbedding)
YIELD relationship AS r, score
RETURN r.text, score
SEARCH子句限制(所有版本):
  • 索引名称不能是参数(不允许
    $indexName
    ——必须使用字面量字符串)
  • 绑定变量必须来自外层MATCH模式
  • 查询向量不能引用绑定变量

Step 5 — Combine with Graph Traversal (simple cases)

步骤5——结合图遍历(简单场景)

Vector search as entry point, then graph hop:
cypher
CYPHER 25
MATCH (c:Chunk)
  SEARCH c IN (
    VECTOR INDEX chunk_embedding
    FOR $queryEmbedding
    LIMIT 10
  ) SCORE AS score
MATCH (c)<-[:HAS_CHUNK]-(a:Article)
OPTIONAL MATCH (a)-[:MENTIONS]->(org:Organization)
RETURN c.text, a.title, score, collect(DISTINCT org.name) AS organizations
ORDER BY score DESC
For full retrieval_query pipelines, HybridCypherRetriever, or
neo4j-graphrag
library → delegate to
neo4j-graphrag-skill
.

以向量搜索为入口,再进行图跳转:
cypher
CYPHER 25
MATCH (c:Chunk)
  SEARCH c IN (
    VECTOR INDEX chunk_embedding
    FOR $queryEmbedding
    LIMIT 10
  ) SCORE AS score
MATCH (c)<-[:HAS_CHUNK]-(a:Article)
OPTIONAL MATCH (a)-[:MENTIONS]->(org:Organization)
RETURN c.text, a.title, score, collect(DISTINCT org.name) AS organizations
ORDER BY score DESC
如需完整retrieval_query管道、HybridCypherRetriever或
neo4j-graphrag
库→请使用
neo4j-graphrag-skill

Embedding Provider Quick-Reference

嵌入提供者速查表

Provider / ModelDimensionsSimilarityNotes
OpenAI text-embedding-3-small1536cosineDefault; reducible to 256–1536 via
dimensions=
param
OpenAI text-embedding-3-large3072cosineReducible to 256–3072
OpenAI text-embedding-ada-0021536cosineLegacy; prefer 3-small
Cohere embed-v3 (English)1024cosineUse
input_type='search_document'
at ingest,
'search_query'
at query
Voyage voyage-3-large1024cosineHigh quality; needs
voyage-ai
package
Google text-embedding-004768cosineVia Vertex AI
Ollama nomic-embed-text768cosineLocal dev/testing
Ollama mxbai-embed-large1024cosineLocal; production-quality
vector.dimensions
must exactly match model output — no auto-truncation.

提供者/模型维度相似度函数说明
OpenAI text-embedding-3-small1536cosine默认选项;可通过
dimensions=
参数将维度调整为256–1536
OpenAI text-embedding-3-large3072cosine可调整维度为256–3072
OpenAI text-embedding-ada-0021536cosine旧版;推荐使用3-small
Cohere embed-v3(英文)1024cosine摄入时使用
input_type='search_document'
,查询时使用
'search_query'
Voyage voyage-3-large1024cosine高质量;需要
voyage-ai
Google text-embedding-004768cosine通过Vertex AI使用
Ollama nomic-embed-text768cosine本地开发/测试使用
Ollama mxbai-embed-large1024cosine本地使用;达到生产级质量
vector.dimensions
必须与模型输出完全匹配——无自动截断功能。

Vector Functions

向量函数

Ad-hoc similarity (not for kNN search — use index for that):
cypher
MATCH (a:Chunk {id: $id1}), (b:Chunk {id: $id2})
RETURN vector.similarity.cosine(a.embedding, b.embedding) AS sim
// vector.similarity.euclidean(a, b) — same signature, 0–1 range

// vector_distance (2025.10+) — metrics: EUCLIDEAN, EUCLIDEAN_SQUARED, MANHATTAN, COSINE, DOT, HAMMING
// Returns distance (lower = more similar, inverse of similarity)
RETURN vector_distance(a.embedding, b.embedding, 'COSINE') AS dist

// vector_dimension_count (2025.10+)
RETURN vector_dimension_count(n.embedding) AS dims

// vector_norm (2025.20+) — metrics: EUCLIDEAN, MANHATTAN
RETURN vector_norm(n.embedding, 'EUCLIDEAN') AS norm
Convert LIST to typed VECTOR:
cypher
// vector(value, dimension, coordinateType)
// coordinateType: FLOAT64, FLOAT32, INTEGER8/16/32/64
WITH vector([1.0, 2.0, 3.0], 3, 'FLOAT32') AS v
RETURN vector_dimension_count(v)

临时相似度计算(不适用于kNN搜索——请使用索引):
cypher
MATCH (a:Chunk {id: $id1}), (b:Chunk {id: $id2})
RETURN vector.similarity.cosine(a.embedding, b.embedding) AS sim
// vector.similarity.euclidean(a, b) —— 签名相同,结果范围0–1

// vector_distance (2025.10+) —— 支持的度量:EUCLIDEAN、EUCLIDEAN_SQUARED、MANHATTAN、COSINE、DOT、HAMMING
// 返回距离值(值越小相似度越高,与相似度结果相反)
RETURN vector_distance(a.embedding, b.embedding, 'COSINE') AS dist

// vector_dimension_count (2025.10+)
RETURN vector_dimension_count(n.embedding) AS dims

// vector_norm (2025.20+) —— 支持的度量:EUCLIDEAN、MANHATTAN
RETURN vector_norm(n.embedding, 'EUCLIDEAN') AS norm
将LIST转换为类型化VECTOR:
cypher
// vector(value, dimension, coordinateType)
// coordinateType: FLOAT64, FLOAT32, INTEGER8/16/32/64
WITH vector([1.0, 2.0, 3.0], 3, 'FLOAT32') AS v
RETURN vector_dimension_count(v)

Index Management

索引管理

cypher
// Show all vector indexes with config
SHOW VECTOR INDEXES YIELD name, state, populationPercent,
  labelsOrTypes, properties, indexConfig
RETURN name, state, populationPercent, labelsOrTypes, properties, indexConfig;

// Drop (node data unchanged — only index structure removed)
DROP INDEX chunk_embedding IF EXISTS;

// No ALTER VECTOR INDEX — to change dimensions or similarity function:
// 1. DROP INDEX old_index IF EXISTS
// 2. CREATE VECTOR INDEX new_index ... with new OPTIONS
// 3. Re-generate all embeddings with new model
// 4. Poll until ONLINE

cypher
// 显示所有向量索引及其配置
SHOW VECTOR INDEXES YIELD name, state, populationPercent,
  labelsOrTypes, properties, indexConfig
RETURN name, state, populationPercent, labelsOrTypes, properties, indexConfig;

// 删除索引(节点数据不受影响——仅移除索引结构)
DROP INDEX chunk_embedding IF EXISTS;

// 无ALTER VECTOR INDEX语法——如需修改维度或相似度函数:
// 1. DROP INDEX old_index IF EXISTS
// 2. CREATE VECTOR INDEX new_index ... 使用新OPTIONS
// 3. 使用新模型重新生成所有嵌入向量
// 4. 轮询至索引变为ONLINE状态

Common Errors

常见错误

ErrorCauseFix
IllegalArgumentException: Index dimension mismatch
Stored embedding dim ≠
vector.dimensions
Fix embed generation; drop + recreate index with correct dim
Search returns incomplete resultsIndex still
POPULATING
Poll until
state = 'ONLINE'
Unknown procedure db.index.vector.queryNodes
Neo4j < 5.11No vector index support below 5.11; upgrade
SEARCH clause not available
Neo4j < 2026.01Use
queryNodes()
procedure
OR/NOT not allowed in SEARCH WHERE
SEARCH in-index filter restrictionMove complex predicates to outer WHERE after SEARCH
Zero results from correct queryWrong similarity function or all-zeros embeddingVerify with
vector.similarity.cosine()
; check embed call succeeded
Score always 1.0All-zeros or identical vectorsEmbedding generation failed; add dimension assertion before ingest
vector.quantization.enabled
option rejected
provider vector-1.0 (Neo4j < 5.18)Omit quantization option or upgrade to 5.18+

错误原因修复方案
IllegalArgumentException: Index dimension mismatch
存储的嵌入向量维度与
vector.dimensions
不匹配
修复嵌入向量生成逻辑;删除并重新创建维度正确的索引
搜索返回不完整结果索引仍处于
POPULATING
状态
轮询至
state = 'ONLINE'
后再查询
Unknown procedure db.index.vector.queryNodes
Neo4j版本低于5.115.11以下版本不支持向量索引;升级版本
SEARCH clause not available
Neo4j版本低于2026.01使用
queryNodes()
存储过程
OR/NOT not allowed in SEARCH WHERE
SEARCH索引内过滤的限制将复杂谓词移至SEARCH后的外层WHERE中
查询正确但返回零结果相似度函数选择错误或嵌入向量全为0使用
vector.similarity.cosine()
验证;检查嵌入向量生成是否成功
分数始终为1.0嵌入向量全为0或完全相同嵌入向量生成失败;摄入前添加维度断言
vector.quantization.enabled
选项被拒绝
使用的是vector-1.0提供者(Neo4j版本低于5.18)移除量化选项或升级至5.18+版本

Checklist

检查清单

  • vector.dimensions
    matches embedding model output exactly
  • Vector index created before ingesting embeddings
  • Similarity function chosen explicitly (
    cosine
    for normalized,
    euclidean
    for distance-based)
  • Index polled to
    state = 'ONLINE'
    before first query
  • Dimension validated on every embedding before ingest
  • SEARCH
    clause on Neo4j >= 2026.01 (preferred); procedure fallback only on 2025.x (deprecated 2026.04)
  • SEARCH
    WHERE
    uses AND-only predicates with scalar types
  • Batch UNWIND pattern used for > 100 nodes
  • If model changes: drop index → recreate with new dimensions → re-generate all embeddings

  • vector.dimensions
    与嵌入模型输出完全匹配
  • 先创建向量索引,再摄入嵌入向量
  • 明确选择相似度函数(归一化向量用
    cosine
    ,基于距离的场景用
    euclidean
  • 首次查询前轮询至索引
    state = 'ONLINE'
  • 摄入前验证每个嵌入向量的维度
  • Neo4j >=2026.01版本使用
    SEARCH
    子句(推荐);仅2025.x版本使用存储过程作为 fallback(2026.04起弃用)
  • SEARCH的
    WHERE
    仅使用AND谓词和标量类型
  • 超过100个节点时使用批量UNWIND模式
  • 模型变更时:删除索引 → 重新创建新维度的索引 → 重新生成所有嵌入向量

In-Cypher Embedding Generation — ai.text.embed() [2025.12]

Cypher内嵌入向量生成——ai.text.embed() [2025.12]

Generate embeddings at query time without external Python code. Use
ai.text.embed()
— the current API since [2025.12]:
cypher
// Syntax (requires CYPHER 25)
CYPHER 25
// ai.text.embed(resource :: STRING, provider :: STRING, configuration :: MAP) :: VECTOR
Provider strings are lowercase (
'openai'
,
'vertexai'
,
'bedrock-titan'
,
'azure-openai'
). Full provider config →
neo4j-genai-plugin-skill
.
Full query pattern — embed at query time, search immediately (procedure fallback for 2025.x):
cypher
CYPHER 25
WITH ai.text.embed(
    "What are good open source projects",
    "openai",
    { token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
CALL db.index.vector.queryNodes('chunk_embedding', 6, userEmbedding)  // deprecated 2026.04
YIELD node AS c, score
RETURN c.text, score
ORDER BY score DESC
With SEARCH clause (2026.01+):
cypher
CYPHER 25
WITH ai.text.embed("my query", "openai", { token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
MATCH (c:Chunk)
  SEARCH c IN (VECTOR INDEX chunk_embedding FOR userEmbedding LIMIT 6) SCORE AS score
RETURN c.text, score
ORDER BY score DESC
❌ Never pass API key as literal string in production — use
$param
or
apoc.static.get()
. ✅ Use
$openaiKey
parameter; inject via driver params dict.
Rule: Use same model at ingest time and query time — embeddings from different models are not comparable.
Deprecated (still works but do not use in new code):
  • genai.vector.encode()
    [deprecated] → use
    ai.text.embed()
    [2025.12]
  • genai.vector.encodeBatch()
    [deprecated] → use
    CALL ai.text.embedBatch()
    [2025.12]
  • genai.vector.listEncodingProviders()
    [deprecated] → use
    CALL ai.text.embed.providers()
    [2025.12]
For full
ai.text.*
reference (completion, structured output, chat, tokenization) →
neo4j-genai-plugin-skill
.

无需外部Python代码,在查询时生成嵌入向量。使用
ai.text.embed()
——2025.12起的当前API:
cypher
// 语法(需要CYPHER 25)
CYPHER 25
// ai.text.embed(resource :: STRING, provider :: STRING, configuration :: MAP) :: VECTOR
提供者字符串为小写(
'openai'
'vertexai'
'bedrock-titan'
'azure-openai'
)。完整提供者配置请查看
neo4j-genai-plugin-skill
完整查询模式——查询时生成嵌入向量并立即搜索(2025.x版本用存储过程作为fallback):
cypher
CYPHER 25
WITH ai.text.embed(
    "What are good open source projects",
    "openai",
    { token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
CALL db.index.vector.queryNodes('chunk_embedding', 6, userEmbedding)  // 2026.04起弃用
YIELD node AS c, score
RETURN c.text, score
ORDER BY score DESC
使用SEARCH子句(2026.01+):
cypher
CYPHER 25
WITH ai.text.embed("my query", "openai", { token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
MATCH (c:Chunk)
  SEARCH c IN (VECTOR INDEX chunk_embedding FOR userEmbedding LIMIT 6) SCORE AS score
RETURN c.text, score
ORDER BY score DESC
❌ 生产环境中绝不要将API密钥作为字面量字符串传递——使用
$param
apoc.static.get()
。 ✅ 使用
$openaiKey
参数;通过驱动参数字典注入。
规则:摄入和查询时使用相同的模型——不同模型生成的嵌入向量不可比较。
已弃用(仍可工作但请勿在新代码中使用):
  • genai.vector.encode()
    [已弃用]→使用
    ai.text.embed()
    [2025.12]
  • genai.vector.encodeBatch()
    [已弃用]→使用
    CALL ai.text.embedBatch()
    [2025.12]
  • genai.vector.listEncodingProviders()
    [已弃用]→使用
    CALL ai.text.embed.providers()
    [2025.12]
完整
ai.text.*
参考(补全、结构化输出、聊天、分词)请查看
neo4j-genai-plugin-skill

Cypher-Based Embedding Ingestion — db.create.setNodeVectorProperty

基于Cypher的嵌入向量摄入——db.create.setNodeVectorProperty

Set vector property via Cypher (e.g. during LOAD CSV or MERGE pipeline):
cypher
LOAD CSV WITH HEADERS FROM 'https://example.com/data.csv' AS row
MERGE (q:Question {text: row.question})
WITH q, row
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))
Use when embedding is already in CSV/JSON form as a string —
apoc.convert.fromJsonList()
converts
"[0.1,0.2,...]"
to
LIST<FLOAT>
. For Python-generated embeddings, use the Python UNWIND batch pattern (Step 3) instead.

通过Cypher设置向量属性(例如在LOAD CSV或MERGE管道中):
cypher
LOAD CSV WITH HEADERS FROM 'https://example.com/data.csv' AS row
MERGE (q:Question {text: row.question})
WITH q, row
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))
适用于嵌入向量已以字符串形式存储在CSV/JSON中的场景——
apoc.convert.fromJsonList()
"[0.1,0.2,...]"
转换为
LIST<FLOAT>
。 对于Python生成的嵌入向量,请使用步骤3中的Python UNWIND批量模式。

Similarity Function — Extended Guidance

相似度函数——扩展指南

Existing table (Step 1) gives the basic rule. Additional guidance from course patterns:
Choose based on training loss function:
  • Check embedding model docs — models trained with cosine loss → use
    'cosine'
  • Models trained with L2/Euclidean loss → use
    'euclidean'
  • When docs are silent: default to
    'cosine'
    (all major hosted APIs use it)
Common pitfall — wrong similarity function:
❌ Created index with 'euclidean' but model outputs L2-normalized vectors
   → scores are mathematically correct but rankings differ from expected cosine order
   → no error thrown; wrong results silently returned
✅ Verify: run vector.similarity.cosine(a.embedding, b.embedding) manually on known
   similar pairs — score should be > 0.9 for near-duplicate text
Sanity check query after index creation:
cypher
MATCH (c:Chunk) WITH c LIMIT 2
WITH collect(c) AS nodes
RETURN vector.similarity.cosine(nodes[0].embedding, nodes[1].embedding) AS cosine_check,
       vector.similarity.euclidean(nodes[0].embedding, nodes[1].embedding) AS euclidean_check
If both return
null
→ embeddings not set. If cosine returns
1.0
→ identical vectors (embed call failed).

步骤1中的表格给出了基本规则。以下是来自课程模式的额外指南:
根据训练损失函数选择:
  • 查看嵌入模型文档——使用余弦损失训练的模型→使用
    'cosine'
  • 使用L2/欧几里得损失训练的模型→使用
    'euclidean'
  • 文档无说明时:默认使用
    'cosine'
    (所有主流托管API均使用此函数)
常见陷阱——错误的相似度函数:
❌ 创建索引时使用'euclidean'但模型输出L2归一化向量
   → 分数在数学上正确,但排序与预期的余弦顺序不同
   → 无错误抛出;静默返回错误结果
✅ 验证:手动对已知相似的向量对运行vector.similarity.cosine(a.embedding, b.embedding)——近重复文本的分数应>0.9
索引创建后的 sanity check 查询:
cypher
MATCH (c:Chunk) WITH c LIMIT 2
WITH collect(c) AS nodes
RETURN vector.similarity.cosine(nodes[0].embedding, nodes[1].embedding) AS cosine_check,
       vector.similarity.euclidean(nodes[0].embedding, nodes[1].embedding) AS euclidean_check
如果两者都返回
null
→嵌入向量未设置。如果余弦分数返回
1.0
→向量完全相同(嵌入生成失败)。

Gotchas — Extended

注意事项——扩展

GotchaDetailFix
Index not ONLINE at ingest timeInserting nodes before index exists is valid — index auto-populates. But querying during
POPULATING
returns partial results
Always poll
state = 'ONLINE'
before first query
Wrong dimensions — silent failureStored vector dim ≠
vector.dimensions
IllegalArgumentException
at query time, not at ingest time
Assert
len(emb) == expected_dim
before every
SET c.embedding
Different models at ingest vs queryNo error; cosine scores ~0.3–0.5 for clearly similar textUse same model string/version for both; store model name as node metadata
Missing model at query
ai.text.embed
returns
null
silently if provider config wrong
Test encode call standalone; check
CYPHER 25 RETURN ai.text.embed(...)
before embedding into pipeline
Large single-transaction ingestOne transaction for 10k nodes → OOM or timeoutUse
UNWIND $rows ... CALL IN TRANSACTIONS OF 500 ROWS
or Python batch loop
Chunk overlap not setAdjacent chunks with no overlap → context at boundaries lost → poor recall for cross-paragraph queriesSet
chunk_overlap
≥ 10% of
chunk_size

注意事项细节修复方案
摄入时索引未变为ONLINE索引创建前插入节点是允许的——索引会自动填充。但
POPULATING
状态下查询会返回部分结果
首次查询前务必轮询至
state = 'ONLINE'
维度错误——静默失败存储的向量维度与
vector.dimensions
不匹配→查询时抛出
IllegalArgumentException
,摄入时无错误
每次
SET c.embedding
前断言
len(emb) == expected_dim
摄入与查询使用不同模型无错误;明显相似的文本余弦分数约为0.3–0.5摄入和查询使用相同的模型字符串/版本;将模型名称存储为节点元数据
查询时模型缺失提供者配置错误时
ai.text.embed
会静默返回
null
单独测试编码调用;嵌入到管道前先执行
CYPHER 25 RETURN ai.text.embed(...)
验证
单事务摄入大量数据10k节点单事务→内存不足或超时使用
UNWIND $rows ... CALL IN TRANSACTIONS OF 500 ROWS
或Python批量循环
未设置分块重叠相邻分块无重叠→边界上下文丢失→跨段落查询召回率差设置
chunk_overlap
chunk_size
的10%

References

参考资料

Load on demand:
按需加载: