spice-search

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Search Data

数据搜索

Spice provides integrated search capabilities: vector (semantic) search, full-text (keyword) search, and hybrid search with Reciprocal Rank Fusion (RRF) — all via SQL functions and HTTP APIs. Search indexes are built on top of accelerated datasets.
Spice 提供集成搜索功能:向量(语义)搜索、全文(关键词)搜索以及结合Reciprocal Rank Fusion (RRF)的混合搜索——所有功能均可通过SQL函数和HTTP API实现。搜索索引构建在加速数据集之上。

Search Methods

搜索方法

MethodWhen to UseRequires
Vector searchSemantic similarity, RAG, recommendationsEmbedding model + column embeddings
Full-text searchKeyword/phrase matching, exact terms
full_text_search.enabled: true
on columns
Hybrid (RRF)Best of both — combines rankings from multiple methodsMultiple search methods configured
Lexical (LIKE/=)Exact pattern or value matchingNothing extra
方法适用场景所需条件
向量搜索语义相似度匹配、RAG、推荐系统Embedding模型 + 列向量Embedding
全文搜索关键词/短语匹配、精确术语匹配列上配置
full_text_search.enabled: true
混合搜索(RRF)兼顾两者优势——结合多种搜索方法的排名结果已配置多种搜索方法
词法搜索(LIKE/=)精确模式或值匹配无额外要求

Set Up Vector Search

配置向量搜索

1. Define an Embedding Model

1. 定义Embedding模型

yaml
embeddings:
  - name: local_embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  - name: openai_embeddings
    from: openai:text-embedding-3-small
    params:
      openai_api_key: ${ secrets:OPENAI_API_KEY }
yaml
embeddings:
  - name: local_embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  - name: openai_embeddings
    from: openai:text-embedding-3-small
    params:
      openai_api_key: ${ secrets:OPENAI_API_KEY }

Supported Embedding Providers

支持的Embedding提供商

ProviderFrom FormatStatus
OpenAI
openai:text-embedding-3-large
Release Candidate
HuggingFace
huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Release Candidate
Local file
file:model.safetensors
Release Candidate
Azure OpenAI
azure:my-deployment
Alpha
Google AI
google:text-embedding-004
Alpha
Amazon Bedrock
bedrock:amazon.titan-embed-text-v1
Alpha
Databricks
databricks:endpoint
Alpha
Model2Vec
model2vec:model-name
Alpha
提供商来源格式状态
OpenAI
openai:text-embedding-3-large
Release Candidate
HuggingFace
huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Release Candidate
本地文件
file:model.safetensors
Release Candidate
Azure OpenAI
azure:my-deployment
Alpha
Google AI
google:text-embedding-004
Alpha
Amazon Bedrock
bedrock:amazon.titan-embed-text-v1
Alpha
Databricks
databricks:endpoint
Alpha
Model2Vec
model2vec:model-name
Alpha

2. Configure Dataset Columns for Embeddings

2. 为数据集列配置Embedding

yaml
datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: local_embeddings
            row_id: id
            chunking:
              enabled: true
              target_chunk_size: 512
              overlap_size: 64
yaml
datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: local_embeddings
            row_id: id
            chunking:
              enabled: true
              target_chunk_size: 512
              overlap_size: 64

Embedding Methods

Embedding方式

MethodDescriptionWhen to Use
AcceleratedPrecomputed and storedFaster queries, frequently searched datasets
JIT (Just-in-Time)Computed at query time (no acceleration)Large or rarely queried datasets
PassthroughPre-existing embeddings used directlySource already has
<col>_embedding
columns
方式描述适用场景
预计算加速提前计算并存储查询速度快、频繁搜索的数据集
即时计算(JIT)查询时计算(无加速)大型或极少查询的数据集
直接复用直接使用已有的Embedding数据源已包含
<col>_embedding

3. Query via HTTP API

3. 通过HTTP API查询

bash
curl -X POST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["docs"],
    "text": "cutting edge AI",
    "where": "author=\"jeadie\"",
    "additional_columns": ["title", "state"],
    "limit": 5
  }'
FieldRequiredDescription
text
YesSearch text
datasets
NoDatasets to search (null = all searchable)
additional_columns
NoExtra columns to return
where
NoSQL filter predicate
limit
NoMax results per dataset
To retrieve full documents (not just chunks), include the embedding column name in
additional_columns
.
bash
curl -X POST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["docs"],
    "text": "cutting edge AI",
    "where": "author=\"jeadie\"",
    "additional_columns": ["title", "state"],
    "limit": 5
  }'
字段是否必填描述
text
搜索文本
datasets
要搜索的数据集(null = 所有可搜索数据集)
additional_columns
要返回的额外列
where
SQL过滤条件
limit
每个数据集的最大结果数
要检索完整文档(而非仅片段),需在
additional_columns
中包含Embedding列名。

4. Query via SQL UDTF

4. 通过SQL UDTF查询

sql
SELECT id, title, score
FROM vector_search(docs, 'cutting edge AI')
WHERE state = 'Open'
ORDER BY score DESC
LIMIT 5;
vector_search
signature:
sql
vector_search(
  table STRING,          -- Dataset name (required)
  query STRING,          -- Search text (required)
  col STRING,            -- Column (optional if single embedding column)
  limit INTEGER,         -- Max results (default: 1000)
  include_score BOOLEAN  -- Include score column (default: TRUE)
) RETURNS TABLE
Limitation:
vector_search
UDTF does not yet support chunked embedding columns. Use the HTTP API for chunked data.
sql
SELECT id, title, score
FROM vector_search(docs, 'cutting edge AI')
WHERE state = 'Open'
ORDER BY score DESC
LIMIT 5;
vector_search
签名:
sql
vector_search(
  table STRING,          -- 数据集名称(必填)
  query STRING,          -- 搜索文本(必填)
  col STRING,            -- 列名(若只有一个Embedding列则可选)
  limit INTEGER,         -- 最大结果数(默认:1000)
  include_score BOOLEAN  -- 是否包含得分列(默认:TRUE)
) RETURNS TABLE
限制
vector_search
UDTF目前暂不支持分块的Embedding列。对于分块数据,请使用HTTP API。

Set Up Full-Text Search

配置全文搜索

Full-text search uses BM25 scoring (powered by Tantivy) for keyword relevance ranking.
全文搜索采用BM25评分(由Tantivy提供支持)进行关键词相关性排名。

1. Enable Indexing on Columns

1. 为列启用索引

yaml
datasets:
  - from: postgres:articles
    name: articles
    acceleration:
      enabled: true
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: body
        full_text_search:
          enabled: true
yaml
datasets:
  - from: postgres:articles
    name: articles
    acceleration:
      enabled: true
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: body
        full_text_search:
          enabled: true

2. Query via SQL UDTF

2. 通过SQL UDTF查询

sql
SELECT id, title, score
FROM text_search(articles, 'search keywords', body)
ORDER BY score DESC
LIMIT 5;
text_search
signature:
sql
text_search(
  table STRING,          -- Dataset name (required)
  query STRING,          -- Keywords/phrase (required)
  col STRING,            -- Column (required if multiple indexed columns)
  limit INTEGER,         -- Max results (default: 1000)
  include_score BOOLEAN  -- Include score column (default: TRUE)
) RETURNS TABLE
sql
SELECT id, title, score
FROM text_search(articles, 'search keywords', body)
ORDER BY score DESC
LIMIT 5;
text_search
签名:
sql
text_search(
  table STRING,          -- 数据集名称(必填)
  query STRING,          -- 关键词/短语(必填)
  col STRING,            -- 列名(若有多个索引列则必填)
  limit INTEGER         -- 最大结果数(默认:1000)
  include_score BOOLEAN  -- 是否包含得分列(默认:TRUE)
) RETURNS TABLE

Hybrid Search with RRF

结合RRF的混合搜索

Reciprocal Rank Fusion merges rankings from multiple search methods. Each query runs independently, then results are combined:
RRF Score = Σ(rank_weight / (k + rank))
Documents appearing across multiple result sets receive higher scores.
Reciprocal Rank Fusion会合并多种搜索方法的排名结果。每个查询独立运行,然后合并结果:
RRF 得分 = Σ(rank_weight / (k + rank))
在多个结果集中出现的文档会获得更高的得分。

Basic Hybrid Search

基础混合搜索

sql
SELECT id, title, content, fused_score
FROM rrf(
    vector_search(documents, 'machine learning algorithms'),
    text_search(documents, 'neural networks deep learning', content),
    join_key => 'id'
)
ORDER BY fused_score DESC
LIMIT 5;
sql
SELECT id, title, content, fused_score
FROM rrf(
    vector_search(documents, 'machine learning algorithms'),
    text_search(documents, 'neural networks deep learning', content),
    join_key => 'id'
)
ORDER BY fused_score DESC
LIMIT 5;

Weighted Ranking

加权排名

sql
SELECT fused_score, title, content
FROM rrf(
    text_search(posts, 'artificial intelligence', rank_weight => 50.0),
    vector_search(posts, 'AI machine learning', rank_weight => 200.0)
)
ORDER BY fused_score DESC
LIMIT 10;
sql
SELECT fused_score, title, content
FROM rrf(
    text_search(posts, 'artificial intelligence', rank_weight => 50.0),
    vector_search(posts, 'AI machine learning', rank_weight => 200.0)
)
ORDER BY fused_score DESC
LIMIT 10;

Recency-Boosted Search

时效性增强搜索

sql
-- Exponential decay (1-hour scale)
SELECT fused_score, title, created_at
FROM rrf(
    text_search(news, 'breaking news'),
    vector_search(news, 'latest updates'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
ORDER BY fused_score DESC
LIMIT 10;

-- Linear decay (24-hour window)
SELECT fused_score, content
FROM rrf(
    text_search(posts, 'trending'),
    vector_search(posts, 'viral popular'),
    time_column => 'created_at',
    recency_decay => 'linear',
    decay_window_secs => 86400
)
ORDER BY fused_score DESC;
sql
-- 指数衰减(1小时尺度)
SELECT fused_score, title, created_at
FROM rrf(
    text_search(news, 'breaking news'),
    vector_search(news, 'latest updates'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
ORDER BY fused_score DESC
LIMIT 10;

-- 线性衰减(24小时窗口)
SELECT fused_score, content
FROM rrf(
    text_search(posts, 'trending'),
    vector_search(posts, 'viral popular'),
    time_column => 'created_at',
    recency_decay => 'linear',
    decay_window_secs => 86400
)
ORDER BY fused_score DESC;

Cross-Language Search

跨语言搜索

sql
SELECT fused_score, text, langs
FROM rrf(
    vector_search(posts, 'ultimas noticias', rank_weight => 100),
    text_search(posts, 'news'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
WHERE trim(text) != ''
ORDER BY fused_score DESC LIMIT 15;
sql
SELECT fused_score, text, langs
FROM rrf(
    vector_search(posts, 'ultimas noticias', rank_weight => 100),
    text_search(posts, 'news'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
WHERE trim(text) != ''
ORDER BY fused_score DESC LIMIT 15;

rrf
Parameters

rrf
参数

ParameterTypeRequiredDescription
query_1
,
query_2
, ...
Search UDTFYes (2+)
vector_search
or
text_search
calls (variadic)
join_key
StringNoColumn for joining results (default: auto-hash)
k
FloatNoSmoothing parameter (default: 60.0, lower = more aggressive)
time_column
StringNoTimestamp column for recency boosting
recency_decay
StringNo
'exponential'
(default) or
'linear'
decay_constant
FloatNoRate for exponential decay (default: 0.01)
decay_scale_secs
FloatNoTime scale for exponential decay (default: 86400)
decay_window_secs
FloatNoWindow for linear decay (default: 86400)
rank_weight
FloatNoPer-query weight (specified inside search calls)
参数类型是否必填描述
query_1
,
query_2
, ...
搜索UDTF是(至少2个)
vector_search
text_search
调用(支持多个)
join_key
字符串用于合并结果的列(默认:自动哈希)
k
浮点数平滑参数(默认:60.0,值越小,排名越激进)
time_column
字符串用于时效性增强的时间戳列
recency_decay
字符串
'exponential'
(默认)或
'linear'
decay_constant
浮点数指数衰减速率(默认:0.01)
decay_scale_secs
浮点数指数衰减的时间尺度(默认:86400)
decay_window_secs
浮点数线性衰减的时间窗口(默认:86400)
rank_weight
浮点数每个查询的权重(在搜索调用中指定)

Vector Engines

向量引擎

Store and index embeddings at scale using dedicated vector engines:
yaml
datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: embed_model
            row_id: id
        metadata:
          vectors: non-filterable
      - name: category
        metadata:
          vectors: filterable # enable filtering on this column
    vectors:
      enabled: true
      engine: s3_vectors
      params:
        s3_vectors_bucket: my-bucket
        s3_vectors_region: us-east-1
使用专用向量引擎大规模存储和索引Embedding:
yaml
datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: embed_model
            row_id: id
        metadata:
          vectors: non-filterable
      - name: category
        metadata:
          vectors: filterable # 启用该列的过滤功能
    vectors:
      enabled: true
      engine: s3_vectors
      params:
        s3_vectors_bucket: my-bucket
        s3_vectors_region: us-east-1

Lexical Search (SQL)

词法搜索(SQL)

Standard SQL filtering:
sql
SELECT * FROM my_table WHERE column LIKE '%substring%';
SELECT * FROM my_table WHERE column = 'exact value';
SELECT * FROM my_table WHERE regexp_like(column, '^spice.*ai$');
标准SQL过滤:
sql
SELECT * FROM my_table WHERE column LIKE '%substring%';
SELECT * FROM my_table WHERE column = 'exact value';
SELECT * FROM my_table WHERE regexp_like(column, '^spice.*ai$');

CLI Search

CLI搜索

bash
spice search "cutting edge AI" --dataset docs --limit 5
spice search --cache-control no-cache "search terms"
bash
spice search "cutting edge AI" --dataset docs --limit 5
spice search --cache-control no-cache "search terms"

Complete Example

完整示例

yaml
version: v1
kind: Spicepod
name: search_app

secrets:
  - from: env
    name: env

embeddings:
  - name: embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

datasets:
  - from: file:articles.parquet
    name: articles
    acceleration:
      enabled: true
      engine: duckdb
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: content
        embeddings:
          - from: embeddings
        full_text_search:
          enabled: true
sql
SELECT id, title, content, fused_score
FROM rrf(
    vector_search(articles, 'machine learning best practices'),
    text_search(articles, 'neural network training', content),
    join_key => 'id',
    time_column => 'published_at',
    recency_decay => 'exponential',
    decay_constant => 0.01,
    decay_scale_secs => 86400
)
WHERE fused_score > 0.01
ORDER BY fused_score DESC
LIMIT 10;
yaml
version: v1
kind: Spicepod
name: search_app

secrets:
  - from: env
    name: env

embeddings:
  - name: embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

datasets:
  - from: file:articles.parquet
    name: articles
    acceleration:
      enabled: true
      engine: duckdb
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: content
        embeddings:
          - from: embeddings
        full_text_search:
          enabled: true
sql
SELECT id, title, content, fused_score
FROM rrf(
    vector_search(articles, 'machine learning best practices'),
    text_search(articles, 'neural network training', content),
    join_key => 'id',
    time_column => 'published_at',
    recency_decay => 'exponential',
    decay_constant => 0.01,
    decay_scale_secs => 86400
)
WHERE fused_score > 0.01
ORDER BY fused_score DESC
LIMIT 10;

Troubleshooting

故障排除

IssueSolution
vector_search
returns no results
Verify embeddings configured on column and model is loaded
text_search
returns no results
Check
full_text_search.enabled: true
; acceleration must be enabled
Poor hybrid search relevanceTune
rank_weight
per query and adjust
k
Results missing recent contentAdd
time_column
and
recency_decay
to RRF
Chunked vector search not working via SQLUse HTTP API instead (UDTF doesn't support chunked columns yet)
问题解决方案
vector_search
无结果返回
验证列上已配置Embedding且模型已加载
text_search
无结果返回
检查是否开启
full_text_search.enabled: true
;必须启用加速功能
混合搜索相关性差调整每个查询的
rank_weight
和参数
k
结果缺少最新内容在RRF中添加
time_column
recency_decay
参数
分块向量搜索无法通过SQL工作改用HTTP API(UDTF暂不支持分块列)

Documentation

相关文档