algo-ecom-bm25

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

BM25 Ranking Function

BM25排序函数

Overview

概述

BM25 (Best Matching 25) is an improved TF-IDF ranking function that adds term frequency saturation and document length normalization. Score = Σ IDF(t) × (TF(t,d) × (k₁+1)) / (TF(t,d) + k₁ × (1 - b + b × |d|/avgdl)). Standard parameters: k₁=1.2, b=0.75. The backbone of most text search engines (Elasticsearch, Solr).
BM25(Best Matching 25)是TF-IDF排序函数的改进版本,增加了词频饱和机制和文档长度归一化。计算公式为:Score = Σ IDF(t) × (TF(t,d) × (k₁+1)) / (TF(t,d) + k₁ × (1 - b + b × |d|/avgdl))。标准参数:k₁=1.2,b=0.75。它是大多数文本搜索引擎(如Elasticsearch、Solr)的核心组件。

When to Use

适用场景

Trigger conditions:
  • Building product search with text-based relevance ranking
  • Replacing basic TF-IDF with better document length normalization
  • Tuning search relevance in Elasticsearch/Solr
When NOT to use:
  • When semantic similarity matters more than keyword matching (use embeddings)
  • For single-field exact matching (simpler methods suffice)
触发条件:
  • 构建基于文本相关性排序的产品搜索功能
  • 用具备更优文档长度归一化的方案替代基础TF-IDF
  • 在Elasticsearch/Solr中优化搜索相关性
不适用场景:
  • 当语义相似度比关键词匹配更重要时(使用向量嵌入)
  • 单字段精确匹配场景(更简单的方法即可满足需求)

Algorithm

算法

IRON LAW: BM25 Has Two Critical Parameters — k₁ and b
k₁ controls term frequency saturation: higher k₁ = more weight to
repeated terms. k₁=0 ignores TF entirely (boolean).
b controls document length normalization: b=1 fully normalizes by
length, b=0 ignores length. Default k₁=1.2, b=0.75 works for most
cases but MUST be tuned for your specific corpus.
IRON LAW: BM25 Has Two Critical Parameters — k₁ and b
k₁ controls term frequency saturation: higher k₁ = more weight to
repeated terms. k₁=0 ignores TF entirely (boolean).
b controls document length normalization: b=1 fully normalizes by
length, b=0 ignores length. Default k₁=1.2, b=0.75 works for most
cases but MUST be tuned for your specific corpus.

Phase 1: Input Validation + Tokenization

阶段1:输入验证 + 分词

Tokenize each document to lowercase word tokens. Remove stop words before counting — the bundled script drops a standard English stop list (
the, a, an, and, or, but, of, in, on, at, to, for, with, by, from, as, is, are, was, were, be, been, being
). Then build an inverted index: term → list of (document, term frequency). Compute: document lengths (post stop-word removal), average document length, document frequency per term.
⚠️ Stop-word removal affects
|d|
and
avgdl
: because stop words are dropped before length is measured, hand-computing BM25 without removing them will give the wrong length normalization and scores will be off by 3–5%. If you're reproducing BM25 by hand to compare against the script, apply the same stop list first — or just run the script.
Gate: Index built, statistics computed, corpus non-empty.
将每个文档分词为小写单词词元。移除停用词后再进行计数——配套脚本会过滤标准英文停用词列表(
the, a, an, and, or, but, of, in, on, at, to, for, with, by, from, as, is, are, was, were, be, been, being
)。然后构建倒排索引:词项 → (文档,词频)列表。计算:文档长度(移除停用词后)、平均文档长度、每个词项的文档频率。
⚠️ 停用词移除会影响
|d|
avgdl
:因为停用词是在测量长度前被过滤的,如果手动计算BM25时不移除停用词,会导致长度归一化错误,分数偏差3–5%。如果您手动复现BM25以与脚本结果对比,请先应用相同的停用词列表——或者直接运行脚本。
检查点: 索引已构建,统计数据已计算,语料库非空。

Phase 2: Core Algorithm

阶段2:核心算法

For query Q with terms t₁...tₙ against document d:
  1. For each query term tᵢ: compute IDF(tᵢ) = log((N - DF(tᵢ) + 0.5) / (DF(tᵢ) + 0.5) + 1)
  2. Compute TF component: (TF(tᵢ,d) × (k₁+1)) / (TF(tᵢ,d) + k₁ × (1 - b + b × |d|/avgdl))
  3. Score(d, Q) = Σᵢ IDF(tᵢ) × TF_component(tᵢ, d)
  4. Rank documents by score descending
⚠️ IDF variant lock-in: BM25 has several IDF formulations in the wild (Robertson-Sparck Jones, classic Okapi, Lucene's smoothed
+1
, BM25+, BM25L). This skill — and the bundled script — uses the Lucene-style smoothed variant shown above (
log((N - df + 0.5) / (df + 0.5) + 1)
), which never returns negative IDF for very common terms. If you compare scores against another engine (Elasticsearch, Solr, Whoosh), they may differ by ~3–5% even on identical inputs. Do not "correct" the script unless you intend to change the variant globally.
针对文档d的查询Q包含词项t₁...tₙ:
  1. 对每个查询词项tᵢ:计算IDF(tᵢ) = log((N - DF(tᵢ) + 0.5) / (DF(tᵢ) + 0.5) + 1)
  2. 计算TF分量:(TF(tᵢ,d) × (k₁+1)) / (TF(tᵢ,d) + k₁ × (1 - b + b × |d|/avgdl))
  3. 文档评分Score(d, Q) = Σᵢ IDF(tᵢ) × TF_component(tᵢ, d)
  4. 按分数降序排列文档
⚠️ IDF变体锁定:BM25在实际应用中有多种IDF公式(Robertson-Sparck Jones、经典Okapi、Lucene的平滑
+1
、BM25+、BM25L)。本技能及配套脚本使用的是Lucene风格的平滑变体(如上所示:
log((N - df + 0.5) / (df + 0.5) + 1)
),该变体对于非常常见的词项不会返回负IDF。如果您将分数与其他引擎(Elasticsearch、Solr、Whoosh)对比,即使输入完全相同,分数也可能相差约3–5%。除非您打算全局更改变体,否则不要“修正”脚本。

Phase 3: Verification

阶段3:验证

Spot-check: query "red shoes" should rank documents containing both "red" and "shoes" higher than documents with only one term. Shorter product titles with both terms should rank above long descriptions with sparse mentions. Gate: Relevance spot-check passes on 10+ test queries.
抽样检查:查询“red shoes”时,同时包含“red”和“shoes”的文档排名应高于只包含其中一个词的文档。同时包含这两个词的较短产品标题应排名高于稀疏提及这两个词的长描述。 检查点: 10个以上测试查询的相关性抽样检查通过。

Phase 4: Output

阶段4:输出

Return ranked results with scores.
返回带有分数的排序结果。

Output Format

输出格式

json
{
  "results": [{"doc_id": "SKU-123", "score": 12.5, "title": "Red Running Shoes"}],
  "metadata": {"query": "red shoes", "hits": 85, "k1": 1.2, "b": 0.75, "avg_doc_length": 45}
}
json
{
  "results": [{"doc_id": "SKU-123", "score": 12.5, "title": "Red Running Shoes"}],
  "metadata": {"query": "red shoes", "hits": 85, "k1": 1.2, "b": 0.75, "avg_doc_length": 45}
}

Examples

示例

Sample I/O

输入输出示例

Input: Query "wireless earbuds", corpus of 1000 product listings Expected: Products with "wireless earbuds" in title rank highest; "wireless headphones" ranks lower (no "earbuds" term).
输入: 查询“wireless earbuds”,包含1000条产品列表的语料库 预期结果: 标题中包含“wireless earbuds”的产品排名最高;“wireless headphones”排名较低(无“earbuds”词项)。

Edge Cases

边缘情况

InputExpectedWhy
Single-word queryIDF-dominated rankingOnly one term's IDF differentiates
Very common term ("the")Near-zero IDF, low impactIDF suppresses common terms
Document with 100 repetitionsSaturated TF, not 100x scorek₁ caps the benefit of repetition
输入预期结果原因
单字词查询以IDF为主导的排序只有单个词项的IDF能区分文档
非常常见的词(如“the”)IDF接近零,影响极小IDF会抑制常见词的权重
包含100次重复词的文档TF达到饱和,分数不会是100倍k₁会限制重复词的收益

Gotchas

注意事项

  • Multi-field scoring: E-commerce products have title, description, brand, category. Weight fields differently: title match > description match. Use field-boosted BM25.
  • Synonyms and stemming: BM25 is keyword-exact. "earphones" won't match "earbuds." Add synonym expansion and stemming in the query pipeline.
  • Parameter tuning: Default k₁=1.2, b=0.75 is reasonable but not optimal. Tune on relevance judgments specific to your catalog.
  • Numeric attributes: BM25 doesn't handle numeric filtering (price range, ratings). Use it for text relevance, then combine with numeric filters.
  • Zero-result queries: When BM25 returns nothing, fall back to fuzzy matching or semantic search rather than showing empty results.
  • 多字段评分:电商产品包含标题、描述、品牌、分类等字段。需为不同字段设置不同权重:标题匹配 > 描述匹配。使用带字段权重的BM25(BM25F)。
  • 同义词与词干提取:BM25是精确关键词匹配。“earphones”不会匹配“earbuds”。需在查询流程中添加同义词扩展和词干提取。
  • 参数调优:默认参数k₁=1.2、b=0.75是合理值,但并非最优。需针对您的商品目录的相关性判断进行调优。
  • 数值属性:BM25不处理数值过滤(价格区间、评分)。用它处理文本相关性,再结合数值过滤。
  • 零结果查询:当BM25返回无结果时,应回退到模糊匹配或语义搜索,而非显示空结果。

Scripts

脚本

ScriptDescriptionUsage
scripts/bm25.py
Score documents against a query using BM25 ranking function
python scripts/bm25.py --help
Run
python scripts/bm25.py --verify
to execute built-in sanity tests.
脚本描述使用方法
scripts/bm25.py
使用BM25排序函数为文档针对查询评分
python scripts/bm25.py --help
运行
python scripts/bm25.py --verify
执行内置的完整性测试。

References

参考资料

  • For BM25F multi-field extension, see
    references/bm25f.md
  • For parameter tuning methodology, see
    references/parameter-tuning.md
  • 关于BM25F多字段扩展,请查看
    references/bm25f.md
  • 关于参数调优方法,请查看
    references/parameter-tuning.md