algo-seo-tfidf

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TF-IDF

Overview

概述

TF-IDF (Term Frequency–Inverse Document Frequency) scores term importance as TF(t,d) × IDF(t). High scores mean a term is frequent in a document but rare across the corpus. Computes in O(N × V) where N is documents and V is vocabulary size.

TF-IDF（Term Frequency–Inverse Document Frequency，词频-逆文档频率）通过TF(t,d) × IDF(t)来衡量词项的重要性。高分表示某个词项在单篇文档中出现频繁，但在整个语料库中少见。计算复杂度为O(N × V)，其中N为文档数量，V为词汇表大小。

When to Use

使用场景

Trigger conditions:

Ranking documents by keyword relevance
Extracting distinguishing terms from documents
Building lightweight search without ML models

When NOT to use:

When semantic similarity matters (use embeddings instead)
When you need ranking with link authority (combine with PageRank)

触发条件：

按关键词相关性对文档排序
从文档中提取具有区分度的词项
无需机器学习模型构建轻量级搜索功能

不适用场景：

需要语义相似度分析时（改用向量嵌入）
需要结合链接权重进行排序时（需与PageRank结合）

Algorithm

算法

IRON LAW: TF-IDF Measures RELATIVE Importance
- A term with high TF but low IDF is common, NOT important
- TF-IDF = TF(t,d) × log(N / DF(t))
- A term appearing in ALL documents has IDF = 0 → score = 0

IRON LAW: TF-IDF Measures RELATIVE Importance
- A term with high TF but low IDF is common, NOT important
- TF-IDF = TF(t,d) × log(N / DF(t))
- A term appearing in ALL documents has IDF = 0 → score = 0

Phase 1: Input Validation

阶段1：输入验证

Tokenize documents, apply lowercasing, remove stop words. Build vocabulary. Gate: All documents tokenized, vocabulary size reasonable.

对文档进行分词、转为小写、移除停用词，构建词汇表。 **校验门限：**所有文档已完成分词，词汇表大小合理。

Phase 2: Core Algorithm

阶段2：核心算法

Compute TF(t,d) for each term in each document (raw count, log-normalized, or boolean)
Compute IDF(t) = log(N / DF(t)) where DF(t) = number of documents containing term t
Compute TF-IDF(t,d) = TF(t,d) × IDF(t)
Optionally L2-normalize document vectors for cosine similarity

计算每个文档中每个词项的TF(t,d)（原始计数、对数归一化或布尔型）
计算IDF(t) = log(N / DF(t))，其中DF(t)是包含词项t的文档数量
计算TF-IDF(t,d) = TF(t,d) × IDF(t)
可选：对文档向量进行L2归一化以用于余弦相似度计算

Phase 3: Verification

阶段3：验证

Check: terms appearing in all documents have IDF ≈ 0. Rare terms have high IDF. Gate: Score distribution is reasonable; common words score low.

检查：出现在所有文档中的词项IDF≈0；稀有词项具有高IDF。 **校验门限：**得分分布合理，常用词得分低。

Phase 4: Output

阶段4：输出

Return scored terms per document or ranked documents per query.

返回每篇文档的带分词项得分，或每个查询的文档排序结果。

Output Format

输出格式

json

{
  "query_results": [{"document": "doc_id", "score": 0.73, "matching_terms": ["term1", "term2"]}],
  "metadata": {"corpus_size": 1000, "vocabulary_size": 5000, "tf_variant": "log_normalized"}
}

json

{
  "query_results": [{"document": "doc_id", "score": 0.73, "matching_terms": ["term1", "term2"]}],
  "metadata": {"corpus_size": 1000, "vocabulary_size": 5000, "tf_variant": "log_normalized"}
}

Examples

示例

Sample I/O

输入输出示例

Input: Corpus: ["the cat sat", "the dog sat", "the cat played"], Query: "cat" Expected: TF("cat", doc1)=1/3, DF("cat")=2, IDF=log(3/2)=0.405. TF-IDF(doc1)=0.135, TF-IDF(doc3)=0.135, TF-IDF(doc2)=0

**输入：**语料库：["the cat sat", "the dog sat", "the cat played"]，查询词："cat" **预期结果：**TF("cat", doc1)=1/3，DF("cat")=2，IDF=log(3/2)=0.405。TF-IDF(doc1)=0.135，TF-IDF(doc3)=0.135，TF-IDF(doc2)=0

Edge Cases

边缘情况

Input	Expected	Why
Term in all docs	Score = 0	IDF = log(N/N) = 0
Term in one doc	Highest IDF	log(N/1) = log(N)
Empty document	All scores = 0	No terms to score

输入	预期结果	原因
词项出现在所有文档中	得分=0	IDF = log(N/N) = 0
词项仅出现在一篇文档中	最高IDF	log(N/1) = log(N)
空文档	所有得分=0	无词项可评分

Gotchas

注意事项

Stop words matter: Without stop word removal, "the", "is", "a" dominate TF but have zero IDF. Preprocess properly.
TF variant choice: Raw count, log(1+count), or boolean TF produce very different rankings. Log normalization prevents long documents from dominating.
IDF smoothing: Add 1 to denominator to avoid division by zero for unknown query terms: IDF = log(N / (DF+1)) + 1.
Not semantic: "car" and "automobile" are treated as completely different terms. TF-IDF has no concept of synonymy.
Corpus dependency: IDF values change when the corpus changes. Adding documents alters all scores.

停用词影响大：如果不移除停用词，"the""is""a"这类词会占据高词频但IDF为0。需做好预处理。
TF变体选择：原始计数、log(1+计数)或布尔型TF会产生截然不同的排序结果。对数归一化可避免长文档主导排序。
IDF平滑：在分母加1以避免未知查询词导致除零错误：IDF = log(N / (DF+1)) + 1。
无语义理解："car"和"automobile"会被视为完全不同的词项。TF-IDF不具备同义词识别能力。
依赖语料库：IDF值会随语料库变化而改变。添加文档会影响所有得分。

Scripts

脚本

Script	Description	Usage
`scripts/tfidf.py`	Compute TF-IDF vectors, top terms per document, and query scoring	`python scripts/tfidf.py --help`

Run

python scripts/tfidf.py --verify

to execute built-in sanity tests.

脚本	描述	使用方法
`scripts/tfidf.py`	计算TF-IDF向量、每篇文档的 top 词项及查询评分	`python scripts/tfidf.py --help`

运行

python scripts/tfidf.py --verify

执行内置的完整性测试。

References

参考资料

For BM25 (improved TF-IDF), see
```
references/bm25-comparison.md
```
For efficient inverted index implementation, see
```
references/inverted-index.md
```

关于BM25（改进版TF-IDF），请查看
```
references/bm25-comparison.md
```
关于高效倒排索引实现，请查看
```
references/inverted-index.md
```