algo-seo-tfidf
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTF-IDF
TF-IDF
Overview
概述
TF-IDF (Term Frequency–Inverse Document Frequency) scores term importance as TF(t,d) × IDF(t). High scores mean a term is frequent in a document but rare across the corpus. Computes in O(N × V) where N is documents and V is vocabulary size.
TF-IDF(Term Frequency–Inverse Document Frequency,词频-逆文档频率)通过TF(t,d) × IDF(t)来衡量词项的重要性。高分表示某个词项在单篇文档中出现频繁,但在整个语料库中少见。计算复杂度为O(N × V),其中N为文档数量,V为词汇表大小。
When to Use
使用场景
Trigger conditions:
- Ranking documents by keyword relevance
- Extracting distinguishing terms from documents
- Building lightweight search without ML models
When NOT to use:
- When semantic similarity matters (use embeddings instead)
- When you need ranking with link authority (combine with PageRank)
触发条件:
- 按关键词相关性对文档排序
- 从文档中提取具有区分度的词项
- 无需机器学习模型构建轻量级搜索功能
不适用场景:
- 需要语义相似度分析时(改用向量嵌入)
- 需要结合链接权重进行排序时(需与PageRank结合)
Algorithm
算法
IRON LAW: TF-IDF Measures RELATIVE Importance
- A term with high TF but low IDF is common, NOT important
- TF-IDF = TF(t,d) × log(N / DF(t))
- A term appearing in ALL documents has IDF = 0 → score = 0IRON LAW: TF-IDF Measures RELATIVE Importance
- A term with high TF but low IDF is common, NOT important
- TF-IDF = TF(t,d) × log(N / DF(t))
- A term appearing in ALL documents has IDF = 0 → score = 0Phase 1: Input Validation
阶段1:输入验证
Tokenize documents, apply lowercasing, remove stop words. Build vocabulary.
Gate: All documents tokenized, vocabulary size reasonable.
对文档进行分词、转为小写、移除停用词,构建词汇表。
**校验门限:**所有文档已完成分词,词汇表大小合理。
Phase 2: Core Algorithm
阶段2:核心算法
- Compute TF(t,d) for each term in each document (raw count, log-normalized, or boolean)
- Compute IDF(t) = log(N / DF(t)) where DF(t) = number of documents containing term t
- Compute TF-IDF(t,d) = TF(t,d) × IDF(t)
- Optionally L2-normalize document vectors for cosine similarity
- 计算每个文档中每个词项的TF(t,d)(原始计数、对数归一化或布尔型)
- 计算IDF(t) = log(N / DF(t)),其中DF(t)是包含词项t的文档数量
- 计算TF-IDF(t,d) = TF(t,d) × IDF(t)
- 可选:对文档向量进行L2归一化以用于余弦相似度计算
Phase 3: Verification
阶段3:验证
Check: terms appearing in all documents have IDF ≈ 0. Rare terms have high IDF.
Gate: Score distribution is reasonable; common words score low.
检查:出现在所有文档中的词项IDF≈0;稀有词项具有高IDF。
**校验门限:**得分分布合理,常用词得分低。
Phase 4: Output
阶段4:输出
Return scored terms per document or ranked documents per query.
返回每篇文档的带分词项得分,或每个查询的文档排序结果。
Output Format
输出格式
json
{
"query_results": [{"document": "doc_id", "score": 0.73, "matching_terms": ["term1", "term2"]}],
"metadata": {"corpus_size": 1000, "vocabulary_size": 5000, "tf_variant": "log_normalized"}
}json
{
"query_results": [{"document": "doc_id", "score": 0.73, "matching_terms": ["term1", "term2"]}],
"metadata": {"corpus_size": 1000, "vocabulary_size": 5000, "tf_variant": "log_normalized"}
}Examples
示例
Sample I/O
输入输出示例
Input: Corpus: ["the cat sat", "the dog sat", "the cat played"], Query: "cat"
Expected: TF("cat", doc1)=1/3, DF("cat")=2, IDF=log(3/2)=0.405. TF-IDF(doc1)=0.135, TF-IDF(doc3)=0.135, TF-IDF(doc2)=0
**输入:**语料库:["the cat sat", "the dog sat", "the cat played"],查询词:"cat"
**预期结果:**TF("cat", doc1)=1/3,DF("cat")=2,IDF=log(3/2)=0.405。TF-IDF(doc1)=0.135,TF-IDF(doc3)=0.135,TF-IDF(doc2)=0
Edge Cases
边缘情况
| Input | Expected | Why |
|---|---|---|
| Term in all docs | Score = 0 | IDF = log(N/N) = 0 |
| Term in one doc | Highest IDF | log(N/1) = log(N) |
| Empty document | All scores = 0 | No terms to score |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| 词项出现在所有文档中 | 得分=0 | IDF = log(N/N) = 0 |
| 词项仅出现在一篇文档中 | 最高IDF | log(N/1) = log(N) |
| 空文档 | 所有得分=0 | 无词项可评分 |
Gotchas
注意事项
- Stop words matter: Without stop word removal, "the", "is", "a" dominate TF but have zero IDF. Preprocess properly.
- TF variant choice: Raw count, log(1+count), or boolean TF produce very different rankings. Log normalization prevents long documents from dominating.
- IDF smoothing: Add 1 to denominator to avoid division by zero for unknown query terms: IDF = log(N / (DF+1)) + 1.
- Not semantic: "car" and "automobile" are treated as completely different terms. TF-IDF has no concept of synonymy.
- Corpus dependency: IDF values change when the corpus changes. Adding documents alters all scores.
- 停用词影响大:如果不移除停用词,"the""is""a"这类词会占据高词频但IDF为0。需做好预处理。
- TF变体选择:原始计数、log(1+计数)或布尔型TF会产生截然不同的排序结果。对数归一化可避免长文档主导排序。
- IDF平滑:在分母加1以避免未知查询词导致除零错误:IDF = log(N / (DF+1)) + 1。
- 无语义理解:"car"和"automobile"会被视为完全不同的词项。TF-IDF不具备同义词识别能力。
- 依赖语料库:IDF值会随语料库变化而改变。添加文档会影响所有得分。
Scripts
脚本
| Script | Description | Usage |
|---|---|---|
| Compute TF-IDF vectors, top terms per document, and query scoring | |
Run to execute built-in sanity tests.
python scripts/tfidf.py --verify| 脚本 | 描述 | 使用方法 |
|---|---|---|
| 计算TF-IDF向量、每篇文档的 top 词项及查询评分 | |
运行执行内置的完整性测试。
python scripts/tfidf.py --verifyReferences
参考资料
- For BM25 (improved TF-IDF), see
references/bm25-comparison.md - For efficient inverted index implementation, see
references/inverted-index.md
- 关于BM25(改进版TF-IDF),请查看
references/bm25-comparison.md - 关于高效倒排索引实现,请查看
references/inverted-index.md