algo-rank-wilson
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWilson Score Ranking
Wilson Score 排名
Overview
概述
Wilson Score interval provides a lower confidence bound on the true proportion of positive ratings. Unlike simple averages, it penalizes items with few ratings, preventing a 5/5 review item (1 review) from outranking a 4.8/5 item (1000 reviews). Computes in O(1) per item.
Wilson Score区间为真实好评占比提供了一个置信下限。与简单平均值不同,它会惩罚评价数量少的项目,避免仅获得1条5星评价的项目排名超过获得1000条评价、平均4.8星的项目。每个项目的计算复杂度为O(1)。
When to Use
使用时机
Trigger conditions:
- Ranking items by user ratings when review counts vary widely
- Building "top rated" or "best of" lists that are fair to well-reviewed items
- Sorting binary feedback (upvote/downvote) with confidence
When NOT to use:
- For continuous scores (use Bayesian average instead)
- When comparing items with similar sample sizes (simple average suffices)
触发条件:
- 当评价数量差异较大时,根据用户评分对项目进行排名
- 构建对评价多的项目公平的“顶级评分”或“最佳”列表
- 对二元反馈(点赞/点踩)进行带置信度的排序
不适用于以下场景:
- 连续评分数据(应使用贝叶斯平均值替代)
- 比较样本量相似的项目(简单平均值已足够)
Algorithm
算法
IRON LAW: Never Rank by Simple Average When Sample Sizes Differ
A 5.0 average from 1 review is NOT better than 4.8 from 1000 reviews.
Wilson Score lower bound accounts for sample uncertainty:
Items with few ratings get a LOWER bound, properly reflecting our
uncertainty about their true quality.IRON LAW: Never Rank by Simple Average When Sample Sizes Differ
A 5.0 average from 1 review is NOT better than 4.8 from 1000 reviews.
Wilson Score lower bound accounts for sample uncertainty:
Items with few ratings get a LOWER bound, properly reflecting our
uncertainty about their true quality.Phase 1: Input Validation
阶段1:输入验证
Collect per item: number of positive ratings (p), total ratings (n). For star ratings, convert to binary (e.g., 4-5 stars = positive).
Gate: n > 0 for all items, confidence level chosen (typically 95%, z=1.96).
收集每个项目的好评数(p)和总评价数(n)。对于星级评价,需转换为二元数据(例如,4-5星视为好评)。
**校验要求:**所有项目的n>0,选择置信水平(通常为95%,对应z=1.96)。
Phase 2: Core Algorithm
阶段2:核心算法
- Compute observed proportion: p̂ = positive / total
- Wilson lower bound: (p̂ + z²/2n - z × √(p̂(1-p̂)/n + z²/4n²)) / (1 + z²/n)
- Rank by Wilson lower bound descending (conservative estimate of true quality)
- 计算观测占比:p̂ = 好评数 / 总评价数
- Wilson置信下限:(p̂ + z²/2n - z × √(p̂(1-p̂)/n + z²/4n²)) / (1 + z²/n)
- 按Wilson置信下限降序排名(这是对真实质量的保守估计)
Phase 3: Verification
阶段3:验证
Check: items with many positive reviews rank above items with few reviews and same proportion. Items with very few reviews are appropriately penalized.
Gate: Ranking intuitively correct on manual inspection.
检查:好评数多的项目应排名高于好评占比相同但评价数少的项目。评价极少的项目应被适当降权。
**校验要求:**手动检查排名结果符合直觉。
Phase 4: Output
阶段4:输出
Return ranked items with scores and confidence intervals.
返回包含评分和置信区间的排名项目列表。
Output Format
输出格式
json
{
"rankings": [{"item": "Product_A", "wilson_lower": 0.89, "positive": 950, "total": 1000, "proportion": 0.95}],
"metadata": {"confidence": 0.95, "z": 1.96, "items_ranked": 500}
}json
{
"rankings": [{"item": "Product_A", "wilson_lower": 0.89, "positive": 950, "total": 1000, "proportion": 0.95}],
"metadata": {"confidence": 0.95, "z": 1.96, "items_ranked": 500}
}Examples
示例
Sample I/O
示例输入输出
Input: Item A: 1 positive / 1 total (100%). Item B: 950 positive / 1000 total (95%).
Expected: B ranks higher. Wilson lower: A ≈ 0.05, B ≈ 0.94. The single review gives almost no confidence.
**输入:**项目A:1条好评/1条总评价(100%)。项目B:950条好评/1000条总评价(95%)。
**预期结果:**B排名更高。Wilson置信下限:A≈0.05,B≈0.94。单条评价几乎不具备置信度。
Edge Cases
边缘情况
| Input | Expected | Why |
|---|---|---|
| 0 reviews | Cannot rank | n=0, undefined. Exclude or assign minimum |
| 0 positive, 100 total | Very low score | Genuinely bad item, high confidence |
| 1M positive, 1M total | Lower bound ≈ 1.0 | Massive sample, high confidence in 100% |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| 0条评价 | 无法排名 | n=0,无定义。需排除或赋予最低值 |
| 0条好评,100条总评价 | 评分极低 | 确实是差评项目,置信度高 |
| 100万条好评,100万条总评价 | 置信下限≈1.0 | 样本量极大,对100%好评的置信度高 |
Gotchas
常见陷阱
- Binary conversion: For 5-star ratings, the positive/negative threshold matters. 4+ stars as positive? 3+ stars? Different thresholds produce different rankings.
- Not for continuous data: Wilson Score is for proportions (binary outcomes). For continuous ratings, use Bayesian average with a prior.
- Cold start: New items with zero reviews can't be ranked. Use a minimum review threshold or Bayesian smoothing.
- Confidence level choice: Higher confidence (99%) penalizes small samples more aggressively. 95% is standard but tune for your use case.
- Sorting by lower bound is conservative: This approach favors well-known items. For discovery/exploration, consider also boosting items with high upper bounds (potential hidden gems).
- 二元转换:对于5星评价,好评/差评的阈值很重要。是4星及以上算好评?还是3星及以上?不同阈值会产生不同的排名结果。
- 不适用于连续数据:Wilson Score适用于占比数据(二元结果)。对于连续评分,应使用带先验的贝叶斯平均值。
- 冷启动问题:零评价的新项目无法排名。可设置最低评价阈值或使用贝叶斯平滑处理。
- 置信水平选择:更高的置信水平(如99%)会更严厉地惩罚小样本。95%是标准值,但可根据实际场景调整。
- 按置信下限排序偏保守:这种方法倾向于知名项目。若用于发现/探索,可考虑同时提升置信上限高的项目(潜在的隐藏优质项目)。
Scripts
脚本
| Script | Description | Usage |
|---|---|---|
| Compute Wilson score interval and rank items | |
Run to execute built-in sanity tests.
python scripts/wilson_score.py --verify| 脚本 | 描述 | 使用方法 |
|---|---|---|
| 计算Wilson Score区间并对项目排名 | |
运行可执行内置的完整性测试。
python scripts/wilson_score.py --verifyReferences
参考资料
- For Bayesian average alternative, see
references/bayesian-average.md - For Reddit's ranking algorithm (Wilson-based), see
references/reddit-ranking.md
- 关于贝叶斯平均值的替代方案,详见
references/bayesian-average.md - 关于基于Wilson Score的Reddit排名算法,详见
references/reddit-ranking.md