algo-rank-wilson

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Wilson Score Ranking

Wilson Score 排名

Overview

概述

Wilson Score interval provides a lower confidence bound on the true proportion of positive ratings. Unlike simple averages, it penalizes items with few ratings, preventing a 5/5 review item (1 review) from outranking a 4.8/5 item (1000 reviews). Computes in O(1) per item.

Wilson Score区间为真实好评占比提供了一个置信下限。与简单平均值不同，它会惩罚评价数量少的项目，避免仅获得1条5星评价的项目排名超过获得1000条评价、平均4.8星的项目。每个项目的计算复杂度为O(1)。

When to Use

使用时机

Trigger conditions:

Ranking items by user ratings when review counts vary widely
Building "top rated" or "best of" lists that are fair to well-reviewed items
Sorting binary feedback (upvote/downvote) with confidence

When NOT to use:

For continuous scores (use Bayesian average instead)
When comparing items with similar sample sizes (simple average suffices)

触发条件：

当评价数量差异较大时，根据用户评分对项目进行排名
构建对评价多的项目公平的“顶级评分”或“最佳”列表
对二元反馈（点赞/点踩）进行带置信度的排序

不适用于以下场景：

连续评分数据（应使用贝叶斯平均值替代）
比较样本量相似的项目（简单平均值已足够）

Algorithm

算法

IRON LAW: Never Rank by Simple Average When Sample Sizes Differ
A 5.0 average from 1 review is NOT better than 4.8 from 1000 reviews.
Wilson Score lower bound accounts for sample uncertainty:
Items with few ratings get a LOWER bound, properly reflecting our
uncertainty about their true quality.

IRON LAW: Never Rank by Simple Average When Sample Sizes Differ
A 5.0 average from 1 review is NOT better than 4.8 from 1000 reviews.
Wilson Score lower bound accounts for sample uncertainty:
Items with few ratings get a LOWER bound, properly reflecting our
uncertainty about their true quality.

Phase 1: Input Validation

阶段1：输入验证

Collect per item: number of positive ratings (p), total ratings (n). For star ratings, convert to binary (e.g., 4-5 stars = positive). Gate: n > 0 for all items, confidence level chosen (typically 95%, z=1.96).

收集每个项目的好评数（p）和总评价数（n）。对于星级评价，需转换为二元数据（例如，4-5星视为好评）。 **校验要求：**所有项目的n>0，选择置信水平（通常为95%，对应z=1.96）。

Phase 2: Core Algorithm

阶段2：核心算法

Compute observed proportion: p̂ = positive / total
Wilson lower bound: (p̂ + z²/2n - z × √(p̂(1-p̂)/n + z²/4n²)) / (1 + z²/n)
Rank by Wilson lower bound descending (conservative estimate of true quality)

计算观测占比：p̂ = 好评数 / 总评价数
Wilson置信下限：(p̂ + z²/2n - z × √(p̂(1-p̂)/n + z²/4n²)) / (1 + z²/n)
按Wilson置信下限降序排名（这是对真实质量的保守估计）

Phase 3: Verification

阶段3：验证

Check: items with many positive reviews rank above items with few reviews and same proportion. Items with very few reviews are appropriately penalized. Gate: Ranking intuitively correct on manual inspection.

检查：好评数多的项目应排名高于好评占比相同但评价数少的项目。评价极少的项目应被适当降权。 **校验要求：**手动检查排名结果符合直觉。

Phase 4: Output

阶段4：输出

Return ranked items with scores and confidence intervals.

返回包含评分和置信区间的排名项目列表。

Output Format

输出格式

json

{
  "rankings": [{"item": "Product_A", "wilson_lower": 0.89, "positive": 950, "total": 1000, "proportion": 0.95}],
  "metadata": {"confidence": 0.95, "z": 1.96, "items_ranked": 500}
}

json

{
  "rankings": [{"item": "Product_A", "wilson_lower": 0.89, "positive": 950, "total": 1000, "proportion": 0.95}],
  "metadata": {"confidence": 0.95, "z": 1.96, "items_ranked": 500}
}

Examples

示例

Sample I/O

示例输入输出

Input: Item A: 1 positive / 1 total (100%). Item B: 950 positive / 1000 total (95%). Expected: B ranks higher. Wilson lower: A ≈ 0.05, B ≈ 0.94. The single review gives almost no confidence.

**输入：**项目A：1条好评/1条总评价（100%）。项目B：950条好评/1000条总评价（95%）。 **预期结果：**B排名更高。Wilson置信下限：A≈0.05，B≈0.94。单条评价几乎不具备置信度。

Edge Cases

边缘情况

Input	Expected	Why
0 reviews	Cannot rank	n=0, undefined. Exclude or assign minimum
0 positive, 100 total	Very low score	Genuinely bad item, high confidence
1M positive, 1M total	Lower bound ≈ 1.0	Massive sample, high confidence in 100%

输入	预期结果	原因
0条评价	无法排名	n=0，无定义。需排除或赋予最低值
0条好评，100条总评价	评分极低	确实是差评项目，置信度高
100万条好评，100万条总评价	置信下限≈1.0	样本量极大，对100%好评的置信度高

Gotchas

常见陷阱

Binary conversion: For 5-star ratings, the positive/negative threshold matters. 4+ stars as positive? 3+ stars? Different thresholds produce different rankings.
Not for continuous data: Wilson Score is for proportions (binary outcomes). For continuous ratings, use Bayesian average with a prior.
Cold start: New items with zero reviews can't be ranked. Use a minimum review threshold or Bayesian smoothing.
Confidence level choice: Higher confidence (99%) penalizes small samples more aggressively. 95% is standard but tune for your use case.
Sorting by lower bound is conservative: This approach favors well-known items. For discovery/exploration, consider also boosting items with high upper bounds (potential hidden gems).

二元转换：对于5星评价，好评/差评的阈值很重要。是4星及以上算好评？还是3星及以上？不同阈值会产生不同的排名结果。
不适用于连续数据：Wilson Score适用于占比数据（二元结果）。对于连续评分，应使用带先验的贝叶斯平均值。
冷启动问题：零评价的新项目无法排名。可设置最低评价阈值或使用贝叶斯平滑处理。
置信水平选择：更高的置信水平（如99%）会更严厉地惩罚小样本。95%是标准值，但可根据实际场景调整。
按置信下限排序偏保守：这种方法倾向于知名项目。若用于发现/探索，可考虑同时提升置信上限高的项目（潜在的隐藏优质项目）。

Scripts

脚本

Script	Description	Usage
`scripts/wilson_score.py`	Compute Wilson score interval and rank items	`python scripts/wilson_score.py --help`

Run

python scripts/wilson_score.py --verify

to execute built-in sanity tests.

脚本	描述	使用方法
`scripts/wilson_score.py`	计算Wilson Score区间并对项目排名	`python scripts/wilson_score.py --help`

运行

python scripts/wilson_score.py --verify

可执行内置的完整性测试。

References

参考资料

For Bayesian average alternative, see
```
references/bayesian-average.md
```
For Reddit's ranking algorithm (Wilson-based), see
```
references/reddit-ranking.md
```

关于贝叶斯平均值的替代方案，详见
```
references/bayesian-average.md
```
关于基于Wilson Score的Reddit排名算法，详见
```
references/reddit-ranking.md
```