algo-ad-ctr
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCTR Prediction Model
CTR预测模型
Overview
概述
CTR prediction estimates the probability that a user clicks on an ad given context (user, query, ad, position). Forms the core of ad ranking: AdRank = Bid × pCTR. Typically uses logistic regression or gradient-boosted trees. Training on billions of impressions.
CTR预测用于估算在给定上下文(用户、查询词、广告、位置)下用户点击广告的概率。它是广告排名的核心:AdRank = 出价 × pCTR。通常使用logistic regression(逻辑回归)或gradient-boosted trees(梯度提升树)模型,基于数十亿次曝光数据进行训练。
When to Use
使用场景
Trigger conditions:
- Building or improving an ad ranking system
- Predicting click probability for bid optimization
- Evaluating ad creative effectiveness from feature analysis
When NOT to use:
- When predicting post-click conversions (use conversion rate model)
- When setting bid amounts (use bidding strategy skill)
触发条件:
- 构建或优化广告排名系统
- 预测点击概率以进行出价优化
- 通过特征分析评估广告创意效果
不适用场景:
- 预测点击后转化(请使用转化率模型)
- 设置出价金额(请使用出价策略技能)
Algorithm
算法
IRON LAW: A CTR Model Must Be CALIBRATED
Predicting relative ranking is insufficient. The predicted probability
must MATCH actual click frequency (e.g., predicted 5% → 5 clicks per
100 impressions). Without calibration, bid optimization breaks:
Expected Value = Bid × pCTR × pConversion
If pCTR is off by 2x, bids are wrong by 2x.IRON LAW: A CTR Model Must Be CALIBRATED
Predicting relative ranking is insufficient. The predicted probability
must MATCH actual click frequency (e.g., predicted 5% → 5 clicks per
100 impressions). Without calibration, bid optimization breaks:
Expected Value = Bid × pCTR × pConversion
If pCTR is off by 2x, bids are wrong by 2x.Phase 1: Input Validation
阶段1:输入验证
Collect impression logs with: user features, ad features, query features, position, click label (0/1). Handle class imbalance (CTR typically 1-5%).
Gate: Sufficient volume (100K+ impressions), click labels verified, no data leakage from position.
收集包含以下内容的曝光日志:用户特征、广告特征、查询词特征、位置、点击标签(0/1)。处理类别不平衡问题(CTR通常为1-5%)。
准入条件: 数据量充足(10万+次曝光)、点击标签已验证、无位置信息导致的数据泄露。
Phase 2: Core Algorithm
阶段2:核心算法
- Feature engineering: user demographics, ad category, query-ad match, historical CTR, time/device features
- Train model: logistic regression (interpretable) or GBDT (higher accuracy)
- Calibrate predictions: Platt scaling or isotonic regression on holdout set
- Evaluate: log-loss (calibration) + AUC (ranking quality)
- 特征工程:用户人口统计特征、广告类别、查询词-广告匹配度、历史CTR、时间/设备特征
- 模型训练:logistic regression(可解释性强)或GBDT(准确率更高)
- 预测结果校准:在保留数据集上使用Platt scaling或isotonic regression
- 模型评估:log-loss(校准性评估) + AUC(排名质量评估)
Phase 3: Verification
阶段3:验证
Check calibration: bucket predictions into deciles, compare predicted vs actual CTR per bucket. Plot reliability diagram.
Gate: Calibration curve close to diagonal, AUC > 0.70.
检查校准性:将预测结果按十分位分组,比较每组的预测CTR与实际CTR。绘制可靠性图。
准入条件: 校准曲线接近对角线、AUC > 0.70。
Phase 4: Output
阶段4:输出
Return predicted CTR with confidence interval and top contributing features.
返回带有置信区间的预测CTR以及贡献度最高的特征。
Output Format
输出格式
json
{
"prediction": {"ctr": 0.035, "confidence_interval": [0.028, 0.042]},
"top_features": [{"feature": "query_ad_match", "importance": 0.32}],
"metadata": {"model": "gbdt", "auc": 0.78, "log_loss": 0.21, "calibration_error": 0.008}
}json
{
"prediction": {"ctr": 0.035, "confidence_interval": [0.028, 0.042]},
"top_features": [{"feature": "query_ad_match", "importance": 0.32}],
"metadata": {"model": "gbdt", "auc": 0.78, "log_loss": 0.21, "calibration_error": 0.008}
}Examples
示例
Sample I/O
输入输出样例
Input: Trained logistic regression with 3 features and these coefficients:
intercept: -3.0
position_1: 0.8
query_ad_match: 1.5
user_is_mobile: 0.3Features for current request: position_1=1, query_ad_match=1, user_is_mobile=1
Expected: logit = -3.0 + 0.8 + 1.5 + 0.3 = -0.4
pCTR = sigmoid(-0.4) = 1/(1 + e^0.4) ≈ 0.401 → 40.1%
Verify: for features all 0 (baseline), pCTR = sigmoid(-3.0) ≈ 0.047 (4.7%). Calibration is checked by bucketing predictions and comparing to actual CTR in each bucket.
输入: 已训练好的logistic regression模型,包含3个特征及对应系数:
intercept: -3.0
position_1: 0.8
query_ad_match: 1.5
user_is_mobile: 0.3当前请求的特征:position_1=1, query_ad_match=1, user_is_mobile=1
预期结果: logit = -3.0 + 0.8 + 1.5 + 0.3 = -0.4
pCTR = sigmoid(-0.4) = 1/(1 + e^0.4) ≈ 0.401 → 40.1%
验证:当所有特征为0(基准情况)时,pCTR = sigmoid(-3.0) ≈ 0.047(4.7%)。通过将预测结果分组并比较每组的实际CTR来检查校准性。
Edge Cases
边缘案例
| Input | Expected | Why |
|---|---|---|
| New ad, no history | Use ad category average | Cold start for features |
| Position 1 vs position 4 | Different CTR, same relevance | Position bias inflates top-slot CTR |
| Very rare query | Low confidence | Insufficient training data for that query |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| 无历史数据的新广告 | 使用广告类别的平均CTR | 特征冷启动问题 |
| 位置1 vs 位置4 | CTR不同,但相关性相同 | 位置偏差会抬高顶部位置的CTR |
| 极为罕见的查询词 | 置信度低 | 该查询词的训练数据不足 |
Gotchas
注意事项
- Position bias: Ads in position 1 get more clicks regardless of relevance. Train on position-debiased data or include position as a feature and normalize at inference.
- Data freshness: CTR patterns change rapidly (seasonality, trends). Retrain daily or use online learning.
- Feature leakage: Including click-derived features (e.g., historical CTR of this exact ad-query pair) creates leakage if not handled carefully with time-based splits.
- Class imbalance: 97% no-click, 3% click. Use proper evaluation metrics (log-loss, AUC), not accuracy. Consider downsampling negatives during training.
- Multi-task learning: CTR and conversion rate are related but different. Joint models can improve both by sharing lower layers.
- 位置偏差:无论相关性如何,位置1的广告获得的点击量更多。训练时使用去位置偏差的数据,或者将位置作为特征纳入模型并在推理时进行归一化。
- 数据新鲜度:CTR模式变化迅速(季节性、趋势性)。需每日重新训练或使用在线学习。
- 特征泄露:若未通过基于时间的划分谨慎处理,包含点击衍生特征(如该精确广告-查询词对的历史CTR)会导致数据泄露。
- 类别不平衡:97%无点击,3%有点击。使用合适的评估指标(log-loss、AUC)而非准确率。训练时可考虑对负样本进行下采样。
- 多任务学习:CTR与转化率相关但不同。联合模型可通过共享底层网络层同时提升两者的性能。
References
参考资料
- For feature engineering best practices, see
references/feature-engineering.md - For position debiasing techniques, see
references/position-debiasing.md
- 特征工程最佳实践,请参阅
references/feature-engineering.md - 位置去偏差技术,请参阅
references/position-debiasing.md