algo-ad-ctr

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CTR Prediction Model

CTR预测模型

Overview

概述

CTR prediction estimates the probability that a user clicks on an ad given context (user, query, ad, position). Forms the core of ad ranking: AdRank = Bid × pCTR. Typically uses logistic regression or gradient-boosted trees. Training on billions of impressions.
CTR预测用于估算在给定上下文(用户、查询词、广告、位置)下用户点击广告的概率。它是广告排名的核心:AdRank = 出价 × pCTR。通常使用logistic regression(逻辑回归)或gradient-boosted trees(梯度提升树)模型,基于数十亿次曝光数据进行训练。

When to Use

使用场景

Trigger conditions:
  • Building or improving an ad ranking system
  • Predicting click probability for bid optimization
  • Evaluating ad creative effectiveness from feature analysis
When NOT to use:
  • When predicting post-click conversions (use conversion rate model)
  • When setting bid amounts (use bidding strategy skill)
触发条件:
  • 构建或优化广告排名系统
  • 预测点击概率以进行出价优化
  • 通过特征分析评估广告创意效果
不适用场景:
  • 预测点击后转化(请使用转化率模型)
  • 设置出价金额(请使用出价策略技能)

Algorithm

算法

IRON LAW: A CTR Model Must Be CALIBRATED
Predicting relative ranking is insufficient. The predicted probability
must MATCH actual click frequency (e.g., predicted 5% → 5 clicks per
100 impressions). Without calibration, bid optimization breaks:
  Expected Value = Bid × pCTR × pConversion
  If pCTR is off by 2x, bids are wrong by 2x.
IRON LAW: A CTR Model Must Be CALIBRATED
Predicting relative ranking is insufficient. The predicted probability
must MATCH actual click frequency (e.g., predicted 5% → 5 clicks per
100 impressions). Without calibration, bid optimization breaks:
  Expected Value = Bid × pCTR × pConversion
  If pCTR is off by 2x, bids are wrong by 2x.

Phase 1: Input Validation

阶段1:输入验证

Collect impression logs with: user features, ad features, query features, position, click label (0/1). Handle class imbalance (CTR typically 1-5%). Gate: Sufficient volume (100K+ impressions), click labels verified, no data leakage from position.
收集包含以下内容的曝光日志:用户特征、广告特征、查询词特征、位置、点击标签(0/1)。处理类别不平衡问题(CTR通常为1-5%)。 准入条件: 数据量充足(10万+次曝光)、点击标签已验证、无位置信息导致的数据泄露。

Phase 2: Core Algorithm

阶段2:核心算法

  1. Feature engineering: user demographics, ad category, query-ad match, historical CTR, time/device features
  2. Train model: logistic regression (interpretable) or GBDT (higher accuracy)
  3. Calibrate predictions: Platt scaling or isotonic regression on holdout set
  4. Evaluate: log-loss (calibration) + AUC (ranking quality)
  1. 特征工程:用户人口统计特征、广告类别、查询词-广告匹配度、历史CTR、时间/设备特征
  2. 模型训练:logistic regression(可解释性强)或GBDT(准确率更高)
  3. 预测结果校准:在保留数据集上使用Platt scaling或isotonic regression
  4. 模型评估:log-loss(校准性评估) + AUC(排名质量评估)

Phase 3: Verification

阶段3:验证

Check calibration: bucket predictions into deciles, compare predicted vs actual CTR per bucket. Plot reliability diagram. Gate: Calibration curve close to diagonal, AUC > 0.70.
检查校准性:将预测结果按十分位分组,比较每组的预测CTR与实际CTR。绘制可靠性图。 准入条件: 校准曲线接近对角线、AUC > 0.70。

Phase 4: Output

阶段4:输出

Return predicted CTR with confidence interval and top contributing features.
返回带有置信区间的预测CTR以及贡献度最高的特征。

Output Format

输出格式

json
{
  "prediction": {"ctr": 0.035, "confidence_interval": [0.028, 0.042]},
  "top_features": [{"feature": "query_ad_match", "importance": 0.32}],
  "metadata": {"model": "gbdt", "auc": 0.78, "log_loss": 0.21, "calibration_error": 0.008}
}
json
{
  "prediction": {"ctr": 0.035, "confidence_interval": [0.028, 0.042]},
  "top_features": [{"feature": "query_ad_match", "importance": 0.32}],
  "metadata": {"model": "gbdt", "auc": 0.78, "log_loss": 0.21, "calibration_error": 0.008}
}

Examples

示例

Sample I/O

输入输出样例

Input: Trained logistic regression with 3 features and these coefficients:
intercept: -3.0
position_1:  0.8
query_ad_match: 1.5
user_is_mobile: 0.3
Features for current request: position_1=1, query_ad_match=1, user_is_mobile=1
Expected: logit = -3.0 + 0.8 + 1.5 + 0.3 = -0.4 pCTR = sigmoid(-0.4) = 1/(1 + e^0.4) ≈ 0.401 → 40.1%
Verify: for features all 0 (baseline), pCTR = sigmoid(-3.0) ≈ 0.047 (4.7%). Calibration is checked by bucketing predictions and comparing to actual CTR in each bucket.
输入: 已训练好的logistic regression模型,包含3个特征及对应系数:
intercept: -3.0
position_1:  0.8
query_ad_match: 1.5
user_is_mobile: 0.3
当前请求的特征:position_1=1, query_ad_match=1, user_is_mobile=1
预期结果: logit = -3.0 + 0.8 + 1.5 + 0.3 = -0.4 pCTR = sigmoid(-0.4) = 1/(1 + e^0.4) ≈ 0.401 → 40.1%
验证:当所有特征为0(基准情况)时,pCTR = sigmoid(-3.0) ≈ 0.047(4.7%)。通过将预测结果分组并比较每组的实际CTR来检查校准性。

Edge Cases

边缘案例

InputExpectedWhy
New ad, no historyUse ad category averageCold start for features
Position 1 vs position 4Different CTR, same relevancePosition bias inflates top-slot CTR
Very rare queryLow confidenceInsufficient training data for that query
输入预期结果原因
无历史数据的新广告使用广告类别的平均CTR特征冷启动问题
位置1 vs 位置4CTR不同,但相关性相同位置偏差会抬高顶部位置的CTR
极为罕见的查询词置信度低该查询词的训练数据不足

Gotchas

注意事项

  • Position bias: Ads in position 1 get more clicks regardless of relevance. Train on position-debiased data or include position as a feature and normalize at inference.
  • Data freshness: CTR patterns change rapidly (seasonality, trends). Retrain daily or use online learning.
  • Feature leakage: Including click-derived features (e.g., historical CTR of this exact ad-query pair) creates leakage if not handled carefully with time-based splits.
  • Class imbalance: 97% no-click, 3% click. Use proper evaluation metrics (log-loss, AUC), not accuracy. Consider downsampling negatives during training.
  • Multi-task learning: CTR and conversion rate are related but different. Joint models can improve both by sharing lower layers.
  • 位置偏差:无论相关性如何,位置1的广告获得的点击量更多。训练时使用去位置偏差的数据,或者将位置作为特征纳入模型并在推理时进行归一化。
  • 数据新鲜度:CTR模式变化迅速(季节性、趋势性)。需每日重新训练或使用在线学习。
  • 特征泄露:若未通过基于时间的划分谨慎处理,包含点击衍生特征(如该精确广告-查询词对的历史CTR)会导致数据泄露。
  • 类别不平衡:97%无点击,3%有点击。使用合适的评估指标(log-loss、AUC)而非准确率。训练时可考虑对负样本进行下采样。
  • 多任务学习:CTR与转化率相关但不同。联合模型可通过共享底层网络层同时提升两者的性能。

References

参考资料

  • For feature engineering best practices, see
    references/feature-engineering.md
  • For position debiasing techniques, see
    references/position-debiasing.md
  • 特征工程最佳实践,请参阅
    references/feature-engineering.md
  • 位置去偏差技术,请参阅
    references/position-debiasing.md