algo-ad-ctr

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CTR Prediction Model

CTR预测模型

Overview

概述

CTR prediction estimates the probability that a user clicks on an ad given context (user, query, ad, position). Forms the core of ad ranking: AdRank = Bid × pCTR. Typically uses logistic regression or gradient-boosted trees. Training on billions of impressions.

CTR预测用于估算在给定上下文（用户、查询词、广告、位置）下用户点击广告的概率。它是广告排名的核心：AdRank = 出价 × pCTR。通常使用logistic regression（逻辑回归）或gradient-boosted trees（梯度提升树）模型，基于数十亿次曝光数据进行训练。

When to Use

使用场景

Trigger conditions:

Building or improving an ad ranking system
Predicting click probability for bid optimization
Evaluating ad creative effectiveness from feature analysis

When NOT to use:

When predicting post-click conversions (use conversion rate model)
When setting bid amounts (use bidding strategy skill)

触发条件：

构建或优化广告排名系统
预测点击概率以进行出价优化
通过特征分析评估广告创意效果

不适用场景：

预测点击后转化（请使用转化率模型）
设置出价金额（请使用出价策略技能）

Algorithm

算法

IRON LAW: A CTR Model Must Be CALIBRATED
Predicting relative ranking is insufficient. The predicted probability
must MATCH actual click frequency (e.g., predicted 5% → 5 clicks per
100 impressions). Without calibration, bid optimization breaks:
  Expected Value = Bid × pCTR × pConversion
  If pCTR is off by 2x, bids are wrong by 2x.

IRON LAW: A CTR Model Must Be CALIBRATED
Predicting relative ranking is insufficient. The predicted probability
must MATCH actual click frequency (e.g., predicted 5% → 5 clicks per
100 impressions). Without calibration, bid optimization breaks:
  Expected Value = Bid × pCTR × pConversion
  If pCTR is off by 2x, bids are wrong by 2x.

Phase 1: Input Validation

阶段1：输入验证

Collect impression logs with: user features, ad features, query features, position, click label (0/1). Handle class imbalance (CTR typically 1-5%). Gate: Sufficient volume (100K+ impressions), click labels verified, no data leakage from position.

收集包含以下内容的曝光日志：用户特征、广告特征、查询词特征、位置、点击标签（0/1）。处理类别不平衡问题（CTR通常为1-5%）。 准入条件： 数据量充足（10万+次曝光）、点击标签已验证、无位置信息导致的数据泄露。

Phase 2: Core Algorithm

阶段2：核心算法

Feature engineering: user demographics, ad category, query-ad match, historical CTR, time/device features
Train model: logistic regression (interpretable) or GBDT (higher accuracy)
Calibrate predictions: Platt scaling or isotonic regression on holdout set
Evaluate: log-loss (calibration) + AUC (ranking quality)

特征工程：用户人口统计特征、广告类别、查询词-广告匹配度、历史CTR、时间/设备特征
模型训练：logistic regression（可解释性强）或GBDT（准确率更高）
预测结果校准：在保留数据集上使用Platt scaling或isotonic regression
模型评估：log-loss（校准性评估） + AUC（排名质量评估）

Phase 3: Verification

阶段3：验证

Check calibration: bucket predictions into deciles, compare predicted vs actual CTR per bucket. Plot reliability diagram. Gate: Calibration curve close to diagonal, AUC > 0.70.

检查校准性：将预测结果按十分位分组，比较每组的预测CTR与实际CTR。绘制可靠性图。 准入条件： 校准曲线接近对角线、AUC > 0.70。

Phase 4: Output

阶段4：输出

Return predicted CTR with confidence interval and top contributing features.

返回带有置信区间的预测CTR以及贡献度最高的特征。

Output Format

输出格式

json

{
  "prediction": {"ctr": 0.035, "confidence_interval": [0.028, 0.042]},
  "top_features": [{"feature": "query_ad_match", "importance": 0.32}],
  "metadata": {"model": "gbdt", "auc": 0.78, "log_loss": 0.21, "calibration_error": 0.008}
}

json

{
  "prediction": {"ctr": 0.035, "confidence_interval": [0.028, 0.042]},
  "top_features": [{"feature": "query_ad_match", "importance": 0.32}],
  "metadata": {"model": "gbdt", "auc": 0.78, "log_loss": 0.21, "calibration_error": 0.008}
}

Examples

示例

Sample I/O

输入输出样例

Input: Trained logistic regression with 3 features and these coefficients:

intercept: -3.0
position_1:  0.8
query_ad_match: 1.5
user_is_mobile: 0.3

Features for current request: position_1=1, query_ad_match=1, user_is_mobile=1

Expected: logit = -3.0 + 0.8 + 1.5 + 0.3 = -0.4 pCTR = sigmoid(-0.4) = 1/(1 + e^0.4) ≈ 0.401 → 40.1%

Verify: for features all 0 (baseline), pCTR = sigmoid(-3.0) ≈ 0.047 (4.7%). Calibration is checked by bucketing predictions and comparing to actual CTR in each bucket.

输入： 已训练好的logistic regression模型，包含3个特征及对应系数：

intercept: -3.0
position_1:  0.8
query_ad_match: 1.5
user_is_mobile: 0.3

当前请求的特征：position_1=1, query_ad_match=1, user_is_mobile=1

预期结果： logit = -3.0 + 0.8 + 1.5 + 0.3 = -0.4 pCTR = sigmoid(-0.4) = 1/(1 + e^0.4) ≈ 0.401 → 40.1%

验证：当所有特征为0（基准情况）时，pCTR = sigmoid(-3.0) ≈ 0.047（4.7%）。通过将预测结果分组并比较每组的实际CTR来检查校准性。

Edge Cases

边缘案例

Input	Expected	Why
New ad, no history	Use ad category average	Cold start for features
Position 1 vs position 4	Different CTR, same relevance	Position bias inflates top-slot CTR
Very rare query	Low confidence	Insufficient training data for that query

输入	预期结果	原因
无历史数据的新广告	使用广告类别的平均CTR	特征冷启动问题
位置1 vs 位置4	CTR不同，但相关性相同	位置偏差会抬高顶部位置的CTR
极为罕见的查询词	置信度低	该查询词的训练数据不足

Gotchas

注意事项

Position bias: Ads in position 1 get more clicks regardless of relevance. Train on position-debiased data or include position as a feature and normalize at inference.
Data freshness: CTR patterns change rapidly (seasonality, trends). Retrain daily or use online learning.
Feature leakage: Including click-derived features (e.g., historical CTR of this exact ad-query pair) creates leakage if not handled carefully with time-based splits.
Class imbalance: 97% no-click, 3% click. Use proper evaluation metrics (log-loss, AUC), not accuracy. Consider downsampling negatives during training.
Multi-task learning: CTR and conversion rate are related but different. Joint models can improve both by sharing lower layers.

位置偏差：无论相关性如何，位置1的广告获得的点击量更多。训练时使用去位置偏差的数据，或者将位置作为特征纳入模型并在推理时进行归一化。
数据新鲜度：CTR模式变化迅速（季节性、趋势性）。需每日重新训练或使用在线学习。
特征泄露：若未通过基于时间的划分谨慎处理，包含点击衍生特征（如该精确广告-查询词对的历史CTR）会导致数据泄露。
类别不平衡：97%无点击，3%有点击。使用合适的评估指标（log-loss、AUC）而非准确率。训练时可考虑对负样本进行下采样。
多任务学习：CTR与转化率相关但不同。联合模型可通过共享底层网络层同时提升两者的性能。

References

参考资料

For feature engineering best practices, see
```
references/feature-engineering.md
```
For position debiasing techniques, see
```
references/position-debiasing.md
```

特征工程最佳实践，请参阅
```
references/feature-engineering.md
```
位置去偏差技术，请参阅
```
references/position-debiasing.md
```