regression

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}): - SKILL.md — this file - scripts/demo.py — runnable marimo notebook with worked example -->
<!-- 捆绑文件(可通过 ${CLAUDE_SKILL_DIR} 访问): - SKILL.md — 本文件 - scripts/demo.py — 可运行的marimo示例笔记本 -->

Regression with XGBoost + Conformal Prediction Intervals

基于XGBoost + 保形预测区间的回归建模

For tabular regression, default to XGBoost as the point estimator and use conformalized quantile regression (CQR) to attach prediction intervals that actually achieve their stated coverage. Point estimates without intervals are not a model — they're a guess. This skill teaches the workflow for shipping a regressor that a downstream system can trust.
对于表格回归任务,默认使用XGBoost作为点估计器,并采用**保形分位数回归(CQR)**生成能真正达到声明覆盖率的预测区间。没有区间的点估计不能称之为模型——只是猜测。本方案将教授如何构建可被下游系统信任的回归模型并投入使用的完整流程。

When to use this skill

何时使用本方案

  • The target is continuous (price, count, score, time, demand)
  • The features are tabular (numbers, categories, dates) — not images, text, or audio
  • You have at least a few hundred labeled examples
  • Downstream consumers need both a point estimate and an interval
  • 目标变量为连续型(如价格、数量、分数、时间、需求量)
  • 特征为表格型(数字、类别、日期)——而非图像、文本或音频
  • 至少拥有数百个带标签的样本
  • 下游使用者同时需要点估计预测区间

When NOT to use this skill

何时不使用本方案

  • Classification (binary / multiclass / multilabel) → see the classification skills
  • Time-series with strong temporal structure → use time-series methods, not XGBoost on a flat dataframe
  • Truly linear data with < 100 rows → a regularized linear model (
    Ridge
    ,
    Lasso
    ,
    ElasticNet
    ) will be hard to beat and gives closed-form coefficient interpretation
  • You need a fully-Bayesian posterior over predictions (e.g. for decision analysis with explicit utilities) → use PyMC GLMs
  • 分类任务(二分类/多分类/多标签)→ 查看分类相关方案
  • 具有强时间结构的时间序列数据 → 使用时间序列方法,而非基于扁平DataFrame的XGBoost
  • 样本量<100的纯线性数据 → 正则化线性模型(
    Ridge
    Lasso
    ElasticNet
    )的表现难以被超越,且能提供闭式系数解释
  • 需要预测结果的完全贝叶斯后验分布(如用于带有明确效用的决策分析)→ 使用PyMC广义线性模型

Project layout

项目结构

<project>/
├── data/                # input parquet/csv
├── src/
│   ├── train.py         # ibis read → 3 XGBRegressors → conformal cal → MLflow
│   ├── predict.py       # reload models + conformal_q, return point + interval
│   └── plots.py         # predicted vs actual, residual diagnostics, coverage, SHAP
├── notebooks/
│   └── demo.py          # marimo walkthrough
└── mlruns/              # MLflow tracking store (gitignored)
<project>/
├── data/                # 输入的parquet/csv文件
├── src/
│   ├── train.py         # ibis读取数据 → 训练3个XGBRegressors → 保形校准 → MLflow记录
│   ├── predict.py       # 重新加载模型 + 保形参数,返回点估计和区间
│   └── plots.py         # 预测值vs真实值、残差诊断、覆盖率、SHAP可视化
├── notebooks/
│   └── demo.py          # marimo分步演示
└── mlruns/              # MLflow跟踪存储(已加入git忽略)

Data access — ibis at the source, pandas at the sklearn boundary

数据访问 — 数据源用ibis,sklearn边界用pandas

Use ibis (
ibis-framework[duckdb]
) to read data and compute summaries; materialize with
.execute()
exactly once just before sklearn:
python
import ibis

table = ibis.duckdb.connect().read_parquet("data/train.parquet")
feature_cols = [c for c in table.columns if c.startswith("feature_")]

target_stats = (
    table
    .aggregate(
        target_mean=table.target.mean(),
        target_std=table.target.std(),
        n_total=table.count(),
    )
    .execute()
    .iloc[0]
)

data = (
    table
    .select(*feature_cols, "target")
    .execute()
)
X = data[feature_cols]
y = data["target"]
使用ibis
ibis-framework[duckdb]
)读取数据并计算统计量;在传入sklearn前仅执行一次
.execute()
实现数据物化:
python
import ibis

table = ibis.duckdb.connect().read_parquet("data/train.parquet")
feature_cols = [c for c in table.columns if c.startswith("feature_")]

target_stats = (
    table
    .aggregate(
        target_mean=table.target.mean(),
        target_std=table.target.std(),
        n_total=table.count(),
    )
    .execute()
    .iloc[0]
)

data = (
    table
    .select(*feature_cols, "target")
    .execute()
)
X = data[feature_cols]
y = data["target"]

The pipeline — three models, not one

流水线 — 三个模型,而非一个

For regression with intervals you fit three XGBoost models on the same training data:
  1. Point estimator
    objective="reg:squarederror"
    , gives the expected value
  2. Lower quantile
    objective="reg:quantileerror"
    ,
    quantile_alpha=0.05
    (or 0.10), gives the lower bound
  3. Upper quantile — same, with
    quantile_alpha=0.95
    (or 0.90)
python
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor

def build_xgb_regressor(feature_cols, seed, *, objective="reg:squarederror", quantile_alpha=None):
    kwargs = dict(
        n_estimators=400, max_depth=4, learning_rate=0.05,
        subsample=0.8, colsample_bytree=0.8, reg_lambda=1.0,
        objective=objective, random_state=seed, n_jobs=-1,
    )
    if quantile_alpha is not None:
        kwargs["quantile_alpha"] = quantile_alpha
    return Pipeline([
        ("preprocess", ColumnTransformer([("num", StandardScaler(), feature_cols)])),
        ("clf", XGBRegressor(**kwargs)),
    ])

xgb_point = build_xgb_regressor(feature_cols, seed=42)
xgb_lower = build_xgb_regressor(feature_cols, seed=42, objective="reg:quantileerror", quantile_alpha=0.05)
xgb_upper = build_xgb_regressor(feature_cols, seed=42, objective="reg:quantileerror", quantile_alpha=0.95)
XGBoost 2.0+ has native quantile regression via
reg:quantileerror
. Earlier versions don't — if you're stuck on 1.x, use a different library (
lightgbm
has supported it for years) or switch to a quantile random forest.
要构建带区间的回归模型,需在同一训练数据上拟合三个XGBoost模型:
  1. 点估计模型
    objective="reg:squarederror"
    ,输出期望值
  2. 低分位数模型
    objective="reg:quantileerror"
    quantile_alpha=0.05
    (或0.10),输出下界
  3. 高分位数模型 — 同上,
    quantile_alpha=0.95
    (或0.90)
python
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor

def build_xgb_regressor(feature_cols, seed, *, objective="reg:squarederror", quantile_alpha=None):
    kwargs = dict(
        n_estimators=400, max_depth=4, learning_rate=0.05,
        subsample=0.8, colsample_bytree=0.8, reg_lambda=1.0,
        objective=objective, random_state=seed, n_jobs=-1,
    )
    if quantile_alpha is not None:
        kwargs["quantile_alpha"] = quantile_alpha
    return Pipeline([
        ("preprocess", ColumnTransformer([("num", StandardScaler(), feature_cols)])),
        ("clf", XGBRegressor(**kwargs)),
    ])

xgb_point = build_xgb_regressor(feature_cols, seed=42)
xgb_lower = build_xgb_regressor(feature_cols, seed=42, objective="reg:quantileerror", quantile_alpha=0.05)
xgb_upper = build_xgb_regressor(feature_cols, seed=42, objective="reg:quantileerror", quantile_alpha=0.95)
XGBoost 2.0+版本通过
reg:quantileerror
原生支持分位数回归。若使用1.x版本,可换用其他库(
lightgbm
已支持多年)或分位数随机森林。

The four things that separate this from a tutorial

本方案与普通教程的四大区别

1. Conformalized quantile regression — intervals that actually cover

1. 保形分位数回归 — 真正达到覆盖率的区间

Vanilla quantile XGBoost optimizes pinball loss but does not guarantee coverage. On Friedman1 with
quantile_alpha=0.05/0.95
(nominal 90%), raw empirical coverage is typically 70-75% — way below target. The fix is conformal calibration.
The recipe:
python
import numpy as np
from sklearn.model_selection import train_test_split
普通分位数XGBoost优化pinball损失,但无法保证覆盖率。在Friedman1数据集上使用
quantile_alpha=0.05/0.95
(标称90%覆盖率)时,原始经验覆盖率通常仅为70-75%——远低于目标值。解决方案是保形校准。
实现步骤:
python
import numpy as np
from sklearn.model_selection import train_test_split

1. Split a calibration set off the training data

1. 从训练数据中拆分出校准集

X_train, X_calib, y_train, y_calib = train_test_split( X_train_full, y_train_full, test_size=0.2, random_state=42, )
X_train, X_calib, y_train, y_calib = train_test_split( X_train_full, y_train_full, test_size=0.2, random_state=42, )

2. Fit quantile models on the (smaller) train set

2. 在(较小的)训练集上拟合分位数模型

xgb_lower.fit(X_train, y_train) xgb_upper.fit(X_train, y_train)
xgb_lower.fit(X_train, y_train) xgb_upper.fit(X_train, y_train)

3. Compute conformity scores on the calibration set:

3. 在校准集上计算一致性分数:

E_i = max(q_low(x_i) - y_i, y_i - q_high(x_i))

E_i = max(q_low(x_i) - y_i, y_i - q_high(x_i))

Positive when y_i is OUTSIDE the predicted interval.

当y_i落在预测区间外时为正值。

cal_low = xgb_lower.predict(X_calib) cal_high = xgb_upper.predict(X_calib) conformity = np.maximum(cal_low - y_calib, y_calib - cal_high)
cal_low = xgb_lower.predict(X_calib) cal_high = xgb_upper.predict(X_calib) conformity = np.maximum(cal_low - y_calib, y_calib - cal_high)

4. Find the appropriate quantile of the conformity scores

4. 找到一致性分数的对应分位数

nominal_coverage = 0.90 n_cal = len(y_calib) q_level = min(1.0, np.ceil(nominal_coverage * (n_cal + 1)) / n_cal) conformal_q = float(np.quantile(conformity, q_level))
nominal_coverage = 0.90 n_cal = len(y_calib) q_level = min(1.0, np.ceil(nominal_coverage * (n_cal + 1)) / n_cal) conformal_q = float(np.quantile(conformity, q_level))

5. At inference time, expand the raw quantile bounds by ±conformal_q

5. 推理阶段,将原始分位数边界扩展±conformal_q

def predict_interval(X_new): y_low_raw = xgb_lower.predict(X_new) y_high_raw = xgb_upper.predict(X_new) return y_low_raw - conformal_q, y_high_raw + conformal_q

This is **conformalized quantile regression** (Romano, Patterson,
Candes 2019). It guarantees marginal coverage of at least
`nominal_coverage` on test data drawn from the same distribution as
calibration. The intervals get wider, but they're now honest.

**Always log both the raw and conformalized empirical coverage** to
MLflow so you can see the gap. If raw coverage is already at the
nominal level, your data is well-behaved and the correction is small;
if it's far off, the correction was load-bearing.
def predict_interval(X_new): y_low_raw = xgb_lower.predict(X_new) y_high_raw = xgb_upper.predict(X_new) return y_low_raw - conformal_q, y_high_raw + conformal_q

这就是**保形分位数回归**(Romano, Patterson, Candes 2019)。它能保证从与校准集同分布的测试数据中得到至少`nominal_coverage`的边际覆盖率。区间会变宽,但结果更可靠。

**务必将原始和保形后的经验覆盖率都记录到MLflow**,以便查看差距。若原始覆盖率已达到标称值,说明数据表现良好,校准修正幅度较小;若差距较大,则校准是必不可少的。

2. Residual diagnostics — what the model is missing

2. 残差诊断 — 模型遗漏的信息

After training, plot the residuals (
y_true - y_pred
) three ways:
  • Residual vs predicted: should be a flat band around 0. A funnel shape = heteroscedasticity (the variance depends on the prediction). A curve = the model is missing non-linear structure.
  • Residual histogram + Normal overlay: should look roughly Normal for squared-error models. Heavy tails suggest you should use
    reg:absoluteerror
    (MAE / L1) or
    reg:huber
    instead.
  • QQ plot vs Normal: quick visual check for tail behavior.
python
import matplotlib.pyplot as plt
from scipy import stats

residuals = y_test - y_pred
fig, axes = plt.subplots(1, 3, figsize=(14, 4.5))
axes[0].scatter(y_pred, residuals, alpha=0.5)
axes[0].axhline(0, color="red", ls="--")
axes[1].hist(residuals, bins=40, density=True)
stats.probplot(residuals, dist="norm", plot=axes[2])
If the residuals show structure, the model is leaving signal on the table. Add features, change the loss, or try a more flexible model.
训练完成后,从三个维度绘制残差(
y_true - y_pred
):
  • 残差vs预测值:应围绕0值形成平坦的带状分布。漏斗形状表示异方差性(方差随预测值变化)。曲线形状表示模型遗漏了非线性结构。
  • 残差直方图+正态分布拟合:对于平方误差模型,残差应大致符合正态分布。重尾分布表明应改用
    reg:absoluteerror
    (MAE/L1)或
    reg:huber
    损失。
  • QQ图vs正态分布:快速可视化检查尾部行为。
python
import matplotlib.pyplot as plt
from scipy import stats

residuals = y_test - y_pred
fig, axes = plt.subplots(1, 3, figsize=(14, 4.5))
axes[0].scatter(y_pred, residuals, alpha=0.5)
axes[0].axhline(0, color="red", ls="--")
axes[1].hist(residuals, bins=40, density=True)
stats.probplot(residuals, dist="norm", plot=axes[2])
若残差呈现特定结构,说明模型仍有未捕捉到的信号。可添加特征、更换损失函数或尝试更灵活的模型。

3. Honest metrics — RMSE and MAE in domain units, not just R²

3. 真实指标 — 采用领域单位的RMSE和MAE,而非仅R²

R² ("explained variance") looks great for a wide range of models and hides important information. Always report:
  • RMSE (root mean squared error) — same units as the target, penalizes large errors. The benchmark to beat is the irreducible error (the std of the noise). If RMSE ≈ noise std, the model is essentially perfect.
  • MAE (mean absolute error) — same units as the target, more interpretable than RMSE, less sensitive to outliers. "On average my prediction is off by X."
  • — useful as a sanity check (≥ 0?) but don't optimize for it in isolation. R² = 0.95 on a problem with no signal is meaningless.
python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

rmse = float(np.sqrt(mean_squared_error(y_test, y_pred)))
mae = float(mean_absolute_error(y_test, y_pred))
r2 = float(r2_score(y_test, y_pred))
R²(“解释方差”)在各类模型中表现都很亮眼,但会隐藏重要信息。务必报告:
  • RMSE(均方根误差)——与目标变量单位相同,对大误差惩罚更重。需超越的基准是不可约误差(噪声的标准差)。若RMSE≈噪声标准差,模型已接近完美。
  • MAE(平均绝对误差)——与目标变量单位相同,比RMSE更易解释,对异常值更不敏感。“我的预测平均偏差为X。”
  • ——可作为 sanity check(是否≥0?),但不要单独针对它优化。在无信号问题上R²=0.95毫无意义。
python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

rmse = float(np.sqrt(mean_squared_error(y_test, y_pred)))
mae = float(mean_absolute_error(y_test, y_pred))
r2 = float(r2_score(y_test, y_pred))

4. SHAP for feature importance

4. SHAP特征重要性

Same advice as for binary classification: don't use
feature_importances_
, it has biases. Use SHAP's
TreeExplainer
:
python
import shap

clf = pipeline.named_steps["clf"]  # the XGBRegressor
preprocessor = pipeline.named_steps["preprocess"]
X_test_t = preprocessor.transform(X_test.iloc[:200])

explainer = shap.TreeExplainer(clf)
shap_values = explainer(X_test_t)
shap.summary_plot(shap_values, X_test_t, feature_names=feature_cols)
SHAP also gives you per-prediction explanations (
shap.plots.waterfall
) which are essential for any deployed model that affects users.
与二分类任务建议相同:不要使用
feature_importances_
,它存在偏差。使用SHAP的
TreeExplainer
python
import shap

clf = pipeline.named_steps["clf"]  # XGBRegressor实例
preprocessor = pipeline.named_steps["preprocess"]
X_test_t = preprocessor.transform(X_test.iloc[:200])

explainer = shap.TreeExplainer(clf)
shap_values = explainer(X_test_t)
shap.summary_plot(shap_values, X_test_t, feature_names=feature_cols)
SHAP还能提供单预测结果的解释(
shap.plots.waterfall
),这对任何影响用户的部署模型都至关重要。

MLflow logging

MLflow记录

Every run logs:
KindWhat
params
data path, n_rows, n_features, target_mean / target_std, seed, lower_quantile, upper_quantile, nominal_coverage, model name, hyperparameters
metrics
test_rmse, test_mae, test_r2, irreducible_rmse (when known), rmse_above_irreducible, conformal_q, coverage_raw vs coverage_conformal, interval_width_raw vs interval_width_conformal
tags
data hash, target distribution stats
artifacts
three models (
model_point
,
model_lower
,
model_upper
), predicted-vs-actual scatter with interval bands, residual diagnostics, interval coverage plot (raw + conformal), SHAP summary, sidecar JSON
The most important metric is
coverage_conformal
— if it's not within ~1% of
nominal_coverage
, something is wrong with your calibration set (too small? distribution shift?).
每次运行需记录:
类型内容
params
数据路径、行数、特征数、目标均值/标准差、随机种子、低分位数高分位数标称覆盖率、模型名称、超参数
metrics
测试集RMSE、测试集MAE、测试集R²、不可约RMSE(已知时)、RMSE超出不可约误差部分conformal_q原始覆盖率vs保形后覆盖率原始区间宽度vs保形后区间宽度
tags
数据哈希、目标分布统计量
artifacts
三个模型(
model_point
model_lower
model_upper
)、带区间带的预测值vs真实值散点图、残差诊断图、区间覆盖率图(原始+保形后)、SHAP汇总图、附带JSON文件
最重要的指标是
coverage_conformal
——若它与
nominal_coverage
的偏差超过~1%,说明校准集存在问题(样本量过小?分布偏移?)。

Common pitfalls

常见陷阱

  1. Reporting only R². R² of 0.95 can be a model that's useless in practice. Always report RMSE/MAE in domain units alongside.
  2. Skipping conformal calibration. Vanilla quantile XGBoost intervals undercover. Without CQR your "90% interval" is probably a 70-75% interval. This silently breaks any downstream system that trusts the bounds.
  3. Calibrating on the test set. The whole point of conformal prediction is that the calibration set is held out from both training and evaluation. Don't peek.
  4. Heteroscedastic data with squared-error loss. If
    residual vs predicted
    shows a funnel, your variance depends on the prediction. Switch to
    reg:quantileerror
    (which doesn't assume constant variance) or
    reg:huber
    (robust to outliers).
  5. Trusting
    feature_importances_
    .
    Use SHAP.
  6. Ignoring the irreducible error. If you know the noise level (e.g. measurement uncertainty in the target), report
    RMSE - noise_std
    . That's the model's actual error budget.
  7. Treating prediction intervals as confidence intervals. They are not the same. A prediction interval covers the next observation; a confidence interval covers the true mean. PIs are wider.
  1. 仅报告R²。R²=0.95的模型在实际中可能毫无用处。务必同时报告领域单位的RMSE/MAE。
  2. 跳过保形校准。普通分位数XGBoost的区间覆盖率不足。没有CQR的话,你的“90%区间”实际可能只有70-75%的覆盖率。这会悄无声息地破坏所有信任该边界的下游系统。
  3. 在测试集上校准。保形预测的核心是校准集与训练集、测试集均分离。不要偷看测试集数据。
  4. 对异方差数据使用平方误差损失。若“残差vs预测值”呈现漏斗形状,说明方差随预测值变化。应改用
    reg:quantileerror
    (不假设方差恒定)或
    reg:huber
    (对异常值鲁棒)。
  5. 信任
    feature_importances_
    。使用SHAP。
  6. 忽略不可约误差。若已知噪声水平(如目标变量的测量不确定性),报告
    RMSE - noise_std
    。这才是模型的实际误差预算。
  7. 将预测区间等同于置信区间。二者并不相同。预测区间覆盖下一个观测值;置信区间覆盖真实均值。预测区间更宽。

Worked example

示例演示

See
demo.py
(marimo notebook). It generates the Friedman1 non-linear regression dataset inline (
y = 10·sin(π·x₀·x₁) + 20·(x₂−0.5)² + 10·x₃ + 5·x₄ + ε
), fits all three XGBoost models, applies conformal calibration, plots residual diagnostics + interval coverage, and includes a baseline
LinearRegression
cell so you can see why a linear model fails on non-linear structure.
查看
demo.py
(marimo笔记本)。它会生成Friedman1非线性回归数据集(
y = 10·sin(π·x₀·x₁) + 20·(x₂−0.5)² + 10·x₃ + 5·x₄ + ε
),拟合三个XGBoost模型,应用保形校准,绘制残差诊断图和区间覆盖率图,还包含
LinearRegression
基线对比单元,可直观看到线性模型在非线性结构上的劣势。