tabular-eda
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}):
- SKILL.md — this file
- scripts/demo.py — runnable marimo notebook with worked example
-->
<!-- 打包文件(可通过${CLAUDE_SKILL_DIR}访问):
- SKILL.md — 本文件
- scripts/demo.py — 可运行的marimo笔记本,包含示例演示
-->
Tabular EDA — Done Right
表格数据EDA — 正确的做法
Whenever you get handed a new tabular dataset, stop. Do not jump
straight to . Ten minutes of EDA will catch problems
that would otherwise destroy your downstream model — target leakage,
high-cardinality explosions, MAR missing data, non-linear features that
Pearson correlation says are useless. This skill is the workflow.
XGBClassifier()每当拿到一份新的表格数据集时,先停下来。不要直接上手使用。花十分钟做EDA能发现那些会毁掉后续模型的问题——target leakage、高基数特征爆炸、MAR缺失数据、皮尔逊相关系数认为无用的非线性特征。本技能就是对应的工作流。
XGBClassifier()When to use this skill
何时使用本技能
- You just received a new dataset and have no idea what's in it
- You're about to train a model and want to validate the data first
- A model is performing suspiciously well (or poorly) and you suspect a data quality issue
- The user asks "what should I do with this dataset?"
- 刚收到一份新数据集,完全不清楚其中内容
- 即将训练模型,想要先验证数据质量
- 模型表现异常好(或差),怀疑存在数据质量问题
- 用户询问“我该如何处理这个数据集?”
When NOT to use this skill
何时不使用本技能
- You already deeply know the dataset and have profiled it before
- The dataset is image / text / audio / time-series — different rules
- The user just wants a model trained, fast, and is OK with risk
- 已经非常熟悉该数据集,并且之前做过探查
- 数据集是图像/文本/音频/时间序列——需遵循不同规则
- 用户只想快速训练模型,愿意承担风险
The workflow
工作流
1. Load → shape, dtypes, memory
2. Identify the target → infer problem type (binary / multiclass / regression)
3. Missing data → per-column %, overall %, patterns
4. Numeric distributions → skew, outliers, scale mismatches
5. Categorical cardinality → flag high-cardinality (OHE explosion risk)
6. Near-constant features → flag and consider dropping
7. Redundant pairs → flag features with > 0.95 mutual correlation
8. **Target leakage detection** → flag features with > 0.95 |Pearson| to target
9. **Mutual information vs Pearson** → catch non-linear features Pearson misses
10. Optional: PCA / UMAP for low-dim visualizationThe output is a findings report: a list of suspicious things, each
with a feature name, the metric that flagged it, and a recommended
action. Don't just print plots. A list of problems with names is
what you act on.
1. 加载数据 → 查看形状、数据类型、内存占用
2. 确定目标列 → 推断问题类型(二分类/多分类/回归)
3. 缺失数据 → 按列统计占比、整体占比、缺失模式
4. 数值特征分布 → 偏度、异常值、尺度不匹配
5. 分类特征基数 → 标记高基数特征(独热编码爆炸风险)
6. 近常值特征 → 标记并考虑删除
7. 冗余特征对 → 标记互相关系数>0.95的特征对
8. **Target leakage检测** → 标记与目标列皮尔逊绝对值>0.95的特征
9. **互信息vs皮尔逊相关系数** → 捕捉皮尔逊无法发现的非线性特征
10. 可选:PCA/UMAP降维可视化输出是一份发现报告:列出所有可疑问题,每个问题包含特征名称、触发标记的指标,以及建议操作。不要只输出图表。带有名称的问题列表才是需要采取行动的依据。
Five things that separate this from a tutorial
与普通教程的五大区别
1. Target leakage detection — the single most valuable EDA check
1. Target leakage检测 —— EDA中最有价值的检查
A "leakage" feature is one that contains information about the target
that wouldn't actually be available at prediction time. The classic
examples:
- for predicting
account_balance_after_paymentmade_payment - (cumulative) for predicting
total_chargeschurned - for predicting
claim_paid_amountclaim_was_filed
These features are computed after the prediction time. Train on
them and you get 99% test accuracy and a model that completely fails
in production. The signature is suspiciously high correlation with
the target — anything > 0.95 is a leak suspect, anything > 0.99 is
almost certainly a leak.
python
def find_leakage_candidates(df, target_col, numeric_cols, threshold=0.95):
out = []
for col in numeric_cols:
if col == target_col:
continue
corr = float(df[[col, target_col]].dropna().corr().iloc[0, 1])
if abs(corr) > threshold:
out.append({"feature": col, "pearson": round(corr, 4)})
return outWhen you find a leakage candidate, always confirm with the data
owner before dropping it. Sometimes a feature is legitimately almost
perfectly correlated with the target (e.g. an upstream model's
prediction). But the default assumption is "this is a leak."
“泄露”特征是指包含了预测时实际无法获取的目标信息的特征。经典例子:
- 用于预测(是否付款)的
made_payment(付款后账户余额)account_balance_after_payment - 用于预测(是否流失)的
churned(累计费用)total_charges - 用于预测(是否提交理赔)的
claim_was_filed(理赔支付金额)claim_paid_amount
这些特征是在预测时间点之后计算的。用它们训练模型会得到99%的测试准确率,但模型在生产环境中完全失效。这类特征的标志是与目标列的相关系数异常高——任何>0.95的值都是疑似泄露,>0.99则几乎可以确定是泄露。
python
def find_leakage_candidates(df, target_col, numeric_cols, threshold=0.95):
out = []
for col in numeric_cols:
if col == target_col:
continue
corr = float(df[[col, target_col]].dropna().corr().iloc[0, 1])
if abs(corr) > threshold:
out.append({"feature": col, "pearson": round(corr, 4)})
return out当发现疑似泄露的特征时,务必先与数据所有者确认再删除。有时某个特征确实与目标列高度相关(例如上游模型的预测结果)。但默认假设应为“这是一个泄露特征”。
2. Mutual information vs Pearson — catch non-linear signal
2. 互信息vs皮尔逊相关系数 —— 捕捉非线性信号
Pearson correlation only catches linear relationships. A feature
that drives the target via or will have Pearson ≈ 0
and Pearson alone will mark it as useless. Mutual information catches
both.
sin(x)(x - 0.5)²python
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression皮尔逊相关系数只能捕捉线性关系。通过或驱动目标的特征,其皮尔逊相关系数≈0,仅用皮尔逊会被标记为无用。而互信息能同时捕捉线性和非线性关系。
sin(x)(x - 0.5)²python
from sklearn.feature_selection import mutual_info_classif, mutual_info_regressionFor classification
针对分类任务
mi_scores = mutual_info_classif(X, y, random_state=0)
mi_scores = mutual_info_classif(X, y, random_state=0)
For regression
针对回归任务
mi_scores = mutual_info_regression(X, y, random_state=0)
Plot |Pearson| and MI side by side as a bar chart. Features where
**MI is high but |Pearson| is low** are non-linear signal hiding from
your linear EDA. They'll be invisible to a linear model and powerful
in XGBoost.
This is the same lesson the regression bundle teaches with Friedman1's
`sin(π·x₀·x₁)` term: zero linear correlation, large mutual information,
huge contribution to the target.mi_scores = mutual_info_regression(X, y, random_state=0)
将|皮尔逊相关系数|和互信息绘制成并排柱状图。那些**互信息高但|皮尔逊相关系数|低**的特征,就是线性EDA无法发现的非线性信号。它们对线性模型不可见,但对XGBoost模型非常有用。
这与回归教程中Friedman1的`sin(π·x₀·x₁)`项所传达的道理一致:线性相关为零,但互信息很高,对目标值的贡献巨大。3. High-cardinality categorical detection
3. 高基数分类特征检测
A column like with thousands of unique values will explode
a into thousands of sparse columns. Flag any
categorical with > 50 unique values:
user_idOneHotEncoderpython
def find_high_cardinality(df, cat_cols, threshold=50):
return [
{"feature": c, "n_unique": int(df[c].nunique())}
for c in cat_cols if df[c].nunique() > threshold
]Recommended action for high-cardinality categoricals:
- Target encoding (smoothed mean of the target per category) — works well, but leaks during cross-validation if you're careless
- Frequency encoding — replace each category with its frequency
- Hash encoding — fixed-size hash buckets
- Just drop it — is rarely a useful feature anyway
user_id
像这样包含数千个唯一值的列,会让生成数千个稀疏列。标记任何唯一值数量>50的分类特征:
user_idOneHotEncoderpython
def find_high_cardinality(df, cat_cols, threshold=50):
return [
{"feature": c, "n_unique": int(df[c].nunique())}
for c in cat_cols if df[c].nunique() > threshold
]针对高基数分类特征的建议操作:
- 目标编码(每个类别对应的目标值平滑均值)——效果好,但如果不小心会在交叉验证中造成泄露
- 频率编码——用类别出现的频率替换该类别
- 哈希编码——固定大小的哈希桶
- 直接删除——很少是有用的特征
user_id
4. Near-constant feature detection
4. 近常值特征检测
A column where one value covers > 98% of the rows has essentially no
signal. It's not always wrong to keep it (some signal beats no signal),
but it's often indicative of a data collection issue and worth flagging:
python
def find_near_constant(df, threshold=0.98):
return [
{"feature": c, "top_value_freq": float(df[c].value_counts(normalize=True).iloc[0])}
for c in df.columns
if df[c].value_counts(normalize=True).iloc[0] > threshold
]某个值覆盖>98%行的列基本没有信号。保留它并非总是错误(有信号总比没信号好),但这通常表明存在数据收集问题,值得标记:
python
def find_near_constant(df, threshold=0.98):
return [
{"feature": c, "top_value_freq": float(df[c].value_counts(normalize=True).iloc[0])}
for c in df.columns
if df[c].value_counts(normalize=True).iloc[0] > threshold
]5. Redundant feature detection
5. 冗余特征检测
Features with mutual correlation > 0.95 carry the same information.
Drop one of each pair to reduce multicollinearity (which messes up
linear models more than tree models, but is still wasted compute):
python
def find_redundant_pairs(df, numeric_cols, threshold=0.95):
corr = df[numeric_cols].corr().abs()
out = []
for i, c1 in enumerate(numeric_cols):
for c2 in numeric_cols[i + 1:]:
if float(corr.loc[c1, c2]) > threshold:
out.append({"pair": [c1, c2], "pearson": float(corr.loc[c1, c2])})
return out互相关系数>0.95的特征携带相同信息。每对中删除一个以减少多重共线性(对线性模型的影响比对树模型大,但仍会浪费计算资源):
python
def find_redundant_pairs(df, numeric_cols, threshold=0.95):
corr = df[numeric_cols].corr().abs()
out = []
for i, c1 in enumerate(numeric_cols):
for c2 in numeric_cols[i + 1:]:
if float(corr.loc[c1, c2]) > threshold:
out.append({"pair": [c1, c2], "pearson": float(corr.loc[c1, c2])})
return outVisual checks (always include in the output)
可视化检查(输出中务必包含)
These six plots together answer "what's in this data?" in 30 seconds:
- Missing data bar chart — sorted by % missing, descending
- Numeric distributions grid — histograms with skew annotated
- Categorical cardinality bar chart — red bars > 50 unique values
- Correlation heatmap — feature × target Pearson, with target column annotated with the actual numbers
- Mutual info vs Pearson side-by-side bar chart — surfaces the non-linear signal Pearson misses
- Outlier box plots — per-numeric-column with IQR-based outlier counts
Output a findings.json file alongside the plots. Each finding has
a feature name, the metric that flagged it, and a recommended action.
The list is what gets actioned; the plots are the supporting evidence.
以下六个图表能在30秒内回答“这份数据里有什么?”:
- 缺失数据柱状图——按缺失占比降序排列
- 数值特征分布网格——带偏度标注的直方图
- 分类特征基数柱状图——唯一值>50的特征用红色标注
- 相关系数热力图——特征与目标列的皮尔逊相关系数,目标列标注具体数值
- 互信息vs皮尔逊相关系数并排柱状图——凸显皮尔逊无法发现的非线性信号
- 异常值箱线图——每个数值特征基于IQR的异常值计数
在图表旁输出findings.json文件。每个发现包含特征名称、触发标记的指标,以及建议操作。列表是行动依据;图表是支撑证据。
Type inference for the target
目标列类型推断
Before any modeling, infer the target type heuristically:
python
def infer_target_type(y):
if y.dtype.kind in "biu": # bool / int
n_unique = y.nunique()
if n_unique == 2:
return "binary"
if n_unique <= 20:
return "multiclass"
return "regression"
if y.dtype.kind == "f":
return "regression"
return "categorical"This tells you which downstream skill to invoke next:
- → binary-classification skill
binary - → multiclass-classification skill
multiclass - → regression skill
regression - (no obvious target) → unsupervised skill
categorical
建模前,通过启发式方法推断目标列类型:
python
def infer_target_type(y):
if y.dtype.kind in "biu": # bool / int
n_unique = y.nunique()
if n_unique == 2:
return "binary"
if n_unique <= 20:
return "multiclass"
return "regression"
if y.dtype.kind == "f":
return "regression"
return "categorical"这会告诉你接下来应该调用哪个技能:
- → 二分类技能
binary - → 多分类技能
multiclass - → 回归技能
regression - (无明确目标)→ 无监督技能
categorical
Common pitfalls
常见误区
- Skipping EDA entirely. "I'll just throw it at XGBoost." This is how target leakage and 99%-test-accuracy-then-broken-in-prod happen.
- Pearson-only correlation. Misses sin / quadratic / categorical relationships. Always pair with mutual information.
- Dropping a "leakage" feature without confirming with the data owner. Sometimes the feature is legitimate (an upstream model's prediction). Confirm before deleting.
- OneHotEncoding a high-cardinality categorical. Explodes feature count, drowns the model in noise, slows training. Use target encoding, frequency encoding, or just drop the column.
- Not checking for duplicates. A dataset with 50% duplicate rows will show inflated test metrics if duplicates land in both train and test.
- Imputing missing values without thinking. Fill with the median for MCAR, but for MAR/MNAR you may need to model the missingness itself. Always flag the missingness pattern; don't silently impute.
- Treating the EDA report as ephemeral. Log the findings JSON and the plots to MLflow (or wherever your experiment tracker lives). When a model fails six months later, you want to be able to look at the EDA report from when the data was first profiled.
- 完全跳过EDA。“我直接用XGBoost就行。”这就是target leakage和“测试准确率99%但生产环境失效”问题的根源。
- 仅依赖皮尔逊相关系数。会错过正弦/二次/分类关系。务必搭配互信息使用。
- 未与数据所有者确认就删除“泄露”特征。有时该特征是合法的(例如上游模型的预测结果)。删除前请确认。
- 对高基数分类特征使用独热编码。会导致特征数量爆炸,让模型淹没在噪声中,减慢训练速度。使用目标编码、频率编码,或直接删除该列。
- 不检查重复数据。如果重复行同时出现在训练集和测试集中,数据集有50%重复行会导致测试指标虚高。
- 不假思索地填充缺失值。MCAR(完全随机缺失)用中位数填充,但MAR/MNAR(随机缺失/非随机缺失)可能需要对缺失本身建模。务必标记缺失模式;不要静默填充。
- 将EDA报告视为临时文件。将findings.json和图表记录到MLflow(或你的实验跟踪工具)中。当模型六个月后失效时,你需要查看首次探查数据时的EDA报告。
Worked example
示例演示
See (marimo notebook). It generates a deliberately messy
synthetic binary classification dataset with seven planted issues
(target leakage, high-cardinality categorical, near-constant feature,
30% missing data, log-normal skew, 2% outliers, redundant pair) and
walks through detecting each one. The notebook ends with a
findings table summarizing what the EDA pipeline caught — and that
table is the input to "what model do I train next?"
demo.py查看(marimo笔记本)。它生成了一个刻意设置了七个问题的合成二分类数据集(target leakage、高基数分类特征、近常值特征、30%缺失数据、对数正态偏度、2%异常值、冗余特征对),并逐步演示如何检测每个问题。笔记本最后会生成一个发现表格,总结EDA pipeline捕捉到的问题——该表格就是“接下来我该训练什么模型?”的输入依据。
demo.pyAfter EDA: what to do next
EDA之后:下一步做什么
Based on the findings, decide:
- Drop: leakage features, near-constant features, one of each redundant pair
- Encode: high-cardinality categoricals via target/frequency/hash
- Impute: missing data (median for numeric, "missing" sentinel for categorical)
- Transform: skewed features (log, Box-Cox), outliers (winsorize or robust scaler)
- Then: invoke the appropriate problem-type skill (binary-classification, regression, multiclass-classification, etc.)
根据发现结果,决定:
- 删除:泄露特征、近常值特征、每对冗余特征中的一个
- 编码:对高基数分类特征使用目标/频率/哈希编码
- 填充:缺失数据(数值特征用中位数,分类特征用“missing”标记值)
- 转换:偏态特征(对数、Box-Cox转换)、异常值(缩尾或鲁棒缩放)
- 然后:调用对应问题类型的技能(二分类、回归、多分类等)