Loading...
Loading...
Compare original and translation side by side
IRON LAW: Perform EDA Only AFTER Train/Test Split — Or You Leak the Future
Agents know "do EDA first." But they almost always do EDA on the FULL
dataset before splitting. This is information leakage: you've seen the
test set's distributions, outliers, and correlations, and your subsequent
modeling choices (feature scaling, outlier treatment, imputation strategy)
are now informed by data the model shouldn't see. Split first, then EDA
only on the training set. Apply the same transformations to the test set
without re-examining it.
Exception: data quality checks (nulls, dtypes, duplicates) CAN run on
the full dataset since they don't inform model hyperparameters.IRON LAW: Perform EDA Only AFTER Train/Test Split — Or You Leak the Future
Agents know "do EDA first." But they almost always do EDA on the FULL
dataset before splitting. This is information leakage: you've seen the
test set's distributions, outliers, and correlations, and your subsequent
modeling choices (feature scaling, outlier treatment, imputation strategy)
are now informed by data the model shouldn't see. Split first, then EDA
only on the training set. Apply the same transformations to the test set
without re-examining it.
Exception: data quality checks (nulls, dtypes, duplicates) CAN run on
the full dataset since they don't inform model hyperparameters.references/missing-data.mdreferences/missing-data.mdundefinedundefined| Issue | Columns Affected | Count/% | Action |
|---|---|---|---|
| Missing values | {cols} | {N / %} | {drop / impute / investigate} |
| Outliers | {cols} | {N} | {cap / remove / keep} |
| Duplicates | — | {N} | {remove} |
| 问题 | 受影响列 | 数量/占比 | 操作 |
|---|---|---|---|
| 缺失值 | {cols} | {N / %} | {删除 / 插补 / 调查} |
| 异常值 | {cols} | {N} | {截断 / 删除 / 保留} |
| 重复值 | — | {N} | {删除} |
| Variable | Mean | Median | Std | Min | Max | Distribution |
|---|---|---|---|---|---|---|
| {var} | ... | ... | ... | ... | ... | {normal/skewed/bimodal} |
| 变量 | 均值 | 中位数 | 标准差 | 最小值 | 最大值 | 分布 |
|---|---|---|---|---|---|---|
| {var} | ... | ... | ... | ... | ... | {正态/偏态/双峰} |
undefinedundefinedreferences/missing-data.mdreferences/missing-data.md