data-science-eda
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExploratory Data Analysis (EDA)
探索性数据分析(EDA)
Use this skill for understanding datasets before modeling: profiling distributions, detecting anomalies, identifying relationships, and assessing data quality.
在建模前了解数据集时可使用该技能:剖析分布、检测异常、识别变量关系以及评估数据质量。
When to use this skill
何时使用该技能
- New dataset — need orientation on structure, types, distributions
- Before feature engineering — understand variable relationships
- Data quality investigation — find anomalies, missing patterns, outliers
- Model preparation — validate assumptions about data
- 面对新数据集——需要熟悉其结构、类型、分布
- 特征工程之前——了解变量间的关系
- 数据质量调查——发现异常值、缺失值模式、极端值
- 建模准备——验证关于数据的假设
Core EDA workflow
核心EDA工作流
- Profile structure
- Schema, types, cardinality
- Missing value patterns
- Analyze distributions
- Numerical: histograms, boxplots, skewness
- Categorical: frequencies, rare categories
- Explore relationships
- Correlation matrix (numerical)
- Cross-tabulations (categorical)
- Target-variable relationships
- Identify issues
- Outliers, duplicates, inconsistencies
- Class imbalance (classification)
- Temporal patterns (time series)
- 剖析数据结构
- 模式、数据类型、基数
- 缺失值模式
- 分析数据分布
- 数值型:直方图、箱线图、偏度
- 类别型:频率分布、稀有类别
- 探索变量关系
- 相关性矩阵(数值型变量)
- 交叉表(类别型变量)
- 目标变量与其他变量的关系
- 识别数据问题
- 异常值、重复值、不一致性
- 类别不平衡(分类任务)
- 时间模式(时间序列数据)
Quick tool selection
快速工具选择
| Task | Default choice | Notes |
|---|---|---|
| Automated profiling | ydata-profiling / pandas-profiling | Fast comprehensive reports |
| Interactive exploration | ipywidgets + plotly | Drill-down capability |
| Statistical tests | scipy.stats | Normality, correlations |
| Large datasets | Polars + lazy | Memory-efficient |
| 任务 | 默认选择 | 说明 |
|---|---|---|
| 自动化剖析 | ydata-profiling / pandas-profiling | 生成快速全面的报告 |
| 交互式探索 | ipywidgets + plotly | 具备下钻分析能力 |
| 统计检验 | scipy.stats | 正态性检验、相关性分析 |
| 大型数据集 | Polars + lazy | 内存高效 |
Core implementation rules
核心实施规则
1) Start with automated profiling
1) 从自动化剖析开始
python
import polars as pl
from ydata_profiling import ProfileReport
df = pl.read_parquet("data.parquet")
profile = ProfileReport(df.to_pandas(), title="Data Profile")
profile.to_file("profile_report.html")python
import polars as pl
from ydata_profiling import ProfileReport
df = pl.read_parquet("data.parquet")
profile = ProfileReport(df.to_pandas(), title="Data Profile")
profile.to_file("profile_report.html")2) Focus on actionable insights
2) 聚焦可落地的洞察
- Document outliers worth investigating (not all outliers are problems)
- Flag features with high cardinality or rare categories
- Note strong correlations that may cause multicollinearity
- 记录值得深入调查的异常值(并非所有异常值都是问题)
- 标记高基数或包含稀有类别的特征
- 注意可能导致多重共线性的强相关性
3) Visualize for communication
3) 通过可视化进行沟通
- Distribution plots for key variables
- Correlation heatmap
- Missing value patterns
- Target relationship plots
- 关键变量的分布图
- 相关性热力图
- 缺失值模式图
- 目标变量关系图
4) Validate assumptions
4) 验证假设
- Check for expected ranges/business rules
- Verify temporal consistency
- Confirm key relationships match domain knowledge
- 检查是否符合预期范围/业务规则
- 验证时间一致性
- 确认关键关系与领域知识匹配
Common anti-patterns
常见反模式
- ❌ Skipping EDA and jumping to modeling
- ❌ Treating all outliers as errors
- ❌ Ignoring missing value mechanisms (MCAR/MAR/MNAR)
- ❌ Over-plotting large datasets without sampling
- ❌ Not documenting findings for team
- ❌ 跳过EDA直接进入建模阶段
- ❌ 将所有异常值视为错误
- ❌ 忽略缺失值机制(MCAR/MAR/MNAR)
- ❌ 不对大型数据集采样就过度绘图
- ❌ 未向团队记录分析结果
Progressive disclosure
进阶参考
- — ydata-profiling, Sweetviz, D-Tale
../references/automated-profiling.md - — Matplotlib, Seaborn, Plotly patterns
../references/visualization-patterns.md - — Scipy statistical tests guide
../references/statistical-tests.md - — Sampling, Polars, Dask approaches
../references/large-dataset-eda.md
- — ydata-profiling、Sweetviz、D-Tale相关内容
../references/automated-profiling.md - — Matplotlib、Seaborn、Plotly使用模式
../references/visualization-patterns.md - — Scipy统计检验指南
../references/statistical-tests.md - — 采样、Polars、Dask处理方法
../references/large-dataset-eda.md
Related skills
相关技能
- — Next step after EDA
@data-science-feature-engineering - — Validate modeling assumptions
@data-science-model-evaluation - — Data validation frameworks
@data-engineering-quality
- — EDA之后的下一步
@data-science-feature-engineering - — 验证建模假设
@data-science-model-evaluation - — 数据验证框架
@data-engineering-quality