exploratory-data-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Exploratory Data Analysis

探索性数据分析（EDA）

Overview

概述

EDA is a process for discovering patterns, anomalies, and relationships in data. Analyze CSV/Excel/JSON/Parquet files to generate statistical summaries, distributions, correlations, outliers, and visualizations. All outputs are markdown-formatted for integration into workflows.

EDA是一个用于发现数据中的模式、异常值和关系的过程。通过分析CSV/Excel/JSON/Parquet等文件，生成统计摘要、分布情况、相关性分析、异常值检测结果以及可视化图表。所有输出内容均采用Markdown格式，便于集成到工作流中。

When to Use This Skill

何时使用该技能

This skill should be used when:

User provides a data file and requests analysis or exploration
User asks to "explore this dataset", "analyze this data", or "what's in this file?"
User needs statistical summaries, distributions, or correlations
User requests data visualizations or insights
User wants to understand data quality issues or patterns
User mentions EDA, exploratory analysis, or data profiling

Supported file formats: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle

在以下场景中应使用本技能：

用户提供数据文件并请求进行分析或探查
用户提出“探索这个数据集”、“分析这份数据”或“这个文件里有什么内容？”等需求
用户需要统计摘要、分布情况或相关性分析结果
用户请求数据可视化或洞察结论
用户希望了解数据质量问题或数据模式
用户提及EDA、探索性分析或数据探查

支持的文件格式：CSV、Excel（.xlsx、.xls）、JSON、Parquet、TSV、Feather、HDF5、Pickle

Quick Start Workflow

快速开始工作流

Receive data file from user
Run comprehensive analysis using
```
scripts/eda_analyzer.py
```
Generate visualizations using
```
scripts/visualizer.py
```
Create markdown report using insights and the
```
assets/report_template.md
```
template
Present findings to user with key insights highlighted

接收用户提供的数据文件
使用
scripts/eda_analyzer.py
运行全面分析
使用
scripts/visualizer.py
生成可视化图表
利用洞察结果和
assets/report_template.md
模板创建Markdown报告
向用户展示分析结果，并突出关键洞察

Core Capabilities

核心功能

1. Comprehensive Data Analysis

1. 全面数据分析

Execute full statistical analysis using the

eda_analyzer.py

script:

bash

python scripts/eda_analyzer.py <data_file_path> -o <output_directory>

What it provides:

Auto-detection and loading of file formats
Basic dataset information (shape, types, memory usage)
Missing data analysis (patterns, percentages)
Summary statistics for numeric and categorical variables
Outlier detection using IQR and Z-score methods
Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
Correlation analysis (Pearson and Spearman)
Data quality assessment (completeness, duplicates, issues)
Automated insight generation

Output: JSON file containing all analysis results at

<output_directory>/eda_analysis.json

使用

eda_analyzer.py

脚本执行完整的统计分析：

bash

python scripts/eda_analyzer.py <data_file_path> -o <output_directory>

该脚本提供的功能：

自动检测并加载多种文件格式
基础数据集信息（数据规模、数据类型、内存占用）
缺失数据分析（模式、占比）
数值型和类别型变量的统计摘要
使用IQR和Z-score方法检测异常值
分布分析及正态性检验（Shapiro-Wilk、Anderson-Darling）
相关性分析（Pearson和Spearman）
数据质量评估（完整性、重复值、问题点）
自动生成洞察结论

输出：包含所有分析结果的JSON文件，路径为

<output_directory>/eda_analysis.json

2. Comprehensive Visualizations

2. 全面可视化

Generate complete visualization suite using the

visualizer.py

script:

bash

python scripts/visualizer.py <data_file_path> -o <output_directory>

Generated visualizations:

Missing data patterns: Heatmap and bar chart showing missing data
Distribution plots: Histograms with KDE overlays for all numeric variables
Box plots with violin plots: Outlier detection visualizations
Correlation heatmap: Both Pearson and Spearman correlation matrices
Scatter matrix: Pairwise relationships between numeric variables
Categorical analysis: Bar charts for top categories
Time series plots: Temporal trends with trend lines (if datetime columns exist)

Output: High-quality PNG files saved to

<output_directory>/eda_visualizations/

All visualizations are production-ready with:

300 DPI resolution
Clear titles and labels
Statistical annotations
Professional styling using seaborn

使用

visualizer.py

脚本生成完整的可视化图表集：

bash

python scripts/visualizer.py <data_file_path> -o <output_directory>

生成的可视化图表：

缺失数据模式：展示缺失数据的热力图和柱状图
分布图表：所有数值型变量的直方图（叠加KDE曲线）
箱线图与小提琴图：异常值检测可视化
相关性热力图：Pearson和Spearman相关系数矩阵
散点矩阵：数值型变量间的两两关系
类别分析：展示主要类别的柱状图
时间序列图表：带趋势线的时间趋势图（若存在日期时间列）

输出：高质量PNG图片，保存至

<output_directory>/eda_visualizations/

所有可视化图表均达到生产级标准：

300 DPI分辨率
清晰的标题和标签
统计标注
使用seaborn实现专业样式

3. Automated Insight Generation

3. 自动生成洞察结论

The analyzer automatically generates actionable insights including:

Data scale insights: Dataset size considerations for processing
Missing data alerts: Warnings when missing data exceeds thresholds
Correlation discoveries: Strong relationships identified for feature engineering
Outlier warnings: Variables with high outlier rates flagged
Distribution assessments: Skewness issues requiring transformations
Duplicate alerts: Duplicate row detection
Imbalance warnings: Categorical variable imbalance detection

Access insights from the analysis results JSON under the

"insights"

key.

分析器会自动生成可执行的洞察结论，包括：

数据规模洞察：数据集大小对处理的影响考量
缺失数据告警：当缺失数据超过阈值时发出警告
相关性发现：识别可用于特征工程的强关联关系
异常值警告：标记异常值占比高的变量
分布评估：指出需要进行转换的偏态分布问题
重复值告警：检测重复行
不平衡警告：检测类别型变量的不平衡问题

可从分析结果JSON文件的

"insights"

字段中获取这些洞察结论。

4. Statistical Interpretation

4. 统计结果解读

For detailed interpretation of statistical tests and measures, reference:

references/statistical_tests_guide.md
- Comprehensive guide covering:

Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
Distribution characteristics (skewness, kurtosis)
Correlation tests (Pearson, Spearman)
Outlier detection methods (IQR, Z-score)
Hypothesis testing guidelines
Data transformation strategies

Load this reference when needing to interpret specific statistical tests or explain results to users.

如需详细解读统计检验和指标，可参考：

references/statistical_tests_guide.md
- 全面指南，涵盖：

正态性检验（Shapiro-Wilk、Anderson-Darling、Kolmogorov-Smirnov）
分布特征（偏度、峰度）
相关性检验（Pearson、Spearman）
异常值检测方法（IQR、Z-score）
假设检验指南
数据转换策略

当需要解读特定统计检验或向用户解释结果时，可查阅该参考文档。

5. Best Practices Guidance

5. 最佳实践指导

For methodological guidance, reference:

references/eda_best_practices.md
- Detailed best practices including:

EDA process framework (6-step methodology)
Univariate, bivariate, and multivariate analysis approaches
Visualization guidelines
Statistical analysis guidelines
Common pitfalls to avoid
Domain-specific considerations
Communication tips for technical and non-technical audiences

Load this reference when planning analysis approach or needing guidance on specific EDA scenarios.

如需方法论指导，可参考：

references/eda_best_practices.md
- 详细的最佳实践内容，包括：

EDA流程框架（6步方法论）
单变量、双变量和多变量分析方法
可视化指南
统计分析指南
需避免的常见陷阱
特定领域的考量因素
面向技术与非技术受众的沟通技巧

当规划分析方法或需要特定EDA场景的指导时，可查阅该参考文档。

Creating Analysis Reports

创建分析报告

Use the provided template to structure comprehensive EDA reports:

assets/report_template.md
- Professional report template with sections for:

Executive summary
Dataset overview
Data quality assessment
Univariate, bivariate, and multivariate analysis
Outlier analysis
Key insights and findings
Recommendations
Limitations and appendices

To use the template:

Copy the template content
Fill in sections with analysis results from JSON output
Embed visualization images using markdown syntax
Populate insights and recommendations
Save as markdown for user consumption

使用提供的模板构建全面的EDA报告：

assets/report_template.md
- 专业的报告模板，包含以下章节：

执行摘要
数据集概述
数据质量评估
单变量、双变量和多变量分析
异常值分析
关键洞察与发现
建议
局限性与附录

使用模板的步骤：

复制模板内容
利用JSON输出中的分析结果填充各章节
使用Markdown语法嵌入可视化图片
填充洞察结论与建议
保存为Markdown文件供用户查看

Typical Workflow Example

典型工作流示例

When user provides a data file:

User: "Can you explore this sales_data.csv file and tell me what you find?"

1. Run analysis:
   python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output

2. Generate visualizations:
   python scripts/visualizer.py sales_data.csv -o ./analysis_output

3. Read analysis results:
   Read ./analysis_output/eda_analysis.json

4. Create markdown report using template:
   - Copy assets/report_template.md structure
   - Fill in sections with analysis results
   - Reference visualizations from ./analysis_output/eda_visualizations/
   - Include automated insights from JSON

5. Present to user:
   - Show key insights prominently
   - Highlight data quality issues
   - Provide visualizations inline
   - Make actionable recommendations
   - Save complete report as .md file

当用户提供数据文件时：

用户：“你能帮我探索这个sales_data.csv文件，并告诉我发现了什么吗？”

1. 运行分析：
   python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output

2. 生成可视化图表：
   python scripts/visualizer.py sales_data.csv -o ./analysis_output

3. 读取分析结果：
   读取./analysis_output/eda_analysis.json

4. 使用模板创建Markdown报告：
   - 复制assets/report_template.md的结构
   - 用分析结果填充各章节
   - 引用./analysis_output/eda_visualizations/中的可视化图表
   - 包含JSON中的自动生成洞察结论

5. 向用户展示结果：
   - 突出展示关键洞察
   - 强调数据质量问题
   - 内联展示可视化图表
   - 提供可执行的建议
   - 将完整报告保存为.md文件

Advanced Analysis Scenarios

高级分析场景

Large Datasets (>1M rows)

大型数据集（>100万行）

Run analysis on sampled data first for quick exploration
Note sample size in report
Recommend distributed computing for full analysis

先对采样数据进行分析，快速完成探索
在报告中注明样本规模
建议使用分布式计算进行全量分析

High-Dimensional Data (>50 columns)

高维数据（>50列）

Focus on most important variables first
Consider PCA or feature selection
Generate correlation analysis to identify variable groups
Reference
```
eda_best_practices.md
```
section on high-dimensional data

先聚焦最重要的变量
考虑使用PCA或特征选择方法
生成相关性分析以识别变量组
参考
```
eda_best_practices.md
```
中关于高维数据的章节

Time Series Data

时间序列数据

Ensure datetime columns are properly detected
Time series visualizations will be automatically generated
Consider temporal patterns, trends, and seasonality
Reference
```
eda_best_practices.md
```
section on time series

确保日期时间列被正确检测
自动生成时间序列可视化图表
考虑时间模式、趋势和季节性
参考
```
eda_best_practices.md
```
中关于时间序列的章节

Imbalanced Data

不平衡数据

Categorical analysis will flag imbalances
Report class distributions prominently
Recommend stratified sampling if needed

类别分析会标记数据不平衡问题
在报告中突出展示类别分布情况
如有需要，建议使用分层采样

Small Sample Sizes (<100 rows)

小样本数据（<100行）

Non-parametric methods automatically used where appropriate
Be conservative in statistical conclusions
Note sample size limitations in report

自动适用非参数方法（如适用）
统计结论需保持谨慎
在报告中注明样本规模的局限性

Output Best Practices

输出最佳实践

Always output as markdown:

Structure findings using markdown headers, tables, and lists
Embed visualizations using
```
![Description](path/to/image.png)
```
syntax
Use tables for statistical summaries
Include code blocks for any suggested transformations
Highlight key insights with bold or bullet points

Ensure reports are actionable:

Provide clear recommendations based on findings
Flag data quality issues that need attention
Suggest next steps for modeling or further analysis
Identify feature engineering opportunities

Make insights accessible:

Explain statistical concepts in plain language
Use reference guides to provide detailed interpretations
Include both technical details and executive summary
Tailor communication to user's technical level

始终生成Markdown格式的输出：

使用Markdown标题、表格和列表结构化分析结果
使用
```
![描述](path/to/image.png)
```
语法嵌入可视化图表
用表格展示统计摘要
包含建议的数据转换代码块
用加粗或项目符号突出关键洞察

确保报告具备可执行性：

根据分析结果提供清晰的建议
标记需要关注的数据质量问题
提出建模或进一步分析的下一步计划
识别特征工程的机会

让洞察结论易于理解：

用通俗易懂的语言解释统计概念
使用参考指南提供详细解读
同时包含技术细节和执行摘要
根据用户的技术水平调整沟通方式

Handling Edge Cases

边缘情况处理

Unsupported file formats:

Request user to convert to supported format
Suggest using pandas-compatible formats

Files too large to load:

Recommend sampling approach
Suggest chunked processing
Consider alternative tools for big data

Corrupted or malformed data:

Report specific errors encountered
Suggest data cleaning steps
Try to salvage partial analysis if possible

All missing data in columns:

Flag completely empty columns
Recommend removal or investigation
Document in data quality section

不支持的文件格式：

请求用户转换为支持的格式
建议使用pandas兼容的格式

文件过大无法加载：

建议采用采样方法
建议使用分块处理
考虑使用其他大数据工具

损坏或格式错误的数据：

报告遇到的具体错误
建议数据清洗步骤
尽可能尝试挽救部分分析结果

整列数据全部缺失：

标记完全为空的列
建议删除或调查原因
在数据质量章节中记录该问题

Resources Summary

资源汇总

scripts/

scripts/目录

eda_analyzer.py
: Main analysis engine - comprehensive statistical analysis
visualizer.py
: Visualization generator - creates all chart types

Both scripts are fully executable and handle multiple file formats automatically.

eda_analyzer.py
：核心分析引擎 - 执行全面统计分析
visualizer.py
：可视化生成器 - 创建各类图表

两个脚本均可直接执行，且能自动处理多种文件格式。

references/

references/目录

statistical_tests_guide.md
: Statistical test interpretation and methodology
eda_best_practices.md
: Comprehensive EDA methodology and best practices

Load these references as needed to inform analysis approach and interpretation.

statistical_tests_guide.md
：统计检验解读与方法论
eda_best_practices.md
：全面的EDA方法论与最佳实践

根据需要查阅这些参考文档，以指导分析方法和结果解读。

assets/

assets/目录

report_template.md
: Professional markdown report template

Use this template structure for creating consistent, comprehensive EDA reports.

report_template.md
：专业的Markdown报告模板

使用该模板结构创建一致、全面的EDA报告。

Key Reminders

关键提醒

Always generate markdown output for textual results
Run both scripts (analyzer and visualizer) for complete analysis
Use the template to structure comprehensive reports
Include visualizations by referencing generated PNG files
Provide actionable insights - don't just present statistics
Interpret findings using reference guides
Document limitations and data quality issues
Make recommendations for next steps

This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.

始终为文本结果生成Markdown格式输出
同时运行两个脚本（分析器和可视化生成器）以获取完整分析结果
使用模板构建全面的报告
引用生成的PNG图片以包含可视化内容
提供可执行的洞察结论 - 不要仅展示统计数据
使用参考指南解读分析结果
记录局限性和数据质量问题
提出下一步建议

该技能通过系统的数据探索、高级统计分析、丰富的可视化图表以及清晰的沟通，将原始数据转化为可执行的洞察结论。