exploratory-data-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExploratory Data Analysis
探索性数据分析
Discover patterns, anomalies, and relationships in tabular data through statistical analysis and visualization.
Supported formats: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
通过统计分析和可视化发现表格数据中的模式、异常情况及关联关系。
支持的格式:CSV、Excel (.xlsx, .xls)、JSON、Parquet、TSV、Feather、HDF5、Pickle
Standard Workflow
标准工作流
- Run statistical analysis:
bash
python scripts/eda_analyzer.py <data_file> -o <output_dir>- Generate visualizations:
bash
python scripts/visualizer.py <data_file> -o <output_dir>-
Read analysis results from
<output_dir>/eda_analysis.json -
Create report usingstructure
assets/report_template.md -
Present findings with key insights and visualizations
- 运行统计分析:
bash
python scripts/eda_analyzer.py <data_file> -o <output_dir>- 生成可视化内容:
bash
python scripts/visualizer.py <data_file> -o <output_dir>-
从读取分析结果
<output_dir>/eda_analysis.json -
使用的结构创建报告
assets/report_template.md -
结合关键洞察和可视化内容展示分析结果
Analysis Capabilities
分析能力
Statistical Analysis
统计分析
Run to generate comprehensive analysis:
scripts/eda_analyzer.pybash
python scripts/eda_analyzer.py sales_data.csv -o ./outputProduces containing:
output/eda_analysis.json- Dataset shape, types, memory usage
- Missing data patterns and percentages
- Summary statistics (numeric and categorical)
- Outlier detection (IQR and Z-score methods)
- Distribution analysis with normality tests
- Correlation matrices (Pearson and Spearman)
- Data quality metrics (completeness, duplicates)
- Automated insights
运行以生成全面的分析结果:
scripts/eda_analyzer.pybash
python scripts/eda_analyzer.py sales_data.csv -o ./output生成的包含以下内容:
output/eda_analysis.json- 数据集规模、数据类型、内存占用情况
- 缺失数据的模式及占比
- 统计摘要(数值型与类别型数据)
- 异常值检测(IQR和Z-score方法)
- 分布分析及正态性检验
- 相关矩阵(Pearson和Spearman方法)
- 数据质量指标(完整性、重复值)
- 自动化洞察
Visualizations
可视化
Run to generate plots:
scripts/visualizer.pybash
python scripts/visualizer.py sales_data.csv -o ./outputCreates high-resolution (300 DPI) PNG files in :
output/eda_visualizations/- Missing data heatmaps and bar charts
- Distribution plots (histograms with KDE)
- Box plots and violin plots for outliers
- Correlation heatmaps
- Scatter matrices for numeric relationships
- Categorical bar charts
- Time series plots (if datetime columns detected)
运行以生成可视化图表:
scripts/visualizer.pybash
python scripts/visualizer.py sales_data.csv -o ./output会在目录下生成300 DPI的高分辨率PNG文件,包括:
output/eda_visualizations/- 缺失数据热力图和柱状图
- 分布图表(带KDE的直方图)
- 用于异常值分析的箱线图和小提琴图
- 相关性热力图
- 数值型变量关系的散点矩阵
- 类别型数据柱状图
- 时间序列图表(若检测到日期时间列)
Automated Insights
自动化洞察
Access generated insights from the key in the analysis JSON:
"insights"- Dataset size considerations
- Missing data warnings (when exceeding thresholds)
- Strong correlations for feature engineering
- High outlier rate flags
- Skewness requiring transformations
- Duplicate detection
- Categorical imbalance warnings
从分析结果JSON文件的字段中获取生成的洞察内容:
"insights"- 数据集规模注意事项
- 缺失数据警告(当占比超过阈值时)
- 可用于特征工程的强相关性变量
- 高异常值占比标记
- 需要进行转换的偏态分布
- 重复值检测结果
- 类别型数据不平衡警告
Reference Materials
参考资料
Statistical Interpretation
统计结果解读
See for detailed guidance on:
references/statistical_tests_guide.md- Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
- Distribution characteristics (skewness, kurtosis)
- Correlation methods (Pearson, Spearman)
- Outlier detection (IQR, Z-score)
- Hypothesis testing and data transformations
Use when interpreting statistical results or explaining findings.
查看获取详细指导,内容包括:
references/statistical_tests_guide.md- 正态性检验(Shapiro-Wilk、Anderson-Darling、Kolmogorov-Smirnov)
- 分布特征(偏度、峰度)
- 相关性分析方法(Pearson、Spearman)
- 异常值检测(IQR、Z-score)
- 假设检验与数据转换
解读统计结果或解释分析发现时可参考此文档。
Methodology
方法论
See for comprehensive guidance on:
references/eda_best_practices.md- 6-step EDA process framework
- Univariate, bivariate, multivariate analysis approaches
- Visualization and statistical analysis guidelines
- Common pitfalls and domain-specific considerations
- Communication strategies for different audiences
Use when planning analysis or handling specific scenarios.
查看获取全面指导,内容包括:
references/eda_best_practices.md- 6步EDA流程框架
- 单变量、双变量、多变量分析方法
- 可视化与统计分析指南
- 常见误区及特定领域注意事项
- 面向不同受众的沟通策略
规划分析工作或处理特定场景时可参考此文档。
Report Template
报告模板
Use to structure findings. Template includes:
assets/report_template.md- Executive summary
- Dataset overview
- Data quality assessment
- Univariate, bivariate, and multivariate analysis
- Outlier analysis
- Key insights and recommendations
- Limitations and appendices
Fill sections with analysis JSON results and embed visualizations using markdown image syntax.
使用来组织分析结果。模板包含以下部分:
assets/report_template.md- 执行摘要
- 数据集概述
- 数据质量评估
- 单变量、双变量、多变量分析
- 异常值分析
- 关键洞察与建议
- 局限性与附录
使用分析结果JSON中的数据填充各部分,并通过Markdown图片语法嵌入可视化图表。
Example: Complete Analysis
示例:完整分析流程
User request: "Explore this sales_data.csv file"
bash
undefined用户需求:“探索这份sales_data.csv文件”
bash
undefined1. Run analysis
1. 运行分析
python scripts/eda_analyzer.py sales_data.csv -o ./output
python scripts/eda_analyzer.py sales_data.csv -o ./output
2. Generate visualizations
2. 生成可视化内容
python scripts/visualizer.py sales_data.csv -o ./output
```pythonpython scripts/visualizer.py sales_data.csv -o ./output
```python3. Read results
3. 读取结果
import json
with open('./output/eda_analysis.json') as f:
results = json.load(f)
import json
with open('./output/eda_analysis.json') as f:
results = json.load(f)
4. Build report from assets/report_template.md
4. 基于assets/report_template.md构建报告
- Fill sections with results
- 用分析结果填充各部分
- Embed images: 
- 嵌入图片:
- Include insights from results['insights']
- 包含results['insights']中的洞察内容
- Add recommendations
- 添加建议内容
undefinedundefinedSpecial Cases
特殊场景处理
Dataset Size Strategy
数据集规模策略
If < 100 rows: Note sample size limitations, use non-parametric methods
If 100-1M rows: Standard workflow applies
If > 1M rows: Sample first for quick exploration, note sample size in report, recommend distributed computing for full analysis
若行数<100条:注意样本量限制,使用非参数方法
若行数在100-100万条之间:适用标准工作流
若行数>100万条:先抽样进行快速探索,在报告中注明样本量,建议使用分布式计算进行全量分析
Data Characteristics
数据特征
High-dimensional (>50 columns): Focus on key variables first, use correlation analysis to identify groups, consider PCA or feature selection. See for guidance.
references/eda_best_practices.mdTime series: Datetime columns auto-detected, temporal visualizations generated automatically. Consider trends, seasonality, patterns.
Imbalanced: Categorical analysis flags imbalances automatically. Report distributions prominently, recommend stratified sampling if needed.
高维度(>50列):先聚焦关键变量,使用相关性分析识别变量组,考虑PCA或特征选择。详情参考。
时间序列数据:会自动检测日期时间列,并自动生成时间相关的可视化图表。需关注趋势、季节性和模式。
不平衡数据:类别型数据分析会自动标记不平衡情况。需在报告中突出展示分布情况,必要时建议使用分层抽样。
references/eda_best_practices.mdOutput Guidelines
输出规范
Format findings as markdown:
- Use headers, tables, and lists for structure
- Embed visualizations:
 - Include code blocks for suggested transformations
- Highlight key insights
Make reports actionable:
- Provide clear recommendations
- Flag data quality issues requiring attention
- Suggest next steps (modeling, feature engineering, further analysis)
- Tailor communication to user's technical level
以Markdown格式呈现分析结果:
- 使用标题、表格和列表构建结构
- 嵌入可视化图表:
 - 包含建议转换操作的代码块
- 突出关键洞察
让报告具备可操作性:
- 提供清晰的建议
- 标记需要关注的数据质量问题
- 建议下一步工作(建模、特征工程、进一步分析)
- 根据用户的技术水平调整沟通方式
Error Handling
错误处理
Unsupported formats: Request conversion to supported format (CSV, Excel, JSON, Parquet)
Files too large: Recommend sampling or chunked processing
Corrupted data: Report specific errors, suggest cleaning steps, attempt partial analysis
Empty columns: Flag in data quality section, recommend removal or investigation
不支持的格式:请求转换为支持的格式(CSV、Excel、JSON、Parquet)
文件过大:建议抽样或分块处理
数据损坏:报告具体错误,建议清洗步骤,尝试进行部分分析
空列:在数据质量部分标记,建议删除或调查原因
Resources
资源
Scripts (handle all formats automatically):
- - Statistical analysis engine
scripts/eda_analyzer.py - - Visualization generator
scripts/visualizer.py
References (load as needed):
- - Test interpretation and methodology
references/statistical_tests_guide.md - - EDA process and best practices
references/eda_best_practices.md
Template:
- - Professional report structure
assets/report_template.md
脚本(自动处理所有支持格式):
- - 统计分析引擎
scripts/eda_analyzer.py - - 可视化生成器
scripts/visualizer.py
参考文档(按需加载):
- - 检验方法解读与方法论
references/statistical_tests_guide.md - - EDA流程与最佳实践
references/eda_best_practices.md
模板:
- - 专业报告结构
assets/report_template.md
Key Points
关键点
- Run both scripts for complete analysis
- Structure reports using the template
- Provide actionable insights, not just statistics
- Use reference guides for detailed interpretations
- Document data quality issues and limitations
- Make clear recommendations for next steps
- 同时运行两个脚本以获取完整分析结果
- 使用模板构建报告
- 提供可操作的洞察,而非仅统计数据
- 参考指导文档进行详细解读
- 记录数据质量问题与局限性
- 明确给出下一步建议