exploratory-data-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExploratory Data Analysis
探索性数据分析(EDA)
Overview
概述
EDA is a process for discovering patterns, anomalies, and relationships in data. Analyze CSV/Excel/JSON/Parquet files to generate statistical summaries, distributions, correlations, outliers, and visualizations. All outputs are markdown-formatted for integration into workflows.
EDA是一个用于发现数据中的模式、异常值和关系的过程。通过分析CSV/Excel/JSON/Parquet等文件,生成统计摘要、分布情况、相关性分析、异常值检测结果以及可视化图表。所有输出内容均采用Markdown格式,便于集成到工作流中。
When to Use This Skill
何时使用该技能
This skill should be used when:
- User provides a data file and requests analysis or exploration
- User asks to "explore this dataset", "analyze this data", or "what's in this file?"
- User needs statistical summaries, distributions, or correlations
- User requests data visualizations or insights
- User wants to understand data quality issues or patterns
- User mentions EDA, exploratory analysis, or data profiling
Supported file formats: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
在以下场景中应使用本技能:
- 用户提供数据文件并请求进行分析或探查
- 用户提出“探索这个数据集”、“分析这份数据”或“这个文件里有什么内容?”等需求
- 用户需要统计摘要、分布情况或相关性分析结果
- 用户请求数据可视化或洞察结论
- 用户希望了解数据质量问题或数据模式
- 用户提及EDA、探索性分析或数据探查
支持的文件格式:CSV、Excel(.xlsx、.xls)、JSON、Parquet、TSV、Feather、HDF5、Pickle
Quick Start Workflow
快速开始工作流
- Receive data file from user
- Run comprehensive analysis using
scripts/eda_analyzer.py - Generate visualizations using
scripts/visualizer.py - Create markdown report using insights and the template
assets/report_template.md - Present findings to user with key insights highlighted
- 接收用户提供的数据文件
- 使用运行全面分析
scripts/eda_analyzer.py - 使用生成可视化图表
scripts/visualizer.py - 利用洞察结果和模板创建Markdown报告
assets/report_template.md - 向用户展示分析结果,并突出关键洞察
Core Capabilities
核心功能
1. Comprehensive Data Analysis
1. 全面数据分析
Execute full statistical analysis using the script:
eda_analyzer.pybash
python scripts/eda_analyzer.py <data_file_path> -o <output_directory>What it provides:
- Auto-detection and loading of file formats
- Basic dataset information (shape, types, memory usage)
- Missing data analysis (patterns, percentages)
- Summary statistics for numeric and categorical variables
- Outlier detection using IQR and Z-score methods
- Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
- Correlation analysis (Pearson and Spearman)
- Data quality assessment (completeness, duplicates, issues)
- Automated insight generation
Output: JSON file containing all analysis results at
<output_directory>/eda_analysis.json使用脚本执行完整的统计分析:
eda_analyzer.pybash
python scripts/eda_analyzer.py <data_file_path> -o <output_directory>该脚本提供的功能:
- 自动检测并加载多种文件格式
- 基础数据集信息(数据规模、数据类型、内存占用)
- 缺失数据分析(模式、占比)
- 数值型和类别型变量的统计摘要
- 使用IQR和Z-score方法检测异常值
- 分布分析及正态性检验(Shapiro-Wilk、Anderson-Darling)
- 相关性分析(Pearson和Spearman)
- 数据质量评估(完整性、重复值、问题点)
- 自动生成洞察结论
输出:包含所有分析结果的JSON文件,路径为
<output_directory>/eda_analysis.json2. Comprehensive Visualizations
2. 全面可视化
Generate complete visualization suite using the script:
visualizer.pybash
python scripts/visualizer.py <data_file_path> -o <output_directory>Generated visualizations:
- Missing data patterns: Heatmap and bar chart showing missing data
- Distribution plots: Histograms with KDE overlays for all numeric variables
- Box plots with violin plots: Outlier detection visualizations
- Correlation heatmap: Both Pearson and Spearman correlation matrices
- Scatter matrix: Pairwise relationships between numeric variables
- Categorical analysis: Bar charts for top categories
- Time series plots: Temporal trends with trend lines (if datetime columns exist)
Output: High-quality PNG files saved to
<output_directory>/eda_visualizations/All visualizations are production-ready with:
- 300 DPI resolution
- Clear titles and labels
- Statistical annotations
- Professional styling using seaborn
使用脚本生成完整的可视化图表集:
visualizer.pybash
python scripts/visualizer.py <data_file_path> -o <output_directory>生成的可视化图表:
- 缺失数据模式:展示缺失数据的热力图和柱状图
- 分布图表:所有数值型变量的直方图(叠加KDE曲线)
- 箱线图与小提琴图:异常值检测可视化
- 相关性热力图:Pearson和Spearman相关系数矩阵
- 散点矩阵:数值型变量间的两两关系
- 类别分析:展示主要类别的柱状图
- 时间序列图表:带趋势线的时间趋势图(若存在日期时间列)
输出:高质量PNG图片,保存至
<output_directory>/eda_visualizations/所有可视化图表均达到生产级标准:
- 300 DPI分辨率
- 清晰的标题和标签
- 统计标注
- 使用seaborn实现专业样式
3. Automated Insight Generation
3. 自动生成洞察结论
The analyzer automatically generates actionable insights including:
- Data scale insights: Dataset size considerations for processing
- Missing data alerts: Warnings when missing data exceeds thresholds
- Correlation discoveries: Strong relationships identified for feature engineering
- Outlier warnings: Variables with high outlier rates flagged
- Distribution assessments: Skewness issues requiring transformations
- Duplicate alerts: Duplicate row detection
- Imbalance warnings: Categorical variable imbalance detection
Access insights from the analysis results JSON under the key.
"insights"分析器会自动生成可执行的洞察结论,包括:
- 数据规模洞察:数据集大小对处理的影响考量
- 缺失数据告警:当缺失数据超过阈值时发出警告
- 相关性发现:识别可用于特征工程的强关联关系
- 异常值警告:标记异常值占比高的变量
- 分布评估:指出需要进行转换的偏态分布问题
- 重复值告警:检测重复行
- 不平衡警告:检测类别型变量的不平衡问题
可从分析结果JSON文件的字段中获取这些洞察结论。
"insights"4. Statistical Interpretation
4. 统计结果解读
For detailed interpretation of statistical tests and measures, reference:
references/statistical_tests_guide.md- Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
- Distribution characteristics (skewness, kurtosis)
- Correlation tests (Pearson, Spearman)
- Outlier detection methods (IQR, Z-score)
- Hypothesis testing guidelines
- Data transformation strategies
Load this reference when needing to interpret specific statistical tests or explain results to users.
如需详细解读统计检验和指标,可参考:
references/statistical_tests_guide.md- 正态性检验(Shapiro-Wilk、Anderson-Darling、Kolmogorov-Smirnov)
- 分布特征(偏度、峰度)
- 相关性检验(Pearson、Spearman)
- 异常值检测方法(IQR、Z-score)
- 假设检验指南
- 数据转换策略
当需要解读特定统计检验或向用户解释结果时,可查阅该参考文档。
5. Best Practices Guidance
5. 最佳实践指导
For methodological guidance, reference:
references/eda_best_practices.md- EDA process framework (6-step methodology)
- Univariate, bivariate, and multivariate analysis approaches
- Visualization guidelines
- Statistical analysis guidelines
- Common pitfalls to avoid
- Domain-specific considerations
- Communication tips for technical and non-technical audiences
Load this reference when planning analysis approach or needing guidance on specific EDA scenarios.
如需方法论指导,可参考:
references/eda_best_practices.md- EDA流程框架(6步方法论)
- 单变量、双变量和多变量分析方法
- 可视化指南
- 统计分析指南
- 需避免的常见陷阱
- 特定领域的考量因素
- 面向技术与非技术受众的沟通技巧
当规划分析方法或需要特定EDA场景的指导时,可查阅该参考文档。
Creating Analysis Reports
创建分析报告
Use the provided template to structure comprehensive EDA reports:
assets/report_template.md- Executive summary
- Dataset overview
- Data quality assessment
- Univariate, bivariate, and multivariate analysis
- Outlier analysis
- Key insights and findings
- Recommendations
- Limitations and appendices
To use the template:
- Copy the template content
- Fill in sections with analysis results from JSON output
- Embed visualization images using markdown syntax
- Populate insights and recommendations
- Save as markdown for user consumption
使用提供的模板构建全面的EDA报告:
assets/report_template.md- 执行摘要
- 数据集概述
- 数据质量评估
- 单变量、双变量和多变量分析
- 异常值分析
- 关键洞察与发现
- 建议
- 局限性与附录
使用模板的步骤:
- 复制模板内容
- 利用JSON输出中的分析结果填充各章节
- 使用Markdown语法嵌入可视化图片
- 填充洞察结论与建议
- 保存为Markdown文件供用户查看
Typical Workflow Example
典型工作流示例
When user provides a data file:
User: "Can you explore this sales_data.csv file and tell me what you find?"
1. Run analysis:
python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output
2. Generate visualizations:
python scripts/visualizer.py sales_data.csv -o ./analysis_output
3. Read analysis results:
Read ./analysis_output/eda_analysis.json
4. Create markdown report using template:
- Copy assets/report_template.md structure
- Fill in sections with analysis results
- Reference visualizations from ./analysis_output/eda_visualizations/
- Include automated insights from JSON
5. Present to user:
- Show key insights prominently
- Highlight data quality issues
- Provide visualizations inline
- Make actionable recommendations
- Save complete report as .md file当用户提供数据文件时:
用户:“你能帮我探索这个sales_data.csv文件,并告诉我发现了什么吗?”
1. 运行分析:
python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output
2. 生成可视化图表:
python scripts/visualizer.py sales_data.csv -o ./analysis_output
3. 读取分析结果:
读取./analysis_output/eda_analysis.json
4. 使用模板创建Markdown报告:
- 复制assets/report_template.md的结构
- 用分析结果填充各章节
- 引用./analysis_output/eda_visualizations/中的可视化图表
- 包含JSON中的自动生成洞察结论
5. 向用户展示结果:
- 突出展示关键洞察
- 强调数据质量问题
- 内联展示可视化图表
- 提供可执行的建议
- 将完整报告保存为.md文件Advanced Analysis Scenarios
高级分析场景
Large Datasets (>1M rows)
大型数据集(>100万行)
- Run analysis on sampled data first for quick exploration
- Note sample size in report
- Recommend distributed computing for full analysis
- 先对采样数据进行分析,快速完成探索
- 在报告中注明样本规模
- 建议使用分布式计算进行全量分析
High-Dimensional Data (>50 columns)
高维数据(>50列)
- Focus on most important variables first
- Consider PCA or feature selection
- Generate correlation analysis to identify variable groups
- Reference section on high-dimensional data
eda_best_practices.md
- 先聚焦最重要的变量
- 考虑使用PCA或特征选择方法
- 生成相关性分析以识别变量组
- 参考中关于高维数据的章节
eda_best_practices.md
Time Series Data
时间序列数据
- Ensure datetime columns are properly detected
- Time series visualizations will be automatically generated
- Consider temporal patterns, trends, and seasonality
- Reference section on time series
eda_best_practices.md
- 确保日期时间列被正确检测
- 自动生成时间序列可视化图表
- 考虑时间模式、趋势和季节性
- 参考中关于时间序列的章节
eda_best_practices.md
Imbalanced Data
不平衡数据
- Categorical analysis will flag imbalances
- Report class distributions prominently
- Recommend stratified sampling if needed
- 类别分析会标记数据不平衡问题
- 在报告中突出展示类别分布情况
- 如有需要,建议使用分层采样
Small Sample Sizes (<100 rows)
小样本数据(<100行)
- Non-parametric methods automatically used where appropriate
- Be conservative in statistical conclusions
- Note sample size limitations in report
- 自动适用非参数方法(如适用)
- 统计结论需保持谨慎
- 在报告中注明样本规模的局限性
Output Best Practices
输出最佳实践
Always output as markdown:
- Structure findings using markdown headers, tables, and lists
- Embed visualizations using syntax
 - Use tables for statistical summaries
- Include code blocks for any suggested transformations
- Highlight key insights with bold or bullet points
Ensure reports are actionable:
- Provide clear recommendations based on findings
- Flag data quality issues that need attention
- Suggest next steps for modeling or further analysis
- Identify feature engineering opportunities
Make insights accessible:
- Explain statistical concepts in plain language
- Use reference guides to provide detailed interpretations
- Include both technical details and executive summary
- Tailor communication to user's technical level
始终生成Markdown格式的输出:
- 使用Markdown标题、表格和列表结构化分析结果
- 使用语法嵌入可视化图表
 - 用表格展示统计摘要
- 包含建议的数据转换代码块
- 用加粗或项目符号突出关键洞察
确保报告具备可执行性:
- 根据分析结果提供清晰的建议
- 标记需要关注的数据质量问题
- 提出建模或进一步分析的下一步计划
- 识别特征工程的机会
让洞察结论易于理解:
- 用通俗易懂的语言解释统计概念
- 使用参考指南提供详细解读
- 同时包含技术细节和执行摘要
- 根据用户的技术水平调整沟通方式
Handling Edge Cases
边缘情况处理
Unsupported file formats:
- Request user to convert to supported format
- Suggest using pandas-compatible formats
Files too large to load:
- Recommend sampling approach
- Suggest chunked processing
- Consider alternative tools for big data
Corrupted or malformed data:
- Report specific errors encountered
- Suggest data cleaning steps
- Try to salvage partial analysis if possible
All missing data in columns:
- Flag completely empty columns
- Recommend removal or investigation
- Document in data quality section
不支持的文件格式:
- 请求用户转换为支持的格式
- 建议使用pandas兼容的格式
文件过大无法加载:
- 建议采用采样方法
- 建议使用分块处理
- 考虑使用其他大数据工具
损坏或格式错误的数据:
- 报告遇到的具体错误
- 建议数据清洗步骤
- 尽可能尝试挽救部分分析结果
整列数据全部缺失:
- 标记完全为空的列
- 建议删除或调查原因
- 在数据质量章节中记录该问题
Resources Summary
资源汇总
scripts/
scripts/目录
- : Main analysis engine - comprehensive statistical analysis
eda_analyzer.py - : Visualization generator - creates all chart types
visualizer.py
Both scripts are fully executable and handle multiple file formats automatically.
- :核心分析引擎 - 执行全面统计分析
eda_analyzer.py - :可视化生成器 - 创建各类图表
visualizer.py
两个脚本均可直接执行,且能自动处理多种文件格式。
references/
references/目录
- : Statistical test interpretation and methodology
statistical_tests_guide.md - : Comprehensive EDA methodology and best practices
eda_best_practices.md
Load these references as needed to inform analysis approach and interpretation.
- :统计检验解读与方法论
statistical_tests_guide.md - :全面的EDA方法论与最佳实践
eda_best_practices.md
根据需要查阅这些参考文档,以指导分析方法和结果解读。
assets/
assets/目录
- : Professional markdown report template
report_template.md
Use this template structure for creating consistent, comprehensive EDA reports.
- :专业的Markdown报告模板
report_template.md
使用该模板结构创建一致、全面的EDA报告。
Key Reminders
关键提醒
- Always generate markdown output for textual results
- Run both scripts (analyzer and visualizer) for complete analysis
- Use the template to structure comprehensive reports
- Include visualizations by referencing generated PNG files
- Provide actionable insights - don't just present statistics
- Interpret findings using reference guides
- Document limitations and data quality issues
- Make recommendations for next steps
This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.
- 始终为文本结果生成Markdown格式输出
- 同时运行两个脚本(分析器和可视化生成器)以获取完整分析结果
- 使用模板构建全面的报告
- 引用生成的PNG图片以包含可视化内容
- 提供可执行的洞察结论 - 不要仅展示统计数据
- 使用参考指南解读分析结果
- 记录局限性和数据质量问题
- 提出下一步建议
该技能通过系统的数据探索、高级统计分析、丰富的可视化图表以及清晰的沟通,将原始数据转化为可执行的洞察结论。