exploratory-data-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Exploratory Data Analysis

探索性数据分析(EDA)

Overview

概述

EDA is a process for discovering patterns, anomalies, and relationships in data. Analyze CSV/Excel/JSON/Parquet files to generate statistical summaries, distributions, correlations, outliers, and visualizations. All outputs are markdown-formatted for integration into workflows.
EDA是一个用于发现数据中的模式、异常值和关系的过程。通过分析CSV/Excel/JSON/Parquet等文件,生成统计摘要、分布情况、相关性分析、异常值检测结果以及可视化图表。所有输出内容均采用Markdown格式,便于集成到工作流中。

When to Use This Skill

何时使用该技能

This skill should be used when:
  • User provides a data file and requests analysis or exploration
  • User asks to "explore this dataset", "analyze this data", or "what's in this file?"
  • User needs statistical summaries, distributions, or correlations
  • User requests data visualizations or insights
  • User wants to understand data quality issues or patterns
  • User mentions EDA, exploratory analysis, or data profiling
Supported file formats: CSV, Excel (.xlsx, .xls), JSON, Parquet, TSV, Feather, HDF5, Pickle
在以下场景中应使用本技能:
  • 用户提供数据文件并请求进行分析或探查
  • 用户提出“探索这个数据集”、“分析这份数据”或“这个文件里有什么内容?”等需求
  • 用户需要统计摘要、分布情况或相关性分析结果
  • 用户请求数据可视化或洞察结论
  • 用户希望了解数据质量问题或数据模式
  • 用户提及EDA、探索性分析或数据探查
支持的文件格式:CSV、Excel(.xlsx、.xls)、JSON、Parquet、TSV、Feather、HDF5、Pickle

Quick Start Workflow

快速开始工作流

  1. Receive data file from user
  2. Run comprehensive analysis using
    scripts/eda_analyzer.py
  3. Generate visualizations using
    scripts/visualizer.py
  4. Create markdown report using insights and the
    assets/report_template.md
    template
  5. Present findings to user with key insights highlighted
  1. 接收用户提供的数据文件
  2. 使用
    scripts/eda_analyzer.py
    运行全面分析
  3. 使用
    scripts/visualizer.py
    生成可视化图表
  4. 利用洞察结果和
    assets/report_template.md
    模板创建Markdown报告
  5. 向用户展示分析结果,并突出关键洞察

Core Capabilities

核心功能

1. Comprehensive Data Analysis

1. 全面数据分析

Execute full statistical analysis using the
eda_analyzer.py
script:
bash
python scripts/eda_analyzer.py <data_file_path> -o <output_directory>
What it provides:
  • Auto-detection and loading of file formats
  • Basic dataset information (shape, types, memory usage)
  • Missing data analysis (patterns, percentages)
  • Summary statistics for numeric and categorical variables
  • Outlier detection using IQR and Z-score methods
  • Distribution analysis with normality tests (Shapiro-Wilk, Anderson-Darling)
  • Correlation analysis (Pearson and Spearman)
  • Data quality assessment (completeness, duplicates, issues)
  • Automated insight generation
Output: JSON file containing all analysis results at
<output_directory>/eda_analysis.json
使用
eda_analyzer.py
脚本执行完整的统计分析:
bash
python scripts/eda_analyzer.py <data_file_path> -o <output_directory>
该脚本提供的功能
  • 自动检测并加载多种文件格式
  • 基础数据集信息(数据规模、数据类型、内存占用)
  • 缺失数据分析(模式、占比)
  • 数值型和类别型变量的统计摘要
  • 使用IQR和Z-score方法检测异常值
  • 分布分析及正态性检验(Shapiro-Wilk、Anderson-Darling)
  • 相关性分析(Pearson和Spearman)
  • 数据质量评估(完整性、重复值、问题点)
  • 自动生成洞察结论
输出:包含所有分析结果的JSON文件,路径为
<output_directory>/eda_analysis.json

2. Comprehensive Visualizations

2. 全面可视化

Generate complete visualization suite using the
visualizer.py
script:
bash
python scripts/visualizer.py <data_file_path> -o <output_directory>
Generated visualizations:
  • Missing data patterns: Heatmap and bar chart showing missing data
  • Distribution plots: Histograms with KDE overlays for all numeric variables
  • Box plots with violin plots: Outlier detection visualizations
  • Correlation heatmap: Both Pearson and Spearman correlation matrices
  • Scatter matrix: Pairwise relationships between numeric variables
  • Categorical analysis: Bar charts for top categories
  • Time series plots: Temporal trends with trend lines (if datetime columns exist)
Output: High-quality PNG files saved to
<output_directory>/eda_visualizations/
All visualizations are production-ready with:
  • 300 DPI resolution
  • Clear titles and labels
  • Statistical annotations
  • Professional styling using seaborn
使用
visualizer.py
脚本生成完整的可视化图表集:
bash
python scripts/visualizer.py <data_file_path> -o <output_directory>
生成的可视化图表
  • 缺失数据模式:展示缺失数据的热力图和柱状图
  • 分布图表:所有数值型变量的直方图(叠加KDE曲线)
  • 箱线图与小提琴图:异常值检测可视化
  • 相关性热力图:Pearson和Spearman相关系数矩阵
  • 散点矩阵:数值型变量间的两两关系
  • 类别分析:展示主要类别的柱状图
  • 时间序列图表:带趋势线的时间趋势图(若存在日期时间列)
输出:高质量PNG图片,保存至
<output_directory>/eda_visualizations/
所有可视化图表均达到生产级标准:
  • 300 DPI分辨率
  • 清晰的标题和标签
  • 统计标注
  • 使用seaborn实现专业样式

3. Automated Insight Generation

3. 自动生成洞察结论

The analyzer automatically generates actionable insights including:
  • Data scale insights: Dataset size considerations for processing
  • Missing data alerts: Warnings when missing data exceeds thresholds
  • Correlation discoveries: Strong relationships identified for feature engineering
  • Outlier warnings: Variables with high outlier rates flagged
  • Distribution assessments: Skewness issues requiring transformations
  • Duplicate alerts: Duplicate row detection
  • Imbalance warnings: Categorical variable imbalance detection
Access insights from the analysis results JSON under the
"insights"
key.
分析器会自动生成可执行的洞察结论,包括:
  • 数据规模洞察:数据集大小对处理的影响考量
  • 缺失数据告警:当缺失数据超过阈值时发出警告
  • 相关性发现:识别可用于特征工程的强关联关系
  • 异常值警告:标记异常值占比高的变量
  • 分布评估:指出需要进行转换的偏态分布问题
  • 重复值告警:检测重复行
  • 不平衡警告:检测类别型变量的不平衡问题
可从分析结果JSON文件的
"insights"
字段中获取这些洞察结论。

4. Statistical Interpretation

4. 统计结果解读

For detailed interpretation of statistical tests and measures, reference:
references/statistical_tests_guide.md
- Comprehensive guide covering:
  • Normality tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov)
  • Distribution characteristics (skewness, kurtosis)
  • Correlation tests (Pearson, Spearman)
  • Outlier detection methods (IQR, Z-score)
  • Hypothesis testing guidelines
  • Data transformation strategies
Load this reference when needing to interpret specific statistical tests or explain results to users.
如需详细解读统计检验和指标,可参考:
references/statistical_tests_guide.md
- 全面指南,涵盖:
  • 正态性检验(Shapiro-Wilk、Anderson-Darling、Kolmogorov-Smirnov)
  • 分布特征(偏度、峰度)
  • 相关性检验(Pearson、Spearman)
  • 异常值检测方法(IQR、Z-score)
  • 假设检验指南
  • 数据转换策略
当需要解读特定统计检验或向用户解释结果时,可查阅该参考文档。

5. Best Practices Guidance

5. 最佳实践指导

For methodological guidance, reference:
references/eda_best_practices.md
- Detailed best practices including:
  • EDA process framework (6-step methodology)
  • Univariate, bivariate, and multivariate analysis approaches
  • Visualization guidelines
  • Statistical analysis guidelines
  • Common pitfalls to avoid
  • Domain-specific considerations
  • Communication tips for technical and non-technical audiences
Load this reference when planning analysis approach or needing guidance on specific EDA scenarios.
如需方法论指导,可参考:
references/eda_best_practices.md
- 详细的最佳实践内容,包括:
  • EDA流程框架(6步方法论)
  • 单变量、双变量和多变量分析方法
  • 可视化指南
  • 统计分析指南
  • 需避免的常见陷阱
  • 特定领域的考量因素
  • 面向技术与非技术受众的沟通技巧
当规划分析方法或需要特定EDA场景的指导时,可查阅该参考文档。

Creating Analysis Reports

创建分析报告

Use the provided template to structure comprehensive EDA reports:
assets/report_template.md
- Professional report template with sections for:
  • Executive summary
  • Dataset overview
  • Data quality assessment
  • Univariate, bivariate, and multivariate analysis
  • Outlier analysis
  • Key insights and findings
  • Recommendations
  • Limitations and appendices
To use the template:
  1. Copy the template content
  2. Fill in sections with analysis results from JSON output
  3. Embed visualization images using markdown syntax
  4. Populate insights and recommendations
  5. Save as markdown for user consumption
使用提供的模板构建全面的EDA报告:
assets/report_template.md
- 专业的报告模板,包含以下章节:
  • 执行摘要
  • 数据集概述
  • 数据质量评估
  • 单变量、双变量和多变量分析
  • 异常值分析
  • 关键洞察与发现
  • 建议
  • 局限性与附录
使用模板的步骤
  1. 复制模板内容
  2. 利用JSON输出中的分析结果填充各章节
  3. 使用Markdown语法嵌入可视化图片
  4. 填充洞察结论与建议
  5. 保存为Markdown文件供用户查看

Typical Workflow Example

典型工作流示例

When user provides a data file:
User: "Can you explore this sales_data.csv file and tell me what you find?"

1. Run analysis:
   python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output

2. Generate visualizations:
   python scripts/visualizer.py sales_data.csv -o ./analysis_output

3. Read analysis results:
   Read ./analysis_output/eda_analysis.json

4. Create markdown report using template:
   - Copy assets/report_template.md structure
   - Fill in sections with analysis results
   - Reference visualizations from ./analysis_output/eda_visualizations/
   - Include automated insights from JSON

5. Present to user:
   - Show key insights prominently
   - Highlight data quality issues
   - Provide visualizations inline
   - Make actionable recommendations
   - Save complete report as .md file
当用户提供数据文件时:
用户:“你能帮我探索这个sales_data.csv文件,并告诉我发现了什么吗?”

1. 运行分析:
   python scripts/eda_analyzer.py sales_data.csv -o ./analysis_output

2. 生成可视化图表:
   python scripts/visualizer.py sales_data.csv -o ./analysis_output

3. 读取分析结果:
   读取./analysis_output/eda_analysis.json

4. 使用模板创建Markdown报告:
   - 复制assets/report_template.md的结构
   - 用分析结果填充各章节
   - 引用./analysis_output/eda_visualizations/中的可视化图表
   - 包含JSON中的自动生成洞察结论

5. 向用户展示结果:
   - 突出展示关键洞察
   - 强调数据质量问题
   - 内联展示可视化图表
   - 提供可执行的建议
   - 将完整报告保存为.md文件

Advanced Analysis Scenarios

高级分析场景

Large Datasets (>1M rows)

大型数据集(>100万行)

  • Run analysis on sampled data first for quick exploration
  • Note sample size in report
  • Recommend distributed computing for full analysis
  • 先对采样数据进行分析,快速完成探索
  • 在报告中注明样本规模
  • 建议使用分布式计算进行全量分析

High-Dimensional Data (>50 columns)

高维数据(>50列)

  • Focus on most important variables first
  • Consider PCA or feature selection
  • Generate correlation analysis to identify variable groups
  • Reference
    eda_best_practices.md
    section on high-dimensional data
  • 先聚焦最重要的变量
  • 考虑使用PCA或特征选择方法
  • 生成相关性分析以识别变量组
  • 参考
    eda_best_practices.md
    中关于高维数据的章节

Time Series Data

时间序列数据

  • Ensure datetime columns are properly detected
  • Time series visualizations will be automatically generated
  • Consider temporal patterns, trends, and seasonality
  • Reference
    eda_best_practices.md
    section on time series
  • 确保日期时间列被正确检测
  • 自动生成时间序列可视化图表
  • 考虑时间模式、趋势和季节性
  • 参考
    eda_best_practices.md
    中关于时间序列的章节

Imbalanced Data

不平衡数据

  • Categorical analysis will flag imbalances
  • Report class distributions prominently
  • Recommend stratified sampling if needed
  • 类别分析会标记数据不平衡问题
  • 在报告中突出展示类别分布情况
  • 如有需要,建议使用分层采样

Small Sample Sizes (<100 rows)

小样本数据(<100行)

  • Non-parametric methods automatically used where appropriate
  • Be conservative in statistical conclusions
  • Note sample size limitations in report
  • 自动适用非参数方法(如适用)
  • 统计结论需保持谨慎
  • 在报告中注明样本规模的局限性

Output Best Practices

输出最佳实践

Always output as markdown:
  • Structure findings using markdown headers, tables, and lists
  • Embed visualizations using
    ![Description](path/to/image.png)
    syntax
  • Use tables for statistical summaries
  • Include code blocks for any suggested transformations
  • Highlight key insights with bold or bullet points
Ensure reports are actionable:
  • Provide clear recommendations based on findings
  • Flag data quality issues that need attention
  • Suggest next steps for modeling or further analysis
  • Identify feature engineering opportunities
Make insights accessible:
  • Explain statistical concepts in plain language
  • Use reference guides to provide detailed interpretations
  • Include both technical details and executive summary
  • Tailor communication to user's technical level
始终生成Markdown格式的输出
  • 使用Markdown标题、表格和列表结构化分析结果
  • 使用
    ![描述](path/to/image.png)
    语法嵌入可视化图表
  • 用表格展示统计摘要
  • 包含建议的数据转换代码块
  • 用加粗或项目符号突出关键洞察
确保报告具备可执行性
  • 根据分析结果提供清晰的建议
  • 标记需要关注的数据质量问题
  • 提出建模或进一步分析的下一步计划
  • 识别特征工程的机会
让洞察结论易于理解
  • 用通俗易懂的语言解释统计概念
  • 使用参考指南提供详细解读
  • 同时包含技术细节和执行摘要
  • 根据用户的技术水平调整沟通方式

Handling Edge Cases

边缘情况处理

Unsupported file formats:
  • Request user to convert to supported format
  • Suggest using pandas-compatible formats
Files too large to load:
  • Recommend sampling approach
  • Suggest chunked processing
  • Consider alternative tools for big data
Corrupted or malformed data:
  • Report specific errors encountered
  • Suggest data cleaning steps
  • Try to salvage partial analysis if possible
All missing data in columns:
  • Flag completely empty columns
  • Recommend removal or investigation
  • Document in data quality section
不支持的文件格式
  • 请求用户转换为支持的格式
  • 建议使用pandas兼容的格式
文件过大无法加载
  • 建议采用采样方法
  • 建议使用分块处理
  • 考虑使用其他大数据工具
损坏或格式错误的数据
  • 报告遇到的具体错误
  • 建议数据清洗步骤
  • 尽可能尝试挽救部分分析结果
整列数据全部缺失
  • 标记完全为空的列
  • 建议删除或调查原因
  • 在数据质量章节中记录该问题

Resources Summary

资源汇总

scripts/

scripts/目录

  • eda_analyzer.py
    : Main analysis engine - comprehensive statistical analysis
  • visualizer.py
    : Visualization generator - creates all chart types
Both scripts are fully executable and handle multiple file formats automatically.
  • eda_analyzer.py
    :核心分析引擎 - 执行全面统计分析
  • visualizer.py
    :可视化生成器 - 创建各类图表
两个脚本均可直接执行,且能自动处理多种文件格式。

references/

references/目录

  • statistical_tests_guide.md
    : Statistical test interpretation and methodology
  • eda_best_practices.md
    : Comprehensive EDA methodology and best practices
Load these references as needed to inform analysis approach and interpretation.
  • statistical_tests_guide.md
    :统计检验解读与方法论
  • eda_best_practices.md
    :全面的EDA方法论与最佳实践
根据需要查阅这些参考文档,以指导分析方法和结果解读。

assets/

assets/目录

  • report_template.md
    : Professional markdown report template
Use this template structure for creating consistent, comprehensive EDA reports.
  • report_template.md
    :专业的Markdown报告模板
使用该模板结构创建一致、全面的EDA报告。

Key Reminders

关键提醒

  1. Always generate markdown output for textual results
  2. Run both scripts (analyzer and visualizer) for complete analysis
  3. Use the template to structure comprehensive reports
  4. Include visualizations by referencing generated PNG files
  5. Provide actionable insights - don't just present statistics
  6. Interpret findings using reference guides
  7. Document limitations and data quality issues
  8. Make recommendations for next steps
This skill transforms raw data into actionable insights through systematic exploration, advanced statistics, rich visualizations, and clear communication.
  1. 始终为文本结果生成Markdown格式输出
  2. 同时运行两个脚本(分析器和可视化生成器)以获取完整分析结果
  3. 使用模板构建全面的报告
  4. 引用生成的PNG图片以包含可视化内容
  5. 提供可执行的洞察结论 - 不要仅展示统计数据
  6. 使用参考指南解读分析结果
  7. 记录局限性和数据质量问题
  8. 提出下一步建议
该技能通过系统的数据探索、高级统计分析、丰富的可视化图表以及清晰的沟通,将原始数据转化为可执行的洞察结论。