data-exploration-visualization

Original🇨🇳 Chinese
Translated
9 scripts

An automated data exploration and visualization tool that provides a complete EDA solution from data loading to professional report generation. It supports multiple chart types, intelligent data diagnosis, modeling evaluation, and HTML report generation. Suitable for data analysis projects in fields such as healthcare, finance, e-commerce, etc.

6installs

NPX Install

npx skill4agent add liangdabiao/claude-data-analysis-ultra-main data-exploration-visualization

SKILL.md Content (Chinese)

View Translation Comparison →

Data Exploration and Visualization Skill

Skill Overview

This Data Exploration and Visualization Skill is an automated EDA toolkit based on the theory from Lesson 2 of "Ten Talks on Data Analysis by Brother Ka". It provides a complete solution from data loading to professional analysis report generation. This skill integrates state-of-the-art data exploration, visualization, and machine learning technologies to help users quickly and deeply understand data characteristics and patterns.

Core Features

🔍 Intelligent Data Exploration

  • Automated Data Diagnosis: Detect data quality issues, outliers, and missing value patterns
  • Statistical Descriptive Analysis: Generate comprehensive statistical summaries and distribution characteristics
  • Correlation Analysis: Identify relationships and dependency patterns between features
  • Data Quality Report: Professional-level data quality assessment and recommendations

📊 Professional Visualization Generation

  • Distribution Visualization: Histograms, density plots, violin plots, QQ plots
  • Statistical Visualization: Box plots, error bar plots, confidence interval plots
  • Relationship Visualization: Scatter plots, heatmaps, pair plots, 3D scatter plots
  • Specialized Charts: ROC curves, confusion matrices, feature importance plots
  • Interactive Charts: Plotly-powered dynamic visualizations

🏥 Healthcare Data Specialization

  • Medical Coding Support: ICD-10, SNOMED CT and other medical standards
  • Biomarker Analysis: Specialized processing of medical indicators
  • Diagnostic Model Construction: Medical prediction models and evaluation
  • Medical Interpretability: Interpretation framework aligned with medical practices

🤖 Automated Modeling Evaluation

  • Multi-algorithm Support: Logistic Regression, Random Forest, XGBoost, Neural Networks
  • Automated Feature Engineering: Feature selection, transformation and optimization
  • Hyperparameter Tuning: Grid search and Bayesian optimization
  • Model Interpretability: SHAP values, feature importance, partial dependence plots

📋 Professional Report Generation

  • HTML Report: Publication-level interactive analysis reports
  • PDF Export: High-quality document format output
  • Markdown Support: Lightweight report format
  • Custom Templates: Configurable report template system

Usage Scenarios

🏥 Healthcare Field

  • Disease Prediction: Disease risk prediction based on clinical data
  • Diagnostic Assistance: Analysis of medical imaging and test results
  • Epidemiological Research: Epidemic data analysis and trend prediction
  • Clinical Trials: Statistical analysis and visualization of trial data

💰 Financial Risk Control Field

  • Credit Assessment: Personal and enterprise credit risk modeling
  • Fraud Detection: Identification of abnormal transaction patterns
  • Investment Analysis: Market trend and risk assessment
  • Compliance Reporting: Analysis reports meeting regulatory requirements

🛒 E-commerce Retail Field

  • User Analysis: Customer behavior and preference analysis
  • Sales Forecasting: Sales volume prediction and inventory optimization
  • Recommendation Systems: Evaluation of personalized recommendation algorithms
  • Market Segmentation: Customer group analysis and profiling

🎓 Scientific Research and Education Field

  • Academic Research: Data-driven academic research support
  • Teaching Cases: Data analysis teaching and practice
  • Thesis Writing: Research data analysis and chart creation
  • Skill Training: Data science skill training tool

Tool Usage Guide

Quick Start

  1. Basic Data Exploration
    python
    from scripts.eda_analyzer import EDAAnalyzer
    
    # 初始化分析器
    analyzer = EDAAnalyzer()
    
    # 加载数据并自动分析
    data = analyzer.load_data('data.csv')
    report = analyzer.auto_eda(data)
  2. Visualization Generation
    python
    from scripts.visualizer import DataVisualizer
    
    # 初始化可视化器
    visualizer = DataVisualizer()
    
    # 自动生成所有图表
    charts = visualizer.auto_visualize(data)
    
    # 生成特定类型图表
    dist_plot = visualizer.plot_distribution(data, 'column_name')
    corr_heatmap = visualizer.plot_correlation(data)
  3. Modeling Evaluation
    python
    from scripts.modeling_evaluator import ModelingEvaluator
    
    # 初始化建模器
    modeler = ModelingEvaluator()
    
    # 自动建模和评估
    results = modeler.auto_modeling(
        data=data,
        target_col='target',
        algorithms=['logistic', 'rf', 'xgboost']
    )
  4. Report Generation
    python
    from scripts.report_generator import ReportGenerator
    
    # 生成完整报告
    generator = ReportGenerator()
    report = generator.generate_comprehensive_report(
        data=data,
        model_results=model_results,
        output_path='analysis_report.html'
    )

Advanced Features

  1. Healthcare Data Analysis
    python
    # 医疗数据特殊处理
    from scripts.medical_analyzer import MedicalDataAnalyzer
    
    medical_analyzer = MedicalDataAnalyzer()
    medical_report = medical_analyzer.analyze_medical_data(
        data=medical_df,
        diagnosis_col='diagnosis',
        biomarker_cols=['biomarker1', 'biomarker2']
    )
  2. Interactive Dashboard
    python
    # 生成交互式仪表板
    dashboard = visualizer.create_dashboard(
        data=data,
        charts=['distribution', 'correlation', 'model_performance']
    )
  3. Batch Data Processing
    python
    # 批量分析多个数据集
    batch_results = analyzer.batch_analyze(
        data_files=['data1.csv', 'data2.csv'],
        analysis_types=['eda', 'modeling', 'visualization']
    )

Technical Dependencies

Core Libraries

  • pandas (>=1.3.0): Data processing and analysis
  • numpy (>=1.20.0): Numerical computation
  • scikit-learn (>=1.0.0): Machine learning algorithms
  • xgboost (>=1.5.0): Gradient boosting algorithms

Visualization Libraries

  • matplotlib (>=3.4.0): Basic plotting
  • seaborn (>=0.11.0): Statistical visualization
  • plotly (>=5.0.0): Interactive charts

Statistical Analysis Libraries

  • scipy (>=1.7.0): Scientific computation
  • statsmodels (>=0.13.0): Statistical modeling

Report Generation

  • jinja2 (>=3.0.0): Template engine
  • weasyprint: PDF generation

Best Practices

Data Preparation

  • Ensure standardized data formats (CSV, Excel, etc.)
  • Check data encoding to avoid Chinese garbled characters
  • Handle missing values and outliers
  • Verify data types and formats

Analysis Workflow

  1. Data Loading and Inspection: Confirm data quality and completeness
  2. Exploratory Analysis: Understand basic data characteristics and distributions
  3. Visualization Exploration: Discover data patterns through charts
  4. Preprocessing: Data cleaning and feature engineering
  5. Modeling Analysis: Build and evaluate prediction models
  6. Result Interpretation: Extract insights and business recommendations
  7. Report Generation: Create professional analysis reports

Visualization Selection

  • Univariate Analysis: Histograms, box plots, violin plots
  • Bivariate Analysis: Scatter plots, grouped box plots
  • Multivariate Analysis: Heatmaps, pair plots, 3D plots
  • Time Series: Timeline plots, trend plots
  • Geographic Data: Map visualizations

Sample Data

Healthcare Data Sample

python
# 乳腺检查数据示例
medical_data = {
    'patient_id': ['P001', 'P002', ...],
    'diagnosis': ['Malignant', 'Benign', ...],
    'radius_mean': [17.99, 20.57, ...],
    'texture_mean': [10.38, 17.77, ...],
    'perimeter_mean': [122.8, 132.9, ...]
}

Financial Data Sample

python
# 信用评分数据示例
financial_data = {
    'customer_id': ['C001', 'C002', ...],
    'credit_score': [720, 680, ...],
    'income': [85000, 62000, ...],
    'debt_ratio': [0.15, 0.32, ...],
    'default': [0, 1, ...]
}

Frequently Asked Questions

Q: How to handle Chinese data?

A: The skill automatically detects and processes Chinese encodings, supporting multiple encoding formats such as UTF-8, GBK, etc.

Q: What data formats are supported?

A: It supports common formats such as CSV, Excel, JSON, Parquet, etc., and also supports database connections.

Q: How to customize visualization styles?

A: You can customize style parameters such as colors, fonts, and chart layouts through configuration files.

Q: How to ensure model accuracy?

A: The skill uses cross-validation, multiple evaluation metrics, and ensemble methods to ensure model reliability and generalization ability.

Skill Highlights

High Intelligence - 90% of EDA work automated ✅ Professional Specialization - Specialized processing for healthcare data ✅ Rich Visualizations - 20+ professional chart types ✅ Strong Modeling Capability - Multi-algorithm integration and automated tuning ✅ High-quality Reports - Publication-level analysis reports ✅ Ease of Use - Simple API with automated complex workflows ✅ High Scalability - Modular design for easy customization and expansion

Update Log

v1.0.0 (2025-01-19)

  • Initial version release
  • Complete EDA functionality
  • Basic visualization support
  • Logistic regression modeling
  • HTML report generation

Future Plans

  • Support for more machine learning algorithms
  • Add deep learning model support
  • Expand healthcare data analysis functions
  • Cloud deployment support
  • Real-time data analysis capability
With this skill, you can significantly improve data analysis efficiency, free yourself from repetitive tasks, and focus on insight discovery and decision support.