data_analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Analysis Skill

数据分析技能

Comprehensive data analysis toolkit using Polars - a blazingly fast DataFrame library. This skill provides instructions, reference documentation, and ready-to-use scripts for common data analysis tasks.
这是一个基于Polars(一款极速DataFrame库)的综合性数据分析工具包。本技能为常见数据分析任务提供操作指南、参考文档以及可直接使用的脚本。

Iteration Checkpoints

迭代检查点

StepWhat to PresentUser Input Type
Data LoadingShape, columns, sample rows"Is this the right data?"
Data ExplorationSummary stats, data quality issues"Any columns to focus on?"
TransformationBefore/after comparison"Does this transformation look correct?"
AnalysisKey findings, charts"Should I dig deeper into anything?"
ExportOutput preview"Ready to save, or any changes?"
步骤展示内容用户输入类型
数据加载数据形状、列名、样本行"这是正确的数据吗?"
数据探索汇总统计信息、数据质量问题"需要重点关注哪些列?"
数据转换转换前后对比"这个转换结果是否正确?"
数据分析关键发现、图表"是否需要深入挖掘某些内容?"
数据导出输出预览"准备保存,还是需要调整?"

Quick Start

快速开始

python
import polars as pl
from polars import col
python
import polars as pl
from polars import col

Load data

Load data

df = pl.read_csv("data.csv")
df = pl.read_csv("data.csv")

Explore

Explore

print(df.shape, df.schema) df.describe()
print(df.shape, df.schema) df.describe()

Transform and analyze

Transform and analyze

result = ( df.filter(col("value") > 0) .group_by("category") .agg(col("value").sum().alias("total")) .sort("total", descending=True) )
result = ( df.filter(col("value") > 0) .group_by("category") .agg(col("value").sum().alias("total")) .sort("total", descending=True) )

Export

Export

result.write_csv("output.csv")
undefined
result.write_csv("output.csv")
undefined

When to Use This Skill

何时使用本技能

  • Loading datasets (CSV, JSON, Parquet, Excel, databases)
  • Data cleaning, filtering, and transformation
  • Aggregations, grouping, and pivot tables
  • Statistical analysis and summary statistics
  • Time series analysis and resampling
  • Joining and merging multiple datasets
  • Creating visualizations and charts
  • Exporting results to various formats
  • 加载数据集(CSV、JSON、Parquet、Excel、数据库)
  • 数据清洗、过滤与转换
  • 聚合、分组与透视表制作
  • 统计分析与汇总统计
  • 时间序列分析与重采样
  • 多数据集连接与合并
  • 可视化图表制作
  • 将结果导出为多种格式

Skill Contents

技能内容

Reference Documentation

参考文档

Detailed API reference and patterns for specific operations:
  • reference/loading.md
    - Loading data from all supported formats
  • reference/transformations.md
    - Column operations, filtering, sorting, type casting
  • reference/aggregations.md
    - Group by, window functions, running totals
  • reference/time_series.md
    - Date parsing, resampling, lag features
  • reference/statistics.md
    - Correlations, distributions, hypothesis testing setup
  • reference/visualization.md
    - Creating charts with matplotlib/plotly
针对特定操作的详细API参考与模式说明:
  • reference/loading.md
    - 从所有支持的格式加载数据
  • reference/transformations.md
    - 列操作、过滤、排序、类型转换
  • reference/aggregations.md
    - 分组、窗口函数、累计求和
  • reference/time_series.md
    - 日期解析、重采样、滞后特征
  • reference/statistics.md
    - 相关性、分布、假设检验设置
  • reference/visualization.md
    - 使用matplotlib/plotly制作图表

Ready-to-Use Scripts

可直接使用的脚本

Executable Python scripts for common tasks:
  • scripts/explore_data.py
    - Quick dataset exploration and profiling
  • scripts/summary_stats.py
    - Generate comprehensive statistics report
针对常见任务的可执行Python脚本:
  • scripts/explore_data.py
    - 快速数据集探索与分析
  • scripts/summary_stats.py
    - 生成全面的统计报告

Core Patterns

核心模式

Loading Data

数据加载

python
undefined
python
undefined

CSV (most common)

CSV (most common)

df = pl.read_csv("data.csv")
df = pl.read_csv("data.csv")

Lazy loading for large files

Lazy loading for large files

df = pl.scan_csv("large.csv").filter(col("x") > 0).collect()
df = pl.scan_csv("large.csv").filter(col("x") > 0).collect()

Parquet (recommended for large datasets)

Parquet (recommended for large datasets)

df = pl.read_parquet("data.parquet")
df = pl.read_parquet("data.parquet")

JSON

JSON

df = pl.read_json("data.json") df = pl.read_ndjson("data.ndjson") # Newline-delimited
undefined
df = pl.read_json("data.json") df = pl.read_ndjson("data.ndjson") # Newline-delimited
undefined

Filtering and Selection

过滤与选择

python
undefined
python
undefined

Select columns

Select columns

df.select("col1", "col2") df.select(col("name"), col("value") * 2)
df.select("col1", "col2") df.select(col("name"), col("value") * 2)

Filter rows

Filter rows

df.filter(col("age") > 25) df.filter((col("status") == "active") & (col("value") > 100)) df.filter(col("name").str.contains("Smith"))
undefined
df.filter(col("age") > 25) df.filter((col("status") == "active") & (col("value") > 100)) df.filter(col("name").str.contains("Smith"))
undefined

Transformations

数据转换

python
undefined
python
undefined

Add/modify columns

Add/modify columns

df = df.with_columns( (col("price") * col("qty")).alias("total"), col("date_str").str.to_date("%Y-%m-%d").alias("date"), )
df = df.with_columns( (col("price") * col("qty")).alias("total"), col("date_str").str.to_date("%Y-%m-%d").alias("date"), )

Conditional values

Conditional values

df = df.with_columns( pl.when(col("score") >= 90).then(pl.lit("A")) .when(col("score") >= 80).then(pl.lit("B")) .otherwise(pl.lit("C")) .alias("grade") )
undefined
df = df.with_columns( pl.when(col("score") >= 90).then(pl.lit("A")) .when(col("score") >= 80).then(pl.lit("B")) .otherwise(pl.lit("C")) .alias("grade") )
undefined

Aggregations

数据聚合

python
undefined
python
undefined

Group by

Group by

df.group_by("category").agg( col("value").sum().alias("total"), col("value").mean().alias("avg"), pl.len().alias("count"), )
df.group_by("category").agg( col("value").sum().alias("total"), col("value").mean().alias("avg"), pl.len().alias("count"), )

Window functions

Window functions

df.with_columns( col("value").sum().over("group").alias("group_total"), col("value").rank().over("group").alias("rank_in_group"), )
undefined
df.with_columns( col("value").sum().over("group").alias("group_total"), col("value").rank().over("group").alias("rank_in_group"), )
undefined

Exporting

数据导出

python
df.write_csv("output.csv")
df.write_parquet("output.parquet")
df.write_json("output.json", row_oriented=True)
python
df.write_csv("output.csv")
df.write_parquet("output.parquet")
df.write_json("output.json", row_oriented=True)

Best Practices

最佳实践

  1. Use lazy evaluation for large datasets:
    pl.scan_csv()
    +
    .collect()
  2. Filter early to reduce data volume before expensive operations
  3. Select only needed columns to minimize memory usage
  4. Prefer Parquet for storage - faster I/O, better compression
  5. Use
    .explain()
    to understand and optimize query plans
  1. 对大型数据集使用延迟求值
    pl.scan_csv()
    +
    .collect()
  2. 尽早过滤数据:在执行高开销操作前减少数据量
  3. 仅选择需要的列:最小化内存占用
  4. 优先使用Parquet存储:更快的I/O速度、更好的压缩效果
  5. 使用
    .explain()
    :理解并优化查询计划