data_analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Analysis Skill
数据分析技能
Comprehensive data analysis toolkit using Polars - a blazingly fast DataFrame library. This skill provides instructions, reference documentation, and ready-to-use scripts for common data analysis tasks.
这是一个基于Polars(一款极速DataFrame库)的综合性数据分析工具包。本技能为常见数据分析任务提供操作指南、参考文档以及可直接使用的脚本。
Iteration Checkpoints
迭代检查点
| Step | What to Present | User Input Type |
|---|---|---|
| Data Loading | Shape, columns, sample rows | "Is this the right data?" |
| Data Exploration | Summary stats, data quality issues | "Any columns to focus on?" |
| Transformation | Before/after comparison | "Does this transformation look correct?" |
| Analysis | Key findings, charts | "Should I dig deeper into anything?" |
| Export | Output preview | "Ready to save, or any changes?" |
| 步骤 | 展示内容 | 用户输入类型 |
|---|---|---|
| 数据加载 | 数据形状、列名、样本行 | "这是正确的数据吗?" |
| 数据探索 | 汇总统计信息、数据质量问题 | "需要重点关注哪些列?" |
| 数据转换 | 转换前后对比 | "这个转换结果是否正确?" |
| 数据分析 | 关键发现、图表 | "是否需要深入挖掘某些内容?" |
| 数据导出 | 输出预览 | "准备保存,还是需要调整?" |
Quick Start
快速开始
python
import polars as pl
from polars import colpython
import polars as pl
from polars import colLoad data
Load data
df = pl.read_csv("data.csv")
df = pl.read_csv("data.csv")
Explore
Explore
print(df.shape, df.schema)
df.describe()
print(df.shape, df.schema)
df.describe()
Transform and analyze
Transform and analyze
result = (
df.filter(col("value") > 0)
.group_by("category")
.agg(col("value").sum().alias("total"))
.sort("total", descending=True)
)
result = (
df.filter(col("value") > 0)
.group_by("category")
.agg(col("value").sum().alias("total"))
.sort("total", descending=True)
)
Export
Export
result.write_csv("output.csv")
undefinedresult.write_csv("output.csv")
undefinedWhen to Use This Skill
何时使用本技能
- Loading datasets (CSV, JSON, Parquet, Excel, databases)
- Data cleaning, filtering, and transformation
- Aggregations, grouping, and pivot tables
- Statistical analysis and summary statistics
- Time series analysis and resampling
- Joining and merging multiple datasets
- Creating visualizations and charts
- Exporting results to various formats
- 加载数据集(CSV、JSON、Parquet、Excel、数据库)
- 数据清洗、过滤与转换
- 聚合、分组与透视表制作
- 统计分析与汇总统计
- 时间序列分析与重采样
- 多数据集连接与合并
- 可视化图表制作
- 将结果导出为多种格式
Skill Contents
技能内容
Reference Documentation
参考文档
Detailed API reference and patterns for specific operations:
- - Loading data from all supported formats
reference/loading.md - - Column operations, filtering, sorting, type casting
reference/transformations.md - - Group by, window functions, running totals
reference/aggregations.md - - Date parsing, resampling, lag features
reference/time_series.md - - Correlations, distributions, hypothesis testing setup
reference/statistics.md - - Creating charts with matplotlib/plotly
reference/visualization.md
针对特定操作的详细API参考与模式说明:
- - 从所有支持的格式加载数据
reference/loading.md - - 列操作、过滤、排序、类型转换
reference/transformations.md - - 分组、窗口函数、累计求和
reference/aggregations.md - - 日期解析、重采样、滞后特征
reference/time_series.md - - 相关性、分布、假设检验设置
reference/statistics.md - - 使用matplotlib/plotly制作图表
reference/visualization.md
Ready-to-Use Scripts
可直接使用的脚本
Executable Python scripts for common tasks:
- - Quick dataset exploration and profiling
scripts/explore_data.py - - Generate comprehensive statistics report
scripts/summary_stats.py
针对常见任务的可执行Python脚本:
- - 快速数据集探索与分析
scripts/explore_data.py - - 生成全面的统计报告
scripts/summary_stats.py
Core Patterns
核心模式
Loading Data
数据加载
python
undefinedpython
undefinedCSV (most common)
CSV (most common)
df = pl.read_csv("data.csv")
df = pl.read_csv("data.csv")
Lazy loading for large files
Lazy loading for large files
df = pl.scan_csv("large.csv").filter(col("x") > 0).collect()
df = pl.scan_csv("large.csv").filter(col("x") > 0).collect()
Parquet (recommended for large datasets)
Parquet (recommended for large datasets)
df = pl.read_parquet("data.parquet")
df = pl.read_parquet("data.parquet")
JSON
JSON
df = pl.read_json("data.json")
df = pl.read_ndjson("data.ndjson") # Newline-delimited
undefineddf = pl.read_json("data.json")
df = pl.read_ndjson("data.ndjson") # Newline-delimited
undefinedFiltering and Selection
过滤与选择
python
undefinedpython
undefinedSelect columns
Select columns
df.select("col1", "col2")
df.select(col("name"), col("value") * 2)
df.select("col1", "col2")
df.select(col("name"), col("value") * 2)
Filter rows
Filter rows
df.filter(col("age") > 25)
df.filter((col("status") == "active") & (col("value") > 100))
df.filter(col("name").str.contains("Smith"))
undefineddf.filter(col("age") > 25)
df.filter((col("status") == "active") & (col("value") > 100))
df.filter(col("name").str.contains("Smith"))
undefinedTransformations
数据转换
python
undefinedpython
undefinedAdd/modify columns
Add/modify columns
df = df.with_columns(
(col("price") * col("qty")).alias("total"),
col("date_str").str.to_date("%Y-%m-%d").alias("date"),
)
df = df.with_columns(
(col("price") * col("qty")).alias("total"),
col("date_str").str.to_date("%Y-%m-%d").alias("date"),
)
Conditional values
Conditional values
df = df.with_columns(
pl.when(col("score") >= 90).then(pl.lit("A"))
.when(col("score") >= 80).then(pl.lit("B"))
.otherwise(pl.lit("C"))
.alias("grade")
)
undefineddf = df.with_columns(
pl.when(col("score") >= 90).then(pl.lit("A"))
.when(col("score") >= 80).then(pl.lit("B"))
.otherwise(pl.lit("C"))
.alias("grade")
)
undefinedAggregations
数据聚合
python
undefinedpython
undefinedGroup by
Group by
df.group_by("category").agg(
col("value").sum().alias("total"),
col("value").mean().alias("avg"),
pl.len().alias("count"),
)
df.group_by("category").agg(
col("value").sum().alias("total"),
col("value").mean().alias("avg"),
pl.len().alias("count"),
)
Window functions
Window functions
df.with_columns(
col("value").sum().over("group").alias("group_total"),
col("value").rank().over("group").alias("rank_in_group"),
)
undefineddf.with_columns(
col("value").sum().over("group").alias("group_total"),
col("value").rank().over("group").alias("rank_in_group"),
)
undefinedExporting
数据导出
python
df.write_csv("output.csv")
df.write_parquet("output.parquet")
df.write_json("output.json", row_oriented=True)python
df.write_csv("output.csv")
df.write_parquet("output.parquet")
df.write_json("output.json", row_oriented=True)Best Practices
最佳实践
- Use lazy evaluation for large datasets: +
pl.scan_csv().collect() - Filter early to reduce data volume before expensive operations
- Select only needed columns to minimize memory usage
- Prefer Parquet for storage - faster I/O, better compression
- Use to understand and optimize query plans
.explain()
- 对大型数据集使用延迟求值:+
pl.scan_csv().collect() - 尽早过滤数据:在执行高开销操作前减少数据量
- 仅选择需要的列:最小化内存占用
- 优先使用Parquet存储:更快的I/O速度、更好的压缩效果
- 使用:理解并优化查询计划
.explain()