data-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Analysis

数据分析

Overview

概述

A comprehensive data analysis and storytelling skill optimized for financial, SaaS, and RevOps contexts. This skill provides structured workflows for turning raw data into actionable insights with full transparency on analytical decisions, bias awareness, and progressive disclosure reporting.

一款针对金融、SaaS与RevOps场景优化的综合性数据分析与数据叙事技能。该技能提供结构化工作流，可将原始数据转化为可落地的洞察，同时完整记录分析决策、具备偏差感知能力，并支持渐进式披露报告。

Workflow Overview

工作流概述

Every analysis follows a 7-phase process:

1. SETUP    → Initialize Marimo notebook (run init_marimo_notebook.py)
2. INGEST   → Load data, document sources and assumptions
3. EXPLORE  → EDA with logged decisions (why this viz, why this filter)
4. MODEL    → If needed, with interpretable-first approach
5. INTERPRET → Apply bias checklist, hedge appropriately
6. WISHLIST → Document data gaps and proxies used
7. OUTPUT   → Generate appropriate tier (slides/report/notebook)

每个分析都遵循以下7阶段流程：

1. SETUP    → 初始化Marimo notebook（运行init_marimo_notebook.py）
2. INGEST   → 加载数据，记录数据源与假设
3. EXPLORE  → 探索性数据分析（EDA）并记录决策（为何选择该可视化、为何使用该筛选条件）
4. MODEL    → 如需建模，优先采用可解释性优先的方法
5. INTERPRET → 应用偏差检查清单，合理表述结论
6. WISHLIST → 记录数据缺口与所用替代指标
7. OUTPUT   → 生成对应层级的输出（演示文稿/报告/Notebook）

Decision Logging Protocol

决策日志记录规范

Every analytical choice must be logged. This creates an audit trail and enables reproducibility.

所有分析选择必须记录在案，以此创建审计追踪并确保可复现性。

What to Log

记录内容

Decision Type	Example	Log Format
Data filtering	Removed 47 records with null revenue	`FILTER: [reason] - [count] records affected`
Metric choice	Used logo churn vs revenue churn	`METRIC: [chosen] over [alternative] because [reason]`
Visualization	Line chart for time series	`VIZ: [type] because [reason]`
Assumption	Assumed linear growth for projection	`ASSUMPTION: [statement] - confidence: [H/M/L]`
Proxy used	Used support tickets as NPS proxy	`PROXY: [proxy] for [missing data] - quality: [S/M/W]`

决策类型	示例	日志格式
数据筛选	移除47条收入为空的记录	`FILTER: [reason] - [count] records affected`
指标选择	使用标识流失率而非收入流失率	`METRIC: [chosen] over [alternative] because [reason]`
可视化选择	用折线图展示时间序列数据	`VIZ: [type] because [reason]`
假设设定	假设预测采用线性增长模型	`ASSUMPTION: [statement] - confidence: [H/M/L]`
替代指标使用	用支持工单作为NPS的替代指标	`PROXY: [proxy] for [missing data] - quality: [S/M/W]`

Log Format in Notebook

Notebook中的日志格式

python

undefined

python

undefined

=== DECISION LOG ===

FILTER: Excluded trial accounts - 1,247 records removed

METRIC: NRR over GRR because expansion is significant factor

ASSUMPTION: Q4 seasonality similar to prior year - confidence: M

PROXY: Support ticket sentiment for NPS - quality: Weak

undefined

undefined

Analysis Workflow Details

分析工作流详情

Phase 1: Setup

阶段1：环境搭建

Run the initialization script to create a new Marimo notebook with pre-built scaffolding:

bash

python scripts/init_marimo_notebook.py <notebook_name>

This creates a

.py

file with:

Decision log cell (markdown)
Data loading template
EDA template cells
Bias checklist cell

运行初始化脚本，创建带有预构建框架的Marimo notebook：

bash

python scripts/init_marimo_notebook.py <notebook_name>

该脚本会生成一个

.py

文件，包含：

决策日志单元格（Markdown格式）
数据加载模板
EDA模板单元格
偏差检查清单单元格

Phase 2: Data Ingestion

阶段2：数据导入

When loading data:

Document the source (file path, API, database query)
Record row/column counts
Note any immediate data quality issues
Log assumptions about data freshness

python

undefined

加载数据时需完成：

记录数据源（文件路径、API、数据库查询语句）
记录行/列数量
标注即时发现的数据质量问题
记录关于数据新鲜度的假设

python

undefined

=== DATA SOURCE ===

Source: sales_data_2024.csv

Loaded: 2024-01-15

Records: 15,847 rows x 23 columns

Note: Data through 2024-01-10, 5-day lag from source system

undefined

undefined

Phase 3: Exploratory Data Analysis

阶段3：探索性数据分析

Phase 4: Modeling (If Needed)

阶段4：建模（如需）

Prioritize interpretability:

First choice: Descriptive statistics, cohort analysis
Second choice: Linear regression, decision trees
Last resort: Complex ML (document why simpler won't work)

Always provide:

Model assumptions
Feature importance / coefficients
Confidence intervals
What the model cannot tell us

优先考虑可解释性：

首选：描述性统计、同期群分析
次选：线性回归、决策树
最后选择：复杂机器学习模型（需记录为何简单模型无法满足需求）

必须提供：

模型假设
特征重要性/系数
置信区间
模型的局限性

Phase 5: Interpretation

阶段5：解读

Before finalizing insights, run the bias checklist. See

references/biases.md

for full checklist.

Quick check:

Survivorship bias: Am I only looking at "survivors"?
Simpson's paradox: Do segment trends differ from aggregate?
Selection bias: Is my sample representative?
Small-n warning: Is sample size sufficient for claims?

Hedge appropriately:

Use "suggests" not "proves"
State confidence levels
Note what additional data would strengthen conclusions

在最终确定洞察前，必须运行偏差检查清单，完整清单请参考

references/biases.md

。

快速检查项：

幸存者偏差：是否仅关注“留存者”数据？
辛普森悖论：细分维度趋势与整体趋势是否存在差异？
选择偏差：样本是否具有代表性？
小样本警告：样本量是否足以支撑结论？

合理表述结论：

使用“表明”而非“证明”
说明置信水平
标注哪些额外数据可强化结论

Phase 5.5: Validation Gate (MANDATORY)

阶段5.5：验证关卡（强制要求）

⚠️ GATE: Before proceeding to output, you MUST run the data quality validation checklist.

This is not optional. Run through

references/data-quality-validator.md

before finalizing:

Critical Patterns Checklist:

Market Context: Are YoY changes compared to market/baseline?
Weighting Sensitivity: If using composite scores, tested 5-6 weight scenarios?
Bootstrap CIs: For small samples, generated P10/P50/P90 ranges?
Survivorship Quantified: Data_Availability column added? Coverage % reported?

Statistical Checks:

Sample sizes disclosed with confidence intervals?
Multiple comparisons accounted for?
Cherry-picked baselines avoided?

Logic Checks:

Correlation not claimed as causation?
Alternative explanations considered?
Back-of-envelope validation passed?

Methodology Note on Time Horizons: When assessing skill vs luck (e.g., sales rep performance, investment returns):

Longer time horizons (3+ years) reveal inconsistency that short windows hide
More periods = higher bar for "likely skill"
A rep who is "top 10%" for 2 quarters could easily be luck
A rep who is "top 10%" for 12 quarters is more likely skill
Always state the number of periods analyzed and what that implies for confidence

Do not proceed to Phase 6/7 until this checklist is complete.

⚠️ 关卡：在进入输出阶段前，必须运行数据质量验证清单

此步骤为必填项。在最终确定前，请完成

references/data-quality-validator.md

中的检查：

关键模式检查清单：

市场背景： 同比变化是否与市场/基准进行了对比？
权重敏感性： 若使用综合评分，是否测试了5-6种权重场景？
Bootstrap置信区间： 针对小样本，是否生成了P10/P50/P90区间？
幸存者偏差量化： 是否添加了数据可用列？是否报告了覆盖比例？

统计检查：

是否披露了样本量与置信区间？
是否考虑了多重比较问题？
是否避免了选择性基准？

逻辑检查：

是否未将相关性等同于因果关系？
是否考虑了替代解释？
是否通过了粗略验证？

时间范围方法论说明： 在评估技能与运气的影响时（如销售代表绩效、投资回报）：

更长的时间范围（3年以上）能揭示短期窗口隐藏的不一致性
分析周期越多，“技能主导”的结论可信度越高
连续2个季度进入“前10%”的销售代表可能只是运气
连续12个季度进入“前10%”的销售代表更可能是技能主导
必须说明分析的周期数量及其对置信度的影响

在完成此检查清单前，不得进入阶段6/7

Phase 6: Data Wishlisting

阶段6：数据缺口记录

Document gaps and proxies. See

references/data-wishlisting.md

for patterns.

Format:

undefined

记录数据缺口与所用替代指标，模式参考

references/data-wishlisting.md

。

格式：

undefined

Data Wishlist

数据缺口清单

Missing Data	Proxy Used	Quality	Impact on Analysis
Customer NPS	Support sentiment	Weak	Core finding, needs validation
True LTV	12-month value	Moderate	Acceptable for segmentation

undefined

缺失数据	所用替代指标	质量	对分析的影响
客户NPS	支持工单情感	弱	核心结论，需验证
真实LTV	12个月价值	中	可用于细分分析

undefined

Phase 7: Output Generation

阶段7：输出生成

Choose output tier based on audience and purpose:

Tier	When to Use	Tool
Slides	Executive summary, board deck	`generate_pptx_summary.py`
Report	Detailed findings, stakeholder review	Markdown/PDF
Notebook	Full analysis, data team handoff	Marimo .py file

根据受众与用途选择输出层级：

层级	适用场景	工具
演示文稿	执行摘要、董事会报告	`generate_pptx_summary.py`
详细报告	详细发现、利益相关方评审	Markdown/PDF
Notebook	完整分析、数据团队交接	Marimo .py文件

Data Cleaning Workflow

数据清洗工作流

For messy data that needs cleaning before analysis:

针对分析前需要清洗的脏数据：

1. Profile the Data

1. 数据概况分析

bash

python scripts/profile_data.py <csv_file> --output data_quality_report.md

This generates:

Column-level statistics (nulls, uniques, types)
Data quality score (A-F grading)
Suspicious pattern detection
Suggested cleaning steps

bash

python scripts/profile_data.py <csv_file> --output data_quality_report.md

该脚本会生成：

列级统计信息（空值、唯一值、数据类型）
数据质量评分（A-F等级）
可疑模式检测
建议的清洗步骤

2. Apply Cleaning Patterns

2. 应用清洗模式

Reference

references/data-cleaning.md

for:

Missing value strategies (drop, impute, flag)
Outlier detection methods (IQR, z-score, domain rules)
Common transforms (pivot, melt, merge patterns)
Type coercion recipes
Deduplication patterns

参考

references/data-cleaning.md

获取：

缺失值处理策略（删除、填充、标记）
异常值检测方法（IQR、z-score、业务规则）
常见转换操作（透视、融合、合并模式）
数据类型转换方案
去重模式

3. Handle Datetime Issues

3. 处理日期时间问题

Reference

references/datetime-handling.md

for:

Timezone conversion patterns
Date parsing for mixed formats
Fiscal calendar handling (FY vs CY)
Period aggregation (daily → weekly → monthly)
Business day calculations

参考

references/datetime-handling.md

获取：

时区转换模式
混合格式日期解析
财年日历处理（FY vs CY）
周期聚合（日→周→月）
工作日计算

Dashboard Building Workflow

仪表盘搭建工作流

For interactive monitoring dashboards:

针对交互式监控仪表盘：

1. Initialize Dashboard

1. 初始化仪表盘

bash

python scripts/init_dashboard.py <dashboard_name>

This creates a Marimo dashboard with:

KPI cards row
Filter sidebar (segment, date range, period)
Time series trend chart
Summary data table
Responsive layout

bash

python scripts/init_dashboard.py <dashboard_name>

该脚本会生成一个Marimo仪表盘，包含：

KPI卡片行
筛选侧边栏（细分维度、日期范围、周期）
时间序列趋势图
汇总数据表
响应式布局

2. Apply Dashboard Patterns

2. 应用仪表盘模式

Reference

references/dashboard-patterns.md

for:

Marimo layout patterns (sidebar, tabs, grid)
KPI card templates with sparklines
Filter/slider patterns for interactivity
Data table styling and formatting
Time series with range selection
Refresh patterns for live data

参考

references/dashboard-patterns.md

获取：

Marimo布局模式（侧边栏、标签页、网格）
带迷你趋势图的KPI卡片模板
交互式筛选/滑块模式
数据表样式与格式
带范围选择的时间序列
实时数据刷新模式

Data Quality Validation Workflow

数据质量验证工作流

Before presenting or accepting analytical claims:

在展示或接受分析结论前：

Run the Data Quality Validation Checklist

运行数据质量验证清单

Reference

references/data-quality-validator.md

for comprehensive checklists:

Statistical Sins:

P-hacking / multiple comparisons
Small sample extrapolation
Missing confidence intervals
Cherry-picked baselines

Chart Crimes:

Truncated y-axis
Dual y-axis manipulation
3D charts
Misleading scales

Logic Fallacies:

Correlation ≠ causation
Ecological fallacy
Base rate neglect
Survivorship bias

Sanity Checks:

Does this pass the smell test?
Back-of-envelope validation
Historical comparison
Cross-source validation

参考

references/data-quality-validator.md

获取完整检查清单：

统计误区：

P值篡改/多重比较
小样本外推
缺失置信区间
选择性基准

图表误导：

截断Y轴
双Y轴操纵
3D图表
误导性刻度

逻辑谬误：

相关性≠因果关系
生态谬误
忽略基础比率
幸存者偏差

合理性检查：

是否符合常识？
是否通过粗略验证？
是否与历史数据对比？
是否跨数据源验证？

Excel Output Workflow

Excel输出工作流

For exporting analysis results to Excel with proper formulas and formatting:

用于将分析结果导出为带正确公式与格式的Excel文件：

Financial Model Standards

财务模型标准

Reference

references/xlsx-patterns.md

for:

Color coding convention (blue=inputs, black=formulas, green=cross-sheet links)
Number formatting standards (currency, percentages, multiples)
Formula construction rules (use formulas, not hardcoded values)
Common formula patterns for analysis

参考

references/xlsx-patterns.md

获取：

颜色编码规范（蓝色=输入值，黑色=公式，绿色=跨工作表链接）
数字格式标准（货币、百分比、倍数）
公式构建规则（使用公式，而非硬编码值）
分析常用公式模式

Verification

验证

After creating Excel files with formulas, always recalculate:

bash

python scripts/recalc.py output.xlsx

This ensures:

All formulas are calculated (openpyxl doesn't evaluate formulas)
Zero formula errors (#REF!, #DIV/0!, etc.)
JSON output shows any errors to fix

创建带公式的Excel文件后，必须重新计算：

bash

python scripts/recalc.py output.xlsx

此操作可确保：

所有公式已计算（openpyxl不会自动计算公式）
无公式错误（#REF!、#DIV/0!等）
JSON输出会显示所有需修复的错误

PDF Handling Workflow

PDF处理工作流

For extracting data from PDFs or creating PDF reports:

用于从PDF中提取数据或创建PDF报告：

Extracting Data

数据提取

Reference

references/pdf-patterns.md

for:

Text extraction (pypdf, pdfplumber)
Table extraction to DataFrame
OCR for scanned documents
Command-line tools (pdftotext, qpdf)

参考

references/pdf-patterns.md

获取：

文本提取（pypdf、pdfplumber）
表格提取为DataFrame
扫描文档OCR识别
命令行工具（pdftotext、qpdf）

Creating Reports

报告创建

Reference

references/pdf-patterns.md

for:

Basic report creation with reportlab
Professional reports with sections and tables
Embedding matplotlib charts in PDFs
Merge/split operations

参考

references/pdf-patterns.md

获取：

使用reportlab创建基础报告
创建带章节与表格的专业报告
在PDF中嵌入matplotlib图表
合并/拆分操作

Reference Files

参考文件

Load these as needed during analysis:

Reference	When to Use
`references/metrics.md`	Calculating SaaS/RevOps metrics
`references/biases.md`	Interpretation phase, before finalizing insights
`references/report-templates.md`	Structuring output (pyramid vs consulting style)
`references/visualization-guide.md`	Choosing chart types, avoiding anti-patterns
`references/data-wishlisting.md`	Documenting gaps, rating proxy quality
`references/data-cleaning.md`	Data quality checks, cleaning patterns
`references/datetime-handling.md`	Timezone, parsing, fiscal calendars
`references/dashboard-patterns.md`	Marimo layouts, KPIs, interactivity
`references/data-quality-validator.md`	Data quality validation, detecting issues
`references/xlsx-patterns.md`	Excel output, financial model standards, formulas
`references/pdf-patterns.md`	PDF extraction, report creation, manipulation

分析过程中可按需加载以下文件：

参考文件	适用场景
`references/metrics.md`	计算SaaS/RevOps指标
`references/biases.md`	解读阶段，最终确定洞察前
`references/report-templates.md`	输出内容结构化（金字塔原理/咨询风格）
`references/visualization-guide.md`	选择图表类型，避免反模式
`references/data-wishlisting.md`	记录数据缺口，评估替代指标质量
`references/data-cleaning.md`	数据质量检查，清洗模式
`references/datetime-handling.md`	时区处理、日期解析、财年日历
`references/dashboard-patterns.md`	Marimo布局、KPI、交互性
`references/data-quality-validator.md`	数据质量验证，问题检测
`references/xlsx-patterns.md`	Excel输出、财务模型标准、公式
`references/pdf-patterns.md`	PDF提取、报告创建、操作

Scripts

脚本

Script	Purpose	Usage
`scripts/init_marimo_notebook.py`	Initialize analysis workspace	`python scripts/init_marimo_notebook.py <name>`
`scripts/generate_pptx_summary.py`	Create slide deck from findings	`python scripts/generate_pptx_summary.py <config.json>`
`scripts/profile_data.py`	Generate data quality report	`python scripts/profile_data.py <csv_file>`
`scripts/init_dashboard.py`	Scaffold interactive dashboard	`python scripts/init_dashboard.py <name>`
`scripts/recalc.py`	Recalculate Excel formulas	`python scripts/recalc.py <xlsx_file>`

脚本	用途	使用方法
`scripts/init_marimo_notebook.py`	初始化分析工作区	`python scripts/init_marimo_notebook.py <name>`
`scripts/generate_pptx_summary.py`	根据分析结果创建演示文稿	`python scripts/generate_pptx_summary.py <config.json>`
`scripts/profile_data.py`	生成数据质量报告	`python scripts/profile_data.py <csv_file>`
`scripts/init_dashboard.py`	搭建交互式仪表盘框架	`python scripts/init_dashboard.py <name>`
`scripts/recalc.py`	重新计算Excel公式	`python scripts/recalc.py <xlsx_file>`

Technology Stack

技术栈

Tool	Purpose	Why
Marimo	Notebook environment	Pure Python files, reactive, git-friendly
pandas	Data manipulation	Reliable LLM code generation, mature ecosystem
Matplotlib/Seaborn	Visualization	Publication-quality, static, well-supported
python-pptx	Slide generation	Programmatic PowerPoint creation
openpyxl	Excel files	Formulas, formatting, financial models
pypdf/pdfplumber	PDF handling	Extract text, tables; create reports
reportlab	PDF creation	Professional PDF reports

工具	用途	选择理由
Marimo	Notebook环境	纯Python文件、响应式、Git友好
pandas	数据处理	可靠的LLM代码生成、成熟的生态系统
Matplotlib/Seaborn	可视化	出版级质量、静态图表、支持完善
python-pptx	演示文稿生成	程序化创建PowerPoint
openpyxl	Excel文件处理	支持公式、格式、财务模型
pypdf/pdfplumber	PDF处理	提取文本、表格；创建报告
reportlab	PDF创建	专业PDF报告生成

Example Invocations

调用示例

Revenue analysis:

"Analyze our ARR trends by segment and identify drivers of growth/churn"

Pipeline analytics:

"Build a win rate analysis by deal size and sales rep"

Cohort analysis:

"Create a retention cohort analysis for customers acquired in 2023"

Forecasting:

"Project next quarter revenue based on current pipeline"

Board deck:

"Create an executive summary deck of our key SaaS metrics"

Data cleaning:

"Clean this messy CSV and profile the data quality"

Dashboard:

"Build a dashboard to monitor our key SaaS metrics"

Data validation:

"Validate these findings before I present them"

Excel output:

"Export this analysis to Excel with proper formulas and formatting"

PDF extraction:

"Extract the tables from this quarterly report PDF"

Financial model:

"Create a revenue projection model in Excel with scenario inputs"

收入分析：

"按细分维度分析我们的ARR趋势，识别增长/流失驱动因素"

销售漏斗分析：

"按交易规模与销售代表创建赢单率分析"

同期群分析：

"为2023年获取的客户创建留存同期群分析"

预测：

"基于当前销售漏斗预测下一季度收入"

董事会报告：

"创建关键SaaS指标的执行摘要演示文稿"

数据清洗：

"清洗这份杂乱的CSV文件并生成数据质量概况报告"

仪表盘搭建：

"搭建一个仪表盘监控我们的关键SaaS指标"

数据验证：

"在我展示前验证这些分析结论"

Excel输出：

"将这份分析结果导出为带正确公式与格式的Excel文件"

PDF提取：

"从这份季度报告PDF中提取表格数据"

财务模型：

"在Excel中创建带场景输入的收入预测模型"