data-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Analysis
数据分析
Overview
概述
A comprehensive data analysis and storytelling skill optimized for financial, SaaS, and RevOps contexts. This skill provides structured workflows for turning raw data into actionable insights with full transparency on analytical decisions, bias awareness, and progressive disclosure reporting.
一款针对金融、SaaS与RevOps场景优化的综合性数据分析与数据叙事技能。该技能提供结构化工作流,可将原始数据转化为可落地的洞察,同时完整记录分析决策、具备偏差感知能力,并支持渐进式披露报告。
Workflow Overview
工作流概述
Every analysis follows a 7-phase process:
1. SETUP → Initialize Marimo notebook (run init_marimo_notebook.py)
2. INGEST → Load data, document sources and assumptions
3. EXPLORE → EDA with logged decisions (why this viz, why this filter)
4. MODEL → If needed, with interpretable-first approach
5. INTERPRET → Apply bias checklist, hedge appropriately
6. WISHLIST → Document data gaps and proxies used
7. OUTPUT → Generate appropriate tier (slides/report/notebook)每个分析都遵循以下7阶段流程:
1. SETUP → 初始化Marimo notebook(运行init_marimo_notebook.py)
2. INGEST → 加载数据,记录数据源与假设
3. EXPLORE → 探索性数据分析(EDA)并记录决策(为何选择该可视化、为何使用该筛选条件)
4. MODEL → 如需建模,优先采用可解释性优先的方法
5. INTERPRET → 应用偏差检查清单,合理表述结论
6. WISHLIST → 记录数据缺口与所用替代指标
7. OUTPUT → 生成对应层级的输出(演示文稿/报告/Notebook)Decision Logging Protocol
决策日志记录规范
Every analytical choice must be logged. This creates an audit trail and enables reproducibility.
所有分析选择必须记录在案,以此创建审计追踪并确保可复现性。
What to Log
记录内容
| Decision Type | Example | Log Format |
|---|---|---|
| Data filtering | Removed 47 records with null revenue | |
| Metric choice | Used logo churn vs revenue churn | |
| Visualization | Line chart for time series | |
| Assumption | Assumed linear growth for projection | |
| Proxy used | Used support tickets as NPS proxy | |
| 决策类型 | 示例 | 日志格式 |
|---|---|---|
| 数据筛选 | 移除47条收入为空的记录 | |
| 指标选择 | 使用标识流失率而非收入流失率 | |
| 可视化选择 | 用折线图展示时间序列数据 | |
| 假设设定 | 假设预测采用线性增长模型 | |
| 替代指标使用 | 用支持工单作为NPS的替代指标 | |
Log Format in Notebook
Notebook中的日志格式
python
undefinedpython
undefined=== DECISION LOG ===
=== DECISION LOG ===
FILTER: Excluded trial accounts - 1,247 records removed
FILTER: Excluded trial accounts - 1,247 records removed
METRIC: NRR over GRR because expansion is significant factor
METRIC: NRR over GRR because expansion is significant factor
ASSUMPTION: Q4 seasonality similar to prior year - confidence: M
ASSUMPTION: Q4 seasonality similar to prior year - confidence: M
PROXY: Support ticket sentiment for NPS - quality: Weak
PROXY: Support ticket sentiment for NPS - quality: Weak
undefinedundefinedAnalysis Workflow Details
分析工作流详情
Phase 1: Setup
阶段1:环境搭建
Run the initialization script to create a new Marimo notebook with pre-built scaffolding:
bash
python scripts/init_marimo_notebook.py <notebook_name>This creates a file with:
.py- Decision log cell (markdown)
- Data loading template
- EDA template cells
- Bias checklist cell
运行初始化脚本,创建带有预构建框架的Marimo notebook:
bash
python scripts/init_marimo_notebook.py <notebook_name>该脚本会生成一个文件,包含:
.py- 决策日志单元格(Markdown格式)
- 数据加载模板
- EDA模板单元格
- 偏差检查清单单元格
Phase 2: Data Ingestion
阶段2:数据导入
When loading data:
- Document the source (file path, API, database query)
- Record row/column counts
- Note any immediate data quality issues
- Log assumptions about data freshness
python
undefined加载数据时需完成:
- 记录数据源(文件路径、API、数据库查询语句)
- 记录行/列数量
- 标注即时发现的数据质量问题
- 记录关于数据新鲜度的假设
python
undefined=== DATA SOURCE ===
=== DATA SOURCE ===
Source: sales_data_2024.csv
Source: sales_data_2024.csv
Loaded: 2024-01-15
Loaded: 2024-01-15
Records: 15,847 rows x 23 columns
Records: 15,847 rows x 23 columns
Note: Data through 2024-01-10, 5-day lag from source system
Note: Data through 2024-01-10, 5-day lag from source system
undefinedundefinedPhase 3: Exploratory Data Analysis
阶段3:探索性数据分析
Follow this EDA checklist:
- Distribution of key numeric variables
- Missing value patterns
- Outlier detection
- Time series patterns (if applicable)
- Segment breakdowns
- Correlation exploration
Log every visualization choice and filtering decision.
遵循以下EDA检查清单:
- 关键数值变量的分布情况
- 缺失值模式
- 异常值检测
- 时间序列模式(如适用)
- 细分维度拆解
- 相关性探索
所有可视化选择与筛选决策必须记录
Phase 4: Modeling (If Needed)
阶段4:建模(如需)
Prioritize interpretability:
- First choice: Descriptive statistics, cohort analysis
- Second choice: Linear regression, decision trees
- Last resort: Complex ML (document why simpler won't work)
Always provide:
- Model assumptions
- Feature importance / coefficients
- Confidence intervals
- What the model cannot tell us
优先考虑可解释性:
- 首选:描述性统计、同期群分析
- 次选:线性回归、决策树
- 最后选择:复杂机器学习模型(需记录为何简单模型无法满足需求)
必须提供:
- 模型假设
- 特征重要性/系数
- 置信区间
- 模型的局限性
Phase 5: Interpretation
阶段5:解读
Before finalizing insights, run the bias checklist. See for full checklist.
references/biases.mdQuick check:
- Survivorship bias: Am I only looking at "survivors"?
- Simpson's paradox: Do segment trends differ from aggregate?
- Selection bias: Is my sample representative?
- Small-n warning: Is sample size sufficient for claims?
Hedge appropriately:
- Use "suggests" not "proves"
- State confidence levels
- Note what additional data would strengthen conclusions
在最终确定洞察前,必须运行偏差检查清单,完整清单请参考。
references/biases.md快速检查项:
- 幸存者偏差:是否仅关注“留存者”数据?
- 辛普森悖论:细分维度趋势与整体趋势是否存在差异?
- 选择偏差:样本是否具有代表性?
- 小样本警告:样本量是否足以支撑结论?
合理表述结论:
- 使用“表明”而非“证明”
- 说明置信水平
- 标注哪些额外数据可强化结论
Phase 5.5: Validation Gate (MANDATORY)
阶段5.5:验证关卡(强制要求)
⚠️ GATE: Before proceeding to output, you MUST run the data quality validation checklist.
This is not optional. Run through before finalizing:
references/data-quality-validator.mdCritical Patterns Checklist:
- Market Context: Are YoY changes compared to market/baseline?
- Weighting Sensitivity: If using composite scores, tested 5-6 weight scenarios?
- Bootstrap CIs: For small samples, generated P10/P50/P90 ranges?
- Survivorship Quantified: Data_Availability column added? Coverage % reported?
Statistical Checks:
- Sample sizes disclosed with confidence intervals?
- Multiple comparisons accounted for?
- Cherry-picked baselines avoided?
Logic Checks:
- Correlation not claimed as causation?
- Alternative explanations considered?
- Back-of-envelope validation passed?
Methodology Note on Time Horizons:
When assessing skill vs luck (e.g., sales rep performance, investment returns):
- Longer time horizons (3+ years) reveal inconsistency that short windows hide
- More periods = higher bar for "likely skill"
- A rep who is "top 10%" for 2 quarters could easily be luck
- A rep who is "top 10%" for 12 quarters is more likely skill
- Always state the number of periods analyzed and what that implies for confidence
Do not proceed to Phase 6/7 until this checklist is complete.
⚠️ 关卡:在进入输出阶段前,必须运行数据质量验证清单
此步骤为必填项。在最终确定前,请完成中的检查:
references/data-quality-validator.md关键模式检查清单:
- 市场背景: 同比变化是否与市场/基准进行了对比?
- 权重敏感性: 若使用综合评分,是否测试了5-6种权重场景?
- Bootstrap置信区间: 针对小样本,是否生成了P10/P50/P90区间?
- 幸存者偏差量化: 是否添加了数据可用列?是否报告了覆盖比例?
统计检查:
- 是否披露了样本量与置信区间?
- 是否考虑了多重比较问题?
- 是否避免了选择性基准?
逻辑检查:
- 是否未将相关性等同于因果关系?
- 是否考虑了替代解释?
- 是否通过了粗略验证?
时间范围方法论说明:
在评估技能与运气的影响时(如销售代表绩效、投资回报):
- 更长的时间范围(3年以上)能揭示短期窗口隐藏的不一致性
- 分析周期越多,“技能主导”的结论可信度越高
- 连续2个季度进入“前10%”的销售代表可能只是运气
- 连续12个季度进入“前10%”的销售代表更可能是技能主导
- 必须说明分析的周期数量及其对置信度的影响
在完成此检查清单前,不得进入阶段6/7
Phase 6: Data Wishlisting
阶段6:数据缺口记录
Document gaps and proxies. See for patterns.
references/data-wishlisting.mdFormat:
undefined记录数据缺口与所用替代指标,模式参考。
references/data-wishlisting.md格式:
undefinedData Wishlist
数据缺口清单
| Missing Data | Proxy Used | Quality | Impact on Analysis |
|---|---|---|---|
| Customer NPS | Support sentiment | Weak | Core finding, needs validation |
| True LTV | 12-month value | Moderate | Acceptable for segmentation |
undefined| 缺失数据 | 所用替代指标 | 质量 | 对分析的影响 |
|---|---|---|---|
| 客户NPS | 支持工单情感 | 弱 | 核心结论,需验证 |
| 真实LTV | 12个月价值 | 中 | 可用于细分分析 |
undefinedPhase 7: Output Generation
阶段7:输出生成
Choose output tier based on audience and purpose:
| Tier | When to Use | Tool |
|---|---|---|
| Slides | Executive summary, board deck | |
| Report | Detailed findings, stakeholder review | Markdown/PDF |
| Notebook | Full analysis, data team handoff | Marimo .py file |
根据受众与用途选择输出层级:
| 层级 | 适用场景 | 工具 |
|---|---|---|
| 演示文稿 | 执行摘要、董事会报告 | |
| 详细报告 | 详细发现、利益相关方评审 | Markdown/PDF |
| Notebook | 完整分析、数据团队交接 | Marimo .py文件 |
Data Cleaning Workflow
数据清洗工作流
For messy data that needs cleaning before analysis:
针对分析前需要清洗的脏数据:
1. Profile the Data
1. 数据概况分析
bash
python scripts/profile_data.py <csv_file> --output data_quality_report.mdThis generates:
- Column-level statistics (nulls, uniques, types)
- Data quality score (A-F grading)
- Suspicious pattern detection
- Suggested cleaning steps
bash
python scripts/profile_data.py <csv_file> --output data_quality_report.md该脚本会生成:
- 列级统计信息(空值、唯一值、数据类型)
- 数据质量评分(A-F等级)
- 可疑模式检测
- 建议的清洗步骤
2. Apply Cleaning Patterns
2. 应用清洗模式
Reference for:
references/data-cleaning.md- Missing value strategies (drop, impute, flag)
- Outlier detection methods (IQR, z-score, domain rules)
- Common transforms (pivot, melt, merge patterns)
- Type coercion recipes
- Deduplication patterns
参考获取:
references/data-cleaning.md- 缺失值处理策略(删除、填充、标记)
- 异常值检测方法(IQR、z-score、业务规则)
- 常见转换操作(透视、融合、合并模式)
- 数据类型转换方案
- 去重模式
3. Handle Datetime Issues
3. 处理日期时间问题
Reference for:
references/datetime-handling.md- Timezone conversion patterns
- Date parsing for mixed formats
- Fiscal calendar handling (FY vs CY)
- Period aggregation (daily → weekly → monthly)
- Business day calculations
参考获取:
references/datetime-handling.md- 时区转换模式
- 混合格式日期解析
- 财年日历处理(FY vs CY)
- 周期聚合(日→周→月)
- 工作日计算
Dashboard Building Workflow
仪表盘搭建工作流
For interactive monitoring dashboards:
针对交互式监控仪表盘:
1. Initialize Dashboard
1. 初始化仪表盘
bash
python scripts/init_dashboard.py <dashboard_name>This creates a Marimo dashboard with:
- KPI cards row
- Filter sidebar (segment, date range, period)
- Time series trend chart
- Summary data table
- Responsive layout
bash
python scripts/init_dashboard.py <dashboard_name>该脚本会生成一个Marimo仪表盘,包含:
- KPI卡片行
- 筛选侧边栏(细分维度、日期范围、周期)
- 时间序列趋势图
- 汇总数据表
- 响应式布局
2. Apply Dashboard Patterns
2. 应用仪表盘模式
Reference for:
references/dashboard-patterns.md- Marimo layout patterns (sidebar, tabs, grid)
- KPI card templates with sparklines
- Filter/slider patterns for interactivity
- Data table styling and formatting
- Time series with range selection
- Refresh patterns for live data
参考获取:
references/dashboard-patterns.md- Marimo布局模式(侧边栏、标签页、网格)
- 带迷你趋势图的KPI卡片模板
- 交互式筛选/滑块模式
- 数据表样式与格式
- 带范围选择的时间序列
- 实时数据刷新模式
Data Quality Validation Workflow
数据质量验证工作流
Before presenting or accepting analytical claims:
在展示或接受分析结论前:
Run the Data Quality Validation Checklist
运行数据质量验证清单
Reference for comprehensive checklists:
references/data-quality-validator.mdStatistical Sins:
- P-hacking / multiple comparisons
- Small sample extrapolation
- Missing confidence intervals
- Cherry-picked baselines
Chart Crimes:
- Truncated y-axis
- Dual y-axis manipulation
- 3D charts
- Misleading scales
Logic Fallacies:
- Correlation ≠ causation
- Ecological fallacy
- Base rate neglect
- Survivorship bias
Sanity Checks:
- Does this pass the smell test?
- Back-of-envelope validation
- Historical comparison
- Cross-source validation
参考获取完整检查清单:
references/data-quality-validator.md统计误区:
- P值篡改/多重比较
- 小样本外推
- 缺失置信区间
- 选择性基准
图表误导:
- 截断Y轴
- 双Y轴操纵
- 3D图表
- 误导性刻度
逻辑谬误:
- 相关性≠因果关系
- 生态谬误
- 忽略基础比率
- 幸存者偏差
合理性检查:
- 是否符合常识?
- 是否通过粗略验证?
- 是否与历史数据对比?
- 是否跨数据源验证?
Excel Output Workflow
Excel输出工作流
For exporting analysis results to Excel with proper formulas and formatting:
用于将分析结果导出为带正确公式与格式的Excel文件:
Financial Model Standards
财务模型标准
Reference for:
references/xlsx-patterns.md- Color coding convention (blue=inputs, black=formulas, green=cross-sheet links)
- Number formatting standards (currency, percentages, multiples)
- Formula construction rules (use formulas, not hardcoded values)
- Common formula patterns for analysis
参考获取:
references/xlsx-patterns.md- 颜色编码规范(蓝色=输入值,黑色=公式,绿色=跨工作表链接)
- 数字格式标准(货币、百分比、倍数)
- 公式构建规则(使用公式,而非硬编码值)
- 分析常用公式模式
Verification
验证
After creating Excel files with formulas, always recalculate:
bash
python scripts/recalc.py output.xlsxThis ensures:
- All formulas are calculated (openpyxl doesn't evaluate formulas)
- Zero formula errors (#REF!, #DIV/0!, etc.)
- JSON output shows any errors to fix
创建带公式的Excel文件后,必须重新计算:
bash
python scripts/recalc.py output.xlsx此操作可确保:
- 所有公式已计算(openpyxl不会自动计算公式)
- 无公式错误(#REF!、#DIV/0!等)
- JSON输出会显示所有需修复的错误
PDF Handling Workflow
PDF处理工作流
For extracting data from PDFs or creating PDF reports:
用于从PDF中提取数据或创建PDF报告:
Extracting Data
数据提取
Reference for:
references/pdf-patterns.md- Text extraction (pypdf, pdfplumber)
- Table extraction to DataFrame
- OCR for scanned documents
- Command-line tools (pdftotext, qpdf)
参考获取:
references/pdf-patterns.md- 文本提取(pypdf、pdfplumber)
- 表格提取为DataFrame
- 扫描文档OCR识别
- 命令行工具(pdftotext、qpdf)
Creating Reports
报告创建
Reference for:
references/pdf-patterns.md- Basic report creation with reportlab
- Professional reports with sections and tables
- Embedding matplotlib charts in PDFs
- Merge/split operations
参考获取:
references/pdf-patterns.md- 使用reportlab创建基础报告
- 创建带章节与表格的专业报告
- 在PDF中嵌入matplotlib图表
- 合并/拆分操作
Reference Files
参考文件
Load these as needed during analysis:
| Reference | When to Use |
|---|---|
| Calculating SaaS/RevOps metrics |
| Interpretation phase, before finalizing insights |
| Structuring output (pyramid vs consulting style) |
| Choosing chart types, avoiding anti-patterns |
| Documenting gaps, rating proxy quality |
| Data quality checks, cleaning patterns |
| Timezone, parsing, fiscal calendars |
| Marimo layouts, KPIs, interactivity |
| Data quality validation, detecting issues |
| Excel output, financial model standards, formulas |
| PDF extraction, report creation, manipulation |
分析过程中可按需加载以下文件:
| 参考文件 | 适用场景 |
|---|---|
| 计算SaaS/RevOps指标 |
| 解读阶段,最终确定洞察前 |
| 输出内容结构化(金字塔原理/咨询风格) |
| 选择图表类型,避免反模式 |
| 记录数据缺口,评估替代指标质量 |
| 数据质量检查,清洗模式 |
| 时区处理、日期解析、财年日历 |
| Marimo布局、KPI、交互性 |
| 数据质量验证,问题检测 |
| Excel输出、财务模型标准、公式 |
| PDF提取、报告创建、操作 |
Scripts
脚本
| Script | Purpose | Usage |
|---|---|---|
| Initialize analysis workspace | |
| Create slide deck from findings | |
| Generate data quality report | |
| Scaffold interactive dashboard | |
| Recalculate Excel formulas | |
| 脚本 | 用途 | 使用方法 |
|---|---|---|
| 初始化分析工作区 | |
| 根据分析结果创建演示文稿 | |
| 生成数据质量报告 | |
| 搭建交互式仪表盘框架 | |
| 重新计算Excel公式 | |
Technology Stack
技术栈
| Tool | Purpose | Why |
|---|---|---|
| Marimo | Notebook environment | Pure Python files, reactive, git-friendly |
| pandas | Data manipulation | Reliable LLM code generation, mature ecosystem |
| Matplotlib/Seaborn | Visualization | Publication-quality, static, well-supported |
| python-pptx | Slide generation | Programmatic PowerPoint creation |
| openpyxl | Excel files | Formulas, formatting, financial models |
| pypdf/pdfplumber | PDF handling | Extract text, tables; create reports |
| reportlab | PDF creation | Professional PDF reports |
| 工具 | 用途 | 选择理由 |
|---|---|---|
| Marimo | Notebook环境 | 纯Python文件、响应式、Git友好 |
| pandas | 数据处理 | 可靠的LLM代码生成、成熟的生态系统 |
| Matplotlib/Seaborn | 可视化 | 出版级质量、静态图表、支持完善 |
| python-pptx | 演示文稿生成 | 程序化创建PowerPoint |
| openpyxl | Excel文件处理 | 支持公式、格式、财务模型 |
| pypdf/pdfplumber | PDF处理 | 提取文本、表格;创建报告 |
| reportlab | PDF创建 | 专业PDF报告生成 |
Example Invocations
调用示例
Revenue analysis:
"Analyze our ARR trends by segment and identify drivers of growth/churn"
Pipeline analytics:
"Build a win rate analysis by deal size and sales rep"
Cohort analysis:
"Create a retention cohort analysis for customers acquired in 2023"
Forecasting:
"Project next quarter revenue based on current pipeline"
Board deck:
"Create an executive summary deck of our key SaaS metrics"
Data cleaning:
"Clean this messy CSV and profile the data quality"
Dashboard:
"Build a dashboard to monitor our key SaaS metrics"
Data validation:
"Validate these findings before I present them"
Excel output:
"Export this analysis to Excel with proper formulas and formatting"
PDF extraction:
"Extract the tables from this quarterly report PDF"
Financial model:
"Create a revenue projection model in Excel with scenario inputs"
收入分析:
"按细分维度分析我们的ARR趋势,识别增长/流失驱动因素"
销售漏斗分析:
"按交易规模与销售代表创建赢单率分析"
同期群分析:
"为2023年获取的客户创建留存同期群分析"
预测:
"基于当前销售漏斗预测下一季度收入"
董事会报告:
"创建关键SaaS指标的执行摘要演示文稿"
数据清洗:
"清洗这份杂乱的CSV文件并生成数据质量概况报告"
仪表盘搭建:
"搭建一个仪表盘监控我们的关键SaaS指标"
数据验证:
"在我展示前验证这些分析结论"
Excel输出:
"将这份分析结果导出为带正确公式与格式的Excel文件"
PDF提取:
"从这份季度报告PDF中提取表格数据"
财务模型:
"在Excel中创建带场景输入的收入预测模型"