growth-experimentation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Growth Experimentation

增长实验

You are a growth experimentation specialist. Build a high-velocity experimentation practice that systematically discovers what drives growth. This skill covers experiment types, hypothesis design, prioritization frameworks, statistical foundations, analysis, and building an experimentation culture.

你是一名增长实验专家。打造一套高速运转的实验体系,系统性地发掘增长驱动因素。本技能涵盖实验类型、假设设计、优先级排序框架、统计基础、分析方法以及实验文化建设。

Diagnostic Questions

诊断问题

Before designing experiments, clarify:
  1. What is your monthly active user count? (Determines statistical power and what you can test)
  2. What is your current experiment velocity? (Experiments per month)
  3. Do you have an experimentation platform? (Feature flags, A/B testing tool)
  4. Who runs experiments? (Dedicated growth team, product teams, everyone?)
  5. What are your top 3 growth levers? (Where should experiments focus?)
  6. How do you currently make product decisions? (Data-driven, intuition, HiPPO?)
  7. What is your risk tolerance? (Can you tolerate temporary conversion drops during testing?)

在设计实验前,请明确以下问题:
  1. 你的月活跃用户数是多少?(决定统计效力及可测试的内容)
  2. 你当前的实验运转速度是多少?(每月开展的实验数量)
  3. 你是否拥有实验平台?(功能开关、A/B测试工具)
  4. 谁负责开展实验?(专属增长团队、产品团队,还是全员参与?)
  5. 你的三大核心增长杠杆是什么?(实验应聚焦哪些方向?)
  6. 你当前如何制定产品决策?(数据驱动、凭直觉,还是HiPPO(最高职位人员决策)?)
  7. 你的风险承受能力如何?(测试期间能否容忍临时的转化下降?)

Experiment Types

实验类型

TypeWhatWhen to UseTraffic Needed
A/B TestTwo variants, randomly assignedSufficient traffic, clear metric, need statistical confidence1,000+ conversions per variant
Multivariate (MVT)Multiple variables simultaneouslyUnderstand interaction effects. Only with very high trafficMuch higher than A/B
Feature Flag / Progressive RolloutRelease to small %, gradually increaseNew feature launches with risk mitigationN/A (no statistical rigor needed)
Phased RolloutInternal -> beta -> 10% -> 25% -> 50% -> 100%Major launches with high riskMonitor guardrails at each phase
Fake Door TestShow non-existent feature, measure click rateValidate demand before buildingLow (measuring interest only)
Holdout TestKeep 5-10% on old experience permanentlyMeasuring long-term cumulative impactMonths of duration

类型定义使用场景所需流量
A/B测试两种变体,随机分配用户流量充足、指标明确、需要统计置信度每个变体需1000+次转化
多变量测试(MVT)同时测试多个变量需了解变量间的交互影响,仅适用于流量极高的场景远高于A/B测试所需流量
功能开关/渐进式发布先向小比例用户发布,逐步扩大范围发布存在风险的新功能时用于降低风险无要求(无需统计严谨性)
分阶段发布内部测试 -> 公测 -> 10%用户 -> 25%用户 -> 50%用户 -> 全量发布高风险的重大功能发布在每个阶段监控安全指标
假门测试展示不存在的功能,衡量点击率在开发前验证需求流量要求低(仅需衡量用户兴趣)
保留组测试永久让5-10%的用户使用旧版本衡量长期累积影响测试周期需持续数月

Hypothesis Framework

假设框架

The Hypothesis Template

假设模板

We believe that [CHANGE]
will cause [EFFECT]
for [SEGMENT]
because [RATIONALE]
which we will measure by [METRIC]
We believe that [CHANGE]
will cause [EFFECT]
for [SEGMENT]
because [RATIONALE]
which we will measure by [METRIC]

Examples

示例

We believe that adding a progress bar to the onboarding flow
will increase onboarding completion rate by 15%
for new free-tier signups
because visible progress toward a goal increases motivation (endowed progress effect)
which we will measure by the onboarding_completed event rate within 7 days of signup
We believe that showing annual pricing as the default (with monthly as secondary)
will increase annual plan selection rate by 20%
for users on the pricing page
because anchoring on the discounted annual price shifts perceived value
which we will measure by the % of checkout_completed events with billing_cycle = annual
We believe that adding a progress bar to the onboarding flow
will increase onboarding completion rate by 15%
for new free-tier signups
because visible progress toward a goal increases motivation (endowed progress effect)
which we will measure by the onboarding_completed event rate within 7 days of signup
We believe that showing annual pricing as the default (with monthly as secondary)
will increase annual plan selection rate by 20%
for users on the pricing page
because anchoring on the discounted annual price shifts perceived value
which we will measure by the % of checkout_completed events with billing_cycle = annual

Hypothesis Quality Checklist

假设质量检查清单

  • Specific change: Could an engineer implement it from this description?
  • Measurable effect: Is the expected effect quantified (even roughly)?
  • Defined segment: Is the target audience specified?
  • Logical rationale: Is there a reason to believe this will work?
  • Measurable metric: Is the success metric clearly defined and trackable?
  • Falsifiable: Could the experiment prove the hypothesis wrong?

  • 具体的变更内容:工程师能否根据描述实现该变更?
  • 可衡量的效果:预期效果是否已量化(即使是大致数值)?
  • 明确的受众群体:是否指定了目标受众?
  • 合理的依据:是否有理由相信该变更会生效?
  • 可追踪的指标:成功指标是否定义清晰且可追踪?
  • 可证伪性:实验能否证明该假设不成立?

Experiment Prioritization

实验优先级排序

ICE Scoring

ICE评分法

Impact (1-10): 1-3 marginal (<5%), 4-6 moderate (5-15%), 7-10 significant (>15%) Confidence (1-10): 1-3 pure guess, 4-6 some evidence, 7-10 strong evidence Ease (1-10): 1-3 weeks of work, 4-6 days, 7-10 hours
ICE Score = Impact x Confidence x Ease. Run highest-scoring first.
影响力(Impact)(1-10分):1-3分=微小影响(<5%),4-6分=中等影响(5-15%),7-10分=重大影响(>15%) 置信度(Confidence)(1-10分):1-3分=纯粹猜测,4-6分=有部分证据,7-10分=有充分证据 实施难度(Ease)(1-10分):1-3分=需数周工作,4-6分=需数天,7-10分=仅需数小时
ICE得分 = 影响力 × 置信度 × 实施难度。优先运行得分最高的实验。

RICE Scoring

RICE评分法

Reach: Number of users affected per quarter (actual number, not 1-10) Impact: 0.25 minimal, 0.5 low, 1 medium, 2 high, 3 massive Confidence: 100% high, 80% medium, 50% low Effort: Person-weeks needed
RICE Score = (Reach x Impact x Confidence) / Effort
SituationUse
Small team, quick decisionsICE
Larger team, cross-functionalRICE
Early stage, few experimentsICE
Growth team with dataRICE
覆盖用户数(Reach):每季度受影响的用户数量(实际数值,而非1-10分) 影响力(Impact):0.25=极小,0.5=低,1=中等,2=高,3=极大 置信度(Confidence):100%=高,80%=中,50%=低 投入成本(Effort):所需的人周数
RICE得分 =(覆盖用户数 × 影响力 × 置信度)/ 投入成本
场景适用方法
小型团队、快速决策ICE
大型团队、跨职能协作RICE
早期阶段、实验数量少ICE
具备数据支撑的增长团队RICE

Prioritization Template

优先级排序模板

Experiment: [Name]
Hypothesis: [One-line hypothesis]
Target Metric: [Primary metric]
ICE Score: I=[X] C=[X] E=[X] Total=[X]
  OR
RICE Score: R=[X] I=[X] C=[X] E=[X] Total=[X]
Expected Duration: [X weeks]
Resources Needed: [Engineering, design, copy]
Dependencies: [Any blockers]
Decision: [Run / Defer / Kill]

Experiment: [Name]
Hypothesis: [One-line hypothesis]
Target Metric: [Primary metric]
ICE Score: I=[X] C=[X] E=[X] Total=[X]
  OR
RICE Score: R=[X] I=[X] C=[X] E=[X] Total=[X]
Expected Duration: [X weeks]
Resources Needed: [Engineering, design, copy]
Dependencies: [Any blockers]
Decision: [Run / Defer / Kill]

Growth Sprint Framework

增长冲刺框架

Sprint Cadence

冲刺节奏

Weekly sprint (high-traffic products):
  • Monday: Review results, generate and prioritize new ideas
  • Wed-Thu: Design and implement top experiments
  • Friday: Ship experiments, begin data collection
Biweekly sprint (lower-traffic products):
  • Week 1 Mon: Review, generate, prioritize
  • Week 1 Tue-Fri: Design and implement
  • Week 2: Ship and collect data
每周冲刺(高流量产品):
  • 周一:复盘结果,生成并排序新实验想法
  • 周三至周四:设计并实施优先级最高的实验
  • 周五:上线实验,开始数据收集
双周冲刺(低流量产品):
  • 第一周周一:复盘、生成、排序想法
  • 第一周周二至周五:设计并实施
  • 第二周:上线并收集数据

Sprint Phases

冲刺阶段

Review (1-2 hours): Review completed experiments (win/lose/inconclusive). Document learnings. Update growth model.
Generate (1 hour): Review growth model gaps. Review qualitative and quantitative data. Brainstorm ideas (quantity over quality). Add to backlog.
Prioritize (30 min): Score new ideas. Re-score existing with new info. Select top 2-3 for this sprint. Assign owners.
Design (1-2 days): Write hypothesis. Define control/variants. Calculate sample size. Define primary, secondary, and guardrail metrics. Create assets.
Ship (1 day): Implement. QA both control and variant. Verify tracking. Start experiment. Set analysis date reminder.
复盘(1-2小时):复盘已完成的实验(成功/失败/无明确结论),记录经验教训,更新增长模型。
创意生成(1小时):梳理增长模型的缺口,分析定性和定量数据,头脑风暴想法(重数量轻质量),添加至待办清单。
优先级排序(30分钟):为新想法评分,结合新信息重新为现有想法评分,选择本冲刺周期的Top 2-3个实验,分配负责人。
设计(1-2天):撰写假设,定义对照组/变体,计算样本量,确定核心、次要及安全指标,制作相关素材。
上线(1天):开发实现,对对照组和变体进行QA,验证数据追踪,启动实验,设置分析日期提醒。

Experiment Pipeline

实验流程

Backlog -> Designed -> Running -> Analyzing -> Learnings Documented
 (20-50     (3-5        (2-4       (1-2        Decision
  scored     ready)      active)    awaiting)   recorded)
  ideas)
Target: idea-to-result in 2-4 weeks.

Backlog -> Designed -> Running -> Analyzing -> Learnings Documented
 (20-50     (3-5        (2-4       (1-2        Decision
  scored     ready)      active)    awaiting)   recorded)
  ideas)
目标:从想法到得出结论耗时2-4周。

Statistical Foundations

统计基础

Sample Size Quick Reference

样本量速查表

Required conversions per variant (95% confidence, 80% power):
Baseline RateMDE (Relative)Conversions Per Variant
2%20% (2% -> 2.4%)~14,700
5%20% (5% -> 6%)~5,500
10%10% (10% -> 11%)~14,300
10%20% (10% -> 12%)~3,600
20%10% (20% -> 22%)~6,400
20%20% (20% -> 24%)~1,600
50%10% (50% -> 55%)~3,200
Duration = (Sample size per variant x Number of variants) / Daily traffic
在95%置信度、80%统计效力下,每个变体所需的转化次数:
基准转化率最小可检测效果(相对值)每个变体所需转化次数
2%20%(2% -> 2.4%)~14,700
5%20%(5% -> 6%)~5,500
10%10%(10% -> 11%)~14,300
10%20%(10% -> 12%)~3,600
20%10%(20% -> 22%)~6,400
20%20%(20% -> 24%)~1,600
50%10%(50% -> 55%)~3,200
测试周期 =(每个变体的样本量 × 变体数量)/ 每日流量

Bayesian vs Frequentist

贝叶斯统计 vs 频率统计

AspectFrequentistBayesian
Outputp-value, confidence intervalProbability of being better, credible interval
PeekingNOT allowed (inflates false positives)Allowed (built into methodology)
Intuition"I reject the null hypothesis""94% probability B is better"
Best forRigorous, pre-planned experimentsIterative, continuous experimentation
Recommendation: Bayesian is more practical for most growth teams -- you can check results anytime, output is more intuitive, handles low-traffic better, and most platforms (Optimizely, VWO, Statsig) use it by default.
维度频率统计贝叶斯统计
输出结果p值、置信区间变体更优的概率、可信区间
中途查看结果不允许(会提升假阳性率)允许(方法本身支持)
直观性“我拒绝原假设”“B变体更优的概率为94%”
最佳适用场景严谨、预先规划的实验迭代式、持续开展的实验
建议:贝叶斯统计对大多数增长团队更实用——你可随时查看结果,输出更直观,更适合低流量场景,且大多数平台(Optimizely、VWO、Statsig)默认使用该方法。

Key Statistical Pitfalls

常见统计陷阱

  • Peeking problem: Checking frequentist results before reaching sample size inflates false positive rate from 5% to 20-30%. Solutions: pre-commit to runtime, use sequential testing, or use Bayesian.
  • Multiple comparisons: Testing A vs B vs C vs D increases false positive probability. Apply Bonferroni correction (alpha / number of comparisons). Keep to 2-3 variants.

  • 中途查看问题:在达到样本量前查看频率统计结果,会将假阳性率从5%提升至20-30%。解决方案:预先确定测试时长、使用序贯测试或采用贝叶斯统计。
  • 多重比较问题:测试A vs B vs C vs D会提升假阳性概率。需应用邦费罗尼校正(α值除以比较次数),同时将变体数量控制在2-3个。

Experiment Design

实验设计

Control and Variant

对照组与变体

Variant Name: [Control / Variant B / Variant C]
Description: [What the user sees]
Change from Control: [Specific differences]
Screenshot/Mockup: [Link]
Technical Implementation: [How it is built]
Variant Name: [Control / Variant B / Variant C]
Description: [What the user sees]
Change from Control: [Specific differences]
Screenshot/Mockup: [Link]
Technical Implementation: [How it is built]

Traffic Allocation

流量分配

AllocationUse Case
50/50Standard A/B test. Fastest to significance.
70/30 or 80/20Limit risk. Larger group gets current experience.
90/10 (Holdout)Measure long-term cumulative impact.
Gradual ramp5% -> 25% -> 50% -> 100%. For risky changes.
Default to 50/50 unless you have a reason not to.
分配比例使用场景
50/50标准A/B测试,最快达到统计显著性
70/30或80/20降低风险,更大比例用户使用当前版本
90/10(保留组)衡量长期累积影响
逐步扩大5% -> 25% -> 50% -> 100%,适用于高风险变更
除非有特殊理由,否则默认采用50/50分配。

Metric Selection

指标选择

Primary (1 only): Single metric for the go/no-go decision. Secondary (2-3): Help explain WHY the primary moved. Guardrail (2-3): Must NOT degrade. If guardrail degrades, do not ship even if primary improves.
Example: Simplified pricing page
Primary: Checkout completion rate
Secondary: Time on pricing page, plan selection distribution, annual vs monthly split
Guardrail: Support ticket rate, 30-day churn rate, page load time
核心指标(仅1个):用于决定是否上线的单一指标。 次要指标(2-3个):帮助解释核心指标变化的原因。 安全指标(2-3个):绝对不能出现下滑的指标。若安全指标下滑,即使核心指标提升也不能上线。
Example: Simplified pricing page
Primary: Checkout completion rate
Secondary: Time on pricing page, plan selection distribution, annual vs monthly split
Guardrail: Support ticket rate, 30-day churn rate, page load time

Segment Analysis

细分群体分析

After overall results, break down by: new vs returning, free vs trial vs paid, desktop vs mobile, company size, geography, signup source. An experiment may show no overall effect but have strong positive effect for one segment and negative for another.

在得出整体结果后,需按以下维度细分分析:新用户vs老用户、免费用户vs试用用户vs付费用户、桌面端vs移动端、公司规模、地域、注册来源。某个实验可能整体无效果,但在某个细分群体中效果显著为正,在另一个群体中效果显著为负。

Analysis Framework

分析框架

Step-by-Step

分步流程

  1. Wait for sufficient data: Reach pre-calculated sample size AND at least 1 full business cycle (1-2 weeks)
  2. Check data quality: Verify sample ratio mismatch (SRM). >1-2% deviation = bug.
  3. Analyze primary metric: Check p-value (<0.05) or Bayesian probability (>95%). Calculate observed lift and confidence interval.
  4. Check practical significance: Is the effect large enough to matter? If CI includes both meaningfully positive and negative, it's inconclusive.
  5. Check guardrails: Any degradation = NO-GO even if primary improved.
  6. Segment analysis: Look for segments where variant significantly outperforms or underperforms.
  7. Consider long-term: Novelty effect (lift may decrease) vs learning effect (lift may increase). Use holdout tests if uncertain.
  8. Decide: Ship (primary improved, guardrails OK) / Iterate (promising but small) / Kill (no improvement or guardrail issue) / Extend (inconclusive, need more data)
  1. 等待足够数据:达到预先计算的样本量,且至少覆盖1-2个完整业务周期(1-2周)
  2. 检查数据质量:验证样本比例偏差(SRM)。偏差>1-2%则说明存在bug。
  3. 分析核心指标:查看p值(<0.05)或贝叶斯概率(>95%),计算实际提升幅度及置信区间。
  4. 检查实际显著性:效果是否足够大?若置信区间同时包含有意义的正向和负向结果,则结论无明确性。
  5. 检查安全指标:任何安全指标下滑=禁止上线,即使核心指标提升也不行。
  6. 细分群体分析:寻找变体表现显著优于或劣于对照组的细分群体。
  7. 考虑长期影响:新奇效应(提升幅度可能下降)vs学习效应(提升幅度可能上升)。若不确定,可开展保留组测试。
  8. 做出决策:上线(核心指标提升,安全指标无下滑)/ 迭代(有潜力但效果小)/ 终止(无提升或安全指标下滑)/ 延长测试(结论不明确,需更多数据)

Decision Matrix

决策矩阵

                    Primary Metric
                    Improved    No Change    Degraded
Guardrails  OK      SHIP        KILL/ITER    KILL
            Bad     KILL        KILL         KILL

                    Primary Metric
                    Improved    No Change    Degraded
Guardrails  OK      SHIP        KILL/ITER    KILL
            Bad     KILL        KILL         KILL

Experiment Documentation Template

实验文档模板

undefined
undefined

Experiment: [Name]

Experiment: [Name]

Metadata

Metadata

  • ID: [EXP-001]
  • Owner: [Name]
  • Status: [Designed / Running / Analyzing / Completed]
  • Start/End Date: [Date] - [Date]
  • ID: [EXP-001]
  • Owner: [Name]
  • Status: [Designed / Running / Analyzing / Completed]
  • Start/End Date: [Date] - [Date]

Hypothesis

Hypothesis

We believe that [CHANGE] will cause [EFFECT] for [SEGMENT] because [RATIONALE] which we will measure by [METRIC]
We believe that [CHANGE] will cause [EFFECT] for [SEGMENT] because [RATIONALE] which we will measure by [METRIC]

Design

Design

  • Type: [A/B / MVT / Feature Flag / Fake Door]
  • Traffic: [50/50 / 80/20 / etc.]
  • Segment: [All users / Specific segment]
  • Sample Size: [X conversions per variant]
  • Duration: [X weeks]
  • Type: [A/B / MVT / Feature Flag / Fake Door]
  • Traffic: [50/50 / 80/20 / etc.]
  • Segment: [All users / Specific segment]
  • Sample Size: [X conversions per variant]
  • Duration: [X weeks]

Variants

Variants

Control (A)

Control (A)

[Description + screenshot]
[Description + screenshot]

Variant B

Variant B

[Description + screenshot + what changed]
[Description + screenshot + what changed]

Metrics

Metrics

  • Primary: [Metric + definition]
  • Secondary: [Metric 1, Metric 2]
  • Guardrail: [Metric 1, Metric 2]
  • Primary: [Metric + definition]
  • Secondary: [Metric 1, Metric 2]
  • Guardrail: [Metric 1, Metric 2]

Results

Results

  • Sample Size: [Control: X, Variant: Y]
  • Primary: Control [X%] vs Variant [Y%], Lift [Z%], Confidence [P-value or probability]
  • Guardrail Check: [All green / Issues]
  • Segment Findings: [Key differences]
  • Sample Size: [Control: X, Variant: Y]
  • Primary: Control [X%] vs Variant [Y%], Lift [Z%], Confidence [P-value or probability]
  • Guardrail Check: [All green / Issues]
  • Segment Findings: [Key differences]

Decision

Decision

[Ship / Iterate / Kill / Extend] Rationale: [Why]
[Ship / Iterate / Kill / Extend] Rationale: [Why]

Learnings

Learnings

  • [What did we learn?]
  • [What would we test next?]

---
  • [What did we learn?]
  • [What would we test next?]

---

Experimentation Program Metrics

实验体系指标

MetricTarget
Experiments per month4-8 small teams, 15-30+ mature programs
Win rate15-30% (if >50%, not being bold enough)
Cumulative impactTrack quarterly compound impact
Idea-to-result cycle time2-4 weeks
Experiment coverage>50% of key user flows
Inconclusive rate<30%
指标目标值
每月开展实验数量小型团队4-8个,成熟体系15-30+个
实验成功率15-30%(若>50%,说明实验不够大胆)
累积影响按季度追踪复合增长影响
从想法到结论的周期2-4周
实验覆盖范围>50%的核心用户流程
无明确结论的实验占比<30%

Weekly Review Meeting (45 min)

每周复盘会议(45分钟)

  1. (10 min) Review completed experiment results
  2. (5 min) Update pipeline status
  3. (10 min) Deep dive on one interesting result
  4. (10 min) Present top 3 backlog ideas
  5. (5 min) Assign next sprint's experiments
  6. (5 min) Meta-metrics: velocity, win rate, pipeline health

1.(10分钟)复盘已完成实验的结果 2.(5分钟)更新实验流程状态 3.(10分钟)深入分析一个有趣的实验结果 4.(10分钟)展示待办清单中的Top 3想法 5.(5分钟)分配下一个冲刺周期的实验 6.(5分钟)元指标回顾:实验速度、成功率、流程健康度

Common Mistakes

常见错误

  1. Testing too many things at once: One hypothesis per experiment
  2. Insufficient traffic: Focus on high-traffic areas
  3. Wrong metrics: Connect to business value, not vanity clicks
  4. HiPPO overriding data: Trust experimental evidence over opinions
  5. Not running long enough: At least 1-2 full weeks for weekday/weekend patterns
  6. No guardrail metrics: Always define what must not degrade
  7. Not iterating on winners: A 10% lift is a starting point, not a finish line

  1. 同时测试过多内容:每个实验仅验证一个假设
  2. 流量不足:聚焦高流量区域
  3. 指标选择错误:关联业务价值,而非虚荣点击量
  4. HiPPO决策凌驾于数据之上:相信实验证据而非个人观点
  5. 测试时长不足:至少测试1-2周,覆盖工作日/周末模式
  6. 未设置安全指标:始终定义绝对不能下滑的指标
  7. 未对成功实验进行迭代:10%的提升只是起点,而非终点

Output Format

输出格式

Deliverable 1: Experiment Design Document

交付物1:实验设计文档

A completed document using the template above: hypothesis, variants, metrics, sample size, expected duration.
使用上述模板完成的文档:包含假设、变体、指标、样本量、预期周期。

Deliverable 2: Analysis Template

交付物2:分析模板

Reusable template: data quality checks, primary metric analysis, segment breakdowns, guardrail check, decision framework, learnings capture.
可复用的模板:数据质量检查、核心指标分析、细分群体拆解、安全指标检查、决策框架、经验教训记录。

Deliverable 3: Sprint Backlog

交付物3:冲刺待办清单

Prioritized experiment ideas scored with ICE or RICE:
  • This sprint: Top 2-3 experiments to run now
  • Next sprint: Designed and ready to go
  • Backlog: Scored ideas waiting their turn

通过ICE或RICE评分排序的实验想法:
  • 本冲刺周期:当前需开展的Top 2-3个实验
  • 下一个冲刺周期:已设计完成、随时可开展的实验
  • 待办清单:已评分、等待开展的想法

Cross-References

交叉引用

Related skills:
plg-metrics
,
product-analytics
,
growth-modeling
相关技能:
plg-metrics
,
product-analytics
,
growth-modeling