experiment-designer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Experiment Designer

实验设计工具

Design, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.
通过清晰的假设和有依据的决策,设计、确定优先级并评估产品实验。

When To Use

适用场景

Use this skill for:
  • A/B and multivariate experiment planning
  • Hypothesis writing and success criteria definition
  • Sample size and minimum detectable effect planning
  • Experiment prioritization with ICE scoring
  • Reading statistical output for product decisions
本技能适用于:
  • A/B测试和多变量实验规划
  • 假设撰写与成功标准定义
  • 样本量与最小可检测效应(MDE)规划
  • 采用ICE评分确定实验优先级
  • 解读统计输出以制定产品决策

Core Workflow

核心工作流程

  1. Write hypothesis in If/Then/Because format
  • If we change
    [intervention]
  • Then
    [metric]
    will change by
    [expected direction/magnitude]
  • Because
    [behavioral mechanism]
  1. Define metrics before running test
  • Primary metric: single decision metric
  • Guardrail metrics: quality/risk protection
  • Secondary metrics: diagnostics only
  1. Estimate sample size
  • Baseline conversion or baseline mean
  • Minimum detectable effect (MDE)
  • Significance level (alpha) and power
Use:
bash
python3 scripts/sample_size_calculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute
  1. Prioritize experiments with ICE
  • Impact: potential upside
  • Confidence: evidence quality
  • Ease: cost/speed/complexity
ICE Score = (Impact * Confidence * Ease) / 10
  1. Launch with stopping rules
  • Decide fixed sample size or fixed duration in advance
  • Avoid repeated peeking without proper method
  • Monitor guardrails continuously
  1. Interpret results
  • Statistical significance is not business significance
  • Compare point estimate + confidence interval to decision threshold
  • Investigate novelty effects and segment heterogeneity
  1. 以If/Then/Because格式撰写假设
  • 如果我们调整
    [干预措施]
  • 那么
    [指标]
    将发生
    [预期方向/幅度]
    的变化
  • 因为
    [行为机制]
  1. 在运行测试前定义指标
  • 核心指标:单一决策指标
  • 防护指标:质量/风险保障指标
  • 次要指标:仅用于诊断分析
  1. 估算样本量
  • 基准转化率或基准均值
  • 最小可检测效应(MDE)
  • 显著性水平(alpha)与统计功效
使用:
bash
python3 scripts/sample_size_calculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute
  1. 采用ICE评分确定实验优先级
  • 影响(Impact):潜在收益
  • 置信度(Confidence):证据质量
  • 易用性(Ease):成本/速度/复杂度
ICE评分 = (影响 × 置信度 × 易用性) / 10
  1. 设定停止规则后启动实验
  • 提前确定固定样本量或固定时长
  • 若无合适方法,避免反复查看结果
  • 持续监控防护指标
  1. 解读实验结果
  • 统计显著性不等于业务显著性
  • 将点估计值+置信区间与决策阈值进行比较
  • 调查新奇效应和细分群体异质性

Hypothesis Quality Checklist

假设质量检查清单

  • Contains explicit intervention and audience
  • Specifies measurable metric change
  • States plausible causal reason
  • Includes expected minimum effect
  • Defines failure condition
  • 包含明确的干预措施和受众
  • 指定可衡量的指标变化
  • 阐述合理的因果理由
  • 包含预期的最小效应
  • 定义失败条件

Common Experiment Pitfalls

常见实验陷阱

  • Underpowered tests leading to false negatives
  • Running too many simultaneous changes without isolation
  • Changing targeting or implementation mid-test
  • Stopping early on random spikes
  • Ignoring sample ratio mismatch and instrumentation drift
  • Declaring success from p-value without effect-size context
  • 测试功效不足导致假阴性结果
  • 同时进行过多变更而未做隔离
  • 测试中途更改目标受众或实现方式
  • 因随机峰值提前停止测试
  • 忽略样本比例不匹配和工具偏差
  • 仅依据p-value就宣称成功,未结合效应量背景

Statistical Interpretation Guardrails

统计解读准则

  • p-value < alpha indicates evidence against null, not guaranteed truth.
  • Confidence interval crossing zero/no-effect means uncertain directional claim.
  • Wide intervals imply low precision even when significant.
  • Use practical significance thresholds tied to business impact.
See:
  • references/experiment-playbook.md
  • references/statistics-reference.md
  • p-value < alpha仅表明存在反对原假设的证据,而非绝对真理。
  • 置信区间跨越零/无效应值意味着方向性结论不确定。
  • 即使结果显著,宽区间也意味着精度较低。
  • 使用与业务影响挂钩的实际显著性阈值。
参考:
  • references/experiment-playbook.md
  • references/statistics-reference.md

Tooling

工具

scripts/sample_size_calculator.py

scripts/sample_size_calculator.py

Computes required sample size (per variant and total) from:
  • baseline rate
  • MDE (absolute or relative)
  • significance level (alpha)
  • statistical power
Example:
bash
python3 scripts/sample_size_calculator.py \
  --baseline-rate 0.10 \
  --mde 0.015 \
  --mde-type absolute \
  --alpha 0.05 \
  --power 0.8
根据以下参数计算所需样本量(每个变体和总样本量):
  • 基准转化率
  • MDE(绝对或相对)
  • 显著性水平(alpha)
  • 统计功效
示例:
bash
python3 scripts/sample_size_calculator.py \
  --baseline-rate 0.10 \
  --mde 0.015 \
  --mde-type absolute \
  --alpha 0.05 \
  --power 0.8