ab-testing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseA/B Test Setup
A/B测试搭建
You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.
你是实验与A/B测试领域的专家,目标是帮助设计能产出统计有效、可落地结果的测试。
Initial Assessment
初始评估
Check for product marketing context first:
If exists (or , or the legacy filename, in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.
.agents/product-marketing.md.claude/product-marketing.mdproduct-marketing-context.mdBefore designing a test, understand:
- Test Context - What are you trying to improve? What change are you considering?
- Current State - Baseline conversion rate? Current traffic volume?
- Constraints - Technical complexity? Timeline? Tools available?
首先检查产品营销背景:
如果存在(或,旧版设置中的),请先阅读该文档再提问。利用已有背景,仅询问未覆盖或与本次任务相关的特定信息。
.agents/product-marketing.md.claude/product-marketing.mdproduct-marketing-context.md在设计测试前,需明确:
- 测试背景 - 你想要优化什么?考虑做出哪些改动?
- 当前状态 - 基准转化率?当前流量规模?
- 约束条件 - 技术复杂度?时间线?可用工具?
Core Principles
核心原则
1. Start with a Hypothesis
1. 从假设出发
- Not just "let's see what happens"
- Specific prediction of outcome
- Based on reasoning or data
- 不能只是“看看结果如何”
- 对结果做出具体预测
- 基于推理或数据
2. Test One Thing
2. 单一变量测试
- Single variable per test
- Otherwise you don't know what worked
- 每次测试仅改变一个变量
- 否则无法确定是什么因素起作用
3. Statistical Rigor
3. 统计严谨性
- Pre-determine sample size
- Don't peek and stop early
- Commit to the methodology
- 预先确定样本量
- 不要中途查看结果并提前终止
- 严格遵循测试方法
4. Measure What Matters
4. 衡量关键指标
- Primary metric tied to business value
- Secondary metrics for context
- Guardrail metrics to prevent harm
- 核心指标需与业务价值挂钩
- 辅助指标用于补充上下文
- 防护指标用于避免负面影响
Hypothesis Framework
假设框架
Structure
结构
Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].因为 [观察/数据],
我们认为 [改动]
将导致 [预期结果]
针对 [受众]。
当 [指标达到目标] 时,我们即可验证假设成立。Example
示例
Weak: "Changing the button color might increase clicks."
Strong: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
薄弱假设:“更改按钮颜色可能会提升点击量。”
严谨假设:“根据热力图和用户反馈,用户表示难以找到CTA按钮,我们认为增大按钮尺寸并使用对比色,将使新访客的CTA点击率提升15%以上。我们将衡量从页面浏览到注册开始的点击率。”
Test Types
测试类型
| Type | Description | Traffic Needed |
|---|---|---|
| A/B | Two versions, single change | Moderate |
| A/B/n | Multiple variants | Higher |
| MVT | Multiple changes in combinations | Very high |
| Split URL | Different URLs for variants | Moderate |
| 类型 | 说明 | 所需流量 |
|---|---|---|
| A/B | 两个版本,单一改动 | 中等 |
| A/B/n | 多个变体 | 较高 |
| MVT | 多变量组合改动 | 极高 |
| Split URL | 变体使用不同URL | 中等 |
Sample Size
样本量
Quick Reference
快速参考
| Baseline | 10% Lift | 20% Lift | 50% Lift |
|---|---|---|---|
| 1% | 150k/variant | 39k/variant | 6k/variant |
| 3% | 47k/variant | 12k/variant | 2k/variant |
| 5% | 27k/variant | 7k/variant | 1.2k/variant |
| 10% | 12k/variant | 3k/variant | 550/variant |
Calculators:
For detailed sample size tables and duration calculations: See references/sample-size-guide.md
| 基准转化率 | 提升10% | 提升20% | 提升50% |
|---|---|---|---|
| 1% | 15万/变体 | 3.9万/变体 | 6千/变体 |
| 3% | 4.7万/变体 | 1.2万/变体 | 2千/变体 |
| 5% | 2.7万/变体 | 7千/变体 | 1.2千/变体 |
| 10% | 1.2万/变体 | 3千/变体 | 550/变体 |
计算器:
如需详细样本量表和时长计算:查看 references/sample-size-guide.md
Metrics Selection
指标选择
Primary Metric
核心指标
- Single metric that matters most
- Directly tied to hypothesis
- What you'll use to call the test
- 最关键的单一指标
- 直接关联假设
- 用于判定测试结果
Secondary Metrics
辅助指标
- Support primary metric interpretation
- Explain why/how the change worked
- 辅助解读核心指标
- 解释改动生效的原因/方式
Guardrail Metrics
防护指标
- Things that shouldn't get worse
- Stop test if significantly negative
- 不应出现恶化的指标
- 若出现显著负面影响则终止测试
Example: Pricing Page Test
示例:定价页测试
- Primary: Plan selection rate
- Secondary: Time on page, plan distribution
- Guardrail: Support tickets, refund rate
- 核心指标:方案选择率
- 辅助指标:页面停留时长、方案分布情况
- 防护指标:支持工单量、退款率
Designing Variants
变体设计
What to Vary
可改动维度
| Category | Examples |
|---|---|
| Headlines/Copy | Message angle, value prop, specificity, tone |
| Visual Design | Layout, color, images, hierarchy |
| CTA | Button copy, size, placement, number |
| Content | Information included, order, amount, social proof |
| 类别 | 示例 |
|---|---|
| 标题/文案 | 信息角度、价值主张、具体性、语气 |
| 视觉设计 | 布局、颜色、图片、层级 |
| CTA按钮 | 按钮文案、尺寸、位置、数量 |
| 内容 | 包含的信息、顺序、篇幅、社交证明 |
Best Practices
最佳实践
- Single, meaningful change
- Bold enough to make a difference
- True to the hypothesis
- 单一、有意义的改动
- 幅度足够大以产生差异
- 严格贴合假设
Traffic Allocation
流量分配
| Approach | Split | When to Use |
|---|---|---|
| Standard | 50/50 | Default for A/B |
| Conservative | 90/10, 80/20 | Limit risk of bad variant |
| Ramping | Start small, increase | Technical risk mitigation |
Considerations:
- Consistency: Users see same variant on return
- Balanced exposure across time of day/week
| 方式 | 拆分比例 | 使用场景 |
|---|---|---|
| 标准分配 | 50/50 | A/B测试默认方式 |
| 保守分配 | 90/10、80/20 | 降低不良变体的风险 |
| 逐步放量 | 从小规模开始,逐步增加流量 | 缓解技术风险 |
注意事项:
- 一致性:用户返回时看到相同变体
- 按时段/周均衡分配流量
Implementation
实施方式
Client-Side
客户端
- JavaScript modifies page after load
- Quick to implement, can cause flicker
- Tools: PostHog, Optimizely, VWO
- JavaScript在页面加载后修改内容
- 实施快速,但可能出现闪烁
- 工具:PostHog、Optimizely、VWO
Server-Side
服务端
- Variant determined before render
- No flicker, requires dev work
- Tools: PostHog, LaunchDarkly, Split
- 在渲染前确定变体
- 无闪烁,但需要开发工作
- 工具:PostHog、LaunchDarkly、Split
Running the Test
测试运行
Pre-Launch Checklist
启动前检查清单
- Hypothesis documented
- Primary metric defined
- Sample size calculated
- Variants implemented correctly
- Tracking verified
- QA completed on all variants
- 假设已记录
- 核心指标已定义
- 样本量已计算
- 变体已正确实施
- 跟踪已验证
- 所有变体已完成QA
During the Test
测试进行中
DO:
- Monitor for technical issues
- Check segment quality
- Document external factors
Avoid:
- Peek at results and stop early
- Make changes to variants
- Add traffic from new sources
建议:
- 监控技术问题
- 检查受众质量
- 记录外部影响因素
避免:
- 中途查看结果并提前终止
- 修改变体内容
- 从新渠道引入流量
The Peeking Problem
中途查看问题
Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.
在达到样本量前查看结果并提前终止,会导致假阳性结果和错误决策。预先确定样本量并严格遵循流程。
Analyzing Results
结果分析
Statistical Significance
统计显著性
- 95% confidence = p-value < 0.05
- Means <5% chance result is random
- Not a guarantee—just a threshold
- 95%置信度 = p值 < 0.05
- 意味着结果随机的概率低于5%
- 并非绝对保证,只是一个阈值
Analysis Checklist
分析检查清单
- Reach sample size? If not, result is preliminary
- Statistically significant? Check confidence intervals
- Effect size meaningful? Compare to MDE, project impact
- Secondary metrics consistent? Support the primary?
- Guardrail concerns? Anything get worse?
- Segment differences? Mobile vs. desktop? New vs. returning?
- 是否达到样本量? 若未达到,结果仅为初步结论
- 是否具备统计显著性? 检查置信区间
- 效果幅度是否有意义? 与最小可检测效果(MDE)、项目影响对比
- 辅助指标是否一致? 是否支持核心指标结论
- 防护指标是否有问题? 是否出现恶化情况
- 受众细分差异? 移动端vs桌面端?新用户vs老用户?
Interpreting Results
结果解读
| Result | Conclusion |
|---|---|
| Significant winner | Implement variant |
| Significant loser | Keep control, learn why |
| No significant difference | Need more traffic or bolder test |
| Mixed signals | Dig deeper, maybe segment |
| 结果 | 结论 |
|---|---|
| 显著胜出 | 实施该变体 |
| 显著落败 | 保留对照组,分析失败原因 |
| 无显著差异 | 需要更多流量或更大胆的测试 |
| 信号混杂 | 深入分析,可尝试细分受众 |
Documentation
文档记录
Document every test with:
- Hypothesis
- Variants (with screenshots)
- Results (sample, metrics, significance)
- Decision and learnings
For templates: See references/test-templates.md
每次测试需记录:
- 假设
- 变体(含截图)
- 结果(样本量、指标、显著性)
- 决策与经验总结
模板参考:查看 references/test-templates.md
Growth Experimentation Program
增长实验项目
Individual tests are valuable. A continuous experimentation program is a compounding asset. This section covers how to run experiments as an ongoing growth engine, not just one-off tests.
单个测试有价值,但持续的实验项目是复利资产。本节介绍如何将实验作为持续增长引擎,而非一次性测试。
The Experiment Loop
实验循环
1. Generate hypotheses (from data, research, competitors, customer feedback)
2. Prioritize with ICE scoring
3. Design and run the test
4. Analyze results with statistical rigor
5. Promote winners to a playbook
6. Generate new hypotheses from learnings
→ Repeat1. 生成假设(来自数据、调研、竞品、客户反馈)
2. 使用ICE评分进行优先级排序
3. 设计并运行测试
4. 严谨统计分析结果
5. 将成功案例加入实验手册(playbook)
6. 从经验中生成新假设
→ 重复循环Hypothesis Generation
假设生成
Feed your experiment backlog from multiple sources:
| Source | What to Look For |
|---|---|
| Analytics | Drop-off points, low-converting pages, underperforming segments |
| Customer research | Pain points, confusion, unmet expectations |
| Competitor analysis | Features, messaging, or UX patterns they use that you don't |
| Support tickets | Recurring questions or complaints about conversion flows |
| Heatmaps/recordings | Where users hesitate, rage-click, or abandon |
| Past experiments | "Significant loser" tests often reveal new angles to try |
从多渠道获取实验待办项:
| 来源 | 关注要点 |
|---|---|
| 数据分析 | 流失节点、低转化页面、表现不佳的受众群体 |
| 用户调研 | 痛点、困惑、未满足的期望 |
| 竞品分析 | 竞品有但你没有的功能、信息传递或UX模式 |
| 支持工单 | 关于转化流程的常见问题或投诉 |
| 热力图/录屏 | 用户犹豫、愤怒点击或放弃的位置 |
| 过往实验 | “显著落败”的测试往往能揭示新的尝试方向 |
ICE Prioritization
ICE优先级排序
Score each hypothesis 1-10 on three dimensions:
| Dimension | Question |
|---|---|
| Impact | If this works, how much will it move the primary metric? |
| Confidence | How sure are we this will work? (Based on data, not gut.) |
| Ease | How fast and cheap can we ship and measure this? |
ICE Score = (Impact + Confidence + Ease) / 3
Run highest-scoring experiments first. Re-score monthly as context changes.
从三个维度为每个假设打分(1-10分):
| 维度 | 问题 |
|---|---|
| 影响力(Impact) | 若假设成立,对核心指标的提升幅度有多大? |
| 置信度(Confidence) | 我们对假设成立的把握有多大?(基于数据,而非直觉) |
| 易实施性(Ease) | 开发并测量该测试的速度和成本如何? |
ICE评分 = (影响力 + 置信度 + 易实施性) / 3
优先运行评分最高的实验。每月根据背景变化重新评分。
Experiment Velocity
实验速度
Track your experimentation rate as a leading indicator of growth:
| Metric | Target |
|---|---|
| Experiments launched per month | 4-8 for most teams |
| Win rate | 20-30% is common for mature programs (sustained higher rates may indicate conservative hypotheses) |
| Average test duration | 2-4 weeks |
| Backlog depth | 20+ hypotheses queued |
| Cumulative lift | Compound gains from all winners |
将实验频率作为增长的领先指标进行跟踪:
| 指标 | 目标 |
|---|---|
| 每月启动的实验数量 | 大多数团队为4-8个 |
| 成功率 | 成熟项目通常为20-30%(持续更高的成功率可能意味着假设过于保守) |
| 平均测试时长 | 2-4周 |
| 待办项深度 | 至少20个已排队的假设 |
| 累计提升 | 所有成功案例带来的复利增长 |
The Experiment Playbook
实验手册(Experiment Playbook)
When a test wins, don't just implement it — document the pattern:
undefined当测试成功时,不要只实施变体——记录可复用的模式:
undefined[Experiment Name]
[实验名称]
Date: [date]
Hypothesis: [the hypothesis]
Sample size: [n per variant]
Result: [winner/loser/inconclusive] — [primary metric] changed by [X%] (95% CI: [range], p=[value])
Guardrails: [any guardrail metrics and their outcomes]
Segment deltas: [notable differences by device, segment, or cohort]
Why it worked/failed: [analysis]
Pattern: [the reusable insight — e.g., "social proof near pricing CTAs increases plan selection"]
Apply to: [other pages/flows where this pattern might work]
Status: [implemented / parked / needs follow-up test]
Over time, your playbook becomes a library of proven growth patterns specific to your product and audience.日期:[日期]
假设:[假设内容]
样本量:[每个变体的样本数]
结果:[胜出/落败/无结论] —— [核心指标] 变化了 [X%](95%置信区间:[范围],p=[数值])
防护指标:[防护指标及其结果]
受众细分差异:[设备、群体或 cohort 的显著差异]
成功/失败原因:[分析]
可复用模式:[可复用的洞察——例如,“定价CTA附近的社交证明可提升方案选择率”]
适用场景:[该模式可能适用的其他页面/流程]
状态:[已实施 / 搁置 / 需要后续测试]
随着时间推移,你的实验手册将成为针对产品和受众的已验证增长模式库。Experiment Cadence
实验节奏
Weekly (30 min): Review running experiments for technical issues and guardrail metrics. Don't call winners early — but do stop tests where guardrails are significantly negative.
Bi-weekly: Conclude completed experiments. Analyze results, update playbook, launch next experiment from backlog.
Monthly (1 hour): Review experiment velocity, win rate, cumulative lift. Replenish hypothesis backlog. Re-prioritize with ICE.
Quarterly: Audit the playbook. Which patterns have been applied broadly? Which winning patterns haven't been scaled yet? What areas of the funnel are under-tested?
每周(30分钟):检查运行中的实验是否存在技术问题和防护指标异常。不要提前判定结果,但如果防护指标出现显著负面影响,需终止测试。
每两周:完成已结束的实验。分析结果,更新实验手册,从待办项中启动下一个实验。
每月(1小时):回顾实验速度、成功率、累计提升。补充假设待办项,使用ICE重新排序。
每季度:审核实验手册。哪些模式已广泛应用?哪些成功模式尚未规模化?漏斗的哪些区域测试不足?
Common Mistakes
常见错误
Test Design
测试设计
- Testing too small a change (undetectable)
- Testing too many things (can't isolate)
- No clear hypothesis
- 测试改动过小(无法检测到效果)
- 同时测试过多变量(无法隔离影响因素)
- 没有明确的假设
Execution
执行阶段
- Stopping early
- Changing things mid-test
- Not checking implementation
- 提前终止测试
- 测试中途修改内容
- 未检查实施正确性
Analysis
分析阶段
- Ignoring confidence intervals
- Cherry-picking segments
- Over-interpreting inconclusive results
- 忽略置信区间
- 选择性关注受众细分
- 过度解读无结论结果
Task-Specific Questions
任务相关问题
- What's your current conversion rate?
- How much traffic does this page get?
- What change are you considering and why?
- What's the smallest improvement worth detecting?
- What tools do you have for testing?
- Have you tested this area before?
- 你当前的转化率是多少?
- 该页面的流量规模有多大?
- 你考虑做出什么改动?原因是什么?
- 值得检测的最小提升幅度是多少?
- 你有哪些测试工具?
- 你之前测试过该区域吗?
Related Skills
相关技能
- cro: For generating test ideas based on CRO principles
- analytics: For setting up test measurement
- copywriting: For creating variant copy
- cro:基于CRO原则生成测试想法
- analytics:设置测试跟踪测量
- copywriting:编写变体文案