ab-testing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

A/B Test Setup

A/B测试搭建

You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.

你是实验与A/B测试领域的专家，目标是帮助设计能产出统计有效、可落地结果的测试。

Initial Assessment

初始评估

Check for product marketing context first: If

.agents/product-marketing.md

exists (or

.claude/product-marketing.md

, or the legacy

product-marketing-context.md

filename, in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.

Before designing a test, understand:

Test Context - What are you trying to improve? What change are you considering?
Current State - Baseline conversion rate? Current traffic volume?
Constraints - Technical complexity? Timeline? Tools available?

首先检查产品营销背景： 如果存在

.agents/product-marketing.md

（或

.claude/product-marketing.md

，旧版设置中的

product-marketing-context.md

），请先阅读该文档再提问。利用已有背景，仅询问未覆盖或与本次任务相关的特定信息。

在设计测试前，需明确：

测试背景 - 你想要优化什么？考虑做出哪些改动？
当前状态 - 基准转化率？当前流量规模？
约束条件 - 技术复杂度？时间线？可用工具？

Core Principles

核心原则

1. Start with a Hypothesis

1. 从假设出发

Not just "let's see what happens"
Specific prediction of outcome
Based on reasoning or data

不能只是“看看结果如何”
对结果做出具体预测
基于推理或数据

2. Test One Thing

2. 单一变量测试

Single variable per test
Otherwise you don't know what worked

每次测试仅改变一个变量
否则无法确定是什么因素起作用

3. Statistical Rigor

3. 统计严谨性

Pre-determine sample size
Don't peek and stop early
Commit to the methodology

预先确定样本量
不要中途查看结果并提前终止
严格遵循测试方法

4. Measure What Matters

4. 衡量关键指标

Primary metric tied to business value
Secondary metrics for context
Guardrail metrics to prevent harm

核心指标需与业务价值挂钩
辅助指标用于补充上下文
防护指标用于避免负面影响

Hypothesis Framework

假设框架

Structure

结构

Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].

因为 [观察/数据],
我们认为 [改动]
将导致 [预期结果]
针对 [受众]。
当 [指标达到目标] 时，我们即可验证假设成立。

Example

示例

Weak: "Changing the button color might increase clicks."

Strong: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."

薄弱假设：“更改按钮颜色可能会提升点击量。”

严谨假设：“根据热力图和用户反馈，用户表示难以找到CTA按钮，我们认为增大按钮尺寸并使用对比色，将使新访客的CTA点击率提升15%以上。我们将衡量从页面浏览到注册开始的点击率。”

Test Types

测试类型

Type	Description	Traffic Needed
A/B	Two versions, single change	Moderate
A/B/n	Multiple variants	Higher
MVT	Multiple changes in combinations	Very high
Split URL	Different URLs for variants	Moderate

类型	说明	所需流量
A/B	两个版本，单一改动	中等
A/B/n	多个变体	较高
MVT	多变量组合改动	极高
Split URL	变体使用不同URL	中等

Sample Size

样本量

Quick Reference

快速参考

Baseline	10% Lift	20% Lift	50% Lift
1%	150k/variant	39k/variant	6k/variant
3%	47k/variant	12k/variant	2k/variant
5%	27k/variant	7k/variant	1.2k/variant
10%	12k/variant	3k/variant	550/variant

Calculators:

For detailed sample size tables and duration calculations: See references/sample-size-guide.md

基准转化率	提升10%	提升20%	提升50%
1%	15万/变体	3.9万/变体	6千/变体
3%	4.7万/变体	1.2万/变体	2千/变体
5%	2.7万/变体	7千/变体	1.2千/变体
10%	1.2万/变体	3千/变体	550/变体

计算器：

如需详细样本量表和时长计算：查看 references/sample-size-guide.md

Metrics Selection

指标选择

Primary Metric

核心指标

Single metric that matters most
Directly tied to hypothesis
What you'll use to call the test

最关键的单一指标
直接关联假设
用于判定测试结果

Secondary Metrics

辅助指标

Support primary metric interpretation
Explain why/how the change worked

辅助解读核心指标
解释改动生效的原因/方式

Guardrail Metrics

防护指标

Things that shouldn't get worse
Stop test if significantly negative

不应出现恶化的指标
若出现显著负面影响则终止测试

Example: Pricing Page Test

示例：定价页测试

Primary: Plan selection rate
Secondary: Time on page, plan distribution
Guardrail: Support tickets, refund rate

核心指标：方案选择率
辅助指标：页面停留时长、方案分布情况
防护指标：支持工单量、退款率

Designing Variants

变体设计

What to Vary

可改动维度

Category	Examples
Headlines/Copy	Message angle, value prop, specificity, tone
Visual Design	Layout, color, images, hierarchy
CTA	Button copy, size, placement, number
Content	Information included, order, amount, social proof

类别	示例
标题/文案	信息角度、价值主张、具体性、语气
视觉设计	布局、颜色、图片、层级
CTA按钮	按钮文案、尺寸、位置、数量
内容	包含的信息、顺序、篇幅、社交证明

Best Practices

最佳实践

Single, meaningful change
Bold enough to make a difference
True to the hypothesis

单一、有意义的改动
幅度足够大以产生差异
严格贴合假设

Traffic Allocation

流量分配

Approach	Split	When to Use
Standard	50/50	Default for A/B
Conservative	90/10, 80/20	Limit risk of bad variant
Ramping	Start small, increase	Technical risk mitigation

Considerations:

Consistency: Users see same variant on return
Balanced exposure across time of day/week

方式	拆分比例	使用场景
标准分配	50/50	A/B测试默认方式
保守分配	90/10、80/20	降低不良变体的风险
逐步放量	从小规模开始，逐步增加流量	缓解技术风险

注意事项：

一致性：用户返回时看到相同变体
按时段/周均衡分配流量

Implementation

实施方式

Client-Side

客户端

JavaScript modifies page after load
Quick to implement, can cause flicker
Tools: PostHog, Optimizely, VWO

JavaScript在页面加载后修改内容
实施快速，但可能出现闪烁
工具：PostHog、Optimizely、VWO

Server-Side

服务端

Variant determined before render
No flicker, requires dev work
Tools: PostHog, LaunchDarkly, Split

在渲染前确定变体
无闪烁，但需要开发工作
工具：PostHog、LaunchDarkly、Split

Running the Test

测试运行

Pre-Launch Checklist

启动前检查清单

During the Test

测试进行中

DO:

Monitor for technical issues
Check segment quality
Document external factors

Avoid:

Peek at results and stop early
Make changes to variants
Add traffic from new sources

建议：

监控技术问题
检查受众质量
记录外部影响因素

避免：

中途查看结果并提前终止
修改变体内容
从新渠道引入流量

The Peeking Problem

中途查看问题

Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.

在达到样本量前查看结果并提前终止，会导致假阳性结果和错误决策。预先确定样本量并严格遵循流程。

Analyzing Results

结果分析

Statistical Significance

统计显著性

95% confidence = p-value < 0.05
Means <5% chance result is random
Not a guarantee—just a threshold

95%置信度 = p值 < 0.05
意味着结果随机的概率低于5%
并非绝对保证，只是一个阈值

Analysis Checklist

分析检查清单

Reach sample size? If not, result is preliminary
Statistically significant? Check confidence intervals
Effect size meaningful? Compare to MDE, project impact
Secondary metrics consistent? Support the primary?
Guardrail concerns? Anything get worse?
Segment differences? Mobile vs. desktop? New vs. returning?

是否达到样本量？ 若未达到，结果仅为初步结论
是否具备统计显著性？ 检查置信区间
效果幅度是否有意义？ 与最小可检测效果（MDE）、项目影响对比
辅助指标是否一致？ 是否支持核心指标结论
防护指标是否有问题？ 是否出现恶化情况
受众细分差异？ 移动端vs桌面端？新用户vs老用户？

Interpreting Results

结果解读

Result	Conclusion
Significant winner	Implement variant
Significant loser	Keep control, learn why
No significant difference	Need more traffic or bolder test
Mixed signals	Dig deeper, maybe segment

结果	结论
显著胜出	实施该变体
显著落败	保留对照组，分析失败原因
无显著差异	需要更多流量或更大胆的测试
信号混杂	深入分析，可尝试细分受众

Documentation

文档记录

Document every test with:

Hypothesis
Variants (with screenshots)
Results (sample, metrics, significance)
Decision and learnings

For templates: See references/test-templates.md

每次测试需记录：

假设
变体（含截图）
结果（样本量、指标、显著性）
决策与经验总结

模板参考：查看 references/test-templates.md

Growth Experimentation Program

增长实验项目

Individual tests are valuable. A continuous experimentation program is a compounding asset. This section covers how to run experiments as an ongoing growth engine, not just one-off tests.

单个测试有价值，但持续的实验项目是复利资产。本节介绍如何将实验作为持续增长引擎，而非一次性测试。

The Experiment Loop

实验循环

1. Generate hypotheses (from data, research, competitors, customer feedback)
2. Prioritize with ICE scoring
3. Design and run the test
4. Analyze results with statistical rigor
5. Promote winners to a playbook
6. Generate new hypotheses from learnings
→ Repeat

1. 生成假设（来自数据、调研、竞品、客户反馈）
2. 使用ICE评分进行优先级排序
3. 设计并运行测试
4. 严谨统计分析结果
5. 将成功案例加入实验手册（playbook）
6. 从经验中生成新假设
→ 重复循环

Hypothesis Generation

假设生成

Feed your experiment backlog from multiple sources:

Source	What to Look For
Analytics	Drop-off points, low-converting pages, underperforming segments
Customer research	Pain points, confusion, unmet expectations
Competitor analysis	Features, messaging, or UX patterns they use that you don't
Support tickets	Recurring questions or complaints about conversion flows
Heatmaps/recordings	Where users hesitate, rage-click, or abandon
Past experiments	"Significant loser" tests often reveal new angles to try

从多渠道获取实验待办项：

来源	关注要点
数据分析	流失节点、低转化页面、表现不佳的受众群体
用户调研	痛点、困惑、未满足的期望
竞品分析	竞品有但你没有的功能、信息传递或UX模式
支持工单	关于转化流程的常见问题或投诉
热力图/录屏	用户犹豫、愤怒点击或放弃的位置
过往实验	“显著落败”的测试往往能揭示新的尝试方向

ICE Prioritization

ICE优先级排序

Score each hypothesis 1-10 on three dimensions:

Dimension	Question
Impact	If this works, how much will it move the primary metric?
Confidence	How sure are we this will work? (Based on data, not gut.)
Ease	How fast and cheap can we ship and measure this?

ICE Score = (Impact + Confidence + Ease) / 3

Run highest-scoring experiments first. Re-score monthly as context changes.

从三个维度为每个假设打分（1-10分）：

维度	问题
影响力（Impact）	若假设成立，对核心指标的提升幅度有多大？
置信度（Confidence）	我们对假设成立的把握有多大？（基于数据，而非直觉）
易实施性（Ease）	开发并测量该测试的速度和成本如何？

ICE评分 = (影响力 + 置信度 + 易实施性) / 3

优先运行评分最高的实验。每月根据背景变化重新评分。

Experiment Velocity

实验速度

Track your experimentation rate as a leading indicator of growth:

Metric	Target
Experiments launched per month	4-8 for most teams
Win rate	20-30% is common for mature programs (sustained higher rates may indicate conservative hypotheses)
Average test duration	2-4 weeks
Backlog depth	20+ hypotheses queued
Cumulative lift	Compound gains from all winners

将实验频率作为增长的领先指标进行跟踪：

指标	目标
每月启动的实验数量	大多数团队为4-8个
成功率	成熟项目通常为20-30%（持续更高的成功率可能意味着假设过于保守）
平均测试时长	2-4周
待办项深度	至少20个已排队的假设
累计提升	所有成功案例带来的复利增长

The Experiment Playbook

实验手册（Experiment Playbook）

When a test wins, don't just implement it — document the pattern:

undefined

当测试成功时，不要只实施变体——记录可复用的模式：

undefined

[Experiment Name]

[实验名称]

Date: [date] Hypothesis: [the hypothesis] Sample size: [n per variant] Result: [winner/loser/inconclusive] — [primary metric] changed by [X%] (95% CI: [range], p=[value]) Guardrails: [any guardrail metrics and their outcomes] Segment deltas: [notable differences by device, segment, or cohort] Why it worked/failed: [analysis] Pattern: [the reusable insight — e.g., "social proof near pricing CTAs increases plan selection"] Apply to: [other pages/flows where this pattern might work] Status: [implemented / parked / needs follow-up test]


Over time, your playbook becomes a library of proven growth patterns specific to your product and audience.

日期：[日期] 假设：[假设内容] 样本量：[每个变体的样本数] 结果：[胜出/落败/无结论] —— [核心指标] 变化了 [X%]（95%置信区间：[范围]，p=[数值]） 防护指标：[防护指标及其结果] 受众细分差异：[设备、群体或 cohort 的显著差异] 成功/失败原因：[分析] 可复用模式：[可复用的洞察——例如，“定价CTA附近的社交证明可提升方案选择率”] 适用场景：[该模式可能适用的其他页面/流程] 状态：[已实施 / 搁置 / 需要后续测试]


随着时间推移，你的实验手册将成为针对产品和受众的已验证增长模式库。

Experiment Cadence

实验节奏

Weekly (30 min): Review running experiments for technical issues and guardrail metrics. Don't call winners early — but do stop tests where guardrails are significantly negative.

Bi-weekly: Conclude completed experiments. Analyze results, update playbook, launch next experiment from backlog.

Monthly (1 hour): Review experiment velocity, win rate, cumulative lift. Replenish hypothesis backlog. Re-prioritize with ICE.

Quarterly: Audit the playbook. Which patterns have been applied broadly? Which winning patterns haven't been scaled yet? What areas of the funnel are under-tested?

每周（30分钟）：检查运行中的实验是否存在技术问题和防护指标异常。不要提前判定结果，但如果防护指标出现显著负面影响，需终止测试。

每两周：完成已结束的实验。分析结果，更新实验手册，从待办项中启动下一个实验。

每月（1小时）：回顾实验速度、成功率、累计提升。补充假设待办项，使用ICE重新排序。

每季度：审核实验手册。哪些模式已广泛应用？哪些成功模式尚未规模化？漏斗的哪些区域测试不足？

Common Mistakes

常见错误

Test Design

测试设计

Testing too small a change (undetectable)
Testing too many things (can't isolate)
No clear hypothesis

测试改动过小（无法检测到效果）
同时测试过多变量（无法隔离影响因素）
没有明确的假设

Execution

执行阶段

Stopping early
Changing things mid-test
Not checking implementation

提前终止测试
测试中途修改内容
未检查实施正确性

Analysis

分析阶段

Ignoring confidence intervals
Cherry-picking segments
Over-interpreting inconclusive results

忽略置信区间
选择性关注受众细分
过度解读无结论结果

Task-Specific Questions

任务相关问题

What's your current conversion rate?
How much traffic does this page get?
What change are you considering and why?
What's the smallest improvement worth detecting?
What tools do you have for testing?
Have you tested this area before?

你当前的转化率是多少？
该页面的流量规模有多大？
你考虑做出什么改动？原因是什么？
值得检测的最小提升幅度是多少？
你有哪些测试工具？
你之前测试过该区域吗？