experiment-designer-tracker

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Experiment Designer + Tracker

实验设计器与跟踪器

Build a structured experimentation program from real account performance signals — detect gaps, translate them into testable hypotheses, score with ICE, and produce a prioritized roadmap.

Do not use this skill to launch many overlapping tests or make account changes.

基于真实账户的性能信号构建结构化实验方案——检测差距，将其转化为可测试的假设，使用ICE评分，并生成优先排序的路线图。

请勿使用此技能启动大量重叠测试或进行账户变更。

Prerequisites

前提条件

Glued MCP: Connect your Glued workspace at glued.me/mcp

Glued MCP：通过 glued.me/mcp 连接你的Glued工作区

Required MCP tools

所需MCP工具

```
list_workspaces
```
```
query_ad_report
```

```
list_workspaces
```
```
query_ad_report
```

Inputs

输入参数

```
workspace_id
```
: optional — resolved via
```
list_workspaces
```
if missing
```
date_range
```
: default
```
last_30_days
```
```
goal
```
:
```
roas
```
(default) or
```
cpa
```

```
workspace_id
```
：可选——若缺失则通过
```
list_workspaces
```
获取
```
date_range
```
：默认值
```
last_30_days
```
```
goal
```
：
```
roas
```
（默认）或
```
cpa
```

Outputs

输出结果

All outputs go to a timestamped directory:

output/<YYYYMMDD>_experiments/

```
experiments_backlog.md
```
— full backlog with hypotheses, ICE scores, roadmap
```
experiments_backlog.json
```
— structured data for programmatic use
```
postmortem_template.md
```
— reusable template for tracking experiment outcomes

所有输出将保存至带时间戳的目录：

output/<YYYYMMDD>_experiments/

```
experiments_backlog.md
```
—— 包含假设、ICE评分、路线图的完整待办事项列表
```
experiments_backlog.json
```
—— 供程序化调用的结构化数据
```
postmortem_template.md
```
—— 用于跟踪实验结果的可复用模板

Procedure

执行流程

Phase 0: Context Collection

阶段0：上下文收集

If workspace_id is not provided, call
```
list_workspaces
```
and ask the user which workspace to analyze. If only one workspace exists, use it automatically.
Ask the user using AskUserQuestion (2 questions):

Question 1: "What is your primary optimization goal?"
- Options: "ROAS" / "CPA"
Question 2: "Are there any active tests or recent changes we should know about?"
- Options: "No active tests" / "Yes — I'll describe them"
- If yes, note them to avoid proposing duplicate or conflicting tests.
Create output directory:
```
output/<YYYYMMDD>_experiments/
```
Tell the user: "Pulling 30-day performance data across campaigns and creatives. This takes about 30-60 seconds."

若未提供workspace_id，调用
```
list_workspaces
```
并询问用户要分析哪个工作区。若仅存在一个工作区，则自动使用该工作区。
使用AskUserQuestion向用户提问（2个问题）：

问题1： "你的核心优化目标是什么？"
- 选项："ROAS" / "CPA"
问题2： "是否有任何正在进行的测试或近期变更需要我们了解？"
- 选项："无正在进行的测试" / "有——我将进行描述"
- 若选择“有”，则记录相关信息，避免提出重复或冲突的测试建议。
创建输出目录：
```
output/<YYYYMMDD>_experiments/
```
告知用户： "正在拉取跨广告系列和创意的30天性能数据，此过程大约需要30-60秒。"

Phase 1: Data Pull

阶段1：数据拉取

Run these API calls in parallel:

1a. Campaign-level performance:

query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: campaign
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 50

1b. Creative-level performance:

query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: creative
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 20

1c. Ad set-level performance (for audience/placement signals):

query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: ad_set
  metrics: [spend, impressions, clicks, ctr, cpm, roas, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 30

If any call errors, continue with available data and note the gap.

并行运行以下API调用：

1a. 广告系列级性能数据：

query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: campaign
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 50

1b. 创意级性能数据：

query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: creative
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 20

1c. 广告组级性能数据（用于受众/投放信号）：

query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: ad_set
  metrics: [spend, impressions, clicks, ctr, cpm, roas, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 30

若任一调用出错，则继续使用可用数据，并记录数据缺口。

Phase 2: Signal Detection

阶段2：信号检测

Scan the data for these signal categories. For each signal found, note the specific campaigns/creatives/numbers involved.

Structural signals:

Bid strategy gap: Compare ROAS across bid strategies (LOWEST_COST vs COST_CAP vs BID_CAP). Flag if one strategy consistently outperforms.
Spend concentration: Flag if top 2 campaigns account for >50% of total spend.
Sub-breakeven campaigns: Any campaign with ROAS < 1.0 and spend > $1,000.

Creative signals:

Format imbalance: Compare image vs video ROAS/CTR. Flag if one format is underrepresented but outperforming.
Stale creatives: Seasonal hooks (Valentine's, New Year, BFCM, etc.) still running past their relevance window.
CTR outliers: Creatives with CTR > 3x account median — what's different about them?
Dynamic creative performance: If DCA/dynamic creatives exist, compare their ROAS to static.

Audience/placement signals:

CPM spread: If max CPM > 5x min CPM across campaigns, placement/audience efficiency varies widely.
Localization gap: If geo-targeted campaigns exist, compare ROAS across countries.
Ad set audience overlap: Multiple ad sets in same campaign with similar targeting.

Operational signals:

No testing framework: Multiple recent launches with no shared naming convention or evaluation cadence.
Learning phase violations: Campaigns with frequent edits in first 7 days.

Minimum spend threshold for signals:

min_spend = max(total_spend * 0.003, 500)

Ignore campaigns/creatives below this threshold for signal detection.

扫描数据以识别以下信号类别。对于每个检测到的信号，记录涉及的具体广告系列/创意/数值。

结构性信号：

出价策略差距：对比不同出价策略（LOWEST_COST vs COST_CAP vs BID_CAP）的ROAS表现。若某一策略持续表现更优，则标记该信号。
支出集中：若排名前2的广告系列占总支出的50%以上，则标记该信号。
低于盈亏平衡点的广告系列：任何ROAS < 1.0且支出 > $1,000的广告系列。

创意信号：

格式失衡：对比图片与视频的ROAS/CTR表现。若某一格式占比偏低但表现更优，则标记该信号。
过期创意：仍在投放的季节性创意（如情人节、新年、黑五网一等）已超出其相关周期。
CTR异常值：CTR高于账户中位数3倍以上的创意——分析其差异点。
动态创意表现：若存在DCA/动态创意，对比其与静态创意的ROAS表现。

受众/投放信号：

CPM差异：若广告系列间最高CPM > 最低CPM的5倍，则说明投放/受众效率差异显著。
本地化差距：若存在地域定向广告系列，对比不同国家的ROAS表现。
广告组受众重叠：同一广告系列中多个广告组的定位相似。

运营信号：

无测试框架：近期多次投放但无统一命名规则或评估周期。
学习阶段违规：广告系列在投放前7天内频繁修改。

信号检测的最低支出阈值：

min_spend = max(total_spend * 0.003, 500)

忽略低于该阈值的广告系列/创意，不进行信号检测。

Phase 3: Console Output — Signal Summary

阶段3：控制台输出——信号摘要

Before writing files, display the signal summary directly to the user:

undefined

在写入文件之前，直接向用户展示信号摘要：

undefined

Signal Scan Complete

信号扫描完成

Account: <workspace_name> | Period: <date_range> | Goal: <goal> Total spend: $X | Blended ROAS: X.Xx | Active campaigns: N

账户： <workspace_name> | 周期： <date_range> | 目标： <goal> 总支出： $X | 综合ROAS： X.Xx | 活跃广告系列： N

Signals Detected: N

检测到的信号数量：N

#	Signal	Category	Evidence	Impact
1	...	Structural	...	High/Med/Low

#	信号	类别	证据	影响
1	...	结构性	...	高/中/低

Quick Wins (no test needed)

快速优化项（无需测试）

[list any immediate actions like pausing sub-breakeven campaigns]


This gives the user immediate value before the full backlog is built.

[列出任何即时操作，如暂停低于盈亏平衡点的广告系列]


这能在完整待办事项列表生成前为用户提供即时价值。

Phase 4: Hypothesis Generation

阶段4：假设生成

For each signal, generate a testable hypothesis using this structure:

Observation: What the data shows (with specific numbers)
Hypothesis: "If we [specific change], then [primary metric] will [improve by X] because [mechanism]"
Change to test: Exact action to take
Primary metric: Single metric that determines success
Guardrail metric: Metric that must not degrade beyond a threshold
Minimum runtime: Days needed (minimum 7 days, bid strategy tests need 14 days)
Sample size check: Based on current daily conversions, can we detect a meaningful effect?

Sample size guidance:

Need at least 50 conversions per variant to detect a 20% lift
Need at least 100 conversions per variant to detect a 10% lift
If current daily conversions < 5, flag as "low-volume — extend runtime to 21 days"

Quality checks:

No two experiments should test the same variable
Each experiment must have exactly one primary metric
Guardrail thresholds must be specific (e.g., "CPA must not exceed $50" not "CPA must not increase")

针对每个信号，使用以下结构生成可测试的假设：

观察结果： 数据显示的内容（包含具体数值）
假设： "如果我们[具体变更]，那么[核心指标]将[提升X]，因为[机制原理]"
待测试变更： 具体执行动作
核心指标： 决定测试成功与否的单一指标
约束指标： 不得超出阈值的指标
最短运行时长： 所需天数（至少7天，出价策略测试需14天）
样本量检查： 根据当前日均转化量，是否能检测到有意义的效果？

样本量指导：

每个变体至少需要50次转化才能检测到20%的提升
每个变体至少需要100次转化才能检测到10%的提升
若当前日均转化量 < 5，则标记为“低流量——将运行时长延长至21天”

质量检查：

任意两个实验不得测试同一变量
每个实验必须仅有一个核心指标
约束阈值必须明确（例如："CPA不得超过$50"而非"CPA不得上升"）

Phase 5: ICE Scoring

阶段5：ICE评分

Score each experiment on three dimensions (1-10 scale):

Impact — How much will this move the goal metric if it works?

9-10: >20% improvement on >$50K monthly spend
7-8: 10-20% improvement or affects $20-50K spend
5-6: 5-10% improvement or affects $5-20K spend
1-4: <5% improvement or affects <$5K spend

Confidence — How sure are we this will work?

9-10: Strong data signal + proven in similar accounts
7-8: Clear data signal, logical mechanism
5-6: Directional signal, some uncertainty
1-4: Speculative, weak signal

Ease — How easy is this to implement and measure?

9-10: Single setting change, no creative needed
7-8: Campaign duplication or budget reallocation
5-6: New creative or audience build needed
1-4: Complex setup, cross-team coordination, new tooling

ICE Total = Impact × Confidence × Ease

从三个维度对每个实验进行评分（1-10分）：

Impact（影响） —— 若测试成功，对目标指标的提升幅度有多大？

9-10：月度支出>$50K的业务提升>20%
7-8：提升10-20%或影响$20-50K支出
5-6：提升5-10%或影响$5-20K支出
1-4：提升<5%或影响<$5K支出

Confidence（信心） —— 我们对测试成功的把握有多大？

9-10：强数据信号 + 在类似账户中已验证有效
7-8：清晰的数据信号，逻辑机制合理
5-6：方向性信号，存在一定不确定性
1-4：推测性，信号较弱

Ease（易用性） —— 实施和测量的难易程度如何？

9-10：仅需修改单一设置，无需创意制作
7-8：需复制广告系列或重新分配预算
5-6：需要制作新创意或搭建受众
1-4：设置复杂，需跨团队协作或引入新工具

ICE总分 = Impact × Confidence × Ease

Phase 6: Console Output — Experiment Backlog

阶段6：控制台输出——实验待办事项

Display the prioritized backlog directly to the user:

undefined

直接向用户展示优先排序后的待办事项：

undefined

Experiment Backlog — N Experiments

实验待办事项——共N个实验

Priority Roadmap

优先级路线图

This Week (P1):

#	Experiment	ICE	Primary Metric	Runtime

Next 2 Weeks (P2):

#	Experiment	ICE	Primary Metric	Runtime

Backlog (P3):

#	Experiment	ICE	Primary Metric	Runtime

本周（P1）：

#	实验	ICE	核心指标	运行时长

未来2周（P2）：

#	实验	ICE	核心指标	运行时长

待办库（P3）：

#	实验	ICE	核心指标	运行时长

Top 3 Experiments Detail

前3名实验详情

[For the top 3 by ICE score, show the full hypothesis, change, metrics, and sample size check]

[针对ICE评分最高的3个实验，展示完整假设、变更内容、指标和样本量检查]

Active Test Limits

活跃测试限制

Max 3 concurrent tests per ad account
No overlapping tests on same audience
Minimum 7-day evaluation window

undefined

每个广告账户最多同时运行3个测试
不得在同一受众上运行重叠测试
最短评估窗口为7天

undefined

Phase 7: Write Output Files

阶段7：写入输出文件

File 1:
<output_dir>/experiments_backlog.md

undefined

文件1：
<output_dir>/experiments_backlog.md

undefined

Experiment Backlog

实验待办事项

Signal Summary

信号摘要

Overview of signals detected with evidence.

检测到的信号概述及相关证据。

Experiment Backlog

实验待办事项

EXP-001: [Title]

EXP-001: [标题]

Observation: ...
Hypothesis: If we [change], then [metric] will [improve] because [reason]
Change to test: ...
Primary metric: ...
Guardrail metric: ... (threshold: ...)
Minimum runtime: N days
Sample size check: Current daily conversions: N → Feasible / Extend runtime / Low confidence
ICE Score: I(N) × C(N) × E(N) = Total
Priority: P1/P2/P3
Owner: TBD

(Repeat for all experiments)

观察结果： ...
假设： 如果我们[变更]，那么[指标]将[提升]，因为[原因]
待测试变更： ...
核心指标： ...
约束指标： ...（阈值：...）
最短运行时长： N天
样本量检查： 当前日均转化量：N → 可行 / 延长运行时长 / 信心不足
ICE评分： I(N) × C(N) × E(N) = 总分
优先级： P1/P2/P3
负责人： TBD

（所有实验重复上述格式）

Priority Roadmap

优先级路线图

This Week (P1)

本周（P1）

Table of P1 experiments with ICE, metric, runtime.

P1实验表格，包含ICE评分、指标、运行时长。

Next 2 Weeks (P2)

未来2周（P2）

Table of P2 experiments.

P2实验表格。

Backlog (P3)

待办库（P3）

Table of P3 experiments.

P3实验表格。

Active Test Limits

活跃测试限制

Max 3 concurrent tests per ad account
No overlapping tests on same audience
Minimum 7-day evaluation window before calling results
Learning phase protection: no edits in first 7 days of any test


**File 2: `<output_dir>/experiments_backlog.json`**

Structured JSON containing:
- `report_date`, `workspace`, `goal`, `date_range`, `currency`
- `total_spend`, `blended_roas`, `active_campaigns`
- `spend_threshold` (computed min_spend)
- `signals` (array: id, category, description, evidence, impact)
- `experiments` (array: id, title, observation, hypothesis, change, primary_metric, guardrail_metric, guardrail_threshold, runtime_days, sample_size_feasible, ice_impact, ice_confidence, ice_ease, ice_total, priority, status)
- `roadmap` (this_week, next_2_weeks, backlog — arrays of experiment IDs)
- `quick_wins` (array: action, campaigns, expected_impact)

**File 3: `<output_dir>/postmortem_template.md`**

Reusable template for tracking experiment outcomes:

每个广告账户最多同时运行3个测试
不得在同一受众上运行重叠测试
需至少7天评估窗口才能判定测试结果
学习阶段保护：测试前7天内不得修改


**文件2：`<output_dir>/experiments_backlog.json`**

包含以下内容的结构化JSON：
- `report_date`, `workspace`, `goal`, `date_range`, `currency`
- `total_spend`, `blended_roas`, `active_campaigns`
- `spend_threshold`（计算得出的min_spend）
- `signals`（数组：id, category, description, evidence, impact）
- `experiments`（数组：id, title, observation, hypothesis, change, primary_metric, guardrail_metric, guardrail_threshold, runtime_days, sample_size_feasible, ice_impact, ice_confidence, ice_ease, ice_total, priority, status）
- `roadmap`（this_week, next_2_weeks, backlog —— 实验ID数组）
- `quick_wins`（数组：action, campaigns, expected_impact）

**文件3：`<output_dir>/postmortem_template.md`**

用于跟踪实验结果的可复用模板：

Experiment Postmortem: [EXP-ID] [Title]

实验复盘：[EXP-ID] [标题]

Summary

摘要

Field	Value
Experiment ID
Hypothesis
Start date
End date
Runtime	X days
Status	Won / Lost / Inconclusive

字段	值
实验ID
假设
开始日期
结束日期
运行时长	X天
状态	成功 / 失败 / 无结论

Setup

测试设置

Control: What was the baseline
Variant: What was changed
Audience: Who saw this
Budget allocation: Control X% / Variant X%

对照组： 基准方案
变体组： 变更内容
受众： 测试覆盖人群
预算分配： 对照组X% / 变体组X%

Results

结果

Metric	Control	Variant	Delta	Significant?
Primary: [metric]				Y/N
Guardrail: [metric]				Y/N
Spend				—
Conversions				—

指标	对照组	变体组	差值	是否显著？
核心指标: [metric]				是/否
约束指标: [metric]				是/否
支出				—
转化量				—

Statistical Validity

统计有效性

Total conversions (control + variant):
Confidence level:
Minimum detectable effect achieved: Yes / No

总转化量（对照组+变体组）：
置信度：
是否达到最小可检测效果：是/否

Analysis

分析

What happened and why.

发生了什么及原因。

Decision

决策

Scale winner / Kill variant / Extend test / Iterate.

推广获胜方案 / 终止变体测试 / 延长测试时长 / 迭代优化。

Learnings

经验总结

What this tells us for future tests.

本次测试对未来实验的启示。

Follow-up Experiments

后续实验

What to test next based on these results.

undefined

基于本次结果的下一步测试方向。

undefined

Phase 8: Console Output — Wrap-Up

阶段8：控制台输出——总结

End with a brief summary:

undefined

以简短总结结束：

undefined

Done

完成

Files written to:

output/<YYYYMMDD>_experiments/

```
experiments_backlog.md
```
— N experiments, N signals
```
experiments_backlog.json
```
— structured data
```
postmortem_template.md
```
— reusable tracker

Immediate actions (no test needed):

[quick wins list]

First test to launch: EXP-XXX — [title]

---

文件已写入：

output/<YYYYMMDD>_experiments/

```
experiments_backlog.md
```
—— N个实验，N个信号
```
experiments_backlog.json
```
—— 结构化数据
```
postmortem_template.md
```
—— 可复用跟踪模板

即时操作（无需测试）：

[快速优化项列表]

首个建议启动的测试： EXP-XXX —— [标题]

---

ICE Scoring Reference

ICE评分参考

Dimension	9-10	7-8	5-6	1-4
Impact	>20% on >$50K/mo	10-20% or $20-50K	5-10% or $5-20K	<5% or <$5K
Confidence	Strong signal + proven	Clear signal	Directional	Speculative
Ease	Setting change	Campaign dupe	New creative needed	Complex/cross-team

维度	9-10	7-8	5-6	1-4
Impact（影响）	月度支出>$50K提升>20%	提升10-20%或影响$20-50K支出	提升5-10%或影响$5-20K支出	提升<5%或影响<$5K支出
Confidence（信心）	强信号+已验证	清晰信号	方向性信号	推测性
Ease（易用性）	仅修改设置	复制广告系列	需制作新创意	复杂/跨团队

Guardrails

约束规则

Never pause campaigns or edit budgets automatically.
Limit recommendations to max 3 concurrent tests per ad account.
If data is sparse (< 10 campaigns or < $5K total spend), produce fewer high-confidence tests and note low data volume.
If no meaningful signals appear, propose diagnostic experiments (e.g., "run a structured A/B to establish baselines").
Flag any experiment where sample size is insufficient for statistical significance.

不得自动暂停广告系列或修改预算。
每个广告账户最多推荐3个并行测试。
若数据稀疏（<10个广告系列或总支出<$5K），则生成更少的高置信度测试，并标注数据量不足。
若未检测到有意义的信号，则提出诊断性实验（例如："运行结构化A/B测试以建立基准"）。
若样本量不足以支持统计显著性，则标记该实验。