experiment-designer-tracker

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Experiment Designer + Tracker

实验设计器与跟踪器

Build a structured experimentation program from real account performance signals — detect gaps, translate them into testable hypotheses, score with ICE, and produce a prioritized roadmap.
Do not use this skill to launch many overlapping tests or make account changes.
基于真实账户的性能信号构建结构化实验方案——检测差距,将其转化为可测试的假设,使用ICE评分,并生成优先排序的路线图。
请勿使用此技能启动大量重叠测试或进行账户变更。

Prerequisites

前提条件

  • Glued MCP:通过 glued.me/mcp 连接你的Glued工作区

Required MCP tools

所需MCP工具

  • list_workspaces
  • query_ad_report
  • list_workspaces
  • query_ad_report

Inputs

输入参数

  • workspace_id
    : optional — resolved via
    list_workspaces
    if missing
  • date_range
    : default
    last_30_days
  • goal
    :
    roas
    (default) or
    cpa
  • workspace_id
    :可选——若缺失则通过
    list_workspaces
    获取
  • date_range
    :默认值
    last_30_days
  • goal
    roas
    (默认)或
    cpa

Outputs

输出结果

All outputs go to a timestamped directory:
output/<YYYYMMDD>_experiments/
  • experiments_backlog.md
    — full backlog with hypotheses, ICE scores, roadmap
  • experiments_backlog.json
    — structured data for programmatic use
  • postmortem_template.md
    — reusable template for tracking experiment outcomes

所有输出将保存至带时间戳的目录:
output/<YYYYMMDD>_experiments/
  • experiments_backlog.md
    —— 包含假设、ICE评分、路线图的完整待办事项列表
  • experiments_backlog.json
    —— 供程序化调用的结构化数据
  • postmortem_template.md
    —— 用于跟踪实验结果的可复用模板

Procedure

执行流程

Phase 0: Context Collection

阶段0:上下文收集

  1. If workspace_id is not provided, call
    list_workspaces
    and ask the user which workspace to analyze. If only one workspace exists, use it automatically.
  2. Ask the user using AskUserQuestion (2 questions):
    Question 1: "What is your primary optimization goal?"
    • Options: "ROAS" / "CPA"
    Question 2: "Are there any active tests or recent changes we should know about?"
    • Options: "No active tests" / "Yes — I'll describe them"
    • If yes, note them to avoid proposing duplicate or conflicting tests.
  3. Create output directory:
    output/<YYYYMMDD>_experiments/
  4. Tell the user: "Pulling 30-day performance data across campaigns and creatives. This takes about 30-60 seconds."
  1. 若未提供workspace_id,调用
    list_workspaces
    并询问用户要分析哪个工作区。若仅存在一个工作区,则自动使用该工作区。
  2. 使用AskUserQuestion向用户提问(2个问题):
    问题1: "你的核心优化目标是什么?"
    • 选项:"ROAS" / "CPA"
    问题2: "是否有任何正在进行的测试或近期变更需要我们了解?"
    • 选项:"无正在进行的测试" / "有——我将进行描述"
    • 若选择“有”,则记录相关信息,避免提出重复或冲突的测试建议。
  3. 创建输出目录:
    output/<YYYYMMDD>_experiments/
  4. 告知用户: "正在拉取跨广告系列和创意的30天性能数据,此过程大约需要30-60秒。"

Phase 1: Data Pull

阶段1:数据拉取

Run these API calls in parallel:
1a. Campaign-level performance:
query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: campaign
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 50
1b. Creative-level performance:
query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: creative
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 20
1c. Ad set-level performance (for audience/placement signals):
query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: ad_set
  metrics: [spend, impressions, clicks, ctr, cpm, roas, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 30
If any call errors, continue with available data and note the gap.
并行运行以下API调用:
1a. 广告系列级性能数据:
query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: campaign
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 50
1b. 创意级性能数据:
query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: creative
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 20
1c. 广告组级性能数据(用于受众/投放信号):
query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: ad_set
  metrics: [spend, impressions, clicks, ctr, cpm, roas, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 30
若任一调用出错,则继续使用可用数据,并记录数据缺口。

Phase 2: Signal Detection

阶段2:信号检测

Scan the data for these signal categories. For each signal found, note the specific campaigns/creatives/numbers involved.
Structural signals:
  • Bid strategy gap: Compare ROAS across bid strategies (LOWEST_COST vs COST_CAP vs BID_CAP). Flag if one strategy consistently outperforms.
  • Spend concentration: Flag if top 2 campaigns account for >50% of total spend.
  • Sub-breakeven campaigns: Any campaign with ROAS < 1.0 and spend > $1,000.
Creative signals:
  • Format imbalance: Compare image vs video ROAS/CTR. Flag if one format is underrepresented but outperforming.
  • Stale creatives: Seasonal hooks (Valentine's, New Year, BFCM, etc.) still running past their relevance window.
  • CTR outliers: Creatives with CTR > 3x account median — what's different about them?
  • Dynamic creative performance: If DCA/dynamic creatives exist, compare their ROAS to static.
Audience/placement signals:
  • CPM spread: If max CPM > 5x min CPM across campaigns, placement/audience efficiency varies widely.
  • Localization gap: If geo-targeted campaigns exist, compare ROAS across countries.
  • Ad set audience overlap: Multiple ad sets in same campaign with similar targeting.
Operational signals:
  • No testing framework: Multiple recent launches with no shared naming convention or evaluation cadence.
  • Learning phase violations: Campaigns with frequent edits in first 7 days.
Minimum spend threshold for signals:
min_spend = max(total_spend * 0.003, 500)
Ignore campaigns/creatives below this threshold for signal detection.
扫描数据以识别以下信号类别。对于每个检测到的信号,记录涉及的具体广告系列/创意/数值。
结构性信号:
  • 出价策略差距:对比不同出价策略(LOWEST_COST vs COST_CAP vs BID_CAP)的ROAS表现。若某一策略持续表现更优,则标记该信号。
  • 支出集中:若排名前2的广告系列占总支出的50%以上,则标记该信号。
  • 低于盈亏平衡点的广告系列:任何ROAS < 1.0且支出 > $1,000的广告系列。
创意信号:
  • 格式失衡:对比图片与视频的ROAS/CTR表现。若某一格式占比偏低但表现更优,则标记该信号。
  • 过期创意:仍在投放的季节性创意(如情人节、新年、黑五网一等)已超出其相关周期。
  • CTR异常值:CTR高于账户中位数3倍以上的创意——分析其差异点。
  • 动态创意表现:若存在DCA/动态创意,对比其与静态创意的ROAS表现。
受众/投放信号:
  • CPM差异:若广告系列间最高CPM > 最低CPM的5倍,则说明投放/受众效率差异显著。
  • 本地化差距:若存在地域定向广告系列,对比不同国家的ROAS表现。
  • 广告组受众重叠:同一广告系列中多个广告组的定位相似。
运营信号:
  • 无测试框架:近期多次投放但无统一命名规则或评估周期。
  • 学习阶段违规:广告系列在投放前7天内频繁修改。
信号检测的最低支出阈值:
min_spend = max(total_spend * 0.003, 500)
忽略低于该阈值的广告系列/创意,不进行信号检测。

Phase 3: Console Output — Signal Summary

阶段3:控制台输出——信号摘要

Before writing files, display the signal summary directly to the user:
undefined
在写入文件之前,直接向用户展示信号摘要:
undefined

Signal Scan Complete

信号扫描完成

Account: <workspace_name> | Period: <date_range> | Goal: <goal> Total spend: $X | Blended ROAS: X.Xx | Active campaigns: N
账户: <workspace_name> | 周期: <date_range> | 目标: <goal> 总支出: $X | 综合ROAS: X.Xx | 活跃广告系列: N

Signals Detected: N

检测到的信号数量:N

#SignalCategoryEvidenceImpact
1...Structural...High/Med/Low
#信号类别证据影响
1...结构性...高/中/低

Quick Wins (no test needed)

快速优化项(无需测试)

  • [list any immediate actions like pausing sub-breakeven campaigns]

This gives the user immediate value before the full backlog is built.
  • [列出任何即时操作,如暂停低于盈亏平衡点的广告系列]

这能在完整待办事项列表生成前为用户提供即时价值。

Phase 4: Hypothesis Generation

阶段4:假设生成

For each signal, generate a testable hypothesis using this structure:
  • Observation: What the data shows (with specific numbers)
  • Hypothesis: "If we [specific change], then [primary metric] will [improve by X] because [mechanism]"
  • Change to test: Exact action to take
  • Primary metric: Single metric that determines success
  • Guardrail metric: Metric that must not degrade beyond a threshold
  • Minimum runtime: Days needed (minimum 7 days, bid strategy tests need 14 days)
  • Sample size check: Based on current daily conversions, can we detect a meaningful effect?
Sample size guidance:
  • Need at least 50 conversions per variant to detect a 20% lift
  • Need at least 100 conversions per variant to detect a 10% lift
  • If current daily conversions < 5, flag as "low-volume — extend runtime to 21 days"
Quality checks:
  • No two experiments should test the same variable
  • Each experiment must have exactly one primary metric
  • Guardrail thresholds must be specific (e.g., "CPA must not exceed $50" not "CPA must not increase")
针对每个信号,使用以下结构生成可测试的假设:
  • 观察结果: 数据显示的内容(包含具体数值)
  • 假设: "如果我们[具体变更],那么[核心指标]将[提升X],因为[机制原理]"
  • 待测试变更: 具体执行动作
  • 核心指标: 决定测试成功与否的单一指标
  • 约束指标: 不得超出阈值的指标
  • 最短运行时长: 所需天数(至少7天,出价策略测试需14天)
  • 样本量检查: 根据当前日均转化量,是否能检测到有意义的效果?
样本量指导:
  • 每个变体至少需要50次转化才能检测到20%的提升
  • 每个变体至少需要100次转化才能检测到10%的提升
  • 若当前日均转化量 < 5,则标记为“低流量——将运行时长延长至21天”
质量检查:
  • 任意两个实验不得测试同一变量
  • 每个实验必须仅有一个核心指标
  • 约束阈值必须明确(例如:"CPA不得超过$50"而非"CPA不得上升")

Phase 5: ICE Scoring

阶段5:ICE评分

Score each experiment on three dimensions (1-10 scale):
Impact — How much will this move the goal metric if it works?
  • 9-10: >20% improvement on >$50K monthly spend
  • 7-8: 10-20% improvement or affects $20-50K spend
  • 5-6: 5-10% improvement or affects $5-20K spend
  • 1-4: <5% improvement or affects <$5K spend
Confidence — How sure are we this will work?
  • 9-10: Strong data signal + proven in similar accounts
  • 7-8: Clear data signal, logical mechanism
  • 5-6: Directional signal, some uncertainty
  • 1-4: Speculative, weak signal
Ease — How easy is this to implement and measure?
  • 9-10: Single setting change, no creative needed
  • 7-8: Campaign duplication or budget reallocation
  • 5-6: New creative or audience build needed
  • 1-4: Complex setup, cross-team coordination, new tooling
ICE Total = Impact × Confidence × Ease
从三个维度对每个实验进行评分(1-10分):
Impact(影响) —— 若测试成功,对目标指标的提升幅度有多大?
  • 9-10:月度支出>$50K的业务提升>20%
  • 7-8:提升10-20%或影响$20-50K支出
  • 5-6:提升5-10%或影响$5-20K支出
  • 1-4:提升<5%或影响<$5K支出
Confidence(信心) —— 我们对测试成功的把握有多大?
  • 9-10:强数据信号 + 在类似账户中已验证有效
  • 7-8:清晰的数据信号,逻辑机制合理
  • 5-6:方向性信号,存在一定不确定性
  • 1-4:推测性,信号较弱
Ease(易用性) —— 实施和测量的难易程度如何?
  • 9-10:仅需修改单一设置,无需创意制作
  • 7-8:需复制广告系列或重新分配预算
  • 5-6:需要制作新创意或搭建受众
  • 1-4:设置复杂,需跨团队协作或引入新工具
ICE总分 = Impact × Confidence × Ease

Phase 6: Console Output — Experiment Backlog

阶段6:控制台输出——实验待办事项

Display the prioritized backlog directly to the user:
undefined
直接向用户展示优先排序后的待办事项:
undefined

Experiment Backlog — N Experiments

实验待办事项——共N个实验

Priority Roadmap

优先级路线图

This Week (P1):
#ExperimentICEPrimary MetricRuntime
Next 2 Weeks (P2):
#ExperimentICEPrimary MetricRuntime
Backlog (P3):
#ExperimentICEPrimary MetricRuntime
本周(P1):
#实验ICE核心指标运行时长
未来2周(P2):
#实验ICE核心指标运行时长
待办库(P3):
#实验ICE核心指标运行时长

Top 3 Experiments Detail

前3名实验详情

[For the top 3 by ICE score, show the full hypothesis, change, metrics, and sample size check]
[针对ICE评分最高的3个实验,展示完整假设、变更内容、指标和样本量检查]

Active Test Limits

活跃测试限制

  • Max 3 concurrent tests per ad account
  • No overlapping tests on same audience
  • Minimum 7-day evaluation window
undefined
  • 每个广告账户最多同时运行3个测试
  • 不得在同一受众上运行重叠测试
  • 最短评估窗口为7天
undefined

Phase 7: Write Output Files

阶段7:写入输出文件

File 1:
<output_dir>/experiments_backlog.md
undefined
文件1:
<output_dir>/experiments_backlog.md
undefined

Experiment Backlog

实验待办事项

Date: <today> | Workspace: <name> | Goal: <roas|cpa> Period analyzed: <date_range> | Currency: <from workspace> Total spend: $X | Blended ROAS: X.Xx | Active campaigns: N
日期: <today> | 工作区: <name> | 目标: <roas|cpa> 分析周期: <date_range> | 货币: <来自工作区设置> 总支出: $X | 综合ROAS: X.Xx | 活跃广告系列: N

Signal Summary

信号摘要

Overview of signals detected with evidence.
检测到的信号概述及相关证据。

Experiment Backlog

实验待办事项

EXP-001: [Title]

EXP-001: [标题]

  • Observation: ...
  • Hypothesis: If we [change], then [metric] will [improve] because [reason]
  • Change to test: ...
  • Primary metric: ...
  • Guardrail metric: ... (threshold: ...)
  • Minimum runtime: N days
  • Sample size check: Current daily conversions: N → Feasible / Extend runtime / Low confidence
  • ICE Score: I(N) × C(N) × E(N) = Total
  • Priority: P1/P2/P3
  • Owner: TBD
(Repeat for all experiments)
  • 观察结果: ...
  • 假设: 如果我们[变更],那么[指标]将[提升],因为[原因]
  • 待测试变更: ...
  • 核心指标: ...
  • 约束指标: ...(阈值:...)
  • 最短运行时长: N天
  • 样本量检查: 当前日均转化量:N → 可行 / 延长运行时长 / 信心不足
  • ICE评分: I(N) × C(N) × E(N) = 总分
  • 优先级: P1/P2/P3
  • 负责人: TBD
(所有实验重复上述格式)

Priority Roadmap

优先级路线图

This Week (P1)

本周(P1)

Table of P1 experiments with ICE, metric, runtime.
P1实验表格,包含ICE评分、指标、运行时长。

Next 2 Weeks (P2)

未来2周(P2)

Table of P2 experiments.
P2实验表格。

Backlog (P3)

待办库(P3)

Table of P3 experiments.
P3实验表格。

Active Test Limits

活跃测试限制

  • Max 3 concurrent tests per ad account
  • No overlapping tests on same audience
  • Minimum 7-day evaluation window before calling results
  • Learning phase protection: no edits in first 7 days of any test

**File 2: `<output_dir>/experiments_backlog.json`**

Structured JSON containing:
- `report_date`, `workspace`, `goal`, `date_range`, `currency`
- `total_spend`, `blended_roas`, `active_campaigns`
- `spend_threshold` (computed min_spend)
- `signals` (array: id, category, description, evidence, impact)
- `experiments` (array: id, title, observation, hypothesis, change, primary_metric, guardrail_metric, guardrail_threshold, runtime_days, sample_size_feasible, ice_impact, ice_confidence, ice_ease, ice_total, priority, status)
- `roadmap` (this_week, next_2_weeks, backlog — arrays of experiment IDs)
- `quick_wins` (array: action, campaigns, expected_impact)

**File 3: `<output_dir>/postmortem_template.md`**

Reusable template for tracking experiment outcomes:
  • 每个广告账户最多同时运行3个测试
  • 不得在同一受众上运行重叠测试
  • 需至少7天评估窗口才能判定测试结果
  • 学习阶段保护:测试前7天内不得修改

**文件2:`<output_dir>/experiments_backlog.json`**

包含以下内容的结构化JSON:
- `report_date`, `workspace`, `goal`, `date_range`, `currency`
- `total_spend`, `blended_roas`, `active_campaigns`
- `spend_threshold`(计算得出的min_spend)
- `signals`(数组:id, category, description, evidence, impact)
- `experiments`(数组:id, title, observation, hypothesis, change, primary_metric, guardrail_metric, guardrail_threshold, runtime_days, sample_size_feasible, ice_impact, ice_confidence, ice_ease, ice_total, priority, status)
- `roadmap`(this_week, next_2_weeks, backlog —— 实验ID数组)
- `quick_wins`(数组:action, campaigns, expected_impact)

**文件3:`<output_dir>/postmortem_template.md`**

用于跟踪实验结果的可复用模板:

Experiment Postmortem: [EXP-ID] [Title]

实验复盘:[EXP-ID] [标题]

Summary

摘要

FieldValue
Experiment ID
Hypothesis
Start date
End date
RuntimeX days
StatusWon / Lost / Inconclusive
字段
实验ID
假设
开始日期
结束日期
运行时长X天
状态成功 / 失败 / 无结论

Setup

测试设置

  • Control: What was the baseline
  • Variant: What was changed
  • Audience: Who saw this
  • Budget allocation: Control X% / Variant X%
  • 对照组: 基准方案
  • 变体组: 变更内容
  • 受众: 测试覆盖人群
  • 预算分配: 对照组X% / 变体组X%

Results

结果

MetricControlVariantDeltaSignificant?
Primary: [metric]Y/N
Guardrail: [metric]Y/N
Spend
Conversions
指标对照组变体组差值是否显著?
核心指标: [metric]是/否
约束指标: [metric]是/否
支出
转化量

Statistical Validity

统计有效性

  • Total conversions (control + variant):
  • Confidence level:
  • Minimum detectable effect achieved: Yes / No
  • 总转化量(对照组+变体组):
  • 置信度:
  • 是否达到最小可检测效果:是/否

Analysis

分析

What happened and why.
发生了什么及原因。

Decision

决策

Scale winner / Kill variant / Extend test / Iterate.
推广获胜方案 / 终止变体测试 / 延长测试时长 / 迭代优化。

Learnings

经验总结

What this tells us for future tests.
本次测试对未来实验的启示。

Follow-up Experiments

后续实验

What to test next based on these results.
undefined
基于本次结果的下一步测试方向。
undefined

Phase 8: Console Output — Wrap-Up

阶段8:控制台输出——总结

End with a brief summary:
undefined
以简短总结结束:
undefined

Done

完成

Files written to:
output/<YYYYMMDD>_experiments/
  • experiments_backlog.md
    — N experiments, N signals
  • experiments_backlog.json
    — structured data
  • postmortem_template.md
    — reusable tracker
Immediate actions (no test needed):
  • [quick wins list]
First test to launch: EXP-XXX — [title]

---
文件已写入:
output/<YYYYMMDD>_experiments/
  • experiments_backlog.md
    —— N个实验,N个信号
  • experiments_backlog.json
    —— 结构化数据
  • postmortem_template.md
    —— 可复用跟踪模板
即时操作(无需测试):
  • [快速优化项列表]
首个建议启动的测试: EXP-XXX —— [标题]

---

ICE Scoring Reference

ICE评分参考

Dimension9-107-85-61-4
Impact>20% on >$50K/mo10-20% or $20-50K5-10% or $5-20K<5% or <$5K
ConfidenceStrong signal + provenClear signalDirectionalSpeculative
EaseSetting changeCampaign dupeNew creative neededComplex/cross-team
维度9-107-85-61-4
Impact(影响)月度支出>$50K提升>20%提升10-20%或影响$20-50K支出提升5-10%或影响$5-20K支出提升<5%或影响<$5K支出
Confidence(信心)强信号+已验证清晰信号方向性信号推测性
Ease(易用性)仅修改设置复制广告系列需制作新创意复杂/跨团队

Guardrails

约束规则

  • Never pause campaigns or edit budgets automatically.
  • Limit recommendations to max 3 concurrent tests per ad account.
  • If data is sparse (< 10 campaigns or < $5K total spend), produce fewer high-confidence tests and note low data volume.
  • If no meaningful signals appear, propose diagnostic experiments (e.g., "run a structured A/B to establish baselines").
  • Flag any experiment where sample size is insufficient for statistical significance.
  • 不得自动暂停广告系列或修改预算。
  • 每个广告账户最多推荐3个并行测试。
  • 若数据稀疏(<10个广告系列或总支出<$5K),则生成更少的高置信度测试,并标注数据量不足。
  • 若未检测到有意义的信号,则提出诊断性实验(例如:"运行结构化A/B测试以建立基准")。
  • 若样本量不足以支持统计显著性,则标记该实验。