experiment-designer-tracker
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExperiment Designer + Tracker
实验设计器与跟踪器
Build a structured experimentation program from real account performance signals — detect gaps, translate them into testable hypotheses, score with ICE, and produce a prioritized roadmap.
Do not use this skill to launch many overlapping tests or make account changes.
基于真实账户的性能信号构建结构化实验方案——检测差距,将其转化为可测试的假设,使用ICE评分,并生成优先排序的路线图。
请勿使用此技能启动大量重叠测试或进行账户变更。
Prerequisites
前提条件
- Glued MCP: Connect your Glued workspace at glued.me/mcp
- Glued MCP:通过 glued.me/mcp 连接你的Glued工作区
Required MCP tools
所需MCP工具
list_workspacesquery_ad_report
list_workspacesquery_ad_report
Inputs
输入参数
- : optional — resolved via
workspace_idif missinglist_workspaces - : default
date_rangelast_30_days - :
goal(default) orroascpa
- :可选——若缺失则通过
workspace_id获取list_workspaces - :默认值
date_rangelast_30_days - :
goal(默认)或roascpa
Outputs
输出结果
All outputs go to a timestamped directory:
output/<YYYYMMDD>_experiments/- — full backlog with hypotheses, ICE scores, roadmap
experiments_backlog.md - — structured data for programmatic use
experiments_backlog.json - — reusable template for tracking experiment outcomes
postmortem_template.md
所有输出将保存至带时间戳的目录:
output/<YYYYMMDD>_experiments/- —— 包含假设、ICE评分、路线图的完整待办事项列表
experiments_backlog.md - —— 供程序化调用的结构化数据
experiments_backlog.json - —— 用于跟踪实验结果的可复用模板
postmortem_template.md
Procedure
执行流程
Phase 0: Context Collection
阶段0:上下文收集
-
If workspace_id is not provided, calland ask the user which workspace to analyze. If only one workspace exists, use it automatically.
list_workspaces -
Ask the user using AskUserQuestion (2 questions):Question 1: "What is your primary optimization goal?"
- Options: "ROAS" / "CPA"
Question 2: "Are there any active tests or recent changes we should know about?"- Options: "No active tests" / "Yes — I'll describe them"
- If yes, note them to avoid proposing duplicate or conflicting tests.
-
Create output directory:
output/<YYYYMMDD>_experiments/ -
Tell the user: "Pulling 30-day performance data across campaigns and creatives. This takes about 30-60 seconds."
-
若未提供workspace_id,调用并询问用户要分析哪个工作区。若仅存在一个工作区,则自动使用该工作区。
list_workspaces -
使用AskUserQuestion向用户提问(2个问题):问题1: "你的核心优化目标是什么?"
- 选项:"ROAS" / "CPA"
问题2: "是否有任何正在进行的测试或近期变更需要我们了解?"- 选项:"无正在进行的测试" / "有——我将进行描述"
- 若选择“有”,则记录相关信息,避免提出重复或冲突的测试建议。
-
创建输出目录:
output/<YYYYMMDD>_experiments/ -
告知用户: "正在拉取跨广告系列和创意的30天性能数据,此过程大约需要30-60秒。"
Phase 1: Data Pull
阶段1:数据拉取
Run these API calls in parallel:
1a. Campaign-level performance:
query_ad_report:
workspace_id: <id>
date_range: <date_range>
group_by: campaign
metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
sort_metric: spend
sort_direction: desc
limit: 501b. Creative-level performance:
query_ad_report:
workspace_id: <id>
date_range: <date_range>
group_by: creative
metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
sort_metric: spend
sort_direction: desc
limit: 201c. Ad set-level performance (for audience/placement signals):
query_ad_report:
workspace_id: <id>
date_range: <date_range>
group_by: ad_set
metrics: [spend, impressions, clicks, ctr, cpm, roas, cpa]
sort_metric: spend
sort_direction: desc
limit: 30If any call errors, continue with available data and note the gap.
并行运行以下API调用:
1a. 广告系列级性能数据:
query_ad_report:
workspace_id: <id>
date_range: <date_range>
group_by: campaign
metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
sort_metric: spend
sort_direction: desc
limit: 501b. 创意级性能数据:
query_ad_report:
workspace_id: <id>
date_range: <date_range>
group_by: creative
metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
sort_metric: spend
sort_direction: desc
limit: 201c. 广告组级性能数据(用于受众/投放信号):
query_ad_report:
workspace_id: <id>
date_range: <date_range>
group_by: ad_set
metrics: [spend, impressions, clicks, ctr, cpm, roas, cpa]
sort_metric: spend
sort_direction: desc
limit: 30若任一调用出错,则继续使用可用数据,并记录数据缺口。
Phase 2: Signal Detection
阶段2:信号检测
Scan the data for these signal categories. For each signal found, note the specific campaigns/creatives/numbers involved.
Structural signals:
- Bid strategy gap: Compare ROAS across bid strategies (LOWEST_COST vs COST_CAP vs BID_CAP). Flag if one strategy consistently outperforms.
- Spend concentration: Flag if top 2 campaigns account for >50% of total spend.
- Sub-breakeven campaigns: Any campaign with ROAS < 1.0 and spend > $1,000.
Creative signals:
- Format imbalance: Compare image vs video ROAS/CTR. Flag if one format is underrepresented but outperforming.
- Stale creatives: Seasonal hooks (Valentine's, New Year, BFCM, etc.) still running past their relevance window.
- CTR outliers: Creatives with CTR > 3x account median — what's different about them?
- Dynamic creative performance: If DCA/dynamic creatives exist, compare their ROAS to static.
Audience/placement signals:
- CPM spread: If max CPM > 5x min CPM across campaigns, placement/audience efficiency varies widely.
- Localization gap: If geo-targeted campaigns exist, compare ROAS across countries.
- Ad set audience overlap: Multiple ad sets in same campaign with similar targeting.
Operational signals:
- No testing framework: Multiple recent launches with no shared naming convention or evaluation cadence.
- Learning phase violations: Campaigns with frequent edits in first 7 days.
Minimum spend threshold for signals:
min_spend = max(total_spend * 0.003, 500)Ignore campaigns/creatives below this threshold for signal detection.
扫描数据以识别以下信号类别。对于每个检测到的信号,记录涉及的具体广告系列/创意/数值。
结构性信号:
- 出价策略差距:对比不同出价策略(LOWEST_COST vs COST_CAP vs BID_CAP)的ROAS表现。若某一策略持续表现更优,则标记该信号。
- 支出集中:若排名前2的广告系列占总支出的50%以上,则标记该信号。
- 低于盈亏平衡点的广告系列:任何ROAS < 1.0且支出 > $1,000的广告系列。
创意信号:
- 格式失衡:对比图片与视频的ROAS/CTR表现。若某一格式占比偏低但表现更优,则标记该信号。
- 过期创意:仍在投放的季节性创意(如情人节、新年、黑五网一等)已超出其相关周期。
- CTR异常值:CTR高于账户中位数3倍以上的创意——分析其差异点。
- 动态创意表现:若存在DCA/动态创意,对比其与静态创意的ROAS表现。
受众/投放信号:
- CPM差异:若广告系列间最高CPM > 最低CPM的5倍,则说明投放/受众效率差异显著。
- 本地化差距:若存在地域定向广告系列,对比不同国家的ROAS表现。
- 广告组受众重叠:同一广告系列中多个广告组的定位相似。
运营信号:
- 无测试框架:近期多次投放但无统一命名规则或评估周期。
- 学习阶段违规:广告系列在投放前7天内频繁修改。
信号检测的最低支出阈值:
min_spend = max(total_spend * 0.003, 500)忽略低于该阈值的广告系列/创意,不进行信号检测。
Phase 3: Console Output — Signal Summary
阶段3:控制台输出——信号摘要
Before writing files, display the signal summary directly to the user:
undefined在写入文件之前,直接向用户展示信号摘要:
undefinedSignal Scan Complete
信号扫描完成
Account: <workspace_name> | Period: <date_range> | Goal: <goal>
Total spend: $X | Blended ROAS: X.Xx | Active campaigns: N
账户: <workspace_name> | 周期: <date_range> | 目标: <goal>
总支出: $X | 综合ROAS: X.Xx | 活跃广告系列: N
Signals Detected: N
检测到的信号数量:N
| # | Signal | Category | Evidence | Impact |
|---|---|---|---|---|
| 1 | ... | Structural | ... | High/Med/Low |
| # | 信号 | 类别 | 证据 | 影响 |
|---|---|---|---|---|
| 1 | ... | 结构性 | ... | 高/中/低 |
Quick Wins (no test needed)
快速优化项(无需测试)
- [list any immediate actions like pausing sub-breakeven campaigns]
This gives the user immediate value before the full backlog is built.- [列出任何即时操作,如暂停低于盈亏平衡点的广告系列]
这能在完整待办事项列表生成前为用户提供即时价值。Phase 4: Hypothesis Generation
阶段4:假设生成
For each signal, generate a testable hypothesis using this structure:
- Observation: What the data shows (with specific numbers)
- Hypothesis: "If we [specific change], then [primary metric] will [improve by X] because [mechanism]"
- Change to test: Exact action to take
- Primary metric: Single metric that determines success
- Guardrail metric: Metric that must not degrade beyond a threshold
- Minimum runtime: Days needed (minimum 7 days, bid strategy tests need 14 days)
- Sample size check: Based on current daily conversions, can we detect a meaningful effect?
Sample size guidance:
- Need at least 50 conversions per variant to detect a 20% lift
- Need at least 100 conversions per variant to detect a 10% lift
- If current daily conversions < 5, flag as "low-volume — extend runtime to 21 days"
Quality checks:
- No two experiments should test the same variable
- Each experiment must have exactly one primary metric
- Guardrail thresholds must be specific (e.g., "CPA must not exceed $50" not "CPA must not increase")
针对每个信号,使用以下结构生成可测试的假设:
- 观察结果: 数据显示的内容(包含具体数值)
- 假设: "如果我们[具体变更],那么[核心指标]将[提升X],因为[机制原理]"
- 待测试变更: 具体执行动作
- 核心指标: 决定测试成功与否的单一指标
- 约束指标: 不得超出阈值的指标
- 最短运行时长: 所需天数(至少7天,出价策略测试需14天)
- 样本量检查: 根据当前日均转化量,是否能检测到有意义的效果?
样本量指导:
- 每个变体至少需要50次转化才能检测到20%的提升
- 每个变体至少需要100次转化才能检测到10%的提升
- 若当前日均转化量 < 5,则标记为“低流量——将运行时长延长至21天”
质量检查:
- 任意两个实验不得测试同一变量
- 每个实验必须仅有一个核心指标
- 约束阈值必须明确(例如:"CPA不得超过$50"而非"CPA不得上升")
Phase 5: ICE Scoring
阶段5:ICE评分
Score each experiment on three dimensions (1-10 scale):
Impact — How much will this move the goal metric if it works?
- 9-10: >20% improvement on >$50K monthly spend
- 7-8: 10-20% improvement or affects $20-50K spend
- 5-6: 5-10% improvement or affects $5-20K spend
- 1-4: <5% improvement or affects <$5K spend
Confidence — How sure are we this will work?
- 9-10: Strong data signal + proven in similar accounts
- 7-8: Clear data signal, logical mechanism
- 5-6: Directional signal, some uncertainty
- 1-4: Speculative, weak signal
Ease — How easy is this to implement and measure?
- 9-10: Single setting change, no creative needed
- 7-8: Campaign duplication or budget reallocation
- 5-6: New creative or audience build needed
- 1-4: Complex setup, cross-team coordination, new tooling
ICE Total = Impact × Confidence × Ease
从三个维度对每个实验进行评分(1-10分):
Impact(影响) —— 若测试成功,对目标指标的提升幅度有多大?
- 9-10:月度支出>$50K的业务提升>20%
- 7-8:提升10-20%或影响$20-50K支出
- 5-6:提升5-10%或影响$5-20K支出
- 1-4:提升<5%或影响<$5K支出
Confidence(信心) —— 我们对测试成功的把握有多大?
- 9-10:强数据信号 + 在类似账户中已验证有效
- 7-8:清晰的数据信号,逻辑机制合理
- 5-6:方向性信号,存在一定不确定性
- 1-4:推测性,信号较弱
Ease(易用性) —— 实施和测量的难易程度如何?
- 9-10:仅需修改单一设置,无需创意制作
- 7-8:需复制广告系列或重新分配预算
- 5-6:需要制作新创意或搭建受众
- 1-4:设置复杂,需跨团队协作或引入新工具
ICE总分 = Impact × Confidence × Ease
Phase 6: Console Output — Experiment Backlog
阶段6:控制台输出——实验待办事项
Display the prioritized backlog directly to the user:
undefined直接向用户展示优先排序后的待办事项:
undefinedExperiment Backlog — N Experiments
实验待办事项——共N个实验
Priority Roadmap
优先级路线图
This Week (P1):
| # | Experiment | ICE | Primary Metric | Runtime |
|---|
Next 2 Weeks (P2):
| # | Experiment | ICE | Primary Metric | Runtime |
|---|
Backlog (P3):
| # | Experiment | ICE | Primary Metric | Runtime |
|---|
本周(P1):
| # | 实验 | ICE | 核心指标 | 运行时长 |
|---|
未来2周(P2):
| # | 实验 | ICE | 核心指标 | 运行时长 |
|---|
待办库(P3):
| # | 实验 | ICE | 核心指标 | 运行时长 |
|---|
Top 3 Experiments Detail
前3名实验详情
[For the top 3 by ICE score, show the full hypothesis, change, metrics, and sample size check]
[针对ICE评分最高的3个实验,展示完整假设、变更内容、指标和样本量检查]
Active Test Limits
活跃测试限制
- Max 3 concurrent tests per ad account
- No overlapping tests on same audience
- Minimum 7-day evaluation window
undefined- 每个广告账户最多同时运行3个测试
- 不得在同一受众上运行重叠测试
- 最短评估窗口为7天
undefinedPhase 7: Write Output Files
阶段7:写入输出文件
File 1:
<output_dir>/experiments_backlog.mdundefined文件1:
<output_dir>/experiments_backlog.mdundefinedExperiment Backlog
实验待办事项
Date: <today> | Workspace: <name> | Goal: <roas|cpa>
Period analyzed: <date_range> | Currency: <from workspace>
Total spend: $X | Blended ROAS: X.Xx | Active campaigns: N
日期: <today> | 工作区: <name> | 目标: <roas|cpa>
分析周期: <date_range> | 货币: <来自工作区设置>
总支出: $X | 综合ROAS: X.Xx | 活跃广告系列: N
Signal Summary
信号摘要
Overview of signals detected with evidence.
检测到的信号概述及相关证据。
Experiment Backlog
实验待办事项
EXP-001: [Title]
EXP-001: [标题]
- Observation: ...
- Hypothesis: If we [change], then [metric] will [improve] because [reason]
- Change to test: ...
- Primary metric: ...
- Guardrail metric: ... (threshold: ...)
- Minimum runtime: N days
- Sample size check: Current daily conversions: N → Feasible / Extend runtime / Low confidence
- ICE Score: I(N) × C(N) × E(N) = Total
- Priority: P1/P2/P3
- Owner: TBD
(Repeat for all experiments)
- 观察结果: ...
- 假设: 如果我们[变更],那么[指标]将[提升],因为[原因]
- 待测试变更: ...
- 核心指标: ...
- 约束指标: ...(阈值:...)
- 最短运行时长: N天
- 样本量检查: 当前日均转化量:N → 可行 / 延长运行时长 / 信心不足
- ICE评分: I(N) × C(N) × E(N) = 总分
- 优先级: P1/P2/P3
- 负责人: TBD
(所有实验重复上述格式)
Priority Roadmap
优先级路线图
This Week (P1)
本周(P1)
Table of P1 experiments with ICE, metric, runtime.
P1实验表格,包含ICE评分、指标、运行时长。
Next 2 Weeks (P2)
未来2周(P2)
Table of P2 experiments.
P2实验表格。
Backlog (P3)
待办库(P3)
Table of P3 experiments.
P3实验表格。
Active Test Limits
活跃测试限制
- Max 3 concurrent tests per ad account
- No overlapping tests on same audience
- Minimum 7-day evaluation window before calling results
- Learning phase protection: no edits in first 7 days of any test
**File 2: `<output_dir>/experiments_backlog.json`**
Structured JSON containing:
- `report_date`, `workspace`, `goal`, `date_range`, `currency`
- `total_spend`, `blended_roas`, `active_campaigns`
- `spend_threshold` (computed min_spend)
- `signals` (array: id, category, description, evidence, impact)
- `experiments` (array: id, title, observation, hypothesis, change, primary_metric, guardrail_metric, guardrail_threshold, runtime_days, sample_size_feasible, ice_impact, ice_confidence, ice_ease, ice_total, priority, status)
- `roadmap` (this_week, next_2_weeks, backlog — arrays of experiment IDs)
- `quick_wins` (array: action, campaigns, expected_impact)
**File 3: `<output_dir>/postmortem_template.md`**
Reusable template for tracking experiment outcomes:
- 每个广告账户最多同时运行3个测试
- 不得在同一受众上运行重叠测试
- 需至少7天评估窗口才能判定测试结果
- 学习阶段保护:测试前7天内不得修改
**文件2:`<output_dir>/experiments_backlog.json`**
包含以下内容的结构化JSON:
- `report_date`, `workspace`, `goal`, `date_range`, `currency`
- `total_spend`, `blended_roas`, `active_campaigns`
- `spend_threshold`(计算得出的min_spend)
- `signals`(数组:id, category, description, evidence, impact)
- `experiments`(数组:id, title, observation, hypothesis, change, primary_metric, guardrail_metric, guardrail_threshold, runtime_days, sample_size_feasible, ice_impact, ice_confidence, ice_ease, ice_total, priority, status)
- `roadmap`(this_week, next_2_weeks, backlog —— 实验ID数组)
- `quick_wins`(数组:action, campaigns, expected_impact)
**文件3:`<output_dir>/postmortem_template.md`**
用于跟踪实验结果的可复用模板:
Experiment Postmortem: [EXP-ID] [Title]
实验复盘:[EXP-ID] [标题]
Summary
摘要
| Field | Value |
|---|---|
| Experiment ID | |
| Hypothesis | |
| Start date | |
| End date | |
| Runtime | X days |
| Status | Won / Lost / Inconclusive |
| 字段 | 值 |
|---|---|
| 实验ID | |
| 假设 | |
| 开始日期 | |
| 结束日期 | |
| 运行时长 | X天 |
| 状态 | 成功 / 失败 / 无结论 |
Setup
测试设置
- Control: What was the baseline
- Variant: What was changed
- Audience: Who saw this
- Budget allocation: Control X% / Variant X%
- 对照组: 基准方案
- 变体组: 变更内容
- 受众: 测试覆盖人群
- 预算分配: 对照组X% / 变体组X%
Results
结果
| Metric | Control | Variant | Delta | Significant? |
|---|---|---|---|---|
| Primary: [metric] | Y/N | |||
| Guardrail: [metric] | Y/N | |||
| Spend | — | |||
| Conversions | — |
| 指标 | 对照组 | 变体组 | 差值 | 是否显著? |
|---|---|---|---|---|
| 核心指标: [metric] | 是/否 | |||
| 约束指标: [metric] | 是/否 | |||
| 支出 | — | |||
| 转化量 | — |
Statistical Validity
统计有效性
- Total conversions (control + variant):
- Confidence level:
- Minimum detectable effect achieved: Yes / No
- 总转化量(对照组+变体组):
- 置信度:
- 是否达到最小可检测效果:是/否
Analysis
分析
What happened and why.
发生了什么及原因。
Decision
决策
Scale winner / Kill variant / Extend test / Iterate.
推广获胜方案 / 终止变体测试 / 延长测试时长 / 迭代优化。
Learnings
经验总结
What this tells us for future tests.
本次测试对未来实验的启示。
Follow-up Experiments
后续实验
What to test next based on these results.
undefined基于本次结果的下一步测试方向。
undefinedPhase 8: Console Output — Wrap-Up
阶段8:控制台输出——总结
End with a brief summary:
undefined以简短总结结束:
undefinedDone
完成
Files written to:
output/<YYYYMMDD>_experiments/- — N experiments, N signals
experiments_backlog.md - — structured data
experiments_backlog.json - — reusable tracker
postmortem_template.md
Immediate actions (no test needed):
- [quick wins list]
First test to launch: EXP-XXX — [title]
---文件已写入:
output/<YYYYMMDD>_experiments/- —— N个实验,N个信号
experiments_backlog.md - —— 结构化数据
experiments_backlog.json - —— 可复用跟踪模板
postmortem_template.md
即时操作(无需测试):
- [快速优化项列表]
首个建议启动的测试: EXP-XXX —— [标题]
---ICE Scoring Reference
ICE评分参考
| Dimension | 9-10 | 7-8 | 5-6 | 1-4 |
|---|---|---|---|---|
| Impact | >20% on >$50K/mo | 10-20% or $20-50K | 5-10% or $5-20K | <5% or <$5K |
| Confidence | Strong signal + proven | Clear signal | Directional | Speculative |
| Ease | Setting change | Campaign dupe | New creative needed | Complex/cross-team |
| 维度 | 9-10 | 7-8 | 5-6 | 1-4 |
|---|---|---|---|---|
| Impact(影响) | 月度支出>$50K提升>20% | 提升10-20%或影响$20-50K支出 | 提升5-10%或影响$5-20K支出 | 提升<5%或影响<$5K支出 |
| Confidence(信心) | 强信号+已验证 | 清晰信号 | 方向性信号 | 推测性 |
| Ease(易用性) | 仅修改设置 | 复制广告系列 | 需制作新创意 | 复杂/跨团队 |
Guardrails
约束规则
- Never pause campaigns or edit budgets automatically.
- Limit recommendations to max 3 concurrent tests per ad account.
- If data is sparse (< 10 campaigns or < $5K total spend), produce fewer high-confidence tests and note low data volume.
- If no meaningful signals appear, propose diagnostic experiments (e.g., "run a structured A/B to establish baselines").
- Flag any experiment where sample size is insufficient for statistical significance.
- 不得自动暂停广告系列或修改预算。
- 每个广告账户最多推荐3个并行测试。
- 若数据稀疏(<10个广告系列或总支出<$5K),则生成更少的高置信度测试,并标注数据量不足。
- 若未检测到有意义的信号,则提出诊断性实验(例如:"运行结构化A/B测试以建立基准")。
- 若样本量不足以支持统计显著性,则标记该实验。