Experiment Designer + Tracker
Build a structured experimentation program from real account performance signals — detect gaps, translate them into testable hypotheses, score with ICE, and produce a prioritized roadmap.
Do not use this skill to launch many overlapping tests or make account changes.
Prerequisites
- Glued MCP: Connect your Glued workspace at glued.me/mcp
Required MCP tools
Inputs
- : optional — resolved via if missing
- : default
- : (default) or
Outputs
All outputs go to a timestamped directory:
output/<YYYYMMDD>_experiments/
- — full backlog with hypotheses, ICE scores, roadmap
- — structured data for programmatic use
- — reusable template for tracking experiment outcomes
Procedure
Phase 0: Context Collection
-
If workspace_id is not provided, call
and ask the user which workspace to analyze. If only one workspace exists, use it automatically.
-
Ask the user using AskUserQuestion (2 questions):
Question 1: "What is your primary optimization goal?"
Question 2: "Are there any active tests or recent changes we should know about?"
- Options: "No active tests" / "Yes — I'll describe them"
- If yes, note them to avoid proposing duplicate or conflicting tests.
-
Create output directory: output/<YYYYMMDD>_experiments/
-
Tell the user: "Pulling 30-day performance data across campaigns and creatives. This takes about 30-60 seconds."
Phase 1: Data Pull
Run these API calls in parallel:
1a. Campaign-level performance:
query_ad_report:
workspace_id: <id>
date_range: <date_range>
group_by: campaign
metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
sort_metric: spend
sort_direction: desc
limit: 50
1b. Creative-level performance:
query_ad_report:
workspace_id: <id>
date_range: <date_range>
group_by: creative
metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
sort_metric: spend
sort_direction: desc
limit: 20
1c. Ad set-level performance (for audience/placement signals):
query_ad_report:
workspace_id: <id>
date_range: <date_range>
group_by: ad_set
metrics: [spend, impressions, clicks, ctr, cpm, roas, cpa]
sort_metric: spend
sort_direction: desc
limit: 30
If any call errors, continue with available data and note the gap.
Phase 2: Signal Detection
Scan the data for these signal categories. For each signal found, note the specific campaigns/creatives/numbers involved.
Structural signals:
- Bid strategy gap: Compare ROAS across bid strategies (LOWEST_COST vs COST_CAP vs BID_CAP). Flag if one strategy consistently outperforms.
- Spend concentration: Flag if top 2 campaigns account for >50% of total spend.
- Sub-breakeven campaigns: Any campaign with ROAS < 1.0 and spend > $1,000.
Creative signals:
- Format imbalance: Compare image vs video ROAS/CTR. Flag if one format is underrepresented but outperforming.
- Stale creatives: Seasonal hooks (Valentine's, New Year, BFCM, etc.) still running past their relevance window.
- CTR outliers: Creatives with CTR > 3x account median — what's different about them?
- Dynamic creative performance: If DCA/dynamic creatives exist, compare their ROAS to static.
Audience/placement signals:
- CPM spread: If max CPM > 5x min CPM across campaigns, placement/audience efficiency varies widely.
- Localization gap: If geo-targeted campaigns exist, compare ROAS across countries.
- Ad set audience overlap: Multiple ad sets in same campaign with similar targeting.
Operational signals:
- No testing framework: Multiple recent launches with no shared naming convention or evaluation cadence.
- Learning phase violations: Campaigns with frequent edits in first 7 days.
Minimum spend threshold for signals:
min_spend = max(total_spend * 0.003, 500)
Ignore campaigns/creatives below this threshold for signal detection.
Phase 3: Console Output — Signal Summary
Before writing files, display the signal summary directly to the user:
## Signal Scan Complete
**Account:** <workspace_name> | **Period:** <date_range> | **Goal:** <goal>
**Total spend:** $X | **Blended ROAS:** X.Xx | **Active campaigns:** N
### Signals Detected: N
| # | Signal | Category | Evidence | Impact |
|---|--------|----------|----------|--------|
| 1 | ... | Structural | ... | High/Med/Low |
### Quick Wins (no test needed)
- [list any immediate actions like pausing sub-breakeven campaigns]
This gives the user immediate value before the full backlog is built.
Phase 4: Hypothesis Generation
For each signal, generate a testable hypothesis using this structure:
- Observation: What the data shows (with specific numbers)
- Hypothesis: "If we [specific change], then [primary metric] will [improve by X] because [mechanism]"
- Change to test: Exact action to take
- Primary metric: Single metric that determines success
- Guardrail metric: Metric that must not degrade beyond a threshold
- Minimum runtime: Days needed (minimum 7 days, bid strategy tests need 14 days)
- Sample size check: Based on current daily conversions, can we detect a meaningful effect?
Sample size guidance:
- Need at least 50 conversions per variant to detect a 20% lift
- Need at least 100 conversions per variant to detect a 10% lift
- If current daily conversions < 5, flag as "low-volume — extend runtime to 21 days"
Quality checks:
- No two experiments should test the same variable
- Each experiment must have exactly one primary metric
- Guardrail thresholds must be specific (e.g., "CPA must not exceed $50" not "CPA must not increase")
Phase 5: ICE Scoring
Score each experiment on three dimensions (1-10 scale):
Impact — How much will this move the goal metric if it works?
- 9-10: >20% improvement on >$50K monthly spend
- 7-8: 10-20% improvement or affects $20-50K spend
- 5-6: 5-10% improvement or affects $5-20K spend
- 1-4: <5% improvement or affects <$5K spend
Confidence — How sure are we this will work?
- 9-10: Strong data signal + proven in similar accounts
- 7-8: Clear data signal, logical mechanism
- 5-6: Directional signal, some uncertainty
- 1-4: Speculative, weak signal
Ease — How easy is this to implement and measure?
- 9-10: Single setting change, no creative needed
- 7-8: Campaign duplication or budget reallocation
- 5-6: New creative or audience build needed
- 1-4: Complex setup, cross-team coordination, new tooling
ICE Total = Impact × Confidence × Ease
Phase 6: Console Output — Experiment Backlog
Display the prioritized backlog directly to the user:
## Experiment Backlog — N Experiments
### Priority Roadmap
**This Week (P1):**
| # | Experiment | ICE | Primary Metric | Runtime |
|---|-----------|-----|----------------|---------|
**Next 2 Weeks (P2):**
| # | Experiment | ICE | Primary Metric | Runtime |
|---|-----------|-----|----------------|---------|
**Backlog (P3):**
| # | Experiment | ICE | Primary Metric | Runtime |
|---|-----------|-----|----------------|---------|
### Top 3 Experiments Detail
[For the top 3 by ICE score, show the full hypothesis, change, metrics, and sample size check]
### Active Test Limits
- Max 3 concurrent tests per ad account
- No overlapping tests on same audience
- Minimum 7-day evaluation window
Phase 7: Write Output Files
File 1: <output_dir>/experiments_backlog.md
# Experiment Backlog
**Date:** <today> | **Workspace:** <name> | **Goal:** <roas|cpa>
**Period analyzed:** <date_range> | **Currency:** <from workspace>
**Total spend:** $X | **Blended ROAS:** X.Xx | **Active campaigns:** N
## Signal Summary
Overview of signals detected with evidence.
## Experiment Backlog
### EXP-001: [Title]
- **Observation:** ...
- **Hypothesis:** If we [change], then [metric] will [improve] because [reason]
- **Change to test:** ...
- **Primary metric:** ...
- **Guardrail metric:** ... (threshold: ...)
- **Minimum runtime:** N days
- **Sample size check:** Current daily conversions: N → Feasible / Extend runtime / Low confidence
- **ICE Score:** I(N) × C(N) × E(N) = Total
- **Priority:** P1/P2/P3
- **Owner:** TBD
(Repeat for all experiments)
## Priority Roadmap
### This Week (P1)
Table of P1 experiments with ICE, metric, runtime.
### Next 2 Weeks (P2)
Table of P2 experiments.
### Backlog (P3)
Table of P3 experiments.
## Active Test Limits
- Max 3 concurrent tests per ad account
- No overlapping tests on same audience
- Minimum 7-day evaluation window before calling results
- Learning phase protection: no edits in first 7 days of any test
File 2: <output_dir>/experiments_backlog.json
Structured JSON containing:
- , , , ,
- , ,
- (computed min_spend)
- (array: id, category, description, evidence, impact)
- (array: id, title, observation, hypothesis, change, primary_metric, guardrail_metric, guardrail_threshold, runtime_days, sample_size_feasible, ice_impact, ice_confidence, ice_ease, ice_total, priority, status)
- (this_week, next_2_weeks, backlog — arrays of experiment IDs)
- (array: action, campaigns, expected_impact)
File 3: <output_dir>/postmortem_template.md
Reusable template for tracking experiment outcomes:
# Experiment Postmortem: [EXP-ID] [Title]
## Summary
| Field | Value |
|-------|-------|
| Experiment ID | |
| Hypothesis | |
| Start date | |
| End date | |
| Runtime | X days |
| Status | Won / Lost / Inconclusive |
## Setup
- **Control:** What was the baseline
- **Variant:** What was changed
- **Audience:** Who saw this
- **Budget allocation:** Control X% / Variant X%
## Results
| Metric | Control | Variant | Delta | Significant? |
|--------|---------|---------|-------|-------------|
| Primary: [metric] | | | | Y/N |
| Guardrail: [metric] | | | | Y/N |
| Spend | | | | — |
| Conversions | | | | — |
## Statistical Validity
- Total conversions (control + variant):
- Confidence level:
- Minimum detectable effect achieved: Yes / No
## Analysis
What happened and why.
## Decision
Scale winner / Kill variant / Extend test / Iterate.
## Learnings
What this tells us for future tests.
## Follow-up Experiments
What to test next based on these results.
Phase 8: Console Output — Wrap-Up
End with a brief summary:
## Done
**Files written to:** `output/<YYYYMMDD>_experiments/`
- `experiments_backlog.md` — N experiments, N signals
- `experiments_backlog.json` — structured data
- `postmortem_template.md` — reusable tracker
**Immediate actions (no test needed):**
- [quick wins list]
**First test to launch:** EXP-XXX — [title]
ICE Scoring Reference
| Dimension | 9-10 | 7-8 | 5-6 | 1-4 |
|---|
| Impact | >20% on >$50K/mo | 10-20% or $20-50K | 5-10% or $5-20K | <5% or <$5K |
| Confidence | Strong signal + proven | Clear signal | Directional | Speculative |
| Ease | Setting change | Campaign dupe | New creative needed | Complex/cross-team |
Guardrails
- Never pause campaigns or edit budgets automatically.
- Limit recommendations to max 3 concurrent tests per ad account.
- If data is sparse (< 10 campaigns or < $5K total spend), produce fewer high-confidence tests and note low data volume.
- If no meaningful signals appear, propose diagnostic experiments (e.g., "run a structured A/B to establish baselines").
- Flag any experiment where sample size is insufficient for statistical significance.