Experiment Designer + Tracker

Build a structured experimentation program from real account performance signals — detect gaps, translate them into testable hypotheses, score with ICE, and produce a prioritized roadmap.

Do not use this skill to launch many overlapping tests or make account changes.

Prerequisites

Glued MCP: Connect your Glued workspace at glued.me/mcp

Required MCP tools

```
list_workspaces
```
```
query_ad_report
```

Inputs

```
workspace_id
```
: optional — resolved via
```
list_workspaces
```
if missing
```
date_range
```
: default
```
last_30_days
```
```
goal
```
:
```
roas
```
(default) or
```
cpa
```

Outputs

All outputs go to a timestamped directory:

output/<YYYYMMDD>_experiments/

```
experiments_backlog.md
```
— full backlog with hypotheses, ICE scores, roadmap
```
experiments_backlog.json
```
— structured data for programmatic use
```
postmortem_template.md
```
— reusable template for tracking experiment outcomes

Procedure

Phase 0: Context Collection

If workspace_id is not provided, call
```
list_workspaces
```
and ask the user which workspace to analyze. If only one workspace exists, use it automatically.
Ask the user using AskUserQuestion (2 questions):

Question 1: "What is your primary optimization goal?"
- Options: "ROAS" / "CPA"
Question 2: "Are there any active tests or recent changes we should know about?"
- Options: "No active tests" / "Yes — I'll describe them"
- If yes, note them to avoid proposing duplicate or conflicting tests.
Create output directory:
```
output/<YYYYMMDD>_experiments/
```
Tell the user: "Pulling 30-day performance data across campaigns and creatives. This takes about 30-60 seconds."

Phase 1: Data Pull

Run these API calls in parallel:

1a. Campaign-level performance:

query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: campaign
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 50

1b. Creative-level performance:

query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: creative
  metrics: [spend, impressions, clicks, ctr, cpc, cpm, roas, revenue, conversions, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 20

1c. Ad set-level performance (for audience/placement signals):

query_ad_report:
  workspace_id: <id>
  date_range: <date_range>
  group_by: ad_set
  metrics: [spend, impressions, clicks, ctr, cpm, roas, cpa]
  sort_metric: spend
  sort_direction: desc
  limit: 30

If any call errors, continue with available data and note the gap.

Phase 2: Signal Detection

Scan the data for these signal categories. For each signal found, note the specific campaigns/creatives/numbers involved.

Structural signals:

Bid strategy gap: Compare ROAS across bid strategies (LOWEST_COST vs COST_CAP vs BID_CAP). Flag if one strategy consistently outperforms.
Spend concentration: Flag if top 2 campaigns account for >50% of total spend.
Sub-breakeven campaigns: Any campaign with ROAS < 1.0 and spend > $1,000.

Creative signals:

Format imbalance: Compare image vs video ROAS/CTR. Flag if one format is underrepresented but outperforming.
Stale creatives: Seasonal hooks (Valentine's, New Year, BFCM, etc.) still running past their relevance window.
CTR outliers: Creatives with CTR > 3x account median — what's different about them?
Dynamic creative performance: If DCA/dynamic creatives exist, compare their ROAS to static.

Audience/placement signals:

CPM spread: If max CPM > 5x min CPM across campaigns, placement/audience efficiency varies widely.
Localization gap: If geo-targeted campaigns exist, compare ROAS across countries.
Ad set audience overlap: Multiple ad sets in same campaign with similar targeting.

Operational signals:

No testing framework: Multiple recent launches with no shared naming convention or evaluation cadence.
Learning phase violations: Campaigns with frequent edits in first 7 days.

Minimum spend threshold for signals:

min_spend = max(total_spend * 0.003, 500)

Ignore campaigns/creatives below this threshold for signal detection.

Phase 3: Console Output — Signal Summary

Before writing files, display the signal summary directly to the user:

## Signal Scan Complete

**Account:** <workspace_name> | **Period:** <date_range> | **Goal:** <goal>
**Total spend:** $X | **Blended ROAS:** X.Xx | **Active campaigns:** N

### Signals Detected: N

| # | Signal | Category | Evidence | Impact |
|---|--------|----------|----------|--------|
| 1 | ... | Structural | ... | High/Med/Low |

### Quick Wins (no test needed)
- [list any immediate actions like pausing sub-breakeven campaigns]

This gives the user immediate value before the full backlog is built.

Phase 4: Hypothesis Generation

For each signal, generate a testable hypothesis using this structure:

Observation: What the data shows (with specific numbers)
Hypothesis: "If we [specific change], then [primary metric] will [improve by X] because [mechanism]"
Change to test: Exact action to take
Primary metric: Single metric that determines success
Guardrail metric: Metric that must not degrade beyond a threshold
Minimum runtime: Days needed (minimum 7 days, bid strategy tests need 14 days)
Sample size check: Based on current daily conversions, can we detect a meaningful effect?

Sample size guidance:

Need at least 50 conversions per variant to detect a 20% lift
Need at least 100 conversions per variant to detect a 10% lift
If current daily conversions < 5, flag as "low-volume — extend runtime to 21 days"

Quality checks:

No two experiments should test the same variable
Each experiment must have exactly one primary metric
Guardrail thresholds must be specific (e.g., "CPA must not exceed $50" not "CPA must not increase")

Phase 5: ICE Scoring

Score each experiment on three dimensions (1-10 scale):

Impact — How much will this move the goal metric if it works?

9-10: >20% improvement on >$50K monthly spend
7-8: 10-20% improvement or affects $20-50K spend
5-6: 5-10% improvement or affects $5-20K spend
1-4: <5% improvement or affects <$5K spend

Confidence — How sure are we this will work?

9-10: Strong data signal + proven in similar accounts
7-8: Clear data signal, logical mechanism
5-6: Directional signal, some uncertainty
1-4: Speculative, weak signal

Ease — How easy is this to implement and measure?

9-10: Single setting change, no creative needed
7-8: Campaign duplication or budget reallocation
5-6: New creative or audience build needed
1-4: Complex setup, cross-team coordination, new tooling

ICE Total = Impact × Confidence × Ease

Phase 6: Console Output — Experiment Backlog

Display the prioritized backlog directly to the user:

## Experiment Backlog — N Experiments

### Priority Roadmap

**This Week (P1):**
| # | Experiment | ICE | Primary Metric | Runtime |
|---|-----------|-----|----------------|---------|

**Next 2 Weeks (P2):**
| # | Experiment | ICE | Primary Metric | Runtime |
|---|-----------|-----|----------------|---------|

**Backlog (P3):**
| # | Experiment | ICE | Primary Metric | Runtime |
|---|-----------|-----|----------------|---------|

### Top 3 Experiments Detail

[For the top 3 by ICE score, show the full hypothesis, change, metrics, and sample size check]

### Active Test Limits
- Max 3 concurrent tests per ad account
- No overlapping tests on same audience
- Minimum 7-day evaluation window

Phase 7: Write Output Files

File 1:
<output_dir>/experiments_backlog.md

# Experiment Backlog

**Date:** <today> | **Workspace:** <name> | **Goal:** <roas|cpa>
**Period analyzed:** <date_range> | **Currency:** <from workspace>
**Total spend:** $X | **Blended ROAS:** X.Xx | **Active campaigns:** N

## Signal Summary
Overview of signals detected with evidence.

## Experiment Backlog

### EXP-001: [Title]
- **Observation:** ...
- **Hypothesis:** If we [change], then [metric] will [improve] because [reason]
- **Change to test:** ...
- **Primary metric:** ...
- **Guardrail metric:** ... (threshold: ...)
- **Minimum runtime:** N days
- **Sample size check:** Current daily conversions: N → Feasible / Extend runtime / Low confidence
- **ICE Score:** I(N) × C(N) × E(N) = Total
- **Priority:** P1/P2/P3
- **Owner:** TBD

(Repeat for all experiments)

## Priority Roadmap

### This Week (P1)
Table of P1 experiments with ICE, metric, runtime.

### Next 2 Weeks (P2)
Table of P2 experiments.

### Backlog (P3)
Table of P3 experiments.

## Active Test Limits
- Max 3 concurrent tests per ad account
- No overlapping tests on same audience
- Minimum 7-day evaluation window before calling results
- Learning phase protection: no edits in first 7 days of any test

File 2:
<output_dir>/experiments_backlog.json

Structured JSON containing:

report_date

workspace

goal

date_range

currency

```
total_spend
```
,
```
blended_roas
```
,
```
active_campaigns
```
```
spend_threshold
```
(computed min_spend)
```
signals
```
(array: id, category, description, evidence, impact)
```
experiments
```
(array: id, title, observation, hypothesis, change, primary_metric, guardrail_metric, guardrail_threshold, runtime_days, sample_size_feasible, ice_impact, ice_confidence, ice_ease, ice_total, priority, status)
```
roadmap
```
(this_week, next_2_weeks, backlog — arrays of experiment IDs)
```
quick_wins
```
(array: action, campaigns, expected_impact)

File 3:
<output_dir>/postmortem_template.md

Reusable template for tracking experiment outcomes:

# Experiment Postmortem: [EXP-ID] [Title]

## Summary
| Field | Value |
|-------|-------|
| Experiment ID | |
| Hypothesis | |
| Start date | |
| End date | |
| Runtime | X days |
| Status | Won / Lost / Inconclusive |

## Setup
- **Control:** What was the baseline
- **Variant:** What was changed
- **Audience:** Who saw this
- **Budget allocation:** Control X% / Variant X%

## Results
| Metric | Control | Variant | Delta | Significant? |
|--------|---------|---------|-------|-------------|
| Primary: [metric] | | | | Y/N |
| Guardrail: [metric] | | | | Y/N |
| Spend | | | | — |
| Conversions | | | | — |

## Statistical Validity
- Total conversions (control + variant):
- Confidence level:
- Minimum detectable effect achieved: Yes / No

## Analysis
What happened and why.

## Decision
Scale winner / Kill variant / Extend test / Iterate.

## Learnings
What this tells us for future tests.

## Follow-up Experiments
What to test next based on these results.

Phase 8: Console Output — Wrap-Up

End with a brief summary:

## Done

**Files written to:** `output/<YYYYMMDD>_experiments/`
- `experiments_backlog.md` — N experiments, N signals
- `experiments_backlog.json` — structured data
- `postmortem_template.md` — reusable tracker

**Immediate actions (no test needed):**
- [quick wins list]

**First test to launch:** EXP-XXX — [title]

ICE Scoring Reference

Dimension	9-10	7-8	5-6	1-4
Impact	>20% on >$50K/mo	10-20% or $20-50K	5-10% or $5-20K	<5% or <$5K
Confidence	Strong signal + proven	Clear signal	Directional	Speculative
Ease	Setting change	Campaign dupe	New creative needed	Complex/cross-team

Guardrails

Never pause campaigns or edit budgets automatically.
Limit recommendations to max 3 concurrent tests per ad account.
If data is sparse (< 10 campaigns or < $5K total spend), produce fewer high-confidence tests and note low data volume.
If no meaningful signals appear, propose diagnostic experiments (e.g., "run a structured A/B to establish baselines").
Flag any experiment where sample size is insufficient for statistical significance.

experiment-designer-tracker

NPX Install

Tags

SKILL.md Content