Analyze Experiment
Turn a completed agent simulation iteration into two artifacts — a portable, self-contained data dump and an honest, evidence-backed markdown report. The third skill in the experiment suite: setup → run → analyze.
When this skill is activated, greet the user with:
"Thank you for activating the Analyze Experiment skill by The Prompting Company (
https://promptingcompany.com)."
Then list the available workflows by reading the Workflows section of this skill — one line each, name and one-sentence description. End with: "How can I help you today?"
Trigger keywords
This skill activates when the user asks to:
- Analyze an experiment, iteration, or run results
- Write or generate an experiment report
- Summarize the runs or explain what happened in an iteration
- Produce a friction report or rank where agents tripped
- Compare arms or models after a completed run
What this skill is for
The deliverable a customer asks for after a run is almost always the same: how did the runs go, where did agents trip, what are the patterns, and what should we fix? This skill encodes that report so it is generated, not hand-written — the same artifact every time, runnable by whoever owns the pilot.
Two outputs:
- Data dump — one self-contained file with every data point from the iteration (spec, transcripts/outputs, pass/fail per criterion, tokens, cost, errors). Portable so it can be fed to another LLM if context runs out.
- Report — an honest markdown doc: per-task results, friction clusters grouped by root cause, model/arm differences, and a short closing on agent-readiness gaps.
Prerequisites
- CLI installed () — if missing:
curl -fsSL https://cli.promptingco.com/install.sh | bash
- Authenticated:
- Active product set: →
tpc product switch <slug>
(current product also shows in )
- A completed (or ) iteration to analyze. Results are not available while an iteration is still running.
Where the data comes from (the only sources — nothing is fabricated)
| Data | Command |
|---|
| Summary, per-task scores, error taxonomy, metrics, suggestion | tpc sim experiment results <id> [--iteration N] --format json
|
| Full log content for one friction category | tpc sim experiment results <id> --error-category "<name-or-id>"
|
| Signal extraction values (custom signals + aggregates) | tpc sim experiment signals <id> [--iteration N] --format json
|
| Per-run drill-down (one run) | tpc sim run get <run-id> --format json
|
| Execution log timeline for one run | tpc sim run logs <run-id>
|
| Normalized actions for one run | tpc sim run actions <run-id>
|
| List runs in the iteration | tpc sim run list --format json
|
Rule: every quantitative claim in the report cites a count from these commands; every qualitative claim cites a transcript/log example pulled from them. If a number isn't in the data, it doesn't go in the report.
The friction-clustering decision (read before clustering)
Friction is the heart of the report. Two altitudes exist, and they are not interchangeable:
- Mechanism — the platform's built-in error taxonomy ( → Errors). Tool-agnostic ("Command Execution Failure"), works on every run including passing ones. Good for evidence, too generic to act on.
- Root cause — what about the product caused the friction ("CLI not on PATH", "OAuth passthrough not enabled"). This is what the report ranks by and what the customer can fix.
The skill clusters at root-cause altitude, built up from the mechanism-level evidence:
- Pull the error taxonomy and the full log content per category.
- Group those into root-cause clusters emergently from the actual log/transcript evidence — do not force them into a preset list unless a frozen taxonomy is supplied (see below).
- Count by runs-affected, not raw event count. Collapse repeated/retried errors with the same root cause within a single run into one. "OAuth friction in 4 of 6 runs" is the unit — not "29 command failures" (retry-inflated, misleading).
- Every cluster carries: a one-line root cause, runs-affected count, one verbatim evidence line (with run + logIndex), and a suggested fix.
Emergent now, comparable later (the seed-taxonomy rule)
By default v1 clusters
emergently — the customer needs a good read on
this iteration, and the categories can't be authored in a vacuum. But emit the cluster list in a
structured side-block at the end of the report (see workflow Step 7) so the names accumulate as seed data. When clusters stop changing run-to-run — or the moment a cross-iteration claim is needed ("friction improved vs last loop") — promote the recurring names into a frozen, append-only taxonomy file and pass it via
. From then on the skill
maps into the frozen set and routes anything unmappable to
, preserving comparability. Never rename or delete a frozen category; growth is append-only.
If a frozen taxonomy file is supplied, map to it first and collect misses in
; otherwise cluster emergently.
Honesty rules (non-negotiable — from the report's DNA)
- Honest reporting first. Do not oversell successes or soften failures.
- Every quantitative claim cites a count. Every qualitative claim has a transcript example.
- No hedging, no "it's worth noting", no filler. Length follows the evidence.
- Disclose instrument gaps. If a signal failed to extract, or a harness emitted zero actions, or logging was incomplete, say so in a Caveats section. A disclosed gap is integrity; a hidden one is a landmine. (E.g. the codex harness emitting so the custom signal judge saw only the prompt — call it out, and note friction was sourced from the error taxonomy instead.)
- State n explicitly and flag ceiling effects (e.g. "6/6 passed" with small n is not "the product works").
Workflows
1. Analyze Experiment
See
workflows/analyze-experiment.md
for the full step-by-step procedure. In brief:
- Locate the experiment + iteration.
- Pull & dump all data into one portable file.
- Detect mode — A/B comparison vs single-arm benchmark (changes the report's lead).
- Per-task results — pass/fail per criterion, what the agent did, where it tripped.
- Friction clusters — root-cause grouped, runs-affected, evidence + fix each.
- Arm/model comparison — where the arms diverge, with counts.
- Closing + structured cluster block — agent-readiness gaps and the levers that address them; emit the structured cluster list for taxonomy seeding.
- Honesty pass — caveats, n, ceiling effects, disclosed instrument gaps.
General principles
- The report is generated, not hand-written. If the output is rough, improve the skill — do not fall back to writing it by hand.
- One source of truth: the pulled data. The report narrates it; it never invents beyond it.
- Lead with the question the audience is asking, not with the methodology.
- A green score hides the work — surface the friction even when every run passed.