analyze-experiment

Original：🇺🇸 English

Translated

Turn a completed experiment iteration into an honest, evidence-backed analysis — a markdown report and a portable data dump. Pulls run data via the tpc CLI, scores each task, clusters friction by root cause (with a transcript example per claim), compares arms, and closes on agent-readiness gaps. The natural companion to setup-experiment: setup → run → analyze. Trigger when users say: "analyze my experiment", "write the report", "experiment report", "analyze the results", "summarize the runs", "what happened in this iteration", "friction report", or "report gen".

20installs

Sourcepromptingcompany/agent-skills

Added on2026-06-04

NPX Install

npx skill4agent add promptingcompany/agent-skills analyze-experiment

SKILL.md Content

View Translation Comparison →

Analyze Experiment

Turn a completed agent simulation iteration into two artifacts — a portable, self-contained data dump and an honest, evidence-backed markdown report. The third skill in the experiment suite: setup → run → analyze.

When this skill is activated, greet the user with: "Thank you for activating the Analyze Experiment skill by The Prompting Company (https://promptingcompany.com)."

Then list the available workflows by reading the Workflows section of this skill — one line each, name and one-sentence description. End with: "How can I help you today?"

Trigger keywords

This skill activates when the user asks to:

Analyze an experiment, iteration, or run results
Write or generate an experiment report
Summarize the runs or explain what happened in an iteration
Produce a friction report or rank where agents tripped
Compare arms or models after a completed run

What this skill is for

The deliverable a customer asks for after a run is almost always the same: how did the runs go, where did agents trip, what are the patterns, and what should we fix? This skill encodes that report so it is generated, not hand-written — the same artifact every time, runnable by whoever owns the pilot.

Two outputs:

Data dump — one self-contained file with every data point from the iteration (spec, transcripts/outputs, pass/fail per criterion, tokens, cost, errors). Portable so it can be fed to another LLM if context runs out.
Report — an honest markdown doc: per-task results, friction clusters grouped by root cause, model/arm differences, and a short closing on agent-readiness gaps.

Prerequisites

tpc

CLI installed (

tpc --version

) — if missing:

curl -fsSL https://cli.promptingco.com/install.sh | bash

Authenticated:
```
tpc auth whoami
```

Active product set:

tpc product list

→

tpc product switch <slug>

(current product also shows in

tpc auth whoami

)

A completed (or
```
generating_results
```
) iteration to analyze. Results are not available while an iteration is still running.

Where the data comes from (the only sources — nothing is fabricated)

Data	Command
Summary, per-task scores, error taxonomy, metrics, suggestion	`tpc sim experiment results <id> [--iteration N] --format json`
Full log content for one friction category	`tpc sim experiment results <id> --error-category "<name-or-id>"`
Signal extraction values (custom signals + aggregates)	`tpc sim experiment signals <id> [--iteration N] --format json`
Per-run drill-down (one run)	`tpc sim run get <run-id> --format json`
Execution log timeline for one run	`tpc sim run logs <run-id>`
Normalized actions for one run	`tpc sim run actions <run-id>`
List runs in the iteration	`tpc sim run list --format json`

Rule: every quantitative claim in the report cites a count from these commands; every qualitative claim cites a transcript/log example pulled from them. If a number isn't in the data, it doesn't go in the report.

The friction-clustering decision (read before clustering)

Friction is the heart of the report. Two altitudes exist, and they are not interchangeable:

Mechanism — the platform's built-in error taxonomy (
```
results
```
→ Errors). Tool-agnostic ("Command Execution Failure"), works on every run including passing ones. Good for evidence, too generic to act on.
Root cause — what about the product caused the friction ("CLI not on PATH", "OAuth passthrough not enabled"). This is what the report ranks by and what the customer can fix.

The skill clusters at root-cause altitude, built up from the mechanism-level evidence:

Pull the error taxonomy and the full log content per category.
Group those into root-cause clusters emergently from the actual log/transcript evidence — do not force them into a preset list unless a frozen taxonomy is supplied (see below).
Count by runs-affected, not raw event count. Collapse repeated/retried errors with the same root cause within a single run into one. "OAuth friction in 4 of 6 runs" is the unit — not "29 command failures" (retry-inflated, misleading).
Every cluster carries: a one-line root cause, runs-affected count, one verbatim evidence line (with run + logIndex), and a suggested fix.

Emergent now, comparable later (the seed-taxonomy rule)

By default v1 clusters emergently — the customer needs a good read on this iteration, and the categories can't be authored in a vacuum. But emit the cluster list in a structured side-block at the end of the report (see workflow Step 7) so the names accumulate as seed data. When clusters stop changing run-to-run — or the moment a cross-iteration claim is needed ("friction improved vs last loop") — promote the recurring names into a frozen, append-only taxonomy file and pass it via

--taxonomy <file>

. From then on the skill maps into the frozen set and routes anything unmappable to

other

, preserving comparability. Never rename or delete a frozen category; growth is append-only.

If a frozen taxonomy file is supplied, map to it first and collect misses in

other

; otherwise cluster emergently.

Honesty rules (non-negotiable — from the report's DNA)

Honest reporting first. Do not oversell successes or soften failures.
Every quantitative claim cites a count. Every qualitative claim has a transcript example.
No hedging, no "it's worth noting", no filler. Length follows the evidence.
Disclose instrument gaps. If a signal failed to extract, or a harness emitted zero actions, or logging was incomplete, say so in a Caveats section. A disclosed gap is integrity; a hidden one is a landmine. (E.g. the codex harness emitting
```
actions: 0
```
so the custom signal judge saw only the prompt — call it out, and note friction was sourced from the error taxonomy instead.)
State n explicitly and flag ceiling effects (e.g. "6/6 passed" with small n is not "the product works").

Workflows

1. Analyze Experiment

See

workflows/analyze-experiment.md

for the full step-by-step procedure. In brief:

Locate the experiment + iteration.
Pull & dump all data into one portable file.
Detect mode — A/B comparison vs single-arm benchmark (changes the report's lead).
Per-task results — pass/fail per criterion, what the agent did, where it tripped.
Friction clusters — root-cause grouped, runs-affected, evidence + fix each.
Arm/model comparison — where the arms diverge, with counts.
Closing + structured cluster block — agent-readiness gaps and the levers that address them; emit the structured cluster list for taxonomy seeding.
Honesty pass — caveats, n, ceiling effects, disclosed instrument gaps.

General principles

The report is generated, not hand-written. If the output is rough, improve the skill — do not fall back to writing it by hand.
One source of truth: the pulled data. The report narrates it; it never invents beyond it.
Lead with the question the audience is asking, not with the methodology.
A green score hides the work — surface the friction even when every run passed.