analyze-experiment

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Analyze Experiment

分析实验

Turn a completed agent simulation iteration into two artifacts — a portable, self-contained data dump and an honest, evidence-backed markdown report. The third skill in the experiment suite: setup → run → analyze.

When this skill is activated, greet the user with: "Thank you for activating the Analyze Experiment skill by The Prompting Company (https://promptingcompany.com)."

Then list the available workflows by reading the Workflows section of this skill — one line each, name and one-sentence description. End with: "How can I help you today?"

将已完成的Agent模拟迭代转化为两个产物——一个可移植、独立的数据转储文件和一份真实、有证据支撑的Markdown报告。这是实验套件中的第三个技能：搭建→运行→分析。

激活此技能时，向用户问候： "感谢您激活The Prompting Company开发的分析实验技能（https://promptingcompany.com）。"

随后列出本技能的可用工作流——每个工作流一行，包含名称和一句话描述。最后以：“今天我能为您提供什么帮助？”结尾。

Trigger keywords

触发关键词

This skill activates when the user asks to:

Analyze an experiment, iteration, or run results
Write or generate an experiment report
Summarize the runs or explain what happened in an iteration
Produce a friction report or rank where agents tripped
Compare arms or models after a completed run

当用户提出以下请求时，此技能将被激活：

分析实验、迭代或运行结果
撰写或生成实验报告
总结运行情况或解释某次迭代中发生的事情
生成摩擦点报告或梳理Agent的失误点
在完成运行后对比实验组或模型

What this skill is for

本技能的用途

The deliverable a customer asks for after a run is almost always the same: how did the runs go, where did agents trip, what are the patterns, and what should we fix? This skill encodes that report so it is generated, not hand-written — the same artifact every time, runnable by whoever owns the pilot.

Two outputs:

Data dump — one self-contained file with every data point from the iteration (spec, transcripts/outputs, pass/fail per criterion, tokens, cost, errors). Portable so it can be fed to another LLM if context runs out.
Report — an honest markdown doc: per-task results, friction clusters grouped by root cause, model/arm differences, and a short closing on agent-readiness gaps.

用户在运行结束后要求的交付内容几乎都是一致的：运行情况如何，Agent在哪些地方出现失误，有什么规律，以及我们应该修复什么？ 本技能将这份报告标准化，使其可以自动生成而非手动编写——每次生成的产物都一致，任何负责试点的人员都可以运行。

两种输出：

数据转储文件——一个包含迭代中所有数据点的独立文件（规格、对话记录/输出、每项标准的通过/失败情况、令牌、成本、错误信息）。可移植，因此当上下文不足时可将其输入到另一个LLM中。
报告——一份真实的Markdown文档：包含每项任务的结果、按根本原因分组的摩擦点聚类、模型/实验组差异，以及关于Agent就绪度差距的简短总结。

Prerequisites

前置条件

tpc

CLI installed (

tpc --version

) — if missing:

curl -fsSL https://cli.promptingco.com/install.sh | bash

Authenticated:
```
tpc auth whoami
```

Active product set:

tpc product list

→

tpc product switch <slug>

(current product also shows in

tpc auth whoami

)

A completed (or
```
generating_results
```
) iteration to analyze. Results are not available while an iteration is still running.

已安装

tpc

CLI（执行

tpc --version

验证）——若未安装：

curl -fsSL https://cli.promptingco.com/install.sh | bash

已完成身份验证：
```
tpc auth whoami
```
已设置活跃产品：
```
tpc product list
```
→
```
tpc product switch <slug>
```
（当前产品也会在
```
tpc auth whoami
```
结果中显示）
存在一个已完成（或处于
```
generating_results
```
状态）的待分析迭代。迭代仍在运行时无法获取结果。

Where the data comes from (the only sources — nothing is fabricated)

数据来源（唯一来源——无任何编造内容）

Data	Command
Summary, per-task scores, error taxonomy, metrics, suggestion	`tpc sim experiment results <id> [--iteration N] --format json`
Full log content for one friction category	`tpc sim experiment results <id> --error-category "<name-or-id>"`
Signal extraction values (custom signals + aggregates)	`tpc sim experiment signals <id> [--iteration N] --format json`
Per-run drill-down (one run)	`tpc sim run get <run-id> --format json`
Execution log timeline for one run	`tpc sim run logs <run-id>`
Normalized actions for one run	`tpc sim run actions <run-id>`
List runs in the iteration	`tpc sim run list --format json`

Rule: every quantitative claim in the report cites a count from these commands; every qualitative claim cites a transcript/log example pulled from them. If a number isn't in the data, it doesn't go in the report.

数据内容	命令
摘要、每项任务得分、错误分类体系、指标、建议	`tpc sim experiment results <id> [--iteration N] --format json`
某一摩擦类别的完整日志内容	`tpc sim experiment results <id> --error-category "<name-or-id>"`
信号提取值（自定义信号 + 聚合值）	`tpc sim experiment signals <id> [--iteration N] --format json`
单轮运行的详细信息	`tpc sim run get <run-id> --format json`
单轮运行的执行日志时间线	`tpc sim run logs <run-id>`
单轮运行的标准化操作	`tpc sim run actions <run-id>`
迭代中的运行列表	`tpc sim run list --format json`

规则： 报告中的每一项定量结论都引用来自这些命令的统计数据；每一项定性结论都引用从这些命令中提取的对话记录/日志示例。如果数据中没有对应的数字，则不会写入报告。

The friction-clustering decision (read before clustering)

摩擦点聚类决策（聚类前必读）

Friction is the heart of the report. Two altitudes exist, and they are not interchangeable:

Mechanism — the platform's built-in error taxonomy (
```
results
```
→ Errors). Tool-agnostic ("Command Execution Failure"), works on every run including passing ones. Good for evidence, too generic to act on.
Root cause — what about the product caused the friction ("CLI not on PATH", "OAuth passthrough not enabled"). This is what the report ranks by and what the customer can fix.

The skill clusters at root-cause altitude, built up from the mechanism-level evidence:

Pull the error taxonomy and the full log content per category.
Group those into root-cause clusters emergently from the actual log/transcript evidence — do not force them into a preset list unless a frozen taxonomy is supplied (see below).
Count by runs-affected, not raw event count. Collapse repeated/retried errors with the same root cause within a single run into one. "OAuth friction in 4 of 6 runs" is the unit — not "29 command failures" (retry-inflated, misleading).
Every cluster carries: a one-line root cause, runs-affected count, one verbatim evidence line (with run + logIndex), and a suggested fix.

摩擦点是报告的核心。存在两个层级，二者不可互换：

机制层——平台内置的错误分类体系（
```
results
```
→ Errors）。与工具无关（如“命令执行失败”），适用于所有运行，包括通过的运行。适合作为证据，但过于通用，无法直接采取行动。
根本原因层——产品的哪些问题导致了摩擦（如“CLI未在PATH中”、“OAuth传递未启用”）。这是报告排名的依据，也是客户可以修复的问题。

本技能在根本原因层级进行聚类，基于机制层的证据逐步构建：

提取错误分类体系和每个类别的完整日志内容。
根据实际日志/对话记录证据动态生成根本原因聚类——除非提供了固定分类体系（见下文），否则不要将其强行纳入预设列表。
按受影响的运行次数统计，而非原始事件数量。将单次运行中因同一根本原因导致的重复/重试错误合并为一次。统计单位应为“6次运行中有4次存在OAuth摩擦”，而非“29次命令失败”（因重试导致数据膨胀，具有误导性）。
每个聚类需包含：一行根本原因描述、受影响的运行次数、一句逐字引用的证据（包含运行ID + logIndex），以及一个建议修复方案。

Emergent now, comparable later (the seed-taxonomy rule)

先动态生成，后保持可比（种子分类体系规则）

By default v1 clusters emergently — the customer needs a good read on this iteration, and the categories can't be authored in a vacuum. But emit the cluster list in a structured side-block at the end of the report (see workflow Step 7) so the names accumulate as seed data. When clusters stop changing run-to-run — or the moment a cross-iteration claim is needed ("friction improved vs last loop") — promote the recurring names into a frozen, append-only taxonomy file and pass it via

--taxonomy <file>

. From then on the skill maps into the frozen set and routes anything unmappable to

other

, preserving comparability. Never rename or delete a frozen category; growth is append-only.

If a frozen taxonomy file is supplied, map to it first and collect misses in

other

; otherwise cluster emergently.

默认v1版本会动态生成聚类——客户需要对本次迭代有清晰的认知，而分类不能凭空制定。但需在报告末尾的结构化侧边栏中输出聚类列表（见工作流步骤7），以便这些名称可以积累为种子数据。当聚类在多次运行中不再变化——或者需要跨迭代对比（如“与上一轮相比，摩擦点有所改善”）时，将重复出现的名称升级为固定的、仅可追加的分类体系文件，并通过

--taxonomy <file>

参数传入。此后，本技能会将新的摩擦点映射到固定集合中，无法映射的内容归入

other

类别，以保持可比性。切勿重命名或删除固定类别；仅可追加新类别。

如果提供了固定分类体系文件，先将摩擦点映射到该体系中，未匹配的内容归入

other

；否则动态生成聚类。

Honesty rules (non-negotiable — from the report's DNA)

真实性规则（报告的核心原则，不可协商）

Honest reporting first. Do not oversell successes or soften failures.
Every quantitative claim cites a count. Every qualitative claim has a transcript example.
No hedging, no "it's worth noting", no filler. Length follows the evidence.
Disclose instrument gaps. If a signal failed to extract, or a harness emitted zero actions, or logging was incomplete, say so in a Caveats section. A disclosed gap is integrity; a hidden one is a landmine. (E.g. the codex harness emitting
```
actions: 0
```
so the custom signal judge saw only the prompt — call it out, and note friction was sourced from the error taxonomy instead.)
State n explicitly and flag ceiling effects (e.g. "6/6 passed" with small n is not "the product works").

优先保证真实报告。不得夸大成功或弱化失败。
每一项定量结论都引用统计数据。每一项定性结论都附带对话记录示例。
不使用模糊措辞，不添加“值得注意的是”这类填充内容。报告长度由证据决定。
披露工具缺陷。如果信号提取失败、测试工具未输出任何操作或日志不完整，需在“注意事项”部分说明。披露缺陷是诚信的体现；隐瞒缺陷则是隐患。（例如，codex测试工具输出
```
actions: 0
```
，导致自定义信号判断仅能看到提示词——需明确指出，并说明摩擦点来自错误分类体系。）
明确标注样本量n，并指出天花板效应（例如，“6/6通过”且样本量较小时，不能等同于“产品可用”）。

Workflows

工作流

1. Analyze Experiment

1. 分析实验

See

workflows/analyze-experiment.md

for the full step-by-step procedure. In brief:

Locate the experiment + iteration.
Pull & dump all data into one portable file.
Detect mode — A/B comparison vs single-arm benchmark (changes the report's lead).
Per-task results — pass/fail per criterion, what the agent did, where it tripped.
Friction clusters — root-cause grouped, runs-affected, evidence + fix each.
Arm/model comparison — where the arms diverge, with counts.
Closing + structured cluster block — agent-readiness gaps and the levers that address them; emit the structured cluster list for taxonomy seeding.
Honesty pass — caveats, n, ceiling effects, disclosed instrument gaps.

详细的分步流程请参见

workflows/analyze-experiment.md

。简要说明：

定位实验和迭代。
提取并转储所有数据到一个可移植文件中。
检测模式——A/B对比vs单组基准测试（会改变报告的开篇内容）。
每项任务的结果——每项标准的通过/失败情况、Agent的操作、失误点。
摩擦点聚类——按根本原因分组、受影响的运行次数、每个聚类的证据和修复建议。
实验组/模型对比——实验组的差异点及统计数据。
总结 + 结构化聚类块——Agent就绪度差距及解决这些差距的手段；输出结构化聚类列表用于分类体系种子数据积累。
真实性检查——注意事项、样本量n、天花板效应、披露的工具缺陷。

General principles

通用原则

The report is generated, not hand-written. If the output is rough, improve the skill — do not fall back to writing it by hand.
One source of truth: the pulled data. The report narrates it; it never invents beyond it.
Lead with the question the audience is asking, not with the methodology.
A green score hides the work — surface the friction even when every run passed.

报告是自动生成的，而非手动编写。如果输出内容不够完善，应改进技能——不要退回到手动编写的方式。
唯一的事实来源：提取的数据。报告仅对数据进行阐述；绝不超出数据范围进行编造。
开篇直接回应受众的问题，而非先介绍方法论。
高分可能掩盖问题——即使所有运行都通过，也要揭示潜在的摩擦点。