skill-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

skill-eval

—

Re-run baseline evaluations on one or more skills. Uses the

evals.json

test definitions committed in each skill, dispatches pressure scenarios via subagents, saves transcripts to a gitignored workspace, and grades the runs deterministically.

Claude Code 扩展（其他Agent会忽略）--- argument-hint: '[<skill-name>|--all]' user-invocable: true model-invocable: true

When to use

skill-eval

Verbatim trigger phrases:

"rerun the baselines"
"re-eval skill X"
"test all the skills"
"check for skill drift"
"run the evals"
"did skill X still pass"

重新运行一项或多项技能的基线评估。使用每个技能中提交的

evals.json

测试定义，通过子Agent调度压力场景，将对话记录保存到git忽略的工作区，并以确定性方式对运行结果评分。

When NOT to use

使用场景

Authoring a new skill — use
```
/skill-creator
```
instead
Modifying skill body content — just edit the SKILL.md
Running unit tests for
```
packages/skill-tools
```
itself (those are vitest, not skill evals)

明确的触发短语：

"rerun the baselines"
"re-eval skill X"
"test all the skills"
"check for skill drift"
"run the evals"
"did skill X still pass"

Inputs

不适用场景

```
$ARGUMENTS
```
— one of:
- ```
<skill-name>
```
  — re-eval one skill (looks under
```
skills/
```
  and
```
.agents/skills/
```
  )
- ```
--all
```
  — re-eval every skill that has an
```
evals.json
```
- empty — same as
```
--all
```

创建新技能——请改用
```
/skill-creator
```
修改技能主体内容——直接编辑SKILL.md即可
运行
```
packages/skill-tools
```
自身的单元测试（这些是vitest测试，不属于技能评估）

Workflow

输入参数

1. Resolve target skills

—

$ARGUMENTS

is a skill name, look in

skills/<name>/

then

.agents/skills/<name>/

. Confirm

evals.json

exists. If not, abort with an error pointing the user at

/skill-creator

(or to author

evals.json

by hand).

--all

(or empty), find every directory with both

SKILL.md

and

evals.json

under those two roots.

```
$ARGUMENTS
```
— 以下选项之一：
- ```
<skill-name>
```
  — 重新评估单个技能（在
```
skills/
```
  和
```
.agents/skills/
```
  目录下查找）
- ```
--all
```
  — 重新评估所有包含
```
evals.json
```
  的技能
- 空值 — 效果等同于
```
--all
```

2. Determine the next iteration number

工作流程

—

1. 确定目标技能

For each target skill, look at its nested workspace at

skills/<skill-name>/.workspace/

(or

.agents/skills/<skill-name>/.workspace/

, depending on the skill's location). If it doesn't exist, the next iteration is

. Otherwise scan

iteration-N/

directories and use

max(N) + 1

Create

skills/<skill-name>/.workspace/iteration-<N>/

(the

.workspace/

pattern is gitignored).

如果

$ARGUMENTS

是技能名称，先在

skills/<name>/

目录查找，再在

.agents/skills/<name>/

目录查找。确认

evals.json

存在，若不存在则终止操作，并提示用户使用

/skill-creator

（或手动编写

evals.json

）。

如果是

--all

（或空值），在上述两个根目录下查找所有同时包含

SKILL.md

和

evals.json

的目录。

3. Dispatch each eval via the Agent tool

2. 确定下一个迭代编号

For every eval in

evals.json

, run two subagent dispatches:

对于每个目标技能，查看其嵌套工作区

skills/<skill-name>/.workspace/

（或

.agents/skills/<skill-name>/.workspace/

，取决于技能所在位置）。若该目录不存在，则下一个迭代编号为

；否则扫描

iteration-N/

目录，使用

max(N) + 1

作为下一个编号。

创建目录

skills/<skill-name>/.workspace/iteration-<N>/

（

.workspace/

模式已被git忽略）。

3a. WITHOUT skill (RED baseline)

3. 通过Agent工具调度每个评估任务

Use the Agent tool with

subagent_type: general-purpose

. The prompt template:

Execute this task exactly:

[eval.prompt]

No skill is loaded for this task. After attempting it, report what you did,
what decisions you made and why, and anything you found tricky. Report
verbatim — do not polish, do not summarize. Include any code you wrote
inline so it can be analyzed.

Save the agent's reply (the entire response text) to:

skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/without_skill/transcript.md

对于

evals.json

中的每一项评估，执行两次子Agent调度：

3b. WITH skill (GREEN run)

3a. 不加载技能（红色基线）

Use the Agent tool again, this time including the target skill's full SKILL.md content as system context. The prompt template:

Execute this task exactly:

[eval.prompt]

The skill `<skill-name>` is available — apply its rules and patterns.

After attempting it, report what you did, what decisions you made and why,
and anything you found tricky. Report verbatim — do not polish, do not
summarize. Include any code you wrote inline.

If you considered skipping any rule from the skill, capture the exact
reasoning verbatim — that's the kind of failure mode the skill needs to
catch.

Save the response to:

skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/with_skill/transcript.md

使用Agent工具，设置

subagent_type: general-purpose

。提示模板如下：

Execute this task exactly:

[eval.prompt]

No skill is loaded for this task. After attempting it, report what you did,
what decisions you made and why, and anything you found tricky. Report
verbatim — do not polish, do not summarize. Include any code you wrote
inline so it can be analyzed.

将Agent的回复（完整响应文本）保存至：

skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/without_skill/transcript.md

4. Grade each transcript

3b. 加载技能（绿色运行）

For each transcript saved, invoke

skill-tools eval

to run assertions:

bash

node packages/skill-tools/dist/index.mjs eval <skill-name> <eval.id> \
  --variant <with_skill|without_skill> \
  --iteration <N> \
  --transcript <path-to-transcript.md>

This writes

grading.json

next to the transcript. Each assertion is regex / contains / file_exists — deterministic, no LLM-as-judge.

再次使用Agent工具，此次将目标技能的完整SKILL.md内容作为系统上下文传入。提示模板如下：

Execute this task exactly:

[eval.prompt]

The skill `<skill-name>` is available — apply its rules and patterns.

After attempting it, report what you did, what decisions you made and why,
and anything you found tricky. Report verbatim — do not polish, do not
summarize. Include any code you wrote inline.

If you considered skipping any rule from the skill, capture the exact
reasoning verbatim — that's the kind of failure mode the skill needs to
catch.

将响应保存至：

skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/with_skill/transcript.md

5. Generate the benchmark

4. 为每条对话记录评分

After all evals are graded for a skill:

bash

node packages/skill-tools/dist/index.mjs benchmark <skill-name>

This aggregates the grading.json files into

benchmark.json

and

benchmark.md

for that iteration.

对于保存的每条对话记录，调用

skill-tools eval

执行断言：

bash

node packages/skill-tools/dist/index.mjs eval <skill-name> <eval.id> \
  --variant <with_skill|without_skill> \
  --iteration <N> \
  --transcript <path-to-transcript.md>

此命令会在对话记录旁生成

grading.json

文件。每个断言基于正则表达式/包含判断/文件存在性——完全确定性，不使用大语言模型作为评判者。

6. Report

5. 生成基准报告

Summarize per skill:

Which evals improved with the skill loaded vs. without
Any evals that failed with the skill loaded (regression to investigate)
Path to the benchmark and to the latest iteration directory

Suggest the user run

pnpm skill-tools view <skill-name>

to navigate transcripts in the TUI.

完成某技能的所有评估评分后：

bash

node packages/skill-tools/dist/index.mjs benchmark <skill-name>

此命令会将所有

grading.json

文件聚合为该迭代的

benchmark.json

和

benchmark.md

文件。

Examples

6. 生成结果报告

<example> <input>"rerun the baselines for ts-best-practices"</input> <output> 1. Resolve: skill at `skills/ts-best-practices/`, evals.json present with 3 cases. 2. Workspace: `skills/ts-best-practices/.workspace/iteration-2/` (iteration-1 already exists). 3. For each eval: dispatch Agent(general-purpose) twice (without/with skill), save transcripts. 4. Grade each transcript via skill-tools eval. 5. Aggregate via skill-tools benchmark → benchmark.md. 6. Report: "with skill 7/9 passed, without skill 3/9 — improvement on eval-1, regression on eval-2". </output> </example> <example> <input>"check for skill drift on all skills"</input> <output> 1. --all: find every dir under skills/ + .agents/skills/ with both SKILL.md and evals.json. 2. For each: run the same 6-step workflow. 3. Aggregate report: any skills where regression count exceeds improvement count are flagged for review. </output> </example> <good> Saved transcript verbatim to: skills/ts-best-practices/.workspace/iteration-2/eval-0-validate-config/with_skill/transcript.md </good> <bad> The agent did roughly what we expected. Skipping transcript save. </bad>

Always save raw transcripts verbatim — paraphrasing or dropping output makes regression detection impossible.

按技能汇总：

加载技能与未加载技能相比，哪些评估结果有所提升
加载技能后失败的评估项（需调查的回归问题）
基准报告和最新迭代目录的路径

建议用户运行

pnpm skill-tools view <skill-name>

在终端界面（TUI）中查看对话记录。

References

示例

evals.json

schema is documented in

skills/skill-creator/references/evals-json.md

skill-tools CLI reference:
```
packages/skill-tools/
```

<example> <input>"rerun the baselines for ts-best-practices"</input> <output> 1. 确定目标：技能位于`skills/ts-best-practices/`，存在包含3个测试用例的evals.json。 2. 工作区：`skills/ts-best-practices/.workspace/iteration-2/`（iteration-1已存在）。 3. 针对每个评估项：调度Agent(general-purpose)两次（未加载/加载技能），保存对话记录。 4. 通过skill-tools eval为每条对话记录评分。 5. 通过skill-tools benchmark聚合结果 → benchmark.md。 6. 报告：“加载技能后7/9通过，未加载技能时3/9通过——eval-1有提升，eval-2出现回归”。 </output> </example> <example> <input>"check for skill drift on all skills"</input> <output> 1. --all：在skills/ + .agents/skills/目录下查找所有同时包含SKILL.md和evals.json的目录。 2. 对每个技能：执行相同的6步工作流程。 3. 汇总报告：标记所有回归数量超过提升数量的技能，以供审查。 </output> </example> <good> 已将对话记录逐字保存至： skills/ts-best-practices/.workspace/iteration-2/eval-0-validate-config/with_skill/transcript.md </good> <bad> Agent大致完成了预期任务，跳过对话记录保存。 </bad>

务必逐字保存原始对话记录——意译或丢弃输出会导致无法检测回归问题。

—

参考资料

—

evals.json

的 schema 文档位于

skills/skill-creator/references/evals-json.md

skill-tools CLI 参考文档：
```
packages/skill-tools/
```