skill-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

skill-eval

Re-run baseline evaluations on one or more skills. Uses the
evals.json
test definitions committed in each skill, dispatches pressure scenarios via subagents, saves transcripts to a gitignored workspace, and grades the runs deterministically.

Claude Code 扩展(其他Agent会忽略)--- argument-hint: '[<skill-name>|--all]' user-invocable: true model-invocable: true

When to use

skill-eval

Verbatim trigger phrases:
  • "rerun the baselines"
  • "re-eval skill X"
  • "test all the skills"
  • "check for skill drift"
  • "run the evals"
  • "did skill X still pass"
重新运行一项或多项技能的基线评估。使用每个技能中提交的
evals.json
测试定义,通过子Agent调度压力场景,将对话记录保存到git忽略的工作区,并以确定性方式对运行结果评分。

When NOT to use

使用场景

  • Authoring a new skill — use
    /skill-creator
    instead
  • Modifying skill body content — just edit the SKILL.md
  • Running unit tests for
    packages/skill-tools
    itself (those are vitest, not skill evals)
明确的触发短语:
  • "rerun the baselines"
  • "re-eval skill X"
  • "test all the skills"
  • "check for skill drift"
  • "run the evals"
  • "did skill X still pass"

Inputs

不适用场景

  • $ARGUMENTS
    — one of:
    • <skill-name>
      — re-eval one skill (looks under
      skills/
      and
      .agents/skills/
      )
    • --all
      — re-eval every skill that has an
      evals.json
    • empty — same as
      --all
  • 创建新技能——请改用
    /skill-creator
  • 修改技能主体内容——直接编辑SKILL.md即可
  • 运行
    packages/skill-tools
    自身的单元测试(这些是vitest测试,不属于技能评估)

Workflow

输入参数

1. Resolve target skills

If
$ARGUMENTS
is a skill name, look in
skills/<name>/
then
.agents/skills/<name>/
. Confirm
evals.json
exists. If not, abort with an error pointing the user at
/skill-creator
(or to author
evals.json
by hand).
If
--all
(or empty), find every directory with both
SKILL.md
and
evals.json
under those two roots.
  • $ARGUMENTS
    — 以下选项之一:
    • <skill-name>
      — 重新评估单个技能(在
      skills/
      .agents/skills/
      目录下查找)
    • --all
      — 重新评估所有包含
      evals.json
      的技能
    • 空值 — 效果等同于
      --all

2. Determine the next iteration number

工作流程

1. 确定目标技能

For each target skill, look at its nested workspace at
skills/<skill-name>/.workspace/
(or
.agents/skills/<skill-name>/.workspace/
, depending on the skill's location). If it doesn't exist, the next iteration is
1
. Otherwise scan
iteration-N/
directories and use
max(N) + 1
.
Create
skills/<skill-name>/.workspace/iteration-<N>/
(the
.workspace/
pattern is gitignored).
如果
$ARGUMENTS
是技能名称,先在
skills/<name>/
目录查找,再在
.agents/skills/<name>/
目录查找。确认
evals.json
存在,若不存在则终止操作,并提示用户使用
/skill-creator
(或手动编写
evals.json
)。
如果是
--all
(或空值),在上述两个根目录下查找所有同时包含
SKILL.md
evals.json
的目录。

3. Dispatch each eval via the Agent tool

2. 确定下一个迭代编号

For every eval in
evals.json
, run two subagent dispatches:
对于每个目标技能,查看其嵌套工作区
skills/<skill-name>/.workspace/
(或
.agents/skills/<skill-name>/.workspace/
,取决于技能所在位置)。若该目录不存在,则下一个迭代编号为
1
;否则扫描
iteration-N/
目录,使用
max(N) + 1
作为下一个编号。
创建目录
skills/<skill-name>/.workspace/iteration-<N>/
.workspace/
模式已被git忽略)。

3a. WITHOUT skill (RED baseline)

3. 通过Agent工具调度每个评估任务

Use the Agent tool with
subagent_type: general-purpose
. The prompt template:
Execute this task exactly:

[eval.prompt]

No skill is loaded for this task. After attempting it, report what you did,
what decisions you made and why, and anything you found tricky. Report
verbatim — do not polish, do not summarize. Include any code you wrote
inline so it can be analyzed.
Save the agent's reply (the entire response text) to:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/without_skill/transcript.md
对于
evals.json
中的每一项评估,执行两次子Agent调度:

3b. WITH skill (GREEN run)

3a. 不加载技能(红色基线)

Use the Agent tool again, this time including the target skill's full SKILL.md content as system context. The prompt template:
Execute this task exactly:

[eval.prompt]

The skill `<skill-name>` is available — apply its rules and patterns.

After attempting it, report what you did, what decisions you made and why,
and anything you found tricky. Report verbatim — do not polish, do not
summarize. Include any code you wrote inline.

If you considered skipping any rule from the skill, capture the exact
reasoning verbatim — that's the kind of failure mode the skill needs to
catch.
Save the response to:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/with_skill/transcript.md
使用Agent工具,设置
subagent_type: general-purpose
。提示模板如下:
Execute this task exactly:

[eval.prompt]

No skill is loaded for this task. After attempting it, report what you did,
what decisions you made and why, and anything you found tricky. Report
verbatim — do not polish, do not summarize. Include any code you wrote
inline so it can be analyzed.
将Agent的回复(完整响应文本)保存至:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/without_skill/transcript.md

4. Grade each transcript

3b. 加载技能(绿色运行)

For each transcript saved, invoke
skill-tools eval
to run assertions:
bash
node packages/skill-tools/dist/index.mjs eval <skill-name> <eval.id> \
  --variant <with_skill|without_skill> \
  --iteration <N> \
  --transcript <path-to-transcript.md>
This writes
grading.json
next to the transcript. Each assertion is regex / contains / file_exists — deterministic, no LLM-as-judge.
再次使用Agent工具,此次将目标技能的完整SKILL.md内容作为系统上下文传入。提示模板如下:
Execute this task exactly:

[eval.prompt]

The skill `<skill-name>` is available — apply its rules and patterns.

After attempting it, report what you did, what decisions you made and why,
and anything you found tricky. Report verbatim — do not polish, do not
summarize. Include any code you wrote inline.

If you considered skipping any rule from the skill, capture the exact
reasoning verbatim — that's the kind of failure mode the skill needs to
catch.
将响应保存至:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/with_skill/transcript.md

5. Generate the benchmark

4. 为每条对话记录评分

After all evals are graded for a skill:
bash
node packages/skill-tools/dist/index.mjs benchmark <skill-name>
This aggregates the grading.json files into
benchmark.json
and
benchmark.md
for that iteration.
对于保存的每条对话记录,调用
skill-tools eval
执行断言:
bash
node packages/skill-tools/dist/index.mjs eval <skill-name> <eval.id> \
  --variant <with_skill|without_skill> \
  --iteration <N> \
  --transcript <path-to-transcript.md>
此命令会在对话记录旁生成
grading.json
文件。每个断言基于正则表达式/包含判断/文件存在性——完全确定性,不使用大语言模型作为评判者。

6. Report

5. 生成基准报告

Summarize per skill:
  • Which evals improved with the skill loaded vs. without
  • Any evals that failed with the skill loaded (regression to investigate)
  • Path to the benchmark and to the latest iteration directory
Suggest the user run
pnpm skill-tools view <skill-name>
to navigate transcripts in the TUI.
完成某技能的所有评估评分后:
bash
node packages/skill-tools/dist/index.mjs benchmark <skill-name>
此命令会将所有
grading.json
文件聚合为该迭代的
benchmark.json
benchmark.md
文件。

Examples

6. 生成结果报告

<example> <input>"rerun the baselines for ts-best-practices"</input> <output> 1. Resolve: skill at `skills/ts-best-practices/`, evals.json present with 3 cases. 2. Workspace: `skills/ts-best-practices/.workspace/iteration-2/` (iteration-1 already exists). 3. For each eval: dispatch Agent(general-purpose) twice (without/with skill), save transcripts. 4. Grade each transcript via skill-tools eval. 5. Aggregate via skill-tools benchmark → benchmark.md. 6. Report: "with skill 7/9 passed, without skill 3/9 — improvement on eval-1, regression on eval-2". </output> </example> <example> <input>"check for skill drift on all skills"</input> <output> 1. --all: find every dir under skills/ + .agents/skills/ with both SKILL.md and evals.json. 2. For each: run the same 6-step workflow. 3. Aggregate report: any skills where regression count exceeds improvement count are flagged for review. </output> </example> <good> Saved transcript verbatim to: skills/ts-best-practices/.workspace/iteration-2/eval-0-validate-config/with_skill/transcript.md </good> <bad> The agent did roughly what we expected. Skipping transcript save. </bad>
Always save raw transcripts verbatim — paraphrasing or dropping output makes regression detection impossible.
按技能汇总:
  • 加载技能与未加载技能相比,哪些评估结果有所提升
  • 加载技能后失败的评估项(需调查的回归问题)
  • 基准报告和最新迭代目录的路径
建议用户运行
pnpm skill-tools view <skill-name>
在终端界面(TUI)中查看对话记录。

References

示例

  • evals.json
    schema is documented in
    skills/skill-creator/references/evals-json.md
  • skill-tools CLI reference:
    packages/skill-tools/
<example> <input>"rerun the baselines for ts-best-practices"</input> <output> 1. 确定目标:技能位于`skills/ts-best-practices/`,存在包含3个测试用例的evals.json。 2. 工作区:`skills/ts-best-practices/.workspace/iteration-2/`(iteration-1已存在)。 3. 针对每个评估项:调度Agent(general-purpose)两次(未加载/加载技能),保存对话记录。 4. 通过skill-tools eval为每条对话记录评分。 5. 通过skill-tools benchmark聚合结果 → benchmark.md。 6. 报告:“加载技能后7/9通过,未加载技能时3/9通过——eval-1有提升,eval-2出现回归”。 </output> </example> <example> <input>"check for skill drift on all skills"</input> <output> 1. --all:在skills/ + .agents/skills/目录下查找所有同时包含SKILL.md和evals.json的目录。 2. 对每个技能:执行相同的6步工作流程。 3. 汇总报告:标记所有回归数量超过提升数量的技能,以供审查。 </output> </example> <good> 已将对话记录逐字保存至: skills/ts-best-practices/.workspace/iteration-2/eval-0-validate-config/with_skill/transcript.md </good> <bad> Agent大致完成了预期任务,跳过对话记录保存。 </bad>
务必逐字保存原始对话记录——意译或丢弃输出会导致无法检测回归问题。

参考资料

  • evals.json
    的 schema 文档位于
    skills/skill-creator/references/evals-json.md
  • skill-tools CLI 参考文档:
    packages/skill-tools/