skill-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseskill-eval
—
Re-run baseline evaluations on one or more skills. Uses the test definitions committed in each skill, dispatches pressure scenarios via subagents, saves transcripts to a gitignored workspace, and grades the runs deterministically.
evals.jsonClaude Code 扩展(其他Agent会忽略)--- argument-hint: '[<skill-name>|--all]' user-invocable: true model-invocable: true
When to use
skill-eval
Verbatim trigger phrases:
- "rerun the baselines"
- "re-eval skill X"
- "test all the skills"
- "check for skill drift"
- "run the evals"
- "did skill X still pass"
重新运行一项或多项技能的基线评估。使用每个技能中提交的测试定义,通过子Agent调度压力场景,将对话记录保存到git忽略的工作区,并以确定性方式对运行结果评分。
evals.jsonWhen NOT to use
使用场景
- Authoring a new skill — use instead
/skill-creator - Modifying skill body content — just edit the SKILL.md
- Running unit tests for itself (those are vitest, not skill evals)
packages/skill-tools
明确的触发短语:
- "rerun the baselines"
- "re-eval skill X"
- "test all the skills"
- "check for skill drift"
- "run the evals"
- "did skill X still pass"
Inputs
不适用场景
- — one of:
$ARGUMENTS- — re-eval one skill (looks under
<skill-name>andskills/).agents/skills/ - — re-eval every skill that has an
--allevals.json - empty — same as
--all
- 创建新技能——请改用
/skill-creator - 修改技能主体内容——直接编辑SKILL.md即可
- 运行自身的单元测试(这些是vitest测试,不属于技能评估)
packages/skill-tools
Workflow
输入参数
1. Resolve target skills
—
If is a skill name, look in then . Confirm exists. If not, abort with an error pointing the user at (or to author by hand).
$ARGUMENTSskills/<name>/.agents/skills/<name>/evals.json/skill-creatorevals.jsonIf (or empty), find every directory with both and under those two roots.
--allSKILL.mdevals.json- — 以下选项之一:
$ARGUMENTS- — 重新评估单个技能(在
<skill-name>和skills/目录下查找).agents/skills/ - — 重新评估所有包含
--all的技能evals.json - 空值 — 效果等同于
--all
2. Determine the next iteration number
工作流程
—
1. 确定目标技能
For each target skill, look at its nested workspace at (or , depending on the skill's location). If it doesn't exist, the next iteration is . Otherwise scan directories and use .
skills/<skill-name>/.workspace/.agents/skills/<skill-name>/.workspace/1iteration-N/max(N) + 1Create (the pattern is gitignored).
skills/<skill-name>/.workspace/iteration-<N>/.workspace/如果是技能名称,先在目录查找,再在目录查找。确认存在,若不存在则终止操作,并提示用户使用(或手动编写)。
$ARGUMENTSskills/<name>/.agents/skills/<name>/evals.json/skill-creatorevals.json如果是(或空值),在上述两个根目录下查找所有同时包含和的目录。
--allSKILL.mdevals.json3. Dispatch each eval via the Agent tool
2. 确定下一个迭代编号
For every eval in , run two subagent dispatches:
evals.json对于每个目标技能,查看其嵌套工作区(或,取决于技能所在位置)。若该目录不存在,则下一个迭代编号为;否则扫描目录,使用作为下一个编号。
skills/<skill-name>/.workspace/.agents/skills/<skill-name>/.workspace/1iteration-N/max(N) + 1创建目录(模式已被git忽略)。
skills/<skill-name>/.workspace/iteration-<N>/.workspace/3a. WITHOUT skill (RED baseline)
3. 通过Agent工具调度每个评估任务
Use the Agent tool with . The prompt template:
subagent_type: general-purposeExecute this task exactly:
[eval.prompt]
No skill is loaded for this task. After attempting it, report what you did,
what decisions you made and why, and anything you found tricky. Report
verbatim — do not polish, do not summarize. Include any code you wrote
inline so it can be analyzed.Save the agent's reply (the entire response text) to:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/without_skill/transcript.md对于中的每一项评估,执行两次子Agent调度:
evals.json3b. WITH skill (GREEN run)
3a. 不加载技能(红色基线)
Use the Agent tool again, this time including the target skill's full SKILL.md content as system context. The prompt template:
Execute this task exactly:
[eval.prompt]
The skill `<skill-name>` is available — apply its rules and patterns.
After attempting it, report what you did, what decisions you made and why,
and anything you found tricky. Report verbatim — do not polish, do not
summarize. Include any code you wrote inline.
If you considered skipping any rule from the skill, capture the exact
reasoning verbatim — that's the kind of failure mode the skill needs to
catch.Save the response to:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/with_skill/transcript.md使用Agent工具,设置。提示模板如下:
subagent_type: general-purposeExecute this task exactly:
[eval.prompt]
No skill is loaded for this task. After attempting it, report what you did,
what decisions you made and why, and anything you found tricky. Report
verbatim — do not polish, do not summarize. Include any code you wrote
inline so it can be analyzed.将Agent的回复(完整响应文本)保存至:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/without_skill/transcript.md4. Grade each transcript
3b. 加载技能(绿色运行)
For each transcript saved, invoke to run assertions:
skill-tools evalbash
node packages/skill-tools/dist/index.mjs eval <skill-name> <eval.id> \
--variant <with_skill|without_skill> \
--iteration <N> \
--transcript <path-to-transcript.md>This writes next to the transcript. Each assertion is regex / contains / file_exists — deterministic, no LLM-as-judge.
grading.json再次使用Agent工具,此次将目标技能的完整SKILL.md内容作为系统上下文传入。提示模板如下:
Execute this task exactly:
[eval.prompt]
The skill `<skill-name>` is available — apply its rules and patterns.
After attempting it, report what you did, what decisions you made and why,
and anything you found tricky. Report verbatim — do not polish, do not
summarize. Include any code you wrote inline.
If you considered skipping any rule from the skill, capture the exact
reasoning verbatim — that's the kind of failure mode the skill needs to
catch.将响应保存至:
skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/with_skill/transcript.md5. Generate the benchmark
4. 为每条对话记录评分
After all evals are graded for a skill:
bash
node packages/skill-tools/dist/index.mjs benchmark <skill-name>This aggregates the grading.json files into and for that iteration.
benchmark.jsonbenchmark.md对于保存的每条对话记录,调用执行断言:
skill-tools evalbash
node packages/skill-tools/dist/index.mjs eval <skill-name> <eval.id> \
--variant <with_skill|without_skill> \
--iteration <N> \
--transcript <path-to-transcript.md>此命令会在对话记录旁生成文件。每个断言基于正则表达式/包含判断/文件存在性——完全确定性,不使用大语言模型作为评判者。
grading.json6. Report
5. 生成基准报告
Summarize per skill:
- Which evals improved with the skill loaded vs. without
- Any evals that failed with the skill loaded (regression to investigate)
- Path to the benchmark and to the latest iteration directory
Suggest the user run to navigate transcripts in the TUI.
pnpm skill-tools view <skill-name>完成某技能的所有评估评分后:
bash
node packages/skill-tools/dist/index.mjs benchmark <skill-name>此命令会将所有文件聚合为该迭代的和文件。
grading.jsonbenchmark.jsonbenchmark.mdExamples
6. 生成结果报告
<example>
<input>"rerun the baselines for ts-best-practices"</input>
<output>
1. Resolve: skill at `skills/ts-best-practices/`, evals.json present with 3 cases.
2. Workspace: `skills/ts-best-practices/.workspace/iteration-2/` (iteration-1 already exists).
3. For each eval: dispatch Agent(general-purpose) twice (without/with skill), save transcripts.
4. Grade each transcript via skill-tools eval.
5. Aggregate via skill-tools benchmark → benchmark.md.
6. Report: "with skill 7/9 passed, without skill 3/9 — improvement on eval-1, regression on eval-2".
</output>
</example>
<example>
<input>"check for skill drift on all skills"</input>
<output>
1. --all: find every dir under skills/ + .agents/skills/ with both SKILL.md and evals.json.
2. For each: run the same 6-step workflow.
3. Aggregate report: any skills where regression count exceeds improvement count are flagged for review.
</output>
</example>
<good>
Saved transcript verbatim to:
skills/ts-best-practices/.workspace/iteration-2/eval-0-validate-config/with_skill/transcript.md
</good>
<bad>
The agent did roughly what we expected. Skipping transcript save.
</bad>
Always save raw transcripts verbatim — paraphrasing or dropping output makes regression detection impossible.
按技能汇总:
- 加载技能与未加载技能相比,哪些评估结果有所提升
- 加载技能后失败的评估项(需调查的回归问题)
- 基准报告和最新迭代目录的路径
建议用户运行在终端界面(TUI)中查看对话记录。
pnpm skill-tools view <skill-name>References
示例
- schema is documented in
evals.jsonskills/skill-creator/references/evals-json.md - skill-tools CLI reference:
packages/skill-tools/
<example>
<input>"rerun the baselines for ts-best-practices"</input>
<output>
1. 确定目标:技能位于`skills/ts-best-practices/`,存在包含3个测试用例的evals.json。
2. 工作区:`skills/ts-best-practices/.workspace/iteration-2/`(iteration-1已存在)。
3. 针对每个评估项:调度Agent(general-purpose)两次(未加载/加载技能),保存对话记录。
4. 通过skill-tools eval为每条对话记录评分。
5. 通过skill-tools benchmark聚合结果 → benchmark.md。
6. 报告:“加载技能后7/9通过,未加载技能时3/9通过——eval-1有提升,eval-2出现回归”。
</output>
</example>
<example>
<input>"check for skill drift on all skills"</input>
<output>
1. --all:在skills/ + .agents/skills/目录下查找所有同时包含SKILL.md和evals.json的目录。
2. 对每个技能:执行相同的6步工作流程。
3. 汇总报告:标记所有回归数量超过提升数量的技能,以供审查。
</output>
</example>
<good>
已将对话记录逐字保存至:
skills/ts-best-practices/.workspace/iteration-2/eval-0-validate-config/with_skill/transcript.md
</good>
<bad>
Agent大致完成了预期任务,跳过对话记录保存。
</bad>
务必逐字保存原始对话记录——意译或丢弃输出会导致无法检测回归问题。
—
参考资料
—
- 的 schema 文档位于
evals.jsonskills/skill-creator/references/evals-json.md - skill-tools CLI 参考文档:
packages/skill-tools/