skill-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Evaluation & Improvement

Skill评估与改进

Measure and improve skill quality through empirical testing — because structure doesn't guarantee behavior, and measurement beats assumption.
通过实证测试衡量并提升Skill质量——因为结构无法保证行为表现,量化测试优于主观假设。

Operator Context

操作场景

This skill operates as the eval-driven improvement pipeline for Claude Code skills. It provides four capabilities: trigger evaluation, description optimization, output benchmarking, and structural validation.
本Skill作为Claude Code Skill的评估驱动型改进流水线,提供四项核心能力:触发评估、描述优化、输出基准测试和结构验证。

Hardcoded Behaviors (Always Apply)

硬编码行为(始终适用)

  • Measure before changing: Always run baseline eval before making improvements
  • Train/test split: Use 60/40 holdout to prevent overfitting descriptions
  • Generalize, don't overfit: Improvements should help across many prompts, not just test cases
  • Report results: Always show before/after metrics
  • 先测量再修改:在进行改进前始终运行基准评估
  • 训练/测试拆分:采用60/40的留存数据划分,避免描述过拟合
  • 注重泛化,而非过拟合:改进应适用于大量提示词,而非仅针对测试用例
  • 输出结果报告:始终展示改进前后的指标

Default Behaviors (ON unless disabled)

默认行为(默认开启,可关闭)

  • HTML reports: Generate visual reports for description optimization
  • Verbose output: Show per-query pass/fail during eval runs
  • 3 runs per query: Run each trigger test 3 times for reliability
  • HTML报告:为描述优化生成可视化报告
  • 详细输出:在评估运行期间展示每个查询的通过/失败情况
  • 每个查询运行3次:为保证可靠性,每个触发测试运行3次

Optional Behaviors (OFF unless enabled)

可选行为(默认关闭,需开启)

  • Blind A/B comparison: Use comparator agent for unbiased output comparison
  • Full benchmark suite: Run aggregate benchmarks with timing and token metrics
  • 盲态A/B对比:使用对比Agent进行无偏输出对比
  • 完整基准测试套件:运行包含计时和令牌指标的聚合基准测试

What This Skill CAN Do

本Skill可实现的功能

  • Test whether a skill's description triggers correctly for a set of queries
  • Optimize descriptions via automated eval+improve loop (train/test split)
  • Benchmark skill output quality (with-skill vs without-skill)
  • Validate skill structure (frontmatter, naming, description length)
  • Generate HTML reports for visual review
  • 测试Skill描述是否能针对一组查询正确触发
  • 通过自动化评估+改进循环优化描述(训练/测试拆分)
  • 基准测试Skill输出质量(有Skill vs 无Skill对比)
  • 验证Skill结构(前置元数据、命名、描述长度)
  • 生成HTML报告用于可视化评审

What This Skill CANNOT Do

本Skill无法实现的功能

  • Create new skills from scratch (use skill-creator-engineer)
  • Modify skill instructions automatically (human reviews changes)
  • Test skills that require specific MCP servers or external services
  • Run evals without the
    claude
    CLI available

  • 从零创建新Skill(请使用skill-creator-engineer)
  • 自动修改Skill指令(需人工审核变更)
  • 测试需要特定MCP服务器或外部服务的Skill
  • 在无
    claude
    CLI可用的情况下运行评估

Instructions

操作步骤

Phase 1: ASSESS — Determine what to evaluate

阶段1:评估规划——确定评估内容

Step 1: Identify the skill
bash
undefined
步骤1:识别目标Skill
bash
undefined

Validate skill structure first

先验证Skill结构

python3 -m scripts.skill_eval.quick_validate <path/to/skill>

**Step 2: Choose evaluation mode**

| User Intent | Mode | Script |
|------------|------|--------|
| "Test if description triggers correctly" | Trigger eval | `run_eval.py` |
| "Optimize/improve the description" | Description optimization | `run_loop.py` |
| "Compare skill vs no-skill output" | Output benchmark | Manual + `aggregate_benchmark.py` |
| "Validate skill structure" | Quick validate | `quick_validate.py` |

**GATE**: Skill path confirmed, mode selected.
python3 -m scripts.skill_eval.quick_validate <path/to/skill>

**步骤2:选择评估模式**

| 用户意图 | 模式 | 脚本 |
|------------|------|--------|
| "测试描述是否能正确触发" | 触发评估 | `run_eval.py` |
| "优化/改进描述" | 描述优化 | `run_loop.py` |
| "对比有Skill和无Skill的输出" | 输出基准测试 | 手动操作 + `aggregate_benchmark.py` |
| "验证Skill结构" | 快速验证 | `quick_validate.py` |

**关卡**:确认Skill路径,选定评估模式。

Phase 2: EVALUATE — Run the appropriate evaluation

阶段2:执行评估——运行对应评估

Mode A: Trigger Evaluation

模式A:触发评估

Test whether a skill's description causes Claude to invoke it for the right queries.
Step 1: Create eval set (or use existing)
Create a JSON file with 8-20 test queries:
json
[
  {"query": "realistic user prompt that should trigger", "should_trigger": true},
  {"query": "similar but different domain prompt", "should_trigger": false}
]
Eval set quality matters — use realistic prompts with detail (file paths, context, casual phrasing), not abstract one-liners. Focus on edge cases where the skill competes with adjacent skills.
Step 2: Run evaluation
bash
python3 -m scripts.skill_eval.run_eval \
  --eval-set evals.json \
  --skill-path <path/to/skill> \
  --runs-per-query 3 \
  --verbose
This spawns
claude -p
for each query, checking whether it invokes the skill. Output includes pass/fail per query with trigger rates.
GATE: Eval results available. Proceed to improvement if failures found.
测试Skill描述是否能让Claude针对正确的查询调用该Skill。
步骤1:创建评估集(或使用现有评估集)
创建一个包含8-20个测试查询的JSON文件:
json
[
  {"query": "应触发Skill的真实用户提示词", "should_trigger": true},
  {"query": "类似但属于不同领域的提示词", "should_trigger": false}
]
评估集质量至关重要——使用包含细节的真实提示词(如文件路径、上下文、口语化表述),而非抽象的单行语句。重点关注Skill与相邻Skill竞争的边缘场景。
步骤2:运行评估
bash
python3 -m scripts.skill_eval.run_eval \
  --eval-set evals.json \
  --skill-path <path/to/skill> \
  --runs-per-query 3 \
  --verbose
该命令会为每个查询启动
claude -p
,检查是否调用了目标Skill。输出内容包含每个查询的通过/失败情况及触发率。
关卡:获取评估结果。若发现失败案例,进入改进阶段。

Mode B: Description Optimization

模式B:描述优化

Automated loop that tests, improves, and re-tests descriptions.
bash
python3 -m scripts.skill_eval.run_loop \
  --eval-set evals.json \
  --skill-path <path/to/skill> \
  --model claude-opus-4-6 \
  --max-iterations 5 \
  --verbose
This will:
  1. Split eval set 60/40 train/test (stratified by should_trigger)
  2. Evaluate current description on all queries (3 runs each)
  3. Use Claude with extended thinking to propose improvements based on failures
  4. Re-evaluate the new description
  5. Repeat until all pass or max iterations reached
  6. Select best description by test score (prevents overfitting)
  7. Open an HTML report in the browser
GATE: Loop complete. Best description identified.
通过自动化循环完成测试、改进、再测试描述的流程。
bash
python3 -m scripts.skill_eval.run_loop \
  --eval-set evals.json \
  --skill-path <path/to/skill> \
  --model claude-opus-4-6 \
  --max-iterations 5 \
  --verbose
该流程会:
  1. 将评估集按60/40拆分为训练集和测试集(按should_trigger分层)
  2. 在所有查询上评估当前描述(每个查询运行3次)
  3. 使用Claude的深度思考能力,基于失败案例提出改进方案
  4. 重新评估新描述
  5. 重复上述步骤,直至所有查询通过或达到最大迭代次数
  6. 根据测试集得分选择最优描述(避免过拟合)
  7. 在浏览器中打开HTML报告
关卡:循环完成,确定最优描述。

Mode C: Output Benchmark

模式C:输出基准测试

Compare skill quality by running prompts with and without the skill.
Step 1: Create test prompts — 2-3 realistic user prompts
Step 2: Run with-skill and without-skill in parallel subagents:
For each test prompt, spawn two agents:
  • With skill: Load the skill, run the prompt, save outputs
  • Without skill (baseline): Same prompt, no skill, save outputs
Step 3: Grade outputs
Spawn a grader subagent using the prompt in
agents/grader.md
. It evaluates assertions against the outputs.
Step 4: Aggregate
bash
python3 -m scripts.skill_eval.aggregate_benchmark <workspace>/iteration-1 --skill-name <name>
Produces
benchmark.json
and
benchmark.md
with pass rates, timing, and token usage.
Step 5: Analyze (optional)
For blind comparison, use
agents/comparator.md
to judge outputs without knowing which skill produced them. Then use
agents/analyzer.md
to understand why the winner won.
GATE: Benchmark results available.
通过对比有Skill和无Skill的情况,测试Skill质量。
步骤1:创建测试提示词——2-3个真实用户提示词
步骤2:并行运行有Skill和无Skill的测试
针对每个测试提示词,启动两个Agent:
  • 有Skill:加载目标Skill,运行提示词,保存输出结果
  • 无Skill(基准线):使用相同提示词,不加载Skill,保存输出结果
步骤3:评分输出结果
使用
agents/grader.md
中的提示词启动评分Agent,根据输出结果评估预设断言。
步骤4:聚合结果
bash
python3 -m scripts.skill_eval.aggregate_benchmark <workspace>/iteration-1 --skill-name <name>
生成包含通过率、计时和令牌使用情况的
benchmark.json
benchmark.md
文件。
步骤5:分析结果(可选)
若需盲态对比,使用
agents/comparator.md
在不知晓输出来源的情况下评判结果。然后使用
agents/analyzer.md
分析获胜方的优势原因。
关卡:获取基准测试结果。

Mode D: Quick Validate

模式D:快速验证

bash
python3 -m scripts.skill_eval.quick_validate <path/to/skill>
Checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets.
bash
python3 -m scripts.skill_eval.quick_validate <path/to/skill>
检查内容:是否存在SKILL.md、有效的前置元数据、必填字段(名称、描述)、短横线命名格式、描述长度不超过1024字符、无尖括号。

Phase 3: IMPROVE — Apply results

阶段3:应用改进——落地评估结果

Step 1: Review results
For trigger eval / description optimization:
  • Show the best description vs original
  • Show per-query results (which queries improved, which regressed)
  • Show train vs test scores
For output benchmark:
  • Show pass rate delta (with-skill vs without-skill)
  • Show timing and token cost delta
  • Highlight assertions that only pass with the skill (value-add)
Step 2: Apply changes (with user confirmation)
If description optimization found a better description:
  1. Show before/after with scores
  2. Ask user to confirm
  3. Update the skill's SKILL.md frontmatter
  4. Re-run quick_validate to confirm the update is valid
GATE: Changes applied and validated, or user chose to keep original.

步骤1:评审结果
对于触发评估/描述优化:
  • 展示最优描述与原始描述的对比
  • 展示每个查询的结果(哪些查询得到改进,哪些出现退化)
  • 展示训练集与测试集的得分
对于输出基准测试:
  • 展示通过率差值(有Skill vs 无Skill)
  • 展示计时和令牌成本差值
  • 突出仅在有Skill时能通过的断言(Skill的价值增益)
步骤2:应用变更(需用户确认)
若描述优化找到更优描述:
  1. 展示改进前后的描述及对应得分
  2. 请求用户确认
  3. 更新Skill的SKILL.md前置元数据
  4. 重新运行quick_validate以确认更新有效
关卡:变更已应用并验证,或用户选择保留原始描述。

Error Handling

错误处理

Error: "No SKILL.md found"

错误:"No SKILL.md found"

Cause: Skill path doesn't point to a valid skill directory Solution: Verify path contains a
SKILL.md
file. Skills must follow the
skill-name/SKILL.md
structure.
原因:Skill路径未指向有效的Skill目录 解决方案:验证路径是否包含
SKILL.md
文件。Skill必须遵循
skill-name/SKILL.md
的结构。

Error: "claude: command not found"

错误:"claude: command not found"

Cause: Claude CLI not available for trigger evaluation Solution: Install Claude Code CLI. Trigger eval requires
claude -p
to test skill invocation.
原因:触发评估需要Claude CLI,但当前环境未安装 解决方案:安装Claude Code CLI。触发评估需要
claude -p
命令来测试Skill调用。

Error: "anthropic SDK not installed"

错误:"anthropic SDK not installed"

Cause: Description optimization requires the Anthropic Python SDK Solution:
pip install anthropic
. Only needed for
improve_description.py
and
run_loop.py
.
原因:描述优化需要Anthropic Python SDK,但当前环境未安装 解决方案:执行
pip install anthropic
。仅
improve_description.py
run_loop.py
需要该依赖。

Error: "CLAUDECODE environment variable"

错误:"CLAUDECODE environment variable"

Cause: Running eval from inside a Claude Code session blocks nested instances Solution: The scripts automatically strip the
CLAUDECODE
env var. If issues persist, run from a separate terminal.
原因:在Claude Code会话内部运行评估会阻止嵌套实例 解决方案:脚本会自动清除
CLAUDECODE
环境变量。若问题仍存在,从独立终端运行脚本。

Error: "All queries timeout"

错误:"All queries timeout"

Cause: Default 30s timeout too short for complex queries Solution: Increase with
--timeout 60
. Simple trigger queries should complete in <15s.

原因:默认30秒超时时间对于复杂查询过短 解决方案:使用
--timeout 60
参数增加超时时间。简单的触发查询应在15秒内完成。

Anti-Patterns

反模式

Anti-Pattern 1: Abstract Eval Queries

反模式1:抽象化评估查询

What it looks like:
"Format this data"
,
"Create a chart"
Why wrong: Real users write detailed, specific prompts. Abstract queries don't test real triggering behavior. Do instead:
"ok so my boss sent me this xlsx file (Q4 sales final FINAL v2.xlsx) and she wants profit margin as a percentage"
表现
"Format this data"
"Create a chart"
问题:真实用户会使用详细、具体的提示词。抽象查询无法测试真实的触发行为。 正确做法:使用类似
"ok so my boss sent me this xlsx file (Q4 sales final FINAL v2.xlsx) and she wants profit margin as a percentage"
的真实提示词。

Anti-Pattern 2: Overfitting to Test Cases

反模式2:过度拟合测试用例

What it looks like: Adding specific query text to the description to force triggers Why wrong: Works for test set, fails on real usage. Bloats the description. Do instead: Generalize from failures to broader categories of user intent.
表现:在描述中添加特定查询文本以强制触发 问题:仅在测试集上有效,在实际使用中失效。同时会导致描述冗余。 正确做法:从失败案例中归纳更广泛的用户意图类别。

Anti-Pattern 3: Skipping Baseline

反模式3:跳过基准线测试

What it looks like: Running with-skill only, no without-skill comparison Why wrong: Can't prove the skill adds value without a baseline. Maybe Claude handles it fine without the skill. Do instead: Always run both configurations. The delta is what matters.

表现:仅运行有Skill的测试,不进行无Skill的对比 问题:无法证明Skill的价值——可能Claude在无Skill时也能很好地处理任务。 正确做法:始终运行两种配置。两者的差值才是Skill的真正价值。

References

参考资料

Scripts (in
scripts/skill_eval/
)

脚本(位于
scripts/skill_eval/

  • run_eval.py
    — Trigger evaluation: tests description against query set
  • run_loop.py
    — Eval+improve loop: automated description optimization
  • improve_description.py
    — Single-shot description improvement via Claude API
  • generate_report.py
    — HTML report from loop output
  • aggregate_benchmark.py
    — Benchmark aggregation from grading results
  • quick_validate.py
    — Structural validation of SKILL.md
  • run_eval.py
    — 触发评估:测试描述与查询集的匹配情况
  • run_loop.py
    — 评估+改进循环:自动化描述优化
  • improve_description.py
    — 通过Claude API实现单次描述改进
  • generate_report.py
    — 从循环输出生成HTML报告
  • aggregate_benchmark.py
    — 从评分结果聚合基准测试数据
  • quick_validate.py
    — SKILL.md的结构验证

Bundled Agents (in
skills/skill-eval/agents/
)

内置Agent(位于
skills/skill-eval/agents/

  • grader.md
    — Evaluates assertions against execution outputs
  • comparator.md
    — Blind A/B comparison of two outputs
  • analyzer.md
    — Post-hoc analysis of why one version beat another
  • grader.md
    — 根据执行输出评估断言
  • comparator.md
    — 输出的盲态A/B对比
  • analyzer.md
    — 事后分析获胜版本的优势原因

Reference Files

参考文件

  • ${CLAUDE_SKILL_DIR}/references/schemas.md
    — JSON schemas for evals.json, grading.json, benchmark.json
  • ${CLAUDE_SKILL_DIR}/references/schemas.md
    — evals.json、grading.json、benchmark.json的JSON schema

Shared Patterns

共享模式

  • Anti-Rationalization
  • Verification Checklist
  • 反合理化
  • 验证清单