Skill Evaluation & Improvement
Measure and improve skill quality through empirical testing — because structure doesn't guarantee behavior, and measurement beats assumption.
Operator Context
This skill operates as the eval-driven improvement pipeline for Claude Code skills. It provides four capabilities: trigger evaluation, description optimization, output benchmarking, and structural validation.
Hardcoded Behaviors (Always Apply)
- Measure before changing: Always run baseline eval before making improvements
- Train/test split: Use 60/40 holdout to prevent overfitting descriptions
- Generalize, don't overfit: Improvements should help across many prompts, not just test cases
- Report results: Always show before/after metrics
Default Behaviors (ON unless disabled)
- HTML reports: Generate visual reports for description optimization
- Verbose output: Show per-query pass/fail during eval runs
- 3 runs per query: Run each trigger test 3 times for reliability
Optional Behaviors (OFF unless enabled)
- Blind A/B comparison: Use comparator agent for unbiased output comparison
- Full benchmark suite: Run aggregate benchmarks with timing and token metrics
What This Skill CAN Do
- Test whether a skill's description triggers correctly for a set of queries
- Optimize descriptions via automated eval+improve loop (train/test split)
- Benchmark skill output quality (with-skill vs without-skill)
- Validate skill structure (frontmatter, naming, description length)
- Generate HTML reports for visual review
What This Skill CANNOT Do
- Create new skills from scratch (use skill-creator-engineer)
- Modify skill instructions automatically (human reviews changes)
- Test skills that require specific MCP servers or external services
- Run evals without the CLI available
Instructions
Phase 1: ASSESS — Determine what to evaluate
Step 1: Identify the skill
bash
# Validate skill structure first
python3 -m scripts.skill_eval.quick_validate <path/to/skill>
Step 2: Choose evaluation mode
| User Intent | Mode | Script |
|---|
| "Test if description triggers correctly" | Trigger eval | |
| "Optimize/improve the description" | Description optimization | |
| "Compare skill vs no-skill output" | Output benchmark | Manual + |
| "Validate skill structure" | Quick validate | |
GATE: Skill path confirmed, mode selected.
Phase 2: EVALUATE — Run the appropriate evaluation
Mode A: Trigger Evaluation
Test whether a skill's description causes Claude to invoke it for the right queries.
Step 1: Create eval set (or use existing)
Create a JSON file with 8-20 test queries:
json
[
{"query": "realistic user prompt that should trigger", "should_trigger": true},
{"query": "similar but different domain prompt", "should_trigger": false}
]
Eval set quality matters — use realistic prompts with detail (file paths, context, casual phrasing), not abstract one-liners. Focus on edge cases where the skill competes with adjacent skills.
Step 2: Run evaluation
bash
python3 -m scripts.skill_eval.run_eval \
--eval-set evals.json \
--skill-path <path/to/skill> \
--runs-per-query 3 \
--verbose
This spawns
for each query, checking whether it invokes the skill. Output includes pass/fail per query with trigger rates.
GATE: Eval results available. Proceed to improvement if failures found.
Mode B: Description Optimization
Automated loop that tests, improves, and re-tests descriptions.
bash
python3 -m scripts.skill_eval.run_loop \
--eval-set evals.json \
--skill-path <path/to/skill> \
--model claude-opus-4-6 \
--max-iterations 5 \
--verbose
This will:
- Split eval set 60/40 train/test (stratified by should_trigger)
- Evaluate current description on all queries (3 runs each)
- Use Claude with extended thinking to propose improvements based on failures
- Re-evaluate the new description
- Repeat until all pass or max iterations reached
- Select best description by test score (prevents overfitting)
- Open an HTML report in the browser
GATE: Loop complete. Best description identified.
Mode C: Output Benchmark
Compare skill quality by running prompts with and without the skill.
Step 1: Create test prompts — 2-3 realistic user prompts
Step 2: Run with-skill and without-skill in parallel subagents:
For each test prompt, spawn two agents:
- With skill: Load the skill, run the prompt, save outputs
- Without skill (baseline): Same prompt, no skill, save outputs
Step 3: Grade outputs
Spawn a grader subagent using the prompt in
. It evaluates assertions against the outputs.
Step 4: Aggregate
bash
python3 -m scripts.skill_eval.aggregate_benchmark <workspace>/iteration-1 --skill-name <name>
Produces
and
with pass rates, timing, and token usage.
Step 5: Analyze (optional)
For blind comparison, use
to judge outputs without knowing which skill produced them. Then use
to understand why the winner won.
GATE: Benchmark results available.
Mode D: Quick Validate
bash
python3 -m scripts.skill_eval.quick_validate <path/to/skill>
Checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets.
Phase 3: IMPROVE — Apply results
Step 1: Review results
For trigger eval / description optimization:
- Show the best description vs original
- Show per-query results (which queries improved, which regressed)
- Show train vs test scores
For output benchmark:
- Show pass rate delta (with-skill vs without-skill)
- Show timing and token cost delta
- Highlight assertions that only pass with the skill (value-add)
Step 2: Apply changes (with user confirmation)
If description optimization found a better description:
- Show before/after with scores
- Ask user to confirm
- Update the skill's SKILL.md frontmatter
- Re-run quick_validate to confirm the update is valid
GATE: Changes applied and validated, or user chose to keep original.
Error Handling
Error: "No SKILL.md found"
Cause: Skill path doesn't point to a valid skill directory
Solution: Verify path contains a
file. Skills must follow the
structure.
Error: "claude: command not found"
Cause: Claude CLI not available for trigger evaluation
Solution: Install Claude Code CLI. Trigger eval requires
to test skill invocation.
Error: "anthropic SDK not installed"
Cause: Description optimization requires the Anthropic Python SDK
Solution:
. Only needed for
and
.
Error: "CLAUDECODE environment variable"
Cause: Running eval from inside a Claude Code session blocks nested instances
Solution: The scripts automatically strip the
env var. If issues persist, run from a separate terminal.
Error: "All queries timeout"
Cause: Default 30s timeout too short for complex queries
Solution: Increase with
. Simple trigger queries should complete in <15s.
Anti-Patterns
Anti-Pattern 1: Abstract Eval Queries
What it looks like:
,
Why wrong: Real users write detailed, specific prompts. Abstract queries don't test real triggering behavior.
Do instead:
"ok so my boss sent me this xlsx file (Q4 sales final FINAL v2.xlsx) and she wants profit margin as a percentage"
Anti-Pattern 2: Overfitting to Test Cases
What it looks like: Adding specific query text to the description to force triggers
Why wrong: Works for test set, fails on real usage. Bloats the description.
Do instead: Generalize from failures to broader categories of user intent.
Anti-Pattern 3: Skipping Baseline
What it looks like: Running with-skill only, no without-skill comparison
Why wrong: Can't prove the skill adds value without a baseline. Maybe Claude handles it fine without the skill.
Do instead: Always run both configurations. The delta is what matters.
References
Scripts (in )
- — Trigger evaluation: tests description against query set
- — Eval+improve loop: automated description optimization
- — Single-shot description improvement via Claude API
- — HTML report from loop output
- — Benchmark aggregation from grading results
- — Structural validation of SKILL.md
Bundled Agents (in skills/skill-eval/agents/
)
- — Evaluates assertions against execution outputs
- — Blind A/B comparison of two outputs
- — Post-hoc analysis of why one version beat another
Reference Files
${CLAUDE_SKILL_DIR}/references/schemas.md
— JSON schemas for evals.json, grading.json, benchmark.json
Shared Patterns
- Anti-Rationalization
- Verification Checklist