eval-agent-md
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseeval-agent-md — Behavioral Compliance Testing
eval-agent-md — 行为合规测试
What This Does
功能介绍
- Reads a CLAUDE.md (or agent .md file)
- Auto-generates behavioral test scenarios for each rule it finds
- Runs each scenario via with LLM-as-judge scoring
claude -p - Reports a compliance score with per-rule pass/fail breakdown
- Optionally runs an automated mutation loop to improve failing rules
- 读取CLAUDE.md(或Agent的.md定义文件)
- 为每个检测到的规则自动生成行为测试场景
- 通过命令运行测试,采用LLM-as-judge评分机制
claude -p - 生成合规评分报告,包含每条规则的通过/未通过详情
- 可选运行自动迭代优化循环,改进未通过的规则
Workflow
工作流程
Progress Reporting
进度报告
This skill runs long operations (30s-5min per step). Always keep the user informed:
- Before each step, tell the user what is about to happen and roughly how long it takes
- Run all scripts via the Bash tool (never capture output) so per-scenario progress streams to the user in real time
- After each step completes, give a brief transition summary before starting the next step
- Set an appropriate timeout on Bash calls (120s for generation, 600s for eval/mutation)
该技能会执行耗时较长的操作(每步30秒至5分钟)。务必持续向用户同步进度:
- 每步开始前,告知用户即将执行的操作及大致耗时
- 所有脚本通过Bash工具运行(切勿捕获输出),以便用户实时查看每个测试场景的进度
- 每步完成后,给出简短的过渡总结再开始下一步
- 为Bash调用设置合适的超时时间(生成场景为120秒,评估/迭代优化为600秒)
Step 1: Locate the target file
步骤1:定位目标文件
Find the CLAUDE.md to test. Priority order:
- If user provided a path argument (e.g., ), use that
/eval-agent-md ./CLAUDE.md - If a project-level CLAUDE.md exists in the current working directory, use that
- Fall back to (user global)
~/.claude/CLAUDE.md - If none found, ask the user
Read the file and confirm with the user: "I found your CLAUDE.md at [path] ([N] lines). Testing this file."
找到要测试的CLAUDE.md文件,优先级如下:
- 如果用户提供了路径参数(例如),则使用该路径
/eval-agent-md ./CLAUDE.md - 如果当前工作目录存在项目级的CLAUDE.md,使用该文件
- 回退至(用户全局配置文件)
~/.claude/CLAUDE.md - 若以上均未找到,询问用户
读取文件后向用户确认:"我已找到你的CLAUDE.md文件,路径为[path](共[N]行)。将对该文件进行测试。"
Step 2: Generate test scenarios
步骤2:生成测试场景
Tell the user: "Generating test scenarios from [filename]... this calls and typically takes 30-60 seconds."
claude -p --model sonnetRun the scenario generator script bundled with this skill. IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees progress lines in real time:
bash
[SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]The script auto-detects the repository name from git and saves to (e.g., ). Override with or .
/tmp/eval-agent-md-<repo>-scenarios.yaml/tmp/eval-agent-md-my-project-scenarios.yaml--repo-name NAME-o PATHAfter generation, read the output file and show the user a summary:
- How many scenarios were generated
- Which rules each scenario tests
- A brief preview of each scenario's prompt
Ask the user: "Generated [N] test scenarios. Ready to run? (Or edit/skip any?)"
告知用户:"正在从[filename]生成测试场景... 此操作将调用,通常耗时30-60秒。"
claude -p --model sonnet运行该技能自带的场景生成脚本。重要提示:切勿捕获输出——通过Bash工具运行,以便用户实时查看进度信息:
bash
[SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]脚本会从git自动检测仓库名称,并将场景保存至(例如)。可通过或参数覆盖默认设置。
/tmp/eval-agent-md-<repo>-scenarios.yaml/tmp/eval-agent-md-my-project-scenarios.yaml--repo-name NAME-o PATH生成完成后,读取输出文件并向用户展示摘要:
- 生成的场景数量
- 每个场景对应的测试规则
- 每个场景提示语的简短预览
询问用户:"已生成[N]个测试场景。是否准备运行?(或是否需要编辑/跳过某些场景?)"
Step 3: Run behavioral tests
步骤3:运行行为测试
Tell the user: "Running [N] scenarios x [runs] run(s) against [model]... each scenario calls twice (subject + judge), so this takes a few minutes. You'll see per-scenario results as they complete."
claude -pIMPORTANT: Do NOT capture output — run via the Bash tool so the user sees per-scenario progress () in real time:
[1/N] scenario_id... PASS/FAIL (Xs)bash
[SKILL_DIR]/scripts/eval-behavioral.py \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--claude-md [TARGET_FILE] \
--runs 1 \
--model sonnetOptions the user can control:
- — runs per scenario for majority vote (default: 1, recommend 3 for reliability)
--runs N - — model for test subject (default: sonnet)
--model MODEL - — run across haiku/sonnet/opus and show comparison matrix
--compare-models
告知用户:"正在针对[model]运行[N]个场景 × [runs]次测试... 每个场景会调用两次(测试主体+评判),因此需要耗时数分钟。你将实时看到每个场景的测试结果。"
claude -p重要提示:切勿捕获输出——通过Bash工具运行,以便用户实时查看每个场景的进度(例如):
[1/N] scenario_id... PASS/FAIL (Xs)bash
[SKILL_DIR]/scripts/eval-behavioral.py \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--claude-md [TARGET_FILE] \
--runs 1 \
--model sonnet用户可控制的选项:
- — 每个场景的运行次数,用于多数投票(默认值:1,建议设为3以提升可靠性)
--runs N - — 测试主体使用的模型(默认值:sonnet)
--model MODEL - — 在haiku/sonnet/opus模型上分别运行测试,并展示对比矩阵
--compare-models
Step 4: Report results
步骤4:报告结果
Print a compliance report:
undefined打印合规性报告:
undefinedCompliance Report — [filename]
合规性报告 — [filename]
Score: 8/10 (80%)
| Scenario | Rule | Verdict | Evidence |
|---|---|---|---|
| gate1_think | GATE-1 | PASS | Lists assumptions before code |
| ... | ... | ... | ... |
评分:8/10(80%)
| 场景 | 规则 | verdict | 证据 |
|---|---|---|---|
| gate1_think | GATE-1 | 通过 | 编写代码前先列出假设 |
| ... | ... | ... | ... |
Failing Rules
未通过的规则
- [rule]: [what went wrong] — suggested fix: [brief suggestion]
undefined- [规则名称]: [问题描述] — 建议修复方案: [简短建议]
undefinedStep 5: Improve (optional)
步骤5:优化规则(可选)
If the user says "improve", "fix", or passed :
--improveTell the user: "Starting mutation loop (dry-run) — this iteratively generates wording fixes for failing rules and A/B tests them. Each iteration takes 1-2 minutes."
IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees iteration progress in real time:
bash
[SKILL_DIR]/scripts/mutate-loop.py \
--target [TARGET_FILE] \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--max-iterations 3 \
--runs 3 \
--model sonnetThis is always dry-run by default. Show the user each suggested mutation and ask before applying.
如果用户说“改进”、“修复”或传入了参数:
--improve告知用户:"开始迭代优化循环(试运行)——该流程会反复生成未通过规则的措辞优化方案,并进行A/B测试。每次迭代耗时1-2分钟。"
重要提示:切勿捕获输出——通过Bash工具运行,以便用户实时查看迭代进度:
bash
[SKILL_DIR]/scripts/mutate-loop.py \
--target [TARGET_FILE] \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--max-iterations 3 \
--runs 3 \
--model sonnet默认情况下始终为试运行模式。向用户展示每个建议的优化方案,获得确认后再应用。
Arguments
参数解析
Parse the user's invocation for these optional arguments:
/eval-agent-md- — target file (positional, e.g.,
[path])/eval-agent-md ./CLAUDE.md - — run mutation loop after testing
--improve - — runs per scenario (default: 1)
--runs N - — model for test subject (default: sonnet)
--model MODEL - — cross-model comparison (haiku/sonnet/opus)
--compare-models - — hint that the target is an agent definition file (adjusts generation style)
--agent
解析用户调用时的可选参数:
/eval-agent-md- — 目标文件(位置参数,例如
[path])/eval-agent-md ./CLAUDE.md - — 测试完成后运行迭代优化循环
--improve - — 每个场景的运行次数(默认值:1)
--runs N - — 测试主体使用的模型(默认值:sonnet)
--model MODEL - — 跨模型对比测试(haiku/sonnet/opus)
--compare-models - — 提示目标文件为Agent定义文件(调整场景生成风格)
--agent
Examples
示例
Positive Trigger
触发示例
User: "Run compliance tests against my CLAUDE.md to check if all rules are being followed."
Expected behavior: Use workflow — locate the CLAUDE.md, generate test scenarios, run behavioral tests, and report compliance results.
eval-agent-md用户:"对我的CLAUDE.md运行合规性测试,检查所有规则是否都被遵守。"
预期行为:执行工作流——定位CLAUDE.md文件,生成测试场景,运行行为测试,然后报告合规性结果。
eval-agent-mdNon-Trigger
非触发示例
User: "Add a new linting rule to our ESLint config."
Expected behavior: Do not use this skill. Choose a more relevant skill or proceed directly.
用户:"给我们的ESLint配置添加一条新的 linting 规则。"
预期行为:不使用该技能,选择更相关的技能或直接处理。
Troubleshooting
故障排除
Scenario Generation Fails
场景生成失败
- Error: exits with non-zero status or produces empty output.
generate-scenarios.py - Cause: The target CLAUDE.md has no detectable rules or structured sections for the generator to parse.
- Solution: Ensure the target file contains clearly structured rules (headings, numbered items, or labeled sections). Try a simpler file first to confirm the script works.
- 错误:以非零状态退出或生成空输出。
generate-scenarios.py - 原因:目标CLAUDE.md文件中没有可被生成器解析的可检测规则或结构化章节。
- 解决方案:确保目标文件包含结构清晰的规则(如标题、编号项或带标签的章节)。先尝试使用更简单的文件确认脚本可正常运行。
Low Compliance Score Despite Correct Rules
规则正确但合规评分低
- Error: Multiple scenarios report FAIL even though the CLAUDE.md rules look correct.
- Cause: Single-run mode () is susceptible to LLM variance. The model may not follow rules consistently in a single sample.
--runs 1 - Solution: Re-run with for majority-vote scoring to reduce noise.
--runs 3
- 错误:多个场景报告未通过,但CLAUDE.md中的规则看起来是正确的。
- 原因:单运行模式()易受LLM输出差异影响。模型在单次采样中可能无法始终遵循规则。
--runs 1 - 解决方案:使用参数重新运行,通过多数投票评分来减少误差。
--runs 3
Scripts Not Found
脚本未找到
- Error: when running skill scripts.
No such file or directory - Cause: The skill directory path is not resolving correctly, or scripts lack execute permissions.
- Solution: Verify the skill is installed at the expected path and run on the scripts in the
chmod +xdirectory.scripts/
- 错误:运行技能脚本时提示。
No such file or directory - 原因:技能目录路径解析错误,或脚本缺少执行权限。
- 解决方案:验证技能是否安装在预期路径,并对目录下的脚本执行
scripts/命令添加执行权限。chmod +x
Notes
注意事项
- All scripts use — no pip install needed
uv run --script - The judge always uses haiku (cheap, fast, reliable for scoring)
- Generated scenarios are ephemeral (temp dir) — they adapt to the current file state
- For agent .md files, the generator creates role-boundary scenarios (e.g., "does the reviewer avoid writing code?")
- Scripts are in this skill's directory
scripts/
- 所有脚本使用运行——无需pip安装依赖
uv run --script - 评判环节始终使用haiku模型(成本低、速度快、评分可靠)
- 生成的场景为临时文件(存储在临时目录)——会适配当前文件的状态
- 对于Agent的.md定义文件,生成器会创建角色边界相关的场景(例如“评审者是否避免编写代码?”)
- 脚本位于该技能的目录下
scripts/