eval-agent-md

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

eval-agent-md — Behavioral Compliance Testing

eval-agent-md — 行为合规测试

What This Does

功能介绍

  1. Reads a CLAUDE.md (or agent .md file)
  2. Auto-generates behavioral test scenarios for each rule it finds
  3. Runs each scenario via
    claude -p
    with LLM-as-judge scoring
  4. Reports a compliance score with per-rule pass/fail breakdown
  5. Optionally runs an automated mutation loop to improve failing rules
  1. 读取CLAUDE.md(或Agent的.md定义文件)
  2. 为每个检测到的规则自动生成行为测试场景
  3. 通过
    claude -p
    命令运行测试,采用LLM-as-judge评分机制
  4. 生成合规评分报告,包含每条规则的通过/未通过详情
  5. 可选运行自动迭代优化循环,改进未通过的规则

Workflow

工作流程

Progress Reporting

进度报告

This skill runs long operations (30s-5min per step). Always keep the user informed:
  • Before each step, tell the user what is about to happen and roughly how long it takes
  • Run all scripts via the Bash tool (never capture output) so per-scenario progress streams to the user in real time
  • After each step completes, give a brief transition summary before starting the next step
  • Set an appropriate timeout on Bash calls (120s for generation, 600s for eval/mutation)
该技能会执行耗时较长的操作(每步30秒至5分钟)。务必持续向用户同步进度
  • 每步开始前,告知用户即将执行的操作及大致耗时
  • 所有脚本通过Bash工具运行(切勿捕获输出),以便用户实时查看每个测试场景的进度
  • 每步完成后,给出简短的过渡总结再开始下一步
  • 为Bash调用设置合适的超时时间(生成场景为120秒,评估/迭代优化为600秒)

Step 1: Locate the target file

步骤1:定位目标文件

Find the CLAUDE.md to test. Priority order:
  1. If user provided a path argument (e.g.,
    /eval-agent-md ./CLAUDE.md
    ), use that
  2. If a project-level CLAUDE.md exists in the current working directory, use that
  3. Fall back to
    ~/.claude/CLAUDE.md
    (user global)
  4. If none found, ask the user
Read the file and confirm with the user: "I found your CLAUDE.md at [path] ([N] lines). Testing this file."
找到要测试的CLAUDE.md文件,优先级如下:
  1. 如果用户提供了路径参数(例如
    /eval-agent-md ./CLAUDE.md
    ),则使用该路径
  2. 如果当前工作目录存在项目级的CLAUDE.md,使用该文件
  3. 回退至
    ~/.claude/CLAUDE.md
    (用户全局配置文件)
  4. 若以上均未找到,询问用户
读取文件后向用户确认:"我已找到你的CLAUDE.md文件,路径为[path](共[N]行)。将对该文件进行测试。"

Step 2: Generate test scenarios

步骤2:生成测试场景

Tell the user: "Generating test scenarios from [filename]... this calls
claude -p --model sonnet
and typically takes 30-60 seconds."
Run the scenario generator script bundled with this skill. IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees progress lines in real time:
bash
[SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]
The script auto-detects the repository name from git and saves to
/tmp/eval-agent-md-<repo>-scenarios.yaml
(e.g.,
/tmp/eval-agent-md-my-project-scenarios.yaml
). Override with
--repo-name NAME
or
-o PATH
.
After generation, read the output file and show the user a summary:
  • How many scenarios were generated
  • Which rules each scenario tests
  • A brief preview of each scenario's prompt
Ask the user: "Generated [N] test scenarios. Ready to run? (Or edit/skip any?)"
告知用户:"正在从[filename]生成测试场景... 此操作将调用
claude -p --model sonnet
,通常耗时30-60秒。"
运行该技能自带的场景生成脚本。重要提示:切勿捕获输出——通过Bash工具运行,以便用户实时查看进度信息
bash
[SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]
脚本会从git自动检测仓库名称,并将场景保存至
/tmp/eval-agent-md-<repo>-scenarios.yaml
(例如
/tmp/eval-agent-md-my-project-scenarios.yaml
)。可通过
--repo-name NAME
-o PATH
参数覆盖默认设置。
生成完成后,读取输出文件并向用户展示摘要:
  • 生成的场景数量
  • 每个场景对应的测试规则
  • 每个场景提示语的简短预览
询问用户:"已生成[N]个测试场景。是否准备运行?(或是否需要编辑/跳过某些场景?)"

Step 3: Run behavioral tests

步骤3:运行行为测试

Tell the user: "Running [N] scenarios x [runs] run(s) against [model]... each scenario calls
claude -p
twice (subject + judge), so this takes a few minutes. You'll see per-scenario results as they complete."
IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees per-scenario progress (
[1/N] scenario_id... PASS/FAIL (Xs)
) in real time:
bash
[SKILL_DIR]/scripts/eval-behavioral.py \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --claude-md [TARGET_FILE] \
  --runs 1 \
  --model sonnet
Options the user can control:
  • --runs N
    — runs per scenario for majority vote (default: 1, recommend 3 for reliability)
  • --model MODEL
    — model for test subject (default: sonnet)
  • --compare-models
    — run across haiku/sonnet/opus and show comparison matrix
告知用户:"正在针对[model]运行[N]个场景 × [runs]次测试... 每个场景会调用两次
claude -p
(测试主体+评判),因此需要耗时数分钟。你将实时看到每个场景的测试结果。"
重要提示:切勿捕获输出——通过Bash工具运行,以便用户实时查看每个场景的进度(例如
[1/N] scenario_id... PASS/FAIL (Xs)
bash
[SKILL_DIR]/scripts/eval-behavioral.py \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --claude-md [TARGET_FILE] \
  --runs 1 \
  --model sonnet
用户可控制的选项:
  • --runs N
    — 每个场景的运行次数,用于多数投票(默认值:1,建议设为3以提升可靠性)
  • --model MODEL
    — 测试主体使用的模型(默认值:sonnet)
  • --compare-models
    — 在haiku/sonnet/opus模型上分别运行测试,并展示对比矩阵

Step 4: Report results

步骤4:报告结果

Print a compliance report:
undefined
打印合规性报告:
undefined

Compliance Report — [filename]

合规性报告 — [filename]

Score: 8/10 (80%)
ScenarioRuleVerdictEvidence
gate1_thinkGATE-1PASSLists assumptions before code
............
评分:8/10(80%)
场景规则verdict证据
gate1_thinkGATE-1通过编写代码前先列出假设
............

Failing Rules

未通过的规则

  • [rule]: [what went wrong] — suggested fix: [brief suggestion]
undefined
  • [规则名称]: [问题描述] — 建议修复方案: [简短建议]
undefined

Step 5: Improve (optional)

步骤5:优化规则(可选)

If the user says "improve", "fix", or passed
--improve
:
Tell the user: "Starting mutation loop (dry-run) — this iteratively generates wording fixes for failing rules and A/B tests them. Each iteration takes 1-2 minutes."
IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees iteration progress in real time:
bash
[SKILL_DIR]/scripts/mutate-loop.py \
  --target [TARGET_FILE] \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --max-iterations 3 \
  --runs 3 \
  --model sonnet
This is always dry-run by default. Show the user each suggested mutation and ask before applying.
如果用户说“改进”、“修复”或传入了
--improve
参数:
告知用户:"开始迭代优化循环(试运行)——该流程会反复生成未通过规则的措辞优化方案,并进行A/B测试。每次迭代耗时1-2分钟。"
重要提示:切勿捕获输出——通过Bash工具运行,以便用户实时查看迭代进度
bash
[SKILL_DIR]/scripts/mutate-loop.py \
  --target [TARGET_FILE] \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --max-iterations 3 \
  --runs 3 \
  --model sonnet
默认情况下始终为试运行模式。向用户展示每个建议的优化方案,获得确认后再应用。

Arguments

参数解析

Parse the user's
/eval-agent-md
invocation for these optional arguments:
  • [path]
    — target file (positional, e.g.,
    /eval-agent-md ./CLAUDE.md
    )
  • --improve
    — run mutation loop after testing
  • --runs N
    — runs per scenario (default: 1)
  • --model MODEL
    — model for test subject (default: sonnet)
  • --compare-models
    — cross-model comparison (haiku/sonnet/opus)
  • --agent
    — hint that the target is an agent definition file (adjusts generation style)
解析用户调用
/eval-agent-md
时的可选参数:
  • [path]
    — 目标文件(位置参数,例如
    /eval-agent-md ./CLAUDE.md
  • --improve
    — 测试完成后运行迭代优化循环
  • --runs N
    — 每个场景的运行次数(默认值:1)
  • --model MODEL
    — 测试主体使用的模型(默认值:sonnet)
  • --compare-models
    — 跨模型对比测试(haiku/sonnet/opus)
  • --agent
    — 提示目标文件为Agent定义文件(调整场景生成风格)

Examples

示例

Positive Trigger

触发示例

User: "Run compliance tests against my CLAUDE.md to check if all rules are being followed."
Expected behavior: Use
eval-agent-md
workflow — locate the CLAUDE.md, generate test scenarios, run behavioral tests, and report compliance results.
用户:"对我的CLAUDE.md运行合规性测试,检查所有规则是否都被遵守。"
预期行为:执行
eval-agent-md
工作流——定位CLAUDE.md文件,生成测试场景,运行行为测试,然后报告合规性结果。

Non-Trigger

非触发示例

User: "Add a new linting rule to our ESLint config."
Expected behavior: Do not use this skill. Choose a more relevant skill or proceed directly.
用户:"给我们的ESLint配置添加一条新的 linting 规则。"
预期行为:不使用该技能,选择更相关的技能或直接处理。

Troubleshooting

故障排除

Scenario Generation Fails

场景生成失败

  • Error:
    generate-scenarios.py
    exits with non-zero status or produces empty output.
  • Cause: The target CLAUDE.md has no detectable rules or structured sections for the generator to parse.
  • Solution: Ensure the target file contains clearly structured rules (headings, numbered items, or labeled sections). Try a simpler file first to confirm the script works.
  • 错误:
    generate-scenarios.py
    以非零状态退出或生成空输出。
  • 原因:目标CLAUDE.md文件中没有可被生成器解析的可检测规则或结构化章节。
  • 解决方案:确保目标文件包含结构清晰的规则(如标题、编号项或带标签的章节)。先尝试使用更简单的文件确认脚本可正常运行。

Low Compliance Score Despite Correct Rules

规则正确但合规评分低

  • Error: Multiple scenarios report FAIL even though the CLAUDE.md rules look correct.
  • Cause: Single-run mode (
    --runs 1
    ) is susceptible to LLM variance. The model may not follow rules consistently in a single sample.
  • Solution: Re-run with
    --runs 3
    for majority-vote scoring to reduce noise.
  • 错误:多个场景报告未通过,但CLAUDE.md中的规则看起来是正确的。
  • 原因:单运行模式(
    --runs 1
    )易受LLM输出差异影响。模型在单次采样中可能无法始终遵循规则。
  • 解决方案:使用
    --runs 3
    参数重新运行,通过多数投票评分来减少误差。

Scripts Not Found

脚本未找到

  • Error:
    No such file or directory
    when running skill scripts.
  • Cause: The skill directory path is not resolving correctly, or scripts lack execute permissions.
  • Solution: Verify the skill is installed at the expected path and run
    chmod +x
    on the scripts in the
    scripts/
    directory.
  • 错误:运行技能脚本时提示
    No such file or directory
  • 原因:技能目录路径解析错误,或脚本缺少执行权限。
  • 解决方案:验证技能是否安装在预期路径,并对
    scripts/
    目录下的脚本执行
    chmod +x
    命令添加执行权限。

Notes

注意事项

  • All scripts use
    uv run --script
    — no pip install needed
  • The judge always uses haiku (cheap, fast, reliable for scoring)
  • Generated scenarios are ephemeral (temp dir) — they adapt to the current file state
  • For agent .md files, the generator creates role-boundary scenarios (e.g., "does the reviewer avoid writing code?")
  • Scripts are in this skill's
    scripts/
    directory
  • 所有脚本使用
    uv run --script
    运行——无需pip安装依赖
  • 评判环节始终使用haiku模型(成本低、速度快、评分可靠)
  • 生成的场景为临时文件(存储在临时目录)——会适配当前文件的状态
  • 对于Agent的.md定义文件,生成器会创建角色边界相关的场景(例如“评审者是否避免编写代码?”)
  • 脚本位于该技能的
    scripts/
    目录下