eval-agent-md

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

eval-agent-md — Behavioral Compliance Testing

eval-agent-md — 行为合规测试

What This Does

功能介绍

Reads a CLAUDE.md (or agent .md file)
Auto-generates behavioral test scenarios for each rule it finds
Runs each scenario via
```
claude -p
```
with LLM-as-judge scoring
Reports a compliance score with per-rule pass/fail breakdown
Optionally runs an automated mutation loop to improve failing rules

读取CLAUDE.md（或Agent的.md定义文件）
为每个检测到的规则自动生成行为测试场景
通过
```
claude -p
```
命令运行测试，采用LLM-as-judge评分机制
生成合规评分报告，包含每条规则的通过/未通过详情
可选运行自动迭代优化循环，改进未通过的规则

Workflow

工作流程

Progress Reporting

进度报告

This skill runs long operations (30s-5min per step). Always keep the user informed:

Before each step, tell the user what is about to happen and roughly how long it takes
Run all scripts via the Bash tool (never capture output) so per-scenario progress streams to the user in real time
After each step completes, give a brief transition summary before starting the next step
Set an appropriate timeout on Bash calls (120s for generation, 600s for eval/mutation)

该技能会执行耗时较长的操作（每步30秒至5分钟）。务必持续向用户同步进度：

每步开始前，告知用户即将执行的操作及大致耗时
所有脚本通过Bash工具运行（切勿捕获输出），以便用户实时查看每个测试场景的进度
每步完成后，给出简短的过渡总结再开始下一步
为Bash调用设置合适的超时时间（生成场景为120秒，评估/迭代优化为600秒）

Step 1: Locate the target file

步骤1：定位目标文件

Find the CLAUDE.md to test. Priority order:

If user provided a path argument (e.g.,
```
/eval-agent-md ./CLAUDE.md
```
), use that
If a project-level CLAUDE.md exists in the current working directory, use that
Fall back to
```
~/.claude/CLAUDE.md
```
(user global)
If none found, ask the user

Read the file and confirm with the user: "I found your CLAUDE.md at [path] ([N] lines). Testing this file."

找到要测试的CLAUDE.md文件，优先级如下：

如果用户提供了路径参数（例如
```
/eval-agent-md ./CLAUDE.md
```
），则使用该路径
如果当前工作目录存在项目级的CLAUDE.md，使用该文件
回退至
```
~/.claude/CLAUDE.md
```
（用户全局配置文件）
若以上均未找到，询问用户

读取文件后向用户确认："我已找到你的CLAUDE.md文件，路径为[path]（共[N]行）。将对该文件进行测试。"

Step 2: Generate test scenarios

步骤2：生成测试场景

Tell the user: "Generating test scenarios from [filename]... this calls

claude -p --model sonnet

and typically takes 30-60 seconds."

Run the scenario generator script bundled with this skill. IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees progress lines in real time:

bash

[SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]

The script auto-detects the repository name from git and saves to

/tmp/eval-agent-md-<repo>-scenarios.yaml

(e.g.,

/tmp/eval-agent-md-my-project-scenarios.yaml

). Override with

--repo-name NAME

-o PATH

After generation, read the output file and show the user a summary:

How many scenarios were generated
Which rules each scenario tests
A brief preview of each scenario's prompt

Ask the user: "Generated [N] test scenarios. Ready to run? (Or edit/skip any?)"

告知用户："正在从[filename]生成测试场景... 此操作将调用

claude -p --model sonnet

，通常耗时30-60秒。"

运行该技能自带的场景生成脚本。重要提示：切勿捕获输出——通过Bash工具运行，以便用户实时查看进度信息：

bash

[SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]

脚本会从git自动检测仓库名称，并将场景保存至

/tmp/eval-agent-md-<repo>-scenarios.yaml

（例如

/tmp/eval-agent-md-my-project-scenarios.yaml

）。可通过

--repo-name NAME

或

-o PATH

参数覆盖默认设置。

生成完成后，读取输出文件并向用户展示摘要：

生成的场景数量
每个场景对应的测试规则
每个场景提示语的简短预览

询问用户："已生成[N]个测试场景。是否准备运行？（或是否需要编辑/跳过某些场景？）"

Step 3: Run behavioral tests

步骤3：运行行为测试

Tell the user: "Running [N] scenarios x [runs] run(s) against [model]... each scenario calls

claude -p

twice (subject + judge), so this takes a few minutes. You'll see per-scenario results as they complete."

IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees per-scenario progress (
[1/N] scenario_id... PASS/FAIL (Xs)
) in real time:

bash

[SKILL_DIR]/scripts/eval-behavioral.py \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --claude-md [TARGET_FILE] \
  --runs 1 \
  --model sonnet

Options the user can control:

```
--runs N
```
— runs per scenario for majority vote (default: 1, recommend 3 for reliability)
```
--model MODEL
```
— model for test subject (default: sonnet)
```
--compare-models
```
— run across haiku/sonnet/opus and show comparison matrix

告知用户："正在针对[model]运行[N]个场景 × [runs]次测试... 每个场景会调用两次

claude -p

（测试主体+评判），因此需要耗时数分钟。你将实时看到每个场景的测试结果。"

重要提示：切勿捕获输出——通过Bash工具运行，以便用户实时查看每个场景的进度（例如
[1/N] scenario_id... PASS/FAIL (Xs)
）：

bash

[SKILL_DIR]/scripts/eval-behavioral.py \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --claude-md [TARGET_FILE] \
  --runs 1 \
  --model sonnet

用户可控制的选项：

```
--runs N
```
— 每个场景的运行次数，用于多数投票（默认值：1，建议设为3以提升可靠性）
```
--model MODEL
```
— 测试主体使用的模型（默认值：sonnet）
```
--compare-models
```
— 在haiku/sonnet/opus模型上分别运行测试，并展示对比矩阵

Step 4: Report results

步骤4：报告结果

Print a compliance report:

undefined

打印合规性报告：

undefined

Compliance Report — [filename]

合规性报告 — [filename]

Score: 8/10 (80%)

Scenario	Rule	Verdict	Evidence
gate1_think	GATE-1	PASS	Lists assumptions before code
...	...	...	...

评分：8/10（80%）

场景	规则	verdict	证据
gate1_think	GATE-1	通过	编写代码前先列出假设
...	...	...	...

Failing Rules

未通过的规则

[rule]: [what went wrong] — suggested fix: [brief suggestion]

undefined

[规则名称]: [问题描述] — 建议修复方案: [简短建议]

undefined

Step 5: Improve (optional)

步骤5：优化规则（可选）

If the user says "improve", "fix", or passed

--improve

Tell the user: "Starting mutation loop (dry-run) — this iteratively generates wording fixes for failing rules and A/B tests them. Each iteration takes 1-2 minutes."

IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees iteration progress in real time:

bash

[SKILL_DIR]/scripts/mutate-loop.py \
  --target [TARGET_FILE] \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --max-iterations 3 \
  --runs 3 \
  --model sonnet

This is always dry-run by default. Show the user each suggested mutation and ask before applying.

如果用户说“改进”、“修复”或传入了

--improve

参数：

告知用户："开始迭代优化循环（试运行）——该流程会反复生成未通过规则的措辞优化方案，并进行A/B测试。每次迭代耗时1-2分钟。"

重要提示：切勿捕获输出——通过Bash工具运行，以便用户实时查看迭代进度：

bash

[SKILL_DIR]/scripts/mutate-loop.py \
  --target [TARGET_FILE] \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --max-iterations 3 \
  --runs 3 \
  --model sonnet

默认情况下始终为试运行模式。向用户展示每个建议的优化方案，获得确认后再应用。

Arguments

参数解析

Parse the user's

/eval-agent-md

invocation for these optional arguments:

```
[path]
```
— target file (positional, e.g.,
```
/eval-agent-md ./CLAUDE.md
```
)
```
--improve
```
— run mutation loop after testing
```
--runs N
```
— runs per scenario (default: 1)
```
--model MODEL
```
— model for test subject (default: sonnet)
```
--compare-models
```
— cross-model comparison (haiku/sonnet/opus)
```
--agent
```
— hint that the target is an agent definition file (adjusts generation style)

解析用户调用

/eval-agent-md

时的可选参数：

```
[path]
```
— 目标文件（位置参数，例如
```
/eval-agent-md ./CLAUDE.md
```
）
```
--improve
```
— 测试完成后运行迭代优化循环
```
--runs N
```
— 每个场景的运行次数（默认值：1）
```
--model MODEL
```
— 测试主体使用的模型（默认值：sonnet）
```
--compare-models
```
— 跨模型对比测试（haiku/sonnet/opus）
```
--agent
```
— 提示目标文件为Agent定义文件（调整场景生成风格）

Examples

示例

Positive Trigger

触发示例

User: "Run compliance tests against my CLAUDE.md to check if all rules are being followed."

Expected behavior: Use

eval-agent-md

workflow — locate the CLAUDE.md, generate test scenarios, run behavioral tests, and report compliance results.

用户："对我的CLAUDE.md运行合规性测试，检查所有规则是否都被遵守。"

预期行为：执行

eval-agent-md

工作流——定位CLAUDE.md文件，生成测试场景，运行行为测试，然后报告合规性结果。

Non-Trigger

非触发示例

User: "Add a new linting rule to our ESLint config."

Expected behavior: Do not use this skill. Choose a more relevant skill or proceed directly.

用户："给我们的ESLint配置添加一条新的 linting 规则。"

预期行为：不使用该技能，选择更相关的技能或直接处理。

Troubleshooting

故障排除

Scenario Generation Fails

场景生成失败

Error:
```
generate-scenarios.py
```
exits with non-zero status or produces empty output.
Cause: The target CLAUDE.md has no detectable rules or structured sections for the generator to parse.
Solution: Ensure the target file contains clearly structured rules (headings, numbered items, or labeled sections). Try a simpler file first to confirm the script works.

错误：
```
generate-scenarios.py
```
以非零状态退出或生成空输出。
原因：目标CLAUDE.md文件中没有可被生成器解析的可检测规则或结构化章节。
解决方案：确保目标文件包含结构清晰的规则（如标题、编号项或带标签的章节）。先尝试使用更简单的文件确认脚本可正常运行。

Low Compliance Score Despite Correct Rules

规则正确但合规评分低

Error: Multiple scenarios report FAIL even though the CLAUDE.md rules look correct.
Cause: Single-run mode (
```
--runs 1
```
) is susceptible to LLM variance. The model may not follow rules consistently in a single sample.
Solution: Re-run with
```
--runs 3
```
for majority-vote scoring to reduce noise.

错误：多个场景报告未通过，但CLAUDE.md中的规则看起来是正确的。
原因：单运行模式（
```
--runs 1
```
）易受LLM输出差异影响。模型在单次采样中可能无法始终遵循规则。
解决方案：使用
```
--runs 3
```
参数重新运行，通过多数投票评分来减少误差。

Scripts Not Found

脚本未找到

Error:
```
No such file or directory
```
when running skill scripts.
Cause: The skill directory path is not resolving correctly, or scripts lack execute permissions.
Solution: Verify the skill is installed at the expected path and run
```
chmod +x
```
on the scripts in the
```
scripts/
```
directory.

错误：运行技能脚本时提示
```
No such file or directory
```
。
原因：技能目录路径解析错误，或脚本缺少执行权限。
解决方案：验证技能是否安装在预期路径，并对
```
scripts/
```
目录下的脚本执行
```
chmod +x
```
命令添加执行权限。

Notes

注意事项

All scripts use
```
uv run --script
```
— no pip install needed
The judge always uses haiku (cheap, fast, reliable for scoring)
Generated scenarios are ephemeral (temp dir) — they adapt to the current file state
For agent .md files, the generator creates role-boundary scenarios (e.g., "does the reviewer avoid writing code?")
Scripts are in this skill's
```
scripts/
```
directory

所有脚本使用
```
uv run --script
```
运行——无需pip安装依赖
评判环节始终使用haiku模型（成本低、速度快、评分可靠）
生成的场景为临时文件（存储在临时目录）——会适配当前文件的状态
对于Agent的.md定义文件，生成器会创建角色边界相关的场景（例如“评审者是否避免编写代码？”）
脚本位于该技能的
```
scripts/
```
目录下