eval-agent-md

Original：🇺🇸 English

Translated

3 scripts

Behavioral compliance testing for any CLAUDE.md or agent definition file. Auto-generates test scenarios from your rules, runs them via LLM-as-judge scoring, and reports compliance. Optionally improves failing rules via automated mutation loop.

7installs

Sourceravnhq/ai-toolkit

Added on2026-03-24

NPX Install

npx skill4agent add ravnhq/ai-toolkit eval-agent-md

SKILL.md Content

View Translation Comparison →

eval-agent-md — Behavioral Compliance Testing

What This Does

Reads a CLAUDE.md (or agent .md file)
Auto-generates behavioral test scenarios for each rule it finds
Runs each scenario via
```
claude -p
```
with LLM-as-judge scoring
Reports a compliance score with per-rule pass/fail breakdown
Optionally runs an automated mutation loop to improve failing rules

Workflow

Progress Reporting

This skill runs long operations (30s-5min per step). Always keep the user informed:

Before each step, tell the user what is about to happen and roughly how long it takes
Run all scripts via the Bash tool (never capture output) so per-scenario progress streams to the user in real time
After each step completes, give a brief transition summary before starting the next step
Set an appropriate timeout on Bash calls (120s for generation, 600s for eval/mutation)

Step 1: Locate the target file

Find the CLAUDE.md to test. Priority order:

If user provided a path argument (e.g.,
```
/eval-agent-md ./CLAUDE.md
```
), use that
If a project-level CLAUDE.md exists in the current working directory, use that
Fall back to
```
~/.claude/CLAUDE.md
```
(user global)
If none found, ask the user

Read the file and confirm with the user: "I found your CLAUDE.md at [path] ([N] lines). Testing this file."

Step 2: Generate test scenarios

Tell the user: "Generating test scenarios from [filename]... this calls

claude -p --model sonnet

and typically takes 30-60 seconds."

Run the scenario generator script bundled with this skill. IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees progress lines in real time:

bash

[SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]

The script auto-detects the repository name from git and saves to

/tmp/eval-agent-md-<repo>-scenarios.yaml

(e.g.,

/tmp/eval-agent-md-my-project-scenarios.yaml

). Override with

--repo-name NAME

-o PATH

After generation, read the output file and show the user a summary:

How many scenarios were generated
Which rules each scenario tests
A brief preview of each scenario's prompt

Ask the user: "Generated [N] test scenarios. Ready to run? (Or edit/skip any?)"

Step 3: Run behavioral tests

Tell the user: "Running [N] scenarios x [runs] run(s) against [model]... each scenario calls

claude -p

twice (subject + judge), so this takes a few minutes. You'll see per-scenario results as they complete."

IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees per-scenario progress (
[1/N] scenario_id... PASS/FAIL (Xs)
) in real time:

bash

[SKILL_DIR]/scripts/eval-behavioral.py \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --claude-md [TARGET_FILE] \
  --runs 1 \
  --model sonnet

Options the user can control:

```
--runs N
```
— runs per scenario for majority vote (default: 1, recommend 3 for reliability)
```
--model MODEL
```
— model for test subject (default: sonnet)
```
--compare-models
```
— run across haiku/sonnet/opus and show comparison matrix

Step 4: Report results

Print a compliance report:

## Compliance Report — [filename]

Score: 8/10 (80%)

| Scenario | Rule | Verdict | Evidence |
|----------|------|---------|----------|
| gate1_think | GATE-1 | PASS | Lists assumptions before code |
| ... | ... | ... | ... |

### Failing Rules
- [rule]: [what went wrong] — suggested fix: [brief suggestion]

Step 5: Improve (optional)

If the user says "improve", "fix", or passed

--improve

Tell the user: "Starting mutation loop (dry-run) — this iteratively generates wording fixes for failing rules and A/B tests them. Each iteration takes 1-2 minutes."

IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees iteration progress in real time:

bash

[SKILL_DIR]/scripts/mutate-loop.py \
  --target [TARGET_FILE] \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --max-iterations 3 \
  --runs 3 \
  --model sonnet

This is always dry-run by default. Show the user each suggested mutation and ask before applying.

Arguments

Parse the user's

/eval-agent-md

invocation for these optional arguments:

```
[path]
```
— target file (positional, e.g.,
```
/eval-agent-md ./CLAUDE.md
```
)
```
--improve
```
— run mutation loop after testing
```
--runs N
```
— runs per scenario (default: 1)
```
--model MODEL
```
— model for test subject (default: sonnet)
```
--compare-models
```
— cross-model comparison (haiku/sonnet/opus)
```
--agent
```
— hint that the target is an agent definition file (adjusts generation style)

Examples

Positive Trigger

User: "Run compliance tests against my CLAUDE.md to check if all rules are being followed."

Expected behavior: Use

eval-agent-md

workflow — locate the CLAUDE.md, generate test scenarios, run behavioral tests, and report compliance results.

Non-Trigger

User: "Add a new linting rule to our ESLint config."

Expected behavior: Do not use this skill. Choose a more relevant skill or proceed directly.

Troubleshooting

Scenario Generation Fails

Error:
```
generate-scenarios.py
```
exits with non-zero status or produces empty output.
Cause: The target CLAUDE.md has no detectable rules or structured sections for the generator to parse.
Solution: Ensure the target file contains clearly structured rules (headings, numbered items, or labeled sections). Try a simpler file first to confirm the script works.

Low Compliance Score Despite Correct Rules

Error: Multiple scenarios report FAIL even though the CLAUDE.md rules look correct.
Cause: Single-run mode (
```
--runs 1
```
) is susceptible to LLM variance. The model may not follow rules consistently in a single sample.
Solution: Re-run with
```
--runs 3
```
for majority-vote scoring to reduce noise.

Scripts Not Found

Error:
```
No such file or directory
```
when running skill scripts.
Cause: The skill directory path is not resolving correctly, or scripts lack execute permissions.
Solution: Verify the skill is installed at the expected path and run
```
chmod +x
```
on the scripts in the
```
scripts/
```
directory.

Notes

All scripts use
```
uv run --script
```
— no pip install needed
The judge always uses haiku (cheap, fast, reliable for scoring)
Generated scenarios are ephemeral (temp dir) — they adapt to the current file state
For agent .md files, the generator creates role-boundary scenarios (e.g., "does the reviewer avoid writing code?")
Scripts are in this skill's
```
scripts/
```
directory

eval-agent-md — Behavioral Compliance Testing

What This Does

Reads a CLAUDE.md (or agent .md file)
Auto-generates behavioral test scenarios for each rule it finds
Runs each scenario via
```
claude -p
```
with LLM-as-judge scoring
Reports a compliance score with per-rule pass/fail breakdown
Optionally runs an automated mutation loop to improve failing rules

Workflow

Progress Reporting

This skill runs long operations (30s-5min per step). Always keep the user informed:

Before each step, tell the user what is about to happen and roughly how long it takes
Run all scripts via the Bash tool (never capture output) so per-scenario progress streams to the user in real time
After each step completes, give a brief transition summary before starting the next step
Set an appropriate timeout on Bash calls (120s for generation, 600s for eval/mutation)

Step 1: Locate the target file

Find the CLAUDE.md to test. Priority order:

If user provided a path argument (e.g.,
```
/eval-agent-md ./CLAUDE.md
```
), use that
If a project-level CLAUDE.md exists in the current working directory, use that
Fall back to
```
~/.claude/CLAUDE.md
```
(user global)
If none found, ask the user

Read the file and confirm with the user: "I found your CLAUDE.md at [path] ([N] lines). Testing this file."

Step 2: Generate test scenarios

Tell the user: "Generating test scenarios from [filename]... this calls

claude -p --model sonnet

and typically takes 30-60 seconds."

Run the scenario generator script bundled with this skill. IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees progress lines in real time:

bash

[SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]

The script auto-detects the repository name from git and saves to

/tmp/eval-agent-md-<repo>-scenarios.yaml

(e.g.,

/tmp/eval-agent-md-my-project-scenarios.yaml

). Override with

--repo-name NAME

-o PATH

After generation, read the output file and show the user a summary:

How many scenarios were generated
Which rules each scenario tests
A brief preview of each scenario's prompt

Ask the user: "Generated [N] test scenarios. Ready to run? (Or edit/skip any?)"

Step 3: Run behavioral tests

Tell the user: "Running [N] scenarios x [runs] run(s) against [model]... each scenario calls

claude -p

twice (subject + judge), so this takes a few minutes. You'll see per-scenario results as they complete."

IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees per-scenario progress (
[1/N] scenario_id... PASS/FAIL (Xs)
) in real time:

bash

[SKILL_DIR]/scripts/eval-behavioral.py \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --claude-md [TARGET_FILE] \
  --runs 1 \
  --model sonnet

Options the user can control:

```
--runs N
```
— runs per scenario for majority vote (default: 1, recommend 3 for reliability)
```
--model MODEL
```
— model for test subject (default: sonnet)
```
--compare-models
```
— run across haiku/sonnet/opus and show comparison matrix

Step 4: Report results

Print a compliance report:

## Compliance Report — [filename]

Score: 8/10 (80%)

| Scenario | Rule | Verdict | Evidence |
|----------|------|---------|----------|
| gate1_think | GATE-1 | PASS | Lists assumptions before code |
| ... | ... | ... | ... |

### Failing Rules
- [rule]: [what went wrong] — suggested fix: [brief suggestion]

Step 5: Improve (optional)

If the user says "improve", "fix", or passed

--improve

Tell the user: "Starting mutation loop (dry-run) — this iteratively generates wording fixes for failing rules and A/B tests them. Each iteration takes 1-2 minutes."

IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees iteration progress in real time:

bash

[SKILL_DIR]/scripts/mutate-loop.py \
  --target [TARGET_FILE] \
  --scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
  --max-iterations 3 \
  --runs 3 \
  --model sonnet

This is always dry-run by default. Show the user each suggested mutation and ask before applying.

Arguments

Parse the user's

/eval-agent-md

invocation for these optional arguments:

```
[path]
```
— target file (positional, e.g.,
```
/eval-agent-md ./CLAUDE.md
```
)
```
--improve
```
— run mutation loop after testing
```
--runs N
```
— runs per scenario (default: 1)
```
--model MODEL
```
— model for test subject (default: sonnet)
```
--compare-models
```
— cross-model comparison (haiku/sonnet/opus)
```
--agent
```
— hint that the target is an agent definition file (adjusts generation style)

Examples

Positive Trigger

User: "Run compliance tests against my CLAUDE.md to check if all rules are being followed."

Expected behavior: Use

eval-agent-md

workflow — locate the CLAUDE.md, generate test scenarios, run behavioral tests, and report compliance results.

Non-Trigger

User: "Add a new linting rule to our ESLint config."

Expected behavior: Do not use this skill. Choose a more relevant skill or proceed directly.

Troubleshooting

Scenario Generation Fails

Error:
```
generate-scenarios.py
```
exits with non-zero status or produces empty output.
Cause: The target CLAUDE.md has no detectable rules or structured sections for the generator to parse.
Solution: Ensure the target file contains clearly structured rules (headings, numbered items, or labeled sections). Try a simpler file first to confirm the script works.

Low Compliance Score Despite Correct Rules

Error: Multiple scenarios report FAIL even though the CLAUDE.md rules look correct.
Cause: Single-run mode (
```
--runs 1
```
) is susceptible to LLM variance. The model may not follow rules consistently in a single sample.
Solution: Re-run with
```
--runs 3
```
for majority-vote scoring to reduce noise.

Scripts Not Found

Error:
```
No such file or directory
```
when running skill scripts.
Cause: The skill directory path is not resolving correctly, or scripts lack execute permissions.
Solution: Verify the skill is installed at the expected path and run
```
chmod +x
```
on the scripts in the
```
scripts/
```
directory.

Notes

All scripts use
```
uv run --script
```
— no pip install needed
The judge always uses haiku (cheap, fast, reliable for scoring)
Generated scenarios are ephemeral (temp dir) — they adapt to the current file state
For agent .md files, the generator creates role-boundary scenarios (e.g., "does the reviewer avoid writing code?")
Scripts are in this skill's
```
scripts/
```
directory

eval-agent-md

NPX Install

Tags

SKILL.md Content

eval-agent-md — Behavioral Compliance Testing

What This Does

Workflow

Progress Reporting

Step 1: Locate the target file

Step 2: Generate test scenarios

Step 3: Run behavioral tests

Step 4: Report results

Step 5: Improve (optional)

Arguments

Examples

Positive Trigger

Non-Trigger

Troubleshooting

Scenario Generation Fails

Low Compliance Score Despite Correct Rules

Scripts Not Found

Notes

eval-agent-md

NPX Install

Tags

SKILL.md Content

eval-agent-md — Behavioral Compliance Testing

What This Does

Workflow

Progress Reporting

Step 1: Locate the target file

Step 2: Generate test scenarios

Step 3: Run behavioral tests

Step 4: Report results

Step 5: Improve (optional)

Arguments

Examples

Positive Trigger

Non-Trigger

Troubleshooting

Scenario Generation Fails

Low Compliance Score Despite Correct Rules

Scripts Not Found

Notes