evaluate-presets
Original:🇺🇸 English
Translated
Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
6installs
Added on
NPX Install
npx skill4agent add mikeyobrien/ralph-orchestrator evaluate-presetsTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Evaluate Presets
Overview
Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.
When to Use
- Testing preset configurations after changes
- Auditing the preset library for quality
- Validating new presets work correctly
- After modifying hat routing logic
Quick Start
Evaluate a single preset:
bash
./tools/evaluate-preset.sh tdd-red-green claudeEvaluate all presets:
bash
./tools/evaluate-all-presets.sh claudeArguments:
- First arg: preset name (without extension)
.yml - Second arg: backend (or
claude, defaults tokiro)claude
Bash Tool Configuration
IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
- Single preset evaluation: Use (10 minutes max) and
timeout: 600000run_in_background: true - All presets evaluation: Use (10 minutes max) and
timeout: 600000run_in_background: true
Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the tool to check progress periodically.
TaskOutputExample invocation pattern:
Bash tool with:
command: "./tools/evaluate-preset.sh tdd-red-green claude"
timeout: 600000
run_in_background: trueAfter launching, use with to check status without waiting for completion.
TaskOutputblock: falseWhat the Scripts Do
evaluate-preset.sh
evaluate-preset.sh- Loads test task from (if
tools/preset-test-tasks.ymlavailable)yq - Creates merged config with evaluation settings
- Runs Ralph with for metrics capture
--record-session - Captures output logs, exit codes, and timing
- Extracts metrics: iterations, hats activated, events published
Output structure:
.eval/
├── logs/<preset>/<timestamp>/
│ ├── output.log # Full stdout/stderr
│ ├── session.jsonl # Recorded session
│ ├── metrics.json # Extracted metrics
│ ├── environment.json # Runtime environment
│ └── merged-config.yml # Config used
└── logs/<preset>/latest -> <timestamp>evaluate-all-presets.sh
evaluate-all-presets.shRuns all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/
├── SUMMARY.md # Markdown report
├── <preset>.json # Per-preset metrics
└── latest -> <suite-id>Presets Under Evaluation
| Preset | Test Task |
|---|---|
| Add |
| Review user input handler for security |
| Understand |
| Specify and implement |
| Implement a |
| Debug failing mock test assertion |
| Understand history of |
| Profile hat matching |
| Design a |
| Document |
| Respond to "tests failing in CI" |
| Plan v1 to v2 config migration |
Interpreting Results
Exit codes from :
evaluate-preset.sh- — Success (LOOP_COMPLETE reached)
0 - — Timeout (preset hung or took too long)
124 - Other — Failure (check )
output.log
Metrics in :
metrics.json- — How many event loop cycles
iterations - — Which hats were triggered
hats_activated - — Total events emitted
events_published - — Whether completion promise was reached
completed
Hat Routing Performance
Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
What Good Looks Like
Each hat should execute in its own iteration:
Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETERed Flags (Same-Iteration Hat Switching)
BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work
^^^ All in one bloated context!How to Check
1. Count iterations vs events in :
session.jsonlbash
# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log
# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonlExpected: iterations ≈ events published (one event per iteration)
Bad sign: 2-3 iterations but 5+ events (all work in single iteration)
2. Check for same-iteration hat switching in :
output.logbash
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
.eval/logs/<preset>/latest/output.logRed flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
3. Check event timestamps in :
session.jsonlbash
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'Red flag: Multiple events with identical timestamps (published in same iteration).
Routing Performance Triage
| Pattern | Diagnosis | Action |
|---|---|---|
| iterations ≈ events | ✅ Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |
Root Cause Checklist
If hat routing is broken:
-
Check workflow prompt in:
hatless_ralph.rs- Does it say "CRITICAL: STOP after publishing"?
- Is the DELEGATE section clear about yielding control?
-
Check hat instructions propagation:
- Does include
HatInfofield?instructions - Are instructions rendered in the section?
## HATS
- Does
-
Check events context:
- Is using the context parameter?
build_prompt(context) - Does prompt include section?
## PENDING EVENTS
- Is
Autonomous Fix Workflow
After evaluation, delegate fixes to subagents:
Step 1: Triage Results
Read and identify:
.eval/results/latest/SUMMARY.md- → Create code tasks for fixes
❌ FAIL - → Investigate infinite loops
⏱️ TIMEOUT - → Check for edge cases
⚠️ PARTIAL
Step 2: Dispatch Task Creation
For each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"Step 3: Dispatch Implementation
For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"Step 4: Re-evaluate
bash
./tools/evaluate-preset.sh <fixed-preset> claudePrerequisites
- yq (optional): For loading test tasks from YAML. Install:
brew install yq - Cargo: Must be able to build Ralph
Related Files
- — Single preset evaluation
tools/evaluate-preset.sh - — Full suite evaluation
tools/evaluate-all-presets.sh - — Test task definitions
tools/preset-test-tasks.yml - — Manual findings doc
tools/preset-evaluation-findings.md - — The preset collection being evaluated
presets/