testing-agents-with-subagents
Original:🇺🇸 English
Translated
RED-GREEN-REFACTOR testing for agents: dispatch subagents with known inputs, capture verbatim outputs, verify against expectations. Use when creating, modifying, or validating agents and skills. Use for "test agent", "validate agent", "verify agent works", or pre-deployment checks. Do NOT use for feature requests, simple prompt edits without behavioral impact, or agents with no structured output to verify.
6installs
Added on
NPX Install
npx skill4agent add notque/claude-code-toolkit testing-agents-with-subagentsTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Testing Agents With Subagents
Operator Context
This skill operates as an operator for agent testing workflows, configuring Claude's behavior for systematic agent validation. It applies TDD methodology to agent development — RED (observe failures), GREEN (fix agent definition), REFACTOR (edge cases and robustness) — with subagent dispatch as the execution mechanism.
Hardcoded Behaviors (Always Apply)
- CLAUDE.md Compliance: Read and follow repository CLAUDE.md files before testing
- Over-Engineering Prevention: Only test what's directly needed. No elaborate test harnesses or infrastructure. Keep test cases focused and minimal.
- Verbatim Output Capture: Document exact agent outputs. NEVER summarize or paraphrase.
- Isolated Execution: Each test runs in a fresh subagent to avoid context pollution
- Evidence-Based Claims: Every claim about agent behavior MUST be backed by actual test execution
- No Self-Exemption: You cannot decide an agent doesn't need testing. Human partner must confirm exemptions.
Default Behaviors (ON unless disabled)
- Multi-Case Testing: Run at least 3 test cases per agent (success, failure, edge case)
- Output Schema Validation: Verify agent output matches expected structure and required sections
- Consistency Testing: Run same input 2+ times to verify deterministic behavior
- Regression Testing: After fixes, re-run ALL previous test cases before declaring green
- Temporary File Cleanup: Remove test files and artifacts at completion. Keep only files needed for documentation.
- Document Findings: Log all observations, hypotheses, and test results in structured format
Optional Behaviors (OFF unless enabled)
- A/B Testing: Compare agent variants using agent-comparison skill
- Performance Benchmarking: Measure response time and token usage
- Stress Testing: Test with large inputs, many iterations, concurrent requests
- Eval Harness Integration: Use for YAML-based automated testing
evals/harness.py skill-test
What This Skill CAN Do
- Systematically validate agents through RED-GREEN-REFACTOR test cycles
- Dispatch subagents with controlled inputs and capture verbatim outputs
- Distinguish between output structure issues and behavioral correctness issues
- Verify fixes don't introduce regressions across the full test suite
- Test routing logic, skill invocation, and multi-agent workflows
What This Skill CANNOT Do
- Deploy agents without completing all three test phases
- Substitute reading agent prompts for executing actual test runs
- Make claims about agent behavior without evidence from subagent dispatch
- Evaluate agent quality structurally (use agent-evaluation instead)
- Skip the RED phase even when "the fix is obvious"
Instructions
Phase 0: PREPARE — Understand the Agent
Goal: Read the agent definition and understand what it claims to do before writing tests.
Step 1: Read the agent file
bash
# Read agent definition
cat agents/{agent-name}.md
# Read any referenced skills
cat skills/{skill-name}/SKILL.mdStep 2: Identify testable claims
Extract concrete, testable behaviors from the agent definition:
- What inputs does it accept?
- What output structure does it produce?
- What routing triggers should activate it?
- What error conditions does it handle?
- What skills does it invoke?
Step 3: Determine minimum test count
| Agent Type | Minimum Tests | Required Coverage |
|---|---|---|
| Reviewer agents | 6 | 2 real issues, 2 clean, 1 edge, 1 ambiguous |
| Implementation agents | 5 | 2 typical, 1 complex, 1 minimal, 1 error |
| Analysis agents | 4 | 2 standard, 1 edge, 1 malformed |
| Routing/orchestration | 4 | 2 correct route, 1 ambiguous, 1 invalid |
No gate — this phase is preparation. Move directly to Phase 1.
Phase 1: RED — Observe Current Behavior
Goal: Run agent with test inputs and document exact current behavior before any changes.
Step 1: Define test plan
markdown
## Test Plan: {agent-name}
**Agent Purpose:** {what the agent does}
**Agent File:** agents/{agent-name}.md
**Date:** {date}
**Test Cases:**
| ID | Input | Expected Output | Validates |
|----|-------|-----------------|-----------|
| T1 | {input} | {expected} | Happy path |
| T2 | {input} | {expected} | Error handling |
| T3 | {input} | {expected} | Edge case |Write the test plan to a file before executing. This creates a reproducible baseline.
Step 2: Dispatch subagent with test inputs
Use the Task tool to dispatch the agent:
Task(
prompt="""
[Test input for the agent]
Context: [Any required context]
{Include the actual problem/request the agent should handle}
""",
subagent_type="{agent-name}"
)Step 3: Capture results verbatim
markdown
## Test T1: Happy Path
**Input:**
{exact input provided}
**Expected Output:**
{what you expected}
**Actual Output:**
{verbatim output from agent — do not summarize}
**Result:** PASS / FAIL
**Failure Reason:** {if FAIL, exactly what was wrong}Step 4: Identify failure patterns
- Which test categories fail (happy path, error, edge)?
- Are failures structural (missing sections) or behavioral (wrong answers)?
- Do failures correlate with input characteristics?
Gate: All test cases executed. Exact outputs captured verbatim. Failures documented with specific issues identified. Proceed only when gate passes.
Phase 2: GREEN — Fix Agent Definition
Goal: Update agent definition until all test cases pass. One fix at a time.
Step 1: Prioritize failures
Triage failures by severity:
| Severity | Description | Priority |
|---|---|---|
| Critical | Agent produces wrong answers or harmful output | Fix first |
| High | Agent missing required output sections | Fix second |
| Medium | Agent formatting or structure issues | Fix third |
| Low | Agent phrasing or style inconsistencies | Fix last |
Step 2: Diagnose root cause
| Failure Type | Fix Approach |
|---|---|
| Missing output section | Add explicit instruction to include section |
| Wrong format | Add output schema with examples |
| Missing context handling | Add instructions for handling missing info |
| Incorrect classification | Add calibration examples |
| Hallucinated content | Add constraint to only use provided info |
| Agent asks questions instead of answering | Provide required context in prompt or add default handling |
Step 3: Make one fix at a time
Change one thing in the agent definition. Re-run ALL test cases. Document which tests now pass/fail.
Never make multiple fixes simultaneously — you cannot determine which change was effective. This is the same principle as debugging: one variable at a time.
Step 4: Iterate until green
Repeat Step 3 until all test cases pass. If a fix causes a previously passing test to fail, revert and try a different approach.
Track fix iterations:
markdown
## Fix Log
| Iteration | Change Made | Tests Passed | Tests Failed | Action |
|-----------|------------|-------------|-------------|--------|
| 1 | Added output schema | T1, T2 | T3 | Continue |
| 2 | Added error handling instruction | T1, T2, T3 | — | Green |Gate: All test cases pass. No regressions from previously passing tests. Can explain what each fix changed and why. Proceed only when gate passes.
Phase 3: REFACTOR — Edge Cases and Robustness
Goal: Verify agent handles boundary conditions and produces consistent outputs.
Step 1: Add edge case tests
| Category | Test Cases |
|---|---|
| Empty Input | Empty string, whitespace only, no context |
| Large Input | Very long content, deeply nested structures |
| Unusual Input | Malformed data, unexpected formats |
| Ambiguous Input | Cases where correct behavior is unclear |
Step 2: Run consistency tests
Run the same input 3 times. Outputs should be consistent:
- Same structure
- Same key findings (for analysis agents)
- Acceptable variation in phrasing only
If inconsistent: add more explicit instructions to the agent definition. Re-test.
Step 3: Run regression suite
Re-run ALL test cases (original + edge cases) to confirm nothing broke during refactoring.
Step 4: Document final test report
markdown
## Test Report: {agent-name}
| Metric | Result |
|--------|--------|
| Test Cases Run | N |
| Passed | N |
| Failed | N |
| Pass Rate | N% |
## Verdict
READY FOR DEPLOYMENT / NEEDS FIXES / REQUIRES REVIEWGate: Edge cases handled. Consistency verified. Full suite green. Test report documented. Fix is complete.
Examples
Example 1: Testing a New Reviewer Agent
User says: "Test the new reviewer-security agent"
Actions:
- Define 6 test cases: 2 real issues, 2 clean code, 1 edge case, 1 ambiguous (RED)
- Dispatch subagent for each, capture verbatim outputs (RED)
- Fix agent definition for any failures, re-run all tests (GREEN)
- Add edge cases (empty input, malformed code), verify consistency (REFACTOR) Result: Agent passes all tests, report documents pass rate and verdict
Example 2: Testing After Agent Modification
User says: "I updated the golang-general-engineer, make sure it still works"
Actions:
- Run existing test cases against modified agent (RED)
- Compare outputs to previous baseline (RED)
- Fix any regressions introduced by the modification (GREEN)
- Test edge cases to verify robustness not degraded (REFACTOR) Result: Agent modification validated, no regressions confirmed
Example 3: Testing Routing Logic
User says: "Verify the /do router sends Go requests to the right agent"
Actions:
- Define test cases: "Review this Go code", "Fix this .go file", "Write a goroutine" (RED)
- Dispatch each through router, verify correct agent handles it (RED)
- Fix routing triggers if wrong agent selected (GREEN)
- Test ambiguous inputs like "Review this code" with mixed-language context (REFACTOR) Result: Routing validated for all trigger phrases, ambiguous cases documented
Error Handling
Error: "Agent type not found"
Cause: Agent not registered or name misspelled
Solution:
- Verify agent file exists:
ls agents/{agent-name}.md - Check YAML frontmatter has correct field
name - Restart Claude Code to pick up new agents
Error: "Inconsistent outputs across runs"
Cause: Agent produces different results for same input
Solution:
- Document the inconsistency — this is a valid finding
- Add more explicit instructions to agent definition
- Re-test consistency after fix
- Determine if variation is acceptable (phrasing) or problematic (structure/findings)
Error: "Subagent timeout"
Cause: Agent taking too long to respond
Solution:
- Simplify test input to reduce processing
- Check agent isn't in an infinite loop or excessive tool use
- Increase timeout if agent legitimately needs more time
Error: "Agent asks questions instead of answering"
Cause: Agent needs clarification that test input did not provide
Solution:
- This may be correct behavior — agent properly requesting context
- Update test input to provide the required context
- Or update agent definition to handle ambiguity with defaults
- Document whether questioning behavior is acceptable for this agent type
Anti-Patterns
Anti-Pattern 1: Testing Without Capturing Exact Output
What it looks like: "Tested the agent, it looks good."
Why wrong: No evidence of what was tested. Cannot reproduce or verify results. Subjective assessment instead of objective evidence.
Do instead: Capture verbatim output for every test case. Document input, expected, actual, and result.
Anti-Pattern 2: Testing Only Happy Path
What it looks like: "Tested with one example, it worked."
Why wrong: Agents fail on edge cases most often. One test proves almost nothing. False confidence in agent quality.
Do instead: Minimum 3-6 test cases per agent covering success, failure, edge, and ambiguous inputs.
Anti-Pattern 3: Skipping Re-test After Fixes
What it looks like: "Fixed the issue, should work now."
Why wrong: Fix might have broken other tests. No verification fix actually works. Regression bugs slip through.
Do instead: Re-run ALL test cases after any change. Only mark green when full suite passes.
Anti-Pattern 4: Reading Prompts Instead of Running Agents
What it looks like: "Checked that agent prompt has the right sections."
Why wrong: Reading a prompt is not executing an agent. Prompt structure does not guarantee behavior. Must verify actual output.
Do instead: Test what the agent DOES, not what the prompt SAYS. Execute with real inputs via Task tool.
Anti-Pattern 5: Self-Exempting from Testing
What it looks like: "This agent is simple, doesn't need testing." or "Simple change, no need to re-test."
Why wrong: Simple agents can still fail. Small changes can break behavior. You cannot self-determine exemptions from testing.
Do instead: Get human partner confirmation for exemptions. When in doubt, test. Document why testing was skipped if approved.
References
This skill uses these shared patterns:
- Anti-Rationalization — Prevents shortcut rationalizations
- Anti-Rationalization: Testing — Testing-specific rationalization blocks
- Verification Checklist — Pre-completion checks
Domain-Specific Anti-Rationalization
| Rationalization | Why It's Wrong | Required Action |
|---|---|---|
| "Agent prompt looks correct" | Reading prompt ≠ executing agent | Dispatch subagent and capture output |
| "Tested manually in conversation" | Not reproducible, no baseline | Use Task tool for formal dispatch |
| "Only a small change" | Small changes can break agent behavior | Run full test suite |
| "Will monitor in production" | Production monitoring ≠ pre-deployment testing | Complete RED-GREEN-REFACTOR first |
| "Based on working template" | Template correctness ≠ instance correctness | Test this specific agent |
Integration
- : A/B test agent variants
agent-comparison - : Structural quality checks
agent-evaluation - : TDD principles applied to agents
test-driven-development
Reference Files
- : Dispatch patterns, test scenarios, eval harness integration
${CLAUDE_SKILL_DIR}/references/testing-patterns.md