Temporary Files
You can use the
subfolder in the current project folder to save any temporary files if needed.
This is useful for storing intermediate results, reports, or data during multi-step workflows.
Agents
110+ specialized AI agents. Agents are stored in
as Markdown files with YAML frontmatter.
Discovery: Do FULL read of
<skill-folder>/agents/INDEX.md
for the full categorized agent directory (110+ agents grouped by domain). Pick the MOST specialized agent — domain-specific checklists and anti-patterns only work when the agent matches the domain.
Agent Categories
| Category | Count | Examples |
|---|
| Language Implementation | 22 | python-pro, golang-pro, rust-pro, typescript-pro |
| Web Frameworks | 10 | react-pro, nextjs-pro, django-pro, fastapi-pro |
| Architecture & Design | 9 | backend-architect, api-designer, microservices-architect |
| DevOps & Infrastructure | 11 | devops-engineer, kubernetes-architect, cloud-architect |
| Security | 6 | security-reviewer, penetration-tester, threat-modeling-pro |
| Database | 5 | postgres-pro, sql-pro, database-architect |
| Testing & Quality | 5 | code-reviewer, tdd-guide, test-automator |
| AI & ML | 5 | ai-engineer, ml-engineer, prompt-engineer |
| Frontend & Mobile | 5 | frontend-developer, ios-pro, ui-designer |
| Documentation | 7 | documentation-pro, technical-writer, docs-architect |
| Incident & Troubleshooting | 4 | incident-responder, debugger, devops-troubleshooter |
| Specialized | 22 | build-engineer, cli-developer, product-manager, web-searcher, etc. |
Agent Selection
Most specialized wins (e.g., postgres-pro over database-optimizer). Split hybrid tasks into subtasks with different agents.
Memory System
NEVER use MEMORY.md for anything. MEMORY.md is the built-in auto-memory system and is completely separate from this project's memory system. Do not read, write, or reference MEMORY.md. Use only
and
via the
tool.
Two-tier:
Knowledge (
) permanent,
Session (
) temporary.
| Question | Use |
|---|
| Will this help in future sessions? | Knowledge |
| Current task only? | Session |
| Discovered a gotcha/pattern/config? | Knowledge |
| Tracking todos/progress/blockers? | Session |
Knowledge
bash
./<skill-folder>/tools/memory.sh add <category> "<content>" [--tags a,b,c]
| Category | Save When |
|---|
| System design, service connections, ports |
| Bugs, pitfalls, non-obvious behavior |
| Code conventions, recurring structures |
| Environment settings, credentials |
| Important classes, functions, APIs |
| Why choices were made |
| New findings about codebase |
| Long-term tasks to remember |
| Useful links, documentation |
| Background info, project context |
Tags: Cross-cutting concerns (e.g.,
--tags redis,production,auth
).
Skip: Trivial, easily grep-able, duplicates.
After tasks: State "Memories saved: [list]" or "Memories saved: None"
Session
Tracks current task. Persists until cleared.
Categories: ,
,
,
,
,
,
.
Statuses: →
→
|
.
bash
./<skill-folder>/tools/memory.sh session add todo "Task" --status pending
./<skill-folder>/tools/memory.sh session show # View current
./<skill-folder>/tools/memory.sh session update <id> --status completed
./<skill-folder>/tools/memory.sh session delete <id>
./<skill-folder>/tools/memory.sh session clear # Current only
./<skill-folder>/tools/memory.sh session clear --all # ALL sessions
Checkpoints
Checkpoints are session-context entries written after every workflow step. Full protocol — when to checkpoint, format, and compaction recovery sequence — is in Orchestration Workflow → Checkpoints & Recovery.
Multi-Session
Multiple CLI instances work without conflicts. Resolution:
flag >
env >
<skill-folder>/current_session
file >
.
bash
./<skill-folder>/tools/memory.sh session use feature-auth # Switch session
./<skill-folder>/tools/memory.sh -S other session add todo "..." # One-off
./<skill-folder>/tools/memory.sh session sessions # List all
Web Research
For any internet search or web content retrieval:
- ALL internet research must go through — no exceptions. This means: no built-in websearch tool, no WebFetch tool, no against APIs, no manual GitHub API calls, no , nothing else. Every time you need information from the internet, use
./<skill-folder>/tools/web_search.sh "query"
(or <skill-folder>/tools/web_search.bat
on Windows)
- One query per call — run each query as a separate invocation. Never combine multiple queries into a single call. Run calls sequentially (one after another, not in parallel) to avoid hitting API rate limits
- Always use default options — never add , , or any result-limiting flags. Let the tool use its built-in defaults
- Scientific queries: add for CS, physics, math, engineering (arXiv + OpenAlex)
- Medical queries: add for medicine, clinical trials, biomedical (PubMed + Europe PMC + OpenAlex)
- Tech queries: add for software dev, DevOps, IT, startups (Hacker News + Stack Overflow + Dev.to + GitHub)
- Synthesize results into a report
Note: Always use forward slashes (
) in paths for agent tool run, even on Windows.
Dependencies handled automatically via uv.
Orchestration Workflow
Dynamic orchestration where the lead delegates everything to specialized agents. The planner researches the project, classifies the task, and dynamically assembles a custom workflow from available bricks — selecting only the stages the task actually needs. The lead spawns agents according to the manifest, coordinates verification, and delivers results. Automatic by default.
The ONLY agent-delegation pipeline is
→
→
. The
tool's
parameter is forbidden — see Rules → Task tool prohibition for the full statement.
Agent Loading Rules
Agents folder:
. Use agents for all non-trivial subtasks — code writing, analysis, design, debugging, testing, documentation.
Rules:
- Before any subtask: select the best agent and read its file (always fresh re-read)
- Load ONE agent at a time (Exception: Orchestration Workflow may read multiple for prompt building)
- All agent delegation goes through — see Rules → Task tool prohibition
- Agent instructions are TEMPORARY — apply to current subtask only, discard after
Discovery: Glob
<skill-folder>/agents/*.md
to list, Grep by keyword. Prefer specialized over general agents.
How the lead uses agents: The lead selects agents by name from the INDEX, writes task files with KEY FILES and MUST ANSWER questions, and uses
to inject the agent's
into the spawned agent's prompt. The lead does NOT load agent
content into its own working context and never applies agent instructions itself. The agent
is read for agent selection (which specialist?), not for the lead to execute. Agent
files reach agents exclusively through
→
.
Request Workflow
-
Continuation: ./<skill-folder>/tools/memory.sh search "GLM-CONTINUATION"
— resume if exists
- If found: Read , read prior synthesis, and continue from where the previous session left off. The plan is already finalized and partially executed — pick up at the next uncompleted stage.
- If not found: Proceed to step 2.
-
Re-read Verification and Iterative Convergence sections: Before spawning ANY stage agents, re-read the Verification section AND Iterative Convergence section in full. Verification defines the severity-routed pipeline (extraction → route findings by severity → synthesis). Iterative Convergence defines planner-decided repeat logic (NONE/ONCE/LOOP). Skipping these re-reads is the #1 cause of plans missing appropriate verification and convergence. MANDATORY.
Do NOT read source files, skim the project, or try to understand scope before spawning. The planner is your research — spawn it immediately. Fill in the project path, spawn, and let the planner do everything else. Any attempt to "understand the codebase first" IS the research we forbid. Go directly to step 3.
-
Planning phase (2 batches, 2 agents) — ALWAYS run, never skipped:
a.
Initial planner: Copy
<skill-folder>/templates/planner-task-template.txt
, fill in the project path (just the working directory — the planner researches the codebase itself), assemble with
assemble-prompt.sh -a agentic-planner -t research -n s0-planner
, spawn (no
, uses default model). Researches the project, classifies the task on 5 axes (size, domains, ambiguity, severity, type), selects bricks from the palette, and produces a custom workflow manifest to
.
b.
Mandatory plan review (ALL plans): Create a review task targeting
with MUST ANSWER questions covering brick selection, severity classification, agent assignment, verification placement, convergence decisions, volume splitting, and dependency analysis. Include
WRITABLE FILES: tmp/glm-plan.md
in the task file. Assemble with
assemble-prompt.sh -a agent-organizer -t review -n s0-organize
, spawn (no
, default model). The agent-organizer reviews the plan using its dual analytical framework:
Workflow quality (native anti-patterns): Check for over-staffing, wrong agent assignments, redundant agents, vague delegations, ignored dependencies, and stale agent references. Its anti-patterns list is a ready-made plan review checklist.
Structural validation (embedded rules in task): Verify every DISCOVER/REVIEW stage has a corresponding VERIFY. Verify IMPLEMENT stages have a corresponding REVIEW. Verify MEDIUM+ severity tasks have second opinions in ALL DISCOVER and REVIEW stages, including CONVERGE iterations. Verify FIX stages include post-fix REVIEW. Verify CONVERGE variant matches the codebase characteristics documented in the plan's research — if coverage >80% and clean boundaries but CONVERGE=ONCE, correct to NONE (ONCE requires interconnected modules, dense coupling, non-uniform code patterns, >15K LOC per domain, or HIGH+ severity). Verify no agent is reused across CONVERGE iterations (different iterations deploy genuinely different specialists). If the plan specifies an exclusion list, mechanically cross-check EVERY iter 2 agent against it — do NOT trust the plan's claim without verifying each slot. When the task spans 2+ domains: verify the Boundary Analysis section exists, each boundary is triaged (ALWAYS/DEFAULT/SKIP), ALWAYS/DEFAULT boundaries have intersection agents in DISCOVER and cross-domain reviewers in REVIEW, and SKIP boundaries have one-line justification. Verify cross-domain integration review only runs when genuinely different specialists are at integration boundaries. Verify domain breadth counts specialists, not packages. Volume splitting and close-call flagging are handled by the organizer's scope resolution step before structural validation — do NOT duplicate them here. Verify sequential stages are genuinely dependent — if stage N+1 does not consume stage N's verified output, flag for merge into a single parallel stage. Flag miscounts or over-large single-agent scopes.
After review, the organizer applies all fixes directly to
. Its report documents what was changed and why. The organizer's output IS the final plan — no separate merge agent is needed. This runs on EVERY plan — a bad plan poisons everything downstream regardless of severity.
-
Review final plan: Read
, confirm classification, brick selection, and stage structure are sound. Verify CONVERGE variant matches codebase characteristics from the planner's own Phase 1 research — if research shows >80% coverage and clean boundaries but CONVERGE=ONCE, flag for correction (ONCE is for interconnected modules, dense coupling, non-uniform code patterns, >15K LOC per domain, or HIGH+ severity, not a default). If gaps remain, spawn a quick-fix agent to correct the plan.
-
Decompose: List subtasks from the plan, map each to best agent, report to user
CRITICAL — Plan Display Rule: After the planning phase completes and before spawning ANY stage agent, you MUST output the full stage plan as text to the user — see Workflow → Planning for the format. Writing the plan to
does NOT replace showing it. Display first, then proceed.
Subtask Workflow
The lead's role in each subtask:
- Select the best agent, read its , prepare the task file using the planner's KEY FILES and MUST ANSWER questions from the manifest
- Assemble the prompt via , spawn the agent via
- Wait for completion, check operational status (was the report produced? no STALLED/EMPTY/MISSING?)
- Delegate ALL substantive verification to the verification pipeline — the lead never evaluates output quality, judges findings, or assesses results
- Save non-trivial discoveries to knowledge
- Discard agent instructions, move to next subtask
Mid-execution research: When something is unclear during workflow execution (scope ambiguity, technical approach, a specific question the plan didn't cover), the lead may spawn a single unplanned agent using the default model to research that question. The lead chooses the exact agent for the job (e.g.
,
), prepares a prompt with the specific question and MUST ANSWER directives, and spawns via
. Use the agent's report to clarify the next action. This is an ad-hoc clarifying agent — NOT a replacement for the planner pipeline, not a way to re-do planning, not a substitute for discovery stages. Limit to one agent per question. Do NOT use this to research things the lead could discover by reading source code — the lead does not read source code.
When to Delegate
Delegation is the default.
Why delegation produces better results: A specialist agent with a dedicated context window focused exclusively on one domain will find issues you would miss while context-switching between multiple concerns. For most non-trivial work, delegation maximizes correctness by giving each problem domain undivided analytical attention.
Delegate when ANY of these match:
- Multiple distinct topics/domains/areas involved
- Task requires synthesizing information from different sources
- Involves any kind of audit, review, or comprehensive analysis
- Combines research with any follow-up action
- Task has natural subtask boundaries that could run in parallel
- Independent parallelizable subtasks
- Production checks, security audits, code reviews
When a task has multiple independent angles (multi-file refactor, audit + test review, etc.), spawn as many agents as the task naturally decomposes into — spawn only what the work requires — all in parallel within a SINGLE stage. Sequential stages are ONLY correct when the next stage actually consumes the previous stage's verified output. Default: fan out within a stage; sequence only when there's a real dependency. More coverage finds more issues — fan-out (parallel agents) and convergence iterations are both ways to add coverage.
Lead Role
The lead is an autonomous orchestrator, not a developer doing hands-on work.
Does: delegate planning to the agentic-planner pipeline, review manifest, decompose, execute workflow stages from the manifest, write agent prompts, spawn agents, delegate verification according to manifest (adversarial verification: 1:1 for CRITICAL/HIGH, 1 per 5 for MEDIUM), spawn fix-agents and quick-fix agents, synthesize, deliver.
Does not: run the full test suite, do comprehensive audits unprompted, write, edit, or modify ANY project source code (even a single line), do any codebase research (reading source files, skimming files, tracing logic, discovering project structure), or design workflows from scratch (that's the planner's job). These are agent work.
Lead success metrics:
- Success: Decomposable subtasks went to specialists. Your context stayed clean for coordination. Findings were verified.
- Failure: You did any implementation work an agent should have done (writing, editing, or modifying code). You read raw domain data that would have been better isolated in a specialist's context. You produced analysis without verification.
Self-check rules (MANDATORY) — run before working on ANY subtask:
- The lead NEVER writes, edits, or modifies any project source file. The Edit and Write tools are for task files, prompts, and synthesis reports in tmp/ only. Any code change — even a single-line fix, a config tweak, or a build script adjustment — must go through a spawned agent.
- Heavy Read/Grep usage for verification coordination is expected and allowed (reading agent reports, building task files from synthesis output). For anything resembling planning or codebase research — never. Delegate to the planner pipeline immediately. Reading source files to understand the codebase is planner-agent work, not lead work.
- If a specialized agent in
<skill-folder>/agents/INDEX.md
matches the subtask domain → SPAWN it. Don't reproduce its work yourself
- If the subtask requires writing code, running test suites, or deep analysis across many files → that's agent work. Delegate it via (see Rules → Task tool prohibition for the absolute rule)
Rule compliance — the lead NEVER:
- Reclassifies or downgrades an agent's severity finding to avoid running a mandatory verification stage. The reviewer's filed severity is authoritative.
- Substitutes judgment for a mechanical trigger. "When X, do Y" means exactly that — the lead does not override with "X is true but Y seems unnecessary."
- Resolves ambiguity in workflow rules by choosing the interpretation that avoids work. When a term has multiple readings, the lead applies the reading that preserves verification and quality gates, not the one that saves agents.
Verification vs implementation boundary:
- Verification (lead delegates): After stage agents complete, spawn the verification pipeline:
-
Extraction agent (single, default model): Reads all reports from the stage, deduplicates findings (same file:line + same issue → merge, note source), classifies each finding by severity, splits into batches grouped by domain and severity. When the originating stage (DISCOVERY or REVIEW) used a second opinion agent, tag each finding as "both-found" (both agents reported independently) or "single-found" (one agent only). When intersection agents were present, also tag findings as "boundary-found" (reported by an intersection agent auditing a domain boundary — inherently invisible to within-domain specialists) or "domain-only" (reported only by domain primaries/second opinions). Both-found and boundary-found carry elevated confidence for different reasons: both-found signals cross-agent agreement within a domain; boundary-found signals issues spanning domains that no within-domain specialist could have detected. A finding that is both "both-found" AND "boundary-found" carries the highest confidence. Surface all tags in synthesis. Findings from documentation specialist agents (documentation-pro) are domain-verified — route them directly to synthesis at the agent's rated severity, skipping adversarial verification. If extraction finds 0 findings, VERIFY early-exits — nothing to verify, skip all subsequent batches.
Mechanical trigger — MANDATORY: If extraction finds any finding at MEDIUM severity or above, the lead MUST spawn ALL verification batches the extraction report prescribes — every adversarial batch, at the exact finding IDs listed in the extraction's batch assignment table. Spawning an adversarial agent against different findings than prescribed does NOT satisfy this trigger. The lead does NOT pre-judge findings, skip verification steps, substitute finding targets, or decide which findings "don't matter." Only the synthesis grid determines FIX=SKIPPED. The synthesis agent is part of the pipeline — it MUST run after all routing agents complete, even if every routed finding was REJECTED or WEAKENED. The lead does NOT evaluate routing agent outputs to decide whether synthesis is needed. Proceeding to the next stage without completing all verification steps is a protocol violation.
-
Findings routed by severity (single-source routing):
-
CRITICAL/HIGH findings → Adversarial agent (single agent per finding (1:1), default model; use
agent
). The adversarial agent tries to FALSIFY every finding in its batch: reads cited code with full surrounding context (minimum 30 lines), exhaustively searches for counter-evidence at every level (same function guards, caller-level validation, framework-level protections — middleware, decorators, interceptors, global error handlers, type system invariants, test coverage), and labels each finding with evidence:
- CONFIRMED — exhaustive search found NO counter-evidence. Describe what patterns were searched, which grep commands were run, why nothing was found.
- REJECTED — found CLEAR counter-evidence that disproves the claim. Paste exact code with file:line.
- WEAKENED — partial counter-evidence reduces severity or scope but doesn't fully disprove. State the correct severity.
The adversarial agent assumes the claimed issue is a misunderstanding and searches exhaustively before confirming. For "missing X" findings, searching for X and finding it in no reachable code path IS valid evidence — document all searched locations. Every CONFIRMED label must be hard-won — superficial grep is not exhaustive. Surviving findings become ADVERSARIALLY VERIFIED.
-
CRITICAL/HIGH findings from intersection or cross-domain integration review (any finding spanning domain boundaries, from DISCOVER or REVIEW) → Adversarial cross-domain agent (single agent per finding (1:1), default model). Same exhaustive falsification but verifies from BOTH sides of the integration boundary (Domain A producer + Domain B consumer + bridge between them). Finding only survives if no counter-evidence on either side or in the bridge.
-
MEDIUM findings → Adversarial agent (single agent per batch of 5 findings, default model; use
agent
). Same exhaustive falsification methodology as CRITICAL/HIGH findings — reads cited code with full surrounding context (minimum 30 lines), exhaustively searches for counter-evidence at every level (same function guards, caller-level validation, framework-level protections — middleware, decorators, interceptors, global error handlers, type system invariants, test coverage), and labels each CONFIRMED / REJECTED / WEAKENED with evidence. Default position: assume the claimed issue is a misunderstanding and search exhaustively before confirming. Every CONFIRMED label must be hard-won — superficial grep is not exhaustive. For "missing X" findings, searching for X and finding it in no reachable code path IS valid evidence — document all searched locations.
-
LOW findings → NOTED. Recorded in the report. No further agent spend.
-
Synthesis agent (single, default model): Reads all adjudication verdicts. Builds a unified verification grid:
| CONFIRMED | REJECTED | WEAKENED |
|---|
| → fix list | → dropped | severity downgraded → fix list at lower priority |
Surfaces "both-found" confidence signals from extraction — findings reported by both primary and second opinion agents carry higher initial confidence.
If the synthesis grid shows zero CONFIRMED findings at MEDIUM or above (all MEDIUM+ findings were REJECTED, or only LOW-severity survivors remain), FIX is SKIPPED — there is nothing significant to fix. LOW verified findings are acknowledged in the synthesis as non-blocking. The lead writes the synthesis with
FIX SKIPPED: Zero MEDIUM+ verified findings — nothing to fix.
This is mechanical — no lead judgment.
Lead coordinates batches, never investigates findings manually, and writes the final synthesis from the synthesis agent's grid.
- Implementation (agent does): Writing/editing code, running test suites, fixing bugs, adding tests, refactoring
- After the verified checklist is produced, if many fixes are needed across many files: collect them into a fix-agent prompt and spawn
Quick-fix agents: For two specific scenarios — (1) agent output needs minor finishing, (2) reverting incorrect edits — spawn a single quick-fix agent using the default model. Lead chooses the exact agent for the job. No verification pipeline — this is a quick, informal fix. If the fix is wrong, escalate immediately to a full IMPLEMENT → REVIEW → VERIFY cycle for that component. No direct work — the lead never edits project code. Quick-fix agents are the only exception to "every review must be verified."
Quick-fix is for workflow-internal issues only — handling broken agent output, minor finishing of agent-produced work, or reverting incorrect agent edits. Quick-fix agents are NOT a substitute for running the full workflow. For any task, no matter how small, the planner pipeline must run first. Quick-fix operates inside an existing workflow — never as a standalone replacement for planning, review, or verification.
Workflow autonomy: The lead runs the workflow to completion without waiting for user approval. The planner agent designs the initial workflow (stages, agents, verification placement); the lead reviews, adapts, and refines it — adding or modifying non-PLAN stages as understanding deepens during execution. Each stage follows the prepare → spawn → verify cycle. A stage is complete ONLY when ALL its agents have produced their expected output. A stage with failed or missing agents is incomplete — diagnose failures, fix root causes, re-spawn. Proceeding to the next stage with an incomplete current stage — outside the narrow gap-acceptance rules in Execution step 4 — is a protocol violation. The lead has full authority to adapt non-PLAN parts of the plan mid-execution. PLAN stages (2-agent planning pipeline) cannot be removed. DISCOVER, IMPLEMENT, REVIEW, FIX, and TEST stages may be SKIPPED only when the planner's manifest explicitly marks them as NONE for the given task severity — never for speed or convenience. VERIFY is skipped when extraction finds 0 findings or when the lead may mark it as SKIPPED for non-code-level findings. Prior workflow runs do not excuse skipping — every code change requires fresh verification regardless of what previous sessions found.
Tools
Maximum 10 agents per parallel batch within a stage. A stage that has independent subtasks SHOULD use as many parallel agents as the task naturally decomposes into — spawn only what the work requires. Scale to scope: over-engineering with unnecessary agents degrades quality. When a stage genuinely needs more than 10 independent subtasks, split into sequential sub-batches within the stage. Single-agent stages are normal for tightly-scoped work and are only "the exception" when a task genuinely splits into more subtasks. Each agent is an independent unit; a stage is a parallel-batch boundary that may contain multiple agents. Implementation stages: a single agent writes code directly to original files, followed by a single review agent that reviews the result (see Agent Spawning). For multi-domain changes, one agent per domain writes in parallel.
Spawn:
bash
<skill-folder>/tools/spawn-glm.sh -n NAME -f PROMPT_FILE [-m MODEL] [--pi]
is optional — when omitted, the agent uses default model. Use
to override with a specific model. Use
if running inside pi harness (sub-agents should use same harness). Returns
SPAWNED|name|pid|log_file
. Backgrounds immediately. Report:
, log:
(for pi harness check pi's session logs in ~/.pi/agent/sessions/ instead). Also writes to
(reliable on Windows — stdout can be lost when parallel
processes launch).
Stage types and model usage — all agents use the default model unless overridden with
. The
flag is available for any stage type when a specific model is needed.
| Stage Type | Description |
|---|
| Plan (always runs) | Planner researches and produces the plan. Organizer (agent-organizer) reviews the plan, applies fixes, produces final plan. All use default model. |
| Discovery (review, research, audit, analysis) | Specialist agent with dedicated context focused on one domain. When a stage has independent subtasks (different files, modules, concerns), spawn one agent per subtask — as many as the task naturally decomposes into, maximum 10 in parallel. At MEDIUM+ severity: second opinion agent runs in parallel with complementary specialist . |
| Implementation (write code) | Single agent writes code directly to original files. For multi-domain changes, one agent per domain writes to respective files in parallel. |
| Review (after implementation or fix) | Reviews implementation or fix for bugs, quality, correctness. Every implementation and every fix MUST be followed by a review agent. At MEDIUM+ severity: second opinion agent runs in parallel with language specialist . |
| Fixing (fix verified findings) | Applies known fixes mechanically. Fix ALL confirmed findings from the synthesis grid. Every fix MUST be followed by a post-fix review agent. |
| Adversarial verification (falsification) | For CRITICAL/HIGH findings — 1 agent per finding (1:1). For MEDIUM findings — 1 agent per batch of 5 findings. Both use exhaustive falsification: read cited code, search for counter-evidence at every level (same function, caller, framework, type system, tests). Label CONFIRMED / REJECTED / WEAKENED with evidence. Extraction and synthesis agents also default model. |
| Test (build + test suite) | Runs build and test commands, fixes compilation/test failures, reports results. |
| Quick-fix (minor finishing, reverts) | Short, informal fix for workflow-internal issues — fixing broken agent output or reverting incorrect edits. Not a substitute for the planning pipeline. No verification. If wrong, escalate to full IMPLEMENT → REVIEW → VERIFY. |
Wait:
bash
<skill-folder>/tools/wait-glm.sh name1:$PID1 name2:$PID2 name3:$PID3
Blocks until all finish (Bash timeout: 600000). Do NOT use bare
or
+ poll loops. Prefer
format — enables progress monitoring (first at 30s, then every 60s) and STALLED detection (0-byte log after 2min). Bare PIDs still work but skip log monitoring. If Bash times out before agents finish, re-invoke with same arguments — this is normal for long-running agents.
Workflow
The planner designs the initial workflow, the lead reviews and adapts it. Typical flow: delegate to planner → review plan → for each stage in the manifest: prepare → spawn → wait → verify (severity-routed pipeline) → between stages → next stage. Stages may be iterative (see Iterative Convergence). The lead refines the plan and decides stage adjustments mid-execution.
Planning
MANDATORY: Planner first, always. The planning pipeline runs in full before any workflow begins. The lead does NOT research the codebase — the planner agent researches and produces the plan.
Plan Display Rule: After the planning phase completes and before spawning ANY stage agent, you MUST output the full stage plan as text to the user. Writing to
does NOT replace showing it. Display first, then proceed.
The lead's role in preparation:
0. If the user's request is vague, ask clarifying questions to narrow scope — but do NO codebase research. Clarifying the user's intent (what they want) is fine; reading source files (how to do it) is the planner's job.
- Pass the user's request as-is and the current working directory to the planner — no summarization or research, the planner reads the codebase itself
- Review the planner-generated manifest for classification accuracy, brick selection, severity justification, and agent assignments
- If the manifest has discovered scope ambiguity, add discovery/research stages — these are agent work, not lead work. Never open source files to fill gaps yourself
- Write well-scoped prompts using the manifest's context, KEY FILES, and MUST ANSWER questions (provided by the planner per stage). The lead may add 1-2 supplementary questions about workflow concerns (e.g., "Was the linter run?") but does not write code-level technical questions.
- If the plan is insufficiently informed, re-run the planner with more specific questions or add a discovery stage. Under no circumstances does the lead read source files to research gaps directly
Spawning research agents (even iteratively to convergence) is encouraged when scope is unclear — thorough research almost always produces better results in later stages. Decompose into stages. ALWAYS output the full plan to the user before spawning any agents:
# DYNAMIC BRICK MANIFEST — planner selects bricks per task.
# No fixed skeleton. Each task gets a custom workflow.
Plan: [N stages, M total agents]
Stage 0: Plan — 2 agents (planner + organizer)
Classification: size=[], domains=[], ambiguity=[], severity=[], type=[]
Stage 1: [Brick name] — [Variant] — N agents
Justification: [why this brick, why this variant]
Agent: [specialist name]
Second Opinion: [agent name if MEDIUM+; "N/A (severity < MEDIUM)" otherwise]
KEY FILES: [list]
MUST ANSWER:
1. [technical question from planner's codebase research]
2. [...]
...
Total agents: M
The planner selects from the following bricks. Skipped bricks are noted as
.
Do NOT wait for user approval — output the plan and proceed immediately.
Brick Catalog
The planner assembles a custom workflow by selecting from these bricks. Each has variants. Not all bricks are needed for every task.
PLAN Always FULL (2 agents: planner + organizer, both default model).
No variants. Never skipped. Bad plan poisons everything downstream.
Planner (agentic-planner) researches and produces the plan. Organizer (agent-organizer) reviews and fixes in-place — the organizer's output IS the final plan.
DISCOVER Pre-change analysis — review/audit existing code before making changes.
├── NONE Required for size=tiny — nothing to discover on changes this small.
│ Required for size=small when the planner traced the complete code
│ path and identified the exact fix location with file:line citations
│ — no open questions remain. Justify with specific research findings.
│ If the planner cannot state "Root cause at [file:line], fix is
│ [approach]" with concrete evidence, the NONE bar is not met.
├── SINGLE 1 agent per domain. Use for medium+ tasks, or small tasks
│ where open questions remain after planning research.
│ At MEDIUM+ severity: +1 second opinion agent per domain (parallel).
│ Default pair: domain specialist (primary) + code-reviewer (second opinion) — planner may override based on task context.
└── MULTI N agents, one per domain. Split by specialist → volume.
At MEDIUM+: each domain gets a second opinion agent.
When the task spans 2+ domains with non-trivial coupling (see
Boundary Selection Criteria below), the planner adds intersection
discovery agents to the DISCOVER batch. An intersection agent
audits the integration boundary between two adjacent domains —
tracing the full data/error/call flow across the divide,
verifying contracts hold at the boundary, and identifying
mismatches in data format, error semantics, or transactional
consistency. This is distinct from second opinions: second
opinions apply a different analytical lens to the SAME domain;
intersection agents trace the boundary BETWEEN different domains
where coupling creates defect-prone blind spots invisible to
either domain specialist alone. Intersection findings are tagged
"boundary-found" in extraction — signaling issues no within-domain
specialist could have detected. CRITICAL/HIGH findings from
intersection discovery are routed through cross-domain adversarial
verification (1:1 per finding, verifying from both sides of the
boundary). Intersection agents MUST be placed in the first DISCOVER
stage — never deferred to CONVERGE iterations. CONVERGE inherits
the intersection requirement but adds ADDITIONAL agents with
different specialists, not replacements for the first-stage ones.
Intersection agents run in parallel with domain primaries and
second opinions within the same stage. At MEDIUM+ severity: each
intersection agent gets its own second opinion (a different
specialist from the INDEX, not the same type as the intersection
agent). Intersection agents audit gaps between domains — second
opinions audit the intersection audit itself for missed concerns.
The planner selects the best agent for each boundary based on
domain context. Suggested defaults (planner's selection is
authoritative — these are starting points, not mandates):
`backend-architect` for data flow and contract tracing;
`security-reviewer` for crypto/auth boundaries. The planner
may choose any agent from the INDEX that fits the boundary.
IMPLEMENT Write or modify code.
├── NONE No code change (analysis-only, cosmetic-only).
├── SINGLE 1 agent per domain. Writes code directly to original files.
│ Standard for all code changes.
└── MULTI N agents, one per domain. Split by specialist → volume.
SINGLE for narrow single-domain changes; MULTI for changes
spanning multiple specialists. Line count is not the measure —
split by domain diversity, not file count.
REVIEW Review code changes.
├── NONE Skip: change type=cosmetic AND severity=none.
│ Or: IMPLEMENT=NONE.
├── SINGLE 1 agent per domain. Standard.
│ At MEDIUM+ severity: +1 second opinion agent per domain (parallel).
│ Default pair: code-reviewer (primary) + language specialist (second opinion) — planner may override based on task context.
│ When the task spans 2+ domains using DIFFERENT specialists,
│ the planner adds cross-domain integration reviewers to the
│ REVIEW batch (see Boundary Selection Criteria for triage —
│ same ALWAYS/DEFAULT/SKIP tiers apply). These agents focus
│ ONLY on integration points: API contracts, shared types,
│ data flow between domains, and regressions at boundaries from
│ implementation changes. Do NOT re-review domain-internal logic.
│ Post-implementation intersection review is critical: domain
│ reviewers see new methods as correct within their context;
│ only tracing the full boundary reveals regressions where error
│ contracts, data formats, or transactional ordering differ from
│ what the caller expects. Findings from cross-domain integration
│ review are routed through adversarial cross-verification (1:1
│ per CRITICAL/HIGH finding, verifying from both sides).
└── MULTI N agents, one per domain.
VERIFY Verify findings from DISCOVER, REVIEW, or post-fix review.
Always includes extraction (1 agent, default model). Tags findings
"both-found"/"single-found" when originating stage had second opinion,
and "boundary-found"/"domain-only" when intersection agents were present.
Routes findings by severity:
CRITICAL/HIGH → ADVERSARIAL AGENT (1 agent per finding — 1:1)
Adversarial agent tries to FALSIFY every finding: reads cited code
with full surrounding context (minimum 30 lines), exhaustively
searches for counter-evidence at every level (same function guards,
caller-level validation, framework-level protections — middleware,
decorators, interceptors, global error handlers — type system
invariants, test coverage). Labels each CONFIRMED / REJECTED /
WEAKENED with evidence. For CONFIRMED: describe what patterns
were searched, which grep commands were run, why nothing was found.
For REJECTED: paste exact counter-evidence code with file:line.
For WEAKENED: paste partial counter-evidence AND explain what
portion of the original claim still stands.
Default position: assume the claimed issue is a misunderstanding and search exhaustively before confirming. For "missing X" findings, searching for X and finding it in no reachable code path IS valid evidence — document all searched locations. Findings that survive
exhaustive falsification become ADVERSARIALLY VERIFIED.
CRITICAL/HIGH from intersection or cross-domain integration review
(any finding spanning domain boundaries, regardless of whether
it originated in DISCOVER or REVIEW) → ADVERSARIAL CROSS AGENT
(1 agent per finding — 1:1). Same exhaustive falsification but verifies
from BOTH sides of the integration boundary (Domain A producer +
Domain B consumer + bridge between them). Finding only survives
if no counter-evidence on either side or in the bridge.
MEDIUM → ADVERSARIAL AGENT (1 agent per batch of 5 findings)
Same exhaustive falsification methodology as CRITICAL/HIGH —
reads cited code with full surrounding context (minimum 30
lines), exhaustively searches for counter-evidence at every
level (same function guards, caller-level validation,
framework-level protections, type system invariants, test
coverage). Labels each CONFIRMED / REJECTED / WEAKENED with
evidence. Default position: assume the claimed issue is a
misunderstanding and search exhaustively before confirming.
Every CONFIRMED label must be hard-won with grep evidence.
LOW → NOTED. Recorded in report. No further agent spend.
After routing: SYNTHESIS (1 agent, default model) compiles all
verdicts into unified grid. Surfaces "both-found" confidence signals.
Unified vocabulary: CONFIRMED / REJECTED / WEAKENED.
Also sanity-checks severity assignments against the severity
classification criteria — if a finding's severity appears mismatched
(e.g., "SQL injection" labeled MEDIUM), flag it as CHALLENGED.
Challenged findings are re-routed through adversarial verification.
Exception: documentation-domain challenged findings skip
adversarial — documentation severity is inherently subjective
(is "10 missing API docs" HIGH or MEDIUM?) and adversarial
review of severity ratings adds no meaningful verification.
Documentation-domain challenged findings stay at their
challenged severity; the lead accepts the downgrade directly.
Early-exit: 0 findings after extraction → skip synthesis.
Always runs when DISCOVER, REVIEW, or post-fix review produced findings with code-level references.
When CONFIRMED findings exist at MEDIUM+, FIX=DOMAINS must follow.
CONVERGE Repeat DISCOVER or REVIEW for additional passes. Planner decides variant.
Factors: ambiguity, codebase complexity, finding volume, production impact,
change type, time sensitivity.
NONE: One pass. For well-understood, narrow work. Also appropriate
for codebases with comprehensive test coverage (>80%) and
clean module boundaries — first pass is unlikely to miss
meaningful issues.
ONCE: One extra iteration if first pass found anything ("found
anything" means any iter 1 agent reported at least one
finding — regardless of whether it survived adversarial
verification; the point is different iter 2 specialists
re-examine what iter 1 noticed). Use when
the planner's Phase 1 research reveals interconnected modules,
dense coupling, non-uniform code patterns, or >15K LOC per
domain — characteristics suggesting a first pass may miss
issues. Also used when severity is HIGH/CRITICAL regardless
of codebase quality (missed findings are expensive). ONCE is
NOT the universal default — well-tested, cleanly-structured
codebases should use NONE.
LOOP: Up to 3 iterations, stop on empty report. For highly ambiguous
or production-critical work where missed findings would be
unacceptable.
Iterations inherit ALL mandatory rules from the parent stage type
(second opinions at MEDIUM+, intersection agents at triaged boundaries,
DISCOVER/REVIEW → VERIFY pipeline, etc.). Intersection agents inherited
by CONVERGE are ADDITIONAL agents, not replacements — the first DISCOVER
stage must have its own intersection agents for ALWAYS/DEFAULT boundaries;
CONVERGE iter 2 adds fresh intersection agents with different specialists.
Each iteration gets its own VERIFY stage. Iter 1's VERIFY runs BEFORE
iter 2 spawns — the synthesis grid from iter 1's VERIFY determines
whether iter 2 spawns (any finding = spawn) AND provides PRIOR CONTEXT
for iter 2 agents. Do NOT merge both iterations' verification into a
single stage after both iterations complete. The plan structure must be:
Stage N: DISCOVER iter 1
Stage N+1: VERIFY iter 1
Stage N+2: DISCOVER iter 2 (conditional, PRIOR CONTEXT from N+1)
Stage N+3: VERIFY iter 2
The planner must list all agents per iteration with different
specialists from the previous iteration — the lead spawns
whatever the plan lists. Before writing iter 2, the planner MUST
list every agent `.md` file used in iter 1 and exclude them all
from iter 2 — no agent may appear in any role in both iterations.
Swapping primary and second opinion roles between iterations does
NOT count as different specialists. Using the same pair of agent
`.md` files in opposite roles is still the same analytical
framework. The exclusion list must be explicit in the plan.
FIX Apply verified findings. Always 2-3 sequential stages — includes post-fix review.
When DOMAINS: 1 fix agent per domain → post-fix REVIEW
(same variant/domain split as the REVIEW stage), then VERIFY
if any post-fix review report contains at least one finding at
MEDIUM severity or above. A finding is any numbered item with a
severity label and code reference (file:line, function, or block)
in a reviewer's report. The lead does NOT re-classify, downgrade,
or exclude findings — the reviewer's filed severity is authoritative.
VERIFY is skipped ONLY when ALL post-fix review reports contain
zero MEDIUM+ findings. Mechanical trigger, no judgment.
CONVERGENCE: If post-fix VERIFY produces CONFIRMED MEDIUM+
findings in the synthesis grid, the fix is incomplete. Spawn a new
fix pass (fix agents → post-fix review → conditional verify) for
the confirmed findings. This repeats until post-fix review
produces zero MEDIUM+ findings and VERIFY is skipped. The FIX
brick is a convergence loop — one pass is never final when
MEDIUM+ findings survive verification. Documented findings marked
"for follow-up action" are still unfixed MEDIUM+ findings — fix
them now, not later.
├── NONE No verified findings.
└── DOMAINS 1 fix agent per domain → post-fix REVIEW → conditional VERIFY.
TEST Run build + test suite. Always single agent, default model (mechanical).
├── NONE IMPLEMENT=NONE. Or planner skips with justification (no test infra).
└── FULL 1 agent. Runs build + tests, fixes failures.
Severity Classification
The planner assesses severity from code context — NOT keyword matching:
| Level | Criteria |
|---|
| None | No functional impact. Comment, formatting, variable rename. |
| Low | Minor, immediately reversible. Dev tooling, internal logging, tests. |
| Medium | User-facing, visible but contained. UI component, new endpoint, feature. |
| High | Core product function, data mutation, could break key flows. Payment, auth, database writes, primary user flows, data model changes. |
| Critical | Product outage, data loss, severe production bugs, irreversible damage. Secret exposure, SQL injection, data deletion, auth bypass, production crash, corrupt state. |
The planner reads the actual code, traces what it touches, and assigns severity based on
actual product impact. No keyword auto-detection. A function named
that handles UI password strength is LOW, not HIGH. A log statement in a payment module is LOW unless the logging itself can break payments.
Domain Splitting
When a task spans multiple domains, split in two steps.
Domain breadth is measured by distinct specialist agents needed, not package count. A task touching 5 Swift packages that all use
is single-domain. A task touching Python + TypeScript files is few-domain.
-
Split by specialist — map each file/concern to the best specialist agent from
<skill-folder>/agents/INDEX.md
-
Split by volume — keep each discovery agent to ~20 files and ~5K LOC. A narrow overage (up to 25 files / 6K LOC, in a single cohesive module — not unrelated files packed together) is acceptable; above that, split is mandatory. Discovery agents must read every file — a 20-line header costs the same context as a 200-line implementation file because the agent must understand the API and cross-reference every caller. After splitting, re-count each resulting sub-group to verify none still exceeds the limits.
Post-split re-evaluation. After mandatory splits, verify the resulting agents
are not fragmented. If any sub-agent has fewer than 15 files AND fewer than 3K LOC,
the split produced an under-utilized agent — standalone agents this small add
coordination overhead without proportional audit depth. Consider merging adjacent
sub-agents: a single 40f/4K-LOC agent with "many thin boilerplate files"
justification is better than two 20f/2K-LOC agents that have almost nothing to
audit. When file count exceeds the 25f cap but total LOC is under 3K, the files
are likely thin code-behind or utility stubs — prefer accepting as close call
over splitting into fragments. The caps prevent agent overload, not create
stand-alone agents that have too little to examine.
The planner provides FILE SCOPES (module-level descriptions, e.g. "GPG core:
core/GPGHandler.py, core/gpg_utils/*.py") with rough LOC estimates from Phase
1 research. The organizer resolves every scope to exact individual file paths
(using glob + find + test -f), runs wc -l for exact counts, produces a
systematic volume audit table comparing each domain against the ~20f / ~5K LOC
baseline and the 25f / 6K LOC narrow cap, splits domains exceeding the cap,
and writes the resolved KEY FILES + exact LOC counts into the plan file,
preserving the planner's MUST ANSWER questions, domain descriptions, and
agent assignments for each domain.
-
Split implementation agents by edit density — different from discovery volume splitting. Sequential edits on the same file accumulate context pressure linearly (agent re-reads, re-edits, re-tests the same code) causing edit amnesia: the agent forgets it already applied a change and tries to re-apply it. Two mechanical caps, counted from the synthesis grid's confirmed MEDIUM+ findings:
- Per-file cap: no single file may carry more than 8 confirmed MEDIUM+ findings to one implementation agent. If a file exceeds 8, split that file's fixes across 2 agents by finding index.
- Per-agent cap: no implementation agent may receive more than 12 confirmed MEDIUM+ findings across all files. If a domain exceeds 12 total, split into 2 agents by file/module.
Boundary Selection for Intersection Agents
The planner identifies domain adjacencies during Phase 1 research, assesses coupling
density by tracing cross-boundary call sites, and classifies each boundary. This
decision is documented in the plan manifest under "Boundary Analysis."
| Tier | Criteria | Action |
|---|
| ALWAYS | Two persistence mechanisms at boundary; OR data format/encoding transformation at boundary; OR error contract differs from caller expectation; OR 5+ cross-boundary call sites across 3+ modules | Add intersection agent to DISCOVER and REVIEW |
| DEFAULT | Multiple cross-boundary call sites; moderate coupling; multi-module boundary | Add intersection agent to DISCOVER and REVIEW |
| SKIP | Boundary bridged by a single well-understood mediator class; OR <3 cross-boundary call sites; OR well-documented established pattern (e.g., standard library protocol layer) | Skip — domain primaries + second opinions sufficient |
SKIP boundaries require a one-line justification in the plan's Boundary Analysis
(e.g., "SKIP: Crypto×Network — thin boundary bridged by MailCore2 TLS").
Rationale (from Run 4 empirical data):
Intersection agents at high-coupling boundaries produce unique MEDIUM+ findings at
~1.4 agents per unique finding. At thin boundaries bridged by a single mediator
class, intersection agents add near-zero unique value (<20% precision, 0 unique
findings in Run 4). Triaging prevents wasteful agent spend at boundaries where
domain primaries and second opinions already provide sufficient coverage.
Academic support: Koru et al. (2007) established that highly coupled modules are
more defect-prone. Zhou et al. (2020) confirmed package coupling metrics predict
defect-proneness. An empirical study of interaction bugs in ROS-based software
(2025) found failures "often manifest at the boundaries between components."
Size Classification
The planner assesses scope along with severity. Size gates DISCOVER=NONE decisions.
| Size | Criteria |
|---|
| tiny | Single file, single change, under 10 lines. Trivial fix, no structural impact. |
| small | Single module, few files. Well-scoped change with clear boundaries. Under ~20 files and ~5K LOC. |
| medium | Multiple modules, cross-file changes. Moderate scope, may touch different concerns. Under ~20 files and ~5K LOC. |
| large | Exceeds ~20 files OR ~5K LOC in any domain, OR spans multiple specialist domains (different languages/frameworks). Requires volume splitting. |
DISCOVER=NONE requires
(nothing to discover) OR
with planner-identified root cause at file:line. For
and
, DISCOVER is mandatory.
Mid-Execution Amendment
After VERIFY produces confirmed findings at MEDIUM severity or above: if the manifest does not include IMPLEMENT, the lead auto-adds IMPLEMENT followed by FIX (which includes internal post-fix REVIEW + conditional VERIFY). This is unconditional — all confirmed MEDIUM+ findings are fixed regardless of task intent. LOW findings are reported but not auto-fixed.
When auto-adding IMPLEMENT or planning implementation stages from the synthesis grid, count confirmed MEDIUM+ findings per file. Apply the edit-density split (Domain Splitting step 3): if any single file carries more than 8 findings or any domain carries more than 12 total findings, split that domain's implementation into 2 agents.
After a FIX stage's post-fix VERIFY produces CONFIRMED MEDIUM+ findings in the synthesis grid: auto-add another FIX pass (fix agents → post-fix review → conditional verify). This repeats until post-fix review produces zero MEDIUM+ findings and VERIFY is skipped. This is mechanical — the FIX brick is a convergence loop, and surviving MEDIUM+ findings mean the fix was incomplete. IMPLEMENT already being in the manifest does not block this — FIX convergence re-entry is independent of the IMPLEMENT amendment.
Implementation stages use write → review structure:
Stage N: Implementation — 1 agent per domain
Agent writes code directly to original files.
Stage N+1: Review — 1 agent per domain
Reviews the implementation for bugs, quality, correctness.
Stage N+2: Verification — severity-routed (extraction → adversarial [CRITICAL/HIGH 1:1, MEDIUM 1 per 5] → synthesis)
Fix agents (docs, configs, scripts): use default model agents for code. Split fixes by domain — one agent per domain. Every fix stage MUST be followed by a post-fix review:
Stage N: Fixes — N agents split by domain
Stage N+1: Post-fix review — N agents (1 per domain)
Stage N+2: Verification — severity-routed (only if fix review found MEDIUM+ findings)
Delegation mapping (MANDATORY in every plan): During planning you MUST answer:
- What subtasks exist? (list each one)
- Which agent handles each subtask? (map agent name to subtask — consult
<skill-folder>/agents/INDEX.md
)
- Where is verification in this plan? Confirm verification runs after every DISCOVER and REVIEW stage that produces findings, or mark it explicitly as SKIPPED with justification.
Answer these explicitly in your plan. Every subtask must have an assigned agent — no subtask goes to the lead.
Stage decomposition rule (MANDATORY): If stage N+1 does NOT consume stage N's verified output — they're independent — MERGE them into a single stage with parallel agents. Sequential stages are only correct when the next stage actually needs the previous stage's verified findings as
.
Write full plan to
. The
flag on
is available to override when a specific model is needed. Quick-fix agents (see Lead Role) are always single-model but run outside the plan's stage structure — they handle agent output issues within an existing workflow, never as a standalone workflow replacement. Checkpoint.
Dependency analysis (MANDATORY — lead's responsibility, before spawning): Before spawning any stage, the lead builds a dependency graph of agents within that stage:
- For each agent, list files it will READ and files it will WRITE/CREATE
- If Agent B reads or tests a file that Agent A writes → B depends on A → they CANNOT run in parallel
- Split into batches: independent agents run together, dependent agents run sequentially
- Document in per stage:
Stage N agents:
Batch 1 (parallel): agent-a (writes X.swift), agent-b (writes Y.swift)
Batch 2 (after batch 1): agent-c (tests X.swift, depends on agent-a)
Common dependency patterns to watch: test-writer depends on implementer, fix-agent depends on reviewer, integration-tester depends on all implementers, plan organizer depends on the planner's output. When in doubt, sequence — wasted time from a retry loop exceeds the cost of sequential execution.
Session start: Clean ALL stale workflow artifacts. Use two steps — explicit files first (shell-safe), then wildcard patterns via
(avoids zsh glob errors when no files match a pattern):
find tmp/ -maxdepth 1 \( -name 'stage-*-synthesis.md' -o -name 'stage-*-iter-*-synthesis.md' -o -name 's[0-9]*-task.txt' -o -name 's[0-9]*-prompt.txt' -o -name 's[0-9]*-status.txt' -o -name 's[0-9]*-report.md' -o -name 'plan-review-*' \) -delete
- Verify: — confirm no stale workflow artifacts remain. If any survived, remove them manually before proceeding.
Also clear stale session checkpoints:
echo "# Session Memory" > session.md
CAUTION: Never use broad patterns like
or
— they will delete non-workflow files (e.g.
). Never delete
— this directory contains permanent loop run logs and must be preserved across sessions. Agent names follow
prefix (e.g.
,
), so
safely matches only workflow artifacts.
Session boundaries: Each session is independent — treat every task as a fresh start. Do not assume prior sessions' findings still hold. Every code change, even from previous sessions, requires fresh verification through the full workflow. Only reference prior sessions when the task explicitly asks you to. If task will likely need >4 stages, plan explicit session splits using the continuation protocol. Long sessions degrade from compaction pressure.
Agent Preparation
Consult
<skill-folder>/agents/INDEX.md
for the full agent directory (110+ agents grouped by domain). Pick the MOST specialized agent (see Agent Selection above) — a PostgreSQL task should use postgres-pro, not database-optimizer. The agent's domain checklists and anti-patterns are the primary value — they only work when the agent matches the domain.
For each agent in the current stage:
- Define task with KEY FILES, CONTEXT, SCOPE, (code agents only — list source files agent may edit), and questions (provided by the planner in the manifest — mandatory, prompts without these are invalid). The lead may add 1-2 supplementary workflow-level questions (e.g., "Was the linter run?") but does not write code-level technical questions.
- Write the TASK ASSIGNMENT block (PROJECT, ENVIRONMENT if code, PRIOR CONTEXT if stage 2+, YOUR TASK, WRITABLE FILES) to . NOTE: Do NOT include the report file path in WRITABLE FILES — the script auto-injects automatically.
- Assemble the full prompt:
bash
<skill-folder>/tools/assemble-prompt.sh -a AGENT -t TYPE -n NAME --task tmp/{name}-task.txt
Types: (coordination-review + severity + quality-rules-review), (coordination-code + quality-rules-code), (coordination-review + quality-rules-review). The script reads the agent .md, selects templates, substitutes in the task file content, and writes . Output: ASSEMBLED|name|path|bytes
- Validate prompt contains ALL: full agent .md, TASK ASSIGNMENT with MUST ANSWER questions, quality rules, severity guide (review only), environment (code only), coordination, report format. The script handles all boilerplate automatically — you only own the task file. Missing ANY = do not spawn
- Match agent type to task: REVIEW → code-reviewer, security-reviewer, backend-architect. CODE → language-pro, debugger. Git/history analysis (blame, log, diff, tracing fixes through commits) → or
- WRITABLE FILES: Code agents: task file MUST list the exact source files/directories the agent may modify. Review/audit/research agents: omit WRITABLE FILES entirely — the script auto-injects the correct report path and marks all source files as read-only.
- Implementation agents: WRITABLE FILES must list the exact source files the agent may modify directly. The task must instruct them to produce their implementation and run any available lint/test commands to verify correctness. The task MUST also instruct them to write an Intent section in their report before coding: a description of their understanding of the task and their intended approach, in their own words, at whatever level of detail they think is useful for the reviewer. The agent decides what to communicate — architectural reasoning, assumptions about the codebase, trade-offs considered, alternatives rejected, or anything else that helps someone else understand why they built what they built. This is the first thing they write, before any code.
- Review/audit/research agents: omit WRITABLE FILES entirely — the script auto-injects the correct report path and marks all source files as read-only.
Describe problems and desired behavior — do NOT paste exact fix code unless precision is critical (regex, API signatures, security logic). Name agents with stage prefix: , .
Agent Spawning
The
flag is available to override when a specific model is needed but is never required.
How it works for review/research/audit stages:
- A single agent gets the agent and the task assignment — it works independently
- When a stage has independent subtasks (different files, modules, concerns), spawn one agent per subtask in parallel — as many as the task naturally decomposes into, maximum 10 agents
- Each agent's report feeds into the verification pipeline (see Verification section)
- Naming convention: , e.g. , (stage 2, iteration 1)
How it works for implementation stages:
- Write step: A single agent writes the implementation directly to the original files. The agent reads the full task, understands the requirements, and produces a complete implementation.
- Review step: A single review agent reviews the implementation — same task description, independent assessment.
- Fix and iterate: The review report is processed by the verification pipeline to produce a verified checklist. ALL verified findings are fixed via fix-agents split by domain. The lead does NOT fix findings directly, regardless of how few or how trivial. Every fix MUST be followed by a post-fix review agent. Every review MUST be followed by verification — review findings are not deliverable until they've been verified. The review → fix → re-review loop iterates until the review agent produces no new findings (empty report) — this convergence is the final gate.
Spawn:
bash
# Single agent (uses default model)
<skill-folder>/tools/spawn-glm.sh -n s1-reviewer -f tmp/s1-reviewer-prompt.txt
# Override model
<skill-folder>/tools/spawn-glm.sh -n s1-reviewer -f tmp/s1-reviewer-prompt.txt -m zai/glm-5.1
Prompt assembly: Assemble ONE prompt per agent via
:
bash
<skill-folder>/tools/assemble-prompt.sh -a AGENT -t TYPE -n NAME --task tmp/task.txt
Types:
(coordination-review + severity + quality-rules-review),
(coordination-code + quality-rules-code),
(coordination-review + quality-rules-review).
Implementation spawn pattern:
bash
# Write step
<skill-folder>/tools/spawn-glm.sh -n sN-impl -f tmp/sN-impl-prompt.txt
# Review step (spawn AFTER write completes)
<skill-folder>/tools/spawn-glm.sh -n sN-review -f tmp/sN-review-prompt.txt
Naming convention overview:
- Plan: ,
- Discovery: , (second opinion),
sN-discover-{domainA}-{domainB}
(intersection, e.g., s1-discover-crypto-services
)
- Implementation: , , (second opinion),
sN-review-{domainA}-{domainB}
(intersection, e.g., s6-review-crypto-services
)
- Verification: , (adversarial — 1:1 for CRITICAL/HIGH, 1 per 5 for MEDIUM), (cross-domain adversarial),
- Fix:
- Test:
- Iterations: (e.g., , )
- Respawns: add , suffix when re-spawning a failed agent with corrected configuration (e.g., = stage 2 iteration 1 reviewer, respawn attempt 2). Maximum 3 respawn attempts per agent.
Second Opinion Guidelines
For DISCOVERY and REVIEW stages at MEDIUM+ severity, spawn a second opinion agent using a different agent
from the INDEX. The two agents review the same code but through different analytical frameworks, producing complementary findings (proven: 87% complementarity across 5 language domains across 3 languages; 4-agent audit confirmed each additional agent type finds structurally distinct issues). PLAN always has an agent-organizer review (mandatory, all tasks) — see Planning phase step 3b. Agent selection is task-driven — the tables below show recommended defaults; the planner selects the best agents for the specific task based on codebase context.
No domain exception: The documentation-domain exceptions (skipping adversarial verification, accepting challenged downgrades directly) apply ONLY to the verification pipeline — how findings are routed and verified. They do NOT excuse documentation-domain DISCOVERY or REVIEW stages from the second-opinion requirement. MEDIUM+ severity → second opinion is unconditional across all domains. If a task is MEDIUM+ and includes documentation as a domain, the discovery and review stages for that domain MUST include a second opinion agent.
DISCOVER pairings (defaults — planner may override)
For DISCOVER, the primary agent is typically the domain specialist who audits existing code. The second opinion is typically a code-reviewer providing a general quality lens. The planner may select different agents when the task warrants it — the table shows recommended defaults, not hard assignments.
| Context | Primary | Second Opinion |
|---|
| General code | domain specialist (, , etc.) | |
| Auth/crypto | | |
| Infrastructure/config | | |
| Trivial / single-domain-small | skip | — (only when overall task severity < MEDIUM; the MEDIUM+ severity rule — "second opinion mandatory in all DISCOVER stages" — overrides this row) |
REVIEW pairings (defaults — planner may override)
For REVIEW, the primary agent is typically a code-reviewer assessing implementation quality. The second opinion varies by context to provide a complementary lens. The planner may select different agents when the task warrants it — the table shows recommended defaults, not hard assignments.
| Context | Primary | Second Opinion |
|---|
| General code | | language specialist (, , etc.) |
| Auth/crypto | | |
| Infrastructure/config | | |
| System design / architecture | | |
| Multi-language | | (prefer splitting into per-language reviews with individual second opinions) |
| Trivial / single-domain-small | skip | — (only when overall task severity < MEDIUM; the MEDIUM+ severity rule — "second opinion mandatory in all REVIEW stages" — overrides this row) |
Same-agent prohibition: The second opinion agent MUST use a different
file from the primary. Using the same agent
twice — even with "different task scoping" — does not create a different analytical framework. Same checklists, same anti-patterns, same blind spots. The 87% complementarity effect depends on genuinely different agent expertise. If no different specialist can be found for a second opinion, split the review into smaller per-domain reviews where each can get a truly different second opinion.
Task-framing guideline: The task file for the second opinion agent uses the same KEY FILES as the primary but may add a domain-specific emphasis directive in the YOUR TASK section. Example: for
reviewing OAuth code, add "Pay special attention to Python error handling patterns around I/O, binary data decoding, and data class validation." This costs zero tokens and amplifies the complementarity effect.
Both-found confidence signal: When a DISCOVERY or REVIEW stage used a second opinion, the subsequent extraction agent tags each finding as "both-found" (both agents reported independently) or "single-found" (only one agent reported). Both-found findings carry higher confidence — surface this in the synthesis grid.
Execution
- Spawn current batch of agents via , respecting the per-batch limit from Tools and the dependency analysis above. If stdout is empty (Windows issue), read to get PID. Checkpoint with PIDs and names. If stage has multiple batches, wait for current batch to finish before spawning next
wait-glm.sh name1:$PID1 name2:$PID2 ...
— first progress at 30s, then every 60s, STALLED warnings, health check on finish
- Do verification prep (for VERIFY stages): read the extraction agent's output, create verification task files per batch, assemble prompts. Batch cross-check (MANDATORY): Before spawning, verify that every batch the extraction report prescribes has a corresponding task file, and each task file targets the exact finding IDs from the extraction's batch assignment table (e.g., ADV-1 → B1-B4). A task file for different findings than prescribed does not satisfy the batch assignment. The extraction report is authoritative — the lead does NOT substitute finding targets.
- Review output. Check operational status only — was the report produced? Is the log non-empty? Any STALLED markers? This is NOT quality review (do NOT evaluate findings, accuracy, or correctness). If ANY agent shows STALLED / EMPTY LOG / MISSING REPORT / EMPTY REPORT:
- Diagnose root cause. Fix the issue (environment, prompt, task file, dependencies).
- Re-spawn the agent with corrected configuration.
- Do NOT proceed to the next stage with incomplete stage output.
- Accept a gap and proceed ONLY for trivial gaps in discovery stages (e.g. a single agent in a 10-agent discovery stage failed after 3 respawn attempts with different approaches, AND its domain is partially covered by other agents). Every such decision must be explicitly justified in with
STAGE GAP ACCEPTED: [domain] [reason] [coverage from other agents]
. Do NOT accept gaps in implementation or fix stages — those stages must produce complete, correct output. Do NOT silently skip failed agents.
Verification
Verification uses the severity-routed verification pipeline. The lead does NOT manually verify findings — that's the agents' job. The pipeline runs in batches with sequential dependencies:
Batch 0: Extraction agent (single, default model; use
agent
). Reads all reports from the stage, extracts every finding with file:line and severity, deduplicates (same file:line + same issue → merge, note both sources), classifies each finding by severity, and splits into batches grouped by domain. When the originating stage (DISCOVERY or REVIEW) used a second opinion agent, tag each finding as "both-found" (both agents reported independently) or "single-found" (one agent only). When intersection agents were present, also tag findings as "boundary-found" (reported by an intersection agent auditing a domain boundary — inherently invisible to within-domain specialists) or "domain-only" (reported only by domain primaries/second opinions). Both-found and boundary-found carry elevated confidence for different reasons: both-found signals cross-agent agreement within a domain; boundary-found signals issues spanning domains that no within-domain specialist could have detected. A finding that is both "both-found" AND "boundary-found" carries the highest confidence. Surface all tags in synthesis. Findings from documentation specialist agents (documentation-pro) are domain-verified — route them directly to synthesis at the agent's rated severity, skipping adversarial verification.
Mechanical trigger — MANDATORY: If extraction finds any finding at MEDIUM severity or above, the lead MUST spawn ALL verification batches the extraction report prescribes — every adversarial batch, at the exact finding IDs listed in the extraction's batch assignment table. Spawning an adversarial agent against different findings than prescribed does NOT satisfy this trigger. The synthesis agent runs after all routing agents complete — even if every routed finding was REJECTED or WEAKENED. The lead does NOT evaluate routing agent outputs to decide whether synthesis is needed. The synthesis grid — not the lead's judgment — determines which findings are fixed. Skipping verification for MEDIUM+ findings is a protocol violation.
Batch 1: Findings routed by severity. All findings extracted by Batch 0 are routed:
-
CRITICAL/HIGH findings → Adversarial agent (single agent per finding (1:1), default model). Tries to FALSIFY every finding: reads cited code with full surrounding context, exhaustively searches for counter-evidence (guards, validation, framework protections, type system invariants, test coverage), labels each CONFIRMED / REJECTED / WEAKENED with evidence. Adversarial methodology: assume the claimed issue is a misunderstanding and search exhaustively before confirming. Every CONFIRMED label must be hard-won with grep evidence.
-
CRITICAL/HIGH findings from intersection or cross-domain integration review (any finding spanning domain boundaries, from DISCOVER or REVIEW) → Adversarial cross-domain agent (single agent per finding (1:1), default model). Same exhaustive falsification but verifies from BOTH sides of the integration boundary (Domain A producer + Domain B consumer + bridge between them). Finding only survives if no counter-evidence on either side or in the bridge.
-
MEDIUM findings → Adversarial agent (single agent per batch of 5 findings, default model). Same exhaustive falsification methodology as CRITICAL/HIGH — reads cited code with full surrounding context, exhaustively searches for counter-evidence (guards, validation, framework protections, type system invariants, test coverage), labels each CONFIRMED / REJECTED / WEAKENED with evidence. Adversarial methodology: assume the claimed issue is a misunderstanding and search exhaustively before confirming. Every CONFIRMED label must be hard-won with grep evidence.
-
LOW findings → NOTED. Recorded in the report. No further agent spend.
Batch 2: Synthesis agent (single, default model; use
agent
). Reads all verdicts. Builds a cross-reference grid per finding using unified vocabulary:
| CONFIRMED | REJECTED | WEAKENED |
|---|
| → fix list | → dropped | severity downgraded → fix list at lower priority |
Surfaces "both-found" confidence signals from extraction — findings reported by both primary and second opinion agents carry higher initial confidence.
Also sanity-checks severity assignments against the severity classification criteria — if a finding's severity appears mismatched (e.g., "SQL injection" labeled MEDIUM), flag it as CHALLENGED. Challenged findings are re-routed through adversarial verification. Exception: documentation-domain challenged findings skip adversarial — documentation severity is inherently subjective (is "10 missing API docs" HIGH or MEDIUM?) and adversarial review of severity ratings adds no meaningful verification. Documentation-domain challenged findings stay at their challenged severity; the lead accepts the downgrade directly.
If the synthesis grid shows zero CONFIRMED findings at MEDIUM or above (all MEDIUM+ findings were REJECTED, all were DROPPED, or only LOW-severity survivors remain), FIX is SKIPPED — there is nothing significant to fix. LOW verified findings are acknowledged in the synthesis as non-blocking. The lead writes the synthesis with
FIX SKIPPED: Zero MEDIUM+ verified findings — nothing to fix.
This is mechanical — no lead judgment.
Verification is MANDATORY after every discovery, review (including cross-domain integration review), and post-fix review stage that produces code-referencing findings with file:line references. Exception: stages producing findings without code-level references (web research, pure analysis, documentation reviews) — lead may mark verification as SKIPPED with explicit justification.
Verification completion checklist — MANDATORY before marking a stage as done:
- Extraction agent spawned and report produced
- If extraction found 0 findings → stage complete (early-exit)
- If extraction found MEDIUM+ findings:
a. ALL adversarial batches from extraction's batch assignment table spawned — cross-check each ADV task file's finding IDs against the prescribed batch:finding mapping
b. Synthesis agent spawned — compiles grid, sanity-checks severity
c. Synthesis grid determines FIX=SKIPPED or FIX follows
Skipping any step when MEDIUM+ findings exist is a protocol violation.
Verification naming convention:
- Extraction:
- Adversarial pairs: (single agent per finding for CRITICAL/HIGH — 1:1; single agent per batch of 5 for MEDIUM)
- Adversarial cross: (single agent per finding — 1:1)
- Synthesis:
Between Stages
- Write — verified results from the synthesis grid, decisions, context for next stage
- Mid-execution amendment (new findings): If VERIFY produces confirmed findings at MEDIUM severity or above and IMPLEMENT is NOT in the manifest, the lead auto-adds IMPLEMENT followed by FIX (always 2-3 sequential stages: fix + post-fix review + conditional VERIFY). This is unconditional — all confirmed MEDIUM+ findings are fixed regardless of task intent. LOW findings are reported but not auto-fixed. This is mechanical — verify the condition, add the stages.
FIX convergence (incomplete fixes): After a FIX stage's post-fix VERIFY produces CONFIRMED MEDIUM+ findings in the synthesis grid, auto-add another FIX pass regardless of whether IMPLEMENT is already in the manifest. IMPLEMENT presence does not block FIX convergence — surviving MEDIUM+ findings mean the fix was incomplete. Repeat until post-fix review produces zero MEDIUM+ findings and VERIFY is skipped.
- If scope changed from original plan, update with actual stages and revised goals
- Checkpoint. Clean up:
rm -f tmp/sN-*-prompt.txt tmp/sN-*-task.txt
- Next stage prompts include synthesis as section. PRIOR CONTEXT should contain only factual project context the next stage needs: what was discovered, what was decided, what constraints exist, what was already fixed. Do NOT include verification process details, rejected findings, or behavioral instructions — these compete with the agent .md. Target under 50 lines
- Never re-do verified work unless evidence shows it was wrong
- Never skip a planned stage without explicitly marking it in as with a reason. A stage is only complete when its agents have been spawned, waited, their reports processed by the verification pipeline, and findings verified — incomplete stages cannot be proceeded past, outside the narrow gap-acceptance rules in Execution step 4. PLAN stages cannot be SKIPPED for speed or token savings — only for genuine blockers (environment failure, missing files, corrupted state).
- After writing synthesis, read to confirm the next stage. If the plan has remaining stages, execute them — do not deliver early unless remaining stages are explicitly marked SKIPPED.
Iterative stages: Between iterations, follow the Iterative Convergence protocol below — skip steps 1-5 until convergence is reached. On convergence, write final stage synthesis (step 1) and resume normal between-stages flow (steps 2-5).
Iterative Convergence
Some stages benefit from repeated runs until agents stop producing new meaningful output. What counts as "new output" depends on the stage purpose — new problems (audit), new information (research), new improvements (analysis), new risks (security), etc.
Convergence is mechanical: when ALL agents in an iteration produce zero new findings (empty reports, no new issues found), the stage has converged. A single non-empty report means the iteration produced output — iterate again. The lead does not subjectively judge whether findings are "meaningful enough" — any finding is meaningful.
Planner-decided, not mandatory. The planner selects NONE / ONCE / LOOP per stage based on task characteristics:
- NONE: One pass. For well-understood, narrow work. Also appropriate for codebases with comprehensive test coverage (>80%) and clean module boundaries — first pass is unlikely to miss meaningful issues.
- ONCE: One extra iteration if first pass found anything ("found anything" means any iter 1 agent reported at least one finding — regardless of whether it survived adversarial verification; the point is different iter 2 specialists re-examine what iter 1 noticed). Use when the planner's Phase 1 research reveals interconnected modules, dense coupling, non-uniform code patterns, or >15K LOC per domain — characteristics suggesting a first pass may miss issues. Also used when severity is HIGH/CRITICAL regardless of codebase quality (missed findings are expensive). ONCE is NOT the universal default — well-tested, cleanly-structured codebases should use NONE.
- LOOP: Up to 3 iterations, stop on empty report. For highly ambiguous or production-critical work where missed findings would be unacceptable.
Factors the planner considers: ambiguity, codebase complexity, finding volume from first pass, production impact of missed findings, change type (exploratory vs. mechanical), time sensitivity.
Not used for: Production stages (implementation and fixing) and verification stages. These produce or evaluate output rather than discovering issues.
Mandatory rules apply: CONVERGE iterations of DISCOVERY or REVIEW stages inherit ALL mandatory rules from the parent stage type — including second-opinion requirements at MEDIUM+ severity (see Second Opinion Guidelines). When the original DISCOVER/REVIEW required a second opinion agent, every CONVERGE iteration must also include a second opinion. The planner's decision table must list all agents to spawn per iteration — the lead spawns exactly what the plan lists.
Execution is mechanical — the lead does NOT re-evaluate the CONVERGE decision. If the plan says ONCE and verified findings exist, the lead spawns the iteration agents unconditionally. If the plan says NONE, the lead skips unconditionally. The planner's assessment of codebase characteristics (test coverage, coupling, module density, severity) was already baked into the plan during Phase 1 research. The lead does NOT substitute judgment based on what findings happened to be confirmed — whether findings appear "isolated" or "specific" is the planner's call at plan time, not the lead's call at execution time. The planner sees the full codebase structure during research; the lead only sees post-hoc finding counts.
Mechanics:
- Each iteration = full prepare → spawn → verify cycle
- After verification: check reports mechanically — any non-empty finding list in any agent report?
- Yes (any finding produced) → write iteration synthesis to
tmp/stage-N-iter-K-synthesis.md
, prepare next iteration with cumulative context from all prior iterations
- No (all reports empty, zero findings) → convergence reached; write final stage synthesis and move on
- Lead SHOULD vary approach between iterations — different agents, focus areas, or angles — to avoid blind spots. Running identical agents repeatedly is wasteful.
- Lead can adjust agent count and type between iterations based on what prior iterations revealed
- If iteration cap hit without convergence → synthesize what's known, note "convergence not reached" in delivery, proceed
- Naming: iteration agents follow — e.g. , (stage 2, iteration 1/2). Respawn within iteration: .
VERIFY between iterations (MANDATORY): The plan must include a VERIFY stage
between every pair of CONVERGE iterations. The structure is:
Stage N: DISCOVER iter 1
Stage N+1: VERIFY iter 1 (extraction → adversarial → synthesis)
Stage N+2: DISCOVER iter 2 (conditional on N+1 synthesis, PRIOR CONTEXT from N+1)
Stage N+3: VERIFY iter 2
Iter 1's VERIFY produces the synthesis grid that (a) determines whether iter 2
spawns (any finding = spawn) and (b) provides PRIOR CONTEXT for iter 2 agents.
Merging both iterations' verification into one stage after both complete is a
protocol violation — there is no way to know whether iter 2 should spawn, and no
PRIOR CONTEXT for iter 2 without iter 1's synthesis first.
Delivery
Before delivery: Read
. Confirm every planned stage is complete or explicitly marked SKIPPED with justification. A stage silently skipped = not delivered yet. Execute it or update the plan. If any code was changed during the fix stage — by fix-agents — confirm that post-fix review and verification both ran (verification runs only if review found new findings). Code changes without downstream verification are not deliverable. The user's task instructions (commit, push, report) are the final step after all stages complete — they do not override the mandatory stages that must run first.
Before delivery, mechanically verify all mid-execution decisions:
- If any conditional VERIFY was skipped: read the stage's review reports.
If any report contains a MEDIUM+ finding with a code reference, the VERIFY
stage must be run now.
- If any finding was marked as dropped or noted by the lead without routing
through the verification pipeline: the finding must be routed through the
verification pipeline now.
After final stage:
- Reviews/audits: write report to with verified findings, rejected items, gaps
- Code changes: spawn a single agent (default model) to run build + tests, fix all failures, and deliver production-ready result. Lead chooses the exact agent for the job (e.g. debugger, build-error-resolver, cpp-pro). This is the final production gate.
- Research/analysis: synthesize into clear summary
- Write : task goal, stages executed, total agents, agent aborts/failures, iterations per iterative stage, verification stats, key decisions, phase durations (planning, preparation, execution/wait, verification, synthesis)
- Cleanup:
rm -f tmp/s[0-9]*-prompt.txt tmp/s[0-9]*-task.txt
. Keep logs, reports, summary
- Save workflow lessons to knowledge if applicable
Agent Prompt Template
Prompts are assembled with cache-aware ordering: stable shared content first (cached across calls), volatile per-instance content last. The assembly order:
You are a single agent working solo. Do all the work yourself — do not spawn sub-agents, do not delegate to other agents, do not run agentic workflows. Agentic workflows are not allowed in this session.
Before claiming something is missing or broken — grep for existing guards, handlers, or implementations first.
{cat <skill-folder>/templates/coordination-review.txt OR coordination-code.txt — replace {NAME}}
{cat <skill-folder>/templates/severity-guide.txt — REVIEW/audit tasks only}
{cat <skill-folder>/templates/quality-rules-review.txt OR quality-rules-code.txt}
{Full <skill-folder>/agents/{agent}.md — see Rules → Prompts}
You are an AI agent named {NAME}.
--- TASK ASSIGNMENT ---
PROJECT: {working directory and project description}
ENVIRONMENT (code tasks only):
{Runtime, test command, lint command}
PRIOR CONTEXT (stage 2+ or iteration 2+):
{Contents of tmp/stage-N-synthesis.md OR cumulative tmp/stage-N-iter-*-synthesis.md for iterations}
YOUR TASK: {KEY FILES, CONTEXT, SCOPE, MUST ANSWER questions}
WRITABLE FILES: {code agents only — list source files agent may edit. Review/research/audit agents: omit this section}
| Task Type | Coordination | Severity Guide | Quality Rules |
|---|
| Review/audit | coordination-review.txt | severity-guide.txt | quality-rules-review.txt |
| Code/refactor | coordination-code.txt | — | quality-rules-code.txt |
| Research | coordination-review.txt | — | quality-rules-review.txt |
Boilerplate templates live in
<skill-folder>/templates/
. Lead only writes the unique parts (agent .md selection + TASK ASSIGNMENT). Templates are
-ed into the prompt file verbatim.
Checkpoints & Recovery
Save after every step — no exceptions. One active checkpoint (delete previous first). Under 500 chars.
bash
./<skill-folder>/tools/memory.sh session add context "CHECKPOINT: [task] | DONE: [steps] | NEXT: [remaining] | SKIP: [do not redo — completed agents, failed approaches, skipped stages, pending approvals] | FILES: [key files] | BUILD/TEST: [commands]"
The
field prevents rework after compaction/crash recovery. Record:
- Already-completed agents whose reports exist (e.g. )
- Failed approaches tried 3× (do not retry same thing)
- Stages explicitly skipped with reason (e.g.
verify skipped — 0 findings
)
- Pending approval decisions (
awaiting user approval for push
)
Compaction recovery — MANDATORY sequence (do ALL steps, no skipping):
- Run
<skill-folder>/tools/glm-recover.sh
— prints memory session, plan, continuation (if any), newest synthesis (iter or stage, by mtime), and latest checklist in one stream. Replaces steps 1, 2, 3 below with a single command
- Re-read AGENTS.md in full and STRICTLY follow its instructions — ALWAYS, no exceptions, no partial reads. does NOT do this for you
- Only then resume work
If
is unavailable, fall back to the manual sequence:
./<skill-folder>/tools/memory.sh session show
— restore session state
- Read — restore current plan
- Read the latest ,
tmp/stage-N-iter-K-synthesis.md
, or — restore verification/iteration/stage state
Do not rely on continuation summary alone. Do not skip the AGENTS.md re-read — this is the #1 cause of workflow deviation after compaction.
| Checkpoint | Recovery |
|---|
| Plan done | Read → prepare agents |
| Agents prepared | List prompts → spawn |
| Agents spawned | Check PIDs/reports → verify or re-wait |
| Verifying stage N | Read — the lead's synthesis from the synthesis agent's grid |
| Iterating stage N, iter K | Read tmp/stage-N-iter-K-synthesis.md
+ cumulative context → prepare next iteration |
| Stage N done | Read synthesis + plan → next stage |
Compaction handoff format — for long-running stages, include this block in stage synthesis to preserve active process state:
markdown
## Compaction Handoff
- **Current objective:** [what this stage is doing]
- **User constraints:** [explicit instructions that must survive compaction]
- **Active plan / workflow:** [reference to plan artifact or current step]
- **Approval state:** [what's approved, what's pending, what was denied]
- **Key facts and decisions:** [exact values, resolved ambiguities, why choices were made]
- **Actions already taken:** [agents spawned, commands run, files changed]
- **Errors, blockers, attempted fixes:** [what failed and what was tried — do not retry same approach]
- **Pending tasks:** [remaining subtasks in this stage]
- **Next recommended step:** [single concrete action to resume with]
- **Do not redo:** [completed agents, failed approaches, skipped steps]
Session Continuation
For tasks exceeding a single session:
- Complete current stage fully
- Write : original task, plan, completed stages, next stage, decisions, modified files, blockers
./<skill-folder>/tools/memory.sh add context "GLM-CONTINUATION: [summary]" --tags glm-opencode,continuation
- Tell user what's done and what continues
Pickup: ./<skill-folder>/tools/memory.sh search "GLM-CONTINUATION"
→ read continuation file → read prior synthesis → continue next stage. On final stage, clean up continuation file and memory entry. Never re-do verified prior work.
Error Handling
| Scenario | Action |
|---|
| No report after exit | Read log to diagnose failure. Fix root cause (bad prompt? missing dependency? environment?). Re-spawn the agent. Do NOT fill gaps yourself — filling gaps is agent work. |
| STALLED (flagged by wait-glm.sh) | Kill process, read log to diagnose. Fix root cause. Re-spawn. Do NOT note gap and proceed. |
| Agent claims success but output wrong | Diagnose why output is wrong (bad prompt? misunderstood task?). Fix the prompt/task. Re-spawn the agent. Do NOT verify or fix the output yourself. |
| Incorrect edits | Diagnose why the agent produced wrong output (bad prompt? misunderstood task?). Fix the prompt/task. Spawn a quick-fix agent to revert and rewrite. Do NOT revert changes yourself. If the quick-fix agent is still wrong, escalate to full IMPLEMENT → REVIEW → VERIFY. |
| 2+ agents fail same env error | STOP respawning. Diagnose environment first (do NOT fix environment issues directly — spawn an agent if changes needed) |
| Agent aborted (same error 3×) | Read log to diagnose root cause, fix environment/config (spawn an agent if code/config changes needed), then respawn |
| Stage partially failed (1+ agents produced no useful output or wrong output) | Diagnose root causes across all failed agents. Fix issues (environment, prompts, tasks). Re-spawn ALL failed agents. The stage is incomplete until all agents succeed. Do NOT proceed to the next stage with gaps. |
| Iteration cap hit without convergence | Synthesize all iterations, note "convergence not reached" in delivery, proceed |
| Adversarial verification produces suspicious results (CONFIRMED on obviously-wrong findings or REJECTED with weak evidence) | Diagnose prompt/task quality — adversarial agent may have misunderstood its role. Adjust MUST ANSWER questions or adversarial instructions and respawn. |
Rules
Quality over speed — ALWAYS. Never rush, never cut corners, never try to finish faster. Slow, thorough, methodical work produces quality. Speed produces bugs. Prefer more stages, more agents, more verification over shorter timelines. There is no deadline. The only measure of success is production-ready, bug-free code.
Limits: Per-batch limit and agent parallelism rules are defined in Tools and Agent Spawning — don't restate. Need more coverage than the 10-agent per-batch cap allows? Add stages, not more agents per batch. Agents run until done (no turn limit). One task per agent. Respawn naming:
,
. No two agents edit same file within a stage (read overlap OK). Balance workload — each agent should cover roughly equal scope.
Task tool prohibition (MANDATORY — single most important rule): Agent delegation in this project happens ONLY via
. The
tool with its
parameter is FORBIDDEN — never call it, regardless of the use case (exploration, code review, implementation, research, anything).
The Task tool's built-in
list happens to share names with our agent
files in
(
,
,
, etc.) — these are TWO DIFFERENT THINGS. The Task tool ships a separate sub-agent runtime that bypasses our agent delegation system, the
pipeline, verification, report formats, and quality rules. Our agent
files are reached ONLY by passing
to
and then spawning via
.
If you catch yourself about to call
— stop, use
instead.
Agent count per stage (MANDATORY — fill capacity by task decomposition): Decompose the task into as many independent subtasks as it naturally splits into, spawn one agent per subtask, maximum 10 agents per batch. Default to what the task genuinely requires — scale to scope. Over-engineering with unnecessary agents adds coordination overhead that degrades quality (proven across 260+ configurations). Fill the maximum only when the task truly spans that many distinct domains. Verification stages scale with findings count and impact surface, not discovery agent count — minimum 1 extraction agent for every stage; adversarial agents run only if extraction finds at least one finding to falsify. When in doubt, decompose into more parallel agents — broader coverage finds more issues. Never run sequential single-agent stages when those stages could be a single stage with parallel agents (see Workflow → Planning → Stage decomposition rule).
Prompts: Include the FULL agent
file — agents are optimized and every section earns its place. Do NOT trim or skip sections. Boilerplate (quality rules, severity guide, coordination, report format) comes from
<skill-folder>/templates/
and is prepended before the agent .md for prompt-cache stability (stable shared content cached first, volatile content last). Agents don't load AGENTS.md — all context must be in prompt.
Verification: Every finding labeled. Every label backed by Read. 100% complete before proceeding. ALL verified actionable findings fixed via fix-agent — the lead does not fix findings directly.
Lead code prohibition (MANDATORY): The lead never writes, edits, or modifies project source code. Every code change — implementation, bug fixes, config adjustments, script changes, one-liners — goes through a spawned agent. The lead's tools (Edit, Write) are for tmp/ artifacts only: task files, prompts, synthesis reports. The only exception is editing AGENTS.md itself (meta-configuration).
Platform: or
on all platforms (spawn-glm.sh handles invocation, use
flag if running in pi). Always redirect output to log files.
Skills (Workflows)
Workflows are available as skills in
directory. Use
to invoke. Skills are orthogonal to the agentic workflow — they are utility operations invoked directly by the lead as needed. Skill output is not routed through the verification pipeline.