Diagnose Why Work Stopped
A repeatable procedure for the recurring class of issues where the user (or a manager) points at a stalled / looping / over-recovered issue tree and asks "why did this stop / why is this looping / how do we make sure this doesn't happen again?"
This skill is diagnostic + product-design, not engineering. The output is a written root cause and an approved plan. No code changes leave this skill.
Canonical execution model: read
doc/execution-semantics.md
before diagnosing or proposing a new liveness/recovery rule. Use that document as the source of truth for status, action-path, post-run disposition, bounded continuation, productivity review, pause-hold, watchdog, and explicit recovery semantics. If the investigation finds a true product-rule gap, the plan should say whether
doc/execution-semantics.md
needs a matching update.
When to use
Trigger on an assignment whose title or body matches any of:
- "why did this work stop", "why did this stall", "why did this just stop"
- "infinite loop", "looping", "spinning", "going too deep", "recovery went too deep"
- "liveness — what happened here", "this tree stopped working", "stuck"
- "approach it from a product perspective", "general product principle / rule"
- An attached link to a specific stalled / looping / over-recovered issue tree
Also use when the user asks for forensics, root cause, or a write-up before any product change.
When NOT to use
- The assignment asks you to ship a code change directly. Use normal engineering flow.
- The assignment is a normal bug report against a specific feature. Use normal investigation.
- You are the original implementer being asked to fix your own bug. Use normal debugging.
Three invariants you must preserve
Every diagnosis and every proposed rule must hold these three invariants together. The user has restated them on at least four issues; treat them as load-bearing:
- Productive work continues. Agents that have a clear next action must keep working without needing the user to wake them. (PAP-2674, PAP-2708)
- Only real blockers stop work. Stops happen when something genuinely cannot proceed (missing approval, missing dependency, human owner). Pseudo-stops (in_review with no action path, cancelled leaves, malformed metadata) must be detected and routed, not left silent. (PAP-2335, PAP-2674)
- No infinite loops. Stranded-work recovery and continuation loops must be bounded and distinguishable from genuinely productive continuation. (PAP-2602, PAP-2486)
If a proposed rule violates any of the three, drop it or rework it. State explicitly in the plan how each invariant is held.
Procedure
0. Read the current execution contract
Before walking the tree, read
doc/execution-semantics.md
and keep its terms intact:
- live path / waiting path / recovery path
- post-run disposition: terminal, explicitly live, explicitly waiting, invalid
- bounded
run_liveness_continuation
- productivity review vs liveness recovery
- active subtree pause holds
- silent active-run watchdog
Do not invent a new rule until you can state how it differs from the current execution semantics document.
1. Forensics on the named tree — before anything else
Do this in the same heartbeat. Do not propose a rule until you have a concrete stop point.
- Open the linked issue (and its blocker chain, parents, recovery siblings, recent runs).
- Walk the tree node-by-node and find the exact issue + state combination that stops the world. Common shapes seen in the company so far:
- with no typed execution participant, no active run, no pending interaction, no recovery issue (PAP-2335, PAP-2674).
- after a successful run with no future action path queued (PAP-2674).
- Blocker chain whose leaf is / malformed / cross-company-inaccessible (PAP-2602).
issue.continuation_recovery
waking the same issue >N times after successful runs (PAP-2602).
- Stranded-work recovery treating its own recovery issues as more recoverable source work (PAP-2486).
- Quote the evidence: run ids, comment timestamps, status transitions. "Inferred" is acceptable only when an API boundary blocks direct evidence — say so explicitly and mark the claim provisional (PAP-2631).
Respect the API boundary. If the linked issue is in another company and your agent token returns 403, do not bypass scoping. Either request a board-approved diagnostic path or proceed from inferred PAP-side evidence and label it.
2. Survey recent related work
Before proposing a new product rule, read what already shipped this week in the same area. The user has explicitly called this out: (
PAP-2602) "review our recent work on liveness that we shipped in the last couple of days." A new rule that contradicts code merged 48 hours ago is rework, not improvement.
Quick survey:
- Recent merged PRs in the affected area.
- Recent done issues whose title mentions liveness, recovery, productivity, continuation, or the affected subsystem.
- Any active plan documents on parent issues. The fix may belong as a revision to an existing plan, not as a new top-level proposal.
State in the forensics: "I reviewed X, Y, Z. The new gap is …"
3. Classify each non-progressing issue in the tree
For every issue in the affected tree that is not
/
/ actively running, decide:
- Truly needs human or board intervention — name the owner and the action.
- Agent-actionable but not currently routed — name the rule that would have routed it, and the agent that should have been waked.
- Already covered — point at the active run, queued wake, recovery issue, or pending interaction.
This is the table the user has asked for repeatedly (
PAP-2335). Without it the plan is abstract.
4. Frame as a general product rule
The user does not want a one-off patch on the named tree. They want the rule. Two checks:
- The rule is stated as a contract, not as an if/else patch. Example contract: "every agent-owned non-terminal issue must finish each heartbeat with a terminal state, an explicit waiting path, or an explicit live path" (PAP-2674).
- The rule is reconciled against
doc/execution-semantics.md
. Prefer citing and applying the existing contract; propose a document change only when the current doc is incomplete or contradicted by accepted/implemented behavior.
- The rule explicitly preserves the three invariants above. Show the work.
If the rule would have blocked a recent productive run from succeeding, drop or narrow it.
5. Plan, do not code
Write the plan into the issue's
document. Cover:
- Forensics summary (root cause + evidence).
- The general product rule, stated as a contract.
- Whether the existing
doc/execution-semantics.md
contract already covers the case, or what exact documentation update is needed.
- Phased subtasks: typically resolves the named live tree (carefully, not destructively), codifies the contract in docs, then implementation phases for detection, recovery, UI surfacing, security review, QA, and CTO review.
- Explicit assignees per phase; favor team specialty (CodexCoder for server, ClaudeCoder for FE, UXDesigner for visible state, SecurityEngineer for ownership/permissions, QA for validation).
- Blocking dependencies wired with , parallel branches identified.
Do not create the child issues yet. Do not push code.
6. Request approval, then decompose
- Open a interaction targeting the latest plan revision. Idempotency key
confirmation:{issueId}:plan:{revisionId}
.
- Wait for board/CTO acceptance. If the user posts a new comment that supersedes the plan, the prior confirmation is invalidated — open a fresh confirmation tied to the new revision (PAP-2602 cycled three revisions; that is fine).
- Only after acceptance: create the phased child issues with the right assignees and dependencies, then block this parent on the final QA / CTO review issue so the parent only wakes when the chain finishes.
7. Phase 0 hygiene on the named tree
Phase 0 cleans up the live tree without papering over evidence:
- Move stalled leaves with no participant to with a precise next action and named owner (PAP-2335).
- Detach cancelled/dead blockers from chains they were holding hostage; do not silently mark issues to clear backlog.
- Leave a comment on the original named issue summarizing what changed and why; never hide the recovery chain history.
8. Final close-out
When the phase chain is complete, post a board-level summary comment on the parent issue: what changed, what the new contract is, what the rollout step is (e.g. "restart the control-plane to pick up the new response shape"), and the live state of the originally-named tree. Then close the parent.
Pitfalls
- Coding before approval. The user has said "make a plan first" on every recent diagnostic issue. Producing code in the forensic phase wastes the round-trip.
- Restating one invariant at the cost of another. Bound continuation too tightly and productive work stalls; loosen recovery and infinite loops return. Always check all three.
- Skipping the recent-work survey. Proposing a contract that contradicts what shipped 24 hours ago is the easiest way to get the plan rejected.
- Letting "in_review" mean done. A leaf assigned to another agent with no participant or active run is not progress; treat it as a stop.
- Bypassing company scoping. Cross-company forensics needs a board-approved diagnostic path, not a database read.
- Recursive recovery. Stranded-work recovery that recovers its own recovery issues is the canonical infinite loop (PAP-2486). Detect it and refuse to deepen.
- Hiding the chain. Don't silently delete or hide the symptomatic recovery issues — the operator needs the audit trail.
Verification checklist (before posting the plan)