diagnose-why-work-stopped
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDiagnose Why Work Stopped
诊断工作停滞原因
A repeatable procedure for the recurring class of issues where the user (or a manager) points at a stalled / looping / over-recovered issue tree and asks "why did this stop / why is this looping / how do we make sure this doesn't happen again?"
This skill is diagnostic + product-design, not engineering. The output is a written root cause and an approved plan. No code changes leave this skill.
Canonical execution model: read before diagnosing or proposing a new liveness/recovery rule. Use that document as the source of truth for status, action-path, post-run disposition, bounded continuation, productivity review, pause-hold, watchdog, and explicit recovery semantics. If the investigation finds a true product-rule gap, the plan should say whether needs a matching update.
doc/execution-semantics.mddoc/execution-semantics.md这是一套可重复的流程,用于处理一类反复出现的问题:用户(或管理者)指向某个停滞/循环/过度恢复的任务树,询问「这项工作为何停止/为何出现循环/如何确保此类问题不再发生?」
此技能属于诊断+产品设计范畴,而非工程开发。输出内容为书面的根本原因分析及获批的计划,无需在此技能环节进行代码修改。
标准执行模型:在诊断或提出新的活性/恢复规则前,请先阅读文档。将该文档作为状态、行动路径、运行后处置、有限续行、效率评审、暂停挂起、监控 watchdog、显式恢复等语义的唯一权威来源。若调查发现产品规则存在真实缺口,计划中需说明是否需要对进行相应更新。
doc/execution-semantics.mddoc/execution-semantics.mdWhen to use
适用场景
Trigger on an assignment whose title or body matches any of:
- "why did this work stop", "why did this stall", "why did this just stop"
- "infinite loop", "looping", "spinning", "going too deep", "recovery went too deep"
- "liveness — what happened here", "this tree stopped working", "stuck"
- "approach it from a product perspective", "general product principle / rule"
- An attached link to a specific stalled / looping / over-recovered issue tree
Also use when the user asks for forensics, root cause, or a write-up before any product change.
当任务的标题或内容符合以下任一情况时触发:
- 「这项工作为何停止」「为何陷入停滞」「为何突然停止」
- 「无限循环」「循环」「自旋」「递归过深」「恢复操作递归过深」
- 「活性问题——这里发生了什么」「此任务树停止工作」「卡住」
- 「从产品视角处理」「通用产品原则/规则」
- 附带指向特定停滞/循环/过度恢复任务树的链接
此外,当用户要求在进行任何产品变更前先开展取证、根本原因分析或撰写报告时,也可使用此流程。
When NOT to use
不适用场景
- The assignment asks you to ship a code change directly. Use normal engineering flow.
- The assignment is a normal bug report against a specific feature. Use normal investigation.
- You are the original implementer being asked to fix your own bug. Use normal debugging.
- 任务要求直接交付代码修改:使用常规工程流程。
- 任务是针对特定功能的常规 bug 报告:使用常规调查流程。
- 要求原实现者修复自己的 bug:使用常规调试流程。
Three invariants you must preserve
必须遵守的三大不变量
Every diagnosis and every proposed rule must hold these three invariants together. The user has restated them on at least four issues; treat them as load-bearing:
- Productive work continues. Agents that have a clear next action must keep working without needing the user to wake them. (PAP-2674, PAP-2708)
- Only real blockers stop work. Stops happen when something genuinely cannot proceed (missing approval, missing dependency, human owner). Pseudo-stops (in_review with no action path, cancelled leaves, malformed metadata) must be detected and routed, not left silent. (PAP-2335, PAP-2674)
- No infinite loops. Stranded-work recovery and continuation loops must be bounded and distinguishable from genuinely productive continuation. (PAP-2602, PAP-2486)
If a proposed rule violates any of the three, drop it or rework it. State explicitly in the plan how each invariant is held.
每一项诊断结果和每一条提出的规则都必须同时满足这三大不变量。用户已在至少四个问题中重申过这些要求,请将其视为核心准则:
- 持续推进有效工作:拥有明确下一步行动的Agent必须持续工作,无需用户唤醒。(PAP-2674, PAP-2708)
- 仅真正的阻碍会导致工作停止:只有当确实无法推进时(缺少批准、缺少依赖、需人工介入)才会停止工作。伪停滞状态(处于但无执行参与者、无活跃运行、无待处理交互、无恢复任务)必须被检测并重新路由,不能放任其静默停滞。(PAP-2335, PAP-2674)
in_review - 无无限循环:孤立工作的恢复和续行循环必须是有限的,且能与真正有效的续行区分开。(PAP-2602, PAP-2486)
若某条规则违反任一不变量,需舍弃或重新修订。计划中需明确说明每条不变量是如何被满足的。
Procedure
流程步骤
0. Read the current execution contract
0. 阅读当前执行协议
Before walking the tree, read and keep its terms intact:
doc/execution-semantics.md- live path / waiting path / recovery path
- post-run disposition: terminal, explicitly live, explicitly waiting, invalid
- bounded
run_liveness_continuation - productivity review vs liveness recovery
- active subtree pause holds
- silent active-run watchdog
Do not invent a new rule until you can state how it differs from the current execution semantics document.
在梳理任务树前,请先阅读文档,并严格遵循其中的术语:
doc/execution-semantics.md- 活跃路径 / 等待路径 / 恢复路径
- 运行后处置:终端状态、显式活跃、显式等待、无效
- 有限的
run_liveness_continuation - 效率评审 vs 活性恢复
- 活跃子树暂停挂起
- 静默活跃运行监控 watchdog
在能够说明新规则与当前执行语义文档的差异前,请勿提出新规则。
1. Forensics on the named tree — before anything else
1. 首先针对指定任务树开展取证工作
Do this in the same heartbeat. Do not propose a rule until you have a concrete stop point.
- Open the linked issue (and its blocker chain, parents, recovery siblings, recent runs).
- Walk the tree node-by-node and find the exact issue + state combination that stops the world. Common shapes seen in the company so far:
- with no typed execution participant, no active run, no pending interaction, no recovery issue (PAP-2335, PAP-2674).
in_review - after a successful run with no future action path queued (PAP-2674).
in_progress - Blocker chain whose leaf is / malformed / cross-company-inaccessible (PAP-2602).
cancelled - waking the same issue >N times after successful runs (PAP-2602).
issue.continuation_recovery - Stranded-work recovery treating its own recovery issues as more recoverable source work (PAP-2486).
- Quote the evidence: run ids, comment timestamps, status transitions. "Inferred" is acceptable only when an API boundary blocks direct evidence — say so explicitly and mark the claim provisional (PAP-2631).
Respect the API boundary. If the linked issue is in another company and your agent token returns 403, do not bypass scoping. Either request a board-approved diagnostic path or proceed from inferred PAP-side evidence and label it.
需立即执行此步骤。在找到具体停滞点前,请勿提出任何规则。
- 打开链接的任务(及其阻碍链、父任务、恢复兄弟任务、最近运行记录)。
- 逐节点梳理任务树,找出导致全局停滞的确切任务+状态组合。目前公司内常见的情况包括:
- 引用证据:运行ID、评论时间戳、状态转换记录。仅当API边界阻碍获取直接证据时,才可使用“推断”结论——需明确说明并标记该结论为临时结论(PAP-2631)。
遵守API边界限制。若链接的任务属于其他公司,且你的Agent令牌返回403错误,请勿绕过权限限制。要么请求董事会批准的诊断路径,要么基于PAP侧的推断证据继续,并标记该证据。
2. Survey recent related work
2. 调研近期相关工作
Before proposing a new product rule, read what already shipped this week in the same area. The user has explicitly called this out: (PAP-2602) "review our recent work on liveness that we shipped in the last couple of days." A new rule that contradicts code merged 48 hours ago is rework, not improvement.
Quick survey:
- Recent merged PRs in the affected area.
- Recent done issues whose title mentions liveness, recovery, productivity, continuation, or the affected subsystem.
- Any active plan documents on parent issues. The fix may belong as a revision to an existing plan, not as a new top-level proposal.
State in the forensics: "I reviewed X, Y, Z. The new gap is …"
在提出新的产品规则前,请阅读同一领域最近一周已上线的工作内容。用户已明确指出这一点:(PAP-2602)“回顾我们过去几天上线的活性相关工作”。与48小时前合并的代码相矛盾的新规则属于重复劳动,而非改进。
快速调研内容:
- 相关领域近期合并的PR。
- 标题提及活性、恢复、效率、续行或相关子系统的已完成任务。
- 父任务上的任何活跃计划文档。修复方案可能需要修订现有计划,而非提出新的顶层提案。
在取证报告中说明:“我已调研X、Y、Z。新的缺口为……”
3. Classify each non-progressing issue in the tree
3. 分类任务树中所有未推进的任务
For every issue in the affected tree that is not / / actively running, decide:
donecancelled- Truly needs human or board intervention — name the owner and the action.
- Agent-actionable but not currently routed — name the rule that would have routed it, and the agent that should have been waked.
- Already covered — point at the active run, queued wake, recovery issue, or pending interaction.
This is the table the user has asked for repeatedly (PAP-2335). Without it the plan is abstract.
针对受影响任务树中所有未处于//活跃运行状态的任务,判断其类型:
donecancelled- 确实需要人工或董事会介入——指明负责人及需执行的操作。
- Agent可处理但未被路由——指明应路由该任务的规则,以及应被唤醒的Agent。
- 已被覆盖——指向活跃运行、排队唤醒、恢复任务或待处理交互。
这是用户反复要求提供的表格(PAP-2335)。缺少此表格的计划将过于抽象。
4. Frame as a general product rule
4. 构建为通用产品规则
The user does not want a one-off patch on the named tree. They want the rule. Two checks:
- The rule is stated as a contract, not as an if/else patch. Example contract: "every agent-owned non-terminal issue must finish each heartbeat with a terminal state, an explicit waiting path, or an explicit live path" (PAP-2674).
- The rule is reconciled against . Prefer citing and applying the existing contract; propose a document change only when the current doc is incomplete or contradicted by accepted/implemented behavior.
doc/execution-semantics.md - The rule explicitly preserves the three invariants above. Show the work.
If the rule would have blocked a recent productive run from succeeding, drop or narrow it.
用户不需要针对指定任务树的一次性补丁,而是需要通用规则。需满足两项检查:
- 规则需以协议形式表述,而非if/else补丁。示例协议:“每个由Agent负责的非终端任务必须在每个心跳周期结束时处于终端状态、显式等待路径或显式活跃路径”(PAP-2674)。
- 规则需与文档保持一致。优先引用并应用现有协议;仅当当前文档不完整或与已接受/已实现的行为相矛盾时,才提出文档修改建议。
doc/execution-semantics.md - 规则需明确满足上述三大不变量。需说明具体满足方式。
若规则会阻碍近期有效运行的任务成功执行,请舍弃或缩小规则范围。
5. Plan, do not code
5. 制定计划,而非编写代码
Write the plan into the issue's document. Cover:
plan- Forensics summary (root cause + evidence).
- The general product rule, stated as a contract.
- Whether the existing contract already covers the case, or what exact documentation update is needed.
doc/execution-semantics.md - Phased subtasks: typically resolves the named live tree (carefully, not destructively),
Phase 0codifies the contract in docs, then implementation phases for detection, recovery, UI surfacing, security review, QA, and CTO review.Phase 1 - Explicit assignees per phase; favor team specialty (CodexCoder for server, ClaudeCoder for FE, UXDesigner for visible state, SecurityEngineer for ownership/permissions, QA for validation).
- Blocking dependencies wired with , parallel branches identified.
blockedByIssueIds
Do not create the child issues yet. Do not push code.
将计划写入任务的文档中,需涵盖:
plan- 取证摘要(根本原因+证据)。
- 通用产品规则,以协议形式表述。
- 现有协议是否已覆盖该场景,或需要进行哪些具体的文档更新。
doc/execution-semantics.md - 分阶段子任务:通常解决指定的活跃任务树(需谨慎操作,避免破坏证据),
Phase 0在文档中明确协议,然后是检测、恢复、UI展示、安全评审、QA、CTO评审等实现阶段。Phase 1 - 每个阶段的明确负责人;优先考虑团队专业领域(CodexCoder负责服务端,ClaudeCoder负责前端,UXDesigner负责可见状态,SecurityEngineer负责权限/所有权,QA负责验证)。
- 通过关联阻塞依赖,识别并行分支。
blockedByIssueIds
暂不要创建子任务。不要推送代码。
6. Request approval, then decompose
6. 请求批准,然后分解任务
- Open a interaction targeting the latest plan revision. Idempotency key
request_confirmation.confirmation:{issueId}:plan:{revisionId} - Wait for board/CTO acceptance. If the user posts a new comment that supersedes the plan, the prior confirmation is invalidated — open a fresh confirmation tied to the new revision (PAP-2602 cycled three revisions; that is fine).
- Only after acceptance: create the phased child issues with the right assignees and dependencies, then block this parent on the final QA / CTO review issue so the parent only wakes when the chain finishes.
- 针对最新版本的计划发起交互。幂等键为
request_confirmation。confirmation:{issueId}:plan:{revisionId} - 等待董事会/CTO的批准。若用户发布新评论取代现有计划,则之前的确认请求失效——需针对新版本计划发起新的确认请求(PAP-2602经历了三次版本迭代;这是正常情况)。
- 仅在获得批准后:创建分阶段的子任务,分配正确的负责人和依赖关系,然后将此父任务阻塞在最终的QA/CTO评审任务上,以便父任务仅在整个任务链完成时被唤醒。
7. Phase 0 hygiene on the named tree
7. 针对指定任务树执行Phase 0清理工作
Phase 0 cleans up the live tree without papering over evidence:
- Move stalled leaves with no participant to
in_reviewwith a precise next action and named owner (PAP-2335).todo - Detach cancelled/dead blockers from chains they were holding hostage; do not silently mark issues to clear backlog.
done - Leave a comment on the original named issue summarizing what changed and why; never hide the recovery chain history.
Phase 0需清理活跃任务树,但不得掩盖证据:
- 将处于停滞状态且无参与者的叶子节点移至
in_review状态,并指定明确的下一步行动和负责人(PAP-2335)。todo - 将已取消/失效的阻碍从其阻塞的任务链中移除;请勿为清理积压任务而静默标记任务为。
done - 在原指定任务上留下评论,总结所做修改及原因;切勿隐藏恢复链的历史记录。
8. Final close-out
8. 最终收尾
When the phase chain is complete, post a board-level summary comment on the parent issue: what changed, what the new contract is, what the rollout step is (e.g. "restart the control-plane to pick up the new response shape"), and the live state of the originally-named tree. Then close the parent.
当阶段任务链完成后,在父任务上发布一份董事会级别的总结评论:说明变更内容、新协议内容、部署步骤(例如“重启控制平面以应用新的响应格式”),以及原指定任务树的当前状态。然后关闭父任务。
Pitfalls
常见陷阱
- Coding before approval. The user has said "make a plan first" on every recent diagnostic issue. Producing code in the forensic phase wastes the round-trip.
- Restating one invariant at the cost of another. Bound continuation too tightly and productive work stalls; loosen recovery and infinite loops return. Always check all three.
- Skipping the recent-work survey. Proposing a contract that contradicts what shipped 24 hours ago is the easiest way to get the plan rejected.
- Letting "in_review" mean done. A leaf assigned to another agent with no participant or active run is not progress; treat it as a stop.
- Bypassing company scoping. Cross-company forensics needs a board-approved diagnostic path, not a database read.
- Recursive recovery. Stranded-work recovery that recovers its own recovery issues is the canonical infinite loop (PAP-2486). Detect it and refuse to deepen.
- Hiding the chain. Don't silently delete or hide the symptomatic recovery issues — the operator needs the audit trail.
- 未获批准就编写代码:用户在近期所有诊断类任务中都要求“先制定计划”。在取证阶段编写代码会浪费往返沟通时间。
- 牺牲一个不变量来满足另一个:过度限制续行会导致有效工作停滞;放宽恢复规则又会引发无限循环。需始终检查三大不变量。
- 跳过近期工作调研:提出与24小时前上线内容相矛盾的协议是计划被驳回的最常见原因。
- 将「in_review」视为已完成:分配给其他Agent但无参与者或活跃运行的叶子节点并非处于推进状态;需将其视为停滞。
- 绕过公司权限限制:跨公司取证需获得董事会批准的诊断路径,而非直接读取数据库。
- 递归恢复:孤立工作恢复操作恢复自身的恢复任务是典型的无限循环场景(PAP-2486)。需检测此类情况并停止递归。
- 隐藏任务链:请勿静默删除或隐藏显示症状的恢复任务——运维人员需要审计轨迹。
Verification checklist (before posting the plan)
计划发布前的验证清单
- The exact stop point in the named tree is identified with run ids / comment ids.
- Recent shipped work in the same area was surveyed and is referenced.
- Every non-progressing issue is classified human-needed / agent-actionable / already-covered.
- The proposed rule is stated as a contract, not a patch.
- All three invariants are explicitly preserved.
- No code change has landed in this heartbeat.
- A against the latest plan revision is open.
request_confirmation - Phase 0 of the plan addresses the live named tree without destroying evidence.
- Implementation phases name specialty-appropriate assignees and dependencies.
blockedByIssueIds
- 已定位指定任务树中的确切停滞点,并附上运行ID/评论ID。
- 已调研同一领域近期上线的工作内容并引用。
- 所有未推进的任务已被分类为需人工介入/Agent可处理/已被覆盖。
- 提出的规则以协议形式表述,而非补丁。
- 明确满足所有三大不变量。
- 此心跳周期内未进行任何代码修改。
- 已针对最新版本的计划发起请求。
request_confirmation - 计划的Phase 0已处理活跃的指定任务树,且未破坏证据。
- 实现阶段已指定专业匹配的负责人,并通过关联依赖。
blockedByIssueIds