diagnose-why-work-stopped

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Diagnose Why Work Stopped

诊断工作停滞原因

A repeatable procedure for the recurring class of issues where the user (or a manager) points at a stalled / looping / over-recovered issue tree and asks "why did this stop / why is this looping / how do we make sure this doesn't happen again?"

This skill is diagnostic + product-design, not engineering. The output is a written root cause and an approved plan. No code changes leave this skill.

Canonical execution model: read

doc/execution-semantics.md

before diagnosing or proposing a new liveness/recovery rule. Use that document as the source of truth for status, action-path, post-run disposition, bounded continuation, productivity review, pause-hold, watchdog, and explicit recovery semantics. If the investigation finds a true product-rule gap, the plan should say whether

doc/execution-semantics.md

needs a matching update.

这是一套可重复的流程，用于处理一类反复出现的问题：用户（或管理者）指向某个停滞/循环/过度恢复的任务树，询问「这项工作为何停止/为何出现循环/如何确保此类问题不再发生？」

此技能属于诊断+产品设计范畴，而非工程开发。输出内容为书面的根本原因分析及获批的计划，无需在此技能环节进行代码修改。

标准执行模型：在诊断或提出新的活性/恢复规则前，请先阅读

doc/execution-semantics.md

文档。将该文档作为状态、行动路径、运行后处置、有限续行、效率评审、暂停挂起、监控 watchdog、显式恢复等语义的唯一权威来源。若调查发现产品规则存在真实缺口，计划中需说明是否需要对

doc/execution-semantics.md

进行相应更新。

When to use

适用场景

Trigger on an assignment whose title or body matches any of:

"why did this work stop", "why did this stall", "why did this just stop"
"infinite loop", "looping", "spinning", "going too deep", "recovery went too deep"
"liveness — what happened here", "this tree stopped working", "stuck"
"approach it from a product perspective", "general product principle / rule"
An attached link to a specific stalled / looping / over-recovered issue tree

Also use when the user asks for forensics, root cause, or a write-up before any product change.

当任务的标题或内容符合以下任一情况时触发：

「这项工作为何停止」「为何陷入停滞」「为何突然停止」
「无限循环」「循环」「自旋」「递归过深」「恢复操作递归过深」
「活性问题——这里发生了什么」「此任务树停止工作」「卡住」
「从产品视角处理」「通用产品原则/规则」
附带指向特定停滞/循环/过度恢复任务树的链接

此外，当用户要求在进行任何产品变更前先开展取证、根本原因分析或撰写报告时，也可使用此流程。

When NOT to use

不适用场景

The assignment asks you to ship a code change directly. Use normal engineering flow.
The assignment is a normal bug report against a specific feature. Use normal investigation.
You are the original implementer being asked to fix your own bug. Use normal debugging.

任务要求直接交付代码修改：使用常规工程流程。
任务是针对特定功能的常规 bug 报告：使用常规调查流程。
要求原实现者修复自己的 bug：使用常规调试流程。

Three invariants you must preserve

必须遵守的三大不变量

Every diagnosis and every proposed rule must hold these three invariants together. The user has restated them on at least four issues; treat them as load-bearing:

Productive work continues. Agents that have a clear next action must keep working without needing the user to wake them. (PAP-2674, PAP-2708)
Only real blockers stop work. Stops happen when something genuinely cannot proceed (missing approval, missing dependency, human owner). Pseudo-stops (in_review with no action path, cancelled leaves, malformed metadata) must be detected and routed, not left silent. (PAP-2335, PAP-2674)
No infinite loops. Stranded-work recovery and continuation loops must be bounded and distinguishable from genuinely productive continuation. (PAP-2602, PAP-2486)

If a proposed rule violates any of the three, drop it or rework it. State explicitly in the plan how each invariant is held.

每一项诊断结果和每一条提出的规则都必须同时满足这三大不变量。用户已在至少四个问题中重申过这些要求，请将其视为核心准则：

持续推进有效工作：拥有明确下一步行动的Agent必须持续工作，无需用户唤醒。(PAP-2674, PAP-2708)
仅真正的阻碍会导致工作停止：只有当确实无法推进时（缺少批准、缺少依赖、需人工介入）才会停止工作。伪停滞状态（处于
```
in_review
```
但无执行参与者、无活跃运行、无待处理交互、无恢复任务）必须被检测并重新路由，不能放任其静默停滞。(PAP-2335, PAP-2674)
无无限循环：孤立工作的恢复和续行循环必须是有限的，且能与真正有效的续行区分开。(PAP-2602, PAP-2486)

若某条规则违反任一不变量，需舍弃或重新修订。计划中需明确说明每条不变量是如何被满足的。

Procedure

流程步骤

0. Read the current execution contract

0. 阅读当前执行协议

Before walking the tree, read

doc/execution-semantics.md

and keep its terms intact:

live path / waiting path / recovery path
post-run disposition: terminal, explicitly live, explicitly waiting, invalid
bounded
```
run_liveness_continuation
```
productivity review vs liveness recovery
active subtree pause holds
silent active-run watchdog

Do not invent a new rule until you can state how it differs from the current execution semantics document.

在梳理任务树前，请先阅读

doc/execution-semantics.md

文档，并严格遵循其中的术语：

活跃路径 / 等待路径 / 恢复路径
运行后处置：终端状态、显式活跃、显式等待、无效
有限的
```
run_liveness_continuation
```
效率评审 vs 活性恢复
活跃子树暂停挂起
静默活跃运行监控 watchdog

在能够说明新规则与当前执行语义文档的差异前，请勿提出新规则。

1. Forensics on the named tree — before anything else

1. 首先针对指定任务树开展取证工作

Do this in the same heartbeat. Do not propose a rule until you have a concrete stop point.

Open the linked issue (and its blocker chain, parents, recovery siblings, recent runs).
Walk the tree node-by-node and find the exact issue + state combination that stops the world. Common shapes seen in the company so far:
- ```
in_review
```
  with no typed execution participant, no active run, no pending interaction, no recovery issue (PAP-2335, PAP-2674).
- ```
in_progress
```
  after a successful run with no future action path queued (PAP-2674).
- Blocker chain whose leaf is
```
cancelled
```
  / malformed / cross-company-inaccessible (PAP-2602).
- ```
issue.continuation_recovery
```
  waking the same issue >N times after successful runs (PAP-2602).
- Stranded-work recovery treating its own recovery issues as more recoverable source work (PAP-2486).
Quote the evidence: run ids, comment timestamps, status transitions. "Inferred" is acceptable only when an API boundary blocks direct evidence — say so explicitly and mark the claim provisional (PAP-2631).

Respect the API boundary. If the linked issue is in another company and your agent token returns 403, do not bypass scoping. Either request a board-approved diagnostic path or proceed from inferred PAP-side evidence and label it.

需立即执行此步骤。在找到具体停滞点前，请勿提出任何规则。

打开链接的任务（及其阻碍链、父任务、恢复兄弟任务、最近运行记录）。
逐节点梳理任务树，找出导致全局停滞的确切任务+状态组合。目前公司内常见的情况包括：
- 处于
```
in_review
```
  状态，但无指定执行参与者、无活跃运行、无待处理交互、无恢复任务(PAP-2335, PAP-2674)。
- 成功运行后处于
```
in_progress
```
  状态，但未排队后续行动路径(PAP-2674)。
- 阻碍链的叶子节点处于
```
cancelled
```
  /格式错误/跨公司无法访问状态(PAP-2602)。
- ```
issue.continuation_recovery
```
  在成功运行后唤醒同一任务超过N次(PAP-2602)。
- 孤立工作恢复操作将自身的恢复任务视为更具可恢复性的源工作(PAP-2486)。
引用证据：运行ID、评论时间戳、状态转换记录。仅当API边界阻碍获取直接证据时，才可使用“推断”结论——需明确说明并标记该结论为临时结论(PAP-2631)。

遵守API边界限制。若链接的任务属于其他公司，且你的Agent令牌返回403错误，请勿绕过权限限制。要么请求董事会批准的诊断路径，要么基于PAP侧的推断证据继续，并标记该证据。

2. Survey recent related work

2. 调研近期相关工作

Before proposing a new product rule, read what already shipped this week in the same area. The user has explicitly called this out: (PAP-2602) "review our recent work on liveness that we shipped in the last couple of days." A new rule that contradicts code merged 48 hours ago is rework, not improvement.

Quick survey:

Recent merged PRs in the affected area.
Recent done issues whose title mentions liveness, recovery, productivity, continuation, or the affected subsystem.
Any active plan documents on parent issues. The fix may belong as a revision to an existing plan, not as a new top-level proposal.

State in the forensics: "I reviewed X, Y, Z. The new gap is …"

在提出新的产品规则前，请阅读同一领域最近一周已上线的工作内容。用户已明确指出这一点：(PAP-2602)“回顾我们过去几天上线的活性相关工作”。与48小时前合并的代码相矛盾的新规则属于重复劳动，而非改进。

快速调研内容：

相关领域近期合并的PR。
标题提及活性、恢复、效率、续行或相关子系统的已完成任务。
父任务上的任何活跃计划文档。修复方案可能需要修订现有计划，而非提出新的顶层提案。

在取证报告中说明：“我已调研X、Y、Z。新的缺口为……”

3. Classify each non-progressing issue in the tree

3. 分类任务树中所有未推进的任务

For every issue in the affected tree that is not

done

cancelled

/ actively running, decide:

Truly needs human or board intervention — name the owner and the action.
Agent-actionable but not currently routed — name the rule that would have routed it, and the agent that should have been waked.
Already covered — point at the active run, queued wake, recovery issue, or pending interaction.

This is the table the user has asked for repeatedly (PAP-2335). Without it the plan is abstract.

针对受影响任务树中所有未处于

done

cancelled

/活跃运行状态的任务，判断其类型：

确实需要人工或董事会介入——指明负责人及需执行的操作。
Agent可处理但未被路由——指明应路由该任务的规则，以及应被唤醒的Agent。
已被覆盖——指向活跃运行、排队唤醒、恢复任务或待处理交互。

这是用户反复要求提供的表格(PAP-2335)。缺少此表格的计划将过于抽象。

4. Frame as a general product rule

4. 构建为通用产品规则

The user does not want a one-off patch on the named tree. They want the rule. Two checks:

The rule is stated as a contract, not as an if/else patch. Example contract: "every agent-owned non-terminal issue must finish each heartbeat with a terminal state, an explicit waiting path, or an explicit live path" (PAP-2674).
The rule is reconciled against
```
doc/execution-semantics.md
```
. Prefer citing and applying the existing contract; propose a document change only when the current doc is incomplete or contradicted by accepted/implemented behavior.
The rule explicitly preserves the three invariants above. Show the work.

If the rule would have blocked a recent productive run from succeeding, drop or narrow it.

用户不需要针对指定任务树的一次性补丁，而是需要通用规则。需满足两项检查：

规则需以协议形式表述，而非if/else补丁。示例协议：“每个由Agent负责的非终端任务必须在每个心跳周期结束时处于终端状态、显式等待路径或显式活跃路径”(PAP-2674)。
规则需与
```
doc/execution-semantics.md
```
文档保持一致。优先引用并应用现有协议；仅当当前文档不完整或与已接受/已实现的行为相矛盾时，才提出文档修改建议。
规则需明确满足上述三大不变量。需说明具体满足方式。

若规则会阻碍近期有效运行的任务成功执行，请舍弃或缩小规则范围。

5. Plan, do not code

5. 制定计划，而非编写代码

Write the plan into the issue's

plan

document. Cover:

Forensics summary (root cause + evidence).
The general product rule, stated as a contract.
Whether the existing
```
doc/execution-semantics.md
```
contract already covers the case, or what exact documentation update is needed.
Phased subtasks: typically
```
Phase 0
```
resolves the named live tree (carefully, not destructively),
```
Phase 1
```
codifies the contract in docs, then implementation phases for detection, recovery, UI surfacing, security review, QA, and CTO review.
Explicit assignees per phase; favor team specialty (CodexCoder for server, ClaudeCoder for FE, UXDesigner for visible state, SecurityEngineer for ownership/permissions, QA for validation).
Blocking dependencies wired with
```
blockedByIssueIds
```
, parallel branches identified.

Do not create the child issues yet. Do not push code.

将计划写入任务的

plan

文档中，需涵盖：

取证摘要（根本原因+证据）。
通用产品规则，以协议形式表述。
现有
```
doc/execution-semantics.md
```
协议是否已覆盖该场景，或需要进行哪些具体的文档更新。
分阶段子任务：通常
```
Phase 0
```
解决指定的活跃任务树（需谨慎操作，避免破坏证据），
```
Phase 1
```
在文档中明确协议，然后是检测、恢复、UI展示、安全评审、QA、CTO评审等实现阶段。
每个阶段的明确负责人；优先考虑团队专业领域（CodexCoder负责服务端，ClaudeCoder负责前端，UXDesigner负责可见状态，SecurityEngineer负责权限/所有权，QA负责验证）。
通过
```
blockedByIssueIds
```
关联阻塞依赖，识别并行分支。

暂不要创建子任务。不要推送代码。

6. Request approval, then decompose

6. 请求批准，然后分解任务

Open a
```
request_confirmation
```
interaction targeting the latest plan revision. Idempotency key
```
confirmation:{issueId}:plan:{revisionId}
```
.
Wait for board/CTO acceptance. If the user posts a new comment that supersedes the plan, the prior confirmation is invalidated — open a fresh confirmation tied to the new revision (PAP-2602 cycled three revisions; that is fine).
Only after acceptance: create the phased child issues with the right assignees and dependencies, then block this parent on the final QA / CTO review issue so the parent only wakes when the chain finishes.

针对最新版本的计划发起

request_confirmation

交互。幂等键为

confirmation:{issueId}:plan:{revisionId}

。

等待董事会/CTO的批准。若用户发布新评论取代现有计划，则之前的确认请求失效——需针对新版本计划发起新的确认请求(PAP-2602经历了三次版本迭代；这是正常情况)。
仅在获得批准后：创建分阶段的子任务，分配正确的负责人和依赖关系，然后将此父任务阻塞在最终的QA/CTO评审任务上，以便父任务仅在整个任务链完成时被唤醒。

7. Phase 0 hygiene on the named tree

7. 针对指定任务树执行Phase 0清理工作

Phase 0 cleans up the live tree without papering over evidence:

Move stalled
```
in_review
```
leaves with no participant to
```
todo
```
with a precise next action and named owner (PAP-2335).
Detach cancelled/dead blockers from chains they were holding hostage; do not silently mark issues
```
done
```
to clear backlog.
Leave a comment on the original named issue summarizing what changed and why; never hide the recovery chain history.

Phase 0需清理活跃任务树，但不得掩盖证据：

将处于停滞状态且无参与者的
```
in_review
```
叶子节点移至
```
todo
```
状态，并指定明确的下一步行动和负责人(PAP-2335)。
将已取消/失效的阻碍从其阻塞的任务链中移除；请勿为清理积压任务而静默标记任务为
```
done
```
。
在原指定任务上留下评论，总结所做修改及原因；切勿隐藏恢复链的历史记录。

8. Final close-out

8. 最终收尾

When the phase chain is complete, post a board-level summary comment on the parent issue: what changed, what the new contract is, what the rollout step is (e.g. "restart the control-plane to pick up the new response shape"), and the live state of the originally-named tree. Then close the parent.

当阶段任务链完成后，在父任务上发布一份董事会级别的总结评论：说明变更内容、新协议内容、部署步骤（例如“重启控制平面以应用新的响应格式”），以及原指定任务树的当前状态。然后关闭父任务。

Pitfalls

常见陷阱

Coding before approval. The user has said "make a plan first" on every recent diagnostic issue. Producing code in the forensic phase wastes the round-trip.
Restating one invariant at the cost of another. Bound continuation too tightly and productive work stalls; loosen recovery and infinite loops return. Always check all three.
Skipping the recent-work survey. Proposing a contract that contradicts what shipped 24 hours ago is the easiest way to get the plan rejected.
Letting "in_review" mean done. A leaf assigned to another agent with no participant or active run is not progress; treat it as a stop.
Bypassing company scoping. Cross-company forensics needs a board-approved diagnostic path, not a database read.
Recursive recovery. Stranded-work recovery that recovers its own recovery issues is the canonical infinite loop (PAP-2486). Detect it and refuse to deepen.
Hiding the chain. Don't silently delete or hide the symptomatic recovery issues — the operator needs the audit trail.

未获批准就编写代码：用户在近期所有诊断类任务中都要求“先制定计划”。在取证阶段编写代码会浪费往返沟通时间。
牺牲一个不变量来满足另一个：过度限制续行会导致有效工作停滞；放宽恢复规则又会引发无限循环。需始终检查三大不变量。
跳过近期工作调研：提出与24小时前上线内容相矛盾的协议是计划被驳回的最常见原因。
将「in_review」视为已完成：分配给其他Agent但无参与者或活跃运行的叶子节点并非处于推进状态；需将其视为停滞。
绕过公司权限限制：跨公司取证需获得董事会批准的诊断路径，而非直接读取数据库。
递归恢复：孤立工作恢复操作恢复自身的恢复任务是典型的无限循环场景(PAP-2486)。需检测此类情况并停止递归。
隐藏任务链：请勿静默删除或隐藏显示症状的恢复任务——运维人员需要审计轨迹。

Verification checklist (before posting the plan)

计划发布前的验证清单

The exact stop point in the named tree is identified with run ids / comment ids.
Recent shipped work in the same area was surveyed and is referenced.
Every non-progressing issue is classified human-needed / agent-actionable / already-covered.
The proposed rule is stated as a contract, not a patch.
All three invariants are explicitly preserved.
No code change has landed in this heartbeat.
A
```
request_confirmation
```
against the latest plan revision is open.
Phase 0 of the plan addresses the live named tree without destroying evidence.
Implementation phases name specialty-appropriate assignees and
```
blockedByIssueIds
```
dependencies.

已定位指定任务树中的确切停滞点，并附上运行ID/评论ID。
已调研同一领域近期上线的工作内容并引用。
所有未推进的任务已被分类为需人工介入/Agent可处理/已被覆盖。
提出的规则以协议形式表述，而非补丁。
明确满足所有三大不变量。
此心跳周期内未进行任何代码修改。
已针对最新版本的计划发起
```
request_confirmation
```
请求。
计划的Phase 0已处理活跃的指定任务树，且未破坏证据。
实现阶段已指定专业匹配的负责人，并通过
```
blockedByIssueIds
```
关联依赖。