diagnose-why-work-stopped

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Diagnose Why Work Stopped

诊断工作停滞原因

A repeatable procedure for the recurring class of issues where the user (or a manager) points at a stalled / looping / over-recovered issue tree and asks "why did this stop / why is this looping / how do we make sure this doesn't happen again?"
This skill is diagnostic + product-design, not engineering. The output is a written root cause and an approved plan. No code changes leave this skill.
Canonical execution model: read
doc/execution-semantics.md
before diagnosing or proposing a new liveness/recovery rule. Use that document as the source of truth for status, action-path, post-run disposition, bounded continuation, productivity review, pause-hold, watchdog, and explicit recovery semantics. If the investigation finds a true product-rule gap, the plan should say whether
doc/execution-semantics.md
needs a matching update.
这是一套可重复的流程,用于处理一类反复出现的问题:用户(或管理者)指向某个停滞/循环/过度恢复的任务树,询问「这项工作为何停止/为何出现循环/如何确保此类问题不再发生?」
此技能属于诊断+产品设计范畴,而非工程开发。输出内容为书面的根本原因分析及获批的计划,无需在此技能环节进行代码修改。
标准执行模型:在诊断或提出新的活性/恢复规则前,请先阅读
doc/execution-semantics.md
文档。将该文档作为状态、行动路径、运行后处置、有限续行、效率评审、暂停挂起、监控 watchdog、显式恢复等语义的唯一权威来源。若调查发现产品规则存在真实缺口,计划中需说明是否需要对
doc/execution-semantics.md
进行相应更新。

When to use

适用场景

Trigger on an assignment whose title or body matches any of:
  • "why did this work stop", "why did this stall", "why did this just stop"
  • "infinite loop", "looping", "spinning", "going too deep", "recovery went too deep"
  • "liveness — what happened here", "this tree stopped working", "stuck"
  • "approach it from a product perspective", "general product principle / rule"
  • An attached link to a specific stalled / looping / over-recovered issue tree
Also use when the user asks for forensics, root cause, or a write-up before any product change.
当任务的标题或内容符合以下任一情况时触发:
  • 「这项工作为何停止」「为何陷入停滞」「为何突然停止」
  • 「无限循环」「循环」「自旋」「递归过深」「恢复操作递归过深」
  • 「活性问题——这里发生了什么」「此任务树停止工作」「卡住」
  • 「从产品视角处理」「通用产品原则/规则」
  • 附带指向特定停滞/循环/过度恢复任务树的链接
此外,当用户要求在进行任何产品变更前先开展取证、根本原因分析或撰写报告时,也可使用此流程。

When NOT to use

不适用场景

  • The assignment asks you to ship a code change directly. Use normal engineering flow.
  • The assignment is a normal bug report against a specific feature. Use normal investigation.
  • You are the original implementer being asked to fix your own bug. Use normal debugging.
  • 任务要求直接交付代码修改:使用常规工程流程。
  • 任务是针对特定功能的常规 bug 报告:使用常规调查流程。
  • 要求原实现者修复自己的 bug:使用常规调试流程。

Three invariants you must preserve

必须遵守的三大不变量

Every diagnosis and every proposed rule must hold these three invariants together. The user has restated them on at least four issues; treat them as load-bearing:
  1. Productive work continues. Agents that have a clear next action must keep working without needing the user to wake them. (PAP-2674, PAP-2708)
  2. Only real blockers stop work. Stops happen when something genuinely cannot proceed (missing approval, missing dependency, human owner). Pseudo-stops (in_review with no action path, cancelled leaves, malformed metadata) must be detected and routed, not left silent. (PAP-2335, PAP-2674)
  3. No infinite loops. Stranded-work recovery and continuation loops must be bounded and distinguishable from genuinely productive continuation. (PAP-2602, PAP-2486)
If a proposed rule violates any of the three, drop it or rework it. State explicitly in the plan how each invariant is held.
每一项诊断结果和每一条提出的规则都必须同时满足这三大不变量。用户已在至少四个问题中重申过这些要求,请将其视为核心准则:
  1. 持续推进有效工作:拥有明确下一步行动的Agent必须持续工作,无需用户唤醒。(PAP-2674, PAP-2708)
  2. 仅真正的阻碍会导致工作停止:只有当确实无法推进时(缺少批准、缺少依赖、需人工介入)才会停止工作。伪停滞状态(处于
    in_review
    但无执行参与者、无活跃运行、无待处理交互、无恢复任务)必须被检测并重新路由,不能放任其静默停滞。(PAP-2335, PAP-2674)
  3. 无无限循环:孤立工作的恢复和续行循环必须是有限的,且能与真正有效的续行区分开。(PAP-2602, PAP-2486)
若某条规则违反任一不变量,需舍弃或重新修订。计划中需明确说明每条不变量是如何被满足的。

Procedure

流程步骤

0. Read the current execution contract

0. 阅读当前执行协议

Before walking the tree, read
doc/execution-semantics.md
and keep its terms intact:
  • live path / waiting path / recovery path
  • post-run disposition: terminal, explicitly live, explicitly waiting, invalid
  • bounded
    run_liveness_continuation
  • productivity review vs liveness recovery
  • active subtree pause holds
  • silent active-run watchdog
Do not invent a new rule until you can state how it differs from the current execution semantics document.
在梳理任务树前,请先阅读
doc/execution-semantics.md
文档,并严格遵循其中的术语:
  • 活跃路径 / 等待路径 / 恢复路径
  • 运行后处置:终端状态、显式活跃、显式等待、无效
  • 有限的
    run_liveness_continuation
  • 效率评审 vs 活性恢复
  • 活跃子树暂停挂起
  • 静默活跃运行监控 watchdog
在能够说明新规则与当前执行语义文档的差异前,请勿提出新规则。

1. Forensics on the named tree — before anything else

1. 首先针对指定任务树开展取证工作

Do this in the same heartbeat. Do not propose a rule until you have a concrete stop point.
  • Open the linked issue (and its blocker chain, parents, recovery siblings, recent runs).
  • Walk the tree node-by-node and find the exact issue + state combination that stops the world. Common shapes seen in the company so far:
    • in_review
      with no typed execution participant, no active run, no pending interaction, no recovery issue (PAP-2335, PAP-2674).
    • in_progress
      after a successful run with no future action path queued (PAP-2674).
    • Blocker chain whose leaf is
      cancelled
      / malformed / cross-company-inaccessible (PAP-2602).
    • issue.continuation_recovery
      waking the same issue >N times after successful runs (PAP-2602).
    • Stranded-work recovery treating its own recovery issues as more recoverable source work (PAP-2486).
  • Quote the evidence: run ids, comment timestamps, status transitions. "Inferred" is acceptable only when an API boundary blocks direct evidence — say so explicitly and mark the claim provisional (PAP-2631).
Respect the API boundary. If the linked issue is in another company and your agent token returns 403, do not bypass scoping. Either request a board-approved diagnostic path or proceed from inferred PAP-side evidence and label it.
需立即执行此步骤。在找到具体停滞点前,请勿提出任何规则。
  • 打开链接的任务(及其阻碍链、父任务、恢复兄弟任务、最近运行记录)。
  • 逐节点梳理任务树,找出导致全局停滞的确切任务+状态组合。目前公司内常见的情况包括:
    • 处于
      in_review
      状态,但无指定执行参与者、无活跃运行、无待处理交互、无恢复任务(PAP-2335, PAP-2674)。
    • 成功运行后处于
      in_progress
      状态,但未排队后续行动路径(PAP-2674)。
    • 阻碍链的叶子节点处于
      cancelled
      /格式错误/跨公司无法访问状态(PAP-2602)。
    • issue.continuation_recovery
      在成功运行后唤醒同一任务超过N次(PAP-2602)。
    • 孤立工作恢复操作将自身的恢复任务视为更具可恢复性的源工作(PAP-2486)。
  • 引用证据:运行ID、评论时间戳、状态转换记录。仅当API边界阻碍获取直接证据时,才可使用“推断”结论——需明确说明并标记该结论为临时结论(PAP-2631)。
遵守API边界限制。若链接的任务属于其他公司,且你的Agent令牌返回403错误,请勿绕过权限限制。要么请求董事会批准的诊断路径,要么基于PAP侧的推断证据继续,并标记该证据。

2. Survey recent related work

2. 调研近期相关工作

Before proposing a new product rule, read what already shipped this week in the same area. The user has explicitly called this out: (PAP-2602) "review our recent work on liveness that we shipped in the last couple of days." A new rule that contradicts code merged 48 hours ago is rework, not improvement.
Quick survey:
  • Recent merged PRs in the affected area.
  • Recent done issues whose title mentions liveness, recovery, productivity, continuation, or the affected subsystem.
  • Any active plan documents on parent issues. The fix may belong as a revision to an existing plan, not as a new top-level proposal.
State in the forensics: "I reviewed X, Y, Z. The new gap is …"
在提出新的产品规则前,请阅读同一领域最近一周已上线的工作内容。用户已明确指出这一点:(PAP-2602)“回顾我们过去几天上线的活性相关工作”。与48小时前合并的代码相矛盾的新规则属于重复劳动,而非改进。
快速调研内容:
  • 相关领域近期合并的PR。
  • 标题提及活性、恢复、效率、续行或相关子系统的已完成任务。
  • 父任务上的任何活跃计划文档。修复方案可能需要修订现有计划,而非提出新的顶层提案。
在取证报告中说明:“我已调研X、Y、Z。新的缺口为……”

3. Classify each non-progressing issue in the tree

3. 分类任务树中所有未推进的任务

For every issue in the affected tree that is not
done
/
cancelled
/ actively running, decide:
  • Truly needs human or board intervention — name the owner and the action.
  • Agent-actionable but not currently routed — name the rule that would have routed it, and the agent that should have been waked.
  • Already covered — point at the active run, queued wake, recovery issue, or pending interaction.
This is the table the user has asked for repeatedly (PAP-2335). Without it the plan is abstract.
针对受影响任务树中所有未处于
done
/
cancelled
/活跃运行状态的任务,判断其类型:
  • 确实需要人工或董事会介入——指明负责人及需执行的操作。
  • Agent可处理但未被路由——指明应路由该任务的规则,以及应被唤醒的Agent。
  • 已被覆盖——指向活跃运行、排队唤醒、恢复任务或待处理交互。
这是用户反复要求提供的表格(PAP-2335)。缺少此表格的计划将过于抽象。

4. Frame as a general product rule

4. 构建为通用产品规则

The user does not want a one-off patch on the named tree. They want the rule. Two checks:
  • The rule is stated as a contract, not as an if/else patch. Example contract: "every agent-owned non-terminal issue must finish each heartbeat with a terminal state, an explicit waiting path, or an explicit live path" (PAP-2674).
  • The rule is reconciled against
    doc/execution-semantics.md
    . Prefer citing and applying the existing contract; propose a document change only when the current doc is incomplete or contradicted by accepted/implemented behavior.
  • The rule explicitly preserves the three invariants above. Show the work.
If the rule would have blocked a recent productive run from succeeding, drop or narrow it.
用户不需要针对指定任务树的一次性补丁,而是需要通用规则。需满足两项检查:
  • 规则需以协议形式表述,而非if/else补丁。示例协议:“每个由Agent负责的非终端任务必须在每个心跳周期结束时处于终端状态、显式等待路径或显式活跃路径”(PAP-2674)。
  • 规则需与
    doc/execution-semantics.md
    文档保持一致。优先引用并应用现有协议;仅当当前文档不完整或与已接受/已实现的行为相矛盾时,才提出文档修改建议。
  • 规则需明确满足上述三大不变量。需说明具体满足方式。
若规则会阻碍近期有效运行的任务成功执行,请舍弃或缩小规则范围。

5. Plan, do not code

5. 制定计划,而非编写代码

Write the plan into the issue's
plan
document. Cover:
  • Forensics summary (root cause + evidence).
  • The general product rule, stated as a contract.
  • Whether the existing
    doc/execution-semantics.md
    contract already covers the case, or what exact documentation update is needed.
  • Phased subtasks: typically
    Phase 0
    resolves the named live tree (carefully, not destructively),
    Phase 1
    codifies the contract in docs, then implementation phases for detection, recovery, UI surfacing, security review, QA, and CTO review.
  • Explicit assignees per phase; favor team specialty (CodexCoder for server, ClaudeCoder for FE, UXDesigner for visible state, SecurityEngineer for ownership/permissions, QA for validation).
  • Blocking dependencies wired with
    blockedByIssueIds
    , parallel branches identified.
Do not create the child issues yet. Do not push code.
将计划写入任务的
plan
文档中,需涵盖:
  • 取证摘要(根本原因+证据)。
  • 通用产品规则,以协议形式表述。
  • 现有
    doc/execution-semantics.md
    协议是否已覆盖该场景,或需要进行哪些具体的文档更新。
  • 分阶段子任务:通常
    Phase 0
    解决指定的活跃任务树(需谨慎操作,避免破坏证据),
    Phase 1
    在文档中明确协议,然后是检测、恢复、UI展示、安全评审、QA、CTO评审等实现阶段。
  • 每个阶段的明确负责人;优先考虑团队专业领域(CodexCoder负责服务端,ClaudeCoder负责前端,UXDesigner负责可见状态,SecurityEngineer负责权限/所有权,QA负责验证)。
  • 通过
    blockedByIssueIds
    关联阻塞依赖,识别并行分支。
暂不要创建子任务。不要推送代码。

6. Request approval, then decompose

6. 请求批准,然后分解任务

  • Open a
    request_confirmation
    interaction targeting the latest plan revision. Idempotency key
    confirmation:{issueId}:plan:{revisionId}
    .
  • Wait for board/CTO acceptance. If the user posts a new comment that supersedes the plan, the prior confirmation is invalidated — open a fresh confirmation tied to the new revision (PAP-2602 cycled three revisions; that is fine).
  • Only after acceptance: create the phased child issues with the right assignees and dependencies, then block this parent on the final QA / CTO review issue so the parent only wakes when the chain finishes.
  • 针对最新版本的计划发起
    request_confirmation
    交互。幂等键为
    confirmation:{issueId}:plan:{revisionId}
  • 等待董事会/CTO的批准。若用户发布新评论取代现有计划,则之前的确认请求失效——需针对新版本计划发起新的确认请求(PAP-2602经历了三次版本迭代;这是正常情况)。
  • 仅在获得批准后:创建分阶段的子任务,分配正确的负责人和依赖关系,然后将此父任务阻塞在最终的QA/CTO评审任务上,以便父任务仅在整个任务链完成时被唤醒。

7. Phase 0 hygiene on the named tree

7. 针对指定任务树执行Phase 0清理工作

Phase 0 cleans up the live tree without papering over evidence:
  • Move stalled
    in_review
    leaves with no participant to
    todo
    with a precise next action and named owner (PAP-2335).
  • Detach cancelled/dead blockers from chains they were holding hostage; do not silently mark issues
    done
    to clear backlog.
  • Leave a comment on the original named issue summarizing what changed and why; never hide the recovery chain history.
Phase 0需清理活跃任务树,但不得掩盖证据:
  • 将处于停滞状态且无参与者的
    in_review
    叶子节点移至
    todo
    状态,并指定明确的下一步行动和负责人(PAP-2335)。
  • 将已取消/失效的阻碍从其阻塞的任务链中移除;请勿为清理积压任务而静默标记任务为
    done
  • 在原指定任务上留下评论,总结所做修改及原因;切勿隐藏恢复链的历史记录。

8. Final close-out

8. 最终收尾

When the phase chain is complete, post a board-level summary comment on the parent issue: what changed, what the new contract is, what the rollout step is (e.g. "restart the control-plane to pick up the new response shape"), and the live state of the originally-named tree. Then close the parent.
当阶段任务链完成后,在父任务上发布一份董事会级别的总结评论:说明变更内容、新协议内容、部署步骤(例如“重启控制平面以应用新的响应格式”),以及原指定任务树的当前状态。然后关闭父任务。

Pitfalls

常见陷阱

  • Coding before approval. The user has said "make a plan first" on every recent diagnostic issue. Producing code in the forensic phase wastes the round-trip.
  • Restating one invariant at the cost of another. Bound continuation too tightly and productive work stalls; loosen recovery and infinite loops return. Always check all three.
  • Skipping the recent-work survey. Proposing a contract that contradicts what shipped 24 hours ago is the easiest way to get the plan rejected.
  • Letting "in_review" mean done. A leaf assigned to another agent with no participant or active run is not progress; treat it as a stop.
  • Bypassing company scoping. Cross-company forensics needs a board-approved diagnostic path, not a database read.
  • Recursive recovery. Stranded-work recovery that recovers its own recovery issues is the canonical infinite loop (PAP-2486). Detect it and refuse to deepen.
  • Hiding the chain. Don't silently delete or hide the symptomatic recovery issues — the operator needs the audit trail.
  • 未获批准就编写代码:用户在近期所有诊断类任务中都要求“先制定计划”。在取证阶段编写代码会浪费往返沟通时间。
  • 牺牲一个不变量来满足另一个:过度限制续行会导致有效工作停滞;放宽恢复规则又会引发无限循环。需始终检查三大不变量。
  • 跳过近期工作调研:提出与24小时前上线内容相矛盾的协议是计划被驳回的最常见原因。
  • 将「in_review」视为已完成:分配给其他Agent但无参与者或活跃运行的叶子节点并非处于推进状态;需将其视为停滞。
  • 绕过公司权限限制:跨公司取证需获得董事会批准的诊断路径,而非直接读取数据库。
  • 递归恢复:孤立工作恢复操作恢复自身的恢复任务是典型的无限循环场景(PAP-2486)。需检测此类情况并停止递归。
  • 隐藏任务链:请勿静默删除或隐藏显示症状的恢复任务——运维人员需要审计轨迹。

Verification checklist (before posting the plan)

计划发布前的验证清单

  • The exact stop point in the named tree is identified with run ids / comment ids.
  • Recent shipped work in the same area was surveyed and is referenced.
  • Every non-progressing issue is classified human-needed / agent-actionable / already-covered.
  • The proposed rule is stated as a contract, not a patch.
  • All three invariants are explicitly preserved.
  • No code change has landed in this heartbeat.
  • A
    request_confirmation
    against the latest plan revision is open.
  • Phase 0 of the plan addresses the live named tree without destroying evidence.
  • Implementation phases name specialty-appropriate assignees and
    blockedByIssueIds
    dependencies.
  • 已定位指定任务树中的确切停滞点,并附上运行ID/评论ID。
  • 已调研同一领域近期上线的工作内容并引用。
  • 所有未推进的任务已被分类为需人工介入/Agent可处理/已被覆盖。
  • 提出的规则以协议形式表述,而非补丁。
  • 明确满足所有三大不变量。
  • 此心跳周期内未进行任何代码修改。
  • 已针对最新版本的计划发起
    request_confirmation
    请求。
  • 计划的Phase 0已处理活跃的指定任务树,且未破坏证据。
  • 实现阶段已指定专业匹配的负责人,并通过
    blockedByIssueIds
    关联依赖。