pr-babysit

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

pr-babysit

Babysit a PR/MR until CI is green AND every valid reviewer feedback is addressed. Supports GitHub PR (via

gh

) and GitLab MR (via

glab

) — auto-detect by

git remote get-url origin

(github.com → gh; gitlab.com / self-hosted GitLab → glab).

持续监控PR/MR，直到CI检测通过且所有有效的评审反馈都得到处理。支持GitHub PR（通过

gh

工具）和GitLab MR（通过

glab

工具）——通过

git remote get-url origin

自动检测平台（github.com → 使用gh；gitlab.com/自托管GitLab → 使用glab）。

Arguments

参数

$ARGUMENTS

— accepts:

empty → current branch's open PR/MR
a number → PR/MR by IID on the current repo
a URL → parse owner/repo + IID from it

If multiple PRs/MRs match the current branch, stop and ask which one.

$ARGUMENTS

— 支持以下输入：

空值 → 当前分支对应的已打开PR/MR
数字 → 当前仓库中对应IID的PR/MR
URL → 从中解析仓库所有者/仓库名称 + IID

如果当前分支对应多个PR/MR，则停止操作并询问选择哪一个。

Reply Language

回复语言

Reply prose posted to PR/MR threads — the

<what changed>

<reason>

<evidence>

content following each reply-template anchor, plus the prose inside Wontfix Template fields — renders in the PR/MR description's primary language. Everything else stays English: the anchor phrases themselves, Wontfix Template field labels, conventional commit prefixes, the race meta tag, P-codes / severity / justification tokens (same canonical set as

pr-review

's Output Language).

Fallback when the PR description lacks substantive prose: linked issue body, then English.

Terminal output (step 6 run report, Gate A / Gate B audit messages, invisible-findings prompt) stays English — those go to the dispatcher session, not the PR.

发布到PR/MR线程中的回复正文——即每个回复模板锚点后的

<what changed>

<reason>

<evidence>

内容，以及Wontfix模板字段内的正文——将以PR/MR描述的主要语言呈现。其余内容保持英文：锚点短语本身、Wontfix模板字段标签、规范提交前缀、竞争元标签、P代码/严重程度/理由标记（与

pr-review

的输出语言使用的标准集合一致）。

如果PR描述中没有实质性正文，则依次回退为：关联的Issue正文、英文。

终端输出（步骤6的运行报告、Gate A/Gate B审计消息、不可见问题提示）保持英文——这些内容发送给调度会话，而非PR本身。

Loop

循环流程

1. Snapshot

1. 快照

Fetch: PR/MR metadata + head SHA, all checks / pipeline jobs, all review comments, all general comments, all review threads / discussions (with resolved state), the current user login.

For each thread you've previously replied to in this PR, cache

{file path, rule code or primary keyword, your reply summary}

— used by step 2 dedup.

Filter on content, not author:

Drop comments whose body is only CI status lines (build green/red, deploy event, "pipeline succeeded"). That is noise.
Keep any comment containing actionable signals (
```
Suggestion
```
/
```
Warning
```
/
```
Critical
```
/
```
Issue
```
/
```
quality gate
```
/
```
failed
```
/ line-level review notes) — even from bot accounts. AI review bots, SonarQube, Snyk are content bots, not noise bots.
Drop your own past replies and already-resolved threads.

获取：PR/MR元数据 + 最新提交SHA、所有检查/流水线任务、所有评审评论、所有通用评论、所有评审线程/讨论（包含已解决状态）、当前用户登录信息。

针对此PR中你之前回复过的每个线程，缓存

{文件路径, 规则代码或主要关键词, 你的回复摘要}

——供步骤2去重使用。

基于内容而非作者进行过滤：

丢弃仅包含CI状态行的评论（如构建成功/失败、部署事件、"pipeline succeeded"）。这些属于无效噪声。
保留任何包含可操作信号的评论（
```
Suggestion
```
/
```
Warning
```
/
```
Critical
```
/
```
Issue
```
/
```
quality gate
```
/
```
failed
```
/行级评审备注）——即使来自机器人账号。AI评审机器人、SonarQube、Snyk属于内容类机器人，而非噪声类机器人。
丢弃你自己过去的回复以及已解决的线程。

2. Triage

2. 分类处理

Hard gate — invisible findings: if a check is failing but the actual finding list lives in an external dashboard your CLI cannot reach (SonarQube, Snyk, DataDog test reports, etc. — no token, no API endpoint accessible), STOP immediately and ask the user to paste the findings. Do not reproduce locally and process "guessed" findings as a complete cycle. Do not process unrelated feedback first while the invisible finding sits unaddressed. Root-cause diagnosis assumes you can see the finding; when you can't, this gate fires first.

Cross-round dedup — for each new comment, check the cache from step 1:

Same file + same rule code (e.g.
```
CA1031
```
) OR same primary keyword as a thread you already replied to → treat as duplicate. Reply with one line linking back to the earlier thread, do not re-implement or re-explain.
Same issue surviving 3 rounds despite fix attempts → escalate to
```
needs-user-input
```
(the bot is stuck; user has to break the tie).

Feedback — bucket each remaining unresolved comment:

Valid — bug, security, logic error, clear actionable suggestion
Discuss — ambiguous, possible source misread, design tradeoff, scope unclear → do NOT reply autonomously, do NOT implement — collect for user
Out-of-scope — clearly outside this PR's stated goal → collect for user

Checks — for each failing check: pull the failure log via CLI, diagnose root cause before attempting a fix (no patch without a named cause). Distinguish real failure vs flaky; only retry on evidence of flake. If the failure log doesn't contain the actual findings → invisible-findings gate above.

硬性关卡——不可见问题：如果某项检查失败，但实际问题列表位于CLI无法访问的外部仪表板（如SonarQube、Snyk、DataDog测试报告等——无令牌、无可用API端点），请立即停止并要求用户粘贴问题内容。请勿在本地重现并将“猜测的”问题作为完整周期处理。请勿先处理无关反馈而忽略未解决的不可见问题。根本原因诊断需要你能看到问题内容；当无法查看时，此关卡优先触发。

跨轮次去重——针对每条新评论，检查步骤1中的缓存：

相同文件 + 相同规则代码（如
```
CA1031
```
）或与你已回复过的线程相同的主要关键词 → 视为重复内容。回复一行内容链接到之前的线程，无需重新实现或解释。
同一问题经过3轮修复尝试仍存在 → 升级为
```
needs-user-input
```
（机器人已陷入僵局；用户需介入解决）。

反馈分类——将剩余未解决的评论分为三类：

有效—— bug、安全问题、逻辑错误、明确的可操作建议
待讨论—— 模糊不清、可能误解源代码、设计权衡、范围不明确 → 请勿自主回复，请勿实现——收集后提交给用户
超出范围—— 明显超出此PR既定目标 → 收集后提交给用户

检查项处理——针对每个失败的检查项：通过CLI拉取失败日志，在尝试修复前诊断根本原因（无明确原因不提交补丁）。区分真实失败与偶发失败；仅在有证据证明是偶发失败时重试。如果失败日志不包含实际问题内容 → 触发上述不可见问题关卡。

3. Address (Valid + real failures only)

3. 处理（仅针对有效项和真实失败）

For each item:

Implement the fix.
Small commit, conventional commits format, one logical change per commit. Type cheat: behaviour change →
```
fix
```
; behaviour-preserving structure / readability (incl. lint suppressions) →
```
refactor
```
; non-source (CI, husky, tooling) →
```
chore
```
; pure docs →
```
docs
```
.
Reply on the originating comment / discussion thread (template table below).
Verify the reply landed inside the thread, not as a top-level note (see "Reply endpoints" below).

Reply endpoints by platform — mismatching these creates orphan top-level notes:

Action GitHub GitLab

Reply to a review thread

Action	GitHub	GitLab
Reply to a review thread	`POST /repos/{O}/{R}/pulls/{id}/comments` with `in_reply_to_id`	`POST /projects/:id/merge_requests/{iid}/discussions/{disc_id}/notes`
New top-level comment	`POST /repos/{O}/{R}/issues/{id}/comments`	`POST /projects/:id/merge_requests/{iid}/notes`

POST /repos/{O}/{R}/pulls/{id}/comments

with

in_reply_to_id

POST /projects/:id/merge_requests/{iid}/discussions/{disc_id}/notes

New top-level comment

POST /repos/{O}/{R}/issues/{id}/comments

POST /projects/:id/merge_requests/{iid}/notes

After posting a reply,

GET

the discussion / review thread back and confirm your note is in the thread (note count ≥ 2, your username present). If it landed top-level → delete it and retry on the right endpoint.

Reply templates — pick by situation:

Situation	Template
Adopted and fixed	`Addressed in <SHA> — <what changed>.`
Deliberate design, won't change	`Deliberate design — <reason>. <spec or codebase ref>.`
Same issue already replied earlier in this PR	`Same as the earlier <topic> thread — <link>.`
Bot premise wrong, won't fix	`Won't fix — premise doesn't hold. <evidence: file:line / spec section>.`

The Deliberate / Won't-fix templates exist to keep tone neutral and evidence-led — without a template these tend to drift into defensive or implementation-dump replies.

Anchor phrases stay English; only the prose after each anchor adapts to the PR description's language. See Reply Language.

Lint / warning suppression — any

#pragma

// eslint-disable

# noqa

@SuppressWarnings

, etc. must include:

(a) inline rationale comment on the same line, AND
(b) reference to the spec section OR an existing codebase precedent (
```
file:line
```
) using the same suppression for the same reason.

If neither (a) nor (b) is available → do not suppress, refactor instead. When (b) applies, cite the precedent

file:line

in the commit message.

Hard rules:

No
```
--amend
```
on already-pushed commits
No
```
--force-push
```
Don't mark GitLab discussions resolved unless the reviewer explicitly asked for that
Don't close any reviewer thread without a reply
3 failed attempts on the same fix → STOP, document what failed + assumptions to question, hand back to user (per global CLAUDE.md)

针对每个项：

实施修复。
小型提交，采用规范提交格式，每次提交对应一个逻辑变更。类型速查：行为变更 →
```
fix
```
；保持行为不变的结构/可读性优化（包括禁用lint检查）→
```
refactor
```
；非源代码修改（CI、husky、工具配置）→
```
chore
```
；纯文档修改 →
```
docs
```
。
在原评论/讨论线程中回复（见下方模板表格）。
验证回复已发布在线程内，而非顶级评论（见下方“回复端点”）。

各平台回复端点——使用错误端点会导致孤立的顶级评论：

操作 GitHub GitLab

回复评审线程

操作	GitHub	GitLab
回复评审线程	`POST /repos/{O}/{R}/pulls/{id}/comments` 并携带 `in_reply_to_id`	`POST /projects/:id/merge_requests/{iid}/discussions/{disc_id}/notes`
新建顶级评论	`POST /repos/{O}/{R}/issues/{id}/comments`	`POST /projects/:id/merge_requests/{iid}/notes`

POST /repos/{O}/{R}/pulls/{id}/comments

并携带

in_reply_to_id

POST /projects/:id/merge_requests/{iid}/discussions/{disc_id}/notes

新建顶级评论

POST /repos/{O}/{R}/issues/{id}/comments

POST /projects/:id/merge_requests/{iid}/notes

发布回复后，

GET

该讨论/评审线程并确认你的回复已包含在内（评论数≥2，且你的用户名存在）。如果回复被发布为顶级评论 → 删除它并使用正确端点重试。

回复模板——根据场景选择：

场景	模板
已采纳并修复	`Addressed in <SHA> — <what changed>.`
故意设计，不做修改	`Deliberate design — <reason>. <spec or codebase ref>.`
同一问题已在此PR中回复过	`Same as the earlier <topic> thread — <link>.`
机器人前提错误，不做修复	`Won't fix — premise doesn't hold. <evidence: file:line / spec section>.`

故意设计/不修复模板用于保持语气中立且基于证据——若无模板，回复容易偏向防御性或堆砌实现细节。

锚点短语保持英文；仅锚点后的正文适配PR描述的语言。详见回复语言。

Lint/警告抑制——任何

#pragma

、

// eslint-disable

、

# noqa

、

@SuppressWarnings

等抑制指令必须包含：

(a) 同一行的内联理由注释，且
(b) 引用规范章节或现有代码库中因相同理由使用相同抑制的先例（
```
file:line
```
）。

如果(a)和(b)都无法满足 → 请勿使用抑制，改为重构。当(b)适用时，在提交信息中引用先例的

file:line

。

硬性规则：

不对已推送的提交使用
```
--amend
```
不使用
```
--force-push
```
除非评审者明确要求，否则不要标记GitLab讨论为已解决
无回复时不要关闭任何评审线程
同一修复尝试失败3次 → 停止，记录失败内容和需要质疑的假设，交还给用户（遵循全局CLAUDE.md）

4. Push & wait

4. 推送并等待

git push

. Poll CI to a terminal state (GitHub:

gh pr checks --watch

; GitLab: poll

head_pipeline.status

until success/failed/canceled).

执行

git push

。轮询CI直到进入终端状态（GitHub：

gh pr checks --watch

；GitLab：轮询

head_pipeline.status

直到成功/失败/取消）。

4.1 Record

prior_fix_range

4.1 记录

prior_fix_range

After step 3's fix commits land and step 4 has pushed them, capture the SHA range covering this iter's fixes. This range is the canonical source-of-truth for two downstream consumers:

Next iter's pr-review invocation — pass as
```
prior_fix_range
```
input so pr-review's incremental mode can apply drop signal (B) self-introduced surface
Gate B in step 4.5 below — same range, same line-level attribution mechanism

bash

undefined

步骤3的修复提交完成且步骤4已推送后，捕获本次迭代修复对应的SHA范围。此范围是两个下游环节的权威来源：

下一轮pr-review调用——作为
```
prior_fix_range
```
输入传递，以便pr-review的增量模式应用丢弃信号(B)排除自引入的代码面
下方步骤4.5中的Gate B——使用相同范围和行级归因机制

bash

undefined

After step 4 push, before invoking the next pr-review iter:

步骤4推送后，调用下一轮pr-review之前：

FIRST_FIX_SHA=$(git log --format='%H' "$PREV_HEAD..HEAD" | tail -1) # oldest fix in this iter LAST_FIX_SHA=$(git rev-parse HEAD) # newest fix in this iter PRIOR_FIX_RANGE="${FIRST_FIX_SHA}^..${LAST_FIX_SHA}"


Persist `PRIOR_FIX_RANGE` (and `$LAST_FIX_SHA` as the next iter's `$PREV_HEAD`) into the babysit state file or session env. If the iter pushed a single commit, `FIRST_FIX_SHA == LAST_FIX_SHA` and the range collapses to `<sha>^..<sha>`.

If this iter pushed zero commits (CI re-run only) → no fix range to record; skip the Gate B self-introduced check for the next iter, but still run Gate A as normal.

**Why not compute lazily at Gate B**: computing at push time anchors the range to the exact commits that addressed iter (N-1) findings. Lazy computation at Gate B time could pick up unrelated commits if the user manually edits the branch between iters.

FIRST_FIX_SHA=$(git log --format='%H' "$PREV_HEAD..HEAD" | tail -1) # 本次迭代中最早的修复提交 LAST_FIX_SHA=$(git rev-parse HEAD) # 本次迭代中最新的修复提交 PRIOR_FIX_RANGE="${FIRST_FIX_SHA}^..${LAST_FIX_SHA}"


将`PRIOR_FIX_RANGE`（以及`$LAST_FIX_SHA`作为下一轮迭代的`$PREV_HEAD`）保存到babysit状态文件或会话环境中。如果本次迭代仅推送了一个提交，则`FIRST_FIX_SHA == LAST_FIX_SHA`，范围简化为`<sha>^..<sha>`。

如果本次迭代未推送任何提交（仅重新运行CI）→ 无修复范围可记录；跳过下一轮迭代的Gate B自引入检查，但仍正常运行Gate A。

**为何不在Gate B时延迟计算**：在推送时计算范围可将其锚定到解决第(N-1)轮问题的精确提交。若在Gate B时延迟计算，用户可能在迭代间隙手动编辑分支，导致范围包含无关提交。

4.5 Self-feedback loop gates

4.5 自我反馈循环关卡

After pushing this iter's fixes and waiting for CI green, before looping back to step 1, run TWO sub-gates that catch different self-feedback failure modes. Without these, an automated reviewer paired with an automated babysitter can spend N iterations either chasing test-hygiene nits (Gate A) or chasing race-of-race surfaces (Gate B).

Both gates parse pr-review's inline comments on this PR:

bash

gh api repos/$OWNER/$REPO/pulls/$N/comments \
  --jq '[.[] | select(.body | contains("<!-- pr-review:finding-id=")) |
         {id, created_at, path, line, body,
          justification: (.body | capture("<!-- pr-review:justification=(?<j>[^ ]+) -->").j),
          race_meta: (.body | capture("\\[window=(?<w>[^,]+), damage=(?<d>[^,]+), recovery=(?<r>[^\\]]+)\\]") // null)}]'

Take only findings created since the previous iter's HEAD sha (the new ones this iter introduced).

推送本次迭代的修复并等待CI通过后，在回到步骤1之前，运行两个子关卡以捕获不同的自我反馈失败模式。若无这些关卡，自动化评审与自动化监控工具可能会花费N轮迭代来处理测试卫生细节（Gate A）或循环引入竞争问题（Gate B）。

两个关卡均解析此PR上pr-review的内联评论：

bash

gh api repos/$OWNER/$REPO/pulls/$N/comments \
  --jq '[.[] | select(.body | contains("<!-- pr-review:finding-id=")) |
         {id, created_at, path, line, body,
          justification: (.body | capture("<!-- pr-review:justification=(?<j>[^ ]+) -->").j),
          race_meta: (.body | capture("\\[window=(?<w>[^,]+), damage=(?<d>[^,]+), recovery=(?<r>[^\\]]+)\\]") // null)}]'

仅保留自上一轮迭代HEAD sha以来创建的问题（即本次迭代引入的新问题）。

Gate A: Diminishing Returns (only-hygiene iter)

Gate A：收益递减（仅卫生类迭代）

Fires when ALL of:

≥1 new pr-review finding this iter

ZERO new findings have

justification ∈ {Reachable, Precedent, Asymmetric, Historical}

ALL new findings are
```
justification=Hygiene
```
(or missing — treat missing as Hygiene)

Action: STOP automatic loop, skip step 5's normal decision, jump to step 6 with:

Status: needs-user-input (diminishing returns)

This iter's pr-review surfaced only hygiene findings — no Reachable / Precedent /
Asymmetric / Historical justification on any new finding.

Hygiene followups (N):
  <list — id, slug, file:line, one-line failure mode>

Continuing the loop will likely surface more hygiene from the same code paths.

Your call:
  (s) ship — open a single follow-up issue collecting the hygiene items, mark PR ready-to-merge
  (p) polish — keep looping (override the gate for this round)
  (r) re-review-full — challenge whether the self-loop missed anything (force `mode=full` on next pr-review)

触发条件：同时满足以下所有条件

本次迭代产生≥1个新的pr-review问题

所有新问题的

justification

均不属于

{Reachable, Precedent, Asymmetric, Historical}

所有新问题的
```
justification=Hygiene
```
（或缺失——缺失视为Hygiene）

操作：停止自动循环，跳过步骤5的常规决策，直接进入步骤6并输出：

状态：needs-user-input（收益递减）

本次迭代的pr-review仅发现卫生类问题——所有新问题均无Reachable/Precedent/Asymmetric/Historical理由。

待处理的卫生项(N)：
  <列表——ID、标识、file:line、单行失败模式>

继续循环可能会在相同代码路径中发现更多卫生类问题。

请决策：
  (s) 发布——创建一个单独的跟进Issue收集所有卫生项，标记PR为就绪可合并
  (p) 优化——继续循环（本次绕过关卡）
  (r) 全面重审——检查自我循环是否遗漏内容（强制下一轮pr-review使用`mode=full`）

Gate B: Convergence Audit (race-of-race iter)

Gate B：收敛审计（循环竞争迭代）

Catches the failure mode where iter (N-1)'s fix introduces a new race / state-transition surface, the reviewer flags it as a Reachable finding, the next fix introduces yet another race surface, ad infinitum. Gate A does NOT catch this — those findings carry

justification=Reachable

and are individually valid; the divergence is only visible at cluster level.

prior_fix_range
: use the range recorded in step 4.1. This is the same range fed to pr-review's incremental-mode invocation, so Gate B's self-introduced check and pr-review's drop signal (B) operate on identical evidence. If step 4.1 recorded nothing (iter N-1 pushed no commits), Gate B does not fire — there is no iter (N-1) fix surface to converge against.

Fires when ALL of:

```
iter ≥ 3
```
(first two iters are normal review cadence, not divergence)
≥ 2 new findings this iter cite
```
file:line
```
inside
```
prior_fix_range
```
— i.e. critiquing iter (N-1)'s freshly-added surface
≥ 2 of those findings are race-class — detection is OR of:
- (i) carries
```
[window=..., damage=..., recovery=...]
```
  meta from pr-review's race-class metadata requirement, OR
- (ii) slug/category keyword-matches one of:
```
race | TOCTOU | concurren | sweep | lifecycle | state-transition | debounce | claim | lease | fence | stale | orphan | race-window
```
  , OR
- (iii)
```
\bwindow=
```
  (matches the meta-tag prefix even when full meta is malformed) OR
```
atomic.*race | race.*atomic
```
  (require co-occurrence to avoid catching DB-transaction
```
atomic
```
  and frontend-viewport
```
window
```
  noise)

Keyword design notes: bare

window

and bare

atomic

are deliberately excluded — they false-positive on rate-limiter / viewport / DB-transaction-correctness comments.

TOCTOU

is the canonical security-race term and matches Codex findings that bypass the meta-tag path.

debounce / claim / lease / fence

cover distributed-locking vocabulary;

stale / orphan

cover sweep-race descriptions.

How to verify file:line inside prior_fix_range:

bash

git diff --name-only $prior_fix_range                  # files touched
git diff -U0 $prior_fix_range -- <file>                # line-level attribution

Action: STOP automatic loop, run Convergence Audit for the cluster. For each race-class finding, apply the Wontfix Template five-step decision:

Window: estimate ms / s / min / hr between the race operations (use the meta tag if present)

Damage: classify as

data-loss | deadlock | inconsistency | latency | marginal

Asymmetric check: is the failure mode security / data-integrity / billing?
Mitigation cost: does the proposed fix introduce a new race surface?
Recovery path: does fault tolerance / next webhook / sweeper cover the race?

Audit verdict per finding:

Verdict	When
modify (Asymmetric)	Justification is Asymmetric (security / data-loss / data-integrity / billing) → ALWAYS modify, regardless of mitigation cost
modify (damage gate)	`damage` value is `data-loss` / `deadlock` / `inconsistency` → modify even if Justification is not formally Asymmetric. These damage classes have no acceptable "fault tolerance" answer
modify (safe fix)	non-Asymmetric, `damage ∈ {latency, marginal}` , BUT mitigation does NOT introduce new race surface → modify (no race-of-race risk)
wontfix-with-template	non-Asymmetric + `damage ∈ {latency, marginal}` + `recovery=has` + mitigation introduces new race surface → reply using Wontfix Template. ALL five conditions required; missing any → fall through to modify
defer-followup	valid concern but resolution requires infrastructure (e.g. real DB test, schema migration, new background job) that belongs to a follow-up issue

Report to user:

Status: convergence-audit (race-of-race detected)

iter (N-1) fix surface attracted N race-class findings this iter (cluster):
  <id> <slug> @ <file:line>  window=<w> damage=<d> recovery=<r>
  ...

Audit verdict per finding:
  <id>: modify    — <reason: Asymmetric / mitigation safe / etc>
  <id>: wontfix   — <five-field summary from Wontfix Template>
  <id>: defer     — <followup issue suggestion>

Your call:
  (a) accept all verdicts (post wontfix replies via template, address modify items, open defer issues)
  (m) modify a specific verdict — say which finding-id and target verdict
  (s) ship — accept all wontfix + defer as-is, mark PR ready-to-merge
  (p) override audit — treat as normal iter, loop back to step 1

Gate B does NOT fire when:

Cluster contains any Asymmetric finding — Asymmetric (security / data-loss / data-integrity / billing) bypasses the convergence escape just as it does in pr-review's drop signal (B). Surface them and modify
```
iter < 3
```
— early iters are normal review cadence
Race-class meta is missing AND no slug/category keyword match — keeps gate narrow to actual race domain; non-race convergence (e.g. naming-bikeshed) falls back to Gate A or normal flow

Rationale: Gate A catches iters where everything is hygiene; Gate B catches iters where individually-valid race findings cluster on freshly-introduced surfaces. Together they cover the two main self-feedback failure modes without suppressing genuine Asymmetric findings or third-party signal (Codex / SonarQube / Snyk findings without pr-review's metadata bypass both gates and route through normal step 2 dedup + 3-round escalation).

捕获第(N-1)轮修复引入新竞争/状态转换代码面，评审者标记为Reachable问题，下一轮修复又引入新竞争代码面的无限循环失败模式。Gate A无法捕获此情况——这些问题带有

justification=Reachable

且单独来看是有效的；仅从集群层面才能发现发散趋势。

prior_fix_range
：使用步骤4.1中记录的范围。此范围与pr-review增量模式调用时使用的范围相同，因此Gate B的自引入检查与pr-review的丢弃信号(B)基于相同证据。如果步骤4.1未记录任何内容（第N-1轮未推送提交），则Gate B不触发——无第(N-1)轮修复代码面可进行收敛检查。

触发条件：同时满足以下所有条件

```
迭代次数≥3
```
（前两轮属于正常评审节奏，不属于发散）
本次迭代产生≥2个新问题，且这些问题引用的
```
file:line
```
位于
```
prior_fix_range
```
内——即针对第(N-1)轮新增代码面的批评
其中≥2个问题属于竞争类——满足以下任一检测条件：
- (i) 带有pr-review竞争类元数据要求的
```
[window=..., damage=..., recovery=...]
```
  元标签，或
- (ii) 标识/类别关键词匹配以下任一：
```
race | TOCTOU | concurren | sweep | lifecycle | state-transition | debounce | claim | lease | fence | stale | orphan | race-window
```
  ，或
- (iii)
```
\bwindow=
```
  （即使元标签格式错误也匹配前缀）或
```
atomic.*race | race.*atomic
```
  （要求同时出现以避免误判DB事务
```
atomic
```
  和前端视窗
```
window
```
  相关评论）

关键词设计说明：故意排除单独的

window

和

atomic

——它们会误判限流器/视窗/DB事务正确性相关评论。

TOCTOU

是标准的安全竞争术语，可匹配绕过元标签路径的Codex问题。

debounce / claim / lease / fence

涵盖分布式锁相关词汇；

stale / orphan

涵盖扫描竞争描述。

如何验证file:line是否在prior_fix_range内：

bash

git diff --name-only $prior_fix_range                  # 被修改的文件
git diff -U0 $prior_fix_range -- <file>                # 行级归因

操作：停止自动循环，对集群执行收敛审计。针对每个竞争类问题，应用Wontfix模板五步决策：

窗口：估算竞争操作之间的时间（ms/s/min/hr，如有元标签则使用其中的值）

影响：分类为

data-loss | deadlock | inconsistency | latency | marginal

不对称检查：失败模式是否涉及安全/数据完整性/计费？
缓解成本：提议的修复是否会引入新的竞争代码面？
恢复路径：容错机制/下一个Webhook/扫描器是否能覆盖此竞争？

每个问题的审计结论：

结论	适用场景
修改（不对称）	理由为Asymmetric（安全/数据丢失/数据完整性/计费）→ 无论缓解成本如何，必须修改
修改（影响关卡）	`影响` 为 `data-loss` / `deadlock` / `inconsistency` → 即使理由未正式标记为Asymmetric也需修改。这些影响类别没有可接受的“容错”解决方案
修改（安全修复）	非不对称、 `影响∈{latency, marginal}` ，但缓解方案不会引入新竞争代码面→ 修改（无循环竞争风险）
使用Wontfix模板不修改	非不对称 + `影响∈{latency, marginal}` + `recovery=has` + 缓解方案会引入新竞争代码面→ 使用Wontfix模板回复。需满足所有五个条件；缺少任一条件则回退为修改
延迟到跟进Issue	合理担忧，但解决方案需要基础设施（如真实DB测试、 schema迁移、新后台任务），应放入跟进Issue处理

向用户报告：

状态：convergence-audit（检测到循环竞争）

第(N-1)轮修复代码面在本次迭代中引发N个竞争类问题（集群）：
  <ID> <标识> @ <file:line>  window=<w> damage=<d> recovery=<r>
  ...

每个问题的审计结论：
  <ID>: 修改    — <理由：不对称/修复安全等>
  <ID>: 不修改   — <Wontfix模板的五字段摘要>
  <ID>: 延迟处理     — <跟进Issue建议>

请决策：
  (a) 接受所有结论（使用模板发布不修改回复，处理需修改项，创建延迟处理Issue）
  (m) 修改特定结论——说明问题ID和目标结论
  (s) 发布——接受所有不修改和延迟处理项，标记PR为就绪可合并
  (p) 覆盖审计——视为正常迭代，回到步骤1

Gate B不触发的情况：

集群包含任何不对称问题——不对称（安全/数据丢失/数据完整性/计费）问题会绕过收敛规则，就像在pr-review的丢弃信号(B)中一样。需呈现这些问题并进行修改
```
迭代次数<3
```
——早期迭代属于正常评审节奏
无竞争类元标签且无标识/类别关键词匹配——将关卡范围限制为实际竞争领域；非竞争收敛（如命名争议）回退到Gate A或正常流程

原理：Gate A捕获所有问题均为卫生类的迭代；Gate B捕获针对新增代码面的竞争类问题集群的迭代。两者共同覆盖两种主要的自我反馈失败模式，同时不会抑制真正的不对称问题或第三方信号（无pr-review元数据的Codex/SonarQube/Snyk问题会绕过两个关卡，通过正常的步骤2去重+3轮升级流程处理）。

4.6 Wontfix Template

4.6 Wontfix模板

Used by step 4.5 Gate B (Convergence Audit) and as a manual reply template for race / state / sweep / atomic class findings where modification would introduce new race surfaces.

Five fields are minimum-required. Missing any one → finding deserves modification, not wontfix.

Wontfix — deliberate trade-off.

Race window: <ms / s / min / hr> between <op A> and <op B>.
Precondition: <only fires when X is in Y state for N+ time>
Damage if race fires: <not data-loss / not deadlock / only X happens N seconds earlier than ideal>
Recovery path: <new event / cron sweeper / next webhook covers it; user-visible behavior unchanged>

Asymmetric check: <not security / not data-loss / not data-integrity / not billing>
Mitigation cost: <atomic re-check / two-step merge into transaction is doable, but introduces new race-of-race surface at X>

Acknowledged as known trade-off; fault tolerance covers genuinely <abandoned / stranded / dropped> class.
Tracking: <if needed, opened followup issue X>

Field semantics:

Race window — concrete time estimate, not "small".
```
ms
```
for tight CAS,
```
min
```
for sweep cycle gap,
```
hr
```
for cron lifecycle. Reviewer needs the magnitude to judge.
Precondition — what state the system must already be in for the race to even matter. If precondition is rare or already-degraded, race is acceptable.
Damage — concrete user / data observation, not "could be a problem". If you cannot describe damage in one line, the finding may not actually be Reachable.
Recovery path — must name a concrete mechanism (next webhook / sweeper run / cron / fault-tolerant retry). "It'll probably be fine" is not a recovery path.
Asymmetric check — explicit declaration that finding is not security / data-integrity / billing. Wontfix is INVALID for Asymmetric findings — modify them.
Mitigation cost — name the new race surface the proposed fix would introduce. "race-of-race" is the load-bearing reasoning.

Reference example: PR #148

sweepAbandonedTasklessThreads

two-UPDATE race — Codex flagged "re-check thread state before abandoning queued events"; race window was milliseconds between two sweep UPDATEs, precondition was thread already stranded 1+ hour, damage was

marginal

(already-stranded events terminalize seconds earlier than ideal), recovery path was new webhook hits reactivation gate. Wontfix posted; PR shipped.

When NOT to use:

Any of the five fields cannot be filled honestly → finding is real, modify it. Wontfix Template is for the specific case where modification introduces equivalent or worse race surface; it is NOT a generic decline template.
Dev-stage self-review context (no separate session between code author and verdict reasoner): do NOT fill these fields from main-session memory. Babysit normally runs in a session separate from the code author, which is what makes Wontfix Template safe to apply — the babysit session has no prior commitment to the design and can honestly reason about damage / recovery / mitigation cost. In a dev-stage self-review loop (same session wrote the code AND is reasoning about findings), author-narrative bias compounds — bug-free framing produces the strongest detection drop among framing conditions tested across 6 LLMs (Mitropoulos et al., Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review, arXiv:2603.18740). Pause and either (a) hand off to a separate session for the verdict, or (b) use a fresh-spawn verdict subagent that independently derives
```
damage
```
/
```
recovery
```
/
```
mitigation cost
```
from code, not from the finding object's fields. The Deriver-pattern verdict subagent is not built as a skill yet — until it is, treat dev-stage wontfix decisions as advisory and surface them to the user.

用于步骤4.5的Gate B（收敛审计），以及手动回复竞争/状态/扫描/原子类问题——这些问题的修改会引入新的竞争代码面。

五个字段为必填项。缺少任一字段→问题需修改，而非不处理。

Wontfix — deliberate trade-off.

Race window: <ms / s / min / hr> between <op A> and <op B>.
Precondition: <only fires when X is in Y state for N+ time>
Damage if race fires: <not data-loss / not deadlock / only X happens N seconds earlier than ideal>
Recovery path: <new event / cron sweeper / next webhook covers it; user-visible behavior unchanged>

Asymmetric check: <not security / not data-loss / not data-integrity / not billing>
Mitigation cost: <atomic re-check / two-step merge into transaction is doable, but introduces new race-of-race surface at X>

Acknowledged as known trade-off; fault tolerance covers genuinely <abandoned / stranded / dropped> class.
Tracking: <if needed, opened followup issue X>

字段语义：

Race window——具体时间估算，而非“很小”。CAS竞争用
```
ms
```
，扫描周期间隙用
```
min
```
， cron生命周期用
```
hr
```
。评审者需要量级来判断。
Precondition——系统必须处于何种状态，竞争才会产生影响。如果前提条件罕见或系统已降级，竞争是可接受的。
Damage——具体的用户/数据影响，而非“可能有问题”。如果无法用一行描述影响，该问题可能并非真正的Reachable。
Recovery path——必须指定具体机制（下一个Webhook/扫描器运行/cron/容错重试）。“可能没问题”不是有效的恢复路径。
Asymmetric check——明确声明问题不涉及安全/数据完整性/计费。不对称问题不能使用Wontfix——必须修改。
Mitigation cost——说明提议的修复会引入的新竞争代码面。“循环竞争”是核心理由。

参考示例：PR #148

sweepAbandonedTasklessThreads

的两次UPDATE竞争——Codex标记“在放弃排队事件前重新检查线程状态”；竞争窗口为两次扫描UPDATE之间的毫秒级，前提条件是线程已滞留1小时以上，影响为

marginal

（已滞留事件提前几秒终止），恢复路径为新Webhook触发重新激活关卡。发布Wontfix回复；PR已合并。

不适用场景：

无法如实填写任一必填字段→问题真实存在，需修改。Wontfix模板仅适用于修改会引入同等或更严重竞争代码面的特定情况；并非通用拒绝模板。
开发阶段自评审场景（代码作者与结论推理者无独立会话）：请勿从主会话内存填写这些字段。Babysit通常在与代码作者独立的会话中运行，这是Wontfix模板安全应用的前提——babysit会话对设计无先入为主的承诺，能如实推理影响/恢复/缓解成本。在开发阶段自评审循环（同一会话既编写代码又推理问题）中，作者叙事偏差会加剧——无bug框架在6个LLM测试中产生最强的检测丢弃效果（Mitropoulos等人，Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review，arXiv:2603.18740）。请暂停并选择(a)将结论交给独立会话处理，或(b)使用新生成的结论子代理，从代码而非问题对象字段独立推导
```
damage
```
/
```
recovery
```
/
```
mitigation cost
```
。Deriver模式的结论子代理尚未作为技能实现——在此之前，将开发阶段的不修改决策视为建议并呈现给用户。

5. Decide

5. 决策

✅ All checks green AND all Valid feedback resolved → Report (step 6)
🟡 New comment / check status changed mid-cycle → back to step 1
🔴 Hit 3-failure stop, invisible-findings gate, dedup 3-round escalation, OR something genuinely needs human judgment → Report with
```
blocked
```
/
```
needs-user-input
```

✅ 所有检查通过且所有有效反馈已解决 → 报告（步骤6）
🟡 循环过程中出现新评论/检查状态变更 → 返回步骤1
🔴 触发3次失败停止规则、不可见问题关卡、去重3轮升级，或确实需要人工判断 → 报告并标记
```
blocked
```
/
```
needs-user-input
```

6. Report (end of run, not auto-merge)

6. 报告（运行结束，不会自动合并）

PR/MR: <link>
Status: ready-to-merge | needs-user-input | blocked
Checks: <green>/<total>
Addressed (this run): <list of SHA → comment ref + one-liner>

Awaiting your decision:
  Discuss (I did NOT reply): <list with comment text + my read of the ambiguity>
  Out-of-scope: <list>  → open follow-up issues for any of these? (y/N per item)

Blockers (if any): <description + what I tried>

Next command: gh pr merge --squash <id>   # or: glab mr merge <id>

After the report, if there are out-of-scope items, ask once: open follow-up issues for which ones? Open only the ones the user picks (

gh issue create

glab issue create

), and edit the report's reply on each MR/PR comment to link the new issue.

PR/MR: <链接>
状态: ready-to-merge | needs-user-input | blocked
检查项: <通过数>/<总数>
已处理（本次运行）: <SHA列表 → 评论引用 + 单行描述>

等待你的决策:
  待讨论（我未回复）: <列表包含评论文本 + 我对歧义的理解>
  超出范围: <列表>  → 是否为这些项创建跟进Issue？（逐项选择y/N）

阻塞点（如有）: <描述 + 我已尝试的操作>

下一步命令: gh pr merge --squash <id>   # 或: glab mr merge <id>

报告后，如果存在超出范围的项，询问一次：为哪些项创建跟进Issue？仅创建用户选择的项（

gh issue create

glab issue create

），并在每个MR/PR评论的回复中编辑报告内容以链接新Issue。

What I never do without asking

未经询问绝不会执行的操作

Reply, dismiss, or implement based on Discuss items — list them, stop.
Open follow-up issues for Out-of-scope items without confirming the list with the user first.
Merge the PR/MR. Even when fully green, report ready-to-merge and let the user run the merge.
Force-push, amend pushed commits, skip hooks (
```
--no-verify
```
), or bypass signing.
Loop forever — if a cycle produces no new work and nothing is resolved, stop and report.

基于待讨论项进行回复、驳回或实现——仅列出这些项，停止操作。
未先与用户确认列表就为超出范围项创建跟进Issue。
合并PR/MR。即使完全通过检测，也仅报告就绪可合并，由用户执行合并操作。
强制推送、修改已推送提交、跳过钩子（
```
--no-verify
```
）或绕过签名。
无限循环——如果一轮循环未产生新工作且未解决任何问题，停止并报告。

pr-babysit

Original

Translation

pr-babysit

pr-babysit

Arguments

参数

Reply Language

回复语言

Loop

循环流程

1. Snapshot

1. 快照

2. Triage

2. 分类处理

3. Address (Valid + real failures only)

3. 处理（仅针对有效项和真实失败）

4. Push & wait

4. 推送并等待

4.1 Record
`prior_fix_range`

4.1 记录
`prior_fix_range`

After step 4 push, before invoking the next pr-review iter:

步骤4推送后，调用下一轮pr-review之前：

4.5 Self-feedback loop gates

4.5 自我反馈循环关卡

Gate A: Diminishing Returns (only-hygiene iter)

Gate A：收益递减（仅卫生类迭代）

Gate B: Convergence Audit (race-of-race iter)

Gate B：收敛审计（循环竞争迭代）

4.6 Wontfix Template

4.6 Wontfix模板

5. Decide

5. 决策

6. Report (end of run, not auto-merge)

6. 报告（运行结束，不会自动合并）

What I never do without asking

未经询问绝不会执行的操作

pr-babysit

Original

Translation

pr-babysit

pr-babysit

Arguments

参数

Reply Language

回复语言

Loop

循环流程

1. Snapshot

1. 快照

2. Triage

2. 分类处理

3. Address (Valid + real failures only)

3. 处理（仅针对有效项和真实失败）

4. Push & wait

4. 推送并等待

4.1 Record prior_fix_range

4.1 记录prior_fix_range

After step 4 push, before invoking the next pr-review iter:

步骤4推送后，调用下一轮pr-review之前：

4.5 Self-feedback loop gates

4.5 自我反馈循环关卡

Gate A: Diminishing Returns (only-hygiene iter)

Gate A：收益递减（仅卫生类迭代）

Gate B: Convergence Audit (race-of-race iter)

Gate B：收敛审计（循环竞争迭代）

4.6 Wontfix Template

4.6 Wontfix模板

5. Decide

5. 决策

6. Report (end of run, not auto-merge)

6. 报告（运行结束，不会自动合并）

What I never do without asking

未经询问绝不会执行的操作

4.1 Record
`prior_fix_range`

4.1 记录
`prior_fix_range`