pr-babysit
Babysit a PR/MR until CI is green AND every valid reviewer feedback is addressed. Supports
GitHub PR (via
) and
GitLab MR (via
) — auto-detect by
git remote get-url origin
(github.com → gh; gitlab.com / self-hosted GitLab → glab).
Arguments
- empty → current branch's open PR/MR
- a number → PR/MR by IID on the current repo
- a URL → parse owner/repo + IID from it
If multiple PRs/MRs match the current branch, stop and ask which one.
Reply Language
Reply prose posted to PR/MR threads — the
/
/
content following each reply-template anchor, plus the prose inside Wontfix Template fields — renders in the PR/MR description's primary language. Everything else stays English: the anchor phrases themselves, Wontfix Template field labels, conventional commit prefixes, the race meta tag, P-codes / severity / justification tokens (same canonical set as
's
Output Language).
Fallback when the PR description lacks substantive prose: linked issue body, then English.
Terminal output (step 6 run report, Gate A / Gate B audit messages, invisible-findings prompt) stays English — those go to the dispatcher session, not the PR.
Loop
1. Snapshot
Fetch: PR/MR metadata + head SHA, all checks / pipeline jobs, all review comments, all general comments, all review threads / discussions (with resolved state), the current user login.
For each thread you've previously replied to in this PR, cache
{file path, rule code or primary keyword, your reply summary}
— used by step 2 dedup.
Filter on content, not author:
- Drop comments whose body is only CI status lines (build green/red, deploy event, "pipeline succeeded"). That is noise.
- Keep any comment containing actionable signals ( / / / / / / line-level review notes) — even from bot accounts. AI review bots, SonarQube, Snyk are content bots, not noise bots.
- Drop your own past replies and already-resolved threads.
2. Triage
Hard gate — invisible findings: if a check is failing but the actual finding list lives in an external dashboard your CLI cannot reach (SonarQube, Snyk, DataDog test reports, etc. — no token, no API endpoint accessible), STOP immediately and ask the user to paste the findings. Do not reproduce locally and process "guessed" findings as a complete cycle. Do not process unrelated feedback first while the invisible finding sits unaddressed. Root-cause diagnosis assumes you can see the finding; when you can't, this gate fires first.
Cross-round dedup — for each new comment, check the cache from step 1:
- Same file + same rule code (e.g. ) OR same primary keyword as a thread you already replied to → treat as duplicate. Reply with one line linking back to the earlier thread, do not re-implement or re-explain.
- Same issue surviving 3 rounds despite fix attempts → escalate to (the bot is stuck; user has to break the tie).
Feedback — bucket each remaining unresolved comment:
- Valid — bug, security, logic error, clear actionable suggestion
- Discuss — ambiguous, possible source misread, design tradeoff, scope unclear → do NOT reply autonomously, do NOT implement — collect for user
- Out-of-scope — clearly outside this PR's stated goal → collect for user
Checks — for each failing check: pull the failure log via CLI, diagnose root cause before attempting a fix (no patch without a named cause). Distinguish real failure vs flaky; only retry on evidence of flake. If the failure log doesn't contain the actual findings → invisible-findings gate above.
3. Address (Valid + real failures only)
For each item:
- Implement the fix.
- Small commit, conventional commits format, one logical change per commit. Type cheat: behaviour change → ; behaviour-preserving structure / readability (incl. lint suppressions) → ; non-source (CI, husky, tooling) → ; pure docs → .
- Reply on the originating comment / discussion thread (template table below).
- Verify the reply landed inside the thread, not as a top-level note (see "Reply endpoints" below).
Reply endpoints by platform — mismatching these creates orphan top-level notes:
| Action | GitHub | GitLab |
|---|
| Reply to a review thread | POST /repos/{O}/{R}/pulls/{id}/comments
with | POST /projects/:id/merge_requests/{iid}/discussions/{disc_id}/notes
|
| New top-level comment | POST /repos/{O}/{R}/issues/{id}/comments
| POST /projects/:id/merge_requests/{iid}/notes
|
After posting a reply,
the discussion / review thread back and confirm your note is in the thread (note count ≥ 2, your username present). If it landed top-level → delete it and retry on the right endpoint.
Reply templates — pick by situation:
| Situation | Template |
|---|
| Adopted and fixed | Addressed in <SHA> — <what changed>.
|
| Deliberate design, won't change | Deliberate design — <reason>. <spec or codebase ref>.
|
| Same issue already replied earlier in this PR | Same as the earlier <topic> thread — <link>.
|
| Bot premise wrong, won't fix | Won't fix — premise doesn't hold. <evidence: file:line / spec section>.
|
The Deliberate / Won't-fix templates exist to keep tone neutral and evidence-led — without a template these tend to drift into defensive or implementation-dump replies.
Anchor phrases stay English; only the prose after each anchor adapts to the PR description's language. See
Reply Language.
Lint / warning suppression — any
,
,
,
, etc. must include:
- (a) inline rationale comment on the same line, AND
- (b) reference to the spec section OR an existing codebase precedent () using the same suppression for the same reason.
If neither (a) nor (b) is available → do not suppress, refactor instead. When (b) applies, cite the precedent
in the commit message.
Hard rules:
- No on already-pushed commits
- No
- Don't mark GitLab discussions resolved unless the reviewer explicitly asked for that
- Don't close any reviewer thread without a reply
- 3 failed attempts on the same fix → STOP, document what failed + assumptions to question, hand back to user (per global CLAUDE.md)
4. Push & wait
. Poll CI to a terminal state (GitHub:
; GitLab: poll
until success/failed/canceled).
4.1 Record
After step 3's fix commits land and step 4 has pushed them, capture the SHA range covering this iter's fixes. This range is the canonical source-of-truth for two downstream consumers:
- Next iter's pr-review invocation — pass as input so pr-review's incremental mode can apply drop signal (B) self-introduced surface
- Gate B in step 4.5 below — same range, same line-level attribution mechanism
bash
# After step 4 push, before invoking the next pr-review iter:
FIRST_FIX_SHA=$(git log --format='%H' "$PREV_HEAD..HEAD" | tail -1) # oldest fix in this iter
LAST_FIX_SHA=$(git rev-parse HEAD) # newest fix in this iter
PRIOR_FIX_RANGE="${FIRST_FIX_SHA}^..${LAST_FIX_SHA}"
Persist
(and
as the next iter's
) into the babysit state file or session env. If the iter pushed a single commit,
FIRST_FIX_SHA == LAST_FIX_SHA
and the range collapses to
.
If this iter pushed zero commits (CI re-run only) → no fix range to record; skip the Gate B self-introduced check for the next iter, but still run Gate A as normal.
Why not compute lazily at Gate B: computing at push time anchors the range to the exact commits that addressed iter (N-1) findings. Lazy computation at Gate B time could pick up unrelated commits if the user manually edits the branch between iters.
4.5 Self-feedback loop gates
After pushing this iter's fixes and waiting for CI green, before looping back to step 1, run TWO sub-gates that catch different self-feedback failure modes. Without these, an automated reviewer paired with an automated babysitter can spend N iterations either chasing test-hygiene nits (Gate A) or chasing race-of-race surfaces (Gate B).
Both gates parse pr-review's inline comments on this PR:
bash
gh api repos/$OWNER/$REPO/pulls/$N/comments \
--jq '[.[] | select(.body | contains("<!-- pr-review:finding-id=")) |
{id, created_at, path, line, body,
justification: (.body | capture("<!-- pr-review:justification=(?<j>[^ ]+) -->").j),
race_meta: (.body | capture("\\[window=(?<w>[^,]+), damage=(?<d>[^,]+), recovery=(?<r>[^\\]]+)\\]") // null)}]'
Take only findings created since the previous iter's HEAD sha (the new ones this iter introduced).
Gate A: Diminishing Returns (only-hygiene iter)
Fires when ALL of:
- ≥1 new pr-review finding this iter
- ZERO new findings have
justification ∈ {Reachable, Precedent, Asymmetric, Historical}
- ALL new findings are (or missing — treat missing as Hygiene)
Action: STOP automatic loop, skip step 5's normal decision, jump to step 6 with:
Status: needs-user-input (diminishing returns)
This iter's pr-review surfaced only hygiene findings — no Reachable / Precedent /
Asymmetric / Historical justification on any new finding.
Hygiene followups (N):
<list — id, slug, file:line, one-line failure mode>
Continuing the loop will likely surface more hygiene from the same code paths.
Your call:
(s) ship — open a single follow-up issue collecting the hygiene items, mark PR ready-to-merge
(p) polish — keep looping (override the gate for this round)
(r) re-review-full — challenge whether the self-loop missed anything (force `mode=full` on next pr-review)
Gate B: Convergence Audit (race-of-race iter)
Catches the failure mode where iter (N-1)'s fix introduces a new race / state-transition surface, the reviewer flags it as a Reachable finding, the next fix introduces yet another race surface, ad infinitum. Gate A does NOT catch this — those findings carry
and are individually valid; the divergence is only visible at cluster level.
: use the range recorded in
step 4.1. This is the same range fed to pr-review's incremental-mode invocation, so Gate B's self-introduced check and pr-review's drop signal (B) operate on identical evidence. If step 4.1 recorded nothing (iter N-1 pushed no commits), Gate B does not fire — there is no iter (N-1) fix surface to converge against.
Fires when ALL of:
- (first two iters are normal review cadence, not divergence)
- ≥ 2 new findings this iter cite inside — i.e. critiquing iter (N-1)'s freshly-added surface
- ≥ 2 of those findings are race-class — detection is OR of:
- (i) carries
[window=..., damage=..., recovery=...]
meta from pr-review's race-class metadata requirement, OR
- (ii) slug/category keyword-matches one of:
race | TOCTOU | concurren | sweep | lifecycle | state-transition | debounce | claim | lease | fence | stale | orphan | race-window
, OR
- (iii) (matches the meta-tag prefix even when full meta is malformed) OR
atomic.*race | race.*atomic
(require co-occurrence to avoid catching DB-transaction and frontend-viewport noise)
Keyword design notes: bare
and bare
are deliberately excluded — they false-positive on rate-limiter / viewport / DB-transaction-correctness comments.
is the canonical security-race term and matches Codex findings that bypass the meta-tag path.
debounce / claim / lease / fence
cover distributed-locking vocabulary;
cover sweep-race descriptions.
How to verify file:line inside prior_fix_range:
bash
git diff --name-only $prior_fix_range # files touched
git diff -U0 $prior_fix_range -- <file> # line-level attribution
Action: STOP automatic loop, run Convergence Audit for the cluster. For each race-class finding, apply the
Wontfix Template five-step decision:
- Window: estimate ms / s / min / hr between the race operations (use the meta tag if present)
- Damage: classify as
data-loss | deadlock | inconsistency | latency | marginal
- Asymmetric check: is the failure mode security / data-integrity / billing?
- Mitigation cost: does the proposed fix introduce a new race surface?
- Recovery path: does fault tolerance / next webhook / sweeper cover the race?
Audit verdict per finding:
| Verdict | When |
|---|
| modify (Asymmetric) | Justification is Asymmetric (security / data-loss / data-integrity / billing) → ALWAYS modify, regardless of mitigation cost |
| modify (damage gate) | value is / / → modify even if Justification is not formally Asymmetric. These damage classes have no acceptable "fault tolerance" answer |
| modify (safe fix) | non-Asymmetric, damage ∈ {latency, marginal}
, BUT mitigation does NOT introduce new race surface → modify (no race-of-race risk) |
| wontfix-with-template | non-Asymmetric + damage ∈ {latency, marginal}
+ + mitigation introduces new race surface → reply using Wontfix Template. ALL five conditions required; missing any → fall through to modify |
| defer-followup | valid concern but resolution requires infrastructure (e.g. real DB test, schema migration, new background job) that belongs to a follow-up issue |
Report to user:
Status: convergence-audit (race-of-race detected)
iter (N-1) fix surface attracted N race-class findings this iter (cluster):
<id> <slug> @ <file:line> window=<w> damage=<d> recovery=<r>
...
Audit verdict per finding:
<id>: modify — <reason: Asymmetric / mitigation safe / etc>
<id>: wontfix — <five-field summary from Wontfix Template>
<id>: defer — <followup issue suggestion>
Your call:
(a) accept all verdicts (post wontfix replies via template, address modify items, open defer issues)
(m) modify a specific verdict — say which finding-id and target verdict
(s) ship — accept all wontfix + defer as-is, mark PR ready-to-merge
(p) override audit — treat as normal iter, loop back to step 1
Gate B does NOT fire when:
- Cluster contains any Asymmetric finding — Asymmetric (security / data-loss / data-integrity / billing) bypasses the convergence escape just as it does in pr-review's drop signal (B). Surface them and modify
- — early iters are normal review cadence
- Race-class meta is missing AND no slug/category keyword match — keeps gate narrow to actual race domain; non-race convergence (e.g. naming-bikeshed) falls back to Gate A or normal flow
Rationale: Gate A catches iters where everything is hygiene; Gate B catches iters where individually-valid race findings cluster on freshly-introduced surfaces. Together they cover the two main self-feedback failure modes without suppressing genuine Asymmetric findings or third-party signal (Codex / SonarQube / Snyk findings without pr-review's metadata bypass both gates and route through normal step 2 dedup + 3-round escalation).
4.6 Wontfix Template
Used by step 4.5 Gate B (Convergence Audit) and as a manual reply template for race / state / sweep / atomic class findings where modification would introduce new race surfaces.
Five fields are minimum-required. Missing any one → finding deserves modification, not wontfix.
Wontfix — deliberate trade-off.
Race window: <ms / s / min / hr> between <op A> and <op B>.
Precondition: <only fires when X is in Y state for N+ time>
Damage if race fires: <not data-loss / not deadlock / only X happens N seconds earlier than ideal>
Recovery path: <new event / cron sweeper / next webhook covers it; user-visible behavior unchanged>
Asymmetric check: <not security / not data-loss / not data-integrity / not billing>
Mitigation cost: <atomic re-check / two-step merge into transaction is doable, but introduces new race-of-race surface at X>
Acknowledged as known trade-off; fault tolerance covers genuinely <abandoned / stranded / dropped> class.
Tracking: <if needed, opened followup issue X>
Field semantics:
- Race window — concrete time estimate, not "small". for tight CAS, for sweep cycle gap, for cron lifecycle. Reviewer needs the magnitude to judge.
- Precondition — what state the system must already be in for the race to even matter. If precondition is rare or already-degraded, race is acceptable.
- Damage — concrete user / data observation, not "could be a problem". If you cannot describe damage in one line, the finding may not actually be Reachable.
- Recovery path — must name a concrete mechanism (next webhook / sweeper run / cron / fault-tolerant retry). "It'll probably be fine" is not a recovery path.
- Asymmetric check — explicit declaration that finding is not security / data-integrity / billing. Wontfix is INVALID for Asymmetric findings — modify them.
- Mitigation cost — name the new race surface the proposed fix would introduce. "race-of-race" is the load-bearing reasoning.
Reference example: PR #148
sweepAbandonedTasklessThreads
two-UPDATE race — Codex flagged "re-check thread state before abandoning queued events"; race window was milliseconds between two sweep UPDATEs, precondition was thread already stranded 1+ hour, damage was
(already-stranded events terminalize seconds earlier than ideal), recovery path was new webhook hits reactivation gate. Wontfix posted; PR shipped.
When NOT to use:
- Any of the five fields cannot be filled honestly → finding is real, modify it. Wontfix Template is for the specific case where modification introduces equivalent or worse race surface; it is NOT a generic decline template.
- Dev-stage self-review context (no separate session between code author and verdict reasoner): do NOT fill these fields from main-session memory. Babysit normally runs in a session separate from the code author, which is what makes Wontfix Template safe to apply — the babysit session has no prior commitment to the design and can honestly reason about damage / recovery / mitigation cost. In a dev-stage self-review loop (same session wrote the code AND is reasoning about findings), author-narrative bias compounds — bug-free framing produces the strongest detection drop among framing conditions tested across 6 LLMs (Mitropoulos et al., Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review, arXiv:2603.18740). Pause and either (a) hand off to a separate session for the verdict, or (b) use a fresh-spawn verdict subagent that independently derives / / from code, not from the finding object's fields. The Deriver-pattern verdict subagent is not built as a skill yet — until it is, treat dev-stage wontfix decisions as advisory and surface them to the user.
5. Decide
- ✅ All checks green AND all Valid feedback resolved → Report (step 6)
- 🟡 New comment / check status changed mid-cycle → back to step 1
- 🔴 Hit 3-failure stop, invisible-findings gate, dedup 3-round escalation, OR something genuinely needs human judgment → Report with /
6. Report (end of run, not auto-merge)
PR/MR: <link>
Status: ready-to-merge | needs-user-input | blocked
Checks: <green>/<total>
Addressed (this run): <list of SHA → comment ref + one-liner>
Awaiting your decision:
Discuss (I did NOT reply): <list with comment text + my read of the ambiguity>
Out-of-scope: <list> → open follow-up issues for any of these? (y/N per item)
Blockers (if any): <description + what I tried>
Next command: gh pr merge --squash <id> # or: glab mr merge <id>
After the report, if there are out-of-scope items, ask once: open follow-up issues for which ones? Open only the ones the user picks (
/
), and edit the report's reply on each MR/PR comment to link the new issue.
What I never do without asking
- Reply, dismiss, or implement based on Discuss items — list them, stop.
- Open follow-up issues for Out-of-scope items without confirming the list with the user first.
- Merge the PR/MR. Even when fully green, report ready-to-merge and let the user run the merge.
- Force-push, amend pushed commits, skip hooks (), or bypass signing.
- Loop forever — if a cycle produces no new work and nothing is resolved, stop and report.