Review All - two clean-agent code review
What this does, and why it has this shape
Two independent reviewers examine the same change, then their findings are
reconciled. The value is not "two reviews"; it is the agreement signal: when
both clean agents independently flag the same issue, confidence is high. When
only one flags something, it deserves scrutiny.
Four properties make the signal useful, and the procedure exists to
protect them:
- Clean context per reviewer. Each reviewer must start fresh, with no
memory of this orchestration or of each other, or they stop being
independent. Spawn two agents with clean context; if the agent API supports a
flag, set it to . Do not paste either reviewer's output
into the other reviewer.
- Same target. Agreement only means something if both looked at the exact
same diff. Resolve the target once and hand the identical range
to both.
- Two complementary review paths. One clean agent runs Codex's native
codex review --base "$base"
command. The other clean agent performs a
direct review using references/review-guide.md
, which is based on Google's
"What to look for in a code review":
https://google.github.io/eng-practices/review/reviewer/looking-for.html
- Genuine parallelism. Spawn both agents before waiting for either result.
Do not serialize the reviews.
This skill is
review-only. Never pass
/
, never apply
patches, and never tell the user you are about to change code.
Not a PR-finishing loop
When the user asks to update code until review comments are handled and checks
pass, use the
workflow instead. This skill should be run once after
a batched fix, or at most once more after a second batched fix if valid P0-P2
findings remain. Do not rerun
after each individual fix.
Procedure
Step 1 — Resolve the shared target (one base, one range)
The target is
(merge-base diff of the current branch), so both
reviewers see exactly the commits this branch adds.
bash
current=$(git rev-parse --abbrev-ref HEAD)
# Base precedence: user-provided --base > origin's default branch > main > master
base="$ARG_BASE" # whatever the user passed, may be empty
if [ -z "$base" ]; then
base=$(git symbolic-ref --quiet refs/remotes/origin/HEAD 2>/dev/null \
| sed 's@^refs/remotes/origin/@@')
fi
[ -z "$base" ] && git show-ref --verify --quiet refs/heads/main && base=main
[ -z "$base" ] && git show-ref --verify --quiet refs/heads/master && base=master
echo "current=$current base=$base"
git diff --shortstat "$base"...HEAD
git diff --name-only "$base"...HEAD
Step 2 — Pre-flight (fail fast, don't waste a review)
Stop and tell the user plainly if any of these hold:
- Not inside a git repo.
- No base branch could be resolved → ask the user which base to diff against.
- is the base branch → there is nothing to compare; ask for a base.
- The diff is empty (
git diff --shortstat "$base"...HEAD
prints nothing) →
there are no committed changes to review. Remind the user this mode reviews
committed changes only; if their work is uncommitted, they should commit
first.
- Codex is not ready: does not report a logged-in account.
Report it and offer to run the Google-rubric review alone.
Step 3 - Spawn BOTH clean agents in parallel
Use the agent/subagent facility available in the current environment. Start both
review agents before waiting. If the API exposes
, set it to
for each agent. Give each agent only the repo path, resolved base, and
its task.
Agent A: Codex review-command agent
Task prompt:
You are a clean-context review-command runner. In the repo at
,
run Codex's native review command against the committed branch diff:
codex review --base <base>
This is review-only. Do not pass
or
, do not post anything
to GitHub, and do not modify files. Return the native command output and, if
possible, a normalized JSON array of findings:
{"file":"...","line":<int or null>,"priority":"P0|P1|P2|P3","category":"design|functionality|complexity|tests|naming|comments|consistency|documentation|security|other","title":"<one line>","description":"<evidence and impact>"}
.
If the command finds nothing, return
after the raw output summary. If the
command fails, return the exact failure and stop.
Agent B: Google-rubric review agent
Read
references/review-guide.md
next to this
, then give the agent
this task with the full rubric pasted in:
You are an independent code reviewer with clean context. In the repo at
, review
only the committed changes in
.
Apply this review rubric, based on Google's "What to look for in a code
review":
<paste the full contents of references/review-guide.md here>
Constraints: This is review-only. Do
not pass
or
, do
not post anything to GitHub, and do
not modify any files. Use system
context as a lens to judge the changed lines, but anchor every finding to the
diff (a changed line, or something the change should have touched but didn't,
like a missing test). Skip nitpicks a linter, formatter, typechecker, or
compiler would catch.
Return your findings to me as a JSON array and nothing else. Each finding:
{"file": "...", "line": <int or null>, "priority": "P0|P1|P2|P3", "category": "design|functionality|complexity|tests|naming|comments|consistency|documentation|security|other", "title": "<one line>", "description": "<why it's a problem, with evidence>"}
.
Assign
per the rubric's P0–P3 scale. If you find nothing, return
. If the change does something notably well, you may add one finding with
priority
and category
titled "Good: …".
After both agents have been spawned, wait for their results.
Step 4 — Collect both results
- Await the Google-rubric review agent's JSON.
- Await the Codex review-command agent's raw output and/or normalized JSON.
Parse the native Codex output semantically; don't rely on a rigid regex.
What the native reviewer's output often looks like:
- A preamble block (Codex version, workdir, model, and a dump of the diff and
the shell commands it ran) — skip all of it.
- The findings appear after a marker as a summary line followed by
and a list of entries shaped like
- [P2] <title> — <path>:<start>-<end>
with a description paragraph under
each. Each entry is one finding.
- The findings block is often printed twice (streamed, then repeated as
the final message). Dedupe — it's the same findings, not new ones.
- Codex already tags each finding –; keep those labels as-is —
it's the same scale the Google-rubric reviewer uses, so no remapping is
needed.
- Harmless / lines come from the
read-only sandbox; ignore them.
If one side fails (Codex errored, an agent returned nothing usable), continue
with whatever you have and say so explicitly in the report — a half review
clearly labeled beats a silent gap.
Step 5 — Merge, dedupe, rank
Normalize both sides into the same finding shape, then reconcile:
- Dedupe by same file + overlapping/adjacent lines + same underlying issue
(semantic match, not string match — the two agents will word things
differently).
- Tag the source of every finding: , , or .
- Resolve priority for each merged finding: if both reviewers flagged it but
assigned different priorities, take the higher (more severe) one and note
the split.
- Rank primarily by priority (P0 → P3). Within a priority tier, list
findings both agents agree on first — independent agreement is the strongest
confidence signal this skill produces.
- Surface disagreement rather than hiding it: if the two agents conflict
on whether something is a bug, show both positions briefly. That tension is
often the most useful part of the report.
Step 6 — Present one unified report
Lead with a verdict, then a priority overview table, then findings grouped by
priority tier. Tag every finding with its priority, its source (
/
/
), and its rubric dimension.
## Review-all: <current> vs <base> (<N> files, +<adds>/-<dels>)
**Verdict:** APPROVE / REQUEST CHANGES / COMMENT
**Overall correctness:** patch is correct / patch is incorrect
Codex review and Google-rubric review examined the same diff independently; <X>
findings agreed.
### Findings overview
| Priority | Count | Where | Summary |
| -------- | ----- | ----- | ------- |
| P0 | <n> | <file:line or —> | <short or "none"> |
| P1 | <n> | … | … |
| P2 | <n> | … | … |
| P3 | <n> | … | … |
### P0 ← show only tiers that have findings
1. [P0] **<title>** — `file:line` · _both_ · functionality
<merged description>
### P1
...
### P2
...
### P3
...
Derive the verdict from the priorities (same logic the kelos reviewer uses):
- Overall correctness is "patch is incorrect" if there's any P0 or P1
finding; otherwise "patch is correct". Ignore P2/P3 nits for this call.
- REQUEST CHANGES when there's a P0/P1; APPROVE when only P2/P3 (or
nothing); COMMENT when you genuinely need the author's input before
deciding.
Keep it tight: no emojis, cite
, mark agreed (
) findings clearly
since that's the highest-confidence signal, and don't pad single-model findings to
look like consensus. If both agents found nothing, say so and stop. A notable
strength may be a one-line "Good:" note under the lowest tier — matter-of-fact,
not flattery.
Notes & edge cases
- Committed changes only. and both ignore
uncommitted/untracked files. If the user wants those reviewed, they must commit
first (a future mode could cover that case).
- Large diffs. Codex may take a while; that's exactly why it runs in its own
clean agent. Don't kill it early.
- Arguments. Accept an optional base override (e.g.
review-all --base develop
or ). If none is given, auto-resolve per Step 1.
- Don't double-review. Both reviewers must get the identical range; never let
one drift to working-tree and the other to branch scope.