codex-review

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Cross-Model Code Review with Codex

基于Codex的跨模型代码评审

Cross-model validation using Codex CLI's MCP tools. Claude writes code, Codex reviews it — different architecture, different training distribution, no self-approval bias.

Core insight: Single-model self-review is systematically biased. Cross-model review catches different bug classes because the reviewer has fundamentally different blind spots than the author.

使用Codex CLI的MCP工具实现跨模型校验。Claude编写代码，Codex负责评审——不同的架构、不同的训练数据分布，不存在自审批偏差。

核心洞察： 单模型自评审存在系统性偏差。跨模型评审能够发现不同类型的缺陷，因为评审模型和代码编写模型的盲点存在本质差异。

The Two Codex Tools

两款Codex工具

The Codex CLI MCP server exposes two tools relevant for reviews:

Tool	Best For	Key Constraint
Codex review	Structured diff review with prioritized findings	`prompt` cannot combine with `uncommitted: true`
Codex codex	Freeform deep-dive on specific concerns	Requires explicit diff context in prompt

Always pass these parameters:

```
model: "gpt-5.3-codex"
```
— most capable coding model
```
reasoningEffort: "xhigh"
```
— maximum depth (on
```
codex
```
tool)
```
sandbox: "read-only"
```
— reviewer must never modify the working tree (on
```
codex
```
tool)

Codex CLI MCP服务提供两款适用于代码评审的工具：

工具	最佳适用场景	核心限制
Codex review	带优先级结果的结构化差异评审	`prompt` 不能与 `uncommitted: true` 同时使用
Codex codex	针对特定问题的自由形式深度分析	需要在prompt中明确提供差异上下文

始终传入以下参数：

```
model: "gpt-5.3-codex"
```
— 能力最强的编码模型
```
reasoningEffort: "xhigh"
```
— 最大分析深度（适用于
```
codex
```
工具）
```
sandbox: "read-only"
```
— 评审工具绝对不能修改工作目录（适用于
```
codex
```
工具）

Review Patterns

评审模式

Pattern 1: Pre-PR Full Review (Default)

模式1：PR前全量评审（默认）

The standard review before opening a PR. Use for any non-trivial change.

Step 1 — Structured review (catches correctness + general issues):
  Codex review(
    base: "main",
    model: "gpt-5.3-codex",
    title: "Pre-PR Review"
  )

Step 2 — Security deep-dive (if code touches auth, input handling, or APIs):
  Codex codex(
    prompt: <security template from references/prompts.md>,
    model: "gpt-5.3-codex",
    reasoningEffort: "xhigh",
    sandbox: "read-only"
  )

Step 3 — Fix findings, then re-review:
  Codex review(
    base: "main",
    model: "gpt-5.3-codex",
    title: "Re-review after fixes"
  )

提交PR前的标准评审流程，适用于所有非 trivial 变更。

Step 1 — 结构化评审（发现正确性问题+通用问题）:
  Codex review(
    base: "main",
    model: "gpt-5.3-codex",
    title: "Pre-PR Review"
  )

Step 2 — 安全深度分析（如果代码涉及认证、输入处理或API）:
  Codex codex(
    prompt: <security template from references/prompts.md>,
    model: "gpt-5.3-codex",
    reasoningEffort: "xhigh",
    sandbox: "read-only"
  )

Step 3 — 修复问题后重新评审:
  Codex review(
    base: "main",
    model: "gpt-5.3-codex",
    title: "Re-review after fixes"
  )

Pattern 2: Commit-Level Review

模式2：Commit级评审

Quick check after each meaningful commit. Good for iterative development.

Codex review(
  commit: "<SHA>",
  model: "gpt-5.3-codex",
  title: "Commit review"
)

每次有意义的提交后做快速检查，适合迭代开发场景。

Codex review(
  commit: "<SHA>",
  model: "gpt-5.3-codex",
  title: "Commit review"
)

Pattern 3: WIP Check

模式3：WIP检查

Review uncommitted work mid-development. Catches issues before they're baked in.

Codex review(
  uncommitted: true,
  model: "gpt-5.3-codex",
  title: "WIP check"
)

Note:

uncommitted: true

cannot combine with a custom

prompt

开发过程中评审未提交的工作，在问题被固化前提前发现。

Codex review(
  uncommitted: true,
  model: "gpt-5.3-codex",
  title: "WIP check"
)

注意：

uncommitted: true

不能与自定义

prompt

同时使用。

Pattern 4: Focused Investigation

模式4：定向排查

Surgical deep-dive on a specific concern (error handling, concurrency, data flow).

Codex codex(
  prompt: "Analyze [specific concern] in the changes between main and HEAD.
           For each issue found: cite file and line, explain the risk,
           suggest a concrete fix. Confidence threshold: only flag issues
           you are >=70% confident about.",
  model: "gpt-5.3-codex",
  reasoningEffort: "xhigh",
  sandbox: "read-only"
)

针对特定问题（错误处理、并发、数据流）做精准深度分析。

Codex codex(
  prompt: "Analyze [specific concern] in the changes between main and HEAD.
           For each issue found: cite file and line, explain the risk,
           suggest a concrete fix. Confidence threshold: only flag issues
           you are >=70% confident about.",
  model: "gpt-5.3-codex",
  reasoningEffort: "xhigh",
  sandbox: "read-only"
)

Pattern 5: Ralph Loop (Implement-Review-Fix)

模式5：Ralph循环（实现-评审-修复）

Iterative quality enforcement — implement, review, fix, repeat. Max 3 iterations.

Iteration 1:
  Claude → implement feature
  Codex → review(base: "main") → findings
  Claude → fix critical/high findings

Iteration 2:
  Codex → review(base: "main") → verify fixes + catch remaining
  Claude → fix remaining issues

Iteration 3 (final):
  Codex → review(base: "main") → clean bill of health
  (or accept known trade-offs and document them)

STOP after 3 iterations. Diminishing returns beyond this.

迭代式质量管控——实现、评审、修复，重复执行，最多3轮迭代。

Iteration 1:
  Claude → implement feature
  Codex → review(base: "main") → findings
  Claude → fix critical/high findings

Iteration 2:
  Codex → review(base: "main") → verify fixes + catch remaining
  Claude → fix remaining issues

Iteration 3 (final):
  Codex → review(base: "main") → clean bill of health
  (or accept known trade-offs and document them)

STOP after 3 iterations. Diminishing returns beyond this.

Multi-Pass Strategy

多轮评审策略

For thorough reviews, run multiple focused passes instead of one vague pass. Each pass gets a specific persona and concern domain.

Pass	Focus	Tool	Reasoning
Correctness	Bugs, logic, edge cases, race conditions	`review`	`xhigh`
Security	OWASP Top 10, injection, auth, secrets	`codex`	`xhigh`
Architecture	Coupling, abstractions, API consistency	`codex`	`xhigh`
Performance	O(n^2), N+1 queries, memory leaks	`codex`	`xhigh`

Run passes sequentially. Fix critical findings between passes to avoid noise compounding.

When to use multi-pass vs single-pass:

Change Size	Strategy
< 50 lines, single concern	Single `review` call
50-300 lines, feature work	`review` + security `codex` pass
300+ lines or architecture change	Full 4-pass
Security-sensitive (auth, payments, crypto)	Always include security pass

要实现全面评审，建议执行多轮定向检查而非单次模糊评审。每一轮检查赋予特定的角色和关注领域。

轮次	关注方向	工具	推理深度
正确性	Bug、逻辑、边缘 case、竞态条件	`review`	`xhigh`
安全性	OWASP Top 10、注入、认证、密钥	`codex`	`xhigh`
架构	耦合度、抽象、API一致性	`codex`	`xhigh`
性能	O(n^2)复杂度、N+1查询、内存泄漏	`codex`	`xhigh`

按顺序执行各轮评审，每轮之间修复严重问题，避免无效信息叠加。

多轮评审 vs 单轮评审的适用场景：

变更规模	策略
< 50行，单一关注点	单次 `review` 调用
50-300行，功能开发	`review` + 安全 `codex` 检查
300行以上或架构变更	完整4轮评审
安全敏感场景（认证、支付、加密）	必须包含安全检查轮次

Decision Tree: Which Pattern?

决策树：选择合适的评审模式

dot

digraph review_decision {
    rankdir=TB;
    node [shape=diamond];

    "What stage?" -> "Pre-commit" [label="writing code"];
    "What stage?" -> "Pre-PR" [label="ready to submit"];
    "What stage?" -> "Post-commit" [label="just committed"];
    "What stage?" -> "Investigating" [label="specific concern"];

    node [shape=box];
    "Pre-commit" -> "Pattern 3: WIP Check";
    "Pre-PR" -> "How big?";
    "Post-commit" -> "Pattern 2: Commit Review";
    "Investigating" -> "Pattern 4: Focused Investigation";

    "How big?" [shape=diamond];
    "How big?" -> "Pattern 1: Pre-PR Review" [label="< 300 lines"];
    "How big?" -> "Full Multi-Pass" [label=">= 300 lines"];
}

dot

digraph review_decision {
    rankdir=TB;
    node [shape=diamond];

    "What stage?" -> "Pre-commit" [label="writing code"];
    "What stage?" -> "Pre-PR" [label="ready to submit"];
    "What stage?" -> "Post-commit" [label="just committed"];
    "What stage?" -> "Investigating" [label="specific concern"];

    node [shape=box];
    "Pre-commit" -> "Pattern 3: WIP Check";
    "Pre-PR" -> "How big?";
    "Post-commit" -> "Pattern 2: Commit Review";
    "Investigating" -> "Pattern 4: Focused Investigation";

    "How big?" [shape=diamond];
    "How big?" -> "Pattern 1: Pre-PR Review" [label="< 300 lines"];
    "How big?" -> "Full Multi-Pass" [label=">= 300 lines"];
}

Prompt Engineering Rules

Prompt工程规则

Assign a persona — "senior security engineer" beats "review for security"
Specify what to skip — "Skip formatting, naming style, minor docs gaps" prevents bikeshedding
Require confidence scores — Only act on findings with confidence >= 0.7
Demand file:line citations — Vague findings without location are not actionable
Ask for concrete fixes — "Suggest a specific fix" not just "this is a problem"
One domain per pass — Security-only, architecture-only. Mixing dilutes depth.

Ready-to-use prompt templates are in

references/prompts.md

分配明确角色 — "资深安全工程师" 比 "评审安全问题" 效果更好
指定跳过的内容 — "跳过格式、命名风格、微小文档缺口" 可以避免无意义的争论
要求置信度评分 — 仅处理置信度>=0.7的问题
要求提供文件:行号引用 — 没有明确位置的模糊结果无法落地
要求给出具体修复方案 — "建议具体修复方式" 而不只是 "这里有问题"
每轮仅关注一个领域 — 仅安全、仅架构，混合多个领域会降低分析深度

可直接使用的prompt模板存放在

references/prompts.md

中。

Anti-Patterns

反模式

Anti-Pattern	Why It Fails	Fix
"Review this code"	Too vague — produces surface-level bikeshedding	Use specific domain prompts with persona
Single pass for everything	Context dilution — every dimension gets shallow treatment	Multi-pass with one concern per pass
Self-review (Claude reviews Claude's code)	Systematic bias — models approve their own patterns	Cross-model: Claude writes, Codex reviews
No confidence threshold	Noise floods signal — 0.3 confidence findings waste time	Only act on >= 0.7 confidence
Style comments in review	LLMs default to bikeshedding without explicit skip directives	"Skip: formatting, naming, minor docs"
> 3 review iterations	Diminishing returns, increasing noise, overbaking	Stop at 3. Accept trade-offs.
Review without project context	Generic advice disconnected from codebase conventions	Codex reads CLAUDE.md/AGENTS.md automatically

反模式	失效原因	解决方案
"评审这段代码"	过于模糊，只会产出表层无意义的结论	使用带角色的特定领域prompt
所有场景都用单轮评审	上下文分散，每个维度都只能得到浅层次分析	每轮仅关注一个问题的多轮评审
自评审（Claude评审Claude写的代码）	系统性偏差，模型会认可自己产出的代码模式	跨模型：Claude编写，Codex评审
没有置信度阈值	噪音淹没有效信号，置信度0.3的问题会浪费时间	仅处理置信度>=0.7的问题
评审包含风格类建议	没有明确跳过指令时，LLM默认会产出无意义的风格评论	增加"跳过：格式、命名、微小文档问题"指令
超过3轮评审迭代	收益递减，噪音增加，过度优化	3轮后停止，接受必要的权衡
无项目上下文的评审	与代码库规范脱节的通用建议	Codex会自动读取CLAUDE.md/AGENTS.md获取上下文

What This Skill is NOT

本技能的能力边界

Not a replacement for human review. Cross-model review catches bugs but can't evaluate product direction or user experience.
Not a linter. Don't use Codex review for formatting or style — that's what linters are for.
Not infallible. 5-15% false positive rate is normal. Triage findings, don't blindly fix everything.
Not for self-approval. The whole point is cross-model validation. Don't use Claude to review Claude's code.

不是人工评审的替代品：跨模型评审可以发现缺陷，但无法评估产品方向或用户体验
不是linter：不要用Codex评审做格式或风格检查，这类工作应该交给linter
不是万无一失的：5-15%的误报率是正常情况，需要对结果做分类，不要盲目修复所有问题
不支持自审批：本技能的核心价值是跨模型校验，不要用Claude评审Claude写的代码

References

参考资料

For ready-to-use prompt templates, see

references/prompts.md

可直接使用的prompt模板请查看

references/prompts.md

。