codex-review

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cross-Model Code Review with Codex

基于Codex的跨模型代码评审

Cross-model validation using Codex CLI's MCP tools. Claude writes code, Codex reviews it — different architecture, different training distribution, no self-approval bias.
Core insight: Single-model self-review is systematically biased. Cross-model review catches different bug classes because the reviewer has fundamentally different blind spots than the author.
使用Codex CLI的MCP工具实现跨模型校验。Claude编写代码,Codex负责评审——不同的架构、不同的训练数据分布,不存在自审批偏差。
核心洞察: 单模型自评审存在系统性偏差。跨模型评审能够发现不同类型的缺陷,因为评审模型和代码编写模型的盲点存在本质差异。

The Two Codex Tools

两款Codex工具

The Codex CLI MCP server exposes two tools relevant for reviews:
ToolBest ForKey Constraint
Codex reviewStructured diff review with prioritized findings
prompt
cannot combine with
uncommitted: true
Codex codexFreeform deep-dive on specific concernsRequires explicit diff context in prompt
Always pass these parameters:
  • model: "gpt-5.3-codex"
    — most capable coding model
  • reasoningEffort: "xhigh"
    — maximum depth (on
    codex
    tool)
  • sandbox: "read-only"
    — reviewer must never modify the working tree (on
    codex
    tool)
Codex CLI MCP服务提供两款适用于代码评审的工具:
工具最佳适用场景核心限制
Codex review带优先级结果的结构化差异评审
prompt
不能与
uncommitted: true
同时使用
Codex codex针对特定问题的自由形式深度分析需要在prompt中明确提供差异上下文
始终传入以下参数:
  • model: "gpt-5.3-codex"
    — 能力最强的编码模型
  • reasoningEffort: "xhigh"
    — 最大分析深度(适用于
    codex
    工具)
  • sandbox: "read-only"
    — 评审工具绝对不能修改工作目录(适用于
    codex
    工具)

Review Patterns

评审模式

Pattern 1: Pre-PR Full Review (Default)

模式1:PR前全量评审(默认)

The standard review before opening a PR. Use for any non-trivial change.
Step 1 — Structured review (catches correctness + general issues):
  Codex review(
    base: "main",
    model: "gpt-5.3-codex",
    title: "Pre-PR Review"
  )

Step 2 — Security deep-dive (if code touches auth, input handling, or APIs):
  Codex codex(
    prompt: <security template from references/prompts.md>,
    model: "gpt-5.3-codex",
    reasoningEffort: "xhigh",
    sandbox: "read-only"
  )

Step 3 — Fix findings, then re-review:
  Codex review(
    base: "main",
    model: "gpt-5.3-codex",
    title: "Re-review after fixes"
  )
提交PR前的标准评审流程,适用于所有非 trivial 变更。
Step 1 — 结构化评审(发现正确性问题+通用问题):
  Codex review(
    base: "main",
    model: "gpt-5.3-codex",
    title: "Pre-PR Review"
  )

Step 2 — 安全深度分析(如果代码涉及认证、输入处理或API):
  Codex codex(
    prompt: <security template from references/prompts.md>,
    model: "gpt-5.3-codex",
    reasoningEffort: "xhigh",
    sandbox: "read-only"
  )

Step 3 — 修复问题后重新评审:
  Codex review(
    base: "main",
    model: "gpt-5.3-codex",
    title: "Re-review after fixes"
  )

Pattern 2: Commit-Level Review

模式2:Commit级评审

Quick check after each meaningful commit. Good for iterative development.
Codex review(
  commit: "<SHA>",
  model: "gpt-5.3-codex",
  title: "Commit review"
)
每次有意义的提交后做快速检查,适合迭代开发场景。
Codex review(
  commit: "<SHA>",
  model: "gpt-5.3-codex",
  title: "Commit review"
)

Pattern 3: WIP Check

模式3:WIP检查

Review uncommitted work mid-development. Catches issues before they're baked in.
Codex review(
  uncommitted: true,
  model: "gpt-5.3-codex",
  title: "WIP check"
)
Note:
uncommitted: true
cannot combine with a custom
prompt
.
开发过程中评审未提交的工作,在问题被固化前提前发现。
Codex review(
  uncommitted: true,
  model: "gpt-5.3-codex",
  title: "WIP check"
)
注意:
uncommitted: true
不能与自定义
prompt
同时使用。

Pattern 4: Focused Investigation

模式4:定向排查

Surgical deep-dive on a specific concern (error handling, concurrency, data flow).
Codex codex(
  prompt: "Analyze [specific concern] in the changes between main and HEAD.
           For each issue found: cite file and line, explain the risk,
           suggest a concrete fix. Confidence threshold: only flag issues
           you are >=70% confident about.",
  model: "gpt-5.3-codex",
  reasoningEffort: "xhigh",
  sandbox: "read-only"
)
针对特定问题(错误处理、并发、数据流)做精准深度分析。
Codex codex(
  prompt: "Analyze [specific concern] in the changes between main and HEAD.
           For each issue found: cite file and line, explain the risk,
           suggest a concrete fix. Confidence threshold: only flag issues
           you are >=70% confident about.",
  model: "gpt-5.3-codex",
  reasoningEffort: "xhigh",
  sandbox: "read-only"
)

Pattern 5: Ralph Loop (Implement-Review-Fix)

模式5:Ralph循环(实现-评审-修复)

Iterative quality enforcement — implement, review, fix, repeat. Max 3 iterations.
Iteration 1:
  Claude → implement feature
  Codex → review(base: "main") → findings
  Claude → fix critical/high findings

Iteration 2:
  Codex → review(base: "main") → verify fixes + catch remaining
  Claude → fix remaining issues

Iteration 3 (final):
  Codex → review(base: "main") → clean bill of health
  (or accept known trade-offs and document them)

STOP after 3 iterations. Diminishing returns beyond this.
迭代式质量管控——实现、评审、修复,重复执行,最多3轮迭代。
Iteration 1:
  Claude → implement feature
  Codex → review(base: "main") → findings
  Claude → fix critical/high findings

Iteration 2:
  Codex → review(base: "main") → verify fixes + catch remaining
  Claude → fix remaining issues

Iteration 3 (final):
  Codex → review(base: "main") → clean bill of health
  (or accept known trade-offs and document them)

STOP after 3 iterations. Diminishing returns beyond this.

Multi-Pass Strategy

多轮评审策略

For thorough reviews, run multiple focused passes instead of one vague pass. Each pass gets a specific persona and concern domain.
PassFocusToolReasoning
CorrectnessBugs, logic, edge cases, race conditions
review
xhigh
SecurityOWASP Top 10, injection, auth, secrets
codex
xhigh
ArchitectureCoupling, abstractions, API consistency
codex
xhigh
PerformanceO(n^2), N+1 queries, memory leaks
codex
xhigh
Run passes sequentially. Fix critical findings between passes to avoid noise compounding.
When to use multi-pass vs single-pass:
Change SizeStrategy
< 50 lines, single concernSingle
review
call
50-300 lines, feature work
review
+ security
codex
pass
300+ lines or architecture changeFull 4-pass
Security-sensitive (auth, payments, crypto)Always include security pass
要实现全面评审,建议执行多轮定向检查而非单次模糊评审。每一轮检查赋予特定的角色和关注领域。
轮次关注方向工具推理深度
正确性Bug、逻辑、边缘 case、竞态条件
review
xhigh
安全性OWASP Top 10、注入、认证、密钥
codex
xhigh
架构耦合度、抽象、API一致性
codex
xhigh
性能O(n^2)复杂度、N+1查询、内存泄漏
codex
xhigh
按顺序执行各轮评审,每轮之间修复严重问题,避免无效信息叠加。
多轮评审 vs 单轮评审的适用场景:
变更规模策略
< 50行,单一关注点单次
review
调用
50-300行,功能开发
review
+ 安全
codex
检查
300行以上或架构变更完整4轮评审
安全敏感场景(认证、支付、加密)必须包含安全检查轮次

Decision Tree: Which Pattern?

决策树:选择合适的评审模式

dot
digraph review_decision {
    rankdir=TB;
    node [shape=diamond];

    "What stage?" -> "Pre-commit" [label="writing code"];
    "What stage?" -> "Pre-PR" [label="ready to submit"];
    "What stage?" -> "Post-commit" [label="just committed"];
    "What stage?" -> "Investigating" [label="specific concern"];

    node [shape=box];
    "Pre-commit" -> "Pattern 3: WIP Check";
    "Pre-PR" -> "How big?";
    "Post-commit" -> "Pattern 2: Commit Review";
    "Investigating" -> "Pattern 4: Focused Investigation";

    "How big?" [shape=diamond];
    "How big?" -> "Pattern 1: Pre-PR Review" [label="< 300 lines"];
    "How big?" -> "Full Multi-Pass" [label=">= 300 lines"];
}
dot
digraph review_decision {
    rankdir=TB;
    node [shape=diamond];

    "What stage?" -> "Pre-commit" [label="writing code"];
    "What stage?" -> "Pre-PR" [label="ready to submit"];
    "What stage?" -> "Post-commit" [label="just committed"];
    "What stage?" -> "Investigating" [label="specific concern"];

    node [shape=box];
    "Pre-commit" -> "Pattern 3: WIP Check";
    "Pre-PR" -> "How big?";
    "Post-commit" -> "Pattern 2: Commit Review";
    "Investigating" -> "Pattern 4: Focused Investigation";

    "How big?" [shape=diamond];
    "How big?" -> "Pattern 1: Pre-PR Review" [label="< 300 lines"];
    "How big?" -> "Full Multi-Pass" [label=">= 300 lines"];
}

Prompt Engineering Rules

Prompt工程规则

  1. Assign a persona — "senior security engineer" beats "review for security"
  2. Specify what to skip — "Skip formatting, naming style, minor docs gaps" prevents bikeshedding
  3. Require confidence scores — Only act on findings with confidence >= 0.7
  4. Demand file:line citations — Vague findings without location are not actionable
  5. Ask for concrete fixes — "Suggest a specific fix" not just "this is a problem"
  6. One domain per pass — Security-only, architecture-only. Mixing dilutes depth.
Ready-to-use prompt templates are in
references/prompts.md
.
  1. 分配明确角色 — "资深安全工程师" 比 "评审安全问题" 效果更好
  2. 指定跳过的内容 — "跳过格式、命名风格、微小文档缺口" 可以避免无意义的争论
  3. 要求置信度评分 — 仅处理置信度>=0.7的问题
  4. 要求提供文件:行号引用 — 没有明确位置的模糊结果无法落地
  5. 要求给出具体修复方案 — "建议具体修复方式" 而不只是 "这里有问题"
  6. 每轮仅关注一个领域 — 仅安全、仅架构,混合多个领域会降低分析深度
可直接使用的prompt模板存放在
references/prompts.md
中。

Anti-Patterns

反模式

Anti-PatternWhy It FailsFix
"Review this code"Too vague — produces surface-level bikesheddingUse specific domain prompts with persona
Single pass for everythingContext dilution — every dimension gets shallow treatmentMulti-pass with one concern per pass
Self-review (Claude reviews Claude's code)Systematic bias — models approve their own patternsCross-model: Claude writes, Codex reviews
No confidence thresholdNoise floods signal — 0.3 confidence findings waste timeOnly act on >= 0.7 confidence
Style comments in reviewLLMs default to bikeshedding without explicit skip directives"Skip: formatting, naming, minor docs"
> 3 review iterationsDiminishing returns, increasing noise, overbakingStop at 3. Accept trade-offs.
Review without project contextGeneric advice disconnected from codebase conventionsCodex reads CLAUDE.md/AGENTS.md automatically
反模式失效原因解决方案
"评审这段代码"过于模糊,只会产出表层无意义的结论使用带角色的特定领域prompt
所有场景都用单轮评审上下文分散,每个维度都只能得到浅层次分析每轮仅关注一个问题的多轮评审
自评审(Claude评审Claude写的代码)系统性偏差,模型会认可自己产出的代码模式跨模型:Claude编写,Codex评审
没有置信度阈值噪音淹没有效信号,置信度0.3的问题会浪费时间仅处理置信度>=0.7的问题
评审包含风格类建议没有明确跳过指令时,LLM默认会产出无意义的风格评论增加"跳过:格式、命名、微小文档问题"指令
超过3轮评审迭代收益递减,噪音增加,过度优化3轮后停止,接受必要的权衡
无项目上下文的评审与代码库规范脱节的通用建议Codex会自动读取CLAUDE.md/AGENTS.md获取上下文

What This Skill is NOT

本技能的能力边界

  • Not a replacement for human review. Cross-model review catches bugs but can't evaluate product direction or user experience.
  • Not a linter. Don't use Codex review for formatting or style — that's what linters are for.
  • Not infallible. 5-15% false positive rate is normal. Triage findings, don't blindly fix everything.
  • Not for self-approval. The whole point is cross-model validation. Don't use Claude to review Claude's code.
  • 不是人工评审的替代品:跨模型评审可以发现缺陷,但无法评估产品方向或用户体验
  • 不是linter:不要用Codex评审做格式或风格检查,这类工作应该交给linter
  • 不是万无一失的:5-15%的误报率是正常情况,需要对结果做分类,不要盲目修复所有问题
  • 不支持自审批:本技能的核心价值是跨模型校验,不要用Claude评审Claude写的代码

References

参考资料

For ready-to-use prompt templates, see
references/prompts.md
.
可直接使用的prompt模板请查看
references/prompts.md