math-olympiad

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Math Olympiad Solver

数学奥赛解题器

The five things that change outcomes

决定解题效果的5个核心要点

  1. Strip thinking before verifying — a verifier that sees the reasoning is biased toward agreement. Fresh context, cleaned proof only.
  2. "Does this prove RH?" — if your theorem's specialization to ζ is a famous open problem, you have a gap. Most reliable red flag.
  3. Short proof → extract the general lemma — try 2×2 counterexamples. If general form is false, find what's special about THIS instance.
  4. Same gap twice → step back — the case split may be obscuring a unified argument. Three lines sometimes does what twelve pages couldn't.
  5. Say "no confident solution" — wrong-and-confident is worse than honest abstain.

Tool policy: Solvers and verifiers use THINKING ONLY in the tight-budget workflow. Competition math is reasoning. Computation is for deep mode (§6c), and even then bounded — a recurrence that's doubly-exponential can't be computed past n~30, work mod 2^m instead.

  1. 验证前清除思考痕迹——看过解题推理过程的验证者会倾向于认同原结论,因此仅向验证者提供全新上下文和整理后的证明内容即可。
  2. 「这能证明RH吗?」——如果你的定理专用于ζ函数时得到的是著名的公开问题,说明你的证明存在漏洞,这是最可靠的危险信号。
  3. 短证明→提取通用引理——尝试寻找2×2的反例,如果通用形式不成立,找到当前问题的特殊属性。
  4. 同一漏洞出现两次→退一步梳理——情况拆分可能会掩盖统一的论证逻辑,有时候三行就能写完十二页都没讲清的内容。
  5. 主动告知「无置信解决方案」——错误但笃定的结论比诚实弃权更糟糕。

工具使用规则:解题器和验证器在紧凑工作流中仅使用推理能力,竞赛数学核心是逻辑推理。计算仅可在深度模式(§6c)中使用,且必须有边界——双指数增长的递推式无法计算到n~30以上,可改用模2^m运算。

When to use which approach

不同问题的适配方案

ProblemApproachVerification
AIME numeric answerBest-of-N → majority voteAnswer check only
Olympiad proof (IMO/Putnam/USAMO)Full workflow below5-pass adversarial
"Is this proof correct?"Skip to verification (step 4)Adversarial + spec-gaming
Full problem set (e.g. all 6 from a competition)Sequential: one full workflow per problem, collect results, compile single PDFPer-problem adversarial
Batch in one Workflow: Set
opts.label
on every
agent()
call to include the problem ID (e.g.,
label: "P3:solver:2"
). Without labels, 36 results come back with no problem association. Run problems in parallel — the label is what matters, not ordering.
问题类型处理方案验证方式
AIME数值答案题最优N选→多数投票仅答案校验
奥赛证明题(IMO/Putnam/USAMO)下方完整工作流5轮对抗验证
「这个证明正确吗?」直接跳到验证环节(第4步)对抗验证+规范校验
完整题集(例如某次竞赛的全部6道题)串行处理:每道题走一遍完整工作流,收集结果,编译为单个PDF单题独立对抗验证
单工作流批量处理规则:每次调用
agent()
时设置
opts.label
来标注问题ID(例如
label: "P3:solver:2"
)。没有标签的话,36个返回结果无法对应到具体问题。可以并行处理多个问题,标签是关联依据,和处理顺序无关。

For a full problem set

完整题集处理流程

Launch one solver workflow per problem (same VERBATIM prompt, different statement). Run them in parallel. When all return, run adversarial verification per problem. Problems that pass get their proof in the PDF; problems that abstain get "No confident solution" with partial notes.
Don't try to solve all N problems in one agent's context — each problem needs its own thinking budget and its own fresh-context verifier. The composition is mechanical: collect the per-problem outputs, fill in LaTeX sections, compile once. | "Simplify this proof" | Skip to presentation (step 8) | — |

为每道题启动独立的解题工作流(使用完全相同的提示词,仅题目内容不同),并行运行。全部返回后,逐题执行对抗验证。验证通过的题目将证明放入PDF,弃权的题目标注「无置信解决方案」并附上部分解题记录。
不要尝试在单个Agent的上下文里解完全部N道题——每道题需要独立的计算资源和全新上下文的验证器。组合逻辑非常机械:收集单题输出,填充LaTeX章节,一次性编译即可。 | "简化这个证明" | 直接跳到呈现环节(第8步) | — |

The Workflow

完整工作流

1. Interpretation check (30 seconds, catches 50/63 of one class of errors)

1. 题意校验(30秒,可捕获一类错误中的50/63)

Before solving anything, identify the interpretation.
Read the problem statement. List 2-3 ways it could be interpreted. For each: is this reading TRIVIAL? If one reading makes the problem easy and another makes it hard, the hard one is almost certainly intended. State which interpretation you're solving and WHY you believe it's the intended one.
The Aletheia case study found 50 of 63 "technically correct" solutions were for the wrong interpretation. Olympiad problems often have a trap easy reading.
解题前先确认题意理解是否正确。
阅读题目描述,列出2-3种可能的解读方式。针对每种解读判断:这种理解下题目是不是非常简单?如果一种理解下题目很简单,另一种很难,几乎可以确定出题人要考的是难的那个版本。说明你选择的解读方式,以及你认为这是题意本意的原因。
Aletheia案例研究显示,63个「技术上正确」的解答里有50个解的是错误的题意。奥赛题目经常设置容易误读的陷阱。

2. Generate candidates with internal refinement (parallel, thinking only)

2. 生成候选方案并内部优化(并行运行,仅用推理能力)

Launch 8-12 attempt agents in parallel. Each agent internally iterates — solve → self-improve → self-verify → correct → repeat. This is the Yang-Huang structure that achieves 85.7% on IMO: one-shot solving isn't enough; per-attempt refinement matters.
The Agent tool cannot enforce tool restriction. Subagents get the full tool set. The only mechanism is the prompt. Use this prompt VERBATIM — do not summarize, do not synthesize your own:
NO COMPUTATION. Do not use Bash, Python, WebSearch, Read, Write, or any tool that runs code or fetches data. Numerical verification is not a proof step. "I computed n=1..10 and the pattern holds" is not a proof.

(If your agent harness requires a StructuredOutput or similar return-mechanism tool call, that is NOT a computation tool — call it to return your answer. The restriction is on tools that DO work, not tools that REPORT work.)

Your internal process (iterate until done):
- Solve: Complete rigorous solution.
- Self-improve: Reread. Fix gaps before a grader sees it.
- Self-verify: Strict grader mode. Every step justified?
- Correct: Fix and re-verify. Up to 5 rounds.
- Stop: Self-verify passes twice clean, OR 5 rounds, OR approach fundamentally wrong.

A correct answer from flawed reasoning is a failure. If incomplete, say so honestly. Never hide gaps.

PROBLEM: <insert the problem statement here>
ANGLE: <insert one starting angle here>
The first two paragraphs are load-bearing. A session that writes its own prompt and omits them will produce subagents that grind Python for 30 iterations and confidently get wrong answers — a pattern that fits n≤10 but fails at n=100 is not a proof.
Starting angles (vary across agents — see
references/solver_heuristics.md
):
  • Work out small cases (test past n=3)
  • Look for an invariant or monovariant
  • Consider the extremal case
  • Try induction
  • What symmetries?
  • Work backwards
  • Drop a condition — where does it become trivially false?
  • Generalize (inventor's paradox — more structure is sometimes easier)
Each returns its FINAL state (not intermediate rounds):
**Verdict**: complete solution | partial result | no progress
**Rounds**: [how many verify→correct cycles]
**Method**: [key idea, one paragraph]
**Detailed Solution**: [full step-by-step, every step justified]
**Answer**: [if applicable]
**Self-verification notes**: [what you caught and fixed; remaining concerns]
Retry policy: If an agent fails or times out, retry once. Transient failures happen.
并行启动8-12个尝试Agent,每个Agent内部迭代:解题→自我优化→自验证→修正→重复。这是Yang-Huang架构在IMO上达到85.7%正确率的核心:单次解题不够,每次尝试的迭代优化非常重要。
Agent工具无法强制工具限制,子Agent可以访问完整工具集,唯一的限制机制是提示词。请完全照搬以下提示词——不要总结,不要自行改写:
NO COMPUTATION. Do not use Bash, Python, WebSearch, Read, Write, or any tool that runs code or fetches data. Numerical verification is not a proof step. "I computed n=1..10 and the pattern holds" is not a proof.

(If your agent harness requires a StructuredOutput or similar return-mechanism tool call, that is NOT a computation tool — call it to return your answer. The restriction is on tools that DO work, not tools that REPORT work.)

Your internal process (iterate until done):
- Solve: Complete rigorous solution.
- Self-improve: Reread. Fix gaps before a grader sees it.
- Self-verify: Strict grader mode. Every step justified?
- Correct: Fix and re-verify. Up to 5 rounds.
- Stop: Self-verify passes twice clean, OR 5 rounds, OR approach fundamentally wrong.

A correct answer from flawed reasoning is a failure. If incomplete, say so honestly. Never hide gaps.

PROBLEM: <insert the problem statement here>
ANGLE: <insert one starting angle here>
前两段是核心约束,如果自行编写提示词省略了这部分,生成的子Agent会用Python跑30轮迭代,给出笃定但错误的答案——符合n≤10的规律但在n=100时失效的推导不能算作证明。
解题切入点(不同Agent使用不同切入点,参考
references/solver_heuristics.md
):
  • 推导小案例(至少测试到n>3)
  • 寻找不变量或单调变量
  • 考虑极端情况
  • 尝试归纳法
  • 寻找对称性
  • 反向推导
  • 去掉一个约束条件——看哪里会明显不成立
  • 泛化(发明者悖论:更多结构有时候反而更容易解题)
每个Agent返回最终状态(不要中间迭代过程):
**Verdict**: complete solution | partial result | no progress
**Rounds**: [how many verify→correct cycles]
**Method**: [key idea, one paragraph]
**Detailed Solution**: [full step-by-step, every step justified]
**Answer**: [if applicable]
**Self-verification notes**: [what you caught and fixed; remaining concerns]
重试规则:如果Agent失败或超时,重试一次,偶尔会出现临时故障。

3. Clean the solution (context isolation — the #1 lever)

3. 整理解题方案(上下文隔离——最重要的优化手段)

The thinking trace biases the verifier toward agreement — a long chain of reasoning reads as supporting evidence even when the conclusion is wrong. Before any verification, strip:
  • All thinking-block content
  • All "Let me try..." / "Actually wait..." / "Hmm" prose
  • All false starts and backtracking
What remains: problem statement + clean final argument only.
Extract only the Method + Proof + Answer sections from each solver's output. The verifier never sees how the solver got there.
解题思考痕迹会让验证者倾向于认同结论——哪怕结论是错的,长推理链也会被当作支持证据。验证前清除所有以下内容:
  • 所有思考块内容
  • 所有「我试试...」/「等等...」/「嗯」这类口语化内容
  • 所有错误尝试和回溯内容
仅保留:题目描述+整洁的最终论证内容。
从每个解题器的输出里仅提取方法+证明+答案部分,验证者完全看不到解题过程。

4. Adversarial verify (fresh context, pattern-armed)

4. 对抗验证(全新上下文,基于错误模式校验)

For each cleaned solution, launch a fresh verifier agent. Fresh context: it sees only (problem statement + cleaned solution). No tools.
The verifier's job is to ATTACK, not grade. Load
references/adversarial_prompts.md
for the prompts. The key patterns it runs:
PatternThe check
#4Does this theorem specialize to a famous object (ζ, quadratic reciprocity, etc.) and prove something open about it? → gap
#18Substitute the proof's own intermediate identities into any "remaining gap." Recover the original claim? → tautological
#40Is any step a "one-line lemma"? Extract the GENERAL form. Find a 2×2 counterexample. If the general form is false, find what special structure saves THIS instance
#5For each invoked theorem: re-check hypotheses FROM SCRATCH. "Continuous on [0,1]" ≠ "continuous on ℝ"
#6Any infinite sum "bounded" via a regularized value? Check the boundary — if there's a pole there, the sum diverges
Full pattern list:
references/verifier_patterns.md
Verifier returns:
**Verdict**: HOLDS | HOLE FOUND | UNCLEAR

**If HOLE FOUND**:
- Location: [quote the problematic step]
- Pattern: [which check fired, or "other"]
- Why it breaks: [specific]
- Fixable?: [yes with X / no, fundamental]
为每个整理后的解题方案启动全新的验证Agent,全新上下文:仅能看到(题目描述+整理后的解题方案),不能使用工具
验证者的任务是「攻击」证明,而不是打分。参考
references/adversarial_prompts.md
里的提示词,核心校验模式如下:
模式编号校验规则
#4这个定理是否专用于著名对象(ζ、二次互反律等)并证明了相关的公开问题?→存在漏洞
#18将证明的中间恒等式代入所有「剩余漏洞」,是否能还原原始结论?→存在同义反复
#40是否存在「一行引理」?提取通用形式,寻找2×2反例,如果通用形式不成立,找到当前案例的特殊支撑结构
#5对每个引用的定理:从零开始重新校验假设条件,「在[0,1]上连续」≠「在ℝ上连续」
#6有无穷和通过正则化值判定「有界」?校验边界——如果存在极点,这个和是发散的
完整模式列表见:
references/verifier_patterns.md
验证器返回结果:
**Verdict**: HOLDS | HOLE FOUND | UNCLEAR

**If HOLE FOUND**:
- Location: [quote the problematic step]
- Pattern: [which check fired, or "other"]
- Why it breaks: [specific]
- Fixable?: [yes with X / no, fundamental]

5. Rank and vote-verify (asymmetric + early exit)

5. 排序和投票验证(非对称阈值+提前终止)

Rank solutions by (verdict, verifier confidence). Take the top one. Run up to 5 fresh verifier agents.
Asymmetric thresholds: 4 HOLDS to confirm, 2 HOLE FOUND to refute. Why asymmetric: one flaky verifier shouldn't kill a correct proof; but two independent dissents is a real signal.
Pigeonhole early exit: stop launching verifiers once the outcome is decided.
  • 2 say HOLE FOUND → refuted, stop (save the remaining 3 calls)
  • 4 say HOLDS → confirmed, stop (save the 5th)
  • After 3 verifiers: if 2 HOLDS + 1 HOLE, launch 2 more (outcome undecided). If 3 HOLDS + 0 HOLE, launch 1 more (could still hit 4-1).
Dual context-isolation: each verifier is blind to (a) the solver's thinking trace — already stripped in step 3 — AND (b) other verifiers' verdicts. Each verifier thinks it's the first. No "3 agents already confirmed this" social proof.
A solver cannot verify its own solution. Different agent, fresh context.
按照(验证结论、验证者置信度)排序解题方案,取排名最高的方案,最多运行5个全新验证Agent。
非对称阈值:4个「成立」即可确认有效,2个「发现漏洞」即可判定无效。非对称设计的原因:单个不稳定的验证者不应该否定正确的证明,但两个独立的反对意见就是真实的错误信号。
鸽巢提前终止:结果确定后就停止启动新的验证器。
  • 2个验证器指出「发现漏洞」→判定无效,停止(节省剩下3次调用)
  • 4个验证器判定「成立」→确认有效,停止(节省第5次调用)
  • 3个验证器后:如果是2个「成立」+1个「漏洞」,再启动2个(结果未确定);如果是3个「成立」+0个「漏洞」,再启动1个(仍可能出现4-1的结果)。
双重上下文隔离:每个验证者都看不到(a)解题者的思考痕迹——第3步已经清除,以及(b)其他验证者的结论。每个验证者都认为自己是第一个校验的,不存在「3个Agent已经确认正确」的社会证明影响。
解题者不能验证自己的方案,必须使用不同Agent、全新上下文。

5b. When one case won't close — step back before grinding

5b. 某个情况无法闭合时——硬啃之前先退一步梳理

If a proof splits into cases and one case proves easily but the other resists: before grinding through the hard case, ask whether there's a route that makes the split disappear.
The pattern that saves you: the hard case's very hypothesis often implies something strong about an intermediate object you haven't looked at. Use that implication directly instead of the original chain.
Concrete shape: proving f(n) ≤ cn for a constrained function f, with a case split on a prime p dividing f(n). One branch closes by index arguments in (ℤ/p^e)*. The other branch resists — same group structure, but the arithmetic doesn't contradict. The fix: the hypothesis "p | f(n)" plugged back into the governing equation implies f(p) = p itself. Once you have that, a Fermat+Dirichlet argument kills both branches in three lines. The case split was a detour — it was splitting on a variable that, under the hypothesis, takes a known value.
Check when stuck on case B:
  • What does case B's hypothesis imply about f at other inputs?
  • Is there a different pair (a,b) to plug into the governing equation?
  • Are you proving too much? (A cleaner contradiction needs less machinery.)
This is also a presentation-pass win: the split-free proof is shorter AND more general.
如果证明拆分成多个情况,其中一个很容易证明,另一个一直卡住:硬啃难的情况之前,先想想有没有办法消除这个情况拆分
有效的解决模式:难的情况的假设往往隐含了你还没注意到的「中间对象」的强属性,直接用这个隐含属性而非原始推导链即可。
具体例子:证明约束函数f满足f(n) ≤ cn,按照整除f(n)的素数p拆分情况,其中一个分支用(ℤ/p^e)*的索引论证即可闭合,另一个分支一直卡住——组结构相同,但算术推导没有矛盾。解决方法:将「p | f(n)」的假设代入控制方程,可以推出f(p) = p本身,得出这个结论后,用Fermat+Dirichlet论证三行就能覆盖两个分支。情况拆分本身就是弯路——你拆分的变量在假设下已经是已知值了。
卡在情况B时检查:
  • 情况B的假设对f的其他输入有什么隐含属性?
  • 有没有其他(a,b)对可以代入控制方程?
  • 你是不是证明了太多内容?(更简洁的矛盾不需要那么多机制)
这也能优化最终呈现:无拆分的证明更短,也更通用。

6. Revise (if needed)

6. 修订(如有需要)

If verification finds a hole: launch a reviser agent. It gets (cleaned solution + verifier's hole report). STILL no access to the original thinking — the reviser works from the hole, not by rereading how you got there.
A verifier found this issue in the proof:
[hole report]

Fix the proof. If the hole is fundamental (the approach doesn't work), say so and return **Verdict: no confident solution** with what partial progress remains.

For any step you cannot fully close, mark it inline: [GAP: specific description of what remains]. Gaps in the proof text, not in a separate list — they're greppable and the next reviser knows exactly where to look.
Up to 3 revise cycles. Then re-run the vote on the revised proof.
If pattern #40 fired (one-line-proof-too-clean), the reviser gets a stronger brief — the Adversarial Brief template from
references/adversarial_prompts.md
§7. It forces a binary: "the general lemma is obviously false (here's a 2×2 counterexample) — so either find what's special about THIS case, or find where the proof breaks." Can't return "looks fine."
如果验证发现漏洞:启动修订Agent,它能拿到(整理后的解题方案+验证者的漏洞报告),仍然不能访问原始解题思考——修订者基于漏洞修正,而不是重读解题过程。
A verifier found this issue in the proof:
[hole report]

Fix the proof. If the hole is fundamental (the approach doesn't work), say so and return **Verdict: no confident solution** with what partial progress remains.

For any step you cannot fully close, mark it inline: [GAP: specific description of what remains]. Gaps in the proof text, not in a separate list — they're greppable and the next reviser knows exactly where to look.
最多3轮修订,之后重新对修订后的证明做投票验证。
如果触发了模式#40(一行证明太完美),给修订者更明确的要求——
references/adversarial_prompts.md
第7节的对抗提示模板,强制二选一:「通用引理明显不成立(这里有2×2反例)——要么找到当前案例的特殊属性,要么找到证明的漏洞」,不能返回「看起来没问题」。

6c. Deep mode (when tight-budget abstains)

6c. 深度模式(紧凑流程弃权时使用)

The standard workflow is tight-budget: 8 solvers, ~15 min, pure reasoning. When it abstains, the problem may need more time, not more capability.
Deep mode is a single focused agent with:
  • Unlimited time — no wall-clock pressure
  • Targeted computation allowed — modular arithmetic checks, small-case enumeration, symbolic verification of identities. NOT exploratory brute force or unbounded recursion.
  • The abstention reason as starting point — if verifiers found a specific gap, start there. If solvers never claimed complete, start from what they partially proved.
The archetype: a focused agent that gets the proven-so-far state plus "one case of Lemma 5 is open" — and finds a 3-line argument the case split was obscuring. Often under 10 minutes with almost no computation. Deep mode is about giving the problem sustained attention, not throwing compute at it.
What deep mode is NOT: open-ended exploration, literature search, looking up solutions, multi-day investigation. That's a different workflow (
math-research
). Deep mode is still "solve THIS problem yourself" — just without the clock.
NO WEB. NO LOOKUP. Deep mode may use Bash/Python for bounded computation, but NEVER WebFetch, WebSearch, or any network access. Finding the solution on AoPS or a blog is not solving the problem — it's cheating on an olympiad, and it teaches us nothing about the skill's actual capability. Put this at the TOP of the deep-mode prompt:
NO WEB ACCESS. Do not use WebFetch, WebSearch, or any tool that touches the internet. Do not look up this problem, its solution, or related problems. You are solving this yourself — the only allowed computation is local (Bash/Python for mod-k arithmetic, small-case enumeration n≤10, symbolic identity checks). If you invoke a web tool, the proof is void.
Computation bounds in deep mode (bug #8 lesson): A6's b_{n+1}=2b_n²+b_n+1 is doubly-exponential; b_99 has ~10^{2^98} digits. Never compute such objects exactly — work in ℤ/2^m, or track only v_p(·), or prove the recursion mod the quantity you care about. If a computation is running longer than 60 seconds, it's probably unbounded. Kill it and work symbolically.
Step 6d (not optional): After any ABSTAIN at the verify stage, automatically launch one deep-mode agent before writing the abstention into the output. Give it:
  • The problem statement
  • The best partial proof from tight-budget solvers
  • The verifier gap descriptions (what specifically didn't close)
  • The instruction: "NO WEB ACCESS — do not look up this problem or its solution. Bounded local computation allowed (mod 2^k, small cases n≤10, symbolic identity checks via Bash/Python only). 60-second computation limit. If n≤10 brute force reveals a pattern the tight-budget solvers missed, that pattern IS the proof structure."
The deep agent may find the construction the pure-reasoning solvers couldn't see. If it also abstains, THEN write the abstention. Do not skip this step — problems with √n or log n answers are often invisible to pure reasoning because the optimal structure is the asymmetric one.
Orchestrator self-restraint: The orchestrator itself must not web-search the problem "to help" the deep agent. If you're tempted to Fetch an AoPS thread "just to check the answer," don't — that contaminates the skill's output and misrepresents its capability.
标准工作流是紧凑预算:8个解题器,约15分钟,纯推理。如果弃权,说明问题可能需要更多时间,而不是更高的能力。
深度模式是单个专注的Agent,具备:
  • 无时间限制——没有时钟压力
  • 允许针对性计算:模运算校验、小案例枚举、恒等式符号验证,不允许探索性暴力破解或无边界递归
  • 以弃权原因作为切入点——如果验证者发现了特定漏洞,从那里开始;如果解题器从来没给出完整方案,从部分证明的内容开始
典型场景:专注的Agent拿到已证明的状态,加上「引理5的一个情况未证明」——就能找到情况拆分掩盖的3行论证,通常不到10分钟,几乎不需要计算。深度模式是给问题持续的注意力,而不是堆砌算力。
深度模式不允许:无限制探索、文献搜索、查答案、多日调研,这些属于另一个工作流(
math-research
)。深度模式仍然是「自己解这道题」——只是没有时间限制。
禁止联网,禁止查找资料。深度模式可以用Bash/Python做有限计算,但绝对不能用WebFetch、WebSearch或任何网络访问。在AoPS或博客上找到答案不算解题——这是奥赛作弊,也无法体现这个技能的真实能力。把这段放在深度模式提示词的最顶部:
NO WEB ACCESS. Do not use WebFetch, WebSearch, or any tool that touches the internet. Do not look up this problem, its solution, or related problems. You are solving this yourself — the only allowed computation is local (Bash/Python for mod-k arithmetic, small-case enumeration n≤10, symbolic identity checks). If you invoke a web tool, the proof is void.
深度模式的计算边界(漏洞#8的教训):A6的递推式b_{n+1}=2b_n²+b_n+1是双指数增长,b_99有~10^{2^98}位,永远不要精确计算这类对象——改用ℤ/2^m运算,或者仅跟踪v_p(·),或者证明你关心的量的模递推。如果计算运行超过60秒,大概率是无边界的,终止计算改用符号推导。
第6d步(非可选):验证阶段任何弃权后,自动启动一个深度模式Agent,再输出弃权结果。给它提供:
  • 题目描述
  • 紧凑流程解题器给出的最优部分证明
  • 验证者的漏洞描述(具体哪部分没闭合)
  • 指令:「禁止联网——不要查这道题或它的答案,仅允许有限本地计算(模2^k、n≤10的小案例、仅用Bash/Python做符号恒等式校验),计算上限60秒。如果n≤10的暴力枚举发现了纯推理解题器遗漏的规律,这个规律就是证明结构。」
深度Agent可能找到纯推理解题器没发现的构造,如果它也弃权,再输出弃权结果。不要跳过这一步——带√n或log n答案的问题纯推理往往看不到,因为最优结构是非对称的。
编排器自我约束:编排器本身不能搜题「帮助」深度Agent,如果你忍不住想抓个AoPS帖子「只是核对答案」,不要这么做——这会污染技能输出,也会错误展示它的真实能力。

7. Calibrated abstention

7. 校准弃权

If 3 revise cycles all fail: stop and admit it.
**Verdict**: no confident solution

**What was tried**: [approaches]
**What WAS proven**: [any lemma or partial result that survived verification]
**Where it breaks**: [the unfixed hole]
Do NOT guess. A wrong confident answer is worse than an honest "couldn't solve it." The metric that matters is CONDITIONAL accuracy — when you say "solved," are you right?
如果3轮修订都失败:停止,诚实告知
**Verdict**: no confident solution

**What was tried**: [approaches]
**What WAS proven**: [any lemma or partial result that survived verification]
**Where it breaks**: [the unfixed hole]
不要猜,错误的笃定答案比诚实的「解不出来」更糟糕。核心指标是条件准确率——你说「解出来」的时候,是不是真的对?

8. Presentation pass (after correctness is established)

8. 呈现优化(确认正确性之后)

A VERIFIED-CORRECT proof is often not a BEAUTIFUL proof. The order you discovered it is rarely the best order to present it. Launch a fresh presentation agent with the verified proof.
Load
references/presentation_prompts.md
. The agent asks:
  • What's the simplest way to say this?
  • Which lemmas should be inlined? Which deserve to stand alone?
  • Is anything OVERKILL? (constructing a double exponential when linear suffices)
  • Now that we know the answer, is there a 3-line hindsight proof?
Output: LaTeX-formatted proof. If
pdflatex
is available (
scripts/check_latex.sh
returns 0), also compile to PDF via
scripts/compile_pdf.sh
.

验证正确的证明往往不是优美的证明,你发现结论的顺序很少是最佳呈现顺序。给验证通过的证明启动全新的呈现Agent。
加载
references/presentation_prompts.md
,Agent会考虑:
  • 怎么说最简洁?
  • 哪些引理应该内联?哪些应该单独列出来?
  • 有没有过度设计?(够用的线性推导就不要用双指数构造)
  • 现在知道答案了,有没有事后诸葛亮的3行极简证明?
输出:LaTeX格式的证明。如果可用
pdflatex
scripts/check_latex.sh
返回0),通过
scripts/compile_pdf.sh
编译为PDF。

Model tier defaults

模型层级默认配置

Read
references/model_tier_defaults.md
for full details. Summary:
ModelSolversVerify passesAbstain afterPresentation
Haiku832 revise failsskip
Sonnet453 revise failsyes
Opus35 + full pattern sweep4 revise fails2 drafts, pick cleaner
Weaker models: more parallel attempts, faster abstention. Stronger models: deeper verification, more presentation effort.

查看
references/model_tier_defaults.md
获取完整细节,摘要如下:
模型解题器数量验证轮次弃权阈值呈现优化
Haiku832次修订失败跳过
Sonnet453次修订失败
Opus35 + 全模式扫描4次修订失败2版草稿,选更简洁的
越弱的模型:并行尝试越多,弃权越快。越强的模型:验证越深,呈现优化投入越多。

For numeric-answer problems (AIME-style)

数值答案题(AIME类型)处理规则

Skip the proof machinery. Run 5-7 solvers with varied approaches, take majority vote on the numeric answer. If no majority: verify the top 2 candidates by substitution.

跳过证明机制,运行5-7个用不同方法的解题器,对数值答案做多数投票。没有多数的话:代入校验前2个候选答案。

Key references

核心参考资料

  • references/verifier_patterns.md
    — the 12 adversarial checks
  • references/adversarial_prompts.md
    — ready-to-use verifier prompts
  • references/presentation_prompts.md
    — beautification prompts + LaTeX template
  • references/model_tier_defaults.md
    — per-model configuration

  • references/verifier_patterns.md
    ——12种对抗校验规则
  • references/adversarial_prompts.md
    ——开箱即用的验证提示词
  • references/presentation_prompts.md
    ——美化提示词+LaTeX模板
  • references/model_tier_defaults.md
    ——分模型配置

What makes this different from generic verify-and-refine

和通用验证优化方案的差异

  1. Dual context isolation: verifier is blind to (a) the solver's thinking trace — which biases toward agreement — and (b) other verifiers' verdicts — social proof also biases. Each verifier thinks it's first.
  2. Pattern-specific attacks: not "is this correct?" but "does this make the #40 mistake? the #4 mistake?" Specific beats generic. The 7-category refutation taxonomy gives the verifier a checklist.
  3. Asymmetric vote + pigeonhole exit: 4-to-confirm, 2-to-refute. One flaky verifier doesn't kill a correct proof; two dissents does. Stop launching verifiers once the outcome is decided — saves ~30% of verification cost on clear cases.
  4. Specification-gaming check first: explicitly asks "is this the intended interpretation?" before solving. The #1 failure mode in prior work (50/63 "correct" answers solved the wrong reading).
  5. Calibrated abstention: will say "no confident solution" with partial results. Optimizes conditional accuracy, not coverage.
  6. Presentation pass: correctness and elegance are separate steps. The presentation agent gets the VERIFIED proof and finds the cleanest way to say it.
  1. 双重上下文隔离:验证者看不到(a)解题者的思考痕迹——会导致认同偏差,以及(b)其他验证者的结论——社会证明也会导致偏差,每个验证者都认为自己是第一个校验的。
  2. 特定模式攻击:不是问「这个对吗」,而是问「有没有犯#40错误?#4错误?」,特定检查比通用检查效果好,7类反驳分类给验证者提供了检查清单。
  3. 非对称投票+鸽巢提前终止:4次确认生效,2次反对失效,单个不稳定的验证者不会否定正确证明,两次反对即可判定错误。结果确定后停止启动验证器——明确案例能节省约30%的验证成本。
  4. 规范作弊校验前置:解题前明确问「这是题意的本意吗?」,这是此前工作的第一大故障模式(50/63的「正确」答案解的是错误的题意)。
  5. 校准弃权:会主动给出「无置信解决方案」和部分结果,优化条件准确率而非覆盖率。
  6. 呈现优化环节:正确性和优美性是独立步骤,呈现Agent拿到验证后的证明,找到最简洁的表达方式。