ablation-planner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAblation Planner
消融实验规划器
Systematically design ablation studies that answer the questions reviewers will ask. Codex leads the design (reviewer perspective), CC reviews feasibility and implements.
系统性地设计消融实验,以回应审稿人可能提出的问题。Codex主导设计(从审稿人视角出发),CC负责审核可行性并执行。
Context: $ARGUMENTS
上下文:$ARGUMENTS
When to Use
使用场景
- Main results pass with claim_supported = yes or partial
/result-to-claim - User explicitly requests ablation planning
- reviewer identifies missing ablations
/auto-review-loop
- 主要结果通过且claim_supported = yes或partial
/result-to-claim - 用户明确要求进行消融实验规划
- 审稿人指出缺失消融实验
/auto-review-loop
Workflow
工作流程
Step 1: Prepare Context
步骤1:准备上下文
CC reads available project files to build the full picture:
- Method description and components (from docs/research_contract.md or project CLAUDE.md)
- Current experiment results (from EXPERIMENT_LOG.md, EXPERIMENT_TRACKER.md, or W&B)
- Confirmed and intended claims (from result-to-claim output or project notes)
- Available compute resources (from CLAUDE.md server config, if present)
CC读取可用的项目文件以构建完整信息:
- 方法描述和组件(来自docs/research_contract.md或项目CLAUDE.md)
- 当前实验结果(来自EXPERIMENT_LOG.md、EXPERIMENT_TRACKER.md或W&B)
- 已确认和拟提出的结论(来自result-to-claim输出或项目笔记)
- 可用计算资源(若存在,来自CLAUDE.md中的服务器配置)
Step 2: Codex Designs Ablations
步骤2:Codex设计消融实验
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are a rigorous ML reviewer planning ablation studies.
Given this method and results, design ablations that:
1. Isolate the contribution of each novel component
2. Answer questions reviewers will definitely ask
3. Test sensitivity to key hyperparameters
4. Compare against natural alternative design choices
Method: [description from project files]
Components: [list of removable/replaceable components]
Current results: [key metrics from experiments]
Claims: [what we claim and current evidence]
For each ablation, specify:
- name: what to change (e.g., "remove module X", "replace Y with Z")
- what_it_tests: the specific question this answers
- expected_if_component_matters: what we predict if the component is important
- priority: 1 (must-run) to 5 (nice-to-have)
Also provide:
- coverage_assessment: what reviewer questions these ablations answer
- unnecessary_ablations: experiments that seem useful but won't add insight
- suggested_order: run order optimized for maximum early information
- estimated_compute: total GPU-hours estimatemcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are a rigorous ML reviewer planning ablation studies.
Given this method and results, design ablations that:
1. Isolate the contribution of each novel component
2. Answer questions reviewers will definitely ask
3. Test sensitivity to key hyperparameters
4. Compare against natural alternative design choices
Method: [description from project files]
Components: [list of removable/replaceable components]
Current results: [key metrics from experiments]
Claims: [what we claim and current evidence]
For each ablation, specify:
- name: what to change (e.g., "remove module X", "replace Y with Z")
- what_it_tests: the specific question this answers
- expected_if_component_matters: what we predict if the component is important
- priority: 1 (must-run) to 5 (nice-to-have)
Also provide:
- coverage_assessment: what reviewer questions these ablations answer
- unnecessary_ablations: experiments that seem useful but won't add insight
- suggested_order: run order optimized for maximum early information
- estimated_compute: total GPU-hours estimateStep 3: Parse Ablation Plan
步骤3:解析消融实验计划
Normalize Codex response into structured format:
markdown
undefined将Codex的响应标准化为结构化格式:
markdown
undefinedAblation Plan
Ablation Plan
Component Ablations (highest priority)
Component Ablations (highest priority)
| # | Name | What It Tests | Expected If Matters | Priority |
|---|---|---|---|---|
| 1 | remove module X | contribution of X | performance drops on metric Y | 1 |
| 2 | replace X with simpler Z | value of learned vs fixed | drops, especially on dataset A | 2 |
| # | Name | What It Tests | Expected If Matters | Priority |
|---|---|---|---|---|
| 1 | remove module X | contribution of X | performance drops on metric Y | 1 |
| 2 | replace X with simpler Z | value of learned vs fixed | drops, especially on dataset A | 2 |
Hyperparameter Sensitivity
Hyperparameter Sensitivity
| # | Parameter | Values to Test | What It Tests | Priority |
|---|---|---|---|---|
| 3 | lambda | [0.01, 0.1, 1.0] | sensitivity to regularization | 3 |
| # | Parameter | Values to Test | What It Tests | Priority |
|---|---|---|---|---|
| 3 | lambda | [0.01, 0.1, 1.0] | sensitivity to regularization | 3 |
Design Choice Comparisons
Design Choice Comparisons
| # | Name | What It Tests | Priority |
|---|---|---|---|
| 4 | joint vs separate matching | whether joint adds value | 4 |
| # | Name | What It Tests | Priority |
|---|---|---|---|
| 4 | joint vs separate matching | whether joint adds value | 4 |
Coverage Assessment
Coverage Assessment
[What reviewer questions these ablations answer]
[What reviewer questions these ablations answer]
Unnecessary Ablations
Unnecessary Ablations
[Experiments that seem useful but won't add insight — skip these]
[Experiments that seem useful but won't add insight — skip these]
Run Order
Run Order
[Optimized for maximum early information]
[Optimized for maximum early information]
Estimated Compute
Estimated Compute
[Total GPU-hours]
undefined[Total GPU-hours]
undefinedStep 4: CC Reviews Feasibility
步骤4:CC审核可行性
Before running anything, CC checks:
- Compute budget: can we afford all ablations with available GPUs?
- Code changes: which ablations need code modifications vs config-only changes?
- Dependencies: which ablations can run in parallel?
- Cuts: if budget is tight, propose removing lower-priority ablations and ask Codex to confirm
在执行任何实验前,CC需检查:
- 计算预算:现有GPU资源是否支持所有消融实验?
- 代码变更:哪些消融实验只需修改配置,哪些需要调整代码?
- 依赖关系:哪些消融实验可以并行执行?
- 削减方案:若预算紧张,建议移除低优先级的消融实验并请Codex确认
Step 5: Implement and Run
步骤5:实现与执行
- Create configs/scripts for each ablation (config-only changes first)
- Smoke test each ablation before full run
- Run in suggested order, using descriptive names (e.g., )
ablation-no-module-X - Track results in EXPERIMENT_LOG.md
- After all ablations complete → update findings.md with insights
- 为每个消融实验创建配置/脚本(优先处理仅需修改配置的实验)
- 在全量运行前对每个消融实验进行冒烟测试
- 按照建议顺序执行,使用描述性命名(如)
ablation-no-module-X - 在EXPERIMENT_LOG.md中记录结果
- 所有消融实验完成后 → 在findings.md中更新实验见解
Rules
规则
- Codex leads the design. CC does not pre-filter or bias the ablation list before Codex sees it. Codex thinks like a reviewer; CC thinks like an engineer.
- Every ablation must have a clear and
what_it_tests. No "just try it" experiments.expected_if_component_matters - Config-only ablations take priority over those needing code changes (faster, less error-prone).
- If total compute exceeds budget, CC proposes cuts and asks Codex to re-prioritize — don't silently drop ablations.
- Component ablations (remove/replace) take priority over hyperparameter sweeps.
- Do not generate ablations for components identical to the baseline (no-op ablations).
- Record all ablation results in EXPERIMENT_LOG.md, including negative results (component removal had no effect = important finding).
- Codex主导设计。在Codex查看前,CC不得预先过滤或偏向消融实验列表。Codex从审稿人角度思考;CC从工程师角度思考。
- 每个消融实验必须明确和
what_it_tests。禁止“随便试试”的实验。expected_if_component_matters - 仅需修改配置的消融实验优先于需要调整代码的实验(更快、出错率更低)。
- 若总计算量超出预算,CC需提出削减方案并请Codex重新排序优先级 — 不得擅自删除消融实验。
- 组件消融实验(移除/替换)优先于超参数扫描。
- 不得为与基线完全相同的组件生成消融实验(无意义的消融实验)。
- 所有消融实验结果(包括负面结果,如移除组件后无效果 = 重要发现)均需记录在EXPERIMENT_LOG.md中。