research-paper-writing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Research Paper Writing Pipeline

科研论文写作流程

End-to-end pipeline for producing publication-ready ML/AI research papers targeting NeurIPS, ICML, ICLR, ACL, AAAI, and COLM. This skill covers the full research lifecycle: experiment design, execution, monitoring, analysis, paper writing, review, revision, and submission.

This is not a linear pipeline — it is an iterative loop. Results trigger new experiments. Reviews trigger new analysis. The agent must handle these feedback loops.

┌─────────────────────────────────────────────────────────────┐
│                    RESEARCH PAPER PIPELINE                  │
│                                                             │
│  Phase 0: Project Setup ──► Phase 1: Literature Review      │
│       │                          │                          │
│       ▼                          ▼                          │
│  Phase 2: Experiment     Phase 5: Paper Drafting ◄──┐      │
│       Design                     │                   │      │
│       │                          ▼                   │      │
│       ▼                    Phase 6: Self-Review      │      │
│  Phase 3: Execution &           & Revision ──────────┘      │
│       Monitoring                 │                          │
│       │                          ▼                          │
│       ▼                    Phase 7: Submission               │
│  Phase 4: Analysis ─────► (feeds back to Phase 2 or 5)     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

面向NeurIPS、ICML、ICLR、ACL、AAAI和COLM等顶会的机器学习/人工智能科研论文全流程写作指南，覆盖完整科研生命周期：实验设计、执行、监控、分析、论文撰写、评审、修订与提交。

这并非线性流程，而是迭代循环：实验结果会触发新的实验设计，评审意见会推动新的分析，Agent需处理这些反馈循环。

┌─────────────────────────────────────────────────────────────┐
│                    RESEARCH PAPER PIPELINE                  │
│                                                             │
│  Phase 0: Project Setup ──► Phase 1: Literature Review      │
│       │                          │                          │
│       ▼                          ▼                          │
│  Phase 2: Experiment     Phase 5: Paper Drafting ◄──┐      │
│       Design                     │                   │      │
│       │                          ▼                   │      │
│       ▼                    Phase 6: Self-Review      │      │
│  Phase 3: Execution &           & Revision ──────────┘      │
│       Monitoring                 │                          │
│       │                          ▼                          │
│       ▼                    Phase 7: Submission               │
│  Phase 4: Analysis ─────► (feeds back to Phase 2 or 5)     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

When To Use This Skill

何时使用本技能

Use this skill when:

Starting a new research paper from an existing codebase or idea
Designing and running experiments to support paper claims
Writing or revising any section of a research paper
Preparing for submission to a specific conference or workshop
Responding to reviews with additional experiments or revisions
Converting a paper between conference formats
Writing non-empirical papers — theory, survey, benchmark, or position papers (see Paper Types Beyond Empirical ML)
Designing human evaluations for NLP, HCI, or alignment research
Preparing post-acceptance deliverables — posters, talks, code releases

当你需要以下服务时使用本技能：

基于现有代码库或想法启动新科研论文
设计并运行实验以支撑论文论点
撰写或修订论文的任意章节
为特定会议或研讨会准备提交材料
通过补充实验或修订响应评审意见
在不同会议格式间转换论文
撰写非实证类论文——理论、综述、基准测试或立场论文（详见实证ML之外的论文类型）
为NLP、HCI或对齐研究设计人类评估方案
准备录用后的交付物——海报、演讲、代码发布

Core Philosophy

核心原则

Be proactive. Deliver complete drafts, not questions. Scientists are busy — produce something concrete they can react to, then iterate.
Never hallucinate citations. AI-generated citations have ~40% error rate. Always fetch programmatically. Mark unverifiable citations as
```
[CITATION NEEDED]
```
.
Paper is a story, not a collection of experiments. Every paper needs one clear contribution stated in a single sentence. If you can't do that, the paper isn't ready.
Experiments serve claims. Every experiment must explicitly state which claim it supports. Never run experiments that don't connect to the paper's narrative.
Commit early, commit often. Every completed experiment batch, every paper draft update — commit with descriptive messages. Git log is the experiment history.

主动出击：交付完整草稿，而非问题。科研人员时间宝贵——产出具体内容供他们反馈，再迭代优化。
绝不编造引用：AI生成的引用错误率约40%，务必通过程序获取。无法验证的引用标记为
```
[CITATION NEEDED]
```
。
论文是故事，而非实验集合：每篇论文需用一句话清晰阐述核心贡献。做不到这一点，说明论文尚未准备就绪。
实验服务于论点：每个实验必须明确说明支撑哪个论点。绝不开展与论文叙事无关的实验。
尽早提交，频繁提交：每完成一批实验、每更新一次草稿，都要提交并附上描述性信息。Git日志就是实验历史记录。

Proactivity and Collaboration

主动性与协作

Default: Be proactive. Draft first, ask with the draft.

Confidence Level	Action
High (clear repo, obvious contribution)	Write full draft, deliver, iterate on feedback
Medium (some ambiguity)	Write draft with flagged uncertainties, continue
Low (major unknowns)	Ask 1-2 targeted questions via `clarify` , then draft

Section	Draft Autonomously?	Flag With Draft
Abstract	Yes	"Framed contribution as X — adjust if needed"
Introduction	Yes	"Emphasized problem Y — correct if wrong"
Methods	Yes	"Included details A, B, C — add missing pieces"
Experiments	Yes	"Highlighted results 1, 2, 3 — reorder if needed"
Related Work	Yes	"Cited papers X, Y, Z — add any I missed"

Block for input only when: target venue unclear, multiple contradictory framings, results seem incomplete, explicit request to review first.

默认规则：主动出击，先起草再询问

置信度	行动
高（代码库清晰，贡献明确）	撰写完整草稿，交付后根据反馈迭代
中（存在部分歧义）	撰写草稿并标记不确定性，继续推进
低（存在重大未知）	通过 `clarify` 提出1-2个针对性问题，再起草

章节	是否自主起草？	起草时需标记
摘要	是	“将核心贡献框架定为X——如需调整请告知”
引言	是	“重点强调了问题Y——若有误请修正”
方法	是	“包含了细节A、B、C——如需补充请告知”
实验	是	“突出了结果1、2、3——如需调整顺序请告知”
相关工作	是	“引用了论文X、Y、Z——如有遗漏请补充”

仅在以下情况需等待输入：目标会议不明确、存在多种矛盾框架、结果看似不完整、明确要求先评审。

Phase 0: Project Setup

阶段0：项目启动

Goal: Establish the workspace, understand existing work, identify the contribution.

目标：建立工作区，理解现有工作，明确核心贡献。

Step 0.1: Explore the Repository

步骤0.1：探索代码库

bash

undefined

bash

undefined

Understand project structure

了解项目结构

ls -la find . -name ".py" | head -30 find . -name ".md" -o -name "*.txt" | xargs grep -l -i "result|conclusion|finding"


Look for:
- `README.md` — project overview and claims
- `results/`, `outputs/`, `experiments/` — existing findings
- `configs/` — experimental settings
- `.bib` files — existing citations
- Draft documents or notes

ls -la find . -name ".py" | head -30 find . -name ".md" -o -name "*.txt" | xargs grep -l -i "result|conclusion|finding"


重点查找：
- `README.md`——项目概述与核心论点
- `results/`、`outputs/`、`experiments/`——现有研究成果
- `configs/`——实验设置
- `.bib`文件——现有引用
- 草稿文档或笔记

Step 0.2: Organize the Workspace

步骤0.2：整理工作区

Establish a consistent workspace structure:

workspace/
  paper/               # LaTeX source, figures, compiled PDFs
  experiments/         # Experiment runner scripts
  code/                # Core method implementation
  results/             # Raw experiment results (auto-generated)
  tasks/               # Task/benchmark definitions
  human_eval/          # Human evaluation materials (if needed)

建立一致的工作区结构：

workspace/
  paper/               # LaTeX源码、图表、编译后的PDF
  experiments/         # 实验运行脚本
  code/                # 核心方法实现
  results/             # 原始实验结果（自动生成）
  tasks/               # 任务/基准测试定义
  human_eval/          # 人类评估材料（如有需要）

Step 0.3: Set Up Version Control

步骤0.3：设置版本控制

bash

git init  # if not already
git remote add origin <repo-url>
git checkout -b paper-draft  # or main

Git discipline: Every completed experiment batch gets committed with a descriptive message. Example:

Add Monte Carlo constrained results (5 runs, Sonnet 4.6, policy memo task)
Add Haiku baseline comparison: autoreason vs refinement baselines at cheap model tier

bash

git init  # 若尚未初始化
git remote add origin <repo-url>
git checkout -b paper-draft  # 或main分支

Git规范：每完成一批实验，都要提交并附上描述性信息。示例：

Add Monte Carlo constrained results (5 runs, Sonnet 4.6, policy memo task)
Add Haiku baseline comparison: autoreason vs refinement baselines at cheap model tier

Step 0.4: Identify the Contribution

步骤0.4：明确核心贡献

Before writing anything, articulate:

The What: What is the single thing this paper contributes?
The Why: What evidence supports it?
The So What: Why should readers care?

Propose to the scientist: "Based on my understanding, the main contribution is: [one sentence]. The key results show [Y]. Is this the framing you want?"

动笔前，先明确：

是什么：本文的核心贡献是什么？用一句话概括。
为什么：有哪些证据支撑这一贡献？
重要性：读者为什么应该关心？

向科研人员提议：“根据我的理解，核心贡献是：[一句话]。关键结果显示[Y]。这是你想要的框架吗？”

Step 0.5: Create a TODO List

步骤0.5：创建待办事项列表

Use the

todo

tool to create a structured project plan:

Research Paper TODO:
- [ ] Define one-sentence contribution
- [ ] Literature review (related work + baselines)
- [ ] Design core experiments
- [ ] Run experiments
- [ ] Analyze results
- [ ] Write first draft
- [ ] Self-review (simulate reviewers)
- [ ] Revise based on review
- [ ] Submission prep

Update this throughout the project. It serves as the persistent state across sessions.

使用

todo

工具创建结构化项目计划：

科研论文待办事项：
- [ ] 定义一句话核心贡献
- [ ] 文献综述（相关工作+基准方法）
- [ ] 设计核心实验
- [ ] 运行实验
- [ ] 分析结果
- [ ] 撰写初稿
- [ ] 自我评审（模拟评审）
- [ ] 根据评审意见修订
- [ ] 准备提交材料

在项目推进过程中持续更新，作为跨会话的持久状态记录。

Step 0.6: Estimate Compute Budget

步骤0.6：估算计算预算

Before running experiments, estimate total cost and time:

Compute Budget Checklist:
- [ ] API costs: (model price per token) × (estimated tokens per run) × (number of runs)
- [ ] GPU hours: (time per experiment) × (number of experiments) × (number of seeds)
- [ ] Human evaluation costs: (annotators) × (hours) × (hourly rate)
- [ ] Total budget ceiling and contingency (add 30-50% for reruns)

Track actual spend as experiments run:

python

undefined

运行实验前，估算总成本与时间：

计算预算清单：
- [ ] API成本：（每token模型价格）×（每次运行预估token数）×（运行次数）
- [ ] GPU时长：（每次实验时间）×（实验次数）×（随机种子数）
- [ ] 人类评估成本：（标注员数量）×（时长）×（时薪）
- [ ] 总预算上限与应急储备（增加30-50%用于重跑）

实验运行时跟踪实际花费：

python

undefined

Simple cost tracker pattern

简单成本跟踪模式

import json, os from datetime import datetime

COST_LOG = "results/cost_log.jsonl"

def log_cost(experiment: str, model: str, input_tokens: int, output_tokens: int, cost_usd: float): entry = { "timestamp": datetime.now().isoformat(), "experiment": experiment, "model": model, "input_tokens": input_tokens, "output_tokens": output_tokens, "cost_usd": cost_usd, } with open(COST_LOG, "a") as f: f.write(json.dumps(entry) + "\n")


**When budget is tight**: Run pilot experiments (1-2 seeds, subset of tasks) before committing to full sweeps. Use cheaper models for debugging pipelines, then switch to target models for final runs.

import json, os from datetime import datetime

COST_LOG = "results/cost_log.jsonl"


**预算紧张时**：在全面运行前先开展试点实验（1-2个随机种子，任务子集）。调试阶段使用低成本模型，最终运行再切换目标模型。

Step 0.7: Multi-Author Coordination

步骤0.7：多作者协作

Most papers have 3-10 authors. Establish workflows early:

Workflow	Tool	When to Use
Overleaf	Browser-based	Multiple authors editing simultaneously, no git experience
Git + LaTeX	`git` with `.gitignore` for aux files	Technical teams, need branch-based review
Overleaf + Git sync	Overleaf premium	Best of both — live collab with version history

Section ownership: Assign each section to one primary author. Others comment but don't edit directly. Prevents merge conflicts and style inconsistency.

Author Coordination Checklist:
- [ ] Agree on section ownership (who writes what)
- [ ] Set up shared workspace (Overleaf or git repo)
- [ ] Establish notation conventions (before anyone writes)
- [ ] Schedule internal review rounds (not just at the end)
- [ ] Designate one person for final formatting pass
- [ ] Agree on figure style (colors, fonts, sizes) before creating figures

LaTeX conventions to agree on early:

```
\method{}
```
macro for consistent method naming
Citation style:
```
\citet{}
```
vs
```
\citep{}
```
usage
Math notation: lowercase bold for vectors, uppercase bold for matrices, etc.
British vs American spelling

大多数论文有3-10位作者，尽早建立工作流：

工作流	工具	使用场景
Overleaf	基于浏览器	多作者同时编辑，无Git经验
Git + LaTeX	`git` 配合 `.gitignore` 忽略aux文件	技术团队，需要基于分支的评审
Overleaf + Git同步	Overleaf高级版	兼顾实时协作与版本历史

章节所有权：为每个章节指定一位主要作者，其他人仅评论不直接编辑。避免合并冲突与风格不一致。

作者协作清单：
- [ ] 确定章节所有权（谁写什么）
- [ ] 设置共享工作区（Overleaf或Git仓库）
- [ ] 建立符号约定（动笔前）
- [ ] 安排内部评审轮次（不要只在最后评审）
- [ ] 指定一人负责最终格式调整
- [ ] 创建图表前就确定风格（颜色、字体、尺寸）

需提前达成一致的LaTeX约定：

```
\method{}
```
宏命令确保方法命名一致
引用风格：
```
\citet{}
```
与
```
\citep{}
```
的使用
数学符号：小写粗体表示向量，大写粗体表示矩阵等
英式或美式拼写

Phase 1: Literature Review

阶段1：文献综述

Goal: Find related work, identify baselines, gather citations.

目标：查找相关工作，确定基准方法，收集引用。

Step 1.1: Identify Seed Papers

步骤1.1：确定种子论文

Start from papers already referenced in the codebase:

bash

undefined

从代码库中已引用的论文开始：

bash

undefined

Via terminal:

通过终端查找:

grep -r "arxiv|doi|cite" --include=".md" --include=".bib" --include=".py" find . -name ".bib"

undefined

grep -r "arxiv|doi|cite" --include=".md" --include=".bib" --include=".py" find . -name ".bib"

undefined

Step 1.2: Search for Related Work

步骤1.2：搜索相关工作

Load the
arxiv
skill for structured paper discovery:

skill_view("arxiv")

. It provides arXiv REST API search, Semantic Scholar citation graphs, author profiles, and BibTeX generation.

Use

web_search

for broad discovery,

web_extract

for fetching specific papers:

undefined

加载
arxiv
技能进行结构化论文检索：

skill_view("arxiv")

。该技能提供arXiv REST API搜索、Semantic Scholar引用图谱、作者简介和BibTeX生成功能。

使用

web_search

进行广泛检索，

web_extract

获取特定论文：

undefined

Via web_search:

通过web_search:

web_search("[main technique] + [application domain] site:arxiv.org") web_search("[baseline method] comparison ICML NeurIPS 2024")

web_search("[核心技术] + [应用领域] site:arxiv.org") web_search("[基准方法] comparison ICML NeurIPS 2024")

Via web_extract (for specific papers):

通过web_extract（针对特定论文）:

web_extract("https://arxiv.org/abs/2303.17651")


Additional search queries to try:

Search queries:

"[main technique] + [application domain]"
"[baseline method] comparison"
"[problem name] state-of-the-art"
Author names from existing citations


**Recommended**: Install **Exa MCP** for real-time academic search:
```bash
claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp"

web_extract("https://arxiv.org/abs/2303.17651")


可尝试的其他搜索查询：

搜索查询：

"[核心技术] + [应用领域]"
"[基准方法] comparison"
"[问题名称] state-of-the-art"
现有引用中的作者姓名


**推荐**：安装**Exa MCP**进行实时学术搜索：
```bash
claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp"

Step 1.2b: Deepen the Search (Breadth-First, Then Depth)

步骤1.2b：深化搜索（广度优先，再深度优先）

A flat search (one round of queries) typically misses important related work. Use an iterative breadth-then-depth pattern inspired by deep research pipelines:

Iterative Literature Search:

Round 1 (Breadth): 4-6 parallel queries covering different angles
  - "[method] + [domain]"
  - "[problem name] state-of-the-art 2024 2025"
  - "[baseline method] comparison"
  - "[alternative approach] vs [your approach]"
  → Collect papers, extract key concepts and terminology

Round 2 (Depth): Generate follow-up queries from Round 1 learnings
  - New terminology discovered in Round 1 papers
  - Papers cited by the most relevant Round 1 results
  - Contradictory findings that need investigation
  → Collect papers, identify remaining gaps

Round 3 (Targeted): Fill specific gaps
  - Missing baselines identified in Rounds 1-2
  - Concurrent work (last 6 months, same problem)
  - Key negative results or failed approaches
  → Stop when new queries return mostly papers you've already seen

When to stop: If a round returns >80% papers already in your collection, the search is saturated. Typically 2-3 rounds suffice. For survey papers, expect 4-5 rounds.

For agent-based workflows: Delegate each round's queries in parallel via

delegate_task

. Collect results, deduplicate, then generate the next round's queries from the combined learnings.

单次搜索（一轮查询）通常会遗漏重要相关工作。采用迭代式广度-深度模式，灵感来自深度研究流程：

迭代式文献搜索：

第1轮（广度）：4-6个平行查询覆盖不同角度
  - "[方法] + [领域]"
  - "[问题名称] state-of-the-art 2024 2025"
  - "[基准方法] comparison"
  - "[替代方法] vs [你的方法]"
  → 收集论文，提取关键概念与术语

第2轮（深度）：根据第1轮的发现生成后续查询
  - 第1轮论文中发现的新术语
  - 第1轮最相关结果引用的论文
  - 需要调查的矛盾发现
  → 收集论文，识别剩余空白

第3轮（针对性）：填补特定空白
  - 第1-2轮中发现的缺失基准方法
  - 近期工作（过去6个月，同一问题）
  - 关键负面结果或失败方法
  → 当新查询返回的大多是已收集的论文时停止

停止时机：若一轮搜索返回的论文中80%以上已在你的收藏中，说明搜索已饱和。通常2-3轮足够，综述论文可能需要4-5轮。

Agent工作流：通过

delegate_task

并行委派每一轮的查询，收集结果、去重，再根据合并后的发现生成下一轮查询。

Step 1.3: Verify Every Citation

步骤1.3：验证每一个引用

NEVER generate BibTeX from memory. ALWAYS fetch programmatically.

For each citation, follow the mandatory 5-step process:

Citation Verification (MANDATORY per citation):
1. SEARCH → Query Semantic Scholar or Exa MCP with specific keywords
2. VERIFY → Confirm paper exists in 2+ sources (Semantic Scholar + arXiv/CrossRef)
3. RETRIEVE → Get BibTeX via DOI content negotiation (programmatically, not from memory)
4. VALIDATE → Confirm the claim you're citing actually appears in the paper
5. ADD → Add verified BibTeX to bibliography
If ANY step fails → mark as [CITATION NEEDED], inform scientist

python

undefined

绝不凭记忆生成BibTeX，务必通过程序获取

每个引用都需遵循以下5步强制流程：

引用验证（每个引用必须执行）：
1. 搜索 → 使用Semantic Scholar或Exa MCP进行关键词查询
2. 验证 → 在2+个来源（Semantic Scholar + arXiv/CrossRef）确认论文存在
3. 获取 → 通过DOI内容协商（程序获取，而非记忆）获取BibTeX
4. 校验 → 确认你引用的论点确实出现在论文中
5. 添加 → 将验证后的BibTeX添加到参考文献
若任何一步失败 → 标记为[CITATION NEEDED]，告知科研人员

python

undefined

Fetch BibTeX via DOI

通过DOI获取BibTeX

import requests

def doi_to_bibtex(doi: str) -> str: response = requests.get( f"https://doi.org/{doi}", headers={"Accept": "application/x-bibtex"} ) response.raise_for_status() return response.text


If you cannot verify a citation:

```latex
\cite{PLACEHOLDER_author2024_verify_this}  % TODO: Verify this citation exists

Always tell the scientist: "I've marked [X] citations as placeholders that need verification."

See references/citation-workflow.md for complete API documentation and the full

CitationManager

class.

import requests

def doi_to_bibtex(doi: str) -> str: response = requests.get( f"https://doi.org/{doi}", headers={"Accept": "application/x-bibtex"} ) response.raise_for_status() return response.text


若无法验证引用：

```latex
\cite{PLACEHOLDER_author2024_verify_this}  % TODO: 验证此引用是否存在

务必告知科研人员：“我已将[X]个引用标记为占位符，需要验证。”

详见references/citation-workflow.md获取完整API文档和

CitationManager

类。

Step 1.4: Organize Related Work

步骤1.4：整理相关工作

Group papers by methodology, not paper-by-paper:

Good: "One line of work uses X's assumption [refs] whereas we use Y's assumption because..." Bad: "Smith et al. introduced X. Jones et al. introduced Y. We combine both."

按方法学分组，而非按论文逐个罗列：

好的写法：“一类工作采用X的假设[引用]，而我们采用Y的假设，因为...” 差的写法：“Smith等人提出了X。Jones等人提出了Y。我们将两者结合。”

Phase 2: Experiment Design

阶段2：实验设计

Goal: Design experiments that directly support paper claims. Every experiment must answer a specific question.

目标：设计直接支撑论文论点的实验。每个实验必须回答特定问题。

Step 2.1: Map Claims to Experiments

步骤2.1：论点与实验映射

Create an explicit mapping:

Claim	Experiment	Expected Evidence
"Our method outperforms baselines"	Main comparison (Table 1)	Win rate, statistical significance
"Effect is larger for weaker models"	Model scaling study	Monotonic improvement curve
"Convergence requires scope constraints"	Constrained vs unconstrained	Convergence rate comparison

Rule: If an experiment doesn't map to a claim, don't run it.

创建明确的映射关系：

论点	实验	预期证据
“我们的方法优于基准方法”	主要对比（表1）	胜率、统计显著性
“在弱模型上效果更显著”	模型缩放研究	单调提升曲线
“收敛需要范围约束”	约束与无约束对比	收敛速率对比

规则：若实验不映射到任何论点，不要运行。

Step 2.2: Design Baselines

步骤2.2：设计基准方法

Strong baselines are what separates accepted papers from rejected ones. Reviewers will ask: "Did they compare against X?"

Standard baseline categories:

Naive baseline: Simplest possible approach
Strong baseline: Best known existing method
Ablation baselines: Your method minus one component
Compute-matched baselines: Same compute budget, different allocation

强大的基准方法是录用论文与被拒论文的分水岭。评审会问：“他们是否与X进行了对比？”

标准基准类别：

朴素基准：最简单的方法
强基准：现有最佳方法
消融基准：你的方法去掉一个组件
计算匹配基准：相同计算预算，不同分配方式

Step 2.3: Define Evaluation Protocol

步骤2.3：定义评估协议

Before running anything, specify:

Metrics: What you're measuring, direction symbols (higher/lower better)
Aggregation: How results are combined across runs/tasks
Statistical tests: What tests will establish significance
Sample sizes: How many runs/problems/tasks

运行前明确：

指标：测量什么，方向符号（越高越好/越低越好）
聚合方式：如何合并多次运行/任务的结果
统计检验：用什么检验确定显著性
样本量：运行次数/问题数/任务数

Step 2.4: Write Experiment Scripts

步骤2.4：编写实验脚本

Follow these patterns from successful research pipelines:

Incremental saving — save results after each step for crash recovery:

python

undefined

遵循成功研究流程的模式：

增量保存——每一步后保存结果，以防崩溃：

python

undefined

Save after each problem/task

每个问题/任务后保存

result_path = f"results/{task}/{strategy}/result.json" if os.path.exists(result_path): continue # Skip already-completed work

result_path = f"results/{task}/{strategy}/result.json" if os.path.exists(result_path): continue # 跳过已完成的工作

... run experiment ...

... 运行实验 ...

with open(result_path, 'w') as f: json.dump(result, f, indent=2)


**Artifact preservation** — save all intermediate outputs:

results/<experiment>/ <task>/ <strategy>/ final_output.md # Final result history.json # Full trajectory pass_01/ # Per-iteration artifacts version_a.md version_b.md critic.md


**Separation of concerns** — keep generation, evaluation, and visualization separate:

run_experiment.py # Core experiment runner run_baselines.py # Baseline comparison run_comparison_judge.py # Blind evaluation analyze_results.py # Statistical analysis make_charts.py # Visualization


See [references/experiment-patterns.md](references/experiment-patterns.md) for complete design patterns, cron monitoring, and error recovery.

with open(result_path, 'w') as f: json.dump(result, f, indent=2)


** artifact保存**——保存所有中间输出：

results/<experiment>/ <task>/ <strategy>/ final_output.md # 最终结果 history.json # 完整轨迹 pass_01/ # 每轮迭代的artifact version_a.md version_b.md critic.md


**关注点分离**——将生成、评估与可视化分开：

run_experiment.py # 核心实验运行器 run_baselines.py # 基准对比 run_comparison_judge.py # 盲评 analyze_results.py # 统计分析 make_charts.py # 可视化


详见[references/experiment-patterns.md](references/experiment-patterns.md)获取完整设计模式、cron监控和错误恢复方法。

Step 2.5: Design Human Evaluation (If Applicable)

步骤2.5：设计人类评估（如有需要）

Many NLP, HCI, and alignment papers require human evaluation as primary or complementary evidence. Design this before running automated experiments — human eval often has longer lead times (IRB approval, annotator recruitment).

When human evaluation is needed:

Automated metrics don't capture what you care about (fluency, helpfulness, safety)
Your contribution is about human-facing qualities (readability, preference, trust)
Reviewers at NLP venues (ACL, EMNLP) expect it for generation tasks

Key design decisions:

Decision	Options	Guidance
Annotator type	Expert, crowdworker, end-user	Match to what your claims require
Scale	Likert (1-5), pairwise comparison, ranking	Pairwise is more reliable than Likert for LLM outputs
Sample size	Per annotator and total items	Power analysis or minimum 100 items, 3+ annotators
Agreement metric	Cohen's kappa, Krippendorff's alpha, ICC	Krippendorff's alpha for >2 annotators; report raw agreement too
Platform	Prolific, MTurk, internal team	Prolific for quality; MTurk for scale; internal for domain expertise

Annotation guideline checklist:

- [ ] Clear task description with examples (good AND bad)
- [ ] Decision criteria for ambiguous cases
- [ ] At least 2 worked examples per category
- [ ] Attention checks / gold standard items (10-15% of total)
- [ ] Qualification task or screening round
- [ ] Estimated time per item and fair compensation (>= local minimum wage)
- [ ] IRB/ethics review if required by your institution

Reporting requirements (reviewers check all of these):

Number of annotators and their qualifications
Inter-annotator agreement with specific metric and value
Compensation details (amount, estimated hourly rate)
Annotation interface description or screenshot (appendix)
Total annotation time

See references/human-evaluation.md for complete guide including statistical tests for human eval data, crowdsourcing quality control patterns, and IRB guidance.

许多NLP、HCI和对齐研究论文需要人类评估作为主要或补充证据。在运行自动实验前设计好——人类评估通常需要更长的准备时间（IRB审批、标注员招募）。

需要人类评估的场景：

自动指标无法捕捉你关心的内容（流畅性、有用性、安全性）
你的贡献涉及人类面向的品质（可读性、偏好、信任）
NLP会议（ACL、EMNLP）的评审期望生成任务包含人类评估

关键设计决策：

决策	选项	指导原则
标注员类型	专家、众包人员、终端用户	与你的论点需求匹配
评分方式	Likert（1-5）、两两对比、排序	两两对比对LLM输出更可靠
样本量	每位标注员处理的项目数与总项目数	功效分析或至少100个项目，3+位标注员
一致性指标	Cohen's kappa、Krippendorff's alpha、ICC	Krippendorff's alpha适用于2+位标注员；同时报告原始一致性
平台	Prolific、MTurk、内部团队	Prolific质量高；MTurk规模大；内部团队适合领域专家

标注指南清单：

- [ ] 清晰的任务描述与示例（好的和差的）
- [ ] 模糊情况的决策标准
- [ ] 每个类别至少2个示例
- [ ] 注意力检查/金标准项目（占总项目的10-15%）
- [ ] 资格任务或筛选轮次
- [ ] 预估每个项目的时间与合理报酬（≥当地最低工资）
- [ ] 若机构要求，完成IRB/伦理审查

报告要求（评审会检查所有这些）：

标注员数量与资质
具体一致性指标与数值
报酬详情（金额、预估时薪）
标注界面描述或截图（附录）
总标注时间

详见references/human-evaluation.md获取完整指南，包括人类评估数据的统计检验、众包质量控制模式和IRB指导。

Phase 3: Experiment Execution & Monitoring

阶段3：实验执行与监控

Goal: Run experiments reliably, monitor progress, recover from failures.

目标：可靠运行实验，监控进度，从失败中恢复。

Step 3.1: Launch Experiments

步骤3.1：启动实验

Use

nohup

for long-running experiments:

bash

nohup python run_experiment.py --config config.yaml > logs/experiment_01.log 2>&1 &
echo $!  # Record the PID

Parallel execution: Run independent experiments simultaneously, but be aware of API rate limits. 4+ concurrent experiments on the same API will slow each down.

使用

nohup

运行长时间实验：

bash

nohup python run_experiment.py --config config.yaml > logs/experiment_01.log 2>&1 &
echo $!  # 记录PID

并行执行：同时运行独立实验，但注意API速率限制。同一API上4+个并发实验会降低每个实验的速度。

Step 3.2: Set Up Monitoring (Cron Pattern)

步骤3.2：设置监控（Cron模式）

For long-running experiments, set up periodic status checks. The cron prompt should follow this template:

Monitor Prompt Template:
1. Check if process is still running: ps aux | grep <pattern>
2. Read last 30 lines of log: tail -30 <logfile>
3. Check for completed results: ls <result_dir>
4. If results exist, read and report: cat <result_file>
5. If all done, commit: git add -A && git commit -m "<descriptive message>" && git push
6. Report in structured format (tables with key metrics)
7. Answer the key analytical question for this experiment

Silent mode: If nothing has changed since the last check, respond with

[SILENT]

to suppress notification to the user. Only report when there's news.

对于长时间运行的实验，设置定期状态检查。Cron提示应遵循以下模板：

监控提示模板：
1. 检查进程是否仍在运行：ps aux | grep <pattern>
2. 读取日志最后30行：tail -30 <logfile>
3. 检查已完成的结果：ls <result_dir>
4. 若结果存在，读取并报告：cat <result_file>
5. 若全部完成，提交：git add -A && git commit -m "<描述性信息>" && git push
6. 以结构化格式报告（含关键指标的表格）
7. 回答本次实验的关键分析问题

静默模式：若自上次检查以来无变化，回复

[SILENT]

以避免通知用户。仅当有新进展时才报告。

Step 3.3: Handle Failures

步骤3.3：处理失败

Common failure modes and recovery:

Failure	Detection	Recovery
API rate limit / credit exhaustion	402/429 errors in logs	Wait, then re-run (scripts skip completed work)
Process crash	PID gone, incomplete results	Re-run from last checkpoint
Timeout on hard problems	Process stuck, no log progress	Kill and skip, note in results
Wrong model ID	Errors referencing model name	Fix ID and re-run

Key: Scripts should always check for existing results and skip completed work. This makes re-runs safe and efficient.

常见失败模式与恢复方法：

失败	检测	恢复
API速率限制/信用耗尽	日志中的402/429错误	等待后重跑（脚本会跳过已完成的工作）
进程崩溃	PID消失，结果不完整	从最后一个检查点重跑
难题超时	进程卡住，无日志进展	终止并跳过，在结果中注明
错误的模型ID	引用模型名称的错误	修正ID后重跑

关键：脚本应始终检查现有结果并跳过已完成的工作，确保重跑安全高效。

Step 3.4: Commit Completed Results

步骤3.4：提交已完成的结果

After each experiment batch completes:

bash

git add -A
git commit -m "Add <experiment name>: <key finding in 1 line>"
git push

每批实验完成后：

bash

git add -A
git commit -m "Add <实验名称>: <一行关键发现>"
git push

Step 3.5: Maintain an Experiment Journal

步骤3.5：维护实验日志

Git commits track what happened, but not the exploration tree — the decisions about what to try next based on what you learned. Maintain a structured experiment journal that captures this tree:

json

// experiment_journal.jsonl — append one entry per experiment attempt
{
  "id": "exp_003",
  "parent": "exp_001",
  "timestamp": "2025-05-10T14:30:00Z",
  "hypothesis": "Adding scope constraints will fix convergence failure from exp_001",
  "plan": "Re-run autoreason with max_tokens=2000 and fixed structure template",
  "config": {"model": "haiku", "strategy": "autoreason", "max_tokens": 2000},
  "status": "completed",
  "result_path": "results/exp_003/",
  "key_metrics": {"win_rate": 0.85, "convergence_rounds": 3},
  "analysis": "Scope constraints fixed convergence. Win rate jumped from 0.42 to 0.85.",
  "next_steps": ["Try same constraints on Sonnet", "Test without structure template"],
  "figures": ["figures/exp003_convergence.pdf"]
}

Why a journal, not just git? Git tracks file changes. The journal tracks the reasoning: why you tried X, what you learned, and what that implies for the next experiment. When writing the paper, this tree is invaluable for the Methods section ("we observed X, which motivated Y") and for honest failure reporting.

Selecting the best path: When the journal shows a branching tree (exp_001 → exp_002a, exp_002b, exp_003), identify the path that best supports the paper's claims. Document dead-end branches in the appendix as ablations or negative results.

Snapshot code per experiment: Copy the experiment script after each run:

bash

cp experiment.py results/exp_003/experiment_snapshot.py

This enables exact reproduction even after subsequent code changes.

Git提交记录了发生的事情，但不记录探索树——基于所学内容决定下一步尝试什么的决策过程。维护结构化实验日志，记录这棵树：

json

// experiment_journal.jsonl —— 每次实验尝试追加一条记录
{
  "id": "exp_003",
  "parent": "exp_001",
  "timestamp": "2025-05-10T14:30:00Z",
  "hypothesis": "添加范围约束将修复exp_001的收敛失败问题",
  "plan": "使用max_tokens=2000和固定结构模板重跑autoreason",
  "config": {"model": "haiku", "strategy": "autoreason", "max_tokens": 2000},
  "status": "completed",
  "result_path": "results/exp_003/",
  "key_metrics": {"win_rate": 0.85, "convergence_rounds": 3},
  "analysis": "范围约束修复了收敛问题。胜率从0.42跃升至0.85。",
  "next_steps": ["在Sonnet上尝试相同约束", "测试无结构模板的情况"],
  "figures": ["figures/exp003_convergence.pdf"]
}

为什么需要日志而不只是Git？ Git跟踪文件变化，日志跟踪推理过程：为什么尝试X，学到了什么，以及这对下一个实验意味着什么。撰写论文时，这棵树对方法部分（“我们观察到X，这促使我们尝试Y”）和如实报告失败至关重要。

选择最佳路径：当日志显示分支树（exp_001 → exp_002a, exp_002b, exp_003）时，确定最能支撑论文论点的路径。将死胡同分支作为消融实验或负面结果记录在附录中。

为每个实验快照代码：每次运行后复制实验脚本：

bash

cp experiment.py results/exp_003/experiment_snapshot.py

即使后续代码更改，也能精确复现实验。

Phase 4: Result Analysis

阶段4：结果分析

Goal: Extract findings, compute statistics, identify the story.

目标：提取发现，计算统计数据，确定叙事主线。

Step 4.1: Aggregate Results

步骤4.1：聚合结果

Write analysis scripts that:

Load all result files from a batch
Compute per-task and aggregate metrics
Generate summary tables

python

undefined

编写分析脚本：

加载一批实验的所有结果文件
计算每个任务的指标和聚合指标
生成汇总表格

python

undefined

Standard analysis pattern

标准分析模式

import json, os from pathlib import Path

results = {} for result_file in Path("results/").rglob("result.json"): data = json.loads(result_file.read_text()) strategy = result_file.parent.name task = result_file.parent.parent.name results.setdefault(strategy, {})[task] = data

import json, os from pathlib import Path

Compute aggregate metrics

计算聚合指标

for strategy, tasks in results.items(): scores = [t["score"] for t in tasks.values()] print(f"{strategy}: mean={np.mean(scores):.1f}, std={np.std(scores):.1f}")

undefined

for strategy, tasks in results.items(): scores = [t["score"] for t in tasks.values()] print(f"{strategy}: mean={np.mean(scores):.1f}, std={np.std(scores):.1f}")

undefined

Step 4.2: Statistical Significance

步骤4.2：统计显著性

Always compute:

Error bars: Standard deviation or standard error, specify which
Confidence intervals: 95% CI for key results
Pairwise tests: McNemar's test for comparing two methods
Effect sizes: Cohen's d or h for practical significance

See references/experiment-patterns.md for complete implementations of McNemar's test, bootstrapped CIs, and Cohen's h.

务必计算：

误差棒：标准差或标准误，注明使用的是哪种
置信区间：关键结果的95%置信区间
两两检验：McNemar检验用于比较两种方法
效应量：Cohen's d或h用于实际显著性

详见references/experiment-patterns.md获取McNemar检验、自助法置信区间和Cohen's h的完整实现。

Step 4.3: Identify the Story

步骤4.3：确定叙事主线

After analysis, explicitly answer:

What is the main finding? State it in one sentence.
What surprised you? Unexpected results often make the best papers.
What failed? Failed experiments can be the most informative. Honest reporting of failures strengthens the paper.
What follow-up experiments are needed? Results often raise new questions.

分析后，明确回答：

主要发现是什么？ 用一句话概括。
什么让你惊讶？ 意外结果往往成就最佳论文。
什么失败了？ 失败的实验可能最具信息量。如实报告失败能增强论文可信度。
需要哪些后续实验？ 结果往往会引出新问题。

Handling Negative or Null Results

处理负面或无效结果

When your hypothesis was wrong or results are inconclusive, you have three options:

Situation	Action	Venue Fit
Hypothesis wrong but why is informative	Frame paper around the analysis of why	NeurIPS, ICML (if analysis is rigorous)
Method doesn't beat baselines but reveals something new	Reframe contribution as understanding/analysis	ICLR (values understanding), workshop papers
Clean negative result on popular claim	Write it up — the field needs to know	NeurIPS Datasets & Benchmarks, TMLR, workshops
Results inconclusive, no clear story	Pivot — run different experiments or reframe	Don't force a paper that isn't there

How to write a negative results paper:

Lead with what the community believes and why it matters to test it
Describe your rigorous methodology (must be airtight — reviewers will scrutinize harder)
Present the null result clearly with statistical evidence
Analyze why the expected result didn't materialize
Discuss implications for the field

Venues that explicitly welcome negative results: NeurIPS (Datasets & Benchmarks track), TMLR, ML Reproducibility Challenge, workshops at major conferences. Some workshops specifically call for negative results.

当你的假设错误或结果不确定时，有三种选择：

情况	行动	会议适配性
假设错误但原因有信息量	将论文框架围绕原因分析展开	NeurIPS、ICML（若分析严谨）
方法未优于基准但揭示了新发现	将贡献重构为理解/分析	ICLR（重视理解）、研讨会论文
对流行论点的明确负面结果	撰写出来——领域需要知道这些	NeurIPS Datasets & Benchmarks、TMLR、研讨会
结果不确定，无清晰叙事主线	转向其他方向——运行不同实验或重构框架	不要强行拼凑论文

如何撰写负面结果论文：

开头说明社区的普遍看法以及测试它的重要性
描述严谨的方法（必须无懈可击——评审会更严格审查）
清晰呈现无效结果并提供统计证据
分析为什么预期结果未出现
讨论对领域的影响

明确欢迎负面结果的会议：NeurIPS（Datasets & Benchmarks track）、TMLR、ML Reproducibility Challenge、顶会研讨会。部分研讨会专门征集负面结果。

Step 4.4: Create Figures and Tables

步骤4.4：创建图表与表格

Figures:

Use vector graphics (PDF) for all plots:
```
plt.savefig('fig.pdf')
```
Colorblind-safe palettes (Okabe-Ito or Paul Tol)
Self-contained captions — reader should understand without main text
No title inside figure — the caption serves this function

Tables:

Use
```
booktabs
```
LaTeX package
Bold best value per metric
Include direction symbols (higher/lower better)
Consistent decimal precision

latex

\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}

图表：

所有绘图使用矢量图形（PDF）：
```
plt.savefig('fig.pdf')
```
色盲友好调色板（Okabe-Ito或Paul Tol）
自包含标题——读者无需阅读正文即可理解
图表内无标题——标题由图例承担

表格：

使用
```
booktabs
```
LaTeX包
每个指标的最佳值加粗
包含方向符号（越高越好/越低越好）
一致的小数精度

latex

\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}

Step 4.5: Decide: More Experiments or Write?

步骤4.5：决定：继续实验还是开始撰写？

Situation	Action
Core claims supported, results significant	Move to Phase 5 (writing)
Results inconclusive, need more data	Back to Phase 2 (design)
Unexpected finding suggests new direction	Back to Phase 2 (design)
Missing one ablation reviewers will ask for	Run it, then Phase 5
All experiments done but some failed	Note failures, move to Phase 5

情况	行动
核心论点得到支撑，结果显著	进入阶段5（撰写）
结果不确定，需要更多数据	返回阶段2（设计）
意外发现提示新方向	返回阶段2（设计）
缺少评审会要求的消融实验	运行该实验，然后进入阶段5
所有实验完成但部分失败	记录失败，进入阶段5

Step 4.6: Write the Experiment Log (Bridge to Writeup)

步骤4.6：撰写实验日志（连接实验与撰写的桥梁）

Before moving to paper writing, create a structured experiment log that bridges results to prose. This is the single most important connective tissue between experiments and the writeup — without it, the writing agent has to re-derive the story from raw result files.

Create
experiment_log.md
with the following structure:

markdown

undefined

开始撰写论文前，创建结构化实验日志，连接结果与文稿。这是实验与撰写之间最重要的连接纽带——没有它，撰写Agent必须从原始结果文件中重新推导叙事主线。

创建
experiment_log.md
，结构如下：

markdown

undefined

Experiment Log

实验日志

Contribution (one sentence)

核心贡献（一句话）

[The paper's main claim]

[论文的主要论点]

Experiments Run

已运行实验

Experiment 1: [Name]

实验1：[名称]

Claim tested: [Which paper claim this supports]
Setup: [Model, dataset, config, number of runs]
Key result: [One sentence with the number]
Result files: results/exp1/final_info.json
Figures generated: figures/exp1_comparison.pdf
Surprising findings: [Anything unexpected]

测试的论点：[支撑论文的哪个论点]
设置：[模型、数据集、配置、运行次数]
关键结果：[一句话带数值]
结果文件：results/exp1/final_info.json
生成的图表：figures/exp1_comparison.pdf
意外发现：[任何意外情况]

Experiment 2: [Name]

实验2：[名称]

...

Figures

图表

Filename	Description	Which section it belongs in
figures/main_comparison.pdf	Bar chart comparing all methods on benchmark X	Results, Figure 2
figures/ablation.pdf	Ablation removing components A, B, C	Results, Figure 3
...

文件名	描述	所属章节
figures/main_comparison.pdf	所有方法在基准测试X上的对比柱状图	结果，图2
figures/ablation.pdf	移除组件A、B、C的消融实验	结果，图3
...

Failed Experiments (document for honesty)

失败实验（如实记录）

[What was tried, why it failed, what it tells us]

[尝试了什么，为什么失败，说明了什么]

Open Questions

开放问题

[Anything the results raised that the paper should address]


**Why this matters**: When drafting, the agent (or a delegated sub-agent) can load `experiment_log.md` alongside the LaTeX template and produce a first draft grounded in actual results. Without this bridge, the writing agent must parse raw JSON/CSV files and infer the story — a common source of hallucinated or misreported numbers.

**Git discipline**: Commit this log alongside the results it describes.

---

[结果引出的论文应解决的问题]


**为什么这很重要**：撰写时，Agent（或委派的子Agent）可以加载`experiment_log.md`和LaTeX模板，生成基于实际结果的初稿。没有这个桥梁，撰写Agent必须解析原始JSON/CSV文件并推断叙事主线——这是编造或错误报告数据的常见来源。

**Git规范**：将此日志与对应的结果一起提交。

---

Iterative Refinement: Strategy Selection

迭代优化：策略选择

Any output in this pipeline — paper drafts, experiment scripts, analysis — can be iteratively refined. The autoreason research provides empirical evidence for when each refinement strategy works and when it fails. Use this section to choose the right approach.

本流程中的任何输出——论文草稿、实验脚本、分析——都可以迭代优化。Autoreason研究提供了实证证据，说明每种优化策略何时有效、何时失败。使用本节选择合适的方法。

Quick Decision Table

快速决策表

Your Situation	Strategy	Why
Mid-tier model + constrained task	Autoreason	Sweet spot. Generation-evaluation gap is widest. Baselines actively destroy weak model outputs.
Mid-tier model + open task	Autoreason with scope constraints added	Add fixed facts, structure, or deliverable to bound the improvement space.
Frontier model + constrained task	Autoreason	Wins 2/3 constrained tasks even at frontier.
Frontier model + unconstrained task	Critique-and-revise or single pass	Autoreason comes last. Model self-evaluates well enough.
Concrete technical task (system design)	Critique-and-revise	Direct find-and-fix loop is more efficient.
Template-filling task (one correct structure)	Single pass or conservative	Minimal decision space. Iteration adds no value.
Code with test cases	Autoreason (code variant)	Structured analysis of why it failed before fixing. Recovery rate 62% vs 43%.
Very weak model (Llama 8B class)	Single pass	Model too weak for diverse candidates. Invest in generation quality.

你的情况	策略	原因
中端模型 + 约束任务	Autoreason	最佳场景。生成-评估差距最大。基准方法会破坏弱模型输出。
中端模型 + 开放任务	Autoreason加范围约束	添加固定事实、结构或交付物以缩小改进空间。
前沿模型 + 约束任务	Autoreason	在约束任务中胜率2/3，即使是前沿模型。
前沿模型 + 无约束任务	Critique-and-revise或单次生成	Autoreason效果最差。模型自我评估足够好。
具体技术任务（系统设计）	Critique-and-revise	直接的发现-修复循环更高效。
模板填充任务（唯一正确结构）	单次生成或保守策略	决策空间极小。迭代无价值。
带测试用例的代码	Autoreason（代码变体）	结构化分析失败原因再修复。恢复率62% vs 43%。
非常弱的模型（Llama 8B级）	单次生成	模型太弱无法生成多样候选。专注于生成质量。

The Generation-Evaluation Gap

生成-评估差距

Core insight: Autoreason's value depends on the gap between a model's generation capability and its self-evaluation capability.

Model Tier        │ Generation │ Self-Eval │ Gap    │ Autoreason Value
──────────────────┼────────────┼───────────┼────────┼─────────────────
Weak (Llama 8B)   │ Poor       │ Poor      │ Small  │ None — can't generate diverse candidates
Mid (Haiku 3.5)   │ Decent     │ Poor      │ LARGE  │ MAXIMUM — 42/42 perfect Borda
Mid (Gemini Flash)│ Decent     │ Moderate  │ Large  │ High — wins 2/3
Strong (Sonnet 4) │ Good       │ Decent    │ Medium │ Moderate — wins 3/5
Frontier (S4.6)   │ Excellent  │ Good      │ Small  │ Only with constraints

This gap is structural, not temporary. As costs drop, today's frontier becomes tomorrow's mid-tier. The sweet spot moves but never disappears.

核心见解：Autoreason的价值取决于模型生成能力与自我评估能力之间的差距。

模型层级        │ 生成能力 │ 自我评估能力 │ 差距    │ Autoreason价值
──────────────────┼────────────┼───────────┼────────┼─────────────────
弱（Llama 8B）   │ 差       │ 差      │ 小  │ 无——无法生成多样候选
中（Haiku 3.5）   │ 尚可     │ 差      │ 大  │ 最大——42/42完美Borda
中（Gemini Flash）│ 尚可     │ 中等      │ 大  │ 高——胜率2/3
强（Sonnet 4） │ 好       │ 尚可    │ 中 │ 中等——胜率3/5
前沿（S4.6）   │ 优秀     │ 好      │ 小  │ 仅在约束下有效

这个差距是结构性的，而非暂时的。随着成本下降，今天的前沿模型会成为明天的中端模型。最佳场景会移动但永远不会消失。

Autoreason Loop (Summary)

Autoreason循环（摘要）

Each pass produces three candidates from fresh, isolated agents:

Critic → finds problems in incumbent A (no fixes)
Author B → revises A based on critique
Synthesizer → merges A and B (randomized labels)
Judge Panel → 3 blind CoT judges rank A, B, AB via Borda count
Convergence → A wins k=2 consecutive passes → done

Key parameters:

k=2 convergence (k=1 premature, k=3 too expensive, no quality gain)
CoT judges always (3x faster convergence)
Temperature 0.8 authors, 0.3 judges
Conservative tiebreak: incumbent wins ties
Every role is a fresh agent with no shared context

每轮从独立的新Agent生成三个候选：

批评者 → 发现现有版本A的问题（不修复）
作者B → 根据批评意见修订A
合成器 → 合并A和B（随机标记）
评审小组 → 3个盲评CoT评审通过Borda计数对A、B、AB排名
收敛 → A连续k=2轮获胜 → 完成

关键参数：

k=2收敛（k=1过早，k=3成本过高，无质量提升）
始终使用CoT评审（收敛速度快3倍）
作者温度0.8，评审温度0.3
保守平局规则：现有版本获胜
每个角色都是无共享上下文的新Agent

Applying to Paper Drafts

应用于论文草稿

When refining the paper itself through autoreason:

Provide ground truth to the critic: actual experimental data, result JSONs, statistical outputs. Without this, models hallucinate fabricated ablation studies and fake confidence intervals.
Use 3 working judges minimum: A broken judge parser doesn't add noise — it prevents equilibrium entirely.
Scope constrain the revision: "Address these specific weaknesses" not "improve the paper."

当通过Autoreason优化论文本身时：

向批评者提供真实数据：实际实验数据、结果JSON、统计输出。没有这些，模型会编造消融研究和虚假置信区间。
至少使用3个有效评审：故障评审解析器不会增加噪声——会完全阻止平衡。
约束修订范围：“解决这些特定弱点”而非“改进论文”。

Failure Modes

失败模式

Failure	Detection	Fix
No convergence (A never wins)	A wins <15% over 20+ passes	Add scope constraints to the task
Synthesis drift	Word counts grow unboundedly	Constrain structure and deliverable
Degradation below single pass	Baselines score higher than iterated output	Switch to single pass; model may be too weak
Overfitting (code)	High public-test pass, low private-test pass	Use structured analysis, not just test feedback
Broken judges	Parsing failures reduce panel below 3	Fix parser before continuing

See references/autoreason-methodology.md for complete prompts, Borda scoring details, model selection guide, scope constraint design patterns, and compute budget reference.

失败	检测	修复
不收敛（A从未获胜）	20+轮中A胜率<15%	为任务添加范围约束
合成漂移	字数无限制增长	约束结构和交付物
性能低于单次生成	基准方法得分高于迭代输出	切换到单次生成；模型可能太弱
过拟合（代码）	公开测试通过率高，私有测试通过率低	使用结构化分析，而非仅测试反馈
评审故障	解析失败导致评审小组少于3人	修复解析器后再继续

详见references/autoreason-methodology.md获取完整提示、Borda评分细节、模型选择指南、范围约束设计模式和计算预算参考。

Phase 5: Paper Drafting

阶段5：论文撰写

Goal: Write a complete, publication-ready paper.

目标：撰写完整、可提交的论文。

Context Management for Large Projects

大型项目的上下文管理

A paper project with 50+ experiment files, multiple result directories, and extensive literature notes can easily exceed the agent's context window. Manage this proactively:

What to load into context per drafting task:

Drafting Task	Load Into Context	Do NOT Load
Writing Introduction	`experiment_log.md` , contribution statement, 5-10 most relevant paper abstracts	Raw result JSONs, full experiment scripts, all literature notes
Writing Methods	Experiment configs, pseudocode, architecture description	Raw logs, results from other experiments
Writing Results	`experiment_log.md` , result summary tables, figure list	Full analysis scripts, intermediate data
Writing Related Work	Organized citation notes (Step 1.4 output), .bib file	Experiment files, raw PDFs
Revision pass	Full paper draft, specific reviewer concerns	Everything else

Principles:

experiment_log.md
is the primary context bridge — it summarizes everything needed for writing without loading raw data files (see Step 4.6)
Load one section's context at a time when delegating. A sub-agent drafting Methods doesn't need the literature review notes.
Summarize, don't include raw files. For a 200-line result JSON, load a 10-line summary table. For a 50-page related paper, load the 5-sentence abstract + your 2-line note about its relevance.

For very large projects: Create a

context/

directory with pre-compressed summaries:

context/
  contribution.md          # 1 sentence
  experiment_summary.md    # Key results table (from experiment_log.md)
  literature_map.md        # Organized citation notes
  figure_inventory.md      # List of figures with descriptions

包含50+实验文件、多个结果目录和大量文献笔记的论文项目很容易超出Agent的上下文窗口。主动管理上下文：

每个撰写任务需加载的上下文：

撰写任务	需加载的上下文	无需加载
撰写引言	`experiment_log.md` 、核心贡献声明、5-10篇最相关论文的摘要	原始结果JSON、完整实验脚本、所有文献笔记
撰写方法	实验配置、伪代码、架构描述	原始日志、其他实验的结果
撰写结果	`experiment_log.md` 、结果汇总表格、图表列表	完整分析脚本、中间数据
撰写相关工作	整理后的引用笔记（步骤1.4输出）、.bib文件	实验文件、原始PDF
修订	完整论文草稿、特定评审关注点	其他所有内容

原则：

experiment_log.md
是主要上下文桥梁——它总结了撰写所需的一切，无需加载原始数据文件（见步骤4.6）
委派时一次加载一个章节的上下文。撰写方法的子Agent不需要文献笔记。
汇总，不要包含原始文件。对于200行的结果JSON，加载10行的汇总表格。对于50页的相关论文，加载5句话的摘要+2句话的相关性笔记。

对于超大型项目：创建

context/

目录，包含预压缩的摘要：

context/
  contribution.md          # 1句话
  experiment_summary.md    # 关键结果表格（来自experiment_log.md）
  literature_map.md        # 整理后的引用笔记
  figure_inventory.md      # 图表列表及描述

The Narrative Principle

叙事原则

The single most critical insight: Your paper is not a collection of experiments — it's a story with one clear contribution supported by evidence.

Every successful ML paper centers on what Neel Nanda calls "the narrative": a short, rigorous, evidence-based technical story with a takeaway readers care about.

Three Pillars (must be crystal clear by end of introduction):

Pillar	Description	Test
The What	1-3 specific novel claims	Can you state them in one sentence?
The Why	Rigorous empirical evidence	Do experiments distinguish your hypothesis from alternatives?
The So What	Why readers should care	Does this connect to a recognized community problem?

If you cannot state your contribution in one sentence, you don't yet have a paper.

最关键的见解：你的论文不是实验集合——它是一个故事，有一个清晰的核心论点并由证据支撑。

每篇成功的ML论文都围绕Neel Nanda所说的“叙事”展开：一个简短、严谨、基于证据的技术故事，包含读者关心的要点。

三大支柱（引言结束时必须清晰）：

支柱	描述	测试
是什么	1-3个具体的新颖论点	你能用一句话概括吗？
为什么	严谨的实证证据	实验是否区分了你的假设与其他假设？
重要性	读者为什么应该关心	这是否与公认的社区问题相关？

如果你不能用一句话概括核心贡献，说明你还没有论文。

The Sources Behind This Guidance

本指南的来源

This skill synthesizes writing philosophy from researchers who have published extensively at top venues. The writing philosophy layer was originally compiled by Orchestra Research as the

ml-paper-writing

skill.

Source	Key Contribution	Link
Neel Nanda (Google DeepMind)	The Narrative Principle, What/Why/So What framework	How to Write ML Papers
Sebastian Farquhar (DeepMind)	5-sentence abstract formula	How to Write ML Papers
Gopen & Swan	7 principles of reader expectations	Science of Scientific Writing
Zachary Lipton	Word choice, eliminating hedging	Heuristics for Scientific Writing
Jacob Steinhardt (UC Berkeley)	Precision, consistent terminology	Writing Tips
Ethan Perez (Anthropic)	Micro-level clarity tips	Easy Paper Writing Tips
Andrej Karpathy	Single contribution focus	Various lectures

For deeper dives into any of these, see:

references/writing-guide.md — Full explanations with examples
references/sources.md — Complete bibliography

本技能综合了在顶会发表大量论文的研究人员的写作理念。写作理念层最初由Orchestra Research作为

ml-paper-writing

技能整理。

来源	核心贡献	链接
Neel Nanda（Google DeepMind）	叙事原则、What/Why/So What框架	How to Write ML Papers
Sebastian Farquhar（DeepMind）	5句话摘要公式	How to Write ML Papers
Gopen & Swan	读者期望的7原则	Science of Scientific Writing
Zachary Lipton	用词、消除模糊表述	Heuristics for Scientific Writing
Jacob Steinhardt（UC Berkeley）	精确性、术语一致性	Writing Tips
Ethan Perez（Anthropic）	微观层面的清晰度技巧	Easy Paper Writing Tips
Andrej Karpathy	聚焦单一贡献	各类讲座

深入了解任何主题，请参阅：

references/writing-guide.md —— 含示例的完整解释
references/sources.md —— 完整参考文献

Time Allocation

时间分配

Spend approximately equal time on each of:

The abstract
The introduction
The figures
Everything else combined

Why? Most reviewers form judgments before reaching your methods. Readers encounter your paper as: title → abstract → introduction → figures → maybe the rest.

将大约相等的时间分配给：

摘要
引言
图表
其他所有内容

为什么？ 大多数评审在看到方法部分前就形成了判断。读者接触论文的顺序是：标题→摘要→引言→图表→可能的其他内容。

Writing Workflow

写作流程

Paper Writing Checklist:
- [ ] Step 1: Define the one-sentence contribution
- [ ] Step 2: Draft Figure 1 (core idea or most compelling result)
- [ ] Step 3: Draft abstract (5-sentence formula)
- [ ] Step 4: Draft introduction (1-1.5 pages max)
- [ ] Step 5: Draft methods
- [ ] Step 6: Draft experiments & results
- [ ] Step 7: Draft related work
- [ ] Step 8: Draft conclusion & discussion
- [ ] Step 9: Draft limitations (REQUIRED by all venues)
- [ ] Step 10: Plan appendix (proofs, extra experiments, details)
- [ ] Step 11: Complete paper checklist
- [ ] Step 12: Final review

论文撰写清单：
- [ ] 步骤1：定义一句话核心贡献
- [ ] 步骤2：绘制图1（核心思想或最具说服力的结果）
- [ ] 步骤3：撰写摘要（5句话公式）
- [ ] 步骤4：撰写引言（最多1-1.5页）
- [ ] 步骤5：撰写方法
- [ ] 步骤6：撰写实验与结果
- [ ] 步骤7：撰写相关工作
- [ ] 步骤8：撰写结论与讨论
- [ ] 步骤9：撰写局限性（所有会议要求）
- [ ] 步骤10：规划附录（证明、额外实验、细节）
- [ ] 步骤11：完成论文清单
- [ ] 步骤12：最终评审

Two-Pass Refinement Pattern

两轮优化模式

When drafting with an AI agent, use a two-pass approach (proven effective in SakanaAI's AI-Scientist pipeline):

Pass 1 — Write + immediate refine per section: For each section, write a complete draft, then immediately refine it in the same context. This catches local issues (clarity, flow, completeness) while the section is fresh.

Pass 2 — Global refinement with full-paper context: After all sections are drafted, revisit each section with awareness of the complete paper. This catches cross-section issues: redundancy, inconsistent terminology, narrative flow, and gaps where one section promises something another doesn't deliver.

Second-pass refinement prompt (per section):
"Review the [SECTION] in the context of the complete paper.
- Does it fit with the rest of the paper? Are there redundancies with other sections?
- Is terminology consistent with Introduction and Methods?
- Can anything be cut without weakening the message?
- Does the narrative flow from the previous section and into the next?
Make minimal, targeted edits. Do not rewrite from scratch."

使用AI Agent撰写时，采用两轮方法（在SakanaAI的AI-Scientist流程中被证明有效）：

第一轮——撰写+即时章节内优化：每个章节写完完整草稿后，立即在同一上下文中优化。这能在章节新鲜时发现局部问题（清晰度、流畅性、完整性）。

第二轮——全论文上下文下的全局优化：所有章节写完后，结合完整论文重新审视每个章节。这能发现跨章节问题：冗余、术语不一致、叙事流畅性、一个章节承诺而另一个章节未交付的差距。

第二轮优化提示（每个章节）：
"结合完整论文评审[章节]。
- 它与论文其他部分契合吗？与其他章节有冗余吗？
- 术语与引言和方法一致吗？
- 有没有可以删除而不削弱信息的内容？
- 叙事是否从前一章节流畅过渡到下一章节？
进行最小、有针对性的编辑。不要从头重写。"

LaTeX Error Checklist

LaTeX错误清单

Append this checklist to every refinement prompt. These are the most common errors when LLMs write LaTeX:

LaTeX Quality Checklist (verify after every edit):
- [ ] No unenclosed math symbols ($ signs balanced)
- [ ] Only reference figures/tables that exist (\ref matches \label)
- [ ] No fabricated citations (\cite matches entries in .bib)
- [ ] Every \begin{env} has matching \end{env} (especially figure, table, algorithm)
- [ ] No HTML contamination (</end{figure}> instead of \end{figure})
- [ ] No unescaped underscores outside math mode (use \_ in text)
- [ ] No duplicate \label definitions
- [ ] No duplicate section headers
- [ ] Numbers in text match actual experimental results
- [ ] All figures have captions and labels
- [ ] No overly long lines that cause overfull hbox warnings

将此清单附加到每个优化提示中。这些是LLM撰写LaTeX时最常见的错误：

LaTeX质量清单（每次编辑后验证）：
- [ ] 无未闭合的数学符号（$符号平衡）
- [ ] 仅引用存在的图表/表格（\ref与\label匹配）
- [ ] 无编造的引用（\cite与.bib中的条目匹配）
- [ ] 每个\begin{env}都有对应的\end{env}（尤其是figure、table、algorithm）
- [ ] 无HTML污染（</end{figure}>而非\end{figure}）
- [ ] 文本模式下无不转义的下划线（文本中使用\_）
- [ ] 无重复的\label定义
- [ ] 无重复的章节标题
- [ ] 文本中的数字与实际实验结果匹配
- [ ] 所有图表都有标题和标签
- [ ] 无导致hbox溢出警告的过长行

Step 5.0: Title

步骤5.0：标题

The title is the single most-read element of the paper. It determines whether anyone clicks through to the abstract.

Good titles:

State the contribution or finding: "Autoreason: When Iterative LLM Refinement Works and Why It Fails"
Highlight a surprising result: "Scaling Data-Constrained Language Models" (implies you can)
Name the method + what it does: "DPO: Direct Preference Optimization of Language Models"

Bad titles:

Too generic: "An Approach to Improving Language Model Outputs"
Too long: anything over ~15 words
Jargon-only: "Asymptotic Convergence of Iterative Stochastic Policy Refinement" (who is this for?)

Rules:

Include your method name if you have one (for citability)
Include 1-2 keywords reviewers will search for
Avoid colons unless both halves carry meaning
Test: would a reviewer know the domain and contribution from the title alone?

标题是论文被阅读最多的部分，决定了是否有人点击查看摘要。

好标题：

说明贡献或发现："Autoreason: When Iterative LLM Refinement Works and Why It Fails"
突出意外结果："Scaling Data-Constrained Language Models"（暗示你能做到）
命名方法+功能："DPO: Direct Preference Optimization of Language Models"

坏标题：

太笼统："An Approach to Improving Language Model Outputs"
太长：超过~15个单词
仅含行话："Asymptotic Convergence of Iterative Stochastic Policy Refinement"（给谁看的？）

规则：

如果有方法名称，包含进去（便于引用）
包含1-2个评审会搜索的关键词
除非两部分都有意义，否则避免使用冒号
测试：评审仅从标题能知道领域和贡献吗？

Step 5.1: Abstract (5-Sentence Formula)

步骤5.1：摘要（5句话公式）

From Sebastian Farquhar (DeepMind):

1. What you achieved: "We introduce...", "We prove...", "We demonstrate..."
2. Why this is hard and important
3. How you do it (with specialist keywords for discoverability)
4. What evidence you have
5. Your most remarkable number/result

Delete generic openings like "Large language models have achieved remarkable success..."

来自Sebastian Farquhar（DeepMind）：

1. 你取得了什么："We introduce...", "We prove...", "We demonstrate..."
2. 为什么这很难且重要
3. 你如何做（包含专业关键词以提高可发现性）
4. 你有什么证据
5. 你最显著的数字/结果

删除通用开头，如"Large language models have achieved remarkable success..."

Step 5.2: Figure 1

步骤5.2：图1

Figure 1 is the second thing most readers look at (after abstract). Draft it before writing the introduction — it forces you to clarify the core idea.

Figure 1 Type	When to Use	Example
Method diagram	New architecture or pipeline	TikZ flowchart showing your system
Results teaser	One compelling result tells the whole story	Bar chart: "Ours vs baselines" with clear gap
Problem illustration	The problem is unintuitive	Before/after showing failure mode you fix
Conceptual diagram	Abstract contribution needs visual grounding	2x2 matrix of method properties

Rules: Figure 1 must be understandable without reading any text. The caption alone should communicate the core idea. Use color purposefully — don't just decorate.

图1是大多数读者看的第二件事（在摘要之后）。在撰写引言前绘制它——这能迫使你明确核心思想。

图1类型	使用场景	示例
方法图	新架构或流程	展示系统的TikZ流程图
结果预告	一个有说服力的结果就能说明全部故事	柱状图："我们的方法vs基准方法"，差距明显
问题示意图	问题不直观	展示你修复的故障模式的前后对比
概念图	抽象贡献需要视觉支撑	方法属性的2x2矩阵

规则：图1无需阅读文本即可理解。仅标题就应传达核心思想。有目的地使用颜色——不要只是装饰。

Step 5.3: Introduction (1-1.5 pages max)

步骤5.3：引言（最多1-1.5页）

Must include:

Clear problem statement
Brief approach overview
2-4 bullet contribution list (max 1-2 lines each in two-column format)
Methods should start by page 2-3

必须包含：

清晰的问题陈述
简要的方法概述
2-4个项目符号的贡献列表（双栏格式下每个最多1-2行）
方法部分应在第2-3页开始

Step 5.4: Methods

步骤5.4：方法

Enable reimplementation:

Conceptual outline or pseudocode
All hyperparameters listed
Architectural details sufficient for reproduction
Present final design decisions; ablations go in experiments

确保可复现：

概念大纲或伪代码
列出所有超参数
足够的架构细节以复现
呈现最终设计决策；消融实验放在实验部分

Step 5.5: Experiments & Results

步骤5.5：实验与结果

For each experiment, explicitly state:

What claim it supports
How it connects to main contribution
What to observe: "the blue line shows X, which demonstrates Y"

Requirements:

Error bars with methodology (std dev vs std error)
Hyperparameter search ranges
Compute infrastructure (GPU type, total hours)
Seed-setting methods

每个实验需明确说明：

支撑哪个论点
如何与核心贡献关联
观察什么："蓝线显示X，这证明了Y"

要求：

带方法说明的误差棒（标准差vs标准误）
超参数搜索范围
计算基础设施（GPU类型、总时长）
随机种子设置方法

Step 5.6: Related Work

步骤5.6：相关工作

Organize methodologically, not paper-by-paper. Cite generously — reviewers likely authored relevant papers.

按方法学组织，而非按论文逐个罗列。慷慨引用——评审可能是相关论文的作者。

Step 5.7: Limitations (REQUIRED)

步骤5.7：局限性（必须包含）

All major conferences require this. Honesty helps:

Reviewers are instructed not to penalize honest limitation acknowledgment
Pre-empt criticisms by identifying weaknesses first
Explain why limitations don't undermine core claims

所有主要会议都要求这部分。诚实有帮助：

评审被指示不要因诚实承认局限性而扣分
通过先指出弱点来预先阻止批评
解释为什么局限性不会削弱核心论点

Step 5.8: Conclusion & Discussion

步骤5.8：结论与讨论

Conclusion (required, 0.5-1 page):

Restate the contribution in one sentence (different wording from abstract)
Summarize key findings (2-3 sentences, not a list)
Implications: what does this mean for the field?
Future work: 2-3 concrete next steps (not vague "we leave X for future work")

Discussion (optional, sometimes combined with conclusion):

Broader implications beyond immediate results
Connections to other subfields
Honest assessment of when the method does and doesn't work
Practical deployment considerations

Do NOT introduce new results or claims in the conclusion.

结论（必须包含，0.5-1页）：

用一句话重述核心贡献（与摘要措辞不同）
总结关键发现（2-3句话，不要列表）
意义：这对领域意味着什么？
未来工作：2-3个具体的下一步（不要模糊的"我们将X留作未来工作"）

讨论（可选，有时与结论合并）：

超出直接结果的更广泛意义
与其他子领域的联系
诚实评估方法何时有效、何时无效
实际部署考虑

不要在结论中引入新结果或论点。

Step 5.9: Appendix Strategy

步骤5.9：附录策略

Appendices are unlimited at all major venues and are essential for reproducibility. Structure:

Appendix Section	What Goes Here
Proofs & Derivations	Full proofs too long for main text. Main text can state theorems with "proof in Appendix A."
Additional Experiments	Ablations, scaling curves, per-dataset breakdowns, hyperparameter sensitivity
Implementation Details	Full hyperparameter tables, training details, hardware specs, random seeds
Dataset Documentation	Data collection process, annotation guidelines, licensing, preprocessing
Prompts & Templates	Exact prompts used (for LLM-based methods), evaluation templates
Human Evaluation	Annotation interface screenshots, instructions given to annotators, IRB details
Additional Figures	Per-task breakdowns, trajectory visualizations, failure case examples

Rules:

The main paper must be self-contained — reviewers are not required to read appendices
Never put critical evidence only in the appendix
Cross-reference: "Full results in Table 5 (Appendix B)" not just "see appendix"
Use
```
\appendix
```
command, then
```
\section{A: Proofs}
```
etc.

所有主要会议的附录不受页数限制，对可复现性至关重要。结构：

附录章节	内容
证明与推导	太长无法放入正文的完整证明。正文可陈述定理并注明"证明见附录A"。
额外实验	消融实验、缩放曲线、每个数据集的细分、超参数敏感性
实现细节	完整超参数表、训练细节、硬件规格、随机种子
数据集文档	数据收集过程、标注指南、许可、预处理
提示与模板	使用的确切提示（针对基于LLM的方法）、评估模板
人类评估	标注界面截图、给标注员的说明、IRB细节
额外图表	每个任务的细分、轨迹可视化、故障案例示例

规则：

正文必须自包含——评审无需阅读附录
不要将关键证据仅放在附录中
交叉引用："完整结果见表5（附录B）"而非仅"见附录"
使用
```
\appendix
```
命令，然后
```
\section{A: Proofs}
```
等

Page Budget Management

页数预算管理

When over the page limit:

Cut Strategy	Saves	Risk
Move proofs to appendix	0.5-2 pages	Low — standard practice
Condense related work	0.5-1 page	Medium — may miss key citations
Combine tables with subfigures	0.25-0.5 page	Low — often improves readability
Use `\vspace{-Xpt}` sparingly	0.1-0.3 page	Low if subtle, high if obvious
Remove qualitative examples	0.5-1 page	Medium — reviewers like examples
Reduce figure sizes	0.25-0.5 page	High — figures must remain readable

Do NOT: reduce font size, change margins, remove required sections (limitations, broader impact), or use

\small

\footnotesize

for main text.

超过页数限制时：

删减策略	节省页数	风险
将证明移至附录	0.5-2页	低——标准做法
压缩相关工作	0.5-1页	中——可能遗漏关键引用
将表格与子图合并	0.25-0.5页	低——通常提高可读性
谨慎使用 `\vspace{-Xpt}`	0.1-0.3页	若微妙则低，若明显则高
删除定性示例	0.5-1页	中——评审喜欢示例
缩小图表尺寸	0.25-0.5页	高——图表必须保持可读

不要：缩小字体、更改页边距、删除必填章节（局限性、更广泛影响）、在正文中使用

\small

\footnotesize

。

Step 5.10: Ethics & Broader Impact Statement

步骤5.10：伦理与更广泛影响声明

Most venues now require or strongly encourage an ethics/broader impact statement. This is not boilerplate — reviewers read it and can flag ethics concerns that trigger desk rejection.

What to include:

Component	Content	Required By
Positive societal impact	How your work benefits society	NeurIPS, ICML
Potential negative impact	Misuse risks, dual-use concerns, failure modes	NeurIPS, ICML
Fairness & bias	Does your method/data have known biases?	All venues (implicitly)
Environmental impact	Compute carbon footprint for large-scale training	ICML, increasingly NeurIPS
Privacy	Does your work use or enable processing of personal data?	ACL, NeurIPS
LLM disclosure	Was AI used in writing or experiments?	ICLR (mandatory), ACL

Writing the statement:

latex

\section*{Broader Impact Statement}
% NeurIPS/ICML: after conclusion, does not count toward page limit

% 1. Positive applications (1-2 sentences)
This work enables [specific application] which may benefit [specific group].

% 2. Risks and mitigations (1-3 sentences, be specific)
[Method/model] could potentially be misused for [specific risk]. We mitigate
this by [specific mitigation, e.g., releasing only model weights above size X,
including safety filters, documenting failure modes].

% 3. Limitations of impact claims (1 sentence)
Our evaluation is limited to [specific domain]; broader deployment would
require [specific additional work].

Common mistakes:

Writing "we foresee no negative impacts" (almost never true — reviewers distrust this)
Being vague: "this could be misused" without specifying how
Ignoring compute costs for large-scale work
Forgetting to disclose LLM use at venues that require it

Compute carbon footprint (for training-heavy papers):

python

undefined

大多数会议现在要求或强烈鼓励伦理/更广泛影响声明。这不是模板——评审会阅读它，可能标记伦理问题导致 desk rejection。

应包含的内容：

组件	内容	要求会议
积极社会影响	你的工作如何造福社会	NeurIPS、ICML
潜在负面影响	误用风险、双重用途问题、故障模式	NeurIPS、ICML
公平性与偏见	你的方法/数据是否有已知偏见？	所有会议（隐含要求）
环境影响	大规模训练的计算碳足迹	ICML、越来越多的NeurIPS
隐私	你的工作是否使用或支持处理个人数据？	ACL、NeurIPS
LLM披露	AI是否用于撰写或实验？	ICLR（强制）、ACL

撰写声明：

latex

\section*{Broader Impact Statement}
% NeurIPS/ICML：结论之后，不计入页数限制

% 1. 积极应用（1-2句话）
This work enables [specific application] which may benefit [specific group].

% 2. 风险与缓解措施（1-3句话，具体）
[Method/model] could potentially be misused for [specific risk]. We mitigate
this by [specific mitigation, e.g., releasing only model weights above size X,
including safety filters, documenting failure modes].

% 3. 影响声明的局限性（1句话）
Our evaluation is limited to [specific domain]; broader deployment would
require [specific additional work].

常见错误：

写"我们预见没有负面影响"（几乎从不真实——评审不信任这种说法）
模糊："这可能被误用"而不说明如何
忽略大规模工作的计算成本
在要求披露的会议上忘记披露LLM使用情况

计算碳足迹（针对训练密集型论文）：

python

undefined

Estimate using ML CO2 Impact tool methodology

使用ML CO2 Impact工具方法估算

gpu_hours = 1000 # total GPU hours gpu_tdp_watts = 400 # e.g., A100 = 400W pue = 1.1 # Power Usage Effectiveness (data center overhead) carbon_intensity = 0.429 # kg CO2/kWh (US average; varies by region)

energy_kwh = (gpu_hours * gpu_tdp_watts * pue) / 1000 carbon_kg = energy_kwh * carbon_intensity print(f"Energy: {energy_kwh:.0f} kWh, Carbon: {carbon_kg:.0f} kg CO2eq")

undefined

gpu_hours = 1000 # 总GPU时长 gpu_tdp_watts = 400 # 例如，A100 = 400W pue = 1.1 # 电源使用效率（数据中心开销） carbon_intensity = 0.429 # kg CO2/kWh（美国平均；因地区而异）

energy_kwh = (gpu_hours * gpu_tdp_watts * pue) / 1000 carbon_kg = energy_kwh * carbon_intensity print(f"Energy: {energy_kwh:.0f} kWh, Carbon: {carbon_kg:.0f} kg CO2eq")

undefined

Step 5.11: Datasheets & Model Cards (If Applicable)

步骤5.11：数据集表与模型卡片（如有需要）

If your paper introduces a new dataset or releases a model, include structured documentation. Reviewers increasingly expect this, and NeurIPS Datasets & Benchmarks track requires it.

Datasheets for Datasets (Gebru et al., 2021) — include in appendix:

Dataset Documentation (Appendix):
- Motivation: Why was this dataset created? What task does it support?
- Composition: What are the instances? How many? What data types?
- Collection: How was data collected? What was the source?
- Preprocessing: What cleaning/filtering was applied?
- Distribution: How is the dataset distributed? Under what license?
- Maintenance: Who maintains it? How to report issues?
- Ethical considerations: Contains personal data? Consent obtained?
  Potential for harm? Known biases?

Model Cards (Mitchell et al., 2019) — include in appendix for model releases:

Model Card (Appendix):
- Model details: Architecture, training data, training procedure
- Intended use: Primary use cases, out-of-scope uses
- Metrics: Evaluation metrics and results on benchmarks
- Ethical considerations: Known biases, fairness evaluations
- Limitations: Known failure modes, domains where model underperforms

如果你的论文引入新数据集或发布模型，包含结构化文档。评审越来越期望这样做，NeurIPS Datasets & Benchmarks track要求必须包含。

数据集表（Gebru等人，2021）——放入附录：

数据集文档（附录）：
- 动机：为什么创建这个数据集？支持什么任务？
- 组成：实例是什么？有多少？数据类型是什么？
- 收集：如何收集数据？来源是什么？
- 预处理：应用了哪些清理/过滤？
- 分发：数据集如何分发？许可是什么？
- 维护：谁维护它？如何报告问题？
- 伦理考虑：包含个人数据吗？获得同意了吗？
  有潜在危害吗？已知偏见？

模型卡片（Mitchell等人，2019）——模型发布时放入附录：

模型卡片（附录）：
- 模型细节：架构、训练数据、训练流程
- 预期用途：主要用例、超出范围的用途
- 指标：基准测试的评估指标与结果
- 伦理考虑：已知偏见、公平性评估
- 局限性：已知故障模式、模型表现不佳的领域

Writing Style

写作风格

Sentence-level clarity (Gopen & Swan's 7 Principles):

Principle	Rule
Subject-verb proximity	Keep subject and verb close
Stress position	Place emphasis at sentence ends
Topic position	Put context first, new info after
Old before new	Familiar info → unfamiliar info
One unit, one function	Each paragraph makes one point
Action in verb	Use verbs, not nominalizations
Context before new	Set stage before presenting

Word choice (Lipton, Steinhardt):

Be specific: "accuracy" not "performance"
Eliminate hedging: drop "may" unless genuinely uncertain
Consistent terminology throughout
Avoid incremental vocabulary: "develop", not "combine"

Full writing guide with examples: See references/writing-guide.md

句子层面的清晰度（Gopen & Swan的7原则）：

原则	规则
主谓靠近	主语和谓语靠近
强调位置	将重点放在句末
主题位置	先放上下文，再放新信息
先旧后新	熟悉的信息→不熟悉的信息
一个单元，一个功能	每个段落表达一个观点
动作在动词中	使用动词，而非名词化
先上下文后新内容	先铺垫再呈现

用词（Lipton, Steinhardt）：

具体：用"accuracy"而非"performance"
消除模糊表述：除非真的不确定，否则删除"may"
全程术语一致
避免增量词汇：用"develop"而非"combine"

含示例的完整写作指南：详见references/writing-guide.md

Using LaTeX Templates

使用LaTeX模板

Always copy the entire template directory first, then write within it.

Template Setup Checklist:
- [ ] Step 1: Copy entire template directory to new project
- [ ] Step 2: Verify template compiles as-is (before any changes)
- [ ] Step 3: Read the template's example content to understand structure
- [ ] Step 4: Replace example content section by section
- [ ] Step 5: Use template macros (check preamble for \newcommand definitions)
- [ ] Step 6: Clean up template artifacts only at the end

Step 1: Copy the Full Template

bash

cp -r templates/neurips2025/ ~/papers/my-paper/
cd ~/papers/my-paper/
ls -la  # Should see: main.tex, neurips.sty, Makefile, etc.

Copy the ENTIRE directory, not just the .tex file. Templates include style files (.sty), bibliography styles (.bst), example content, and Makefiles.

Step 2: Verify Template Compiles First

Before making ANY changes:

bash

latexmk -pdf main.tex

始终先复制整个模板目录，再在其中撰写。

模板设置清单：
- [ ] 步骤1：将整个模板目录复制到新项目
- [ ] 步骤2：验证模板原样可编译（修改前）
- [ ] 步骤3：阅读模板示例内容以理解结构
- [ ] 步骤4：逐节替换示例内容
- [ ] 步骤5：使用模板宏（检查导言区的\newcommand定义）
- [ ] 步骤6：仅在最后清理模板 artifacts

步骤1：复制完整模板

bash

cp -r templates/neurips2025/ ~/papers/my-paper/
cd ~/papers/my-paper/
ls -la  # 应看到：main.tex, neurips.sty, Makefile等

复制整个目录，而不仅仅是.tex文件。模板包含样式文件(.sty)、参考文献样式(.bst)、示例内容和Makefile。

步骤2：先验证模板可编译

在做任何修改前：

bash

latexmk -pdf main.tex

Or manual: pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex

或手动：pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex


If the unmodified template doesn't compile, fix that first (usually missing TeX packages — install via `tlmgr install <package>`).

**Step 3: Keep Template Content as Reference**

Don't immediately delete example content. Comment it out and use as formatting reference:
```latex
% Template example (keep for reference):
% \begin{figure}[t]
%   \centering
%   \includegraphics[width=0.8\linewidth]{example-image}
%   \caption{Template shows caption style}
% \end{figure}

% Your actual figure:
\begin{figure}[t]
  \centering
  \includegraphics[width=0.8\linewidth]{your-figure.pdf}
  \caption{Your caption following the same style.}
\end{figure}

Step 4: Replace Content Section by Section

Work through systematically: title/authors → abstract → introduction → methods → experiments → related work → conclusion → references → appendix. Compile after each section.

Step 5: Use Template Macros

latex

\newcommand{\method}{YourMethodName}  % Consistent method naming
\newcommand{\eg}{e.g.,\xspace}        % Proper abbreviations
\newcommand{\ie}{i.e.,\xspace}


如果未修改的模板无法编译，先修复（通常是缺少TeX包——通过`tlmgr install <package>`安装）。

**步骤3：保留模板内容作为参考**

不要立即删除示例内容。注释掉并用作格式参考：
```latex
% 模板示例（保留参考）：
% \begin{figure}[t]
%   \centering
%   \includegraphics[width=0.8\linewidth]{example-image}
%   \caption{Template shows caption style}
% \end{figure}

% 你的实际图表：
\begin{figure}[t]
  \centering
  \includegraphics[width=0.8\linewidth]{your-figure.pdf}
  \caption{Your caption following the same style.}
\end{figure}

步骤4：逐节替换内容

系统地进行：标题/作者→摘要→引言→方法→实验→相关工作→结论→参考文献→附录。每节写完后编译。

步骤5：使用模板宏

latex

\newcommand{\method}{YourMethodName}  % 方法命名一致
\newcommand{\eg}{e.g.,\xspace}        % 正确缩写
\newcommand{\ie}{i.e.,\xspace}

Template Pitfalls

模板陷阱

Pitfall	Problem	Solution
Copying only `.tex` file	Missing `.sty` , won't compile	Copy entire directory
Modifying `.sty` files	Breaks conference formatting	Never edit style files
Adding random packages	Conflicts, breaks template	Only add if necessary
Deleting template content early	Lose formatting reference	Keep as comments until done
Not compiling frequently	Errors accumulate	Compile after each section
Raster PNGs for figures	Blurry in paper	Always use vector PDF via `savefig('fig.pdf')`

陷阱	问题	解决方案
仅复制 `.tex` 文件	缺少 `.sty` ，无法编译	复制整个目录
修改 `.sty` 文件	破坏会议格式	永远不要编辑样式文件
添加随机包	冲突，破坏模板	仅在必要时添加
过早删除模板内容	失去格式参考	保留为注释直到完成
不频繁编译	错误累积	每节写完后编译
图表使用光栅PNG	论文中模糊	始终使用矢量PDF，通过 `savefig('fig.pdf')` 生成

Quick Template Reference

快速模板参考

Conference	Main File	Style File	Page Limit
NeurIPS 2025	`main.tex`	`neurips.sty`	9 pages
ICML 2026	`example_paper.tex`	`icml2026.sty`	8 pages
ICLR 2026	`iclr2026_conference.tex`	`iclr2026_conference.sty`	9 pages
ACL 2025	`acl_latex.tex`	`acl.sty`	8 pages (long)
AAAI 2026	`aaai2026-unified-template.tex`	`aaai2026.sty`	7 pages
COLM 2025	`colm2025_conference.tex`	`colm2025_conference.sty`	9 pages

Universal: Double-blind, references don't count, appendices unlimited, LaTeX required.

Templates in

templates/

directory. See templates/README.md for compilation setup (VS Code, CLI, Overleaf, other IDEs).

会议	主文件	样式文件	页数限制
NeurIPS 2025	`main.tex`	`neurips.sty`	9页
ICML 2026	`example_paper.tex`	`icml2026.sty`	8页
ICLR 2026	`iclr2026_conference.tex`	`iclr2026_conference.sty`	9页
ACL 2025	`acl_latex.tex`	`acl.sty`	8页（长文）
AAAI 2026	`aaai2026-unified-template.tex`	`aaai2026.sty`	7页
COLM 2025	`colm2025_conference.tex`	`colm2025_conference.sty`	9页

通用规则：双盲评审，参考文献不计入页数，附录不受限制，必须使用LaTeX。

模板在

templates/

目录中。详见templates/README.md获取编译设置（VS Code、CLI、Overleaf、其他IDE）。

Tables and Figures

表格与图表

Tables — use

booktabs

for professional formatting:

latex

\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}

Rules:

Bold best value per metric
Include direction symbols ($\uparrow$ higher better, $\downarrow$ lower better)
Right-align numerical columns
Consistent decimal precision

Figures:

Vector graphics (PDF, EPS) for all plots and diagrams —
```
plt.savefig('fig.pdf')
```
Raster (PNG 600 DPI) only for photographs
Colorblind-safe palettes (Okabe-Ito or Paul Tol)
Verify grayscale readability (8% of men have color vision deficiency)
No title inside figure — the caption serves this function
Self-contained captions — reader should understand without main text

表格——使用

booktabs

进行专业格式化：

latex

\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}

规则：

每个指标的最佳值加粗
包含方向符号（$\uparrow$越高越好，$\downarrow$越低越好）
数值列右对齐
一致的小数精度

图表：

矢量图形（PDF、EPS）用于所有绘图和图表——
```
plt.savefig('fig.pdf')
```
光栅（PNG 600 DPI）仅用于照片
色盲友好调色板（Okabe-Ito或Paul Tol）
验证灰度可读性（8%的男性有色觉缺陷）
图表内无标题——标题由图例承担
自包含图例——读者无需阅读正文即可理解

Conference Resubmission

会议重提交

For converting between venues, see Phase 7 (Submission Preparation) — it covers the full conversion workflow, page-change table, and post-rejection guidance.

如需在不同会议间转换，详见阶段7（提交准备）——涵盖完整转换工作流、页数变化表和拒稿后指南。

Professional LaTeX Preamble

专业LaTeX导言区

Add these packages to any paper for professional quality. They are compatible with all major conference style files:

latex

% --- Professional Packages (add after conference style file) ---

% Typography
\usepackage{microtype}              % Microtypographic improvements (protrusion, expansion)
                                     % Makes text noticeably more polished — always include

% Tables
\usepackage{booktabs}               % Professional table rules (\toprule, \midrule, \bottomrule)
\usepackage{siunitx}                % Consistent number formatting, decimal alignment
                                     % Usage: \num{12345} → 12,345; \SI{3.5}{GHz} → 3.5 GHz
                                     % Table alignment: S column type for decimal-aligned numbers

% Figures
\usepackage{graphicx}               % Include graphics (\includegraphics)
\usepackage{subcaption}             % Subfigures with (a), (b), (c) labels
                                     % Usage: \begin{subfigure}{0.48\textwidth} ... \end{subfigure}

% Diagrams and Algorithms
\usepackage{tikz}                   % Programmable vector diagrams
\usetikzlibrary{arrows.meta, positioning, shapes.geometric, calc, fit, backgrounds}
\usepackage[ruled,vlined]{algorithm2e}  % Professional pseudocode
                                     % Alternative: \usepackage{algorithmicx} if template bundles it

% Cross-references
\usepackage{cleveref}               % Smart references: \cref{fig:x} → "Figure 1"
                                     % MUST be loaded AFTER hyperref
                                     % Handles: figures, tables, sections, equations, algorithms

% Math (usually included by conference .sty, but verify)
\usepackage{amsmath,amssymb}        % AMS math environments and symbols
\usepackage{mathtools}              % Extends amsmath (dcases, coloneqq, etc.)

% Colors (for figures and diagrams)
\usepackage{xcolor}                 % Color management
% Okabe-Ito colorblind-safe palette:
\definecolor{okblue}{HTML}{0072B2}
\definecolor{okorange}{HTML}{E69F00}
\definecolor{okgreen}{HTML}{009E73}
\definecolor{okred}{HTML}{D55E00}
\definecolor{okpurple}{HTML}{CC79A7}
\definecolor{okcyan}{HTML}{56B4E9}
\definecolor{okyellow}{HTML}{F0E442}

Notes:

```
microtype
```
is the single highest-impact package for visual quality. It adjusts character spacing at a sub-pixel level. Always include it.
```
siunitx
```
handles decimal alignment in tables via the
```
S
```
column type — eliminates manual spacing.
```
cleveref
```
must be loaded after
```
hyperref
```
. Most conference .sty files load hyperref, so put cleveref last.
Check if the conference template already loads any of these (especially
```
algorithm
```
,
```
amsmath
```
,
```
graphicx
```
). Don't double-load.

将这些包添加到任何论文中以获得专业质量。它们与所有主要会议样式文件兼容：

latex

% --- 专业包（在会议样式文件后添加） ---

% 排版
\usepackage{microtype}              % 微排版改进（突出、扩展）
                                     % 使文本明显更精致——始终包含

% 表格
\usepackage{booktabs}               % 专业表格规则（\toprule, \midrule, \bottomrule）
\usepackage{siunitx}                % 一致的数字格式、小数对齐
                                     % 使用：\num{12345} → 12,345; \SI{3.5}{GHz} → 3.5 GHz
                                     % 表格对齐：S列类型用于小数对齐的数字

% 图表
\usepackage{graphicx}               % 包含图形（\includegraphics）
\usepackage{subcaption}             % 带(a), (b), (c)标签的子图
                                     % 使用：\begin{subfigure}{0.48\textwidth} ... \end{subfigure}

% 图表与算法
\usepackage{tikz}                   % 可编程矢量图表
\usetikzlibrary{arrows.meta, positioning, shapes.geometric, calc, fit, backgrounds}
\usepackage[ruled,vlined]{algorithm2e}  % 专业伪代码
                                     % 替代：如果模板包含，使用\usepackage{algorithmicx}

% 交叉引用
\usepackage{cleveref}               % 智能引用：\cref{fig:x} → "Figure 1"
                                     % 必须在hyperref之后加载
                                     % 处理：图表、表格、章节、公式、算法

% 数学（通常由会议.sty包含，但需验证）
\usepackage{amsmath,amssymb}        % AMS数学环境和符号
\usepackage{mathtools}              % 扩展amsmath（dcases, coloneqq等）

% 颜色（用于图表和图形）
\usepackage{xcolor}                 % 颜色管理
% Okabe-Ito色盲友好调色板:
\definecolor{okblue}{HTML}{0072B2}
\definecolor{okorange}{HTML}{E69F00}
\definecolor{okgreen}{HTML}{009E73}
\definecolor{okred}{HTML}{D55E00}
\definecolor{okpurple}{HTML}{CC79A7}
\definecolor{okcyan}{HTML}{56B4E9}
\definecolor{okyellow}{HTML}{F0E442}

注意：

```
microtype
```
是对视觉质量影响最大的单个包。它在亚像素级别调整字符间距。始终包含它。
```
siunitx
```
通过
```
S
```
列类型处理表格中的小数对齐——消除手动间距。
```
cleveref
```
必须在
```
hyperref
```
之后加载。大多数会议.sty文件会加载hyperref，因此将cleveref放在最后。
检查会议模板是否已加载其中任何包（尤其是
```
algorithm
```
、
```
amsmath
```
、
```
graphicx
```
）。不要重复加载。

siunitx Table Alignment

siunitx表格对齐

siunitx

makes number-heavy tables significantly more readable:

latex

\begin{tabular}{l S[table-format=2.1] S[table-format=2.1] S[table-format=2.1]}
\toprule
Method & {Accuracy $\uparrow$} & {F1 $\uparrow$} & {Latency (ms) $\downarrow$} \\
\midrule
Baseline         & 85.2  & 83.7  & 45.3 \\
Ablation (no X)  & 87.1  & 85.4  & 42.1 \\
\textbf{Ours}    & \textbf{92.1} & \textbf{90.8} & \textbf{38.7} \\
\bottomrule
\end{tabular}

The

column type auto-aligns on the decimal point. Headers in

{}

escape the alignment.

siunitx

使含大量数字的表格可读性显著提高：

latex

\begin{tabular}{l S[table-format=2.1] S[table-format=2.1] S[table-format=2.1]}
\toprule
Method & {Accuracy $\uparrow$} & {F1 $\uparrow$} & {Latency (ms) $\downarrow$} \\
\midrule
Baseline         & 85.2  & 83.7  & 45.3 \\
Ablation (no X)  & 87.1  & 85.4  & 42.1 \\
\textbf{Ours}    & \textbf{92.1} & \textbf{90.8} & \textbf{38.7} \\
\bottomrule
\end{tabular}

列类型自动按小数点对齐。

{}

中的表头会跳过对齐。

Subfigures

子图

Standard pattern for side-by-side figures:

latex

\begin{figure}[t]
  \centering
  \begin{subfigure}[b]{0.48\textwidth}
    \centering
    \includegraphics[width=\textwidth]{fig_results_a.pdf}
    \caption{Results on Dataset A.}
    \label{fig:results-a}
  \end{subfigure}
  \hfill
  \begin{subfigure}[b]{0.48\textwidth}
    \centering
    \includegraphics[width=\textwidth]{fig_results_b.pdf}
    \caption{Results on Dataset B.}
    \label{fig:results-b}
  \end{subfigure}
  \caption{Comparison of our method across two datasets. (a) shows the scaling
  behavior and (b) shows the ablation results. Both use 5 random seeds.}
  \label{fig:results}
\end{figure}

Use

\cref{fig:results}

→ "Figure 1",

\cref{fig:results-a}

→ "Figure 1a".

并排图表的标准模式：

latex

\begin{figure}[t]
  \centering
  \begin{subfigure}[b]{0.48\textwidth}
    \centering
    \includegraphics[width=\textwidth]{fig_results_a.pdf}
    \caption{Results on Dataset A.}
    \label{fig:results-a}
  \end{subfigure}
  \hfill
  \begin{subfigure}[b]{0.48\textwidth}
    \centering
    \includegraphics[width=\textwidth]{fig_results_b.pdf}
    \caption{Results on Dataset B.}
    \label{fig:results-b}
  \end{subfigure}
  \caption{Comparison of our method across two datasets. (a) shows the scaling
  behavior and (b) shows the ablation results. Both use 5 random seeds.}
  \label{fig:results}
\end{figure}

使用

\cref{fig:results}

→ "Figure 1"，

\cref{fig:results-a}

→ "Figure 1a"。

Pseudocode with algorithm2e

使用algorithm2e的伪代码

latex

\begin{algorithm}[t]
\caption{Iterative Refinement with Judge Panel}
\label{alg:method}
\KwIn{Task $T$, model $M$, judges $J_1 \ldots J_n$, convergence threshold $k$}
\KwOut{Final output $A^*$}
$A \gets M(T)$ \tcp*{Initial generation}
$\text{streak} \gets 0$\;
\While{$\text{streak} < k$}{
  $C \gets \text{Critic}(A, T)$ \tcp*{Identify weaknesses}
  $B \gets M(T, C)$ \tcp*{Revised version addressing critique}
  $AB \gets \text{Synthesize}(A, B)$ \tcp*{Merge best elements}
  \ForEach{judge $J_i$}{
    $\text{rank}_i \gets J_i(\text{shuffle}(A, B, AB))$ \tcp*{Blind ranking}
  }
  $\text{winner} \gets \text{BordaCount}(\text{ranks})$\;
  \eIf{$\text{winner} = A$}{
    $\text{streak} \gets \text{streak} + 1$\;
  }{
    $A \gets \text{winner}$; $\text{streak} \gets 0$\;
  }
}
\Return{$A$}\;
\end{algorithm}

latex

\begin{algorithm}[t]
\caption{Iterative Refinement with Judge Panel}
\label{alg:method}
\KwIn{Task $T$, model $M$, judges $J_1 \ldots J_n$, convergence threshold $k$}
\KwOut{Final output $A^*$}
$A \gets M(T)$ \tcp*{Initial generation}
$\text{streak} \gets 0$\;
\While{$\text{streak} < k$}{
  $C \gets \text{Critic}(A, T)$ \tcp*{Identify weaknesses}
  $B \gets M(T, C)$ \tcp*{Revised version addressing critique}
  $AB \gets \text{Synthesize}(A, B)$ \tcp*{Merge best elements}
  \ForEach{judge $J_i$}{
    $\text{rank}_i \gets J_i(\text{shuffle}(A, B, AB))$ \tcp*{Blind ranking}
  }
  $\text{winner} \gets \text{BordaCount}(\text{ranks})$\;
  \eIf{$\text{winner} = A$}{
    $\text{streak} \gets \text{streak} + 1$\;
  }{
    $A \gets \text{winner}$; $\text{streak} \gets 0$\;
  }
}
\Return{$A$}\;
\end{algorithm}

TikZ Diagram Patterns

TikZ图表模式

TikZ is the standard for method diagrams in ML papers. Common patterns:

Pipeline/Flow Diagram (most common in ML papers):

latex

\begin{figure}[t]
\centering
\begin{tikzpicture}[
  node distance=1.8cm,
  box/.style={rectangle, draw, rounded corners, minimum height=1cm, 
              minimum width=2cm, align=center, font=\small},
  arrow/.style={-{Stealth[length=3mm]}, thick},
]
  \node[box, fill=okcyan!20] (input) {Input\\$x$};
  \node[box, fill=okblue!20, right of=input] (encoder) {Encoder\\$f_\theta$};
  \node[box, fill=okgreen!20, right of=encoder] (latent) {Latent\\$z$};
  \node[box, fill=okorange!20, right of=latent] (decoder) {Decoder\\$g_\phi$};
  \node[box, fill=okred!20, right of=decoder] (output) {Output\\$\hat{x}$};
  
  \draw[arrow] (input) -- (encoder);
  \draw[arrow] (encoder) -- (latent);
  \draw[arrow] (latent) -- (decoder);
  \draw[arrow] (decoder) -- (output);
\end{tikzpicture}
\caption{Architecture overview. The encoder maps input $x$ to latent 
representation $z$, which the decoder reconstructs.}
\label{fig:architecture}
\end{figure}

Comparison/Matrix Diagram (for showing method variants):

latex

\begin{tikzpicture}[
  cell/.style={rectangle, draw, minimum width=2.5cm, minimum height=1cm, 
               align=center, font=\small},
  header/.style={cell, fill=gray!20, font=\small\bfseries},
]
  % Headers
  \node[header] at (0, 0) {Method};
  \node[header] at (3, 0) {Converges?};
  \node[header] at (6, 0) {Quality?};
  % Rows
  \node[cell] at (0, -1) {Single Pass};
  \node[cell, fill=okgreen!15] at (3, -1) {N/A};
  \node[cell, fill=okorange!15] at (6, -1) {Baseline};
  \node[cell] at (0, -2) {Critique+Revise};
  \node[cell, fill=okred!15] at (3, -2) {No};
  \node[cell, fill=okred!15] at (6, -2) {Degrades};
  \node[cell] at (0, -3) {Ours};
  \node[cell, fill=okgreen!15] at (3, -3) {Yes ($k$=2)};
  \node[cell, fill=okgreen!15] at (6, -3) {Improves};
\end{tikzpicture}

Iterative Loop Diagram (for methods with feedback):

latex

\begin{tikzpicture}[
  node distance=2cm,
  box/.style={rectangle, draw, rounded corners, minimum height=0.8cm, 
              minimum width=1.8cm, align=center, font=\small},
  arrow/.style={-{Stealth[length=3mm]}, thick},
  label/.style={font=\scriptsize, midway, above},
]
  \node[box, fill=okblue!20] (gen) {Generator};
  \node[box, fill=okred!20, right=2.5cm of gen] (critic) {Critic};
  \node[box, fill=okgreen!20, below=1.5cm of $(gen)!0.5!(critic)$] (judge) {Judge Panel};
  
  \draw[arrow] (gen) -- node[label] {output $A$} (critic);
  \draw[arrow] (critic) -- node[label, right] {critique $C$} (judge);
  \draw[arrow] (judge) -| node[label, left, pos=0.3] {winner} (gen);
\end{tikzpicture}

TikZ是ML论文中方法图的标准工具。常见模式：

流程/流程图（ML论文中最常见）：

latex

\begin{figure}[t]
\centering
\begin{tikzpicture}[
  node distance=1.8cm,
  box/.style={rectangle, draw, rounded corners, minimum height=1cm, 
              minimum width=2cm, align=center, font=\small},
  arrow/.style={-{Stealth[length=3mm]}, thick},
]
  \node[box, fill=okcyan!20] (input) {Input\\$x$};
  \node[box, fill=okblue!20, right of=input] (encoder) {Encoder\\$f_\theta$};
  \node[box, fill=okgreen!20, right of=encoder] (latent) {Latent\\$z$};
  \node[box, fill=okorange!20, right of=latent] (decoder) {Decoder\\$g_\phi$};
  \node[box, fill=okred!20, right of=decoder] (output) {Output\\$\hat{x}$};
  
  \draw[arrow] (input) -- (encoder);
  \draw[arrow] (encoder) -- (latent);
  \draw[arrow] (latent) -- (decoder);
  \draw[arrow] (decoder) -- (output);
\end{tikzpicture}
\caption{Architecture overview. The encoder maps input $x$ to latent 
representation $z$, which the decoder reconstructs.}
\label{fig:architecture}
\end{figure}

对比/矩阵图（用于展示方法变体）：

latex

\begin{tikzpicture}[
  cell/.style={rectangle, draw, minimum width=2.5cm, minimum height=1cm, 
               align=center, font=\small},
  header/.style={cell, fill=gray!20, font=\small\bfseries},
]
  % 表头
  \node[header] at (0, 0) {Method};
  \node[header] at (3, 0) {Converges?};
  \node[header] at (6, 0) {Quality?};
  % 行
  \node[cell] at (0, -1) {Single Pass};
  \node[cell, fill=okgreen!15] at (3, -1) {N/A};
  \node[cell, fill=okorange!15] at (6, -1) {Baseline};
  \node[cell] at (0, -2) {Critique+Revise};
  \node[cell, fill=okred!15] at (3, -2) {No};
  \node[cell, fill=okred!15] at (6, -2) {Degrades};
  \node[cell] at (0, -3) {Ours};
  \node[cell, fill=okgreen!15] at (3, -3) {Yes ($k$=2)};
  \node[cell, fill=okgreen!15] at (6, -3) {Improves};
\end{tikzpicture}

迭代循环图（用于带反馈的方法）：

latex

\begin{tikzpicture}[
  node distance=2cm,
  box/.style={rectangle, draw, rounded corners, minimum height=0.8cm, 
              minimum width=1.8cm, align=center, font=\small},
  arrow/.style={-{Stealth[length=3mm]}, thick},
  label/.style={font=\scriptsize, midway, above},
]
  \node[box, fill=okblue!20] (gen) {Generator};
  \node[box, fill=okred!20, right=2.5cm of gen] (critic) {Critic};
  \node[box, fill=okgreen!20, below=1.5cm of $(gen)!0.5!(critic)$] (judge) {Judge Panel};
  
  \draw[arrow] (gen) -- node[label] {output $A$} (critic);
  \draw[arrow] (critic) -- node[label, right] {critique $C$} (judge);
  \draw[arrow] (judge) -| node[label, left, pos=0.3] {winner} (gen);
\end{tikzpicture}

latexdiff for Revision Tracking

latexdiff用于修订跟踪

Essential for rebuttals — generates a marked-up PDF showing changes between versions:

bash

undefined

反驳时必不可少——生成标记PDF，显示版本间的变化：

bash

undefined

Install

安装

macOS: brew install latexdiff (or comes with TeX Live)

macOS: brew install latexdiff（或随TeX Live安装）

Linux: sudo apt install latexdiff

Generate diff

生成差异

latexdiff paper_v1.tex paper_v2.tex > paper_diff.tex pdflatex paper_diff.tex

For multi-file projects (with \input{} or \include{})

多文件项目（使用\input{}或\include{}）

latexdiff --flatten paper_v1.tex paper_v2.tex > paper_diff.tex


This produces a PDF with deletions in red strikethrough and additions in blue — standard format for rebuttal supplements.

latexdiff --flatten paper_v1.tex paper_v2.tex > paper_diff.tex


这会生成一个PDF，删除内容用红色删除线，添加内容用蓝色——反驳补充材料的标准格式。

SciencePlots for matplotlib

SciencePlots用于matplotlib

Install and use for publication-quality plots:

bash

pip install SciencePlots

python

import matplotlib.pyplot as plt
import scienceplots  # registers styles

安装并使用以获得出版级绘图：

bash

pip install SciencePlots

python

import matplotlib.pyplot as plt
import scienceplots  # 注册样式

Use science style (IEEE-like, clean)

使用science样式（类似IEEE，简洁）

with plt.style.context(['science', 'no-latex']): fig, ax = plt.subplots(figsize=(3.5, 2.5)) # Single-column width ax.plot(x, y, label='Ours', color='#0072B2') ax.plot(x, y2, label='Baseline', color='#D55E00', linestyle='--') ax.set_xlabel('Training Steps') ax.set_ylabel('Accuracy') ax.legend() fig.savefig('paper/fig_results.pdf', bbox_inches='tight')

with plt.style.context(['science', 'no-latex']): fig, ax = plt.subplots(figsize=(3.5, 2.5)) # 单栏宽度 ax.plot(x, y, label='Ours', color='#0072B2') ax.plot(x, y2, label='Baseline', color='#D55E00', linestyle='--') ax.set_xlabel('Training Steps') ax.set_ylabel('Accuracy') ax.legend() fig.savefig('paper/fig_results.pdf', bbox_inches='tight')

Available styles: 'science', 'ieee', 'nature', 'science+ieee'

可用样式: 'science', 'ieee', 'nature', 'science+ieee'

Add 'no-latex' if LaTeX is not installed on the machine generating plots

如果生成绘图的机器未安装LaTeX，添加'no-latex'


**Standard figure sizes** (two-column format):
- Single column: `figsize=(3.5, 2.5)` — fits in one column
- Double column: `figsize=(7.0, 3.0)` — spans both columns
- Square: `figsize=(3.5, 3.5)` — for heatmaps, confusion matrices

---


**标准图表尺寸**（双栏格式）：
- 单栏：`figsize=(3.5, 2.5)` —— 适合单栏
- 双栏：`figsize=(7.0, 3.0)` —— 跨双栏
- 方形：`figsize=(3.5, 3.5)` —— 用于热力图、混淆矩阵

---

Phase 6: Self-Review & Revision

阶段6：自我评审与修订

Goal: Simulate the review process before submission. Catch weaknesses early.

目标：提交前模拟评审过程，尽早发现弱点。

Step 6.1: Simulate Reviews (Ensemble Pattern)

步骤6.1：模拟评审（集成模式）

Generate reviews from multiple perspectives. The key insight from automated research pipelines (notably SakanaAI's AI-Scientist): ensemble reviewing with a meta-reviewer produces far more calibrated feedback than a single review pass.

Step 1: Generate N independent reviews (N=3-5)

Use different models or temperature settings. Each reviewer sees only the paper, not other reviews. Default to negative bias — LLMs have well-documented positivity bias in evaluation.

You are an expert reviewer for [VENUE]. You are critical and thorough.
If a paper has weaknesses or you are unsure about a claim, flag it clearly
and reflect that in your scores. Do not give the benefit of the doubt.

Review this paper according to the official reviewer guidelines. Evaluate:

1. Soundness (are claims well-supported? are baselines fair and strong?)
2. Clarity (is the paper well-written? could an expert reproduce it?)
3. Significance (does this matter to the community?)
4. Originality (new insights, not just incremental combination?)

Provide your review as structured JSON:
{
  "summary": "2-3 sentence summary",
  "strengths": ["strength 1", "strength 2", ...],
  "weaknesses": ["weakness 1 (most critical)", "weakness 2", ...],
  "questions": ["question for authors 1", ...],
  "missing_references": ["paper that should be cited", ...],
  "soundness": 1-4,
  "presentation": 1-4,
  "contribution": 1-4,
  "overall": 1-10,
  "confidence": 1-5
}

Step 2: Meta-review (Area Chair aggregation)

Feed all N reviews to a meta-reviewer:

You are an Area Chair at [VENUE]. You have received [N] independent reviews
of a paper. Your job is to:

1. Identify consensus strengths and weaknesses across reviewers
2. Resolve disagreements by examining the paper directly
3. Produce a meta-review that represents the aggregate judgment
4. Use AVERAGED numerical scores across all reviews

Be conservative: if reviewers disagree on whether a weakness is serious,
treat it as serious until the authors address it.

Reviews:
[review_1]
[review_2]
...

Step 3: Reflection loop (optional, 2-3 rounds)

Each reviewer can refine their review after seeing the meta-review. Use an early termination sentinel: if the reviewer responds "I am done" (no changes), stop iterating.

Model selection for reviewing: Reviewing is best done with the strongest available model, even if you wrote the paper with a cheaper one. The reviewer model should be chosen independently from the writing model.

Few-shot calibration: If available, include 1-2 real published reviews from the target venue as examples. This dramatically improves score calibration. See references/reviewer-guidelines.md for example reviews.

从多个视角生成评审意见。自动化研究流程（尤其是SakanaAI的AI-Scientist）的关键见解：使用元评审的集成评审比单次评审产生的反馈校准度高得多。

步骤1：生成N个独立评审意见（N=3-5）

使用不同模型或温度设置。每个评审仅看到论文，看不到其他评审意见。默认负面偏见——LLM在评估中存在众所周知的积极偏见。

你是[会议]的专家评审。你批判性强、细致入微。
如果论文有弱点或你对论点不确定，明确标记并在评分中体现。不要给予怀疑的好处。

根据官方评审指南评审本文。评估：

1. 严谨性（论点是否有充分支撑？基准方法是否公平且强大？）
2. 清晰度（论文是否写得好？专家能否复现？）
3. 重要性（这对社区重要吗？）
4. 原创性（新见解，而非仅增量组合？）

以结构化JSON提供评审意见：
{
  "summary": "2-3句话摘要",
  "strengths": ["优点1", "优点2", ...],
  "weaknesses": ["弱点1（最关键）", "弱点2", ...],
  "questions": ["给作者的问题1", ...],
  "missing_references": ["应引用的论文", ...],
  "soundness": 1-4,
  "presentation": 1-4,
  "contribution": 1-4,
  "overall": 1-10,
  "confidence": 1-5
}

步骤2：元评审（领域主席汇总）

将所有N个评审意见交给元评审：

你是[会议]的领域主席。你收到了[N]份对一篇论文的独立评审意见。你的工作是：

1. 识别评审间的共识优点和弱点
2. 通过直接查看论文解决分歧
3. 生成代表总体判断的元评审
4. 使用所有评审的平均数值评分

保守处理：如果评审对弱点是否严重存在分歧，在作者解决前视为严重问题。

评审意见：
[review_1]
[review_2]
...

步骤3：反思循环（可选，2-3轮）

每个评审看到元评审后可以优化自己的评审意见。使用提前终止标记：如果评审回复"我已完成"（无变化），停止迭代。

评审模型选择：评审最好使用最强可用模型，即使你用更便宜的模型写论文。评审模型应独立于写作模型选择。

少样本校准：如果可用，包含1-2份目标会议的已发表评审作为示例。这能显著提高评分校准度。详见references/reviewer-guidelines.md获取示例评审意见。

Step 6.1b: Visual Review Pass (VLM)

步骤6.1b：视觉评审（VLM）

Text-only review misses an entire class of problems: figure quality, layout issues, visual consistency. If you have access to a vision-capable model, run a separate visual review on the compiled PDF:

You are reviewing the visual presentation of this research paper PDF.
Check for:
1. Figure quality: Are plots readable? Labels legible? Colors distinguishable?
2. Figure-caption alignment: Does each caption accurately describe its figure?
3. Layout issues: Orphaned section headers, awkward page breaks, figures far from their references
4. Table formatting: Aligned columns, consistent decimal precision, bold for best results
5. Visual consistency: Same color scheme across all figures, consistent font sizes
6. Grayscale readability: Would the figures be understandable if printed in B&W?

For each issue, specify the page number and exact location.

This catches problems that text-based review cannot: a plot with illegible axis labels, a figure placed 3 pages from its first reference, inconsistent color palettes between Figure 2 and Figure 5, or a table that's clearly wider than the column width.

纯文本评审会遗漏一类问题：图表质量、布局问题、视觉一致性。如果有视觉能力模型，对编译后的PDF进行单独视觉评审：

你正在评审本科研论文PDF的视觉呈现。
检查：
1. 图表质量：绘图可读吗？标签清晰吗？颜色可区分吗？
2. 图表-标题对齐：每个标题是否准确描述其图表？
3. 布局问题：孤立的章节标题、尴尬的分页、图表与其引用位置相距甚远
4. 表格格式：列对齐、小数精度一致、最佳结果加粗
5. 视觉一致性：所有图表使用相同配色方案、字体大小一致
6. 灰度可读性：如果黑白打印，图表能否理解？

每个问题都要注明页码和确切位置。

这能发现纯文本评审无法发现的问题：轴标签难以辨认的绘图、首次引用后3页才出现的图表、图2和图5之间不一致的调色板、明显宽于栏宽的表格。

Step 6.1c: Claim Verification Pass

步骤6.1c：论点验证

After simulated reviews, run a separate verification pass. This catches factual errors that reviewers might miss:

Claim Verification Protocol:
1. Extract every factual claim from the paper (numbers, comparisons, trends)
2. For each claim, trace it to the specific experiment/result that supports it
3. Verify the number in the paper matches the actual result file
4. Flag any claim without a traceable source as [VERIFY]

For agent-based workflows: delegate verification to a fresh sub-agent that receives only the paper text and the raw result files. The fresh context prevents confirmation bias — the verifier doesn't "remember" what the results were supposed to be.

模拟评审后，进行单独的验证。这能发现评审可能遗漏的事实错误：

论点验证协议：
1. 从论文中提取每个事实主张（数字、比较、趋势）
2. 每个主张都追溯到支撑它的特定实验/结果
3. 验证论文中的数字与实际结果文件匹配
4. 将任何无可追溯来源的主张标记为[VERIFY]

对于Agent工作流：委派给新的子Agent，仅提供论文文本和原始结果文件。新上下文可防止确认偏差——验证者不会“记住”结果应该是什么。

Step 6.2: Prioritize Feedback

步骤6.2：优先处理反馈

After collecting reviews, categorize:

Priority	Action
Critical (technical flaw, missing baseline)	Must fix. May require new experiments → back to Phase 2
High (clarity issue, missing ablation)	Should fix in this revision
Medium (minor writing issues, extra experiments)	Fix if time allows
Low (style preferences, tangential suggestions)	Note for future work

收集评审意见后分类：

优先级	行动
关键（技术缺陷、缺少基准方法）	必须修复。可能需要新实验 → 返回阶段2
高（清晰度问题、缺少消融实验）	应在本次修订中修复
中（次要写作问题、额外实验）	时间允许则修复
低（风格偏好、无关建议）	记录为未来工作

Step 6.3: Revision Cycle

步骤6.3：修订循环

For each critical/high issue:

Identify the specific section(s) affected
Draft the fix
Verify the fix doesn't break other claims
Update the paper
Re-check against the reviewer's concern

每个关键/高优先级问题：

确定受影响的特定章节
起草修复方案
验证修复不会破坏其他论点
更新论文
对照评审关注点重新检查

Step 6.4: Rebuttal Writing

步骤6.4：反驳撰写

When responding to actual reviews (post-submission), rebuttals are a distinct skill from revision:

Format: Point-by-point. For each reviewer concern:

> R1-W1: "The paper lacks comparison with Method X."

We thank the reviewer for this suggestion. We have added a comparison with 
Method X in Table 3 (revised). Our method outperforms X by 3.2pp on [metric] 
(p<0.05). We note that X requires 2x our compute budget.

Rules:

Address every concern — reviewers notice if you skip one
Lead with the strongest responses
Be concise and direct — reviewers read dozens of rebuttals
Include new results if you ran experiments during the rebuttal period
Never be defensive or dismissive, even of weak criticisms
Use
```
latexdiff
```
to generate a marked-up PDF showing changes (see Professional LaTeX Tooling section)
Thank reviewers for specific, actionable feedback (not generic praise)

What NOT to do: "We respectfully disagree" without evidence. "This is out of scope" without explanation. Ignoring a weakness by only responding to strengths.

回应实际评审意见（提交后）时，反驳是与修订不同的技能：

格式：逐点回应。每个评审关注点：

> R1-W1: "The paper lacks comparison with Method X."

We thank the reviewer for this suggestion. We have added a comparison with 
Method X in Table 3 (revised). Our method outperforms X by 3.2pp on [metric] 
(p<0.05). We note that X requires 2x our compute budget.

规则：

回应每个关注点——评审会注意到你是否跳过
从最强的回应开始
简洁直接——评审要读几十份反驳
如果反驳期间运行了实验，包含新结果
永远不要防御性或 dismissive，即使是弱批评
使用
```
latexdiff
```
生成标记PDF显示变化（见专业LaTeX工具部分）
感谢评审提供的具体、可操作的反馈（不要泛泛赞美）

不要做：“我们不同意”而无证据。“这超出范围”而无解释。仅回应优点而忽略弱点。

Step 6.5: Paper Evolution Tracking

步骤6.5：论文演变跟踪

Save snapshots at key milestones:

paper/
  paper.tex                    # Current working version
  paper_v1_first_draft.tex     # First complete draft
  paper_v2_post_review.tex     # After simulated review
  paper_v3_pre_submission.tex  # Final before submission
  paper_v4_camera_ready.tex    # Post-acceptance final

在关键里程碑保存快照：

paper/
  paper.tex                    # 当前工作版本
  paper_v1_first_draft.tex     # 第一份完整草稿
  paper_v2_post_review.tex     # 模拟评审后
  paper_v3_pre_submission.tex  # 提交前最终版本
  paper_v4_camera_ready.tex    # 录用后最终版本

Phase 7: Submission Preparation

阶段7：提交准备

Goal: Final checks, formatting, and submission.

目标：最终检查、格式化与提交。

Step 7.1: Conference Checklist

步骤7.1：会议清单

Every venue has mandatory checklists. Complete them carefully — incomplete checklists can result in desk rejection.

See references/checklists.md for:

NeurIPS 16-item paper checklist
ICML broader impact + reproducibility
ICLR LLM disclosure policy
ACL mandatory limitations section
Universal pre-submission checklist

每个会议都有强制清单。仔细完成——不完整的清单可能导致desk rejection。

详见references/checklists.md获取：

NeurIPS 16项论文清单
ICML更广泛影响+可复现性要求
ICLR LLM披露政策
ACL强制局限性章节
通用提交前清单

Step 7.2: Anonymization Checklist

步骤7.2：匿名化清单

Double-blind review means reviewers cannot know who wrote the paper. Check ALL of these:

Anonymization Checklist:
- [ ] No author names or affiliations anywhere in the PDF
- [ ] No acknowledgments section (add after acceptance)
- [ ] Self-citations written in third person: "Smith et al. [1] showed..." not "We previously showed [1]..."
- [ ] No GitHub/GitLab URLs pointing to your personal repos
- [ ] Use Anonymous GitHub (https://anonymous.4open.science/) for code links
- [ ] No institutional logos or identifiers in figures
- [ ] No file metadata containing author names (check PDF properties)
- [ ] No "our previous work" or "in our earlier paper" phrasing
- [ ] Dataset names don't reveal institution (rename if needed)
- [ ] Supplementary materials don't contain identifying information

Common mistakes: Git commit messages visible in supplementary code, watermarked figures from institutional tools, acknowledgments left in from a previous draft, arXiv preprint posted before anonymity period.

双盲评审意味着评审不知道作者是谁。检查所有这些：

匿名化清单：
- [ ] PDF中无作者姓名或机构
- [ ] 无致谢部分（录用后添加）
- [ ] 自引用用第三人称："Smith et al. [1] showed..."而非"We previously showed [1]..."
- [ ] 无指向个人仓库的GitHub/GitLab URL
- [ ] 使用Anonymous GitHub（https://anonymous.4open.science/）作为代码链接
- [ ] 图表中无机构标志或标识符
- [ ] 文件元数据中无作者姓名（检查PDF属性）
- [ ] 无"our previous work"或"in our earlier paper"表述
- [ ] 数据集名称不透露机构（必要时重命名）
- [ ] 补充材料中无识别信息

常见错误：补充代码中可见的Git提交消息、机构工具生成的带水印图表、前一草稿中遗留的致谢、匿名期前发布的arXiv预印本。

Step 7.3: Formatting Verification

步骤7.3：格式验证

Pre-Submission Format Check:
- [ ] Page limit respected (excluding references and appendix)
- [ ] All figures are vector (PDF) or high-res raster (600 DPI PNG)
- [ ] All figures readable in grayscale
- [ ] All tables use booktabs
- [ ] References compile correctly (no "?" in citations)
- [ ] No overfull hboxes in critical areas
- [ ] Appendix clearly labeled and separated
- [ ] Required sections present (limitations, broader impact, etc.)

提交前格式检查：
- [ ] 遵守页数限制（不包括参考文献和附录）
- [ ] 所有图表为矢量（PDF）或高分辨率光栅（600 DPI PNG）
- [ ] 所有图表灰度可读
- [ ] 所有表格使用booktabs
- [ ] 参考文献编译正确（无"?"引用）
- [ ] 关键区域无hbox溢出
- [ ] 附录清晰标记并分离
- [ ] 包含必填章节（局限性、更广泛影响等）

Step 7.4: Pre-Compilation Validation

步骤7.4：预编译验证

Run these automated checks before attempting

pdflatex

. Catching errors here is faster than debugging compiler output.

bash

undefined

尝试

pdflatex

前运行这些自动检查。在此处发现错误比调试编译器输出更快。

bash

undefined

1. Lint with chktex (catches common LaTeX mistakes)

1. 使用chktex检查（发现常见LaTeX错误）

Suppress noisy warnings: -n2 (sentence end), -n24 (parens), -n13 (intersentence), -n1 (command terminated)

抑制嘈杂警告：-n2（句末）, -n24（括号）, -n13（句间）, -n1（命令终止）

chktex main.tex -q -n2 -n24 -n13 -n1

2. Verify all citations exist in .bib

2. 验证所有引用在.bib中存在

Extract \cite{...} from .tex, check each against .bib

从.tex提取\cite{...}，每个都与.bib对比

python3 -c " import re tex = open('main.tex').read() bib = open('references.bib').read() cites = set(re.findall(r'\\cite[tp]?{([^}]+)}', tex)) for cite_group in cites: for cite in cite_group.split(','): cite = cite.strip() if cite and cite not in bib: print(f'WARNING: \\cite{{{cite}}} not found in references.bib') "

3. Verify all referenced figures exist on disk

3. 验证所有引用的图表在磁盘上存在

python3 -c " import re, os tex = open('main.tex').read() figs = re.findall(r'\\includegraphics(?:[.*?])?{([^}]+)}', tex) for fig in figs: if not os.path.exists(fig): print(f'WARNING: Figure file not found: {fig}') "

4. Check for duplicate \label definitions

4. 检查重复的\label定义

python3 -c " import re from collections import Counter tex = open('main.tex').read() labels = re.findall(r'\\label{([^}]+)}', tex) dupes = {k: v for k, v in Counter(labels).items() if v > 1} for label, count in dupes.items(): print(f'WARNING: Duplicate label: {label} (appears {count} times)') "


Fix any warnings before proceeding. For agent-based workflows: feed chktex output back to the agent with instructions to make minimal fixes.


继续前修复所有警告。对于Agent工作流：将chktex输出反馈给Agent，指示进行最小修复。

Step 7.5: Final Compilation

步骤7.5：最终编译

bash

undefined

bash

undefined

Clean build

清理构建

rm -f *.aux *.bbl *.blg *.log *.out *.pdf latexmk -pdf main.tex

Or manual (triple pdflatex + bibtex for cross-references)

或手动（三次pdflatex + bibtex解决交叉引用）

pdflatex -interaction=nonstopmode main.tex bibtex main pdflatex -interaction=nonstopmode main.tex pdflatex -interaction=nonstopmode main.tex

Verify output exists and has content

验证输出存在且有内容

ls -la main.pdf


**If compilation fails**: Parse the `.log` file for the first error. Common fixes:
- "Undefined control sequence" → missing package or typo in command name
- "Missing $ inserted" → math symbol outside math mode
- "File not found" → wrong figure path or missing .sty file
- "Citation undefined" → .bib entry missing or bibtex not run

ls -la main.pdf


**如果编译失败**：解析`.log`文件找第一个错误。常见修复：
- "Undefined control sequence" → 缺少包或命令名拼写错误
- "Missing $ inserted" → 数学符号在数学模式外
- "File not found" → 图表路径错误或缺少.sty文件
- "Citation undefined" → .bib条目缺失或未运行bibtex

Step 7.6: Conference-Specific Requirements

步骤7.6：会议特定要求

Venue	Special Requirements
NeurIPS	Paper checklist in appendix, lay summary if accepted
ICML	Broader Impact Statement (after conclusion, doesn't count toward limit)
ICLR	LLM disclosure required, reciprocal reviewing agreement
ACL	Mandatory Limitations section, Responsible NLP checklist
AAAI	Strict style file — no modifications whatsoever
COLM	Frame contribution for language model community

会议	特殊要求
NeurIPS	附录中包含论文清单，录用后需提供通俗摘要
ICML	更广泛影响声明（结论后，不计入页数限制）
ICLR	强制LLM披露，互惠评审协议
ACL	强制局限性章节，负责任NLP清单
AAAI	严格样式文件——不得修改
COLM	为语言模型社区构建贡献框架

Step 7.7: Conference Resubmission & Format Conversion

步骤7.7：会议重提交与格式转换

When converting between venues, never copy LaTeX preambles between templates:

bash

undefined

在不同会议间转换时，绝不要在模板间复制LaTeX导言区：

bash

undefined

1. Start fresh with target template

1. 从目标模板开始

cp -r templates/icml2026/ new_submission/

2. Copy ONLY content sections (not preamble)

2. 仅复制内容章节（不包括导言区）

- Abstract text, section content, figures, tables, bib entries

- 摘要文本、章节内容、图表、表格、bib条目

3. Adjust for page limits

3. 调整页数限制

4. Add venue-specific required sections

4. 添加会议特定必填章节

5. Update references

5. 更新参考文献


| From → To | Page Change | Key Adjustments |
|-----------|-------------|-----------------|
| NeurIPS → ICML | 9 → 8 | Cut 1 page, add Broader Impact |
| ICML → ICLR | 8 → 9 | Expand experiments, add LLM disclosure |
| NeurIPS → ACL | 9 → 8 | Restructure for NLP conventions, add Limitations |
| ICLR → AAAI | 9 → 7 | Significant cuts, strict style adherence |
| Any → COLM | varies → 9 | Reframe for language model focus |

When cutting pages: move proofs to appendix, condense related work, combine tables, use subfigures.
When expanding: add ablations, expand limitations, include additional baselines, add qualitative examples.

**After rejection**: Address reviewer concerns in the new version, but don't include a "changes" section or reference the previous submission (blind review).


| 从 → 到 | 页数变化 | 关键调整 |
|-----------|-------------|-----------------|
| NeurIPS → ICML | 9 → 8 | 删减1页，添加更广泛影响声明 |
| ICML → ICLR | 8 → 9 | 扩展实验，添加LLM披露 |
| NeurIPS → ACL | 9 → 8 | 按NLP惯例重构，添加局限性章节 |
| ICLR → AAAI | 9 → 7 | 大幅删减，严格遵守样式 |
| 任何 → COLM | 可变 → 9 | 为语言模型焦点重构框架 |

删减页数时：将证明移至附录，压缩相关工作，合并表格，使用子图。
扩展页数时：添加消融实验，扩展局限性，包含更多基准方法，添加定性示例。

**拒稿后**：在新版本中解决评审关注点，但不要包含“变化”部分或引用之前的提交（双盲评审）。

Step 7.8: Camera-Ready Preparation (Post-Acceptance)

步骤7.8：终稿准备（录用后）

After acceptance, prepare the camera-ready version:

Camera-Ready Checklist:
- [ ] De-anonymize: add author names, affiliations, email addresses
- [ ] Add Acknowledgments section (funding, compute grants, helpful reviewers)
- [ ] Add public code/data URL (real GitHub, not anonymous)
- [ ] Address any mandatory revisions from meta-reviewer
- [ ] Switch template to camera-ready mode (if applicable — e.g., AAAI \anon → \camera)
- [ ] Add copyright notice if required by venue
- [ ] Update any "anonymous" placeholders in text
- [ ] Verify final PDF compiles cleanly
- [ ] Check page limit for camera-ready (sometimes differs from submission)
- [ ] Upload supplementary materials (code, data, appendix) to venue portal

录用后准备终稿：

终稿清单：
- [ ] 去匿名化：添加作者姓名、机构、邮箱
- [ ] 添加致谢部分（资助、计算资源、有帮助的评审）
- [ ] 添加公开代码/数据URL（真实GitHub，而非匿名）
- [ ] 解决元评审要求的强制修订
- [ ] 切换到终稿模板（如有需要——例如，AAAI \anon → \camera）
- [ ] 如会议要求，添加版权声明
- [ ] 更新文本中所有“匿名”占位符
- [ ] 验证最终PDF编译无误
- [ ] 检查终稿页数限制（有时与提交时不同）
- [ ] 将补充材料（代码、数据、附录）上传到会议门户

Step 7.9: arXiv & Preprint Strategy

步骤7.9：arXiv与预印本策略

Posting to arXiv is standard practice in ML but has important timing and anonymity considerations.

Timing decision tree:

Situation	Recommendation
Submitting to double-blind venue (NeurIPS, ICML, ACL)	Post to arXiv after submission deadline, not before. Posting before can technically violate anonymity policies, though enforcement varies.
Submitting to ICLR	ICLR explicitly allows arXiv posting before submission. But don't put author names in the submission itself.
Paper already on arXiv, submitting to new venue	Acceptable at most venues. Do NOT update arXiv version during review with changes that reference reviews.
Workshop paper	arXiv is fine at any time — workshops are typically not double-blind.
Want to establish priority	Post immediately if scooping is a concern — but accept the anonymity tradeoff.

arXiv category selection (ML/AI papers):

Category	Code	Best For
Machine Learning	`cs.LG`	General ML methods
Computation and Language	`cs.CL`	NLP, language models
Artificial Intelligence	`cs.AI`	Reasoning, planning, agents
Computer Vision	`cs.CV`	Vision models
Information Retrieval	`cs.IR`	Search, recommendation

List primary + 1-2 cross-listed categories. More categories = more visibility, but only cross-list where genuinely relevant.

Versioning strategy:

v1: Initial submission (matches conference submission)
v2: Post-acceptance with camera-ready corrections (add "accepted at [Venue]" to abstract)
Don't post v2 during the review period with changes that clearly respond to reviewer feedback

bash

undefined

发布到arXiv是ML领域的标准做法，但有重要的时间和匿名性考虑。

时间决策树：

情况	建议
提交到双盲会议（NeurIPS、ICML、ACL）	提交截止后再发布到arXiv，不要提前。提前发布可能在技术上违反匿名政策，尽管执行力度不同。
提交到ICLR	ICLR明确允许提交前发布到arXiv。但提交本身不要包含作者姓名。
论文已在arXiv，提交到新会议	大多数会议允许。评审期间不要更新arXiv版本以响应评审意见。
研讨会论文	任何时间都可以发布到arXiv——研讨会通常不是双盲。
想确立优先权	如果担心被抢先，立即发布——但接受匿名性权衡。

arXiv类别选择（ML/AI论文）：

类别	代码	最佳适用场景
Machine Learning	`cs.LG`	通用ML方法
Computation and Language	`cs.CL`	NLP、语言模型
Artificial Intelligence	`cs.AI`	推理、规划、Agent
Computer Vision	`cs.CV`	视觉模型
Information Retrieval	`cs.IR`	搜索、推荐

列出主要类别+1-2个交叉类别。更多类别=更高可见性，但仅在真正相关时交叉列出。

版本策略：

v1：初始提交（与会议提交匹配）
v2：录用后终稿修正（摘要添加"accepted at [会议]"）
评审期间不要发布v2，因为变化明显响应评审意见

bash

undefined

Check if your paper's title is already taken on arXiv

检查你的论文标题是否已在arXiv存在

(before choosing a title)

（选择标题前）

pip install arxiv python -c " import arxiv results = list(arxiv.Search(query='ti:"Your Exact Title"', max_results=5).results()) print(f'Found {len(results)} matches') for r in results: print(f' {r.title} ({r.published.year})') "

undefined

undefined

Step 7.10: Research Code Packaging

步骤7.10：科研代码打包

Releasing clean, runnable code significantly increases citations and reviewer trust. Package code alongside the camera-ready submission.

Repository structure:

your-method/
  README.md              # Setup, usage, reproduction instructions
  requirements.txt       # Or environment.yml for conda
  setup.py               # For pip-installable packages
  LICENSE                # MIT or Apache 2.0 recommended for research
  configs/               # Experiment configurations
  src/                   # Core method implementation
  scripts/               # Training, evaluation, analysis scripts
    train.py
    evaluate.py
    reproduce_table1.sh  # One script per main result
  data/                  # Small data or download scripts
    download_data.sh
  results/               # Expected outputs for verification

README template for research code:

markdown

undefined

发布干净、可运行的代码能显著增加引用和评审信任。终稿提交时一并打包代码。

仓库结构：

your-method/
  README.md              # 设置、使用、复现说明
  requirements.txt       # 或conda的environment.yml
  setup.py               # 用于pip可安装包
  LICENSE                # 研究推荐MIT或Apache 2.0
  configs/               # 实验配置
  src/                   # 核心方法实现
  scripts/               # 训练、评估、分析脚本
    train.py
    evaluate.py
    reproduce_table1.sh  # 每个主要结果对应一个脚本
  data/                  # 小数据或下载脚本
    download_data.sh
  results/               # 预期输出用于验证

科研代码README模板：

markdown

undefined

[Paper Title]

[论文标题]

Official implementation of "[Paper Title]" (Venue Year).

"[论文标题]"（会议年份）的官方实现。

Setup

设置

[Exact commands to set up environment]

[设置环境的确切命令]

Reproduction

复现

To reproduce Table 1:

bash scripts/reproduce_table1.sh

To reproduce Figure 2:

python scripts/make_figure2.py

复现表1：

bash scripts/reproduce_table1.sh

复现图2：

python scripts/make_figure2.py

Citation

引用

[BibTeX entry]


**Pre-release checklist:**

Code runs from a clean clone (test on fresh machine or Docker)
All dependencies pinned to specific versions
No hardcoded absolute paths
No API keys, credentials, or personal data in repo
README covers setup, reproduction, and citation
LICENSE file present (MIT or Apache 2.0 for max reuse)
Results are reproducible within expected variance
.gitignore excludes data files, checkpoints, logs


**Anonymous code for submission** (before acceptance):
```bash

[BibTeX条目]


**发布前清单**：

代码从干净克隆可运行（在新机器或Docker上测试）
所有依赖固定到特定版本
无硬编码绝对路径
仓库中无API密钥、凭据或个人数据
README涵盖设置、复现和引用
存在LICENSE文件（MIT或Apache 2.0以最大化复用）
结果在预期方差内可复现
.gitignore排除数据文件、检查点、日志


**提交用匿名代码**（录用前）：
```bash

Use Anonymous GitHub for double-blind review

双盲评审使用Anonymous GitHub

https://anonymous.4open.science/

Upload your repo → get an anonymous URL → put in paper

上传你的仓库 → 获取匿名URL → 放入论文

---

---

Phase 8: Post-Acceptance Deliverables

阶段8：录用后交付物

Goal: Maximize the impact of your accepted paper through presentation materials and community engagement.

目标：通过演示材料和社区参与最大化录用论文的影响力。

Step 8.1: Conference Poster

步骤8.1：会议海报

Most conferences require a poster session. Poster design principles:

Element	Guideline
Size	Check venue requirements (typically 24"x36" or A0 portrait/landscape)
Content	Title, authors, 1-sentence contribution, method figure, 2-3 key results, conclusion
Flow	Top-left to bottom-right (Z-pattern) or columnar
Text	Title readable at 3m, body at 1m. No full paragraphs — bullet points only.
Figures	Reuse paper figures at higher resolution. Enlarge key result.

Tools: LaTeX (

beamerposter

package), PowerPoint/Keynote, Figma, Canva.

Production: Order 2+ weeks before the conference. Fabric posters are lighter for travel. Many conferences now support virtual/digital posters too.

大多数会议要求海报展示。海报设计原则：

元素	指南
尺寸	检查会议要求（通常24"x36"或A0竖版/横版）
内容	标题、作者、1句话贡献、方法图、2-3个关键结果、结论
流程	左上到右下（Z模式）或分栏
文本	标题3米可读，正文1米可读。无完整段落——仅项目符号。
图表	高分辨率复用论文图表。放大关键结果。

工具：LaTeX（

beamerposter

包）、PowerPoint/Keynote、Figma、Canva。

制作：会议前2+周订购。织物海报更轻便，便于旅行。许多会议现在也支持虚拟/数字海报。

Step 8.2: Conference Talk / Spotlight

步骤8.2：会议演讲 / 亮点展示

If awarded an oral or spotlight presentation:

Talk Type	Duration	Content
Spotlight	5 min	Problem, approach, one key result. Rehearse to exactly 5 minutes.
Oral	15-20 min	Full story: problem, approach, key results, ablations, limitations.
Workshop talk	10-15 min	Adapt based on workshop audience — may need more background.

Slide design rules:

One idea per slide
Minimize text — speak the details, don't project them
Animate key figures to build understanding step-by-step
Include a "takeaway" slide at the end (single sentence contribution)
Prepare backup slides for anticipated questions

如果获得口头或亮点展示机会：

演讲类型	时长	内容
亮点展示	5分钟	问题、方法、一个关键结果。排练到正好5分钟。
口头演讲	15-20分钟	完整故事：问题、方法、关键结果、消融实验、局限性。
研讨会演讲	10-15分钟	根据研讨会受众调整——可能需要更多背景。

幻灯片设计规则：

每张幻灯片一个想法
最小化文本——讲述细节，不要投影出来
逐步动画关键图表以建立理解
结尾包含“要点”幻灯片（一句话贡献）
准备备份幻灯片以应对预期问题

Step 8.3: Blog Post / Social Media

步骤8.3：博客文章 / 社交媒体

An accessible summary significantly increases impact:

Twitter/X thread: 5-8 tweets. Lead with the result, not the method. Include Figure 1 and key result figure.
Blog post: 800-1500 words. Written for ML practitioners, not reviewers. Skip formalism, emphasize intuition and practical implications.
Project page: HTML page with abstract, figures, demo, code link, BibTeX. Use GitHub Pages.

Timing: Post within 1-2 days of paper appearing on proceedings or arXiv camera-ready.

通俗易懂的摘要能显著提高影响力：

Twitter/X线程：5-8条推文。从结果开始，而非方法。包含图1和关键结果图。
博客文章：800-1500字。面向ML从业者，而非评审。跳过形式主义，强调直觉和实际意义。
项目页面：HTML页面，包含摘要、图表、演示、代码链接、BibTeX。使用GitHub Pages。

时间：论文出现在会议论文集或arXiv终稿后1-2天内发布。

Workshop & Short Papers

研讨会与短文

Workshop papers and short papers (e.g., ACL short papers, Findings papers) follow the same pipeline but with different constraints and expectations.

研讨会论文和短文（如ACL短文、Findings论文）遵循相同流程，但有不同的约束和期望。

Workshop Papers

研讨会论文

Property	Workshop	Main Conference
Page limit	4-6 pages (typically)	7-9 pages
Review standard	Lower bar for completeness	Must be complete, thorough
Review process	Usually single-blind or light review	Double-blind, rigorous
What's valued	Interesting ideas, preliminary results, position pieces	Complete empirical story with strong baselines
arXiv	Post anytime	Timing matters (see arXiv strategy)
Contribution bar	Novel direction, interesting negative result, work-in-progress	Significant advance with strong evidence

When to target a workshop:

Early-stage idea you want feedback on before a full paper
Negative result that doesn't justify 8+ pages
Position piece or opinion on a timely topic
Replication study or reproducibility report

属性	研讨会	主会议
页数限制	4-6页（通常）	7-9页
评审标准	完整性要求较低	必须完整、全面
评审流程	通常单盲或轻量评审	双盲、严格
重视的内容	有趣的想法、初步结果、立场文章	完整的实证故事，有强大的基准方法
arXiv	任何时间都可发布	时间很重要（见arXiv策略）
贡献门槛	新颖方向、有趣的负面结果、进行中的工作	重大进展，有强有力的证据

何时瞄准研讨会：

你想在写长文前获得反馈的早期想法
不足以支撑8+页的负面结果
关于热门话题的立场文章或观点
复制研究或可复现性报告

ACL Short Papers & Findings

ACL短文与Findings

ACL venues have distinct submission types:

Type	Pages	What's Expected
Long paper	8	Complete study, strong baselines, ablations
Short paper	4	Focused contribution: one clear point with evidence
Findings	8	Solid work that narrowly missed main conference

Short paper strategy: Pick ONE claim and support it thoroughly. Don't try to compress a long paper into 4 pages — write a different, more focused paper.

ACL会议有不同的提交类型：

类型	页数	预期内容
长文	8	完整研究，强大的基准方法，消融实验
短文	4	聚焦贡献：一个清晰的论点，有证据支撑
Findings	8	扎实的工作， narrowly missed主会议

短文策略：选择一个论点并彻底支撑。不要试图把长文压缩到4页——写一篇不同的、更聚焦的论文。

Paper Types Beyond Empirical ML

实证ML之外的论文类型

The main pipeline above targets empirical ML papers. Other paper types require different structures and evidence standards. See references/paper-types.md for detailed guidance on each type.

上述主要流程针对实证ML论文。其他论文类型需要不同的结构和证据标准。详见references/paper-types.md获取每种类型的详细指南。

Theory Papers

理论论文

Structure: Introduction → Preliminaries (definitions, notation) → Main Results (theorems) → Proof Sketches → Discussion → Full Proofs (appendix)

Key differences from empirical papers:

Contribution is a theorem, bound, or impossibility result — not experimental numbers
Methods section replaced by "Preliminaries" and "Main Results"
Proofs are the evidence, not experiments (though empirical validation of theory is welcome)
Proof sketches in main text, full proofs in appendix is standard practice
Experimental section is optional but strengthens the paper if it validates theoretical predictions

Proof writing principles:

State theorems formally with all assumptions explicit
Provide intuition before formal proof ("The key insight is...")
Proof sketches should convey the main idea in 0.5-1 page
Use
```
\begin{proof}...\end{proof}
```
environments
Number assumptions and reference them in theorems: "Under Assumptions 1-3, ..."

结构：引言 → 预备知识（定义、符号） → 主要结果（定理） → 证明梗概 → 讨论 → 完整证明（附录）

与实证论文的关键区别：

贡献是定理、边界或不可能结果——而非实验数字
方法部分替换为“预备知识”和“主要结果”
证明是证据，而非实验（尽管理论的实证验证受欢迎）
正文包含证明梗概，完整证明放在附录是标准做法
实验部分可选，但如果验证理论预测会增强论文

证明写作原则：

正式陈述定理，所有假设明确
正式证明前提供直觉（“关键见解是...”）
证明梗概应在0.5-1页内传达主要思想
使用
```
\begin{proof}...\end{proof}
```
环境
对假设编号并在定理中引用：“在假设1-3下，...”

Survey / Tutorial Papers

综述 / 教程论文

Structure: Introduction → Taxonomy / Organization → Detailed Coverage → Open Problems → Conclusion

Key differences:

Contribution is the organization, synthesis, and identification of open problems — not new methods
Must be comprehensive within scope (reviewers will check for missing references)
Requires a clear taxonomy or organizational framework
Value comes from connections between works that individual papers don't make
Best venues: TMLR (survey track), JMLR, Foundations and Trends in ML, ACM Computing Surveys

结构：引言 → 分类 / 组织 → 详细覆盖 → 开放问题 → 结论

关键区别：

贡献是组织、综合和识别开放问题——而非新方法
必须在范围内全面（评审会检查是否遗漏引用）
需要清晰的分类或组织框架
价值来自于单个论文未建立的工作间联系
最佳会议：TMLR（综述track）、JMLR、Foundations and Trends in ML、ACM Computing Surveys

Benchmark Papers

基准测试论文

Structure: Introduction → Task Definition → Dataset Construction → Baseline Evaluation → Analysis → Intended Use & Limitations

Key differences:

Contribution is the benchmark itself — it must fill a genuine evaluation gap
Dataset documentation is mandatory, not optional (see Datasheets, Step 5.11)
Must demonstrate the benchmark is challenging (baselines don't saturate it)
Must demonstrate the benchmark measures what you claim it measures (construct validity)
Best venues: NeurIPS Datasets & Benchmarks track, ACL (resource papers), LREC-COLING

结构：引言 → 任务定义 → 数据集构建 → 基准评估 → 分析 → 预期用途与局限性

关键区别：

贡献是基准测试本身——必须填补真正的评估空白
数据集文档是必须的，而非可选（见数据集表，步骤5.11）
必须证明基准测试具有挑战性（基准方法未达到饱和）
必须证明基准测试测量的是你声称的内容（结构效度）
最佳会议：NeurIPS Datasets & Benchmarks track、ACL（资源论文）、LREC-COLING

Position Papers

立场论文

Structure: Introduction → Background → Thesis / Argument → Supporting Evidence → Counterarguments → Implications

Key differences:

Contribution is an argument, not a result
Must engage seriously with counterarguments
Evidence can be empirical, theoretical, or logical analysis
Best venues: ICML (position track), workshops, TMLR

结构：引言 → 背景 → 论点 / 主张 → 支撑证据 → 反论点 → 意义

关键区别：

贡献是论点，而非结果
必须认真对待反论点
证据可以是实证、理论或逻辑分析
最佳会议：ICML（立场track）、研讨会、TMLR

Hermes Agent Integration

Hermes Agent集成

This skill is designed for the Hermes agent. It uses Hermes tools, delegation, scheduling, and memory for the full research lifecycle.

本技能为Hermes Agent设计，使用Hermes工具、委派、调度和内存管理完整科研生命周期。

Related Skills

Hermes Tools Reference

Hermes工具参考

Tool	Usage in This Pipeline
`terminal`	LaTeX compilation ( `latexmk -pdf` ), git operations, launching experiments ( `nohup python run.py &` ), process checks
`process`	Background experiment management: `process("start", ...)` , `process("poll", pid)` , `process("log", pid)` , `process("kill", pid)`
`execute_code`	Run Python for citation verification, statistical analysis, data aggregation. Has tool access via RPC.
`read_file` / `write_file` / `patch`	Paper editing, experiment scripts, result files. Use `patch` for targeted edits to large .tex files.
`web_search`	Literature discovery: `web_search("transformer attention mechanism 2024")`
`web_extract`	Fetch paper content, verify citations: `web_extract("https://arxiv.org/abs/2303.17651")`
`delegate_task`	Parallel section drafting — spawn isolated subagents for each section. Also for concurrent citation verification.
`todo`	Primary state tracker across sessions. Update after every phase transition.
`memory`	Persist key decisions across sessions: contribution framing, venue choice, reviewer feedback.
`cronjob`	Schedule experiment monitoring, deadline countdowns, automated arXiv checks.
`clarify`	Ask the user targeted questions when blocked (venue choice, contribution framing).
`send_message`	Notify user when experiments complete or drafts are ready, even if user isn't in chat.

工具	本流程中的用法
`terminal`	LaTeX编译（ `latexmk -pdf` ）、Git操作、启动实验（ `nohup python run.py &` ）、进程检查
`process`	后台实验管理： `process("start", ...)` 、 `process("poll", pid)` 、 `process("log", pid)` 、 `process("kill", pid)`
`execute_code`	运行Python进行引用验证、统计分析、数据聚合。通过RPC访问工具。
`read_file` / `write_file` / `patch`	论文编辑、实验脚本、结果文件。使用 `patch` 对大型.tex文件进行针对性编辑。
`web_search`	文献发现： `web_search("transformer attention mechanism 2024")`
`web_extract`	获取论文内容、验证引用： `web_extract("https://arxiv.org/abs/2303.17651")`
`delegate_task`	并行章节撰写——为每个章节生成独立的子Agent。也用于并发引用验证。
`todo`	跨会话的主要状态跟踪器。每次阶段转换后更新。
`memory`	跨会话持久化关键决策：贡献框架、会议选择、评审反馈。
`cronjob`	调度实验监控、截止日期倒计时、自动arXiv检查。
`clarify`	受阻时向用户提出针对性问题（会议选择、贡献框架）。
`send_message`	实验完成或草稿准备好时通知用户，即使用户不在聊天中。

Tool Usage Patterns

工具使用模式

Experiment monitoring (most common):

terminal("ps aux | grep <pattern>")
→ terminal("tail -30 <logfile>")
→ terminal("ls results/")
→ execute_code("analyze results JSON, compute metrics")
→ terminal("git add -A && git commit -m '<descriptive message>' && git push")
→ send_message("Experiment complete: <summary>")

Parallel section drafting (using delegation):

delegate_task("Draft the Methods section based on these experiment scripts and configs. 
  Include: pseudocode, all hyperparameters, architectural details sufficient for 
  reproduction. Write in LaTeX using the neurips2025 template conventions.")

delegate_task("Draft the Related Work section. Use web_search and web_extract to 
  find papers. Verify every citation via Semantic Scholar. Group by methodology.")

delegate_task("Draft the Experiments section. Read all result files in results/. 
  State which claim each experiment supports. Include error bars and significance.")

Each delegate runs as a fresh subagent with no shared context — provide all necessary information in the prompt. Collect outputs and integrate.

Citation verification (using execute_code):

python

undefined

实验监控（最常见）：

terminal("ps aux | grep <pattern>")
→ terminal("tail -30 <logfile>")
→ terminal("ls results/")
→ execute_code("分析结果JSON，计算指标")
→ terminal("git add -A && git commit -m '<描述性信息>' && git push")
→ send_message("实验完成：<摘要>")

并行章节撰写（使用委派）：

delegate_task("根据这些实验脚本和配置撰写方法章节。
  包含：伪代码、所有超参数、足够的架构细节以复现。
  使用neurips2025模板惯例撰写LaTeX。")

delegate_task("撰写相关工作章节。使用web_search和web_extract查找论文。
  通过Semantic Scholar验证每个引用。按方法学分组。")

delegate_task("撰写实验章节。读取results/中的所有结果文件。
  说明每个实验支撑哪个论点。包含误差棒和显著性。")

每个委派作为新的子Agent运行，无共享上下文——在提示中提供所有必要信息。收集输出并整合。

引用验证（使用execute_code）：

python

undefined

In execute_code:

在execute_code中:

from semanticscholar import SemanticScholar import requests

sch = SemanticScholar() results = sch.search_paper("attention mechanism transformers", limit=5) for paper in results: doi = paper.externalIds.get('DOI', 'N/A') if doi != 'N/A': bibtex = requests.get(f"https://doi.org/{doi}", headers={"Accept": "application/x-bibtex"}).text print(bibtex)

undefined

from semanticscholar import SemanticScholar import requests

undefined

State Management with

memory

and

todo

使用

memory

和

todo

进行状态管理

memory
tool — persist key decisions (bounded: ~2200 chars for MEMORY.md):

memory("add", "Paper: autoreason. Venue: NeurIPS 2025 (9 pages). 
  Contribution: structured refinement works when generation-evaluation gap is wide.
  Key results: Haiku 42/42, Sonnet 3/5, S4.6 constrained 2/3.
  Status: Phase 5 — drafting Methods section.")

Update memory after major decisions or phase transitions. This persists across sessions.

todo
tool — track granular progress:

todo("add", "Design constrained task experiments for Sonnet 4.6")
todo("add", "Run Haiku baseline comparison")
todo("add", "Draft Methods section")
todo("update", id=3, status="in_progress")
todo("update", id=1, status="completed")

Session startup protocol:

1. todo("list")                           # Check current task list
2. memory("read")                         # Recall key decisions
3. terminal("git log --oneline -10")      # Check recent commits
4. terminal("ps aux | grep python")       # Check running experiments
5. terminal("ls results/ | tail -20")     # Check for new results
6. Report status to user, ask for direction

memory
工具——持久化关键决策（有限：MEMORY.md约2200字符）：

memory("add", "Paper: autoreason. Venue: NeurIPS 2025 (9 pages). 
  Contribution: structured refinement works when generation-evaluation gap is wide.
  Key results: Haiku 42/42, Sonnet 3/5, S4.6 constrained 2/3.
  Status: Phase 5 — drafting Methods section.")

重大决策或阶段转换后更新内存。跨会话持久化。

todo
工具——跟踪细粒度进度：

todo("add", "为Sonnet 4.6设计约束任务实验")
todo("add", "运行Haiku基准对比")
todo("add", "撰写方法章节")
todo("update", id=3, status="in_progress")
todo("update", id=1, status="completed")

会话启动协议：

1. todo("list")                           # 检查当前任务列表
2. memory("read")                         # 回忆关键决策
3. terminal("git log --oneline -10")      # 检查最近提交
4. terminal("ps aux | grep python")       # 检查运行中的实验
5. terminal("ls results/ | tail -20")     # 检查新结果
6. 向用户报告状态，询问方向

Cron Monitoring with

cronjob

使用

cronjob

进行Cron监控

Use the

cronjob

tool to schedule periodic experiment checks:

cronjob("create", {
  "schedule": "*/30 * * * *",  # Every 30 minutes
  "prompt": "Check experiment status:
    1. ps aux | grep run_experiment
    2. tail -30 logs/experiment_haiku.log
    3. ls results/haiku_baselines/
    4. If complete: read results, compute Borda scores, 
       git add -A && git commit -m 'Add Haiku results' && git push
    5. Report: table of results, key finding, next step
    6. If nothing changed: respond with [SILENT]"
})

[SILENT] protocol: When nothing has changed since the last check, respond with exactly

[SILENT]

. This suppresses notification delivery to the user. Only report when there are genuine changes worth knowing about.

Deadline tracking:

cronjob("create", {
  "schedule": "0 9 * * *",  # Daily at 9am
  "prompt": "NeurIPS 2025 deadline: May 22. Today is {date}. 
    Days remaining: {compute}. 
    Check todo list — are we on track? 
    If <7 days: warn user about remaining tasks."
})

使用

cronjob

工具调度定期实验检查：

cronjob("create", {
  "schedule": "*/30 * * * *",  # 每30分钟
  "prompt": "检查实验状态:
    1. ps aux | grep run_experiment
    2. tail -30 logs/experiment_haiku.log
    3. ls results/haiku_baselines/
    4. 如果完成：读取结果，计算Borda分数，
       git add -A && git commit -m 'Add Haiku results' && git push
    5. 报告：结果表格、关键发现、下一步
    6. 如果无变化：回复[SILENT]"
})

[SILENT]协议：自上次检查以来无变化时，精确回复

[SILENT]

。这会抑制向用户发送通知。仅当有真正值得关注的变化时才报告。

截止日期跟踪：

cronjob("create", {
  "schedule": "0 9 * * *",  # 每天上午9点
  "prompt": "NeurIPS 2025截止日期：5月22日。今天是{date}。 
    剩余天数：{compute}。 
    检查待办事项列表——我们是否按计划进行？ 
    如果<7天：警告用户剩余任务。"
})

Communication Patterns

沟通模式

When to notify the user (via

send_message

or direct response):

Experiment batch completed (with results table)
Unexpected finding or failure requiring decision
Draft section ready for review
Deadline approaching with incomplete tasks

When NOT to notify:

Experiment still running, no new results →
```
[SILENT]
```
Routine monitoring with no changes →
```
[SILENT]
```
Intermediate steps that don't need attention

Report format — always include structured data:

undefined

何时通知用户（通过

send_message

或直接回复）：

一批实验完成（带结果表格）
意外发现或失败需要决策
章节草稿准备好供评审
截止日期临近，任务未完成

何时不通知：

实验仍在运行，无新结果 →
```
[SILENT]
```
例行监控无变化 →
```
[SILENT]
```
不需要关注的中间步骤

报告格式——始终包含结构化数据：

undefined

Experiment: <name>

实验：<名称>

Status: Complete / Running / Failed

Task	Method A	Method B	Method C
Task 1	85.2	82.1	89.4

Key finding: <one sentence> Next step: <what happens next>

undefined

状态：完成 / 运行中 / 失败

任务	方法A	方法B	方法C
任务1	85.2	82.1	89.4

关键发现：<一句话> 下一步：<接下来做什么>

undefined

Decision Points Requiring Human Input

需要人工输入的决策点

Use

clarify

for targeted questions when genuinely blocked:

Decision	When to Ask
Target venue	Before starting paper (affects page limits, framing)
Contribution framing	When multiple valid framings exist
Experiment priority	When TODO list has more experiments than time allows
Submission readiness	Before final submission

Do NOT ask about (be proactive, make a choice, flag it):

Word choice, section ordering
Which specific results to highlight
Citation completeness (draft with what you find, note gaps)

真正受阻时使用

clarify

提出针对性问题：

决策	何时询问
目标会议	开始撰写前（影响页数限制、框架）
贡献框架	存在多个有效框架时
实验优先级	待办事项列表中的实验多于可用时间时
提交准备情况	最终提交前

不要询问（主动选择，标记出来）：

用词、章节顺序
突出哪些特定结果
引用完整性（找到什么写什么，记录空白）

Reviewer Evaluation Criteria

评审评估标准

Understanding what reviewers look for helps focus effort:

Criterion	What They Check
Quality	Technical soundness, well-supported claims, fair baselines
Clarity	Clear writing, reproducible by experts, consistent notation
Significance	Community impact, advances understanding
Originality	New insights (doesn't require new method)

Scoring (NeurIPS 6-point scale):

6: Strong Accept — groundbreaking, flawless
5: Accept — technically solid, high impact
4: Borderline Accept — solid, limited evaluation
3: Borderline Reject — weaknesses outweigh
2: Reject — technical flaws
1: Strong Reject — known results or ethics issues

See references/reviewer-guidelines.md for detailed guidelines, common concerns, and rebuttal strategies.

了解评审关注的内容有助于聚焦努力：

标准	他们检查什么
质量	技术严谨性、论点支撑充分、基准方法公平
清晰度	写作清晰、专家可复现、术语一致
重要性	社区影响、增进理解
原创性	新见解（不需要新方法）

评分（NeurIPS 6分制）：

6：强接受——突破性、完美
5：接受——技术扎实、高影响
4： borderline接受——扎实、评估有限
3： borderline拒绝——弱点超过优点
2：拒绝——技术缺陷
1：强拒绝——已知结果或伦理问题

详见references/reviewer-guidelines.md获取详细指南、常见关注点和反驳策略。

Common Issues and Solutions

常见问题与解决方案

Issue	Solution
Abstract too generic	Delete first sentence if it could prepend any ML paper. Start with your specific contribution.
Introduction exceeds 1.5 pages	Split background into Related Work. Front-load contribution bullets.
Experiments lack explicit claims	Add: "This experiment tests whether [specific claim]..." before each one.
Reviewers find paper hard to follow	Add signposting, use consistent terminology, make figure captions self-contained.
Missing statistical significance	Add error bars, number of runs, statistical tests, confidence intervals.
Scope creep in experiments	Every experiment must map to a specific claim. Cut experiments that don't.
Paper rejected, need to resubmit	See Conference Resubmission in Phase 7. Address reviewer concerns without referencing reviews.
Missing broader impact statement	See Step 5.10. Most venues require it. "No negative impacts" is almost never credible.
Human eval criticized as weak	See Step 2.5 and references/human-evaluation.md. Report agreement metrics, annotator details, compensation.
Reviewers question reproducibility	Release code (Step 7.9), document all hyperparameters, include seeds and compute details.
Theory paper lacks intuition	Add proof sketches with plain-language explanations before formal proofs. See references/paper-types.md.
Results are negative/null	See Phase 4.3 on handling negative results. Consider workshops, TMLR, or reframing as analysis.

问题	解决方案
摘要太笼统	如果第一句可以放在任何ML论文前，删除它。从你的具体贡献开始。
引言超过1.5页	将背景部分拆分到相关工作。前置贡献项目符号。
实验缺少明确论点	每个实验前添加：“本实验测试是否[特定论点]...”
评审认为论文难以理解	添加路标，使用一致术语，使图表标题自包含。
缺少统计显著性	添加误差棒、运行次数、统计检验、置信区间。
实验范围蔓延	每个实验必须映射到特定论点。删除不相关的实验。
论文被拒，需要重提交	见阶段7的会议重提交。解决评审关注点，不要引用评审意见。
缺少更广泛影响声明	见步骤5.10。大多数会议要求。“无负面影响”几乎从不可信。
人类评估被批评为薄弱	见步骤2.5和references/human-evaluation.md。报告一致性指标、标注员细节、报酬。
评审质疑可复现性	发布代码（步骤7.9），记录所有超参数，包含种子和计算细节。
理论论文缺乏直觉	正式证明前添加带通俗语言解释的证明梗概。详见references/paper-types.md。
结果为负面/无效	见阶段4.3处理负面结果。考虑研讨会、TMLR或重构为分析。

Reference Documents

参考文档

Document	Contents
references/writing-guide.md	Gopen & Swan 7 principles, Perez micro-tips, Lipton word choice, Steinhardt precision, figure design
references/citation-workflow.md	Citation APIs, Python code, CitationManager class, BibTeX management
references/checklists.md	NeurIPS 16-item, ICML, ICLR, ACL requirements, universal pre-submission checklist
references/reviewer-guidelines.md	Evaluation criteria, scoring, common concerns, rebuttal template
references/sources.md	Complete bibliography of all writing guides, conference guidelines, APIs
references/experiment-patterns.md	Experiment design patterns, evaluation protocols, monitoring, error recovery
references/autoreason-methodology.md	Autoreason loop, strategy selection, model guide, prompts, scope constraints, Borda scoring
references/human-evaluation.md	Human evaluation design, annotation guidelines, agreement metrics, crowdsourcing QC, IRB guidance
references/paper-types.md	Theory papers (proof writing, theorem structure), survey papers, benchmark papers, position papers

文档	内容
references/writing-guide.md	Gopen & Swan 7原则、Perez微观技巧、Lipton用词、Steinhardt精确性、图表设计
references/citation-workflow.md	引用API、Python代码、CitationManager类、BibTeX管理
references/checklists.md	NeurIPS 16项、ICML、ICLR、ACL要求、通用提交前清单
references/reviewer-guidelines.md	评估标准、评分、常见关注点、反驳模板
references/sources.md	所有写作指南、会议指南、API的完整参考文献
references/experiment-patterns.md	实验设计模式、评估协议、监控、错误恢复
references/autoreason-methodology.md	Autoreason循环、策略选择、模型指南、提示、范围约束、Borda评分
references/human-evaluation.md	人类评估设计、标注指南、一致性指标、众包QC、IRB指导
references/paper-types.md	理论论文（证明写作、定理结构）、综述论文、基准测试论文、立场论文

LaTeX Templates

LaTeX模板

Templates in

templates/

for: NeurIPS 2025, ICML 2026, ICLR 2026, ACL, AAAI 2026, COLM 2025.

See templates/README.md for compilation instructions.

templates/

中的模板：NeurIPS 2025、ICML 2026、ICLR 2026、ACL、AAAI 2026、COLM 2025。

详见templates/README.md获取编译说明。

Key External Sources

关键外部来源

Writing Philosophy:

APIs: Semantic Scholar | CrossRef | arXiv

Venues: NeurIPS | ICML | ICLR | ACL

写作理念：

APIs: Semantic Scholar | CrossRef | arXiv

Venues: NeurIPS | ICML | ICLR | ACL

Skill	When to Use	How to Load
arxiv	Phase 1 (Literature Review): searching arXiv, generating BibTeX, finding related papers via Semantic Scholar	`skill_view("arxiv")`
subagent-driven-development	Phase 5 (Drafting): parallel section writing with 2-stage review (spec compliance then quality)	`skill_view("subagent-driven-development")`
plan	Phase 0 (Setup): creating structured plans before execution. Writes to `.hermes/plans/`	`skill_view("plan")`
qmd	Phase 1 (Literature): searching local knowledge bases (notes, transcripts, docs) via hybrid BM25+vector search	Install: `skill_manage("install", "qmd")`
diagramming	Phase 4-5: creating Excalidraw-based figures and architecture diagrams	`skill_view("diagramming")`
data-science	Phase 4 (Analysis): Jupyter live kernel for interactive analysis and visualization	`skill_view("data-science")`

技能	使用场景	加载方式
arxiv	阶段1（文献综述）：搜索arXiv、生成BibTeX、通过Semantic Scholar查找相关论文	`skill_view("arxiv")`
subagent-driven-development	阶段5（撰写）：并行章节撰写，2阶段评审（规范合规性然后质量）	`skill_view("subagent-driven-development")`
plan	阶段0（启动）：执行前创建结构化计划。写入 `.hermes/plans/`	`skill_view("plan")`
qmd	阶段1（文献）：通过混合BM25+向量搜索本地知识库（笔记、 transcripts、文档）	安装： `skill_manage("install", "qmd")`
diagramming	阶段4-5：创建基于Excalidraw的图表和架构图	`skill_view("diagramming")`
data-science	阶段4（分析）：Jupyter实时内核用于交互式分析和可视化	`skill_view("data-science")`