auto-research

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Auto Research

自动化研究

Run iterative NeMo-RL experiments in this repository against the user's stated objective, such as accuracy, reward, throughput, latency, stability, or another recipe-specific metric, with git as the research ledger.

Treat dependencies as ready, but choose the runtime deliberately. Use the recipe's authoritative metric as the source of truth. Keep changes small, reproducible, and simple. Preserve unrelated user work.

Use the

session-memory

skill for every auto-research campaign. Start or resume a session record before branching, then checkpoint after forming the plan, before and after meaningful edits or long-running launches, when the user changes direction, and before handoff or final summary.

After context compaction, handoff, disconnect, or a long gap, reload this skill and any companion skills already in use, read the latest

session-memory

handoff, and restate the overall objective, stop rules, current branch, and latest result before continuing. Treat follow-up steering as additive unless the user explicitly changes the main objective.

在本仓库中针对用户指定目标（如准确率、奖励值、吞吐量、延迟、稳定性或其他方案特定指标）运行迭代式NeMo-RL实验，将git作为研究记录载体。

假设依赖已就绪，但需谨慎选择运行环境。以实验方案的权威指标为事实依据。保持变更小而精、可复现，同时保留用户无关工作内容。

每次自动化研究任务均需使用

session-memory

技能。在分支前启动或恢复会话记录，然后在制定计划后、有意义的编辑或长时间运行启动前后、用户改变方向时，以及交接或最终总结前创建检查点。

在上下文压缩、交接、断开连接或长时间间隔后，重新加载本技能及所有已在使用的配套技能，读取最新的

session-memory

交接内容，并在继续前重述整体目标、停止规则、当前分支和最新结果。除非用户明确更改主目标，否则后续指导均视为补充内容。

Workflow

工作流

Inspect the current git state and identify unrelated user changes before branching.
Use a shared branch prefix. Prefer a user-provided one; otherwise create a suggestive default such as
```
autoresearch/2026-03-24-dapo-qwen2p5
```
.
Read the target recipe, its parents, and the relevant code paths in
```
examples/run_grpo.py
```
,
```
nemo_rl/models/
```
,
```
nemo_rl/algorithms/
```
,
```
nemo_rl/environments/
```
, and
```
docs/
```
. For NeMo-gym recipes, also inspect
```
examples/nemo_gym/
```
entrypoints, configs, and launch scripts.
Translate any user stop rule into explicit values you can monitor, such as the requested number of experiments as
```
target_experiment_count
```
,
```
campaign_deadline
```
,
```
per_experiment_timeout
```
, or
```
target_metric
```
.
Verify required data, checkpoints, runtime inputs, and the launcher.
Create an untracked TSV log and per-experiment log directory.
Run a baseline first on
```
<prefix>/baseline
```
if none exists.

For GPU, CPU-heavy, distributed, or long-running work, choose the execution environment deliberately. Run locally when the current machine has suitable GPUs and capacity; otherwise follow the user's requested environment, use

launch-nemo-rl

for nrl-k8s/Kubernetes, use the environment's native launcher for Slurm, or clarify with the user before launching. Use CPU-only local runs only for light inspection, dry runs, and short non-GPU checks.

If the user mentions Brev, or if

/home/ubuntu/RL

exists and

/ephemeral

is available as a volume, treat the machine as a Brev instance and use

brev-etiquette

before creating experiment directories, caches, logs, checkpoints, or authenticated runtime state.

在分支前检查当前git状态，识别用户的无关变更。
使用共享分支前缀。优先使用用户提供的前缀；若未提供，则创建一个具有提示性的默认前缀，例如
```
autoresearch/2026-03-24-dapo-qwen2p5
```
。
阅读目标实验方案及其父方案，并查看
```
examples/run_grpo.py
```
、
```
nemo_rl/models/
```
、
```
nemo_rl/algorithms/
```
、
```
nemo_rl/environments/
```
和
```
docs/
```
中的相关代码路径。对于NeMo-gym方案，还需检查
```
examples/nemo_gym/
```
中的入口文件、配置和启动脚本。
将用户的任何停止规则转换为可监控的明确值，例如请求的实验次数
```
target_experiment_count
```
、
```
campaign_deadline
```
、
```
per_experiment_timeout
```
或
```
target_metric
```
。
验证所需数据、检查点、运行时输入和启动器。
创建未追踪的TSV日志和每个实验的日志目录。
若不存在基线实验，则先在
```
<prefix>/baseline
```
分支上运行基线。

对于GPU、CPU密集型、分布式或长时间运行的任务，需谨慎选择执行环境。当当前机器具备合适的GPU和容量时，本地运行；否则遵循用户要求的环境，使用

launch-nemo-rl

在nrl-k8s/Kubernetes上运行，使用Slurm的原生启动器，或在启动前与用户确认。仅在轻量检查、试运行和短时间非GPU检查时使用纯CPU本地运行。

若用户提及Brev，或

/home/ubuntu/RL

存在且

/ephemeral

可用作卷，则将该机器视为Brev实例，并在创建实验目录、缓存、日志、检查点或认证运行时状态前遵循

brev-etiquette

规范。

Branching

分支规则

Put every experiment on its own branch under the shared prefix.
Keep every branch, even for failed or weak ideas.
Put at least one commit on each branch for the hypothesis.
Add follow-up fix commits on the same branch when a rerun is justified.
Never stash, reset, or overwrite unrelated user changes silently. If dirty files overlap the experiment, use a separate worktree or ask before proceeding.

See

references/git-workflow.md

for the exact pattern.

每个实验均放在共享前缀下的独立分支中。
保留所有分支，即使是针对失败或效果不佳的想法。
每个分支至少有一个关于假设的提交。
当重新运行合理时，在同一分支上添加后续修复提交。
切勿静默暂存、重置或覆盖用户的无关变更。若未提交文件与实验重叠，使用独立工作树或在继续前询问用户。

具体模式请参见

references/git-workflow.md

。

Loop

循环流程

Pick one concrete hypothesis.

Create a branch such as

autoresearch/2026-03-24-dapo-qwen2p5/prompt-compact-schema

Edit the smallest set of files needed.
Commit the hypothesis.
Before launching the run, check the monitored stop conditions. Do not stop early unless one is already clearly met.
Identify the authoritative metric source from the recipe or logging code, then run with a unique log path:

bash

LOG_DIR=reports/auto_research/<campaign>/<experiment>
mkdir -p "$LOG_DIR"
uv run <entrypoint> > "$LOG_DIR/run.log" 2>&1

If the user gave a per-experiment wall-clock limit, enforce it explicitly. Prefer a recipe-level timeout when one already exists; otherwise wrap the command with an external timeout. If both exist, honor the tighter limit.
Extract the primary metric with a command appropriate for the actual log format. If extraction is empty, inspect the last log lines and the recipe's logging path before marking the run.
Record index, branch, parent commit, commit, recipe, metric name, metric value, memory (GB), elapsed time (minutes), launcher, job id, command, log path, status, and description in the TSV, along with enough timing or count information to evaluate the stop rule.
Periodically print user-facing progress updates during the campaign. Include the current branch, latest known result, attempted experiment count, remaining experiment count if applicable, remaining campaign time if applicable, and whether any stop condition has been met yet.
Re-check the monitored stop conditions after the experiment completes and state the result explicitly, for example
```
stop condition not yet met: 17/24 attempted, 6h12m remaining
```
or
```
stop condition met: 24/24 attempted
```
.
Mark the result as
```
keep
```
,
```
discard
```
, or
```
crash
```
, then move to the next branch unless a user-specified stop condition has been clearly met.

For count-based stop rules, count attempted ideas, not only successful or fully completed runs.

For campaign time budgets, convert the user limit into an absolute deadline at the start of the campaign and keep checking remaining time.

For per-experiment budgets, enforce a timeout on every run and treat overruns as failures.

Examples:

```
do 50 experiments
```
: stop only after 50 attempted experiment rows exist in the TSV
```
10h total, 1h each
```
: enforce a 1 hour limit per run and stop when the 10 hour campaign budget is reached, or when there is not enough remaining budget to start another 1 hour run
```
50 experiments or 10h total, 1h each
```
: monitor all three values, never exceed the per-run cap, and stop only when one campaign-level stop trigger is clearly reached

选择一个具体的假设。

创建分支，例如

autoresearch/2026-03-24-dapo-qwen2p5/prompt-compact-schema

。

编辑最少必要的文件。
提交假设。
在启动运行前，检查监控的停止条件。除非已明确满足某个条件，否则不要提前停止。
从实验方案或日志代码中确定权威指标来源，然后使用唯一日志路径运行：

bash

LOG_DIR=reports/auto_research/<campaign>/<experiment>
mkdir -p "$LOG_DIR"
uv run <entrypoint> > "$LOG_DIR/run.log" 2>&1

若用户指定了每个实验的挂钟时间限制，则明确执行该限制。优先使用实验方案已有的超时设置；否则用外部超时命令包装运行命令。若两者都存在，遵循更严格的限制。
使用适合实际日志格式的命令提取主要指标。若提取结果为空，在标记运行状态前检查日志最后几行和实验方案的日志路径。
在TSV中记录索引、分支、父提交、提交记录、实验方案、指标名称、指标值、内存（GB）、耗时（分钟）、启动器、作业ID、命令、日志路径、状态和描述，同时记录足够的时间或计数信息以评估停止规则。
在任务期间定期向用户展示进度更新。内容包括当前分支、最新已知结果、已尝试实验次数、剩余实验次数（若适用）、剩余任务时间（若适用），以及是否已满足任何停止条件。

实验完成后重新检查监控的停止条件，并明确说明结果，例如

未满足停止条件：已尝试17/24次，剩余6小时12分钟

或

已满足停止条件：已尝试24/24次

。

将结果标记为
```
保留
```
、
```
丢弃
```
或
```
崩溃
```
，然后切换到下一个分支，除非已明确满足用户指定的停止条件。

对于基于计数的停止规则，计数已尝试的想法数量，而非仅成功或完全完成的运行次数。

对于任务时间预算，在任务开始时将用户限制转换为绝对截止时间，并持续检查剩余时间。

对于每个实验的预算，对每次运行执行超时限制，并将超时视为失败。

示例：

```
运行50次实验
```
：仅当TSV中存在50条已尝试实验记录时停止
```
总时长10小时，每次1小时
```
：每次运行执行1小时限制，当10小时任务预算用尽，或剩余预算不足以启动另一个1小时运行时停止
```
50次实验或总时长10小时，每次1小时
```
：监控所有三个值，绝不超过每次运行的上限，仅当明确触发其中一个任务级停止条件时停止

Priorities

优先级

Prefer ideas with high expected objective gain and low complexity cost:

correctness and backend compatibility
prompt and rollout formatting
batch, sequence, and precision layout
optimizer and scheduler tuning
reward shaping, clipping, or scaling
dataset mix or validation changes
synchronous versus asynchronous execution based on hardware

All else equal, prefer simpler wins and avoid brittle hardware-specific hacks.

优先选择预期目标收益高且复杂度成本低的想法：

正确性与后端兼容性
提示词与回滚格式
批次、序列与精度布局
优化器与调度器调优
奖励塑造、裁剪或缩放
数据集混合或验证变更
基于硬件的同步与异步执行选择

在其他条件相同的情况下，优先选择更简单的改进方案，避免脆弱的硬件特定技巧。

Avoid

注意事项

Do not conclude a training idea failed from an underpowered smoke run. If a run uses tiny batch sizes, very few optimizer steps, or otherwise non-representative settings, treat it as plumbing validation only; scale to a meaningful batch size and train long enough to test the hypothesis before marking it
```
discard
```
.
Do not repeatedly pay batch-scheduler setup costs for tight edit-run-debug loops. If Slurm batch jobs have a large startup tax and failures require quick iteration, use the documented interactive Slurm pattern or ask the user before resubmitting more batch jobs.
Do not let context compaction or follow-up steering questions erase the original campaign goal. Refresh
```
session-memory
```
, reload active skills, and preserve the main objective unless the user explicitly changes it.

不要从算力不足的试运行中判定训练想法失败。若运行使用极小批次大小、极少优化器步骤或其他非代表性设置，则仅将其视为管道验证；在标记为
```
丢弃
```
前，需调整至有意义的批次大小并训练足够时长以测试假设。
不要在紧凑的编辑-运行-调试循环中重复支付批处理调度器的设置成本。若Slurm批处理作业启动成本高且失败需要快速迭代，使用文档化的交互式Slurm模式，或在重新提交更多批处理作业前询问用户。
不要让上下文压缩或后续指导问题抹去原始任务目标。刷新
```
session-memory
```
，重新加载激活的技能，并保留主目标，除非用户明确更改。

Stop

停止规则

If the user gives explicit stopping conditions, they override the generic rule. Do not stop because the search feels sufficient; stop only when the requested count, deadline, budget, or target condition has been clearly met.

During the campaign, explicitly inform the user whether the stop condition has been met. If not, report the remaining count, remaining time, or other remaining threshold in concrete terms.

If the user does not give explicit stopping conditions, run the baseline plus up to three low-risk experiments, then summarize the best result and ask before continuing.

若用户给出明确的停止条件，则覆盖通用规则。不要因搜索看似充分而停止；仅当请求的次数、截止日期、预算或目标条件已明确满足时停止。

在任务期间，明确告知用户是否已满足停止条件。若未满足，用具体术语报告剩余次数、剩余时间或其他剩余阈值。

若用户未给出明确停止条件，则运行基线实验加上最多三个低风险实验，然后总结最佳结果并询问用户是否继续。

References

参考资料

```
references/git-workflow.md
```
for branch, dirty-worktree, parent-commit, and baseline rules.
```
references/exploration-ideas.md
```
for turning symptoms into concrete hypotheses.
```
references/experiment-log-template.md
```
for the TSV schema and reproducibility fields.

```
references/git-workflow.md
```
：分支、未提交工作树、父提交和基线规则。
```
references/exploration-ideas.md
```
：将问题症状转化为具体假设。
```
references/experiment-log-template.md
```
：TSV schema与可复现性字段。