autoresearch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Autoresearch: Autonomous Iterative Experimentation

Autoresearch：自主迭代实验

An autonomous experimentation loop for any programming task. You define the goal and how to measure it; the agent iterates autonomously -- modifying code, running experiments, measuring results, and keeping or discarding changes -- until interrupted.

This skill is inspired by Karpathy's autoresearch, generalized from ML training to any programming task with a measurable outcome.

适用于任意编程任务的自主实验循环。你只需定义目标和度量方式；Agent会自主进行迭代——修改代码、运行实验、度量结果并保留或舍弃更改——直到被中断。

该技能灵感来自Karpathy的autoresearch，从机器学习训练场景泛化到所有具备可衡量结果的编程任务。

Agent Behavior Rules

Agent行为规则

DO guide the user through the Setup phase interactively before starting the loop.
DO establish a baseline measurement before making any changes.
DO commit every experiment attempt before running it (so it can be reverted cleanly).
DO keep a results log (TSV) tracking every experiment.
DO revert changes that do not improve the metric (git reset to last known good).
DO run autonomously once the loop starts -- never pause to ask "should I continue?".
DO NOT modify files the user marked as out-of-scope.
DO NOT skip the measurement step -- every experiment must be measured.
DO NOT keep changes that regress the metric unless the user explicitly allowed trade-offs.
DO NOT install new dependencies or make environment changes unless the user approved it.

必须在启动循环前，以交互式方式引导用户完成设置阶段。
必须在进行任何修改前建立基准度量值。
必须在运行每个实验前提交代码（以便可以干净地回滚）。
必须维护一个结果日志（TSV格式），记录所有实验。
必须回滚未提升指标的更改（通过git重置到上一个已知的良好版本）。
必须在循环启动后自主运行——绝不要暂停询问“是否继续？”。
禁止修改用户标记为超出范围的文件。
禁止跳过度量步骤——每个实验都必须被度量。
禁止保留导致指标退化的更改，除非用户明确允许权衡取舍。
禁止安装新依赖或修改环境，除非获得用户批准。

Phase 1: Setup (Interactive)

阶段1：设置（交互式）

Before any experimentation begins, work with the user to establish these parameters. Ask the user directly for each item. Do not assume or skip any.

在开始任何实验前，与用户协作确定以下参数。直接向用户询问每一项内容，不得假设或跳过。

1.1 Define the Goal

1.1 定义目标

Ask the user:

What are you trying to improve or optimize?

Examples: execution time, memory usage, binary size, test pass rate, code coverage, API response latency, throughput, error rate, benchmark score, build time, bundle size, lines of code, cyclomatic complexity, etc.

Record the user's answer as the goal.

询问用户：

你想要改进或优化什么？

示例：执行时间、内存占用、二进制文件大小、测试通过率、代码覆盖率、 API响应延迟、吞吐量、错误率、基准测试分数、构建时间、打包体积、代码行数、圈复杂度等。

将用户的回答记录为目标。

1.2 Define the Metric

1.2 定义指标

Ask the user:

How do we measure success? What exact command produces the metric?

I need:
The command to run (e.g.,
dotnet test
,
npm run benchmark
,
time ./build.sh
,
pytest --tb=short
)
How to extract the metric from the output (e.g., a regex pattern, a specific line, a JSON field)

Direction: Is lower better or higher better?
Example: "Run
dotnet test --logger trx
, count passing tests. Higher is better." Example: "Run
hyperfine './my-program'
, extract mean time. Lower is better."

Record:

```
METRIC_COMMAND
```
: the command to run
```
METRIC_EXTRACTION
```
: how to extract the numeric metric from output

METRIC_DIRECTION

lower_is_better

higher_is_better

询问用户：

我们如何衡量成功？执行什么具体命令可以得到该指标？

我需要：
执行命令（例如：
dotnet test
、
npm run benchmark
、
time ./build.sh
、
pytest --tb=short
）
指标提取方式（例如：正则表达式、特定行、JSON字段）

优化方向：数值越低越好还是越高越好？
示例：“执行
dotnet test --logger trx
，统计通过的测试用例数量，数值越高越好。” 示例：“执行
hyperfine './my-program'
，提取平均耗时，数值越低越好。”

记录：

```
METRIC_COMMAND
```
：要执行的命令
```
METRIC_EXTRACTION
```
：从输出中提取数值指标的方式

METRIC_DIRECTION

：

lower_is_better

（越低越好）或

higher_is_better

（越高越好）

1.3 Define the Scope

1.3 定义范围

Ask the user:

Which files or directories am I allowed to modify?

And which files are OFF LIMITS (read-only)?

Record:

```
IN_SCOPE_FILES
```
: files/dirs the agent may edit
```
OUT_OF_SCOPE_FILES
```
: files/dirs that must not be modified

询问用户：

我可以修改哪些文件或目录？

以及哪些文件是禁止修改的（只读）？

记录：

```
IN_SCOPE_FILES
```
：Agent可编辑的文件/目录
```
OUT_OF_SCOPE_FILES
```
：禁止修改的文件/目录

1.4 Define Constraints

1.4 定义约束条件

Ask the user:

Are there any constraints I should respect?

Examples:

Time budget per experiment (e.g., "each run should take < 2 minutes")

No new dependencies

Must keep all existing tests passing

Must not change the public API

Must maintain backward compatibility

VRAM/memory limit

Code complexity limits (prefer simpler solutions)

Record as

CONSTRAINTS

询问用户：

有哪些需要遵守的约束条件？

示例：

每个实验的时间预算（例如：“每次运行耗时需<2分钟”）

禁止安装新依赖

必须保持所有现有测试用例通过

不得修改公开API

必须保持向后兼容性

VRAM/内存限制

代码复杂度限制（优先选择简单方案）

记录为

CONSTRAINTS

。

1.5 Define the Experiment Budget (Optional)

1.5 定义实验预算（可选）

Ask the user:

How many experiments should I run, or should I just keep going until you stop me?

You can say a number (e.g., "try 20 experiments") or "unlimited" (I'll run until you interrupt).

Record as

MAX_EXPERIMENTS

(number or

unlimited

询问用户：

我应该运行多少次实验，还是一直运行直到你停止？

你可以指定一个数字（例如：“尝试20次实验”）或选择“无限制”（我会运行直到你中断）。

记录为

MAX_EXPERIMENTS

（数字或

unlimited

）。

1.6 Simplicity Criterion

1.6 简洁性准则

Inform the user of the default simplicity policy:

Simplicity policy (default): All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Removing code while maintaining or improving the metric is a great outcome. I'll weigh the complexity cost against the improvement magnitude. Does this policy work for you, or do you want to adjust it?

Record any adjustments as

SIMPLICITY_POLICY

告知用户默认的简洁性策略：

简洁性策略（默认）：在其他条件相同的情况下，方案越简洁越好。小幅提升但引入冗余复杂度的修改不值得保留。在维持或提升指标的同时移除代码是理想结果。我会权衡复杂度成本与指标提升幅度。该策略是否符合你的需求，还是需要调整？

将任何调整记录为

SIMPLICITY_POLICY

。

1.7 Confirm Setup

1.7 确认设置

Summarize all parameters back to the user in a clear table:

Parameter	Value
Goal	...
Metric command	...
Metric extraction	...
Direction	lower is better / higher ...
In-scope files	...
Out-of-scope files	...
Constraints	...
Max experiments	...
Simplicity policy	...

Ask the user to confirm. Do not proceed until confirmed.

以清晰的表格形式向用户总结所有参数：

参数	值
目标	...
指标命令	...
指标提取方式	...
优化方向	越低越好 / 越高越好
可修改文件	...
禁止修改文件	...
约束条件	...
最大实验次数	...
简洁性策略	...

请用户确认。在获得确认前不得继续。

Phase 2: Branch & Baseline

阶段2：创建分支与基准测试

Once the user confirms:

Create a branch: Propose a tag based on today's date (e.g.,
```
autoresearch/mar17
```
). Create the branch:
```
git checkout -b autoresearch/<tag>
```
.
Read in-scope files: Read all files that are in scope to build full context of the current state.
Initialize results.tsv: Create
```
results.tsv
```
in the repo root with the header row:
```
experiment	commit	metric	status	description
```
Add
```
results.tsv
```
and
```
run.log
```
to
```
.git/info/exclude
```
(append if not already present) so they stay untracked without modifying any tracked files.
Run the baseline: Execute the metric command on the current unmodified code. Record the result as experiment
```
0
```
with status
```
baseline
```
in
```
results.tsv
```
.
Report baseline to the user:

Baseline established: [metric_name] = [value] Starting autonomous experimentation loop.

用户确认后：

创建分支：基于当前日期提议一个标签（例如：
```
autoresearch/mar17
```
）。执行创建分支命令：
```
git checkout -b autoresearch/<tag>
```
。
读取可修改文件：读取所有在范围内的文件，以构建当前代码的完整上下文。
初始化results.tsv：在仓库根目录创建
```
results.tsv
```
，添加表头行：
```
experiment	commit	metric	status	description
```
将
```
results.tsv
```
和
```
run.log
```
添加到
```
.git/info/exclude
```
（如果已存在则追加），这样它们会保持未追踪状态，且无需修改任何已追踪文件。
运行基准测试：在当前未修改的代码上执行指标命令。将结果记录为实验
```
0
```
，状态为
```
baseline
```
，写入
```
results.tsv
```
。
向用户报告基准结果：

基准已建立：[指标名称] = [数值] 即将启动自主实验循环。

Phase 3: Experiment Loop

阶段3：实验循环

Run this loop continuously. Do not stop to ask the user. Run until:

```
MAX_EXPERIMENTS
```
is reached, OR
The user manually interrupts

持续运行该循环，无需停止询问用户。直到以下情况出现时停止：

达到
```
MAX_EXPERIMENTS
```
次数，或
用户手动中断

For each experiment:

每个实验的步骤：

LOOP:
  1. THINK   - Analyze previous results and the current code.
               Generate an experiment hypothesis.
               Consider: what worked, what didn't, what hasn't been tried.

  2. EDIT    - Modify the in-scope file(s) to implement the idea.
               Keep changes focused and minimal per experiment.

  3. COMMIT  - git add + git commit with a short descriptive message.
               Format: "experiment: <short description of what changed>"

  4. RUN     - Execute the metric command.
               Redirect output to run.log so it does not flood the context window.
               Use shell-appropriate redirection:
               - Bash/Zsh: `<command> > run.log 2>&1`
               - PowerShell: `<command> *> run.log`

  5. MEASURE - Extract the metric from run.log.
               If extraction fails (crash/error), read the last 50 lines
               of run.log for the error.

  6. DECIDE  - Compare metric to the current best:
               - IMPROVED: Keep the commit. Update the "best" baseline.
                 Log status = "keep".
               - SAME OR WORSE: Revert. `git reset --hard HEAD~1`.
                 Log status = "discard".
               - CRASH: Attempt a quick fix (typo, import, simple error).
                 Amend the experiment commit (`git commit --amend`) with the fix
                 and rerun. The experiment keeps its original number.
                 If unfixable after 2 attempts, revert the entire experiment
                 (`git reset --hard HEAD~1`) and log status = "crash".

  7. LOG     - Append a row to results.tsv:
               experiment_number  commit_hash  metric_value  status  description

  8. CONTINUE - Go to step 1.

循环:
  1. 思考   - 分析过往结果与当前代码。生成实验假设。
               考虑：哪些方案有效、哪些无效、哪些还未尝试。

  2. 编辑    - 修改范围内的文件以实现实验想法。每次实验的修改应聚焦且最小化。

  3. 提交  - 执行git add + git commit，使用简短的描述性信息。
               格式："experiment: <修改内容的简短描述>"

  4. 运行     - 执行指标命令。将输出重定向到run.log，避免占用过多上下文窗口。
               使用适合Shell的重定向方式：
               - Bash/Zsh：`<command> > run.log 2>&1`
               - PowerShell：`<command> *> run.log`

  5. 度量 - 从run.log中提取指标。如果提取失败（崩溃/错误），读取run.log的最后50行以排查错误。

  6. 决策  - 将当前指标与历史最优值比较：
               - 提升：保留提交。更新“最优”基准值。记录状态为"keep"。
               - 无变化或退化：回滚。执行`git reset --hard HEAD~1`。记录状态为"discard"。
               - 崩溃：尝试快速修复（拼写错误、导入问题、简单错误）。通过`git commit --amend`修正实验提交并重新运行。实验编号保持不变。如果2次尝试后仍无法修复，回滚整个实验（`git reset --hard HEAD~1`）并记录状态为"crash"。

  7. 日志     - 向results.tsv追加一行记录：
               experiment_number  commit_hash  metric_value  status  description

  8. 继续 - 返回步骤1。

Experiment Strategy

实验策略

When generating experiment ideas, follow this priority order:

Low-hanging fruit first: Simple parameter tweaks, obvious inefficiencies.
Informed by results: If a direction showed promise, explore further in that direction.
Diversify after plateaus: If the last 3-5 experiments all failed, try a different approach entirely.
Combine winners: If experiments A and B each improved independently, try combining them.
Simplification passes: Periodically try removing code/complexity to see if the metric holds.
Radical changes: After exhausting incremental ideas, try larger architectural changes.

生成实验想法时，遵循以下优先级：

优先低难度方案：简单的参数调整、明显的低效点。
基于结果调整：如果某个方向显示出潜力，进一步探索该方向。
瓶颈期尝试多样化方案：如果连续3-5次实验均失败，尝试完全不同的方法。
组合有效方案：如果实验A和B各自独立提升了指标，尝试将它们结合。
定期简化代码：周期性尝试移除代码/复杂度，观察指标是否保持。
激进式修改：在穷尽增量式方案后，尝试更大的架构调整。

Handling Constraints

约束条件处理

Time budget: If a run exceeds 2x the expected duration, kill it and treat as a crash.
Existing tests: If constraints require tests to pass, run them before/after and revert if they break.
Memory/resources: Monitor and revert if resource usage exceeds stated limits.

时间预算：如果运行耗时超过预期的2倍，终止运行并视为崩溃。
现有测试用例：如果约束条件要求测试用例必须通过，在修改前后运行测试，若测试失败则回滚。
内存/资源：监控资源使用，若超过指定限制则回滚。

Phase 4: Reporting

阶段4：报告

When the loop ends (budget reached or user interrupts):

Print the full results.tsv as a formatted table.
Summarize:
- Total experiments run
- Experiments kept / discarded / crashed
- Starting metric (baseline) vs. final metric
- Improvement percentage
- Top 3 most impactful changes
Show the cumulative git log of kept experiments:
```
git log --oneline <start_commit>..HEAD
```
Recommend next steps: Based on the results, suggest what a human researcher might try next (ideas that were too risky/complex for automated experimentation).

当循环结束（达到预算或用户中断）：

以格式化表格形式打印完整的results.tsv。
总结内容：
- 总实验次数
- 保留/舍弃/崩溃的实验数量
- 初始指标（基准）与最终指标
- 提升百分比
- 影响最大的3项修改
展示保留实验的累计git日志：
```
git log --oneline <start_commit>..HEAD
```
建议下一步操作：基于实验结果，提出人类研究员接下来可能尝试的方案（自动化实验中风险过高或复杂度太高的想法）。

Quick Reference

快速参考

Results TSV Format

结果TSV格式

Tab-separated, 5 columns:

experiment	commit	metric	status	description
0	a1b2c3d	0.997900	baseline	unmodified code
1	b2c3d4e	0.993200	keep	increase learning rate to 0.04
2	c3d4e5f	1.005000	discard	switch to GeLU activation
3	d4e5f6g	0.000000	crash	double model width (OOM)

制表符分隔，共5列：

experiment	commit	metric	status	description
0	a1b2c3d	0.997900	baseline	unmodified code
1	b2c3d4e	0.993200	keep	increase learning rate to 0.04
2	c3d4e5f	1.005000	discard	switch to GeLU activation
3	d4e5f6g	0.000000	crash	double model width (OOM)

Git Workflow

Git工作流

All experiments happen on the
```
autoresearch/<tag>
```
branch
Each experiment is committed before running
Failed experiments are reverted with
```
git reset --hard HEAD~1
```
Successful experiments advance the branch
```
results.tsv
```
and
```
run.log
```
stay untracked (added to
```
.git/info/exclude
```
)

所有实验都在
```
autoresearch/<tag>
```
分支上进行
每个实验在运行前都会提交
失败的实验会被回滚
只有提升指标的提交才会推进分支
```
results.tsv
```
和
```
run.log
```
保持未追踪状态（添加到
```
.git/info/exclude
```
）

Key Principles

核心原则

Measure everything: No experiment without a measurement.
Revert failures: The branch only advances on improvements.
Stay autonomous: Never stop to ask. Think harder if stuck.
Keep it simple: Complexity is a cost. Weigh it against gains.
Log everything: The TSV is the research journal.

全面度量：无度量不实验。
回滚失败：分支仅在指标提升时推进。
自主运行：绝不停止询问。遇到瓶颈时深入思考。
保持简洁：复杂度是一种成本，需与指标提升幅度权衡。
完整记录：TSV文件就是研究日志。