autoresearch-agent

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Autoresearch Agent

You sleep. The agent experiments. You wake up to results.

Autonomous experiment loop inspired by Karpathy's autoresearch. The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.

Not one guess — fifty measured attempts, compounding.

你安心休息，Agent自动做实验。醒来就能看到结果。

受Karpathy's autoresearch启发的自主实验循环。该Agent会编辑单个文件、运行固定评估、保留改进成果、丢弃失败尝试并无限循环运行。

不是一次猜测——而是五十次有数据支撑的尝试，持续积累优化效果。

Slash Commands

斜杠命令

Command	What it does
`/ar:setup`	Set up a new experiment interactively
`/ar:run`	Run a single experiment iteration
`/ar:loop`	Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly)
`/ar:status`	Show dashboard and results
`/ar:resume`	Resume a paused experiment

命令	功能
`/ar:setup`	交互式设置新实验
`/ar:run`	运行单次实验迭代
`/ar:loop`	启动可配置间隔的自主循环（10分钟、1小时、每日、每周、每月）
`/ar:status`	显示仪表盘和实验结果
`/ar:resume`	恢复暂停的实验

When This Skill Activates

触发场景

Recognize these patterns from the user:

"Make this faster / smaller / better"
"Optimize [file] for [metric]"
"Improve my [headlines / copy / prompts]"
"Run experiments overnight"
"I want to get [metric] from X to Y"
Any request involving: optimize, benchmark, improve, experiment loop, autoresearch

If the user describes a target file + a way to measure success → this skill applies.

当用户提到以下模式时，激活本技能：

"让这个更快/更小/更好"
"针对[指标]优化[文件]"
"改进我的[标题/文案/提示词]"
"通宵运行实验"
"我想把[指标]从X提升到Y"
任何包含：优化、基准测试、改进、实验循环、autoresearch的请求

如果用户描述了目标文件+衡量成功的方式→适用本技能。

Setup

设置步骤

First Time — Create the Experiment

首次使用——创建实验

Run the setup script. The user decides where experiments live:

Project-level (inside repo, git-tracked, shareable with team):

bash

python scripts/setup_experiment.py \
  --domain engineering \
  --name api-speed \
  --target src/api/search.py \
  --eval "pytest bench.py --tb=no -q" \
  --metric p50_ms \
  --direction lower \
  --scope project

User-level (personal, in

~/.autoresearch/

bash

python scripts/setup_experiment.py \
  --domain marketing \
  --name medium-ctr \
  --target content/titles.md \
  --eval "python evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content \
  --scope user

The

--scope

flag determines where

.autoresearch/

lives:

```
project
```
(default) →
```
.autoresearch/
```
in the repo root. Experiment definitions are git-tracked. Results are gitignored.
```
user
```
→
```
~/.autoresearch/
```
in the home directory. Everything is personal.

运行设置脚本，用户可决定实验存储位置：

项目级（仓库内，Git追踪，可与团队共享）：

bash

python scripts/setup_experiment.py \
  --domain engineering \
  --name api-speed \
  --target src/api/search.py \
  --eval "pytest bench.py --tb=no -q" \
  --metric p50_ms \
  --direction lower \
  --scope project

用户级（个人使用，存储在

~/.autoresearch/

）：

bash

python scripts/setup_experiment.py \
  --domain marketing \
  --name medium-ctr \
  --target content/titles.md \
  --eval "python evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content \
  --scope user

--scope

参数决定

.autoresearch/

的存储位置：

```
project
```
（默认）→ 仓库根目录下的
```
.autoresearch/
```
。实验定义会被Git追踪，结果会被Git忽略。
```
user
```
→ 主目录下的
```
~/.autoresearch/
```
。所有内容均为个人使用。

What Setup Creates

设置完成后生成的文件结构

.autoresearch/
├── config.yaml                        ← Global settings
├── .gitignore                         ← Ignores results.tsv, *.log
└── {domain}/{experiment-name}/
    ├── program.md                     ← Objectives, constraints, strategy
    ├── config.cfg                     ← Target, eval cmd, metric, direction
    ├── results.tsv                    ← Experiment log (gitignored)
    └── evaluate.py                    ← Evaluation script (if --evaluator used)

results.tsv columns:

commit | metric | status | description

```
commit
```
— short git hash
```
metric
```
— float value or "N/A" for crashes
```
status
```
— keep | discard | crash
```
description
```
— what changed or why it crashed

.autoresearch/
├── config.yaml                        ← 全局设置
├── .gitignore                         ← 忽略results.tsv、*.log
└── {domain}/{experiment-name}/
    ├── program.md                     ← 目标、约束、策略
    ├── config.cfg                     ← 目标文件、评估命令、指标、优化方向
    ├── results.tsv                    ← 实验日志（Git忽略）
    └── evaluate.py                    ← 评估脚本（使用--evaluator参数时生成）

results.tsv列说明：

commit | metric | status | description

```
commit
```
— 短Git哈希值
```
metric
```
— 浮点数值或崩溃时显示"N/A"
```
status
```
— keep（保留）| discard（丢弃）| crash（崩溃）
```
description
```
— 修改内容或崩溃原因

Domains

领域分类

Domain	Use Cases
`engineering`	Code speed, memory, bundle size, test pass rate, build time
`marketing`	Headlines, social copy, email subjects, ad copy, engagement
`content`	Article structure, SEO descriptions, readability, CTR
`prompts`	System prompts, chatbot tone, agent instructions
`custom`	Anything else with a measurable metric

领域	适用场景
`engineering`	代码速度、内存占用、包体积、测试通过率、构建时间
`marketing`	标题、社交文案、邮件主题、广告文案、用户参与度
`content`	文章结构、SEO描述、可读性、点击率
`prompts`	系统提示词、聊天机器人语气、Agent指令
`custom`	任何其他可衡量指标的场景

program.md

Already Exists

若

program.md

已存在

The user may have written their own

program.md

. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.

用户可能已自行编写

program.md

。如果在实验目录中找到该文件，则读取它，它会覆盖模板内容。仅询问缺失的信息。

Agent Protocol

Agent协议

You are the loop. The scripts handle setup and evaluation — you handle the creative work.

你是循环的核心。脚本负责设置和评估——你负责创造性工作。

Before Starting

启动前准备

Read
```
.autoresearch/{domain}/{name}/config.cfg
```
to get:
- ```
target
```
  — the file you edit
- ```
evaluate_cmd
```
  — the command that measures your changes
- ```
metric
```
  — the metric name to look for in eval output
- ```
metric_direction
```
  — "lower" or "higher" is better
- ```
time_budget_minutes
```
  — max time per evaluation
Read
```
program.md
```
for strategy, constraints, and what you can/cannot change
Read
```
results.tsv
```
for experiment history (columns: commit, metric, status, description)

Checkout the experiment branch:

git checkout autoresearch/{domain}/{name}

读取
```
.autoresearch/{domain}/{name}/config.cfg
```
获取：
- ```
target
```
  — 待编辑的文件
- ```
evaluate_cmd
```
  — 衡量修改效果的命令
- ```
metric
```
  — 从评估输出中提取的指标名称
- ```
metric_direction
```
  — "lower"（越小越好）或"higher"（越大越好）
- ```
time_budget_minutes
```
  — 单次评估的最长时间
读取
```
program.md
```
获取策略、约束及可修改/不可修改内容
读取
```
results.tsv
```
获取实验历史（列：commit、metric、status、description）

切换到实验分支：

git checkout autoresearch/{domain}/{name}

Each Iteration

每次迭代步骤

Review results.tsv — what worked? What failed? What hasn't been tried?
Decide ONE change to the target file. One variable per experiment.
Edit the target file

Commit:

git add {target} && git commit -m "experiment: {description}"

Evaluate:

python scripts/run_experiment.py --experiment {domain}/{name} --single

Read the output — it prints KEEP, DISCARD, or CRASH with the metric value
Go to step 1

查看results.tsv——哪些方法有效？哪些失败？哪些尚未尝试？
决定对目标文件进行一处修改。每次实验仅改变一个变量。
编辑目标文件

提交：

git add {target} && git commit -m "experiment: {description}"

评估：

python scripts/run_experiment.py --experiment {domain}/{name} --single

读取输出——会打印KEEP、DISCARD或CRASH及指标数值
返回步骤1

What the Script Handles (you don't)

脚本处理的内容（无需你操作）

Running the eval command with timeout
Parsing the metric from eval output
Comparing to previous best
Reverting the commit on failure (
```
git reset --hard HEAD~1
```
)
Logging the result to results.tsv

带超时运行评估命令
从评估输出中解析指标
与之前的最优结果对比
失败时回滚提交（
```
git reset --hard HEAD~1
```
）
将结果记录到results.tsv

Starting an Experiment

启动实验

bash

undefined

bash

undefined

Single iteration (the agent calls this repeatedly)

单次迭代（Agent会重复调用此命令）

python scripts/run_experiment.py --experiment engineering/api-speed --single

Dry run (test setup before starting)

试运行（启动前测试设置是否正确）

python scripts/run_experiment.py --experiment engineering/api-speed --dry-run

undefined

python scripts/run_experiment.py --experiment engineering/api-speed --dry-run

undefined

Strategy Escalation

策略升级

Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
Runs 6-15: Systematic exploration (vary one parameter at a time)
Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
Runs 30+: Radical experiments (completely different approaches)
If no improvement in 20+ runs: update program.md Strategy section

第1-5次运行：低难度优化（明显改进、简单优化）
第6-15次运行：系统性探索（每次仅改变一个参数）
第16-30次运行：结构性变更（算法替换、架构调整）
第30+次运行：激进实验（完全不同的方案）
若20+次运行无改进：更新program.md的策略部分

Self-Improvement

自我改进

After every 10 experiments, review results.tsv for patterns. Update the Strategy section of program.md with what you learned (e.g., "caching changes consistently improve by 5-10%", "refactoring attempts never improve the metric"). Future iterations benefit from this accumulated knowledge.

每完成10次实验后，回顾results.tsv寻找规律。更新program.md的策略部分，记录学到的经验（例如："缓存变更持续提升5-10%性能"、"重构尝试从未提升指标"）。后续迭代将受益于这些积累的知识。

Stopping

停止条件

Run until interrupted by the user, context limit reached, or goal in program.md is met
Before stopping: ensure results.tsv is up to date
On context limit: the next session can resume — results.tsv and git log persist

运行至用户中断、上下文限制达到或program.md中的目标完成
停止前：确保results.tsv已更新
达到上下文限制时：下次会话可恢复——results.tsv和Git日志会保留

Rules

规则

One change per experiment. Don't change 5 things at once. You won't know what worked.
Simplicity criterion. A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
Never modify the evaluator.
```
evaluate.py
```
is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
Timeout. If a run exceeds 2.5× the time budget, kill it and treat as crash.
Crash handling. If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
No new dependencies. Only use what's already available in the project.

每次实验仅做一处修改。不要同时修改5处内容。否则无法确定哪项修改有效。
简洁性准则。小幅改进但引入冗余复杂的代码不值得。性能相同但代码更简洁是胜利。移除不影响结果的代码是最佳结果。
绝不修改评估器。
```
evaluate.py
```
是基准真理。修改它会使所有对比失效。如果发现自己在修改它，立即停止。
超时处理。如果运行时间超过预算的2.5倍，终止并视为崩溃。
崩溃处理。如果是拼写错误或缺失依赖，修复后重新运行。如果思路根本错误，回滚、记录"crash"并继续。连续5次崩溃→暂停并提醒用户。
不添加新依赖。仅使用项目中已有的依赖。

Evaluators

评估器

Ready-to-use evaluation scripts. Copied into the experiment directory during setup with

--evaluator

现成可用的评估脚本。设置时使用

--evaluator

参数会将其复制到实验目录。

Free Evaluators (no API cost)

免费评估器（无API成本）

Evaluator	Metric	Use Case
`benchmark_speed`	`p50_ms` (lower)	Function/API execution time
`benchmark_size`	`size_bytes` (lower)	File, bundle, Docker image size
`test_pass_rate`	`pass_rate` (higher)	Test suite pass percentage
`build_speed`	`build_seconds` (lower)	Build/compile/Docker build time
`memory_usage`	`peak_mb` (lower)	Peak memory during execution

评估器	指标	适用场景
`benchmark_speed`	`p50_ms` （越小越好）	函数/API执行时间
`benchmark_size`	`size_bytes` （越小越好）	文件、包、Docker镜像大小
`test_pass_rate`	`pass_rate` （越大越好）	测试套件通过率
`build_speed`	`build_seconds` （越小越好）	构建/编译/Docker构建时间
`memory_usage`	`peak_mb` （越小越好）	执行期间的峰值内存

LLM Judge Evaluators (uses your subscription)

LLM判断评估器（使用用户订阅）

Evaluator	Metric	Use Case
`llm_judge_content`	`ctr_score` 0-10 (higher)	Headlines, titles, descriptions
`llm_judge_prompt`	`quality_score` 0-100 (higher)	System prompts, agent instructions
`llm_judge_copy`	`engagement_score` 0-10 (higher)	Social posts, ad copy, emails

LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside

evaluate.py

— the agent cannot modify it. This prevents the agent from gaming its own evaluator.

The user's existing subscription covers the cost:

Claude Code Max → unlimited Claude calls for evaluation
Codex CLI (ChatGPT Pro) → unlimited Codex calls
Gemini CLI (free tier) → free evaluation calls

评估器	指标	适用场景
`llm_judge_content`	`ctr_score` 0-10（越大越好）	标题、副标题、描述
`llm_judge_prompt`	`quality_score` 0-100（越大越好）	系统提示词、Agent指令
`llm_judge_copy`	`engagement_score` 0-10（越大越好）	社交帖子、广告文案、邮件

LLM判断器会调用用户正在使用的CLI工具（Claude、Codex、Gemini）。评估提示词被锁定在

evaluate.py

中——Agent无法修改它。这防止Agent作弊操纵自己的评估器。

用户现有订阅覆盖成本：

Claude Code Max → 无限次Claude评估调用
Codex CLI（ChatGPT Pro）→ 无限次Codex调用
Gemini CLI（免费版）→ 免费评估调用

Custom Evaluators

自定义评估器

If no built-in evaluator fits, the user writes their own

evaluate.py

. Only requirement: it must print

metric_name: value

to stdout.

python

#!/usr/bin/env python3

如果没有内置评估器符合需求，用户可自行编写

evaluate.py

。唯一要求：必须向stdout输出

metric_name: value

。

python

#!/usr/bin/env python3

My custom evaluator — DO NOT MODIFY after experiment starts

我的自定义评估器——实验开始后请勿修改

import subprocess result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)

Parse and output

解析并输出

print(f"my_metric: {parse_score(result.stdout)}")

---

print(f"my_metric: {parse_score(result.stdout)}")

---

Viewing Results

查看结果

bash

undefined

bash

undefined

Single experiment

单个实验

python scripts/log_results.py --experiment engineering/api-speed

All experiments in a domain

某一领域的所有实验

python scripts/log_results.py --domain engineering

Cross-experiment dashboard

跨实验仪表盘

python scripts/log_results.py --dashboard

Export formats

导出格式

python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md python scripts/log_results.py --dashboard --format markdown --output dashboard.md

undefined

undefined

Dashboard Output

仪表盘输出示例

DOMAIN          EXPERIMENT          RUNS  KEPT  BEST         Δ FROM START  STATUS
engineering     api-speed            47    14   185ms        -76.9%        active
engineering     bundle-size          23     8   412KB        -58.3%        paused
marketing       medium-ctr           31    11   8.4/10       +68.0%        active
prompts         support-tone         15     6   82/100       +46.4%        done

DOMAIN          EXPERIMENT          RUNS  KEPT  BEST         Δ FROM START  STATUS
engineering     api-speed            47    14   185ms        -76.9%        active
engineering     bundle-size          23     8   412KB        -58.3%        paused
marketing       medium-ctr           31    11   8.4/10       +68.0%        active
prompts         support-tone         15     6   82/100       +46.4%        done

Export Formats

导出格式

TSV — default, tab-separated (compatible with spreadsheets)
CSV — comma-separated, with proper quoting
Markdown — formatted table, readable in GitHub/docs

TSV — 默认格式，制表符分隔（兼容电子表格）
CSV — 逗号分隔，带正确引号
Markdown — 格式化表格，适合在GitHub/文档中阅读

Proactive Triggers

主动触发检查

Flag these without being asked:

No evaluation command works → Test it before starting the loop. Run once, verify output.

Target file not in git →

git init && git add . && git commit -m 'initial'

first.

Metric direction unclear → Ask: is lower or higher better? Must know before starting.
Time budget too short → If eval takes longer than budget, every run crashes.
Agent modifying evaluate.py → Hard stop. This invalidates all comparisons.
5 consecutive crashes → Pause the loop. Alert the user. Don't keep burning cycles.
No improvement in 20+ runs → Suggest changing strategy in program.md or trying a different approach.

无需用户询问，自动检查以下情况：

评估命令无法运行 → 启动循环前先测试。运行一次，验证输出。

目标文件未加入Git → 先执行

git init && git add . && git commit -m 'initial'

。

指标方向不明确 → 询问用户：指标是越小越好还是越大越好？启动前必须明确。
时间预算过短 → 如果评估时间超过预算，每次运行都会崩溃。
Agent修改evaluate.py → 立即停止。这会使所有对比失效。
连续5次崩溃 → 暂停循环。提醒用户。不要继续浪费资源。
20+次运行无改进 → 建议修改program.md中的策略或尝试其他方法。

Installation

安装方法

One-liner (any tool)

一键安装（适用于任意工具）

bash

git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/

bash

git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/

Multi-tool install

多工具适配安装

bash

./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw

bash

./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw

OpenClaw

OpenClaw安装

bash

clawhub install cs-autoresearch-agent

bash

clawhub install cs-autoresearch-agent

autoresearch-agent

Original

Translation

Autoresearch Agent

Autoresearch Agent

Slash Commands

斜杠命令

When This Skill Activates

触发场景

Setup

设置步骤

First Time — Create the Experiment

首次使用——创建实验

What Setup Creates

设置完成后生成的文件结构

Domains

领域分类

If program.md Already Exists

若program.md已存在

Agent Protocol

Agent协议

Before Starting

启动前准备

Each Iteration

每次迭代步骤

What the Script Handles (you don't)

脚本处理的内容（无需你操作）

Starting an Experiment

启动实验

Single iteration (the agent calls this repeatedly)

单次迭代（Agent会重复调用此命令）

Dry run (test setup before starting)

试运行（启动前测试设置是否正确）

Strategy Escalation

策略升级

Self-Improvement

自我改进

Stopping

停止条件

Rules

规则

Evaluators

评估器

Free Evaluators (no API cost)

免费评估器（无API成本）

LLM Judge Evaluators (uses your subscription)

LLM判断评估器（使用用户订阅）

Custom Evaluators

自定义评估器

My custom evaluator — DO NOT MODIFY after experiment starts

我的自定义评估器——实验开始后请勿修改

Parse and output

解析并输出

Viewing Results

查看结果

Single experiment

单个实验

All experiments in a domain

某一领域的所有实验

Cross-experiment dashboard

跨实验仪表盘

Export formats

导出格式

Dashboard Output

仪表盘输出示例

Export Formats

导出格式

Proactive Triggers

主动触发检查

Installation

安装方法

One-liner (any tool)

一键安装（适用于任意工具）

Multi-tool install

多工具适配安装

OpenClaw

OpenClaw安装

Related Skills

相关技能

If
`program.md`
Already Exists

若
`program.md`
已存在