autoresearch-agent

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Autoresearch Agent

Autoresearch Agent

You sleep. The agent experiments. You wake up to results.
Autonomous experiment loop inspired by Karpathy's autoresearch. The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
Not one guess — fifty measured attempts, compounding.

你安心休息,Agent自动做实验。醒来就能看到结果。
Karpathy's autoresearch启发的自主实验循环。该Agent会编辑单个文件、运行固定评估、保留改进成果、丢弃失败尝试并无限循环运行。
不是一次猜测——而是五十次有数据支撑的尝试,持续积累优化效果。

Slash Commands

斜杠命令

CommandWhat it does
/ar:setup
Set up a new experiment interactively
/ar:run
Run a single experiment iteration
/ar:loop
Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly)
/ar:status
Show dashboard and results
/ar:resume
Resume a paused experiment

命令功能
/ar:setup
交互式设置新实验
/ar:run
运行单次实验迭代
/ar:loop
启动可配置间隔的自主循环(10分钟、1小时、每日、每周、每月)
/ar:status
显示仪表盘和实验结果
/ar:resume
恢复暂停的实验

When This Skill Activates

触发场景

Recognize these patterns from the user:
  • "Make this faster / smaller / better"
  • "Optimize [file] for [metric]"
  • "Improve my [headlines / copy / prompts]"
  • "Run experiments overnight"
  • "I want to get [metric] from X to Y"
  • Any request involving: optimize, benchmark, improve, experiment loop, autoresearch
If the user describes a target file + a way to measure success → this skill applies.

当用户提到以下模式时,激活本技能:
  • "让这个更快/更小/更好"
  • "针对[指标]优化[文件]"
  • "改进我的[标题/文案/提示词]"
  • "通宵运行实验"
  • "我想把[指标]从X提升到Y"
  • 任何包含:优化、基准测试、改进、实验循环、autoresearch的请求
如果用户描述了目标文件+衡量成功的方式→适用本技能。

Setup

设置步骤

First Time — Create the Experiment

首次使用——创建实验

Run the setup script. The user decides where experiments live:
Project-level (inside repo, git-tracked, shareable with team):
bash
python scripts/setup_experiment.py \
  --domain engineering \
  --name api-speed \
  --target src/api/search.py \
  --eval "pytest bench.py --tb=no -q" \
  --metric p50_ms \
  --direction lower \
  --scope project
User-level (personal, in
~/.autoresearch/
):
bash
python scripts/setup_experiment.py \
  --domain marketing \
  --name medium-ctr \
  --target content/titles.md \
  --eval "python evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content \
  --scope user
The
--scope
flag determines where
.autoresearch/
lives:
  • project
    (default) →
    .autoresearch/
    in the repo root. Experiment definitions are git-tracked. Results are gitignored.
  • user
    ~/.autoresearch/
    in the home directory. Everything is personal.
运行设置脚本,用户可决定实验存储位置:
项目级(仓库内,Git追踪,可与团队共享):
bash
python scripts/setup_experiment.py \
  --domain engineering \
  --name api-speed \
  --target src/api/search.py \
  --eval "pytest bench.py --tb=no -q" \
  --metric p50_ms \
  --direction lower \
  --scope project
用户级(个人使用,存储在
~/.autoresearch/
):
bash
python scripts/setup_experiment.py \
  --domain marketing \
  --name medium-ctr \
  --target content/titles.md \
  --eval "python evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content \
  --scope user
--scope
参数决定
.autoresearch/
的存储位置:
  • project
    (默认)→ 仓库根目录下的
    .autoresearch/
    。实验定义会被Git追踪,结果会被Git忽略。
  • user
    → 主目录下的
    ~/.autoresearch/
    。所有内容均为个人使用。

What Setup Creates

设置完成后生成的文件结构

.autoresearch/
├── config.yaml                        ← Global settings
├── .gitignore                         ← Ignores results.tsv, *.log
└── {domain}/{experiment-name}/
    ├── program.md                     ← Objectives, constraints, strategy
    ├── config.cfg                     ← Target, eval cmd, metric, direction
    ├── results.tsv                    ← Experiment log (gitignored)
    └── evaluate.py                    ← Evaluation script (if --evaluator used)
results.tsv columns:
commit | metric | status | description
  • commit
    — short git hash
  • metric
    — float value or "N/A" for crashes
  • status
    — keep | discard | crash
  • description
    — what changed or why it crashed
.autoresearch/
├── config.yaml                        ← 全局设置
├── .gitignore                         ← 忽略results.tsv、*.log
└── {domain}/{experiment-name}/
    ├── program.md                     ← 目标、约束、策略
    ├── config.cfg                     ← 目标文件、评估命令、指标、优化方向
    ├── results.tsv                    ← 实验日志(Git忽略)
    └── evaluate.py                    ← 评估脚本(使用--evaluator参数时生成)
results.tsv列说明:
commit | metric | status | description
  • commit
    — 短Git哈希值
  • metric
    — 浮点数值或崩溃时显示"N/A"
  • status
    — keep(保留)| discard(丢弃)| crash(崩溃)
  • description
    — 修改内容或崩溃原因

Domains

领域分类

DomainUse Cases
engineering
Code speed, memory, bundle size, test pass rate, build time
marketing
Headlines, social copy, email subjects, ad copy, engagement
content
Article structure, SEO descriptions, readability, CTR
prompts
System prompts, chatbot tone, agent instructions
custom
Anything else with a measurable metric
领域适用场景
engineering
代码速度、内存占用、包体积、测试通过率、构建时间
marketing
标题、社交文案、邮件主题、广告文案、用户参与度
content
文章结构、SEO描述、可读性、点击率
prompts
系统提示词、聊天机器人语气、Agent指令
custom
任何其他可衡量指标的场景

If
program.md
Already Exists

program.md
已存在

The user may have written their own
program.md
. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.

用户可能已自行编写
program.md
。如果在实验目录中找到该文件,则读取它,它会覆盖模板内容。仅询问缺失的信息。

Agent Protocol

Agent协议

You are the loop. The scripts handle setup and evaluation — you handle the creative work.
你是循环的核心。脚本负责设置和评估——你负责创造性工作。

Before Starting

启动前准备

  1. Read
    .autoresearch/{domain}/{name}/config.cfg
    to get:
    • target
      — the file you edit
    • evaluate_cmd
      — the command that measures your changes
    • metric
      — the metric name to look for in eval output
    • metric_direction
      — "lower" or "higher" is better
    • time_budget_minutes
      — max time per evaluation
  2. Read
    program.md
    for strategy, constraints, and what you can/cannot change
  3. Read
    results.tsv
    for experiment history (columns: commit, metric, status, description)
  4. Checkout the experiment branch:
    git checkout autoresearch/{domain}/{name}
  1. 读取
    .autoresearch/{domain}/{name}/config.cfg
    获取:
    • target
      — 待编辑的文件
    • evaluate_cmd
      — 衡量修改效果的命令
    • metric
      — 从评估输出中提取的指标名称
    • metric_direction
      — "lower"(越小越好)或"higher"(越大越好)
    • time_budget_minutes
      — 单次评估的最长时间
  2. 读取
    program.md
    获取策略、约束及可修改/不可修改内容
  3. 读取
    results.tsv
    获取实验历史(列:commit、metric、status、description)
  4. 切换到实验分支:
    git checkout autoresearch/{domain}/{name}

Each Iteration

每次迭代步骤

  1. Review results.tsv — what worked? What failed? What hasn't been tried?
  2. Decide ONE change to the target file. One variable per experiment.
  3. Edit the target file
  4. Commit:
    git add {target} && git commit -m "experiment: {description}"
  5. Evaluate:
    python scripts/run_experiment.py --experiment {domain}/{name} --single
  6. Read the output — it prints KEEP, DISCARD, or CRASH with the metric value
  7. Go to step 1
  1. 查看results.tsv——哪些方法有效?哪些失败?哪些尚未尝试?
  2. 决定对目标文件进行一处修改。每次实验仅改变一个变量。
  3. 编辑目标文件
  4. 提交:
    git add {target} && git commit -m "experiment: {description}"
  5. 评估:
    python scripts/run_experiment.py --experiment {domain}/{name} --single
  6. 读取输出——会打印KEEP、DISCARD或CRASH及指标数值
  7. 返回步骤1

What the Script Handles (you don't)

脚本处理的内容(无需你操作)

  • Running the eval command with timeout
  • Parsing the metric from eval output
  • Comparing to previous best
  • Reverting the commit on failure (
    git reset --hard HEAD~1
    )
  • Logging the result to results.tsv
  • 带超时运行评估命令
  • 从评估输出中解析指标
  • 与之前的最优结果对比
  • 失败时回滚提交(
    git reset --hard HEAD~1
  • 将结果记录到results.tsv

Starting an Experiment

启动实验

bash
undefined
bash
undefined

Single iteration (the agent calls this repeatedly)

单次迭代(Agent会重复调用此命令)

python scripts/run_experiment.py --experiment engineering/api-speed --single
python scripts/run_experiment.py --experiment engineering/api-speed --single

Dry run (test setup before starting)

试运行(启动前测试设置是否正确)

python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
undefined
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
undefined

Strategy Escalation

策略升级

  • Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
  • Runs 6-15: Systematic exploration (vary one parameter at a time)
  • Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
  • Runs 30+: Radical experiments (completely different approaches)
  • If no improvement in 20+ runs: update program.md Strategy section
  • 第1-5次运行:低难度优化(明显改进、简单优化)
  • 第6-15次运行:系统性探索(每次仅改变一个参数)
  • 第16-30次运行:结构性变更(算法替换、架构调整)
  • 第30+次运行:激进实验(完全不同的方案)
  • 若20+次运行无改进:更新program.md的策略部分

Self-Improvement

自我改进

After every 10 experiments, review results.tsv for patterns. Update the Strategy section of program.md with what you learned (e.g., "caching changes consistently improve by 5-10%", "refactoring attempts never improve the metric"). Future iterations benefit from this accumulated knowledge.
每完成10次实验后,回顾results.tsv寻找规律。更新program.md的策略部分,记录学到的经验(例如:"缓存变更持续提升5-10%性能"、"重构尝试从未提升指标")。后续迭代将受益于这些积累的知识。

Stopping

停止条件

  • Run until interrupted by the user, context limit reached, or goal in program.md is met
  • Before stopping: ensure results.tsv is up to date
  • On context limit: the next session can resume — results.tsv and git log persist
  • 运行至用户中断、上下文限制达到或program.md中的目标完成
  • 停止前:确保results.tsv已更新
  • 达到上下文限制时:下次会话可恢复——results.tsv和Git日志会保留

Rules

规则

  • One change per experiment. Don't change 5 things at once. You won't know what worked.
  • Simplicity criterion. A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
  • Never modify the evaluator.
    evaluate.py
    is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
  • Timeout. If a run exceeds 2.5× the time budget, kill it and treat as crash.
  • Crash handling. If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
  • No new dependencies. Only use what's already available in the project.

  • 每次实验仅做一处修改。不要同时修改5处内容。否则无法确定哪项修改有效。
  • 简洁性准则。小幅改进但引入冗余复杂的代码不值得。性能相同但代码更简洁是胜利。移除不影响结果的代码是最佳结果。
  • 绝不修改评估器
    evaluate.py
    是基准真理。修改它会使所有对比失效。如果发现自己在修改它,立即停止。
  • 超时处理。如果运行时间超过预算的2.5倍,终止并视为崩溃。
  • 崩溃处理。如果是拼写错误或缺失依赖,修复后重新运行。如果思路根本错误,回滚、记录"crash"并继续。连续5次崩溃→暂停并提醒用户。
  • 不添加新依赖。仅使用项目中已有的依赖。

Evaluators

评估器

Ready-to-use evaluation scripts. Copied into the experiment directory during setup with
--evaluator
.
现成可用的评估脚本。设置时使用
--evaluator
参数会将其复制到实验目录。

Free Evaluators (no API cost)

免费评估器(无API成本)

EvaluatorMetricUse Case
benchmark_speed
p50_ms
(lower)
Function/API execution time
benchmark_size
size_bytes
(lower)
File, bundle, Docker image size
test_pass_rate
pass_rate
(higher)
Test suite pass percentage
build_speed
build_seconds
(lower)
Build/compile/Docker build time
memory_usage
peak_mb
(lower)
Peak memory during execution
评估器指标适用场景
benchmark_speed
p50_ms
(越小越好)
函数/API执行时间
benchmark_size
size_bytes
(越小越好)
文件、包、Docker镜像大小
test_pass_rate
pass_rate
(越大越好)
测试套件通过率
build_speed
build_seconds
(越小越好)
构建/编译/Docker构建时间
memory_usage
peak_mb
(越小越好)
执行期间的峰值内存

LLM Judge Evaluators (uses your subscription)

LLM判断评估器(使用用户订阅)

EvaluatorMetricUse Case
llm_judge_content
ctr_score
0-10 (higher)
Headlines, titles, descriptions
llm_judge_prompt
quality_score
0-100 (higher)
System prompts, agent instructions
llm_judge_copy
engagement_score
0-10 (higher)
Social posts, ad copy, emails
LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside
evaluate.py
— the agent cannot modify it. This prevents the agent from gaming its own evaluator.
The user's existing subscription covers the cost:
  • Claude Code Max → unlimited Claude calls for evaluation
  • Codex CLI (ChatGPT Pro) → unlimited Codex calls
  • Gemini CLI (free tier) → free evaluation calls
评估器指标适用场景
llm_judge_content
ctr_score
0-10(越大越好)
标题、副标题、描述
llm_judge_prompt
quality_score
0-100(越大越好)
系统提示词、Agent指令
llm_judge_copy
engagement_score
0-10(越大越好)
社交帖子、广告文案、邮件
LLM判断器会调用用户正在使用的CLI工具(Claude、Codex、Gemini)。评估提示词被锁定在
evaluate.py
中——Agent无法修改它。这防止Agent作弊操纵自己的评估器。
用户现有订阅覆盖成本:
  • Claude Code Max → 无限次Claude评估调用
  • Codex CLI(ChatGPT Pro)→ 无限次Codex调用
  • Gemini CLI(免费版)→ 免费评估调用

Custom Evaluators

自定义评估器

If no built-in evaluator fits, the user writes their own
evaluate.py
. Only requirement: it must print
metric_name: value
to stdout.
python
#!/usr/bin/env python3
如果没有内置评估器符合需求,用户可自行编写
evaluate.py
。唯一要求:必须向stdout输出
metric_name: value
python
#!/usr/bin/env python3

My custom evaluator — DO NOT MODIFY after experiment starts

我的自定义评估器——实验开始后请勿修改

import subprocess result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
import subprocess result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)

Parse and output

解析并输出

print(f"my_metric: {parse_score(result.stdout)}")

---
print(f"my_metric: {parse_score(result.stdout)}")

---

Viewing Results

查看结果

bash
undefined
bash
undefined

Single experiment

单个实验

python scripts/log_results.py --experiment engineering/api-speed
python scripts/log_results.py --experiment engineering/api-speed

All experiments in a domain

某一领域的所有实验

python scripts/log_results.py --domain engineering
python scripts/log_results.py --domain engineering

Cross-experiment dashboard

跨实验仪表盘

python scripts/log_results.py --dashboard
python scripts/log_results.py --dashboard

Export formats

导出格式

python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md python scripts/log_results.py --dashboard --format markdown --output dashboard.md
undefined
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md python scripts/log_results.py --dashboard --format markdown --output dashboard.md
undefined

Dashboard Output

仪表盘输出示例

DOMAIN          EXPERIMENT          RUNS  KEPT  BEST         Δ FROM START  STATUS
engineering     api-speed            47    14   185ms        -76.9%        active
engineering     bundle-size          23     8   412KB        -58.3%        paused
marketing       medium-ctr           31    11   8.4/10       +68.0%        active
prompts         support-tone         15     6   82/100       +46.4%        done
DOMAIN          EXPERIMENT          RUNS  KEPT  BEST         Δ FROM START  STATUS
engineering     api-speed            47    14   185ms        -76.9%        active
engineering     bundle-size          23     8   412KB        -58.3%        paused
marketing       medium-ctr           31    11   8.4/10       +68.0%        active
prompts         support-tone         15     6   82/100       +46.4%        done

Export Formats

导出格式

  • TSV — default, tab-separated (compatible with spreadsheets)
  • CSV — comma-separated, with proper quoting
  • Markdown — formatted table, readable in GitHub/docs

  • TSV — 默认格式,制表符分隔(兼容电子表格)
  • CSV — 逗号分隔,带正确引号
  • Markdown — 格式化表格,适合在GitHub/文档中阅读

Proactive Triggers

主动触发检查

Flag these without being asked:
  • No evaluation command works → Test it before starting the loop. Run once, verify output.
  • Target file not in git
    git init && git add . && git commit -m 'initial'
    first.
  • Metric direction unclear → Ask: is lower or higher better? Must know before starting.
  • Time budget too short → If eval takes longer than budget, every run crashes.
  • Agent modifying evaluate.py → Hard stop. This invalidates all comparisons.
  • 5 consecutive crashes → Pause the loop. Alert the user. Don't keep burning cycles.
  • No improvement in 20+ runs → Suggest changing strategy in program.md or trying a different approach.

无需用户询问,自动检查以下情况:
  • 评估命令无法运行 → 启动循环前先测试。运行一次,验证输出。
  • 目标文件未加入Git → 先执行
    git init && git add . && git commit -m 'initial'
  • 指标方向不明确 → 询问用户:指标是越小越好还是越大越好?启动前必须明确。
  • 时间预算过短 → 如果评估时间超过预算,每次运行都会崩溃。
  • Agent修改evaluate.py → 立即停止。这会使所有对比失效。
  • 连续5次崩溃 → 暂停循环。提醒用户。不要继续浪费资源。
  • 20+次运行无改进 → 建议修改program.md中的策略或尝试其他方法。

Installation

安装方法

One-liner (any tool)

一键安装(适用于任意工具)

bash
git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
bash
git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/

Multi-tool install

多工具适配安装

bash
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
bash
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw

OpenClaw

OpenClaw安装

bash
clawhub install cs-autoresearch-agent

bash
clawhub install cs-autoresearch-agent

Related Skills

相关技能

  • self-improving-agent — improves an agent's own memory/rules over time. NOT for structured experiment loops.
  • senior-ml-engineer — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
  • tdd-guide — test-driven development. Complementary — tests can be the evaluation function.
  • skill-security-auditor — audit skills before publishing. NOT for optimization loops.
  • self-improving-agent — 随时间改进Agent自身的记忆/规则。不适用于结构化实验循环。
  • senior-ml-engineer — 机器学习架构决策。互补技能——用于初始设计,然后用autoresearch进行优化。
  • tdd-guide — 测试驱动开发。互补技能——测试可作为评估函数。
  • skill-security-auditor — 发布前审核技能。不适用于优化循环。