autoresearch-agent
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAutoresearch Agent
Autoresearch Agent
You sleep. The agent experiments. You wake up to results.
Autonomous experiment loop inspired by Karpathy's autoresearch. The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
Not one guess — fifty measured attempts, compounding.
你安心休息,Agent自动做实验。醒来就能看到结果。
受Karpathy's autoresearch启发的自主实验循环。该Agent会编辑单个文件、运行固定评估、保留改进成果、丢弃失败尝试并无限循环运行。
不是一次猜测——而是五十次有数据支撑的尝试,持续积累优化效果。
Slash Commands
斜杠命令
| Command | What it does |
|---|---|
| Set up a new experiment interactively |
| Run a single experiment iteration |
| Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly) |
| Show dashboard and results |
| Resume a paused experiment |
| 命令 | 功能 |
|---|---|
| 交互式设置新实验 |
| 运行单次实验迭代 |
| 启动可配置间隔的自主循环(10分钟、1小时、每日、每周、每月) |
| 显示仪表盘和实验结果 |
| 恢复暂停的实验 |
When This Skill Activates
触发场景
Recognize these patterns from the user:
- "Make this faster / smaller / better"
- "Optimize [file] for [metric]"
- "Improve my [headlines / copy / prompts]"
- "Run experiments overnight"
- "I want to get [metric] from X to Y"
- Any request involving: optimize, benchmark, improve, experiment loop, autoresearch
If the user describes a target file + a way to measure success → this skill applies.
当用户提到以下模式时,激活本技能:
- "让这个更快/更小/更好"
- "针对[指标]优化[文件]"
- "改进我的[标题/文案/提示词]"
- "通宵运行实验"
- "我想把[指标]从X提升到Y"
- 任何包含:优化、基准测试、改进、实验循环、autoresearch的请求
如果用户描述了目标文件+衡量成功的方式→适用本技能。
Setup
设置步骤
First Time — Create the Experiment
首次使用——创建实验
Run the setup script. The user decides where experiments live:
Project-level (inside repo, git-tracked, shareable with team):
bash
python scripts/setup_experiment.py \
--domain engineering \
--name api-speed \
--target src/api/search.py \
--eval "pytest bench.py --tb=no -q" \
--metric p50_ms \
--direction lower \
--scope projectUser-level (personal, in ):
~/.autoresearch/bash
python scripts/setup_experiment.py \
--domain marketing \
--name medium-ctr \
--target content/titles.md \
--eval "python evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content \
--scope userThe flag determines where lives:
--scope.autoresearch/- (default) →
projectin the repo root. Experiment definitions are git-tracked. Results are gitignored..autoresearch/ - →
userin the home directory. Everything is personal.~/.autoresearch/
运行设置脚本,用户可决定实验存储位置:
项目级(仓库内,Git追踪,可与团队共享):
bash
python scripts/setup_experiment.py \
--domain engineering \
--name api-speed \
--target src/api/search.py \
--eval "pytest bench.py --tb=no -q" \
--metric p50_ms \
--direction lower \
--scope project用户级(个人使用,存储在):
~/.autoresearch/bash
python scripts/setup_experiment.py \
--domain marketing \
--name medium-ctr \
--target content/titles.md \
--eval "python evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content \
--scope user--scope.autoresearch/- (默认)→ 仓库根目录下的
project。实验定义会被Git追踪,结果会被Git忽略。.autoresearch/ - → 主目录下的
user。所有内容均为个人使用。~/.autoresearch/
What Setup Creates
设置完成后生成的文件结构
.autoresearch/
├── config.yaml ← Global settings
├── .gitignore ← Ignores results.tsv, *.log
└── {domain}/{experiment-name}/
├── program.md ← Objectives, constraints, strategy
├── config.cfg ← Target, eval cmd, metric, direction
├── results.tsv ← Experiment log (gitignored)
└── evaluate.py ← Evaluation script (if --evaluator used)results.tsv columns:
commit | metric | status | description- — short git hash
commit - — float value or "N/A" for crashes
metric - — keep | discard | crash
status - — what changed or why it crashed
description
.autoresearch/
├── config.yaml ← 全局设置
├── .gitignore ← 忽略results.tsv、*.log
└── {domain}/{experiment-name}/
├── program.md ← 目标、约束、策略
├── config.cfg ← 目标文件、评估命令、指标、优化方向
├── results.tsv ← 实验日志(Git忽略)
└── evaluate.py ← 评估脚本(使用--evaluator参数时生成)results.tsv列说明:
commit | metric | status | description- — 短Git哈希值
commit - — 浮点数值或崩溃时显示"N/A"
metric - — keep(保留)| discard(丢弃)| crash(崩溃)
status - — 修改内容或崩溃原因
description
Domains
领域分类
| Domain | Use Cases |
|---|---|
| Code speed, memory, bundle size, test pass rate, build time |
| Headlines, social copy, email subjects, ad copy, engagement |
| Article structure, SEO descriptions, readability, CTR |
| System prompts, chatbot tone, agent instructions |
| Anything else with a measurable metric |
| 领域 | 适用场景 |
|---|---|
| 代码速度、内存占用、包体积、测试通过率、构建时间 |
| 标题、社交文案、邮件主题、广告文案、用户参与度 |
| 文章结构、SEO描述、可读性、点击率 |
| 系统提示词、聊天机器人语气、Agent指令 |
| 任何其他可衡量指标的场景 |
If program.md
Already Exists
program.md若program.md
已存在
program.mdThe user may have written their own . If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.
program.md用户可能已自行编写。如果在实验目录中找到该文件,则读取它,它会覆盖模板内容。仅询问缺失的信息。
program.mdAgent Protocol
Agent协议
You are the loop. The scripts handle setup and evaluation — you handle the creative work.
你是循环的核心。脚本负责设置和评估——你负责创造性工作。
Before Starting
启动前准备
- Read to get:
.autoresearch/{domain}/{name}/config.cfg- — the file you edit
target - — the command that measures your changes
evaluate_cmd - — the metric name to look for in eval output
metric - — "lower" or "higher" is better
metric_direction - — max time per evaluation
time_budget_minutes
- Read for strategy, constraints, and what you can/cannot change
program.md - Read for experiment history (columns: commit, metric, status, description)
results.tsv - Checkout the experiment branch:
git checkout autoresearch/{domain}/{name}
- 读取获取:
.autoresearch/{domain}/{name}/config.cfg- — 待编辑的文件
target - — 衡量修改效果的命令
evaluate_cmd - — 从评估输出中提取的指标名称
metric - — "lower"(越小越好)或"higher"(越大越好)
metric_direction - — 单次评估的最长时间
time_budget_minutes
- 读取获取策略、约束及可修改/不可修改内容
program.md - 读取获取实验历史(列:commit、metric、status、description)
results.tsv - 切换到实验分支:
git checkout autoresearch/{domain}/{name}
Each Iteration
每次迭代步骤
- Review results.tsv — what worked? What failed? What hasn't been tried?
- Decide ONE change to the target file. One variable per experiment.
- Edit the target file
- Commit:
git add {target} && git commit -m "experiment: {description}" - Evaluate:
python scripts/run_experiment.py --experiment {domain}/{name} --single - Read the output — it prints KEEP, DISCARD, or CRASH with the metric value
- Go to step 1
- 查看results.tsv——哪些方法有效?哪些失败?哪些尚未尝试?
- 决定对目标文件进行一处修改。每次实验仅改变一个变量。
- 编辑目标文件
- 提交:
git add {target} && git commit -m "experiment: {description}" - 评估:
python scripts/run_experiment.py --experiment {domain}/{name} --single - 读取输出——会打印KEEP、DISCARD或CRASH及指标数值
- 返回步骤1
What the Script Handles (you don't)
脚本处理的内容(无需你操作)
- Running the eval command with timeout
- Parsing the metric from eval output
- Comparing to previous best
- Reverting the commit on failure ()
git reset --hard HEAD~1 - Logging the result to results.tsv
- 带超时运行评估命令
- 从评估输出中解析指标
- 与之前的最优结果对比
- 失败时回滚提交()
git reset --hard HEAD~1 - 将结果记录到results.tsv
Starting an Experiment
启动实验
bash
undefinedbash
undefinedSingle iteration (the agent calls this repeatedly)
单次迭代(Agent会重复调用此命令)
python scripts/run_experiment.py --experiment engineering/api-speed --single
python scripts/run_experiment.py --experiment engineering/api-speed --single
Dry run (test setup before starting)
试运行(启动前测试设置是否正确)
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
undefinedpython scripts/run_experiment.py --experiment engineering/api-speed --dry-run
undefinedStrategy Escalation
策略升级
- Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
- Runs 6-15: Systematic exploration (vary one parameter at a time)
- Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
- Runs 30+: Radical experiments (completely different approaches)
- If no improvement in 20+ runs: update program.md Strategy section
- 第1-5次运行:低难度优化(明显改进、简单优化)
- 第6-15次运行:系统性探索(每次仅改变一个参数)
- 第16-30次运行:结构性变更(算法替换、架构调整)
- 第30+次运行:激进实验(完全不同的方案)
- 若20+次运行无改进:更新program.md的策略部分
Self-Improvement
自我改进
After every 10 experiments, review results.tsv for patterns. Update the
Strategy section of program.md with what you learned (e.g., "caching changes
consistently improve by 5-10%", "refactoring attempts never improve the metric").
Future iterations benefit from this accumulated knowledge.
每完成10次实验后,回顾results.tsv寻找规律。更新program.md的策略部分,记录学到的经验(例如:"缓存变更持续提升5-10%性能"、"重构尝试从未提升指标")。后续迭代将受益于这些积累的知识。
Stopping
停止条件
- Run until interrupted by the user, context limit reached, or goal in program.md is met
- Before stopping: ensure results.tsv is up to date
- On context limit: the next session can resume — results.tsv and git log persist
- 运行至用户中断、上下文限制达到或program.md中的目标完成
- 停止前:确保results.tsv已更新
- 达到上下文限制时:下次会话可恢复——results.tsv和Git日志会保留
Rules
规则
- One change per experiment. Don't change 5 things at once. You won't know what worked.
- Simplicity criterion. A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
- Never modify the evaluator. is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
evaluate.py - Timeout. If a run exceeds 2.5× the time budget, kill it and treat as crash.
- Crash handling. If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
- No new dependencies. Only use what's already available in the project.
- 每次实验仅做一处修改。不要同时修改5处内容。否则无法确定哪项修改有效。
- 简洁性准则。小幅改进但引入冗余复杂的代码不值得。性能相同但代码更简洁是胜利。移除不影响结果的代码是最佳结果。
- 绝不修改评估器。是基准真理。修改它会使所有对比失效。如果发现自己在修改它,立即停止。
evaluate.py - 超时处理。如果运行时间超过预算的2.5倍,终止并视为崩溃。
- 崩溃处理。如果是拼写错误或缺失依赖,修复后重新运行。如果思路根本错误,回滚、记录"crash"并继续。连续5次崩溃→暂停并提醒用户。
- 不添加新依赖。仅使用项目中已有的依赖。
Evaluators
评估器
Ready-to-use evaluation scripts. Copied into the experiment directory during setup with .
--evaluator现成可用的评估脚本。设置时使用参数会将其复制到实验目录。
--evaluatorFree Evaluators (no API cost)
免费评估器(无API成本)
| Evaluator | Metric | Use Case |
|---|---|---|
| | Function/API execution time |
| | File, bundle, Docker image size |
| | Test suite pass percentage |
| | Build/compile/Docker build time |
| | Peak memory during execution |
| 评估器 | 指标 | 适用场景 |
|---|---|---|
| | 函数/API执行时间 |
| | 文件、包、Docker镜像大小 |
| | 测试套件通过率 |
| | 构建/编译/Docker构建时间 |
| | 执行期间的峰值内存 |
LLM Judge Evaluators (uses your subscription)
LLM判断评估器(使用用户订阅)
| Evaluator | Metric | Use Case |
|---|---|---|
| | Headlines, titles, descriptions |
| | System prompts, agent instructions |
| | Social posts, ad copy, emails |
LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside — the agent cannot modify it. This prevents the agent from gaming its own evaluator.
evaluate.pyThe user's existing subscription covers the cost:
- Claude Code Max → unlimited Claude calls for evaluation
- Codex CLI (ChatGPT Pro) → unlimited Codex calls
- Gemini CLI (free tier) → free evaluation calls
| 评估器 | 指标 | 适用场景 |
|---|---|---|
| | 标题、副标题、描述 |
| | 系统提示词、Agent指令 |
| | 社交帖子、广告文案、邮件 |
LLM判断器会调用用户正在使用的CLI工具(Claude、Codex、Gemini)。评估提示词被锁定在中——Agent无法修改它。这防止Agent作弊操纵自己的评估器。
evaluate.py用户现有订阅覆盖成本:
- Claude Code Max → 无限次Claude评估调用
- Codex CLI(ChatGPT Pro)→ 无限次Codex调用
- Gemini CLI(免费版)→ 免费评估调用
Custom Evaluators
自定义评估器
If no built-in evaluator fits, the user writes their own . Only requirement: it must print to stdout.
evaluate.pymetric_name: valuepython
#!/usr/bin/env python3如果没有内置评估器符合需求,用户可自行编写。唯一要求:必须向stdout输出。
evaluate.pymetric_name: valuepython
#!/usr/bin/env python3My custom evaluator — DO NOT MODIFY after experiment starts
我的自定义评估器——实验开始后请勿修改
import subprocess
result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
import subprocess
result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
Parse and output
解析并输出
print(f"my_metric: {parse_score(result.stdout)}")
---print(f"my_metric: {parse_score(result.stdout)}")
---Viewing Results
查看结果
bash
undefinedbash
undefinedSingle experiment
单个实验
python scripts/log_results.py --experiment engineering/api-speed
python scripts/log_results.py --experiment engineering/api-speed
All experiments in a domain
某一领域的所有实验
python scripts/log_results.py --domain engineering
python scripts/log_results.py --domain engineering
Cross-experiment dashboard
跨实验仪表盘
python scripts/log_results.py --dashboard
python scripts/log_results.py --dashboard
Export formats
导出格式
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
undefinedpython scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
undefinedDashboard Output
仪表盘输出示例
DOMAIN EXPERIMENT RUNS KEPT BEST Δ FROM START STATUS
engineering api-speed 47 14 185ms -76.9% active
engineering bundle-size 23 8 412KB -58.3% paused
marketing medium-ctr 31 11 8.4/10 +68.0% active
prompts support-tone 15 6 82/100 +46.4% doneDOMAIN EXPERIMENT RUNS KEPT BEST Δ FROM START STATUS
engineering api-speed 47 14 185ms -76.9% active
engineering bundle-size 23 8 412KB -58.3% paused
marketing medium-ctr 31 11 8.4/10 +68.0% active
prompts support-tone 15 6 82/100 +46.4% doneExport Formats
导出格式
- TSV — default, tab-separated (compatible with spreadsheets)
- CSV — comma-separated, with proper quoting
- Markdown — formatted table, readable in GitHub/docs
- TSV — 默认格式,制表符分隔(兼容电子表格)
- CSV — 逗号分隔,带正确引号
- Markdown — 格式化表格,适合在GitHub/文档中阅读
Proactive Triggers
主动触发检查
Flag these without being asked:
- No evaluation command works → Test it before starting the loop. Run once, verify output.
- Target file not in git → first.
git init && git add . && git commit -m 'initial' - Metric direction unclear → Ask: is lower or higher better? Must know before starting.
- Time budget too short → If eval takes longer than budget, every run crashes.
- Agent modifying evaluate.py → Hard stop. This invalidates all comparisons.
- 5 consecutive crashes → Pause the loop. Alert the user. Don't keep burning cycles.
- No improvement in 20+ runs → Suggest changing strategy in program.md or trying a different approach.
无需用户询问,自动检查以下情况:
- 评估命令无法运行 → 启动循环前先测试。运行一次,验证输出。
- 目标文件未加入Git → 先执行。
git init && git add . && git commit -m 'initial' - 指标方向不明确 → 询问用户:指标是越小越好还是越大越好?启动前必须明确。
- 时间预算过短 → 如果评估时间超过预算,每次运行都会崩溃。
- Agent修改evaluate.py → 立即停止。这会使所有对比失效。
- 连续5次崩溃 → 暂停循环。提醒用户。不要继续浪费资源。
- 20+次运行无改进 → 建议修改program.md中的策略或尝试其他方法。
Installation
安装方法
One-liner (any tool)
一键安装(适用于任意工具)
bash
git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/bash
git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/Multi-tool install
多工具适配安装
bash
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclawbash
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclawOpenClaw
OpenClaw安装
bash
clawhub install cs-autoresearch-agentbash
clawhub install cs-autoresearch-agentRelated Skills
相关技能
- self-improving-agent — improves an agent's own memory/rules over time. NOT for structured experiment loops.
- senior-ml-engineer — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
- tdd-guide — test-driven development. Complementary — tests can be the evaluation function.
- skill-security-auditor — audit skills before publishing. NOT for optimization loops.
- self-improving-agent — 随时间改进Agent自身的记忆/规则。不适用于结构化实验循环。
- senior-ml-engineer — 机器学习架构决策。互补技能——用于初始设计,然后用autoresearch进行优化。
- tdd-guide — 测试驱动开发。互补技能——测试可作为评估函数。
- skill-security-auditor — 发布前审核技能。不适用于优化循环。