autoresearch
Original:🇺🇸 English
Translated
2 scripts
This skill should be used when the user asks to "run autoresearch", "optimize X in a loop", "set up autonomous experiments", "start autoresearch", "optimize X overnight", or "experiment loop". Sets up and runs an autonomous experiment loop for any optimization target.
2installs
Sourcepaulrberg/agent-skills
Added on
NPX Install
npx skill4agent add paulrberg/agent-skills autoresearchTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Autoresearch
Autonomous experiment loop: try ideas, measure results, keep what works, discard what doesn't, never stop.
Works for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores, binary size, latency, memory usage.
Setup
If already exists in the working directory, skip setup and resume the loop — read , , and , then continue experimenting.
autoresearch.mdautoresearch.mdautoresearch.jsonlgit logOtherwise:
- Gather context: Ask (or infer from and conversation) the Goal, Command to benchmark, Primary metric (name + direction), Files in scope, and Constraints.
$ARGUMENTS - Create branch: (e.g.
git checkout -b autoresearch/<goal>-<date>).autoresearch/test-speed-2026-03-21 - Read source files: Understand the workload deeply before writing anything. Read every file in scope.
- Write session files: Create and
autoresearch.md(see templates below). If constraints require correctness validation (tests must pass, types must check), also createautoresearch.sh. Commit all.autoresearch.checks.sh - Run baseline: Execute the first experiment with no changes to establish the baseline metric.
- Start looping: Begin the experiment loop immediately after the baseline is logged.
autoresearch.md
autoresearch.mdThe heart of the session. A fresh agent with no context should be able to read this file alone and run the loop effectively. Invest time making it excellent.
markdown
# Autoresearch: <goal>
## Objective
<Specific description of what we're optimizing and the workload.>
## Metrics
- **Primary**: <name> (<unit>, lower/higher is better)
- **Secondary**: <name>, <name>, ...
## How to Run
`./autoresearch.sh` — outputs `METRIC name=value` lines.
## Files in Scope
<Every file the agent may modify, with a brief note on what it does.>
## Off Limits
<What must NOT be touched — evaluation harness, data prep, etc.>
## Constraints
<Hard rules: tests must pass, no new deps, fixed time budget, etc.>
## What's Been Tried
<Update this section as experiments accumulate. Note key wins, dead ends,
and architectural insights so the agent doesn't repeat failed approaches.>Update periodically — especially "What's Been Tried" — so resuming agents have full context.
autoresearch.mdautoresearch.sh
autoresearch.shBash script that runs the benchmark and outputs structured metrics.
bash
#!/bin/bash
set -euo pipefail
# Pre-checks (fast, <1s — catch syntax errors early)
python3 -c "import ast; ast.parse(open('train.py').read())"
# Run benchmark
uv run train.py > /tmp/autoresearch-output.log 2>&1
# Extract and output metrics as METRIC lines
val_bpb=$(grep "^val_bpb:" /tmp/autoresearch-output.log | awk '{print $2}')
echo "METRIC val_bpb=$val_bpb"Rules:
- Use .
set -euo pipefail - Output lines to stdout (one per metric). The primary metric name must match what's documented in
METRIC name=value.autoresearch.md - Metric names: word chars, dots, or (e.g.
µ,val_bpb,total_µs).bundle.size_kb - Keep the script fast — every second is multiplied by hundreds of runs.
- For fast/noisy benchmarks (<5s), run multiple times inside the script and report the median.
- Update the script during the loop as needed.
autoresearch.checks.sh
(optional)
autoresearch.checks.shBackpressure checks: tests, types, lint. Only create when constraints require correctness validation.
bash
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || trueWhen this file exists:
- Run it after every passing benchmark (exit 0).
- If checks fail, log the experiment as and revert.
checks_failed - Check execution time does NOT affect the primary metric.
- Keep output minimal — suppress verbose progress, only show errors.
When this file does not exist, skip checks entirely.
The Experiment Loop
LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work.
Each iteration:
- Formulate hypothesis: Based on prior results, source code understanding, and any ideas in , choose what to try next.
autoresearch.ideas.md - Edit code: Modify the in-scope files. Make a single, focused change per experiment.
- Commit:
git add -A && git commit -m "<short description of what this experiment tries>" - Run benchmark:
If the command times out or crashes, treat it as a failure.bash
timeout 600 ./autoresearch.sh > run.log 2>&1 - Parse metrics: Extract lines from the output:
METRICIf no METRIC lines found, the run crashed — readbashgrep '^METRIC ' run.logfor the error.tail -50 run.log - Run checks (if exists and benchmark passed):
autoresearch.checks.shbashtimeout 300 ./autoresearch.checks.sh > checks.log 2>&1 - Evaluate and log:
- Improved (primary metric better than best so far) → status . The commit stays.
keep - Worse or equal → status . Revert: stage autoresearch files first, then reset.
discard - Crash (benchmark failed) → status . Fix if trivial, otherwise revert and move on.
crash - Checks failed → status . Revert.
checks_failed
- Improved (primary metric better than best so far) → status
- Log to JSONL: Append one line to :
autoresearch.jsonljson{"run":1,"commit":"a1b2c3d","metric":0.9979,"metrics":{"val_bpb":0.9979,"peak_vram_mb":45060.2},"status":"keep","description":"baseline","timestamp":1711036800000,"confidence":null} - On discard/crash/checks_failed — revert code changes:
bash
# Preserve autoresearch session files, revert everything else git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true git checkout -- . git clean -fd - Check confidence: After 3+ runs, run the confidence script from the skill's installation directory:
Or locate it via the skill path and run it directly. Interpret the score:bash
bash "$(dirname "$(readlink -f "$0")")/scripts/confidence.sh"- >= 2.0x: Improvement is likely real (green).
- 1.0-2.0x: Above noise but marginal (yellow).
- < 1.0x: Within noise — consider re-running to confirm (red).
- Update session: Periodically update "What's Been Tried" section and run the summary script to review progress.
autoresearch.md
Repeat forever until interrupted.
JSONL Schema
Each line in is a JSON object:
autoresearch.jsonl| Field | Type | Description |
|---|---|---|
| number | 1-indexed experiment count |
| string | Short git SHA (7 chars) |
| number | Primary metric value |
| object | All metrics dict (primary + secondary) |
| string | |
| string | What this experiment tried |
| number | Unix timestamp (ms) |
| number or null | MAD-based confidence score (null if <3 runs) |
Resuming
When exists in the working directory:
autoresearch.md- Read for full context (objective, what's been tried, constraints).
autoresearch.md - Read to reconstruct state (best metric, run count, last segment).
autoresearch.jsonl - Read for recent commit history.
git log --oneline -20 - Check if it exists — prune stale entries, experiment with promising ones.
autoresearch.ideas.md - Continue the loop from where it left off. Do not re-run the baseline.
Ideas Backlog
When you discover complex but promising optimizations you won't pursue right now, append them as bullets to . Don't let good ideas get lost.
autoresearch.ideas.mdOn resume, check this file — prune stale/tried entries, experiment with the rest. When all paths are exhausted, delete the file and write a final summary to .
autoresearch.mdLoop Rules
See for the full reference. Key rules:
references/loop-rules.md- Primary metric is king. Improved → keep. Worse/equal → discard.
- Simpler is better. Remove code for equal perf = keep. Ugly complexity for tiny gain = discard.
- Don't thrash. Repeatedly reverting the same idea? Try something structurally different.
- Think longer when stuck. Re-read source files, reason about what the CPU/compiler/runtime is actually doing. Deep understanding beats random variation.
- Crashes: fix if trivial (typo, missing import), otherwise log and move on. Don't over-invest.
- NEVER STOP. The user may be away for hours. Keep going until interrupted.
User Messages During Experiments
If the user sends a message while an experiment is running, finish the current run-evaluate-log cycle first, then incorporate their feedback in the next iteration.