codex-autoresearch-loop

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Codex Autoresearch

Codex Autoresearch

Skill by ara.so — Daily 2026 Skills collection.
Codex Autoresearch is a Codex skill that runs an autonomous modify→verify→keep/revert loop on your codebase. You describe a measurable goal in one sentence; Codex confirms the plan, then iterates unattended — every improvement stacks in git, every failure reverts automatically — until interrupted or a cap is reached. Inspired by Karpathy's autoresearch concept, generalized beyond ML training to any software metric.

ara.so开发的Skill — 属于Daily 2026 Skills合集。
Codex Autoresearch是一款Codex Skill,可在你的代码库中自主运行「修改→验证→保留/回滚」的循环流程。你用一句话描述可量化目标,Codex会确认计划,然后无人值守地迭代——每次改进都会提交到git,每次失败都会自动回滚——直到被中断或达到次数上限。该工具灵感来自Karpathy的autoresearch概念,从机器学习训练场景推广到了所有软件指标优化场景。

Installation

安装

Option A — manual copy into your project:
bash
git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch
Option B — Codex skill installer:
text
$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch
The skill lives at
.agents/skills/codex-autoresearch/
inside your project. No config file is required before first use.

选项A — 手动复制到项目中:
bash
git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch
选项B — 使用Codex Skill安装器:
text
$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch
该Skill会存放在项目内的
.agents/skills/codex-autoresearch/
路径下,首次使用无需配置文件。

How to Activate

激活方式

Open Codex in your project directory and prefix your goal with
$codex-autoresearch
:
text
$codex-autoresearch
I want to get rid of all `any` types in my TypeScript code
Codex will:
  1. Scan the repo and infer scope, metric, verify command, and guard command.
  2. Present a confirmation summary — reply
    go
    (or correct anything).
  3. Run the loop unattended until you interrupt it or the goal is met.
You never write config. Codex infers everything.

在项目目录中打开Codex,然后在你的目标前加上
$codex-autoresearch
前缀:
text
$codex-autoresearch
我要移除TypeScript代码中所有的`any`类型
Codex会执行以下操作:
  1. 扫描代码仓库,推断范围、指标、验证命令和防护命令。
  2. 展示确认摘要——回复
    go
    (或修改任何内容)即可开始。
  3. 无人值守运行循环,直到你中断或达成目标。
你无需编写配置文件,Codex会自动推断所有信息。

Confirmation Flow

确认流程

Before the loop starts Codex always shows what it found and asks you to confirm. Example exchange:
Codex: I found 47 `any` occurrences across src/**/*.ts.

       Confirmed:
       - Target: eliminate `any` types in src/**/*.ts
       - Metric: `any` count (current: 47), direction: lower
       - Verify: grep + tsc --noEmit as guard

       Need to confirm:
       - Run until all gone, or cap at N iterations?

       Reply "go" to start, or tell me what to change.

You:   Go, run overnight.

Codex: Starting — baseline: 47. Iterating until interrupted.
Up to five confirmation rounds are possible. After that, Codex proceeds.

在循环开始前,Codex总会展示它发现的信息并请求你确认。以下是示例对话:
Codex: 我在src/**/*.ts文件中发现了47处`any`类型。

       已确认内容:
       - 目标:消除src/**/*.ts中的`any`类型
       - 指标:`any`类型数量(当前:47),方向:减少
       - 验证:使用grep + tsc --noEmit作为防护手段

       需要确认的内容:
       - 运行到全部消除为止,还是设置N次迭代上限?

       回复"go"开始,或告诉我需要修改的内容。

你:   开始,通宵运行。

Codex: 启动中——基准值:47。将持续迭代直到被中断。
最多会进行5轮确认,之后Codex会自动继续执行。

The Loop (internals)

循环流程(内部机制)

PHASE 0: Probe environment (CPU/GPU/RAM/toolchains), check for session resume
PHASE 1: Read context + lessons file from prior run (if any)

LOOP (forever or N times):
  1. Review current state, git history, results log, lessons
  2. Pick ONE hypothesis (apply perspectives, filter by environment)
     -- or N hypotheses if parallel mode is active
  3. Make ONE atomic change
  4. git commit (before verification)
  5. Run verify command  →  did the target metric improve?
     Run guard command   →  did anything else break?
  6. Improved → keep (extract lesson)
     Worse    → approved rollback strategy (git revert)
     Crashed  → fix or skip
  7. Log the result to results log
  8. Health check (disk, git, verify health)
  9. If 3+ discards → REFINE; 5+ → PIVOT; 2 PIVOTs → web search
 10. Repeat. Never stop. Never ask.
The loop runs unbounded unless you say
Iterations: N
during confirmation.

PHASE 0: 探测环境(CPU/GPU/内存/工具链),检查是否可恢复会话
PHASE 1: 读取上下文 + 上一次运行的经验文件(如果存在)

循环(无限次或N次):
  1. 回顾当前状态、git历史、结果日志、经验总结
  2. 选择一个假设(结合视角,根据环境过滤)
     -- 如果开启并行模式,则选择N个假设
  3. 进行一次原子性修改
  4. git提交(在验证前)
  5. 运行验证命令 → 目标指标是否有提升?
     运行防护命令 → 是否有其他内容被破坏?
  6. 指标提升 → 保留修改(提取经验)
     指标恶化 → 执行已批准的回滚策略(git revert)
     运行崩溃 → 修复或跳过
  7. 将结果记录到结果日志
  8. 健康检查(磁盘、git、验证机制状态)
  9. 如果连续3次丢弃修改 → **优化**;连续5次 → **转向**;2次转向后 → 网页搜索
  10. 重复循环。永不停止,永不询问。
除非你在确认时指定
Iterations: N
,否则循环会无限运行。

Dual-Gate Verification

双验证门机制

Two commands serve distinct purposes:
GatePurposeFails means
VerifyDid the target metric improve?Change discarded, reverted
GuardDid anything else break?Change reworked (up to 2 attempts), then reverted
Guard files are never modified by the loop.
Example verify + guard pair for a Python coverage run:
text
Verify: pytest --cov=src --cov-report=term 2>&1 | grep TOTAL | awk '{print $NF}'
Guard:  python -m mypy src --ignore-missing-imports
Example for TypeScript type cleanup:
text
Verify: grep -r "any" src --include="*.ts" | wc -l
Guard:  npx tsc --noEmit

两个命令分别承担不同的作用:
验证门作用失败意味着
Verify(验证)目标指标是否有提升?修改会被丢弃并回滚
Guard(防护)是否有其他内容被破坏?修改会被重新处理(最多2次尝试),之后回滚
防护文件永远不会被循环修改。
以下是Python覆盖率测试的验证+防护命令示例:
text
Verify: pytest --cov=src --cov-report=term 2>&1 | grep TOTAL | awk '{print $NF}'
Guard:  python -m mypy src --ignore-missing-imports
TypeScript类型清理的示例:
text
Verify: grep -r "any" src --include="*.ts" | wc -l
Guard:  npx tsc --noEmit

Modes

运行模式

Codex maps your sentence to one of seven modes automatically — you never pick a mode explicitly.
Codex会自动将你的目标语句映射到7种模式之一——你无需手动选择模式。

loop
— iterate toward a measurable target (default)

loop
模式 — 迭代逼近可量化目标(默认)

text
$codex-autoresearch
Improve test coverage in src/ to at least 80%
text
$codex-autoresearch
Reduce bundle size — it's currently 2.3 MB, get it under 1 MB
text
$codex-autoresearch
将src/目录的测试覆盖率提升至至少80%
text
$codex-autoresearch
减小打包体积——当前为2.3 MB,目标降至1 MB以下

plan
— turn a vague goal into a validated loop config

plan
模式 — 将模糊目标转化为可验证的循环配置

text
$codex-autoresearch
I want to make our API faster but I don't know where to start
Codex will interview you (p95 latency vs throughput? which endpoint?) and produce a ready-to-run loop config.
text
$codex-autoresearch
我想让我们的API更快,但不知道从哪里入手
Codex会询问你相关问题(关注p95延迟还是吞吐量?针对哪个端点?),并生成可直接运行的循环配置。

fix
— repair errors until count reaches zero

fix
模式 — 修复错误直到数量为零

text
$codex-autoresearch
pytest is failing, 12 tests broken after the refactor — fix them all
text
$codex-autoresearch
pytest运行失败,重构后有12个测试用例损坏——全部修复

debug
— evidence-driven root-cause hunting

debug
模式 — 基于证据的根因排查

text
$codex-autoresearch
Our API returns 503 randomly under load, no idea why
Each iteration tests one falsifiable hypothesis. Codex presents evidence, not guesses.
text
$codex-autoresearch
我们的API在负载下会随机返回503错误,不知道原因
每次迭代测试一个可证伪的假设。Codex会展示证据,而非猜测。

security
— read-only STRIDE + OWASP audit

security
模式 — 只读的STRIDE + OWASP审计

text
$codex-autoresearch
Is this code secure?
text
$codex-autoresearch
这段代码安全吗?

ship
— readiness verification and release gating

ship
模式 — 发布就绪验证与发布闸门

text
$codex-autoresearch
Ship it
text
$codex-autoresearch
发布上线

exec
— one-shot execution with no loop

exec
模式 — 单次执行无循环

text
$codex-autoresearch
Run the benchmark suite and summarize results

text
$codex-autoresearch
运行基准测试套件并总结结果

Inline Configuration (optional)

内联配置(可选)

You can override defaults inline during the confirmation step — no file edits needed:
PhraseEffect
Iterations: 20
Cap the loop at 20 iterations
Parallel: 3
Test 3 hypotheses concurrently per round
Guard: npm test
Override the inferred guard command
Verify: <command>
Override the inferred verify command
Scope: src/api/
Restrict changes to a subdirectory
Example during confirmation:
You:   Go. Iterations: 30, Guard: npm test, Scope: src/api/

你可以在确认步骤中通过内联语句覆盖默认设置——无需编辑文件:
语句效果
Iterations: 20
将循环上限设置为20次
Parallel: 3
每轮同时测试3个假设
Guard: npm test
覆盖自动推断的防护命令
Verify: <command>
覆盖自动推断的验证命令
Scope: src/api/
将修改限制在指定子目录
确认时的示例:
你:   开始。Iterations: 30, Guard: npm test, Scope: src/api/

Cross-Run Learning

跨会话学习

At the end of each iteration Codex writes a structured lesson to
.agents/skills/codex-autoresearch/lessons.md
:
Iteration 7 — KEPT
Hypothesis: replace explicit `any` with inferred generic in src/utils/mapper.ts
Change: added <T extends Record<string, unknown>> to mapKeys()
Result: any count 31 → 29
Lesson: Generic constraints on utility functions eliminate clusters of `any` downstream.
On session resume Codex reads this file first. Each new run benefits from prior runs.
To resume an interrupted run:
text
$codex-autoresearch
Resume
Codex re-reads the lessons file, checks git state, re-establishes the baseline, and continues.

每次迭代结束后,Codex会将结构化的经验写入
.agents/skills/codex-autoresearch/lessons.md
文件:
第7次迭代 — 保留修改
假设:将src/utils/mapper.ts中的显式`any`替换为推断泛型
修改:为mapKeys()添加<T extends Record<string, unknown>>约束
结果:`any`类型数量从31降至29
经验:对工具函数添加泛型约束可消除下游大量的`any`类型。
恢复会话时,Codex会首先读取该文件。每次新运行都会受益于之前的运行经验。
恢复中断的运行:
text
$codex-autoresearch
恢复
Codex会重新读取经验文件,检查git状态,重建基准值,然后继续运行。

Parallel Experiments

并行实验

Request parallel mode during confirmation or at any time:
text
You:   Go, parallel 4
Codex runs four hypotheses concurrently, keeps the best result, discards the rest. Useful when hypothesis space is large.

在确认时或任何时候请求并行模式:
text
你:   开始,并行4个假设
Codex会同时运行4个假设,保留最佳结果,丢弃其余的。当假设空间较大时非常有用。

Pivot Protocol

转向协议

If the loop stalls, escalation happens automatically:
Consecutive discardsAction
3REFINE — narrow hypothesis, try smaller atomic changes
5PIVOT — change strategy entirely
2 PIVOTsWeb search — Codex fetches external references to unstick itself
You are never asked for permission during escalation. The loop continues.

如果循环陷入停滞,会自动触发升级流程:
连续丢弃修改次数操作
3优化 — 缩小假设范围,尝试更小的原子性修改
5转向 — 完全改变策略
2次转向网页搜索 — Codex会获取外部参考资料以打破僵局
升级过程中不会请求你的许可,循环会持续运行。

Real Code Examples

真实代码示例

Example 1 — TypeScript
any
elimination (Python verify script)

示例1 — TypeScript
any
类型消除(Python验证脚本)

If you want a custom verify script instead of a one-liner:
python
undefined
如果你不想使用单行命令,而是自定义验证脚本:
python
undefined

scripts/count_any.py

scripts/count_any.py

import subprocess, sys
result = subprocess.run( ["grep", "-r", "--include=*.ts", r"\bany\b", "src/"], capture_output=True, text=True ) count = len(result.stdout.strip().splitlines()) print(count) sys.exit(0) # always exit 0; the number is what matters

Tell Codex during confirmation:

```text
Verify: python scripts/count_any.py
Guard:  npx tsc --noEmit
import subprocess, sys
result = subprocess.run( ["grep", "-r", "--include=*.ts", r"\bany\b", "src/"], capture_output=True, text=True ) count = len(result.stdout.strip().splitlines()) print(count) sys.exit(0) # always exit 0; the number is what matters

在确认时告知Codex:

```text
Verify: python scripts/count_any.py
Guard:  npx tsc --noEmit

Example 2 — pytest coverage loop (Python)

示例2 — pytest覆盖率循环(Python)

python
undefined
python
undefined

scripts/coverage_pct.py

scripts/coverage_pct.py

import subprocess, re, sys
out = subprocess.check_output( ["pytest", "--cov=src", "--cov-report=term", "-q"], stderr=subprocess.STDOUT, text=True ) match = re.search(r"TOTAL\s+\d+\s+\d+\s+(\d+)%", out) if match: print(int(match.group(1))) sys.exit(0) print(0) sys.exit(0)

```text
$codex-autoresearch
Improve test coverage — target 85%

Verify: python scripts/coverage_pct.py
Guard:  python -m mypy src
Direction: higher
Target: 85
Iterations: 50
import subprocess, re, sys
out = subprocess.check_output( ["pytest", "--cov=src", "--cov-report=term", "-q"], stderr=subprocess.STDOUT, text=True ) match = re.search(r"TOTAL\s+\d+\s+\d+\s+(\d+)%", out) if match: print(int(match.group(1))) sys.exit(0) print(0) sys.exit(0)

```text
$codex-autoresearch
提升测试覆盖率——目标85%

Verify: python scripts/coverage_pct.py
Guard:  python -m mypy src
Direction: higher
Target: 85
Iterations: 50

Example 3 — bundle size loop (Node.js project)

示例3 — 打包体积循环(Node.js项目)

bash
undefined
bash
undefined

scripts/bundle_size.sh

scripts/bundle_size.sh

#!/usr/bin/env bash npm run build --silent 2>/dev/null du -k dist/bundle.js | awk '{print $1}'

```text
$codex-autoresearch
Reduce our JS bundle size, currently ~2300 KB, target under 900 KB

Verify: bash scripts/bundle_size.sh
Guard:  npm test
Direction: lower
Target: 900
#!/usr/bin/env bash npm run build --silent 2>/dev/null du -k dist/bundle.js | awk '{print $1}'

```text
$codex-autoresearch
减小我们的JS打包体积,当前约2300 KB,目标降至900 KB以下

Verify: bash scripts/bundle_size.sh
Guard:  npm test
Direction: lower
Target: 900

Example 4 — lint warning count (any language)

示例4 — 代码检查警告计数(任意语言)

bash
undefined
bash
undefined

scripts/lint_count.sh

scripts/lint_count.sh

#!/usr/bin/env bash npx eslint src/ --format json 2>/dev/null
| python3 -c "import sys,json; d=json.load(sys.stdin); print(sum(len(f['messages']) for f in d))"

```text
$codex-autoresearch
Get our ESLint warning count to zero

Verify: bash scripts/lint_count.sh
Direction: lower
Target: 0

#!/usr/bin/env bash npx eslint src/ --format json 2>/dev/null \ | python3 -c "import sys,json; d=json.load(sys.stdin); print(sum(len(f['messages']) for f in d))"

```text
$codex-autoresearch
将ESLint警告计数降至零

Verify: bash scripts/lint_count.sh
Direction: lower
Target: 0

Unattended Runs

无人值守运行

For overnight or long runs, ensure Codex CLI approval settings do not interrupt
git commit
or
git revert
commands. The simplest option is to run in a disposable or sandboxed repo clone:
bash
git clone . /tmp/autoresearch-sandbox
cd /tmp/autoresearch-sandbox
对于通宵或长时间运行,请确保Codex CLI的权限设置不会中断
git commit
git revert
命令。最简单的方法是在临时或沙箱化的代码仓库克隆中运行:
bash
git clone . /tmp/autoresearch-sandbox
cd /tmp/autoresearch-sandbox

launch Codex here with full permissions

在此处启动拥有完整权限的Codex


Results accumulate in git history. Pull the winning commits back to your main repo when done:

```bash

结果会累积在git历史中。完成后,将成功的提交合并回主仓库:

```bash

in your main repo

在你的主仓库中

git fetch /tmp/autoresearch-sandbox main git cherry-pick <winning-commit-sha>

---
git fetch /tmp/autoresearch-sandbox main git cherry-pick <winning-commit-sha>

---

Session Artifacts

会话产物

FileContents
.agents/skills/codex-autoresearch/lessons.md
Structured lessons from every iteration
.agents/skills/codex-autoresearch/results.log
Full per-iteration log (metric value, kept/reverted, elapsed)
.agents/skills/codex-autoresearch/session.json
Current session state for resume
These files persist across Codex sessions. Delete them to start fresh.

文件内容
.agents/skills/codex-autoresearch/lessons.md
每次迭代的结构化经验总结
.agents/skills/codex-autoresearch/results.log
完整的逐次迭代日志(指标值、保留/回滚、耗时)
.agents/skills/codex-autoresearch/session.json
当前会话状态,用于恢复运行
这些文件会在Codex会话间保留。删除它们可重新开始。

Troubleshooting

故障排除

Loop reverts every change:
  • Verify command may be returning a non-numeric value. Test it manually:
    bash -c "<your verify command>"
    should print a single number.
  • Metric direction may be wrong. Confirm
    Direction: lower
    or
    Direction: higher
    during setup.
Guard fires on unrelated files:
  • Narrow scope:
    Scope: src/specific-module/
  • Or tell Codex explicitly:
    Do not touch tests/
    during confirmation.
Session resume picks up wrong baseline:
  • Delete
    session.json
    to force a fresh baseline:
    rm .agents/skills/codex-autoresearch/session.json
Parallel mode produces merge conflicts:
  • Codex handles this internally via the pivot protocol, but if it gets stuck, reduce parallelism:
    Parallel: 2
Codex asks questions mid-loop:
  • This means a guard crash produced ambiguous output. Pre-empt it by specifying
    Guard: <command> || true
    if guard failures should be non-fatal, or by giving Codex fuller sandbox permissions so it can run git commands freely.
Loop hits PIVOT but makes no progress:
  • Supply a seed hypothesis during confirmation:
    Hint: try tree-shaking unused imports first
  • Or run
    plan
    mode first to produce a richer hypothesis list before switching to
    loop
    .

循环回滚所有修改:
  • 验证命令可能返回了非数值结果。手动测试:
    bash -c "<你的验证命令>"
    应输出单个数字。
  • 指标方向可能错误。在设置时确认
    Direction: lower
    Direction: higher
防护命令触发无关文件的错误:
  • 缩小范围:
    Scope: src/specific-module/
  • 或在确认时明确告知Codex:
    不要修改tests/目录
恢复会话时基准值错误:
  • 删除
    session.json
    以强制重新建立基准值:
    rm .agents/skills/codex-autoresearch/session.json
并行模式产生合并冲突:
  • Codex会通过转向协议内部处理此问题,但如果陷入停滞,可减少并行数:
    Parallel: 2
Codex在循环中提问:
  • 这意味着防护命令崩溃产生了模糊输出。可通过指定
    Guard: <command> || true
    来预防(如果防护失败应视为非致命),或给予Codex更充分的沙箱权限使其可自由运行git命令。
循环触发转向但无进展:
  • 在确认时提供初始假设:
    提示:先尝试摇树优化未使用的导入
  • 或先运行
    plan
    模式,在切换到
    loop
    模式前生成更丰富的假设列表。

Quick Reference

快速参考

text
undefined
text
undefined

Start a loop

启动循环

$codex-autoresearch <your goal in one sentence>
$codex-autoresearch <你的目标语句>

Resume interrupted run

恢复中断的运行

$codex-autoresearch Resume
$codex-autoresearch 恢复

Bounded run

有限次数运行

$codex-autoresearch <goal> — Iterations: 25
$codex-autoresearch <目标> — Iterations: 25

Parallel hypotheses

并行假设

$codex-autoresearch <goal> — Parallel: 4
$codex-autoresearch <目标> — Parallel: 4

Force a mode

强制指定模式

$codex-autoresearch fix pytest has 8 failures, repair them
$codex-autoresearch fix pytest有8个失败用例,修复它们

Read-only audit

只读审计

$codex-autoresearch security Audit src/api/ for injection vulnerabilities
undefined
$codex-autoresearch security 审计src/api/中的注入漏洞
undefined