codex-autoresearch-loop

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Codex Autoresearch

Skill by ara.so — Daily 2026 Skills collection.

Codex Autoresearch is a Codex skill that runs an autonomous modify→verify→keep/revert loop on your codebase. You describe a measurable goal in one sentence; Codex confirms the plan, then iterates unattended — every improvement stacks in git, every failure reverts automatically — until interrupted or a cap is reached. Inspired by Karpathy's autoresearch concept, generalized beyond ML training to any software metric.

由ara.so开发的Skill — 属于Daily 2026 Skills合集。

Codex Autoresearch是一款Codex Skill，可在你的代码库中自主运行「修改→验证→保留/回滚」的循环流程。你用一句话描述可量化目标，Codex会确认计划，然后无人值守地迭代——每次改进都会提交到git，每次失败都会自动回滚——直到被中断或达到次数上限。该工具灵感来自Karpathy的autoresearch概念，从机器学习训练场景推广到了所有软件指标优化场景。

Installation

安装

Option A — manual copy into your project:

bash

git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch

Option B — Codex skill installer:

text

$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch

The skill lives at

.agents/skills/codex-autoresearch/

inside your project. No config file is required before first use.

选项A — 手动复制到项目中：

bash

git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch

选项B — 使用Codex Skill安装器：

text

$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch

该Skill会存放在项目内的

.agents/skills/codex-autoresearch/

路径下，首次使用无需配置文件。

How to Activate

激活方式

Open Codex in your project directory and prefix your goal with

$codex-autoresearch

text

$codex-autoresearch
I want to get rid of all `any` types in my TypeScript code

Codex will:

Scan the repo and infer scope, metric, verify command, and guard command.
Present a confirmation summary — reply
```
go
```
(or correct anything).
Run the loop unattended until you interrupt it or the goal is met.

You never write config. Codex infers everything.

在项目目录中打开Codex，然后在你的目标前加上

$codex-autoresearch

前缀：

text

$codex-autoresearch
我要移除TypeScript代码中所有的`any`类型

Codex会执行以下操作：

扫描代码仓库，推断范围、指标、验证命令和防护命令。
展示确认摘要——回复
```
go
```
（或修改任何内容）即可开始。
无人值守运行循环，直到你中断或达成目标。

你无需编写配置文件，Codex会自动推断所有信息。

Confirmation Flow

确认流程

Before the loop starts Codex always shows what it found and asks you to confirm. Example exchange:

Codex: I found 47 `any` occurrences across src/**/*.ts.

       Confirmed:
       - Target: eliminate `any` types in src/**/*.ts
       - Metric: `any` count (current: 47), direction: lower
       - Verify: grep + tsc --noEmit as guard

       Need to confirm:
       - Run until all gone, or cap at N iterations?

       Reply "go" to start, or tell me what to change.

You:   Go, run overnight.

Codex: Starting — baseline: 47. Iterating until interrupted.

Up to five confirmation rounds are possible. After that, Codex proceeds.

在循环开始前，Codex总会展示它发现的信息并请求你确认。以下是示例对话：

Codex: 我在src/**/*.ts文件中发现了47处`any`类型。

       已确认内容：
       - 目标：消除src/**/*.ts中的`any`类型
       - 指标：`any`类型数量（当前：47），方向：减少
       - 验证：使用grep + tsc --noEmit作为防护手段

       需要确认的内容：
       - 运行到全部消除为止，还是设置N次迭代上限？

       回复"go"开始，或告诉我需要修改的内容。

你：   开始，通宵运行。

Codex: 启动中——基准值：47。将持续迭代直到被中断。

最多会进行5轮确认，之后Codex会自动继续执行。

The Loop (internals)

循环流程（内部机制）

PHASE 0: Probe environment (CPU/GPU/RAM/toolchains), check for session resume
PHASE 1: Read context + lessons file from prior run (if any)

LOOP (forever or N times):
  1. Review current state, git history, results log, lessons
  2. Pick ONE hypothesis (apply perspectives, filter by environment)
     -- or N hypotheses if parallel mode is active
  3. Make ONE atomic change
  4. git commit (before verification)
  5. Run verify command  →  did the target metric improve?
     Run guard command   →  did anything else break?
  6. Improved → keep (extract lesson)
     Worse    → approved rollback strategy (git revert)
     Crashed  → fix or skip
  7. Log the result to results log
  8. Health check (disk, git, verify health)
  9. If 3+ discards → REFINE; 5+ → PIVOT; 2 PIVOTs → web search
 10. Repeat. Never stop. Never ask.

The loop runs unbounded unless you say

Iterations: N

during confirmation.

PHASE 0: 探测环境（CPU/GPU/内存/工具链），检查是否可恢复会话
PHASE 1: 读取上下文 + 上一次运行的经验文件（如果存在）

循环（无限次或N次）：
  1. 回顾当前状态、git历史、结果日志、经验总结
  2. 选择一个假设（结合视角，根据环境过滤）
     -- 如果开启并行模式，则选择N个假设
  3. 进行一次原子性修改
  4. git提交（在验证前）
  5. 运行验证命令 → 目标指标是否有提升？
     运行防护命令 → 是否有其他内容被破坏？
  6. 指标提升 → 保留修改（提取经验）
     指标恶化 → 执行已批准的回滚策略（git revert）
     运行崩溃 → 修复或跳过
  7. 将结果记录到结果日志
  8. 健康检查（磁盘、git、验证机制状态）
  9. 如果连续3次丢弃修改 → **优化**；连续5次 → **转向**；2次转向后 → 网页搜索
  10. 重复循环。永不停止，永不询问。

除非你在确认时指定

Iterations: N

，否则循环会无限运行。

Dual-Gate Verification

双验证门机制

Two commands serve distinct purposes:

Gate	Purpose	Fails means
Verify	Did the target metric improve?	Change discarded, reverted
Guard	Did anything else break?	Change reworked (up to 2 attempts), then reverted

Guard files are never modified by the loop.

Example verify + guard pair for a Python coverage run:

text

Verify: pytest --cov=src --cov-report=term 2>&1 | grep TOTAL | awk '{print $NF}'
Guard:  python -m mypy src --ignore-missing-imports

Example for TypeScript type cleanup:

text

Verify: grep -r "any" src --include="*.ts" | wc -l
Guard:  npx tsc --noEmit

两个命令分别承担不同的作用：

验证门	作用	失败意味着
Verify（验证）	目标指标是否有提升？	修改会被丢弃并回滚
Guard（防护）	是否有其他内容被破坏？	修改会被重新处理（最多2次尝试），之后回滚

防护文件永远不会被循环修改。

以下是Python覆盖率测试的验证+防护命令示例：

text

Verify: pytest --cov=src --cov-report=term 2>&1 | grep TOTAL | awk '{print $NF}'
Guard:  python -m mypy src --ignore-missing-imports

TypeScript类型清理的示例：

text

Verify: grep -r "any" src --include="*.ts" | wc -l
Guard:  npx tsc --noEmit

Modes

运行模式

Codex maps your sentence to one of seven modes automatically — you never pick a mode explicitly.

Codex会自动将你的目标语句映射到7种模式之一——你无需手动选择模式。

loop

— iterate toward a measurable target (default)

loop

模式 — 迭代逼近可量化目标（默认）

text

$codex-autoresearch
Improve test coverage in src/ to at least 80%

text

$codex-autoresearch
Reduce bundle size — it's currently 2.3 MB, get it under 1 MB

text

$codex-autoresearch
将src/目录的测试覆盖率提升至至少80%

text

$codex-autoresearch
减小打包体积——当前为2.3 MB，目标降至1 MB以下

plan

— turn a vague goal into a validated loop config

plan

模式 — 将模糊目标转化为可验证的循环配置

text

$codex-autoresearch
I want to make our API faster but I don't know where to start

Codex will interview you (p95 latency vs throughput? which endpoint?) and produce a ready-to-run loop config.

text

$codex-autoresearch
我想让我们的API更快，但不知道从哪里入手

Codex会询问你相关问题（关注p95延迟还是吞吐量？针对哪个端点？），并生成可直接运行的循环配置。

fix

— repair errors until count reaches zero

fix

模式 — 修复错误直到数量为零

text

$codex-autoresearch
pytest is failing, 12 tests broken after the refactor — fix them all

text

$codex-autoresearch
pytest运行失败，重构后有12个测试用例损坏——全部修复

debug

— evidence-driven root-cause hunting

debug

模式 — 基于证据的根因排查

text

$codex-autoresearch
Our API returns 503 randomly under load, no idea why

Each iteration tests one falsifiable hypothesis. Codex presents evidence, not guesses.

text

$codex-autoresearch
我们的API在负载下会随机返回503错误，不知道原因

每次迭代测试一个可证伪的假设。Codex会展示证据，而非猜测。

security

— read-only STRIDE + OWASP audit

security

模式 — 只读的STRIDE + OWASP审计

text

$codex-autoresearch
Is this code secure?

text

$codex-autoresearch
这段代码安全吗？

ship

— readiness verification and release gating

ship

模式 — 发布就绪验证与发布闸门

text

$codex-autoresearch
Ship it

text

$codex-autoresearch
发布上线

exec

— one-shot execution with no loop

exec

模式 — 单次执行无循环

text

$codex-autoresearch
Run the benchmark suite and summarize results

text

$codex-autoresearch
运行基准测试套件并总结结果

Inline Configuration (optional)

内联配置（可选）

You can override defaults inline during the confirmation step — no file edits needed:

Phrase	Effect
`Iterations: 20`	Cap the loop at 20 iterations
`Parallel: 3`	Test 3 hypotheses concurrently per round
`Guard: npm test`	Override the inferred guard command
`Verify: <command>`	Override the inferred verify command
`Scope: src/api/`	Restrict changes to a subdirectory

Example during confirmation:

You:   Go. Iterations: 30, Guard: npm test, Scope: src/api/

你可以在确认步骤中通过内联语句覆盖默认设置——无需编辑文件：

语句	效果
`Iterations: 20`	将循环上限设置为20次
`Parallel: 3`	每轮同时测试3个假设
`Guard: npm test`	覆盖自动推断的防护命令
`Verify: <command>`	覆盖自动推断的验证命令
`Scope: src/api/`	将修改限制在指定子目录

确认时的示例：

你:   开始。Iterations: 30, Guard: npm test, Scope: src/api/

Cross-Run Learning

跨会话学习

At the end of each iteration Codex writes a structured lesson to

.agents/skills/codex-autoresearch/lessons.md

Iteration 7 — KEPT
Hypothesis: replace explicit `any` with inferred generic in src/utils/mapper.ts
Change: added <T extends Record<string, unknown>> to mapKeys()
Result: any count 31 → 29
Lesson: Generic constraints on utility functions eliminate clusters of `any` downstream.

On session resume Codex reads this file first. Each new run benefits from prior runs.

To resume an interrupted run:

text

$codex-autoresearch
Resume

Codex re-reads the lessons file, checks git state, re-establishes the baseline, and continues.

每次迭代结束后，Codex会将结构化的经验写入

.agents/skills/codex-autoresearch/lessons.md

文件：

第7次迭代 — 保留修改
假设：将src/utils/mapper.ts中的显式`any`替换为推断泛型
修改：为mapKeys()添加<T extends Record<string, unknown>>约束
结果：`any`类型数量从31降至29
经验：对工具函数添加泛型约束可消除下游大量的`any`类型。

恢复会话时，Codex会首先读取该文件。每次新运行都会受益于之前的运行经验。

恢复中断的运行：

text

$codex-autoresearch
恢复

Codex会重新读取经验文件，检查git状态，重建基准值，然后继续运行。

Parallel Experiments

并行实验

Request parallel mode during confirmation or at any time:

text

You:   Go, parallel 4

Codex runs four hypotheses concurrently, keeps the best result, discards the rest. Useful when hypothesis space is large.

在确认时或任何时候请求并行模式：

text

你:   开始，并行4个假设

Codex会同时运行4个假设，保留最佳结果，丢弃其余的。当假设空间较大时非常有用。

Pivot Protocol

转向协议

If the loop stalls, escalation happens automatically:

Consecutive discards	Action
3	REFINE — narrow hypothesis, try smaller atomic changes
5	PIVOT — change strategy entirely
2 PIVOTs	Web search — Codex fetches external references to unstick itself

You are never asked for permission during escalation. The loop continues.

如果循环陷入停滞，会自动触发升级流程：

连续丢弃修改次数	操作
3	优化 — 缩小假设范围，尝试更小的原子性修改
5	转向 — 完全改变策略
2次转向	网页搜索 — Codex会获取外部参考资料以打破僵局

升级过程中不会请求你的许可，循环会持续运行。

Real Code Examples

真实代码示例

Example 1 — TypeScript

any

elimination (Python verify script)

示例1 — TypeScript

any

类型消除（Python验证脚本）

If you want a custom verify script instead of a one-liner:

python

undefined

如果你不想使用单行命令，而是自定义验证脚本：

python

undefined

scripts/count_any.py

import subprocess, sys

result = subprocess.run( ["grep", "-r", "--include=*.ts", r"\bany\b", "src/"], capture_output=True, text=True ) count = len(result.stdout.strip().splitlines()) print(count) sys.exit(0) # always exit 0; the number is what matters


Tell Codex during confirmation:

```text
Verify: python scripts/count_any.py
Guard:  npx tsc --noEmit

import subprocess, sys


在确认时告知Codex：

```text
Verify: python scripts/count_any.py
Guard:  npx tsc --noEmit

Example 2 — pytest coverage loop (Python)

示例2 — pytest覆盖率循环（Python）

python

undefined

python

undefined

scripts/coverage_pct.py

import subprocess, re, sys

out = subprocess.check_output( ["pytest", "--cov=src", "--cov-report=term", "-q"], stderr=subprocess.STDOUT, text=True ) match = re.search(r"TOTAL\s+\d+\s+\d+\s+(\d+)%", out) if match: print(int(match.group(1))) sys.exit(0) print(0) sys.exit(0)


```text
$codex-autoresearch
Improve test coverage — target 85%

Verify: python scripts/coverage_pct.py
Guard:  python -m mypy src
Direction: higher
Target: 85
Iterations: 50

import subprocess, re, sys


```text
$codex-autoresearch
提升测试覆盖率——目标85%

Verify: python scripts/coverage_pct.py
Guard:  python -m mypy src
Direction: higher
Target: 85
Iterations: 50

Example 3 — bundle size loop (Node.js project)

示例3 — 打包体积循环（Node.js项目）

bash

undefined

bash

undefined

scripts/bundle_size.sh

#!/usr/bin/env bash npm run build --silent 2>/dev/null du -k dist/bundle.js | awk '{print $1}'


```text
$codex-autoresearch
Reduce our JS bundle size, currently ~2300 KB, target under 900 KB

Verify: bash scripts/bundle_size.sh
Guard:  npm test
Direction: lower
Target: 900

#!/usr/bin/env bash npm run build --silent 2>/dev/null du -k dist/bundle.js | awk '{print $1}'


```text
$codex-autoresearch
减小我们的JS打包体积，当前约2300 KB，目标降至900 KB以下

Verify: bash scripts/bundle_size.sh
Guard:  npm test
Direction: lower
Target: 900

Example 4 — lint warning count (any language)

示例4 — 代码检查警告计数（任意语言）

bash

undefined

bash

undefined

scripts/lint_count.sh

#!/usr/bin/env bash npx eslint src/ --format json 2>/dev/null
| python3 -c "import sys,json; d=json.load(sys.stdin); print(sum(len(f['messages']) for f in d))"


```text
$codex-autoresearch
Get our ESLint warning count to zero

Verify: bash scripts/lint_count.sh
Direction: lower
Target: 0

#!/usr/bin/env bash npx eslint src/ --format json 2>/dev/null \ | python3 -c "import sys,json; d=json.load(sys.stdin); print(sum(len(f['messages']) for f in d))"


```text
$codex-autoresearch
将ESLint警告计数降至零

Verify: bash scripts/lint_count.sh
Direction: lower
Target: 0

Unattended Runs

无人值守运行

For overnight or long runs, ensure Codex CLI approval settings do not interrupt

git commit

git revert

commands. The simplest option is to run in a disposable or sandboxed repo clone:

bash

git clone . /tmp/autoresearch-sandbox
cd /tmp/autoresearch-sandbox

对于通宵或长时间运行，请确保Codex CLI的权限设置不会中断

git commit

或

git revert

命令。最简单的方法是在临时或沙箱化的代码仓库克隆中运行：

bash

git clone . /tmp/autoresearch-sandbox
cd /tmp/autoresearch-sandbox

launch Codex here with full permissions

在此处启动拥有完整权限的Codex


Results accumulate in git history. Pull the winning commits back to your main repo when done:

```bash


结果会累积在git历史中。完成后，将成功的提交合并回主仓库：

```bash

in your main repo

在你的主仓库中

git fetch /tmp/autoresearch-sandbox main git cherry-pick <winning-commit-sha>

---

git fetch /tmp/autoresearch-sandbox main git cherry-pick <winning-commit-sha>

---

Session Artifacts

会话产物

File	Contents
`.agents/skills/codex-autoresearch/lessons.md`	Structured lessons from every iteration
`.agents/skills/codex-autoresearch/results.log`	Full per-iteration log (metric value, kept/reverted, elapsed)
`.agents/skills/codex-autoresearch/session.json`	Current session state for resume

These files persist across Codex sessions. Delete them to start fresh.

文件	内容
`.agents/skills/codex-autoresearch/lessons.md`	每次迭代的结构化经验总结
`.agents/skills/codex-autoresearch/results.log`	完整的逐次迭代日志（指标值、保留/回滚、耗时）
`.agents/skills/codex-autoresearch/session.json`	当前会话状态，用于恢复运行

这些文件会在Codex会话间保留。删除它们可重新开始。

Troubleshooting

故障排除

Loop reverts every change:

Verify command may be returning a non-numeric value. Test it manually:
```
bash -c "<your verify command>"
```
should print a single number.
Metric direction may be wrong. Confirm
```
Direction: lower
```
or
```
Direction: higher
```
during setup.

Guard fires on unrelated files:

Narrow scope:
```
Scope: src/specific-module/
```
Or tell Codex explicitly:
```
Do not touch tests/
```
during confirmation.

Session resume picks up wrong baseline:

Delete

session.json

to force a fresh baseline:

rm .agents/skills/codex-autoresearch/session.json

Parallel mode produces merge conflicts:

Codex handles this internally via the pivot protocol, but if it gets stuck, reduce parallelism:
```
Parallel: 2
```

Codex asks questions mid-loop:

This means a guard crash produced ambiguous output. Pre-empt it by specifying
```
Guard: <command> || true
```
if guard failures should be non-fatal, or by giving Codex fuller sandbox permissions so it can run git commands freely.

Loop hits PIVOT but makes no progress:

Supply a seed hypothesis during confirmation:
```
Hint: try tree-shaking unused imports first
```
Or run
```
plan
```
mode first to produce a richer hypothesis list before switching to
```
loop
```
.

循环回滚所有修改：

验证命令可能返回了非数值结果。手动测试：
```
bash -c "<你的验证命令>"
```
应输出单个数字。
指标方向可能错误。在设置时确认
```
Direction: lower
```
或
```
Direction: higher
```
。

防护命令触发无关文件的错误：

缩小范围：
```
Scope: src/specific-module/
```
或在确认时明确告知Codex：
```
不要修改tests/目录
```

恢复会话时基准值错误：

删除

session.json

以强制重新建立基准值：

rm .agents/skills/codex-autoresearch/session.json

并行模式产生合并冲突：

Codex会通过转向协议内部处理此问题，但如果陷入停滞，可减少并行数：
```
Parallel: 2
```

Codex在循环中提问：

这意味着防护命令崩溃产生了模糊输出。可通过指定
```
Guard: <command> || true
```
来预防（如果防护失败应视为非致命），或给予Codex更充分的沙箱权限使其可自由运行git命令。

循环触发转向但无进展：

在确认时提供初始假设：

提示：先尝试摇树优化未使用的导入

或先运行
```
plan
```
模式，在切换到
```
loop
```
模式前生成更丰富的假设列表。

Quick Reference

快速参考

text

undefined

text

undefined

Start a loop

启动循环

$codex-autoresearch <your goal in one sentence>

$codex-autoresearch <你的目标语句>

Resume interrupted run

恢复中断的运行

$codex-autoresearch Resume

$codex-autoresearch 恢复

Bounded run

有限次数运行

$codex-autoresearch <goal> — Iterations: 25

$codex-autoresearch <目标> — Iterations: 25

Parallel hypotheses

并行假设

$codex-autoresearch <goal> — Parallel: 4

$codex-autoresearch <目标> — Parallel: 4

Force a mode

强制指定模式

$codex-autoresearch fix pytest has 8 failures, repair them

$codex-autoresearch fix pytest有8个失败用例，修复它们

Read-only audit

只读审计

$codex-autoresearch security Audit src/api/ for injection vulnerabilities

undefined

$codex-autoresearch security 审计src/api/中的注入漏洞

undefined

codex-autoresearch-loop

Original

Translation

Codex Autoresearch

Codex Autoresearch

Installation

安装

How to Activate

激活方式

Confirmation Flow

确认流程

The Loop (internals)

循环流程（内部机制）

Dual-Gate Verification

双验证门机制

Modes

运行模式

loop — iterate toward a measurable target (default)

loop模式 — 迭代逼近可量化目标（默认）

plan — turn a vague goal into a validated loop config

plan模式 — 将模糊目标转化为可验证的循环配置

fix — repair errors until count reaches zero

fix模式 — 修复错误直到数量为零

debug — evidence-driven root-cause hunting

debug模式 — 基于证据的根因排查

security — read-only STRIDE + OWASP audit

security模式 — 只读的STRIDE + OWASP审计

ship — readiness verification and release gating

ship模式 — 发布就绪验证与发布闸门

exec — one-shot execution with no loop

exec模式 — 单次执行无循环

Inline Configuration (optional)

内联配置（可选）

Cross-Run Learning

跨会话学习

Parallel Experiments

并行实验

Pivot Protocol

转向协议

Real Code Examples

真实代码示例

Example 1 — TypeScript any elimination (Python verify script)

示例1 — TypeScript any类型消除（Python验证脚本）

scripts/count_any.py

scripts/count_any.py

Example 2 — pytest coverage loop (Python)

示例2 — pytest覆盖率循环（Python）

scripts/coverage_pct.py

scripts/coverage_pct.py

Example 3 — bundle size loop (Node.js project)

示例3 — 打包体积循环（Node.js项目）

scripts/bundle_size.sh

scripts/bundle_size.sh

Example 4 — lint warning count (any language)

示例4 — 代码检查警告计数（任意语言）

scripts/lint_count.sh

scripts/lint_count.sh

Unattended Runs

无人值守运行

launch Codex here with full permissions

在此处启动拥有完整权限的Codex

in your main repo

在你的主仓库中

Session Artifacts

会话产物

Troubleshooting

故障排除

Quick Reference

快速参考

Start a loop

启动循环

Resume interrupted run

恢复中断的运行

Bounded run

有限次数运行

Parallel hypotheses

并行假设

Force a mode

强制指定模式

Read-only audit

`loop`
— iterate toward a measurable target (default)

`loop`
模式 — 迭代逼近可量化目标（默认）

`plan`
— turn a vague goal into a validated loop config

`plan`
模式 — 将模糊目标转化为可验证的循环配置

`fix`
— repair errors until count reaches zero

`fix`
模式 — 修复错误直到数量为零

`debug`
— evidence-driven root-cause hunting

`debug`
模式 — 基于证据的根因排查

`security`
— read-only STRIDE + OWASP audit

`security`
模式 — 只读的STRIDE + OWASP审计

`ship`
— readiness verification and release gating

`ship`
模式 — 发布就绪验证与发布闸门

`exec`
— one-shot execution with no loop

`exec`
模式 — 单次执行无循环

Example 1 — TypeScript
`any`
elimination (Python verify script)

示例1 — TypeScript
`any`
类型消除（Python验证脚本）