recursive-benchmark

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

recursive-benchmark

Use this skill to run a fair benchmark that compares the same coding agent with recursive-mode off and recursive-mode on.

The benchmark should use the same project requirements, the same model family, and the same success criteria for both arms. The recursive-on arm should additionally start from a bootstrapped recursive-mode scaffold, a run-local

00-requirements.md

, and a command-style prompt that explicitly tells the agent to use the bootstrapped recursive control-plane files as the recursive-mode skill before implementing the run end to end.

For fairness, the recursive-off arm should receive controller guidance only in the chat prompt, not as benchmark requirement, rubric, or prompt documents inside its repo or benchmark workspace.

使用此技能可运行公平的基准测试，对比同一coding agent在recursive-mode关闭和recursive-mode开启状态下的表现。

基准测试的两组测试项应使用相同的项目需求、相同的模型系列以及相同的成功标准。开启递归模式的测试项需额外从预构建的递归模式脚手架、本地运行的

00-requirements.md

以及命令式提示语开始，提示语需明确告知Agent在端到端执行前，将预构建的递归控制平面文件作为递归模式技能使用。

为保证公平性，关闭递归模式的测试项仅可在聊天提示中获得控制器指导，其仓库或基准测试工作区内不得包含基准测试需求、评分准则或提示文档。

Primary Use Case

主要使用场景

Use

recursive-benchmark

when the user wants to:

compare recursive-mode against a non-recursive baseline
measure whether recursive-mode improves implementation quality, reliability, or completion rate
run disposable benchmark repos in temp folders
capture build/test/preview outcomes, timings, issues, and final scores
capture and report screenshot artifacts produced during the benchmark
generate a markdown benchmark report or dashboard
choose among packaged easy, medium, hard, and xhard benchmark scenarios
optionally run the paired arms in parallel when the runtime/provider can tolerate it; unstable runners should fall back to sequential execution and record that downgrade in the report

当用户有以下需求时，使用

recursive-benchmark

：

对比递归模式与非递归基准线的表现
衡量递归模式是否提升了实现质量、可靠性或完成率
在临时文件夹中运行一次性基准测试仓库
捕获构建/测试/预览结果、时序数据、问题以及最终得分
捕获并报告基准测试过程中生成的截图工件
生成Markdown格式的基准测试报告或仪表盘
从预设的简单、中等、困难、极困难基准测试场景中选择
若运行环境/服务商支持，可选择并行运行两组测试项；不稳定的运行器应退化为顺序执行，并在报告中记录该降级情况

Benchmark Contract

基准测试合约

For each benchmark run:

Create paired disposable repos for
```
recursive-off
```
and
```
recursive-on
```
.
Give both repos the same benchmark project requirements.
Bootstrap recursive-mode only in the recursive-on repo and place the benchmark requirements in a run-local, recursive-compliant
```
00-requirements.md
```
.
Prompt the recursive-on arm to read
```
/.recursive/RECURSIVE.md
```
, the bridge docs, and the run requirements before implementing the run.
Record the runner, provider family, model string, and timeout budget.
Execute the selected agent runtime non-interactively for both arms.
Run a mandatory controller-side judge review for every completed arm, preferring
```
gpt-5.4
```
and falling back to a fresh instance of the benchmarked model when needed.
Capture logs, durations, issues, screenshot artifacts, live progress artifacts, and evaluation outcomes, including whether the recursive-on arm produced the expected run artifacts through
```
08-memory-impact.md
```
, passed controller-side recursive run lint, and required an isolated product snapshot for Rust build/test/preview evaluation.
Keep repo-local benchmark workspaces such as
```
.benchmark-workspaces/
```
ignored when the harness runs inside the packaged repo.
Produce a final markdown report that compares the two arms side by side, including a combined benchmark score that blends heuristic rubric coverage with the mandatory judge metric.
Surface whether recursive-on completed the recursive artifact set, whether it passed controller-side recursive lint, and whether it used an isolated worktree or stayed in the repo root.

每次基准测试运行需执行以下步骤：

为
```
recursive-off
```
和
```
recursive-on
```
创建配对的一次性仓库。
为两个仓库配置相同的基准测试项目需求。
仅在开启递归模式的仓库中预构建递归模式，并将基准测试需求放置在本地运行的、符合递归模式规范的
```
00-requirements.md
```
中。
提示开启递归模式的测试项在执行前读取
```
/.recursive/RECURSIVE.md
```
、桥接文档以及运行需求。
记录运行器、服务商系列、模型字符串以及超时预算。
以非交互式方式执行两组测试项的选定Agent运行环境。
对每个完成的测试项执行强制的控制器端评审，优先使用
```
gpt-5.4
```
，必要时退化为基准测试模型的全新实例。
捕获日志、时长、问题、截图工件、实时进度工件以及评估结果，包括开启递归模式的测试项是否通过
```
08-memory-impact.md
```
生成了预期的运行工件、是否通过控制器端递归运行检查、是否需要隔离的产品快照用于Rust构建/测试/预览评估。
当测试工具在打包仓库中运行时，忽略仓库本地的基准测试工作区（如
```
.benchmark-workspaces/
```
）。
生成最终的Markdown报告，并列对比两组测试项，包含结合启发式准则覆盖率与强制评审指标的综合基准测试得分。
展示开启递归模式的测试项是否完成了递归工件集、是否通过控制器端递归检查、是否使用了隔离工作树或保留在仓库根目录。

Packaged Scenario Tiers

预设场景层级

```
local-first-planner
```
- easy
```
team-capacity-board
```
- medium
```
release-readiness-dashboard
```
- hard
```
scientific-calculator-rust
```
- xhard

All packaged scenarios should stay:

browser-local state only
no external database or server dependencies
local browser preview should work from a temp folder
output should be suitable for later screenshot validation

Current packaged stacks:

React + TypeScript + Vite for easy/medium/hard
Rust + WebAssembly with Trunk for xhard

```
local-first-planner
```
- 简单
```
team-capacity-board
```
- 中等
```
release-readiness-dashboard
```
- 困难
```
scientific-calculator-rust
```
- 极困难

所有预设场景需满足：

仅使用浏览器本地状态
无外部数据库或服务器依赖
本地浏览器预览可在临时文件夹中正常工作
输出内容需适合后续截图验证

当前预设技术栈：

简单/中等/困难场景：React + TypeScript + Vite
极困难场景：Rust + WebAssembly with Trunk

Logging Requirements

日志要求

The benchmark should preserve:

raw agent stdout/stderr or JSON event logs when available
per-phase timing data
per-arm live progress files
build/test/preview results
screenshot paths and image embeds when screenshots exist
timeout or failure reasons
benchmark repo paths and report paths
token or usage data only when the underlying CLI exposes it

Both benchmark arms should also ask the coding agent to maintain a simple in-repo benchmark activity log. If the controller provides hints during the benchmark, the arm should record them in

benchmark/hints.md

so the report can apply any configured hint penalty.

基准测试需保留以下内容：

原始Agent标准输出/标准错误或JSON事件日志（若可用）
各阶段时序数据
各组测试项的实时进度文件
构建/测试/预览结果
截图路径与图片嵌入（若存在截图）
超时或失败原因
基准测试仓库路径与报告路径
仅当底层CLI暴露时，保留token或使用数据

两组测试项还需要求coding agent在仓库内维护一个简单的基准测试活动日志。若控制器在基准测试过程中提供提示，测试项需将提示记录在

benchmark/hints.md

中，以便报告应用配置的提示惩罚。

Output

输出

The benchmark should produce a final report that includes:

benchmark scenario name
provider/runtime and model
recursive-off vs recursive-on comparison
total duration and timeout status
build/test/preview outcomes
screenshot galleries for both arms when screenshots exist
separated runner health vs product outcome
heuristic score breakdown
mandatory code-review judge metric and reviewer identity
combined benchmark score that weights heuristic coverage and judge review together
recursive-on worktree isolation status and recorded worktree location
artifact paths for live progress inspection
notable issues or gaps
links or relative paths to logs and generated artifacts
timestamp fallback evidence when agent logging is incomplete

基准测试需生成包含以下内容的最终报告：

基准测试场景名称
服务商/运行环境与模型
recursive-off与recursive-on的对比
总时长与超时状态
构建/测试/预览结果
两组测试项的截图图库（若存在截图）
运行器健康状态与产品结果的分离展示
启发式得分细分
强制代码评审指标与评审者身份
结合启发式覆盖率与评审结果的综合基准测试得分
开启递归模式的测试项的工作树隔离状态与记录的工作树位置
用于实时进度检查的工件路径
显著问题或差距
日志与生成工件的链接或相对路径
当Agent日志不完整时，提供时间戳 fallback 证据

Fairness Rules

公平性规则

Keep the project spec identical between both arms.
Do not silently give one arm different acceptance criteria.
Record when a metric is unavailable instead of faking it.
Keep the benchmark disposable; do not contaminate this reusable repo with run residue.
Use the same timeout budget and scoring rubric for both arms.
If one arm receives hints, record them and reflect the configured penalty in the final scoring.

两组测试项的项目规范保持一致。
不得暗中为某一组测试项设置不同的验收标准。
当指标不可用时需记录，而非伪造数据。
保持基准测试的一次性；不得因运行残留污染此可复用仓库。
为两组测试项使用相同的超时预算与评分准则。
若某一组测试项获得提示，需记录提示并在最终得分中体现配置的惩罚。

Boundaries

边界

This skill is for benchmark setup, execution, and reporting.
It does not replace the recursive-mode workflow spec itself.
It should not use hidden benchmark-specific criteria that are absent from the packaged rubric.
It should not require external services such as a database server.

此技能用于基准测试的设置、执行与报告。
它不会替代递归模式工作流规范本身。
不得使用打包准则中未包含的隐藏基准测试特定标准。
不得依赖外部服务（如数据库服务器）。

References

参考资料

```
./references/patterns.md
```
```
../../references/benchmarks/README.md
```

../../references/benchmarks/local-first-planner/README.md

../../references/benchmarks/local-first-planner/00-requirements.md

../../references/benchmarks/local-first-planner/scoring-rubric.md

../../scripts/run-recursive-benchmark.py

```
./references/patterns.md
```
```
../../references/benchmarks/README.md
```

../../references/benchmarks/local-first-planner/README.md

../../references/benchmarks/local-first-planner/00-requirements.md

../../references/benchmarks/local-first-planner/scoring-rubric.md

../../scripts/run-recursive-benchmark.py