recursive-benchmark

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

recursive-benchmark

recursive-benchmark

Use this skill to run a fair benchmark that compares the same coding agent with recursive-mode off and recursive-mode on.
The benchmark should use the same project requirements, the same model family, and the same success criteria for both arms. The recursive-on arm should additionally start from a bootstrapped recursive-mode scaffold, a run-local
00-requirements.md
, and a command-style prompt that explicitly tells the agent to use the bootstrapped recursive control-plane files as the recursive-mode skill before implementing the run end to end.
For fairness, the recursive-off arm should receive controller guidance only in the chat prompt, not as benchmark requirement, rubric, or prompt documents inside its repo or benchmark workspace.
使用此技能可运行公平的基准测试,对比同一coding agent在recursive-mode关闭recursive-mode开启状态下的表现。
基准测试的两组测试项应使用相同的项目需求、相同的模型系列以及相同的成功标准。开启递归模式的测试项需额外从预构建的递归模式脚手架、本地运行的
00-requirements.md
以及命令式提示语开始,提示语需明确告知Agent在端到端执行前,将预构建的递归控制平面文件作为递归模式技能使用。
为保证公平性,关闭递归模式的测试项仅可在聊天提示中获得控制器指导,其仓库或基准测试工作区内不得包含基准测试需求、评分准则或提示文档。

Primary Use Case

主要使用场景

Use
recursive-benchmark
when the user wants to:
  • compare recursive-mode against a non-recursive baseline
  • measure whether recursive-mode improves implementation quality, reliability, or completion rate
  • run disposable benchmark repos in temp folders
  • capture build/test/preview outcomes, timings, issues, and final scores
  • capture and report screenshot artifacts produced during the benchmark
  • generate a markdown benchmark report or dashboard
  • choose among packaged easy, medium, hard, and xhard benchmark scenarios
  • optionally run the paired arms in parallel when the runtime/provider can tolerate it; unstable runners should fall back to sequential execution and record that downgrade in the report
当用户有以下需求时,使用
recursive-benchmark
  • 对比递归模式与非递归基准线的表现
  • 衡量递归模式是否提升了实现质量、可靠性或完成率
  • 在临时文件夹中运行一次性基准测试仓库
  • 捕获构建/测试/预览结果、时序数据、问题以及最终得分
  • 捕获并报告基准测试过程中生成的截图工件
  • 生成Markdown格式的基准测试报告或仪表盘
  • 从预设的简单、中等、困难、极困难基准测试场景中选择
  • 若运行环境/服务商支持,可选择并行运行两组测试项;不稳定的运行器应退化为顺序执行,并在报告中记录该降级情况

Benchmark Contract

基准测试合约

For each benchmark run:
  1. Create paired disposable repos for
    recursive-off
    and
    recursive-on
    .
  2. Give both repos the same benchmark project requirements.
  3. Bootstrap recursive-mode only in the recursive-on repo and place the benchmark requirements in a run-local, recursive-compliant
    00-requirements.md
    .
  4. Prompt the recursive-on arm to read
    /.recursive/RECURSIVE.md
    , the bridge docs, and the run requirements before implementing the run.
  5. Record the runner, provider family, model string, and timeout budget.
  6. Execute the selected agent runtime non-interactively for both arms.
  7. Run a mandatory controller-side judge review for every completed arm, preferring
    gpt-5.4
    and falling back to a fresh instance of the benchmarked model when needed.
  8. Capture logs, durations, issues, screenshot artifacts, live progress artifacts, and evaluation outcomes, including whether the recursive-on arm produced the expected run artifacts through
    08-memory-impact.md
    , passed controller-side recursive run lint, and required an isolated product snapshot for Rust build/test/preview evaluation.
  9. Keep repo-local benchmark workspaces such as
    .benchmark-workspaces/
    ignored when the harness runs inside the packaged repo.
  10. Produce a final markdown report that compares the two arms side by side, including a combined benchmark score that blends heuristic rubric coverage with the mandatory judge metric.
  11. Surface whether recursive-on completed the recursive artifact set, whether it passed controller-side recursive lint, and whether it used an isolated worktree or stayed in the repo root.
每次基准测试运行需执行以下步骤:
  1. recursive-off
    recursive-on
    创建配对的一次性仓库。
  2. 为两个仓库配置相同的基准测试项目需求。
  3. 仅在开启递归模式的仓库中预构建递归模式,并将基准测试需求放置在本地运行的、符合递归模式规范的
    00-requirements.md
    中。
  4. 提示开启递归模式的测试项在执行前读取
    /.recursive/RECURSIVE.md
    、桥接文档以及运行需求。
  5. 记录运行器、服务商系列、模型字符串以及超时预算。
  6. 以非交互式方式执行两组测试项的选定Agent运行环境。
  7. 对每个完成的测试项执行强制的控制器端评审,优先使用
    gpt-5.4
    ,必要时退化为基准测试模型的全新实例。
  8. 捕获日志、时长、问题、截图工件、实时进度工件以及评估结果,包括开启递归模式的测试项是否通过
    08-memory-impact.md
    生成了预期的运行工件、是否通过控制器端递归运行检查、是否需要隔离的产品快照用于Rust构建/测试/预览评估。
  9. 当测试工具在打包仓库中运行时,忽略仓库本地的基准测试工作区(如
    .benchmark-workspaces/
    )。
  10. 生成最终的Markdown报告,并列对比两组测试项,包含结合启发式准则覆盖率与强制评审指标的综合基准测试得分。
  11. 展示开启递归模式的测试项是否完成了递归工件集、是否通过控制器端递归检查、是否使用了隔离工作树或保留在仓库根目录。

Packaged Scenario Tiers

预设场景层级

  • local-first-planner
    - easy
  • team-capacity-board
    - medium
  • release-readiness-dashboard
    - hard
  • scientific-calculator-rust
    - xhard
All packaged scenarios should stay:
  • browser-local state only
  • no external database or server dependencies
  • local browser preview should work from a temp folder
  • output should be suitable for later screenshot validation
Current packaged stacks:
  • React + TypeScript + Vite for easy/medium/hard
  • Rust + WebAssembly with Trunk for xhard
  • local-first-planner
    - 简单
  • team-capacity-board
    - 中等
  • release-readiness-dashboard
    - 困难
  • scientific-calculator-rust
    - 极困难
所有预设场景需满足:
  • 仅使用浏览器本地状态
  • 无外部数据库或服务器依赖
  • 本地浏览器预览可在临时文件夹中正常工作
  • 输出内容需适合后续截图验证
当前预设技术栈:
  • 简单/中等/困难场景:React + TypeScript + Vite
  • 极困难场景:Rust + WebAssembly with Trunk

Logging Requirements

日志要求

The benchmark should preserve:
  • raw agent stdout/stderr or JSON event logs when available
  • per-phase timing data
  • per-arm live progress files
  • build/test/preview results
  • screenshot paths and image embeds when screenshots exist
  • timeout or failure reasons
  • benchmark repo paths and report paths
  • token or usage data only when the underlying CLI exposes it
Both benchmark arms should also ask the coding agent to maintain a simple in-repo benchmark activity log. If the controller provides hints during the benchmark, the arm should record them in
benchmark/hints.md
so the report can apply any configured hint penalty.
基准测试需保留以下内容:
  • 原始Agent标准输出/标准错误或JSON事件日志(若可用)
  • 各阶段时序数据
  • 各组测试项的实时进度文件
  • 构建/测试/预览结果
  • 截图路径与图片嵌入(若存在截图)
  • 超时或失败原因
  • 基准测试仓库路径与报告路径
  • 仅当底层CLI暴露时,保留token或使用数据
两组测试项还需要求coding agent在仓库内维护一个简单的基准测试活动日志。若控制器在基准测试过程中提供提示,测试项需将提示记录在
benchmark/hints.md
中,以便报告应用配置的提示惩罚。

Output

输出

The benchmark should produce a final report that includes:
  • benchmark scenario name
  • provider/runtime and model
  • recursive-off vs recursive-on comparison
  • total duration and timeout status
  • build/test/preview outcomes
  • screenshot galleries for both arms when screenshots exist
  • separated runner health vs product outcome
  • heuristic score breakdown
  • mandatory code-review judge metric and reviewer identity
  • combined benchmark score that weights heuristic coverage and judge review together
  • recursive-on worktree isolation status and recorded worktree location
  • artifact paths for live progress inspection
  • notable issues or gaps
  • links or relative paths to logs and generated artifacts
  • timestamp fallback evidence when agent logging is incomplete
基准测试需生成包含以下内容的最终报告:
  • 基准测试场景名称
  • 服务商/运行环境与模型
  • recursive-off与recursive-on的对比
  • 总时长与超时状态
  • 构建/测试/预览结果
  • 两组测试项的截图图库(若存在截图)
  • 运行器健康状态与产品结果的分离展示
  • 启发式得分细分
  • 强制代码评审指标与评审者身份
  • 结合启发式覆盖率与评审结果的综合基准测试得分
  • 开启递归模式的测试项的工作树隔离状态与记录的工作树位置
  • 用于实时进度检查的工件路径
  • 显著问题或差距
  • 日志与生成工件的链接或相对路径
  • 当Agent日志不完整时,提供时间戳 fallback 证据

Fairness Rules

公平性规则

  • Keep the project spec identical between both arms.
  • Do not silently give one arm different acceptance criteria.
  • Record when a metric is unavailable instead of faking it.
  • Keep the benchmark disposable; do not contaminate this reusable repo with run residue.
  • Use the same timeout budget and scoring rubric for both arms.
  • If one arm receives hints, record them and reflect the configured penalty in the final scoring.
  • 两组测试项的项目规范保持一致。
  • 不得暗中为某一组测试项设置不同的验收标准。
  • 当指标不可用时需记录,而非伪造数据。
  • 保持基准测试的一次性;不得因运行残留污染此可复用仓库。
  • 为两组测试项使用相同的超时预算与评分准则。
  • 若某一组测试项获得提示,需记录提示并在最终得分中体现配置的惩罚。

Boundaries

边界

  • This skill is for benchmark setup, execution, and reporting.
  • It does not replace the recursive-mode workflow spec itself.
  • It should not use hidden benchmark-specific criteria that are absent from the packaged rubric.
  • It should not require external services such as a database server.
  • 此技能用于基准测试的设置、执行与报告。
  • 它不会替代递归模式工作流规范本身。
  • 不得使用打包准则中未包含的隐藏基准测试特定标准。
  • 不得依赖外部服务(如数据库服务器)。

References

参考资料

  • ./references/patterns.md
  • ../../references/benchmarks/README.md
  • ../../references/benchmarks/local-first-planner/README.md
  • ../../references/benchmarks/local-first-planner/00-requirements.md
  • ../../references/benchmarks/local-first-planner/scoring-rubric.md
  • ../../scripts/run-recursive-benchmark.py
  • ./references/patterns.md
  • ../../references/benchmarks/README.md
  • ../../references/benchmarks/local-first-planner/README.md
  • ../../references/benchmarks/local-first-planner/00-requirements.md
  • ../../references/benchmarks/local-first-planner/scoring-rubric.md
  • ../../scripts/run-recursive-benchmark.py