recursive-benchmark
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineserecursive-benchmark
recursive-benchmark
Use this skill to run a fair benchmark that compares the same coding agent with recursive-mode off and recursive-mode on.
The benchmark should use the same project requirements, the same model family, and the same success criteria for both arms. The recursive-on arm should additionally start from a bootstrapped recursive-mode scaffold, a run-local , and a command-style prompt that explicitly tells the agent to use the bootstrapped recursive control-plane files as the recursive-mode skill before implementing the run end to end.
00-requirements.mdFor fairness, the recursive-off arm should receive controller guidance only in the chat prompt, not as benchmark requirement, rubric, or prompt documents inside its repo or benchmark workspace.
使用此技能可运行公平的基准测试,对比同一coding agent在recursive-mode关闭和recursive-mode开启状态下的表现。
基准测试的两组测试项应使用相同的项目需求、相同的模型系列以及相同的成功标准。开启递归模式的测试项需额外从预构建的递归模式脚手架、本地运行的以及命令式提示语开始,提示语需明确告知Agent在端到端执行前,将预构建的递归控制平面文件作为递归模式技能使用。
00-requirements.md为保证公平性,关闭递归模式的测试项仅可在聊天提示中获得控制器指导,其仓库或基准测试工作区内不得包含基准测试需求、评分准则或提示文档。
Primary Use Case
主要使用场景
Use when the user wants to:
recursive-benchmark- compare recursive-mode against a non-recursive baseline
- measure whether recursive-mode improves implementation quality, reliability, or completion rate
- run disposable benchmark repos in temp folders
- capture build/test/preview outcomes, timings, issues, and final scores
- capture and report screenshot artifacts produced during the benchmark
- generate a markdown benchmark report or dashboard
- choose among packaged easy, medium, hard, and xhard benchmark scenarios
- optionally run the paired arms in parallel when the runtime/provider can tolerate it; unstable runners should fall back to sequential execution and record that downgrade in the report
当用户有以下需求时,使用:
recursive-benchmark- 对比递归模式与非递归基准线的表现
- 衡量递归模式是否提升了实现质量、可靠性或完成率
- 在临时文件夹中运行一次性基准测试仓库
- 捕获构建/测试/预览结果、时序数据、问题以及最终得分
- 捕获并报告基准测试过程中生成的截图工件
- 生成Markdown格式的基准测试报告或仪表盘
- 从预设的简单、中等、困难、极困难基准测试场景中选择
- 若运行环境/服务商支持,可选择并行运行两组测试项;不稳定的运行器应退化为顺序执行,并在报告中记录该降级情况
Benchmark Contract
基准测试合约
For each benchmark run:
- Create paired disposable repos for and
recursive-off.recursive-on - Give both repos the same benchmark project requirements.
- Bootstrap recursive-mode only in the recursive-on repo and place the benchmark requirements in a run-local, recursive-compliant .
00-requirements.md - Prompt the recursive-on arm to read , the bridge docs, and the run requirements before implementing the run.
/.recursive/RECURSIVE.md - Record the runner, provider family, model string, and timeout budget.
- Execute the selected agent runtime non-interactively for both arms.
- Run a mandatory controller-side judge review for every completed arm, preferring and falling back to a fresh instance of the benchmarked model when needed.
gpt-5.4 - Capture logs, durations, issues, screenshot artifacts, live progress artifacts, and evaluation outcomes, including whether the recursive-on arm produced the expected run artifacts through , passed controller-side recursive run lint, and required an isolated product snapshot for Rust build/test/preview evaluation.
08-memory-impact.md - Keep repo-local benchmark workspaces such as ignored when the harness runs inside the packaged repo.
.benchmark-workspaces/ - Produce a final markdown report that compares the two arms side by side, including a combined benchmark score that blends heuristic rubric coverage with the mandatory judge metric.
- Surface whether recursive-on completed the recursive artifact set, whether it passed controller-side recursive lint, and whether it used an isolated worktree or stayed in the repo root.
每次基准测试运行需执行以下步骤:
- 为和
recursive-off创建配对的一次性仓库。recursive-on - 为两个仓库配置相同的基准测试项目需求。
- 仅在开启递归模式的仓库中预构建递归模式,并将基准测试需求放置在本地运行的、符合递归模式规范的中。
00-requirements.md - 提示开启递归模式的测试项在执行前读取、桥接文档以及运行需求。
/.recursive/RECURSIVE.md - 记录运行器、服务商系列、模型字符串以及超时预算。
- 以非交互式方式执行两组测试项的选定Agent运行环境。
- 对每个完成的测试项执行强制的控制器端评审,优先使用,必要时退化为基准测试模型的全新实例。
gpt-5.4 - 捕获日志、时长、问题、截图工件、实时进度工件以及评估结果,包括开启递归模式的测试项是否通过生成了预期的运行工件、是否通过控制器端递归运行检查、是否需要隔离的产品快照用于Rust构建/测试/预览评估。
08-memory-impact.md - 当测试工具在打包仓库中运行时,忽略仓库本地的基准测试工作区(如)。
.benchmark-workspaces/ - 生成最终的Markdown报告,并列对比两组测试项,包含结合启发式准则覆盖率与强制评审指标的综合基准测试得分。
- 展示开启递归模式的测试项是否完成了递归工件集、是否通过控制器端递归检查、是否使用了隔离工作树或保留在仓库根目录。
Packaged Scenario Tiers
预设场景层级
- - easy
local-first-planner - - medium
team-capacity-board - - hard
release-readiness-dashboard - - xhard
scientific-calculator-rust
All packaged scenarios should stay:
- browser-local state only
- no external database or server dependencies
- local browser preview should work from a temp folder
- output should be suitable for later screenshot validation
Current packaged stacks:
- React + TypeScript + Vite for easy/medium/hard
- Rust + WebAssembly with Trunk for xhard
- - 简单
local-first-planner - - 中等
team-capacity-board - - 困难
release-readiness-dashboard - - 极困难
scientific-calculator-rust
所有预设场景需满足:
- 仅使用浏览器本地状态
- 无外部数据库或服务器依赖
- 本地浏览器预览可在临时文件夹中正常工作
- 输出内容需适合后续截图验证
当前预设技术栈:
- 简单/中等/困难场景:React + TypeScript + Vite
- 极困难场景:Rust + WebAssembly with Trunk
Logging Requirements
日志要求
The benchmark should preserve:
- raw agent stdout/stderr or JSON event logs when available
- per-phase timing data
- per-arm live progress files
- build/test/preview results
- screenshot paths and image embeds when screenshots exist
- timeout or failure reasons
- benchmark repo paths and report paths
- token or usage data only when the underlying CLI exposes it
Both benchmark arms should also ask the coding agent to maintain a simple in-repo benchmark activity log.
If the controller provides hints during the benchmark, the arm should record them in so the report can apply any configured hint penalty.
benchmark/hints.md基准测试需保留以下内容:
- 原始Agent标准输出/标准错误或JSON事件日志(若可用)
- 各阶段时序数据
- 各组测试项的实时进度文件
- 构建/测试/预览结果
- 截图路径与图片嵌入(若存在截图)
- 超时或失败原因
- 基准测试仓库路径与报告路径
- 仅当底层CLI暴露时,保留token或使用数据
两组测试项还需要求coding agent在仓库内维护一个简单的基准测试活动日志。若控制器在基准测试过程中提供提示,测试项需将提示记录在中,以便报告应用配置的提示惩罚。
benchmark/hints.mdOutput
输出
The benchmark should produce a final report that includes:
- benchmark scenario name
- provider/runtime and model
- recursive-off vs recursive-on comparison
- total duration and timeout status
- build/test/preview outcomes
- screenshot galleries for both arms when screenshots exist
- separated runner health vs product outcome
- heuristic score breakdown
- mandatory code-review judge metric and reviewer identity
- combined benchmark score that weights heuristic coverage and judge review together
- recursive-on worktree isolation status and recorded worktree location
- artifact paths for live progress inspection
- notable issues or gaps
- links or relative paths to logs and generated artifacts
- timestamp fallback evidence when agent logging is incomplete
基准测试需生成包含以下内容的最终报告:
- 基准测试场景名称
- 服务商/运行环境与模型
- recursive-off与recursive-on的对比
- 总时长与超时状态
- 构建/测试/预览结果
- 两组测试项的截图图库(若存在截图)
- 运行器健康状态与产品结果的分离展示
- 启发式得分细分
- 强制代码评审指标与评审者身份
- 结合启发式覆盖率与评审结果的综合基准测试得分
- 开启递归模式的测试项的工作树隔离状态与记录的工作树位置
- 用于实时进度检查的工件路径
- 显著问题或差距
- 日志与生成工件的链接或相对路径
- 当Agent日志不完整时,提供时间戳 fallback 证据
Fairness Rules
公平性规则
- Keep the project spec identical between both arms.
- Do not silently give one arm different acceptance criteria.
- Record when a metric is unavailable instead of faking it.
- Keep the benchmark disposable; do not contaminate this reusable repo with run residue.
- Use the same timeout budget and scoring rubric for both arms.
- If one arm receives hints, record them and reflect the configured penalty in the final scoring.
- 两组测试项的项目规范保持一致。
- 不得暗中为某一组测试项设置不同的验收标准。
- 当指标不可用时需记录,而非伪造数据。
- 保持基准测试的一次性;不得因运行残留污染此可复用仓库。
- 为两组测试项使用相同的超时预算与评分准则。
- 若某一组测试项获得提示,需记录提示并在最终得分中体现配置的惩罚。
Boundaries
边界
- This skill is for benchmark setup, execution, and reporting.
- It does not replace the recursive-mode workflow spec itself.
- It should not use hidden benchmark-specific criteria that are absent from the packaged rubric.
- It should not require external services such as a database server.
- 此技能用于基准测试的设置、执行与报告。
- 它不会替代递归模式工作流规范本身。
- 不得使用打包准则中未包含的隐藏基准测试特定标准。
- 不得依赖外部服务(如数据库服务器)。
References
参考资料
./references/patterns.md../../references/benchmarks/README.md../../references/benchmarks/local-first-planner/README.md../../references/benchmarks/local-first-planner/00-requirements.md../../references/benchmarks/local-first-planner/scoring-rubric.md../../scripts/run-recursive-benchmark.py
./references/patterns.md../../references/benchmarks/README.md../../references/benchmarks/local-first-planner/README.md../../references/benchmarks/local-first-planner/00-requirements.md../../references/benchmarks/local-first-planner/scoring-rubric.md../../scripts/run-recursive-benchmark.py