tbench
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTerminal-Bench Integration
Terminal-Bench集成
This directory contains the mux agent adapter for Terminal-Bench 2.0, using Harbor as the evaluation harness.
本目录包含适配Terminal-Bench 2.0的mux agent适配器,采用Harbor作为评估框架。
Quick Start
快速开始
When user asks to run a tbench, generally assume they mean in CI via workflow_dispatch.
bash
undefined当用户要求运行tbench时,通常默认指通过workflow_dispatch在CI中执行。
bash
undefinedRun full benchmark suite
Run full benchmark suite
make benchmark-terminal
make benchmark-terminal
Run specific tasks
Run specific tasks
make benchmark-terminal TB_TASK_NAMES="hello-world chess-best-move"
make benchmark-terminal TB_TASK_NAMES="hello-world chess-best-move"
Run with specific model
Run with specific model
make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic/claude-opus-4-5"
make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic/claude-opus-4-5"
Run on Daytona cloud (high parallelism)
Run on Daytona cloud (high parallelism)
TB_ENV=daytona TB_CONCURRENCY=48 make benchmark-terminal
undefinedTB_ENV=daytona TB_CONCURRENCY=48 make benchmark-terminal
undefinedDaytona Cloud Sandboxes
Daytona云沙箱
For faster benchmarks, use Daytona cloud sandboxes instead of local Docker:
bash
undefined为获得更快的基准测试速度,可使用Daytona云沙箱替代本地Docker:
bash
undefinedSet API key (get from https://app.daytona.io)
Set API key (get from https://app.daytona.io)
export DAYTONA_API_KEY="your-api-key"
export DAYTONA_API_KEY="your-api-key"
Run with 48 concurrent cloud sandboxes (~6x faster than local)
Run with 48 concurrent cloud sandboxes (~6x faster than local)
make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48
make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48
Run specific tasks on Daytona
Run specific tasks on Daytona
make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48 TB_TASK_NAMES="chess-best-move stockfish-elo"
**Account limits (Tier 3):** Pool of 250 vCPU / 500GB RAM. Most tasks require 1 vCPU / 2GB RAM, with a few needing up to 4 vCPU / 8GB RAM. Harbor automatically requests the correct per-task resources.
**Speed comparison:**
| Environment | Concurrency | Full suite time |
|-------------|-------------|-----------------|
| Local Docker | 4 | ~90 min |
| Daytona Cloud | 48 | ~10-15 min |make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48 TB_TASK_NAMES="chess-best-move stockfish-elo"
**账户限制(3级):** 总资源池为250 vCPU / 500GB RAM。大多数任务需要1 vCPU / 2GB RAM,少数任务最高需要4 vCPU / 8GB RAM。Harbor会自动为每个任务申请合适的资源。
**速度对比:**
| 运行环境 | 并发数 | 完整套件运行时间 |
|-------------|-------------|-----------------|
| 本地Docker | 4 | ~90 分钟 |
| Daytona云 | 48 | ~10-15 分钟 |Configuration
配置
Environment Variables
环境变量
- : Dataset to use (default:
TB_DATASET)terminal-bench@2.0 - : Number of concurrent tasks (default: 4)
TB_CONCURRENCY - : Global timeout in seconds (default: 1800 = 30 minutes)
TB_TIMEOUT - : Environment to run in (
TB_ENVorlocal)daytona - : Space-separated task names to run (default: all tasks)
TB_TASK_NAMES - : Additional arguments passed to harbor
TB_ARGS - : CLI flags passed directly to
MUX_RUN_ARGSinside the container (e.g.,mux run). This is the primary mechanism for all--thinking high --use-1m --budget 5.00flags — avoids per-flag plumbing.mux run
- : 使用的数据集(默认:
TB_DATASET)terminal-bench@2.0 - : 并发任务数(默认:4)
TB_CONCURRENCY - : 全局超时时间,单位为秒(默认:1800 = 30分钟)
TB_TIMEOUT - : 运行环境(
TB_ENV或local)daytona - : 要运行的任务名,空格分隔(默认:所有任务)
TB_TASK_NAMES - : 传递给harbor的额外参数
TB_ARGS - : 直接传递给容器内
MUX_RUN_ARGS的CLI标志(例如mux run)。这是传递所有--thinking high --use-1m --budget 5.00标志的主要方式,避免逐个标志的管道适配。mux run
Timeout Handling
超时处理
The benchmark uses a global timeout applied to all tasks. The default is 30 minutes (1800 seconds), which provides sufficient time for most tasks while catching genuinely stuck agents.
Design Rationale:
Based on analysis of Oct 30, 2025 nightly runs:
- Longest successful task: at 20 minutes
blind-maze-explorer-algorithm.hard - 95th percentile: ~15 minutes
- Mean duration: ~6 minutes
The 30-minute default provides comfortable headroom for complex tasks without excessive wait times for failed attempts.
Override timeout:
bash
undefined基准测试对所有任务应用全局超时,默认值为30分钟(1800秒),既为大多数任务提供了充足的运行时间,也能捕获真正卡住的agent。
设计逻辑:
基于2025年10月30日的夜间运行数据分析:
- 最长的成功任务:,耗时20分钟
blind-maze-explorer-algorithm.hard - 95百分位耗时:~15分钟
- 平均耗时:~6分钟
30分钟的默认值为复杂任务提供了充足的预留空间,同时不会让失败的尝试等待过久。
覆盖超时设置:
bash
undefinedRun with 60 minute timeout for very complex tasks
Run with 60 minute timeout for very complex tasks
TB_TIMEOUT=3600 make benchmark-terminal
TB_TIMEOUT=3600 make benchmark-terminal
Run with shorter 10 minute timeout for quick iteration
Run with shorter 10 minute timeout for quick iteration
TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5
**Note:** We prefer global timeout defaults over per-task configuration to avoid complexity and maintenance burden. If you find tasks consistently timing out, increase `TB_TIMEOUT` rather than adding per-task configuration.TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5
**注意:** 我们优先使用全局超时默认值而非单任务配置,以避免复杂度和维护负担。如果发现任务频繁超时,请调大`TB_TIMEOUT`而不是添加单任务配置。Agent Configuration
Agent配置
The agent adapter accepts a few Harbor kwargs (passed via ):
--agent-kwarg- : Model to use (e.g.,
model_name,anthropic/claude-sonnet-4-5)openai/gpt-5-codex - : Experiments to enable, comma-separated (e.g.,
experiments)programmatic-tool-calling
All other CLI flags (thinking level, mode, runtime, budget, etc.) are passed via — no per-flag plumbing needed.
mux runMUX_RUN_ARGSCI dispatch (primary method):
bash
undefinedAgent适配器支持几个Harbor参数(通过传递):
--agent-kwarg- : 使用的模型(例如
model_name、anthropic/claude-sonnet-4-5)openai/gpt-5-codex - : 要启用的实验功能,逗号分隔(例如
experiments)programmatic-tool-calling
所有其他 CLI标志(思考等级、模式、运行时、预算等)都通过传递,无需逐个适配标志。
mux runMUX_RUN_ARGSCI触发(主要方式):
bash
undefinedRun with model, thinking, and 1M context
Run with model, thinking, and 1M context
gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking xhigh --use-1m"
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking xhigh --use-1m"
gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking xhigh --use-1m"
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking xhigh --use-1m"
Run with budget cap
Run with budget cap
gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking high --budget 5.00"
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking high --budget 5.00"
**Local runs:**
```bashgh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking high --budget 5.00"
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking high --budget 5.00"
**本地运行:**
```bashPass flags via MUX_RUN_ARGS env var
Pass flags via MUX_RUN_ARGS env var
MUX_RUN_ARGS="--thinking high --use-1m" make benchmark-terminal
MUX_RUN_ARGS="--thinking high --use-1m" make benchmark-terminal
Model and experiments via TB_ARGS
Model and experiments via TB_ARGS
make benchmark-terminal TB_ARGS="--agent-kwarg model_name=openai/gpt-5-codex --agent-kwarg experiments=programmatic-tool-calling"
undefinedmake benchmark-terminal TB_ARGS="--agent-kwarg model_name=openai/gpt-5-codex --agent-kwarg experiments=programmatic-tool-calling"
undefinedResults
结果
Results are saved to :
runs/YYYY-MM-DD__HH-MM-SS/- : Aggregate results with pass/fail rates
results.json - : Run configuration and metadata
run_metadata.json - : Per-task directories containing:
<task-id>/- : Full agent execution log
sessions/agent.log - : Asciinema recording of agent session
sessions/agent.cast - : Test execution output
sessions/tests.log - : Per-trial results
results.json
结果保存到目录下:
runs/YYYY-MM-DD__HH-MM-SS/- : 包含通过率的聚合结果
results.json - : 运行配置和元数据
run_metadata.json - : 每个任务的目录,包含:
<task-id>/- : 完整的Agent执行日志
sessions/agent.log - : Agent会话的Asciinema录制文件
sessions/agent.cast - : 测试执行输出
sessions/tests.log - : 单轮测试结果
results.json
CI/CD Integration
CI/CD集成
Querying Results from BigQuery
从BigQuery查询结果
Mux Terminal-Bench results are uploaded to BigQuery after CI runs. Query via CLI after authenticating with and setting project to .
bqgcloud auth loginmux-benchmarksTable:
mux-benchmarks.benchmarks.tbench_resultsSchema: (STRING), (STRING), (STRING), (STRING: off/low/medium/high), (STRING: plan/exec), (STRING), (STRING), (BOOL), (FLOAT), (INT), (INT), (INT), (STRING), (TIMESTAMP).
run_idtask_idmodel_namethinking_levelmodedatasetexperimentspassedscoren_input_tokensn_output_tokensgithub_run_idgithub_shaingested_atSee and for GitHub Actions integration.
.github/workflows/terminal-bench.yml.github/workflows/nightly-terminal-bench.ymlNightly workflow runs both Claude and GPT models on the full task suite, uploading results as artifacts.
Mux Terminal-Bench的结果会在CI运行后上传到BigQuery。通过认证并将项目设置为后,可通过 CLI查询。
gcloud auth loginmux-benchmarksbq表名:
mux-benchmarks.benchmarks.tbench_results表结构: (STRING), (STRING), (STRING), (STRING: off/low/medium/high), (STRING: plan/exec), (STRING), (STRING), (BOOL), (FLOAT), (INT), (INT), (INT), (STRING), (TIMESTAMP)。
run_idtask_idmodel_namethinking_levelmodedatasetexperimentspassedscoren_input_tokensn_output_tokensgithub_run_idgithub_shaingested_at查看和了解GitHub Actions集成细节。
.github/workflows/terminal-bench.yml.github/workflows/nightly-terminal-bench.yml夜间工作流会在完整任务套件上同时运行Claude和GPT模型,并将结果作为制品上传。
Leaderboard Submission
排行榜提交
To submit mux results to the Terminal-Bench 2.0 leaderboard:
要将mux的结果提交到Terminal-Bench 2.0排行榜,请按以下步骤操作:
Step 1: Prepare Submission
步骤1:准备提交内容
The leaderboard computes pass@k from multiple attempts per task. Provide
multiple runs so each becomes its own job folder inside the submission.
bash
undefined排行榜会根据每个任务的多次尝试计算pass@k。请提供多次运行结果,每次运行对应提交包中的一个作业文件夹。
bash
undefinedDownload latest 5 successful nightly runs (recommended for submission)
Download latest 5 successful nightly runs (recommended for submission)
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5
Use specific run IDs (each becomes a separate job folder)
Use specific run IDs (each becomes a separate job folder)
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --run-id 111 222 333 444 555
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --run-id 111 222 333 444 555
Use multiple existing artifact directories
Use multiple existing artifact directories
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --artifacts-dir ./run1 ./run2
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --artifacts-dir ./run1 ./run2
Download latest single run (quick iteration)
Download latest single run (quick iteration)
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py
Only prepare specific models
Only prepare specific models
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5 --models anthropic/claude-opus-4-5
This creates a properly structured submission folder at `leaderboard_submission/` containing:
submissions/terminal-bench/2.0/Mux__<model>/
metadata.yaml # Agent and model info
<job-folder-1>/ # Results from run 1
config.json
result.json
<trial-1>/
config.json
result.json
agent/
verifier/
...
<job-folder-2>/ # Results from run 2
...
undefinedpython3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5 --models anthropic/claude-opus-4-5
执行后会在`leaderboard_submission/`目录下生成结构正确的提交文件夹,包含:
submissions/terminal-bench/2.0/Mux__<model>/
metadata.yaml # Agent和模型信息
<job-folder-1>/ # 第1次运行的结果
config.json
result.json
<trial-1>/
config.json
result.json
agent/
verifier/
...
<job-folder-2>/ # 第2次运行的结果
...
undefinedStep 2: Submit via HuggingFace Python API
步骤2:通过HuggingFace Python API提交
The CLI tends to timeout on large submissions due to LFS file handling.
Use the Python API with an extended timeout instead:
hf uploadbash
undefinedhf uploadbash
undefinedInstall huggingface_hub (via uv or pip)
Install huggingface_hub (via uv or pip)
pip install huggingface_hub
pip install huggingface_hub
Authenticate (one-time setup)
Authenticate (one-time setup)
hf auth login
```python
import httpx
from huggingface_hub import HfApi
from huggingface_hub.utils import configure_http_backend
configure_http_backend(
backend_factory=lambda: httpx.Client(timeout=httpx.Timeout(300.0, connect=60.0))
)
api = HfApi()
api.upload_folder(
repo_id="alexgshaw/terminal-bench-2-leaderboard",
folder_path="./leaderboard_submission/submissions",
path_in_repo="submissions",
repo_type="dataset",
create_pr=True,
commit_message="Add Mux + <Model> submission",
commit_description="- Agent: Mux (Coder)\n- Model: <model>\n- <N> tasks × <K> attempts",
)The PR will be automatically validated by the leaderboard bot. Once merged, results appear on the leaderboard.
Tips from past submissions:
- The prepare script already strips files (they trigger HF LFS and cause timeouts)
*.log - accepts raw job folders directly (e.g., an extracted tarball root)
--artifacts-dir - To update an existing PR, pass instead of
revision="refs/pr/<N>"create_pr=True - To remove stale files from a PR, use
api.delete_folder(..., revision="refs/pr/<N>")
hf auth login
```python
import httpx
from huggingface_hub import HfApi
from huggingface_hub.utils import configure_http_backend
configure_http_backend(
backend_factory=lambda: httpx.Client(timeout=httpx.Timeout(300.0, connect=60.0))
)
api = HfApi()
api.upload_folder(
repo_id="alexgshaw/terminal-bench-2-leaderboard",
folder_path="./leaderboard_submission/submissions",
path_in_repo="submissions",
repo_type="dataset",
create_pr=True,
commit_message="Add Mux + <Model> submission",
commit_description="- Agent: Mux (Coder)\n- Model: <model>\n- <N> tasks × <K> attempts",
)PR会被排行榜机器人自动验证,合并后结果就会显示在排行榜上。
过往提交的经验提示:
- 准备脚本已经自动剔除了文件(这类文件会触发HF LFS导致超时)
*.log - 可以直接接受原始作业文件夹(例如解压后的tar包根目录)
--artifacts-dir - 要更新已有PR,请传递替代
revision="refs/pr/<N>"create_pr=True - 要从PR中删除过时文件,请使用
api.delete_folder(..., revision="refs/pr/<N>")
Files
文件说明
- : Main agent adapter implementing Harbor's
mux_agent.pyinterfaceBaseInstalledAgent - : Shell script that sets up environment and invokes mux CLI
mux-run.sh - : Helper to package mux app for containerized execution
mux_payload.py - : Jinja2 template for agent installation script
mux_setup.sh.j2 - : Script to prepare results for leaderboard submission
prepare_leaderboard_submission.py - : Analyze failure rates to find optimization opportunities
analyze_failure_rates.py - : Download and inspect raw agent logs from nightly runs
download_run_logs.py
- : 实现了Harbor的
mux_agent.py接口的主Agent适配器BaseInstalledAgent - : 用于设置环境并调用mux CLI的Shell脚本
mux-run.sh - : 用于打包mux应用以支持容器化执行的工具
mux_payload.py - : 用于Agent安装脚本的Jinja2模板
mux_setup.sh.j2 - : 用于准备排行榜提交结果的脚本
prepare_leaderboard_submission.py - : 用于分析故障率、寻找优化点的脚本
analyze_failure_rates.py - : 用于从夜间运行结果中下载和查看原始Agent日志的脚本
download_run_logs.py
Comparative Failure Analysis Workflow
对比故障分析工作流
When investigating why Mux fails on a task more than other agents, consider this workflow:
当调查Mux在某个任务上的失败率高于其他Agent的原因时,可参考以下工作流:
1. Identify High-Priority Failures
1. 识别高优先级故障
bash
undefinedbash
undefinedFind tasks where Mux underperforms (high M/O ratio = Mux fails more than others)
Find tasks where Mux underperforms (high M/O ratio = Mux fails more than others)
python benchmarks/terminal_bench/analyze_failure_rates.py --top 20
undefinedpython benchmarks/terminal_bench/analyze_failure_rates.py --top 20
undefined2. Check BigQuery for Failure Patterns
2. 在BigQuery中查询故障模式
bash
undefinedbash
undefinedAuthenticate and set project
Authenticate and set project
gcloud auth login && gcloud config set project mux-benchmarks
gcloud auth login && gcloud config set project mux-benchmarks
Query pass/fail by model for specific task (strip __hash suffix mentally)
Query pass/fail by model for specific task (strip __hash suffix mentally)
bq query --use_legacy_sql=false '
SELECT model_name, passed, COUNT(*) as runs
FROM
WHERE REGEXP_REPLACE(task_id, r"__[a-zA-Z0-9]+$", "") = "TASK_NAME_HERE"
AND github_workflow = "Nightly Terminal-Bench"
AND passed IS NOT NULL
GROUP BY model_name, passed
ORDER BY model_name, passed
'
mux-benchmarks.benchmarks.tbench_resultsundefinedbq query --use_legacy_sql=false '
SELECT model_name, passed, COUNT(*) as runs
FROM
WHERE REGEXP_REPLACE(task_id, r"__[a-zA-Z0-9]+$", "") = "TASK_NAME_HERE"
AND github_workflow = "Nightly Terminal-Bench"
AND passed IS NOT NULL
GROUP BY model_name, passed
ORDER BY model_name, passed
'
mux-benchmarks.benchmarks.tbench_resultsundefined3. Download and Inspect Agent Logs
3. 下载并查看Agent日志
bash
undefinedbash
undefinedList recent nightly runs
List recent nightly runs
python benchmarks/terminal_bench/download_run_logs.py --list-runs
python benchmarks/terminal_bench/download_run_logs.py --list-runs
Download latest run and filter to failing task
Download latest run and filter to failing task
python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME --failures-only
python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME --failures-only
Download specific run, filter to specific model
Download specific run, filter to specific model
python benchmarks/terminal_bench/download_run_logs.py --run-id 21230456195 --model opus --task TASK_NAME
python benchmarks/terminal_bench/download_run_logs.py --run-id 21230456195 --model opus --task TASK_NAME
Verbose mode shows stderr from agent execution
Verbose mode shows stderr from agent execution
python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME -v
Logs are cached in `.run_logs/<run-id>/`. Inspect:
- `agent/command-0/stdout.txt` — Full agent output (JSONL stream)
- `agent/command-0/stderr.txt` — Errors during execution
- `result.json` — Trial result with `verifier_result` and `exception_info`python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME -v
日志会缓存到`.run_logs/<run-id>/`目录下,可查看:
- `agent/command-0/stdout.txt` — 完整的Agent输出(JSONL流)
- `agent/command-0/stderr.txt` — 执行过程中的错误
- `result.json` — 单轮测试结果,包含`verifier_result`和`exception_info`4. Compare with Leaderboard Submissions
4. 与排行榜提交结果对比
bash
undefinedbash
undefinedClone leaderboard repo from HuggingFace (cached in .leaderboard_cache/)
Clone leaderboard repo from HuggingFace (cached in .leaderboard_cache/)
cd benchmarks/terminal_bench
git clone https://huggingface.co/datasets/alexgshaw/terminal-bench-2-leaderboard .leaderboard_cache/terminal-bench-2-leaderboard 2>/dev/null
cd benchmarks/terminal_bench
git clone https://huggingface.co/datasets/alexgshaw/terminal-bench-2-leaderboard .leaderboard_cache/terminal-bench-2-leaderboard 2>/dev/null
Find passing submissions for the task
Find passing submissions for the task
find .leaderboard_cache -path "TASK_NAME" -name "result.json" -exec sh -c '
agent=$(echo "$1" | cut -d/ -f5)
reward=$(cat "$1" | python3 -c "import json,sys; print(json.load(sys.stdin).get("verifier_result",{}).get("rewards",{}).get("reward",0))")
echo "$agent: reward=$reward"
' _ {} ;
undefinedfind .leaderboard_cache -path "TASK_NAME" -name "result.json" -exec sh -c '
agent=$(echo "$1" | cut -d/ -f5)
reward=$(cat "$1" | python3 -c "import json,sys; print(json.load(sys.stdin).get("verifier_result",{}).get("rewards",{}).get("reward",0))")
echo "$agent: reward=$reward"
' _ {} ;
undefinedAnalyzing Failure Rates
故障率分析
To identify where Mux underperforms relative to other top agents, use the analysis script:
bash
undefined要识别Mux相对于其他顶级Agent表现不佳的地方,可使用分析脚本:
bash
undefinedRun analysis (requires bq CLI for Mux results, git for leaderboard data)
Run analysis (requires bq CLI for Mux results, git for leaderboard data)
python benchmarks/terminal_bench/analyze_failure_rates.py
python benchmarks/terminal_bench/analyze_failure_rates.py
Show more results
Show more results
python benchmarks/terminal_bench/analyze_failure_rates.py --top 50
python benchmarks/terminal_bench/analyze_failure_rates.py --top 50
Filter to specific Mux model
Filter to specific Mux model
python benchmarks/terminal_bench/analyze_failure_rates.py --mux-model sonnet
python benchmarks/terminal_bench/analyze_failure_rates.py --mux-model sonnet
Force refresh of cached data
Force refresh of cached data
python benchmarks/terminal_bench/analyze_failure_rates.py --refresh
python benchmarks/terminal_bench/analyze_failure_rates.py --refresh
Output as JSON for further processing
Output as JSON for further processing
python benchmarks/terminal_bench/analyze_failure_rates.py --json > opportunities.json
The script computes the **M/O ratio** for each task:
M/O ratio = Mux failure rate / Average failure rate of top 10 agents
Tasks with **high M/O ratio** are where Mux underperforms relative to competitors—these represent the best optimization opportunities.
Example output:
================================================================================ OPTIMIZATION OPPORTUNITIES (sorted by M/O ratio)
Task ID Mux Fail% Avg Other% M/O Ratio Agent
some-difficult-task 100.0% 10.0% 9.09 Mux__Claude-Sonnet-4.5
another-task 80.0% 20.0% 3.64 Mux__Claude-Sonnet-4.5
...
================================================================================ SUMMARY
Total tasks with Mux failures: 42
High priority (M/O > 2.0): 12
Medium priority (1.0 < M/O ≤ 2.0): 8
undefinedpython benchmarks/terminal_bench/analyze_failure_rates.py --json > opportunities.json
脚本会为每个任务计算**M/O比值**:
M/O ratio = Mux failure rate / Average failure rate of top 10 agents
**高M/O比值**的任务就是Mux表现弱于竞品的地方,这些是最优的优化方向。
示例输出:
================================================================================ OPTIMIZATION OPPORTUNITIES (sorted by M/O ratio)
Task ID Mux Fail% Avg Other% M/O Ratio Agent
some-difficult-task 100.0% 10.0% 9.09 Mux__Claude-Sonnet-4.5
another-task 80.0% 20.0% 3.64 Mux__Claude-Sonnet-4.5
...
================================================================================ SUMMARY
Total tasks with Mux failures: 42
High priority (M/O > 2.0): 12
Medium priority (1.0 < M/O ≤ 2.0): 8
undefined