tbench

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Terminal-Bench Integration

Terminal-Bench集成

This directory contains the mux agent adapter for Terminal-Bench 2.0, using Harbor as the evaluation harness.
本目录包含适配Terminal-Bench 2.0的mux agent适配器,采用Harbor作为评估框架。

Quick Start

快速开始

When user asks to run a tbench, generally assume they mean in CI via workflow_dispatch.
bash
undefined
当用户要求运行tbench时,通常默认指通过workflow_dispatch在CI中执行。
bash
undefined

Run full benchmark suite

Run full benchmark suite

make benchmark-terminal
make benchmark-terminal

Run specific tasks

Run specific tasks

make benchmark-terminal TB_TASK_NAMES="hello-world chess-best-move"
make benchmark-terminal TB_TASK_NAMES="hello-world chess-best-move"

Run with specific model

Run with specific model

make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic/claude-opus-4-5"
make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic/claude-opus-4-5"

Run on Daytona cloud (high parallelism)

Run on Daytona cloud (high parallelism)

TB_ENV=daytona TB_CONCURRENCY=48 make benchmark-terminal
undefined
TB_ENV=daytona TB_CONCURRENCY=48 make benchmark-terminal
undefined

Daytona Cloud Sandboxes

Daytona云沙箱

For faster benchmarks, use Daytona cloud sandboxes instead of local Docker:
bash
undefined
为获得更快的基准测试速度,可使用Daytona云沙箱替代本地Docker:
bash
undefined

Set API key (get from https://app.daytona.io)

Set API key (get from https://app.daytona.io)

export DAYTONA_API_KEY="your-api-key"
export DAYTONA_API_KEY="your-api-key"

Run with 48 concurrent cloud sandboxes (~6x faster than local)

Run with 48 concurrent cloud sandboxes (~6x faster than local)

make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48
make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48

Run specific tasks on Daytona

Run specific tasks on Daytona

make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48 TB_TASK_NAMES="chess-best-move stockfish-elo"

**Account limits (Tier 3):** Pool of 250 vCPU / 500GB RAM. Most tasks require 1 vCPU / 2GB RAM, with a few needing up to 4 vCPU / 8GB RAM. Harbor automatically requests the correct per-task resources.

**Speed comparison:**
| Environment | Concurrency | Full suite time |
|-------------|-------------|-----------------|
| Local Docker | 4 | ~90 min |
| Daytona Cloud | 48 | ~10-15 min |
make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48 TB_TASK_NAMES="chess-best-move stockfish-elo"

**账户限制(3级):** 总资源池为250 vCPU / 500GB RAM。大多数任务需要1 vCPU / 2GB RAM,少数任务最高需要4 vCPU / 8GB RAM。Harbor会自动为每个任务申请合适的资源。

**速度对比:**
| 运行环境 | 并发数 | 完整套件运行时间 |
|-------------|-------------|-----------------|
| 本地Docker | 4 | ~90 分钟 |
| Daytona云 | 48 | ~10-15 分钟 |

Configuration

配置

Environment Variables

环境变量

  • TB_DATASET
    : Dataset to use (default:
    terminal-bench@2.0
    )
  • TB_CONCURRENCY
    : Number of concurrent tasks (default: 4)
  • TB_TIMEOUT
    : Global timeout in seconds (default: 1800 = 30 minutes)
  • TB_ENV
    : Environment to run in (
    local
    or
    daytona
    )
  • TB_TASK_NAMES
    : Space-separated task names to run (default: all tasks)
  • TB_ARGS
    : Additional arguments passed to harbor
  • MUX_RUN_ARGS
    : CLI flags passed directly to
    mux run
    inside the container (e.g.,
    --thinking high --use-1m --budget 5.00
    ). This is the primary mechanism for all
    mux run
    flags — avoids per-flag plumbing.
  • TB_DATASET
    : 使用的数据集(默认:
    terminal-bench@2.0
  • TB_CONCURRENCY
    : 并发任务数(默认:4)
  • TB_TIMEOUT
    : 全局超时时间,单位为秒(默认:1800 = 30分钟)
  • TB_ENV
    : 运行环境(
    local
    daytona
  • TB_TASK_NAMES
    : 要运行的任务名,空格分隔(默认:所有任务)
  • TB_ARGS
    : 传递给harbor的额外参数
  • MUX_RUN_ARGS
    : 直接传递给容器内
    mux run
    的CLI标志(例如
    --thinking high --use-1m --budget 5.00
    )。这是传递所有
    mux run
    标志的主要方式,避免逐个标志的管道适配。

Timeout Handling

超时处理

The benchmark uses a global timeout applied to all tasks. The default is 30 minutes (1800 seconds), which provides sufficient time for most tasks while catching genuinely stuck agents.
Design Rationale:
Based on analysis of Oct 30, 2025 nightly runs:
  • Longest successful task:
    blind-maze-explorer-algorithm.hard
    at 20 minutes
  • 95th percentile: ~15 minutes
  • Mean duration: ~6 minutes
The 30-minute default provides comfortable headroom for complex tasks without excessive wait times for failed attempts.
Override timeout:
bash
undefined
基准测试对所有任务应用全局超时,默认值为30分钟(1800秒),既为大多数任务提供了充足的运行时间,也能捕获真正卡住的agent。
设计逻辑:
基于2025年10月30日的夜间运行数据分析:
  • 最长的成功任务:
    blind-maze-explorer-algorithm.hard
    ,耗时20分钟
  • 95百分位耗时:~15分钟
  • 平均耗时:~6分钟
30分钟的默认值为复杂任务提供了充足的预留空间,同时不会让失败的尝试等待过久。
覆盖超时设置:
bash
undefined

Run with 60 minute timeout for very complex tasks

Run with 60 minute timeout for very complex tasks

TB_TIMEOUT=3600 make benchmark-terminal
TB_TIMEOUT=3600 make benchmark-terminal

Run with shorter 10 minute timeout for quick iteration

Run with shorter 10 minute timeout for quick iteration

TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5

**Note:** We prefer global timeout defaults over per-task configuration to avoid complexity and maintenance burden. If you find tasks consistently timing out, increase `TB_TIMEOUT` rather than adding per-task configuration.
TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5

**注意:** 我们优先使用全局超时默认值而非单任务配置,以避免复杂度和维护负担。如果发现任务频繁超时,请调大`TB_TIMEOUT`而不是添加单任务配置。

Agent Configuration

Agent配置

The agent adapter accepts a few Harbor kwargs (passed via
--agent-kwarg
):
  • model_name
    : Model to use (e.g.,
    anthropic/claude-sonnet-4-5
    ,
    openai/gpt-5-codex
    )
  • experiments
    : Experiments to enable, comma-separated (e.g.,
    programmatic-tool-calling
    )
All other
mux run
CLI flags (thinking level, mode, runtime, budget, etc.) are passed via
MUX_RUN_ARGS
— no per-flag plumbing needed.
CI dispatch (primary method):
bash
undefined
Agent适配器支持几个Harbor参数(通过
--agent-kwarg
传递):
  • model_name
    : 使用的模型(例如
    anthropic/claude-sonnet-4-5
    openai/gpt-5-codex
  • experiments
    : 要启用的实验功能,逗号分隔(例如
    programmatic-tool-calling
所有其他
mux run
CLI标志(思考等级、模式、运行时、预算等)都通过
MUX_RUN_ARGS
传递,无需逐个适配标志。
CI触发(主要方式):
bash
undefined

Run with model, thinking, and 1M context

Run with model, thinking, and 1M context

gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking xhigh --use-1m"
gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking xhigh --use-1m"

Run with budget cap

Run with budget cap

gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking high --budget 5.00"

**Local runs:**

```bash
gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking high --budget 5.00"

**本地运行:**

```bash

Pass flags via MUX_RUN_ARGS env var

Pass flags via MUX_RUN_ARGS env var

MUX_RUN_ARGS="--thinking high --use-1m" make benchmark-terminal
MUX_RUN_ARGS="--thinking high --use-1m" make benchmark-terminal

Model and experiments via TB_ARGS

Model and experiments via TB_ARGS

make benchmark-terminal TB_ARGS="--agent-kwarg model_name=openai/gpt-5-codex --agent-kwarg experiments=programmatic-tool-calling"
undefined
make benchmark-terminal TB_ARGS="--agent-kwarg model_name=openai/gpt-5-codex --agent-kwarg experiments=programmatic-tool-calling"
undefined

Results

结果

Results are saved to
runs/YYYY-MM-DD__HH-MM-SS/
:
  • results.json
    : Aggregate results with pass/fail rates
  • run_metadata.json
    : Run configuration and metadata
  • <task-id>/
    : Per-task directories containing:
    • sessions/agent.log
      : Full agent execution log
    • sessions/agent.cast
      : Asciinema recording of agent session
    • sessions/tests.log
      : Test execution output
    • results.json
      : Per-trial results
结果保存到
runs/YYYY-MM-DD__HH-MM-SS/
目录下:
  • results.json
    : 包含通过率的聚合结果
  • run_metadata.json
    : 运行配置和元数据
  • <task-id>/
    : 每个任务的目录,包含:
    • sessions/agent.log
      : 完整的Agent执行日志
    • sessions/agent.cast
      : Agent会话的Asciinema录制文件
    • sessions/tests.log
      : 测试执行输出
    • results.json
      : 单轮测试结果

CI/CD Integration

CI/CD集成

Querying Results from BigQuery

从BigQuery查询结果

Mux Terminal-Bench results are uploaded to BigQuery after CI runs. Query via
bq
CLI after authenticating with
gcloud auth login
and setting project to
mux-benchmarks
.
Table:
mux-benchmarks.benchmarks.tbench_results
Schema:
run_id
(STRING),
task_id
(STRING),
model_name
(STRING),
thinking_level
(STRING: off/low/medium/high),
mode
(STRING: plan/exec),
dataset
(STRING),
experiments
(STRING),
passed
(BOOL),
score
(FLOAT),
n_input_tokens
(INT),
n_output_tokens
(INT),
github_run_id
(INT),
github_sha
(STRING),
ingested_at
(TIMESTAMP).
See
.github/workflows/terminal-bench.yml
and
.github/workflows/nightly-terminal-bench.yml
for GitHub Actions integration.
Nightly workflow runs both Claude and GPT models on the full task suite, uploading results as artifacts.
Mux Terminal-Bench的结果会在CI运行后上传到BigQuery。通过
gcloud auth login
认证并将项目设置为
mux-benchmarks
后,可通过
bq
CLI查询。
表名:
mux-benchmarks.benchmarks.tbench_results
表结构:
run_id
(STRING),
task_id
(STRING),
model_name
(STRING),
thinking_level
(STRING: off/low/medium/high),
mode
(STRING: plan/exec),
dataset
(STRING),
experiments
(STRING),
passed
(BOOL),
score
(FLOAT),
n_input_tokens
(INT),
n_output_tokens
(INT),
github_run_id
(INT),
github_sha
(STRING),
ingested_at
(TIMESTAMP)。
查看
.github/workflows/terminal-bench.yml
.github/workflows/nightly-terminal-bench.yml
了解GitHub Actions集成细节。
夜间工作流会在完整任务套件上同时运行Claude和GPT模型,并将结果作为制品上传。

Leaderboard Submission

排行榜提交

To submit mux results to the Terminal-Bench 2.0 leaderboard:
要将mux的结果提交到Terminal-Bench 2.0排行榜,请按以下步骤操作:

Step 1: Prepare Submission

步骤1:准备提交内容

The leaderboard computes pass@k from multiple attempts per task. Provide multiple runs so each becomes its own job folder inside the submission.
bash
undefined
排行榜会根据每个任务的多次尝试计算pass@k。请提供多次运行结果,每次运行对应提交包中的一个作业文件夹。
bash
undefined

Download latest 5 successful nightly runs (recommended for submission)

Download latest 5 successful nightly runs (recommended for submission)

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5

Use specific run IDs (each becomes a separate job folder)

Use specific run IDs (each becomes a separate job folder)

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --run-id 111 222 333 444 555
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --run-id 111 222 333 444 555

Use multiple existing artifact directories

Use multiple existing artifact directories

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --artifacts-dir ./run1 ./run2
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --artifacts-dir ./run1 ./run2

Download latest single run (quick iteration)

Download latest single run (quick iteration)

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py

Only prepare specific models

Only prepare specific models

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5 --models anthropic/claude-opus-4-5

This creates a properly structured submission folder at `leaderboard_submission/` containing:
submissions/terminal-bench/2.0/Mux__<model>/ metadata.yaml # Agent and model info <job-folder-1>/ # Results from run 1 config.json result.json <trial-1>/ config.json result.json agent/ verifier/ ... <job-folder-2>/ # Results from run 2 ...
undefined
python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5 --models anthropic/claude-opus-4-5

执行后会在`leaderboard_submission/`目录下生成结构正确的提交文件夹,包含:
submissions/terminal-bench/2.0/Mux__<model>/ metadata.yaml # Agent和模型信息 <job-folder-1>/ # 第1次运行的结果 config.json result.json <trial-1>/ config.json result.json agent/ verifier/ ... <job-folder-2>/ # 第2次运行的结果 ...
undefined

Step 2: Submit via HuggingFace Python API

步骤2:通过HuggingFace Python API提交

The
hf upload
CLI tends to timeout on large submissions due to LFS file handling. Use the Python API with an extended timeout instead:
bash
undefined
hf upload
CLI在处理大体积提交时容易因为LFS文件处理问题超时,建议使用设置了更长超时时间的Python API:
bash
undefined

Install huggingface_hub (via uv or pip)

Install huggingface_hub (via uv or pip)

pip install huggingface_hub
pip install huggingface_hub

Authenticate (one-time setup)

Authenticate (one-time setup)

hf auth login

```python
import httpx
from huggingface_hub import HfApi
from huggingface_hub.utils import configure_http_backend

configure_http_backend(
    backend_factory=lambda: httpx.Client(timeout=httpx.Timeout(300.0, connect=60.0))
)

api = HfApi()
api.upload_folder(
    repo_id="alexgshaw/terminal-bench-2-leaderboard",
    folder_path="./leaderboard_submission/submissions",
    path_in_repo="submissions",
    repo_type="dataset",
    create_pr=True,
    commit_message="Add Mux + <Model> submission",
    commit_description="- Agent: Mux (Coder)\n- Model: <model>\n- <N> tasks × <K> attempts",
)
The PR will be automatically validated by the leaderboard bot. Once merged, results appear on the leaderboard.
Tips from past submissions:
  • The prepare script already strips
    *.log
    files (they trigger HF LFS and cause timeouts)
  • --artifacts-dir
    accepts raw job folders directly (e.g., an extracted tarball root)
  • To update an existing PR, pass
    revision="refs/pr/<N>"
    instead of
    create_pr=True
  • To remove stale files from a PR, use
    api.delete_folder(..., revision="refs/pr/<N>")
hf auth login

```python
import httpx
from huggingface_hub import HfApi
from huggingface_hub.utils import configure_http_backend

configure_http_backend(
    backend_factory=lambda: httpx.Client(timeout=httpx.Timeout(300.0, connect=60.0))
)

api = HfApi()
api.upload_folder(
    repo_id="alexgshaw/terminal-bench-2-leaderboard",
    folder_path="./leaderboard_submission/submissions",
    path_in_repo="submissions",
    repo_type="dataset",
    create_pr=True,
    commit_message="Add Mux + <Model> submission",
    commit_description="- Agent: Mux (Coder)\n- Model: <model>\n- <N> tasks × <K> attempts",
)
PR会被排行榜机器人自动验证,合并后结果就会显示在排行榜上。
过往提交的经验提示:
  • 准备脚本已经自动剔除了
    *.log
    文件(这类文件会触发HF LFS导致超时)
  • --artifacts-dir
    可以直接接受原始作业文件夹(例如解压后的tar包根目录)
  • 要更新已有PR,请传递
    revision="refs/pr/<N>"
    替代
    create_pr=True
  • 要从PR中删除过时文件,请使用
    api.delete_folder(..., revision="refs/pr/<N>")

Files

文件说明

  • mux_agent.py
    : Main agent adapter implementing Harbor's
    BaseInstalledAgent
    interface
  • mux-run.sh
    : Shell script that sets up environment and invokes mux CLI
  • mux_payload.py
    : Helper to package mux app for containerized execution
  • mux_setup.sh.j2
    : Jinja2 template for agent installation script
  • prepare_leaderboard_submission.py
    : Script to prepare results for leaderboard submission
  • analyze_failure_rates.py
    : Analyze failure rates to find optimization opportunities
  • download_run_logs.py
    : Download and inspect raw agent logs from nightly runs
  • mux_agent.py
    : 实现了Harbor的
    BaseInstalledAgent
    接口的主Agent适配器
  • mux-run.sh
    : 用于设置环境并调用mux CLI的Shell脚本
  • mux_payload.py
    : 用于打包mux应用以支持容器化执行的工具
  • mux_setup.sh.j2
    : 用于Agent安装脚本的Jinja2模板
  • prepare_leaderboard_submission.py
    : 用于准备排行榜提交结果的脚本
  • analyze_failure_rates.py
    : 用于分析故障率、寻找优化点的脚本
  • download_run_logs.py
    : 用于从夜间运行结果中下载和查看原始Agent日志的脚本

Comparative Failure Analysis Workflow

对比故障分析工作流

When investigating why Mux fails on a task more than other agents, consider this workflow:
当调查Mux在某个任务上的失败率高于其他Agent的原因时,可参考以下工作流:

1. Identify High-Priority Failures

1. 识别高优先级故障

bash
undefined
bash
undefined

Find tasks where Mux underperforms (high M/O ratio = Mux fails more than others)

Find tasks where Mux underperforms (high M/O ratio = Mux fails more than others)

python benchmarks/terminal_bench/analyze_failure_rates.py --top 20
undefined
python benchmarks/terminal_bench/analyze_failure_rates.py --top 20
undefined

2. Check BigQuery for Failure Patterns

2. 在BigQuery中查询故障模式

bash
undefined
bash
undefined

Authenticate and set project

Authenticate and set project

gcloud auth login && gcloud config set project mux-benchmarks
gcloud auth login && gcloud config set project mux-benchmarks

Query pass/fail by model for specific task (strip __hash suffix mentally)

Query pass/fail by model for specific task (strip __hash suffix mentally)

bq query --use_legacy_sql=false ' SELECT model_name, passed, COUNT(*) as runs FROM
mux-benchmarks.benchmarks.tbench_results
WHERE REGEXP_REPLACE(task_id, r"__[a-zA-Z0-9]+$", "") = "TASK_NAME_HERE" AND github_workflow = "Nightly Terminal-Bench" AND passed IS NOT NULL GROUP BY model_name, passed ORDER BY model_name, passed '
undefined
bq query --use_legacy_sql=false ' SELECT model_name, passed, COUNT(*) as runs FROM
mux-benchmarks.benchmarks.tbench_results
WHERE REGEXP_REPLACE(task_id, r"__[a-zA-Z0-9]+$", "") = "TASK_NAME_HERE" AND github_workflow = "Nightly Terminal-Bench" AND passed IS NOT NULL GROUP BY model_name, passed ORDER BY model_name, passed '
undefined

3. Download and Inspect Agent Logs

3. 下载并查看Agent日志

bash
undefined
bash
undefined

List recent nightly runs

List recent nightly runs

python benchmarks/terminal_bench/download_run_logs.py --list-runs
python benchmarks/terminal_bench/download_run_logs.py --list-runs

Download latest run and filter to failing task

Download latest run and filter to failing task

python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME --failures-only
python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME --failures-only

Download specific run, filter to specific model

Download specific run, filter to specific model

python benchmarks/terminal_bench/download_run_logs.py --run-id 21230456195 --model opus --task TASK_NAME
python benchmarks/terminal_bench/download_run_logs.py --run-id 21230456195 --model opus --task TASK_NAME

Verbose mode shows stderr from agent execution

Verbose mode shows stderr from agent execution

python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME -v

Logs are cached in `.run_logs/<run-id>/`. Inspect:

- `agent/command-0/stdout.txt` — Full agent output (JSONL stream)
- `agent/command-0/stderr.txt` — Errors during execution
- `result.json` — Trial result with `verifier_result` and `exception_info`
python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME -v

日志会缓存到`.run_logs/<run-id>/`目录下,可查看:

- `agent/command-0/stdout.txt` — 完整的Agent输出(JSONL流)
- `agent/command-0/stderr.txt` — 执行过程中的错误
- `result.json` — 单轮测试结果,包含`verifier_result`和`exception_info`

4. Compare with Leaderboard Submissions

4. 与排行榜提交结果对比

bash
undefined
bash
undefined

Clone leaderboard repo from HuggingFace (cached in .leaderboard_cache/)

Clone leaderboard repo from HuggingFace (cached in .leaderboard_cache/)

cd benchmarks/terminal_bench git clone https://huggingface.co/datasets/alexgshaw/terminal-bench-2-leaderboard .leaderboard_cache/terminal-bench-2-leaderboard 2>/dev/null
cd benchmarks/terminal_bench git clone https://huggingface.co/datasets/alexgshaw/terminal-bench-2-leaderboard .leaderboard_cache/terminal-bench-2-leaderboard 2>/dev/null

Find passing submissions for the task

Find passing submissions for the task

find .leaderboard_cache -path "TASK_NAME" -name "result.json" -exec sh -c ' agent=$(echo "$1" | cut -d/ -f5) reward=$(cat "$1" | python3 -c "import json,sys; print(json.load(sys.stdin).get("verifier_result",{}).get("rewards",{}).get("reward",0))") echo "$agent: reward=$reward" ' _ {} ;
undefined
find .leaderboard_cache -path "TASK_NAME" -name "result.json" -exec sh -c ' agent=$(echo "$1" | cut -d/ -f5) reward=$(cat "$1" | python3 -c "import json,sys; print(json.load(sys.stdin).get("verifier_result",{}).get("rewards",{}).get("reward",0))") echo "$agent: reward=$reward" ' _ {} ;
undefined

Analyzing Failure Rates

故障率分析

To identify where Mux underperforms relative to other top agents, use the analysis script:
bash
undefined
要识别Mux相对于其他顶级Agent表现不佳的地方,可使用分析脚本:
bash
undefined

Run analysis (requires bq CLI for Mux results, git for leaderboard data)

Run analysis (requires bq CLI for Mux results, git for leaderboard data)

python benchmarks/terminal_bench/analyze_failure_rates.py
python benchmarks/terminal_bench/analyze_failure_rates.py

Show more results

Show more results

python benchmarks/terminal_bench/analyze_failure_rates.py --top 50
python benchmarks/terminal_bench/analyze_failure_rates.py --top 50

Filter to specific Mux model

Filter to specific Mux model

python benchmarks/terminal_bench/analyze_failure_rates.py --mux-model sonnet
python benchmarks/terminal_bench/analyze_failure_rates.py --mux-model sonnet

Force refresh of cached data

Force refresh of cached data

python benchmarks/terminal_bench/analyze_failure_rates.py --refresh
python benchmarks/terminal_bench/analyze_failure_rates.py --refresh

Output as JSON for further processing

Output as JSON for further processing

python benchmarks/terminal_bench/analyze_failure_rates.py --json > opportunities.json

The script computes the **M/O ratio** for each task:
M/O ratio = Mux failure rate / Average failure rate of top 10 agents

Tasks with **high M/O ratio** are where Mux underperforms relative to competitors—these represent the best optimization opportunities.

Example output:

================================================================================ OPTIMIZATION OPPORTUNITIES (sorted by M/O ratio)

Task ID Mux Fail% Avg Other% M/O Ratio Agent

some-difficult-task 100.0% 10.0% 9.09 Mux__Claude-Sonnet-4.5 another-task 80.0% 20.0% 3.64 Mux__Claude-Sonnet-4.5 ...

================================================================================ SUMMARY

Total tasks with Mux failures: 42 High priority (M/O > 2.0): 12 Medium priority (1.0 < M/O ≤ 2.0): 8
undefined
python benchmarks/terminal_bench/analyze_failure_rates.py --json > opportunities.json

脚本会为每个任务计算**M/O比值**:
M/O ratio = Mux failure rate / Average failure rate of top 10 agents

**高M/O比值**的任务就是Mux表现弱于竞品的地方,这些是最优的优化方向。

示例输出:

================================================================================ OPTIMIZATION OPPORTUNITIES (sorted by M/O ratio)

Task ID Mux Fail% Avg Other% M/O Ratio Agent

some-difficult-task 100.0% 10.0% 9.09 Mux__Claude-Sonnet-4.5 another-task 80.0% 20.0% 3.64 Mux__Claude-Sonnet-4.5 ...

================================================================================ SUMMARY

Total tasks with Mux failures: 42 High priority (M/O > 2.0): 12 Medium priority (1.0 < M/O ≤ 2.0): 8
undefined