Terminal-Bench Integration

Terminal-Bench集成

This directory contains the mux agent adapter for Terminal-Bench 2.0, using Harbor as the evaluation harness.

本目录包含适配Terminal-Bench 2.0的mux agent适配器，采用Harbor作为评估框架。

Quick Start

快速开始

When user asks to run a tbench, generally assume they mean in CI via workflow_dispatch.

bash

undefined

当用户要求运行tbench时，通常默认指通过workflow_dispatch在CI中执行。

bash

undefined

Run full benchmark suite

make benchmark-terminal

Run specific tasks

make benchmark-terminal TB_TASK_NAMES="hello-world chess-best-move"

Run with specific model

make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic/claude-opus-4-5"

Run on Daytona cloud (high parallelism)

TB_ENV=daytona TB_CONCURRENCY=48 make benchmark-terminal

undefined

TB_ENV=daytona TB_CONCURRENCY=48 make benchmark-terminal

undefined

Daytona Cloud Sandboxes

Daytona云沙箱

For faster benchmarks, use Daytona cloud sandboxes instead of local Docker:

bash

undefined

为获得更快的基准测试速度，可使用Daytona云沙箱替代本地Docker：

bash

undefined

Set API key (get from https://app.daytona.io)

export DAYTONA_API_KEY="your-api-key"

Run with 48 concurrent cloud sandboxes (~6x faster than local)

make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48

Run specific tasks on Daytona

make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48 TB_TASK_NAMES="chess-best-move stockfish-elo"


**Account limits (Tier 3):** Pool of 250 vCPU / 500GB RAM. Most tasks require 1 vCPU / 2GB RAM, with a few needing up to 4 vCPU / 8GB RAM. Harbor automatically requests the correct per-task resources.

**Speed comparison:**
| Environment | Concurrency | Full suite time |
|-------------|-------------|-----------------|
| Local Docker | 4 | ~90 min |
| Daytona Cloud | 48 | ~10-15 min |

make benchmark-terminal TB_ENV=daytona TB_CONCURRENCY=48 TB_TASK_NAMES="chess-best-move stockfish-elo"


**账户限制（3级）：** 总资源池为250 vCPU / 500GB RAM。大多数任务需要1 vCPU / 2GB RAM，少数任务最高需要4 vCPU / 8GB RAM。Harbor会自动为每个任务申请合适的资源。

**速度对比：**
| 运行环境 | 并发数 | 完整套件运行时间 |
|-------------|-------------|-----------------|
| 本地Docker | 4 | ~90 分钟 |
| Daytona云 | 48 | ~10-15 分钟 |

Configuration

配置

Environment Variables

环境变量

```
TB_DATASET
```
: Dataset to use (default:
```
terminal-bench@2.0
```
)
```
TB_CONCURRENCY
```
: Number of concurrent tasks (default: 4)
```
TB_TIMEOUT
```
: Global timeout in seconds (default: 1800 = 30 minutes)
```
TB_ENV
```
: Environment to run in (
```
local
```
or
```
daytona
```
)
```
TB_TASK_NAMES
```
: Space-separated task names to run (default: all tasks)
```
TB_ARGS
```
: Additional arguments passed to harbor
```
MUX_RUN_ARGS
```
: CLI flags passed directly to
```
mux run
```
inside the container (e.g.,
```
--thinking high --use-1m --budget 5.00
```
). This is the primary mechanism for all
```
mux run
```
flags — avoids per-flag plumbing.

```
TB_DATASET
```
: 使用的数据集（默认：
```
terminal-bench@2.0
```
）
```
TB_CONCURRENCY
```
: 并发任务数（默认：4）
```
TB_TIMEOUT
```
: 全局超时时间，单位为秒（默认：1800 = 30分钟）
```
TB_ENV
```
: 运行环境（
```
local
```
或
```
daytona
```
）
```
TB_TASK_NAMES
```
: 要运行的任务名，空格分隔（默认：所有任务）
```
TB_ARGS
```
: 传递给harbor的额外参数
```
MUX_RUN_ARGS
```
: 直接传递给容器内
```
mux run
```
的CLI标志（例如
```
--thinking high --use-1m --budget 5.00
```
）。这是传递所有
```
mux run
```
标志的主要方式，避免逐个标志的管道适配。

Timeout Handling

超时处理

The benchmark uses a global timeout applied to all tasks. The default is 30 minutes (1800 seconds), which provides sufficient time for most tasks while catching genuinely stuck agents.

Design Rationale:

Based on analysis of Oct 30, 2025 nightly runs:

Longest successful task:
```
blind-maze-explorer-algorithm.hard
```
at 20 minutes
95th percentile: ~15 minutes
Mean duration: ~6 minutes

The 30-minute default provides comfortable headroom for complex tasks without excessive wait times for failed attempts.

Override timeout:

bash

undefined

基准测试对所有任务应用全局超时，默认值为30分钟（1800秒），既为大多数任务提供了充足的运行时间，也能捕获真正卡住的agent。

设计逻辑：

基于2025年10月30日的夜间运行数据分析：

最长的成功任务：
```
blind-maze-explorer-algorithm.hard
```
，耗时20分钟
95百分位耗时：~15分钟
平均耗时：~6分钟

30分钟的默认值为复杂任务提供了充足的预留空间，同时不会让失败的尝试等待过久。

覆盖超时设置：

bash

undefined

Run with 60 minute timeout for very complex tasks

TB_TIMEOUT=3600 make benchmark-terminal

Run with shorter 10 minute timeout for quick iteration

TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5


**Note:** We prefer global timeout defaults over per-task configuration to avoid complexity and maintenance burden. If you find tasks consistently timing out, increase `TB_TIMEOUT` rather than adding per-task configuration.

TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5


**注意：** 我们优先使用全局超时默认值而非单任务配置，以避免复杂度和维护负担。如果发现任务频繁超时，请调大`TB_TIMEOUT`而不是添加单任务配置。

Agent Configuration

Agent配置

The agent adapter accepts a few Harbor kwargs (passed via

--agent-kwarg

):

model_name

: Model to use (e.g.,

anthropic/claude-sonnet-4-5

,

openai/gpt-5-codex

)

```
experiments
```
: Experiments to enable, comma-separated (e.g.,
```
programmatic-tool-calling
```
)

All other

mux run

CLI flags (thinking level, mode, runtime, budget, etc.) are passed via

MUX_RUN_ARGS

— no per-flag plumbing needed.

CI dispatch (primary method):

bash

undefined

Agent适配器支持几个Harbor参数（通过

--agent-kwarg

传递）：

model_name

: 使用的模型（例如

anthropic/claude-sonnet-4-5

、

openai/gpt-5-codex

）

```
experiments
```
: 要启用的实验功能，逗号分隔（例如
```
programmatic-tool-calling
```
）

所有其他

mux run

CLI标志（思考等级、模式、运行时、预算等）都通过

MUX_RUN_ARGS

传递，无需逐个适配标志。

CI触发（主要方式）：

bash

undefined

Run with model, thinking, and 1M context

gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking xhigh --use-1m"

Run with budget cap

gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking high --budget 5.00"


**Local runs:**

```bash

gh workflow run terminal-bench.yml
-f model_name=anthropic/claude-opus-4-6
-f mux_run_args="--thinking high --budget 5.00"


**本地运行：**

```bash

Pass flags via MUX_RUN_ARGS env var

MUX_RUN_ARGS="--thinking high --use-1m" make benchmark-terminal

Model and experiments via TB_ARGS

make benchmark-terminal TB_ARGS="--agent-kwarg model_name=openai/gpt-5-codex --agent-kwarg experiments=programmatic-tool-calling"

undefined

make benchmark-terminal TB_ARGS="--agent-kwarg model_name=openai/gpt-5-codex --agent-kwarg experiments=programmatic-tool-calling"

undefined

Results

结果

Results are saved to

runs/YYYY-MM-DD__HH-MM-SS/

:

```
results.json
```
: Aggregate results with pass/fail rates
```
run_metadata.json
```
: Run configuration and metadata
```
<task-id>/
```
: Per-task directories containing:
- ```
sessions/agent.log
```
  : Full agent execution log
- ```
sessions/agent.cast
```
  : Asciinema recording of agent session
- ```
sessions/tests.log
```
  : Test execution output
- ```
results.json
```
  : Per-trial results

结果保存到

runs/YYYY-MM-DD__HH-MM-SS/

目录下：

```
results.json
```
: 包含通过率的聚合结果
```
run_metadata.json
```
: 运行配置和元数据
```
<task-id>/
```
: 每个任务的目录，包含：
- ```
sessions/agent.log
```
  : 完整的Agent执行日志
- ```
sessions/agent.cast
```
  : Agent会话的Asciinema录制文件
- ```
sessions/tests.log
```
  : 测试执行输出
- ```
results.json
```
  : 单轮测试结果

CI/CD Integration

CI/CD集成

Querying Results from BigQuery

从BigQuery查询结果

Mux Terminal-Bench results are uploaded to BigQuery after CI runs. Query via

bq

CLI after authenticating with

gcloud auth login

and setting project to

mux-benchmarks

.

Table:

mux-benchmarks.benchmarks.tbench_results

Schema:

run_id

(STRING),

task_id

(STRING),

model_name

(STRING),

thinking_level

(STRING: off/low/medium/high),

mode

(STRING: plan/exec),

dataset

(STRING),

experiments

(STRING),

passed

(BOOL),

score

(FLOAT),

n_input_tokens

(INT),

n_output_tokens

(INT),

github_run_id

(INT),

github_sha

(STRING),

ingested_at

(TIMESTAMP).

See

.github/workflows/terminal-bench.yml

and

.github/workflows/nightly-terminal-bench.yml

for GitHub Actions integration.

Nightly workflow runs both Claude and GPT models on the full task suite, uploading results as artifacts.

Mux Terminal-Bench的结果会在CI运行后上传到BigQuery。通过

gcloud auth login

认证并将项目设置为

mux-benchmarks

后，可通过

bq

CLI查询。

表名：

mux-benchmarks.benchmarks.tbench_results

表结构：

run_id

(STRING),

task_id

(STRING),

model_name

(STRING),

thinking_level

(STRING: off/low/medium/high),

mode

(STRING: plan/exec),

dataset

(STRING),

experiments

(STRING),

passed

(BOOL),

score

(FLOAT),

n_input_tokens

(INT),

n_output_tokens

(INT),

github_run_id

(INT),

github_sha

(STRING),

ingested_at

(TIMESTAMP)。

查看

.github/workflows/terminal-bench.yml

和

.github/workflows/nightly-terminal-bench.yml

了解GitHub Actions集成细节。

夜间工作流会在完整任务套件上同时运行Claude和GPT模型，并将结果作为制品上传。

Leaderboard Submission

排行榜提交

To submit mux results to the Terminal-Bench 2.0 leaderboard:

要将mux的结果提交到Terminal-Bench 2.0排行榜，请按以下步骤操作：

Step 1: Prepare Submission

步骤1：准备提交内容

The leaderboard computes pass@k from multiple attempts per task. Provide multiple runs so each becomes its own job folder inside the submission.

bash

undefined

排行榜会根据每个任务的多次尝试计算pass@k。请提供多次运行结果，每次运行对应提交包中的一个作业文件夹。

bash

undefined

Download latest 5 successful nightly runs (recommended for submission)

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5

Use specific run IDs (each becomes a separate job folder)

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --run-id 111 222 333 444 555

Use multiple existing artifact directories

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --artifacts-dir ./run1 ./run2

Download latest single run (quick iteration)

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py

Only prepare specific models

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5 --models anthropic/claude-opus-4-5


This creates a properly structured submission folder at `leaderboard_submission/` containing:

submissions/terminal-bench/2.0/Mux__<model>/ metadata.yaml # Agent and model info <job-folder-1>/ # Results from run 1 config.json result.json <trial-1>/ config.json result.json agent/ verifier/ ... <job-folder-2>/ # Results from run 2 ...

undefined

python3 benchmarks/terminal_bench/prepare_leaderboard_submission.py --n-runs 5 --models anthropic/claude-opus-4-5


执行后会在`leaderboard_submission/`目录下生成结构正确的提交文件夹，包含：

submissions/terminal-bench/2.0/Mux__<model>/ metadata.yaml # Agent和模型信息 <job-folder-1>/ # 第1次运行的结果 config.json result.json <trial-1>/ config.json result.json agent/ verifier/ ... <job-folder-2>/ # 第2次运行的结果 ...

undefined

Step 2: Submit via HuggingFace Python API

步骤2：通过HuggingFace Python API提交

The

hf upload

CLI tends to timeout on large submissions due to LFS file handling. Use the Python API with an extended timeout instead:

bash

undefined

hf upload

CLI在处理大体积提交时容易因为LFS文件处理问题超时，建议使用设置了更长超时时间的Python API：

bash

undefined

Install huggingface_hub (via uv or pip)

pip install huggingface_hub

Authenticate (one-time setup)

hf auth login


```python
import httpx
from huggingface_hub import HfApi
from huggingface_hub.utils import configure_http_backend

configure_http_backend(
    backend_factory=lambda: httpx.Client(timeout=httpx.Timeout(300.0, connect=60.0))
)

api = HfApi()
api.upload_folder(
    repo_id="alexgshaw/terminal-bench-2-leaderboard",
    folder_path="./leaderboard_submission/submissions",
    path_in_repo="submissions",
    repo_type="dataset",
    create_pr=True,
    commit_message="Add Mux + <Model> submission",
    commit_description="- Agent: Mux (Coder)\n- Model: <model>\n- <N> tasks × <K> attempts",
)

The PR will be automatically validated by the leaderboard bot. Once merged, results appear on the leaderboard.

Tips from past submissions:

The prepare script already strips
```
*.log
```
files (they trigger HF LFS and cause timeouts)
```
--artifacts-dir
```
accepts raw job folders directly (e.g., an extracted tarball root)
To update an existing PR, pass
```
revision="refs/pr/<N>"
```
instead of
```
create_pr=True
```

To remove stale files from a PR, use

api.delete_folder(..., revision="refs/pr/<N>")

hf auth login


```python
import httpx
from huggingface_hub import HfApi
from huggingface_hub.utils import configure_http_backend

configure_http_backend(
    backend_factory=lambda: httpx.Client(timeout=httpx.Timeout(300.0, connect=60.0))
)

api = HfApi()
api.upload_folder(
    repo_id="alexgshaw/terminal-bench-2-leaderboard",
    folder_path="./leaderboard_submission/submissions",
    path_in_repo="submissions",
    repo_type="dataset",
    create_pr=True,
    commit_message="Add Mux + <Model> submission",
    commit_description="- Agent: Mux (Coder)\n- Model: <model>\n- <N> tasks × <K> attempts",
)

PR会被排行榜机器人自动验证，合并后结果就会显示在排行榜上。

过往提交的经验提示：

准备脚本已经自动剔除了
```
*.log
```
文件（这类文件会触发HF LFS导致超时）
```
--artifacts-dir
```
可以直接接受原始作业文件夹（例如解压后的tar包根目录）
要更新已有PR，请传递
```
revision="refs/pr/<N>"
```
替代
```
create_pr=True
```

要从PR中删除过时文件，请使用

api.delete_folder(..., revision="refs/pr/<N>")

Files

文件说明

```
mux_agent.py
```
: Main agent adapter implementing Harbor's
```
BaseInstalledAgent
```
interface
```
mux-run.sh
```
: Shell script that sets up environment and invokes mux CLI
```
mux_payload.py
```
: Helper to package mux app for containerized execution
```
mux_setup.sh.j2
```
: Jinja2 template for agent installation script
```
prepare_leaderboard_submission.py
```
: Script to prepare results for leaderboard submission
```
analyze_failure_rates.py
```
: Analyze failure rates to find optimization opportunities
```
download_run_logs.py
```
: Download and inspect raw agent logs from nightly runs

```
mux_agent.py
```
: 实现了Harbor的
```
BaseInstalledAgent
```
接口的主Agent适配器
```
mux-run.sh
```
: 用于设置环境并调用mux CLI的Shell脚本
```
mux_payload.py
```
: 用于打包mux应用以支持容器化执行的工具
```
mux_setup.sh.j2
```
: 用于Agent安装脚本的Jinja2模板
```
prepare_leaderboard_submission.py
```
: 用于准备排行榜提交结果的脚本
```
analyze_failure_rates.py
```
: 用于分析故障率、寻找优化点的脚本
```
download_run_logs.py
```
: 用于从夜间运行结果中下载和查看原始Agent日志的脚本

Comparative Failure Analysis Workflow

对比故障分析工作流

When investigating why Mux fails on a task more than other agents, consider this workflow:

当调查Mux在某个任务上的失败率高于其他Agent的原因时，可参考以下工作流：

1. Identify High-Priority Failures

1. 识别高优先级故障

bash

undefined

bash

undefined

Find tasks where Mux underperforms (high M/O ratio = Mux fails more than others)

python benchmarks/terminal_bench/analyze_failure_rates.py --top 20

undefined

python benchmarks/terminal_bench/analyze_failure_rates.py --top 20

undefined

2. Check BigQuery for Failure Patterns

2. 在BigQuery中查询故障模式

bash

undefined

bash

undefined

Authenticate and set project

gcloud auth login && gcloud config set project mux-benchmarks

Query pass/fail by model for specific task (strip __hash suffix mentally)

bq query --use_legacy_sql=false ' SELECT model_name, passed, COUNT(*) as runs FROM

mux-benchmarks.benchmarks.tbench_results

WHERE REGEXP_REPLACE(task_id, r"__[a-zA-Z0-9]+$", "") = "TASK_NAME_HERE" AND github_workflow = "Nightly Terminal-Bench" AND passed IS NOT NULL GROUP BY model_name, passed ORDER BY model_name, passed '

undefined

bq query --use_legacy_sql=false ' SELECT model_name, passed, COUNT(*) as runs FROM

mux-benchmarks.benchmarks.tbench_results

WHERE REGEXP_REPLACE(task_id, r"__[a-zA-Z0-9]+$", "") = "TASK_NAME_HERE" AND github_workflow = "Nightly Terminal-Bench" AND passed IS NOT NULL GROUP BY model_name, passed ORDER BY model_name, passed '

undefined

3. Download and Inspect Agent Logs

3. 下载并查看Agent日志

bash

undefined

bash

undefined

List recent nightly runs

python benchmarks/terminal_bench/download_run_logs.py --list-runs

Download latest run and filter to failing task

python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME --failures-only

Download specific run, filter to specific model

python benchmarks/terminal_bench/download_run_logs.py --run-id 21230456195 --model opus --task TASK_NAME

Verbose mode shows stderr from agent execution

python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME -v


Logs are cached in `.run_logs/<run-id>/`. Inspect:

- `agent/command-0/stdout.txt` — Full agent output (JSONL stream)
- `agent/command-0/stderr.txt` — Errors during execution
- `result.json` — Trial result with `verifier_result` and `exception_info`

python benchmarks/terminal_bench/download_run_logs.py --task TASK_NAME -v


日志会缓存到`.run_logs/<run-id>/`目录下，可查看：

- `agent/command-0/stdout.txt` — 完整的Agent输出（JSONL流）
- `agent/command-0/stderr.txt` — 执行过程中的错误
- `result.json` — 单轮测试结果，包含`verifier_result`和`exception_info`

4. Compare with Leaderboard Submissions

4. 与排行榜提交结果对比

bash

undefined

bash

undefined

Clone leaderboard repo from HuggingFace (cached in .leaderboard_cache/)

cd benchmarks/terminal_bench git clone https://huggingface.co/datasets/alexgshaw/terminal-bench-2-leaderboard .leaderboard_cache/terminal-bench-2-leaderboard 2>/dev/null

Find passing submissions for the task

find .leaderboard_cache -path "TASK_NAME" -name "result.json" -exec sh -c ' agent=$(echo "$1" | cut -d/ -f5) reward=$(cat "$1" | python3 -c "import json,sys; print(json.load(sys.stdin).get("verifier_result",{}).get("rewards",{}).get("reward",0))") echo "$agent: reward=$reward" ' _ {} ;

undefined

find .leaderboard_cache -path "TASK_NAME" -name "result.json" -exec sh -c ' agent=$(echo "$1" | cut -d/ -f5) reward=$(cat "$1" | python3 -c "import json,sys; print(json.load(sys.stdin).get("verifier_result",{}).get("rewards",{}).get("reward",0))") echo "$agent: reward=$reward" ' _ {} ;

undefined

Analyzing Failure Rates

故障率分析

To identify where Mux underperforms relative to other top agents, use the analysis script:

bash

undefined

要识别Mux相对于其他顶级Agent表现不佳的地方，可使用分析脚本：

bash

undefined

Run analysis (requires bq CLI for Mux results, git for leaderboard data)

python benchmarks/terminal_bench/analyze_failure_rates.py

Show more results

python benchmarks/terminal_bench/analyze_failure_rates.py --top 50

Filter to specific Mux model

python benchmarks/terminal_bench/analyze_failure_rates.py --mux-model sonnet

Force refresh of cached data

python benchmarks/terminal_bench/analyze_failure_rates.py --refresh

Output as JSON for further processing

python benchmarks/terminal_bench/analyze_failure_rates.py --json > opportunities.json


The script computes the **M/O ratio** for each task:

M/O ratio = Mux failure rate / Average failure rate of top 10 agents


Tasks with **high M/O ratio** are where Mux underperforms relative to competitors—these represent the best optimization opportunities.

Example output:

================================================================================ OPTIMIZATION OPPORTUNITIES (sorted by M/O ratio)

Task ID Mux Fail% Avg Other% M/O Ratio Agent

some-difficult-task 100.0% 10.0% 9.09 Mux__Claude-Sonnet-4.5 another-task 80.0% 20.0% 3.64 Mux__Claude-Sonnet-4.5 ...

================================================================================ SUMMARY

Total tasks with Mux failures: 42 High priority (M/O > 2.0): 12 Medium priority (1.0 < M/O ≤ 2.0): 8

undefined

python benchmarks/terminal_bench/analyze_failure_rates.py --json > opportunities.json


脚本会为每个任务计算**M/O比值**：

M/O ratio = Mux failure rate / Average failure rate of top 10 agents


**高M/O比值**的任务就是Mux表现弱于竞品的地方，这些是最优的优化方向。

示例输出：

================================================================================ OPTIMIZATION OPPORTUNITIES (sorted by M/O ratio)

Task ID Mux Fail% Avg Other% M/O Ratio Agent

some-difficult-task 100.0% 10.0% 9.09 Mux__Claude-Sonnet-4.5 another-task 80.0% 20.0% 3.64 Mux__Claude-Sonnet-4.5 ...

================================================================================ SUMMARY

Total tasks with Mux failures: 42 High priority (M/O > 2.0): 12 Medium priority (1.0 < M/O ≤ 2.0): 8

undefined

tbench

Original

Translation

Terminal-Bench Integration

Terminal-Bench集成

Quick Start

快速开始

Run full benchmark suite

Run full benchmark suite

Run specific tasks

Run specific tasks

Run with specific model

Run with specific model

Run on Daytona cloud (high parallelism)

Run on Daytona cloud (high parallelism)

Daytona Cloud Sandboxes

Daytona云沙箱

Set API key (get from https://app.daytona.io)

Set API key (get from https://app.daytona.io)

Run with 48 concurrent cloud sandboxes (~6x faster than local)

Run with 48 concurrent cloud sandboxes (~6x faster than local)

Run specific tasks on Daytona

Run specific tasks on Daytona

Configuration

配置

Environment Variables

环境变量

Timeout Handling

超时处理

Run with 60 minute timeout for very complex tasks

Run with 60 minute timeout for very complex tasks

Run with shorter 10 minute timeout for quick iteration

Run with shorter 10 minute timeout for quick iteration

Agent Configuration

Agent配置

Run with model, thinking, and 1M context

Run with model, thinking, and 1M context

Run with budget cap

Run with budget cap

Pass flags via MUX_RUN_ARGS env var

Pass flags via MUX_RUN_ARGS env var

Model and experiments via TB_ARGS

Model and experiments via TB_ARGS

Results

结果

CI/CD Integration

CI/CD集成

Querying Results from BigQuery

从BigQuery查询结果

Leaderboard Submission

排行榜提交

Step 1: Prepare Submission

步骤1：准备提交内容

Download latest 5 successful nightly runs (recommended for submission)

Download latest 5 successful nightly runs (recommended for submission)

Use specific run IDs (each becomes a separate job folder)

Use specific run IDs (each becomes a separate job folder)

Use multiple existing artifact directories

Use multiple existing artifact directories

Download latest single run (quick iteration)

Download latest single run (quick iteration)

Only prepare specific models

Only prepare specific models

Step 2: Submit via HuggingFace Python API

步骤2：通过HuggingFace Python API提交

Install huggingface_hub (via uv or pip)

Install huggingface_hub (via uv or pip)

Authenticate (one-time setup)

Authenticate (one-time setup)

Files

文件说明

Comparative Failure Analysis Workflow

对比故障分析工作流

1. Identify High-Priority Failures

1. 识别高优先级故障

Find tasks where Mux underperforms (high M/O ratio = Mux fails more than others)

Find tasks where Mux underperforms (high M/O ratio = Mux fails more than others)

2. Check BigQuery for Failure Patterns

2. 在BigQuery中查询故障模式

Authenticate and set project