tao-run-on-brev

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Brev

NVIDIA Brev provides on-demand GPU instances across multiple cloud providers. Instances come pre-loaded with NVIDIA drivers, CUDA, Docker, and NVIDIA Container Toolkit.

Brev is instance-based (not job-based like Lepton). You create an instance, run commands on it via

brev exec

, and delete it when done. The TAO SDK's BrevHandler wraps this into the standard job interface.

NVIDIA Brev 提供跨多个云服务商的按需GPU实例。实例预装了NVIDIA驱动、CUDA、Docker以及NVIDIA Container Toolkit。

Brev 基于实例运行（而非像Lepton那样基于任务）。您需要创建一个实例，通过

brev exec

在其上运行命令，完成后删除实例。TAO SDK的BrevHandler将这些操作封装为标准任务接口。

Preflight

预检查

This skill needs the

brev

CLI, its companion agent skill (

brev-cli

), and an active login. Check before proceeding:

bash

undefined

使用此技能需要

brev

CLI、配套的代理技能（

brev-cli

）以及有效的登录状态。继续操作前请检查以下内容：

bash

undefined

1. brev CLI installed

command -v brev >/dev/null 2>&1 || { echo "MISSING: brev CLI not installed. Install:" echo " https://docs.nvidia.com/brev/" exit 1 }

2. brev-cli agent skill installed — provides the brev CLI's command reference to the agent

[ -d "$HOME/.claude/skills/brev-cli" ] || [ -d ".claude/skills/brev-cli" ] || { echo "MISSING: brev-cli agent skill not installed. Run:" echo " brev agent-skill install" exit 1 }

3. brev login active — always token-login first when running headless.

Plain

brev ls

will hit an interactive auth prompt (read: EOF on stdin)

Plain

brev ls

will hit an interactive auth prompt (read: EOF on stdin)

even when BREV_API_TOKEN is set, so refresh the session up front.

if [ -n "$BREV_API_TOKEN" ]; then brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1 || { echo "MISSING: brev token login failed. Verify BREV_API_TOKEN." exit 1 } fi

Retry once after a forced re-login: cached creds occasionally desync and the

first

brev ls

returns auth EOF until the session is rebuilt.

first

brev ls

returns auth EOF until the session is rebuilt.

brev ls >/dev/null 2>&1 || { [ -n "$BREV_API_TOKEN" ] && brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1 brev ls >/dev/null 2>&1 || { echo "MISSING: not logged in to brev. Run:" echo " brev login # interactive (opens browser)" echo " # or set BREV_API_TOKEN in ~/.config/tao/.env (then 'brev login --token $BREV_API_TOKEN')" exit 1 } }


If any step fails, the agent prompts the user to authorize the fix via Bash, then re-runs the preflight before continuing. The TAO SDK is **not** required for Brev — `brev exec docker run …` is sufficient. Reach for the SDK only if you want Job handles, S3 I/O wrapping via `script_runner`, or state persistence; `nvidia-tao-sdk` is on public PyPI, install the pinned Brev extra from `versions.yaml`: `pip install "$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_brev)"`. **When going the SDK route, read `tao-skill-bank:tao-run-platform` for the `BrevSDK` kwarg reference, `build_entrypoint`, and `ActionWorkflow` patterns.**


如果任何步骤失败，代理会提示用户通过Bash授权修复，然后在继续前重新运行预检查。使用Brev不需要TAO SDK —— `brev exec docker run …` 就足够了。只有当您需要任务句柄、通过`script_runner`封装S3输入输出或状态持久化时才需要SDK；`nvidia-tao-sdk`在公开PyPI上，可从`versions.yaml`安装固定的Brev扩展包：`pip install "$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_brev)"`。**如果使用SDK方式，请阅读`tao-skill-bank:tao-run-platform`获取`BrevSDK`参数参考、`build_entrypoint`以及`ActionWorkflow`模式。**

Authentication

认证

Two options:

Automated (recommended): Get an API token from the Brev console settings page. Set
```
BREV_API_TOKEN
```
as an environment variable (e.g., in
```
~/.config/tao/.env
```
). The handler auto-authenticates via
```
brev login --token
```
on first use — same UX as Lepton.
Manual: Run
```
brev login
```
(opens browser). Tokens expire hourly — the handler refreshes automatically.

S3 credentials (ACCESS_KEY, SECRET_KEY) are needed separately for data transfer.

有两种选项：

自动认证（推荐）：从Brev控制台设置页面获取API令牌。将
```
BREV_API_TOKEN
```
设置为环境变量（例如在
```
~/.config/tao/.env
```
中）。处理器会在首次使用时通过
```
brev login --token
```
自动认证——与Lepton的用户体验相同。
手动认证：运行
```
brev login
```
（会打开浏览器）。令牌每小时过期一次——处理器会自动刷新。

数据传输还需要单独的S3凭证（ACCESS_KEY、SECRET_KEY）。

Headless / non-interactive

无头/非交互式环境

In a CI shell, container, or agent session with no controlling TTY, always run
brev login --token "$BREV_API_TOKEN"
before any other
brev
call — even when the token is exported. Otherwise the CLI prompts on stdin and returns an

EOF

auth error on commands like

brev ls

brev create

, or

brev exec

. Re-run the token login if a call returns auth-EOF; a single refresh is usually enough.

在没有控制TTY的CI shell、容器或代理会话中，在执行任何其他
brev
命令前，务必先运行
brev login --token "$BREV_API_TOKEN"
——即使令牌已导出。否则CLI会在标准输入上弹出提示，导致

brev ls

、

brev create

或

brev exec

等命令返回

EOF

认证错误。如果调用返回认证EOF，请重新运行令牌登录；通常一次刷新就足够了。

Launch Preflight

启动预检查

Before generating scripts or submitting jobs:

Verify
```
BREV_API_TOKEN
```
is set.
Verify the
```
brev
```
CLI is installed and can list instances, for example
```
brev ls --json
```
. If needed, authenticate with
```
brev login --token
```
.
For
```
s3://
```
datasets/results, verify
```
ACCESS_KEY
```
and
```
SECRET_KEY
```
are set and the exact paths are readable with
```
aws s3 ls
```
.
Do not accept local
```
/path
```
inputs for Brev unless the user has proven those paths exist on the target Brev instance or are mounted into it.
Verify model-specific credentials such as
```
HF_TOKEN
```
before launch.

生成脚本或提交任务前：

确认已设置
```
BREV_API_TOKEN
```
。
确认已安装
```
brev
```
CLI且能列出实例，例如运行
```
brev ls --json
```
。如有需要，通过
```
brev login --token
```
进行认证。
对于
```
s3://
```
数据集/结果，确认已设置
```
ACCESS_KEY
```
和
```
SECRET_KEY
```
，并能通过
```
aws s3 ls
```
读取具体路径。
除非用户已证明目标Brev实例上存在这些路径或已挂载到实例中，否则不要接受Brev的本地
```
/path
```
输入。
启动前确认模型特定凭证（如
```
HF_TOKEN
```
）已设置。

Instance Lifecycle

实例生命周期

The agent controls instance lifecycle:

Reuse: Pass
```
instance_id
```
in
```
backend_details
```
to run multiple jobs on the same instance. Efficient for multi-step workflows.
Ephemeral: Omit
```
instance_id
```
— the handler creates a new instance per job. Clean but slower (instance boot ~2-5 min).

代理控制实例生命周期：

复用：在
```
backend_details
```
中传入
```
instance_id
```
，可在同一实例上运行多个任务。这对多步骤工作流来说效率更高。
临时实例：省略
```
instance_id
```
——处理器会为每个任务创建一个新实例。这种方式更干净但速度较慢（实例启动约需2-5分钟）。

Creating an instance — placement info

创建实例——放置信息

For accounts with more than one cloud credential or workspace group, plain

brev create

rejects the call with a placement error. Pass the account-specific IDs explicitly:

bash

brev create my-instance \
  --gpu L40S:1 \
  --cloud-cred-id <cloudCredId> \
  --workspace-group-id <workspaceGroupId>

Discover the values once and stash them in

~/.config/tao/.env

bash

brev ls --json | jq -r '.workspaces[0].workspaceGroupId'   # default group
brev orgs --json | jq -r '.[0].cloudCredentials[].id'      # cloud credential

When using the SDK, pass them through

backend_details

python

BrevSDK().create_job(
    ...,
    backend_details={
        "cloud_cred_id": "<cloudCredId>",
        "workspace_group_id": "<workspaceGroupId>",
    },
)

对于拥有多个云凭证或工作区组的账户，直接运行

brev create

会因放置错误而被拒绝。需显式传入账户特定ID：

bash

brev create my-instance \
  --gpu L40S:1 \
  --cloud-cred-id <cloudCredId> \
  --workspace-group-id <workspaceGroupId>

一次性查询这些值并存入

~/.config/tao/.env

：

bash

brev ls --json | jq -r '.workspaces[0].workspaceGroupId'   # default group
brev orgs --json | jq -r '.[0].cloudCredentials[].id'      # cloud credential

使用SDK时，通过

backend_details

传入这些值：

python

BrevSDK().create_job(
    ...,
    backend_details={
        "cloud_cred_id": "<cloudCredId>",
        "workspace_group_id": "<workspaceGroupId>",
    },
)

Multi-GPU and multi-node

多GPU与多节点

Multi-node is not supported on Brev. Brev is instance-based — one job runs on one instance, with no cross-instance coordination.

Multi-GPU on a single instance is supported (instances available with up to 8× H100 / A100 / L40S).

gpu_count

maps to the GPU count on the instance;

torchrun --nproc-per-node=N

or PyTorch DDP work within the instance.

Brev不支持多节点。Brev基于实例运行——一个任务在一个实例上运行，没有跨实例协调。

单实例上的多GPU是支持的（实例最多可配备8× H100 / A100 / L40S GPU）。

gpu_count

对应实例上的GPU数量；

torchrun --nproc-per-node=N

或PyTorch DDP可在实例内正常工作。

GPU Types

GPU类型

Available via

brev search

L40S, A100 80GB, H100 (availability varies by provider)
Use
```
--gpu-name
```
to filter,
```
--min-vram
```
for memory requirements

可通过

brev search

查询可用类型：

L40S、A100 80GB、H100（可用性因服务商而异）
使用
```
--gpu-name
```
进行筛选，使用
```
--min-vram
```
满足内存需求

Storage

存储

No shared NFS/Lustre. All data flows through S3 via the script_runner's fsspec integration. Instance-local disk under the login user's home directory (

$HOME

) persists across stop/start but not across delete/create.

无共享NFS/Lustre存储。所有数据通过script_runner的fsspec集成流经S3。登录用户主目录（

$HOME

）下的实例本地磁盘在停止/启动时会保留，但在删除/创建实例时不会保留。

Docker on Brev

Brev上的Docker

VM Mode instances have Docker pre-installed. For TAO container images:

bash

undefined

VM模式实例预装了Docker。对于TAO容器镜像：

bash

undefined

NGC auth (one-time per instance)

brev exec <instance> -- docker login nvcr.io -u '$oauthtoken' -p <NGC_KEY>

Run a TAO training job

brev exec <instance> -- docker run --gpus all --rm
-v $HOME/data:/data
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
visual_changenet train -e /data/spec.yaml

undefined

brev exec <instance> -- docker run --gpus all --rm
-v $HOME/data:/data
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
visual_changenet train -e /data/spec.yaml

undefined

Wait for instance readiness before the first

brev exec

首次执行

brev exec

前等待实例就绪

A freshly created instance reports

RUNNING

long before sshd, hostname resolution, and the user shell are ready. The first

brev exec

against an unsettled instance fails with

hostname not resolvable

Connection refused

, or a silent timeout. Always poll until a trivial exec succeeds before issuing real work:

bash

undefined

新创建的实例在sshd、主机名解析和用户Shell就绪前就会报告

RUNNING

状态。针对未稳定实例的首次

brev exec

会因

hostname not resolvable

、

Connection refused

或静默超时而失败。在执行实际任务前，务必轮询直到简单的执行命令成功：

bash

undefined

Wait up to 5 minutes for shell readiness — covers the SSH bring-up window.

for i in $(seq 1 60); do brev exec <instance> -- true >/dev/null 2>&1 && break sleep 5 done brev exec <instance> -- true >/dev/null 2>&1 || { echo "instance <instance> never became exec-ready"; exit 1; }

undefined

undefined

brev exec

timeout for cold-start workloads

冷启动工作负载的

brev exec

超时设置

brev exec

inherits no default timeout, but anything that wraps it (the SDK handler, CI step wrappers,

timeout

shell builtins) must allow time for both the SSH bring-up window and the container pull on a fresh instance. Use ≥ 600 s (10 min) for the first exec on a new instance; the previous 60–120 s default truncates remote startup and surfaces as a spurious

exec failed

even though the remote command is still progressing.

brev exec

没有默认超时，但任何封装它的工具（SDK处理器、CI步骤封装器、

timeout

shell内置命令）都必须为SSH启动窗口和新实例上的容器拉取留出足够时间。对于新实例的首次执行，使用**≥ 600秒（10分钟）**的超时时间；之前60–120秒的默认设置会中断远程启动过程，导致看似

exec failed

的错误，而实际上远程命令仍在运行。

Mixed-Platform Workflows

混合平台工作流

Brev can be mixed with Lepton in the same workflow. Per-stage platform assignment:

json

{"skill": "vcn-gap-analysis", "action": "analyze", "platform": "brev"},
{"skill": "visual-changenet", "action": "train", "platform": "lepton"}

CPU stages (gap analysis, data merge) run cheaply on Brev. GPU stages (training) run on Lepton H100s.

Brev可与Lepton在同一工作流中混合使用。按阶段分配平台：

json

{"skill": "vcn-gap-analysis", "action": "analyze", "platform": "brev"},
{"skill": "visual-changenet", "action": "train", "platform": "lepton"}

CPU阶段（差距分析、数据合并）可在Brev上低成本运行。GPU阶段（训练）可在Lepton H100上运行。

Cleanup

清理

bash

brev delete <instance>      # plain delete — no flags

The CLI does not accept

--yes

-y

; passing it errors with

unknown flag: --yes

brev delete <instance>

is already non-interactive on recent CLIs, so no confirmation flag is needed.

bash

brev delete <instance>      # plain delete — no flags

CLI不接受

--yes

-y

参数；传入该参数会报错

unknown flag: --yes

。在最新版CLI上，

brev delete <instance>

已经是非交互式的，因此不需要确认参数。

Error Patterns

错误模式

brev CLI not found: Install from https://docs.nvidia.com/brev/.

brev ls
returns auth EOF even with
BREV_API_TOKEN
set: Headless shell has no stdin for the interactive auth prompt. Run

brev login --token "$BREV_API_TOKEN"

first, then retry. If the failure persists across a single retry, the token itself is stale — mint a fresh one.

Token expired: Handler auto-refreshes via

brev login --token

. If persistent, run

brev login

manually.

brev create
rejected with placement error (
cloudCredId
/
workspaceGroupId
required): Multi-credential or multi-workspace accounts must pass

--cloud-cred-id

and/or

--workspace-group-id

. See Creating an instance — placement info above.

brev exec
fails with
hostname not resolvable
or
Connection refused
right after create: Instance reports

RUNNING

before sshd is up. Use the readiness-wait loop in Wait for instance readiness before the first
brev exec
before issuing the real command.

SDK exec timeout /
exec failed
on a fresh instance: The SDK's

brev exec

wrapper timed out before remote startup finished. Raise the timeout to ≥ 600 s for cold-start runs (see brev exec
timeout for cold-start workloads).

brev delete --yes
:
unknown flag: --yes
: The CLI has no confirmation flag. Use plain

brev delete <instance>

Instance stuck in provisioning: Some GPU types have limited availability. Try a different

--gpu-name

or provider.

Docker pull fails on nvcr.io: NGC_KEY not set or expired. Run

docker login nvcr.io

on the instance.

未找到brev CLI：从https://docs.nvidia.com/brev/安装。

即使设置了
BREV_API_TOKEN
，
brev ls
仍返回认证EOF：无头Shell没有标准输入用于交互式认证提示。先运行

brev login --token "$BREV_API_TOKEN"

，然后重试。如果重试后仍失败，则令牌已过期——生成新的令牌。

令牌过期：处理器会通过

brev login --token

自动刷新。如果问题持续，请手动运行

brev login

。

brev create
因放置错误被拒绝（需要
cloudCredId
/
workspaceGroupId
）：多凭证或多工作区账户必须传入

--cloud-cred-id

和/或

--workspace-group-id

。请参阅上方的“创建实例——放置信息”部分。

创建实例后立即执行
brev exec
失败，提示
hostname not resolvable
或
Connection refused
：实例在sshd启动前就报告

RUNNING

状态。在执行实际命令前，使用“首次执行

brev exec

前等待实例就绪”部分中的就绪等待循环。

新实例上的SDK执行超时 /
exec failed
：SDK的

brev exec

封装器在远程启动完成前超时。将冷启动运行的超时时间提高到≥ 600秒（请参阅“冷启动工作负载的

brev exec

超时设置”部分）。

brev delete --yes
:
unknown flag: --yes
：CLI没有确认参数。直接使用

brev delete <instance>

。

实例卡在配置中：某些GPU类型可用性有限。尝试使用不同的

--gpu-name

或服务商。

nvcr.io上的Docker拉取失败：NGC_KEY未设置或已过期。在实例上运行

docker login nvcr.io

。

tao-run-on-brev

Original

Translation

Brev

Brev

Preflight

预检查

1. brev CLI installed

1. brev CLI installed

2. brev-cli agent skill installed — provides the brev CLI's command reference to the agent

2. brev-cli agent skill installed — provides the brev CLI's command reference to the agent

3. brev login active — always token-login first when running headless.

3. brev login active — always token-login first when running headless.

Plain brev ls will hit an interactive auth prompt (read: EOF on stdin)

Plain brev ls will hit an interactive auth prompt (read: EOF on stdin)

even when BREV_API_TOKEN is set, so refresh the session up front.

even when BREV_API_TOKEN is set, so refresh the session up front.

Retry once after a forced re-login: cached creds occasionally desync and the

Retry once after a forced re-login: cached creds occasionally desync and the

first brev ls returns auth EOF until the session is rebuilt.

first brev ls returns auth EOF until the session is rebuilt.

Authentication

认证

Headless / non-interactive

无头/非交互式环境

Launch Preflight

启动预检查

Instance Lifecycle

实例生命周期

Creating an instance — placement info

创建实例——放置信息

Multi-GPU and multi-node

多GPU与多节点

GPU Types

GPU类型

Storage

存储

Docker on Brev

Brev上的Docker

NGC auth (one-time per instance)

NGC auth (one-time per instance)

Run a TAO training job

Run a TAO training job

Wait for instance readiness before the first brev exec

首次执行brev exec前等待实例就绪

Wait up to 5 minutes for shell readiness — covers the SSH bring-up window.

Wait up to 5 minutes for shell readiness — covers the SSH bring-up window.

brev exec timeout for cold-start workloads

冷启动工作负载的brev exec超时设置

Mixed-Platform Workflows

混合平台工作流

Cleanup

清理

Error Patterns

错误模式

Plain
`brev ls`
will hit an interactive auth prompt (read: EOF on stdin)

Plain
`brev ls`
will hit an interactive auth prompt (read: EOF on stdin)

first
`brev ls`
returns auth EOF until the session is rebuilt.

first
`brev ls`
returns auth EOF until the session is rebuilt.

Wait for instance readiness before the first
`brev exec`

首次执行
`brev exec`
前等待实例就绪

`brev exec`
timeout for cold-start workloads

冷启动工作负载的
`brev exec`
超时设置