tao-run-on-brev

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Brev

Brev

NVIDIA Brev provides on-demand GPU instances across multiple cloud providers. Instances come pre-loaded with NVIDIA drivers, CUDA, Docker, and NVIDIA Container Toolkit.
Brev is instance-based (not job-based like Lepton). You create an instance, run commands on it via
brev exec
, and delete it when done. The TAO SDK's BrevHandler wraps this into the standard job interface.
NVIDIA Brev 提供跨多个云服务商的按需GPU实例。实例预装了NVIDIA驱动、CUDA、Docker以及NVIDIA Container Toolkit。
Brev 基于实例运行(而非像Lepton那样基于任务)。您需要创建一个实例,通过
brev exec
在其上运行命令,完成后删除实例。TAO SDK的BrevHandler将这些操作封装为标准任务接口。

Preflight

预检查

This skill needs the
brev
CLI, its companion agent skill (
brev-cli
), and an active login. Check before proceeding:
bash
undefined
使用此技能需要
brev
CLI、配套的代理技能(
brev-cli
)以及有效的登录状态。继续操作前请检查以下内容:
bash
undefined

1. brev CLI installed

1. brev CLI installed

command -v brev >/dev/null 2>&1 || { echo "MISSING: brev CLI not installed. Install:" echo " https://docs.nvidia.com/brev/" exit 1 }
command -v brev >/dev/null 2>&1 || { echo "MISSING: brev CLI not installed. Install:" echo " https://docs.nvidia.com/brev/" exit 1 }

2. brev-cli agent skill installed — provides the brev CLI's command reference to the agent

2. brev-cli agent skill installed — provides the brev CLI's command reference to the agent

[ -d "$HOME/.claude/skills/brev-cli" ] || [ -d ".claude/skills/brev-cli" ] || { echo "MISSING: brev-cli agent skill not installed. Run:" echo " brev agent-skill install" exit 1 }
[ -d "$HOME/.claude/skills/brev-cli" ] || [ -d ".claude/skills/brev-cli" ] || { echo "MISSING: brev-cli agent skill not installed. Run:" echo " brev agent-skill install" exit 1 }

3. brev login active — always token-login first when running headless.

3. brev login active — always token-login first when running headless.

Plain
brev ls
will hit an interactive auth prompt (read: EOF on stdin)

Plain
brev ls
will hit an interactive auth prompt (read: EOF on stdin)

even when BREV_API_TOKEN is set, so refresh the session up front.

even when BREV_API_TOKEN is set, so refresh the session up front.

if [ -n "$BREV_API_TOKEN" ]; then brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1 || { echo "MISSING: brev token login failed. Verify BREV_API_TOKEN." exit 1 } fi
if [ -n "$BREV_API_TOKEN" ]; then brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1 || { echo "MISSING: brev token login failed. Verify BREV_API_TOKEN." exit 1 } fi

Retry once after a forced re-login: cached creds occasionally desync and the

Retry once after a forced re-login: cached creds occasionally desync and the

first
brev ls
returns auth EOF until the session is rebuilt.

first
brev ls
returns auth EOF until the session is rebuilt.

brev ls >/dev/null 2>&1 || { [ -n "$BREV_API_TOKEN" ] && brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1 brev ls >/dev/null 2>&1 || { echo "MISSING: not logged in to brev. Run:" echo " brev login # interactive (opens browser)" echo " # or set BREV_API_TOKEN in ~/.config/tao/.env (then 'brev login --token $BREV_API_TOKEN')" exit 1 } }

If any step fails, the agent prompts the user to authorize the fix via Bash, then re-runs the preflight before continuing. The TAO SDK is **not** required for Brev — `brev exec docker run …` is sufficient. Reach for the SDK only if you want Job handles, S3 I/O wrapping via `script_runner`, or state persistence; `nvidia-tao-sdk` is on public PyPI, install the pinned Brev extra from `versions.yaml`: `pip install "$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_brev)"`. **When going the SDK route, read `tao-skill-bank:tao-run-platform` for the `BrevSDK` kwarg reference, `build_entrypoint`, and `ActionWorkflow` patterns.**
brev ls >/dev/null 2>&1 || { [ -n "$BREV_API_TOKEN" ] && brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1 brev ls >/dev/null 2>&1 || { echo "MISSING: not logged in to brev. Run:" echo " brev login # interactive (opens browser)" echo " # or set BREV_API_TOKEN in ~/.config/tao/.env (then 'brev login --token $BREV_API_TOKEN')" exit 1 } }

如果任何步骤失败,代理会提示用户通过Bash授权修复,然后在继续前重新运行预检查。使用Brev不需要TAO SDK —— `brev exec docker run …` 就足够了。只有当您需要任务句柄、通过`script_runner`封装S3输入输出或状态持久化时才需要SDK;`nvidia-tao-sdk`在公开PyPI上,可从`versions.yaml`安装固定的Brev扩展包:`pip install "$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_brev)"`。**如果使用SDK方式,请阅读`tao-skill-bank:tao-run-platform`获取`BrevSDK`参数参考、`build_entrypoint`以及`ActionWorkflow`模式。**

Authentication

认证

Two options:
  1. Automated (recommended): Get an API token from the Brev console settings page. Set
    BREV_API_TOKEN
    as an environment variable (e.g., in
    ~/.config/tao/.env
    ). The handler auto-authenticates via
    brev login --token
    on first use — same UX as Lepton.
  2. Manual: Run
    brev login
    (opens browser). Tokens expire hourly — the handler refreshes automatically.
S3 credentials (ACCESS_KEY, SECRET_KEY) are needed separately for data transfer.
有两种选项:
  1. 自动认证(推荐):从Brev控制台设置页面获取API令牌。将
    BREV_API_TOKEN
    设置为环境变量(例如在
    ~/.config/tao/.env
    中)。处理器会在首次使用时通过
    brev login --token
    自动认证——与Lepton的用户体验相同。
  2. 手动认证:运行
    brev login
    (会打开浏览器)。令牌每小时过期一次——处理器会自动刷新。
数据传输还需要单独的S3凭证(ACCESS_KEY、SECRET_KEY)。

Headless / non-interactive

无头/非交互式环境

In a CI shell, container, or agent session with no controlling TTY, always run
brev login --token "$BREV_API_TOKEN"
before any other
brev
call
— even when the token is exported. Otherwise the CLI prompts on stdin and returns an
EOF
auth error on commands like
brev ls
,
brev create
, or
brev exec
. Re-run the token login if a call returns auth-EOF; a single refresh is usually enough.
在没有控制TTY的CI shell、容器或代理会话中,在执行任何其他
brev
命令前,务必先运行
brev login --token "$BREV_API_TOKEN"
——即使令牌已导出。否则CLI会在标准输入上弹出提示,导致
brev ls
brev create
brev exec
等命令返回
EOF
认证错误。如果调用返回认证EOF,请重新运行令牌登录;通常一次刷新就足够了。

Launch Preflight

启动预检查

Before generating scripts or submitting jobs:
  1. Verify
    BREV_API_TOKEN
    is set.
  2. Verify the
    brev
    CLI is installed and can list instances, for example
    brev ls --json
    . If needed, authenticate with
    brev login --token
    .
  3. For
    s3://
    datasets/results, verify
    ACCESS_KEY
    and
    SECRET_KEY
    are set and the exact paths are readable with
    aws s3 ls
    .
  4. Do not accept local
    /path
    inputs for Brev unless the user has proven those paths exist on the target Brev instance or are mounted into it.
  5. Verify model-specific credentials such as
    HF_TOKEN
    before launch.
生成脚本或提交任务前:
  1. 确认已设置
    BREV_API_TOKEN
  2. 确认已安装
    brev
    CLI且能列出实例,例如运行
    brev ls --json
    。如有需要,通过
    brev login --token
    进行认证。
  3. 对于
    s3://
    数据集/结果,确认已设置
    ACCESS_KEY
    SECRET_KEY
    ,并能通过
    aws s3 ls
    读取具体路径。
  4. 除非用户已证明目标Brev实例上存在这些路径或已挂载到实例中,否则不要接受Brev的本地
    /path
    输入。
  5. 启动前确认模型特定凭证(如
    HF_TOKEN
    )已设置。

Instance Lifecycle

实例生命周期

The agent controls instance lifecycle:
  • Reuse: Pass
    instance_id
    in
    backend_details
    to run multiple jobs on the same instance. Efficient for multi-step workflows.
  • Ephemeral: Omit
    instance_id
    — the handler creates a new instance per job. Clean but slower (instance boot ~2-5 min).
代理控制实例生命周期:
  • 复用:在
    backend_details
    中传入
    instance_id
    ,可在同一实例上运行多个任务。这对多步骤工作流来说效率更高。
  • 临时实例:省略
    instance_id
    ——处理器会为每个任务创建一个新实例。这种方式更干净但速度较慢(实例启动约需2-5分钟)。

Creating an instance — placement info

创建实例——放置信息

For accounts with more than one cloud credential or workspace group, plain
brev create
rejects the call with a placement error. Pass the account-specific IDs explicitly:
bash
brev create my-instance \
  --gpu L40S:1 \
  --cloud-cred-id <cloudCredId> \
  --workspace-group-id <workspaceGroupId>
Discover the values once and stash them in
~/.config/tao/.env
:
bash
brev ls --json | jq -r '.workspaces[0].workspaceGroupId'   # default group
brev orgs --json | jq -r '.[0].cloudCredentials[].id'      # cloud credential
When using the SDK, pass them through
backend_details
:
python
BrevSDK().create_job(
    ...,
    backend_details={
        "cloud_cred_id": "<cloudCredId>",
        "workspace_group_id": "<workspaceGroupId>",
    },
)
对于拥有多个云凭证或工作区组的账户,直接运行
brev create
会因放置错误而被拒绝。需显式传入账户特定ID:
bash
brev create my-instance \
  --gpu L40S:1 \
  --cloud-cred-id <cloudCredId> \
  --workspace-group-id <workspaceGroupId>
一次性查询这些值并存入
~/.config/tao/.env
bash
brev ls --json | jq -r '.workspaces[0].workspaceGroupId'   # default group
brev orgs --json | jq -r '.[0].cloudCredentials[].id'      # cloud credential
使用SDK时,通过
backend_details
传入这些值:
python
BrevSDK().create_job(
    ...,
    backend_details={
        "cloud_cred_id": "<cloudCredId>",
        "workspace_group_id": "<workspaceGroupId>",
    },
)

Multi-GPU and multi-node

多GPU与多节点

Multi-node is not supported on Brev. Brev is instance-based — one job runs on one instance, with no cross-instance coordination.
Multi-GPU on a single instance is supported (instances available with up to 8× H100 / A100 / L40S).
gpu_count
maps to the GPU count on the instance;
torchrun --nproc-per-node=N
or PyTorch DDP work within the instance.
Brev不支持多节点。Brev基于实例运行——一个任务在一个实例上运行,没有跨实例协调。
单实例上的多GPU是支持的(实例最多可配备8× H100 / A100 / L40S GPU)。
gpu_count
对应实例上的GPU数量;
torchrun --nproc-per-node=N
或PyTorch DDP可在实例内正常工作。

GPU Types

GPU类型

Available via
brev search
:
  • L40S, A100 80GB, H100 (availability varies by provider)
  • Use
    --gpu-name
    to filter,
    --min-vram
    for memory requirements
可通过
brev search
查询可用类型:
  • L40S、A100 80GB、H100(可用性因服务商而异)
  • 使用
    --gpu-name
    进行筛选,使用
    --min-vram
    满足内存需求

Storage

存储

No shared NFS/Lustre. All data flows through S3 via the script_runner's fsspec integration. Instance-local disk under the login user's home directory (
$HOME
) persists across stop/start but not across delete/create.
无共享NFS/Lustre存储。所有数据通过script_runner的fsspec集成流经S3。登录用户主目录(
$HOME
)下的实例本地磁盘在停止/启动时会保留,但在删除/创建实例时不会保留。

Docker on Brev

Brev上的Docker

VM Mode instances have Docker pre-installed. For TAO container images:
bash
undefined
VM模式实例预装了Docker。对于TAO容器镜像:
bash
undefined

NGC auth (one-time per instance)

NGC auth (one-time per instance)

brev exec <instance> -- docker login nvcr.io -u '$oauthtoken' -p <NGC_KEY>
brev exec <instance> -- docker login nvcr.io -u '$oauthtoken' -p <NGC_KEY>

Run a TAO training job

Run a TAO training job

brev exec <instance> -- docker run --gpus all --rm
-v $HOME/data:/data
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
visual_changenet train -e /data/spec.yaml
undefined
brev exec <instance> -- docker run --gpus all --rm
-v $HOME/data:/data
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
visual_changenet train -e /data/spec.yaml
undefined

Wait for instance readiness before the first
brev exec

首次执行
brev exec
前等待实例就绪

A freshly created instance reports
RUNNING
long before sshd, hostname resolution, and the user shell are ready. The first
brev exec
against an unsettled instance fails with
hostname not resolvable
,
Connection refused
, or a silent timeout. Always poll until a trivial exec succeeds before issuing real work:
bash
undefined
新创建的实例在sshd、主机名解析和用户Shell就绪前就会报告
RUNNING
状态。针对未稳定实例的首次
brev exec
会因
hostname not resolvable
Connection refused
或静默超时而失败。在执行实际任务前,务必轮询直到简单的执行命令成功:
bash
undefined

Wait up to 5 minutes for shell readiness — covers the SSH bring-up window.

Wait up to 5 minutes for shell readiness — covers the SSH bring-up window.

for i in $(seq 1 60); do brev exec <instance> -- true >/dev/null 2>&1 && break sleep 5 done brev exec <instance> -- true >/dev/null 2>&1 || { echo "instance <instance> never became exec-ready"; exit 1; }
undefined
for i in $(seq 1 60); do brev exec <instance> -- true >/dev/null 2>&1 && break sleep 5 done brev exec <instance> -- true >/dev/null 2>&1 || { echo "instance <instance> never became exec-ready"; exit 1; }
undefined

brev exec
timeout for cold-start workloads

冷启动工作负载的
brev exec
超时设置

brev exec
inherits no default timeout, but anything that wraps it (the SDK handler, CI step wrappers,
timeout
shell builtins) must allow time for both the SSH bring-up window and the container pull on a fresh instance. Use ≥ 600 s (10 min) for the first exec on a new instance; the previous 60–120 s default truncates remote startup and surfaces as a spurious
exec failed
even though the remote command is still progressing.
brev exec
没有默认超时,但任何封装它的工具(SDK处理器、CI步骤封装器、
timeout
shell内置命令)都必须为SSH启动窗口和新实例上的容器拉取留出足够时间。对于新实例的首次执行,使用**≥ 600秒(10分钟)**的超时时间;之前60–120秒的默认设置会中断远程启动过程,导致看似
exec failed
的错误,而实际上远程命令仍在运行。

Mixed-Platform Workflows

混合平台工作流

Brev can be mixed with Lepton in the same workflow. Per-stage platform assignment:
json
{"skill": "vcn-gap-analysis", "action": "analyze", "platform": "brev"},
{"skill": "visual-changenet", "action": "train", "platform": "lepton"}
CPU stages (gap analysis, data merge) run cheaply on Brev. GPU stages (training) run on Lepton H100s.
Brev可与Lepton在同一工作流中混合使用。按阶段分配平台:
json
{"skill": "vcn-gap-analysis", "action": "analyze", "platform": "brev"},
{"skill": "visual-changenet", "action": "train", "platform": "lepton"}
CPU阶段(差距分析、数据合并)可在Brev上低成本运行。GPU阶段(训练)可在Lepton H100上运行。

Cleanup

清理

bash
brev delete <instance>      # plain delete — no flags
The CLI does not accept
--yes
/
-y
; passing it errors with
unknown flag: --yes
.
brev delete <instance>
is already non-interactive on recent CLIs, so no confirmation flag is needed.
bash
brev delete <instance>      # plain delete — no flags
CLI不接受
--yes
/
-y
参数;传入该参数会报错
unknown flag: --yes
。在最新版CLI上,
brev delete <instance>
已经是非交互式的,因此不需要确认参数。

Error Patterns

错误模式

brev CLI not found: Install from https://docs.nvidia.com/brev/.
brev ls
returns auth EOF even with
BREV_API_TOKEN
set
: Headless shell has no stdin for the interactive auth prompt. Run
brev login --token "$BREV_API_TOKEN"
first, then retry. If the failure persists across a single retry, the token itself is stale — mint a fresh one.
Token expired: Handler auto-refreshes via
brev login --token
. If persistent, run
brev login
manually.
brev create
rejected with placement error (
cloudCredId
/
workspaceGroupId
required)
: Multi-credential or multi-workspace accounts must pass
--cloud-cred-id
and/or
--workspace-group-id
. See Creating an instance — placement info above.
brev exec
fails with
hostname not resolvable
or
Connection refused
right after create
: Instance reports
RUNNING
before sshd is up. Use the readiness-wait loop in Wait for instance readiness before the first
brev exec
before issuing the real command.
SDK exec timeout /
exec failed
on a fresh instance
: The SDK's
brev exec
wrapper timed out before remote startup finished. Raise the timeout to ≥ 600 s for cold-start runs (see
brev exec
timeout for cold-start workloads
).
brev delete --yes
:
unknown flag: --yes
: The CLI has no confirmation flag. Use plain
brev delete <instance>
.
Instance stuck in provisioning: Some GPU types have limited availability. Try a different
--gpu-name
or provider.
Docker pull fails on nvcr.io: NGC_KEY not set or expired. Run
docker login nvcr.io
on the instance.
即使设置了
BREV_API_TOKEN
brev ls
仍返回认证EOF
:无头Shell没有标准输入用于交互式认证提示。先运行
brev login --token "$BREV_API_TOKEN"
,然后重试。如果重试后仍失败,则令牌已过期——生成新的令牌。
令牌过期:处理器会通过
brev login --token
自动刷新。如果问题持续,请手动运行
brev login
brev create
因放置错误被拒绝(需要
cloudCredId
/
workspaceGroupId
:多凭证或多工作区账户必须传入
--cloud-cred-id
和/或
--workspace-group-id
。请参阅上方的“创建实例——放置信息”部分。
创建实例后立即执行
brev exec
失败,提示
hostname not resolvable
Connection refused
:实例在sshd启动前就报告
RUNNING
状态。在执行实际命令前,使用“首次执行
brev exec
前等待实例就绪”部分中的就绪等待循环。
新实例上的SDK执行超时 /
exec failed
:SDK的
brev exec
封装器在远程启动完成前超时。将冷启动运行的超时时间提高到≥ 600秒(请参阅“冷启动工作负载的
brev exec
超时设置”部分)。
brev delete --yes
:
unknown flag: --yes
:CLI没有确认参数。直接使用
brev delete <instance>
实例卡在配置中:某些GPU类型可用性有限。尝试使用不同的
--gpu-name
或服务商。
nvcr.io上的Docker拉取失败:NGC_KEY未设置或已过期。在实例上运行
docker login nvcr.io