tao-run-on-brev
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBrev
Brev
NVIDIA Brev provides on-demand GPU instances across multiple cloud providers. Instances come pre-loaded with NVIDIA drivers, CUDA, Docker, and NVIDIA Container Toolkit.
Brev is instance-based (not job-based like Lepton). You create an instance, run commands on it via , and delete it when done. The TAO SDK's BrevHandler wraps this into the standard job interface.
brev execNVIDIA Brev 提供跨多个云服务商的按需GPU实例。实例预装了NVIDIA驱动、CUDA、Docker以及NVIDIA Container Toolkit。
Brev 基于实例运行(而非像Lepton那样基于任务)。您需要创建一个实例,通过在其上运行命令,完成后删除实例。TAO SDK的BrevHandler将这些操作封装为标准任务接口。
brev execPreflight
预检查
This skill needs the CLI, its companion agent skill (), and an active login. Check before proceeding:
brevbrev-clibash
undefined使用此技能需要 CLI、配套的代理技能()以及有效的登录状态。继续操作前请检查以下内容:
brevbrev-clibash
undefined1. brev CLI installed
1. brev CLI installed
command -v brev >/dev/null 2>&1 || {
echo "MISSING: brev CLI not installed. Install:"
echo " https://docs.nvidia.com/brev/"
exit 1
}
command -v brev >/dev/null 2>&1 || {
echo "MISSING: brev CLI not installed. Install:"
echo " https://docs.nvidia.com/brev/"
exit 1
}
2. brev-cli agent skill installed — provides the brev CLI's command reference to the agent
2. brev-cli agent skill installed — provides the brev CLI's command reference to the agent
[ -d "$HOME/.claude/skills/brev-cli" ] || [ -d ".claude/skills/brev-cli" ] || {
echo "MISSING: brev-cli agent skill not installed. Run:"
echo " brev agent-skill install"
exit 1
}
[ -d "$HOME/.claude/skills/brev-cli" ] || [ -d ".claude/skills/brev-cli" ] || {
echo "MISSING: brev-cli agent skill not installed. Run:"
echo " brev agent-skill install"
exit 1
}
3. brev login active — always token-login first when running headless.
3. brev login active — always token-login first when running headless.
Plain brev ls
will hit an interactive auth prompt (read: EOF on stdin)
brev lsPlain brev ls
will hit an interactive auth prompt (read: EOF on stdin)
brev lseven when BREV_API_TOKEN is set, so refresh the session up front.
even when BREV_API_TOKEN is set, so refresh the session up front.
if [ -n "$BREV_API_TOKEN" ]; then
brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1 || {
echo "MISSING: brev token login failed. Verify BREV_API_TOKEN."
exit 1
}
fi
if [ -n "$BREV_API_TOKEN" ]; then
brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1 || {
echo "MISSING: brev token login failed. Verify BREV_API_TOKEN."
exit 1
}
fi
Retry once after a forced re-login: cached creds occasionally desync and the
Retry once after a forced re-login: cached creds occasionally desync and the
first brev ls
returns auth EOF until the session is rebuilt.
brev lsfirst brev ls
returns auth EOF until the session is rebuilt.
brev lsbrev ls >/dev/null 2>&1 || {
[ -n "$BREV_API_TOKEN" ] && brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1
brev ls >/dev/null 2>&1 || {
echo "MISSING: not logged in to brev. Run:"
echo " brev login # interactive (opens browser)"
echo " # or set BREV_API_TOKEN in ~/.config/tao/.env (then 'brev login --token $BREV_API_TOKEN')"
exit 1
}
}
If any step fails, the agent prompts the user to authorize the fix via Bash, then re-runs the preflight before continuing. The TAO SDK is **not** required for Brev — `brev exec docker run …` is sufficient. Reach for the SDK only if you want Job handles, S3 I/O wrapping via `script_runner`, or state persistence; `nvidia-tao-sdk` is on public PyPI, install the pinned Brev extra from `versions.yaml`: `pip install "$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_brev)"`. **When going the SDK route, read `tao-skill-bank:tao-run-platform` for the `BrevSDK` kwarg reference, `build_entrypoint`, and `ActionWorkflow` patterns.**brev ls >/dev/null 2>&1 || {
[ -n "$BREV_API_TOKEN" ] && brev login --token "$BREV_API_TOKEN" >/dev/null 2>&1
brev ls >/dev/null 2>&1 || {
echo "MISSING: not logged in to brev. Run:"
echo " brev login # interactive (opens browser)"
echo " # or set BREV_API_TOKEN in ~/.config/tao/.env (then 'brev login --token $BREV_API_TOKEN')"
exit 1
}
}
如果任何步骤失败,代理会提示用户通过Bash授权修复,然后在继续前重新运行预检查。使用Brev不需要TAO SDK —— `brev exec docker run …` 就足够了。只有当您需要任务句柄、通过`script_runner`封装S3输入输出或状态持久化时才需要SDK;`nvidia-tao-sdk`在公开PyPI上,可从`versions.yaml`安装固定的Brev扩展包:`pip install "$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_brev)"`。**如果使用SDK方式,请阅读`tao-skill-bank:tao-run-platform`获取`BrevSDK`参数参考、`build_entrypoint`以及`ActionWorkflow`模式。**Authentication
认证
Two options:
-
Automated (recommended): Get an API token from the Brev console settings page. Setas an environment variable (e.g., in
BREV_API_TOKEN). The handler auto-authenticates via~/.config/tao/.envon first use — same UX as Lepton.brev login --token -
Manual: Run(opens browser). Tokens expire hourly — the handler refreshes automatically.
brev login
S3 credentials (ACCESS_KEY, SECRET_KEY) are needed separately for data transfer.
有两种选项:
-
自动认证(推荐):从Brev控制台设置页面获取API令牌。将设置为环境变量(例如在
BREV_API_TOKEN中)。处理器会在首次使用时通过~/.config/tao/.env自动认证——与Lepton的用户体验相同。brev login --token -
手动认证:运行(会打开浏览器)。令牌每小时过期一次——处理器会自动刷新。
brev login
数据传输还需要单独的S3凭证(ACCESS_KEY、SECRET_KEY)。
Headless / non-interactive
无头/非交互式环境
In a CI shell, container, or agent session with no controlling TTY, always
run before any other call —
even when the token is exported. Otherwise the CLI prompts on stdin and
returns an auth error on commands like , , or
. Re-run the token login if a call returns auth-EOF; a single
refresh is usually enough.
brev login --token "$BREV_API_TOKEN"brevEOFbrev lsbrev createbrev exec在没有控制TTY的CI shell、容器或代理会话中,在执行任何其他命令前,务必先运行——即使令牌已导出。否则CLI会在标准输入上弹出提示,导致、或等命令返回认证错误。如果调用返回认证EOF,请重新运行令牌登录;通常一次刷新就足够了。
brevbrev login --token "$BREV_API_TOKEN"brev lsbrev createbrev execEOFLaunch Preflight
启动预检查
Before generating scripts or submitting jobs:
- Verify is set.
BREV_API_TOKEN - Verify the CLI is installed and can list instances, for example
brev. If needed, authenticate withbrev ls --json.brev login --token - For datasets/results, verify
s3://andACCESS_KEYare set and the exact paths are readable withSECRET_KEY.aws s3 ls - Do not accept local inputs for Brev unless the user has proven those paths exist on the target Brev instance or are mounted into it.
/path - Verify model-specific credentials such as before launch.
HF_TOKEN
生成脚本或提交任务前:
- 确认已设置。
BREV_API_TOKEN - 确认已安装CLI且能列出实例,例如运行
brev。如有需要,通过brev ls --json进行认证。brev login --token - 对于数据集/结果,确认已设置
s3://和ACCESS_KEY,并能通过SECRET_KEY读取具体路径。aws s3 ls - 除非用户已证明目标Brev实例上存在这些路径或已挂载到实例中,否则不要接受Brev的本地输入。
/path - 启动前确认模型特定凭证(如)已设置。
HF_TOKEN
Instance Lifecycle
实例生命周期
The agent controls instance lifecycle:
- Reuse: Pass in
instance_idto run multiple jobs on the same instance. Efficient for multi-step workflows.backend_details - Ephemeral: Omit — the handler creates a new instance per job. Clean but slower (instance boot ~2-5 min).
instance_id
代理控制实例生命周期:
- 复用:在中传入
backend_details,可在同一实例上运行多个任务。这对多步骤工作流来说效率更高。instance_id - 临时实例:省略——处理器会为每个任务创建一个新实例。这种方式更干净但速度较慢(实例启动约需2-5分钟)。
instance_id
Creating an instance — placement info
创建实例——放置信息
For accounts with more than one cloud credential or workspace group, plain
rejects the call with a placement error. Pass the account-specific
IDs explicitly:
brev createbash
brev create my-instance \
--gpu L40S:1 \
--cloud-cred-id <cloudCredId> \
--workspace-group-id <workspaceGroupId>Discover the values once and stash them in :
~/.config/tao/.envbash
brev ls --json | jq -r '.workspaces[0].workspaceGroupId' # default group
brev orgs --json | jq -r '.[0].cloudCredentials[].id' # cloud credentialWhen using the SDK, pass them through :
backend_detailspython
BrevSDK().create_job(
...,
backend_details={
"cloud_cred_id": "<cloudCredId>",
"workspace_group_id": "<workspaceGroupId>",
},
)对于拥有多个云凭证或工作区组的账户,直接运行会因放置错误而被拒绝。需显式传入账户特定ID:
brev createbash
brev create my-instance \
--gpu L40S:1 \
--cloud-cred-id <cloudCredId> \
--workspace-group-id <workspaceGroupId>一次性查询这些值并存入:
~/.config/tao/.envbash
brev ls --json | jq -r '.workspaces[0].workspaceGroupId' # default group
brev orgs --json | jq -r '.[0].cloudCredentials[].id' # cloud credential使用SDK时,通过传入这些值:
backend_detailspython
BrevSDK().create_job(
...,
backend_details={
"cloud_cred_id": "<cloudCredId>",
"workspace_group_id": "<workspaceGroupId>",
},
)Multi-GPU and multi-node
多GPU与多节点
Multi-node is not supported on Brev. Brev is instance-based — one job runs on one instance, with no cross-instance coordination.
Multi-GPU on a single instance is supported (instances available with up to 8× H100 / A100 / L40S). maps to the GPU count on the instance; or PyTorch DDP work within the instance.
gpu_counttorchrun --nproc-per-node=NBrev不支持多节点。Brev基于实例运行——一个任务在一个实例上运行,没有跨实例协调。
单实例上的多GPU是支持的(实例最多可配备8× H100 / A100 / L40S GPU)。对应实例上的GPU数量;或PyTorch DDP可在实例内正常工作。
gpu_counttorchrun --nproc-per-node=NGPU Types
GPU类型
Available via :
brev search- L40S, A100 80GB, H100 (availability varies by provider)
- Use to filter,
--gpu-namefor memory requirements--min-vram
可通过查询可用类型:
brev search- L40S、A100 80GB、H100(可用性因服务商而异)
- 使用进行筛选,使用
--gpu-name满足内存需求--min-vram
Storage
存储
No shared NFS/Lustre. All data flows through S3 via the script_runner's fsspec integration. Instance-local disk under the login user's home directory () persists across stop/start but not across delete/create.
$HOME无共享NFS/Lustre存储。所有数据通过script_runner的fsspec集成流经S3。登录用户主目录()下的实例本地磁盘在停止/启动时会保留,但在删除/创建实例时不会保留。
$HOMEDocker on Brev
Brev上的Docker
VM Mode instances have Docker pre-installed. For TAO container images:
bash
undefinedVM模式实例预装了Docker。对于TAO容器镜像:
bash
undefinedNGC auth (one-time per instance)
NGC auth (one-time per instance)
brev exec <instance> -- docker login nvcr.io -u '$oauthtoken' -p <NGC_KEY>
brev exec <instance> -- docker login nvcr.io -u '$oauthtoken' -p <NGC_KEY>
Run a TAO training job
Run a TAO training job
brev exec <instance> -- docker run --gpus all --rm
-v $HOME/data:/data
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
visual_changenet train -e /data/spec.yaml
-v $HOME/data:/data
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
visual_changenet train -e /data/spec.yaml
undefinedbrev exec <instance> -- docker run --gpus all --rm
-v $HOME/data:/data
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
visual_changenet train -e /data/spec.yaml
-v $HOME/data:/data
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
visual_changenet train -e /data/spec.yaml
undefinedWait for instance readiness before the first brev exec
brev exec首次执行brev exec
前等待实例就绪
brev execA freshly created instance reports long before sshd, hostname
resolution, and the user shell are ready. The first against an
unsettled instance fails with ,
, or a silent timeout. Always poll until a trivial exec
succeeds before issuing real work:
RUNNINGbrev exechostname not resolvableConnection refusedbash
undefined新创建的实例在sshd、主机名解析和用户Shell就绪前就会报告状态。针对未稳定实例的首次会因、或静默超时而失败。在执行实际任务前,务必轮询直到简单的执行命令成功:
RUNNINGbrev exechostname not resolvableConnection refusedbash
undefinedWait up to 5 minutes for shell readiness — covers the SSH bring-up window.
Wait up to 5 minutes for shell readiness — covers the SSH bring-up window.
for i in $(seq 1 60); do
brev exec <instance> -- true >/dev/null 2>&1 && break
sleep 5
done
brev exec <instance> -- true >/dev/null 2>&1 || {
echo "instance <instance> never became exec-ready"; exit 1;
}
undefinedfor i in $(seq 1 60); do
brev exec <instance> -- true >/dev/null 2>&1 && break
sleep 5
done
brev exec <instance> -- true >/dev/null 2>&1 || {
echo "instance <instance> never became exec-ready"; exit 1;
}
undefinedbrev exec
timeout for cold-start workloads
brev exec冷启动工作负载的brev exec
超时设置
brev execbrev exectimeoutexec failedbrev exectimeoutexec failedMixed-Platform Workflows
混合平台工作流
Brev can be mixed with Lepton in the same workflow. Per-stage platform assignment:
json
{"skill": "vcn-gap-analysis", "action": "analyze", "platform": "brev"},
{"skill": "visual-changenet", "action": "train", "platform": "lepton"}CPU stages (gap analysis, data merge) run cheaply on Brev. GPU stages (training) run on Lepton H100s.
Brev可与Lepton在同一工作流中混合使用。按阶段分配平台:
json
{"skill": "vcn-gap-analysis", "action": "analyze", "platform": "brev"},
{"skill": "visual-changenet", "action": "train", "platform": "lepton"}CPU阶段(差距分析、数据合并)可在Brev上低成本运行。GPU阶段(训练)可在Lepton H100上运行。
Cleanup
清理
bash
brev delete <instance> # plain delete — no flagsThe CLI does not accept / ; passing it errors with
. is already non-interactive on
recent CLIs, so no confirmation flag is needed.
--yes-yunknown flag: --yesbrev delete <instance>bash
brev delete <instance> # plain delete — no flagsCLI不接受 / 参数;传入该参数会报错。在最新版CLI上,已经是非交互式的,因此不需要确认参数。
--yes-yunknown flag: --yesbrev delete <instance>Error Patterns
错误模式
brev CLI not found: Install from https://docs.nvidia.com/brev/.
brev lsBREV_API_TOKENbrev login --token "$BREV_API_TOKEN"Token expired: Handler auto-refreshes via . If
persistent, run manually.
brev login --tokenbrev loginbrev createcloudCredIdworkspaceGroupId--cloud-cred-id--workspace-group-idbrev exechostname not resolvableConnection refusedRUNNINGbrev execSDK exec timeout / on a fresh instance: The SDK's
wrapper timed out before remote startup finished. Raise the
timeout to ≥ 600 s for cold-start runs (see timeout for
cold-start workloads).
exec failedbrev execbrev execbrev delete --yesunknown flag: --yesbrev delete <instance>Instance stuck in provisioning: Some GPU types have limited availability. Try a different or provider.
--gpu-nameDocker pull fails on nvcr.io: NGC_KEY not set or expired. Run on the instance.
docker login nvcr.io未找到brev CLI:从https://docs.nvidia.com/brev/安装。
即使设置了,仍返回认证EOF:无头Shell没有标准输入用于交互式认证提示。先运行,然后重试。如果重试后仍失败,则令牌已过期——生成新的令牌。
BREV_API_TOKENbrev lsbrev login --token "$BREV_API_TOKEN"令牌过期:处理器会通过自动刷新。如果问题持续,请手动运行。
brev login --tokenbrev loginbrev createcloudCredIdworkspaceGroupId--cloud-cred-id--workspace-group-id创建实例后立即执行失败,提示或:实例在sshd启动前就报告状态。在执行实际命令前,使用“首次执行前等待实例就绪”部分中的就绪等待循环。
brev exechostname not resolvableConnection refusedRUNNINGbrev exec新实例上的SDK执行超时 / :SDK的封装器在远程启动完成前超时。将冷启动运行的超时时间提高到≥ 600秒(请参阅“冷启动工作负载的超时设置”部分)。
exec failedbrev execbrev execbrev delete --yesunknown flag: --yesbrev delete <instance>实例卡在配置中:某些GPU类型可用性有限。尝试使用不同的或服务商。
--gpu-namenvcr.io上的Docker拉取失败:NGC_KEY未设置或已过期。在实例上运行。
docker login nvcr.io