tao-run-platform

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TAO Execution SDK

The SDK is the optional Python layer for users who need job handles, S3 I/O wrapping, or platform-specific features (Lepton multi-node, SLURM/Lustre queues, Kubernetes Jobs, local Docker debugging, Brev instance reuse). Most TAO skills run with just

docker run

and don't need it. Reach for the SDK when:

You want a
```
Job
```
handle to poll status and stream logs over time.
The platform is API-only (Lepton has no docker-run equivalent).
You need S3-aware input download / output upload baked into the entrypoint.
You're chaining multiple jobs and want persisted state.

该SDK是一个可选的Python层，适用于需要任务句柄、S3 I/O封装或平台特定功能（Lepton多节点、SLURM/Lustre队列、Kubernetes Jobs、本地Docker调试、Brev实例复用）的用户。大多数TAO技能仅通过

docker run

即可运行，无需使用该SDK。在以下场景中可选用该SDK：

你需要
```
Job
```
句柄来轮询任务状态并实时流式查看日志。
平台仅支持API调用（Lepton没有等效的docker-run方式）。
你需要在入口点中内置支持S3的输入下载/输出上传功能。
你需要串联多个任务并保留持久化状态。

Preflight

预检查

Install

nvidia-tao-sdk[all]

before using this platform — the

[all]

extra pulls in every platform-specific dependency (Lepton, Brev, S3 utilities, etc.):

bash

python -c "import tao_sdk" 2>/dev/null || {
  echo "MISSING: nvidia-tao-sdk not installed. Run:"
  echo "  pip install nvidia-tao-sdk[all]"
  exit 1
}

The package index is environment-specific — the runner/container is expected to have a working

pip

configuration (e.g.

~/.pip/pip.conf

PIP_INDEX_URL

PIP_EXTRA_INDEX_URL

, or proxy). If the install fails for index/network reasons, that's a runner setup issue; this skill stays agnostic to the registry.

If missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight. Never auto-install silently.

使用该平台前，请先安装

nvidia-tao-sdk[all]

——

[all]

扩展会引入所有平台相关的依赖（Lepton、Brev、S3工具等）：

bash

python -c "import tao_sdk" 2>/dev/null || {
  echo "MISSING: nvidia-tao-sdk not installed. Run:"
  echo "  pip install nvidia-tao-sdk[all]"
  exit 1
}

包索引取决于具体环境——运行器/容器需配置有效的

pip

环境（如

~/.pip/pip.conf

、

PIP_INDEX_URL

、

PIP_EXTRA_INDEX_URL

或代理）。如果因索引/网络问题导致安装失败，属于运行器配置问题，本技能不会对接特定镜像仓库。

若未安装该包，Agent会提示用户通过Bash授权安装，然后重新执行预检查。禁止静默自动安装。

Setup

配置

Credentials come from environment variables — sourced from

~/.config/tao/.env

(auto-loaded by the skill bank's SessionStart hook).

python

from tao_sdk.platforms.lepton import LeptonSDK   # DGX Cloud
from tao_sdk.platforms.brev   import BrevSDK     # Brev GPU instances

sdk = LeptonSDK()    # reads LEPTON_WORKSPACE_ID, LEPTON_AUTH_TOKEN

凭证来自环境变量——从

~/.config/tao/.env

加载（由技能库的SessionStart钩子自动加载）。

python

from tao_sdk.platforms.lepton import LeptonSDK   # DGX Cloud
from tao_sdk.platforms.brev   import BrevSDK     # Brev GPU实例

sdk = LeptonSDK()    # 读取LEPTON_WORKSPACE_ID、LEPTON_AUTH_TOKEN

or

或

sdk = BrevSDK() # reads BREV_API_TOKEN (optional — falls back to brev login)


Both SDKs validate credentials lazily on first use and raise `CredentialError` with a clear message if a required env var is missing. Required env vars:

| Platform | Required | Optional |
|---|---|---|
| Lepton | `LEPTON_WORKSPACE_ID`, `LEPTON_AUTH_TOKEN` | — |
| Brev | — (manual `brev login` works) | `BREV_API_TOKEN` |
| S3 I/O (any platform) | `S3_BUCKET_NAME`, `ACCESS_KEY`, `SECRET_KEY` | `S3_ENDPOINT_URL`, `CLOUD_REGION` |
| Container env | `NGC_KEY` | `HF_TOKEN` |

The agent never reads credential values — it only checks presence with `[ -n "$VAR_NAME" ]`.

sdk = BrevSDK() # 读取BREV_API_TOKEN（可选——默认使用brev login登录）


两个SDK都会在首次使用时延迟验证凭证，若缺少必填环境变量，会抛出`CredentialError`并给出明确提示。必填环境变量如下：

| 平台 | 必填项 | 可选项 |
|---|---|---|
| Lepton | `LEPTON_WORKSPACE_ID`, `LEPTON_AUTH_TOKEN` | — |
| Brev | —（手动`brev login`即可） | `BREV_API_TOKEN` |
| S3 I/O（任意平台） | `S3_BUCKET_NAME`, `ACCESS_KEY`, `SECRET_KEY` | `S3_ENDPOINT_URL`, `CLOUD_REGION` |
| 容器环境 | `NGC_KEY` | `HF_TOKEN` |

Agent不会读取凭证值——仅通过`[ -n "$VAR_NAME" ]`检查是否存在。

Workflow Launch Intake

工作流启动流程

For any TAO workflow or action launch, first confirm the user goal. Then ask for platform and monitoring preferences before credentials or launch details. Generate the supported platform choices from the packaged helper, not by scanning platform docs or folders:

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text

Ask:

Which supported platform should run this workflow?
Should long-running monitoring stay enabled? Default: enabled. This means the agent remains attached and posts status until terminal state, including long
```
PENDING
```
queue waits.
How many minutes between status updates? Default: 5 minutes.

After the model/action are known, resolve the default container image from the packaged metadata and ask the user to confirm it or provide

image=<override>

before creating runner files:

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --model <network_arch> --action <action> --format text

For train-capable model workflows, inspect model-level AutoML metadata before creating a plain training job:

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --scope automl --format json

If the selected model has

automl_enabled: true

and a valid train schema, route training through

skills/applications/tao-run-automl

by default. A workflow should only bypass AutoML when its run settings include

automl_policy: off

, the user explicitly asks for a plain run, or the model metadata says AutoML is enabled but the train schema is not packaged yet.

After the platform is selected, get the credential filter:

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> --format text

Ask only for credentials returned for the selected platform. For example, SLURM needs

SLURM_USER

and

SLURM_HOSTNAME

; it does not need Lepton credentials. Kubernetes and local Docker do not need Lepton or SLURM credentials. Ask storage credentials such as S3 keys only when the selected platform and the data/result URIs require them.

对于任何TAO工作流或动作启动，首先确认用户目标。然后在询问凭证或启动细节前，先确认平台和监控偏好。从打包的辅助工具生成支持的平台选项，而非扫描平台文档或文件夹：

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text

需询问：

该工作流应在哪个支持的平台上运行？
是否保持长时间监控启用？默认：启用。这意味着Agent会保持连接并发布状态，直到任务进入终端状态，包括长时间的
```
PENDING
```
队列等待。
状态更新的时间间隔（分钟）？默认：5分钟。

确定模型/动作后，从打包的元数据中解析默认容器镜像，并在创建运行器文件前，让用户确认该镜像或提供

image=<override>

：

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --model <network_arch> --action <action> --format text

对于支持训练的模型工作流，在创建普通训练任务前，先检查模型级别的AutoML元数据：

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --scope automl --format json

若所选模型的

automl_enabled: true

且拥有有效的训练 schema，默认通过

skills/applications/tao-run-automl

路由训练。仅当运行设置包含

automl_policy: off

、用户明确要求普通运行，或模型元数据显示AutoML已启用但训练schema尚未打包时，工作流才会绕过AutoML。

选择平台后，获取凭证筛选条件：

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> --format text

仅询问所选平台所需的凭证。例如，SLURM需要

SLURM_USER

和

SLURM_HOSTNAME

，不需要Lepton凭证。Kubernetes和本地Docker不需要Lepton或SLURM凭证。仅当所选平台和数据/结果URI需要时，才询问存储凭证（如S3密钥）。

Core API

核心API

All platform SDKs implement the same core shape:

python

sdk.create_job(image, command, gpu_count=1, env_vars=None, inputs=None, outputs=None, **kwargs) -> Job
sdk.get_job_status(job_id) -> JobStatus
sdk.get_job_logs(job_id, tail=None) -> str
sdk.cancel_job(job_id) -> bool
sdk.get_failure_analysis(job_id) -> dict | None
sdk.get_job_results_dir(job_id) -> str
sdk.check_path(remote_path) -> bool
sdk.list_path(remote_path) -> list[str]

Lepton-only:

```
sdk.get_job_replicas(job_id)
```
— replica-level diagnostics for stuck-pending jobs.

Brev-only:

```
sdk.delete_instance(instance_id)
```
— clean up an ephemeral instance.
```
sdk.list_instances()
```
— list active instances.

所有平台SDK都实现相同的核心接口：

python

sdk.create_job(image, command, gpu_count=1, env_vars=None, inputs=None, outputs=None, **kwargs) -> Job
sdk.get_job_status(job_id) -> JobStatus
sdk.get_job_logs(job_id, tail=None) -> str
sdk.cancel_job(job_id) -> bool
sdk.get_failure_analysis(job_id) -> dict | None
sdk.get_job_results_dir(job_id) -> str
sdk.check_path(remote_path) -> bool
sdk.list_path(remote_path) -> list[str]

Lepton专属接口：

```
sdk.get_job_replicas(job_id)
```
—— 针对停滞在Pending状态的任务，提供副本级别的诊断信息。

Brev专属接口：

```
sdk.delete_instance(instance_id)
```
—— 清理临时实例。
```
sdk.list_instances()
```
—— 列出活跃实例。

Submitting a Job

提交任务

The agent always constructs the container command via
build_entrypoint
before calling

create_job

. The agent reads the action's schema from

skill_info.yaml

(

command

config_format

inputs

outputs

upload_excludes

) and passes those fields as kwargs.

build_entrypoint

bakes the in-container

script_runner

runtime (inlined as a base64 heredoc) and the CLI invocation that, at runtime, downloads declared inputs, writes the spec file at

{config_path}

with remote URIs rewritten to local paths, runs the user command, and uploads outputs. The platform SDK's

create_job

runs the resulting command as-is — no implicit wrapping.

build_entrypoint

infers the mode (

config

args

passthrough

) from what you pass — you never pass

mode

explicitly. See

references/job-construction.md

for the full entrypoint contract, the spec/args construction strategy per action

mode

, the mode-inference table, and

resolve_container_image()

. See

references/outputs.md

for where outputs land (the runtime destination tables and per-platform injection policy) and the critical "spec is nested dicts, not flat dotted keys" rule. See

references/examples.md

for complete spec-driven and path-keyed

build_entrypoint

create_job

examples.

Agent在调用

create_job

前，始终通过

build_entrypoint

构建容器命令。Agent从

skill_info.yaml

读取动作的schema（

command

、

config_format

、

inputs

、

outputs

、

upload_excludes

），并将这些字段作为关键字参数传递。

build_entrypoint

会将容器内的

script_runner

运行时（以base64 heredoc形式内联）和CLI调用嵌入其中，在运行时会下载声明的输入、将规范文件写入

{config_path}

并将远程URI重写为本地路径、运行用户命令、上传输出。平台SDK的

create_job

会按原样运行生成的命令——无隐式封装。

build_entrypoint

会根据传入的参数推断模式（

config

args

passthrough

）——无需显式传递

mode

。有关完整的入口点约定、每个动作

mode

的规范/参数构建策略、模式推断表以及

resolve_container_image()

，请参阅

references/job-construction.md

。有关输出的存储位置（运行时目标表和按平台注入策略）以及关键的“规范为嵌套字典，而非扁平点分隔键”规则，请参阅

references/outputs.md

。有关完整的基于规范和路径键的

build_entrypoint

create_job

示例，请参阅

references/examples.md

。

Monitoring

监控

python

status = sdk.get_job_status(job.id)
print(status.status)   # Pending, Running, Complete, Error, Canceled
print(status.message)  # platform-specific detail

logs = sdk.get_job_logs(job.id, tail=200)
print(logs)

For stuck-Pending Lepton jobs, replica diagnostics reveal the cause (image pull, scheduling, mount errors):

python

for r in sdk.get_job_replicas(job.id):
    issue = r["status"].get("readiness_issue")
    if issue:
        print(issue["reason"], issue["message"])
        # e.g. "InProgress" / "Pulling image"  (normal for big images)
        #      "Failed"     / "ImagePullBackOff" (NGC_KEY problem)
        #      "ConfigError" / "Mount point not found" (bad node)

On failure,

get_failure_analysis()

classifies the root cause:

python

analysis = sdk.get_failure_analysis(job.id)
if analysis:
    print(analysis["err_class"])   # ERR_PROGRAM, ERR_INFRA, etc.
    print(analysis["suggestion"])  # human-readable fix
    for event in analysis.get("job_failure_by_node_event", []):
        print(event["node_event_name"], event["message"])  # OOM, GPU error, etc.

python

status = sdk.get_job_status(job.id)
print(status.status)   # Pending, Running, Complete, Error, Canceled
print(status.message)  # 平台特定详情

logs = sdk.get_job_logs(job.id, tail=200)
print(logs)

对于停滞在Pending状态的Lepton任务，副本诊断信息可揭示原因（镜像拉取、调度、挂载错误）：

python

for r in sdk.get_job_replicas(job.id):
    issue = r["status"].get("readiness_issue")
    if issue:
        print(issue["reason"], issue["message"])
        # 示例："InProgress" / "Pulling image" （大镜像拉取时正常）
        #      "Failed"     / "ImagePullBackOff"（NGC_KEY问题）
        #      "ConfigError" / "Mount point not found"（节点异常）

任务失败时，

get_failure_analysis()

会对根本原因进行分类：

python

analysis = sdk.get_failure_analysis(job.id)
if analysis:
    print(analysis["err_class"])   # ERR_PROGRAM, ERR_INFRA等
    print(analysis["suggestion"])  # 人类可读的修复建议
    for event in analysis.get("job_failure_by_node_event", []):
        print(event["node_event_name"], event["message"])  # OOM、GPU错误等

Polling pattern

轮询模式

For interactive runs where the user wants to watch:

python

import time
status_interval_minutes = status_interval_minutes or 5
while True:
    status = sdk.get_job_status(job.id)
    if status.status in ("Complete", "Error", "Canceled"):
        break
    print(f"  {status.status}")
    time.sleep(status_interval_minutes * 60)

if status.status == "Error":
    print(sdk.get_job_logs(job.id, tail=100))
    print(sdk.get_failure_analysis(job.id))

With long-running monitoring enabled, do not stop after 30 minutes or after a few unchanged polls. Keep emitting updates every

status_interval_minutes

until the job finishes, fails, is canceled, or the user asks to detach/stop. If the chat/runtime cannot remain open that long, say so explicitly and provide the durable workflow/log path for manual status refresh.

Do not use a final response for non-terminal monitored jobs. Finalizing the turn detaches the chat watcher. Keep non-terminal status messages in progress updates and continue polling; only finalize at terminal state, explicit user detach/stop, or a real runtime limit that prevents further polling.

For background runs, persist

job.id

and the

state_file

path, then re-attach later by constructing the same SDK and calling

get_job_status(job_id)

— job state is read from the on-disk store.

对于用户希望实时查看的交互式运行：

python

import time
status_interval_minutes = status_interval_minutes or 5
while True:
    status = sdk.get_job_status(job.id)
    if status.status in ("Complete", "Error", "Canceled"):
        break
    print(f"  {status.status}")
    time.sleep(status_interval_minutes * 60)

if status.status == "Error":
    print(sdk.get_job_logs(job.id, tail=100))
    print(sdk.get_failure_analysis(job.id))

启用长时间监控后，不要在30分钟后或几次无变化的轮询后停止。每隔

status_interval_minutes

发送一次更新，直到任务完成、失败、被取消，或用户要求断开/停止。如果聊天/运行时无法保持长时间开启，请明确告知用户，并提供持久化的工作流/日志路径供手动刷新状态。

对于非终端状态的监控任务，不要发送最终响应。结束对话回合会断开聊天监控器。将非终端状态消息作为进度更新发送，并继续轮询；仅在任务进入终端状态、用户明确要求断开/停止，或存在无法继续轮询的实际运行限制时，才结束对话回合。

对于后台运行，持久化

job.id

和

state_file

路径，之后通过构建相同的SDK并调用

get_job_status(job_id)

重新连接——任务状态从磁盘存储中读取。

Orchestration patterns

编排模式

Multi-step workflows, parallel sweeps, and run-folder durability via

ActionWorkflow

live in

references/orchestration-patterns.md

. Read it before chaining

create_job

calls, sweeping a parameter, or persisting run state across context breaks.

多步骤工作流、并行扫描和通过

ActionWorkflow

实现的运行文件夹持久化相关内容，请参阅

references/orchestration-patterns.md

。在串联

create_job

调用、扫描参数或跨上下文中断保留运行状态前，请先阅读该文档。

Dataset utilities

数据集工具

When the skill's documented filenames don't match the user's layout, list the dataset to confirm:

python

assert sdk.check_path("s3://my-bucket/coco/")
files = sdk.list_path("s3://my-bucket/coco/train/")

当技能文档中的文件名与用户的数据集结构不匹配时，列出数据集进行确认：

python

assert sdk.check_path("s3://my-bucket/coco/")
files = sdk.list_path("s3://my-bucket/coco/train/")

Use the actual paths to set spec fields.

使用实际路径设置规范字段。


For S3 paths, strip trailing slashes when concatenating to avoid `//`:

```python
base = dataset_uri.rstrip("/")
specs["dataset"]["train_csv"] = f"{base}/train.csv"   # nested — see "spec is nested dicts"


对于S3路径，拼接时请去除末尾斜杠以避免出现`//`：

```python
base = dataset_uri.rstrip("/")
specs["dataset"]["train_csv"] = f"{base}/train.csv"   # 嵌套结构——请遵循“规范为嵌套字典”规则

Platform-specific notes

平台特定说明

Each backend (Lepton, Brev, SLURM, Kubernetes, local Docker) has its own import path, storage model, distributed-training options, credential scope, and

create_job

kwargs. See

references/platform-notes.md

for the per-platform details before generating or launching runner artifacts for a given backend.

每个后端（Lepton、Brev、SLURM、Kubernetes、本地Docker）都有自己的导入路径、存储模型、分布式训练选项、凭证范围和

create_job

关键字参数。在为特定后端生成或启动运行器工件前，请参阅

references/platform-notes.md

了解各平台的详细信息。

Error patterns

错误模式

SDK error → root cause → fix mappings are in

references/error-patterns.md

. Read when you hit a

CredentialError

, image-pull failure, stuck-Pending job, or similar — the entries map exception text to the underlying cause.

SDK错误→根本原因→修复方案的映射关系，请参阅

references/error-patterns.md

。当遇到

CredentialError

、镜像拉取失败、停滞在Pending状态的任务或类似问题时，请阅读该文档——其中的条目将异常文本映射到潜在原因。

What the SDK does NOT do

SDK不支持的功能

Scope guardrails (no skill-reading, no HPO, no spec opinions, no auto-platform-selection, no workflow orchestration) live in

references/scope.md

范围限制（不读取技能、不支持HPO、不干预规范、不自动选择平台、不支持工作流编排）请参阅

references/scope.md

。