tao-run-platform

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TAO Execution SDK

TAO Execution SDK

The SDK is the optional Python layer for users who need job handles, S3 I/O wrapping, or platform-specific features (Lepton multi-node, SLURM/Lustre queues, Kubernetes Jobs, local Docker debugging, Brev instance reuse). Most TAO skills run with just
docker run
and don't need it. Reach for the SDK when:
  • You want a
    Job
    handle to poll status and stream logs over time.
  • The platform is API-only (Lepton has no docker-run equivalent).
  • You need S3-aware input download / output upload baked into the entrypoint.
  • You're chaining multiple jobs and want persisted state.
该SDK是一个可选的Python层,适用于需要任务句柄、S3 I/O封装或平台特定功能(Lepton多节点、SLURM/Lustre队列、Kubernetes Jobs、本地Docker调试、Brev实例复用)的用户。大多数TAO技能仅通过
docker run
即可运行,无需使用该SDK。在以下场景中可选用该SDK:
  • 你需要
    Job
    句柄来轮询任务状态并实时流式查看日志。
  • 平台仅支持API调用(Lepton没有等效的docker-run方式)。
  • 你需要在入口点中内置支持S3的输入下载/输出上传功能。
  • 你需要串联多个任务并保留持久化状态。

Preflight

预检查

Install
nvidia-tao-sdk[all]
before using this platform — the
[all]
extra pulls in every platform-specific dependency (Lepton, Brev, S3 utilities, etc.):
bash
python -c "import tao_sdk" 2>/dev/null || {
  echo "MISSING: nvidia-tao-sdk not installed. Run:"
  echo "  pip install nvidia-tao-sdk[all]"
  exit 1
}
The package index is environment-specific — the runner/container is expected to have a working
pip
configuration (e.g.
~/.pip/pip.conf
,
PIP_INDEX_URL
,
PIP_EXTRA_INDEX_URL
, or proxy). If the install fails for index/network reasons, that's a runner setup issue; this skill stays agnostic to the registry.
If missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight. Never auto-install silently.
使用该平台前,请先安装
nvidia-tao-sdk[all]
——
[all]
扩展会引入所有平台相关的依赖(Lepton、Brev、S3工具等):
bash
python -c "import tao_sdk" 2>/dev/null || {
  echo "MISSING: nvidia-tao-sdk not installed. Run:"
  echo "  pip install nvidia-tao-sdk[all]"
  exit 1
}
包索引取决于具体环境——运行器/容器需配置有效的
pip
环境(如
~/.pip/pip.conf
PIP_INDEX_URL
PIP_EXTRA_INDEX_URL
或代理)。如果因索引/网络问题导致安装失败,属于运行器配置问题,本技能不会对接特定镜像仓库。
若未安装该包,Agent会提示用户通过Bash授权安装,然后重新执行预检查。禁止静默自动安装。

Setup

配置

Credentials come from environment variables — sourced from
~/.config/tao/.env
(auto-loaded by the skill bank's SessionStart hook).
python
from tao_sdk.platforms.lepton import LeptonSDK   # DGX Cloud
from tao_sdk.platforms.brev   import BrevSDK     # Brev GPU instances

sdk = LeptonSDK()    # reads LEPTON_WORKSPACE_ID, LEPTON_AUTH_TOKEN
凭证来自环境变量——从
~/.config/tao/.env
加载(由技能库的SessionStart钩子自动加载)。
python
from tao_sdk.platforms.lepton import LeptonSDK   # DGX Cloud
from tao_sdk.platforms.brev   import BrevSDK     # Brev GPU实例

sdk = LeptonSDK()    # 读取LEPTON_WORKSPACE_ID、LEPTON_AUTH_TOKEN

or

sdk = BrevSDK() # reads BREV_API_TOKEN (optional — falls back to brev login)

Both SDKs validate credentials lazily on first use and raise `CredentialError` with a clear message if a required env var is missing. Required env vars:

| Platform | Required | Optional |
|---|---|---|
| Lepton | `LEPTON_WORKSPACE_ID`, `LEPTON_AUTH_TOKEN` | — |
| Brev | — (manual `brev login` works) | `BREV_API_TOKEN` |
| S3 I/O (any platform) | `S3_BUCKET_NAME`, `ACCESS_KEY`, `SECRET_KEY` | `S3_ENDPOINT_URL`, `CLOUD_REGION` |
| Container env | `NGC_KEY` | `HF_TOKEN` |

The agent never reads credential values — it only checks presence with `[ -n "$VAR_NAME" ]`.
sdk = BrevSDK() # 读取BREV_API_TOKEN(可选——默认使用brev login登录)

两个SDK都会在首次使用时延迟验证凭证,若缺少必填环境变量,会抛出`CredentialError`并给出明确提示。必填环境变量如下:

| 平台 | 必填项 | 可选项 |
|---|---|---|
| Lepton | `LEPTON_WORKSPACE_ID`, `LEPTON_AUTH_TOKEN` | — |
| Brev | —(手动`brev login`即可) | `BREV_API_TOKEN` |
| S3 I/O(任意平台) | `S3_BUCKET_NAME`, `ACCESS_KEY`, `SECRET_KEY` | `S3_ENDPOINT_URL`, `CLOUD_REGION` |
| 容器环境 | `NGC_KEY` | `HF_TOKEN` |

Agent不会读取凭证值——仅通过`[ -n "$VAR_NAME" ]`检查是否存在。

Workflow Launch Intake

工作流启动流程

For any TAO workflow or action launch, first confirm the user goal. Then ask for platform and monitoring preferences before credentials or launch details. Generate the supported platform choices from the packaged helper, not by scanning platform docs or folders:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text
Ask:
  1. Which supported platform should run this workflow?
  2. Should long-running monitoring stay enabled? Default: enabled. This means the agent remains attached and posts status until terminal state, including long
    PENDING
    queue waits.
  3. How many minutes between status updates? Default: 5 minutes.
After the model/action are known, resolve the default container image from the packaged metadata and ask the user to confirm it or provide
image=<override>
before creating runner files:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --model <network_arch> --action <action> --format text
For train-capable model workflows, inspect model-level AutoML metadata before creating a plain training job:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --scope automl --format json
If the selected model has
automl_enabled: true
and a valid train schema, route training through
skills/applications/tao-run-automl
by default. A workflow should only bypass AutoML when its run settings include
automl_policy: off
, the user explicitly asks for a plain run, or the model metadata says AutoML is enabled but the train schema is not packaged yet.
After the platform is selected, get the credential filter:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> --format text
Ask only for credentials returned for the selected platform. For example, SLURM needs
SLURM_USER
and
SLURM_HOSTNAME
; it does not need Lepton credentials. Kubernetes and local Docker do not need Lepton or SLURM credentials. Ask storage credentials such as S3 keys only when the selected platform and the data/result URIs require them.
对于任何TAO工作流或动作启动,首先确认用户目标。然后在询问凭证或启动细节前,先确认平台和监控偏好。从打包的辅助工具生成支持的平台选项,而非扫描平台文档或文件夹:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text
需询问:
  1. 该工作流应在哪个支持的平台上运行?
  2. 是否保持长时间监控启用?默认:启用。这意味着Agent会保持连接并发布状态,直到任务进入终端状态,包括长时间的
    PENDING
    队列等待。
  3. 状态更新的时间间隔(分钟)?默认:5分钟。
确定模型/动作后,从打包的元数据中解析默认容器镜像,并在创建运行器文件前,让用户确认该镜像或提供
image=<override>
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --model <network_arch> --action <action> --format text
对于支持训练的模型工作流,在创建普通训练任务前,先检查模型级别的AutoML元数据:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --scope automl --format json
若所选模型的
automl_enabled: true
且拥有有效的训练 schema,默认通过
skills/applications/tao-run-automl
路由训练。仅当运行设置包含
automl_policy: off
、用户明确要求普通运行,或模型元数据显示AutoML已启用但训练schema尚未打包时,工作流才会绕过AutoML。
选择平台后,获取凭证筛选条件:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> --format text
仅询问所选平台所需的凭证。例如,SLURM需要
SLURM_USER
SLURM_HOSTNAME
,不需要Lepton凭证。Kubernetes和本地Docker不需要Lepton或SLURM凭证。仅当所选平台和数据/结果URI需要时,才询问存储凭证(如S3密钥)。

Core API

核心API

All platform SDKs implement the same core shape:
python
sdk.create_job(image, command, gpu_count=1, env_vars=None, inputs=None, outputs=None, **kwargs) -> Job
sdk.get_job_status(job_id) -> JobStatus
sdk.get_job_logs(job_id, tail=None) -> str
sdk.cancel_job(job_id) -> bool
sdk.get_failure_analysis(job_id) -> dict | None
sdk.get_job_results_dir(job_id) -> str
sdk.check_path(remote_path) -> bool
sdk.list_path(remote_path) -> list[str]
Lepton-only:
  • sdk.get_job_replicas(job_id)
    — replica-level diagnostics for stuck-pending jobs.
Brev-only:
  • sdk.delete_instance(instance_id)
    — clean up an ephemeral instance.
  • sdk.list_instances()
    — list active instances.
所有平台SDK都实现相同的核心接口:
python
sdk.create_job(image, command, gpu_count=1, env_vars=None, inputs=None, outputs=None, **kwargs) -> Job
sdk.get_job_status(job_id) -> JobStatus
sdk.get_job_logs(job_id, tail=None) -> str
sdk.cancel_job(job_id) -> bool
sdk.get_failure_analysis(job_id) -> dict | None
sdk.get_job_results_dir(job_id) -> str
sdk.check_path(remote_path) -> bool
sdk.list_path(remote_path) -> list[str]
Lepton专属接口:
  • sdk.get_job_replicas(job_id)
    —— 针对停滞在Pending状态的任务,提供副本级别的诊断信息。
Brev专属接口:
  • sdk.delete_instance(instance_id)
    —— 清理临时实例。
  • sdk.list_instances()
    —— 列出活跃实例。

Submitting a Job

提交任务

The agent always constructs the container command via
build_entrypoint
before calling
create_job
. The agent reads the action's schema from
skill_info.yaml
(
command
,
config_format
,
inputs
,
outputs
,
upload_excludes
) and passes those fields as kwargs.
build_entrypoint
bakes the in-container
script_runner
runtime (inlined as a base64 heredoc) and the CLI invocation that, at runtime, downloads declared inputs, writes the spec file at
{config_path}
with remote URIs rewritten to local paths, runs the user command, and uploads outputs. The platform SDK's
create_job
runs the resulting command as-is — no implicit wrapping.
build_entrypoint
infers the mode (
config
/
args
/
passthrough
) from what you pass — you never pass
mode
explicitly. See
references/job-construction.md
for the full entrypoint contract, the spec/args construction strategy per action
mode
, the mode-inference table, and
resolve_container_image()
. See
references/outputs.md
for where outputs land (the runtime destination tables and per-platform injection policy) and the critical "spec is nested dicts, not flat dotted keys" rule. See
references/examples.md
for complete spec-driven and path-keyed
build_entrypoint
+
create_job
examples.
Agent在调用
create_job
前,始终通过
build_entrypoint
构建容器命令。Agent从
skill_info.yaml
读取动作的schema(
command
config_format
inputs
outputs
upload_excludes
),并将这些字段作为关键字参数传递。
build_entrypoint
会将容器内的
script_runner
运行时(以base64 heredoc形式内联)和CLI调用嵌入其中,在运行时会下载声明的输入、将规范文件写入
{config_path}
并将远程URI重写为本地路径、运行用户命令、上传输出。平台SDK的
create_job
按原样运行生成的命令——无隐式封装。
build_entrypoint
会根据传入的参数推断模式(
config
/
args
/
passthrough
)——无需显式传递
mode
。有关完整的入口点约定、每个动作
mode
的规范/参数构建策略、模式推断表以及
resolve_container_image()
,请参阅
references/job-construction.md
。有关输出的存储位置(运行时目标表和按平台注入策略)以及关键的“规范为嵌套字典,而非扁平点分隔键”规则,请参阅
references/outputs.md
。有关完整的基于规范和路径键的
build_entrypoint
+
create_job
示例,请参阅
references/examples.md

Monitoring

监控

python
status = sdk.get_job_status(job.id)
print(status.status)   # Pending, Running, Complete, Error, Canceled
print(status.message)  # platform-specific detail

logs = sdk.get_job_logs(job.id, tail=200)
print(logs)
For stuck-Pending Lepton jobs, replica diagnostics reveal the cause (image pull, scheduling, mount errors):
python
for r in sdk.get_job_replicas(job.id):
    issue = r["status"].get("readiness_issue")
    if issue:
        print(issue["reason"], issue["message"])
        # e.g. "InProgress" / "Pulling image"  (normal for big images)
        #      "Failed"     / "ImagePullBackOff" (NGC_KEY problem)
        #      "ConfigError" / "Mount point not found" (bad node)
On failure,
get_failure_analysis()
classifies the root cause:
python
analysis = sdk.get_failure_analysis(job.id)
if analysis:
    print(analysis["err_class"])   # ERR_PROGRAM, ERR_INFRA, etc.
    print(analysis["suggestion"])  # human-readable fix
    for event in analysis.get("job_failure_by_node_event", []):
        print(event["node_event_name"], event["message"])  # OOM, GPU error, etc.
python
status = sdk.get_job_status(job.id)
print(status.status)   # Pending, Running, Complete, Error, Canceled
print(status.message)  # 平台特定详情

logs = sdk.get_job_logs(job.id, tail=200)
print(logs)
对于停滞在Pending状态的Lepton任务,副本诊断信息可揭示原因(镜像拉取、调度、挂载错误):
python
for r in sdk.get_job_replicas(job.id):
    issue = r["status"].get("readiness_issue")
    if issue:
        print(issue["reason"], issue["message"])
        # 示例:"InProgress" / "Pulling image" (大镜像拉取时正常)
        #      "Failed"     / "ImagePullBackOff"(NGC_KEY问题)
        #      "ConfigError" / "Mount point not found"(节点异常)
任务失败时,
get_failure_analysis()
会对根本原因进行分类:
python
analysis = sdk.get_failure_analysis(job.id)
if analysis:
    print(analysis["err_class"])   # ERR_PROGRAM, ERR_INFRA等
    print(analysis["suggestion"])  # 人类可读的修复建议
    for event in analysis.get("job_failure_by_node_event", []):
        print(event["node_event_name"], event["message"])  # OOM、GPU错误等

Polling pattern

轮询模式

For interactive runs where the user wants to watch:
python
import time
status_interval_minutes = status_interval_minutes or 5
while True:
    status = sdk.get_job_status(job.id)
    if status.status in ("Complete", "Error", "Canceled"):
        break
    print(f"  {status.status}")
    time.sleep(status_interval_minutes * 60)

if status.status == "Error":
    print(sdk.get_job_logs(job.id, tail=100))
    print(sdk.get_failure_analysis(job.id))
With long-running monitoring enabled, do not stop after 30 minutes or after a few unchanged polls. Keep emitting updates every
status_interval_minutes
until the job finishes, fails, is canceled, or the user asks to detach/stop. If the chat/runtime cannot remain open that long, say so explicitly and provide the durable workflow/log path for manual status refresh.
Do not use a final response for non-terminal monitored jobs. Finalizing the turn detaches the chat watcher. Keep non-terminal status messages in progress updates and continue polling; only finalize at terminal state, explicit user detach/stop, or a real runtime limit that prevents further polling.
For background runs, persist
job.id
and the
state_file
path, then re-attach later by constructing the same SDK and calling
get_job_status(job_id)
— job state is read from the on-disk store.
对于用户希望实时查看的交互式运行:
python
import time
status_interval_minutes = status_interval_minutes or 5
while True:
    status = sdk.get_job_status(job.id)
    if status.status in ("Complete", "Error", "Canceled"):
        break
    print(f"  {status.status}")
    time.sleep(status_interval_minutes * 60)

if status.status == "Error":
    print(sdk.get_job_logs(job.id, tail=100))
    print(sdk.get_failure_analysis(job.id))
启用长时间监控后,不要在30分钟后或几次无变化的轮询后停止。每隔
status_interval_minutes
发送一次更新,直到任务完成、失败、被取消,或用户要求断开/停止。如果聊天/运行时无法保持长时间开启,请明确告知用户,并提供持久化的工作流/日志路径供手动刷新状态。
对于非终端状态的监控任务,不要发送最终响应。结束对话回合会断开聊天监控器。将非终端状态消息作为进度更新发送,并继续轮询;仅在任务进入终端状态、用户明确要求断开/停止,或存在无法继续轮询的实际运行限制时,才结束对话回合。
对于后台运行,持久化
job.id
state_file
路径,之后通过构建相同的SDK并调用
get_job_status(job_id)
重新连接——任务状态从磁盘存储中读取。

Orchestration patterns

编排模式

Multi-step workflows, parallel sweeps, and run-folder durability via
ActionWorkflow
live in
references/orchestration-patterns.md
. Read it before chaining
create_job
calls, sweeping a parameter, or persisting run state across context breaks.
多步骤工作流、并行扫描和通过
ActionWorkflow
实现的运行文件夹持久化相关内容,请参阅
references/orchestration-patterns.md
。在串联
create_job
调用、扫描参数或跨上下文中断保留运行状态前,请先阅读该文档。

Dataset utilities

数据集工具

When the skill's documented filenames don't match the user's layout, list the dataset to confirm:
python
assert sdk.check_path("s3://my-bucket/coco/")
files = sdk.list_path("s3://my-bucket/coco/train/")
当技能文档中的文件名与用户的数据集结构不匹配时,列出数据集进行确认:
python
assert sdk.check_path("s3://my-bucket/coco/")
files = sdk.list_path("s3://my-bucket/coco/train/")

Use the actual paths to set spec fields.

使用实际路径设置规范字段。


For S3 paths, strip trailing slashes when concatenating to avoid `//`:

```python
base = dataset_uri.rstrip("/")
specs["dataset"]["train_csv"] = f"{base}/train.csv"   # nested — see "spec is nested dicts"

对于S3路径,拼接时请去除末尾斜杠以避免出现`//`:

```python
base = dataset_uri.rstrip("/")
specs["dataset"]["train_csv"] = f"{base}/train.csv"   # 嵌套结构——请遵循“规范为嵌套字典”规则

Platform-specific notes

平台特定说明

Each backend (Lepton, Brev, SLURM, Kubernetes, local Docker) has its own import path, storage model, distributed-training options, credential scope, and
create_job
kwargs. See
references/platform-notes.md
for the per-platform details before generating or launching runner artifacts for a given backend.
每个后端(Lepton、Brev、SLURM、Kubernetes、本地Docker)都有自己的导入路径、存储模型、分布式训练选项、凭证范围和
create_job
关键字参数。在为特定后端生成或启动运行器工件前,请参阅
references/platform-notes.md
了解各平台的详细信息。

Error patterns

错误模式

SDK error → root cause → fix mappings are in
references/error-patterns.md
. Read when you hit a
CredentialError
, image-pull failure, stuck-Pending job, or similar — the entries map exception text to the underlying cause.
SDK错误→根本原因→修复方案的映射关系,请参阅
references/error-patterns.md
。当遇到
CredentialError
、镜像拉取失败、停滞在Pending状态的任务或类似问题时,请阅读该文档——其中的条目将异常文本映射到潜在原因。

What the SDK does NOT do

SDK不支持的功能

Scope guardrails (no skill-reading, no HPO, no spec opinions, no auto-platform-selection, no workflow orchestration) live in
references/scope.md
.
范围限制(不读取技能、不支持HPO、不干预规范、不自动选择平台、不支持工作流编排)请参阅
references/scope.md