tao-run-platform
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTAO Execution SDK
TAO Execution SDK
The SDK is the optional Python layer for users who need job handles, S3 I/O wrapping, or platform-specific features (Lepton multi-node, SLURM/Lustre queues, Kubernetes Jobs, local Docker debugging, Brev instance reuse). Most TAO skills run with just and don't need it. Reach for the SDK when:
docker run- You want a handle to poll status and stream logs over time.
Job - The platform is API-only (Lepton has no docker-run equivalent).
- You need S3-aware input download / output upload baked into the entrypoint.
- You're chaining multiple jobs and want persisted state.
该SDK是一个可选的Python层,适用于需要任务句柄、S3 I/O封装或平台特定功能(Lepton多节点、SLURM/Lustre队列、Kubernetes Jobs、本地Docker调试、Brev实例复用)的用户。大多数TAO技能仅通过即可运行,无需使用该SDK。在以下场景中可选用该SDK:
docker run- 你需要句柄来轮询任务状态并实时流式查看日志。
Job - 平台仅支持API调用(Lepton没有等效的docker-run方式)。
- 你需要在入口点中内置支持S3的输入下载/输出上传功能。
- 你需要串联多个任务并保留持久化状态。
Preflight
预检查
Install before using this platform — the extra pulls in every platform-specific dependency (Lepton, Brev, S3 utilities, etc.):
nvidia-tao-sdk[all][all]bash
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install nvidia-tao-sdk[all]"
exit 1
}The package index is environment-specific — the runner/container is expected to have a working configuration (e.g. , , , or proxy). If the install fails for index/network reasons, that's a runner setup issue; this skill stays agnostic to the registry.
pip~/.pip/pip.confPIP_INDEX_URLPIP_EXTRA_INDEX_URLIf missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight. Never auto-install silently.
使用该平台前,请先安装——扩展会引入所有平台相关的依赖(Lepton、Brev、S3工具等):
nvidia-tao-sdk[all][all]bash
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install nvidia-tao-sdk[all]"
exit 1
}包索引取决于具体环境——运行器/容器需配置有效的环境(如、、或代理)。如果因索引/网络问题导致安装失败,属于运行器配置问题,本技能不会对接特定镜像仓库。
pip~/.pip/pip.confPIP_INDEX_URLPIP_EXTRA_INDEX_URL若未安装该包,Agent会提示用户通过Bash授权安装,然后重新执行预检查。禁止静默自动安装。
Setup
配置
Credentials come from environment variables — sourced from (auto-loaded by the skill bank's SessionStart hook).
~/.config/tao/.envpython
from tao_sdk.platforms.lepton import LeptonSDK # DGX Cloud
from tao_sdk.platforms.brev import BrevSDK # Brev GPU instances
sdk = LeptonSDK() # reads LEPTON_WORKSPACE_ID, LEPTON_AUTH_TOKEN凭证来自环境变量——从加载(由技能库的SessionStart钩子自动加载)。
~/.config/tao/.envpython
from tao_sdk.platforms.lepton import LeptonSDK # DGX Cloud
from tao_sdk.platforms.brev import BrevSDK # Brev GPU实例
sdk = LeptonSDK() # 读取LEPTON_WORKSPACE_ID、LEPTON_AUTH_TOKENor
或
sdk = BrevSDK() # reads BREV_API_TOKEN (optional — falls back to brev login)
Both SDKs validate credentials lazily on first use and raise `CredentialError` with a clear message if a required env var is missing. Required env vars:
| Platform | Required | Optional |
|---|---|---|
| Lepton | `LEPTON_WORKSPACE_ID`, `LEPTON_AUTH_TOKEN` | — |
| Brev | — (manual `brev login` works) | `BREV_API_TOKEN` |
| S3 I/O (any platform) | `S3_BUCKET_NAME`, `ACCESS_KEY`, `SECRET_KEY` | `S3_ENDPOINT_URL`, `CLOUD_REGION` |
| Container env | `NGC_KEY` | `HF_TOKEN` |
The agent never reads credential values — it only checks presence with `[ -n "$VAR_NAME" ]`.sdk = BrevSDK() # 读取BREV_API_TOKEN(可选——默认使用brev login登录)
两个SDK都会在首次使用时延迟验证凭证,若缺少必填环境变量,会抛出`CredentialError`并给出明确提示。必填环境变量如下:
| 平台 | 必填项 | 可选项 |
|---|---|---|
| Lepton | `LEPTON_WORKSPACE_ID`, `LEPTON_AUTH_TOKEN` | — |
| Brev | —(手动`brev login`即可) | `BREV_API_TOKEN` |
| S3 I/O(任意平台) | `S3_BUCKET_NAME`, `ACCESS_KEY`, `SECRET_KEY` | `S3_ENDPOINT_URL`, `CLOUD_REGION` |
| 容器环境 | `NGC_KEY` | `HF_TOKEN` |
Agent不会读取凭证值——仅通过`[ -n "$VAR_NAME" ]`检查是否存在。Workflow Launch Intake
工作流启动流程
For any TAO workflow or action launch, first confirm the user goal. Then ask
for platform and monitoring preferences before credentials or launch details.
Generate the supported platform choices from the packaged helper, not by
scanning platform docs or folders:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format textAsk:
- Which supported platform should run this workflow?
- Should long-running monitoring stay enabled? Default: enabled. This means
the agent remains attached and posts status until terminal state, including
long queue waits.
PENDING - How many minutes between status updates? Default: 5 minutes.
After the model/action are known, resolve the default container image from the
packaged metadata and ask the user to confirm it or provide
before creating runner files:
image=<override>bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
--model <network_arch> --action <action> --format textFor train-capable model workflows, inspect model-level AutoML metadata before
creating a plain training job:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
--scope automl --format jsonIf the selected model has and a valid train schema,
route training through by default. A workflow should
only bypass AutoML when its run settings include , the user
explicitly asks for a plain run, or the model metadata says AutoML is enabled
but the train schema is not packaged yet.
automl_enabled: trueskills/applications/tao-run-automlautoml_policy: offAfter the platform is selected, get the credential filter:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
--platform <platform> --format textAsk only for credentials returned for the selected platform. For example, SLURM
needs and ; it does not need Lepton credentials.
Kubernetes and local Docker do not need Lepton or SLURM credentials. Ask storage
credentials such as S3 keys only when the selected platform and the data/result
URIs require them.
SLURM_USERSLURM_HOSTNAME对于任何TAO工作流或动作启动,首先确认用户目标。然后在询问凭证或启动细节前,先确认平台和监控偏好。从打包的辅助工具生成支持的平台选项,而非扫描平台文档或文件夹:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text需询问:
- 该工作流应在哪个支持的平台上运行?
- 是否保持长时间监控启用?默认:启用。这意味着Agent会保持连接并发布状态,直到任务进入终端状态,包括长时间的队列等待。
PENDING - 状态更新的时间间隔(分钟)?默认:5分钟。
确定模型/动作后,从打包的元数据中解析默认容器镜像,并在创建运行器文件前,让用户确认该镜像或提供:
image=<override>bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
--model <network_arch> --action <action> --format text对于支持训练的模型工作流,在创建普通训练任务前,先检查模型级别的AutoML元数据:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
--scope automl --format json若所选模型的且拥有有效的训练 schema,默认通过路由训练。仅当运行设置包含、用户明确要求普通运行,或模型元数据显示AutoML已启用但训练schema尚未打包时,工作流才会绕过AutoML。
automl_enabled: trueskills/applications/tao-run-automlautoml_policy: off选择平台后,获取凭证筛选条件:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
--platform <platform> --format text仅询问所选平台所需的凭证。例如,SLURM需要和,不需要Lepton凭证。Kubernetes和本地Docker不需要Lepton或SLURM凭证。仅当所选平台和数据/结果URI需要时,才询问存储凭证(如S3密钥)。
SLURM_USERSLURM_HOSTNAMECore API
核心API
All platform SDKs implement the same core shape:
python
sdk.create_job(image, command, gpu_count=1, env_vars=None, inputs=None, outputs=None, **kwargs) -> Job
sdk.get_job_status(job_id) -> JobStatus
sdk.get_job_logs(job_id, tail=None) -> str
sdk.cancel_job(job_id) -> bool
sdk.get_failure_analysis(job_id) -> dict | None
sdk.get_job_results_dir(job_id) -> str
sdk.check_path(remote_path) -> bool
sdk.list_path(remote_path) -> list[str]Lepton-only:
- — replica-level diagnostics for stuck-pending jobs.
sdk.get_job_replicas(job_id)
Brev-only:
- — clean up an ephemeral instance.
sdk.delete_instance(instance_id) - — list active instances.
sdk.list_instances()
所有平台SDK都实现相同的核心接口:
python
sdk.create_job(image, command, gpu_count=1, env_vars=None, inputs=None, outputs=None, **kwargs) -> Job
sdk.get_job_status(job_id) -> JobStatus
sdk.get_job_logs(job_id, tail=None) -> str
sdk.cancel_job(job_id) -> bool
sdk.get_failure_analysis(job_id) -> dict | None
sdk.get_job_results_dir(job_id) -> str
sdk.check_path(remote_path) -> bool
sdk.list_path(remote_path) -> list[str]Lepton专属接口:
- —— 针对停滞在Pending状态的任务,提供副本级别的诊断信息。
sdk.get_job_replicas(job_id)
Brev专属接口:
- —— 清理临时实例。
sdk.delete_instance(instance_id) - —— 列出活跃实例。
sdk.list_instances()
Submitting a Job
提交任务
The agent always constructs the container command via before calling . The agent reads the action's schema from (, , , , ) and passes those fields as kwargs. bakes the in-container runtime (inlined as a base64 heredoc) and the CLI invocation that, at runtime, downloads declared inputs, writes the spec file at with remote URIs rewritten to local paths, runs the user command, and uploads outputs. The platform SDK's runs the resulting command as-is — no implicit wrapping.
build_entrypointcreate_jobskill_info.yamlcommandconfig_formatinputsoutputsupload_excludesbuild_entrypointscript_runner{config_path}create_jobbuild_entrypointconfigargspassthroughmodereferences/job-construction.mdmoderesolve_container_image()references/outputs.mdreferences/examples.mdbuild_entrypointcreate_jobAgent在调用前,始终通过构建容器命令。Agent从读取动作的schema(、、、、),并将这些字段作为关键字参数传递。会将容器内的运行时(以base64 heredoc形式内联)和CLI调用嵌入其中,在运行时会下载声明的输入、将规范文件写入并将远程URI重写为本地路径、运行用户命令、上传输出。平台SDK的会按原样运行生成的命令——无隐式封装。
create_jobbuild_entrypointskill_info.yamlcommandconfig_formatinputsoutputsupload_excludesbuild_entrypointscript_runner{config_path}create_jobbuild_entrypointconfigargspassthroughmodemoderesolve_container_image()references/job-construction.mdreferences/outputs.mdbuild_entrypointcreate_jobreferences/examples.mdMonitoring
监控
python
status = sdk.get_job_status(job.id)
print(status.status) # Pending, Running, Complete, Error, Canceled
print(status.message) # platform-specific detail
logs = sdk.get_job_logs(job.id, tail=200)
print(logs)For stuck-Pending Lepton jobs, replica diagnostics reveal the cause (image pull, scheduling, mount errors):
python
for r in sdk.get_job_replicas(job.id):
issue = r["status"].get("readiness_issue")
if issue:
print(issue["reason"], issue["message"])
# e.g. "InProgress" / "Pulling image" (normal for big images)
# "Failed" / "ImagePullBackOff" (NGC_KEY problem)
# "ConfigError" / "Mount point not found" (bad node)On failure, classifies the root cause:
get_failure_analysis()python
analysis = sdk.get_failure_analysis(job.id)
if analysis:
print(analysis["err_class"]) # ERR_PROGRAM, ERR_INFRA, etc.
print(analysis["suggestion"]) # human-readable fix
for event in analysis.get("job_failure_by_node_event", []):
print(event["node_event_name"], event["message"]) # OOM, GPU error, etc.python
status = sdk.get_job_status(job.id)
print(status.status) # Pending, Running, Complete, Error, Canceled
print(status.message) # 平台特定详情
logs = sdk.get_job_logs(job.id, tail=200)
print(logs)对于停滞在Pending状态的Lepton任务,副本诊断信息可揭示原因(镜像拉取、调度、挂载错误):
python
for r in sdk.get_job_replicas(job.id):
issue = r["status"].get("readiness_issue")
if issue:
print(issue["reason"], issue["message"])
# 示例:"InProgress" / "Pulling image" (大镜像拉取时正常)
# "Failed" / "ImagePullBackOff"(NGC_KEY问题)
# "ConfigError" / "Mount point not found"(节点异常)任务失败时,会对根本原因进行分类:
get_failure_analysis()python
analysis = sdk.get_failure_analysis(job.id)
if analysis:
print(analysis["err_class"]) # ERR_PROGRAM, ERR_INFRA等
print(analysis["suggestion"]) # 人类可读的修复建议
for event in analysis.get("job_failure_by_node_event", []):
print(event["node_event_name"], event["message"]) # OOM、GPU错误等Polling pattern
轮询模式
For interactive runs where the user wants to watch:
python
import time
status_interval_minutes = status_interval_minutes or 5
while True:
status = sdk.get_job_status(job.id)
if status.status in ("Complete", "Error", "Canceled"):
break
print(f" {status.status}")
time.sleep(status_interval_minutes * 60)
if status.status == "Error":
print(sdk.get_job_logs(job.id, tail=100))
print(sdk.get_failure_analysis(job.id))With long-running monitoring enabled, do not stop after 30 minutes or after a
few unchanged polls. Keep emitting updates every
until the job finishes, fails, is canceled, or the user asks to detach/stop.
If the chat/runtime cannot remain open that long, say so explicitly and provide
the durable workflow/log path for manual status refresh.
status_interval_minutesDo not use a final response for non-terminal monitored jobs. Finalizing the
turn detaches the chat watcher. Keep non-terminal status messages in progress
updates and continue polling; only finalize at terminal state, explicit user
detach/stop, or a real runtime limit that prevents further polling.
For background runs, persist and the path, then re-attach later by constructing the same SDK and calling — job state is read from the on-disk store.
job.idstate_fileget_job_status(job_id)对于用户希望实时查看的交互式运行:
python
import time
status_interval_minutes = status_interval_minutes or 5
while True:
status = sdk.get_job_status(job.id)
if status.status in ("Complete", "Error", "Canceled"):
break
print(f" {status.status}")
time.sleep(status_interval_minutes * 60)
if status.status == "Error":
print(sdk.get_job_logs(job.id, tail=100))
print(sdk.get_failure_analysis(job.id))启用长时间监控后,不要在30分钟后或几次无变化的轮询后停止。每隔发送一次更新,直到任务完成、失败、被取消,或用户要求断开/停止。如果聊天/运行时无法保持长时间开启,请明确告知用户,并提供持久化的工作流/日志路径供手动刷新状态。
status_interval_minutes对于非终端状态的监控任务,不要发送最终响应。结束对话回合会断开聊天监控器。将非终端状态消息作为进度更新发送,并继续轮询;仅在任务进入终端状态、用户明确要求断开/停止,或存在无法继续轮询的实际运行限制时,才结束对话回合。
对于后台运行,持久化和路径,之后通过构建相同的SDK并调用重新连接——任务状态从磁盘存储中读取。
job.idstate_fileget_job_status(job_id)Orchestration patterns
编排模式
Multi-step workflows, parallel sweeps, and run-folder durability via
live in
.
Read it before chaining calls, sweeping a parameter, or
persisting run state across context breaks.
ActionWorkflowreferences/orchestration-patterns.mdcreate_job多步骤工作流、并行扫描和通过实现的运行文件夹持久化相关内容,请参阅。在串联调用、扫描参数或跨上下文中断保留运行状态前,请先阅读该文档。
ActionWorkflowreferences/orchestration-patterns.mdcreate_jobDataset utilities
数据集工具
When the skill's documented filenames don't match the user's layout, list the dataset to confirm:
python
assert sdk.check_path("s3://my-bucket/coco/")
files = sdk.list_path("s3://my-bucket/coco/train/")当技能文档中的文件名与用户的数据集结构不匹配时,列出数据集进行确认:
python
assert sdk.check_path("s3://my-bucket/coco/")
files = sdk.list_path("s3://my-bucket/coco/train/")Use the actual paths to set spec fields.
使用实际路径设置规范字段。
For S3 paths, strip trailing slashes when concatenating to avoid `//`:
```python
base = dataset_uri.rstrip("/")
specs["dataset"]["train_csv"] = f"{base}/train.csv" # nested — see "spec is nested dicts"
对于S3路径,拼接时请去除末尾斜杠以避免出现`//`:
```python
base = dataset_uri.rstrip("/")
specs["dataset"]["train_csv"] = f"{base}/train.csv" # 嵌套结构——请遵循“规范为嵌套字典”规则Platform-specific notes
平台特定说明
Each backend (Lepton, Brev, SLURM, Kubernetes, local Docker) has its own import
path, storage model, distributed-training options, credential scope, and
kwargs. See
for the
per-platform details before generating or launching runner artifacts for a
given backend.
create_jobreferences/platform-notes.md每个后端(Lepton、Brev、SLURM、Kubernetes、本地Docker)都有自己的导入路径、存储模型、分布式训练选项、凭证范围和关键字参数。在为特定后端生成或启动运行器工件前,请参阅了解各平台的详细信息。
create_jobreferences/platform-notes.mdError patterns
错误模式
SDK error → root cause → fix mappings are in
. Read when
you hit a , image-pull failure, stuck-Pending job, or
similar — the entries map exception text to the underlying cause.
references/error-patterns.mdCredentialErrorSDK错误→根本原因→修复方案的映射关系,请参阅。当遇到、镜像拉取失败、停滞在Pending状态的任务或类似问题时,请阅读该文档——其中的条目将异常文本映射到潜在原因。
references/error-patterns.mdCredentialErrorWhat the SDK does NOT do
SDK不支持的功能
Scope guardrails (no skill-reading, no HPO, no spec opinions, no
auto-platform-selection, no workflow orchestration) live in
.
references/scope.md范围限制(不读取技能、不支持HPO、不干预规范、不自动选择平台、不支持工作流编排)请参阅。
references/scope.md