tao-launch-workflow

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TAO Workflow Launch Intake

TAO工作流启动引导

Use this skill before launching any TAO workflow or model action.

在启动任何TAO工作流或模型操作前使用此技能。

Quick Start

快速开始

Run the platform helper, ask for platform and monitoring preferences, then run the selected platform detail helper before asking for credentials.

运行平台助手，询问平台和监控偏好，然后在请求凭证前运行所选平台的详细助手。

Non-Negotiable Launch Gate

不可协商的启动门槛

Do not create runner scripts, launch scripts, compatibility shims, workspace folders, state files, logs, or dependency-install side effects until the launch preflight passes.

Preflight passes only after all of these are true:

The execution platform is selected from the packaged platform helper.
Platform credentials and required credential groups are satisfied.
Model-specific credentials are satisfied.
The default container image is resolved from packaged model/action metadata, shown to the user, and either confirmed or replaced by an explicit
```
image=<override>
```
.
The platform access check succeeds from the launch host.
Dataset inputs are mapped to concrete spec keys and verified from the selected platform's point of view.
Required compute shape fields from the model/workflow skill are known.

If any item is missing, ask for the missing input and stop before generating artifacts. This applies to AutoML, normal train/eval/infer/export/TRT, and DEFT/application workflows.

在启动预检通过前，禁止创建运行器脚本、启动脚本、兼容性垫片、工作区文件夹、状态文件、日志或依赖安装的副作用。

只有满足以下所有条件，预检才算通过：

从打包的平台助手中选择执行平台。
平台凭证和所需凭证组已满足要求。
模型特定凭证已满足要求。
从打包的模型/操作元数据中解析出默认容器镜像，展示给用户，并得到确认或通过显式
```
image=<override>
```
进行替换。
从启动主机发起的平台访问检查成功。
数据集输入已映射到具体的规格键，并从所选平台的视角进行验证。
模型/工作流技能所需的计算形态字段已明确。

如果有任何一项缺失，询问用户补充缺失的输入，在生成工件前停止操作。这适用于AutoML、常规训练/评估/推理/导出/TRT，以及DEFT/应用工作流。

Initial Questions

初始问题

After the user confirms what they want to do, ask for the execution platform using the packaged helper. Do not scan platform docs, skill folders, or config folders to build the choices.

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text

Then ask:

Which supported platform should run this workflow?
Should I monitor the run in this chat? Monitoring means I keep polling the backend/job logs after launch and report progress until the job finishes, fails, or you ask me to stop, even if the job stays queued for hours or days. If disabled, I launch the job, give you the job id/log path, and stop polling. Default: monitor in chat.
How often should I post status? Default: every 5 minutes. Use 1-2 minutes for smoke tests, 5 minutes for normal training, or 10-15 minutes for long runs.

Use

long_running_enabled=true

and

status_interval_minutes=5

when the user accepts the defaults.

When monitoring is enabled, do not send a final summary just because several polls have elapsed or the job is still

PENDING

. Keep the turn attached and emit status every

status_interval_minutes

until a terminal state or explicit user stop/detach request. If the runtime environment cannot keep the chat turn open, say that clearly and leave a durable watcher/log path; do not imply that chat updates will continue after the turn ends.

Final-answer rule: a

final

response ends chat-side monitoring. While

long_running_enabled=true

and any launched job is non-terminal, status messages must be sent as in-progress updates and the agent must continue polling. Only send a final response when the workflow reaches terminal state, the user explicitly asks to detach/stop monitoring, or the runtime genuinely cannot keep the turn open; in that last case, say it is a runtime limitation and provide the exact durable status command/log path.

在用户确认要执行的操作后，使用打包的助手询问执行平台。不要扫描平台文档、技能文件夹或配置文件夹来构建选项。

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text

然后询问：

应使用哪个支持的平台来运行此工作流？
是否需要在此聊天中监控运行状态？监控意味着启动后我会持续轮询后端/任务日志，报告进度直到任务完成、失败或您要求停止，即使任务排队数小时或数天。如果禁用监控，我将启动任务，提供任务ID/日志路径，然后停止轮询。默认：在聊天中监控。
我应多久发布一次状态？默认：每5分钟。冒烟测试使用1-2分钟，常规训练使用5分钟，长时运行任务使用10-15分钟。

使用

long_running_enabled=true

和

status_interval_minutes=5

当用户接受默认设置时。

当启用监控时，不要仅仅因为多次轮询已过去或任务仍处于

PENDING

状态就发送最终摘要。保持对话连接，每隔

status_interval_minutes

发送一次状态，直到进入终端状态或用户明确要求停止/分离。如果运行时环境无法保持对话连接，请明确说明，并提供持久的监控器/日志路径；不要暗示对话结束后仍会继续更新状态。

最终回复规则：

final

响应将终止聊天端的监控。当

long_running_enabled=true

且任何已启动的任务处于非终端状态时，状态消息必须作为进行中更新发送，代理必须继续轮询。只有当工作流进入终端状态、用户明确要求分离/停止监控，或运行时确实无法保持对话连接时，才发送最终回复；在最后一种情况下，请说明这是运行时限制，并提供确切的持久状态命令/日志路径。

Missing-Input Prompt Shape

缺失输入提示格式

When asking for launch inputs, include concrete examples and both dataset input modes. Do not ask only for "dataset root".

Use this structure and adapt spec keys to the selected model/action:

text

I need these launch inputs before I can create specs or runner files:

1. Execution platform: lepton, brev, slurm, local-docker, or kubernetes.

2. Dataset inputs. You can provide either mode:
   A) Root mode: give train/eval roots and I map required files automatically.
      Example Cosmos-RL:
      train_root=/lustre/fsw/.../cosmos/train
      -> custom.train_dataset.annotation_path=train_root/annotations.json
      -> custom.train_dataset.media_path=train_root
   B) Direct spec mode: give the exact config/spec parameters yourself.
      Example:
      custom.train_dataset.annotation_path=/lustre/fsw/.../train_annotations.json
      custom.train_dataset.media_path=/lustre/fsw/.../videos_train.tar.gz
      custom.val_dataset.annotation_path=/lustre/fsw/.../eval_annotations.json
      custom.val_dataset.media_path=/lustre/fsw/.../eval_videos/

   Platform examples:
   - SLURM/Lustre: /lustre/fsw/.../data/train or lustre:///lustre/fsw/.../data/train
   - Lepton/Brev/Kubernetes: s3://bucket/path/train and s3://bucket/path/eval
   - local-docker: /data/tao/<model>/train or file:///data/tao/<model>/eval

3. Container image. I will resolve the default from packaged model metadata and
   show it before launch, for example:
   default image for <model>/<action>: <resolved container image>
   Use this image, or provide image=<override> to pin a different TAO build.

4. Compute shape required by the model, for example GPUs/nodes.

5. Required credentials from platform/model docs, for example HF_TOKEN for
   gated Hugging Face models.

6. Monitoring preference. By default I monitor in this chat and post progress
   every 5 minutes; choose 1-2 minutes for smoke tests or 10-15 minutes for
   long training.

当请求启动输入时，包含具体示例和两种数据集输入模式。不要只询问“数据集根目录”。

使用以下结构，并根据所选模型/操作调整规格键：

text

我需要以下启动输入才能创建规格或运行器文件：

1. 执行平台：lepton, brev, slurm, local-docker, or kubernetes.

2. 数据集输入。您可以提供以下任一模式：
   A) 根目录模式：提供训练/评估根目录，我会自动映射所需文件。
      Cosmos-RL示例：
      train_root=/lustre/fsw/.../cosmos/train
      -> custom.train_dataset.annotation_path=train_root/annotations.json
      -> custom.train_dataset.media_path=train_root
   B) 直接规格模式：自行提供确切的配置/规格参数。
      示例：
      custom.train_dataset.annotation_path=/lustre/fsw/.../train_annotations.json
      custom.train_dataset.media_path=/lustre/fsw/.../videos_train.tar.gz
      custom.val_dataset.annotation_path=/lustre/fsw/.../eval_annotations.json
      custom.val_dataset.media_path=/lustre/fsw/.../eval_videos/

   平台示例：
   - SLURM/Lustre: /lustre/fsw/.../data/train 或 lustre:///lustre/fsw/.../data/train
   - Lepton/Brev/Kubernetes: s3://bucket/path/train 和 s3://bucket/path/eval
   - local-docker: /data/tao/<model>/train 或 file:///data/tao/<model>/eval

3. 容器镜像。我会从打包的模型元数据中解析默认镜像，并在启动前展示，例如：
   default image for <model>/<action>: <resolved container image>
   使用此镜像，或提供image=<override>来指定不同的TAO构建版本。

4. 模型所需的计算形态，例如GPUs/nodes.

5. 平台/模型文档中要求的凭证，例如用于受限Hugging Face模型的HF_TOKEN。

6. 监控偏好。默认我会在此聊天中监控，每5分钟发布一次进度；冒烟测试可选择1-2分钟，长时训练可选择10-15分钟。

Container Image Confirmation

容器镜像确认

Before creating specs, runner scripts, workspaces, logs, state files, or submitting a job, resolve the image for the selected model/action:

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --model <network> --action <action> --format text

If the helper is unavailable, read

skills/models/<network>/config.json

through

SkillBank().get_model_config(network_arch)

. Resolve image fields in this order:

```
actions.<action>.container_image
```
```
actions.<action>.image
```
top-level
```
container_image
```
top-level
```
image
```

Show the exact image and ask:

text

Container image for <network>/<action>:
default=<resolved image>

Use this image, or provide image=<override>?

If the user accepts, pass the resolved image as the job

image

. If the user overrides, require a non-empty image reference and pass that value instead. Do not silently launch on the default image. This confirmation applies to training, AutoML recommendations, evaluation, inference, export, TensorRT engine generation, and application workflows that submit TAO containers.

在创建规格、运行器脚本、工作区、日志、状态文件或提交任务前，解析所选模型/操作的镜像：

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --model <network> --action <action> --format text

如果助手不可用，通过

SkillBank().get_model_config(network_arch)

读取

skills/models/<network>/config.json

。按以下顺序解析镜像字段：

```
actions.<action>.container_image
```
```
actions.<action>.image
```
顶层
```
container_image
```
顶层
```
image
```

展示确切的镜像并询问：

text

Container image for <network>/<action>:
default=<resolved image>

Use this image, or provide image=<override>?

如果用户接受，将解析后的镜像作为任务

image

传递。如果用户替换，要求提供非空的镜像引用并传递该值。不要静默使用默认镜像启动。此确认适用于训练、AutoML推荐、评估、推理、导出、TensorRT引擎生成，以及提交TAO容器的应用工作流。

Credential Filtering

凭证筛选

After the user chooses a platform, get the credential list for only that platform:

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> --format text

Ask only for credentials returned by that command, plus model-specific credentials from the selected model skill. Do not ask for Lepton credentials on SLURM, Kubernetes, or local Docker. Do not ask for SLURM credentials on Lepton, Brev, Kubernetes, or local Docker. Ask S3 credentials only when the selected platform and the dataset/result URIs require

s3://

access.

For initial launch intake, ask for required credentials and required credential groups only. Treat the helper's optional credentials/settings section as reference material; do not request those values unless their

only_when

condition applies, the selected workflow cannot proceed without them, or the user asks to customize that setting.

When the helper output includes a "Required credential groups" section, satisfy one credential from each group before proceeding. Explain each requested value using the helper's description and "How to get it" text.

For SLURM, user-facing prompts should ask for

SSH_KEY_PATH

first. Mention

SSH_AUTH_SOCK

only if the user says they already use an SSH agent.

用户选择平台后，仅获取该平台的凭证列表：

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> --format text

仅询问该命令返回的凭证，加上所选模型技能的特定凭证。不要在SLURM、Kubernetes或本地Docker上询问Lepton凭证。不要在Lepton、Brev、Kubernetes或本地Docker上询问SLURM凭证。仅当所选平台和数据集/结果URI需要

s3://

访问时，才询问S3凭证。

对于初始启动引导，仅询问必填凭证和必填凭证组。将助手的可选凭证/设置部分视为参考资料；除非其

only_when

条件适用、所选工作流无法继续，或用户要求自定义该设置，否则不要请求这些值。

当助手输出包含“Required credential groups”部分时，在继续之前需满足每个组中的至少一个凭证。使用助手的描述和“How to get it”文本解释每个请求的值。

对于SLURM，面向用户的提示应首先询问

SSH_KEY_PATH

。仅当用户表示已使用SSH代理时，才提及

SSH_AUTH_SOCK

。

Dataset Intake

数据集引导

Accept dataset inputs in either mode:

Dataset root mode: the user gives train/eval/calibration roots, and the model skill maps required files by convention. Example for Cosmos-RL train:
```
custom.train_dataset.annotation_path=<root>/annotations.json
```
and
```
custom.train_dataset.media_path=<root>
```
.
Direct spec mode: the user gives exact spec-key paths when annotations, media archives, videos, or image folders live in different places. Preserve those keys directly, for example
```
custom.train_dataset.annotation_path=/lustre/.../train_annotations.json
```
and
```
custom.train_dataset.media_path=/lustre/.../videos.tar.gz
```
.

Ask for dataset examples that match the selected platform:

SLURM: shared cluster paths such as
```
/lustre/fsw/portfolios/<team>/<your-dir>/data/<model>/train
```
(where
```
<your-dir>
```
is your per-user directory on the cluster), or direct spec paths under
```
/lustre/...
```
.
Lepton, Brev, Kubernetes: usually
```
s3://bucket/path/train
```
and
```
s3://bucket/path/eval
```
unless the platform profile mounts shared storage.
Local Docker: local paths visible to the Docker host, such as
```
/data/tao/<model>/train
```
, or direct spec paths visible inside the planned container mount.

Do not assume "dataset root" is the only acceptable input. When direct spec paths are supplied, validate the exact spec paths rather than appending default filenames.

接受以下任一模式的数据集输入：

Dataset root mode: 用户提供训练/评估/校准根目录，模型技能按约定映射所需文件。Cosmos-RL训练示例：
```
custom.train_dataset.annotation_path=<root>/annotations.json
```
和
```
custom.train_dataset.media_path=<root>
```
。
Direct spec mode: 当标注、媒体归档、视频或图像文件夹位于不同位置时，用户提供确切的规格键路径。直接保留这些键，例如
```
custom.train_dataset.annotation_path=/lustre/.../train_annotations.json
```
和
```
custom.train_dataset.media_path=/lustre/.../videos.tar.gz
```
。

询问与所选平台匹配的数据集示例：

SLURM: 共享集群路径，例如
```
/lustre/fsw/portfolios/<team>/<your-dir>/data/<model>/train
```
（其中
```
<your-dir>
```
是您在集群上的个人目录），或
```
/lustre/...
```
下的直接规格路径。
Lepton, Brev, Kubernetes: 通常为
```
s3://bucket/path/train
```
和
```
s3://bucket/path/eval
```
，除非平台配置文件挂载了共享存储。
Local Docker: Docker主机可见的本地路径，例如
```
/data/tao/<model>/train
```
，或计划挂载到容器内的直接规格路径。

不要假设“dataset root”是唯一可接受的输入。当提供直接规格路径时，验证确切的规格路径，而不是附加默认文件名。

Platform Preflight

平台预检

Run the selected platform's preflight checks before any launch artifact is created.

Prefer the packaged preflight helper when the needed inputs are available:

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/check_tao_launch_preflight.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> \
  --path train_annotation=<path> \
  --path train_media=<path>

Pass exact direct spec paths when the user supplied them. For root-mode inputs, expand model-required files first, then pass those concrete annotation/media paths to the helper.

When a model skill lists annotation-level required fields, pass them with

--json-required-field <path-label>=<field>[,<field>...]

so schema/data content issues fail during preflight rather than inside the first training container. For example, Cosmos-RL train/AutoML requires

--json-required-field train_annotation=video_fps

and

--json-required-field val_annotation=video_fps

Do not use

--skip-platform-access

for a real launch. That flag is only for dry environment checks or for cases where the user has already provided explicit manual proof of platform and storage access. If the helper cannot verify remote API, CLI, cluster, or object-store access, treat preflight as failed and do not generate launch artifacts.

For SLURM:

Require
```
SLURM_USER
```
,
```
SLURM_HOSTNAME
```
,
```
SLURM_PARTITION
```
, and one of
```
SSH_KEY_PATH
```
or
```
SSH_AUTH_SOCK
```
. Use the selected platform helper's
```
Resource defaults
```
for runtime values. For the packaged SLURM defaults, generate launchers with
```
SLURM_TIME_HOURS=4
```
and
```
SLURM_TIMEOUT_HOURS=3.8
```
; never invent a 12-hour default for the 4-hour partition list. Launching the orchestrator with
```
nohup
```
or in the background is allowed for durability, but it does not satisfy chat monitoring by itself. After launch, keep a foreground chat-side polling loop attached until terminal state or explicit detach.
Split comma-separated
```
SLURM_HOSTNAME
```
, resolve hosts where possible, and require passwordless
```
ssh -o BatchMode=yes
```
to at least one host.

If SSH fails, do not offer several equivalent choices. Ask for

SSH_KEY_PATH=/path/to/private_key

and show the passwordless setup steps: create a key if needed with

ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519

; install it with

ssh-copy-id -i ~/.ssh/id_ed25519.pub <SLURM_USER>@<login-host>

; trust the host with

ssh-keyscan -H <login-host> >> ~/.ssh/known_hosts

; set

chmod 600 ~/.ssh/id_ed25519

; verify with

ssh -o BatchMode=yes -i ~/.ssh/id_ed25519 <SLURM_USER>@<login-host> 'hostname'

; then rerun with

SSH_KEY_PATH=~/.ssh/id_ed25519

After SSH passes, validate dataset annotation/media paths on the remote login host with
```
test -e
```
or an equivalent read-only command.
Only then create runner scripts, specs, workspaces, or submit jobs.

For local Docker, validate Docker/GPU access and local dataset paths before writing launch artifacts. For Lepton, Brev, and Kubernetes, validate API or cluster access plus object-storage credentials and

aws s3 ls

readability for

s3://

inputs before writing launch artifacts. For mounted shared-storage or PVC paths on those remote platforms, require manual proof that the path is mounted into the job environment; the helper fails closed rather than accepting unverified remote mount paths.

在创建任何启动工件前，运行所选平台的预检检查。

当所需输入可用时，优先使用打包的预检助手：

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/check_tao_launch_preflight.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> \
  --path train_annotation=<path> \
  --path train_media=<path>

如果用户提供了直接规格路径，传递确切的路径。对于根目录模式输入，先扩展模型所需的文件，然后将这些具体的标注/媒体路径传递给助手。

当模型技能列出标注级必填字段时，使用

--json-required-field <path-label>=<field>[,<field>...]

传递这些字段，以便在预检期间发现模式/数据内容问题，而不是在第一个训练容器内部失败。例如，Cosmos-RL训练/AutoML需要

--json-required-field train_annotation=video_fps

和

--json-required-field val_annotation=video_fps

。

对于实际启动，不要使用

--skip-platform-access

。该标志仅用于模拟环境检查或用户已提供明确手动证明平台和存储访问权限的情况。如果助手无法验证远程API、CLI、集群或对象存储访问，视为预检失败，不要生成启动工件。

—

SLURM相关要求

—

需要
```
SLURM_USER
```
,
```
SLURM_HOSTNAME
```
,
```
SLURM_PARTITION
```
，以及
```
SSH_KEY_PATH
```
或
```
SSH_AUTH_SOCK
```
中的一个。使用所选平台助手的
```
Resource defaults
```
作为运行时值。对于打包的SLURM默认值，生成的启动器使用
```
SLURM_TIME_HOURS=4
```
和
```
SLURM_TIMEOUT_HOURS=3.8
```
；切勿为4小时分区列表设置12小时的默认值。允许使用
```
nohup
```
或在后台启动编排器以保证持久性，但这本身不能满足聊天监控要求。启动后，保持前台聊天端的轮询循环连接，直到进入终端状态或明确分离。
拆分逗号分隔的
```
SLURM_HOSTNAME
```
，尽可能解析主机，要求至少能无密码通过
```
ssh -o BatchMode=yes
```
连接到一个主机。

如果SSH连接失败，不要提供多个等效选项。询问

SSH_KEY_PATH=/path/to/private_key

并展示无密码设置步骤：若需要，使用

ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519

创建密钥；使用

ssh-copy-id -i ~/.ssh/id_ed25519.pub <SLURM_USER>@<login-host>

安装密钥；使用

ssh-keyscan -H <login-host> >> ~/.ssh/known_hosts

信任主机；设置

chmod 600 ~/.ssh/id_ed25519

；使用

ssh -o BatchMode=yes -i ~/.ssh/id_ed25519 <SLURM_USER>@<login-host> 'hostname'

验证；然后使用

SSH_KEY_PATH=~/.ssh/id_ed25519

重新运行。

SSH连接通过后，在远程登录主机上使用
```
test -e
```
或等效的只读命令验证数据集标注/媒体路径。
只有完成上述步骤后，才能创建运行器脚本、规格、工作区或提交任务。

—

本地Docker相关要求

—

在编写启动工件前，验证Docker/GPU访问权限和本地数据集路径。对于Lepton、Brev和Kubernetes，在编写启动工件前，验证API或集群访问权限、对象存储凭证，以及

s3://

输入的

aws s3 ls

可读性。对于这些远程平台上挂载的共享存储或PVC路径，要求用户手动证明该路径已挂载到任务环境中；助手会默认拒绝未验证的远程挂载路径。