tao-launch-workflow

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TAO Workflow Launch Intake

TAO工作流启动引导

Use this skill before launching any TAO workflow or model action.
在启动任何TAO工作流或模型操作前使用此技能。

Quick Start

快速开始

Run the platform helper, ask for platform and monitoring preferences, then run the selected platform detail helper before asking for credentials.
运行平台助手,询问平台和监控偏好,然后在请求凭证前运行所选平台的详细助手。

Non-Negotiable Launch Gate

不可协商的启动门槛

Do not create runner scripts, launch scripts, compatibility shims, workspace folders, state files, logs, or dependency-install side effects until the launch preflight passes.
Preflight passes only after all of these are true:
  1. The execution platform is selected from the packaged platform helper.
  2. Platform credentials and required credential groups are satisfied.
  3. Model-specific credentials are satisfied.
  4. The default container image is resolved from packaged model/action metadata, shown to the user, and either confirmed or replaced by an explicit
    image=<override>
    .
  5. The platform access check succeeds from the launch host.
  6. Dataset inputs are mapped to concrete spec keys and verified from the selected platform's point of view.
  7. Required compute shape fields from the model/workflow skill are known.
If any item is missing, ask for the missing input and stop before generating artifacts. This applies to AutoML, normal train/eval/infer/export/TRT, and DEFT/application workflows.
在启动预检通过前,禁止创建运行器脚本、启动脚本、兼容性垫片、工作区文件夹、状态文件、日志或依赖安装的副作用。
只有满足以下所有条件,预检才算通过:
  1. 从打包的平台助手中选择执行平台。
  2. 平台凭证和所需凭证组已满足要求。
  3. 模型特定凭证已满足要求。
  4. 从打包的模型/操作元数据中解析出默认容器镜像,展示给用户,并得到确认或通过显式
    image=<override>
    进行替换。
  5. 从启动主机发起的平台访问检查成功。
  6. 数据集输入已映射到具体的规格键,并从所选平台的视角进行验证。
  7. 模型/工作流技能所需的计算形态字段已明确。
如果有任何一项缺失,询问用户补充缺失的输入,在生成工件前停止操作。这适用于AutoML、常规训练/评估/推理/导出/TRT,以及DEFT/应用工作流。

Initial Questions

初始问题

After the user confirms what they want to do, ask for the execution platform using the packaged helper. Do not scan platform docs, skill folders, or config folders to build the choices.
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text
Then ask:
  • Which supported platform should run this workflow?
  • Should I monitor the run in this chat? Monitoring means I keep polling the backend/job logs after launch and report progress until the job finishes, fails, or you ask me to stop, even if the job stays queued for hours or days. If disabled, I launch the job, give you the job id/log path, and stop polling. Default: monitor in chat.
  • How often should I post status? Default: every 5 minutes. Use 1-2 minutes for smoke tests, 5 minutes for normal training, or 10-15 minutes for long runs.
Use
long_running_enabled=true
and
status_interval_minutes=5
when the user accepts the defaults.
When monitoring is enabled, do not send a final summary just because several polls have elapsed or the job is still
PENDING
. Keep the turn attached and emit status every
status_interval_minutes
until a terminal state or explicit user stop/detach request. If the runtime environment cannot keep the chat turn open, say that clearly and leave a durable watcher/log path; do not imply that chat updates will continue after the turn ends.
Final-answer rule: a
final
response ends chat-side monitoring. While
long_running_enabled=true
and any launched job is non-terminal, status messages must be sent as in-progress updates and the agent must continue polling. Only send a final response when the workflow reaches terminal state, the user explicitly asks to detach/stop monitoring, or the runtime genuinely cannot keep the turn open; in that last case, say it is a runtime limitation and provide the exact durable status command/log path.
在用户确认要执行的操作后,使用打包的助手询问执行平台。不要扫描平台文档、技能文件夹或配置文件夹来构建选项。
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text
然后询问:
  • 应使用哪个支持的平台来运行此工作流?
  • 是否需要在此聊天中监控运行状态?监控意味着启动后我会持续轮询后端/任务日志,报告进度直到任务完成、失败或您要求停止,即使任务排队数小时或数天。如果禁用监控,我将启动任务,提供任务ID/日志路径,然后停止轮询。默认:在聊天中监控。
  • 我应多久发布一次状态?默认:每5分钟。冒烟测试使用1-2分钟,常规训练使用5分钟,长时运行任务使用10-15分钟。
使用
long_running_enabled=true
status_interval_minutes=5
当用户接受默认设置时。
当启用监控时,不要仅仅因为多次轮询已过去或任务仍处于
PENDING
状态就发送最终摘要。保持对话连接,每隔
status_interval_minutes
发送一次状态,直到进入终端状态或用户明确要求停止/分离。如果运行时环境无法保持对话连接,请明确说明,并提供持久的监控器/日志路径;不要暗示对话结束后仍会继续更新状态。
最终回复规则:
final
响应将终止聊天端的监控。当
long_running_enabled=true
且任何已启动的任务处于非终端状态时,状态消息必须作为进行中更新发送,代理必须继续轮询。只有当工作流进入终端状态、用户明确要求分离/停止监控,或运行时确实无法保持对话连接时,才发送最终回复;在最后一种情况下,请说明这是运行时限制,并提供确切的持久状态命令/日志路径。

Missing-Input Prompt Shape

缺失输入提示格式

When asking for launch inputs, include concrete examples and both dataset input modes. Do not ask only for "dataset root".
Use this structure and adapt spec keys to the selected model/action:
text
I need these launch inputs before I can create specs or runner files:

1. Execution platform: lepton, brev, slurm, local-docker, or kubernetes.

2. Dataset inputs. You can provide either mode:
   A) Root mode: give train/eval roots and I map required files automatically.
      Example Cosmos-RL:
      train_root=/lustre/fsw/.../cosmos/train
      -> custom.train_dataset.annotation_path=train_root/annotations.json
      -> custom.train_dataset.media_path=train_root
   B) Direct spec mode: give the exact config/spec parameters yourself.
      Example:
      custom.train_dataset.annotation_path=/lustre/fsw/.../train_annotations.json
      custom.train_dataset.media_path=/lustre/fsw/.../videos_train.tar.gz
      custom.val_dataset.annotation_path=/lustre/fsw/.../eval_annotations.json
      custom.val_dataset.media_path=/lustre/fsw/.../eval_videos/

   Platform examples:
   - SLURM/Lustre: /lustre/fsw/.../data/train or lustre:///lustre/fsw/.../data/train
   - Lepton/Brev/Kubernetes: s3://bucket/path/train and s3://bucket/path/eval
   - local-docker: /data/tao/<model>/train or file:///data/tao/<model>/eval

3. Container image. I will resolve the default from packaged model metadata and
   show it before launch, for example:
   default image for <model>/<action>: <resolved container image>
   Use this image, or provide image=<override> to pin a different TAO build.

4. Compute shape required by the model, for example GPUs/nodes.

5. Required credentials from platform/model docs, for example HF_TOKEN for
   gated Hugging Face models.

6. Monitoring preference. By default I monitor in this chat and post progress
   every 5 minutes; choose 1-2 minutes for smoke tests or 10-15 minutes for
   long training.
当请求启动输入时,包含具体示例和两种数据集输入模式。不要只询问“数据集根目录”。
使用以下结构,并根据所选模型/操作调整规格键:
text
我需要以下启动输入才能创建规格或运行器文件:

1. 执行平台:lepton, brev, slurm, local-docker, or kubernetes.

2. 数据集输入。您可以提供以下任一模式:
   A) 根目录模式:提供训练/评估根目录,我会自动映射所需文件。
      Cosmos-RL示例:
      train_root=/lustre/fsw/.../cosmos/train
      -> custom.train_dataset.annotation_path=train_root/annotations.json
      -> custom.train_dataset.media_path=train_root
   B) 直接规格模式:自行提供确切的配置/规格参数。
      示例:
      custom.train_dataset.annotation_path=/lustre/fsw/.../train_annotations.json
      custom.train_dataset.media_path=/lustre/fsw/.../videos_train.tar.gz
      custom.val_dataset.annotation_path=/lustre/fsw/.../eval_annotations.json
      custom.val_dataset.media_path=/lustre/fsw/.../eval_videos/

   平台示例:
   - SLURM/Lustre: /lustre/fsw/.../data/train 或 lustre:///lustre/fsw/.../data/train
   - Lepton/Brev/Kubernetes: s3://bucket/path/train 和 s3://bucket/path/eval
   - local-docker: /data/tao/<model>/train 或 file:///data/tao/<model>/eval

3. 容器镜像。我会从打包的模型元数据中解析默认镜像,并在启动前展示,例如:
   default image for <model>/<action>: <resolved container image>
   使用此镜像,或提供image=<override>来指定不同的TAO构建版本。

4. 模型所需的计算形态,例如GPUs/nodes.

5. 平台/模型文档中要求的凭证,例如用于受限Hugging Face模型的HF_TOKEN。

6. 监控偏好。默认我会在此聊天中监控,每5分钟发布一次进度;冒烟测试可选择1-2分钟,长时训练可选择10-15分钟。

Container Image Confirmation

容器镜像确认

Before creating specs, runner scripts, workspaces, logs, state files, or submitting a job, resolve the image for the selected model/action:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --model <network> --action <action> --format text
If the helper is unavailable, read
skills/models/<network>/config.json
through
SkillBank().get_model_config(network_arch)
. Resolve image fields in this order:
  1. actions.<action>.container_image
  2. actions.<action>.image
  3. top-level
    container_image
  4. top-level
    image
Show the exact image and ask:
text
Container image for <network>/<action>:
default=<resolved image>

Use this image, or provide image=<override>?
If the user accepts, pass the resolved image as the job
image
. If the user overrides, require a non-empty image reference and pass that value instead. Do not silently launch on the default image. This confirmation applies to training, AutoML recommendations, evaluation, inference, export, TensorRT engine generation, and application workflows that submit TAO containers.
在创建规格、运行器脚本、工作区、日志、状态文件或提交任务前,解析所选模型/操作的镜像:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/resolve_tao_image.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --model <network> --action <action> --format text
如果助手不可用,通过
SkillBank().get_model_config(network_arch)
读取
skills/models/<network>/config.json
。按以下顺序解析镜像字段:
  1. actions.<action>.container_image
  2. actions.<action>.image
  3. 顶层
    container_image
  4. 顶层
    image
展示确切的镜像并询问:
text
Container image for <network>/<action>:
default=<resolved image>

Use this image, or provide image=<override>?
如果用户接受,将解析后的镜像作为任务
image
传递。如果用户替换,要求提供非空的镜像引用并传递该值。不要静默使用默认镜像启动。此确认适用于训练、AutoML推荐、评估、推理、导出、TensorRT引擎生成,以及提交TAO容器的应用工作流。

Credential Filtering

凭证筛选

After the user chooses a platform, get the credential list for only that platform:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> --format text
Ask only for credentials returned by that command, plus model-specific credentials from the selected model skill. Do not ask for Lepton credentials on SLURM, Kubernetes, or local Docker. Do not ask for SLURM credentials on Lepton, Brev, Kubernetes, or local Docker. Ask S3 credentials only when the selected platform and the dataset/result URIs require
s3://
access.
For initial launch intake, ask for required credentials and required credential groups only. Treat the helper's optional credentials/settings section as reference material; do not request those values unless their
only_when
condition applies, the selected workflow cannot proceed without them, or the user asks to customize that setting.
When the helper output includes a "Required credential groups" section, satisfy one credential from each group before proceeding. Explain each requested value using the helper's description and "How to get it" text.
For SLURM, user-facing prompts should ask for
SSH_KEY_PATH
first. Mention
SSH_AUTH_SOCK
only if the user says they already use an SSH agent.
用户选择平台后,仅获取该平台的凭证列表:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> --format text
仅询问该命令返回的凭证,加上所选模型技能的特定凭证。不要在SLURM、Kubernetes或本地Docker上询问Lepton凭证。不要在Lepton、Brev、Kubernetes或本地Docker上询问SLURM凭证。仅当所选平台和数据集/结果URI需要
s3://
访问时,才询问S3凭证。
对于初始启动引导,仅询问必填凭证和必填凭证组。将助手的可选凭证/设置部分视为参考资料;除非其
only_when
条件适用、所选工作流无法继续,或用户要求自定义该设置,否则不要请求这些值。
当助手输出包含“Required credential groups”部分时,在继续之前需满足每个组中的至少一个凭证。使用助手的描述和“How to get it”文本解释每个请求的值。
对于SLURM,面向用户的提示应首先询问
SSH_KEY_PATH
。仅当用户表示已使用SSH代理时,才提及
SSH_AUTH_SOCK

Dataset Intake

数据集引导

Accept dataset inputs in either mode:
  • Dataset root mode: the user gives train/eval/calibration roots, and the model skill maps required files by convention. Example for Cosmos-RL train:
    custom.train_dataset.annotation_path=<root>/annotations.json
    and
    custom.train_dataset.media_path=<root>
    .
  • Direct spec mode: the user gives exact spec-key paths when annotations, media archives, videos, or image folders live in different places. Preserve those keys directly, for example
    custom.train_dataset.annotation_path=/lustre/.../train_annotations.json
    and
    custom.train_dataset.media_path=/lustre/.../videos.tar.gz
    .
Ask for dataset examples that match the selected platform:
  • SLURM: shared cluster paths such as
    /lustre/fsw/portfolios/<team>/<your-dir>/data/<model>/train
    (where
    <your-dir>
    is your per-user directory on the cluster), or direct spec paths under
    /lustre/...
    .
  • Lepton, Brev, Kubernetes: usually
    s3://bucket/path/train
    and
    s3://bucket/path/eval
    unless the platform profile mounts shared storage.
  • Local Docker: local paths visible to the Docker host, such as
    /data/tao/<model>/train
    , or direct spec paths visible inside the planned container mount.
Do not assume "dataset root" is the only acceptable input. When direct spec paths are supplied, validate the exact spec paths rather than appending default filenames.
接受以下任一模式的数据集输入:
  • Dataset root mode: 用户提供训练/评估/校准根目录,模型技能按约定映射所需文件。Cosmos-RL训练示例:
    custom.train_dataset.annotation_path=<root>/annotations.json
    custom.train_dataset.media_path=<root>
  • Direct spec mode: 当标注、媒体归档、视频或图像文件夹位于不同位置时,用户提供确切的规格键路径。直接保留这些键,例如
    custom.train_dataset.annotation_path=/lustre/.../train_annotations.json
    custom.train_dataset.media_path=/lustre/.../videos.tar.gz
询问与所选平台匹配的数据集示例:
  • SLURM: 共享集群路径,例如
    /lustre/fsw/portfolios/<team>/<your-dir>/data/<model>/train
    (其中
    <your-dir>
    是您在集群上的个人目录),或
    /lustre/...
    下的直接规格路径。
  • Lepton, Brev, Kubernetes: 通常为
    s3://bucket/path/train
    s3://bucket/path/eval
    ,除非平台配置文件挂载了共享存储。
  • Local Docker: Docker主机可见的本地路径,例如
    /data/tao/<model>/train
    ,或计划挂载到容器内的直接规格路径。
不要假设“dataset root”是唯一可接受的输入。当提供直接规格路径时,验证确切的规格路径,而不是附加默认文件名。

Platform Preflight

平台预检

Run the selected platform's preflight checks before any launch artifact is created.
Prefer the packaged preflight helper when the needed inputs are available:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/check_tao_launch_preflight.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> \
  --path train_annotation=<path> \
  --path train_media=<path>
Pass exact direct spec paths when the user supplied them. For root-mode inputs, expand model-required files first, then pass those concrete annotation/media paths to the helper.
When a model skill lists annotation-level required fields, pass them with
--json-required-field <path-label>=<field>[,<field>...]
so schema/data content issues fail during preflight rather than inside the first training container. For example, Cosmos-RL train/AutoML requires
--json-required-field train_annotation=video_fps
and
--json-required-field val_annotation=video_fps
.
Do not use
--skip-platform-access
for a real launch. That flag is only for dry environment checks or for cases where the user has already provided explicit manual proof of platform and storage access. If the helper cannot verify remote API, CLI, cluster, or object-store access, treat preflight as failed and do not generate launch artifacts.
For SLURM:
  1. Require
    SLURM_USER
    ,
    SLURM_HOSTNAME
    ,
    SLURM_PARTITION
    , and one of
    SSH_KEY_PATH
    or
    SSH_AUTH_SOCK
    . Use the selected platform helper's
    Resource defaults
    for runtime values. For the packaged SLURM defaults, generate launchers with
    SLURM_TIME_HOURS=4
    and
    SLURM_TIMEOUT_HOURS=3.8
    ; never invent a 12-hour default for the 4-hour partition list. Launching the orchestrator with
    nohup
    or in the background is allowed for durability, but it does not satisfy chat monitoring by itself. After launch, keep a foreground chat-side polling loop attached until terminal state or explicit detach.
  2. Split comma-separated
    SLURM_HOSTNAME
    , resolve hosts where possible, and require passwordless
    ssh -o BatchMode=yes
    to at least one host.
  3. If SSH fails, do not offer several equivalent choices. Ask for
    SSH_KEY_PATH=/path/to/private_key
    and show the passwordless setup steps: create a key if needed with
    ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519
    ; install it with
    ssh-copy-id -i ~/.ssh/id_ed25519.pub <SLURM_USER>@<login-host>
    ; trust the host with
    ssh-keyscan -H <login-host> >> ~/.ssh/known_hosts
    ; set
    chmod 600 ~/.ssh/id_ed25519
    ; verify with
    ssh -o BatchMode=yes -i ~/.ssh/id_ed25519 <SLURM_USER>@<login-host> 'hostname'
    ; then rerun with
    SSH_KEY_PATH=~/.ssh/id_ed25519
    .
  4. After SSH passes, validate dataset annotation/media paths on the remote login host with
    test -e
    or an equivalent read-only command.
  5. Only then create runner scripts, specs, workspaces, or submit jobs.
For local Docker, validate Docker/GPU access and local dataset paths before writing launch artifacts. For Lepton, Brev, and Kubernetes, validate API or cluster access plus object-storage credentials and
aws s3 ls
readability for
s3://
inputs before writing launch artifacts. For mounted shared-storage or PVC paths on those remote platforms, require manual proof that the path is mounted into the job environment; the helper fails closed rather than accepting unverified remote mount paths.
在创建任何启动工件前,运行所选平台的预检检查。
当所需输入可用时,优先使用打包的预检助手:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/check_tao_launch_preflight.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} \
  --platform <platform> \
  --path train_annotation=<path> \
  --path train_media=<path>
如果用户提供了直接规格路径,传递确切的路径。对于根目录模式输入,先扩展模型所需的文件,然后将这些具体的标注/媒体路径传递给助手。
当模型技能列出标注级必填字段时,使用
--json-required-field <path-label>=<field>[,<field>...]
传递这些字段,以便在预检期间发现模式/数据内容问题,而不是在第一个训练容器内部失败。例如,Cosmos-RL训练/AutoML需要
--json-required-field train_annotation=video_fps
--json-required-field val_annotation=video_fps
对于实际启动,不要使用
--skip-platform-access
。该标志仅用于模拟环境检查或用户已提供明确手动证明平台和存储访问权限的情况。如果助手无法验证远程API、CLI、集群或对象存储访问,视为预检失败,不要生成启动工件。

SLURM相关要求

  1. 需要
    SLURM_USER
    ,
    SLURM_HOSTNAME
    ,
    SLURM_PARTITION
    ,以及
    SSH_KEY_PATH
    SSH_AUTH_SOCK
    中的一个。 使用所选平台助手的
    Resource defaults
    作为运行时值。对于打包的SLURM默认值,生成的启动器使用
    SLURM_TIME_HOURS=4
    SLURM_TIMEOUT_HOURS=3.8
    ;切勿为4小时分区列表设置12小时的默认值。 允许使用
    nohup
    或在后台启动编排器以保证持久性,但这本身不能满足聊天监控要求。启动后,保持前台聊天端的轮询循环连接,直到进入终端状态或明确分离。
  2. 拆分逗号分隔的
    SLURM_HOSTNAME
    ,尽可能解析主机,要求至少能无密码通过
    ssh -o BatchMode=yes
    连接到一个主机。
  3. 如果SSH连接失败,不要提供多个等效选项。询问
    SSH_KEY_PATH=/path/to/private_key
    并展示无密码设置步骤:若需要,使用
    ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519
    创建密钥;使用
    ssh-copy-id -i ~/.ssh/id_ed25519.pub <SLURM_USER>@<login-host>
    安装密钥;使用
    ssh-keyscan -H <login-host> >> ~/.ssh/known_hosts
    信任主机;设置
    chmod 600 ~/.ssh/id_ed25519
    ;使用
    ssh -o BatchMode=yes -i ~/.ssh/id_ed25519 <SLURM_USER>@<login-host> 'hostname'
    验证;然后使用
    SSH_KEY_PATH=~/.ssh/id_ed25519
    重新运行。
  4. SSH连接通过后,在远程登录主机上使用
    test -e
    或等效的只读命令验证数据集标注/媒体路径。
  5. 只有完成上述步骤后,才能创建运行器脚本、规格、工作区或提交任务。

本地Docker相关要求

在编写启动工件前,验证Docker/GPU访问权限和本地数据集路径。对于Lepton、Brev和Kubernetes,在编写启动工件前,验证API或集群访问权限、对象存储凭证,以及
s3://
输入的
aws s3 ls
可读性。对于这些远程平台上挂载的共享存储或PVC路径,要求用户手动证明该路径已挂载到任务环境中;助手会默认拒绝未验证的远程挂载路径。