tao-train-single-step

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Normal Train

常规训练

Standard supervised fine-tuning: train a model on a labeled dataset, optionally evaluate, then optionally export. The most common TAO workflow for adapting a pretrained model to a new dataset.
标准监督微调:在标注数据集上训练模型,可选择进行评估,然后可选择导出。这是将预训练模型适配到新数据集时最常用的TAO工作流。

Steps

步骤

  1. train — executed through AutoML when the selected model has
    automl_enabled: true
    and
    automl_policy
    is
    auto
    ; set
    automl_policy=off
    for a plain single training run
  2. eval — executed if
    eval_dataset_uri
    is resolved
  3. export — optional, on user request after training
  1. 训练 — 当所选模型设置
    automl_enabled: true
    automl_policy
    auto
    时,通过AutoML执行;若要进行普通单次训练,设置
    automl_policy=off
  2. 评估 — 若
    eval_dataset_uri
    已解析则执行
  3. 导出 — 可选操作,在训练后根据用户请求执行

Prerequisites

前提条件

Required

必填项

  • model: A compatible TAO model (e.g., clip, nvdinov2, grounding_dino)
  • train_dataset_uri: URI of the training dataset (e.g.,
    s3://bucket/train/
    )
  • platform: Ask from the generated supported-platform list:
    ${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py --format text
  • container image confirmation: resolve the default image from the selected model/action config, show it to the user, and require confirmation or
    image=<override>
    before creating runner files or submitting training.
  • 模型:兼容的TAO模型(例如:clip, nvdinov2, grounding_dino)
  • train_dataset_uri:训练数据集的URI(例如:
    s3://bucket/train/
  • 平台:从生成的支持平台列表中选择,生成命令:
    ${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_platforms.py --format text
  • 容器镜像确认:从所选模型/操作配置中解析默认镜像,展示给用户,在创建运行器文件或提交训练前,需用户确认或指定
    image=<override>

Optional

可选项

  • eval_dataset_uri: Some model skills mark this as required — check the resolved model skill before treating it as optional.
  • base_checkpoint: If not provided, defaults to the NGC pretrained checkpoint listed in the model skill, or trains from scratch if no NGC checkpoint exists.
  • automl_policy:
    auto
    by default; set
    off
    to bypass model-level AutoML for this run while leaving model metadata unchanged.
  • image override: Use
    image=<override>
    to pin a specific TAO toolkit build after reviewing the resolved default.
  • eval_dataset_uri:部分模型技能将其标记为必填项——在将其视为可选项前,请检查已解析的模型技能。
  • base_checkpoint:若未提供,默认使用模型技能中列出的NGC预训练检查点;若不存在NGC检查点,则从头开始训练。
  • automl_policy:默认值为
    auto
    ;设置为
    off
    可在本次运行中绕过模型级AutoML,同时保持模型元数据不变。
  • 镜像覆盖:查看解析后的默认镜像后,可使用
    image=<override>
    来固定特定TAO toolkit版本。

Launch Intake

启动引导

After the user confirms they want this standard train/eval/export workflow, ask which supported platform they intend to run on. Generate the choices with
scripts/list_tao_platforms.py --format text
; do not scan platform docs or folders.
Before creating a plain train runner, inspect the selected model's metadata with
scripts/list_tao_models.py --scope automl --format json
or read
skills/models/<network>/references/skill_info.yaml
. If
automl_enabled
is true and the helper reports a valid train schema for that model, route the train stage through
skills/applications/tao-run-automl
by default. Only stay on the plain train path when
automl_policy=off
, the user explicitly asks for no HPO/AutoML, or AutoML is enabled but not runnable because the model's train schema is not packaged yet.
Also ask whether long-running monitoring should stay enabled and how many minutes between status updates. Defaults: enabled, 5 minutes.
After the model/action are known, run
scripts/resolve_tao_image.py --model <network> --action train --format text
and ask whether to use the resolved image or an
image=<override>
. Do not create the tao-train-single-step runner until the image is confirmed.
After platform selection, run
scripts/list_tao_platforms.py --platform <platform> --format text
and ask only for credentials relevant to that platform, plus any selected-model credentials. Do not ask for unrelated platform credentials.
在用户确认需要此标准训练/评估/导出工作流后,询问其计划运行的支持平台。使用
scripts/list_tao_platforms.py --format text
生成选项列表;请勿扫描平台文档或文件夹。
在创建普通训练运行器之前,使用
scripts/list_tao_models.py --scope automl --format json
或读取
skills/models/<network>/references/skill_info.yaml
来检查所选模型的元数据。如果
automl_enabled
为true且辅助工具报告该模型有有效的训练架构,则默认通过
skills/applications/tao-run-automl
路由训练阶段。仅当
automl_policy=off
、用户明确要求不使用HPO/AutoML,或AutoML已启用但因模型训练架构尚未打包而无法运行时,才保留在普通训练路径。
同时询问是否保持长时间运行监控启用,以及状态更新的间隔分钟数。默认设置:启用,5分钟。
确定模型/操作后,运行
scripts/resolve_tao_image.py --model <network> --action train --format text
,询问用户是使用解析后的镜像还是指定
image=<override>
。在镜像确认前,请勿创建tao-train-single-step运行器。
选择平台后,运行
scripts/list_tao_platforms.py --platform <platform> --format text
,仅询问与该平台相关的凭证,以及所选模型所需的凭证。请勿询问无关平台的凭证。