tao-run-automl

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TAO AutoML Skill

TAO AutoML技能

Run automated hyperparameter optimization (HPO) for any TAO network. The agent uses

AutoMLRunner

— a single interface that manages the full loop: generate recommendations, launch training jobs, extract metrics, and feed results back to the optimizer.

The runner is platform-agnostic — it takes any object implementing the standard SDK shape (

create_job

get_job_status

get_job_logs

get_failure_analysis

) and calls those methods. Pick whichever SDK matches where you want jobs to run:

SDK	Best for AutoML
`LeptonSDK`	Multi-node sweeps on DGX Cloud; managed scheduling
`BrevSDK`	Cost-tuned sweeps on Brev instances (single-instance per rec, multi-GPU OK). Multi-credential / multi-workspace accounts must pass `cloud_cred_id=` and `workspace_group_id=` to `create_job` — see `skills/platform/tao-run-on-brev/SKILL.md` .
`SlurmSDK`	Large sweeps on shared HPC clusters with queue/quota
`KubernetesSDK`	Sweeps on EKS / GKE / AKS / on-prem clusters with the NVIDIA GPU Operator
`DockerSDK`	Local debugging or single-host sweeps

Multi-node per rec works on Lepton, SLURM, and K8s (each rec is an N-node distributed training job). Brev and local Docker are single-host per rec — multi-GPU within one host still works (

gpu_count > 1

), but one rec can't span multiple hosts.

Workflow: (1) parse user intent + preflight, (2) select algorithm, (3) configure and run, (4) monitor/resume/query status, (5) interpret results. Each step below links the reference holding its full detail. Failure modes:

references/pitfalls.md

. Example exchanges:

references/examples.md

. Setup detail:

references/prerequisites.md

为任意TAO网络运行自动化超参数优化（HPO）。该Agent使用

AutoMLRunner

——一个管理完整流程的统一接口：生成推荐配置、启动训练任务、提取指标，并将结果反馈给优化器。

该Runner是平台无关的——它接受任何实现标准SDK接口（

create_job

、

get_job_status

、

get_job_logs

、

get_failure_analysis

）的对象，并调用这些方法。选择与你任务运行环境匹配的SDK即可：

SDK	AutoML最佳适用场景
`LeptonSDK`	DGX Cloud上的多节点调优；托管式调度
`BrevSDK`	Brev实例上的成本优化调优（每个推荐对应单实例，支持多GPU）。多凭证/多工作区账户必须向 `create_job` 传入 `cloud_cred_id=` 和 `workspace_group_id=` ——详见 `skills/platform/tao-run-on-brev/SKILL.md` 。
`SlurmSDK`	带队列/配额的共享HPC集群上的大规模调优
`KubernetesSDK`	搭载NVIDIA GPU Operator的EKS/GKE/AKS/本地集群上的调优
`DockerSDK`	本地调试或单主机调优

每个推荐的多节点运行支持在Lepton、SLURM和K8s上实现（每个推荐对应一个N节点分布式训练任务）。Brev和本地Docker为每个推荐对应单主机——仍支持单主机内多GPU（

gpu_count > 1

），但单个推荐无法跨多主机。

工作流程：(1) 解析用户意图+预检，(2) 选择算法，(3) 配置并运行，(4) 监控/恢复/查询状态，(5) 解读结果。以下每个步骤都链接到包含详细信息的参考文档。失败模式：

references/pitfalls.md

。示例对话：

references/examples.md

。设置细节：

references/prerequisites.md

。

Preflight

预检

This skill needs

nvidia-tao-automl

(which pulls

nvidia-tao-sdk

transitively). Both are on public PyPI; pinned versions live in

versions.yaml

(

wheels.tao_automl_*

), resolved via

scripts/resolve_versions_key.py

. Pick the platform extra you want:

bash

python -c "import tao_automl" 2>/dev/null || {
  SB="${TAO_SKILL_BANK_PATH:?}"
  echo "MISSING: nvidia-tao-automl not installed. Pick the platform extra you need:"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_lepton)\"      # DGX Cloud / Lepton"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_slurm)\"       # on-prem SLURM cluster"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_kubernetes)\"  # K8s (EKS / GKE / on-prem)"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_docker)\"      # local Docker daemon"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_brev)\"        # Brev GPU instances"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_all)\"         # all 5 platforms"
  echo "  (append ,llm or ,wandb to the extra for agentic-search or experiment-tracking deps)"
  exit 1
}

(Local development against a checkout:

pip install -e '~/tao-run-automl[lepton]'

.) If missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight before continuing.

本技能需要

nvidia-tao-automl

（会间接依赖

nvidia-tao-sdk

）。两者均在公开PyPI上；固定版本存储在

versions.yaml

（

wheels.tao_automl_*

）中，可通过

scripts/resolve_versions_key.py

解析。选择你需要的平台扩展：

bash

python -c "import tao_automl" 2>/dev/null || {
  SB="${TAO_SKILL_BANK_PATH:?}"
  echo "缺失依赖：未安装nvidia-tao-automl。请选择你需要的平台扩展进行安装："
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_lepton)\"      # DGX Cloud / Lepton"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_slurm)\"       # 本地SLURM集群"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_kubernetes)\"  # K8s（EKS / GKE / 本地）"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_docker)\"      # 本地Docker守护进程"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_brev)\"        # Brev GPU实例"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_all)\"         # 全部5个平台"
  echo "  （如需智能搜索或实验跟踪依赖，可在扩展后追加,llm或,wandb）"
  exit 1
}

（针对本地检出代码的开发：

pip install -e '~/tao-run-automl[lepton]'

。）如果缺失依赖，Agent会提示用户通过Bash授权安装，然后在继续前重新运行预检。

Prerequisites

前置条件

Before running AutoML, satisfy all of these — the full detail (per-platform credential filtering, dataset URI formats, the bank-structure tree, and the install commands) is in

references/prerequisites.md

Shared launch preflight — run the
```
tao-launch-workflow
```
intake pattern first. AutoML must not create runner files, workspaces, state files, logs, compatibility shims, or install dependencies until the selected platform's credentials, access check, dataset visibility, model credentials, container image confirmation, and compute shape are satisfied. This prevents wasting the budget on fake recommendation failures caused by SSH, storage, image, or credential setup.
SDK credentials — env vars sourced from
```
~/.config/tao/.env
```
(auto-loaded by the skill bank's SessionStart hook). Filter required vars per platform with
```
scripts/list_tao_platforms.py --platform <platform> --format text
```
and ask only for what it lists (S3 only when URIs use
```
s3://
```
;
```
NGC_KEY
```
for container pulls). The agent never reads values — only checks presence with
```
[ -n "$VAR_NAME" ]
```
. Construct the SDK with no arguments, e.g.
```
LeptonSDK()
```
.
Dataset — accessible from the compute backend; URI format depends on the platform (
```
s3://...
```
for Lepton, an absolute shared path for SLURM,
```
azure://...
```
for Azure, a local path for Docker; never generate
```
aws://...
```
). Accept dataset roots or exact spec-key paths, preserving user-supplied keys such as
```
custom.train_dataset.annotation_path=
```
without forcing files to share a parent directory.
Skill bank available — the runner takes an explicit
```
skill_dir
```
(absolute path to
```
<bank-root>/models/<network>
```
, no env-var fallback). Use the same bank root the agent loaded the workflow from. CRITICAL: AutoML requires a packaged, valid
```
<bank-root>/models/<network>/schemas/train.schema.json
```
— it is the AutoML support gate (defines
```
automl_enabled
```
params, defaults, ranges, options, weights, popular metadata). The runtime must not expect
```
~/tao-core
```
to exist; if the packaged train schema is missing, do not run AutoML for that model.
```
references/spec_template_<action>.yaml
```
is required for non-TAO-Core models (cosmos-rl, clip) and optional for TAO Core / Hydra-based models (DINO, BEVFusion).
nvidia-tao-automl
installed with the platform extra you want (public PyPI; pin in
```
versions.yaml
```
). Use the install commands from the Preflight block above or
```
references/prerequisites.md
```
; append
```
,llm
```
to the extra for agentic algorithms.

Verify setup:

bash

python3 -c "from tao_automl.runner import AutoMLRunner; print('OK')"
python3 -c "from tao_automl.brain.llm_brain import LLMBrain; print('LLM OK')"   # optional, LLM features
python3 -c "import wandb; print('WandB OK')"                                    # optional, WandB

运行AutoML前，请满足以下所有条件——完整细节（各平台凭证过滤、数据集URI格式、技能库结构树、安装命令）见

references/prerequisites.md

：

共享启动预检——先运行
```
tao-launch-workflow
```
导入流程。AutoML不得创建Runner文件、工作区、状态文件、日志、兼容性垫片或安装依赖，直到所选平台的凭证、访问检查、数据集可见性、模型凭证、容器镜像确认和计算规格都满足要求。这可避免因SSH、存储、镜像或凭证设置错误导致的无效推荐失败，浪费预算。
SDK凭证——从
```
~/.config/tao/.env
```
读取环境变量（由技能库的SessionStart钩子自动加载）。使用
```
scripts/list_tao_platforms.py --platform <platform> --format text
```
按平台筛选所需变量，仅询问列表中存在的变量（当URI使用
```
s3://
```
时才需要S3相关变量；拉取容器需要
```
NGC_KEY
```
）。Agent从不读取变量值——仅通过
```
[ -n "$VAR_NAME" ]
```
检查是否存在。构造SDK时无需传入参数，例如
```
LeptonSDK()
```
。
数据集——可从计算后端访问；URI格式取决于平台（Lepton使用
```
s3://...
```
，SLURM使用绝对共享路径，Azure使用
```
azure://...
```
，Docker使用本地路径；请勿生成
```
aws://...
```
格式）。接受数据集根目录或精确的规格键路径，保留用户提供的键，如
```
custom.train_dataset.annotation_path=
```
，无需强制文件共享父目录。
技能库可用——Runner需要显式传入
```
skill_dir
```
（
```
<bank-root>/models/<network>
```
的绝对路径，无环境变量 fallback）。使用Agent加载工作流的同一技能库根目录。**关键：**AutoML需要打包且有效的
```
<bank-root>/models/<network>/schemas/train.schema.json
```
——这是AutoML支持的入口（定义
```
automl_enabled
```
参数、默认值、范围、选项、权重、通用元数据）。运行时不得依赖
```
~/tao-core
```
的存在；如果打包的训练 schema 缺失，则不得为该模型运行AutoML。非TAO-Core模型（cosmos-rl、clip）需要
```
references/spec_template_<action>.yaml
```
，TAO Core / Hydra 基模型（DINO、BEVFusion）则为可选。
已安装带对应平台扩展的
nvidia-tao-automl
（公开PyPI；版本固定在
```
versions.yaml
```
中）。使用上述预检块或
```
references/prerequisites.md
```
中的安装命令；如需智能算法，可在扩展后追加
```
,llm
```
。

验证设置：

bash

python3 -c "from tao_automl.runner import AutoMLRunner; print('OK')"
python3 -c "from tao_automl.brain.llm_brain import LLMBrain; print('LLM OK')"   # 可选，LLM功能
python3 -c "import wandb; print('WandB OK')"                                    # 可选，WandB

Concepts: What is TAO AutoML?

概念：什么是TAO AutoML？

TAO AutoML automates the "try different hyperparameter values → train → compare results → repeat" cycle. You tell it what network (

network_arch

), which hyperparameters to search (from the model skill and schema), what metric to optimize (from the model skill or user request), and how many trials (budget). It then picks hyperparameter values with a search algorithm (Bayesian, Hyperband, LLM, etc.), launches a real training job on whichever backend the SDK targets, reads the result metric from training logs, feeds it back so the algorithm learns what works, repeats until budget is exhausted, and returns the best configuration found.

Each "trial" is called a recommendation (rec). One rec = one full training run with a specific set of hyperparameters.

TAO AutoML自动化了“尝试不同超参数值→训练→比较结果→重复”的循环。你只需告知它目标网络（

network_arch

）、要搜索的超参数（来自模型技能和schema）、要优化的指标（来自模型技能或用户请求），以及试验次数（预算）。然后它会通过搜索算法（贝叶斯、Hyperband、LLM等）选择超参数值，在SDK指向的任意后端启动真实训练任务，从训练日志中读取结果指标，反馈给算法以学习有效配置，重复直到预算耗尽，最后返回找到的最佳配置。

每个“试验”称为一个推荐（recommendation，简称rec）。一个rec对应一组特定超参数的完整训练运行。

Quick Support Queries

快速支持查询

When the user asks what models/networks are supported for AutoML, run the packaged model-list helper in AutoML mode. AutoML enablement is model-level metadata (

skills/models/<network>/references/skill_info.yaml

has

automl_enabled: true

), not workflow-level. The helper reads that metadata, then validates whether the model also has a packaged, parseable train dataclass schema:

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --scope automl --format text

The compatibility wrapper below is also valid and delegates to the same logic:

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_automl_support.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text

Return both sections from that output: runnable AutoML models and AutoML-enabled models still blocked on schema packaging. The support rule: AutoML is enabled at model level; runnable AutoML also requires

skills/models/<network>/schemas/train.schema.json

to be packaged and valid.

当用户询问哪些模型/网络支持AutoML时，运行AutoML模式下的打包模型列表工具。AutoML启用状态是模型级元数据（

skills/models/<network>/references/skill_info.yaml

中包含

automl_enabled: true

），而非工作流级。该工具读取该元数据，然后验证模型是否同时拥有打包且可解析的训练数据类schema：

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --scope automl --format text

以下兼容包装器同样有效，且会委托给相同逻辑：

bash

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_automl_support.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text

返回该输出的两个部分：可运行AutoML的模型，以及已启用AutoML但仍因schema打包问题无法运行的模型。支持规则：AutoML在模型级启用；可运行的AutoML还要求

skills/models/<network>/schemas/train.schema.json

已打包且有效。

Step 1: Parse User Intent

步骤1：解析用户意图

Default to a quick-start run unless the user explicitly asks to customize AutoML or agrees to a customization offer. Do not present algorithm, budget, or search-space choices as required inputs for a normal "run AutoML" request.

Any workflow/application that reaches a train-capable model skill must consult the selected model's

automl_enabled

metadata. If it is

true

, use this AutoML workflow as the default training path unless the run/workflow setting has

automl_policy: off

or the user explicitly asks for a plain single training run. This keeps AutoML enablement scalable across tao-train-single-step, DEFT, and future workflows without duplicating allowlists in each application skill.

Extract the default-run inputs and apply the quick-start defaults. The full required-field table (

network_arch

platform

, dataset URIs / direct spec paths,

image

metric

direction

skill_dir

long_running_enabled

status_interval_minutes

, credentials, compute shape, and the LLM endpoint/model/key trio), the quick-start defaults (

bayesian

recs,

None

hyperparameters/ranges,

-minute monitoring), the friendly launch-intake prompting checklist, the customization-only fields, the quick-start runner shape, and metric-choice best practices all live in

references/intake-and-inputs.md

Key gating policy that always applies:

If any required field is missing, ask the user. Do NOT guess dataset paths, skill bank paths, credentials, or hardware that the model skill marks as required.
```
image
```
: resolve the default, show it to the user, and require confirmation or
```
image=<override>
```
before creating the AutoML runner.
```
direction
```
: only needed when the metric name disagrees with the implicit "contains 'loss' → minimize, else maximize" rule.

llm_endpoint

llm_model

llm_api_key

: MUST prompt for

llm

hybrid

autoresearch

; the code default

https://integrate.api.nvidia.com/v1

returns 404, so always pass

llm_endpoint

explicitly.

Before generating an AutoML script, verify platform access and dataset visibility using the shared launch preflight. For SLURM, that means passwordless SSH to at least one login host and remote

test -e

checks for each required annotation/media path. Verify container image confirmation the same way — the confirmed train image must be passed into

AutoMLRunner.run(..., image=chosen_image, ...)

or the SDK adapter's

create_job(..., image=chosen_image, ...)

; do not rely on an implicit default. Also run any model-specific annotation content checks documented by the model skill. If preflight fails, stop with remediation steps instead of creating a runner that will immediately fail. Missing required annotation fields are a preflight failure, not an AutoML recommendation failure.

Customization gate: After the required quick-start fields are resolved, you may briefly offer customization. If the user declines, proceed with the defaults. If the user chooses customization, present the customization-only fields from

references/intake-and-inputs.md

MANDATORY: Read the generated dataclass schema before configuring AutoML. For the selected model/action, read

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/models/<network>/schemas/train.schema.json

and

.../schemas/manifest.json

. AutoML can run only when

train.schema.json

is packaged and valid. Do not fall back to hand-written notes, old runner scripts, or a local

~/tao-core

checkout. If the schema is missing, stop and report that AutoML is enabled but not runnable until the schema is generated and shipped. Use the schema JSON as the source of truth for

automl_default_parameters

automl_disabled_parameters

, per-parameter defaults, ranges, enums,

option_weights

math_cond

depends_on

parent_param

, and

popular

. When

automl_hyperparameters=None

, the runner discovers all params marked

automl_enabled=True

in the schema; each network has its own set, so never hardcode them here.

The following MANDATORY rules gate every run — full text, code patterns, and rationale in

references/mandatory-rules.md

MANDATORY prompting for LLM-based algorithms (
```
llm
```
,
```
hybrid
```
,
```
autoresearch
```
) — resolve
```
llm_endpoint
```
,
```
llm_model
```
, and
```
llm_api_key
```
before generating the script (precedence chains in the reference). Without valid LLM settings the brain silently falls back to random sampling and wastes GPU budget.
MANDATORY: Read the model skill before generating the script — read
```
<bank-root>/models/<network>/SKILL.md
```
and apply its Training Requirements, Per-Action Dataset Requirements, Typical Spec Overrides, AutoML / HPO Notes, and Error Patterns. Do not hardcode model-specific knowledge.
MANDATORY: No model-specific constants in this AutoML skill — hyperparameter names, ranges, defaults, metric names, dataset layouts, spec override keys, images, and metric regexes belong in the schema and model skill, not here.
MANDATORY: Timestamped workspace folders — always suffix
```
workspace_path
```
with
```
datetime.now().strftime("%Y%m%d_%H%M%S")
```
; never use a flat path.
MANDATORY: Fresh runner per new AutoML request, after preflight passes — every new request creates a new runner script, log, PID file, SDK
```
state_file
```
, and
```
workspace_path
```
with a unique timestamp; only resume when the user explicitly asks to resume/continue/recover/inspect.

Best-practice on metric choice: prefer the model skill's recommended validation or task metric over cheap training loss (which overfits on small fine-tuning sets); when using a validation proxy, also apply the model skill's required validation-related

spec_overrides

so the metric is emitted; a real task metric via

eval_fn

is most honest but adds per-rec cost. Details in

references/intake-and-inputs.md

除非用户明确要求自定义AutoML或同意自定义提议，否则默认使用快速启动运行。对于常规的“运行AutoML”请求，无需将算法、预算或搜索空间选项作为必填输入。

任何可调用训练能力模型技能的工作流/应用都必须查询所选模型的

automl_enabled

元数据。如果值为

true

，则默认使用本AutoML工作流作为训练路径，除非运行/工作流设置中包含

automl_policy: off

或用户明确要求普通单步训练。这可确保AutoML启用状态在tao-train-single-step、DEFT及未来工作流中可扩展，无需在每个应用技能中重复维护允许列表。

提取默认运行输入并应用快速启动默认值。完整必填字段表（

network_arch

、

platform

、数据集URI/直接规格路径、

image

、

metric

、

direction

、

skill_dir

、

long_running_enabled

、

status_interval_minutes

、凭证、计算规格，以及LLM端点/模型/密钥三元组）、快速启动默认值（

bayesian

算法、10个rec、无自定义超参数/范围、5分钟监控间隔）、友好的启动导入提示清单、仅自定义字段、快速启动Runner结构，以及指标选择最佳实践均在

references/intake-and-inputs.md

中。

始终适用的关键管控策略：

如果任何必填字段缺失，请询问用户。请勿猜测模型技能标记为必填的数据集路径、技能库路径、凭证或硬件信息。
```
image
```
：解析默认值，展示给用户，在创建AutoML Runner前需要确认或传入
```
image=<override>
```
进行覆盖。
```
direction
```
：仅当指标名称与隐含规则（包含'loss'则最小化，否则最大化）不符时才需要指定。
```
llm_endpoint
```
、
```
llm_model
```
、
```
llm_api_key
```
：必须提示用户获取这些参数（针对
```
llm
```
/
```
hybrid
```
/
```
autoresearch
```
算法）；代码默认值
```
https://integrate.api.nvidia.com/v1
```
会返回404，因此必须显式传入
```
llm_endpoint
```
。

生成AutoML脚本前，请使用共享启动预检验证平台访问权限和数据集可见性。对于SLURM，这意味着至少可无密码SSH到一个登录节点，并对每个所需的标注/媒体路径执行远程

test -e

检查。以相同方式验证容器镜像确认——已确认的训练镜像必须传入

AutoMLRunner.run(..., image=chosen_image, ...)

或SDK适配器的

create_job(..., image=chosen_image, ...)

；请勿依赖隐式默认值。同时运行模型技能文档中记录的任何模型特定标注内容检查。如果预检失败，请停止操作并提供修复步骤，而非创建会立即失败的Runner。缺失必填标注字段属于预检失败，而非AutoML推荐失败。

**自定义入口：**解决必填的快速启动字段后，可短暂提供自定义选项。如果用户拒绝，则使用默认值继续。如果用户选择自定义，则展示

references/intake-and-inputs.md

中的仅自定义字段。

**强制要求：配置AutoML前读取生成的数据类schema。**对于所选模型/操作，读取

${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/models/<network>/schemas/train.schema.json

和

.../schemas/manifest.json

。仅当

train.schema.json

已打包且有效时，AutoML才可运行。请勿依赖手写笔记、旧Runner脚本或本地

~/tao-core

检出代码。如果schema缺失，请停止操作并报告：AutoML已启用但无法运行，直到schema生成并发布。将schema JSON作为

automl_default_parameters

、

automl_disabled_parameters

、每个参数的默认值、范围、枚举、

option_weights

、

math_cond

、

depends_on

、

parent_param

和

popular

的唯一来源。当

automl_hyperparameters=None

时，Runner会发现schema中所有标记为

automl_enabled=True

的参数；每个网络都有自己的参数集，因此请勿在此处硬编码。

每次运行都必须遵守以下强制规则——完整文本、代码模式和原理见

references/mandatory-rules.md

：

强制提示LLM基算法（
```
llm
```
、
```
hybrid
```
、
```
autoresearch
```
）——生成脚本前解析
```
llm_endpoint
```
、
```
llm_model
```
和
```
llm_api_key
```
（参考文档中有优先级链）。如果LLM设置无效，算法会静默回退到随机采样，浪费GPU预算。
强制要求：生成脚本前读取模型技能文档——读取
```
<bank-root>/models/<network>/SKILL.md
```
并应用其训练要求、按操作划分的数据集要求、典型规格覆盖、AutoML/HPO说明和错误模式。请勿硬编码模型特定知识。
强制要求：本AutoML技能中不得包含模型特定常量——超参数名称、范围、默认值、指标名称、数据集布局、规格覆盖键、镜像和指标正则表达式应属于schema和模型技能，而非本技能。
强制要求：带时间戳的工作区文件夹——始终为
```
workspace_path
```
添加后缀
```
datetime.now().strftime("%Y%m%d_%H%M%S")
```
；请勿使用平级路径。
强制要求：预检通过后，每个新AutoML请求使用全新Runner——每个新请求都会创建新的Runner脚本、日志、PID文件、SDK
```
state_file
```
和带唯一时间戳的
```
workspace_path
```
；仅当用户明确要求恢复/继续/恢复/检查时才进行恢复。

**指标选择最佳实践：**优先选择模型技能推荐的验证或任务指标，而非廉价的训练损失（在小微调集上容易过拟合）；当使用验证代理指标时，还需应用模型技能要求的验证相关

spec_overrides

以确保指标被输出；通过

eval_fn

获取真实任务指标最准确，但会增加每个rec的成本。详情见

references/intake-and-inputs.md

。

Step 2: Select Algorithm

步骤2：选择算法

Default to

bayesian

. The full classical and LLM/agentic algorithm tables (use-when, typical budget, how it works), the default/caveat rules, and the decision tree are in

references/algorithms.md

. Present the algorithm guide only in customization mode or when the user names one.

默认选择

bayesian

算法。完整的经典算法和LLM/智能算法表（适用场景、典型预算、工作原理）、默认/注意事项规则，以及决策树见

references/algorithms.md

。仅在自定义模式或用户指定算法时展示算法指南。

Step 3: Configure and Run

步骤3：配置并运行

Build the runner from the generic shapes in

references/automl-settings.md

— minimal example, full all-options example, LLM-powered example, the programmatic

AutoML

API, the complete

automl_settings

key table,

kpi

metric resolution, the LLM analyzer environment toggles, and

spec_overrides

rules.

Constrain the search space with
```
custom_param_ranges
```
:
```
references/custom-param-ranges.md
```
(format table, examples, model-specific search-space rules).

Opt-in

metric_extractor

eval_fn

hooks and WandB tracking:

references/hooks-and-wandb.md

LLM/agentic deep dive —
```
NLConfigGenerator
```
, the standalone
```
LLMAnalyzer
```
, the five autoresearch agent components, and multi-phase research programs:
```
references/nl-config-and-research.md
```
.

All model-specific hyperparameters, metric extractors, and

spec_overrides

come from the model skill.

根据

references/automl-settings.md

中的通用结构构建Runner——最小示例、全选项示例、LLM驱动示例、程序化

AutoML

API、完整

automl_settings

键表、

kpi

指标解析、LLM分析器环境切换，以及

spec_overrides

规则。

使用
```
custom_param_ranges
```
约束搜索空间：
```
references/custom-param-ranges.md
```
（格式表、示例、模型特定搜索空间规则）。

选择加入

metric_extractor

eval_fn

钩子和WandB跟踪：

references/hooks-and-wandb.md

。

LLM/智能深度解析——
```
NLConfigGenerator
```
、独立
```
LLMAnalyzer
```
、五个autoresearch Agent组件，以及多阶段研究程序：
```
references/nl-config-and-research.md
```
。

所有模型特定超参数、指标提取器和

spec_overrides

均来自模型技能。

Step 4: Monitor Progress

步骤4：监控进度

runner.run()

blocks until all recommendations complete; use

on_recommendation

on_result

callbacks to report progress. Each rec takes 10–90 minutes — don't assume failure during long uploads. If the orchestrator dies mid-run, relaunch with the full suffixed

workspace_path

and

resume=True

. Check progress from a separate process with

query_status()

. Callbacks, resume behaviour, and full

query_status()

get_status()

usage:

references/monitoring-and-resume.md

runner.run()

会阻塞直到所有推荐完成；使用

on_recommendation

on_result

回调报告进度。每个rec耗时10–90分钟——长时间上传期间请勿假设任务失败。如果编排器在运行中途崩溃，使用完整的带后缀

workspace_path

并设置

resume=True

重新启动。通过独立进程调用

query_status()

检查进度。回调、恢复行为，以及

query_status()

get_status()

的完整用法见

references/monitoring-and-resume.md

。

Step 5: Interpret Results

步骤5：解读结果

runner.run()

returns a plain dict with

best

progress

, and

history

keys; metric values are always in the user's original scale. Report the best config, a ranked comparison table, insights, the WandB link if enabled, and next steps. Full result-dict shape, reporting checklist, and all-recs-failed triage:

references/results.md

runner.run()

返回一个包含

best

、

progress

和

history

键的普通字典；指标值始终保持用户原始尺度。报告最佳配置、排名对比表、洞察信息、启用WandB时的链接，以及后续步骤。完整结果字典结构、报告清单，以及所有rec失败的分类排查见

references/results.md

。

Model-Specific Notes

模型特定说明

Model-specific notes do not belong here. For every requested

network_arch

, read

<bank-root>/models/<network>/SKILL.md

and use its Training Requirements, Per-Action Dataset Requirements, Typical Spec Overrides, AutoML / HPO Notes, and Error Patterns sections as the source of truth.

模型特定说明不属于本技能范畴。对于每个请求的

network_arch

，请读取

<bank-root>/models/<network>/SKILL.md

并以其训练要求、按操作划分的数据集要求、典型规格覆盖、AutoML/HPO说明和错误模式部分为唯一来源。

Common Pitfalls

常见陷阱

The 15 recurring failure modes — including wrong/missing

skill_dir

, wrong LLM endpoint (404), model-specific training failures, workspace collisions, weak proxy metrics, the implicit-direction trap, spec-override typos, mid-sweep orchestrator death, silent random LLM configs, missing

openai

, WandB not logging, and

conda run

buffering — are documented with fixes in

references/pitfalls.md

. Review them before and during any run.

15种常见失败模式——包括错误/缺失

skill_dir

、错误LLM端点（404）、模型特定训练失败、工作区冲突、弱代理指标、隐含方向陷阱、规格覆盖拼写错误、调优中途编排器崩溃、静默随机LLM配置、缺失

openai

、WandB未记录日志，以及

conda run

缓冲——均在

references/pitfalls.md

中记录并提供修复方案。请在每次运行前后查阅。

Example Conversations

示例对话

Representative agent/user exchanges for optimizing a network, requesting a real task metric, LLM-guided search, fully-autonomous autoresearch, resuming, switching to ASHA with WandB, and generating a config from a goal description: see

references/examples.md

优化网络、请求真实任务指标、LLM引导搜索、完全自主autoresearch、恢复任务、切换到带WandB的ASHA算法，以及根据目标描述生成配置等代表性Agent/用户对话见

references/examples.md

。