tao-run-automl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TAO AutoML Skill

TAO AutoML技能

Run automated hyperparameter optimization (HPO) for any TAO network. The agent uses
AutoMLRunner
— a single interface that manages the full loop: generate recommendations, launch training jobs, extract metrics, and feed results back to the optimizer.
The runner is platform-agnostic — it takes any object implementing the standard SDK shape (
create_job
,
get_job_status
,
get_job_logs
,
get_failure_analysis
) and calls those methods. Pick whichever SDK matches where you want jobs to run:
SDKBest for AutoML
LeptonSDK
Multi-node sweeps on DGX Cloud; managed scheduling
BrevSDK
Cost-tuned sweeps on Brev instances (single-instance per rec, multi-GPU OK). Multi-credential / multi-workspace accounts must pass
cloud_cred_id=
and
workspace_group_id=
to
create_job
— see
skills/platform/tao-run-on-brev/SKILL.md
.
SlurmSDK
Large sweeps on shared HPC clusters with queue/quota
KubernetesSDK
Sweeps on EKS / GKE / AKS / on-prem clusters with the NVIDIA GPU Operator
DockerSDK
Local debugging or single-host sweeps
Multi-node per rec works on Lepton, SLURM, and K8s (each rec is an N-node distributed training job). Brev and local Docker are single-host per rec — multi-GPU within one host still works (
gpu_count > 1
), but one rec can't span multiple hosts.
Workflow: (1) parse user intent + preflight, (2) select algorithm, (3) configure and run, (4) monitor/resume/query status, (5) interpret results. Each step below links the reference holding its full detail. Failure modes:
references/pitfalls.md
. Example exchanges:
references/examples.md
. Setup detail:
references/prerequisites.md
.
为任意TAO网络运行自动化超参数优化(HPO)。该Agent使用
AutoMLRunner
——一个管理完整流程的统一接口:生成推荐配置、启动训练任务、提取指标,并将结果反馈给优化器。
该Runner是平台无关的——它接受任何实现标准SDK接口(
create_job
get_job_status
get_job_logs
get_failure_analysis
)的对象,并调用这些方法。选择与你任务运行环境匹配的SDK即可:
SDKAutoML最佳适用场景
LeptonSDK
DGX Cloud上的多节点调优;托管式调度
BrevSDK
Brev实例上的成本优化调优(每个推荐对应单实例,支持多GPU)。多凭证/多工作区账户必须向
create_job
传入
cloud_cred_id=
workspace_group_id=
——详见
skills/platform/tao-run-on-brev/SKILL.md
SlurmSDK
带队列/配额的共享HPC集群上的大规模调优
KubernetesSDK
搭载NVIDIA GPU Operator的EKS/GKE/AKS/本地集群上的调优
DockerSDK
本地调试或单主机调优
每个推荐的多节点运行支持在Lepton、SLURM和K8s上实现(每个推荐对应一个N节点分布式训练任务)。Brev和本地Docker为每个推荐对应单主机——仍支持单主机内多GPU(
gpu_count > 1
),但单个推荐无法跨多主机。
工作流程:(1) 解析用户意图+预检,(2) 选择算法,(3) 配置并运行,(4) 监控/恢复/查询状态,(5) 解读结果。以下每个步骤都链接到包含详细信息的参考文档。失败模式:
references/pitfalls.md
。示例对话:
references/examples.md
。设置细节:
references/prerequisites.md

Preflight

预检

This skill needs
nvidia-tao-automl
(which pulls
nvidia-tao-sdk
transitively). Both are on public PyPI; pinned versions live in
versions.yaml
(
wheels.tao_automl_*
), resolved via
scripts/resolve_versions_key.py
. Pick the platform extra you want:
bash
python -c "import tao_automl" 2>/dev/null || {
  SB="${TAO_SKILL_BANK_PATH:?}"
  echo "MISSING: nvidia-tao-automl not installed. Pick the platform extra you need:"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_lepton)\"      # DGX Cloud / Lepton"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_slurm)\"       # on-prem SLURM cluster"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_kubernetes)\"  # K8s (EKS / GKE / on-prem)"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_docker)\"      # local Docker daemon"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_brev)\"        # Brev GPU instances"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_all)\"         # all 5 platforms"
  echo "  (append ,llm or ,wandb to the extra for agentic-search or experiment-tracking deps)"
  exit 1
}
(Local development against a checkout:
pip install -e '~/tao-run-automl[lepton]'
.) If missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight before continuing.
本技能需要
nvidia-tao-automl
(会间接依赖
nvidia-tao-sdk
)。两者均在公开PyPI上;固定版本存储在
versions.yaml
wheels.tao_automl_*
)中,可通过
scripts/resolve_versions_key.py
解析。选择你需要的平台扩展:
bash
python -c "import tao_automl" 2>/dev/null || {
  SB="${TAO_SKILL_BANK_PATH:?}"
  echo "缺失依赖:未安装nvidia-tao-automl。请选择你需要的平台扩展进行安装:"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_lepton)\"      # DGX Cloud / Lepton"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_slurm)\"       # 本地SLURM集群"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_kubernetes)\"  # K8s(EKS / GKE / 本地)"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_docker)\"      # 本地Docker守护进程"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_brev)\"        # Brev GPU实例"
  echo "  pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_all)\"         # 全部5个平台"
  echo "  (如需智能搜索或实验跟踪依赖,可在扩展后追加,llm或,wandb)"
  exit 1
}
(针对本地检出代码的开发:
pip install -e '~/tao-run-automl[lepton]'
。)如果缺失依赖,Agent会提示用户通过Bash授权安装,然后在继续前重新运行预检。

Prerequisites

前置条件

Before running AutoML, satisfy all of these — the full detail (per-platform credential filtering, dataset URI formats, the bank-structure tree, and the install commands) is in
references/prerequisites.md
:
  1. Shared launch preflight — run the
    tao-launch-workflow
    intake pattern first. AutoML must not create runner files, workspaces, state files, logs, compatibility shims, or install dependencies until the selected platform's credentials, access check, dataset visibility, model credentials, container image confirmation, and compute shape are satisfied. This prevents wasting the budget on fake recommendation failures caused by SSH, storage, image, or credential setup.
  2. SDK credentials — env vars sourced from
    ~/.config/tao/.env
    (auto-loaded by the skill bank's SessionStart hook). Filter required vars per platform with
    scripts/list_tao_platforms.py --platform <platform> --format text
    and ask only for what it lists (S3 only when URIs use
    s3://
    ;
    NGC_KEY
    for container pulls). The agent never reads values — only checks presence with
    [ -n "$VAR_NAME" ]
    . Construct the SDK with no arguments, e.g.
    LeptonSDK()
    .
  3. Dataset — accessible from the compute backend; URI format depends on the platform (
    s3://...
    for Lepton, an absolute shared path for SLURM,
    azure://...
    for Azure, a local path for Docker; never generate
    aws://...
    ). Accept dataset roots or exact spec-key paths, preserving user-supplied keys such as
    custom.train_dataset.annotation_path=
    without forcing files to share a parent directory.
  4. Skill bank available — the runner takes an explicit
    skill_dir
    (absolute path to
    <bank-root>/models/<network>
    , no env-var fallback). Use the same bank root the agent loaded the workflow from. CRITICAL: AutoML requires a packaged, valid
    <bank-root>/models/<network>/schemas/train.schema.json
    — it is the AutoML support gate (defines
    automl_enabled
    params, defaults, ranges, options, weights, popular metadata). The runtime must not expect
    ~/tao-core
    to exist; if the packaged train schema is missing, do not run AutoML for that model.
    references/spec_template_<action>.yaml
    is required for non-TAO-Core models (cosmos-rl, clip) and optional for TAO Core / Hydra-based models (DINO, BEVFusion).
  5. nvidia-tao-automl
    installed
    with the platform extra you want (public PyPI; pin in
    versions.yaml
    ). Use the install commands from the Preflight block above or
    references/prerequisites.md
    ; append
    ,llm
    to the extra for agentic algorithms.
Verify setup:
bash
python3 -c "from tao_automl.runner import AutoMLRunner; print('OK')"
python3 -c "from tao_automl.brain.llm_brain import LLMBrain; print('LLM OK')"   # optional, LLM features
python3 -c "import wandb; print('WandB OK')"                                    # optional, WandB

运行AutoML前,请满足以下所有条件——完整细节(各平台凭证过滤、数据集URI格式、技能库结构树、安装命令)见
references/prerequisites.md
  1. 共享启动预检——先运行
    tao-launch-workflow
    导入流程。AutoML不得创建Runner文件、工作区、状态文件、日志、兼容性垫片或安装依赖,直到所选平台的凭证、访问检查、数据集可见性、模型凭证、容器镜像确认和计算规格都满足要求。这可避免因SSH、存储、镜像或凭证设置错误导致的无效推荐失败,浪费预算。
  2. SDK凭证——从
    ~/.config/tao/.env
    读取环境变量(由技能库的SessionStart钩子自动加载)。使用
    scripts/list_tao_platforms.py --platform <platform> --format text
    按平台筛选所需变量,仅询问列表中存在的变量(当URI使用
    s3://
    时才需要S3相关变量;拉取容器需要
    NGC_KEY
    )。Agent从不读取变量值——仅通过
    [ -n "$VAR_NAME" ]
    检查是否存在。构造SDK时无需传入参数,例如
    LeptonSDK()
  3. 数据集——可从计算后端访问;URI格式取决于平台(Lepton使用
    s3://...
    ,SLURM使用绝对共享路径,Azure使用
    azure://...
    ,Docker使用本地路径;请勿生成
    aws://...
    格式)。接受数据集根目录或精确的规格键路径,保留用户提供的键,如
    custom.train_dataset.annotation_path=
    ,无需强制文件共享父目录。
  4. 技能库可用——Runner需要显式传入
    skill_dir
    <bank-root>/models/<network>
    的绝对路径,无环境变量 fallback)。使用Agent加载工作流的同一技能库根目录。**关键:**AutoML需要打包且有效的
    <bank-root>/models/<network>/schemas/train.schema.json
    ——这是AutoML支持的入口(定义
    automl_enabled
    参数、默认值、范围、选项、权重、通用元数据)。运行时不得依赖
    ~/tao-core
    的存在;如果打包的训练 schema 缺失,则不得为该模型运行AutoML。非TAO-Core模型(cosmos-rl、clip)需要
    references/spec_template_<action>.yaml
    ,TAO Core / Hydra 基模型(DINO、BEVFusion)则为可选。
  5. 已安装带对应平台扩展的
    nvidia-tao-automl
    (公开PyPI;版本固定在
    versions.yaml
    中)。使用上述预检块或
    references/prerequisites.md
    中的安装命令;如需智能算法,可在扩展后追加
    ,llm
验证设置:
bash
python3 -c "from tao_automl.runner import AutoMLRunner; print('OK')"
python3 -c "from tao_automl.brain.llm_brain import LLMBrain; print('LLM OK')"   # 可选,LLM功能
python3 -c "import wandb; print('WandB OK')"                                    # 可选,WandB

Concepts: What is TAO AutoML?

概念:什么是TAO AutoML?

TAO AutoML automates the "try different hyperparameter values → train → compare results → repeat" cycle. You tell it what network (
network_arch
), which hyperparameters to search (from the model skill and schema), what metric to optimize (from the model skill or user request), and how many trials (budget). It then picks hyperparameter values with a search algorithm (Bayesian, Hyperband, LLM, etc.), launches a real training job on whichever backend the SDK targets, reads the result metric from training logs, feeds it back so the algorithm learns what works, repeats until budget is exhausted, and returns the best configuration found.
Each "trial" is called a recommendation (rec). One rec = one full training run with a specific set of hyperparameters.

TAO AutoML自动化了“尝试不同超参数值→训练→比较结果→重复”的循环。你只需告知它目标网络
network_arch
)、要搜索的超参数(来自模型技能和schema)、要优化的指标(来自模型技能或用户请求),以及试验次数(预算)。然后它会通过搜索算法(贝叶斯、Hyperband、LLM等)选择超参数值,在SDK指向的任意后端启动真实训练任务,从训练日志中读取结果指标,反馈给算法以学习有效配置,重复直到预算耗尽,最后返回找到的最佳配置。
每个“试验”称为一个推荐(recommendation,简称rec)。一个rec对应一组特定超参数的完整训练运行。

Quick Support Queries

快速支持查询

When the user asks what models/networks are supported for AutoML, run the packaged model-list helper in AutoML mode. AutoML enablement is model-level metadata (
skills/models/<network>/references/skill_info.yaml
has
automl_enabled: true
), not workflow-level. The helper reads that metadata, then validates whether the model also has a packaged, parseable train dataclass schema:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --scope automl --format text
The compatibility wrapper below is also valid and delegates to the same logic:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_automl_support.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text
Return both sections from that output: runnable AutoML models and AutoML-enabled models still blocked on schema packaging. The support rule: AutoML is enabled at model level; runnable AutoML also requires
skills/models/<network>/schemas/train.schema.json
to be packaged and valid.

当用户询问哪些模型/网络支持AutoML时,运行AutoML模式下的打包模型列表工具。AutoML启用状态是模型级元数据(
skills/models/<network>/references/skill_info.yaml
中包含
automl_enabled: true
),而非工作流级。该工具读取该元数据,然后验证模型是否同时拥有打包且可解析的训练数据类schema:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --scope automl --format text
以下兼容包装器同样有效,且会委托给相同逻辑:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_automl_support.py \
  --skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text
返回该输出的两个部分:可运行AutoML的模型,以及已启用AutoML但仍因schema打包问题无法运行的模型。支持规则:AutoML在模型级启用;可运行的AutoML还要求
skills/models/<network>/schemas/train.schema.json
已打包且有效。

Step 1: Parse User Intent

步骤1:解析用户意图

Default to a quick-start run unless the user explicitly asks to customize AutoML or agrees to a customization offer. Do not present algorithm, budget, or search-space choices as required inputs for a normal "run AutoML" request.
Any workflow/application that reaches a train-capable model skill must consult the selected model's
automl_enabled
metadata. If it is
true
, use this AutoML workflow as the default training path unless the run/workflow setting has
automl_policy: off
or the user explicitly asks for a plain single training run. This keeps AutoML enablement scalable across tao-train-single-step, DEFT, and future workflows without duplicating allowlists in each application skill.
Extract the default-run inputs and apply the quick-start defaults. The full required-field table (
network_arch
,
platform
, dataset URIs / direct spec paths,
image
,
metric
,
direction
,
skill_dir
,
long_running_enabled
,
status_interval_minutes
, credentials, compute shape, and the LLM endpoint/model/key trio), the quick-start defaults (
bayesian
,
10
recs,
None
hyperparameters/ranges,
5
-minute monitoring), the friendly launch-intake prompting checklist, the customization-only fields, the quick-start runner shape, and metric-choice best practices all live in
references/intake-and-inputs.md
.
Key gating policy that always applies:
  • If any required field is missing, ask the user. Do NOT guess dataset paths, skill bank paths, credentials, or hardware that the model skill marks as required.
  • image
    : resolve the default, show it to the user, and require confirmation or
    image=<override>
    before creating the AutoML runner.
  • direction
    : only needed when the metric name disagrees with the implicit "contains 'loss' → minimize, else maximize" rule.
  • llm_endpoint
    ,
    llm_model
    ,
    llm_api_key
    : MUST prompt for
    llm
    /
    hybrid
    /
    autoresearch
    ; the code default
    https://integrate.api.nvidia.com/v1
    returns 404, so always pass
    llm_endpoint
    explicitly.
Before generating an AutoML script, verify platform access and dataset visibility using the shared launch preflight. For SLURM, that means passwordless SSH to at least one login host and remote
test -e
checks for each required annotation/media path. Verify container image confirmation the same way — the confirmed train image must be passed into
AutoMLRunner.run(..., image=chosen_image, ...)
or the SDK adapter's
create_job(..., image=chosen_image, ...)
; do not rely on an implicit default. Also run any model-specific annotation content checks documented by the model skill. If preflight fails, stop with remediation steps instead of creating a runner that will immediately fail. Missing required annotation fields are a preflight failure, not an AutoML recommendation failure.
Customization gate: After the required quick-start fields are resolved, you may briefly offer customization. If the user declines, proceed with the defaults. If the user chooses customization, present the customization-only fields from
references/intake-and-inputs.md
.
MANDATORY: Read the generated dataclass schema before configuring AutoML. For the selected model/action, read
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/models/<network>/schemas/train.schema.json
and
.../schemas/manifest.json
. AutoML can run only when
train.schema.json
is packaged and valid. Do not fall back to hand-written notes, old runner scripts, or a local
~/tao-core
checkout. If the schema is missing, stop and report that AutoML is enabled but not runnable until the schema is generated and shipped. Use the schema JSON as the source of truth for
automl_default_parameters
,
automl_disabled_parameters
, per-parameter defaults, ranges, enums,
option_weights
,
math_cond
,
depends_on
,
parent_param
, and
popular
. When
automl_hyperparameters=None
, the runner discovers all params marked
automl_enabled=True
in the schema; each network has its own set, so never hardcode them here.
The following MANDATORY rules gate every run — full text, code patterns, and rationale in
references/mandatory-rules.md
:
  • MANDATORY prompting for LLM-based algorithms (
    llm
    ,
    hybrid
    ,
    autoresearch
    ) — resolve
    llm_endpoint
    ,
    llm_model
    , and
    llm_api_key
    before generating the script (precedence chains in the reference). Without valid LLM settings the brain silently falls back to random sampling and wastes GPU budget.
  • MANDATORY: Read the model skill before generating the script — read
    <bank-root>/models/<network>/SKILL.md
    and apply its Training Requirements, Per-Action Dataset Requirements, Typical Spec Overrides, AutoML / HPO Notes, and Error Patterns. Do not hardcode model-specific knowledge.
  • MANDATORY: No model-specific constants in this AutoML skill — hyperparameter names, ranges, defaults, metric names, dataset layouts, spec override keys, images, and metric regexes belong in the schema and model skill, not here.
  • MANDATORY: Timestamped workspace folders — always suffix
    workspace_path
    with
    datetime.now().strftime("%Y%m%d_%H%M%S")
    ; never use a flat path.
  • MANDATORY: Fresh runner per new AutoML request, after preflight passes — every new request creates a new runner script, log, PID file, SDK
    state_file
    , and
    workspace_path
    with a unique timestamp; only resume when the user explicitly asks to resume/continue/recover/inspect.
Best-practice on metric choice: prefer the model skill's recommended validation or task metric over cheap training loss (which overfits on small fine-tuning sets); when using a validation proxy, also apply the model skill's required validation-related
spec_overrides
so the metric is emitted; a real task metric via
eval_fn
is most honest but adds per-rec cost. Details in
references/intake-and-inputs.md
.

除非用户明确要求自定义AutoML或同意自定义提议,否则默认使用快速启动运行。对于常规的“运行AutoML”请求,无需将算法、预算或搜索空间选项作为必填输入。
任何可调用训练能力模型技能的工作流/应用都必须查询所选模型的
automl_enabled
元数据。如果值为
true
,则默认使用本AutoML工作流作为训练路径,除非运行/工作流设置中包含
automl_policy: off
或用户明确要求普通单步训练。这可确保AutoML启用状态在tao-train-single-step、DEFT及未来工作流中可扩展,无需在每个应用技能中重复维护允许列表。
提取默认运行输入并应用快速启动默认值。完整必填字段表(
network_arch
platform
、数据集URI/直接规格路径、
image
metric
direction
skill_dir
long_running_enabled
status_interval_minutes
、凭证、计算规格,以及LLM端点/模型/密钥三元组)、快速启动默认值(
bayesian
算法、10个rec、无自定义超参数/范围、5分钟监控间隔)、友好的启动导入提示清单、仅自定义字段、快速启动Runner结构,以及指标选择最佳实践均在
references/intake-and-inputs.md
中。
始终适用的关键管控策略:
  • 如果任何必填字段缺失,请询问用户。请勿猜测模型技能标记为必填的数据集路径、技能库路径、凭证或硬件信息。
  • image
    :解析默认值,展示给用户,在创建AutoML Runner前需要确认或传入
    image=<override>
    进行覆盖。
  • direction
    :仅当指标名称与隐含规则(包含'loss'则最小化,否则最大化)不符时才需要指定。
  • llm_endpoint
    llm_model
    llm_api_key
    必须提示用户获取这些参数(针对
    llm
    /
    hybrid
    /
    autoresearch
    算法);代码默认值
    https://integrate.api.nvidia.com/v1
    会返回404,因此必须显式传入
    llm_endpoint
生成AutoML脚本前,请使用共享启动预检验证平台访问权限和数据集可见性。对于SLURM,这意味着至少可无密码SSH到一个登录节点,并对每个所需的标注/媒体路径执行远程
test -e
检查。以相同方式验证容器镜像确认——已确认的训练镜像必须传入
AutoMLRunner.run(..., image=chosen_image, ...)
或SDK适配器的
create_job(..., image=chosen_image, ...)
;请勿依赖隐式默认值。同时运行模型技能文档中记录的任何模型特定标注内容检查。如果预检失败,请停止操作并提供修复步骤,而非创建会立即失败的Runner。缺失必填标注字段属于预检失败,而非AutoML推荐失败。
**自定义入口:**解决必填的快速启动字段后,可短暂提供自定义选项。如果用户拒绝,则使用默认值继续。如果用户选择自定义,则展示
references/intake-and-inputs.md
中的仅自定义字段。
**强制要求:配置AutoML前读取生成的数据类schema。**对于所选模型/操作,读取
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/models/<network>/schemas/train.schema.json
.../schemas/manifest.json
。仅当
train.schema.json
已打包且有效时,AutoML才可运行。请勿依赖手写笔记、旧Runner脚本或本地
~/tao-core
检出代码。如果schema缺失,请停止操作并报告:AutoML已启用但无法运行,直到schema生成并发布。将schema JSON作为
automl_default_parameters
automl_disabled_parameters
、每个参数的默认值、范围、枚举、
option_weights
math_cond
depends_on
parent_param
popular
的唯一来源。当
automl_hyperparameters=None
时,Runner会发现schema中所有标记为
automl_enabled=True
的参数;每个网络都有自己的参数集,因此请勿在此处硬编码。
每次运行都必须遵守以下强制规则——完整文本、代码模式和原理见
references/mandatory-rules.md
  • 强制提示LLM基算法
    llm
    hybrid
    autoresearch
    )——生成脚本前解析
    llm_endpoint
    llm_model
    llm_api_key
    (参考文档中有优先级链)。如果LLM设置无效,算法会静默回退到随机采样,浪费GPU预算。
  • 强制要求:生成脚本前读取模型技能文档——读取
    <bank-root>/models/<network>/SKILL.md
    并应用其训练要求按操作划分的数据集要求典型规格覆盖AutoML/HPO说明错误模式。请勿硬编码模型特定知识。
  • 强制要求:本AutoML技能中不得包含模型特定常量——超参数名称、范围、默认值、指标名称、数据集布局、规格覆盖键、镜像和指标正则表达式应属于schema和模型技能,而非本技能。
  • 强制要求:带时间戳的工作区文件夹——始终为
    workspace_path
    添加后缀
    datetime.now().strftime("%Y%m%d_%H%M%S")
    ;请勿使用平级路径。
  • 强制要求:预检通过后,每个新AutoML请求使用全新Runner——每个新请求都会创建新的Runner脚本、日志、PID文件、SDK
    state_file
    和带唯一时间戳的
    workspace_path
    ;仅当用户明确要求恢复/继续/恢复/检查时才进行恢复。
**指标选择最佳实践:**优先选择模型技能推荐的验证或任务指标,而非廉价的训练损失(在小微调集上容易过拟合);当使用验证代理指标时,还需应用模型技能要求的验证相关
spec_overrides
以确保指标被输出;通过
eval_fn
获取真实任务指标最准确,但会增加每个rec的成本。详情见
references/intake-and-inputs.md

Step 2: Select Algorithm

步骤2:选择算法

Default to
bayesian
. The full classical and LLM/agentic algorithm tables (use-when, typical budget, how it works), the default/caveat rules, and the decision tree are in
references/algorithms.md
. Present the algorithm guide only in customization mode or when the user names one.

默认选择
bayesian
算法。完整的经典算法和LLM/智能算法表(适用场景、典型预算、工作原理)、默认/注意事项规则,以及决策树见
references/algorithms.md
。仅在自定义模式或用户指定算法时展示算法指南。

Step 3: Configure and Run

步骤3:配置并运行

Build the runner from the generic shapes in
references/automl-settings.md
— minimal example, full all-options example, LLM-powered example, the programmatic
AutoML
API, the complete
automl_settings
key table,
kpi
metric resolution, the LLM analyzer environment toggles, and
spec_overrides
rules.
  • Constrain the search space with
    custom_param_ranges
    :
    references/custom-param-ranges.md
    (format table, examples, model-specific search-space rules).
  • Opt-in
    metric_extractor
    /
    eval_fn
    hooks and WandB tracking:
    references/hooks-and-wandb.md
    .
  • LLM/agentic deep dive —
    NLConfigGenerator
    , the standalone
    LLMAnalyzer
    , the five autoresearch agent components, and multi-phase research programs:
    references/nl-config-and-research.md
    .
All model-specific hyperparameters, metric extractors, and
spec_overrides
come from the model skill.

根据
references/automl-settings.md
中的通用结构构建Runner——最小示例、全选项示例、LLM驱动示例、程序化
AutoML
API、完整
automl_settings
键表、
kpi
指标解析、LLM分析器环境切换,以及
spec_overrides
规则。
  • 使用
    custom_param_ranges
    约束搜索空间:
    references/custom-param-ranges.md
    (格式表、示例、模型特定搜索空间规则)。
  • 选择加入
    metric_extractor
    /
    eval_fn
    钩子和WandB跟踪:
    references/hooks-and-wandb.md
  • LLM/智能深度解析——
    NLConfigGenerator
    、独立
    LLMAnalyzer
    、五个autoresearch Agent组件,以及多阶段研究程序:
    references/nl-config-and-research.md
所有模型特定超参数、指标提取器和
spec_overrides
均来自模型技能。

Step 4: Monitor Progress

步骤4:监控进度

runner.run()
blocks until all recommendations complete; use
on_recommendation
/
on_result
callbacks to report progress. Each rec takes 10–90 minutes — don't assume failure during long uploads. If the orchestrator dies mid-run, relaunch with the full suffixed
workspace_path
and
resume=True
. Check progress from a separate process with
query_status()
. Callbacks, resume behaviour, and full
query_status()
/
get_status()
usage:
references/monitoring-and-resume.md
.

runner.run()
会阻塞直到所有推荐完成;使用
on_recommendation
/
on_result
回调报告进度。每个rec耗时10–90分钟——长时间上传期间请勿假设任务失败。如果编排器在运行中途崩溃,使用完整的带后缀
workspace_path
并设置
resume=True
重新启动。通过独立进程调用
query_status()
检查进度。回调、恢复行为,以及
query_status()
/
get_status()
的完整用法见
references/monitoring-and-resume.md

Step 5: Interpret Results

步骤5:解读结果

runner.run()
returns a plain dict with
best
,
progress
, and
history
keys; metric values are always in the user's original scale. Report the best config, a ranked comparison table, insights, the WandB link if enabled, and next steps. Full result-dict shape, reporting checklist, and all-recs-failed triage:
references/results.md
.

runner.run()
返回一个包含
best
progress
history
键的普通字典;指标值始终保持用户原始尺度。报告最佳配置、排名对比表、洞察信息、启用WandB时的链接,以及后续步骤。完整结果字典结构、报告清单,以及所有rec失败的分类排查见
references/results.md

Model-Specific Notes

模型特定说明

Model-specific notes do not belong here. For every requested
network_arch
, read
<bank-root>/models/<network>/SKILL.md
and use its Training Requirements, Per-Action Dataset Requirements, Typical Spec Overrides, AutoML / HPO Notes, and Error Patterns sections as the source of truth.

模型特定说明不属于本技能范畴。对于每个请求的
network_arch
,请读取
<bank-root>/models/<network>/SKILL.md
并以其训练要求按操作划分的数据集要求典型规格覆盖AutoML/HPO说明错误模式部分为唯一来源。

Common Pitfalls

常见陷阱

The 15 recurring failure modes — including wrong/missing
skill_dir
, wrong LLM endpoint (404), model-specific training failures, workspace collisions, weak proxy metrics, the implicit-direction trap, spec-override typos, mid-sweep orchestrator death, silent random LLM configs, missing
openai
, WandB not logging, and
conda run
buffering — are documented with fixes in
references/pitfalls.md
. Review them before and during any run.

15种常见失败模式——包括错误/缺失
skill_dir
、错误LLM端点(404)、模型特定训练失败、工作区冲突、弱代理指标、隐含方向陷阱、规格覆盖拼写错误、调优中途编排器崩溃、静默随机LLM配置、缺失
openai
、WandB未记录日志,以及
conda run
缓冲——均在
references/pitfalls.md
中记录并提供修复方案。请在每次运行前后查阅。

Example Conversations

示例对话

Representative agent/user exchanges for optimizing a network, requesting a real task metric, LLM-guided search, fully-autonomous autoresearch, resuming, switching to ASHA with WandB, and generating a config from a goal description: see
references/examples.md
.
优化网络、请求真实任务指标、LLM引导搜索、完全自主autoresearch、恢复任务、切换到带WandB的ASHA算法,以及根据目标描述生成配置等代表性Agent/用户对话见
references/examples.md