tao-run-automl
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTAO AutoML Skill
TAO AutoML技能
Run automated hyperparameter optimization (HPO) for any TAO network. The agent uses — a single interface that manages the full loop: generate recommendations, launch training jobs, extract metrics, and feed results back to the optimizer.
AutoMLRunnerThe runner is platform-agnostic — it takes any object implementing the standard SDK shape (, , , ) and calls those methods. Pick whichever SDK matches where you want jobs to run:
create_jobget_job_statusget_job_logsget_failure_analysis| SDK | Best for AutoML |
|---|---|
| Multi-node sweeps on DGX Cloud; managed scheduling |
| Cost-tuned sweeps on Brev instances (single-instance per rec, multi-GPU OK). Multi-credential / multi-workspace accounts must pass |
| Large sweeps on shared HPC clusters with queue/quota |
| Sweeps on EKS / GKE / AKS / on-prem clusters with the NVIDIA GPU Operator |
| Local debugging or single-host sweeps |
Multi-node per rec works on Lepton, SLURM, and K8s (each rec is an N-node distributed training job). Brev and local Docker are single-host per rec — multi-GPU within one host still works (), but one rec can't span multiple hosts.
gpu_count > 1Workflow: (1) parse user intent + preflight, (2) select algorithm, (3) configure and run, (4) monitor/resume/query status, (5) interpret results. Each step below links the reference holding its full detail. Failure modes: . Example exchanges: . Setup detail: .
references/pitfalls.mdreferences/examples.mdreferences/prerequisites.md为任意TAO网络运行自动化超参数优化(HPO)。该Agent使用——一个管理完整流程的统一接口:生成推荐配置、启动训练任务、提取指标,并将结果反馈给优化器。
AutoMLRunner该Runner是平台无关的——它接受任何实现标准SDK接口(、、、)的对象,并调用这些方法。选择与你任务运行环境匹配的SDK即可:
create_jobget_job_statusget_job_logsget_failure_analysis| SDK | AutoML最佳适用场景 |
|---|---|
| DGX Cloud上的多节点调优;托管式调度 |
| Brev实例上的成本优化调优(每个推荐对应单实例,支持多GPU)。多凭证/多工作区账户必须向 |
| 带队列/配额的共享HPC集群上的大规模调优 |
| 搭载NVIDIA GPU Operator的EKS/GKE/AKS/本地集群上的调优 |
| 本地调试或单主机调优 |
每个推荐的多节点运行支持在Lepton、SLURM和K8s上实现(每个推荐对应一个N节点分布式训练任务)。Brev和本地Docker为每个推荐对应单主机——仍支持单主机内多GPU(),但单个推荐无法跨多主机。
gpu_count > 1工作流程:(1) 解析用户意图+预检,(2) 选择算法,(3) 配置并运行,(4) 监控/恢复/查询状态,(5) 解读结果。以下每个步骤都链接到包含详细信息的参考文档。失败模式:。示例对话:。设置细节:。
references/pitfalls.mdreferences/examples.mdreferences/prerequisites.mdPreflight
预检
This skill needs (which pulls transitively). Both are on public PyPI; pinned versions live in (), resolved via . Pick the platform extra you want:
nvidia-tao-automlnvidia-tao-sdkversions.yamlwheels.tao_automl_*scripts/resolve_versions_key.pybash
python -c "import tao_automl" 2>/dev/null || {
SB="${TAO_SKILL_BANK_PATH:?}"
echo "MISSING: nvidia-tao-automl not installed. Pick the platform extra you need:"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_lepton)\" # DGX Cloud / Lepton"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_slurm)\" # on-prem SLURM cluster"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_kubernetes)\" # K8s (EKS / GKE / on-prem)"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_docker)\" # local Docker daemon"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_brev)\" # Brev GPU instances"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_all)\" # all 5 platforms"
echo " (append ,llm or ,wandb to the extra for agentic-search or experiment-tracking deps)"
exit 1
}(Local development against a checkout: .) If missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight before continuing.
pip install -e '~/tao-run-automl[lepton]'本技能需要(会间接依赖)。两者均在公开PyPI上;固定版本存储在()中,可通过解析。选择你需要的平台扩展:
nvidia-tao-automlnvidia-tao-sdkversions.yamlwheels.tao_automl_*scripts/resolve_versions_key.pybash
python -c "import tao_automl" 2>/dev/null || {
SB="${TAO_SKILL_BANK_PATH:?}"
echo "缺失依赖:未安装nvidia-tao-automl。请选择你需要的平台扩展进行安装:"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_lepton)\" # DGX Cloud / Lepton"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_slurm)\" # 本地SLURM集群"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_kubernetes)\" # K8s(EKS / GKE / 本地)"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_docker)\" # 本地Docker守护进程"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_brev)\" # Brev GPU实例"
echo " pip install \"$($SB/scripts/resolve_versions_key.py wheels.tao_automl_all)\" # 全部5个平台"
echo " (如需智能搜索或实验跟踪依赖,可在扩展后追加,llm或,wandb)"
exit 1
}(针对本地检出代码的开发:。)如果缺失依赖,Agent会提示用户通过Bash授权安装,然后在继续前重新运行预检。
pip install -e '~/tao-run-automl[lepton]'Prerequisites
前置条件
Before running AutoML, satisfy all of these — the full detail (per-platform credential filtering, dataset URI formats, the bank-structure tree, and the install commands) is in :
references/prerequisites.md- Shared launch preflight — run the intake pattern first. AutoML must not create runner files, workspaces, state files, logs, compatibility shims, or install dependencies until the selected platform's credentials, access check, dataset visibility, model credentials, container image confirmation, and compute shape are satisfied. This prevents wasting the budget on fake recommendation failures caused by SSH, storage, image, or credential setup.
tao-launch-workflow - SDK credentials — env vars sourced from (auto-loaded by the skill bank's SessionStart hook). Filter required vars per platform with
~/.config/tao/.envand ask only for what it lists (S3 only when URIs usescripts/list_tao_platforms.py --platform <platform> --format text;s3://for container pulls). The agent never reads values — only checks presence withNGC_KEY. Construct the SDK with no arguments, e.g.[ -n "$VAR_NAME" ].LeptonSDK() - Dataset — accessible from the compute backend; URI format depends on the platform (for Lepton, an absolute shared path for SLURM,
s3://...for Azure, a local path for Docker; never generateazure://...). Accept dataset roots or exact spec-key paths, preserving user-supplied keys such asaws://...without forcing files to share a parent directory.custom.train_dataset.annotation_path= - Skill bank available — the runner takes an explicit (absolute path to
skill_dir, no env-var fallback). Use the same bank root the agent loaded the workflow from. CRITICAL: AutoML requires a packaged, valid<bank-root>/models/<network>— it is the AutoML support gate (defines<bank-root>/models/<network>/schemas/train.schema.jsonparams, defaults, ranges, options, weights, popular metadata). The runtime must not expectautoml_enabledto exist; if the packaged train schema is missing, do not run AutoML for that model.~/tao-coreis required for non-TAO-Core models (cosmos-rl, clip) and optional for TAO Core / Hydra-based models (DINO, BEVFusion).references/spec_template_<action>.yaml - installed with the platform extra you want (public PyPI; pin in
nvidia-tao-automl). Use the install commands from the Preflight block above orversions.yaml; appendreferences/prerequisites.mdto the extra for agentic algorithms.,llm
Verify setup:
bash
python3 -c "from tao_automl.runner import AutoMLRunner; print('OK')"
python3 -c "from tao_automl.brain.llm_brain import LLMBrain; print('LLM OK')" # optional, LLM features
python3 -c "import wandb; print('WandB OK')" # optional, WandB运行AutoML前,请满足以下所有条件——完整细节(各平台凭证过滤、数据集URI格式、技能库结构树、安装命令)见:
references/prerequisites.md- 共享启动预检——先运行导入流程。AutoML不得创建Runner文件、工作区、状态文件、日志、兼容性垫片或安装依赖,直到所选平台的凭证、访问检查、数据集可见性、模型凭证、容器镜像确认和计算规格都满足要求。这可避免因SSH、存储、镜像或凭证设置错误导致的无效推荐失败,浪费预算。
tao-launch-workflow - SDK凭证——从读取环境变量(由技能库的SessionStart钩子自动加载)。使用
~/.config/tao/.env按平台筛选所需变量,仅询问列表中存在的变量(当URI使用scripts/list_tao_platforms.py --platform <platform> --format text时才需要S3相关变量;拉取容器需要s3://)。Agent从不读取变量值——仅通过NGC_KEY检查是否存在。构造SDK时无需传入参数,例如[ -n "$VAR_NAME" ]。LeptonSDK() - 数据集——可从计算后端访问;URI格式取决于平台(Lepton使用,SLURM使用绝对共享路径,Azure使用
s3://...,Docker使用本地路径;请勿生成azure://...格式)。接受数据集根目录或精确的规格键路径,保留用户提供的键,如aws://...,无需强制文件共享父目录。custom.train_dataset.annotation_path= - 技能库可用——Runner需要显式传入(
skill_dir的绝对路径,无环境变量 fallback)。使用Agent加载工作流的同一技能库根目录。**关键:**AutoML需要打包且有效的<bank-root>/models/<network>——这是AutoML支持的入口(定义<bank-root>/models/<network>/schemas/train.schema.json参数、默认值、范围、选项、权重、通用元数据)。运行时不得依赖automl_enabled的存在;如果打包的训练 schema 缺失,则不得为该模型运行AutoML。非TAO-Core模型(cosmos-rl、clip)需要~/tao-core,TAO Core / Hydra 基模型(DINO、BEVFusion)则为可选。references/spec_template_<action>.yaml - 已安装带对应平台扩展的(公开PyPI;版本固定在
nvidia-tao-automl中)。使用上述预检块或versions.yaml中的安装命令;如需智能算法,可在扩展后追加references/prerequisites.md。,llm
验证设置:
bash
python3 -c "from tao_automl.runner import AutoMLRunner; print('OK')"
python3 -c "from tao_automl.brain.llm_brain import LLMBrain; print('LLM OK')" # 可选,LLM功能
python3 -c "import wandb; print('WandB OK')" # 可选,WandBConcepts: What is TAO AutoML?
概念:什么是TAO AutoML?
TAO AutoML automates the "try different hyperparameter values → train → compare results → repeat" cycle. You tell it what network (), which hyperparameters to search (from the model skill and schema), what metric to optimize (from the model skill or user request), and how many trials (budget). It then picks hyperparameter values with a search algorithm (Bayesian, Hyperband, LLM, etc.), launches a real training job on whichever backend the SDK targets, reads the result metric from training logs, feeds it back so the algorithm learns what works, repeats until budget is exhausted, and returns the best configuration found.
network_archEach "trial" is called a recommendation (rec). One rec = one full training run with a specific set of hyperparameters.
TAO AutoML自动化了“尝试不同超参数值→训练→比较结果→重复”的循环。你只需告知它目标网络()、要搜索的超参数(来自模型技能和schema)、要优化的指标(来自模型技能或用户请求),以及试验次数(预算)。然后它会通过搜索算法(贝叶斯、Hyperband、LLM等)选择超参数值,在SDK指向的任意后端启动真实训练任务,从训练日志中读取结果指标,反馈给算法以学习有效配置,重复直到预算耗尽,最后返回找到的最佳配置。
network_arch每个“试验”称为一个推荐(recommendation,简称rec)。一个rec对应一组特定超参数的完整训练运行。
Quick Support Queries
快速支持查询
When the user asks what models/networks are supported for AutoML, run the packaged model-list helper in AutoML mode. AutoML enablement is model-level metadata ( has ), not workflow-level. The helper reads that metadata, then validates whether the model also has a packaged, parseable train dataclass schema:
skills/models/<network>/references/skill_info.yamlautoml_enabled: truebash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --scope automl --format textThe compatibility wrapper below is also valid and delegates to the same logic:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_automl_support.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format textReturn both sections from that output: runnable AutoML models and AutoML-enabled models still blocked on schema packaging. The support rule: AutoML is enabled at model level; runnable AutoML also requires to be packaged and valid.
skills/models/<network>/schemas/train.schema.json当用户询问哪些模型/网络支持AutoML时,运行AutoML模式下的打包模型列表工具。AutoML启用状态是模型级元数据(中包含),而非工作流级。该工具读取该元数据,然后验证模型是否同时拥有打包且可解析的训练数据类schema:
skills/models/<network>/references/skill_info.yamlautoml_enabled: truebash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_tao_models.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --scope automl --format text以下兼容包装器同样有效,且会委托给相同逻辑:
bash
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/scripts/list_automl_support.py \
--skill-bank ${TAO_SKILL_BANK_PATH:-~/tao-skills-external} --format text返回该输出的两个部分:可运行AutoML的模型,以及已启用AutoML但仍因schema打包问题无法运行的模型。支持规则:AutoML在模型级启用;可运行的AutoML还要求已打包且有效。
skills/models/<network>/schemas/train.schema.jsonStep 1: Parse User Intent
步骤1:解析用户意图
Default to a quick-start run unless the user explicitly asks to customize AutoML or agrees to a customization offer. Do not present algorithm, budget, or search-space choices as required inputs for a normal "run AutoML" request.
Any workflow/application that reaches a train-capable model skill must consult the selected model's metadata. If it is , use this AutoML workflow as the default training path unless the run/workflow setting has or the user explicitly asks for a plain single training run. This keeps AutoML enablement scalable across tao-train-single-step, DEFT, and future workflows without duplicating allowlists in each application skill.
automl_enabledtrueautoml_policy: offExtract the default-run inputs and apply the quick-start defaults. The full required-field table (, , dataset URIs / direct spec paths, , , , , , , credentials, compute shape, and the LLM endpoint/model/key trio), the quick-start defaults (, recs, hyperparameters/ranges, -minute monitoring), the friendly launch-intake prompting checklist, the customization-only fields, the quick-start runner shape, and metric-choice best practices all live in .
network_archplatformimagemetricdirectionskill_dirlong_running_enabledstatus_interval_minutesbayesian10None5references/intake-and-inputs.mdKey gating policy that always applies:
- If any required field is missing, ask the user. Do NOT guess dataset paths, skill bank paths, credentials, or hardware that the model skill marks as required.
- : resolve the default, show it to the user, and require confirmation or
imagebefore creating the AutoML runner.image=<override> - : only needed when the metric name disagrees with the implicit "contains 'loss' → minimize, else maximize" rule.
direction - ,
llm_endpoint,llm_model: MUST prompt forllm_api_key/llm/hybrid; the code defaultautoresearchreturns 404, so always passhttps://integrate.api.nvidia.com/v1explicitly.llm_endpoint
Before generating an AutoML script, verify platform access and dataset visibility using the shared launch preflight. For SLURM, that means passwordless SSH to at least one login host and remote checks for each required annotation/media path. Verify container image confirmation the same way — the confirmed train image must be passed into or the SDK adapter's ; do not rely on an implicit default. Also run any model-specific annotation content checks documented by the model skill. If preflight fails, stop with remediation steps instead of creating a runner that will immediately fail. Missing required annotation fields are a preflight failure, not an AutoML recommendation failure.
test -eAutoMLRunner.run(..., image=chosen_image, ...)create_job(..., image=chosen_image, ...)Customization gate: After the required quick-start fields are resolved, you may briefly offer customization. If the user declines, proceed with the defaults. If the user chooses customization, present the customization-only fields from .
references/intake-and-inputs.mdMANDATORY: Read the generated dataclass schema before configuring AutoML. For the selected model/action, read and . AutoML can run only when is packaged and valid. Do not fall back to hand-written notes, old runner scripts, or a local checkout. If the schema is missing, stop and report that AutoML is enabled but not runnable until the schema is generated and shipped. Use the schema JSON as the source of truth for , , per-parameter defaults, ranges, enums, , , , , and . When , the runner discovers all params marked in the schema; each network has its own set, so never hardcode them here.
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/models/<network>/schemas/train.schema.json.../schemas/manifest.jsontrain.schema.json~/tao-coreautoml_default_parametersautoml_disabled_parametersoption_weightsmath_conddepends_onparent_parampopularautoml_hyperparameters=Noneautoml_enabled=TrueThe following MANDATORY rules gate every run — full text, code patterns, and rationale in :
references/mandatory-rules.md- MANDATORY prompting for LLM-based algorithms (,
llm,hybrid) — resolveautoresearch,llm_endpoint, andllm_modelbefore generating the script (precedence chains in the reference). Without valid LLM settings the brain silently falls back to random sampling and wastes GPU budget.llm_api_key - MANDATORY: Read the model skill before generating the script — read and apply its Training Requirements, Per-Action Dataset Requirements, Typical Spec Overrides, AutoML / HPO Notes, and Error Patterns. Do not hardcode model-specific knowledge.
<bank-root>/models/<network>/SKILL.md - MANDATORY: No model-specific constants in this AutoML skill — hyperparameter names, ranges, defaults, metric names, dataset layouts, spec override keys, images, and metric regexes belong in the schema and model skill, not here.
- MANDATORY: Timestamped workspace folders — always suffix with
workspace_path; never use a flat path.datetime.now().strftime("%Y%m%d_%H%M%S") - MANDATORY: Fresh runner per new AutoML request, after preflight passes — every new request creates a new runner script, log, PID file, SDK , and
state_filewith a unique timestamp; only resume when the user explicitly asks to resume/continue/recover/inspect.workspace_path
Best-practice on metric choice: prefer the model skill's recommended validation or task metric over cheap training loss (which overfits on small fine-tuning sets); when using a validation proxy, also apply the model skill's required validation-related so the metric is emitted; a real task metric via is most honest but adds per-rec cost. Details in .
spec_overrideseval_fnreferences/intake-and-inputs.md除非用户明确要求自定义AutoML或同意自定义提议,否则默认使用快速启动运行。对于常规的“运行AutoML”请求,无需将算法、预算或搜索空间选项作为必填输入。
任何可调用训练能力模型技能的工作流/应用都必须查询所选模型的元数据。如果值为,则默认使用本AutoML工作流作为训练路径,除非运行/工作流设置中包含或用户明确要求普通单步训练。这可确保AutoML启用状态在tao-train-single-step、DEFT及未来工作流中可扩展,无需在每个应用技能中重复维护允许列表。
automl_enabledtrueautoml_policy: off提取默认运行输入并应用快速启动默认值。完整必填字段表(、、数据集URI/直接规格路径、、、、、、、凭证、计算规格,以及LLM端点/模型/密钥三元组)、快速启动默认值(算法、10个rec、无自定义超参数/范围、5分钟监控间隔)、友好的启动导入提示清单、仅自定义字段、快速启动Runner结构,以及指标选择最佳实践均在中。
network_archplatformimagemetricdirectionskill_dirlong_running_enabledstatus_interval_minutesbayesianreferences/intake-and-inputs.md始终适用的关键管控策略:
- 如果任何必填字段缺失,请询问用户。请勿猜测模型技能标记为必填的数据集路径、技能库路径、凭证或硬件信息。
- :解析默认值,展示给用户,在创建AutoML Runner前需要确认或传入
image进行覆盖。image=<override> - :仅当指标名称与隐含规则(包含'loss'则最小化,否则最大化)不符时才需要指定。
direction - 、
llm_endpoint、llm_model:必须提示用户获取这些参数(针对llm_api_key/llm/hybrid算法);代码默认值autoresearch会返回404,因此必须显式传入https://integrate.api.nvidia.com/v1。llm_endpoint
生成AutoML脚本前,请使用共享启动预检验证平台访问权限和数据集可见性。对于SLURM,这意味着至少可无密码SSH到一个登录节点,并对每个所需的标注/媒体路径执行远程检查。以相同方式验证容器镜像确认——已确认的训练镜像必须传入或SDK适配器的;请勿依赖隐式默认值。同时运行模型技能文档中记录的任何模型特定标注内容检查。如果预检失败,请停止操作并提供修复步骤,而非创建会立即失败的Runner。缺失必填标注字段属于预检失败,而非AutoML推荐失败。
test -eAutoMLRunner.run(..., image=chosen_image, ...)create_job(..., image=chosen_image, ...)**自定义入口:**解决必填的快速启动字段后,可短暂提供自定义选项。如果用户拒绝,则使用默认值继续。如果用户选择自定义,则展示中的仅自定义字段。
references/intake-and-inputs.md**强制要求:配置AutoML前读取生成的数据类schema。**对于所选模型/操作,读取和。仅当已打包且有效时,AutoML才可运行。请勿依赖手写笔记、旧Runner脚本或本地检出代码。如果schema缺失,请停止操作并报告:AutoML已启用但无法运行,直到schema生成并发布。将schema JSON作为、、每个参数的默认值、范围、枚举、、、、和的唯一来源。当时,Runner会发现schema中所有标记为的参数;每个网络都有自己的参数集,因此请勿在此处硬编码。
${TAO_SKILL_BANK_PATH:-~/tao-skills-external}/models/<network>/schemas/train.schema.json.../schemas/manifest.jsontrain.schema.json~/tao-coreautoml_default_parametersautoml_disabled_parametersoption_weightsmath_conddepends_onparent_parampopularautoml_hyperparameters=Noneautoml_enabled=True每次运行都必须遵守以下强制规则——完整文本、代码模式和原理见:
references/mandatory-rules.md- 强制提示LLM基算法(、
llm、hybrid)——生成脚本前解析autoresearch、llm_endpoint和llm_model(参考文档中有优先级链)。如果LLM设置无效,算法会静默回退到随机采样,浪费GPU预算。llm_api_key - 强制要求:生成脚本前读取模型技能文档——读取并应用其训练要求、按操作划分的数据集要求、典型规格覆盖、AutoML/HPO说明和错误模式。请勿硬编码模型特定知识。
<bank-root>/models/<network>/SKILL.md - 强制要求:本AutoML技能中不得包含模型特定常量——超参数名称、范围、默认值、指标名称、数据集布局、规格覆盖键、镜像和指标正则表达式应属于schema和模型技能,而非本技能。
- 强制要求:带时间戳的工作区文件夹——始终为添加后缀
workspace_path;请勿使用平级路径。datetime.now().strftime("%Y%m%d_%H%M%S") - 强制要求:预检通过后,每个新AutoML请求使用全新Runner——每个新请求都会创建新的Runner脚本、日志、PID文件、SDK和带唯一时间戳的
state_file;仅当用户明确要求恢复/继续/恢复/检查时才进行恢复。workspace_path
**指标选择最佳实践:**优先选择模型技能推荐的验证或任务指标,而非廉价的训练损失(在小微调集上容易过拟合);当使用验证代理指标时,还需应用模型技能要求的验证相关以确保指标被输出;通过获取真实任务指标最准确,但会增加每个rec的成本。详情见。
spec_overrideseval_fnreferences/intake-and-inputs.mdStep 2: Select Algorithm
步骤2:选择算法
Default to . The full classical and LLM/agentic algorithm tables (use-when, typical budget, how it works), the default/caveat rules, and the decision tree are in . Present the algorithm guide only in customization mode or when the user names one.
bayesianreferences/algorithms.md默认选择算法。完整的经典算法和LLM/智能算法表(适用场景、典型预算、工作原理)、默认/注意事项规则,以及决策树见。仅在自定义模式或用户指定算法时展示算法指南。
bayesianreferences/algorithms.mdStep 3: Configure and Run
步骤3:配置并运行
Build the runner from the generic shapes in — minimal example, full all-options example, LLM-powered example, the programmatic API, the complete key table, metric resolution, the LLM analyzer environment toggles, and rules.
references/automl-settings.mdAutoMLautoml_settingskpispec_overrides- Constrain the search space with :
custom_param_ranges(format table, examples, model-specific search-space rules).references/custom-param-ranges.md - Opt-in /
metric_extractorhooks and WandB tracking:eval_fn.references/hooks-and-wandb.md - LLM/agentic deep dive — , the standalone
NLConfigGenerator, the five autoresearch agent components, and multi-phase research programs:LLMAnalyzer.references/nl-config-and-research.md
All model-specific hyperparameters, metric extractors, and come from the model skill.
spec_overrides根据中的通用结构构建Runner——最小示例、全选项示例、LLM驱动示例、程序化API、完整键表、指标解析、LLM分析器环境切换,以及规则。
references/automl-settings.mdAutoMLautoml_settingskpispec_overrides- 使用约束搜索空间:
custom_param_ranges(格式表、示例、模型特定搜索空间规则)。references/custom-param-ranges.md - 选择加入/
metric_extractor钩子和WandB跟踪:eval_fn。references/hooks-and-wandb.md - LLM/智能深度解析——、独立
NLConfigGenerator、五个autoresearch Agent组件,以及多阶段研究程序:LLMAnalyzer。references/nl-config-and-research.md
所有模型特定超参数、指标提取器和均来自模型技能。
spec_overridesStep 4: Monitor Progress
步骤4:监控进度
runner.run()on_recommendationon_resultworkspace_pathresume=Truequery_status()query_status()get_status()references/monitoring-and-resume.mdrunner.run()on_recommendationon_resultworkspace_pathresume=Truequery_status()query_status()get_status()references/monitoring-and-resume.mdStep 5: Interpret Results
步骤5:解读结果
runner.run()bestprogresshistoryreferences/results.mdrunner.run()bestprogresshistoryreferences/results.mdModel-Specific Notes
模型特定说明
Model-specific notes do not belong here. For every requested , read and use its Training Requirements, Per-Action Dataset Requirements, Typical Spec Overrides, AutoML / HPO Notes, and Error Patterns sections as the source of truth.
network_arch<bank-root>/models/<network>/SKILL.md模型特定说明不属于本技能范畴。对于每个请求的,请读取并以其训练要求、按操作划分的数据集要求、典型规格覆盖、AutoML/HPO说明和错误模式部分为唯一来源。
network_arch<bank-root>/models/<network>/SKILL.mdCommon Pitfalls
常见陷阱
The 15 recurring failure modes — including wrong/missing , wrong LLM endpoint (404), model-specific training failures, workspace collisions, weak proxy metrics, the implicit-direction trap, spec-override typos, mid-sweep orchestrator death, silent random LLM configs, missing , WandB not logging, and buffering — are documented with fixes in . Review them before and during any run.
skill_diropenaiconda runreferences/pitfalls.md15种常见失败模式——包括错误/缺失、错误LLM端点(404)、模型特定训练失败、工作区冲突、弱代理指标、隐含方向陷阱、规格覆盖拼写错误、调优中途编排器崩溃、静默随机LLM配置、缺失、WandB未记录日志,以及缓冲——均在中记录并提供修复方案。请在每次运行前后查阅。
skill_diropenaiconda runreferences/pitfalls.mdExample Conversations
示例对话
Representative agent/user exchanges for optimizing a network, requesting a real task metric, LLM-guided search, fully-autonomous autoresearch, resuming, switching to ASHA with WandB, and generating a config from a goal description: see .
references/examples.md优化网络、请求真实任务指标、LLM引导搜索、完全自主autoresearch、恢复任务、切换到带WandB的ASHA算法,以及根据目标描述生成配置等代表性Agent/用户对话见。
references/examples.md