nemotron-customize

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

nemotron-customize

nemotron-customize

IMPORTANT: Read this file before answering any
nemotron-customize
, Nemotron customization, Curator curation, translation, SFT, PEFT, RL, conversion, optimization, checkpoint or existing/hosted-endpoint evaluation, or multi-step pipeline request. This applies whether the user names one step or asks you to compose several steps into a pipeline.
Evaluation requests count even when no training is involved: "evaluate", "benchmark", "smoke test", or "score" an existing/hosted endpoint, an API/model ID, or a deployed model all route to
eval/model_eval
. Read this skill for those too.
重要提示:在回答任何
nemotron-customize
、Nemotron定制、Curator数据整理、翻译、SFT、PEFT、RL、转换、优化、检查点或现有/托管端点评估,或多步流水线请求之前,请阅读本文档。无论用户提及单个步骤还是要求将多个步骤组合为流水线,均适用本提示。
即使不涉及训练的评估请求也适用:对现有/托管端点、API/模型ID或已部署模型进行“评估”、“基准测试”、“冒烟测试”或“评分”的请求均会路由至
eval/model_eval
。此类请求也请阅读本技能文档。

Purpose

用途

Turn a model-customization request into a repo-native Nemotron step pipeline. Plan the DAG, validate artifact wiring, and create only the YAML/config files needed to run existing steps.
Use this skill only for inspecting, configuring, validating, running, or submitting existing Nemotron steps or multi-step training/customization pipelines. For frontend, dashboard, visualization, generic ML advice, billing/access, or unrelated coding tasks, stop with a short scope note and do not inspect the step catalog or edit files in that turn.
将模型定制请求转换为仓库原生的Nemotron步骤流水线。规划有向无环图(DAG),验证工件连接,并仅创建运行现有步骤所需的YAML/配置文件。
仅在检查、配置、验证、运行或提交现有Nemotron步骤或多步训练/定制流水线时使用本技能。对于前端、仪表盘、可视化、通用机器学习建议、计费/权限或无关编码任务,请简短说明范围不符,且在该轮次中不要检查步骤目录或编辑文件。

Prerequisites

前提条件

  • A checkout of the Nemotron repo with
    src/nemotron/steps/
    present; run from the repo root.
  • uv
    available to invoke
    uv run nemotron steps ...
    .
  • For remote execution: an env profile TOML (
    NEMOTRON_ENV_FILE
    or
    env*.toml
    ) with a section matching the selected step.
  • For hosted services (translation, hosted eval): the auth environment variable expected by the step (for example
    NVIDIA_API_KEY
    ), exported in the environment — never inlined or committed.
  • User-provided concrete values (model/checkpoint, data paths, output dir, hardware/GPU count) before any command is presented as runnable.
  • 已检出Nemotron仓库,且存在
    src/nemotron/steps/
    目录;请从仓库根目录运行命令。
  • 已安装
    uv
    ,用于调用
    uv run nemotron steps ...
    命令。
  • 远程执行:需提供环境配置文件TOML(
    NEMOTRON_ENV_FILE
    env*.toml
    ),其中包含与所选步骤匹配的配置段。
  • 托管服务(翻译、托管评估):步骤所需的身份验证环境变量(例如
    NVIDIA_API_KEY
    )需在环境中导出——切勿内联或提交到代码仓库。
  • 用户需提供具体值(模型/检查点、数据路径、输出目录、硬件/GPU数量),之后才能提供可运行的命令。

Limitations

局限性

  • Does not invent new catalog steps. When no existing step, runner, recipe, CLI, or config can satisfy the request, it names the gap (Explorer mode) instead of fabricating a step.
  • Produces YAML/config for existing steps; new Python/shell is out of scope except in Explorer mode after the gap is approved.
  • Not for deployment-only/serving, frontend, dashboards, generic ML advice, or non-Nemotron tasks.
  • Does not guess concrete values (paths, model IDs, GPU counts, profiles); it asks or returns
    Blocked
    when they are missing.
  • 不会创建新的目录步骤。当没有现有步骤、运行器、方案、CLI或配置可以满足请求时,会指出缺口(探索模式),而非编造步骤。
  • 为现有步骤生成YAML/配置;除探索模式下缺口获批准的情况外,新的Python/Shell代码超出范围。
  • 不适用于仅部署/服务、前端、仪表盘、通用机器学习建议或非Nemotron任务。
  • 不会猜测具体值(路径、模型ID、GPU数量、配置文件);当缺少这些值时,会询问用户或返回
    Blocked
    提示。

Core Rule

核心规则

Use bundled references first. The
references/
folder is the first decision surface for routing, artifacts, patterns, hardware heuristics, and command shape. Use
src/nemotron/steps/...
only as a live verification/fallback source when you need exact current config fields, manifests, runner imports, or details missing from bundled references.
If sources disagree:
  1. Checked live repo files win for exact execution.
  2. Bundled references win for initial routing and planning.
  3. Upstream docs/context packs are used only for exceptional code generation or library API details.
优先使用捆绑参考文件。
references/
文件夹是路由、工件、模式、硬件启发式和命令格式的首要决策依据。仅当需要确切的当前配置字段、清单、运行器导入或捆绑参考文件中缺失的细节时,才将
src/nemotron/steps/...
作为实时验证/备用源。
若来源存在冲突:
  1. 已检出的实时仓库文件在确切执行方面优先。
  2. 捆绑参考文件在初始路由和规划方面优先。
  3. 上游文档/上下文包仅用于特殊代码生成或库API细节。

Before You Begin

开始之前

  • Read this
    SKILL.md
    workflow and the relevant bundled reference before opening repo source files.
  • Route from
    references/CATALOG.md
    and
    references/ARTIFACTS.md
    before any broad repo exploration. Once a route is determined, verify only the selected live step/config/env files needed for the answer.
  • Do not emit commands with fake paths, placeholder model IDs, guessed task IDs, guessed batch profiles, or default auth variable names presented as facts. Ask for missing concrete values or return a
    Blocked
    handoff.
  • Use
    references/COMMANDS.md
    as the authoritative checklist before finalizing configs or execution commands.
  • For pipeline requests, plan before editing. Do not create or modify files until the DAG, artifact edges, required inputs, and validation checks are stated and approved.
  • For one-shot command requests, prefer a complete parameterized command in one response over exploratory prose, but only after required inputs are known. If the user already provides the needed values and asks for only a command, answer with the command first and keep explanation minimal.
  • Output discipline (keeps responses tight): emit one command block per step, include only flags the step actually defines, and add no speculative or invented flags. Keep narrative to a few lines — the command plus the required safety/profile callouts, not a tutorial. Do not restate reference content the user did not ask for.
  • Do not spawn subagents for one-shot command lookup. Use the bundled command reference directly; verify only the selected step if needed.
  • 在打开仓库源文件之前,请阅读本
    SKILL.md
    工作流和相关的捆绑参考文件。
  • 在进行广泛仓库探索之前,先从
    references/CATALOG.md
    references/ARTIFACTS.md
    进行路由。确定路由后,仅验证回答所需的选定实时步骤/配置/环境文件。
  • 请勿输出包含虚假路径、占位符模型ID、猜测的任务ID、猜测的批处理配置文件或默认身份验证变量名的命令。询问缺失的具体值或返回
    Blocked
    提示。
  • 在最终确定配置或执行命令之前,使用
    references/COMMANDS.md
    作为权威检查清单。
  • 对于流水线请求,先规划再编辑。在明确并获批准DAG、工件连接、所需输入和验证检查之前,请勿创建或修改文件。
  • 对于一次性命令请求,在获取所需输入后,优先在一个响应中提供完整的参数化命令,而非探索性描述。如果用户已提供所需值并仅请求命令,请先给出命令,尽量减少解释。
  • 输出规范(保持响应简洁):每个步骤输出一个命令块,仅包含步骤实际定义的标志,不添加推测或虚构的标志。叙述部分保持简短——仅包含命令和必要的安全/配置文件说明,而非教程。请勿重复用户未询问的参考内容。
  • 对于一次性命令查询,不要生成子代理。直接使用捆绑的命令参考文件;必要时仅验证选定步骤。

Safety

安全

Keep Bash scoped to repo-safe commands such as
uv run nemotron steps ...
, targeted tests,
git status/diff
, and config validation. Never run environment dumps (
env
,
printenv
, broad
export
) or commands that expose secret values. For remote submissions, destructive changes, or expensive launches, confirm before execution.
When inspecting env/config files, avoid printing whole files that may contain secrets. Use targeted reads, report only section names and env-var names, and redact values for fields containing
token
,
key
,
secret
,
password
,
credential
, or
auth
.
将Bash命令限制为仓库安全命令,例如
uv run nemotron steps ...
、针对性测试、
git status/diff
和配置验证。切勿运行环境转储命令(
env
printenv
、广泛的
export
)或暴露机密值的命令。对于远程提交、破坏性更改或高成本启动,执行前请确认。
检查环境/配置文件时,避免打印可能包含机密的完整文件。使用针对性读取,仅报告段名和环境变量名,并对包含
token
key
secret
password
credential
auth
的字段值进行脱敏。

Reference Map

参考映射表

QuestionRead firstLive fallback / verification
Which step or category fits?
references/CATALOG.md
uv run nemotron steps list/show
, then selected
step.toml
Do artifacts chain?
references/ARTIFACTS.md
src/nemotron/steps/types.toml
What run shape should I emit?
references/COMMANDS.md
checked-in config YAML plus active profile TOML
Remote profile generation or selection
references/COMMANDS.md
active
NEMOTRON_ENV_FILE
,
env.toml
, or
env.*.toml
What hardware/backend should I recommend?
references/HARDWARE.md
selected step
[[models]]
and
[[strategies]]
Which cross-step guardrails apply?
references/PATTERNS.md
src/nemotron/steps/patterns/<id>.md
How do I run the full workflow?
references/WORKFLOW.md
selected step configs,
step.py
, and runners
Which upstream library API should generated code use?
references/context/index.toml
-> matching pack
selected
step.py
,
_runners/
, upstream docs
New project scaffold, only when existing repo code cannot support the request
references/act/PROJECT.md
existing repo project/recipe shape
Per-stage code rules, only when existing repo code cannot support the request
references/act/STAGE.md
selected
step.py
and shared runner
Do not start by reading category READMEs or
step.toml
for ordinary decisions. Select candidates from bundled references, then verify exact live details before writing configs or final commands.
问题优先阅读实时备用/验证
哪个步骤或类别适用?
references/CATALOG.md
uv run nemotron steps list/show
,然后是选定的
step.toml
工件是否可以串联?
references/ARTIFACTS.md
src/nemotron/steps/types.toml
我应该输出什么格式的运行命令?
references/COMMANDS.md
已签入的配置YAML加上活动配置文件TOML
远程配置文件生成或选择
references/COMMANDS.md
活动的
NEMOTRON_ENV_FILE
env.toml
env.*.toml
我应该推荐哪种硬件/后端?
references/HARDWARE.md
选定步骤的
[[models]]
[[strategies]]
哪些跨步骤防护措施适用?
references/PATTERNS.md
src/nemotron/steps/patterns/<id>.md
如何运行完整工作流?
references/WORKFLOW.md
选定步骤的配置、
step.py
和运行器
生成代码应使用哪个上游库API?
references/context/index.toml
-> 匹配的包
选定的
step.py
_runners/
、上游文档
新项目脚手架(仅当现有仓库代码无法支持请求时使用)
references/act/PROJECT.md
现有仓库项目/方案格式
阶段代码规则(仅当现有仓库代码无法支持请求时使用)
references/act/STAGE.md
选定的
step.py
和共享运行器
对于常规决策,不要从阅读类别README或
step.toml
开始。从捆绑参考文件中选择候选,然后在编写配置或最终命令之前验证确切的实时细节。

Routing

路由

Use
references/CATALOG.md
as the authoritative home for step selection and route-specific fast paths. Use
ARTIFACTS.md
,
PATTERNS.md
, and
HARDWARE.md
only to resolve artifact, cross-step, or hardware constraints after the catalog narrows the route.
Each step is independent and stitching steps together is your job. Compose any pipeline by artifact matching from the user's end goal: chain a step only when the next step consumes an artifact type nothing upstream already produces. Do not rely on fixed, named step combinations.
references/CATALOG.md
作为步骤选择和特定路由快速路径的权威来源。仅在目录缩小路由范围后,才使用
ARTIFACTS.md
PATTERNS.md
HARDWARE.md
解决工件、跨步骤或硬件约束。
每个步骤都是独立的,您的任务是将步骤拼接在一起。根据用户的最终目标,通过工件匹配来组合任何流水线:仅当下一个步骤消耗上游未生成的工件类型时,才串联步骤。不要依赖固定的、命名的步骤组合。

Instructions

操作说明

Follow the flow that matches the request: a recommendation/plan, a single-step command, or a multi-step pipeline. In all cases, route from the bundled references first, gather required inputs, and verify the selected live step before presenting anything as runnable.
遵循与请求匹配的流程:建议/规划、单步命令或多步流水线。在所有情况下,优先从捆绑参考文件进行路由,收集所需输入,并在呈现任何可运行内容之前验证选定的实时步骤。

Recommendation Response

建议响应

Use this shape for planning answers:
Decision
,
Why
,
Required inputs
,
Config/command
,
Avoid
, and
Next step
. Call out the stack to avoid when the user's constraints make it a poor fit.
Whenever the answer includes a command that touches a hosted service or remote execution, also state, in the answer:
  • The auth env-var name and that its value must be exported in the environment, never inlined or committed (never print the value).
  • For
    --batch
    /
    --run
    , the env TOML profile prerequisite; if no profile exists, mark the command
    Blocked
    or give the local
    --dry-run
    shape.
针对规划类回答使用以下格式:
决策
原因
所需输入
配置/命令
避免事项
下一步
。当用户的约束使某一技术栈不适合时,指出应避免的技术栈。
每当回答包含涉及托管服务或远程执行的命令时,还需在回答中说明:
  • 身份验证环境变量名称,且其值必须在环境中导出,切勿内联或提交(切勿打印值)。
  • 对于
    --batch
    /
    --run
    ,需满足环境TOML配置文件前提条件;如果不存在配置文件,标记命令为
    Blocked
    或提供本地
    --dry-run
    格式。

Single-Step Command Flow

单步命令流程

  1. Confirm repo root has
    pyproject.toml
    and
    src/nemotron/steps/
    .
  2. Read
    references/CATALOG.md
    and the selected section of
    references/COMMANDS.md
    .
  3. Verify the selected live step with
    uv run nemotron steps show <step_id>
    when available, or the selected
    step.toml
    when the CLI is unavailable.
  4. Read the requested checked-in config or user overlay before emitting the command.
  5. For remote execution, read
    NEMOTRON_ENV_FILE
    or repo-root
    env*.toml
    and pick an actual section whose profile matches the step.
  6. Emit the full command in one reply with the source tier:
    Verified
    ,
    Repo-grounded
    ,
    Reference-grounded
    , or
    Blocked
    .
Canonical command shapes live in
references/COMMANDS.md
.
  1. 确认仓库根目录存在
    pyproject.toml
    src/nemotron/steps/
  2. 阅读
    references/CATALOG.md
    references/COMMANDS.md
    的选定章节。
  3. 若可用,使用
    uv run nemotron steps show <step_id>
    验证选定的实时步骤;若CLI不可用,则使用选定的
    step.toml
  4. 在输出命令之前,阅读请求的已签入配置或用户覆盖配置。
  5. 对于远程执行,读取
    NEMOTRON_ENV_FILE
    或仓库根目录的
    env*.toml
    ,选择与步骤匹配的实际配置段。
  6. 在一个回复中输出完整命令,并标注来源层级:
    Verified
    (已验证)、
    Repo-grounded
    (基于仓库)、
    Reference-grounded
    (基于参考)或
    Blocked
    (受阻)。
标准命令格式位于
references/COMMANDS.md
中。

Pipeline Workflow

流水线工作流

For pipelines with two or more stages, use Orient -> Plan -> Act -> Verify. Read
references/WORKFLOW.md
for the phase checklist.
  • Orient from bundled references and user constraints.
  • Plan a DAG with artifact types, configs, patterns, and validation checks.
  • Wait for approval before writing configs or code.
  • Act with YAML/config-only changes whenever an existing step can satisfy the request.
  • Verify every generated YAML, artifact edge, command, and README command before reporting completion.
对于包含两个或更多阶段的流水线,使用定位 -> 规划 -> 执行 -> 验证流程。阅读
references/WORKFLOW.md
获取阶段检查清单。
  • 根据捆绑参考文件和用户约束进行定位。
  • 规划包含工件类型、配置、模式和验证检查的DAG。
  • 在编写配置或代码之前等待批准。
  • 只要现有步骤可以满足请求,就仅通过YAML/配置更改来执行。
  • 在报告完成之前,验证每个生成的YAML、工件连接、命令和README命令。

Catalog Mode

目录模式

Use when the request maps to existing steps. Fast path:
references/CATALOG.md
->
references/ARTIFACTS.md
->
references/COMMANDS.md
-> verify selected live manifest/config/profile -> add a new named config under the selected step's
config/
directory.
当请求映射到现有步骤时使用。快速路径:
references/CATALOG.md
->
references/ARTIFACTS.md
->
references/COMMANDS.md
-> 验证选定的实时清单/配置/配置文件 -> 在选定步骤的
config/
目录下添加新的命名配置。

Customization Surface

定制范围

  • Always customize through the step catalog under
    src/nemotron/steps/
    . Never divert to alternate recipe CLIs such as
    src/nemotron/cli/commands/super3/
    or
    .../nano3/
    , even for Super3/Nano3 work. If a request seems to need those, map it back to the equivalent catalog step (e.g.
    sft/megatron_bridge
    ).
  • Make customizations as NEW config files inside the selected step's
    src/nemotron/steps/<cat>/<step>/config/
    directory, for example
    src/nemotron/steps/sft/megatron_bridge/config/my_super3.yaml
    .
  • Never edit the checked-in
    default.yaml
    ,
    tiny.yaml
    , other shipped configs,
    step.toml
    ,
    step.py
    , or shared runners. Adding a new config file beside them is the expected and only customization write.
  • Base new configs on the checked-in
    default.yaml
    schema (read it, copy the needed fields), then override only what the request requires.
  • 始终通过
    src/nemotron/steps/
    下的步骤目录进行定制。切勿转向替代方案CLI,例如
    src/nemotron/cli/commands/super3/
    .../nano3/
    ,即使是Super3/Nano3相关工作也不例外。如果请求似乎需要这些CLI,请将其映射回等效的目录步骤(例如
    sft/megatron_bridge
    )。
  • 在选定步骤的
    src/nemotron/steps/<cat>/<step>/config/
    目录内创建新的配置文件进行定制,例如
    src/nemotron/steps/sft/megatron_bridge/config/my_super3.yaml
  • 切勿编辑已签入的
    default.yaml
    tiny.yaml
    、其他已发布配置、
    step.toml
    step.py
    或共享运行器。在它们旁边添加新配置文件是预期且唯一的定制写入方式。
  • 基于已签入的
    default.yaml
    架构创建新配置(读取该文件,复制所需字段),然后仅覆盖请求所需的内容。

Explorer Mode

探索模式

Use only after confirming no existing step, runner, recipe, CLI, or YAML config surface can satisfy the request. Full procedure lives in
references/WORKFLOW.md
.
仅在确认没有现有步骤、运行器、方案、CLI或YAML配置范围可以满足请求时使用。完整流程位于
references/WORKFLOW.md
中。

Configuration Alignment

配置对齐

Surface these constraints before commands or config writes:
  • SFT packing
    pack_size
    , Megatron-Bridge
    seq_length
    , packed sequence size, tokenizer, and chat template must match.
  • Prepared
    packed_parquet
    and
    binidx
    are tokenizer-locked; rebuild after tokenizer, chat-template, sequence-length, split, or blend changes.
  • Megatron-Bridge global batch size must be divisible by data-parallel size; start distributed validation with micro batch size 1.
  • TP/PP/CP/EP choices must fit GPU count, memory, topology, and model divisibility.
  • LoRA merge requires the exact base checkpoint/model and tokenizer used during adapter training.
  • Conversion/eval of Megatron checkpoints should point at a concrete
    iter_*
    checkpoint, not a parent run directory.
  • Hosted eval and translation configs store auth env-var names only, not values.
在输出命令或写入配置之前,需明确以下约束:
  • SFT打包
    pack_size
    、Megatron-Bridge
    seq_length
    、打包序列大小、分词器和聊天模板必须匹配。
  • 已准备好的
    packed_parquet
    binidx
    与分词器绑定;在分词器、聊天模板、序列长度、拆分或混合更改后需重新构建。
  • Megatron-Bridge全局批处理大小必须可被数据并行大小整除;分布式验证从微批处理大小1开始。
  • TP/PP/CP/EP选择必须符合GPU数量、内存、拓扑结构和模型可分性。
  • LoRA合并需要适配器训练期间使用的确切基础检查点/模型和分词器。
  • Megatron检查点的转换/评估应指向具体的
    iter_*
    检查点,而非父运行目录。
  • 托管评估和翻译配置仅存储身份验证环境变量名称,不存储值。

Operational Nuances

操作细节

  • Smoke configs (
    tiny.yaml
    ,
    tiny_chat.yaml
    ) are wiring tests, not quality evidence.
  • ${art:...}
    references belong in recipe-backed configs; standalone YAML uses plain paths.
  • Keep pretraining
    bin/idx
    data and
    blend.json
    from the same run/release.
  • Write customized configs as new files in the step's
    src/nemotron/steps/<cat>/<step>/config/
    directory; never modify the checked-in
    default.yaml
    or other shipped configs.
  • For LoRA, preserve the exact base checkpoint and tokenizer/template metadata needed by later merge/eval.
  • For translation and hosted eval, mention auth environment variable names only, never values.
  • 冒烟测试配置(
    tiny.yaml
    tiny_chat.yaml
    )是连接测试,而非质量验证依据。
  • ${art:...}
    引用属于方案支持的配置;独立YAML使用普通路径。
  • 保持预训练
    bin/idx
    数据和
    blend.json
    来自同一运行/版本。
  • 将定制配置写入步骤的
    src/nemotron/steps/<cat>/<step>/config/
    目录下的新文件;切勿修改已签入的
    default.yaml
    或其他已发布配置。
  • 对于LoRA,保留后续合并/评估所需的确切基础检查点和分词器/模板元数据。
  • 对于翻译和托管评估,仅提及身份验证环境变量名称,切勿提及值。

Boundaries

边界

Do:
  • Always route through the step catalog under
    src/nemotron/steps/
    ; never use alternate recipe CLIs (
    src/nemotron/cli/commands/super3|nano3/...
    ).
  • Reuse repo CLIs, runners, recipes, steps, and checked-in configs first.
  • Customize by adding a new config under the step's
    config/
    directory; base it on
    default.yaml
    rather than copying it blindly.
  • Validate artifact edges and cite patterns that changed the plan.
  • Ask about hardware/data/backend/output path when missing.
  • Surface tradeoffs such as AutoModel vs Megatron-Bridge and full SFT vs LoRA.
Do not:
  • Invent steps when a catalog step fits.
  • Skip Plan for pipelines with two or more stages.
  • Generate Python or shell when YAML is enough.
  • Add monitoring/W&B unless asked.
  • Assume GPU count, env profile, endpoint type, task ID, or auth value.
  • Generate Slurm/Airflow/Kubeflow wrappers unless the request explicitly needs deployment scaffolding.
  • Edit checked-in step files (
    default.yaml
    /
    tiny.yaml
    , other shipped configs,
    step.toml
    ,
    step.py
    , runners); only add a new config beside them.
  • Restate all per-step rules in
    SKILL.md
    ; use bundled references and source fallback.
允许操作:
  • 始终通过
    src/nemotron/steps/
    下的步骤目录进行路由;切勿使用替代方案CLI(
    src/nemotron/cli/commands/super3|nano3/...
    )。
  • 优先重用仓库CLI、运行器、方案、步骤和已签入配置。
  • 通过在步骤的
    config/
    目录下添加新配置进行定制;基于
    default.yaml
    而非盲目复制。
  • 验证工件连接并引用更改规划的模式。
  • 当缺少硬件/数据/后端/输出路径时询问用户。
  • 指出权衡选项,例如AutoModel vs Megatron-Bridge、全量SFT vs LoRA。
禁止操作:
  • 当目录步骤适用时,不要创建新步骤。
  • 对于包含两个或更多阶段的流水线,不要跳过规划步骤。
  • 当YAML足够时,不要生成Python或Shell代码。
  • 除非被要求,否则不要添加监控/W&B。
  • 不要假设GPU数量、环境配置文件、端点类型、任务ID或身份验证值。
  • 除非请求明确需要部署脚手架,否则不要生成Slurm/Airflow/Kubeflow包装器。
  • 不要编辑已签入的步骤文件(
    default.yaml
    /
    tiny.yaml
    、其他已发布配置、
    step.toml
    step.py
    、运行器);仅在它们旁边添加新配置。
  • 不要重复
    SKILL.md
    中的所有每步规则;使用捆绑参考文件和源备用。

Examples

示例

Single-step routing (LoRA on a small box). User: "LoRA fine-tune a HF model on 2 GPUs." Route per
CATALOG.md
->
peft/automodel
(HF base + small GPU count); do not offer Megatron-Bridge. Collect base model, JSONL data path, output dir, LoRA rank/alpha, then emit one
uv run nemotron steps run peft/automodel -c <config> --dry-run ...
command.
Multi-step pipeline (Super3 SFT). User: "data prep + SFT for Super3." This is two stages, so plan first: SFT on Super3 -> Megatron-Bridge, which consumes
packed_parquet
, so
data_prep/sft_packing
is required upstream. Present the DAG (
sft_packing -> sft/megatron_bridge
), align
pack_size
/
seq_length
/ tokenizer, wait for approval, then add new configs under
src/nemotron/steps/<step>/config/<name>.yaml
. Super3 needs a remote profile; state the env TOML prerequisite or mark
Blocked
.
Hosted-endpoint evaluation (no training). User: "benchmark my hosted model endpoint." Route to
eval/model_eval
with
-c tiny_chat
. Collect endpoint URL, model id, task IDs, and the auth env-var name (value exported, never inlined). See
references/COMMANDS.md
Evaluation Examples.
单步路由(小型设备上的LoRA)。用户:"在2个GPU上对HF模型进行LoRA微调。" 根据
CATALOG.md
路由至
peft/automodel
(HF基础模型 + 少量GPU数量);不要提供Megatron-Bridge选项。收集基础模型、JSONL数据路径、输出目录、LoRA秩/alpha值,然后输出一个
uv run nemotron steps run peft/automodel -c <config> --dry-run ...
命令。
多步流水线(Super3 SFT)。用户:"数据准备 + Super3的SFT。" 这包含两个阶段,因此先规划:Super3上的SFT -> Megatron-Bridge,它需要
packed_parquet
,因此上游需要
data_prep/sft_packing
。呈现DAG(
sft_packing -> sft/megatron_bridge
),对齐
pack_size
/
seq_length
/分词器,等待批准,然后在
src/nemotron/steps/<step>/config/<name>.yaml
下添加新配置。Super3需要远程配置文件;说明环境TOML前提条件或标记为
Blocked
托管端点评估(无训练)。用户:"基准测试我的托管模型端点。" 路由至
eval/model_eval
并使用
-c tiny_chat
。收集端点URL、模型ID、任务ID和身份验证环境变量名称(值需导出,切勿内联)。请参阅
references/COMMANDS.md
中的评估示例。

Troubleshooting

故障排除

SituationAction
Artifact types do not chainRecheck
references/ARTIFACTS.md
; insert a converter or change the DAG before writing configs.
Remote profile or
--batch
is unclear
Read active env TOML; do not guess profile names.
Config key is unclearVerify selected checked-in config,
step.py
, and shared runner before editing.
Strategy points to a missing context packSkip the pack, use catalog/pattern text, and flag the plan with
WARNING: <topic> docs unavailable
.
Hardware looks too smallUse
references/HARDWARE.md
; suggest smaller model, AutoModel, then LoRA before full Megatron-Bridge.
Two Act attempts failStop, explain what was tried and failed, and ask how to proceed.
No existing repo path matchesCheck
references/context/index.toml
and selected source fallback; use Explorer mode only after naming the gap.
情况操作
工件类型无法串联重新检查
references/ARTIFACTS.md
;在编写配置之前插入转换器或更改DAG。
远程配置文件或
--batch
不明确
读取活动的环境TOML;不要猜测配置文件名称。
配置键不明确在编辑之前验证选定的已签入配置、
step.py
和共享运行器。
策略指向缺失的上下文包跳过该包,使用目录/模式文本,并在规划中标记
WARNING: <topic> docs unavailable
(警告:<主题>文档不可用)。
硬件看起来过小使用
references/HARDWARE.md
;建议使用更小的模型、AutoModel,然后是LoRA,最后再考虑全量Megatron-Bridge。
两次执行尝试失败停止操作,说明已尝试的内容和失败原因,并询问如何继续。
没有匹配的现有仓库路径检查
references/context/index.toml
和选定的源备用;仅在指出缺口后使用探索模式。