nemo-automodel-recipe-development

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

NeMo AutoModel Recipe Development

NeMo AutoModel 方案开发

Instructions

操作指南

For recipe questions, answer with the smallest complete path to action:

Name the relevant recipe file or YAML section.
List the builder functions or config keys involved.
Include a minimal YAML or command example when the question asks how to configure something.
End with a local validation command or tiny CPU-compatible test.

For conceptual recipe questions, answer from this skill without inspecting the repository or loading other AutoModel skills unless the user asks you to edit files. Keep the response focused on recipe YAML, builders, CLI routing, tests, and local validation.

Use these compact answer patterns for common questions:

New finetuning recipe variant: start from the closest file under
```
nemo_automodel/recipes/
```
, update the model, dataset or dataloader, optimizer, loss, LR scheduler, step scheduler, and checkpoint builders, register a CLI route only if adding a command or domain alias, add example YAML under
```
examples/
```
, then add a tiny CPU-compatible unit test and run
```
automodel finetune llm -c <config.yaml>
```
.
```
_target_
```
fields: describe
```
_target_
```
as the fully qualified Python callable, explain that sibling keys become keyword arguments, show optimizer and dataset examples, and mention nested CLI overrides such as
```
--optimizer.lr
```
.
Validation and checkpointing: name
```
step_scheduler.val_check_interval
```
,
```
step_scheduler.checkpoint_interval
```
,
```
validation_dataset
```
,
```
restore_from.path
```
, and consolidated safetensors; include the minimal YAML snippet from this skill.

For validation and checkpointing, always name:

```
step_scheduler.val_check_interval
```
for validation cadence.
```
step_scheduler.checkpoint_interval
```
for save cadence.
```
validation_dataset
```
as the validation dataloader source.
```
restore_from.path
```
for resume.
Consolidated safetensors as the default checkpoint format for HF ecosystem compatibility.

针对方案相关问题，以最简洁的完整操作路径作答：

指明相关的方案文件或YAML章节。
列出涉及的构建器函数或配置键。
当问题询问配置方法时，提供最简YAML或命令示例。
以本地验证命令或兼容CPU的小型测试结尾。

针对概念性方案问题，仅基于本技能作答，无需检查代码仓库或加载其他AutoModel技能，除非用户要求编辑文件。回答需聚焦于方案YAML、构建器、CLI路由、测试及本地验证。

针对常见问题，使用以下简洁回答模板：

新增微调方案变体：从
```
nemo_automodel/recipes/
```
下最接近的文件开始，更新模型、数据集或数据加载器、优化器、损失函数、学习率调度器、步骤调度器和检查点构建器，仅在添加命令或领域别名时注册CLI路由，在
```
examples/
```
下添加示例YAML，然后添加兼容CPU的小型单元测试并运行
```
automodel finetune llm -c <config.yaml>
```
。
```
_target_
```
字段：将
```
_target_
```
描述为完整限定的Python可调用对象，说明同级键会作为关键字参数传入，展示优化器和数据集示例，并提及嵌套CLI覆盖（如
```
--optimizer.lr
```
）。
验证与检查点：指明
```
step_scheduler.val_check_interval
```
、
```
step_scheduler.checkpoint_interval
```
、
```
validation_dataset
```
、
```
restore_from.path
```
，以及合并后的safetensors；包含本技能中的最简YAML片段。

针对验证与检查点，需始终指明：

```
step_scheduler.val_check_interval
```
：验证频率。
```
step_scheduler.checkpoint_interval
```
：保存频率。
```
validation_dataset
```
：验证数据加载器的来源。
```
restore_from.path
```
：恢复训练的路径。
合并后的safetensors：为兼容HF生态系统的默认检查点格式。

Routing Boundary

路由边界

Use this skill for recipe construction and execution-flow questions: YAML structure,

_target_

callables, builder functions, validation datasets, checkpoint configuration, CLI route registration, and recipe-specific tests.

Do not use this skill for standalone distributed strategy selection, cluster launcher configuration, or model architecture onboarding unless the user is asking how those choices appear inside an AutoModel recipe YAML.

本技能适用于方案构建与执行流程相关问题：YAML结构、

_target_

可调用对象、构建器函数、验证数据集、检查点配置、CLI路由注册及方案专属测试。

请勿将本技能用于独立的分布式策略选择、集群启动器配置或模型架构接入，除非用户询问这些选项在AutoModel方案YAML中的呈现方式。

Recipe Architecture

方案架构

Execution Flow

执行流程

CLI (automodel finetune llm -c config.yaml)
  -> app.py parses command + domain + config
    -> recipe script (e.g. train_ft.py) main(config_path)
      -> Recipe class .setup() builds all components
        -> .run_train_validation_loop() executes training

CLI (automodel finetune llm -c config.yaml)
  -> app.py 解析命令 + 领域 + 配置
    -> 方案脚本（如train_ft.py）main(config_path)
      -> Recipe类.setup() 构建所有组件
        -> .run_train_validation_loop() 执行训练

Recipe Class

Recipe类

Recipes inherit from

BaseRecipe

and implement two methods:

```
setup()
```
-- builds model, optimizer, dataloader, loss, LR scheduler, step scheduler, and checkpoint config via builder functions.
```
run_train_validation_loop()
```
-- executes the training and validation loop.

方案继承自

BaseRecipe

并实现两个方法：

```
setup()
```
：通过构建器函数构建模型、优化器、数据加载器、损失函数、学习率调度器、步骤调度器和检查点配置。
```
run_train_validation_loop()
```
：执行训练与验证循环。

Builder Pattern

构建器模式

All components are constructed through dedicated builder functions:

```
build_model()
```
-- instantiates the model from config
```
build_optimizer()
```
-- creates optimizer (AdamW, etc.)
```
build_dataloader()
```
-- sets up train and validation dataloaders
```
build_loss_fn()
```
-- creates the loss function
```
build_lr_scheduler()
```
-- creates the learning rate scheduler
```
build_step_scheduler()
```
-- creates the step scheduler controlling training progression
```
build_checkpoint_config()
```
-- configures checkpointing

所有组件通过专用构建器函数创建：

```
build_model()
```
：从配置实例化模型
```
build_optimizer()
```
：创建优化器（如AdamW）
```
build_dataloader()
```
：设置训练和验证数据加载器
```
build_loss_fn()
```
：创建损失函数
```
build_lr_scheduler()
```
：创建学习率调度器
```
build_step_scheduler()
```
：创建控制训练进程的步骤调度器
```
build_checkpoint_config()
```
：配置检查点

Infrastructure Application Order

基础设施应用顺序

Components are applied in this strict order after building:

PEFT (LoRA, etc.)
FP8 quantization
QAT (quantization-aware training)
Checkpoint load / restore
Parameter freezing
Sharding (FSDP2, Megatron-FSDP, DDP)
Device placement
```
torch.compile
```
Context parallelism hooks

组件构建完成后，严格按照以下顺序应用：

PEFT（如LoRA）
FP8量化
QAT（量化感知训练）
检查点加载/恢复
参数冻结
分片（FSDP2、Megatron-FSDP、DDP）
设备放置
```
torch.compile
```
上下文并行钩子

YAML Config Anatomy

YAML配置结构

A complete recipe config follows this structure:

yaml

step_scheduler:
  max_steps: 1000
  num_epochs: 1
  grad_accumulation_steps: 4
  val_check_interval: 100
  checkpoint_interval: 500
  log_interval: 10

dist_env:
  master_addr: localhost
  master_port: 29500

rng:
  seed: 42

model:
  _target_: nemo_automodel.models.llm.NemotronHForCausalLM
  name_or_path: meta-llama/Llama-3.2-1B
  # additional model kwargs passed to the constructor

compile:
  enabled: false
  backend: inductor

clip_grad_norm:
  max_norm: 1.0

distributed:
  strategy: fsdp2       # fsdp2 | megatron_fsdp | ddp
  dp_size: auto
  tp_size: 1
  cp_size: 1

loss_fn:
  _target_: torch.nn.CrossEntropyLoss

dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  tokenizer_name_or_path: meta-llama/Llama-3.2-1B
  max_seq_length: 2048

validation_dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  split: validation

packed_sequence:
  enabled: false

dataloader:
  batch_size: 4
  num_workers: 4
  pin_memory: true

optimizer:
  _target_: torch.optim.AdamW
  lr: 2.0e-5
  weight_decay: 0.01

lr_scheduler:
  _target_: nemo_automodel.schedulers.CosineAnnealingWarmup
  warmup_steps: 50
  min_lr: 1.0e-6

完整的方案配置遵循以下结构：

yaml

step_scheduler:
  max_steps: 1000
  num_epochs: 1
  grad_accumulation_steps: 4
  val_check_interval: 100
  checkpoint_interval: 500
  log_interval: 10

dist_env:
  master_addr: localhost
  master_port: 29500

rng:
  seed: 42

model:
  _target_: nemo_automodel.models.llm.NemotronHForCausalLM
  name_or_path: meta-llama/Llama-3.2-1B
  # 传递给构造函数的额外模型关键字参数

compile:
  enabled: false
  backend: inductor

clip_grad_norm:
  max_norm: 1.0

distributed:
  strategy: fsdp2       # fsdp2 | megatron_fsdp | ddp
  dp_size: auto
  tp_size: 1
  cp_size: 1

loss_fn:
  _target_: torch.nn.CrossEntropyLoss

dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  tokenizer_name_or_path: meta-llama/Llama-3.2-1B
  max_seq_length: 2048

validation_dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  split: validation

packed_sequence:
  enabled: false

dataloader:
  batch_size: 4
  num_workers: 4
  pin_memory: true

optimizer:
  _target_: torch.optim.AdamW
  lr: 2.0e-5
  weight_decay: 0.01

lr_scheduler:
  _target_: nemo_automodel.schedulers.CosineAnnealingWarmup
  warmup_steps: 50
  min_lr: 1.0e-6

The

_target_

Pattern

_target_

模式

The

_target_

key specifies a fully qualified Python callable. All remaining keys in that section are passed as keyword arguments:

yaml

optimizer:
  _target_: torch.optim.AdamW   # callable
  lr: 2.0e-5                    # kwarg
  weight_decay: 0.01            # kwarg

This is equivalent to:

torch.optim.AdamW(lr=2e-5, weight_decay=0.01)

_target_

键指定完整限定的Python可调用对象，该章节中的所有其余键会作为关键字参数传入：

yaml

optimizer:
  _target_: torch.optim.AdamW   # 可调用对象
  lr: 2.0e-5                    # 关键字参数
  weight_decay: 0.01            # 关键字参数

这等同于：

torch.optim.AdamW(lr=2e-5, weight_decay=0.01)

。

CLI Overrides

CLI覆盖

Any config value can be overridden from the command line:

bash

automodel finetune llm -c config.yaml \
  --optimizer.lr 1e-4 \
  --step_scheduler.max_steps 500 \
  --distributed.tp_size 2

任何配置值都可通过命令行覆盖：

bash

automodel finetune llm -c config.yaml \
  --optimizer.lr 1e-4 \
  --step_scheduler.max_steps 500 \
  --distributed.tp_size 2

Examples

示例

Validation and checkpointing:

yaml

step_scheduler:
  val_check_interval: 100
  checkpoint_interval: 500

validation_dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  split: validation

restore_from:
  path: /checkpoints/step-500

验证与检查点配置：

yaml

step_scheduler:
  val_check_interval: 100
  checkpoint_interval: 500

validation_dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  split: validation

restore_from:
  path: /checkpoints/step-500

Domain-Specific Notes

领域专属说明

LLM

```
nemo_automodel/recipes/llm/train_ft.py
```
handles both finetuning and pretraining. The distinction is in the config (dataset, learning rate, etc.).
```
nemo_automodel/recipes/llm/kd.py
```
implements knowledge distillation with a teacher and student model.
```
nemo_automodel/recipes/llm/benchmark.py
```
runs throughput and latency benchmarks.

```
nemo_automodel/recipes/llm/train_ft.py
```
同时处理微调与预训练，区别在于配置（数据集、学习率等）。
```
nemo_automodel/recipes/llm/kd.py
```
实现基于教师模型和学生模型的知识蒸馏。
```
nemo_automodel/recipes/llm/benchmark.py
```
运行吞吐量和延迟基准测试。

VLM

Uses
```
NeMoAutoModelForImageTextToText
```
instead of causal LM classes.
Config includes a
```
processor
```
section instead of a standalone tokenizer.
Recipe lives in
```
nemo_automodel/recipes/vlm/finetune.py
```
.

使用
```
NeMoAutoModelForImageTextToText
```
而非因果语言模型类。
配置包含
```
processor
```
章节，而非独立的tokenizer。
方案位于
```
nemo_automodel/recipes/vlm/finetune.py
```
。

Diffusion

Uses
```
NeMoAutoDiffusionPipeline
```
.
Requires a
```
parallel_scheme
```
dict in config to define parallelism.
Only supports DDP and FSDP2 strategies (no Megatron-FSDP).

Recipe lives in

nemo_automodel/recipes/diffusion/train.py

使用
```
NeMoAutoDiffusionPipeline
```
。
配置中需要
```
parallel_scheme
```
字典来定义并行策略。
仅支持DDP和FSDP2策略（不支持Megatron-FSDP）。

方案位于

nemo_automodel/recipes/diffusion/train.py

。

Retrieval

Two encoder patterns:
- Bi-encoder (
```
nemo_automodel/recipes/retrieval/train_bi_encoder.py
```
  ): separate query and document encoders, contrastive loss.
- Cross-encoder (
```
nemo_automodel/recipes/retrieval/train_cross_encoder.py
```
  ): joint encoding, classification head.

Hard negative mining:

nemo_automodel/recipes/retrieval/mine_hard_negatives.py

两种编码器模式：
- 双编码器（
```
nemo_automodel/recipes/retrieval/train_bi_encoder.py
```
  ）：独立的查询和文档编码器，对比损失。
- 交叉编码器（
```
nemo_automodel/recipes/retrieval/train_cross_encoder.py
```
  ）：联合编码，分类头。

难负样本挖掘：

nemo_automodel/recipes/retrieval/mine_hard_negatives.py

。

Training Loop Details

训练循环细节

The training loop follows this structure per epoch:

for epoch in range(num_epochs):
    for batch_idx in range(batches_per_epoch):
        # --- gradient accumulation inner loop ---
        for micro_batch in micro_batches:
            if pipeline_parallel:
                schedule.step(micro_batch)    # PP schedule
            else:
                loss = model(micro_batch)     # direct forward
                loss.backward()

        # --- optimizer step ---
        scale_grads_and_clip_grad_norm(model, max_norm)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        # --- logging ---
        MetricsSample(step, epoch, loss, grad_norm, lr, mem, tps, mfu)

        # --- validation (at configured intervals) ---
        if step % val_check_interval == 0:
            run_validation()

        # --- checkpoint (at configured intervals) ---
        if step % checkpoint_interval == 0:
            save_checkpoint()

训练循环遵循以下每轮 epoch 的结构：

for epoch in range(num_epochs):
    for batch_idx in range(batches_per_epoch):
        # --- 梯度累积内循环 ---
        for micro_batch in micro_batches:
            if pipeline_parallel:
                schedule.step(micro_batch)    # PP调度
            else:
                loss = model(micro_batch)     # 直接前向传播
                loss.backward()

        # --- 优化器步骤 ---
        scale_grads_and_clip_grad_norm(model, max_norm)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        # --- 日志记录 ---
        MetricsSample(step, epoch, loss, grad_norm, lr, mem, tps, mfu)

        # --- 验证（按配置间隔） ---
        if step % val_check_interval == 0:
            run_validation()

        # --- 检查点（按配置间隔） ---
        if step % checkpoint_interval == 0:
            save_checkpoint()

StepScheduler

Controls all training progression: total epochs, total steps, gradient accumulation steps, validation interval, checkpoint interval, and logging interval.

控制所有训练进程：总epoch数、总步数、梯度累积步数、验证间隔、检查点间隔和日志记录间隔。

Gradient Clipping

梯度裁剪

Applied via

scale_grads_and_clip_grad_norm()

after the backward pass and before the optimizer step. Controlled by

clip_grad_norm.max_norm

in config.

在反向传播后、优化器步骤前，通过

scale_grads_and_clip_grad_norm()

应用，由配置中的

clip_grad_norm.max_norm

控制。

Context Parallelism

上下文并行

When

cp_size > 1

, batches are split across the context-parallel group using

make_cp_batch_and_ctx()

. This must happen before the forward pass.

当

cp_size > 1

时，使用

make_cp_batch_and_ctx()

在上下文并行组中拆分批次，此操作必须在前向传播前完成。

MetricsSample

Each training step produces a

MetricsSample

with fields:

```
step
```
-- global step count
```
epoch
```
-- current epoch
```
loss
```
-- training loss
```
grad_norm
```
-- gradient norm after clipping
```
lr
```
-- current learning rate
```
mem
```
-- GPU memory usage
```
tps
```
-- tokens per second
```
mfu
```
-- model FLOPS utilization

每个训练步骤生成一个

MetricsSample

，包含以下字段：

```
step
```
-- 全局步数
```
epoch
```
-- 当前epoch
```
loss
```
-- 训练损失
```
grad_norm
```
-- 裁剪后的梯度范数
```
lr
```
-- 当前学习率
```
mem
```
-- GPU内存使用情况
```
tps
```
-- 每秒处理token数
```
mfu
```
-- 模型FLOPS利用率

Validation & Checkpointing

验证与检查点

Validation

验证

Runs at intervals defined by
```
step_scheduler.val_check_interval
```
.
Uses the validation dataloader built from
```
validation_dataset
```
config.
Model is set to eval mode; gradients are disabled.

按
```
step_scheduler.val_check_interval
```
定义的间隔运行。
使用从
```
validation_dataset
```
配置构建的验证数据加载器。
模型设置为评估模式；禁用梯度计算。

Checkpointing

检查点

Default format: consolidated safetensors for easy deployment on HF ecosystem (always prefer this over DCP).
Checkpoint interval controlled by
```
step_scheduler.checkpoint_interval
```
.
Resume training via the
```
restore_from
```
config key pointing to a checkpoint directory.

yaml

restore_from:
  path: /checkpoints/step-500

默认格式：合并后的safetensors，便于在HF生态系统中部署（始终优先选择此格式而非DCP）。
检查点间隔由
```
step_scheduler.checkpoint_interval
```
控制。
通过配置中的
```
restore_from
```
键指向检查点目录，恢复训练。

yaml

restore_from:
  path: /checkpoints/step-500

Pitfalls

常见陷阱

Problem	Cause	Fix
Silent config errors	Typo in `_target_` value	The class path must be a valid, importable Python callable. Double-check the module path and class name.
Training crashes at first step	`global_batch_size` not divisible by `local_batch_size * dp_size * grad_accumulation_steps`	Ensure the batch size math is consistent across all dimensions.
New recipe not accessible via CLI	Missing CLI command alias registration	Register the new route in the CLI app so `automodel <command> <domain>` resolves correctly.
Shape mismatch at forward pass	Dataset collate function output does not match model input signature	Verify that the collate function returns tensors with the keys and shapes the model expects.
OOM during validation	Validation batch size too large or gradients not disabled	Wrap validation in `torch.no_grad()` and consider a smaller validation batch size.
Checkpoint restore fails	Mismatched model architecture between checkpoint and config	Ensure the model config matches the checkpoint exactly (layer count, hidden dim, vocab size).

问题	原因	解决方法
静默配置错误	`_target_` 值存在拼写错误	类路径必须是有效的、可导入的Python可调用对象。仔细检查模块路径和类名。
训练在第一步崩溃	`global_batch_size` 无法被 `local_batch_size * dp_size * grad_accumulation_steps` 整除	确保所有维度的批量大小计算一致。
新方案无法通过CLI访问	缺少CLI命令别名注册	在CLI应用中注册新路由，确保 `automodel <command> <domain>` 可正确解析。
前向传播时形状不匹配	数据集整理函数的输出与模型输入签名不匹配	验证整理函数返回的张量具有模型期望的键和形状。
验证时出现OOM	验证批量过大或未禁用梯度	使用 `torch.no_grad()` 包裹验证逻辑，并考虑减小验证批量大小。
检查点恢复失败	检查点与配置的模型架构不匹配	确保模型配置与检查点完全一致（层数、隐藏维度、词汇量）。

nemo-automodel-recipe-development

Original

Translation

NeMo AutoModel Recipe Development

NeMo AutoModel 方案开发

Instructions

操作指南

Routing Boundary

路由边界

Recipe Architecture

方案架构

Execution Flow

执行流程

Recipe Class

Recipe类

Builder Pattern

构建器模式

Infrastructure Application Order

基础设施应用顺序

YAML Config Anatomy

YAML配置结构

The _target_ Pattern

_target_模式

CLI Overrides

CLI覆盖

Examples

示例

Domain-Specific Notes

领域专属说明

LLM

LLM

VLM

VLM

Diffusion

Diffusion

Retrieval

Retrieval

Training Loop Details

训练循环细节

StepScheduler

StepScheduler

Gradient Clipping

梯度裁剪

Context Parallelism

上下文并行

MetricsSample

MetricsSample

Validation & Checkpointing

验证与检查点

Validation

验证

Checkpointing

检查点

Pitfalls

常见陷阱

The
`_target_`
Pattern

`_target_`
模式