nemo-automodel-recipe-development

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NeMo AutoModel Recipe Development

NeMo AutoModel 方案开发

Instructions

操作指南

For recipe questions, answer with the smallest complete path to action:
  1. Name the relevant recipe file or YAML section.
  2. List the builder functions or config keys involved.
  3. Include a minimal YAML or command example when the question asks how to configure something.
  4. End with a local validation command or tiny CPU-compatible test.
For conceptual recipe questions, answer from this skill without inspecting the repository or loading other AutoModel skills unless the user asks you to edit files. Keep the response focused on recipe YAML, builders, CLI routing, tests, and local validation.
Use these compact answer patterns for common questions:
  • New finetuning recipe variant: start from the closest file under
    nemo_automodel/recipes/
    , update the model, dataset or dataloader, optimizer, loss, LR scheduler, step scheduler, and checkpoint builders, register a CLI route only if adding a command or domain alias, add example YAML under
    examples/
    , then add a tiny CPU-compatible unit test and run
    automodel finetune llm -c <config.yaml>
    .
  • _target_
    fields: describe
    _target_
    as the fully qualified Python callable, explain that sibling keys become keyword arguments, show optimizer and dataset examples, and mention nested CLI overrides such as
    --optimizer.lr
    .
  • Validation and checkpointing: name
    step_scheduler.val_check_interval
    ,
    step_scheduler.checkpoint_interval
    ,
    validation_dataset
    ,
    restore_from.path
    , and consolidated safetensors; include the minimal YAML snippet from this skill.
For validation and checkpointing, always name:
  • step_scheduler.val_check_interval
    for validation cadence.
  • step_scheduler.checkpoint_interval
    for save cadence.
  • validation_dataset
    as the validation dataloader source.
  • restore_from.path
    for resume.
  • Consolidated safetensors as the default checkpoint format for HF ecosystem compatibility.
针对方案相关问题,以最简洁的完整操作路径作答:
  1. 指明相关的方案文件或YAML章节。
  2. 列出涉及的构建器函数或配置键。
  3. 当问题询问配置方法时,提供最简YAML或命令示例。
  4. 以本地验证命令或兼容CPU的小型测试结尾。
针对概念性方案问题,仅基于本技能作答,无需检查代码仓库或加载其他AutoModel技能,除非用户要求编辑文件。回答需聚焦于方案YAML、构建器、CLI路由、测试及本地验证。
针对常见问题,使用以下简洁回答模板:
  • 新增微调方案变体:从
    nemo_automodel/recipes/
    下最接近的文件开始,更新模型、数据集或数据加载器、优化器、损失函数、学习率调度器、步骤调度器和检查点构建器,仅在添加命令或领域别名时注册CLI路由,在
    examples/
    下添加示例YAML,然后添加兼容CPU的小型单元测试并运行
    automodel finetune llm -c <config.yaml>
  • _target_
    字段:将
    _target_
    描述为完整限定的Python可调用对象,说明同级键会作为关键字参数传入,展示优化器和数据集示例,并提及嵌套CLI覆盖(如
    --optimizer.lr
    )。
  • 验证与检查点:指明
    step_scheduler.val_check_interval
    step_scheduler.checkpoint_interval
    validation_dataset
    restore_from.path
    ,以及合并后的safetensors;包含本技能中的最简YAML片段。
针对验证与检查点,需始终指明:
  • step_scheduler.val_check_interval
    :验证频率。
  • step_scheduler.checkpoint_interval
    :保存频率。
  • validation_dataset
    :验证数据加载器的来源。
  • restore_from.path
    :恢复训练的路径。
  • 合并后的safetensors:为兼容HF生态系统的默认检查点格式。

Routing Boundary

路由边界

Use this skill for recipe construction and execution-flow questions: YAML structure,
_target_
callables, builder functions, validation datasets, checkpoint configuration, CLI route registration, and recipe-specific tests.
Do not use this skill for standalone distributed strategy selection, cluster launcher configuration, or model architecture onboarding unless the user is asking how those choices appear inside an AutoModel recipe YAML.
本技能适用于方案构建与执行流程相关问题:YAML结构、
_target_
可调用对象、构建器函数、验证数据集、检查点配置、CLI路由注册及方案专属测试。
请勿将本技能用于独立的分布式策略选择、集群启动器配置或模型架构接入,除非用户询问这些选项在AutoModel方案YAML中的呈现方式。

Recipe Architecture

方案架构

Execution Flow

执行流程

CLI (automodel finetune llm -c config.yaml)
  -> app.py parses command + domain + config
    -> recipe script (e.g. train_ft.py) main(config_path)
      -> Recipe class .setup() builds all components
        -> .run_train_validation_loop() executes training
CLI (automodel finetune llm -c config.yaml)
  -> app.py 解析命令 + 领域 + 配置
    -> 方案脚本(如train_ft.py)main(config_path)
      -> Recipe类.setup() 构建所有组件
        -> .run_train_validation_loop() 执行训练

Recipe Class

Recipe类

Recipes inherit from
BaseRecipe
and implement two methods:
  • setup()
    -- builds model, optimizer, dataloader, loss, LR scheduler, step scheduler, and checkpoint config via builder functions.
  • run_train_validation_loop()
    -- executes the training and validation loop.
方案继承自
BaseRecipe
并实现两个方法:
  • setup()
    :通过构建器函数构建模型、优化器、数据加载器、损失函数、学习率调度器、步骤调度器和检查点配置。
  • run_train_validation_loop()
    :执行训练与验证循环。

Builder Pattern

构建器模式

All components are constructed through dedicated builder functions:
  • build_model()
    -- instantiates the model from config
  • build_optimizer()
    -- creates optimizer (AdamW, etc.)
  • build_dataloader()
    -- sets up train and validation dataloaders
  • build_loss_fn()
    -- creates the loss function
  • build_lr_scheduler()
    -- creates the learning rate scheduler
  • build_step_scheduler()
    -- creates the step scheduler controlling training progression
  • build_checkpoint_config()
    -- configures checkpointing
所有组件通过专用构建器函数创建:
  • build_model()
    :从配置实例化模型
  • build_optimizer()
    :创建优化器(如AdamW)
  • build_dataloader()
    :设置训练和验证数据加载器
  • build_loss_fn()
    :创建损失函数
  • build_lr_scheduler()
    :创建学习率调度器
  • build_step_scheduler()
    :创建控制训练进程的步骤调度器
  • build_checkpoint_config()
    :配置检查点

Infrastructure Application Order

基础设施应用顺序

Components are applied in this strict order after building:
  1. PEFT (LoRA, etc.)
  2. FP8 quantization
  3. QAT (quantization-aware training)
  4. Checkpoint load / restore
  5. Parameter freezing
  6. Sharding (FSDP2, Megatron-FSDP, DDP)
  7. Device placement
  8. torch.compile
  9. Context parallelism hooks
组件构建完成后,严格按照以下顺序应用:
  1. PEFT(如LoRA)
  2. FP8量化
  3. QAT(量化感知训练)
  4. 检查点加载/恢复
  5. 参数冻结
  6. 分片(FSDP2、Megatron-FSDP、DDP)
  7. 设备放置
  8. torch.compile
  9. 上下文并行钩子

YAML Config Anatomy

YAML配置结构

A complete recipe config follows this structure:
yaml
step_scheduler:
  max_steps: 1000
  num_epochs: 1
  grad_accumulation_steps: 4
  val_check_interval: 100
  checkpoint_interval: 500
  log_interval: 10

dist_env:
  master_addr: localhost
  master_port: 29500

rng:
  seed: 42

model:
  _target_: nemo_automodel.models.llm.NemotronHForCausalLM
  name_or_path: meta-llama/Llama-3.2-1B
  # additional model kwargs passed to the constructor

compile:
  enabled: false
  backend: inductor

clip_grad_norm:
  max_norm: 1.0

distributed:
  strategy: fsdp2       # fsdp2 | megatron_fsdp | ddp
  dp_size: auto
  tp_size: 1
  cp_size: 1

loss_fn:
  _target_: torch.nn.CrossEntropyLoss

dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  tokenizer_name_or_path: meta-llama/Llama-3.2-1B
  max_seq_length: 2048

validation_dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  split: validation

packed_sequence:
  enabled: false

dataloader:
  batch_size: 4
  num_workers: 4
  pin_memory: true

optimizer:
  _target_: torch.optim.AdamW
  lr: 2.0e-5
  weight_decay: 0.01

lr_scheduler:
  _target_: nemo_automodel.schedulers.CosineAnnealingWarmup
  warmup_steps: 50
  min_lr: 1.0e-6
完整的方案配置遵循以下结构:
yaml
step_scheduler:
  max_steps: 1000
  num_epochs: 1
  grad_accumulation_steps: 4
  val_check_interval: 100
  checkpoint_interval: 500
  log_interval: 10

dist_env:
  master_addr: localhost
  master_port: 29500

rng:
  seed: 42

model:
  _target_: nemo_automodel.models.llm.NemotronHForCausalLM
  name_or_path: meta-llama/Llama-3.2-1B
  # 传递给构造函数的额外模型关键字参数

compile:
  enabled: false
  backend: inductor

clip_grad_norm:
  max_norm: 1.0

distributed:
  strategy: fsdp2       # fsdp2 | megatron_fsdp | ddp
  dp_size: auto
  tp_size: 1
  cp_size: 1

loss_fn:
  _target_: torch.nn.CrossEntropyLoss

dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  tokenizer_name_or_path: meta-llama/Llama-3.2-1B
  max_seq_length: 2048

validation_dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  split: validation

packed_sequence:
  enabled: false

dataloader:
  batch_size: 4
  num_workers: 4
  pin_memory: true

optimizer:
  _target_: torch.optim.AdamW
  lr: 2.0e-5
  weight_decay: 0.01

lr_scheduler:
  _target_: nemo_automodel.schedulers.CosineAnnealingWarmup
  warmup_steps: 50
  min_lr: 1.0e-6

The
_target_
Pattern

_target_
模式

The
_target_
key specifies a fully qualified Python callable. All remaining keys in that section are passed as keyword arguments:
yaml
optimizer:
  _target_: torch.optim.AdamW   # callable
  lr: 2.0e-5                    # kwarg
  weight_decay: 0.01            # kwarg
This is equivalent to:
torch.optim.AdamW(lr=2e-5, weight_decay=0.01)
.
_target_
键指定完整限定的Python可调用对象,该章节中的所有其余键会作为关键字参数传入:
yaml
optimizer:
  _target_: torch.optim.AdamW   # 可调用对象
  lr: 2.0e-5                    # 关键字参数
  weight_decay: 0.01            # 关键字参数
这等同于:
torch.optim.AdamW(lr=2e-5, weight_decay=0.01)

CLI Overrides

CLI覆盖

Any config value can be overridden from the command line:
bash
automodel finetune llm -c config.yaml \
  --optimizer.lr 1e-4 \
  --step_scheduler.max_steps 500 \
  --distributed.tp_size 2
任何配置值都可通过命令行覆盖:
bash
automodel finetune llm -c config.yaml \
  --optimizer.lr 1e-4 \
  --step_scheduler.max_steps 500 \
  --distributed.tp_size 2

Examples

示例

Validation and checkpointing:
yaml
step_scheduler:
  val_check_interval: 100
  checkpoint_interval: 500

validation_dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  split: validation

restore_from:
  path: /checkpoints/step-500
验证与检查点配置:
yaml
step_scheduler:
  val_check_interval: 100
  checkpoint_interval: 500

validation_dataset:
  _target_: nemo_automodel.datasets.squad.SquadDataset
  split: validation

restore_from:
  path: /checkpoints/step-500

Domain-Specific Notes

领域专属说明

LLM

LLM

  • nemo_automodel/recipes/llm/train_ft.py
    handles both finetuning and pretraining. The distinction is in the config (dataset, learning rate, etc.).
  • nemo_automodel/recipes/llm/kd.py
    implements knowledge distillation with a teacher and student model.
  • nemo_automodel/recipes/llm/benchmark.py
    runs throughput and latency benchmarks.
  • nemo_automodel/recipes/llm/train_ft.py
    同时处理微调与预训练,区别在于配置(数据集、学习率等)。
  • nemo_automodel/recipes/llm/kd.py
    实现基于教师模型和学生模型的知识蒸馏。
  • nemo_automodel/recipes/llm/benchmark.py
    运行吞吐量和延迟基准测试。

VLM

VLM

  • Uses
    NeMoAutoModelForImageTextToText
    instead of causal LM classes.
  • Config includes a
    processor
    section instead of a standalone tokenizer.
  • Recipe lives in
    nemo_automodel/recipes/vlm/finetune.py
    .
  • 使用
    NeMoAutoModelForImageTextToText
    而非因果语言模型类。
  • 配置包含
    processor
    章节,而非独立的tokenizer。
  • 方案位于
    nemo_automodel/recipes/vlm/finetune.py

Diffusion

Diffusion

  • Uses
    NeMoAutoDiffusionPipeline
    .
  • Requires a
    parallel_scheme
    dict in config to define parallelism.
  • Only supports DDP and FSDP2 strategies (no Megatron-FSDP).
  • Recipe lives in
    nemo_automodel/recipes/diffusion/train.py
    .
  • 使用
    NeMoAutoDiffusionPipeline
  • 配置中需要
    parallel_scheme
    字典来定义并行策略。
  • 仅支持DDP和FSDP2策略(不支持Megatron-FSDP)。
  • 方案位于
    nemo_automodel/recipes/diffusion/train.py

Retrieval

Retrieval

  • Two encoder patterns:
    • Bi-encoder (
      nemo_automodel/recipes/retrieval/train_bi_encoder.py
      ): separate query and document encoders, contrastive loss.
    • Cross-encoder (
      nemo_automodel/recipes/retrieval/train_cross_encoder.py
      ): joint encoding, classification head.
  • Hard negative mining:
    nemo_automodel/recipes/retrieval/mine_hard_negatives.py
    .
  • 两种编码器模式:
    • 双编码器
      nemo_automodel/recipes/retrieval/train_bi_encoder.py
      ):独立的查询和文档编码器,对比损失。
    • 交叉编码器
      nemo_automodel/recipes/retrieval/train_cross_encoder.py
      ):联合编码,分类头。
  • 难负样本挖掘:
    nemo_automodel/recipes/retrieval/mine_hard_negatives.py

Training Loop Details

训练循环细节

The training loop follows this structure per epoch:
for epoch in range(num_epochs):
    for batch_idx in range(batches_per_epoch):
        # --- gradient accumulation inner loop ---
        for micro_batch in micro_batches:
            if pipeline_parallel:
                schedule.step(micro_batch)    # PP schedule
            else:
                loss = model(micro_batch)     # direct forward
                loss.backward()

        # --- optimizer step ---
        scale_grads_and_clip_grad_norm(model, max_norm)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        # --- logging ---
        MetricsSample(step, epoch, loss, grad_norm, lr, mem, tps, mfu)

        # --- validation (at configured intervals) ---
        if step % val_check_interval == 0:
            run_validation()

        # --- checkpoint (at configured intervals) ---
        if step % checkpoint_interval == 0:
            save_checkpoint()
训练循环遵循以下每轮 epoch 的结构:
for epoch in range(num_epochs):
    for batch_idx in range(batches_per_epoch):
        # --- 梯度累积内循环 ---
        for micro_batch in micro_batches:
            if pipeline_parallel:
                schedule.step(micro_batch)    # PP调度
            else:
                loss = model(micro_batch)     # 直接前向传播
                loss.backward()

        # --- 优化器步骤 ---
        scale_grads_and_clip_grad_norm(model, max_norm)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        # --- 日志记录 ---
        MetricsSample(step, epoch, loss, grad_norm, lr, mem, tps, mfu)

        # --- 验证(按配置间隔) ---
        if step % val_check_interval == 0:
            run_validation()

        # --- 检查点(按配置间隔) ---
        if step % checkpoint_interval == 0:
            save_checkpoint()

StepScheduler

StepScheduler

Controls all training progression: total epochs, total steps, gradient accumulation steps, validation interval, checkpoint interval, and logging interval.
控制所有训练进程:总epoch数、总步数、梯度累积步数、验证间隔、检查点间隔和日志记录间隔。

Gradient Clipping

梯度裁剪

Applied via
scale_grads_and_clip_grad_norm()
after the backward pass and before the optimizer step. Controlled by
clip_grad_norm.max_norm
in config.
在反向传播后、优化器步骤前,通过
scale_grads_and_clip_grad_norm()
应用,由配置中的
clip_grad_norm.max_norm
控制。

Context Parallelism

上下文并行

When
cp_size > 1
, batches are split across the context-parallel group using
make_cp_batch_and_ctx()
. This must happen before the forward pass.
cp_size > 1
时,使用
make_cp_batch_and_ctx()
在上下文并行组中拆分批次,此操作必须在前向传播前完成。

MetricsSample

MetricsSample

Each training step produces a
MetricsSample
with fields:
  • step
    -- global step count
  • epoch
    -- current epoch
  • loss
    -- training loss
  • grad_norm
    -- gradient norm after clipping
  • lr
    -- current learning rate
  • mem
    -- GPU memory usage
  • tps
    -- tokens per second
  • mfu
    -- model FLOPS utilization
每个训练步骤生成一个
MetricsSample
,包含以下字段:
  • step
    -- 全局步数
  • epoch
    -- 当前epoch
  • loss
    -- 训练损失
  • grad_norm
    -- 裁剪后的梯度范数
  • lr
    -- 当前学习率
  • mem
    -- GPU内存使用情况
  • tps
    -- 每秒处理token数
  • mfu
    -- 模型FLOPS利用率

Validation & Checkpointing

验证与检查点

Validation

验证

  • Runs at intervals defined by
    step_scheduler.val_check_interval
    .
  • Uses the validation dataloader built from
    validation_dataset
    config.
  • Model is set to eval mode; gradients are disabled.
  • step_scheduler.val_check_interval
    定义的间隔运行。
  • 使用从
    validation_dataset
    配置构建的验证数据加载器。
  • 模型设置为评估模式;禁用梯度计算。

Checkpointing

检查点

  • Default format: consolidated safetensors for easy deployment on HF ecosystem (always prefer this over DCP).
  • Checkpoint interval controlled by
    step_scheduler.checkpoint_interval
    .
  • Resume training via the
    restore_from
    config key pointing to a checkpoint directory.
yaml
restore_from:
  path: /checkpoints/step-500
  • 默认格式:合并后的safetensors,便于在HF生态系统中部署(始终优先选择此格式而非DCP)。
  • 检查点间隔由
    step_scheduler.checkpoint_interval
    控制。
  • 通过配置中的
    restore_from
    键指向检查点目录,恢复训练。
yaml
restore_from:
  path: /checkpoints/step-500

Pitfalls

常见陷阱

ProblemCauseFix
Silent config errorsTypo in
_target_
value
The class path must be a valid, importable Python callable. Double-check the module path and class name.
Training crashes at first step
global_batch_size
not divisible by
local_batch_size * dp_size * grad_accumulation_steps
Ensure the batch size math is consistent across all dimensions.
New recipe not accessible via CLIMissing CLI command alias registrationRegister the new route in the CLI app so
automodel <command> <domain>
resolves correctly.
Shape mismatch at forward passDataset collate function output does not match model input signatureVerify that the collate function returns tensors with the keys and shapes the model expects.
OOM during validationValidation batch size too large or gradients not disabledWrap validation in
torch.no_grad()
and consider a smaller validation batch size.
Checkpoint restore failsMismatched model architecture between checkpoint and configEnsure the model config matches the checkpoint exactly (layer count, hidden dim, vocab size).
问题原因解决方法
静默配置错误
_target_
值存在拼写错误
类路径必须是有效的、可导入的Python可调用对象。仔细检查模块路径和类名。
训练在第一步崩溃
global_batch_size
无法被
local_batch_size * dp_size * grad_accumulation_steps
整除
确保所有维度的批量大小计算一致。
新方案无法通过CLI访问缺少CLI命令别名注册在CLI应用中注册新路由,确保
automodel <command> <domain>
可正确解析。
前向传播时形状不匹配数据集整理函数的输出与模型输入签名不匹配验证整理函数返回的张量具有模型期望的键和形状。
验证时出现OOM验证批量过大或未禁用梯度使用
torch.no_grad()
包裹验证逻辑,并考虑减小验证批量大小。
检查点恢复失败检查点与配置的模型架构不匹配确保模型配置与检查点完全一致(层数、隐藏维度、词汇量)。