nemo-automodel-recipe-development
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNeMo AutoModel Recipe Development
NeMo AutoModel 方案开发
Instructions
操作指南
For recipe questions, answer with the smallest complete path to action:
- Name the relevant recipe file or YAML section.
- List the builder functions or config keys involved.
- Include a minimal YAML or command example when the question asks how to configure something.
- End with a local validation command or tiny CPU-compatible test.
For conceptual recipe questions, answer from this skill without inspecting the
repository or loading other AutoModel skills unless the user asks you to edit
files. Keep the response focused on recipe YAML, builders, CLI routing, tests,
and local validation.
Use these compact answer patterns for common questions:
- New finetuning recipe variant: start from the closest file under
, update the model, dataset or dataloader, optimizer, loss, LR scheduler, step scheduler, and checkpoint builders, register a CLI route only if adding a command or domain alias, add example YAML under
nemo_automodel/recipes/, then add a tiny CPU-compatible unit test and runexamples/.automodel finetune llm -c <config.yaml> - fields: describe
_target_as the fully qualified Python callable, explain that sibling keys become keyword arguments, show optimizer and dataset examples, and mention nested CLI overrides such as_target_.--optimizer.lr - Validation and checkpointing: name ,
step_scheduler.val_check_interval,step_scheduler.checkpoint_interval,validation_dataset, and consolidated safetensors; include the minimal YAML snippet from this skill.restore_from.path
For validation and checkpointing, always name:
- for validation cadence.
step_scheduler.val_check_interval - for save cadence.
step_scheduler.checkpoint_interval - as the validation dataloader source.
validation_dataset - for resume.
restore_from.path - Consolidated safetensors as the default checkpoint format for HF ecosystem compatibility.
针对方案相关问题,以最简洁的完整操作路径作答:
- 指明相关的方案文件或YAML章节。
- 列出涉及的构建器函数或配置键。
- 当问题询问配置方法时,提供最简YAML或命令示例。
- 以本地验证命令或兼容CPU的小型测试结尾。
针对概念性方案问题,仅基于本技能作答,无需检查代码仓库或加载其他AutoModel技能,除非用户要求编辑文件。回答需聚焦于方案YAML、构建器、CLI路由、测试及本地验证。
针对常见问题,使用以下简洁回答模板:
- 新增微调方案变体:从下最接近的文件开始,更新模型、数据集或数据加载器、优化器、损失函数、学习率调度器、步骤调度器和检查点构建器,仅在添加命令或领域别名时注册CLI路由,在
nemo_automodel/recipes/下添加示例YAML,然后添加兼容CPU的小型单元测试并运行examples/。automodel finetune llm -c <config.yaml> - 字段:将
_target_描述为完整限定的Python可调用对象,说明同级键会作为关键字参数传入,展示优化器和数据集示例,并提及嵌套CLI覆盖(如_target_)。--optimizer.lr - 验证与检查点:指明、
step_scheduler.val_check_interval、step_scheduler.checkpoint_interval、validation_dataset,以及合并后的safetensors;包含本技能中的最简YAML片段。restore_from.path
针对验证与检查点,需始终指明:
- :验证频率。
step_scheduler.val_check_interval - :保存频率。
step_scheduler.checkpoint_interval - :验证数据加载器的来源。
validation_dataset - :恢复训练的路径。
restore_from.path - 合并后的safetensors:为兼容HF生态系统的默认检查点格式。
Routing Boundary
路由边界
Use this skill for recipe construction and execution-flow questions: YAML
structure, callables, builder functions, validation datasets,
checkpoint configuration, CLI route registration, and recipe-specific tests.
_target_Do not use this skill for standalone distributed strategy selection, cluster
launcher configuration, or model architecture onboarding unless the user is
asking how those choices appear inside an AutoModel recipe YAML.
本技能适用于方案构建与执行流程相关问题:YAML结构、可调用对象、构建器函数、验证数据集、检查点配置、CLI路由注册及方案专属测试。
_target_请勿将本技能用于独立的分布式策略选择、集群启动器配置或模型架构接入,除非用户询问这些选项在AutoModel方案YAML中的呈现方式。
Recipe Architecture
方案架构
Execution Flow
执行流程
CLI (automodel finetune llm -c config.yaml)
-> app.py parses command + domain + config
-> recipe script (e.g. train_ft.py) main(config_path)
-> Recipe class .setup() builds all components
-> .run_train_validation_loop() executes trainingCLI (automodel finetune llm -c config.yaml)
-> app.py 解析命令 + 领域 + 配置
-> 方案脚本(如train_ft.py)main(config_path)
-> Recipe类.setup() 构建所有组件
-> .run_train_validation_loop() 执行训练Recipe Class
Recipe类
Recipes inherit from and implement two methods:
BaseRecipe- -- builds model, optimizer, dataloader, loss, LR scheduler, step scheduler, and checkpoint config via builder functions.
setup() - -- executes the training and validation loop.
run_train_validation_loop()
方案继承自并实现两个方法:
BaseRecipe- :通过构建器函数构建模型、优化器、数据加载器、损失函数、学习率调度器、步骤调度器和检查点配置。
setup() - :执行训练与验证循环。
run_train_validation_loop()
Builder Pattern
构建器模式
All components are constructed through dedicated builder functions:
- -- instantiates the model from config
build_model() - -- creates optimizer (AdamW, etc.)
build_optimizer() - -- sets up train and validation dataloaders
build_dataloader() - -- creates the loss function
build_loss_fn() - -- creates the learning rate scheduler
build_lr_scheduler() - -- creates the step scheduler controlling training progression
build_step_scheduler() - -- configures checkpointing
build_checkpoint_config()
所有组件通过专用构建器函数创建:
- :从配置实例化模型
build_model() - :创建优化器(如AdamW)
build_optimizer() - :设置训练和验证数据加载器
build_dataloader() - :创建损失函数
build_loss_fn() - :创建学习率调度器
build_lr_scheduler() - :创建控制训练进程的步骤调度器
build_step_scheduler() - :配置检查点
build_checkpoint_config()
Infrastructure Application Order
基础设施应用顺序
Components are applied in this strict order after building:
- PEFT (LoRA, etc.)
- FP8 quantization
- QAT (quantization-aware training)
- Checkpoint load / restore
- Parameter freezing
- Sharding (FSDP2, Megatron-FSDP, DDP)
- Device placement
torch.compile- Context parallelism hooks
组件构建完成后,严格按照以下顺序应用:
- PEFT(如LoRA)
- FP8量化
- QAT(量化感知训练)
- 检查点加载/恢复
- 参数冻结
- 分片(FSDP2、Megatron-FSDP、DDP)
- 设备放置
torch.compile- 上下文并行钩子
YAML Config Anatomy
YAML配置结构
A complete recipe config follows this structure:
yaml
step_scheduler:
max_steps: 1000
num_epochs: 1
grad_accumulation_steps: 4
val_check_interval: 100
checkpoint_interval: 500
log_interval: 10
dist_env:
master_addr: localhost
master_port: 29500
rng:
seed: 42
model:
_target_: nemo_automodel.models.llm.NemotronHForCausalLM
name_or_path: meta-llama/Llama-3.2-1B
# additional model kwargs passed to the constructor
compile:
enabled: false
backend: inductor
clip_grad_norm:
max_norm: 1.0
distributed:
strategy: fsdp2 # fsdp2 | megatron_fsdp | ddp
dp_size: auto
tp_size: 1
cp_size: 1
loss_fn:
_target_: torch.nn.CrossEntropyLoss
dataset:
_target_: nemo_automodel.datasets.squad.SquadDataset
tokenizer_name_or_path: meta-llama/Llama-3.2-1B
max_seq_length: 2048
validation_dataset:
_target_: nemo_automodel.datasets.squad.SquadDataset
split: validation
packed_sequence:
enabled: false
dataloader:
batch_size: 4
num_workers: 4
pin_memory: true
optimizer:
_target_: torch.optim.AdamW
lr: 2.0e-5
weight_decay: 0.01
lr_scheduler:
_target_: nemo_automodel.schedulers.CosineAnnealingWarmup
warmup_steps: 50
min_lr: 1.0e-6完整的方案配置遵循以下结构:
yaml
step_scheduler:
max_steps: 1000
num_epochs: 1
grad_accumulation_steps: 4
val_check_interval: 100
checkpoint_interval: 500
log_interval: 10
dist_env:
master_addr: localhost
master_port: 29500
rng:
seed: 42
model:
_target_: nemo_automodel.models.llm.NemotronHForCausalLM
name_or_path: meta-llama/Llama-3.2-1B
# 传递给构造函数的额外模型关键字参数
compile:
enabled: false
backend: inductor
clip_grad_norm:
max_norm: 1.0
distributed:
strategy: fsdp2 # fsdp2 | megatron_fsdp | ddp
dp_size: auto
tp_size: 1
cp_size: 1
loss_fn:
_target_: torch.nn.CrossEntropyLoss
dataset:
_target_: nemo_automodel.datasets.squad.SquadDataset
tokenizer_name_or_path: meta-llama/Llama-3.2-1B
max_seq_length: 2048
validation_dataset:
_target_: nemo_automodel.datasets.squad.SquadDataset
split: validation
packed_sequence:
enabled: false
dataloader:
batch_size: 4
num_workers: 4
pin_memory: true
optimizer:
_target_: torch.optim.AdamW
lr: 2.0e-5
weight_decay: 0.01
lr_scheduler:
_target_: nemo_automodel.schedulers.CosineAnnealingWarmup
warmup_steps: 50
min_lr: 1.0e-6The _target_
Pattern
_target__target_
模式
_target_The key specifies a fully qualified Python callable. All remaining keys in that section are passed as keyword arguments:
_target_yaml
optimizer:
_target_: torch.optim.AdamW # callable
lr: 2.0e-5 # kwarg
weight_decay: 0.01 # kwargThis is equivalent to: .
torch.optim.AdamW(lr=2e-5, weight_decay=0.01)_target_yaml
optimizer:
_target_: torch.optim.AdamW # 可调用对象
lr: 2.0e-5 # 关键字参数
weight_decay: 0.01 # 关键字参数这等同于:。
torch.optim.AdamW(lr=2e-5, weight_decay=0.01)CLI Overrides
CLI覆盖
Any config value can be overridden from the command line:
bash
automodel finetune llm -c config.yaml \
--optimizer.lr 1e-4 \
--step_scheduler.max_steps 500 \
--distributed.tp_size 2任何配置值都可通过命令行覆盖:
bash
automodel finetune llm -c config.yaml \
--optimizer.lr 1e-4 \
--step_scheduler.max_steps 500 \
--distributed.tp_size 2Examples
示例
Validation and checkpointing:
yaml
step_scheduler:
val_check_interval: 100
checkpoint_interval: 500
validation_dataset:
_target_: nemo_automodel.datasets.squad.SquadDataset
split: validation
restore_from:
path: /checkpoints/step-500验证与检查点配置:
yaml
step_scheduler:
val_check_interval: 100
checkpoint_interval: 500
validation_dataset:
_target_: nemo_automodel.datasets.squad.SquadDataset
split: validation
restore_from:
path: /checkpoints/step-500Domain-Specific Notes
领域专属说明
LLM
LLM
- handles both finetuning and pretraining. The distinction is in the config (dataset, learning rate, etc.).
nemo_automodel/recipes/llm/train_ft.py - implements knowledge distillation with a teacher and student model.
nemo_automodel/recipes/llm/kd.py - runs throughput and latency benchmarks.
nemo_automodel/recipes/llm/benchmark.py
- 同时处理微调与预训练,区别在于配置(数据集、学习率等)。
nemo_automodel/recipes/llm/train_ft.py - 实现基于教师模型和学生模型的知识蒸馏。
nemo_automodel/recipes/llm/kd.py - 运行吞吐量和延迟基准测试。
nemo_automodel/recipes/llm/benchmark.py
VLM
VLM
- Uses instead of causal LM classes.
NeMoAutoModelForImageTextToText - Config includes a section instead of a standalone tokenizer.
processor - Recipe lives in .
nemo_automodel/recipes/vlm/finetune.py
- 使用而非因果语言模型类。
NeMoAutoModelForImageTextToText - 配置包含章节,而非独立的tokenizer。
processor - 方案位于。
nemo_automodel/recipes/vlm/finetune.py
Diffusion
Diffusion
- Uses .
NeMoAutoDiffusionPipeline - Requires a dict in config to define parallelism.
parallel_scheme - Only supports DDP and FSDP2 strategies (no Megatron-FSDP).
- Recipe lives in .
nemo_automodel/recipes/diffusion/train.py
- 使用。
NeMoAutoDiffusionPipeline - 配置中需要字典来定义并行策略。
parallel_scheme - 仅支持DDP和FSDP2策略(不支持Megatron-FSDP)。
- 方案位于。
nemo_automodel/recipes/diffusion/train.py
Retrieval
Retrieval
- Two encoder patterns:
- Bi-encoder (): separate query and document encoders, contrastive loss.
nemo_automodel/recipes/retrieval/train_bi_encoder.py - Cross-encoder (): joint encoding, classification head.
nemo_automodel/recipes/retrieval/train_cross_encoder.py
- Bi-encoder (
- Hard negative mining: .
nemo_automodel/recipes/retrieval/mine_hard_negatives.py
- 两种编码器模式:
- 双编码器():独立的查询和文档编码器,对比损失。
nemo_automodel/recipes/retrieval/train_bi_encoder.py - 交叉编码器():联合编码,分类头。
nemo_automodel/recipes/retrieval/train_cross_encoder.py
- 双编码器(
- 难负样本挖掘:。
nemo_automodel/recipes/retrieval/mine_hard_negatives.py
Training Loop Details
训练循环细节
The training loop follows this structure per epoch:
for epoch in range(num_epochs):
for batch_idx in range(batches_per_epoch):
# --- gradient accumulation inner loop ---
for micro_batch in micro_batches:
if pipeline_parallel:
schedule.step(micro_batch) # PP schedule
else:
loss = model(micro_batch) # direct forward
loss.backward()
# --- optimizer step ---
scale_grads_and_clip_grad_norm(model, max_norm)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# --- logging ---
MetricsSample(step, epoch, loss, grad_norm, lr, mem, tps, mfu)
# --- validation (at configured intervals) ---
if step % val_check_interval == 0:
run_validation()
# --- checkpoint (at configured intervals) ---
if step % checkpoint_interval == 0:
save_checkpoint()训练循环遵循以下每轮 epoch 的结构:
for epoch in range(num_epochs):
for batch_idx in range(batches_per_epoch):
# --- 梯度累积内循环 ---
for micro_batch in micro_batches:
if pipeline_parallel:
schedule.step(micro_batch) # PP调度
else:
loss = model(micro_batch) # 直接前向传播
loss.backward()
# --- 优化器步骤 ---
scale_grads_and_clip_grad_norm(model, max_norm)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# --- 日志记录 ---
MetricsSample(step, epoch, loss, grad_norm, lr, mem, tps, mfu)
# --- 验证(按配置间隔) ---
if step % val_check_interval == 0:
run_validation()
# --- 检查点(按配置间隔) ---
if step % checkpoint_interval == 0:
save_checkpoint()StepScheduler
StepScheduler
Controls all training progression: total epochs, total steps, gradient accumulation steps, validation interval, checkpoint interval, and logging interval.
控制所有训练进程:总epoch数、总步数、梯度累积步数、验证间隔、检查点间隔和日志记录间隔。
Gradient Clipping
梯度裁剪
Applied via after the backward pass and before the optimizer step. Controlled by in config.
scale_grads_and_clip_grad_norm()clip_grad_norm.max_norm在反向传播后、优化器步骤前,通过应用,由配置中的控制。
scale_grads_and_clip_grad_norm()clip_grad_norm.max_normContext Parallelism
上下文并行
When , batches are split across the context-parallel group using . This must happen before the forward pass.
cp_size > 1make_cp_batch_and_ctx()当时,使用在上下文并行组中拆分批次,此操作必须在前向传播前完成。
cp_size > 1make_cp_batch_and_ctx()MetricsSample
MetricsSample
Each training step produces a with fields:
MetricsSample- -- global step count
step - -- current epoch
epoch - -- training loss
loss - -- gradient norm after clipping
grad_norm - -- current learning rate
lr - -- GPU memory usage
mem - -- tokens per second
tps - -- model FLOPS utilization
mfu
每个训练步骤生成一个,包含以下字段:
MetricsSample- -- 全局步数
step - -- 当前epoch
epoch - -- 训练损失
loss - -- 裁剪后的梯度范数
grad_norm - -- 当前学习率
lr - -- GPU内存使用情况
mem - -- 每秒处理token数
tps - -- 模型FLOPS利用率
mfu
Validation & Checkpointing
验证与检查点
Validation
验证
- Runs at intervals defined by .
step_scheduler.val_check_interval - Uses the validation dataloader built from config.
validation_dataset - Model is set to eval mode; gradients are disabled.
- 按定义的间隔运行。
step_scheduler.val_check_interval - 使用从配置构建的验证数据加载器。
validation_dataset - 模型设置为评估模式;禁用梯度计算。
Checkpointing
检查点
- Default format: consolidated safetensors for easy deployment on HF ecosystem (always prefer this over DCP).
- Checkpoint interval controlled by .
step_scheduler.checkpoint_interval - Resume training via the config key pointing to a checkpoint directory.
restore_from
yaml
restore_from:
path: /checkpoints/step-500- 默认格式:合并后的safetensors,便于在HF生态系统中部署(始终优先选择此格式而非DCP)。
- 检查点间隔由控制。
step_scheduler.checkpoint_interval - 通过配置中的键指向检查点目录,恢复训练。
restore_from
yaml
restore_from:
path: /checkpoints/step-500Pitfalls
常见陷阱
| Problem | Cause | Fix |
|---|---|---|
| Silent config errors | Typo in | The class path must be a valid, importable Python callable. Double-check the module path and class name. |
| Training crashes at first step | | Ensure the batch size math is consistent across all dimensions. |
| New recipe not accessible via CLI | Missing CLI command alias registration | Register the new route in the CLI app so |
| Shape mismatch at forward pass | Dataset collate function output does not match model input signature | Verify that the collate function returns tensors with the keys and shapes the model expects. |
| OOM during validation | Validation batch size too large or gradients not disabled | Wrap validation in |
| Checkpoint restore fails | Mismatched model architecture between checkpoint and config | Ensure the model config matches the checkpoint exactly (layer count, hidden dim, vocab size). |
| 问题 | 原因 | 解决方法 |
|---|---|---|
| 静默配置错误 | | 类路径必须是有效的、可导入的Python可调用对象。仔细检查模块路径和类名。 |
| 训练在第一步崩溃 | | 确保所有维度的批量大小计算一致。 |
| 新方案无法通过CLI访问 | 缺少CLI命令别名注册 | 在CLI应用中注册新路由,确保 |
| 前向传播时形状不匹配 | 数据集整理函数的输出与模型输入签名不匹配 | 验证整理函数返回的张量具有模型期望的键和形状。 |
| 验证时出现OOM | 验证批量过大或未禁用梯度 | 使用 |
| 检查点恢复失败 | 检查点与配置的模型架构不匹配 | 确保模型配置与检查点完全一致(层数、隐藏维度、词汇量)。 |