nemo-automodel-model-onboarding

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Adding Model Support to NeMo AutoModel

为NeMo AutoModel添加模型支持

Purpose

目的

This skill guides implementation of new model architectures in NeMo AutoModel. Follow the five phases in order.

本指南指导如何在NeMo AutoModel中实现新模型架构，请按顺序完成以下五个阶段。

Instructions

说明

When answering an onboarding question, keep the response in this order:

Classify the architecture from
```
config.json
```
.
Name the exact implementation files under
```
components/models/<name>/
```
.
Identify registry and optional custom-config updates.
State the validation tests that must be added before full checkpoint use.

For conceptual onboarding questions, answer from this skill without opening the pattern files unless the user asks you to edit code. Mention pattern filenames as references, then give the direct checklist.

Use direct action verbs: classify the model, name the files, map the weights, register the class, and add tests. Do not discuss distributed strategy, launcher configuration, or general recipe authoring unless the user explicitly connects it to onboarding a new architecture.

回答接入相关问题时，请遵循以下顺序：

从
```
config.json
```
中对架构进行分类。
指明
```
components/models/<name>/
```
下的具体实现文件。
确认注册表及可选的自定义配置更新。
说明在完全使用检查点前必须添加的验证测试。

对于概念性接入问题，直接依据本指南回答，除非用户要求编辑代码，否则无需打开模式文件。可提及模式文件名作为参考，然后给出直接的检查清单。

使用直接的动作动词：分类模型、命名文件、映射权重、注册类、添加测试。除非用户明确将其与新架构接入关联，否则不要讨论分布式策略、启动器配置或通用配方编写。

Examples

示例

Use these compact answer patterns for common questions:

Dense causal LM: classify as dense only when
```
architectures
```
contains a
```
ForCausalLM
```
class and expert fields such as
```
num_local_experts
```
,
```
n_routed_experts
```
, or
```
num_experts_per_tok
```
are absent. Create
```
components/models/<name>/model.py
```
,
```
state_dict_adapter.py
```
,
```
__init__.py
```
, and optional
```
config.py
```
, register
```
MODEL_ARCH_MAPPING
```
in
```
_transformers/registry.py
```
, add example YAML, and add tiny-config unit tests plus layer-equivalence tests for rewritten layers.
MoE state dict: identify expert fields in
```
config.json
```
, reference
```
moe-patterns.md
```
, map router tensors separately, preserve routed-expert index order, map routed experts, shared experts, and gate/up/down projections, add adapter key-map tests and tiny-config numerical equivalence tests, and do not rely only on
```
from_pretrained()
```
or silent tensor reshapes.
VLM onboarding: classify as VLM only when
```
vision_config
```
,
```
text_config
```
, and a
```
ForConditionalGeneration
```
architecture are present. Reference
```
vlm-patterns.md
```
and existing VLM implementations such as
```
mistral4
```
,
```
kimivl
```
, or
```
kimi_k25_vl
```
; check text backbone, vision tower, projector, processor assumptions, text and vision
```
state_dict_adapter.py
```
mappings, registry registration, and tiny image-text tests before full checkpoints. Do not treat VLM onboarding as a pure causal-LM path or skip processor/image tests.

For MoE state-dict questions, always include the safety checklist:

Map router tensors separately from expert tensors.
Preserve routed-expert index order; never sort, drop, merge, or silently reshape expert weights to make loading pass.
Map gate, up, and down projections explicitly, including combined projection layouts and shared experts when present.
Add adapter key-map tests and tiny-config numerical equivalence tests before relying on full checkpoint loading.

For VLM questions, explicitly check

vision_config

text_config

, the conditional-generation architecture, text backbone, vision tower, projector, processor assumptions, registry entry, and tiny image-text tests.

针对常见问题，可使用以下简洁的回答模板：

密集型因果语言模型：仅当
```
architectures
```
包含
```
ForCausalLM
```
类且不存在
```
num_local_experts
```
、
```
n_routed_experts
```
或
```
num_experts_per_tok
```
等专家字段时，才归类为密集型。创建
```
components/models/<name>/model.py
```
、
```
state_dict_adapter.py
```
、
```
__init__.py
```
及可选的
```
config.py
```
，在
```
_transformers/registry.py
```
中注册
```
MODEL_ARCH_MAPPING
```
，添加示例YAML，并为重写的层添加微型配置单元测试及层等价性测试。
MoE状态字典：识别
```
config.json
```
中的专家字段，参考
```
moe-patterns.md
```
，单独映射路由器张量，保留路由专家的索引顺序，映射路由专家、共享专家以及门控/向上/向下投影，添加适配器键映射测试和微型配置数值等价性测试，不要仅依赖
```
from_pretrained()
```
或静默张量重塑。
VLM接入：仅当
```
architectures
```
包含
```
ForConditionalGeneration
```
架构且存在
```
vision_config
```
、
```
text_config
```
时，才归类为VLM。参考
```
vlm-patterns.md
```
及现有VLM实现（如
```
mistral4
```
、
```
kimivl
```
或
```
kimi_k25_vl
```
）；在使用完整检查点前，检查文本骨干、视觉塔、投影器、处理器假设、文本和视觉
```
state_dict_adapter.py
```
映射、注册表注册以及微型图文测试。不要将VLM接入视为纯因果语言模型流程，也不要跳过处理器/图像测试。

对于MoE状态字典相关问题，务必包含以下安全检查清单：

将路由器张量与专家张量分开映射。
保留路由专家的索引顺序；切勿为使加载通过而进行排序、丢弃、合并或静默重塑专家权重。
显式映射门控、向上和向下投影，包括组合投影布局及存在的共享专家。
在依赖完整检查点加载前，添加适配器键映射测试和微型配置数值等价性测试。

对于VLM相关问题，需显式检查

vision_config

、

text_config

、条件生成架构、文本骨干、视觉塔、投影器、处理器假设、注册表条目以及微型图文测试。

Routing Boundary

适用边界

Use this skill only when the user is adding or modifying model architecture support: model files, custom layers, state-dict adapters, Hugging Face config mapping, registry entries, or model capability flags.

Do not use this skill for standalone training recipe YAML questions about optimizers, datasets, schedulers, validation datasets, or trainer wiring unless they are explicitly part of onboarding a new model architecture. Those recipe questions belong to the nemo-automodel-recipe-development skill.

In-scope examples:

"Add support for a new Hugging Face causal LM architecture."
"Map MoE router and expert weights from a Hugging Face checkpoint."
"Register a new model class in NeMo AutoModel."

Out-of-scope examples:

"Write a finetuning recipe YAML with optimizer and dataset sections."
"Choose FSDP2, DDP, tensor parallel, or context parallel settings."
"Configure Slurm, SkyPilot, containers, mounts, or launch dispatch."

仅当用户添加或修改模型架构支持时使用本指南：模型文件、自定义层、状态字典适配器、Hugging Face配置映射、注册表条目或模型功能标志。

除非明确属于新模型架构接入的一部分，否则不要将本指南用于关于优化器、数据集、调度器、验证数据集或训练器连接的独立训练配方YAML问题。这些配方问题属于nemo-automodel-recipe-development技能范畴。

适用场景示例：

"为新的Hugging Face因果语言模型架构添加支持。"
"从Hugging Face检查点映射MoE路由器和专家权重。"
"在NeMo AutoModel中注册新模型类。"

不适用场景示例：

"编写包含优化器和数据集部分的微调配方YAML。"
"选择FSDP2、DDP、张量并行或上下文并行设置。"
"配置Slurm、SkyPilot、容器、挂载或启动调度。"

Phase 1: Discovery

阶段1：探索

Before writing code, gather information about the target model.

编写代码前，收集目标模型的相关信息。

1.1 Fetch HuggingFace config.json

1.1 获取HuggingFace config.json

Download the model's

config.json

from the HuggingFace Hub (or use

AutoConfig.from_pretrained

). Key fields to extract:

architectures

-- determines the class name and registration key (e.g.,

"LlamaForCausalLM"

"Qwen3MoeForCausalLM"

"Mistral3ForConditionalGeneration"

)

```
model_type
```
-- used for custom config registration in
```
_CUSTOM_CONFIG_REGISTRATIONS
```
if HF does not have a built-in config class

hidden_size

intermediate_size

num_hidden_layers

num_attention_heads

num_key_value_heads

-- sizing

```
vocab_size
```
-- needed for tiny test configs
```
tie_word_embeddings
```
-- whether lm_head shares weights with embed_tokens
```
hidden_act
```
-- activation function (e.g.,
```
"silu"
```
for SwiGLU)

从HuggingFace Hub下载模型的

config.json

（或使用

AutoConfig.from_pretrained

）。需提取的关键字段：

architectures

-- 确定类名和注册键（例如

"LlamaForCausalLM"

、

"Qwen3MoeForCausalLM"

、

"Mistral3ForConditionalGeneration"

）

```
model_type
```
-- 如果HF没有内置配置类，用于在
```
_CUSTOM_CONFIG_REGISTRATIONS
```
中注册自定义配置

hidden_size

、

intermediate_size

、

num_hidden_layers

、

num_attention_heads

、

num_key_value_heads

-- 尺寸参数

```
vocab_size
```
-- 微型测试配置所需
```
tie_word_embeddings
```
-- lm_head是否与embed_tokens共享权重
```
hidden_act
```
-- 激活函数（例如SwiGLU对应
```
"silu"
```
）

1.2 Determine model type

1.2 确定模型类型

Type	Indicators	Pattern file
Dense LLM	`ForCausalLM` in architectures, no expert fields	llm-patterns.md
MoE LLM	`n_routed_experts` , `num_local_experts` , `num_experts_per_tok` in config	moe-patterns.md
VLM	`ForConditionalGeneration` in architectures, has `vision_config` + `text_config`	vlm-patterns.md

类型	标识	模式文件
密集型LLM	架构中包含 `ForCausalLM` ，无专家字段	llm-patterns.md
MoE LLM	配置中包含 `n_routed_experts` 、 `num_local_experts` 、 `num_experts_per_tok`	moe-patterns.md
VLM	架构中包含 `ForConditionalGeneration` ，同时存在 `vision_config` + `text_config`	vlm-patterns.md

1.3 Check for existing similar architectures

1.3 检查是否存在类似架构

Look in

components/models/

for architectures with similar attention or MLP patterns:

components/models/
  llama/           # Standard GQA + SwiGLU (CombinedQKV + CombinedGateUpMLP)
  qwen2/           # Same as Llama but with attention bias + QKV bias
  baichuan/        # ALiBi attention variant
  deepseek_v3/     # MLA attention + MoE (DeepSeek-style grouped experts)
  mistral4/        # MLA + MoE + VLM (Pixtral vision)
  kimivl/          # DeepSeek-V3 backbone + MoonVit vision
  kimi_k25_vl/     # Updated KimiVL with different projector
  qwen3_moe/       # Qwen3 with MoE layers
  nemotron_v3/     # Hybrid mamba-attention

在

components/models/

中查找具有相似注意力或MLP模式的架构：

components/models/
  llama/           # 标准GQA + SwiGLU（CombinedQKV + CombinedGateUpMLP）
  qwen2/           # 与Llama类似，但带有注意力偏置 + QKV偏置
  baichuan/        # ALiBi注意力变体
  deepseek_v3/     # MLA注意力 + MoE（DeepSeek风格分组专家）
  mistral4/        # MLA + MoE + VLM（Pixtral视觉）
  kimivl/          # DeepSeek-V3骨干 + MoonVit视觉
  kimi_k25_vl/     # 改进版KimiVL，带有不同的投影器
  qwen3_moe/       # 带有MoE层的Qwen3
  nemotron_v3/     # 混合mamba-注意力

1.4 Identify custom components

1.4 识别自定义组件

Check whether the model needs:

Custom attention: GQA (standard), MLA (DeepSeek/Mistral4), sliding window, bidirectional
Custom RoPE: Standard (Llama), YaRN scaling, NTK-aware, complex-number (DeepSeek)
Custom normalization: RMSNorm (standard), LayerNorm, different eps values
Custom MLP: SwiGLU (standard), GeGLU, ReLU-squared, MoE routing
Custom config class: Needed only if HF
```
AutoConfig
```
cannot parse the model's
```
config.json
```
(check
```
auto_map
```
field)

检查模型是否需要：

自定义注意力：GQA（标准）、MLA（DeepSeek/Mistral4）、滑动窗口、双向
自定义RoPE：标准（Llama）、YaRN缩放、NTK-aware、复数（DeepSeek）
自定义归一化：RMSNorm（标准）、LayerNorm、不同的eps值
自定义MLP：SwiGLU（标准）、GeGLU、ReLU平方、MoE路由
自定义配置类：仅当HF
```
AutoConfig
```
无法解析模型的
```
config.json
```
时需要（检查
```
auto_map
```
字段）

1.5 Note dimensions for test config

1.5 记录测试配置的维度

For unit tests, create a tiny config. Target: ~1M parameters or less.

python

undefined

对于单元测试，创建一个微型配置。目标：参数约1M或更少。

python

undefined

Example tiny config for a Llama-like model:

Llama类模型的微型配置示例：

tiny_config = LlamaConfig( hidden_size=64, intermediate_size=128, num_hidden_layers=2, num_attention_heads=4, num_key_value_heads=2, vocab_size=256, max_position_embeddings=128, )

---

tiny_config = LlamaConfig( hidden_size=64, intermediate_size=128, num_hidden_layers=2, num_attention_heads=4, num_key_value_heads=2, vocab_size=256, max_position_embeddings=128, )

---

Phase 2: Implementation

阶段2：实现

2.1 Create directory structure

2.1 创建目录结构

components/models/<name>/
  __init__.py
  model.py
  state_dict_adapter.py
  config.py            # Only if HF config is insufficient
  layers.py            # Only for MoE / MLA / other non-standard layers
  rope_utils.py        # Only for custom RoPE

components/models/<name>/
  __init__.py
  model.py
  state_dict_adapter.py
  config.py            # 仅当HF配置不足时需要
  layers.py            # 仅适用于MoE / MLA / 其他非标准层
  rope_utils.py        # 仅当需要自定义RoPE时需要

2.2 Implementation order

2.2 实现顺序

Implement files in dependency order:

config.py (if needed) -- Custom
```
PretrainedConfig
```
subclass
rope_utils.py (if needed) -- RoPE implementation
layers.py (if needed) -- Attention, MLP, decoder block classes
model.py -- The main
```
ForCausalLM
```
(or
```
ForConditionalGeneration
```
) class
state_dict_adapter.py -- HF weight conversion
init.py -- Re-export the main model class

See the pattern files for detailed implementation guidance:

Dense LLM: llm-patterns.md
MoE: moe-patterns.md
VLM: vlm-patterns.md

按依赖顺序实现文件：

config.py（如需要）-- 自定义
```
PretrainedConfig
```
子类
rope_utils.py（如需要）-- RoPE实现
layers.py（如需要）-- 注意力、MLP、解码器块类
model.py -- 主
```
ForCausalLM
```
（或
```
ForConditionalGeneration
```
）类
state_dict_adapter.py -- HF权重转换
init.py -- 重新导出主模型类

请参考模式文件获取详细实现指导：

密集型LLM：llm-patterns.md
MoE：moe-patterns.md
VLM：vlm-patterns.md

2.3 MoE state-dict adapter checklist

2.3 MoE状态字典适配器检查清单

For MoE models, do not stop at generic loading. The adapter must explicitly map:

Router weights, including gate bias or correction-bias tensors when the Hugging Face model has them.
Expert weights, preserving expert index order across local and routed experts.
Gate/up/down projections, including combined or split projection layouts.
Shared experts separately from routed experts when the architecture has both.

Add tests that assert expected key mappings and run numerical equivalence with tiny configs before trying full checkpoints.

Do not use these shortcuts:

Do not validate the adapter only by calling
```
from_pretrained()
```
.
Do not accept missing or extra expert keys without an explicit mapping reason.
Do not change dtype, transpose dimensions, or reshape tensors unless the HF and NeMo layouts require it and a test proves the conversion is reversible.
Do not skip router or shared-expert tests because dense-layer tests pass.

对于MoE模型，不要停留在通用加载层面。适配器必须显式映射：

路由器权重，包括Hugging Face模型中存在的门控偏置或校正偏置张量。
专家权重，保留本地和路由专家的索引顺序。
门控/向上/向下投影，包括组合或拆分投影布局。
当架构同时存在共享专家和路由专家时，单独映射共享专家。

在尝试完整检查点之前，添加测试以断言预期的键映射，并通过微型配置运行数值等价性验证。

请勿使用以下捷径：

不要仅通过调用
```
from_pretrained()
```
来验证适配器。
不要在没有显式映射原因的情况下接受缺失或多余的专家键。
除非HF和NeMo布局要求且测试证明转换可逆，否则不要更改数据类型、转置维度或重塑张量。
不要因为密集层测试通过就跳过路由器或共享专家测试。

2.4 VLM onboarding checklist

2.4 VLM接入检查清单

For VLMs, confirm the Hugging Face config has

vision_config

and

text_config

and that

architectures

points to a conditional-generation class. Start from the closest VLM pattern file, usually vlm-patterns.md, and compare existing implementations such as

mistral4

kimivl

, or

kimi_k25_vl

The implementation should explicitly cover:

Text backbone, vision tower, projector, and processor or image preprocessing assumptions.
Weight mapping for both text and vision modules in
```
state_dict_adapter.py
```
.

Registration of the

ForConditionalGeneration

class in

_transformers/registry.py

Tiny tests that exercise image-text inputs and verify the adapter round-trip.

对于VLM，确认Hugging Face配置包含

vision_config

和

text_config

，且

architectures

指向条件生成类。从最接近的VLM模式文件（通常是vlm-patterns.md）开始，并对比现有实现（如

mistral4

、

kimivl

或

kimi_k25_vl

）。

实现需明确覆盖：

文本骨干、视觉塔、投影器以及处理器或图像预处理假设。
```
state_dict_adapter.py
```
中文本和视觉模块的权重映射。

在

_transformers/registry.py

中注册

ForConditionalGeneration

类。

测试图文输入并验证适配器往返的微型测试。

2.5 Register in registry

2.5 在注册表中注册

Add the model to

MODEL_ARCH_MAPPING

_transformers/registry.py

python

undefined

将模型添加到

_transformers/registry.py

中的

MODEL_ARCH_MAPPING

：

python

undefined

In _transformers/registry.py

在_transformers/registry.py中

MODEL_ARCH_MAPPING = OrderedDict([ # ... existing entries ... ( "NewModelForCausalLM", ("nemo_automodel.components.models.new_model.model", "NewModelForCausalLM"), ), ])


If the model has a custom config class with `auto_map` in its `config.json`, also register in `_CUSTOM_CONFIG_REGISTRATIONS`:

```python
_CUSTOM_CONFIG_REGISTRATIONS: Dict[str, Tuple[str, str]] = {
    # ... existing entries ...
    "new_model": ("nemo_automodel.components.models.new_model.configuration", "NewModelConfig"),
}

MODEL_ARCH_MAPPING = OrderedDict([ # ... 现有条目 ... ( "NewModelForCausalLM", ("nemo_automodel.components.models.new_model.model", "NewModelForCausalLM"), ), ])


如果模型的`config.json`中`auto_map`字段指向自定义配置类，还需在`_CUSTOM_CONFIG_REGISTRATIONS`中注册：

```python
_CUSTOM_CONFIG_REGISTRATIONS: Dict[str, Tuple[str, str]] = {
    # ... 现有条目 ...
    "new_model": ("nemo_automodel.components.models.new_model.configuration", "NewModelConfig"),
}

Phase 3: Onboarding Example Config

阶段3：接入示例配置

This phase is only for adding a minimal example config that proves the newly onboarded architecture can load and run. Use nemo-automodel-recipe-development for general recipe authoring or existing recipe modifications.

此阶段仅用于添加最小示例配置，以证明新接入的架构可以加载并运行。通用配方编写或现有配方修改请使用nemo-automodel-recipe-development技能。

3.1 Create example YAML config

3.1 创建示例YAML配置

Create an example config under

examples/llm_finetune/<name>/

(or

examples/vlm_finetune/<name>/

yaml

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: <org>/<model-name>

trainer:
  max_steps: 100
  gradient_clip_val: 1.0
  accumulate_grad_batches: 1

在

examples/llm_finetune/<name>/

（或

examples/vlm_finetune/<name>/

）下创建示例配置：

yaml

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: <org>/<model-name>

trainer:
  max_steps: 100
  gradient_clip_val: 1.0
  accumulate_grad_batches: 1

... data, optimizer config ...

... 数据、优化器配置 ...

undefined

undefined

3.2 Verify model loads

3.2 验证模型加载

Test that the model loads from a HuggingFace checkpoint:

python

from nemo_automodel import NeMoAutoModelForCausalLM

model = NeMoAutoModelForCausalLM.from_pretrained("<org>/<model-name>")

测试模型是否可以从HuggingFace检查点加载：

python

from nemo_automodel import NeMoAutoModelForCausalLM

model = NeMoAutoModelForCausalLM.from_pretrained("<org>/<model-name>")

3.3 Test with tiny config first

3.3 先使用微型配置测试

Before using full-size models, verify with a tiny config (1-2 layers, small hidden dim) to catch shape mismatches early.

在使用全尺寸模型之前，先通过微型配置（1-2层，小隐藏维度）验证，以尽早发现形状不匹配问题。

Phase 4: Tests

阶段4：测试

Create

tests/unit_tests/models/<name>/

and cover the checks below before loading full checkpoints:

Forward-shape smoke test with a tiny config.
State-dict adapter round-trip:
```
from_hf -> to_hf
```
preserves mapped names, shapes, dtypes, and values.
Layer equivalence tests for every rewritten attention, MLP, normalization, RoPE, or MoE layer. Use the model dtype from config, identical seeded weights, identical inputs, and dtype-appropriate
```
torch.allclose
```
tolerances.
Short functional test that verifies loss decreases over a few training steps.

创建

tests/unit_tests/models/<name>/

，并在加载完整检查点前完成以下检查：

使用微型配置进行前向形状冒烟测试。
状态字典适配器往返测试：
```
from_hf -> to_hf
```
保留映射名称、形状、数据类型和值。
每个重写的注意力、MLP、归一化、RoPE或MoE层的层等价性测试。使用配置中的模型数据类型、相同的种子权重、相同的输入以及适合数据类型的
```
torch.allclose
```
容差。
验证训练损失在几个步骤中下降的简短功能测试。

Phase 5: Documentation

阶段5：文档

5.1 Update model coverage page

5.1 更新模型覆盖页面

Edit the appropriate file in

docs/model-coverage/

LLM/MoE:
```
docs/model-coverage/llm/index.md
```
VLM:
```
docs/model-coverage/vlm/index.md
```

Add a row with the model name, supported features (TP, PP, FSDP, LoRA, QLoRA), and any limitations.

编辑

docs/model-coverage/

中的对应文件：

LLM/MoE：
```
docs/model-coverage/llm/index.md
```
VLM：
```
docs/model-coverage/vlm/index.md
```

添加一行，包含模型名称、支持的功能（TP、PP、FSDP、LoRA、QLoRA）以及任何限制。

Phase 6: Parity Testing

阶段6：等价性测试

After implementation and unit tests are complete, run the full parity-testing workflow to verify that the new model produces numerically equivalent results to the reference HuggingFace implementation.

Run three levels of comparison:

State-dict round-trip: load a reference HuggingFace checkpoint, convert it into the NeMo AutoModel layout, export it back, and verify that all mapped tensors match the reference names, shapes, dtypes, and values within the expected tolerance.
Component-level parity: compare rewritten attention, MLP, normalization, RoPE, and MoE components against the HuggingFace implementation with fixed seeds and identical dtype.
End-to-end forward pass: run the full NeMo AutoModel and HuggingFace model on the same tokenized input and compare logits, hidden states, and loss.

Do not skip this phase. A model that passes unit tests can still diverge from HF due to subtle weight-conversion bugs, backend differences, or RoPE mismatches that only surface in a full parity comparison.

完成实现和单元测试后，运行完整的等价性测试工作流，以验证新模型产生的结果与参考HuggingFace实现在数值上等价。

运行三个级别的对比：

状态字典往返：加载参考HuggingFace检查点，将其转换为NeMo AutoModel布局，再导出回去，验证所有映射张量的名称、形状、数据类型和值在预期容差内与参考一致。
组件级等价性：在固定种子和相同数据类型下，对比重写的注意力、MLP、归一化、RoPE和MoE组件与HuggingFace实现的结果。
端到端前向传播：在相同的分词输入上运行完整的NeMo AutoModel和HuggingFace模型，对比logits、隐藏状态和损失。

不要跳过此阶段。通过单元测试的模型仍可能因细微的权重转换错误、后端差异或RoPE不匹配而与HF结果偏离，这些问题只有在完整等价性对比中才会显现。

Key Files Reference

关键文件参考

File	Purpose
`_transformers/registry.py`	`MODEL_ARCH_MAPPING` and `_CUSTOM_CONFIG_REGISTRATIONS`
`components/models/common/__init__.py`	Exports `CombinedQKVAttentionMixin` , `CombinedGateUpMLP` , `BackendConfig` , `HFCheckpointingMixin` , etc.
`components/models/common/combined_projection/combined_qkv.py`	`CombinedQKVAttentionMixin` with `setup_qkv_projection()` and `compute_qkv()`
`components/models/common/combined_projection/combined_mlp.py`	`CombinedGateUpMLP` with interleaved gate/up layout
`components/models/common/combined_projection/state_dict_adapter.py`	`CombinedProjectionStateDictAdapter` base class
`components/models/common/hf_checkpointing_mixin.py`	`HFCheckpointingMixin` for save/load
`components/models/common/utils.py`	`BackendConfig` , `initialize_rms_norm_module` , `initialize_linear_module` , `get_rope_config`
`components/moe/config.py`	`MoEConfig` dataclass
`components/moe/fsdp_mixin.py`	`MoEFSDPSyncMixin` for distributed expert handling
`components/moe/layers.py`	`MoE` layer, `MLP` (dense) for MoE blocks
`components/moe/experts.py`	`GroupedExperts` , `GroupedExpertsDeepEP` , `GroupedExpertsTE`

文件	用途
`_transformers/registry.py`	`MODEL_ARCH_MAPPING` 和 `_CUSTOM_CONFIG_REGISTRATIONS`
`components/models/common/__init__.py`	导出 `CombinedQKVAttentionMixin` 、 `CombinedGateUpMLP` 、 `BackendConfig` 、 `HFCheckpointingMixin` 等
`components/models/common/combined_projection/combined_qkv.py`	带有 `setup_qkv_projection()` 和 `compute_qkv()` 的 `CombinedQKVAttentionMixin`
`components/models/common/combined_projection/combined_mlp.py`	带有交错门控/向上布局的 `CombinedGateUpMLP`
`components/models/common/combined_projection/state_dict_adapter.py`	`CombinedProjectionStateDictAdapter` 基类
`components/models/common/hf_checkpointing_mixin.py`	用于保存/加载的 `HFCheckpointingMixin`
`components/models/common/utils.py`	`BackendConfig` 、 `initialize_rms_norm_module` 、 `initialize_linear_module` 、 `get_rope_config`
`components/moe/config.py`	`MoEConfig` 数据类
`components/moe/fsdp_mixin.py`	用于分布式专家处理的 `MoEFSDPSyncMixin`
`components/moe/layers.py`	`MoE` 层、MoE块的 `MLP` （密集型）
`components/moe/experts.py`	`GroupedExperts` 、 `GroupedExpertsDeepEP` 、 `GroupedExpertsTE`

Checklist

检查清单

Fetched and analyzed
```
config.json
```
from HuggingFace
Determined model type (dense LLM / MoE / VLM)
Identified custom components (attention, RoPE, normalization, MLP)
Created
```
components/models/<name>/
```
directory
Implemented config.py (if custom config needed)
Implemented layers.py (if custom layers needed)
Implemented rope_utils.py (if custom RoPE needed)
Implemented model.py with
```
HFCheckpointingMixin
```
Implemented state_dict_adapter.py
Implemented init.py with re-export

Registered in

MODEL_ARCH_MAPPING

_transformers/registry.py

Registered custom config in
```
_CUSTOM_CONFIG_REGISTRATIONS
```
(if applicable)
Created example YAML config

Verified model loads via

NeMoAutoModelForCausalLM.from_pretrained()

Created unit tests (forward shape, state_dict round-trip)
Created layer equivalence tests for every rewritten layer (matching model dtype)
Created functional tests (training loss decreases)
Updated docs/model-coverage page
Ran state-dict round-trip, component parity, and E2E forward-pass parity checks
Set
```
ModelClass = <Name>ForCausalLM
```
at module bottom

从HuggingFace获取并分析了
```
config.json
```
确定了模型类型（密集型LLM / MoE / VLM）
识别了自定义组件（注意力、RoPE、归一化、MLP）
创建了
```
components/models/<name>/
```
目录
实现了config.py（如需要自定义配置）
实现了layers.py（如需要自定义层）
实现了rope_utils.py（如需要自定义RoPE）
实现了带有
```
HFCheckpointingMixin
```
的model.py
实现了state_dict_adapter.py
实现了带有重导出的__init__.py

在

_transformers/registry.py

的

MODEL_ARCH_MAPPING

中完成注册

在
```
_CUSTOM_CONFIG_REGISTRATIONS
```
中注册了自定义配置（如适用）
创建了示例YAML配置

通过

NeMoAutoModelForCausalLM.from_pretrained()

验证了模型加载

创建了单元测试（前向形状、状态字典往返）
为每个重写的层创建了层等价性测试（匹配模型数据类型）
创建了功能测试（训练损失下降）
更新了docs/model-coverage页面
运行了状态字典往返、组件等价性和端到端前向传播等价性检查
在模块底部设置了
```
ModelClass = <Name>ForCausalLM
```