adding-model-support

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Adding New Model Support in Megatron-Bridge

在Megatron-Bridge中添加新模型支持

Phase 1: Discovery

第一阶段：调研分析

Step 1 — Get the HF model link

步骤1 — 获取HF模型链接

Ask the user for the HuggingFace model link (e.g.

https://huggingface.co/Qwen/Qwen3.5-VL-27B

If the model is not public, ask the user to provide the

config.json

file directly.

向用户索要HuggingFace模型链接（例如

https://huggingface.co/Qwen/Qwen3.5-VL-27B

）。

如果模型非公开，请用户直接提供

config.json

文件。

Step 2 — Fetch and analyze config.json

步骤2 — 获取并分析config.json

Read the model's

config.json

from HuggingFace (or from the user-provided file). Key fields to extract:

model_type

— used for

@register_bridge(model_type=...)

```
architectures
```
— the HF model class name (used for
```
source=...
```
in registration)
```
tie_word_embeddings
```
— critical for weight tying

Architecture fields:

num_hidden_layers

hidden_size

intermediate_size

num_attention_heads

num_key_value_heads

vocab_size

max_position_embeddings

rope_theta

, etc.

MoE fields (if present):

num_local_experts

num_experts_per_tok

moe_intermediate_size

MLA fields (if present):

q_lora_rank

kv_lora_rank

qk_nope_head_dim

qk_rope_head_dim

If there are config fields you don't recognize from previously supported models (check

CONFIG_MAPPING

model_bridge.py

and existing bridges), this likely indicates a new architectural block (e.g., a novel attention variant, custom normalization, or a new layer type). Ask the user to provide the HuggingFace

modeling_*.py

implementation of that block so you can understand the computation and create the correct Megatron-side mapping or custom module.

从HuggingFace（或用户提供的文件）读取模型的

config.json

。需要提取的关键字段：

model_type

— 用于

@register_bridge(model_type=...)

```
architectures
```
— HF模型类名（用于注册时的
```
source=...
```
）
```
tie_word_embeddings
```
— 权重绑定的关键配置

架构字段：

num_hidden_layers

、

hidden_size

、

intermediate_size

、

num_attention_heads

、

num_key_value_heads

、

vocab_size

、

max_position_embeddings

、

rope_theta

等

MoE相关字段（若存在）：

num_local_experts

、

num_experts_per_tok

、

moe_intermediate_size

MLA相关字段（若存在）：

q_lora_rank

、

kv_lora_rank

、

qk_nope_head_dim

、

qk_rope_head_dim

如果遇到之前支持的模型中未见过的配置字段（可查看

model_bridge.py

中的

CONFIG_MAPPING

及现有桥接器），这通常意味着存在新的架构模块（例如新型注意力变体、自定义归一化层或新的层类型）。请用户提供该模块的HuggingFace

modeling_*.py

实现，以便理解计算逻辑并创建正确的Megatron侧映射或自定义模块。

Step 3 — Determine VLM vs LLM

步骤3 — 判断是VLM还是LLM

VLM (Vision-Language Model) if config.json contains:

```
text_config
```
AND
```
vision_config
```
sub-configs
Note: VLMs may or may not have "VL" in the name

LLM (Text-only) if:

No
```
text_config
```
/
```
vision_config
```
Single flat config for the language model

This distinction affects:

Which files to create (VLMs need a model.py combining vision + language)
Where to read config fields from (
```
text_config
```
vs top-level for VLMs)
Test patterns (VLMs need vision inputs in functional tests)

若config.json包含以下内容，则为VLM（视觉语言模型）：

```
text_config
```
和
```
vision_config
```
子配置
注意：VLM名称中不一定包含"VL"

若满足以下条件，则为LLM（纯文本模型）：

无
```
text_config
```
/
```
vision_config
```
语言模型使用单一扁平化配置

该区分会影响：

需要创建的文件（VLM需要结合视觉+语言的model.py）
配置字段的读取位置（VLM从
```
text_config
```
读取，而非顶层）
测试模式（VLM的功能测试需要视觉输入）

Step 4 — Check for quantized weights (FP8 / FP4)

步骤4 — 检查量化权重（FP8 / FP4）

Inspect the HF checkpoint's

model.safetensors

(or

model.safetensors.index.json

) for quantized weight dtypes such as

float8_e4m3fn

(FP8) or

uint8

uint4

with accompanying

*_scale_inv

*_scale

tensors. Common signs:

config.json

mentions

quantization_config

or dtype fields like

"torch_dtype": "float8_e4m3fn"

Safetensors contain
```
weight_scale_inv
```
keys alongside the main weight keys
The model card mentions FP8/FP4/INT4 weights

Why this matters: The bridge's

import_ckpt

path does not automatically dequantize — it loads raw quantized values as-is. This produces a silently broken model (random-level loss, huge grad norms) instead of raising an error.

Fix: Dequantize before conversion. Two approaches:

Standalone script (recommended for user-facing models) — Write a

dequant_fp8_for_bridge.py

in the model's examples folder. Reference:

examples/models/ministral/ministral3/dequant_fp8_for_bridge.py

. The pattern is:

w_bf16 = fp8_weight.to(bfloat16) * weight_scale_inv

In-bridge hook — Override

maybe_modify_loaded_hf_weight()

in the bridge class to dequantize on the fly during import:

python

def maybe_modify_loaded_hf_weight(self, hf_param, hf_state_dict):
    weight = hf_state_dict[hf_param]
    scale_key = hf_param + "_scale_inv"
    if weight.dtype == torch.float8_e4m3fn and scale_key in hf_state_dict:
        return weight.to(torch.bfloat16) * hf_state_dict[scale_key].to(torch.bfloat16)
    return weight

Always add a sanity check in the verification workflow (e.g., print

std

of a weight tensor — quantized models typically have

std ≈ 13

before dequantization vs

std ≈ 0.006

after).

检查HF checkpoint的

model.safetensors

（或

model.safetensors.index.json

）中是否存在量化权重类型，例如

float8_e4m3fn

（FP8）或带有

*_scale_inv

*_scale

张量的

uint8

uint4

。常见特征：

config.json

中提及

quantization_config

或类似

"torch_dtype": "float8_e4m3fn"

的类型字段

Safetensors文件中包含与主权重键对应的
```
weight_scale_inv
```
键
模型卡片中提到FP8/FP4/INT4权重

重要性说明：桥接器的

import_ckpt

流程不会自动反量化——它会直接加载原始量化值。这会导致模型静默失效（随机级别的损失、巨大的梯度范数），而非抛出错误。

解决方法：转换前先反量化。有两种方式：

独立脚本（面向用户模型推荐）——在模型的示例文件夹中编写

dequant_fp8_for_bridge.py

。参考示例：

examples/models/ministral/ministral3/dequant_fp8_for_bridge.py

。核心逻辑：

w_bf16 = fp8_weight.to(bfloat16) * weight_scale_inv

。

桥接器钩子——在桥接器类中重写

maybe_modify_loaded_hf_weight()

，在导入时实时反量化：

python

def maybe_modify_loaded_hf_weight(self, hf_param, hf_state_dict):
    weight = hf_state_dict[hf_param]
    scale_key = hf_param + "_scale_inv"
    if weight.dtype == torch.float8_e4m3fn and scale_key in hf_state_dict:
        return weight.to(torch.bfloat16) * hf_state_dict[scale_key].to(torch.bfloat16)
    return weight

务必在验证流程中添加 sanity check（例如打印权重张量的

std

——量化模型反量化前

std≈13

，反量化后

std≈0.006

）。

Phase 2: Bridge Support

第二阶段：桥接器支持

File structure

文件结构

LLM — Reference: Qwen2 (

src/megatron/bridge/models/qwen/qwen2_bridge.py

)

src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py      # Config + weight mappings (no provider file needed)
└── modeling_<model>/      # (optional) Custom nn.Module implementations if needed
    └── ...

VLM — Reference: Qwen3.5-VL (

src/megatron/bridge/models/qwen_vl/

)

src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py         # Config + weight mappings
├── <model>_provider.py       # Only for VLMs that need custom provide()
└── modeling_<model>/         # If using Megatron vision encoder
    ├── __init__.py
    └── model.py              # Combines vision + language

OR with HF vision encoder (Reference: Gemma3-VL):

src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py
├── <model>_provider.py       # Only for VLMs that need custom provide()
└── modeling_<model>.py       # HF vision + Megatron language wrapper

Model-specific modeling code: If the model requires custom

nn.Module

implementations (e.g. a custom RoPE variant, non-standard transformer config, custom thinker/talker architecture), place them in a

modeling_<model>/

directory or a single

modeling_<model>.py

file inside the model family folder. Use a directory when there are multiple files (model, transformer config, custom ops); use a single file when one module suffices. Never put model-specific modeling code in shared directories or as loose files in the bridge family directory — keep them namespaced under the

modeling_<model>

prefix.

LLM — 参考示例：Qwen2（

src/megatron/bridge/models/qwen/qwen2_bridge.py

）

src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py      # 配置+权重映射（无需提供器文件）
└── modeling_<model>/      # （可选）自定义nn.Module实现（若需要）
    └── ...

VLM — 参考示例：Qwen3.5-VL（

src/megatron/bridge/models/qwen_vl/

）

src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py         # 配置+权重映射
├── <model>_provider.py       # 仅需自定义provide()的VLM需要
└── modeling_<model>/         # 使用Megatron视觉编码器时
    ├── __init__.py
    └── model.py              # 结合视觉+语言模块

或使用HF视觉编码器（参考示例：Gemma3-VL）：

src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py
├── <model>_provider.py       # 仅需自定义provide()的VLM需要
└── modeling_<model>.py       # HF视觉+Megatron语言包装器

模型专属建模代码：如果模型需要自定义

nn.Module

实现（例如自定义RoPE变体、非标准Transformer配置、自定义思考/对话架构），请将其放在

modeling_<model>/

目录或模型家族文件夹下的单个

modeling_<model>.py

文件中。多文件（模型、Transformer配置、自定义算子）时使用目录；单个模块时使用单个文件。切勿将模型专属建模代码放在共享目录或桥接器家族目录下的零散文件中——请将它们命名空间化到

modeling_<model>

前缀下。

Implementation order

实现顺序

LLM:

Bridge only — Register bridge, implement
```
provider_bridge()
```
and
```
mapping_registry()
```
. The bridge calls
```
super().provider_bridge()
```
to get a
```
GPTModelProvider
```
from
```
CONFIG_MAPPING
```
, then sets model-specific attributes on it. Do not create a provider file — the stock provider returned by
```
super().provider_bridge()
```
is usually sufficient for LLMs (e.g.,
```
GPTModelProvider
```
, or another base provider selected via
```
PROVIDER_CLASS
```
).

VLM:

Bridge — Register bridge, implement config and weight mappings.
Provider (when needed) — Only VLMs that require a custom
```
provide()
```
to instantiate a combined vision+language model need a provider subclass. The bridge manually calls
```
hf_config_to_provider_kwargs(text_config)
```
and instantiates the custom provider.
Model class — Combine vision encoder + language decoder.

For detailed patterns, see:

VLM: @skills/adding-model-support/vlm-patterns.md
LLM: @skills/adding-model-support/llm-patterns.md

LLM：

仅桥接器——注册桥接器，实现
```
provider_bridge()
```
和
```
mapping_registry()
```
。桥接器调用
```
super().provider_bridge()
```
从
```
CONFIG_MAPPING
```
获取
```
GPTModelProvider
```
，然后为其设置模型专属属性。无需创建提供器文件——
```
super().provider_bridge()
```
返回的标准提供器通常足以支持LLM（例如
```
GPTModelProvider
```
或通过
```
PROVIDER_CLASS
```
选择的其他基础提供器）。

VLM：

桥接器——注册桥接器，实现配置和权重映射。
提供器（必要时）——只有需要自定义
```
provide()
```
来实例化视觉+语言组合模型的VLM才需要提供器子类。桥接器手动调用
```
hf_config_to_provider_kwargs(text_config)
```
并实例化自定义提供器。
模型类——结合视觉编码器+语言解码器。

详细实现模式请参考：

VLM：@skills/adding-model-support/vlm-patterns.md
LLM：@skills/adding-model-support/llm-patterns.md

Critical:

tie_word_embeddings

for VLMs

关键注意事项：VLM的

tie_word_embeddings

For VLMs,

tie_word_embeddings

lives on the top-level HF config, NOT on

text_config

. Always read from the parent config:

python

provider.share_embeddings_and_output_weights = getattr(hf_config, "tie_word_embeddings", False)

对于VLM，

tie_word_embeddings

位于HF配置的顶层，而非

text_config

中。务必从父配置读取：

python

provider.share_embeddings_and_output_weights = getattr(hf_config, "tie_word_embeddings", False)

Critical: Config field location for VLMs

关键注意事项：VLM的配置字段位置

When reading HF config for VLMs, check whether each field is in:

hf_config

(top-level) — e.g.

tie_word_embeddings

image_token_id

video_token_id

hf_config.text_config

— e.g.

num_hidden_layers

hidden_size

, etc.

```
hf_config.vision_config
```
— e.g. vision encoder dimensions

读取VLM的HF配置时，需检查每个字段所在位置：

hf_config

（顶层）——例如

tie_word_embeddings

、

image_token_id

、

video_token_id

hf_config.text_config

——例如

num_hidden_layers

、

hidden_size

等

```
hf_config.vision_config
```
——例如视觉编码器维度

Encapsulating model-specific layers

封装模型专属层

When a new model introduces custom or non-standard layers (novel attention variants, custom normalization, fused expert layouts, MTP heads, etc.), keep all model-specific logic inside the model family directory. Do not modify shared files in

src/megatron/bridge/models/conversion/

(e.g.

param_mapping.py

model_bridge.py

quant_mapping.py

) unless the change is genuinely reusable across multiple model families.

Principle: The bridge and provider files for a model family are your primary extension surface. Shared conversion infrastructure provides hooks and base classes — subclass them locally rather than adding conditionals to shared code.

当新模型引入自定义或非标准层（新型注意力变体、自定义归一化层、融合专家布局、MTP头等）时，请将所有模型专属逻辑保留在模型家族目录内。除非更改可真正跨多个模型家族复用，否则请勿修改

src/megatron/bridge/models/conversion/

下的共享文件（例如

param_mapping.py

、

model_bridge.py

、

quant_mapping.py

）。

原则：模型家族的桥接器和提供器文件是主要扩展入口。共享转换基础设施提供钩子和基类——请在本地子类化它们，而非在共享代码中添加条件分支。

Strategy 1: Create a local mapping subclass

策略1：创建本地映射子类

If the model has a layer whose weight layout doesn't match any existing mapping class, create a private mapping class in the bridge file or a

<model>_mappings.py

file in the family directory.

Example — GLM's fused expert down-projection disables grouped-export transpose:

python

undefined

如果模型的某层权重布局与现有映射类不匹配，请在桥接器文件或家族目录下的

<model>_mappings.py

文件中创建私有映射类。

示例——GLM的融合专家下投影禁用分组导出转置：

python

undefined

src/megatron/bridge/models/glm/glm_moe_mappings.py

class GLMExpertDownProjMapping(FusedExpertMapping): def init(self, megatron_param, hf_param, permute_dims=None): super().init(megatron_param, hf_param, permute_dims, transpose_on_export=False)


Example — Nemotron-H's MTP layers flatten indices during resolve:

```python

class GLMExpertDownProjMapping(FusedExpertMapping): def init(self, megatron_param, hf_param, permute_dims=None): super().init(megatron_param, hf_param, permute_dims, transpose_on_export=False)


示例——Nemotron-H的MTP层在解析时展平索引：

```python

Inside nemotron_h_bridge.py (private to the module)

在nemotron_h_bridge.py内（模块私有）

class _MTPFlatteningMapping(MegatronParamMapping): def resolve(self, captures): return AutoMapping(self._flatten(captures), ...)


Example — MiniMax-M2's non-standard QK norm layout:

```python

class _MTPFlatteningMapping(MegatronParamMapping): def resolve(self, captures): return AutoMapping(self._flatten(captures), ...)


示例——MiniMax-M2的非标准QK归一化布局：

```python

Inside minimax_m2_bridge.py (private to the module)

在minimax_m2_bridge.py内（模块私有）

class _FullDimQKNormMapping(MegatronParamMapping): def hf_to_megatron(self, hf_weights): # Custom scatter logic for full-dim QK norm ... def megatron_to_hf(self, megatron_weights): # Custom gather logic ...

undefined

class _FullDimQKNormMapping(MegatronParamMapping): def hf_to_megatron(self, hf_weights): # 全维度QK归一化的自定义分散逻辑 ... def megatron_to_hf(self, megatron_weights): # 自定义聚合逻辑 ...

undefined

Strategy 2: Override bridge hooks

策略2：重写桥接器钩子

MegatronModelBridge

provides several override hooks — use them instead of modifying the base class:

Hook	When to use
`mapping_registry()`	Define all weight name mappings (abstract, always overridden)
`provider_bridge()`	Configure the provider with model-specific flags (call `super()` then setattr)
`maybe_modify_loaded_hf_weight()`	Dequantize, rename, or reshape HF weights before conversion
`maybe_modify_converted_hf_weight()`	Synthesize extra HF keys on export (e.g. `inv_freq` )
`megatron_to_hf_config()`	Build HF `config.json` for export
`hf_config_to_provider_kwargs()`	Override CONFIG_MAPPING behavior for specific fields

Accessing HF config in
mapping_registry()
: The bridge instance has

self.hf_config

available during conversion — it is set automatically by the dispatch system before

mapping_registry()

is called. Use it when your mapping registry needs config-dependent logic (e.g. dynamic MTP layer count, number of experts):

python

def mapping_registry(self) -> MegatronMappingRegistry:
    hf_config = getattr(self, "hf_config", None)
    num_mtp_layers = getattr(hf_config, "num_nextn_predict_layers", 0) if hf_config else 0
    ...

Do not override

build_conversion_tasks()

to stash

self._hf_config

— that pattern is deprecated.

MegatronModelBridge

提供多个可重写钩子——请使用它们而非修改基类：

钩子	使用场景
`mapping_registry()`	定义所有权重名称映射（抽象方法，必须重写）
`provider_bridge()`	为提供器配置模型专属标志（调用 `super()` 后设置属性）
`maybe_modify_loaded_hf_weight()`	转换前对HF权重进行反量化、重命名或重塑
`maybe_modify_converted_hf_weight()`	导出时合成额外HF键（例如 `inv_freq` ）
`megatron_to_hf_config()`	构建用于导出的HF `config.json`
`hf_config_to_provider_kwargs()`	针对特定字段覆盖CONFIG_MAPPING行为

在
mapping_registry()
中访问HF配置：转换期间，桥接器实例会自动设置

self.hf_config

——调度系统会在调用

mapping_registry()

前完成设置。当映射注册表需要依赖配置的逻辑时（例如动态MTP层数、专家数量），可使用该属性：

python

def mapping_registry(self) -> MegatronMappingRegistry:
    hf_config = getattr(self, "hf_config", None)
    num_mtp_layers = getattr(hf_config, "num_nextn_predict_layers", 0) if hf_config else 0
    ...

请勿重写

build_conversion_tasks()

来存储

self._hf_config

——该模式已废弃。

Strategy 3: Custom provider subclass (VLMs only)

策略3：自定义提供器子类（仅VLM）

Most models do not need a provider file — the stock provider (e.g.,

GPTModelProvider

, or another base selected via

PROVIDER_CLASS

) is usually sufficient for LLMs. Only create a provider subclass when a VLM needs custom

provide()

logic to instantiate a combined vision+language model:

python

undefined

大多数模型无需提供器文件——标准提供器（例如

GPTModelProvider

或通过

PROVIDER_CLASS

选择的其他基础提供器）通常足以支持LLM。仅当VLM需要自定义

provide()

逻辑来实例化视觉+语言组合模型时，才创建提供器子类：

python

undefined

src/megatron/bridge/models/<model>/<model>_provider.py

class MyVLModelProvider(GPTModelProvider): image_token_id: int = 0

def provide(self, ...):
    # Custom model construction combining vision encoder + language decoder
    ...


The bridge then references it via `PROVIDER_CLASS = MyVLModelProvider` or instantiates it directly
in `provider_bridge()`.

class MyVLModelProvider(GPTModelProvider): image_token_id: int = 0

def provide(self, ...):
    # 自定义模型构建逻辑，结合视觉编码器+语言解码器
    ...


桥接器可通过`PROVIDER_CLASS = MyVLModelProvider`或在`provider_bridge()`中直接实例化来引用它。

When shared file changes ARE justified

何时修改共享文件才合理

Modify

param_mapping.py

model_bridge.py

only when the pattern is reusable by 2+ model families. Examples of justified shared changes:

FusedExpertMapping

FusedGatedExpertMapping

— used by GLM, DeepSeek, OLMoE, etc.

```
RMSNorm2ZeroCenteredRMSNormMapping
```
— used by Gemma, Nemotron, etc.
New
```
CONFIG_MAPPING
```
entries — when a standard HF config key maps to a standard provider attribute

If you're tempted to add a model-specific

if model_type == "..."

branch in shared code, or pattern-matching on specific weight names in shared conversion logic, that's a signal to use a local subclass or hook override instead.

仅当模式可被2个及以上模型家族复用时，才修改

param_mapping.py

或

model_bridge.py

。合理修改的示例：

FusedExpertMapping

FusedGatedExpertMapping

——被GLM、DeepSeek、OLMoE等使用

```
RMSNorm2ZeroCenteredRMSNormMapping
```
——被Gemma、Nemotron等使用
新增
```
CONFIG_MAPPING
```
条目——当标准HF配置键映射到标准提供器属性时

如果您想在共享代码中添加模型专属的

if model_type == "..."

分支，或在共享转换逻辑中对特定权重名称进行模式匹配，这意味着您应该使用本地子类或钩子重写，而非修改共享代码。

Update FLOPs calculator for new architectural blocks

为新架构模块更新FLOPs计算器

If the model introduces a new computational block that differs from standard attention or MLP (e.g., Gated DeltaNet / GDN linear attention, Multi-Token Prediction / MTP heads, Mamba SSM layers), update the FLOPs calculator in

src/megatron/bridge/training/utils/flop_utils.py

so that training throughput metrics (TFLOPs/GPU) are accurate.

When to update: Any time the new block has different FLOPs-per-token than standard self-attention or standard MLP. Common cases:

Linear attention variants (GDN, RetNet, RWKV) — replace the
```
O(s²)
```
attention term with the block's actual operation count
MTP / speculative decoding heads — add FLOPs for the extra projection and norm layers
SSM layers (Mamba) — different recurrence FLOPs than attention
Novel MoE routing — may change the effective expert count

How to update:

Read the existing
```
transformer_flops()
```
function in
```
flop_utils.py
```
to understand the structure.
Add a conditional block gated on a config attribute (e.g.,
```
experimental_attention_variant
```
,
```
mtp_num_layers
```
). Follow the existing MoE pattern for config validation — raise on invalid types, assert list lengths, and use direct attribute access instead of
```
getattr
```
with fallback defaults so that misconfigurations fail explicitly.
Compute the per-layer FLOPs for the new block and blend it with the standard attention term based on the layer pattern.
Add unit tests in
```
tests/unit_tests/training/utils/test_flop_utils.py
```
that verify:
- New-block FLOPs differ from pure-attention baseline
- Exact formula matches hand-computed expected values
- Varying the block ratio (e.g.,
```
linear_attention_freq
```
  ) changes FLOPs

Reference PR: #2925 — GDN FLOPs calculator adds GDN support with both the calculator code and comprehensive tests.

如果模型引入了与标准注意力或MLP不同的新计算模块（例如Gated DeltaNet / GDN线性注意力、多 token 预测 / MTP头、Mamba SSM层），请更新

src/megatron/bridge/training/utils/flop_utils.py

中的FLOPs计算器，确保训练吞吐量指标（TFLOPs/GPU）准确。

更新时机：当新模块的每token FLOPs与标准自注意力或标准MLP不同时。常见场景：

线性注意力变体（GDN、RetNet、RWKV）——用模块实际运算量替换
```
O(s²)
```
注意力项
MTP / speculative decoding头——为额外的投影和归一化层添加FLOPs
SSM层（Mamba）——与注意力不同的循环FLOPs
新型MoE路由——可能改变有效专家数量

更新方法：

阅读
```
flop_utils.py
```
中现有的
```
transformer_flops()
```
函数，理解其结构。
添加基于配置属性的条件分支（例如
```
experimental_attention_variant
```
、
```
mtp_num_layers
```
）。遵循现有MoE模式进行配置验证——对无效类型抛出错误，断言列表长度，使用直接属性访问而非带默认值的
```
getattr
```
，确保配置错误会显式触发失败。
计算新模块的每层FLOPs，并根据层模式将其与标准注意力项融合。
在
```
tests/unit_tests/training/utils/test_flop_utils.py
```
中添加单元测试，验证：
- 新模块的FLOPs与纯注意力基线不同
- 精确公式与手动计算的预期值匹配
- 调整模块比例（例如
```
linear_attention_freq
```
  ）会改变FLOPs

参考PR：#2925 — GDN FLOPs计算器添加了GDN支持，包含计算器代码和全面测试。

Phase 3: Recipe Support

第三阶段：配置脚本支持

Recipes provide pre-configured training settings for each model size.

LLM recipes:

src/megatron/bridge/recipes/<family>/<model>.py

VLM recipes:

src/megatron/bridge/recipes/<family>/<model>.py

Each recipe file defines functions for each model size + training mode:

```
<model>_<size>_sft_config()
```
— Full supervised fine-tuning
```
<model>_<size>_peft_config()
```
— LoRA/DoRA parameter-efficient fine-tuning
```
<model>_<size>_pretrain_config()
```
— Pretraining (LLM only, usually)

For detailed recipe patterns, see @skills/adding-model-support/recipe-patterns.md.

配置脚本为每个模型尺寸提供预配置的训练设置。

LLM配置脚本：

src/megatron/bridge/recipes/<family>/<model>.py

VLM配置脚本：

src/megatron/bridge/recipes/<family>/<model>.py

每个配置脚本文件为每个模型尺寸+训练模式定义函数：

```
<model>_<size>_sft_config()
```
——全量监督微调
```
<model>_<size>_peft_config()
```
——LoRA/DoRA参数高效微调
```
<model>_<size>_pretrain_config()
```
——预训练（通常仅LLM）

详细配置脚本模式请参考@skills/adding-model-support/recipe-patterns.md。

Export checklist

导出检查清单

Family
```
__init__.py
```
— import and add to
```
__all__
```
Top-level
```
src/megatron/bridge/recipes/__init__.py
```
— wildcard import
```
train_any_basic.py
```
— add to
```
config_map
```
, docstring, and
```
--model
```
choices

家族
```
__init__.py
```
——导入并添加到
```
__all__
```
顶层
```
src/megatron/bridge/recipes/__init__.py
```
——通配符导入
```
train_any_basic.py
```
——添加到
```
config_map
```
、文档字符串及
```
--model
```
选项

Phase 4: Tests

第四阶段：测试

Unit tests (no GPU)

单元测试（无需GPU）

text

tests/unit_tests/models/<model>/
├── __init__.py
├── test_<model>_bridge.py    # Mock HF config → verify provider mapping
└── test_<model>_provider.py  # (optional) Only if custom provider subclass exists

text

tests/unit_tests/models/<model>/
├── __init__.py
├── test_<model>_bridge.py    # 模拟HF配置 → 验证提供器映射
└── test_<model>_provider.py  # （可选）仅当存在自定义提供器子类时需要

Functional tests (GPU)

功能测试（需要GPU）

text

tests/functional_tests/models/<model>/
├── __init__.py
├── test_<model>_conversion.py  # Toy model HF↔Megatron roundtrip
└── test_<model>_provider.py    # compare_provider_configs (optional)

For detailed test patterns, see @skills/adding-model-support/tests-and-examples.md.

text

tests/functional_tests/models/<model>/
├── __init__.py
├── test_<model>_conversion.py  # 小型模型HF↔Megatron往返转换
└── test_<model>_provider.py    # compare_provider_configs（可选）

详细测试模式请参考@skills/adding-model-support/tests-and-examples.md。

Phase 5: Docs and Examples

第五阶段：文档与示例

Examples

示例

Model examples:

examples/models/<brand>/<model>/

text

examples/models/<brand>/<model>/
├── README.md
├── conversion.sh        # HF↔Megatron conversion commands (real model)
├── inference.sh         # Generation commands (real model, reasonable output)
├── slurm_sft.sh         # SFT training on SLURM
└── slurm_peft.sh        # PEFT training on SLURM

Key deliverable requirement:

conversion.sh

and

inference.sh

must target a real published model (e.g.

Qwen/Qwen3-8B

, not a toy). The inference script must produce reasonable output — for LLMs a coherent text continuation, for VLMs a plausible image description. This is the acceptance bar: conversion runs cleanly and generation makes sense.

模型示例：

examples/models/<brand>/<model>/

text

examples/models/<brand>/<model>/
├── README.md
├── conversion.sh        # HF↔Megatron转换命令（真实模型）
├── inference.sh         # 生成命令（真实模型，输出合理）
├── slurm_sft.sh         # SLURM上的SFT训练
└── slurm_peft.sh        # SLURM上的PEFT训练

关键交付要求：

conversion.sh

和

inference.sh

必须针对已发布的真实模型（例如

Qwen/Qwen3-8B

，而非测试模型）。推理脚本必须生成合理输出——LLM需生成连贯的文本续写，VLM需生成合理的图像描述。验收标准：转换可顺利运行，生成结果符合预期。

Documentation

文档

Add a model page at

docs/models/<type>/<model>.md

covering:

Supported variants and sizes
Conversion commands
Training examples (SFT, PEFT)
Known limitations

在

docs/models/<type>/<model>.md

添加模型页面，涵盖：

支持的变体和尺寸
转换命令
训练示例（SFT、PEFT）
已知限制

Verification Workflow

验证流程

After implementing bridge support, prompt the user to run these commands on the cluster:

实现桥接器支持后，请提示用户在集群上运行以下命令：

1. Smoke test (single GPU)

1. 冒烟测试（单GPU）

bash

uv run python -c "
from megatron.bridge import AutoBridge
bridge = AutoBridge.from_hf_pretrained('<org>/<model>')
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
bridge.load_hf_weights(model)
for i, (name, tensor) in enumerate(bridge.export_hf_weights(model, cpu=True)):
    print(name, tuple(tensor.shape))
    if i > 10: break
"

bash

uv run python -c "
from megatron.bridge import AutoBridge
bridge = AutoBridge.from_hf_pretrained('<org>/<model>')
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
bridge.load_hf_weights(model)
for i, (name, tensor) in enumerate(bridge.export_hf_weights(model, cpu=True)):
    print(name, tuple(tensor.shape))
    if i > 10: break
"

2. Conversion roundtrip (multi-GPU)

2. 转换往返测试（多GPU）

bash

uv run python examples/conversion/convert_checkpoints.py import \
    --hf-model <org>/<model> \
    --megatron-path /workspace/<model> \
    --torch-dtype bfloat16

uv run python examples/conversion/convert_checkpoints.py export \
    --hf-model <org>/<model> \
    --megatron-path /workspace/<model>/iter_0000000 \
    --hf-path /workspace/<model>-hf-export

bash

uv run python examples/conversion/convert_checkpoints.py import \
    --hf-model <org>/<model> \
    --megatron-path /workspace/<model> \
    --torch-dtype bfloat16

uv run python examples/conversion/convert_checkpoints.py export \
    --hf-model <org>/<model> \
    --megatron-path /workspace/<model>/iter_0000000 \
    --hf-path /workspace/<model>-hf-export

3. Generation test

3. 生成测试

For LLMs:

bash

uv run python examples/conversion/hf_to_megatron_generate_text.py \
    --hf_model_path <org>/<model> --prompt "Hello"

For VLMs:

bash

uv run python examples/conversion/hf_to_megatron_generate_vlm.py \
    --hf_model_path <org>/<model> \
    --image_path "https://example.com/image.jpeg" \
    --prompt "Describe this image."

LLM测试：

bash

uv run python examples/conversion/hf_to_megatron_generate_text.py \
    --hf_model_path <org>/<model> --prompt "Hello"

VLM测试：

bash

uv run python examples/conversion/hf_to_megatron_generate_vlm.py \
    --hf_model_path <org>/<model> \
    --image_path "https://example.com/image.jpeg" \
    --prompt "Describe this image."

4. Run tests

4. 运行测试

bash

uv run python -m pytest tests/unit_tests/models/<model>/ -v
uv run python -m pytest tests/functional_tests/models/<model>/ -v --run-gpu

bash

uv run python -m pytest tests/unit_tests/models/<model>/ -v
uv run python -m pytest tests/functional_tests/models/<model>/ -v --run-gpu

Quick Decision Tree

快速决策树

User wants to add a model
│
├─ Has HF link? ─── No ──→ Ask for link (or config.json if private)
│
├─ Has text_config + vision_config? ─── Yes ──→ VLM path
│   ├─ Has Megatron vision encoder? ──→ Megatron encoder (Qwen3.5 pattern)
│   └─ No Megatron encoder ──→ HF encoder (Gemma3 pattern)
│
└─ No vision config ──→ LLM path (bridge only, no provider file)
    ├─ Standard GPT-style? ──→ Bridge with stock mappings
    └─ Custom layers? ──→ Bridge + local mapping subclasses / hook overrides
        ├─ Custom weight layout? ──→ Local mapping subclass in family dir
        └─ Custom import/export? ──→ Override bridge hooks (maybe_modify_*)

用户想要添加模型
│
├─ 是否有HF链接？ ─── 否 ──→ 索要链接（非公开则索要config.json）
│
├─ 是否有text_config + vision_config？ ─── 是 ──→ VLM流程
│   ├─ 是否有Megatron视觉编码器？ ──→ 使用Megatron编码器（Qwen3.5模式）
│   └─ 无Megatron视觉编码器 ──→ 使用HF编码器（Gemma3模式）
│
└─ 无视觉配置 ──→ LLM流程（仅桥接器，无需提供器文件）
    ├─ 是否为标准GPT风格？ ──→ 使用标准映射的桥接器
    └─ 是否有自定义层？ ──→ 桥接器+本地映射子类/钩子重写
        ├─ 是否有自定义权重布局？ ──→ 在家族目录中创建本地映射子类
        └─ 是否有自定义导入/导出逻辑？ ──→ 重写桥接器钩子（maybe_modify_*）

adding-model-support

Original

Translation

Adding New Model Support in Megatron-Bridge

在Megatron-Bridge中添加新模型支持

Phase 1: Discovery

第一阶段：调研分析

Step 1 — Get the HF model link

步骤1 — 获取HF模型链接

Step 2 — Fetch and analyze config.json

步骤2 — 获取并分析config.json

Step 3 — Determine VLM vs LLM

步骤3 — 判断是VLM还是LLM

Step 4 — Check for quantized weights (FP8 / FP4)

步骤4 — 检查量化权重（FP8 / FP4）

Phase 2: Bridge Support

第二阶段：桥接器支持

File structure

文件结构

Implementation order

实现顺序

Critical: tie_word_embeddings for VLMs

关键注意事项：VLM的tie_word_embeddings

Critical: Config field location for VLMs

关键注意事项：VLM的配置字段位置

Encapsulating model-specific layers

封装模型专属层

Strategy 1: Create a local mapping subclass

策略1：创建本地映射子类

src/megatron/bridge/models/glm/glm_moe_mappings.py

src/megatron/bridge/models/glm/glm_moe_mappings.py

Inside nemotron_h_bridge.py (private to the module)

在nemotron_h_bridge.py内（模块私有）

Inside minimax_m2_bridge.py (private to the module)

在minimax_m2_bridge.py内（模块私有）

Strategy 2: Override bridge hooks

策略2：重写桥接器钩子

Strategy 3: Custom provider subclass (VLMs only)

策略3：自定义提供器子类（仅VLM）

src/megatron/bridge/models/<model>/<model>_provider.py

src/megatron/bridge/models/<model>/<model>_provider.py

When shared file changes ARE justified

何时修改共享文件才合理

Update FLOPs calculator for new architectural blocks

为新架构模块更新FLOPs计算器

Phase 3: Recipe Support

第三阶段：配置脚本支持

Export checklist

导出检查清单

Phase 4: Tests

第四阶段：测试

Unit tests (no GPU)

单元测试（无需GPU）

Functional tests (GPU)

功能测试（需要GPU）

Phase 5: Docs and Examples

第五阶段：文档与示例

Examples

示例

Documentation

文档

Verification Workflow

验证流程

1. Smoke test (single GPU)

1. 冒烟测试（单GPU）

2. Conversion roundtrip (multi-GPU)

2. 转换往返测试（多GPU）

3. Generation test

3. 生成测试

4. Run tests

4. 运行测试

Quick Decision Tree

快速决策树

Critical:
`tie_word_embeddings`
for VLMs

关键注意事项：VLM的
`tie_word_embeddings`