adding-model-support

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Adding New Model Support in Megatron-Bridge

在Megatron-Bridge中添加新模型支持

Phase 1: Discovery

第一阶段:调研分析

Step 1 — Get the HF model link

步骤1 — 获取HF模型链接

Ask the user for the HuggingFace model link (e.g.
https://huggingface.co/Qwen/Qwen3.5-VL-27B
).
If the model is not public, ask the user to provide the
config.json
file directly.
向用户索要HuggingFace模型链接(例如
https://huggingface.co/Qwen/Qwen3.5-VL-27B
)。
如果模型非公开,请用户直接提供
config.json
文件。

Step 2 — Fetch and analyze config.json

步骤2 — 获取并分析config.json

Read the model's
config.json
from HuggingFace (or from the user-provided file). Key fields to extract:
  • model_type
    — used for
    @register_bridge(model_type=...)
  • architectures
    — the HF model class name (used for
    source=...
    in registration)
  • tie_word_embeddings
    — critical for weight tying
  • Architecture fields:
    num_hidden_layers
    ,
    hidden_size
    ,
    intermediate_size
    ,
    num_attention_heads
    ,
    num_key_value_heads
    ,
    vocab_size
    ,
    max_position_embeddings
    ,
    rope_theta
    , etc.
  • MoE fields (if present):
    num_local_experts
    ,
    num_experts_per_tok
    ,
    moe_intermediate_size
  • MLA fields (if present):
    q_lora_rank
    ,
    kv_lora_rank
    ,
    qk_nope_head_dim
    ,
    qk_rope_head_dim
If there are config fields you don't recognize from previously supported models (check
CONFIG_MAPPING
in
model_bridge.py
and existing bridges), this likely indicates a new architectural block (e.g., a novel attention variant, custom normalization, or a new layer type). Ask the user to provide the HuggingFace
modeling_*.py
implementation of that block so you can understand the computation and create the correct Megatron-side mapping or custom module.
从HuggingFace(或用户提供的文件)读取模型的
config.json
。需要提取的关键字段:
  • model_type
    — 用于
    @register_bridge(model_type=...)
  • architectures
    — HF模型类名(用于注册时的
    source=...
  • tie_word_embeddings
    — 权重绑定的关键配置
  • 架构字段:
    num_hidden_layers
    hidden_size
    intermediate_size
    num_attention_heads
    num_key_value_heads
    vocab_size
    max_position_embeddings
    rope_theta
  • MoE相关字段(若存在):
    num_local_experts
    num_experts_per_tok
    moe_intermediate_size
  • MLA相关字段(若存在):
    q_lora_rank
    kv_lora_rank
    qk_nope_head_dim
    qk_rope_head_dim
如果遇到之前支持的模型中未见过的配置字段(可查看
model_bridge.py
中的
CONFIG_MAPPING
及现有桥接器),这通常意味着存在新的架构模块(例如新型注意力变体、自定义归一化层或新的层类型)。请用户提供该模块的HuggingFace
modeling_*.py
实现,以便理解计算逻辑并创建正确的Megatron侧映射或自定义模块。

Step 3 — Determine VLM vs LLM

步骤3 — 判断是VLM还是LLM

VLM (Vision-Language Model) if config.json contains:
  • text_config
    AND
    vision_config
    sub-configs
  • Note: VLMs may or may not have "VL" in the name
LLM (Text-only) if:
  • No
    text_config
    /
    vision_config
  • Single flat config for the language model
This distinction affects:
  • Which files to create (VLMs need a model.py combining vision + language)
  • Where to read config fields from (
    text_config
    vs top-level for VLMs)
  • Test patterns (VLMs need vision inputs in functional tests)
若config.json包含以下内容,则为VLM(视觉语言模型):
  • text_config
    vision_config
    子配置
  • 注意:VLM名称中不一定包含"VL"
若满足以下条件,则为LLM(纯文本模型):
  • text_config
    /
    vision_config
  • 语言模型使用单一扁平化配置
该区分会影响:
  • 需要创建的文件(VLM需要结合视觉+语言的model.py)
  • 配置字段的读取位置(VLM从
    text_config
    读取,而非顶层)
  • 测试模式(VLM的功能测试需要视觉输入)

Step 4 — Check for quantized weights (FP8 / FP4)

步骤4 — 检查量化权重(FP8 / FP4)

Inspect the HF checkpoint's
model.safetensors
(or
model.safetensors.index.json
) for quantized weight dtypes such as
float8_e4m3fn
(FP8) or
uint8
/
uint4
with accompanying
*_scale_inv
or
*_scale
tensors. Common signs:
  • config.json
    mentions
    quantization_config
    or dtype fields like
    "torch_dtype": "float8_e4m3fn"
  • Safetensors contain
    weight_scale_inv
    keys alongside the main weight keys
  • The model card mentions FP8/FP4/INT4 weights
Why this matters: The bridge's
import_ckpt
path does not automatically dequantize — it loads raw quantized values as-is. This produces a silently broken model (random-level loss, huge grad norms) instead of raising an error.
Fix: Dequantize before conversion. Two approaches:
  1. Standalone script (recommended for user-facing models) — Write a
    dequant_fp8_for_bridge.py
    in the model's examples folder. Reference:
    examples/models/ministral/ministral3/dequant_fp8_for_bridge.py
    . The pattern is:
    w_bf16 = fp8_weight.to(bfloat16) * weight_scale_inv
    .
  2. In-bridge hook — Override
    maybe_modify_loaded_hf_weight()
    in the bridge class to dequantize on the fly during import:
    python
    def maybe_modify_loaded_hf_weight(self, hf_param, hf_state_dict):
        weight = hf_state_dict[hf_param]
        scale_key = hf_param + "_scale_inv"
        if weight.dtype == torch.float8_e4m3fn and scale_key in hf_state_dict:
            return weight.to(torch.bfloat16) * hf_state_dict[scale_key].to(torch.bfloat16)
        return weight
Always add a sanity check in the verification workflow (e.g., print
std
of a weight tensor — quantized models typically have
std ≈ 13
before dequantization vs
std ≈ 0.006
after).
检查HF checkpoint的
model.safetensors
(或
model.safetensors.index.json
)中是否存在量化权重类型,例如
float8_e4m3fn
(FP8)或带有
*_scale_inv
/
*_scale
张量的
uint8
/
uint4
。常见特征:
  • config.json
    中提及
    quantization_config
    或类似
    "torch_dtype": "float8_e4m3fn"
    的类型字段
  • Safetensors文件中包含与主权重键对应的
    weight_scale_inv
  • 模型卡片中提到FP8/FP4/INT4权重
重要性说明:桥接器的
import_ckpt
流程不会自动反量化——它会直接加载原始量化值。这会导致模型静默失效(随机级别的损失、巨大的梯度范数),而非抛出错误。
解决方法:转换前先反量化。有两种方式:
  1. 独立脚本(面向用户模型推荐)——在模型的示例文件夹中编写
    dequant_fp8_for_bridge.py
    。 参考示例:
    examples/models/ministral/ministral3/dequant_fp8_for_bridge.py
    。 核心逻辑:
    w_bf16 = fp8_weight.to(bfloat16) * weight_scale_inv
  2. 桥接器钩子——在桥接器类中重写
    maybe_modify_loaded_hf_weight()
    ,在导入时实时反量化:
    python
    def maybe_modify_loaded_hf_weight(self, hf_param, hf_state_dict):
        weight = hf_state_dict[hf_param]
        scale_key = hf_param + "_scale_inv"
        if weight.dtype == torch.float8_e4m3fn and scale_key in hf_state_dict:
            return weight.to(torch.bfloat16) * hf_state_dict[scale_key].to(torch.bfloat16)
        return weight
务必在验证流程中添加 sanity check(例如打印权重张量的
std
——量化模型反量化前
std≈13
,反量化后
std≈0.006
)。

Phase 2: Bridge Support

第二阶段:桥接器支持

File structure

文件结构

LLM — Reference: Qwen2 (
src/megatron/bridge/models/qwen/qwen2_bridge.py
)
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py      # Config + weight mappings (no provider file needed)
└── modeling_<model>/      # (optional) Custom nn.Module implementations if needed
    └── ...
VLM — Reference: Qwen3.5-VL (
src/megatron/bridge/models/qwen_vl/
)
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py         # Config + weight mappings
├── <model>_provider.py       # Only for VLMs that need custom provide()
└── modeling_<model>/         # If using Megatron vision encoder
    ├── __init__.py
    └── model.py              # Combines vision + language
OR with HF vision encoder (Reference: Gemma3-VL):
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py
├── <model>_provider.py       # Only for VLMs that need custom provide()
└── modeling_<model>.py       # HF vision + Megatron language wrapper
Model-specific modeling code: If the model requires custom
nn.Module
implementations (e.g. a custom RoPE variant, non-standard transformer config, custom thinker/talker architecture), place them in a
modeling_<model>/
directory or a single
modeling_<model>.py
file inside the model family folder. Use a directory when there are multiple files (model, transformer config, custom ops); use a single file when one module suffices. Never put model-specific modeling code in shared directories or as loose files in the bridge family directory — keep them namespaced under the
modeling_<model>
prefix.
LLM — 参考示例:Qwen2(
src/megatron/bridge/models/qwen/qwen2_bridge.py
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py      # 配置+权重映射(无需提供器文件)
└── modeling_<model>/      # (可选)自定义nn.Module实现(若需要)
    └── ...
VLM — 参考示例:Qwen3.5-VL(
src/megatron/bridge/models/qwen_vl/
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py         # 配置+权重映射
├── <model>_provider.py       # 仅需自定义provide()的VLM需要
└── modeling_<model>/         # 使用Megatron视觉编码器时
    ├── __init__.py
    └── model.py              # 结合视觉+语言模块
或使用HF视觉编码器(参考示例:Gemma3-VL):
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py
├── <model>_provider.py       # 仅需自定义provide()的VLM需要
└── modeling_<model>.py       # HF视觉+Megatron语言包装器
模型专属建模代码:如果模型需要自定义
nn.Module
实现(例如自定义RoPE变体、非标准Transformer配置、自定义思考/对话架构),请将其放在
modeling_<model>/
目录或模型家族文件夹下的单个
modeling_<model>.py
文件中。多文件(模型、Transformer配置、自定义算子)时使用目录;单个模块时使用单个文件。切勿将模型专属建模代码放在共享目录或桥接器家族目录下的零散文件中——请将它们命名空间化到
modeling_<model>
前缀下。

Implementation order

实现顺序

LLM:
  1. Bridge only — Register bridge, implement
    provider_bridge()
    and
    mapping_registry()
    . The bridge calls
    super().provider_bridge()
    to get a
    GPTModelProvider
    from
    CONFIG_MAPPING
    , then sets model-specific attributes on it. Do not create a provider file — the stock provider returned by
    super().provider_bridge()
    is usually sufficient for LLMs (e.g.,
    GPTModelProvider
    , or another base provider selected via
    PROVIDER_CLASS
    ).
VLM:
  1. Bridge — Register bridge, implement config and weight mappings.
  2. Provider (when needed) — Only VLMs that require a custom
    provide()
    to instantiate a combined vision+language model need a provider subclass. The bridge manually calls
    hf_config_to_provider_kwargs(text_config)
    and instantiates the custom provider.
  3. Model class — Combine vision encoder + language decoder.
For detailed patterns, see:
  • VLM: @skills/adding-model-support/vlm-patterns.md
  • LLM: @skills/adding-model-support/llm-patterns.md
LLM
  1. 仅桥接器——注册桥接器,实现
    provider_bridge()
    mapping_registry()
    。 桥接器调用
    super().provider_bridge()
    CONFIG_MAPPING
    获取
    GPTModelProvider
    ,然后为其设置模型专属属性。无需创建提供器文件——
    super().provider_bridge()
    返回的标准提供器通常足以支持LLM(例如
    GPTModelProvider
    或通过
    PROVIDER_CLASS
    选择的其他基础提供器)。
VLM
  1. 桥接器——注册桥接器,实现配置和权重映射。
  2. 提供器(必要时)——只有需要自定义
    provide()
    来实例化视觉+语言组合模型的VLM才需要提供器子类。桥接器手动调用
    hf_config_to_provider_kwargs(text_config)
    并实例化自定义提供器。
  3. 模型类——结合视觉编码器+语言解码器。
详细实现模式请参考:
  • VLM:@skills/adding-model-support/vlm-patterns.md
  • LLM:@skills/adding-model-support/llm-patterns.md

Critical:
tie_word_embeddings
for VLMs

关键注意事项:VLM的
tie_word_embeddings

For VLMs,
tie_word_embeddings
lives on the top-level HF config, NOT on
text_config
. Always read from the parent config:
python
provider.share_embeddings_and_output_weights = getattr(hf_config, "tie_word_embeddings", False)
对于VLM,
tie_word_embeddings
位于HF配置的顶层,而非
text_config
中。务必从父配置读取:
python
provider.share_embeddings_and_output_weights = getattr(hf_config, "tie_word_embeddings", False)

Critical: Config field location for VLMs

关键注意事项:VLM的配置字段位置

When reading HF config for VLMs, check whether each field is in:
  • hf_config
    (top-level) — e.g.
    tie_word_embeddings
    ,
    image_token_id
    ,
    video_token_id
  • hf_config.text_config
    — e.g.
    num_hidden_layers
    ,
    hidden_size
    , etc.
  • hf_config.vision_config
    — e.g. vision encoder dimensions
读取VLM的HF配置时,需检查每个字段所在位置:
  • hf_config
    (顶层)——例如
    tie_word_embeddings
    image_token_id
    video_token_id
  • hf_config.text_config
    ——例如
    num_hidden_layers
    hidden_size
  • hf_config.vision_config
    ——例如视觉编码器维度

Encapsulating model-specific layers

封装模型专属层

When a new model introduces custom or non-standard layers (novel attention variants, custom normalization, fused expert layouts, MTP heads, etc.), keep all model-specific logic inside the model family directory. Do not modify shared files in
src/megatron/bridge/models/conversion/
(e.g.
param_mapping.py
,
model_bridge.py
,
quant_mapping.py
) unless the change is genuinely reusable across multiple model families.
Principle: The bridge and provider files for a model family are your primary extension surface. Shared conversion infrastructure provides hooks and base classes — subclass them locally rather than adding conditionals to shared code.
当新模型引入自定义或非标准层(新型注意力变体、自定义归一化层、融合专家布局、MTP头等)时,请将所有模型专属逻辑保留在模型家族目录内。除非更改可真正跨多个模型家族复用,否则请勿修改
src/megatron/bridge/models/conversion/
下的共享文件(例如
param_mapping.py
model_bridge.py
quant_mapping.py
)。
原则:模型家族的桥接器和提供器文件是主要扩展入口。共享转换基础设施提供钩子和基类——请在本地子类化它们,而非在共享代码中添加条件分支。

Strategy 1: Create a local mapping subclass

策略1:创建本地映射子类

If the model has a layer whose weight layout doesn't match any existing mapping class, create a private mapping class in the bridge file or a
<model>_mappings.py
file in the family directory.
Example — GLM's fused expert down-projection disables grouped-export transpose:
python
undefined
如果模型的某层权重布局与现有映射类不匹配,请在桥接器文件或家族目录下的
<model>_mappings.py
文件中创建私有映射类。
示例——GLM的融合专家下投影禁用分组导出转置:
python
undefined

src/megatron/bridge/models/glm/glm_moe_mappings.py

src/megatron/bridge/models/glm/glm_moe_mappings.py

class GLMExpertDownProjMapping(FusedExpertMapping): def init(self, megatron_param, hf_param, permute_dims=None): super().init(megatron_param, hf_param, permute_dims, transpose_on_export=False)

Example — Nemotron-H's MTP layers flatten indices during resolve:

```python
class GLMExpertDownProjMapping(FusedExpertMapping): def init(self, megatron_param, hf_param, permute_dims=None): super().init(megatron_param, hf_param, permute_dims, transpose_on_export=False)

示例——Nemotron-H的MTP层在解析时展平索引:

```python

Inside nemotron_h_bridge.py (private to the module)

在nemotron_h_bridge.py内(模块私有)

class _MTPFlatteningMapping(MegatronParamMapping): def resolve(self, captures): return AutoMapping(self._flatten(captures), ...)

Example — MiniMax-M2's non-standard QK norm layout:

```python
class _MTPFlatteningMapping(MegatronParamMapping): def resolve(self, captures): return AutoMapping(self._flatten(captures), ...)

示例——MiniMax-M2的非标准QK归一化布局:

```python

Inside minimax_m2_bridge.py (private to the module)

在minimax_m2_bridge.py内(模块私有)

class _FullDimQKNormMapping(MegatronParamMapping): def hf_to_megatron(self, hf_weights): # Custom scatter logic for full-dim QK norm ... def megatron_to_hf(self, megatron_weights): # Custom gather logic ...
undefined
class _FullDimQKNormMapping(MegatronParamMapping): def hf_to_megatron(self, hf_weights): # 全维度QK归一化的自定义分散逻辑 ... def megatron_to_hf(self, megatron_weights): # 自定义聚合逻辑 ...
undefined

Strategy 2: Override bridge hooks

策略2:重写桥接器钩子

MegatronModelBridge
provides several override hooks — use them instead of modifying the base class:
HookWhen to use
mapping_registry()
Define all weight name mappings (abstract, always overridden)
provider_bridge()
Configure the provider with model-specific flags (call
super()
then setattr)
maybe_modify_loaded_hf_weight()
Dequantize, rename, or reshape HF weights before conversion
maybe_modify_converted_hf_weight()
Synthesize extra HF keys on export (e.g.
inv_freq
)
megatron_to_hf_config()
Build HF
config.json
for export
hf_config_to_provider_kwargs()
Override CONFIG_MAPPING behavior for specific fields
Accessing HF config in
mapping_registry()
:
The bridge instance has
self.hf_config
available during conversion — it is set automatically by the dispatch system before
mapping_registry()
is called. Use it when your mapping registry needs config-dependent logic (e.g. dynamic MTP layer count, number of experts):
python
def mapping_registry(self) -> MegatronMappingRegistry:
    hf_config = getattr(self, "hf_config", None)
    num_mtp_layers = getattr(hf_config, "num_nextn_predict_layers", 0) if hf_config else 0
    ...
Do not override
build_conversion_tasks()
to stash
self._hf_config
— that pattern is deprecated.
MegatronModelBridge
提供多个可重写钩子——请使用它们而非修改基类:
钩子使用场景
mapping_registry()
定义所有权重名称映射(抽象方法,必须重写)
provider_bridge()
为提供器配置模型专属标志(调用
super()
后设置属性)
maybe_modify_loaded_hf_weight()
转换前对HF权重进行反量化、重命名或重塑
maybe_modify_converted_hf_weight()
导出时合成额外HF键(例如
inv_freq
megatron_to_hf_config()
构建用于导出的HF
config.json
hf_config_to_provider_kwargs()
针对特定字段覆盖CONFIG_MAPPING行为
mapping_registry()
中访问HF配置
:转换期间,桥接器实例会自动设置
self.hf_config
——调度系统会在调用
mapping_registry()
前完成设置。当映射注册表需要依赖配置的逻辑时(例如动态MTP层数、专家数量),可使用该属性:
python
def mapping_registry(self) -> MegatronMappingRegistry:
    hf_config = getattr(self, "hf_config", None)
    num_mtp_layers = getattr(hf_config, "num_nextn_predict_layers", 0) if hf_config else 0
    ...
请勿重写
build_conversion_tasks()
来存储
self._hf_config
——该模式已废弃。

Strategy 3: Custom provider subclass (VLMs only)

策略3:自定义提供器子类(仅VLM)

Most models do not need a provider file — the stock provider (e.g.,
GPTModelProvider
, or another base selected via
PROVIDER_CLASS
) is usually sufficient for LLMs. Only create a provider subclass when a VLM needs custom
provide()
logic to instantiate a combined vision+language model:
python
undefined
大多数模型无需提供器文件——标准提供器(例如
GPTModelProvider
或通过
PROVIDER_CLASS
选择的其他基础提供器)通常足以支持LLM。仅当VLM需要自定义
provide()
逻辑来实例化视觉+语言组合模型时,才创建提供器子类:
python
undefined

src/megatron/bridge/models/<model>/<model>_provider.py

src/megatron/bridge/models/<model>/<model>_provider.py

class MyVLModelProvider(GPTModelProvider): image_token_id: int = 0
def provide(self, ...):
    # Custom model construction combining vision encoder + language decoder
    ...

The bridge then references it via `PROVIDER_CLASS = MyVLModelProvider` or instantiates it directly
in `provider_bridge()`.
class MyVLModelProvider(GPTModelProvider): image_token_id: int = 0
def provide(self, ...):
    # 自定义模型构建逻辑,结合视觉编码器+语言解码器
    ...

桥接器可通过`PROVIDER_CLASS = MyVLModelProvider`或在`provider_bridge()`中直接实例化来引用它。

When shared file changes ARE justified

何时修改共享文件才合理

Modify
param_mapping.py
or
model_bridge.py
only when the pattern is reusable by 2+ model families. Examples of justified shared changes:
  • FusedExpertMapping
    /
    FusedGatedExpertMapping
    — used by GLM, DeepSeek, OLMoE, etc.
  • RMSNorm2ZeroCenteredRMSNormMapping
    — used by Gemma, Nemotron, etc.
  • New
    CONFIG_MAPPING
    entries — when a standard HF config key maps to a standard provider attribute
If you're tempted to add a model-specific
if model_type == "..."
branch in shared code, or pattern-matching on specific weight names in shared conversion logic, that's a signal to use a local subclass or hook override instead.
仅当模式可被2个及以上模型家族复用时,才修改
param_mapping.py
model_bridge.py
。合理修改的示例:
  • FusedExpertMapping
    /
    FusedGatedExpertMapping
    ——被GLM、DeepSeek、OLMoE等使用
  • RMSNorm2ZeroCenteredRMSNormMapping
    ——被Gemma、Nemotron等使用
  • 新增
    CONFIG_MAPPING
    条目——当标准HF配置键映射到标准提供器属性时
如果您想在共享代码中添加模型专属的
if model_type == "..."
分支,或在共享转换逻辑中对特定权重名称进行模式匹配,这意味着您应该使用本地子类或钩子重写,而非修改共享代码。

Update FLOPs calculator for new architectural blocks

为新架构模块更新FLOPs计算器

If the model introduces a new computational block that differs from standard attention or MLP (e.g., Gated DeltaNet / GDN linear attention, Multi-Token Prediction / MTP heads, Mamba SSM layers), update the FLOPs calculator in
src/megatron/bridge/training/utils/flop_utils.py
so that training throughput metrics (TFLOPs/GPU) are accurate.
When to update: Any time the new block has different FLOPs-per-token than standard self-attention or standard MLP. Common cases:
  • Linear attention variants (GDN, RetNet, RWKV) — replace the
    O(s²)
    attention term with the block's actual operation count
  • MTP / speculative decoding heads — add FLOPs for the extra projection and norm layers
  • SSM layers (Mamba) — different recurrence FLOPs than attention
  • Novel MoE routing — may change the effective expert count
How to update:
  1. Read the existing
    transformer_flops()
    function in
    flop_utils.py
    to understand the structure.
  2. Add a conditional block gated on a config attribute (e.g.,
    experimental_attention_variant
    ,
    mtp_num_layers
    ). Follow the existing MoE pattern for config validation — raise on invalid types, assert list lengths, and use direct attribute access instead of
    getattr
    with fallback defaults so that misconfigurations fail explicitly.
  3. Compute the per-layer FLOPs for the new block and blend it with the standard attention term based on the layer pattern.
  4. Add unit tests in
    tests/unit_tests/training/utils/test_flop_utils.py
    that verify:
    • New-block FLOPs differ from pure-attention baseline
    • Exact formula matches hand-computed expected values
    • Varying the block ratio (e.g.,
      linear_attention_freq
      ) changes FLOPs
Reference PR: #2925 — GDN FLOPs calculator adds GDN support with both the calculator code and comprehensive tests.
如果模型引入了与标准注意力或MLP不同的新计算模块(例如Gated DeltaNet / GDN线性注意力、多 token 预测 / MTP头、Mamba SSM层),请更新
src/megatron/bridge/training/utils/flop_utils.py
中的FLOPs计算器,确保训练吞吐量指标(TFLOPs/GPU)准确。
更新时机:当新模块的每token FLOPs与标准自注意力或标准MLP不同时。常见场景:
  • 线性注意力变体(GDN、RetNet、RWKV)——用模块实际运算量替换
    O(s²)
    注意力项
  • MTP / speculative decoding头——为额外的投影和归一化层添加FLOPs
  • SSM层(Mamba)——与注意力不同的循环FLOPs
  • 新型MoE路由——可能改变有效专家数量
更新方法
  1. 阅读
    flop_utils.py
    中现有的
    transformer_flops()
    函数,理解其结构。
  2. 添加基于配置属性的条件分支(例如
    experimental_attention_variant
    mtp_num_layers
    )。遵循现有MoE模式进行配置验证——对无效类型抛出错误,断言列表长度,使用直接属性访问而非带默认值的
    getattr
    ,确保配置错误会显式触发失败。
  3. 计算新模块的每层FLOPs,并根据层模式将其与标准注意力项融合。
  4. tests/unit_tests/training/utils/test_flop_utils.py
    中添加单元测试,验证:
    • 新模块的FLOPs与纯注意力基线不同
    • 精确公式与手动计算的预期值匹配
    • 调整模块比例(例如
      linear_attention_freq
      )会改变FLOPs
参考PR:#2925 — GDN FLOPs计算器添加了GDN支持,包含计算器代码和全面测试。

Phase 3: Recipe Support

第三阶段:配置脚本支持

Recipes provide pre-configured training settings for each model size.
LLM recipes:
src/megatron/bridge/recipes/<family>/<model>.py
VLM recipes:
src/megatron/bridge/recipes/<family>/<model>.py
Each recipe file defines functions for each model size + training mode:
  • <model>_<size>_sft_config()
    — Full supervised fine-tuning
  • <model>_<size>_peft_config()
    — LoRA/DoRA parameter-efficient fine-tuning
  • <model>_<size>_pretrain_config()
    — Pretraining (LLM only, usually)
For detailed recipe patterns, see @skills/adding-model-support/recipe-patterns.md.
配置脚本为每个模型尺寸提供预配置的训练设置。
LLM配置脚本
src/megatron/bridge/recipes/<family>/<model>.py
VLM配置脚本
src/megatron/bridge/recipes/<family>/<model>.py
每个配置脚本文件为每个模型尺寸+训练模式定义函数:
  • <model>_<size>_sft_config()
    ——全量监督微调
  • <model>_<size>_peft_config()
    ——LoRA/DoRA参数高效微调
  • <model>_<size>_pretrain_config()
    ——预训练(通常仅LLM)
详细配置脚本模式请参考@skills/adding-model-support/recipe-patterns.md。

Export checklist

导出检查清单

  1. Family
    __init__.py
    — import and add to
    __all__
  2. Top-level
    src/megatron/bridge/recipes/__init__.py
    — wildcard import
  3. train_any_basic.py
    — add to
    config_map
    , docstring, and
    --model
    choices
  1. 家族
    __init__.py
    ——导入并添加到
    __all__
  2. 顶层
    src/megatron/bridge/recipes/__init__.py
    ——通配符导入
  3. train_any_basic.py
    ——添加到
    config_map
    、文档字符串及
    --model
    选项

Phase 4: Tests

第四阶段:测试

Unit tests (no GPU)

单元测试(无需GPU)

text
tests/unit_tests/models/<model>/
├── __init__.py
├── test_<model>_bridge.py    # Mock HF config → verify provider mapping
└── test_<model>_provider.py  # (optional) Only if custom provider subclass exists
text
tests/unit_tests/models/<model>/
├── __init__.py
├── test_<model>_bridge.py    # 模拟HF配置 → 验证提供器映射
└── test_<model>_provider.py  # (可选)仅当存在自定义提供器子类时需要

Functional tests (GPU)

功能测试(需要GPU)

text
tests/functional_tests/models/<model>/
├── __init__.py
├── test_<model>_conversion.py  # Toy model HF↔Megatron roundtrip
└── test_<model>_provider.py    # compare_provider_configs (optional)
For detailed test patterns, see @skills/adding-model-support/tests-and-examples.md.
text
tests/functional_tests/models/<model>/
├── __init__.py
├── test_<model>_conversion.py  # 小型模型HF↔Megatron往返转换
└── test_<model>_provider.py    # compare_provider_configs(可选)
详细测试模式请参考@skills/adding-model-support/tests-and-examples.md。

Phase 5: Docs and Examples

第五阶段:文档与示例

Examples

示例

Model examples:
examples/models/<brand>/<model>/
text
examples/models/<brand>/<model>/
├── README.md
├── conversion.sh        # HF↔Megatron conversion commands (real model)
├── inference.sh         # Generation commands (real model, reasonable output)
├── slurm_sft.sh         # SFT training on SLURM
└── slurm_peft.sh        # PEFT training on SLURM
Key deliverable requirement:
conversion.sh
and
inference.sh
must target a real published model (e.g.
Qwen/Qwen3-8B
, not a toy). The inference script must produce reasonable output — for LLMs a coherent text continuation, for VLMs a plausible image description. This is the acceptance bar: conversion runs cleanly and generation makes sense.
模型示例:
examples/models/<brand>/<model>/
text
examples/models/<brand>/<model>/
├── README.md
├── conversion.sh        # HF↔Megatron转换命令(真实模型)
├── inference.sh         # 生成命令(真实模型,输出合理)
├── slurm_sft.sh         # SLURM上的SFT训练
└── slurm_peft.sh        # SLURM上的PEFT训练
关键交付要求
conversion.sh
inference.sh
必须针对已发布的真实模型(例如
Qwen/Qwen3-8B
,而非测试模型)。推理脚本必须生成合理输出——LLM需生成连贯的文本续写,VLM需生成合理的图像描述。验收标准:转换可顺利运行,生成结果符合预期。

Documentation

文档

Add a model page at
docs/models/<type>/<model>.md
covering:
  • Supported variants and sizes
  • Conversion commands
  • Training examples (SFT, PEFT)
  • Known limitations
docs/models/<type>/<model>.md
添加模型页面,涵盖:
  • 支持的变体和尺寸
  • 转换命令
  • 训练示例(SFT、PEFT)
  • 已知限制

Verification Workflow

验证流程

After implementing bridge support, prompt the user to run these commands on the cluster:
实现桥接器支持后,请提示用户在集群上运行以下命令:

1. Smoke test (single GPU)

1. 冒烟测试(单GPU)

bash
uv run python -c "
from megatron.bridge import AutoBridge
bridge = AutoBridge.from_hf_pretrained('<org>/<model>')
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
bridge.load_hf_weights(model)
for i, (name, tensor) in enumerate(bridge.export_hf_weights(model, cpu=True)):
    print(name, tuple(tensor.shape))
    if i > 10: break
"
bash
uv run python -c "
from megatron.bridge import AutoBridge
bridge = AutoBridge.from_hf_pretrained('<org>/<model>')
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
bridge.load_hf_weights(model)
for i, (name, tensor) in enumerate(bridge.export_hf_weights(model, cpu=True)):
    print(name, tuple(tensor.shape))
    if i > 10: break
"

2. Conversion roundtrip (multi-GPU)

2. 转换往返测试(多GPU)

bash
uv run python examples/conversion/convert_checkpoints.py import \
    --hf-model <org>/<model> \
    --megatron-path /workspace/<model> \
    --torch-dtype bfloat16

uv run python examples/conversion/convert_checkpoints.py export \
    --hf-model <org>/<model> \
    --megatron-path /workspace/<model>/iter_0000000 \
    --hf-path /workspace/<model>-hf-export
bash
uv run python examples/conversion/convert_checkpoints.py import \
    --hf-model <org>/<model> \
    --megatron-path /workspace/<model> \
    --torch-dtype bfloat16

uv run python examples/conversion/convert_checkpoints.py export \
    --hf-model <org>/<model> \
    --megatron-path /workspace/<model>/iter_0000000 \
    --hf-path /workspace/<model>-hf-export

3. Generation test

3. 生成测试

For LLMs:
bash
uv run python examples/conversion/hf_to_megatron_generate_text.py \
    --hf_model_path <org>/<model> --prompt "Hello"
For VLMs:
bash
uv run python examples/conversion/hf_to_megatron_generate_vlm.py \
    --hf_model_path <org>/<model> \
    --image_path "https://example.com/image.jpeg" \
    --prompt "Describe this image."
LLM测试:
bash
uv run python examples/conversion/hf_to_megatron_generate_text.py \
    --hf_model_path <org>/<model> --prompt "Hello"
VLM测试:
bash
uv run python examples/conversion/hf_to_megatron_generate_vlm.py \
    --hf_model_path <org>/<model> \
    --image_path "https://example.com/image.jpeg" \
    --prompt "Describe this image."

4. Run tests

4. 运行测试

bash
uv run python -m pytest tests/unit_tests/models/<model>/ -v
uv run python -m pytest tests/functional_tests/models/<model>/ -v --run-gpu
bash
uv run python -m pytest tests/unit_tests/models/<model>/ -v
uv run python -m pytest tests/functional_tests/models/<model>/ -v --run-gpu

Quick Decision Tree

快速决策树

User wants to add a model
├─ Has HF link? ─── No ──→ Ask for link (or config.json if private)
├─ Has text_config + vision_config? ─── Yes ──→ VLM path
│   ├─ Has Megatron vision encoder? ──→ Megatron encoder (Qwen3.5 pattern)
│   └─ No Megatron encoder ──→ HF encoder (Gemma3 pattern)
└─ No vision config ──→ LLM path (bridge only, no provider file)
    ├─ Standard GPT-style? ──→ Bridge with stock mappings
    └─ Custom layers? ──→ Bridge + local mapping subclasses / hook overrides
        ├─ Custom weight layout? ──→ Local mapping subclass in family dir
        └─ Custom import/export? ──→ Override bridge hooks (maybe_modify_*)
用户想要添加模型
├─ 是否有HF链接? ─── 否 ──→ 索要链接(非公开则索要config.json)
├─ 是否有text_config + vision_config? ─── 是 ──→ VLM流程
│   ├─ 是否有Megatron视觉编码器? ──→ 使用Megatron编码器(Qwen3.5模式)
│   └─ 无Megatron视觉编码器 ──→ 使用HF编码器(Gemma3模式)
└─ 无视觉配置 ──→ LLM流程(仅桥接器,无需提供器文件)
    ├─ 是否为标准GPT风格? ──→ 使用标准映射的桥接器
    └─ 是否有自定义层? ──→ 桥接器+本地映射子类/钩子重写
        ├─ 是否有自定义权重布局? ──→ 在家族目录中创建本地映射子类
        └─ 是否有自定义导入/导出逻辑? ──→ 重写桥接器钩子(maybe_modify_*)