adding-model-support
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAdding New Model Support in Megatron-Bridge
在Megatron-Bridge中添加新模型支持
Phase 1: Discovery
第一阶段:调研分析
Step 1 — Get the HF model link
步骤1 — 获取HF模型链接
Ask the user for the HuggingFace model link (e.g. ).
https://huggingface.co/Qwen/Qwen3.5-VL-27BIf the model is not public, ask the user to provide the file directly.
config.json向用户索要HuggingFace模型链接(例如 )。
https://huggingface.co/Qwen/Qwen3.5-VL-27B如果模型非公开,请用户直接提供文件。
config.jsonStep 2 — Fetch and analyze config.json
步骤2 — 获取并分析config.json
Read the model's from HuggingFace (or from the user-provided file). Key fields to extract:
config.json- — used for
model_type@register_bridge(model_type=...) - — the HF model class name (used for
architecturesin registration)source=... - — critical for weight tying
tie_word_embeddings - Architecture fields: ,
num_hidden_layers,hidden_size,intermediate_size,num_attention_heads,num_key_value_heads,vocab_size,max_position_embeddings, etc.rope_theta - MoE fields (if present): ,
num_local_experts,num_experts_per_tokmoe_intermediate_size - MLA fields (if present): ,
q_lora_rank,kv_lora_rank,qk_nope_head_dimqk_rope_head_dim
If there are config fields you don't recognize from previously supported models (check in and existing bridges), this likely indicates a new architectural block (e.g., a novel attention variant, custom normalization, or a new layer type). Ask the user to provide the HuggingFace implementation of that block so you can understand the computation and create the correct Megatron-side mapping or custom module.
CONFIG_MAPPINGmodel_bridge.pymodeling_*.py从HuggingFace(或用户提供的文件)读取模型的。需要提取的关键字段:
config.json- — 用于
model_type@register_bridge(model_type=...) - — HF模型类名(用于注册时的
architectures)source=... - — 权重绑定的关键配置
tie_word_embeddings - 架构字段:、
num_hidden_layers、hidden_size、intermediate_size、num_attention_heads、num_key_value_heads、vocab_size、max_position_embeddings等rope_theta - MoE相关字段(若存在):、
num_local_experts、num_experts_per_tokmoe_intermediate_size - MLA相关字段(若存在):、
q_lora_rank、kv_lora_rank、qk_nope_head_dimqk_rope_head_dim
如果遇到之前支持的模型中未见过的配置字段(可查看中的及现有桥接器),这通常意味着存在新的架构模块(例如新型注意力变体、自定义归一化层或新的层类型)。请用户提供该模块的HuggingFace 实现,以便理解计算逻辑并创建正确的Megatron侧映射或自定义模块。
model_bridge.pyCONFIG_MAPPINGmodeling_*.pyStep 3 — Determine VLM vs LLM
步骤3 — 判断是VLM还是LLM
VLM (Vision-Language Model) if config.json contains:
- AND
text_configsub-configsvision_config - Note: VLMs may or may not have "VL" in the name
LLM (Text-only) if:
- No /
text_configvision_config - Single flat config for the language model
This distinction affects:
- Which files to create (VLMs need a model.py combining vision + language)
- Where to read config fields from (vs top-level for VLMs)
text_config - Test patterns (VLMs need vision inputs in functional tests)
若config.json包含以下内容,则为VLM(视觉语言模型):
- 和
text_config子配置vision_config - 注意:VLM名称中不一定包含"VL"
若满足以下条件,则为LLM(纯文本模型):
- 无/
text_configvision_config - 语言模型使用单一扁平化配置
该区分会影响:
- 需要创建的文件(VLM需要结合视觉+语言的model.py)
- 配置字段的读取位置(VLM从读取,而非顶层)
text_config - 测试模式(VLM的功能测试需要视觉输入)
Step 4 — Check for quantized weights (FP8 / FP4)
步骤4 — 检查量化权重(FP8 / FP4)
Inspect the HF checkpoint's (or ) for quantized
weight dtypes such as (FP8) or / with accompanying or
tensors. Common signs:
model.safetensorsmodel.safetensors.index.jsonfloat8_e4m3fnuint8uint4*_scale_inv*_scale- mentions
config.jsonor dtype fields likequantization_config"torch_dtype": "float8_e4m3fn" - Safetensors contain keys alongside the main weight keys
weight_scale_inv - The model card mentions FP8/FP4/INT4 weights
Why this matters: The bridge's path does not automatically dequantize — it
loads raw quantized values as-is. This produces a silently broken model (random-level loss, huge
grad norms) instead of raising an error.
import_ckptFix: Dequantize before conversion. Two approaches:
-
Standalone script (recommended for user-facing models) — Write ain the model's examples folder. Reference:
dequant_fp8_for_bridge.py. The pattern is:examples/models/ministral/ministral3/dequant_fp8_for_bridge.py.w_bf16 = fp8_weight.to(bfloat16) * weight_scale_inv -
In-bridge hook — Overridein the bridge class to dequantize on the fly during import:
maybe_modify_loaded_hf_weight()pythondef maybe_modify_loaded_hf_weight(self, hf_param, hf_state_dict): weight = hf_state_dict[hf_param] scale_key = hf_param + "_scale_inv" if weight.dtype == torch.float8_e4m3fn and scale_key in hf_state_dict: return weight.to(torch.bfloat16) * hf_state_dict[scale_key].to(torch.bfloat16) return weight
Always add a sanity check in the verification workflow (e.g., print of a weight tensor —
quantized models typically have before dequantization vs after).
stdstd ≈ 13std ≈ 0.006检查HF checkpoint的(或)中是否存在量化权重类型,例如(FP8)或带有/张量的/。常见特征:
model.safetensorsmodel.safetensors.index.jsonfloat8_e4m3fn*_scale_inv*_scaleuint8uint4- 中提及
config.json或类似quantization_config的类型字段"torch_dtype": "float8_e4m3fn" - Safetensors文件中包含与主权重键对应的键
weight_scale_inv - 模型卡片中提到FP8/FP4/INT4权重
重要性说明:桥接器的流程不会自动反量化——它会直接加载原始量化值。这会导致模型静默失效(随机级别的损失、巨大的梯度范数),而非抛出错误。
import_ckpt解决方法:转换前先反量化。有两种方式:
-
独立脚本(面向用户模型推荐)——在模型的示例文件夹中编写。 参考示例:
dequant_fp8_for_bridge.py。 核心逻辑:examples/models/ministral/ministral3/dequant_fp8_for_bridge.py。w_bf16 = fp8_weight.to(bfloat16) * weight_scale_inv -
桥接器钩子——在桥接器类中重写,在导入时实时反量化:
maybe_modify_loaded_hf_weight()pythondef maybe_modify_loaded_hf_weight(self, hf_param, hf_state_dict): weight = hf_state_dict[hf_param] scale_key = hf_param + "_scale_inv" if weight.dtype == torch.float8_e4m3fn and scale_key in hf_state_dict: return weight.to(torch.bfloat16) * hf_state_dict[scale_key].to(torch.bfloat16) return weight
务必在验证流程中添加 sanity check(例如打印权重张量的——量化模型反量化前,反量化后)。
stdstd≈13std≈0.006Phase 2: Bridge Support
第二阶段:桥接器支持
File structure
文件结构
LLM — Reference: Qwen2 ()
src/megatron/bridge/models/qwen/qwen2_bridge.pysrc/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py # Config + weight mappings (no provider file needed)
└── modeling_<model>/ # (optional) Custom nn.Module implementations if needed
└── ...VLM — Reference: Qwen3.5-VL ()
src/megatron/bridge/models/qwen_vl/src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py # Config + weight mappings
├── <model>_provider.py # Only for VLMs that need custom provide()
└── modeling_<model>/ # If using Megatron vision encoder
├── __init__.py
└── model.py # Combines vision + languageOR with HF vision encoder (Reference: Gemma3-VL):
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py
├── <model>_provider.py # Only for VLMs that need custom provide()
└── modeling_<model>.py # HF vision + Megatron language wrapperModel-specific modeling code: If the model requires custom implementations
(e.g. a custom RoPE variant, non-standard transformer config, custom thinker/talker
architecture), place them in a directory or a single
file inside the model family folder. Use a directory when there are multiple files (model,
transformer config, custom ops); use a single file when one module suffices. Never put
model-specific modeling code in shared directories or as loose files in the bridge family
directory — keep them namespaced under the prefix.
nn.Modulemodeling_<model>/modeling_<model>.pymodeling_<model>LLM — 参考示例:Qwen2()
src/megatron/bridge/models/qwen/qwen2_bridge.pysrc/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py # 配置+权重映射(无需提供器文件)
└── modeling_<model>/ # (可选)自定义nn.Module实现(若需要)
└── ...VLM — 参考示例:Qwen3.5-VL()
src/megatron/bridge/models/qwen_vl/src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py # 配置+权重映射
├── <model>_provider.py # 仅需自定义provide()的VLM需要
└── modeling_<model>/ # 使用Megatron视觉编码器时
├── __init__.py
└── model.py # 结合视觉+语言模块或使用HF视觉编码器(参考示例:Gemma3-VL):
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py
├── <model>_provider.py # 仅需自定义provide()的VLM需要
└── modeling_<model>.py # HF视觉+Megatron语言包装器模型专属建模代码:如果模型需要自定义实现(例如自定义RoPE变体、非标准Transformer配置、自定义思考/对话架构),请将其放在目录或模型家族文件夹下的单个文件中。多文件(模型、Transformer配置、自定义算子)时使用目录;单个模块时使用单个文件。切勿将模型专属建模代码放在共享目录或桥接器家族目录下的零散文件中——请将它们命名空间化到前缀下。
nn.Modulemodeling_<model>/modeling_<model>.pymodeling_<model>Implementation order
实现顺序
LLM:
- Bridge only — Register bridge, implement and
provider_bridge(). The bridge callsmapping_registry()to get asuper().provider_bridge()fromGPTModelProvider, then sets model-specific attributes on it. Do not create a provider file — the stock provider returned byCONFIG_MAPPINGis usually sufficient for LLMs (e.g.,super().provider_bridge(), or another base provider selected viaGPTModelProvider).PROVIDER_CLASS
VLM:
- Bridge — Register bridge, implement config and weight mappings.
- Provider (when needed) — Only VLMs that require a custom to instantiate a combined vision+language model need a provider subclass. The bridge manually calls
provide()and instantiates the custom provider.hf_config_to_provider_kwargs(text_config) - Model class — Combine vision encoder + language decoder.
For detailed patterns, see:
- VLM: @skills/adding-model-support/vlm-patterns.md
- LLM: @skills/adding-model-support/llm-patterns.md
LLM:
- 仅桥接器——注册桥接器,实现和
provider_bridge()。 桥接器调用mapping_registry()从super().provider_bridge()获取CONFIG_MAPPING,然后为其设置模型专属属性。无需创建提供器文件——GPTModelProvider返回的标准提供器通常足以支持LLM(例如super().provider_bridge()或通过GPTModelProvider选择的其他基础提供器)。PROVIDER_CLASS
VLM:
- 桥接器——注册桥接器,实现配置和权重映射。
- 提供器(必要时)——只有需要自定义来实例化视觉+语言组合模型的VLM才需要提供器子类。桥接器手动调用
provide()并实例化自定义提供器。hf_config_to_provider_kwargs(text_config) - 模型类——结合视觉编码器+语言解码器。
详细实现模式请参考:
- VLM:@skills/adding-model-support/vlm-patterns.md
- LLM:@skills/adding-model-support/llm-patterns.md
Critical: tie_word_embeddings
for VLMs
tie_word_embeddings关键注意事项:VLM的tie_word_embeddings
tie_word_embeddingsFor VLMs, lives on the top-level HF config, NOT on . Always read from the parent config:
tie_word_embeddingstext_configpython
provider.share_embeddings_and_output_weights = getattr(hf_config, "tie_word_embeddings", False)对于VLM,位于HF配置的顶层,而非中。务必从父配置读取:
tie_word_embeddingstext_configpython
provider.share_embeddings_and_output_weights = getattr(hf_config, "tie_word_embeddings", False)Critical: Config field location for VLMs
关键注意事项:VLM的配置字段位置
When reading HF config for VLMs, check whether each field is in:
- (top-level) — e.g.
hf_config,tie_word_embeddings,image_token_idvideo_token_id - — e.g.
hf_config.text_config,num_hidden_layers, etc.hidden_size - — e.g. vision encoder dimensions
hf_config.vision_config
读取VLM的HF配置时,需检查每个字段所在位置:
- (顶层)——例如
hf_config、tie_word_embeddings、image_token_idvideo_token_id - ——例如
hf_config.text_config、num_hidden_layers等hidden_size - ——例如视觉编码器维度
hf_config.vision_config
Encapsulating model-specific layers
封装模型专属层
When a new model introduces custom or non-standard layers (novel attention variants, custom
normalization, fused expert layouts, MTP heads, etc.), keep all model-specific logic inside
the model family directory. Do not modify shared files in
(e.g. , , ) unless the change is genuinely
reusable across multiple model families.
src/megatron/bridge/models/conversion/param_mapping.pymodel_bridge.pyquant_mapping.pyPrinciple: The bridge and provider files for a model family are your primary extension surface.
Shared conversion infrastructure provides hooks and base classes — subclass them locally rather
than adding conditionals to shared code.
当新模型引入自定义或非标准层(新型注意力变体、自定义归一化层、融合专家布局、MTP头等)时,请将所有模型专属逻辑保留在模型家族目录内。除非更改可真正跨多个模型家族复用,否则请勿修改下的共享文件(例如、、)。
src/megatron/bridge/models/conversion/param_mapping.pymodel_bridge.pyquant_mapping.py原则:模型家族的桥接器和提供器文件是主要扩展入口。共享转换基础设施提供钩子和基类——请在本地子类化它们,而非在共享代码中添加条件分支。
Strategy 1: Create a local mapping subclass
策略1:创建本地映射子类
If the model has a layer whose weight layout doesn't match any existing mapping class, create a
private mapping class in the bridge file or a file in the family directory.
<model>_mappings.pyExample — GLM's fused expert down-projection disables grouped-export transpose:
python
undefined如果模型的某层权重布局与现有映射类不匹配,请在桥接器文件或家族目录下的文件中创建私有映射类。
<model>_mappings.py示例——GLM的融合专家下投影禁用分组导出转置:
python
undefinedsrc/megatron/bridge/models/glm/glm_moe_mappings.py
src/megatron/bridge/models/glm/glm_moe_mappings.py
class GLMExpertDownProjMapping(FusedExpertMapping):
def init(self, megatron_param, hf_param, permute_dims=None):
super().init(megatron_param, hf_param, permute_dims, transpose_on_export=False)
Example — Nemotron-H's MTP layers flatten indices during resolve:
```pythonclass GLMExpertDownProjMapping(FusedExpertMapping):
def init(self, megatron_param, hf_param, permute_dims=None):
super().init(megatron_param, hf_param, permute_dims, transpose_on_export=False)
示例——Nemotron-H的MTP层在解析时展平索引:
```pythonInside nemotron_h_bridge.py (private to the module)
在nemotron_h_bridge.py内(模块私有)
class _MTPFlatteningMapping(MegatronParamMapping):
def resolve(self, captures):
return AutoMapping(self._flatten(captures), ...)
Example — MiniMax-M2's non-standard QK norm layout:
```pythonclass _MTPFlatteningMapping(MegatronParamMapping):
def resolve(self, captures):
return AutoMapping(self._flatten(captures), ...)
示例——MiniMax-M2的非标准QK归一化布局:
```pythonInside minimax_m2_bridge.py (private to the module)
在minimax_m2_bridge.py内(模块私有)
class _FullDimQKNormMapping(MegatronParamMapping):
def hf_to_megatron(self, hf_weights):
# Custom scatter logic for full-dim QK norm
...
def megatron_to_hf(self, megatron_weights):
# Custom gather logic
...
undefinedclass _FullDimQKNormMapping(MegatronParamMapping):
def hf_to_megatron(self, hf_weights):
# 全维度QK归一化的自定义分散逻辑
...
def megatron_to_hf(self, megatron_weights):
# 自定义聚合逻辑
...
undefinedStrategy 2: Override bridge hooks
策略2:重写桥接器钩子
MegatronModelBridge| Hook | When to use |
|---|---|
| Define all weight name mappings (abstract, always overridden) |
| Configure the provider with model-specific flags (call |
| Dequantize, rename, or reshape HF weights before conversion |
| Synthesize extra HF keys on export (e.g. |
| Build HF |
| Override CONFIG_MAPPING behavior for specific fields |
Accessing HF config in : The bridge instance has
available during conversion — it is set automatically by the dispatch system before
is called. Use it when your mapping registry needs config-dependent
logic (e.g. dynamic MTP layer count, number of experts):
mapping_registry()self.hf_configmapping_registry()python
def mapping_registry(self) -> MegatronMappingRegistry:
hf_config = getattr(self, "hf_config", None)
num_mtp_layers = getattr(hf_config, "num_nextn_predict_layers", 0) if hf_config else 0
...Do not override to stash — that pattern is
deprecated.
build_conversion_tasks()self._hf_configMegatronModelBridge| 钩子 | 使用场景 |
|---|---|
| 定义所有权重名称映射(抽象方法,必须重写) |
| 为提供器配置模型专属标志(调用 |
| 转换前对HF权重进行反量化、重命名或重塑 |
| 导出时合成额外HF键(例如 |
| 构建用于导出的HF |
| 针对特定字段覆盖CONFIG_MAPPING行为 |
在中访问HF配置:转换期间,桥接器实例会自动设置——调度系统会在调用前完成设置。当映射注册表需要依赖配置的逻辑时(例如动态MTP层数、专家数量),可使用该属性:
mapping_registry()self.hf_configmapping_registry()python
def mapping_registry(self) -> MegatronMappingRegistry:
hf_config = getattr(self, "hf_config", None)
num_mtp_layers = getattr(hf_config, "num_nextn_predict_layers", 0) if hf_config else 0
...请勿重写来存储——该模式已废弃。
build_conversion_tasks()self._hf_configStrategy 3: Custom provider subclass (VLMs only)
策略3:自定义提供器子类(仅VLM)
Most models do not need a provider file — the stock provider (e.g., , or
another base selected via ) is usually sufficient for LLMs. Only create a provider subclass when a VLM needs custom logic to instantiate
a combined vision+language model:
GPTModelProviderPROVIDER_CLASSprovide()python
undefined大多数模型无需提供器文件——标准提供器(例如或通过选择的其他基础提供器)通常足以支持LLM。仅当VLM需要自定义逻辑来实例化视觉+语言组合模型时,才创建提供器子类:
GPTModelProviderPROVIDER_CLASSprovide()python
undefinedsrc/megatron/bridge/models/<model>/<model>_provider.py
src/megatron/bridge/models/<model>/<model>_provider.py
class MyVLModelProvider(GPTModelProvider):
image_token_id: int = 0
def provide(self, ...):
# Custom model construction combining vision encoder + language decoder
...
The bridge then references it via `PROVIDER_CLASS = MyVLModelProvider` or instantiates it directly
in `provider_bridge()`.class MyVLModelProvider(GPTModelProvider):
image_token_id: int = 0
def provide(self, ...):
# 自定义模型构建逻辑,结合视觉编码器+语言解码器
...
桥接器可通过`PROVIDER_CLASS = MyVLModelProvider`或在`provider_bridge()`中直接实例化来引用它。When shared file changes ARE justified
何时修改共享文件才合理
Modify or only when the pattern is reusable by 2+ model
families. Examples of justified shared changes:
param_mapping.pymodel_bridge.py- /
FusedExpertMapping— used by GLM, DeepSeek, OLMoE, etc.FusedGatedExpertMapping - — used by Gemma, Nemotron, etc.
RMSNorm2ZeroCenteredRMSNormMapping - New entries — when a standard HF config key maps to a standard provider attribute
CONFIG_MAPPING
If you're tempted to add a model-specific branch in shared code, or
pattern-matching on specific weight names in shared conversion logic, that's a signal to use a
local subclass or hook override instead.
if model_type == "..."仅当模式可被2个及以上模型家族复用时,才修改或。合理修改的示例:
param_mapping.pymodel_bridge.py- /
FusedExpertMapping——被GLM、DeepSeek、OLMoE等使用FusedGatedExpertMapping - ——被Gemma、Nemotron等使用
RMSNorm2ZeroCenteredRMSNormMapping - 新增条目——当标准HF配置键映射到标准提供器属性时
CONFIG_MAPPING
如果您想在共享代码中添加模型专属的分支,或在共享转换逻辑中对特定权重名称进行模式匹配,这意味着您应该使用本地子类或钩子重写,而非修改共享代码。
if model_type == "..."Update FLOPs calculator for new architectural blocks
为新架构模块更新FLOPs计算器
If the model introduces a new computational block that differs from standard attention or MLP
(e.g., Gated DeltaNet / GDN linear attention, Multi-Token Prediction / MTP heads, Mamba SSM layers),
update the FLOPs calculator in so that
training throughput metrics (TFLOPs/GPU) are accurate.
src/megatron/bridge/training/utils/flop_utils.pyWhen to update: Any time the new block has different FLOPs-per-token than standard self-attention
or standard MLP. Common cases:
- Linear attention variants (GDN, RetNet, RWKV) — replace the attention term with the block's actual operation count
O(s²) - MTP / speculative decoding heads — add FLOPs for the extra projection and norm layers
- SSM layers (Mamba) — different recurrence FLOPs than attention
- Novel MoE routing — may change the effective expert count
How to update:
- Read the existing function in
transformer_flops()to understand the structure.flop_utils.py - Add a conditional block gated on a config attribute (e.g., ,
experimental_attention_variant). Follow the existing MoE pattern for config validation — raise on invalid types, assert list lengths, and use direct attribute access instead ofmtp_num_layerswith fallback defaults so that misconfigurations fail explicitly.getattr - Compute the per-layer FLOPs for the new block and blend it with the standard attention term based on the layer pattern.
- Add unit tests in that verify:
tests/unit_tests/training/utils/test_flop_utils.py- New-block FLOPs differ from pure-attention baseline
- Exact formula matches hand-computed expected values
- Varying the block ratio (e.g., ) changes FLOPs
linear_attention_freq
Reference PR: #2925 — GDN FLOPs calculator
adds GDN support with both the calculator code and comprehensive tests.
如果模型引入了与标准注意力或MLP不同的新计算模块(例如Gated DeltaNet / GDN线性注意力、多 token 预测 / MTP头、Mamba SSM层),请更新中的FLOPs计算器,确保训练吞吐量指标(TFLOPs/GPU)准确。
src/megatron/bridge/training/utils/flop_utils.py更新时机:当新模块的每token FLOPs与标准自注意力或标准MLP不同时。常见场景:
- 线性注意力变体(GDN、RetNet、RWKV)——用模块实际运算量替换注意力项
O(s²) - MTP / speculative decoding头——为额外的投影和归一化层添加FLOPs
- SSM层(Mamba)——与注意力不同的循环FLOPs
- 新型MoE路由——可能改变有效专家数量
更新方法:
- 阅读中现有的
flop_utils.py函数,理解其结构。transformer_flops() - 添加基于配置属性的条件分支(例如、
experimental_attention_variant)。遵循现有MoE模式进行配置验证——对无效类型抛出错误,断言列表长度,使用直接属性访问而非带默认值的mtp_num_layers,确保配置错误会显式触发失败。getattr - 计算新模块的每层FLOPs,并根据层模式将其与标准注意力项融合。
- 在中添加单元测试,验证:
tests/unit_tests/training/utils/test_flop_utils.py- 新模块的FLOPs与纯注意力基线不同
- 精确公式与手动计算的预期值匹配
- 调整模块比例(例如)会改变FLOPs
linear_attention_freq
参考PR:#2925 — GDN FLOPs计算器添加了GDN支持,包含计算器代码和全面测试。
Phase 3: Recipe Support
第三阶段:配置脚本支持
Recipes provide pre-configured training settings for each model size.
LLM recipes:
VLM recipes:
src/megatron/bridge/recipes/<family>/<model>.pysrc/megatron/bridge/recipes/<family>/<model>.pyEach recipe file defines functions for each model size + training mode:
- — Full supervised fine-tuning
<model>_<size>_sft_config() - — LoRA/DoRA parameter-efficient fine-tuning
<model>_<size>_peft_config() - — Pretraining (LLM only, usually)
<model>_<size>_pretrain_config()
For detailed recipe patterns, see @skills/adding-model-support/recipe-patterns.md.
配置脚本为每个模型尺寸提供预配置的训练设置。
LLM配置脚本:
VLM配置脚本:
src/megatron/bridge/recipes/<family>/<model>.pysrc/megatron/bridge/recipes/<family>/<model>.py每个配置脚本文件为每个模型尺寸+训练模式定义函数:
- ——全量监督微调
<model>_<size>_sft_config() - ——LoRA/DoRA参数高效微调
<model>_<size>_peft_config() - ——预训练(通常仅LLM)
<model>_<size>_pretrain_config()
详细配置脚本模式请参考@skills/adding-model-support/recipe-patterns.md。
Export checklist
导出检查清单
- Family — import and add to
__init__.py__all__ - Top-level — wildcard import
src/megatron/bridge/recipes/__init__.py - — add to
train_any_basic.py, docstring, andconfig_mapchoices--model
- 家族——导入并添加到
__init__.py__all__ - 顶层——通配符导入
src/megatron/bridge/recipes/__init__.py - ——添加到
train_any_basic.py、文档字符串及config_map选项--model
Phase 4: Tests
第四阶段:测试
Unit tests (no GPU)
单元测试(无需GPU)
text
tests/unit_tests/models/<model>/
├── __init__.py
├── test_<model>_bridge.py # Mock HF config → verify provider mapping
└── test_<model>_provider.py # (optional) Only if custom provider subclass existstext
tests/unit_tests/models/<model>/
├── __init__.py
├── test_<model>_bridge.py # 模拟HF配置 → 验证提供器映射
└── test_<model>_provider.py # (可选)仅当存在自定义提供器子类时需要Functional tests (GPU)
功能测试(需要GPU)
text
tests/functional_tests/models/<model>/
├── __init__.py
├── test_<model>_conversion.py # Toy model HF↔Megatron roundtrip
└── test_<model>_provider.py # compare_provider_configs (optional)For detailed test patterns, see @skills/adding-model-support/tests-and-examples.md.
text
tests/functional_tests/models/<model>/
├── __init__.py
├── test_<model>_conversion.py # 小型模型HF↔Megatron往返转换
└── test_<model>_provider.py # compare_provider_configs(可选)详细测试模式请参考@skills/adding-model-support/tests-and-examples.md。
Phase 5: Docs and Examples
第五阶段:文档与示例
Examples
示例
Model examples:
examples/models/<brand>/<model>/text
examples/models/<brand>/<model>/
├── README.md
├── conversion.sh # HF↔Megatron conversion commands (real model)
├── inference.sh # Generation commands (real model, reasonable output)
├── slurm_sft.sh # SFT training on SLURM
└── slurm_peft.sh # PEFT training on SLURMKey deliverable requirement: and must target a real published model (e.g. , not a toy). The inference script must produce reasonable output — for LLMs a coherent text continuation, for VLMs a plausible image description. This is the acceptance bar: conversion runs cleanly and generation makes sense.
conversion.shinference.shQwen/Qwen3-8B模型示例:
examples/models/<brand>/<model>/text
examples/models/<brand>/<model>/
├── README.md
├── conversion.sh # HF↔Megatron转换命令(真实模型)
├── inference.sh # 生成命令(真实模型,输出合理)
├── slurm_sft.sh # SLURM上的SFT训练
└── slurm_peft.sh # SLURM上的PEFT训练关键交付要求:和必须针对已发布的真实模型(例如,而非测试模型)。推理脚本必须生成合理输出——LLM需生成连贯的文本续写,VLM需生成合理的图像描述。验收标准:转换可顺利运行,生成结果符合预期。
conversion.shinference.shQwen/Qwen3-8BDocumentation
文档
Add a model page at covering:
docs/models/<type>/<model>.md- Supported variants and sizes
- Conversion commands
- Training examples (SFT, PEFT)
- Known limitations
在添加模型页面,涵盖:
docs/models/<type>/<model>.md- 支持的变体和尺寸
- 转换命令
- 训练示例(SFT、PEFT)
- 已知限制
Verification Workflow
验证流程
After implementing bridge support, prompt the user to run these commands on the cluster:
实现桥接器支持后,请提示用户在集群上运行以下命令:
1. Smoke test (single GPU)
1. 冒烟测试(单GPU)
bash
uv run python -c "
from megatron.bridge import AutoBridge
bridge = AutoBridge.from_hf_pretrained('<org>/<model>')
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
bridge.load_hf_weights(model)
for i, (name, tensor) in enumerate(bridge.export_hf_weights(model, cpu=True)):
print(name, tuple(tensor.shape))
if i > 10: break
"bash
uv run python -c "
from megatron.bridge import AutoBridge
bridge = AutoBridge.from_hf_pretrained('<org>/<model>')
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
bridge.load_hf_weights(model)
for i, (name, tensor) in enumerate(bridge.export_hf_weights(model, cpu=True)):
print(name, tuple(tensor.shape))
if i > 10: break
"2. Conversion roundtrip (multi-GPU)
2. 转换往返测试(多GPU)
bash
uv run python examples/conversion/convert_checkpoints.py import \
--hf-model <org>/<model> \
--megatron-path /workspace/<model> \
--torch-dtype bfloat16
uv run python examples/conversion/convert_checkpoints.py export \
--hf-model <org>/<model> \
--megatron-path /workspace/<model>/iter_0000000 \
--hf-path /workspace/<model>-hf-exportbash
uv run python examples/conversion/convert_checkpoints.py import \
--hf-model <org>/<model> \
--megatron-path /workspace/<model> \
--torch-dtype bfloat16
uv run python examples/conversion/convert_checkpoints.py export \
--hf-model <org>/<model> \
--megatron-path /workspace/<model>/iter_0000000 \
--hf-path /workspace/<model>-hf-export3. Generation test
3. 生成测试
For LLMs:
bash
uv run python examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path <org>/<model> --prompt "Hello"For VLMs:
bash
uv run python examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path <org>/<model> \
--image_path "https://example.com/image.jpeg" \
--prompt "Describe this image."LLM测试:
bash
uv run python examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path <org>/<model> --prompt "Hello"VLM测试:
bash
uv run python examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path <org>/<model> \
--image_path "https://example.com/image.jpeg" \
--prompt "Describe this image."4. Run tests
4. 运行测试
bash
uv run python -m pytest tests/unit_tests/models/<model>/ -v
uv run python -m pytest tests/functional_tests/models/<model>/ -v --run-gpubash
uv run python -m pytest tests/unit_tests/models/<model>/ -v
uv run python -m pytest tests/functional_tests/models/<model>/ -v --run-gpuQuick Decision Tree
快速决策树
User wants to add a model
│
├─ Has HF link? ─── No ──→ Ask for link (or config.json if private)
│
├─ Has text_config + vision_config? ─── Yes ──→ VLM path
│ ├─ Has Megatron vision encoder? ──→ Megatron encoder (Qwen3.5 pattern)
│ └─ No Megatron encoder ──→ HF encoder (Gemma3 pattern)
│
└─ No vision config ──→ LLM path (bridge only, no provider file)
├─ Standard GPT-style? ──→ Bridge with stock mappings
└─ Custom layers? ──→ Bridge + local mapping subclasses / hook overrides
├─ Custom weight layout? ──→ Local mapping subclass in family dir
└─ Custom import/export? ──→ Override bridge hooks (maybe_modify_*)用户想要添加模型
│
├─ 是否有HF链接? ─── 否 ──→ 索要链接(非公开则索要config.json)
│
├─ 是否有text_config + vision_config? ─── 是 ──→ VLM流程
│ ├─ 是否有Megatron视觉编码器? ──→ 使用Megatron编码器(Qwen3.5模式)
│ └─ 无Megatron视觉编码器 ──→ 使用HF编码器(Gemma3模式)
│
└─ 无视觉配置 ──→ LLM流程(仅桥接器,无需提供器文件)
├─ 是否为标准GPT风格? ──→ 使用标准映射的桥接器
└─ 是否有自定义层? ──→ 桥接器+本地映射子类/钩子重写
├─ 是否有自定义权重布局? ──→ 在家族目录中创建本地映射子类
└─ 是否有自定义导入/导出逻辑? ──→ 重写桥接器钩子(maybe_modify_*)