nemo-mbridge-perf-moe-vlm-training

Original：🇺🇸 English

Translated

Practical guidance for training MoE VLMs in Megatron Bridge. Compares FSDP and 3D-parallel approaches, using rounded lessons from Qwen3-VL, Qwen3-Next, and other multimodal experiments.

6installs

Sourcenvidia/skills

Added on2026-05-30

NPX Install

npx skill4agent add nvidia/skills nemo-mbridge-perf-moe-vlm-training

SKILL.md Content

View Translation Comparison →

MoE VLM Training

Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-vlm-training/card.yaml

FSDP vs 3D Parallel

Approach	Strength	Best fit
FSDP	Simplest path to a working multimodal run	first bring-up, memory-first tuning, awkward PP boundaries
3D parallel	Higher ceiling after tuning	stable models with a clean PP layout and time for deeper sweeps

For MoE VLMs, the practical workflow is usually:

get the first reliable run with FSDP
stabilize real-data input, recompute, and memory behavior
move to 3D parallel only if the throughput headroom is worth the extra work

Rounded Findings From Recent VLM Runs

Qwen3-VL class models

The main patterns were consistent across the tracker:

FSDP on GB200-class systems can already reach healthy high-teens utilization with a comparatively simple setup
B200 FSDP runs are viable, but more sensitive to recompute choice and frozen vision settings
3D parallel can recover to a similar or better operating point, but only after tuning MBS, recompute, and the real vision path together

Real data vs mock data

Mock-data VLM runs are not trustworthy performance proxies. In the experiments, image-free mock runs looked closer to "roughly twice as fast" than "slightly optimistic" when compared with real multimodal input.

Use real or realistic image payloads before drawing any conclusion about VLM throughput.

Smaller multimodal MoE runs

The smaller Qwen3.5-style multimodal experiments reinforce the same lessons:

HybridEP is a solid default on GB200
TE-scoped CUDA graphs help once the training loop is stable
larger MBS can pay off, but only if the vision encoder does not become the next bottleneck

Decision Guide

Choose FSDP when

you are bringing up a new VLM for the first time
the model has awkward stage boundaries across embedding, vision, and decoder
memory fit matters more than absolute throughput
you may freeze the vision stack during decoder-focused tuning

Choose 3D parallel when

the model is already stable under FSDP
the PP layout is clear and repeatable
you can sweep MBS, recompute, and CUDA-graph scope together
the goal is best steady-state throughput, not easiest bring-up

Key Tuning Knobs

Freeze the vision stack when appropriate: if the work is decoder-focused, freezing the vision side often gives a small but real throughput gain and reduces memory pressure.
Sweep MBS aggressively: VLMs are more MBS-sensitive than text-only MoE runs because the vision path changes the compute-to-overhead balance.
Prefer selective recompute once the model fits: full recompute is a useful bring-up tool, but selective recompute is usually the better steady state.
Match CUDA-graph scope to the workload:
```
attn moe_router moe_preprocess
```
is the safer MoE default, while narrower scopes can still be useful for controlled experiments.
Use ETP only when EP alone is insufficient: it can unlock a layout, but it also introduces more communication and more tuning surface.

Representative Config Families

FSDP-first GB200 path

text

TP=1  CP=1  PP=1
EP sized to the expert topology, often large
Dispatcher: HybridEP on GB200-class systems
Recompute: start with full, then relax toward selective recompute

3D-parallel GB200 path