nemo-mbridge-perf-moe-optimization-workflow
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMoE Training Optimization Workflow
MoE训练优化流程
Stable docs: @docs/training/moe-optimization.md
Card: @skills/nemo-mbridge-perf-moe-optimization-workflow/card.yaml
Source: Scalable Training of MoE Models with Megatron Core
稳定文档:@docs/training/moe-optimization.md
卡片:@skills/nemo-mbridge-perf-moe-optimization-workflow/card.yaml
来源:Scalable Training of MoE Models with Megatron Core
Quick Reference
快速参考
Think in terms of the paper's Three Walls:
- memory wall
- communication wall
- compute and host-overhead wall
MoE tuning is iterative. Fixing one wall usually exposes the next one, so the
best workflow is: fit first, scale second, profile third, then retune.
从论文中的Three Walls角度思考:
- memory wall(内存墙)
- communication wall(通信墙)
- compute and host-overhead wall(计算与主机开销墙)
MoE调优是一个迭代过程。解决一个瓶颈通常会暴露下一个,因此最佳流程是:先适配,再扩展,然后分析性能,最后重新调优。
Phase 1: Make The Run Memory-Feasible
第一阶段:确保运行内存可行
Start with a configuration that fits reliably before chasing throughput.
Recommended order:
- Use the smallest amount of model parallelism that still fits.
- Turn on selective recompute before falling back to full recompute.
- Add offloading only when recompute and parallelism are still insufficient.
- Use to sanity-check large parallel layouts on a single GPU before burning cluster time.
--fake-init-process-group
在追求吞吐量之前,先从一个能稳定适配的配置开始。
推荐顺序:
- 使用能满足适配要求的最小模型并行度。
- 先启用selective recompute(选择性重计算),再考虑full recompute(完全重计算)。
- 只有当重计算和并行策略仍不足以满足需求时,再添加offloading(卸载)。
- 在消耗集群资源之前,使用在单GPU上验证大型并行布局是否合理。
--fake-init-process-group
Recompute guidance
重计算指南
Prefer selective recompute for MoE runs:
- good first choices: ,
layernorm,core_attn,moe_act, or model-specific modules (mlp,shared_experts)mla_up_proj - use full recompute only when the run still does not fit
- revisit recompute after enabling CUDA graphs, because some graph scopes and full recompute paths do not mix well
As a rule of thumb, fine-grained recompute often recovers most of the needed
memory while keeping throughput much closer to the non-recompute baseline than
full-layer recompute does.
MoE运行优先选择selective recompute:
- 优先选择的模块:、
layernorm、core_attn、moe_act,或模型特定模块(mlp、shared_experts)mla_up_proj - 仅当运行仍无法适配时,才使用full recompute
- 启用CUDA-graph后需重新评估重计算策略,因为部分图作用域与full recompute路径无法很好兼容
根据经验,细粒度重计算通常能恢复大部分所需内存,同时相比全层重计算,吞吐量更接近未使用重计算的基线水平。
Phase 2: Choose Parallelism For Scale
第二阶段:选择并行策略以实现扩展
Priority order:
- Maximize DP once the model fits.
- Keep the hot communication path inside the fast interconnect when possible.
- Use PP, plus VPP if needed, for multi-node scaling.
- Prefer EP over extra TP for expert layers.
- Add CP for long context once sequence length makes attention memory dominant.
优先级顺序:
- 模型适配后,最大化DP(数据并行)。
- 尽可能将热点通信路径放在快速互连网络内。
- 多节点扩展时使用PP(流水线并行),必要时添加VPP(虚拟流水线并行)。
- 对专家层,优先使用EP(专家并行)而非额外的TP(张量并行)。
- 当序列长度导致注意力内存成为主导因素时,添加CP(上下文并行)。
Parallel Folding
Parallel Folding(并行折叠)
Parallel Folding decouples attention and MoE parallelism so you do not have to
pick a single compromise layout:
text
Attention: TP × CP × DP × PP
MoE: ETP × EP × EDP × PPKey knobs:
--expert-model-parallel-size--expert-tensor-parallel-size
Use it when attention prefers some TP or CP, but expert layers benefit from a
larger EP degree than the dense layers can tolerate.
Parallel Folding将注意力机制与MoE并行解耦,因此无需选择单一的折中布局:
text
Attention: TP × CP × DP × PP
MoE: ETP × EP × EDP × PP关键参数:
--expert-model-parallel-size--expert-tensor-parallel-size
当注意力机制需要一定的TP或CP,但专家层受益于比密集层更高的EP度时,可使用该策略。
Phase 3: Profile The Dominant Bottleneck
第三阶段:分析主要瓶颈
| Bottleneck | What it looks like | Primary fixes |
|---|---|---|
| Memory | Run fits only with aggressive full recompute or OOMs during warmup | selective recompute, FP8, offloading, better PP layout |
| Communication | Nsight shows large all-to-all or collective blocks | DeepEP or HybridEP, EP overlap, DP/TP overlap, better PP layout |
| Host overhead | GPU gaps, launch-bound traces, Python overhead | CUDA graphs, |
| Compute | Low SM utilization after comm and host issues are addressed | grouped GEMM, fusion work, FP8, dispatcher-specific kernel tuning |
| 瓶颈类型 | 表现特征 | 主要修复方法 |
|---|---|---|
| Memory(内存) | 仅通过激进的full recompute才能适配,或预热阶段出现OOM(内存不足) | selective recompute、FP8、offloading、优化PP布局 |
| Communication(通信) | Nsight显示大量all-to-all或集合通信块 | DeepEP或HybridEP、EP重叠、DP/TP重叠、优化PP布局 |
| Host overhead(主机开销) | GPU存在空闲间隙、启动受限的追踪结果、Python开销过高 | CUDA-graph、 |
| Compute(计算) | 解决通信和主机问题后SM利用率仍较低 | grouped GEMM、融合优化、FP8、针对调度器的内核调优 |
Dispatcher And Overlap Guidance
调度器与重叠优化指南
Use dispatcher choice as a bottleneck fix, not as the first tuning knob.
- : safest bring-up path, fine for smaller EP sizes
moe_token_dispatcher_type="alltoall" - +
moe_token_dispatcher_type="flex": strong default for H100 and B200 style deploymentsmoe_flex_dispatcher_backend="deepep" - +
moe_token_dispatcher_type="flex": strongest starting point on GB200 or GB300 NVL72 systemsmoe_flex_dispatcher_backend="hybridep"
If the all-to-all path is visible in profiles, combine dispatcher tuning with:
--overlap-moe-expert-parallel-comm--overlap-grad-reduce--tp-comm-overlap
将调度器选择作为瓶颈修复手段,而非首个调优参数。
- :最安全的启动路径,适用于较小的EP规模
moe_token_dispatcher_type="alltoall" - +
moe_token_dispatcher_type="flex":H100和B200类部署的强力默认选项moe_flex_dispatcher_backend="deepep" - +
moe_token_dispatcher_type="flex":GB200或GB300 NVL72系统的最佳起始方案moe_flex_dispatcher_backend="hybridep"
如果性能分析中可见all-to-all路径,可结合调度器调优与以下参数:
--overlap-moe-expert-parallel-comm--overlap-grad-reduce--tp-comm-overlap
FP8 Recipe Quick Decision
FP8方案快速选择
| Platform | Recommended starting recipe |
|---|---|
| Hopper | FP8 blockwise |
| Blackwell | MXFP8 |
| Blackwell, speed-first exploration | NVFP4 after the BF16 or FP8 path is stable |
Keep the router in FP32. The largest wins usually come from expert GEMMs and
other heavy matrix math, not from trying to quantize every small MoE component.
| 平台 | 推荐起始方案 |
|---|---|
| Hopper | FP8 blockwise |
| Blackwell | MXFP8 |
| Blackwell(优先追求速度的探索场景) | 在BF16或FP8路径稳定后使用NVFP4 |
保持路由器为FP32精度。最大的性能提升通常来自专家GEMM和其他重型矩阵运算,而非尝试量化每个小型MoE组件。
CUDA Graphs For MoE
MoE的CUDA-graph使用方法
For dropless MoE, start with partial TE-scoped graphs:
attnmoe_routermoe_preprocess
That path usually gives a meaningful step-time win while keeping the dynamic
expert work outside the graph. Expect a moderate speedup when launch overhead is
visible, but budget several extra GB of memory and verify that shapes remain
static.
Use full-iteration graphs only for graph-friendly workloads such as drop-and-pad
or tightly controlled static-shape experiments.
Related references:
- @skills/nemo-mbridge-perf-cuda-graphs/SKILL.md
- @docs/training/cuda-graphs.md
- @docs/training/activation-recomputation.md
对于dropless MoE,从部分TE作用域图开始:
attnmoe_routermoe_preprocess
该路径通常能显著降低单步耗时,同时将动态专家工作留在图外。当启动开销明显时,预计能获得中等幅度的加速,但需预留额外数GB内存,并确保形状保持静态。
仅对支持图的工作负载(如drop-and-pad或严格控制的静态形状实验)使用全迭代图。
相关参考:
- @skills/nemo-mbridge-perf-cuda-graphs/SKILL.md
- @docs/training/cuda-graphs.md
- @docs/training/activation-recomputation.md
Pitfalls
注意事项
-
Do not optimize in the wrong order: fitting the model and selecting sane parallelism matter more than micro-optimizations.
-
Platform changes the limiting wall: H100-class runs often feel more communication-bound, while GB200 or GB300 runs often expose CPU or launch overhead earlier.
-
FP8 MFU can look misleadingly low: compare absolute throughput as well as MFU when switching precision modes.
-
CUDA graphs and recompute interact: TE-scoped graphs are usually paired with selective recompute, not blanket full recompute.
-
Parallel Folding is not optional at large scale: once attention and expert layers want clearly different layouts, a single shared TP or EP plan becomes a tax on both.
- 不要按错误顺序优化:模型适配和选择合理的并行策略比微优化更重要。
- 平台会改变瓶颈类型:H100级别的运行通常更受通信限制,而GB200或GB300的运行往往更早暴露CPU或启动开销问题。
- FP8的MFU(模型FLOPS利用率)可能存在误导性:切换精度模式时,需同时比较绝对吞吐量和MFU。
- CUDA-graph与重计算相互影响:TE作用域图通常与selective recompute配合使用,而非全面的full recompute。
- 大规模场景下Parallel Folding不可或缺:当注意力机制和专家层需要明显不同的布局时,单一共享的TP或EP方案会对两者都造成性能损耗。