nemo-mbridge-perf-moe-optimization-workflow

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MoE Training Optimization Workflow

MoE训练优化流程

Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-optimization-workflow/card.yaml Source: Scalable Training of MoE Models with Megatron Core
稳定文档:@docs/training/moe-optimization.md 卡片:@skills/nemo-mbridge-perf-moe-optimization-workflow/card.yaml 来源:Scalable Training of MoE Models with Megatron Core

Quick Reference

快速参考

Think in terms of the paper's Three Walls:
  • memory wall
  • communication wall
  • compute and host-overhead wall
MoE tuning is iterative. Fixing one wall usually exposes the next one, so the best workflow is: fit first, scale second, profile third, then retune.
从论文中的Three Walls角度思考:
  • memory wall(内存墙)
  • communication wall(通信墙)
  • compute and host-overhead wall(计算与主机开销墙)
MoE调优是一个迭代过程。解决一个瓶颈通常会暴露下一个,因此最佳流程是:先适配,再扩展,然后分析性能,最后重新调优。

Phase 1: Make The Run Memory-Feasible

第一阶段:确保运行内存可行

Start with a configuration that fits reliably before chasing throughput.
Recommended order:
  1. Use the smallest amount of model parallelism that still fits.
  2. Turn on selective recompute before falling back to full recompute.
  3. Add offloading only when recompute and parallelism are still insufficient.
  4. Use
    --fake-init-process-group
    to sanity-check large parallel layouts on a single GPU before burning cluster time.
在追求吞吐量之前,先从一个能稳定适配的配置开始。
推荐顺序:
  1. 使用能满足适配要求的最小模型并行度。
  2. 先启用selective recompute(选择性重计算),再考虑full recompute(完全重计算)。
  3. 只有当重计算和并行策略仍不足以满足需求时,再添加offloading(卸载)。
  4. 在消耗集群资源之前,使用
    --fake-init-process-group
    在单GPU上验证大型并行布局是否合理。

Recompute guidance

重计算指南

Prefer selective recompute for MoE runs:
  • good first choices:
    layernorm
    ,
    core_attn
    ,
    moe_act
    ,
    mlp
    , or model-specific modules (
    shared_experts
    ,
    mla_up_proj
    )
  • use full recompute only when the run still does not fit
  • revisit recompute after enabling CUDA graphs, because some graph scopes and full recompute paths do not mix well
As a rule of thumb, fine-grained recompute often recovers most of the needed memory while keeping throughput much closer to the non-recompute baseline than full-layer recompute does.
MoE运行优先选择selective recompute:
  • 优先选择的模块:
    layernorm
    core_attn
    moe_act
    mlp
    ,或模型特定模块(
    shared_experts
    mla_up_proj
  • 仅当运行仍无法适配时,才使用full recompute
  • 启用CUDA-graph后需重新评估重计算策略,因为部分图作用域与full recompute路径无法很好兼容
根据经验,细粒度重计算通常能恢复大部分所需内存,同时相比全层重计算,吞吐量更接近未使用重计算的基线水平。

Phase 2: Choose Parallelism For Scale

第二阶段:选择并行策略以实现扩展

Priority order:
  1. Maximize DP once the model fits.
  2. Keep the hot communication path inside the fast interconnect when possible.
  3. Use PP, plus VPP if needed, for multi-node scaling.
  4. Prefer EP over extra TP for expert layers.
  5. Add CP for long context once sequence length makes attention memory dominant.
优先级顺序:
  1. 模型适配后,最大化DP(数据并行)。
  2. 尽可能将热点通信路径放在快速互连网络内。
  3. 多节点扩展时使用PP(流水线并行),必要时添加VPP(虚拟流水线并行)。
  4. 对专家层,优先使用EP(专家并行)而非额外的TP(张量并行)。
  5. 当序列长度导致注意力内存成为主导因素时,添加CP(上下文并行)。

Parallel Folding

Parallel Folding(并行折叠)

Parallel Folding decouples attention and MoE parallelism so you do not have to pick a single compromise layout:
text
Attention: TP × CP × DP × PP
MoE:       ETP × EP × EDP × PP
Key knobs:
  • --expert-model-parallel-size
  • --expert-tensor-parallel-size
Use it when attention prefers some TP or CP, but expert layers benefit from a larger EP degree than the dense layers can tolerate.
Parallel Folding将注意力机制与MoE并行解耦,因此无需选择单一的折中布局:
text
Attention: TP × CP × DP × PP
MoE:       ETP × EP × EDP × PP
关键参数:
  • --expert-model-parallel-size
  • --expert-tensor-parallel-size
当注意力机制需要一定的TP或CP,但专家层受益于比密集层更高的EP度时,可使用该策略。

Phase 3: Profile The Dominant Bottleneck

第三阶段:分析主要瓶颈

BottleneckWhat it looks likePrimary fixes
MemoryRun fits only with aggressive full recompute or OOMs during warmupselective recompute, FP8, offloading, better PP layout
CommunicationNsight shows large all-to-all or collective blocksDeepEP or HybridEP, EP overlap, DP/TP overlap, better PP layout
Host overheadGPU gaps, launch-bound traces, Python overheadCUDA graphs,
--manual-gc
, higher MBS, CPU affinity tuning
ComputeLow SM utilization after comm and host issues are addressedgrouped GEMM, fusion work, FP8, dispatcher-specific kernel tuning
瓶颈类型表现特征主要修复方法
Memory(内存)仅通过激进的full recompute才能适配,或预热阶段出现OOM(内存不足)selective recompute、FP8、offloading、优化PP布局
Communication(通信)Nsight显示大量all-to-all或集合通信块DeepEP或HybridEP、EP重叠、DP/TP重叠、优化PP布局
Host overhead(主机开销)GPU存在空闲间隙、启动受限的追踪结果、Python开销过高CUDA-graph、
--manual-gc
、更高的MBS(微批次大小)、CPU亲和性调优
Compute(计算)解决通信和主机问题后SM利用率仍较低grouped GEMM、融合优化、FP8、针对调度器的内核调优

Dispatcher And Overlap Guidance

调度器与重叠优化指南

Use dispatcher choice as a bottleneck fix, not as the first tuning knob.
  • moe_token_dispatcher_type="alltoall"
    : safest bring-up path, fine for smaller EP sizes
  • moe_token_dispatcher_type="flex"
    +
    moe_flex_dispatcher_backend="deepep"
    : strong default for H100 and B200 style deployments
  • moe_token_dispatcher_type="flex"
    +
    moe_flex_dispatcher_backend="hybridep"
    : strongest starting point on GB200 or GB300 NVL72 systems
If the all-to-all path is visible in profiles, combine dispatcher tuning with:
  • --overlap-moe-expert-parallel-comm
  • --overlap-grad-reduce
  • --tp-comm-overlap
将调度器选择作为瓶颈修复手段,而非首个调优参数。
  • moe_token_dispatcher_type="alltoall"
    :最安全的启动路径,适用于较小的EP规模
  • moe_token_dispatcher_type="flex"
    +
    moe_flex_dispatcher_backend="deepep"
    :H100和B200类部署的强力默认选项
  • moe_token_dispatcher_type="flex"
    +
    moe_flex_dispatcher_backend="hybridep"
    :GB200或GB300 NVL72系统的最佳起始方案
如果性能分析中可见all-to-all路径,可结合调度器调优与以下参数:
  • --overlap-moe-expert-parallel-comm
  • --overlap-grad-reduce
  • --tp-comm-overlap

FP8 Recipe Quick Decision

FP8方案快速选择

PlatformRecommended starting recipe
HopperFP8 blockwise
BlackwellMXFP8
Blackwell, speed-first explorationNVFP4 after the BF16 or FP8 path is stable
Keep the router in FP32. The largest wins usually come from expert GEMMs and other heavy matrix math, not from trying to quantize every small MoE component.
平台推荐起始方案
HopperFP8 blockwise
BlackwellMXFP8
Blackwell(优先追求速度的探索场景)在BF16或FP8路径稳定后使用NVFP4
保持路由器为FP32精度。最大的性能提升通常来自专家GEMM和其他重型矩阵运算,而非尝试量化每个小型MoE组件。

CUDA Graphs For MoE

MoE的CUDA-graph使用方法

For dropless MoE, start with partial TE-scoped graphs:
  • attn
  • moe_router
  • moe_preprocess
That path usually gives a meaningful step-time win while keeping the dynamic expert work outside the graph. Expect a moderate speedup when launch overhead is visible, but budget several extra GB of memory and verify that shapes remain static.
Use full-iteration graphs only for graph-friendly workloads such as drop-and-pad or tightly controlled static-shape experiments.
Related references:
  • @skills/nemo-mbridge-perf-cuda-graphs/SKILL.md
  • @docs/training/cuda-graphs.md
  • @docs/training/activation-recomputation.md
对于dropless MoE,从部分TE作用域图开始:
  • attn
  • moe_router
  • moe_preprocess
该路径通常能显著降低单步耗时,同时将动态专家工作留在图外。当启动开销明显时,预计能获得中等幅度的加速,但需预留额外数GB内存,并确保形状保持静态。
仅对支持图的工作负载(如drop-and-pad或严格控制的静态形状实验)使用全迭代图。
相关参考:
  • @skills/nemo-mbridge-perf-cuda-graphs/SKILL.md
  • @docs/training/cuda-graphs.md
  • @docs/training/activation-recomputation.md

Pitfalls

注意事项

  1. Do not optimize in the wrong order: fitting the model and selecting sane parallelism matter more than micro-optimizations.
  2. Platform changes the limiting wall: H100-class runs often feel more communication-bound, while GB200 or GB300 runs often expose CPU or launch overhead earlier.
  3. FP8 MFU can look misleadingly low: compare absolute throughput as well as MFU when switching precision modes.
  4. CUDA graphs and recompute interact: TE-scoped graphs are usually paired with selective recompute, not blanket full recompute.
  5. Parallel Folding is not optional at large scale: once attention and expert layers want clearly different layouts, a single shared TP or EP plan becomes a tax on both.
  1. 不要按错误顺序优化:模型适配和选择合理的并行策略比微优化更重要。
  2. 平台会改变瓶颈类型:H100级别的运行通常更受通信限制,而GB200或GB300的运行往往更早暴露CPU或启动开销问题。
  3. FP8的MFU(模型FLOPS利用率)可能存在误导性:切换精度模式时,需同时比较绝对吞吐量和MFU。
  4. CUDA-graph与重计算相互影响:TE作用域图通常与selective recompute配合使用,而非全面的full recompute。
  5. 大规模场景下Parallel Folding不可或缺:当注意力机制和专家层需要明显不同的布局时,单一共享的TP或EP方案会对两者都造成性能损耗。