nemo-mbridge-perf-moe-comm-overlap
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMoE Communication Overlap
MoE通信重叠
For the higher-level overview, see:
- @docs/training/communication-overlap.md
- @skills/nemo-mbridge-perf-moe-comm-overlap/card.yaml
如需更高层级的概述,请参阅:
- @docs/training/communication-overlap.md
- @skills/nemo-mbridge-perf-moe-comm-overlap/card.yaml
Quick Decision
快速决策
Use MoE communication overlap when:
EP > 1- token dispatch or combine time is visible in the profile
- the run is already correct and you are now tuning throughput
Avoid turning it on as an early bring-up step. It is easier to validate after
the dispatcher, routing mode, and recompute plan are already stable.
在以下场景使用MoE通信重叠:
EP > 1- 性能分析中可见token分发或合并耗时
- 运行已确保正确,当前正在优化吞吐量
请勿在早期启动阶段开启该功能。最好在调度器、路由模式和重计算计划均稳定后再进行验证。
Enablement
启用方式
python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = Truepython
cfg.comm_overlap.overlap_moe_expert_parallel_comm = TrueOptional: delayed wgrad for additional overlap
可选:延迟梯度权重计算以获得额外重叠效果
cfg.comm_overlap.delay_wgrad_compute = True
cfg.comm_overlap.delay_wgrad_compute = True
IMPORTANT: disable shared expert overlap when using dispatch overlap
重要提示:使用分发重叠时,请禁用共享专家重叠
cfg.model.moe_shared_expert_overlap = False
undefinedcfg.model.moe_shared_expert_overlap = False
undefinedPrerequisites
前置条件
expert_model_parallel_size > 1num_moe_experts > 1- must be
moe_token_dispatcher_typeor"alltoall""flex" - Precision: BF16 or FP16
- If PP is used, VPP () must be set (non-
virtual_pipeline_model_parallel_size)None
expert_model_parallel_size > 1num_moe_experts > 1- 必须为
moe_token_dispatcher_type或"alltoall""flex" - 精度:BF16 或 FP16
- 如果使用流水线并行(PP),必须设置VPP(,不可为
virtual_pipeline_model_parallel_size)None
Flex dispatcher activation
Flex调度器激活
Setting alone does not activate flex dispatch.
You must also set .
moe_flex_dispatcher_backendmoe_token_dispatcher_type = "flex"仅设置 不会激活flex分发。您还必须设置 。
moe_flex_dispatcher_backendmoe_token_dispatcher_type = "flex"Recompute And CUDA Graph Interaction
重计算与CUDA Graph交互
- Full recompute is not a good companion for the overlap path.
- adds further constraints if CUDA-graph scopes include attention or MoE-router work.
delay_wgrad_compute - In practice, selective recompute is the safer pairing when overlap is enabled.
- 完全重计算不适用于重叠路径。
- 如果CUDA-graph作用域包含注意力或MoE-router任务,会增加额外限制。
delay_wgrad_compute - 实际应用中,启用重叠时,选择性重计算是更安全的搭配方案。
Measured Short-Run Caveat
短期测试注意事项
A 2026-05-18 current-main H100 x16 smoke on Qwen3 30B-A3B mock pretraining
used , , global batch size 1024, CUDA graphs disabled, and
because the PyTorch 25.11 / TE / Triton stack failed
in Transformer Engine fused permutation in prior bring-up.
EP=16alltoallmoe_permute_fusion=falseResults were directional rather than release-grade:
- no EP overlap: 41.25s steady-state mean over iterations 3-8
- EP overlap: 31.31s steady-state mean over iterations 3-8
- EP overlap plus : 31.20s steady-state mean over iterations 3-8
delay_wgrad_compute
Treat this as evidence that EP overlap can help an inter-node MoE
shape when communication is exposed. It is not proof that delayed wgrad is a
separate win, and it does not validate the fused permutation path. An earlier
2026-05-16 short smoke on the same shape showed the same pattern.
alltoall2026-05-18基于当前主分支在H100 x16上对Qwen3 30B-A3B模拟预训练进行的冒烟测试,使用了、、全局批次大小1024、禁用CUDA graphs,且,因为在之前的启动过程中,PyTorch 25.11 / TE / Triton栈在Transformer Engine融合置换中出现故障。
EP=16alltoallmoe_permute_fusion=false结果仅为方向性参考,而非发布级标准:
- 无EP重叠:第3-8轮迭代的稳态平均耗时41.25秒
- 开启EP重叠:第3-8轮迭代的稳态平均耗时31.31秒
- 开启EP重叠并启用:第3-8轮迭代的稳态平均耗时31.20秒
delay_wgrad_compute
这表明EP重叠有助于缓解节点间类型MoE模型的通信瓶颈,但不能证明延迟梯度权重计算能单独带来收益,也未验证融合置换路径的有效性。更早的2026-05-16针对相同模型的短期冒烟测试也呈现了相同模式。
alltoallCode Anchors
代码锚点
- Overlap validation:
src/megatron/bridge/training/comm_overlap.py - Flex dispatcher backend:
src/megatron/bridge/training/flex_dispatcher_backend.py - Config:
src/megatron/bridge/training/config.py - Unit tests:
tests/unit_tests/training/test_comm_overlap.py - DeepEP tests:
tests/unit_tests/training/test_deepep.py
- 重叠验证:
src/megatron/bridge/training/comm_overlap.py - Flex调度器后端:
src/megatron/bridge/training/flex_dispatcher_backend.py - 配置文件:
src/megatron/bridge/training/config.py - 单元测试:
tests/unit_tests/training/test_comm_overlap.py - DeepEP测试:
tests/unit_tests/training/test_deepep.py
Pitfalls
注意事项
-
Shared expert overlap conflict:and
moe_shared_expert_overlapcan conflict. Disable shared expert overlap when using the dispatch overlap path.overlap_moe_expert_parallel_comm -
PP without VPP: MoE overlap requires VPP when pipeline parallelism is active. Without it, the overlap scheduling cannot interleave correctly.
-
Flex != backend flag:alone does nothing if
moe_flex_dispatcher_backend="deepep"is stillmoe_token_dispatcher_type."alltoall" -
Conservative recipe defaults: Most public recipes leave MoE overlap disabled. You need to explicitly enable it via overrides.
-
Performance gains are workload-dependent: overlap helps most when dispatch communication is already a visible slice of step time. It is not guaranteed to help every small or lightly loaded EP run.
-
共享专家重叠冲突:与
moe_shared_expert_overlap可能存在冲突。使用分发重叠路径时,请禁用共享专家重叠。overlap_moe_expert_parallel_comm -
无VPP的流水线并行:当启用流水线并行(PP)时,MoE重叠需要VPP支持。若无VPP,重叠调度无法正确交错执行。
-
Flex不等于后端标志:若仍为
moe_token_dispatcher_type,仅设置"alltoall"不会产生任何效果。moe_flex_dispatcher_backend="deepep" -
保守的默认配置:大多数公开配置默认禁用MoE重叠。您需要通过显式覆盖来启用该功能。
-
性能收益依赖工作负载:当分发通信已成为步骤耗时的显著部分时,重叠效果最明显。并非所有小型或轻负载EP运行都能从中获益。
Verification
验证方法
Look for overlap-related log messages during initialization. The comm overlap
validation in will raise if prerequisites are not met, so a
clean startup confirms the feature is active.
comm_overlap.pyFor a short performance-harness smoke, keep the command shape explicit and vary
only one overlap knob at a time:
bash
uv run python scripts/performance/run_script.py \
-m qwen \
-mr qwen3_30b_a3b \
--task pretrain \
-g h100 \
-c bf16 \
-ng 16 \
-gn 8 \
--max_steps 8 \
--config_variant v1 \
--cuda_graph_impl none \
--moe_flex_dispatcher_backend None \
--moe_a2a_overlap false \
--tokenizer_type NullTokenizer \
comm_overlap.overlap_moe_expert_parallel_comm=true \
comm_overlap.delay_wgrad_compute=false \
model.moe_shared_expert_overlap=falseIf fused MoE permutation fails during bring-up, add
to separate overlap timing from runtime-stack
validation, then retest with the matched production container.
model.moe_permute_fusion=false初始化过程中查找与重叠相关的日志信息。中的通信重叠验证会在未满足前置条件时抛出异常,因此启动无异常则表明该功能已激活。
comm_overlap.py对于短期性能测试冒烟用例,请保持命令格式明确,每次仅调整一个重叠参数:
bash
uv run python scripts/performance/run_script.py \
-m qwen \
-mr qwen3_30b_a3b \
--task pretrain \
-g h100 \
-c bf16 \
-ng 16 \
-gn 8 \
--max_steps 8 \
--config_variant v1 \
--cuda_graph_impl none \
--moe_flex_dispatcher_backend None \
--moe_a2a_overlap false \
--tokenizer_type NullTokenizer \
comm_overlap.overlap_moe_expert_parallel_comm=true \
comm_overlap.delay_wgrad_compute=false \
model.moe_shared_expert_overlap=false如果在启动过程中融合MoE置换失败,请添加以将重叠计时与运行时栈验证分离,然后使用匹配的生产容器重新测试。
model.moe_permute_fusion=false