nemo-mbridge-perf-moe-comm-overlap

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MoE Communication Overlap

MoE通信重叠

For the higher-level overview, see:
  • @docs/training/communication-overlap.md
  • @skills/nemo-mbridge-perf-moe-comm-overlap/card.yaml
如需更高层级的概述,请参阅:
  • @docs/training/communication-overlap.md
  • @skills/nemo-mbridge-perf-moe-comm-overlap/card.yaml

Quick Decision

快速决策

Use MoE communication overlap when:
  • EP > 1
  • token dispatch or combine time is visible in the profile
  • the run is already correct and you are now tuning throughput
Avoid turning it on as an early bring-up step. It is easier to validate after the dispatcher, routing mode, and recompute plan are already stable.
在以下场景使用MoE通信重叠:
  • EP > 1
  • 性能分析中可见token分发或合并耗时
  • 运行已确保正确,当前正在优化吞吐量
请勿在早期启动阶段开启该功能。最好在调度器、路由模式和重计算计划均稳定后再进行验证。

Enablement

启用方式

python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True

Optional: delayed wgrad for additional overlap

可选:延迟梯度权重计算以获得额外重叠效果

cfg.comm_overlap.delay_wgrad_compute = True
cfg.comm_overlap.delay_wgrad_compute = True

IMPORTANT: disable shared expert overlap when using dispatch overlap

重要提示:使用分发重叠时,请禁用共享专家重叠

cfg.model.moe_shared_expert_overlap = False
undefined
cfg.model.moe_shared_expert_overlap = False
undefined

Prerequisites

前置条件

  • expert_model_parallel_size > 1
  • num_moe_experts > 1
  • moe_token_dispatcher_type
    must be
    "alltoall"
    or
    "flex"
  • Precision: BF16 or FP16
  • If PP is used, VPP (
    virtual_pipeline_model_parallel_size
    ) must be set (non-
    None
    )
  • expert_model_parallel_size > 1
  • num_moe_experts > 1
  • moe_token_dispatcher_type
    必须为
    "alltoall"
    "flex"
  • 精度:BF16 或 FP16
  • 如果使用流水线并行(PP),必须设置VPP(
    virtual_pipeline_model_parallel_size
    ,不可为
    None

Flex dispatcher activation

Flex调度器激活

Setting
moe_flex_dispatcher_backend
alone does not activate flex dispatch. You must also set
moe_token_dispatcher_type = "flex"
.
仅设置
moe_flex_dispatcher_backend
不会激活flex分发。您还必须设置
moe_token_dispatcher_type = "flex"

Recompute And CUDA Graph Interaction

重计算与CUDA Graph交互

  • Full recompute is not a good companion for the overlap path.
  • delay_wgrad_compute
    adds further constraints if CUDA-graph scopes include attention or MoE-router work.
  • In practice, selective recompute is the safer pairing when overlap is enabled.
  • 完全重计算不适用于重叠路径。
  • 如果CUDA-graph作用域包含注意力或MoE-router任务,
    delay_wgrad_compute
    会增加额外限制。
  • 实际应用中,启用重叠时,选择性重计算是更安全的搭配方案。

Measured Short-Run Caveat

短期测试注意事项

A 2026-05-18 current-main H100 x16 smoke on Qwen3 30B-A3B mock pretraining used
EP=16
,
alltoall
, global batch size 1024, CUDA graphs disabled, and
moe_permute_fusion=false
because the PyTorch 25.11 / TE / Triton stack failed in Transformer Engine fused permutation in prior bring-up.
Results were directional rather than release-grade:
  • no EP overlap: 41.25s steady-state mean over iterations 3-8
  • EP overlap: 31.31s steady-state mean over iterations 3-8
  • EP overlap plus
    delay_wgrad_compute
    : 31.20s steady-state mean over iterations 3-8
Treat this as evidence that EP overlap can help an inter-node
alltoall
MoE shape when communication is exposed. It is not proof that delayed wgrad is a separate win, and it does not validate the fused permutation path. An earlier 2026-05-16 short smoke on the same shape showed the same pattern.
2026-05-18基于当前主分支在H100 x16上对Qwen3 30B-A3B模拟预训练进行的冒烟测试,使用了
EP=16
alltoall
、全局批次大小1024、禁用CUDA graphs,且
moe_permute_fusion=false
,因为在之前的启动过程中,PyTorch 25.11 / TE / Triton栈在Transformer Engine融合置换中出现故障。
结果仅为方向性参考,而非发布级标准:
  • 无EP重叠:第3-8轮迭代的稳态平均耗时41.25秒
  • 开启EP重叠:第3-8轮迭代的稳态平均耗时31.31秒
  • 开启EP重叠并启用
    delay_wgrad_compute
    :第3-8轮迭代的稳态平均耗时31.20秒
这表明EP重叠有助于缓解节点间
alltoall
类型MoE模型的通信瓶颈,但不能证明延迟梯度权重计算能单独带来收益,也未验证融合置换路径的有效性。更早的2026-05-16针对相同模型的短期冒烟测试也呈现了相同模式。

Code Anchors

代码锚点

  • Overlap validation:
    src/megatron/bridge/training/comm_overlap.py
  • Flex dispatcher backend:
    src/megatron/bridge/training/flex_dispatcher_backend.py
  • Config:
    src/megatron/bridge/training/config.py
  • Unit tests:
    tests/unit_tests/training/test_comm_overlap.py
  • DeepEP tests:
    tests/unit_tests/training/test_deepep.py
  • 重叠验证:
    src/megatron/bridge/training/comm_overlap.py
  • Flex调度器后端:
    src/megatron/bridge/training/flex_dispatcher_backend.py
  • 配置文件:
    src/megatron/bridge/training/config.py
  • 单元测试:
    tests/unit_tests/training/test_comm_overlap.py
  • DeepEP测试:
    tests/unit_tests/training/test_deepep.py

Pitfalls

注意事项

  1. Shared expert overlap conflict:
    moe_shared_expert_overlap
    and
    overlap_moe_expert_parallel_comm
    can conflict. Disable shared expert overlap when using the dispatch overlap path.
  2. PP without VPP: MoE overlap requires VPP when pipeline parallelism is active. Without it, the overlap scheduling cannot interleave correctly.
  3. Flex != backend flag:
    moe_flex_dispatcher_backend="deepep"
    alone does nothing if
    moe_token_dispatcher_type
    is still
    "alltoall"
    .
  4. Conservative recipe defaults: Most public recipes leave MoE overlap disabled. You need to explicitly enable it via overrides.
  5. Performance gains are workload-dependent: overlap helps most when dispatch communication is already a visible slice of step time. It is not guaranteed to help every small or lightly loaded EP run.
  1. 共享专家重叠冲突
    moe_shared_expert_overlap
    overlap_moe_expert_parallel_comm
    可能存在冲突。使用分发重叠路径时,请禁用共享专家重叠。
  2. 无VPP的流水线并行:当启用流水线并行(PP)时,MoE重叠需要VPP支持。若无VPP,重叠调度无法正确交错执行。
  3. Flex不等于后端标志:若
    moe_token_dispatcher_type
    仍为
    "alltoall"
    ,仅设置
    moe_flex_dispatcher_backend="deepep"
    不会产生任何效果。
  4. 保守的默认配置:大多数公开配置默认禁用MoE重叠。您需要通过显式覆盖来启用该功能。
  5. 性能收益依赖工作负载:当分发通信已成为步骤耗时的显著部分时,重叠效果最明显。并非所有小型或轻负载EP运行都能从中获益。

Verification

验证方法

Look for overlap-related log messages during initialization. The comm overlap validation in
comm_overlap.py
will raise if prerequisites are not met, so a clean startup confirms the feature is active.
For a short performance-harness smoke, keep the command shape explicit and vary only one overlap knob at a time:
bash
uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false
If fused MoE permutation fails during bring-up, add
model.moe_permute_fusion=false
to separate overlap timing from runtime-stack validation, then retest with the matched production container.
初始化过程中查找与重叠相关的日志信息。
comm_overlap.py
中的通信重叠验证会在未满足前置条件时抛出异常,因此启动无异常则表明该功能已激活。
对于短期性能测试冒烟用例,请保持命令格式明确,每次仅调整一个重叠参数:
bash
uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false
如果在启动过程中融合MoE置换失败,请添加
model.moe_permute_fusion=false
以将重叠计时与运行时栈验证分离,然后使用匹配的生产容器重新测试。