nemo-mbridge-perf-moe-comm-overlap

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MoE Communication Overlap

MoE通信重叠

For the higher-level overview, see:

@docs/training/communication-overlap.md
@skills/nemo-mbridge-perf-moe-comm-overlap/card.yaml

如需更高层级的概述，请参阅：

@docs/training/communication-overlap.md
@skills/nemo-mbridge-perf-moe-comm-overlap/card.yaml

Quick Decision

快速决策

Use MoE communication overlap when:

```
EP > 1
```
token dispatch or combine time is visible in the profile
the run is already correct and you are now tuning throughput

Avoid turning it on as an early bring-up step. It is easier to validate after the dispatcher, routing mode, and recompute plan are already stable.

在以下场景使用MoE通信重叠：

```
EP > 1
```
性能分析中可见token分发或合并耗时
运行已确保正确，当前正在优化吞吐量

请勿在早期启动阶段开启该功能。最好在调度器、路由模式和重计算计划均稳定后再进行验证。

Enablement

启用方式

python

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True

python

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True

Optional: delayed wgrad for additional overlap

可选：延迟梯度权重计算以获得额外重叠效果

cfg.comm_overlap.delay_wgrad_compute = True

IMPORTANT: disable shared expert overlap when using dispatch overlap

重要提示：使用分发重叠时，请禁用共享专家重叠

cfg.model.moe_shared_expert_overlap = False

undefined

cfg.model.moe_shared_expert_overlap = False

undefined

Prerequisites

前置条件

```
expert_model_parallel_size > 1
```
```
num_moe_experts > 1
```

moe_token_dispatcher_type

must be

"alltoall"

"flex"

Precision: BF16 or FP16

If PP is used, VPP (

virtual_pipeline_model_parallel_size

) must be set (non-

None

)

```
expert_model_parallel_size > 1
```
```
num_moe_experts > 1
```

moe_token_dispatcher_type

必须为

"alltoall"

或

"flex"

精度：BF16 或 FP16
如果使用流水线并行（PP），必须设置VPP（
```
virtual_pipeline_model_parallel_size
```
，不可为
```
None
```
）

Flex dispatcher activation

Flex调度器激活

Setting

moe_flex_dispatcher_backend

alone does not activate flex dispatch. You must also set

moe_token_dispatcher_type = "flex"

仅设置

moe_flex_dispatcher_backend

不会激活flex分发。您还必须设置

moe_token_dispatcher_type = "flex"

。

Recompute And CUDA Graph Interaction

重计算与CUDA Graph交互

Full recompute is not a good companion for the overlap path.
```
delay_wgrad_compute
```
adds further constraints if CUDA-graph scopes include attention or MoE-router work.
In practice, selective recompute is the safer pairing when overlap is enabled.

完全重计算不适用于重叠路径。
如果CUDA-graph作用域包含注意力或MoE-router任务，
```
delay_wgrad_compute
```
会增加额外限制。
实际应用中，启用重叠时，选择性重计算是更安全的搭配方案。

Measured Short-Run Caveat

短期测试注意事项

A 2026-05-18 current-main H100 x16 smoke on Qwen3 30B-A3B mock pretraining used

EP=16

alltoall

, global batch size 1024, CUDA graphs disabled, and

moe_permute_fusion=false

because the PyTorch 25.11 / TE / Triton stack failed in Transformer Engine fused permutation in prior bring-up.

Results were directional rather than release-grade:

no EP overlap: 41.25s steady-state mean over iterations 3-8
EP overlap: 31.31s steady-state mean over iterations 3-8
EP overlap plus
```
delay_wgrad_compute
```
: 31.20s steady-state mean over iterations 3-8

Treat this as evidence that EP overlap can help an inter-node

alltoall

MoE shape when communication is exposed. It is not proof that delayed wgrad is a separate win, and it does not validate the fused permutation path. An earlier 2026-05-16 short smoke on the same shape showed the same pattern.

2026-05-18基于当前主分支在H100 x16上对Qwen3 30B-A3B模拟预训练进行的冒烟测试，使用了

EP=16

、

alltoall

、全局批次大小1024、禁用CUDA graphs，且

moe_permute_fusion=false

，因为在之前的启动过程中，PyTorch 25.11 / TE / Triton栈在Transformer Engine融合置换中出现故障。

结果仅为方向性参考，而非发布级标准：

无EP重叠：第3-8轮迭代的稳态平均耗时41.25秒
开启EP重叠：第3-8轮迭代的稳态平均耗时31.31秒
开启EP重叠并启用
```
delay_wgrad_compute
```
：第3-8轮迭代的稳态平均耗时31.20秒

这表明EP重叠有助于缓解节点间

alltoall

类型MoE模型的通信瓶颈，但不能证明延迟梯度权重计算能单独带来收益，也未验证融合置换路径的有效性。更早的2026-05-16针对相同模型的短期冒烟测试也呈现了相同模式。

Code Anchors

代码锚点

Overlap validation:

src/megatron/bridge/training/comm_overlap.py

Flex dispatcher backend:

src/megatron/bridge/training/flex_dispatcher_backend.py

Config:
```
src/megatron/bridge/training/config.py
```

Unit tests:

tests/unit_tests/training/test_comm_overlap.py

DeepEP tests:

tests/unit_tests/training/test_deepep.py

重叠验证：

src/megatron/bridge/training/comm_overlap.py

Flex调度器后端：

src/megatron/bridge/training/flex_dispatcher_backend.py

配置文件：
```
src/megatron/bridge/training/config.py
```

单元测试：

tests/unit_tests/training/test_comm_overlap.py

DeepEP测试：

tests/unit_tests/training/test_deepep.py

Pitfalls

注意事项

Shared expert overlap conflict:
```
moe_shared_expert_overlap
```
and
```
overlap_moe_expert_parallel_comm
```
can conflict. Disable shared expert overlap when using the dispatch overlap path.
PP without VPP: MoE overlap requires VPP when pipeline parallelism is active. Without it, the overlap scheduling cannot interleave correctly.

Flex != backend flag:

moe_flex_dispatcher_backend="deepep"

alone does nothing if

moe_token_dispatcher_type

is still

"alltoall"

Conservative recipe defaults: Most public recipes leave MoE overlap disabled. You need to explicitly enable it via overrides.
Performance gains are workload-dependent: overlap helps most when dispatch communication is already a visible slice of step time. It is not guaranteed to help every small or lightly loaded EP run.

共享专家重叠冲突：
```
moe_shared_expert_overlap
```
与
```
overlap_moe_expert_parallel_comm
```
可能存在冲突。使用分发重叠路径时，请禁用共享专家重叠。
无VPP的流水线并行：当启用流水线并行（PP）时，MoE重叠需要VPP支持。若无VPP，重叠调度无法正确交错执行。
Flex不等于后端标志：若
```
moe_token_dispatcher_type
```
仍为
```
"alltoall"
```
，仅设置
```
moe_flex_dispatcher_backend="deepep"
```
不会产生任何效果。
保守的默认配置：大多数公开配置默认禁用MoE重叠。您需要通过显式覆盖来启用该功能。
性能收益依赖工作负载：当分发通信已成为步骤耗时的显著部分时，重叠效果最明显。并非所有小型或轻负载EP运行都能从中获益。

Verification

验证方法

Look for overlap-related log messages during initialization. The comm overlap validation in

comm_overlap.py

will raise if prerequisites are not met, so a clean startup confirms the feature is active.

For a short performance-harness smoke, keep the command shape explicit and vary only one overlap knob at a time:

bash

uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false

If fused MoE permutation fails during bring-up, add

model.moe_permute_fusion=false

to separate overlap timing from runtime-stack validation, then retest with the matched production container.

初始化过程中查找与重叠相关的日志信息。

comm_overlap.py

中的通信重叠验证会在未满足前置条件时抛出异常，因此启动无异常则表明该功能已激活。

对于短期性能测试冒烟用例，请保持命令格式明确，每次仅调整一个重叠参数：

bash

uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false

如果在启动过程中融合MoE置换失败，请添加

model.moe_permute_fusion=false

以将重叠计时与运行时栈验证分离，然后使用匹配的生产容器重新测试。