perf-moe-comm-overlap

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MoE Communication Overlap

MoE通信重叠

For the higher-level overview, see:

@docs/training/communication-overlap.md
@skills/perf-moe-comm-overlap/card.yaml

如需更高层面的概述，请参阅：

@docs/training/communication-overlap.md
@skills/perf-moe-comm-overlap/card.yaml

Quick Decision

快速决策

Use MoE communication overlap when:

```
EP > 1
```
token dispatch or combine time is visible in the profile
the run is already correct and you are now tuning throughput

Avoid turning it on as an early bring-up step. It is easier to validate after the dispatcher, routing mode, and recompute plan are already stable.

在以下场景使用MoE通信重叠：

```
EP > 1
```
性能分析中可见token分发或合并耗时
运行已确保正确，当前正在调优吞吐量

请勿在早期启动阶段启用该功能。建议在调度器、路由模式和重计算计划稳定后再进行验证。

Enablement

启用方式

python

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True

python

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True

Optional: delayed wgrad for additional overlap

可选：延迟梯度权重计算以获得额外重叠效果

cfg.comm_overlap.delay_wgrad_compute = True

IMPORTANT: disable shared expert overlap when using dispatch overlap

重要提示：使用分发重叠时，请禁用共享专家重叠

cfg.model.moe_shared_expert_overlap = False

undefined

cfg.model.moe_shared_expert_overlap = False

undefined

Prerequisites

前提条件

```
expert_model_parallel_size > 1
```
```
num_moe_experts > 1
```

moe_token_dispatcher_type

must be

"alltoall"

"flex"

Precision: BF16 or FP16

If PP is used, VPP (

virtual_pipeline_model_parallel_size

) must be set (non-

None

)

```
expert_model_parallel_size > 1
```
```
num_moe_experts > 1
```

moe_token_dispatcher_type

必须为

"alltoall"

或

"flex"

精度：BF16或FP16
如果使用PP（流水线并行），必须设置VPP（
```
virtual_pipeline_model_parallel_size
```
，值不为
```
None
```
）

Flex dispatcher activation

Flex调度器激活

Setting

moe_flex_dispatcher_backend

alone does not activate flex dispatch. You must also set

moe_token_dispatcher_type = "flex"

仅设置

moe_flex_dispatcher_backend

不会激活Flex分发。还必须设置

moe_token_dispatcher_type = "flex"

。

Recompute And CUDA Graph Interaction

重计算与CUDA Graph交互

Full recompute is not a good companion for the overlap path.
```
delay_wgrad_compute
```
adds further constraints if CUDA-graph scopes include attention or MoE-router work.
In practice, selective recompute is the safer pairing when overlap is enabled.

全量重计算与重叠路径兼容性不佳。
如果CUDA-graph作用域包含注意力或MoE路由器操作，
```
delay_wgrad_compute
```
会增加额外约束。
实际应用中，启用重叠时选择选择性重计算是更安全的搭配。

Measured Short-Run Caveat

短期测试注意事项

A 2026-05-18 current-main H100 x16 smoke on Qwen3 30B-A3B mock pretraining used

EP=16

alltoall

, global batch size 1024, CUDA graphs disabled, and

moe_permute_fusion=false

because the PyTorch 25.11 / TE / Triton stack failed in Transformer Engine fused permutation in prior bring-up.

Results were directional rather than release-grade:

no EP overlap: 41.25s steady-state mean over iterations 3-8
EP overlap: 31.31s steady-state mean over iterations 3-8
EP overlap plus
```
delay_wgrad_compute
```
: 31.20s steady-state mean over iterations 3-8

Treat this as evidence that EP overlap can help an inter-node

alltoall

MoE shape when communication is exposed. It is not proof that delayed wgrad is a separate win, and it does not validate the fused permutation path. An earlier 2026-05-16 short smoke on the same shape showed the same pattern.

2026年5月18日基于当前主分支，在H100 x16上对Qwen3 30B-A3B模拟预训练进行的冒烟测试中，使用了

EP=16

、

alltoall

、全局批量大小1024、禁用CUDA graphs，且

moe_permute_fusion=false

，因为PyTorch 25.11 / TE / Triton栈在之前的启动过程中Transformer Engine融合排列功能失败。

测试结果仅作趋势参考，而非发布级结论：

未启用EP重叠：第3-8轮迭代的稳态平均耗时41.25秒
启用EP重叠：第3-8轮迭代的稳态平均耗时31.31秒
启用EP重叠并开启
```
delay_wgrad_compute
```
：第3-8轮迭代的稳态平均耗时31.20秒

这表明当通信开销显著时，EP重叠有助于优化跨节点

alltoall

的MoE模型性能。但这并不证明延迟梯度权重计算能单独带来收益，也未验证融合排列路径的有效性。2026年5月16日针对相同模型的早期短期冒烟测试也呈现了相同模式。

Code Anchors

代码锚点

Overlap validation:

src/megatron/bridge/training/comm_overlap.py

Flex dispatcher backend:

src/megatron/bridge/training/flex_dispatcher_backend.py

Config:
```
src/megatron/bridge/training/config.py
```

Unit tests:

tests/unit_tests/training/test_comm_overlap.py

DeepEP tests:

tests/unit_tests/training/test_deepep.py

重叠验证：

src/megatron/bridge/training/comm_overlap.py

Flex调度器后端：

src/megatron/bridge/training/flex_dispatcher_backend.py

配置文件：
```
src/megatron/bridge/training/config.py
```

单元测试：

tests/unit_tests/training/test_comm_overlap.py

DeepEP测试：

tests/unit_tests/training/test_deepep.py

Pitfalls

常见陷阱

Shared expert overlap conflict:
```
moe_shared_expert_overlap
```
and
```
overlap_moe_expert_parallel_comm
```
can conflict. Disable shared expert overlap when using the dispatch overlap path.
PP without VPP: MoE overlap requires VPP when pipeline parallelism is active. Without it, the overlap scheduling cannot interleave correctly.

Flex != backend flag:

moe_flex_dispatcher_backend="deepep"

alone does nothing if

moe_token_dispatcher_type

is still

"alltoall"

Conservative recipe defaults: Most public recipes leave MoE overlap disabled. You need to explicitly enable it via overrides.
Performance gains are workload-dependent: overlap helps most when dispatch communication is already a visible slice of step time. It is not guaranteed to help every small or lightly loaded EP run.

共享专家重叠冲突：
```
moe_shared_expert_overlap
```
与
```
overlap_moe_expert_parallel_comm
```
可能存在冲突。使用分发重叠路径时，请禁用共享专家重叠。
未设置VPP的PP：当启用流水线并行（PP）时，MoE重叠需要VPP。如果未设置VPP，重叠调度无法正确交错执行。
Flex不等于后端标志：如果
```
moe_token_dispatcher_type
```
仍为
```
"alltoall"
```
，仅设置
```
moe_flex_dispatcher_backend="deepep"
```
不会产生任何效果。
保守的默认配置：大多数公开配置默认禁用MoE重叠。你需要通过显式覆盖来启用它。
性能收益取决于工作负载：当分发通信已成为步骤耗时的显著部分时，重叠效果最明显。它并不保证能优化所有小型或轻负载的EP运行。

Verification

验证方法

Look for overlap-related log messages during initialization. The comm overlap validation in

comm_overlap.py

will raise if prerequisites are not met, so a clean startup confirms the feature is active.

For a short performance-harness smoke, keep the command shape explicit and vary only one overlap knob at a time:

bash

uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false

If fused MoE permutation fails during bring-up, add

model.moe_permute_fusion=false

to separate overlap timing from runtime-stack validation, then retest with the matched production container.

初始化过程中查找与重叠相关的日志信息。如果

comm_overlap.py

中的通信重叠验证未触发报错，说明功能已正常激活，启动过程无异常。

对于短期性能测试，保持命令格式明确，每次仅调整一个重叠参数：

bash

uv run python scripts/performance/run_script.py \\
  -m qwen \\
  -mr qwen3_30b_a3b \\
  --task pretrain \\
  -g h100 \\
  -c bf16 \\
  -ng 16 \\
  -gn 8 \\
  --max_steps 8 \\
  --config_variant v1 \\
  --cuda_graph_impl none \\
  --moe_flex_dispatcher_backend None \\
  --moe_a2a_overlap false \\
  --tokenizer_type NullTokenizer \\
  comm_overlap.overlap_moe_expert_parallel_comm=true \\
  comm_overlap.delay_wgrad_compute=false \\
  model.moe_shared_expert_overlap=false

如果启动过程中MoE融合排列失败，添加

model.moe_permute_fusion=false

以将重叠计时与运行时栈验证分离，然后使用匹配的生产容器重新测试。",