perf-moe-dispatcher-selection

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MoE Dispatcher Selection Guide

MoE调度器选择指南

Stable docs: @docs/training/moe-optimization.md Card: @skills/perf-moe-dispatcher-selection/card.yaml
稳定文档:@docs/training/moe-optimization.md 卡片:@skills/perf-moe-dispatcher-selection/card.yaml

Quick Decision

快速决策

By hardware

按硬件选择

HardwareFirst choiceWhy
H100DeepEP, if the runtime package is installedStrong default for cross-node EP on Hopper
B200DeepEP, if the runtime package is installedGood first choice unless a platform-specific HybridEP path is available
GB200 / GB300 NVL72HybridEP, if the runtime package is installedBest fit for NVLink-domain-aware dispatch and lower memory pressure
Unknown or first bring-up
alltoall
Easiest path for correctness and debugging
硬件首选方案原因
H100若已安装运行时包,选择DeepEPHopper架构下跨节点EP的可靠默认方案
B200若已安装运行时包,选择DeepEP除非有平台专属的HybridEP路径,否则是最佳首选
GB200 / GB300 NVL72若已安装运行时包,选择HybridEP最适配NVLink域感知调度,且内存压力更低
未知硬件或首次部署
alltoall
最易保证正确性和调试的方案

By EP degree

按EP规模选择

EP sizeGuidance
Small EPDispatcher choice is usually second-order; start with
alltoall
or DeepEP
Medium EPDeepEP often becomes worthwhile
Large EPHybridEP is usually the best target on NVL72 systems
EP规模指导建议
小型EP调度器选择通常影响不大;从
alltoall
或DeepEP开始
中型EPDeepEP通常会带来显著收益
大型EP在NVL72系统上,HybridEP通常是最佳选择

Model-Family Patterns

模型家族实践模式

WorkloadCommon best pathNotes
DSV3 at large scaleHybridEP on GB200 or GB300, DeepEP on H100Dispatcher choice matters more as EP and PP both grow
Qwen3 235BDeepEP on H100, HybridEP on GB200HybridEP usually wins on GB200 and often uses less memory
Qwen3 30BDeepEPSmaller models still benefit, but the absolute gap is smaller
Qwen3-NextClose race in BF16, HybridEP stronger in FP8 or memory-tight runsGood reminder to test, not assume
MoE VLMsStart simple, then test HybridEP on GB200-class systemsVision workloads are sensitive to both memory and host overhead
工作负载常用最优方案说明
大规模DSV3GB200或GB300上用HybridEP,H100上用DeepEP随着EP和PP(Pipeline Parallelism)规模增长,调度器选择的影响愈发显著
Qwen3 235BH100上用DeepEP,GB200上用HybridEPHybridEP在GB200上通常表现更优,且内存占用更低
Qwen3 30BDeepEP较小模型仍能受益,但性能提升的绝对差距更小
Qwen3-NextBF16精度下各方案差距不大;FP8或内存紧张场景下HybridEP表现更优提醒需实际测试,而非主观假设
MoE VLM从简单方案开始,之后在GB200级系统上测试HybridEP视觉工作负载对内存和主机开销都很敏感

Rounded Evidence Summary

综合证据总结

Backend availability gate

后端可用性检查

Do not interpret a dispatcher timing until the container has proven that the selected backend package is available.
--moe_flex_dispatcher_backend None
selects the standard
alltoall
dispatcher, while
deepep
and
hybridep
select
moe_token_dispatcher_type="flex"
and then require their corresponding runtime packages at model construction time. If DeepEP or HybridEP is missing, record the import failure as an environment limitation and treat
alltoall
as the only measured correctness fallback for that run.
在确认容器中已安装所选后端包之前,不要解读调度器的性能数据。
--moe_flex_dispatcher_backend None
会选择标准的
alltoall
调度器,而
deepep
hybridep
会选择
moe_token_dispatcher_type="flex"
,并在模型构建阶段要求对应的运行时包。若DeepEP或HybridEP缺失,需记录导入失败为环境限制,并将
alltoall
作为该运行中唯一经验证的正确性 fallback 方案。

Qwen3 30B A3B on H100

Qwen3 30B A3B在H100上的测试

A short 2026-05-17 H100 smoke run used Qwen3 30B A3B BF16, 16 GPUs, EP=16, the recipe's Transformer Engine CUDA graph scopes (
moe_router
,
moe_preprocess
), and
model.moe_permute_fusion=false
due to a Triton JIT compatibility issue in the run container. The
alltoall
fallback completed five steps with 45.65 s mean step time after warmup, 132.9 mean TFLOP/s/GPU after warmup, final loss 11.44050, and 61.351 GB peak max allocated memory. DeepEP and HybridEP selected the requested flex backend in the dumped configs but failed before the first iteration because the packages were not installed. This confirms the availability gate; it is not a throughput ranking for flex dispatchers on H100.
2026-05-17的H100短时间冒烟测试使用了Qwen3 30B A3B BF16精度、16块GPU、EP=16、配方中的Transformer Engine CUDA图作用域(
moe_router
moe_preprocess
),且因运行容器中的Triton JIT兼容性问题设置了
model.moe_permute_fusion=false
alltoall
fallback方案完成了5个步骤,预热后平均步骤时间为45.65秒,预热后每GPU平均TFLOP/s为132.9,最终损失值为11.44050,峰值内存分配为61.351 GB。DeepEP和HybridEP在导出的配置中选择了指定的flex后端,但因未安装对应包在首次迭代前失败。这验证了可用性检查的必要性;该结果并非H100上flex调度器的吞吐量排名。

DSV3 on GB200 or GB300

DSV3在GB200或GB300上的测试

The broad trend is more important than any single row in the tracker:
  • plain
    alltoall
    is usually the conservative baseline
  • DeepEP improves that baseline once EP communication becomes visible
  • HybridEP adds another step up on NVL72 systems, especially after CUDA graphs, routing improvements, and CPU-side cleanup are already in place
In practice, the stack often moves from roughly "low-teens MFU" territory with an untuned baseline into "high-teens to low-20s MFU" territory after the full dispatcher and kernel stack is tuned.
整体趋势比跟踪器中的任何单一数据行更重要:
  • 普通
    alltoall
    通常是保守基线
  • 当EP通信开销变得显著时,DeepEP能提升基线性能
  • 在NVL72系统上,尤其是在CUDA图、路由优化和CPU端清理已完成的情况下,HybridEP能进一步提升性能
实际场景中,经过调度器和内核栈的完整调优后,系统通常从未调优基线的“十几% MFU”区间提升到“十几%到20%出头 MFU”区间。

Qwen3 235B on GB200

Qwen3 235B在GB200上的测试

For Qwen3 235B, the practical ordering is usually:
  1. alltoall
    for initial bring-up
  2. DeepEP if you want a familiar tuned path
  3. HybridEP for the strongest steady-state result on GB200
HybridEP is usually modestly faster than
alltoall
on this workload and often has noticeably better memory headroom.
对于Qwen3 235B,实践中的选择顺序通常为:
  1. 初始部署用
    alltoall
  2. 若需要成熟的调优路径,选择DeepEP
  3. 在GB200上追求最佳稳态性能,选择HybridEP
HybridEP在该工作负载上通常比
alltoall
略快,且内存余量明显更优。

Qwen3-Next on GB200

Qwen3-Next在GB200上的测试

This family is a good reminder that dispatcher wins are workload-dependent:
  • in BF16,
    alltoall
    and HybridEP can be close
  • in FP8 or memory-constrained settings, HybridEP tends to look better
  • pipeline layout and grouped-GEMM changes can matter almost as much as the dispatcher itself
该模型家族提醒我们,调度器的优势取决于工作负载:
  • BF16精度下,
    alltoall
    和HybridEP性能接近
  • FP8或内存受限场景下,HybridEP表现更优
  • 流水线布局和分组GEMM的调整对性能的影响几乎与调度器本身相当

Tuning Parameters

调优参数

DeepEP

DeepEP

DeepEP is selected by setting
moe_token_dispatcher_type="flex"
and
moe_flex_dispatcher_backend="deepep"
.
bash
--moe-deepep-num-sms 20
Tune the SM count allocated to DeepEP communication kernels (default 20). The optimal value depends on the workload and EP degree. First confirm the DeepEP package imports in the target container; a missing package fails during model construction, before any dispatcher timing is available.
通过设置
moe_token_dispatcher_type="flex"
moe_flex_dispatcher_backend="deepep"
选择DeepEP。
bash
--moe-deepep-num-sms 20
调整分配给DeepEP通信内核的SM数量(默认值为20)。最优值取决于工作负载和EP规模。首先确认目标容器中已导入DeepEP包;若包缺失,会在模型构建阶段失败,无法获取任何调度器性能数据。

HybridEP

HybridEP

HybridEP is selected by setting
moe_token_dispatcher_type="flex"
and
moe_flex_dispatcher_backend="hybridep"
.
bash
--moe-hybridep-num-sms 16
Tune the SM count allocated to HybridEP communication (default 16). The performance harness uses 32 for HybridEP workloads. Sweep between 16 and 32 for the target hardware. Set
NUM_OF_HYBRID_EP_RANKS_PER_NVLINK_DOMAIN
to match the NVLink domain size of the deployment. If it does not match the actual topology, performance and sometimes correctness will suffer. First confirm the HybridEP package imports in the target container; a missing package fails during model construction, before any dispatcher timing is available.
通过设置
moe_token_dispatcher_type="flex"
moe_flex_dispatcher_backend="hybridep"
选择HybridEP。
bash
--moe-hybridep-num-sms 16
调整分配给HybridEP通信的SM数量(默认值为16)。性能测试工具在HybridEP工作负载中使用32。针对目标硬件在16到32之间进行扫描测试。设置
NUM_OF_HYBRID_EP_RANKS_PER_NVLINK_DOMAIN
以匹配部署的NVLink域大小。若与实际拓扑不匹配,会影响性能甚至正确性。首先确认目标容器中已导入HybridEP包;若包缺失,会在模型构建阶段失败,无法获取任何调度器性能数据。

Routing mode

路由模式

bash
--moe-router-force-load-balancing
For performance benchmarking, force-balance routing is the safer default. It usually outperforms dropless routing in large-scale benchmarks and makes results more comparable across dispatcher backends.
bash
--moe-router-force-load-balancing
对于性能基准测试,强制负载均衡路由是更安全的默认选项。在大规模基准测试中,它通常优于无丢弃路由,且能让不同调度器后端的结果更具可比性。

Key Interactions

关键交互

FeatureInteraction
CUDA graphsBest paired with
attn moe_router moe_preprocess
on dropless MoE
EP overlapHelps when dispatcher time is still visible after backend tuning
FP8Often increases the relative importance of communication and host overhead
CPU affinityCan matter as much as dispatcher choice on GB200 or GB300
Pipeline layoutPoor PP or VPP layout can erase dispatcher gains
特性交互影响
CUDA graphs与无丢弃MoE的
attn moe_router moe_preprocess
搭配使用效果最佳
EP重叠当后端调优后调度器时间仍显著时,该特性会有所帮助
FP8通常会提升通信和主机开销的相对重要性
CPU亲和性在GB200或GB300上,其重要性可能与调度器选择相当
流水线布局糟糕的PP或VPP布局可能抵消调度器带来的性能提升

When To Use Each

各调度器适用场景

alltoall

alltoall

  • first correctness bring-up
  • small EP configurations
  • debugging communication regressions
  • 首次正确性验证部署
  • 小型EP配置
  • 调试通信回归问题

DeepEP

DeepEP

  • Hopper or B200 deployments
  • cross-node EP is clearly visible in profiles
  • you want a mature intermediate step before testing HybridEP
  • Hopper或B200部署
  • 跨节点EP开销在性能分析中显著可见
  • 在测试HybridEP之前,需要一个成熟的中间方案

HybridEP

HybridEP

  • GB200 or GB300 NVL72 systems
  • large EP degrees
  • memory headroom matters in addition to throughput
  • GB200或GB300 NVL72系统
  • 大型EP规模
  • 除吞吐量外,内存余量也很重要的场景

Pitfalls

注意事项

  1. Do not compare dispatchers on different stacks: container, routing mode, PP layout, and CUDA-graph scope can move the result as much as the dispatcher.
  2. HybridEP is topology-sensitive: it is not a universal win outside the hardware it was designed for.
  3. Both dispatchers need SM tuning: default
    moe_deepep_num_sms
    (20) and
    moe_hybridep_num_sms
    (16) are reasonable starting points but rarely optimal.
  4. Force-balance and dropless are not interchangeable baselines: keep the routing mode fixed when comparing dispatcher backends.
  5. Memory and throughput can trade off differently by model: Qwen3-style runs may show a smaller speed delta than DSV3, but still justify HybridEP for memory headroom.
  6. Backend import failures are not performance data: if DeepEP or HybridEP is missing from the container, do not compare its failed job against a completed
    alltoall
    job. Fix the environment first, then rerun the same stack.
  1. 不要在不同技术栈上比较调度器:容器、路由模式、PP布局和CUDA图作用域对结果的影响可能与调度器本身相当。
  2. HybridEP对拓扑敏感:在其设计目标硬件之外的环境中,并非总能带来收益。
  3. 两种调度器都需要SM调优:默认的
    moe_deepep_num_sms
    (20)和
    moe_hybridep_num_sms
    (16)是合理的起点,但很少是最优值。
  4. 强制负载均衡和无丢弃路由不能作为可互换的基线:比较调度器后端时,需保持路由模式一致。
  5. 内存和吞吐量的权衡因模型而异:Qwen3系列运行中的性能差距可能比DSV3小,但HybridEP仍可能因内存余量优势而值得选择。
  6. 后端导入失败不属于性能数据:若容器中缺失DeepEP或HybridEP,不要将其失败任务与完成的
    alltoall
    任务进行比较。先修复环境,再在相同技术栈上重新运行。