nemo-mbridge-perf-moe-long-context
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMoE Long-Context Training
MoE长上下文训练
Stable docs: @docs/training/moe-optimization.md
Card: @skills/nemo-mbridge-perf-moe-long-context/card.yaml
稳定文档:@docs/training/moe-optimization.md
卡片:@skills/nemo-mbridge-perf-moe-long-context/card.yaml
What Changes At Long Context
长上下文下的变化
Once sequence length moves well past the 4K-class regime, attention memory and
activation residency become the dominant constraints. For MoE models, that
usually means you need some combination of:
- context parallelism
- selective recompute
- lower precision
- CPU offload for optimizer state
- a dispatcher and PP layout that do not waste the smaller remaining DP budget
当序列长度远超4K级别后,注意力内存和激活驻留成为主要约束。对于MoE模型,这通常意味着你需要结合以下几种策略:
- 上下文并行(CP)
- 选择性重计算
- 更低精度
- 优化器状态CPU卸载
- 不浪费剩余较小DP预算的调度器和流水线并行(PP)布局
Rounded Scaling Patterns
规模化模式总结
DSV3 on H100
H100上的DSV3
The DSV3 long-context runs show a stable pattern:
- selective recompute works better than full recompute once you move past the shortest contexts
- throughput stays in a fairly narrow band from mid-length through very long contexts if CP is increased appropriately
- the trade shifts from "memory fit" to "GPU-count feasibility" as CP grows
In other words, long context does not immediately collapse utilization if the
layout is chosen well, but it does consume the DP budget very quickly.
DSV3的长上下文运行展示了稳定的模式:
- 一旦超出最短上下文范围,选择性重计算的效果优于完全重计算
- 如果适当增加CP,吞吐量从中等长度到极长上下文都能保持在相当窄的区间内
- 随着CP增大,权衡点从「内存适配」转向「GPU数量可行性」
换句话说,如果布局选择得当,长上下文不会立即导致利用率崩溃,但会快速消耗DP预算。
Qwen3-Next on GB200
GB200上的Qwen3-Next
Qwen3-Next behaves more like a memory-sensitive medium-scale model:
- 8K and 32K remain practical with moderate CP
- 64K is possible, but the throughput drop is noticeable and memory becomes much tighter
- pipeline layout and grouped-GEMM improvements matter almost as much as CP
Qwen3-Next的表现更像对内存敏感的中等规模模型:
- 8K和32K上下文在适度CP下仍具实用性
- 64K上下文可行,但吞吐量下降明显,内存也会紧张得多
- 流水线布局和分组GEMM优化的重要性几乎与CP相当
Qwen3 235B on GB200
GB200上的Qwen3 235B
Qwen3 235B shows that long context can still be efficient on NVL72 systems when
TP, CP, and HybridEP are coordinated. The best 128K-class configurations are
not just "fit-only" recipes; they can remain highly efficient if routing,
parallelism, and recompute are balanced.
Qwen3 235B表明,当TP、CP和HybridEP协同配合时,长上下文在NVL72系统上仍能保持高效。最佳的128K级配置不仅是「仅适配」的方案;如果路由、并行策略和重计算达到平衡,它们可以保持极高的效率。
CP Sizing Rules Of Thumb
CP规模设置经验法则
-
Start from a 4K shard target: a good first guess is, then round to a practical power-of-two layout.
CP ~= seq_len / 4096 -
Keep DP alive if possible: long-context scaling becomes brittle once CP, EP, TP, and PP together squeeze DP down to the floor.
-
Prefer selective recompute: recompute modules such as,
up_proj,norm,moe, ormoe_actbefore reaching for full recompute.mlp -
Avoid SDPA-heavy recompute at very long context: recomputing attention internals can add a lot of work for less memory benefit than recomputing smaller MoE and MLP-side modules.
-
Use TP as another lever on NVL72 systems: GB200 and GB300 runs can sometimes trade some CP for TP while still staying efficient.
-
Assume GBS will need to shrink: as CP rises and DP falls, you may need to reduce global batch size or accept higher GA.
-
从4K分片目标开始:一个不错的初始估计是,然后取整为实用的2的幂次布局。
CP ~= seq_len / 4096 -
尽可能保留DP:当CP、EP、TP和PP共同将DP压缩到最低限度时,长上下文扩展会变得脆弱。
-
优先选择选择性重计算:在考虑完全重计算之前,先对、
up_proj、norm、moe或moe_act等模块进行重计算。mlp -
在极长上下文下避免重计算大量SDPA相关内容:重计算注意力内部结构会增加大量工作量,相比重计算较小的MoE和MLP侧模块,内存收益更低。
-
在NVL72系统上使用TP作为另一个调节手段:GB200和GB300的运行有时可以在保持高效的同时,用部分CP换取TP。
-
假设GBS需要缩小:随着CP增加和DP减少,你可能需要减小全局批次大小(GBS)或接受更高的梯度累积(GA)。
Representative Config Families
典型配置系列
DSV3 at 128K on H100
H100上128K上下文的DSV3
text
TP=1 CP=32 EP=32 PP=8 VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offloadtext
TP=1 CP=32 EP=32 PP=8 VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offloadDSV3 at 256K on H100
H100上256K上下文的DSV3
text
TP=1 CP=64 EP=32 PP=8 EDP=2 VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offloadtext
TP=1 CP=64 EP=32 PP=8 EDP=2 VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offloadQwen3 235B at 128K on GB200
GB200上128K上下文的Qwen3 235B
text
TP=4 CP=4 EP=32 PP=4 VPP=12
Precision: BF16 or MXFP8
Dispatcher: HybridEP
Recompute: moe_act, norm
CUDA Graph: attn + moe_router + moe_preprocesstext
TP=4 CP=4 EP=32 PP=4 VPP=12
Precision: BF16 or MXFP8
Dispatcher: HybridEP
Recompute: moe_act, norm
CUDA Graph: attn + moe_router + moe_preprocessRecompute And CUDA Graph Guidance
重计算与CUDA Graph指南
For long-context MoE training:
- start with selective recompute
- add CUDA graphs only after the shapes and routing path are stable
- keep sequence length and MBS fixed when using CUDA graphs
- if the run depends on highly dynamic batches, prefer eager execution
Useful references:
- @docs/training/activation-recomputation.md
- @skills/nemo-mbridge-perf-cuda-graphs/SKILL.md
对于长上下文MoE训练:
- 从选择性重计算开始
- 仅在形状和路由路径稳定后再添加CUDA Graph
- 使用CUDA Graph时保持序列长度和微批次大小(MBS)固定
- 如果运行依赖高度动态的批次,优先选择即时执行
有用参考:
- @docs/training/activation-recomputation.md
- @skills/nemo-mbridge-perf-cuda-graphs/SKILL.md
Pitfalls
注意事项
-
CP does not replace EP or PP: it adds another dimension; it does not make the others disappear.
-
A good 4K baseline can still be a bad long-context baseline: routing mode, recompute choice, and offload strategy often need to change.
-
GPU-count feasibility becomes the real constraint: very long context can look fine in a single recipe, then become impossible once EP and PP are added honestly across the full model.
-
CUDA graphs need static shapes: variable-length batches and opportunistic padding strategies can silently break the path.
-
Container and kernel support matters more at 128K+: long-context paths tend to rely on newer kernels and bug fixes than short-context bring-up does.
-
CP不能替代EP或PP:它只是增加了另一个维度,并不会让其他并行策略消失。
-
优秀的4K基线可能仍是糟糕的长上下文基线:路由模式、重计算选择和卸载策略通常需要调整。
-
GPU数量可行性成为真正的约束:单个方案中的长上下文看起来没问题,但一旦在整个模型中实际加入EP和PP,就可能变得不可行。
-
CUDA Graph需要静态形状:变长批次和机会性填充策略可能会悄无声息地破坏执行路径。
-
容器和内核支持在128K+上下文下更为重要:长上下文路径往往依赖比短上下文启动时更新的内核和bug修复。