nemo-mbridge-perf-cuda-graphs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCUDA Graphs
CUDA 图
Stable docs: @docs/training/cuda-graphs.md
Card: @skills/nemo-mbridge-perf-cuda-graphs/card.yaml
稳定文档:@docs/training/cuda-graphs.md
卡片:@skills/nemo-mbridge-perf-cuda-graphs/card.yaml
What It Is
功能介绍
CUDA graphs capture GPU operations once and replay them with minimal
host-driver overhead. Bridge supports two implementations:
| Mechanism | Scope support |
|---|---|---|
| MCore | |
| TE | |
CUDA图可一次性捕获GPU操作,并重放时将主机-驱动开销降至最低。Bridge支持两种实现方式:
| 实现机制 | 支持的作用域 |
|---|---|---|
| 用MCore | |
| 每层使用TE | |
Quick Decision
快速决策指南
Start with TE-scoped graphs for most training workloads, then verify replay
timing against eager on the same dispatcher, layout, and container:
- dense models: , then optionally
attnmlp - dropless MoE:
attn moe_router moe_preprocess - VLMs: the same dropless-MoE scope, but only after the real-data path is stable
Use + only when you specifically want full-iteration
capture and can satisfy the tighter constraints.
localfull_iterationFor recompute-heavy workloads:
- TE-scoped graphs pair naturally with selective recompute
- full recompute usually pushes you toward full-iteration graphs or away from graphs entirely
local
Related docs:
- @docs/training/cuda-graphs.md
- @docs/training/activation-recomputation.md
对于大多数训练工作负载,优先使用TE作用域图,然后在相同调度器、布局和容器下,对比重放时间与eager模式的性能:
- 稠密模型:先启用,可选再启用
attnmlp - 无丢弃MoE模型:
attn moe_router moe_preprocess - 多模态大模型(VLM):使用与无丢弃MoE相同的作用域,但仅在真实数据路径稳定后启用
仅当你明确需要全迭代捕获且能满足更严格约束时,才使用 + 组合。
localfull_iteration对于重计算密集型工作负载:
- TE作用域图可自然搭配选择性重计算
- 全重计算通常会促使你选择全迭代图,或完全放弃使用CUDA图
local
相关文档:
- @docs/training/cuda-graphs.md
- @docs/training/activation-recomputation.md
Enablement
启用方法
Local full-iteration graph
本地全迭代图
python
cfg.model.cuda_graph_impl = "local"
cfg.model.cuda_graph_scope = ["full_iteration"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
cfg.rerun_state_machine.check_for_nan_in_loss = False
cfg.ddp.check_for_nan_in_grad = Falsepython
cfg.model.cuda_graph_impl = "local"
cfg.model.cuda_graph_scope = ["full_iteration"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
cfg.rerun_state_machine.check_for_nan_in_loss = False
cfg.ddp.check_for_nan_in_grad = FalseTE scoped graph (dense model)
TE作用域图(稠密模型)
python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn"] # or ["attn", "mlp"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = Truepython
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn"] # 或 ["attn", "mlp"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = TrueTE scoped graph (MoE model)
TE作用域图(MoE模型)
python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = Truepython
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = TruePerformance harness CLI
性能测试工具CLI
bash
uv run python scripts/performance/run_script.py \
-m qwen \
-mr qwen3_30b_a3b \
--task pretrain \
-g h100 \
-c bf16 \
-ng 16 \
--cuda_graph_impl transformer_engine \
--cuda_graph_scope attn,moe_router,moe_preprocess \
...Valid CLI values live in :
scripts/performance/argument_parser.py- :
VALID_CUDA_GRAPH_IMPLS["none", "local", "transformer_engine"] - :
VALID_CUDA_GRAPH_SCOPES["full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba"]
The performance harness uses a comma-separated value and
auto-enables plus when
is not .
--cuda_graph_scopemodel.use_te_rng_trackerrng.te_rng_tracker--cuda_graph_implnonebash
uv run python scripts/performance/run_script.py \
-m qwen \
-mr qwen3_30b_a3b \
--task pretrain \
-g h100 \
-c bf16 \
-ng 16 \
--cuda_graph_impl transformer_engine \
--cuda_graph_scope attn,moe_router,moe_preprocess \
...有效的CLI值定义在中:
scripts/performance/argument_parser.py- :
VALID_CUDA_GRAPH_IMPLS["none", "local", "transformer_engine"] - :
VALID_CUDA_GRAPH_SCOPES["full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba"]
性能测试工具使用逗号分隔的参数值,当不为时,会自动启用和。
--cuda_graph_scope--cuda_graph_implnonemodel.use_te_rng_trackerrng.te_rng_trackerRequired constraints
必要约束条件
- (enforced in
use_te_rng_tracker = True)gpt_provider.py - scope only with
full_iterationcuda_graph_impl = "local" - scope requires
full_iterationcheck_for_nan_in_loss = False - Do not combine scope and
moescopemoe_router - Tensor shapes must be static (fixed seq_length, fixed micro_batch_size)
- MoE token-dropless routing limits graphable scope to dense modules
- With , set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True(MCore enforces for local impl on arch < sm_100; TE impl asserts unconditionally)NCCL_GRAPH_REGISTER=0 - CPU offloading is incompatible with CUDA graphs
- scope requires
moe_preprocessscope to also be setmoe_router
- (在
use_te_rng_tracker = True中强制执行)gpt_provider.py - 作用域仅能与
full_iteration搭配使用cuda_graph_impl = "local" - 作用域要求
full_iterationcheck_for_nan_in_loss = False - 不可同时使用作用域和
moe作用域moe_router - 张量形状必须是静态的(固定序列长度、固定微批次大小)
- MoE无令牌丢弃路由仅支持对稠密模块进行图捕获
- 当设置时,需设置
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True(MCore会对架构<sm_100的本地实现强制执行;TE实现会无条件触发断言)NCCL_GRAPH_REGISTER=0 - CPU卸载与CUDA图不兼容
- 作用域要求同时设置
moe_preprocess作用域moe_router
Practical bring-up order
实际部署步骤
- Stabilize the eager run first.
- Fix sequence length and micro-batch size.
- Enable the narrowest useful graph scope.
- Confirm replay is active and memory is still acceptable.
- Compare eager against graph replay iterations after warmup and capture; do not include the capture step in steady-state timing.
- Only then widen scope or combine with overlap features.
- 先确保eager模式运行稳定。
- 固定序列长度和微批次大小。
- 启用最窄的有效图作用域。
- 确认重放已激活且内存占用仍在可接受范围内。
- 在预热和捕获完成后,对比eager模式与图重放的迭代性能;稳态计时中不要包含捕获步骤。
- 之后再扩大作用域或与重叠功能组合使用。
Code Anchors
代码锚点
Bridge config and validation
Bridge配置与验证
1524
# CUDA graph scope validation: check_for_nan_in_loss must be disabled with full_iteration graph
if self.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in self.model.cuda_graph_scope:
assert not self.rerun_state_machine.check_for_nan_in_loss, (
"check_for_nan_in_loss must be disabled when using full_iteration CUDA graph. "
"Set rerun_state_machine.check_for_nan_in_loss=False."
)
if self.model.cuda_graph_impl == "none":
self.model.cuda_graph_scope = []1524
# CUDA graph scope validation: check_for_nan_in_loss must be disabled with full_iteration graph
if self.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in self.model.cuda_graph_scope:
assert not self.rerun_state_machine.check_for_nan_in_loss, (
"check_for_nan_in_loss must be disabled when using full_iteration CUDA graph. "
"Set rerun_state_machine.check_for_nan_in_loss=False."
)
if self.model.cuda_graph_impl == "none":
self.model.cuda_graph_scope = []TE RNG tracker requirement
TE RNG追踪器要求
213
if self.cuda_graph_impl != "none":
assert getattr(self, "use_te_rng_tracker", False), (
"Transformer engine's RNG tracker is required for cudagraphs, it can be "
"enabled with use_te_rng_tracker=True'."213
if self.cuda_graph_impl != "none":
assert getattr(self, "use_te_rng_tracker", False), (
"Transformer engine's RNG tracker is required for cudagraphs, it can be "
"enabled with use_te_rng_tracker=True'."Graph creation and capture in training loop
训练循环中的图创建与捕获
231
# Capture CUDA Graphs.
cuda_graph_helper = None
if model_config.cuda_graph_impl == "transformer_engine":
cuda_graph_helper = TECudaGraphHelper(...)
# ...
if config.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in config.model.cuda_graph_scope:
forward_backward_func = FullCudaGraphWrapper(
forward_backward_func, cuda_graph_warmup_steps=config.model.cuda_graph_warmup_steps
)231
# Capture CUDA Graphs.
cuda_graph_helper = None
if model_config.cuda_graph_impl == "transformer_engine":
cuda_graph_helper = TECudaGraphHelper(...)
# ...
if config.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in config.model.cuda_graph_scope:
forward_backward_func = FullCudaGraphWrapper(
forward_backward_func, cuda_graph_warmup_steps=config.model.cuda_graph_warmup_steps
)TE graph capture after warmup
预热后的TE图捕获
338
# Capture CUDA Graphs after warmup.
if (
model_config.cuda_graph_impl == "transformer_engine"
and cuda_graph_helper is not None
and not cuda_graph_helper.graphs_created()
and global_state.train_state.step - start_iteration == model_config.cuda_graph_warmup_steps
):
if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
disable_forward_pre_hook(model, param_sync=False)
cuda_graph_helper.create_cudagraphs()
if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
enable_forward_pre_hook(model)
cuda_graph_helper.cuda_graph_set_manual_hooks()338
# Capture CUDA Graphs after warmup.
if (
model_config.cuda_graph_impl == "transformer_engine"
and cuda_graph_helper is not None
and not cuda_graph_helper.graphs_created()
and global_state.train_state.step - start_iteration == model_config.cuda_graph_warmup_steps
):
if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
disable_forward_pre_hook(model, param_sync=False)
cuda_graph_helper.create_cudagraphs()
if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
enable_forward_pre_hook(model)
cuda_graph_helper.cuda_graph_set_manual_hooks()RNG initialization
RNG初始化
199
_set_random_seed(
rng_config.seed,
rng_config.data_parallel_random_init,
rng_config.te_rng_tracker,
rng_config.inference_rng_tracker,
use_cudagraphable_rng=(model_config.cuda_graph_impl != "none"),
pg_collection=pg_collection,
)199
_set_random_seed(
rng_config.seed,
rng_config.data_parallel_random_init,
rng_config.te_rng_tracker,
rng_config.inference_rng_tracker,
use_cudagraphable_rng=(model_config.cuda_graph_impl != "none"),
pg_collection=pg_collection,
)Delayed wgrad + CUDA graph interaction
延迟梯度计算(wgrad)与CUDA图的交互
522
cuda_graph_scope = getattr(model_cfg, "cuda_graph_scope", []) or []
# ... scope parsing ...
if wgrad_in_graph_scope:
assert is_te_min_version("2.12.0"), ...
assert model_cfg.gradient_accumulation_fusion, ...
if attn_scope_enabled:
assert not model_cfg.add_bias_linear and not model_cfg.add_qkv_bias, ...522
cuda_graph_scope = getattr(model_cfg, "cuda_graph_scope", []) or []
# ... scope parsing ...
if wgrad_in_graph_scope:
assert is_te_min_version("2.12.0"), ...
assert model_cfg.gradient_accumulation_fusion, ...
if attn_scope_enabled:
assert not model_cfg.add_bias_linear and not model_cfg.add_qkv_bias, ...Perf harness override helper
性能测试工具覆盖助手
102
def _set_cuda_graph_overrides(
recipe, cuda_graph_impl=None, cuda_graph_scope=None
):
# Sets impl, scope, and auto-enables te_rng_tracker102
def _set_cuda_graph_overrides(
recipe, cuda_graph_impl=None, cuda_graph_scope=None
):
# Sets impl, scope, and auto-enables te_rng_trackerGraph cleanup
图清理
1414
def _delete_cuda_graphs(cuda_graph_helper):
# Deletes FullCudaGraphWrapper and TE graph objects to free NCCL buffers1414
def _delete_cuda_graphs(cuda_graph_helper):
# Deletes FullCudaGraphWrapper and TE graph objects to free NCCL buffersMCore classes (in 3rdparty/Megatron-LM)
MCore类(位于3rdparty/Megatron-LM)
- :
CudaGraphManagermegatron/core/transformer/cuda_graphs.py - :
TECudaGraphHelpermegatron/core/transformer/cuda_graphs.py - :
FullCudaGraphWrappermegatron/core/full_cuda_graph.py - enum:
CudaGraphScopemegatron/core/transformer/enums.py
- :
CudaGraphManagermegatron/core/transformer/cuda_graphs.py - :
TECudaGraphHelpermegatron/core/transformer/cuda_graphs.py - :
FullCudaGraphWrappermegatron/core/full_cuda_graph.py - 枚举:
CudaGraphScopemegatron/core/transformer/enums.py
Positive recipe anchors
参考配置锚点
scripts/performance/configs/deepseek/deepseek_workload_base_configs.pyscripts/performance/configs/qwen/qwen3_workload_base_configs.pyscripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py
scripts/performance/configs/deepseek/deepseek_workload_base_configs.pyscripts/performance/configs/qwen/qwen3_workload_base_configs.pyscripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py
Tests
测试用例
| File | Coverage |
|---|---|
| |
| |
| TE autocast with CUDA graphs |
| End-to-end local and TE graph smoke tests |
| TE + CUDA graph recipe config |
| TE + CUDA graph recipe config |
| VLM CUDA graph settings |
| 文件 | 覆盖范围 |
|---|---|
| |
| |
| 带CUDA图的TE自动混合精度 |
| 本地和TE图的端到端冒烟测试 |
| TE + CUDA图的配置示例 |
| TE + CUDA图的配置示例 |
| 多模态大模型的CUDA图设置 |
Pitfalls
常见陷阱
-
TE RNG tracker is mandatory: Settingwithout
cuda_graph_implanduse_te_rng_tracker=Truewill assert in the provider.rng.te_rng_tracker=True -
requires NaN checks disabled: The entire fwd+bwd is captured, so loss-NaN checking cannot inspect intermediate values.
full_iteration -
MoE scope restrictions:scope and
moescope are mutually exclusive. Token-dropless MoE can only graphmoe_routerandmoe_router, not the full expert dispatch.moe_preprocess -
Memory overhead: CUDA graphs pin all intermediate buffers for the graph's lifetime (no memory reuse). TE scoped graphs add a few GB; full-iteration graphs can increase peak memory by 1.5–2×.compounds overhead since each stage holds its own graph.
PP > 1 -
Delayed wgrad interaction: Whenand attention or MoE router is in
delay_wgrad_compute=True, additional constraints apply: TE >= 2.12.0,cuda_graph_scope, and no attention bias.gradient_accumulation_fusion=True -
Variable-length sequences break graphs: Sequence lengths must be constant across steps. Use padded packed sequences if packing is needed.
-
Graph cleanup is required: CUDA graph objects hold NCCL buffer references. Bridge handles this inat the end of training, but early exits must call it explicitly.
_delete_cuda_graphs() -
Older GPU architectures: On GPUs with compute capability < 10.0 (pre-Blackwell), setwhen using
NCCL_GRAPH_REGISTER=0. Enforced in MCorePYTORCH_CUDA_ALLOC_CONF=expandable_segments:True(cuda_graphs.py:1428) andCudaGraphManager(cuda_graphs.py:1697). The TE impl asserts unconditionally regardless of arch.TECudaGraphHelper -
CPU offloading incompatible: CUDA graphs cannot be used with CPU offloading. Enforced in MCore.
transformer_config.py:1907 -
MoE recompute + moe_router scope: MoE recompute is not supported withCUDA graph scope when using
moe_router. Enforced in MCorecuda_graph_impl = "transformer_engine".transformer_config.py:1977 -
Layer-level recompute requiresscope: Using
full_iterationwithrecompute_granularity="full"(recompute N whole transformer layers) is incompatible with TE-scoped graphs. MCore calls this "full" granularity even though you're selecting how many layers — the name refers to recomputing the full layer, not full model. Any TE-scoped scope (recompute_num_layers,attn,mlp, etc.) will assert:moe_routerThis commonly hits FP8 configs that default to TE-scoped graphs (e.g.AssertionError: full recompute is only supported with full iteration CUDA graph.usesLLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1,cuda_graph_impl= "transformer_engine"). Fix: use submodule recompute (cuda_graph_scope="mlp"+recompute_granularity="selective"), disable CUDA graphs, or switch torecompute_modules+local. Enforced in MCorefull_iteration. See also @skills/nemo-mbridge-perf-activation-recompute/SKILL.md.transformer_config.py:2001-2005 -
Benchmark numbers are workload-specific: graph wins are usually real when host overhead is visible, but the exact gain depends on batch shape, PP depth, recompute, dispatcher backend, and whether the eager baseline was already optimized.
-
A successful capture is not a speedup guarantee: On 2026-05-18, Qwen3 30B A3B H100 BF16 pretrain with the all-to-all dispatcher captured TE-scopedgraphs successfully (
attn,moe_router,moe_preprocessgraphable layers, about48capture time on rank 0), but replay iterations 5-8 averaged6.9 sversus42.00 sfor eager. Treat scoped graphs as a bring-up candidate and validate on the target stack.41.36 s
-
TE RNG追踪器是强制要求:设置但未开启
cuda_graph_impl和use_te_rng_tracker=True会在provider中触发断言。rng.te_rng_tracker=True -
模式需禁用NaN检查:完整的前向+反向传播被捕获,因此损失NaN检查无法访问中间值。
full_iteration -
MoE作用域限制:作用域和
moe作用域互斥。无令牌丢弃MoE仅能对moe_router和moe_router进行图捕获,无法捕获完整的专家调度过程。moe_preprocess -
内存开销:CUDA图会在其生命周期内固定所有中间缓冲区(无内存复用)。TE作用域图会增加数GB内存占用;全迭代图可能使峰值内存增加1.5-2倍。当流水线并行(PP)>1时,每个阶段都会持有自己的图,会进一步加剧内存开销。
-
延迟梯度计算的交互限制:当且注意力或MoE路由处于
delay_wgrad_compute=True时,需满足额外约束:TE版本>=2.12.0、cuda_graph_scope,且无注意力偏置。gradient_accumulation_fusion=True -
变长序列会破坏图:序列长度必须在各步骤中保持恒定。如果需要打包序列,请使用填充后的打包序列。
-
必须进行图清理:CUDA图对象会持有NCCL缓冲区引用。Bridge会在训练结束时通过处理,但提前退出时必须显式调用该函数。
_delete_cuda_graphs() -
旧GPU架构限制:在计算能力<10.0的GPU(Blackwell之前的型号)上,当使用时,需设置
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True。MCore的NCCL_GRAPH_REGISTER=0(cuda_graphs.py:1428)和CudaGraphManager(cuda_graphs.py:1697)会强制执行此要求。TE实现会无条件触发断言,与架构无关。TECudaGraphHelper -
CPU卸载不兼容:CUDA图无法与CPU卸载一起使用。MCore的会强制执行此约束。
transformer_config.py:1907 -
MoE重计算与作用域冲突:当使用
moe_router时,MoE重计算与cuda_graph_impl = "transformer_engine"CUDA图作用域不兼容。MCore的moe_router会强制执行此约束。transformer_config.py:1977 -
层级重计算需要作用域:将
full_iteration与recompute_granularity="full"(重计算N个完整Transformer层)搭配使用时,与TE作用域图不兼容。MCore将此称为“full”粒度,尽管你可以选择重计算的层数——该名称指的是重计算完整层,而非完整模型。任何TE作用域(recompute_num_layers、attn、mlp等)都会触发断言:moe_router这通常会影响默认使用TE作用域图的FP8配置(例如AssertionError: full recompute is only supported with full iteration CUDA graph.使用LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1、cuda_graph_impl="transformer_engine")。解决方法:使用子模块重计算(cuda_graph_scope="mlp"+recompute_granularity="selective")、禁用CUDA图,或切换到recompute_modules+local模式。MCore的full_iteration会强制执行此约束。另请参考@skills/nemo-mbridge-perf-activation-recompute/SKILL.md。transformer_config.py:2001-2005 -
基准测试结果具有工作负载特异性:当主机开销明显时,CUDA图通常能带来真实的性能提升,但具体增益取决于批次形状、流水线并行深度、重计算设置、调度器后端以及eager基线是否已优化。
-
捕获成功不代表一定会加速:在2026-05-18的测试中,使用all-to-all调度器的Qwen3 30B A3B H100 BF16预训练任务成功捕获了TE作用域的图(48个可图化层,rank 0的捕获时间约6.9秒),但重放迭代5-8的平均时间为42.00秒,而eager模式为41.36秒。请将作用域图视为候选优化方案,并在目标栈上验证性能。
attn,moe_router,moe_preprocess
Verification
验证方法
Unit tests
单元测试
bash
uv run python -m pytest \
tests/unit_tests/training/test_config.py -k "cuda_graph" \
tests/unit_tests/training/test_comm_overlap.py -k "cuda_graph" \
tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cuda_graph" -qbash
uv run python -m pytest \
tests/unit_tests/training/test_config.py -k "cuda_graph" \
tests/unit_tests/training/test_comm_overlap.py -k "cuda_graph" \
tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cuda_graph" -qFunctional smoke test (requires GPU)
功能冒烟测试(需要GPU)
bash
uv run python -m pytest \
tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py -qbash
uv run python -m pytest \
tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py -qSuccess criteria
成功标准
- Unit tests pass, covering config validation for both and
localimplementations.transformer_engine - Functional test completes training steps with both CUDA graph implementations.
- No NCCL errors or illegal memory access in logs.
- 单元测试通过,覆盖和
local两种实现的配置验证。transformer_engine - 功能测试完成两种CUDA图实现的训练步骤。
- 日志中无NCCL错误或非法内存访问。