nemo-mbridge-perf-memory-tuning

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Memory Tuning

内存调优

Stable docs: @docs/parallelisms.md Card: @skills/nemo-mbridge-perf-memory-tuning/card.yaml

What It Is

什么是内存调优

GPU OOM failures during training often stem from memory fragmentation rather than raw capacity. PyTorch's default CUDA allocator can leave unusable gaps between allocations. The single most effective fix is:

bash

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

This tells PyTorch to use expandable (non-fixed-size) memory segments, which dramatically reduces fragmentation and often eliminates borderline OOM without any model or parallelism changes.

Beyond fragmentation, actual peak memory is determined by:

Parameter + optimizer state memory — controlled by TP, PP, DP sharding (distributed optimizer, FSDP)
Activation memory — controlled by activation recompute, sequence length, micro-batch size
Temporary / workspace memory — CUDA kernels, NCCL buffers, CUDA graphs

For configuration planning, use the Bridge theoretical estimator before launching large jobs:

python

from megatron.bridge.training.utils.theoretical_memory_utils import estimate_training_memory

estimate = estimate_training_memory(cfg, num_microbatches=num_microbatches)

The estimator reports the most-loaded GPU shard and separates dense/embedding, routed MoE expert, and activation components. It does not include allocator fragmentation, CUDA/NCCL workspace, CUDA graph buffers, token imbalance, or dispatcher workspace, so validate final configs with runtime memory metrics.

训练过程中的GPU OOM故障通常源于内存碎片化而非原始容量不足。PyTorch默认的CUDA分配器会在内存分配之间留下无法使用的间隙。最有效的修复方法是：

bash

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

该命令让PyTorch使用可扩展（非固定大小）的内存段，这能大幅减少内存碎片化，通常无需修改模型或并行度即可解决临界OOM问题。

除碎片化外，实际峰值内存由以下因素决定：

参数+优化器状态内存 — 由TP、PP、DP分片控制（分布式优化器、FSDP）
激活内存 — 由激活重计算、序列长度、微批次大小控制
临时/工作区内存 — CUDA内核、NCCL缓冲区、CUDA图

在启动大型任务前，可使用Bridge理论估算工具进行配置规划：

python

from megatron.bridge.training.utils.theoretical_memory_utils import estimate_training_memory

estimate = estimate_training_memory(cfg, num_microbatches=num_microbatches)

该估算工具会报告负载最高的GPU分片，并区分密集/嵌入层、路由MoE专家层和激活组件的内存占用。它不包含分配器碎片化、CUDA/NCCL工作区、CUDA图缓冲区、token不均衡或调度器工作区的内存，因此需通过运行时内存指标验证最终配置。

Quick Decision

快速决策指南

When a training run OOMs or is close to the memory limit:

Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
first. This fixes fragmentation-induced OOM with zero performance cost. Most Slurm launch templates already include it.
Add selective activation recompute (
```
recompute_modules=[core_attn]
```
) if not already enabled. See @skills/nemo-mbridge-perf-activation-recompute/SKILL.md.
Avoid increasing TP as a memory fix — doubling TP dramatically increases NVLink all-reduce volume and often kills throughput (-28% on Llama3 70B).
Avoid increasing PP at the cost of DP — halving DP doubles gradient accumulation steps and hurts throughput (~6%).
Consider
```
mlp
```
recompute if still OOM. Saves ~3 GB but costs ~16% GPU utilization on large dense models (Llama3 70B).
CPU offloading is blocked when PP > 1.

当训练出现OOM或接近内存限制时：

首先设置
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
。此方法可修复碎片化导致的OOM，且无性能损耗。大多数Slurm启动模板已包含该配置。
添加选择性激活重计算（
```
recompute_modules=[core_attn]
```
）（若尚未启用）。详见@skills/nemo-mbridge-perf-activation-recompute/SKILL.md。
避免通过增加TP来解决内存问题——将TP翻倍会大幅增加NVLink的all-reduce通信量，通常会严重降低吞吐量（Llama3 70B模型下降28%）。
避免以减少DP为代价增加PP——将DP减半会使梯度累积步数翻倍，导致吞吐量下降约6%。
若仍出现OOM，可考虑
```
mlp
```
重计算。在大型密集模型（如Llama3 70B）上可节省约3GB内存，但会使GPU利用率下降约16%。
当PP>1时，CPU卸载功能被禁用。

Enablement

配置方法

Expandable segments (recommended first step)

可扩展内存段（推荐第一步）

Set in the job's environment before launching:

bash

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

In Slurm scripts this is typically placed alongside other env vars:

bash

export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

No model config changes needed. Zero throughput cost.

在启动任务前设置环境变量：

bash

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

在Slurm脚本中，通常将其与其他环境变量放在一起：

bash

export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

无需修改模型配置，无吞吐量损耗。

Parallelism resizing

并行度调整

If the model genuinely does not fit (not fragmentation), adjust parallelism:

Strategy	Memory effect	Throughput cost	Notes
Increase PP (keeping DP)	Fewer layers per stage	Moderate (~6% if DP halved)	Only if GPU count allows
Increase TP	Fewer params per GPU	Severe (-28% on 70B)	Last resort
Distributed optimizer	Shards optimizer state across DP ranks	~1-2%	Recommended for large models
FSDP	Shards params + grads + optimizer	Varies	See @skills/nemo-mbridge-perf-megatron-fsdp/SKILL.md

若模型确实无法容纳（非碎片化问题），则调整并行度：

策略	内存效果	吞吐量损耗	说明
增加PP（保持DP不变）	每个阶段的层数减少	中等（若DP减半则约6%）	仅当GPU数量允许时使用
增加TP	每个GPU的参数数量减少	严重（70B模型下降28%）	最后手段
分布式优化器	在DP节点间分片优化器状态	~1-2%	推荐用于大型模型
FSDP	对参数、梯度和优化器进行分片	视情况而定	详见@skills/nemo-mbridge-perf-megatron-fsdp/SKILL.md

Activation recompute

激活重计算

See @skills/nemo-mbridge-perf-activation-recompute/SKILL.md for full details.

详见@skills/nemo-mbridge-perf-activation-recompute/SKILL.md。

CPU offloading

CPU卸载

python

cfg.model.cpu_offloading = True

Incompatible with PP > 1. Only usable when

pipeline_model_parallel_size = 1

python

cfg.model.cpu_offloading = True

与PP>1不兼容。仅当

pipeline_model_parallel_size = 1

时可用。

A Note on VPP

关于VPP的说明

Virtual pipeline parallelism (VPP) is primarily a throughput optimization that reduces pipeline bubble overhead by interleaving smaller model chunks. Its effect on peak memory is minimal — changing VPP does not meaningfully change the total activation, parameter, or optimizer memory on a GPU.

In earlier experiments we incorrectly attributed an OOM fix to VPP tuning (VPP 5→10). The actual fix was

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

which eliminated memory fragmentation. The VPP=10 run actually used slightly more peak memory (60.2 GB vs 58.8 GB) but did not OOM because expandable segments prevented fragmentation.

VPP should be tuned for pipeline bubble reduction (see @docs/parallelisms.md), not as a memory fix.

虚拟流水线并行（VPP）主要是一种吞吐量优化手段，通过交错处理更小的模型块来减少流水线气泡开销。它对峰值内存的影响极小——调整VPP不会显著改变GPU上的激活、参数或优化器总内存占用。

在早期实验中，我们错误地将OOM修复归因于VPP调优（VPP从5调整到10）。实际的修复方法是设置

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

，它消除了内存碎片化。VPP=10的运行实际上占用了更多的峰值内存（60.2GB vs 58.8GB），但由于可扩展内存段避免了碎片化，因此未出现OOM。

VPP应针对减少流水线气泡进行调优（详见@docs/parallelisms.md），而非作为内存修复手段。

Compatibility and Constraints

兼容性与限制

```
expandable_segments:True
```
is incompatible with
```
--use-nccl-ub
```
(NCCL user-buffer registration). See Megatron-FSDP docs.
When using CUDA graphs with
```
expandable_segments:True
```
, set
```
NCCL_GRAPH_REGISTER=0
```
(required on pre-Blackwell GPUs, enforced by MCore
```
CudaGraphManager
```
).
CPU offloading requires
```
pipeline_model_parallel_size = 1
```
.
Distributed optimizer requires
```
use_distributed_optimizer = True
```
in the optimizer config.

```
expandable_segments:True
```
与
```
--use-nccl-ub
```
（NCCL用户缓冲区注册）不兼容。详见Megatron-FSDP文档。
当结合
```
expandable_segments:True
```
使用CUDA图时，需设置
```
NCCL_GRAPH_REGISTER=0
```
（Blackwell之前的GPU需要此设置，由MCore的
```
CudaGraphManager
```
强制执行）。
CPU卸载要求
```
pipeline_model_parallel_size = 1
```
。
分布式优化器要求在优化器配置中设置
```
use_distributed_optimizer = True
```
。

Measured Results

实测结果

Llama3 70B SFT on 32x H100 80GB, FP8 (Current Scaling):

Baseline: TP=4, PP=4, VPP=5, DP=2, MBS=1, GBS=32, seq_len=4096
Golden GPU utilization: 709.93 TFLOP/s/GPU
Regression threshold: 5%

基于32张H100 80GB GPU的Llama3 70B SFT训练，FP8精度（当前缩放配置）：

基线配置：TP=4，PP=4，VPP=5，DP=2，MBS=1，GBS=32，seq_len=4096
理想GPU利用率：709.93 TFLOP/s/GPU
性能退化阈值：5%

Strategy comparison: parallelism changes for memory reduction

策略对比：通过调整并行度减少内存

Experiment	TP	PP	VPP	DP	TFLOP/s/GPU	vs Golden	Peak Mem (GB)	Result
Baseline	4	4	5	2	~704	-0.8%	58.8	OOM (fragmentation)
More PP	4	8	5	1	668.0	-5.9%	53.2	Borderline perf
More TP	8	4	5	1	508.7	-28.4%	50.2	Severe regression
Baseline + expandable_segments	4	4	5	2	~704	-0.8%	~59	Passed

Key takeaways:

expandable_segments:True
is the winner. The baseline OOM was caused by memory fragmentation, not insufficient capacity. Setting this env var eliminated the OOM with zero throughput cost and no parallelism changes.
PP=8 works for memory but loses DP (2→1), meaning 32 gradient accumulation steps per batch, which hurts throughput by ~6%.
TP=8 is catastrophic (-28%) because doubling TP increases all-reduce communication volume proportionally across NVLink, and DP=1 means no micro-batch overlap.

实验	TP	PP	VPP	DP	TFLOP/s/GPU	与理想值对比	峰值内存（GB）	结果
基线	4	4	5	2	~704	-0.8%	58.8	OOM（碎片化）
增加PP	4	8	5	1	668.0	-5.9%	53.2	性能临界
增加TP	8	4	5	1	508.7	-28.4%	50.2	严重性能退化
基线+可扩展内存段	4	4	5	2	~704	-0.8%	~59	通过

核心结论：

expandable_segments:True
是最优方案。基线配置的OOM由内存碎片化导致，而非容量不足。设置该环境变量无需修改并行度即可解决OOM问题，且无吞吐量损耗。
PP=8可解决内存问题但会减少DP（从2变为1），这意味着每个批次需要32次梯度累积，导致吞吐量下降约6%。
TP=8的性能损失是灾难性的（下降28%），因为TP翻倍会使NVLink上的all-reduce通信量成比例增加，且DP=1意味着无法进行微批次重叠处理。

CPU offloading: blocked

CPU卸载：不可用

Experiment	offload_layers	Result
Exp 4	2	Incompatible (PP > 1)
Exp 5	4	Incompatible (PP > 1)
Exp 6	6	Incompatible (PP > 1)

ValueError: Currently there is no support for Pipeline parallelism with CPU offloading.

This approach is blocked for any model using PP > 1.

实验	卸载层数	结果
实验4	2	不兼容（PP>1）
实验5	4	不兼容（PP>1）
实验6	6	不兼容（PP>1）

ValueError: Currently there is no support for Pipeline parallelism with CPU offloading.

对于任何使用PP>1的模型，此方法均不可用。

Activation recompute: expensive alternative

激活重计算：高成本替代方案

Selective activation recompute with

mlp

saved ~3 GB peak memory but cost ~16% GPU utilization on this workload. See @skills/nemo-mbridge-perf-activation-recompute/SKILL.md for full results.

针对

mlp

的选择性激活重计算在此工作负载下可节省约3GB峰值内存，但会使GPU利用率下降约16%。完整结果详见@skills/nemo-mbridge-perf-activation-recompute/SKILL.md。

Code Anchors

代码锚点

CPU offloading PP incompatibility (MCore)

CPU卸载与PP不兼容（MCore）

1303

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

1303

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

VPP config and layer divisibility validation (MCore)

VPP配置与层数可分性验证（MCore）

1581

            if pipeline_parallel_size and self.virtual_pipeline_model_parallel_size is not None:
                num_layers_per_middle_pipeline_rank = num_layers // pipeline_parallel_size
                if (
                    not num_layers_per_middle_pipeline_rank
                    % self.virtual_pipeline_model_parallel_size
                    == 0
                ):
                    raise ValueError(
                        f"number of layers on each middle pipeline rank:"
                        f"{num_layers_per_middle_pipeline_rank} must be divisible by virtual"
                        f"pipeline parallel degree {self.virtual_pipeline_model_parallel_size}"
                    )

1581

            if pipeline_parallel_size and self.virtual_pipeline_model_parallel_size is not None:
                num_layers_per_middle_pipeline_rank = num_layers // pipeline_parallel_size
                if (
                    not num_layers_per_middle_pipeline_rank
                    % self.virtual_pipeline_model_parallel_size
                    == 0
                ):
                    raise ValueError(
                        f"number of layers on each middle pipeline rank:"
                        f"{num_layers_per_middle_pipeline_rank} must be divisible by virtual"
                        f"pipeline parallel degree {self.virtual_pipeline_model_parallel_size}"
                    )

Parallelism docs on interleaved pipeline schedule

交错流水线调度的并行度文档

116

To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. Enable this by setting `virtual_pipeline_model_parallel_size`:

model_config = GPTModelProvider(
    pipeline_model_parallel_size=4,
    virtual_pipeline_model_parallel_size=2,  # 2 model chunks per pipeline stage
    # ... other model parameters
)

116

To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. Enable this by setting `virtual_pipeline_model_parallel_size`:

model_config = GPTModelProvider(
    pipeline_model_parallel_size=4,
    virtual_pipeline_model_parallel_size=2,  # 2 model chunks per pipeline stage
    # ... other model parameters
)

Failure Diagnosis

故障诊断

Symptom	Cause	Confirm	Fix
OOM on a single rank despite headroom on others	Memory fragmentation	check if `expandable_segments:True` is set	set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
OOM with `expandable_segments` already set	Genuine capacity limit	check `nvidia-smi` for param/optimizer memory	increase PP, use distributed optimizer, or add recompute
Estimated memory exceeds GPU capacity before launch	model state or activations genuinely too large	run `estimate_training_memory` and inspect the largest component	adjust PP/TP/CP/EP, distributed optimizer, or recompute before launching
`ValueError: PP + CPU offloading`	using cpu_offloading with PP > 1	check PP config	disable CPU offloading or set PP=1
`RuntimeError` with `--use-nccl-ub` + expandable segments	NCCL UB incompatible with expandable allocator	check env vars	remove `expandable_segments:True` or disable `--use-nccl-ub`

症状	原因	确认方式	修复方案
单个节点出现OOM，其他节点仍有内存余量	内存碎片化	检查是否设置了 `expandable_segments:True`	设置 `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
已设置 `expandable_segments` 仍出现OOM	内存容量确实不足	通过 `nvidia-smi` 检查参数/优化器内存占用	增加PP、使用分布式优化器或添加重计算
启动前估算内存超过GPU容量	模型状态或激活内存确实过大	运行 `estimate_training_memory` 并检查最大内存占用组件	启动前调整PP/TP/CP/EP、使用分布式优化器或添加重计算
`ValueError: PP + CPU offloading`	在PP>1时使用CPU卸载	检查PP配置	禁用CPU卸载或设置PP=1
使用 `--use-nccl-ub` +可扩展内存段时出现 `RuntimeError`	NCCL UB与可扩展分配器不兼容	检查环境变量	移除 `expandable_segments:True` 或禁用 `--use-nccl-ub`

Known Limitations

已知限制

CPU offloading is blocked when PP > 1
Parallelism resizing (TP/PP) often has significant throughput costs
The theoretical estimator is formula-based and does not replace runtime profiling or CUDA memory reports

当PP>1时，CPU卸载功能被禁用
调整并行度（TP/PP）通常会导致显著的吞吐量损耗
理论估算工具基于公式计算，无法替代运行时性能分析或CUDA内存报告

Verification

验证方法

Quick check that

expandable_segments:True

is active:

python

import os
assert "expandable_segments:True" in os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")

For Slurm jobs, verify the env var is exported before the training command in the launch script.

快速检查

expandable_segments:True

是否生效：

python

import os
assert "expandable_segments:True" in os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")

对于Slurm任务，需验证环境变量在启动脚本中的训练命令之前已导出。