perf-tp-dp-comm-overlap

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TP / DP / PP Communication Overlap Skill

TP / DP / PP 通信重叠技能

For stable background and recommendation level, see:

@docs/training/communication-overlap.md

关于稳定背景和推荐级别，请参阅：

@docs/training/communication-overlap.md

Enablement

启用方法

Minimal Bridge override:

python

from megatron.bridge.training.comm_overlap import CommOverlapConfig

cfg.model.tensor_model_parallel_size = 4
cfg.model.sequence_parallel = True
cfg.model.pipeline_model_parallel_size = 4
cfg.model.virtual_pipeline_model_parallel_size = 2

cfg.comm_overlap = CommOverlapConfig(
    tp_comm_overlap=True,
)

cfg.ddp.use_distributed_optimizer = True
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True

Optional TP preset:

python

from megatron.bridge.training.comm_overlap import userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048

cfg.comm_overlap.tp_comm_overlap_cfg = userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048

Precision knobs belong to mixed precision:

python

cfg.mixed_precision.grad_reduce_in_fp32 = False
cfg.mixed_precision.fp8_param_gather = False

最小化Bridge覆盖配置：

python

from megatron.bridge.training.comm_overlap import CommOverlapConfig

cfg.model.tensor_model_parallel_size = 4
cfg.model.sequence_parallel = True
cfg.model.pipeline_model_parallel_size = 4
cfg.model.virtual_pipeline_model_parallel_size = 2

cfg.comm_overlap = CommOverlapConfig(
    tp_comm_overlap=True,
)

cfg.ddp.use_distributed_optimizer = True
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True

可选的TP预设：

python

from megatron.bridge.training.comm_overlap import userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048

cfg.comm_overlap.tp_comm_overlap_cfg = userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048

精度参数属于混合精度配置：

python

cfg.mixed_precision.grad_reduce_in_fp32 = False
cfg.mixed_precision.fp8_param_gather = False

Code Anchors

代码锚点

Bridge overlap gating:

439

if self.user_comm_overlap_cfg.tp_comm_overlap is True:
    if model_cfg.tensor_model_parallel_size < 2:
        ...
    elif not model_cfg.sequence_parallel:
        ...
    elif not HAVE_TE:
        ...

PP overlap selection:

451

if model_cfg.pipeline_model_parallel_size > 1:
    if vp_size > 1:
        comm_overlap_cfg.overlap_p2p_comm = True
        comm_overlap_cfg.batch_p2p_comm = False
    else:
        comm_overlap_cfg.overlap_p2p_comm = False
        comm_overlap_cfg.batch_p2p_comm = True

DP overlap defaults:

572

if self.data_parallel_size > 1:
    comm_overlap_cfg.bucket_size = 128 * 1024 * 1024
    comm_overlap_cfg.overlap_grad_reduce = True
    comm_overlap_cfg.overlap_param_gather = True

Launch-time env tuning:

570

executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(cuda_device_max_connections)
...
executor.env_vars["NVTE_FWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
executor.env_vars["NVTE_BWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)

Bridge重叠控制逻辑：

439

if self.user_comm_overlap_cfg.tp_comm_overlap is True:
    if model_cfg.tensor_model_parallel_size < 2:
        ...
    elif not model_cfg.sequence_parallel:
        ...
    elif not HAVE_TE:
        ...

PP重叠选择逻辑：

451

if model_cfg.pipeline_model_parallel_size > 1:
    if vp_size > 1:
        comm_overlap_cfg.overlap_p2p_comm = True
        comm_overlap_cfg.batch_p2p_comm = False
    else:
        comm_overlap_cfg.overlap_p2p_comm = False
        comm_overlap_cfg.batch_p2p_comm = True

DP重叠默认配置：

572

if self.data_parallel_size > 1:
    comm_overlap_cfg.bucket_size = 128 * 1024 * 1024
    comm_overlap_cfg.overlap_grad_reduce = True
    comm_overlap_cfg.overlap_param_gather = True

启动时环境变量调优：

570

executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(cuda_device_max_connections)
...
executor.env_vars["NVTE_FWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
executor.env_vars["NVTE_BWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)

Pitfalls

注意事项

TP overlap silently disables itself if
```
sequence_parallel=False
```
or Transformer Engine is unavailable.
PP overlap is not enabled for all PP cases. Bridge only auto-selects
```
overlap_p2p_comm=True
```
when
```
PP > 1
```
and
```
VPP > 1
```
.
```
bucket_size
```
is a parameter-count knob, not a byte-size knob.
```
grad_reduce_in_fp32
```
and
```
fp8_param_gather
```
should be set through mixed precision, not as standalone DDP tuning first.
```
CUDA_DEVICE_MAX_CONNECTIONS
```
and LayerNorm SM margin are launch-time plugin settings, not
```
CommOverlapConfig
```
fields.

如果
```
sequence_parallel=False
```
或Transformer Engine不可用，TP重叠会自动静默禁用。
PP重叠并非适用于所有PP场景。仅当
```
PP > 1
```
且
```
VPP > 1
```
时，Bridge才会自动选择
```
overlap_p2p_comm=True
```
。
```
bucket_size
```
是基于参数数量的配置项，而非字节大小。
```
grad_reduce_in_fp32
```
和
```
fp8_param_gather
```
应通过混合精度配置进行设置，而非先作为独立的DDP调优项。
```
CUDA_DEVICE_MAX_CONNECTIONS
```
和LayerNorm SM margin是启动时的插件设置，不属于
```
CommOverlapConfig
```
字段。

Verification

验证方法

Use the checked-in overlap unit coverage first:

bash

uv run python -m pytest tests/unit_tests/training/test_comm_overlap.py -q

Optional second check if

nemo_run

is available:

bash

uv run python -m pytest tests/unit_tests/recipes/test_run_plugins.py -q

Success criteria:

first command reports
```
26 passed
```
second command validates plugin-owned env wiring when not skipped

首先使用内置的重叠单元测试覆盖率：

bash

uv run python -m pytest tests/unit_tests/training/test_comm_overlap.py -q

如果

nemo_run

可用，可进行第二次检查：

bash

uv run python -m pytest tests/unit_tests/recipes/test_run_plugins.py -q

成功标准：

第一条命令显示
```
26 passed
```
第二条命令在未跳过的情况下验证插件所属的环境配置是否正确