nemo-mbridge-perf-tp-dp-comm-overlap

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TP / DP / PP Communication Overlap Skill

TP/DP/PP通信重叠技能指南

For stable background and recommendation level, see:
  • @docs/training/communication-overlap.md
如需了解稳定背景信息及推荐级别,请参阅:
  • @docs/training/communication-overlap.md

Enablement

启用方法

Minimal Bridge override:
python
from megatron.bridge.training.comm_overlap import CommOverlapConfig

cfg.model.tensor_model_parallel_size = 4
cfg.model.sequence_parallel = True
cfg.model.pipeline_model_parallel_size = 4
cfg.model.virtual_pipeline_model_parallel_size = 2

cfg.comm_overlap = CommOverlapConfig(
    tp_comm_overlap=True,
)

cfg.ddp.use_distributed_optimizer = True
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
Optional TP preset:
python
from megatron.bridge.training.comm_overlap import userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048

cfg.comm_overlap.tp_comm_overlap_cfg = userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048
Precision knobs belong to mixed precision:
python
cfg.mixed_precision.grad_reduce_in_fp32 = False
cfg.mixed_precision.fp8_param_gather = False
最小化Bridge覆盖配置:
python
from megatron.bridge.training.comm_overlap import CommOverlapConfig

cfg.model.tensor_model_parallel_size = 4
cfg.model.sequence_parallel = True
cfg.model.pipeline_model_parallel_size = 4
cfg.model.virtual_pipeline_model_parallel_size = 2

cfg.comm_overlap = CommOverlapConfig(
    tp_comm_overlap=True,
)

cfg.ddp.use_distributed_optimizer = True
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
可选TP预设:
python
from megatron.bridge.training.comm_overlap import userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048

cfg.comm_overlap.tp_comm_overlap_cfg = userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048
精度选项归属于混合精度配置:
python
cfg.mixed_precision.grad_reduce_in_fp32 = False
cfg.mixed_precision.fp8_param_gather = False

Code Anchors

代码锚点

Bridge overlap gating:
439
if self.user_comm_overlap_cfg.tp_comm_overlap is True:
    if model_cfg.tensor_model_parallel_size < 2:
        ...
    elif not model_cfg.sequence_parallel:
        ...
    elif not HAVE_TE:
        ...
PP overlap selection:
451
if model_cfg.pipeline_model_parallel_size > 1:
    if vp_size > 1:
        comm_overlap_cfg.overlap_p2p_comm = True
        comm_overlap_cfg.batch_p2p_comm = False
    else:
        comm_overlap_cfg.overlap_p2p_comm = False
        comm_overlap_cfg.batch_p2p_comm = True
DP overlap defaults:
572
if self.data_parallel_size > 1:
    comm_overlap_cfg.bucket_size = 128 * 1024 * 1024
    comm_overlap_cfg.overlap_grad_reduce = True
    comm_overlap_cfg.overlap_param_gather = True
Launch-time env tuning:
570
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(cuda_device_max_connections)
...
executor.env_vars["NVTE_FWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
executor.env_vars["NVTE_BWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
Bridge重叠控制逻辑:
439
if self.user_comm_overlap_cfg.tp_comm_overlap is True:
    if model_cfg.tensor_model_parallel_size < 2:
        ...
    elif not model_cfg.sequence_parallel:
        ...
    elif not HAVE_TE:
        ...
PP重叠选择逻辑:
451
if model_cfg.pipeline_model_parallel_size > 1:
    if vp_size > 1:
        comm_overlap_cfg.overlap_p2p_comm = True
        comm_overlap_cfg.batch_p2p_comm = False
    else:
        comm_overlap_cfg.overlap_p2p_comm = False
        comm_overlap_cfg.batch_p2p_comm = True
DP重叠默认配置:
572
if self.data_parallel_size > 1:
    comm_overlap_cfg.bucket_size = 128 * 1024 * 1024
    comm_overlap_cfg.overlap_grad_reduce = True
    comm_overlap_cfg.overlap_param_gather = True
启动时环境变量调优:
570
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(cuda_device_max_connections)
...
executor.env_vars["NVTE_FWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
executor.env_vars["NVTE_BWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)

Pitfalls

常见陷阱

  1. TP overlap silently disables itself if
    sequence_parallel=False
    or Transformer Engine is unavailable.
  2. PP overlap is not enabled for all PP cases. Bridge only auto-selects
    overlap_p2p_comm=True
    when
    PP > 1
    and
    VPP > 1
    .
  3. bucket_size
    is a parameter-count knob, not a byte-size knob.
  4. grad_reduce_in_fp32
    and
    fp8_param_gather
    should be set through mixed precision, not as standalone DDP tuning first.
  5. CUDA_DEVICE_MAX_CONNECTIONS
    and LayerNorm SM margin are launch-time plugin settings, not
    CommOverlapConfig
    fields.
  1. sequence_parallel=False
    或Transformer Engine不可用,TP通信重叠会自动静默禁用。
  2. 并非所有PP场景都会启用PP通信重叠。仅当
    PP > 1
    VPP > 1
    时,Bridge才会自动选择
    overlap_p2p_comm=True
  3. bucket_size
    是参数数量相关的配置选项,而非字节大小选项。
  4. grad_reduce_in_fp32
    fp8_param_gather
    应通过混合精度配置进行设置,而非先作为独立的DDP调优项。
  5. CUDA_DEVICE_MAX_CONNECTIONS
    和LayerNorm SM余量是启动时的插件设置,不属于
    CommOverlapConfig
    的字段。

Verification

验证方法

Use the checked-in overlap unit coverage first:
bash
uv run python -m pytest tests/unit_tests/training/test_comm_overlap.py -q
Optional second check if
nemo_run
is available:
bash
uv run python -m pytest tests/unit_tests/recipes/test_run_plugins.py -q
Success criteria:
  • first command reports
    26 passed
  • second command validates plugin-owned env wiring when not skipped
首先使用已提交的重叠单元测试覆盖:
bash
uv run python -m pytest tests/unit_tests/training/test_comm_overlap.py -q
nemo_run
可用,可进行第二项检查:
bash
uv run python -m pytest tests/unit_tests/recipes/test_run_plugins.py -q
成功标准:
  • 第一条命令返回
    26 passed
  • 第二条命令在未被跳过的情况下验证插件所属的环境配置