perf-tp-dp-comm-overlap
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTP / DP / PP Communication Overlap Skill
TP / DP / PP 通信重叠技能
For stable background and recommendation level, see:
- @docs/training/communication-overlap.md
关于稳定背景和推荐级别,请参阅:
- @docs/training/communication-overlap.md
Enablement
启用方法
Minimal Bridge override:
python
from megatron.bridge.training.comm_overlap import CommOverlapConfig
cfg.model.tensor_model_parallel_size = 4
cfg.model.sequence_parallel = True
cfg.model.pipeline_model_parallel_size = 4
cfg.model.virtual_pipeline_model_parallel_size = 2
cfg.comm_overlap = CommOverlapConfig(
tp_comm_overlap=True,
)
cfg.ddp.use_distributed_optimizer = True
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = TrueOptional TP preset:
python
from megatron.bridge.training.comm_overlap import userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048
cfg.comm_overlap.tp_comm_overlap_cfg = userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048Precision knobs belong to mixed precision:
python
cfg.mixed_precision.grad_reduce_in_fp32 = False
cfg.mixed_precision.fp8_param_gather = False最小化Bridge覆盖配置:
python
from megatron.bridge.training.comm_overlap import CommOverlapConfig
cfg.model.tensor_model_parallel_size = 4
cfg.model.sequence_parallel = True
cfg.model.pipeline_model_parallel_size = 4
cfg.model.virtual_pipeline_model_parallel_size = 2
cfg.comm_overlap = CommOverlapConfig(
tp_comm_overlap=True,
)
cfg.ddp.use_distributed_optimizer = True
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True可选的TP预设:
python
from megatron.bridge.training.comm_overlap import userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048
cfg.comm_overlap.tp_comm_overlap_cfg = userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048精度参数属于混合精度配置:
python
cfg.mixed_precision.grad_reduce_in_fp32 = False
cfg.mixed_precision.fp8_param_gather = FalseCode Anchors
代码锚点
Bridge overlap gating:
439
if self.user_comm_overlap_cfg.tp_comm_overlap is True:
if model_cfg.tensor_model_parallel_size < 2:
...
elif not model_cfg.sequence_parallel:
...
elif not HAVE_TE:
...PP overlap selection:
451
if model_cfg.pipeline_model_parallel_size > 1:
if vp_size > 1:
comm_overlap_cfg.overlap_p2p_comm = True
comm_overlap_cfg.batch_p2p_comm = False
else:
comm_overlap_cfg.overlap_p2p_comm = False
comm_overlap_cfg.batch_p2p_comm = TrueDP overlap defaults:
572
if self.data_parallel_size > 1:
comm_overlap_cfg.bucket_size = 128 * 1024 * 1024
comm_overlap_cfg.overlap_grad_reduce = True
comm_overlap_cfg.overlap_param_gather = TrueLaunch-time env tuning:
570
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(cuda_device_max_connections)
...
executor.env_vars["NVTE_FWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
executor.env_vars["NVTE_BWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)Bridge重叠控制逻辑:
439
if self.user_comm_overlap_cfg.tp_comm_overlap is True:
if model_cfg.tensor_model_parallel_size < 2:
...
elif not model_cfg.sequence_parallel:
...
elif not HAVE_TE:
...PP重叠选择逻辑:
451
if model_cfg.pipeline_model_parallel_size > 1:
if vp_size > 1:
comm_overlap_cfg.overlap_p2p_comm = True
comm_overlap_cfg.batch_p2p_comm = False
else:
comm_overlap_cfg.overlap_p2p_comm = False
comm_overlap_cfg.batch_p2p_comm = TrueDP重叠默认配置:
572
if self.data_parallel_size > 1:
comm_overlap_cfg.bucket_size = 128 * 1024 * 1024
comm_overlap_cfg.overlap_grad_reduce = True
comm_overlap_cfg.overlap_param_gather = True启动时环境变量调优:
570
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(cuda_device_max_connections)
...
executor.env_vars["NVTE_FWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
executor.env_vars["NVTE_BWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)Pitfalls
注意事项
- TP overlap silently disables itself if or Transformer Engine is unavailable.
sequence_parallel=False - PP overlap is not enabled for all PP cases. Bridge only auto-selects when
overlap_p2p_comm=TrueandPP > 1.VPP > 1 - is a parameter-count knob, not a byte-size knob.
bucket_size - and
grad_reduce_in_fp32should be set through mixed precision, not as standalone DDP tuning first.fp8_param_gather - and LayerNorm SM margin are launch-time plugin settings, not
CUDA_DEVICE_MAX_CONNECTIONSfields.CommOverlapConfig
- 如果或Transformer Engine不可用,TP重叠会自动静默禁用。
sequence_parallel=False - PP重叠并非适用于所有PP场景。仅当且
PP > 1时,Bridge才会自动选择VPP > 1。overlap_p2p_comm=True - 是基于参数数量的配置项,而非字节大小。
bucket_size - 和
grad_reduce_in_fp32应通过混合精度配置进行设置,而非先作为独立的DDP调优项。fp8_param_gather - 和LayerNorm SM margin是启动时的插件设置,不属于
CUDA_DEVICE_MAX_CONNECTIONS字段。CommOverlapConfig
Verification
验证方法
Use the checked-in overlap unit coverage first:
bash
uv run python -m pytest tests/unit_tests/training/test_comm_overlap.py -qOptional second check if is available:
nemo_runbash
uv run python -m pytest tests/unit_tests/recipes/test_run_plugins.py -qSuccess criteria:
- first command reports
26 passed - second command validates plugin-owned env wiring when not skipped
首先使用内置的重叠单元测试覆盖率:
bash
uv run python -m pytest tests/unit_tests/training/test_comm_overlap.py -q如果可用,可进行第二次检查:
nemo_runbash
uv run python -m pytest tests/unit_tests/recipes/test_run_plugins.py -q成功标准:
- 第一条命令显示
26 passed - 第二条命令在未跳过的情况下验证插件所属的环境配置是否正确