Loading...
Loading...
Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.
npx skill4agent add nvidia/skills nemo-automodel-distributed-trainingMeshContextstrategy: fsdp2tp_sizepp_sizecp_sizeep_sizepipelinedp_sizeworld_size / (tp_size * pp_size * cp_size)distributed:
strategy: fsdp2
tp_size: 8
pp_size: 4
cp_size: 1
ep_size: 1
pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 1strategy: fsdp2ep_size > 1moe_meshmoeep_sizedp_size * cp_sizemegatron_fsdpddpdistributed:
strategy: fsdp2
ep_size: 8
moe:
reshard_after_forward: falsesequence_parallelfsdp2sequence_paralleldistributed.strategy| Strategy | YAML value | Best for |
|---|---|---|
| FSDP2 | | General use, recommended default. Supports TP, PP, CP, EP, HSDP. |
| MegatronFSDP | | NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel. |
| DDP | | Simple data parallelism only. No TP, PP, CP, or EP. |
fsdp2ddpfsdp2fsdp2ep_size > 1moe_meshfsdp2cp_size > 1distributed.strategystrategy: fsdp2megatron_fsdptp_sizepp_sizepipeline:pp_schedulepp_microbatch_sizedp_sizenoneworld_size / (tp_size * pp_size * cp_size)strategy: fsdp2ep_size > 1moe:ep_size > 1MoEParallelizerConfigmoe_meshdevice_meshmegatron_fsdpddpmegatron_fsdpep_sizedp_size * cp_sizemegatron_fsdpsequence_paralleldistributedparse_distributed_section()recipes/_dist_setup.pydistributed:
strategy: fsdp2 # fsdp2 | megatron_fsdp | ddp
dp_size: none # auto-calculated from world_size / (tp * pp * cp)
dp_replicate_size: none # FSDP2-only, for HSDP
tp_size: 1
pp_size: 1
cp_size: 1
ep_size: 1
# Strategy-specific flags (forwarded to the strategy dataclass):
sequence_parallel: false
activation_checkpointing: false
defer_fsdp_grad_sync: true # FSDP2 only
# Sub-configs (optional):
pipeline:
pp_schedule: 1f1b
pp_microbatch_size: 1
# ... see PipelineConfig fields
moe:
reshard_after_forward: false
# ... see MoEParallelizerConfig fieldsdp_sizedp_size = world_size / (tp_size * pp_size * cp_size)YAML distributed section
-> parse_distributed_section() [recipes/_dist_setup.py]
-> setup_distributed() [recipes/_dist_setup.py]
-> create_device_mesh() [components/distributed/device_mesh.py]
-> MeshContext(...) [components/distributed/mesh.py]
-> instantiate_infrastructure() [_transformers/infrastructure.py]
-> _instantiate_distributed() -> FSDP2Manager / MegatronFSDPManager / DDPManager
-> _instantiate_pipeline() -> AutoPipeline (if pp_size > 1)
-> parallelize_fn -> MoE parallelizer (if ep_size > 1) or PP wrapper
-> apply_model_infrastructure() [_transformers/infrastructure.py]
-> _shard_pp() or _shard_ep_fsdp() (applies sharding to the model)distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1dp_size = world_sizefully_shard()distributed:
strategy: fsdp2
tp_size: 4 # 2, 4, or 8 -- must divide GPUs per node
sequence_parallel: trueconfig = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)distributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b # 1f1b, gpipe, interleaved_1f1b, etc.
pp_microbatch_size: 4
scale_grads_in_schedule: false_pp_planAutoPipeline_pp_plandistributed:
strategy: fsdp2
dp_replicate_size: 2 # must divide dp_sizedp_replicate_size < dp_sizedistributed:
activation_checkpointing: trueMeshContext.activation_checkpointingdistributed:
defer_fsdp_grad_sync: true # defaultMixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True)from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)_pp_planpp_size > 1pipelinePipelineConfig.pp_schedule1f1bgpipeinterleaved_1f1binterleaved1f1blooped_bfsdfsv_schedulezero_bubbledistributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 4
scale_grads_in_schedule: false
checkpoint:
model_save_format: safetensors
save_consolidated: trueAutoPipeline.build()pipeline_model()_pp_planPipelineStageschedule.step()distributed:
strategy: fsdp2
cp_size: 2 # or 4, 8is_causal=Trueattach_context_parallel_hooks()apply_model_infrastructure()attach_context_parallel_hooks()make_cp_batch_and_ctx()context_parallel()torch.distributed.tensor.experimentalmake_cp_batch_for_te()thd_get_partitioned_indicespacked_sequence_sizecp_size_shard_thd_chunk_for_te()packed_sequence:
packed_sequence_size: 4096 # 0 = disabled
step_scheduler:
local_batch_size: 1 # must be 1 for packed sequencespacked_sequence_size > 0local_batch_sizeep_size > 1moe_meshdevice_meshdistributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: truemoe_mesh(pp_size, ep_shard_size, ep_size)("pp", "ep_shard", "ep")dp_cp_sizedp_size * cp_sizeep_sizedistributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: true
moe:
reshard_after_forward: false
ignore_router_for_ac: false
wrap_outer_model: truemoeMoEParallelizerConfigep_size > 1distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1
pp_size: 1
ep_size: 8
sequence_parallel: false
activation_checkpointing: truemegatron_fsdpep_size > 1pp_size > 1sequence_parallelfsdp2| Model size | TP | PP | CP | Strategy |
|---|---|---|---|---|
| < 3B | 1 | 1 | 1 | FSDP2 (DP only) |
| 3-13B | 2-4 | 1 | 1 | FSDP2 + TP |
| 13-70B | 4-8 | 2-4 | 1 | FSDP2 + TP + PP |
| 70B+ | 8 | 4-8 | 1 | FSDP2 + TP + PP |
| Any + long seq (8K+) | as above | as above | 2-8 | add CP |
| Model | TP | PP | EP | Notes |
|---|---|---|---|---|
| Small MoE (<10B total) | 1 | 1 | 8 | EP only |
| Medium MoE (10-30B total) | 1-2 | 1 | 8 | small TP for shared layers |
| Large MoE (100B+ total) | 1-2 | 4+ | 8-64 | PP for depth, EP for experts |
components/distributed/config.pycomponents/distributed/mesh.pycomponents/distributed/device_mesh.pymoe_meshcomponents/distributed/pipelining/config.pycomponents/moe/config.pyrecipes/_dist_setup.py_pp_planvalidate_hf_model_for_pipeline_support()interleaved_1f1bsafetensorssave_consolidated: trueSDPBackend.MATHdp_size * cp_sizedp_cp_size % ep_size == 0pp_size > 1ep_size > 1sequence_parallelMeshContextMixedPrecisionPolicypacked_sequence_sizecp_sizedp_replicate_sizemegatron_fsdpddpValueError