Distributed Training in NeMo AutoModel

Purpose

NeMo AutoModel uses PyTorch-native distributed training. All parallelism is orchestrated through a single

MeshContext

object that holds device meshes, strategy configs, and axis names.

Instructions

For conceptual distributed-training questions, answer directly from the quick patterns in this skill without inspecting the repository. Start with the strategy choice, then list only the YAML fields and constraints relevant to the question.

Use direct action verbs in the final answer: recommend the strategy, show the minimal YAML, state the sizing constraint, and name the unsupported strategies. Do not discuss model onboarding, recipes, Slurm, SkyPilot, or checkpointing unless the user asks.

Examples

TP plus PP for a large multi-node model

Recommend

strategy: fsdp2

. Mention

tp_size

pp_size

cp_size

ep_size

, and the

pipeline

sub-config. State that

dp_size

is inferred from

world_size / (tp_size * pp_size * cp_size)

yaml

distributed:
  strategy: fsdp2
  tp_size: 8
  pp_size: 4
  cp_size: 1
  ep_size: 1
  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 1

MoE expert parallelism

Recommend

strategy: fsdp2

with

ep_size > 1

. Say this creates a separate

moe_mesh

; include the

moe

sub-config when relevant; state that

ep_size

must divide

dp_size * cp_size

. Do not recommend

megatron_fsdp

ddp

yaml

distributed:
  strategy: fsdp2
  ep_size: 8
  moe:
    reshard_after_forward: false

MegatronFSDP limitations

Say no for pipeline parallelism, expert parallelism, and

sequence_parallel

. Recommend

fsdp2

for PP, EP, or

sequence_parallel

; mention that DDP is only simple data parallelism.

Strategy Selection

Three strategies are available, selected via the

distributed.strategy

YAML key:

Strategy	YAML value	Best for
FSDP2	`fsdp2`	General use, recommended default. Supports TP, PP, CP, EP, HSDP.
MegatronFSDP	`megatron_fsdp`	NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel.
DDP	`ddp`	Simple data parallelism only. No TP, PP, CP, or EP.

Decision tree:

Single GPU: no distributed config needed (FSDP2Manager skips parallelization when world_size=1).
Multi-GPU single node:
```
fsdp2
```
(default). Use
```
ddp
```
only if you need the simplest possible setup.
Multi-node:
```
fsdp2
```
with appropriate TP/PP sizing.
MoE models with expert parallelism:
```
fsdp2
```
with
```
ep_size > 1
```
(creates a separate
```
moe_mesh
```
).
Large models (70B+):
```
fsdp2
```
with PP + TP.
Long sequences (8K+): add CP (
```
cp_size > 1
```
).

When answering strategy-selection questions, state the chosen

distributed.strategy

first, then enumerate the YAML fields the user must set.

Quick TP + PP answer:

Use
```
strategy: fsdp2
```
; do not use
```
megatron_fsdp
```
when pipeline parallelism is required.
Set
```
tp_size
```
for tensor parallelism and
```
pp_size
```
for pipeline parallelism.
Add a
```
pipeline:
```
sub-config with
```
pp_schedule
```
and
```
pp_microbatch_size
```
.

Leave

dp_size

unset or

none

; it is inferred as

world_size / (tp_size * pp_size * cp_size)

Keep TP inside a fast intra-node domain when possible, and use PP across model depth for 70B+ models.

Quick MoE expert-parallel answer:

Start with
```
strategy: fsdp2
```
and
```
ep_size > 1
```
.
Include a
```
moe:
```
sub-config only when
```
ep_size > 1
```
; it maps to
```
MoEParallelizerConfig
```
.
Expect a separate
```
moe_mesh
```
for expert parallelism in addition to the main
```
device_mesh
```
.
Do not recommend
```
megatron_fsdp
```
or
```
ddp
```
for expert parallelism;
```
megatron_fsdp
```
has no EP support.
Before finishing an MoE EP answer, explicitly state that
```
ep_size
```
must divide
```
dp_size * cp_size
```
and that
```
megatron_fsdp
```
does not support EP, PP, or
```
sequence_parallel
```
.

YAML Config Structure

The

distributed

section in the recipe YAML maps directly to

parse_distributed_section()

recipes/_dist_setup.py

yaml

distributed:
  strategy: fsdp2           # fsdp2 | megatron_fsdp | ddp
  dp_size: none             # auto-calculated from world_size / (tp * pp * cp)
  dp_replicate_size: none   # FSDP2-only, for HSDP
  tp_size: 1
  pp_size: 1
  cp_size: 1
  ep_size: 1

  # Strategy-specific flags (forwarded to the strategy dataclass):
  sequence_parallel: false
  activation_checkpointing: false
  defer_fsdp_grad_sync: true   # FSDP2 only

  # Sub-configs (optional):
  pipeline:
    pp_schedule: 1f1b
    pp_microbatch_size: 1
    # ... see PipelineConfig fields

  moe:
    reshard_after_forward: false
    # ... see MoEParallelizerConfig fields

The

dp_size

is always inferred:

dp_size = world_size / (tp_size * pp_size * cp_size)

Infrastructure Flow

YAML distributed section
    -> parse_distributed_section()          [recipes/_dist_setup.py]
    -> setup_distributed()                  [recipes/_dist_setup.py]
        -> create_device_mesh()             [components/distributed/device_mesh.py]
        -> MeshContext(...)                  [components/distributed/mesh.py]
    -> instantiate_infrastructure()         [_transformers/infrastructure.py]
        -> _instantiate_distributed()       -> FSDP2Manager / MegatronFSDPManager / DDPManager
        -> _instantiate_pipeline()          -> AutoPipeline (if pp_size > 1)
        -> parallelize_fn                   -> MoE parallelizer (if ep_size > 1) or PP wrapper
    -> apply_model_infrastructure()         [_transformers/infrastructure.py]
        -> _shard_pp() or _shard_ep_fsdp()  (applies sharding to the model)

FSDP2 Configuration

Basic FSDP2 (data parallelism only)

yaml

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1

This auto-calculates

dp_size = world_size

and applies

fully_shard()

per transformer block via DTensor-based sharding.

FSDP2 with Tensor Parallelism

Keep TP within a single NVLink domain (typically one node):

yaml

distributed:
  strategy: fsdp2
  tp_size: 4        # 2, 4, or 8 -- must divide GPUs per node
  sequence_parallel: true

The TP plan is auto-selected based on the model type. Pass a custom plan via the Python API if needed:

python

config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)

FSDP2 with Pipeline Parallelism

yaml

distributed:
  strategy: fsdp2
  pp_size: 2
  pipeline:
    pp_schedule: interleaved1f1b   # 1f1b, gpipe, interleaved_1f1b, etc.
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

The model must have a

_pp_plan

attribute (set on the HF model class) for

AutoPipeline

to know how to split layers across stages. Models without

_pp_plan

are not compatible with PP.

FSDP2 with HSDP (Hybrid Sharded Data Parallel)

Intra-node full sharding + inter-node replication via a 2D DeviceMesh:

yaml

distributed:
  strategy: fsdp2
  dp_replicate_size: 2   # must divide dp_size

Constraint:

dp_replicate_size < dp_size

(pure replication with no sharding is not supported by FSDP2).

Activation Checkpointing

Trades compute for memory by recomputing activations during backward:

yaml

distributed:
  activation_checkpointing: true

This is forwarded to the strategy config for non-EP models, or read from

MeshContext.activation_checkpointing

for EP models.

Gradient Sync Deferral

FSDP2 defers gradient sync to the final micro-batch by default for communication overlap:

yaml

distributed:
  defer_fsdp_grad_sync: true   # default

Mixed Precision

FSDP2Config defaults to bfloat16 for all three precision knobs via

MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True)

. Override via the Python API:

python

from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
    mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)

Pipeline Parallelism

Requirements

Model class must define
```
_pp_plan
```
(a dict mapping module FQNs to stages).
```
pp_size > 1
```
in the distributed section.
A
```
pipeline
```
sub-config with schedule and microbatch size.

Supported schedules

Defined in

PipelineConfig.pp_schedule

```
1f1b
```
(one-forward-one-backward, default)
```
gpipe
```
```
interleaved_1f1b
```
/
```
interleaved1f1b
```
```
looped_bfs
```
```
dfs
```
```
v_schedule
```
```
zero_bubble
```

Example (8B model on 8 GPUs, PP=2 + DP=4)

yaml

distributed:
  strategy: fsdp2
  pp_size: 2

  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

checkpoint:
  model_save_format: safetensors
  save_consolidated: true

How it works

AutoPipeline.build()

calls

pipeline_model()

which splits the model into stages using the model's

_pp_plan

, creates

PipelineStage

objects, and builds the schedule. During training,

schedule.step()

drives forward and backward through the pipeline.

Context Parallelism

Use CP for long sequences (8K+). CP shards Q/K/V on the sequence dimension as DTensors.

Config

yaml

distributed:
  strategy: fsdp2
  cp_size: 2   # or 4, 8

Requirements

SDPA (Flash Attention or Efficient Attention backend) or Transformer Engine attention. SDPBackend.MATH is not compatible with DTensor.
Attention masks are automatically stripped;
```
is_causal=True
```
is set via forward pre-hooks registered by
```
attach_context_parallel_hooks()
```
.

How it works

After model sharding,
```
apply_model_infrastructure()
```
calls
```
attach_context_parallel_hooks()
```
on each model part (for non-TE models).
At each training step,
```
make_cp_batch_and_ctx()
```
creates a CP context manager that shards the batch along the sequence dimension and sets up
```
context_parallel()
```
from
```
torch.distributed.tensor.experimental
```
.
For TE attention models,
```
make_cp_batch_for_te()
```
uses THD format and TE's
```
thd_get_partitioned_indices
```
for sharding.

CP with Sequence Packing

CP works with packed sequences. The

packed_sequence_size

must be divisible by

cp_size

. When using TE, chunks are sharded per-chunk via

_shard_thd_chunk_for_te()

Sequence Packing

Packing multiple sequences into a single training sample for efficiency.

Config

yaml

packed_sequence:
  packed_sequence_size: 4096   # 0 = disabled

step_scheduler:
  local_batch_size: 1          # must be 1 for packed sequences

When

packed_sequence_size > 0

, the dataset collator packs sequences up to that length.

local_batch_size

must be 1 because each "sample" is already a packed batch.

MoE Distributed Training

Expert Parallelism

Set

ep_size > 1

to distribute experts across GPUs. This creates a separate

moe_mesh

alongside the main

device_mesh

yaml

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

The

moe_mesh

shape is

(pp_size, ep_shard_size, ep_size)

with dimension names

("pp", "ep_shard", "ep")

Constraint:

dp_cp_size

dp_size * cp_size

) must be divisible by

ep_size

MoE sub-config

yaml

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

  moe:
    reshard_after_forward: false
    ignore_router_for_ac: false
    wrap_outer_model: true

The

moe

sub-section maps to

MoEParallelizerConfig

and is only instantiated when

ep_size > 1

Full MoE example (Qwen3-30B-A3B on 8 GPUs)

yaml

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
  pp_size: 1
  ep_size: 8
  sequence_parallel: false
  activation_checkpointing: true

MegatronFSDP limitations

Despite its name,

megatron_fsdp

does not support expert parallelism (

ep_size > 1

), pipeline parallelism (

pp_size > 1

), or

sequence_parallel

. Use

fsdp2

for these features.

Parallelism Sizing Guidelines

Dense models

Model size	TP	PP	CP	Strategy
< 3B	1	1	1	FSDP2 (DP only)
3-13B	2-4	1	1	FSDP2 + TP
13-70B	4-8	2-4	1	FSDP2 + TP + PP
70B+	8	4-8	1	FSDP2 + TP + PP
Any + long seq (8K+)	as above	as above	2-8	add CP

MoE models

MoE models need less TP than dense models of similar total parameter count because only a fraction of parameters are active per token. EP is the primary scaling dimension:

Model	TP	PP	EP	Notes
Small MoE (<10B total)	1	1	8	EP only
Medium MoE (10-30B total)	1-2	1	8	small TP for shared layers
Large MoE (100B+ total)	1-2	4+	8-64	PP for depth, EP for experts

Hardware topology rules

TP must stay within a single NVLink domain (one node, typically 8 GPUs).
Use PP or DP for cross-node scaling.
TP across InfiniBand degrades throughput severely.

Code Anchors

```
components/distributed/config.py
```
: FSDP2Config, MegatronFSDPConfig, DDPConfig.
```
components/distributed/mesh.py
```
: MeshContext, strategy map, and mesh sizes.

components/distributed/device_mesh.py

: device mesh and

moe_mesh

creation.

components/distributed/pipelining/config.py

: PipelineConfig fields.

```
components/moe/config.py
```
: MoEParallelizerConfig and MoEConfig.
```
recipes/_dist_setup.py
```
: YAML parsing and distributed setup.

Pitfalls

TP across nodes destroys throughput. Always keep TP within a single NVLink domain. Use PP or DP for cross-node scaling.
PP requires
_pp_plan
on the model class. Not all HF models have this. Check
```
validate_hf_model_for_pipeline_support()
```
before enabling PP.
PP bubbles reduce GPU utilization. Use interleaved schedules (
```
interleaved_1f1b
```
) and smaller microbatches to reduce bubble time.
FSDP2 requires DTensor-aware state dict saving. Use
```
safetensors
```
with
```
save_consolidated: true
```
for checkpoint compatibility.
CP requires compatible attention. SDPA (Flash Attention or Efficient Attention) or TE attention only.
```
SDPBackend.MATH
```
is not compatible with DTensor.
MoE EP size must evenly divide
dp_size * cp_size
. The device mesh creation asserts
```
dp_cp_size % ep_size == 0
```
.
MegatronFSDP is more limited than FSDP2. It does not support PP (
```
pp_size > 1
```
), EP (
```
ep_size > 1
```
), or
```
sequence_parallel
```
. The
```
MeshContext
```
validation raises on these combinations.
DDP supports nothing beyond data parallelism. No TP, PP, CP, EP, or HSDP. Validation raises on any of these.
Activation checkpointing increases compute. It saves memory by recomputing activations during backward, but adds ~30% compute overhead.
Mixed precision policy must match model expectations. The default bfloat16 policy works for most models. FP16 models may need a custom
```
MixedPrecisionPolicy
```
.
packed_sequence_size
must be divisible by
cp_size
when using CP with packed sequences.

dp_replicate_size
is FSDP2-only. Passing it with

megatron_fsdp

ddp

raises a

ValueError

Verification

Run the smallest recipe that exercises the requested strategy. Success means exit code 0, finite loss, no NCCL timeout, and log output matching the expected TP/PP/CP/EP sizes.

nemo-automodel-distributed-training

NPX Install

Tags

SKILL.md Content

Distributed Training in NeMo AutoModel

Purpose

Instructions

Examples

TP plus PP for a large multi-node model

MoE expert parallelism

MegatronFSDP limitations

Strategy Selection

YAML Config Structure

Infrastructure Flow

FSDP2 Configuration

Basic FSDP2 (data parallelism only)

FSDP2 with Tensor Parallelism

FSDP2 with Pipeline Parallelism

FSDP2 with HSDP (Hybrid Sharded Data Parallel)

Activation Checkpointing

Gradient Sync Deferral

Mixed Precision

Pipeline Parallelism

Requirements

Supported schedules

Example (8B model on 8 GPUs, PP=2 + DP=4)

How it works

Context Parallelism

Config

Requirements

How it works

CP with Sequence Packing

Sequence Packing

Config

MoE Distributed Training

Expert Parallelism

MoE sub-config

Full MoE example (Qwen3-30B-A3B on 8 GPUs)

MegatronFSDP limitations

Parallelism Sizing Guidelines

Dense models

MoE models

Hardware topology rules

Code Anchors

Pitfalls

Verification