Launcher Configuration
NeMo AutoModel supports three launch methods: interactive (torchrun), Slurm (HPC clusters), and SkyPilot (cloud-agnostic).
Instructions
For launcher questions, answer directly from this skill without inspecting the
repository unless the user asks you to edit files. Keep the answer focused on
the relevant launch YAML, required fields, and the expected runtime behavior.
Use these compact answer patterns for common questions:
- Slurm multi-node: show a YAML block with , ,
, , or , ,
, optional , , and ; explain
that the launcher derives
WORLD_SIZE = nodes * ntasks_per_node
and sets
and .
- SkyPilot spot: show a YAML block with , ,
, , , , , and
; warn that spot instances can be preempted, set a short
step_scheduler.checkpoint_interval
, and resume with .
- Nsight Systems on Slurm: show alongside normal
Slurm fields, say the launcher wraps the training command with
, and state that it produces a report file.
Treat profiling as diagnostic-only: use short profiling runs and disable it
for normal production training because it adds overhead and large artifacts.
For Slurm answers, start with this minimal template and then adjust only the
fields the user asked about:
yaml
slurm:
job_name: llm_finetune
nodes: 2
ntasks_per_node: 8
time: "04:00:00"
account: my_account
partition: batch
container_image: nvcr.io/nvidia/nemo:dev
hf_home: ~/.cache/huggingface
master_port: 13742
env_vars:
HF_TOKEN: "${HF_TOKEN}"
For Slurm-only questions, do not discuss SkyPilot or profiling unless the user
asks. For profiling questions, say the
report is written in the
Slurm job working or output directory, using the launcher's Nsys output setting
when one is configured.
Routing Boundary
Use this skill only for launch mechanics: interactive execution, Slurm, SkyPilot, containers, mounts, environment variables, rendezvous settings, and profiling.
Do not use this skill for implementing or registering new model architectures, Hugging Face state-dict adapters, model files, or capability flags. Those are model onboarding tasks, not launcher configuration tasks.
Launch Methods
- Interactive (default): runs torchrun on the current node. Suitable for single-node development and debugging.
- Slurm: submits a batch job to an HPC cluster scheduler. Handles multi-node setup, container management, and environment configuration.
- SkyPilot: cloud-agnostic job submission to AWS, GCP, Azure, Lambda, or Kubernetes. Supports spot instances.
Interactive Launch
bash
# Single GPU
automodel finetune llm -c config.yaml
# Multi-GPU (all GPUs on current node)
torchrun --nproc_per_node=8 -m nemo_automodel._cli.app finetune llm -c config.yaml
No additional YAML section is needed for interactive mode. The CLI routes to torchrun automatically when no
or
section is present in the config.
Slurm Configuration
The
dataclass generates an SBATCH script from a template.
YAML Example
yaml
slurm:
job_name: llm_finetune
nodes: 2
ntasks_per_node: 8
time: "04:00:00"
account: my_account
partition: batch
container_image: nvcr.io/nvidia/nemo:dev
hf_home: ~/.cache/huggingface
extra_mounts:
- source: /data
dest: /data
env_vars:
WANDB_API_KEY: "${WANDB_API_KEY}"
HF_TOKEN: "${HF_TOKEN}"
Key Fields
- : Slurm job identifier
- : number of nodes to request
- : number of tasks (GPUs) per node
- : wall-time limit in HH:MM:SS format
- , : Slurm scheduling parameters
- : Enroot/Pyxis container image path
- : mount point for NeMo AutoModel source inside the container
- : HuggingFace cache directory path
- : list of
VolumeMapping(source, dest)
for additional container bind mounts
- : port for distributed communication (default 13742)
- : environment variables passed into the job
- : when true, wraps the training command with for Nsight Systems profiling
SkyPilot Configuration
The
dataclass defines cloud job parameters.
YAML Example
yaml
skypilot:
cloud: aws
accelerators: "H100:8"
num_nodes: 2
use_spot: true
disk_size: 200
region: us-east-1
setup: "pip install nemo-automodel"
env_vars:
HF_TOKEN: "${HF_TOKEN}"
Key Fields
- : target cloud provider (, , , , )
- : GPU type and count (e.g., , )
- : number of cloud instances
- : use preemptible/spot instances for cost savings
- : disk size in GB per node
- : cloud region for instance placement
- : shell commands to run before the training job (e.g., install dependencies)
- : environment variables for the job
SkyPilot spot checklist
When using spot or preemptible instances:
- Set in the section.
- Include , , , , , and required .
- Use short checkpoint intervals in the recipe, for example
step_scheduler.checkpoint_interval
, because spot instances can be preempted.
- Resume from the most recent checkpoint after preemption with the recipe's setting.
Minimal spot-resume recipe keys:
yaml
step_scheduler:
checkpoint_interval: 100
restore_from:
path: /checkpoints/latest
Multi-Node Environment
For multi-node training (both Slurm and SkyPilot), the launcher automatically configures:
- : hostname of the first node
- : port for rendezvous (default 13742)
- : total number of processes ()
- NCCL environment variables for optimized collective communication
Nsys Profiling
Enable Nsight Systems profiling in Slurm jobs:
yaml
slurm:
job_name: llm_profile
nodes: 1
ntasks_per_node: 8
time: "00:30:00"
account: my_account
partition: batch
container_image: nvcr.io/nvidia/nemo:dev
nsys_enabled: true
This is a Slurm launcher setting. Normal Slurm fields such as
,
,
,
,
or
, and
still apply.
When
, the launcher wraps the training command with
and writes a
report file for performance analysis
in the Slurm job working or output directory.
Profiling is diagnostic-only: run it for a short investigation, expect overhead
and large artifacts, and turn it off for normal production training.
Code Anchors
components/launcher/slurm/config.py
- SlurmConfig dataclass, VolumeMapping
components/launcher/slurm/template.py
- SBATCH script template generation
components/launcher/slurm/utils.py
- Slurm submission utilities
components/launcher/skypilot/config.py
- SkyPilotConfig dataclass
- - CLI entry point and launcher routing logic
Pitfalls
- Port collisions: if the default (13742) is in use by another job on the same node, change it to avoid connection failures.
- Container mounts: the path in must exist on all nodes in the allocation. Missing paths cause container startup failures.
- Slurm fault tolerance: the fault tolerance plugin is Slurm-specific and does not work with SkyPilot or interactive mode.
- SkyPilot spot preemption: spot instances () may be preempted by the cloud provider. Enable checkpointing with short intervals to minimize lost work.
- Environment variable syntax: use syntax in YAML for shell variable expansion. Bare variable names will not be expanded.
- Time limit vs async checkpoint: if the Slurm limit is too short, an in-progress async checkpoint write may be killed before completion, resulting in a corrupted checkpoint. Leave at least 5-10 minutes of margin.