SLURM
Remote GPU compute platform for clusters managed by SLURM. Jobs are submitted
from the TAO service or SDK host to a login node over SSH, staged on a shared
filesystem, submitted with
, and executed with
container support.
Use SLURM when the user has access to a managed GPU cluster, shared Lustre
storage, and scheduler-owned GPU allocation. Do not use SLURM for local files
that exist only on the agent machine; data and outputs must be reachable from
the cluster.
Preflight
bash
# 1. SSH to the login node works without a password prompt
SLURM_HOST="${SLURM_HOSTNAME%%,*}"
[ -n "$SLURM_USER" ] && [ -n "$SLURM_HOST" ] || {
echo "MISSING: set SLURM_USER and SLURM_HOSTNAME (comma-separated for failover) in your env (~/.config/tao/.env)."
exit 1
}
ssh -o BatchMode=yes -o ConnectTimeout=10 "${SLURM_USER}@${SLURM_HOST}" "true" 2>/dev/null || {
echo "MISSING: passwordless SSH to ${SLURM_USER}@${SLURM_HOST} not working. See references/ssh-setup.md."
exit 1
}
# 2. Optional: TAO SDK wrapper for Job handles + S3 wrapping.
# nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_slurm).
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_slurm)
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install \"$PIN\""
exit 1
}
If a check fails, the agent prompts the user to authorize the install/fix via Bash.
A third preflight step applies only for
private images: Pyxis on
the compute nodes needs persistent enroot credentials in
~/.config/enroot/.credentials
on the cluster (it does NOT read
from
the job env). Without them, auth-gated pulls fail with "Could not process JSON
input" at job startup. This runs once per (cluster, user). See
for the full check and the
install
pattern that keeps
out of history, files, and chat output. Skip it for
public images.
Prerequisites
Before any job is submitted, the host running the TAO service or SDK must log in
to at least one host from
over SSH
without an interactive
password prompt. The handler runs
,
,
,
, and
log tails non-interactively, so password or 2FA prompts will fail the job at
submit or status time.
Set this up once per (host, login node, user) tuple: create an SSH keypair,
install the public key on each login host, trust the host key, lock private-key
permissions to
, and verify with
. See
for the full step-by-step (including the
alias, the container key-mount note, and the 2FA /
fallback). The
same file holds the
SSH failure remediation prompt to show the user when
passwordless SSH fails.
Credentials
- SLURM_USER (required): SSH username for the login node. In microservices
workspace metadata this is
cloud_specific_details.slurm_user
.
- SLURM_HOSTNAME (required): Comma-separated login hostnames for failover.
Microservices schema stores this as the list field
cloud_specific_details.slurm_hostname
.
- SLURM_PARTITION (required): Partition list for GPU job submission. Ask
for this in the mandatory SLURM intake list. The packaged default is
polar,polar3,polar4,grizzly
, which are treated as 4-hour queues.
- SSH_KEY_PATH (preferred and expected before launch): private key path for
non-interactive public-key auth to the login node. If passwordless SSH fails,
ask the user for
SSH_KEY_PATH=/path/to/private_key
and show the setup steps
in ; do not bury this behind several alternate choices.
- SSH_AUTH_SOCK (advanced fallback): SSH agent socket with an accepted key
already loaded. Prefer in user-facing remediation prompts.
- SLURM_BASE_RESULTS_DIR (optional): Base shared filesystem path. Default
convention from is
/lustre/fsw/portfolios/edgeai/<your-dir>
,
where is your per-user directory on the cluster.
- SLURM_ACCOUNT (usually required by site policy): Account charged by
.
Do not ask for
or
in the initial
intake unless the user says their site requires an account, wants a custom
results root, or the workflow cannot proceed without overriding defaults.
Backend Details
Use
backend_details.backend_type = "slurm"
when routing a job to this
platform. Supported backend details from the microservices schema:
json
{
"backend_type": "slurm",
"partition": "polar,polar3,polar4,grizzly",
"cluster_name": "optional-name"
}
Runtime metadata is stored under
backend_details.slurm_metadata
, especially
and
. Do not invent these values. They are written
after
returns a scheduler job id.
Storage
SLURM jobs run on the cluster, so local paths from the API host are not valid
dataset paths. Prefer shared filesystem URIs:
- Use for user-provided datasets on Lustre.
- paths may appear in microservices metadata and are converted to
actual Lustre paths before the container starts.
- Avoid bare and dataset URIs for SLURM. Validation in
rejects local and file paths for remote backends.
Accept either dataset roots or direct spec-key paths:
- Root mode:
/lustre/.../<model>/train
, which model skills map to required
files such as and as media path.
- Direct spec mode: exact fields such as
custom.train_dataset.annotation_path=/lustre/.../train.json
and
custom.train_dataset.media_path=/lustre/.../videos.tar.gz
.
After passwordless SSH succeeds and before generating scripts, validate each
required dataset file/path from the login host:
bash
ssh -o BatchMode=yes <SLURM_USER>@<working-login-host> \
'test -e /lustre/.../annotations.json && test -e /lustre/.../media_or_archive'
If the remote
fails, stop and ask for corrected paths or for the data
to be staged onto shared cluster storage. Do not create runner scripts that will
fail inside the first training job.
Results default to:
text
/lustre/fsw/portfolios/edgeai/<your-dir>/results/<job_id>
is your per-user directory on the cluster.
The runner sets
to the parent results directory because
container code appends the job id when writing status and artifacts.
Use Lustre, not S3, for SLURM job inputs. SLURM's scheduler enforces a
GPU-idle timeout — a long
download at the top of the script can burn
the allocation before training begins, and the scheduler may kill the job.
Stage training data onto Lustre first; S3 / HF / NGC pre-fetch is fine only
for small auxiliary inputs (checkpoints, configs). See
for the full rationale.
Container Execution
uses the SLURM handler to run TAO containers through Pyxis/Enroot:
- Stage compact JSON files for specs, environment, and cloud metadata under
, , and .
- Optionally convert the Docker image to a cached SQSH image with
srun -n1 -p <conversion_partition> enroot import
.
- Write an sbatch script under
<job_dir>/sbatch/job_<job_id>.sbatch
.
- Submit
sbatch --export=ALL <script>
.
- Run the container with
srun --container-image=<image> --container-mounts=/lustre
.
Image formats accepted by the handler:
docker://registry#image:tag
- ordinary , which is converted to Pyxis form when needed
SQSH conversion is cached by image name. For
images, cached SQSH is
used unless
is enabled.
Resource Mapping
- : 1
- : 4
- : 8
- : 16
- : 4
- : 3.8
- : 4
- :
- : true
- : true
When generating launchers or wrapper scripts for SLURM, set the wall-time
defaults explicitly from the packaged platform resource defaults:
bash
export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"
Do not default to 12 hours on SLURM. If the user supplies a longer
, verify that the selected partition supports it before
submitting. For the packaged default partition list
polar,polar3,polar4,grizzly
, reject requests above 4 hours and ask for a
different partition only if the user actually wants a longer wall time.
When
is greater than or equal to
, the
handler treats the request as exclusive per node and computes additional nodes
from total GPU count when necessary.
For multi-node jobs (
), the sbatch script exports
,
,
,
, and
, and Cosmos-RL
has special multi-node role handling for controller, policy, and rollout
workers. See
for the full sbatch directives, the
rendezvous env-var table and contract, and cluster requirements.
Monitoring
- Scheduler status comes from the stored SLURM job id via or .
- TAO terminal status comes from in the shared results folder.
- If the user enabled chat monitoring, continue polling at the requested
interval while the job is , , or otherwise non-terminal.
Do not stop after a fixed elapsed time such as 30 minutes; long queue waits
are normal on shared GPU partitions.
- Do not send a final response for a non-terminal SLURM job when chat
monitoring is enabled. A final response is a detach action; use it only if
the user asked to detach/stop or the job reached terminal state.
- Logs are read over SSH from:
text
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.err
Status mapping:
- ->
- or ->
- -> check
- , , , , -> retry if
logs match retriable infrastructure patterns, otherwise
- , , ->
- ->
- , ->
Cancellation
Cancel by looking up
backend_details.slurm_metadata.slurm_job_id
and running
over SSH. Treat missing or already terminated SLURM
jobs as successful cancellation.
Multi-node training (distributed)
SLURM is the platform of choice for large multi-node runs — pass
and the SDK handles the sbatch directives and PyTorch-distributed env vars
automatically. See
for a worked
example,
the generated sbatch directives, the rendezvous env-var table (
,
,
,
,
), the Cosmos-RL
role note, cluster requirements (Pyxis/Enroot, InfiniBand/NVLink, Lustre), and
upstream reference links.
Running via the TAO SDK
The SDK install is covered in Preflight —
pip install 'nvidia-tao-sdk[slurm]'
.
Use it when you want Job handles, the sbatch/
/
plumbing handled
for you, run-folder durability via
, or convenient cloud-storage
I/O (
,
,
). Without the SDK, drive
and
yourself.
Auto-retry is
fully automatic: a background monitor polls
/
and re-
's the staged script on infrastructure-looking failures up to
, while plain training failures surface immediately. In
addition,
is set by default (
, defaults
to
). See
for the
/
code example, the Lustre-not-S3 rule, the retriable-failure classification, and
the full auto-retry and requeue behavior.
Failure Modes
Common failures: SSH auth failure, local dataset path rejected, SQSH conversion
timeout, Pyxis/Enroot unavailable, and bad-node / transient GPU failures (which
the handler retries up to the configured limit). See
references/troubleshooting.md
for the diagnosis and remediation of each.