SLURM

Remote GPU compute platform for clusters managed by SLURM. Jobs are submitted from the TAO service or SDK host to a login node over SSH, staged on a shared filesystem, submitted with

sbatch

, and executed with

srun

container support.

Use SLURM when the user has access to a managed GPU cluster, shared Lustre storage, and scheduler-owned GPU allocation. Do not use SLURM for local files that exist only on the agent machine; data and outputs must be reachable from the cluster.

Preflight

bash

# 1. SSH to the login node works without a password prompt
SLURM_HOST="${SLURM_HOSTNAME%%,*}"
[ -n "$SLURM_USER" ] && [ -n "$SLURM_HOST" ] || {
  echo "MISSING: set SLURM_USER and SLURM_HOSTNAME (comma-separated for failover) in your env (~/.config/tao/.env)."
  exit 1
}
ssh -o BatchMode=yes -o ConnectTimeout=10 "${SLURM_USER}@${SLURM_HOST}" "true" 2>/dev/null || {
  echo "MISSING: passwordless SSH to ${SLURM_USER}@${SLURM_HOST} not working. See references/ssh-setup.md."
  exit 1
}

# 2. Optional: TAO SDK wrapper for Job handles + S3 wrapping.
# nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_slurm).
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_slurm)
python -c "import tao_sdk" 2>/dev/null || {
  echo "MISSING: nvidia-tao-sdk not installed. Run:"
  echo "  pip install \"$PIN\""
  exit 1
}

If a check fails, the agent prompts the user to authorize the install/fix via Bash.

A third preflight step applies only for private
nvcr.io
images: Pyxis on the compute nodes needs persistent enroot credentials in

~/.config/enroot/.credentials

on the cluster (it does NOT read

NGC_KEY

from the job env). Without them, auth-gated pulls fail with "Could not process JSON input" at job startup. This runs once per (cluster, user). See

references/ssh-setup.md

for the full check and the

printf | ssh

install pattern that keeps

NGC_KEY

out of history, files, and chat output. Skip it for public images.

Prerequisites

Before any job is submitted, the host running the TAO service or SDK must log in to at least one host from

SLURM_HOSTNAME

over SSH without an interactive password prompt. The handler runs

sbatch

squeue

sacct

scancel

, and log tails non-interactively, so password or 2FA prompts will fail the job at submit or status time.

Set this up once per (host, login node, user) tuple: create an SSH keypair, install the public key on each login host, trust the host key, lock private-key permissions to

chmod 600

, and verify with

ssh -o BatchMode=yes ...

. See

references/ssh-setup.md

for the full step-by-step (including the

~/.ssh/config

alias, the container key-mount note, and the 2FA /

SSH_AUTH_SOCK

fallback). The same file holds the SSH failure remediation prompt to show the user when passwordless SSH fails.

Credentials

SLURM_USER (required): SSH username for the login node. In microservices workspace metadata this is
```
cloud_specific_details.slurm_user
```
.
SLURM_HOSTNAME (required): Comma-separated login hostnames for failover. Microservices schema stores this as the list field
```
cloud_specific_details.slurm_hostname
```
.
SLURM_PARTITION (required): Partition list for GPU job submission. Ask for this in the mandatory SLURM intake list. The packaged default is
```
polar,polar3,polar4,grizzly
```
, which are treated as 4-hour queues.
SSH_KEY_PATH (preferred and expected before launch): private key path for non-interactive public-key auth to the login node. If passwordless SSH fails, ask the user for
```
SSH_KEY_PATH=/path/to/private_key
```
and show the setup steps in
```
references/ssh-setup.md
```
; do not bury this behind several alternate choices.
SSH_AUTH_SOCK (advanced fallback): SSH agent socket with an accepted key already loaded. Prefer
```
SSH_KEY_PATH
```
in user-facing remediation prompts.
SLURM_BASE_RESULTS_DIR (optional): Base shared filesystem path. Default convention from
```
tao-core
```
is
```
/lustre/fsw/portfolios/edgeai/<your-dir>
```
, where
```
<your-dir>
```
is your per-user directory on the cluster.
SLURM_ACCOUNT (usually required by site policy): Account charged by
```
#SBATCH --account
```
.

Do not ask for

SLURM_ACCOUNT

SLURM_BASE_RESULTS_DIR

in the initial intake unless the user says their site requires an account, wants a custom results root, or the workflow cannot proceed without overriding defaults.

Backend Details

Use

backend_details.backend_type = "slurm"

when routing a job to this platform. Supported backend details from the microservices schema:

json

{
  "backend_type": "slurm",
  "partition": "polar,polar3,polar4,grizzly",
  "cluster_name": "optional-name"
}

Runtime metadata is stored under

backend_details.slurm_metadata

, especially

slurm_job_id

and

job_dir

. Do not invent these values. They are written after

sbatch

returns a scheduler job id.

Storage

SLURM jobs run on the cluster, so local paths from the API host are not valid dataset paths. Prefer shared filesystem URIs:

Use
```
lustre:///absolute/path
```
for user-provided datasets on Lustre.
```
slurm://
```
paths may appear in microservices metadata and are converted to actual Lustre paths before the container starts.
Avoid bare
```
/local/path
```
and
```
file://
```
dataset URIs for SLURM. Validation in
```
tao-core
```
rejects local and file paths for remote backends.

Accept either dataset roots or direct spec-key paths:

Root mode:
```
/lustre/.../<model>/train
```
, which model skills map to required files such as
```
<root>/annotations.json
```
and
```
<root>
```
as media path.

Direct spec mode: exact fields such as

custom.train_dataset.annotation_path=/lustre/.../train.json

and

custom.train_dataset.media_path=/lustre/.../videos.tar.gz

After passwordless SSH succeeds and before generating scripts, validate each required dataset file/path from the login host:

bash

ssh -o BatchMode=yes <SLURM_USER>@<working-login-host> \
  'test -e /lustre/.../annotations.json && test -e /lustre/.../media_or_archive'

If the remote

test -e

fails, stop and ask for corrected paths or for the data to be staged onto shared cluster storage. Do not create runner scripts that will fail inside the first training job.

Results default to:

text

/lustre/fsw/portfolios/edgeai/<your-dir>/results/<job_id>

<your-dir>

is your per-user directory on the cluster.

The runner sets

TAO_API_RESULTS_DIR

to the parent results directory because container code appends the job id when writing status and artifacts.

Use Lustre, not S3, for SLURM job inputs. SLURM's scheduler enforces a GPU-idle timeout — a long
s3://
download at the top of the script can burn the allocation before training begins, and the scheduler may kill the job. Stage training data onto Lustre first; S3 / HF / NGC pre-fetch is fine only for small auxiliary inputs (checkpoints, configs). See
references/sdk-usage.md
for the full rationale.

Container Execution

tao-core

uses the SLURM handler to run TAO containers through Pyxis/Enroot:

Stage compact JSON files for specs, environment, and cloud metadata under
```
<job_dir>/specs
```
,
```
<job_dir>/env
```
, and
```
<job_dir>/meta
```
.
Optionally convert the Docker image to a cached SQSH image with
```
srun -n1 -p <conversion_partition> enroot import
```
.
Write an sbatch script under
```
<job_dir>/sbatch/job_<job_id>.sbatch
```
.
Submit
```
sbatch --export=ALL <script>
```
.

Run the container with

srun --container-image=<image> --container-mounts=/lustre

Image formats accepted by the handler:

```
/path/to/image.sqsh
```
```
registry#image:tag
```
```
docker://registry#image:tag
```
ordinary
```
registry/image:tag
```
, which is converted to Pyxis form when needed

SQSH conversion is cached by image name. For

:latest

images, cached SQSH is used unless

force_reconvert_latest

is enabled.

Resource Mapping

Defaults from

tao-core

```
num_nodes
```
: 1
```
num_gpus
```
: 4
```
max_num_gpus_per_node
```
: 8
```
cpus_per_task
```
: 16
```
time_hours
```
: 4
```
timeout_hours
```
: 3.8
```
max_time_hours
```
: 4
```
container_mounts
```
:
```
/lustre
```
```
use_requeue
```
: true
```
use_sqsh
```
: true

When generating launchers or wrapper scripts for SLURM, set the wall-time defaults explicitly from the packaged platform resource defaults:

bash

export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"

Do not default to 12 hours on SLURM. If the user supplies a longer

SLURM_TIME_HOURS

, verify that the selected partition supports it before submitting. For the packaged default partition list

polar,polar3,polar4,grizzly

, reject requests above 4 hours and ask for a different partition only if the user actually wants a longer wall time.

When

num_gpus

is greater than or equal to

max_num_gpus_per_node

, the handler treats the request as exclusive per node and computes additional nodes from total GPU count when necessary.

For multi-node jobs (

num_nodes > 1

), the sbatch script exports

WORLD_SIZE

MASTER_ADDR

MASTER_PORT

NODE_RANK

, and

NUM_GPU_PER_NODE

, and Cosmos-RL has special multi-node role handling for controller, policy, and rollout workers. See

references/multi-node.md

for the full sbatch directives, the rendezvous env-var table and contract, and cluster requirements.

Monitoring

Scheduler status comes from the stored SLURM job id via
```
squeue
```
or
```
sacct
```
.
TAO terminal status comes from
```
status.json
```
in the shared results folder.
If the user enabled chat monitoring, continue polling at the requested interval while the job is
```
PENDING
```
,
```
RUNNING
```
, or otherwise non-terminal. Do not stop after a fixed elapsed time such as 30 minutes; long queue waits are normal on shared GPU partitions.
Do not send a final response for a non-terminal SLURM job when chat monitoring is enabled. A final response is a detach action; use it only if the user asked to detach/stop or the job reached terminal state.
Logs are read over SSH from:

text

<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.err

Status mapping:

```
PENDING
```
->
```
Pending
```
```
RUNNING
```
or
```
COMPLETING
```
->
```
Running
```
```
COMPLETED
```
-> check
```
status.json
```
```
FAILED
```
,
```
BOOT_FAIL
```
,
```
DEADLINE
```
,
```
OUT_OF_MEMORY
```
,
```
NODE_FAIL
```
-> retry if logs match retriable infrastructure patterns, otherwise
```
Error
```
```
CANCELLED
```
,
```
PREEMPTED
```
,
```
REVOKED
```
->
```
Canceled
```
```
TIMEOUT
```
->
```
Error
```
```
SUSPENDED
```
,
```
STOPPED
```
->
```
Paused
```

Cancellation

Cancel by looking up

backend_details.slurm_metadata.slurm_job_id

and running

scancel <slurm_job_id>

over SSH. Treat missing or already terminated SLURM jobs as successful cancellation.

Multi-node training (distributed)

SLURM is the platform of choice for large multi-node runs — pass

num_nodes > 1

and the SDK handles the sbatch directives and PyTorch-distributed env vars automatically. See

references/multi-node.md

for a worked

create_job

example, the generated sbatch directives, the rendezvous env-var table (

WORLD_SIZE

NUM_GPU_PER_NODE

NODE_RANK

MASTER_ADDR

MASTER_PORT

), the Cosmos-RL role note, cluster requirements (Pyxis/Enroot, InfiniBand/NVLink, Lustre), and upstream reference links.

Running via the TAO SDK

The SDK install is covered in Preflight —

pip install 'nvidia-tao-sdk[slurm]'

. Use it when you want Job handles, the sbatch/

squeue

sacct

plumbing handled for you, run-folder durability via

ActionWorkflow

, or convenient cloud-storage I/O (

s3://

hf_model://

ngc://

). Without the SDK, drive

sbatch

and

srun

yourself.

Auto-retry is fully automatic: a background monitor polls

squeue

sacct

and re-

sbatch

's the staged script on infrastructure-looking failures up to

MAX_JOB_RETRIES = 10

, while plain training failures surface immediately. In addition,

#SBATCH --requeue

is set by default (

SLURM_USE_REQUEUE

, defaults to

true

). See

references/sdk-usage.md

for the

SlurmSDK

build_entrypoint

code example, the Lustre-not-S3 rule, the retriable-failure classification, and the full auto-retry and requeue behavior.

Failure Modes

Common failures: SSH auth failure, local dataset path rejected, SQSH conversion timeout, Pyxis/Enroot unavailable, and bad-node / transient GPU failures (which the handler retries up to the configured limit). See

references/troubleshooting.md

for the diagnosis and remediation of each.

tao-run-on-slurm

NPX Install

Tags

SKILL.md Content

SLURM

Preflight

Prerequisites

Credentials

Backend Details

Storage

Container Execution

Resource Mapping

Monitoring

Cancellation

Multi-node training (distributed)

Running via the TAO SDK

Failure Modes