tao-run-on-slurm

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SLURM

Remote GPU compute platform for clusters managed by SLURM. Jobs are submitted from the TAO service or SDK host to a login node over SSH, staged on a shared filesystem, submitted with

sbatch

, and executed with

srun

container support.

Use SLURM when the user has access to a managed GPU cluster, shared Lustre storage, and scheduler-owned GPU allocation. Do not use SLURM for local files that exist only on the agent machine; data and outputs must be reachable from the cluster.

由SLURM管理的集群的远程GPU计算平台。作业从TAO服务或SDK主机通过SSH提交到登录节点，在共享文件系统上暂存，使用

sbatch

提交，并通过支持容器的

srun

执行。

当用户有权访问受管理的GPU集群、共享Lustre存储以及由调度器分配的GPU资源时，请使用SLURM。不要将SLURM用于仅存在于Agent机器上的本地文件；数据和输出必须能从集群访问到。

Preflight

预检

bash

undefined

bash

undefined

1. SSH to the login node works without a password prompt

1. 无需密码提示即可SSH登录到登录节点

SLURM_HOST="${SLURM_HOSTNAME%%,*}" [ -n "$SLURM_USER" ] && [ -n "$SLURM_HOST" ] || { echo "MISSING: set SLURM_USER and SLURM_HOSTNAME (comma-separated for failover) in your env (~/.config/tao/.env)." exit 1 } ssh -o BatchMode=yes -o ConnectTimeout=10 "${SLURM_USER}@${SLURM_HOST}" "true" 2>/dev/null || { echo "MISSING: passwordless SSH to ${SLURM_USER}@${SLURM_HOST} not working. See references/ssh-setup.md." exit 1 }

SLURM_HOST="${SLURM_HOSTNAME%%,*}" [ -n "$SLURM_USER" ] && [ -n "$SLURM_HOST" ] || { echo "缺失：请在环境变量中设置SLURM_USER和SLURM_HOSTNAME（故障转移用逗号分隔）（~/.config/tao/.env）。" exit 1 } ssh -o BatchMode=yes -o ConnectTimeout=10 "${SLURM_USER}@${SLURM_HOST}" "true" 2>/dev/null || { echo "缺失：无法无密码SSH登录到${SLURM_USER}@${SLURM_HOST}。请查看references/ssh-setup.md。" exit 1 }

2. Optional: TAO SDK wrapper for Job handles + S3 wrapping.

2. 可选：用于Job处理和S3封装的TAO SDK包装器。

nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_slurm).

nvidia-tao-sdk在公共PyPI上；版本固定信息在versions.yaml（wheels.tao_sdk_slurm）中。

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_slurm) python -c "import tao_sdk" 2>/dev/null || { echo "MISSING: nvidia-tao-sdk not installed. Run:" echo " pip install "$PIN"" exit 1 }


If a check fails, the agent prompts the user to authorize the install/fix via Bash.

A third preflight step applies only for **private `nvcr.io` images**: Pyxis on
the compute nodes needs persistent enroot credentials in
`~/.config/enroot/.credentials` on the cluster (it does NOT read `NGC_KEY` from
the job env). Without them, auth-gated pulls fail with "Could not process JSON
input" at job startup. This runs once per (cluster, user). See
`references/ssh-setup.md` for the full check and the `printf | ssh` install
pattern that keeps `NGC_KEY` out of history, files, and chat output. Skip it for
public images.

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_slurm) python -c "import tao_sdk" 2>/dev/null || { echo "缺失：未安装nvidia-tao-sdk。请运行：" echo " pip install "$PIN"" exit 1 }


如果检查失败，Agent会提示用户通过Bash授权安装/修复。

第三个预检步骤仅适用于**私有`nvcr.io`镜像**：计算节点上的Pyxis需要在集群的`~/.config/enroot/.credentials`中存储持久化的enroot凭据（它不会从作业环境中读取`NGC_KEY`）。没有这些凭据，在作业启动时，受权限限制的拉取操作会因‘无法处理JSON输入’而失败。此操作每个（集群，用户）只需执行一次。有关完整检查以及避免`NGC_KEY`出现在历史记录、文件和聊天输出中的`printf | ssh`安装模式，请查看`references/ssh-setup.md`。公共镜像可跳过此步骤。

Prerequisites

前提条件

Before any job is submitted, the host running the TAO service or SDK must log in to at least one host from

SLURM_HOSTNAME

over SSH without an interactive password prompt. The handler runs

sbatch

squeue

sacct

scancel

, and log tails non-interactively, so password or 2FA prompts will fail the job at submit or status time.

Set this up once per (host, login node, user) tuple: create an SSH keypair, install the public key on each login host, trust the host key, lock private-key permissions to

chmod 600

, and verify with

ssh -o BatchMode=yes ...

. See

references/ssh-setup.md

for the full step-by-step (including the

~/.ssh/config

alias, the container key-mount note, and the 2FA /

SSH_AUTH_SOCK

fallback). The same file holds the SSH failure remediation prompt to show the user when passwordless SSH fails.

在提交任何作业之前，运行TAO服务或SDK的主机必须能够通过SSH登录到

SLURM_HOSTNAME

中的至少一个主机，且无需交互式密码提示。处理器会非交互式地运行

sbatch

、

squeue

、

sacct

、

scancel

以及日志尾部查看操作，因此密码或双因素认证提示会导致作业在提交或状态查询时失败。

针对每个（主机，登录节点，用户）元组设置一次：创建SSH密钥对，在每个登录主机上安装公钥，信任主机密钥，将私钥权限锁定为

chmod 600

，并使用

ssh -o BatchMode=yes ...

验证。有关完整的分步说明（包括

~/.ssh/config

别名、容器密钥挂载说明以及双因素认证/

SSH_AUTH_SOCK

回退方案），请查看

references/ssh-setup.md

。同一文件还包含当无密码SSH失败时向用户显示的SSH故障修复提示。

Credentials

凭据

SLURM_USER (required): SSH username for the login node. In microservices workspace metadata this is
```
cloud_specific_details.slurm_user
```
.
SLURM_HOSTNAME (required): Comma-separated login hostnames for failover. Microservices schema stores this as the list field
```
cloud_specific_details.slurm_hostname
```
.
SLURM_PARTITION (required): Partition list for GPU job submission. Ask for this in the mandatory SLURM intake list. The packaged default is
```
polar,polar3,polar4,grizzly
```
, which are treated as 4-hour queues.
SSH_KEY_PATH (preferred and expected before launch): private key path for non-interactive public-key auth to the login node. If passwordless SSH fails, ask the user for
```
SSH_KEY_PATH=/path/to/private_key
```
and show the setup steps in
```
references/ssh-setup.md
```
; do not bury this behind several alternate choices.
SSH_AUTH_SOCK (advanced fallback): SSH agent socket with an accepted key already loaded. Prefer
```
SSH_KEY_PATH
```
in user-facing remediation prompts.
SLURM_BASE_RESULTS_DIR (optional): Base shared filesystem path. Default convention from
```
tao-core
```
is
```
/lustre/fsw/portfolios/edgeai/<your-dir>
```
, where
```
<your-dir>
```
is your per-user directory on the cluster.
SLURM_ACCOUNT (usually required by site policy): Account charged by
```
#SBATCH --account
```
.

Do not ask for

SLURM_ACCOUNT

SLURM_BASE_RESULTS_DIR

in the initial intake unless the user says their site requires an account, wants a custom results root, or the workflow cannot proceed without overriding defaults.

SLURM_USER（必填）：登录节点的SSH用户名。在微服务工作区元数据中，此字段为
```
cloud_specific_details.slurm_user
```
。
SLURM_HOSTNAME（必填）：用于故障转移的逗号分隔登录主机名列表。微服务架构将其存储为列表字段
```
cloud_specific_details.slurm_hostname
```
。
SLURM_PARTITION（必填）：用于GPU作业提交的分区列表。请在必填的SLURM信息收集列表中询问此字段。打包的默认值为
```
polar,polar3,polar4,grizzly
```
，这些被视为4小时队列。
SSH_KEY_PATH（推荐，启动前需配置）：用于非交互式公钥认证登录节点的私钥路径。如果无密码SSH失败，请询问用户
```
SSH_KEY_PATH=/path/to/private_key
```
并显示
```
references/ssh-setup.md
```
中的设置步骤；不要将此选项隐藏在多个替代选择之后。
SSH_AUTH_SOCK（高级回退方案）：已加载可接受密钥的SSH代理套接字。在面向用户的修复提示中优先使用
```
SSH_KEY_PATH
```
。
SLURM_BASE_RESULTS_DIR（可选）：共享文件系统的基础路径。
```
tao-core
```
的默认约定为
```
/lustre/fsw/portfolios/edgeai/<your-dir>
```
，其中
```
<your-dir>
```
是您在集群上的每个用户目录。
SLURM_ACCOUNT（通常受站点策略要求）：
```
#SBATCH --account
```
指定的计费账户。

除非用户表示其站点需要账户、想要自定义结果根目录，或者工作流无法在不覆盖默认值的情况下进行，否则不要在初始信息收集中询问

SLURM_ACCOUNT

或

SLURM_BASE_RESULTS_DIR

。

Backend Details

后端详情

Use

backend_details.backend_type = "slurm"

when routing a job to this platform. Supported backend details from the microservices schema:

json

{
  "backend_type": "slurm",
  "partition": "polar,polar3,polar4,grizzly",
  "cluster_name": "optional-name"
}

Runtime metadata is stored under

backend_details.slurm_metadata

, especially

slurm_job_id

and

job_dir

. Do not invent these values. They are written after

sbatch

returns a scheduler job id.

当将作业路由到此平台时，请使用

backend_details.backend_type = "slurm"

。微服务架构支持的后端详情如下：

json

{
  "backend_type": "slurm",
  "partition": "polar,polar3,polar4,grizzly",
  "cluster_name": "optional-name"
}

运行时元数据存储在

backend_details.slurm_metadata

下，尤其是

slurm_job_id

和

job_dir

。请勿自行创建这些值。它们会在

sbatch

返回调度器作业ID后写入。

Storage

存储

SLURM jobs run on the cluster, so local paths from the API host are not valid dataset paths. Prefer shared filesystem URIs:

Use
```
lustre:///absolute/path
```
for user-provided datasets on Lustre.
```
slurm://
```
paths may appear in microservices metadata and are converted to actual Lustre paths before the container starts.
Avoid bare
```
/local/path
```
and
```
file://
```
dataset URIs for SLURM. Validation in
```
tao-core
```
rejects local and file paths for remote backends.

Accept either dataset roots or direct spec-key paths:

Root mode:
```
/lustre/.../<model>/train
```
, which model skills map to required files such as
```
<root>/annotations.json
```
and
```
<root>
```
as media path.

Direct spec mode: exact fields such as

custom.train_dataset.annotation_path=/lustre/.../train.json

and

custom.train_dataset.media_path=/lustre/.../videos.tar.gz

After passwordless SSH succeeds and before generating scripts, validate each required dataset file/path from the login host:

bash

ssh -o BatchMode=yes <SLURM_USER>@<working-login-host> \
  'test -e /lustre/.../annotations.json && test -e /lustre/.../media_or_archive'

If the remote

test -e

fails, stop and ask for corrected paths or for the data to be staged onto shared cluster storage. Do not create runner scripts that will fail inside the first training job.

Results default to:

text

/lustre/fsw/portfolios/edgeai/<your-dir>/results/<job_id>

<your-dir>

is your per-user directory on the cluster.

The runner sets

TAO_API_RESULTS_DIR

to the parent results directory because container code appends the job id when writing status and artifacts.

Use Lustre, not S3, for SLURM job inputs. SLURM's scheduler enforces a GPU-idle timeout — a long
s3://
download at the top of the script can burn the allocation before training begins, and the scheduler may kill the job. Stage training data onto Lustre first; S3 / HF / NGC pre-fetch is fine only for small auxiliary inputs (checkpoints, configs). See
references/sdk-usage.md
for the full rationale.

SLURM作业在集群上运行，因此API主机的本地路径不是有效的数据集路径。优先使用共享文件系统URI：

对于Lustre上用户提供的数据集，请使用
```
lustre:///absolute/path
```
。
```
slurm://
```
路径可能出现在微服务元数据中，并会在容器启动前转换为实际的Lustre路径。
对于SLURM，请避免使用裸路径
```
/local/path
```
和
```
file://
```
数据集URI。
```
tao-core
```
中的验证会拒绝远程后端的本地和文件路径。

接受数据集根目录或直接的规范键路径：

根目录模式：
```
/lustre/.../<model>/train
```
，模型技能会将其映射到所需文件，例如
```
<root>/annotations.json
```
和作为媒体路径的
```
<root>
```
。

直接规范模式：精确字段，例如

custom.train_dataset.annotation_path=/lustre/.../train.json

和

custom.train_dataset.media_path=/lustre/.../videos.tar.gz

。

在无密码SSH成功后、生成脚本之前，从登录主机验证每个所需的数据集文件/路径：

bash

ssh -o BatchMode=yes <SLURM_USER>@<working-login-host> \
  'test -e /lustre/.../annotations.json && test -e /lustre/.../media_or_archive'

如果远程

test -e

失败，请停止操作并询问用户更正路径或将数据暂存到共享集群存储中。不要创建会在第一个训练作业内失败的运行器脚本。

结果默认存储在：

text

/lustre/fsw/portfolios/edgeai/<your-dir>/results/<job_id>

<your-dir>

是您在集群上的每个用户目录。

运行器会将

TAO_API_RESULTS_DIR

设置为结果父目录，因为容器代码在写入状态和工件时会追加作业ID。

SLURM作业输入请使用Lustre，而非S3。 SLURM的调度器会强制执行GPU空闲超时——脚本开头的长时间
s3://
下载可能会在训练开始前耗尽分配的资源，调度器可能会终止作业。请先将训练数据暂存到Lustre上；仅对于小型辅助输入（检查点、配置），使用S3/HF/NGC预取是可行的。有关完整原理，请查看
references/sdk-usage.md
。

Container Execution

容器执行

tao-core

uses the SLURM handler to run TAO containers through Pyxis/Enroot:

Stage compact JSON files for specs, environment, and cloud metadata under
```
<job_dir>/specs
```
,
```
<job_dir>/env
```
, and
```
<job_dir>/meta
```
.
Optionally convert the Docker image to a cached SQSH image with
```
srun -n1 -p <conversion_partition> enroot import
```
.
Write an sbatch script under
```
<job_dir>/sbatch/job_<job_id>.sbatch
```
.
Submit
```
sbatch --export=ALL <script>
```
.

Run the container with

srun --container-image=<image> --container-mounts=/lustre

Image formats accepted by the handler:

```
/path/to/image.sqsh
```
```
registry#image:tag
```
```
docker://registry#image:tag
```
ordinary
```
registry/image:tag
```
, which is converted to Pyxis form when needed

SQSH conversion is cached by image name. For

:latest

images, cached SQSH is used unless

force_reconvert_latest

is enabled.

tao-core

使用SLURM处理器通过Pyxis/Enroot运行TAO容器：

将规范、环境和云元数据的紧凑JSON文件暂存到
```
<job_dir>/specs
```
、
```
<job_dir>/env
```
和
```
<job_dir>/meta
```
下。
可选地使用
```
srun -n1 -p <conversion_partition> enroot import
```
将Docker镜像转换为缓存的SQSH镜像。
在
```
<job_dir>/sbatch/job_<job_id>.sbatch
```
下编写sbatch脚本。
提交
```
sbatch --export=ALL <script>
```
。

使用

srun --container-image=<image> --container-mounts=/lustre

运行容器。

处理器接受的镜像格式：

```
/path/to/image.sqsh
```
```
registry#image:tag
```
```
docker://registry#image:tag
```
普通的
```
registry/image:tag
```
，会在需要时转换为Pyxis格式

SQSH转换会按镜像名称缓存。对于

:latest

镜像，除非启用

force_reconvert_latest

，否则会使用缓存的SQSH。

Resource Mapping

资源映射

Defaults from

tao-core

```
num_nodes
```
: 1
```
num_gpus
```
: 4
```
max_num_gpus_per_node
```
: 8
```
cpus_per_task
```
: 16
```
time_hours
```
: 4
```
timeout_hours
```
: 3.8
```
max_time_hours
```
: 4
```
container_mounts
```
:
```
/lustre
```
```
use_requeue
```
: true
```
use_sqsh
```
: true

When generating launchers or wrapper scripts for SLURM, set the wall-time defaults explicitly from the packaged platform resource defaults:

bash

export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"

Do not default to 12 hours on SLURM. If the user supplies a longer

SLURM_TIME_HOURS

, verify that the selected partition supports it before submitting. For the packaged default partition list

polar,polar3,polar4,grizzly

, reject requests above 4 hours and ask for a different partition only if the user actually wants a longer wall time.

When

num_gpus

is greater than or equal to

max_num_gpus_per_node

, the handler treats the request as exclusive per node and computes additional nodes from total GPU count when necessary.

For multi-node jobs (

num_nodes > 1

), the sbatch script exports

WORLD_SIZE

MASTER_ADDR

MASTER_PORT

NODE_RANK

, and

NUM_GPU_PER_NODE

, and Cosmos-RL has special multi-node role handling for controller, policy, and rollout workers. See

references/multi-node.md

for the full sbatch directives, the rendezvous env-var table and contract, and cluster requirements.

tao-core

的默认值：

```
num_nodes
```
: 1
```
num_gpus
```
: 4
```
max_num_gpus_per_node
```
: 8
```
cpus_per_task
```
: 16
```
time_hours
```
: 4
```
timeout_hours
```
: 3.8
```
max_time_hours
```
: 4
```
container_mounts
```
:
```
/lustre
```
```
use_requeue
```
: true
```
use_sqsh
```
: true

为SLURM生成启动器或包装脚本时，请从打包的平台资源默认值中显式设置默认墙钟时间：

bash

export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"

不要在SLURM上默认设置12小时。如果用户提供更长的

SLURM_TIME_HOURS

，请在提交前验证所选分区是否支持该时长。对于打包的默认分区列表

polar,polar3,polar4,grizzly

，拒绝超过4小时的请求，仅当用户确实需要更长的墙钟时间时才询问是否更换分区。

当

num_gpus

大于或等于

max_num_gpus_per_node

时，处理器会将请求视为每个节点独占，并在必要时根据总GPU数量计算额外节点。

对于多节点作业（

num_nodes > 1

），sbatch脚本会导出

WORLD_SIZE

、

MASTER_ADDR

、

MASTER_PORT

、

NODE_RANK

和

NUM_GPU_PER_NODE

，并且Cosmos-RL对控制器、策略和rollout工作器有特殊的多节点角色处理。有关完整的sbatch指令、 rendezvous环境变量表和约定，以及集群要求，请查看

references/multi-node.md

。

Monitoring

监控

Scheduler status comes from the stored SLURM job id via
```
squeue
```
or
```
sacct
```
.
TAO terminal status comes from
```
status.json
```
in the shared results folder.
If the user enabled chat monitoring, continue polling at the requested interval while the job is
```
PENDING
```
,
```
RUNNING
```
, or otherwise non-terminal. Do not stop after a fixed elapsed time such as 30 minutes; long queue waits are normal on shared GPU partitions.
Do not send a final response for a non-terminal SLURM job when chat monitoring is enabled. A final response is a detach action; use it only if the user asked to detach/stop or the job reached terminal state.
Logs are read over SSH from:

text

<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.err

Status mapping:

```
PENDING
```
->
```
Pending
```
```
RUNNING
```
or
```
COMPLETING
```
->
```
Running
```
```
COMPLETED
```
-> check
```
status.json
```
```
FAILED
```
,
```
BOOT_FAIL
```
,
```
DEADLINE
```
,
```
OUT_OF_MEMORY
```
,
```
NODE_FAIL
```
-> retry if logs match retriable infrastructure patterns, otherwise
```
Error
```
```
CANCELLED
```
,
```
PREEMPTED
```
,
```
REVOKED
```
->
```
Canceled
```
```
TIMEOUT
```
->
```
Error
```
```
SUSPENDED
```
,
```
STOPPED
```
->
```
Paused
```

调度器状态通过存储的SLURM作业ID，使用
```
squeue
```
或
```
sacct
```
获取。
TAO终端状态来自共享结果文件夹中的
```
status.json
```
。
如果用户启用了聊天监控，请在作业处于
```
PENDING
```
、
```
RUNNING
```
或其他非终端状态时，按照请求的间隔继续轮询。不要在固定时间（如30分钟）后停止轮询；在共享GPU分区上，长时间排队等待是正常现象。
当启用聊天监控时，不要为非终端SLURM作业发送最终响应。最终响应是一种分离操作；仅当用户要求分离/停止或作业达到终端状态时才使用。
日志通过SSH从以下路径读取：

text

<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.err

状态映射：

```
PENDING
```
->
```
Pending
```
```
RUNNING
```
或
```
COMPLETING
```
->
```
Running
```
```
COMPLETED
```
-> 检查
```
status.json
```
```
FAILED
```
、
```
BOOT_FAIL
```
、
```
DEADLINE
```
、
```
OUT_OF_MEMORY
```
、
```
NODE_FAIL
```
-> 如果日志匹配可重试的基础设施模式则重试，否则标记为
```
Error
```
```
CANCELLED
```
、
```
PREEMPTED
```
、
```
REVOKED
```
->
```
Canceled
```
```
TIMEOUT
```
->
```
Error
```
```
SUSPENDED
```
、
```
STOPPED
```
->
```
Paused
```

Cancellation

取消

Cancel by looking up

backend_details.slurm_metadata.slurm_job_id

and running

scancel <slurm_job_id>

over SSH. Treat missing or already terminated SLURM jobs as successful cancellation.

通过查找

backend_details.slurm_metadata.slurm_job_id

并通过SSH运行

scancel <slurm_job_id>

来取消作业。将缺失或已终止的SLURM作业视为取消成功。

Multi-node training (distributed)

多节点训练（分布式）

SLURM is the platform of choice for large multi-node runs — pass

num_nodes > 1

and the SDK handles the sbatch directives and PyTorch-distributed env vars automatically. See

references/multi-node.md

for a worked

create_job

example, the generated sbatch directives, the rendezvous env-var table (

WORLD_SIZE

NUM_GPU_PER_NODE

NODE_RANK

MASTER_ADDR

MASTER_PORT

), the Cosmos-RL role note, cluster requirements (Pyxis/Enroot, InfiniBand/NVLink, Lustre), and upstream reference links.

SLURM是大型多节点运行的首选平台——传递

num_nodes > 1

，SDK会自动处理sbatch指令和PyTorch分布式环境变量。有关完整的

create_job

示例、生成的sbatch指令、rendezvous环境变量表（

WORLD_SIZE

、

NUM_GPU_PER_NODE

、

NODE_RANK

、

MASTER_ADDR

、

MASTER_PORT

）、Cosmos-RL角色说明、集群要求（Pyxis/Enroot、InfiniBand/NVLink、Lustre）以及上游参考链接，请查看

references/multi-node.md

。

Running via the TAO SDK

通过TAO SDK运行

The SDK install is covered in Preflight —

pip install 'nvidia-tao-sdk[slurm]'

. Use it when you want Job handles, the sbatch/

squeue

sacct

plumbing handled for you, run-folder durability via

ActionWorkflow

, or convenient cloud-storage I/O (

s3://

hf_model://

ngc://

). Without the SDK, drive

sbatch

and

srun

yourself.

Auto-retry is fully automatic: a background monitor polls

squeue

sacct

and re-

sbatch

's the staged script on infrastructure-looking failures up to

MAX_JOB_RETRIES = 10

, while plain training failures surface immediately. In addition,

#SBATCH --requeue

is set by default (

SLURM_USE_REQUEUE

, defaults to

true

). See

references/sdk-usage.md

for the

SlurmSDK

build_entrypoint

code example, the Lustre-not-S3 rule, the retriable-failure classification, and the full auto-retry and requeue behavior.

SDK安装在预检部分已介绍——

pip install 'nvidia-tao-sdk[slurm]'

。当您需要Job处理、自动处理sbatch/

squeue

sacct

流程、通过

ActionWorkflow

实现运行文件夹持久性，或便捷的云存储I/O（

s3://

、

hf_model://

、

ngc://

）时，请使用它。如果不使用SDK，则需要自行操作

sbatch

和

srun

。

自动重试完全自动化：后台监控程序会轮询

squeue

sacct

，并在出现基础设施类故障时重新提交暂存的脚本，最多重试

MAX_JOB_RETRIES = 10

次，而普通训练失败会立即显示。此外，默认设置

#SBATCH --requeue

（

SLURM_USE_REQUEUE

，默认值为

true

）。有关

SlurmSDK

build_entrypoint

代码示例、Lustre而非S3规则、可重试故障分类以及完整的自动重试和重新排队行为，请查看

references/sdk-usage.md

。

Failure Modes

故障模式

Common failures: SSH auth failure, local dataset path rejected, SQSH conversion timeout, Pyxis/Enroot unavailable, and bad-node / transient GPU failures (which the handler retries up to the configured limit). See

references/troubleshooting.md

for the diagnosis and remediation of each.

常见故障：SSH认证失败、本地数据集路径被拒绝、SQSH转换超时、Pyxis/Enroot不可用，以及坏节点/临时GPU故障（处理器会根据配置的限制重试）。有关每种故障的诊断和修复方法，请查看

references/troubleshooting.md

。