tao-run-on-slurm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SLURM

SLURM

Remote GPU compute platform for clusters managed by SLURM. Jobs are submitted from the TAO service or SDK host to a login node over SSH, staged on a shared filesystem, submitted with
sbatch
, and executed with
srun
container support.
Use SLURM when the user has access to a managed GPU cluster, shared Lustre storage, and scheduler-owned GPU allocation. Do not use SLURM for local files that exist only on the agent machine; data and outputs must be reachable from the cluster.
由SLURM管理的集群的远程GPU计算平台。作业从TAO服务或SDK主机通过SSH提交到登录节点,在共享文件系统上暂存,使用
sbatch
提交,并通过支持容器的
srun
执行。
当用户有权访问受管理的GPU集群、共享Lustre存储以及由调度器分配的GPU资源时,请使用SLURM。不要将SLURM用于仅存在于Agent机器上的本地文件;数据和输出必须能从集群访问到。

Preflight

预检

bash
undefined
bash
undefined

1. SSH to the login node works without a password prompt

1. 无需密码提示即可SSH登录到登录节点

SLURM_HOST="${SLURM_HOSTNAME%%,*}" [ -n "$SLURM_USER" ] && [ -n "$SLURM_HOST" ] || { echo "MISSING: set SLURM_USER and SLURM_HOSTNAME (comma-separated for failover) in your env (~/.config/tao/.env)." exit 1 } ssh -o BatchMode=yes -o ConnectTimeout=10 "${SLURM_USER}@${SLURM_HOST}" "true" 2>/dev/null || { echo "MISSING: passwordless SSH to ${SLURM_USER}@${SLURM_HOST} not working. See references/ssh-setup.md." exit 1 }
SLURM_HOST="${SLURM_HOSTNAME%%,*}" [ -n "$SLURM_USER" ] && [ -n "$SLURM_HOST" ] || { echo "缺失:请在环境变量中设置SLURM_USER和SLURM_HOSTNAME(故障转移用逗号分隔)(~/.config/tao/.env)。" exit 1 } ssh -o BatchMode=yes -o ConnectTimeout=10 "${SLURM_USER}@${SLURM_HOST}" "true" 2>/dev/null || { echo "缺失:无法无密码SSH登录到${SLURM_USER}@${SLURM_HOST}。请查看references/ssh-setup.md。" exit 1 }

2. Optional: TAO SDK wrapper for Job handles + S3 wrapping.

2. 可选:用于Job处理和S3封装的TAO SDK包装器。

nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_slurm).

nvidia-tao-sdk在公共PyPI上;版本固定信息在versions.yaml(wheels.tao_sdk_slurm)中。

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_slurm) python -c "import tao_sdk" 2>/dev/null || { echo "MISSING: nvidia-tao-sdk not installed. Run:" echo " pip install "$PIN"" exit 1 }

If a check fails, the agent prompts the user to authorize the install/fix via Bash.

A third preflight step applies only for **private `nvcr.io` images**: Pyxis on
the compute nodes needs persistent enroot credentials in
`~/.config/enroot/.credentials` on the cluster (it does NOT read `NGC_KEY` from
the job env). Without them, auth-gated pulls fail with "Could not process JSON
input" at job startup. This runs once per (cluster, user). See
`references/ssh-setup.md` for the full check and the `printf | ssh` install
pattern that keeps `NGC_KEY` out of history, files, and chat output. Skip it for
public images.
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_slurm) python -c "import tao_sdk" 2>/dev/null || { echo "缺失:未安装nvidia-tao-sdk。请运行:" echo " pip install "$PIN"" exit 1 }

如果检查失败,Agent会提示用户通过Bash授权安装/修复。

第三个预检步骤仅适用于**私有`nvcr.io`镜像**:计算节点上的Pyxis需要在集群的`~/.config/enroot/.credentials`中存储持久化的enroot凭据(它不会从作业环境中读取`NGC_KEY`)。没有这些凭据,在作业启动时,受权限限制的拉取操作会因‘无法处理JSON输入’而失败。此操作每个(集群,用户)只需执行一次。有关完整检查以及避免`NGC_KEY`出现在历史记录、文件和聊天输出中的`printf | ssh`安装模式,请查看`references/ssh-setup.md`。公共镜像可跳过此步骤。

Prerequisites

前提条件

Before any job is submitted, the host running the TAO service or SDK must log in to at least one host from
SLURM_HOSTNAME
over SSH without an interactive password prompt. The handler runs
sbatch
,
squeue
,
sacct
,
scancel
, and log tails non-interactively, so password or 2FA prompts will fail the job at submit or status time.
Set this up once per (host, login node, user) tuple: create an SSH keypair, install the public key on each login host, trust the host key, lock private-key permissions to
chmod 600
, and verify with
ssh -o BatchMode=yes ...
. See
references/ssh-setup.md
for the full step-by-step (including the
~/.ssh/config
alias, the container key-mount note, and the 2FA /
SSH_AUTH_SOCK
fallback). The same file holds the SSH failure remediation prompt to show the user when passwordless SSH fails.
在提交任何作业之前,运行TAO服务或SDK的主机必须能够通过SSH登录到
SLURM_HOSTNAME
中的至少一个主机,且无需交互式密码提示。处理器会非交互式地运行
sbatch
squeue
sacct
scancel
以及日志尾部查看操作,因此密码或双因素认证提示会导致作业在提交或状态查询时失败。
针对每个(主机,登录节点,用户)元组设置一次:创建SSH密钥对,在每个登录主机上安装公钥,信任主机密钥,将私钥权限锁定为
chmod 600
,并使用
ssh -o BatchMode=yes ...
验证。有关完整的分步说明(包括
~/.ssh/config
别名、容器密钥挂载说明以及双因素认证/
SSH_AUTH_SOCK
回退方案),请查看
references/ssh-setup.md
。同一文件还包含当无密码SSH失败时向用户显示的SSH故障修复提示

Credentials

凭据

  • SLURM_USER (required): SSH username for the login node. In microservices workspace metadata this is
    cloud_specific_details.slurm_user
    .
  • SLURM_HOSTNAME (required): Comma-separated login hostnames for failover. Microservices schema stores this as the list field
    cloud_specific_details.slurm_hostname
    .
  • SLURM_PARTITION (required): Partition list for GPU job submission. Ask for this in the mandatory SLURM intake list. The packaged default is
    polar,polar3,polar4,grizzly
    , which are treated as 4-hour queues.
  • SSH_KEY_PATH (preferred and expected before launch): private key path for non-interactive public-key auth to the login node. If passwordless SSH fails, ask the user for
    SSH_KEY_PATH=/path/to/private_key
    and show the setup steps in
    references/ssh-setup.md
    ; do not bury this behind several alternate choices.
  • SSH_AUTH_SOCK (advanced fallback): SSH agent socket with an accepted key already loaded. Prefer
    SSH_KEY_PATH
    in user-facing remediation prompts.
  • SLURM_BASE_RESULTS_DIR (optional): Base shared filesystem path. Default convention from
    tao-core
    is
    /lustre/fsw/portfolios/edgeai/<your-dir>
    , where
    <your-dir>
    is your per-user directory on the cluster.
  • SLURM_ACCOUNT (usually required by site policy): Account charged by
    #SBATCH --account
    .
Do not ask for
SLURM_ACCOUNT
or
SLURM_BASE_RESULTS_DIR
in the initial intake unless the user says their site requires an account, wants a custom results root, or the workflow cannot proceed without overriding defaults.
  • SLURM_USER(必填):登录节点的SSH用户名。在微服务工作区元数据中,此字段为
    cloud_specific_details.slurm_user
  • SLURM_HOSTNAME(必填):用于故障转移的逗号分隔登录主机名列表。微服务架构将其存储为列表字段
    cloud_specific_details.slurm_hostname
  • SLURM_PARTITION(必填):用于GPU作业提交的分区列表。请在必填的SLURM信息收集列表中询问此字段。打包的默认值为
    polar,polar3,polar4,grizzly
    ,这些被视为4小时队列。
  • SSH_KEY_PATH(推荐,启动前需配置):用于非交互式公钥认证登录节点的私钥路径。如果无密码SSH失败,请询问用户
    SSH_KEY_PATH=/path/to/private_key
    并显示
    references/ssh-setup.md
    中的设置步骤;不要将此选项隐藏在多个替代选择之后。
  • SSH_AUTH_SOCK(高级回退方案):已加载可接受密钥的SSH代理套接字。在面向用户的修复提示中优先使用
    SSH_KEY_PATH
  • SLURM_BASE_RESULTS_DIR(可选):共享文件系统的基础路径。
    tao-core
    的默认约定为
    /lustre/fsw/portfolios/edgeai/<your-dir>
    ,其中
    <your-dir>
    是您在集群上的每个用户目录。
  • SLURM_ACCOUNT(通常受站点策略要求):
    #SBATCH --account
    指定的计费账户。
除非用户表示其站点需要账户、想要自定义结果根目录,或者工作流无法在不覆盖默认值的情况下进行,否则不要在初始信息收集中询问
SLURM_ACCOUNT
SLURM_BASE_RESULTS_DIR

Backend Details

后端详情

Use
backend_details.backend_type = "slurm"
when routing a job to this platform. Supported backend details from the microservices schema:
json
{
  "backend_type": "slurm",
  "partition": "polar,polar3,polar4,grizzly",
  "cluster_name": "optional-name"
}
Runtime metadata is stored under
backend_details.slurm_metadata
, especially
slurm_job_id
and
job_dir
. Do not invent these values. They are written after
sbatch
returns a scheduler job id.
当将作业路由到此平台时,请使用
backend_details.backend_type = "slurm"
。微服务架构支持的后端详情如下:
json
{
  "backend_type": "slurm",
  "partition": "polar,polar3,polar4,grizzly",
  "cluster_name": "optional-name"
}
运行时元数据存储在
backend_details.slurm_metadata
下,尤其是
slurm_job_id
job_dir
。请勿自行创建这些值。它们会在
sbatch
返回调度器作业ID后写入。

Storage

存储

SLURM jobs run on the cluster, so local paths from the API host are not valid dataset paths. Prefer shared filesystem URIs:
  • Use
    lustre:///absolute/path
    for user-provided datasets on Lustre.
  • slurm://
    paths may appear in microservices metadata and are converted to actual Lustre paths before the container starts.
  • Avoid bare
    /local/path
    and
    file://
    dataset URIs for SLURM. Validation in
    tao-core
    rejects local and file paths for remote backends.
Accept either dataset roots or direct spec-key paths:
  • Root mode:
    /lustre/.../<model>/train
    , which model skills map to required files such as
    <root>/annotations.json
    and
    <root>
    as media path.
  • Direct spec mode: exact fields such as
    custom.train_dataset.annotation_path=/lustre/.../train.json
    and
    custom.train_dataset.media_path=/lustre/.../videos.tar.gz
    .
After passwordless SSH succeeds and before generating scripts, validate each required dataset file/path from the login host:
bash
ssh -o BatchMode=yes <SLURM_USER>@<working-login-host> \
  'test -e /lustre/.../annotations.json && test -e /lustre/.../media_or_archive'
If the remote
test -e
fails, stop and ask for corrected paths or for the data to be staged onto shared cluster storage. Do not create runner scripts that will fail inside the first training job.
Results default to:
text
/lustre/fsw/portfolios/edgeai/<your-dir>/results/<job_id>
<your-dir>
is your per-user directory on the cluster.
The runner sets
TAO_API_RESULTS_DIR
to the parent results directory because container code appends the job id when writing status and artifacts.
Use Lustre, not S3, for SLURM job inputs. SLURM's scheduler enforces a GPU-idle timeout — a long
s3://
download at the top of the script can burn the allocation before training begins, and the scheduler may kill the job. Stage training data onto Lustre first; S3 / HF / NGC pre-fetch is fine only for small auxiliary inputs (checkpoints, configs). See
references/sdk-usage.md
for the full rationale.
SLURM作业在集群上运行,因此API主机的本地路径不是有效的数据集路径。优先使用共享文件系统URI:
  • 对于Lustre上用户提供的数据集,请使用
    lustre:///absolute/path
  • slurm://
    路径可能出现在微服务元数据中,并会在容器启动前转换为实际的Lustre路径。
  • 对于SLURM,请避免使用裸路径
    /local/path
    file://
    数据集URI。
    tao-core
    中的验证会拒绝远程后端的本地和文件路径。
接受数据集根目录或直接的规范键路径:
  • 根目录模式:
    /lustre/.../<model>/train
    ,模型技能会将其映射到所需文件,例如
    <root>/annotations.json
    和作为媒体路径的
    <root>
  • 直接规范模式:精确字段,例如
    custom.train_dataset.annotation_path=/lustre/.../train.json
    custom.train_dataset.media_path=/lustre/.../videos.tar.gz
在无密码SSH成功后、生成脚本之前,从登录主机验证每个所需的数据集文件/路径:
bash
ssh -o BatchMode=yes <SLURM_USER>@<working-login-host> \
  'test -e /lustre/.../annotations.json && test -e /lustre/.../media_or_archive'
如果远程
test -e
失败,请停止操作并询问用户更正路径或将数据暂存到共享集群存储中。不要创建会在第一个训练作业内失败的运行器脚本。
结果默认存储在:
text
/lustre/fsw/portfolios/edgeai/<your-dir>/results/<job_id>
<your-dir>
是您在集群上的每个用户目录。
运行器会将
TAO_API_RESULTS_DIR
设置为结果父目录,因为容器代码在写入状态和工件时会追加作业ID。
SLURM作业输入请使用Lustre,而非S3。 SLURM的调度器会强制执行GPU空闲超时——脚本开头的长时间
s3://
下载可能会在训练开始前耗尽分配的资源,调度器可能会终止作业。 请先将训练数据暂存到Lustre上;仅对于小型辅助输入(检查点、配置),使用S3/HF/NGC预取是可行的。有关完整原理,请查看
references/sdk-usage.md

Container Execution

容器执行

tao-core
uses the SLURM handler to run TAO containers through Pyxis/Enroot:
  1. Stage compact JSON files for specs, environment, and cloud metadata under
    <job_dir>/specs
    ,
    <job_dir>/env
    , and
    <job_dir>/meta
    .
  2. Optionally convert the Docker image to a cached SQSH image with
    srun -n1 -p <conversion_partition> enroot import
    .
  3. Write an sbatch script under
    <job_dir>/sbatch/job_<job_id>.sbatch
    .
  4. Submit
    sbatch --export=ALL <script>
    .
  5. Run the container with
    srun --container-image=<image> --container-mounts=/lustre
    .
Image formats accepted by the handler:
  • /path/to/image.sqsh
  • registry#image:tag
  • docker://registry#image:tag
  • ordinary
    registry/image:tag
    , which is converted to Pyxis form when needed
SQSH conversion is cached by image name. For
:latest
images, cached SQSH is used unless
force_reconvert_latest
is enabled.
tao-core
使用SLURM处理器通过Pyxis/Enroot运行TAO容器:
  1. 将规范、环境和云元数据的紧凑JSON文件暂存到
    <job_dir>/specs
    <job_dir>/env
    <job_dir>/meta
    下。
  2. 可选地使用
    srun -n1 -p <conversion_partition> enroot import
    将Docker镜像转换为缓存的SQSH镜像。
  3. <job_dir>/sbatch/job_<job_id>.sbatch
    下编写sbatch脚本。
  4. 提交
    sbatch --export=ALL <script>
  5. 使用
    srun --container-image=<image> --container-mounts=/lustre
    运行容器。
处理器接受的镜像格式:
  • /path/to/image.sqsh
  • registry#image:tag
  • docker://registry#image:tag
  • 普通的
    registry/image:tag
    ,会在需要时转换为Pyxis格式
SQSH转换会按镜像名称缓存。对于
:latest
镜像,除非启用
force_reconvert_latest
,否则会使用缓存的SQSH。

Resource Mapping

资源映射

Defaults from
tao-core
:
  • num_nodes
    : 1
  • num_gpus
    : 4
  • max_num_gpus_per_node
    : 8
  • cpus_per_task
    : 16
  • time_hours
    : 4
  • timeout_hours
    : 3.8
  • max_time_hours
    : 4
  • container_mounts
    :
    /lustre
  • use_requeue
    : true
  • use_sqsh
    : true
When generating launchers or wrapper scripts for SLURM, set the wall-time defaults explicitly from the packaged platform resource defaults:
bash
export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"
Do not default to 12 hours on SLURM. If the user supplies a longer
SLURM_TIME_HOURS
, verify that the selected partition supports it before submitting. For the packaged default partition list
polar,polar3,polar4,grizzly
, reject requests above 4 hours and ask for a different partition only if the user actually wants a longer wall time.
When
num_gpus
is greater than or equal to
max_num_gpus_per_node
, the handler treats the request as exclusive per node and computes additional nodes from total GPU count when necessary.
For multi-node jobs (
num_nodes > 1
), the sbatch script exports
WORLD_SIZE
,
MASTER_ADDR
,
MASTER_PORT
,
NODE_RANK
, and
NUM_GPU_PER_NODE
, and Cosmos-RL has special multi-node role handling for controller, policy, and rollout workers. See
references/multi-node.md
for the full sbatch directives, the rendezvous env-var table and contract, and cluster requirements.
tao-core
的默认值:
  • num_nodes
    : 1
  • num_gpus
    : 4
  • max_num_gpus_per_node
    : 8
  • cpus_per_task
    : 16
  • time_hours
    : 4
  • timeout_hours
    : 3.8
  • max_time_hours
    : 4
  • container_mounts
    :
    /lustre
  • use_requeue
    : true
  • use_sqsh
    : true
为SLURM生成启动器或包装脚本时,请从打包的平台资源默认值中显式设置默认墙钟时间:
bash
export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"
不要在SLURM上默认设置12小时。如果用户提供更长的
SLURM_TIME_HOURS
,请在提交前验证所选分区是否支持该时长。对于打包的默认分区列表
polar,polar3,polar4,grizzly
,拒绝超过4小时的请求,仅当用户确实需要更长的墙钟时间时才询问是否更换分区。
num_gpus
大于或等于
max_num_gpus_per_node
时,处理器会将请求视为每个节点独占,并在必要时根据总GPU数量计算额外节点。
对于多节点作业(
num_nodes > 1
),sbatch脚本会导出
WORLD_SIZE
MASTER_ADDR
MASTER_PORT
NODE_RANK
NUM_GPU_PER_NODE
,并且Cosmos-RL对控制器、策略和rollout工作器有特殊的多节点角色处理。有关完整的sbatch指令、 rendezvous环境变量表和约定,以及集群要求,请查看
references/multi-node.md

Monitoring

监控

  • Scheduler status comes from the stored SLURM job id via
    squeue
    or
    sacct
    .
  • TAO terminal status comes from
    status.json
    in the shared results folder.
  • If the user enabled chat monitoring, continue polling at the requested interval while the job is
    PENDING
    ,
    RUNNING
    , or otherwise non-terminal. Do not stop after a fixed elapsed time such as 30 minutes; long queue waits are normal on shared GPU partitions.
  • Do not send a final response for a non-terminal SLURM job when chat monitoring is enabled. A final response is a detach action; use it only if the user asked to detach/stop or the job reached terminal state.
  • Logs are read over SSH from:
text
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.err
Status mapping:
  • PENDING
    ->
    Pending
  • RUNNING
    or
    COMPLETING
    ->
    Running
  • COMPLETED
    -> check
    status.json
  • FAILED
    ,
    BOOT_FAIL
    ,
    DEADLINE
    ,
    OUT_OF_MEMORY
    ,
    NODE_FAIL
    -> retry if logs match retriable infrastructure patterns, otherwise
    Error
  • CANCELLED
    ,
    PREEMPTED
    ,
    REVOKED
    ->
    Canceled
  • TIMEOUT
    ->
    Error
  • SUSPENDED
    ,
    STOPPED
    ->
    Paused
  • 调度器状态通过存储的SLURM作业ID,使用
    squeue
    sacct
    获取。
  • TAO终端状态来自共享结果文件夹中的
    status.json
  • 如果用户启用了聊天监控,请在作业处于
    PENDING
    RUNNING
    或其他非终端状态时,按照请求的间隔继续轮询。不要在固定时间(如30分钟)后停止轮询;在共享GPU分区上,长时间排队等待是正常现象。
  • 当启用聊天监控时,不要为非终端SLURM作业发送最终响应。最终响应是一种分离操作;仅当用户要求分离/停止或作业达到终端状态时才使用。
  • 日志通过SSH从以下路径读取:
text
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.err
状态映射:
  • PENDING
    ->
    Pending
  • RUNNING
    COMPLETING
    ->
    Running
  • COMPLETED
    -> 检查
    status.json
  • FAILED
    BOOT_FAIL
    DEADLINE
    OUT_OF_MEMORY
    NODE_FAIL
    -> 如果日志匹配可重试的基础设施模式则重试,否则标记为
    Error
  • CANCELLED
    PREEMPTED
    REVOKED
    ->
    Canceled
  • TIMEOUT
    ->
    Error
  • SUSPENDED
    STOPPED
    ->
    Paused

Cancellation

取消

Cancel by looking up
backend_details.slurm_metadata.slurm_job_id
and running
scancel <slurm_job_id>
over SSH. Treat missing or already terminated SLURM jobs as successful cancellation.
通过查找
backend_details.slurm_metadata.slurm_job_id
并通过SSH运行
scancel <slurm_job_id>
来取消作业。将缺失或已终止的SLURM作业视为取消成功。

Multi-node training (distributed)

多节点训练(分布式)

SLURM is the platform of choice for large multi-node runs — pass
num_nodes > 1
and the SDK handles the sbatch directives and PyTorch-distributed env vars automatically. See
references/multi-node.md
for a worked
create_job
example, the generated sbatch directives, the rendezvous env-var table (
WORLD_SIZE
,
NUM_GPU_PER_NODE
,
NODE_RANK
,
MASTER_ADDR
,
MASTER_PORT
), the Cosmos-RL role note, cluster requirements (Pyxis/Enroot, InfiniBand/NVLink, Lustre), and upstream reference links.
SLURM是大型多节点运行的首选平台——传递
num_nodes > 1
,SDK会自动处理sbatch指令和PyTorch分布式环境变量。有关完整的
create_job
示例、生成的sbatch指令、rendezvous环境变量表(
WORLD_SIZE
NUM_GPU_PER_NODE
NODE_RANK
MASTER_ADDR
MASTER_PORT
)、Cosmos-RL角色说明、集群要求(Pyxis/Enroot、InfiniBand/NVLink、Lustre)以及上游参考链接,请查看
references/multi-node.md

Running via the TAO SDK

通过TAO SDK运行

The SDK install is covered in Preflight —
pip install 'nvidia-tao-sdk[slurm]'
. Use it when you want Job handles, the sbatch/
squeue
/
sacct
plumbing handled for you, run-folder durability via
ActionWorkflow
, or convenient cloud-storage I/O (
s3://
,
hf_model://
,
ngc://
). Without the SDK, drive
sbatch
and
srun
yourself.
Auto-retry is fully automatic: a background monitor polls
squeue
/
sacct
and re-
sbatch
's the staged script on infrastructure-looking failures up to
MAX_JOB_RETRIES = 10
, while plain training failures surface immediately. In addition,
#SBATCH --requeue
is set by default (
SLURM_USE_REQUEUE
, defaults to
true
). See
references/sdk-usage.md
for the
SlurmSDK
/
build_entrypoint
code example, the Lustre-not-S3 rule, the retriable-failure classification, and the full auto-retry and requeue behavior.
SDK安装在预检部分已介绍——
pip install 'nvidia-tao-sdk[slurm]'
。当您需要Job处理、自动处理sbatch/
squeue
/
sacct
流程、通过
ActionWorkflow
实现运行文件夹持久性,或便捷的云存储I/O(
s3://
hf_model://
ngc://
)时,请使用它。如果不使用SDK,则需要自行操作
sbatch
srun
自动重试完全自动化:后台监控程序会轮询
squeue
/
sacct
,并在出现基础设施类故障时重新提交暂存的脚本,最多重试
MAX_JOB_RETRIES = 10
次,而普通训练失败会立即显示。此外,默认设置
#SBATCH --requeue
SLURM_USE_REQUEUE
,默认值为
true
)。有关
SlurmSDK
/
build_entrypoint
代码示例、Lustre而非S3规则、可重试故障分类以及完整的自动重试和重新排队行为,请查看
references/sdk-usage.md

Failure Modes

故障模式

Common failures: SSH auth failure, local dataset path rejected, SQSH conversion timeout, Pyxis/Enroot unavailable, and bad-node / transient GPU failures (which the handler retries up to the configured limit). See
references/troubleshooting.md
for the diagnosis and remediation of each.
常见故障:SSH认证失败、本地数据集路径被拒绝、SQSH转换超时、Pyxis/Enroot不可用,以及坏节点/临时GPU故障(处理器会根据配置的限制重试)。有关每种故障的诊断和修复方法,请查看
references/troubleshooting.md