tao-run-on-kubernetes

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Kubernetes

Submits TAO container jobs as Kubernetes Jobs. Works on any cluster reachable via kubeconfig (EKS / GKE / AKS / on-prem) or in-cluster service account (when the SDK runs inside a pod).

Single-pod by default; opt into multi-node distributed training via

num_nodes > 1

(uses Indexed Job + headless Service, see Multi-node training below).

将 TAO 容器作业提交为 Kubernetes Job。可在任何能通过 kubeconfig 访问的集群（EKS/GKE/AKS/本地集群）中运行，也支持集群内服务账号（当 SDK 在 Pod 内部运行时）。

默认采用单 Pod 模式；若需多节点分布式训练，可设置

num_nodes > 1

（使用索引 Job + Headless Service，详见下方【多节点训练（分布式）】）。

Preflight

预检查

Four checks: GPU host runtime ready, SDK installed, cluster reachable, GPU Operator/device plugin present.

bash

undefined

需完成四项检查：GPU 主机运行时就绪、SDK 已安装、集群可访问、GPU Operator/设备插件已部署。

bash

undefined

0. GPU node host runtime.

0. GPU 节点主机运行时。

Run this on each self-managed GPU worker node or in the node image build.

在每个自管理 GPU 工作节点上运行，或集成到节点镜像构建流程中。

Set TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1 only when using managed GPU nodes whose

仅当使用由云服务商或 GPU Operator 策略管理驱动/工具包生命周期的托管 GPU 节点时，才设置 TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1。

driver/toolkit lifecycle is owned by the cloud provider or GPU Operator policy.

—

if [ "${TAO_K8S_SKIP_NODE_RUNTIME_CHECK:-0}" != "1" ]; then TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}" SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh" [ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"

bash "$SETUP_SCRIPT" --backend kubernetes --check-only || { echo "MISSING: TAO Kubernetes GPU node runtime is not ready." echo "For self-managed GPU nodes, run after user approval:" echo " bash "$SETUP_SCRIPT" --backend kubernetes --install --yes" echo "For managed clusters, verify the node image/GPU Operator policy installs driver 580 and toolkit 1.19.0, then set TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1." exit 1 } fi

bash "$SETUP_SCRIPT" --backend kubernetes --check-only || { echo "缺失：TAO Kubernetes GPU 节点运行时未就绪。" echo "对于自管理 GPU 节点，获得用户许可后运行：" echo " bash "$SETUP_SCRIPT" --backend kubernetes --install --yes" echo "对于托管集群，请验证节点镜像/GPU Operator 策略是否安装了驱动 580 和工具包 1.19.0，然后设置 TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1。" exit 1 } fi

1. SDK + kubernetes extra installed.

1. SDK 及 Kubernetes 扩展已安装。

nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_kubernetes).

nvidia-tao-sdk 可从公开 PyPI 获取；版本固定信息位于 versions.yaml（wheels.tao_sdk_kubernetes）中。

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_kubernetes) python -c "import tao_sdk" 2>/dev/null || { echo "MISSING: nvidia-tao-sdk not installed. Run:" echo " pip install "$PIN"" exit 1 } python -c "import kubernetes" 2>/dev/null || { echo "MISSING: kubernetes extra not installed. Run:" echo " pip install "$PIN"" exit 1 }

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_kubernetes) python -c "import tao_sdk" 2>/dev/null || { echo "缺失：nvidia-tao-sdk 未安装。请运行：" echo " pip install "$PIN"" exit 1 } python -c "import kubernetes" 2>/dev/null || { echo "缺失：kubernetes 扩展未安装。请运行：" echo " pip install "$PIN"" exit 1 }

2. Cluster reachable (kubeconfig OR in-cluster service account)

2. 集群可访问（kubeconfig 或集群内服务账号）

python -c "from kubernetes import config; config.load_kube_config()" 2>/dev/null ||
python -c "from kubernetes import config; config.load_incluster_config()" 2>/dev/null || { echo "MISSING: no kubeconfig at ~/.kube/config and not running in a pod." echo "Configure kubectl (e.g., 'aws eks update-kubeconfig --name my-cluster') or set $KUBECONFIG." exit 1 }

python -c "from kubernetes import config; config.load_kube_config()" 2>/dev/null ||
python -c "from kubernetes import config; config.load_incluster_config()" 2>/dev/null || { echo "缺失：~/.kube/config 路径下无 kubeconfig，且未在 Pod 内部运行。" echo "配置 kubectl（例如：'aws eks update-kubeconfig --name my-cluster'）或设置 $KUBECONFIG 环境变量。" exit 1 }

3. NVIDIA GPU Operator present (soft check — warn if kubectl available, don't fail)

3. NVIDIA GPU Operator 已部署（软检查——若安装了 kubectl 则发出警告，不会导致失败）

if command -v kubectl >/dev/null 2>&1; then gpu=$(kubectl get nodes -o jsonpath='{range .items[*]}{.status.allocatable.nvidia.com/gpu}{"\n"}{end}' 2>/dev/null | grep -v '^$' | head -1) if [ -z "$gpu" ] || [ "$gpu" = "0" ]; then echo "WARN: no nvidia.com/gpu allocatable on this cluster." echo "Install the NVIDIA GPU Operator before submitting GPU jobs:" echo " https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html" fi fi


The GPU node runtime check is mandatory for self-managed nodes. For managed
clusters where the client is not running on a GPU worker, verify the provider
node image or GPU Operator policy and set `TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1`
instead of running the installer on the client. The final GPU capacity check is
a warning rather than a hard fail — `kubectl` isn't always installed. The SDK
does a hard guard inside
`KubernetesSDK.create_job()` that uses the kubernetes Python client to verify
GPU capacity before submitting.

if command -v kubectl >/dev/null 2>&1; then gpu=$(kubectl get nodes -o jsonpath='{range .items[*]}{.status.allocatable.nvidia.com/gpu}{"\n"}{end}' 2>/dev/null | grep -v '^$' | head -1) if [ -z "$gpu" ] || [ "$gpu" = "0" ]; then echo "警告：集群中无可用的 nvidia.com/gpu 资源。" echo "提交 GPU 作业前请安装 NVIDIA GPU Operator：" echo " https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html" fi fi


对于自管理节点，GPU 节点运行时检查是强制性的。对于客户端未在 GPU 工作节点上运行的托管集群，请验证服务商节点镜像或 GPU Operator 策略，并设置 `TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1`，而非在客户端运行安装脚本。最终的 GPU 容量检查仅为警告而非强制失败——因为并非总是安装了 kubectl。SDK 会在 `KubernetesSDK.create_job()` 内部进行严格检查，通过 Kubernetes Python 客户端在提交作业前验证 GPU 容量。

Credentials & configuration

凭据与配置

Kubeconfig (one of):
- ```
~/.kube/config
```
  — default discovery path
- ```
$KUBECONFIG
```
  — alternate path
- In-cluster service account — used when running inside a pod (no kubeconfig needed)
TAO_K8S_NAMESPACE (optional): default namespace for Job submission. Defaults to
```
default
```
.
TAO_K8S_CONTEXT (optional): kubeconfig context name to switch clusters.
NGC_KEY (optional): for nvcr.io image pulls. If you've pre-created an image-pull secret in the target namespace, pass its name to
```
create_job
```
via the
```
image_pull_secret
```
argument.
ACCESS_KEY / SECRET_KEY / S3_BUCKET_NAME / S3_ENDPOINT_URL (optional): for S3 dataset I/O via the SDK's
```
inputs
```
/
```
outputs
```
script_runner wrapping.

Do not ask for Lepton, Brev, or SLURM credentials for Kubernetes runs. Ask for S3 credentials only when the selected workflow uses

s3://

inputs or outputs, and ask for model-specific credentials such as

HF_TOKEN

only when the selected model requires them. Before launch, verify the selected namespace can create Jobs, dataset/result paths are visible from the pod, and PVC/mounted filesystem paths are proven to be mounted into the job container; an agent-host local path is not sufficient proof.

Kubeconfig（三选一）：
- ```
~/.kube/config
```
  ——默认发现路径
- ```
$KUBECONFIG
```
  ——自定义路径
- 集群内服务账号——在 Pod 内部运行时使用（无需 kubeconfig）
TAO_K8S_NAMESPACE（可选）：作业提交的默认命名空间，默认为
```
default
```
。
TAO_K8S_CONTEXT（可选）：用于切换集群的 kubeconfig 上下文名称。
NGC_KEY（可选）：用于拉取 nvcr.io 镜像。若已在目标命名空间中预先创建镜像拉取密钥，可通过
```
create_job
```
的
```
image_pull_secret
```
参数传入其名称。
ACCESS_KEY / SECRET_KEY / S3_BUCKET_NAME / S3_ENDPOINT_URL（可选）：通过 SDK 的
```
inputs
```
/
```
outputs
```
script_runner 包装实现 S3 数据集输入输出。

Kubernetes 运行无需询问 Lepton、Brev 或 SLURM 凭据。仅当所选工作流使用

s3://

输入或输出时，才需询问 S3 凭据；仅当所选模型需要时，才需询问特定于模型的凭据（如

HF_TOKEN

）。启动前，请验证所选命名空间是否可创建 Job、数据集/结果路径是否能被 Pod 访问、PVC/挂载文件系统路径是否已被证实挂载到作业容器中；代理主机本地路径不足以作为有效证明。

SDK API

K8s is SDK-only — there is no

kubectl

-only launch path. Read

tao-skill-bank:tao-run-platform

before drafting

create_job

calls; it covers

build_entrypoint

, the shared kwarg contract, monitoring, and

ActionWorkflow

python

from tao_sdk.platforms.kubernetes import KubernetesSDK

sdk = KubernetesSDK()  # auto-detects auth
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',
    gpu_count=1,
    env_vars={'NGC_KEY': os.environ['NGC_KEY']},
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
    namespace='tao-jobs',                       # optional override
    image_pull_secret='ngc-pull-secret',         # optional, pre-created
    node_selector={'gpu-type': 'h100'},          # optional
)

The SDK constructs a

V1Job

with:

spec.template.spec.containers[0]

: the requested image and

command=["/bin/bash", "-c", <command>]

```
resources.limits["nvidia.com/gpu"]: <gpu_count>
```
— schedules onto GPU nodes via the NVIDIA Device Plugin / GPU Operator.
```
env_vars
```
flowed through, plus auto-injected S3/NGC/HF credentials for
```
script_runner
```
.
```
restart_policy=Never
```
and
```
backoff_limit=0
```
— failures surface to the user instead of silently retrying.
```
ttl_seconds_after_finished=3600
```
— Job auto-cleans 1 hour after terminal state.

K8s 仅支持 SDK——没有仅使用

kubectl

的启动路径。在编写

create_job

调用前，请阅读

tao-skill-bank:tao-run-platform

；其中涵盖了

build_entrypoint

、共享参数约定、监控以及

ActionWorkflow

。

python

from tao_sdk.platforms.kubernetes import KubernetesSDK

sdk = KubernetesSDK()  # 自动检测认证信息
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',
    gpu_count=1,
    env_vars={'NGC_KEY': os.environ['NGC_KEY']},
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
    namespace='tao-jobs',                       # 可选，覆盖默认命名空间
    image_pull_secret='ngc-pull-secret',         # 可选，预先创建的密钥
    node_selector={'gpu-type': 'h100'},          # 可选，节点选择器
)

SDK 会构造一个

V1Job

，包含：

spec.template.spec.containers[0]

：请求的镜像和

command=["/bin/bash", "-c", <command>]

。

```
resources.limits["nvidia.com/gpu"]: <gpu_count>
```
——通过 NVIDIA 设备插件/GPU Operator 调度到 GPU 节点。
传递的
```
env_vars
```
，以及为
```
script_runner
```
自动注入的 S3/NGC/HF 凭据。
```
restart_policy=Never
```
和
```
backoff_limit=0
```
——失败会直接告知用户，而非静默重试。
```
ttl_seconds_after_finished=3600
```
——作业进入终端状态后 1 小时自动清理。

Status & monitoring

状态与监控

python

status = sdk.get_job_status(job.id)

python

status = sdk.get_job_status(job.id)

status.status ∈ {"Pending", "Running", "Complete", "Error", "Canceled", "Unknown"}

logs = sdk.get_job_logs(job.id, tail=200) # concatenates logs from all pods of the Job

logs = sdk.get_job_logs(job.id, tail=200) # 合并作业所有 Pod 的日志

For stuck-Pending jobs — replica diagnostics:

针对停滞在 Pending 状态的作业——副本诊断：

for r in sdk.get_job_replicas(job.id): issue = r["status"].get("readiness_issue") if issue: print(issue["reason"], issue["message"]) # e.g. "ImagePullBackOff" / "Back-off pulling image..." # e.g. "Pending" / "0/3 nodes available: 3 Insufficient nvidia.com/gpu"

for r in sdk.get_job_replicas(job.id): issue = r["status"].get("readiness_issue") if issue: print(issue["reason"], issue["message"]) # 示例："ImagePullBackOff" / "Back-off pulling image..." # 示例："Pending" / "0/3 nodes available: 3 Insufficient nvidia.com/gpu"

On failure:

作业失败时：

analysis = sdk.get_failure_analysis(job.id)

{"err_class": "ERR_PROGRAM" | "ERR_INFRA",

"suggestion": "Container OOM-killed. Reduce batch size...",

"job_failure_by_node_event": [{"node_event_name": "OOMKilled", ...}]}

undefined

undefined

Cancel & cleanup

取消与清理

python

sdk.cancel_job(job.id)  # delete_namespaced_job with propagation_policy="Foreground"

ttl_seconds_after_finished=3600

means completed Jobs auto-delete after 1h. To cancel an in-flight Job,

cancel_job

deletes it and its pods immediately.

python

sdk.cancel_job(job.id)  # 使用 propagation_policy="Foreground" 删除命名空间下的作业

ttl_seconds_after_finished=3600

表示已完成的作业会在 1 小时后自动删除。要取消正在运行的作业，

cancel_job

会立即删除作业及其 Pod。

GPU Operator dependency

GPU Operator 依赖

The SDK refuses to submit GPU jobs to a cluster with no

nvidia.com/gpu

allocatable. For self-managed clusters, first run the

tao-setup-nvidia-gpu-host

install action on every GPU worker node or bake the same package set into the node image:

bash

bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --install --yes

Then install the NVIDIA GPU Operator or device plugin:

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator

Full guide: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html

SDK 拒绝向无

nvidia.com/gpu

可分配资源的集群提交 GPU 作业。对于自管理集群，需先在每个 GPU 工作节点上运行

tao-setup-nvidia-gpu-host

安装操作，或将相同的软件包集集成到节点镜像中：

bash

bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --install --yes

然后安装 NVIDIA GPU Operator 或设备插件：

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator

完整指南：https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html

Multi-node training (distributed)

多节点训练（分布式）

Pass

num_nodes > 1

create_job()

to run distributed training across N pods. The SDK provisions:

A headless Service named after the Job (selector:
```
job-name=<job-name>
```
,
```
clusterIP: None
```
,
```
publishNotReadyAddresses: true
```
so pods can rendezvous before they're all Ready).
An Indexed Job with
```
parallelism = completions = num_nodes
```
,
```
completionMode: Indexed
```
. Each pod gets
```
JOB_COMPLETION_INDEX
```
injected by k8s automatically (= the node rank).

A command wrapper that exports the rendezvous env vars before invoking the user command. Two naming conventions are exported simultaneously:

Env var	Value	Read by
`WORLD_SIZE`	`num_nodes`	TAO PyTorch container's `nvidia_tao_pytorch/core/entrypoint.py` (uses this to mean node count, even though PyTorch's own convention is total processes)
`NUM_GPU_PER_NODE`	`gpu_count`	TAO PyTorch container's entrypoint
`NNODES`	`num_nodes`	`torchrun` and PyTorch-standard rendezvous
`NPROC_PER_NODE`	`gpu_count`	`torchrun`
`NODE_RANK`	`$JOB_COMPLETION_INDEX`	both
`MASTER_ADDR`	`<job-name>-0.<job-name>` (pod-0's DNS)	both
`MASTER_PORT`	`29500`	both (TAO's default)

Both naming conventions are set so TAO entrypoints (

dino train

, etc.) and raw

torchrun

commands work without modification.

python

job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',  # TAO entrypoint reads spec.train.num_nodes; env vars are wired by the container
    gpu_count=8,           # GPUs per node
    num_nodes=4,           # 4 × 8 = 32 GPUs total
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)

For raw

torchrun

-based commands (non-TAO containers):

python

job = sdk.create_job(
    image='nvcr.io/nvidia/pytorch:25.08-py3',
    command='torchrun --nnodes=$NNODES --nproc-per-node=$NPROC_PER_NODE --node-rank=$NODE_RANK '
            '--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py',
    gpu_count=8,
    num_nodes=4,
)

The capacity check sums across nodes:

gpu_count × num_nodes

≤ cluster's allocatable

nvidia.com/gpu

向

create_job()

传递

num_nodes > 1

即可在 N 个 Pod 上运行分布式训练。SDK 会自动部署：

一个Headless Service，名称与作业一致（选择器：
```
job-name=<job-name>
```
，
```
clusterIP: None
```
，
```
publishNotReadyAddresses: true
```
，以便 Pod 在全部就绪前即可实现 rendezvous）。
一个索引 Job，配置
```
parallelism = completions = num_nodes
```
，
```
completionMode: Indexed
```
。每个 Pod 会自动注入 k8s 提供的
```
JOB_COMPLETION_INDEX
```
（即节点 rank）。

一个命令包装器，在调用用户命令前导出 rendezvous 环境变量。同时导出两种命名约定：

环境变量	值	读取方
`WORLD_SIZE`	`num_nodes`	TAO PyTorch 容器的 `nvidia_tao_pytorch/core/entrypoint.py` （将其视为节点数量，尽管 PyTorch 自身约定为总进程数）
`NUM_GPU_PER_NODE`	`gpu_count`	TAO PyTorch 容器的入口脚本
`NNODES`	`num_nodes`	`torchrun` 和 PyTorch 标准 rendezvous
`NPROC_PER_NODE`	`gpu_count`	`torchrun`
`NODE_RANK`	`$JOB_COMPLETION_INDEX`	两者均支持
`MASTER_ADDR`	`<job-name>-0.<job-name>` （Pod-0 的 DNS）	两者均支持
`MASTER_PORT`	`29500`	两者均支持（TAO 默认端口）

同时设置两种命名约定，确保 TAO 入口脚本（如

dino train

等）和原生

torchrun

命令无需修改即可运行。

python

job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',  # TAO 入口脚本读取 spec.train.num_nodes；环境变量由容器自动关联
    gpu_count=8,           # 每个节点的 GPU 数量
    num_nodes=4,           # 总计 4 × 8 = 32 个 GPU
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)

对于基于原生

torchrun

的命令（非 TAO 容器）：

python

job = sdk.create_job(
    image='nvcr.io/nvidia/pytorch:25.08-py3',
    command='torchrun --nnodes=$NNODES --nproc-per-node=$NPROC_PER_NODE --node-rank=$NODE_RANK '
            '--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py',
    gpu_count=8,
    num_nodes=4,
)

容量检查会汇总所有节点的资源：

gpu_count × num_nodes

≤ 集群可分配的

nvidia.com/gpu

总数。

Cluster requirements for multi-node

多节点训练的集群要求

k8s 1.28+ is required for stable pod hostnames in Indexed Jobs (the
```
PodIndexLabel
```
feature). On older clusters the
```
MASTER_ADDR=<job>-0.<svc>
```
DNS lookup fails. Verify with
```
kubectl version
```
.
Pod-to-pod networking must be open on port 29500 (PyTorch default; configurable via
```
MASTER_PORT
```
env var). Most CNIs (Calico, Cilium, AWS VPC CNI) allow this by default; restrictive NetworkPolicies must be relaxed.
NCCL in the container talks GPU-to-GPU; if the cluster has multi-NIC nodes or RDMA, set
```
NCCL_SOCKET_IFNAME
```
/
```
NCCL_IB_HCA
```
via
```
env_vars
```
.

k8s 1.28+：索引 Job 中稳定 Pod 主机名需要该版本（
```
PodIndexLabel
```
特性）。在旧版本集群中，
```
MASTER_ADDR=<job>-0.<svc>
```
的 DNS 查找会失败。请通过
```
kubectl version
```
验证版本。
Pod 间网络：需开放端口 29500（PyTorch 默认端口；可通过
```
MASTER_PORT
```
环境变量配置）。大多数 CNI（Calico、Cilium、AWS VPC CNI）默认允许该端口；若有严格的 NetworkPolicy，需放宽限制。
容器中的 NCCL 实现 GPU 间通信；若集群有多 NIC 节点或 RDMA，需通过
```
env_vars
```
设置
```
NCCL_SOCKET_IFNAME
```
/
```
NCCL_IB_HCA
```
。

Reference reading

参考文档

Kubernetes Indexed Job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode
Indexed Job for batch ML: https://kubernetes.io/blog/2022/06/01/indexed-jobs-mpi/
PyTorch distributed (env-var rendezvous): https://pytorch.org/docs/stable/elastic/run.html
NCCL networking tuning (NCCL_SOCKET_IFNAME, NCCL_IB_HCA): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

Kubernetes 索引 Job：https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode
用于批量机器学习的索引 Job：https://kubernetes.io/blog/2022/06/01/indexed-jobs-mpi/
PyTorch 分布式（环境变量 rendezvous）：https://pytorch.org/docs/stable/elastic/run.html
NCCL 网络调优（NCCL_SOCKET_IFNAME、NCCL_IB_HCA）：https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

When to use a Kubernetes operator instead

何时改用 Kubernetes Operator

For more sophisticated topologies (gang scheduling, PyTorch elastic / fault-tolerant training, MPI / Horovod, RDMA setup), reach for an operator instead of plain Indexed Job:

MPI Operator — https://github.com/kubeflow/mpi-operator — for MPI / Horovod workloads.
Kubeflow Training Operator (
```
PyTorchJob
```
,
```
TFJob
```
) — https://www.kubeflow.org/docs/components/training/ — for elastic PyTorch training with built-in restart logic.
Volcano — https://volcano.sh/ — gang scheduling, queues, fair-share. Useful in shared multi-tenant clusters.
Kueue — https://kueue.sigs.k8s.io/ — quota / queue layer on top of any of the above.

The TAO SDK's Indexed Job path is intentionally simple and dependency-free; if you need elastic restart or gang scheduling, layer one of these on top and submit jobs through the operator's CRD instead.

对于更复杂的拓扑（ gang 调度、PyTorch 弹性/容错训练、MPI/Horovod、RDMA 配置），请使用 Operator 而非原生索引 Job：

MPI Operator — https://github.com/kubeflow/mpi-operator — 适用于 MPI/Horovod 工作负载。
Kubeflow Training Operator（
```
PyTorchJob
```
、
```
TFJob
```
）— https://www.kubeflow.org/docs/components/training/ — 用于带内置重启逻辑的弹性 PyTorch 训练。
Volcano — https://volcano.sh/ — gang 调度、队列、公平共享。适用于共享多租户集群。
Kueue — https://kueue.sigs.k8s.io/ — 在上述任意组件之上的配额/队列层。

TAO SDK 的索引 Job 路径设计为简单且无依赖；若需要弹性重启或 gang 调度，请在其之上集成上述组件之一，并通过 Operator 的 CRD 提交作业。

Common error patterns

常见错误模式

No nvidia.com/gpu resources allocatable on the cluster
— the GPU Operator (or NVIDIA Device Plugin) isn't installed. Install per the link above; verify with

kubectl get nodes -o jsonpath='{.items[*].status.allocatable}'

ImagePullBackOff
/
ErrImagePull
— the cluster can't pull the image. For nvcr.io: pre-create an image-pull secret in the namespace and pass its name via the

image_pull_secret

argument:

bash

kubectl create secret docker-registry ngc-pull-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=$NGC_KEY -n tao-jobs

Pod stays
Pending
forever —

get_job_replicas(job_id)

will show the readiness_issue. Common causes: insufficient GPU capacity (

Insufficient nvidia.com/gpu

), no node matches

node_selector

, missing image-pull secret, or PVC mount failure.

OOMKilled
(exit 137) — container exceeded memory. Reduce batch size, lower max_length, or add a memory request/limit and target a larger node.

CredentialError: Could not authenticate to a Kubernetes cluster
— neither kubeconfig nor in-cluster auth worked. Run

kubectl get nodes

to verify your config, or set

$KUBECONFIG

to the right path.

No nvidia.com/gpu resources allocatable on the cluster
——未安装 GPU Operator（或 NVIDIA 设备插件）。请按照上述链接安装；通过

kubectl get nodes -o jsonpath='{.items[*].status.allocatable}'

验证。

ImagePullBackOff
/
ErrImagePull
——集群无法拉取镜像。对于 nvcr.io 镜像：在命名空间中预先创建镜像拉取密钥，并通过

image_pull_secret

参数传入其名称：

bash

kubectl create secret docker-registry ngc-pull-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=$NGC_KEY -n tao-jobs

Pod 一直处于
Pending
状态——

get_job_replicas(job_id)

会显示就绪问题。常见原因：GPU 容量不足（

Insufficient nvidia.com/gpu

）、无节点匹配

node_selector

、缺少镜像拉取密钥、或 PVC 挂载失败。

OOMKilled
（退出码 137）——容器内存超限。请减小批量大小、降低 max_length，或添加内存请求/限制并选择更大的节点。

CredentialError: Could not authenticate to a Kubernetes cluster
——kubeconfig 和集群内认证均失败。运行

kubectl get nodes

验证配置，或设置

$KUBECONFIG

指向正确路径。

What this skill does NOT support (yet)

本技能暂不支持的功能

Elastic / fault-tolerant training. Indexed Job has
```
backoff_limit=0
```
— failures fail the whole training run. For elastic restart (e.g., resume from checkpoint after a node death), use Kubeflow's
```
PyTorchJob
```
operator instead.
Gang scheduling. Indexed Job pods are scheduled independently — no all-or-nothing. Multi-node training will partially start if only some pods can be scheduled (rank-0 will hang waiting for peers). For all-or-nothing scheduling on shared clusters, use Volcano or Kueue.
MPI / Horovod. Use the MPI Operator. The Indexed Job path here is PyTorch-distributed-shaped (env-var rendezvous on
```
MASTER_ADDR:MASTER_PORT
```
).
Persistent volumes for shared storage. S3 only via the script_runner. PVC support is a follow-up.
Auto-creating image-pull secrets from
$NGC_KEY
. You pre-create the secret in the target namespace and pass the name. Lepton does this auto; we don't here because k8s namespace conventions vary widely.

弹性/容错训练：索引 Job 的
```
backoff_limit=0
```
——失败会导致整个训练运行终止。如需弹性重启（例如节点故障后从检查点恢复），请改用 Kubeflow 的
```
PyTorchJob
```
Operator。
Gang 调度：索引 Job 的 Pod 独立调度——不支持全有或全无模式。若仅部分 Pod 可调度，多节点训练会部分启动（rank-0 会挂起等待其他节点）。在共享集群中如需全有或全无调度，请使用 Volcano 或 Kueue。
MPI/Horovod：请使用 MPI Operator。此处的索引 Job 路径为 PyTorch 分布式模式（基于
```
MASTER_ADDR:MASTER_PORT
```
的环境变量 rendezvous）。
用于共享存储的持久卷：仅支持通过 script_runner 访问 S3。PVC 支持为后续功能。
从
$NGC_KEY
自动创建镜像拉取密钥：需预先在目标命名空间中创建密钥并传入名称。Lepton 支持自动创建，但此处不支持，因为 Kubernetes 命名空间约定差异较大。