tao-run-on-kubernetes

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes

Kubernetes

Submits TAO container jobs as Kubernetes Jobs. Works on any cluster reachable via kubeconfig (EKS / GKE / AKS / on-prem) or in-cluster service account (when the SDK runs inside a pod).
Single-pod by default; opt into multi-node distributed training via
num_nodes > 1
(uses Indexed Job + headless Service, see Multi-node training below).
将 TAO 容器作业提交为 Kubernetes Job。可在任何能通过 kubeconfig 访问的集群(EKS/GKE/AKS/本地集群)中运行,也支持集群内服务账号(当 SDK 在 Pod 内部运行时)。
默认采用单 Pod 模式;若需多节点分布式训练,可设置
num_nodes > 1
(使用索引 Job + Headless Service,详见下方【多节点训练(分布式)】)。

Preflight

预检查

Four checks: GPU host runtime ready, SDK installed, cluster reachable, GPU Operator/device plugin present.
bash
undefined
需完成四项检查:GPU 主机运行时就绪、SDK 已安装、集群可访问、GPU Operator/设备插件已部署。
bash
undefined

0. GPU node host runtime.

0. GPU 节点主机运行时。

Run this on each self-managed GPU worker node or in the node image build.

在每个自管理 GPU 工作节点上运行,或集成到节点镜像构建流程中。

Set TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1 only when using managed GPU nodes whose

仅当使用由云服务商或 GPU Operator 策略管理驱动/工具包生命周期的托管 GPU 节点时,才设置 TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1。

driver/toolkit lifecycle is owned by the cloud provider or GPU Operator policy.

if [ "${TAO_K8S_SKIP_NODE_RUNTIME_CHECK:-0}" != "1" ]; then TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}" SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh" [ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend kubernetes --check-only || { echo "MISSING: TAO Kubernetes GPU node runtime is not ready." echo "For self-managed GPU nodes, run after user approval:" echo " bash "$SETUP_SCRIPT" --backend kubernetes --install --yes" echo "For managed clusters, verify the node image/GPU Operator policy installs driver 580 and toolkit 1.19.0, then set TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1." exit 1 } fi
if [ "${TAO_K8S_SKIP_NODE_RUNTIME_CHECK:-0}" != "1" ]; then TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}" SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh" [ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend kubernetes --check-only || { echo "缺失:TAO Kubernetes GPU 节点运行时未就绪。" echo "对于自管理 GPU 节点,获得用户许可后运行:" echo " bash "$SETUP_SCRIPT" --backend kubernetes --install --yes" echo "对于托管集群,请验证节点镜像/GPU Operator 策略是否安装了驱动 580 和工具包 1.19.0,然后设置 TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1。" exit 1 } fi

1. SDK + kubernetes extra installed.

1. SDK 及 Kubernetes 扩展已安装。

nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_kubernetes).

nvidia-tao-sdk 可从公开 PyPI 获取;版本固定信息位于 versions.yaml(wheels.tao_sdk_kubernetes)中。

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_kubernetes) python -c "import tao_sdk" 2>/dev/null || { echo "MISSING: nvidia-tao-sdk not installed. Run:" echo " pip install "$PIN"" exit 1 } python -c "import kubernetes" 2>/dev/null || { echo "MISSING: kubernetes extra not installed. Run:" echo " pip install "$PIN"" exit 1 }
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_kubernetes) python -c "import tao_sdk" 2>/dev/null || { echo "缺失:nvidia-tao-sdk 未安装。请运行:" echo " pip install "$PIN"" exit 1 } python -c "import kubernetes" 2>/dev/null || { echo "缺失:kubernetes 扩展未安装。请运行:" echo " pip install "$PIN"" exit 1 }

2. Cluster reachable (kubeconfig OR in-cluster service account)

2. 集群可访问(kubeconfig 或集群内服务账号)

python -c "from kubernetes import config; config.load_kube_config()" 2>/dev/null ||
python -c "from kubernetes import config; config.load_incluster_config()" 2>/dev/null || { echo "MISSING: no kubeconfig at ~/.kube/config and not running in a pod." echo "Configure kubectl (e.g., 'aws eks update-kubeconfig --name my-cluster') or set $KUBECONFIG." exit 1 }
python -c "from kubernetes import config; config.load_kube_config()" 2>/dev/null ||
python -c "from kubernetes import config; config.load_incluster_config()" 2>/dev/null || { echo "缺失:~/.kube/config 路径下无 kubeconfig,且未在 Pod 内部运行。" echo "配置 kubectl(例如:'aws eks update-kubeconfig --name my-cluster')或设置 $KUBECONFIG 环境变量。" exit 1 }

3. NVIDIA GPU Operator present (soft check — warn if kubectl available, don't fail)

3. NVIDIA GPU Operator 已部署(软检查——若安装了 kubectl 则发出警告,不会导致失败)

if command -v kubectl >/dev/null 2>&1; then gpu=$(kubectl get nodes -o jsonpath='{range .items[*]}{.status.allocatable.nvidia.com/gpu}{"\n"}{end}' 2>/dev/null | grep -v '^$' | head -1) if [ -z "$gpu" ] || [ "$gpu" = "0" ]; then echo "WARN: no nvidia.com/gpu allocatable on this cluster." echo "Install the NVIDIA GPU Operator before submitting GPU jobs:" echo " https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html" fi fi

The GPU node runtime check is mandatory for self-managed nodes. For managed
clusters where the client is not running on a GPU worker, verify the provider
node image or GPU Operator policy and set `TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1`
instead of running the installer on the client. The final GPU capacity check is
a warning rather than a hard fail — `kubectl` isn't always installed. The SDK
does a hard guard inside
`KubernetesSDK.create_job()` that uses the kubernetes Python client to verify
GPU capacity before submitting.
if command -v kubectl >/dev/null 2>&1; then gpu=$(kubectl get nodes -o jsonpath='{range .items[*]}{.status.allocatable.nvidia.com/gpu}{"\n"}{end}' 2>/dev/null | grep -v '^$' | head -1) if [ -z "$gpu" ] || [ "$gpu" = "0" ]; then echo "警告:集群中无可用的 nvidia.com/gpu 资源。" echo "提交 GPU 作业前请安装 NVIDIA GPU Operator:" echo " https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html" fi fi

对于自管理节点,GPU 节点运行时检查是强制性的。对于客户端未在 GPU 工作节点上运行的托管集群,请验证服务商节点镜像或 GPU Operator 策略,并设置 `TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1`,而非在客户端运行安装脚本。最终的 GPU 容量检查仅为警告而非强制失败——因为并非总是安装了 kubectl。SDK 会在 `KubernetesSDK.create_job()` 内部进行严格检查,通过 Kubernetes Python 客户端在提交作业前验证 GPU 容量。

Credentials & configuration

凭据与配置

  • Kubeconfig (one of):
    • ~/.kube/config
      — default discovery path
    • $KUBECONFIG
      — alternate path
    • In-cluster service account — used when running inside a pod (no kubeconfig needed)
  • TAO_K8S_NAMESPACE (optional): default namespace for Job submission. Defaults to
    default
    .
  • TAO_K8S_CONTEXT (optional): kubeconfig context name to switch clusters.
  • NGC_KEY (optional): for nvcr.io image pulls. If you've pre-created an image-pull secret in the target namespace, pass its name to
    create_job
    via the
    image_pull_secret
    argument.
  • ACCESS_KEY / SECRET_KEY / S3_BUCKET_NAME / S3_ENDPOINT_URL (optional): for S3 dataset I/O via the SDK's
    inputs
    /
    outputs
    script_runner wrapping.
Do not ask for Lepton, Brev, or SLURM credentials for Kubernetes runs. Ask for S3 credentials only when the selected workflow uses
s3://
inputs or outputs, and ask for model-specific credentials such as
HF_TOKEN
only when the selected model requires them. Before launch, verify the selected namespace can create Jobs, dataset/result paths are visible from the pod, and PVC/mounted filesystem paths are proven to be mounted into the job container; an agent-host local path is not sufficient proof.
  • Kubeconfig(三选一):
    • ~/.kube/config
      ——默认发现路径
    • $KUBECONFIG
      ——自定义路径
    • 集群内服务账号——在 Pod 内部运行时使用(无需 kubeconfig)
  • TAO_K8S_NAMESPACE(可选):作业提交的默认命名空间,默认为
    default
  • TAO_K8S_CONTEXT(可选):用于切换集群的 kubeconfig 上下文名称。
  • NGC_KEY(可选):用于拉取 nvcr.io 镜像。若已在目标命名空间中预先创建镜像拉取密钥,可通过
    create_job
    image_pull_secret
    参数传入其名称。
  • ACCESS_KEY / SECRET_KEY / S3_BUCKET_NAME / S3_ENDPOINT_URL(可选):通过 SDK 的
    inputs
    /
    outputs
    script_runner 包装实现 S3 数据集输入输出。
Kubernetes 运行无需询问 Lepton、Brev 或 SLURM 凭据。仅当所选工作流使用
s3://
输入或输出时,才需询问 S3 凭据;仅当所选模型需要时,才需询问特定于模型的凭据(如
HF_TOKEN
)。启动前,请验证所选命名空间是否可创建 Job、数据集/结果路径是否能被 Pod 访问、PVC/挂载文件系统路径是否已被证实挂载到作业容器中;代理主机本地路径不足以作为有效证明。

SDK API

SDK API

K8s is SDK-only — there is no
kubectl
-only launch path. Read
tao-skill-bank:tao-run-platform
before drafting
create_job
calls; it covers
build_entrypoint
, the shared kwarg contract, monitoring, and
ActionWorkflow
.
python
from tao_sdk.platforms.kubernetes import KubernetesSDK

sdk = KubernetesSDK()  # auto-detects auth
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',
    gpu_count=1,
    env_vars={'NGC_KEY': os.environ['NGC_KEY']},
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
    namespace='tao-jobs',                       # optional override
    image_pull_secret='ngc-pull-secret',         # optional, pre-created
    node_selector={'gpu-type': 'h100'},          # optional
)
The SDK constructs a
V1Job
with:
  • spec.template.spec.containers[0]
    : the requested image and
    command=["/bin/bash", "-c", <command>]
    .
  • resources.limits["nvidia.com/gpu"]: <gpu_count>
    — schedules onto GPU nodes via the NVIDIA Device Plugin / GPU Operator.
  • env_vars
    flowed through, plus auto-injected S3/NGC/HF credentials for
    script_runner
    .
  • restart_policy=Never
    and
    backoff_limit=0
    — failures surface to the user instead of silently retrying.
  • ttl_seconds_after_finished=3600
    — Job auto-cleans 1 hour after terminal state.
K8s 仅支持 SDK——没有仅使用
kubectl
的启动路径。在编写
create_job
调用前,请阅读
tao-skill-bank:tao-run-platform
;其中涵盖了
build_entrypoint
、共享参数约定、监控以及
ActionWorkflow
python
from tao_sdk.platforms.kubernetes import KubernetesSDK

sdk = KubernetesSDK()  # 自动检测认证信息
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',
    gpu_count=1,
    env_vars={'NGC_KEY': os.environ['NGC_KEY']},
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
    namespace='tao-jobs',                       # 可选,覆盖默认命名空间
    image_pull_secret='ngc-pull-secret',         # 可选,预先创建的密钥
    node_selector={'gpu-type': 'h100'},          # 可选,节点选择器
)
SDK 会构造一个
V1Job
,包含:
  • spec.template.spec.containers[0]
    :请求的镜像和
    command=["/bin/bash", "-c", <command>]
  • resources.limits["nvidia.com/gpu"]: <gpu_count>
    ——通过 NVIDIA 设备插件/GPU Operator 调度到 GPU 节点。
  • 传递的
    env_vars
    ,以及为
    script_runner
    自动注入的 S3/NGC/HF 凭据。
  • restart_policy=Never
    backoff_limit=0
    ——失败会直接告知用户,而非静默重试。
  • ttl_seconds_after_finished=3600
    ——作业进入终端状态后 1 小时自动清理。

Status & monitoring

状态与监控

python
status = sdk.get_job_status(job.id)
python
status = sdk.get_job_status(job.id)

status.status ∈ {"Pending", "Running", "Complete", "Error", "Canceled", "Unknown"}

status.status ∈ {"Pending", "Running", "Complete", "Error", "Canceled", "Unknown"}

logs = sdk.get_job_logs(job.id, tail=200) # concatenates logs from all pods of the Job
logs = sdk.get_job_logs(job.id, tail=200) # 合并作业所有 Pod 的日志

For stuck-Pending jobs — replica diagnostics:

针对停滞在 Pending 状态的作业——副本诊断:

for r in sdk.get_job_replicas(job.id): issue = r["status"].get("readiness_issue") if issue: print(issue["reason"], issue["message"]) # e.g. "ImagePullBackOff" / "Back-off pulling image..." # e.g. "Pending" / "0/3 nodes available: 3 Insufficient nvidia.com/gpu"
for r in sdk.get_job_replicas(job.id): issue = r["status"].get("readiness_issue") if issue: print(issue["reason"], issue["message"]) # 示例:"ImagePullBackOff" / "Back-off pulling image..." # 示例:"Pending" / "0/3 nodes available: 3 Insufficient nvidia.com/gpu"

On failure:

作业失败时:

analysis = sdk.get_failure_analysis(job.id)
analysis = sdk.get_failure_analysis(job.id)

{"err_class": "ERR_PROGRAM" | "ERR_INFRA",

{"err_class": "ERR_PROGRAM" | "ERR_INFRA",

"suggestion": "Container OOM-killed. Reduce batch size...",

"suggestion": "Container OOM-killed. Reduce batch size...",

"job_failure_by_node_event": [{"node_event_name": "OOMKilled", ...}]}

"job_failure_by_node_event": [{"node_event_name": "OOMKilled", ...}]}

undefined
undefined

Cancel & cleanup

取消与清理

python
sdk.cancel_job(job.id)  # delete_namespaced_job with propagation_policy="Foreground"
ttl_seconds_after_finished=3600
means completed Jobs auto-delete after 1h. To cancel an in-flight Job,
cancel_job
deletes it and its pods immediately.
python
sdk.cancel_job(job.id)  # 使用 propagation_policy="Foreground" 删除命名空间下的作业
ttl_seconds_after_finished=3600
表示已完成的作业会在 1 小时后自动删除。要取消正在运行的作业,
cancel_job
会立即删除作业及其 Pod。

GPU Operator dependency

GPU Operator 依赖

The SDK refuses to submit GPU jobs to a cluster with no
nvidia.com/gpu
allocatable. For self-managed clusters, first run the
tao-setup-nvidia-gpu-host
install action on every GPU worker node or bake the same package set into the node image:
bash
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --install --yes
Then install the NVIDIA GPU Operator or device plugin:
bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator
SDK 拒绝向无
nvidia.com/gpu
可分配资源的集群提交 GPU 作业。对于自管理集群,需先在每个 GPU 工作节点上运行
tao-setup-nvidia-gpu-host
安装操作,或将相同的软件包集集成到节点镜像中:
bash
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --install --yes
然后安装 NVIDIA GPU Operator 或设备插件:
bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator

Multi-node training (distributed)

多节点训练(分布式)

Pass
num_nodes > 1
to
create_job()
to run distributed training across N pods. The SDK provisions:
  1. A headless Service named after the Job (selector:
    job-name=<job-name>
    ,
    clusterIP: None
    ,
    publishNotReadyAddresses: true
    so pods can rendezvous before they're all Ready).
  2. An Indexed Job with
    parallelism = completions = num_nodes
    ,
    completionMode: Indexed
    . Each pod gets
    JOB_COMPLETION_INDEX
    injected by k8s automatically (= the node rank).
  3. A command wrapper that exports the rendezvous env vars before invoking the user command. Two naming conventions are exported simultaneously:
    Env varValueRead by
    WORLD_SIZE
    num_nodes
    TAO PyTorch container's
    nvidia_tao_pytorch/core/entrypoint.py
    (uses this to mean node count, even though PyTorch's own convention is total processes)
    NUM_GPU_PER_NODE
    gpu_count
    TAO PyTorch container's entrypoint
    NNODES
    num_nodes
    torchrun
    and PyTorch-standard rendezvous
    NPROC_PER_NODE
    gpu_count
    torchrun
    NODE_RANK
    $JOB_COMPLETION_INDEX
    both
    MASTER_ADDR
    <job-name>-0.<job-name>
    (pod-0's DNS)
    both
    MASTER_PORT
    29500
    both (TAO's default)
    Both naming conventions are set so TAO entrypoints (
    dino train
    , etc.) and raw
    torchrun
    commands work without modification.
python
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',  # TAO entrypoint reads spec.train.num_nodes; env vars are wired by the container
    gpu_count=8,           # GPUs per node
    num_nodes=4,           # 4 × 8 = 32 GPUs total
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)
For raw
torchrun
-based commands (non-TAO containers):
python
job = sdk.create_job(
    image='nvcr.io/nvidia/pytorch:25.08-py3',
    command='torchrun --nnodes=$NNODES --nproc-per-node=$NPROC_PER_NODE --node-rank=$NODE_RANK '
            '--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py',
    gpu_count=8,
    num_nodes=4,
)
The capacity check sums across nodes:
gpu_count × num_nodes
≤ cluster's allocatable
nvidia.com/gpu
.
create_job()
传递
num_nodes > 1
即可在 N 个 Pod 上运行分布式训练。SDK 会自动部署:
  1. 一个Headless Service,名称与作业一致(选择器:
    job-name=<job-name>
    clusterIP: None
    publishNotReadyAddresses: true
    ,以便 Pod 在全部就绪前即可实现 rendezvous)。
  2. 一个索引 Job,配置
    parallelism = completions = num_nodes
    completionMode: Indexed
    。每个 Pod 会自动注入 k8s 提供的
    JOB_COMPLETION_INDEX
    (即节点 rank)。
  3. 一个命令包装器,在调用用户命令前导出 rendezvous 环境变量。同时导出两种命名约定:
    环境变量读取方
    WORLD_SIZE
    num_nodes
    TAO PyTorch 容器的
    nvidia_tao_pytorch/core/entrypoint.py
    (将其视为节点数量,尽管 PyTorch 自身约定为总进程数
    NUM_GPU_PER_NODE
    gpu_count
    TAO PyTorch 容器的入口脚本
    NNODES
    num_nodes
    torchrun
    和 PyTorch 标准 rendezvous
    NPROC_PER_NODE
    gpu_count
    torchrun
    NODE_RANK
    $JOB_COMPLETION_INDEX
    两者均支持
    MASTER_ADDR
    <job-name>-0.<job-name>
    (Pod-0 的 DNS)
    两者均支持
    MASTER_PORT
    29500
    两者均支持(TAO 默认端口)
    同时设置两种命名约定,确保 TAO 入口脚本(如
    dino train
    等)和原生
    torchrun
    命令无需修改即可运行。
python
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',  # TAO 入口脚本读取 spec.train.num_nodes;环境变量由容器自动关联
    gpu_count=8,           # 每个节点的 GPU 数量
    num_nodes=4,           # 总计 4 × 8 = 32 个 GPU
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)
对于基于原生
torchrun
的命令(非 TAO 容器):
python
job = sdk.create_job(
    image='nvcr.io/nvidia/pytorch:25.08-py3',
    command='torchrun --nnodes=$NNODES --nproc-per-node=$NPROC_PER_NODE --node-rank=$NODE_RANK '
            '--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py',
    gpu_count=8,
    num_nodes=4,
)
容量检查会汇总所有节点的资源:
gpu_count × num_nodes
≤ 集群可分配的
nvidia.com/gpu
总数。

Cluster requirements for multi-node

多节点训练的集群要求

  • k8s 1.28+ is required for stable pod hostnames in Indexed Jobs (the
    PodIndexLabel
    feature). On older clusters the
    MASTER_ADDR=<job>-0.<svc>
    DNS lookup fails. Verify with
    kubectl version
    .
  • Pod-to-pod networking must be open on port 29500 (PyTorch default; configurable via
    MASTER_PORT
    env var). Most CNIs (Calico, Cilium, AWS VPC CNI) allow this by default; restrictive NetworkPolicies must be relaxed.
  • NCCL in the container talks GPU-to-GPU; if the cluster has multi-NIC nodes or RDMA, set
    NCCL_SOCKET_IFNAME
    /
    NCCL_IB_HCA
    via
    env_vars
    .
  • k8s 1.28+:索引 Job 中稳定 Pod 主机名需要该版本(
    PodIndexLabel
    特性)。在旧版本集群中,
    MASTER_ADDR=<job>-0.<svc>
    的 DNS 查找会失败。请通过
    kubectl version
    验证版本。
  • Pod 间网络:需开放端口 29500(PyTorch 默认端口;可通过
    MASTER_PORT
    环境变量配置)。大多数 CNI(Calico、Cilium、AWS VPC CNI)默认允许该端口;若有严格的 NetworkPolicy,需放宽限制。
  • 容器中的 NCCL 实现 GPU 间通信;若集群有多 NIC 节点或 RDMA,需通过
    env_vars
    设置
    NCCL_SOCKET_IFNAME
    /
    NCCL_IB_HCA

Reference reading

参考文档

When to use a Kubernetes operator instead

何时改用 Kubernetes Operator

For more sophisticated topologies (gang scheduling, PyTorch elastic / fault-tolerant training, MPI / Horovod, RDMA setup), reach for an operator instead of plain Indexed Job:
The TAO SDK's Indexed Job path is intentionally simple and dependency-free; if you need elastic restart or gang scheduling, layer one of these on top and submit jobs through the operator's CRD instead.
对于更复杂的拓扑( gang 调度、PyTorch 弹性/容错训练、MPI/Horovod、RDMA 配置),请使用 Operator 而非原生索引 Job:
TAO SDK 的索引 Job 路径设计为简单且无依赖;若需要弹性重启或 gang 调度,请在其之上集成上述组件之一,并通过 Operator 的 CRD 提交作业。

Common error patterns

常见错误模式

No nvidia.com/gpu resources allocatable on the cluster
— the GPU Operator (or NVIDIA Device Plugin) isn't installed. Install per the link above; verify with
kubectl get nodes -o jsonpath='{.items[*].status.allocatable}'
.
ImagePullBackOff
/
ErrImagePull
— the cluster can't pull the image. For nvcr.io: pre-create an image-pull secret in the namespace and pass its name via the
image_pull_secret
argument:
bash
kubectl create secret docker-registry ngc-pull-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=$NGC_KEY -n tao-jobs
Pod stays
Pending
forever
get_job_replicas(job_id)
will show the readiness_issue. Common causes: insufficient GPU capacity (
Insufficient nvidia.com/gpu
), no node matches
node_selector
, missing image-pull secret, or PVC mount failure.
OOMKilled
(exit 137)
— container exceeded memory. Reduce batch size, lower max_length, or add a memory request/limit and target a larger node.
CredentialError: Could not authenticate to a Kubernetes cluster
— neither kubeconfig nor in-cluster auth worked. Run
kubectl get nodes
to verify your config, or set
$KUBECONFIG
to the right path.
No nvidia.com/gpu resources allocatable on the cluster
——未安装 GPU Operator(或 NVIDIA 设备插件)。请按照上述链接安装;通过
kubectl get nodes -o jsonpath='{.items[*].status.allocatable}'
验证。
ImagePullBackOff
/
ErrImagePull
——集群无法拉取镜像。对于 nvcr.io 镜像:在命名空间中预先创建镜像拉取密钥,并通过
image_pull_secret
参数传入其名称:
bash
kubectl create secret docker-registry ngc-pull-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=$NGC_KEY -n tao-jobs
Pod 一直处于
Pending
状态
——
get_job_replicas(job_id)
会显示就绪问题。常见原因:GPU 容量不足(
Insufficient nvidia.com/gpu
)、无节点匹配
node_selector
、缺少镜像拉取密钥、或 PVC 挂载失败。
OOMKilled
(退出码 137)
——容器内存超限。请减小批量大小、降低 max_length,或添加内存请求/限制并选择更大的节点。
CredentialError: Could not authenticate to a Kubernetes cluster
——kubeconfig 和集群内认证均失败。运行
kubectl get nodes
验证配置,或设置
$KUBECONFIG
指向正确路径。

What this skill does NOT support (yet)

本技能暂不支持的功能

  • Elastic / fault-tolerant training. Indexed Job has
    backoff_limit=0
    — failures fail the whole training run. For elastic restart (e.g., resume from checkpoint after a node death), use Kubeflow's
    PyTorchJob
    operator instead.
  • Gang scheduling. Indexed Job pods are scheduled independently — no all-or-nothing. Multi-node training will partially start if only some pods can be scheduled (rank-0 will hang waiting for peers). For all-or-nothing scheduling on shared clusters, use Volcano or Kueue.
  • MPI / Horovod. Use the MPI Operator. The Indexed Job path here is PyTorch-distributed-shaped (env-var rendezvous on
    MASTER_ADDR:MASTER_PORT
    ).
  • Persistent volumes for shared storage. S3 only via the script_runner. PVC support is a follow-up.
  • Auto-creating image-pull secrets from
    $NGC_KEY
    .
    You pre-create the secret in the target namespace and pass the name. Lepton does this auto; we don't here because k8s namespace conventions vary widely.
  • 弹性/容错训练:索引 Job 的
    backoff_limit=0
    ——失败会导致整个训练运行终止。如需弹性重启(例如节点故障后从检查点恢复),请改用 Kubeflow 的
    PyTorchJob
    Operator。
  • Gang 调度:索引 Job 的 Pod 独立调度——不支持全有或全无模式。若仅部分 Pod 可调度,多节点训练会部分启动(rank-0 会挂起等待其他节点)。在共享集群中如需全有或全无调度,请使用 Volcano 或 Kueue。
  • MPI/Horovod:请使用 MPI Operator。此处的索引 Job 路径为 PyTorch 分布式模式(基于
    MASTER_ADDR:MASTER_PORT
    的环境变量 rendezvous)。
  • 用于共享存储的持久卷:仅支持通过 script_runner 访问 S3。PVC 支持为后续功能。
  • $NGC_KEY
    自动创建镜像拉取密钥
    :需预先在目标命名空间中创建密钥并传入名称。Lepton 支持自动创建,但此处不支持,因为 Kubernetes 命名空间约定差异较大。