tao-run-on-kubernetes
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes
Kubernetes
Submits TAO container jobs as Kubernetes Jobs. Works on any cluster reachable via kubeconfig (EKS / GKE / AKS / on-prem) or in-cluster service account (when the SDK runs inside a pod).
Single-pod by default; opt into multi-node distributed training via (uses Indexed Job + headless Service, see Multi-node training below).
num_nodes > 1将 TAO 容器作业提交为 Kubernetes Job。可在任何能通过 kubeconfig 访问的集群(EKS/GKE/AKS/本地集群)中运行,也支持集群内服务账号(当 SDK 在 Pod 内部运行时)。
默认采用单 Pod 模式;若需多节点分布式训练,可设置 (使用索引 Job + Headless Service,详见下方【多节点训练(分布式)】)。
num_nodes > 1Preflight
预检查
Four checks: GPU host runtime ready, SDK installed, cluster reachable, GPU
Operator/device plugin present.
bash
undefined需完成四项检查:GPU 主机运行时就绪、SDK 已安装、集群可访问、GPU Operator/设备插件已部署。
bash
undefined0. GPU node host runtime.
0. GPU 节点主机运行时。
Run this on each self-managed GPU worker node or in the node image build.
在每个自管理 GPU 工作节点上运行,或集成到节点镜像构建流程中。
Set TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1 only when using managed GPU nodes whose
仅当使用由云服务商或 GPU Operator 策略管理驱动/工具包生命周期的托管 GPU 节点时,才设置 TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1。
driver/toolkit lifecycle is owned by the cloud provider or GPU Operator policy.
—
if [ "${TAO_K8S_SKIP_NODE_RUNTIME_CHECK:-0}" != "1" ]; then
TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}"
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend kubernetes --check-only || {
echo "MISSING: TAO Kubernetes GPU node runtime is not ready."
echo "For self-managed GPU nodes, run after user approval:"
echo " bash "$SETUP_SCRIPT" --backend kubernetes --install --yes"
echo "For managed clusters, verify the node image/GPU Operator policy installs driver 580 and toolkit 1.19.0, then set TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1."
exit 1
}
fi
if [ "${TAO_K8S_SKIP_NODE_RUNTIME_CHECK:-0}" != "1" ]; then
TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}"
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend kubernetes --check-only || {
echo "缺失:TAO Kubernetes GPU 节点运行时未就绪。"
echo "对于自管理 GPU 节点,获得用户许可后运行:"
echo " bash "$SETUP_SCRIPT" --backend kubernetes --install --yes"
echo "对于托管集群,请验证节点镜像/GPU Operator 策略是否安装了驱动 580 和工具包 1.19.0,然后设置 TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1。"
exit 1
}
fi
1. SDK + kubernetes extra installed.
1. SDK 及 Kubernetes 扩展已安装。
nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_kubernetes).
nvidia-tao-sdk 可从公开 PyPI 获取;版本固定信息位于 versions.yaml(wheels.tao_sdk_kubernetes)中。
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_kubernetes)
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install "$PIN""
exit 1
}
python -c "import kubernetes" 2>/dev/null || {
echo "MISSING: kubernetes extra not installed. Run:"
echo " pip install "$PIN""
exit 1
}
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_kubernetes)
python -c "import tao_sdk" 2>/dev/null || {
echo "缺失:nvidia-tao-sdk 未安装。请运行:"
echo " pip install "$PIN""
exit 1
}
python -c "import kubernetes" 2>/dev/null || {
echo "缺失:kubernetes 扩展未安装。请运行:"
echo " pip install "$PIN""
exit 1
}
2. Cluster reachable (kubeconfig OR in-cluster service account)
2. 集群可访问(kubeconfig 或集群内服务账号)
python -c "from kubernetes import config; config.load_kube_config()" 2>/dev/null ||
python -c "from kubernetes import config; config.load_incluster_config()" 2>/dev/null || { echo "MISSING: no kubeconfig at ~/.kube/config and not running in a pod." echo "Configure kubectl (e.g., 'aws eks update-kubeconfig --name my-cluster') or set $KUBECONFIG." exit 1 }
python -c "from kubernetes import config; config.load_incluster_config()" 2>/dev/null || { echo "MISSING: no kubeconfig at ~/.kube/config and not running in a pod." echo "Configure kubectl (e.g., 'aws eks update-kubeconfig --name my-cluster') or set $KUBECONFIG." exit 1 }
python -c "from kubernetes import config; config.load_kube_config()" 2>/dev/null ||
python -c "from kubernetes import config; config.load_incluster_config()" 2>/dev/null || { echo "缺失:~/.kube/config 路径下无 kubeconfig,且未在 Pod 内部运行。" echo "配置 kubectl(例如:'aws eks update-kubeconfig --name my-cluster')或设置 $KUBECONFIG 环境变量。" exit 1 }
python -c "from kubernetes import config; config.load_incluster_config()" 2>/dev/null || { echo "缺失:~/.kube/config 路径下无 kubeconfig,且未在 Pod 内部运行。" echo "配置 kubectl(例如:'aws eks update-kubeconfig --name my-cluster')或设置 $KUBECONFIG 环境变量。" exit 1 }
3. NVIDIA GPU Operator present (soft check — warn if kubectl available, don't fail)
3. NVIDIA GPU Operator 已部署(软检查——若安装了 kubectl 则发出警告,不会导致失败)
if command -v kubectl >/dev/null 2>&1; then
gpu=$(kubectl get nodes -o jsonpath='{range .items[*]}{.status.allocatable.nvidia.com/gpu}{"\n"}{end}' 2>/dev/null | grep -v '^$' | head -1)
if [ -z "$gpu" ] || [ "$gpu" = "0" ]; then
echo "WARN: no nvidia.com/gpu allocatable on this cluster."
echo "Install the NVIDIA GPU Operator before submitting GPU jobs:"
echo " https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html"
fi
fi
The GPU node runtime check is mandatory for self-managed nodes. For managed
clusters where the client is not running on a GPU worker, verify the provider
node image or GPU Operator policy and set `TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1`
instead of running the installer on the client. The final GPU capacity check is
a warning rather than a hard fail — `kubectl` isn't always installed. The SDK
does a hard guard inside
`KubernetesSDK.create_job()` that uses the kubernetes Python client to verify
GPU capacity before submitting.if command -v kubectl >/dev/null 2>&1; then
gpu=$(kubectl get nodes -o jsonpath='{range .items[*]}{.status.allocatable.nvidia.com/gpu}{"\n"}{end}' 2>/dev/null | grep -v '^$' | head -1)
if [ -z "$gpu" ] || [ "$gpu" = "0" ]; then
echo "警告:集群中无可用的 nvidia.com/gpu 资源。"
echo "提交 GPU 作业前请安装 NVIDIA GPU Operator:"
echo " https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html"
fi
fi
对于自管理节点,GPU 节点运行时检查是强制性的。对于客户端未在 GPU 工作节点上运行的托管集群,请验证服务商节点镜像或 GPU Operator 策略,并设置 `TAO_K8S_SKIP_NODE_RUNTIME_CHECK=1`,而非在客户端运行安装脚本。最终的 GPU 容量检查仅为警告而非强制失败——因为并非总是安装了 kubectl。SDK 会在 `KubernetesSDK.create_job()` 内部进行严格检查,通过 Kubernetes Python 客户端在提交作业前验证 GPU 容量。Credentials & configuration
凭据与配置
- Kubeconfig (one of):
- — default discovery path
~/.kube/config - — alternate path
$KUBECONFIG - In-cluster service account — used when running inside a pod (no kubeconfig needed)
- TAO_K8S_NAMESPACE (optional): default namespace for Job submission. Defaults to .
default - TAO_K8S_CONTEXT (optional): kubeconfig context name to switch clusters.
- NGC_KEY (optional): for nvcr.io image pulls. If you've pre-created an image-pull secret in the target namespace, pass its name to via the
create_jobargument.image_pull_secret - ACCESS_KEY / SECRET_KEY / S3_BUCKET_NAME / S3_ENDPOINT_URL (optional): for S3 dataset I/O via the SDK's /
inputsscript_runner wrapping.outputs
Do not ask for Lepton, Brev, or SLURM credentials for Kubernetes runs. Ask for
S3 credentials only when the selected workflow uses inputs or outputs,
and ask for model-specific credentials such as only when the selected
model requires them. Before launch, verify the selected namespace can create
Jobs, dataset/result paths are visible from the pod, and PVC/mounted filesystem
paths are proven to be mounted into the job container; an agent-host local path
is not sufficient proof.
s3://HF_TOKEN- Kubeconfig(三选一):
- ——默认发现路径
~/.kube/config - ——自定义路径
$KUBECONFIG - 集群内服务账号——在 Pod 内部运行时使用(无需 kubeconfig)
- TAO_K8S_NAMESPACE(可选):作业提交的默认命名空间,默认为 。
default - TAO_K8S_CONTEXT(可选):用于切换集群的 kubeconfig 上下文名称。
- NGC_KEY(可选):用于拉取 nvcr.io 镜像。若已在目标命名空间中预先创建镜像拉取密钥,可通过 的
create_job参数传入其名称。image_pull_secret - ACCESS_KEY / SECRET_KEY / S3_BUCKET_NAME / S3_ENDPOINT_URL(可选):通过 SDK 的 /
inputsscript_runner 包装实现 S3 数据集输入输出。outputs
Kubernetes 运行无需询问 Lepton、Brev 或 SLURM 凭据。仅当所选工作流使用 输入或输出时,才需询问 S3 凭据;仅当所选模型需要时,才需询问特定于模型的凭据(如 )。启动前,请验证所选命名空间是否可创建 Job、数据集/结果路径是否能被 Pod 访问、PVC/挂载文件系统路径是否已被证实挂载到作业容器中;代理主机本地路径不足以作为有效证明。
s3://HF_TOKENSDK API
SDK API
K8s is SDK-only — there is no -only launch path. Read
before drafting calls; it covers
, the shared kwarg contract, monitoring, and .
kubectltao-skill-bank:tao-run-platformcreate_jobbuild_entrypointActionWorkflowpython
from tao_sdk.platforms.kubernetes import KubernetesSDK
sdk = KubernetesSDK() # auto-detects auth
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml',
gpu_count=1,
env_vars={'NGC_KEY': os.environ['NGC_KEY']},
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
namespace='tao-jobs', # optional override
image_pull_secret='ngc-pull-secret', # optional, pre-created
node_selector={'gpu-type': 'h100'}, # optional
)The SDK constructs a with:
V1Job- : the requested image and
spec.template.spec.containers[0].command=["/bin/bash", "-c", <command>] - — schedules onto GPU nodes via the NVIDIA Device Plugin / GPU Operator.
resources.limits["nvidia.com/gpu"]: <gpu_count> - flowed through, plus auto-injected S3/NGC/HF credentials for
env_vars.script_runner - and
restart_policy=Never— failures surface to the user instead of silently retrying.backoff_limit=0 - — Job auto-cleans 1 hour after terminal state.
ttl_seconds_after_finished=3600
K8s 仅支持 SDK——没有仅使用 的启动路径。在编写 调用前,请阅读 ;其中涵盖了 、共享参数约定、监控以及 。
kubectlcreate_jobtao-skill-bank:tao-run-platformbuild_entrypointActionWorkflowpython
from tao_sdk.platforms.kubernetes import KubernetesSDK
sdk = KubernetesSDK() # 自动检测认证信息
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml',
gpu_count=1,
env_vars={'NGC_KEY': os.environ['NGC_KEY']},
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
namespace='tao-jobs', # 可选,覆盖默认命名空间
image_pull_secret='ngc-pull-secret', # 可选,预先创建的密钥
node_selector={'gpu-type': 'h100'}, # 可选,节点选择器
)SDK 会构造一个 ,包含:
V1Job- :请求的镜像和
spec.template.spec.containers[0]。command=["/bin/bash", "-c", <command>] - ——通过 NVIDIA 设备插件/GPU Operator 调度到 GPU 节点。
resources.limits["nvidia.com/gpu"]: <gpu_count> - 传递的 ,以及为
env_vars自动注入的 S3/NGC/HF 凭据。script_runner - 和
restart_policy=Never——失败会直接告知用户,而非静默重试。backoff_limit=0 - ——作业进入终端状态后 1 小时自动清理。
ttl_seconds_after_finished=3600
Status & monitoring
状态与监控
python
status = sdk.get_job_status(job.id)python
status = sdk.get_job_status(job.id)status.status ∈ {"Pending", "Running", "Complete", "Error", "Canceled", "Unknown"}
status.status ∈ {"Pending", "Running", "Complete", "Error", "Canceled", "Unknown"}
logs = sdk.get_job_logs(job.id, tail=200) # concatenates logs from all pods of the Job
logs = sdk.get_job_logs(job.id, tail=200) # 合并作业所有 Pod 的日志
For stuck-Pending jobs — replica diagnostics:
针对停滞在 Pending 状态的作业——副本诊断:
for r in sdk.get_job_replicas(job.id):
issue = r["status"].get("readiness_issue")
if issue:
print(issue["reason"], issue["message"])
# e.g. "ImagePullBackOff" / "Back-off pulling image..."
# e.g. "Pending" / "0/3 nodes available: 3 Insufficient nvidia.com/gpu"
for r in sdk.get_job_replicas(job.id):
issue = r["status"].get("readiness_issue")
if issue:
print(issue["reason"], issue["message"])
# 示例:"ImagePullBackOff" / "Back-off pulling image..."
# 示例:"Pending" / "0/3 nodes available: 3 Insufficient nvidia.com/gpu"
On failure:
作业失败时:
analysis = sdk.get_failure_analysis(job.id)
analysis = sdk.get_failure_analysis(job.id)
{"err_class": "ERR_PROGRAM" | "ERR_INFRA",
{"err_class": "ERR_PROGRAM" | "ERR_INFRA",
"suggestion": "Container OOM-killed. Reduce batch size...",
"suggestion": "Container OOM-killed. Reduce batch size...",
"job_failure_by_node_event": [{"node_event_name": "OOMKilled", ...}]}
"job_failure_by_node_event": [{"node_event_name": "OOMKilled", ...}]}
undefinedundefinedCancel & cleanup
取消与清理
python
sdk.cancel_job(job.id) # delete_namespaced_job with propagation_policy="Foreground"ttl_seconds_after_finished=3600cancel_jobpython
sdk.cancel_job(job.id) # 使用 propagation_policy="Foreground" 删除命名空间下的作业ttl_seconds_after_finished=3600cancel_jobGPU Operator dependency
GPU Operator 依赖
The SDK refuses to submit GPU jobs to a cluster with no allocatable. For self-managed clusters, first run the install action on every GPU worker node or bake the same package set into the node image:
nvidia.com/gputao-setup-nvidia-gpu-hostbash
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --install --yesThen install the NVIDIA GPU Operator or device plugin:
bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operatorSDK 拒绝向无 可分配资源的集群提交 GPU 作业。对于自管理集群,需先在每个 GPU 工作节点上运行 安装操作,或将相同的软件包集集成到节点镜像中:
nvidia.com/gputao-setup-nvidia-gpu-hostbash
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --install --yes然后安装 NVIDIA GPU Operator 或设备插件:
bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operatorMulti-node training (distributed)
多节点训练(分布式)
Pass to to run distributed training across N pods. The SDK provisions:
num_nodes > 1create_job()-
A headless Service named after the Job (selector:,
job-name=<job-name>,clusterIP: Noneso pods can rendezvous before they're all Ready).publishNotReadyAddresses: true -
An Indexed Job with,
parallelism = completions = num_nodes. Each pod getscompletionMode: Indexedinjected by k8s automatically (= the node rank).JOB_COMPLETION_INDEX -
A command wrapper that exports the rendezvous env vars before invoking the user command. Two naming conventions are exported simultaneously:
Env var Value Read by WORLD_SIZEnum_nodesTAO PyTorch container's (uses this to mean node count, even though PyTorch's own convention is total processes)nvidia_tao_pytorch/core/entrypoint.pyNUM_GPU_PER_NODEgpu_countTAO PyTorch container's entrypoint NNODESnum_nodesand PyTorch-standard rendezvoustorchrunNPROC_PER_NODEgpu_counttorchrunNODE_RANK$JOB_COMPLETION_INDEXboth MASTER_ADDR(pod-0's DNS)<job-name>-0.<job-name>both MASTER_PORT29500both (TAO's default) Both naming conventions are set so TAO entrypoints (, etc.) and rawdino traincommands work without modification.torchrun
python
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml', # TAO entrypoint reads spec.train.num_nodes; env vars are wired by the container
gpu_count=8, # GPUs per node
num_nodes=4, # 4 × 8 = 32 GPUs total
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
)For raw -based commands (non-TAO containers):
torchrunpython
job = sdk.create_job(
image='nvcr.io/nvidia/pytorch:25.08-py3',
command='torchrun --nnodes=$NNODES --nproc-per-node=$NPROC_PER_NODE --node-rank=$NODE_RANK '
'--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py',
gpu_count=8,
num_nodes=4,
)The capacity check sums across nodes: ≤ cluster's allocatable .
gpu_count × num_nodesnvidia.com/gpu向 传递 即可在 N 个 Pod 上运行分布式训练。SDK 会自动部署:
create_job()num_nodes > 1-
一个Headless Service,名称与作业一致(选择器:,
job-name=<job-name>,clusterIP: None,以便 Pod 在全部就绪前即可实现 rendezvous)。publishNotReadyAddresses: true -
一个索引 Job,配置,
parallelism = completions = num_nodes。每个 Pod 会自动注入 k8s 提供的completionMode: Indexed(即节点 rank)。JOB_COMPLETION_INDEX -
一个命令包装器,在调用用户命令前导出 rendezvous 环境变量。同时导出两种命名约定:
环境变量 值 读取方 WORLD_SIZEnum_nodesTAO PyTorch 容器的 (将其视为节点数量,尽管 PyTorch 自身约定为总进程数)nvidia_tao_pytorch/core/entrypoint.pyNUM_GPU_PER_NODEgpu_countTAO PyTorch 容器的入口脚本 NNODESnum_nodes和 PyTorch 标准 rendezvoustorchrunNPROC_PER_NODEgpu_counttorchrunNODE_RANK$JOB_COMPLETION_INDEX两者均支持 MASTER_ADDR(Pod-0 的 DNS)<job-name>-0.<job-name>两者均支持 MASTER_PORT29500两者均支持(TAO 默认端口) 同时设置两种命名约定,确保 TAO 入口脚本(如等)和原生dino train命令无需修改即可运行。torchrun
python
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml', # TAO 入口脚本读取 spec.train.num_nodes;环境变量由容器自动关联
gpu_count=8, # 每个节点的 GPU 数量
num_nodes=4, # 总计 4 × 8 = 32 个 GPU
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
)对于基于原生 的命令(非 TAO 容器):
torchrunpython
job = sdk.create_job(
image='nvcr.io/nvidia/pytorch:25.08-py3',
command='torchrun --nnodes=$NNODES --nproc-per-node=$NPROC_PER_NODE --node-rank=$NODE_RANK '
'--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py',
gpu_count=8,
num_nodes=4,
)容量检查会汇总所有节点的资源: ≤ 集群可分配的 总数。
gpu_count × num_nodesnvidia.com/gpuCluster requirements for multi-node
多节点训练的集群要求
- k8s 1.28+ is required for stable pod hostnames in Indexed Jobs (the feature). On older clusters the
PodIndexLabelDNS lookup fails. Verify withMASTER_ADDR=<job>-0.<svc>.kubectl version - Pod-to-pod networking must be open on port 29500 (PyTorch default; configurable via env var). Most CNIs (Calico, Cilium, AWS VPC CNI) allow this by default; restrictive NetworkPolicies must be relaxed.
MASTER_PORT - NCCL in the container talks GPU-to-GPU; if the cluster has multi-NIC nodes or RDMA, set /
NCCL_SOCKET_IFNAMEviaNCCL_IB_HCA.env_vars
- k8s 1.28+:索引 Job 中稳定 Pod 主机名需要该版本(特性)。在旧版本集群中,
PodIndexLabel的 DNS 查找会失败。请通过MASTER_ADDR=<job>-0.<svc>验证版本。kubectl version - Pod 间网络:需开放端口 29500(PyTorch 默认端口;可通过 环境变量配置)。大多数 CNI(Calico、Cilium、AWS VPC CNI)默认允许该端口;若有严格的 NetworkPolicy,需放宽限制。
MASTER_PORT - 容器中的 NCCL 实现 GPU 间通信;若集群有多 NIC 节点或 RDMA,需通过 设置
env_vars/NCCL_SOCKET_IFNAME。NCCL_IB_HCA
Reference reading
参考文档
- Kubernetes Indexed Job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode
- Indexed Job for batch ML: https://kubernetes.io/blog/2022/06/01/indexed-jobs-mpi/
- PyTorch distributed (env-var rendezvous): https://pytorch.org/docs/stable/elastic/run.html
- NCCL networking tuning (NCCL_SOCKET_IFNAME, NCCL_IB_HCA): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
- Kubernetes 索引 Job:https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode
- 用于批量机器学习的索引 Job:https://kubernetes.io/blog/2022/06/01/indexed-jobs-mpi/
- PyTorch 分布式(环境变量 rendezvous):https://pytorch.org/docs/stable/elastic/run.html
- NCCL 网络调优(NCCL_SOCKET_IFNAME、NCCL_IB_HCA):https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
When to use a Kubernetes operator instead
何时改用 Kubernetes Operator
For more sophisticated topologies (gang scheduling, PyTorch elastic / fault-tolerant training, MPI / Horovod, RDMA setup), reach for an operator instead of plain Indexed Job:
- MPI Operator — https://github.com/kubeflow/mpi-operator — for MPI / Horovod workloads.
- Kubeflow Training Operator (,
PyTorchJob) — https://www.kubeflow.org/docs/components/training/ — for elastic PyTorch training with built-in restart logic.TFJob - Volcano — https://volcano.sh/ — gang scheduling, queues, fair-share. Useful in shared multi-tenant clusters.
- Kueue — https://kueue.sigs.k8s.io/ — quota / queue layer on top of any of the above.
The TAO SDK's Indexed Job path is intentionally simple and dependency-free; if you need elastic restart or gang scheduling, layer one of these on top and submit jobs through the operator's CRD instead.
对于更复杂的拓扑( gang 调度、PyTorch 弹性/容错训练、MPI/Horovod、RDMA 配置),请使用 Operator 而非原生索引 Job:
- MPI Operator — https://github.com/kubeflow/mpi-operator — 适用于 MPI/Horovod 工作负载。
- Kubeflow Training Operator(、
PyTorchJob)— https://www.kubeflow.org/docs/components/training/ — 用于带内置重启逻辑的弹性 PyTorch 训练。TFJob - Volcano — https://volcano.sh/ — gang 调度、队列、公平共享。适用于共享多租户集群。
- Kueue — https://kueue.sigs.k8s.io/ — 在上述任意组件之上的配额/队列层。
TAO SDK 的索引 Job 路径设计为简单且无依赖;若需要弹性重启或 gang 调度,请在其之上集成上述组件之一,并通过 Operator 的 CRD 提交作业。
Common error patterns
常见错误模式
No nvidia.com/gpu resources allocatable on the clusterkubectl get nodes -o jsonpath='{.items[*].status.allocatable}'ImagePullBackOffErrImagePullimage_pull_secretbash
kubectl create secret docker-registry ngc-pull-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=$NGC_KEY -n tao-jobsPod stays forever — will show the readiness_issue. Common causes: insufficient GPU capacity (), no node matches , missing image-pull secret, or PVC mount failure.
Pendingget_job_replicas(job_id)Insufficient nvidia.com/gpunode_selectorOOMKilledCredentialError: Could not authenticate to a Kubernetes clusterkubectl get nodes$KUBECONFIGNo nvidia.com/gpu resources allocatable on the clusterkubectl get nodes -o jsonpath='{.items[*].status.allocatable}'ImagePullBackOffErrImagePullimage_pull_secretbash
kubectl create secret docker-registry ngc-pull-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=$NGC_KEY -n tao-jobsPod 一直处于 状态—— 会显示就绪问题。常见原因:GPU 容量不足()、无节点匹配 、缺少镜像拉取密钥、或 PVC 挂载失败。
Pendingget_job_replicas(job_id)Insufficient nvidia.com/gpunode_selectorOOMKilledCredentialError: Could not authenticate to a Kubernetes clusterkubectl get nodes$KUBECONFIGWhat this skill does NOT support (yet)
本技能暂不支持的功能
- Elastic / fault-tolerant training. Indexed Job has — failures fail the whole training run. For elastic restart (e.g., resume from checkpoint after a node death), use Kubeflow's
backoff_limit=0operator instead.PyTorchJob - Gang scheduling. Indexed Job pods are scheduled independently — no all-or-nothing. Multi-node training will partially start if only some pods can be scheduled (rank-0 will hang waiting for peers). For all-or-nothing scheduling on shared clusters, use Volcano or Kueue.
- MPI / Horovod. Use the MPI Operator. The Indexed Job path here is PyTorch-distributed-shaped (env-var rendezvous on ).
MASTER_ADDR:MASTER_PORT - Persistent volumes for shared storage. S3 only via the script_runner. PVC support is a follow-up.
- Auto-creating image-pull secrets from . You pre-create the secret in the target namespace and pass the name. Lepton does this auto; we don't here because k8s namespace conventions vary widely.
$NGC_KEY
- 弹性/容错训练:索引 Job 的 ——失败会导致整个训练运行终止。如需弹性重启(例如节点故障后从检查点恢复),请改用 Kubeflow 的
backoff_limit=0Operator。PyTorchJob - Gang 调度:索引 Job 的 Pod 独立调度——不支持全有或全无模式。若仅部分 Pod 可调度,多节点训练会部分启动(rank-0 会挂起等待其他节点)。在共享集群中如需全有或全无调度,请使用 Volcano 或 Kueue。
- MPI/Horovod:请使用 MPI Operator。此处的索引 Job 路径为 PyTorch 分布式模式(基于 的环境变量 rendezvous)。
MASTER_ADDR:MASTER_PORT - 用于共享存储的持久卷:仅支持通过 script_runner 访问 S3。PVC 支持为后续功能。
- 从 自动创建镜像拉取密钥:需预先在目标命名空间中创建密钥并传入名称。Lepton 支持自动创建,但此处不支持,因为 Kubernetes 命名空间约定差异较大。
$NGC_KEY