tao-setup-nvidia-gpu-host

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

NVIDIA GPU Host Setup

NVIDIA GPU主机设置

Use this setup skill before TAO workflows run on the

docker

local-docker

, or

kubernetes

backend. It standardizes the host GPU runtime on:

NVIDIA driver branch
```
580
```
(open kernel module preferred)
CUDA Toolkit package
```
cuda-toolkit-13-0
```
NVIDIA Container Toolkit
```
1.19.0
```
Docker engine — only installed for
```
docker
```
/
```
local-docker
```
backends and only when Docker is missing. The package picked depends on the distro family (
```
docker.io
```
on Debian-family by default,
```
moby-engine
```
/
```
docker-ce
```
from
```
download.docker.com
```
on RHEL-family,
```
docker
```
on SUSE-family). Pass
```
--skip-docker-install
```
to opt out.

The check is safe and read-only by default — it works on any Linux distribution because it only probes

nvidia-smi

, the CUDA toolkit path, the installed container-toolkit package version (via

dpkg

rpm

/the

nvidia-ctk

binary version), and the Docker daemon's NVIDIA runtime.

Installation must be explicitly authorized by the user and rerun with

--install

. The install path is automated for these distro families:

Family	Tested distros	Manager	Notes
debian	Ubuntu 22.04 / 24.04, Debian 12 (and derivatives Pop!_OS, Mint, Zorin, Raspbian, KDE Neon, etc. via `UBUNTU_CODENAME` / `VERSION_CODENAME` )	`apt-get`	Adds NVIDIA `cuda-keyring` + Container Toolkit `.list` . Docker via `docker.io` (override `$DOCKER_PACKAGE_DEBIAN` ).
rhel	Fedora 39+, RHEL / Rocky / AlmaLinux 9 and 10	`dnf` (or `yum` )	Adds NVIDIA `cuda-<distro>.repo` + Container Toolkit `.repo` . Docker via Fedora `moby-engine` when available, otherwise `docker-ce` from `download.docker.com` .
suse	openSUSE Leap 15, SLES 15	`zypper`	Adds the same NVIDIA `.repo` files. Docker via the distribution `docker` package.
other (Arch, Alpine, Gentoo, NixOS, FreeBSD, …)	n/a	n/a	`--install` exits with a clear error listing the version targets and the NVIDIA install-guide URLs. Install manually, then rerun `--check-only` .

在

docker

、

local-docker

或

kubernetes

后端运行TAO工作流之前，请使用此设置工具。它将主机GPU运行时标准化为：

NVIDIA驱动580分支（优先使用开源内核模块）
CUDA Toolkit包
```
cuda-toolkit-13-0
```
NVIDIA Container Toolkit
```
1.19.0
```
Docker引擎 — 仅在
```
docker
```
/
```
local-docker
```
后端且Docker缺失时安装。所选包取决于发行版系列（debian系列默认使用
```
docker.io
```
，rhel系列使用
```
download.docker.com
```
提供的
```
moby-engine
```
/
```
docker-ce
```
，suse系列使用
```
docker
```
）。可通过
```
--skip-docker-install
```
参数跳过安装。

默认情况下，检查操作是安全且只读的 — 它适用于任何Linux发行版，因为仅会探测

nvidia-smi

、CUDA Toolkit路径、已安装的容器工具包版本（通过

dpkg

rpm

nvidia-ctk

二进制版本）以及Docker守护进程的NVIDIA运行时。

安装操作必须经过用户明确授权，并通过

--install

参数重新运行。安装流程支持以下发行版系列的自动化操作：

系列	已测试发行版	包管理器	说明
debian	Ubuntu 22.04 / 24.04、Debian 12（及其衍生版本Pop!_OS、Mint、Zorin、Raspbian、KDE Neon等，通过 `UBUNTU_CODENAME` / `VERSION_CODENAME` 适配）	`apt-get`	添加NVIDIA `cuda-keyring` 和容器工具包 `.list` 源。Docker使用 `docker.io` （可通过 `$DOCKER_PACKAGE_DEBIAN` 覆盖）。
rhel	Fedora 39+、RHEL/Rocky/AlmaLinux 9和10	`dnf` （或 `yum` ）	添加NVIDIA `cuda-<distro>.repo` 和容器工具包 `.repo` 源。Docker优先使用Fedora提供的 `moby-engine` ，否则使用 `download.docker.com` 的 `docker-ce` 。
suse	openSUSE Leap 15、SLES 15	`zypper`	添加相同的NVIDIA `.repo` 源。Docker使用发行版自带的 `docker` 包。
其他（Arch、Alpine、Gentoo、NixOS、FreeBSD等）	无	无	`--install` 会退出并显示清晰错误，列出版本目标和NVIDIA安装指南URL。手动完成安装后，重新运行 `--check-only` 进行验证。

Quick Start

快速开始

From the skill bank root:

bash

undefined

从技能库根目录执行：

bash

undefined

Check the local Docker backend host.

检查本地Docker后端主机。

bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only

Install or repair after user approval (prompts for confirmation; see the note below for non-interactive runs).

用户确认后安装或修复（会提示确认；非交互式运行请参见下方说明）。

bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --install

Check a Kubernetes GPU worker host.

检查Kubernetes GPU工作节点主机。

bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --check-only


> ⚠️ **Note — running non-interactively (agent / skill runs):** a skill run has
> no terminal, so the installer's `Continue? [y/N]` confirmation cannot be
> answered. After running `--check-only` to preview what is missing and getting
> the user's explicit approval, append the assume-yes flag (`--yes`) to the
> `--install` command so it proceeds without a prompt. That auto-confirms
> installation of system packages (NVIDIA driver branch 580, CUDA Toolkit 13.0,
> NVIDIA Container Toolkit, and — for Docker backends — Docker) and modifies the
> host: it adds NVIDIA package repositories, may restart Docker, and adds the
> invoking user to the `docker` group, so only do this on a host you control and
> have the privileges to change. When a person runs `--install` directly at a
> terminal, the script instead prompts with the exact package list before making
> any changes.

In an installed plugin copy that exposes `skills/`, use:

```bash
bash skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only

bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --check-only


> ⚠️ **注意 — 非交互式运行（Agent/工具执行）：** 工具运行没有终端，无法响应安装程序的`Continue? [y/N]`确认提示。在运行`--check-only`预览缺失组件并获得用户明确授权后，需在`--install`命令后添加`--yes`参数以自动确认，避免交互。此参数会自动确认系统包（NVIDIA驱动580分支、CUDA Toolkit 13.0、NVIDIA Container Toolkit，以及Docker后端所需的Docker）的安装，并修改主机：添加NVIDIA包源、可能重启Docker、将执行用户加入`docker`组，因此仅能在你拥有控制权和修改权限的主机上执行。当用户在终端直接运行`--install`时，脚本会在做出任何修改前显示具体的包列表并提示确认。

在已安装的插件副本（暴露`skills/`目录）中，使用：

```bash
bash skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only

Workflow Contract

工作流约定

Docker and Kubernetes workflows must run the check before submitting GPU work:

bash

SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"

bash "$SETUP_SCRIPT" --backend docker --check-only || {
  echo "MISSING: TAO GPU host runtime is not ready."
  echo "After user approval, run (append --yes for non-interactive agent runs):"
  echo "  bash \"$SETUP_SCRIPT\" --backend docker --install"
  exit 1
}

Never install silently. If the check fails, explain what is missing, ask the user to authorize the fix, then run the install command and rerun the check.

Docker和Kubernetes工作流必须在提交GPU任务前执行检查：

bash

SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"

bash "$SETUP_SCRIPT" --backend docker --check-only || {
  echo "缺失组件：TAO GPU主机运行时未就绪。"
  echo "获得用户批准后，执行（非交互式Agent运行需添加--yes参数）："
  echo "  bash \"$SETUP_SCRIPT\" --backend docker --install"
  exit 1
}

禁止静默安装。如果检查失败，需说明缺失的组件，请求用户授权修复，然后执行安装命令并重新运行检查。

What The Installer Does

安装程序执行流程

The installer dispatches on the detected distribution family. On every supported family it adds NVIDIA's CUDA and Container Toolkit repositories (if missing), installs the pinned runtime packages, optionally installs Docker, wires the NVIDIA Docker runtime, and adds the invoking user to the

docker

group.

Common steps (all families):

Adds NVIDIA's CUDA repository if missing (apt
```
cuda-keyring
```
deb,
```
cuda-<distro>.repo
```
for dnf/zypper).
Adds NVIDIA's Container Toolkit repository if missing (
```
.list
```
for apt,
```
.repo
```
for dnf/zypper).
Installs the matching kernel header / devel package for the running kernel.
Installs the driver branch 580 packages,
```
cuda-toolkit-13-0
```
, and the Container Toolkit pinned to
```
1.19.0
```
(the dpkg-suffixed
```
1.19.0-1
```
is the same upstream version expressed for apt).
For Docker backends and when Docker is missing, installs Docker (override / opt-out flags below), enables/starts the daemon, then runs
```
nvidia-ctk runtime configure --runtime=docker
```
and restarts Docker when
```
systemctl
```
is available.
Adds the invoking user (
```
$SUDO_USER
```
if available, else
```
$USER
```
) to the
```
docker
```
group so subsequent shells can run
```
docker
```
without
```
sudo
```
— opt out with
```
--skip-docker-group
```
. The new group membership does not take effect in the current shell: log out and back in, or run
```
newgrp docker
```
in each new shell.
Attempts
```
modprobe nvidia
```
so verification can pass before reboot.

Family-specific package selections:

Step	debian-family	rhel-family	suse-family
Kernel headers	`linux-headers-$(uname -r)`	`kernel-devel-$(uname -r)` , `kernel-headers-$(uname -r)`	`kernel-default-devel`
Driver	`nvidia-driver-pinning-580` , `nvidia-open-580` (override: `$NVIDIA_DRIVER_PACKAGE_DEBIAN` )	`nvidia-driver-cuda` , `kmod-nvidia-open-dkms` (override: `$NVIDIA_DRIVER_PACKAGE_RHEL` , `$NVIDIA_DRIVER_KMOD_RHEL` )	`nvidia-open-driver-G06-signed-kmp-default` (override: `$NVIDIA_DRIVER_PACKAGE_SUSE` )
CUDA toolkit	`cuda-toolkit-13-0`	`cuda-toolkit-13-0`	`cuda-toolkit-13-0`
Container Toolkit	`nvidia-container-toolkit=1.19.0-1` + base/tools/libs	`nvidia-container-toolkit-1.19.0` + base/tools/libs	same as rhel
Docker	`docker.io` (override: `$DOCKER_PACKAGE_DEBIAN` )	`moby-engine` + `moby-cli` on Fedora when available, else `docker-ce docker-ce-cli containerd.io` from `download.docker.com`	`docker`

安装程序会根据检测到的发行版系列进行分发处理。在所有支持的系列中，它会添加NVIDIA的CUDA和Container Toolkit源（如果缺失）、安装指定版本的运行时包、可选安装Docker、配置NVIDIA Docker运行时，并将执行用户加入

docker

组。

通用步骤（所有系列）：

如果缺失，添加NVIDIA的CUDA源（apt使用
```
cuda-keyring
```
deb包，dnf/zypper使用
```
cuda-<distro>.repo
```
）。
如果缺失，添加NVIDIA的Container Toolkit源（apt使用
```
.list
```
，dnf/zypper使用
```
.repo
```
）。
为当前运行的内核安装匹配的内核头文件/开发包。
安装580分支驱动包、
```
cuda-toolkit-13-0
```
，以及固定版本为
```
1.19.0
```
的Container Toolkit（apt使用带dpkg后缀的
```
1.19.0-1
```
，与上游版本一致）。
对于Docker后端且Docker缺失的情况，安装Docker（可通过下方参数覆盖/跳过）、启用/启动守护进程，然后在
```
systemctl
```
可用时执行
```
nvidia-ctk runtime configure --runtime=docker
```
并重启Docker。
将执行用户（如果有
```
$SUDO_USER
```
则使用该用户，否则使用
```
$USER
```
）加入
```
docker
```
组，以便后续Shell无需
```
sudo
```
即可运行
```
docker
```
— 可通过
```
--skip-docker-group
```
参数跳过此步骤。新的组成员身份不会在当前Shell中生效：需登出后重新登录，或在每个新Shell中执行
```
newgrp docker
```
。
尝试执行
```
modprobe nvidia
```
，以便在重启前通过验证。

系列特定包选择：

步骤	debian系列	rhel系列	suse系列
内核头文件	`linux-headers-$(uname -r)`	`kernel-devel-$(uname -r)` , `kernel-headers-$(uname -r)`	`kernel-default-devel`
驱动	`nvidia-driver-pinning-580` , `nvidia-open-580` （可通过 `$NVIDIA_DRIVER_PACKAGE_DEBIAN` 覆盖）	`nvidia-driver-cuda` , `kmod-nvidia-open-dkms` （可通过 `$NVIDIA_DRIVER_PACKAGE_RHEL` , `$NVIDIA_DRIVER_KMOD_RHEL` 覆盖）	`nvidia-open-driver-G06-signed-kmp-default` （可通过 `$NVIDIA_DRIVER_PACKAGE_SUSE` 覆盖）
CUDA Toolkit	`cuda-toolkit-13-0`	`cuda-toolkit-13-0`	`cuda-toolkit-13-0`
Container Toolkit	`nvidia-container-toolkit=1.19.0-1` + 基础/工具/库	`nvidia-container-toolkit-1.19.0` + 基础/工具/库	与rhel系列相同
Docker	`docker.io` （可通过 `$DOCKER_PACKAGE_DEBIAN` 覆盖）	Fedora可用时使用 `moby-engine` + `moby-cli` ，否则使用 `download.docker.com` 的 `docker-ce docker-ce-cli containerd.io`	`docker`

Verification

验证

After installation, verify:

bash

nvidia-smi
/usr/local/cuda-13.0/bin/nvcc --version
docker info --format '{{json .Runtimes}}' | grep nvidia
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Expected

nvidia-smi

output includes driver

580.x

and CUDA Version

13.0

. Expected

nvcc

output includes

release 13.0

安装完成后，执行以下命令验证：

bash

nvidia-smi
/usr/local/cuda-13.0/bin/nvcc --version
docker info --format '{{json .Runtimes}}' | grep nvidia
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

nvidia-smi

的预期输出应包含驱动版本

580.x

和CUDA版本

13.0

。

nvcc

的预期输出应包含

release 13.0

。

Kubernetes Notes

Kubernetes注意事项

For self-managed Kubernetes clusters, run the host installer on every GPU worker node or bake the same package set into the node image before installing the NVIDIA GPU Operator or device plugin.

The workflow check also warns if

kubectl

is available but the cluster reports no

nvidia.com/gpu

allocatable capacity. In that case, install/configure the NVIDIA GPU Operator after the worker host runtime is ready:

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator

Managed Kubernetes providers may own driver installation through node images or GPU Operator policy. Do not overwrite a provider-managed GPU node without user approval and a rollback plan.

对于自管理Kubernetes集群，需在每个GPU工作节点上运行主机安装程序，或在安装NVIDIA GPU Operator或设备插件前，将相同的包集预安装到节点镜像中。

如果

kubectl

可用但集群报告无

nvidia.com/gpu

可分配资源，工作流检查会发出警告。这种情况下，需在工作节点运行时就绪后安装/配置NVIDIA GPU Operator：

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator

托管Kubernetes提供商可能通过节点镜像或GPU Operator策略负责驱动安装。未经用户批准和回滚计划，请勿覆盖提供商管理的GPU节点。

Failure Modes

故障模式

Unsupported distribution family:

--install

automates debian-, rhel-, and suse-family hosts. On Arch, Alpine, Gentoo, NixOS, FreeBSD, or anything without

/etc/os-release

(e.g. macOS), the script exits with a clear error that lists the four version targets and the upstream NVIDIA install-guide URLs:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

```
https://docs.docker.com/engine/install/
```

Install those four pieces using your distribution's package manager and rerun the script with

--check-only

to verify. The check is universally portable — it only queries the binaries / package databases — so once the runtime is in place the workflow contract is satisfied regardless of the underlying distro.

Unsupported Ubuntu/Debian derivative: When

ID

is e.g.

pop

mint

zorin

raspbian

, or another debian-family derivative, the script maps the host onto the upstream Ubuntu/Debian CUDA repo via

UBUNTU_CODENAME

VERSION_CODENAME

(

focal

jammy

noble

→ Ubuntu 20.04/22.04/24.04;

bullseye

bookworm

trixie

→ Debian 11/12/12). If the host's codename doesn't match a known upstream release,

--install

exits with the same manual-install guidance described above.

Docker not installed:

--check-only

reports

MISSING: Docker is not installed

and prints the exact rerun command appropriate to the detected distro family. The default

--install

path installs Docker (

docker.io

moby-engine

docker-ce

docker

depending on family), enables/starts the daemon, configures the NVIDIA runtime, and adds the invoking user to the

docker

group. If you prefer to manage Docker yourself, install it before rerunning the script or pass

--skip-docker-install

Docker installed but
docker run
still needs sudo: The script adds the invoking user to the

docker

group, but Linux only refreshes group membership on a new login session. Log out and back in, or run

newgrp docker

in each new shell, until the new membership is active.

Docker runtime still missing: Restart Docker, then rerun

nvidia-ctk runtime configure --runtime=docker

Driver branch detected != 580: The driver-branch pin is exact on debian-family (

nvidia-open-580

). On rhel-/suse-family the script installs the latest open driver shipped in NVIDIA's CUDA 13.0 repo for the detected distro, which is always ≥ 580. If your host needs a stricter pin, set

$NVIDIA_DRIVER_PACKAGE_RHEL

$NVIDIA_DRIVER_KMOD_RHEL

$NVIDIA_DRIVER_PACKAGE_SUSE

to the exact package names you want before running

--install

Driver installed but
nvidia-smi
fails: Load the module with

sudo modprobe nvidia

or reboot. Secure Boot may require MOK enrollment on systems where it is enabled.

Kubernetes still has no GPU capacity: Confirm the driver works on each GPU node with

nvidia-smi

, then check the GPU Operator/device plugin pods and node labels.

不支持的发行版系列：

--install

可自动处理debian、rhel和suse系列主机。在Arch、Alpine、Gentoo、NixOS、FreeBSD或无

/etc/os-release

的系统（如macOS）上，脚本会退出并显示清晰错误，列出四个版本目标和上游NVIDIA安装指南URL：

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

```
https://docs.docker.com/engine/install/
```

使用发行版的包管理器安装这四个组件，然后重新运行

--check-only

脚本进行验证。检查操作具有通用可移植性 — 仅查询二进制文件/包数据库 — 因此只要运行时配置到位，无论底层发行版如何，都能满足工作流约定。

不支持的Ubuntu/Debian衍生版：当

ID

为

pop

、

mint

、

zorin

、

raspbian

等debian系列衍生版时，脚本会通过

UBUNTU_CODENAME

VERSION_CODENAME

将主机映射到上游Ubuntu/Debian CUDA源（

focal

jammy

noble

对应Ubuntu 20.04/22.04/24.04；

bullseye

bookworm

trixie

对应Debian 11/12/12）。如果主机的代号与已知上游版本不匹配，

--install

会退出并提供上述手动安装指导。

Docker未安装：

--check-only

会报告

缺失组件：Docker未安装

，并打印适用于检测到的发行版系列的重新运行命令。默认的

--install

流程会安装Docker（根据系列选择

docker.io

moby-engine

docker-ce

docker

）、启用/启动守护进程、配置NVIDIA运行时，并将执行用户加入

docker

组。如果你偏好自行管理Docker，请在重新运行脚本前安装Docker，或通过

--skip-docker-install

参数跳过安装。

Docker已安装但运行
docker run
仍需sudo：脚本会将执行用户加入

docker

组，但Linux仅在新登录会话时刷新组成员身份。需登出后重新登录，或在每个新Shell中执行

newgrp docker

，直到新组成员身份生效。

Docker运行时仍缺失：重启Docker，然后重新执行

nvidia-ctk runtime configure --runtime=docker

。

检测到的驱动分支≠580：debian系列严格固定驱动分支为

nvidia-open-580

。rhel/suse系列脚本会为检测到的发行版安装NVIDIA CUDA 13.0源中提供的最新开源驱动，版本始终≥580。如果你的主机需要更严格的版本固定，请在运行

--install

前设置

$NVIDIA_DRIVER_PACKAGE_RHEL

$NVIDIA_DRIVER_KMOD_RHEL

$NVIDIA_DRIVER_PACKAGE_SUSE

为你需要的精确包名。

驱动已安装但
nvidia-smi
执行失败：执行

sudo modprobe nvidia

加载模块，或重启系统。启用Secure Boot的系统可能需要注册MOK。

Kubernetes仍无GPU资源：在每个GPU节点上通过

nvidia-smi