tao-setup-nvidia-gpu-host

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NVIDIA GPU Host Setup

NVIDIA GPU主机设置

Use this setup skill before TAO workflows run on the
docker
,
local-docker
, or
kubernetes
backend. It standardizes the host GPU runtime on:
  • NVIDIA driver branch
    580
    (open kernel module preferred)
  • CUDA Toolkit package
    cuda-toolkit-13-0
  • NVIDIA Container Toolkit
    1.19.0
  • Docker engine — only installed for
    docker
    /
    local-docker
    backends and only when Docker is missing. The package picked depends on the distro family (
    docker.io
    on Debian-family by default,
    moby-engine
    /
    docker-ce
    from
    download.docker.com
    on RHEL-family,
    docker
    on SUSE-family). Pass
    --skip-docker-install
    to opt out.
The check is safe and read-only by default — it works on any Linux distribution because it only probes
nvidia-smi
, the CUDA toolkit path, the installed container-toolkit package version (via
dpkg
/
rpm
/the
nvidia-ctk
binary version), and the Docker daemon's NVIDIA runtime.
Installation must be explicitly authorized by the user and rerun with
--install
. The install path is automated for these distro families:
FamilyTested distrosManagerNotes
debianUbuntu 22.04 / 24.04, Debian 12 (and derivatives Pop!_OS, Mint, Zorin, Raspbian, KDE Neon, etc. via
UBUNTU_CODENAME
/
VERSION_CODENAME
)
apt-get
Adds NVIDIA
cuda-keyring
+ Container Toolkit
.list
. Docker via
docker.io
(override
$DOCKER_PACKAGE_DEBIAN
).
rhelFedora 39+, RHEL / Rocky / AlmaLinux 9 and 10
dnf
(or
yum
)
Adds NVIDIA
cuda-<distro>.repo
+ Container Toolkit
.repo
. Docker via Fedora
moby-engine
when available, otherwise
docker-ce
from
download.docker.com
.
suseopenSUSE Leap 15, SLES 15
zypper
Adds the same NVIDIA
.repo
files. Docker via the distribution
docker
package.
other (Arch, Alpine, Gentoo, NixOS, FreeBSD, …)n/an/a
--install
exits with a clear error listing the version targets and the NVIDIA install-guide URLs. Install manually, then rerun
--check-only
.
docker
local-docker
kubernetes
后端运行TAO工作流之前,请使用此设置工具。它将主机GPU运行时标准化为:
  • NVIDIA驱动580分支(优先使用开源内核模块)
  • CUDA Toolkit包
    cuda-toolkit-13-0
  • NVIDIA Container Toolkit
    1.19.0
  • Docker引擎 — 仅在
    docker
    /
    local-docker
    后端且Docker缺失时安装。所选包取决于发行版系列(debian系列默认使用
    docker.io
    ,rhel系列使用
    download.docker.com
    提供的
    moby-engine
    /
    docker-ce
    ,suse系列使用
    docker
    )。可通过
    --skip-docker-install
    参数跳过安装。
默认情况下,检查操作是安全且只读的 — 它适用于任何Linux发行版,因为仅会探测
nvidia-smi
、CUDA Toolkit路径、已安装的容器工具包版本(通过
dpkg
/
rpm
/
nvidia-ctk
二进制版本)以及Docker守护进程的NVIDIA运行时。
安装操作必须经过用户明确授权,并通过
--install
参数重新运行。安装流程支持以下发行版系列的自动化操作:
系列已测试发行版包管理器说明
debianUbuntu 22.04 / 24.04、Debian 12(及其衍生版本Pop!_OS、Mint、Zorin、Raspbian、KDE Neon等,通过
UBUNTU_CODENAME
/
VERSION_CODENAME
适配)
apt-get
添加NVIDIA
cuda-keyring
和容器工具包
.list
源。Docker使用
docker.io
(可通过
$DOCKER_PACKAGE_DEBIAN
覆盖)。
rhelFedora 39+、RHEL/Rocky/AlmaLinux 9和10
dnf
(或
yum
添加NVIDIA
cuda-<distro>.repo
和容器工具包
.repo
源。Docker优先使用Fedora提供的
moby-engine
,否则使用
download.docker.com
docker-ce
suseopenSUSE Leap 15、SLES 15
zypper
添加相同的NVIDIA
.repo
源。Docker使用发行版自带的
docker
包。
其他(Arch、Alpine、Gentoo、NixOS、FreeBSD等)
--install
会退出并显示清晰错误,列出版本目标和NVIDIA安装指南URL。手动完成安装后,重新运行
--check-only
进行验证。

Quick Start

快速开始

From the skill bank root:
bash
undefined
从技能库根目录执行:
bash
undefined

Check the local Docker backend host.

检查本地Docker后端主机。

bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only

Install or repair after user approval (prompts for confirmation; see the note below for non-interactive runs).

用户确认后安装或修复(会提示确认;非交互式运行请参见下方说明)。

bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --install
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --install

Check a Kubernetes GPU worker host.

检查Kubernetes GPU工作节点主机。

bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --check-only

> ⚠️ **Note — running non-interactively (agent / skill runs):** a skill run has
> no terminal, so the installer's `Continue? [y/N]` confirmation cannot be
> answered. After running `--check-only` to preview what is missing and getting
> the user's explicit approval, append the assume-yes flag (`--yes`) to the
> `--install` command so it proceeds without a prompt. That auto-confirms
> installation of system packages (NVIDIA driver branch 580, CUDA Toolkit 13.0,
> NVIDIA Container Toolkit, and — for Docker backends — Docker) and modifies the
> host: it adds NVIDIA package repositories, may restart Docker, and adds the
> invoking user to the `docker` group, so only do this on a host you control and
> have the privileges to change. When a person runs `--install` directly at a
> terminal, the script instead prompts with the exact package list before making
> any changes.

In an installed plugin copy that exposes `skills/`, use:

```bash
bash skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --check-only

> ⚠️ **注意 — 非交互式运行(Agent/工具执行):** 工具运行没有终端,无法响应安装程序的`Continue? [y/N]`确认提示。在运行`--check-only`预览缺失组件并获得用户明确授权后,需在`--install`命令后添加`--yes`参数以自动确认,避免交互。此参数会自动确认系统包(NVIDIA驱动580分支、CUDA Toolkit 13.0、NVIDIA Container Toolkit,以及Docker后端所需的Docker)的安装,并修改主机:添加NVIDIA包源、可能重启Docker、将执行用户加入`docker`组,因此仅能在你拥有控制权和修改权限的主机上执行。当用户在终端直接运行`--install`时,脚本会在做出任何修改前显示具体的包列表并提示确认。

在已安装的插件副本(暴露`skills/`目录)中,使用:

```bash
bash skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only

Workflow Contract

工作流约定

Docker and Kubernetes workflows must run the check before submitting GPU work:
bash
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"

bash "$SETUP_SCRIPT" --backend docker --check-only || {
  echo "MISSING: TAO GPU host runtime is not ready."
  echo "After user approval, run (append --yes for non-interactive agent runs):"
  echo "  bash \"$SETUP_SCRIPT\" --backend docker --install"
  exit 1
}
Never install silently. If the check fails, explain what is missing, ask the user to authorize the fix, then run the install command and rerun the check.
Docker和Kubernetes工作流必须在提交GPU任务前执行检查:
bash
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"

bash "$SETUP_SCRIPT" --backend docker --check-only || {
  echo "缺失组件:TAO GPU主机运行时未就绪。"
  echo "获得用户批准后,执行(非交互式Agent运行需添加--yes参数):"
  echo "  bash \"$SETUP_SCRIPT\" --backend docker --install"
  exit 1
}
禁止静默安装。如果检查失败,需说明缺失的组件,请求用户授权修复,然后执行安装命令并重新运行检查。

What The Installer Does

安装程序执行流程

The installer dispatches on the detected distribution family. On every supported family it adds NVIDIA's CUDA and Container Toolkit repositories (if missing), installs the pinned runtime packages, optionally installs Docker, wires the NVIDIA Docker runtime, and adds the invoking user to the
docker
group.
Common steps (all families):
  1. Adds NVIDIA's CUDA repository if missing (apt
    cuda-keyring
    deb,
    cuda-<distro>.repo
    for dnf/zypper).
  2. Adds NVIDIA's Container Toolkit repository if missing (
    .list
    for apt,
    .repo
    for dnf/zypper).
  3. Installs the matching kernel header / devel package for the running kernel.
  4. Installs the driver branch 580 packages,
    cuda-toolkit-13-0
    , and the Container Toolkit pinned to
    1.19.0
    (the dpkg-suffixed
    1.19.0-1
    is the same upstream version expressed for apt).
  5. For Docker backends and when Docker is missing, installs Docker (override / opt-out flags below), enables/starts the daemon, then runs
    nvidia-ctk runtime configure --runtime=docker
    and restarts Docker when
    systemctl
    is available.
  6. Adds the invoking user (
    $SUDO_USER
    if available, else
    $USER
    ) to the
    docker
    group so subsequent shells can run
    docker
    without
    sudo
    — opt out with
    --skip-docker-group
    . The new group membership does not take effect in the current shell: log out and back in, or run
    newgrp docker
    in each new shell.
  7. Attempts
    modprobe nvidia
    so verification can pass before reboot.
Family-specific package selections:
Stepdebian-familyrhel-familysuse-family
Kernel headers
linux-headers-$(uname -r)
kernel-devel-$(uname -r)
,
kernel-headers-$(uname -r)
kernel-default-devel
Driver
nvidia-driver-pinning-580
,
nvidia-open-580
(override:
$NVIDIA_DRIVER_PACKAGE_DEBIAN
)
nvidia-driver-cuda
,
kmod-nvidia-open-dkms
(override:
$NVIDIA_DRIVER_PACKAGE_RHEL
,
$NVIDIA_DRIVER_KMOD_RHEL
)
nvidia-open-driver-G06-signed-kmp-default
(override:
$NVIDIA_DRIVER_PACKAGE_SUSE
)
CUDA toolkit
cuda-toolkit-13-0
cuda-toolkit-13-0
cuda-toolkit-13-0
Container Toolkit
nvidia-container-toolkit=1.19.0-1
+ base/tools/libs
nvidia-container-toolkit-1.19.0
+ base/tools/libs
same as rhel
Docker
docker.io
(override:
$DOCKER_PACKAGE_DEBIAN
)
moby-engine
+
moby-cli
on Fedora when available, else
docker-ce docker-ce-cli containerd.io
from
download.docker.com
docker
安装程序会根据检测到的发行版系列进行分发处理。在所有支持的系列中,它会添加NVIDIA的CUDA和Container Toolkit源(如果缺失)、安装指定版本的运行时包、可选安装Docker、配置NVIDIA Docker运行时,并将执行用户加入
docker
组。
通用步骤(所有系列):
  1. 如果缺失,添加NVIDIA的CUDA源(apt使用
    cuda-keyring
    deb包,dnf/zypper使用
    cuda-<distro>.repo
    )。
  2. 如果缺失,添加NVIDIA的Container Toolkit源(apt使用
    .list
    ,dnf/zypper使用
    .repo
    )。
  3. 为当前运行的内核安装匹配的内核头文件/开发包。
  4. 安装580分支驱动包、
    cuda-toolkit-13-0
    ,以及固定版本为
    1.19.0
    的Container Toolkit(apt使用带dpkg后缀的
    1.19.0-1
    ,与上游版本一致)。
  5. 对于Docker后端且Docker缺失的情况,安装Docker(可通过下方参数覆盖/跳过)、启用/启动守护进程,然后在
    systemctl
    可用时执行
    nvidia-ctk runtime configure --runtime=docker
    并重启Docker。
  6. 将执行用户(如果有
    $SUDO_USER
    则使用该用户,否则使用
    $USER
    )加入
    docker
    组,以便后续Shell无需
    sudo
    即可运行
    docker
    — 可通过
    --skip-docker-group
    参数跳过此步骤。新的组成员身份不会在当前Shell中生效:需登出后重新登录,或在每个新Shell中执行
    newgrp docker
  7. 尝试执行
    modprobe nvidia
    ,以便在重启前通过验证。
系列特定包选择:
步骤debian系列rhel系列suse系列
内核头文件
linux-headers-$(uname -r)
kernel-devel-$(uname -r)
,
kernel-headers-$(uname -r)
kernel-default-devel
驱动
nvidia-driver-pinning-580
,
nvidia-open-580
(可通过
$NVIDIA_DRIVER_PACKAGE_DEBIAN
覆盖)
nvidia-driver-cuda
,
kmod-nvidia-open-dkms
(可通过
$NVIDIA_DRIVER_PACKAGE_RHEL
,
$NVIDIA_DRIVER_KMOD_RHEL
覆盖)
nvidia-open-driver-G06-signed-kmp-default
(可通过
$NVIDIA_DRIVER_PACKAGE_SUSE
覆盖)
CUDA Toolkit
cuda-toolkit-13-0
cuda-toolkit-13-0
cuda-toolkit-13-0
Container Toolkit
nvidia-container-toolkit=1.19.0-1
+ 基础/工具/库
nvidia-container-toolkit-1.19.0
+ 基础/工具/库
与rhel系列相同
Docker
docker.io
(可通过
$DOCKER_PACKAGE_DEBIAN
覆盖)
Fedora可用时使用
moby-engine
+
moby-cli
,否则使用
download.docker.com
docker-ce docker-ce-cli containerd.io
docker

Verification

验证

After installation, verify:
bash
nvidia-smi
/usr/local/cuda-13.0/bin/nvcc --version
docker info --format '{{json .Runtimes}}' | grep nvidia
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Expected
nvidia-smi
output includes driver
580.x
and CUDA Version
13.0
. Expected
nvcc
output includes
release 13.0
.
安装完成后,执行以下命令验证:
bash
nvidia-smi
/usr/local/cuda-13.0/bin/nvcc --version
docker info --format '{{json .Runtimes}}' | grep nvidia
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
nvidia-smi
的预期输出应包含驱动版本
580.x
和CUDA版本
13.0
nvcc
的预期输出应包含
release 13.0

Kubernetes Notes

Kubernetes注意事项

For self-managed Kubernetes clusters, run the host installer on every GPU worker node or bake the same package set into the node image before installing the NVIDIA GPU Operator or device plugin.
The workflow check also warns if
kubectl
is available but the cluster reports no
nvidia.com/gpu
allocatable capacity. In that case, install/configure the NVIDIA GPU Operator after the worker host runtime is ready:
bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator
Managed Kubernetes providers may own driver installation through node images or GPU Operator policy. Do not overwrite a provider-managed GPU node without user approval and a rollback plan.
对于自管理Kubernetes集群,需在每个GPU工作节点上运行主机安装程序,或在安装NVIDIA GPU Operator或设备插件前,将相同的包集预安装到节点镜像中。
如果
kubectl
可用但集群报告无
nvidia.com/gpu
可分配资源,工作流检查会发出警告。这种情况下,需在工作节点运行时就绪后安装/配置NVIDIA GPU Operator:
bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator
托管Kubernetes提供商可能通过节点镜像或GPU Operator策略负责驱动安装。未经用户批准和回滚计划,请勿覆盖提供商管理的GPU节点。

Failure Modes

故障模式

Unsupported distribution family:
--install
automates debian-, rhel-, and suse-family hosts. On Arch, Alpine, Gentoo, NixOS, FreeBSD, or anything without
/etc/os-release
(e.g. macOS), the script exits with a clear error that lists the four version targets and the upstream NVIDIA install-guide URLs:
  • https://docs.nvidia.com/cuda/cuda-installation-guide-linux/
  • https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
  • https://docs.docker.com/engine/install/
Install those four pieces using your distribution's package manager and rerun the script with
--check-only
to verify. The check is universally portable — it only queries the binaries / package databases — so once the runtime is in place the workflow contract is satisfied regardless of the underlying distro.
Unsupported Ubuntu/Debian derivative: When
ID
is e.g.
pop
,
mint
,
zorin
,
raspbian
, or another debian-family derivative, the script maps the host onto the upstream Ubuntu/Debian CUDA repo via
UBUNTU_CODENAME
/
VERSION_CODENAME
(
focal
/
jammy
/
noble
→ Ubuntu 20.04/22.04/24.04;
bullseye
/
bookworm
/
trixie
→ Debian 11/12/12). If the host's codename doesn't match a known upstream release,
--install
exits with the same manual-install guidance described above.
Docker not installed:
--check-only
reports
MISSING: Docker is not installed
and prints the exact rerun command appropriate to the detected distro family. The default
--install
path installs Docker (
docker.io
/
moby-engine
/
docker-ce
/
docker
depending on family), enables/starts the daemon, configures the NVIDIA runtime, and adds the invoking user to the
docker
group. If you prefer to manage Docker yourself, install it before rerunning the script or pass
--skip-docker-install
.
Docker installed but
docker run
still needs sudo
: The script adds the invoking user to the
docker
group, but Linux only refreshes group membership on a new login session. Log out and back in, or run
newgrp docker
in each new shell, until the new membership is active.
Docker runtime still missing: Restart Docker, then rerun
nvidia-ctk runtime configure --runtime=docker
.
Driver branch detected != 580: The driver-branch pin is exact on debian-family (
nvidia-open-580
). On rhel-/suse-family the script installs the latest open driver shipped in NVIDIA's CUDA 13.0 repo for the detected distro, which is always ≥ 580. If your host needs a stricter pin, set
$NVIDIA_DRIVER_PACKAGE_RHEL
/
$NVIDIA_DRIVER_KMOD_RHEL
/
$NVIDIA_DRIVER_PACKAGE_SUSE
to the exact package names you want before running
--install
.
Driver installed but
nvidia-smi
fails
: Load the module with
sudo modprobe nvidia
or reboot. Secure Boot may require MOK enrollment on systems where it is enabled.
Kubernetes still has no GPU capacity: Confirm the driver works on each GPU node with
nvidia-smi
, then check the GPU Operator/device plugin pods and node labels.
不支持的发行版系列
--install
可自动处理debian、rhel和suse系列主机。在Arch、Alpine、Gentoo、NixOS、FreeBSD或无
/etc/os-release
的系统(如macOS)上,脚本会退出并显示清晰错误,列出四个版本目标和上游NVIDIA安装指南URL:
  • https://docs.nvidia.com/cuda/cuda-installation-guide-linux/
  • https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
  • https://docs.docker.com/engine/install/
使用发行版的包管理器安装这四个组件,然后重新运行
--check-only
脚本进行验证。检查操作具有通用可移植性 — 仅查询二进制文件/包数据库 — 因此只要运行时配置到位,无论底层发行版如何,都能满足工作流约定。
不支持的Ubuntu/Debian衍生版:当
ID
pop
mint
zorin
raspbian
等debian系列衍生版时,脚本会通过
UBUNTU_CODENAME
/
VERSION_CODENAME
将主机映射到上游Ubuntu/Debian CUDA源(
focal
/
jammy
/
noble
对应Ubuntu 20.04/22.04/24.04;
bullseye
/
bookworm
/
trixie
对应Debian 11/12/12)。如果主机的代号与已知上游版本不匹配,
--install
会退出并提供上述手动安装指导。
Docker未安装
--check-only
会报告
缺失组件:Docker未安装
,并打印适用于检测到的发行版系列的重新运行命令。默认的
--install
流程会安装Docker(根据系列选择
docker.io
/
moby-engine
/
docker-ce
/
docker
)、启用/启动守护进程、配置NVIDIA运行时,并将执行用户加入
docker
组。如果你偏好自行管理Docker,请在重新运行脚本前安装Docker,或通过
--skip-docker-install
参数跳过安装。
Docker已安装但运行
docker run
仍需sudo
:脚本会将执行用户加入
docker
组,但Linux仅在新登录会话时刷新组成员身份。需登出后重新登录,或在每个新Shell中执行
newgrp docker
,直到新组成员身份生效。
Docker运行时仍缺失:重启Docker,然后重新执行
nvidia-ctk runtime configure --runtime=docker
检测到的驱动分支≠580:debian系列严格固定驱动分支为
nvidia-open-580
。rhel/suse系列脚本会为检测到的发行版安装NVIDIA CUDA 13.0源中提供的最新开源驱动,版本始终≥580。如果你的主机需要更严格的版本固定,请在运行
--install
前设置
$NVIDIA_DRIVER_PACKAGE_RHEL
/
$NVIDIA_DRIVER_KMOD_RHEL
/
$NVIDIA_DRIVER_PACKAGE_SUSE
为你需要的精确包名。
驱动已安装但
nvidia-smi
执行失败
:执行
sudo modprobe nvidia
加载模块,或重启系统。启用Secure Boot的系统可能需要注册MOK。
Kubernetes仍无GPU资源:在每个GPU节点上通过
nvidia-smi
确认驱动正常工作,然后检查GPU Operator/设备插件Pod和节点标签。