tao-setup-nvidia-gpu-host
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNVIDIA GPU Host Setup
NVIDIA GPU主机设置
Use this setup skill before TAO workflows run on the , ,
or backend. It standardizes the host GPU runtime on:
dockerlocal-dockerkubernetes- NVIDIA driver branch (open kernel module preferred)
580 - CUDA Toolkit package
cuda-toolkit-13-0 - NVIDIA Container Toolkit
1.19.0 - Docker engine — only installed for /
dockerbackends and only when Docker is missing. The package picked depends on the distro family (local-dockeron Debian-family by default,docker.io/moby-enginefromdocker-ceon RHEL-family,download.docker.comon SUSE-family). Passdockerto opt out.--skip-docker-install
The check is safe and read-only by default — it works on any Linux
distribution because it only probes , the CUDA toolkit path,
the installed container-toolkit package version (via //the
binary version), and the Docker daemon's NVIDIA runtime.
nvidia-smidpkgrpmnvidia-ctkInstallation must be explicitly authorized by the user and rerun with
. The install path is automated for these distro families:
--install| Family | Tested distros | Manager | Notes |
|---|---|---|---|
| debian | Ubuntu 22.04 / 24.04, Debian 12 (and derivatives Pop!_OS, Mint, Zorin, Raspbian, KDE Neon, etc. via | | Adds NVIDIA |
| rhel | Fedora 39+, RHEL / Rocky / AlmaLinux 9 and 10 | | Adds NVIDIA |
| suse | openSUSE Leap 15, SLES 15 | | Adds the same NVIDIA |
| other (Arch, Alpine, Gentoo, NixOS, FreeBSD, …) | n/a | n/a | |
在、或后端运行TAO工作流之前,请使用此设置工具。它将主机GPU运行时标准化为:
dockerlocal-dockerkubernetes- NVIDIA驱动580分支(优先使用开源内核模块)
- CUDA Toolkit包
cuda-toolkit-13-0 - NVIDIA Container Toolkit
1.19.0 - Docker引擎 — 仅在/
docker后端且Docker缺失时安装。所选包取决于发行版系列(debian系列默认使用local-docker,rhel系列使用docker.io提供的download.docker.com/moby-engine,suse系列使用docker-ce)。可通过docker参数跳过安装。--skip-docker-install
默认情况下,检查操作是安全且只读的 — 它适用于任何Linux发行版,因为仅会探测、CUDA Toolkit路径、已安装的容器工具包版本(通过//二进制版本)以及Docker守护进程的NVIDIA运行时。
nvidia-smidpkgrpmnvidia-ctk安装操作必须经过用户明确授权,并通过参数重新运行。安装流程支持以下发行版系列的自动化操作:
--install| 系列 | 已测试发行版 | 包管理器 | 说明 |
|---|---|---|---|
| debian | Ubuntu 22.04 / 24.04、Debian 12(及其衍生版本Pop!_OS、Mint、Zorin、Raspbian、KDE Neon等,通过 | | 添加NVIDIA |
| rhel | Fedora 39+、RHEL/Rocky/AlmaLinux 9和10 | | 添加NVIDIA |
| suse | openSUSE Leap 15、SLES 15 | | 添加相同的NVIDIA |
| 其他(Arch、Alpine、Gentoo、NixOS、FreeBSD等) | 无 | 无 | |
Quick Start
快速开始
From the skill bank root:
bash
undefined从技能库根目录执行:
bash
undefinedCheck the local Docker backend host.
检查本地Docker后端主机。
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only
Install or repair after user approval (prompts for confirmation; see the note below for non-interactive runs).
用户确认后安装或修复(会提示确认;非交互式运行请参见下方说明)。
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --install
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --install
Check a Kubernetes GPU worker host.
检查Kubernetes GPU工作节点主机。
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --check-only
> ⚠️ **Note — running non-interactively (agent / skill runs):** a skill run has
> no terminal, so the installer's `Continue? [y/N]` confirmation cannot be
> answered. After running `--check-only` to preview what is missing and getting
> the user's explicit approval, append the assume-yes flag (`--yes`) to the
> `--install` command so it proceeds without a prompt. That auto-confirms
> installation of system packages (NVIDIA driver branch 580, CUDA Toolkit 13.0,
> NVIDIA Container Toolkit, and — for Docker backends — Docker) and modifies the
> host: it adds NVIDIA package repositories, may restart Docker, and adds the
> invoking user to the `docker` group, so only do this on a host you control and
> have the privileges to change. When a person runs `--install` directly at a
> terminal, the script instead prompts with the exact package list before making
> any changes.
In an installed plugin copy that exposes `skills/`, use:
```bash
bash skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-onlybash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --check-only
> ⚠️ **注意 — 非交互式运行(Agent/工具执行):** 工具运行没有终端,无法响应安装程序的`Continue? [y/N]`确认提示。在运行`--check-only`预览缺失组件并获得用户明确授权后,需在`--install`命令后添加`--yes`参数以自动确认,避免交互。此参数会自动确认系统包(NVIDIA驱动580分支、CUDA Toolkit 13.0、NVIDIA Container Toolkit,以及Docker后端所需的Docker)的安装,并修改主机:添加NVIDIA包源、可能重启Docker、将执行用户加入`docker`组,因此仅能在你拥有控制权和修改权限的主机上执行。当用户在终端直接运行`--install`时,脚本会在做出任何修改前显示具体的包列表并提示确认。
在已安装的插件副本(暴露`skills/`目录)中,使用:
```bash
bash skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-onlyWorkflow Contract
工作流约定
Docker and Kubernetes workflows must run the check before submitting GPU work:
bash
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend docker --check-only || {
echo "MISSING: TAO GPU host runtime is not ready."
echo "After user approval, run (append --yes for non-interactive agent runs):"
echo " bash \"$SETUP_SCRIPT\" --backend docker --install"
exit 1
}Never install silently. If the check fails, explain what is missing, ask the
user to authorize the fix, then run the install command and rerun the check.
Docker和Kubernetes工作流必须在提交GPU任务前执行检查:
bash
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend docker --check-only || {
echo "缺失组件:TAO GPU主机运行时未就绪。"
echo "获得用户批准后,执行(非交互式Agent运行需添加--yes参数):"
echo " bash \"$SETUP_SCRIPT\" --backend docker --install"
exit 1
}禁止静默安装。如果检查失败,需说明缺失的组件,请求用户授权修复,然后执行安装命令并重新运行检查。
What The Installer Does
安装程序执行流程
The installer dispatches on the detected distribution family. On every
supported family it adds NVIDIA's CUDA and Container Toolkit repositories
(if missing), installs the pinned runtime packages, optionally installs
Docker, wires the NVIDIA Docker runtime, and adds the invoking user to
the group.
dockerCommon steps (all families):
- Adds NVIDIA's CUDA repository if missing (apt deb,
cuda-keyringfor dnf/zypper).cuda-<distro>.repo - Adds NVIDIA's Container Toolkit repository if missing (for apt,
.listfor dnf/zypper)..repo - Installs the matching kernel header / devel package for the running kernel.
- Installs the driver branch 580 packages, , and the Container Toolkit pinned to
cuda-toolkit-13-0(the dpkg-suffixed1.19.0is the same upstream version expressed for apt).1.19.0-1 - For Docker backends and when Docker is missing, installs Docker
(override / opt-out flags below), enables/starts the daemon, then runs
and restarts Docker when
nvidia-ctk runtime configure --runtime=dockeris available.systemctl - Adds the invoking user (if available, else
$SUDO_USER) to the$USERgroup so subsequent shells can rundockerwithoutdocker— opt out withsudo. The new group membership does not take effect in the current shell: log out and back in, or run--skip-docker-groupin each new shell.newgrp docker - Attempts so verification can pass before reboot.
modprobe nvidia
Family-specific package selections:
| Step | debian-family | rhel-family | suse-family |
|---|---|---|---|
| Kernel headers | | | |
| Driver | | | |
| CUDA toolkit | | | |
| Container Toolkit | | | same as rhel |
| Docker | | | |
安装程序会根据检测到的发行版系列进行分发处理。在所有支持的系列中,它会添加NVIDIA的CUDA和Container Toolkit源(如果缺失)、安装指定版本的运行时包、可选安装Docker、配置NVIDIA Docker运行时,并将执行用户加入组。
docker通用步骤(所有系列):
- 如果缺失,添加NVIDIA的CUDA源(apt使用deb包,dnf/zypper使用
cuda-keyring)。cuda-<distro>.repo - 如果缺失,添加NVIDIA的Container Toolkit源(apt使用,dnf/zypper使用
.list)。.repo - 为当前运行的内核安装匹配的内核头文件/开发包。
- 安装580分支驱动包、,以及固定版本为
cuda-toolkit-13-0的Container Toolkit(apt使用带dpkg后缀的1.19.0,与上游版本一致)。1.19.0-1 - 对于Docker后端且Docker缺失的情况,安装Docker(可通过下方参数覆盖/跳过)、启用/启动守护进程,然后在可用时执行
systemctl并重启Docker。nvidia-ctk runtime configure --runtime=docker - 将执行用户(如果有则使用该用户,否则使用
$SUDO_USER)加入$USER组,以便后续Shell无需docker即可运行sudo— 可通过docker参数跳过此步骤。新的组成员身份不会在当前Shell中生效:需登出后重新登录,或在每个新Shell中执行--skip-docker-group。newgrp docker - 尝试执行,以便在重启前通过验证。
modprobe nvidia
系列特定包选择:
| 步骤 | debian系列 | rhel系列 | suse系列 |
|---|---|---|---|
| 内核头文件 | | | |
| 驱动 | | | |
| CUDA Toolkit | | | |
| Container Toolkit | | | 与rhel系列相同 |
| Docker | | Fedora可用时使用 | |
Verification
验证
After installation, verify:
bash
nvidia-smi
/usr/local/cuda-13.0/bin/nvcc --version
docker info --format '{{json .Runtimes}}' | grep nvidia
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smiExpected output includes driver and CUDA Version .
Expected output includes .
nvidia-smi580.x13.0nvccrelease 13.0安装完成后,执行以下命令验证:
bash
nvidia-smi
/usr/local/cuda-13.0/bin/nvcc --version
docker info --format '{{json .Runtimes}}' | grep nvidia
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-sminvidia-smi580.x13.0nvccrelease 13.0Kubernetes Notes
Kubernetes注意事项
For self-managed Kubernetes clusters, run the host installer on every GPU
worker node or bake the same package set into the node image before installing
the NVIDIA GPU Operator or device plugin.
The workflow check also warns if is available but the cluster reports
no allocatable capacity. In that case, install/configure the
NVIDIA GPU Operator after the worker host runtime is ready:
kubectlnvidia.com/gpubash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operatorManaged Kubernetes providers may own driver installation through node images or
GPU Operator policy. Do not overwrite a provider-managed GPU node without user
approval and a rollback plan.
对于自管理Kubernetes集群,需在每个GPU工作节点上运行主机安装程序,或在安装NVIDIA GPU Operator或设备插件前,将相同的包集预安装到节点镜像中。
如果可用但集群报告无可分配资源,工作流检查会发出警告。这种情况下,需在工作节点运行时就绪后安装/配置NVIDIA GPU Operator:
kubectlnvidia.com/gpubash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator托管Kubernetes提供商可能通过节点镜像或GPU Operator策略负责驱动安装。未经用户批准和回滚计划,请勿覆盖提供商管理的GPU节点。
Failure Modes
故障模式
Unsupported distribution family: automates debian-, rhel-,
and suse-family hosts. On Arch, Alpine, Gentoo, NixOS, FreeBSD, or anything
without (e.g. macOS), the script exits with a clear error
that lists the four version targets and the upstream NVIDIA install-guide
URLs:
--install/etc/os-releasehttps://docs.nvidia.com/cuda/cuda-installation-guide-linux/https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.htmlhttps://docs.docker.com/engine/install/
Install those four pieces using your distribution's package manager and
rerun the script with to verify. The check is universally
portable — it only queries the binaries / package databases — so once the
runtime is in place the workflow contract is satisfied regardless of the
underlying distro.
--check-onlyUnsupported Ubuntu/Debian derivative: When is e.g. , ,
, , or another debian-family derivative, the script maps
the host onto the upstream Ubuntu/Debian CUDA repo via /
(// → Ubuntu 20.04/22.04/24.04;
// → Debian 11/12/12). If the host's codename
doesn't match a known upstream release, exits with the same
manual-install guidance described above.
IDpopmintzorinraspbianUBUNTU_CODENAMEVERSION_CODENAMEfocaljammynoblebullseyebookwormtrixie--installDocker not installed: reports and prints the exact rerun command appropriate to the detected
distro family. The default path installs Docker ( /
/ / depending on family), enables/starts
the daemon, configures the NVIDIA runtime, and adds the invoking user to
the group. If you prefer to manage Docker yourself, install it
before rerunning the script or pass .
--check-onlyMISSING: Docker is not installed--installdocker.iomoby-enginedocker-cedockerdocker--skip-docker-installDocker installed but still needs sudo: The script adds the
invoking user to the group, but Linux only refreshes group
membership on a new login session. Log out and back in, or run
in each new shell, until the new membership is active.
docker rundockernewgrp dockerDocker runtime still missing: Restart Docker, then rerun
.
nvidia-ctk runtime configure --runtime=dockerDriver branch detected != 580: The driver-branch pin is exact on
debian-family (). On rhel-/suse-family the script
installs the latest open driver shipped in NVIDIA's CUDA 13.0 repo for
the detected distro, which is always ≥ 580. If your host needs a stricter
pin, set / /
to the exact package names you want before
running .
nvidia-open-580$NVIDIA_DRIVER_PACKAGE_RHEL$NVIDIA_DRIVER_KMOD_RHEL$NVIDIA_DRIVER_PACKAGE_SUSE--installDriver installed but fails: Load the module with
or reboot. Secure Boot may require MOK enrollment on
systems where it is enabled.
nvidia-smisudo modprobe nvidiaKubernetes still has no GPU capacity: Confirm the driver works on each GPU
node with , then check the GPU Operator/device plugin pods and node
labels.
nvidia-smi不支持的发行版系列:可自动处理debian、rhel和suse系列主机。在Arch、Alpine、Gentoo、NixOS、FreeBSD或无的系统(如macOS)上,脚本会退出并显示清晰错误,列出四个版本目标和上游NVIDIA安装指南URL:
--install/etc/os-releasehttps://docs.nvidia.com/cuda/cuda-installation-guide-linux/https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.htmlhttps://docs.docker.com/engine/install/
使用发行版的包管理器安装这四个组件,然后重新运行脚本进行验证。检查操作具有通用可移植性 — 仅查询二进制文件/包数据库 — 因此只要运行时配置到位,无论底层发行版如何,都能满足工作流约定。
--check-only不支持的Ubuntu/Debian衍生版:当为、、、等debian系列衍生版时,脚本会通过/将主机映射到上游Ubuntu/Debian CUDA源(//对应Ubuntu 20.04/22.04/24.04;//对应Debian 11/12/12)。如果主机的代号与已知上游版本不匹配,会退出并提供上述手动安装指导。
IDpopmintzorinraspbianUBUNTU_CODENAMEVERSION_CODENAMEfocaljammynoblebullseyebookwormtrixie--installDocker未安装:会报告,并打印适用于检测到的发行版系列的重新运行命令。默认的流程会安装Docker(根据系列选择///)、启用/启动守护进程、配置NVIDIA运行时,并将执行用户加入组。如果你偏好自行管理Docker,请在重新运行脚本前安装Docker,或通过参数跳过安装。
--check-only缺失组件:Docker未安装--installdocker.iomoby-enginedocker-cedockerdocker--skip-docker-installDocker已安装但运行仍需sudo:脚本会将执行用户加入组,但Linux仅在新登录会话时刷新组成员身份。需登出后重新登录,或在每个新Shell中执行,直到新组成员身份生效。
docker rundockernewgrp dockerDocker运行时仍缺失:重启Docker,然后重新执行。
nvidia-ctk runtime configure --runtime=docker检测到的驱动分支≠580:debian系列严格固定驱动分支为。rhel/suse系列脚本会为检测到的发行版安装NVIDIA CUDA 13.0源中提供的最新开源驱动,版本始终≥580。如果你的主机需要更严格的版本固定,请在运行前设置//为你需要的精确包名。
nvidia-open-580--install$NVIDIA_DRIVER_PACKAGE_RHEL$NVIDIA_DRIVER_KMOD_RHEL$NVIDIA_DRIVER_PACKAGE_SUSE驱动已安装但执行失败:执行加载模块,或重启系统。启用Secure Boot的系统可能需要注册MOK。
nvidia-smisudo modprobe nvidiaKubernetes仍无GPU资源:在每个GPU节点上通过确认驱动正常工作,然后检查GPU Operator/设备插件Pod和节点标签。
nvidia-smi