Local Docker

本地Docker

Single-node execution platform that runs TAO jobs as named Docker containers on the local Docker daemon. It is useful for development, debugging, small runs, and machines where the agent host already has the required GPUs, NVIDIA driver, Docker, and NVIDIA Container Toolkit.

Use local Docker when the data is local to the Docker host or accessible through mounted volumes/cloud credentials. Do not use it for remote cluster scheduling, multi-node training, or jobs that need SLURM queueing.

单节点执行平台，可在本地Docker daemon上将TAO作业作为命名Docker容器运行。适用于开发、调试、小规模运行，以及代理主机已配备所需GPU、NVIDIA驱动、Docker和NVIDIA Container Toolkit的机器。

当数据存储在Docker主机本地，或可通过挂载卷/云凭证访问时，可使用本地Docker。请勿将其用于远程集群调度、多节点训练或需要SLURM队列的作业。

Preflight

预检

The workflow must verify the host GPU runtime before starting Docker jobs. If the check fails, prompt the user to approve the install, run the printed install command, and rerun the preflight.

bash

undefined

在启动Docker作业前，工作流必须验证主机GPU运行环境。如果检查失败，需提示用户确认安装，运行打印的安装命令，然后重新执行预检。

bash

undefined

Host GPU runtime: NVIDIA driver 580, CUDA 13.0, NVIDIA Container Toolkit 1.19.0.

TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}" SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh" [ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"

bash "$SETUP_SCRIPT" --backend docker --check-only || { echo "MISSING: TAO GPU host runtime is not ready." echo "After user approval, run:" echo " bash "$SETUP_SCRIPT" --backend docker --install --yes" exit 1 }

TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}" SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh" [ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"

bash "$SETUP_SCRIPT" --backend docker --check-only || { echo "MISSING: TAO GPU host runtime is not ready." echo "After user approval, run:" echo " bash "$SETUP_SCRIPT" --backend docker --install --yes" exit 1 }

Mode 1 — direct docker (no Python). All you need is docker + the GPU runtime.

docker info >/dev/null 2>&1 || { echo "MISSING: docker daemon not reachable. Start Docker."; exit 1; } docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi >/dev/null 2>&1 || { echo "MISSING: NVIDIA Container Toolkit not installed/configured. See:" echo " bash "$SETUP_SCRIPT" --backend docker --install --yes" exit 1 }

Mode 2 — TAO SDK wrapper. Adds Job handles, S3 I/O wrapping, ActionWorkflow.

Skip this block if Mode 1 is sufficient for the user's request.

When Mode 2 is in scope, read

tao-skill-bank:tao-run-platform

for the DockerSDK

When Mode 2 is in scope, read

tao-skill-bank:tao-run-platform

for the DockerSDK

kwarg contract, build_entrypoint, and monitoring patterns.

nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_docker).

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_docker) python -c "import tao_sdk" 2>/dev/null || { echo "MISSING: nvidia-tao-sdk not installed. Run:" echo " pip install "$PIN"" exit 1 } python -c "import docker" 2>/dev/null || { echo "MISSING: docker Python client not installed. Run:" echo " pip install "$PIN"" exit 1 }

DockerSDK attaches every job container to ${DOCKER_NETWORK:-tao_default}. If

the network does not exist, container start fails instantly with

network <name> not found

for every create_job.

network <name> not found

for every create_job.

DOCKER_NETWORK_NAME="${DOCKER_NETWORK:-tao_default}" docker network ls --format '{{.Name}}' | grep -qx "$DOCKER_NETWORK_NAME" || { echo "MISSING: docker network '$DOCKER_NETWORK_NAME' not found. After user approval, run:" echo " docker network create $DOCKER_NETWORK_NAME" exit 1 }


If a check fails, the agent prompts the user to authorize the install/fix via Bash before proceeding.

DOCKER_NETWORK_NAME="${DOCKER_NETWORK:-tao_default}" docker network ls --format '{{.Name}}' | grep -qx "$DOCKER_NETWORK_NAME" || { echo "MISSING: docker network '$DOCKER_NETWORK_NAME' not found. After user approval, run:" echo " docker network create $DOCKER_NETWORK_NAME" exit 1 }


如果检查失败，代理会提示用户通过Bash授权安装/修复后再继续。

Credentials

凭证

There are no platform credentials required beyond access to the Docker daemon.

Optional environment:

DOCKER_HOST: Optional Docker daemon URL. If unset, the SDK uses the Docker Python client's normal environment/default socket resolution.
DOCKER_NETWORK: Docker network for job containers. Default is
```
tao_default
```
.
DOCKER_USERNAME: Registry username. Default is
```
$oauthtoken
```
for NGC.
NGC_KEY: Used when pulling private images from
```
nvcr.io
```
.
HOST_SSH_PATH: Mounted into AutoML brain containers when they need SSH keys to monitor remote SLURM child jobs.
ACCESS_KEY, SECRET_KEY, S3_ENDPOINT_URL, S3_BUCKET_NAME: Optional S3-compatible storage settings for jobs that still read/write cloud storage from a local container.

除了Docker daemon的访问权限外，无需其他平台凭证。

可选环境变量：

DOCKER_HOST：可选的Docker daemon URL。如果未设置，SDK将使用Docker Python客户端的常规环境/默认套接字解析方式。
DOCKER_NETWORK：作业容器使用的Docker网络。默认值为
```
tao_default
```
。
DOCKER_USERNAME：镜像仓库用户名。NGC的默认值为
```
$oauthtoken
```
。
NGC_KEY：从
```
nvcr.io
```
拉取私有镜像时使用。
HOST_SSH_PATH：当AutoML brain容器需要SSH密钥来监控远程SLURM子作业时，会挂载该路径。
ACCESS_KEY、SECRET_KEY、S3_ENDPOINT_URL、S3_BUCKET_NAME：适用于仍需从本地容器读写云存储的作业的可选S3兼容存储设置。

Launch Preflight

启动预检

Before generating scripts or starting containers:

Verify the Docker daemon is reachable and the NVIDIA runtime can see GPUs.
Verify every local/file dataset annotation and media path exists on the Docker host.
For
```
s3://
```
datasets/results, verify
```
ACCESS_KEY
```
and
```
SECRET_KEY
```
are set and the exact paths are readable with
```
aws s3 ls
```
.
Verify model-specific credentials such as
```
HF_TOKEN
```
before launch.

在生成脚本或启动容器前：

验证Docker daemon是否可访问，且NVIDIA runtime能识别GPU。
验证所有本地/文件数据集标注和媒体路径在Docker主机上是否存在。
对于
```
s3://
```
数据集/结果，验证
```
ACCESS_KEY
```
和
```
SECRET_KEY
```
已设置，且使用
```
aws s3 ls
```
可读取精确路径。
启动前验证模型特定凭证（如
```
HF_TOKEN
```
）。

Multi-GPU and multi-node

多GPU与多节点

Multi-node is not supported on local Docker. One job runs on the local Docker daemon's host with no cross-host coordination.

Multi-GPU on the local host is supported via the NVIDIA Container Toolkit's

--gpus

flag (

--gpus all

or

--gpus '"device=0,1,2,3"'

).

DockerSDK.create_job(gpu_count=N)

plumbs through to

--gpus

. Single-host distributed init uses

localhost

;

torchrun --nproc-per-node=N

or PyTorch DDP work as usual.

本地Docker不支持多节点。一个作业仅在本地Docker daemon所在主机上运行，无跨主机协调。

本地主机上的多GPU通过NVIDIA Container Toolkit的

--gpus

标志支持（如

--gpus all

或

--gpus '"device=0,1,2,3"'

）。

DockerSDK.create_job(gpu_count=N)

会传递该参数至

--gpus

。单主机分布式初始化使用

localhost

；

torchrun --nproc-per-node=N

或PyTorch DDP可正常工作。

Backend Details

后端细节

Use the SDK backend value

local-docker

. The local backend schema has no extra backend details, so most routing is controlled by environment and job parameters:

json

{
  "backend_type": "local-docker",
  "num_gpu": 1
}

Following the Lepton/Brev SDK design, platform/control-plane values stay in SDK state and Docker labels. The SDK does not inject

BACKEND

,

HOST_PLATFORM

,

MONGOSECRET

,

DOCKER_HOST

, or

DOCKER_NETWORK

into the training container.

使用SDK后端值

local-docker

。本地后端架构无额外后端细节，因此大部分路由由环境变量和作业参数控制：

json

{
  "backend_type": "local-docker",
  "num_gpu": 1
}

遵循Lepton/Brev SDK设计，平台/控制平面值存储在SDK状态和Docker标签中。SDK不会将

BACKEND

、

HOST_PLATFORM

、

MONGOSECRET

、

DOCKER_HOST

或

DOCKER_NETWORK

注入训练容器。

Container Execution

容器执行

The TAO SDK local Docker handler starts containers through the Docker Python client:

Backend job name uses the
```
tao-job-<job_id>
```
form used by SDK handlers.
Command is usually
```
["/bin/bash", "-c", "<job command>"]
```
.
Containers run detached. The SDK keeps containers by default so status and logs remain inspectable, unless
```
DOCKER_AUTO_REMOVE=true
```
.
```
/dev/shm
```
is mounted as tmpfs.
The configured Docker network is applied by the Docker daemon for the job container; it is not passed through as a process environment variable.
Existing containers with the same job id are stopped and removed before a replacement starts.

For GPU access, the handler auto-detects the host type:

Tegra or Jetson hosts use

runtime="nvidia"

plus

NVIDIA_VISIBLE_DEVICES

and

NVIDIA_DRIVER_CAPABILITIES=all

.

Standard x86 hosts use Docker
```
device_requests
```
with GPU capabilities.

If

num_gpus

is

, no GPUs are assigned. If

num_gpus

is

-1

, all visible GPUs are requested. Prefer explicit GPU counts for shared development machines.

TAO SDK本地Docker处理器通过Docker Python客户端启动容器：

后端作业名称采用SDK处理器使用的
```
tao-job-<job_id>
```
格式。
命令通常为
```
["/bin/bash", "-c", "<job command>"]
```
。
容器以 detached 模式运行。默认情况下SDK会保留容器，以便检查状态和日志，除非设置
```
DOCKER_AUTO_REMOVE=true
```
。
```
/dev/shm
```
挂载为tmpfs。
配置的Docker网络由Docker daemon应用于作业容器；不会作为进程环境变量传递。
在启动新容器前，会停止并移除具有相同作业ID的现有容器。

对于GPU访问，处理器会自动检测主机类型：

Tegra或Jetson主机使用

runtime="nvidia"

以及

NVIDIA_VISIBLE_DEVICES

和

NVIDIA_DRIVER_CAPABILITIES=all

。

标准x86主机使用带有GPU功能的Docker
```
device_requests
```
。

如果

num_gpus

设为

，则不分配GPU。如果

num_gpus

设为

-1

，则请求所有可见GPU。对于共享开发机器，建议使用明确的GPU数量。

Storage

存储

Local Docker accepts local and

file://

paths because the container runs on the same Docker host. Make sure every path in the spec is either:

mounted into the container by the handler or surrounding service,
reachable from inside the container already, or
a cloud URI with matching credentials.

For remote/shared filesystems, prefer the platform that owns that filesystem. For example, use SLURM plus

lustre:///...

for Lustre paths on a cluster.

本地Docker支持本地路径和

file://

路径，因为容器运行在同一Docker主机上。确保规范中的每个路径满足以下条件之一：

由处理器或周边服务挂载到容器中，
已可从容器内部访问，或
是带有匹配凭证的云URI。

对于远程/共享文件系统，优先使用该文件系统所属的平台。例如，在集群上使用SLURM加

lustre:///...

路径访问Lustre文件系统。

Monitoring

监控

The SDK handler maps Docker container state directly: created -> Pending, running/restarting -> Running, paused -> Paused, exit code 0 -> Complete, nonzero exit -> Error.
Logs come directly from the named container through the Docker Python client (
```
docker logs tao-job-<job_id>
```
).

If the container has exited, died, is being removed, or cannot be found, status reconciliation treats the backend process as terminated.

SDK处理器直接映射Docker容器状态：created -> 待处理，running/restarting -> 运行中，paused -> 已暂停，退出码0 -> 已完成，非零退出码 -> 错误。
日志直接通过Docker Python客户端从命名容器获取（
```
docker logs tao-job-<job_id>
```
）。

如果容器已退出、终止、正在被移除或无法找到，状态协调会将后端进程视为已终止。

Cancellation

取消

Cancellation stops the named container. GPU ownership is managed by Docker / the NVIDIA runtime, not by TAO Core's local GPU manager.

取消操作会停止命名容器。GPU所有权由Docker/NVIDIA runtime管理，而非TAO Core的本地GPU管理器。

Optional: via the TAO SDK

可选方式：通过TAO SDK

If you want Job handles, S3 I/O wrapping via the SDK's

script_runner

, or durability across sessions:

python

from tao_sdk.platforms.docker import DockerSDK

sdk = DockerSDK()  # reads DOCKER_HOST, NGC_KEY, S3 creds from env
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',
    gpu_count=1,
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)

status = sdk.get_job_status(job.id)
logs = sdk.get_job_logs(job.id, tail=200)

This wraps the same

docker run

invocation under a

Job

handle and routes the entrypoint through

script_runner

so

inputs

/

outputs

get downloaded from / uploaded to S3 automatically. If you don't need those, just use

docker run

directly — no SDK install required.

如果需要作业句柄、通过SDK的

script_runner

实现S3 I/O封装，或跨会话的持久性：

python

from tao_sdk.platforms.docker import DockerSDK

sdk = DockerSDK()  # reads DOCKER_HOST, NGC_KEY, S3 creds from env
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',
    gpu_count=1,
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)

status = sdk.get_job_status(job.id)
logs = sdk.get_job_logs(job.id, tail=200)

此方式在

Job

句柄下封装了相同的

docker run

调用，并通过

script_runner

路由入口点，从而自动从S3下载

inputs

并上传

outputs

至S3。如果不需要这些功能，直接使用

docker run

即可——无需安装SDK。

Failure Modes

故障模式

Docker client not initialized: Verify the Docker Python package is installed, set

DOCKER_HOST

if you are not using the default local socket, and confirm the process can talk to the daemon.

GPU assignment failed: Requested GPUs are unavailable, the NVIDIA Container Toolkit is not configured, or the Docker daemon cannot create GPU device requests. Use fewer GPUs, wait for another job to finish, or verify

docker run --gpus ...

works on the host.

Image pull auth failed: Set a valid

NGC_KEY

for private

nvcr.io

images or run

docker login nvcr.io -u '$oauthtoken'

on the Docker host.

Container exited unexpectedly: Check

docker logs tao-job-<job_id>

, the configured

DOCKER_NETWORK

, and the command produced by the SDK action runner.

Path missing inside container: A local path on the host is not necessarily mounted into the job container. Use a path convention supported by the action runner or configure an explicit volume through the surrounding service.

Docker客户端未初始化：验证Docker Python包已安装，如果未使用默认本地套接字则设置

DOCKER_HOST

，并确认进程可与daemon通信。

GPU分配失败：请求的GPU不可用、NVIDIA Container Toolkit未配置，或Docker daemon无法创建GPU设备请求。减少GPU数量、等待其他作业完成，或验证主机上的

docker run --gpus ...

命令可正常工作。

镜像拉取认证失败：为私有

nvcr.io

镜像设置有效的

NGC_KEY

，或在Docker主机上运行

docker login nvcr.io -u '$oauthtoken'

。

容器意外退出：检查

docker logs tao-job-<job_id>

、配置的

DOCKER_NETWORK

以及SDK动作生成器生成的命令。

容器内路径缺失：主机上的本地路径不一定会挂载到作业容器中。使用动作生成器支持的路径约定，或通过周边服务配置显式卷。

tao-run-on-local-docker

Original

Translation

Local Docker

本地Docker

Preflight

预检

Host GPU runtime: NVIDIA driver 580, CUDA 13.0, NVIDIA Container Toolkit 1.19.0.

Host GPU runtime: NVIDIA driver 580, CUDA 13.0, NVIDIA Container Toolkit 1.19.0.

Mode 1 — direct docker (no Python). All you need is docker + the GPU runtime.

Mode 1 — direct docker (no Python). All you need is docker + the GPU runtime.

Mode 2 — TAO SDK wrapper. Adds Job handles, S3 I/O wrapping, ActionWorkflow.

Mode 2 — TAO SDK wrapper. Adds Job handles, S3 I/O wrapping, ActionWorkflow.

Skip this block if Mode 1 is sufficient for the user's request.

Skip this block if Mode 1 is sufficient for the user's request.

When Mode 2 is in scope, read tao-skill-bank:tao-run-platform for the DockerSDK

When Mode 2 is in scope, read tao-skill-bank:tao-run-platform for the DockerSDK

kwarg contract, build_entrypoint, and monitoring patterns.

kwarg contract, build_entrypoint, and monitoring patterns.

nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_docker).

nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_docker).

DockerSDK attaches every job container to ${DOCKER_NETWORK:-tao_default}. If

DockerSDK attaches every job container to ${DOCKER_NETWORK:-tao_default}. If

the network does not exist, container start fails instantly with

the network does not exist, container start fails instantly with

network <name> not found for every create_job.

network <name> not found for every create_job.

Credentials

凭证

Launch Preflight

启动预检

Multi-GPU and multi-node

多GPU与多节点

Backend Details

后端细节

Container Execution

容器执行

Storage

存储

Monitoring

监控

Cancellation

取消

Optional: via the TAO SDK

可选方式：通过TAO SDK

Failure Modes

故障模式

When Mode 2 is in scope, read
`tao-skill-bank:tao-run-platform`
for the DockerSDK

When Mode 2 is in scope, read
`tao-skill-bank:tao-run-platform`
for the DockerSDK

`network <name> not found`
for every create_job.

`network <name> not found`
for every create_job.