tao-run-on-local-docker

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Local Docker

本地Docker

Single-node execution platform that runs TAO jobs as named Docker containers on the local Docker daemon. It is useful for development, debugging, small runs, and machines where the agent host already has the required GPUs, NVIDIA driver, Docker, and NVIDIA Container Toolkit.
Use local Docker when the data is local to the Docker host or accessible through mounted volumes/cloud credentials. Do not use it for remote cluster scheduling, multi-node training, or jobs that need SLURM queueing.
单节点执行平台,可在本地Docker daemon上将TAO作业作为命名Docker容器运行。适用于开发、调试、小规模运行,以及代理主机已配备所需GPU、NVIDIA驱动、Docker和NVIDIA Container Toolkit的机器。
当数据存储在Docker主机本地,或可通过挂载卷/云凭证访问时,可使用本地Docker。请勿将其用于远程集群调度、多节点训练或需要SLURM队列的作业。

Preflight

预检

The workflow must verify the host GPU runtime before starting Docker jobs. If the check fails, prompt the user to approve the install, run the printed install command, and rerun the preflight.
bash
undefined
在启动Docker作业前,工作流必须验证主机GPU运行环境。如果检查失败,需提示用户确认安装,运行打印的安装命令,然后重新执行预检。
bash
undefined

Host GPU runtime: NVIDIA driver 580, CUDA 13.0, NVIDIA Container Toolkit 1.19.0.

Host GPU runtime: NVIDIA driver 580, CUDA 13.0, NVIDIA Container Toolkit 1.19.0.

TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}" SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh" [ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend docker --check-only || { echo "MISSING: TAO GPU host runtime is not ready." echo "After user approval, run:" echo " bash "$SETUP_SCRIPT" --backend docker --install --yes" exit 1 }
TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}" SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh" [ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend docker --check-only || { echo "MISSING: TAO GPU host runtime is not ready." echo "After user approval, run:" echo " bash "$SETUP_SCRIPT" --backend docker --install --yes" exit 1 }

Mode 1 — direct docker (no Python). All you need is docker + the GPU runtime.

Mode 1 — direct docker (no Python). All you need is docker + the GPU runtime.

docker info >/dev/null 2>&1 || { echo "MISSING: docker daemon not reachable. Start Docker."; exit 1; } docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi >/dev/null 2>&1 || { echo "MISSING: NVIDIA Container Toolkit not installed/configured. See:" echo " bash "$SETUP_SCRIPT" --backend docker --install --yes" exit 1 }
docker info >/dev/null 2>&1 || { echo "MISSING: docker daemon not reachable. Start Docker."; exit 1; } docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi >/dev/null 2>&1 || { echo "MISSING: NVIDIA Container Toolkit not installed/configured. See:" echo " bash "$SETUP_SCRIPT" --backend docker --install --yes" exit 1 }

Mode 2 — TAO SDK wrapper. Adds Job handles, S3 I/O wrapping, ActionWorkflow.

Mode 2 — TAO SDK wrapper. Adds Job handles, S3 I/O wrapping, ActionWorkflow.

Skip this block if Mode 1 is sufficient for the user's request.

Skip this block if Mode 1 is sufficient for the user's request.

When Mode 2 is in scope, read
tao-skill-bank:tao-run-platform
for the DockerSDK

When Mode 2 is in scope, read
tao-skill-bank:tao-run-platform
for the DockerSDK

kwarg contract, build_entrypoint, and monitoring patterns.

kwarg contract, build_entrypoint, and monitoring patterns.

nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_docker).

nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_docker).

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_docker) python -c "import tao_sdk" 2>/dev/null || { echo "MISSING: nvidia-tao-sdk not installed. Run:" echo " pip install "$PIN"" exit 1 } python -c "import docker" 2>/dev/null || { echo "MISSING: docker Python client not installed. Run:" echo " pip install "$PIN"" exit 1 }
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_docker) python -c "import tao_sdk" 2>/dev/null || { echo "MISSING: nvidia-tao-sdk not installed. Run:" echo " pip install "$PIN"" exit 1 } python -c "import docker" 2>/dev/null || { echo "MISSING: docker Python client not installed. Run:" echo " pip install "$PIN"" exit 1 }

DockerSDK attaches every job container to ${DOCKER_NETWORK:-tao_default}. If

DockerSDK attaches every job container to ${DOCKER_NETWORK:-tao_default}. If

the network does not exist, container start fails instantly with

the network does not exist, container start fails instantly with

network <name> not found
for every create_job.

network <name> not found
for every create_job.

DOCKER_NETWORK_NAME="${DOCKER_NETWORK:-tao_default}" docker network ls --format '{{.Name}}' | grep -qx "$DOCKER_NETWORK_NAME" || { echo "MISSING: docker network '$DOCKER_NETWORK_NAME' not found. After user approval, run:" echo " docker network create $DOCKER_NETWORK_NAME" exit 1 }

If a check fails, the agent prompts the user to authorize the install/fix via Bash before proceeding.
DOCKER_NETWORK_NAME="${DOCKER_NETWORK:-tao_default}" docker network ls --format '{{.Name}}' | grep -qx "$DOCKER_NETWORK_NAME" || { echo "MISSING: docker network '$DOCKER_NETWORK_NAME' not found. After user approval, run:" echo " docker network create $DOCKER_NETWORK_NAME" exit 1 }

如果检查失败,代理会提示用户通过Bash授权安装/修复后再继续。

Credentials

凭证

There are no platform credentials required beyond access to the Docker daemon.
Optional environment:
  • DOCKER_HOST: Optional Docker daemon URL. If unset, the SDK uses the Docker Python client's normal environment/default socket resolution.
  • DOCKER_NETWORK: Docker network for job containers. Default is
    tao_default
    .
  • DOCKER_USERNAME: Registry username. Default is
    $oauthtoken
    for NGC.
  • NGC_KEY: Used when pulling private images from
    nvcr.io
    .
  • HOST_SSH_PATH: Mounted into AutoML brain containers when they need SSH keys to monitor remote SLURM child jobs.
  • ACCESS_KEY, SECRET_KEY, S3_ENDPOINT_URL, S3_BUCKET_NAME: Optional S3-compatible storage settings for jobs that still read/write cloud storage from a local container.
除了Docker daemon的访问权限外,无需其他平台凭证。
可选环境变量:
  • DOCKER_HOST:可选的Docker daemon URL。如果未设置,SDK将使用Docker Python客户端的常规环境/默认套接字解析方式。
  • DOCKER_NETWORK:作业容器使用的Docker网络。默认值为
    tao_default
  • DOCKER_USERNAME:镜像仓库用户名。NGC的默认值为
    $oauthtoken
  • NGC_KEY:从
    nvcr.io
    拉取私有镜像时使用。
  • HOST_SSH_PATH:当AutoML brain容器需要SSH密钥来监控远程SLURM子作业时,会挂载该路径。
  • ACCESS_KEYSECRET_KEYS3_ENDPOINT_URLS3_BUCKET_NAME: 适用于仍需从本地容器读写云存储的作业的可选S3兼容存储设置。

Launch Preflight

启动预检

Before generating scripts or starting containers:
  1. Verify the Docker daemon is reachable and the NVIDIA runtime can see GPUs.
  2. Verify every local/file dataset annotation and media path exists on the Docker host.
  3. For
    s3://
    datasets/results, verify
    ACCESS_KEY
    and
    SECRET_KEY
    are set and the exact paths are readable with
    aws s3 ls
    .
  4. Verify model-specific credentials such as
    HF_TOKEN
    before launch.
在生成脚本或启动容器前:
  1. 验证Docker daemon是否可访问,且NVIDIA runtime能识别GPU。
  2. 验证所有本地/文件数据集标注和媒体路径在Docker主机上是否存在。
  3. 对于
    s3://
    数据集/结果,验证
    ACCESS_KEY
    SECRET_KEY
    已设置,且使用
    aws s3 ls
    可读取精确路径。
  4. 启动前验证模型特定凭证(如
    HF_TOKEN
    )。

Multi-GPU and multi-node

多GPU与多节点

Multi-node is not supported on local Docker. One job runs on the local Docker daemon's host with no cross-host coordination.
Multi-GPU on the local host is supported via the NVIDIA Container Toolkit's
--gpus
flag (
--gpus all
or
--gpus '"device=0,1,2,3"'
).
DockerSDK.create_job(gpu_count=N)
plumbs through to
--gpus
. Single-host distributed init uses
localhost
;
torchrun --nproc-per-node=N
or PyTorch DDP work as usual.
本地Docker不支持多节点。一个作业仅在本地Docker daemon所在主机上运行,无跨主机协调。
本地主机上的多GPU通过NVIDIA Container Toolkit的
--gpus
标志支持(如
--gpus all
--gpus '"device=0,1,2,3"'
)。
DockerSDK.create_job(gpu_count=N)
会传递该参数至
--gpus
。单主机分布式初始化使用
localhost
torchrun --nproc-per-node=N
或PyTorch DDP可正常工作。

Backend Details

后端细节

Use the SDK backend value
local-docker
. The local backend schema has no extra backend details, so most routing is controlled by environment and job parameters:
json
{
  "backend_type": "local-docker",
  "num_gpu": 1
}
Following the Lepton/Brev SDK design, platform/control-plane values stay in SDK state and Docker labels. The SDK does not inject
BACKEND
,
HOST_PLATFORM
,
MONGOSECRET
,
DOCKER_HOST
, or
DOCKER_NETWORK
into the training container.
使用SDK后端值
local-docker
。本地后端架构无额外后端细节,因此大部分路由由环境变量和作业参数控制:
json
{
  "backend_type": "local-docker",
  "num_gpu": 1
}
遵循Lepton/Brev SDK设计,平台/控制平面值存储在SDK状态和Docker标签中。SDK不会将
BACKEND
HOST_PLATFORM
MONGOSECRET
DOCKER_HOST
DOCKER_NETWORK
注入训练容器。

Container Execution

容器执行

The TAO SDK local Docker handler starts containers through the Docker Python client:
  • Backend job name uses the
    tao-job-<job_id>
    form used by SDK handlers.
  • Command is usually
    ["/bin/bash", "-c", "<job command>"]
    .
  • Containers run detached. The SDK keeps containers by default so status and logs remain inspectable, unless
    DOCKER_AUTO_REMOVE=true
    .
  • /dev/shm
    is mounted as tmpfs.
  • The configured Docker network is applied by the Docker daemon for the job container; it is not passed through as a process environment variable.
  • Existing containers with the same job id are stopped and removed before a replacement starts.
For GPU access, the handler auto-detects the host type:
  • Tegra or Jetson hosts use
    runtime="nvidia"
    plus
    NVIDIA_VISIBLE_DEVICES
    and
    NVIDIA_DRIVER_CAPABILITIES=all
    .
  • Standard x86 hosts use Docker
    device_requests
    with GPU capabilities.
If
num_gpus
is
0
, no GPUs are assigned. If
num_gpus
is
-1
, all visible GPUs are requested. Prefer explicit GPU counts for shared development machines.
TAO SDK本地Docker处理器通过Docker Python客户端启动容器:
  • 后端作业名称采用SDK处理器使用的
    tao-job-<job_id>
    格式。
  • 命令通常为
    ["/bin/bash", "-c", "<job command>"]
  • 容器以 detached 模式运行。默认情况下SDK会保留容器,以便检查状态和日志,除非设置
    DOCKER_AUTO_REMOVE=true
  • /dev/shm
    挂载为tmpfs。
  • 配置的Docker网络由Docker daemon应用于作业容器;不会作为进程环境变量传递。
  • 在启动新容器前,会停止并移除具有相同作业ID的现有容器。
对于GPU访问,处理器会自动检测主机类型:
  • Tegra或Jetson主机使用
    runtime="nvidia"
    以及
    NVIDIA_VISIBLE_DEVICES
    NVIDIA_DRIVER_CAPABILITIES=all
  • 标准x86主机使用带有GPU功能的Docker
    device_requests
如果
num_gpus
设为
0
,则不分配GPU。如果
num_gpus
设为
-1
,则请求所有可见GPU。对于共享开发机器,建议使用明确的GPU数量。

Storage

存储

Local Docker accepts local and
file://
paths because the container runs on the same Docker host. Make sure every path in the spec is either:
  • mounted into the container by the handler or surrounding service,
  • reachable from inside the container already, or
  • a cloud URI with matching credentials.
For remote/shared filesystems, prefer the platform that owns that filesystem. For example, use SLURM plus
lustre:///...
for Lustre paths on a cluster.
本地Docker支持本地路径和
file://
路径,因为容器运行在同一Docker主机上。确保规范中的每个路径满足以下条件之一:
  • 由处理器或周边服务挂载到容器中,
  • 已可从容器内部访问,或
  • 是带有匹配凭证的云URI。
对于远程/共享文件系统,优先使用该文件系统所属的平台。例如,在集群上使用SLURM加
lustre:///...
路径访问Lustre文件系统。

Monitoring

监控

  • The SDK handler maps Docker container state directly: created -> Pending, running/restarting -> Running, paused -> Paused, exit code 0 -> Complete, nonzero exit -> Error.
  • Logs come directly from the named container through the Docker Python client (
    docker logs tao-job-<job_id>
    ).
If the container has exited, died, is being removed, or cannot be found, status reconciliation treats the backend process as terminated.
  • SDK处理器直接映射Docker容器状态:created -> 待处理,running/restarting -> 运行中,paused -> 已暂停,退出码0 -> 已完成,非零退出码 -> 错误。
  • 日志直接通过Docker Python客户端从命名容器获取(
    docker logs tao-job-<job_id>
    )。
如果容器已退出、终止、正在被移除或无法找到,状态协调会将后端进程视为已终止。

Cancellation

取消

Cancellation stops the named container. GPU ownership is managed by Docker / the NVIDIA runtime, not by TAO Core's local GPU manager.
取消操作会停止命名容器。GPU所有权由Docker/NVIDIA runtime管理,而非TAO Core的本地GPU管理器。

Optional: via the TAO SDK

可选方式:通过TAO SDK

If you want Job handles, S3 I/O wrapping via the SDK's
script_runner
, or durability across sessions:
python
from tao_sdk.platforms.docker import DockerSDK

sdk = DockerSDK()  # reads DOCKER_HOST, NGC_KEY, S3 creds from env
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',
    gpu_count=1,
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)

status = sdk.get_job_status(job.id)
logs = sdk.get_job_logs(job.id, tail=200)
This wraps the same
docker run
invocation under a
Job
handle and routes the entrypoint through
script_runner
so
inputs
/
outputs
get downloaded from / uploaded to S3 automatically. If you don't need those, just use
docker run
directly — no SDK install required.
如果需要作业句柄、通过SDK的
script_runner
实现S3 I/O封装,或跨会话的持久性:
python
from tao_sdk.platforms.docker import DockerSDK

sdk = DockerSDK()  # reads DOCKER_HOST, NGC_KEY, S3 creds from env
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',
    gpu_count=1,
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)

status = sdk.get_job_status(job.id)
logs = sdk.get_job_logs(job.id, tail=200)
此方式在
Job
句柄下封装了相同的
docker run
调用,并通过
script_runner
路由入口点,从而自动从S3下载
inputs
并上传
outputs
至S3。如果不需要这些功能,直接使用
docker run
即可——无需安装SDK。

Failure Modes

故障模式

Docker client not initialized: Verify the Docker Python package is installed, set
DOCKER_HOST
if you are not using the default local socket, and confirm the process can talk to the daemon.
GPU assignment failed: Requested GPUs are unavailable, the NVIDIA Container Toolkit is not configured, or the Docker daemon cannot create GPU device requests. Use fewer GPUs, wait for another job to finish, or verify
docker run --gpus ...
works on the host.
Image pull auth failed: Set a valid
NGC_KEY
for private
nvcr.io
images or run
docker login nvcr.io -u '$oauthtoken'
on the Docker host.
Container exited unexpectedly: Check
docker logs tao-job-<job_id>
, the configured
DOCKER_NETWORK
, and the command produced by the SDK action runner.
Path missing inside container: A local path on the host is not necessarily mounted into the job container. Use a path convention supported by the action runner or configure an explicit volume through the surrounding service.
Docker客户端未初始化:验证Docker Python包已安装,如果未使用默认本地套接字则设置
DOCKER_HOST
,并确认进程可与daemon通信。
GPU分配失败:请求的GPU不可用、NVIDIA Container Toolkit未配置,或Docker daemon无法创建GPU设备请求。减少GPU数量、等待其他作业完成,或验证主机上的
docker run --gpus ...
命令可正常工作。
镜像拉取认证失败:为私有
nvcr.io
镜像设置有效的
NGC_KEY
,或在Docker主机上运行
docker login nvcr.io -u '$oauthtoken'
容器意外退出:检查
docker logs tao-job-<job_id>
、配置的
DOCKER_NETWORK
以及SDK动作生成器生成的命令。
容器内路径缺失:主机上的本地路径不一定会挂载到作业容器中。使用动作生成器支持的路径约定,或通过周边服务配置显式卷。