tao-run-on-local-docker
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLocal Docker
本地Docker
Single-node execution platform that runs TAO jobs as named Docker containers on
the local Docker daemon. It is useful for development, debugging, small runs,
and machines where the agent host already has the required GPUs, NVIDIA driver,
Docker, and NVIDIA Container Toolkit.
Use local Docker when the data is local to the Docker host or accessible through
mounted volumes/cloud credentials. Do not use it for remote cluster scheduling,
multi-node training, or jobs that need SLURM queueing.
单节点执行平台,可在本地Docker daemon上将TAO作业作为命名Docker容器运行。适用于开发、调试、小规模运行,以及代理主机已配备所需GPU、NVIDIA驱动、Docker和NVIDIA Container Toolkit的机器。
当数据存储在Docker主机本地,或可通过挂载卷/云凭证访问时,可使用本地Docker。请勿将其用于远程集群调度、多节点训练或需要SLURM队列的作业。
Preflight
预检
The workflow must verify the host GPU runtime before starting Docker jobs. If
the check fails, prompt the user to approve the install, run the printed install
command, and rerun the preflight.
bash
undefined在启动Docker作业前,工作流必须验证主机GPU运行环境。如果检查失败,需提示用户确认安装,运行打印的安装命令,然后重新执行预检。
bash
undefinedHost GPU runtime: NVIDIA driver 580, CUDA 13.0, NVIDIA Container Toolkit 1.19.0.
Host GPU runtime: NVIDIA driver 580, CUDA 13.0, NVIDIA Container Toolkit 1.19.0.
TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}"
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend docker --check-only || {
echo "MISSING: TAO GPU host runtime is not ready."
echo "After user approval, run:"
echo " bash "$SETUP_SCRIPT" --backend docker --install --yes"
exit 1
}
TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}"
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend docker --check-only || {
echo "MISSING: TAO GPU host runtime is not ready."
echo "After user approval, run:"
echo " bash "$SETUP_SCRIPT" --backend docker --install --yes"
exit 1
}
Mode 1 — direct docker (no Python). All you need is docker + the GPU runtime.
Mode 1 — direct docker (no Python). All you need is docker + the GPU runtime.
docker info >/dev/null 2>&1 || { echo "MISSING: docker daemon not reachable. Start Docker."; exit 1; }
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi >/dev/null 2>&1 || {
echo "MISSING: NVIDIA Container Toolkit not installed/configured. See:"
echo " bash "$SETUP_SCRIPT" --backend docker --install --yes"
exit 1
}
docker info >/dev/null 2>&1 || { echo "MISSING: docker daemon not reachable. Start Docker."; exit 1; }
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi >/dev/null 2>&1 || {
echo "MISSING: NVIDIA Container Toolkit not installed/configured. See:"
echo " bash "$SETUP_SCRIPT" --backend docker --install --yes"
exit 1
}
Mode 2 — TAO SDK wrapper. Adds Job handles, S3 I/O wrapping, ActionWorkflow.
Mode 2 — TAO SDK wrapper. Adds Job handles, S3 I/O wrapping, ActionWorkflow.
Skip this block if Mode 1 is sufficient for the user's request.
Skip this block if Mode 1 is sufficient for the user's request.
When Mode 2 is in scope, read tao-skill-bank:tao-run-platform
for the DockerSDK
tao-skill-bank:tao-run-platformWhen Mode 2 is in scope, read tao-skill-bank:tao-run-platform
for the DockerSDK
tao-skill-bank:tao-run-platformkwarg contract, build_entrypoint, and monitoring patterns.
kwarg contract, build_entrypoint, and monitoring patterns.
nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_docker).
nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_docker).
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_docker)
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install "$PIN""
exit 1
}
python -c "import docker" 2>/dev/null || {
echo "MISSING: docker Python client not installed. Run:"
echo " pip install "$PIN""
exit 1
}
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_docker)
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install "$PIN""
exit 1
}
python -c "import docker" 2>/dev/null || {
echo "MISSING: docker Python client not installed. Run:"
echo " pip install "$PIN""
exit 1
}
DockerSDK attaches every job container to ${DOCKER_NETWORK:-tao_default}. If
DockerSDK attaches every job container to ${DOCKER_NETWORK:-tao_default}. If
the network does not exist, container start fails instantly with
the network does not exist, container start fails instantly with
network <name> not found
for every create_job.
network <name> not foundnetwork <name> not found
for every create_job.
network <name> not foundDOCKER_NETWORK_NAME="${DOCKER_NETWORK:-tao_default}"
docker network ls --format '{{.Name}}' | grep -qx "$DOCKER_NETWORK_NAME" || {
echo "MISSING: docker network '$DOCKER_NETWORK_NAME' not found. After user approval, run:"
echo " docker network create $DOCKER_NETWORK_NAME"
exit 1
}
If a check fails, the agent prompts the user to authorize the install/fix via Bash before proceeding.DOCKER_NETWORK_NAME="${DOCKER_NETWORK:-tao_default}"
docker network ls --format '{{.Name}}' | grep -qx "$DOCKER_NETWORK_NAME" || {
echo "MISSING: docker network '$DOCKER_NETWORK_NAME' not found. After user approval, run:"
echo " docker network create $DOCKER_NETWORK_NAME"
exit 1
}
如果检查失败,代理会提示用户通过Bash授权安装/修复后再继续。Credentials
凭证
There are no platform credentials required beyond access to the Docker daemon.
Optional environment:
- DOCKER_HOST: Optional Docker daemon URL. If unset, the SDK uses the Docker Python client's normal environment/default socket resolution.
- DOCKER_NETWORK: Docker network for job containers. Default is
.
tao_default - DOCKER_USERNAME: Registry username. Default is for NGC.
$oauthtoken - NGC_KEY: Used when pulling private images from .
nvcr.io - HOST_SSH_PATH: Mounted into AutoML brain containers when they need SSH keys to monitor remote SLURM child jobs.
- ACCESS_KEY, SECRET_KEY, S3_ENDPOINT_URL, S3_BUCKET_NAME: Optional S3-compatible storage settings for jobs that still read/write cloud storage from a local container.
除了Docker daemon的访问权限外,无需其他平台凭证。
可选环境变量:
- DOCKER_HOST:可选的Docker daemon URL。如果未设置,SDK将使用Docker Python客户端的常规环境/默认套接字解析方式。
- DOCKER_NETWORK:作业容器使用的Docker网络。默认值为。
tao_default - DOCKER_USERNAME:镜像仓库用户名。NGC的默认值为。
$oauthtoken - NGC_KEY:从拉取私有镜像时使用。
nvcr.io - HOST_SSH_PATH:当AutoML brain容器需要SSH密钥来监控远程SLURM子作业时,会挂载该路径。
- ACCESS_KEY、SECRET_KEY、S3_ENDPOINT_URL、S3_BUCKET_NAME: 适用于仍需从本地容器读写云存储的作业的可选S3兼容存储设置。
Launch Preflight
启动预检
Before generating scripts or starting containers:
- Verify the Docker daemon is reachable and the NVIDIA runtime can see GPUs.
- Verify every local/file dataset annotation and media path exists on the Docker host.
- For datasets/results, verify
s3://andACCESS_KEYare set and the exact paths are readable withSECRET_KEY.aws s3 ls - Verify model-specific credentials such as before launch.
HF_TOKEN
在生成脚本或启动容器前:
- 验证Docker daemon是否可访问,且NVIDIA runtime能识别GPU。
- 验证所有本地/文件数据集标注和媒体路径在Docker主机上是否存在。
- 对于数据集/结果,验证
s3://和ACCESS_KEY已设置,且使用SECRET_KEY可读取精确路径。aws s3 ls - 启动前验证模型特定凭证(如)。
HF_TOKEN
Multi-GPU and multi-node
多GPU与多节点
Multi-node is not supported on local Docker. One job runs on the local Docker daemon's host with no cross-host coordination.
Multi-GPU on the local host is supported via the NVIDIA Container Toolkit's flag ( or ). plumbs through to . Single-host distributed init uses ; or PyTorch DDP work as usual.
--gpus--gpus all--gpus '"device=0,1,2,3"'DockerSDK.create_job(gpu_count=N)--gpuslocalhosttorchrun --nproc-per-node=N本地Docker不支持多节点。一个作业仅在本地Docker daemon所在主机上运行,无跨主机协调。
本地主机上的多GPU通过NVIDIA Container Toolkit的标志支持(如或)。会传递该参数至。单主机分布式初始化使用;或PyTorch DDP可正常工作。
--gpus--gpus all--gpus '"device=0,1,2,3"'DockerSDK.create_job(gpu_count=N)--gpuslocalhosttorchrun --nproc-per-node=NBackend Details
后端细节
Use the SDK backend value . The local backend schema has no extra
backend details, so most routing is controlled by environment and job
parameters:
local-dockerjson
{
"backend_type": "local-docker",
"num_gpu": 1
}Following the Lepton/Brev SDK design, platform/control-plane values stay in SDK
state and Docker labels. The SDK does not inject , ,
, , or into the training container.
BACKENDHOST_PLATFORMMONGOSECRETDOCKER_HOSTDOCKER_NETWORK使用SDK后端值。本地后端架构无额外后端细节,因此大部分路由由环境变量和作业参数控制:
local-dockerjson
{
"backend_type": "local-docker",
"num_gpu": 1
}遵循Lepton/Brev SDK设计,平台/控制平面值存储在SDK状态和Docker标签中。SDK不会将、、、或注入训练容器。
BACKENDHOST_PLATFORMMONGOSECRETDOCKER_HOSTDOCKER_NETWORKContainer Execution
容器执行
The TAO SDK local Docker handler starts containers through the Docker Python
client:
- Backend job name uses the form used by SDK handlers.
tao-job-<job_id> - Command is usually .
["/bin/bash", "-c", "<job command>"] - Containers run detached. The SDK keeps containers by default so status and
logs remain inspectable, unless .
DOCKER_AUTO_REMOVE=true - is mounted as tmpfs.
/dev/shm - The configured Docker network is applied by the Docker daemon for the job container; it is not passed through as a process environment variable.
- Existing containers with the same job id are stopped and removed before a replacement starts.
For GPU access, the handler auto-detects the host type:
- Tegra or Jetson hosts use plus
runtime="nvidia"andNVIDIA_VISIBLE_DEVICES.NVIDIA_DRIVER_CAPABILITIES=all - Standard x86 hosts use Docker with GPU capabilities.
device_requests
If is , no GPUs are assigned. If is , all visible
GPUs are requested. Prefer explicit GPU counts for shared development machines.
num_gpus0num_gpus-1TAO SDK本地Docker处理器通过Docker Python客户端启动容器:
- 后端作业名称采用SDK处理器使用的格式。
tao-job-<job_id> - 命令通常为。
["/bin/bash", "-c", "<job command>"] - 容器以 detached 模式运行。默认情况下SDK会保留容器,以便检查状态和日志,除非设置。
DOCKER_AUTO_REMOVE=true - 挂载为tmpfs。
/dev/shm - 配置的Docker网络由Docker daemon应用于作业容器;不会作为进程环境变量传递。
- 在启动新容器前,会停止并移除具有相同作业ID的现有容器。
对于GPU访问,处理器会自动检测主机类型:
- Tegra或Jetson主机使用以及
runtime="nvidia"和NVIDIA_VISIBLE_DEVICES。NVIDIA_DRIVER_CAPABILITIES=all - 标准x86主机使用带有GPU功能的Docker 。
device_requests
如果设为,则不分配GPU。如果设为,则请求所有可见GPU。对于共享开发机器,建议使用明确的GPU数量。
num_gpus0num_gpus-1Storage
存储
Local Docker accepts local and paths because the container runs on the
same Docker host. Make sure every path in the spec is either:
file://- mounted into the container by the handler or surrounding service,
- reachable from inside the container already, or
- a cloud URI with matching credentials.
For remote/shared filesystems, prefer the platform that owns that filesystem.
For example, use SLURM plus for Lustre paths on a cluster.
lustre:///...本地Docker支持本地路径和路径,因为容器运行在同一Docker主机上。确保规范中的每个路径满足以下条件之一:
file://- 由处理器或周边服务挂载到容器中,
- 已可从容器内部访问,或
- 是带有匹配凭证的云URI。
对于远程/共享文件系统,优先使用该文件系统所属的平台。例如,在集群上使用SLURM加路径访问Lustre文件系统。
lustre:///...Monitoring
监控
- The SDK handler maps Docker container state directly: created -> Pending, running/restarting -> Running, paused -> Paused, exit code 0 -> Complete, nonzero exit -> Error.
- Logs come directly from the named container through the Docker Python client
().
docker logs tao-job-<job_id>
If the container has exited, died, is being removed, or cannot be found, status
reconciliation treats the backend process as terminated.
- SDK处理器直接映射Docker容器状态:created -> 待处理,running/restarting -> 运行中,paused -> 已暂停,退出码0 -> 已完成,非零退出码 -> 错误。
- 日志直接通过Docker Python客户端从命名容器获取()。
docker logs tao-job-<job_id>
如果容器已退出、终止、正在被移除或无法找到,状态协调会将后端进程视为已终止。
Cancellation
取消
Cancellation stops the named container. GPU ownership is managed by Docker /
the NVIDIA runtime, not by TAO Core's local GPU manager.
取消操作会停止命名容器。GPU所有权由Docker/NVIDIA runtime管理,而非TAO Core的本地GPU管理器。
Optional: via the TAO SDK
可选方式:通过TAO SDK
If you want Job handles, S3 I/O wrapping via the SDK's , or
durability across sessions:
script_runnerpython
from tao_sdk.platforms.docker import DockerSDK
sdk = DockerSDK() # reads DOCKER_HOST, NGC_KEY, S3 creds from env
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml',
gpu_count=1,
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
)
status = sdk.get_job_status(job.id)
logs = sdk.get_job_logs(job.id, tail=200)This wraps the same invocation under a handle and routes
the entrypoint through so / get downloaded
from / uploaded to S3 automatically. If you don't need those, just use
directly — no SDK install required.
docker runJobscript_runnerinputsoutputsdocker run如果需要作业句柄、通过SDK的实现S3 I/O封装,或跨会话的持久性:
script_runnerpython
from tao_sdk.platforms.docker import DockerSDK
sdk = DockerSDK() # reads DOCKER_HOST, NGC_KEY, S3 creds from env
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml',
gpu_count=1,
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
)
status = sdk.get_job_status(job.id)
logs = sdk.get_job_logs(job.id, tail=200)此方式在句柄下封装了相同的调用,并通过路由入口点,从而自动从S3下载并上传至S3。如果不需要这些功能,直接使用即可——无需安装SDK。
Jobdocker runscript_runnerinputsoutputsdocker runFailure Modes
故障模式
Docker client not initialized: Verify the Docker Python package is installed,
set if you are not using the default local socket, and confirm the
process can talk to the daemon.
DOCKER_HOSTGPU assignment failed: Requested GPUs are unavailable, the NVIDIA Container
Toolkit is not configured, or the Docker daemon cannot create GPU device
requests. Use fewer GPUs, wait for another job to finish, or verify
works on the host.
docker run --gpus ...Image pull auth failed: Set a valid for private images
or run on the Docker host.
NGC_KEYnvcr.iodocker login nvcr.io -u '$oauthtoken'Container exited unexpectedly: Check , the
configured , and the command produced by the SDK action runner.
docker logs tao-job-<job_id>DOCKER_NETWORKPath missing inside container: A local path on the host is not necessarily
mounted into the job container. Use a path convention supported by the action
runner or configure an explicit volume through the surrounding service.
Docker客户端未初始化:验证Docker Python包已安装,如果未使用默认本地套接字则设置,并确认进程可与daemon通信。
DOCKER_HOSTGPU分配失败:请求的GPU不可用、NVIDIA Container Toolkit未配置,或Docker daemon无法创建GPU设备请求。减少GPU数量、等待其他作业完成,或验证主机上的命令可正常工作。
docker run --gpus ...镜像拉取认证失败:为私有镜像设置有效的,或在Docker主机上运行。
nvcr.ioNGC_KEYdocker login nvcr.io -u '$oauthtoken'容器意外退出:检查、配置的以及SDK动作生成器生成的命令。
docker logs tao-job-<job_id>DOCKER_NETWORK容器内路径缺失:主机上的本地路径不一定会挂载到作业容器中。使用动作生成器支持的路径约定,或通过周边服务配置显式卷。