tao-run-on-lepton

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Lepton

Managed GPU compute platform on DGX Cloud. Jobs are submitted as container workloads that run on dedicated or shared GPU node groups. Lepton handles scheduling, image pulling, log collection, and job lifecycle.

Use Lepton when you need cloud-based GPU compute without managing Kubernetes or SLURM infrastructure directly.

DGX Cloud上的托管GPU计算平台。作业以容器工作负载形式提交，运行在专属或共享GPU节点组上。Lepton负责调度、镜像拉取、日志收集以及作业生命周期管理。

当您需要基于云的GPU计算能力，且不想直接管理Kubernetes或SLURM基础设施时，请使用Lepton。

Preflight

预检查

Lepton is API-first — no docker-run alternative. This skill needs the TAO SDK with the Lepton extra.

nvidia-tao-sdk

is on public PyPI; the pinned version lives in

versions.yaml

(

wheels.tao_sdk_lepton

), resolved via

scripts/resolve_versions_key.py

bash

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_lepton)
python -c "import tao_sdk" 2>/dev/null || {
  echo "MISSING: nvidia-tao-sdk not installed. Run:"
  echo "  pip install \"$PIN\""
  exit 1
}
python -c "import leptonai" 2>/dev/null || {
  echo "MISSING: lepton extra not installed. Run:"
  echo "  pip install \"$PIN\""
  exit 1
}

If missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight before continuing.

Lepton采用API优先设计——没有docker-run替代方案。本技能需要带有Lepton扩展的TAO SDK。

nvidia-tao-sdk

可在公开PyPI获取；固定版本信息存于

versions.yaml

的

wheels.tao_sdk_lepton

字段中，可通过

scripts/resolve_versions_key.py

解析：

bash

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_lepton)
python -c "import tao_sdk" 2>/dev/null || {
  echo "缺失：未安装nvidia-tao-sdk。请运行："
  echo "  pip install \"$PIN\""
  exit 1
}
python -c "import leptonai" 2>/dev/null || {
  echo "缺失：未安装lepton扩展。请运行："
  echo "  pip install \"$PIN\""
  exit 1
}

若缺失相关依赖，Agent会提示用户通过Bash授权安装，然后在继续操作前重新运行预检查。

Credentials

凭证

LEPTON_WORKSPACE_ID (required): Determines which cluster and billing account the job runs under.
LEPTON_AUTH_TOKEN (required): API token for authenticating with the Lepton control plane.
NGC_KEY (optional): Used to create image pull secrets for pulling TAO container images from nvcr.io.
ACCESS_KEY / SECRET_KEY (optional): S3-compatible storage keys for dataset and checkpoint URIs.
S3_ENDPOINT_URL (optional): Custom S3 endpoint (e.g., for MinIO or non-AWS S3).
S3_BUCKET_NAME (optional): Bucket for job output artifacts.
CLOUD_REGION (optional): Storage region (e.g., us-east-1).

LEPTON_WORKSPACE_ID（必填）：决定作业运行所在的集群和计费账户。
LEPTON_AUTH_TOKEN（必填）：用于与Lepton控制平面认证的API令牌。
NGC_KEY（可选）：用于创建镜像拉取密钥，从nvcr.io拉取TAO容器镜像。
ACCESS_KEY / SECRET_KEY（可选）：用于数据集和检查点URI的S3兼容存储密钥。
S3_ENDPOINT_URL（可选）：自定义S3端点（例如MinIO或非AWS S3）。
S3_BUCKET_NAME（可选）：用于作业输出产物的存储桶。
CLOUD_REGION（可选）：存储区域（例如us-east-1）。

Launch Preflight

启动预检查

Before generating scripts or submitting jobs:

Verify
```
LEPTON_WORKSPACE_ID
```
and
```
LEPTON_AUTH_TOKEN
```
are set.
Verify the workspace API is reachable with the packaged helper:
```
scripts/check_tao_launch_preflight.py --platform lepton ...
```
.
For
```
s3://
```
datasets/results, verify
```
ACCESS_KEY
```
and
```
SECRET_KEY
```
are set and the exact paths are readable with
```
aws s3 ls
```
.
For NFS/Lustre mounted paths, require proof from Lepton volume/storage permissions that the path will be mounted into the job. Do not treat a local filesystem
```
test -e
```
on the agent host as proof for Lepton jobs.
Verify model-specific credentials such as
```
HF_TOKEN
```
before launch.

在生成脚本或提交作业前：

验证已设置
```
LEPTON_WORKSPACE_ID
```
和
```
LEPTON_AUTH_TOKEN
```
。
使用打包的辅助工具验证工作区API是否可达：
```
scripts/check_tao_launch_preflight.py --platform lepton ...
```
。
对于
```
s3://
```
格式的数据集/结果，验证已设置
```
ACCESS_KEY
```
和
```
SECRET_KEY
```
，并通过
```
aws s3 ls
```
确认具体路径可读取。
对于NFS/Lustre挂载路径，需要Lepton卷/存储权限证明该路径将被挂载到作业中。不要将Agent主机上本地文件系统的
```
test -e
```
检查结果作为Lepton作业的有效证明。
启动前验证模型特定凭证，例如
```
HF_TOKEN
```
。

Backend Details

后端细节

LeptonSDK.create_job

accepts these Lepton-specific kwargs (in addition to the platform-agnostic ones —

image

command

gpu_count

env_vars

inputs

outputs

hooks

resource_shape
: explicit GPU resource shape ID (e.g.,
```
"gpu.8xh100-sxm"
```
). When set, skips the auto-resolution from
```
gpu_count
```
. The format is opaque (whatever Lepton's API returns as instance metadata.id) — discover valid IDs via
```
sdk.list_resource_shapes()
```
.
dedicated_node_group
: node group ID for guaranteed GPU allocation (no preemption). Omit for shared resources.
num_nodes
: number of nodes for distributed training. Default 1. When > 1, enables intra-job communication and PyTorch distributed initialization (see Multi-node training).
mounts
: pre-built
```
Mount
```
objects for NFS / Lustre. Auto-detected from the node group when not set.

LeptonSDK.create_job

接受以下Lepton特定参数（除了平台无关参数——

image

、

command

、

gpu_count

、

env_vars

、

inputs

、

outputs

、

hooks

）：

resource_shape
：明确的GPU资源规格ID（例如
```
"gpu.8xh100-sxm"
```
）。设置后，将跳过从
```
gpu_count
```
自动解析的步骤。格式为Lepton API返回的实例metadata.id，可通过
```
sdk.list_resource_shapes()
```
查询有效ID。
dedicated_node_group
：用于保证GPU分配（无抢占）的节点组ID。使用共享资源时可省略。
num_nodes
：分布式训练的节点数量。默认值为1。当值大于1时，将启用作业内通信和PyTorch分布式初始化（参见多节点训练）。
mounts
：用于NFS / Lustre的预构建
```
Mount
```
对象。未设置时，将从节点组自动检测。

Discovering the workspace's shapes / volumes

发现工作区的资源规格/卷

python

shapes = sdk.list_resource_shapes()

python

shapes = sdk.list_resource_shapes()

{<platform_id>: {"cluster": ..., "gpu_type": "gpu.8xh100-sxm",

"gpu_count": 8, "instance_type": ..., ...}, ...}

volumes = sdk.get_volumes(node_group_id="my-h100-pool")

[{"name": "lustre", "from_path": "/lustre", "type": "Lustre"}, ...]

prefixes = sdk.get_storage_permissions("lustre", "my-h100-pool")

["/lustre/fsw/portfolios/edgeai/...", ...]

undefined

undefined

Multi-node training (distributed)

多节点训练（分布式）

Pass

num_nodes > 1

create_job

for multi-node distributed training. The Lepton handler (

tao_sdk/platforms/lepton/handler.py

) configures the underlying

LeptonJob

by setting

intra_job_communication=True

(opens pod-to-pod networking),

parallelism=num_nodes

and

completions=num_nodes

(Lepton schedules N replicas), and exports

WORLD_SIZE=num_nodes

as a container env var.

Lepton's native per-replica env vars use Lepton-specific names (

LEPTON_JOB_WORKER_INDEX

LEPTON_JOB_TOTAL_WORKERS

LEPTON_JOB_WORKER_PREFIX

LEPTON_SUBDOMAIN

), so the handler prepends a bootstrap that sources Lepton's official translation script:

bash

wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.sh

向

create_job

传入

num_nodes > 1

即可进行多节点分布式训练。Lepton处理器（

tao_sdk/platforms/lepton/handler.py

）通过设置

intra_job_communication=True

（启用Pod间网络）、

parallelism=num_nodes

和

completions=num_nodes

（Lepton调度N个副本），并导出

WORLD_SIZE=num_nodes

作为容器环境变量，来配置底层的

LeptonJob

。

Lepton的原生每个副本环境变量使用Lepton特定名称（

LEPTON_JOB_WORKER_INDEX

、

LEPTON_JOB_TOTAL_WORKERS

、

LEPTON_JOB_WORKER_PREFIX

、

LEPTON_SUBDOMAIN

），因此处理器会预先添加一个引导脚本，引入Lepton官方的转换脚本：

bash

wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.sh

user command runs here

用户命令在此处运行


After sourcing, the following env vars are set:

| Env var | Source | Value |
|---|---|---|
| `MASTER_ADDR` | script | `${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN}` |
| `MASTER_PORT` | script | `29400` |
| `NNODES` | script | `${LEPTON_JOB_TOTAL_WORKERS}` |
| `NODE_RANK` | script | `${LEPTON_JOB_WORKER_INDEX}` |
| `WORKER_ADDRS` | script | comma-separated list of non-master worker hostnames |
| `WORLD_SIZE` | TAO SDK handler | `num_nodes` (TAO container's convention — same value as `NNODES`) |
| `NUM_GPU_PER_NODE` | TAO SDK handler | `gpu_count` (read by TAO container's entrypoint) |

```python
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',  # TAO entrypoint reads WORLD_SIZE + NUM_GPU_PER_NODE
    gpu_count=8,                          # GPUs per node
    num_nodes=4,                          # 4 × 8 = 32 GPUs total
    dedicated_node_group='my-h100-pool',
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)

For raw

torchrun

-based commands (non-TAO containers):

python

command='torchrun --nnodes=$NNODES --nproc-per-node=8 --node-rank=$NODE_RANK '
        '--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py'


引入脚本后，将设置以下环境变量：

| 环境变量 | 来源 | 值 |
|---|---|---|
| `MASTER_ADDR` | 脚本 | `${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN}` |
| `MASTER_PORT` | 脚本 | `29400` |
| `NNODES` | 脚本 | `${LEPTON_JOB_TOTAL_WORKERS}` |
| `NODE_RANK` | 脚本 | `${LEPTON_JOB_WORKER_INDEX}` |
| `WORKER_ADDRS` | 脚本 | 非主节点工作主机名的逗号分隔列表 |
| `WORLD_SIZE` | TAO SDK处理器 | `num_nodes`（TAO容器约定——与`NNODES`值相同） |
| `NUM_GPU_PER_NODE` | TAO SDK处理器 | `gpu_count`（由TAO容器入口脚本读取） |

```python
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',  # TAO入口脚本读取WORLD_SIZE + NUM_GPU_PER_NODE
    gpu_count=8,                          # 每个节点的GPU数量
    num_nodes=4,                          # 总计4×8=32个GPU
    dedicated_node_group='my-h100-pool',
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)

对于基于原生

torchrun

的命令（非TAO容器）：

python

command='torchrun --nnodes=$NNODES --nproc-per-node=8 --node-rank=$NODE_RANK '
        '--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py'

Two ways to run distributed jobs on Lepton

在Lepton上运行分布式作业的两种方式

Path	When to use
TAO SDK `create_job(num_nodes=N)` (this skill)	Programmatic submission from agent code; you want the SDK's S3 wrapping, monitoring, failure analysis, and JobStore.
Lepton "Torchrun" job type (Lepton UI / lep CLI)	Hand-crafted submission via the Lepton console. Lepton's UI has a first-class "Torchrun" mode that wires up the rendezvous for you — no bootstrap script needed. See the official example.

方式	使用场景
TAO SDK `create_job(num_nodes=N)` （本技能）	从Agent代码以编程方式提交；需要SDK的S3封装、监控、故障分析和JobStore功能。
Lepton "Torchrun"作业类型（Lepton UI / lep CLI）	通过Lepton控制台手动提交作业。Lepton UI提供一流的"Torchrun"模式，可自动配置 rendezvous，无需引导脚本。请参见官方示例。

Reference reading

参考资料

NVIDIA's Lepton multi-node PyTorch example (UI / Torchrun mode): https://docs.nvidia.com/dgx-cloud/lepton/examples/batch-job/distributed-training-with-pytorch/
The translation script the SDK sources: https://github.com/leptonai/scripts/blob/main/lepton_env_to_pytorch.sh
PyTorch distributed (env-var rendezvous): https://pytorch.org/docs/stable/elastic/run.html
NCCL networking tuning: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

NVIDIA的Lepton多节点PyTorch示例（UI / Torchrun模式）：https://docs.nvidia.com/dgx-cloud/lepton/examples/batch-job/distributed-training-with-pytorch/
SDK引入的转换脚本：https://github.com/leptonai/scripts/blob/main/lepton_env_to_pytorch.sh
PyTorch分布式（环境变量 rendezvous）：https://pytorch.org/docs/stable/elastic/run.html
NCCL网络调优：https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

Notes

注意事项

Prefer
```
dedicated_node_group
```
for multi-node to keep replicas on the same low-latency interconnect (NVLink / InfiniBand).
If a replica is preempted on a shared node group, the whole job fails — Lepton doesn't elastically restart in v1. Use a dedicated node group for long runs.
For Lustre-backed datasets, the same mount is exposed to every replica — no per-replica I/O wrapping needed.

多节点训练优先使用
```
dedicated_node_group
```
，以确保副本位于同一低延迟互连网络（NVLink / InfiniBand）上。
如果共享节点组上的副本被抢占，整个作业会失败——Lepton v1不支持弹性重启。长时间运行的作业请使用专属节点组。
对于Lustre存储的数据集，所有副本都会暴露相同的挂载点——无需每个副本单独进行I/O封装。

Cloud Storage

云存储

Even though the platform is Lepton, the storage layer is S3-compatible. Always use

aws

as the

cloud_metadata

key and

s3://

as the URI protocol for both datasets and

results_dir

Correct:
```
s3://bucket-name/path
```
Incorrect:
```
lepton://bucket-name/path
```

The container's

get_cloud_storage_class_object()

parses the URI protocol to look up credentials in

CLOUD_METADATA[protocol][bucket]

尽管平台是Lepton，但存储层是S3兼容的。请始终使用

aws

作为

cloud_metadata

键，

s3://

作为URI协议，适用于数据集和

results_dir

。

正确格式：
```
s3://bucket-name/path
```
错误格式：
```
lepton://bucket-name/path
```

容器的

get_cloud_storage_class_object()

会解析URI协议，从

CLOUD_METADATA[protocol][bucket]

中查找凭证。

Shared Storage (NFS/Lustre)

共享存储（NFS/Lustre）

Node groups can have NFS or Lustre volumes attached. The SDK auto-detects these and mounts them into containers for persistent cross-job data sharing.

节点组可挂载NFS或Lustre卷。SDK会自动检测这些卷，并将其挂载到容器中，以实现跨作业的持久数据共享。

SDK Functions

SDK函数

```
sdk.get_volumes(node_group_id=None)
```
— returns available volumes (name, from_path, type) from node group spec

sdk.get_storage_permissions(volume_name, node_group_id)

— returns allowed path prefixes for a volume

LeptonSDK.create_job()

calls these automatically to detect mounts and build the appropriate

Mount

objects for job specs.

```
sdk.get_volumes(node_group_id=None)
```
—— 返回节点组规格中可用的卷（名称、源路径、类型）

sdk.get_storage_permissions(volume_name, node_group_id)

—— 返回卷的允许路径前缀

LeptonSDK.create_job()

会自动调用这些函数，检测挂载点并为作业规格构建相应的

Mount

对象。

How the script runner uses mounts

脚本运行器如何使用挂载点

When a Lustre mount is available:

Inputs: S3 paths are mapped to Lustre (
```
s3://bucket/path
```
→
```
/mnt/lustre/bucket/path
```
). If the file exists on Lustre, it's used directly (zero download). If missing, it's downloaded from S3 to Lustre and persists for future jobs.
Outputs: Results write to Lustre first (fast, persistent), then upload to S3 (durable). Downstream jobs (e.g., gap analysis) can read results directly from Lustre without an S3 round-trip.

当存在Lustre挂载时：

输入：S3路径会映射到Lustre（
```
s3://bucket/path
```
→
```
/mnt/lustre/bucket/path
```
）。如果文件已存在于Lustre，则直接使用（无需下载）。如果缺失，则从S3下载到Lustre，并持久化供后续作业使用。
输出：结果先写入Lustre（快速、持久），然后上传到S3（耐用）。下游作业（例如差距分析）可直接从Lustre读取结果，无需经过S3往返。

Volume preference order

卷优先级顺序

lustre > filestore > first available

lustre > filestore > 第一个可用卷

Lustre Cache Invalidation

Lustre缓存失效

Lustre caches files persistently across jobs. There is no built-in invalidation. If upstream data changes but the S3 path stays the same, Lustre serves the stale cached version. To force a cache miss:

Rename the file on S3 (e.g.,
```
prompt_v2.txt
```
instead of overwriting
```
prompt.txt
```
)
Use a new storage_root between iterations to avoid cross-iteration staleness
Use a new path for any regenerated artifacts

Lustre会跨作业持久缓存文件。没有内置的失效机制。如果上游数据发生变化但S3路径保持不变，Lustre会提供过期的缓存版本。要强制缓存失效：

重命名S3上的文件（例如使用
```
prompt_v2.txt
```
而非覆盖
```
prompt.txt
```
）
在迭代之间使用新的storage_root，避免跨迭代的缓存过期问题
为任何重新生成的产物使用新路径

Monitoring

监控

Job Status

作业状态

Use

sdk.get_job_status(job_id)

for high-level status (Pending, Running, Complete, Error).

使用

sdk.get_job_status(job_id)

获取高级状态（Pending、Running、Complete、Error）。

Replica Status

副本状态

Use

sdk.get_job_replicas(job_id)

during startup for detailed replica-level info. Each replica is a dict:

python

replicas = sdk.get_job_replicas(job_id)
for r in replicas:
    node = r["status"]["node"]["name"]           # e.g., "node-ip-10-50-111-24"
    node_group = r["status"]["node"]["node_group_id"]
    cpu = r["status"]["cpu"]                      # e.g., 2
    memory_mb = r["status"]["memory_in_mb"]       # e.g., 8192
    readiness = r["status"].get("readiness_issue")
    if readiness:
        reason = readiness["reason"]   # "InProgress", "Failed", "ConfigError"
        message = readiness["message"] # "Pulling image", "Mount point not found", etc.

Key readiness_issue patterns:

```
reason="InProgress"
```
,
```
message="Pulling image"
```
— image pull in progress (normal for large images)
```
reason="Failed"
```
— image pull failed (check NGC_KEY)
```
reason="ConfigError"
```
— node issue (mount failure, GPU error)
No
```
readiness_issue
```
— replica is running

Replica status is especially useful when a job is stuck in Pending — it reveals whether the issue is image pulling, resource scheduling, or node health.

启动期间使用

sdk.get_job_replicas(job_id)

获取详细的副本级信息。每个副本是一个字典：

python

replicas = sdk.get_job_replicas(job_id)
for r in replicas:
    node = r["status"]["node"]["name"]           # 例如："node-ip-10-50-111-24"
    node_group = r["status"]["node"]["node_group_id"]
    cpu = r["status"]["cpu"]                      # 例如：2
    memory_mb = r["status"]["memory_in_mb"]       # 例如：8192
    readiness = r["status"].get("readiness_issue")
    if readiness:
        reason = readiness["reason"]   # "InProgress", "Failed", "ConfigError"
        message = readiness["message"] # "Pulling image", "Mount point not found", 等

关键readiness_issue模式：

```
reason="InProgress"
```
，
```
message="Pulling image"
```
—— 镜像拉取中（大镜像属于正常情况）
```
reason="Failed"
```
—— 镜像拉取失败（检查NGC_KEY）
```
reason="ConfigError"
```
—— 节点问题（挂载失败、GPU错误）
无
```
readiness_issue
```
—— 副本正在运行

当作业卡在Pending状态时，副本状态尤其有用——它能揭示问题是镜像拉取、资源调度还是节点健康问题。

Job Logs

作业日志

Use

sdk.get_job_logs(job_id, tail=N)

for the most recent N log lines. Logs are fetched from Lepton's log collection service.

使用

sdk.get_job_logs(job_id, tail=N)

获取最近N行日志。日志从Lepton的日志收集服务获取。

Parallel Jobs

并行作业

For workflow stages that run in parallel (e.g., video generation x8):

Launch: Call
```
execute_step(plan, step_id, extra_args={"split_id": i})
```
for each split. Each call returns immediately with a job_id.
Monitor: Poll all jobs:
```
sdk.get_job_status(job_id)
```
for each. Use
```
get_job_replicas(job_id)
```
for startup diagnostics.
Completion: All jobs done when every status is
```
Complete
```
or
```
Error
```
.
Partial failure: Retry only failed splits — successful splits don't need re-running. Pass the same
```
split_id
```
to
```
execute_step
```
.

对于并行运行的工作流阶段（例如8路视频生成）：

启动：为每个分片调用
```
execute_step(plan, step_id, extra_args={"split_id": i})
```
。每个调用会立即返回一个job_id。
监控：轮询所有作业：对每个job_id调用
```
sdk.get_job_status(job_id)
```
。使用
```
get_job_replicas(job_id)
```
进行启动诊断。
完成：当所有作业状态为
```
Complete
```
或
```
Error
```
时，所有作业完成。
部分失败：仅重试失败的分片——成功的分片无需重新运行。向
```
execute_step
```
传入相同的
```
split_id
```
。

Failure Analysis

故障分析

When a job fails, use

sdk.get_failure_analysis(job_id)

for automatic root cause detection:

python

analysis = sdk.get_failure_analysis(job_id)
if analysis:
    print(analysis["err_class"])    # e.g., "ERR_PROGRAM"
    print(analysis["suggestion"])   # Human-readable fix
    for event in analysis.get("job_failure_by_node_event", []):
        print(event["node_event_name"], event["message"])
        # e.g., "OOM", "OOM encountered, victim process: cosmos-rl-evalu, pid: 3368483"

Returns:

```
err_class
```
: Error classification (
```
ERR_PROGRAM
```
,
```
ERR_INFRA
```
, etc.)
```
suggestion
```
: What likely went wrong and how to fix it
```
job_failure_by_node_event
```
: Node-level events (OOM kills, GPU errors, mount failures)
```
log_streams
```
: Relevant log snippets with error context

Always call this on failed jobs before retrying — it distinguishes user errors (bad config, OOM) from infrastructure issues (node failure, eviction).

作业失败时，使用

sdk.get_failure_analysis(job_id)

进行自动根因检测：

python

analysis = sdk.get_failure_analysis(job_id)
if analysis:
    print(analysis["err_class"])    # 例如："ERR_PROGRAM"
    print(analysis["suggestion"])   # 人类可读的修复建议
    for event in analysis.get("job_failure_by_node_event", []):
        print(event["node_event_name"], event["message"])
        # 例如："OOM", "OOM encountered, victim process: cosmos-rl-evalu, pid: 3368483"

返回内容：

```
err_class
```
: 错误分类（
```
ERR_PROGRAM
```
、
```
ERR_INFRA
```
等）
```
suggestion
```
: 可能的问题原因及修复方法
```
job_failure_by_node_event
```
: 节点级事件（OOM终止、GPU错误、挂载失败）
```
log_streams
```
: 包含错误上下文的相关日志片段

在重试失败作业前，请务必调用此函数——它能区分用户错误（配置错误、OOM）和基础设施问题（节点故障、驱逐）。

Failure Modes

故障模式

OOM killed: Container exceeded GPU or system memory. Detection:

get_failure_analysis()

returns

node_event_name: "OOM"

. Common causes:

evaluation.batch_size

too high,

max_length

too large for available KV cache. Recovery: reduce batch_size, add GPUs with tensor parallelism, or reduce max_length.

Image pull failure: The TAO container image cannot be pulled from nvcr.io. Usually caused by a missing or expired image pull secret. The SDK auto-provisions the secret from NGC_KEY, but if NGC_KEY is invalid, the job will fail. Detection: check

get_job_replicas()

—

readiness_issue.reason

will show

InProgress

with

message = "Pulling image"

for extended periods, or

Failed

if the pull fails. Recovery: verify NGC_KEY is valid.

Resource unavailable: The requested GPU shape is not available. Job enters Queueing state indefinitely. Detection: Pending > 15 minutes, replicas show no node assignment. Recovery: try a different resource_shape or dedicated_node_group, or wait for resources.

Auth failure: Invalid or expired LEPTON_AUTH_TOKEN. All API calls fail with 401/403. Detection: job creation raises an exception immediately. Recovery: refresh the token and reinitialize the SDK.

Unhealthy node: The assigned node has infrastructure issues (mount failures, GPU errors, network problems). Detection: check

get_job_replicas()

—

readiness_issue.reason = "ConfigError"

with messages like

"Mount point not found"

. The job stays Pending indefinitely on the bad node. Recovery: cancel the job and resubmit — Lepton will schedule on a different node. If the issue recurs, try a different

dedicated_node_group

resource_shape

Job eviction: On shared node groups, Lepton may evict jobs under resource pressure. Detection: job unexpectedly transitions from Running to Error. Recovery: retry, or use a dedicated_node_group.

OOM终止：容器超出GPU或系统内存限制。检测方式：

get_failure_analysis()

node_event_name: "OOM"

。常见原因：

evaluation.batch_size

过大、

max_length

超过可用KV缓存。恢复方法：减小batch_size、添加GPU并使用张量并行、或减小max_length。

镜像拉取失败：无法从nvcr.io拉取TAO容器镜像。通常由缺失或过期的镜像拉取密钥导致。SDK会自动从NGC_KEY配置密钥，但如果NGC_KEY无效，作业会失败。检测方式：检查

get_job_replicas()

——

readiness_issue.reason

会显示

InProgress

且

message = "Pulling image"

持续较长时间，或拉取失败时显示

Failed

。恢复方法：验证NGC_KEY是否有效。

资源不可用：请求的GPU规格不可用。作业会无限期处于Queueing状态。检测方式：Pending状态超过15分钟，副本未分配节点。恢复方法：尝试不同的resource_shape或dedicated_node_group，或等待资源释放。

认证失败：LEPTON_AUTH_TOKEN无效或过期。所有API调用返回401/403错误。检测方式：作业创建时立即抛出异常。恢复方法：刷新令牌并重新初始化SDK。

节点不健康：分配的节点存在基础设施问题（挂载失败、GPU错误、网络问题）。检测方式：检查

get_job_replicas()

——

readiness_issue.reason = "ConfigError"

，消息如

"Mount point not found"

。作业会在故障节点上无限期处于Pending状态。恢复方法：取消作业并重新提交——Lepton会将其调度到其他节点。如果问题重复出现，尝试不同的

dedicated_node_group

或

resource_shape

。

作业驱逐：在共享节点组上，Lepton可能在资源紧张时驱逐作业。检测方式：作业从Running状态意外转为Error。恢复方法：重试，或使用专属节点组。