tao-run-on-lepton
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLepton
Lepton
Managed GPU compute platform on DGX Cloud. Jobs are submitted as container workloads that run on dedicated or shared GPU node groups. Lepton handles scheduling, image pulling, log collection, and job lifecycle.
Use Lepton when you need cloud-based GPU compute without managing Kubernetes or SLURM infrastructure directly.
DGX Cloud上的托管GPU计算平台。作业以容器工作负载形式提交,运行在专属或共享GPU节点组上。Lepton负责调度、镜像拉取、日志收集以及作业生命周期管理。
当您需要基于云的GPU计算能力,且不想直接管理Kubernetes或SLURM基础设施时,请使用Lepton。
Preflight
预检查
Lepton is API-first — no docker-run alternative. This skill needs the TAO SDK with the Lepton extra. is on public PyPI; the pinned version lives in (), resolved via :
nvidia-tao-sdkversions.yamlwheels.tao_sdk_leptonscripts/resolve_versions_key.pybash
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_lepton)
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install \"$PIN\""
exit 1
}
python -c "import leptonai" 2>/dev/null || {
echo "MISSING: lepton extra not installed. Run:"
echo " pip install \"$PIN\""
exit 1
}If missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight before continuing.
Lepton采用API优先设计——没有docker-run替代方案。本技能需要带有Lepton扩展的TAO SDK。可在公开PyPI获取;固定版本信息存于的字段中,可通过解析:
nvidia-tao-sdkversions.yamlwheels.tao_sdk_leptonscripts/resolve_versions_key.pybash
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_lepton)
python -c "import tao_sdk" 2>/dev/null || {
echo "缺失:未安装nvidia-tao-sdk。请运行:"
echo " pip install \"$PIN\""
exit 1
}
python -c "import leptonai" 2>/dev/null || {
echo "缺失:未安装lepton扩展。请运行:"
echo " pip install \"$PIN\""
exit 1
}若缺失相关依赖,Agent会提示用户通过Bash授权安装,然后在继续操作前重新运行预检查。
Credentials
凭证
- LEPTON_WORKSPACE_ID (required): Determines which cluster and billing account the job runs under.
- LEPTON_AUTH_TOKEN (required): API token for authenticating with the Lepton control plane.
- NGC_KEY (optional): Used to create image pull secrets for pulling TAO container images from nvcr.io.
- ACCESS_KEY / SECRET_KEY (optional): S3-compatible storage keys for dataset and checkpoint URIs.
- S3_ENDPOINT_URL (optional): Custom S3 endpoint (e.g., for MinIO or non-AWS S3).
- S3_BUCKET_NAME (optional): Bucket for job output artifacts.
- CLOUD_REGION (optional): Storage region (e.g., us-east-1).
- LEPTON_WORKSPACE_ID(必填):决定作业运行所在的集群和计费账户。
- LEPTON_AUTH_TOKEN(必填):用于与Lepton控制平面认证的API令牌。
- NGC_KEY(可选):用于创建镜像拉取密钥,从nvcr.io拉取TAO容器镜像。
- ACCESS_KEY / SECRET_KEY(可选):用于数据集和检查点URI的S3兼容存储密钥。
- S3_ENDPOINT_URL(可选):自定义S3端点(例如MinIO或非AWS S3)。
- S3_BUCKET_NAME(可选):用于作业输出产物的存储桶。
- CLOUD_REGION(可选):存储区域(例如us-east-1)。
Launch Preflight
启动预检查
Before generating scripts or submitting jobs:
- Verify and
LEPTON_WORKSPACE_IDare set.LEPTON_AUTH_TOKEN - Verify the workspace API is reachable with the packaged helper:
.
scripts/check_tao_launch_preflight.py --platform lepton ... - For datasets/results, verify
s3://andACCESS_KEYare set and the exact paths are readable withSECRET_KEY.aws s3 ls - For NFS/Lustre mounted paths, require proof from Lepton volume/storage
permissions that the path will be mounted into the job. Do not treat a local
filesystem on the agent host as proof for Lepton jobs.
test -e - Verify model-specific credentials such as before launch.
HF_TOKEN
在生成脚本或提交作业前:
- 验证已设置和
LEPTON_WORKSPACE_ID。LEPTON_AUTH_TOKEN - 使用打包的辅助工具验证工作区API是否可达:
。
scripts/check_tao_launch_preflight.py --platform lepton ... - 对于格式的数据集/结果,验证已设置
s3://和ACCESS_KEY, 并通过SECRET_KEY确认具体路径可读取。aws s3 ls - 对于NFS/Lustre挂载路径,需要Lepton卷/存储权限证明该路径将被挂载到作业中。不要将Agent主机上本地文件系统的检查结果作为Lepton作业的有效证明。
test -e - 启动前验证模型特定凭证,例如。
HF_TOKEN
Backend Details
后端细节
LeptonSDK.create_jobimagecommandgpu_countenv_varsinputsoutputshooks- : explicit GPU resource shape ID (e.g.,
resource_shape). When set, skips the auto-resolution from"gpu.8xh100-sxm". The format is opaque (whatever Lepton's API returns as instance metadata.id) — discover valid IDs viagpu_count.sdk.list_resource_shapes() - : node group ID for guaranteed GPU allocation (no preemption). Omit for shared resources.
dedicated_node_group - : number of nodes for distributed training. Default 1. When > 1, enables intra-job communication and PyTorch distributed initialization (see Multi-node training).
num_nodes - : pre-built
mountsobjects for NFS / Lustre. Auto-detected from the node group when not set.Mount
LeptonSDK.create_jobimagecommandgpu_countenv_varsinputsoutputshooks- :明确的GPU资源规格ID(例如
resource_shape)。设置后,将跳过从"gpu.8xh100-sxm"自动解析的步骤。格式为Lepton API返回的实例metadata.id,可通过gpu_count查询有效ID。sdk.list_resource_shapes() - :用于保证GPU分配(无抢占)的节点组ID。使用共享资源时可省略。
dedicated_node_group - :分布式训练的节点数量。默认值为1。当值大于1时,将启用作业内通信和PyTorch分布式初始化(参见多节点训练)。
num_nodes - :用于NFS / Lustre的预构建
mounts对象。未设置时,将从节点组自动检测。Mount
Discovering the workspace's shapes / volumes
发现工作区的资源规格/卷
python
shapes = sdk.list_resource_shapes()python
shapes = sdk.list_resource_shapes(){<platform_id>: {"cluster": ..., "gpu_type": "gpu.8xh100-sxm",
{<platform_id>: {"cluster": ..., "gpu_type": "gpu.8xh100-sxm",
"gpu_count": 8, "instance_type": ..., ...}, ...}
"gpu_count": 8, "instance_type": ..., ...}, ...}
volumes = sdk.get_volumes(node_group_id="my-h100-pool")
volumes = sdk.get_volumes(node_group_id="my-h100-pool")
[{"name": "lustre", "from_path": "/lustre", "type": "Lustre"}, ...]
[{"name": "lustre", "from_path": "/lustre", "type": "Lustre"}, ...]
prefixes = sdk.get_storage_permissions("lustre", "my-h100-pool")
prefixes = sdk.get_storage_permissions("lustre", "my-h100-pool")
["/lustre/fsw/portfolios/edgeai/...", ...]
["/lustre/fsw/portfolios/edgeai/...", ...]
undefinedundefinedMulti-node training (distributed)
多节点训练(分布式)
Pass to for multi-node distributed training. The Lepton handler () configures the underlying by setting (opens pod-to-pod networking), and (Lepton schedules N replicas), and exports as a container env var.
num_nodes > 1create_jobtao_sdk/platforms/lepton/handler.pyLeptonJobintra_job_communication=Trueparallelism=num_nodescompletions=num_nodesWORLD_SIZE=num_nodesLepton's native per-replica env vars use Lepton-specific names (, , , ), so the handler prepends a bootstrap that sources Lepton's official translation script:
LEPTON_JOB_WORKER_INDEXLEPTON_JOB_TOTAL_WORKERSLEPTON_JOB_WORKER_PREFIXLEPTON_SUBDOMAINbash
wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.sh向传入即可进行多节点分布式训练。Lepton处理器()通过设置(启用Pod间网络)、和(Lepton调度N个副本),并导出作为容器环境变量,来配置底层的。
create_jobnum_nodes > 1tao_sdk/platforms/lepton/handler.pyintra_job_communication=Trueparallelism=num_nodescompletions=num_nodesWORLD_SIZE=num_nodesLeptonJobLepton的原生每个副本环境变量使用Lepton特定名称(、、、),因此处理器会预先添加一个引导脚本,引入Lepton官方的转换脚本:
LEPTON_JOB_WORKER_INDEXLEPTON_JOB_TOTAL_WORKERSLEPTON_JOB_WORKER_PREFIXLEPTON_SUBDOMAINbash
wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.shuser command runs here
用户命令在此处运行
After sourcing, the following env vars are set:
| Env var | Source | Value |
|---|---|---|
| `MASTER_ADDR` | script | `${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN}` |
| `MASTER_PORT` | script | `29400` |
| `NNODES` | script | `${LEPTON_JOB_TOTAL_WORKERS}` |
| `NODE_RANK` | script | `${LEPTON_JOB_WORKER_INDEX}` |
| `WORKER_ADDRS` | script | comma-separated list of non-master worker hostnames |
| `WORLD_SIZE` | TAO SDK handler | `num_nodes` (TAO container's convention — same value as `NNODES`) |
| `NUM_GPU_PER_NODE` | TAO SDK handler | `gpu_count` (read by TAO container's entrypoint) |
```python
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml', # TAO entrypoint reads WORLD_SIZE + NUM_GPU_PER_NODE
gpu_count=8, # GPUs per node
num_nodes=4, # 4 × 8 = 32 GPUs total
dedicated_node_group='my-h100-pool',
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
)For raw -based commands (non-TAO containers):
torchrunpython
command='torchrun --nnodes=$NNODES --nproc-per-node=8 --node-rank=$NODE_RANK '
'--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py'
引入脚本后,将设置以下环境变量:
| 环境变量 | 来源 | 值 |
|---|---|---|
| `MASTER_ADDR` | 脚本 | `${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN}` |
| `MASTER_PORT` | 脚本 | `29400` |
| `NNODES` | 脚本 | `${LEPTON_JOB_TOTAL_WORKERS}` |
| `NODE_RANK` | 脚本 | `${LEPTON_JOB_WORKER_INDEX}` |
| `WORKER_ADDRS` | 脚本 | 非主节点工作主机名的逗号分隔列表 |
| `WORLD_SIZE` | TAO SDK处理器 | `num_nodes`(TAO容器约定——与`NNODES`值相同) |
| `NUM_GPU_PER_NODE` | TAO SDK处理器 | `gpu_count`(由TAO容器入口脚本读取) |
```python
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml', # TAO入口脚本读取WORLD_SIZE + NUM_GPU_PER_NODE
gpu_count=8, # 每个节点的GPU数量
num_nodes=4, # 总计4×8=32个GPU
dedicated_node_group='my-h100-pool',
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
)对于基于原生的命令(非TAO容器):
torchrunpython
command='torchrun --nnodes=$NNODES --nproc-per-node=8 --node-rank=$NODE_RANK '
'--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py'Two ways to run distributed jobs on Lepton
在Lepton上运行分布式作业的两种方式
| Path | When to use |
|---|---|
TAO SDK | Programmatic submission from agent code; you want the SDK's S3 wrapping, monitoring, failure analysis, and JobStore. |
| Lepton "Torchrun" job type (Lepton UI / lep CLI) | Hand-crafted submission via the Lepton console. Lepton's UI has a first-class "Torchrun" mode that wires up the rendezvous for you — no bootstrap script needed. See the official example. |
| 方式 | 使用场景 |
|---|---|
TAO SDK | 从Agent代码以编程方式提交;需要SDK的S3封装、监控、故障分析和JobStore功能。 |
| Lepton "Torchrun"作业类型(Lepton UI / lep CLI) | 通过Lepton控制台手动提交作业。Lepton UI提供一流的"Torchrun"模式,可自动配置 rendezvous,无需引导脚本。请参见官方示例。 |
Reference reading
参考资料
- NVIDIA's Lepton multi-node PyTorch example (UI / Torchrun mode): https://docs.nvidia.com/dgx-cloud/lepton/examples/batch-job/distributed-training-with-pytorch/
- The translation script the SDK sources: https://github.com/leptonai/scripts/blob/main/lepton_env_to_pytorch.sh
- PyTorch distributed (env-var rendezvous): https://pytorch.org/docs/stable/elastic/run.html
- NCCL networking tuning: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
- NVIDIA的Lepton多节点PyTorch示例(UI / Torchrun模式):https://docs.nvidia.com/dgx-cloud/lepton/examples/batch-job/distributed-training-with-pytorch/
- SDK引入的转换脚本:https://github.com/leptonai/scripts/blob/main/lepton_env_to_pytorch.sh
- PyTorch分布式(环境变量 rendezvous):https://pytorch.org/docs/stable/elastic/run.html
- NCCL网络调优:https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
Notes
注意事项
- Prefer for multi-node to keep replicas on the same low-latency interconnect (NVLink / InfiniBand).
dedicated_node_group - If a replica is preempted on a shared node group, the whole job fails — Lepton doesn't elastically restart in v1. Use a dedicated node group for long runs.
- For Lustre-backed datasets, the same mount is exposed to every replica — no per-replica I/O wrapping needed.
- 多节点训练优先使用,以确保副本位于同一低延迟互连网络(NVLink / InfiniBand)上。
dedicated_node_group - 如果共享节点组上的副本被抢占,整个作业会失败——Lepton v1不支持弹性重启。长时间运行的作业请使用专属节点组。
- 对于Lustre存储的数据集,所有副本都会暴露相同的挂载点——无需每个副本单独进行I/O封装。
Cloud Storage
云存储
Even though the platform is Lepton, the storage layer is S3-compatible. Always use as the key and as the URI protocol for both datasets and .
awscloud_metadatas3://results_dir- Correct:
s3://bucket-name/path - Incorrect:
lepton://bucket-name/path
The container's parses the URI protocol to look up credentials in .
get_cloud_storage_class_object()CLOUD_METADATA[protocol][bucket]尽管平台是Lepton,但存储层是S3兼容的。请始终使用作为键,作为URI协议,适用于数据集和。
awscloud_metadatas3://results_dir- 正确格式:
s3://bucket-name/path - 错误格式:
lepton://bucket-name/path
容器的会解析URI协议,从中查找凭证。
get_cloud_storage_class_object()CLOUD_METADATA[protocol][bucket]Shared Storage (NFS/Lustre)
共享存储(NFS/Lustre)
Node groups can have NFS or Lustre volumes attached. The SDK auto-detects these and mounts them into containers for persistent cross-job data sharing.
节点组可挂载NFS或Lustre卷。SDK会自动检测这些卷,并将其挂载到容器中,以实现跨作业的持久数据共享。
SDK Functions
SDK函数
- — returns available volumes (name, from_path, type) from node group spec
sdk.get_volumes(node_group_id=None) - — returns allowed path prefixes for a volume
sdk.get_storage_permissions(volume_name, node_group_id)
LeptonSDK.create_job()Mount- —— 返回节点组规格中可用的卷(名称、源路径、类型)
sdk.get_volumes(node_group_id=None) - —— 返回卷的允许路径前缀
sdk.get_storage_permissions(volume_name, node_group_id)
LeptonSDK.create_job()MountHow the script runner uses mounts
脚本运行器如何使用挂载点
When a Lustre mount is available:
- Inputs: S3 paths are mapped to Lustre (→
s3://bucket/path). If the file exists on Lustre, it's used directly (zero download). If missing, it's downloaded from S3 to Lustre and persists for future jobs./mnt/lustre/bucket/path - Outputs: Results write to Lustre first (fast, persistent), then upload to S3 (durable). Downstream jobs (e.g., gap analysis) can read results directly from Lustre without an S3 round-trip.
当存在Lustre挂载时:
- 输入:S3路径会映射到Lustre(→
s3://bucket/path)。如果文件已存在于Lustre,则直接使用(无需下载)。如果缺失,则从S3下载到Lustre,并持久化供后续作业使用。/mnt/lustre/bucket/path - 输出:结果先写入Lustre(快速、持久),然后上传到S3(耐用)。下游作业(例如差距分析)可直接从Lustre读取结果,无需经过S3往返。
Volume preference order
卷优先级顺序
lustre > filestore > first available
lustre > filestore > 第一个可用卷
Lustre Cache Invalidation
Lustre缓存失效
Lustre caches files persistently across jobs. There is no built-in invalidation. If upstream data changes but the S3 path stays the same, Lustre serves the stale cached version. To force a cache miss:
- Rename the file on S3 (e.g., instead of overwriting
prompt_v2.txt)prompt.txt - Use a new storage_root between iterations to avoid cross-iteration staleness
- Use a new path for any regenerated artifacts
Lustre会跨作业持久缓存文件。没有内置的失效机制。如果上游数据发生变化但S3路径保持不变,Lustre会提供过期的缓存版本。要强制缓存失效:
- 重命名S3上的文件(例如使用而非覆盖
prompt_v2.txt)prompt.txt - 在迭代之间使用新的storage_root,避免跨迭代的缓存过期问题
- 为任何重新生成的产物使用新路径
Monitoring
监控
Job Status
作业状态
Use for high-level status (Pending, Running, Complete, Error).
sdk.get_job_status(job_id)使用获取高级状态(Pending、Running、Complete、Error)。
sdk.get_job_status(job_id)Replica Status
副本状态
Use during startup for detailed replica-level info. Each replica is a dict:
sdk.get_job_replicas(job_id)python
replicas = sdk.get_job_replicas(job_id)
for r in replicas:
node = r["status"]["node"]["name"] # e.g., "node-ip-10-50-111-24"
node_group = r["status"]["node"]["node_group_id"]
cpu = r["status"]["cpu"] # e.g., 2
memory_mb = r["status"]["memory_in_mb"] # e.g., 8192
readiness = r["status"].get("readiness_issue")
if readiness:
reason = readiness["reason"] # "InProgress", "Failed", "ConfigError"
message = readiness["message"] # "Pulling image", "Mount point not found", etc.Key readiness_issue patterns:
- ,
reason="InProgress"— image pull in progress (normal for large images)message="Pulling image" - — image pull failed (check NGC_KEY)
reason="Failed" - — node issue (mount failure, GPU error)
reason="ConfigError" - No — replica is running
readiness_issue
Replica status is especially useful when a job is stuck in Pending — it reveals whether the issue is image pulling, resource scheduling, or node health.
启动期间使用获取详细的副本级信息。每个副本是一个字典:
sdk.get_job_replicas(job_id)python
replicas = sdk.get_job_replicas(job_id)
for r in replicas:
node = r["status"]["node"]["name"] # 例如:"node-ip-10-50-111-24"
node_group = r["status"]["node"]["node_group_id"]
cpu = r["status"]["cpu"] # 例如:2
memory_mb = r["status"]["memory_in_mb"] # 例如:8192
readiness = r["status"].get("readiness_issue")
if readiness:
reason = readiness["reason"] # "InProgress", "Failed", "ConfigError"
message = readiness["message"] # "Pulling image", "Mount point not found", 等关键readiness_issue模式:
- ,
reason="InProgress"—— 镜像拉取中(大镜像属于正常情况)message="Pulling image" - —— 镜像拉取失败(检查NGC_KEY)
reason="Failed" - —— 节点问题(挂载失败、GPU错误)
reason="ConfigError" - 无—— 副本正在运行
readiness_issue
当作业卡在Pending状态时,副本状态尤其有用——它能揭示问题是镜像拉取、资源调度还是节点健康问题。
Job Logs
作业日志
Use for the most recent N log lines. Logs are fetched from Lepton's log collection service.
sdk.get_job_logs(job_id, tail=N)使用获取最近N行日志。日志从Lepton的日志收集服务获取。
sdk.get_job_logs(job_id, tail=N)Parallel Jobs
并行作业
For workflow stages that run in parallel (e.g., video generation x8):
- Launch: Call for each split. Each call returns immediately with a job_id.
execute_step(plan, step_id, extra_args={"split_id": i}) - Monitor: Poll all jobs: for each. Use
sdk.get_job_status(job_id)for startup diagnostics.get_job_replicas(job_id) - Completion: All jobs done when every status is or
Complete.Error - Partial failure: Retry only failed splits — successful splits don't need re-running. Pass the same to
split_id.execute_step
对于并行运行的工作流阶段(例如8路视频生成):
- 启动:为每个分片调用。每个调用会立即返回一个job_id。
execute_step(plan, step_id, extra_args={"split_id": i}) - 监控:轮询所有作业:对每个job_id调用。使用
sdk.get_job_status(job_id)进行启动诊断。get_job_replicas(job_id) - 完成:当所有作业状态为或
Complete时,所有作业完成。Error - 部分失败:仅重试失败的分片——成功的分片无需重新运行。向传入相同的
execute_step。split_id
Failure Analysis
故障分析
When a job fails, use for automatic root cause detection:
sdk.get_failure_analysis(job_id)python
analysis = sdk.get_failure_analysis(job_id)
if analysis:
print(analysis["err_class"]) # e.g., "ERR_PROGRAM"
print(analysis["suggestion"]) # Human-readable fix
for event in analysis.get("job_failure_by_node_event", []):
print(event["node_event_name"], event["message"])
# e.g., "OOM", "OOM encountered, victim process: cosmos-rl-evalu, pid: 3368483"Returns:
- : Error classification (
err_class,ERR_PROGRAM, etc.)ERR_INFRA - : What likely went wrong and how to fix it
suggestion - : Node-level events (OOM kills, GPU errors, mount failures)
job_failure_by_node_event - : Relevant log snippets with error context
log_streams
Always call this on failed jobs before retrying — it distinguishes user errors (bad config, OOM) from infrastructure issues (node failure, eviction).
作业失败时,使用进行自动根因检测:
sdk.get_failure_analysis(job_id)python
analysis = sdk.get_failure_analysis(job_id)
if analysis:
print(analysis["err_class"]) # 例如:"ERR_PROGRAM"
print(analysis["suggestion"]) # 人类可读的修复建议
for event in analysis.get("job_failure_by_node_event", []):
print(event["node_event_name"], event["message"])
# 例如:"OOM", "OOM encountered, victim process: cosmos-rl-evalu, pid: 3368483"返回内容:
- : 错误分类(
err_class、ERR_PROGRAM等)ERR_INFRA - : 可能的问题原因及修复方法
suggestion - : 节点级事件(OOM终止、GPU错误、挂载失败)
job_failure_by_node_event - : 包含错误上下文的相关日志片段
log_streams
在重试失败作业前,请务必调用此函数——它能区分用户错误(配置错误、OOM)和基础设施问题(节点故障、驱逐)。
Failure Modes
故障模式
OOM killed: Container exceeded GPU or system memory. Detection: returns . Common causes: too high, too large for available KV cache. Recovery: reduce batch_size, add GPUs with tensor parallelism, or reduce max_length.
get_failure_analysis()node_event_name: "OOM"evaluation.batch_sizemax_lengthImage pull failure: The TAO container image cannot be pulled from nvcr.io. Usually caused by a missing or expired image pull secret. The SDK auto-provisions the secret from NGC_KEY, but if NGC_KEY is invalid, the job will fail. Detection: check — will show with for extended periods, or if the pull fails. Recovery: verify NGC_KEY is valid.
get_job_replicas()readiness_issue.reasonInProgressmessage = "Pulling image"FailedResource unavailable: The requested GPU shape is not available. Job enters Queueing state indefinitely. Detection: Pending > 15 minutes, replicas show no node assignment. Recovery: try a different resource_shape or dedicated_node_group, or wait for resources.
Auth failure: Invalid or expired LEPTON_AUTH_TOKEN. All API calls fail with 401/403. Detection: job creation raises an exception immediately. Recovery: refresh the token and reinitialize the SDK.
Unhealthy node: The assigned node has infrastructure issues (mount failures, GPU errors, network problems). Detection: check — with messages like . The job stays Pending indefinitely on the bad node. Recovery: cancel the job and resubmit — Lepton will schedule on a different node. If the issue recurs, try a different or .
get_job_replicas()readiness_issue.reason = "ConfigError""Mount point not found"dedicated_node_groupresource_shapeJob eviction: On shared node groups, Lepton may evict jobs under resource pressure. Detection: job unexpectedly transitions from Running to Error. Recovery: retry, or use a dedicated_node_group.
OOM终止:容器超出GPU或系统内存限制。检测方式:返回。常见原因:过大、超过可用KV缓存。恢复方法:减小batch_size、添加GPU并使用张量并行、或减小max_length。
get_failure_analysis()node_event_name: "OOM"evaluation.batch_sizemax_length镜像拉取失败:无法从nvcr.io拉取TAO容器镜像。通常由缺失或过期的镜像拉取密钥导致。SDK会自动从NGC_KEY配置密钥,但如果NGC_KEY无效,作业会失败。检测方式:检查——会显示且持续较长时间,或拉取失败时显示。恢复方法:验证NGC_KEY是否有效。
get_job_replicas()readiness_issue.reasonInProgressmessage = "Pulling image"Failed资源不可用:请求的GPU规格不可用。作业会无限期处于Queueing状态。检测方式:Pending状态超过15分钟,副本未分配节点。恢复方法:尝试不同的resource_shape或dedicated_node_group,或等待资源释放。
认证失败:LEPTON_AUTH_TOKEN无效或过期。所有API调用返回401/403错误。检测方式:作业创建时立即抛出异常。恢复方法:刷新令牌并重新初始化SDK。
节点不健康:分配的节点存在基础设施问题(挂载失败、GPU错误、网络问题)。检测方式:检查——,消息如。作业会在故障节点上无限期处于Pending状态。恢复方法:取消作业并重新提交——Lepton会将其调度到其他节点。如果问题重复出现,尝试不同的或。
get_job_replicas()readiness_issue.reason = "ConfigError""Mount point not found"dedicated_node_groupresource_shape作业驱逐:在共享节点组上,Lepton可能在资源紧张时驱逐作业。检测方式:作业从Running状态意外转为Error。恢复方法:重试,或使用专属节点组。