dynamo-troubleshoot

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Dynamo Troubleshoot

Dynamo 故障排查

<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: CC-BY-4.0 -->
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: CC-BY-4.0 -->

Purpose

目的

Turn a Dynamo failure into a clear problem class, strongest signal, and next action. Start with read-only evidence, avoid secrets, and fix one layer at a time.
将Dynamo故障转化为明确的问题类别、最关键的信号以及下一步操作。从只读证据入手,避免涉及密钥,逐层排查修复。

Prerequisites

前提条件

  • Python 3.10+ on the operator machine.
  • kubectl
    configured with read access to the target namespace.
  • Permission to read pods, events, jobs, PVCs, and
    DynamoGraphDeployment
    resources (NOT secrets).
  • Network reachability to the cluster API server.
  • 操作机器上安装Python 3.10及以上版本。
  • 已配置
    kubectl
    ,拥有目标命名空间的只读权限。
  • 具备读取pod、事件、任务、PVC和
    DynamoGraphDeployment
    资源的权限(不包括密钥)。
  • 能够连通集群API服务器。

Instructions

操作步骤

1. Collect A Read-Only Bundle

1. 收集只读调试包

Run:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"
If the user names a deployment, include it:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>
Do not collect Kubernetes secrets. Do not print Hugging Face tokens.
运行以下命令:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"
如果用户指定了部署名称,需包含该参数:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>
请勿收集Kubernetes密钥,请勿打印Hugging Face令牌。

2. Classify The Failure

2. 分类故障类型

Use
references/failure-decision-tree.md
and classify into one primary bucket:
  • cluster/platform
  • namespace/secret
  • model cache/PVC/download
  • image pull/runtime image
  • GPU scheduling/resources
  • operator/DynamoGraphDeployment reconciliation
  • frontend/router
  • worker/backend
  • endpoint/API
  • benchmark/perf job
使用
references/failure-decision-tree.md
将故障归类到以下主要类别之一:
  • 集群/平台
  • 命名空间/密钥
  • 模型缓存/PVC/下载
  • 镜像拉取/运行时镜像
  • GPU调度/资源
  • Operator/DynamoGraphDeployment 协调
  • 前端/路由
  • 工作节点/后端
  • 端点/API
  • 基准测试/性能任务

3. Debug Top Down

3. 自上而下排查

Check in this order:
  1. namespace, storage class, GPU nodes, and HF secret existence
  2. PVC and model-download job
  3. DynamoGraphDeployment
    status and events
  4. pod status,
    describe pod
    , and container logs
  5. frontend service and port-forward
  6. /v1/models
  7. /v1/chat/completions
  8. benchmark job only after endpoint smoke test passes
按以下顺序检查:
  1. 命名空间、存储类、GPU节点和HF密钥是否存在
  2. PVC和模型下载任务
  3. DynamoGraphDeployment
    的状态和事件
  4. Pod状态、
    describe pod
    命令输出和容器日志
  5. 前端服务和端口转发
  6. /v1/models
    接口
  7. /v1/chat/completions
    接口
  8. 仅当端点冒烟测试通过后,再检查基准测试任务

4. Fix One Layer At A Time

4. 逐层修复

Prefer the smallest reversible change:
  • create missing namespace or HF secret
  • patch
    storageClassName
  • patch image tag or image pull secret
  • reduce GPU request only if the recipe can still be valid
  • switch KV router to approximate mode only if workers do not publish events
  • restart failed jobs after fixing the underlying config
After each fix, rerun the relevant readiness check before moving deeper.
优先选择最小的可逆变更:
  • 创建缺失的命名空间或HF密钥
  • 修补
    storageClassName
    配置
  • 修补镜像标签或镜像拉取密钥
  • 仅当仍能保证配置有效性时,减少GPU资源请求
  • 仅当工作节点不发布事件时,将KV路由切换为近似模式
  • 修复底层配置后,重启失败的任务
每次修复后,重新运行相关的就绪检查,再进行更深层的排查。

Available Scripts

可用脚本

ScriptPurposeArguments
scripts/collect_dynamo_debug_bundle.py
Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status)
--namespace
,
--deployment-name
,
--output-dir
Invoke via the agentskills.io
run_script()
protocol:
python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])
脚本用途参数
scripts/collect_dynamo_debug_bundle.py
收集只读调试包(包含pod、事件、任务、PVC、自定义资源状态)
--namespace
,
--deployment-name
,
--output-dir
通过agentskills.io的
run_script()
协议调用:
python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])

Examples

示例

Collect everything in a namespace for triage:
bash
python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo
Scope to a single failing deployment:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg
Equivalent through the agent protocol:
python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])
收集命名空间内的所有信息用于分类排查:
bash
python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo
仅针对单个故障部署收集信息:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg
通过Agent协议的等效调用:
python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])

Output Contract

输出约定

Return:
  • problem class
  • evidence checked
  • strongest signal
  • likely cause
  • exact next command or patch
  • what was ruled out
  • whether it is safe to continue deployment or benchmarking
返回内容需包含:
  • 问题类别
  • 已检查的证据
  • 最关键的信号
  • 可能的原因
  • 具体的下一步命令或修补操作
  • 已排除的可能性
  • 是否可以继续部署或基准测试

Limitations

局限性

  • Read-only. Never mutates the cluster; remediation commands are returned, not executed.
  • Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
  • Bundle size grows with deployment size; on very large namespaces, scope with
    --deployment-name
    .
  • Does not validate disagg transport — use
    dynamo-interconnect-check
    for that.
  • 只读模式:绝不会修改集群;仅返回修复命令,不会自动执行。
  • 不会收集密钥或打印Hugging Face令牌;某些故障模式(如认证问题)可能需要用户自行检查。
  • 调试包大小随部署规模增长;对于大型命名空间,请使用
    --deployment-name
    参数缩小范围。
  • 不验证解耦传输——请使用
    dynamo-interconnect-check
    工具进行该检查。

Troubleshooting

故障排查

SymptomLikely causeNext step
kubectl
returns Forbidden on events/pods
Service account lacks read RBACAsk operator for read-only role binding on the namespace
Bundle missing
DynamoGraphDeployment
status
Operator not installed or different namespaceVerify
dynamo-platform
operator is installed and watching the namespace
Model-download job in
Pending
PVC unbound or HF secret missingFix PVC binding or create the named HF secret, then rerun the job
Worker pods
CrashLoopBackOff
Image/runtime mismatch or GPU not availableInspect container logs; check
nvidia.com/gpu
allocatable on nodes
症状可能原因下一步操作
kubectl
返回事件/pods权限禁止
服务账号缺少只读RBAC权限请求运维人员为该命名空间添加只读角色绑定
调试包中缺少
DynamoGraphDeployment
状态
Operator未安装或不在对应命名空间确认
dynamo-platform
Operator已安装并正在监控该命名空间
模型下载任务处于
Pending
状态
PVC未绑定或HF密钥缺失修复PVC绑定或创建指定的HF密钥,然后重新运行任务
工作节点Pod处于
CrashLoopBackOff
状态
镜像/运行时不匹配或GPU不可用检查容器日志;查看节点上
nvidia.com/gpu
的可分配资源

Benchmark

基准测试

See
BENCHMARK.md
for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run
/nvskills-ci
on an upstream PR touching this skill.
有关NVCARPS-EVAL性能报告,请查看
BENCHMARK.md
(由NVSkills CI流水线自动生成)。如需刷新报告,请在触及本技能的上游PR上重新运行
/nvskills-ci

References

参考资料

  • Read
    references/failure-decision-tree.md
    for bucket-specific checks.
  • Use
    scripts/collect_dynamo_debug_bundle.py
    for read-only bundle collection.
  • 查看
    references/failure-decision-tree.md
    获取特定类别故障的检查方法。
  • 使用
    scripts/collect_dynamo_debug_bundle.py
    收集只读调试包。