dynamo-troubleshoot

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Dynamo Troubleshoot

Dynamo 故障排查

Purpose

目的

Turn a Dynamo failure into a clear problem class, strongest signal, and next action. Start with read-only evidence, avoid secrets, and fix one layer at a time.

将Dynamo故障转化为明确的问题类别、最关键的信号以及下一步操作。从只读证据入手，避免涉及密钥，逐层排查修复。

Prerequisites

前提条件

Python 3.10+ on the operator machine.
```
kubectl
```
configured with read access to the target namespace.
Permission to read pods, events, jobs, PVCs, and
```
DynamoGraphDeployment
```
resources (NOT secrets).
Network reachability to the cluster API server.

操作机器上安装Python 3.10及以上版本。
已配置
```
kubectl
```
，拥有目标命名空间的只读权限。
具备读取pod、事件、任务、PVC和
```
DynamoGraphDeployment
```
资源的权限（不包括密钥）。
能够连通集群API服务器。

Instructions

操作步骤

1. Collect A Read-Only Bundle

1. 收集只读调试包

Run:

bash

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"

If the user names a deployment, include it:

bash

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>

Do not collect Kubernetes secrets. Do not print Hugging Face tokens.

运行以下命令：

bash

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"

如果用户指定了部署名称，需包含该参数：

bash

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>

请勿收集Kubernetes密钥，请勿打印Hugging Face令牌。

2. Classify The Failure

2. 分类故障类型

Use

references/failure-decision-tree.md

and classify into one primary bucket:

cluster/platform
namespace/secret
model cache/PVC/download
image pull/runtime image
GPU scheduling/resources
operator/DynamoGraphDeployment reconciliation
frontend/router
worker/backend
endpoint/API
benchmark/perf job

使用

references/failure-decision-tree.md

将故障归类到以下主要类别之一：

集群/平台
命名空间/密钥
模型缓存/PVC/下载
镜像拉取/运行时镜像
GPU调度/资源
Operator/DynamoGraphDeployment 协调
前端/路由
工作节点/后端
端点/API
基准测试/性能任务

3. Debug Top Down

3. 自上而下排查

Check in this order:

namespace, storage class, GPU nodes, and HF secret existence
PVC and model-download job
```
DynamoGraphDeployment
```
status and events
pod status,
```
describe pod
```
, and container logs
frontend service and port-forward
```
/v1/models
```
```
/v1/chat/completions
```
benchmark job only after endpoint smoke test passes

按以下顺序检查：

命名空间、存储类、GPU节点和HF密钥是否存在
PVC和模型下载任务
```
DynamoGraphDeployment
```
的状态和事件
Pod状态、
```
describe pod
```
命令输出和容器日志
前端服务和端口转发
```
/v1/models
```
接口
```
/v1/chat/completions
```
接口
仅当端点冒烟测试通过后，再检查基准测试任务

4. Fix One Layer At A Time

4. 逐层修复

Prefer the smallest reversible change:

create missing namespace or HF secret
patch
```
storageClassName
```
patch image tag or image pull secret
reduce GPU request only if the recipe can still be valid
switch KV router to approximate mode only if workers do not publish events
restart failed jobs after fixing the underlying config

After each fix, rerun the relevant readiness check before moving deeper.

优先选择最小的可逆变更：

创建缺失的命名空间或HF密钥
修补
```
storageClassName
```
配置
修补镜像标签或镜像拉取密钥
仅当仍能保证配置有效性时，减少GPU资源请求
仅当工作节点不发布事件时，将KV路由切换为近似模式
修复底层配置后，重启失败的任务

每次修复后，重新运行相关的就绪检查，再进行更深层的排查。

Available Scripts

可用脚本

Script	Purpose	Arguments
`scripts/collect_dynamo_debug_bundle.py`	Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status)	`--namespace` , `--deployment-name` , `--output-dir`

Invoke via the agentskills.io

run_script()

protocol:

python

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])

脚本	用途	参数
`scripts/collect_dynamo_debug_bundle.py`	收集只读调试包（包含pod、事件、任务、PVC、自定义资源状态）	`--namespace` , `--deployment-name` , `--output-dir`

通过agentskills.io的

run_script()

协议调用：

python

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])

Examples

示例

Collect everything in a namespace for triage:

bash

python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo

Scope to a single failing deployment:

bash

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg

Equivalent through the agent protocol:

python

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])

收集命名空间内的所有信息用于分类排查：

bash

python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo

仅针对单个故障部署收集信息：

bash

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg

通过Agent协议的等效调用：

python

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])

Output Contract

输出约定

Return:

problem class
evidence checked
strongest signal
likely cause
exact next command or patch
what was ruled out
whether it is safe to continue deployment or benchmarking

返回内容需包含：

问题类别
已检查的证据
最关键的信号
可能的原因
具体的下一步命令或修补操作
已排除的可能性
是否可以继续部署或基准测试

Limitations

局限性

Read-only. Never mutates the cluster; remediation commands are returned, not executed.
Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
Bundle size grows with deployment size; on very large namespaces, scope with
```
--deployment-name
```
.
Does not validate disagg transport — use
```
dynamo-interconnect-check
```
for that.

只读模式：绝不会修改集群；仅返回修复命令，不会自动执行。
不会收集密钥或打印Hugging Face令牌；某些故障模式（如认证问题）可能需要用户自行检查。
调试包大小随部署规模增长；对于大型命名空间，请使用
```
--deployment-name
```
参数缩小范围。
不验证解耦传输——请使用
```
dynamo-interconnect-check
```
工具进行该检查。

Troubleshooting

故障排查

Symptom	Likely cause	Next step
`kubectl` returns Forbidden on events/pods	Service account lacks read RBAC	Ask operator for read-only role binding on the namespace
Bundle missing `DynamoGraphDeployment` status	Operator not installed or different namespace	Verify `dynamo-platform` operator is installed and watching the namespace
Model-download job in `Pending`	PVC unbound or HF secret missing	Fix PVC binding or create the named HF secret, then rerun the job
Worker pods `CrashLoopBackOff`	Image/runtime mismatch or GPU not available	Inspect container logs; check `nvidia.com/gpu` allocatable on nodes

症状	可能原因	下一步操作
`kubectl` 返回事件/pods权限禁止	服务账号缺少只读RBAC权限	请求运维人员为该命名空间添加只读角色绑定
调试包中缺少 `DynamoGraphDeployment` 状态	Operator未安装或不在对应命名空间	确认 `dynamo-platform` Operator已安装并正在监控该命名空间
模型下载任务处于 `Pending` 状态	PVC未绑定或HF密钥缺失	修复PVC绑定或创建指定的HF密钥，然后重新运行任务
工作节点Pod处于 `CrashLoopBackOff` 状态	镜像/运行时不匹配或GPU不可用	检查容器日志；查看节点上 `nvidia.com/gpu` 的可分配资源

Benchmark

基准测试

See

BENCHMARK.md

for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run

/nvskills-ci

on an upstream PR touching this skill.

有关NVCARPS-EVAL性能报告，请查看

BENCHMARK.md

（由NVSkills CI流水线自动生成）。如需刷新报告，请在触及本技能的上游PR上重新运行

/nvskills-ci

。

References

参考资料

Read
```
references/failure-decision-tree.md
```
for bucket-specific checks.
Use
```
scripts/collect_dynamo_debug_bundle.py
```
for read-only bundle collection.

查看
```
references/failure-decision-tree.md
```
获取特定类别故障的检查方法。
使用
```
scripts/collect_dynamo_debug_bundle.py
```
收集只读调试包。