dynamo-troubleshoot
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDynamo Troubleshoot
Dynamo 故障排查
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: CC-BY-4.0
-->
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: CC-BY-4.0
-->
Purpose
目的
Turn a Dynamo failure into a clear problem class, strongest signal, and next
action. Start with read-only evidence, avoid secrets, and fix one layer at a
time.
将Dynamo故障转化为明确的问题类别、最关键的信号以及下一步操作。从只读证据入手,避免涉及密钥,逐层排查修复。
Prerequisites
前提条件
- Python 3.10+ on the operator machine.
- configured with read access to the target namespace.
kubectl - Permission to read pods, events, jobs, PVCs, and resources (NOT secrets).
DynamoGraphDeployment - Network reachability to the cluster API server.
- 操作机器上安装Python 3.10及以上版本。
- 已配置,拥有目标命名空间的只读权限。
kubectl - 具备读取pod、事件、任务、PVC和资源的权限(不包括密钥)。
DynamoGraphDeployment - 能够连通集群API服务器。
Instructions
操作步骤
1. Collect A Read-Only Bundle
1. 收集只读调试包
Run:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}"If the user names a deployment, include it:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}" \
--deployment-name <deployment-name>Do not collect Kubernetes secrets. Do not print Hugging Face tokens.
运行以下命令:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}"如果用户指定了部署名称,需包含该参数:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}" \
--deployment-name <deployment-name>请勿收集Kubernetes密钥,请勿打印Hugging Face令牌。
2. Classify The Failure
2. 分类故障类型
Use and classify into one primary bucket:
references/failure-decision-tree.md- cluster/platform
- namespace/secret
- model cache/PVC/download
- image pull/runtime image
- GPU scheduling/resources
- operator/DynamoGraphDeployment reconciliation
- frontend/router
- worker/backend
- endpoint/API
- benchmark/perf job
使用将故障归类到以下主要类别之一:
references/failure-decision-tree.md- 集群/平台
- 命名空间/密钥
- 模型缓存/PVC/下载
- 镜像拉取/运行时镜像
- GPU调度/资源
- Operator/DynamoGraphDeployment 协调
- 前端/路由
- 工作节点/后端
- 端点/API
- 基准测试/性能任务
3. Debug Top Down
3. 自上而下排查
Check in this order:
- namespace, storage class, GPU nodes, and HF secret existence
- PVC and model-download job
- status and events
DynamoGraphDeployment - pod status, , and container logs
describe pod - frontend service and port-forward
/v1/models/v1/chat/completions- benchmark job only after endpoint smoke test passes
按以下顺序检查:
- 命名空间、存储类、GPU节点和HF密钥是否存在
- PVC和模型下载任务
- 的状态和事件
DynamoGraphDeployment - Pod状态、命令输出和容器日志
describe pod - 前端服务和端口转发
- 接口
/v1/models - 接口
/v1/chat/completions - 仅当端点冒烟测试通过后,再检查基准测试任务
4. Fix One Layer At A Time
4. 逐层修复
Prefer the smallest reversible change:
- create missing namespace or HF secret
- patch
storageClassName - patch image tag or image pull secret
- reduce GPU request only if the recipe can still be valid
- switch KV router to approximate mode only if workers do not publish events
- restart failed jobs after fixing the underlying config
After each fix, rerun the relevant readiness check before moving deeper.
优先选择最小的可逆变更:
- 创建缺失的命名空间或HF密钥
- 修补配置
storageClassName - 修补镜像标签或镜像拉取密钥
- 仅当仍能保证配置有效性时,减少GPU资源请求
- 仅当工作节点不发布事件时,将KV路由切换为近似模式
- 修复底层配置后,重启失败的任务
每次修复后,重新运行相关的就绪检查,再进行更深层的排查。
Available Scripts
可用脚本
| Script | Purpose | Arguments |
|---|---|---|
| Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status) | |
Invoke via the agentskills.io protocol:
run_script()python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])| 脚本 | 用途 | 参数 |
|---|---|---|
| 收集只读调试包(包含pod、事件、任务、PVC、自定义资源状态) | |
通过agentskills.io的协议调用:
run_script()python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])Examples
示例
Collect everything in a namespace for triage:
bash
python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demoScope to a single failing deployment:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace dynamo-demo \
--deployment-name qwen-vllm-disaggEquivalent through the agent protocol:
python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])收集命名空间内的所有信息用于分类排查:
bash
python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo仅针对单个故障部署收集信息:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace dynamo-demo \
--deployment-name qwen-vllm-disagg通过Agent协议的等效调用:
python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])Output Contract
输出约定
Return:
- problem class
- evidence checked
- strongest signal
- likely cause
- exact next command or patch
- what was ruled out
- whether it is safe to continue deployment or benchmarking
返回内容需包含:
- 问题类别
- 已检查的证据
- 最关键的信号
- 可能的原因
- 具体的下一步命令或修补操作
- 已排除的可能性
- 是否可以继续部署或基准测试
Limitations
局限性
- Read-only. Never mutates the cluster; remediation commands are returned, not executed.
- Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
- Bundle size grows with deployment size; on very large namespaces, scope with .
--deployment-name - Does not validate disagg transport — use for that.
dynamo-interconnect-check
- 只读模式:绝不会修改集群;仅返回修复命令,不会自动执行。
- 不会收集密钥或打印Hugging Face令牌;某些故障模式(如认证问题)可能需要用户自行检查。
- 调试包大小随部署规模增长;对于大型命名空间,请使用参数缩小范围。
--deployment-name - 不验证解耦传输——请使用工具进行该检查。
dynamo-interconnect-check
Troubleshooting
故障排查
| Symptom | Likely cause | Next step |
|---|---|---|
| Service account lacks read RBAC | Ask operator for read-only role binding on the namespace |
Bundle missing | Operator not installed or different namespace | Verify |
Model-download job in | PVC unbound or HF secret missing | Fix PVC binding or create the named HF secret, then rerun the job |
Worker pods | Image/runtime mismatch or GPU not available | Inspect container logs; check |
| 症状 | 可能原因 | 下一步操作 |
|---|---|---|
| 服务账号缺少只读RBAC权限 | 请求运维人员为该命名空间添加只读角色绑定 |
调试包中缺少 | Operator未安装或不在对应命名空间 | 确认 |
模型下载任务处于 | PVC未绑定或HF密钥缺失 | 修复PVC绑定或创建指定的HF密钥,然后重新运行任务 |
工作节点Pod处于 | 镜像/运行时不匹配或GPU不可用 | 检查容器日志;查看节点上 |
Benchmark
基准测试
See for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run on an upstream PR touching this skill.
BENCHMARK.md/nvskills-ci有关NVCARPS-EVAL性能报告,请查看(由NVSkills CI流水线自动生成)。如需刷新报告,请在触及本技能的上游PR上重新运行。
BENCHMARK.md/nvskills-ciReferences
参考资料
- Read for bucket-specific checks.
references/failure-decision-tree.md - Use for read-only bundle collection.
scripts/collect_dynamo_debug_bundle.py
- 查看获取特定类别故障的检查方法。
references/failure-decision-tree.md - 使用收集只读调试包。
scripts/collect_dynamo_debug_bundle.py