Dynamo Troubleshoot
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: CC-BY-4.0
-->
Purpose
Turn a Dynamo failure into a clear problem class, strongest signal, and next
action. Start with read-only evidence, avoid secrets, and fix one layer at a
time.
Prerequisites
- Python 3.10+ on the operator machine.
- configured with read access to the target namespace.
- Permission to read pods, events, jobs, PVCs, and resources (NOT secrets).
- Network reachability to the cluster API server.
Instructions
1. Collect A Read-Only Bundle
Run:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}"
If the user names a deployment, include it:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}" \
--deployment-name <deployment-name>
Do not collect Kubernetes secrets. Do not print Hugging Face tokens.
2. Classify The Failure
Use
references/failure-decision-tree.md
and classify into one primary bucket:
- cluster/platform
- namespace/secret
- model cache/PVC/download
- image pull/runtime image
- GPU scheduling/resources
- operator/DynamoGraphDeployment reconciliation
- frontend/router
- worker/backend
- endpoint/API
- benchmark/perf job
3. Debug Top Down
Check in this order:
- namespace, storage class, GPU nodes, and HF secret existence
- PVC and model-download job
- status and events
- pod status, , and container logs
- frontend service and port-forward
- benchmark job only after endpoint smoke test passes
4. Fix One Layer At A Time
Prefer the smallest reversible change:
- create missing namespace or HF secret
- patch
- patch image tag or image pull secret
- reduce GPU request only if the recipe can still be valid
- switch KV router to approximate mode only if workers do not publish events
- restart failed jobs after fixing the underlying config
After each fix, rerun the relevant readiness check before moving deeper.
Available Scripts
| Script | Purpose | Arguments |
|---|
scripts/collect_dynamo_debug_bundle.py
| Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status) | , , |
Invoke via the agentskills.io
protocol:
python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])
Examples
Collect everything in a namespace for triage:
bash
python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo
Scope to a single failing deployment:
bash
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace dynamo-demo \
--deployment-name qwen-vllm-disagg
Equivalent through the agent protocol:
python
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])
Output Contract
Return:
- problem class
- evidence checked
- strongest signal
- likely cause
- exact next command or patch
- what was ruled out
- whether it is safe to continue deployment or benchmarking
Limitations
- Read-only. Never mutates the cluster; remediation commands are returned, not executed.
- Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
- Bundle size grows with deployment size; on very large namespaces, scope with .
- Does not validate disagg transport — use
dynamo-interconnect-check
for that.
Troubleshooting
| Symptom | Likely cause | Next step |
|---|
| returns Forbidden on events/pods | Service account lacks read RBAC | Ask operator for read-only role binding on the namespace |
| Bundle missing status | Operator not installed or different namespace | Verify operator is installed and watching the namespace |
| Model-download job in | PVC unbound or HF secret missing | Fix PVC binding or create the named HF secret, then rerun the job |
| Worker pods | Image/runtime mismatch or GPU not available | Inspect container logs; check allocatable on nodes |
Benchmark
See
for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run
on an upstream PR touching this skill.
References
- Read
references/failure-decision-tree.md
for bucket-specific checks.
- Use
scripts/collect_dynamo_debug_bundle.py
for read-only bundle collection.