Loading...
Loading...
Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.
npx skill4agent add nvidia/skills dynamo-troubleshootkubectlDynamoGraphDeploymentpython3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}"python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}" \
--deployment-name <deployment-name>references/failure-decision-tree.mdDynamoGraphDeploymentdescribe pod/v1/models/v1/chat/completionsstorageClassName| Script | Purpose | Arguments |
|---|---|---|
| Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status) | |
run_script()run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demopython3 scripts/collect_dynamo_debug_bundle.py \
--namespace dynamo-demo \
--deployment-name qwen-vllm-disaggrun_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])--deployment-namedynamo-interconnect-check| Symptom | Likely cause | Next step |
|---|---|---|
| Service account lacks read RBAC | Ask operator for read-only role binding on the namespace |
Bundle missing | Operator not installed or different namespace | Verify |
Model-download job in | PVC unbound or HF secret missing | Fix PVC binding or create the named HF secret, then rerun the job |
Worker pods | Image/runtime mismatch or GPU not available | Inspect container logs; check |
BENCHMARK.md/nvskills-cireferences/failure-decision-tree.mdscripts/collect_dynamo_debug_bundle.py