k8s-troubleshoot
Original:🇺🇸 English
Translated
2 scripts
Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.
6installs
Added on
NPX Install
npx skill4agent add rohitg00/kubectl-mcp-server k8s-troubleshootTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Kubernetes Troubleshooting
Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.
When to Apply
Use this skill when:
- User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
- Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
- Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
- Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"
Priority Rules
| Priority | Rule | Impact | Tools |
|---|---|---|---|
| 1 | Check pod status first | CRITICAL | |
| 2 | View recent events | CRITICAL | |
| 3 | Inspect logs (including previous) | HIGH | |
| 4 | Check resource metrics | HIGH | |
| 5 | Verify endpoints | MEDIUM | |
| 6 | Review network policies | MEDIUM | |
| 7 | Examine node status | LOW | |
Quick Reference
| Symptom | First Tool | Next Steps |
|---|---|---|
| Pod Pending | | Check events, node capacity, resource requests |
| CrashLoopBackOff | | Check exit code, resources, liveness probes |
| ImagePullBackOff | | Verify image name, registry auth, network |
| OOMKilled | | Increase memory limits, check for memory leaks |
| ContainerCreating | | Check PVC binding, secrets, configmaps |
| Terminating (stuck) | | Check finalizers, PDBs, preStop hooks |
Diagnostic Workflows
Pod Not Starting
1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loopsCommon Pod States
| State | Likely Cause | Tools to Use |
|---|---|---|
| Pending | Scheduling issues | |
| ImagePullBackOff | Registry/auth | |
| CrashLoopBackOff | App crash | |
| OOMKilled | Memory limit | |
| ContainerCreating | Volume/network | |
Node Issues
1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logsDeep Debugging Workflows
CrashLoopBackOff Investigation
1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack traceNetworking Issues
1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()Storage Problems
1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modesDNS Resolution
1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")Multi-Cluster Debugging
All tools support parameter for targeting different clusters:
contextpython
get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")Diagnostic Scripts
For comprehensive diagnostics, run the bundled scripts:
- See scripts/diagnose-pod.py for automated pod analysis
- See scripts/health-check.sh for cluster health checks
Decision Tree
See references/DECISION-TREE.md for visual troubleshooting flowcharts.
Common Errors Reference
See references/COMMON-ERRORS.md for error message explanations and fixes.
Related Tools
Core Diagnostics
- ,
get_pods,describe_pod,get_pod_logsget_pod_metrics - ,
get_events,get_nodesdescribe_node - ,
get_resource_usagecompare_namespaces
Advanced (Ecosystem)
- Cilium: ,
cilium_endpoints_list_toolhubble_flows_query_tool - Istio: ,
istio_proxy_status_toolistio_analyze_tool
Related Skills
- k8s-diagnostics - Metrics and health checks
- k8s-incident - Emergency runbooks
- k8s-networking - Network troubleshooting