k8s-troubleshoot

Original🇺🇸 English
Translated
2 scripts

Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.

6installs
Added on

NPX Install

npx skill4agent add rohitg00/kubectl-mcp-server k8s-troubleshoot

Kubernetes Troubleshooting

Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.

When to Apply

Use this skill when:
  • User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
  • Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
  • Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
  • Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"

Priority Rules

PriorityRuleImpactTools
1Check pod status firstCRITICAL
get_pods
,
describe_pod
2View recent eventsCRITICAL
get_events
3Inspect logs (including previous)HIGH
get_pod_logs
4Check resource metricsHIGH
get_pod_metrics
5Verify endpointsMEDIUM
get_endpoints
6Review network policiesMEDIUM
get_network_policies
7Examine node statusLOW
get_nodes
,
describe_node

Quick Reference

SymptomFirst ToolNext Steps
Pod Pending
describe_pod
Check events, node capacity, resource requests
CrashLoopBackOff
get_pod_logs(previous=True)
Check exit code, resources, liveness probes
ImagePullBackOff
describe_pod
Verify image name, registry auth, network
OOMKilled
get_pod_metrics
Increase memory limits, check for memory leaks
ContainerCreating
describe_pod
Check PVC binding, secrets, configmaps
Terminating (stuck)
describe_pod
Check finalizers, PDBs, preStop hooks

Diagnostic Workflows

Pod Not Starting

1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loops

Common Pod States

StateLikely CauseTools to Use
PendingScheduling issues
describe_pod
,
get_nodes
,
get_events
ImagePullBackOffRegistry/auth
describe_pod
, check image name
CrashLoopBackOffApp crash
get_pod_logs(previous=True)
OOMKilledMemory limit
get_pod_metrics
, adjust limits
ContainerCreatingVolume/network
describe_pod
,
get_pvc

Node Issues

1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logs

Deep Debugging Workflows

CrashLoopBackOff Investigation

1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack trace

Networking Issues

1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()

Storage Problems

1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modes

DNS Resolution

1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")

Multi-Cluster Debugging

All tools support
context
parameter for targeting different clusters:
python
get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")

Diagnostic Scripts

For comprehensive diagnostics, run the bundled scripts:
  • See scripts/diagnose-pod.py for automated pod analysis
  • See scripts/health-check.sh for cluster health checks

Decision Tree

See references/DECISION-TREE.md for visual troubleshooting flowcharts.

Common Errors Reference

See references/COMMON-ERRORS.md for error message explanations and fixes.

Related Tools

Core Diagnostics

  • get_pods
    ,
    describe_pod
    ,
    get_pod_logs
    ,
    get_pod_metrics
  • get_events
    ,
    get_nodes
    ,
    describe_node
  • get_resource_usage
    ,
    compare_namespaces

Advanced (Ecosystem)

  • Cilium:
    cilium_endpoints_list_tool
    ,
    hubble_flows_query_tool
  • Istio:
    istio_proxy_status_tool
    ,
    istio_analyze_tool

Related Skills

  • k8s-diagnostics - Metrics and health checks
  • k8s-incident - Emergency runbooks
  • k8s-networking - Network troubleshooting