k8s-troubleshoot
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Troubleshooting
Kubernetes故障排查
Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.
使用kubectl-mcp-server工具对Kubernetes集群进行专业调试与诊断。
When to Apply
适用场景
Use this skill when:
- User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
- Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
- Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
- Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"
在以下场景使用该技能:
- 用户提及以下词汇时:"debug"、"troubleshoot"、"diagnose"、"failing"、"crash"、"无法启动"、"故障"
- Pod状态:Pending、CrashLoopBackOff、ImagePullBackOff、OOMKilled、Error、Unknown
- 节点问题:NotReady、MemoryPressure、DiskPressure、NetworkUnavailable、PIDPressure
- 关键词:"logs"、"events"、"describe"、"为什么无法工作"、"卡住"、"无响应"
Priority Rules
优先级规则
| Priority | Rule | Impact | Tools |
|---|---|---|---|
| 1 | Check pod status first | CRITICAL | |
| 2 | View recent events | CRITICAL | |
| 3 | Inspect logs (including previous) | HIGH | |
| 4 | Check resource metrics | HIGH | |
| 5 | Verify endpoints | MEDIUM | |
| 6 | Review network policies | MEDIUM | |
| 7 | Examine node status | LOW | |
| 优先级 | 规则 | 影响级别 | 工具 |
|---|---|---|---|
| 1 | 首先检查Pod状态 | 严重 | |
| 2 | 查看近期事件 | 严重 | |
| 3 | 检查日志(包括历史日志) | 高 | |
| 4 | 检查资源指标 | 高 | |
| 5 | 验证端点 | 中 | |
| 6 | 检查网络策略 | 中 | |
| 7 | 检查节点状态 | 低 | |
Quick Reference
快速参考
| Symptom | First Tool | Next Steps |
|---|---|---|
| Pod Pending | | Check events, node capacity, resource requests |
| CrashLoopBackOff | | Check exit code, resources, liveness probes |
| ImagePullBackOff | | Verify image name, registry auth, network |
| OOMKilled | | Increase memory limits, check for memory leaks |
| ContainerCreating | | Check PVC binding, secrets, configmaps |
| Terminating (stuck) | | Check finalizers, PDBs, preStop hooks |
| 症状 | 首选工具 | 后续步骤 |
|---|---|---|
| Pod Pending | | 检查事件、节点容量、资源请求 |
| CrashLoopBackOff | | 检查退出码、资源配置、存活探针 |
| ImagePullBackOff | | 验证镜像名称、仓库认证、网络连接 |
| OOMKilled | | 提升内存限制、检查内存泄漏 |
| ContainerCreating | | 检查PVC绑定、密钥、配置映射 |
| Terminating(卡住) | | 检查终结器、PDB、preStop钩子 |
Diagnostic Workflows
诊断流程
Pod Not Starting
Pod无法启动
1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loops1. get_pods(namespace, label_selector) - 获取Pod状态
2. describe_pod(name, namespace) - 查看事件与状态条件
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - 检查相关事件
4. get_pod_logs(name, namespace, previous=True) - 针对崩溃循环场景Common Pod States
常见Pod状态
| State | Likely Cause | Tools to Use |
|---|---|---|
| Pending | Scheduling issues | |
| ImagePullBackOff | Registry/auth | |
| CrashLoopBackOff | App crash | |
| OOMKilled | Memory limit | |
| ContainerCreating | Volume/network | |
| 状态 | 可能原因 | 适用工具 |
|---|---|---|
| Pending | 调度问题 | |
| ImagePullBackOff | 仓库/认证问题 | |
| CrashLoopBackOff | 应用崩溃 | |
| OOMKilled | 内存限制不足 | |
| ContainerCreating | 存储/网络问题 | |
Node Issues
节点问题
1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logs1. get_nodes() - 列出节点及状态
2. describe_node(name) - 查看状态条件与容量
3. 检查:Ready、MemoryPressure、DiskPressure、PIDPressure
4. node_logs_tool(name, "kubelet") - 获取Kubelet日志Deep Debugging Workflows
深度调试流程
CrashLoopBackOff Investigation
CrashLoopBackOff问题排查
1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack trace1. get_pod_logs(name, namespace, previous=True) - 查看崩溃原因
2. describe_pod(name, namespace) - 检查资源限制、探针配置
3. get_pod_metrics(name, namespace) - 获取崩溃时的内存/CPU数据
4. 若为OOM:对比资源请求/限制与实际使用量
5. 若为应用错误:检查日志中的堆栈跟踪信息Networking Issues
网络问题
1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()1. get_services(namespace) - 验证服务是否存在
2. get_endpoints(namespace) - 检查端点后端
3. 若端点为空:Pod与选择器不匹配
4. get_network_policies(namespace) - 检查流量规则
5. 针对Cilium:使用cilium_endpoints_list_tool(), hubble_flows_query_tool()Storage Problems
存储问题
1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modes1. get_pvc(namespace) - 检查PVC状态
2. describe_pvc(name, namespace) - 查看绑定问题
3. get_storage_classes() - 验证存储供应者是否存在
4. 若为Pending状态:检查存储类、访问模式DNS Resolution
DNS解析问题
1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - 测试DNS
2. 若失败:检查kube-system命名空间中的coredns Pod
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")Multi-Cluster Debugging
多集群调试
All tools support parameter for targeting different clusters:
contextpython
get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")所有工具均支持参数,用于指定目标集群:
contextpython
get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")Diagnostic Scripts
诊断脚本
For comprehensive diagnostics, run the bundled scripts:
- See scripts/diagnose-pod.py for automated pod analysis
- See scripts/health-check.sh for cluster health checks
如需全面诊断,可运行内置脚本:
- 查看scripts/diagnose-pod.py进行自动化Pod分析
- 查看scripts/health-check.sh进行集群健康检查
Decision Tree
决策树
See references/DECISION-TREE.md for visual troubleshooting flowcharts.
查看references/DECISION-TREE.md获取可视化故障排查流程图。
Common Errors Reference
常见错误参考
See references/COMMON-ERRORS.md for error message explanations and fixes.
查看references/COMMON-ERRORS.md获取错误消息解释与修复方案。
Related Tools
相关工具
Core Diagnostics
核心诊断工具
- ,
get_pods,describe_pod,get_pod_logsget_pod_metrics - ,
get_events,get_nodesdescribe_node - ,
get_resource_usagecompare_namespaces
- ,
get_pods,describe_pod,get_pod_logsget_pod_metrics - ,
get_events,get_nodesdescribe_node - ,
get_resource_usagecompare_namespaces
Advanced (Ecosystem)
高级工具(生态系统)
- Cilium: ,
cilium_endpoints_list_toolhubble_flows_query_tool - Istio: ,
istio_proxy_status_toolistio_analyze_tool
- Cilium: ,
cilium_endpoints_list_toolhubble_flows_query_tool - Istio: ,
istio_proxy_status_toolistio_analyze_tool
Related Skills
相关技能
- k8s-diagnostics - Metrics and health checks
- k8s-incident - Emergency runbooks
- k8s-networking - Network troubleshooting
- k8s-diagnostics - 指标与健康检查
- k8s-incident - 应急手册
- k8s-networking - 网络故障排查