k8s-debug
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Debugging Skill
Kubernetes调试技能
Overview
概述
Systematic toolkit for debugging and troubleshooting Kubernetes clusters, pods, services, and deployments. Provides scripts, workflows, and reference guides for identifying and resolving common Kubernetes issues efficiently.
用于调试和排查Kubernetes集群、Pod、服务及部署问题的系统化工具包。提供脚本、工作流和参考指南,助力高效识别并解决常见Kubernetes问题。
When to Use This Skill
何时使用本技能
Invoke this skill when encountering:
- Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
- Service connectivity or DNS resolution issues
- Network policy or ingress problems
- Volume and storage mount failures
- Deployment rollout issues
- Cluster health or performance degradation
- Resource exhaustion (CPU/memory)
- Configuration problems (ConfigMaps, Secrets, RBAC)
遇到以下情况时调用本技能:
- Pod故障(CrashLoopBackOff、ImagePullBackOff、Pending、OOMKilled)
- 服务连接或DNS解析问题
- 网络策略或Ingress问题
- 卷与存储挂载故障
- 部署发布问题
- 集群健康或性能下降
- 资源耗尽(CPU/内存)
- 配置问题(ConfigMaps、Secrets、RBAC)
Debugging Workflow
调试工作流
Follow this systematic approach for any Kubernetes issue:
针对任何Kubernetes问题,遵循以下系统化步骤:
1. Identify the Problem Layer
1. 确定问题层级
Categorize the issue:
- Application Layer: Application crashes, errors, bugs
- Pod Layer: Pod not starting, restarting, or pending
- Service Layer: Network connectivity, DNS issues
- Node Layer: Node not ready, resource exhaustion
- Cluster Layer: Control plane issues, API problems
- Storage Layer: Volume mount failures, PVC issues
- Configuration Layer: ConfigMap, Secret, RBAC issues
对问题进行分类:
- 应用层:应用崩溃、报错、程序缺陷
- Pod层:Pod无法启动、重启或处于Pending状态
- 服务层:网络连接、DNS问题
- 节点层:节点未就绪、资源耗尽
- 集群层:控制平面问题、API故障
- 存储层:卷挂载失败、PVC问题
- 配置层:ConfigMap、Secret、RBAC问题
2. Gather Diagnostic Information
2. 收集诊断信息
Use the appropriate diagnostic script based on scope:
根据范围使用相应的诊断脚本:
Pod-Level Diagnostics
Pod级诊断
Use for comprehensive pod analysis:
scripts/pod_diagnostics.pybash
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>This script gathers:
- Pod status and description
- Pod events
- Container logs (current and previous)
- Resource usage
- Node information
- YAML configuration
Output can be saved for analysis:
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt使用进行全面的Pod分析:
scripts/pod_diagnostics.pybash
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>该脚本会收集:
- Pod状态与描述
- Pod事件
- 容器日志(当前及历史)
- 资源使用情况
- 节点信息
- YAML配置
输出结果可保存用于分析:
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txtCluster-Level Health Check
集群级健康检查
Use for overall cluster diagnostics:
scripts/cluster_health.shbash
./scripts/cluster_health.shThis script checks:
- Cluster info and version
- Node status and resources
- Pods across all namespaces
- Failed/pending pods
- Recent events
- Deployments, services, statefulsets, daemonsets
- PVCs and PVs
- Component health
- Common error states (CrashLoopBackOff, ImagePullBackOff)
使用进行整体集群诊断:
scripts/cluster_health.shbash
./scripts/cluster_health.sh该脚本会检查:
- 集群信息与版本
- 节点状态与资源
- 所有命名空间下的Pod
- 故障或Pending状态的Pod
- 近期事件
- 部署、服务、有状态集、守护进程集
- PVC与PV
- 组件健康状态
- 常见错误状态(CrashLoopBackOff、ImagePullBackOff)
Network Diagnostics
网络诊断
Use for connectivity issues:
scripts/network_debug.shbash
./scripts/network_debug.sh <namespace> <pod-name>This script analyzes:
- Pod network configuration
- DNS setup and resolution
- Service endpoints
- Network policies
- Connectivity tests
- CoreDNS logs
使用排查连接问题:
scripts/network_debug.shbash
./scripts/network_debug.sh <namespace> <pod-name>该脚本会分析:
- Pod网络配置
- DNS设置与解析
- 服务端点
- 网络策略
- 连通性测试
- CoreDNS日志
3. Follow Issue-Specific Workflow
3. 遵循特定问题工作流
Based on the identified issue, consult for detailed workflows:
references/troubleshooting_workflow.md- Pod Pending: Resource/scheduling workflow
- CrashLoopBackOff: Application crash workflow
- ImagePullBackOff: Image pull workflow
- Service issues: Network connectivity workflow
- DNS failures: DNS troubleshooting workflow
- Resource exhaustion: Performance investigation workflow
- Storage issues: PVC binding workflow
- Deployment stuck: Rollout workflow
根据识别出的问题,查阅获取详细工作流:
references/troubleshooting_workflow.md- Pod Pending:资源/调度工作流
- CrashLoopBackOff:应用崩溃工作流
- ImagePullBackOff:镜像拉取工作流
- 服务问题:网络连通性工作流
- DNS故障:DNS排查工作流
- 资源耗尽:性能调查工作流
- 存储问题:PVC绑定工作流
- 部署停滞:发布工作流
4. Apply Targeted Fixes
4. 应用针对性修复
Refer to for specific solutions to common problems.
references/common_issues.md查阅获取常见问题的具体解决方案。
references/common_issues.mdCommon Debugging Patterns
常见调试模式
Pattern 1: Pod Not Starting
模式1:Pod无法启动
bash
undefinedbash
undefinedQuick assessment
快速评估
kubectl get pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
Detailed diagnostics
详细诊断
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>
Check common causes:
检查常见原因:
- ImagePullBackOff: Verify image exists and credentials
- ImagePullBackOff:验证镜像存在及凭证有效性
- CrashLoopBackOff: Check logs with --previous flag
- CrashLoopBackOff:使用--previous标志查看日志
- Pending: Check node resources and scheduling
- Pending:检查节点资源与调度情况
undefinedundefinedPattern 2: Service Connectivity Issues
模式2:服务连接问题
bash
undefinedbash
undefinedVerify service and endpoints
验证服务与端点
kubectl get svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
kubectl get svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
Network diagnostics
网络诊断
./scripts/network_debug.sh <namespace> <pod-name>
./scripts/network_debug.sh <namespace> <pod-name>
Test connectivity from debug pod
从调试Pod测试连通性
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
Inside: curl <service-name>.<namespace>.svc.cluster.local:<port>
容器内执行:curl <service-name>.<namespace>.svc.cluster.local:<port>
Check network policies
检查网络策略
kubectl get networkpolicies -n <namespace>
undefinedkubectl get networkpolicies -n <namespace>
undefinedPattern 3: Application Performance Issues
模式3:应用性能问题
bash
undefinedbash
undefinedCheck resource usage
检查资源使用情况
kubectl top nodes
kubectl top pods -n <namespace> --containers
kubectl top nodes
kubectl top pods -n <namespace> --containers
Get pod metrics
获取Pod指标
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources
Check for OOMKilled
检查是否出现OOMKilled
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 lastState
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 lastState
Review application logs
查看应用日志
kubectl logs <pod-name> -n <namespace> --tail=100
undefinedkubectl logs <pod-name> -n <namespace> --tail=100
undefinedPattern 4: Cluster Health Assessment
模式4:集群健康评估
bash
undefinedbash
undefinedRun comprehensive health check
运行全面健康检查
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
Review output for:
检查输出内容:
- Node conditions and resource pressure
- 节点状态与资源压力
- Failed or pending pods
- 故障或Pending状态的Pod
- Recent error events
- 近期错误事件
- Component health status
- 组件健康状态
- Resource quota usage
- 资源配额使用情况
undefinedundefinedEssential Manual Commands
必备手动命令
While scripts automate diagnostics, understand these core commands:
虽然脚本可自动化诊断,但需掌握以下核心命令:
Pod Debugging
Pod调试
bash
undefinedbash
undefinedView pod status
查看Pod状态
kubectl get pods -n <namespace> -o wide
kubectl get pods -n <namespace> -o wide
Detailed pod information
详细Pod信息
kubectl describe pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
View logs
查看日志
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # Previous container
kubectl logs <pod-name> -n <namespace> -c <container> # Specific container
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # 查看上一个容器的日志
kubectl logs <pod-name> -n <namespace> -c <container> # 查看指定容器的日志
Execute commands in pod
在Pod内执行命令
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh
Get pod YAML
获取Pod的YAML配置
kubectl get pod <pod-name> -n <namespace> -o yaml
undefinedkubectl get pod <pod-name> -n <namespace> -o yaml
undefinedService and Network Debugging
服务与网络调试
bash
undefinedbash
undefinedCheck services
检查服务
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
Check endpoints
检查端点
kubectl get endpoints -n <namespace>
kubectl get endpoints -n <namespace>
Test DNS
测试DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
View events
查看事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
undefinedkubectl get events -n <namespace> --sort-by='.lastTimestamp'
undefinedResource Monitoring
资源监控
bash
undefinedbash
undefinedNode resources
节点资源
kubectl top nodes
kubectl describe nodes
kubectl top nodes
kubectl describe nodes
Pod resources
Pod资源
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers
undefinedkubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers
undefinedEmergency Operations
紧急操作
bash
undefinedbash
undefinedRestart deployment
重启部署
kubectl rollout restart deployment/<name> -n <namespace>
kubectl rollout restart deployment/<name> -n <namespace>
Rollback deployment
回滚部署
kubectl rollout undo deployment/<name> -n <namespace>
kubectl rollout undo deployment/<name> -n <namespace>
Force delete stuck pod
强制删除停滞的Pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
Drain node (maintenance)
驱逐节点(维护时)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Cordon node (prevent scheduling)
隔离节点(阻止调度)
kubectl cordon <node-name>
undefinedkubectl cordon <node-name>
undefinedReference Documentation
参考文档
Detailed Troubleshooting Guides
详细故障排查指南
Consult for:
references/troubleshooting_workflow.md- Step-by-step workflows for each issue type
- Decision trees for diagnosis
- Command sequences for systematic debugging
- Quick reference command cheat sheet
查阅获取:
references/troubleshooting_workflow.md- 各类问题的分步工作流
- 诊断决策树
- 系统化调试的命令序列
- 快速参考命令速查表
Common Issues Database
常见问题数据库
Consult for:
references/common_issues.md- Detailed explanations of each common issue
- Symptoms and causes
- Specific debugging steps
- Solutions and fixes
- Prevention strategies
查阅获取:
references/common_issues.md- 各常见问题的详细说明
- 症状与原因
- 具体调试步骤
- 解决方案与修复方法
- 预防策略
Best Practices
最佳实践
Systematic Approach
系统化方法
- Observe: Gather facts before making changes
- Analyze: Use diagnostic scripts to collect comprehensive data
- Hypothesize: Form theory about root cause
- Test: Verify hypothesis with targeted commands
- Fix: Apply appropriate solution
- Verify: Confirm issue is resolved
- Document: Record findings for future reference
- 观察:在做出更改前收集事实信息
- 分析:使用诊断脚本收集全面数据
- 假设:形成关于根本原因的理论
- 测试:使用针对性命令验证假设
- 修复:应用合适的解决方案
- 验证:确认问题已解决
- 记录:记录发现以供未来参考
Data Collection
数据收集
- Save diagnostic output to files for analysis
- Capture logs before restarting failing pods
- Record events timeline for incident reports
- Export resource metrics for trend analysis
- 将诊断输出保存到文件以便分析
- 在重启故障Pod前捕获日志
- 记录事件时间线用于事故报告
- 导出资源指标用于趋势分析
Prevention
预防措施
- Set appropriate resource requests and limits
- Implement health checks (liveness/readiness probes)
- Use proper logging and monitoring
- Apply network policies incrementally
- Test changes in non-production environments
- Maintain documentation of cluster architecture
- 设置合适的资源请求与限制
- 实现健康检查(存活/就绪探针)
- 使用恰当的日志与监控
- 逐步应用网络策略
- 在非生产环境测试更改
- 维护集群架构文档
Advanced Debugging Techniques
高级调试技术
Debug Containers (Kubernetes 1.23+)
调试容器(Kubernetes 1.23+)
bash
undefinedbash
undefinedAttach ephemeral debug container
附加临时调试容器
kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot
kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot
Create debug copy of pod
创建Pod的调试副本
kubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container>
undefinedkubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container>
undefinedPort Forwarding for Testing
端口转发用于测试
bash
undefinedbash
undefinedForward pod port to local machine
将Pod端口转发到本地机器
kubectl port-forward pod/<pod-name> -n <namespace> <local-port>:<pod-port>
kubectl port-forward pod/<pod-name> -n <namespace> <local-port>:<pod-port>
Forward service port
转发服务端口
kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port>
undefinedkubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port>
undefinedProxy for API Access
代理用于API访问
bash
undefinedbash
undefinedStart kubectl proxy
启动kubectl代理
kubectl proxy --port=8080
kubectl proxy --port=8080
Access API
访问API
curl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name>
undefinedcurl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name>
undefinedCustom Column Output
自定义列输出
bash
undefinedbash
undefinedCustom pod info
自定义Pod信息
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP
Node taints
节点污点
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
undefinedkubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
undefinedTroubleshooting Checklist
故障排查清单
Before escalating issues, verify:
- Reviewed pod events:
kubectl describe pod - Checked pod logs (current and previous)
- Verified resource availability on nodes
- Confirmed image exists and is accessible
- Validated service selectors match pod labels
- Tested DNS resolution from pods
- Checked network policies
- Reviewed recent cluster events
- Confirmed ConfigMaps/Secrets exist
- Validated RBAC permissions
- Checked for resource quotas/limits
- Reviewed cluster component health
在升级问题前,请验证:
- 已查看Pod事件:
kubectl describe pod - 已检查Pod日志(当前及历史)
- 已验证节点上的资源可用性
- 已确认镜像存在且可访问
- 已验证服务选择器与Pod标签匹配
- 已从Pod测试DNS解析
- 已检查网络策略
- 已查看近期集群事件
- 已确认ConfigMaps/Secrets存在
- 已验证RBAC权限
- 已检查资源配额/限制
- 已查看集群组件健康状态
Related Tools
相关工具
Useful additional tools for Kubernetes debugging:
- kubectl-debug: Advanced debugging plugin
- stern: Multi-pod log tailing
- kubectx/kubens: Context and namespace switching
- k9s: Terminal UI for Kubernetes
- lens: Desktop IDE for Kubernetes
- Prometheus/Grafana: Monitoring and alerting
- Jaeger/Zipkin: Distributed tracing
以下是用于Kubernetes调试的实用附加工具:
- kubectl-debug:高级调试插件
- stern:多Pod日志追踪
- kubectx/kubens:上下文与命名空间切换工具
- k9s:Kubernetes终端UI
- lens:Kubernetes桌面IDE
- Prometheus/Grafana:监控与告警
- Jaeger/Zipkin:分布式追踪