kubernetes-operations
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Operations
Kubernetes 运维操作
Comprehensive kubectl assistance for debugging, resource management, and cluster operations with token-efficient scripts.
提供全面的kubectl协助,通过高效节省Token的脚本实现调试、资源管理和集群操作。
BEFORE YOU START
开始之前
This skill prevents 5 common errors and saves ~70% tokens.
| Metric | Without Skill | With Skill |
|---|---|---|
| Pod Debugging | ~1200 tokens | ~400 tokens |
| Resource Listing | ~800 tokens | ~200 tokens |
| Cluster Health | ~1500 tokens | ~300 tokens |
该技能可避免5类常见错误,节省约70%的Token。
| 指标 | 未使用该技能 | 使用该技能 |
|---|---|---|
| Pod 调试 | ~1200 Token | ~400 Token |
| 资源列表 | ~800 Token | ~200 Token |
| 集群健康检查 | ~1500 Token | ~300 Token |
Known Issues This Skill Prevents
该技能可避免的常见问题
- Running kubectl commands in wrong namespace/context
- Verbose output flooding context with unnecessary data
- Missing critical debugging steps (events, previous logs)
- Exposing secrets in plain text output
- Destructive operations without dry-run verification
- 在错误的namespace/context中执行kubectl命令
- 冗长的输出导致上下文充斥不必要的数据
- 遗漏关键调试步骤(事件、历史日志)
- 在明文输出中暴露敏感信息(secrets)
- 未通过dry-run验证就执行破坏性操作
Quick Start
快速开始
Step 1: Verify Context
步骤1:验证上下文
bash
kubectl config current-context
kubectl config get-contextsWhy this matters: Running commands in the wrong cluster can cause production incidents.
bash
kubectl config current-context
kubectl config get-contexts为什么这很重要: 在错误的集群中执行命令可能导致生产事故。
Step 2: Debug a Pod
步骤2:调试Pod
bash
uv run scripts/debug_pod.py <pod-name> [-n namespace]Why this matters: The script combines describe, logs, and events into a condensed summary, saving ~800 tokens.
bash
uv run scripts/debug_pod.py <pod-name> [-n namespace]为什么这很重要: 该脚本将describe、logs和events整合为精简的摘要,可节省约800个Token。
Step 3: Check Cluster Health
步骤3:检查集群健康状态
bash
uv run scripts/cluster_health.pyWhy this matters: Quick overview of node status and unhealthy pods without verbose output.
bash
uv run scripts/cluster_health.py为什么这很重要: 快速查看节点状态和异常Pod,无需冗长输出。
Critical Rules
核心规则
Always Do
必须执行的操作
- Always verify before operations
kubectl config current-context - Always use to be explicit about target
-n namespace - Always use before applying changes
--dry-run=client -o yaml - Always check events when debugging:
kubectl get events --sort-by='.lastTimestamp' - Always use flag when pod is in CrashLoopBackOff
--previous
- 执行操作前务必验证
kubectl config current-context - 务必使用明确指定目标命名空间
-n namespace - 应用变更前务必使用进行验证
--dry-run=client -o yaml - 调试时务必检查事件:
kubectl get events --sort-by='.lastTimestamp' - 当Pod处于CrashLoopBackOff状态时,务必使用参数
--previous
Never Do
禁止执行的操作
- Never run without
kubectl deletefirst in production--dry-run - Never output secrets without filtering: avoid
kubectl get secret -o yaml - Never assume default namespace - always specify
-n - Never ignore resource limits when debugging OOMKilled pods
- Never skip when logs show no errors
describe
- 生产环境中禁止在未使用的情况下执行
--dry-runkubectl delete - 禁止输出未过滤的secrets信息:避免使用
kubectl get secret -o yaml - 禁止默认使用默认命名空间 - 务必指定
-n - 调试OOMKilled Pod时禁止忽略资源限制
- 当日志无错误时禁止跳过操作
describe
Common Mistakes
常见错误示例
Wrong:
bash
kubectl logs my-podCorrect:
bash
kubectl logs my-pod -n my-namespace --tail=100 --timestampsWhy: Default namespace may not be correct, unlimited logs flood context, timestamps help correlate with events.
错误写法:
bash
kubectl logs my-pod正确写法:
bash
kubectl logs my-pod -n my-namespace --tail=100 --timestamps原因: 默认命名空间可能不正确,无限制的日志会占用大量上下文,时间戳有助于关联事件。
Known Issues Prevention
问题预防方案
| Issue | Root Cause | Solution |
|---|---|---|
| CrashLoopBackOff | App crash on startup | Check |
| ImagePullBackOff | Registry auth or image tag | Verify image exists and check pull secrets |
| Pending pods | No schedulable nodes | Check node resources and pod affinity/tolerations |
| OOMKilled | Memory limit exceeded | Check container limits vs actual usage with |
| Connection refused | Service selector mismatch | Verify pod labels match service selector |
| 问题 | 根本原因 | 解决方案 |
|---|---|---|
| CrashLoopBackOff | 应用启动时崩溃 | 检查 |
| ImagePullBackOff | 镜像仓库认证失败或镜像标签错误 | 验证镜像存在性并检查拉取密钥 |
| Pod 处于Pending状态 | 无可用调度节点 | 检查节点资源和Pod亲和性/容忍度 |
| OOMKilled | 内存限制超出 | 使用 |
| 连接被拒绝 | Service选择器不匹配 | 验证Pod标签与Service选择器一致 |
Debugging Workflows
调试流程
Pod Not Starting
Pod 无法启动
bash
undefinedbash
undefined1. Get pod status and events
1. 获取Pod状态和事件
kubectl describe pod <name> -n <namespace>
kubectl describe pod <name> -n <namespace>
2. Check logs (current or previous)
2. 查看日志(当前或历史)
kubectl logs <name> -n <namespace> --tail=100
kubectl logs <name> -n <namespace> --previous # If restarting
kubectl logs <name> -n <namespace> --tail=100
kubectl logs <name> -n <namespace> --previous # 若Pod已重启
3. Check events for scheduling issues
3. 检查调度相关事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <name>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <name>
4. Interactive debugging
4. 交互式调试
kubectl exec -it <name> -n <namespace> -- /bin/sh
undefinedkubectl exec -it <name> -n <namespace> -- /bin/sh
undefinedService Connectivity
Service 连通性测试
bash
undefinedbash
undefined1. Verify service exists and has endpoints
1. 验证Service存在且有端点
kubectl get svc <name> -n <namespace>
kubectl get endpoints <name> -n <namespace>
kubectl get svc <name> -n <namespace>
kubectl get endpoints <name> -n <namespace>
2. Check pod labels match service selector
2. 检查Pod标签与Service选择器是否匹配
kubectl get pods -n <namespace> --show-labels
kubectl get pods -n <namespace> --show-labels
3. Test from within cluster
3. 集群内部测试
kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service>:<port>
kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service>:<port>
4. Port-forward for local testing
4. 端口转发用于本地测试
kubectl port-forward svc/<name> 8080:80 -n <namespace>
undefinedkubectl port-forward svc/<name> 8080:80 -n <namespace>
undefinedResource Management
资源管理
Deployments
Deployments
bash
undefinedbash
undefinedList deployments
列出Deployments
kubectl get deployments -n <namespace>
kubectl get deployments -n <namespace>
Scale
扩容
kubectl scale deployment <name> --replicas=3 -n <namespace>
kubectl scale deployment <name> --replicas=3 -n <namespace>
Rollout status
滚动发布状态
kubectl rollout status deployment/<name> -n <namespace>
kubectl rollout status deployment/<name> -n <namespace>
Rollback
回滚
kubectl rollout undo deployment/<name> -n <namespace>
kubectl rollout undo deployment/<name> -n <namespace>
History
历史记录
kubectl rollout history deployment/<name> -n <namespace>
undefinedkubectl rollout history deployment/<name> -n <namespace>
undefinedConfigMaps and Secrets
ConfigMaps 和 Secrets
bash
undefinedbash
undefinedList
列出资源
kubectl get configmaps -n <namespace>
kubectl get secrets -n <namespace>
kubectl get configmaps -n <namespace>
kubectl get secrets -n <namespace>
View ConfigMap data
查看ConfigMap数据
kubectl get configmap <name> -n <namespace> -o jsonpath='{.data}'
kubectl get configmap <name> -n <namespace> -o jsonpath='{.data}'
View Secret keys (NOT values)
查看Secret密钥(不显示值)
kubectl get secret <name> -n <namespace> -o jsonpath='{.data}' | jq 'keys'
kubectl get secret <name> -n <namespace> -o jsonpath='{.data}' | jq 'keys'
Create from file
从文件创建
kubectl create configmap <name> --from-file=<path> -n <namespace> --dry-run=client -o yaml
undefinedkubectl create configmap <name> --from-file=<path> -n <namespace> --dry-run=client -o yaml
undefinedCluster Operations
集群操作
Node Management
节点管理
bash
undefinedbash
undefinedList nodes with status
列出节点及状态
kubectl get nodes -o wide
kubectl get nodes -o wide
Node details
节点详情
kubectl describe node <name>
kubectl describe node <name>
Cordon (prevent scheduling)
标记为不可调度(Cordon)
kubectl cordon <node>
kubectl cordon <node>
Drain (evict pods)
驱逐Pod(Drain)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
Uncordon
标记为可调度(Uncordon)
kubectl uncordon <node>
undefinedkubectl uncordon <node>
undefinedResource Usage
资源使用情况
bash
undefinedbash
undefinedNode resources
节点资源使用
kubectl top nodes
kubectl top nodes
Pod resources
Pod资源使用
kubectl top pods -n <namespace>
kubectl top pods -n <namespace>
Sort by memory
按内存排序
kubectl top pods -n <namespace> --sort-by=memory
undefinedkubectl top pods -n <namespace> --sort-by=memory
undefinedBundled Resources
配套资源
Scripts
脚本
Located in :
scripts/- - Comprehensive pod debugging with condensed output
debug_pod.py - - Resource summary using jsonpath for minimal tokens
get_resources.py - - Quick cluster status overview
cluster_health.py
位于目录下:
scripts/- - 整合式Pod调试工具,输出精简摘要
debug_pod.py - - 使用jsonpath生成资源摘要,节省Token
get_resources.py - - 快速查看集群状态
cluster_health.py
References
参考文档
Located in :
references/- - Condensed command reference
kubectl-cheatsheet.md - - Common JSONPath expressions
jsonpath-patterns.md - - Decision tree for pod issues
debugging-flowchart.md
Note: For deep dives on specific topics, see the reference files above.
位于目录下:
references/- - 精简版kubectl命令参考
kubectl-cheatsheet.md - - 常用JSONPath表达式
jsonpath-patterns.md - - Pod问题排查决策树
debugging-flowchart.md
注意: 如需深入了解特定主题,请查看上述参考文档。
Dependencies
依赖环境
Required
必装依赖
| Package | Version | Purpose |
|---|---|---|
| kubectl | 1.25+ | Kubernetes CLI |
| jq | 1.6+ | JSON parsing for scripts |
| 软件包 | 版本 | 用途 |
|---|---|---|
| kubectl | 1.25+ | Kubernetes命令行工具 |
| jq | 1.6+ | JSON解析工具,用于脚本执行 |
Optional
可选依赖
| Package | Version | Purpose |
|---|---|---|
| k9s | 0.27+ | Terminal UI for Kubernetes |
| stern | 1.25+ | Multi-pod log tailing |
| 软件包 | 版本 | 用途 |
|---|---|---|
| k9s | 0.27+ | Kubernetes终端UI工具 |
| stern | 1.25+ | 多Pod日志实时查看工具 |
Official Documentation
官方文档
Troubleshooting
故障排查
kubectl command not found
kubectl命令未找到
Symptoms:
command not found: kubectlSolution:
bash
undefined症状:
command not found: kubectl解决方案:
bash
undefinedmacOS
macOS系统
brew install kubectl
brew install kubectl
Verify
验证安装
kubectl version --client
undefinedkubectl version --client
undefinedContext not set
上下文未设置
Symptoms:
error: no context is currently setSolution:
bash
undefined症状:
error: no context is currently set解决方案:
bash
undefinedList available contexts
列出可用上下文
kubectl config get-contexts
kubectl config get-contexts
Set context
设置上下文
kubectl config use-context <context-name>
undefinedkubectl config use-context <context-name>
undefinedPermission denied
权限不足
Symptoms:
Error from server (Forbidden)Solution:
bash
undefined症状:
Error from server (Forbidden)解决方案:
bash
undefinedCheck current user
检查当前用户
kubectl auth whoami
kubectl auth whoami
Check permissions
检查权限
kubectl auth can-i get pods -n <namespace>
kubectl auth can-i --list -n <namespace>
undefinedkubectl auth can-i get pods -n <namespace>
kubectl auth can-i --list -n <namespace>
undefinedTimeout connecting to cluster
集群连接超时
Symptoms:
Unable to connect to the server: dial tcp: i/o timeoutSolution:
bash
undefined症状:
Unable to connect to the server: dial tcp: i/o timeout解决方案:
bash
undefinedCheck cluster endpoint
检查集群端点
kubectl cluster-info
kubectl cluster-info
Verify network connectivity
验证网络连通性
curl -k https://<cluster-api-endpoint>/healthz
curl -k https://<cluster-api-endpoint>/healthz
Check kubeconfig
检查kubeconfig配置
cat ~/.kube/config
undefinedcat ~/.kube/config
undefinedSetup Checklist
安装检查清单
Before using this skill, verify:
- installed (
kubectl)kubectl version --client - Kubeconfig configured (exists)
~/.kube/config - Context set to correct cluster ()
kubectl config current-context - Permissions verified ()
kubectl auth can-i get pods - installed for JSON parsing (
jq)jq --version
使用该技能前,请验证以下内容:
- 已安装kubectl(执行验证)
kubectl version --client - 已配置Kubeconfig(文件存在)
~/.kube/config - 已设置正确的集群上下文(执行验证)
kubectl config current-context - 已验证权限(执行验证)
kubectl auth can-i get pods - 已安装jq用于JSON解析(执行验证)
jq --version