kubernetes-production
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Production
Kubernetes 生产环境运维
<!-- dual-compat-start -->
<!-- dual-compat-start -->
Use When
适用场景
- Use when operating production Kubernetes — Helm, autoscaling (HPA/VPA), resource management, StatefulSets, external-secrets, observability (Prometheus/Grafana/Loki), RBAC, Pod Security Standards, NetworkPolicies, admission control, backup (Velero), and cost control.
- The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.
- 适用于生产环境Kubernetes运维场景——涵盖Helm、自动扩缩容(HPA/VPA)、资源管理、StatefulSets、外部密钥、可观测性(Prometheus/Grafana/Loki)、RBAC、Pod安全标准、NetworkPolicies、准入控制、备份(Velero)以及成本管控。
- 任务需要可复用的判断逻辑、领域约束或经过验证的工作流,而非临时建议。
Do Not Use When
不适用场景
- The task is unrelated to or would be better handled by a more specific companion skill.
kubernetes-production - The request only needs a trivial answer and none of this skill's constraints or references materially help.
- 任务与无关,或更适合由更细分的配套技能处理。
kubernetes-production - 请求仅需要简单答案,本技能的约束条件或参考资料无法提供实质性帮助。
Required Inputs
必要输入
- Gather relevant project context, constraints, and the concrete problem to solve; load only as needed.
references - Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.
- 收集相关项目背景、约束条件及具体待解决问题;仅在需要时加载内容。
references - 确认所需交付物:设计方案、代码、评审意见、迁移计划、审计报告或文档。
Workflow
工作流
- Read this first, then load only the referenced deep-dive files that are necessary for the task.
SKILL.md - Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
- Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.
- 先阅读本,再仅加载任务必需的深度参考文件。
SKILL.md - 应用本技能中的有序指引、检查清单和决策规则,而非随意挑选孤立片段。
- 生成交付物时,需明确说明关键假设、风险及后续工作。
Quality Standards
质量标准
- Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
- Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
- Prefer deterministic, reviewable steps over vague advice or tool-specific magic.
- 输出内容需以执行为导向、简洁清晰,并与仓库的基线工程标准保持一致。
- 除非技能明确要求更高标准,否则需兼容现有项目约定。
- 优先采用可确定、可评审的步骤,避免模糊建议或工具特定的“魔法操作”。
Anti-Patterns
反模式
- Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
- Loading every reference file by default instead of using progressive disclosure.
- 将示例视为可直接复制粘贴的标准答案,未检查适配性、约束条件或失败场景。
- 默认加载所有参考文件,而非按需逐步披露内容。
Outputs
输出内容
- A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
- Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
- References used, companion skills, or follow-up actions when they materially improve execution.
- 符合任务需求的具体成果:实施指引、评审发现、架构决策、模板或生成的工件。
- 当无法仅通过现有上下文完成任务时,需明确说明假设条件、权衡方案或未解决的缺口。
- 列出所使用的参考资料、配套技能或后续行动,以提升执行效果。
Evidence Produced
生成的证据
| Category | Artifact | Format | Example |
|---|---|---|---|
| Operability | K8s deployment runbook | Markdown doc per | |
| Operability | Resource and autoscaler config note | Markdown doc covering requests/limits, HPA/VPA, and PDB rationale | |
| 分类 | 工件 | 格式 | 示例 |
|---|---|---|---|
| 可运维性 | K8s部署手册 | 遵循 | |
| 可运维性 | 资源与自动扩缩容配置说明 | 涵盖资源请求/限制、HPA/VPA及PDB设计依据的Markdown文档 | |
References
参考资料
- Use the directory for deep detail after reading the core workflow below.
references/
The operational bar for running K8s in production — not just running Pods, but running them predictably, securely, observably, and cheaply.
Prerequisites: Load first.
kubernetes-fundamentals- 阅读以下核心工作流后,如需深入细节可查看目录。
references/
生产环境运行K8s的运维标准——不仅是运行Pod,更是要实现Pod的可预测、安全、可观测及低成本运行。
前置条件:先加载技能。
kubernetes-fundamentalsWhen this skill applies
本技能适用场景
- Moving a POC cluster into production.
- Auditing an existing cluster for production readiness.
- Adding observability, security, or cost controls.
- Reviewing Helm charts before deployment.
- 将POC集群迁移至生产环境。
- 审计现有集群的生产就绪状态。
- 添加可观测性、安全或成本管控措施。
- 部署前评审Helm Chart。
The production checklist
生产环境检查清单
text
[ ] Helm (or Kustomize) — not raw manifests
[ ] requests + limits on every container
[ ] HPA on stateless workloads
[ ] PodDisruptionBudget on every workload with >1 replica
[ ] External secrets (Vault / AWS SM / GCP SM) — not inline Secret manifests
[ ] Observability: Prometheus + Grafana + Loki + Alertmanager
[ ] RBAC: least-privilege ServiceAccounts
[ ] Pod Security Standards enforced (restricted for app workloads)
[ ] NetworkPolicies: default-deny + explicit allow
[ ] Admission control: OPA Gatekeeper or Kyverno
[ ] Image scanning in CI (Trivy, Grype)
[ ] Backups: Velero with off-cluster storage
[ ] Cluster autoscaler or karpenter
[ ] SLOs + runbooks per servicetext
[ ] Helm(或Kustomize)——而非原始清单
[ ] 每个容器都设置requests + limits
[ ] 无状态工作负载配置HPA
[ ] 副本数>1的所有工作负载配置PodDisruptionBudget
[ ] 使用外部密钥(Vault / AWS SM / GCP SM)——而非内联Secret清单
[ ] 可观测性:Prometheus + Grafana + Loki + Alertmanager
[ ] RBAC:最小权限ServiceAccount
[ ] 强制实施Pod安全标准(应用工作负载使用restricted级别)
[ ] NetworkPolicies:默认拒绝 + 显式允许
[ ] 准入控制:OPA Gatekeeper或Kyverno
[ ] CI流程中进行镜像扫描(Trivy、Grype)
[ ] 备份:使用Velero搭配集群外存储
[ ] 集群自动扩缩容或karpenter
[ ] 每个服务配置SLO + 运行手册Request sizing heuristics
请求容量估算启发式规则
Measure first; do not guess. Steady-state and peak come from plus and over a representative week.
kubectl topcontainer_memory_working_set_bytesrate(container_cpu_usage_seconds_total[5m])text
Memory request = p95 working set
Memory limit = p99 working set + 30% headroom
CPU request = p95 utilisation in cores
CPU limit = 2x request, or unset (with PriorityClass + tested cluster) for burstable workloadsQoS classes (set deliberately):
text
Guaranteed requests == limits -> tier-1 services, evicted last
Burstable requests < limits -> default for most apps
BestEffort no requests/limits -> never use in productionHPA does not work without CPU requests. VPA in mode for one week tells you whether you are over- or under-provisioned.
recommend先测量,不要猜测。稳态和峰值数据来自,以及连续一周的和指标。
kubectl topcontainer_memory_working_set_bytesrate(container_cpu_usage_seconds_total[5m])text
内存request = p95工作集
内存limit = p99工作集 + 30%预留空间
CPU request = p95利用率(单位:核)
CPU limit = request的2倍,或留空(针对配置完善的集群中的可突发工作负载,需搭配PriorityClass并经过测试)QoS等级(需刻意设置):
text
Guaranteed requests == limits -> 一级服务,最后被驱逐
Burstable requests < limits -> 大多数应用的默认选择
BestEffort 无requests/limits -> 生产环境绝不使用没有CPU request的话,HPA无法正常工作。使用VPA的模式运行一周,可判断资源配置是否过度或不足。
recommendWhen to reach for an operator
何时使用Operator
text
Stateful system you operate (Postgres, Kafka, Redis, ES) -> install a vetted operator (CloudNativePG, Strimzi, ECK)
Repeated multi-step ops you do by hand -> consider an operator
Watch a ConfigMap and bounce Pods -> CronJob is enough
Tenant lifecycle in SaaS -> GitOps + ApplicationSet first; operator only if dynamicRule: install before you build. See for the build/install/avoid matrix and CRD hygiene.
references/crd-operators.mdtext
自行运维的有状态系统(Postgres、Kafka、Redis、ES) -> 安装经过验证的Operator(CloudNativePG、Strimzi、ECK)
手动重复执行的多步运维操作 -> 考虑使用Operator
监听ConfigMap并重启Pod -> CronJob即可满足需求
SaaS中的租户生命周期管理 -> 优先使用GitOps + ApplicationSet;仅在需要动态管理时使用Operator规则:优先安装,而非自行构建。查看获取构建/安装/避免矩阵及CRD规范。
references/crd-operators.mdCluster and node upgrades
集群与节点升级
Upgrades are releases, not chores. Pre-flight: run / against Git for removed APIs, verify a Velero restore in a sandbox, drain-test one canary node.
kube-no-troubleplutoOrder: control plane -> node groups one at a time -> add-ons (CNI first, mesh last). Every workload with replicas > 1 has a PDB; a PDB of blocks drain forever. See .
minAvailable: 100%references/upgrade-runbook.md升级是发布流程,而非日常琐事。升级前准备:运行/检查Git中已废弃的API,在沙箱环境验证Velero恢复功能,对一个金丝雀节点进行驱逐测试。
kube-no-troublepluto顺序:控制平面 -> 逐个节点组 -> 附加组件(先CNI,最后服务网格)。所有副本数>1的工作负载都需配置PDB;的PDB会永久阻止节点驱逐。详情见。
minAvailable: 100%references/upgrade-runbook.mdHelm vs Kustomize
Helm vs Kustomize
text
Package to distribute to many users / tenants -> Helm
Env overlays (dev/staging/prod), single app -> Kustomize
Complex templating + conditionals -> Helm
Strict "WYSIWYG" manifests -> KustomizeWe use Helm for shipped packages and Kustomize for simple in-house env overlays. Never both for the same workload.
See .
references/helm-vs-kustomize.mdtext
需要分发给多用户/租户的包 -> Helm
环境覆盖(dev/staging/prod)、单一应用 -> Kustomize
复杂模板 + 条件判断 -> Helm
严格“所见即所得”的清单 -> Kustomize我们使用Helm来分发包,使用Kustomize处理简单的内部环境覆盖。同一工作负载绝不混用两者。
详情见。
references/helm-vs-kustomize.mdResource management
资源管理
Requests: guaranteed minimum; scheduler uses this to place Pods.
Limits: cap; exceeded CPU = throttled, exceeded memory = OOMKilled.
Rules:
- Always set both.
- Memory: limit = expected peak + headroom (~30%).
- CPU: request = steady-state; limit = 2–5× request or unset (in carefully-configured clusters).
- Measure before setting — and Prometheus
kubectl top,container_memory_working_set_bytes.rate(container_cpu_usage_seconds_total[5m]) - Gradually reduce over-provisioning using VPA in recommend mode.
See .
references/resource-management.mdRequests: 保证的最小资源;调度器使用该值来调度Pod。
Limits: 资源上限;超过CPU限制会被节流,超过内存限制会被OOMKilled。
规则:
- 始终同时设置两者。
- 内存:limit = 预期峰值 + 预留空间(约30%)。
- CPU:request = 稳态值;limit = request的2–5倍,或留空(针对配置完善的集群)。
- 设置前先测量——使用和Prometheus的
kubectl top、container_memory_working_set_bytes指标。rate(container_cpu_usage_seconds_total[5m]) - 使用VPA的recommend模式逐步减少过度配置。
详情见。
references/resource-management.mdHPA — Horizontal Pod Autoscaler
HPA — Horizontal Pod Autoscaler
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api, namespace: production }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies: [{ type: Percent, value: 50, periodSeconds: 60 }]
scaleUp:
stabilizationWindowSeconds: 0
policies: [{ type: Percent, value: 100, periodSeconds: 30 }]- Target 60–70% CPU utilisation to leave headroom for spikes.
- minimum (HA).
minReplicas: 2 - Scale up fast, scale down slow.
- For queue-depth-based or custom metrics: install Prometheus Adapter.
See .
references/autoscaling-hpa-vpa.mdyaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api, namespace: production }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies: [{ type: Percent, value: 50, periodSeconds: 60 }]
scaleUp:
stabilizationWindowSeconds: 0
policies: [{ type: Percent, value: 100, periodSeconds: 30 }]- 目标CPU利用率设置为60–70%,为峰值预留空间。
- 为最小值(高可用)。
minReplicas: 2 - 扩容快,缩容慢。
- 如需基于队列深度或自定义指标扩容:安装Prometheus Adapter。
详情见。
references/autoscaling-hpa-vpa.mdStateful workloads
有状态工作负载
StatefulSet + PVC for databases, caches, brokers:
- Stable names (,
pod-0).pod-1 - Ordered rollout/rollback.
- Each Pod has its own PersistentVolume.
- Headless Service for stable DNS per Pod.
- PodDisruptionBudget ().
minAvailable: N-1 - Anti-affinity across nodes/zones.
- Backups to object storage (never rely on PV alone).
For databases, strongly consider managed (RDS, Cloud SQL, Neon) before in-cluster. In-cluster DBs make sense only with serious ops maturity.
See .
references/stateful-workloads.md使用StatefulSet + PVC部署数据库、缓存、消息队列:
- 稳定名称(、
pod-0)。pod-1 - 有序发布/回滚。
- 每个Pod拥有独立的PersistentVolume。
- 使用Headless Service实现每个Pod的稳定DNS。
- 配置PodDisruptionBudget()。
minAvailable: N-1 - 跨节点/区域的反亲和性配置。
- 备份至对象存储(绝不仅依赖PV)。
对于数据库,在考虑集群内部署前,强烈建议使用托管服务(RDS、Cloud SQL、Neon)。只有具备成熟运维能力时,集群内部署数据库才有意义。
详情见。
references/stateful-workloads.mdExternal secrets
外部密钥
Never commit manifests to Git (even base64 is not encryption). Options:
Secret- external-secrets operator + Vault / AWS Secrets Manager / GCP Secret Manager / 1Password.
- Sealed Secrets (Bitnami) — encrypted manifests safe for Git.
- SOPS + age for GitOps-friendly encryption.
Pattern: external-secrets + cloud secret manager is our default in cloud; SOPS for air-gapped or self-hosted.
See .
references/secrets-external-secrets.md绝不要将清单提交至Git(即使是base64编码也不是加密)。可选方案:
Secret- external-secrets operator + Vault / AWS Secrets Manager / GCP Secret Manager / 1Password。
- Sealed Secrets(Bitnami)——加密后的清单可安全提交至Git。
- SOPS + age——适合GitOps场景的加密方式。
模式:在云环境中默认使用external-secrets + 云密钥管理器;在离线或自托管环境中使用SOPS。
详情见。
references/secrets-external-secrets.mdObservability stack
可观测性栈
Minimum stack:
- Prometheus — metrics scraping + storage (or Mimir/Thanos for long-term).
- Grafana — dashboards.
- Loki — logs.
- Alertmanager — alert routing.
- OpenTelemetry Collector — receive OTLP, fan out to backends.
- Tempo or Jaeger — traces.
Install via kube-prometheus-stack Helm chart.
Golden signals per service: latency, traffic, errors, saturation.
See .
references/observability-stack.md最小栈配置:
- Prometheus — 指标采集+存储(长期存储可使用Mimir/Thanos)。
- Grafana — 仪表盘。
- Loki — 日志管理。
- Alertmanager — 告警路由。
- OpenTelemetry Collector — 接收OTLP数据并转发至后端。
- Tempo或Jaeger — 链路追踪。
通过kube-prometheus-stack Helm Chart安装。
每个服务的黄金信号: 延迟、流量、错误、饱和度。
详情见。
references/observability-stack.mdRBAC + Pod Security
RBAC + Pod安全
RBAC principles:
- One ServiceAccount per workload.
- No default ServiceAccount usage (set when not needed).
automountServiceAccountToken: false - Roles — least privilege.
- Audit unused ClusterRoleBindings periodically.
Pod Security Standards — enforce at namespace level:
yaml
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restrictedRestricted disallows: privilege escalation, hostPath, hostNetwork, running as root, etc.
See .
references/rbac-and-pod-security.mdRBAC原则:
- 每个工作负载对应一个ServiceAccount。
- 不使用默认ServiceAccount(不需要时设置)。
automountServiceAccountToken: false - 角色遵循最小权限原则。
- 定期审计未使用的ClusterRoleBindings。
Pod安全标准 — 在命名空间级别强制实施:
yaml
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restrictedRestricted级别禁止:权限提升、hostPath、hostNetwork、以root用户运行等操作。
详情见。
references/rbac-and-pod-security.mdNetworkPolicies — default deny
NetworkPolicies — 默认拒绝
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: production }
spec:
podSelector: {}
policyTypes: [Ingress, Egress]Then add explicit allows per workload (e.g., api can reach db, web can reach api, everything can reach DNS + external API).
Requires a CNI that enforces NetworkPolicy: Calico, Cilium, or Azure CNI. Flannel does not.
See .
references/network-policies.mdyaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: production }
spec:
podSelector: {}
policyTypes: [Ingress, Egress]然后为每个工作负载添加显式允许规则(例如:api可访问db,web可访问api,所有服务可访问DNS + 外部API)。
需要支持NetworkPolicy的CNI:Calico、Cilium或Azure CNI。Flannel不支持。
详情见。
references/network-policies.mdAdmission control — OPA Gatekeeper or Kyverno
准入控制 — OPA Gatekeeper或Kyverno
Policy-as-code enforcement at cluster admission:
- Deny images.
:latest - Require and
resources.requests.resources.limits - Require readinessProbe.
- Require specific labels (,
app.kubernetes.io/*).owner - Restrict which registries are allowed.
- Deny hostPath volumes.
Kyverno — YAML-based policies, easier for most teams.
OPA Gatekeeper — Rego-based, more powerful, steeper curve.
See .
references/admission-control-opa-kyverno.md集群准入阶段的策略即代码(Policy-as-code)强制:
- 拒绝使用镜像。
:latest - 要求设置和
resources.requests。resources.limits - 要求配置readinessProbe。
- 要求设置特定标签(、
app.kubernetes.io/*)。owner - 限制允许的镜像仓库。
- 拒绝hostPath卷。
Kyverno — 基于YAML的策略,对大多数团队更友好。
OPA Gatekeeper — 基于Rego的策略,功能更强大,但学习曲线更陡。
详情见。
references/admission-control-opa-kyverno.mdImage scanning
镜像扫描
- Trivy or Grype in CI for container images and IaC.
- Fail builds on HIGH/CRITICAL.
- Sign images (cosign) and verify at admission (policy-controller or Kyverno).
- 在CI流程中使用Trivy或Grype扫描容器镜像和IaC。
- 发现HIGH/CRITICAL级别的漏洞时终止构建。
- 对镜像进行签名(cosign)并在准入阶段验证(使用policy-controller或Kyverno)。
Backup — Velero
备份 — Velero
bash
velero backup create weekly-$(date +%Y%m%d) --include-namespaces production --ttl 720h- Backup + restore cluster resources and PersistentVolumes.
- Storage in off-cluster bucket (S3/GCS/Azure Blob).
- Scheduled backups with Velero Schedule.
- Regular restore drills — backups you never restore are hope, not a backup.
See .
references/backup-velero.mdbash
velero backup create weekly-$(date +%Y%m%d) --include-namespaces production --ttl 720h- 备份+恢复集群资源和PersistentVolumes。
- 存储至集群外存储桶(S3/GCS/Azure Blob)。
- 使用Velero Schedule配置定时备份。
- 定期执行恢复演练——从未恢复过的备份只是一种期望,而非真正的备份。
详情见。
references/backup-velero.mdCost control
成本管控
- Cluster autoscaler (or karpenter on AWS) — nodes scale with demand.
- Right-sizing — VPA in recommend mode, kubectl-cost, kubecost for cost dashboards.
- Spot / preemptible nodes — for stateless, fault-tolerant workloads. Use node taints + tolerations to steer.
- Idle resource alerts — requests far above usage = over-provisioning.
See .
references/cost-control.md- Cluster autoscaler(AWS上使用karpenter)——节点随需求自动扩缩容。
- 资源优化——使用VPA的recommend模式、kubectl-cost、kubecost查看成本仪表盘。
- Spot/抢占式节点——适用于无状态、容错性高的工作负载。使用节点污点+容忍度进行调度。
- 闲置资源告警——请求远高于实际使用量即表示过度配置。
详情见。
references/cost-control.mdAnti-patterns
反模式
- Relying on default ServiceAccount tokens.
- Mounting every ConfigMap/Secret as env vars (file mounts are more secure, support rotation).
- Using Deployment for stateful workloads.
- HPA on workloads with startup > 60s without startup probe.
- Default-allow NetworkPolicy posture.
- Skipping backup/restore drills.
- No PodDisruptionBudget on critical services.
- 依赖默认ServiceAccount令牌。
- 将所有ConfigMap/Secret挂载为环境变量(文件挂载更安全,支持轮换)。
- 使用Deployment部署有状态工作负载。
- 对启动时间>60s的工作负载配置HPA却未设置startup probe。
- 默认允许的NetworkPolicy策略。
- 跳过备份/恢复演练。
- 关键服务未配置PodDisruptionBudget。
Read next
后续阅读
- — multi-tenant SaaS on K8s, GitOps.
kubernetes-saas-delivery - — SLO design and alert discipline across the stack.
observability-monitoring - — incident response + runbooks.
reliability-engineering
- ——基于K8s的多租户SaaS、GitOps。
kubernetes-saas-delivery - ——全栈SLO设计及告警规范。
observability-monitoring - ——事件响应+运行手册。
reliability-engineering
References
参考资料
references/helm-vs-kustomize.mdreferences/resource-management.mdreferences/autoscaling-hpa-vpa.mdreferences/stateful-workloads.mdreferences/secrets-external-secrets.mdreferences/observability-stack.mdreferences/rbac-and-pod-security.mdreferences/network-policies.mdreferences/admission-control-opa-kyverno.mdreferences/backup-velero.mdreferences/cost-control.mdreferences/upgrade-runbook.mdreferences/crd-operators.md
references/helm-vs-kustomize.mdreferences/resource-management.mdreferences/autoscaling-hpa-vpa.mdreferences/stateful-workloads.mdreferences/secrets-external-secrets.mdreferences/observability-stack.mdreferences/rbac-and-pod-security.mdreferences/network-policies.mdreferences/admission-control-opa-kyverno.mdreferences/backup-velero.mdreferences/cost-control.mdreferences/upgrade-runbook.mdreferences/crd-operators.md