kubernetes-production

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Kubernetes Production

Kubernetes 生产环境运维

Use When

适用场景

Use when operating production Kubernetes — Helm, autoscaling (HPA/VPA), resource management, StatefulSets, external-secrets, observability (Prometheus/Grafana/Loki), RBAC, Pod Security Standards, NetworkPolicies, admission control, backup (Velero), and cost control.
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

适用于生产环境Kubernetes运维场景——涵盖Helm、自动扩缩容（HPA/VPA）、资源管理、StatefulSets、外部密钥、可观测性（Prometheus/Grafana/Loki）、RBAC、Pod安全标准、NetworkPolicies、准入控制、备份（Velero）以及成本管控。
任务需要可复用的判断逻辑、领域约束或经过验证的工作流，而非临时建议。

Do Not Use When

不适用场景

The task is unrelated to
```
kubernetes-production
```
or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

任务与
```
kubernetes-production
```
无关，或更适合由更细分的配套技能处理。
请求仅需要简单答案，本技能的约束条件或参考资料无法提供实质性帮助。

Required Inputs

必要输入

Gather relevant project context, constraints, and the concrete problem to solve; load
```
references
```
only as needed.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

收集相关项目背景、约束条件及具体待解决问题；仅在需要时加载
```
references
```
内容。
确认所需交付物：设计方案、代码、评审意见、迁移计划、审计报告或文档。

Workflow

工作流

Read this
```
SKILL.md
```
first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

先阅读本
```
SKILL.md
```
，再仅加载任务必需的深度参考文件。
应用本技能中的有序指引、检查清单和决策规则，而非随意挑选孤立片段。
生成交付物时，需明确说明关键假设、风险及后续工作。

Quality Standards

质量标准

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

输出内容需以执行为导向、简洁清晰，并与仓库的基线工程标准保持一致。
除非技能明确要求更高标准，否则需兼容现有项目约定。
优先采用可确定、可评审的步骤，避免模糊建议或工具特定的“魔法操作”。

Anti-Patterns

反模式

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

将示例视为可直接复制粘贴的标准答案，未检查适配性、约束条件或失败场景。
默认加载所有参考文件，而非按需逐步披露内容。

Outputs

输出内容

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

符合任务需求的具体成果：实施指引、评审发现、架构决策、模板或生成的工件。
当无法仅通过现有上下文完成任务时，需明确说明假设条件、权衡方案或未解决的缺口。
列出所使用的参考资料、配套技能或后续行动，以提升执行效果。

Evidence Produced

生成的证据

Category	Artifact	Format	Example
Operability	K8s deployment runbook	Markdown doc per `skill-composition-standards/references/runbook-template.md` covering rollout, scaling, secret rotation, and PDB review	`docs/k8s/deployment-runbook.md`
Operability	Resource and autoscaler config note	Markdown doc covering requests/limits, HPA/VPA, and PDB rationale	`docs/k8s/resources-config.md`

分类	工件	格式	示例
可运维性	K8s部署手册	遵循 `skill-composition-standards/references/runbook-template.md` 的Markdown文档，涵盖发布、扩缩容、密钥轮换及PDB评审	`docs/k8s/deployment-runbook.md`
可运维性	资源与自动扩缩容配置说明	涵盖资源请求/限制、HPA/VPA及PDB设计依据的Markdown文档	`docs/k8s/resources-config.md`

References

参考资料

Use the
```
references/
```
directory for deep detail after reading the core workflow below.

The operational bar for running K8s in production — not just running Pods, but running them predictably, securely, observably, and cheaply.

Prerequisites: Load

kubernetes-fundamentals

first.

阅读以下核心工作流后，如需深入细节可查看
```
references/
```
目录。

生产环境运行K8s的运维标准——不仅是运行Pod，更是要实现Pod的可预测、安全、可观测及低成本运行。

前置条件：先加载
kubernetes-fundamentals
技能。

When this skill applies

本技能适用场景

Moving a POC cluster into production.
Auditing an existing cluster for production readiness.
Adding observability, security, or cost controls.
Reviewing Helm charts before deployment.

将POC集群迁移至生产环境。
审计现有集群的生产就绪状态。
添加可观测性、安全或成本管控措施。
部署前评审Helm Chart。

The production checklist

生产环境检查清单

text

[ ] Helm (or Kustomize) — not raw manifests
[ ] requests + limits on every container
[ ] HPA on stateless workloads
[ ] PodDisruptionBudget on every workload with >1 replica
[ ] External secrets (Vault / AWS SM / GCP SM) — not inline Secret manifests
[ ] Observability: Prometheus + Grafana + Loki + Alertmanager
[ ] RBAC: least-privilege ServiceAccounts
[ ] Pod Security Standards enforced (restricted for app workloads)
[ ] NetworkPolicies: default-deny + explicit allow
[ ] Admission control: OPA Gatekeeper or Kyverno
[ ] Image scanning in CI (Trivy, Grype)
[ ] Backups: Velero with off-cluster storage
[ ] Cluster autoscaler or karpenter
[ ] SLOs + runbooks per service

text

[ ] Helm（或Kustomize）——而非原始清单
[ ] 每个容器都设置requests + limits
[ ] 无状态工作负载配置HPA
[ ] 副本数>1的所有工作负载配置PodDisruptionBudget
[ ] 使用外部密钥（Vault / AWS SM / GCP SM）——而非内联Secret清单
[ ] 可观测性：Prometheus + Grafana + Loki + Alertmanager
[ ] RBAC：最小权限ServiceAccount
[ ] 强制实施Pod安全标准（应用工作负载使用restricted级别）
[ ] NetworkPolicies：默认拒绝 + 显式允许
[ ] 准入控制：OPA Gatekeeper或Kyverno
[ ] CI流程中进行镜像扫描（Trivy、Grype）
[ ] 备份：使用Velero搭配集群外存储
[ ] 集群自动扩缩容或karpenter
[ ] 每个服务配置SLO + 运行手册

Request sizing heuristics

请求容量估算启发式规则

Measure first; do not guess. Steady-state and peak come from

kubectl top

plus

container_memory_working_set_bytes

and

rate(container_cpu_usage_seconds_total[5m])

over a representative week.

text

Memory request  = p95 working set
Memory limit    = p99 working set + 30% headroom
CPU request     = p95 utilisation in cores
CPU limit       = 2x request, or unset (with PriorityClass + tested cluster) for burstable workloads

QoS classes (set deliberately):

text

Guaranteed  requests == limits          -> tier-1 services, evicted last
Burstable   requests < limits           -> default for most apps
BestEffort  no requests/limits          -> never use in production

HPA does not work without CPU requests. VPA in

recommend

mode for one week tells you whether you are over- or under-provisioned.

先测量，不要猜测。稳态和峰值数据来自

kubectl top

，以及连续一周的

container_memory_working_set_bytes

和

rate(container_cpu_usage_seconds_total[5m])

指标。

text

内存request = p95工作集
内存limit = p99工作集 + 30%预留空间
CPU request = p95利用率（单位：核）
CPU limit = request的2倍，或留空（针对配置完善的集群中的可突发工作负载，需搭配PriorityClass并经过测试）

QoS等级（需刻意设置）：

text

Guaranteed  requests == limits          -> 一级服务，最后被驱逐
Burstable   requests < limits           -> 大多数应用的默认选择
BestEffort  无requests/limits          -> 生产环境绝不使用

没有CPU request的话，HPA无法正常工作。使用VPA的

recommend

模式运行一周，可判断资源配置是否过度或不足。

When to reach for an operator

何时使用Operator

text

Stateful system you operate (Postgres, Kafka, Redis, ES) -> install a vetted operator (CloudNativePG, Strimzi, ECK)
Repeated multi-step ops you do by hand                   -> consider an operator
Watch a ConfigMap and bounce Pods                        -> CronJob is enough
Tenant lifecycle in SaaS                                 -> GitOps + ApplicationSet first; operator only if dynamic

Rule: install before you build. See

references/crd-operators.md

for the build/install/avoid matrix and CRD hygiene.

text

自行运维的有状态系统（Postgres、Kafka、Redis、ES） -> 安装经过验证的Operator（CloudNativePG、Strimzi、ECK）
手动重复执行的多步运维操作                   -> 考虑使用Operator
监听ConfigMap并重启Pod                        -> CronJob即可满足需求
SaaS中的租户生命周期管理                                 -> 优先使用GitOps + ApplicationSet；仅在需要动态管理时使用Operator

规则：优先安装，而非自行构建。查看

references/crd-operators.md

获取构建/安装/避免矩阵及CRD规范。

Cluster and node upgrades

集群与节点升级

Upgrades are releases, not chores. Pre-flight: run

kube-no-trouble

pluto

against Git for removed APIs, verify a Velero restore in a sandbox, drain-test one canary node.

Order: control plane -> node groups one at a time -> add-ons (CNI first, mesh last). Every workload with replicas > 1 has a PDB; a PDB of

minAvailable: 100%

blocks drain forever. See

references/upgrade-runbook.md

升级是发布流程，而非日常琐事。升级前准备：运行

kube-no-trouble

pluto

检查Git中已废弃的API，在沙箱环境验证Velero恢复功能，对一个金丝雀节点进行驱逐测试。

顺序：控制平面 -> 逐个节点组 -> 附加组件（先CNI，最后服务网格）。所有副本数>1的工作负载都需配置PDB；

minAvailable: 100%

的PDB会永久阻止节点驱逐。详情见

references/upgrade-runbook.md

。

Helm vs Kustomize

text

Package to distribute to many users / tenants  -> Helm
Env overlays (dev/staging/prod), single app    -> Kustomize
Complex templating + conditionals              -> Helm
Strict "WYSIWYG" manifests                     -> Kustomize

We use Helm for shipped packages and Kustomize for simple in-house env overlays. Never both for the same workload.

See

references/helm-vs-kustomize.md

text

需要分发给多用户/租户的包  -> Helm
环境覆盖（dev/staging/prod）、单一应用    -> Kustomize
复杂模板 + 条件判断              -> Helm
严格“所见即所得”的清单                     -> Kustomize

我们使用Helm来分发包，使用Kustomize处理简单的内部环境覆盖。同一工作负载绝不混用两者。

详情见

references/helm-vs-kustomize.md

。

Resource management

资源管理

Requests: guaranteed minimum; scheduler uses this to place Pods. Limits: cap; exceeded CPU = throttled, exceeded memory = OOMKilled.

Rules:

Always set both.
Memory: limit = expected peak + headroom (~30%).
CPU: request = steady-state; limit = 2–5× request or unset (in carefully-configured clusters).

Measure before setting —

kubectl top

and Prometheus

container_memory_working_set_bytes

rate(container_cpu_usage_seconds_total[5m])

Gradually reduce over-provisioning using VPA in recommend mode.

See

references/resource-management.md

Requests： 保证的最小资源；调度器使用该值来调度Pod。 Limits： 资源上限；超过CPU限制会被节流，超过内存限制会被OOMKilled。

规则：

始终同时设置两者。
内存：limit = 预期峰值 + 预留空间（约30%）。
CPU：request = 稳态值；limit = request的2–5倍，或留空（针对配置完善的集群）。

设置前先测量——使用

kubectl top

和Prometheus的

container_memory_working_set_bytes

、

rate(container_cpu_usage_seconds_total[5m])

指标。

使用VPA的recommend模式逐步减少过度配置。

详情见

references/resource-management.md

。

HPA — Horizontal Pod Autoscaler

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api, namespace: production }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies: [{ type: Percent, value: 50, periodSeconds: 60 }]
    scaleUp:
      stabilizationWindowSeconds: 0
      policies: [{ type: Percent, value: 100, periodSeconds: 30 }]

Target 60–70% CPU utilisation to leave headroom for spikes.
```
minReplicas: 2
```
minimum (HA).
Scale up fast, scale down slow.
For queue-depth-based or custom metrics: install Prometheus Adapter.

See

references/autoscaling-hpa-vpa.md

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api, namespace: production }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies: [{ type: Percent, value: 50, periodSeconds: 60 }]
    scaleUp:
      stabilizationWindowSeconds: 0
      policies: [{ type: Percent, value: 100, periodSeconds: 30 }]

目标CPU利用率设置为60–70%，为峰值预留空间。
```
minReplicas: 2
```
为最小值（高可用）。
扩容快，缩容慢。
如需基于队列深度或自定义指标扩容：安装Prometheus Adapter。

详情见

references/autoscaling-hpa-vpa.md

。

Stateful workloads

有状态工作负载

StatefulSet + PVC for databases, caches, brokers:

Stable names (
```
pod-0
```
,
```
pod-1
```
).
Ordered rollout/rollback.
Each Pod has its own PersistentVolume.
Headless Service for stable DNS per Pod.
PodDisruptionBudget (
```
minAvailable: N-1
```
).
Anti-affinity across nodes/zones.
Backups to object storage (never rely on PV alone).

For databases, strongly consider managed (RDS, Cloud SQL, Neon) before in-cluster. In-cluster DBs make sense only with serious ops maturity.

See

references/stateful-workloads.md

使用StatefulSet + PVC部署数据库、缓存、消息队列：

稳定名称（
```
pod-0
```
、
```
pod-1
```
）。
有序发布/回滚。
每个Pod拥有独立的PersistentVolume。
使用Headless Service实现每个Pod的稳定DNS。
配置PodDisruptionBudget（
```
minAvailable: N-1
```
）。
跨节点/区域的反亲和性配置。
备份至对象存储（绝不仅依赖PV）。

对于数据库，在考虑集群内部署前，强烈建议使用托管服务（RDS、Cloud SQL、Neon）。只有具备成熟运维能力时，集群内部署数据库才有意义。

详情见

references/stateful-workloads.md

。

External secrets

外部密钥

Never commit

Secret

manifests to Git (even base64 is not encryption). Options:

external-secrets operator + Vault / AWS Secrets Manager / GCP Secret Manager / 1Password.
Sealed Secrets (Bitnami) — encrypted manifests safe for Git.
SOPS + age for GitOps-friendly encryption.

Pattern: external-secrets + cloud secret manager is our default in cloud; SOPS for air-gapped or self-hosted.

See

references/secrets-external-secrets.md

绝不要将

Secret

清单提交至Git（即使是base64编码也不是加密）。可选方案：

external-secrets operator + Vault / AWS Secrets Manager / GCP Secret Manager / 1Password。
Sealed Secrets（Bitnami）——加密后的清单可安全提交至Git。
SOPS + age——适合GitOps场景的加密方式。

模式：在云环境中默认使用external-secrets + 云密钥管理器；在离线或自托管环境中使用SOPS。

详情见

references/secrets-external-secrets.md

。

Observability stack

可观测性栈

Minimum stack:

Prometheus — metrics scraping + storage (or Mimir/Thanos for long-term).
Grafana — dashboards.
Loki — logs.
Alertmanager — alert routing.
OpenTelemetry Collector — receive OTLP, fan out to backends.
Tempo or Jaeger — traces.

Install via kube-prometheus-stack Helm chart.

Golden signals per service: latency, traffic, errors, saturation.

See

references/observability-stack.md

最小栈配置：

Prometheus — 指标采集+存储（长期存储可使用Mimir/Thanos）。
Grafana — 仪表盘。
Loki — 日志管理。
Alertmanager — 告警路由。
OpenTelemetry Collector — 接收OTLP数据并转发至后端。
Tempo或Jaeger — 链路追踪。

通过kube-prometheus-stack Helm Chart安装。

每个服务的黄金信号： 延迟、流量、错误、饱和度。

详情见

references/observability-stack.md

。

RBAC + Pod Security

RBAC + Pod安全

RBAC principles:

One ServiceAccount per workload.
No default ServiceAccount usage (set
```
automountServiceAccountToken: false
```
when not needed).
Roles — least privilege.
Audit unused ClusterRoleBindings periodically.

Pod Security Standards — enforce at namespace level:

yaml

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Restricted disallows: privilege escalation, hostPath, hostNetwork, running as root, etc.

See

references/rbac-and-pod-security.md

RBAC原则：

每个工作负载对应一个ServiceAccount。
不使用默认ServiceAccount（不需要时设置
```
automountServiceAccountToken: false
```
）。
角色遵循最小权限原则。
定期审计未使用的ClusterRoleBindings。

Pod安全标准 — 在命名空间级别强制实施：

yaml

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Restricted级别禁止：权限提升、hostPath、hostNetwork、以root用户运行等操作。

详情见

references/rbac-and-pod-security.md

。

NetworkPolicies — default deny

NetworkPolicies — 默认拒绝

yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: production }
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]

Then add explicit allows per workload (e.g., api can reach db, web can reach api, everything can reach DNS + external API).

Requires a CNI that enforces NetworkPolicy: Calico, Cilium, or Azure CNI. Flannel does not.

See

references/network-policies.md

yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: production }
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]

然后为每个工作负载添加显式允许规则（例如：api可访问db，web可访问api，所有服务可访问DNS + 外部API）。

需要支持NetworkPolicy的CNI：Calico、Cilium或Azure CNI。Flannel不支持。

详情见

references/network-policies.md

。

Admission control — OPA Gatekeeper or Kyverno

准入控制 — OPA Gatekeeper或Kyverno

Policy-as-code enforcement at cluster admission:

Deny
```
:latest
```
images.
Require
```
resources.requests
```
and
```
resources.limits
```
.
Require readinessProbe.
Require specific labels (
```
app.kubernetes.io/*
```
,
```
owner
```
).
Restrict which registries are allowed.
Deny hostPath volumes.

Kyverno — YAML-based policies, easier for most teams. OPA Gatekeeper — Rego-based, more powerful, steeper curve.

See

references/admission-control-opa-kyverno.md

集群准入阶段的策略即代码（Policy-as-code）强制：

拒绝使用
```
:latest
```
镜像。
要求设置
```
resources.requests
```
和
```
resources.limits
```
。
要求配置readinessProbe。
要求设置特定标签（
```
app.kubernetes.io/*
```
、
```
owner
```
）。
限制允许的镜像仓库。
拒绝hostPath卷。

Kyverno — 基于YAML的策略，对大多数团队更友好。 OPA Gatekeeper — 基于Rego的策略，功能更强大，但学习曲线更陡。

详情见

references/admission-control-opa-kyverno.md

。

Image scanning

镜像扫描

Trivy or Grype in CI for container images and IaC.
Fail builds on HIGH/CRITICAL.
Sign images (cosign) and verify at admission (policy-controller or Kyverno).

在CI流程中使用Trivy或Grype扫描容器镜像和IaC。
发现HIGH/CRITICAL级别的漏洞时终止构建。
对镜像进行签名（cosign）并在准入阶段验证（使用policy-controller或Kyverno）。

Backup — Velero

备份 — Velero

bash

velero backup create weekly-$(date +%Y%m%d) --include-namespaces production --ttl 720h

Backup + restore cluster resources and PersistentVolumes.
Storage in off-cluster bucket (S3/GCS/Azure Blob).
Scheduled backups with Velero Schedule.
Regular restore drills — backups you never restore are hope, not a backup.

See

references/backup-velero.md

bash

velero backup create weekly-$(date +%Y%m%d) --include-namespaces production --ttl 720h

备份+恢复集群资源和PersistentVolumes。
存储至集群外存储桶（S3/GCS/Azure Blob）。
使用Velero Schedule配置定时备份。
定期执行恢复演练——从未恢复过的备份只是一种期望，而非真正的备份。

详情见

references/backup-velero.md

。

Cost control

成本管控

Cluster autoscaler (or karpenter on AWS) — nodes scale with demand.
Right-sizing — VPA in recommend mode, kubectl-cost, kubecost for cost dashboards.
Spot / preemptible nodes — for stateless, fault-tolerant workloads. Use node taints + tolerations to steer.
Idle resource alerts — requests far above usage = over-provisioning.

See

references/cost-control.md

Cluster autoscaler（AWS上使用karpenter）——节点随需求自动扩缩容。
资源优化——使用VPA的recommend模式、kubectl-cost、kubecost查看成本仪表盘。
Spot/抢占式节点——适用于无状态、容错性高的工作负载。使用节点污点+容忍度进行调度。
闲置资源告警——请求远高于实际使用量即表示过度配置。

详情见

references/cost-control.md

。

Anti-patterns

反模式

Relying on default ServiceAccount tokens.
Mounting every ConfigMap/Secret as env vars (file mounts are more secure, support rotation).
Using Deployment for stateful workloads.
HPA on workloads with startup > 60s without startup probe.
Default-allow NetworkPolicy posture.
Skipping backup/restore drills.
No PodDisruptionBudget on critical services.

依赖默认ServiceAccount令牌。
将所有ConfigMap/Secret挂载为环境变量（文件挂载更安全，支持轮换）。
使用Deployment部署有状态工作负载。
对启动时间>60s的工作负载配置HPA却未设置startup probe。
默认允许的NetworkPolicy策略。
跳过备份/恢复演练。
关键服务未配置PodDisruptionBudget。

后续阅读

```
kubernetes-saas-delivery
```
— multi-tenant SaaS on K8s, GitOps.
```
observability-monitoring
```
— SLO design and alert discipline across the stack.
```
reliability-engineering
```
— incident response + runbooks.

```
kubernetes-saas-delivery
```
——基于K8s的多租户SaaS、GitOps。
```
observability-monitoring
```
——全栈SLO设计及告警规范。
```
reliability-engineering
```
——事件响应+运行手册。

References

参考资料

```
references/helm-vs-kustomize.md
```
```
references/resource-management.md
```
```
references/autoscaling-hpa-vpa.md
```
```
references/stateful-workloads.md
```
```
references/secrets-external-secrets.md
```
```
references/observability-stack.md
```
```
references/rbac-and-pod-security.md
```
```
references/network-policies.md
```

references/admission-control-opa-kyverno.md

```
references/backup-velero.md
```
```
references/cost-control.md
```
```
references/upgrade-runbook.md
```
```
references/crd-operators.md
```

```
references/helm-vs-kustomize.md
```
```
references/resource-management.md
```
```
references/autoscaling-hpa-vpa.md
```
```
references/stateful-workloads.md
```
```
references/secrets-external-secrets.md
```
```
references/observability-stack.md
```
```
references/rbac-and-pod-security.md
```
```
references/network-policies.md
```

references/admission-control-opa-kyverno.md

```
references/backup-velero.md
```
```
references/cost-control.md
```
```
references/upgrade-runbook.md
```
```
references/crd-operators.md
```

kubernetes-production

Original

Translation

Kubernetes Production

Kubernetes 生产环境运维

Use When

适用场景

Do Not Use When

不适用场景

Required Inputs

必要输入

Workflow

工作流

Quality Standards

质量标准

Anti-Patterns

反模式

Outputs

输出内容

Evidence Produced

生成的证据

References

参考资料

When this skill applies

本技能适用场景

The production checklist

生产环境检查清单

Request sizing heuristics

请求容量估算启发式规则

When to reach for an operator

何时使用Operator

Cluster and node upgrades

集群与节点升级

Helm vs Kustomize

Helm vs Kustomize

Resource management

资源管理

HPA — Horizontal Pod Autoscaler

HPA — Horizontal Pod Autoscaler

Stateful workloads

有状态工作负载

External secrets

外部密钥

Observability stack

可观测性栈

RBAC + Pod Security

RBAC + Pod安全

NetworkPolicies — default deny

NetworkPolicies — 默认拒绝

Admission control — OPA Gatekeeper or Kyverno

准入控制 — OPA Gatekeeper或Kyverno

Image scanning

镜像扫描

Backup — Velero

备份 — Velero

Cost control

成本管控

Anti-patterns

反模式

Read next

后续阅读

References

参考资料