kubernetes-production

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Production

Kubernetes 生产环境运维

<!-- dual-compat-start -->
<!-- dual-compat-start -->

Use When

适用场景

  • Use when operating production Kubernetes — Helm, autoscaling (HPA/VPA), resource management, StatefulSets, external-secrets, observability (Prometheus/Grafana/Loki), RBAC, Pod Security Standards, NetworkPolicies, admission control, backup (Velero), and cost control.
  • The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.
  • 适用于生产环境Kubernetes运维场景——涵盖Helm、自动扩缩容(HPA/VPA)、资源管理、StatefulSets、外部密钥、可观测性(Prometheus/Grafana/Loki)、RBAC、Pod安全标准、NetworkPolicies、准入控制、备份(Velero)以及成本管控。
  • 任务需要可复用的判断逻辑、领域约束或经过验证的工作流,而非临时建议。

Do Not Use When

不适用场景

  • The task is unrelated to
    kubernetes-production
    or would be better handled by a more specific companion skill.
  • The request only needs a trivial answer and none of this skill's constraints or references materially help.
  • 任务与
    kubernetes-production
    无关,或更适合由更细分的配套技能处理。
  • 请求仅需要简单答案,本技能的约束条件或参考资料无法提供实质性帮助。

Required Inputs

必要输入

  • Gather relevant project context, constraints, and the concrete problem to solve; load
    references
    only as needed.
  • Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.
  • 收集相关项目背景、约束条件及具体待解决问题;仅在需要时加载
    references
    内容。
  • 确认所需交付物:设计方案、代码、评审意见、迁移计划、审计报告或文档。

Workflow

工作流

  • Read this
    SKILL.md
    first, then load only the referenced deep-dive files that are necessary for the task.
  • Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
  • Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.
  • 先阅读本
    SKILL.md
    ,再仅加载任务必需的深度参考文件。
  • 应用本技能中的有序指引、检查清单和决策规则,而非随意挑选孤立片段。
  • 生成交付物时,需明确说明关键假设、风险及后续工作。

Quality Standards

质量标准

  • Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
  • Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
  • Prefer deterministic, reviewable steps over vague advice or tool-specific magic.
  • 输出内容需以执行为导向、简洁清晰,并与仓库的基线工程标准保持一致。
  • 除非技能明确要求更高标准,否则需兼容现有项目约定。
  • 优先采用可确定、可评审的步骤,避免模糊建议或工具特定的“魔法操作”。

Anti-Patterns

反模式

  • Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
  • Loading every reference file by default instead of using progressive disclosure.
  • 将示例视为可直接复制粘贴的标准答案,未检查适配性、约束条件或失败场景。
  • 默认加载所有参考文件,而非按需逐步披露内容。

Outputs

输出内容

  • A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
  • Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
  • References used, companion skills, or follow-up actions when they materially improve execution.
  • 符合任务需求的具体成果:实施指引、评审发现、架构决策、模板或生成的工件。
  • 当无法仅通过现有上下文完成任务时,需明确说明假设条件、权衡方案或未解决的缺口。
  • 列出所使用的参考资料、配套技能或后续行动,以提升执行效果。

Evidence Produced

生成的证据

CategoryArtifactFormatExample
OperabilityK8s deployment runbookMarkdown doc per
skill-composition-standards/references/runbook-template.md
covering rollout, scaling, secret rotation, and PDB review
docs/k8s/deployment-runbook.md
OperabilityResource and autoscaler config noteMarkdown doc covering requests/limits, HPA/VPA, and PDB rationale
docs/k8s/resources-config.md
分类工件格式示例
可运维性K8s部署手册遵循
skill-composition-standards/references/runbook-template.md
的Markdown文档,涵盖发布、扩缩容、密钥轮换及PDB评审
docs/k8s/deployment-runbook.md
可运维性资源与自动扩缩容配置说明涵盖资源请求/限制、HPA/VPA及PDB设计依据的Markdown文档
docs/k8s/resources-config.md

References

参考资料

  • Use the
    references/
    directory for deep detail after reading the core workflow below.
<!-- dual-compat-end -->
The operational bar for running K8s in production — not just running Pods, but running them predictably, securely, observably, and cheaply.
Prerequisites: Load
kubernetes-fundamentals
first.
  • 阅读以下核心工作流后,如需深入细节可查看
    references/
    目录。
<!-- dual-compat-end -->
生产环境运行K8s的运维标准——不仅是运行Pod,更是要实现Pod的可预测、安全、可观测及低成本运行。
前置条件:先加载
kubernetes-fundamentals
技能。

When this skill applies

本技能适用场景

  • Moving a POC cluster into production.
  • Auditing an existing cluster for production readiness.
  • Adding observability, security, or cost controls.
  • Reviewing Helm charts before deployment.
  • 将POC集群迁移至生产环境。
  • 审计现有集群的生产就绪状态。
  • 添加可观测性、安全或成本管控措施。
  • 部署前评审Helm Chart。

The production checklist

生产环境检查清单

text
[ ] Helm (or Kustomize) — not raw manifests
[ ] requests + limits on every container
[ ] HPA on stateless workloads
[ ] PodDisruptionBudget on every workload with >1 replica
[ ] External secrets (Vault / AWS SM / GCP SM) — not inline Secret manifests
[ ] Observability: Prometheus + Grafana + Loki + Alertmanager
[ ] RBAC: least-privilege ServiceAccounts
[ ] Pod Security Standards enforced (restricted for app workloads)
[ ] NetworkPolicies: default-deny + explicit allow
[ ] Admission control: OPA Gatekeeper or Kyverno
[ ] Image scanning in CI (Trivy, Grype)
[ ] Backups: Velero with off-cluster storage
[ ] Cluster autoscaler or karpenter
[ ] SLOs + runbooks per service
text
[ ] Helm(或Kustomize)——而非原始清单
[ ] 每个容器都设置requests + limits
[ ] 无状态工作负载配置HPA
[ ] 副本数>1的所有工作负载配置PodDisruptionBudget
[ ] 使用外部密钥(Vault / AWS SM / GCP SM)——而非内联Secret清单
[ ] 可观测性:Prometheus + Grafana + Loki + Alertmanager
[ ] RBAC:最小权限ServiceAccount
[ ] 强制实施Pod安全标准(应用工作负载使用restricted级别)
[ ] NetworkPolicies:默认拒绝 + 显式允许
[ ] 准入控制:OPA Gatekeeper或Kyverno
[ ] CI流程中进行镜像扫描(Trivy、Grype)
[ ] 备份:使用Velero搭配集群外存储
[ ] 集群自动扩缩容或karpenter
[ ] 每个服务配置SLO + 运行手册

Request sizing heuristics

请求容量估算启发式规则

Measure first; do not guess. Steady-state and peak come from
kubectl top
plus
container_memory_working_set_bytes
and
rate(container_cpu_usage_seconds_total[5m])
over a representative week.
text
Memory request  = p95 working set
Memory limit    = p99 working set + 30% headroom
CPU request     = p95 utilisation in cores
CPU limit       = 2x request, or unset (with PriorityClass + tested cluster) for burstable workloads
QoS classes (set deliberately):
text
Guaranteed  requests == limits          -> tier-1 services, evicted last
Burstable   requests < limits           -> default for most apps
BestEffort  no requests/limits          -> never use in production
HPA does not work without CPU requests. VPA in
recommend
mode for one week tells you whether you are over- or under-provisioned.
先测量,不要猜测。稳态和峰值数据来自
kubectl top
,以及连续一周的
container_memory_working_set_bytes
rate(container_cpu_usage_seconds_total[5m])
指标。
text
内存request = p95工作集
内存limit = p99工作集 + 30%预留空间
CPU request = p95利用率(单位:核)
CPU limit = request的2倍,或留空(针对配置完善的集群中的可突发工作负载,需搭配PriorityClass并经过测试)
QoS等级(需刻意设置):
text
Guaranteed  requests == limits          -> 一级服务,最后被驱逐
Burstable   requests < limits           -> 大多数应用的默认选择
BestEffort  无requests/limits          -> 生产环境绝不使用
没有CPU request的话,HPA无法正常工作。使用VPA的
recommend
模式运行一周,可判断资源配置是否过度或不足。

When to reach for an operator

何时使用Operator

text
Stateful system you operate (Postgres, Kafka, Redis, ES) -> install a vetted operator (CloudNativePG, Strimzi, ECK)
Repeated multi-step ops you do by hand                   -> consider an operator
Watch a ConfigMap and bounce Pods                        -> CronJob is enough
Tenant lifecycle in SaaS                                 -> GitOps + ApplicationSet first; operator only if dynamic
Rule: install before you build. See
references/crd-operators.md
for the build/install/avoid matrix and CRD hygiene.
text
自行运维的有状态系统(Postgres、Kafka、Redis、ES) -> 安装经过验证的Operator(CloudNativePG、Strimzi、ECK)
手动重复执行的多步运维操作                   -> 考虑使用Operator
监听ConfigMap并重启Pod                        -> CronJob即可满足需求
SaaS中的租户生命周期管理                                 -> 优先使用GitOps + ApplicationSet;仅在需要动态管理时使用Operator
规则:优先安装,而非自行构建。查看
references/crd-operators.md
获取构建/安装/避免矩阵及CRD规范。

Cluster and node upgrades

集群与节点升级

Upgrades are releases, not chores. Pre-flight: run
kube-no-trouble
/
pluto
against Git for removed APIs, verify a Velero restore in a sandbox, drain-test one canary node.
Order: control plane -> node groups one at a time -> add-ons (CNI first, mesh last). Every workload with replicas > 1 has a PDB; a PDB of
minAvailable: 100%
blocks drain forever. See
references/upgrade-runbook.md
.
升级是发布流程,而非日常琐事。升级前准备:运行
kube-no-trouble
/
pluto
检查Git中已废弃的API,在沙箱环境验证Velero恢复功能,对一个金丝雀节点进行驱逐测试。
顺序:控制平面 -> 逐个节点组 -> 附加组件(先CNI,最后服务网格)。所有副本数>1的工作负载都需配置PDB;
minAvailable: 100%
的PDB会永久阻止节点驱逐。详情见
references/upgrade-runbook.md

Helm vs Kustomize

Helm vs Kustomize

text
Package to distribute to many users / tenants  -> Helm
Env overlays (dev/staging/prod), single app    -> Kustomize
Complex templating + conditionals              -> Helm
Strict "WYSIWYG" manifests                     -> Kustomize
We use Helm for shipped packages and Kustomize for simple in-house env overlays. Never both for the same workload.
See
references/helm-vs-kustomize.md
.
text
需要分发给多用户/租户的包  -> Helm
环境覆盖(dev/staging/prod)、单一应用    -> Kustomize
复杂模板 + 条件判断              -> Helm
严格“所见即所得”的清单                     -> Kustomize
我们使用Helm来分发包,使用Kustomize处理简单的内部环境覆盖。同一工作负载绝不混用两者。
详情见
references/helm-vs-kustomize.md

Resource management

资源管理

Requests: guaranteed minimum; scheduler uses this to place Pods. Limits: cap; exceeded CPU = throttled, exceeded memory = OOMKilled.
Rules:
  • Always set both.
  • Memory: limit = expected peak + headroom (~30%).
  • CPU: request = steady-state; limit = 2–5× request or unset (in carefully-configured clusters).
  • Measure before setting —
    kubectl top
    and Prometheus
    container_memory_working_set_bytes
    ,
    rate(container_cpu_usage_seconds_total[5m])
    .
  • Gradually reduce over-provisioning using VPA in recommend mode.
See
references/resource-management.md
.
Requests: 保证的最小资源;调度器使用该值来调度Pod。 Limits: 资源上限;超过CPU限制会被节流,超过内存限制会被OOMKilled。
规则:
  • 始终同时设置两者。
  • 内存:limit = 预期峰值 + 预留空间(约30%)。
  • CPU:request = 稳态值;limit = request的2–5倍,或留空(针对配置完善的集群)。
  • 设置前先测量——使用
    kubectl top
    和Prometheus的
    container_memory_working_set_bytes
    rate(container_cpu_usage_seconds_total[5m])
    指标。
  • 使用VPA的recommend模式逐步减少过度配置。
详情见
references/resource-management.md

HPA — Horizontal Pod Autoscaler

HPA — Horizontal Pod Autoscaler

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api, namespace: production }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies: [{ type: Percent, value: 50, periodSeconds: 60 }]
    scaleUp:
      stabilizationWindowSeconds: 0
      policies: [{ type: Percent, value: 100, periodSeconds: 30 }]
  • Target 60–70% CPU utilisation to leave headroom for spikes.
  • minReplicas: 2
    minimum (HA).
  • Scale up fast, scale down slow.
  • For queue-depth-based or custom metrics: install Prometheus Adapter.
See
references/autoscaling-hpa-vpa.md
.
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api, namespace: production }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies: [{ type: Percent, value: 50, periodSeconds: 60 }]
    scaleUp:
      stabilizationWindowSeconds: 0
      policies: [{ type: Percent, value: 100, periodSeconds: 30 }]
  • 目标CPU利用率设置为60–70%,为峰值预留空间。
  • minReplicas: 2
    为最小值(高可用)。
  • 扩容快,缩容慢。
  • 如需基于队列深度或自定义指标扩容:安装Prometheus Adapter。
详情见
references/autoscaling-hpa-vpa.md

Stateful workloads

有状态工作负载

StatefulSet + PVC for databases, caches, brokers:
  • Stable names (
    pod-0
    ,
    pod-1
    ).
  • Ordered rollout/rollback.
  • Each Pod has its own PersistentVolume.
  • Headless Service for stable DNS per Pod.
  • PodDisruptionBudget (
    minAvailable: N-1
    ).
  • Anti-affinity across nodes/zones.
  • Backups to object storage (never rely on PV alone).
For databases, strongly consider managed (RDS, Cloud SQL, Neon) before in-cluster. In-cluster DBs make sense only with serious ops maturity.
See
references/stateful-workloads.md
.
使用StatefulSet + PVC部署数据库、缓存、消息队列:
  • 稳定名称(
    pod-0
    pod-1
    )。
  • 有序发布/回滚。
  • 每个Pod拥有独立的PersistentVolume。
  • 使用Headless Service实现每个Pod的稳定DNS。
  • 配置PodDisruptionBudget(
    minAvailable: N-1
    )。
  • 跨节点/区域的反亲和性配置。
  • 备份至对象存储(绝不仅依赖PV)。
对于数据库,在考虑集群内部署前,强烈建议使用托管服务(RDS、Cloud SQL、Neon)。只有具备成熟运维能力时,集群内部署数据库才有意义。
详情见
references/stateful-workloads.md

External secrets

外部密钥

Never commit
Secret
manifests to Git (even base64 is not encryption). Options:
  • external-secrets operator + Vault / AWS Secrets Manager / GCP Secret Manager / 1Password.
  • Sealed Secrets (Bitnami) — encrypted manifests safe for Git.
  • SOPS + age for GitOps-friendly encryption.
Pattern: external-secrets + cloud secret manager is our default in cloud; SOPS for air-gapped or self-hosted.
See
references/secrets-external-secrets.md
.
绝不要将
Secret
清单提交至Git(即使是base64编码也不是加密)。可选方案:
  • external-secrets operator + Vault / AWS Secrets Manager / GCP Secret Manager / 1Password。
  • Sealed Secrets(Bitnami)——加密后的清单可安全提交至Git。
  • SOPS + age——适合GitOps场景的加密方式。
模式:在云环境中默认使用external-secrets + 云密钥管理器;在离线或自托管环境中使用SOPS。
详情见
references/secrets-external-secrets.md

Observability stack

可观测性栈

Minimum stack:
  • Prometheus — metrics scraping + storage (or Mimir/Thanos for long-term).
  • Grafana — dashboards.
  • Loki — logs.
  • Alertmanager — alert routing.
  • OpenTelemetry Collector — receive OTLP, fan out to backends.
  • Tempo or Jaeger — traces.
Install via kube-prometheus-stack Helm chart.
Golden signals per service: latency, traffic, errors, saturation.
See
references/observability-stack.md
.
最小栈配置:
  • Prometheus — 指标采集+存储(长期存储可使用Mimir/Thanos)。
  • Grafana — 仪表盘。
  • Loki — 日志管理。
  • Alertmanager — 告警路由。
  • OpenTelemetry Collector — 接收OTLP数据并转发至后端。
  • TempoJaeger — 链路追踪。
通过kube-prometheus-stack Helm Chart安装。
每个服务的黄金信号: 延迟、流量、错误、饱和度。
详情见
references/observability-stack.md

RBAC + Pod Security

RBAC + Pod安全

RBAC principles:
  • One ServiceAccount per workload.
  • No default ServiceAccount usage (set
    automountServiceAccountToken: false
    when not needed).
  • Roles — least privilege.
  • Audit unused ClusterRoleBindings periodically.
Pod Security Standards — enforce at namespace level:
yaml
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
Restricted disallows: privilege escalation, hostPath, hostNetwork, running as root, etc.
See
references/rbac-and-pod-security.md
.
RBAC原则:
  • 每个工作负载对应一个ServiceAccount。
  • 不使用默认ServiceAccount(不需要时设置
    automountServiceAccountToken: false
    )。
  • 角色遵循最小权限原则。
  • 定期审计未使用的ClusterRoleBindings。
Pod安全标准 — 在命名空间级别强制实施:
yaml
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
Restricted级别禁止:权限提升、hostPath、hostNetwork、以root用户运行等操作。
详情见
references/rbac-and-pod-security.md

NetworkPolicies — default deny

NetworkPolicies — 默认拒绝

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: production }
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
Then add explicit allows per workload (e.g., api can reach db, web can reach api, everything can reach DNS + external API).
Requires a CNI that enforces NetworkPolicy: Calico, Cilium, or Azure CNI. Flannel does not.
See
references/network-policies.md
.
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: production }
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
然后为每个工作负载添加显式允许规则(例如:api可访问db,web可访问api,所有服务可访问DNS + 外部API)。
需要支持NetworkPolicy的CNI:Calico、Cilium或Azure CNI。Flannel不支持。
详情见
references/network-policies.md

Admission control — OPA Gatekeeper or Kyverno

准入控制 — OPA Gatekeeper或Kyverno

Policy-as-code enforcement at cluster admission:
  • Deny
    :latest
    images.
  • Require
    resources.requests
    and
    resources.limits
    .
  • Require readinessProbe.
  • Require specific labels (
    app.kubernetes.io/*
    ,
    owner
    ).
  • Restrict which registries are allowed.
  • Deny hostPath volumes.
Kyverno — YAML-based policies, easier for most teams. OPA Gatekeeper — Rego-based, more powerful, steeper curve.
See
references/admission-control-opa-kyverno.md
.
集群准入阶段的策略即代码(Policy-as-code)强制:
  • 拒绝使用
    :latest
    镜像。
  • 要求设置
    resources.requests
    resources.limits
  • 要求配置readinessProbe。
  • 要求设置特定标签(
    app.kubernetes.io/*
    owner
    )。
  • 限制允许的镜像仓库。
  • 拒绝hostPath卷。
Kyverno — 基于YAML的策略,对大多数团队更友好。 OPA Gatekeeper — 基于Rego的策略,功能更强大,但学习曲线更陡。
详情见
references/admission-control-opa-kyverno.md

Image scanning

镜像扫描

  • Trivy or Grype in CI for container images and IaC.
  • Fail builds on HIGH/CRITICAL.
  • Sign images (cosign) and verify at admission (policy-controller or Kyverno).
  • 在CI流程中使用TrivyGrype扫描容器镜像和IaC。
  • 发现HIGH/CRITICAL级别的漏洞时终止构建。
  • 对镜像进行签名(cosign)并在准入阶段验证(使用policy-controller或Kyverno)。

Backup — Velero

备份 — Velero

bash
velero backup create weekly-$(date +%Y%m%d) --include-namespaces production --ttl 720h
  • Backup + restore cluster resources and PersistentVolumes.
  • Storage in off-cluster bucket (S3/GCS/Azure Blob).
  • Scheduled backups with Velero Schedule.
  • Regular restore drills — backups you never restore are hope, not a backup.
See
references/backup-velero.md
.
bash
velero backup create weekly-$(date +%Y%m%d) --include-namespaces production --ttl 720h
  • 备份+恢复集群资源和PersistentVolumes。
  • 存储至集群外存储桶(S3/GCS/Azure Blob)。
  • 使用Velero Schedule配置定时备份。
  • 定期执行恢复演练——从未恢复过的备份只是一种期望,而非真正的备份。
详情见
references/backup-velero.md

Cost control

成本管控

  • Cluster autoscaler (or karpenter on AWS) — nodes scale with demand.
  • Right-sizing — VPA in recommend mode, kubectl-cost, kubecost for cost dashboards.
  • Spot / preemptible nodes — for stateless, fault-tolerant workloads. Use node taints + tolerations to steer.
  • Idle resource alerts — requests far above usage = over-provisioning.
See
references/cost-control.md
.
  • Cluster autoscaler(AWS上使用karpenter)——节点随需求自动扩缩容。
  • 资源优化——使用VPA的recommend模式、kubectl-cost、kubecost查看成本仪表盘。
  • Spot/抢占式节点——适用于无状态、容错性高的工作负载。使用节点污点+容忍度进行调度。
  • 闲置资源告警——请求远高于实际使用量即表示过度配置。
详情见
references/cost-control.md

Anti-patterns

反模式

  • Relying on default ServiceAccount tokens.
  • Mounting every ConfigMap/Secret as env vars (file mounts are more secure, support rotation).
  • Using Deployment for stateful workloads.
  • HPA on workloads with startup > 60s without startup probe.
  • Default-allow NetworkPolicy posture.
  • Skipping backup/restore drills.
  • No PodDisruptionBudget on critical services.
  • 依赖默认ServiceAccount令牌。
  • 将所有ConfigMap/Secret挂载为环境变量(文件挂载更安全,支持轮换)。
  • 使用Deployment部署有状态工作负载。
  • 对启动时间>60s的工作负载配置HPA却未设置startup probe。
  • 默认允许的NetworkPolicy策略。
  • 跳过备份/恢复演练。
  • 关键服务未配置PodDisruptionBudget。

Read next

后续阅读

  • kubernetes-saas-delivery
    — multi-tenant SaaS on K8s, GitOps.
  • observability-monitoring
    — SLO design and alert discipline across the stack.
  • reliability-engineering
    — incident response + runbooks.
  • kubernetes-saas-delivery
    ——基于K8s的多租户SaaS、GitOps。
  • observability-monitoring
    ——全栈SLO设计及告警规范。
  • reliability-engineering
    ——事件响应+运行手册。

References

参考资料

  • references/helm-vs-kustomize.md
  • references/resource-management.md
  • references/autoscaling-hpa-vpa.md
  • references/stateful-workloads.md
  • references/secrets-external-secrets.md
  • references/observability-stack.md
  • references/rbac-and-pod-security.md
  • references/network-policies.md
  • references/admission-control-opa-kyverno.md
  • references/backup-velero.md
  • references/cost-control.md
  • references/upgrade-runbook.md
  • references/crd-operators.md
  • references/helm-vs-kustomize.md
  • references/resource-management.md
  • references/autoscaling-hpa-vpa.md
  • references/stateful-workloads.md
  • references/secrets-external-secrets.md
  • references/observability-stack.md
  • references/rbac-and-pod-security.md
  • references/network-policies.md
  • references/admission-control-opa-kyverno.md
  • references/backup-velero.md
  • references/cost-control.md
  • references/upgrade-runbook.md
  • references/crd-operators.md