Kubernetes Production

Use When

Use when operating production Kubernetes — Helm, autoscaling (HPA/VPA), resource management, StatefulSets, external-secrets, observability (Prometheus/Grafana/Loki), RBAC, Pod Security Standards, NetworkPolicies, admission control, backup (Velero), and cost control.
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

Do Not Use When

The task is unrelated to
```
kubernetes-production
```
or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

Required Inputs

Gather relevant project context, constraints, and the concrete problem to solve; load
```
references
```
only as needed.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

Workflow

Read this
```
SKILL.md
```
first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

Quality Standards

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

Anti-Patterns

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

Outputs

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

Evidence Produced

Category	Artifact	Format	Example
Operability	K8s deployment runbook	Markdown doc per `skill-composition-standards/references/runbook-template.md` covering rollout, scaling, secret rotation, and PDB review	`docs/k8s/deployment-runbook.md`
Operability	Resource and autoscaler config note	Markdown doc covering requests/limits, HPA/VPA, and PDB rationale	`docs/k8s/resources-config.md`

References

Use the
```
references/
```
directory for deep detail after reading the core workflow below.

The operational bar for running K8s in production — not just running Pods, but running them predictably, securely, observably, and cheaply.

Prerequisites: Load

kubernetes-fundamentals

first.

When this skill applies

Moving a POC cluster into production.
Auditing an existing cluster for production readiness.
Adding observability, security, or cost controls.
Reviewing Helm charts before deployment.

The production checklist

text

[ ] Helm (or Kustomize) — not raw manifests
[ ] requests + limits on every container
[ ] HPA on stateless workloads
[ ] PodDisruptionBudget on every workload with >1 replica
[ ] External secrets (Vault / AWS SM / GCP SM) — not inline Secret manifests
[ ] Observability: Prometheus + Grafana + Loki + Alertmanager
[ ] RBAC: least-privilege ServiceAccounts
[ ] Pod Security Standards enforced (restricted for app workloads)
[ ] NetworkPolicies: default-deny + explicit allow
[ ] Admission control: OPA Gatekeeper or Kyverno
[ ] Image scanning in CI (Trivy, Grype)
[ ] Backups: Velero with off-cluster storage
[ ] Cluster autoscaler or karpenter
[ ] SLOs + runbooks per service

Request sizing heuristics

Measure first; do not guess. Steady-state and peak come from

kubectl top

plus

container_memory_working_set_bytes

and

rate(container_cpu_usage_seconds_total[5m])

over a representative week.

text

Memory request  = p95 working set
Memory limit    = p99 working set + 30% headroom
CPU request     = p95 utilisation in cores
CPU limit       = 2x request, or unset (with PriorityClass + tested cluster) for burstable workloads

QoS classes (set deliberately):

text

Guaranteed  requests == limits          -> tier-1 services, evicted last
Burstable   requests < limits           -> default for most apps
BestEffort  no requests/limits          -> never use in production

HPA does not work without CPU requests. VPA in

recommend

mode for one week tells you whether you are over- or under-provisioned.

When to reach for an operator

text

Stateful system you operate (Postgres, Kafka, Redis, ES) -> install a vetted operator (CloudNativePG, Strimzi, ECK)
Repeated multi-step ops you do by hand                   -> consider an operator
Watch a ConfigMap and bounce Pods                        -> CronJob is enough
Tenant lifecycle in SaaS                                 -> GitOps + ApplicationSet first; operator only if dynamic

Rule: install before you build. See

references/crd-operators.md

for the build/install/avoid matrix and CRD hygiene.

Cluster and node upgrades

Upgrades are releases, not chores. Pre-flight: run

kube-no-trouble

pluto

against Git for removed APIs, verify a Velero restore in a sandbox, drain-test one canary node.

Order: control plane -> node groups one at a time -> add-ons (CNI first, mesh last). Every workload with replicas > 1 has a PDB; a PDB of

minAvailable: 100%

blocks drain forever. See

references/upgrade-runbook.md

Helm vs Kustomize

text

Package to distribute to many users / tenants  -> Helm
Env overlays (dev/staging/prod), single app    -> Kustomize
Complex templating + conditionals              -> Helm
Strict "WYSIWYG" manifests                     -> Kustomize

We use Helm for shipped packages and Kustomize for simple in-house env overlays. Never both for the same workload.

See

references/helm-vs-kustomize.md

Resource management

Requests: guaranteed minimum; scheduler uses this to place Pods. Limits: cap; exceeded CPU = throttled, exceeded memory = OOMKilled.

Rules:

Always set both.
Memory: limit = expected peak + headroom (~30%).
CPU: request = steady-state; limit = 2–5× request or unset (in carefully-configured clusters).

Measure before setting —

kubectl top

and Prometheus

container_memory_working_set_bytes

rate(container_cpu_usage_seconds_total[5m])

Gradually reduce over-provisioning using VPA in recommend mode.

See

references/resource-management.md

HPA — Horizontal Pod Autoscaler

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api, namespace: production }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies: [{ type: Percent, value: 50, periodSeconds: 60 }]
    scaleUp:
      stabilizationWindowSeconds: 0
      policies: [{ type: Percent, value: 100, periodSeconds: 30 }]

Target 60–70% CPU utilisation to leave headroom for spikes.
```
minReplicas: 2
```
minimum (HA).
Scale up fast, scale down slow.
For queue-depth-based or custom metrics: install Prometheus Adapter.

See

references/autoscaling-hpa-vpa.md

Stateful workloads

StatefulSet + PVC for databases, caches, brokers:

Stable names (
```
pod-0
```
,
```
pod-1
```
).
Ordered rollout/rollback.
Each Pod has its own PersistentVolume.
Headless Service for stable DNS per Pod.
PodDisruptionBudget (
```
minAvailable: N-1
```
).
Anti-affinity across nodes/zones.
Backups to object storage (never rely on PV alone).

For databases, strongly consider managed (RDS, Cloud SQL, Neon) before in-cluster. In-cluster DBs make sense only with serious ops maturity.

See

references/stateful-workloads.md

External secrets

Never commit

Secret

manifests to Git (even base64 is not encryption). Options:

external-secrets operator + Vault / AWS Secrets Manager / GCP Secret Manager / 1Password.
Sealed Secrets (Bitnami) — encrypted manifests safe for Git.
SOPS + age for GitOps-friendly encryption.

Pattern: external-secrets + cloud secret manager is our default in cloud; SOPS for air-gapped or self-hosted.

See

references/secrets-external-secrets.md

Observability stack

Minimum stack:

Prometheus — metrics scraping + storage (or Mimir/Thanos for long-term).
Grafana — dashboards.
Loki — logs.
Alertmanager — alert routing.
OpenTelemetry Collector — receive OTLP, fan out to backends.
Tempo or Jaeger — traces.

Install via kube-prometheus-stack Helm chart.

Golden signals per service: latency, traffic, errors, saturation.

See

references/observability-stack.md

RBAC + Pod Security

RBAC principles:

One ServiceAccount per workload.
No default ServiceAccount usage (set
```
automountServiceAccountToken: false
```
when not needed).
Roles — least privilege.
Audit unused ClusterRoleBindings periodically.

Pod Security Standards — enforce at namespace level:

yaml

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Restricted disallows: privilege escalation, hostPath, hostNetwork, running as root, etc.

See

references/rbac-and-pod-security.md

NetworkPolicies — default deny

yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: production }
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]

Then add explicit allows per workload (e.g., api can reach db, web can reach api, everything can reach DNS + external API).

Requires a CNI that enforces NetworkPolicy: Calico, Cilium, or Azure CNI. Flannel does not.

See

references/network-policies.md

Admission control — OPA Gatekeeper or Kyverno

Policy-as-code enforcement at cluster admission:

Deny
```
:latest
```
images.
Require
```
resources.requests
```
and
```
resources.limits
```
.
Require readinessProbe.
Require specific labels (
```
app.kubernetes.io/*
```
,
```
owner
```
).
Restrict which registries are allowed.
Deny hostPath volumes.

Kyverno — YAML-based policies, easier for most teams. OPA Gatekeeper — Rego-based, more powerful, steeper curve.

See

references/admission-control-opa-kyverno.md

Image scanning

Trivy or Grype in CI for container images and IaC.
Fail builds on HIGH/CRITICAL.
Sign images (cosign) and verify at admission (policy-controller or Kyverno).

Backup — Velero

bash

velero backup create weekly-$(date +%Y%m%d) --include-namespaces production --ttl 720h

Backup + restore cluster resources and PersistentVolumes.
Storage in off-cluster bucket (S3/GCS/Azure Blob).
Scheduled backups with Velero Schedule.
Regular restore drills — backups you never restore are hope, not a backup.

See

references/backup-velero.md

Cost control

Cluster autoscaler (or karpenter on AWS) — nodes scale with demand.
Right-sizing — VPA in recommend mode, kubectl-cost, kubecost for cost dashboards.
Spot / preemptible nodes — for stateless, fault-tolerant workloads. Use node taints + tolerations to steer.
Idle resource alerts — requests far above usage = over-provisioning.

See

references/cost-control.md

Anti-patterns

Relying on default ServiceAccount tokens.
Mounting every ConfigMap/Secret as env vars (file mounts are more secure, support rotation).
Using Deployment for stateful workloads.
HPA on workloads with startup > 60s without startup probe.
Default-allow NetworkPolicy posture.
Skipping backup/restore drills.
No PodDisruptionBudget on critical services.

References

```
references/helm-vs-kustomize.md
```
```
references/resource-management.md
```
```
references/autoscaling-hpa-vpa.md
```
```
references/stateful-workloads.md
```
```
references/secrets-external-secrets.md
```
```
references/observability-stack.md
```
```
references/rbac-and-pod-security.md
```
```
references/network-policies.md
```

references/admission-control-opa-kyverno.md

```
references/backup-velero.md
```
```
references/cost-control.md
```
```
references/upgrade-runbook.md
```
```
references/crd-operators.md
```

kubernetes-production

NPX Install

Tags

SKILL.md Content

Kubernetes Production

Use When

Do Not Use When

Required Inputs

Workflow

Quality Standards

Anti-Patterns

Outputs

Evidence Produced

References

When this skill applies

The production checklist

Request sizing heuristics

When to reach for an operator

Cluster and node upgrades

Helm vs Kustomize

Resource management

HPA — Horizontal Pod Autoscaler

Stateful workloads

External secrets

Observability stack

RBAC + Pod Security

NetworkPolicies — default deny

Admission control — OPA Gatekeeper or Kyverno

Image scanning

Backup — Velero

Cost control

Anti-patterns

Read next

References