kubernetes-operator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Operator

Kubernetes Operator

Build operators that reconcile correctly. Most operator bugs are not Kubernetes bugs — they are reconcile-loop bugs: missing finalizers, blocking calls, no requeue on transient errors, status drift, RBAC over-grants. This skill catches them deterministically before they reach a cluster.
构建能正确协调状态的Operator。大多数Operator故障并非Kubernetes本身的问题——而是调和循环(reconcile-loop)的问题:缺少终结器(finalizers)、阻塞调用、临时错误未触发重入队列、状态漂移、RBAC权限过度授予。本技能能在代码部署到集群前,确定性地发现这些问题。

When to use

适用场景

  • Building a new Kubernetes Operator (controller for a CRD)
  • Reviewing an existing operator for capability-level gaps
  • Auditing a CRD spec for status/conditions/finalizer correctness
  • Choosing a framework (controller-runtime / kubebuilder / operator-sdk / metacontroller / KOPF)
  • Designing the API surface of a Custom Resource
  • Hardening RBAC, leader election, or webhook validation
  • 构建新的Kubernetes Operator(CRD对应的控制器)
  • 审查现有Operator的能力水平差距
  • 审计CRD规格中状态/条件/终结器的正确性
  • 选择框架(controller-runtime / kubebuilder / operator-sdk / metacontroller / KOPF)
  • 设计自定义资源(Custom Resource)的API接口
  • 强化RBAC、主节点选举或Webhook验证

When NOT to use

不适用场景

  • Plain Helm chart packaging → use
    helm-chart-builder
  • Standard kubectl operations / blue-green deploys → use
    senior-devops
  • General k8s security posture → use
    cloud-security
  • "I want to run a workload" — that's a Deployment / Job, not an operator
  • 普通Helm Chart打包 → 使用
    helm-chart-builder
  • 标准kubectl操作/蓝绿部署 → 使用
    senior-devops
  • 通用K8S安全态势 → 使用
    cloud-security
  • “我想运行一个工作负载”——这属于Deployment/Job范畴,而非Operator

Core principle: an operator is a reconcile loop, not a script

核心原则:Operator是调和循环,而非脚本

observe(actual) → desired = read(spec) → diff(actual, desired) → act → update(status)
                                                                   requeue / done
Operators that fail are the ones that:
  1. Treat reconcile as imperative (do this, then this, then this) instead of declarative (make actual=desired, idempotently)
  2. Don't requeue transient failures
  3. Don't use finalizers, leaving orphan resources
  4. Mutate spec instead of status
  5. Don't use the status subresource (status updates trigger spec reconciles → loop)
  6. Block in reconcile (long HTTP calls, locks)
  7. Forget leader election → split-brain on multi-replica deploys
The 3 tools below catch each of these.
observe(actual) → desired = read(spec) → diff(actual, desired) → act → update(status)
                                                                   requeue / done
失败的Operator通常存在以下问题:
  1. 将调和视为命令式操作(先做A,再做B,再做C)而非声明式操作(确保实际状态与期望状态一致,具备幂等性)
  2. 临时故障未触发重入队列
  3. 未使用终结器,导致资源残留
  4. 修改spec而非status
  5. 未使用status子资源(状态更新会触发spec调和→循环)
  6. 调和过程中阻塞(长时间HTTP调用、锁)
  7. 忘记主节点选举→多副本部署时出现脑裂
下面的3个工具可以检测上述所有问题。

Quick start

快速开始

bash
SKILL=engineering/kubernetes-operator/skills/kubernetes-operator
bash
SKILL=engineering/kubernetes-operator/skills/kubernetes-operator

Validate a CRD design

验证CRD设计

python "$SKILL/scripts/crd_validator.py" --crd config/crd/myapp.yaml
python "$SKILL/scripts/crd_validator.py" --crd config/crd/myapp.yaml

Lint a Go reconcile function

检查Go语言调和函数

python "$SKILL/scripts/reconcile_lint.py" --controller controllers/myapp_controller.go
python "$SKILL/scripts/reconcile_lint.py" --controller controllers/myapp_controller.go

Score against OperatorHub Capability Levels (1-5)

对照OperatorHub能力等级(1-5级)打分

python "$SKILL/scripts/operator_capability_audit.py" --operator-dir .
undefined
python "$SKILL/scripts/operator_capability_audit.py" --operator-dir .
undefined

The 3 Python tools

3个Python工具

All stdlib-only. Run with
--help
.
所有工具均仅依赖Python标准库。运行时可添加
--help
查看帮助。

crd_validator.py

crd_validator.py

Validates a CRD YAML against operator-pattern best practices.
bash
python scripts/crd_validator.py --crd config/crd/myapp.yaml
python scripts/crd_validator.py --crd config/crd/ --format json
Checks:
  • spec.versions[*].subresources.status
    is set (status subresource)
  • spec.scope
    is
    Namespaced
    (not
    Cluster
    ) unless explicitly justified
  • Singular and listKind defined
  • spec.versions[*].schema.openAPIV3Schema
    has type definitions (no
    x-kubernetes-preserve-unknown-fields: true
    at top level)
  • A version is marked
    served: true
    AND
    storage: true
  • Conditions array is in the schema (allows
    metav1.Conditions
    )
  • Printer columns include
    Age
    and
    Status
    /
    Phase
根据Operator模式最佳实践验证CRD YAML文件。
bash
python scripts/crd_validator.py --crd config/crd/myapp.yaml
python scripts/crd_validator.py --crd config/crd/ --format json
检查项:
  • 已设置
    spec.versions[*].subresources.status
    (status子资源)
  • spec.scope
    Namespaced
    (而非
    Cluster
    ),除非有明确理由
  • 已定义单数形式和listKind
  • spec.versions[*].schema.openAPIV3Schema
    包含类型定义(顶层无
    x-kubernetes-preserve-unknown-fields: true
  • 存在一个版本同时标记为
    served: true
    storage: true
  • 架构中包含Conditions数组(允许
    metav1.Conditions
  • 打印列包含
    Age
    Status
    /
    Phase

reconcile_lint.py

reconcile_lint.py

Lints a Go controller reconcile function for anti-patterns.
bash
python scripts/reconcile_lint.py --controller controllers/myapp_controller.go
Checks (regex-based heuristics):
  • Returns are
    (ctrl.Result, error)
    shape
  • Errors trigger a non-zero requeue (
    return ctrl.Result{Requeue: true}, err
    )
  • client.Update()
    on the spec object is flagged (controllers should update only status)
  • time.Sleep
    inside reconcile is flagged (use
    RequeueAfter
    )
  • HTTP calls without context cancellation are flagged
  • Missing
    defer
    after a finalizer add
  • No
    IsConditionTrue
    /
    SetCondition
    calls when conditions present in CRD
  • Reconcile function exceeds 80 lines (extract subroutines)
检查Go语言控制器调和函数中的反模式。
bash
python scripts/reconcile_lint.py --controller controllers/myapp_controller.go
检查项(基于正则表达式的启发式规则):
  • 返回值为
    (ctrl.Result, error)
    格式
  • 错误会触发非零重入队列(
    return ctrl.Result{Requeue: true}, err
  • 标记对spec对象的
    client.Update()
    操作(控制器应仅更新status)
  • 标记调和过程中的
    time.Sleep
    (应使用
    RequeueAfter
  • 标记无上下文取消的HTTP调用
  • 添加终结器后缺少
    defer
  • 当CRD中存在Conditions时,未调用
    IsConditionTrue
    /
    SetCondition
  • 调和函数超过80行(应提取子例程)

operator_capability_audit.py

operator_capability_audit.py

Scores an operator against OperatorHub's 5 Capability Levels.
bash
python scripts/operator_capability_audit.py --operator-dir .
Levels:
  • L1 — Basic Install: CRD defined, controller deploys it
  • L2 — Seamless Upgrades: PDBs, conversion webhooks, version skew strategy
  • L3 — Full Lifecycle: backups, restores, failure recovery
  • L4 — Deep Insights: metrics endpoint, Prometheus rules, alerts
  • L5 — Auto Pilot: auto-scaling, auto-tuning, anomaly detection
Reports current level + concrete next steps to advance one level.
对照OperatorHub的5级能力标准为Operator打分。
bash
python scripts/operator_capability_audit.py --operator-dir .
等级说明:
  • L1 — 基础安装: 已定义CRD,控制器可部署它
  • L2 — 无缝升级: 具备PDB、转换Webhook、版本偏差策略
  • L3 — 完整生命周期: 支持备份、恢复、故障恢复
  • L4 — 深度洞察: 具备指标端点、Prometheus规则、告警
  • L5 — 自动驾驶: 支持自动扩缩容、自动调优、异常检测
报告会显示当前等级,以及提升一级的具体下一步措施。

Tooling landscape

工具生态

Pick a framework based on language and complexity. See
references/tooling_landscape.md
.
FrameworkLanguageBest forMaintenance
controller-runtimeGoProduction-grade, low-level controlActive (sig-api-machinery)
kubebuilderGoStandard scaffolding, opinionatedActive (Kubernetes SIGs)
operator-sdkGo / Helm / AnsibleOpenShift / mixed-paradigm teamsActive (Red Hat)
metacontrollerAny (webhook-based)Polyglot teams, avoiding GoLess active
KOPFPythonPython shops, async-firstActive (community)
java-operator-sdkJavaJVM shopsActive (Red Hat / Java SIG)
Decision rules:
  • New operator + Go shop → kubebuilder
  • New operator + Python shop → KOPF
  • New operator + can't pick a language → metacontroller
  • OpenShift target → operator-sdk
根据语言和复杂度选择框架。详见
references/tooling_landscape.md
框架语言适用场景维护状态
controller-runtimeGo生产级、底层控制活跃(sig-api-machinery)
kubebuilderGo标准脚手架、约定式活跃(Kubernetes SIGs)
operator-sdkGo / Helm / AnsibleOpenShift / 混合范式团队活跃(Red Hat)
metacontroller任意语言(基于Webhook)多语言团队、避免使用Go活跃度较低
KOPFPythonPython技术栈、异步优先活跃(社区)
java-operator-sdkJavaJVM技术栈活跃(Red Hat / Java SIG)
决策规则:
  • 新Operator + Go技术栈 → kubebuilder
  • 新Operator + Python技术栈 → KOPF
  • 新Operator + 无语言偏好 → metacontroller
  • 目标环境为OpenShift → operator-sdk

CRD design principles

CRD设计原则

See
references/crd_design.md
for full detail. Quick rules:
  1. status is the source of truth for the controller's view of the world. Spec is what the user wants; status is what the controller observed.
  2. Use the status subresource. Without it, status updates re-trigger reconcile (loop).
  3. Use Conditions.
    Ready
    ,
    Reconciling
    ,
    Degraded
    . Each carries a reason and message.
  4. Add finalizers. Without finalizers, deletion races the controller and orphans external resources.
  5. Version your CRD from day 1.
    v1alpha1
    v1beta1
    v1
    . Plan a conversion webhook.
  6. Validate via OpenAPI v3 schema. Don't rely on the controller for validation that should fail at admission.
  7. Use
    additionalPrinterColumns
    for
    kubectl get
    .
    Show
    Age
    ,
    Phase
    ,
    Ready
    at minimum.
  8. Namespace your CRDs unless they manage cluster-scoped resources.
详见
references/crd_design.md
的完整内容。快速规则:
  1. status是控制器对当前状态的唯一可信来源。spec是用户期望的状态;status是控制器观测到的实际状态。
  2. 使用status子资源。若不使用,状态更新会触发spec调和→循环。
  3. 使用Conditions。例如
    Ready
    Reconciling
    Degraded
    。每个Condition包含原因和消息。
  4. 添加终结器。若不使用,删除操作会与控制器竞争,导致外部资源残留。
  5. 从第一天开始为CRD版本化
    v1alpha1
    v1beta1
    v1
    。规划转换Webhook。
  6. 通过OpenAPI v3架构进行验证。不应依赖控制器来处理应在准入阶段就失败的验证。
  7. kubectl get
    配置
    additionalPrinterColumns
    。至少显示
    Age
    Phase
    Ready
  8. 除非管理集群级资源,否则为CRD设置命名空间

Reconcile loop principles

调和循环原则

See
references/reconcile_loop.md
for full detail. Quick rules:
  1. Idempotent. Reconciling the same state twice → same result, zero side effects.
  2. Read once, decide, act. Don't observe the world repeatedly during reconcile.
  3. Update status, not spec. Spec belongs to the user.
  4. Return errors that requeue. Use
    ctrl.Result{RequeueAfter: ...}
    for known transient cases.
  5. Never block. No
    time.Sleep
    . No long HTTP calls without context.
  6. Use the cache. Read via the controller's cached client; only escape the cache for a specific reason.
  7. Leader-elect when running >1 replica. Otherwise enable single-replica mode.
  8. Set OwnerReferences. Cascading deletion is the operator pattern's free gift.
详见
references/reconcile_loop.md
的完整内容。快速规则:
  1. 幂等性。对同一状态执行两次调和→结果相同,无副作用。
  2. 一次读取、决策、执行。调和过程中不要反复观测状态。
  3. 更新status,而非spec。spec属于用户。
  4. 返回触发重入队列的错误。针对已知临时情况,使用
    ctrl.Result{RequeueAfter: ...}
  5. 绝不阻塞。禁止
    time.Sleep
    。禁止无上下文的长时间HTTP调用。
  6. 使用缓存。通过控制器的缓存客户端读取;仅在特定情况下绕过缓存。
  7. 当部署>1副本时启用主节点选举。否则启用单副本模式。
  8. 设置OwnerReferences。级联删除是Operator模式的天然优势。

Workflows

工作流程

Workflow 1: Bootstrap a new operator (Go + kubebuilder)

工作流程1:初始化新Operator(Go + kubebuilder)

1. Pick a Group/Version/Kind: e.g., apps.example.com/v1alpha1, kind=MyApp
2. kubebuilder init --domain example.com --repo github.com/org/myapp-operator
3. kubebuilder create api --group apps --version v1alpha1 --kind MyApp
4. Run crd_validator.py on config/crd/bases/apps.example.com_myapps.yaml
   → Fix every WARN before writing controller code
5. Implement the reconcile function (Karpathy principle 2: simplest correct version first)
6. Run reconcile_lint.py on controllers/myapp_controller.go
7. Run operator_capability_audit.py --operator-dir . — confirm L1
8. Test in a kind cluster: kubectl apply -f config/samples/
9. Add status conditions; aim for L2 in the same PR
1. 选择Group/Version/Kind:例如apps.example.com/v1alpha1,kind=MyApp
2. kubebuilder init --domain example.com --repo github.com/org/myapp-operator
3. kubebuilder create api --group apps --version v1alpha1 --kind MyApp
4. 对config/crd/bases/apps.example.com_myapps.yaml运行crd_validator.py
   → 在编写控制器代码前修复所有WARN项
5. 实现调和函数(Karpathy原则2:先实现最简单的正确版本)
6. 对controllers/myapp_controller.go运行reconcile_lint.py
7. 运行operator_capability_audit.py --operator-dir . — 确认达到L1级
8. 在kind集群中测试:kubectl apply -f config/samples/
9. 添加状态条件;在同一个PR中目标达到L2级

Workflow 2: Audit an existing operator

工作流程2:审计现有Operator

1. Run operator_capability_audit.py --operator-dir <path>
2. Run crd_validator.py --crd config/crd/
3. Run reconcile_lint.py --controller controllers/
4. Triage findings:
   - FAIL → block release; fix before next deploy
   - WARN → file an issue; fix in next 30 days
5. Document current capability level in README; commit
6. Plan one capability level advancement per quarter
1. 运行operator_capability_audit.py --operator-dir <路径>
2. 对config/crd/运行crd_validator.py
3. 对controllers/运行reconcile_lint.py
4. 分类处理发现的问题:
   - FAIL → 阻止发布;下次部署前修复
   - WARN → 创建Issue;30天内修复
5. 在README中记录当前能力等级;提交代码
6. 规划每季度提升一个能力等级

Workflow 3: Choose a framework

工作流程3:选择框架

1. Identify primary language constraint (team skill)
2. Identify deployment target (vanilla k8s vs OpenShift)
3. Identify operator complexity (single CRD vs multi-CRD vs cluster-wide)
4. Cross-reference with references/tooling_landscape.md
5. Build a 1-week proof-of-concept before committing
1. 确定主要语言约束(团队技能)
2. 确定部署目标(原生k8s vs OpenShift)
3. 确定Operator复杂度(单CRD vs 多CRD vs 集群级)
4. 参考references/tooling_landscape.md进行交叉对比
5. 提交前先进行1周的概念验证

References

参考文档

  • references/operator_pattern.md
    — what an operator IS, when to use vs alternatives
  • references/crd_design.md
    — CRD design principles, versioning, conversion webhooks
  • references/reconcile_loop.md
    — reconcile patterns, error handling, idempotency
  • references/tooling_landscape.md
    — framework comparison + decision tree
  • references/operator_pattern.md
    — Operator的定义、适用场景及替代方案
  • references/crd_design.md
    — CRD设计原则、版本化、转换Webhook
  • references/reconcile_loop.md
    — 调和模式、错误处理、幂等性
  • references/tooling_landscape.md
    — 框架对比+决策树

Slash command

斜杠命令

/operator-audit
— Run all 3 tools on an operator repo and produce a markdown report.
/operator-audit
— 在Operator仓库中运行所有3个工具并生成Markdown报告。

Asset templates

资产模板

  • assets/crd_template.yaml
    — CRD with status subresource, conditions, finalizer hint, printer columns
  • assets/reconcile_skeleton.go
    — Go controller reconcile function with idempotency, conditions, finalizers, requeue patterns
  • assets/crd_template.yaml
    — 包含status子资源、Conditions、终结器提示、打印列的CRD模板
  • assets/reconcile_skeleton.go
    — 具备幂等性、Conditions、终结器、重入队列模式的Go控制器调和函数骨架

Anti-patterns

反模式

  • time.Sleep(30 * time.Second)
    inside reconcile
    — block other reconciles. Use
    RequeueAfter
    .
  • r.Client.Update(ctx, obj)
    to set status
    — use
    r.Status().Update(ctx, obj)
    instead.
  • No leader election + 2+ replicas — split-brain.
  • No finalizer — external resources orphan on deletion.
  • CRD without status subresource — status updates trigger spec reconciles (infinite loop).
  • Reconcile function > 200 lines — extract reconcileXxx subroutines per condition.
  • x-kubernetes-preserve-unknown-fields: true
    on spec root
    — defeats validation.
  • Imperative reconcile — "if creating, do A; if updating, do B; if deleting, do C". Wrong shape. Reconcile = make actual=desired, regardless of how we got here.
  • 调和过程中使用
    time.Sleep(30 * time.Second)
    — 阻塞其他调和操作。应使用
    RequeueAfter
  • 使用
    r.Client.Update(ctx, obj)
    设置status
    — 应改用
    r.Status().Update(ctx, obj)
  • 无主节点选举 + 2+副本 — 出现脑裂。
  • 无终结器 — 删除时外部资源残留。
  • CRD无status子资源 — 状态更新触发spec调和(无限循环)。
  • 调和函数>200行 — 按条件提取reconcileXxx子例程。
  • spec根节点设置
    x-kubernetes-preserve-unknown-fields: true
    — 失效验证。
  • 命令式调和 — “如果是创建,执行A;如果是更新,执行B;如果是删除,执行C”。错误模式。调和的本质是:无论当前状态如何,让实际状态等于期望状态。

Verifiable success

可验证的成功指标

A team using this skill should achieve:
  • 100% of new CRDs pass
    crd_validator.py
    before merge
  • All reconcile functions pass
    reconcile_lint.py
    strict mode
  • Operators reach OperatorHub Capability Level 3 (Full Lifecycle) before public release
  • Mean time to fix a reconcile bug: <1 day (no infinite loops in production)
使用本技能的团队应达成:
  • 所有新CRD在合并前100%通过
    crd_validator.py
    检查
  • 所有调和函数通过
    reconcile_lint.py
    严格模式检查
  • Operator在公开发布前达到OperatorHub能力等级3(完整生命周期)
  • 调和故障平均修复时间:<1天(生产环境无无限循环)