kubernetes-operator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Operator
Kubernetes Operator
Build operators that reconcile correctly. Most operator bugs are not Kubernetes bugs — they are reconcile-loop bugs: missing finalizers, blocking calls, no requeue on transient errors, status drift, RBAC over-grants. This skill catches them deterministically before they reach a cluster.
构建能正确协调状态的Operator。大多数Operator故障并非Kubernetes本身的问题——而是调和循环(reconcile-loop)的问题:缺少终结器(finalizers)、阻塞调用、临时错误未触发重入队列、状态漂移、RBAC权限过度授予。本技能能在代码部署到集群前,确定性地发现这些问题。
When to use
适用场景
- Building a new Kubernetes Operator (controller for a CRD)
- Reviewing an existing operator for capability-level gaps
- Auditing a CRD spec for status/conditions/finalizer correctness
- Choosing a framework (controller-runtime / kubebuilder / operator-sdk / metacontroller / KOPF)
- Designing the API surface of a Custom Resource
- Hardening RBAC, leader election, or webhook validation
- 构建新的Kubernetes Operator(CRD对应的控制器)
- 审查现有Operator的能力水平差距
- 审计CRD规格中状态/条件/终结器的正确性
- 选择框架(controller-runtime / kubebuilder / operator-sdk / metacontroller / KOPF)
- 设计自定义资源(Custom Resource)的API接口
- 强化RBAC、主节点选举或Webhook验证
When NOT to use
不适用场景
- Plain Helm chart packaging → use
helm-chart-builder - Standard kubectl operations / blue-green deploys → use
senior-devops - General k8s security posture → use
cloud-security - "I want to run a workload" — that's a Deployment / Job, not an operator
- 普通Helm Chart打包 → 使用
helm-chart-builder - 标准kubectl操作/蓝绿部署 → 使用
senior-devops - 通用K8S安全态势 → 使用
cloud-security - “我想运行一个工作负载”——这属于Deployment/Job范畴,而非Operator
Core principle: an operator is a reconcile loop, not a script
核心原则:Operator是调和循环,而非脚本
observe(actual) → desired = read(spec) → diff(actual, desired) → act → update(status)
↓
requeue / doneOperators that fail are the ones that:
- Treat reconcile as imperative (do this, then this, then this) instead of declarative (make actual=desired, idempotently)
- Don't requeue transient failures
- Don't use finalizers, leaving orphan resources
- Mutate spec instead of status
- Don't use the status subresource (status updates trigger spec reconciles → loop)
- Block in reconcile (long HTTP calls, locks)
- Forget leader election → split-brain on multi-replica deploys
The 3 tools below catch each of these.
observe(actual) → desired = read(spec) → diff(actual, desired) → act → update(status)
↓
requeue / done失败的Operator通常存在以下问题:
- 将调和视为命令式操作(先做A,再做B,再做C)而非声明式操作(确保实际状态与期望状态一致,具备幂等性)
- 临时故障未触发重入队列
- 未使用终结器,导致资源残留
- 修改spec而非status
- 未使用status子资源(状态更新会触发spec调和→循环)
- 调和过程中阻塞(长时间HTTP调用、锁)
- 忘记主节点选举→多副本部署时出现脑裂
下面的3个工具可以检测上述所有问题。
Quick start
快速开始
bash
SKILL=engineering/kubernetes-operator/skills/kubernetes-operatorbash
SKILL=engineering/kubernetes-operator/skills/kubernetes-operatorValidate a CRD design
验证CRD设计
python "$SKILL/scripts/crd_validator.py" --crd config/crd/myapp.yaml
python "$SKILL/scripts/crd_validator.py" --crd config/crd/myapp.yaml
Lint a Go reconcile function
检查Go语言调和函数
python "$SKILL/scripts/reconcile_lint.py" --controller controllers/myapp_controller.go
python "$SKILL/scripts/reconcile_lint.py" --controller controllers/myapp_controller.go
Score against OperatorHub Capability Levels (1-5)
对照OperatorHub能力等级(1-5级)打分
python "$SKILL/scripts/operator_capability_audit.py" --operator-dir .
undefinedpython "$SKILL/scripts/operator_capability_audit.py" --operator-dir .
undefinedThe 3 Python tools
3个Python工具
All stdlib-only. Run with .
--help所有工具均仅依赖Python标准库。运行时可添加查看帮助。
--helpcrd_validator.py
crd_validator.pycrd_validator.py
crd_validator.pyValidates a CRD YAML against operator-pattern best practices.
bash
python scripts/crd_validator.py --crd config/crd/myapp.yaml
python scripts/crd_validator.py --crd config/crd/ --format jsonChecks:
- is set (status subresource)
spec.versions[*].subresources.status - is
spec.scope(notNamespaced) unless explicitly justifiedCluster - Singular and listKind defined
- has type definitions (no
spec.versions[*].schema.openAPIV3Schemaat top level)x-kubernetes-preserve-unknown-fields: true - A version is marked AND
served: truestorage: true - Conditions array is in the schema (allows )
metav1.Conditions - Printer columns include and
Age/StatusPhase
根据Operator模式最佳实践验证CRD YAML文件。
bash
python scripts/crd_validator.py --crd config/crd/myapp.yaml
python scripts/crd_validator.py --crd config/crd/ --format json检查项:
- 已设置(status子资源)
spec.versions[*].subresources.status - 为
spec.scope(而非Namespaced),除非有明确理由Cluster - 已定义单数形式和listKind
- 包含类型定义(顶层无
spec.versions[*].schema.openAPIV3Schema)x-kubernetes-preserve-unknown-fields: true - 存在一个版本同时标记为和
served: truestorage: true - 架构中包含Conditions数组(允许)
metav1.Conditions - 打印列包含和
Age/StatusPhase
reconcile_lint.py
reconcile_lint.pyreconcile_lint.py
reconcile_lint.pyLints a Go controller reconcile function for anti-patterns.
bash
python scripts/reconcile_lint.py --controller controllers/myapp_controller.goChecks (regex-based heuristics):
- Returns are shape
(ctrl.Result, error) - Errors trigger a non-zero requeue ()
return ctrl.Result{Requeue: true}, err - on the spec object is flagged (controllers should update only status)
client.Update() - inside reconcile is flagged (use
time.Sleep)RequeueAfter - HTTP calls without context cancellation are flagged
- Missing after a finalizer add
defer - No /
IsConditionTruecalls when conditions present in CRDSetCondition - Reconcile function exceeds 80 lines (extract subroutines)
检查Go语言控制器调和函数中的反模式。
bash
python scripts/reconcile_lint.py --controller controllers/myapp_controller.go检查项(基于正则表达式的启发式规则):
- 返回值为格式
(ctrl.Result, error) - 错误会触发非零重入队列()
return ctrl.Result{Requeue: true}, err - 标记对spec对象的操作(控制器应仅更新status)
client.Update() - 标记调和过程中的(应使用
time.Sleep)RequeueAfter - 标记无上下文取消的HTTP调用
- 添加终结器后缺少
defer - 当CRD中存在Conditions时,未调用/
IsConditionTrueSetCondition - 调和函数超过80行(应提取子例程)
operator_capability_audit.py
operator_capability_audit.pyoperator_capability_audit.py
operator_capability_audit.pyScores an operator against OperatorHub's 5 Capability Levels.
bash
python scripts/operator_capability_audit.py --operator-dir .Levels:
- L1 — Basic Install: CRD defined, controller deploys it
- L2 — Seamless Upgrades: PDBs, conversion webhooks, version skew strategy
- L3 — Full Lifecycle: backups, restores, failure recovery
- L4 — Deep Insights: metrics endpoint, Prometheus rules, alerts
- L5 — Auto Pilot: auto-scaling, auto-tuning, anomaly detection
Reports current level + concrete next steps to advance one level.
对照OperatorHub的5级能力标准为Operator打分。
bash
python scripts/operator_capability_audit.py --operator-dir .等级说明:
- L1 — 基础安装: 已定义CRD,控制器可部署它
- L2 — 无缝升级: 具备PDB、转换Webhook、版本偏差策略
- L3 — 完整生命周期: 支持备份、恢复、故障恢复
- L4 — 深度洞察: 具备指标端点、Prometheus规则、告警
- L5 — 自动驾驶: 支持自动扩缩容、自动调优、异常检测
报告会显示当前等级,以及提升一级的具体下一步措施。
Tooling landscape
工具生态
Pick a framework based on language and complexity. See .
references/tooling_landscape.md| Framework | Language | Best for | Maintenance |
|---|---|---|---|
| controller-runtime | Go | Production-grade, low-level control | Active (sig-api-machinery) |
| kubebuilder | Go | Standard scaffolding, opinionated | Active (Kubernetes SIGs) |
| operator-sdk | Go / Helm / Ansible | OpenShift / mixed-paradigm teams | Active (Red Hat) |
| metacontroller | Any (webhook-based) | Polyglot teams, avoiding Go | Less active |
| KOPF | Python | Python shops, async-first | Active (community) |
| java-operator-sdk | Java | JVM shops | Active (Red Hat / Java SIG) |
Decision rules:
- New operator + Go shop → kubebuilder
- New operator + Python shop → KOPF
- New operator + can't pick a language → metacontroller
- OpenShift target → operator-sdk
根据语言和复杂度选择框架。详见。
references/tooling_landscape.md| 框架 | 语言 | 适用场景 | 维护状态 |
|---|---|---|---|
| controller-runtime | Go | 生产级、底层控制 | 活跃(sig-api-machinery) |
| kubebuilder | Go | 标准脚手架、约定式 | 活跃(Kubernetes SIGs) |
| operator-sdk | Go / Helm / Ansible | OpenShift / 混合范式团队 | 活跃(Red Hat) |
| metacontroller | 任意语言(基于Webhook) | 多语言团队、避免使用Go | 活跃度较低 |
| KOPF | Python | Python技术栈、异步优先 | 活跃(社区) |
| java-operator-sdk | Java | JVM技术栈 | 活跃(Red Hat / Java SIG) |
决策规则:
- 新Operator + Go技术栈 → kubebuilder
- 新Operator + Python技术栈 → KOPF
- 新Operator + 无语言偏好 → metacontroller
- 目标环境为OpenShift → operator-sdk
CRD design principles
CRD设计原则
See for full detail. Quick rules:
references/crd_design.md- status is the source of truth for the controller's view of the world. Spec is what the user wants; status is what the controller observed.
- Use the status subresource. Without it, status updates re-trigger reconcile (loop).
- Use Conditions. ,
Ready,Reconciling. Each carries a reason and message.Degraded - Add finalizers. Without finalizers, deletion races the controller and orphans external resources.
- Version your CRD from day 1. →
v1alpha1→v1beta1. Plan a conversion webhook.v1 - Validate via OpenAPI v3 schema. Don't rely on the controller for validation that should fail at admission.
- Use for
additionalPrinterColumns. Showkubectl get,Age,Phaseat minimum.Ready - Namespace your CRDs unless they manage cluster-scoped resources.
详见的完整内容。快速规则:
references/crd_design.md- status是控制器对当前状态的唯一可信来源。spec是用户期望的状态;status是控制器观测到的实际状态。
- 使用status子资源。若不使用,状态更新会触发spec调和→循环。
- 使用Conditions。例如、
Ready、Reconciling。每个Condition包含原因和消息。Degraded - 添加终结器。若不使用,删除操作会与控制器竞争,导致外部资源残留。
- 从第一天开始为CRD版本化。→
v1alpha1→v1beta1。规划转换Webhook。v1 - 通过OpenAPI v3架构进行验证。不应依赖控制器来处理应在准入阶段就失败的验证。
- 为配置
kubectl get。至少显示additionalPrinterColumns、Age、Phase。Ready - 除非管理集群级资源,否则为CRD设置命名空间。
Reconcile loop principles
调和循环原则
See for full detail. Quick rules:
references/reconcile_loop.md- Idempotent. Reconciling the same state twice → same result, zero side effects.
- Read once, decide, act. Don't observe the world repeatedly during reconcile.
- Update status, not spec. Spec belongs to the user.
- Return errors that requeue. Use for known transient cases.
ctrl.Result{RequeueAfter: ...} - Never block. No . No long HTTP calls without context.
time.Sleep - Use the cache. Read via the controller's cached client; only escape the cache for a specific reason.
- Leader-elect when running >1 replica. Otherwise enable single-replica mode.
- Set OwnerReferences. Cascading deletion is the operator pattern's free gift.
详见的完整内容。快速规则:
references/reconcile_loop.md- 幂等性。对同一状态执行两次调和→结果相同,无副作用。
- 一次读取、决策、执行。调和过程中不要反复观测状态。
- 更新status,而非spec。spec属于用户。
- 返回触发重入队列的错误。针对已知临时情况,使用。
ctrl.Result{RequeueAfter: ...} - 绝不阻塞。禁止。禁止无上下文的长时间HTTP调用。
time.Sleep - 使用缓存。通过控制器的缓存客户端读取;仅在特定情况下绕过缓存。
- 当部署>1副本时启用主节点选举。否则启用单副本模式。
- 设置OwnerReferences。级联删除是Operator模式的天然优势。
Workflows
工作流程
Workflow 1: Bootstrap a new operator (Go + kubebuilder)
工作流程1:初始化新Operator(Go + kubebuilder)
1. Pick a Group/Version/Kind: e.g., apps.example.com/v1alpha1, kind=MyApp
2. kubebuilder init --domain example.com --repo github.com/org/myapp-operator
3. kubebuilder create api --group apps --version v1alpha1 --kind MyApp
4. Run crd_validator.py on config/crd/bases/apps.example.com_myapps.yaml
→ Fix every WARN before writing controller code
5. Implement the reconcile function (Karpathy principle 2: simplest correct version first)
6. Run reconcile_lint.py on controllers/myapp_controller.go
7. Run operator_capability_audit.py --operator-dir . — confirm L1
8. Test in a kind cluster: kubectl apply -f config/samples/
9. Add status conditions; aim for L2 in the same PR1. 选择Group/Version/Kind:例如apps.example.com/v1alpha1,kind=MyApp
2. kubebuilder init --domain example.com --repo github.com/org/myapp-operator
3. kubebuilder create api --group apps --version v1alpha1 --kind MyApp
4. 对config/crd/bases/apps.example.com_myapps.yaml运行crd_validator.py
→ 在编写控制器代码前修复所有WARN项
5. 实现调和函数(Karpathy原则2:先实现最简单的正确版本)
6. 对controllers/myapp_controller.go运行reconcile_lint.py
7. 运行operator_capability_audit.py --operator-dir . — 确认达到L1级
8. 在kind集群中测试:kubectl apply -f config/samples/
9. 添加状态条件;在同一个PR中目标达到L2级Workflow 2: Audit an existing operator
工作流程2:审计现有Operator
1. Run operator_capability_audit.py --operator-dir <path>
2. Run crd_validator.py --crd config/crd/
3. Run reconcile_lint.py --controller controllers/
4. Triage findings:
- FAIL → block release; fix before next deploy
- WARN → file an issue; fix in next 30 days
5. Document current capability level in README; commit
6. Plan one capability level advancement per quarter1. 运行operator_capability_audit.py --operator-dir <路径>
2. 对config/crd/运行crd_validator.py
3. 对controllers/运行reconcile_lint.py
4. 分类处理发现的问题:
- FAIL → 阻止发布;下次部署前修复
- WARN → 创建Issue;30天内修复
5. 在README中记录当前能力等级;提交代码
6. 规划每季度提升一个能力等级Workflow 3: Choose a framework
工作流程3:选择框架
1. Identify primary language constraint (team skill)
2. Identify deployment target (vanilla k8s vs OpenShift)
3. Identify operator complexity (single CRD vs multi-CRD vs cluster-wide)
4. Cross-reference with references/tooling_landscape.md
5. Build a 1-week proof-of-concept before committing1. 确定主要语言约束(团队技能)
2. 确定部署目标(原生k8s vs OpenShift)
3. 确定Operator复杂度(单CRD vs 多CRD vs 集群级)
4. 参考references/tooling_landscape.md进行交叉对比
5. 提交前先进行1周的概念验证References
参考文档
- — what an operator IS, when to use vs alternatives
references/operator_pattern.md - — CRD design principles, versioning, conversion webhooks
references/crd_design.md - — reconcile patterns, error handling, idempotency
references/reconcile_loop.md - — framework comparison + decision tree
references/tooling_landscape.md
- — Operator的定义、适用场景及替代方案
references/operator_pattern.md - — CRD设计原则、版本化、转换Webhook
references/crd_design.md - — 调和模式、错误处理、幂等性
references/reconcile_loop.md - — 框架对比+决策树
references/tooling_landscape.md
Slash command
斜杠命令
/operator-audit/operator-auditAsset templates
资产模板
- — CRD with status subresource, conditions, finalizer hint, printer columns
assets/crd_template.yaml - — Go controller reconcile function with idempotency, conditions, finalizers, requeue patterns
assets/reconcile_skeleton.go
- — 包含status子资源、Conditions、终结器提示、打印列的CRD模板
assets/crd_template.yaml - — 具备幂等性、Conditions、终结器、重入队列模式的Go控制器调和函数骨架
assets/reconcile_skeleton.go
Anti-patterns
反模式
- inside reconcile — block other reconciles. Use
time.Sleep(30 * time.Second).RequeueAfter - to set status — use
r.Client.Update(ctx, obj)instead.r.Status().Update(ctx, obj) - No leader election + 2+ replicas — split-brain.
- No finalizer — external resources orphan on deletion.
- CRD without status subresource — status updates trigger spec reconciles (infinite loop).
- Reconcile function > 200 lines — extract reconcileXxx subroutines per condition.
- on spec root — defeats validation.
x-kubernetes-preserve-unknown-fields: true - Imperative reconcile — "if creating, do A; if updating, do B; if deleting, do C". Wrong shape. Reconcile = make actual=desired, regardless of how we got here.
- 调和过程中使用— 阻塞其他调和操作。应使用
time.Sleep(30 * time.Second)。RequeueAfter - 使用设置status — 应改用
r.Client.Update(ctx, obj)。r.Status().Update(ctx, obj) - 无主节点选举 + 2+副本 — 出现脑裂。
- 无终结器 — 删除时外部资源残留。
- CRD无status子资源 — 状态更新触发spec调和(无限循环)。
- 调和函数>200行 — 按条件提取reconcileXxx子例程。
- spec根节点设置— 失效验证。
x-kubernetes-preserve-unknown-fields: true - 命令式调和 — “如果是创建,执行A;如果是更新,执行B;如果是删除,执行C”。错误模式。调和的本质是:无论当前状态如何,让实际状态等于期望状态。
Verifiable success
可验证的成功指标
A team using this skill should achieve:
- 100% of new CRDs pass before merge
crd_validator.py - All reconcile functions pass strict mode
reconcile_lint.py - Operators reach OperatorHub Capability Level 3 (Full Lifecycle) before public release
- Mean time to fix a reconcile bug: <1 day (no infinite loops in production)
使用本技能的团队应达成:
- 所有新CRD在合并前100%通过检查
crd_validator.py - 所有调和函数通过严格模式检查
reconcile_lint.py - Operator在公开发布前达到OperatorHub能力等级3(完整生命周期)
- 调和故障平均修复时间:<1天(生产环境无无限循环)