kubernetes-operator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Kubernetes Operator

Build operators that reconcile correctly. Most operator bugs are not Kubernetes bugs — they are reconcile-loop bugs: missing finalizers, blocking calls, no requeue on transient errors, status drift, RBAC over-grants. This skill catches them deterministically before they reach a cluster.

构建能正确协调状态的Operator。大多数Operator故障并非Kubernetes本身的问题——而是调和循环（reconcile-loop）的问题：缺少终结器（finalizers）、阻塞调用、临时错误未触发重入队列、状态漂移、RBAC权限过度授予。本技能能在代码部署到集群前，确定性地发现这些问题。

When to use

适用场景

Building a new Kubernetes Operator (controller for a CRD)
Reviewing an existing operator for capability-level gaps
Auditing a CRD spec for status/conditions/finalizer correctness
Choosing a framework (controller-runtime / kubebuilder / operator-sdk / metacontroller / KOPF)
Designing the API surface of a Custom Resource
Hardening RBAC, leader election, or webhook validation

构建新的Kubernetes Operator（CRD对应的控制器）
审查现有Operator的能力水平差距
审计CRD规格中状态/条件/终结器的正确性
选择框架（controller-runtime / kubebuilder / operator-sdk / metacontroller / KOPF）
设计自定义资源（Custom Resource）的API接口
强化RBAC、主节点选举或Webhook验证

When NOT to use

不适用场景

Plain Helm chart packaging → use
```
helm-chart-builder
```
Standard kubectl operations / blue-green deploys → use
```
senior-devops
```
General k8s security posture → use
```
cloud-security
```
"I want to run a workload" — that's a Deployment / Job, not an operator

普通Helm Chart打包 → 使用
```
helm-chart-builder
```
标准kubectl操作/蓝绿部署 → 使用
```
senior-devops
```
通用K8S安全态势 → 使用
```
cloud-security
```
“我想运行一个工作负载”——这属于Deployment/Job范畴，而非Operator

Core principle: an operator is a reconcile loop, not a script

核心原则：Operator是调和循环，而非脚本

observe(actual) → desired = read(spec) → diff(actual, desired) → act → update(status)
                                                                          ↓
                                                                   requeue / done

Operators that fail are the ones that:

Treat reconcile as imperative (do this, then this, then this) instead of declarative (make actual=desired, idempotently)
Don't requeue transient failures
Don't use finalizers, leaving orphan resources
Mutate spec instead of status
Don't use the status subresource (status updates trigger spec reconciles → loop)
Block in reconcile (long HTTP calls, locks)
Forget leader election → split-brain on multi-replica deploys

The 3 tools below catch each of these.

observe(actual) → desired = read(spec) → diff(actual, desired) → act → update(status)
                                                                          ↓
                                                                   requeue / done

失败的Operator通常存在以下问题：

将调和视为命令式操作（先做A，再做B，再做C）而非声明式操作（确保实际状态与期望状态一致，具备幂等性）
临时故障未触发重入队列
未使用终结器，导致资源残留
修改spec而非status
未使用status子资源（状态更新会触发spec调和→循环）
调和过程中阻塞（长时间HTTP调用、锁）
忘记主节点选举→多副本部署时出现脑裂

下面的3个工具可以检测上述所有问题。

Quick start

快速开始

bash

SKILL=engineering/kubernetes-operator/skills/kubernetes-operator

bash

SKILL=engineering/kubernetes-operator/skills/kubernetes-operator

Validate a CRD design

验证CRD设计

python "$SKILL/scripts/crd_validator.py" --crd config/crd/myapp.yaml

Lint a Go reconcile function

检查Go语言调和函数

python "$SKILL/scripts/reconcile_lint.py" --controller controllers/myapp_controller.go

Score against OperatorHub Capability Levels (1-5)

对照OperatorHub能力等级（1-5级）打分

python "$SKILL/scripts/operator_capability_audit.py" --operator-dir .

undefined

python "$SKILL/scripts/operator_capability_audit.py" --operator-dir .

undefined

The 3 Python tools

3个Python工具

All stdlib-only. Run with

--help

所有工具均仅依赖Python标准库。运行时可添加

--help

查看帮助。

crd_validator.py

crd_validator.py

Validates a CRD YAML against operator-pattern best practices.

bash

python scripts/crd_validator.py --crd config/crd/myapp.yaml
python scripts/crd_validator.py --crd config/crd/ --format json

Checks:

```
spec.versions[*].subresources.status
```
is set (status subresource)
```
spec.scope
```
is
```
Namespaced
```
(not
```
Cluster
```
) unless explicitly justified
Singular and listKind defined

spec.versions[*].schema.openAPIV3Schema

has type definitions (no

x-kubernetes-preserve-unknown-fields: true

at top level)

A version is marked
```
served: true
```
AND
```
storage: true
```
Conditions array is in the schema (allows
```
metav1.Conditions
```
)
Printer columns include
```
Age
```
and
```
Status
```
/
```
Phase
```

根据Operator模式最佳实践验证CRD YAML文件。

bash

python scripts/crd_validator.py --crd config/crd/myapp.yaml
python scripts/crd_validator.py --crd config/crd/ --format json

检查项：

已设置
```
spec.versions[*].subresources.status
```
（status子资源）
```
spec.scope
```
为
```
Namespaced
```
（而非
```
Cluster
```
），除非有明确理由
已定义单数形式和listKind

spec.versions[*].schema.openAPIV3Schema

包含类型定义（顶层无

x-kubernetes-preserve-unknown-fields: true

）

存在一个版本同时标记为
```
served: true
```
和
```
storage: true
```
架构中包含Conditions数组（允许
```
metav1.Conditions
```
）
打印列包含
```
Age
```
和
```
Status
```
/
```
Phase
```

reconcile_lint.py

reconcile_lint.py

Lints a Go controller reconcile function for anti-patterns.

bash

python scripts/reconcile_lint.py --controller controllers/myapp_controller.go

Checks (regex-based heuristics):

Returns are
```
(ctrl.Result, error)
```
shape
Errors trigger a non-zero requeue (
```
return ctrl.Result{Requeue: true}, err
```
)
```
client.Update()
```
on the spec object is flagged (controllers should update only status)
```
time.Sleep
```
inside reconcile is flagged (use
```
RequeueAfter
```
)
HTTP calls without context cancellation are flagged
Missing
```
defer
```
after a finalizer add
No
```
IsConditionTrue
```
/
```
SetCondition
```
calls when conditions present in CRD
Reconcile function exceeds 80 lines (extract subroutines)

检查Go语言控制器调和函数中的反模式。

bash

python scripts/reconcile_lint.py --controller controllers/myapp_controller.go

检查项（基于正则表达式的启发式规则）：

返回值为
```
(ctrl.Result, error)
```
格式
错误会触发非零重入队列（
```
return ctrl.Result{Requeue: true}, err
```
）
标记对spec对象的
```
client.Update()
```
操作（控制器应仅更新status）
标记调和过程中的
```
time.Sleep
```
（应使用
```
RequeueAfter
```
）
标记无上下文取消的HTTP调用
添加终结器后缺少
```
defer
```
当CRD中存在Conditions时，未调用
```
IsConditionTrue
```
/
```
SetCondition
```
调和函数超过80行（应提取子例程）

operator_capability_audit.py

operator_capability_audit.py

Scores an operator against OperatorHub's 5 Capability Levels.

bash

python scripts/operator_capability_audit.py --operator-dir .

Levels:

L1 — Basic Install: CRD defined, controller deploys it
L2 — Seamless Upgrades: PDBs, conversion webhooks, version skew strategy
L3 — Full Lifecycle: backups, restores, failure recovery
L4 — Deep Insights: metrics endpoint, Prometheus rules, alerts
L5 — Auto Pilot: auto-scaling, auto-tuning, anomaly detection

Reports current level + concrete next steps to advance one level.

对照OperatorHub的5级能力标准为Operator打分。

bash

python scripts/operator_capability_audit.py --operator-dir .

等级说明：

L1 — 基础安装： 已定义CRD，控制器可部署它
L2 — 无缝升级： 具备PDB、转换Webhook、版本偏差策略
L3 — 完整生命周期： 支持备份、恢复、故障恢复
L4 — 深度洞察： 具备指标端点、Prometheus规则、告警
L5 — 自动驾驶： 支持自动扩缩容、自动调优、异常检测

报告会显示当前等级，以及提升一级的具体下一步措施。

Tooling landscape

工具生态

Pick a framework based on language and complexity. See

references/tooling_landscape.md

Framework	Language	Best for	Maintenance
controller-runtime	Go	Production-grade, low-level control	Active (sig-api-machinery)
kubebuilder	Go	Standard scaffolding, opinionated	Active (Kubernetes SIGs)
operator-sdk	Go / Helm / Ansible	OpenShift / mixed-paradigm teams	Active (Red Hat)
metacontroller	Any (webhook-based)	Polyglot teams, avoiding Go	Less active
KOPF	Python	Python shops, async-first	Active (community)
java-operator-sdk	Java	JVM shops	Active (Red Hat / Java SIG)

Decision rules:

New operator + Go shop → kubebuilder
New operator + Python shop → KOPF
New operator + can't pick a language → metacontroller
OpenShift target → operator-sdk

根据语言和复杂度选择框架。详见

references/tooling_landscape.md

。

框架	语言	适用场景	维护状态
controller-runtime	Go	生产级、底层控制	活跃（sig-api-machinery）
kubebuilder	Go	标准脚手架、约定式	活跃（Kubernetes SIGs）
operator-sdk	Go / Helm / Ansible	OpenShift / 混合范式团队	活跃（Red Hat）
metacontroller	任意语言（基于Webhook）	多语言团队、避免使用Go	活跃度较低
KOPF	Python	Python技术栈、异步优先	活跃（社区）
java-operator-sdk	Java	JVM技术栈	活跃（Red Hat / Java SIG）

决策规则：

新Operator + Go技术栈 → kubebuilder
新Operator + Python技术栈 → KOPF
新Operator + 无语言偏好 → metacontroller
目标环境为OpenShift → operator-sdk

CRD design principles

CRD设计原则

See

references/crd_design.md

for full detail. Quick rules:

status is the source of truth for the controller's view of the world. Spec is what the user wants; status is what the controller observed.
Use the status subresource. Without it, status updates re-trigger reconcile (loop).
Use Conditions.
```
Ready
```
,
```
Reconciling
```
,
```
Degraded
```
. Each carries a reason and message.
Add finalizers. Without finalizers, deletion races the controller and orphans external resources.
Version your CRD from day 1.
```
v1alpha1
```
→
```
v1beta1
```
→
```
v1
```
. Plan a conversion webhook.
Validate via OpenAPI v3 schema. Don't rely on the controller for validation that should fail at admission.

Use
additionalPrinterColumns
for
kubectl get
. Show

Age

Phase

Ready

at minimum.

Namespace your CRDs unless they manage cluster-scoped resources.

详见

references/crd_design.md

的完整内容。快速规则：

status是控制器对当前状态的唯一可信来源。spec是用户期望的状态；status是控制器观测到的实际状态。
使用status子资源。若不使用，状态更新会触发spec调和→循环。
使用Conditions。例如
```
Ready
```
、
```
Reconciling
```
、
```
Degraded
```
。每个Condition包含原因和消息。
添加终结器。若不使用，删除操作会与控制器竞争，导致外部资源残留。
从第一天开始为CRD版本化。
```
v1alpha1
```
→
```
v1beta1
```
→
```
v1
```
。规划转换Webhook。
通过OpenAPI v3架构进行验证。不应依赖控制器来处理应在准入阶段就失败的验证。

为
kubectl get
配置
additionalPrinterColumns
。至少显示

Age

、

Phase

、

Ready

。

除非管理集群级资源，否则为CRD设置命名空间。

Reconcile loop principles

调和循环原则

See

references/reconcile_loop.md

for full detail. Quick rules:

Idempotent. Reconciling the same state twice → same result, zero side effects.
Read once, decide, act. Don't observe the world repeatedly during reconcile.
Update status, not spec. Spec belongs to the user.
Return errors that requeue. Use
```
ctrl.Result{RequeueAfter: ...}
```
for known transient cases.
Never block. No
```
time.Sleep
```
. No long HTTP calls without context.
Use the cache. Read via the controller's cached client; only escape the cache for a specific reason.
Leader-elect when running >1 replica. Otherwise enable single-replica mode.
Set OwnerReferences. Cascading deletion is the operator pattern's free gift.

详见

references/reconcile_loop.md

的完整内容。快速规则：

幂等性。对同一状态执行两次调和→结果相同，无副作用。
一次读取、决策、执行。调和过程中不要反复观测状态。
更新status，而非spec。spec属于用户。
返回触发重入队列的错误。针对已知临时情况，使用
```
ctrl.Result{RequeueAfter: ...}
```
。
绝不阻塞。禁止
```
time.Sleep
```
。禁止无上下文的长时间HTTP调用。
使用缓存。通过控制器的缓存客户端读取；仅在特定情况下绕过缓存。
当部署>1副本时启用主节点选举。否则启用单副本模式。
设置OwnerReferences。级联删除是Operator模式的天然优势。

Workflows

工作流程

Workflow 1: Bootstrap a new operator (Go + kubebuilder)

工作流程1：初始化新Operator（Go + kubebuilder）

1. Pick a Group/Version/Kind: e.g., apps.example.com/v1alpha1, kind=MyApp
2. kubebuilder init --domain example.com --repo github.com/org/myapp-operator
3. kubebuilder create api --group apps --version v1alpha1 --kind MyApp
4. Run crd_validator.py on config/crd/bases/apps.example.com_myapps.yaml
   → Fix every WARN before writing controller code
5. Implement the reconcile function (Karpathy principle 2: simplest correct version first)
6. Run reconcile_lint.py on controllers/myapp_controller.go
7. Run operator_capability_audit.py --operator-dir . — confirm L1
8. Test in a kind cluster: kubectl apply -f config/samples/
9. Add status conditions; aim for L2 in the same PR

1. 选择Group/Version/Kind：例如apps.example.com/v1alpha1，kind=MyApp
2. kubebuilder init --domain example.com --repo github.com/org/myapp-operator
3. kubebuilder create api --group apps --version v1alpha1 --kind MyApp
4. 对config/crd/bases/apps.example.com_myapps.yaml运行crd_validator.py
   → 在编写控制器代码前修复所有WARN项
5. 实现调和函数（Karpathy原则2：先实现最简单的正确版本）
6. 对controllers/myapp_controller.go运行reconcile_lint.py
7. 运行operator_capability_audit.py --operator-dir . — 确认达到L1级
8. 在kind集群中测试：kubectl apply -f config/samples/
9. 添加状态条件；在同一个PR中目标达到L2级

Workflow 2: Audit an existing operator

工作流程2：审计现有Operator

1. Run operator_capability_audit.py --operator-dir <path>
2. Run crd_validator.py --crd config/crd/
3. Run reconcile_lint.py --controller controllers/
4. Triage findings:
   - FAIL → block release; fix before next deploy
   - WARN → file an issue; fix in next 30 days
5. Document current capability level in README; commit
6. Plan one capability level advancement per quarter

1. 运行operator_capability_audit.py --operator-dir <路径>
2. 对config/crd/运行crd_validator.py
3. 对controllers/运行reconcile_lint.py
4. 分类处理发现的问题：
   - FAIL → 阻止发布；下次部署前修复
   - WARN → 创建Issue；30天内修复
5. 在README中记录当前能力等级；提交代码
6. 规划每季度提升一个能力等级

Workflow 3: Choose a framework

工作流程3：选择框架

1. Identify primary language constraint (team skill)
2. Identify deployment target (vanilla k8s vs OpenShift)
3. Identify operator complexity (single CRD vs multi-CRD vs cluster-wide)
4. Cross-reference with references/tooling_landscape.md
5. Build a 1-week proof-of-concept before committing

1. 确定主要语言约束（团队技能）
2. 确定部署目标（原生k8s vs OpenShift）
3. 确定Operator复杂度（单CRD vs 多CRD vs 集群级）
4. 参考references/tooling_landscape.md进行交叉对比
5. 提交前先进行1周的概念验证

References

参考文档

```
references/operator_pattern.md
```
— what an operator IS, when to use vs alternatives
```
references/crd_design.md
```
— CRD design principles, versioning, conversion webhooks
```
references/reconcile_loop.md
```
— reconcile patterns, error handling, idempotency
```
references/tooling_landscape.md
```
— framework comparison + decision tree

```
references/operator_pattern.md
```
— Operator的定义、适用场景及替代方案
```
references/crd_design.md
```
— CRD设计原则、版本化、转换Webhook
```
references/reconcile_loop.md
```
— 调和模式、错误处理、幂等性
```
references/tooling_landscape.md
```
— 框架对比+决策树

Slash command

斜杠命令

/operator-audit

— Run all 3 tools on an operator repo and produce a markdown report.

/operator-audit

— 在Operator仓库中运行所有3个工具并生成Markdown报告。

Asset templates

资产模板

```
assets/crd_template.yaml
```
— CRD with status subresource, conditions, finalizer hint, printer columns
```
assets/reconcile_skeleton.go
```
— Go controller reconcile function with idempotency, conditions, finalizers, requeue patterns

```
assets/crd_template.yaml
```
— 包含status子资源、Conditions、终结器提示、打印列的CRD模板
```
assets/reconcile_skeleton.go
```
— 具备幂等性、Conditions、终结器、重入队列模式的Go控制器调和函数骨架

Anti-patterns

反模式

time.Sleep(30 * time.Second)
inside reconcile — block other reconciles. Use
```
RequeueAfter
```
.

r.Client.Update(ctx, obj)
to set status — use

r.Status().Update(ctx, obj)

instead.

No leader election + 2+ replicas — split-brain.
No finalizer — external resources orphan on deletion.
CRD without status subresource — status updates trigger spec reconciles (infinite loop).
Reconcile function > 200 lines — extract reconcileXxx subroutines per condition.

x-kubernetes-preserve-unknown-fields: true
on spec root — defeats validation.

Imperative reconcile — "if creating, do A; if updating, do B; if deleting, do C". Wrong shape. Reconcile = make actual=desired, regardless of how we got here.

调和过程中使用
time.Sleep(30 * time.Second)
— 阻塞其他调和操作。应使用
```
RequeueAfter
```
。

使用
r.Client.Update(ctx, obj)
设置status — 应改用

r.Status().Update(ctx, obj)

。

无主节点选举 + 2+副本 — 出现脑裂。
无终结器 — 删除时外部资源残留。
CRD无status子资源 — 状态更新触发spec调和（无限循环）。
调和函数>200行 — 按条件提取reconcileXxx子例程。

spec根节点设置
x-kubernetes-preserve-unknown-fields: true
— 失效验证。

命令式调和 — “如果是创建，执行A；如果是更新，执行B；如果是删除，执行C”。错误模式。调和的本质是：无论当前状态如何，让实际状态等于期望状态。

Verifiable success

可验证的成功指标

A team using this skill should achieve:

100% of new CRDs pass
```
crd_validator.py
```
before merge
All reconcile functions pass
```
reconcile_lint.py
```
strict mode
Operators reach OperatorHub Capability Level 3 (Full Lifecycle) before public release
Mean time to fix a reconcile bug: <1 day (no infinite loops in production)

使用本技能的团队应达成：

所有新CRD在合并前100%通过
```
crd_validator.py
```
检查
所有调和函数通过
```
reconcile_lint.py
```
严格模式检查
Operator在公开发布前达到OperatorHub能力等级3（完整生命周期）
调和故障平均修复时间：<1天（生产环境无无限循环）