performing-cluster-maintenance

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Performing Cluster Maintenance

执行集群维护

Manages planned cluster maintenance across all deployment tiers. For Self-Hosted, this means draining and restarting individual nodes. For Advanced/BYOC, this means configuring and managing maintenance windows for CRL-applied patches. For Standard and Basic, maintenance is fully managed with no customer action required.

管理所有部署层级的计划内集群维护。对于Self-Hosted层级，这意味着驱逐并重启单个节点。对于Advanced/BYOC层级，这意味着配置和管理CRL应用补丁的维护窗口。对于Standard和Basic层级，维护由平台完全托管，无需客户执行任何操作。

When to Use This Skill

何时使用此技能

Planning OS patching, hardware changes, or configuration updates (Self-Hosted)
Configuring or modifying a maintenance window (Advanced, BYOC)
Setting patch deferral policies (Advanced, BYOC)
Monitoring during a CRL-managed maintenance event (Advanced, BYOC)
Running pre-maintenance validation checks (Self-Hosted, Advanced, BYOC)
Understanding how maintenance affects your application (all tiers)
Preparing applications for maintenance events (all tiers)

For permanent node removal: Use managing-cluster-capacity. For pre-maintenance health check: Use reviewing-cluster-health. For version upgrades: Use upgrading-cluster-version.

计划操作系统补丁、硬件变更或配置更新（Self-Hosted）
配置或修改维护窗口（Advanced、BYOC）
设置补丁延迟策略（Advanced、BYOC）
在CRL管理的维护事件期间进行监控（Advanced、BYOC）
运行维护前验证检查（Self-Hosted、Advanced、BYOC）
了解维护对应用程序的影响（所有层级）
为维护事件准备应用程序（所有层级）

如需永久移除节点： 使用 managing-cluster-capacity。 如需维护前健康检查： 使用 reviewing-cluster-health。 如需版本升级： 使用 upgrading-cluster-version。

Step 1: Gather Context

步骤1：收集上下文

Required Context

必要上下文

Question	Options	Why It Matters
Deployment tier?	Self-Hosted, Advanced, BYOC, Standard, Basic	Determines maintenance procedure
Goal?	Plan maintenance, Configure maintenance window, Defer a patch, Monitor during maintenance, Prepare application	Routes to the right procedure

问题	选项	重要性
部署层级？	Self-Hosted, Advanced, BYOC, Standard, Basic	决定维护流程
目标？	计划维护、配置维护窗口、延迟补丁、维护期间监控、准备应用程序	引导至正确流程

Additional Context (by tier)

各层级补充上下文

If Self-Hosted:

Question	Options	Why It Matters
Maintenance type?	OS patching, Hardware change, Binary upgrade, Config change, Planned restart	Affects sequencing and post-maintenance steps
Deployment platform?	Bare metal, VMs, Kubernetes (Operator/Helm/manual)	Changes drain and restart commands
Process manager?	systemd, manual, container orchestrator	Changes stop/start commands
Target node ID?	Node ID	Required for drain command
Long-running queries expected?	Yes (increase drain timeout), No (default timeout)	Determines drain-wait parameter

If Advanced or BYOC:

Question	Options	Why It Matters
Maintenance window configured?	Yes (what schedule), No	Determines if window needs setup
Patch pending?	Yes, No, Don't know	Determines urgency
Cloud provider? (BYOC only)	AWS, GCP, Azure	For infrastructure-level monitoring

If Standard or Basic: No context needed — maintenance is fully managed.

如果是Self-Hosted：

问题	选项	重要性
维护类型？	操作系统补丁、硬件变更、二进制升级、配置变更、计划重启	影响执行顺序和维护后步骤
部署平台？	裸金属、虚拟机、Kubernetes（Operator/Helm/手动）	改变驱逐和重启命令
进程管理器？	systemd、手动、容器编排器	改变停止/启动命令
目标节点ID？	节点ID	驱逐命令必需参数
是否存在长期运行的查询？	是（增加驱逐超时时间）、否（默认超时时间）	决定drain-wait参数

如果是Advanced或BYOC：

问题	选项	重要性
是否已配置维护窗口？	是（当前调度）、否	决定是否需要设置窗口
是否有待处理补丁？	是、否、不清楚	决定优先级
云提供商？（仅BYOC）	AWS, GCP, Azure	用于基础设施级监控

如果是Standard或Basic： 无需上下文——维护完全托管。

Context-Driven Routing

上下文驱动路由

Tier	Go To
Self-Hosted	Self-Hosted Node Maintenance
Advanced	Advanced Maintenance Management
BYOC	BYOC Maintenance Management
Standard	Standard Maintenance
Basic	Basic Maintenance

层级	跳转至
Self-Hosted	Self-Hosted节点维护
Advanced	Advanced维护管理
BYOC	BYOC维护管理
Standard	Standard维护
Basic	Basic维护

Self-Hosted Node Maintenance

Self-Hosted节点维护

Applies when: Tier = Self-Hosted

Self-Hosted operators manage all maintenance directly. The core operation is draining a node to safely move leases and connections before stopping it.

适用场景： 层级 = Self-Hosted

Self-Hosted运维人员需直接管理所有维护操作。核心操作是驱逐节点，在停止节点前安全迁移租约和连接。

Pre-Maintenance Checks

维护前检查

Run all checks before any maintenance operation. Stop if any check fails.

sql

-- Check 1: All nodes live (STOP if any node is not live)
SELECT n.node_id, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- Check 2: No other nodes currently draining (STOP if any draining)
SELECT node_id FROM crdb_internal.gossip_liveness WHERE draining = true;

-- Check 3: Ranges fully replicated (STOP if under-replicated ranges exist)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Check 4: No disruptive jobs running (WAIT or pause before proceeding)
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, status, now() - created AS running_for FROM j
WHERE status IN ('running', 'paused')
  AND job_type IN ('SCHEMA CHANGE', 'BACKUP', 'RESTORE', 'IMPORT', 'NEW SCHEMA CHANGE');

-- Check 5: Not mid-upgrade (STOP if versions differ)
SELECT DISTINCT build_tag FROM crdb_internal.gossip_nodes;

-- Check 6: Storage utilization safe (WARNING if any node > 70%)
SELECT node_id,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

Stop conditions: Do not proceed with maintenance if any node is not live, ranges are under-replicated, another node is draining, or a rolling upgrade is in progress. Wait for running jobs to complete or pause them.

See maintenance-prechecks reference for a consolidated precheck script.

在执行任何维护操作前运行所有检查。若任何检查失败，立即停止。

sql

-- Check 1: All nodes live (STOP if any node is not live)
SELECT n.node_id, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- Check 2: No other nodes currently draining (STOP if any draining)
SELECT node_id FROM crdb_internal.gossip_liveness WHERE draining = true;

-- Check 3: Ranges fully replicated (STOP if under-replicated ranges exist)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Check 4: No disruptive jobs running (WAIT or pause before proceeding)
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, status, now() - created AS running_for FROM j
WHERE status IN ('running', 'paused')
  AND job_type IN ('SCHEMA CHANGE', 'BACKUP', 'RESTORE', 'IMPORT', 'NEW SCHEMA CHANGE');

-- Check 5: Not mid-upgrade (STOP if versions differ)
SELECT DISTINCT build_tag FROM crdb_internal.gossip_nodes;

-- Check 6: Storage utilization safe (WARNING if any node > 70%)
SELECT node_id,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

停止条件： 若存在节点未在线、副本集复制不足、其他节点正在驱逐或滚动升级正在进行中，请勿继续维护。等待运行中的任务完成或暂停任务。

查看 maintenance-prechecks reference 获取整合的预检查脚本。

Execute Drain

执行驱逐

If platform = bare metal or VMs:

bash

cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address>

If long-running queries expected:

bash

cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address> --drain-wait=60s

If platform = Kubernetes:

bash

undefined

如果平台是裸金属或虚拟机：

bash

cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address>

如果存在长期运行的查询：

bash

cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address> --drain-wait=60s

如果平台是Kubernetes：

bash

undefined

Operator handles drain automatically during pod eviction

kubectl delete pod <pod-name>

Or for rolling restart:

kubectl rollout restart statefulset cockroachdb

undefined

kubectl rollout restart statefulset cockroachdb

undefined

Stop, Maintain, Restart

停止、维护、重启

If process manager = systemd:

bash

sudo systemctl stop cockroachdb

如果进程管理器是systemd：

bash

sudo systemctl stop cockroachdb

... perform maintenance ...

sudo systemctl start cockroachdb


**If process manager = manual:**
```bash
kill -TERM $(pgrep -f 'cockroach start')

sudo systemctl start cockroachdb


**如果进程管理器是手动：**
```bash
kill -TERM $(pgrep -f 'cockroach start')

... perform maintenance ...

cockroach start --certs-dir=<certs-dir> --store=<path> --join=<addresses> --background


Never use `kill -9` unless the process is unresponsive to SIGTERM.

cockroach start --certs-dir=<certs-dir> --store=<path> --join=<addresses> --background


除非进程对SIGTERM无响应，否则请勿使用`kill -9`。

Post-Restart Verification

重启后验证

sql

SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <node_id>;
-- is_live = true

SELECT node_id, lease_count FROM crdb_internal.kv_store_status WHERE node_id = <node_id>;
-- lease_count should increase over minutes as leases rebalance

See drain-details reference for drain phases, timeout configuration, and advanced monitoring.

sql

SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <node_id>;
-- is_live = true

SELECT node_id, lease_count FROM crdb_internal.kv_store_status WHERE node_id = <node_id>;
-- lease_count should increase over minutes as leases rebalance

查看 drain-details reference 获取驱逐阶段、超时配置和高级监控信息。

Storage Maintenance

存储维护

Periodic storage maintenance for Self-Hosted clusters:

Ballast file verification:

bash

ls -lh <store-path>/auxiliary/EMERGENCY_BALLAST

Self-Hosted集群的定期存储维护：

压舱文件验证：

bash

ls -lh <store-path>/auxiliary/EMERGENCY_BALLAST

If missing, create: cockroach debug ballast <store-path>/auxiliary/EMERGENCY_BALLAST --size=1GiB


**Disk utilization check:**
```sql
SELECT node_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

Nodes above 70% utilization should be addressed before maintenance — draining a node temporarily increases load on remaining nodes.


**磁盘利用率检查：**
```sql
SELECT node_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

利用率超过70%的节点应在维护前处理——驱逐节点会暂时增加剩余节点的负载。

Advanced Maintenance Management

Advanced维护管理

Applies when: Tier = Advanced

Advanced clusters are managed by Cockroach Labs. CRL applies patches and performs infrastructure maintenance during the configured maintenance window. You do not drain or restart nodes — CRL handles this using rolling restarts.

适用场景： 层级 = Advanced

Advanced集群由Cockroach Labs管理。CRL会在配置的维护窗口内应用补丁并执行基础设施维护。您无需驱逐或重启节点——CRL会使用滚动重启完成这些操作。

Configure a Maintenance Window

配置维护窗口

Cloud Console → Cluster → Settings → Maintenance
Set a weekly 6-hour window
- Choose day of week (e.g., Sunday)
- Choose start time in UTC (e.g., 02:00 UTC)
- Window duration is 6 hours

If no window is configured, CRL applies patches at a time of their choosing.

云控制台 → 集群 → 设置 → 维护
设置每周6小时的窗口
- 选择星期几（例如：周日）
- 选择UTC开始时间（例如：02:00 UTC）
- 窗口时长为6小时

如果未配置窗口，CRL会自行选择时间应用补丁。

View Current Maintenance Window

查看当前维护窗口

Cloud Console → Cluster → Settings → Maintenance shows the current schedule.

Cloud API:

bash

curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
  "https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.maintenance_window'

云控制台 → 集群 → 设置 → 维护 显示当前调度。

云API：

bash

curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
  "https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.maintenance_window'

Defer Patches

延迟补丁

If a pending patch needs to be delayed (e.g., for testing):

Cloud Console → Cluster → Settings → Upgrades
Select deferral period: 30, 60, or 90 days

Deferred patches still apply at the end of the deferral period. Deferral only delays — it does not skip.

如果待处理补丁需要延迟（例如用于测试）：

云控制台 → 集群 → 设置 → 升级
选择延迟周期：30、60或90天

延迟的补丁仍会在延迟周期结束时应用。延迟仅为推迟，并非跳过。

What Happens During Maintenance

维护期间会发生什么

CRL applies the patch using rolling restarts — one node at a time
Each node is drained (connections and leases moved), updated, and restarted
Cluster remains available throughout (multi-node clusters)
Performance may be slightly degraded during the window due to temporarily reduced capacity

Single-node clusters experience downtime during maintenance. Consider scaling to 3+ nodes for production workloads.

CRL使用滚动重启应用补丁——一次处理一个节点
每个节点会被驱逐（迁移连接和租约）、更新并重启
集群在整个过程中保持可用（多节点集群）
由于容量暂时减少，窗口期间性能可能略有下降

单节点集群在维护期间会经历停机。生产工作负载建议扩展至3个及以上节点。

Monitor During Maintenance

维护期间监控

Cloud Console:

Cluster Overview shows node status during rolling restarts
Metrics page shows temporary dips in QPS and capacity
Alerts may fire for transient node unavailability

SQL (during maintenance):

sql

-- Check which nodes are currently live
SELECT node_id, build_tag, is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;

云控制台：

集群概览显示滚动重启期间的节点状态
指标页面显示QPS和容量的暂时下降
可能会触发节点临时不可用的警报

SQL（维护期间）：

sql

-- Check which nodes are currently live
SELECT node_id, build_tag, is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;

Best Practices

最佳实践

Schedule during your lowest-traffic period
Monitor P99 latency during and after the window
Test patches in a staging cluster before production
Use deferral to align with your testing and release cadence
Configure alerting to notify during maintenance windows
Ensure applications implement connection retry with exponential backoff

在流量最低的时段调度维护
监控窗口期间及之后的P99延迟
在生产环境前先在 staging 集群测试补丁
使用延迟功能对齐您的测试和发布节奏
配置警报在维护窗口期间发送通知
确保应用程序实现带指数退避的连接重试逻辑

BYOC Maintenance Management

BYOC维护管理

Applies when: Tier = BYOC

BYOC maintenance follows the same CRL-managed process as Advanced. Follow all Advanced Maintenance Management steps for maintenance window configuration, patch deferral, and monitoring.

适用场景： 层级 = BYOC

BYOC维护遵循与Advanced相同的CRL托管流程。请遵循所有Advanced维护管理步骤进行维护窗口配置、补丁延迟和监控。

Cloud Provider Visibility

云提供商可见性

Since BYOC clusters run in your cloud account, you can directly observe maintenance operations:

If AWS:

EC2 console shows instance restarts during rolling patches
CloudWatch metrics show brief dips during node cycling
Set up CloudWatch Alarms for instance state changes

If GCP:

Compute Engine console shows VM restarts
Cloud Monitoring shows instance-level events
Configure alerting policies for instance uptime

If Azure:

Azure portal shows VM cycling
Azure Monitor captures instance restart events
Set up Azure Alerts for VM availability

由于BYOC集群运行在您的云账户中，您可以直接观察维护操作：

如果是AWS：

EC2控制台显示滚动补丁期间的实例重启
CloudWatch指标显示节点轮换期间的短暂下降
设置CloudWatch警报监控实例状态变化

如果是GCP：

Compute Engine控制台显示VM重启
Cloud Monitoring显示实例级事件
配置警报策略监控实例 uptime

如果是Azure：

Azure门户显示VM轮换
Azure Monitor捕获实例重启事件
设置Azure警报监控VM可用性

BYOC Infrastructure Maintenance

BYOC基础设施维护

For infrastructure changes in your cloud account that CRL does not manage (VPC, security groups, IAM, DNS):

Coordinate with CRL before making changes that could affect the cluster
Do not modify CRL-managed resources (instances, disks, network interfaces)
Test infrastructure changes in a staging BYOC cluster first
Changes to networking (PrivateLink, PSC, VPC Peering) may require CRL coordination

对于CRL不管理的云账户基础设施变更（VPC、安全组、IAM、DNS）：

在进行可能影响集群的变更前与CRL协调
请勿修改CRL托管的资源（实例、磁盘、网络接口）
先在 staging BYOC集群测试基础设施变更
网络变更（PrivateLink、PSC、VPC对等连接）可能需要CRL协调

Standard Maintenance

Standard维护

Applies when: Tier = Standard

Standard is a multi-tenant managed service. There are no nodes, no maintenance windows to configure, and no patches to defer. Cockroach Labs manages all maintenance transparently.

适用场景： 层级 = Standard

Standard是多租户托管服务。没有节点，无需配置维护窗口，也无需延迟补丁。Cockroach Labs透明管理所有维护。

What to Expect

预期情况

Patches are applied during low-traffic periods chosen by CRL
No downtime during maintenance
No customer notification required for routine patches
Major version upgrades are also automatic

补丁在CRL选择的低流量时段应用
维护期间无停机
常规补丁无需通知客户
大版本升级也自动进行

Application Preparation

应用程序准备

Implement connection retry logic with exponential backoff
Handle brief latency variations gracefully
Monitor Cloud Console for any service notifications

实现带指数退避的连接重试逻辑
优雅处理短暂的延迟波动
监控云控制台获取服务通知

Basic Maintenance

Basic维护

Applies when: Tier = Basic

Basic is a serverless offering. All maintenance is fully managed by Cockroach Labs. The serverless architecture is designed for zero-downtime maintenance.

适用场景： 层级 = Basic

Basic是无服务器产品。所有维护由Cockroach Labs完全托管。无服务器架构专为零停机维护设计。

What to Expect

预期情况

All patches and upgrades are transparent
No customer action required
No maintenance notifications needed

所有补丁和升级都是透明的
无需客户操作
无需维护通知

Application Preparation

应用程序准备

Implement connection retry logic (recommended for all production applications)
Be aware that idle clusters may scale to zero — first reconnection after inactivity may have higher latency (this is not maintenance-related)

实现连接重试逻辑（所有生产应用推荐）
注意闲置集群可能缩容至零——闲置后的首次重连可能延迟更高（这与维护无关）

Safety Considerations

安全注意事项

Read-only monitoring queries are safe on all tiers.

Self-Hosted node maintenance:

Only drain one node at a time
Drain cannot be canceled once started
Applications must have connection retry logic
Load balancer detects drained node via
```
/health?ready=1
```
returning error
Never SIGKILL unless process is unresponsive to SIGTERM

Advanced/BYOC maintenance windows:

Single-node clusters experience downtime during maintenance
Deferring patches too long delays security fixes — evaluate CVE impact
Do not modify CRL-managed infrastructure during a maintenance window

Standard/Basic: No maintenance risk for customers — fully managed by CRL.

See safety-guide reference for detailed risk matrix.

所有层级的只读监控查询都是安全的。

Self-Hosted节点维护：

一次仅驱逐一个节点
驱逐开始后无法取消
应用程序必须具备连接重试逻辑
负载均衡器通过
```
/health?ready=1
```
返回错误检测到已驱逐的节点
除非进程对SIGTERM无响应，否则请勿使用SIGKILL

Advanced/BYOC维护窗口：

单节点集群在维护期间会经历停机
延迟补丁过久会推迟安全修复——评估CVE影响
维护窗口期间请勿修改CRL托管的基础设施

Standard/Basic： 客户无维护风险——完全由CRL托管。

查看 safety-guide reference 获取详细风险矩阵。

Troubleshooting

故障排除

Issue	Tier	Fix
Drain very slow	SH	Check `SHOW CLUSTER STATEMENTS` for stuck queries
Drain hangs	SH	Check logs; SIGTERM if unresponsive
Node won't rejoin after restart	SH	Verify --join flag; check network connectivity
Leases not returning to node	SH	Wait 5-10 min; monitor lease_count
Clients not reconnecting	SH	Verify load balancer health check is passing
Maintenance window missed	ADV/BYOC	Contact support
Unexpected maintenance outside window	ADV/BYOC	Emergency patches may be applied outside windows; check Cloud Console notifications
Latency during maintenance	ADV/BYOC	Expected — temporarily reduced capacity; monitor and verify recovery after window

问题	层级	修复方案
驱逐速度极慢	SH	检查 `SHOW CLUSTER STATEMENTS` 是否存在卡住的查询
驱逐挂起	SH	检查日志；若无响应则使用SIGTERM
节点重启后无法重新加入集群	SH	验证--join参数；检查网络连通性
租约未返回至节点	SH	等待5-10分钟；监控lease_count
客户端无法重连	SH	验证负载均衡器健康检查是否通过
错过维护窗口	ADV/BYOC	联系支持团队
窗口外出现意外维护	ADV/BYOC	紧急补丁可能在窗口外应用；检查云控制台通知
维护期间延迟升高	ADV/BYOC	属于预期情况——容量暂时减少；监控并验证窗口后恢复情况

References

参考资料

Skill references:

Drain phases and timeouts
Maintenance prechecks
Safety guide

Related skills:

reviewing-cluster-health
managing-cluster-capacity
upgrading-cluster-version

Official CockroachDB Documentation:

技能参考：

Drain phases and timeouts
Maintenance prechecks
Safety guide

相关技能：

reviewing-cluster-health
managing-cluster-capacity
upgrading-cluster-version

官方CockroachDB文档：