performing-cluster-maintenance

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Performing Cluster Maintenance

执行集群维护

Manages planned cluster maintenance across all deployment tiers. For Self-Hosted, this means draining and restarting individual nodes. For Advanced/BYOC, this means configuring and managing maintenance windows for CRL-applied patches. For Standard and Basic, maintenance is fully managed with no customer action required.
管理所有部署层级的计划内集群维护。对于Self-Hosted层级,这意味着驱逐并重启单个节点。对于Advanced/BYOC层级,这意味着配置和管理CRL应用补丁的维护窗口。对于Standard和Basic层级,维护由平台完全托管,无需客户执行任何操作。

When to Use This Skill

何时使用此技能

  • Planning OS patching, hardware changes, or configuration updates (Self-Hosted)
  • Configuring or modifying a maintenance window (Advanced, BYOC)
  • Setting patch deferral policies (Advanced, BYOC)
  • Monitoring during a CRL-managed maintenance event (Advanced, BYOC)
  • Running pre-maintenance validation checks (Self-Hosted, Advanced, BYOC)
  • Understanding how maintenance affects your application (all tiers)
  • Preparing applications for maintenance events (all tiers)
For permanent node removal: Use managing-cluster-capacity. For pre-maintenance health check: Use reviewing-cluster-health. For version upgrades: Use upgrading-cluster-version.

  • 计划操作系统补丁、硬件变更或配置更新(Self-Hosted)
  • 配置或修改维护窗口(Advanced、BYOC)
  • 设置补丁延迟策略(Advanced、BYOC)
  • 在CRL管理的维护事件期间进行监控(Advanced、BYOC)
  • 运行维护前验证检查(Self-Hosted、Advanced、BYOC)
  • 了解维护对应用程序的影响(所有层级)
  • 为维护事件准备应用程序(所有层级)
如需永久移除节点: 使用 managing-cluster-capacity如需维护前健康检查: 使用 reviewing-cluster-health如需版本升级: 使用 upgrading-cluster-version

Step 1: Gather Context

步骤1:收集上下文

Required Context

必要上下文

QuestionOptionsWhy It Matters
Deployment tier?Self-Hosted, Advanced, BYOC, Standard, BasicDetermines maintenance procedure
Goal?Plan maintenance, Configure maintenance window, Defer a patch, Monitor during maintenance, Prepare applicationRoutes to the right procedure
问题选项重要性
部署层级?Self-Hosted, Advanced, BYOC, Standard, Basic决定维护流程
目标?计划维护、配置维护窗口、延迟补丁、维护期间监控、准备应用程序引导至正确流程

Additional Context (by tier)

各层级补充上下文

If Self-Hosted:
QuestionOptionsWhy It Matters
Maintenance type?OS patching, Hardware change, Binary upgrade, Config change, Planned restartAffects sequencing and post-maintenance steps
Deployment platform?Bare metal, VMs, Kubernetes (Operator/Helm/manual)Changes drain and restart commands
Process manager?systemd, manual, container orchestratorChanges stop/start commands
Target node ID?Node IDRequired for drain command
Long-running queries expected?Yes (increase drain timeout), No (default timeout)Determines drain-wait parameter
If Advanced or BYOC:
QuestionOptionsWhy It Matters
Maintenance window configured?Yes (what schedule), NoDetermines if window needs setup
Patch pending?Yes, No, Don't knowDetermines urgency
Cloud provider? (BYOC only)AWS, GCP, AzureFor infrastructure-level monitoring
If Standard or Basic: No context needed — maintenance is fully managed.
如果是Self-Hosted:
问题选项重要性
维护类型?操作系统补丁、硬件变更、二进制升级、配置变更、计划重启影响执行顺序和维护后步骤
部署平台?裸金属、虚拟机、Kubernetes(Operator/Helm/手动)改变驱逐和重启命令
进程管理器?systemd、手动、容器编排器改变停止/启动命令
目标节点ID?节点ID驱逐命令必需参数
是否存在长期运行的查询?是(增加驱逐超时时间)、否(默认超时时间)决定drain-wait参数
如果是Advanced或BYOC:
问题选项重要性
是否已配置维护窗口?是(当前调度)、否决定是否需要设置窗口
是否有待处理补丁?是、否、不清楚决定优先级
云提供商?(仅BYOC)AWS, GCP, Azure用于基础设施级监控
如果是Standard或Basic: 无需上下文——维护完全托管。

Context-Driven Routing

上下文驱动路由

Self-Hosted Node Maintenance

Self-Hosted节点维护

Applies when: Tier = Self-Hosted
Self-Hosted operators manage all maintenance directly. The core operation is draining a node to safely move leases and connections before stopping it.
适用场景: 层级 = Self-Hosted
Self-Hosted运维人员需直接管理所有维护操作。核心操作是驱逐节点,在停止节点前安全迁移租约和连接。

Pre-Maintenance Checks

维护前检查

Run all checks before any maintenance operation. Stop if any check fails.
sql
-- Check 1: All nodes live (STOP if any node is not live)
SELECT n.node_id, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- Check 2: No other nodes currently draining (STOP if any draining)
SELECT node_id FROM crdb_internal.gossip_liveness WHERE draining = true;

-- Check 3: Ranges fully replicated (STOP if under-replicated ranges exist)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Check 4: No disruptive jobs running (WAIT or pause before proceeding)
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, status, now() - created AS running_for FROM j
WHERE status IN ('running', 'paused')
  AND job_type IN ('SCHEMA CHANGE', 'BACKUP', 'RESTORE', 'IMPORT', 'NEW SCHEMA CHANGE');

-- Check 5: Not mid-upgrade (STOP if versions differ)
SELECT DISTINCT build_tag FROM crdb_internal.gossip_nodes;

-- Check 6: Storage utilization safe (WARNING if any node > 70%)
SELECT node_id,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;
Stop conditions: Do not proceed with maintenance if any node is not live, ranges are under-replicated, another node is draining, or a rolling upgrade is in progress. Wait for running jobs to complete or pause them.
See maintenance-prechecks reference for a consolidated precheck script.
在执行任何维护操作前运行所有检查。若任何检查失败,立即停止。
sql
-- Check 1: All nodes live (STOP if any node is not live)
SELECT n.node_id, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- Check 2: No other nodes currently draining (STOP if any draining)
SELECT node_id FROM crdb_internal.gossip_liveness WHERE draining = true;

-- Check 3: Ranges fully replicated (STOP if under-replicated ranges exist)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Check 4: No disruptive jobs running (WAIT or pause before proceeding)
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, status, now() - created AS running_for FROM j
WHERE status IN ('running', 'paused')
  AND job_type IN ('SCHEMA CHANGE', 'BACKUP', 'RESTORE', 'IMPORT', 'NEW SCHEMA CHANGE');

-- Check 5: Not mid-upgrade (STOP if versions differ)
SELECT DISTINCT build_tag FROM crdb_internal.gossip_nodes;

-- Check 6: Storage utilization safe (WARNING if any node > 70%)
SELECT node_id,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;
停止条件: 若存在节点未在线、副本集复制不足、其他节点正在驱逐或滚动升级正在进行中,请勿继续维护。等待运行中的任务完成或暂停任务。
查看 maintenance-prechecks reference 获取整合的预检查脚本。

Execute Drain

执行驱逐

If platform = bare metal or VMs:
bash
cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address>
If long-running queries expected:
bash
cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address> --drain-wait=60s
If platform = Kubernetes:
bash
undefined
如果平台是裸金属或虚拟机:
bash
cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address>
如果存在长期运行的查询:
bash
cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address> --drain-wait=60s
如果平台是Kubernetes:
bash
undefined

Operator handles drain automatically during pod eviction

Operator handles drain automatically during pod eviction

kubectl delete pod <pod-name>
kubectl delete pod <pod-name>

Or for rolling restart:

Or for rolling restart:

kubectl rollout restart statefulset cockroachdb
undefined
kubectl rollout restart statefulset cockroachdb
undefined

Stop, Maintain, Restart

停止、维护、重启

If process manager = systemd:
bash
sudo systemctl stop cockroachdb
如果进程管理器是systemd:
bash
sudo systemctl stop cockroachdb

... perform maintenance ...

... perform maintenance ...

sudo systemctl start cockroachdb

**If process manager = manual:**
```bash
kill -TERM $(pgrep -f 'cockroach start')
sudo systemctl start cockroachdb

**如果进程管理器是手动:**
```bash
kill -TERM $(pgrep -f 'cockroach start')

... perform maintenance ...

... perform maintenance ...

cockroach start --certs-dir=<certs-dir> --store=<path> --join=<addresses> --background

Never use `kill -9` unless the process is unresponsive to SIGTERM.
cockroach start --certs-dir=<certs-dir> --store=<path> --join=<addresses> --background

除非进程对SIGTERM无响应,否则请勿使用`kill -9`。

Post-Restart Verification

重启后验证

sql
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <node_id>;
-- is_live = true

SELECT node_id, lease_count FROM crdb_internal.kv_store_status WHERE node_id = <node_id>;
-- lease_count should increase over minutes as leases rebalance
See drain-details reference for drain phases, timeout configuration, and advanced monitoring.
sql
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <node_id>;
-- is_live = true

SELECT node_id, lease_count FROM crdb_internal.kv_store_status WHERE node_id = <node_id>;
-- lease_count should increase over minutes as leases rebalance
查看 drain-details reference 获取驱逐阶段、超时配置和高级监控信息。

Storage Maintenance

存储维护

Periodic storage maintenance for Self-Hosted clusters:
Ballast file verification:
bash
ls -lh <store-path>/auxiliary/EMERGENCY_BALLAST
Self-Hosted集群的定期存储维护:
压舱文件验证:
bash
ls -lh <store-path>/auxiliary/EMERGENCY_BALLAST

If missing, create: cockroach debug ballast <store-path>/auxiliary/EMERGENCY_BALLAST --size=1GiB

If missing, create: cockroach debug ballast <store-path>/auxiliary/EMERGENCY_BALLAST --size=1GiB


**Disk utilization check:**
```sql
SELECT node_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;
Nodes above 70% utilization should be addressed before maintenance — draining a node temporarily increases load on remaining nodes.


**磁盘利用率检查:**
```sql
SELECT node_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;
利用率超过70%的节点应在维护前处理——驱逐节点会暂时增加剩余节点的负载。

Advanced Maintenance Management

Advanced维护管理

Applies when: Tier = Advanced
Advanced clusters are managed by Cockroach Labs. CRL applies patches and performs infrastructure maintenance during the configured maintenance window. You do not drain or restart nodes — CRL handles this using rolling restarts.
适用场景: 层级 = Advanced
Advanced集群由Cockroach Labs管理。CRL会在配置的维护窗口内应用补丁并执行基础设施维护。您无需驱逐或重启节点——CRL会使用滚动重启完成这些操作。

Configure a Maintenance Window

配置维护窗口

  1. Cloud Console → Cluster → Settings → Maintenance
  2. Set a weekly 6-hour window
    • Choose day of week (e.g., Sunday)
    • Choose start time in UTC (e.g., 02:00 UTC)
    • Window duration is 6 hours
If no window is configured, CRL applies patches at a time of their choosing.
  1. 云控制台 → 集群 → 设置 → 维护
  2. 设置每周6小时的窗口
    • 选择星期几(例如:周日)
    • 选择UTC开始时间(例如:02:00 UTC)
    • 窗口时长为6小时
如果未配置窗口,CRL会自行选择时间应用补丁。

View Current Maintenance Window

查看当前维护窗口

Cloud Console → Cluster → Settings → Maintenance shows the current schedule.
Cloud API:
bash
curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
  "https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.maintenance_window'
云控制台 → 集群 → 设置 → 维护 显示当前调度。
云API:
bash
curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
  "https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.maintenance_window'

Defer Patches

延迟补丁

If a pending patch needs to be delayed (e.g., for testing):
  1. Cloud Console → Cluster → Settings → Upgrades
  2. Select deferral period: 30, 60, or 90 days
Deferred patches still apply at the end of the deferral period. Deferral only delays — it does not skip.
如果待处理补丁需要延迟(例如用于测试):
  1. 云控制台 → 集群 → 设置 → 升级
  2. 选择延迟周期:30、60或90天
延迟的补丁仍会在延迟周期结束时应用。延迟仅为推迟,并非跳过。

What Happens During Maintenance

维护期间会发生什么

  1. CRL applies the patch using rolling restarts — one node at a time
  2. Each node is drained (connections and leases moved), updated, and restarted
  3. Cluster remains available throughout (multi-node clusters)
  4. Performance may be slightly degraded during the window due to temporarily reduced capacity
Single-node clusters experience downtime during maintenance. Consider scaling to 3+ nodes for production workloads.
  1. CRL使用滚动重启应用补丁——一次处理一个节点
  2. 每个节点会被驱逐(迁移连接和租约)、更新并重启
  3. 集群在整个过程中保持可用(多节点集群)
  4. 由于容量暂时减少,窗口期间性能可能略有下降
单节点集群在维护期间会经历停机。生产工作负载建议扩展至3个及以上节点。

Monitor During Maintenance

维护期间监控

Cloud Console:
  • Cluster Overview shows node status during rolling restarts
  • Metrics page shows temporary dips in QPS and capacity
  • Alerts may fire for transient node unavailability
SQL (during maintenance):
sql
-- Check which nodes are currently live
SELECT node_id, build_tag, is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;
云控制台:
  • 集群概览显示滚动重启期间的节点状态
  • 指标页面显示QPS和容量的暂时下降
  • 可能会触发节点临时不可用的警报
SQL(维护期间):
sql
-- Check which nodes are currently live
SELECT node_id, build_tag, is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;

Best Practices

最佳实践

  • Schedule during your lowest-traffic period
  • Monitor P99 latency during and after the window
  • Test patches in a staging cluster before production
  • Use deferral to align with your testing and release cadence
  • Configure alerting to notify during maintenance windows
  • Ensure applications implement connection retry with exponential backoff

  • 在流量最低的时段调度维护
  • 监控窗口期间及之后的P99延迟
  • 在生产环境前先在 staging 集群测试补丁
  • 使用延迟功能对齐您的测试和发布节奏
  • 配置警报在维护窗口期间发送通知
  • 确保应用程序实现带指数退避的连接重试逻辑

BYOC Maintenance Management

BYOC维护管理

Applies when: Tier = BYOC
BYOC maintenance follows the same CRL-managed process as Advanced. Follow all Advanced Maintenance Management steps for maintenance window configuration, patch deferral, and monitoring.
适用场景: 层级 = BYOC
BYOC维护遵循与Advanced相同的CRL托管流程。请遵循所有Advanced维护管理步骤进行维护窗口配置、补丁延迟和监控。

Cloud Provider Visibility

云提供商可见性

Since BYOC clusters run in your cloud account, you can directly observe maintenance operations:
If AWS:
  • EC2 console shows instance restarts during rolling patches
  • CloudWatch metrics show brief dips during node cycling
  • Set up CloudWatch Alarms for instance state changes
If GCP:
  • Compute Engine console shows VM restarts
  • Cloud Monitoring shows instance-level events
  • Configure alerting policies for instance uptime
If Azure:
  • Azure portal shows VM cycling
  • Azure Monitor captures instance restart events
  • Set up Azure Alerts for VM availability
由于BYOC集群运行在您的云账户中,您可以直接观察维护操作:
如果是AWS:
  • EC2控制台显示滚动补丁期间的实例重启
  • CloudWatch指标显示节点轮换期间的短暂下降
  • 设置CloudWatch警报监控实例状态变化
如果是GCP:
  • Compute Engine控制台显示VM重启
  • Cloud Monitoring显示实例级事件
  • 配置警报策略监控实例 uptime
如果是Azure:
  • Azure门户显示VM轮换
  • Azure Monitor捕获实例重启事件
  • 设置Azure警报监控VM可用性

BYOC Infrastructure Maintenance

BYOC基础设施维护

For infrastructure changes in your cloud account that CRL does not manage (VPC, security groups, IAM, DNS):
  • Coordinate with CRL before making changes that could affect the cluster
  • Do not modify CRL-managed resources (instances, disks, network interfaces)
  • Test infrastructure changes in a staging BYOC cluster first
  • Changes to networking (PrivateLink, PSC, VPC Peering) may require CRL coordination

对于CRL不管理的云账户基础设施变更(VPC、安全组、IAM、DNS):
  • 在进行可能影响集群的变更前与CRL协调
  • 请勿修改CRL托管的资源(实例、磁盘、网络接口)
  • 先在 staging BYOC集群测试基础设施变更
  • 网络变更(PrivateLink、PSC、VPC对等连接)可能需要CRL协调

Standard Maintenance

Standard维护

Applies when: Tier = Standard
Standard is a multi-tenant managed service. There are no nodes, no maintenance windows to configure, and no patches to defer. Cockroach Labs manages all maintenance transparently.
适用场景: 层级 = Standard
Standard是多租户托管服务。没有节点,无需配置维护窗口,也无需延迟补丁。Cockroach Labs透明管理所有维护。

What to Expect

预期情况

  • Patches are applied during low-traffic periods chosen by CRL
  • No downtime during maintenance
  • No customer notification required for routine patches
  • Major version upgrades are also automatic
  • 补丁在CRL选择的低流量时段应用
  • 维护期间无停机
  • 常规补丁无需通知客户
  • 大版本升级也自动进行

Application Preparation

应用程序准备

  • Implement connection retry logic with exponential backoff
  • Handle brief latency variations gracefully
  • Monitor Cloud Console for any service notifications

  • 实现带指数退避的连接重试逻辑
  • 优雅处理短暂的延迟波动
  • 监控云控制台获取服务通知

Basic Maintenance

Basic维护

Applies when: Tier = Basic
Basic is a serverless offering. All maintenance is fully managed by Cockroach Labs. The serverless architecture is designed for zero-downtime maintenance.
适用场景: 层级 = Basic
Basic是无服务器产品。所有维护由Cockroach Labs完全托管。无服务器架构专为零停机维护设计。

What to Expect

预期情况

  • All patches and upgrades are transparent
  • No customer action required
  • No maintenance notifications needed
  • 所有补丁和升级都是透明的
  • 无需客户操作
  • 无需维护通知

Application Preparation

应用程序准备

  • Implement connection retry logic (recommended for all production applications)
  • Be aware that idle clusters may scale to zero — first reconnection after inactivity may have higher latency (this is not maintenance-related)

  • 实现连接重试逻辑(所有生产应用推荐)
  • 注意闲置集群可能缩容至零——闲置后的首次重连可能延迟更高(这与维护无关)

Safety Considerations

安全注意事项

Read-only monitoring queries are safe on all tiers.
Self-Hosted node maintenance:
  • Only drain one node at a time
  • Drain cannot be canceled once started
  • Applications must have connection retry logic
  • Load balancer detects drained node via
    /health?ready=1
    returning error
  • Never SIGKILL unless process is unresponsive to SIGTERM
Advanced/BYOC maintenance windows:
  • Single-node clusters experience downtime during maintenance
  • Deferring patches too long delays security fixes — evaluate CVE impact
  • Do not modify CRL-managed infrastructure during a maintenance window
Standard/Basic: No maintenance risk for customers — fully managed by CRL.
See safety-guide reference for detailed risk matrix.
所有层级的只读监控查询都是安全的。
Self-Hosted节点维护:
  • 一次仅驱逐一个节点
  • 驱逐开始后无法取消
  • 应用程序必须具备连接重试逻辑
  • 负载均衡器通过
    /health?ready=1
    返回错误检测到已驱逐的节点
  • 除非进程对SIGTERM无响应,否则请勿使用SIGKILL
Advanced/BYOC维护窗口:
  • 单节点集群在维护期间会经历停机
  • 延迟补丁过久会推迟安全修复——评估CVE影响
  • 维护窗口期间请勿修改CRL托管的基础设施
Standard/Basic: 客户无维护风险——完全由CRL托管。
查看 safety-guide reference 获取详细风险矩阵。

Troubleshooting

故障排除

IssueTierFix
Drain very slowSHCheck
SHOW CLUSTER STATEMENTS
for stuck queries
Drain hangsSHCheck logs; SIGTERM if unresponsive
Node won't rejoin after restartSHVerify --join flag; check network connectivity
Leases not returning to nodeSHWait 5-10 min; monitor lease_count
Clients not reconnectingSHVerify load balancer health check is passing
Maintenance window missedADV/BYOCContact support
Unexpected maintenance outside windowADV/BYOCEmergency patches may be applied outside windows; check Cloud Console notifications
Latency during maintenanceADV/BYOCExpected — temporarily reduced capacity; monitor and verify recovery after window
问题层级修复方案
驱逐速度极慢SH检查
SHOW CLUSTER STATEMENTS
是否存在卡住的查询
驱逐挂起SH检查日志;若无响应则使用SIGTERM
节点重启后无法重新加入集群SH验证--join参数;检查网络连通性
租约未返回至节点SH等待5-10分钟;监控lease_count
客户端无法重连SH验证负载均衡器健康检查是否通过
错过维护窗口ADV/BYOC联系支持团队
窗口外出现意外维护ADV/BYOC紧急补丁可能在窗口外应用;检查云控制台通知
维护期间延迟升高ADV/BYOC属于预期情况——容量暂时减少;监控并验证窗口后恢复情况

References

参考资料

Skill references:
  • Drain phases and timeouts
  • Maintenance prechecks
  • Safety guide
Related skills:
  • reviewing-cluster-health
  • managing-cluster-capacity
  • upgrading-cluster-version
Official CockroachDB Documentation:
技能参考:
  • Drain phases and timeouts
  • Maintenance prechecks
  • Safety guide
相关技能:
  • reviewing-cluster-health
  • managing-cluster-capacity
  • upgrading-cluster-version
官方CockroachDB文档: