managing-cluster-capacity

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Managing Cluster Capacity

集群容量管理

Manages cluster capacity across all CockroachDB deployment tiers. What "capacity" means varies by tier — Self-Hosted manages individual nodes, Advanced/BYOC manage node count and machine size, Standard manages provisioned vCPUs, and Basic auto-scales with cost controls.

管理所有CockroachDB部署层级的集群容量。“容量”的定义因层级而异：Self-Hosted层级管理单个节点，Advanced/BYOC层级管理节点数量和机器规格，Standard层级管理预配置vCPUs，Basic层级则通过成本控制实现自动扩容。

When to Use This Skill

何时使用此技能

Permanently removing a node from a cluster (Self-Hosted)
Adding nodes to increase capacity (Self-Hosted)
Scaling cluster node count or machine size (Advanced, BYOC)
Adjusting provisioned compute (Standard)
Managing costs on a serverless cluster (Basic)
Replacing hardware or migrating infrastructure (Self-Hosted, BYOC)
Replacing a failed or dead node (Self-Hosted)
Managing storage utilization and disk pressure (Self-Hosted)

For temporary maintenance (not capacity changes): Use performing-cluster-maintenance. For pre-operation health check: Use reviewing-cluster-health.

从集群中永久移除节点（Self-Hosted）
添加节点以提升容量（Self-Hosted）
调整集群节点数量或机器规格（Advanced、BYOC）
调整预配置计算资源（Standard）
管理无服务器集群的成本（Basic）
更换硬件或迁移基础设施（Self-Hosted、BYOC）
更换故障或失效节点（Self-Hosted）
管理存储利用率和磁盘压力（Self-Hosted）

临时维护（非容量变更）： 使用执行集群维护。 操作前健康检查： 使用检查集群健康状态。

Step 1: Gather Context

步骤1：收集上下文信息

Required Context

必填上下文

Question	Options	Why It Matters
Deployment tier?	Self-Hosted, Advanced, BYOC, Standard, Basic	Different capacity model per tier
Direction?	Scale up (add capacity), Scale down (reduce capacity)	Determines procedure

问题	选项	重要性
部署层级？	Self-Hosted, Advanced, BYOC, Standard, Basic	不同层级对应不同的容量管理模型
操作方向？	扩容（增加容量）, 缩容（减少容量）	决定具体操作流程

Additional Context (by tier)

各层级补充上下文

If Self-Hosted (scaling down):

Question	Options	Why It Matters
How many nodes to remove?	1, multiple	Multi-node decommission should be done simultaneously
Target node IDs?	Node IDs from `cockroach node status`	Required for CLI commands
Is the node alive or dead?	Alive, Dead	Dead nodes use a different procedure
Deployment platform?	Bare metal, VMs, Kubernetes	Changes CLI and cleanup steps
Current replication factor?	3, 5, custom	Must have enough nodes remaining
Current node count?	Number	Validates remaining capacity
Storage utilization?	Low (<60%), Medium (60-80%), High (>80%)	Determines urgency and whether storage maintenance is needed

If Advanced or BYOC:

Question	Options	Why It Matters
Scale method?	Cloud Console, API, Terraform	Determines procedure
Current and target configuration?	e.g., 5 nodes → 3 nodes, or 4 vCPU → 8 vCPU	Validates constraints
Cloud provider? (BYOC only)	AWS, GCP, Azure	Affects infrastructure verification

If Standard:

Question	Options	Why It Matters
Current provisioned vCPUs?	Number	Context for scaling decision
Target vCPUs?	Number	Validates workload will fit

If Basic: Gather cost management goals — Basic auto-scales with no manual capacity control.

若为Self-Hosted（缩容）：

问题	选项	重要性
要移除多少个节点？	1个, 多个	多节点退役需同时执行
目标节点ID？	来自 `cockroach node status` 的节点ID	执行CLI命令必需
节点是存活还是已失效？	存活, 已失效	失效节点需使用不同流程
部署平台？	裸金属, 虚拟机, Kubernetes	影响CLI命令和清理步骤
当前副本因子？	3, 5, 自定义	剩余节点数量必须满足副本因子要求
当前节点数量？	具体数值	验证剩余容量是否充足
存储利用率？	低（<60%）, 中（60-80%）, 高（>80%）	决定操作紧迫性及是否需要存储维护

若为Advanced或BYOC：

问题	选项	重要性
扩缩容方式？	Cloud Console, API, Terraform	决定具体操作流程
当前及目标配置？	例如：5节点→3节点，或4 vCPU→8 vCPU	验证配置约束
云服务商？（仅BYOC）	AWS, GCP, Azure	影响基础设施验证步骤

若为Standard：

问题	选项	重要性
当前预配置vCPUs？	具体数值	为扩缩容决策提供上下文
目标vCPUs？	具体数值	验证工作负载是否能适配

若为Basic： 收集成本管理目标——Basic层级自动扩容，无手动容量控制选项。

Context-Driven Routing

上下文驱动的流程路由

Tier	Go To
Self-Hosted	Self-Hosted Capacity Management
Advanced	Advanced Scaling
BYOC	BYOC Scaling
Standard	Standard Compute Management
Basic	Basic Cost Management

层级	跳转至
Self-Hosted	自托管容量管理
Advanced	Advanced层级扩缩容
BYOC	BYOC层级扩缩容
Standard	Standard层级计算资源管理
Basic	Basic层级成本管理

Self-Hosted Capacity Management

自托管容量管理

Applies when: Tier = Self-Hosted

适用场景： 层级 = Self-Hosted

Scaling Down: Decommission Nodes

缩容：节点退役

Pre-Decommission Validation

退役前验证

sql

-- All nodes live
SELECT n.node_id, n.is_live, n.build_tag
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- Ranges fully replicated
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Remaining capacity check
SELECT node_id, store_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

-- Replication factor
SHOW ZONE CONFIGURATION FOR RANGE default;

Remaining nodes must stay < 60% utilization after absorbing data. Node count after decommission must be >= replication factor.

sql

-- 检查所有节点存活状态
SELECT n.node_id, n.is_live, n.build_tag
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- 检查副本是否完全复制
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- 检查剩余容量
SELECT node_id, store_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

-- 查看副本因子
SHOW ZONE CONFIGURATION FOR RANGE default;

吸收数据后，剩余节点的存储利用率必须低于60%。退役后的节点数量必须≥副本因子。

If Node Is Alive: Drain Then Decommission

若节点存活：先排空再退役

bash

undefined

bash

undefined

Step 1: Drain

步骤1：排空节点

cockroach node drain <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Step 2: Decommission (single node)

步骤2：退役单个节点

cockroach node decommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Step 2: Decommission (multiple nodes — more efficient, do simultaneously)

步骤2：退役多个节点（更高效，需同时执行）

cockroach node decommission <id_1> <id_2> <id_3> --certs-dir=<certs-dir> --host=<any-live-node>

undefined

cockroach node decommission <id_1> <id_2> <id_3> --certs-dir=<certs-dir> --host=<any-live-node>

undefined

If Node Is Dead: Replace Failed Node

若节点已失效：更换故障节点

When a node has been dead longer than

server.time_until_store_dead

(default 5m), CockroachDB automatically re-replicates its data to surviving nodes. Use this procedure to clean up the dead node and optionally add a replacement.

Step 1: Confirm the node is dead and data is safe

sql

-- Confirm node is dead
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <dead_node_id>;

-- Verify all ranges are fully replicated (no under-replicated after re-replication)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Check remaining capacity can handle the load
SELECT node_id, ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

If under-replicated ranges exist, wait for re-replication to complete before proceeding.

Step 2: Decommission the dead node (metadata cleanup)

bash

cockroach node decommission <dead_node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Step 3: Add a replacement node (recommended)

If remaining nodes are above 60% utilization, provision a replacement node using the Scaling Up: Add Nodes procedure.

Multiple dead nodes: Decommission all dead nodes simultaneously:

bash

cockroach node decommission <id_1> <id_2> --certs-dir=<certs-dir> --host=<any-live-node>

See replacing-failed-nodes reference for detailed failure scenarios and recovery procedures.

当节点失效时间超过

server.time_until_store_dead

（默认5分钟），CockroachDB会自动将其数据重新复制到存活节点。使用以下流程清理失效节点并可选添加替换节点。

步骤1：确认节点已失效且数据安全

sql

-- 确认节点已失效
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <dead_node_id>;

-- 验证所有副本已完全复制（重新复制后无欠复制情况）
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- 检查剩余节点能否承载负载
SELECT node_id, ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

若存在欠复制副本，需等待重新复制完成后再继续操作。

步骤2：退役失效节点（清理元数据）

bash

cockroach node decommission <dead_node_id> --certs-dir=<certs-dir> --host=<any-live-node>

步骤3：添加替换节点（推荐）

若剩余节点利用率超过60%，请按照扩容：添加节点流程部署替换节点。

多个失效节点： 同时退役所有失效节点：

bash

cockroach node decommission <id_1> <id_2> --certs-dir=<certs-dir> --host=<any-live-node>

查看替换故障节点参考文档获取详细故障场景和恢复流程。

Monitor Decommission Progress

监控退役进度

bash

cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

Wait for

gossiped_replicas = 0

and

membership = 'decommissioned'

. Then stop the process on the decommissioned node.

bash

cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

等待

gossiped_replicas = 0

且

membership = 'decommissioned'

，然后停止已退役节点的进程。

Cancel a Decommission

取消退役操作

bash

cockroach node recommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Only works while still in

decommissioning

state.

bash

cockroach node recommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

仅在节点处于

decommissioning

状态时有效。

Scaling Up: Add Nodes

扩容：添加节点

Provision new hardware/VM with same specs as existing nodes
Install same CockroachDB version (
```
cockroach version
```
to confirm)
Start node with
```
--join
```
pointing to existing cluster nodes

Verify join:

sql

SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;

Data rebalances automatically — monitor with:

sql

SELECT node_id, range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY node_id;

部署与现有节点规格相同的新硬件/虚拟机
安装与现有集群版本一致的CockroachDB（使用
```
cockroach version
```
确认）
使用
```
--join
```
参数指向现有集群节点启动新节点

验证节点已加入集群：

sql

SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;

数据会自动重新平衡，使用以下语句监控进度：

sql

SELECT node_id, range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY node_id;

Post-Scaling Verification

扩缩容后验证

sql

SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

SELECT node_id, range_count, lease_count,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

sql

SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

SELECT node_id, range_count, lease_count,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

Advanced Scaling

Advanced层级扩缩容

Applies when: Tier = Advanced

Advanced clusters are managed by Cockroach Labs. Capacity is adjusted by changing node count or machine size.

适用场景： 层级 = Advanced

Advanced集群由Cockroach Labs管理，可通过调整节点数量或机器规格来变更容量。

Via Cloud Console

通过Cloud Console操作

Cluster → Capacity
Adjust node count or machine type (vCPUs per node)
CRL handles all node operations (drain, decommission, provisioning) safely
Monitor progress in Cloud Console

进入Cluster → Capacity页面
调整节点数量或机器类型（每节点vCPUs）
Cockroach Labs会安全处理所有节点操作（排空、退役、部署）
在Cloud Console中监控进度

Via Cloud API

通过Cloud API操作

bash

undefined

bash

undefined

Scale node count

调整节点数量

curl -X PATCH -H "Authorization: Bearer $COCKROACH_API_KEY"
-H "Content-Type: application/json"
-d '{"config": {"num_nodes": <new_count>}}'
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"

undefined

curl -X PATCH -H "Authorization: Bearer $COCKROACH_API_KEY"
-H "Content-Type: application/json"
-d '{"config": {"num_nodes": <new_count>}}'
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"

undefined

Via Terraform

通过Terraform操作

hcl

resource "cockroach_cluster" "example" {
  dedicated {
    num_virtual_cpus = 8     # vCPUs per node
    storage_gib      = 150
    num_nodes        = 5     # total nodes
  }
}

hcl

resource "cockroach_cluster" "example" {
  dedicated {
    num_virtual_cpus = 8     # 每节点vCPUs
    storage_gib      = 150
    num_nodes        = 5     # 总节点数
  }
}

Pre-Scaling Check

扩缩容前检查

sql

-- Ensure no disruptive jobs are running before scaling down
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j WHERE status = 'running' GROUP BY 1, 2;

sql

-- 缩容前确保无正在运行的破坏性任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j WHERE status = 'running' GROUP BY 1, 2;

Constraints

约束条件

Minimum: 3 nodes x 4 vCPUs (12 vCPUs total)
Scale down: Data must fit on remaining nodes; zone configs must be satisfiable
Scale up: Additional nodes available within your plan limits

最小值： 3节点 × 4 vCPUs（总计12 vCPUs）
缩容： 数据必须能适配剩余节点；区域配置必须可满足
扩容： 新增节点数量需在你的计划限制内

BYOC Scaling

BYOC层级扩缩容

Applies when: Tier = BYOC

Follow all Advanced Scaling steps. BYOC scaling is managed through the same Cloud Console/API/Terraform interfaces.

适用场景： 层级 = BYOC

遵循所有Advanced层级扩缩容步骤。BYOC层级的扩缩容通过相同的Cloud Console/API/Terraform界面管理。

Cloud Provider Verification (after scaling down)

云服务商验证（缩容后）

If AWS:

bash

aws ec2 describe-instances --filters "Name=tag:cockroach-cluster,Values=<cluster-name>" \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'

If GCP:

bash

gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"

If Azure:

bash

az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<name>']"

若为AWS：

bash

aws ec2 describe-instances --filters "Name=tag:cockroach-cluster,Values=<cluster-name>" \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'

若为GCP：

bash

gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"

若为Azure：

bash

az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<name>']"

Additional BYOC Considerations

BYOC额外注意事项

Verify security groups/firewall rules after scaling
Update reserved instance or committed use discount allocations
Verify network connectivity (PrivateLink/PSC/VPC Peering) is unaffected
Check cloud billing reflects the new instance count

扩缩容后验证安全组/防火墙规则
更新预留实例或承诺使用折扣分配
验证网络连接（PrivateLink/PSC/VPC Peering）未受影响
检查云账单是否反映新的实例数量

Standard Compute Management

Standard层级计算资源管理

Applies when: Tier = Standard

Standard is a multi-tenant managed service. There are no individual nodes. Capacity is managed by adjusting provisioned compute (vCPUs).

适用场景： 层级 = Standard

Standard是多租户托管服务，无独立节点。容量通过调整预配置计算资源（vCPUs）进行管理。

Adjust Provisioned vCPUs

调整预配置vCPUs

Cloud Console → Cluster → Capacity
Increase or decrease provisioned vCPUs
Change takes effect without downtime

进入Cloud Console → Cluster → Capacity页面
增加或减少预配置vCPUs
变更无需停机即可生效

Before Scaling Down

缩容前检查

Review CPU utilization in Cloud Console — ensure workload fits within reduced compute
Storage is usage-based and unaffected by compute changes

在Cloud Console中查看CPU利用率，确保工作负载能适配减少后的计算资源
存储按使用量计费，不受计算资源变更影响

After Scaling

扩缩容后

Monitor P99 latency and QPS in Cloud Console for 24-48 hours. If latency increases after scaling down, scale compute back up.

在Cloud Console中监控P99延迟和QPS 24-48小时。若缩容后延迟上升，需恢复原计算资源配置。

Basic Cost Management

Basic层级成本管理

Applies when: Tier = Basic

Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to manage. Capacity scales automatically based on demand. Cost is managed through spending controls.

适用场景： 层级 = Basic

Basic是无服务器产品，支持自动扩容，无节点或预配置计算资源可管理。容量会根据需求自动调整，成本通过支出控制进行管理。

Manage Spending

管理支出

Set spending limits: Cloud Console → Cluster → Settings → configure monthly spending cap
Review usage: Cloud Console shows Request Unit (RU) consumption over time
Optimize queries: Reduce RU consumption through query tuning and indexing
Archive data: Delete unused tables or databases to reduce storage costs

设置支出限额： Cloud Console → Cluster → Settings → 配置月度支出上限
查看使用情况： Cloud Console展示Request Unit（RU）的时间消耗趋势
优化查询： 通过查询调优和索引减少RU消耗
归档数据： 删除未使用的表或数据库以降低存储成本

When to Consider Upgrading

何时考虑升级

If you need explicit control over compute capacity (guaranteed vCPUs), consider upgrading to Standard. If you need dedicated infrastructure, consider Advanced.

若需要对计算资源进行显式控制（保证vCPUs），可考虑升级至Standard层级；若需要专用基础设施，可考虑升级至Advanced层级。

Safety Considerations

安全注意事项

Operation	Tier	Reversible?
`cockroach node decommission`	SH	Recommission only before completion
Stop decommissioned node	SH	No (must rejoin as new node)
Add node to cluster	SH	Yes (decommission to remove)
Scale via Console/API	ADV/BYOC	Contact support to reverse
Adjust provisioned vCPUs	STD	Yes (scale back)
Set spending limit	BAS	Yes (adjust anytime)

Critical (Self-Hosted):

Never decommission below the replication factor
Always drain before decommission (for live nodes)
Decommission multiple nodes simultaneously (not sequentially)
Verify remaining capacity can absorb the data
For dead nodes: wait for re-replication to complete before decommissioning
Monitor storage utilization — nodes above 80% risk performance degradation

操作	层级	可逆？
`cockroach node decommission`	SH	仅在完成前可重新启用
停止已退役节点	SH	否（需作为新节点重新加入）
向集群添加节点	SH	是（可通过退役移除）
通过Console/API扩缩容	ADV/BYOC	需联系支持人员撤销
调整预配置vCPUs	STD	是（可恢复原配置）
设置支出限额	BAS	是（可随时调整）

关键注意事项（Self-Hosted）：

退役节点数量不得低于副本因子
存活节点必须先排空再退役
多节点退役需同时执行（不可依次进行）
验证剩余节点容量可吸收数据
失效节点：需等待重新复制完成后再退役
监控存储利用率——节点利用率超过80%可能导致性能下降

Troubleshooting

故障排查

Issue	Tier	Fix
Decommission hangs	SH	Check zone config constraints; investigate stalled ranges
Recommission fails	SH	Node already fully decommissioned; must rejoin as new
New node not rebalancing	SH	Wait for automatic rebalancing; check `range_count`
Scale-down rejected	ADV/BYOC	Below minimum or data won't fit
Latency spike after reduction	STD	Scale provisioned vCPUs back up
Cloud instances not cleaned up	BYOC	Contact support; verify in cloud console
Dead node not re-replicating	SH	Check `server.time_until_store_dead` ; verify surviving nodes have capacity
Storage utilization high after scale-down	SH	Add replacement node or increase disk size

问题	层级	解决方法
退役操作卡住	SH	检查区域配置约束；排查停滞的副本
重新启用节点失败	SH	节点已完全退役；需作为新节点重新加入
新节点未重新平衡数据	SH	等待自动重新平衡；检查 `range_count`
缩容请求被拒绝	ADV/BYOC	低于最小值或数据无法适配剩余节点
缩容后延迟飙升	STD	恢复原预配置vCPUs
云实例未被清理	BYOC	联系支持人员；在云控制台中验证
失效节点未重新复制数据	SH	检查 `server.time_until_store_dead` 配置；验证存活节点容量充足
缩容后存储利用率过高	SH	添加替换节点或增大磁盘容量

References

参考资料

Skill references:

Replacing failed nodes
Storage management

Related skills:

reviewing-cluster-health — Pre/post health checks
performing-cluster-maintenance — Drain procedure (SH)
upgrading-cluster-version — Upgrades and lifecycle

Official CockroachDB Documentation:

技能参考：

替换故障节点
存储管理

相关技能：

检查集群健康状态 — 操作前后健康检查
执行集群维护 — 节点排空流程（SH）
升级集群版本 — 版本升级与生命周期管理

官方CockroachDB文档：