managing-cluster-capacity

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Managing Cluster Capacity

集群容量管理

Manages cluster capacity across all CockroachDB deployment tiers. What "capacity" means varies by tier — Self-Hosted manages individual nodes, Advanced/BYOC manage node count and machine size, Standard manages provisioned vCPUs, and Basic auto-scales with cost controls.
管理所有CockroachDB部署层级的集群容量。“容量”的定义因层级而异:Self-Hosted层级管理单个节点,Advanced/BYOC层级管理节点数量和机器规格,Standard层级管理预配置vCPUs,Basic层级则通过成本控制实现自动扩容。

When to Use This Skill

何时使用此技能

  • Permanently removing a node from a cluster (Self-Hosted)
  • Adding nodes to increase capacity (Self-Hosted)
  • Scaling cluster node count or machine size (Advanced, BYOC)
  • Adjusting provisioned compute (Standard)
  • Managing costs on a serverless cluster (Basic)
  • Replacing hardware or migrating infrastructure (Self-Hosted, BYOC)
  • Replacing a failed or dead node (Self-Hosted)
  • Managing storage utilization and disk pressure (Self-Hosted)
For temporary maintenance (not capacity changes): Use performing-cluster-maintenance. For pre-operation health check: Use reviewing-cluster-health.

  • 从集群中永久移除节点(Self-Hosted)
  • 添加节点以提升容量(Self-Hosted)
  • 调整集群节点数量或机器规格(Advanced、BYOC)
  • 调整预配置计算资源(Standard)
  • 管理无服务器集群的成本(Basic)
  • 更换硬件或迁移基础设施(Self-Hosted、BYOC)
  • 更换故障或失效节点(Self-Hosted)
  • 管理存储利用率和磁盘压力(Self-Hosted)
临时维护(非容量变更): 使用执行集群维护操作前健康检查: 使用检查集群健康状态

Step 1: Gather Context

步骤1:收集上下文信息

Required Context

必填上下文

QuestionOptionsWhy It Matters
Deployment tier?Self-Hosted, Advanced, BYOC, Standard, BasicDifferent capacity model per tier
Direction?Scale up (add capacity), Scale down (reduce capacity)Determines procedure
问题选项重要性
部署层级?Self-Hosted, Advanced, BYOC, Standard, Basic不同层级对应不同的容量管理模型
操作方向?扩容(增加容量), 缩容(减少容量)决定具体操作流程

Additional Context (by tier)

各层级补充上下文

If Self-Hosted (scaling down):
QuestionOptionsWhy It Matters
How many nodes to remove?1, multipleMulti-node decommission should be done simultaneously
Target node IDs?Node IDs from
cockroach node status
Required for CLI commands
Is the node alive or dead?Alive, DeadDead nodes use a different procedure
Deployment platform?Bare metal, VMs, KubernetesChanges CLI and cleanup steps
Current replication factor?3, 5, customMust have enough nodes remaining
Current node count?NumberValidates remaining capacity
Storage utilization?Low (<60%), Medium (60-80%), High (>80%)Determines urgency and whether storage maintenance is needed
If Advanced or BYOC:
QuestionOptionsWhy It Matters
Scale method?Cloud Console, API, TerraformDetermines procedure
Current and target configuration?e.g., 5 nodes → 3 nodes, or 4 vCPU → 8 vCPUValidates constraints
Cloud provider? (BYOC only)AWS, GCP, AzureAffects infrastructure verification
If Standard:
QuestionOptionsWhy It Matters
Current provisioned vCPUs?NumberContext for scaling decision
Target vCPUs?NumberValidates workload will fit
If Basic: Gather cost management goals — Basic auto-scales with no manual capacity control.
若为Self-Hosted(缩容):
问题选项重要性
要移除多少个节点?1个, 多个多节点退役需同时执行
目标节点ID?来自
cockroach node status
的节点ID
执行CLI命令必需
节点是存活还是已失效?存活, 已失效失效节点需使用不同流程
部署平台?裸金属, 虚拟机, Kubernetes影响CLI命令和清理步骤
当前副本因子?3, 5, 自定义剩余节点数量必须满足副本因子要求
当前节点数量?具体数值验证剩余容量是否充足
存储利用率?低(<60%), 中(60-80%), 高(>80%)决定操作紧迫性及是否需要存储维护
若为Advanced或BYOC:
问题选项重要性
扩缩容方式?Cloud Console, API, Terraform决定具体操作流程
当前及目标配置?例如:5节点→3节点,或4 vCPU→8 vCPU验证配置约束
云服务商?(仅BYOC)AWS, GCP, Azure影响基础设施验证步骤
若为Standard:
问题选项重要性
当前预配置vCPUs?具体数值为扩缩容决策提供上下文
目标vCPUs?具体数值验证工作负载是否能适配
若为Basic: 收集成本管理目标——Basic层级自动扩容,无手动容量控制选项。

Context-Driven Routing

上下文驱动的流程路由

Self-Hosted Capacity Management

自托管容量管理

Applies when: Tier = Self-Hosted
适用场景: 层级 = Self-Hosted

Scaling Down: Decommission Nodes

缩容:节点退役

Pre-Decommission Validation

退役前验证

sql
-- All nodes live
SELECT n.node_id, n.is_live, n.build_tag
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- Ranges fully replicated
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Remaining capacity check
SELECT node_id, store_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

-- Replication factor
SHOW ZONE CONFIGURATION FOR RANGE default;
Remaining nodes must stay < 60% utilization after absorbing data. Node count after decommission must be >= replication factor.
sql
-- 检查所有节点存活状态
SELECT n.node_id, n.is_live, n.build_tag
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- 检查副本是否完全复制
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- 检查剩余容量
SELECT node_id, store_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

-- 查看副本因子
SHOW ZONE CONFIGURATION FOR RANGE default;
吸收数据后,剩余节点的存储利用率必须低于60%。退役后的节点数量必须≥副本因子。

If Node Is Alive: Drain Then Decommission

若节点存活:先排空再退役

bash
undefined
bash
undefined

Step 1: Drain

步骤1:排空节点

cockroach node drain <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
cockroach node drain <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Step 2: Decommission (single node)

步骤2:退役单个节点

cockroach node decommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
cockroach node decommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Step 2: Decommission (multiple nodes — more efficient, do simultaneously)

步骤2:退役多个节点(更高效,需同时执行)

cockroach node decommission <id_1> <id_2> <id_3> --certs-dir=<certs-dir> --host=<any-live-node>
undefined
cockroach node decommission <id_1> <id_2> <id_3> --certs-dir=<certs-dir> --host=<any-live-node>
undefined

If Node Is Dead: Replace Failed Node

若节点已失效:更换故障节点

When a node has been dead longer than
server.time_until_store_dead
(default 5m), CockroachDB automatically re-replicates its data to surviving nodes. Use this procedure to clean up the dead node and optionally add a replacement.
Step 1: Confirm the node is dead and data is safe
sql
-- Confirm node is dead
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <dead_node_id>;

-- Verify all ranges are fully replicated (no under-replicated after re-replication)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Check remaining capacity can handle the load
SELECT node_id, ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;
If under-replicated ranges exist, wait for re-replication to complete before proceeding.
Step 2: Decommission the dead node (metadata cleanup)
bash
cockroach node decommission <dead_node_id> --certs-dir=<certs-dir> --host=<any-live-node>
Step 3: Add a replacement node (recommended)
If remaining nodes are above 60% utilization, provision a replacement node using the Scaling Up: Add Nodes procedure.
Multiple dead nodes: Decommission all dead nodes simultaneously:
bash
cockroach node decommission <id_1> <id_2> --certs-dir=<certs-dir> --host=<any-live-node>
See replacing-failed-nodes reference for detailed failure scenarios and recovery procedures.
当节点失效时间超过
server.time_until_store_dead
(默认5分钟),CockroachDB会自动将其数据重新复制到存活节点。使用以下流程清理失效节点并可选添加替换节点。
步骤1:确认节点已失效且数据安全
sql
-- 确认节点已失效
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <dead_node_id>;

-- 验证所有副本已完全复制(重新复制后无欠复制情况)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- 检查剩余节点能否承载负载
SELECT node_id, ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;
若存在欠复制副本,需等待重新复制完成后再继续操作。
步骤2:退役失效节点(清理元数据)
bash
cockroach node decommission <dead_node_id> --certs-dir=<certs-dir> --host=<any-live-node>
步骤3:添加替换节点(推荐)
若剩余节点利用率超过60%,请按照扩容:添加节点流程部署替换节点。
多个失效节点: 同时退役所有失效节点:
bash
cockroach node decommission <id_1> <id_2> --certs-dir=<certs-dir> --host=<any-live-node>
查看替换故障节点参考文档获取详细故障场景和恢复流程。

Monitor Decommission Progress

监控退役进度

bash
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>
Wait for
gossiped_replicas = 0
and
membership = 'decommissioned'
. Then stop the process on the decommissioned node.
bash
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>
等待
gossiped_replicas = 0
membership = 'decommissioned'
,然后停止已退役节点的进程。

Cancel a Decommission

取消退役操作

bash
cockroach node recommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
Only works while still in
decommissioning
state.
bash
cockroach node recommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
仅在节点处于
decommissioning
状态时有效。

Scaling Up: Add Nodes

扩容:添加节点

  1. Provision new hardware/VM with same specs as existing nodes
  2. Install same CockroachDB version (
    cockroach version
    to confirm)
  3. Start node with
    --join
    pointing to existing cluster nodes
  4. Verify join:
    sql
    SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes n
    JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;
  5. Data rebalances automatically — monitor with:
    sql
    SELECT node_id, range_count, lease_count
    FROM crdb_internal.kv_store_status ORDER BY node_id;
  1. 部署与现有节点规格相同的新硬件/虚拟机
  2. 安装与现有集群版本一致的CockroachDB(使用
    cockroach version
    确认)
  3. 使用
    --join
    参数指向现有集群节点启动新节点
  4. 验证节点已加入集群:
    sql
    SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes n
    JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;
  5. 数据会自动重新平衡,使用以下语句监控进度:
    sql
    SELECT node_id, range_count, lease_count
    FROM crdb_internal.kv_store_status ORDER BY node_id;

Post-Scaling Verification

扩缩容后验证

sql
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

SELECT node_id, range_count, lease_count,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

sql
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

SELECT node_id, range_count, lease_count,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

Advanced Scaling

Advanced层级扩缩容

Applies when: Tier = Advanced
Advanced clusters are managed by Cockroach Labs. Capacity is adjusted by changing node count or machine size.
适用场景: 层级 = Advanced
Advanced集群由Cockroach Labs管理,可通过调整节点数量或机器规格来变更容量。

Via Cloud Console

通过Cloud Console操作

  1. Cluster → Capacity
  2. Adjust node count or machine type (vCPUs per node)
  3. CRL handles all node operations (drain, decommission, provisioning) safely
  4. Monitor progress in Cloud Console
  1. 进入Cluster → Capacity页面
  2. 调整节点数量或机器类型(每节点vCPUs)
  3. Cockroach Labs会安全处理所有节点操作(排空、退役、部署)
  4. 在Cloud Console中监控进度

Via Cloud API

通过Cloud API操作

bash
undefined
bash
undefined

Scale node count

调整节点数量

curl -X PATCH -H "Authorization: Bearer $COCKROACH_API_KEY"
-H "Content-Type: application/json"
-d '{"config": {"num_nodes": <new_count>}}'
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"
undefined
curl -X PATCH -H "Authorization: Bearer $COCKROACH_API_KEY"
-H "Content-Type: application/json"
-d '{"config": {"num_nodes": <new_count>}}'
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"
undefined

Via Terraform

通过Terraform操作

hcl
resource "cockroach_cluster" "example" {
  dedicated {
    num_virtual_cpus = 8     # vCPUs per node
    storage_gib      = 150
    num_nodes        = 5     # total nodes
  }
}
hcl
resource "cockroach_cluster" "example" {
  dedicated {
    num_virtual_cpus = 8     # 每节点vCPUs
    storage_gib      = 150
    num_nodes        = 5     # 总节点数
  }
}

Pre-Scaling Check

扩缩容前检查

sql
-- Ensure no disruptive jobs are running before scaling down
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j WHERE status = 'running' GROUP BY 1, 2;
sql
-- 缩容前确保无正在运行的破坏性任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j WHERE status = 'running' GROUP BY 1, 2;

Constraints

约束条件

  • Minimum: 3 nodes x 4 vCPUs (12 vCPUs total)
  • Scale down: Data must fit on remaining nodes; zone configs must be satisfiable
  • Scale up: Additional nodes available within your plan limits

  • 最小值: 3节点 × 4 vCPUs(总计12 vCPUs)
  • 缩容: 数据必须能适配剩余节点;区域配置必须可满足
  • 扩容: 新增节点数量需在你的计划限制内

BYOC Scaling

BYOC层级扩缩容

Applies when: Tier = BYOC
Follow all Advanced Scaling steps. BYOC scaling is managed through the same Cloud Console/API/Terraform interfaces.
适用场景: 层级 = BYOC
遵循所有Advanced层级扩缩容步骤。BYOC层级的扩缩容通过相同的Cloud Console/API/Terraform界面管理。

Cloud Provider Verification (after scaling down)

云服务商验证(缩容后)

If AWS:
bash
aws ec2 describe-instances --filters "Name=tag:cockroach-cluster,Values=<cluster-name>" \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'
If GCP:
bash
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"
If Azure:
bash
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<name>']"
若为AWS:
bash
aws ec2 describe-instances --filters "Name=tag:cockroach-cluster,Values=<cluster-name>" \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'
若为GCP:
bash
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"
若为Azure:
bash
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<name>']"

Additional BYOC Considerations

BYOC额外注意事项

  • Verify security groups/firewall rules after scaling
  • Update reserved instance or committed use discount allocations
  • Verify network connectivity (PrivateLink/PSC/VPC Peering) is unaffected
  • Check cloud billing reflects the new instance count

  • 扩缩容后验证安全组/防火墙规则
  • 更新预留实例或承诺使用折扣分配
  • 验证网络连接(PrivateLink/PSC/VPC Peering)未受影响
  • 检查云账单是否反映新的实例数量

Standard Compute Management

Standard层级计算资源管理

Applies when: Tier = Standard
Standard is a multi-tenant managed service. There are no individual nodes. Capacity is managed by adjusting provisioned compute (vCPUs).
适用场景: 层级 = Standard
Standard是多租户托管服务,无独立节点。容量通过调整预配置计算资源(vCPUs)进行管理。

Adjust Provisioned vCPUs

调整预配置vCPUs

  1. Cloud Console → Cluster → Capacity
  2. Increase or decrease provisioned vCPUs
  3. Change takes effect without downtime
  1. 进入Cloud Console → Cluster → Capacity页面
  2. 增加或减少预配置vCPUs
  3. 变更无需停机即可生效

Before Scaling Down

缩容前检查

  • Review CPU utilization in Cloud Console — ensure workload fits within reduced compute
  • Storage is usage-based and unaffected by compute changes
  • 在Cloud Console中查看CPU利用率,确保工作负载能适配减少后的计算资源
  • 存储按使用量计费,不受计算资源变更影响

After Scaling

扩缩容后

Monitor P99 latency and QPS in Cloud Console for 24-48 hours. If latency increases after scaling down, scale compute back up.

在Cloud Console中监控P99延迟和QPS 24-48小时。若缩容后延迟上升,需恢复原计算资源配置。

Basic Cost Management

Basic层级成本管理

Applies when: Tier = Basic
Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to manage. Capacity scales automatically based on demand. Cost is managed through spending controls.
适用场景: 层级 = Basic
Basic是无服务器产品,支持自动扩容,无节点或预配置计算资源可管理。容量会根据需求自动调整,成本通过支出控制进行管理。

Manage Spending

管理支出

  • Set spending limits: Cloud Console → Cluster → Settings → configure monthly spending cap
  • Review usage: Cloud Console shows Request Unit (RU) consumption over time
  • Optimize queries: Reduce RU consumption through query tuning and indexing
  • Archive data: Delete unused tables or databases to reduce storage costs
  • 设置支出限额: Cloud Console → Cluster → Settings → 配置月度支出上限
  • 查看使用情况: Cloud Console展示Request Unit(RU)的时间消耗趋势
  • 优化查询: 通过查询调优和索引减少RU消耗
  • 归档数据: 删除未使用的表或数据库以降低存储成本

When to Consider Upgrading

何时考虑升级

If you need explicit control over compute capacity (guaranteed vCPUs), consider upgrading to Standard. If you need dedicated infrastructure, consider Advanced.

若需要对计算资源进行显式控制(保证vCPUs),可考虑升级至Standard层级;若需要专用基础设施,可考虑升级至Advanced层级。

Safety Considerations

安全注意事项

OperationTierReversible?
cockroach node decommission
SHRecommission only before completion
Stop decommissioned nodeSHNo (must rejoin as new node)
Add node to clusterSHYes (decommission to remove)
Scale via Console/APIADV/BYOCContact support to reverse
Adjust provisioned vCPUsSTDYes (scale back)
Set spending limitBASYes (adjust anytime)
Critical (Self-Hosted):
  • Never decommission below the replication factor
  • Always drain before decommission (for live nodes)
  • Decommission multiple nodes simultaneously (not sequentially)
  • Verify remaining capacity can absorb the data
  • For dead nodes: wait for re-replication to complete before decommissioning
  • Monitor storage utilization — nodes above 80% risk performance degradation
操作层级可逆?
cockroach node decommission
SH仅在完成前可重新启用
停止已退役节点SH否(需作为新节点重新加入)
向集群添加节点SH是(可通过退役移除)
通过Console/API扩缩容ADV/BYOC需联系支持人员撤销
调整预配置vCPUsSTD是(可恢复原配置)
设置支出限额BAS是(可随时调整)
关键注意事项(Self-Hosted):
  • 退役节点数量不得低于副本因子
  • 存活节点必须先排空再退役
  • 多节点退役需同时执行(不可依次进行)
  • 验证剩余节点容量可吸收数据
  • 失效节点:需等待重新复制完成后再退役
  • 监控存储利用率——节点利用率超过80%可能导致性能下降

Troubleshooting

故障排查

IssueTierFix
Decommission hangsSHCheck zone config constraints; investigate stalled ranges
Recommission failsSHNode already fully decommissioned; must rejoin as new
New node not rebalancingSHWait for automatic rebalancing; check
range_count
Scale-down rejectedADV/BYOCBelow minimum or data won't fit
Latency spike after reductionSTDScale provisioned vCPUs back up
Cloud instances not cleaned upBYOCContact support; verify in cloud console
Dead node not re-replicatingSHCheck
server.time_until_store_dead
; verify surviving nodes have capacity
Storage utilization high after scale-downSHAdd replacement node or increase disk size
问题层级解决方法
退役操作卡住SH检查区域配置约束;排查停滞的副本
重新启用节点失败SH节点已完全退役;需作为新节点重新加入
新节点未重新平衡数据SH等待自动重新平衡;检查
range_count
缩容请求被拒绝ADV/BYOC低于最小值或数据无法适配剩余节点
缩容后延迟飙升STD恢复原预配置vCPUs
云实例未被清理BYOC联系支持人员;在云控制台中验证
失效节点未重新复制数据SH检查
server.time_until_store_dead
配置;验证存活节点容量充足
缩容后存储利用率过高SH添加替换节点或增大磁盘容量

References

参考资料

Skill references:
  • Replacing failed nodes
  • Storage management
Related skills:
  • reviewing-cluster-health — Pre/post health checks
  • performing-cluster-maintenance — Drain procedure (SH)
  • upgrading-cluster-version — Upgrades and lifecycle
Official CockroachDB Documentation:
技能参考:
  • 替换故障节点
  • 存储管理
相关技能:
  • 检查集群健康状态 — 操作前后健康检查
  • 执行集群维护 — 节点排空流程(SH)
  • 升级集群版本 — 版本升级与生命周期管理
官方CockroachDB文档: