managing-cluster-capacity
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseManaging Cluster Capacity
集群容量管理
Manages cluster capacity across all CockroachDB deployment tiers. What "capacity" means varies by tier — Self-Hosted manages individual nodes, Advanced/BYOC manage node count and machine size, Standard manages provisioned vCPUs, and Basic auto-scales with cost controls.
管理所有CockroachDB部署层级的集群容量。“容量”的定义因层级而异:Self-Hosted层级管理单个节点,Advanced/BYOC层级管理节点数量和机器规格,Standard层级管理预配置vCPUs,Basic层级则通过成本控制实现自动扩容。
When to Use This Skill
何时使用此技能
- Permanently removing a node from a cluster (Self-Hosted)
- Adding nodes to increase capacity (Self-Hosted)
- Scaling cluster node count or machine size (Advanced, BYOC)
- Adjusting provisioned compute (Standard)
- Managing costs on a serverless cluster (Basic)
- Replacing hardware or migrating infrastructure (Self-Hosted, BYOC)
- Replacing a failed or dead node (Self-Hosted)
- Managing storage utilization and disk pressure (Self-Hosted)
For temporary maintenance (not capacity changes): Use performing-cluster-maintenance.
For pre-operation health check: Use reviewing-cluster-health.
- 从集群中永久移除节点(Self-Hosted)
- 添加节点以提升容量(Self-Hosted)
- 调整集群节点数量或机器规格(Advanced、BYOC)
- 调整预配置计算资源(Standard)
- 管理无服务器集群的成本(Basic)
- 更换硬件或迁移基础设施(Self-Hosted、BYOC)
- 更换故障或失效节点(Self-Hosted)
- 管理存储利用率和磁盘压力(Self-Hosted)
临时维护(非容量变更): 使用执行集群维护。
操作前健康检查: 使用检查集群健康状态。
Step 1: Gather Context
步骤1:收集上下文信息
Required Context
必填上下文
| Question | Options | Why It Matters |
|---|---|---|
| Deployment tier? | Self-Hosted, Advanced, BYOC, Standard, Basic | Different capacity model per tier |
| Direction? | Scale up (add capacity), Scale down (reduce capacity) | Determines procedure |
| 问题 | 选项 | 重要性 |
|---|---|---|
| 部署层级? | Self-Hosted, Advanced, BYOC, Standard, Basic | 不同层级对应不同的容量管理模型 |
| 操作方向? | 扩容(增加容量), 缩容(减少容量) | 决定具体操作流程 |
Additional Context (by tier)
各层级补充上下文
If Self-Hosted (scaling down):
| Question | Options | Why It Matters |
|---|---|---|
| How many nodes to remove? | 1, multiple | Multi-node decommission should be done simultaneously |
| Target node IDs? | Node IDs from | Required for CLI commands |
| Is the node alive or dead? | Alive, Dead | Dead nodes use a different procedure |
| Deployment platform? | Bare metal, VMs, Kubernetes | Changes CLI and cleanup steps |
| Current replication factor? | 3, 5, custom | Must have enough nodes remaining |
| Current node count? | Number | Validates remaining capacity |
| Storage utilization? | Low (<60%), Medium (60-80%), High (>80%) | Determines urgency and whether storage maintenance is needed |
If Advanced or BYOC:
| Question | Options | Why It Matters |
|---|---|---|
| Scale method? | Cloud Console, API, Terraform | Determines procedure |
| Current and target configuration? | e.g., 5 nodes → 3 nodes, or 4 vCPU → 8 vCPU | Validates constraints |
| Cloud provider? (BYOC only) | AWS, GCP, Azure | Affects infrastructure verification |
If Standard:
| Question | Options | Why It Matters |
|---|---|---|
| Current provisioned vCPUs? | Number | Context for scaling decision |
| Target vCPUs? | Number | Validates workload will fit |
If Basic: Gather cost management goals — Basic auto-scales with no manual capacity control.
若为Self-Hosted(缩容):
| 问题 | 选项 | 重要性 |
|---|---|---|
| 要移除多少个节点? | 1个, 多个 | 多节点退役需同时执行 |
| 目标节点ID? | 来自 | 执行CLI命令必需 |
| 节点是存活还是已失效? | 存活, 已失效 | 失效节点需使用不同流程 |
| 部署平台? | 裸金属, 虚拟机, Kubernetes | 影响CLI命令和清理步骤 |
| 当前副本因子? | 3, 5, 自定义 | 剩余节点数量必须满足副本因子要求 |
| 当前节点数量? | 具体数值 | 验证剩余容量是否充足 |
| 存储利用率? | 低(<60%), 中(60-80%), 高(>80%) | 决定操作紧迫性及是否需要存储维护 |
若为Advanced或BYOC:
| 问题 | 选项 | 重要性 |
|---|---|---|
| 扩缩容方式? | Cloud Console, API, Terraform | 决定具体操作流程 |
| 当前及目标配置? | 例如:5节点→3节点,或4 vCPU→8 vCPU | 验证配置约束 |
| 云服务商?(仅BYOC) | AWS, GCP, Azure | 影响基础设施验证步骤 |
若为Standard:
| 问题 | 选项 | 重要性 |
|---|---|---|
| 当前预配置vCPUs? | 具体数值 | 为扩缩容决策提供上下文 |
| 目标vCPUs? | 具体数值 | 验证工作负载是否能适配 |
若为Basic: 收集成本管理目标——Basic层级自动扩容,无手动容量控制选项。
Context-Driven Routing
上下文驱动的流程路由
| Tier | Go To |
|---|---|
| Self-Hosted | Self-Hosted Capacity Management |
| Advanced | Advanced Scaling |
| BYOC | BYOC Scaling |
| Standard | Standard Compute Management |
| Basic | Basic Cost Management |
| 层级 | 跳转至 |
|---|---|
| Self-Hosted | 自托管容量管理 |
| Advanced | Advanced层级扩缩容 |
| BYOC | BYOC层级扩缩容 |
| Standard | Standard层级计算资源管理 |
| Basic | Basic层级成本管理 |
Self-Hosted Capacity Management
自托管容量管理
Applies when: Tier = Self-Hosted
适用场景: 层级 = Self-Hosted
Scaling Down: Decommission Nodes
缩容:节点退役
Pre-Decommission Validation
退役前验证
sql
-- All nodes live
SELECT n.node_id, n.is_live, n.build_tag
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;
-- Ranges fully replicated
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- Remaining capacity check
SELECT node_id, store_id,
ROUND(capacity / 1073741824.0, 2) AS total_gb,
ROUND(available / 1073741824.0, 2) AS available_gb,
ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;
-- Replication factor
SHOW ZONE CONFIGURATION FOR RANGE default;Remaining nodes must stay < 60% utilization after absorbing data. Node count after decommission must be >= replication factor.
sql
-- 检查所有节点存活状态
SELECT n.node_id, n.is_live, n.build_tag
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;
-- 检查副本是否完全复制
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- 检查剩余容量
SELECT node_id, store_id,
ROUND(capacity / 1073741824.0, 2) AS total_gb,
ROUND(available / 1073741824.0, 2) AS available_gb,
ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;
-- 查看副本因子
SHOW ZONE CONFIGURATION FOR RANGE default;吸收数据后,剩余节点的存储利用率必须低于60%。退役后的节点数量必须≥副本因子。
If Node Is Alive: Drain Then Decommission
若节点存活:先排空再退役
bash
undefinedbash
undefinedStep 1: Drain
步骤1:排空节点
cockroach node drain <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
cockroach node drain <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
Step 2: Decommission (single node)
步骤2:退役单个节点
cockroach node decommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
cockroach node decommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
Step 2: Decommission (multiple nodes — more efficient, do simultaneously)
步骤2:退役多个节点(更高效,需同时执行)
cockroach node decommission <id_1> <id_2> <id_3> --certs-dir=<certs-dir> --host=<any-live-node>
undefinedcockroach node decommission <id_1> <id_2> <id_3> --certs-dir=<certs-dir> --host=<any-live-node>
undefinedIf Node Is Dead: Replace Failed Node
若节点已失效:更换故障节点
When a node has been dead longer than (default 5m), CockroachDB automatically re-replicates its data to surviving nodes. Use this procedure to clean up the dead node and optionally add a replacement.
server.time_until_store_deadStep 1: Confirm the node is dead and data is safe
sql
-- Confirm node is dead
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <dead_node_id>;
-- Verify all ranges are fully replicated (no under-replicated after re-replication)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- Check remaining capacity can handle the load
SELECT node_id, ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;If under-replicated ranges exist, wait for re-replication to complete before proceeding.
Step 2: Decommission the dead node (metadata cleanup)
bash
cockroach node decommission <dead_node_id> --certs-dir=<certs-dir> --host=<any-live-node>Step 3: Add a replacement node (recommended)
If remaining nodes are above 60% utilization, provision a replacement node using the Scaling Up: Add Nodes procedure.
Multiple dead nodes: Decommission all dead nodes simultaneously:
bash
cockroach node decommission <id_1> <id_2> --certs-dir=<certs-dir> --host=<any-live-node>See replacing-failed-nodes reference for detailed failure scenarios and recovery procedures.
当节点失效时间超过(默认5分钟),CockroachDB会自动将其数据重新复制到存活节点。使用以下流程清理失效节点并可选添加替换节点。
server.time_until_store_dead步骤1:确认节点已失效且数据安全
sql
-- 确认节点已失效
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <dead_node_id>;
-- 验证所有副本已完全复制(重新复制后无欠复制情况)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- 检查剩余节点能否承载负载
SELECT node_id, ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;若存在欠复制副本,需等待重新复制完成后再继续操作。
步骤2:退役失效节点(清理元数据)
bash
cockroach node decommission <dead_node_id> --certs-dir=<certs-dir> --host=<any-live-node>步骤3:添加替换节点(推荐)
若剩余节点利用率超过60%,请按照扩容:添加节点流程部署替换节点。
多个失效节点: 同时退役所有失效节点:
bash
cockroach node decommission <id_1> <id_2> --certs-dir=<certs-dir> --host=<any-live-node>查看替换故障节点参考文档获取详细故障场景和恢复流程。
Monitor Decommission Progress
监控退役进度
bash
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>Wait for and . Then stop the process on the decommissioned node.
gossiped_replicas = 0membership = 'decommissioned'bash
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>等待且,然后停止已退役节点的进程。
gossiped_replicas = 0membership = 'decommissioned'Cancel a Decommission
取消退役操作
bash
cockroach node recommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>Only works while still in state.
decommissioningbash
cockroach node recommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>仅在节点处于状态时有效。
decommissioningScaling Up: Add Nodes
扩容:添加节点
- Provision new hardware/VM with same specs as existing nodes
- Install same CockroachDB version (to confirm)
cockroach version - Start node with pointing to existing cluster nodes
--join - Verify join:
sql
SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes n JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id; - Data rebalances automatically — monitor with:
sql
SELECT node_id, range_count, lease_count FROM crdb_internal.kv_store_status ORDER BY node_id;
- 部署与现有节点规格相同的新硬件/虚拟机
- 安装与现有集群版本一致的CockroachDB(使用确认)
cockroach version - 使用参数指向现有集群节点启动新节点
--join - 验证节点已加入集群:
sql
SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes n JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id; - 数据会自动重新平衡,使用以下语句监控进度:
sql
SELECT node_id, range_count, lease_count FROM crdb_internal.kv_store_status ORDER BY node_id;
Post-Scaling Verification
扩缩容后验证
sql
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
SELECT node_id, range_count, lease_count,
ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;sql
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
SELECT node_id, range_count, lease_count,
ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;Advanced Scaling
Advanced层级扩缩容
Applies when: Tier = Advanced
Advanced clusters are managed by Cockroach Labs. Capacity is adjusted by changing node count or machine size.
适用场景: 层级 = Advanced
Advanced集群由Cockroach Labs管理,可通过调整节点数量或机器规格来变更容量。
Via Cloud Console
通过Cloud Console操作
- Cluster → Capacity
- Adjust node count or machine type (vCPUs per node)
- CRL handles all node operations (drain, decommission, provisioning) safely
- Monitor progress in Cloud Console
- 进入Cluster → Capacity页面
- 调整节点数量或机器类型(每节点vCPUs)
- Cockroach Labs会安全处理所有节点操作(排空、退役、部署)
- 在Cloud Console中监控进度
Via Cloud API
通过Cloud API操作
bash
undefinedbash
undefinedScale node count
调整节点数量
curl -X PATCH -H "Authorization: Bearer $COCKROACH_API_KEY"
-H "Content-Type: application/json"
-d '{"config": {"num_nodes": <new_count>}}'
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"
-H "Content-Type: application/json"
-d '{"config": {"num_nodes": <new_count>}}'
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"
undefinedcurl -X PATCH -H "Authorization: Bearer $COCKROACH_API_KEY"
-H "Content-Type: application/json"
-d '{"config": {"num_nodes": <new_count>}}'
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"
-H "Content-Type: application/json"
-d '{"config": {"num_nodes": <new_count>}}'
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"
undefinedVia Terraform
通过Terraform操作
hcl
resource "cockroach_cluster" "example" {
dedicated {
num_virtual_cpus = 8 # vCPUs per node
storage_gib = 150
num_nodes = 5 # total nodes
}
}hcl
resource "cockroach_cluster" "example" {
dedicated {
num_virtual_cpus = 8 # 每节点vCPUs
storage_gib = 150
num_nodes = 5 # 总节点数
}
}Pre-Scaling Check
扩缩容前检查
sql
-- Ensure no disruptive jobs are running before scaling down
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j WHERE status = 'running' GROUP BY 1, 2;sql
-- 缩容前确保无正在运行的破坏性任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j WHERE status = 'running' GROUP BY 1, 2;Constraints
约束条件
- Minimum: 3 nodes x 4 vCPUs (12 vCPUs total)
- Scale down: Data must fit on remaining nodes; zone configs must be satisfiable
- Scale up: Additional nodes available within your plan limits
- 最小值: 3节点 × 4 vCPUs(总计12 vCPUs)
- 缩容: 数据必须能适配剩余节点;区域配置必须可满足
- 扩容: 新增节点数量需在你的计划限制内
BYOC Scaling
BYOC层级扩缩容
Applies when: Tier = BYOC
Follow all Advanced Scaling steps. BYOC scaling is managed through the same Cloud Console/API/Terraform interfaces.
适用场景: 层级 = BYOC
遵循所有Advanced层级扩缩容步骤。BYOC层级的扩缩容通过相同的Cloud Console/API/Terraform界面管理。
Cloud Provider Verification (after scaling down)
云服务商验证(缩容后)
If AWS:
bash
aws ec2 describe-instances --filters "Name=tag:cockroach-cluster,Values=<cluster-name>" \
--query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'If GCP:
bash
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"If Azure:
bash
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<name>']"若为AWS:
bash
aws ec2 describe-instances --filters "Name=tag:cockroach-cluster,Values=<cluster-name>" \
--query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'若为GCP:
bash
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"若为Azure:
bash
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<name>']"Additional BYOC Considerations
BYOC额外注意事项
- Verify security groups/firewall rules after scaling
- Update reserved instance or committed use discount allocations
- Verify network connectivity (PrivateLink/PSC/VPC Peering) is unaffected
- Check cloud billing reflects the new instance count
- 扩缩容后验证安全组/防火墙规则
- 更新预留实例或承诺使用折扣分配
- 验证网络连接(PrivateLink/PSC/VPC Peering)未受影响
- 检查云账单是否反映新的实例数量
Standard Compute Management
Standard层级计算资源管理
Applies when: Tier = Standard
Standard is a multi-tenant managed service. There are no individual nodes. Capacity is managed by adjusting provisioned compute (vCPUs).
适用场景: 层级 = Standard
Standard是多租户托管服务,无独立节点。容量通过调整预配置计算资源(vCPUs)进行管理。
Adjust Provisioned vCPUs
调整预配置vCPUs
- Cloud Console → Cluster → Capacity
- Increase or decrease provisioned vCPUs
- Change takes effect without downtime
- 进入Cloud Console → Cluster → Capacity页面
- 增加或减少预配置vCPUs
- 变更无需停机即可生效
Before Scaling Down
缩容前检查
- Review CPU utilization in Cloud Console — ensure workload fits within reduced compute
- Storage is usage-based and unaffected by compute changes
- 在Cloud Console中查看CPU利用率,确保工作负载能适配减少后的计算资源
- 存储按使用量计费,不受计算资源变更影响
After Scaling
扩缩容后
Monitor P99 latency and QPS in Cloud Console for 24-48 hours. If latency increases after scaling down, scale compute back up.
在Cloud Console中监控P99延迟和QPS 24-48小时。若缩容后延迟上升,需恢复原计算资源配置。
Basic Cost Management
Basic层级成本管理
Applies when: Tier = Basic
Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to manage. Capacity scales automatically based on demand. Cost is managed through spending controls.
适用场景: 层级 = Basic
Basic是无服务器产品,支持自动扩容,无节点或预配置计算资源可管理。容量会根据需求自动调整,成本通过支出控制进行管理。
Manage Spending
管理支出
- Set spending limits: Cloud Console → Cluster → Settings → configure monthly spending cap
- Review usage: Cloud Console shows Request Unit (RU) consumption over time
- Optimize queries: Reduce RU consumption through query tuning and indexing
- Archive data: Delete unused tables or databases to reduce storage costs
- 设置支出限额: Cloud Console → Cluster → Settings → 配置月度支出上限
- 查看使用情况: Cloud Console展示Request Unit(RU)的时间消耗趋势
- 优化查询: 通过查询调优和索引减少RU消耗
- 归档数据: 删除未使用的表或数据库以降低存储成本
When to Consider Upgrading
何时考虑升级
If you need explicit control over compute capacity (guaranteed vCPUs), consider upgrading to Standard. If you need dedicated infrastructure, consider Advanced.
若需要对计算资源进行显式控制(保证vCPUs),可考虑升级至Standard层级;若需要专用基础设施,可考虑升级至Advanced层级。
Safety Considerations
安全注意事项
| Operation | Tier | Reversible? |
|---|---|---|
| SH | Recommission only before completion |
| Stop decommissioned node | SH | No (must rejoin as new node) |
| Add node to cluster | SH | Yes (decommission to remove) |
| Scale via Console/API | ADV/BYOC | Contact support to reverse |
| Adjust provisioned vCPUs | STD | Yes (scale back) |
| Set spending limit | BAS | Yes (adjust anytime) |
Critical (Self-Hosted):
- Never decommission below the replication factor
- Always drain before decommission (for live nodes)
- Decommission multiple nodes simultaneously (not sequentially)
- Verify remaining capacity can absorb the data
- For dead nodes: wait for re-replication to complete before decommissioning
- Monitor storage utilization — nodes above 80% risk performance degradation
| 操作 | 层级 | 可逆? |
|---|---|---|
| SH | 仅在完成前可重新启用 |
| 停止已退役节点 | SH | 否(需作为新节点重新加入) |
| 向集群添加节点 | SH | 是(可通过退役移除) |
| 通过Console/API扩缩容 | ADV/BYOC | 需联系支持人员撤销 |
| 调整预配置vCPUs | STD | 是(可恢复原配置) |
| 设置支出限额 | BAS | 是(可随时调整) |
关键注意事项(Self-Hosted):
- 退役节点数量不得低于副本因子
- 存活节点必须先排空再退役
- 多节点退役需同时执行(不可依次进行)
- 验证剩余节点容量可吸收数据
- 失效节点:需等待重新复制完成后再退役
- 监控存储利用率——节点利用率超过80%可能导致性能下降
Troubleshooting
故障排查
| Issue | Tier | Fix |
|---|---|---|
| Decommission hangs | SH | Check zone config constraints; investigate stalled ranges |
| Recommission fails | SH | Node already fully decommissioned; must rejoin as new |
| New node not rebalancing | SH | Wait for automatic rebalancing; check |
| Scale-down rejected | ADV/BYOC | Below minimum or data won't fit |
| Latency spike after reduction | STD | Scale provisioned vCPUs back up |
| Cloud instances not cleaned up | BYOC | Contact support; verify in cloud console |
| Dead node not re-replicating | SH | Check |
| Storage utilization high after scale-down | SH | Add replacement node or increase disk size |
| 问题 | 层级 | 解决方法 |
|---|---|---|
| 退役操作卡住 | SH | 检查区域配置约束;排查停滞的副本 |
| 重新启用节点失败 | SH | 节点已完全退役;需作为新节点重新加入 |
| 新节点未重新平衡数据 | SH | 等待自动重新平衡;检查 |
| 缩容请求被拒绝 | ADV/BYOC | 低于最小值或数据无法适配剩余节点 |
| 缩容后延迟飙升 | STD | 恢复原预配置vCPUs |
| 云实例未被清理 | BYOC | 联系支持人员;在云控制台中验证 |
| 失效节点未重新复制数据 | SH | 检查 |
| 缩容后存储利用率过高 | SH | 添加替换节点或增大磁盘容量 |
References
参考资料
Skill references:
- Replacing failed nodes
- Storage management
Related skills:
- reviewing-cluster-health — Pre/post health checks
- performing-cluster-maintenance — Drain procedure (SH)
- upgrading-cluster-version — Upgrades and lifecycle
Official CockroachDB Documentation:
技能参考:
- 替换故障节点
- 存储管理
相关技能:
- 检查集群健康状态 — 操作前后健康检查
- 执行集群维护 — 节点排空流程(SH)
- 升级集群版本 — 版本升级与生命周期管理
官方CockroachDB文档: