Loading...
Loading...
Compare original and translation side by side
| Question | Options | Why It Matters |
|---|---|---|
| Deployment tier? | Self-Hosted, Advanced, BYOC, Standard, Basic | Determines available diagnostics and operator responsibilities |
| Reason for health check? | Daily check, Pre-maintenance, Post-incident, Pre-upgrade | Prioritizes which dimensions to check first |
| 问题 | 选项 | 重要性说明 |
|---|---|---|
| 部署层级? | Self-Hosted、Advanced、BYOC、Standard、Basic | 决定可用的诊断工具及运维人员职责 |
| 健康检查原因? | 日常检查、维护前、事件后、升级前 | 确定优先检查的维度 |
| Question | Options | Why It Matters |
|---|---|---|
| Access available? | SQL + CLI, SQL only | Determines which tools can be used |
| Cloud provider? | AWS, GCP, Azure, On-Premises | Affects infrastructure-level checks |
| Kubernetes deployment? | Yes (Operator, Helm, manual), No | Changes CLI commands and monitoring |
| Node count and regions? | e.g., 9 nodes, 3 regions | Sets expectations for query results |
| Question | Options | Why It Matters |
|---|---|---|
| Cloud provider? (BYOC only) | AWS, GCP, Azure | For infrastructure-level monitoring in your cloud account |
| Question | Options | Why It Matters |
|---|---|---|
| Current provisioned vCPUs? | Number | Context for compute utilization assessment |
| 问题 | 选项 | 重要性说明 |
|---|---|---|
| 可用访问方式? | SQL + CLI、仅SQL | 决定可使用的工具 |
| 云服务商? | AWS、GCP、Azure、本地部署 | 影响基础设施级别的检查 |
| 是否基于Kubernetes部署? 是(Operator、Helm、手动)、否 | 会改变CLI命令与监控方式 | |
| 节点数量和区域? 例如:9节点,3个区域 | 为查询结果设定预期 |
| 问题 | 选项 | 重要性说明 |
|---|---|---|
| 云服务商?(仅BYOC) | AWS、GCP、Azure | 用于在您的云账户中监控基础设施 |
| 问题 | 选项 | 重要性说明 |
|---|---|---|
| 当前已配置的vCPU数量? | 具体数值 | 为计算资源利用率评估提供上下文 |
| Tier | Go To |
|---|---|
| Self-Hosted | Self-Hosted Health Check |
| Advanced | Advanced Health Check |
| BYOC | BYOC Health Check |
| Standard | Standard Health Check |
| Basic | Basic Health Check |
| 层级 | 跳转至 |
|---|---|
| Self-Hosted | 自托管集群健康检查 |
| Advanced | 高级版集群健康检查 |
| BYOC | BYOC集群健康检查 |
| Standard | 标准版集群健康检查 |
| Basic | 基础版集群健康检查 |
SELECT
n.node_id, n.address, n.build_tag AS version, n.locality,
n.is_live, l.epoch,
CASE WHEN n.is_live THEN 'HEALTHY'
WHEN n.is_live IS NULL THEN 'UNKNOWN'
ELSE 'DOWN' END AS health_status
FROM crdb_internal.gossip_nodes n
LEFT JOIN crdb_internal.gossip_liveness l ON n.node_id = l.node_id
ORDER BY n.node_id;is_live = falsegossip_nodesepochcockroach node status --certs-dir=<certs-dir> --host=<node-address>SELECT
n.node_id, n.address, n.build_tag AS version, n.locality,
n.is_live, l.epoch,
CASE WHEN n.is_live THEN 'HEALTHY'
WHEN n.is_live IS NULL THEN 'UNKNOWN'
ELSE 'DOWN' END AS health_status
FROM crdb_internal.gossip_nodes n
LEFT JOIN crdb_internal.gossip_liveness l ON n.node_id = l.node_id
ORDER BY n.node_id;is_live = falsegossip_nodesepochcockroach node status --certs-dir=<certs-dir> --host=<node-address>SELECT build_tag AS version, COUNT(*) AS node_count,
array_agg(node_id ORDER BY node_id) AS node_ids
FROM crdb_internal.gossip_nodes GROUP BY build_tag;SELECT build_tag AS version, COUNT(*) AS node_count,
array_agg(node_id ORDER BY node_id) AS node_ids
FROM crdb_internal.gossip_nodes GROUP BY build_tag;SELECT node_id, store_id,
ROUND(capacity / 1073741824.0, 2) AS total_gb,
ROUND(available / 1073741824.0, 2) AS available_gb,
ROUND((1 - (available::FLOAT / capacity::FLOAT)) * 100, 2) AS utilization_pct,
CASE WHEN (available::FLOAT / capacity::FLOAT) < 0.10 THEN 'CRITICAL'
WHEN (available::FLOAT / capacity::FLOAT) < 0.30 THEN 'WARNING'
ELSE 'OK' END AS capacity_status,
range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY utilization_pct DESC;SELECT node_id, store_id,
ROUND(capacity / 1073741824.0, 2) AS total_gb,
ROUND(available / 1073741824.0, 2) AS available_gb,
ROUND((1 - (available::FLOAT / capacity::FLOAT)) * 100, 2) AS utilization_pct,
CASE WHEN (available::FLOAT / capacity::FLOAT) < 0.10 THEN 'CRITICAL'
WHEN (available::FLOAT / capacity::FLOAT) < 0.30 THEN 'WARNING'
ELSE 'OK' END AS capacity_status,
range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY utilization_pct DESC;SELECT
CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
WHEN array_length(replicas, 1) = 2 THEN 'under_replicated'
WHEN array_length(replicas, 1) = 1 THEN 'critically_under_replicated'
ELSE 'unknown' END AS replication_status,
COUNT(*) AS range_count
FROM crdb_internal.ranges_no_leases GROUP BY 1 ORDER BY 1;SELECT
CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
WHEN array_length(replicas, 1) = 2 THEN 'under_replicated'
WHEN array_length(replicas, 1) = 1 THEN 'critically_under_replicated'
ELSE 'unknown' END AS replication_status,
COUNT(*) AS range_count
FROM crdb_internal.ranges_no_leases GROUP BY 1 ORDER BY 1;SELECT node_id,
to_timestamp((metrics->>'security.certificate.expiration.ca')::FLOAT)::TIMESTAMPTZ AS ca_expires,
to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ AS node_cert_expires,
CASE WHEN to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ
< now() + INTERVAL '90 days' THEN 'EXPIRING_SOON'
ELSE 'OK' END AS cert_status
FROM crdb_internal.kv_node_status ORDER BY node_cert_expires;SELECT node_id,
to_timestamp((metrics->>'security.certificate.expiration.ca')::FLOAT)::TIMESTAMPTZ AS ca_expires,
to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ AS node_cert_expires,
CASE WHEN to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ
< now() + INTERVAL '90 days' THEN 'EXPIRING_SOON'
ELSE 'OK' END AS cert_status
FROM crdb_internal.kv_node_status ORDER BY node_cert_expires;SELECT variable, value FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'server.time_until_store_dead', 'admission.kv.enabled',
'cluster.preserve_downgrade_option', 'gc.ttlseconds'
) ORDER BY variable;SELECT variable, value FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'server.time_until_store_dead', 'admission.kv.enabled',
'cluster.preserve_downgrade_option', 'gc.ttlseconds'
) ORDER BY variable;SELECT 'live_nodes' AS metric, COUNT(*)::TEXT AS value
FROM crdb_internal.gossip_nodes WHERE is_live = true
UNION ALL SELECT 'dead_nodes', COUNT(*)::TEXT
FROM crdb_internal.gossip_nodes WHERE is_live = false
UNION ALL SELECT 'distinct_versions', COUNT(DISTINCT build_tag)::TEXT
FROM crdb_internal.gossip_nodes
UNION ALL SELECT 'total_ranges', COUNT(*)::TEXT
FROM crdb_internal.ranges_no_leases
UNION ALL SELECT 'min_store_available_pct',
ROUND(MIN(available::FLOAT / capacity::FLOAT) * 100, 2)::TEXT
FROM crdb_internal.kv_store_status
UNION ALL SELECT 'cluster_version', value
FROM [SHOW CLUSTER SETTING version];WITH j AS (SHOW JOBS)
SELECT job_type, COUNT(*) FROM j WHERE status = 'running' GROUP BY job_type;SELECT 'live_nodes' AS metric, COUNT(*)::TEXT AS value
FROM crdb_internal.gossip_nodes WHERE is_live = true
UNION ALL SELECT 'dead_nodes', COUNT(*)::TEXT
FROM crdb_internal.gossip_nodes WHERE is_live = false
UNION ALL SELECT 'distinct_versions', COUNT(DISTINCT build_tag)::TEXT
FROM crdb_internal.gossip_nodes
UNION ALL SELECT 'total_ranges', COUNT(*)::TEXT
FROM crdb_internal.ranges_no_leases
UNION ALL SELECT 'min_store_available_pct',
ROUND(MIN(available::FLOAT / capacity::FLOAT) * 100, 2)::TEXT
FROM crdb_internal.kv_store_status
UNION ALL SELECT 'cluster_version', value
FROM [SHOW CLUSTER SETTING version];WITH j AS (SHOW JOBS)
SELECT job_type, COUNT(*) FROM j WHERE status = 'running' GROUP BY job_type;-- Node count and replication (minimum 3 nodes for production)
SELECT COUNT(*) AS total_nodes,
COUNT(*) FILTER (WHERE n.is_live) AS live_nodes,
COUNT(DISTINCT n.locality) AS distinct_localities
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id);
-- Critical production settings check
SELECT variable, value,
CASE
WHEN variable = 'kv.rangefeed.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'kv.rangefeed.enabled' AND value = 'false' THEN 'WARN: should be true for CDC'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'false' THEN 'WARN: should be true'
WHEN variable = 'admission.kv.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'admission.kv.enabled' AND value = 'false' THEN 'WARN: recommended for production'
WHEN variable = 'cluster.preserve_downgrade_option' AND value != '' THEN 'INFO: finalization pending'
ELSE 'OK'
END AS assessment
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'admission.kv.enabled', 'cluster.preserve_downgrade_option',
'server.time_until_store_dead', 'gc.ttlseconds'
) ORDER BY variable;
-- Enterprise license status (Self-Hosted only)
SELECT value AS organization FROM [SHOW CLUSTER SETTING cluster.organization];-- 节点数量与副本情况(生产环境至少需要3个节点)
SELECT COUNT(*) AS total_nodes,
COUNT(*) FILTER (WHERE n.is_live) AS live_nodes,
COUNT(DISTINCT n.locality) AS distinct_localities
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id);
-- 关键生产配置检查
SELECT variable, value,
CASE
WHEN variable = 'kv.rangefeed.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'kv.rangefeed.enabled' AND value = 'false' THEN 'WARN: should be true for CDC'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'false' THEN 'WARN: should be true'
WHEN variable = 'admission.kv.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'admission.kv.enabled' AND value = 'false' THEN 'WARN: recommended for production'
WHEN variable = 'cluster.preserve_downgrade_option' AND value != '' THEN 'INFO: finalization pending'
ELSE 'OK'
END AS assessment
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'admission.kv.enabled', 'cluster.preserve_downgrade_option',
'server.time_until_store_dead', 'gc.ttlseconds'
) ORDER BY variable;
-- 企业版许可证状态(仅Self-Hosted)
SELECT value AS organization FROM [SHOW CLUSTER SETTING cluster.organization];-- Node liveness (nodes are visible on Advanced)
SELECT n.node_id, n.build_tag, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;
-- Version consistency
SELECT build_tag AS version, COUNT(*) FROM crdb_internal.gossip_nodes GROUP BY 1;
-- Range health
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j
WHERE status IN ('running', 'failed') AND created > now() - INTERVAL '24 hours'
GROUP BY job_type, status;-- 节点存活状态(Advanced集群可见节点)
SELECT n.node_id, n.build_tag, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;
-- 版本一致性
SELECT build_tag AS version, COUNT(*) FROM crdb_internal.gossip_nodes GROUP BY 1;
-- 数据范围健康状态
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- 近期失败的任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j
WHERE status IN ('running', 'failed') AND created > now() - INTERVAL '24 hours'
GROUP BY job_type, status;curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.state, .cockroach_version'curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.state, .cockroach_version'aws ec2 describe-instance-status --filters "Name=tag:cockroach-cluster,Values=<cluster-name>"gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<cluster-name>']"aws ec2 describe-instance-status --filters "Name=tag:cockroach-cluster,Values=<cluster-name>"gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<cluster-name>']"RUNNINGRUNNING-- Verify connectivity
SELECT 1;
-- Current version
SELECT version();
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';-- 验证连通性
SELECT 1;
-- 当前版本
SELECT version();
-- 近期失败的任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';crdb_internal.gossip_nodeskv_store_statuscrdb_internal.gossip_nodeskv_store_statusRUNNINGRUNNING-- Verify connectivity
SELECT 1;
-- Current version
SELECT version();
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';-- 验证连通性
SELECT 1;
-- 当前版本
SELECT version();
-- 近期失败的任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';crdb_internal.ranges_no_leasesLIMITcrdb_internal.ranges_no_leasesLIMIT| Issue | Tier | Fix |
|---|---|---|
| SH | Grant admin or VIEWCLUSTERMETADATA |
| STD/BAS | Expected — use Cloud Console |
| Node missing from gossip_nodes | SH | Check node process; verify --join address |
| Cloud Console shows degraded | ADV/BYOC | Check Cloud status page; contact support |
| High RU consumption | BAS | Profile queries; set spending limits |
| Cloud API returns 401 | ADV/BYOC | Regenerate API key |
| High latency on first connection | BAS | Expected cold start after idle period |
| 问题 | 层级 | 解决方法 |
|---|---|---|
| SH | 授予管理员权限或VIEWCLUSTERMETADATA权限 |
找不到 | STD/BAS | 正常情况 — 使用Cloud Console |
| 节点未出现在gossip_nodes中 | SH | 检查节点进程;验证--join地址 |
| Cloud Console显示集群状态降级 | ADV/BYOC | 查看云状态页面;联系支持团队 |
| RU消耗过高 | BAS | 分析查询性能;设置支出限额 |
| Cloud API返回401 | ADV/BYOC | 重新生成API密钥 |
| 首次连接延迟高 | BAS | 闲置后的冷启动属于正常情况 |