monitoring-background-jobs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMonitoring Background Jobs
监控后台任务
Monitors background job health by identifying failed, paused, and long-running jobs that are distinct from user queries. Uses SQL-only interfaces (SHOW JOBS and SHOW AUTOMATIC JOBS) to surface schema changes, backups/restores, automatic statistics collection, and SQL stats compaction without requiring DB Console access.
通过识别与用户查询不同的失败、暂停和长时间运行的任务,监控后台任务的健康状态。使用纯SQL接口(SHOW JOBS和SHOW AUTOMATIC JOBS),无需访问DB Console即可查看模式变更、备份/恢复、自动统计信息收集以及SQL统计信息压缩任务。
When to Use This Skill
何时使用该技能
- Schema changes appear stuck or delayed (ALTER TABLE, CREATE INDEX, DROP operations)
- Backups or restores are failing or taking longer than expected
- Need to verify automatic statistics collection is running
- Investigating "waiting for MVCC GC" status in schema change cleanup
- Troubleshooting failed jobs without DB Console access
- Monitoring long-running operations that don't appear in query metrics
For live query monitoring: Use triaging-live-sql-activity to monitor currently executing user queries. Note that background jobs execute statements that may not appear in SHOW CLUSTER STATEMENTS.
For historical query analysis: Use profiling-statement-fingerprints for query pattern trends. Note that background jobs are excluded from statement statistics.
- 模式变更(ALTER TABLE、CREATE INDEX、DROP操作)出现停滞或延迟
- 备份或恢复失败或耗时超出预期
- 需要验证自动统计信息收集是否在运行
- 排查模式变更清理中的“waiting for MVCC GC”状态
- 在无DB Console访问权限时排查失败任务
- 监控未出现在查询指标中的长时间运行操作
实时查询监控: 使用triaging-live-sql-activity监控当前正在执行的用户查询。注意,后台任务执行的语句可能不会出现在SHOW CLUSTER STATEMENTS中。
历史查询分析: 使用profiling-statement-fingerprints分析查询模式趋势。注意,后台任务不包含在语句统计信息中。
Prerequisites
前提条件
Required SQL access:
- Connection to any CockroachDB node
- For cluster-wide job visibility: system privilege (read-only monitoring)
VIEWJOB - For job control operations: role option (pause/cancel/resume jobs)
CONTROLJOB - Without these: Limited visibility into jobs you created
Check your privileges:
sql
SHOW GRANTS ON ROLE <username>;Look for:
- in the
VIEWJOBcolumn (system privilege)privilege_type - in role options (check with
CONTROLJOB)SHOW USERS
See permissions reference for detailed RBAC setup.
所需SQL权限:
- 连接到任意CockroachDB节点
- 集群级任务可见性:系统权限(只读监控)
VIEWJOB - 任务控制操作:角色选项(暂停/取消/恢复任务)
CONTROLJOB - 无上述权限:仅能查看自己创建的任务
检查你的权限:
sql
SHOW GRANTS ON ROLE <username>;查找:
- 列中的
privilege_type(系统权限)VIEWJOB - 角色选项中的(使用
CONTROLJOB查看)SHOW USERS
详细RBAC设置请参考权限参考文档。
Core Concepts
核心概念
Jobs vs Statements
任务与语句
Key distinction:
- Statements: User-initiated SQL queries tracked by SHOW CLUSTER STATEMENTS and statement statistics
- Background jobs: Long-running operations tracked separately by SHOW JOBS
Background jobs are excluded from:
- (live query monitoring)
SHOW CLUSTER STATEMENTS - (historical query analysis)
crdb_internal.statement_statistics - Statement fingerprint metrics and DB Console Statements page
Common job types:
| Category | Job Types | Examples |
|---|---|---|
| User-initiated | SCHEMA CHANGE, BACKUP, RESTORE, IMPORT, CHANGEFEED | ALTER TABLE, CREATE INDEX, BACKUP DATABASE, RESTORE |
| Automatic | SCHEMA CHANGE GC, AUTO CREATE STATS, AUTO SQL STATS COMPACTION | Post-DROP cleanup, table statistics refresh, stats table maintenance |
See job types reference for complete catalog.
关键区别:
- 语句: 用户发起的SQL查询,由SHOW CLUSTER STATEMENTS和语句统计信息跟踪
- 后台任务: 长时间运行的操作,由SHOW JOBS单独跟踪
后台任务不包含在以下内容中:
- (实时查询监控)
SHOW CLUSTER STATEMENTS - (历史查询分析)
crdb_internal.statement_statistics - 语句指纹指标和DB Console的语句页面
常见任务类型:
| 类别 | 任务类型 | 示例 |
|---|---|---|
| 用户发起 | SCHEMA CHANGE、BACKUP、RESTORE、IMPORT、CHANGEFEED | ALTER TABLE、CREATE INDEX、BACKUP DATABASE、RESTORE |
| 自动任务 | SCHEMA CHANGE GC、AUTO CREATE STATS、AUTO SQL STATS COMPACTION | DROP操作后的清理、表统计信息刷新、统计信息表维护 |
完整任务类型目录请参考任务类型参考文档。
SHOW JOBS vs SHOW AUTOMATIC JOBS
SHOW JOBS与SHOW AUTOMATIC JOBS对比
| Interface | Scope | Time Window | Use Case |
|---|---|---|---|
| User-initiated + automatic | Last 12 hours (default) | Monitor backups, schema changes, user operations |
| Automatic only | Configurable (recommend 24h) | Monitor AUTO CREATE STATS, AUTO SQL STATS COMPACTION, SCHEMA CHANGE GC |
Time retention:
- Default retention: 14 days in table
crdb_internal.jobs - display window: 12 hours (configurable with
SHOW JOBS)SHOW JOBS SELECT * FROM [SHOW JOBS] WHERE ... - display window: Configurable with
SHOW AUTOMATIC JOBSWHERE created > now() - INTERVAL '...'
| 接口 | 范围 | 时间窗口 | 使用场景 |
|---|---|---|---|
| 用户发起+自动任务 | 默认最近12小时 | 监控备份、模式变更、用户操作 |
| 仅自动任务 | 可配置(推荐24小时) | 监控AUTO CREATE STATS、AUTO SQL STATS COMPACTION、SCHEMA CHANGE GC |
时间保留规则:
- 默认保留:表中保留14天
crdb_internal.jobs - 显示窗口:12小时(可通过
SHOW JOBS配置)SHOW JOBS SELECT * FROM [SHOW JOBS] WHERE ... - 显示窗口:可通过
SHOW AUTOMATIC JOBS配置WHERE created > now() - INTERVAL '...'
Job Status Values
任务状态值
| Status | Meaning | Action Required |
|---|---|---|
| Job is actively executing | Monitor progress via |
| Job completed successfully | None |
| Job encountered an error | Investigate |
| Job manually paused | Resume with |
| Job was canceled (terminal state) | Retry operation if needed |
| Job queued but not started | Monitor; may indicate resource constraints |
| Job failed and is rolling back changes | Wait for completion; check error after |
Running status sub-states:
- : Backup job actively transferring data
performing backup - : Restore job actively applying data
restoring - : SCHEMA CHANGE GC waiting for garbage collection eligibility
waiting for MVCC GC
See job states reference for detailed state transitions and "waiting for MVCC GC" explanation.
| 状态 | 含义 | 所需操作 |
|---|---|---|
| 任务正在执行 | 通过 |
| 任务成功完成 | 无 |
| 任务遇到错误 | 查看 |
| 任务被手动暂停 | 若合适,使用 |
| 任务已被取消(终端状态) | 若需要,重新执行操作 |
| 任务已排队但未启动 | 监控;可能表示资源受限 |
| 任务失败并正在回滚变更 | 等待完成;之后检查错误 |
运行状态子状态:
- :备份任务正在传输数据
performing backup - :恢复任务正在应用数据
restoring - :SCHEMA CHANGE GC等待垃圾回收资格
waiting for MVCC GC
详细状态转换和“waiting for MVCC GC”解释请参考任务状态参考文档。
Core Diagnostic Queries
核心诊断查询
Query 1: Failed Jobs (Last 12 Hours)
查询1:最近12小时内的失败任务
Identify jobs that failed with error messages:
sql
-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
finished,
now() - created AS total_duration,
error
FROM j
WHERE status = 'failed'
AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;Key columns:
- : Failure reason (check for permission errors, disk space, network issues)
error - : Human-readable description of what the job was doing
description - : How long the job ran before failing
total_duration
Common failure patterns:
- Permission denied: User lacks required privileges
- Disk space: Backup destination full
- Network timeout: External storage unreachable
- Constraint violation: Restore conflicts with existing data
识别带有错误信息的失败任务:
sql
-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
finished,
now() - created AS total_duration,
error
FROM j
WHERE status = 'failed'
AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;关键列:
- :失败原因(检查权限错误、磁盘空间、网络问题)
error - :任务执行内容的可读描述
description - :任务失败前运行的时长
total_duration
常见失败模式:
- 权限拒绝:用户缺少所需权限
- 磁盘空间不足:备份目标存储已满
- 网络超时:外部存储无法访问
- 约束冲突:恢复与现有数据冲突
Query 2: Long-Running Jobs
查询2:长时间运行的任务
Find jobs running longer than expected threshold:
sql
-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
status,
running_status,
created,
now() - created AS running_for,
fraction_completed,
coordinator_id
FROM j
WHERE status = 'running'
AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;Key columns:
- : Total elapsed time since job started
running_for - : Progress estimate (0.0 to 1.0, NULL if unavailable)
fraction_completed - : Sub-state details (e.g., "waiting for MVCC GC")
running_status
Customizable thresholds:
- Schema changes: 30 minutes to several hours (depends on table size)
- Backups: 1-6+ hours (depends on data volume)
- Automatic jobs: Usually < 30 minutes
查找运行时长超出预期阈值的任务:
sql
-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
status,
running_status,
created,
now() - created AS running_for,
fraction_completed,
coordinator_id
FROM j
WHERE status = 'running'
AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;关键列:
- :任务启动后的总耗时
running_for - :进度估计值(0.0到1.0,不可用时为NULL)
fraction_completed - :子状态详情(如“waiting for MVCC GC”)
running_status
可自定义阈值:
- 模式变更:30分钟到数小时(取决于表大小)
- 备份:1-6小时以上(取决于数据量)
- 自动任务:通常少于30分钟
Query 3: Paused Jobs
查询3:暂停的任务
Identify jobs that are paused and may need attention:
sql
-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS paused_for,
coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;Action required:
Resume with after verifying the pause reason.
RESUME JOB <job_id>Common reasons for paused jobs:
- Manual user pause for maintenance
- Resource constraints (cluster paused the job)
- Error requiring manual intervention
识别已暂停且可能需要关注的任务:
sql
-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS paused_for,
coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;所需操作:
确认暂停原因后,使用恢复任务。
RESUME JOB <job_id>任务暂停的常见原因:
- 用户为维护手动暂停
- 资源受限(集群暂停任务)
- 需要手动干预的错误
Query 4: Schema Changes Waiting for MVCC GC
查询4:等待MVCC GC的模式变更任务
Find SCHEMA CHANGE GC jobs waiting for garbage collection:
sql
-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS waiting_for,
running_status
FROM j
WHERE status = 'running'
AND job_type = 'SCHEMA CHANGE GC'
AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;Interpretation:
- Normal: SCHEMA CHANGE GC jobs wait for data to become garbage-collectable based on setting (default 25 hours)
gc.ttlseconds - Expected duration: Up to + some overhead
gc.ttlseconds - When to worry: Waiting > 2x (check setting with
gc.ttlseconds)SHOW CLUSTER SETTING gc.ttlseconds
Why this happens:
After DROP TABLE/INDEX operations, CockroachDB must wait for all reads at older timestamps to complete before physically removing data. This prevents "time-travel" queries from failing.
See job states reference for detailed MVCC GC explanation.
查找等待垃圾回收的SCHEMA CHANGE GC任务:
sql
-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS waiting_for,
running_status
FROM j
WHERE status = 'running'
AND job_type = 'SCHEMA CHANGE GC'
AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;解读:
- 正常情况: SCHEMA CHANGE GC任务会根据设置(默认25小时)等待数据变为可垃圾回收状态
gc.ttlseconds - 预期时长: 最多加上一些额外开销
gc.ttlseconds - 需关注的情况: 等待时长超过的2倍(使用
gc.ttlseconds查看设置)SHOW CLUSTER SETTING gc.ttlseconds
原因:
执行DROP TABLE/INDEX操作后,CockroachDB必须等待所有基于旧时间戳的读取完成,才能物理删除数据。这可以防止“时间旅行”查询失败。
详细MVCC GC解释请参考任务状态参考文档。
Query 5: Automatic Job Health (24h Window)
查询5:自动任务健康状态(24小时窗口)
Monitor automatic background jobs like statistics collection:
sql
-- Automatic jobs in last 24 hours
SELECT
job_id,
job_type,
description,
status,
created,
finished,
COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;Key job types:
- : Automatic table statistics refresh (critical for query optimizer)
AUTO CREATE STATS - : Periodic cleanup of statement/transaction statistics tables
AUTO SQL STATS COMPACTION
Health indicators:
- Healthy: Regular successful executions (every few hours)
- Unhealthy: No recent executions, or high failure rate
- Impact of failure: Stale statistics lead to poor query plans and slow queries
监控自动后台任务,如统计信息收集:
sql
-- Automatic jobs in last 24 hours
SELECT
job_id,
job_type,
description,
status,
created,
finished,
COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;关键任务类型:
- :自动表统计信息刷新(对查询优化器至关重要)
AUTO CREATE STATS - :定期清理语句/事务统计信息表
AUTO SQL STATS COMPACTION
健康指标:
- 健康状态: 定期成功执行(每几小时一次)
- 不健康状态: 近期无执行记录,或失败率高
- 失败影响: 过时的统计信息会导致查询计划不佳和查询缓慢
Query 6: Jobs by Type and Status
查询6:按类型和状态分类的任务
Aggregated view for pattern analysis:
sql
-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
job_type,
status,
COUNT(*) AS job_count,
MIN(created) AS oldest,
MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;Use case:
- Identify patterns (e.g., all BACKUP jobs failing, multiple schema changes stuck)
- Spot anomalies (e.g., unusual job type volume)
- Track job success rates by type
用于模式分析的聚合视图:
sql
-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
job_type,
status,
COUNT(*) AS job_count,
MIN(created) AS oldest,
MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;使用场景:
- 识别模式(如所有BACKUP任务失败、多个模式变更停滞)
- 发现异常(如异常的任务类型数量)
- 按类型跟踪任务成功率
Query 7: Backup and Restore Progress
查询7:备份与恢复进度
Track progress of backup/restore jobs:
sql
-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS running_for,
ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
CASE
WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
((now() - created) / fraction_completed) - (now() - created)
ELSE NULL
END AS estimated_time_remaining,
running_status
FROM j
WHERE status = 'running'
AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;Key columns:
- : Progress percentage (0-100)
percent_complete - : Rough estimate based on current progress rate
estimated_time_remaining - : Detailed status (e.g., "performing backup to s3://...")
running_status
Note: may be NULL for some job types or early in execution.
fraction_completed跟踪备份/恢复任务的进度:
sql
-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS running_for,
ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
CASE
WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
((now() - created) / fraction_completed) - (now() - created)
ELSE NULL
END AS estimated_time_remaining,
running_status
FROM j
WHERE status = 'running'
AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;关键列:
- :进度百分比(0-100)
percent_complete - :基于当前进度的大致剩余时间估计
estimated_time_remaining - :详细状态(如“performing backup to s3://...”)
running_status
注意: 部分任务类型或执行初期,可能为NULL。
fraction_completedCommon Workflows
常见工作流
Workflow 1: Schema Change Stuck Investigation
工作流1:排查停滞的模式变更
Scenario: User reports ALTER TABLE or CREATE INDEX appears stuck.
-
Check for running schema changes:sql
WITH j AS (SHOW JOBS) SELECT job_id, description, created, now() - created AS running_for, fraction_completed, running_status FROM j WHERE status = 'running' AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE') ORDER BY created; -
Identify MVCC GC waits:sql
-- Use Query 4 to find "waiting for MVCC GC" jobs -
Interpret results:
- If = "waiting for MVCC GC": Normal for post-DROP cleanup (wait up to
running_status)gc.ttlseconds - If long-running with low : Check for contention, large table size, or resource constraints
fraction_completed - If failed: Check column for specific failure reason
error
- If
-
Next steps:
- MVCC GC wait: Verify and wait
SHOW CLUSTER SETTING gc.ttlseconds - Resource constraints: Check cluster CPU/memory usage
- Failed job: Address error (permissions, constraints) and retry operation
- MVCC GC wait: Verify
场景: 用户反馈ALTER TABLE或CREATE INDEX似乎停滞。
-
检查运行中的模式变更:sql
WITH j AS (SHOW JOBS) SELECT job_id, description, created, now() - created AS running_for, fraction_completed, running_status FROM j WHERE status = 'running' AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE') ORDER BY created; -
识别MVCC GC等待:sql
-- 使用查询4查找“waiting for MVCC GC”任务 -
解读结果:
- 如果= "waiting for MVCC GC":DROP操作后的清理属于正常情况(等待时长不超过
running_status)gc.ttlseconds - 如果长时间运行且较低:检查是否存在竞争、表过大或资源受限
fraction_completed - 如果任务失败:查看列获取具体失败原因
error
- 如果
-
下一步操作:
- MVCC GC等待:验证并等待
SHOW CLUSTER SETTING gc.ttlseconds - 资源受限:检查集群CPU/内存使用情况
- 任务失败:解决错误(权限、约束)并重试操作
- MVCC GC等待:验证
Workflow 2: Failed Backup Triage
工作流2:排查失败的备份
Scenario: Scheduled backup job failed.
-
Find recent failed backups:sql
-- Use Query 1 filtered for BACKUP job type WITH j AS (SHOW JOBS) SELECT job_id, description, created, finished, error FROM j WHERE status = 'failed' AND job_type = 'BACKUP' AND created > now() - INTERVAL '24 hours' ORDER BY created DESC; -
Analyze error messages:
- "permission denied": Check external storage credentials
- "timeout": Network connectivity to backup destination
- "no space left": Destination storage full
- "connection refused": External storage endpoint unreachable
-
Verify backup destination:sql
-- Check SHOW BACKUP for successful backups to same destination SHOW BACKUP 's3://bucket/path'; -
Remediate and retry:
- Fix underlying issue (credentials, storage, network)
- Re-run backup command
- Monitor with Query 7 for progress
场景: 定时备份任务失败。
-
查找最近失败的备份:sql
-- 使用查询1筛选BACKUP任务类型 WITH j AS (SHOW JOBS) SELECT job_id, description, created, finished, error FROM j WHERE status = 'failed' AND job_type = 'BACKUP' AND created > now() - INTERVAL '24 hours' ORDER BY created DESC; -
分析错误信息:
- "permission denied":检查外部存储凭证
- "timeout":与备份目标的网络连接问题
- "no space left":目标存储已满
- "connection refused":外部存储端点无法访问
-
验证备份目标:sql
-- 检查同一目标的成功备份记录 SHOW BACKUP 's3://bucket/path'; -
修复并重试:
- 修复根本问题(凭证、存储、网络)
- 重新运行备份命令
- 使用查询7监控进度
Workflow 3: Automatic Job Health Check
工作流3:自动任务健康检查
Scenario: Proactive monitoring of automatic background jobs.
-
Check AUTO CREATE STATS frequency:sql
-- Use Query 5 to see recent automatic statistics jobs SELECT job_type, status, COUNT(*) AS job_count, MAX(created) AS most_recent FROM [SHOW AUTOMATIC JOBS] WHERE created > now() - INTERVAL '24 hours' AND job_type = 'AUTO CREATE STATS' GROUP BY job_type, status; -
Expected pattern:
- Multiple successful AUTO CREATE STATS jobs per day (depends on table update frequency)
- Regular AUTO SQL STATS COMPACTION (typically once per hour)
-
Warning signs:
- No AUTO CREATE STATS in last 24h: Statistics collection may be disabled
- High failure rate: Check cluster resource constraints or permission issues
- No AUTO SQL STATS COMPACTION: Stats table may grow unbounded
-
Verify settings:sql
SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled; -- Should be true SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;
场景: 主动监控自动后台任务。
-
检查AUTO CREATE STATS执行频率:sql
-- 使用查询5查看最近的自动统计信息任务 SELECT job_type, status, COUNT(*) AS job_count, MAX(created) AS most_recent FROM [SHOW AUTOMATIC JOBS] WHERE created > now() - INTERVAL '24 hours' AND job_type = 'AUTO CREATE STATS' GROUP BY job_type, status; -
预期模式:
- 每天多次成功执行AUTO CREATE STATS任务(取决于表更新频率)
- 定期执行AUTO SQL STATS COMPACTION(通常每小时一次)
-
警告信号:
- 过去24小时内无AUTO CREATE STATS任务:统计信息收集可能已禁用
- 高失败率:检查集群资源受限或权限问题
- 无AUTO SQL STATS COMPACTION:统计信息表可能无限增长
-
验证设置:sql
SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled; -- 应设为true SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;
Workflow 4: Long-Running Job Monitoring
工作流4:长时间运行任务监控
Scenario: Track progress of expected long-running operations.
-
Identify long-running jobs:sql
-- Use Query 2 with custom threshold WITH j AS (SHOW JOBS) SELECT job_id, job_type, description, now() - created AS running_for, fraction_completed FROM j WHERE status = 'running' AND created < now() - INTERVAL '30 minutes' ORDER BY created; -
Monitor progress over time:sql
-- Re-run every 10-15 minutes, track fraction_completed changes -- Example: 0.25 → 0.40 → 0.55 indicates steady progress -
Estimate completion:sql
-- Use Query 7 for backup/restore jobs with time estimates -
Decide on action:
- Steady progress: Continue monitoring
- Stalled progress (fraction_completed not increasing): Investigate with triaging-live-sql-activity
- Failed: Use Query 1 to check error
场景: 跟踪预期长时间运行操作的进度。
-
识别长时间运行的任务:sql
-- 使用查询2并自定义阈值 WITH j AS (SHOW JOBS) SELECT job_id, job_type, description, now() - created AS running_for, fraction_completed FROM j WHERE status = 'running' AND created < now() - INTERVAL '30 minutes' ORDER BY created; -
随时间监控进度:sql
-- 每10-15分钟重新运行一次,跟踪fraction_completed的变化 -- 示例:0.25 → 0.40 → 0.55表示进度稳定 -
估计完成时间:sql
-- 使用查询7获取备份/恢复任务的时间估计 -
决定操作:
- 进度稳定:继续监控
- 进度停滞(fraction_completed未增加):使用triaging-live-sql-activity排查
- 任务失败:使用查询1检查错误
Safety Considerations
安全注意事项
Read-only operations (all diagnostic queries):
All and queries are read-only and safe to run in production. No performance impact on cluster operations.
SHOW JOBSSHOW AUTOMATIC JOBSJob control operations (opt-in):
CAUTION: Pausing or canceling jobs can have data integrity implications
Only proceed with job control if:
- You have role option
CONTROLJOB - You understand the implications (e.g., canceling a schema change mid-execution may require manual cleanup)
- You have authorization to interrupt cluster operations
- You've verified the job is truly problematic (not just slow)
Job control commands:
sql
-- Pause a running job (can be resumed later)
PAUSE JOB <job_id>;
-- Resume a paused job
RESUME JOB <job_id>;
-- Cancel a job (terminal - cannot be resumed)
CANCEL JOB <job_id>;Risks by job type:
- SCHEMA CHANGE: Canceling may leave schema in inconsistent state; prefer PAUSE and investigation
- BACKUP: Canceling is safe (can retry); pausing is better for temporary issues
- RESTORE: Canceling may leave database partially restored; requires cleanup
- AUTO CREATE STATS: Canceling is safe (will retry later automatically)
Best practice: Focus on monitoring and diagnosis; only use control operations when explicitly required and authorized.
See permissions reference for CONTROLJOB role option setup.
只读操作(所有诊断查询):
所有和查询均为只读,可安全在生产环境运行,不会对集群操作造成性能影响。
SHOW JOBSSHOW AUTOMATIC JOBS任务控制操作(可选):
注意:暂停或取消任务可能影响数据完整性
仅在以下情况下执行任务控制:
- 你拥有角色选项
CONTROLJOB - 你了解潜在影响(如中途取消模式变更可能需要手动清理)
- 你有权中断集群操作
- 你已确认任务确实存在问题(并非只是运行缓慢)
任务控制命令:
sql
-- 暂停运行中的任务(之后可恢复)
PAUSE JOB <job_id>;
-- 恢复暂停的任务
RESUME JOB <job_id>;
-- 取消任务(终端操作 - 无法恢复)
CANCEL JOB <job_id>;按任务类型划分的风险:
- SCHEMA CHANGE: 取消可能导致模式处于不一致状态;优先选择暂停并排查问题
- BACKUP: 取消是安全的(可重试);临时问题优先选择暂停
- RESTORE: 取消可能导致数据库部分恢复;需要清理
- AUTO CREATE STATS: 取消是安全的(系统会自动重试)
最佳实践: 重点关注监控和诊断;仅在明确需要且获得授权时使用控制操作。
CONTROLJOBTroubleshooting
故障排查
| Issue | Cause | Fix |
|---|---|---|
| No jobs in last 12h, or insufficient privileges | Grant |
| "waiting for MVCC GC" for many hours | Normal behavior for SCHEMA CHANGE GC after DROP operations | Wait up to |
| Can't pause/resume job: "permission denied" | Missing | Use |
Job stuck at same | Job may be processing large batch, or actually stuck | Wait 15-30 min and re-check; if no change, investigate with live query triage |
| No AUTO CREATE STATS jobs | Automatic collection disabled | Check |
| Need to filter by time window | Add |
| Failed job with "schema change GC" error | Expected for post-DROP cleanup failures | Usually safe to ignore; job will retry automatically |
| Job error: "concurrent schema change" | Multiple schema changes on same table | Wait for first schema change to complete, then retry |
| 问题 | 原因 | 解决方法 |
|---|---|---|
| 最近12小时无任务,或权限不足 | 授予 |
| "waiting for MVCC GC"持续数小时 | DROP操作后SCHEMA CHANGE GC的正常行为 | 等待时长不超过 |
| 无法暂停/恢复任务:"permission denied" | 缺少 | 使用 |
任务的 | 任务可能在处理大数据批,或确实停滞 | 等待15-30分钟后重新检查;若无变化,使用实时查询排查工具 |
| 无AUTO CREATE STATS任务 | 自动收集已禁用 | 检查 |
| 需要按时间窗口筛选 | 添加 |
| 任务出现"schema change GC"错误 | DROP操作后清理失败的预期情况 | 通常可忽略;任务会自动重试 |
| 任务错误:"concurrent schema change" | 同一表存在多个模式变更 | 等待第一个模式变更完成后重试 |
Key Considerations
关键注意事项
- Jobs vs queries: Background jobs execute statements that don't appear in SHOW STATEMENTS or statement statistics
- Time windows: SHOW JOBS default 12h retention; use for up to 14 days
crdb_internal.jobs - MVCC GC waiting: Normal and expected for post-DROP cleanup; duration tied to
gc.ttlseconds - LIMIT clauses: Always include for production safety (prevents overwhelming output)
- Privilege model: VIEWJOB (system privilege) for read-only; CONTROLJOB (role option) for control
- Automatic job health: Regular AUTO CREATE STATS is critical for query optimizer performance
- Progress estimates: may be NULL or sparse for some job types
fraction_completed - Job control risks: PAUSE is safer than CANCEL; some cancellations require manual cleanup
- 任务与查询: 后台任务执行的语句不会出现在SHOW STATEMENTS或语句统计信息中
- 时间窗口: SHOW JOBS默认保留12小时;使用可查看最多14天的任务
crdb_internal.jobs - MVCC GC等待: DROP操作后的清理属于正常预期行为;时长与相关
gc.ttlseconds - LIMIT子句: 生产环境中务必包含该子句(防止输出过多)
- 权限模型: VIEWJOB(系统权限)用于只读访问;CONTROLJOB(角色选项)用于控制操作
- 自动任务健康: 定期执行AUTO CREATE STATS对查询优化器性能至关重要
- 进度估计: 部分任务类型的可能为NULL或不连续
fraction_completed - 任务控制风险: 暂停比取消更安全;部分取消操作需要手动清理
References
参考资料
Skill references:
- RBAC and privilege setup
- Job states and transitions
- Job types catalog
- SQL query variations
Related skills:
- triaging-live-sql-activity - For live query monitoring (job-executed statements may not appear)
- profiling-statement-fingerprints - For historical query analysis (background jobs excluded)
Official CockroachDB Documentation:
技能参考:
- RBAC与权限设置
- 任务状态与转换
- 任务类型目录
- SQL查询变体
相关技能:
- triaging-live-sql-activity - 用于实时查询监控(任务执行的语句可能不会显示)
- profiling-statement-fingerprints - 用于历史查询分析(不包含后台任务)
官方CockroachDB文档: