altinity-expert-clickhouse-merges

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Merge Performance and Part Management

合并性能与分区管理

Diagnose merge performance, backlog issues, and part management problems.

诊断合并性能、积压问题以及分区管理故障。

Diagnostics

诊断步骤

Run all queries from the file checks.sql and analyze the results.

运行checks.sql文件中的所有查询并分析结果。

Problem-Specific Investigation

特定问题排查

"Too Many Parts" Error Investigation

“分区过多”错误排查

For deep investigation of a specific table, use these ad-hoc queries:
sql
-- Check part creation rate (should be < 1/second)
select
    toStartOfMinute(event_time) as minute,
    count() as new_parts,
    round(avg(rows)) as avg_rows_per_part
from system.part_log
where event_type = 'NewPart'
  and database = '{database}'
  and table = '{table}'
  and event_time > now() - interval 1 hour
group by minute
order by minute desc
limit 30
sql
-- Check if merges are keeping up
select
    toStartOfMinute(event_time) as minute,
    countIf(event_type = 'NewPart') as new_parts,
    countIf(event_type = 'MergeParts') as merges,
    countIf(event_type = 'MergeParts') - countIf(event_type = 'NewPart') as net_reduction
from system.part_log
where database = '{database}'
  and table = '{table}'
  and event_time > now() - interval 1 hour
group by minute
order by minute desc
limit 30
If
net_reduction
is negative consistently
→ Inserts outpace merges. Solutions:
  • Increase batch size
  • Check
    max_parts_to_merge_at_once
    setting
  • Verify sufficient CPU for background merges

如需对特定表进行深度排查,可使用以下临时查询:
sql
-- 检查分区创建速率(应小于1个/秒)
select
    toStartOfMinute(event_time) as minute,
    count() as new_parts,
    round(avg(rows)) as avg_rows_per_part
from system.part_log
where event_type = 'NewPart'
  and database = '{database}'
  and table = '{table}'
  and event_time > now() - interval 1 hour
group by minute
order by minute desc
limit 30
sql
-- 检查合并操作是否跟上节奏
select
    toStartOfMinute(event_time) as minute,
    countIf(event_type = 'NewPart') as new_parts,
    countIf(event_type = 'MergeParts') as merges,
    countIf(event_type = 'MergeParts') - countIf(event_type = 'NewPart') as net_reduction
from system.part_log
where database = '{database}'
  and table = '{table}'
  and event_time > now() - interval 1 hour
group by minute
order by minute desc
limit 30
如果
net_reduction
持续为负值
→ 插入速度超过合并速度。解决方案:
  • 增大批量插入的大小
  • 检查
    max_parts_to_merge_at_once
    配置
  • 确认有足够的CPU资源用于后台合并操作

Ad-Hoc Query Guidelines

临时查询指南

Required Safeguards

必要的安全措施

sql
-- Always include LIMIT
limit 100

-- Always time-bound historical queries
where event_date >= today() - 7

-- For part_log, always filter event_type
where event_type in ('NewPart', 'MergeParts', 'MutatePart')
sql
-- 始终添加LIMIT限制
limit 100

-- 历史查询必须添加时间范围
where event_date >= today() - 7

-- 查询part_log时,始终过滤event_type
where event_type in ('NewPart', 'MergeParts', 'MutatePart')

Safe Exploration Patterns

安全探索模式

sql
-- Discover available merge_reason values
select distinct merge_reason
from system.part_log
where event_type = 'MergeParts'
  and event_date = today()
limit 100

-- Check table engine
select
    database,
    name,
    engine,
    partition_key,
    sorting_key
from system.tables
where database = '{database}'
  and name = '{table}'
sql
-- 查看可用的merge_reason取值
select distinct merge_reason
from system.part_log
where event_type = 'MergeParts'
  and event_date = today()
limit 100

-- 检查表引擎信息
select
    database,
    name,
    engine,
    partition_key,
    sorting_key
from system.tables
where database = '{database}'
  and name = '{table}'

Avoid

注意避免

  • select * from system.part_log
    → Huge, crashes context
  • Queries without time bounds on
    *_log
    tables
  • Joining large result sets in context (do aggregation in SQL)

  • select * from system.part_log
    → 数据量极大,会导致上下文崩溃
  • *_log
    表执行无时间范围限制的查询
  • 在上下文中关联大型结果集(应在SQL中完成聚合操作)

Cross-Module Triggers

跨模块触发条件

FindingLoad ModuleReason
Slow merges, normal disk
altinity-expert-clickhouse-schema
Check ORDER BY, partitioning
Slow merges, high disk IO
altinity-expert-clickhouse-storage
Storage bottleneck analysis
Merges blocked by mutations
altinity-expert-clickhouse-mutations
Stuck mutation investigation
High memory during merges
altinity-expert-clickhouse-memory
Memory limits, settings
Replication lag + merge issues
altinity-expert-clickhouse-replication
Replica queue analysis

发现的问题加载模块原因
合并缓慢,磁盘状态正常
altinity-expert-clickhouse-schema
检查ORDER BY、分区设置
合并缓慢,磁盘IO高
altinity-expert-clickhouse-storage
存储瓶颈分析
合并被mutations阻塞
altinity-expert-clickhouse-mutations
停滞的mutation排查
合并期间内存占用高
altinity-expert-clickhouse-memory
内存限制、配置检查
复制延迟 + 合并问题
altinity-expert-clickhouse-replication
副本队列分析

Key Settings Reference

关键配置参考

SettingDefaultImpact
max_parts_to_merge_at_once
100Max parts in single merge
number_of_free_entries_in_pool_to_lower_max_size_of_merge
8Throttles large merges when busy
background_pool_size
16Merge threads
parts_to_throw_insert
300Error threshold
parts_to_delay_insert
150Delay threshold
max_bytes_to_merge_at_max_space_in_pool
150GBMax merge size
配置项默认值影响
max_parts_to_merge_at_once
100单次合并的最大分区数
number_of_free_entries_in_pool_to_lower_max_size_of_merge
8系统繁忙时限制大型合并操作
background_pool_size
16合并线程数
parts_to_throw_insert
300触发错误的阈值
parts_to_delay_insert
150触发延迟的阈值
max_bytes_to_merge_at_max_space_in_pool
150GB最大合并数据量