altinity-expert-clickhouse-index-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Diagnostics

诊断步骤

Run all queries from the file checks.sql and analyze the results.

运行checks.sql文件中的所有查询并分析结果。

Deep Dive Queries (Placeholder-Based)

深度分析查询(基于占位符)

EXPLAIN Index Usage for Specific Query

特定查询的EXPLAIN索引使用分析

sql
EXPLAIN indexes = 1
{query_without_format}
Look for:
  • PrimaryKey
    condition should not be
    true
    (means no filtering)
  • Granules: X/Y
    ratio shows selectivity (low X/Y = good)
  • Skip
    indexes should reduce parts/granules further
sql
EXPLAIN indexes = 1
{query_without_format}
需要关注:
  • PrimaryKey
    条件不应为
    true
    (表示未进行过滤)
  • Granules: X/Y
    比值表示筛选性(X/Y值越小越好)
  • 跳数索引(Skip indexes)应进一步减少数据分片/颗粒数

Column Cardinality Analysis

列基数分析

sql
SELECT 
    {columns} APPLY uniq
FROM {database}.{table}
WHERE {time_column} > now() - INTERVAL {days} DAY
Optimal ORDER BY ordering: Low cardinality columns first, high cardinality last.
sql
SELECT 
    {columns} APPLY uniq
FROM {database}.{table}
WHERE {time_column} > now() - INTERVAL {days} DAY
最优ORDER BY排序规则: 低基数列在前,高基数列在后。

Query Pattern WHERE Columns Extraction

查询模式中WHERE子句列提取

sql
WITH
    any(query) AS q,
    arrayJoin(extractAll(query, '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w,
    arrayFilter(x -> (position(w, extract(x, '\\.(`[^`]+`|[^\\.]+)$')) > 0), columns) AS c,
    arrayJoin(c) AS c2
SELECT
    c2,
    count() AS usage_count
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
GROUP BY c2
ORDER BY usage_count DESC
FORMAT PrettyCompactMonoBlock
sql
WITH
    any(query) AS q,
    arrayJoin(extractAll(query, '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w,
    arrayFilter(x -> (position(w, extract(x, '\\.(`[^`]+`|[^\\.]+)$')) > 0), columns) AS c,
    arrayJoin(c) AS c2
SELECT
    c2,
    count() AS usage_count
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
GROUP BY c2
ORDER BY usage_count DESC
FORMAT PrettyCompactMonoBlock

Normalized WHERE Clause Patterns

标准化WHERE子句模式

sql
WITH
    arrayJoin(extractAll(normalizeQuery(query), '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w
SELECT
    w AS where_pattern,
    count() AS frequency
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
GROUP BY w
ORDER BY frequency DESC
LIMIT 20
sql
WITH
    arrayJoin(extractAll(normalizeQuery(query), '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w
SELECT
    w AS where_pattern,
    count() AS frequency
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
GROUP BY w
ORDER BY frequency DESC
LIMIT 20

Granule Selectivity from Query Log

从查询日志分析颗粒筛选性

sql
SELECT
    query_id,
    normalized_query_hash,
    selected_parts,
    selected_marks,
    read_rows,
    round(read_rows / nullIf(selected_marks, 0)) AS rows_per_mark,
    query_duration_ms,
    formatReadableSize(read_bytes) AS read_bytes
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
ORDER BY selected_marks DESC
LIMIT 20
High
selected_marks
/ total marks = poor index utilization.

sql
SELECT
    query_id,
    normalized_query_hash,
    selected_parts,
    selected_marks,
    read_rows,
    round(read_rows / nullIf(selected_marks, 0)) AS rows_per_mark,
    query_duration_ms,
    formatReadableSize(read_bytes) AS read_bytes
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
ORDER BY selected_marks DESC
LIMIT 20
selected_marks
/总标记数比值高 = 索引利用率低。

Analysis Workflow

分析流程

Step 1: Check Current Indexes

步骤1:检查当前索引

sql
-- Table structure with ORDER BY, PRIMARY KEY, indexes
SHOW CREATE TABLE {database}.{table}
sql
-- Skipping indexes
SELECT name, type, expr, granularity
FROM system.data_skipping_indices
WHERE database = '{database}' AND table = '{table}'
sql
-- 包含ORDER BY、PRIMARY KEY、索引的表结构
SHOW CREATE TABLE {database}.{table}
sql
-- 跳数索引
SELECT name, type, expr, granularity
FROM system.data_skipping_indices
WHERE database = '{database}' AND table = '{table}'

Step 2: Extract Query Patterns

步骤2:提取查询模式

Run the WHERE column extraction and normalized pattern queries to understand:
  • Which columns appear most frequently in WHERE clauses
  • What condition combinations are common
运行WHERE列提取和标准化模式查询,以了解:
  • WHERE子句中最常出现的列
  • 常见的条件组合

Step 3: Check Column Cardinalities

步骤3:检查列基数

Compare cardinalities of columns in:
  • Current ORDER BY key
  • Frequently filtered columns from Step 2
比较以下列的基数:
  • 当前ORDER BY键中的列
  • 步骤2中筛选出的高频过滤列

Step 4: Evaluate Index Alignment

步骤4:评估索引匹配度

Query PatternIndex SupportAction
Filters on ORDER BY prefix✅ GoodNone
Filters on non-ORDER BY cols⚠️ Skip index?Add bloom_filter or projection
Time range + entity⚠️ Check orderTime in ORDER BY or partition?
High-cardinality first in ORDER BY❌ BadReorder (low→high cardinality)

查询模式索引支持情况操作建议
对ORDER BY前缀列进行过滤✅ 良好无需操作
对非ORDER BY列进行过滤⚠️ 是否使用跳数索引?添加bloom_filter或投影
时间范围+实体筛选⚠️ 检查顺序时间列是否在ORDER BY或分区键中?
ORDER BY中高基数列在前❌ 不合理重新排序(低→高基数)

ORDER BY Design Guidelines

ORDER BY设计指南

Column Order Principles

列排序原则

  1. Lowest cardinality first - maximizes granule skipping
  2. Most frequently filtered - columns in WHERE should be in ORDER BY
  3. Time column considerations:
    • If most queries filter on time ranges → include in ORDER BY (possibly with lower resolution like
      toDate(ts)
      )
    • If partition key handles time filtering → may not need in ORDER BY
  1. 最低基数列优先 - 最大化颗粒跳过效率
  2. 高频过滤列优先 - WHERE子句中的列应包含在ORDER BY中
  3. 时间列注意事项:
    • 如果大多数查询针对时间范围过滤 → 将其加入ORDER BY(可使用更低精度的格式如
      toDate(ts)
    • 如果分区键已处理时间过滤 → 可能无需加入ORDER BY

Common Anti-Patterns

常见反模式

Anti-PatternProblemFix
High-cardinality UUID firstNo granule skippingMove after low-cardinality columns
DateTime64 microseconds firstToo granularUse
toDate()
or
toStartOfHour()
Column in WHERE not in ORDER BYFull scanAdd to ORDER BY or create projection
Bloom filter on ORDER BY columnRedundantRemove skip index
Time not in ORDER BY or partitionRange queries scan allAdd
toDate(ts)
to ORDER BY prefix
反模式问题修复方案
高基数UUID列排在首位无法跳过颗粒移至低基数列之后
DateTime64微秒列排在首位粒度太细使用
toDate()
toStartOfHour()
WHERE子句中的列未在ORDER BY中全表扫描加入ORDER BY或创建投影
ORDER BY列上创建Bloom过滤器冗余删除跳数索引
时间列未在ORDER BY或分区键中范围查询扫描全表
toDate(ts)
加入ORDER BY前缀

Cardinality Ordering Example

基数排序示例

Given cardinalities:
  • entity_type
    : 6
  • entity
    : 18,588
  • cast_hash
    : 335,620
Recommended ORDER BY:
(entity_type, entity, cast_hash, ...)

给定基数:
  • entity_type
    : 6
  • entity
    : 18,588
  • cast_hash
    : 335,620
推荐ORDER BY顺序:
(entity_type, entity, cast_hash, ...)

Skipping Index Guidelines

跳数索引指南

When Skip Indexes Help

跳数索引适用场景

  • Column NOT in ORDER BY
  • Column values correlate with physical data order
  • Low false-positive rate for the index type
  • 列未包含在ORDER BY中
  • 列值与物理数据顺序相关
  • 索引类型的误报率低

When Skip Indexes Don't Help

跳数索引不适用场景

  • Column already in ORDER BY prefix (use PRIMARY KEY instead)
  • Column values randomly distributed (no correlation with ORDER BY)
  • Very high cardinality with set/bloom_filter
  • 列已在ORDER BY前缀中(改用PRIMARY KEY)
  • 列值随机分布(与ORDER BY无关联)
  • 使用set/bloom_filter的极高基数列