altinity-expert-clickhouse-index-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Diagnostics

诊断步骤

Run all queries from the file checks.sql and analyze the results.

运行checks.sql文件中的所有查询并分析结果。

Deep Dive Queries (Placeholder-Based)

深度分析查询（基于占位符）

EXPLAIN Index Usage for Specific Query

特定查询的EXPLAIN索引使用分析

sql

EXPLAIN indexes = 1
{query_without_format}

Look for:

```
PrimaryKey
```
condition should not be
```
true
```
(means no filtering)
```
Granules: X/Y
```
ratio shows selectivity (low X/Y = good)
```
Skip
```
indexes should reduce parts/granules further

sql

EXPLAIN indexes = 1
{query_without_format}

需要关注：

```
PrimaryKey
```
条件不应为
```
true
```
（表示未进行过滤）
```
Granules: X/Y
```
比值表示筛选性（X/Y值越小越好）
跳数索引（Skip indexes）应进一步减少数据分片/颗粒数

Column Cardinality Analysis

列基数分析

sql

SELECT 
    {columns} APPLY uniq
FROM {database}.{table}
WHERE {time_column} > now() - INTERVAL {days} DAY

Optimal ORDER BY ordering: Low cardinality columns first, high cardinality last.

sql

SELECT 
    {columns} APPLY uniq
FROM {database}.{table}
WHERE {time_column} > now() - INTERVAL {days} DAY

最优ORDER BY排序规则： 低基数列在前，高基数列在后。

Query Pattern WHERE Columns Extraction

查询模式中WHERE子句列提取

sql

WITH
    any(query) AS q,
    arrayJoin(extractAll(query, '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w,
    arrayFilter(x -> (position(w, extract(x, '\\.(`[^`]+`|[^\\.]+)$')) > 0), columns) AS c,
    arrayJoin(c) AS c2
SELECT
    c2,
    count() AS usage_count
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
GROUP BY c2
ORDER BY usage_count DESC
FORMAT PrettyCompactMonoBlock

sql

WITH
    any(query) AS q,
    arrayJoin(extractAll(query, '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w,
    arrayFilter(x -> (position(w, extract(x, '\\.(`[^`]+`|[^\\.]+)$')) > 0), columns) AS c,
    arrayJoin(c) AS c2
SELECT
    c2,
    count() AS usage_count
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
GROUP BY c2
ORDER BY usage_count DESC
FORMAT PrettyCompactMonoBlock

Normalized WHERE Clause Patterns

标准化WHERE子句模式

sql

WITH
    arrayJoin(extractAll(normalizeQuery(query), '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w
SELECT
    w AS where_pattern,
    count() AS frequency
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
GROUP BY w
ORDER BY frequency DESC
LIMIT 20

sql

WITH
    arrayJoin(extractAll(normalizeQuery(query), '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w
SELECT
    w AS where_pattern,
    count() AS frequency
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
GROUP BY w
ORDER BY frequency DESC
LIMIT 20

Granule Selectivity from Query Log

从查询日志分析颗粒筛选性

sql

SELECT
    query_id,
    normalized_query_hash,
    selected_parts,
    selected_marks,
    read_rows,
    round(read_rows / nullIf(selected_marks, 0)) AS rows_per_mark,
    query_duration_ms,
    formatReadableSize(read_bytes) AS read_bytes
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
ORDER BY selected_marks DESC
LIMIT 20

High
selected_marks
/ total marks = poor index utilization.

sql

SELECT
    query_id,
    normalized_query_hash,
    selected_parts,
    selected_marks,
    read_rows,
    round(read_rows / nullIf(selected_marks, 0)) AS rows_per_mark,
    query_duration_ms,
    formatReadableSize(read_bytes) AS read_bytes
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
  AND arrayExists(x -> x LIKE '%{table}%', tables)
  AND query ILIKE 'SELECT%'
  AND type = 'QueryFinish'
ORDER BY selected_marks DESC
LIMIT 20

selected_marks
/总标记数比值高 = 索引利用率低。

Analysis Workflow

分析流程

Step 1: Check Current Indexes

步骤1：检查当前索引

sql

-- Table structure with ORDER BY, PRIMARY KEY, indexes
SHOW CREATE TABLE {database}.{table}

sql

-- Skipping indexes
SELECT name, type, expr, granularity
FROM system.data_skipping_indices
WHERE database = '{database}' AND table = '{table}'

sql

-- 包含ORDER BY、PRIMARY KEY、索引的表结构
SHOW CREATE TABLE {database}.{table}

sql

-- 跳数索引
SELECT name, type, expr, granularity
FROM system.data_skipping_indices
WHERE database = '{database}' AND table = '{table}'

Step 2: Extract Query Patterns

步骤2：提取查询模式

Run the WHERE column extraction and normalized pattern queries to understand:

Which columns appear most frequently in WHERE clauses
What condition combinations are common

运行WHERE列提取和标准化模式查询，以了解：

WHERE子句中最常出现的列
常见的条件组合

Step 3: Check Column Cardinalities

步骤3：检查列基数

Compare cardinalities of columns in:

Current ORDER BY key
Frequently filtered columns from Step 2

比较以下列的基数：

当前ORDER BY键中的列
步骤2中筛选出的高频过滤列

Step 4: Evaluate Index Alignment

步骤4：评估索引匹配度

Query Pattern	Index Support	Action
Filters on ORDER BY prefix	✅ Good	None
Filters on non-ORDER BY cols	⚠️ Skip index?	Add bloom_filter or projection
Time range + entity	⚠️ Check order	Time in ORDER BY or partition?
High-cardinality first in ORDER BY	❌ Bad	Reorder (low→high cardinality)

查询模式	索引支持情况	操作建议
对ORDER BY前缀列进行过滤	✅ 良好	无需操作
对非ORDER BY列进行过滤	⚠️ 是否使用跳数索引？	添加bloom_filter或投影
时间范围+实体筛选	⚠️ 检查顺序	时间列是否在ORDER BY或分区键中？
ORDER BY中高基数列在前	❌ 不合理	重新排序（低→高基数）

ORDER BY Design Guidelines

ORDER BY设计指南

Column Order Principles

列排序原则

Lowest cardinality first - maximizes granule skipping
Most frequently filtered - columns in WHERE should be in ORDER BY
Time column considerations:
- If most queries filter on time ranges → include in ORDER BY (possibly with lower resolution like
```
toDate(ts)
```
  )
- If partition key handles time filtering → may not need in ORDER BY

最低基数列优先 - 最大化颗粒跳过效率
高频过滤列优先 - WHERE子句中的列应包含在ORDER BY中
时间列注意事项：
- 如果大多数查询针对时间范围过滤 → 将其加入ORDER BY（可使用更低精度的格式如
```
toDate(ts)
```
  ）
- 如果分区键已处理时间过滤 → 可能无需加入ORDER BY

Common Anti-Patterns

常见反模式

Anti-Pattern	Problem	Fix
High-cardinality UUID first	No granule skipping	Move after low-cardinality columns
DateTime64 microseconds first	Too granular	Use `toDate()` or `toStartOfHour()`
Column in WHERE not in ORDER BY	Full scan	Add to ORDER BY or create projection
Bloom filter on ORDER BY column	Redundant	Remove skip index
Time not in ORDER BY or partition	Range queries scan all	Add `toDate(ts)` to ORDER BY prefix

反模式	问题	修复方案
高基数UUID列排在首位	无法跳过颗粒	移至低基数列之后
DateTime64微秒列排在首位	粒度太细	使用 `toDate()` 或 `toStartOfHour()`
WHERE子句中的列未在ORDER BY中	全表扫描	加入ORDER BY或创建投影
ORDER BY列上创建Bloom过滤器	冗余	删除跳数索引
时间列未在ORDER BY或分区键中	范围查询扫描全表	将 `toDate(ts)` 加入ORDER BY前缀

Cardinality Ordering Example

基数排序示例

Given cardinalities:

```
entity_type
```
: 6
```
entity
```
: 18,588
```
cast_hash
```
: 335,620

Recommended ORDER BY:

(entity_type, entity, cast_hash, ...)

给定基数：

```
entity_type
```
: 6
```
entity
```
: 18,588
```
cast_hash
```
: 335,620

推荐ORDER BY顺序：

(entity_type, entity, cast_hash, ...)