clickhouse-query-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseClickHouse Query Optimization
ClickHouse 查询优化
Quick Start
快速入门
Check your query plan:
sql
EXPLAIN
SELECT user_id, COUNT()
FROM events
WHERE timestamp >= '2024-01-01'
GROUP BY user_id;This shows which parts of the index are used, how many partitions are read, and the aggregation strategy.
查看你的查询计划:
sql
EXPLAIN
SELECT user_id, COUNT()
FROM events
WHERE timestamp >= '2024-01-01'
GROUP BY user_id;这会展示索引的哪些部分被使用、读取了多少分区以及聚合策略。
When to Use
适用场景
- Write fast ClickHouse queries
- Design table schemas
- Analyze slow queries
- Add data skipping indexes
- Implement partitioning strategies
- Use projections for multiple access patterns
- 编写高性能ClickHouse查询
- 设计表结构
- 分析慢查询
- 添加数据跳过索引
- 实现分区策略
- 为多访问模式配置投影
Core Principles
核心原则
1. Primary Key Design
1. 主键设计
The primary key defines sort order (not uniqueness). Order columns by low → high cardinality.
sql
-- Good: country (low) → user_id → timestamp (high)
CREATE TABLE events (
user_id UInt32,
timestamp DateTime,
country String
)
ENGINE = MergeTree()
ORDER BY (country, user_id, timestamp);Key principle: Queries must filter on primary key prefix to use index.
sql
-- ✅ Fast: Uses index (country first)
SELECT * FROM events WHERE country = 'US';
-- ❌ Slow: Skips index (missing country)
SELECT * FROM events WHERE user_id = 12345;主键定义排序顺序(而非唯一性)。按低→高基数对列排序。
sql
-- 推荐:country(低基数)→ user_id → timestamp(高基数)
CREATE TABLE events (
user_id UInt32,
timestamp DateTime,
country String
)
ENGINE = MergeTree()
ORDER BY (country, user_id, timestamp);核心原则:查询必须基于主键前缀进行过滤才能使用索引。
sql
-- ✅ 快速:使用索引(country在前)
SELECT * FROM events WHERE country = 'US';
-- ❌ 缓慢:跳过索引(缺少country条件)
SELECT * FROM events WHERE user_id = 12345;2. Data Skipping Indexes
2. 数据跳过索引
For non-primary-key columns:
sql
-- Numeric ranges
ALTER TABLE events ADD INDEX idx_duration session_duration TYPE minmax GRANULARITY 4;
-- Categorical (low cardinality)
ALTER TABLE events ADD INDEX idx_event_type event_type TYPE set(100) GRANULARITY 4;
-- String equality
ALTER TABLE events ADD INDEX idx_url url TYPE bloom_filter(0.01) GRANULARITY 4;
-- Substring search
ALTER TABLE logs ADD INDEX idx_message message TYPE ngrambf_v1(4, 512, 3, 0) GRANULARITY 1;针对非主键列:
sql
-- 数值范围
ALTER TABLE events ADD INDEX idx_duration session_duration TYPE minmax GRANULARITY 4;
-- 分类数据(低基数)
ALTER TABLE events ADD INDEX idx_event_type event_type TYPE set(100) GRANULARITY 4;
-- 字符串等值匹配
ALTER TABLE events ADD INDEX idx_url url TYPE bloom_filter(0.01) GRANULARITY 4;
-- 子字符串搜索
ALTER TABLE logs ADD INDEX idx_message message TYPE ngrambf_v1(4, 512, 3, 0) GRANULARITY 1;3. Partitioning for Lifecycle Management
3. 用于生命周期管理的分区
sql
CREATE TABLE events (
timestamp DateTime,
user_id UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp);
-- Drop old data instantly
ALTER TABLE events DROP PARTITION '202401';
-- Or use TTL
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;sql
CREATE TABLE events (
timestamp DateTime,
user_id UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp);
-- 立即删除旧数据
ALTER TABLE events DROP PARTITION '202401';
-- 或使用TTL
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;4. Projections for Multiple Access Patterns
4. 为多访问模式配置投影
sql
-- Main table sorted by user_id
CREATE TABLE events (
user_id UInt32,
product_id UInt32,
timestamp DateTime
)
ENGINE = MergeTree()
ORDER BY (user_id, timestamp);
-- Add projection for product queries
ALTER TABLE events ADD PROJECTION proj_by_product (
SELECT *
ORDER BY (product_id, timestamp)
);
ALTER TABLE events MATERIALIZE PROJECTION proj_by_product;
-- Both queries now fast:
SELECT * FROM events WHERE user_id = 12345; -- Uses main table
SELECT * FROM events WHERE product_id = 789; -- Uses projectionsql
-- 主表按user_id排序
CREATE TABLE events (
user_id UInt32,
product_id UInt32,
timestamp DateTime
)
ENGINE = MergeTree()
ORDER BY (user_id, timestamp);
-- 添加针对product查询的投影
ALTER TABLE events ADD PROJECTION proj_by_product (
SELECT *
ORDER BY (product_id, timestamp)
);
ALTER TABLE events MATERIALIZE PROJECTION proj_by_product;
-- 现在两个查询都很快:
SELECT * FROM events WHERE user_id = 12345; -- 使用主表
SELECT * FROM events WHERE product_id = 789; -- 使用投影5. Query Optimization
5. 查询优化
PREWHERE for Early Filtering:
sql
SELECT user_id, event_type, properties
FROM events
PREWHERE timestamp >= '2024-01-01' AND country = 'US' -- Small columns first
WHERE event_type IN ('purchase', 'signup'); -- Complex logicApproximate Functions:
sql
-- 10-100x faster, ~2% error
SELECT uniq(user_id) FROM events; -- vs COUNT(DISTINCT)
SELECT topK(10)(product_id) FROM events; -- Approximate top-K
SELECT quantile(0.95)(response_time) FROM events; -- Approximate percentileSelect Only Needed Columns:
sql
-- Bad: Reads all columns
SELECT * FROM events WHERE user_id = 12345;
-- Good: Columnar advantage
SELECT user_id, timestamp, event_type FROM events WHERE user_id = 12345;使用PREWHERE提前过滤:
sql
SELECT user_id, event_type, properties
FROM events
PREWHERE timestamp >= '2024-01-01' AND country = 'US' -- 优先过滤小列
WHERE event_type IN ('purchase', 'signup'); -- 复杂逻辑后置近似函数:
sql
-- 速度快10-100倍,误差约2%
SELECT uniq(user_id) FROM events; -- 替代COUNT(DISTINCT)
SELECT topK(10)(product_id) FROM events; -- 近似Top-K
SELECT quantile(0.95)(response_time) FROM events; -- 近似分位数仅选择需要的列:
sql
-- 不佳:读取所有列
SELECT * FROM events WHERE user_id = 12345;
-- 推荐:利用列式存储优势
SELECT user_id, timestamp, event_type FROM events WHERE user_id = 12345;6. Profile and Debug
6. 性能分析与调试
sql
-- View execution plan
EXPLAIN SELECT COUNT() FROM events WHERE country = 'US';
-- Check performance
SELECT
query,
query_duration_ms,
read_rows,
read_bytes
FROM system.query_log
WHERE query LIKE '%events%'
ORDER BY event_time DESC
LIMIT 1;sql
-- 查看执行计划
EXPLAIN SELECT COUNT() FROM events WHERE country = 'US';
-- 检查性能指标
SELECT
query,
query_duration_ms,
read_rows,
read_bytes
FROM system.query_log
WHERE query LIKE '%events%'
ORDER BY event_time DESC
LIMIT 1;Common Patterns
常见模式
| Technique | Problem Solved | Impact | When to Use |
|---|---|---|---|
| Primary Key Design | Index doesn't cover queries | Foundation | Always (design first) |
| Data Skipping Indexes | Non-primary filtering slow | 10-100x | After primary key |
| Partitioning | Need to delete old data | Instant deletion | Time-series with retention |
| Projections | Multiple query patterns | 100-1000x | Different sort orders |
| Query Syntax | Large columns read unnecessarily | 2-10x | Per-query optimization |
| Profiling | Don't know why slow | Insight | When optimization unclear |
| 技术 | 解决的问题 | 影响 | 适用场景 |
|---|---|---|---|
| 主键设计 | 索引无法覆盖查询 | 基础 | 始终(优先设计) |
| 数据跳过索引 | 非主键过滤缓慢 | 10-100倍提升 | 主键设计完成后 |
| 分区 | 需要删除旧数据 | 即时删除 | 带保留策略的时序数据 |
| 投影 | 多查询模式需求 | 100-1000倍提升 | 不同排序需求的场景 |
| 查询语法优化 | 不必要地读取大列 | 2-10倍提升 | 单查询优化 |
| 性能分析 | 不清楚查询缓慢的原因 | 提供洞察 | 优化方向不明确时 |
Supporting Files
配套文件
| File | Purpose |
|---|---|
| examples/examples.md | Real-world optimization scenarios with metrics |
| references/reference.md | Technical guides and decision trees |
| 文件 | 用途 |
|---|---|
| examples/examples.md | 带指标的真实场景优化案例 |
| references/reference.md | 技术指南与决策树 |
Requirements
要求
- ClickHouse 21.4+
- Understanding of SQL and aggregation
- Knowledge of query patterns
- ClickHouse 21.4+
- 具备SQL与聚合的基础知识
- 了解查询模式
Integration Tips
集成技巧
- Design tables first (use EXPLAIN before/after)
- Monitor query_log (alert on > 100M rows read)
- Profile inserts (more indexes = slower writes)
- Test projections (use EXPLAIN to confirm optimizer choice)
- 优先设计表结构(优化前后均使用EXPLAIN)
- 监控query_log(当读取行数>1亿时触发告警)
- 分析插入性能(索引越多,写入越慢)
- 测试投影(使用EXPLAIN确认优化器选择)
See Also
相关链接
- Examples: examples/examples.md - E-commerce, user events, time-series
- Reference: references/reference.md - Index selection matrix, EXPLAIN guide, TTL config
- 示例:examples/examples.md - 电商、用户事件、时序数据场景
- 参考:references/reference.md - 索引选择矩阵、EXPLAIN指南、TTL配置