data-sql-optimization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SQL Query Optimization

SQL查询优化

Framework

框架

IRON LAW: Measure Before Optimizing

NEVER guess which query is slow or why. Use EXPLAIN (EXPLAIN ANALYZE in
PostgreSQL) to see the actual execution plan. The database's plan often
differs from what you expect — a query you think is efficient may do
a full table scan, and a complex-looking query may use an index perfectly.

Measure → identify bottleneck → fix → measure again.

铁律：优化前先测量

永远不要猜测哪个查询慢或者为什么慢。使用EXPLAIN（PostgreSQL中使用EXPLAIN ANALYZE）查看实际执行计划。数据库的计划往往和你预期的不同——你认为高效的查询可能在做全表扫描，而看起来复杂的查询可能完美地使用了索引。

测量→识别瓶颈→修复→再次测量。

EXPLAIN Output Reading

EXPLAIN输出解读

Key metrics in EXPLAIN ANALYZE (PostgreSQL):

Metric	What It Means	Red Flag
Seq Scan	Full table scan	On large tables (>100K rows)
Index Scan	Using an index	Expected for filtered queries
Nested Loop	Join method (row-by-row)	On large tables without index
Hash Join	Join method (hash table)	Normal for larger tables
Sort	Sorting results	Without index support on large sets
Actual Time	Milliseconds for this step	Compare to identify bottleneck
Rows	Actual rows processed vs estimated	Large mismatch = stale statistics

PostgreSQL中EXPLAIN ANALYZE的关键指标：

指标	含义	危险信号
Seq Scan	全表扫描	在大表（>10万行）上出现
Index Scan	索引扫描	过滤查询的预期行为
Nested Loop	嵌套循环（逐行连接）	在无索引的大表上使用
Hash Join	哈希连接	大表的正常连接方式
Sort	排序操作	在大数据集上无索引支持时出现
Actual Time	实际耗时（毫秒）	用于对比识别瓶颈
Rows	实际扫描行数与预估行数	差异过大=统计信息过时

Indexing Strategy

索引策略

When to Index	Index Type	Example
WHERE clause column	B-Tree (default)	`CREATE INDEX idx_user_email ON users(email)`
JOIN column	B-Tree	`CREATE INDEX idx_order_user ON orders(user_id)`
Composite filter	Composite index	`CREATE INDEX idx_order_status_date ON orders(status, created_at)`
Text search	GIN / Full-text	`CREATE INDEX idx_product_name_gin ON products USING gin(name gin_trgm_ops)`
Range queries	B-Tree	Columns used with `BETWEEN` , `>` , `<`

Composite index column order matters: Put the most selective (highest cardinality) column first.

INDEX(status, date)

is good if you always filter by status.

INDEX(date, status)

is better if you always filter by date range first.

何时创建索引	索引类型	示例
WHERE子句列	B-Tree（默认）	`CREATE INDEX idx_user_email ON users(email)`
JOIN关联列	B-Tree	`CREATE INDEX idx_order_user ON orders(user_id)`
复合过滤场景	复合索引	`CREATE INDEX idx_order_status_date ON orders(status, created_at)`
文本搜索场景	GIN/全文索引	`CREATE INDEX idx_product_name_gin ON products USING gin(name gin_trgm_ops)`
范围查询场景	B-Tree	用于使用 `BETWEEN` 、 `>` , `<` 的列

复合索引的列顺序很重要：将选择性最高（基数最大）的列放在前面。如果总是按status过滤，

INDEX(status, date)

很合适；如果总是先按日期范围过滤，

INDEX(date, status)

更优。

Common Anti-Patterns

常见反模式

Anti-Pattern	Problem	Fix
`SELECT *`	Reads all columns, prevents index-only scans	Select only needed columns
Subquery in WHERE	Re-executes for each row	Rewrite as JOIN or CTE
`OR` in WHERE	Prevents index use	Rewrite as UNION or separate queries
Function on indexed column	`WHERE YEAR(date) = 2024` bypasses index	`WHERE date >= '2024-01-01' AND date < '2025-01-01'`
N+1 queries	1 query for list + N queries for details	JOIN or batch query with `IN`
Missing pagination	Fetching all rows when only showing 20	`LIMIT` + `OFFSET` or keyset pagination
Implicit type conversion	`WHERE id = '123'` (string vs int)	Use correct type: `WHERE id = 123`

反模式	问题	修复方案
`SELECT *`	读取所有列，无法使用仅索引扫描	仅选择需要的列
WHERE子句中的子查询	为每一行重新执行	重写为JOIN或CTE
WHERE中的 `OR`	无法使用索引	重写为UNION或拆分查询
索引列上使用函数	`WHERE YEAR(date) = 2024` 会绕过索引	`WHERE date >= '2024-01-01' AND date < '2025-01-01'`
N+1查询	1次列表查询+N次详情查询	使用JOIN或 `IN` 进行批量查询
缺少分页	仅展示20行却获取所有行	使用 `LIMIT` + `OFFSET` 或键集分页
隐式类型转换	`WHERE id = '123'` （字符串vs整数）	使用正确类型： `WHERE id = 123`

Optimization Workflow

优化工作流

Identify slow queries: Database slow query log (pg_stat_statements, MySQL slow log)
Run EXPLAIN ANALYZE on the slowest
Find the bottleneck: Seq Scan on large table? Missing index? Expensive sort?
Apply fix: Add index, rewrite query, or restructure schema
Verify: Run EXPLAIN ANALYZE again — confirm improvement
Monitor: Check that fix didn't degrade other queries

识别慢查询：数据库慢查询日志（pg_stat_statements、MySQL慢查询日志）
对最慢查询运行EXPLAIN ANALYZE
定位瓶颈：大表全表扫描？缺少索引？昂贵的排序操作？
应用修复：添加索引、重写查询或重构Schema
验证效果：再次运行EXPLAIN ANALYZE——确认性能提升
持续监控：检查修复是否影响其他查询的性能

Partitioning (Large Tables)

分区（大表场景）

When tables exceed millions of rows:

Strategy	How It Works	Best For
Range partition	Split by date range (monthly, yearly)	Time-series data, logs
Hash partition	Distribute by hash of a column	Even distribution, high-throughput
List partition	Split by specific values	Multi-tenant, status-based

当表行数超过百万时：

策略	工作原理	适用场景
范围分区	按日期范围拆分（月度、年度）	时序数据、日志
哈希分区	按列的哈希值分布数据	均匀分布、高吞吐量场景
列表分区	按特定值拆分	多租户、基于状态的场景

Output Format

输出格式

markdown

undefined

markdown

undefined

Query Optimization: {Context}

查询优化：{上下文}

Slow Query

慢查询

sql

{the original slow query}

Execution time: {current ms}
Rows scanned: {N}
Problem: {what EXPLAIN revealed}

sql

{原始慢查询}

执行耗时：{当前毫秒数}
扫描行数：{N}
问题：{EXPLAIN分析结果}

Fix Applied

修复措施

{What was changed — new index, query rewrite, etc.}

{修改内容——新增索引、重写查询等}

Result

优化结果

Execution time: {original ms} → {optimized ms} ({X% improvement})
Rows scanned: {original N} → {optimized N}

undefined

执行耗时：{原毫秒数} → {优化后毫秒数}（提升{X%}）
扫描行数：{原行数} → {优化后行数}

undefined

Gotchas

注意事项

Indexes have write cost: Every INSERT/UPDATE must update all indexes. Over-indexing slows writes. Index what you query, not everything.
Statistics can be stale: If EXPLAIN estimates are way off from actuals, run
```
ANALYZE
```
(PostgreSQL) or
```
ANALYZE TABLE
```
(MySQL) to update statistics.
Query cache hides problems: A query may appear fast because it's cached. Test with cache cleared or cold start.
ORM-generated queries: ORMs (Django, SQLAlchemy, ActiveRecord) generate SQL that may not be optimal. Always inspect the actual SQL for performance-critical paths.
Connection pooling: Sometimes the bottleneck isn't the query but connection overhead. Use connection pooling (PgBouncer, ProxySQL) for high-concurrency applications.

索引存在写入开销：每次INSERT/UPDATE都必须更新所有索引。过度索引会减慢写入速度。只为查询用到的列创建索引，而非所有列。
统计信息可能过时：如果EXPLAIN的预估行数与实际差异过大，运行
```
ANALYZE
```
（PostgreSQL）或
```
ANALYZE TABLE
```
（MySQL）更新统计信息。
查询缓存会掩盖问题：查询看起来很快可能是因为命中了缓存。请在清空缓存或冷启动状态下测试。
ORM生成的查询：ORM（Django、SQLAlchemy、ActiveRecord）生成的SQL可能并非最优。对于性能关键路径，务必检查实际生成的SQL。
连接池优化：有时瓶颈并非查询本身，而是连接开销。对于高并发应用，请使用连接池（PgBouncer、ProxySQL）。

References

参考资料

For PostgreSQL-specific optimization, see
```
references/pg-optimization.md
```
For CTE vs temp table performance comparison, see
```
references/cte-vs-temp.md
```

针对PostgreSQL的特定优化，请参考
```
references/pg-optimization.md
```
关于CTE与临时表的性能对比，请参考
```
references/cte-vs-temp.md
```