data-sql-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SQL Query Optimization

SQL查询优化

Framework

框架

IRON LAW: Measure Before Optimizing

NEVER guess which query is slow or why. Use EXPLAIN (EXPLAIN ANALYZE in
PostgreSQL) to see the actual execution plan. The database's plan often
differs from what you expect — a query you think is efficient may do
a full table scan, and a complex-looking query may use an index perfectly.

Measure → identify bottleneck → fix → measure again.
铁律:优化前先测量

永远不要猜测哪个查询慢或者为什么慢。使用EXPLAIN(PostgreSQL中使用EXPLAIN ANALYZE)查看实际执行计划。数据库的计划往往和你预期的不同——你认为高效的查询可能在做全表扫描,而看起来复杂的查询可能完美地使用了索引。

测量→识别瓶颈→修复→再次测量。

EXPLAIN Output Reading

EXPLAIN输出解读

Key metrics in EXPLAIN ANALYZE (PostgreSQL):
MetricWhat It MeansRed Flag
Seq ScanFull table scanOn large tables (>100K rows)
Index ScanUsing an indexExpected for filtered queries
Nested LoopJoin method (row-by-row)On large tables without index
Hash JoinJoin method (hash table)Normal for larger tables
SortSorting resultsWithout index support on large sets
Actual TimeMilliseconds for this stepCompare to identify bottleneck
RowsActual rows processed vs estimatedLarge mismatch = stale statistics
PostgreSQL中EXPLAIN ANALYZE的关键指标:
指标含义危险信号
Seq Scan全表扫描在大表(>10万行)上出现
Index Scan索引扫描过滤查询的预期行为
Nested Loop嵌套循环(逐行连接)在无索引的大表上使用
Hash Join哈希连接大表的正常连接方式
Sort排序操作在大数据集上无索引支持时出现
Actual Time实际耗时(毫秒)用于对比识别瓶颈
Rows实际扫描行数与预估行数差异过大=统计信息过时

Indexing Strategy

索引策略

When to IndexIndex TypeExample
WHERE clause columnB-Tree (default)
CREATE INDEX idx_user_email ON users(email)
JOIN columnB-Tree
CREATE INDEX idx_order_user ON orders(user_id)
Composite filterComposite index
CREATE INDEX idx_order_status_date ON orders(status, created_at)
Text searchGIN / Full-text
CREATE INDEX idx_product_name_gin ON products USING gin(name gin_trgm_ops)
Range queriesB-TreeColumns used with
BETWEEN
,
>
,
<
Composite index column order matters: Put the most selective (highest cardinality) column first.
INDEX(status, date)
is good if you always filter by status.
INDEX(date, status)
is better if you always filter by date range first.
何时创建索引索引类型示例
WHERE子句列B-Tree(默认)
CREATE INDEX idx_user_email ON users(email)
JOIN关联列B-Tree
CREATE INDEX idx_order_user ON orders(user_id)
复合过滤场景复合索引
CREATE INDEX idx_order_status_date ON orders(status, created_at)
文本搜索场景GIN/全文索引
CREATE INDEX idx_product_name_gin ON products USING gin(name gin_trgm_ops)
范围查询场景B-Tree用于使用
BETWEEN
>
,
<
的列
复合索引的列顺序很重要:将选择性最高(基数最大)的列放在前面。如果总是按status过滤,
INDEX(status, date)
很合适;如果总是先按日期范围过滤,
INDEX(date, status)
更优。

Common Anti-Patterns

常见反模式

Anti-PatternProblemFix
SELECT *
Reads all columns, prevents index-only scansSelect only needed columns
Subquery in WHERERe-executes for each rowRewrite as JOIN or CTE
OR
in WHERE
Prevents index useRewrite as UNION or separate queries
Function on indexed column
WHERE YEAR(date) = 2024
bypasses index
WHERE date >= '2024-01-01' AND date < '2025-01-01'
N+1 queries1 query for list + N queries for detailsJOIN or batch query with
IN
Missing paginationFetching all rows when only showing 20
LIMIT
+
OFFSET
or keyset pagination
Implicit type conversion
WHERE id = '123'
(string vs int)
Use correct type:
WHERE id = 123
反模式问题修复方案
SELECT *
读取所有列,无法使用仅索引扫描仅选择需要的列
WHERE子句中的子查询为每一行重新执行重写为JOIN或CTE
WHERE中的
OR
无法使用索引重写为UNION或拆分查询
索引列上使用函数
WHERE YEAR(date) = 2024
会绕过索引
WHERE date >= '2024-01-01' AND date < '2025-01-01'
N+1查询1次列表查询+N次详情查询使用JOIN或
IN
进行批量查询
缺少分页仅展示20行却获取所有行使用
LIMIT
+
OFFSET
或键集分页
隐式类型转换
WHERE id = '123'
(字符串vs整数)
使用正确类型:
WHERE id = 123

Optimization Workflow

优化工作流

  1. Identify slow queries: Database slow query log (pg_stat_statements, MySQL slow log)
  2. Run EXPLAIN ANALYZE on the slowest
  3. Find the bottleneck: Seq Scan on large table? Missing index? Expensive sort?
  4. Apply fix: Add index, rewrite query, or restructure schema
  5. Verify: Run EXPLAIN ANALYZE again — confirm improvement
  6. Monitor: Check that fix didn't degrade other queries
  1. 识别慢查询:数据库慢查询日志(pg_stat_statements、MySQL慢查询日志)
  2. 对最慢查询运行EXPLAIN ANALYZE
  3. 定位瓶颈:大表全表扫描?缺少索引?昂贵的排序操作?
  4. 应用修复:添加索引、重写查询或重构Schema
  5. 验证效果:再次运行EXPLAIN ANALYZE——确认性能提升
  6. 持续监控:检查修复是否影响其他查询的性能

Partitioning (Large Tables)

分区(大表场景)

When tables exceed millions of rows:
StrategyHow It WorksBest For
Range partitionSplit by date range (monthly, yearly)Time-series data, logs
Hash partitionDistribute by hash of a columnEven distribution, high-throughput
List partitionSplit by specific valuesMulti-tenant, status-based
当表行数超过百万时:
策略工作原理适用场景
范围分区按日期范围拆分(月度、年度)时序数据、日志
哈希分区按列的哈希值分布数据均匀分布、高吞吐量场景
列表分区按特定值拆分多租户、基于状态的场景

Output Format

输出格式

markdown
undefined
markdown
undefined

Query Optimization: {Context}

查询优化:{上下文}

Slow Query

慢查询

sql
{the original slow query}
  • Execution time: {current ms}
  • Rows scanned: {N}
  • Problem: {what EXPLAIN revealed}
sql
{原始慢查询}
  • 执行耗时:{当前毫秒数}
  • 扫描行数:{N}
  • 问题:{EXPLAIN分析结果}

Fix Applied

修复措施

{What was changed — new index, query rewrite, etc.}
{修改内容——新增索引、重写查询等}

Result

优化结果

  • Execution time: {original ms} → {optimized ms} ({X% improvement})
  • Rows scanned: {original N} → {optimized N}
undefined
  • 执行耗时:{原毫秒数} → {优化后毫秒数}(提升{X%})
  • 扫描行数:{原行数} → {优化后行数}
undefined

Gotchas

注意事项

  • Indexes have write cost: Every INSERT/UPDATE must update all indexes. Over-indexing slows writes. Index what you query, not everything.
  • Statistics can be stale: If EXPLAIN estimates are way off from actuals, run
    ANALYZE
    (PostgreSQL) or
    ANALYZE TABLE
    (MySQL) to update statistics.
  • Query cache hides problems: A query may appear fast because it's cached. Test with cache cleared or cold start.
  • ORM-generated queries: ORMs (Django, SQLAlchemy, ActiveRecord) generate SQL that may not be optimal. Always inspect the actual SQL for performance-critical paths.
  • Connection pooling: Sometimes the bottleneck isn't the query but connection overhead. Use connection pooling (PgBouncer, ProxySQL) for high-concurrency applications.
  • 索引存在写入开销:每次INSERT/UPDATE都必须更新所有索引。过度索引会减慢写入速度。只为查询用到的列创建索引,而非所有列。
  • 统计信息可能过时:如果EXPLAIN的预估行数与实际差异过大,运行
    ANALYZE
    (PostgreSQL)或
    ANALYZE TABLE
    (MySQL)更新统计信息。
  • 查询缓存会掩盖问题:查询看起来很快可能是因为命中了缓存。请在清空缓存或冷启动状态下测试。
  • ORM生成的查询:ORM(Django、SQLAlchemy、ActiveRecord)生成的SQL可能并非最优。对于性能关键路径,务必检查实际生成的SQL。
  • 连接池优化:有时瓶颈并非查询本身,而是连接开销。对于高并发应用,请使用连接池(PgBouncer、ProxySQL)。

References

参考资料

  • For PostgreSQL-specific optimization, see
    references/pg-optimization.md
  • For CTE vs temp table performance comparison, see
    references/cte-vs-temp.md
  • 针对PostgreSQL的特定优化,请参考
    references/pg-optimization.md
  • 关于CTE与临时表的性能对比,请参考
    references/cte-vs-temp.md