rust-performance-best-practices

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Rust Performance Best Practices

Rust性能优化最佳实践

Expert-level performance optimization guide for Rust. Contains 45+ rules across 9 categories with real benchmarks, failure modes, and profiling workflows.

专业级Rust性能优化指南，包含9大分类下的45+条规则，附带真实基准测试、失效模式以及性能分析流程。

When to Apply

适用场景

Reference these guidelines when:

Investigating slow Rust programs or high latency
Optimizing build times or binary size
Reviewing allocation-heavy code
Debugging lock contention or thread scaling issues
Setting up release profiles for production
Working with async runtimes (Tokio, async-std)

在以下场景中可参考本指南：

排查Rust程序运行缓慢或高延迟问题
优化编译时间或二进制文件大小
评审内存分配密集型代码
调试锁竞争或线程扩展问题
为生产环境配置发布构建配置
处理异步运行时（Tokio、async-std）相关工作

When NOT to Apply

不适用场景

Skip these optimizations when:

Code isn't in a hot path (profile first!)
Readability would suffer significantly
You haven't measured a performance problem
The optimization requires unsafe code you can't verify
Premature optimization would delay shipping

在以下场景中请跳过这些优化：

代码不在热点路径中（先做性能分析！）
优化会严重损害代码可读性
尚未测量到明确的性能问题
优化需要使用你无法验证正确性的unsafe代码
过早优化会导致发布延迟

The Optimization Workflow

优化工作流

CRITICAL: Most Rust code doesn't need optimization. Profile first, optimize second.

┌─────────────────────────────────────────────────────────────┐
│                   OPTIMIZATION WORKFLOW                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. MEASURE FIRST                                           │
│     └── Profile before changing anything                   │
│     └── Use cargo flamegraph, perf, or heaptrack           │
│     └── Identify actual bottlenecks (don't guess!)         │
│                                                             │
│  2. CHECK BUILD SETTINGS                                    │
│     └── Release mode? (10-100x vs debug)                   │
│     └── LTO enabled? (5-20% improvement)                   │
│     └── Target CPU? (10-30% for SIMD)                      │
│                                                             │
│  3. FIX ALGORITHMIC ISSUES                                  │
│     └── O(n²) → O(n log n) matters more than micro-opts   │
│     └── Check data structure choices                       │
│     └── Avoid unnecessary work                             │
│                                                             │
│  4. REDUCE ALLOCATIONS                                      │
│     └── Pre-size collections (with_capacity)               │
│     └── Reuse buffers (clear + reuse)                      │
│     └── Avoid cloning (borrow instead)                     │
│                                                             │
│  5. OPTIMIZE HOT LOOPS                                      │
│     └── Iterators over indices                             │
│     └── Reduce lock scope                                  │
│     └── Batch I/O operations                               │
│                                                             │
│  6. MEASURE AGAIN                                           │
│     └── Verify improvement with benchmarks                 │
│     └── Check for regressions elsewhere                    │
│     └── Document the optimization                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

关键提示：大多数Rust代码不需要优化。先做性能分析，再进行优化。

┌─────────────────────────────────────────────────────────────┐
│                   OPTIMIZATION WORKFLOW                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. MEASURE FIRST                                           │
│     └── Profile before changing anything                   │
│     └── Use cargo flamegraph, perf, or heaptrack           │
│     └── Identify actual bottlenecks (don't guess!)         │
│                                                             │
│  2. CHECK BUILD SETTINGS                                    │
│     └── Release mode? (10-100x vs debug)                   │
│     └── LTO enabled? (5-20% improvement)                   │
│     └── Target CPU? (10-30% for SIMD)                      │
│                                                             │
│  3. FIX ALGORITHMIC ISSUES                                  │
│     └── O(n²) → O(n log n) matters more than micro-opts   │
│     └── Check data structure choices                       │
│     └── Avoid unnecessary work                             │
│                                                             │
│  4. REDUCE ALLOCATIONS                                      │
│     └── Pre-size collections (with_capacity)               │
│     └── Reuse buffers (clear + reuse)                      │
│     └── Avoid cloning (borrow instead)                     │
│                                                             │
│  5. OPTIMIZE HOT LOOPS                                      │
│     └── Iterators over indices                             │
│     └── Reduce lock scope                                  │
│     └── Batch I/O operations                               │
│                                                             │
│  6. MEASURE AGAIN                                           │
│     └── Verify improvement with benchmarks                 │
│     └── Check for regressions elsewhere                    │
│     └── Document the optimization                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Profiling Commands

快速性能分析命令

bash

undefined

bash

undefined

CPU profiling (Linux)

cargo flamegraph --bin myapp perf record -g ./target/release/myapp && perf report

Memory profiling

heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out

Benchmark

cargo bench # All benchmarks cargo bench hot_function # Specific benchmark

Check allocations

MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp mtrace ./target/release/myapp /tmp/mtrace.log

Assembly inspection

cargo asm my_crate::hot_function --rust

syscall count

strace -c ./target/release/myapp 2>&1 | head -20

undefined

strace -c ./target/release/myapp 2>&1 | head -20

undefined

Common Scenarios → Rules

常见场景对应规则

"My Rust program is slow"

"我的Rust程序运行缓慢"

Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
    │
    Where does flamegraph show time?
    ├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
    ├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
    ├── read/write syscalls → io-* rules (BufReader/BufWriter)
    ├── clone/drop → alloc-avoid-clone, use references
    └── Your code → iter-* rules, algorithm improvements

Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
    │
    Where does flamegraph show time?
    ├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
    ├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
    ├── read/write syscalls → io-* rules (BufReader/BufWriter)
    ├── clone/drop → alloc-avoid-clone, use references
    └── Your code → iter-* rules, algorithm improvements

"My binary is too large"

"我的二进制文件太大"

1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0

1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0

"High memory usage"

"内存占用过高"

1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate

1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate

"Lock contention / thread scaling"

"锁竞争 / 线程扩展问题"

1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats

1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats

"Slow file I/O"

"文件I/O缓慢"

1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate

1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate

Rule Categories

规则分类


build-
bench-
alloc-
data-
iter-
sync-
io-
unsafe-

Priority	Category	Typical Impact	Prefix
1	Build Profiles	10-100x (debug→release)	`build-`
2	Benchmarking	Enables measurement	`bench-`
3	Allocation	2-50x for allocation-heavy code	`alloc-`
4	Data Structures	2-10x for hot paths	`data-`
5	Iteration	2-5x for loop-heavy code	`iter-`
6	Synchronization	5-100x for contended code	`sync-`
7	I/O	10-100x for I/O-bound code	`io-`
8	Unsafe	5-30% in tight loops (experts only)	`unsafe-`


build-
bench-
alloc-
data-
iter-
sync-
io-
unsafe-

优先级	分类	典型性能提升	规则前缀
1	构建配置	10-100倍（debug→release）	`build-`
2	基准测试	支持性能测量	`bench-`
3	内存分配	分配密集型代码提升2-50倍	`alloc-`
4	数据结构	热点路径提升2-10倍	`data-`
5	迭代器使用	循环密集型代码提升2-5倍	`iter-`
6	同步机制	竞争场景提升5-100倍	`sync-`
7	I/O操作	I/O密集型代码提升10-100倍	`io-`
8	Unsafe代码	紧凑循环提升5-30%（仅专家）	`unsafe-`

1. Build Profiles (CRITICAL)

1. 构建配置（关键）

These apply to ALL Rust code. Check these first.

Rule	Impact	One-liner
`build-release-profile`	10-100x	Always ship release builds
`build-opt-level`	2-5x	opt-level=3 for speed, 'z' for size
`build-enable-lto`	5-20%	LTO enables cross-crate optimization
`build-codegen-units`	5-15%	codegen-units=1 for max optimization
`build-panic-abort`	Binary size	panic='abort' removes unwinding
`build-target-cpu`	10-30%	target-cpu=native for SIMD
`build-pgo`	5-20%	Profile-guided optimization
`build-incremental-off`	5-10%	Disable for release builds

这些规则适用于所有Rust代码，请首先检查这些配置。

规则	性能提升	一句话总结
`build-release-profile`	10-100倍	始终发布release版本构建
`build-opt-level`	2-5倍	追求速度用opt-level=3，追求体积用'z'
`build-enable-lto`	5-20%	LTO支持跨crate优化
`build-codegen-units`	5-15%	codegen-units=1以获得最大优化效果
`build-panic-abort`	减小二进制体积	panic='abort'移除unwind代码
`build-target-cpu`	10-30%	target-cpu=native以启用SIMD
`build-pgo`	5-20%	基于性能分析的优化（PGO）
`build-incremental-off`	5-10%	发布版本中禁用增量构建

2. Benchmarking (REQUIRED)

2. 基准测试（必备）

You can't optimize what you don't measure.

Rule	Purpose
`bench-cargo-bench`	Use `cargo bench` with criterion
`bench-bench-profile`	Bench profile enables optimizations
`bench-black-box`	Prevent dead code elimination
`bench-avoid-io`	I/O variance destroys measurements

你无法优化未测量的内容。

规则	用途
`bench-cargo-bench`	结合criterion使用 `cargo bench`
`bench-bench-profile`	Bench配置启用优化
`bench-black-box`	防止死代码被消除
`bench-avoid-io`	I/O波动会破坏测量结果

3. Allocation

3. 内存分配

Every allocation is a syscall. Reduce them.

Rule	Impact	Pattern
`alloc-vec-with-capacity`	2-10x	`Vec::with_capacity(n)` not `Vec::new()`
`alloc-string-with-capacity`	2-5x	`String::with_capacity(n)`
`alloc-hashmap-with-capacity`	2-5x	`HashMap::with_capacity(n)`
`alloc-reuse-buffers`	2-10x	`.clear()` and reuse, don't reallocate (up to 50x in tight loops)
`alloc-use-slices-in-apis`	Flexibility	`&[T]` not `Vec<T>` in parameters
`alloc-avoid-clone`	2-10x	Borrow `&T` instead of `clone()` (benefits scale with data size)

每次分配都是系统调用，应尽量减少。

规则	性能提升	代码模式
`alloc-vec-with-capacity`	2-10倍	使用 `Vec::with_capacity(n)` 而非 `Vec::new()`
`alloc-string-with-capacity`	2-5倍	使用 `String::with_capacity(n)`
`alloc-hashmap-with-capacity`	2-5倍	使用 `HashMap::with_capacity(n)`
`alloc-reuse-buffers`	2-10倍	调用 `.clear()` 后复用缓冲区，而非重新分配（紧凑循环中可提升50倍）
`alloc-use-slices-in-apis`	提升灵活性	参数使用 `&[T]` 而非 `Vec<T>`
`alloc-avoid-clone`	2-10倍	借用 `&T` 而非调用 `clone()` （性能提升随数据规模增长）

4. Data Structures

4. 数据结构

The right data structure beats micro-optimization.

Rule	When
`data-avoid-linkedlist`	Almost always (Vec wins)
`data-choose-vecdeque-for-queue`	FIFO queues
`data-choose-map-type`	HashMap=O(1), BTreeMap=sorted
`data-use-entry-api`	Insert-or-update patterns
`data-repr-transparent`	FFI newtypes

合适的数据结构比微优化更重要。

规则	适用场景
`data-avoid-linkedlist`	几乎所有场景（Vec性能更优）
`data-choose-vecdeque-for-queue`	FIFO队列场景
`data-choose-map-type`	HashMap=O(1)查询，BTreeMap支持排序
`data-use-entry-api`	插入或更新的场景
`data-repr-transparent`	FFI新类型场景

5. Iteration

5. 迭代器使用

Iterators are as fast as loops and safer.

Rule	Impact	Pattern
`iter-avoid-collect-then-loop`	2-3x	Chain iterators, don't collect
`iter-use-lazy-iterators`	2-3x	`.filter().map()` not intermediate vecs
`iter-use-any-find`	Short-circuit	`.any()` not `.filter().count() > 0`
`iter-use-retain`	In-place	`.retain()` not `.filter().collect()`
`iter-use-binary-search`	O(log n)	`.binary_search()` on sorted data

迭代器和循环速度相当，且更安全。

规则	性能提升	代码模式
`iter-avoid-collect-then-loop`	2-3倍	链式调用迭代器，而非先收集为集合
`iter-use-lazy-iterators`	2-3倍	使用 `.filter().map()` 而非生成中间Vec
`iter-use-any-find`	提前终止	使用 `.any()` 而非 `.filter().count() > 0`
`iter-use-retain`	原地修改	使用 `.retain()` 而非 `.filter().collect()`
`iter-use-binary-search`	O(log n)时间	在已排序数据上使用 `.binary_search()`

6. Synchronization

6. 同步机制

Locks are expensive. Minimize contention.

Rule	Impact	When
`sync-share-with-arc`	Avoids copying	Share large (>64B) data across threads
`sync-use-rwlock`	2-8x for reads	>80% reads, few writes; consider parking_lot
`sync-keep-lock-scope-short`	4x	Minimize code under lock
`sync-use-channels`	3-4x	Message passing vs shared state
`sync-use-atomics`	20x	Simple counters, flags
`sync-use-parking-lot`	1.5-5x	Prefer `parking_lot` over std sync primitives

锁的开销很高，应尽量减少竞争。

规则	性能提升	适用场景
`sync-share-with-arc`	避免数据拷贝	在线程间共享大型数据（>64B）
`sync-use-rwlock`	读场景提升2-8倍	读操作占比>80%，写操作少；可考虑parking_lot
`sync-keep-lock-scope-short`	提升4倍	最小化锁保护的代码范围
`sync-use-channels`	提升3-4倍	使用消息传递而非共享状态
`sync-use-atomics`	提升20倍	简单计数器、标志位场景
`sync-use-parking-lot`	提升1.5-5倍	优先使用 `parking_lot` 而非标准库同步原语

7. I/O

7. I/O操作

Every syscall costs. Buffer them.

Rule	Impact	Pattern
`io-use-bufreader`	50x	Wrap `File` in `BufReader`
`io-use-bufwriter`	18x	Wrap `File` in `BufWriter`
`io-flush-bufwriter`	CRITICAL	Must flush or lose data!
`io-read-line-with-bufread`	53x	Reuse String buffer with `read_line`

每次系统调用都有开销，应使用缓冲区。

规则	性能提升	代码模式
`io-use-bufreader`	提升50倍	将 `File` 包装在 `BufReader` 中
`io-use-bufwriter`	提升18倍	将 `File` 包装在 `BufWriter` 中
`io-flush-bufwriter`	关键提示	必须手动刷新，否则会丢失数据！
`io-read-line-with-bufread`	提升53倍	复用String缓冲区调用 `read_line`

8. Async/Await (HIGH)

8. Async/Await（重要）

Critical for Tokio and async-std applications.

Rule	Impact	Pattern
`async-spawn-blocking`	Prevents hang	Use `spawn_blocking` for CPU-bound work
`async-cooperative`	Latency	Yield periodically in long computations
`async-mutex-choice`	Correctness	`tokio::sync::Mutex` across `.await` points
`async-avoid-blocking-io`	Throughput	Use async I/O, not std::fs in async contexts
`async-bounded-channels`	Backpressure	Prefer bounded channels for flow control

Key insight: The async runtime is cooperative. Blocking the executor thread starves all other tasks.

rust

// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
    let hash = expensive_hash(data);  // CPU-bound, blocks executor!
    Ok(hash)
}

// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
    tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}

对Tokio和async-std应用至关重要。

规则	性能提升	代码模式
`async-spawn-blocking`	避免程序挂起	使用 `spawn_blocking` 处理CPU密集型任务
`async-cooperative`	降低延迟	在长时间计算中定期让出执行权
`async-mutex-choice`	保证正确性	在 `.await` 点间使用 `tokio::sync::Mutex`
`async-avoid-blocking-io`	提升吞吐量	在异步上下文中使用异步I/O而非std::fs
`async-bounded-channels`	实现背压	优先使用有界通道进行流控制

核心要点：异步运行时是协作式的。阻塞执行器线程会导致所有其他任务饥饿。

rust

// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
    let hash = expensive_hash(data);  // CPU-bound, blocks executor!
    Ok(hash)
}

// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
    tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}

9. Unsafe (Expert Only)

9. Unsafe代码（仅专家）

Only after profiling proves these matter.

Rule	Impact	Risk
`unsafe-get-unchecked`	5-30%	UB if bounds wrong
`unsafe-use-maybeuninit`	20-100x alloc	UB if read before write
`unsafe-avoid-transmute`	Correctness	Prefer safe alternatives
`unsafe-repr-transparent`	Zero-cost	Required for FFI newtypes

仅当性能分析证明这些优化有必要时才使用。

规则	性能提升	风险
`unsafe-get-unchecked`	提升5-30%	越界会导致未定义行为
`unsafe-use-maybeuninit`	分配场景提升20-100倍	未写入就读取会导致未定义行为
`unsafe-avoid-transmute`	保证正确性	优先使用安全替代方案
`unsafe-repr-transparent`	零开销	FFI新类型场景必备

Decision Trees

决策树

When to use with_capacity?

何时使用with_capacity？

Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
    │
    Will it grow frequently?
    ├── YES → Start bigger or use reserve()
    └── NO → Vec::new() is fine

Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
    │
    Will it grow frequently?
    ├── YES → Start bigger or use reserve()
    └── NO → Vec::new() is fine

Mutex vs RwLock vs Atomics?

Mutex vs RwLock vs Atomics？

Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
    │
    What's the read/write ratio?
    ├── Mostly reads (>90%) → RwLock
    ├── Mostly writes → Mutex
    └── Mixed → Mutex (simpler)

    Consider: parking_lot > std for all of these

Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
    │
    What's the read/write ratio?
    ├── Mostly reads (>90%) → RwLock
    ├── Mostly writes → Mutex
    └── Mixed → Mutex (simpler)

    Consider: parking_lot > std for all of these

When is unsafe get_unchecked worth it?

何时值得使用unsafe get_unchecked？

Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
    │
    Did you check if LLVM already removed the bounds check?
    ├── NO → Check assembly first (cargo asm)
    └── YES, still there
        │
        Can you use iterators instead?
        ├── YES → Use iterators (same speed, safe)
        └── NO → get_unchecked with documented invariants

Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
    │
    Did you check if LLVM already removed the bounds check?
    ├── NO → Check assembly first (cargo asm)
    └── YES, still there
        │
        Can you use iterators instead?
        ├── YES → Use iterators (same speed, safe)
        └── NO → get_unchecked with documented invariants

Reading Rules

规则阅读说明

Each rule file in

rules/

contains:

Quantified impact with real benchmark numbers
Visual explanations of how the optimization works
Incorrect examples showing common mistakes
Correct examples with best practices
When NOT to apply - trade-offs and edge cases
Common mistakes to avoid
Profiling commands to identify the issue
References to official docs

rules/

目录下的每个规则文件包含：

量化性能提升：附带真实基准测试数据
可视化解释：优化的工作原理
错误示例：常见错误写法
正确示例：最佳实践
不适用场景：权衡与边缘情况
常见错误：需要避免的问题
性能分析命令：用于定位问题
参考资料：官方文档链接

Full Compiled Document

完整编译文档

For all rules in a single file:

AGENTS.md

如需查看所有规则的单文件版本：

AGENTS.md