rust-performance-best-practices

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Rust Performance Best Practices

Rust性能优化最佳实践

Expert-level performance optimization guide for Rust. Contains 45+ rules across 9 categories with real benchmarks, failure modes, and profiling workflows.
专业级Rust性能优化指南,包含9大分类下的45+条规则,附带真实基准测试、失效模式以及性能分析流程。

When to Apply

适用场景

Reference these guidelines when:
  • Investigating slow Rust programs or high latency
  • Optimizing build times or binary size
  • Reviewing allocation-heavy code
  • Debugging lock contention or thread scaling issues
  • Setting up release profiles for production
  • Working with async runtimes (Tokio, async-std)
在以下场景中可参考本指南:
  • 排查Rust程序运行缓慢或高延迟问题
  • 优化编译时间或二进制文件大小
  • 评审内存分配密集型代码
  • 调试锁竞争或线程扩展问题
  • 为生产环境配置发布构建配置
  • 处理异步运行时(Tokio、async-std)相关工作

When NOT to Apply

不适用场景

Skip these optimizations when:
  • Code isn't in a hot path (profile first!)
  • Readability would suffer significantly
  • You haven't measured a performance problem
  • The optimization requires unsafe code you can't verify
  • Premature optimization would delay shipping
在以下场景中请跳过这些优化:
  • 代码不在热点路径中(先做性能分析!)
  • 优化会严重损害代码可读性
  • 尚未测量到明确的性能问题
  • 优化需要使用你无法验证正确性的unsafe代码
  • 过早优化会导致发布延迟

The Optimization Workflow

优化工作流

CRITICAL: Most Rust code doesn't need optimization. Profile first, optimize second.
┌─────────────────────────────────────────────────────────────┐
│                   OPTIMIZATION WORKFLOW                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. MEASURE FIRST                                           │
│     └── Profile before changing anything                   │
│     └── Use cargo flamegraph, perf, or heaptrack           │
│     └── Identify actual bottlenecks (don't guess!)         │
│                                                             │
│  2. CHECK BUILD SETTINGS                                    │
│     └── Release mode? (10-100x vs debug)                   │
│     └── LTO enabled? (5-20% improvement)                   │
│     └── Target CPU? (10-30% for SIMD)                      │
│                                                             │
│  3. FIX ALGORITHMIC ISSUES                                  │
│     └── O(n²) → O(n log n) matters more than micro-opts   │
│     └── Check data structure choices                       │
│     └── Avoid unnecessary work                             │
│                                                             │
│  4. REDUCE ALLOCATIONS                                      │
│     └── Pre-size collections (with_capacity)               │
│     └── Reuse buffers (clear + reuse)                      │
│     └── Avoid cloning (borrow instead)                     │
│                                                             │
│  5. OPTIMIZE HOT LOOPS                                      │
│     └── Iterators over indices                             │
│     └── Reduce lock scope                                  │
│     └── Batch I/O operations                               │
│                                                             │
│  6. MEASURE AGAIN                                           │
│     └── Verify improvement with benchmarks                 │
│     └── Check for regressions elsewhere                    │
│     └── Document the optimization                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘
关键提示:大多数Rust代码不需要优化。先做性能分析,再进行优化。
┌─────────────────────────────────────────────────────────────┐
│                   OPTIMIZATION WORKFLOW                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. MEASURE FIRST                                           │
│     └── Profile before changing anything                   │
│     └── Use cargo flamegraph, perf, or heaptrack           │
│     └── Identify actual bottlenecks (don't guess!)         │
│                                                             │
│  2. CHECK BUILD SETTINGS                                    │
│     └── Release mode? (10-100x vs debug)                   │
│     └── LTO enabled? (5-20% improvement)                   │
│     └── Target CPU? (10-30% for SIMD)                      │
│                                                             │
│  3. FIX ALGORITHMIC ISSUES                                  │
│     └── O(n²) → O(n log n) matters more than micro-opts   │
│     └── Check data structure choices                       │
│     └── Avoid unnecessary work                             │
│                                                             │
│  4. REDUCE ALLOCATIONS                                      │
│     └── Pre-size collections (with_capacity)               │
│     └── Reuse buffers (clear + reuse)                      │
│     └── Avoid cloning (borrow instead)                     │
│                                                             │
│  5. OPTIMIZE HOT LOOPS                                      │
│     └── Iterators over indices                             │
│     └── Reduce lock scope                                  │
│     └── Batch I/O operations                               │
│                                                             │
│  6. MEASURE AGAIN                                           │
│     └── Verify improvement with benchmarks                 │
│     └── Check for regressions elsewhere                    │
│     └── Document the optimization                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Profiling Commands

快速性能分析命令

bash
undefined
bash
undefined

CPU profiling (Linux)

CPU profiling (Linux)

cargo flamegraph --bin myapp perf record -g ./target/release/myapp && perf report
cargo flamegraph --bin myapp perf record -g ./target/release/myapp && perf report

Memory profiling

Memory profiling

heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out
heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out

Benchmark

Benchmark

cargo bench # All benchmarks cargo bench hot_function # Specific benchmark
cargo bench # All benchmarks cargo bench hot_function # Specific benchmark

Check allocations

Check allocations

MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp mtrace ./target/release/myapp /tmp/mtrace.log
MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp mtrace ./target/release/myapp /tmp/mtrace.log

Assembly inspection

Assembly inspection

cargo asm my_crate::hot_function --rust
cargo asm my_crate::hot_function --rust

syscall count

syscall count

strace -c ./target/release/myapp 2>&1 | head -20
undefined
strace -c ./target/release/myapp 2>&1 | head -20
undefined

Common Scenarios → Rules

常见场景对应规则

"My Rust program is slow"

"我的Rust程序运行缓慢"

Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
    Where does flamegraph show time?
    ├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
    ├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
    ├── read/write syscalls → io-* rules (BufReader/BufWriter)
    ├── clone/drop → alloc-avoid-clone, use references
    └── Your code → iter-* rules, algorithm improvements
Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
    Where does flamegraph show time?
    ├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
    ├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
    ├── read/write syscalls → io-* rules (BufReader/BufWriter)
    ├── clone/drop → alloc-avoid-clone, use references
    └── Your code → iter-* rules, algorithm improvements

"My binary is too large"

"我的二进制文件太大"

1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0
1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0

"High memory usage"

"内存占用过高"

1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate
1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate

"Lock contention / thread scaling"

"锁竞争 / 线程扩展问题"

1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats
1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats

"Slow file I/O"

"文件I/O缓慢"

1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate
1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate

Rule Categories

规则分类

PriorityCategoryTypical ImpactPrefix
1Build Profiles10-100x (debug→release)
build-
2BenchmarkingEnables measurement
bench-
3Allocation2-50x for allocation-heavy code
alloc-
4Data Structures2-10x for hot paths
data-
5Iteration2-5x for loop-heavy code
iter-
6Synchronization5-100x for contended code
sync-
7I/O10-100x for I/O-bound code
io-
8Unsafe5-30% in tight loops (experts only)
unsafe-
优先级分类典型性能提升规则前缀
1构建配置10-100倍(debug→release)
build-
2基准测试支持性能测量
bench-
3内存分配分配密集型代码提升2-50倍
alloc-
4数据结构热点路径提升2-10倍
data-
5迭代器使用循环密集型代码提升2-5倍
iter-
6同步机制竞争场景提升5-100倍
sync-
7I/O操作I/O密集型代码提升10-100倍
io-
8Unsafe代码紧凑循环提升5-30%(仅专家)
unsafe-

1. Build Profiles (CRITICAL)

1. 构建配置(关键)

These apply to ALL Rust code. Check these first.
RuleImpactOne-liner
build-release-profile
10-100xAlways ship release builds
build-opt-level
2-5xopt-level=3 for speed, 'z' for size
build-enable-lto
5-20%LTO enables cross-crate optimization
build-codegen-units
5-15%codegen-units=1 for max optimization
build-panic-abort
Binary sizepanic='abort' removes unwinding
build-target-cpu
10-30%target-cpu=native for SIMD
build-pgo
5-20%Profile-guided optimization
build-incremental-off
5-10%Disable for release builds
这些规则适用于所有Rust代码,请首先检查这些配置。
规则性能提升一句话总结
build-release-profile
10-100倍始终发布release版本构建
build-opt-level
2-5倍追求速度用opt-level=3,追求体积用'z'
build-enable-lto
5-20%LTO支持跨crate优化
build-codegen-units
5-15%codegen-units=1以获得最大优化效果
build-panic-abort
减小二进制体积panic='abort'移除unwind代码
build-target-cpu
10-30%target-cpu=native以启用SIMD
build-pgo
5-20%基于性能分析的优化(PGO)
build-incremental-off
5-10%发布版本中禁用增量构建

2. Benchmarking (REQUIRED)

2. 基准测试(必备)

You can't optimize what you don't measure.
RulePurpose
bench-cargo-bench
Use
cargo bench
with criterion
bench-bench-profile
Bench profile enables optimizations
bench-black-box
Prevent dead code elimination
bench-avoid-io
I/O variance destroys measurements
你无法优化未测量的内容。
规则用途
bench-cargo-bench
结合criterion使用
cargo bench
bench-bench-profile
Bench配置启用优化
bench-black-box
防止死代码被消除
bench-avoid-io
I/O波动会破坏测量结果

3. Allocation

3. 内存分配

Every allocation is a syscall. Reduce them.
RuleImpactPattern
alloc-vec-with-capacity
2-10x
Vec::with_capacity(n)
not
Vec::new()
alloc-string-with-capacity
2-5x
String::with_capacity(n)
alloc-hashmap-with-capacity
2-5x
HashMap::with_capacity(n)
alloc-reuse-buffers
2-10x
.clear()
and reuse, don't reallocate (up to 50x in tight loops)
alloc-use-slices-in-apis
Flexibility
&[T]
not
Vec<T>
in parameters
alloc-avoid-clone
2-10xBorrow
&T
instead of
clone()
(benefits scale with data size)
每次分配都是系统调用,应尽量减少。
规则性能提升代码模式
alloc-vec-with-capacity
2-10倍使用
Vec::with_capacity(n)
而非
Vec::new()
alloc-string-with-capacity
2-5倍使用
String::with_capacity(n)
alloc-hashmap-with-capacity
2-5倍使用
HashMap::with_capacity(n)
alloc-reuse-buffers
2-10倍调用
.clear()
后复用缓冲区,而非重新分配(紧凑循环中可提升50倍)
alloc-use-slices-in-apis
提升灵活性参数使用
&[T]
而非
Vec<T>
alloc-avoid-clone
2-10倍借用
&T
而非调用
clone()
(性能提升随数据规模增长)

4. Data Structures

4. 数据结构

The right data structure beats micro-optimization.
RuleWhen
data-avoid-linkedlist
Almost always (Vec wins)
data-choose-vecdeque-for-queue
FIFO queues
data-choose-map-type
HashMap=O(1), BTreeMap=sorted
data-use-entry-api
Insert-or-update patterns
data-repr-transparent
FFI newtypes
合适的数据结构比微优化更重要。
规则适用场景
data-avoid-linkedlist
几乎所有场景(Vec性能更优)
data-choose-vecdeque-for-queue
FIFO队列场景
data-choose-map-type
HashMap=O(1)查询,BTreeMap支持排序
data-use-entry-api
插入或更新的场景
data-repr-transparent
FFI新类型场景

5. Iteration

5. 迭代器使用

Iterators are as fast as loops and safer.
RuleImpactPattern
iter-avoid-collect-then-loop
2-3xChain iterators, don't collect
iter-use-lazy-iterators
2-3x
.filter().map()
not intermediate vecs
iter-use-any-find
Short-circuit
.any()
not
.filter().count() > 0
iter-use-retain
In-place
.retain()
not
.filter().collect()
iter-use-binary-search
O(log n)
.binary_search()
on sorted data
迭代器和循环速度相当,且更安全。
规则性能提升代码模式
iter-avoid-collect-then-loop
2-3倍链式调用迭代器,而非先收集为集合
iter-use-lazy-iterators
2-3倍使用
.filter().map()
而非生成中间Vec
iter-use-any-find
提前终止使用
.any()
而非
.filter().count() > 0
iter-use-retain
原地修改使用
.retain()
而非
.filter().collect()
iter-use-binary-search
O(log n)时间在已排序数据上使用
.binary_search()

6. Synchronization

6. 同步机制

Locks are expensive. Minimize contention.
RuleImpactWhen
sync-share-with-arc
Avoids copyingShare large (>64B) data across threads
sync-use-rwlock
2-8x for reads>80% reads, few writes; consider parking_lot
sync-keep-lock-scope-short
4xMinimize code under lock
sync-use-channels
3-4xMessage passing vs shared state
sync-use-atomics
20xSimple counters, flags
sync-use-parking-lot
1.5-5xPrefer
parking_lot
over std sync primitives
锁的开销很高,应尽量减少竞争。
规则性能提升适用场景
sync-share-with-arc
避免数据拷贝在线程间共享大型数据(>64B)
sync-use-rwlock
读场景提升2-8倍读操作占比>80%,写操作少;可考虑parking_lot
sync-keep-lock-scope-short
提升4倍最小化锁保护的代码范围
sync-use-channels
提升3-4倍使用消息传递而非共享状态
sync-use-atomics
提升20倍简单计数器、标志位场景
sync-use-parking-lot
提升1.5-5倍优先使用
parking_lot
而非标准库同步原语

7. I/O

7. I/O操作

Every syscall costs. Buffer them.
RuleImpactPattern
io-use-bufreader
50xWrap
File
in
BufReader
io-use-bufwriter
18xWrap
File
in
BufWriter
io-flush-bufwriter
CRITICALMust flush or lose data!
io-read-line-with-bufread
53xReuse String buffer with
read_line
每次系统调用都有开销,应使用缓冲区。
规则性能提升代码模式
io-use-bufreader
提升50倍
File
包装在
BufReader
io-use-bufwriter
提升18倍
File
包装在
BufWriter
io-flush-bufwriter
关键提示必须手动刷新,否则会丢失数据!
io-read-line-with-bufread
提升53倍复用String缓冲区调用
read_line

8. Async/Await (HIGH)

8. Async/Await(重要)

Critical for Tokio and async-std applications.
RuleImpactPattern
async-spawn-blocking
Prevents hangUse
spawn_blocking
for CPU-bound work
async-cooperative
LatencyYield periodically in long computations
async-mutex-choice
Correctness
tokio::sync::Mutex
across
.await
points
async-avoid-blocking-io
ThroughputUse async I/O, not std::fs in async contexts
async-bounded-channels
BackpressurePrefer bounded channels for flow control
Key insight: The async runtime is cooperative. Blocking the executor thread starves all other tasks.
rust
// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
    let hash = expensive_hash(data);  // CPU-bound, blocks executor!
    Ok(hash)
}

// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
    tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}
对Tokio和async-std应用至关重要。
规则性能提升代码模式
async-spawn-blocking
避免程序挂起使用
spawn_blocking
处理CPU密集型任务
async-cooperative
降低延迟在长时间计算中定期让出执行权
async-mutex-choice
保证正确性
.await
点间使用
tokio::sync::Mutex
async-avoid-blocking-io
提升吞吐量在异步上下文中使用异步I/O而非std::fs
async-bounded-channels
实现背压优先使用有界通道进行流控制
核心要点:异步运行时是协作式的。阻塞执行器线程会导致所有其他任务饥饿。
rust
// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
    let hash = expensive_hash(data);  // CPU-bound, blocks executor!
    Ok(hash)
}

// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
    tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}

9. Unsafe (Expert Only)

9. Unsafe代码(仅专家)

Only after profiling proves these matter.
RuleImpactRisk
unsafe-get-unchecked
5-30%UB if bounds wrong
unsafe-use-maybeuninit
20-100x allocUB if read before write
unsafe-avoid-transmute
CorrectnessPrefer safe alternatives
unsafe-repr-transparent
Zero-costRequired for FFI newtypes
仅当性能分析证明这些优化有必要时才使用。
规则性能提升风险
unsafe-get-unchecked
提升5-30%越界会导致未定义行为
unsafe-use-maybeuninit
分配场景提升20-100倍未写入就读取会导致未定义行为
unsafe-avoid-transmute
保证正确性优先使用安全替代方案
unsafe-repr-transparent
零开销FFI新类型场景必备

Decision Trees

决策树

When to use with_capacity?

何时使用with_capacity?

Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
    Will it grow frequently?
    ├── YES → Start bigger or use reserve()
    └── NO → Vec::new() is fine
Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
    Will it grow frequently?
    ├── YES → Start bigger or use reserve()
    └── NO → Vec::new() is fine

Mutex vs RwLock vs Atomics?

Mutex vs RwLock vs Atomics?

Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
    What's the read/write ratio?
    ├── Mostly reads (>90%) → RwLock
    ├── Mostly writes → Mutex
    └── Mixed → Mutex (simpler)

    Consider: parking_lot > std for all of these
Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
    What's the read/write ratio?
    ├── Mostly reads (>90%) → RwLock
    ├── Mostly writes → Mutex
    └── Mixed → Mutex (simpler)

    Consider: parking_lot > std for all of these

When is unsafe get_unchecked worth it?

何时值得使用unsafe get_unchecked?

Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
    Did you check if LLVM already removed the bounds check?
    ├── NO → Check assembly first (cargo asm)
    └── YES, still there
        Can you use iterators instead?
        ├── YES → Use iterators (same speed, safe)
        └── NO → get_unchecked with documented invariants
Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
    Did you check if LLVM already removed the bounds check?
    ├── NO → Check assembly first (cargo asm)
    └── YES, still there
        Can you use iterators instead?
        ├── YES → Use iterators (same speed, safe)
        └── NO → get_unchecked with documented invariants

Reading Rules

规则阅读说明

Each rule file in
rules/
contains:
  • Quantified impact with real benchmark numbers
  • Visual explanations of how the optimization works
  • Incorrect examples showing common mistakes
  • Correct examples with best practices
  • When NOT to apply - trade-offs and edge cases
  • Common mistakes to avoid
  • Profiling commands to identify the issue
  • References to official docs
rules/
目录下的每个规则文件包含:
  • 量化性能提升:附带真实基准测试数据
  • 可视化解释:优化的工作原理
  • 错误示例:常见错误写法
  • 正确示例:最佳实践
  • 不适用场景:权衡与边缘情况
  • 常见错误:需要避免的问题
  • 性能分析命令:用于定位问题
  • 参考资料:官方文档链接

Full Compiled Document

完整编译文档

For all rules in a single file:
AGENTS.md
如需查看所有规则的单文件版本:
AGENTS.md