rust-performance-best-practices
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRust Performance Best Practices
Rust性能优化最佳实践
Expert-level performance optimization guide for Rust. Contains 45+ rules across 9 categories with real benchmarks, failure modes, and profiling workflows.
专业级Rust性能优化指南,包含9大分类下的45+条规则,附带真实基准测试、失效模式以及性能分析流程。
When to Apply
适用场景
Reference these guidelines when:
- Investigating slow Rust programs or high latency
- Optimizing build times or binary size
- Reviewing allocation-heavy code
- Debugging lock contention or thread scaling issues
- Setting up release profiles for production
- Working with async runtimes (Tokio, async-std)
在以下场景中可参考本指南:
- 排查Rust程序运行缓慢或高延迟问题
- 优化编译时间或二进制文件大小
- 评审内存分配密集型代码
- 调试锁竞争或线程扩展问题
- 为生产环境配置发布构建配置
- 处理异步运行时(Tokio、async-std)相关工作
When NOT to Apply
不适用场景
Skip these optimizations when:
- Code isn't in a hot path (profile first!)
- Readability would suffer significantly
- You haven't measured a performance problem
- The optimization requires unsafe code you can't verify
- Premature optimization would delay shipping
在以下场景中请跳过这些优化:
- 代码不在热点路径中(先做性能分析!)
- 优化会严重损害代码可读性
- 尚未测量到明确的性能问题
- 优化需要使用你无法验证正确性的unsafe代码
- 过早优化会导致发布延迟
The Optimization Workflow
优化工作流
CRITICAL: Most Rust code doesn't need optimization. Profile first, optimize second.
┌─────────────────────────────────────────────────────────────┐
│ OPTIMIZATION WORKFLOW │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. MEASURE FIRST │
│ └── Profile before changing anything │
│ └── Use cargo flamegraph, perf, or heaptrack │
│ └── Identify actual bottlenecks (don't guess!) │
│ │
│ 2. CHECK BUILD SETTINGS │
│ └── Release mode? (10-100x vs debug) │
│ └── LTO enabled? (5-20% improvement) │
│ └── Target CPU? (10-30% for SIMD) │
│ │
│ 3. FIX ALGORITHMIC ISSUES │
│ └── O(n²) → O(n log n) matters more than micro-opts │
│ └── Check data structure choices │
│ └── Avoid unnecessary work │
│ │
│ 4. REDUCE ALLOCATIONS │
│ └── Pre-size collections (with_capacity) │
│ └── Reuse buffers (clear + reuse) │
│ └── Avoid cloning (borrow instead) │
│ │
│ 5. OPTIMIZE HOT LOOPS │
│ └── Iterators over indices │
│ └── Reduce lock scope │
│ └── Batch I/O operations │
│ │
│ 6. MEASURE AGAIN │
│ └── Verify improvement with benchmarks │
│ └── Check for regressions elsewhere │
│ └── Document the optimization │
│ │
└─────────────────────────────────────────────────────────────┘关键提示:大多数Rust代码不需要优化。先做性能分析,再进行优化。
┌─────────────────────────────────────────────────────────────┐
│ OPTIMIZATION WORKFLOW │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. MEASURE FIRST │
│ └── Profile before changing anything │
│ └── Use cargo flamegraph, perf, or heaptrack │
│ └── Identify actual bottlenecks (don't guess!) │
│ │
│ 2. CHECK BUILD SETTINGS │
│ └── Release mode? (10-100x vs debug) │
│ └── LTO enabled? (5-20% improvement) │
│ └── Target CPU? (10-30% for SIMD) │
│ │
│ 3. FIX ALGORITHMIC ISSUES │
│ └── O(n²) → O(n log n) matters more than micro-opts │
│ └── Check data structure choices │
│ └── Avoid unnecessary work │
│ │
│ 4. REDUCE ALLOCATIONS │
│ └── Pre-size collections (with_capacity) │
│ └── Reuse buffers (clear + reuse) │
│ └── Avoid cloning (borrow instead) │
│ │
│ 5. OPTIMIZE HOT LOOPS │
│ └── Iterators over indices │
│ └── Reduce lock scope │
│ └── Batch I/O operations │
│ │
│ 6. MEASURE AGAIN │
│ └── Verify improvement with benchmarks │
│ └── Check for regressions elsewhere │
│ └── Document the optimization │
│ │
└─────────────────────────────────────────────────────────────┘Quick Profiling Commands
快速性能分析命令
bash
undefinedbash
undefinedCPU profiling (Linux)
CPU profiling (Linux)
cargo flamegraph --bin myapp
perf record -g ./target/release/myapp && perf report
cargo flamegraph --bin myapp
perf record -g ./target/release/myapp && perf report
Memory profiling
Memory profiling
heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz
DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out
heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz
DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out
Benchmark
Benchmark
cargo bench # All benchmarks
cargo bench hot_function # Specific benchmark
cargo bench # All benchmarks
cargo bench hot_function # Specific benchmark
Check allocations
Check allocations
MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp
mtrace ./target/release/myapp /tmp/mtrace.log
MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp
mtrace ./target/release/myapp /tmp/mtrace.log
Assembly inspection
Assembly inspection
cargo asm my_crate::hot_function --rust
cargo asm my_crate::hot_function --rust
syscall count
syscall count
strace -c ./target/release/myapp 2>&1 | head -20
undefinedstrace -c ./target/release/myapp 2>&1 | head -20
undefinedCommon Scenarios → Rules
常见场景对应规则
"My Rust program is slow"
"我的Rust程序运行缓慢"
Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
│
Where does flamegraph show time?
├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
├── read/write syscalls → io-* rules (BufReader/BufWriter)
├── clone/drop → alloc-avoid-clone, use references
└── Your code → iter-* rules, algorithm improvementsIs it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
│
Where does flamegraph show time?
├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
├── read/write syscalls → io-* rules (BufReader/BufWriter)
├── clone/drop → alloc-avoid-clone, use references
└── Your code → iter-* rules, algorithm improvements"My binary is too large"
"我的二进制文件太大"
1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 01. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0"High memory usage"
"内存占用过高"
1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate"Lock contention / thread scaling"
"锁竞争 / 线程扩展问题"
1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats"Slow file I/O"
"文件I/O缓慢"
1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crateRule Categories
规则分类
| Priority | Category | Typical Impact | Prefix |
|---|---|---|---|
| 1 | Build Profiles | 10-100x (debug→release) | |
| 2 | Benchmarking | Enables measurement | |
| 3 | Allocation | 2-50x for allocation-heavy code | |
| 4 | Data Structures | 2-10x for hot paths | |
| 5 | Iteration | 2-5x for loop-heavy code | |
| 6 | Synchronization | 5-100x for contended code | |
| 7 | I/O | 10-100x for I/O-bound code | |
| 8 | Unsafe | 5-30% in tight loops (experts only) | |
| 优先级 | 分类 | 典型性能提升 | 规则前缀 |
|---|---|---|---|
| 1 | 构建配置 | 10-100倍(debug→release) | |
| 2 | 基准测试 | 支持性能测量 | |
| 3 | 内存分配 | 分配密集型代码提升2-50倍 | |
| 4 | 数据结构 | 热点路径提升2-10倍 | |
| 5 | 迭代器使用 | 循环密集型代码提升2-5倍 | |
| 6 | 同步机制 | 竞争场景提升5-100倍 | |
| 7 | I/O操作 | I/O密集型代码提升10-100倍 | |
| 8 | Unsafe代码 | 紧凑循环提升5-30%(仅专家) | |
1. Build Profiles (CRITICAL)
1. 构建配置(关键)
These apply to ALL Rust code. Check these first.
| Rule | Impact | One-liner |
|---|---|---|
| 10-100x | Always ship release builds |
| 2-5x | opt-level=3 for speed, 'z' for size |
| 5-20% | LTO enables cross-crate optimization |
| 5-15% | codegen-units=1 for max optimization |
| Binary size | panic='abort' removes unwinding |
| 10-30% | target-cpu=native for SIMD |
| 5-20% | Profile-guided optimization |
| 5-10% | Disable for release builds |
这些规则适用于所有Rust代码,请首先检查这些配置。
| 规则 | 性能提升 | 一句话总结 |
|---|---|---|
| 10-100倍 | 始终发布release版本构建 |
| 2-5倍 | 追求速度用opt-level=3,追求体积用'z' |
| 5-20% | LTO支持跨crate优化 |
| 5-15% | codegen-units=1以获得最大优化效果 |
| 减小二进制体积 | panic='abort'移除unwind代码 |
| 10-30% | target-cpu=native以启用SIMD |
| 5-20% | 基于性能分析的优化(PGO) |
| 5-10% | 发布版本中禁用增量构建 |
2. Benchmarking (REQUIRED)
2. 基准测试(必备)
You can't optimize what you don't measure.
| Rule | Purpose |
|---|---|
| Use |
| Bench profile enables optimizations |
| Prevent dead code elimination |
| I/O variance destroys measurements |
你无法优化未测量的内容。
| 规则 | 用途 |
|---|---|
| 结合criterion使用 |
| Bench配置启用优化 |
| 防止死代码被消除 |
| I/O波动会破坏测量结果 |
3. Allocation
3. 内存分配
Every allocation is a syscall. Reduce them.
| Rule | Impact | Pattern |
|---|---|---|
| 2-10x | |
| 2-5x | |
| 2-5x | |
| 2-10x | |
| Flexibility | |
| 2-10x | Borrow |
每次分配都是系统调用,应尽量减少。
| 规则 | 性能提升 | 代码模式 |
|---|---|---|
| 2-10倍 | 使用 |
| 2-5倍 | 使用 |
| 2-5倍 | 使用 |
| 2-10倍 | 调用 |
| 提升灵活性 | 参数使用 |
| 2-10倍 | 借用 |
4. Data Structures
4. 数据结构
The right data structure beats micro-optimization.
| Rule | When |
|---|---|
| Almost always (Vec wins) |
| FIFO queues |
| HashMap=O(1), BTreeMap=sorted |
| Insert-or-update patterns |
| FFI newtypes |
合适的数据结构比微优化更重要。
| 规则 | 适用场景 |
|---|---|
| 几乎所有场景(Vec性能更优) |
| FIFO队列场景 |
| HashMap=O(1)查询,BTreeMap支持排序 |
| 插入或更新的场景 |
| FFI新类型场景 |
5. Iteration
5. 迭代器使用
Iterators are as fast as loops and safer.
| Rule | Impact | Pattern |
|---|---|---|
| 2-3x | Chain iterators, don't collect |
| 2-3x | |
| Short-circuit | |
| In-place | |
| O(log n) | |
迭代器和循环速度相当,且更安全。
| 规则 | 性能提升 | 代码模式 |
|---|---|---|
| 2-3倍 | 链式调用迭代器,而非先收集为集合 |
| 2-3倍 | 使用 |
| 提前终止 | 使用 |
| 原地修改 | 使用 |
| O(log n)时间 | 在已排序数据上使用 |
6. Synchronization
6. 同步机制
Locks are expensive. Minimize contention.
| Rule | Impact | When |
|---|---|---|
| Avoids copying | Share large (>64B) data across threads |
| 2-8x for reads | >80% reads, few writes; consider parking_lot |
| 4x | Minimize code under lock |
| 3-4x | Message passing vs shared state |
| 20x | Simple counters, flags |
| 1.5-5x | Prefer |
锁的开销很高,应尽量减少竞争。
| 规则 | 性能提升 | 适用场景 |
|---|---|---|
| 避免数据拷贝 | 在线程间共享大型数据(>64B) |
| 读场景提升2-8倍 | 读操作占比>80%,写操作少;可考虑parking_lot |
| 提升4倍 | 最小化锁保护的代码范围 |
| 提升3-4倍 | 使用消息传递而非共享状态 |
| 提升20倍 | 简单计数器、标志位场景 |
| 提升1.5-5倍 | 优先使用 |
7. I/O
7. I/O操作
Every syscall costs. Buffer them.
| Rule | Impact | Pattern |
|---|---|---|
| 50x | Wrap |
| 18x | Wrap |
| CRITICAL | Must flush or lose data! |
| 53x | Reuse String buffer with |
每次系统调用都有开销,应使用缓冲区。
| 规则 | 性能提升 | 代码模式 |
|---|---|---|
| 提升50倍 | 将 |
| 提升18倍 | 将 |
| 关键提示 | 必须手动刷新,否则会丢失数据! |
| 提升53倍 | 复用String缓冲区调用 |
8. Async/Await (HIGH)
8. Async/Await(重要)
Critical for Tokio and async-std applications.
| Rule | Impact | Pattern |
|---|---|---|
| Prevents hang | Use |
| Latency | Yield periodically in long computations |
| Correctness | |
| Throughput | Use async I/O, not std::fs in async contexts |
| Backpressure | Prefer bounded channels for flow control |
Key insight: The async runtime is cooperative. Blocking the executor thread starves all other tasks.
rust
// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
let hash = expensive_hash(data); // CPU-bound, blocks executor!
Ok(hash)
}
// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}对Tokio和async-std应用至关重要。
| 规则 | 性能提升 | 代码模式 |
|---|---|---|
| 避免程序挂起 | 使用 |
| 降低延迟 | 在长时间计算中定期让出执行权 |
| 保证正确性 | 在 |
| 提升吞吐量 | 在异步上下文中使用异步I/O而非std::fs |
| 实现背压 | 优先使用有界通道进行流控制 |
核心要点:异步运行时是协作式的。阻塞执行器线程会导致所有其他任务饥饿。
rust
// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
let hash = expensive_hash(data); // CPU-bound, blocks executor!
Ok(hash)
}
// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}9. Unsafe (Expert Only)
9. Unsafe代码(仅专家)
Only after profiling proves these matter.
| Rule | Impact | Risk |
|---|---|---|
| 5-30% | UB if bounds wrong |
| 20-100x alloc | UB if read before write |
| Correctness | Prefer safe alternatives |
| Zero-cost | Required for FFI newtypes |
仅当性能分析证明这些优化有必要时才使用。
| 规则 | 性能提升 | 风险 |
|---|---|---|
| 提升5-30% | 越界会导致未定义行为 |
| 分配场景提升20-100倍 | 未写入就读取会导致未定义行为 |
| 保证正确性 | 优先使用安全替代方案 |
| 零开销 | FFI新类型场景必备 |
Decision Trees
决策树
When to use with_capacity?
何时使用with_capacity?
Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
│
Will it grow frequently?
├── YES → Start bigger or use reserve()
└── NO → Vec::new() is fineDo you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
│
Will it grow frequently?
├── YES → Start bigger or use reserve()
└── NO → Vec::new() is fineMutex vs RwLock vs Atomics?
Mutex vs RwLock vs Atomics?
Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
│
What's the read/write ratio?
├── Mostly reads (>90%) → RwLock
├── Mostly writes → Mutex
└── Mixed → Mutex (simpler)
Consider: parking_lot > std for all of theseIs it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
│
What's the read/write ratio?
├── Mostly reads (>90%) → RwLock
├── Mostly writes → Mutex
└── Mixed → Mutex (simpler)
Consider: parking_lot > std for all of theseWhen is unsafe get_unchecked worth it?
何时值得使用unsafe get_unchecked?
Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
│
Did you check if LLVM already removed the bounds check?
├── NO → Check assembly first (cargo asm)
└── YES, still there
│
Can you use iterators instead?
├── YES → Use iterators (same speed, safe)
└── NO → get_unchecked with documented invariantsDid you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
│
Did you check if LLVM already removed the bounds check?
├── NO → Check assembly first (cargo asm)
└── YES, still there
│
Can you use iterators instead?
├── YES → Use iterators (same speed, safe)
└── NO → get_unchecked with documented invariantsReading Rules
规则阅读说明
Each rule file in contains:
rules/- Quantified impact with real benchmark numbers
- Visual explanations of how the optimization works
- Incorrect examples showing common mistakes
- Correct examples with best practices
- When NOT to apply - trade-offs and edge cases
- Common mistakes to avoid
- Profiling commands to identify the issue
- References to official docs
rules/- 量化性能提升:附带真实基准测试数据
- 可视化解释:优化的工作原理
- 错误示例:常见错误写法
- 正确示例:最佳实践
- 不适用场景:权衡与边缘情况
- 常见错误:需要避免的问题
- 性能分析命令:用于定位问题
- 参考资料:官方文档链接
Full Compiled Document
完整编译文档
For all rules in a single file:
AGENTS.md如需查看所有规则的单文件版本:
AGENTS.md