rust-performance
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOptimization Priority
优化优先级
1. Algorithm choice (10x - 1000x) ← Biggest impact
2. Data structure (2x - 10x)
3. Reduce allocations (2x - 5x)
4. Cache optimization (1.5x - 3x)
5. SIMD/parallelism (2x - 8x)Warning: Premature optimization is the root of all evil. Make it work first, then optimize hot paths.
1. Algorithm choice (10x - 1000x) ← Biggest impact
2. Data structure (2x - 10x)
3. Reduce allocations (2x - 5x)
4. Cache optimization (1.5x - 3x)
5. SIMD/parallelism (2x - 8x)注意:过早优化是万恶之源。先让代码正常运行,再优化热点路径。
Solution Patterns
解决方案模式
Pattern 1: Pre-allocation
模式1:预分配
rust
// ❌ Bad: grows dynamically
let mut vec = Vec::new();
for i in 0..1000 {
vec.push(i);
}
// ✅ Good: pre-allocate known size
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
vec.push(i);
}rust
// ❌ Bad: grows dynamically
let mut vec = Vec::new();
for i in 0..1000 {
vec.push(i);
}
// ✅ Good: pre-allocate known size
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
vec.push(i);
}Pattern 2: Avoid Unnecessary Clones
模式2:避免不必要的克隆
rust
// ❌ Bad: unnecessary clone
fn process(item: &Item) {
let data = item.data.clone();
// use data...
}
// ✅ Good: use reference
fn process(item: &Item) {
let data = &item.data;
// use data...
}rust
// ❌ Bad: unnecessary clone
fn process(item: &Item) {
let data = item.data.clone();
// use data...
}
// ✅ Good: use reference
fn process(item: &Item) {
let data = &item.data;
// use data...
}Pattern 3: Batch Operations
模式3:批量操作
rust
// ❌ Bad: multiple database calls
for user_id in user_ids {
db.update(user_id, status)?;
}
// ✅ Good: batch update
db.update_all(user_ids, status)?;rust
// ❌ Bad: multiple database calls
for user_id in user_ids {
db.update(user_id, status)?;
}
// ✅ Good: batch update
db.update_all(user_ids, status)?;Pattern 4: Small Object Optimization
模式4:小对象优化
rust
use smallvec::SmallVec;
// ✅ No heap allocation for ≤16 items
let mut vec: SmallVec<[u8; 16]> = SmallVec::new();rust
use smallvec::SmallVec;
// ✅ No heap allocation for ≤16 items
let mut vec: SmallVec<[u8; 16]> = SmallVec::new();Pattern 5: Parallel Processing
模式5:并行处理
rust
use rayon::prelude::*;
let sum: i32 = data
.par_iter()
.map(|x| expensive_computation(x))
.sum();rust
use rayon::prelude::*;
let sum: i32 = data
.par_iter()
.map(|x| expensive_computation(x))
.sum();Profiling Tools
性能剖析工具
| Tool | Purpose |
|---|---|
| Criterion benchmarks |
| CPU flame graphs |
| Allocation tracking |
| Cache analysis |
| Heap allocation profiling |
| 工具 | 用途 |
|---|---|
| Criterion基准测试 |
| CPU火焰图 |
| 分配追踪 |
| 缓存分析 |
| 堆分配性能剖析 |
Common Optimizations
常见优化手段
Anti-Patterns to Fix
需要修正的反模式
| Anti-Pattern | Why Bad | Correct Approach |
|---|---|---|
| Clone to avoid lifetimes | Performance cost | Proper ownership design |
| Box everything | Indirection overhead | Prefer stack allocation |
| HashMap for small data | Hash overhead too high | Vec + linear search |
| String concatenation in loop | O(n²) | |
| LinkedList | Cache-unfriendly | |
| 反模式 | 危害 | 正确做法 |
|---|---|---|
| Clone to avoid lifetimes | 性能损耗 | 合理设计所有权 |
| Box everything | 间接访问开销 | 优先栈分配 |
| HashMap for small data | 哈希开销过高 | Vec + 线性搜索 |
| String concatenation in loop | O(n²)时间复杂度 | 使用 |
| LinkedList | 缓存不友好 | |
Advanced: False Sharing
进阶:伪共享
Symptom
症状
rust
// ❌ Problem: multiple AtomicU64 in one struct
struct ShardCounters {
inflight: AtomicU64,
completed: AtomicU64,
}- One CPU core at 90%+
- High LLC miss rate in perf
- Many atomic RMW operations
- Adding threads makes it slower
rust
// ❌ Problem: multiple AtomicU64 in one struct
struct ShardCounters {
inflight: AtomicU64,
completed: AtomicU64,
}- 单个CPU核心占用率达90%以上
- perf工具显示LLC miss率高
- 大量原子RMW操作
- 增加线程后性能反而下降
Diagnosis
诊断方法
bash
undefinedbash
undefinedPerf analysis
Perf analysis
perf stat -d your_program
perf stat -d your_program
Look for LLC-load-misses and locked-instrs
Look for LLC-load-misses and locked-instrs
Flamegraph
Flamegraph
cargo flamegraph
cargo flamegraph
Find atomic fetch_add hotspots
Find atomic fetch_add hotspots
undefinedundefinedSolution: Cache Line Padding
解决方案:缓存行填充
rust
// ✅ Each field in separate cache line
#[repr(align(64))]
struct PaddedAtomicU64(AtomicU64);
struct ShardCounters {
inflight: PaddedAtomicU64,
completed: PaddedAtomicU64,
}rust
// ✅ Each field in separate cache line
#[repr(align(64))]
struct PaddedAtomicU64(AtomicU64);
struct ShardCounters {
inflight: PaddedAtomicU64,
completed: PaddedAtomicU64,
}Lock Contention Optimization
锁竞争优化
Symptom
症状
rust
// ❌ All threads compete for single lock
let shared: Arc<Mutex<HashMap<String, usize>>> =
Arc::new(Mutex::new(HashMap::new()));- Most time spent in mutex lock/unlock
- Performance degrades with more threads
- High system time percentage
rust
// ❌ All threads compete for single lock
let shared: Arc<Mutex<HashMap<String, usize>>> =
Arc::new(Mutex::new(HashMap::new()));- 大部分时间消耗在 mutex 加锁/解锁上
- 线程越多性能越差
- 系统时间占比高
Solution: Thread-Local Sharding
解决方案:线程本地分片
rust
// ✅ Each thread has local HashMap, merge at end
pub fn parallel_count(data: &[String], num_threads: usize)
-> HashMap<String, usize>
{
let mut handles = Vec::new();
for chunk in data.chunks(data.len() / num_threads) {
handles.push(thread::spawn(move || {
let mut local = HashMap::new();
for key in chunk {
*local.entry(key.clone()).or_insert(0) += 1;
}
local // Return local counts
}));
}
// Merge all local results
let mut result = HashMap::new();
for handle in handles {
for (k, v) in handle.join().unwrap() {
*result.entry(k).or_insert(0) += v;
}
}
result
}rust
// ✅ Each thread has local HashMap, merge at end
pub fn parallel_count(data: &[String], num_threads: usize)
-> HashMap<String, usize>
{
let mut handles = Vec::new();
for chunk in data.chunks(data.len() / num_threads) {
handles.push(thread::spawn(move || {
let mut local = HashMap::new();
for key in chunk {
*local.entry(key.clone()).or_insert(0) += 1;
}
local // Return local counts
}));
}
// Merge all local results
let mut result = HashMap::new();
for handle in handles {
for (k, v) in handle.join().unwrap() {
*result.entry(k).or_insert(0) += v;
}
}
result
}NUMA Awareness
NUMA感知
Problem
问题
rust
// Multi-socket server, memory allocated on remote NUMA node
let pool = ArenaPool::new(num_threads);
// Rayon work-stealing causes tasks to run on any thread
// Cross-NUMA access causes severe memory migration latencyrust
// Multi-socket server, memory allocated on remote NUMA node
let pool = ArenaPool::new(num_threads);
// Rayon work-stealing causes tasks to run on any thread
// Cross-NUMA access causes severe memory migration latencySolution
解决方案
rust
// 1. NUMA node binding
let numa_node = detect_numa_node();
let pool = NumaAwarePool::new(numa_node);
// 2. Use unified allocator (jemalloc)
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;
// 3. Avoid cross-NUMA object clones
// Borrow directly, don't copy datarust
// 1. NUMA node binding
let numa_node = detect_numa_node();
let pool = NumaAwarePool::new(numa_node);
// 2. Use unified allocator (jemalloc)
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;
// 3. Avoid cross-NUMA object clones
// Borrow directly, don't copy dataTools
工具
bash
undefinedbash
undefinedCheck NUMA topology
Check NUMA topology
numactl --hardware
numactl --hardware
Bind to NUMA node
Bind to NUMA node
numactl --cpunodebind=0 --membind=0 ./my_program
undefinednumactl --cpunodebind=0 --membind=0 ./my_program
undefinedData Structure Selection
数据结构选择
| Scenario | Choice | Reason |
|---|---|---|
| High-concurrency writes | DashMap or sharding | Reduces lock contention |
| Read-heavy, few writes | RwLock<HashMap> | Read locks don't block |
| Small dataset | Vec + linear search | HashMap overhead higher |
| Fixed keys | Enum + array | Zero hash overhead |
| 场景 | 选择 | 原因 |
|---|---|---|
| 高并发写入 | DashMap 或分片 | 减少锁竞争 |
| 读多写少 | RwLock<HashMap> | 读锁不阻塞 |
| 小型数据集 | Vec + 线性搜索 | HashMap开销过高 |
| 固定键 | Enum + 数组 | 零哈希开销 |
Read-Heavy Example
读多写少示例
rust
// ✅ Many reads, few updates
struct Config {
map: RwLock<HashMap<String, ConfigValue>>,
}
impl Config {
pub fn get(&self, key: &str) -> Option<ConfigValue> {
self.map.read().unwrap().get(key).cloned()
}
pub fn update(&self, key: String, value: ConfigValue) {
self.map.write().unwrap().insert(key, value);
}
}rust
// ✅ Many reads, few updates
struct Config {
map: RwLock<HashMap<String, ConfigValue>>,
}
impl Config {
pub fn get(&self, key: &str) -> Option<ConfigValue> {
self.map.read().unwrap().get(key).cloned()
}
pub fn update(&self, key: String, value: ConfigValue) {
self.map.write().unwrap().insert(key, value);
}
}Common Performance Traps
常见性能陷阱
| Trap | Symptom | Solution |
|---|---|---|
| Adjacent atomic variables | False sharing | |
| Global Mutex | Lock contention | Thread-local + merge |
| Cross-NUMA allocation | Memory migration | NUMA-aware allocation |
| Frequent small allocations | Allocator pressure | Object pooling |
| Dynamic string keys | Extra allocations | Use integer IDs |
| 陷阱 | 症状 | 解决方案 |
|---|---|---|
| Adjacent atomic variables | 伪共享 | 使用 |
| Global Mutex | 锁竞争 | 线程本地存储 + 合并结果 |
| Cross-NUMA allocation | 内存迁移延迟 | NUMA感知分配 |
| Frequent small allocations | 分配器压力大 | 对象池 |
| Dynamic string keys | 额外分配开销 | 使用整数ID |
Review Checklist
优化检查清单
When optimizing performance:
- Profiled to identify bottleneck
- Bottleneck confirmed with measurements
- Algorithm is optimal for use case
- Data structure appropriate
- Unnecessary allocations removed
- Parallelism exploited where beneficial
- Cache-friendly data layout
- Lock contention minimized
- Benchmarks show improvement
- Code still readable and maintainable
进行性能优化时:
- 已通过性能剖析识别瓶颈
- 已通过测量确认瓶颈
- 算法符合当前场景最优
- 数据结构选择合适
- 已移除不必要的分配
- 已充分利用并行性(如适用)
- 数据布局缓存友好
- 锁竞争已最小化
- 基准测试显示性能提升
- 代码仍保持可读和可维护
Verification Commands
验证命令
bash
undefinedbash
undefinedBenchmark
Benchmark
cargo bench
cargo bench
Profile with perf
Profile with perf
perf stat -d ./target/release/your_program
perf stat -d ./target/release/your_program
Generate flamegraph
Generate flamegraph
cargo flamegraph --release
cargo flamegraph --release
Heap profiling
Heap profiling
valgrind --tool=dhat ./target/release/your_program
valgrind --tool=dhat ./target/release/your_program
Cache analysis
Cache analysis
valgrind --tool=cachegrind ./target/release/your_program
valgrind --tool=cachegrind ./target/release/your_program
NUMA topology
NUMA topology
numactl --hardware
undefinednumactl --hardware
undefinedCommon Pitfalls
常见误区
1. Premature Optimization
1. 过早优化
Symptom: Optimizing before profiling
Fix: Profile first, optimize hot paths only
症状:未进行性能剖析就开始优化
解决方法:先剖析,仅优化热点路径
2. Micro-optimizing Cold Paths
2. 优化冷路径
Symptom: Spending time on code that rarely runs
Fix: Focus on hot loops (90% of time in 10% of code)
症状:花费时间优化很少执行的代码
解决方法:聚焦热点循环(90%的时间消耗在10%的代码上)
3. Trading Readability for Minimal Gains
3. 为微小收益牺牲可读性
Symptom: Complex code for <5% improvement
Fix: Only optimize if gain is significant (>20%)
症状:为了不到5%的性能提升编写复杂代码
解决方法:仅当收益显著(>20%)时才进行优化
Performance Diagnostic Workflow
性能诊断流程
1. Identify symptom (slow, high CPU, high memory)
↓
2. Profile with appropriate tool
- CPU → perf/flamegraph
- Memory → heaptrack/dhat
- Cache → cachegrind
↓
3. Find hotspot (function/line)
↓
4. Understand why it's slow
- Algorithm? Data structure? Allocation?
↓
5. Apply targeted optimization
↓
6. Benchmark to confirm improvement
↓
7. Repeat if not fast enough1. 识别症状(运行缓慢、CPU占用高、内存占用高)
↓
2. 使用合适工具进行性能剖析
- CPU问题 → perf/flamegraph
- 内存问题 → heaptrack/dhat
- 缓存问题 → cachegrind
↓
3. 定位热点(函数/代码行)
↓
4. 分析慢的原因
- 算法问题?数据结构问题?分配问题?
↓
5. 应用针对性优化
↓
6. 基准测试确认性能提升
↓
7. 若仍不满足需求则重复上述步骤Related Skills
相关技能
- rust-concurrency - Parallel processing patterns
- rust-async - Async performance optimization
- rust-unsafe - Zero-cost abstractions with unsafe
- rust-coding - Writing performant idiomatic code
- rust-anti-pattern - Performance anti-patterns to avoid
- rust-concurrency - 并行处理模式
- rust-async - 异步性能优化
- rust-unsafe - 使用unsafe实现零成本抽象
- rust-coding - 编写高性能地道Rust代码
- rust-anti-pattern - 需要避免的性能反模式
Localized Reference
本地化参考
- Chinese version: SKILL_ZH.md - 完整中文版本,包含所有内容
- 中文版本:SKILL_ZH.md - 完整中文版本,包含所有内容