rust-performance

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Optimization Priority

优化优先级

1. Algorithm choice      (10x - 1000x)   ← Biggest impact
2. Data structure        (2x - 10x)
3. Reduce allocations    (2x - 5x)
4. Cache optimization    (1.5x - 3x)
5. SIMD/parallelism      (2x - 8x)

Warning: Premature optimization is the root of all evil. Make it work first, then optimize hot paths.

1. Algorithm choice      (10x - 1000x)   ← Biggest impact
2. Data structure        (2x - 10x)
3. Reduce allocations    (2x - 5x)
4. Cache optimization    (1.5x - 3x)
5. SIMD/parallelism      (2x - 8x)

注意：过早优化是万恶之源。先让代码正常运行，再优化热点路径。

Solution Patterns

解决方案模式

Pattern 1: Pre-allocation

模式1：预分配

rust

// ❌ Bad: grows dynamically
let mut vec = Vec::new();
for i in 0..1000 {
    vec.push(i);
}

// ✅ Good: pre-allocate known size
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
    vec.push(i);
}

rust

// ❌ Bad: grows dynamically
let mut vec = Vec::new();
for i in 0..1000 {
    vec.push(i);
}

// ✅ Good: pre-allocate known size
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
    vec.push(i);
}

Pattern 2: Avoid Unnecessary Clones

模式2：避免不必要的克隆

rust

// ❌ Bad: unnecessary clone
fn process(item: &Item) {
    let data = item.data.clone();
    // use data...
}

// ✅ Good: use reference
fn process(item: &Item) {
    let data = &item.data;
    // use data...
}

rust

// ❌ Bad: unnecessary clone
fn process(item: &Item) {
    let data = item.data.clone();
    // use data...
}

// ✅ Good: use reference
fn process(item: &Item) {
    let data = &item.data;
    // use data...
}

Pattern 3: Batch Operations

模式3：批量操作

rust

// ❌ Bad: multiple database calls
for user_id in user_ids {
    db.update(user_id, status)?;
}

// ✅ Good: batch update
db.update_all(user_ids, status)?;

rust

// ❌ Bad: multiple database calls
for user_id in user_ids {
    db.update(user_id, status)?;
}

// ✅ Good: batch update
db.update_all(user_ids, status)?;

Pattern 4: Small Object Optimization

模式4：小对象优化

rust

use smallvec::SmallVec;

// ✅ No heap allocation for ≤16 items
let mut vec: SmallVec<[u8; 16]> = SmallVec::new();

rust

use smallvec::SmallVec;

// ✅ No heap allocation for ≤16 items
let mut vec: SmallVec<[u8; 16]> = SmallVec::new();

Pattern 5: Parallel Processing

模式5：并行处理

rust

use rayon::prelude::*;

let sum: i32 = data
    .par_iter()
    .map(|x| expensive_computation(x))
    .sum();

rust

use rayon::prelude::*;

let sum: i32 = data
    .par_iter()
    .map(|x| expensive_computation(x))
    .sum();

Profiling Tools

性能剖析工具

Tool	Purpose
`cargo bench`	Criterion benchmarks
`perf` / `flamegraph`	CPU flame graphs
`heaptrack`	Allocation tracking
`valgrind --tool=cachegrind`	Cache analysis
`dhat`	Heap allocation profiling

工具	用途
`cargo bench`	Criterion基准测试
`perf` / `flamegraph`	CPU火焰图
`heaptrack`	分配追踪
`valgrind --tool=cachegrind`	缓存分析
`dhat`	堆分配性能剖析

Common Optimizations

常见优化手段

Anti-Patterns to Fix

需要修正的反模式

Anti-Pattern	Why Bad	Correct Approach
Clone to avoid lifetimes	Performance cost	Proper ownership design
Box everything	Indirection overhead	Prefer stack allocation
HashMap for small data	Hash overhead too high	Vec + linear search
String concatenation in loop	O(n²)	`with_capacity` or `format!`
LinkedList	Cache-unfriendly	`Vec` or `VecDeque`

反模式	危害	正确做法
Clone to avoid lifetimes	性能损耗	合理设计所有权
Box everything	间接访问开销	优先栈分配
HashMap for small data	哈希开销过高	Vec + 线性搜索
String concatenation in loop	O(n²)时间复杂度	使用 `with_capacity` 或 `format!`
LinkedList	缓存不友好	`Vec` 或 `VecDeque`

Advanced: False Sharing

进阶：伪共享

Symptom

症状

rust

// ❌ Problem: multiple AtomicU64 in one struct
struct ShardCounters {
    inflight: AtomicU64,
    completed: AtomicU64,
}

One CPU core at 90%+
High LLC miss rate in perf
Many atomic RMW operations
Adding threads makes it slower

rust

// ❌ Problem: multiple AtomicU64 in one struct
struct ShardCounters {
    inflight: AtomicU64,
    completed: AtomicU64,
}

单个CPU核心占用率达90%以上
perf工具显示LLC miss率高
大量原子RMW操作
增加线程后性能反而下降

Diagnosis

诊断方法

bash

undefined

bash

undefined

Perf analysis

perf stat -d your_program

Look for LLC-load-misses and locked-instrs

Flamegraph

cargo flamegraph

Find atomic fetch_add hotspots

undefined

undefined

Solution: Cache Line Padding

解决方案：缓存行填充

rust

// ✅ Each field in separate cache line
#[repr(align(64))]
struct PaddedAtomicU64(AtomicU64);

struct ShardCounters {
    inflight: PaddedAtomicU64,
    completed: PaddedAtomicU64,
}

rust

// ✅ Each field in separate cache line
#[repr(align(64))]
struct PaddedAtomicU64(AtomicU64);

struct ShardCounters {
    inflight: PaddedAtomicU64,
    completed: PaddedAtomicU64,
}

Lock Contention Optimization

锁竞争优化

Symptom

症状

rust

// ❌ All threads compete for single lock
let shared: Arc<Mutex<HashMap<String, usize>>> =
    Arc::new(Mutex::new(HashMap::new()));

Most time spent in mutex lock/unlock
Performance degrades with more threads
High system time percentage

rust

// ❌ All threads compete for single lock
let shared: Arc<Mutex<HashMap<String, usize>>> =
    Arc::new(Mutex::new(HashMap::new()));

大部分时间消耗在 mutex 加锁/解锁上
线程越多性能越差
系统时间占比高

Solution: Thread-Local Sharding

解决方案：线程本地分片

rust

// ✅ Each thread has local HashMap, merge at end
pub fn parallel_count(data: &[String], num_threads: usize)
    -> HashMap<String, usize>
{
    let mut handles = Vec::new();

    for chunk in data.chunks(data.len() / num_threads) {
        handles.push(thread::spawn(move || {
            let mut local = HashMap::new();
            for key in chunk {
                *local.entry(key.clone()).or_insert(0) += 1;
            }
            local  // Return local counts
        }));
    }

    // Merge all local results
    let mut result = HashMap::new();
    for handle in handles {
        for (k, v) in handle.join().unwrap() {
            *result.entry(k).or_insert(0) += v;
        }
    }
    result
}

rust

// ✅ Each thread has local HashMap, merge at end
pub fn parallel_count(data: &[String], num_threads: usize)
    -> HashMap<String, usize>
{
    let mut handles = Vec::new();

    for chunk in data.chunks(data.len() / num_threads) {
        handles.push(thread::spawn(move || {
            let mut local = HashMap::new();
            for key in chunk {
                *local.entry(key.clone()).or_insert(0) += 1;
            }
            local  // Return local counts
        }));
    }

    // Merge all local results
    let mut result = HashMap::new();
    for handle in handles {
        for (k, v) in handle.join().unwrap() {
            *result.entry(k).or_insert(0) += v;
        }
    }
    result
}

NUMA Awareness

NUMA感知

Problem

问题

rust

// Multi-socket server, memory allocated on remote NUMA node
let pool = ArenaPool::new(num_threads);
// Rayon work-stealing causes tasks to run on any thread
// Cross-NUMA access causes severe memory migration latency

rust

// Multi-socket server, memory allocated on remote NUMA node
let pool = ArenaPool::new(num_threads);
// Rayon work-stealing causes tasks to run on any thread
// Cross-NUMA access causes severe memory migration latency

Solution

解决方案

rust

// 1. NUMA node binding
let numa_node = detect_numa_node();
let pool = NumaAwarePool::new(numa_node);

// 2. Use unified allocator (jemalloc)
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

// 3. Avoid cross-NUMA object clones
// Borrow directly, don't copy data

rust

// 1. NUMA node binding
let numa_node = detect_numa_node();
let pool = NumaAwarePool::new(numa_node);

// 2. Use unified allocator (jemalloc)
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

// 3. Avoid cross-NUMA object clones
// Borrow directly, don't copy data

Tools

工具

bash

undefined

bash

undefined

Check NUMA topology

numactl --hardware

Bind to NUMA node

numactl --cpunodebind=0 --membind=0 ./my_program

undefined

numactl --cpunodebind=0 --membind=0 ./my_program

undefined

Data Structure Selection

数据结构选择

Scenario	Choice	Reason
High-concurrency writes	DashMap or sharding	Reduces lock contention
Read-heavy, few writes	RwLock<HashMap>	Read locks don't block
Small dataset	Vec + linear search	HashMap overhead higher
Fixed keys	Enum + array	Zero hash overhead

场景	选择	原因
高并发写入	DashMap 或分片	减少锁竞争
读多写少	RwLock<HashMap>	读锁不阻塞
小型数据集	Vec + 线性搜索	HashMap开销过高
固定键	Enum + 数组	零哈希开销

Read-Heavy Example

读多写少示例

rust

// ✅ Many reads, few updates
struct Config {
    map: RwLock<HashMap<String, ConfigValue>>,
}

impl Config {
    pub fn get(&self, key: &str) -> Option<ConfigValue> {
        self.map.read().unwrap().get(key).cloned()
    }

    pub fn update(&self, key: String, value: ConfigValue) {
        self.map.write().unwrap().insert(key, value);
    }
}

rust

// ✅ Many reads, few updates
struct Config {
    map: RwLock<HashMap<String, ConfigValue>>,
}

impl Config {
    pub fn get(&self, key: &str) -> Option<ConfigValue> {
        self.map.read().unwrap().get(key).cloned()
    }

    pub fn update(&self, key: String, value: ConfigValue) {
        self.map.write().unwrap().insert(key, value);
    }
}

Common Performance Traps

常见性能陷阱

Trap	Symptom	Solution
Adjacent atomic variables	False sharing	`#[repr(align(64))]`
Global Mutex	Lock contention	Thread-local + merge
Cross-NUMA allocation	Memory migration	NUMA-aware allocation
Frequent small allocations	Allocator pressure	Object pooling
Dynamic string keys	Extra allocations	Use integer IDs

陷阱	症状	解决方案
Adjacent atomic variables	伪共享	使用 `#[repr(align(64))]`
Global Mutex	锁竞争	线程本地存储 + 合并结果
Cross-NUMA allocation	内存迁移延迟	NUMA感知分配
Frequent small allocations	分配器压力大	对象池
Dynamic string keys	额外分配开销	使用整数ID

Review Checklist

优化检查清单

Verification Commands

验证命令

bash

undefined

bash

undefined

Benchmark

cargo bench

Profile with perf

perf stat -d ./target/release/your_program

Generate flamegraph

cargo flamegraph --release

Heap profiling

valgrind --tool=dhat ./target/release/your_program

Cache analysis

valgrind --tool=cachegrind ./target/release/your_program

NUMA topology

numactl --hardware

undefined

numactl --hardware

undefined

Common Pitfalls

常见误区

1. Premature Optimization

1. 过早优化

Symptom: Optimizing before profiling

Fix: Profile first, optimize hot paths only

症状：未进行性能剖析就开始优化

解决方法：先剖析，仅优化热点路径

2. Micro-optimizing Cold Paths

2. 优化冷路径

Symptom: Spending time on code that rarely runs

Fix: Focus on hot loops (90% of time in 10% of code)

症状：花费时间优化很少执行的代码

解决方法：聚焦热点循环（90%的时间消耗在10%的代码上）

3. Trading Readability for Minimal Gains

3. 为微小收益牺牲可读性

Symptom: Complex code for <5% improvement

Fix: Only optimize if gain is significant (>20%)

症状：为了不到5%的性能提升编写复杂代码

解决方法：仅当收益显著（>20%）时才进行优化

Performance Diagnostic Workflow

性能诊断流程

1. Identify symptom (slow, high CPU, high memory)
   ↓
2. Profile with appropriate tool
   - CPU → perf/flamegraph
   - Memory → heaptrack/dhat
   - Cache → cachegrind
   ↓
3. Find hotspot (function/line)
   ↓
4. Understand why it's slow
   - Algorithm? Data structure? Allocation?
   ↓
5. Apply targeted optimization
   ↓
6. Benchmark to confirm improvement
   ↓
7. Repeat if not fast enough

1. 识别症状（运行缓慢、CPU占用高、内存占用高）
   ↓
2. 使用合适工具进行性能剖析
   - CPU问题 → perf/flamegraph
   - 内存问题 → heaptrack/dhat
   - 缓存问题 → cachegrind
   ↓
3. 定位热点（函数/代码行）
   ↓
4. 分析慢的原因
   - 算法问题？数据结构问题？分配问题？
   ↓
5. 应用针对性优化
   ↓
6. 基准测试确认性能提升
   ↓
7. 若仍不满足需求则重复上述步骤