rust-performance

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Optimization Priority

优化优先级

1. Algorithm choice      (10x - 1000x)   ← Biggest impact
2. Data structure        (2x - 10x)
3. Reduce allocations    (2x - 5x)
4. Cache optimization    (1.5x - 3x)
5. SIMD/parallelism      (2x - 8x)
Warning: Premature optimization is the root of all evil. Make it work first, then optimize hot paths.
1. Algorithm choice      (10x - 1000x)   ← Biggest impact
2. Data structure        (2x - 10x)
3. Reduce allocations    (2x - 5x)
4. Cache optimization    (1.5x - 3x)
5. SIMD/parallelism      (2x - 8x)
注意:过早优化是万恶之源。先让代码正常运行,再优化热点路径。

Solution Patterns

解决方案模式

Pattern 1: Pre-allocation

模式1:预分配

rust
// ❌ Bad: grows dynamically
let mut vec = Vec::new();
for i in 0..1000 {
    vec.push(i);
}

// ✅ Good: pre-allocate known size
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
    vec.push(i);
}
rust
// ❌ Bad: grows dynamically
let mut vec = Vec::new();
for i in 0..1000 {
    vec.push(i);
}

// ✅ Good: pre-allocate known size
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
    vec.push(i);
}

Pattern 2: Avoid Unnecessary Clones

模式2:避免不必要的克隆

rust
// ❌ Bad: unnecessary clone
fn process(item: &Item) {
    let data = item.data.clone();
    // use data...
}

// ✅ Good: use reference
fn process(item: &Item) {
    let data = &item.data;
    // use data...
}
rust
// ❌ Bad: unnecessary clone
fn process(item: &Item) {
    let data = item.data.clone();
    // use data...
}

// ✅ Good: use reference
fn process(item: &Item) {
    let data = &item.data;
    // use data...
}

Pattern 3: Batch Operations

模式3:批量操作

rust
// ❌ Bad: multiple database calls
for user_id in user_ids {
    db.update(user_id, status)?;
}

// ✅ Good: batch update
db.update_all(user_ids, status)?;
rust
// ❌ Bad: multiple database calls
for user_id in user_ids {
    db.update(user_id, status)?;
}

// ✅ Good: batch update
db.update_all(user_ids, status)?;

Pattern 4: Small Object Optimization

模式4:小对象优化

rust
use smallvec::SmallVec;

// ✅ No heap allocation for ≤16 items
let mut vec: SmallVec<[u8; 16]> = SmallVec::new();
rust
use smallvec::SmallVec;

// ✅ No heap allocation for ≤16 items
let mut vec: SmallVec<[u8; 16]> = SmallVec::new();

Pattern 5: Parallel Processing

模式5:并行处理

rust
use rayon::prelude::*;

let sum: i32 = data
    .par_iter()
    .map(|x| expensive_computation(x))
    .sum();
rust
use rayon::prelude::*;

let sum: i32 = data
    .par_iter()
    .map(|x| expensive_computation(x))
    .sum();

Profiling Tools

性能剖析工具

ToolPurpose
cargo bench
Criterion benchmarks
perf
/
flamegraph
CPU flame graphs
heaptrack
Allocation tracking
valgrind --tool=cachegrind
Cache analysis
dhat
Heap allocation profiling
工具用途
cargo bench
Criterion基准测试
perf
/
flamegraph
CPU火焰图
heaptrack
分配追踪
valgrind --tool=cachegrind
缓存分析
dhat
堆分配性能剖析

Common Optimizations

常见优化手段

Anti-Patterns to Fix

需要修正的反模式

Anti-PatternWhy BadCorrect Approach
Clone to avoid lifetimesPerformance costProper ownership design
Box everythingIndirection overheadPrefer stack allocation
HashMap for small dataHash overhead too highVec + linear search
String concatenation in loopO(n²)
with_capacity
or
format!
LinkedListCache-unfriendly
Vec
or
VecDeque
反模式危害正确做法
Clone to avoid lifetimes性能损耗合理设计所有权
Box everything间接访问开销优先栈分配
HashMap for small data哈希开销过高Vec + 线性搜索
String concatenation in loopO(n²)时间复杂度使用
with_capacity
format!
LinkedList缓存不友好
Vec
VecDeque

Advanced: False Sharing

进阶:伪共享

Symptom

症状

rust
// ❌ Problem: multiple AtomicU64 in one struct
struct ShardCounters {
    inflight: AtomicU64,
    completed: AtomicU64,
}
  • One CPU core at 90%+
  • High LLC miss rate in perf
  • Many atomic RMW operations
  • Adding threads makes it slower
rust
// ❌ Problem: multiple AtomicU64 in one struct
struct ShardCounters {
    inflight: AtomicU64,
    completed: AtomicU64,
}
  • 单个CPU核心占用率达90%以上
  • perf工具显示LLC miss率高
  • 大量原子RMW操作
  • 增加线程后性能反而下降

Diagnosis

诊断方法

bash
undefined
bash
undefined

Perf analysis

Perf analysis

perf stat -d your_program
perf stat -d your_program

Look for LLC-load-misses and locked-instrs

Look for LLC-load-misses and locked-instrs

Flamegraph

Flamegraph

cargo flamegraph
cargo flamegraph

Find atomic fetch_add hotspots

Find atomic fetch_add hotspots

undefined
undefined

Solution: Cache Line Padding

解决方案:缓存行填充

rust
// ✅ Each field in separate cache line
#[repr(align(64))]
struct PaddedAtomicU64(AtomicU64);

struct ShardCounters {
    inflight: PaddedAtomicU64,
    completed: PaddedAtomicU64,
}
rust
// ✅ Each field in separate cache line
#[repr(align(64))]
struct PaddedAtomicU64(AtomicU64);

struct ShardCounters {
    inflight: PaddedAtomicU64,
    completed: PaddedAtomicU64,
}

Lock Contention Optimization

锁竞争优化

Symptom

症状

rust
// ❌ All threads compete for single lock
let shared: Arc<Mutex<HashMap<String, usize>>> =
    Arc::new(Mutex::new(HashMap::new()));
  • Most time spent in mutex lock/unlock
  • Performance degrades with more threads
  • High system time percentage
rust
// ❌ All threads compete for single lock
let shared: Arc<Mutex<HashMap<String, usize>>> =
    Arc::new(Mutex::new(HashMap::new()));
  • 大部分时间消耗在 mutex 加锁/解锁上
  • 线程越多性能越差
  • 系统时间占比高

Solution: Thread-Local Sharding

解决方案:线程本地分片

rust
// ✅ Each thread has local HashMap, merge at end
pub fn parallel_count(data: &[String], num_threads: usize)
    -> HashMap<String, usize>
{
    let mut handles = Vec::new();

    for chunk in data.chunks(data.len() / num_threads) {
        handles.push(thread::spawn(move || {
            let mut local = HashMap::new();
            for key in chunk {
                *local.entry(key.clone()).or_insert(0) += 1;
            }
            local  // Return local counts
        }));
    }

    // Merge all local results
    let mut result = HashMap::new();
    for handle in handles {
        for (k, v) in handle.join().unwrap() {
            *result.entry(k).or_insert(0) += v;
        }
    }
    result
}
rust
// ✅ Each thread has local HashMap, merge at end
pub fn parallel_count(data: &[String], num_threads: usize)
    -> HashMap<String, usize>
{
    let mut handles = Vec::new();

    for chunk in data.chunks(data.len() / num_threads) {
        handles.push(thread::spawn(move || {
            let mut local = HashMap::new();
            for key in chunk {
                *local.entry(key.clone()).or_insert(0) += 1;
            }
            local  // Return local counts
        }));
    }

    // Merge all local results
    let mut result = HashMap::new();
    for handle in handles {
        for (k, v) in handle.join().unwrap() {
            *result.entry(k).or_insert(0) += v;
        }
    }
    result
}

NUMA Awareness

NUMA感知

Problem

问题

rust
// Multi-socket server, memory allocated on remote NUMA node
let pool = ArenaPool::new(num_threads);
// Rayon work-stealing causes tasks to run on any thread
// Cross-NUMA access causes severe memory migration latency
rust
// Multi-socket server, memory allocated on remote NUMA node
let pool = ArenaPool::new(num_threads);
// Rayon work-stealing causes tasks to run on any thread
// Cross-NUMA access causes severe memory migration latency

Solution

解决方案

rust
// 1. NUMA node binding
let numa_node = detect_numa_node();
let pool = NumaAwarePool::new(numa_node);

// 2. Use unified allocator (jemalloc)
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

// 3. Avoid cross-NUMA object clones
// Borrow directly, don't copy data
rust
// 1. NUMA node binding
let numa_node = detect_numa_node();
let pool = NumaAwarePool::new(numa_node);

// 2. Use unified allocator (jemalloc)
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

// 3. Avoid cross-NUMA object clones
// Borrow directly, don't copy data

Tools

工具

bash
undefined
bash
undefined

Check NUMA topology

Check NUMA topology

numactl --hardware
numactl --hardware

Bind to NUMA node

Bind to NUMA node

numactl --cpunodebind=0 --membind=0 ./my_program
undefined
numactl --cpunodebind=0 --membind=0 ./my_program
undefined

Data Structure Selection

数据结构选择

ScenarioChoiceReason
High-concurrency writesDashMap or shardingReduces lock contention
Read-heavy, few writesRwLock<HashMap>Read locks don't block
Small datasetVec + linear searchHashMap overhead higher
Fixed keysEnum + arrayZero hash overhead
场景选择原因
高并发写入DashMap 或分片减少锁竞争
读多写少RwLock<HashMap>读锁不阻塞
小型数据集Vec + 线性搜索HashMap开销过高
固定键Enum + 数组零哈希开销

Read-Heavy Example

读多写少示例

rust
// ✅ Many reads, few updates
struct Config {
    map: RwLock<HashMap<String, ConfigValue>>,
}

impl Config {
    pub fn get(&self, key: &str) -> Option<ConfigValue> {
        self.map.read().unwrap().get(key).cloned()
    }

    pub fn update(&self, key: String, value: ConfigValue) {
        self.map.write().unwrap().insert(key, value);
    }
}
rust
// ✅ Many reads, few updates
struct Config {
    map: RwLock<HashMap<String, ConfigValue>>,
}

impl Config {
    pub fn get(&self, key: &str) -> Option<ConfigValue> {
        self.map.read().unwrap().get(key).cloned()
    }

    pub fn update(&self, key: String, value: ConfigValue) {
        self.map.write().unwrap().insert(key, value);
    }
}

Common Performance Traps

常见性能陷阱

TrapSymptomSolution
Adjacent atomic variablesFalse sharing
#[repr(align(64))]
Global MutexLock contentionThread-local + merge
Cross-NUMA allocationMemory migrationNUMA-aware allocation
Frequent small allocationsAllocator pressureObject pooling
Dynamic string keysExtra allocationsUse integer IDs
陷阱症状解决方案
Adjacent atomic variables伪共享使用
#[repr(align(64))]
Global Mutex锁竞争线程本地存储 + 合并结果
Cross-NUMA allocation内存迁移延迟NUMA感知分配
Frequent small allocations分配器压力大对象池
Dynamic string keys额外分配开销使用整数ID

Review Checklist

优化检查清单

When optimizing performance:
  • Profiled to identify bottleneck
  • Bottleneck confirmed with measurements
  • Algorithm is optimal for use case
  • Data structure appropriate
  • Unnecessary allocations removed
  • Parallelism exploited where beneficial
  • Cache-friendly data layout
  • Lock contention minimized
  • Benchmarks show improvement
  • Code still readable and maintainable
进行性能优化时:
  • 已通过性能剖析识别瓶颈
  • 已通过测量确认瓶颈
  • 算法符合当前场景最优
  • 数据结构选择合适
  • 已移除不必要的分配
  • 已充分利用并行性(如适用)
  • 数据布局缓存友好
  • 锁竞争已最小化
  • 基准测试显示性能提升
  • 代码仍保持可读和可维护

Verification Commands

验证命令

bash
undefined
bash
undefined

Benchmark

Benchmark

cargo bench
cargo bench

Profile with perf

Profile with perf

perf stat -d ./target/release/your_program
perf stat -d ./target/release/your_program

Generate flamegraph

Generate flamegraph

cargo flamegraph --release
cargo flamegraph --release

Heap profiling

Heap profiling

valgrind --tool=dhat ./target/release/your_program
valgrind --tool=dhat ./target/release/your_program

Cache analysis

Cache analysis

valgrind --tool=cachegrind ./target/release/your_program
valgrind --tool=cachegrind ./target/release/your_program

NUMA topology

NUMA topology

numactl --hardware
undefined
numactl --hardware
undefined

Common Pitfalls

常见误区

1. Premature Optimization

1. 过早优化

Symptom: Optimizing before profiling
Fix: Profile first, optimize hot paths only
症状:未进行性能剖析就开始优化
解决方法:先剖析,仅优化热点路径

2. Micro-optimizing Cold Paths

2. 优化冷路径

Symptom: Spending time on code that rarely runs
Fix: Focus on hot loops (90% of time in 10% of code)
症状:花费时间优化很少执行的代码
解决方法:聚焦热点循环(90%的时间消耗在10%的代码上)

3. Trading Readability for Minimal Gains

3. 为微小收益牺牲可读性

Symptom: Complex code for <5% improvement
Fix: Only optimize if gain is significant (>20%)
症状:为了不到5%的性能提升编写复杂代码
解决方法:仅当收益显著(>20%)时才进行优化

Performance Diagnostic Workflow

性能诊断流程

1. Identify symptom (slow, high CPU, high memory)
2. Profile with appropriate tool
   - CPU → perf/flamegraph
   - Memory → heaptrack/dhat
   - Cache → cachegrind
3. Find hotspot (function/line)
4. Understand why it's slow
   - Algorithm? Data structure? Allocation?
5. Apply targeted optimization
6. Benchmark to confirm improvement
7. Repeat if not fast enough
1. 识别症状(运行缓慢、CPU占用高、内存占用高)
2. 使用合适工具进行性能剖析
   - CPU问题 → perf/flamegraph
   - 内存问题 → heaptrack/dhat
   - 缓存问题 → cachegrind
3. 定位热点(函数/代码行)
4. 分析慢的原因
   - 算法问题?数据结构问题?分配问题?
5. 应用针对性优化
6. 基准测试确认性能提升
7. 若仍不满足需求则重复上述步骤

Related Skills

相关技能

  • rust-concurrency - Parallel processing patterns
  • rust-async - Async performance optimization
  • rust-unsafe - Zero-cost abstractions with unsafe
  • rust-coding - Writing performant idiomatic code
  • rust-anti-pattern - Performance anti-patterns to avoid
  • rust-concurrency - 并行处理模式
  • rust-async - 异步性能优化
  • rust-unsafe - 使用unsafe实现零成本抽象
  • rust-coding - 编写高性能地道Rust代码
  • rust-anti-pattern - 需要避免的性能反模式

Localized Reference

本地化参考

  • Chinese version: SKILL_ZH.md - 完整中文版本,包含所有内容
  • 中文版本SKILL_ZH.md - 完整中文版本,包含所有内容