cpu-cache-opt

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CPU Cache Optimization

CPU缓存优化

Purpose

目标

Guide agents through cache-aware programming: diagnosing cache misses with perf, data layout transformations (AoS→SoA), false sharing detection and fixes, prefetching, and cache-friendly algorithm design.

指导开发者进行缓存感知编程：使用perf诊断缓存缺失、数据布局转换（AoS→SoA）、伪共享检测与修复、预取操作，以及设计缓存友好型算法。

Triggers

触发场景

"My program has high cache miss rates — how do I fix it?"
"What is false sharing and how do I detect it?"
"Should I use AoS or SoA data layout?"
"How do I measure cache performance with perf?"
"How do I use __builtin_prefetch?"
"My multithreaded program is slower than single-threaded due to cache"

"我的程序缓存缺失率很高 — 该如何修复？"
"什么是伪共享，如何检测它？"
"我应该使用AoS还是SoA数据布局？"
"如何用perf测量缓存性能？"
"如何使用__builtin_prefetch？"
"我的多线程程序因缓存问题比单线程程序更慢"

Workflow

工作流程

1. Measure cache performance

1. 测量缓存性能

bash

undefined

bash

undefined

Basic cache counters

perf stat -e cache-references,cache-misses,cycles,instructions ./prog

L1/L2/L3 miss breakdown

perf stat -e
L1-dcache-load-misses,
L1-dcache-loads,
L2-dcache-load-misses,
LLC-load-misses,
LLC-loads
./prog

Cache miss rate = L1-dcache-load-misses / L1-dcache-loads

> 5% is concerning; > 20% is severe

False sharing detection

perf stat -e
machine_clears.memory_ordering,
mem_load_l3_hit_retired.xsnp_hitm
./prog

undefined

perf stat -e
machine_clears.memory_ordering,
mem_load_l3_hit_retired.xsnp_hitm
./prog

undefined

2. Cache line basics

2. 缓存行基础

Cache line size: 64 bytes on x86-64, ARM (most platforms)
L1 cache: 32–64 KB, ~4 cycles latency
L2 cache: 256 KB–1 MB, ~12 cycles latency
L3 cache: 6–64 MB, ~40 cycles latency
Main memory: ~200–300 cycles latency

// Check cache line size
long cache_line = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);

// Align data to cache line
struct alignas(64) HotData {
    int counter;
    // ... 60 bytes of data that fit in one line
};

// C
typedef struct {
    int x;
} __attribute__((aligned(64))) AlignedData;

缓存行大小：64字节（x86-64、ARM等多数平台）
L1缓存：32–64 KB，延迟约4个周期
L2缓存：256 KB–1 MB，延迟约12个周期
L3缓存：6–64 MB，延迟约40个周期
主内存：延迟约200–300个周期

// Check cache line size
long cache_line = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);

// Align data to cache line
struct alignas(64) HotData {
    int counter;
    // ... 60 bytes of data that fit in one line
};

// C
typedef struct {
    int x;
} __attribute__((aligned(64))) AlignedData;

3. AoS vs SoA data layout

3. AoS与SoA数据布局

// AoS (Array of Structures) — default layout
struct Particle {
    float x, y, z;     // position (12 bytes)
    float vx, vy, vz;  // velocity (12 bytes)
    float mass;         // (4 bytes)
    int   flags;        // (4 bytes)
};
Particle particles[N];  // Bad for loops that only need position

// Problem: accessing particles[i].x loads x,y,z,vx,vy,vz,mass,flags
// But we only need x,y,z → 75% of loaded data is wasted

// SoA (Structure of Arrays) — cache-friendly for SIMD + sequential access
struct ParticlesSoA {
    float *x, *y, *z;
    float *vx, *vy, *vz;
    float *mass;
    int   *flags;
};

// Accessing x[i] for i=0..N loads 16 consecutive x values → 0% waste
// Also auto-vectorizes better

// AoS (Array of Structures) — default layout
struct Particle {
    float x, y, z;     // position (12 bytes)
    float vx, vy, vz;  // velocity (12 bytes)
    float mass;         // (4 bytes)
    int   flags;        // (4 bytes)
};
Particle particles[N];  // Bad for loops that only need position

// Problem: accessing particles[i].x loads x,y,z,vx,vy,vz,mass,flags
// But we only need x,y,z → 75% of loaded data is wasted

// SoA (Structure of Arrays) — cache-friendly for SIMD + sequential access
struct ParticlesSoA {
    float *x, *y, *z;
    float *vx, *vy, *vz;
    float *mass;
    int   *flags;
};

// Accessing x[i] for i=0..N loads 16 consecutive x values → 0% waste
// Also auto-vectorizes better

4. Common cache-unfriendly patterns

4. 常见的缓存不友好模式

// BAD: random access (linked list traversal)
Node *node = head;
while (node) {
    process(node->data);
    node = node->next;  // pointer chasing = cache miss per node
}

// BETTER: pool allocate nodes contiguously
// Or: rewrite as contiguous array with indices

// BAD: stride > cache line in matrix traversal
for (int i = 0; i < N; i++)
    for (int j = 0; j < M; j++)
        sum += matrix[j][i];  // column-major access on row-major array

// GOOD: row-major access
for (int i = 0; i < N; i++)
    for (int j = 0; j < M; j++)
        sum += matrix[i][j];

// BAD: large struct with hot + cold fields
struct Record {
    int id;           // hot: accessed every iteration
    char name[128];   // cold: accessed rarely
    int value;        // hot
    char desc[256];   // cold
};

// GOOD: separate hot and cold data
struct RecordHot { int id; int value; };
struct RecordCold { char name[128]; char desc[256]; };
RecordHot hot_data[N];
RecordCold cold_data[N];

// BAD: random access (linked list traversal)
Node *node = head;
while (node) {
    process(node->data);
    node = node->next;  // pointer chasing = cache miss per node
}

// BETTER: pool allocate nodes contiguously
// Or: rewrite as contiguous array with indices

// BAD: stride > cache line in matrix traversal
for (int i = 0; i < N; i++)
    for (int j = 0; j < M; j++)
        sum += matrix[j][i];  // column-major access on row-major array

// GOOD: row-major access
for (int i = 0; i < N; i++)
    for (int j = 0; j < M; j++)
        sum += matrix[i][j];

// BAD: large struct with hot + cold fields
struct Record {
    int id;           // hot: accessed every iteration
    char name[128];   // cold: accessed rarely
    int value;        // hot
    char desc[256];   // cold
};

// GOOD: separate hot and cold data
struct RecordHot { int id; int value; };
struct RecordCold { char name[128]; char desc[256]; };
RecordHot hot_data[N];
RecordCold cold_data[N];

5. False sharing

5. 伪共享

False sharing occurs when two threads write to different variables that share a cache line, causing constant cache-line invalidations.

// BAD: counters likely on same cache line (8 bytes each, line = 64 bytes)
int counter_a;  // thread A's counter
int counter_b;  // thread B's counter

// Both on the same cache line → every write invalidates the other thread's cache

// GOOD: pad to separate cache lines
struct alignas(64) PaddedCounter {
    int value;
    char padding[60];  // Ensure next counter is on different cache line
};

PaddedCounter counters[NUM_THREADS];
// Thread i: counters[i].value++

// C++ standard approach
struct alignas(std::hardware_destructive_interference_size) PaddedCounter {
    int value;
};

伪共享指当两个线程写入共享同一缓存行的不同变量时，导致缓存行持续失效的现象。

// BAD: counters likely on same cache line (8 bytes each, line = 64 bytes)
int counter_a;  // thread A's counter
int counter_b;  // thread B's counter

// Both on the same cache line → every write invalidates the other thread's cache

// GOOD: pad to separate cache lines
struct alignas(64) PaddedCounter {
    int value;
    char padding[60];  // Ensure next counter is on different cache line
};

PaddedCounter counters[NUM_THREADS];
// Thread i: counters[i].value++

// C++ standard approach
struct alignas(std::hardware_destructive_interference_size) PaddedCounter {
    int value;
};

6. Prefetching

6. 预取操作

Manual prefetch hints to hide memory latency:

#include <immintrin.h>  // or <xmmintrin.h>

// Prefetch for read (locality 0=non-temporal, 3=high temporal)
__builtin_prefetch(ptr, 0, 3);  // prefetch for read, high locality
__builtin_prefetch(ptr, 1, 3);  // prefetch for write, high locality

// SSE prefetch (x86)
_mm_prefetch((char*)ptr, _MM_HINT_T0);   // L1
_mm_prefetch((char*)ptr, _MM_HINT_T1);   // L2
_mm_prefetch((char*)ptr, _MM_HINT_T2);   // L3
_mm_prefetch((char*)ptr, _MM_HINT_NTA);  // non-temporal (streaming)

// Typical pattern: prefetch N iterations ahead
#define PREFETCH_DIST 8
for (int i = 0; i < N; i++) {
    if (i + PREFETCH_DIST < N)
        __builtin_prefetch(&data[i + PREFETCH_DIST], 0, 3);
    process(data[i]);
}

Prefetching rules:

Prefetch too early = cache evicted before use
Prefetch too late = no benefit
Prefetch distance = memory latency / time per iteration (typically 8–32 elements)

手动添加预取提示以隐藏内存延迟：

#include <immintrin.h>  // or <xmmintrin.h>

// Prefetch for read (locality 0=non-temporal, 3=high temporal)
__builtin_prefetch(ptr, 0, 3);  // prefetch for read, high locality
__builtin_prefetch(ptr, 1, 3);  // prefetch for write, high locality

// SSE prefetch (x86)
_mm_prefetch((char*)ptr, _MM_HINT_T0);   // L1
_mm_prefetch((char*)ptr, _MM_HINT_T1);   // L2
_mm_prefetch((char*)ptr, _MM_HINT_T2);   // L3
_mm_prefetch((char*)ptr, _MM_HINT_NTA);  // non-temporal (streaming)

// Typical pattern: prefetch N iterations ahead
#define PREFETCH_DIST 8
for (int i = 0; i < N; i++) {
    if (i + PREFETCH_DIST < N)
        __builtin_prefetch(&data[i + PREFETCH_DIST], 0, 3);
    process(data[i]);
}

预取规则：

预取过早 → 缓存数据在使用前被淘汰
预取过晚 → 无性能收益
预取距离 = 内存延迟 / 每次迭代耗时（通常为8–32个元素）

7. Cache-friendly algorithm design

7. 缓存友好型算法设计

// Loop blocking / tiling for matrix operations
// Process cache-fitting blocks instead of full rows/columns
#define BLOCK 64  // tuned to L1 cache size

void matrix_mult_blocked(float *C, float *A, float *B, int N) {
    for (int i = 0; i < N; i += BLOCK)
    for (int k = 0; k < N; k += BLOCK)
    for (int j = 0; j < N; j += BLOCK)
    // Inner block fits in L1 cache
    for (int ii = i; ii < i + BLOCK && ii < N; ii++)
    for (int kk = k; kk < k + BLOCK && kk < N; kk++)
    for (int jj = j; jj < j + BLOCK && jj < N; jj++)
        C[ii*N+jj] += A[ii*N+kk] * B[kk*N+jj];
}

For perf cache event reference and false sharing detection patterns, see references/cache-counters.md.

// Loop blocking / tiling for matrix operations
// Process cache-fitting blocks instead of full rows/columns
#define BLOCK 64  // tuned to L1 cache size

void matrix_mult_blocked(float *C, float *A, float *B, int N) {
    for (int i = 0; i < N; i += BLOCK)
    for (int k = 0; k < N; k += BLOCK)
    for (int j = 0; j < N; j += BLOCK)
    // Inner block fits in L1 cache
    for (int ii = i; ii < i + BLOCK && ii < N; ii++)
    for (int kk = k; kk < k + BLOCK && kk < N; kk++)
    for (int jj = j; jj < j + BLOCK && jj < N; jj++)
        C[ii*N+jj] += A[ii*N+kk] * B[kk*N+jj];
}

关于perf缓存事件参考和伪共享检测模式，请参见references/cache-counters.md。

cpu-cache-opt

Original

Translation

CPU Cache Optimization

CPU缓存优化

Purpose

目标

Triggers

触发场景

Workflow

工作流程

1. Measure cache performance

1. 测量缓存性能

Basic cache counters

Basic cache counters

L1/L2/L3 miss breakdown

L1/L2/L3 miss breakdown

Cache miss rate = L1-dcache-load-misses / L1-dcache-loads

Cache miss rate = L1-dcache-load-misses / L1-dcache-loads

> 5% is concerning; > 20% is severe

> 5% is concerning; > 20% is severe

False sharing detection

False sharing detection

2. Cache line basics

2. 缓存行基础

3. AoS vs SoA data layout

3. AoS与SoA数据布局

4. Common cache-unfriendly patterns

4. 常见的缓存不友好模式

5. False sharing

5. 伪共享

6. Prefetching

6. 预取操作

7. Cache-friendly algorithm design

7. 缓存友好型算法设计

Related skills

相关技能