hardware-counters

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Hardware Performance Counters

硬件性能计数器

Purpose

用途

Guide agents through hardware performance counter analysis: collecting PMU events with
perf stat -e
, using the PAPI library for portable counter access, interpreting cache miss rates and branch misprediction ratios, computing IPC, and correlating events to source lines with
perf annotate
.
指导Agent完成硬件性能计数器分析:使用
perf stat -e
收集PMU事件、利用PAPI库实现可移植的计数器访问、解读缓存未命中率和分支预测错误率、计算IPC,以及通过
perf annotate
将事件与源代码行关联。

Triggers

触发场景

  • "How do I measure cache miss rate with perf?"
  • "How do I count branch mispredictions?"
  • "How do I compute IPC (instructions per clock) with perf?"
  • "How do I use the PAPI library for hardware counters?"
  • "How do I see which source lines cause the most cache misses?"
  • "How do I measure memory bandwidth with performance counters?"
  • "如何用perf测量缓存未命中率?"
  • "如何统计分支预测错误数?"
  • "如何用perf计算IPC(每时钟周期指令数)?"
  • "如何使用PAPI库操作硬件计数器?"
  • "如何查看哪些源代码行导致最多缓存未命中?"
  • "如何用性能计数器测量内存带宽?"

Workflow

工作流程

1. perf stat — basic counter collection

1. perf stat — 基础计数器收集

bash
undefined
bash
undefined

Basic hardware event summary

基础硬件事件汇总

perf stat ./prog
perf stat ./prog

Output:

输出:

Performance counter stats for './prog':

Performance counter stats for './prog':

1,234,567,890 instructions

1,234,567,890 instructions

456,789,012 cycles

456,789,012 cycles

12,345,678 cache-misses # 1.23 % of all cache refs

12,345,678 cache-misses # 1.23 % of all cache refs

23,456,789 branch-misses # 2.34 % of all branches

23,456,789 branch-misses # 2.34 % of all branches

0.456789012 seconds time elapsed

0.456789012 seconds time elapsed

Derived metrics (computed from the output)

衍生指标(从输出计算得出)

IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70

IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70

CPI = cycles / instructions ≈ 0.37

CPI = cycles / instructions ≈ 0.37

undefined
undefined

2. Specifying PMU events with -e

2. 使用-e指定PMU事件

bash
undefined
bash
undefined

Specific hardware events

指定硬件事件

perf stat -e instructions,cycles,cache-misses,branch-misses ./prog
perf stat -e instructions,cycles,cache-misses,branch-misses ./prog

L1/L2/L3 cache events

L1/L2/L3缓存事件

perf stat -e
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog
perf stat -e
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog

Memory bandwidth (Intel)

内存带宽(Intel平台)

perf stat -e
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog
perf stat -e
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog

TLB misses

TLB未命中

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog

Branch misprediction rate

分支预测错误率

perf stat -e branches,branch-misses ./prog
perf stat -e branches,branch-misses ./prog

Rate = branch-misses / branches × 100%

错误率 = branch-misses / branches × 100%

Available events (varies by CPU)

可用事件(因CPU而异)

perf list hardware # generic hardware events perf list cache # cache events perf list pmu # raw PMU events for your CPU
undefined
perf list hardware # 通用硬件事件 perf list cache # 缓存相关事件 perf list pmu # 针对当前CPU的原始PMU事件
undefined

3. Key metrics and thresholds

3. 关键指标与阈值

MetricFormulaHealthyConcerning
IPCinstructions / cycles> 2.0 (modern x86)< 1.0
L1 miss rateL1-misses / L1-accesses< 1%> 5%
LLC miss rateLLC-misses / LLC-accesses< 1%> 10%
Branch miss ratebranch-misses / branches< 1%> 5%
MPKImisses per 1K instructionsL3 MPKI > 10 = memory bound
bash
undefined
指标计算公式健康值需关注值
IPCinstructions / cycles> 2.0(现代x86架构)< 1.0
L1缓存未命中率L1-misses / L1-accesses< 1%> 5%
LLC缓存未命中率LLC-misses / LLC-accesses< 1%> 10%
分支预测错误率branch-misses / branches< 1%> 5%
MPKI每千条指令未命中数L3 MPKI > 10 = 内存瓶颈
bash
undefined

Compute MPKI (Misses Per Kilo-Instructions)

计算MPKI(每千条指令未命中数)

perf stat -e instructions,LLC-load-misses ./prog
perf stat -e instructions,LLC-load-misses ./prog

MPKI = LLC-load-misses / (instructions / 1000)

MPKI = LLC-load-misses / (instructions / 1000)

undefined
undefined

4. Raw PMU events (CPU-specific)

4. 原始PMU事件(CPU专属)

For events not in the generic aliases, use raw event codes:
bash
undefined
对于通用别名中未包含的事件,使用原始事件编码:
bash
undefined

Intel: use perf list or look up in Intel SDM

Intel平台:使用perf list或查阅Intel SDM

Format: rXXYY where XX=umask, YY=event code

格式:rXXYY 其中XX=掩码,YY=事件编码

perf stat -e r0124 ./prog # example Intel raw event
perf stat -e r0124 ./prog # Intel平台原始事件示例

List Intel events with ocperf (OpenCL Perf Events)

使用ocperf(OpenCL Perf Events)列出Intel事件

pip install ocperf ocperf.py list | grep "mem_load"
pip install ocperf ocperf.py list | grep "mem_load"

Use libpfm4 for event names

使用libpfm4获取事件名称

pfm_ls | grep "MEM_LOAD" perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog
pfm_ls | grep "MEM_LOAD" perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog

AMD: similar approach

AMD平台:类似方法

perf stat -e r04041 ./prog # AMD raw event
undefined
perf stat -e r04041 ./prog # AMD平台原始事件示例
undefined

5. Source-level annotation with perf record/annotate

5. 使用perf record/annotate进行源代码级标注

bash
undefined
bash
undefined

Record with hardware events

记录硬件事件

perf record -e LLC-load-misses -g ./prog
perf record -e LLC-load-misses -g ./prog

Annotate: show source lines sorted by cache miss count

标注:按缓存未命中数排序显示源代码行

perf annotate --stdio
perf annotate --stdio

Interactive (requires debug symbols)

交互式模式(需要调试符号)

perf report
perf report

Press 'a' on a function to annotate it

在函数上按'a'键进行标注

Combined: record hotspot + annotate

组合操作:记录热点区域 + 标注

perf record -e cycles:u -g ./prog perf annotate --symbol=my_function --stdio 2>/dev/null | head -40
perf record -e cycles:u -g ./prog perf annotate --symbol=my_function --stdio 2>/dev/null | head -40

Example annotate output:

标注输出示例:

Percent | Source code

Percent | Source code

45.23 | for (int i = 0; i < N; i++)

45.23 | for (int i = 0; i < N; i++)

3.12 | sum += data[i]; ← cache miss here (strided access)

3.12 | sum += data[i]; ← 此处发生缓存未命中(跨步访问)

undefined
undefined

6. PAPI — Portable API for hardware counters

6. PAPI — 硬件计数器可移植API

PAPI provides a portable C API across different CPU architectures:
c
#include <papi.h>
#include <stdio.h>

int main(void) {
    int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC,
                    PAPI_L2_TCM,  PAPI_BR_MSP};
    long long values[4];

    if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
        fprintf(stderr, "PAPI init failed\n");
        return 1;
    }

    PAPI_start_counters(Events, 4);

    // --- Code to measure ---
    do_work();
    // -----------------------

    PAPI_stop_counters(values, 4);

    printf("Instructions:      %lld\n", values[0]);
    printf("Cycles:            %lld\n", values[1]);
    printf("IPC:               %.2f\n", (double)values[0]/values[1]);
    printf("L2 cache misses:   %lld\n", values[2]);
    printf("Branch mispred:    %lld\n", values[3]);

    return 0;
}
bash
undefined
PAPI提供跨不同CPU架构的可移植C语言API:
c
#include <papi.h>
#include <stdio.h>

int main(void) {
    int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC,
                    PAPI_L2_TCM,  PAPI_BR_MSP};
    long long values[4];

    if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
        fprintf(stderr, "PAPI初始化失败\n");
        return 1;
    }

    PAPI_start_counters(Events, 4);

    // --- 待测量代码 ---
    do_work();
    // -----------------------

    PAPI_stop_counters(values, 4);

    printf("总指令数:      %lld\n", values[0]);
    printf("总周期数:            %lld\n", values[1]);
    printf("IPC:               %.2f\n", (double)values[0]/values[1]);
    printf("L2缓存未命中数:   %lld\n", values[2]);
    printf("分支预测错误数:    %lld\n", values[3]);

    return 0;
}
bash
undefined

Build with PAPI

结合PAPI编译

gcc -O2 -g -o prog prog.c -lpapi
gcc -O2 -g -o prog prog.c -lpapi

Available PAPI events on your system

系统上可用的PAPI事件

papi_avail -a | head -30 papi_native_avail | grep "L3" # native events with "L3"

Common PAPI presets:

| Preset | Event |
|--------|-------|
| `PAPI_TOT_INS` | Total instructions |
| `PAPI_TOT_CYC` | Total cycles |
| `PAPI_L1_DCM` | L1 data cache misses |
| `PAPI_L2_TCM` | L2 total cache misses |
| `PAPI_L3_TCM` | L3 total cache misses |
| `PAPI_BR_MSP` | Branch mispredictions |
| `PAPI_TLB_DM` | Data TLB misses |
| `PAPI_FP_INS` | Floating point instructions |
| `PAPI_VEC_INS` | Vector/SIMD instructions |
papi_avail -a | head -30 papi_native_avail | grep "L3" # 包含"L3"的原生事件

常见PAPI预设事件:

| 预设事件 | 说明 |
|--------|-------|
| `PAPI_TOT_INS` | 总指令数 |
| `PAPI_TOT_CYC` | 总周期数 |
| `PAPI_L1_DCM` | L1数据缓存未命中数 |
| `PAPI_L2_TCM` | L2总缓存未命中数 |
| `PAPI_L3_TCM` | L3总缓存未命中数 |
| `PAPI_BR_MSP` | 分支预测错误数 |
| `PAPI_TLB_DM` | 数据TLB未命中数 |
| `PAPI_FP_INS` | 浮点指令数 |
| `PAPI_VEC_INS` | 向量/SIMD指令数 |

7. Intel PCM (Performance Counter Monitor)

7. Intel PCM(性能计数器监控工具)

bash
undefined
bash
undefined

Intel PCM — system-wide counters, no root required on modern kernels

Intel PCM — 系统级计数器,现代内核无需root权限

git clone https://github.com/intel/pcm cd pcm && cmake -S . -B build && cmake --build build
git clone https://github.com/intel/pcm cd pcm && cmake -S . -B build && cmake --build build

Measure memory bandwidth

测量内存带宽

./build/bin/pcm-memory 1 # sample every 1 second
./build/bin/pcm-memory 1 # 每秒采样一次

Core utilization + IPC

核心利用率 + IPC

./build/bin/pcm 1
./build/bin/pcm 1

Cache miss breakdown per socket

按插槽拆分缓存未命中情况

./build/bin/pcm 1 -csv | head -20
undefined
./build/bin/pcm 1 -csv | head -20
undefined

Related skills

相关技能

  • Use
    skills/profilers/intel-vtune-amd-uprof
    for guided microarchitecture analysis
  • Use
    skills/profilers/linux-perf
    for perf record/report and flamegraph generation
  • Use
    skills/low-level-programming/cpu-cache-opt
    for applying cache optimization patterns
  • Use
    skills/low-level-programming/simd-intrinsics
    for improving FLOPS/cycle metrics
  • 使用
    skills/profilers/intel-vtune-amd-uprof
    进行微架构分析指导
  • 使用
    skills/profilers/linux-perf
    生成perf记录/报告及火焰图
  • 使用
    skills/low-level-programming/cpu-cache-opt
    应用缓存优化模式
  • 使用
    skills/low-level-programming/simd-intrinsics
    提升FLOPS/周期指标