Hardware Performance Counters

硬件性能计数器

Purpose

用途

Guide agents through hardware performance counter analysis: collecting PMU events with

perf stat -e

, using the PAPI library for portable counter access, interpreting cache miss rates and branch misprediction ratios, computing IPC, and correlating events to source lines with

perf annotate

.

指导Agent完成硬件性能计数器分析：使用

perf stat -e

收集PMU事件、利用PAPI库实现可移植的计数器访问、解读缓存未命中率和分支预测错误率、计算IPC，以及通过

perf annotate

将事件与源代码行关联。

Triggers

触发场景

"How do I measure cache miss rate with perf?"
"How do I count branch mispredictions?"
"How do I compute IPC (instructions per clock) with perf?"
"How do I use the PAPI library for hardware counters?"
"How do I see which source lines cause the most cache misses?"
"How do I measure memory bandwidth with performance counters?"

"如何用perf测量缓存未命中率？"
"如何统计分支预测错误数？"
"如何用perf计算IPC（每时钟周期指令数）？"
"如何使用PAPI库操作硬件计数器？"
"如何查看哪些源代码行导致最多缓存未命中？"
"如何用性能计数器测量内存带宽？"

Workflow

工作流程

1. perf stat — basic counter collection

1. perf stat — 基础计数器收集

bash

undefined

bash

undefined

Basic hardware event summary

基础硬件事件汇总

perf stat ./prog

Output:

输出:

Performance counter stats for './prog':

1,234,567,890 instructions

456,789,012 cycles

12,345,678 cache-misses # 1.23 % of all cache refs

23,456,789 branch-misses # 2.34 % of all branches

0.456789012 seconds time elapsed

Derived metrics (computed from the output)

衍生指标（从输出计算得出）

IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70

CPI = cycles / instructions ≈ 0.37

undefined

undefined

2. Specifying PMU events with -e

2. 使用-e指定PMU事件

bash

undefined

bash

undefined

Specific hardware events

指定硬件事件

perf stat -e instructions,cycles,cache-misses,branch-misses ./prog

L1/L2/L3 cache events

L1/L2/L3缓存事件

perf stat -e
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog

Memory bandwidth (Intel)

内存带宽（Intel平台）

perf stat -e
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog

TLB misses

TLB未命中

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog

Branch misprediction rate

分支预测错误率

perf stat -e branches,branch-misses ./prog

Rate = branch-misses / branches × 100%

错误率 = branch-misses / branches × 100%

Available events (varies by CPU)

可用事件（因CPU而异）

perf list hardware # generic hardware events perf list cache # cache events perf list pmu # raw PMU events for your CPU

undefined

perf list hardware # 通用硬件事件 perf list cache # 缓存相关事件 perf list pmu # 针对当前CPU的原始PMU事件

undefined

3. Key metrics and thresholds

3. 关键指标与阈值

Metric	Formula	Healthy	Concerning
IPC	instructions / cycles	> 2.0 (modern x86)	< 1.0
L1 miss rate	L1-misses / L1-accesses	< 1%	> 5%
LLC miss rate	LLC-misses / LLC-accesses	< 1%	> 10%
Branch miss rate	branch-misses / branches	< 1%	> 5%
MPKI	misses per 1K instructions	—	L3 MPKI > 10 = memory bound

bash

undefined

指标	计算公式	健康值	需关注值
IPC	instructions / cycles	> 2.0（现代x86架构）	< 1.0
L1缓存未命中率	L1-misses / L1-accesses	< 1%	> 5%
LLC缓存未命中率	LLC-misses / LLC-accesses	< 1%	> 10%
分支预测错误率	branch-misses / branches	< 1%	> 5%
MPKI	每千条指令未命中数	—	L3 MPKI > 10 = 内存瓶颈

bash

undefined

Compute MPKI (Misses Per Kilo-Instructions)

计算MPKI（每千条指令未命中数）

perf stat -e instructions,LLC-load-misses ./prog

MPKI = LLC-load-misses / (instructions / 1000)

undefined

undefined

4. Raw PMU events (CPU-specific)

4. 原始PMU事件（CPU专属）

For events not in the generic aliases, use raw event codes:

bash

undefined

对于通用别名中未包含的事件，使用原始事件编码：

bash

undefined

Intel: use perf list or look up in Intel SDM

Intel平台：使用perf list或查阅Intel SDM

Format: rXXYY where XX=umask, YY=event code

格式：rXXYY 其中XX=掩码，YY=事件编码

perf stat -e r0124 ./prog # example Intel raw event

perf stat -e r0124 ./prog # Intel平台原始事件示例

List Intel events with ocperf (OpenCL Perf Events)

使用ocperf（OpenCL Perf Events）列出Intel事件

pip install ocperf ocperf.py list | grep "mem_load"

Use libpfm4 for event names

使用libpfm4获取事件名称

pfm_ls | grep "MEM_LOAD" perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog

AMD: similar approach

AMD平台：类似方法

perf stat -e r04041 ./prog # AMD raw event

undefined

perf stat -e r04041 ./prog # AMD平台原始事件示例

undefined

5. Source-level annotation with perf record/annotate

5. 使用perf record/annotate进行源代码级标注

bash

undefined

bash

undefined

Record with hardware events

记录硬件事件

perf record -e LLC-load-misses -g ./prog

Annotate: show source lines sorted by cache miss count

标注：按缓存未命中数排序显示源代码行

perf annotate --stdio

Interactive (requires debug symbols)

交互式模式（需要调试符号）

perf report

Press 'a' on a function to annotate it

在函数上按'a'键进行标注

Combined: record hotspot + annotate

组合操作：记录热点区域 + 标注

perf record -e cycles:u -g ./prog perf annotate --symbol=my_function --stdio 2>/dev/null | head -40

Example annotate output:

标注输出示例:

Percent | Source code

45.23 | for (int i = 0; i < N; i++)

3.12 | sum += data[i]; ← cache miss here (strided access)

3.12 | sum += data[i]; ← 此处发生缓存未命中（跨步访问）

undefined

undefined

6. PAPI — Portable API for hardware counters

6. PAPI — 硬件计数器可移植API

PAPI provides a portable C API across different CPU architectures:

c

#include <papi.h>
#include <stdio.h>

int main(void) {
    int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC,
                    PAPI_L2_TCM,  PAPI_BR_MSP};
    long long values[4];

    if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
        fprintf(stderr, "PAPI init failed\n");
        return 1;
    }

    PAPI_start_counters(Events, 4);

    // --- Code to measure ---
    do_work();
    // -----------------------

    PAPI_stop_counters(values, 4);

    printf("Instructions:      %lld\n", values[0]);
    printf("Cycles:            %lld\n", values[1]);
    printf("IPC:               %.2f\n", (double)values[0]/values[1]);
    printf("L2 cache misses:   %lld\n", values[2]);
    printf("Branch mispred:    %lld\n", values[3]);

    return 0;
}

bash

undefined

PAPI提供跨不同CPU架构的可移植C语言API：

c

#include <papi.h>
#include <stdio.h>

int main(void) {
    int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC,
                    PAPI_L2_TCM,  PAPI_BR_MSP};
    long long values[4];

    if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
        fprintf(stderr, "PAPI初始化失败\n");
        return 1;
    }

    PAPI_start_counters(Events, 4);

    // --- 待测量代码 ---
    do_work();
    // -----------------------

    PAPI_stop_counters(values, 4);

    printf("总指令数:      %lld\n", values[0]);
    printf("总周期数:            %lld\n", values[1]);
    printf("IPC:               %.2f\n", (double)values[0]/values[1]);
    printf("L2缓存未命中数:   %lld\n", values[2]);
    printf("分支预测错误数:    %lld\n", values[3]);

    return 0;
}

bash

undefined

Build with PAPI

结合PAPI编译

gcc -O2 -g -o prog prog.c -lpapi

Available PAPI events on your system

系统上可用的PAPI事件

papi_avail -a | head -30 papi_native_avail | grep "L3" # native events with "L3"


Common PAPI presets:

| Preset | Event |
|--------|-------|
| `PAPI_TOT_INS` | Total instructions |
| `PAPI_TOT_CYC` | Total cycles |
| `PAPI_L1_DCM` | L1 data cache misses |
| `PAPI_L2_TCM` | L2 total cache misses |
| `PAPI_L3_TCM` | L3 total cache misses |
| `PAPI_BR_MSP` | Branch mispredictions |
| `PAPI_TLB_DM` | Data TLB misses |
| `PAPI_FP_INS` | Floating point instructions |
| `PAPI_VEC_INS` | Vector/SIMD instructions |

papi_avail -a | head -30 papi_native_avail | grep "L3" # 包含"L3"的原生事件


常见PAPI预设事件:

| 预设事件 | 说明 |
|--------|-------|
| `PAPI_TOT_INS` | 总指令数 |
| `PAPI_TOT_CYC` | 总周期数 |
| `PAPI_L1_DCM` | L1数据缓存未命中数 |
| `PAPI_L2_TCM` | L2总缓存未命中数 |
| `PAPI_L3_TCM` | L3总缓存未命中数 |
| `PAPI_BR_MSP` | 分支预测错误数 |
| `PAPI_TLB_DM` | 数据TLB未命中数 |
| `PAPI_FP_INS` | 浮点指令数 |
| `PAPI_VEC_INS` | 向量/SIMD指令数 |

7. Intel PCM (Performance Counter Monitor)

7. Intel PCM（性能计数器监控工具）

bash

undefined

bash

undefined

Intel PCM — system-wide counters, no root required on modern kernels

Intel PCM — 系统级计数器，现代内核无需root权限

git clone https://github.com/intel/pcm cd pcm && cmake -S . -B build && cmake --build build

Measure memory bandwidth

测量内存带宽

./build/bin/pcm-memory 1 # sample every 1 second

./build/bin/pcm-memory 1 # 每秒采样一次

Core utilization + IPC

核心利用率 + IPC

./build/bin/pcm 1

Cache miss breakdown per socket

按插槽拆分缓存未命中情况

./build/bin/pcm 1 -csv | head -20

undefined

./build/bin/pcm 1 -csv | head -20

undefined

hardware-counters

Original

Translation

Hardware Performance Counters

硬件性能计数器

Purpose

用途

Triggers

触发场景

Workflow

工作流程

1. perf stat — basic counter collection

1. perf stat — 基础计数器收集

Basic hardware event summary

基础硬件事件汇总

Output:

输出:

Performance counter stats for './prog':

Performance counter stats for './prog':

1,234,567,890 instructions

1,234,567,890 instructions

456,789,012 cycles

456,789,012 cycles

12,345,678 cache-misses # 1.23 % of all cache refs

12,345,678 cache-misses # 1.23 % of all cache refs

23,456,789 branch-misses # 2.34 % of all branches

23,456,789 branch-misses # 2.34 % of all branches

0.456789012 seconds time elapsed

0.456789012 seconds time elapsed

Derived metrics (computed from the output)

衍生指标（从输出计算得出）

IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70

IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70

CPI = cycles / instructions ≈ 0.37

CPI = cycles / instructions ≈ 0.37

2. Specifying PMU events with -e

2. 使用-e指定PMU事件

Specific hardware events

指定硬件事件

L1/L2/L3 cache events

L1/L2/L3缓存事件

Memory bandwidth (Intel)

内存带宽（Intel平台）

TLB misses

TLB未命中

Branch misprediction rate

分支预测错误率

Rate = branch-misses / branches × 100%

错误率 = branch-misses / branches × 100%

Available events (varies by CPU)

可用事件（因CPU而异）

3. Key metrics and thresholds

3. 关键指标与阈值

Compute MPKI (Misses Per Kilo-Instructions)

计算MPKI（每千条指令未命中数）

MPKI = LLC-load-misses / (instructions / 1000)

MPKI = LLC-load-misses / (instructions / 1000)

4. Raw PMU events (CPU-specific)

4. 原始PMU事件（CPU专属）

Intel: use perf list or look up in Intel SDM

Intel平台：使用perf list或查阅Intel SDM

Format: rXXYY where XX=umask, YY=event code

格式：rXXYY 其中XX=掩码，YY=事件编码

List Intel events with ocperf (OpenCL Perf Events)

使用ocperf（OpenCL Perf Events）列出Intel事件

Use libpfm4 for event names

使用libpfm4获取事件名称

AMD: similar approach

AMD平台：类似方法

5. Source-level annotation with perf record/annotate

5. 使用perf record/annotate进行源代码级标注

Record with hardware events

记录硬件事件

Annotate: show source lines sorted by cache miss count

标注：按缓存未命中数排序显示源代码行

Interactive (requires debug symbols)

交互式模式（需要调试符号）

Press 'a' on a function to annotate it

在函数上按'a'键进行标注

Combined: record hotspot + annotate