hardware-counters
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHardware Performance Counters
硬件性能计数器
Purpose
用途
Guide agents through hardware performance counter analysis: collecting PMU events with , using the PAPI library for portable counter access, interpreting cache miss rates and branch misprediction ratios, computing IPC, and correlating events to source lines with .
perf stat -eperf annotate指导Agent完成硬件性能计数器分析:使用收集PMU事件、利用PAPI库实现可移植的计数器访问、解读缓存未命中率和分支预测错误率、计算IPC,以及通过将事件与源代码行关联。
perf stat -eperf annotateTriggers
触发场景
- "How do I measure cache miss rate with perf?"
- "How do I count branch mispredictions?"
- "How do I compute IPC (instructions per clock) with perf?"
- "How do I use the PAPI library for hardware counters?"
- "How do I see which source lines cause the most cache misses?"
- "How do I measure memory bandwidth with performance counters?"
- "如何用perf测量缓存未命中率?"
- "如何统计分支预测错误数?"
- "如何用perf计算IPC(每时钟周期指令数)?"
- "如何使用PAPI库操作硬件计数器?"
- "如何查看哪些源代码行导致最多缓存未命中?"
- "如何用性能计数器测量内存带宽?"
Workflow
工作流程
1. perf stat — basic counter collection
1. perf stat — 基础计数器收集
bash
undefinedbash
undefinedBasic hardware event summary
基础硬件事件汇总
perf stat ./prog
perf stat ./prog
Output:
输出:
Performance counter stats for './prog':
Performance counter stats for './prog':
1,234,567,890 instructions
1,234,567,890 instructions
456,789,012 cycles
456,789,012 cycles
12,345,678 cache-misses # 1.23 % of all cache refs
12,345,678 cache-misses # 1.23 % of all cache refs
23,456,789 branch-misses # 2.34 % of all branches
23,456,789 branch-misses # 2.34 % of all branches
0.456789012 seconds time elapsed
0.456789012 seconds time elapsed
Derived metrics (computed from the output)
衍生指标(从输出计算得出)
IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70
IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70
CPI = cycles / instructions ≈ 0.37
CPI = cycles / instructions ≈ 0.37
undefinedundefined2. Specifying PMU events with -e
2. 使用-e指定PMU事件
bash
undefinedbash
undefinedSpecific hardware events
指定硬件事件
perf stat -e instructions,cycles,cache-misses,branch-misses ./prog
perf stat -e instructions,cycles,cache-misses,branch-misses ./prog
L1/L2/L3 cache events
L1/L2/L3缓存事件
perf stat -e
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog
perf stat -e
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog
Memory bandwidth (Intel)
内存带宽(Intel平台)
perf stat -e
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog
perf stat -e
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog
TLB misses
TLB未命中
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog
Branch misprediction rate
分支预测错误率
perf stat -e branches,branch-misses ./prog
perf stat -e branches,branch-misses ./prog
Rate = branch-misses / branches × 100%
错误率 = branch-misses / branches × 100%
Available events (varies by CPU)
可用事件(因CPU而异)
perf list hardware # generic hardware events
perf list cache # cache events
perf list pmu # raw PMU events for your CPU
undefinedperf list hardware # 通用硬件事件
perf list cache # 缓存相关事件
perf list pmu # 针对当前CPU的原始PMU事件
undefined3. Key metrics and thresholds
3. 关键指标与阈值
| Metric | Formula | Healthy | Concerning |
|---|---|---|---|
| IPC | instructions / cycles | > 2.0 (modern x86) | < 1.0 |
| L1 miss rate | L1-misses / L1-accesses | < 1% | > 5% |
| LLC miss rate | LLC-misses / LLC-accesses | < 1% | > 10% |
| Branch miss rate | branch-misses / branches | < 1% | > 5% |
| MPKI | misses per 1K instructions | — | L3 MPKI > 10 = memory bound |
bash
undefined| 指标 | 计算公式 | 健康值 | 需关注值 |
|---|---|---|---|
| IPC | instructions / cycles | > 2.0(现代x86架构) | < 1.0 |
| L1缓存未命中率 | L1-misses / L1-accesses | < 1% | > 5% |
| LLC缓存未命中率 | LLC-misses / LLC-accesses | < 1% | > 10% |
| 分支预测错误率 | branch-misses / branches | < 1% | > 5% |
| MPKI | 每千条指令未命中数 | — | L3 MPKI > 10 = 内存瓶颈 |
bash
undefinedCompute MPKI (Misses Per Kilo-Instructions)
计算MPKI(每千条指令未命中数)
perf stat -e instructions,LLC-load-misses ./prog
perf stat -e instructions,LLC-load-misses ./prog
MPKI = LLC-load-misses / (instructions / 1000)
MPKI = LLC-load-misses / (instructions / 1000)
undefinedundefined4. Raw PMU events (CPU-specific)
4. 原始PMU事件(CPU专属)
For events not in the generic aliases, use raw event codes:
bash
undefined对于通用别名中未包含的事件,使用原始事件编码:
bash
undefinedIntel: use perf list or look up in Intel SDM
Intel平台:使用perf list或查阅Intel SDM
Format: rXXYY where XX=umask, YY=event code
格式:rXXYY 其中XX=掩码,YY=事件编码
perf stat -e r0124 ./prog # example Intel raw event
perf stat -e r0124 ./prog # Intel平台原始事件示例
List Intel events with ocperf (OpenCL Perf Events)
使用ocperf(OpenCL Perf Events)列出Intel事件
pip install ocperf
ocperf.py list | grep "mem_load"
pip install ocperf
ocperf.py list | grep "mem_load"
Use libpfm4 for event names
使用libpfm4获取事件名称
pfm_ls | grep "MEM_LOAD"
perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog
pfm_ls | grep "MEM_LOAD"
perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog
AMD: similar approach
AMD平台:类似方法
perf stat -e r04041 ./prog # AMD raw event
undefinedperf stat -e r04041 ./prog # AMD平台原始事件示例
undefined5. Source-level annotation with perf record/annotate
5. 使用perf record/annotate进行源代码级标注
bash
undefinedbash
undefinedRecord with hardware events
记录硬件事件
perf record -e LLC-load-misses -g ./prog
perf record -e LLC-load-misses -g ./prog
Annotate: show source lines sorted by cache miss count
标注:按缓存未命中数排序显示源代码行
perf annotate --stdio
perf annotate --stdio
Interactive (requires debug symbols)
交互式模式(需要调试符号)
perf report
perf report
Press 'a' on a function to annotate it
在函数上按'a'键进行标注
Combined: record hotspot + annotate
组合操作:记录热点区域 + 标注
perf record -e cycles:u -g ./prog
perf annotate --symbol=my_function --stdio 2>/dev/null | head -40
perf record -e cycles:u -g ./prog
perf annotate --symbol=my_function --stdio 2>/dev/null | head -40
Example annotate output:
标注输出示例:
Percent | Source code
Percent | Source code
45.23 | for (int i = 0; i < N; i++)
45.23 | for (int i = 0; i < N; i++)
3.12 | sum += data[i]; ← cache miss here (strided access)
3.12 | sum += data[i]; ← 此处发生缓存未命中(跨步访问)
undefinedundefined6. PAPI — Portable API for hardware counters
6. PAPI — 硬件计数器可移植API
PAPI provides a portable C API across different CPU architectures:
c
#include <papi.h>
#include <stdio.h>
int main(void) {
int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC,
PAPI_L2_TCM, PAPI_BR_MSP};
long long values[4];
if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI init failed\n");
return 1;
}
PAPI_start_counters(Events, 4);
// --- Code to measure ---
do_work();
// -----------------------
PAPI_stop_counters(values, 4);
printf("Instructions: %lld\n", values[0]);
printf("Cycles: %lld\n", values[1]);
printf("IPC: %.2f\n", (double)values[0]/values[1]);
printf("L2 cache misses: %lld\n", values[2]);
printf("Branch mispred: %lld\n", values[3]);
return 0;
}bash
undefinedPAPI提供跨不同CPU架构的可移植C语言API:
c
#include <papi.h>
#include <stdio.h>
int main(void) {
int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC,
PAPI_L2_TCM, PAPI_BR_MSP};
long long values[4];
if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI初始化失败\n");
return 1;
}
PAPI_start_counters(Events, 4);
// --- 待测量代码 ---
do_work();
// -----------------------
PAPI_stop_counters(values, 4);
printf("总指令数: %lld\n", values[0]);
printf("总周期数: %lld\n", values[1]);
printf("IPC: %.2f\n", (double)values[0]/values[1]);
printf("L2缓存未命中数: %lld\n", values[2]);
printf("分支预测错误数: %lld\n", values[3]);
return 0;
}bash
undefinedBuild with PAPI
结合PAPI编译
gcc -O2 -g -o prog prog.c -lpapi
gcc -O2 -g -o prog prog.c -lpapi
Available PAPI events on your system
系统上可用的PAPI事件
papi_avail -a | head -30
papi_native_avail | grep "L3" # native events with "L3"
Common PAPI presets:
| Preset | Event |
|--------|-------|
| `PAPI_TOT_INS` | Total instructions |
| `PAPI_TOT_CYC` | Total cycles |
| `PAPI_L1_DCM` | L1 data cache misses |
| `PAPI_L2_TCM` | L2 total cache misses |
| `PAPI_L3_TCM` | L3 total cache misses |
| `PAPI_BR_MSP` | Branch mispredictions |
| `PAPI_TLB_DM` | Data TLB misses |
| `PAPI_FP_INS` | Floating point instructions |
| `PAPI_VEC_INS` | Vector/SIMD instructions |papi_avail -a | head -30
papi_native_avail | grep "L3" # 包含"L3"的原生事件
常见PAPI预设事件:
| 预设事件 | 说明 |
|--------|-------|
| `PAPI_TOT_INS` | 总指令数 |
| `PAPI_TOT_CYC` | 总周期数 |
| `PAPI_L1_DCM` | L1数据缓存未命中数 |
| `PAPI_L2_TCM` | L2总缓存未命中数 |
| `PAPI_L3_TCM` | L3总缓存未命中数 |
| `PAPI_BR_MSP` | 分支预测错误数 |
| `PAPI_TLB_DM` | 数据TLB未命中数 |
| `PAPI_FP_INS` | 浮点指令数 |
| `PAPI_VEC_INS` | 向量/SIMD指令数 |7. Intel PCM (Performance Counter Monitor)
7. Intel PCM(性能计数器监控工具)
bash
undefinedbash
undefinedIntel PCM — system-wide counters, no root required on modern kernels
Intel PCM — 系统级计数器,现代内核无需root权限
git clone https://github.com/intel/pcm
cd pcm && cmake -S . -B build && cmake --build build
git clone https://github.com/intel/pcm
cd pcm && cmake -S . -B build && cmake --build build
Measure memory bandwidth
测量内存带宽
./build/bin/pcm-memory 1 # sample every 1 second
./build/bin/pcm-memory 1 # 每秒采样一次
Core utilization + IPC
核心利用率 + IPC
./build/bin/pcm 1
./build/bin/pcm 1
Cache miss breakdown per socket
按插槽拆分缓存未命中情况
./build/bin/pcm 1 -csv | head -20
undefined./build/bin/pcm 1 -csv | head -20
undefinedRelated skills
相关技能
- Use for guided microarchitecture analysis
skills/profilers/intel-vtune-amd-uprof - Use for perf record/report and flamegraph generation
skills/profilers/linux-perf - Use for applying cache optimization patterns
skills/low-level-programming/cpu-cache-opt - Use for improving FLOPS/cycle metrics
skills/low-level-programming/simd-intrinsics
- 使用进行微架构分析指导
skills/profilers/intel-vtune-amd-uprof - 使用生成perf记录/报告及火焰图
skills/profilers/linux-perf - 使用应用缓存优化模式
skills/low-level-programming/cpu-cache-opt - 使用提升FLOPS/周期指标
skills/low-level-programming/simd-intrinsics