intel-vtune-amd-uprof

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Intel VTune & AMD uProf

Purpose

用途

Guide agents through CPU microarchitecture profiling with Intel VTune Profiler (free Community Edition) and AMD uProf: hotspot identification, microarchitecture analysis, memory access pattern optimization, pipeline stall diagnosis, and roofline model analysis.

指导开发者使用Intel VTune Profiler（免费社区版）和AMD uProf进行CPU微架构性能分析：包括热点识别、微架构分析、内存访问模式优化、流水线停顿诊断以及Roofline模型分析。

Triggers

触发场景

"How do I use Intel VTune to profile my code?"
"What are pipeline stalls and how do I reduce them?"
"How do I analyze memory bandwidth with VTune?"
"What is the roofline model and how do I use it?"
"How do I use AMD uProf as a free alternative to VTune?"
"My code has good cache hit rates but is still slow"

"如何使用Intel VTune对代码进行性能分析？"
"什么是流水线停顿，如何减少它？"
"如何使用VTune分析内存带宽？"
"什么是Roofline模型，如何使用它？"
"如何使用AMD uProf作为VTune的免费替代方案？"
"我的代码缓存命中率不错，但运行仍然缓慢"

Workflow

工作流程

1. VTune setup (free Community Edition)

1. VTune 安装配置（免费社区版）

bash

undefined

bash

undefined

Download Intel VTune Profiler (Community Edition — free)

下载Intel VTune Profiler（社区版——免费）

https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html

Install on Linux

Linux系统安装后配置

source /opt/intel/oneapi/vtune/latest/env/vars.sh

CLI usage

CLI 使用示例

vtune -collect hotspots ./prog vtune -collect microarchitecture-exploration ./prog vtune -collect memory-access ./prog

View results in GUI

图形界面查看结果

vtune-gui &

File → Open Result → select .vtune directory

文件 → 打开结果 → 选择.vtune目录

Or use amplxe-cl (legacy CLI)

或使用旧版CLI工具amplxe-cl

amplxe-cl -collect hotspots ./prog amplxe-cl -report hotspots -r result/

undefined

amplxe-cl -collect hotspots ./prog amplxe-cl -report hotspots -r result/

undefined

2. Analysis types

2. 分析类型

Analysis	What it finds	When to use
Hotspots	CPU-bound functions	First step — find where time is spent
Microarchitecture Exploration	IPC, pipeline stalls, retired instructions	After hotspot — why is the hotspot slow?
Memory Access	Cache misses, DRAM bandwidth, NUMA	Memory-bound code
Threading	Lock contention, parallel efficiency	Multithreaded code
HPC Performance	Vectorization, memory, roofline	HPC / scientific code
I/O	Disk and network bottlenecks	I/O-bound code

分析类型	检测内容	使用场景
Hotspots（热点分析）	CPU密集型函数	第一步——定位时间消耗集中的位置
Microarchitecture Exploration（微架构探索）	IPC、流水线停顿、已退休指令数	热点分析后——排查热点代码运行缓慢的原因
Memory Access（内存访问分析）	缓存未命中、DRAM带宽、NUMA情况	内存受限的代码
Threading（线程分析）	锁竞争、并行效率	多线程代码
HPC Performance（高性能计算性能分析）	向量化、内存、Roofline模型	高性能计算/科学计算代码
I/O（输入输出分析）	磁盘与网络瓶颈	I/O受限的代码

3. Hotspot analysis

3. 热点分析

bash

undefined

bash

undefined

Collect and report hotspots

收集并生成热点分析报告

vtune -collect hotspots -result-dir hotspots_result ./prog

Report top functions by CPU time

按CPU时间排序输出Top函数

vtune -report hotspots -r hotspots_result -format csv | head -20

CLI output example:

CLI输出示例:

Function CPU Time Module

compute_fft 4.532s libfft.so

matrix_mult 2.108s prog

parse_input 0.234s prog


Build with debug info for meaningful symbols:
```bash
gcc -O2 -g ./prog.c -o prog     # symbols visible in VTune
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog  # better stacks


编译时添加调试信息以显示有意义的符号：
```bash
gcc -O2 -g ./prog.c -o prog     # VTune中可查看符号信息
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog  # 生成更清晰的调用栈

4. Microarchitecture exploration — pipeline stalls

4. 微架构探索——流水线停顿分析

bash

vtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_result

Key metrics to examine:

Metric	Meaning	Good value
IPC (Instructions Per Clock)	How many instructions retire per cycle	x86: aim for > 2.0
CPI (Clocks Per Instruction)	Inverse of IPC	Lower is better
Bad Speculation	Branch mispredictions	< 5%
Front-End Bound	Instruction decode bottleneck	< 15%
Back-End Bound	Execution unit or memory stall	< 30%
Retiring	Useful work fraction	> 70% ideal
Memory Bound	% cycles waiting for memory	< 20%

Pipeline Analysis (Top-Down Methodology):
├── Retiring (good, useful work)
├── Bad Speculation (branch mispredictions)
├── Front-End Bound
│   ├── Fetch Latency (I-cache misses, branch mispredicts)
│   └── Fetch Bandwidth
└── Back-End Bound
    ├── Memory Bound
    │   ├── L1 Bound → L1 cache misses
    │   ├── L2 Bound → L2 cache misses
    │   ├── L3 Bound → L3 cache misses
    │   └── DRAM Bound → main memory bandwidth limited
    └── Core Bound → ALU/compute bound

bash

vtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_result

需要重点关注的指标：

指标	含义	理想值
IPC（Instructions Per Clock，每时钟周期执行指令数）	每个时钟周期退休的指令数量	x86架构：目标值>2.0
CPI（Clocks Per Instruction，每指令时钟周期数）	IPC的倒数	越低越好
Bad Speculation（错误推测）	分支预测失败率	<5%
Front-End Bound（前端受限）	指令解码瓶颈	<15%
Back-End Bound（后端受限）	执行单元或内存停顿	<30%
Retiring（有效工作占比）	实际完成有用工作的周期占比	理想值>70%
Memory Bound（内存受限）	等待内存的周期占比	<20%

流水线分析（自上而下方法）：
├── Retiring（良好，有效工作）
├── Bad Speculation（分支预测失败）
├── Front-End Bound
│   ├── Fetch Latency（指令缓存未命中、分支预测失败）
│   └── Fetch Bandwidth（指令获取带宽不足）
└── Back-End Bound
    ├── Memory Bound
    │   ├── L1 Bound → L1缓存未命中
    │   ├── L2 Bound → L2缓存未命中
    │   ├── L3 Bound → L3缓存未命中
    │   └── DRAM Bound → 主内存带宽受限
    └── Core Bound → ALU/计算单元受限

5. Memory access analysis

5. 内存访问分析

bash

undefined

bash

undefined

Collect memory access profile

收集内存访问性能数据

vtune -collect memory-access -r mem_result ./prog

Key output sections:

关键输出部分：

- Memory Bound: % time waiting for memory

- Memory Bound：等待内存的时间占比

- LLC (Last Level Cache) Miss Rate

- LLC（Last Level Cache，末级缓存）未命中率

- DRAM Bandwidth: GB/s achieved vs theoretical peak

- DRAM Bandwidth：实际达到的带宽（GB/s）与理论峰值的对比

- NUMA: cross-socket accesses (for multi-socket systems)

- NUMA：跨socket访问情况（适用于多socket系统）


Reading DRAM bandwidth:

DRAM Bandwidth: 18.4 GB/s Peak Theoretical: 51.2 GB/s Utilization: 36% — likely not DRAM-bound


If DRAM-bound: optimize data layout (AoS → SoA), reduce working set, improve spatial locality.


DRAM带宽解读示例：

DRAM Bandwidth: 18.4 GB/s Peak Theoretical: 51.2 GB/s Utilization: 36% — 大概率未受DRAM带宽限制


如果受DRAM带宽限制：优化数据布局（AoS→SoA）、减少工作集大小、提升空间局部性。

6. AMD uProf — free alternative for AMD CPUs

6. AMD uProf —— AMD CPU的免费替代方案

bash

undefined

bash

undefined

Download AMD uProf

下载AMD uProf

https://www.amd.com/en/developer/uprof.html

CLI profiling

CLI性能分析

AMDuProfCLI collect --config tbp ./prog # time-based profiling AMDuProfCLI collect --config assess ./prog # microarchitecture assessment AMDuProfCLI collect --config memory ./prog # memory access

AMDuProfCLI collect --config tbp ./prog # 基于时间的性能分析 AMDuProfCLI collect --config assess ./prog # 微架构评估 AMDuProfCLI collect --config memory ./prog # 内存访问分析

Generate report

生成报告

AMDuProfCLI report -i /tmp/uprof_result/ -o report.html

Open GUI

打开图形界面

AMDuProf &


AMD uProf metrics map to VTune equivalents:
- `Retired Instructions` → IPC analysis
- `Branch Mispredictions` → Bad Speculation
- `L1/L2/L3 Cache Misses` → Memory Bound levels
- `Data Cache Accesses` → Cache efficiency

AMDuProf &


AMD uProf指标与VTune对应关系：
- `Retired Instructions` → IPC分析
- `Branch Mispredictions` → 错误推测
- `L1/L2/L3 Cache Misses` → 内存受限层级
- `Data Cache Accesses` → 缓存效率

7. Roofline model

7. Roofline模型

The roofline model shows whether code is compute-bound or memory-bound by comparing achieved performance against hardware limits:

Performance (GFLOPS/s)
     |                    _______________
Peak |                 /
Perf |              /  compute bound
     |           /
     |        /
     |     /  memory bandwidth bound
     |  /
     +------------------------------→
        Arithmetic Intensity (FLOPS/Byte)

bash

undefined

Roofline模型通过对比实际性能与硬件极限，判断代码是计算受限还是内存受限：

Performance (GFLOPS/s)
     |                    _______________
Peak |                 /
Perf |              /  compute bound（计算受限）
     |           /
     |        /
     |     /  memory bandwidth bound（内存带宽受限）
     |  /
     +------------------------------→
        Arithmetic Intensity (FLOPS/Byte)（算术强度）

bash

undefined

VTune roofline collection

VTune收集Roofline数据

vtune -collect hpc-performance -r roofline_result ./prog

Then: VTune GUI → Roofline view

然后：VTune图形界面 → Roofline视图

For manual calculation:

手动计算方法：

Arithmetic Intensity = FLOPS / memory_bytes_accessed

算术强度 = FLOPS / 访问的内存字节数

Peak FLOPS = CPUs × cores × freq × FLOPS_per_cycle_per_core

峰值FLOPS = CPU数量 × 核心数 × 频率 × 每核心每周期FLOPS数

Peak BW = from hardware spec (e.g., 51.2 GB/s for DDR4-3200 dual channel)

峰值带宽 = 硬件规格值（例如DDR4-3200双通道为51.2 GB/s）

likwid-perfctr for manual roofline data (Linux)

使用likwid-perfctr手动获取Roofline数据（Linux系统）

likwid-perfctr -C 0 -g FLOPS_DP ./prog # double-precision FLOPS likwid-perfctr -C 0 -g MEM ./prog # memory bandwidth

undefined

likwid-perfctr -C 0 -g FLOPS_DP ./prog # 双精度FLOPS likwid-perfctr -C 0 -g MEM ./prog # 内存带宽

undefined