intel-vtune-amd-uprof

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Intel VTune & AMD uProf

Intel VTune & AMD uProf

Purpose

用途

Guide agents through CPU microarchitecture profiling with Intel VTune Profiler (free Community Edition) and AMD uProf: hotspot identification, microarchitecture analysis, memory access pattern optimization, pipeline stall diagnosis, and roofline model analysis.
指导开发者使用Intel VTune Profiler(免费社区版)和AMD uProf进行CPU微架构性能分析:包括热点识别、微架构分析、内存访问模式优化、流水线停顿诊断以及Roofline模型分析。

Triggers

触发场景

  • "How do I use Intel VTune to profile my code?"
  • "What are pipeline stalls and how do I reduce them?"
  • "How do I analyze memory bandwidth with VTune?"
  • "What is the roofline model and how do I use it?"
  • "How do I use AMD uProf as a free alternative to VTune?"
  • "My code has good cache hit rates but is still slow"
  • "如何使用Intel VTune对代码进行性能分析?"
  • "什么是流水线停顿,如何减少它?"
  • "如何使用VTune分析内存带宽?"
  • "什么是Roofline模型,如何使用它?"
  • "如何使用AMD uProf作为VTune的免费替代方案?"
  • "我的代码缓存命中率不错,但运行仍然缓慢"

Workflow

工作流程

1. VTune setup (free Community Edition)

1. VTune 安装配置(免费社区版)

bash
undefined
bash
undefined

Download Intel VTune Profiler (Community Edition — free)

下载Intel VTune Profiler(社区版——免费)

Install on Linux

Linux系统安装后配置

source /opt/intel/oneapi/vtune/latest/env/vars.sh
source /opt/intel/oneapi/vtune/latest/env/vars.sh

CLI usage

CLI 使用示例

vtune -collect hotspots ./prog vtune -collect microarchitecture-exploration ./prog vtune -collect memory-access ./prog
vtune -collect hotspots ./prog vtune -collect microarchitecture-exploration ./prog vtune -collect memory-access ./prog

View results in GUI

图形界面查看结果

vtune-gui &
vtune-gui &

File → Open Result → select .vtune directory

文件 → 打开结果 → 选择.vtune目录

Or use amplxe-cl (legacy CLI)

或使用旧版CLI工具amplxe-cl

amplxe-cl -collect hotspots ./prog amplxe-cl -report hotspots -r result/
undefined
amplxe-cl -collect hotspots ./prog amplxe-cl -report hotspots -r result/
undefined

2. Analysis types

2. 分析类型

AnalysisWhat it findsWhen to use
HotspotsCPU-bound functionsFirst step — find where time is spent
Microarchitecture ExplorationIPC, pipeline stalls, retired instructionsAfter hotspot — why is the hotspot slow?
Memory AccessCache misses, DRAM bandwidth, NUMAMemory-bound code
ThreadingLock contention, parallel efficiencyMultithreaded code
HPC PerformanceVectorization, memory, rooflineHPC / scientific code
I/ODisk and network bottlenecksI/O-bound code
分析类型检测内容使用场景
Hotspots(热点分析)CPU密集型函数第一步——定位时间消耗集中的位置
Microarchitecture Exploration(微架构探索)IPC、流水线停顿、已退休指令数热点分析后——排查热点代码运行缓慢的原因
Memory Access(内存访问分析)缓存未命中、DRAM带宽、NUMA情况内存受限的代码
Threading(线程分析)锁竞争、并行效率多线程代码
HPC Performance(高性能计算性能分析)向量化、内存、Roofline模型高性能计算/科学计算代码
I/O(输入输出分析)磁盘与网络瓶颈I/O受限的代码

3. Hotspot analysis

3. 热点分析

bash
undefined
bash
undefined

Collect and report hotspots

收集并生成热点分析报告

vtune -collect hotspots -result-dir hotspots_result ./prog
vtune -collect hotspots -result-dir hotspots_result ./prog

Report top functions by CPU time

按CPU时间排序输出Top函数

vtune -report hotspots -r hotspots_result -format csv | head -20
vtune -report hotspots -r hotspots_result -format csv | head -20

CLI output example:

CLI输出示例:

Function CPU Time Module

Function CPU Time Module

compute_fft 4.532s libfft.so

compute_fft 4.532s libfft.so

matrix_mult 2.108s prog

matrix_mult 2.108s prog

parse_input 0.234s prog

parse_input 0.234s prog


Build with debug info for meaningful symbols:
```bash
gcc -O2 -g ./prog.c -o prog     # symbols visible in VTune
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog  # better stacks

编译时添加调试信息以显示有意义的符号:
```bash
gcc -O2 -g ./prog.c -o prog     # VTune中可查看符号信息
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog  # 生成更清晰的调用栈

4. Microarchitecture exploration — pipeline stalls

4. 微架构探索——流水线停顿分析

bash
vtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_result
Key metrics to examine:
MetricMeaningGood value
IPC (Instructions Per Clock)How many instructions retire per cyclex86: aim for > 2.0
CPI (Clocks Per Instruction)Inverse of IPCLower is better
Bad SpeculationBranch mispredictions< 5%
Front-End BoundInstruction decode bottleneck< 15%
Back-End BoundExecution unit or memory stall< 30%
RetiringUseful work fraction> 70% ideal
Memory Bound% cycles waiting for memory< 20%
Pipeline Analysis (Top-Down Methodology):
├── Retiring (good, useful work)
├── Bad Speculation (branch mispredictions)
├── Front-End Bound
│   ├── Fetch Latency (I-cache misses, branch mispredicts)
│   └── Fetch Bandwidth
└── Back-End Bound
    ├── Memory Bound
    │   ├── L1 Bound → L1 cache misses
    │   ├── L2 Bound → L2 cache misses
    │   ├── L3 Bound → L3 cache misses
    │   └── DRAM Bound → main memory bandwidth limited
    └── Core Bound → ALU/compute bound
bash
vtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_result
需要重点关注的指标:
指标含义理想值
IPC(Instructions Per Clock,每时钟周期执行指令数)每个时钟周期退休的指令数量x86架构:目标值>2.0
CPI(Clocks Per Instruction,每指令时钟周期数)IPC的倒数越低越好
Bad Speculation(错误推测)分支预测失败率<5%
Front-End Bound(前端受限)指令解码瓶颈<15%
Back-End Bound(后端受限)执行单元或内存停顿<30%
Retiring(有效工作占比)实际完成有用工作的周期占比理想值>70%
Memory Bound(内存受限)等待内存的周期占比<20%
流水线分析(自上而下方法):
├── Retiring(良好,有效工作)
├── Bad Speculation(分支预测失败)
├── Front-End Bound
│   ├── Fetch Latency(指令缓存未命中、分支预测失败)
│   └── Fetch Bandwidth(指令获取带宽不足)
└── Back-End Bound
    ├── Memory Bound
    │   ├── L1 Bound → L1缓存未命中
    │   ├── L2 Bound → L2缓存未命中
    │   ├── L3 Bound → L3缓存未命中
    │   └── DRAM Bound → 主内存带宽受限
    └── Core Bound → ALU/计算单元受限

5. Memory access analysis

5. 内存访问分析

bash
undefined
bash
undefined

Collect memory access profile

收集内存访问性能数据

vtune -collect memory-access -r mem_result ./prog
vtune -collect memory-access -r mem_result ./prog

Key output sections:

关键输出部分:

- Memory Bound: % time waiting for memory

- Memory Bound:等待内存的时间占比

- LLC (Last Level Cache) Miss Rate

- LLC(Last Level Cache,末级缓存)未命中率

- DRAM Bandwidth: GB/s achieved vs theoretical peak

- DRAM Bandwidth:实际达到的带宽(GB/s)与理论峰值的对比

- NUMA: cross-socket accesses (for multi-socket systems)

- NUMA:跨socket访问情况(适用于多socket系统)


Reading DRAM bandwidth:
DRAM Bandwidth: 18.4 GB/s Peak Theoretical: 51.2 GB/s Utilization: 36% — likely not DRAM-bound

If DRAM-bound: optimize data layout (AoS → SoA), reduce working set, improve spatial locality.

DRAM带宽解读示例:
DRAM Bandwidth: 18.4 GB/s Peak Theoretical: 51.2 GB/s Utilization: 36% — 大概率未受DRAM带宽限制

如果受DRAM带宽限制:优化数据布局(AoS→SoA)、减少工作集大小、提升空间局部性。

6. AMD uProf — free alternative for AMD CPUs

6. AMD uProf —— AMD CPU的免费替代方案

bash
undefined
bash
undefined

Download AMD uProf

下载AMD uProf

CLI profiling

CLI性能分析

AMDuProfCLI collect --config tbp ./prog # time-based profiling AMDuProfCLI collect --config assess ./prog # microarchitecture assessment AMDuProfCLI collect --config memory ./prog # memory access
AMDuProfCLI collect --config tbp ./prog # 基于时间的性能分析 AMDuProfCLI collect --config assess ./prog # 微架构评估 AMDuProfCLI collect --config memory ./prog # 内存访问分析

Generate report

生成报告

AMDuProfCLI report -i /tmp/uprof_result/ -o report.html
AMDuProfCLI report -i /tmp/uprof_result/ -o report.html

Open GUI

打开图形界面

AMDuProf &

AMD uProf metrics map to VTune equivalents:
- `Retired Instructions` → IPC analysis
- `Branch Mispredictions` → Bad Speculation
- `L1/L2/L3 Cache Misses` → Memory Bound levels
- `Data Cache Accesses` → Cache efficiency
AMDuProf &

AMD uProf指标与VTune对应关系:
- `Retired Instructions` → IPC分析
- `Branch Mispredictions` → 错误推测
- `L1/L2/L3 Cache Misses` → 内存受限层级
- `Data Cache Accesses` → 缓存效率

7. Roofline model

7. Roofline模型

The roofline model shows whether code is compute-bound or memory-bound by comparing achieved performance against hardware limits:
Performance (GFLOPS/s)
     |                    _______________
Peak |                 /
Perf |              /  compute bound
     |           /
     |        /
     |     /  memory bandwidth bound
     |  /
     +------------------------------→
        Arithmetic Intensity (FLOPS/Byte)
bash
undefined
Roofline模型通过对比实际性能与硬件极限,判断代码是计算受限还是内存受限:
Performance (GFLOPS/s)
     |                    _______________
Peak |                 /
Perf |              /  compute bound(计算受限)
     |           /
     |        /
     |     /  memory bandwidth bound(内存带宽受限)
     |  /
     +------------------------------→
        Arithmetic Intensity (FLOPS/Byte)(算术强度)
bash
undefined

VTune roofline collection

VTune收集Roofline数据

vtune -collect hpc-performance -r roofline_result ./prog
vtune -collect hpc-performance -r roofline_result ./prog

Then: VTune GUI → Roofline view

然后:VTune图形界面 → Roofline视图

For manual calculation:

手动计算方法:

Arithmetic Intensity = FLOPS / memory_bytes_accessed

算术强度 = FLOPS / 访问的内存字节数

Peak FLOPS = CPUs × cores × freq × FLOPS_per_cycle_per_core

峰值FLOPS = CPU数量 × 核心数 × 频率 × 每核心每周期FLOPS数

Peak BW = from hardware spec (e.g., 51.2 GB/s for DDR4-3200 dual channel)

峰值带宽 = 硬件规格值(例如DDR4-3200双通道为51.2 GB/s)

likwid-perfctr for manual roofline data (Linux)

使用likwid-perfctr手动获取Roofline数据(Linux系统)

likwid-perfctr -C 0 -g FLOPS_DP ./prog # double-precision FLOPS likwid-perfctr -C 0 -g MEM ./prog # memory bandwidth
undefined
likwid-perfctr -C 0 -g FLOPS_DP ./prog # 双精度FLOPS likwid-perfctr -C 0 -g MEM ./prog # 内存带宽
undefined

Related skills

相关技能

  • Use
    skills/profilers/hardware-counters
    for raw PMU event collection with perf stat
  • Use
    skills/profilers/linux-perf
    for perf-based profiling on Linux
  • Use
    skills/low-level-programming/cpu-cache-opt
    for memory access pattern optimization
  • Use
    skills/low-level-programming/simd-intrinsics
    for vectorization to increase FLOPS
  • 使用
    skills/profilers/hardware-counters
    通过perf stat收集原始PMU事件数据
  • 使用
    skills/profilers/linux-perf
    进行基于perf的Linux系统性能分析
  • 使用
    skills/low-level-programming/cpu-cache-opt
    优化内存访问模式
  • 使用
    skills/low-level-programming/simd-intrinsics
    通过向量化提升FLOPS