intel-vtune-amd-uprof
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIntel VTune & AMD uProf
Intel VTune & AMD uProf
Purpose
用途
Guide agents through CPU microarchitecture profiling with Intel VTune Profiler (free Community Edition) and AMD uProf: hotspot identification, microarchitecture analysis, memory access pattern optimization, pipeline stall diagnosis, and roofline model analysis.
指导开发者使用Intel VTune Profiler(免费社区版)和AMD uProf进行CPU微架构性能分析:包括热点识别、微架构分析、内存访问模式优化、流水线停顿诊断以及Roofline模型分析。
Triggers
触发场景
- "How do I use Intel VTune to profile my code?"
- "What are pipeline stalls and how do I reduce them?"
- "How do I analyze memory bandwidth with VTune?"
- "What is the roofline model and how do I use it?"
- "How do I use AMD uProf as a free alternative to VTune?"
- "My code has good cache hit rates but is still slow"
- "如何使用Intel VTune对代码进行性能分析?"
- "什么是流水线停顿,如何减少它?"
- "如何使用VTune分析内存带宽?"
- "什么是Roofline模型,如何使用它?"
- "如何使用AMD uProf作为VTune的免费替代方案?"
- "我的代码缓存命中率不错,但运行仍然缓慢"
Workflow
工作流程
1. VTune setup (free Community Edition)
1. VTune 安装配置(免费社区版)
bash
undefinedbash
undefinedDownload Intel VTune Profiler (Community Edition — free)
下载Intel VTune Profiler(社区版——免费)
Install on Linux
Linux系统安装后配置
source /opt/intel/oneapi/vtune/latest/env/vars.sh
source /opt/intel/oneapi/vtune/latest/env/vars.sh
CLI usage
CLI 使用示例
vtune -collect hotspots ./prog
vtune -collect microarchitecture-exploration ./prog
vtune -collect memory-access ./prog
vtune -collect hotspots ./prog
vtune -collect microarchitecture-exploration ./prog
vtune -collect memory-access ./prog
View results in GUI
图形界面查看结果
vtune-gui &
vtune-gui &
File → Open Result → select .vtune directory
文件 → 打开结果 → 选择.vtune目录
Or use amplxe-cl (legacy CLI)
或使用旧版CLI工具amplxe-cl
amplxe-cl -collect hotspots ./prog
amplxe-cl -report hotspots -r result/
undefinedamplxe-cl -collect hotspots ./prog
amplxe-cl -report hotspots -r result/
undefined2. Analysis types
2. 分析类型
| Analysis | What it finds | When to use |
|---|---|---|
| Hotspots | CPU-bound functions | First step — find where time is spent |
| Microarchitecture Exploration | IPC, pipeline stalls, retired instructions | After hotspot — why is the hotspot slow? |
| Memory Access | Cache misses, DRAM bandwidth, NUMA | Memory-bound code |
| Threading | Lock contention, parallel efficiency | Multithreaded code |
| HPC Performance | Vectorization, memory, roofline | HPC / scientific code |
| I/O | Disk and network bottlenecks | I/O-bound code |
| 分析类型 | 检测内容 | 使用场景 |
|---|---|---|
| Hotspots(热点分析) | CPU密集型函数 | 第一步——定位时间消耗集中的位置 |
| Microarchitecture Exploration(微架构探索) | IPC、流水线停顿、已退休指令数 | 热点分析后——排查热点代码运行缓慢的原因 |
| Memory Access(内存访问分析) | 缓存未命中、DRAM带宽、NUMA情况 | 内存受限的代码 |
| Threading(线程分析) | 锁竞争、并行效率 | 多线程代码 |
| HPC Performance(高性能计算性能分析) | 向量化、内存、Roofline模型 | 高性能计算/科学计算代码 |
| I/O(输入输出分析) | 磁盘与网络瓶颈 | I/O受限的代码 |
3. Hotspot analysis
3. 热点分析
bash
undefinedbash
undefinedCollect and report hotspots
收集并生成热点分析报告
vtune -collect hotspots -result-dir hotspots_result ./prog
vtune -collect hotspots -result-dir hotspots_result ./prog
Report top functions by CPU time
按CPU时间排序输出Top函数
vtune -report hotspots -r hotspots_result -format csv | head -20
vtune -report hotspots -r hotspots_result -format csv | head -20
CLI output example:
CLI输出示例:
Function CPU Time Module
Function CPU Time Module
compute_fft 4.532s libfft.so
compute_fft 4.532s libfft.so
matrix_mult 2.108s prog
matrix_mult 2.108s prog
parse_input 0.234s prog
parse_input 0.234s prog
Build with debug info for meaningful symbols:
```bash
gcc -O2 -g ./prog.c -o prog # symbols visible in VTune
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog # better stacks
编译时添加调试信息以显示有意义的符号:
```bash
gcc -O2 -g ./prog.c -o prog # VTune中可查看符号信息
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog # 生成更清晰的调用栈4. Microarchitecture exploration — pipeline stalls
4. 微架构探索——流水线停顿分析
bash
vtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_resultKey metrics to examine:
| Metric | Meaning | Good value |
|---|---|---|
| IPC (Instructions Per Clock) | How many instructions retire per cycle | x86: aim for > 2.0 |
| CPI (Clocks Per Instruction) | Inverse of IPC | Lower is better |
| Bad Speculation | Branch mispredictions | < 5% |
| Front-End Bound | Instruction decode bottleneck | < 15% |
| Back-End Bound | Execution unit or memory stall | < 30% |
| Retiring | Useful work fraction | > 70% ideal |
| Memory Bound | % cycles waiting for memory | < 20% |
Pipeline Analysis (Top-Down Methodology):
├── Retiring (good, useful work)
├── Bad Speculation (branch mispredictions)
├── Front-End Bound
│ ├── Fetch Latency (I-cache misses, branch mispredicts)
│ └── Fetch Bandwidth
└── Back-End Bound
├── Memory Bound
│ ├── L1 Bound → L1 cache misses
│ ├── L2 Bound → L2 cache misses
│ ├── L3 Bound → L3 cache misses
│ └── DRAM Bound → main memory bandwidth limited
└── Core Bound → ALU/compute boundbash
vtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_result需要重点关注的指标:
| 指标 | 含义 | 理想值 |
|---|---|---|
| IPC(Instructions Per Clock,每时钟周期执行指令数) | 每个时钟周期退休的指令数量 | x86架构:目标值>2.0 |
| CPI(Clocks Per Instruction,每指令时钟周期数) | IPC的倒数 | 越低越好 |
| Bad Speculation(错误推测) | 分支预测失败率 | <5% |
| Front-End Bound(前端受限) | 指令解码瓶颈 | <15% |
| Back-End Bound(后端受限) | 执行单元或内存停顿 | <30% |
| Retiring(有效工作占比) | 实际完成有用工作的周期占比 | 理想值>70% |
| Memory Bound(内存受限) | 等待内存的周期占比 | <20% |
流水线分析(自上而下方法):
├── Retiring(良好,有效工作)
├── Bad Speculation(分支预测失败)
├── Front-End Bound
│ ├── Fetch Latency(指令缓存未命中、分支预测失败)
│ └── Fetch Bandwidth(指令获取带宽不足)
└── Back-End Bound
├── Memory Bound
│ ├── L1 Bound → L1缓存未命中
│ ├── L2 Bound → L2缓存未命中
│ ├── L3 Bound → L3缓存未命中
│ └── DRAM Bound → 主内存带宽受限
└── Core Bound → ALU/计算单元受限5. Memory access analysis
5. 内存访问分析
bash
undefinedbash
undefinedCollect memory access profile
收集内存访问性能数据
vtune -collect memory-access -r mem_result ./prog
vtune -collect memory-access -r mem_result ./prog
Key output sections:
关键输出部分:
- Memory Bound: % time waiting for memory
- Memory Bound:等待内存的时间占比
- LLC (Last Level Cache) Miss Rate
- LLC(Last Level Cache,末级缓存)未命中率
- DRAM Bandwidth: GB/s achieved vs theoretical peak
- DRAM Bandwidth:实际达到的带宽(GB/s)与理论峰值的对比
- NUMA: cross-socket accesses (for multi-socket systems)
- NUMA:跨socket访问情况(适用于多socket系统)
Reading DRAM bandwidth:DRAM Bandwidth: 18.4 GB/s
Peak Theoretical: 51.2 GB/s
Utilization: 36% — likely not DRAM-bound
If DRAM-bound: optimize data layout (AoS → SoA), reduce working set, improve spatial locality.
DRAM带宽解读示例:DRAM Bandwidth: 18.4 GB/s
Peak Theoretical: 51.2 GB/s
Utilization: 36% — 大概率未受DRAM带宽限制
如果受DRAM带宽限制:优化数据布局(AoS→SoA)、减少工作集大小、提升空间局部性。6. AMD uProf — free alternative for AMD CPUs
6. AMD uProf —— AMD CPU的免费替代方案
bash
undefinedbash
undefinedDownload AMD uProf
下载AMD uProf
CLI profiling
CLI性能分析
AMDuProfCLI collect --config tbp ./prog # time-based profiling
AMDuProfCLI collect --config assess ./prog # microarchitecture assessment
AMDuProfCLI collect --config memory ./prog # memory access
AMDuProfCLI collect --config tbp ./prog # 基于时间的性能分析
AMDuProfCLI collect --config assess ./prog # 微架构评估
AMDuProfCLI collect --config memory ./prog # 内存访问分析
Generate report
生成报告
AMDuProfCLI report -i /tmp/uprof_result/ -o report.html
AMDuProfCLI report -i /tmp/uprof_result/ -o report.html
Open GUI
打开图形界面
AMDuProf &
AMD uProf metrics map to VTune equivalents:
- `Retired Instructions` → IPC analysis
- `Branch Mispredictions` → Bad Speculation
- `L1/L2/L3 Cache Misses` → Memory Bound levels
- `Data Cache Accesses` → Cache efficiencyAMDuProf &
AMD uProf指标与VTune对应关系:
- `Retired Instructions` → IPC分析
- `Branch Mispredictions` → 错误推测
- `L1/L2/L3 Cache Misses` → 内存受限层级
- `Data Cache Accesses` → 缓存效率7. Roofline model
7. Roofline模型
The roofline model shows whether code is compute-bound or memory-bound by comparing achieved performance against hardware limits:
Performance (GFLOPS/s)
| _______________
Peak | /
Perf | / compute bound
| /
| /
| / memory bandwidth bound
| /
+------------------------------→
Arithmetic Intensity (FLOPS/Byte)bash
undefinedRoofline模型通过对比实际性能与硬件极限,判断代码是计算受限还是内存受限:
Performance (GFLOPS/s)
| _______________
Peak | /
Perf | / compute bound(计算受限)
| /
| /
| / memory bandwidth bound(内存带宽受限)
| /
+------------------------------→
Arithmetic Intensity (FLOPS/Byte)(算术强度)bash
undefinedVTune roofline collection
VTune收集Roofline数据
vtune -collect hpc-performance -r roofline_result ./prog
vtune -collect hpc-performance -r roofline_result ./prog
Then: VTune GUI → Roofline view
然后:VTune图形界面 → Roofline视图
For manual calculation:
手动计算方法:
Arithmetic Intensity = FLOPS / memory_bytes_accessed
算术强度 = FLOPS / 访问的内存字节数
Peak FLOPS = CPUs × cores × freq × FLOPS_per_cycle_per_core
峰值FLOPS = CPU数量 × 核心数 × 频率 × 每核心每周期FLOPS数
Peak BW = from hardware spec (e.g., 51.2 GB/s for DDR4-3200 dual channel)
峰值带宽 = 硬件规格值(例如DDR4-3200双通道为51.2 GB/s)
likwid-perfctr for manual roofline data (Linux)
使用likwid-perfctr手动获取Roofline数据(Linux系统)
likwid-perfctr -C 0 -g FLOPS_DP ./prog # double-precision FLOPS
likwid-perfctr -C 0 -g MEM ./prog # memory bandwidth
undefinedlikwid-perfctr -C 0 -g FLOPS_DP ./prog # 双精度FLOPS
likwid-perfctr -C 0 -g MEM ./prog # 内存带宽
undefinedRelated skills
相关技能
- Use for raw PMU event collection with perf stat
skills/profilers/hardware-counters - Use for perf-based profiling on Linux
skills/profilers/linux-perf - Use for memory access pattern optimization
skills/low-level-programming/cpu-cache-opt - Use for vectorization to increase FLOPS
skills/low-level-programming/simd-intrinsics
- 使用通过perf stat收集原始PMU事件数据
skills/profilers/hardware-counters - 使用进行基于perf的Linux系统性能分析
skills/profilers/linux-perf - 使用优化内存访问模式
skills/low-level-programming/cpu-cache-opt - 使用通过向量化提升FLOPS
skills/low-level-programming/simd-intrinsics