Loading...
Loading...
Intel VTune and AMD uProf profiling skill for microarchitecture analysis. Use when analyzing hotspots, microarchitecture bottlenecks, memory access patterns, pipeline stalls, or using the roofline model. Covers VTune Community Edition (free) and AMD uProf as a free alternative. Activates on queries about VTune, uProf, microarchitecture analysis, pipeline stalls, memory bandwidth, roofline model, or hardware performance analysis.
npx skill4agent add mohitmishra786/low-level-dev-skills intel-vtune-amd-uprof# Download Intel VTune Profiler (Community Edition — free)
# https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
# Install on Linux
source /opt/intel/oneapi/vtune/latest/env/vars.sh
# CLI usage
vtune -collect hotspots ./prog
vtune -collect microarchitecture-exploration ./prog
vtune -collect memory-access ./prog
# View results in GUI
vtune-gui &
# File → Open Result → select .vtune directory
# Or use amplxe-cl (legacy CLI)
amplxe-cl -collect hotspots ./prog
amplxe-cl -report hotspots -r result/| Analysis | What it finds | When to use |
|---|---|---|
| Hotspots | CPU-bound functions | First step — find where time is spent |
| Microarchitecture Exploration | IPC, pipeline stalls, retired instructions | After hotspot — why is the hotspot slow? |
| Memory Access | Cache misses, DRAM bandwidth, NUMA | Memory-bound code |
| Threading | Lock contention, parallel efficiency | Multithreaded code |
| HPC Performance | Vectorization, memory, roofline | HPC / scientific code |
| I/O | Disk and network bottlenecks | I/O-bound code |
# Collect and report hotspots
vtune -collect hotspots -result-dir hotspots_result ./prog
# Report top functions by CPU time
vtune -report hotspots -r hotspots_result -format csv | head -20
# CLI output example:
# Function CPU Time Module
# compute_fft 4.532s libfft.so
# matrix_mult 2.108s prog
# parse_input 0.234s proggcc -O2 -g ./prog.c -o prog # symbols visible in VTune
gcc -O2 -g -gsplit-dwarf -fno-omit-frame-pointer ./prog.c -o prog # better stacksvtune -collect microarchitecture-exploration -r micro_result ./prog
vtune -report summary -r micro_result| Metric | Meaning | Good value |
|---|---|---|
| IPC (Instructions Per Clock) | How many instructions retire per cycle | x86: aim for > 2.0 |
| CPI (Clocks Per Instruction) | Inverse of IPC | Lower is better |
| Bad Speculation | Branch mispredictions | < 5% |
| Front-End Bound | Instruction decode bottleneck | < 15% |
| Back-End Bound | Execution unit or memory stall | < 30% |
| Retiring | Useful work fraction | > 70% ideal |
| Memory Bound | % cycles waiting for memory | < 20% |
Pipeline Analysis (Top-Down Methodology):
├── Retiring (good, useful work)
├── Bad Speculation (branch mispredictions)
├── Front-End Bound
│ ├── Fetch Latency (I-cache misses, branch mispredicts)
│ └── Fetch Bandwidth
└── Back-End Bound
├── Memory Bound
│ ├── L1 Bound → L1 cache misses
│ ├── L2 Bound → L2 cache misses
│ ├── L3 Bound → L3 cache misses
│ └── DRAM Bound → main memory bandwidth limited
└── Core Bound → ALU/compute bound# Collect memory access profile
vtune -collect memory-access -r mem_result ./prog
# Key output sections:
# - Memory Bound: % time waiting for memory
# - LLC (Last Level Cache) Miss Rate
# - DRAM Bandwidth: GB/s achieved vs theoretical peak
# - NUMA: cross-socket accesses (for multi-socket systems)DRAM Bandwidth: 18.4 GB/s
Peak Theoretical: 51.2 GB/s
Utilization: 36% — likely not DRAM-bound# Download AMD uProf
# https://www.amd.com/en/developer/uprof.html
# CLI profiling
AMDuProfCLI collect --config tbp ./prog # time-based profiling
AMDuProfCLI collect --config assess ./prog # microarchitecture assessment
AMDuProfCLI collect --config memory ./prog # memory access
# Generate report
AMDuProfCLI report -i /tmp/uprof_result/ -o report.html
# Open GUI
AMDuProf &Retired InstructionsBranch MispredictionsL1/L2/L3 Cache MissesData Cache AccessesPerformance (GFLOPS/s)
| _______________
Peak | /
Perf | / compute bound
| /
| /
| / memory bandwidth bound
| /
+------------------------------→
Arithmetic Intensity (FLOPS/Byte)# VTune roofline collection
vtune -collect hpc-performance -r roofline_result ./prog
# Then: VTune GUI → Roofline view
# For manual calculation:
# Arithmetic Intensity = FLOPS / memory_bytes_accessed
# Peak FLOPS = CPUs × cores × freq × FLOPS_per_cycle_per_core
# Peak BW = from hardware spec (e.g., 51.2 GB/s for DDR4-3200 dual channel)
# likwid-perfctr for manual roofline data (Linux)
likwid-perfctr -C 0 -g FLOPS_DP ./prog # double-precision FLOPS
likwid-perfctr -C 0 -g MEM ./prog # memory bandwidthskills/profilers/hardware-countersskills/profilers/linux-perfskills/low-level-programming/cpu-cache-optskills/low-level-programming/simd-intrinsics