Loading...
Loading...
Compare original and translation side by side
ncuncuncu.ncu-repnsysnsysnvidia-smincu.ncu-repnsysnsysnvidia-smi| Dependency | Version | Notes |
|---|---|---|
| CUDA Toolkit | >=11.0 | Includes |
| Match CUDA version | Or set |
| NVIDIA GPU | Kepler+ | Volta+ recommended |
ncusudoCAP_SYS_ADMIN--privilegedncu -v| 依赖项 | 版本 | 说明 |
|---|---|---|
| CUDA Toolkit | >=11.0 | 包含 |
| 匹配CUDA版本 | 或设置 |
| NVIDIA GPU | Kepler及以上 | 推荐使用Volta及以上 |
ncusudoCAP_SYS_ADMIN--privilegedncu -v| Compute % | Memory % | Bottleneck | Next Step |
|---|---|---|---|
| >60 | <40 | Compute-bound | ComputeWorkloadAnalysis section |
| <40 | >60 | Memory-bound | MemoryWorkloadAnalysis section |
| <40 | <40 | Latency-bound | LaunchStats + Occupancy sections |
| 40-60 | 40-60 | Balanced | Profile deeper with detailed sections |
| 计算占比 | 内存占比 | 瓶颈类型 | 下一步操作 |
|---|---|---|---|
| >60 | <40 | 计算受限(Compute-bound) | 查看ComputeWorkloadAnalysis部分 |
| <40 | >60 | 内存受限(Memory-bound) | 查看MemoryWorkloadAnalysis部分 |
| <40 | <40 | 延迟受限(Latency-bound) | 查看LaunchStats + Occupancy部分 |
| 40-60 | 40-60 | 平衡型 | 使用详细部分进行深度分析 |
| SOL% | Level | Action |
|---|---|---|
| >80% | Excellent | Minor tuning only |
| 60-80% | Good | Targeted optimization |
| 40-60% | Fair | Significant optimization needed |
| <40% | Poor | Major rework needed |
| SOL% | 等级 | 操作建议 |
|---|---|---|
| >80% | 优秀 | 仅需微调 |
| 60-80% | 良好 | 针对性优化 |
| 40-60% | 一般 | 需要显著优化 |
| <40% | 较差 | 需要重大重构 |
--section--set--set basic--set detailed--section--set--set basic--set detailed| Tool | Scope | Overhead | Purpose |
|---|---|---|---|
| nsys | System-level | 5-10% | Find which kernels to optimize |
| ncu | Kernel-level | 10-100x slower | Understand why a kernel is slow |
| 工具 | 范围 | 开销 | 用途 |
|---|---|---|---|
| nsys | 系统级 | 5-10% | 找出需要优化的内核 |
| ncu | 内核级 | 慢10-100倍 | 理解内核运行缓慢的原因 |
ncu -vncu -v
If not found, ensure CUDA toolkit is installed or set `NCU` env var to the binary path.
如果未找到,请确保已安装CUDA Toolkit,或设置`NCU`环境变量指向二进制文件路径。ncu --section SpeedOfLight --csv \
--kernel-name regex:"KERNEL" \
--launch-skip 5 --launch-count 3 \
-- COMMANDCompute (SM) ThroughputMemory Throughputncu --section SpeedOfLight --csv \
--kernel-name regex:"KERNEL" \
--launch-skip 5 --launch-count 3 \
-- COMMANDCompute (SM) ThroughputMemory Throughput| Classification | Sections to Add |
|---|---|
| Compute-bound | |
| Memory-bound | |
| Latency-bound | |
| Warp stalls | |
| Need instruction breakdown | |
LaunchStatsOccupancyncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding_lookup" \
--launch-count 3 \
-- python script.pyncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
--kernel-name regex:"gemm" \
--launch-count 3 -- python script.pyncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
--kernel-name regex:"small_kernel" \
-- python script.py| 分类 | 需添加的部分 |
|---|---|
| 计算受限 | |
| 内存受限 | |
| 延迟受限 | |
| Warp停顿 | |
| 需要指令细分 | |
LaunchStatsOccupancyncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding_lookup" \
--launch-count 3 \
-- python script.pyncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
--kernel-name regex:"gemm" \
--launch-count 3 -- python script.pyncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
--kernel-name regex:"small_kernel" \
-- python script.pyncu --section SpeedOfLight_RooflineChart \
--kernel-name regex:"KERNEL" -- COMMANDundefinedncu --section SpeedOfLight_RooflineChart \
--kernel-name regex:"KERNEL" -- COMMANDundefined
Interpretation: kernel left of ridge point = memory-bound; right = compute-bound;
far below both roofs = latency/occupancy issue. See `references/roofline-analysis.md`.
解读:内核位于脊点左侧 = 内存受限;右侧 = 计算受限;远低于两条脊线 = 延迟/occupancy问题。详见`references/roofline-analysis.md`。references/bottleneck-guide.mdreferences/bottleneck-guide.mdncu --section SpeedOfLight --csv \
--kernel-name regex:"optimized_kernel" \
--launch-count 3 \
-- python optimized_script.pyncu --section SpeedOfLight --csv \
--kernel-name regex:"optimized_kernel" \
--launch-count 3 \
-- python optimized_script.pytorch.cuda.synchronize()cudaProfilerStart()cudaProfilerStop()--profile-from-start offundefinedtorch.cuda.synchronize()cudaProfilerStart()cudaProfilerStop()--profile-from-start offundefined
```bash
ncu --profile-from-start off --section SpeedOfLight --csv \
--kernel-name regex:"target_kernel" \
--launch-count 3 -- python script.py--launch-skip Nreferences/advanced-profiling.md
```bash
ncu --profile-from-start off --section SpeedOfLight --csv \
--kernel-name regex:"target_kernel" \
--launch-count 3 -- python script.py--launch-skip Nreferences/advanced-profiling.md.ncu-repncu_reportextras/python/import ncu_report
ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
for action in rng:
name = action.name()
compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
duration = action["gpu__time_duration.sum"].as_uint64()
if compute > 60:
classification = "compute-bound"
elif memory > 60:
classification = "memory-bound"
else:
classification = "latency-bound"
print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")references/python-report-api.mdncu_reportextras/python/.ncu-repimport ncu_report
ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
for action in rng:
name = action.name()
compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
duration = action["gpu__time_duration.sum"].as_uint64()
if compute > 60:
classification = "compute-bound"
elif memory > 60:
classification = "memory-bound"
else:
classification = "latency-bound"
print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")references/python-report-api.mdncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND # All metrics flatncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw # Export to CSV| Column | Meaning |
|---|---|
| CUDA kernel function name |
| Execution time (nanoseconds) |
| % of peak compute |
| % of peak memory bandwidth |
| Active warps / max warps (%) |
--launch-count > 1ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND # 所有指标平铺展示ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw # 导出为CSV| 列名 | 含义 |
|---|---|
| CUDA内核函数名称 |
| 执行时间(纳秒) |
| 峰值计算占比 |
| 峰值内存带宽占比 |
| 活跃warp数 / 最大warp数(百分比) |
--launch-count > 1ncu --section SpeedOfLight --csv \
--kernel-name regex:"gemm" \
--launch-skip 5 --launch-count 3 \
-- python train.py"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2--section ComputeWorkloadAnalysisncu --section SpeedOfLight --csv \
--kernel-name regex:"gemm" \
--launch-skip 5 --launch-count 3 \
-- python train.py"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2--section ComputeWorkloadAnalysisncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding" \
--launch-count 3 -- python train.pyncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding" \
--launch-count 3 -- python train.py| Error | Cause | Fix |
|---|---|---|
| Not in PATH | |
| Needs elevated privileges | |
| No kernels captured | Name regex doesn't match | Run without |
| Profiling extremely slow | Using | Use |
| Autotuning pollutes results | JIT kernel warmup captured | Use |
| Metrics show 0% tensor cores | Kernel doesn't use tensor cores | Check with |
| Report file too large | | Use targeted sections; limit with |
| Out-of-range metric values | Async GPU activity or short kernels | Profile on isolated GPU; increase workload size |
| Dependent kernels across ranks | Use |
| 错误 | 原因 | 修复方案 |
|---|---|---|
| 不在PATH中 | |
| 需要提升权限 | |
| 未捕获到内核 | 名称正则不匹配 | 先不使用 |
| 性能分析极慢 | 使用了 | 仅使用 |
| 自动调优污染结果 | 捕获到JIT内核预热 | 使用 |
| 指标显示Tensor Core使用率为0% | 内核未使用Tensor Core | 使用 |
| 报告文件过大 | 使用 | 使用针对性部分;通过 |
| 指标值超出范围 | GPU异步活动或内核运行时间过短 | 在隔离的GPU上进行性能分析;增加工作负载大小 |
| 跨进程依赖内核 | 使用 |
references/references/cli-reference.mdreferences/metrics-guide.mdreferences/sections-guide.md--sectionreferences/bottleneck-guide.mdreferences/memory-analysis.mdreferences/roofline-analysis.mdreferences/advanced-profiling.mdreferences/python-report-api.mdncu_reportGrepreferences/Readreferences/references/cli-reference.mdreferences/metrics-guide.mdreferences/sections-guide.md--sectionreferences/bottleneck-guide.mdreferences/memory-analysis.mdreferences/roofline-analysis.mdreferences/advanced-profiling.mdreferences/python-report-api.mdncu_reportGrepreferences/ncu_reportreferences/ncu_reportreferences/