cuda

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CUDA Programming Skill

CUDA编程技能指南

Core Philosophy

核心原则

Measure before guessing. GPU performance is deeply counterintuitive. Profile first, hypothesize second, change third, verify fourth.
Small, isolated changes. CUDA bugs compound. Make one change, test it, commit it. Resist the urge to "fix everything at once."
printf is your strongest tool. When debuggers fail, when tools produce inscrutable output, printf in device code reveals truth. Don't be embarrassed to use it extensively.
Sometimes, stare at the diff. Inscrutable segfaults are common. Tools often don't help. The human approach: minimize the diff, read it carefully, see the bug. This is legitimate and often faster than tooling.
先测量,后猜测。 GPU性能往往与直觉相悖。请遵循:先做性能分析,再提出假设,然后修改代码,最后验证效果。
小步迭代,单独验证。 CUDA的Bug会相互叠加。每次只做一处修改,测试通过后再提交。不要试图“一次性修复所有问题”。
printf是最强调试工具之一。 当调试器失效、工具输出难以理解时,在设备代码中加入printf能直接揭示问题真相。不要不好意思大量使用它。
有时,盯着代码差异看。 难以解释的段错误很常见,工具往往帮不上忙。此时可以用人工方法:最小化正常代码与错误代码的差异,仔细阅读差异部分,Bug往往就藏在其中。这是合理且常常比工具更高效的方法。

Debugging Workflow

调试工作流

First Response to a Bug

遇到Bug的第一处理步骤

  1. Reproduce minimally — Isolate the failing kernel with smallest possible input
  2. Add printf — Before any tool, add
    printf
    in device code to trace execution
  3. Run compute-sanitizer — Catch memory errors non-interactively:
    bash
    compute-sanitizer --tool memcheck ./your_program
    compute-sanitizer --tool racecheck ./your_program  # for race conditions
    compute-sanitizer --tool initcheck ./your_program  # uninitialized memory
  4. If still stuck, try cuda-gdb non-interactively for backtrace:
    bash
    cuda-gdb -batch -ex "run" -ex "bt" ./your_program
  5. When tools fail — Minimize the diff between working and broken code. Read it. The bug is in the diff.
  1. 最小化复现场景 —— 用最小的输入规模隔离出出现问题的内核
  2. 添加printf输出 —— 在使用任何工具之前,先在设备代码中加入printf追踪执行过程
  3. 运行compute-sanitizer —— 非交互式捕获内存错误:
    bash
    compute-sanitizer --tool memcheck ./your_program
    compute-sanitizer --tool racecheck ./your_program  # 检测竞态条件
    compute-sanitizer --tool initcheck ./your_program  # 检测未初始化内存
  4. 如果仍无法解决,尝试用cuda-gdb非交互式获取回溯信息:
    bash
    cuda-gdb -batch -ex "run" -ex "bt" ./your_program
  5. 当工具都失效时 —— 缩小正常代码与错误代码的差异范围,仔细阅读差异部分,Bug就在其中。

printf in Device Code

设备代码中的printf使用

cuda
__global__ void myKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx == 0) {  // Limit output
        printf("Kernel launched, n=%d, data[0]=%f\n", n, data[0]);
    }
    // ... kernel logic ...
    if (idx < 10) {  // Sample a few threads
        printf("Thread %d: result=%f\n", idx, someValue);
    }
}
Key patterns:
  • Guard with
    if (idx == 0)
    or
    if (idx < N)
    to avoid output flood
  • Print at kernel entry to confirm launch
  • Print intermediate values at suspected failure points
  • Flush is automatic at kernel completion
cuda
__global__ void myKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx == 0) {  // 限制输出数量
        printf("Kernel已启动, n=%d, data[0]=%f\n", n, data[0]);
    }
    // ... 内核逻辑 ...
    if (idx < 10) {  // 采样部分线程输出
        printf("线程 %d: 结果=%f\n", idx, someValue);
    }
}
关键模式:
  • if (idx == 0)
    if (idx < N)
    来避免输出泛滥
  • 在内核入口处打印信息,确认内核已成功启动
  • 在疑似出错的位置打印中间值
  • 内核执行完成时会自动刷新输出缓冲区

compute-sanitizer Quick Reference

compute-sanitizer快速参考

Common gotcha: "Invalid shared write... out of bounds" usually means insufficient dynamic shared memory allocation in the kernel launch, not wrong array indexing. Check
<<<grid, block, smem_size>>>
.
bash
undefined
常见陷阱: "Invalid shared write... out of bounds"(无效的共享内存写入...越界)通常意味着内核启动时动态共享内存分配不足,而非数组索引错误。检查
<<<grid, block, smem_size>>>
参数。
bash
undefined

Memory errors (most common)

检测内存错误(最常用)

compute-sanitizer --tool memcheck ./program
compute-sanitizer --tool memcheck ./program

Other tools: racecheck, initcheck, synccheck

其他工具:racecheck、initcheck、synccheck

For detailed options, see references/debugging-tools.md

详细选项请参考references/debugging-tools.md

undefined
undefined

cuda-gdb Non-Interactive

cuda-gdb非交互式使用

bash
undefined
bash
undefined

Get backtrace on crash

崩溃时获取调用栈

cuda-gdb -batch -ex "run" -ex "bt" ./program
cuda-gdb -batch -ex "run" -ex "bt" ./program

For breakpoints, thread inspection, see references/debugging-tools.md

断点设置、线程检查等功能请参考references/debugging-tools.md


**Compile with debug info:**
```bash
nvcc -g -G -lineinfo program.cu -o program

**编译时包含调试信息:**
```bash
nvcc -g -G -lineinfo program.cu -o program

cuobjdump for Binary Inspection

使用cuobjdump进行二进制检查

bash
undefined
bash
undefined

Dump PTX and SASS

导出PTX和SASS代码

cuobjdump -ptx ./program cuobjdump -sass ./program
cuobjdump -ptx ./program cuobjdump -sass ./program

For resource usage, symbol listing, see references/debugging-tools.md

资源使用情况、符号列表等请参考references/debugging-tools.md


**For complete debugging tool reference:** See `references/debugging-tools.md` for detailed compute-sanitizer options, cuda-gdb workflows, and cuobjdump analysis patterns.

**完整调试工具参考:** 关于compute-sanitizer的详细选项、cuda-gdb工作流和cuobjdump分析模式,请查看`references/debugging-tools.md`。

Performance Optimization Workflow

性能优化工作流

Golden Rule

黄金法则

Never optimize without profiling first. Intuition about GPU bottlenecks is almost always wrong. The profile → fix → verify loop is the actual optimization work, not a preliminary step.
未经性能分析,绝不进行优化。 对GPU瓶颈的直觉几乎总是错误的。“分析→修复→验证”的循环才是真正的优化工作,而非预备步骤。

Performance Investigation Steps

性能分析步骤

  1. Establish baseline — Time the operation, record it
  2. Profile with nsys — Get timeline, identify which kernels matter
  3. Deep-dive with ncu — Analyze specific bottleneck kernels
  4. Hypothesize — Based on metrics, form specific hypothesis
  5. Change one thing — Make a single targeted change
  6. Verify — Re-profile, confirm improvement
  7. Repeat
  1. 建立基准线 —— 记录当前操作的耗时
  2. 用nsys做性能分析 —— 获取时间线,确定哪些内核是性能热点
  3. 用ncu深度分析 —— 分析特定的瓶颈内核
  4. 提出假设 —— 根据性能指标形成具体假设
  5. 只改一处 —— 做一个针对性的小修改
  6. 验证效果 —— 重新分析性能,确认优化有效
  7. 重复迭代

nsys (Nsight Systems) — Timeline Profiling

nsys(Nsight Systems)—— 时间线性能分析

Use nsys for: "Where is time being spent?" — CPU/GPU interaction, kernel launch patterns, memory transfers, overall timeline.
bash
undefined
nsys用于回答:“时间都花在哪里了?”—— 包括CPU/GPU交互、内核启动模式、内存传输、整体时间线等。
bash
undefined

Basic profile

基础性能分析

nsys profile -o report ./program nsys stats report.nsys-rep --report cuda_gpu_kern_sum
nsys profile -o report ./program nsys stats report.nsys-rep --report cuda_gpu_kern_sum

With NVTX markers

配合NVTX标记

nsys profile --trace=cuda,nvtx -o report ./program
nsys profile --trace=cuda,nvtx -o report ./program

Key reports: cuda_gpu_kern_sum, cuda_api_sum, cuda_gpu_mem_time_sum, nvtx_sum

关键报告:cuda_gpu_kern_sum、cuda_api_sum、cuda_gpu_mem_time_sum、nvtx_sum

For detailed usage, see references/nsys-guide.md

详细用法请参考references/nsys-guide.md


**For detailed nsys analysis patterns:** See `references/nsys-guide.md` for timeline interpretation, identifying common bottlenecks, and analysis workflows.

**nsys详细分析模式:** 关于时间线解读、常见瓶颈识别和分析工作流,请查看`references/nsys-guide.md`。

ncu (Nsight Compute) — Kernel Analysis

ncu(Nsight Compute)—— 内核分析

Use ncu for: "Why is this kernel slow?" — Detailed metrics, roofline, memory analysis, occupancy.
bash
undefined
ncu用于回答:“这个内核为什么慢?”—— 包括详细性能指标、roofline分析、内存分析、占用率等。
bash
undefined

Profile specific kernel

分析特定内核

ncu --kernel-name "myKernel" -o report ./program
ncu --kernel-name "myKernel" -o report ./program

Quick summary to stdout

快速输出摘要到控制台

ncu --set basic ./program
ncu --set basic ./program

Sets: basic, full, memory, launch, roofline

预设配置:basic、full、memory、launch、roofline

Sections: ComputeWorkloadAnalysis, MemoryWorkloadAnalysis, Occupancy

分析模块:ComputeWorkloadAnalysis、MemoryWorkloadAnalysis、Occupancy

For detailed metrics and interpretation, see references/ncu-guide.md

详细指标及解读请参考references/ncu-guide.md


**Warning:** ncu expert system recommendations can be misleading. Always verify with actual metrics and experiments.

**Scale matters:** Optimizations that help at large scale can hurt at small scale. Always profile at your actual problem size, not theoretical maximums.

**For detailed ncu metric interpretation:** See `references/ncu-guide.md` for understanding roofline analysis, memory bottlenecks, occupancy limits, and warp scheduling.

**注意:** ncu的专家系统建议可能有误导性。请始终结合实际指标和实验结果进行验证。

**规模很重要:** 在大规模场景有效的优化,在小规模场景可能反而有害。请始终在实际问题规模下进行性能分析,而非理论最大值。

**ncu指标详细解读:** 关于roofline分析、内存瓶颈、占用率限制和 warp调度的内容,请查看`references/ncu-guide.md`。

NVTX for Custom Instrumentation

NVTX自定义埋点

When you need finer granularity than kernel-level, use NVTX:
cuda
#include <nvtx3/nvToolsExt.h>

nvtxRangePush("Operation Name");
// ... code to profile ...
nvtxRangePop();
Compile:
-lnvToolsExt
| Profile:
nsys profile --trace=cuda,nvtx
For complete patterns: See
references/nvtx-patterns.md
for nested ranges, colors, and analysis workflows.
当需要比内核更细粒度的性能分析时,使用NVTX:
cuda
#include <nvtx3/nvToolsExt.h>

nvtxRangePush("操作名称");
// ... 需要分析的代码 ...
nvtxRangePop();
编译选项:
-lnvToolsExt
| 性能分析:
nsys profile --trace=cuda,nvtx
完整使用模式: 关于嵌套范围、颜色标记和分析工作流,请查看
references/nvtx-patterns.md

Common Performance Patterns

常见性能问题对照表

SymptomLikely CauseInvestigation
Low GPU utilizationKernel launch overhead, CPU bottlenecknsys timeline, look for gaps
Memory boundPoor access patterns, low cache hitncu memory section, check coalescing
Compute bound but slowLow occupancy, register pressurencu occupancy, reduce registers
Lots of small kernelsLaunch overhead dominatesnsys timeline, consider fusion
High memcpy timeExcessive H2D/D2H transfersnsys cuda_gpu_mem, batch transfers
Most cycles stalledBank conflicts, memory stallsncu SchedulerStatistics, check shared memory
High sectors/requestPoor coalescing (>4 sectors/req)ncu memory metrics, use vectorized loads
Critical traps: Bank conflicts and memory coalescing issues often dominate performance but aren't obvious without profiling. See
references/performance-traps.md
for detailed diagnosis and fixes.
Reality check: Budget 80% of optimization time for problems you didn't predict. Profile-driven iteration discovers the real bottlenecks.
症状可能原因排查方向
GPU利用率低内核启动开销、CPU瓶颈nsys时间线,查找间隙
内存受限内存访问模式差、缓存命中率低ncu内存模块,检查内存合并
计算受限但速度慢占用率低、寄存器压力大ncu占用率模块,减少寄存器使用
大量小内核启动开销占主导nsys时间线,考虑内核融合
内存拷贝时间长过多H2D/D2H传输nsys cuda_gpu_mem模块,批量传输
大部分周期处于停滞状态存储体冲突、内存停滞ncu SchedulerStatistics模块,检查共享内存
高扇区数/请求内存合并差(>4扇区/请求)ncu内存指标,使用向量化加载
关键陷阱: 存储体冲突和内存合并问题通常对性能影响极大,但如果不进行性能分析很难发现。详细的诊断和修复方法请查看
references/performance-traps.md
现实情况: 请为你未预料到的问题预留80%的优化时间。基于性能分析的迭代才能发现真正的瓶颈。

Compilation Reference

编译参考

bash
undefined
bash
undefined

Debug build

调试版本

nvcc -g -G -lineinfo -O0 program.cu -o program_debug
nvcc -g -G -lineinfo -O0 program.cu -o program_debug

Release build

发布版本

nvcc -O3 -lineinfo program.cu -o program
nvcc -O3 -lineinfo program.cu -o program

Specific architecture

指定架构

nvcc -arch=sm_80 program.cu -o program # Ampere nvcc -arch=sm_89 program.cu -o program # Ada Lovelace nvcc -arch=sm_90 program.cu -o program # Hopper
nvcc -arch=sm_80 program.cu -o program # Ampere架构 nvcc -arch=sm_89 program.cu -o program # Ada Lovelace架构 nvcc -arch=sm_90 program.cu -o program # Hopper架构

Generate PTX (inspect it)

生成PTX代码(用于检查)

nvcc -ptx program.cu
nvcc -ptx program.cu

Verbose compilation (see register usage)

详细编译输出(查看寄存器使用情况)

nvcc --ptxas-options=-v program.cu
nvcc --ptxas-options=-v program.cu

With NVTX

包含NVTX

nvcc program.cu -lnvToolsExt -o program

**Always compile with `-lineinfo` for production profiling** — minimal overhead, enables source correlation.
nvcc program.cu -lnvToolsExt -o program

**生产环境性能分析请始终加上`-lineinfo`选项** —— 开销极小,且支持源代码关联。

Local API Documentation

本地API文档

Complete reference documentation available for grep-based search:
PTX ISA 9.1
references/ptx-docs/
(405 files, 2.3MB)
  • Search guide:
    references/ptx-isa.md
  • Use for: Instruction-level optimization, inline PTX, TensorCore operations (WMMA, WGMMA, TMA), memory swizzling
CUDA Runtime API 13.1
references/cuda-runtime-docs/
(107 files, 0.9MB)
  • Search guide:
    references/cuda-runtime.md
  • Use for: Error codes, API parameters, device properties (
    cudaDeviceProp
    ), memory management, stream behavior
CUDA Driver API 13.1
references/cuda-driver-docs/
(128 files, 0.8MB)
  • Search guide:
    references/cuda-driver.md
  • Use for: Context management (
    cuCtxCreate
    ), module loading (
    cuModuleLoad
    ), virtual memory, Driver errors (
    CUDA_ERROR_*
    ), advanced features
Each search guide contains grep examples, documentation structure, and common usage patterns.
Search strategy: Use grep/ripgrep to search directly in the
*-docs/
directories. The search guides (
.md
files) provide navigation patterns and common queries.
完整的参考文档支持基于grep的搜索:
PTX ISA 9.1 ——
references/ptx-docs/
(405个文件,2.3MB)
  • 搜索指南:
    references/ptx-isa.md
  • 用途:指令级优化、内联PTX、TensorCore操作(WMMA、WGMMA、TMA)、内存重排
CUDA Runtime API 13.1 ——
references/cuda-runtime-docs/
(107个文件,0.9MB)
  • 搜索指南:
    references/cuda-runtime.md
  • 用途:错误码、API参数、设备属性(
    cudaDeviceProp
    )、内存管理、流行为
CUDA Driver API 13.1 ——
references/cuda-driver-docs/
(128个文件,0.8MB)
  • 搜索指南:
    references/cuda-driver.md
  • 用途:上下文管理(
    cuCtxCreate
    )、模块加载(
    cuModuleLoad
    )、虚拟内存、驱动错误(
    CUDA_ERROR_*
    )、高级特性
每个搜索指南都包含grep示例、文档结构和常见使用模式。
搜索策略: 使用grep/ripgrep直接在
*-docs/
目录中搜索。搜索指南(
.md
文件)提供了导航模式和常见查询示例。

Additional References

额外参考资料

  • references/performance-traps.md
    — Bank conflicts, memory coalescing, scale-dependent optimizations
  • references/debugging-tools.md
    — compute-sanitizer, cuda-gdb, cuobjdump detailed usage
  • references/nsys-guide.md
    — nsys timeline analysis and bottleneck identification
  • references/ncu-guide.md
    — ncu metrics, roofline, occupancy interpretation
  • references/nvtx-patterns.md
    — NVTX instrumentation and profiling patterns
  • references/performance-traps.md
    —— 存储体冲突、内存合并、规模相关的优化陷阱
  • references/debugging-tools.md
    —— compute-sanitizer、cuda-gdb、cuobjdump的详细用法
  • references/nsys-guide.md
    —— nsys时间线分析和瓶颈识别
  • references/ncu-guide.md
    —— ncu指标、roofline分析、占用率解读
  • references/nvtx-patterns.md
    —— NVTX埋点和性能分析模式

Checklist Before Optimizing

优化前检查清单

  • Established reproducible baseline timing
  • Profiled with nsys to identify hotspots
  • Know which kernel(s) dominate runtime
  • Profiled target kernel with ncu
  • Identified specific bottleneck (memory? compute? latency?)
  • Formed specific, testable hypothesis
  • Plan to change ONE thing
  • 已建立可复现的基准耗时
  • 已用nsys分析出性能热点
  • 明确哪些内核是运行时间的主要消耗者
  • 已用ncu分析目标内核
  • 已确定具体瓶颈(内存?计算?延迟?)
  • 已形成具体、可测试的假设
  • 计划只修改一处内容