cuda-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CUDA & PTX Reference

CUDA & PTX 参考文档

Documentation Locations

文档位置

All documentation is under the
references/
directory within this skill's install location. The base path depends on which agent tool is used:
  • Cursor:
    ~/.cursor/skills/cuda-skill/references/
  • Claude Code:
    ~/.claude/skills/cuda-skill/references/
  • Codex:
    ~/.agents/skills/cuda-skill/references/
Determine actual path at runtime:
bash
CUDA_REFS="$(dirname "$(find ~/.cursor/skills ~/.claude/skills ~/.agents/skills -name 'cuda-skill' -type d 2>/dev/null | head -1)")/cuda-skill/references"
All
rg
examples below use
~/.cursor/skills/cuda-skill/references/
as placeholder. Replace with the actual path.
references/
├── ptx-docs/              # PTX ISA 9.1 full spec (405 files, 2.3MB)
├── ptx-simple/            # PTX condensed quick-ref (13 files, 149KB)
├── cuda-runtime-docs/     # CUDA Runtime API 13.1 (107 files, 0.9MB)
├── cuda-driver-docs/      # CUDA Driver API 13.1 (128 files, 0.8MB)
├── cuda-guide/            # CUDA Programming Guide v13.1 (39 pages, 1.6MB)
│   ├── 01-introduction/   # Programming model, CUDA platform
│   ├── 02-basics/         # CUDA C++, kernels, async, memory, nvcc
│   ├── 03-advanced/       # Advanced APIs, kernel programming, driver API, multi-GPU
│   ├── 04-special-topics/ # Graphs, Unified Memory, Coop Groups, TMA, etc.
│   ├── 05-appendices/     # Compute Capabilities, C++ extensions, math funcs
│   └── INDEX.md
├── best-practices-guide/  # CUDA C++ Best Practices Guide
├── ncu-docs/              # Nsight Compute full docs (ProfilingGuide, CLI, etc.)
├── nsys-docs/             # Nsight Systems full docs (UserGuide, etc.)
├── ptx-isa.md             # PTX search guide
├── cuda-runtime.md        # Runtime API search guide
├── cuda-driver.md         # Driver API search guide
├── nsys-guide.md          # Nsight Systems quick reference
├── ncu-guide.md           # Nsight Compute quick reference
├── debugging-tools.md     # compute-sanitizer, cuda-gdb
├── nvtx-patterns.md       # NVTX instrumentation
└── performance-traps.md   # Bank conflicts, coalescing
所有文档都位于该Skill安装目录下的
references/
文件夹中。 基础路径取决于所使用的Agent工具:
  • Cursor:
    ~/.cursor/skills/cuda-skill/references/
  • Claude Code:
    ~/.claude/skills/cuda-skill/references/
  • Codex:
    ~/.agents/skills/cuda-skill/references/
运行时确定实际路径:
bash
CUDA_REFS="$(dirname "$(find ~/.cursor/skills ~/.claude/skills ~/.agents/skills -name 'cuda-skill' -type d 2>/dev/null | head -1)")/cuda-skill/references"
以下所有
rg
示例均使用
~/.cursor/skills/cuda-skill/references/
作为占位符。请替换为实际路径。
references/
├── ptx-docs/              # PTX ISA 9.1完整规范(405个文件,2.3MB)
├── ptx-simple/            # PTX精简速查手册(13个文件,149KB)
├── cuda-runtime-docs/     # CUDA Runtime API 13.1(107个文件,0.9MB)
├── cuda-driver-docs/      # CUDA Driver API 13.1(128个文件,0.8MB)
├── cuda-guide/            # CUDA Programming Guide v13.1(39页,1.6MB)
│   ├── 01-introduction/   # 编程模型、CUDA平台
│   ├── 02-basics/         # CUDA C++、内核、异步操作、内存、nvcc
│   ├── 03-advanced/       # 高级API、内核编程、驱动API、多GPU
│   ├── 04-special-topics/ # Graphs、Unified Memory、Coop Groups、TMA等
│   ├── 05-appendices/     # 计算能力、C++扩展、数学函数
│   └── INDEX.md
├── best-practices-guide/  # CUDA C++最佳实践指南
├── ncu-docs/              # Nsight Compute完整文档(ProfilingGuide、CLI等)
├── nsys-docs/             # Nsight Systems完整文档(UserGuide等)
├── ptx-isa.md             # PTX搜索指南
├── cuda-runtime.md        # Runtime API搜索指南
├── cuda-driver.md         # Driver API搜索指南
├── nsys-guide.md          # Nsight Systems速查手册
├── ncu-guide.md           # Nsight Compute速查手册
├── debugging-tools.md     # compute-sanitizer、cuda-gdb
├── nvtx-patterns.md       # NVTX插桩
└── performance-traps.md   # 存储体冲突、内存合并

ptx-simple/ Contents (Condensed Quick-Ref)

ptx-simple/ 内容(精简速查手册)

ptx-simple/
├── ptx-isa-arithmetic.md       # add, sub, mul, mad, fma, div, min, max
├── ptx-isa-data-types.md       # Types, cvt, rounding, pack
├── ptx-isa-memory-spaces.md    # .reg, .global, .shared, fences
├── ptx-isa-load-store.md       # ld, st, prefetch
├── ptx-isa-control-flow.md     # @p, setp, bra, call, ret, exit
├── ptx-isa-tensor-cores.md     # mma.sync, ldmatrix, wgmma
├── ptx-isa-async-copy.md       # cp.async, cp.async.bulk, TMA
├── ptx-isa-barriers.md         # bar.sync, mbarrier
├── ptx-isa-warp-ops.md         # shfl, vote, match, redux
├── ptx-isa-cache-hints.md      # Cache control
├── ptx-isa-sm90-hopper.md      # Hopper-specific (sm_90)
├── ptx-isa-sm100-blackwell.md  # Blackwell-specific (sm_100, tcgen05)
└── ptx-isa-misc.md             # Other instructions
ptx-simple/
├── ptx-isa-arithmetic.md       # add、sub、mul、mad、fma、div、min、max
├── ptx-isa-data-types.md       # 类型、cvt、舍入、打包
├── ptx-isa-memory-spaces.md    # .reg、.global、.shared、fences
├── ptx-isa-load-store.md       # ld、st、prefetch
├── ptx-isa-control-flow.md     # @p、setp、bra、call、ret、exit
├── ptx-isa-tensor-cores.md     # mma.sync、ldmatrix、wgmma
├── ptx-isa-async-copy.md       # cp.async、cp.async.bulk、TMA
├── ptx-isa-barriers.md         # bar.sync、mbarrier
├── ptx-isa-warp-ops.md         # shfl、vote、match、redux
├── ptx-isa-cache-hints.md      # 缓存控制
├── ptx-isa-sm90-hopper.md      # Hopper专属(sm_90)
├── ptx-isa-sm100-blackwell.md  # Blackwell专属(sm_100、tcgen05)
└── ptx-isa-misc.md             # 其他指令

Search Strategy

搜索策略

Use Grep tool to search documentation. Never load entire files into context.
使用Grep工具搜索文档。切勿将整个文件加载到上下文环境中。

PTX Instruction Lookup

PTX指令查询

bash
undefined
bash
undefined

Find specific instruction

查找特定指令

rg "mbarrier.init" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/
rg "mbarrier.init" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/

Find WGMMA register fragments

查找WGMMA寄存器片段

rg "register fragment" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/ | rg -i wgmma
rg "register fragment" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/ | rg -i wgmma

Find TMA swizzling modes

查找TMA混洗模式

rg "swizzle_mode" ~/.cursor/skills/cuda-skill/references/ptx-docs/
rg "swizzle_mode" ~/.cursor/skills/cuda-skill/references/ptx-docs/

Quick PTX syntax lookup (condensed)

快速查询PTX语法(精简版)

rg "wgmma" ~/.cursor/skills/cuda-skill/references/ptx-simple/ptx-isa-tensor-cores.md
undefined
rg "wgmma" ~/.cursor/skills/cuda-skill/references/ptx-simple/ptx-isa-tensor-cores.md
undefined

CUDA Runtime API Lookup

CUDA Runtime API查询

bash
undefined
bash
undefined

Error code meaning

错误代码含义

rg "cudaErrorInvalidValue" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/
rg "cudaErrorInvalidValue" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/

Function documentation

函数文档

rg -A 20 "cudaStreamSynchronize" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/modules/group__cudart__stream.md
rg -A 20 "cudaStreamSynchronize" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/modules/group__cudart__stream.md

Struct fields

结构体字段

rg "" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/data-structures/structcudadeviceprop.md
undefined
rg "" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/data-structures/structcudadeviceprop.md
undefined

CUDA Driver API Lookup

CUDA Driver API查询

bash
undefined
bash
undefined

Context management

上下文管理

rg -A 20 "cuCtxCreate" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__ctx.md
rg -A 20 "cuCtxCreate" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__ctx.md

Module loading

模块加载

rg "cuModuleLoad" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__module.md
rg "cuModuleLoad" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__module.md

Virtual memory

虚拟内存

rg "cuMemMap" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__va.md
undefined
rg "cuMemMap" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__va.md
undefined

CUDA Programming Guide Lookup

CUDA编程指南查询

bash
undefined
bash
undefined

Compute Capabilities table

计算能力表

rg -A 5 "sm_90" ~/.cursor/skills/cuda-skill/references/cuda-guide/05-appendices/compute-capabilities.md
rg -A 5 "sm_90" ~/.cursor/skills/cuda-skill/references/cuda-guide/05-appendices/compute-capabilities.md

CUDA Graphs usage

CUDA Graphs用法

rg "cudaGraph" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cuda-graphs.md
rg "cudaGraph" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cuda-graphs.md

Cooperative Groups

Cooperative Groups

rg "cooperative" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cooperative-groups.md
rg "cooperative" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cooperative-groups.md

Unified Memory behavior

Unified Memory行为

rg "managed" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/unified-memory.md
rg "managed" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/unified-memory.md

Thread Block Clusters (Hopper+)

线程块集群(Hopper及以上架构)

rg "cluster" ~/.cursor/skills/cuda-skill/references/cuda-guide/01-introduction/programming-model.md
rg "cluster" ~/.cursor/skills/cuda-skill/references/cuda-guide/01-introduction/programming-model.md

Programming Guide index (discover all topics)

编程指南索引(查看所有主题)

cat ~/.cursor/skills/cuda-skill/references/cuda-guide/INDEX.md
undefined
cat ~/.cursor/skills/cuda-skill/references/cuda-guide/INDEX.md
undefined

Best Practices Guide Lookup

最佳实践指南查询

bash
undefined
bash
undefined

Memory coalescing best practices

内存合并最佳实践

rg -i "coalescing" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
rg -i "coalescing" ~/.cursor/skills/cuda-skill/references/best-practices-guide/

Occupancy optimization

占用率优化

rg -i "occupancy" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
rg -i "occupancy" ~/.cursor/skills/cuda-skill/references/best-practices-guide/

Shared memory usage patterns

共享内存使用模式

rg -i "shared memory" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
undefined
rg -i "shared memory" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
undefined

Nsight Compute Lookup

Nsight Compute查询

bash
undefined
bash
undefined

Metric meanings and collection

指标含义与收集

rg -i "metric" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
rg -i "metric" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md

CLI usage and options

CLI用法与选项

rg -i "section" ~/.cursor/skills/cuda-skill/references/ncu-docs/NsightComputeCli.md
rg -i "section" ~/.cursor/skills/cuda-skill/references/ncu-docs/NsightComputeCli.md

Roofline analysis

Roofline分析

rg -i "roofline" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
undefined
rg -i "roofline" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
undefined

Nsight Systems Lookup

Nsight Systems查询

bash
undefined
bash
undefined

CLI profiling options

CLI性能分析选项

rg -i "nsys profile" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
rg -i "nsys profile" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md

CUDA trace analysis

CUDA追踪分析

rg -i "cuda.*trace" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
undefined
rg -i "cuda.*trace" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
undefined

When to Use Each Source

各来源适用场景

NeedSourcePath shorthand
PTX instruction syntax/semanticsFull PTX docs
ptx-docs/9-instruction-set/
Quick PTX syntax checkCondensed PTX
ptx-simple/
State spaces, data typesFull PTX docs
ptx-docs/5-state-spaces-types-and-variables/
Memory consistency modelFull PTX docs
ptx-docs/8-memory-consistency-model/
Special registers (%tid, etc.)Full PTX docs
ptx-docs/10-special-registers/
Directives (.version, .target)Full PTX docs
ptx-docs/11-directives/
CUDA Runtime functionsRuntime docs
cuda-runtime-docs/modules/
CUDA structs (cudaDeviceProp)Runtime docs
cuda-runtime-docs/data-structures/
Driver API (cuCtx, cuModule)Driver docs
cuda-driver-docs/modules/
sm_90 / Hopper specificsCondensed PTX
ptx-simple/ptx-isa-sm90-hopper.md
sm_100 / Blackwell / tcgen05Condensed PTX
ptx-simple/ptx-isa-sm100-blackwell.md
CUDA C++ programming conceptsProgramming Guide
cuda-guide/02-basics/
Thread/block/grid modelProgramming Guide
cuda-guide/01-introduction/programming-model.md
Compute Capabilities tableProgramming Guide
cuda-guide/05-appendices/compute-capabilities.md
CUDA Graphs usageProgramming Guide
cuda-guide/04-special-topics/cuda-graphs.md
Unified MemoryProgramming Guide
cuda-guide/04-special-topics/unified-memory.md
Cooperative GroupsProgramming Guide
cuda-guide/04-special-topics/cooperative-groups.md
Async barriers/pipelines (C++)Programming Guide
cuda-guide/04-special-topics/async-barriers.md
L2 cache controlProgramming Guide
cuda-guide/04-special-topics/l2-cache-control.md
Dynamic parallelismProgramming Guide
cuda-guide/04-special-topics/dynamic-parallelism.md
C++ language extensionsProgramming Guide
cuda-guide/05-appendices/cpp-language-extensions.md
Math functions (device)Programming Guide
cuda-guide/05-appendices/mathematical-functions.md
Multi-GPU programmingProgramming Guide
cuda-guide/03-advanced/multi-gpu-systems.md
Environment variablesProgramming Guide
cuda-guide/05-appendices/environment-variables.md
Memory optimization practicesBest Practices
best-practices-guide/
Performance profiling strategyBest Practices
best-practices-guide/
ncu metrics, sections, rooflineNsight Compute
ncu-docs/ProfilingGuide.md
ncu CLI options and workflowsNsight Compute
ncu-docs/NsightComputeCli.md
nsys profiling and tracingNsight Systems
nsys-docs/UserGuide.md
需求来源路径简写
PTX指令语法/语义完整PTX文档
ptx-docs/9-instruction-set/
快速PTX语法检查精简PTX文档
ptx-simple/
状态空间、数据类型完整PTX文档
ptx-docs/5-state-spaces-types-and-variables/
内存一致性模型完整PTX文档
ptx-docs/8-memory-consistency-model/
特殊寄存器(%tid等)完整PTX文档
ptx-docs/10-special-registers/
指令(.version、.target)完整PTX文档
ptx-docs/11-directives/
CUDA Runtime函数Runtime文档
cuda-runtime-docs/modules/
CUDA结构体(cudaDeviceProp)Runtime文档
cuda-runtime-docs/data-structures/
驱动API(cuCtx、cuModule)驱动文档
cuda-driver-docs/modules/
sm_90 / Hopper专属内容精简PTX文档
ptx-simple/ptx-isa-sm90-hopper.md
sm_100 / Blackwell / tcgen05精简PTX文档
ptx-simple/ptx-isa-sm100-blackwell.md
CUDA C++编程概念编程指南
cuda-guide/02-basics/
线程/块/网格模型编程指南
cuda-guide/01-introduction/programming-model.md
计算能力表编程指南
cuda-guide/05-appendices/compute-capabilities.md
CUDA Graphs用法编程指南
cuda-guide/04-special-topics/cuda-graphs.md
Unified Memory编程指南
cuda-guide/04-special-topics/unified-memory.md
Cooperative Groups编程指南
cuda-guide/04-special-topics/cooperative-groups.md
异步屏障/流水线(C++)编程指南
cuda-guide/04-special-topics/async-barriers.md
L2缓存控制编程指南
cuda-guide/04-special-topics/l2-cache-control.md
动态并行编程指南
cuda-guide/04-special-topics/dynamic-parallelism.md
C++语言扩展编程指南
cuda-guide/05-appendices/cpp-language-extensions.md
数学函数(设备端)编程指南
cuda-guide/05-appendices/mathematical-functions.md
多GPU编程编程指南
cuda-guide/03-advanced/multi-gpu-systems.md
环境变量编程指南
cuda-guide/05-appendices/environment-variables.md
内存优化实践最佳实践指南
best-practices-guide/
性能分析策略最佳实践指南
best-practices-guide/
ncu指标、章节、RooflineNsight Compute
ncu-docs/ProfilingGuide.md
ncu CLI选项与工作流Nsight Compute
ncu-docs/NsightComputeCli.md
nsys性能分析与追踪Nsight Systems
nsys-docs/UserGuide.md

Debugging Workflow

调试工作流

  1. Reproduce minimally — Isolate failing kernel with smallest input
  2. Add printf
    if (idx == 0) printf(...)
    in device code
  3. Run compute-sanitizer:
    bash
    compute-sanitizer --tool memcheck ./program
    compute-sanitizer --tool racecheck ./program
  4. cuda-gdb backtrace (non-interactive):
    bash
    cuda-gdb -batch -ex "run" -ex "bt" ./program
  5. When tools fail — Minimize diff between working/broken code, read it carefully
For detailed tool options, read
~/.cursor/skills/cuda-skill/references/debugging-tools.md
.
  1. 最小化复现 — 使用最小输入隔离出出现问题的内核
  2. 添加printf — 在设备代码中使用
    if (idx == 0) printf(...)
  3. 运行compute-sanitizer:
    bash
    compute-sanitizer --tool memcheck ./program
    compute-sanitizer --tool racecheck ./program
  4. cuda-gdb回溯(非交互式):
    bash
    cuda-gdb -batch -ex "run" -ex "bt" ./program
  5. 当工具失效时 — 最小化正常代码与故障代码的差异,仔细阅读代码
如需详细工具选项,请查阅
~/.cursor/skills/cuda-skill/references/debugging-tools.md

Performance Optimization Workflow

性能优化工作流

Never optimize without profiling. GPU bottleneck intuition is almost always wrong.
  1. Establish baseline timing
  2. nsys — Where is time spent?
    bash
    nsys profile -o report ./program
    nsys stats report.nsys-rep --report cuda_gpu_kern_sum
  3. ncu — Why is this kernel slow?
    bash
    ncu --kernel-name "myKernel" --set full -o report ./program
  4. Hypothesize based on metrics, change ONE thing, verify
SymptomLikely CauseTool
Low GPU utilizationLaunch overhead, CPU bottlenecknsys timeline
Memory boundPoor coalescing, low cache hitncu memory section
Compute bound but slowLow occupancy, register pressurencu occupancy
High sectors/request (>4)Poor coalescingncu memory metrics
For detailed guides, read:
  • ~/.cursor/skills/cuda-skill/references/nsys-guide.md
    (quick reference)
  • ~/.cursor/skills/cuda-skill/references/ncu-guide.md
    (quick reference)
  • ~/.cursor/skills/cuda-skill/references/performance-traps.md
  • ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
    (full Nsight Compute profiling guide)
  • ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
    (full Nsight Systems user guide)
  • ~/.cursor/skills/cuda-skill/references/best-practices-guide/
    (CUDA C++ Best Practices)
未进行性能分析前切勿优化。 对GPU瓶颈的直觉判断几乎总是错误的。
  1. 建立基准 计时
  2. nsys — 时间消耗在何处?
    bash
    nsys profile -o report ./program
    nsys stats report.nsys-rep --report cuda_gpu_kern_sum
  3. ncu — 内核运行缓慢的原因是什么?
    bash
    ncu --kernel-name "myKernel" --set full -o report ./program
  4. 基于指标提出假设,只修改一处内容,然后验证效果
症状可能原因工具
GPU利用率低启动开销、CPU瓶颈nsys时间线
内存受限内存合并不佳、缓存命中率低ncu内存章节
计算受限但运行缓慢占用率低、寄存器压力大ncu占用率分析
高扇区数/请求数(>4)内存合并不佳ncu内存指标
如需详细指南,请查阅:
  • ~/.cursor/skills/cuda-skill/references/nsys-guide.md
    (速查手册)
  • ~/.cursor/skills/cuda-skill/references/ncu-guide.md
    (速查手册)
  • ~/.cursor/skills/cuda-skill/references/performance-traps.md
  • ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
    (完整Nsight Compute性能分析指南)
  • ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
    (完整Nsight Systems用户指南)
  • ~/.cursor/skills/cuda-skill/references/best-practices-guide/
    (CUDA C++最佳实践)

Compilation Reference

编译参考

bash
undefined
bash
undefined

Debug

调试版本

nvcc -g -G -lineinfo -O0 program.cu -o program_debug
nvcc -g -G -lineinfo -O0 program.cu -o program_debug

Release with line info (always use -lineinfo for profiling)

带行信息的发布版本(性能分析时始终使用-lineinfo)

nvcc -O3 -lineinfo program.cu -o program
nvcc -O3 -lineinfo program.cu -o program

Target architecture

指定目标架构

nvcc -arch=sm_80 program.cu # Ampere nvcc -arch=sm_90 program.cu # Hopper nvcc -arch=sm_100 program.cu # Blackwell
nvcc -arch=sm_80 program.cu # Ampere nvcc -arch=sm_90 program.cu # Hopper nvcc -arch=sm_100 program.cu # Blackwell

Generate PTX / inspect binary

生成PTX / 检查二进制文件

nvcc -ptx program.cu cuobjdump -ptx ./program cuobjdump -sass ./program nvcc --ptxas-options=-v program.cu # Register usage
undefined
nvcc -ptx program.cu cuobjdump -ptx ./program cuobjdump -sass ./program nvcc --ptxas-options=-v program.cu # 寄存器使用情况
undefined

Inline PTX in CUDA

CUDA中的内联PTX

cuda
__device__ int myAdd(int a, int b) {
    int result;
    asm("add.s32 %0, %1, %2;"
        : "=r"(result)
        : "r"(a), "r"(b));
    return result;
}
// Constraint codes: r=32b reg, l=64b reg, f=f32, d=f64, n=immediate
cuda
__device__ int myAdd(int a, int b) {
    int result;
    asm("add.s32 %0, %1, %2;"
        : "=r"(result)
        : "r"(a), "r"(b));
    return result;
}
// 约束码:r=32位寄存器, l=64位寄存器, f=f32, d=f64, n=立即数

PTX Documentation Structure

PTX文档结构

ptx-docs/
├── 1-introduction/
├── 2-programming-model/          # Thread hierarchy, memory
├── 3-ptx-machine-model/          # SIMT architecture
├── 4-syntax/                     # PTX syntax rules
├── 5-state-spaces-types-and-variables/  # Memory spaces, data types
├── 6-instruction-operands/       # Operand types
├── 7-abstracting-the-abi/        # Functions, calling conventions
├── 8-memory-consistency-model/   # Memory ordering, atomics
├── 9-instruction-set/            # 186 instruction files
│   ├── 9.7.1-*   Integer arithmetic
│   ├── 9.7.3-*   Floating point
│   ├── 9.7.9-*   Data movement (includes TMA)
│   ├── 9.7.14-*  WMMA (sm_70+)
│   ├── 9.7.15-*  WGMMA (sm_90+)
│   └── 9.7.16-*  TensorCore Gen5 (sm_100+)
├── 10-special-registers/         # %tid, %ctaid, %clock64
├── 11-directives/                # .version, .target, .entry
├── 12-descriptions-ofpragmastrings/
└── 13-release-notes/
ptx-docs/
├── 1-introduction/
├── 2-programming-model/          # 线程层次、内存
├── 3-ptx-machine-model/          # SIMT架构
├── 4-syntax/                     # PTX语法规则
├── 5-state-spaces-types-and-variables/  # 内存空间、数据类型
├── 6-instruction-operands/       # 操作数类型
├── 7-abstracting-the-abi/        # 函数、调用约定
├── 8-memory-consistency-model/   # 内存排序、原子操作
├── 9-instruction-set/            # 186个指令文件
│   ├── 9.7.1-*   整数算术
│   ├── 9.7.3-*   浮点数
│   ├── 9.7.9-*   数据移动(包含TMA)
│   ├── 9.7.14-*  WMMA(sm_70+)
│   ├── 9.7.15-*  WGMMA(sm_90+)
│   └── 9.7.16-*  TensorCore Gen5(sm_100+)
├── 10-special-registers/         # %tid、%ctaid、%clock64
├── 11-directives/                # .version、.target、.entry
├── 12-descriptions-ofpragmastrings/
└── 13-release-notes/

Updating Documentation

更新文档

bash
cd /path/to/cursor-gpu-skills
bash
cd /path/to/cursor-gpu-skills

Update everything

更新所有文档

uv run scrape_docs.py all --force
uv run scrape_docs.py all --force

Or update individually:

或单独更新:

uv run scrape_docs.py ptx-simple --force # Condensed PTX from triton repo uv run scrape_docs.py ptx # Full PTX ISA from NVIDIA uv run scrape_docs.py runtime # CUDA Runtime API uv run scrape_docs.py driver # CUDA Driver API uv run scrape_docs.py guide --force # CUDA Programming Guide v13.1 uv run scrape_docs.py best-practices --force # CUDA C++ Best Practices Guide uv run scrape_docs.py ncu-docs --force # Nsight Compute docs uv run scrape_docs.py nsys-docs --force # Nsight Systems docs
undefined
uv run scrape_docs.py ptx-simple --force # 从triton仓库获取精简PTX文档 uv run scrape_docs.py ptx # 从NVIDIA获取完整PTX ISA文档 uv run scrape_docs.py runtime # CUDA Runtime API文档 uv run scrape_docs.py driver # CUDA Driver API文档 uv run scrape_docs.py guide --force # CUDA Programming Guide v13.1 uv run scrape_docs.py best-practices --force # CUDA C++最佳实践指南 uv run scrape_docs.py ncu-docs --force # Nsight Compute文档 uv run scrape_docs.py nsys-docs --force # Nsight Systems文档
undefined

Additional References

额外参考

For deeper investigation, read the search guide files:
  • PTX search workflow:
    ~/.cursor/skills/cuda-skill/references/ptx-isa.md
  • Runtime API guide:
    ~/.cursor/skills/cuda-skill/references/cuda-runtime.md
  • Driver API guide:
    ~/.cursor/skills/cuda-skill/references/cuda-driver.md
如需深入研究,请查阅搜索指南文件:
  • PTX搜索工作流:
    ~/.cursor/skills/cuda-skill/references/ptx-isa.md
  • Runtime API指南:
    ~/.cursor/skills/cuda-skill/references/cuda-runtime.md
  • Driver API指南:
    ~/.cursor/skills/cuda-skill/references/cuda-driver.md