cuda-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCUDA & PTX Reference
CUDA & PTX 参考文档
Documentation Locations
文档位置
All documentation is under the directory within this skill's install location.
The base path depends on which agent tool is used:
references/- Cursor:
~/.cursor/skills/cuda-skill/references/ - Claude Code:
~/.claude/skills/cuda-skill/references/ - Codex:
~/.agents/skills/cuda-skill/references/
Determine actual path at runtime:
bash
CUDA_REFS="$(dirname "$(find ~/.cursor/skills ~/.claude/skills ~/.agents/skills -name 'cuda-skill' -type d 2>/dev/null | head -1)")/cuda-skill/references"All examples below use as placeholder. Replace with the actual path.
rg~/.cursor/skills/cuda-skill/references/references/
├── ptx-docs/ # PTX ISA 9.1 full spec (405 files, 2.3MB)
├── ptx-simple/ # PTX condensed quick-ref (13 files, 149KB)
├── cuda-runtime-docs/ # CUDA Runtime API 13.1 (107 files, 0.9MB)
├── cuda-driver-docs/ # CUDA Driver API 13.1 (128 files, 0.8MB)
├── cuda-guide/ # CUDA Programming Guide v13.1 (39 pages, 1.6MB)
│ ├── 01-introduction/ # Programming model, CUDA platform
│ ├── 02-basics/ # CUDA C++, kernels, async, memory, nvcc
│ ├── 03-advanced/ # Advanced APIs, kernel programming, driver API, multi-GPU
│ ├── 04-special-topics/ # Graphs, Unified Memory, Coop Groups, TMA, etc.
│ ├── 05-appendices/ # Compute Capabilities, C++ extensions, math funcs
│ └── INDEX.md
├── best-practices-guide/ # CUDA C++ Best Practices Guide
├── ncu-docs/ # Nsight Compute full docs (ProfilingGuide, CLI, etc.)
├── nsys-docs/ # Nsight Systems full docs (UserGuide, etc.)
├── ptx-isa.md # PTX search guide
├── cuda-runtime.md # Runtime API search guide
├── cuda-driver.md # Driver API search guide
├── nsys-guide.md # Nsight Systems quick reference
├── ncu-guide.md # Nsight Compute quick reference
├── debugging-tools.md # compute-sanitizer, cuda-gdb
├── nvtx-patterns.md # NVTX instrumentation
└── performance-traps.md # Bank conflicts, coalescing所有文档都位于该Skill安装目录下的文件夹中。
基础路径取决于所使用的Agent工具:
references/- Cursor:
~/.cursor/skills/cuda-skill/references/ - Claude Code:
~/.claude/skills/cuda-skill/references/ - Codex:
~/.agents/skills/cuda-skill/references/
运行时确定实际路径:
bash
CUDA_REFS="$(dirname "$(find ~/.cursor/skills ~/.claude/skills ~/.agents/skills -name 'cuda-skill' -type d 2>/dev/null | head -1)")/cuda-skill/references"以下所有示例均使用作为占位符。请替换为实际路径。
rg~/.cursor/skills/cuda-skill/references/references/
├── ptx-docs/ # PTX ISA 9.1完整规范(405个文件,2.3MB)
├── ptx-simple/ # PTX精简速查手册(13个文件,149KB)
├── cuda-runtime-docs/ # CUDA Runtime API 13.1(107个文件,0.9MB)
├── cuda-driver-docs/ # CUDA Driver API 13.1(128个文件,0.8MB)
├── cuda-guide/ # CUDA Programming Guide v13.1(39页,1.6MB)
│ ├── 01-introduction/ # 编程模型、CUDA平台
│ ├── 02-basics/ # CUDA C++、内核、异步操作、内存、nvcc
│ ├── 03-advanced/ # 高级API、内核编程、驱动API、多GPU
│ ├── 04-special-topics/ # Graphs、Unified Memory、Coop Groups、TMA等
│ ├── 05-appendices/ # 计算能力、C++扩展、数学函数
│ └── INDEX.md
├── best-practices-guide/ # CUDA C++最佳实践指南
├── ncu-docs/ # Nsight Compute完整文档(ProfilingGuide、CLI等)
├── nsys-docs/ # Nsight Systems完整文档(UserGuide等)
├── ptx-isa.md # PTX搜索指南
├── cuda-runtime.md # Runtime API搜索指南
├── cuda-driver.md # Driver API搜索指南
├── nsys-guide.md # Nsight Systems速查手册
├── ncu-guide.md # Nsight Compute速查手册
├── debugging-tools.md # compute-sanitizer、cuda-gdb
├── nvtx-patterns.md # NVTX插桩
└── performance-traps.md # 存储体冲突、内存合并ptx-simple/ Contents (Condensed Quick-Ref)
ptx-simple/ 内容(精简速查手册)
ptx-simple/
├── ptx-isa-arithmetic.md # add, sub, mul, mad, fma, div, min, max
├── ptx-isa-data-types.md # Types, cvt, rounding, pack
├── ptx-isa-memory-spaces.md # .reg, .global, .shared, fences
├── ptx-isa-load-store.md # ld, st, prefetch
├── ptx-isa-control-flow.md # @p, setp, bra, call, ret, exit
├── ptx-isa-tensor-cores.md # mma.sync, ldmatrix, wgmma
├── ptx-isa-async-copy.md # cp.async, cp.async.bulk, TMA
├── ptx-isa-barriers.md # bar.sync, mbarrier
├── ptx-isa-warp-ops.md # shfl, vote, match, redux
├── ptx-isa-cache-hints.md # Cache control
├── ptx-isa-sm90-hopper.md # Hopper-specific (sm_90)
├── ptx-isa-sm100-blackwell.md # Blackwell-specific (sm_100, tcgen05)
└── ptx-isa-misc.md # Other instructionsptx-simple/
├── ptx-isa-arithmetic.md # add、sub、mul、mad、fma、div、min、max
├── ptx-isa-data-types.md # 类型、cvt、舍入、打包
├── ptx-isa-memory-spaces.md # .reg、.global、.shared、fences
├── ptx-isa-load-store.md # ld、st、prefetch
├── ptx-isa-control-flow.md # @p、setp、bra、call、ret、exit
├── ptx-isa-tensor-cores.md # mma.sync、ldmatrix、wgmma
├── ptx-isa-async-copy.md # cp.async、cp.async.bulk、TMA
├── ptx-isa-barriers.md # bar.sync、mbarrier
├── ptx-isa-warp-ops.md # shfl、vote、match、redux
├── ptx-isa-cache-hints.md # 缓存控制
├── ptx-isa-sm90-hopper.md # Hopper专属(sm_90)
├── ptx-isa-sm100-blackwell.md # Blackwell专属(sm_100、tcgen05)
└── ptx-isa-misc.md # 其他指令Search Strategy
搜索策略
Use Grep tool to search documentation. Never load entire files into context.
使用Grep工具搜索文档。切勿将整个文件加载到上下文环境中。
PTX Instruction Lookup
PTX指令查询
bash
undefinedbash
undefinedFind specific instruction
查找特定指令
rg "mbarrier.init" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/
rg "mbarrier.init" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/
Find WGMMA register fragments
查找WGMMA寄存器片段
rg "register fragment" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/ | rg -i wgmma
rg "register fragment" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/ | rg -i wgmma
Find TMA swizzling modes
查找TMA混洗模式
rg "swizzle_mode" ~/.cursor/skills/cuda-skill/references/ptx-docs/
rg "swizzle_mode" ~/.cursor/skills/cuda-skill/references/ptx-docs/
Quick PTX syntax lookup (condensed)
快速查询PTX语法(精简版)
rg "wgmma" ~/.cursor/skills/cuda-skill/references/ptx-simple/ptx-isa-tensor-cores.md
undefinedrg "wgmma" ~/.cursor/skills/cuda-skill/references/ptx-simple/ptx-isa-tensor-cores.md
undefinedCUDA Runtime API Lookup
CUDA Runtime API查询
bash
undefinedbash
undefinedError code meaning
错误代码含义
rg "cudaErrorInvalidValue" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/
rg "cudaErrorInvalidValue" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/
Function documentation
函数文档
rg -A 20 "cudaStreamSynchronize" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/modules/group__cudart__stream.md
rg -A 20 "cudaStreamSynchronize" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/modules/group__cudart__stream.md
Struct fields
结构体字段
rg "" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/data-structures/structcudadeviceprop.md
undefinedrg "" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/data-structures/structcudadeviceprop.md
undefinedCUDA Driver API Lookup
CUDA Driver API查询
bash
undefinedbash
undefinedContext management
上下文管理
rg -A 20 "cuCtxCreate" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__ctx.md
rg -A 20 "cuCtxCreate" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__ctx.md
Module loading
模块加载
rg "cuModuleLoad" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__module.md
rg "cuModuleLoad" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__module.md
Virtual memory
虚拟内存
rg "cuMemMap" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__va.md
undefinedrg "cuMemMap" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__va.md
undefinedCUDA Programming Guide Lookup
CUDA编程指南查询
bash
undefinedbash
undefinedCompute Capabilities table
计算能力表
rg -A 5 "sm_90" ~/.cursor/skills/cuda-skill/references/cuda-guide/05-appendices/compute-capabilities.md
rg -A 5 "sm_90" ~/.cursor/skills/cuda-skill/references/cuda-guide/05-appendices/compute-capabilities.md
CUDA Graphs usage
CUDA Graphs用法
rg "cudaGraph" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cuda-graphs.md
rg "cudaGraph" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cuda-graphs.md
Cooperative Groups
Cooperative Groups
rg "cooperative" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cooperative-groups.md
rg "cooperative" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cooperative-groups.md
Unified Memory behavior
Unified Memory行为
rg "managed" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/unified-memory.md
rg "managed" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/unified-memory.md
Thread Block Clusters (Hopper+)
线程块集群(Hopper及以上架构)
rg "cluster" ~/.cursor/skills/cuda-skill/references/cuda-guide/01-introduction/programming-model.md
rg "cluster" ~/.cursor/skills/cuda-skill/references/cuda-guide/01-introduction/programming-model.md
Programming Guide index (discover all topics)
编程指南索引(查看所有主题)
cat ~/.cursor/skills/cuda-skill/references/cuda-guide/INDEX.md
undefinedcat ~/.cursor/skills/cuda-skill/references/cuda-guide/INDEX.md
undefinedBest Practices Guide Lookup
最佳实践指南查询
bash
undefinedbash
undefinedMemory coalescing best practices
内存合并最佳实践
rg -i "coalescing" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
rg -i "coalescing" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
Occupancy optimization
占用率优化
rg -i "occupancy" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
rg -i "occupancy" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
Shared memory usage patterns
共享内存使用模式
rg -i "shared memory" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
undefinedrg -i "shared memory" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
undefinedNsight Compute Lookup
Nsight Compute查询
bash
undefinedbash
undefinedMetric meanings and collection
指标含义与收集
rg -i "metric" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
rg -i "metric" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
CLI usage and options
CLI用法与选项
rg -i "section" ~/.cursor/skills/cuda-skill/references/ncu-docs/NsightComputeCli.md
rg -i "section" ~/.cursor/skills/cuda-skill/references/ncu-docs/NsightComputeCli.md
Roofline analysis
Roofline分析
rg -i "roofline" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
undefinedrg -i "roofline" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
undefinedNsight Systems Lookup
Nsight Systems查询
bash
undefinedbash
undefinedCLI profiling options
CLI性能分析选项
rg -i "nsys profile" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
rg -i "nsys profile" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
CUDA trace analysis
CUDA追踪分析
rg -i "cuda.*trace" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
undefinedrg -i "cuda.*trace" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
undefinedWhen to Use Each Source
各来源适用场景
| Need | Source | Path shorthand |
|---|---|---|
| PTX instruction syntax/semantics | Full PTX docs | |
| Quick PTX syntax check | Condensed PTX | |
| State spaces, data types | Full PTX docs | |
| Memory consistency model | Full PTX docs | |
| Special registers (%tid, etc.) | Full PTX docs | |
| Directives (.version, .target) | Full PTX docs | |
| CUDA Runtime functions | Runtime docs | |
| CUDA structs (cudaDeviceProp) | Runtime docs | |
| Driver API (cuCtx, cuModule) | Driver docs | |
| sm_90 / Hopper specifics | Condensed PTX | |
| sm_100 / Blackwell / tcgen05 | Condensed PTX | |
| CUDA C++ programming concepts | Programming Guide | |
| Thread/block/grid model | Programming Guide | |
| Compute Capabilities table | Programming Guide | |
| CUDA Graphs usage | Programming Guide | |
| Unified Memory | Programming Guide | |
| Cooperative Groups | Programming Guide | |
| Async barriers/pipelines (C++) | Programming Guide | |
| L2 cache control | Programming Guide | |
| Dynamic parallelism | Programming Guide | |
| C++ language extensions | Programming Guide | |
| Math functions (device) | Programming Guide | |
| Multi-GPU programming | Programming Guide | |
| Environment variables | Programming Guide | |
| Memory optimization practices | Best Practices | |
| Performance profiling strategy | Best Practices | |
| ncu metrics, sections, roofline | Nsight Compute | |
| ncu CLI options and workflows | Nsight Compute | |
| nsys profiling and tracing | Nsight Systems | |
| 需求 | 来源 | 路径简写 |
|---|---|---|
| PTX指令语法/语义 | 完整PTX文档 | |
| 快速PTX语法检查 | 精简PTX文档 | |
| 状态空间、数据类型 | 完整PTX文档 | |
| 内存一致性模型 | 完整PTX文档 | |
| 特殊寄存器(%tid等) | 完整PTX文档 | |
| 指令(.version、.target) | 完整PTX文档 | |
| CUDA Runtime函数 | Runtime文档 | |
| CUDA结构体(cudaDeviceProp) | Runtime文档 | |
| 驱动API(cuCtx、cuModule) | 驱动文档 | |
| sm_90 / Hopper专属内容 | 精简PTX文档 | |
| sm_100 / Blackwell / tcgen05 | 精简PTX文档 | |
| CUDA C++编程概念 | 编程指南 | |
| 线程/块/网格模型 | 编程指南 | |
| 计算能力表 | 编程指南 | |
| CUDA Graphs用法 | 编程指南 | |
| Unified Memory | 编程指南 | |
| Cooperative Groups | 编程指南 | |
| 异步屏障/流水线(C++) | 编程指南 | |
| L2缓存控制 | 编程指南 | |
| 动态并行 | 编程指南 | |
| C++语言扩展 | 编程指南 | |
| 数学函数(设备端) | 编程指南 | |
| 多GPU编程 | 编程指南 | |
| 环境变量 | 编程指南 | |
| 内存优化实践 | 最佳实践指南 | |
| 性能分析策略 | 最佳实践指南 | |
| ncu指标、章节、Roofline | Nsight Compute | |
| ncu CLI选项与工作流 | Nsight Compute | |
| nsys性能分析与追踪 | Nsight Systems | |
Debugging Workflow
调试工作流
- Reproduce minimally — Isolate failing kernel with smallest input
- Add printf — in device code
if (idx == 0) printf(...) - Run compute-sanitizer:
bash
compute-sanitizer --tool memcheck ./program compute-sanitizer --tool racecheck ./program - cuda-gdb backtrace (non-interactive):
bash
cuda-gdb -batch -ex "run" -ex "bt" ./program - When tools fail — Minimize diff between working/broken code, read it carefully
For detailed tool options, read .
~/.cursor/skills/cuda-skill/references/debugging-tools.md- 最小化复现 — 使用最小输入隔离出出现问题的内核
- 添加printf — 在设备代码中使用
if (idx == 0) printf(...) - 运行compute-sanitizer:
bash
compute-sanitizer --tool memcheck ./program compute-sanitizer --tool racecheck ./program - cuda-gdb回溯(非交互式):
bash
cuda-gdb -batch -ex "run" -ex "bt" ./program - 当工具失效时 — 最小化正常代码与故障代码的差异,仔细阅读代码
如需详细工具选项,请查阅。
~/.cursor/skills/cuda-skill/references/debugging-tools.mdPerformance Optimization Workflow
性能优化工作流
Never optimize without profiling. GPU bottleneck intuition is almost always wrong.
- Establish baseline timing
- nsys — Where is time spent?
bash
nsys profile -o report ./program nsys stats report.nsys-rep --report cuda_gpu_kern_sum - ncu — Why is this kernel slow?
bash
ncu --kernel-name "myKernel" --set full -o report ./program - Hypothesize based on metrics, change ONE thing, verify
| Symptom | Likely Cause | Tool |
|---|---|---|
| Low GPU utilization | Launch overhead, CPU bottleneck | nsys timeline |
| Memory bound | Poor coalescing, low cache hit | ncu memory section |
| Compute bound but slow | Low occupancy, register pressure | ncu occupancy |
| High sectors/request (>4) | Poor coalescing | ncu memory metrics |
For detailed guides, read:
- (quick reference)
~/.cursor/skills/cuda-skill/references/nsys-guide.md - (quick reference)
~/.cursor/skills/cuda-skill/references/ncu-guide.md ~/.cursor/skills/cuda-skill/references/performance-traps.md- (full Nsight Compute profiling guide)
~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md - (full Nsight Systems user guide)
~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md - (CUDA C++ Best Practices)
~/.cursor/skills/cuda-skill/references/best-practices-guide/
未进行性能分析前切勿优化。 对GPU瓶颈的直觉判断几乎总是错误的。
- 建立基准 计时
- nsys — 时间消耗在何处?
bash
nsys profile -o report ./program nsys stats report.nsys-rep --report cuda_gpu_kern_sum - ncu — 内核运行缓慢的原因是什么?
bash
ncu --kernel-name "myKernel" --set full -o report ./program - 基于指标提出假设,只修改一处内容,然后验证效果
| 症状 | 可能原因 | 工具 |
|---|---|---|
| GPU利用率低 | 启动开销、CPU瓶颈 | nsys时间线 |
| 内存受限 | 内存合并不佳、缓存命中率低 | ncu内存章节 |
| 计算受限但运行缓慢 | 占用率低、寄存器压力大 | ncu占用率分析 |
| 高扇区数/请求数(>4) | 内存合并不佳 | ncu内存指标 |
如需详细指南,请查阅:
- (速查手册)
~/.cursor/skills/cuda-skill/references/nsys-guide.md - (速查手册)
~/.cursor/skills/cuda-skill/references/ncu-guide.md ~/.cursor/skills/cuda-skill/references/performance-traps.md- (完整Nsight Compute性能分析指南)
~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md - (完整Nsight Systems用户指南)
~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md - (CUDA C++最佳实践)
~/.cursor/skills/cuda-skill/references/best-practices-guide/
Compilation Reference
编译参考
bash
undefinedbash
undefinedDebug
调试版本
nvcc -g -G -lineinfo -O0 program.cu -o program_debug
nvcc -g -G -lineinfo -O0 program.cu -o program_debug
Release with line info (always use -lineinfo for profiling)
带行信息的发布版本(性能分析时始终使用-lineinfo)
nvcc -O3 -lineinfo program.cu -o program
nvcc -O3 -lineinfo program.cu -o program
Target architecture
指定目标架构
nvcc -arch=sm_80 program.cu # Ampere
nvcc -arch=sm_90 program.cu # Hopper
nvcc -arch=sm_100 program.cu # Blackwell
nvcc -arch=sm_80 program.cu # Ampere
nvcc -arch=sm_90 program.cu # Hopper
nvcc -arch=sm_100 program.cu # Blackwell
Generate PTX / inspect binary
生成PTX / 检查二进制文件
nvcc -ptx program.cu
cuobjdump -ptx ./program
cuobjdump -sass ./program
nvcc --ptxas-options=-v program.cu # Register usage
undefinednvcc -ptx program.cu
cuobjdump -ptx ./program
cuobjdump -sass ./program
nvcc --ptxas-options=-v program.cu # 寄存器使用情况
undefinedInline PTX in CUDA
CUDA中的内联PTX
cuda
__device__ int myAdd(int a, int b) {
int result;
asm("add.s32 %0, %1, %2;"
: "=r"(result)
: "r"(a), "r"(b));
return result;
}
// Constraint codes: r=32b reg, l=64b reg, f=f32, d=f64, n=immediatecuda
__device__ int myAdd(int a, int b) {
int result;
asm("add.s32 %0, %1, %2;"
: "=r"(result)
: "r"(a), "r"(b));
return result;
}
// 约束码:r=32位寄存器, l=64位寄存器, f=f32, d=f64, n=立即数PTX Documentation Structure
PTX文档结构
ptx-docs/
├── 1-introduction/
├── 2-programming-model/ # Thread hierarchy, memory
├── 3-ptx-machine-model/ # SIMT architecture
├── 4-syntax/ # PTX syntax rules
├── 5-state-spaces-types-and-variables/ # Memory spaces, data types
├── 6-instruction-operands/ # Operand types
├── 7-abstracting-the-abi/ # Functions, calling conventions
├── 8-memory-consistency-model/ # Memory ordering, atomics
├── 9-instruction-set/ # 186 instruction files
│ ├── 9.7.1-* Integer arithmetic
│ ├── 9.7.3-* Floating point
│ ├── 9.7.9-* Data movement (includes TMA)
│ ├── 9.7.14-* WMMA (sm_70+)
│ ├── 9.7.15-* WGMMA (sm_90+)
│ └── 9.7.16-* TensorCore Gen5 (sm_100+)
├── 10-special-registers/ # %tid, %ctaid, %clock64
├── 11-directives/ # .version, .target, .entry
├── 12-descriptions-ofpragmastrings/
└── 13-release-notes/ptx-docs/
├── 1-introduction/
├── 2-programming-model/ # 线程层次、内存
├── 3-ptx-machine-model/ # SIMT架构
├── 4-syntax/ # PTX语法规则
├── 5-state-spaces-types-and-variables/ # 内存空间、数据类型
├── 6-instruction-operands/ # 操作数类型
├── 7-abstracting-the-abi/ # 函数、调用约定
├── 8-memory-consistency-model/ # 内存排序、原子操作
├── 9-instruction-set/ # 186个指令文件
│ ├── 9.7.1-* 整数算术
│ ├── 9.7.3-* 浮点数
│ ├── 9.7.9-* 数据移动(包含TMA)
│ ├── 9.7.14-* WMMA(sm_70+)
│ ├── 9.7.15-* WGMMA(sm_90+)
│ └── 9.7.16-* TensorCore Gen5(sm_100+)
├── 10-special-registers/ # %tid、%ctaid、%clock64
├── 11-directives/ # .version、.target、.entry
├── 12-descriptions-ofpragmastrings/
└── 13-release-notes/Updating Documentation
更新文档
bash
cd /path/to/cursor-gpu-skillsbash
cd /path/to/cursor-gpu-skillsUpdate everything
更新所有文档
uv run scrape_docs.py all --force
uv run scrape_docs.py all --force
Or update individually:
或单独更新:
uv run scrape_docs.py ptx-simple --force # Condensed PTX from triton repo
uv run scrape_docs.py ptx # Full PTX ISA from NVIDIA
uv run scrape_docs.py runtime # CUDA Runtime API
uv run scrape_docs.py driver # CUDA Driver API
uv run scrape_docs.py guide --force # CUDA Programming Guide v13.1
uv run scrape_docs.py best-practices --force # CUDA C++ Best Practices Guide
uv run scrape_docs.py ncu-docs --force # Nsight Compute docs
uv run scrape_docs.py nsys-docs --force # Nsight Systems docs
undefineduv run scrape_docs.py ptx-simple --force # 从triton仓库获取精简PTX文档
uv run scrape_docs.py ptx # 从NVIDIA获取完整PTX ISA文档
uv run scrape_docs.py runtime # CUDA Runtime API文档
uv run scrape_docs.py driver # CUDA Driver API文档
uv run scrape_docs.py guide --force # CUDA Programming Guide v13.1
uv run scrape_docs.py best-practices --force # CUDA C++最佳实践指南
uv run scrape_docs.py ncu-docs --force # Nsight Compute文档
uv run scrape_docs.py nsys-docs --force # Nsight Systems文档
undefinedAdditional References
额外参考
For deeper investigation, read the search guide files:
- PTX search workflow:
~/.cursor/skills/cuda-skill/references/ptx-isa.md - Runtime API guide:
~/.cursor/skills/cuda-skill/references/cuda-runtime.md - Driver API guide:
~/.cursor/skills/cuda-skill/references/cuda-driver.md
如需深入研究,请查阅搜索指南文件:
- PTX搜索工作流:
~/.cursor/skills/cuda-skill/references/ptx-isa.md - Runtime API指南:
~/.cursor/skills/cuda-skill/references/cuda-runtime.md - Driver API指南:
~/.cursor/skills/cuda-skill/references/cuda-driver.md