cuda-guide
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCUDA Guide
CUDA 开发指南
Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC
适用范围:CUDA 11+、GPU计算、深度学习、科学计算、高性能计算(HPC)
Core Principles
核心原则
- Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
- Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
- Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
- Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
- Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory
- 并行优先:面向数千个并发线程设计算法,串行思维是GPU性能的首要敌人
- 感知内存层级:全局内存速度比共享内存慢100倍,比寄存器慢1000倍,所有Kernel设计都要从内存访问规划开始
- 合并访问:相邻线程必须访问相邻内存地址,单个未对齐的访问模式可能会让带宽降低32倍
- 占用率优先于技巧:通过协同管理寄存器数量、共享内存使用量和块维度,最大化每个SM的活动warp数
- 最小化主机-设备传输:PCIe带宽是常见瓶颈,使用流和页锁定内存让传输与计算重叠执行
Guardrails
规范约束
Error Checking
错误检查
- ALWAYS check CUDA API return values with a macro wrapper
- ALWAYS call after every kernel launch
cudaGetLastError() - ALWAYS call before reading kernel results on the host
cudaDeviceSynchronize() - Use (successor to
compute-sanitizer) in development buildscuda-memcheck - Handle gracefully; never assume GPU memory is infinite
cudaErrorMemoryAllocation
cuda
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
#define CUDA_CHECK_KERNEL() \
do { \
cudaError_t err = cudaGetLastError(); \
if (err != cudaSuccess) { \
fprintf(stderr, "Kernel launch error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)- 必须使用宏封装检查CUDA API的返回值
- 每次Kernel启动后必须调用
cudaGetLastError() - 在主机端读取Kernel结果前必须调用
cudaDeviceSynchronize() - 开发构建版本中使用(
compute-sanitizer的后继工具)cuda-memcheck - 优雅处理错误,永远不要假设GPU内存无限
cudaErrorMemoryAllocation
cuda
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
#define CUDA_CHECK_KERNEL() \
do { \
cudaError_t err = cudaGetLastError(); \
if (err != cudaSuccess) { \
fprintf(stderr, "Kernel launch error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)Memory Management
内存管理
- Pair every with a
cudaMalloc; prefer RAII wrappers in C++ host codecudaFree - Use (Unified Memory) for prototyping; switch to explicit transfers for production
cudaMallocManaged - Use (pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute
cudaMallocHost - Prefer with streams over synchronous
cudaMemcpyAsynccudaMemcpy - Never access device pointers from host code or host pointers from device code (except Unified Memory)
- Call or
cudaMemsetto zero-initialize device bufferscudaMemsetAsync
- 每个必须配对一个
cudaMalloc,C++主机代码优先使用RAII封装cudaFree - 原型开发阶段使用(统一内存),生产环境切换为显式传输
cudaMallocManaged - 向GPU流传输数据时使用(页锁定内存),可分页内存无法与计算重叠
cudaMallocHost - 优先使用带流的,而非同步的
cudaMemcpyAsynccudaMemcpy - 不要从主机代码访问设备指针,也不要从设备代码访问主机指针(统一内存除外)
- 调用或
cudaMemset对设备缓冲区进行零初始化cudaMemsetAsync
Kernel Design
Kernel设计
- Block size must be a multiple of warp size (32); prefer 128, 256, or 512
- Calculate grid size as
(n + block_size - 1) / block_size - Always include bounds checking: at the top of every kernel
if (idx < n) - Use grid-stride loops for kernels that must handle arbitrary data sizes
- Document thread mapping: which dimension maps to which data axis
- Mark device-only helpers as , host+device as
__device____host__ __device__
cuda
// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x) {
y[i] = a * x[i] + y[i];
}
}- 块大小必须是warp大小(32)的倍数,优先选择128、256或512
- 网格大小计算公式为
(n + block_size - 1) / block_size - 所有Kernel顶部必须包含边界检查:
if (idx < n) - 需要处理任意数据大小的Kernel使用网格步长循环
- 注明线程映射关系:哪个维度对应哪个数据轴
- 仅设备端使用的辅助函数标记为,主机+设备通用的标记为
__device____host__ __device__
cuda
// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x) {
y[i] = a * x[i] + y[i];
}
}Synchronization
同步机制
- Use after every shared memory write before any thread reads another thread's value
__syncthreads() - Never place inside a conditional branch that not all threads in a block will reach (deadlock)
__syncthreads() - Use (CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution
__syncwarp() - Use sparingly in production; prefer stream synchronization with
cudaDeviceSynchronize()cudaStreamSynchronize() - Use CUDA events (/
cudaEventRecord) for fine-grained inter-stream orderingcudaEventSynchronize
- 共享内存写入后、任意线程读取其他线程写入的值前,必须调用
__syncthreads() - 不要将放在块内并非所有线程都会进入的条件分支中(会导致死锁)
__syncthreads() - CUDA 9+版本使用进行warp级同步,不要依赖隐式的warp同步执行
__syncwarp() - 生产环境尽量少用,优先使用
cudaDeviceSynchronize()进行流同步cudaStreamSynchronize() - 使用CUDA事件(/
cudaEventRecord)实现细粒度的流间排序cudaEventSynchronize
Performance
性能优化
- Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view
- Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions
- Aim for >60% of peak memory bandwidth in memory-bound kernels
- Avoid warp divergence: ensure threads within a warp take the same branch when possible
- Prefer over
floaton consumer GPUs (2x throughput difference)double - Minimize atomic operations on global memory; use shared memory atomics with a final reduction
- 优化前先 profiling:使用Nsight Compute做Kernel分析,Nsight Systems做系统级视图分析
- 目标理论占用率>50%,使用CUDA占用率计算器调整块维度
- 内存 bound 的Kernel目标达到峰值内存带宽的60%以上
- 避免warp发散:尽可能保证同一个warp内的线程走相同的分支
- 消费级GPU上优先使用而非
float(吞吐量差2倍)double - 尽量减少全局内存上的原子操作,使用共享内存原子操作加最终规约的方案
Memory Hierarchy
内存层级
Understanding the memory hierarchy is the single most important factor in CUDA performance.
| Memory Type | Scope | Latency (cycles) | Size | Cached | Read/Write |
|---|---|---|---|---|---|
| Registers | Thread | 1 | ~255 per thread | N/A | R/W |
| Shared | Block | ~5 | 48-164 KB per SM | N/A | R/W |
| L1 Cache | SM | ~28 | 48-192 KB per SM | Auto | R |
| L2 Cache | Device | ~200 | 4-40 MB | Auto | R/W |
| Global | Device | ~400-600 | 4-80 GB (HBM/GDDR) | Yes | R/W |
| Constant | Device | ~5 (cached) | 64 KB | Yes (broadcast) | R |
| Texture | Device | ~400 (cached) | Global pool | Yes (spatial) | R |
Decision guide:
- Data reused within a thread -> registers (automatic via local variables)
- Data shared across threads in a block -> memory
__shared__ - Read-only data broadcast to all threads -> memory
__constant__ - Large read-only data with spatial locality -> texture memory
- Everything else -> global memory with coalesced access patterns
理解内存层级是提升CUDA性能最重要的单一因素。
| 内存类型 | 作用范围 | 延迟(周期) | 容量 | 缓存 | 读写属性 |
|---|---|---|---|---|---|
| 寄存器 | 线程 | 1 | 每个线程约255个 | 无 | 可读可写 |
| 共享内存 | 块 | ~5 | 每个SM 48-164 KB | 无 | 可读可写 |
| L1缓存 | SM | ~28 | 每个SM 48-192 KB | 自动 | 只读 |
| L2缓存 | 设备 | ~200 | 4-40 MB | 自动 | 可读可写 |
| 全局内存 | 设备 | ~400-600 | 4-80 GB(HBM/GDDR) | 支持 | 可读可写 |
| 常量内存 | 设备 | ~5(缓存后) | 64 KB | 支持(广播) | 只读 |
| 纹理内存 | 设备 | ~400(缓存后) | 全局内存池 | 支持(空间局部性) | 只读 |
选型指南:
- 单个线程内复用的数据 -> 寄存器(通过局部变量自动分配)
- 块内线程共享的数据 -> 内存
__shared__ - 广播到所有线程的只读数据 -> 内存
__constant__ - 有空间局部性的大型只读数据 -> 纹理内存
- 其他所有场景 -> 采用合并访问模式的全局内存
Key Patterns
核心模式
Kernel Launch Configuration
Kernel启动配置
cuda
// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
int block_size;
int min_grid_size;
// Let the runtime suggest optimal block size for maximum occupancy
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size, &block_size, my_kernel, 0, n);
int grid_size = (n + block_size - 1) / block_size;
my_kernel<<<grid_size, block_size>>>(input, output, n);
CUDA_CHECK_KERNEL();
}cuda
// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
int block_size;
int min_grid_size;
// Let the runtime suggest optimal block size for maximum occupancy
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size, &block_size, my_kernel, 0, n);
int grid_size = (n + block_size - 1) / block_size;
my_kernel<<<grid_size, block_size>>>(input, output, n);
CUDA_CHECK_KERNEL();
}Coalesced Memory Access
合并内存访问
cuda
// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < W && y < H) {
out[x * H + y] = in[y * W + x]; // Write is strided
}
}
// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
const float* in, float* out, int W, int H
) {
__shared__ float tile[32][33]; // +1 padding avoids bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
if (x < W && y < H) {
tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
}
__syncthreads();
x = blockIdx.y * 32 + threadIdx.x;
y = blockIdx.x * 32 + threadIdx.y;
if (x < H && y < W) {
out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
}
}cuda
// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < W && y < H) {
out[x * H + y] = in[y * W + x]; // Write is strided
}
}
// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
const float* in, float* out, int W, int H
) {
__shared__ float tile[32][33]; // +1 padding avoids bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
if (x < W && y < H) {
tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
}
__syncthreads();
x = blockIdx.y * 32 + threadIdx.x;
y = blockIdx.x * 32 + threadIdx.y;
if (x < H && y < W) {
out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
}
}Shared Memory Tiling
共享内存分块
cuda
// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
const float* a, const float* b, float* result, int n
) {
__shared__ float cache[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Each thread computes its partial sum via grid-stride
float partial = 0.0f;
for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
partial += a[i] * b[i];
}
cache[tid] = partial;
__syncthreads();
// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
cache[tid] += cache[tid + s];
}
__syncthreads();
}
if (tid == 0) {
atomicAdd(result, cache[0]);
}
}cuda
// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
const float* a, const float* b, float* result, int n
) {
__shared__ float cache[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Each thread computes its partial sum via grid-stride
float partial = 0.0f;
for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
partial += a[i] * b[i];
}
cache[tid] = partial;
__syncthreads();
// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
cache[tid] += cache[tid + s];
}
__syncthreads();
}
if (tid == 0) {
atomicAdd(result, cache[0]);
}
}Warp-Level Primitives (CUDA 9+)
Warp级原语(CUDA 9+)
cuda
// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
}
return val;
}
// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
__shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)
int lane = threadIdx.x % warpSize;
int warp_id = threadIdx.x / warpSize;
val = warp_reduce_sum(val);
if (lane == 0) {
warp_sums[warp_id] = val;
}
__syncthreads();
// First warp reduces the warp sums
int num_warps = (blockDim.x + warpSize - 1) / warpSize;
val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
if (warp_id == 0) {
val = warp_reduce_sum(val);
}
return val;
}cuda
// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
}
return val;
}
// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
__shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)
int lane = threadIdx.x % warpSize;
int warp_id = threadIdx.x / warpSize;
val = warp_reduce_sum(val);
if (lane == 0) {
warp_sums[warp_id] = val;
}
__syncthreads();
// First warp reduces the warp sums
int num_warps = (blockDim.x + warpSize - 1) / warpSize;
val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
if (warp_id == 0) {
val = warp_reduce_sum(val);
}
return val;
}Performance
性能相关
Occupancy Calculator
占用率计算器
cuda
// Query occupancy at compile time for tuning
void report_occupancy() {
int block_size = 256;
int num_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&num_blocks, my_kernel, block_size, 0);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int active_warps = num_blocks * (block_size / prop.warpSize);
int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
float occupancy = (float)active_warps / max_warps;
printf("Occupancy: %.1f%% (%d/%d warps)\n",
occupancy * 100, active_warps, max_warps);
}cuda
// Query occupancy at compile time for tuning
void report_occupancy() {
int block_size = 256;
int num_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&num_blocks, my_kernel, block_size, 0);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int active_warps = num_blocks * (block_size / prop.warpSize);
int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
float occupancy = (float)active_warps / max_warps;
printf("Occupancy: %.1f%% (%d/%d warps)\n",
occupancy * 100, active_warps, max_warps);
}Nsight Profiling Workflow
Nsight Profiling工作流
bash
undefinedbash
undefinedSystem-level trace: find CPU/GPU idle gaps, stream concurrency
System-level trace: find CPU/GPU idle gaps, stream concurrency
nsys profile -o trace ./program
nsys stats trace.nsys-rep
nsys profile -o trace ./program
nsys stats trace.nsys-rep
Kernel-level analysis: roofline, memory throughput, occupancy
Kernel-level analysis: roofline, memory throughput, occupancy
ncu --set full -o kernel_report ./program
ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI
ncu --set full -o kernel_report ./program
ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI
Quick single-metric check
Quick single-metric check
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program
undefinedncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program
undefinedMemory Bandwidth Measurement
内存带宽测量
cuda
// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
size_t bytes = 2 * n * sizeof(float); // Read A + Write B
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
CUDA_CHECK(cudaEventRecord(start));
copy_kernel<<<grid, block>>>(d_in, d_out, n);
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
float ms = 0;
CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
float gb_per_sec = bytes / (ms * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
}cuda
// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
size_t bytes = 2 * n * sizeof(float); // Read A + Write B
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
CUDA_CHECK(cudaEventRecord(start));
copy_kernel<<<grid, block>>>(d_in, d_out, n);
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
float ms = 0;
CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
float gb_per_sec = bytes / (ms * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
}Tooling
工具链
Essential Commands
常用命令
bash
undefinedbash
undefinedCompile CUDA code
Compile CUDA code
nvcc -arch=sm_80 -O3 -o program main.cu # Single file
nvcc -arch=native -lineinfo -o program main.cu # With debug line info
nvcc -arch=sm_80 -O3 -o program main.cu # Single file
nvcc -arch=native -lineinfo -o program main.cu # With debug line info
CMake build
CMake build
cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86"
cmake --build build -j$(nproc)
cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86"
cmake --build build -j$(nproc)
Runtime debugging
Runtime debugging
compute-sanitizer ./program # Memory errors (replaces cuda-memcheck)
compute-sanitizer --tool racecheck ./program # Shared memory race conditions
compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads
compute-sanitizer --tool synccheck ./program # Synchronization errors
compute-sanitizer ./program # Memory errors (replaces cuda-memcheck)
compute-sanitizer --tool racecheck ./program # Shared memory race conditions
compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads
compute-sanitizer --tool synccheck ./program # Synchronization errors
Profiling
Profiling
nsys profile ./program # System-level timeline
ncu ./program # Kernel-level metrics
ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program
nsys profile ./program # System-level timeline
ncu ./program # Kernel-level metrics
ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program
Device info
Device info
nvidia-smi # GPU status and memory usage
nvcc --version # CUDA compiler version
undefinednvidia-smi # GPU status and memory usage
nvcc --version # CUDA compiler version
undefinedCMakeLists.txt Template
CMakeLists.txt模板
cmake
cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
find_package(CUDAToolkit REQUIRED)
add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)
add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)
enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)cmake
cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
find_package(CUDAToolkit REQUIRED)
add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)
add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)
enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)References
参考资料
For detailed patterns and examples, see:
- references/patterns.md -- Tiled matrix multiply, parallel reduction tree, stream overlap pipeline
如需详细模式和示例,请查看:
- references/patterns.md -- 分块矩阵乘法、并行规约树、流重叠流水线