cuda-guide

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CUDA Guide

CUDA 开发指南

Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC
适用范围:CUDA 11+、GPU计算、深度学习、科学计算、高性能计算(HPC)

Core Principles

核心原则

  1. Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
  2. Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
  3. Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
  4. Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
  5. Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory
  1. 并行优先:面向数千个并发线程设计算法,串行思维是GPU性能的首要敌人
  2. 感知内存层级:全局内存速度比共享内存慢100倍,比寄存器慢1000倍,所有Kernel设计都要从内存访问规划开始
  3. 合并访问:相邻线程必须访问相邻内存地址,单个未对齐的访问模式可能会让带宽降低32倍
  4. 占用率优先于技巧:通过协同管理寄存器数量、共享内存使用量和块维度,最大化每个SM的活动warp数
  5. 最小化主机-设备传输:PCIe带宽是常见瓶颈,使用流和页锁定内存让传输与计算重叠执行

Guardrails

规范约束

Error Checking

错误检查

  • ALWAYS check CUDA API return values with a macro wrapper
  • ALWAYS call
    cudaGetLastError()
    after every kernel launch
  • ALWAYS call
    cudaDeviceSynchronize()
    before reading kernel results on the host
  • Use
    compute-sanitizer
    (successor to
    cuda-memcheck
    ) in development builds
  • Handle
    cudaErrorMemoryAllocation
    gracefully; never assume GPU memory is infinite
cuda
#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = call;                                            \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                   \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)

#define CUDA_CHECK_KERNEL()                                                 \
    do {                                                                    \
        cudaError_t err = cudaGetLastError();                              \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "Kernel launch error at %s:%d: %s\n",          \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)
  • 必须使用宏封装检查CUDA API的返回值
  • 每次Kernel启动后必须调用
    cudaGetLastError()
  • 在主机端读取Kernel结果前必须调用
    cudaDeviceSynchronize()
  • 开发构建版本中使用
    compute-sanitizer
    cuda-memcheck
    的后继工具)
  • 优雅处理
    cudaErrorMemoryAllocation
    错误,永远不要假设GPU内存无限
cuda
#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = call;                                            \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                   \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)

#define CUDA_CHECK_KERNEL()                                                 \
    do {                                                                    \
        cudaError_t err = cudaGetLastError();                              \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "Kernel launch error at %s:%d: %s\n",          \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)

Memory Management

内存管理

  • Pair every
    cudaMalloc
    with a
    cudaFree
    ; prefer RAII wrappers in C++ host code
  • Use
    cudaMallocManaged
    (Unified Memory) for prototyping; switch to explicit transfers for production
  • Use
    cudaMallocHost
    (pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute
  • Prefer
    cudaMemcpyAsync
    with streams over synchronous
    cudaMemcpy
  • Never access device pointers from host code or host pointers from device code (except Unified Memory)
  • Call
    cudaMemset
    or
    cudaMemsetAsync
    to zero-initialize device buffers
  • 每个
    cudaMalloc
    必须配对一个
    cudaFree
    ,C++主机代码优先使用RAII封装
  • 原型开发阶段使用
    cudaMallocManaged
    (统一内存),生产环境切换为显式传输
  • 向GPU流传输数据时使用
    cudaMallocHost
    (页锁定内存),可分页内存无法与计算重叠
  • 优先使用带流的
    cudaMemcpyAsync
    ,而非同步的
    cudaMemcpy
  • 不要从主机代码访问设备指针,也不要从设备代码访问主机指针(统一内存除外)
  • 调用
    cudaMemset
    cudaMemsetAsync
    对设备缓冲区进行零初始化

Kernel Design

Kernel设计

  • Block size must be a multiple of warp size (32); prefer 128, 256, or 512
  • Calculate grid size as
    (n + block_size - 1) / block_size
  • Always include bounds checking:
    if (idx < n)
    at the top of every kernel
  • Use grid-stride loops for kernels that must handle arbitrary data sizes
  • Document thread mapping: which dimension maps to which data axis
  • Mark device-only helpers as
    __device__
    , host+device as
    __host__ __device__
cuda
// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x) {
        y[i] = a * x[i] + y[i];
    }
}
  • 块大小必须是warp大小(32)的倍数,优先选择128、256或512
  • 网格大小计算公式为
    (n + block_size - 1) / block_size
  • 所有Kernel顶部必须包含边界检查:
    if (idx < n)
  • 需要处理任意数据大小的Kernel使用网格步长循环
  • 注明线程映射关系:哪个维度对应哪个数据轴
  • 仅设备端使用的辅助函数标记为
    __device__
    ,主机+设备通用的标记为
    __host__ __device__
cuda
// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x) {
        y[i] = a * x[i] + y[i];
    }
}

Synchronization

同步机制

  • Use
    __syncthreads()
    after every shared memory write before any thread reads another thread's value
  • Never place
    __syncthreads()
    inside a conditional branch that not all threads in a block will reach (deadlock)
  • Use
    __syncwarp()
    (CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution
  • Use
    cudaDeviceSynchronize()
    sparingly in production; prefer stream synchronization with
    cudaStreamSynchronize()
  • Use CUDA events (
    cudaEventRecord
    /
    cudaEventSynchronize
    ) for fine-grained inter-stream ordering
  • 共享内存写入后、任意线程读取其他线程写入的值前,必须调用
    __syncthreads()
  • 不要将
    __syncthreads()
    放在块内并非所有线程都会进入的条件分支中(会导致死锁)
  • CUDA 9+版本使用
    __syncwarp()
    进行warp级同步,不要依赖隐式的warp同步执行
  • 生产环境尽量少用
    cudaDeviceSynchronize()
    ,优先使用
    cudaStreamSynchronize()
    进行流同步
  • 使用CUDA事件(
    cudaEventRecord
    /
    cudaEventSynchronize
    )实现细粒度的流间排序

Performance

性能优化

  • Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view
  • Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions
  • Aim for >60% of peak memory bandwidth in memory-bound kernels
  • Avoid warp divergence: ensure threads within a warp take the same branch when possible
  • Prefer
    float
    over
    double
    on consumer GPUs (2x throughput difference)
  • Minimize atomic operations on global memory; use shared memory atomics with a final reduction
  • 优化前先 profiling:使用Nsight Compute做Kernel分析,Nsight Systems做系统级视图分析
  • 目标理论占用率>50%,使用CUDA占用率计算器调整块维度
  • 内存 bound 的Kernel目标达到峰值内存带宽的60%以上
  • 避免warp发散:尽可能保证同一个warp内的线程走相同的分支
  • 消费级GPU上优先使用
    float
    而非
    double
    (吞吐量差2倍)
  • 尽量减少全局内存上的原子操作,使用共享内存原子操作加最终规约的方案

Memory Hierarchy

内存层级

Understanding the memory hierarchy is the single most important factor in CUDA performance.
Memory TypeScopeLatency (cycles)SizeCachedRead/Write
RegistersThread1~255 per threadN/AR/W
SharedBlock~548-164 KB per SMN/AR/W
L1 CacheSM~2848-192 KB per SMAutoR
L2 CacheDevice~2004-40 MBAutoR/W
GlobalDevice~400-6004-80 GB (HBM/GDDR)YesR/W
ConstantDevice~5 (cached)64 KBYes (broadcast)R
TextureDevice~400 (cached)Global poolYes (spatial)R
Decision guide:
  • Data reused within a thread -> registers (automatic via local variables)
  • Data shared across threads in a block ->
    __shared__
    memory
  • Read-only data broadcast to all threads ->
    __constant__
    memory
  • Large read-only data with spatial locality -> texture memory
  • Everything else -> global memory with coalesced access patterns
理解内存层级是提升CUDA性能最重要的单一因素。
内存类型作用范围延迟(周期)容量缓存读写属性
寄存器线程1每个线程约255个可读可写
共享内存~5每个SM 48-164 KB可读可写
L1缓存SM~28每个SM 48-192 KB自动只读
L2缓存设备~2004-40 MB自动可读可写
全局内存设备~400-6004-80 GB(HBM/GDDR)支持可读可写
常量内存设备~5(缓存后)64 KB支持(广播)只读
纹理内存设备~400(缓存后)全局内存池支持(空间局部性)只读
选型指南:
  • 单个线程内复用的数据 -> 寄存器(通过局部变量自动分配)
  • 块内线程共享的数据 ->
    __shared__
    内存
  • 广播到所有线程的只读数据 ->
    __constant__
    内存
  • 有空间局部性的大型只读数据 -> 纹理内存
  • 其他所有场景 -> 采用合并访问模式的全局内存

Key Patterns

核心模式

Kernel Launch Configuration

Kernel启动配置

cuda
// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
    int block_size;
    int min_grid_size;

    // Let the runtime suggest optimal block size for maximum occupancy
    cudaOccupancyMaxPotentialBlockSize(
        &min_grid_size, &block_size, my_kernel, 0, n);

    int grid_size = (n + block_size - 1) / block_size;
    my_kernel<<<grid_size, block_size>>>(input, output, n);
    CUDA_CHECK_KERNEL();
}
cuda
// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
    int block_size;
    int min_grid_size;

    // Let the runtime suggest optimal block size for maximum occupancy
    cudaOccupancyMaxPotentialBlockSize(
        &min_grid_size, &block_size, my_kernel, 0, n);

    int grid_size = (n + block_size - 1) / block_size;
    my_kernel<<<grid_size, block_size>>>(input, output, n);
    CUDA_CHECK_KERNEL();
}

Coalesced Memory Access

合并内存访问

cuda
// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < W && y < H) {
        out[x * H + y] = in[y * W + x];  // Write is strided
    }
}

// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
    const float* in, float* out, int W, int H
) {
    __shared__ float tile[32][33]; // +1 padding avoids bank conflicts

    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;

    if (x < W && y < H) {
        tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
    }
    __syncthreads();

    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;

    if (x < H && y < W) {
        out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
    }
}
cuda
// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < W && y < H) {
        out[x * H + y] = in[y * W + x];  // Write is strided
    }
}

// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
    const float* in, float* out, int W, int H
) {
    __shared__ float tile[32][33]; // +1 padding avoids bank conflicts

    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;

    if (x < W && y < H) {
        tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
    }
    __syncthreads();

    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;

    if (x < H && y < W) {
        out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
    }
}

Shared Memory Tiling

共享内存分块

cuda
// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
    const float* a, const float* b, float* result, int n
) {
    __shared__ float cache[256];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Each thread computes its partial sum via grid-stride
    float partial = 0.0f;
    for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
        partial += a[i] * b[i];
    }
    cache[tid] = partial;
    __syncthreads();

    // Tree reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            cache[tid] += cache[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        atomicAdd(result, cache[0]);
    }
}
cuda
// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
    const float* a, const float* b, float* result, int n
) {
    __shared__ float cache[256];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Each thread computes its partial sum via grid-stride
    float partial = 0.0f;
    for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
        partial += a[i] * b[i];
    }
    cache[tid] = partial;
    __syncthreads();

    // Tree reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            cache[tid] += cache[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        atomicAdd(result, cache[0]);
    }
}

Warp-Level Primitives (CUDA 9+)

Warp级原语(CUDA 9+)

cuda
// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
    __shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)

    int lane = threadIdx.x % warpSize;
    int warp_id = threadIdx.x / warpSize;

    val = warp_reduce_sum(val);

    if (lane == 0) {
        warp_sums[warp_id] = val;
    }
    __syncthreads();

    // First warp reduces the warp sums
    int num_warps = (blockDim.x + warpSize - 1) / warpSize;
    val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
    if (warp_id == 0) {
        val = warp_reduce_sum(val);
    }

    return val;
}
cuda
// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
    __shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)

    int lane = threadIdx.x % warpSize;
    int warp_id = threadIdx.x / warpSize;

    val = warp_reduce_sum(val);

    if (lane == 0) {
        warp_sums[warp_id] = val;
    }
    __syncthreads();

    // First warp reduces the warp sums
    int num_warps = (blockDim.x + warpSize - 1) / warpSize;
    val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
    if (warp_id == 0) {
        val = warp_reduce_sum(val);
    }

    return val;
}

Performance

性能相关

Occupancy Calculator

占用率计算器

cuda
// Query occupancy at compile time for tuning
void report_occupancy() {
    int block_size = 256;
    int num_blocks;

    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &num_blocks, my_kernel, block_size, 0);

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);

    int active_warps = num_blocks * (block_size / prop.warpSize);
    int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
    float occupancy = (float)active_warps / max_warps;

    printf("Occupancy: %.1f%% (%d/%d warps)\n",
           occupancy * 100, active_warps, max_warps);
}
cuda
// Query occupancy at compile time for tuning
void report_occupancy() {
    int block_size = 256;
    int num_blocks;

    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &num_blocks, my_kernel, block_size, 0);

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);

    int active_warps = num_blocks * (block_size / prop.warpSize);
    int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
    float occupancy = (float)active_warps / max_warps;

    printf("Occupancy: %.1f%% (%d/%d warps)\n",
           occupancy * 100, active_warps, max_warps);
}

Nsight Profiling Workflow

Nsight Profiling工作流

bash
undefined
bash
undefined

System-level trace: find CPU/GPU idle gaps, stream concurrency

System-level trace: find CPU/GPU idle gaps, stream concurrency

nsys profile -o trace ./program nsys stats trace.nsys-rep
nsys profile -o trace ./program nsys stats trace.nsys-rep

Kernel-level analysis: roofline, memory throughput, occupancy

Kernel-level analysis: roofline, memory throughput, occupancy

ncu --set full -o kernel_report ./program ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI
ncu --set full -o kernel_report ./program ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI

Quick single-metric check

Quick single-metric check

ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program
undefined
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program
undefined

Memory Bandwidth Measurement

内存带宽测量

cuda
// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
    size_t bytes = 2 * n * sizeof(float); // Read A + Write B

    cudaEvent_t start, stop;
    CUDA_CHECK(cudaEventCreate(&start));
    CUDA_CHECK(cudaEventCreate(&stop));

    CUDA_CHECK(cudaEventRecord(start));
    copy_kernel<<<grid, block>>>(d_in, d_out, n);
    CUDA_CHECK(cudaEventRecord(stop));
    CUDA_CHECK(cudaEventSynchronize(stop));

    float ms = 0;
    CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));

    float gb_per_sec = bytes / (ms * 1e6);
    printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);

    CUDA_CHECK(cudaEventDestroy(start));
    CUDA_CHECK(cudaEventDestroy(stop));
}
cuda
// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
    size_t bytes = 2 * n * sizeof(float); // Read A + Write B

    cudaEvent_t start, stop;
    CUDA_CHECK(cudaEventCreate(&start));
    CUDA_CHECK(cudaEventCreate(&stop));

    CUDA_CHECK(cudaEventRecord(start));
    copy_kernel<<<grid, block>>>(d_in, d_out, n);
    CUDA_CHECK(cudaEventRecord(stop));
    CUDA_CHECK(cudaEventSynchronize(stop));

    float ms = 0;
    CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));

    float gb_per_sec = bytes / (ms * 1e6);
    printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);

    CUDA_CHECK(cudaEventDestroy(start));
    CUDA_CHECK(cudaEventDestroy(stop));
}

Tooling

工具链

Essential Commands

常用命令

bash
undefined
bash
undefined

Compile CUDA code

Compile CUDA code

nvcc -arch=sm_80 -O3 -o program main.cu # Single file nvcc -arch=native -lineinfo -o program main.cu # With debug line info
nvcc -arch=sm_80 -O3 -o program main.cu # Single file nvcc -arch=native -lineinfo -o program main.cu # With debug line info

CMake build

CMake build

cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86" cmake --build build -j$(nproc)
cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86" cmake --build build -j$(nproc)

Runtime debugging

Runtime debugging

compute-sanitizer ./program # Memory errors (replaces cuda-memcheck) compute-sanitizer --tool racecheck ./program # Shared memory race conditions compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads compute-sanitizer --tool synccheck ./program # Synchronization errors
compute-sanitizer ./program # Memory errors (replaces cuda-memcheck) compute-sanitizer --tool racecheck ./program # Shared memory race conditions compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads compute-sanitizer --tool synccheck ./program # Synchronization errors

Profiling

Profiling

nsys profile ./program # System-level timeline ncu ./program # Kernel-level metrics ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program
nsys profile ./program # System-level timeline ncu ./program # Kernel-level metrics ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program

Device info

Device info

nvidia-smi # GPU status and memory usage nvcc --version # CUDA compiler version
undefined
nvidia-smi # GPU status and memory usage nvcc --version # CUDA compiler version
undefined

CMakeLists.txt Template

CMakeLists.txt模板

cmake
cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)

find_package(CUDAToolkit REQUIRED)

add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)

add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)

enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)
cmake
cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)

find_package(CUDAToolkit REQUIRED)

add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)

add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)

enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)

References

参考资料

For detailed patterns and examples, see:
  • references/patterns.md -- Tiled matrix multiply, parallel reduction tree, stream overlap pipeline
如需详细模式和示例,请查看:
  • references/patterns.md -- 分块矩阵乘法、并行规约树、流重叠流水线

External References

外部参考