cuda-guide

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CUDA Guide

CUDA 开发指南

Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC

适用范围：CUDA 11+、GPU计算、深度学习、科学计算、高性能计算（HPC）

Core Principles

核心原则

Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory

并行优先：面向数千个并发线程设计算法，串行思维是GPU性能的首要敌人
感知内存层级：全局内存速度比共享内存慢100倍，比寄存器慢1000倍，所有Kernel设计都要从内存访问规划开始
合并访问：相邻线程必须访问相邻内存地址，单个未对齐的访问模式可能会让带宽降低32倍
占用率优先于技巧：通过协同管理寄存器数量、共享内存使用量和块维度，最大化每个SM的活动warp数
最小化主机-设备传输：PCIe带宽是常见瓶颈，使用流和页锁定内存让传输与计算重叠执行

Guardrails

规范约束

Error Checking

错误检查

ALWAYS check CUDA API return values with a macro wrapper
ALWAYS call
```
cudaGetLastError()
```
after every kernel launch
ALWAYS call
```
cudaDeviceSynchronize()
```
before reading kernel results on the host
Use
```
compute-sanitizer
```
(successor to
```
cuda-memcheck
```
) in development builds
Handle
```
cudaErrorMemoryAllocation
```
gracefully; never assume GPU memory is infinite

cuda

#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = call;                                            \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                   \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)

#define CUDA_CHECK_KERNEL()                                                 \
    do {                                                                    \
        cudaError_t err = cudaGetLastError();                              \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "Kernel launch error at %s:%d: %s\n",          \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)

必须使用宏封装检查CUDA API的返回值
每次Kernel启动后必须调用
```
cudaGetLastError()
```
在主机端读取Kernel结果前必须调用
```
cudaDeviceSynchronize()
```
开发构建版本中使用
```
compute-sanitizer
```
（
```
cuda-memcheck
```
的后继工具）
优雅处理
```
cudaErrorMemoryAllocation
```
错误，永远不要假设GPU内存无限

cuda

#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = call;                                            \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                   \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)

#define CUDA_CHECK_KERNEL()                                                 \
    do {                                                                    \
        cudaError_t err = cudaGetLastError();                              \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "Kernel launch error at %s:%d: %s\n",          \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)

Memory Management

内存管理

Pair every
```
cudaMalloc
```
with a
```
cudaFree
```
; prefer RAII wrappers in C++ host code
Use
```
cudaMallocManaged
```
(Unified Memory) for prototyping; switch to explicit transfers for production
Use
```
cudaMallocHost
```
(pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute
Prefer
```
cudaMemcpyAsync
```
with streams over synchronous
```
cudaMemcpy
```
Never access device pointers from host code or host pointers from device code (except Unified Memory)
Call
```
cudaMemset
```
or
```
cudaMemsetAsync
```
to zero-initialize device buffers

每个
```
cudaMalloc
```
必须配对一个
```
cudaFree
```
，C++主机代码优先使用RAII封装
原型开发阶段使用
```
cudaMallocManaged
```
（统一内存），生产环境切换为显式传输
向GPU流传输数据时使用
```
cudaMallocHost
```
（页锁定内存），可分页内存无法与计算重叠
优先使用带流的
```
cudaMemcpyAsync
```
，而非同步的
```
cudaMemcpy
```
不要从主机代码访问设备指针，也不要从设备代码访问主机指针（统一内存除外）
调用
```
cudaMemset
```
或
```
cudaMemsetAsync
```
对设备缓冲区进行零初始化

Kernel Design

Kernel设计

Block size must be a multiple of warp size (32); prefer 128, 256, or 512
Calculate grid size as
```
(n + block_size - 1) / block_size
```
Always include bounds checking:
```
if (idx < n)
```
at the top of every kernel
Use grid-stride loops for kernels that must handle arbitrary data sizes
Document thread mapping: which dimension maps to which data axis
Mark device-only helpers as
```
__device__
```
, host+device as
```
__host__ __device__
```

cuda

// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x) {
        y[i] = a * x[i] + y[i];
    }
}

块大小必须是warp大小（32）的倍数，优先选择128、256或512
网格大小计算公式为
```
(n + block_size - 1) / block_size
```
所有Kernel顶部必须包含边界检查：
```
if (idx < n)
```
需要处理任意数据大小的Kernel使用网格步长循环
注明线程映射关系：哪个维度对应哪个数据轴
仅设备端使用的辅助函数标记为
```
__device__
```
，主机+设备通用的标记为
```
__host__ __device__
```

cuda

// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x) {
        y[i] = a * x[i] + y[i];
    }
}

Synchronization

同步机制

Use
```
__syncthreads()
```
after every shared memory write before any thread reads another thread's value
Never place
```
__syncthreads()
```
inside a conditional branch that not all threads in a block will reach (deadlock)
Use
```
__syncwarp()
```
(CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution
Use
```
cudaDeviceSynchronize()
```
sparingly in production; prefer stream synchronization with
```
cudaStreamSynchronize()
```
Use CUDA events (
```
cudaEventRecord
```
/
```
cudaEventSynchronize
```
) for fine-grained inter-stream ordering

共享内存写入后、任意线程读取其他线程写入的值前，必须调用
```
__syncthreads()
```
不要将
```
__syncthreads()
```
放在块内并非所有线程都会进入的条件分支中（会导致死锁）
CUDA 9+版本使用
```
__syncwarp()
```
进行warp级同步，不要依赖隐式的warp同步执行
生产环境尽量少用
```
cudaDeviceSynchronize()
```
，优先使用
```
cudaStreamSynchronize()
```
进行流同步
使用CUDA事件（
```
cudaEventRecord
```
/
```
cudaEventSynchronize
```
）实现细粒度的流间排序

Performance

性能优化

Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view
Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions
Aim for >60% of peak memory bandwidth in memory-bound kernels
Avoid warp divergence: ensure threads within a warp take the same branch when possible
Prefer
```
float
```
over
```
double
```
on consumer GPUs (2x throughput difference)
Minimize atomic operations on global memory; use shared memory atomics with a final reduction

优化前先 profiling：使用Nsight Compute做Kernel分析，Nsight Systems做系统级视图分析
目标理论占用率>50%，使用CUDA占用率计算器调整块维度
内存 bound 的Kernel目标达到峰值内存带宽的60%以上
避免warp发散：尽可能保证同一个warp内的线程走相同的分支
消费级GPU上优先使用
```
float
```
而非
```
double
```
（吞吐量差2倍）
尽量减少全局内存上的原子操作，使用共享内存原子操作加最终规约的方案

Memory Hierarchy

内存层级

Understanding the memory hierarchy is the single most important factor in CUDA performance.

Memory Type	Scope	Latency (cycles)	Size	Cached	Read/Write
Registers	Thread	1	~255 per thread	N/A	R/W
Shared	Block	~5	48-164 KB per SM	N/A	R/W
L1 Cache	SM	~28	48-192 KB per SM	Auto	R
L2 Cache	Device	~200	4-40 MB	Auto	R/W
Global	Device	~400-600	4-80 GB (HBM/GDDR)	Yes	R/W
Constant	Device	~5 (cached)	64 KB	Yes (broadcast)	R
Texture	Device	~400 (cached)	Global pool	Yes (spatial)	R

Decision guide:

Data reused within a thread -> registers (automatic via local variables)
Data shared across threads in a block ->
```
__shared__
```
memory
Read-only data broadcast to all threads ->
```
__constant__
```
memory
Large read-only data with spatial locality -> texture memory
Everything else -> global memory with coalesced access patterns

理解内存层级是提升CUDA性能最重要的单一因素。

内存类型	作用范围	延迟（周期）	容量	缓存	读写属性
寄存器	线程	1	每个线程约255个	无	可读可写
共享内存	块	~5	每个SM 48-164 KB	无	可读可写
L1缓存	SM	~28	每个SM 48-192 KB	自动	只读
L2缓存	设备	~200	4-40 MB	自动	可读可写
全局内存	设备	~400-600	4-80 GB（HBM/GDDR）	支持	可读可写
常量内存	设备	~5（缓存后）	64 KB	支持（广播）	只读
纹理内存	设备	~400（缓存后）	全局内存池	支持（空间局部性）	只读

选型指南：

单个线程内复用的数据 -> 寄存器（通过局部变量自动分配）
块内线程共享的数据 ->
```
__shared__
```
内存
广播到所有线程的只读数据 ->
```
__constant__
```
内存
有空间局部性的大型只读数据 -> 纹理内存
其他所有场景 -> 采用合并访问模式的全局内存

Key Patterns

核心模式

Kernel Launch Configuration

Kernel启动配置

cuda

// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
    int block_size;
    int min_grid_size;

    // Let the runtime suggest optimal block size for maximum occupancy
    cudaOccupancyMaxPotentialBlockSize(
        &min_grid_size, &block_size, my_kernel, 0, n);

    int grid_size = (n + block_size - 1) / block_size;
    my_kernel<<<grid_size, block_size>>>(input, output, n);
    CUDA_CHECK_KERNEL();
}

cuda

// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
    int block_size;
    int min_grid_size;

    // Let the runtime suggest optimal block size for maximum occupancy
    cudaOccupancyMaxPotentialBlockSize(
        &min_grid_size, &block_size, my_kernel, 0, n);

    int grid_size = (n + block_size - 1) / block_size;
    my_kernel<<<grid_size, block_size>>>(input, output, n);
    CUDA_CHECK_KERNEL();
}

Coalesced Memory Access

合并内存访问

cuda

// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < W && y < H) {
        out[x * H + y] = in[y * W + x];  // Write is strided
    }
}

// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
    const float* in, float* out, int W, int H
) {
    __shared__ float tile[32][33]; // +1 padding avoids bank conflicts

    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;

    if (x < W && y < H) {
        tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
    }
    __syncthreads();

    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;

    if (x < H && y < W) {
        out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
    }
}

cuda

// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < W && y < H) {
        out[x * H + y] = in[y * W + x];  // Write is strided
    }
}

// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
    const float* in, float* out, int W, int H
) {
    __shared__ float tile[32][33]; // +1 padding avoids bank conflicts

    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;

    if (x < W && y < H) {
        tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
    }
    __syncthreads();

    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;

    if (x < H && y < W) {
        out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
    }
}

Shared Memory Tiling

共享内存分块

cuda

// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
    const float* a, const float* b, float* result, int n
) {
    __shared__ float cache[256];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Each thread computes its partial sum via grid-stride
    float partial = 0.0f;
    for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
        partial += a[i] * b[i];
    }
    cache[tid] = partial;
    __syncthreads();

    // Tree reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            cache[tid] += cache[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        atomicAdd(result, cache[0]);
    }
}

cuda

// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
    const float* a, const float* b, float* result, int n
) {
    __shared__ float cache[256];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Each thread computes its partial sum via grid-stride
    float partial = 0.0f;
    for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
        partial += a[i] * b[i];
    }
    cache[tid] = partial;
    __syncthreads();

    // Tree reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            cache[tid] += cache[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        atomicAdd(result, cache[0]);
    }
}

Warp-Level Primitives (CUDA 9+)

Warp级原语（CUDA 9+）

cuda

// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
    __shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)

    int lane = threadIdx.x % warpSize;
    int warp_id = threadIdx.x / warpSize;

    val = warp_reduce_sum(val);

    if (lane == 0) {
        warp_sums[warp_id] = val;
    }
    __syncthreads();

    // First warp reduces the warp sums
    int num_warps = (blockDim.x + warpSize - 1) / warpSize;
    val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
    if (warp_id == 0) {
        val = warp_reduce_sum(val);
    }

    return val;
}

cuda

// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
    __shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)

    int lane = threadIdx.x % warpSize;
    int warp_id = threadIdx.x / warpSize;

    val = warp_reduce_sum(val);

    if (lane == 0) {
        warp_sums[warp_id] = val;
    }
    __syncthreads();

    // First warp reduces the warp sums
    int num_warps = (blockDim.x + warpSize - 1) / warpSize;
    val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
    if (warp_id == 0) {
        val = warp_reduce_sum(val);
    }

    return val;
}

Performance

性能相关

Occupancy Calculator

占用率计算器

cuda

// Query occupancy at compile time for tuning
void report_occupancy() {
    int block_size = 256;
    int num_blocks;

    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &num_blocks, my_kernel, block_size, 0);

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);

    int active_warps = num_blocks * (block_size / prop.warpSize);
    int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
    float occupancy = (float)active_warps / max_warps;

    printf("Occupancy: %.1f%% (%d/%d warps)\n",
           occupancy * 100, active_warps, max_warps);
}

cuda

// Query occupancy at compile time for tuning
void report_occupancy() {
    int block_size = 256;
    int num_blocks;

    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &num_blocks, my_kernel, block_size, 0);

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);

    int active_warps = num_blocks * (block_size / prop.warpSize);
    int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
    float occupancy = (float)active_warps / max_warps;

    printf("Occupancy: %.1f%% (%d/%d warps)\n",
           occupancy * 100, active_warps, max_warps);
}

Nsight Profiling Workflow

Nsight Profiling工作流

bash

undefined

bash

undefined

System-level trace: find CPU/GPU idle gaps, stream concurrency

nsys profile -o trace ./program nsys stats trace.nsys-rep

Kernel-level analysis: roofline, memory throughput, occupancy

ncu --set full -o kernel_report ./program ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI

Quick single-metric check

ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program

undefined

ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program

undefined

Memory Bandwidth Measurement

内存带宽测量

cuda

// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
    size_t bytes = 2 * n * sizeof(float); // Read A + Write B

    cudaEvent_t start, stop;
    CUDA_CHECK(cudaEventCreate(&start));
    CUDA_CHECK(cudaEventCreate(&stop));

    CUDA_CHECK(cudaEventRecord(start));
    copy_kernel<<<grid, block>>>(d_in, d_out, n);
    CUDA_CHECK(cudaEventRecord(stop));
    CUDA_CHECK(cudaEventSynchronize(stop));

    float ms = 0;
    CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));

    float gb_per_sec = bytes / (ms * 1e6);
    printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);

    CUDA_CHECK(cudaEventDestroy(start));
    CUDA_CHECK(cudaEventDestroy(stop));
}

cuda

// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
    size_t bytes = 2 * n * sizeof(float); // Read A + Write B

    cudaEvent_t start, stop;
    CUDA_CHECK(cudaEventCreate(&start));
    CUDA_CHECK(cudaEventCreate(&stop));

    CUDA_CHECK(cudaEventRecord(start));
    copy_kernel<<<grid, block>>>(d_in, d_out, n);
    CUDA_CHECK(cudaEventRecord(stop));
    CUDA_CHECK(cudaEventSynchronize(stop));

    float ms = 0;
    CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));

    float gb_per_sec = bytes / (ms * 1e6);
    printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);

    CUDA_CHECK(cudaEventDestroy(start));
    CUDA_CHECK(cudaEventDestroy(stop));
}

Tooling

工具链

Essential Commands

常用命令

bash

undefined

bash

undefined

Compile CUDA code

nvcc -arch=sm_80 -O3 -o program main.cu # Single file nvcc -arch=native -lineinfo -o program main.cu # With debug line info

CMake build

cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86" cmake --build build -j$(nproc)

Runtime debugging

compute-sanitizer ./program # Memory errors (replaces cuda-memcheck) compute-sanitizer --tool racecheck ./program # Shared memory race conditions compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads compute-sanitizer --tool synccheck ./program # Synchronization errors

Profiling

nsys profile ./program # System-level timeline ncu ./program # Kernel-level metrics ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program

Device info

nvidia-smi # GPU status and memory usage nvcc --version # CUDA compiler version

undefined

nvidia-smi # GPU status and memory usage nvcc --version # CUDA compiler version

undefined

CMakeLists.txt Template

CMakeLists.txt模板

cmake

cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)

find_package(CUDAToolkit REQUIRED)

add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)

add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)

enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)

cmake

cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)

find_package(CUDAToolkit REQUIRED)

add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)

add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)

enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)

References

参考资料

For detailed patterns and examples, see:

references/patterns.md -- Tiled matrix multiply, parallel reduction tree, stream overlap pipeline

如需详细模式和示例，请查看：

references/patterns.md -- 分块矩阵乘法、并行规约树、流重叠流水线

cuda-guide

Original

Translation

CUDA Guide

CUDA 开发指南

Core Principles

核心原则

Guardrails

规范约束

Error Checking

错误检查

Memory Management

内存管理

Kernel Design

Kernel设计

Synchronization

同步机制

Performance

性能优化

Memory Hierarchy

内存层级

Key Patterns

核心模式

Kernel Launch Configuration

Kernel启动配置

Coalesced Memory Access

合并内存访问

Shared Memory Tiling

共享内存分块

Warp-Level Primitives (CUDA 9+)

Warp级原语（CUDA 9+）

Performance

性能相关

Occupancy Calculator

占用率计算器

Nsight Profiling Workflow

Nsight Profiling工作流

System-level trace: find CPU/GPU idle gaps, stream concurrency

System-level trace: find CPU/GPU idle gaps, stream concurrency

Kernel-level analysis: roofline, memory throughput, occupancy

Kernel-level analysis: roofline, memory throughput, occupancy

Quick single-metric check

Quick single-metric check

Memory Bandwidth Measurement

内存带宽测量

Tooling

工具链

Essential Commands

常用命令

Compile CUDA code

Compile CUDA code

CMake build

CMake build

Runtime debugging

Runtime debugging

Profiling

Profiling

Device info

Device info

CMakeLists.txt Template

CMakeLists.txt模板

References

参考资料

External References

外部参考