ta-kernel-butler

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Ascend NPU Kernel Butler

Ascend NPU Kernel Butler

Expert guide for Ascend NPU hardware architecture and triton-ascend kernel development. Avoid confusion with GPU concepts by understanding the fundamental differences between Ascend NPU and GPU architectures.
Ascend NPU硬件架构与triton-ascend内核开发的专家指南。通过了解Ascend NPU与GPU架构的根本差异,避免与GPU概念混淆。

Overview

概述

Ascend NPU (Neural Processing Unit) is Huawei's AI accelerator with a fundamentally different architecture from GPUs. This skill provides accurate, NPU-specific guidance for kernel development using triton-ascend, ensuring code correctness and optimal performance.
Critical: When answering Ascend-related questions, always use NPU-specific terminology and concepts. Do not map GPU concepts (warp, SM, shared memory) directly to NPU architecture.
Ascend NPU(Neural Processing Unit)是华为的AI加速器,其架构与GPU存在本质区别。本技能为使用triton-ascend进行内核开发提供精准的、针对NPU的指导,确保代码正确性与最优性能。
重要提示:回答Ascend相关问题时,请始终使用NPU专属术语和概念。请勿将GPU概念(warp、SM、共享内存)直接映射到NPU架构。

Ascend Hardware Architecture

Ascend硬件架构

AI Core Structure

AI Core结构

The AI Core is the fundamental computing unit in Ascend NPU, organized differently from GPU Streaming Multiprocessors (SM):
ComponentFunctionGPU Equivalent
Cube UnitMatrix computation (16x16 FP16 matmul per cycle)Tensor Core
Vector UnitSIMD vector operationsCUDA Core
Scalar UnitControl flow, instruction scheduling (mini-CPU)Warp Scheduler
MTE1/MTE2/MTE3Data transfer between storage levelsMemory Controller
FixPipeOn-the-fly format/type conversionN/A
AI Core是Ascend NPU中的基础计算单元,其组织方式与GPU的流多处理器(SM)不同:
组件功能GPU对应组件
Cube Unit矩阵计算(每周期完成16x16 FP16矩阵乘法)Tensor Core
Vector UnitSIMD向量运算CUDA Core
Scalar Unit控制流、指令调度(微型CPU)Warp调度器
MTE1/MTE2/MTE3存储层级间的数据传输内存控制器
FixPipe即时格式/类型转换无对应组件

Operating Modes

运行模式

Coupled Mode (A1 series):
  • Single Scalar unit schedules both Cube and Vector
Decoupled Mode (A2/A3 series):
  • Independent Scalar units for Cube and Vector
  • Higher parallelism potential
耦合模式(A1系列):
  • 单个Scalar单元调度Cube和Vector单元
解耦模式(A2/A3系列):
  • Cube和Vector单元拥有独立的Scalar单元
  • 具备更高的并行潜力

Memory Hierarchy

内存层级

Global Memory (GM)
        ↓ MTE3
    Unified Buffer (UB)
        ↓ MTE2
     L1 Buffer
        ↓ MTE1
    ┌─────────┬─────────┐
    ↓         ↓         ↓
  L0A       L0B      L0C
  (Cube    (Cube    (Cube
   input)   input)   output)
    └─────────┴─────────┘
        ↓ FixPipe
    Unified Buffer (UB)
        ↓ MTE3
  Global Memory (GM)
Key differences from GPU:
  • No unified memory space
  • Explicit data movement between levels (MTE units)
  • Strict data flow paths
Global Memory (GM)
        ↓ MTE3
    Unified Buffer (UB)
        ↓ MTE2
     L1 Buffer
        ↓ MTE1
    ┌─────────┬─────────┐
    ↓         ↓         ↓
  L0A       L0B      L0C
  (Cube    (Cube    (Cube
   input)   input)   output)
    └─────────┴─────────┘
        ↓ FixPipe
    Unified Buffer (UB)
        ↓ MTE3
  Global Memory (GM)
与GPU的关键差异
  • 无统一内存空间
  • 层级间需显式进行数据传输(通过MTE单元)
  • 严格的数据流转路径

Common GPU vs NPU Confusions

常见GPU与NPU概念混淆点

1. Memory Model

1. 内存模型

GPU: Unified memory space, shared memory accessible by all threads in a block
  • shared memory
    → fast, software-managed cache
NPU: Multi-level storage hierarchy, explicit data movement required
  • Unified Buffer (UB) → general-purpose data staging
  • L0A/L0B/L0C → Cube unit specific buffers
  • L1 Buffer → intermediate storage
GPU:统一内存空间,共享内存可被块内所有线程访问
  • shared memory
    → 高速、软件管理的缓存
NPU:多级存储层级,需显式进行数据传输
  • Unified Buffer (UB) → 通用数据暂存区
  • L0A/L0B/L0C → Cube单元专属缓冲区
  • L1 Buffer → 中间存储区

2. Threading Model

2. 线程模型

GPU: Thread blocks, warps (32 threads), SIMT execution
  • tl.program_id(axis)
    → block ID
  • tl.arange()
    → thread ID within block
NPU: Block-based execution, no warp concept
  • Blocks are the fundamental execution unit
  • No SIMT warp-level synchronization
  • Use block-level barriers instead
GPU:线程块、warp(32线程)、SIMT执行
  • tl.program_id(axis)
    → 块ID
  • tl.arange()
    → 块内线程ID
NPU:基于块的执行,无warp概念
  • 块是基础执行单元
  • 无SIMT warp级同步
  • 改用块级屏障

3. Synchronization

3. 同步机制

GPU:
cudaSyncThreads()
, warp-level primitives
  • tl.atomic_*
    for shared memory atomics
NPU: PipeBarrier and SetFlag/WaitFlag for pipeline synchronization
  • Different synchronization semantics
  • Avoid GPU synchronization patterns
GPU
cudaSyncThreads()
、warp级原语
  • tl.atomic_*
    用于共享内存原子操作
NPU:使用PipeBarrier和SetFlag/WaitFlag进行流水线同步
  • 同步语义不同
  • 避免使用GPU同步模式

4. Data Access Patterns

4. 数据访问模式

GPU: Flexible memory access, coalescing important
  • Arbitrary access patterns possible (with performance cost)
NPU: Strict alignment requirements
  • Vector instructions require 32B alignment
  • Cache Line alignment improves load efficiency
  • Plan data movement carefully
GPU:灵活的内存访问,合并访问很重要
  • 支持任意访问模式(但会有性能损耗)
NPU:严格的对齐要求
  • 向量指令要求32B对齐
  • 缓存行对齐可提升加载效率
  • 需精心规划数据传输

triton-ascend Development

triton-ascend开发

Basic Kernel Structure

基础内核结构

python
import triton
import triton.language as tl

@triton.jit
def npu_kernel(
    x_ptr, y_ptr, z_ptr,
    n_elements,
    BLOCK_SIZE: tl.constexpr,
):
    # Block ID (different from GPU thread block concept)
    pid = tl.program_id(axis=0)

    # Offset calculation
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)

    # Load from Global Memory to Unified Buffer
    x = tl.load(x_ptr + offsets)
    y = tl.load(y_ptr + offsets)

    # Compute in Vector Unit
    z = x + y

    # Store back to Global Memory
    tl.store(z_ptr + offsets, z)
python
import triton
import triton.language as tl

@triton.jit
def npu_kernel(
    x_ptr, y_ptr, z_ptr,
    n_elements,
    BLOCK_SIZE: tl.constexpr,
):
    # 块ID(与GPU线程块概念不同)
    pid = tl.program_id(axis=0)

    # 偏移计算
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)

    # 从全局内存加载到Unified Buffer
    x = tl.load(x_ptr + offsets)
    y = tl.load(y_ptr + offsets)

    # 在Vector单元中计算
    z = x + y

    # 存储回全局内存
    tl.store(z_ptr + offsets, z)

GPU to NPU Migration Checklist

GPU到NPU迁移检查清单

When migrating GPU Triton kernels to NPU:
  • Replace
    tl.dot()
    with explicit
    tl.matmul()
    for NPU
  • Check data alignment (32B for Vector, 64B for Cube)
  • Verify memory access patterns match NPU hierarchy
  • Remove GPU-specific synchronization primitives
  • Use NPU-specific intrinsic functions when needed
  • Consider multi-buffering for pipeline efficiency
将GPU Triton内核迁移到NPU时:
  • 用显式
    tl.matmul()
    替换
    tl.dot()
    以适配NPU
  • 检查数据对齐(Vector单元需32B,Cube单元需64B)
  • 验证内存访问模式是否匹配NPU层级结构
  • 移除GPU专属同步原语
  • 必要时使用NPU专属内置函数
  • 考虑使用多缓冲提升流水线效率

Performance Optimization

性能优化

Reduce Scalar Computation

减少Scalar计算

Scalar units have limited throughput. Minimize:
  • Complex branching logic
  • Runtime-dependent calculations
  • Dynamic loop conditions
Good:
python
undefined
Scalar单元的吞吐量有限,应尽量减少:
  • 复杂分支逻辑
  • 依赖运行时的计算
  • 动态循环条件
推荐写法
python
undefined

Precompute at compile time

编译时预计算

TILE_SIZE: tl.constexpr = 64

**Avoid**:
```python
TILE_SIZE: tl.constexpr = 64

**避免写法**:
```python

Runtime calculation

运行时计算

tile_size = tl.sqrt(n_elements).to(tl.int32)
undefined
tile_size = tl.sqrt(n_elements).to(tl.int32)
undefined

Data Alignment

数据对齐

  • Vector instructions: 32B alignment minimum
  • Cache Line alignment: 64B for better performance
  • Use
    tl.contiguous()
    to ensure memory layout
  • 向量指令:最小32B对齐
  • 缓存行对齐:64B对齐可提升性能
  • 使用
    tl.contiguous()
    确保内存布局

Cache Utilization

缓存利用率

Maximize ICache (instruction cache) and DCache (data cache):
  • Keep kernels compact
  • Reuse loaded data
  • Minimize Global Memory access
最大化ICache(指令缓存)和DCache(数据缓存)的使用:
  • 保持内核代码简洁
  • 复用已加载的数据
  • 减少全局内存访问

Key Intrinsic Functions

关键内置函数

FunctionPurposeNotes
tl.program_id(axis)
Get block indexNot thread ID
tl.arange(start, stop)
Generate offset sequenceBlock-local
tl.load(ptr)
Load from GM to UBRespects alignment
tl.store(ptr, val)
Store from UB to GMRespects alignment
tl.matmul(a, b)
Matrix multiplicationUses Cube unit
tl.exp(x)
,
tl.sqrt(x)
Math functionsVector unit
函数用途说明
tl.program_id(axis)
获取块索引并非线程ID
tl.arange(start, stop)
生成偏移序列块本地范围
tl.load(ptr)
从GM加载到UB遵循对齐要求
tl.store(ptr, val)
从UB存储到GM遵循对齐要求
tl.matmul(a, b)
矩阵乘法调用Cube单元
tl.exp(x)
,
tl.sqrt(x)
数学函数调用Vector单元

Migration from GPU Triton

从GPU Triton迁移

For detailed migration guidance, refer to:
  • [
    references/migrate-from-gpu.md
    ]
    (references/migrate-from-gpu.md) - Step-by-step migration guide
  • [
    references/architecture-difference.md
    ]
    (references/architecture-difference.md) - Detailed architecture comparison
When migrating kernels:
  1. Analyze memory access patterns
  2. Verify data flow through storage hierarchy
  3. Replace GPU-specific operations with NPU equivalents
  4. Test with small inputs first
  5. Profile and optimize based on NPU-specific counters
如需详细迁移指南,请参考:
  • [
    references/migrate-from-gpu.md
    ]
    (references/migrate-from-gpu.md) - 分步迁移指南
  • [
    references/architecture-difference.md
    ]
    (references/architecture-difference.md) - 详细架构对比
迁移内核时:
  1. 分析内存访问模式
  2. 验证数据流是否符合存储层级
  3. 用NPU等效操作替换GPU专属操作
  4. 先使用小输入进行测试
  5. 基于NPU专属计数器进行性能分析与优化

Additional Resources

额外资源

Official Documentation

官方文档

Reference Files in This Skill

本技能中的参考文件

  • [
    references/hardware-architecture.md
    ]
    (references/hardware-architecture.md) - Detailed hardware architecture
  • [
    references/triton-ascend-guide.md
    ]
    (references/triton-ascend-guide.md) - Development workflow
  • [
    references/gpu-npu-differences.md
    ]
    (references/gpu-npu-differences.md) - Comprehensive comparison
  • [
    references/hardware-architecture.md
    ]
    (references/hardware-architecture.md) - 详细硬件架构
  • [
    references/triton-ascend-guide.md
    ]
    (references/triton-ascend-guide.md) - 开发工作流
  • [
    references/gpu-npu-differences.md
    ]
    (references/gpu-npu-differences.md) - 全面对比

Example Code

示例代码

Working examples in
examples/
:
  • [
    kernel-example.py
    ]
    (examples/kernel-example.py) - Basic NPU kernel template
examples/
目录下的可用示例:
  • [
    kernel-example.py
    ]
    (examples/kernel-example.py) - 基础NPU内核模板

Common Pitfalls

常见误区

  1. Using GPU terminology → Always use NPU-specific terms (AI Core, not SM; UB, not shared memory)
  2. Ignoring alignment → Vector ops require 32B alignment, Cache Line is 64B
  3. Wrong synchronization → No warps on NPU, use block-level barriers
  4. Excessive Scalar computation → Scalar units are slow, precompute at compile time
  5. Poor data reuse → Minimize GM access, maximize UB/L1 utilization
  1. 使用GPU术语 → 始终使用NPU专属术语(如AI Core而非SM;UB而非共享内存)
  2. 忽略对齐要求 → 向量操作需32B对齐,缓存行是64B
  3. 同步方式错误 → NPU无warp概念,使用块级屏障
  4. Scalar计算过多 → Scalar单元速度较慢,尽量在编译时预计算
  5. 数据复用率低 → 减少全局内存访问,最大化UB/L1的利用率