ta-kernel-butler

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Ascend NPU Kernel Butler

Expert guide for Ascend NPU hardware architecture and triton-ascend kernel development. Avoid confusion with GPU concepts by understanding the fundamental differences between Ascend NPU and GPU architectures.

Ascend NPU硬件架构与triton-ascend内核开发的专家指南。通过了解Ascend NPU与GPU架构的根本差异，避免与GPU概念混淆。

Overview

概述

Ascend NPU (Neural Processing Unit) is Huawei's AI accelerator with a fundamentally different architecture from GPUs. This skill provides accurate, NPU-specific guidance for kernel development using triton-ascend, ensuring code correctness and optimal performance.

Critical: When answering Ascend-related questions, always use NPU-specific terminology and concepts. Do not map GPU concepts (warp, SM, shared memory) directly to NPU architecture.

Ascend NPU（Neural Processing Unit）是华为的AI加速器，其架构与GPU存在本质区别。本技能为使用triton-ascend进行内核开发提供精准的、针对NPU的指导，确保代码正确性与最优性能。

重要提示：回答Ascend相关问题时，请始终使用NPU专属术语和概念。请勿将GPU概念（warp、SM、共享内存）直接映射到NPU架构。

Ascend Hardware Architecture

Ascend硬件架构

AI Core Structure

AI Core结构

The AI Core is the fundamental computing unit in Ascend NPU, organized differently from GPU Streaming Multiprocessors (SM):

Component	Function	GPU Equivalent
Cube Unit	Matrix computation (16x16 FP16 matmul per cycle)	Tensor Core
Vector Unit	SIMD vector operations	CUDA Core
Scalar Unit	Control flow, instruction scheduling (mini-CPU)	Warp Scheduler
MTE1/MTE2/MTE3	Data transfer between storage levels	Memory Controller
FixPipe	On-the-fly format/type conversion	N/A

AI Core是Ascend NPU中的基础计算单元，其组织方式与GPU的流多处理器（SM）不同：

组件	功能	GPU对应组件
Cube Unit	矩阵计算（每周期完成16x16 FP16矩阵乘法）	Tensor Core
Vector Unit	SIMD向量运算	CUDA Core
Scalar Unit	控制流、指令调度（微型CPU）	Warp调度器
MTE1/MTE2/MTE3	存储层级间的数据传输	内存控制器
FixPipe	即时格式/类型转换	无对应组件

Operating Modes

运行模式

Coupled Mode (A1 series):

Single Scalar unit schedules both Cube and Vector

Decoupled Mode (A2/A3 series):

Independent Scalar units for Cube and Vector
Higher parallelism potential

耦合模式（A1系列）：

单个Scalar单元调度Cube和Vector单元

解耦模式（A2/A3系列）：

Cube和Vector单元拥有独立的Scalar单元
具备更高的并行潜力

Memory Hierarchy

内存层级

Global Memory (GM)
        ↓ MTE3
    Unified Buffer (UB)
        ↓ MTE2
     L1 Buffer
        ↓ MTE1
    ┌─────────┬─────────┐
    ↓         ↓         ↓
  L0A       L0B      L0C
  (Cube    (Cube    (Cube
   input)   input)   output)
    └─────────┴─────────┘
        ↓ FixPipe
    Unified Buffer (UB)
        ↓ MTE3
  Global Memory (GM)

Key differences from GPU:

No unified memory space
Explicit data movement between levels (MTE units)
Strict data flow paths

Global Memory (GM)
        ↓ MTE3
    Unified Buffer (UB)
        ↓ MTE2
     L1 Buffer
        ↓ MTE1
    ┌─────────┬─────────┐
    ↓         ↓         ↓
  L0A       L0B      L0C
  (Cube    (Cube    (Cube
   input)   input)   output)
    └─────────┴─────────┘
        ↓ FixPipe
    Unified Buffer (UB)
        ↓ MTE3
  Global Memory (GM)

与GPU的关键差异：

无统一内存空间
层级间需显式进行数据传输（通过MTE单元）
严格的数据流转路径

Common GPU vs NPU Confusions

常见GPU与NPU概念混淆点

1. Memory Model

1. 内存模型

GPU: Unified memory space, shared memory accessible by all threads in a block

```
shared memory
```
→ fast, software-managed cache

NPU: Multi-level storage hierarchy, explicit data movement required

Unified Buffer (UB) → general-purpose data staging
L0A/L0B/L0C → Cube unit specific buffers
L1 Buffer → intermediate storage

GPU：统一内存空间，共享内存可被块内所有线程访问

```
shared memory
```
→ 高速、软件管理的缓存

NPU：多级存储层级，需显式进行数据传输

Unified Buffer (UB) → 通用数据暂存区
L0A/L0B/L0C → Cube单元专属缓冲区
L1 Buffer → 中间存储区

2. Threading Model

2. 线程模型

GPU: Thread blocks, warps (32 threads), SIMT execution

```
tl.program_id(axis)
```
→ block ID
```
tl.arange()
```
→ thread ID within block

NPU: Block-based execution, no warp concept

Blocks are the fundamental execution unit
No SIMT warp-level synchronization
Use block-level barriers instead

GPU：线程块、warp（32线程）、SIMT执行

```
tl.program_id(axis)
```
→ 块ID
```
tl.arange()
```
→ 块内线程ID

NPU：基于块的执行，无warp概念

块是基础执行单元
无SIMT warp级同步
改用块级屏障

3. Synchronization

3. 同步机制

GPU:

cudaSyncThreads()

, warp-level primitives

```
tl.atomic_*
```
for shared memory atomics

NPU: PipeBarrier and SetFlag/WaitFlag for pipeline synchronization

Different synchronization semantics
Avoid GPU synchronization patterns

GPU：

cudaSyncThreads()

、warp级原语

```
tl.atomic_*
```
用于共享内存原子操作

NPU：使用PipeBarrier和SetFlag/WaitFlag进行流水线同步

同步语义不同
避免使用GPU同步模式

4. Data Access Patterns

4. 数据访问模式

GPU: Flexible memory access, coalescing important

Arbitrary access patterns possible (with performance cost)

NPU: Strict alignment requirements

Vector instructions require 32B alignment
Cache Line alignment improves load efficiency
Plan data movement carefully

GPU：灵活的内存访问，合并访问很重要

支持任意访问模式（但会有性能损耗）

NPU：严格的对齐要求

向量指令要求32B对齐
缓存行对齐可提升加载效率
需精心规划数据传输

triton-ascend Development

triton-ascend开发

Basic Kernel Structure

基础内核结构

python

import triton
import triton.language as tl

@triton.jit
def npu_kernel(
    x_ptr, y_ptr, z_ptr,
    n_elements,
    BLOCK_SIZE: tl.constexpr,
):
    # Block ID (different from GPU thread block concept)
    pid = tl.program_id(axis=0)

    # Offset calculation
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)

    # Load from Global Memory to Unified Buffer
    x = tl.load(x_ptr + offsets)
    y = tl.load(y_ptr + offsets)

    # Compute in Vector Unit
    z = x + y

    # Store back to Global Memory
    tl.store(z_ptr + offsets, z)

python

import triton
import triton.language as tl

@triton.jit
def npu_kernel(
    x_ptr, y_ptr, z_ptr,
    n_elements,
    BLOCK_SIZE: tl.constexpr,
):
    # 块ID（与GPU线程块概念不同）
    pid = tl.program_id(axis=0)

    # 偏移计算
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)

    # 从全局内存加载到Unified Buffer
    x = tl.load(x_ptr + offsets)
    y = tl.load(y_ptr + offsets)

    # 在Vector单元中计算
    z = x + y

    # 存储回全局内存
    tl.store(z_ptr + offsets, z)

GPU to NPU Migration Checklist

GPU到NPU迁移检查清单

When migrating GPU Triton kernels to NPU:

Replace
```
tl.dot()
```
with explicit
```
tl.matmul()
```
for NPU
Check data alignment (32B for Vector, 64B for Cube)
Verify memory access patterns match NPU hierarchy
Remove GPU-specific synchronization primitives
Use NPU-specific intrinsic functions when needed
Consider multi-buffering for pipeline efficiency

将GPU Triton内核迁移到NPU时：

用显式
```
tl.matmul()
```
替换
```
tl.dot()
```
以适配NPU
检查数据对齐（Vector单元需32B，Cube单元需64B）
验证内存访问模式是否匹配NPU层级结构
移除GPU专属同步原语
必要时使用NPU专属内置函数
考虑使用多缓冲提升流水线效率

Performance Optimization

性能优化

Reduce Scalar Computation

减少Scalar计算

Scalar units have limited throughput. Minimize:

Complex branching logic
Runtime-dependent calculations
Dynamic loop conditions

Good:

python

undefined

Scalar单元的吞吐量有限，应尽量减少：

复杂分支逻辑
依赖运行时的计算
动态循环条件

推荐写法：

python

undefined

Precompute at compile time

编译时预计算

TILE_SIZE: tl.constexpr = 64


**Avoid**:
```python

TILE_SIZE: tl.constexpr = 64


**避免写法**：
```python

Runtime calculation

运行时计算

tile_size = tl.sqrt(n_elements).to(tl.int32)

undefined

tile_size = tl.sqrt(n_elements).to(tl.int32)

undefined

Data Alignment

数据对齐

Vector instructions: 32B alignment minimum
Cache Line alignment: 64B for better performance
Use
```
tl.contiguous()
```
to ensure memory layout

向量指令：最小32B对齐
缓存行对齐：64B对齐可提升性能
使用
```
tl.contiguous()
```
确保内存布局

Cache Utilization

缓存利用率

Maximize ICache (instruction cache) and DCache (data cache):

Keep kernels compact
Reuse loaded data
Minimize Global Memory access

最大化ICache（指令缓存）和DCache（数据缓存）的使用：

保持内核代码简洁
复用已加载的数据
减少全局内存访问

Key Intrinsic Functions

关键内置函数

Function	Purpose	Notes
`tl.program_id(axis)`	Get block index	Not thread ID
`tl.arange(start, stop)`	Generate offset sequence	Block-local
`tl.load(ptr)`	Load from GM to UB	Respects alignment
`tl.store(ptr, val)`	Store from UB to GM	Respects alignment
`tl.matmul(a, b)`	Matrix multiplication	Uses Cube unit
`tl.exp(x)` , `tl.sqrt(x)`	Math functions	Vector unit

函数	用途	说明
`tl.program_id(axis)`	获取块索引	并非线程ID
`tl.arange(start, stop)`	生成偏移序列	块本地范围
`tl.load(ptr)`	从GM加载到UB	遵循对齐要求
`tl.store(ptr, val)`	从UB存储到GM	遵循对齐要求
`tl.matmul(a, b)`	矩阵乘法	调用Cube单元
`tl.exp(x)` , `tl.sqrt(x)`	数学函数	调用Vector单元

Migration from GPU Triton

从GPU Triton迁移

For detailed migration guidance, refer to:

[
references/migrate-from-gpu.md
](references/migrate-from-gpu.md) - Step-by-step migration guide
[
references/architecture-difference.md
](references/architecture-difference.md) - Detailed architecture comparison

When migrating kernels:

Analyze memory access patterns
Verify data flow through storage hierarchy
Replace GPU-specific operations with NPU equivalents
Test with small inputs first
Profile and optimize based on NPU-specific counters

如需详细迁移指南，请参考：

[
references/migrate-from-gpu.md
](references/migrate-from-gpu.md) - 分步迁移指南
[
references/architecture-difference.md
](references/architecture-difference.md) - 详细架构对比

迁移内核时：

分析内存访问模式
验证数据流是否符合存储层级
用NPU等效操作替换GPU专属操作
先使用小输入进行测试
基于NPU专属计数器进行性能分析与优化

Additional Resources

额外资源

Official Documentation

官方文档

Ascend Basic Architecture - Hardware fundamentals
Abstract Hardware Architecture - Programming model
Architecture Difference - GPU vs NPU comparison
Migration Guide - Kernel migration
Performance Guidelines - Optimization tips
Core Features - triton-ascend design

Ascend基础架构 - 硬件基础
抽象硬件架构 - 编程模型
架构差异 - GPU与NPU对比
迁移指南 - 内核迁移
性能优化指南 - 优化技巧
核心特性 - triton-ascend设计

Reference Files in This Skill

本技能中的参考文件

[
references/hardware-architecture.md
](references/hardware-architecture.md) - Detailed hardware architecture
[
references/triton-ascend-guide.md
](references/triton-ascend-guide.md) - Development workflow
[
references/gpu-npu-differences.md
](references/gpu-npu-differences.md) - Comprehensive comparison

[
references/hardware-architecture.md
](references/hardware-architecture.md) - 详细硬件架构
[
references/triton-ascend-guide.md
](references/triton-ascend-guide.md) - 开发工作流
[
references/gpu-npu-differences.md
](references/gpu-npu-differences.md) - 全面对比

Example Code

示例代码

Working examples in

examples/

[
kernel-example.py
](examples/kernel-example.py) - Basic NPU kernel template

examples/

目录下的可用示例：

[
kernel-example.py
](examples/kernel-example.py) - 基础NPU内核模板

Common Pitfalls

常见误区

Using GPU terminology → Always use NPU-specific terms (AI Core, not SM; UB, not shared memory)
Ignoring alignment → Vector ops require 32B alignment, Cache Line is 64B
Wrong synchronization → No warps on NPU, use block-level barriers
Excessive Scalar computation → Scalar units are slow, precompute at compile time
Poor data reuse → Minimize GM access, maximize UB/L1 utilization

使用GPU术语 → 始终使用NPU专属术语（如AI Core而非SM；UB而非共享内存）
忽略对齐要求 → 向量操作需32B对齐，缓存行是64B
同步方式错误 → NPU无warp概念，使用块级屏障
Scalar计算过多 → Scalar单元速度较慢，尽量在编译时预计算
数据复用率低 → 减少全局内存访问，最大化UB/L1的利用率