ta-kernel-butler
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAscend NPU Kernel Butler
Ascend NPU Kernel Butler
Expert guide for Ascend NPU hardware architecture and triton-ascend kernel development. Avoid confusion with GPU concepts by understanding the fundamental differences between Ascend NPU and GPU architectures.
Ascend NPU硬件架构与triton-ascend内核开发的专家指南。通过了解Ascend NPU与GPU架构的根本差异,避免与GPU概念混淆。
Overview
概述
Ascend NPU (Neural Processing Unit) is Huawei's AI accelerator with a fundamentally different architecture from GPUs. This skill provides accurate, NPU-specific guidance for kernel development using triton-ascend, ensuring code correctness and optimal performance.
Critical: When answering Ascend-related questions, always use NPU-specific terminology and concepts. Do not map GPU concepts (warp, SM, shared memory) directly to NPU architecture.
Ascend NPU(Neural Processing Unit)是华为的AI加速器,其架构与GPU存在本质区别。本技能为使用triton-ascend进行内核开发提供精准的、针对NPU的指导,确保代码正确性与最优性能。
重要提示:回答Ascend相关问题时,请始终使用NPU专属术语和概念。请勿将GPU概念(warp、SM、共享内存)直接映射到NPU架构。
Ascend Hardware Architecture
Ascend硬件架构
AI Core Structure
AI Core结构
The AI Core is the fundamental computing unit in Ascend NPU, organized differently from GPU Streaming Multiprocessors (SM):
| Component | Function | GPU Equivalent |
|---|---|---|
| Cube Unit | Matrix computation (16x16 FP16 matmul per cycle) | Tensor Core |
| Vector Unit | SIMD vector operations | CUDA Core |
| Scalar Unit | Control flow, instruction scheduling (mini-CPU) | Warp Scheduler |
| MTE1/MTE2/MTE3 | Data transfer between storage levels | Memory Controller |
| FixPipe | On-the-fly format/type conversion | N/A |
AI Core是Ascend NPU中的基础计算单元,其组织方式与GPU的流多处理器(SM)不同:
| 组件 | 功能 | GPU对应组件 |
|---|---|---|
| Cube Unit | 矩阵计算(每周期完成16x16 FP16矩阵乘法) | Tensor Core |
| Vector Unit | SIMD向量运算 | CUDA Core |
| Scalar Unit | 控制流、指令调度(微型CPU) | Warp调度器 |
| MTE1/MTE2/MTE3 | 存储层级间的数据传输 | 内存控制器 |
| FixPipe | 即时格式/类型转换 | 无对应组件 |
Operating Modes
运行模式
Coupled Mode (A1 series):
- Single Scalar unit schedules both Cube and Vector
Decoupled Mode (A2/A3 series):
- Independent Scalar units for Cube and Vector
- Higher parallelism potential
耦合模式(A1系列):
- 单个Scalar单元调度Cube和Vector单元
解耦模式(A2/A3系列):
- Cube和Vector单元拥有独立的Scalar单元
- 具备更高的并行潜力
Memory Hierarchy
内存层级
Global Memory (GM)
↓ MTE3
Unified Buffer (UB)
↓ MTE2
L1 Buffer
↓ MTE1
┌─────────┬─────────┐
↓ ↓ ↓
L0A L0B L0C
(Cube (Cube (Cube
input) input) output)
└─────────┴─────────┘
↓ FixPipe
Unified Buffer (UB)
↓ MTE3
Global Memory (GM)Key differences from GPU:
- No unified memory space
- Explicit data movement between levels (MTE units)
- Strict data flow paths
Global Memory (GM)
↓ MTE3
Unified Buffer (UB)
↓ MTE2
L1 Buffer
↓ MTE1
┌─────────┬─────────┐
↓ ↓ ↓
L0A L0B L0C
(Cube (Cube (Cube
input) input) output)
└─────────┴─────────┘
↓ FixPipe
Unified Buffer (UB)
↓ MTE3
Global Memory (GM)与GPU的关键差异:
- 无统一内存空间
- 层级间需显式进行数据传输(通过MTE单元)
- 严格的数据流转路径
Common GPU vs NPU Confusions
常见GPU与NPU概念混淆点
1. Memory Model
1. 内存模型
GPU: Unified memory space, shared memory accessible by all threads in a block
- → fast, software-managed cache
shared memory
NPU: Multi-level storage hierarchy, explicit data movement required
- Unified Buffer (UB) → general-purpose data staging
- L0A/L0B/L0C → Cube unit specific buffers
- L1 Buffer → intermediate storage
GPU:统一内存空间,共享内存可被块内所有线程访问
- → 高速、软件管理的缓存
shared memory
NPU:多级存储层级,需显式进行数据传输
- Unified Buffer (UB) → 通用数据暂存区
- L0A/L0B/L0C → Cube单元专属缓冲区
- L1 Buffer → 中间存储区
2. Threading Model
2. 线程模型
GPU: Thread blocks, warps (32 threads), SIMT execution
- → block ID
tl.program_id(axis) - → thread ID within block
tl.arange()
NPU: Block-based execution, no warp concept
- Blocks are the fundamental execution unit
- No SIMT warp-level synchronization
- Use block-level barriers instead
GPU:线程块、warp(32线程)、SIMT执行
- → 块ID
tl.program_id(axis) - → 块内线程ID
tl.arange()
NPU:基于块的执行,无warp概念
- 块是基础执行单元
- 无SIMT warp级同步
- 改用块级屏障
3. Synchronization
3. 同步机制
GPU: , warp-level primitives
cudaSyncThreads()- for shared memory atomics
tl.atomic_*
NPU: PipeBarrier and SetFlag/WaitFlag for pipeline synchronization
- Different synchronization semantics
- Avoid GPU synchronization patterns
GPU:、warp级原语
cudaSyncThreads()- 用于共享内存原子操作
tl.atomic_*
NPU:使用PipeBarrier和SetFlag/WaitFlag进行流水线同步
- 同步语义不同
- 避免使用GPU同步模式
4. Data Access Patterns
4. 数据访问模式
GPU: Flexible memory access, coalescing important
- Arbitrary access patterns possible (with performance cost)
NPU: Strict alignment requirements
- Vector instructions require 32B alignment
- Cache Line alignment improves load efficiency
- Plan data movement carefully
GPU:灵活的内存访问,合并访问很重要
- 支持任意访问模式(但会有性能损耗)
NPU:严格的对齐要求
- 向量指令要求32B对齐
- 缓存行对齐可提升加载效率
- 需精心规划数据传输
triton-ascend Development
triton-ascend开发
Basic Kernel Structure
基础内核结构
python
import triton
import triton.language as tl
@triton.jit
def npu_kernel(
x_ptr, y_ptr, z_ptr,
n_elements,
BLOCK_SIZE: tl.constexpr,
):
# Block ID (different from GPU thread block concept)
pid = tl.program_id(axis=0)
# Offset calculation
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
# Load from Global Memory to Unified Buffer
x = tl.load(x_ptr + offsets)
y = tl.load(y_ptr + offsets)
# Compute in Vector Unit
z = x + y
# Store back to Global Memory
tl.store(z_ptr + offsets, z)python
import triton
import triton.language as tl
@triton.jit
def npu_kernel(
x_ptr, y_ptr, z_ptr,
n_elements,
BLOCK_SIZE: tl.constexpr,
):
# 块ID(与GPU线程块概念不同)
pid = tl.program_id(axis=0)
# 偏移计算
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
# 从全局内存加载到Unified Buffer
x = tl.load(x_ptr + offsets)
y = tl.load(y_ptr + offsets)
# 在Vector单元中计算
z = x + y
# 存储回全局内存
tl.store(z_ptr + offsets, z)GPU to NPU Migration Checklist
GPU到NPU迁移检查清单
When migrating GPU Triton kernels to NPU:
- Replace with explicit
tl.dot()for NPUtl.matmul() - Check data alignment (32B for Vector, 64B for Cube)
- Verify memory access patterns match NPU hierarchy
- Remove GPU-specific synchronization primitives
- Use NPU-specific intrinsic functions when needed
- Consider multi-buffering for pipeline efficiency
将GPU Triton内核迁移到NPU时:
- 用显式替换
tl.matmul()以适配NPUtl.dot() - 检查数据对齐(Vector单元需32B,Cube单元需64B)
- 验证内存访问模式是否匹配NPU层级结构
- 移除GPU专属同步原语
- 必要时使用NPU专属内置函数
- 考虑使用多缓冲提升流水线效率
Performance Optimization
性能优化
Reduce Scalar Computation
减少Scalar计算
Scalar units have limited throughput. Minimize:
- Complex branching logic
- Runtime-dependent calculations
- Dynamic loop conditions
Good:
python
undefinedScalar单元的吞吐量有限,应尽量减少:
- 复杂分支逻辑
- 依赖运行时的计算
- 动态循环条件
推荐写法:
python
undefinedPrecompute at compile time
编译时预计算
TILE_SIZE: tl.constexpr = 64
**Avoid**:
```pythonTILE_SIZE: tl.constexpr = 64
**避免写法**:
```pythonRuntime calculation
运行时计算
tile_size = tl.sqrt(n_elements).to(tl.int32)
undefinedtile_size = tl.sqrt(n_elements).to(tl.int32)
undefinedData Alignment
数据对齐
- Vector instructions: 32B alignment minimum
- Cache Line alignment: 64B for better performance
- Use to ensure memory layout
tl.contiguous()
- 向量指令:最小32B对齐
- 缓存行对齐:64B对齐可提升性能
- 使用确保内存布局
tl.contiguous()
Cache Utilization
缓存利用率
Maximize ICache (instruction cache) and DCache (data cache):
- Keep kernels compact
- Reuse loaded data
- Minimize Global Memory access
最大化ICache(指令缓存)和DCache(数据缓存)的使用:
- 保持内核代码简洁
- 复用已加载的数据
- 减少全局内存访问
Key Intrinsic Functions
关键内置函数
| Function | Purpose | Notes |
|---|---|---|
| Get block index | Not thread ID |
| Generate offset sequence | Block-local |
| Load from GM to UB | Respects alignment |
| Store from UB to GM | Respects alignment |
| Matrix multiplication | Uses Cube unit |
| Math functions | Vector unit |
| 函数 | 用途 | 说明 |
|---|---|---|
| 获取块索引 | 并非线程ID |
| 生成偏移序列 | 块本地范围 |
| 从GM加载到UB | 遵循对齐要求 |
| 从UB存储到GM | 遵循对齐要求 |
| 矩阵乘法 | 调用Cube单元 |
| 数学函数 | 调用Vector单元 |
Migration from GPU Triton
从GPU Triton迁移
For detailed migration guidance, refer to:
- [](references/migrate-from-gpu.md) - Step-by-step migration guide
references/migrate-from-gpu.md - [](references/architecture-difference.md) - Detailed architecture comparison
references/architecture-difference.md
When migrating kernels:
- Analyze memory access patterns
- Verify data flow through storage hierarchy
- Replace GPU-specific operations with NPU equivalents
- Test with small inputs first
- Profile and optimize based on NPU-specific counters
如需详细迁移指南,请参考:
- [](references/migrate-from-gpu.md) - 分步迁移指南
references/migrate-from-gpu.md - [](references/architecture-difference.md) - 详细架构对比
references/architecture-difference.md
迁移内核时:
- 分析内存访问模式
- 验证数据流是否符合存储层级
- 用NPU等效操作替换GPU专属操作
- 先使用小输入进行测试
- 基于NPU专属计数器进行性能分析与优化
Additional Resources
额外资源
Official Documentation
官方文档
- Ascend Basic Architecture - Hardware fundamentals
- Abstract Hardware Architecture - Programming model
- Architecture Difference - GPU vs NPU comparison
- Migration Guide - Kernel migration
- Performance Guidelines - Optimization tips
- Core Features - triton-ascend design
Reference Files in This Skill
本技能中的参考文件
- [](references/hardware-architecture.md) - Detailed hardware architecture
references/hardware-architecture.md - [](references/triton-ascend-guide.md) - Development workflow
references/triton-ascend-guide.md - [](references/gpu-npu-differences.md) - Comprehensive comparison
references/gpu-npu-differences.md
- [](references/hardware-architecture.md) - 详细硬件架构
references/hardware-architecture.md - [](references/triton-ascend-guide.md) - 开发工作流
references/triton-ascend-guide.md - [](references/gpu-npu-differences.md) - 全面对比
references/gpu-npu-differences.md
Example Code
示例代码
Working examples in :
examples/- [](examples/kernel-example.py) - Basic NPU kernel template
kernel-example.py
examples/- [](examples/kernel-example.py) - 基础NPU内核模板
kernel-example.py
Common Pitfalls
常见误区
- Using GPU terminology → Always use NPU-specific terms (AI Core, not SM; UB, not shared memory)
- Ignoring alignment → Vector ops require 32B alignment, Cache Line is 64B
- Wrong synchronization → No warps on NPU, use block-level barriers
- Excessive Scalar computation → Scalar units are slow, precompute at compile time
- Poor data reuse → Minimize GM access, maximize UB/L1 utilization
- 使用GPU术语 → 始终使用NPU专属术语(如AI Core而非SM;UB而非共享内存)
- 忽略对齐要求 → 向量操作需32B对齐,缓存行是64B
- 同步方式错误 → NPU无warp概念,使用块级屏障
- Scalar计算过多 → Scalar单元速度较慢,尽量在编译时预计算
- 数据复用率低 → 减少全局内存访问,最大化UB/L1的利用率