triton-operator-code-gen

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Triton 算子代码生成

Triton Operator Code Generation

核心原则

Core Principles

计算逻辑 → Tiling 策略 → 代码实现

这个顺序绝对不可颠倒。错误的计算逻辑会导致完全错误的结果，错误的 tiling 策略会导致性能问题或内存溢出。

Computing Logic → Tiling Strategy → Code Implementation

This order must not be reversed. Incorrect computing logic will lead to completely wrong results, while incorrect tiling strategy will cause performance issues or memory overflow.

参考资源加载路由表

Reference Resource Loading Routing Table

MANDATORY - 按需加载：根据任务阶段加载对应的参考文档

阶段	必须加载	不要加载
理解需求文档	无	所有 references
确认计算逻辑	无	所有 references
设计 Tiling 策略	`hardware-architecture.md`	`templates.md`
生成 Kernel 代码	`templates.md`	`hardware-architecture.md`
生成测试代码	无	所有 references

MANDATORY - Load On Demand: Load corresponding reference documents according to task stages

Phase	Must Load	Do Not Load
Understand Requirement Documents	None	All references
Confirm Computing Logic	None	All references
Design Tiling Strategy	`hardware-architecture.md`	`templates.md`
Generate Kernel Code	`templates.md`	`hardware-architecture.md`
Generate Test Code	None	All references

工作流程

Workflow

阶段 1：理解需求文档

Phase 1: Understand Requirement Documents

提取：数学公式、输入输出规格、约束条件、Tiling 策略

Extract: mathematical formulas, input/output specifications, constraints, tiling strategies

阶段 2：确认计算逻辑

Phase 2: Confirm Computing Logic

用伪代码描述计算过程
确认数据依赖关系
确认精度处理（归约操作必须使用 FP32）

输出：计算逻辑确认（必须与用户确认）

Describe the computing process with pseudocode
Confirm data dependency relationships
Confirm precision handling (reduction operations must use FP32)

Output: Computing logic confirmation (must be confirmed with the user)

阶段 3：设计 Tiling 策略

Phase 3: Design Tiling Strategy

MANDATORY - READ ENTIRE FILE：在设计 Tiling 策略之前，你必须完整阅读

hardware-architecture.md

。

绝对不要设置任何行数限制。

核间切分原则（必须遵循）：

grid = 物理核数：保证利用每个核，避免资源浪费
核内均衡负载：每个核自己计算要处理哪些数据，实现负载均衡

python

core_num = get_npu_aicore_num()  # 或 get_npu_vectorcore_num()

grid = (core_num,)  # 原则1：grid必须等于物理核数

@triton.jit
def xxx_fwd(
    ......
    M, N,
    BLOCK_M: tl.constexpr,
    BLOCK_N: tl.constexpr,
):
    pid = tl.program_id(0)
    num_core = tl.num_programs(0)

    num_block_m = tl.cdiv(M, BLOCK_M)
    num_block_n = tl.cdiv(N, BLOCK_N)

    total_blocks = num_block_m * num_block_n

    # 原则2：核内循环处理多任务，每个核自己计算要处理的数据
    for block_idx in range(pid, total_blocks, num_core):
        pid_m = block_idx // num_block_n
        pid_n = block_idx % num_block_n

UB空间计算：

UB 总大小: 192KB (A2/A3)
安全 BLOCK_SIZE = (196608 - 32) / (缓冲区数量 × 数据类型大小) × 0.8

MANDATORY - READ ENTIRE FILE: You must read the entire

hardware-architecture.md

before designing the tiling strategy.

Never set any line limits.

Inter-core Partitioning Principles (Must Follow):

grid = number of physical cores: Ensure every core is utilized to avoid resource waste
Balanced intra-core load: Each core calculates which data to process on its own to achieve load balance

python

core_num = get_npu_aicore_num()  # or get_npu_vectorcore_num()

grid = (core_num,)  # Principle 1: grid must equal the number of physical cores

@triton.jit
def xxx_fwd(
    ......
    M, N,
    BLOCK_M: tl.constexpr,
    BLOCK_N: tl.constexpr,
):
    pid = tl.program_id(0)
    num_core = tl.num_programs(0)

    num_block_m = tl.cdiv(M, BLOCK_M)
    num_block_n = tl.cdiv(N, BLOCK_N)

    total_blocks = num_block_m * num_block_n

    # Principle 2: Intra-core loop handles multiple tasks, each core calculates the data it needs to process
    for block_idx in range(pid, total_blocks, num_core):
        pid_m = block_idx // num_block_n
        pid_n = block_idx % num_block_n

UB Space Calculation:

Total UB size: 192KB (A2/A3)
Safe BLOCK_SIZE = (196608 - 32) / (number of buffers × data type size) × 0.8

阶段 4：生成 Kernel 代码

Phase 4: Generate Kernel Code

MANDATORY - READ ENTIRE FILE：在生成代码之前，你必须完整阅读

templates.md

。

绝对不要设置任何行数限制。

根据算子类型选择对应模板灵活参考：

算子类型	特征	核心类型	模板
归约类	sum/max/min 归约	vector core	模板 1
GEMM	tl.dot() 矩阵乘法	AI core	模板 2
激活函数	逐元素计算	vector core	模板 3
损失函数	softmax + reduction	vector core	模板 4
索引变换	索引计算、条件分支	vector core	模板 5
注意力	QK^T + SV 多阶段	AI core	模板 6
MoE	门控机制	vector core	模板 7
后处理	简单数据变换	vector core	模板 8
卷积	状态更新、滑动窗口	AI core	模板 9

MANDATORY - READ ENTIRE FILE: You must read the entire

templates.md

before generating code.

Never set any line limits.

Flexibly refer to the corresponding template based on operator type:

Operator Type	Features	Core Type	Template
Reduction	sum/max/min reduction	vector core	Template 1
GEMM	tl.dot() matrix multiplication	AI core	Template 2
Activation Function	Element-wise calculation	vector core	Template 3
Loss Function	softmax + reduction	vector core	Template 4
Index Transformation	Index calculation, conditional branching	vector core	Template 5
Attention	QK^T + SV multi-stage	AI core	Template 6
MoE	Gating mechanism	vector core	Template 7
Post-processing	Simple data transformation	vector core	Template 8
Convolution	State update, sliding window	AI core	Template 9

阶段 5：生成测试代码

Phase 5: Generate Test Code

反模式清单（NEVER）

Anti-Pattern List (NEVER)

❌ 不确认计算逻辑就开始写代码
❌ 忽略 UB 大小限制（192KB）
❌ 归约操作不使用 FP32 精度
❌ 使用 int64 数据类型（性能极差）
❌ grid 大小超过 65535
❌ 在 kernel 中使用第三方库
❌ 一个元素一个元素地计算
❌ 过度复杂的优化（如对角线分核）
❌ 调用第三方函数获取核数
❌ 混淆 Vector Core 和 Cube Core 的用途
❌ 使用pytorch而不用triton实现算子
❌ 不测试算子的正确性
❌ 不在npu上测试算子
❌ 不确保测试标杆的准确性
❌ grid大小不等于物理核数（违反核间切分原则1）
❌ 核间负载不均衡（违反核间切分原则2）

❌ Don't start writing code without confirming computing logic
❌ Ignore UB size limit (192KB)
❌ Don't use FP32 precision for reduction operations
❌ Use int64 data type (extremely poor performance)
❌ grid size exceeds 65535
❌ Use third-party libraries in kernel
❌ Calculate element by element
❌ Overly complex optimizations (e.g., diagonal core partitioning)
❌ Call third-party functions to get core count
❌ Confuse the usage of Vector Core and Cube Core
❌ Implement operators with PyTorch instead of Triton
❌ Don't test operator correctness
❌ Don't test operators on NPU
❌ Don't ensure the accuracy of test benchmarks
❌ grid size is not equal to the number of physical cores (violates Inter-core Partitioning Principle 1)
❌ Unbalanced inter-core load (violates Inter-core Partitioning Principle 2)

常见陷阱

Common Pitfalls

陷阱	症状	解决方案
计算逻辑错误	输出结果与预期不符	用伪代码描述计算过程，与用户确认
UB 溢出	运行时报错 "ub overflow"	计算缓冲区总大小，减小 BLOCK_SIZE
coreDim 超限	运行时报错 "coreDim can't be greater than UINT16_MAX"	增大 BLOCK_SIZE 或设置 `TRITON_ALL_BLOCKS_PARALLEL=1`
精度损失	FP16 输入时结果不准确	归约操作前升精度到 FP32
索引长度不够	D-cache报错	在超大shape下int32索引长度不足，需要换成int64

Pitfall	Symptom	Solution
Incorrect computing logic	Output results do not match expectations	Describe the computing process with pseudocode and confirm with the user
UB overflow	Runtime error "ub overflow"	Calculate total buffer size and reduce BLOCK_SIZE
coreDim exceeded	Runtime error "coreDim can't be greater than UINT16_MAX"	Increase BLOCK_SIZE or set `TRITON_ALL_BLOCKS_PARALLEL=1`
Precision loss	Inaccurate results with FP16 input	Upgrade precision to FP32 before reduction operations
Insufficient index length	D-cache error	Replace int32 with int64 for index when dealing with super-large shapes