triton-operator-code-review

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Triton 算子静态代码检视（Ascend NPU）

Static Code Inspection for Triton Operators (Ascend NPU)

检视原则

Inspection Principles

Ascend 特有约束优先：Agent 已懂 Triton 通用知识，聚焦 Ascend 硬件差异
仅做静态分析：只通过阅读代码发现问题，不涉及编译期/运行时
Mask 零容错：Ascend 对越界访问零容忍，这是最致命的差异点

Ascend-specific constraints first: The Agent already has general Triton knowledge, focus on Ascend hardware differences
Static analysis only: Identify issues solely by reading code, no involvement in compile-time/runtime processes
Zero tolerance for Mask issues: Ascend has zero tolerance for out-of-bounds access, which is the most critical difference

严重性分级

Severity Classification

检视发现的问题按以下级别分类，报告时必须标注：

级别	含义	典型问题
P0 致命	必定导致错误结果或崩溃	Mask 遗漏、核类型错配、Atomic 循环死锁
P1 严重	高概率导致精度或功能问题	归约未升精度、dot 无累加器、Softmax 未减 max
P2 建议	影响性能或可维护性	冗余访存、非连续访存、BLOCK 未对齐

Issues found during inspection are classified into the following levels, which must be marked in reports:

Level	Meaning	Typical Issues
P0 Critical	Will definitely lead to incorrect results or crashes	Missing Mask, core type mismatch, Atomic loop deadlock
P1 Severe	High probability of causing precision or functional issues	Reduction without precision promotion, dot without accumulator, Softmax without max subtraction
P2 Recommendation	Affects performance or maintainability	Redundant memory access, non-contiguous memory access, unaligned BLOCK

检视工作流

Inspection Workflow

Phase 1: Host 侧检视

Phase 1: Host Side Inspection

MANDATORY - READ ENTIRE FILE：在检视 Host 侧前，完整阅读

ascend-triton-api-constraints.md

。

MANDATORY - READ ENTIRE FILE: Before inspecting the Host side, read

ascend-triton-api-constraints.md

in full.

1.1 Grid 配置（P0）

1.1 Grid Configuration (P0)

检查项	如何在代码中识别
硬编码核数	`grid = (20,)` 或 `grid = (24,)` 等字面量
核类型错配	含 `tl.dot` 的 kernel 使用了 `num_vectorcore`
Grid 维度	使用 2D/3D Grid 但无必要（推荐 1D）

核类型速查：

算子类型	应该用	获取方式
含 `tl.dot`	AI Core	`get_device_properties(device)["num_aicore"]`
逐元素/归约/激活	Vector Core	`get_device_properties(device)["num_vectorcore"]`

python

undefined

Check Item	How to Identify in Code
Hard-coded core count	Literals like `grid = (20,)` or `grid = (24,)`
Core type mismatch	Kernel containing `tl.dot` uses `num_vectorcore`
Grid dimension	Unnecessary use of 2D/3D Grid (1D is recommended)

Core Type Quick Reference:

Operator Type	Should Use	Acquisition Method
Contains `tl.dot`	AI Core	`get_device_properties(device)["num_aicore"]`
Element-wise/Reduction/Activation	Vector Core	`get_device_properties(device)["num_vectorcore"]`

python

undefined

❌ P0：硬编码 + 核类型错配

❌ P0: Hard-coded + core type mismatch

core_num = driver.active.utils.get_device_properties(device)["num_vectorcore"] grid = (20,) # 但 kernel 中使用了 tl.dot

core_num = driver.active.utils.get_device_properties(device)["num_vectorcore"] grid = (20,) # but tl.dot is used in the kernel

✅ 正确

✅ Correct

core_num = driver.active.utils.get_device_properties(device)["num_aicore"] grid = (min(core_num, triton.cdiv(n_elements, BLOCK_SIZE)),)

undefined

core_num = driver.active.utils.get_device_properties(device)["num_aicore"] grid = (min(core_num, triton.cdiv(n_elements, BLOCK_SIZE)),)

undefined

1.2 Block Size 配置（P1-P2）

1.2 Block Size Configuration (P1-P2)

检查项	级别
BLOCK_SIZE 未声明为 `tl.constexpr`	P1
矩阵运算 BLOCK_M/N/K 非 16 倍数	P2（Cube 单元粒度）
BLOCK_K 未对齐 `kalign = 32 // dtype_bytes`	P2

Check Item	Level
BLOCK_SIZE not declared as `tl.constexpr`	P1
BLOCK_M/N/K for matrix operations not multiples of 16	P2 (Cube unit granularity)
BLOCK_K not aligned with `kalign = 32 // dtype_bytes`	P2

Phase 2: Device 侧检视

Phase 2: Device Side Inspection

2.1 Mask 完整性（P0）

2.1 Mask Integrity (P0)

Ascend 对越界访问零容错。搜索所有

tl.load

tl.store

，确认每个都满足以下之一：

有
```
mask=
```
参数（
```
tl.load
```
还需
```
other=
```
）
使用
```
make_block_ptr
```
（自动处理边界）

python

undefined

Ascend has zero tolerance for out-of-bounds access. Search all

tl.load

tl.store

and confirm each meets one of the following:

Has
```
mask=
```
parameter (
```
tl.load
```
also requires
```
other=
```
)
Uses
```
make_block_ptr
```
(automatically handles boundaries)

python

undefined

❌ P0：缺少 mask

❌ P0: Missing mask

x = tl.load(x_ptr + offsets)

✅ 显式 mask

✅ Explicit mask

x = tl.load(x_ptr + offsets, mask=mask, other=0.0)

✅ make_block_ptr（自动处理）

✅ make_block_ptr (automatic handling)

block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), ...) x = tl.load(block_ptr)

undefined

block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), ...) x = tl.load(block_ptr)

undefined

2.2 数据类型合规（P0-P1）

2.2 Data Type Compliance (P0-P1)

MANDATORY - READ ENTIRE FILE：首次检视时，完整阅读

ascend-api-dtype-matrix.md

。

代码模式	问题	级别
`tl.dot(a_int32, b_int32)`	输入仅支持 int8/fp16/fp32/bf16	P0
`dot_scaled(...)`	不支持	P0
`permute` / `trans` 用 int64	不支持	P0
`tl.dot(a, b)` 无显式 `out_dtype`	浮点默认 fp32、int8 仅 int32 可选，显式指定非必要	P2
`permute` / `trans` 3D (2,1,0)	兼容性风险	P1

MANDATORY - READ ENTIRE FILE: When inspecting for the first time, read

ascend-api-dtype-matrix.md

in full.

Code Pattern	Issue	Level
`tl.dot(a_int32, b_int32)`	Input only supports int8/fp16/fp32/bf16	P0
`dot_scaled(...)`	Not supported	P0
`permute` / `trans` using int64	Not supported	P0
`tl.dot(a, b)` without explicit `out_dtype`	Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary	P2
3D (2,1,0) `permute` / `trans`	Compatibility risk	P1

2.3 精度处理（P1）

2.3 Precision Processing (P1)

python

undefined

python

undefined

❌ P1：FP16 直接归约 → 应先 .to(tl.float32)

❌ P1: FP16 direct reduction → should convert to tl.float32 first

sum_x = tl.sum(x_fp16, axis=-1)

❌ P1：Softmax 未减最大值 → 数值不稳定

❌ P1: Softmax without subtracting maximum value → numerically unstable

exp_x = tl.exp(x)

✅ 正确精度模式

✅ Correct precision mode

x_fp32 = x_fp16.to(tl.float32) sum_x = tl.sum(x_fp32, axis=-1)

out_dtype 浮点默认 fp32、int8 仅 int32 可选，显式指定非必要

Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary

acc = tl.dot(a, b, acc)

max_x = tl.max(x, axis=-1, keepdims=True) exp_x = tl.exp(x - max_x)

undefined

acc = tl.dot(a, b, acc)

max_x = tl.max(x, axis=-1, keepdims=True) exp_x = tl.exp(x - max_x)

undefined

2.4 代码模式（P0-P2）

2.4 Code Patterns (P0-P2)

代码模式	问题	级别
`for ... : tl.atomic_cas/or/xor/and/xchg(...)`	不支持在 loop 中，可能死锁	P0
多核 kernel 中 `tl.atomic_add` 返回值被使用	不支持多核 add + 保存中间结果	P0
`import numpy` 在 kernel 中	kernel 内不可调用第三方库	P0
`for i in range(N):` 在 kernel 中（loop 次数少且固定）	可考虑 `tl.static_range` ，但 loop 数较大时收益不明显甚至劣化，不应盲目替换	P2
`tensor[i].item()` 在 Host 热路径	触发 CPU-NPU 同步	P2

Code Pattern	Issue	Level
`for ... : tl.atomic_cas/or/xor/and/xchg(...)`	Not supported in loops, may cause deadlock	P0
Return value of `tl.atomic_add` used in multi-core kernel	Does not support multi-core add + saving intermediate results	P0
`import numpy` in kernel	Third-party libraries cannot be called inside kernels	P0
`for i in range(N):` in kernel (loop count is small and fixed)	Consider `tl.static_range` , but benefits are insignificant or even degraded when loop count is large; do not replace blindly	P2
`tensor[i].item()` in Host hot path	Triggers CPU-NPU synchronization	P2

Phase 3: 性能隐患检视（P2）

Phase 3: Performance Risk Inspection (P2)

代码特征	隐患
同一 ptr 多次 `tl.load`	冗余 GM 访问
`tl.arange(0, N) * stride` （stride > 1）	非连续访存
`pid` 直接映射到 block，无核间循环分配	负载不均衡

Code Feature	Risk
Multiple `tl.load` on the same ptr	Redundant GM access
`tl.arange(0, N) * stride` (stride > 1)	Non-contiguous memory access
`pid` directly mapped to block, no inter-core loop allocation	Load imbalance

反模式清单（NEVER）

Anti-Pattern List (NEVER)

Host 侧

Host Side

❌ 硬编码核数
```
grid = (20,)
```
— P0
❌ 矩阵乘法用
```
num_vectorcore
```
（含
```
tl.dot
```
应用 AI Core）— P0
❌ BLOCK_SIZE 不是
```
tl.constexpr
```
— P1

❌ Hard-coded core count
```
grid = (20,)
```
— P0
❌ Using
```
num_vectorcore
```
for matrix multiplication (AI Core should be used for kernels containing
```
tl.dot
```
) — P0
❌ BLOCK_SIZE not declared as
```
tl.constexpr
```
— P1

Device 侧

Device Side

❌
```
tl.load
```
/
```
tl.store
```
无
```
mask=
```
（也无
```
make_block_ptr
```
）— P0
❌
```
tl.dot
```
输入用 int32/int16/int64 — P0
❌
```
dot_scaled
```
（不支持）— P0
❌
```
atomic_or/xor/and/xchg/cas
```
在
```
for
```
循环体内 — P0
❌ kernel 内调用第三方库 — P0
❌ FP16/BF16 归约不升精度到 FP32 — P1
⚠️
```
tl.dot
```
无显式
```
out_dtype
```
（浮点默认 fp32、int8 仅 int32 可选，非必要）— P2
❌ Softmax 不减最大值 — P1
⚠️
```
for i in range(N):
```
可考虑
```
tl.static_range
```
，但仅 loop 次数少且固定时有收益；loop 数较大时可能劣化，不强制要求 — P2

❌
```
tl.load
```
/
```
tl.store
```
without
```
mask=
```
(and no
```
make_block_ptr
```
) — P0
❌ Using int32/int16/int64 as input for
```
tl.dot
```
— P0
❌
```
dot_scaled
```
(not supported) — P0
❌
```
atomic_or/xor/and/xchg/cas
```
inside
```
for
```
loop — P0
❌ Calling third-party libraries inside kernel — P0
❌ FP16/BF16 reduction without precision promotion to FP32 — P1
⚠️
```
tl.dot
```
without explicit
```
out_dtype
```
(floating-point defaults to fp32, only int32 is optional for int8; not necessary) — P2
❌ Softmax without subtracting maximum value — P1
⚠️
```
for i in range(N):
```
can be replaced with
```
tl.static_range
```
only when loop count is small and fixed; may degrade performance when loop count is large, not mandatory — P2

检视报告

Inspection Report

检视完成后，按

code-review-report-template.md

输出报告。

After completing the inspection, output the report according to

code-review-report-template.md

参考资源

Reference Resources

按需加载

On-Demand Loading

工作流阶段	加载文档	不要加载
Phase 1: Host 侧	`ascend-triton-api-constraints.md`	dtype-matrix, test-patterns
Phase 2: Device 侧	`ascend-api-dtype-matrix.md`	test-patterns
逐项核对	`code-review-checklist.md`	test-patterns, dtype-matrix
需要参考官方实现	`ascend-test-patterns.md`	—

加载原则：只加载当前检视阶段需要的文档，不要一次加载所有文档。

Workflow Phase	Load Document	Do Not Load
Phase 1: Host Side	`ascend-triton-api-constraints.md`	dtype-matrix, test-patterns
Phase 2: Device Side	`ascend-api-dtype-matrix.md`	test-patterns
Item-by-item Check	`code-review-checklist.md`	test-patterns, dtype-matrix
Need to reference official implementations	`ascend-test-patterns.md`	—

Loading Principle: Only load documents required for the current inspection phase; do not load all documents at once.

triton-operator-code-review

Original

Translation

Triton 算子静态代码检视（Ascend NPU）

Static Code Inspection for Triton Operators (Ascend NPU)

检视原则

Inspection Principles

严重性分级

Severity Classification

检视工作流

Inspection Workflow

Phase 1: Host 侧检视

Phase 1: Host Side Inspection

1.1 Grid 配置（P0）

1.1 Grid Configuration (P0)

❌ P0：硬编码 + 核类型错配

❌ P0: Hard-coded + core type mismatch

✅ 正确

✅ Correct

1.2 Block Size 配置（P1-P2）

1.2 Block Size Configuration (P1-P2)

Phase 2: Device 侧检视

Phase 2: Device Side Inspection

2.1 Mask 完整性（P0）

2.1 Mask Integrity (P0)

❌ P0：缺少 mask

❌ P0: Missing mask

✅ 显式 mask

✅ Explicit mask

✅ make_block_ptr（自动处理）

✅ make_block_ptr (automatic handling)

2.2 数据类型合规（P0-P1）

2.2 Data Type Compliance (P0-P1)

2.3 精度处理（P1）

2.3 Precision Processing (P1)

❌ P1：FP16 直接归约 → 应先 .to(tl.float32)

❌ P1: FP16 direct reduction → should convert to tl.float32 first

❌ P1：Softmax 未减最大值 → 数值不稳定

❌ P1: Softmax without subtracting maximum value → numerically unstable

✅ 正确精度模式

✅ Correct precision mode

out_dtype 浮点默认 fp32、int8 仅 int32 可选，显式指定非必要

Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary

2.4 代码模式（P0-P2）

2.4 Code Patterns (P0-P2)

Phase 3: 性能隐患检视（P2）

Phase 3: Performance Risk Inspection (P2)

反模式清单（NEVER）

Anti-Pattern List (NEVER)

Host 侧

Host Side

Device 侧

Device Side

检视报告

Inspection Report

参考资源

Reference Resources

按需加载

On-Demand Loading

官方文档

Official Documents