triton-operator-code-review

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triton 算子静态代码检视(Ascend NPU)

Static Code Inspection for Triton Operators (Ascend NPU)

检视原则

Inspection Principles

  • Ascend 特有约束优先:Agent 已懂 Triton 通用知识,聚焦 Ascend 硬件差异
  • 仅做静态分析:只通过阅读代码发现问题,不涉及编译期/运行时
  • Mask 零容错:Ascend 对越界访问零容忍,这是最致命的差异点
  • Ascend-specific constraints first: The Agent already has general Triton knowledge, focus on Ascend hardware differences
  • Static analysis only: Identify issues solely by reading code, no involvement in compile-time/runtime processes
  • Zero tolerance for Mask issues: Ascend has zero tolerance for out-of-bounds access, which is the most critical difference

严重性分级

Severity Classification

检视发现的问题按以下级别分类,报告时必须标注:
级别含义典型问题
P0 致命必定导致错误结果或崩溃Mask 遗漏、核类型错配、Atomic 循环死锁
P1 严重高概率导致精度或功能问题归约未升精度、dot 无累加器、Softmax 未减 max
P2 建议影响性能或可维护性冗余访存、非连续访存、BLOCK 未对齐
Issues found during inspection are classified into the following levels, which must be marked in reports:
LevelMeaningTypical Issues
P0 CriticalWill definitely lead to incorrect results or crashesMissing Mask, core type mismatch, Atomic loop deadlock
P1 SevereHigh probability of causing precision or functional issuesReduction without precision promotion, dot without accumulator, Softmax without max subtraction
P2 RecommendationAffects performance or maintainabilityRedundant memory access, non-contiguous memory access, unaligned BLOCK

检视工作流

Inspection Workflow

Phase 1: Host 侧检视

Phase 1: Host Side Inspection

MANDATORY - READ ENTIRE FILE:在检视 Host 侧前,完整阅读
ascend-triton-api-constraints.md
MANDATORY - READ ENTIRE FILE: Before inspecting the Host side, read
ascend-triton-api-constraints.md
in full.

1.1 Grid 配置(P0)

1.1 Grid Configuration (P0)

检查项如何在代码中识别
硬编码核数
grid = (20,)
grid = (24,)
等字面量
核类型错配
tl.dot
的 kernel 使用了
num_vectorcore
Grid 维度使用 2D/3D Grid 但无必要(推荐 1D)
核类型速查
算子类型应该用获取方式
tl.dot
AI Core
get_device_properties(device)["num_aicore"]
逐元素/归约/激活Vector Core
get_device_properties(device)["num_vectorcore"]
python
undefined
Check ItemHow to Identify in Code
Hard-coded core countLiterals like
grid = (20,)
or
grid = (24,)
Core type mismatchKernel containing
tl.dot
uses
num_vectorcore
Grid dimensionUnnecessary use of 2D/3D Grid (1D is recommended)
Core Type Quick Reference:
Operator TypeShould UseAcquisition Method
Contains
tl.dot
AI Core
get_device_properties(device)["num_aicore"]
Element-wise/Reduction/ActivationVector Core
get_device_properties(device)["num_vectorcore"]
python
undefined

❌ P0:硬编码 + 核类型错配

❌ P0: Hard-coded + core type mismatch

core_num = driver.active.utils.get_device_properties(device)["num_vectorcore"] grid = (20,) # 但 kernel 中使用了 tl.dot
core_num = driver.active.utils.get_device_properties(device)["num_vectorcore"] grid = (20,) # but tl.dot is used in the kernel

✅ 正确

✅ Correct

core_num = driver.active.utils.get_device_properties(device)["num_aicore"] grid = (min(core_num, triton.cdiv(n_elements, BLOCK_SIZE)),)
undefined
core_num = driver.active.utils.get_device_properties(device)["num_aicore"] grid = (min(core_num, triton.cdiv(n_elements, BLOCK_SIZE)),)
undefined

1.2 Block Size 配置(P1-P2)

1.2 Block Size Configuration (P1-P2)

检查项级别
BLOCK_SIZE 未声明为
tl.constexpr
P1
矩阵运算 BLOCK_M/N/K 非 16 倍数P2(Cube 单元粒度)
BLOCK_K 未对齐
kalign = 32 // dtype_bytes
P2
Check ItemLevel
BLOCK_SIZE not declared as
tl.constexpr
P1
BLOCK_M/N/K for matrix operations not multiples of 16P2 (Cube unit granularity)
BLOCK_K not aligned with
kalign = 32 // dtype_bytes
P2

Phase 2: Device 侧检视

Phase 2: Device Side Inspection

2.1 Mask 完整性(P0)

2.1 Mask Integrity (P0)

Ascend 对越界访问零容错。搜索所有
tl.load
/
tl.store
,确认每个都满足以下之一:
  • mask=
    参数(
    tl.load
    还需
    other=
  • 使用
    make_block_ptr
    (自动处理边界)
python
undefined
Ascend has zero tolerance for out-of-bounds access. Search all
tl.load
/
tl.store
and confirm each meets one of the following:
  • Has
    mask=
    parameter (
    tl.load
    also requires
    other=
    )
  • Uses
    make_block_ptr
    (automatically handles boundaries)
python
undefined

❌ P0:缺少 mask

❌ P0: Missing mask

x = tl.load(x_ptr + offsets)
x = tl.load(x_ptr + offsets)

✅ 显式 mask

✅ Explicit mask

x = tl.load(x_ptr + offsets, mask=mask, other=0.0)
x = tl.load(x_ptr + offsets, mask=mask, other=0.0)

✅ make_block_ptr(自动处理)

✅ make_block_ptr (automatic handling)

block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), ...) x = tl.load(block_ptr)
undefined
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), ...) x = tl.load(block_ptr)
undefined

2.2 数据类型合规(P0-P1)

2.2 Data Type Compliance (P0-P1)

MANDATORY - READ ENTIRE FILE:首次检视时,完整阅读
ascend-api-dtype-matrix.md
代码模式问题级别
tl.dot(a_int32, b_int32)
输入仅支持 int8/fp16/fp32/bf16P0
dot_scaled(...)
不支持P0
permute
/
trans
用 int64
不支持P0
tl.dot(a, b)
无显式
out_dtype
浮点默认 fp32、int8 仅 int32 可选,显式指定非必要P2
permute
/
trans
3D (2,1,0)
兼容性风险P1
MANDATORY - READ ENTIRE FILE: When inspecting for the first time, read
ascend-api-dtype-matrix.md
in full.
Code PatternIssueLevel
tl.dot(a_int32, b_int32)
Input only supports int8/fp16/fp32/bf16P0
dot_scaled(...)
Not supportedP0
permute
/
trans
using int64
Not supportedP0
tl.dot(a, b)
without explicit
out_dtype
Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessaryP2
3D (2,1,0)
permute
/
trans
Compatibility riskP1

2.3 精度处理(P1)

2.3 Precision Processing (P1)

python
undefined
python
undefined

❌ P1:FP16 直接归约 → 应先 .to(tl.float32)

❌ P1: FP16 direct reduction → should convert to tl.float32 first

sum_x = tl.sum(x_fp16, axis=-1)
sum_x = tl.sum(x_fp16, axis=-1)

❌ P1:Softmax 未减最大值 → 数值不稳定

❌ P1: Softmax without subtracting maximum value → numerically unstable

exp_x = tl.exp(x)
exp_x = tl.exp(x)

✅ 正确精度模式

✅ Correct precision mode

x_fp32 = x_fp16.to(tl.float32) sum_x = tl.sum(x_fp32, axis=-1)
x_fp32 = x_fp16.to(tl.float32) sum_x = tl.sum(x_fp32, axis=-1)

out_dtype 浮点默认 fp32、int8 仅 int32 可选,显式指定非必要

Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary

acc = tl.dot(a, b, acc)
max_x = tl.max(x, axis=-1, keepdims=True) exp_x = tl.exp(x - max_x)
undefined
acc = tl.dot(a, b, acc)
max_x = tl.max(x, axis=-1, keepdims=True) exp_x = tl.exp(x - max_x)
undefined

2.4 代码模式(P0-P2)

2.4 Code Patterns (P0-P2)

代码模式问题级别
for ... : tl.atomic_cas/or/xor/and/xchg(...)
不支持在 loop 中,可能死锁P0
多核 kernel 中
tl.atomic_add
返回值被使用
不支持多核 add + 保存中间结果P0
import numpy
在 kernel 中
kernel 内不可调用第三方库P0
for i in range(N):
在 kernel 中(loop 次数少且固定)
可考虑
tl.static_range
,但 loop 数较大时收益不明显甚至劣化,不应盲目替换
P2
tensor[i].item()
在 Host 热路径
触发 CPU-NPU 同步P2
Code PatternIssueLevel
for ... : tl.atomic_cas/or/xor/and/xchg(...)
Not supported in loops, may cause deadlockP0
Return value of
tl.atomic_add
used in multi-core kernel
Does not support multi-core add + saving intermediate resultsP0
import numpy
in kernel
Third-party libraries cannot be called inside kernelsP0
for i in range(N):
in kernel (loop count is small and fixed)
Consider
tl.static_range
, but benefits are insignificant or even degraded when loop count is large; do not replace blindly
P2
tensor[i].item()
in Host hot path
Triggers CPU-NPU synchronizationP2

Phase 3: 性能隐患检视(P2)

Phase 3: Performance Risk Inspection (P2)

代码特征隐患
同一 ptr 多次
tl.load
冗余 GM 访问
tl.arange(0, N) * stride
(stride > 1)
非连续访存
pid
直接映射到 block,无核间循环分配
负载不均衡
Code FeatureRisk
Multiple
tl.load
on the same ptr
Redundant GM access
tl.arange(0, N) * stride
(stride > 1)
Non-contiguous memory access
pid
directly mapped to block, no inter-core loop allocation
Load imbalance

反模式清单(NEVER)

Anti-Pattern List (NEVER)

Host 侧

Host Side

  • ❌ 硬编码核数
    grid = (20,)
    — P0
  • ❌ 矩阵乘法用
    num_vectorcore
    (含
    tl.dot
    应用 AI Core)— P0
  • ❌ BLOCK_SIZE 不是
    tl.constexpr
    — P1
  • ❌ Hard-coded core count
    grid = (20,)
    — P0
  • ❌ Using
    num_vectorcore
    for matrix multiplication (AI Core should be used for kernels containing
    tl.dot
    ) — P0
  • ❌ BLOCK_SIZE not declared as
    tl.constexpr
    — P1

Device 侧

Device Side

  • tl.load
    /
    tl.store
    mask=
    (也无
    make_block_ptr
    )— P0
  • tl.dot
    输入用 int32/int16/int64 — P0
  • dot_scaled
    (不支持)— P0
  • atomic_or/xor/and/xchg/cas
    for
    循环体内 — P0
  • ❌ kernel 内调用第三方库 — P0
  • ❌ FP16/BF16 归约不升精度到 FP32 — P1
  • ⚠️
    tl.dot
    无显式
    out_dtype
    (浮点默认 fp32、int8 仅 int32 可选,非必要)— P2
  • ❌ Softmax 不减最大值 — P1
  • ⚠️
    for i in range(N):
    可考虑
    tl.static_range
    ,但仅 loop 次数少且固定时有收益;loop 数较大时可能劣化,不强制要求 — P2
  • tl.load
    /
    tl.store
    without
    mask=
    (and no
    make_block_ptr
    ) — P0
  • ❌ Using int32/int16/int64 as input for
    tl.dot
    — P0
  • dot_scaled
    (not supported) — P0
  • atomic_or/xor/and/xchg/cas
    inside
    for
    loop — P0
  • ❌ Calling third-party libraries inside kernel — P0
  • ❌ FP16/BF16 reduction without precision promotion to FP32 — P1
  • ⚠️
    tl.dot
    without explicit
    out_dtype
    (floating-point defaults to fp32, only int32 is optional for int8; not necessary) — P2
  • ❌ Softmax without subtracting maximum value — P1
  • ⚠️
    for i in range(N):
    can be replaced with
    tl.static_range
    only when loop count is small and fixed; may degrade performance when loop count is large, not mandatory — P2

检视报告

Inspection Report

检视完成后,按
code-review-report-template.md
输出报告。
After completing the inspection, output the report according to
code-review-report-template.md
.

参考资源

Reference Resources

按需加载

On-Demand Loading

工作流阶段加载文档不要加载
Phase 1: Host 侧
ascend-triton-api-constraints.md
dtype-matrix, test-patterns
Phase 2: Device 侧
ascend-api-dtype-matrix.md
test-patterns
逐项核对
code-review-checklist.md
test-patterns, dtype-matrix
需要参考官方实现
ascend-test-patterns.md
加载原则:只加载当前检视阶段需要的文档,不要一次加载所有文档。
Workflow PhaseLoad DocumentDo Not Load
Phase 1: Host Side
ascend-triton-api-constraints.md
dtype-matrix, test-patterns
Phase 2: Device Side
ascend-api-dtype-matrix.md
test-patterns
Item-by-item Check
code-review-checklist.md
test-patterns, dtype-matrix
Need to reference official implementations
ascend-test-patterns.md
Loading Principle: Only load documents required for the current inspection phase; do not load all documents at once.

官方文档

Official Documents