triton-operator-code-review
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTriton 算子静态代码检视(Ascend NPU)
Static Code Inspection for Triton Operators (Ascend NPU)
检视原则
Inspection Principles
- Ascend 特有约束优先:Agent 已懂 Triton 通用知识,聚焦 Ascend 硬件差异
- 仅做静态分析:只通过阅读代码发现问题,不涉及编译期/运行时
- Mask 零容错:Ascend 对越界访问零容忍,这是最致命的差异点
- Ascend-specific constraints first: The Agent already has general Triton knowledge, focus on Ascend hardware differences
- Static analysis only: Identify issues solely by reading code, no involvement in compile-time/runtime processes
- Zero tolerance for Mask issues: Ascend has zero tolerance for out-of-bounds access, which is the most critical difference
严重性分级
Severity Classification
检视发现的问题按以下级别分类,报告时必须标注:
| 级别 | 含义 | 典型问题 |
|---|---|---|
| P0 致命 | 必定导致错误结果或崩溃 | Mask 遗漏、核类型错配、Atomic 循环死锁 |
| P1 严重 | 高概率导致精度或功能问题 | 归约未升精度、dot 无累加器、Softmax 未减 max |
| P2 建议 | 影响性能或可维护性 | 冗余访存、非连续访存、BLOCK 未对齐 |
Issues found during inspection are classified into the following levels, which must be marked in reports:
| Level | Meaning | Typical Issues |
|---|---|---|
| P0 Critical | Will definitely lead to incorrect results or crashes | Missing Mask, core type mismatch, Atomic loop deadlock |
| P1 Severe | High probability of causing precision or functional issues | Reduction without precision promotion, dot without accumulator, Softmax without max subtraction |
| P2 Recommendation | Affects performance or maintainability | Redundant memory access, non-contiguous memory access, unaligned BLOCK |
检视工作流
Inspection Workflow
Phase 1: Host 侧检视
Phase 1: Host Side Inspection
MANDATORY - READ ENTIRE FILE:在检视 Host 侧前,完整阅读 。
ascend-triton-api-constraints.mdMANDATORY - READ ENTIRE FILE: Before inspecting the Host side, read in full.
ascend-triton-api-constraints.md1.1 Grid 配置(P0)
1.1 Grid Configuration (P0)
| 检查项 | 如何在代码中识别 |
|---|---|
| 硬编码核数 | |
| 核类型错配 | 含 |
| Grid 维度 | 使用 2D/3D Grid 但无必要(推荐 1D) |
核类型速查:
| 算子类型 | 应该用 | 获取方式 |
|---|---|---|
含 | AI Core | |
| 逐元素/归约/激活 | Vector Core | |
python
undefined| Check Item | How to Identify in Code |
|---|---|
| Hard-coded core count | Literals like |
| Core type mismatch | Kernel containing |
| Grid dimension | Unnecessary use of 2D/3D Grid (1D is recommended) |
Core Type Quick Reference:
| Operator Type | Should Use | Acquisition Method |
|---|---|---|
Contains | AI Core | |
| Element-wise/Reduction/Activation | Vector Core | |
python
undefined❌ P0:硬编码 + 核类型错配
❌ P0: Hard-coded + core type mismatch
core_num = driver.active.utils.get_device_properties(device)["num_vectorcore"]
grid = (20,) # 但 kernel 中使用了 tl.dot
core_num = driver.active.utils.get_device_properties(device)["num_vectorcore"]
grid = (20,) # but tl.dot is used in the kernel
✅ 正确
✅ Correct
core_num = driver.active.utils.get_device_properties(device)["num_aicore"]
grid = (min(core_num, triton.cdiv(n_elements, BLOCK_SIZE)),)
undefinedcore_num = driver.active.utils.get_device_properties(device)["num_aicore"]
grid = (min(core_num, triton.cdiv(n_elements, BLOCK_SIZE)),)
undefined1.2 Block Size 配置(P1-P2)
1.2 Block Size Configuration (P1-P2)
| 检查项 | 级别 |
|---|---|
BLOCK_SIZE 未声明为 | P1 |
| 矩阵运算 BLOCK_M/N/K 非 16 倍数 | P2(Cube 单元粒度) |
BLOCK_K 未对齐 | P2 |
| Check Item | Level |
|---|---|
BLOCK_SIZE not declared as | P1 |
| BLOCK_M/N/K for matrix operations not multiples of 16 | P2 (Cube unit granularity) |
BLOCK_K not aligned with | P2 |
Phase 2: Device 侧检视
Phase 2: Device Side Inspection
2.1 Mask 完整性(P0)
2.1 Mask Integrity (P0)
Ascend 对越界访问零容错。搜索所有 /,确认每个都满足以下之一:
tl.loadtl.store- 有 参数(
mask=还需tl.load)other= - 使用 (自动处理边界)
make_block_ptr
python
undefinedAscend has zero tolerance for out-of-bounds access. Search all / and confirm each meets one of the following:
tl.loadtl.store- Has parameter (
mask=also requirestl.load)other= - Uses (automatically handles boundaries)
make_block_ptr
python
undefined❌ P0:缺少 mask
❌ P0: Missing mask
x = tl.load(x_ptr + offsets)
x = tl.load(x_ptr + offsets)
✅ 显式 mask
✅ Explicit mask
x = tl.load(x_ptr + offsets, mask=mask, other=0.0)
x = tl.load(x_ptr + offsets, mask=mask, other=0.0)
✅ make_block_ptr(自动处理)
✅ make_block_ptr (automatic handling)
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), ...)
x = tl.load(block_ptr)
undefinedblock_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), ...)
x = tl.load(block_ptr)
undefined2.2 数据类型合规(P0-P1)
2.2 Data Type Compliance (P0-P1)
MANDATORY - READ ENTIRE FILE:首次检视时,完整阅读 。
ascend-api-dtype-matrix.md| 代码模式 | 问题 | 级别 |
|---|---|---|
| 输入仅支持 int8/fp16/fp32/bf16 | P0 |
| 不支持 | P0 |
| 不支持 | P0 |
| 浮点默认 fp32、int8 仅 int32 可选,显式指定非必要 | P2 |
| 兼容性风险 | P1 |
MANDATORY - READ ENTIRE FILE: When inspecting for the first time, read in full.
ascend-api-dtype-matrix.md| Code Pattern | Issue | Level |
|---|---|---|
| Input only supports int8/fp16/fp32/bf16 | P0 |
| Not supported | P0 |
| Not supported | P0 |
| Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary | P2 |
3D (2,1,0) | Compatibility risk | P1 |
2.3 精度处理(P1)
2.3 Precision Processing (P1)
python
undefinedpython
undefined❌ P1:FP16 直接归约 → 应先 .to(tl.float32)
❌ P1: FP16 direct reduction → should convert to tl.float32 first
sum_x = tl.sum(x_fp16, axis=-1)
sum_x = tl.sum(x_fp16, axis=-1)
❌ P1:Softmax 未减最大值 → 数值不稳定
❌ P1: Softmax without subtracting maximum value → numerically unstable
exp_x = tl.exp(x)
exp_x = tl.exp(x)
✅ 正确精度模式
✅ Correct precision mode
x_fp32 = x_fp16.to(tl.float32)
sum_x = tl.sum(x_fp32, axis=-1)
x_fp32 = x_fp16.to(tl.float32)
sum_x = tl.sum(x_fp32, axis=-1)
out_dtype 浮点默认 fp32、int8 仅 int32 可选,显式指定非必要
Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary
acc = tl.dot(a, b, acc)
max_x = tl.max(x, axis=-1, keepdims=True)
exp_x = tl.exp(x - max_x)
undefinedacc = tl.dot(a, b, acc)
max_x = tl.max(x, axis=-1, keepdims=True)
exp_x = tl.exp(x - max_x)
undefined2.4 代码模式(P0-P2)
2.4 Code Patterns (P0-P2)
| 代码模式 | 问题 | 级别 |
|---|---|---|
| 不支持在 loop 中,可能死锁 | P0 |
多核 kernel 中 | 不支持多核 add + 保存中间结果 | P0 |
| kernel 内不可调用第三方库 | P0 |
| 可考虑 | P2 |
| 触发 CPU-NPU 同步 | P2 |
| Code Pattern | Issue | Level |
|---|---|---|
| Not supported in loops, may cause deadlock | P0 |
Return value of | Does not support multi-core add + saving intermediate results | P0 |
| Third-party libraries cannot be called inside kernels | P0 |
| Consider | P2 |
| Triggers CPU-NPU synchronization | P2 |
Phase 3: 性能隐患检视(P2)
Phase 3: Performance Risk Inspection (P2)
| 代码特征 | 隐患 |
|---|---|
同一 ptr 多次 | 冗余 GM 访问 |
| 非连续访存 |
| 负载不均衡 |
| Code Feature | Risk |
|---|---|
Multiple | Redundant GM access |
| Non-contiguous memory access |
| Load imbalance |
反模式清单(NEVER)
Anti-Pattern List (NEVER)
Host 侧
Host Side
- ❌ 硬编码核数 — P0
grid = (20,) - ❌ 矩阵乘法用 (含
num_vectorcore应用 AI Core)— P0tl.dot - ❌ BLOCK_SIZE 不是 — P1
tl.constexpr
- ❌ Hard-coded core count — P0
grid = (20,) - ❌ Using for matrix multiplication (AI Core should be used for kernels containing
num_vectorcore) — P0tl.dot - ❌ BLOCK_SIZE not declared as — P1
tl.constexpr
Device 侧
Device Side
- ❌ /
tl.load无tl.store(也无mask=)— P0make_block_ptr - ❌ 输入用 int32/int16/int64 — P0
tl.dot - ❌ (不支持)— P0
dot_scaled - ❌ 在
atomic_or/xor/and/xchg/cas循环体内 — P0for - ❌ kernel 内调用第三方库 — P0
- ❌ FP16/BF16 归约不升精度到 FP32 — P1
- ⚠️ 无显式
tl.dot(浮点默认 fp32、int8 仅 int32 可选,非必要)— P2out_dtype - ❌ Softmax 不减最大值 — P1
- ⚠️ 可考虑
for i in range(N):,但仅 loop 次数少且固定时有收益;loop 数较大时可能劣化,不强制要求 — P2tl.static_range
- ❌ /
tl.loadwithouttl.store(and nomask=) — P0make_block_ptr - ❌ Using int32/int16/int64 as input for — P0
tl.dot - ❌ (not supported) — P0
dot_scaled - ❌ inside
atomic_or/xor/and/xchg/casloop — P0for - ❌ Calling third-party libraries inside kernel — P0
- ❌ FP16/BF16 reduction without precision promotion to FP32 — P1
- ⚠️ without explicit
tl.dot(floating-point defaults to fp32, only int32 is optional for int8; not necessary) — P2out_dtype - ❌ Softmax without subtracting maximum value — P1
- ⚠️ can be replaced with
for i in range(N):only when loop count is small and fixed; may degrade performance when loop count is large, not mandatory — P2tl.static_range
检视报告
Inspection Report
检视完成后,按 输出报告。
code-review-report-template.mdAfter completing the inspection, output the report according to .
code-review-report-template.md参考资源
Reference Resources
按需加载
On-Demand Loading
| 工作流阶段 | 加载文档 | 不要加载 |
|---|---|---|
| Phase 1: Host 侧 | | dtype-matrix, test-patterns |
| Phase 2: Device 侧 | | test-patterns |
| 逐项核对 | | test-patterns, dtype-matrix |
| 需要参考官方实现 | | — |
加载原则:只加载当前检视阶段需要的文档,不要一次加载所有文档。
| Workflow Phase | Load Document | Do Not Load |
|---|---|---|
| Phase 1: Host Side | | dtype-matrix, test-patterns |
| Phase 2: Device Side | | test-patterns |
| Item-by-item Check | | test-patterns, dtype-matrix |
| Need to reference official implementations | | — |
Loading Principle: Only load documents required for the current inspection phase; do not load all documents at once.