Static Code Inspection for Triton Operators (Ascend NPU)

Inspection Principles

Ascend-specific constraints first: The Agent already has general Triton knowledge, focus on Ascend hardware differences
Static analysis only: Identify issues solely by reading code, no involvement in compile-time/runtime processes
Zero tolerance for Mask issues: Ascend has zero tolerance for out-of-bounds access, which is the most critical difference

Severity Classification

Issues found during inspection are classified into the following levels, which must be marked in reports:

Level	Meaning	Typical Issues
P0 Critical	Will definitely lead to incorrect results or crashes	Missing Mask, core type mismatch, Atomic loop deadlock
P1 Severe	High probability of causing precision or functional issues	Reduction without precision promotion, dot without accumulator, Softmax without max subtraction
P2 Recommendation	Affects performance or maintainability	Redundant memory access, non-contiguous memory access, unaligned BLOCK

Inspection Workflow

Phase 1: Host Side Inspection

MANDATORY - READ ENTIRE FILE: Before inspecting the Host side, read

ascend-triton-api-constraints.md

in full.

1.1 Grid Configuration (P0)

Check Item	How to Identify in Code
Hard-coded core count	Literals like `grid = (20,)` or `grid = (24,)`
Core type mismatch	Kernel containing `tl.dot` uses `num_vectorcore`
Grid dimension	Unnecessary use of 2D/3D Grid (1D is recommended)

Core Type Quick Reference:

Operator Type	Should Use	Acquisition Method
Contains `tl.dot`	AI Core	`get_device_properties(device)["num_aicore"]`
Element-wise/Reduction/Activation	Vector Core	`get_device_properties(device)["num_vectorcore"]`

python

# ❌ P0: Hard-coded + core type mismatch
core_num = driver.active.utils.get_device_properties(device)["num_vectorcore"]
grid = (20,)  # but tl.dot is used in the kernel

# ✅ Correct
core_num = driver.active.utils.get_device_properties(device)["num_aicore"]
grid = (min(core_num, triton.cdiv(n_elements, BLOCK_SIZE)),)

1.2 Block Size Configuration (P1-P2)

Check Item	Level
BLOCK_SIZE not declared as `tl.constexpr`	P1
BLOCK_M/N/K for matrix operations not multiples of 16	P2 (Cube unit granularity)
BLOCK_K not aligned with `kalign = 32 // dtype_bytes`	P2

Phase 2: Device Side Inspection

2.1 Mask Integrity (P0)

Ascend has zero tolerance for out-of-bounds access. Search all

tl.load

tl.store

and confirm each meets one of the following:

Has
```
mask=
```
parameter (
```
tl.load
```
also requires
```
other=
```
)
Uses
```
make_block_ptr
```
(automatically handles boundaries)

python

# ❌ P0: Missing mask
x = tl.load(x_ptr + offsets)

# ✅ Explicit mask
x = tl.load(x_ptr + offsets, mask=mask, other=0.0)

# ✅ make_block_ptr (automatic handling)
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), ...)
x = tl.load(block_ptr)

2.2 Data Type Compliance (P0-P1)

MANDATORY - READ ENTIRE FILE: When inspecting for the first time, read

ascend-api-dtype-matrix.md

in full.

Code Pattern	Issue	Level
`tl.dot(a_int32, b_int32)`	Input only supports int8/fp16/fp32/bf16	P0
`dot_scaled(...)`	Not supported	P0
`permute` / `trans` using int64	Not supported	P0
`tl.dot(a, b)` without explicit `out_dtype`	Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary	P2
3D (2,1,0) `permute` / `trans`	Compatibility risk	P1

2.3 Precision Processing (P1)

python

# ❌ P1: FP16 direct reduction → should convert to tl.float32 first
sum_x = tl.sum(x_fp16, axis=-1)

# ❌ P1: Softmax without subtracting maximum value → numerically unstable
exp_x = tl.exp(x)

# ✅ Correct precision mode
x_fp32 = x_fp16.to(tl.float32)
sum_x = tl.sum(x_fp32, axis=-1)

# Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary
acc = tl.dot(a, b, acc)

max_x = tl.max(x, axis=-1, keepdims=True)
exp_x = tl.exp(x - max_x)

2.4 Code Patterns (P0-P2)

Code Pattern	Issue	Level
`for ... : tl.atomic_cas/or/xor/and/xchg(...)`	Not supported in loops, may cause deadlock	P0
Return value of `tl.atomic_add` used in multi-core kernel	Does not support multi-core add + saving intermediate results	P0
`import numpy` in kernel	Third-party libraries cannot be called inside kernels	P0
`for i in range(N):` in kernel (loop count is small and fixed)	Consider `tl.static_range` , but benefits are insignificant or even degraded when loop count is large; do not replace blindly	P2
`tensor[i].item()` in Host hot path	Triggers CPU-NPU synchronization	P2

Phase 3: Performance Risk Inspection (P2)

Code Feature	Risk
Multiple `tl.load` on the same ptr	Redundant GM access
`tl.arange(0, N) * stride` (stride > 1)	Non-contiguous memory access
`pid` directly mapped to block, no inter-core loop allocation	Load imbalance

Anti-Pattern List (NEVER)

Host Side

❌ Hard-coded core count
```
grid = (20,)
```
— P0
❌ Using
```
num_vectorcore
```
for matrix multiplication (AI Core should be used for kernels containing
```
tl.dot
```
) — P0
❌ BLOCK_SIZE not declared as
```
tl.constexpr
```
— P1

Device Side

❌
```
tl.load
```
/
```
tl.store
```
without
```
mask=
```
(and no
```
make_block_ptr
```
) — P0
❌ Using int32/int16/int64 as input for
```
tl.dot
```
— P0
❌
```
dot_scaled
```
(not supported) — P0
❌
```
atomic_or/xor/and/xchg/cas
```
inside
```
for
```
loop — P0
❌ Calling third-party libraries inside kernel — P0
❌ FP16/BF16 reduction without precision promotion to FP32 — P1
⚠️
```
tl.dot
```
without explicit
```
out_dtype
```
(floating-point defaults to fp32, only int32 is optional for int8; not necessary) — P2
❌ Softmax without subtracting maximum value — P1
⚠️
```
for i in range(N):
```
can be replaced with
```
tl.static_range
```
only when loop count is small and fixed; may degrade performance when loop count is large, not mandatory — P2

Inspection Report

After completing the inspection, output the report according to

code-review-report-template.md

Reference Resources

On-Demand Loading

Workflow Phase	Load Document	Do Not Load
Phase 1: Host Side	`ascend-triton-api-constraints.md`	dtype-matrix, test-patterns
Phase 2: Device Side	`ascend-api-dtype-matrix.md`	test-patterns
Item-by-item Check	`code-review-checklist.md`	test-patterns, dtype-matrix
Need to reference official implementations	`ascend-test-patterns.md`	—

Loading Principle: Only load documents required for the current inspection phase; do not load all documents at once.

triton-operator-code-review

NPX Install

Tags

SKILL.md Content (Chinese)

Static Code Inspection for Triton Operators (Ascend NPU)

Inspection Principles

Severity Classification

Inspection Workflow

Phase 1: Host Side Inspection

1.1 Grid Configuration (P0)

1.2 Block Size Configuration (P1-P2)

Phase 2: Device Side Inspection

2.1 Mask Integrity (P0)

2.2 Data Type Compliance (P0-P1)

2.3 Precision Processing (P1)

2.4 Code Patterns (P0-P2)

Phase 3: Performance Risk Inspection (P2)

Anti-Pattern List (NEVER)

Host Side

Device Side

Inspection Report

Reference Resources

On-Demand Loading

Official Documents