Loading...
Loading...
Static inspection of Triton operator code quality (Host side + Device side) for Ascend NPU. Used when users need to identify potential bugs, API misuses, and performance risks by reading code. Core capabilities: (1) Ascend API constraint compliance check (2) Mask integrity verification (3) Precision processing review (4) Code pattern recognition. Note: This Skill only focuses on static code analysis; compile-time and runtime issues are handled by other Skills.
npx skill4agent add ascend/agent-skills triton-operator-code-review| Level | Meaning | Typical Issues |
|---|---|---|
| P0 Critical | Will definitely lead to incorrect results or crashes | Missing Mask, core type mismatch, Atomic loop deadlock |
| P1 Severe | High probability of causing precision or functional issues | Reduction without precision promotion, dot without accumulator, Softmax without max subtraction |
| P2 Recommendation | Affects performance or maintainability | Redundant memory access, non-contiguous memory access, unaligned BLOCK |
ascend-triton-api-constraints.md| Check Item | How to Identify in Code |
|---|---|
| Hard-coded core count | Literals like |
| Core type mismatch | Kernel containing |
| Grid dimension | Unnecessary use of 2D/3D Grid (1D is recommended) |
| Operator Type | Should Use | Acquisition Method |
|---|---|---|
Contains | AI Core | |
| Element-wise/Reduction/Activation | Vector Core | |
# ❌ P0: Hard-coded + core type mismatch
core_num = driver.active.utils.get_device_properties(device)["num_vectorcore"]
grid = (20,) # but tl.dot is used in the kernel
# ✅ Correct
core_num = driver.active.utils.get_device_properties(device)["num_aicore"]
grid = (min(core_num, triton.cdiv(n_elements, BLOCK_SIZE)),)| Check Item | Level |
|---|---|
BLOCK_SIZE not declared as | P1 |
| BLOCK_M/N/K for matrix operations not multiples of 16 | P2 (Cube unit granularity) |
BLOCK_K not aligned with | P2 |
tl.loadtl.storemask=tl.loadother=make_block_ptr# ❌ P0: Missing mask
x = tl.load(x_ptr + offsets)
# ✅ Explicit mask
x = tl.load(x_ptr + offsets, mask=mask, other=0.0)
# ✅ make_block_ptr (automatic handling)
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), ...)
x = tl.load(block_ptr)ascend-api-dtype-matrix.md| Code Pattern | Issue | Level |
|---|---|---|
| Input only supports int8/fp16/fp32/bf16 | P0 |
| Not supported | P0 |
| Not supported | P0 |
| Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary | P2 |
3D (2,1,0) | Compatibility risk | P1 |
# ❌ P1: FP16 direct reduction → should convert to tl.float32 first
sum_x = tl.sum(x_fp16, axis=-1)
# ❌ P1: Softmax without subtracting maximum value → numerically unstable
exp_x = tl.exp(x)
# ✅ Correct precision mode
x_fp32 = x_fp16.to(tl.float32)
sum_x = tl.sum(x_fp32, axis=-1)
# Floating-point defaults to fp32, only int32 is optional for int8; explicit specification is unnecessary
acc = tl.dot(a, b, acc)
max_x = tl.max(x, axis=-1, keepdims=True)
exp_x = tl.exp(x - max_x)| Code Pattern | Issue | Level |
|---|---|---|
| Not supported in loops, may cause deadlock | P0 |
Return value of | Does not support multi-core add + saving intermediate results | P0 |
| Third-party libraries cannot be called inside kernels | P0 |
| Consider | P2 |
| Triggers CPU-NPU synchronization | P2 |
| Code Feature | Risk |
|---|---|
Multiple | Redundant GM access |
| Non-contiguous memory access |
| Load imbalance |
grid = (20,)num_vectorcoretl.dottl.constexprtl.loadtl.storemask=make_block_ptrtl.dotdot_scaledatomic_or/xor/and/xchg/casfortl.dotout_dtypefor i in range(N):tl.static_rangecode-review-report-template.md| Workflow Phase | Load Document | Do Not Load |
|---|---|---|
| Phase 1: Host Side | | dtype-matrix, test-patterns |
| Phase 2: Device Side | | test-patterns |
| Item-by-item Check | | test-patterns, dtype-matrix |
| Need to reference official implementations | | — |