Loading...
Loading...
Generate Triton kernel code for Ascend NPU based on operator design documents. Used when users need to implement Triton operator kernels and convert requirement documents into executable code. Core capabilities: (1) Parse requirement documents to confirm computing logic (2) Design tiling partitioning strategy (3) Generate high-performance kernel code (4) Generate test code to verify correctness.
npx skill4agent add ascend/agent-skills triton-operator-code-gen| Phase | Must Load | Do Not Load |
|---|---|---|
| Understand Requirement Documents | None | All references |
| Confirm Computing Logic | None | All references |
| Design Tiling Strategy | | |
| Generate Kernel Code | | |
| Generate Test Code | None | All references |
hardware-architecture.mdcore_num = get_npu_aicore_num() # or get_npu_vectorcore_num()
grid = (core_num,) # Principle 1: grid must equal the number of physical cores
@triton.jit
def xxx_fwd(
......
M, N,
BLOCK_M: tl.constexpr,
BLOCK_N: tl.constexpr,
):
pid = tl.program_id(0)
num_core = tl.num_programs(0)
num_block_m = tl.cdiv(M, BLOCK_M)
num_block_n = tl.cdiv(N, BLOCK_N)
total_blocks = num_block_m * num_block_n
# Principle 2: Intra-core loop handles multiple tasks, each core calculates the data it needs to process
for block_idx in range(pid, total_blocks, num_core):
pid_m = block_idx // num_block_n
pid_n = block_idx % num_block_nTotal UB size: 192KB (A2/A3)
Safe BLOCK_SIZE = (196608 - 32) / (number of buffers × data type size) × 0.8templates.md| Operator Type | Features | Core Type | Template |
|---|---|---|---|
| Reduction | sum/max/min reduction | vector core | Template 1 |
| GEMM | tl.dot() matrix multiplication | AI core | Template 2 |
| Activation Function | Element-wise calculation | vector core | Template 3 |
| Loss Function | softmax + reduction | vector core | Template 4 |
| Index Transformation | Index calculation, conditional branching | vector core | Template 5 |
| Attention | QK^T + SV multi-stage | AI core | Template 6 |
| MoE | Gating mechanism | vector core | Template 7 |
| Post-processing | Simple data transformation | vector core | Template 8 |
| Convolution | State update, sliding window | AI core | Template 9 |
| Pitfall | Symptom | Solution |
|---|---|---|
| Incorrect computing logic | Output results do not match expectations | Describe the computing process with pseudocode and confirm with the user |
| UB overflow | Runtime error "ub overflow" | Calculate total buffer size and reduce BLOCK_SIZE |
| coreDim exceeded | Runtime error "coreDim can't be greater than UINT16_MAX" | Increase BLOCK_SIZE or set |
| Precision loss | Inaccurate results with FP16 input | Upgrade precision to FP32 before reduction operations |
| Insufficient index length | D-cache error | Replace int32 with int64 for index when dealing with super-large shapes |