triton-operator-code-gen

Original：🇨🇳 Chinese

Translated

Generate Triton kernel code for Ascend NPU based on operator design documents. Used when users need to implement Triton operator kernels and convert requirement documents into executable code. Core capabilities: (1) Parse requirement documents to confirm computing logic (2) Design tiling partitioning strategy (3) Generate high-performance kernel code (4) Generate test code to verify correctness.

5installs

Sourceascend/agent-skills

Added on2026-04-15

NPX Install

npx skill4agent add ascend/agent-skills triton-operator-code-gen

SKILL.md Content (Chinese)

View Translation Comparison →

Triton Operator Code Generation

Core Principles

Computing Logic → Tiling Strategy → Code Implementation

This order must not be reversed. Incorrect computing logic will lead to completely wrong results, while incorrect tiling strategy will cause performance issues or memory overflow.

Reference Resource Loading Routing Table

MANDATORY - Load On Demand: Load corresponding reference documents according to task stages

Phase	Must Load	Do Not Load
Understand Requirement Documents	None	All references
Confirm Computing Logic	None	All references
Design Tiling Strategy	`hardware-architecture.md`	`templates.md`
Generate Kernel Code	`templates.md`	`hardware-architecture.md`
Generate Test Code	None	All references

Workflow

Phase 1: Understand Requirement Documents

Extract: mathematical formulas, input/output specifications, constraints, tiling strategies

Phase 2: Confirm Computing Logic

Describe the computing process with pseudocode
Confirm data dependency relationships
Confirm precision handling (reduction operations must use FP32)

Output: Computing logic confirmation (must be confirmed with the user)

Phase 3: Design Tiling Strategy

MANDATORY - READ ENTIRE FILE: You must read the entire

hardware-architecture.md

before designing the tiling strategy.

Never set any line limits.

Inter-core Partitioning Principles (Must Follow):

grid = number of physical cores: Ensure every core is utilized to avoid resource waste
Balanced intra-core load: Each core calculates which data to process on its own to achieve load balance

python

core_num = get_npu_aicore_num()  # or get_npu_vectorcore_num()

grid = (core_num,)  # Principle 1: grid must equal the number of physical cores

@triton.jit
def xxx_fwd(
    ......
    M, N,
    BLOCK_M: tl.constexpr,
    BLOCK_N: tl.constexpr,
):
    pid = tl.program_id(0)
    num_core = tl.num_programs(0)

    num_block_m = tl.cdiv(M, BLOCK_M)
    num_block_n = tl.cdiv(N, BLOCK_N)

    total_blocks = num_block_m * num_block_n

    # Principle 2: Intra-core loop handles multiple tasks, each core calculates the data it needs to process
    for block_idx in range(pid, total_blocks, num_core):
        pid_m = block_idx // num_block_n
        pid_n = block_idx % num_block_n

UB Space Calculation:

Total UB size: 192KB (A2/A3)
Safe BLOCK_SIZE = (196608 - 32) / (number of buffers × data type size) × 0.8

Phase 4: Generate Kernel Code

MANDATORY - READ ENTIRE FILE: You must read the entire

templates.md

before generating code.

Never set any line limits.

Flexibly refer to the corresponding template based on operator type:

Operator Type	Features	Core Type	Template
Reduction	sum/max/min reduction	vector core	Template 1
GEMM	tl.dot() matrix multiplication	AI core	Template 2
Activation Function	Element-wise calculation	vector core	Template 3
Loss Function	softmax + reduction	vector core	Template 4
Index Transformation	Index calculation, conditional branching	vector core	Template 5
Attention	QK^T + SV multi-stage	AI core	Template 6
MoE	Gating mechanism	vector core	Template 7
Post-processing	Simple data transformation	vector core	Template 8
Convolution	State update, sliding window	AI core	Template 9

Phase 5: Generate Test Code

Anti-Pattern List (NEVER)

❌ Don't start writing code without confirming computing logic
❌ Ignore UB size limit (192KB)
❌ Don't use FP32 precision for reduction operations
❌ Use int64 data type (extremely poor performance)
❌ grid size exceeds 65535
❌ Use third-party libraries in kernel
❌ Calculate element by element
❌ Overly complex optimizations (e.g., diagonal core partitioning)
❌ Call third-party functions to get core count
❌ Confuse the usage of Vector Core and Cube Core
❌ Implement operators with PyTorch instead of Triton
❌ Don't test operator correctness
❌ Don't test operators on NPU
❌ Don't ensure the accuracy of test benchmarks
❌ grid size is not equal to the number of physical cores (violates Inter-core Partitioning Principle 1)
❌ Unbalanced inter-core load (violates Inter-core Partitioning Principle 2)

Common Pitfalls

Pitfall	Symptom	Solution
Incorrect computing logic	Output results do not match expectations	Describe the computing process with pseudocode and confirm with the user
UB overflow	Runtime error "ub overflow"	Calculate total buffer size and reduce BLOCK_SIZE
coreDim exceeded	Runtime error "coreDim can't be greater than UINT16_MAX"	Increase BLOCK_SIZE or set `TRITON_ALL_BLOCKS_PARALLEL=1`
Precision loss	Inaccurate results with FP16 input	Upgrade precision to FP32 before reduction operations
Insufficient index length	D-cache error	Replace int32 with int64 for index when dealing with super-large shapes

Checklist

Computing Logic

Mathematical formulas are correctly understood
Pseudocode is consistent with formulas
Boundary conditions are handled correctly
Data type conversions are correct

Tiling Strategy

grid = number of physical cores (Principle 1)
Intra-core loop handles multiple tasks with balanced load (Principle 2)
UB space calculation is correct
BLOCK_SIZE is reasonably selected

Kernel Implementation

Core count acquisition function is called correctly
Pointer calculation is correct
Mask handling is correct
Precision handling is correct (use FP32 for reduction)
No third-party library dependencies

Test Code

PyTorch reference implementation is correct
Test cases cover multiple shapes
Test cases cover multiple data types
Precision tolerance is set reasonably
Execute test code to ensure the operator runs correctly