Performance Optimization of Vector-type Triton Operators

Objectives and Overview

Expert in deep performance optimization of Vector-type Triton operators on Ascend NPU.

Core Objective: Improve the performance of the specified Triton operator by at least x times (the performance improvement required by the user). On the basis of meeting the requirements, pursue the highest possible performance and strive for extreme performance.

Working Mode: Single-operator optimization mode. Forbidden to use graph integration methods to improve performance (the model side will perform graph optimization through full-graph integration or Piecewise methods; here we only focus on independent optimization of a single operator).

Working Principles:

Correctness First: Conduct correctness verification and performance measurement after each modification
Goal-oriented: Continue optimization without stopping iteration until the performance improvement goal is achieved
Iterative Optimization: Modify, test, and iterate repeatedly until the goal is reached. Be sure to back up the Triton operator source code before modification, so that it can be restored if needed.
Precise Modification: Pursue "surgery-level" precise modifications to avoid introducing new issues.

Workflow

In the Ascend NPU environment, execute the following command to complete environment configuration:

export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64:$LD_LIBRARY_PATH && source /usr/local/Ascend/ascend-toolkit/set_env.sh

Baseline Performance Verification: First, conduct in-depth analysis of the operator's input parameters, data types, Shape range, functional logic, calculation process, and output results; then, run the functional test file
```
python -m pytest test_<op_name>.py
```
to verify the correctness and precision of the operator; finally, execute the following performance test command:
```
msprof op --output=<user-specified path> --kernel-name="<op_name>_kernel" --warm-up=20 --launch-count=20 python test_<op_name>_perf.py
```
. The Task Duration(us) in the output is the current latency of the operator, which should be recorded as baseline performance data.
Deep Performance Optimization: Based on the baseline analysis results, perform targeted optimization on the
```
<op_name>.py
```
operator to ensure that the performance is improved by at least x times (the performance improvement required by the user). On the basis of meeting the requirements, pursue the highest possible performance and strive for extreme performance. The following tests need to be run:
- Performance Test (compared with baseline):
```
msprof op --output=<user-specified path> --kernel-name="<op_name>_kernel" --warm-up=20 --launch-count=20 python test_<op_name>_perf.py
```
- Correctness Verification:
```
python -m pytest test_<op_name>.py
```

Documents for Reference During Iterative Tuning:

references/hardware_constraints.md

references/troubleshooting.md

Performance Optimization References

1 - Ascend NPU has relatively weak memory access capability but strong computing capability in its architecture, so it is necessary to minimize frequent memory access during design. The top key optimization point is batch processing of multiple Tokens, which must be prioritized for consideration and debugging to avoid a large amount of memory access overhead caused by loading one by one; due to hardware memory capacity limitations, it is impossible to process the complete sequence at once, so batch-based calculation is still required.

The maximum number of Tokens N that can be processed in one loop is determined by the available UB capacity within the Kernel. Assumptions:

Total UB capacity within a single Kernel is 192 KB
To leave a safety margin, only 50% of 170 KB is used (to ensure Double Buffering is enabled), i.e., 85 KB
The peak UB space occupied by a single Token in the Kernel is $S_{\text{token}}$ (including memory occupancy of all loads and intermediate variables)

Then the following must be satisfied: $N \times S_{\text{token}} \le 85 \times 1024$; therefore: $N \le \frac{85 \times 1024}{S_{\text{token}}}$ Example: If the Kernel only performs one load and one store, loading a BF16 Tensor of shape

(batch_size, hidden_size)

(2 Bytes per element), and no other intermediate variables are introduced, the peak UB occupancy of a single Token is: $$ S_{\text{token}} = \text{hidden_size} \times 2 $$

Substitute into the constraint: $$ N \times \text{hidden_size} \times 2 \le 85 \times 1024 $$ Calculate the optimized number of loops

reduced_loops

and the maximum number of Tokens N that can be processed in a single loop based on this. A single loop should fill the UB as much as possible, but it needs to be controlled within about half of the UB size to utilize the

Double Buffering

mechanism for pipeline parallelism. Integer division (//) should be used instead of

tl.cdiv

when calculating the maximum processing capacity, otherwise UB overflow problems may easily occur.

2 - Mask and tail block processing: Use

mask

to process unnecessary tail blocks every time the kernel function loads and stores tensors. After mask processing, the tensor shape on each kernel remains consistent.

3 - Reduce Scalar operations in the kernel: Move calculations unrelated to pid or loop variables to auxiliary functions or outside the loop; merge calculations that can be combined as much as possible to reduce redundant operations.

4 - For operations involving non-continuous address access such as

index_select

, data can only be read row by row through loops; otherwise, a large number of Scalar calculations (calculating 2D masks) will be introduced, which seriously affects performance.

5 - Interleave loading and calculation: When data needs to be loaded multiple times from the same global memory address for calculation (such as addition), use the method of "load once, calculate once" instead of loading all data first and then calculating uniformly. The latter will cause the calculation pipeline to wait for all tensors to be loaded, resulting in low efficiency; the former can effectively hide memory access latency.

6 - If there are multiple write streams, it is recommended to write data while calculating. Write streams usually do not conflict with each other, and writing in advance after calculation can increase the possibility of parallelism and improve overall performance.

7 - Using

tl.arange

can efficiently generate indexes for 2D tensors, avoiding a large number of Scalar calculations caused by reading discrete row data from Global Memory (GM) for 2D array operations, thereby significantly improving performance.

8 - Try to avoid using

tl.where

, as it is mainly suitable for discrete data processing and has poor performance.

9 - Avoid calling

insert_slice

multiple times on the same tensor to improve execution efficiency.

10 - When performing reduction operations, prioritize the largest dimension for reduction, which helps improve performance.

11 - Kernel Input Parameters: For parameters that remain unchanged during the same model call, it is recommended to declare them as

tl.constexpr

compile-time constants to enable better optimization by the compiler; for parameters that may change (such as

batch_size

seq_len

, etc.), they should be passed as ordinary dynamic parameters to avoid excessive compile-time constants leading to long compilation time.

Rules and Constraints to Follow

Single-operator Mode

A single operator only focuses on the basic functions and performance in single-operator mode. Forbidden to use graph integration methods to improve performance, because the model side will perform graph optimization for multiple operators through full-graph integration or Piecewise methods.

Requirements for tl.load and mask Usage

Try to merge the same load, calculation, and store operations. For example, use
```
tl.load
```
with the mask parameter to load data of multiple Tokens at once, avoiding multiple independent loads. Reducing such redundant operations helps improve performance.
Avoid using the other parameter in
```
tl.load
```
, because it will trigger
```
tl.where
```
internally, making it impossible to parallelize with other loads after loading.
Recommended alternative: First perform
```
tl.load
```
without mask, then implement the mask logic through the combination of
```
tl.where
```
and mask; when accessing memory in a continuous and regular manner, use
```
tl.insert_slice
```
instead.

Branch and Compilation Constraints

In the

if-else

branches inside the kernel, the Shape of variables with the same name must be consistent, otherwise compilation errors will occur.

Notes on Data Movement

Ensure that tl.load loads consecutive rows of data; if the data is distributed discretely, it needs to be loaded row by row.
The tensor passed to the Triton operator must be memory-contiguous; if necessary, use the
```
.contiguous()
```
method to ensure this.
Avoid reusing variable names for
```
tl.load
```
and
```
tl.store
```
; using different variable names can improve code readability and reduce the risk of data flow errors.

Execution Requirements

In Ascend NPU operator optimization, it is necessary to independently complete the full closed-loop process from code modification (pursuing "surgery-level" precision), test verification to performance comparison, to ensure that the performance improvement reaches more than x times required by the user. Through iterative optimization, continue to improve until the goal is achieved without introducing errors.

Result Report

After the performance optimization goal is truly achieved, a standardized report must be output accurately:

## Optimization Result Report

### Operator Information

- Operator Name: <op_name>
- Source File: <file_path>

### Performance Comparison

| Baseline Latency (us) | Optimized Latency (us) | Speedup |
|-------------|---------------|-------|
| ... | ... | ...x |

### List of Optimization Technologies

1. [Applied] Multi-Tokens Parallel Processing: N = ...
2. [Applied] Eliminate tl.load with other parameter
3. ...

### Key Modification Instructions

- Modification 1: ...
- Modification 2: ...

vector-triton-ascend-ops-optimizer

NPX Install

Tags

SKILL.md Content (Chinese)