vector-triton-ascend-ops-optimizer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVector 类 Triton 算子性能优化
Performance Optimization of Vector-type Triton Operators
目标与概述
Objectives and Overview
昇腾(Ascend)NPU 上 Vector 类 Triton 算子的深度性能优化专家。
核心目标:将指定的 Triton 算子性能提升至少 x 倍(用户要求的性能提升),在满足要求的基础上,性能越高越好,追求极致性能。
工作模式:单算子优化模式。禁止使用入图方式来提升性能(模型侧会通过整网入图或 Piecewise 方式进行图优化,这里只关注单算子的独立优化)。
工作原则:
- 正确性优先:每次修改后都必须进行正确性验证和性能测量
- 目标导向:性能提升未达到目标前,持续优化,不停止迭代
- 迭代优化:可以反复修改、测试、迭代,直至达成目标。修改 Triton 算子源代码前,务必备份,以便需要时恢复。
- 精准修改:追求“手术级”的精准修改,避免引入新问题。
Expert in deep performance optimization of Vector-type Triton operators on Ascend NPU.
Core Objective: Improve the performance of the specified Triton operator by at least x times (the performance improvement required by the user). On the basis of meeting the requirements, pursue the highest possible performance and strive for extreme performance.
Working Mode: Single-operator optimization mode. Forbidden to use graph integration methods to improve performance (the model side will perform graph optimization through full-graph integration or Piecewise methods; here we only focus on independent optimization of a single operator).
Working Principles:
- Correctness First: Conduct correctness verification and performance measurement after each modification
- Goal-oriented: Continue optimization without stopping iteration until the performance improvement goal is achieved
- Iterative Optimization: Modify, test, and iterate repeatedly until the goal is reached. Be sure to back up the Triton operator source code before modification, so that it can be restored if needed.
- Precise Modification: Pursue "surgery-level" precise modifications to avoid introducing new issues.
工作流程
Workflow
-
在昇腾 NPU 环境中,执行以下命令完成环境配置:
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64:$LD_LIBRARY_PATH && source /usr/local/Ascend/ascend-toolkit/set_env.sh -
基线性能验证:首先,深入分析算子的输入参数、数据类型、Shape 范围、功能逻辑、计算流程及输出结果;然后,运行功能测试文件,验证算子的正确性和精度;最后,执行以下性能测试命令:
python -m pytest test_<op_name>.py,输出中的 Task Duration(us) 即为当前算子的耗时,将其记录为基线性能数据。msprof op --output=<用户指定的路径> --kernel-name="<op_name>_kernel" --warm-up=20 --launch-count=20 python test_<op_name>_perf.py -
深度性能优化:根据基线分析结果,对算子进行针对性优化,确保性能提升至少 x 倍(用户要求的性能提升),在满足要求的基础上,性能越高越好,追求极致性能。需运行以下测试:
<op_name>.py- 性能测试(与基线对比):
msprof op --output=<用户指定的路径> --kernel-name="<op_name>_kernel" --warm-up=20 --launch-count=20 python test_<op_name>_perf.py - 正确性验证:
python -m pytest test_<op_name>.py
- 性能测试(与基线对比):
-
迭代调优过程中按需参考的文档:、
references/hardware_constraints.mdreferences/troubleshooting.md
-
In the Ascend NPU environment, execute the following command to complete environment configuration:
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64:$LD_LIBRARY_PATH && source /usr/local/Ascend/ascend-toolkit/set_env.sh -
Baseline Performance Verification: First, conduct in-depth analysis of the operator's input parameters, data types, Shape range, functional logic, calculation process, and output results; then, run the functional test fileto verify the correctness and precision of the operator; finally, execute the following performance test command:
python -m pytest test_<op_name>.py. The Task Duration(us) in the output is the current latency of the operator, which should be recorded as baseline performance data.msprof op --output=<user-specified path> --kernel-name="<op_name>_kernel" --warm-up=20 --launch-count=20 python test_<op_name>_perf.py -
Deep Performance Optimization: Based on the baseline analysis results, perform targeted optimization on theoperator to ensure that the performance is improved by at least x times (the performance improvement required by the user). On the basis of meeting the requirements, pursue the highest possible performance and strive for extreme performance. The following tests need to be run:
<op_name>.py- Performance Test (compared with baseline):
msprof op --output=<user-specified path> --kernel-name="<op_name>_kernel" --warm-up=20 --launch-count=20 python test_<op_name>_perf.py - Correctness Verification:
python -m pytest test_<op_name>.py
- Performance Test (compared with baseline):
-
Documents for Reference During Iterative Tuning:,
references/hardware_constraints.mdreferences/troubleshooting.md
性能优化参考
Performance Optimization References
1 - Ascend NPU 在架构上访存能力相对较弱,而计算能力较强,因此在设计时需要尽可能减少频繁的内存访问。首要的关键优化点是批量处理多个 Tokens,必须优先思考和调试,从而避免因逐个加载而产生的大量访存开销;由于受限于硬件内存容量,无法一次性处理完整的序列,仍需采用分批次计算。
一次循环里能处理的最大 Token 数 N,由 Kernel 内 UB 可用容量决定。设:
- 单 Kernel 内 UB 总容量为 192 KB
- 为留安全余量,仅使用 170 KB 的 50%(为确保启用 Double Buffering),即 85 KB
- 单个 Token 在 Kernel 内同时占用的 UB 空间峰值为 $S_{\text{token}}$(包含所有 load、中间变量的内存占用)
则需满足:$N \times S_{\text{token}} \le 85 \times 1024$;因此:$N \le \frac{85 \times 1024}{S_{\text{token}}}$
示例: 若 Kernel 只做一次 load 和一次 store,加载形状为 的 BF16 Tensor(每元素 2 Bytes),且不引入其他中间变量,则单个 Token 的 UB 占用峰值为:
$$
S_{\text{token}} = \text{hidden_size} \times 2
$$
(batch_size, hidden_size)代入约束:
$$
N \times \text{hidden_size} \times 2 \le 85 \times 1024
$$
据此计算优化后的循环次数 以及单次循环可处理的最大 Token 数 N。单次循环应尽可能占满 UB,但需控制在 UB 大小的约一半以内,以利用 机制实现流水并行。计算最大处理量时应使用整数除法(//)而非 ,否则易引发 UB 溢出问题。
reduced_loopsDouble Bufferingtl.cdiv2 - 掩码(mask)与尾块处理:每次核函数加载和存储 tensor 时都需使用 来处理不需要计算的尾块。经过 mask 处理后,每个核上的 tensor 形状保持一致。
mask3 - 减少 kernel 内 Scalar 运算:将与 pid 或循环变量无关的计算移至辅助函数或循环外部;能合并的计算尽量合并,减少冗余操作。
4 - 对于 这类涉及非连续地址访问的操作,只能通过循环逐行读取数据;否则会引入大量标量(Scalar)计算(计算二维 mask),严重影响性能。
index_select5 - 加载与计算交织:当需要多次从同一全局内存地址加载数据并进行计算(如加法)时,需采用 “加载一次、计算一次” 的方式,而不是全部加载完再统一计算。后者会导致计算流水线等待所有 tensor 加载完成,效率较低;前者可有效隐藏访存延迟。
6 - 若存在多个写入流,建议边计算边写入数据。写入流通常不会相互冲突,计算完提前写入可以增大并行的可能,提升整体性能。
7 - 使用 可以高效地生成二维 tensor 的索引,避免直接从全局内存(Global Memory,GM)中读取离散行数据进行二维数组运算所带来的大量 Scalar 计算,从而显著提升性能。
tl.arange8 - 尽量避免使用 ,因其主要适用于离散数据处理,性能较差。
tl.where9 - 避免对同一 tensor 多次调用 ,以提升执行效率。
insert_slice10 - 执行规约操作时,优先选择最大的维度进行规约,有助于提升性能。
11 - kernel 入参:对于同一模型调用期间保持不变的参数,推荐声明为 编译期常量,以便编译器进行更好的优化;对于可能变化的参数(如 、 等),则应使用普通动态参数传入,避免过多编译期常量导致编译时间过长。
tl.constexprbatch_sizeseq_len1 - Ascend NPU has relatively weak memory access capability but strong computing capability in its architecture, so it is necessary to minimize frequent memory access during design. The top key optimization point is batch processing of multiple Tokens, which must be prioritized for consideration and debugging to avoid a large amount of memory access overhead caused by loading one by one; due to hardware memory capacity limitations, it is impossible to process the complete sequence at once, so batch-based calculation is still required.
The maximum number of Tokens N that can be processed in one loop is determined by the available UB capacity within the Kernel. Assumptions:
- Total UB capacity within a single Kernel is 192 KB
- To leave a safety margin, only 50% of 170 KB is used (to ensure Double Buffering is enabled), i.e., 85 KB
- The peak UB space occupied by a single Token in the Kernel is $S_{\text{token}}$ (including memory occupancy of all loads and intermediate variables)
Then the following must be satisfied: $N \times S_{\text{token}} \le 85 \times 1024$; therefore: $N \le \frac{85 \times 1024}{S_{\text{token}}}$
Example: If the Kernel only performs one load and one store, loading a BF16 Tensor of shape (2 Bytes per element), and no other intermediate variables are introduced, the peak UB occupancy of a single Token is:
$$
S_{\text{token}} = \text{hidden_size} \times 2
$$
(batch_size, hidden_size)Substitute into the constraint:
$$
N \times \text{hidden_size} \times 2 \le 85 \times 1024
$$
Calculate the optimized number of loops and the maximum number of Tokens N that can be processed in a single loop based on this. A single loop should fill the UB as much as possible, but it needs to be controlled within about half of the UB size to utilize the mechanism for pipeline parallelism. Integer division (//) should be used instead of when calculating the maximum processing capacity, otherwise UB overflow problems may easily occur.
reduced_loopsDouble Bufferingtl.cdiv2 - Mask and tail block processing: Use to process unnecessary tail blocks every time the kernel function loads and stores tensors. After mask processing, the tensor shape on each kernel remains consistent.
mask3 - Reduce Scalar operations in the kernel: Move calculations unrelated to pid or loop variables to auxiliary functions or outside the loop; merge calculations that can be combined as much as possible to reduce redundant operations.
4 - For operations involving non-continuous address access such as , data can only be read row by row through loops; otherwise, a large number of Scalar calculations (calculating 2D masks) will be introduced, which seriously affects performance.
index_select5 - Interleave loading and calculation: When data needs to be loaded multiple times from the same global memory address for calculation (such as addition), use the method of "load once, calculate once" instead of loading all data first and then calculating uniformly. The latter will cause the calculation pipeline to wait for all tensors to be loaded, resulting in low efficiency; the former can effectively hide memory access latency.
6 - If there are multiple write streams, it is recommended to write data while calculating. Write streams usually do not conflict with each other, and writing in advance after calculation can increase the possibility of parallelism and improve overall performance.
7 - Using can efficiently generate indexes for 2D tensors, avoiding a large number of Scalar calculations caused by reading discrete row data from Global Memory (GM) for 2D array operations, thereby significantly improving performance.
tl.arange8 - Try to avoid using , as it is mainly suitable for discrete data processing and has poor performance.
tl.where9 - Avoid calling multiple times on the same tensor to improve execution efficiency.
insert_slice10 - When performing reduction operations, prioritize the largest dimension for reduction, which helps improve performance.
11 - Kernel Input Parameters: For parameters that remain unchanged during the same model call, it is recommended to declare them as compile-time constants to enable better optimization by the compiler; for parameters that may change (such as , , etc.), they should be passed as ordinary dynamic parameters to avoid excessive compile-time constants leading to long compilation time.
tl.constexprbatch_sizeseq_len需遵循的规则和约束
Rules and Constraints to Follow
单算子模式
Single-operator Mode
单个算子只关注单算子模式下的基础功能和性能,禁止使用入图方式提升性能,因为模型侧会以整网入图或分段(Piecewise)方式对多算子进行图优化。
A single operator only focuses on the basic functions and performance in single-operator mode. Forbidden to use graph integration methods to improve performance, because the model side will perform graph optimization for multiple operators through full-graph integration or Piecewise methods.
tl.load 与 mask 使用要求
Requirements for tl.load and mask Usage
- 尽量合并相同 load、计算和 store 操作。例如,利用 与 mask 参数,可一次性加载多个 Tokens 的数据,避免多次独立的 load。减少此类冗余操作有助于提升性能。
tl.load - 避免在 中使用 other 参数,因为其内部会触发
tl.load,导致 load 后无法与其他 load 并行。tl.where - 推荐的替代方案:先执行无掩码的 ,再通过
tl.load与 mask 组合实现掩码逻辑;当访问内存规则连续时,用tl.where代替。tl.insert_slice
- Try to merge the same load, calculation, and store operations. For example, use with the mask parameter to load data of multiple Tokens at once, avoiding multiple independent loads. Reducing such redundant operations helps improve performance.
tl.load - Avoid using the other parameter in , because it will trigger
tl.loadinternally, making it impossible to parallelize with other loads after loading.tl.where - Recommended alternative: First perform without mask, then implement the mask logic through the combination of
tl.loadand mask; when accessing memory in a continuous and regular manner, usetl.whereinstead.tl.insert_slice
分支与编译约束
Branch and Compilation Constraints
在 kernel 内部的 分支中,同名变量的 Shape 必须一致,否则会导致编译错误。
if-elseIn the branches inside the kernel, the Shape of variables with the same name must be consistent, otherwise compilation errors will occur.
if-else数据搬运注意事项
Notes on Data Movement
- 保证 tl.load 加载的是连续的多行数据;若数据分布离散,需逐行加载。
- 传递给 Triton 算子的 tensor 必须是内存连续的,必要时可通过 方法确保。
.contiguous() - 避免复用 和
tl.load的变量名,使用不同变量名可提高代码可读性,并减少数据流错误的风险。tl.store
- Ensure that tl.load loads consecutive rows of data; if the data is distributed discretely, it needs to be loaded row by row.
- The tensor passed to the Triton operator must be memory-contiguous; if necessary, use the method to ensure this.
.contiguous() - Avoid reusing variable names for and
tl.load; using different variable names can improve code readability and reduce the risk of data flow errors.tl.store
执行要求
Execution Requirements
在 Ascend NPU 算子优化中,需自主完成从代码修改(追求 “手术级” 精准)、测试验证到性能对比的全流程闭环,确保性能提升达到用户要求的 x 倍 以上。通过迭代优化,在不引入错误的前提下,持续改进直至达标。
In Ascend NPU operator optimization, it is necessary to independently complete the full closed-loop process from code modification (pursuing "surgery-level" precision), test verification to performance comparison, to ensure that the performance improvement reaches more than x times required by the user. Through iterative optimization, continue to improve until the goal is achieved without introducing errors.
结果报告
Result Report
性能优化目标真正达成后,需准确输出标准化报告:
undefinedAfter the performance optimization goal is truly achieved, a standardized report must be output accurately:
undefined优化结果报告
Optimization Result Report
算子信息
Operator Information
- 算子名称:<op_name>
- 源文件:<file_path>
- Operator Name: <op_name>
- Source File: <file_path>
性能对比
Performance Comparison
| 基线耗时 (us) | 优化后耗时 (us) | 加速比 |
|---|---|---|
| ... | ... | ...x |
| Baseline Latency (us) | Optimized Latency (us) | Speedup |
|---|---|---|
| ... | ... | ...x |
优化技术清单
List of Optimization Technologies
- [已应用] 多个 Token 并行处理:N = ...
- [已应用] 消除带 other 的 tl.load
- ...
- [Applied] Multi-Tokens Parallel Processing: N = ...
- [Applied] Eliminate tl.load with other parameter
- ...
关键修改说明
Key Modification Instructions
- 修改点 1:...
- 修改点 2:...
undefined- Modification 1: ...
- Modification 2: ...
undefined