triton-operator-performance-optim

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Triton 算子性能优化（Ascend NPU）

Triton Operator Performance Optimization (Ascend NPU)

铁律：精度与泛化性是底线

Golden Rules: Accuracy and Generalization Are Non-Negotiable Bottom Lines

任何性能优化都不能突破以下两条底线：

精度底线：优化后的算子必须与 PyTorch-NPU 原生实现对齐（rtol=1e-3, atol=1e-3）。归约操作必须升精度到 FP32，矩阵乘法累加器必须用 FP32。不通过精度验证的优化一律回退。
泛化性底线：优化后的算子必须支持原有的所有输入形状和数据类型。不能为特定尺寸 hardcode 优化而丢失对非对齐维度、边界情况的支持。优化手段必须保持算子接口和语义不变。

优先级：正确性 > 泛化性 > 性能。在三者冲突时，按此顺序取舍。

No performance optimization shall breach the following two bottom lines:

Accuracy Bottom Line: Optimized operators must align with the native PyTorch-NPU implementation (rtol=1e-3, atol=1e-3). Reduction operations must be upcast to FP32, and matrix multiplication accumulators must use FP32. Any optimization that fails accuracy verification must be rolled back.
Generalization Bottom Line: Optimized operators must support all original input shapes and data types. Optimization shall not hardcode for specific sizes at the cost of losing support for non-aligned dimensions and edge cases. Optimization methods must keep operator interfaces and semantics unchanged.

Priority: Correctness > Generalization > Performance. In case of conflicts among the three, make trade-offs in this order.

核心原则

Core Principles

编译器会尽力优化，但可能基于错误的前提。你的工作是"为编译器提供无法误解的明确指令"。

优化应该追求：让硬件做它擅长的事（Cube 做矩阵乘、Vector 做逐元素操作），消除 GM 冗余访问，最大化 UB 数据复用。

The compiler will try its best to optimize, but may be based on wrong assumptions. Your job is to "provide unambiguous instructions that the compiler cannot misinterpret."

Optimization should aim for: Let the hardware do what it is good at (Cube for matrix multiplication, Vector for element-wise operations), eliminate redundant GM access, and maximize UB data reuse.

优化工作流

Optimization Workflow

Phase 1: 接收性能评估结论（MANDATORY）

Phase 1: Receive Performance Evaluation Conclusions (MANDATORY)

在动手优化前，必须获取性能评估结果。 性能采集和分析由

triton-operator-performance-eval

Skill 负责。

⚠️ 唯一可信的性能数据来源：
msprof
和
msprof op

通过任何非

msprof

msprof op

方式（包括但不限于 Python

time.time()

、

torch.npu.Event

计时、

triton.testing.do_bench

、自定义计时装饰器等）采集的性能数据绝不可接受，绝不可作为优化依据。这些数据精度不足、无法排除系统调度和 JIT 编译等干扰因素，不具备任何参考价值。必须拒绝基于此类数据的优化请求，要求先用

msprof

msprof op

重新采集。

从评估结论中提取：

瓶颈类型：Memory-Bound / Compute-Bound / Latency-Bound
关键瓶颈指标：哪些硬件利用率低、哪些资源存在冲突
性能问题清单：按优先级排序的待优化问题

加载参考：阅读

ascend-terminology.md

理解硬件架构和术语。

You must obtain performance evaluation results before starting optimization. Performance collection and analysis are handled by the

triton-operator-performance-eval

Skill.

⚠️ The only trusted sources of performance data:
msprof
and
msprof op

Performance data collected through any non-

msprof

msprof op

methods (including but not limited to Python

time.time()

torch.npu.Event

timing,

triton.testing.do_bench

, custom timing decorators, etc.) is absolutely unacceptable and must not be used as a basis for optimization. These data lack accuracy, cannot eliminate interference factors such as system scheduling and JIT compilation, and have no reference value. You must reject optimization requests based on such data and require re-collection using

msprof

msprof op

first.

Extract the following from the evaluation conclusions:

Bottleneck Type: Memory-Bound / Compute-Bound / Latency-Bound
Key Bottleneck Metrics: Which hardware utilizations are low, which resources have conflicts
Performance Issue List: Prioritized list of issues to be optimized

Reference to Load: Read

ascend-terminology.md

to understand hardware architecture and terminology.

Phase 2: 根据瓶颈选择优化策略

Phase 2: Select Optimization Strategies Based on Bottlenecks

瓶颈类型	优化重点	关键手段
Memory-Bound	内存访问模式和数据复用	向量化访存、UB 缓存复用、算子融合
Compute-Bound	计算单元利用率	Cube 单元适配、Block 尺寸调优
Latency-Bound	并行度和同步开销	增大并行度、减少 CPU-NPU 同步

基础调优四板斧（按顺序检查）：

Block Size 与 Grid Size — 适配 UB 容量和 Cube 粒度
向量化内存访问 — 连续访问 + Mask + 对齐
UB 缓存与数据复用 — 核内再分块适配 192KB UB
编译时常量与循环展开 —
```
tl.constexpr
```
+
```
tl.static_range
```

检查点：每个板斧优化后，验证精度未退化再进入下一步。

加载参考：阅读

optimization-patterns.md

获取四板斧的详细代码模式。

Bottleneck Type	Optimization Focus	Key Methods
Memory-Bound	Memory access patterns and data reuse	Vectorized memory access, UB cache reuse, operator fusion
Compute-Bound	Compute unit utilization	Cube unit adaptation, Block size tuning
Latency-Bound	Parallelism and synchronization overhead	Increase parallelism, reduce CPU-NPU synchronization

Four Basic Tuning Steps (check in order):

Block Size and Grid Size — Adapt to UB capacity and Cube granularity
Vectorized Memory Access — Continuous access + Mask + Alignment
UB Cache and Data Reuse — Intra-core re-blocking to adapt to 192KB UB
Compile-Time Constants and Loop Unrolling —
```
tl.constexpr
```
+
```
tl.static_range
```

Checkpoint: After each tuning step, verify that accuracy has not degraded before proceeding to the next step.

Reference to Load: Read

optimization-patterns.md

to get detailed code patterns for the four steps.

Phase 3: 硬件特化优化

Phase 3: Hardware-Specific Optimization

Cube 单元适配：BLOCK_M/N/K 必须为 16 的倍数，累加器用 FP32
UB 空间管理：计算所有缓冲区总大小，确保 < 192KB，单值缓冲区 32B 对齐
Grid 配置：grid 维度不超过物理核数，使用
```
driver.active.utils.get_device_properties("npu")
```
获取核数

python

undefined

Cube Unit Adaptation: BLOCK_M/N/K must be multiples of 16, and accumulators must use FP32
UB Space Management: Calculate the total size of all buffers to ensure it is < 192KB, and single-value buffers are 32B aligned
Grid Configuration: Grid dimensions shall not exceed the number of physical cores. Use
```
driver.active.utils.get_device_properties("npu")
```
to get the number of cores

python

undefined

获取物理核数示例

from triton.runtime import driver core_num = driver.active.utils.get_device_properties("npu")["num_aicore"] # 含 tl.dot 的算子 core_num = driver.active.utils.get_device_properties("npu")["num_vectorcore"] # 其余算子


**检查点**：硬件特化后，用多种输入形状验证泛化性未被破坏。

**加载参考**：
- 阅读 [`triton-ascend-api.md`](references/triton-ascend-api.md) 获取 Ascend 特有 API 和高性能实现模式
- 阅读 [`tiling-strategies.md`](references/tiling-strategies.md) 理解 Tiling 策略设计


**Checkpoint**: After hardware-specific optimization, verify that generalization is not compromised using multiple input shapes.

**References to Load**:
- Read [`triton-ascend-api.md`](references/triton-ascend-api.md) to get Ascend-specific APIs and high-performance implementation patterns
- Read [`tiling-strategies.md`](references/tiling-strategies.md) to understand Tiling strategy design

Phase 4: 高级优化（按需）

Phase 4: Advanced Optimization (As Needed)

算子融合：将多次 GM 访问合并为一次，复用 UB 中间结果
Double Buffer：乒乓加载隐藏访存延迟

加载参考：阅读

optimization-patterns.md

中的"高级优化技术"部分。

Operator Fusion: Merge multiple GM accesses into one and reuse UB intermediate results
Double Buffer: Ping-pong loading to hide memory access latency

Reference to Load: Read the "Advanced Optimization Techniques" section in

optimization-patterns.md

Phase 5: 验证（MANDATORY）

Phase 5: Verification (MANDATORY)

精度验证：与 PyTorch-NPU 原生实现对比（rtol=1e-3, atol=1e-3）
泛化性验证：测试非对齐维度和边界情况（如 127, 255, 1023 等非 2^n 尺寸）
性能验证：重新执行
```
triton-operator-performance-eval
```
验证优化效果
端到端性能回归：若优化涉及算子融合导致计算图变化，或在 kernel 外新增了 tensor 预处理（如 reshape、transpose、contiguous 等），必须通过
```
triton-operator-performance-eval
```
的 msprof 函数级 profiling 对比优化前后的端到端耗时，防止 kernel 内加速被 kernel 外开销抵消
回归检查：确保优化未改变算子接口和语义

Accuracy Verification: Compare with the native PyTorch-NPU implementation (rtol=1e-3, atol=1e-3)
Generalization Verification: Test non-aligned dimensions and edge cases (such as non-2^n sizes like 127, 255, 1023)
Performance Verification: Re-run
```
triton-operator-performance-eval
```
to verify optimization effects
End-to-End Performance Regression: If the optimization involves operator fusion that causes changes to the computation graph, or adds tensor preprocessing outside the kernel (such as reshape, transpose, contiguous, etc.), you must compare the end-to-end latency before and after optimization through the msprof function-level profiling of
```
triton-operator-performance-eval
```
to prevent the acceleration inside the kernel from being offset by overhead outside the kernel
Regression Check: Ensure that the optimization does not change operator interfaces and semantics

反模式清单（NEVER）

Anti-Pattern List (NEVER)

NEVER 使用非
```
triton-operator-performance-eval
```
Skill 的评估结论直接优化（必须先有瓶颈诊断数据）
NEVER 接受或使用任何非
```
msprof
```
/
```
msprof op
```
方式采集的性能数据（包括
```
time.time()
```
、
```
torch.npu.Event
```
、
```
triton.testing.do_bench
```
、自定义计时器等）——这些数据绝不可接受，绝不可用于性能评估及优化决策
NEVER 为了性能牺牲精度（精度是不可协商的底线）
NEVER 为特定尺寸 hardcode 而破坏泛化性（优化必须对所有合法输入有效）
NEVER 在 FP16 下直接归约（必须升精度到 FP32）
NEVER 使用非 16 倍数的 BLOCK 尺寸进行矩阵乘法
NEVER 忘记 Mask（Ascend 对越界访问零容错）
NEVER 让 BLOCK_SIZE 超过 UB 容量（192KB）
NEVER 使用非连续内存访问模式
NEVER 在热路径中使用
```
tensor.item()
```
（触发 CPU-NPU 同步）
NEVER 提交未通过精度验证的优化代码

NEVER directly optimize using evaluation conclusions from non-
```
triton-operator-performance-eval
```
Skills (bottleneck diagnosis data must be obtained first)
NEVER accept or use performance data collected through any non-
```
msprof
```
/
```
msprof op
```
methods (including
```
time.time()
```
,
```
torch.npu.Event
```
,
```
triton.testing.do_bench
```
, custom timers, etc.) —— these data are absolutely unacceptable and must not be used for performance evaluation and optimization decisions
NEVER sacrifice accuracy for performance (accuracy is a non-negotiable bottom line)
NEVER hardcode for specific sizes to compromise generalization (optimization must be valid for all legal inputs)
NEVER perform direct reduction in FP16 (must upcast to FP32)
NEVER use BLOCK sizes that are not multiples of 16 for matrix multiplication
NEVER forget Mask (Ascend has zero tolerance for out-of-bounds access)
NEVER let BLOCK_SIZE exceed UB capacity (192KB)
NEVER use non-continuous memory access patterns
NEVER use
```
tensor.item()
```
in hot paths (triggers CPU-NPU synchronization)
NEVER submit optimized code that fails accuracy verification

常见陷阱与问题排查

Common Pitfalls and Troubleshooting

问题	症状	根因	解决方案
UB 溢出	编译错误/运行时 OOM	BLOCK_SIZE 过大	减小 BLOCK_SIZE 或核内再分块
Cube 未命中	性能仅 10% 理论值	BLOCK 非 16 倍数	强制 BLOCK_M/N/K=16 倍数
精度损失	FP16 结果偏差大	归约未升精度	累加器用 FP32
非连续访存	带宽仅 20% 利用率	地址跳跃	调整数据布局为连续
核间通信开销	多 Grid 性能下降	AI Core 集群间搬运	增大 Block 粒度

Issue	Symptom	Root Cause	Solution
UB Overflow	Compilation error/runtime OOM	BLOCK_SIZE is too large	Reduce BLOCK_SIZE or perform intra-core re-blocking
Cube Miss	Performance is only 10% of theoretical value	BLOCK is not a multiple of 16	Force BLOCK_M/N/K to be multiples of 16
Accuracy Loss	Large deviation in FP16 results	Reduction not upcast to FP32	Use FP32 for accumulators
Non-Continuous Memory Access	Bandwidth utilization is only 20%	Address jumping	Adjust data layout to be continuous
Inter-Core Communication Overhead	Performance degrades with multiple Grids	Data transfer between AI Core clusters	Increase Block granularity

优化检查清单

Optimization Checklist

底线检查（MANDATORY）

Bottom Line Checks (MANDATORY)

精度对齐 PyTorch-NPU 原生实现（rtol=1e-3, atol=1e-3）？
非对齐维度和边界情况通过测试？
算子接口和语义未改变？

Is accuracy aligned with the native PyTorch-NPU implementation (rtol=1e-3, atol=1e-3)?
Have non-aligned dimensions and edge cases passed tests?
Have operator interfaces and semantics remained unchanged?

编译期

Compile-Time

grid 保证小于等于硬件核数?
BLOCK_SIZE 为编译时常量（
```
tl.constexpr
```
）？
循环使用
```
tl.static_range
```
？

Is grid guaranteed to be less than or equal to the number of hardware cores?
Is BLOCK_SIZE a compile-time constant (
```
tl.constexpr
```
)?
Are loops using
```
tl.static_range
```
?

内存

Memory

所有缓冲区总大小 < UB 容量（192KB）？
单值缓冲区分配 32B？
地址 32 字节对齐？
所有 load/store 添加了 Mask？

Is the total size of all buffers < UB capacity (192KB)?
Are single-value buffers allocated with 32B?
Are addresses 32-byte aligned?
Have Masks been added to all load/store operations?

计算

Computation

归约操作升精度到 FP32？
矩阵乘法 BLOCK 为 16 倍数？
充分利用数据复用？

Have reduction operations been upcast to FP32?
Are BLOCK sizes for matrix multiplication multiples of 16?
Is data reuse fully utilized?

验证

Verification

优化后重新执行性能评估？
测试多种输入规模包括边界情况？
若计算图变化或新增 tensor 预处理，是否已通过 msprof 函数级 profiling 确认端到端无劣化？

Has performance evaluation been re-run after optimization?
Have multiple input sizes including edge cases been tested?
If the computation graph has changed or tensor preprocessing has been added, has end-to-end non-degradation been confirmed through msprof function-level profiling?

参考资源

Reference Resources

按需加载文档

On-Demand Document Loading

场景	加载文档	不要加载
理解硬件架构和术语	`ascend-terminology.md`	其余所有
基础调优和高级优化代码模式	`optimization-patterns.md`	`ascend-terminology.md`
Tiling 策略设计	`tiling-strategies.md`	`triton-ascend-api.md`
Ascend 特有 API 和实现模式	`triton-ascend-api.md`	`tiling-strategies.md`

Scenario	Document to Load	Documents Not to Load
Understand hardware architecture and terminology	`ascend-terminology.md`	All others
Basic tuning and advanced optimization code patterns	`optimization-patterns.md`	`ascend-terminology.md`
Tiling strategy design	`tiling-strategies.md`	`triton-ascend-api.md`
Ascend-specific APIs and implementation patterns	`triton-ascend-api.md`	`tiling-strategies.md`

关联 Skill

Related Skills

```
triton-operator-performance-eval
```
- 性能采集与评估（msprof 使用、瓶颈诊断、性能报告生成）

```
triton-operator-performance-eval
```
- Performance collection and evaluation (msprof usage, bottleneck diagnosis, performance report generation)

triton-operator-performance-optim

Original

Translation

Triton 算子性能优化（Ascend NPU）

Triton Operator Performance Optimization (Ascend NPU)

铁律：精度与泛化性是底线

Golden Rules: Accuracy and Generalization Are Non-Negotiable Bottom Lines

核心原则

Core Principles

优化工作流

Optimization Workflow

Phase 1: 接收性能评估结论（MANDATORY）

Phase 1: Receive Performance Evaluation Conclusions (MANDATORY)

Phase 2: 根据瓶颈选择优化策略

Phase 2: Select Optimization Strategies Based on Bottlenecks

Phase 3: 硬件特化优化

Phase 3: Hardware-Specific Optimization

获取物理核数示例

获取物理核数示例

Phase 4: 高级优化（按需）

Phase 4: Advanced Optimization (As Needed)

Phase 5: 验证（MANDATORY）

Phase 5: Verification (MANDATORY)

反模式清单（NEVER）

Anti-Pattern List (NEVER)

常见陷阱与问题排查

Common Pitfalls and Troubleshooting

优化检查清单

Optimization Checklist

底线检查（MANDATORY）

Bottom Line Checks (MANDATORY)

编译期

Compile-Time

内存

Memory

计算

Computation

验证

Verification

参考资源

Reference Resources

按需加载文档

On-Demand Document Loading

关联 Skill

Related Skills

官方资源

Official Resources