triton-operator-performance-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Triton 算子性能评估（Ascend NPU）

Triton Operator Performance Evaluation (Ascend NPU)

核心认知

Core Concepts

性能数据信仰：只相信测量到的数据，不相信自己的直觉和假设。评估前必须用 Prof 定位真正的性能热区。

⚠️ 唯一可信的性能采集方式：
msprof
和
msprof op

通过任何非

msprof

msprof op

方式（包括但不限于 Python

time.time()

、

torch.npu.Event

计时、

triton.testing.do_bench

、自定义计时装饰器等）采集的性能数据绝不可接受，绝不可用于性能评估及优化决策。这些方式精度不足、无法排除系统调度和 JIT 编译等干扰因素，得出的数据不具备任何参考价值。所有性能分析必须且只能基于

msprof

（函数级）或

msprof op

（算子级）的输出数据。

评估目标：

识别性能瓶颈类型（Memory-Bound vs Compute-Bound）
量化硬件利用率
对比不同实现的性能差异
验证优化效果

Performance Data Principle: Only trust measured data, not your intuition or assumptions. You must use Prof to locate the real performance hotspots before evaluation.

⚠️ The only reliable performance collection methods:
msprof
and
msprof op

Performance data collected through any non-msprof/msprof op methods (including but not limited to Python

time.time()

torch.npu.Event

timing,

triton.testing.do_bench

, custom timing decorators, etc.) is absolutely unacceptable and must not be used for performance evaluation or optimization decisions. These methods lack precision and cannot eliminate interference from system scheduling and JIT compilation, so the resulting data has no reference value. All performance analysis must be based solely on output data from

msprof

(function-level) or

msprof op

(operator-level).

Evaluation Objectives:

Identify performance bottleneck types (Memory-Bound vs Compute-Bound)
Quantify hardware utilization
Compare performance differences between different implementations
Verify optimization effects

性能评估工作流

Performance Evaluation Workflow

函数级性能采集（首选）

Function-Level Performance Collection (Preferred)

使用 msprof 进行函数级性能分析：

bash

msprof --application="python my_script.py" --output=./profiling_result

适用场景：

对比多个 PyTorch 算子 vs 融合 Triton 算子的性能
分析函数级别的性能瓶颈
生成可视化性能报告
全链路性能分析（Host + Device）

详细用法和示例：加载

msprof-function-level.md

Perform function-level performance analysis using msprof:

bash

msprof --application="python my_script.py" --output=./profiling_result

Applicable Scenarios:

Compare performance between multiple PyTorch operators vs fused Triton operators
Analyze function-level performance bottlenecks
Generate visualized performance reports
Full-link performance analysis (Host + Device)

Detailed Usage and Examples: Load

msprof-function-level.md

算子级性能采集（深度分析）

Operator-Level Performance Collection (In-Depth Analysis)

使用 msprof op 进行算子级深度分析：

bash

msprof op --kernel-name={jit_kernel_name} {application}

适用场景：

分析单个 Triton kernel 的硬件利用率
诊断 Cube/Vector 单元性能
分析 UB 缓存和内存访问模式
定位 Bank Conflict 等硬件问题

详细用法和示例：加载

msprof-op-level.md

Perform in-depth operator-level analysis using msprof op:

bash

msprof op --kernel-name={jit_kernel_name} {application}

Applicable Scenarios:

Analyze hardware utilization of a single Triton kernel
Diagnose Cube/Vector unit performance
Analyze UB cache and memory access patterns
Locate hardware issues such as Bank Conflict

Detailed Usage and Examples: Load

msprof-op-level.md

性能数据分析

Performance Data Analysis

关键指标：

瓶颈类型判断（Memory-Bound vs Compute-Bound）
内存带宽利用率
计算单元利用率（Cube/Vector）
UB 冲突分析

详细分析方法：加载

performance-data-analysis.md

Key Metrics:

Bottleneck type judgment (Memory-Bound vs Compute-Bound)
Memory bandwidth utilization
Computing unit utilization (Cube/Vector)
UB conflict analysis

Detailed Analysis Methods: Load

performance-data-analysis.md

性能总结输出

Performance Summary Output

完成性能评估后，必须按以下模板输出结构化性能总结：

markdown

undefined

After completing the performance evaluation, you must output a structured performance summary according to the following template:

markdown

undefined

性能评估总结

Performance Evaluation Summary

基本信息

Basic Information

项目	值
算子名称	{kernel_name}
输入规模	{shape, dtype}
测试硬件	{Ascend 型号}
测量方法	{msprof / msprof op}

Item	Value
Operator Name	{kernel_name}
Input Scale	{shape, dtype}
Test Hardware	{Ascend Model}
Measurement Method	{msprof / msprof op}

性能指标

Performance Metrics

指标	值	参考值	利用率
执行耗时	{X} us	-	-
内存带宽	{X} GB/s	{理论峰值} GB/s	{X}%
Cube 利用率	-	-	{X}%
Vector 利用率	-	-	{X}%
L2 Cache 命中率	-	-	{X}%
Bank Conflict 比例	-	-	{X}%

Metric	Value	Reference Value	Utilization
Execution Time	{X} us	-	-
Memory Bandwidth	{X} GB/s	{Theoretical Peak} GB/s	{X}%
Cube Utilization	-	-	{X}%
Vector Utilization	-	-	{X}%
L2 Cache Hit Rate	-	-	{X}%
Bank Conflict Ratio	-	-	{X}%

瓶颈诊断

Bottleneck Diagnosis

瓶颈类型：{Memory-Bound / Compute-Bound}
判断依据：{基于 Arithmetic Intensity 和硬件利用率数据的分析}
关键证据：{引用具体 CSV 数据}

Bottleneck Type: {Memory-Bound / Compute-Bound}
Judgment Basis: {Analysis based on Arithmetic Intensity and hardware utilization data}
Key Evidence: {Cite specific CSV data}

性能问题清单

Performance Issue List

优先级	问题	证据	优化方向
P0	{最关键问题}	{数据来源}	{具体建议}
P1	...	...	...

Priority	Problem	Evidence	Optimization Direction
P0	{Most Critical Issue}	{Data Source}	{Specific Suggestions}
P1	...	...	...

优化建议

Optimization Suggestions

{最高优先级优化建议及预期收益}
...


**输出原则**：
- 所有结论必须有 Profiling 数据支撑，不做主观猜测
- 利用率低于 30% 的指标标记为 **重点关注**
- 对比场景需同时列出 baseline 和 optimized 结果

{Highest Priority Optimization Suggestion and Expected Benefit}
...


**Output Principles**:
- All conclusions must be supported by Profiling data, no subjective guesses allowed
- Metrics with utilization below 30% are marked as **Key Focus**
- Both baseline and optimized results must be listed in comparison scenarios

参考资源加载指南

Reference Resource Loading Guide

MANDATORY - 按需加载：根据任务类型加载对应的参考文档

任务类型	必须加载	不要加载
函数级性能算子对比	`msprof-function-level.md`	`msprof-op-level.md`
算子级硬件分析	`msprof-op-level.md` , `performance-data-analysis.md`	`msprof-function-level.md`
性能瓶颈诊断	`performance-data-analysis.md`	-
理解硬件术语	`ascend-terminology.md`	-
完整性能优化流程	所有 references	-

MANDATORY - Load On Demand: Load corresponding reference documents based on task type

Task Type	Must Load	Do Not Load
Function-Level Performance Operator Comparison	`msprof-function-level.md`	`msprof-op-level.md`
Operator-Level Hardware Analysis	`msprof-op-level.md` , `performance-data-analysis.md`	`msprof-function-level.md`
Performance Bottleneck Diagnosis	`performance-data-analysis.md`	-
Understand Hardware Terminology	`ascend-terminology.md`	-
Complete Performance Optimization Process	All references	-

性能评估检查清单

Performance Evaluation Checklist

基础检查

Basic Checks

是否使用 msprof 进行性能采集？
是否进行了预热（避免首次编译开销）？
是否多次测量取统计值？
是否同步了 NPU 设备（
```
torch.npu.synchronize()
```
）？

Is msprof used for performance collection?
Has warm-up been performed (to avoid first-time compilation overhead)?
Have multiple measurements been taken to get statistical values?
Has the NPU device been synchronized (
```
torch.npu.synchronize()
```
)?

Profiler 检查

Profiler Checks

是否指定了正确的
```
--application
```
或
```
--kernel-name
```
？
是否选择了合适的
```
--aic-metrics
```
？
是否分析了所有关键性能指标？
是否识别了瓶颈类型（Memory-Bound vs Compute-Bound）？

Is the correct
```
--application
```
or
```
--kernel-name
```
specified?
Has the appropriate
```
--aic-metrics
```
been selected?
Have all key performance metrics been analyzed?
Has the bottleneck type (Memory-Bound vs Compute-Bound) been identified?

性能指标检查

Performance Metric Checks

内存带宽利用率是否合理？
计算单元利用率是否合理？
是否存在高 Bank Conflict？
L2 Cache 命中率是否合理？

Is memory bandwidth utilization reasonable?
Is computing unit utilization reasonable?
Does high Bank Conflict exist?
Is L2 Cache hit rate reasonable?

反模式清单（NEVER）

Anti-Pattern List (NEVER)

NEVER 使用任何非
```
msprof
```
/
```
msprof op
```
的方式进行计时或性能评估（包括
```
time.time()
```
、
```
torch.npu.Event
```
、
```
triton.testing.do_bench
```
、自定义计时器等）——这些方式采集的数据精度不足、无法排除系统调度和 JIT 编译干扰，绝不可接受，绝不可用于任何性能评估及优化决策
NEVER 在性能测试中不进行预热（首次执行包含编译开销）
NEVER 只测试一次就得出结论（需要多次测量取统计值）
NEVER 在性能测试中包含打印或日志（I/O 会严重影响结果）
NEVER 忘记同步 NPU 设备（
```
torch.npu.synchronize()
```
）
NEVER 在不同硬件环境下对比性能结果
NEVER 混淆
```
msprof
```
和
```
msprof op
```
命令（前者函数级全局分析，后者算子级深度分析）
NEVER 在没有 Profiling 数据支撑时给出优化建议

NEVER use any non-msprof/msprof op methods for timing or performance evaluation (including
```
time.time()
```
,
```
torch.npu.Event
```
,
```
triton.testing.do_bench
```
, custom timers, etc.)——the data collected by these methods lacks precision and cannot eliminate interference from system scheduling and JIT compilation, it is absolutely unacceptable and must not be used for any performance evaluation or optimization decisions
NEVER skip warm-up in performance tests (the first execution includes compilation overhead)
NEVER draw conclusions based on only one test (multiple measurements are needed to get statistical values)
NEVER include printing or logging in performance tests (I/O will seriously affect results)
NEVER forget to synchronize the NPU device (
```
torch.npu.synchronize()
```
)
NEVER compare performance results across different hardware environments
NEVER confuse
```
msprof
```
and
```
msprof op
```
commands (the former is for function-level global analysis, the latter for operator-level in-depth analysis)
NEVER give optimization suggestions without Profiling data support

常见陷阱与注意事项

Common Pitfalls and Notes

陷阱	表现	正确做法
用 `msprof op` 做算子对比	只能看到单个 kernel，无法对比	算子对比用 `msprof` ，深度分析用 `msprof op`
`--kernel-name` 拼写错误	`msprof op` 静默完成但无数据	确认 kernel 名称与 Triton 函数定义一致
未区分首次编译和稳态性能	首次运行耗时异常高	至少 5 次预热后再采集
小规模输入测性能	启动开销占比过大，结论无参考价值	使用生产规模输入进行评估
忽略 dtype 对性能影响	FP16 和 FP32 性能差异显著	固定 dtype 进行对比，分别评估

Pitfall	Manifestation	Correct Approach
Using `msprof op` for operator comparison	Only a single kernel can be seen, no comparison possible	Use `msprof` for operator comparison, `msprof op` for in-depth analysis
Incorrect `--kernel-name` spelling	`msprof op` completes silently but has no data	Confirm the kernel name matches the Triton function definition
Failing to distinguish first-time compilation and steady-state performance	Abnormally high execution time in the first run	Collect data only after at least 5 warm-up runs
Testing performance with small-scale inputs	Startup overhead accounts for a large proportion, conclusions have no reference value	Use production-scale inputs for evaluation
Ignoring the impact of dtype on performance	Significant performance difference between FP16 and FP32	Fix dtype for comparison and evaluate separately

性能问题与优化方向

Performance Issues and Optimization Directions

瓶颈类型与优化策略

Bottleneck Types and Optimization Strategies

瓶颈类型	判断条件	核心优化方向
Memory-Bound	AI < 硬件平衡点；带宽利用率高、计算利用率低	减少数据搬运量、提高数据复用、优化访存模式
Compute-Bound	AI > 硬件平衡点；计算利用率高、带宽利用率低	优化计算指令效率、提高 Cube/Vector 利用率
Latency-Bound	带宽和计算利用率均低	增大并行度（Grid Size）、减少同步开销

Bottleneck Type	Judgment Criteria	Core Optimization Direction
Memory-Bound	AI < hardware balance point; high bandwidth utilization, low computing utilization	Reduce data transfer volume, improve data reuse, optimize memory access patterns
Compute-Bound	AI > hardware balance point; high computing utilization, low bandwidth utilization	Optimize computation instruction efficiency, improve Cube/Vector utilization
Latency-Bound	Both bandwidth and computing utilization are low	Increase parallelism (Grid Size), reduce synchronization overhead

常见性能问题诊断

Common Performance Problem Diagnosis

问题	症状	诊断数据源	解决方向
UB 溢出	编译错误/运行时 OOM	检查 BLOCK_SIZE 配置	减小 BLOCK_SIZE 或核内再分块
Cube 未命中	性能仅 10% 理论值	ArithmeticUtilization.csv	强制 BLOCK_M/N/K=16 倍数
精度损失	FP16 结果偏差大	对比 PyTorch 结果	累加器用 FP32
非连续访存	带宽仅 20% 利用率	Memory.csv	调整数据布局为连续
低并行度	AI Core 利用率低	PipeUtilization.csv	增大 Grid Size
高 Bank Conflict	资源冲突率 > 10%	ResourceConflictRatio.csv	调整数据块大小和对齐方式
L2 Cache 命中率低	频繁 GM 访问	L2Cache.csv	优化 Tiling 策略，提高数据局部性

Problem	Symptom	Diagnosis Data Source	Solution Direction
UB Overflow	Compilation error/runtime OOM	Check BLOCK_SIZE configuration	Reduce BLOCK_SIZE or split blocks within the kernel
Cube Miss	Performance is only 10% of theoretical value	ArithmeticUtilization.csv	Force BLOCK_M/N/K to be multiples of 16
Precision Loss	Large deviation in FP16 results	Compare with PyTorch results	Use FP32 for accumulators
Non-Contiguous Memory Access	Bandwidth utilization is only 20%	Memory.csv	Adjust data layout to be contiguous
Low Parallelism	Low AI Core utilization	PipeUtilization.csv	Increase Grid Size
High Bank Conflict	Resource conflict rate > 10%	ResourceConflictRatio.csv	Adjust data block size and alignment method
Low L2 Cache Hit Rate	Frequent GM access	L2Cache.csv	Optimize Tiling strategy to improve data locality

优化方向速查

Optimization Direction Quick Reference

Memory-Bound 算子优化路径：

检查访存模式 → 确保连续访存（Memory.csv 带宽利用率）
减少数据搬运 → 算子融合减少 GM 读写次数
提高数据复用 → 优化 Tiling 策略使数据在 UB/L1 中多次使用
消除 Bank Conflict → 调整对齐方式（ResourceConflictRatio.csv）

Compute-Bound 算子优化路径：

命中 Cube 单元 → BLOCK 维度设为 16 倍数（ArithmeticUtilization.csv）
减少类型转换 → 避免不必要的 upcast/downcast
流水线优化 → 检查 Pipe 利用率，平衡计算和搬运（PipeUtilization.csv）
向量化 → 确保 Vector 操作充分利用 SIMD 宽度

Memory-Bound Operator Optimization Path:

Check memory access pattern → Ensure contiguous memory access (Memory.csv bandwidth utilization)
Reduce data transfer → Operator fusion to reduce GM read/write times
Improve data reuse → Optimize Tiling strategy to enable multiple uses of data in UB/L1
Eliminate Bank Conflict → Adjust alignment method (ResourceConflictRatio.csv)

Compute-Bound Operator Optimization Path:

Hit Cube units → Set BLOCK dimensions to multiples of 16 (ArithmeticUtilization.csv)
Reduce type conversions → Avoid unnecessary upcast/downcast
Pipeline optimization → Check Pipe utilization, balance computation and data transfer (PipeUtilization.csv)
Vectorization → Ensure Vector operations fully utilize SIMD width

参考资源

Reference Resources

msprof

msprof op

命令对比

msprof

msprof op

Command Comparison

这两个命令是完全不同的分析层级，选择错误会导致无效分析：

维度	`msprof` （函数级）	`msprof op` （算子级）
命令格式	`msprof --application="python x.py"`	`msprof op --kernel-name=K python x.py`
分析粒度	整个应用的所有算子	指定的单个 kernel
核心输出	op_summary.csv、timeline_trace.json、report.html	ArithmeticUtilization.csv、Memory.csv、PipeUtilization.csv 等
提供信息	各算子耗时排名、Host/Device 全链路时间线	硬件利用率、内存带宽、Bank Conflict 等微架构指标
典型用途	对比 PyTorch vs Triton 算子整体性能	诊断单个 kernel 的硬件瓶颈
必需参数	`--application`	`--kernel-name`

选择决策：

需要知道"哪个算子最慢" → 用
```
msprof
```
需要知道"这个 kernel 为什么慢" → 用
```
msprof op
```
完整优化流程：先
```
msprof
```
定位热点，再
```
msprof op
```
深度分析

These two commands are completely different analysis levels, choosing the wrong one will lead to invalid analysis:

Dimension	`msprof` (Function-Level)	`msprof op` (Operator-Level)
Command Format	`msprof --application="python x.py"`	`msprof op --kernel-name=K python x.py`
Analysis Granularity	All operators of the entire application	Specified single kernel
Core Output	op_summary.csv, timeline_trace.json, report.html	ArithmeticUtilization.csv, Memory.csv, PipeUtilization.csv, etc.
Provided Information	Operator execution time ranking, Host/Device full-link timeline	Hardware utilization, memory bandwidth, Bank Conflict and other microarchitecture metrics
Typical Usage	Compare overall performance of PyTorch vs Triton operators	Diagnose hardware bottlenecks of a single kernel
Required Parameter	`--application`	`--kernel-name`

Selection Decision:

Need to know "which operator is the slowest" → Use
```
msprof
```
Need to know "why this kernel is slow" → Use
```
msprof op
```
Complete optimization process: First use
```
msprof
```
to locate hotspots, then use
```
msprof op
```
for in-depth analysis

参考文档

Reference Documents

文档	内容	关联命令
`msprof-function-level.md`	函数级性能采集用法和输出分析	`msprof`
`msprof-op-level.md`	算子级深度分析用法和硬件指标	`msprof op`
`performance-data-analysis.md`	`msprof op` 输出 CSV 的详细分析方法	`msprof op`
`profiling-tools.md`	性能分析工具链总览和工作流	两者均涉及
`ascend-terminology.md`	Ascend 硬件术语和架构概念	-

Document	Content	Associated Command
`msprof-function-level.md`	Function-level performance collection usage and output analysis	`msprof`
`msprof-op-level.md`	Operator-level in-depth analysis usage and hardware metrics	`msprof op`
`performance-data-analysis.md`	Detailed analysis methods for `msprof op` output CSV	`msprof op`
`profiling-tools.md`	Overview of performance analysis toolchain and workflow	Both
`ascend-terminology.md`	Ascend hardware terminology and architecture concepts	-

triton-operator-performance-eval

Original

Translation

Triton 算子性能评估（Ascend NPU）

Triton Operator Performance Evaluation (Ascend NPU)

核心认知

Core Concepts

性能评估工作流

Performance Evaluation Workflow

函数级性能采集（首选）

Function-Level Performance Collection (Preferred)

算子级性能采集（深度分析）

Operator-Level Performance Collection (In-Depth Analysis)

性能数据分析

Performance Data Analysis

性能总结输出

Performance Summary Output

性能评估总结

Performance Evaluation Summary

基本信息

Basic Information

性能指标

Performance Metrics

瓶颈诊断

Bottleneck Diagnosis

性能问题清单

Performance Issue List

优化建议

Optimization Suggestions

参考资源加载指南

Reference Resource Loading Guide

性能评估检查清单

Performance Evaluation Checklist

基础检查

Basic Checks

Profiler 检查

Profiler Checks

性能指标检查

Performance Metric Checks

反模式清单（NEVER）

Anti-Pattern List (NEVER)

常见陷阱与注意事项

Common Pitfalls and Notes

性能问题与优化方向

Performance Issues and Optimization Directions

瓶颈类型与优化策略

Bottleneck Types and Optimization Strategies

常见性能问题诊断

Common Performance Problem Diagnosis

优化方向速查

Optimization Direction Quick Reference

参考资源

Reference Resources

msprof vs msprof op 命令对比

msprof vs msprof op Command Comparison

参考文档

Reference Documents

官方资源

Official Resources

`msprof`
vs
`msprof op`
命令对比

`msprof`
vs
`msprof op`
Command Comparison