Triton Operator Performance Evaluation (Ascend NPU)

Core Concepts

Performance Data Principle: Only trust measured data, not your intuition or assumptions. You must use Prof to locate the real performance hotspots before evaluation.

⚠️ The only reliable performance collection methods:
msprof
and
msprof op

Performance data collected through any non-msprof/msprof op methods (including but not limited to Python

time.time()

torch.npu.Event

timing,

triton.testing.do_bench

, custom timing decorators, etc.) is absolutely unacceptable and must not be used for performance evaluation or optimization decisions. These methods lack precision and cannot eliminate interference from system scheduling and JIT compilation, so the resulting data has no reference value. All performance analysis must be based solely on output data from

msprof

(function-level) or

msprof op

(operator-level).

Evaluation Objectives:

Identify performance bottleneck types (Memory-Bound vs Compute-Bound)
Quantify hardware utilization
Compare performance differences between different implementations
Verify optimization effects

Performance Evaluation Workflow

Function-Level Performance Collection (Preferred)

Perform function-level performance analysis using msprof:

bash

msprof --application="python my_script.py" --output=./profiling_result

Applicable Scenarios:

Compare performance between multiple PyTorch operators vs fused Triton operators
Analyze function-level performance bottlenecks
Generate visualized performance reports
Full-link performance analysis (Host + Device)

Detailed Usage and Examples: Load

msprof-function-level.md

Operator-Level Performance Collection (In-Depth Analysis)

Perform in-depth operator-level analysis using msprof op:

bash

msprof op --kernel-name={jit_kernel_name} {application}

Applicable Scenarios:

Analyze hardware utilization of a single Triton kernel
Diagnose Cube/Vector unit performance
Analyze UB cache and memory access patterns
Locate hardware issues such as Bank Conflict

Detailed Usage and Examples: Load

msprof-op-level.md

Performance Data Analysis

Key Metrics:

Bottleneck type judgment (Memory-Bound vs Compute-Bound)
Memory bandwidth utilization
Computing unit utilization (Cube/Vector)
UB conflict analysis

Detailed Analysis Methods: Load

performance-data-analysis.md

Performance Summary Output

After completing the performance evaluation, you must output a structured performance summary according to the following template:

markdown

## Performance Evaluation Summary

### Basic Information
| Item | Value |
|------|-----|
| Operator Name | {kernel_name} |
| Input Scale | {shape, dtype} |
| Test Hardware | {Ascend Model} |
| Measurement Method | {msprof / msprof op} |

### Performance Metrics
| Metric | Value | Reference Value | Utilization |
|------|-----|--------|--------|
| Execution Time | {X} us | - | - |
| Memory Bandwidth | {X} GB/s | {Theoretical Peak} GB/s | {X}% |
| Cube Utilization | - | - | {X}% |
| Vector Utilization | - | - | {X}% |
| L2 Cache Hit Rate | - | - | {X}% |
| Bank Conflict Ratio | - | - | {X}% |

### Bottleneck Diagnosis
- **Bottleneck Type**: {Memory-Bound / Compute-Bound}
- **Judgment Basis**: {Analysis based on Arithmetic Intensity and hardware utilization data}
- **Key Evidence**: {Cite specific CSV data}

### Performance Issue List
| Priority | Problem | Evidence | Optimization Direction |
|--------|------|------|----------|
| P0 | {Most Critical Issue} | {Data Source} | {Specific Suggestions} |
| P1 | ... | ... | ... |

### Optimization Suggestions
1. {Highest Priority Optimization Suggestion and Expected Benefit}
2. ...

Output Principles:

All conclusions must be supported by Profiling data, no subjective guesses allowed
Metrics with utilization below 30% are marked as Key Focus
Both baseline and optimized results must be listed in comparison scenarios

Reference Resource Loading Guide

MANDATORY - Load On Demand: Load corresponding reference documents based on task type

Task Type	Must Load	Do Not Load
Function-Level Performance Operator Comparison	`msprof-function-level.md`	`msprof-op-level.md`
Operator-Level Hardware Analysis	`msprof-op-level.md` , `performance-data-analysis.md`	`msprof-function-level.md`
Performance Bottleneck Diagnosis	`performance-data-analysis.md`	-
Understand Hardware Terminology	`ascend-terminology.md`	-
Complete Performance Optimization Process	All references	-

Performance Evaluation Checklist

Basic Checks

Is msprof used for performance collection?
Has warm-up been performed (to avoid first-time compilation overhead)?
Have multiple measurements been taken to get statistical values?
Has the NPU device been synchronized (
```
torch.npu.synchronize()
```
)?

Profiler Checks

Is the correct
```
--application
```
or
```
--kernel-name
```
specified?
Has the appropriate
```
--aic-metrics
```
been selected?
Have all key performance metrics been analyzed?
Has the bottleneck type (Memory-Bound vs Compute-Bound) been identified?

Performance Metric Checks

Is memory bandwidth utilization reasonable?
Is computing unit utilization reasonable?
Does high Bank Conflict exist?
Is L2 Cache hit rate reasonable?

Anti-Pattern List (NEVER)

NEVER use any non-msprof/msprof op methods for timing or performance evaluation (including
```
time.time()
```
,
```
torch.npu.Event
```
,
```
triton.testing.do_bench
```
, custom timers, etc.)——the data collected by these methods lacks precision and cannot eliminate interference from system scheduling and JIT compilation, it is absolutely unacceptable and must not be used for any performance evaluation or optimization decisions
NEVER skip warm-up in performance tests (the first execution includes compilation overhead)
NEVER draw conclusions based on only one test (multiple measurements are needed to get statistical values)
NEVER include printing or logging in performance tests (I/O will seriously affect results)
NEVER forget to synchronize the NPU device (
```
torch.npu.synchronize()
```
)
NEVER compare performance results across different hardware environments
NEVER confuse
```
msprof
```
and
```
msprof op
```
commands (the former is for function-level global analysis, the latter for operator-level in-depth analysis)
NEVER give optimization suggestions without Profiling data support

Common Pitfalls and Notes

Pitfall	Manifestation	Correct Approach
Using `msprof op` for operator comparison	Only a single kernel can be seen, no comparison possible	Use `msprof` for operator comparison, `msprof op` for in-depth analysis
Incorrect `--kernel-name` spelling	`msprof op` completes silently but has no data	Confirm the kernel name matches the Triton function definition
Failing to distinguish first-time compilation and steady-state performance	Abnormally high execution time in the first run	Collect data only after at least 5 warm-up runs
Testing performance with small-scale inputs	Startup overhead accounts for a large proportion, conclusions have no reference value	Use production-scale inputs for evaluation
Ignoring the impact of dtype on performance	Significant performance difference between FP16 and FP32	Fix dtype for comparison and evaluate separately

Performance Issues and Optimization Directions

Bottleneck Types and Optimization Strategies

Bottleneck Type	Judgment Criteria	Core Optimization Direction
Memory-Bound	AI < hardware balance point; high bandwidth utilization, low computing utilization	Reduce data transfer volume, improve data reuse, optimize memory access patterns
Compute-Bound	AI > hardware balance point; high computing utilization, low bandwidth utilization	Optimize computation instruction efficiency, improve Cube/Vector utilization
Latency-Bound	Both bandwidth and computing utilization are low	Increase parallelism (Grid Size), reduce synchronization overhead

Common Performance Problem Diagnosis

Problem	Symptom	Diagnosis Data Source	Solution Direction
UB Overflow	Compilation error/runtime OOM	Check BLOCK_SIZE configuration	Reduce BLOCK_SIZE or split blocks within the kernel
Cube Miss	Performance is only 10% of theoretical value	ArithmeticUtilization.csv	Force BLOCK_M/N/K to be multiples of 16
Precision Loss	Large deviation in FP16 results	Compare with PyTorch results	Use FP32 for accumulators
Non-Contiguous Memory Access	Bandwidth utilization is only 20%	Memory.csv	Adjust data layout to be contiguous
Low Parallelism	Low AI Core utilization	PipeUtilization.csv	Increase Grid Size
High Bank Conflict	Resource conflict rate > 10%	ResourceConflictRatio.csv	Adjust data block size and alignment method
Low L2 Cache Hit Rate	Frequent GM access	L2Cache.csv	Optimize Tiling strategy to improve data locality

Optimization Direction Quick Reference

Memory-Bound Operator Optimization Path:

Check memory access pattern → Ensure contiguous memory access (Memory.csv bandwidth utilization)
Reduce data transfer → Operator fusion to reduce GM read/write times
Improve data reuse → Optimize Tiling strategy to enable multiple uses of data in UB/L1
Eliminate Bank Conflict → Adjust alignment method (ResourceConflictRatio.csv)

Compute-Bound Operator Optimization Path:

Hit Cube units → Set BLOCK dimensions to multiples of 16 (ArithmeticUtilization.csv)
Reduce type conversions → Avoid unnecessary upcast/downcast
Pipeline optimization → Check Pipe utilization, balance computation and data transfer (PipeUtilization.csv)
Vectorization → Ensure Vector operations fully utilize SIMD width

Reference Resources

msprof

msprof op

Command Comparison

These two commands are completely different analysis levels, choosing the wrong one will lead to invalid analysis:

Dimension	`msprof` (Function-Level)	`msprof op` (Operator-Level)
Command Format	`msprof --application="python x.py"`	`msprof op --kernel-name=K python x.py`
Analysis Granularity	All operators of the entire application	Specified single kernel
Core Output	op_summary.csv, timeline_trace.json, report.html	ArithmeticUtilization.csv, Memory.csv, PipeUtilization.csv, etc.
Provided Information	Operator execution time ranking, Host/Device full-link timeline	Hardware utilization, memory bandwidth, Bank Conflict and other microarchitecture metrics
Typical Usage	Compare overall performance of PyTorch vs Triton operators	Diagnose hardware bottlenecks of a single kernel
Required Parameter	`--application`	`--kernel-name`

Selection Decision:

Need to know "which operator is the slowest" → Use
```
msprof
```
Need to know "why this kernel is slow" → Use
```
msprof op
```
Complete optimization process: First use
```
msprof
```
to locate hotspots, then use
```
msprof op
```
for in-depth analysis

Reference Documents

Document	Content	Associated Command
`msprof-function-level.md`	Function-level performance collection usage and output analysis	`msprof`
`msprof-op-level.md`	Operator-level in-depth analysis usage and hardware metrics	`msprof op`
`performance-data-analysis.md`	Detailed analysis methods for `msprof op` output CSV	`msprof op`
`profiling-tools.md`	Overview of performance analysis toolchain and workflow	Both
`ascend-terminology.md`	Ascend hardware terminology and architecture concepts	-

triton-operator-performance-eval

NPX Install

Tags

SKILL.md Content (Chinese)

Triton Operator Performance Evaluation (Ascend NPU)

Core Concepts

Performance Evaluation Workflow

Function-Level Performance Collection (Preferred)

Operator-Level Performance Collection (In-Depth Analysis)

Performance Data Analysis

Performance Summary Output

Reference Resource Loading Guide

Performance Evaluation Checklist

Basic Checks

Profiler Checks

Performance Metric Checks

Anti-Pattern List (NEVER)

Common Pitfalls and Notes

Performance Issues and Optimization Directions

Bottleneck Types and Optimization Strategies

Common Performance Problem Diagnosis

Optimization Direction Quick Reference

Reference Resources

`msprof`
vs
`msprof op`
Command Comparison

Reference Documents

Official Resources

triton-operator-performance-eval

NPX Install

Tags

SKILL.md Content (Chinese)

Triton Operator Performance Evaluation (Ascend NPU)

Core Concepts

Performance Evaluation Workflow

Function-Level Performance Collection (Preferred)

Operator-Level Performance Collection (In-Depth Analysis)

Performance Data Analysis

Performance Summary Output

Reference Resource Loading Guide

Performance Evaluation Checklist

Basic Checks

Profiler Checks

Performance Metric Checks

Anti-Pattern List (NEVER)

Common Pitfalls and Notes

Performance Issues and Optimization Directions

Bottleneck Types and Optimization Strategies

Common Performance Problem Diagnosis

Optimization Direction Quick Reference

Reference Resources

msprof vs msprof op Command Comparison

Reference Documents

Official Resources

`msprof`
vs
`msprof op`
Command Comparison