triton-operator-performance-eval

Original🇨🇳 Chinese
Translated

Evaluate the performance of Triton operators on Ascend NPU. It is used when users need to analyze operator performance bottlenecks, collect and compare operator performance using msprof/msprof op, diagnose Memory-Bound/Compute-Bound bottlenecks, measure hardware utilization metrics, and generate performance evaluation reports.

3installs
Added on

NPX Install

npx skill4agent add ascend/agent-skills triton-operator-performance-eval

SKILL.md Content (Chinese)

View Translation Comparison →

Triton Operator Performance Evaluation (Ascend NPU)

Core Concepts

Performance Data Principle: Only trust measured data, not your intuition or assumptions. You must use Prof to locate the real performance hotspots before evaluation.
⚠️ The only reliable performance collection methods:
msprof
and
msprof op
Performance data collected through any non-msprof/msprof op methods (including but not limited to Python
time.time()
,
torch.npu.Event
timing,
triton.testing.do_bench
, custom timing decorators, etc.) is absolutely unacceptable and must not be used for performance evaluation or optimization decisions. These methods lack precision and cannot eliminate interference from system scheduling and JIT compilation, so the resulting data has no reference value. All performance analysis must be based solely on output data from
msprof
(function-level) or
msprof op
(operator-level).
Evaluation Objectives:
  • Identify performance bottleneck types (Memory-Bound vs Compute-Bound)
  • Quantify hardware utilization
  • Compare performance differences between different implementations
  • Verify optimization effects

Performance Evaluation Workflow

Function-Level Performance Collection (Preferred)

Perform function-level performance analysis using msprof:
bash
msprof --application="python my_script.py" --output=./profiling_result
Applicable Scenarios:
  • Compare performance between multiple PyTorch operators vs fused Triton operators
  • Analyze function-level performance bottlenecks
  • Generate visualized performance reports
  • Full-link performance analysis (Host + Device)
Detailed Usage and Examples: Load
msprof-function-level.md

Operator-Level Performance Collection (In-Depth Analysis)

Perform in-depth operator-level analysis using msprof op:
bash
msprof op --kernel-name={jit_kernel_name} {application}
Applicable Scenarios:
  • Analyze hardware utilization of a single Triton kernel
  • Diagnose Cube/Vector unit performance
  • Analyze UB cache and memory access patterns
  • Locate hardware issues such as Bank Conflict
Detailed Usage and Examples: Load
msprof-op-level.md

Performance Data Analysis

Key Metrics:
  • Bottleneck type judgment (Memory-Bound vs Compute-Bound)
  • Memory bandwidth utilization
  • Computing unit utilization (Cube/Vector)
  • UB conflict analysis
Detailed Analysis Methods: Load
performance-data-analysis.md

Performance Summary Output

After completing the performance evaluation, you must output a structured performance summary according to the following template:
markdown
## Performance Evaluation Summary

### Basic Information
| Item | Value |
|------|-----|
| Operator Name | {kernel_name} |
| Input Scale | {shape, dtype} |
| Test Hardware | {Ascend Model} |
| Measurement Method | {msprof / msprof op} |

### Performance Metrics
| Metric | Value | Reference Value | Utilization |
|------|-----|--------|--------|
| Execution Time | {X} us | - | - |
| Memory Bandwidth | {X} GB/s | {Theoretical Peak} GB/s | {X}% |
| Cube Utilization | - | - | {X}% |
| Vector Utilization | - | - | {X}% |
| L2 Cache Hit Rate | - | - | {X}% |
| Bank Conflict Ratio | - | - | {X}% |

### Bottleneck Diagnosis
- **Bottleneck Type**: {Memory-Bound / Compute-Bound}
- **Judgment Basis**: {Analysis based on Arithmetic Intensity and hardware utilization data}
- **Key Evidence**: {Cite specific CSV data}

### Performance Issue List
| Priority | Problem | Evidence | Optimization Direction |
|--------|------|------|----------|
| P0 | {Most Critical Issue} | {Data Source} | {Specific Suggestions} |
| P1 | ... | ... | ... |

### Optimization Suggestions
1. {Highest Priority Optimization Suggestion and Expected Benefit}
2. ...
Output Principles:
  • All conclusions must be supported by Profiling data, no subjective guesses allowed
  • Metrics with utilization below 30% are marked as Key Focus
  • Both baseline and optimized results must be listed in comparison scenarios

Reference Resource Loading Guide

MANDATORY - Load On Demand: Load corresponding reference documents based on task type
Task TypeMust LoadDo Not Load
Function-Level Performance Operator Comparison
msprof-function-level.md
msprof-op-level.md
Operator-Level Hardware Analysis
msprof-op-level.md
,
performance-data-analysis.md
msprof-function-level.md
Performance Bottleneck Diagnosis
performance-data-analysis.md
-
Understand Hardware Terminology
ascend-terminology.md
-
Complete Performance Optimization ProcessAll references-

Performance Evaluation Checklist

Basic Checks

  • Is msprof used for performance collection?
  • Has warm-up been performed (to avoid first-time compilation overhead)?
  • Have multiple measurements been taken to get statistical values?
  • Has the NPU device been synchronized (
    torch.npu.synchronize()
    )?

Profiler Checks

  • Is the correct
    --application
    or
    --kernel-name
    specified?
  • Has the appropriate
    --aic-metrics
    been selected?
  • Have all key performance metrics been analyzed?
  • Has the bottleneck type (Memory-Bound vs Compute-Bound) been identified?

Performance Metric Checks

  • Is memory bandwidth utilization reasonable?
  • Is computing unit utilization reasonable?
  • Does high Bank Conflict exist?
  • Is L2 Cache hit rate reasonable?

Anti-Pattern List (NEVER)

  • NEVER use any non-msprof/msprof op methods for timing or performance evaluation (including
    time.time()
    ,
    torch.npu.Event
    ,
    triton.testing.do_bench
    , custom timers, etc.)——the data collected by these methods lacks precision and cannot eliminate interference from system scheduling and JIT compilation, it is absolutely unacceptable and must not be used for any performance evaluation or optimization decisions
  • NEVER skip warm-up in performance tests (the first execution includes compilation overhead)
  • NEVER draw conclusions based on only one test (multiple measurements are needed to get statistical values)
  • NEVER include printing or logging in performance tests (I/O will seriously affect results)
  • NEVER forget to synchronize the NPU device (
    torch.npu.synchronize()
    )
  • NEVER compare performance results across different hardware environments
  • NEVER confuse
    msprof
    and
    msprof op
    commands (the former is for function-level global analysis, the latter for operator-level in-depth analysis)
  • NEVER give optimization suggestions without Profiling data support

Common Pitfalls and Notes

PitfallManifestationCorrect Approach
Using
msprof op
for operator comparison
Only a single kernel can be seen, no comparison possibleUse
msprof
for operator comparison,
msprof op
for in-depth analysis
Incorrect
--kernel-name
spelling
msprof op
completes silently but has no data
Confirm the kernel name matches the Triton function definition
Failing to distinguish first-time compilation and steady-state performanceAbnormally high execution time in the first runCollect data only after at least 5 warm-up runs
Testing performance with small-scale inputsStartup overhead accounts for a large proportion, conclusions have no reference valueUse production-scale inputs for evaluation
Ignoring the impact of dtype on performanceSignificant performance difference between FP16 and FP32Fix dtype for comparison and evaluate separately

Performance Issues and Optimization Directions

Bottleneck Types and Optimization Strategies

Bottleneck TypeJudgment CriteriaCore Optimization Direction
Memory-BoundAI < hardware balance point; high bandwidth utilization, low computing utilizationReduce data transfer volume, improve data reuse, optimize memory access patterns
Compute-BoundAI > hardware balance point; high computing utilization, low bandwidth utilizationOptimize computation instruction efficiency, improve Cube/Vector utilization
Latency-BoundBoth bandwidth and computing utilization are lowIncrease parallelism (Grid Size), reduce synchronization overhead

Common Performance Problem Diagnosis

ProblemSymptomDiagnosis Data SourceSolution Direction
UB OverflowCompilation error/runtime OOMCheck BLOCK_SIZE configurationReduce BLOCK_SIZE or split blocks within the kernel
Cube MissPerformance is only 10% of theoretical valueArithmeticUtilization.csvForce BLOCK_M/N/K to be multiples of 16
Precision LossLarge deviation in FP16 resultsCompare with PyTorch resultsUse FP32 for accumulators
Non-Contiguous Memory AccessBandwidth utilization is only 20%Memory.csvAdjust data layout to be contiguous
Low ParallelismLow AI Core utilizationPipeUtilization.csvIncrease Grid Size
High Bank ConflictResource conflict rate > 10%ResourceConflictRatio.csvAdjust data block size and alignment method
Low L2 Cache Hit RateFrequent GM accessL2Cache.csvOptimize Tiling strategy to improve data locality

Optimization Direction Quick Reference

Memory-Bound Operator Optimization Path:
  1. Check memory access pattern → Ensure contiguous memory access (Memory.csv bandwidth utilization)
  2. Reduce data transfer → Operator fusion to reduce GM read/write times
  3. Improve data reuse → Optimize Tiling strategy to enable multiple uses of data in UB/L1
  4. Eliminate Bank Conflict → Adjust alignment method (ResourceConflictRatio.csv)
Compute-Bound Operator Optimization Path:
  1. Hit Cube units → Set BLOCK dimensions to multiples of 16 (ArithmeticUtilization.csv)
  2. Reduce type conversions → Avoid unnecessary upcast/downcast
  3. Pipeline optimization → Check Pipe utilization, balance computation and data transfer (PipeUtilization.csv)
  4. Vectorization → Ensure Vector operations fully utilize SIMD width

Reference Resources

msprof
vs
msprof op
Command Comparison

These two commands are completely different analysis levels, choosing the wrong one will lead to invalid analysis:
Dimension
msprof
(Function-Level)
msprof op
(Operator-Level)
Command Format
msprof --application="python x.py"
msprof op --kernel-name=K python x.py
Analysis GranularityAll operators of the entire applicationSpecified single kernel
Core Outputop_summary.csv, timeline_trace.json, report.htmlArithmeticUtilization.csv, Memory.csv, PipeUtilization.csv, etc.
Provided InformationOperator execution time ranking, Host/Device full-link timelineHardware utilization, memory bandwidth, Bank Conflict and other microarchitecture metrics
Typical UsageCompare overall performance of PyTorch vs Triton operatorsDiagnose hardware bottlenecks of a single kernel
Required Parameter
--application
--kernel-name
Selection Decision:
  • Need to know "which operator is the slowest" → Use
    msprof
  • Need to know "why this kernel is slow" → Use
    msprof op
  • Complete optimization process: First use
    msprof
    to locate hotspots, then use
    msprof op
    for in-depth analysis

Reference Documents

DocumentContentAssociated Command
msprof-function-level.md
Function-level performance collection usage and output analysis
msprof
msprof-op-level.md
Operator-level in-depth analysis usage and hardware metrics
msprof op
performance-data-analysis.md
Detailed analysis methods for
msprof op
output CSV
msprof op
profiling-tools.md
Overview of performance analysis toolchain and workflowBoth
ascend-terminology.md
Ascend hardware terminology and architecture concepts-

Official Resources