triton-operator-performance-eval
Original:🇨🇳 Chinese
Translated
Evaluate the performance of Triton operators on Ascend NPU. It is used when users need to analyze operator performance bottlenecks, collect and compare operator performance using msprof/msprof op, diagnose Memory-Bound/Compute-Bound bottlenecks, measure hardware utilization metrics, and generate performance evaluation reports.
3installs
Sourceascend/agent-skills
Added on
NPX Install
npx skill4agent add ascend/agent-skills triton-operator-performance-evalTags
Translated version includes tags in frontmatterSKILL.md Content (Chinese)
View Translation Comparison →Triton Operator Performance Evaluation (Ascend NPU)
Core Concepts
Performance Data Principle: Only trust measured data, not your intuition or assumptions. You must use Prof to locate the real performance hotspots before evaluation.
⚠️ The only reliable performance collection methods: and
msprofmsprof opPerformance data collected through any non-msprof/msprof op methods (including but not limited to Python , timing, , custom timing decorators, etc.) is absolutely unacceptable and must not be used for performance evaluation or optimization decisions. These methods lack precision and cannot eliminate interference from system scheduling and JIT compilation, so the resulting data has no reference value. All performance analysis must be based solely on output data from (function-level) or (operator-level).
time.time()torch.npu.Eventtriton.testing.do_benchmsprofmsprof opEvaluation Objectives:
- Identify performance bottleneck types (Memory-Bound vs Compute-Bound)
- Quantify hardware utilization
- Compare performance differences between different implementations
- Verify optimization effects
Performance Evaluation Workflow
Function-Level Performance Collection (Preferred)
Perform function-level performance analysis using msprof:
bash
msprof --application="python my_script.py" --output=./profiling_resultApplicable Scenarios:
- Compare performance between multiple PyTorch operators vs fused Triton operators
- Analyze function-level performance bottlenecks
- Generate visualized performance reports
- Full-link performance analysis (Host + Device)
Detailed Usage and Examples: Load
msprof-function-level.mdOperator-Level Performance Collection (In-Depth Analysis)
Perform in-depth operator-level analysis using msprof op:
bash
msprof op --kernel-name={jit_kernel_name} {application}Applicable Scenarios:
- Analyze hardware utilization of a single Triton kernel
- Diagnose Cube/Vector unit performance
- Analyze UB cache and memory access patterns
- Locate hardware issues such as Bank Conflict
Detailed Usage and Examples: Load
msprof-op-level.mdPerformance Data Analysis
Key Metrics:
- Bottleneck type judgment (Memory-Bound vs Compute-Bound)
- Memory bandwidth utilization
- Computing unit utilization (Cube/Vector)
- UB conflict analysis
Detailed Analysis Methods: Load
performance-data-analysis.mdPerformance Summary Output
After completing the performance evaluation, you must output a structured performance summary according to the following template:
markdown
## Performance Evaluation Summary
### Basic Information
| Item | Value |
|------|-----|
| Operator Name | {kernel_name} |
| Input Scale | {shape, dtype} |
| Test Hardware | {Ascend Model} |
| Measurement Method | {msprof / msprof op} |
### Performance Metrics
| Metric | Value | Reference Value | Utilization |
|------|-----|--------|--------|
| Execution Time | {X} us | - | - |
| Memory Bandwidth | {X} GB/s | {Theoretical Peak} GB/s | {X}% |
| Cube Utilization | - | - | {X}% |
| Vector Utilization | - | - | {X}% |
| L2 Cache Hit Rate | - | - | {X}% |
| Bank Conflict Ratio | - | - | {X}% |
### Bottleneck Diagnosis
- **Bottleneck Type**: {Memory-Bound / Compute-Bound}
- **Judgment Basis**: {Analysis based on Arithmetic Intensity and hardware utilization data}
- **Key Evidence**: {Cite specific CSV data}
### Performance Issue List
| Priority | Problem | Evidence | Optimization Direction |
|--------|------|------|----------|
| P0 | {Most Critical Issue} | {Data Source} | {Specific Suggestions} |
| P1 | ... | ... | ... |
### Optimization Suggestions
1. {Highest Priority Optimization Suggestion and Expected Benefit}
2. ...Output Principles:
- All conclusions must be supported by Profiling data, no subjective guesses allowed
- Metrics with utilization below 30% are marked as Key Focus
- Both baseline and optimized results must be listed in comparison scenarios
Reference Resource Loading Guide
MANDATORY - Load On Demand: Load corresponding reference documents based on task type
| Task Type | Must Load | Do Not Load |
|---|---|---|
| Function-Level Performance Operator Comparison | | |
| Operator-Level Hardware Analysis | | |
| Performance Bottleneck Diagnosis | | - |
| Understand Hardware Terminology | | - |
| Complete Performance Optimization Process | All references | - |
Performance Evaluation Checklist
Basic Checks
- Is msprof used for performance collection?
- Has warm-up been performed (to avoid first-time compilation overhead)?
- Have multiple measurements been taken to get statistical values?
- Has the NPU device been synchronized ()?
torch.npu.synchronize()
Profiler Checks
- Is the correct or
--applicationspecified?--kernel-name - Has the appropriate been selected?
--aic-metrics - Have all key performance metrics been analyzed?
- Has the bottleneck type (Memory-Bound vs Compute-Bound) been identified?
Performance Metric Checks
- Is memory bandwidth utilization reasonable?
- Is computing unit utilization reasonable?
- Does high Bank Conflict exist?
- Is L2 Cache hit rate reasonable?
Anti-Pattern List (NEVER)
- NEVER use any non-msprof/msprof op methods for timing or performance evaluation (including ,
time.time(),torch.npu.Event, custom timers, etc.)——the data collected by these methods lacks precision and cannot eliminate interference from system scheduling and JIT compilation, it is absolutely unacceptable and must not be used for any performance evaluation or optimization decisionstriton.testing.do_bench - NEVER skip warm-up in performance tests (the first execution includes compilation overhead)
- NEVER draw conclusions based on only one test (multiple measurements are needed to get statistical values)
- NEVER include printing or logging in performance tests (I/O will seriously affect results)
- NEVER forget to synchronize the NPU device ()
torch.npu.synchronize() - NEVER compare performance results across different hardware environments
- NEVER confuse and
msprofcommands (the former is for function-level global analysis, the latter for operator-level in-depth analysis)msprof op - NEVER give optimization suggestions without Profiling data support
Common Pitfalls and Notes
| Pitfall | Manifestation | Correct Approach |
|---|---|---|
Using | Only a single kernel can be seen, no comparison possible | Use |
Incorrect | | Confirm the kernel name matches the Triton function definition |
| Failing to distinguish first-time compilation and steady-state performance | Abnormally high execution time in the first run | Collect data only after at least 5 warm-up runs |
| Testing performance with small-scale inputs | Startup overhead accounts for a large proportion, conclusions have no reference value | Use production-scale inputs for evaluation |
| Ignoring the impact of dtype on performance | Significant performance difference between FP16 and FP32 | Fix dtype for comparison and evaluate separately |
Performance Issues and Optimization Directions
Bottleneck Types and Optimization Strategies
| Bottleneck Type | Judgment Criteria | Core Optimization Direction |
|---|---|---|
| Memory-Bound | AI < hardware balance point; high bandwidth utilization, low computing utilization | Reduce data transfer volume, improve data reuse, optimize memory access patterns |
| Compute-Bound | AI > hardware balance point; high computing utilization, low bandwidth utilization | Optimize computation instruction efficiency, improve Cube/Vector utilization |
| Latency-Bound | Both bandwidth and computing utilization are low | Increase parallelism (Grid Size), reduce synchronization overhead |
Common Performance Problem Diagnosis
| Problem | Symptom | Diagnosis Data Source | Solution Direction |
|---|---|---|---|
| UB Overflow | Compilation error/runtime OOM | Check BLOCK_SIZE configuration | Reduce BLOCK_SIZE or split blocks within the kernel |
| Cube Miss | Performance is only 10% of theoretical value | ArithmeticUtilization.csv | Force BLOCK_M/N/K to be multiples of 16 |
| Precision Loss | Large deviation in FP16 results | Compare with PyTorch results | Use FP32 for accumulators |
| Non-Contiguous Memory Access | Bandwidth utilization is only 20% | Memory.csv | Adjust data layout to be contiguous |
| Low Parallelism | Low AI Core utilization | PipeUtilization.csv | Increase Grid Size |
| High Bank Conflict | Resource conflict rate > 10% | ResourceConflictRatio.csv | Adjust data block size and alignment method |
| Low L2 Cache Hit Rate | Frequent GM access | L2Cache.csv | Optimize Tiling strategy to improve data locality |
Optimization Direction Quick Reference
Memory-Bound Operator Optimization Path:
- Check memory access pattern → Ensure contiguous memory access (Memory.csv bandwidth utilization)
- Reduce data transfer → Operator fusion to reduce GM read/write times
- Improve data reuse → Optimize Tiling strategy to enable multiple uses of data in UB/L1
- Eliminate Bank Conflict → Adjust alignment method (ResourceConflictRatio.csv)
Compute-Bound Operator Optimization Path:
- Hit Cube units → Set BLOCK dimensions to multiples of 16 (ArithmeticUtilization.csv)
- Reduce type conversions → Avoid unnecessary upcast/downcast
- Pipeline optimization → Check Pipe utilization, balance computation and data transfer (PipeUtilization.csv)
- Vectorization → Ensure Vector operations fully utilize SIMD width
Reference Resources
msprof
vs msprof op
Command Comparison
msprofmsprof opThese two commands are completely different analysis levels, choosing the wrong one will lead to invalid analysis:
| Dimension | | |
|---|---|---|
| Command Format | | |
| Analysis Granularity | All operators of the entire application | Specified single kernel |
| Core Output | op_summary.csv, timeline_trace.json, report.html | ArithmeticUtilization.csv, Memory.csv, PipeUtilization.csv, etc. |
| Provided Information | Operator execution time ranking, Host/Device full-link timeline | Hardware utilization, memory bandwidth, Bank Conflict and other microarchitecture metrics |
| Typical Usage | Compare overall performance of PyTorch vs Triton operators | Diagnose hardware bottlenecks of a single kernel |
| Required Parameter | | |
Selection Decision:
- Need to know "which operator is the slowest" → Use
msprof - Need to know "why this kernel is slow" → Use
msprof op - Complete optimization process: First use to locate hotspots, then use
msproffor in-depth analysismsprof op
Reference Documents
| Document | Content | Associated Command |
|---|---|---|
| Function-level performance collection usage and output analysis | |
| Operator-level in-depth analysis usage and hardware metrics | |
| Detailed analysis methods for | |
| Overview of performance analysis toolchain and workflow | Both |
| Ascend hardware terminology and architecture concepts | - |