triton-operator-performance-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triton 算子性能评估(Ascend NPU)

Triton Operator Performance Evaluation (Ascend NPU)

核心认知

Core Concepts

性能数据信仰:只相信测量到的数据,不相信自己的直觉和假设。评估前必须用 Prof 定位真正的性能热区。
⚠️ 唯一可信的性能采集方式:
msprof
msprof op
通过任何非
msprof
/
msprof op
方式(包括但不限于 Python
time.time()
torch.npu.Event
计时、
triton.testing.do_bench
、自定义计时装饰器等)采集的性能数据绝不可接受绝不可用于性能评估及优化决策。这些方式精度不足、无法排除系统调度和 JIT 编译等干扰因素,得出的数据不具备任何参考价值。所有性能分析必须且只能基于
msprof
(函数级)或
msprof op
(算子级)的输出数据。
评估目标
  • 识别性能瓶颈类型(Memory-Bound vs Compute-Bound)
  • 量化硬件利用率
  • 对比不同实现的性能差异
  • 验证优化效果
Performance Data Principle: Only trust measured data, not your intuition or assumptions. You must use Prof to locate the real performance hotspots before evaluation.
⚠️ The only reliable performance collection methods:
msprof
and
msprof op
Performance data collected through any non-msprof/msprof op methods (including but not limited to Python
time.time()
,
torch.npu.Event
timing,
triton.testing.do_bench
, custom timing decorators, etc.) is absolutely unacceptable and must not be used for performance evaluation or optimization decisions. These methods lack precision and cannot eliminate interference from system scheduling and JIT compilation, so the resulting data has no reference value. All performance analysis must be based solely on output data from
msprof
(function-level) or
msprof op
(operator-level).
Evaluation Objectives:
  • Identify performance bottleneck types (Memory-Bound vs Compute-Bound)
  • Quantify hardware utilization
  • Compare performance differences between different implementations
  • Verify optimization effects

性能评估工作流

Performance Evaluation Workflow

函数级性能采集(首选)

Function-Level Performance Collection (Preferred)

使用 msprof 进行函数级性能分析:
bash
msprof --application="python my_script.py" --output=./profiling_result
适用场景
  • 对比多个 PyTorch 算子 vs 融合 Triton 算子的性能
  • 分析函数级别的性能瓶颈
  • 生成可视化性能报告
  • 全链路性能分析(Host + Device)
详细用法和示例:加载
msprof-function-level.md
Perform function-level performance analysis using msprof:
bash
msprof --application="python my_script.py" --output=./profiling_result
Applicable Scenarios:
  • Compare performance between multiple PyTorch operators vs fused Triton operators
  • Analyze function-level performance bottlenecks
  • Generate visualized performance reports
  • Full-link performance analysis (Host + Device)
Detailed Usage and Examples: Load
msprof-function-level.md

算子级性能采集(深度分析)

Operator-Level Performance Collection (In-Depth Analysis)

使用 msprof op 进行算子级深度分析:
bash
msprof op --kernel-name={jit_kernel_name} {application}
适用场景
  • 分析单个 Triton kernel 的硬件利用率
  • 诊断 Cube/Vector 单元性能
  • 分析 UB 缓存和内存访问模式
  • 定位 Bank Conflict 等硬件问题
详细用法和示例:加载
msprof-op-level.md
Perform in-depth operator-level analysis using msprof op:
bash
msprof op --kernel-name={jit_kernel_name} {application}
Applicable Scenarios:
  • Analyze hardware utilization of a single Triton kernel
  • Diagnose Cube/Vector unit performance
  • Analyze UB cache and memory access patterns
  • Locate hardware issues such as Bank Conflict
Detailed Usage and Examples: Load
msprof-op-level.md

性能数据分析

Performance Data Analysis

关键指标
  • 瓶颈类型判断(Memory-Bound vs Compute-Bound)
  • 内存带宽利用率
  • 计算单元利用率(Cube/Vector)
  • UB 冲突分析
详细分析方法:加载
performance-data-analysis.md
Key Metrics:
  • Bottleneck type judgment (Memory-Bound vs Compute-Bound)
  • Memory bandwidth utilization
  • Computing unit utilization (Cube/Vector)
  • UB conflict analysis
Detailed Analysis Methods: Load
performance-data-analysis.md

性能总结输出

Performance Summary Output

完成性能评估后,必须按以下模板输出结构化性能总结:
markdown
undefined
After completing the performance evaluation, you must output a structured performance summary according to the following template:
markdown
undefined

性能评估总结

Performance Evaluation Summary

基本信息

Basic Information

项目
算子名称{kernel_name}
输入规模{shape, dtype}
测试硬件{Ascend 型号}
测量方法{msprof / msprof op}
ItemValue
Operator Name{kernel_name}
Input Scale{shape, dtype}
Test Hardware{Ascend Model}
Measurement Method{msprof / msprof op}

性能指标

Performance Metrics

指标参考值利用率
执行耗时{X} us--
内存带宽{X} GB/s{理论峰值} GB/s{X}%
Cube 利用率--{X}%
Vector 利用率--{X}%
L2 Cache 命中率--{X}%
Bank Conflict 比例--{X}%
MetricValueReference ValueUtilization
Execution Time{X} us--
Memory Bandwidth{X} GB/s{Theoretical Peak} GB/s{X}%
Cube Utilization--{X}%
Vector Utilization--{X}%
L2 Cache Hit Rate--{X}%
Bank Conflict Ratio--{X}%

瓶颈诊断

Bottleneck Diagnosis

  • 瓶颈类型:{Memory-Bound / Compute-Bound}
  • 判断依据:{基于 Arithmetic Intensity 和硬件利用率数据的分析}
  • 关键证据:{引用具体 CSV 数据}
  • Bottleneck Type: {Memory-Bound / Compute-Bound}
  • Judgment Basis: {Analysis based on Arithmetic Intensity and hardware utilization data}
  • Key Evidence: {Cite specific CSV data}

性能问题清单

Performance Issue List

优先级问题证据优化方向
P0{最关键问题}{数据来源}{具体建议}
P1.........
PriorityProblemEvidenceOptimization Direction
P0{Most Critical Issue}{Data Source}{Specific Suggestions}
P1.........

优化建议

Optimization Suggestions

  1. {最高优先级优化建议及预期收益}
  2. ...

**输出原则**:
- 所有结论必须有 Profiling 数据支撑,不做主观猜测
- 利用率低于 30% 的指标标记为 **重点关注**
- 对比场景需同时列出 baseline 和 optimized 结果
  1. {Highest Priority Optimization Suggestion and Expected Benefit}
  2. ...

**Output Principles**:
- All conclusions must be supported by Profiling data, no subjective guesses allowed
- Metrics with utilization below 30% are marked as **Key Focus**
- Both baseline and optimized results must be listed in comparison scenarios

参考资源加载指南

Reference Resource Loading Guide

MANDATORY - 按需加载:根据任务类型加载对应的参考文档
任务类型必须加载不要加载
函数级性能算子对比
msprof-function-level.md
msprof-op-level.md
算子级硬件分析
msprof-op-level.md
,
performance-data-analysis.md
msprof-function-level.md
性能瓶颈诊断
performance-data-analysis.md
-
理解硬件术语
ascend-terminology.md
-
完整性能优化流程所有 references-
MANDATORY - Load On Demand: Load corresponding reference documents based on task type
Task TypeMust LoadDo Not Load
Function-Level Performance Operator Comparison
msprof-function-level.md
msprof-op-level.md
Operator-Level Hardware Analysis
msprof-op-level.md
,
performance-data-analysis.md
msprof-function-level.md
Performance Bottleneck Diagnosis
performance-data-analysis.md
-
Understand Hardware Terminology
ascend-terminology.md
-
Complete Performance Optimization ProcessAll references-

性能评估检查清单

Performance Evaluation Checklist

基础检查

Basic Checks

  • 是否使用 msprof 进行性能采集?
  • 是否进行了预热(避免首次编译开销)?
  • 是否多次测量取统计值?
  • 是否同步了 NPU 设备(
    torch.npu.synchronize()
    )?
  • Is msprof used for performance collection?
  • Has warm-up been performed (to avoid first-time compilation overhead)?
  • Have multiple measurements been taken to get statistical values?
  • Has the NPU device been synchronized (
    torch.npu.synchronize()
    )?

Profiler 检查

Profiler Checks

  • 是否指定了正确的
    --application
    --kernel-name
  • 是否选择了合适的
    --aic-metrics
  • 是否分析了所有关键性能指标?
  • 是否识别了瓶颈类型(Memory-Bound vs Compute-Bound)?
  • Is the correct
    --application
    or
    --kernel-name
    specified?
  • Has the appropriate
    --aic-metrics
    been selected?
  • Have all key performance metrics been analyzed?
  • Has the bottleneck type (Memory-Bound vs Compute-Bound) been identified?

性能指标检查

Performance Metric Checks

  • 内存带宽利用率是否合理?
  • 计算单元利用率是否合理?
  • 是否存在高 Bank Conflict?
  • L2 Cache 命中率是否合理?
  • Is memory bandwidth utilization reasonable?
  • Is computing unit utilization reasonable?
  • Does high Bank Conflict exist?
  • Is L2 Cache hit rate reasonable?

反模式清单(NEVER)

Anti-Pattern List (NEVER)

  • NEVER 使用任何非
    msprof
    /
    msprof op
    的方式进行计时或性能评估(包括
    time.time()
    torch.npu.Event
    triton.testing.do_bench
    、自定义计时器等)——这些方式采集的数据精度不足、无法排除系统调度和 JIT 编译干扰,绝不可接受,绝不可用于任何性能评估及优化决策
  • NEVER 在性能测试中不进行预热(首次执行包含编译开销)
  • NEVER 只测试一次就得出结论(需要多次测量取统计值)
  • NEVER 在性能测试中包含打印或日志(I/O 会严重影响结果)
  • NEVER 忘记同步 NPU 设备(
    torch.npu.synchronize()
  • NEVER 在不同硬件环境下对比性能结果
  • NEVER 混淆
    msprof
    msprof op
    命令(前者函数级全局分析,后者算子级深度分析)
  • NEVER 在没有 Profiling 数据支撑时给出优化建议
  • NEVER use any non-msprof/msprof op methods for timing or performance evaluation (including
    time.time()
    ,
    torch.npu.Event
    ,
    triton.testing.do_bench
    , custom timers, etc.)——the data collected by these methods lacks precision and cannot eliminate interference from system scheduling and JIT compilation, it is absolutely unacceptable and must not be used for any performance evaluation or optimization decisions
  • NEVER skip warm-up in performance tests (the first execution includes compilation overhead)
  • NEVER draw conclusions based on only one test (multiple measurements are needed to get statistical values)
  • NEVER include printing or logging in performance tests (I/O will seriously affect results)
  • NEVER forget to synchronize the NPU device (
    torch.npu.synchronize()
    )
  • NEVER compare performance results across different hardware environments
  • NEVER confuse
    msprof
    and
    msprof op
    commands (the former is for function-level global analysis, the latter for operator-level in-depth analysis)
  • NEVER give optimization suggestions without Profiling data support

常见陷阱与注意事项

Common Pitfalls and Notes

陷阱表现正确做法
msprof op
做算子对比
只能看到单个 kernel,无法对比算子对比用
msprof
,深度分析用
msprof op
--kernel-name
拼写错误
msprof op
静默完成但无数据
确认 kernel 名称与 Triton 函数定义一致
未区分首次编译和稳态性能首次运行耗时异常高至少 5 次预热后再采集
小规模输入测性能启动开销占比过大,结论无参考价值使用生产规模输入进行评估
忽略 dtype 对性能影响FP16 和 FP32 性能差异显著固定 dtype 进行对比,分别评估
PitfallManifestationCorrect Approach
Using
msprof op
for operator comparison
Only a single kernel can be seen, no comparison possibleUse
msprof
for operator comparison,
msprof op
for in-depth analysis
Incorrect
--kernel-name
spelling
msprof op
completes silently but has no data
Confirm the kernel name matches the Triton function definition
Failing to distinguish first-time compilation and steady-state performanceAbnormally high execution time in the first runCollect data only after at least 5 warm-up runs
Testing performance with small-scale inputsStartup overhead accounts for a large proportion, conclusions have no reference valueUse production-scale inputs for evaluation
Ignoring the impact of dtype on performanceSignificant performance difference between FP16 and FP32Fix dtype for comparison and evaluate separately

性能问题与优化方向

Performance Issues and Optimization Directions

瓶颈类型与优化策略

Bottleneck Types and Optimization Strategies

瓶颈类型判断条件核心优化方向
Memory-BoundAI < 硬件平衡点;带宽利用率高、计算利用率低减少数据搬运量、提高数据复用、优化访存模式
Compute-BoundAI > 硬件平衡点;计算利用率高、带宽利用率低优化计算指令效率、提高 Cube/Vector 利用率
Latency-Bound带宽和计算利用率均低增大并行度(Grid Size)、减少同步开销
Bottleneck TypeJudgment CriteriaCore Optimization Direction
Memory-BoundAI < hardware balance point; high bandwidth utilization, low computing utilizationReduce data transfer volume, improve data reuse, optimize memory access patterns
Compute-BoundAI > hardware balance point; high computing utilization, low bandwidth utilizationOptimize computation instruction efficiency, improve Cube/Vector utilization
Latency-BoundBoth bandwidth and computing utilization are lowIncrease parallelism (Grid Size), reduce synchronization overhead

常见性能问题诊断

Common Performance Problem Diagnosis

问题症状诊断数据源解决方向
UB 溢出编译错误/运行时 OOM检查 BLOCK_SIZE 配置减小 BLOCK_SIZE 或核内再分块
Cube 未命中性能仅 10% 理论值ArithmeticUtilization.csv强制 BLOCK_M/N/K=16 倍数
精度损失FP16 结果偏差大对比 PyTorch 结果累加器用 FP32
非连续访存带宽仅 20% 利用率Memory.csv调整数据布局为连续
低并行度AI Core 利用率低PipeUtilization.csv增大 Grid Size
高 Bank Conflict资源冲突率 > 10%ResourceConflictRatio.csv调整数据块大小和对齐方式
L2 Cache 命中率低频繁 GM 访问L2Cache.csv优化 Tiling 策略,提高数据局部性
ProblemSymptomDiagnosis Data SourceSolution Direction
UB OverflowCompilation error/runtime OOMCheck BLOCK_SIZE configurationReduce BLOCK_SIZE or split blocks within the kernel
Cube MissPerformance is only 10% of theoretical valueArithmeticUtilization.csvForce BLOCK_M/N/K to be multiples of 16
Precision LossLarge deviation in FP16 resultsCompare with PyTorch resultsUse FP32 for accumulators
Non-Contiguous Memory AccessBandwidth utilization is only 20%Memory.csvAdjust data layout to be contiguous
Low ParallelismLow AI Core utilizationPipeUtilization.csvIncrease Grid Size
High Bank ConflictResource conflict rate > 10%ResourceConflictRatio.csvAdjust data block size and alignment method
Low L2 Cache Hit RateFrequent GM accessL2Cache.csvOptimize Tiling strategy to improve data locality

优化方向速查

Optimization Direction Quick Reference

Memory-Bound 算子优化路径
  1. 检查访存模式 → 确保连续访存(Memory.csv 带宽利用率)
  2. 减少数据搬运 → 算子融合减少 GM 读写次数
  3. 提高数据复用 → 优化 Tiling 策略使数据在 UB/L1 中多次使用
  4. 消除 Bank Conflict → 调整对齐方式(ResourceConflictRatio.csv)
Compute-Bound 算子优化路径
  1. 命中 Cube 单元 → BLOCK 维度设为 16 倍数(ArithmeticUtilization.csv)
  2. 减少类型转换 → 避免不必要的 upcast/downcast
  3. 流水线优化 → 检查 Pipe 利用率,平衡计算和搬运(PipeUtilization.csv)
  4. 向量化 → 确保 Vector 操作充分利用 SIMD 宽度
Memory-Bound Operator Optimization Path:
  1. Check memory access pattern → Ensure contiguous memory access (Memory.csv bandwidth utilization)
  2. Reduce data transfer → Operator fusion to reduce GM read/write times
  3. Improve data reuse → Optimize Tiling strategy to enable multiple uses of data in UB/L1
  4. Eliminate Bank Conflict → Adjust alignment method (ResourceConflictRatio.csv)
Compute-Bound Operator Optimization Path:
  1. Hit Cube units → Set BLOCK dimensions to multiples of 16 (ArithmeticUtilization.csv)
  2. Reduce type conversions → Avoid unnecessary upcast/downcast
  3. Pipeline optimization → Check Pipe utilization, balance computation and data transfer (PipeUtilization.csv)
  4. Vectorization → Ensure Vector operations fully utilize SIMD width

参考资源

Reference Resources

msprof
vs
msprof op
命令对比

msprof
vs
msprof op
Command Comparison

这两个命令是完全不同的分析层级,选择错误会导致无效分析:
维度
msprof
(函数级)
msprof op
(算子级)
命令格式
msprof --application="python x.py"
msprof op --kernel-name=K python x.py
分析粒度整个应用的所有算子指定的单个 kernel
核心输出op_summary.csv、timeline_trace.json、report.htmlArithmeticUtilization.csv、Memory.csv、PipeUtilization.csv 等
提供信息各算子耗时排名、Host/Device 全链路时间线硬件利用率、内存带宽、Bank Conflict 等微架构指标
典型用途对比 PyTorch vs Triton 算子整体性能诊断单个 kernel 的硬件瓶颈
必需参数
--application
--kernel-name
选择决策
  • 需要知道"哪个算子最慢" → 用
    msprof
  • 需要知道"这个 kernel 为什么慢" → 用
    msprof op
  • 完整优化流程:先
    msprof
    定位热点,再
    msprof op
    深度分析
These two commands are completely different analysis levels, choosing the wrong one will lead to invalid analysis:
Dimension
msprof
(Function-Level)
msprof op
(Operator-Level)
Command Format
msprof --application="python x.py"
msprof op --kernel-name=K python x.py
Analysis GranularityAll operators of the entire applicationSpecified single kernel
Core Outputop_summary.csv, timeline_trace.json, report.htmlArithmeticUtilization.csv, Memory.csv, PipeUtilization.csv, etc.
Provided InformationOperator execution time ranking, Host/Device full-link timelineHardware utilization, memory bandwidth, Bank Conflict and other microarchitecture metrics
Typical UsageCompare overall performance of PyTorch vs Triton operatorsDiagnose hardware bottlenecks of a single kernel
Required Parameter
--application
--kernel-name
Selection Decision:
  • Need to know "which operator is the slowest" → Use
    msprof
  • Need to know "why this kernel is slow" → Use
    msprof op
  • Complete optimization process: First use
    msprof
    to locate hotspots, then use
    msprof op
    for in-depth analysis

参考文档

Reference Documents

文档内容关联命令
msprof-function-level.md
函数级性能采集用法和输出分析
msprof
msprof-op-level.md
算子级深度分析用法和硬件指标
msprof op
performance-data-analysis.md
msprof op
输出 CSV 的详细分析方法
msprof op
profiling-tools.md
性能分析工具链总览和工作流两者均涉及
ascend-terminology.md
Ascend 硬件术语和架构概念-
DocumentContentAssociated Command
msprof-function-level.md
Function-level performance collection usage and output analysis
msprof
msprof-op-level.md
Operator-level in-depth analysis usage and hardware metrics
msprof op
performance-data-analysis.md
Detailed analysis methods for
msprof op
output CSV
msprof op
profiling-tools.md
Overview of performance analysis toolchain and workflowBoth
ascend-terminology.md
Ascend hardware terminology and architecture concepts-

官方资源

Official Resources