triton-operator-performance-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTriton 算子性能评估(Ascend NPU)
Triton Operator Performance Evaluation (Ascend NPU)
核心认知
Core Concepts
性能数据信仰:只相信测量到的数据,不相信自己的直觉和假设。评估前必须用 Prof 定位真正的性能热区。
⚠️ 唯一可信的性能采集方式: 和
msprofmsprof op通过任何非 / 方式(包括但不限于 Python 、 计时、、自定义计时装饰器等)采集的性能数据绝不可接受,绝不可用于性能评估及优化决策。这些方式精度不足、无法排除系统调度和 JIT 编译等干扰因素,得出的数据不具备任何参考价值。所有性能分析必须且只能基于 (函数级)或 (算子级)的输出数据。
msprofmsprof optime.time()torch.npu.Eventtriton.testing.do_benchmsprofmsprof op评估目标:
- 识别性能瓶颈类型(Memory-Bound vs Compute-Bound)
- 量化硬件利用率
- 对比不同实现的性能差异
- 验证优化效果
Performance Data Principle: Only trust measured data, not your intuition or assumptions. You must use Prof to locate the real performance hotspots before evaluation.
⚠️ The only reliable performance collection methods: and
msprofmsprof opPerformance data collected through any non-msprof/msprof op methods (including but not limited to Python , timing, , custom timing decorators, etc.) is absolutely unacceptable and must not be used for performance evaluation or optimization decisions. These methods lack precision and cannot eliminate interference from system scheduling and JIT compilation, so the resulting data has no reference value. All performance analysis must be based solely on output data from (function-level) or (operator-level).
time.time()torch.npu.Eventtriton.testing.do_benchmsprofmsprof opEvaluation Objectives:
- Identify performance bottleneck types (Memory-Bound vs Compute-Bound)
- Quantify hardware utilization
- Compare performance differences between different implementations
- Verify optimization effects
性能评估工作流
Performance Evaluation Workflow
函数级性能采集(首选)
Function-Level Performance Collection (Preferred)
使用 msprof 进行函数级性能分析:
bash
msprof --application="python my_script.py" --output=./profiling_result适用场景:
- 对比多个 PyTorch 算子 vs 融合 Triton 算子的性能
- 分析函数级别的性能瓶颈
- 生成可视化性能报告
- 全链路性能分析(Host + Device)
详细用法和示例:加载
msprof-function-level.mdPerform function-level performance analysis using msprof:
bash
msprof --application="python my_script.py" --output=./profiling_resultApplicable Scenarios:
- Compare performance between multiple PyTorch operators vs fused Triton operators
- Analyze function-level performance bottlenecks
- Generate visualized performance reports
- Full-link performance analysis (Host + Device)
Detailed Usage and Examples: Load
msprof-function-level.md算子级性能采集(深度分析)
Operator-Level Performance Collection (In-Depth Analysis)
使用 msprof op 进行算子级深度分析:
bash
msprof op --kernel-name={jit_kernel_name} {application}适用场景:
- 分析单个 Triton kernel 的硬件利用率
- 诊断 Cube/Vector 单元性能
- 分析 UB 缓存和内存访问模式
- 定位 Bank Conflict 等硬件问题
详细用法和示例:加载
msprof-op-level.mdPerform in-depth operator-level analysis using msprof op:
bash
msprof op --kernel-name={jit_kernel_name} {application}Applicable Scenarios:
- Analyze hardware utilization of a single Triton kernel
- Diagnose Cube/Vector unit performance
- Analyze UB cache and memory access patterns
- Locate hardware issues such as Bank Conflict
Detailed Usage and Examples: Load
msprof-op-level.md性能数据分析
Performance Data Analysis
关键指标:
- 瓶颈类型判断(Memory-Bound vs Compute-Bound)
- 内存带宽利用率
- 计算单元利用率(Cube/Vector)
- UB 冲突分析
详细分析方法:加载
performance-data-analysis.mdKey Metrics:
- Bottleneck type judgment (Memory-Bound vs Compute-Bound)
- Memory bandwidth utilization
- Computing unit utilization (Cube/Vector)
- UB conflict analysis
Detailed Analysis Methods: Load
performance-data-analysis.md性能总结输出
Performance Summary Output
完成性能评估后,必须按以下模板输出结构化性能总结:
markdown
undefinedAfter completing the performance evaluation, you must output a structured performance summary according to the following template:
markdown
undefined性能评估总结
Performance Evaluation Summary
基本信息
Basic Information
| 项目 | 值 |
|---|---|
| 算子名称 | {kernel_name} |
| 输入规模 | {shape, dtype} |
| 测试硬件 | {Ascend 型号} |
| 测量方法 | {msprof / msprof op} |
| Item | Value |
|---|---|
| Operator Name | {kernel_name} |
| Input Scale | {shape, dtype} |
| Test Hardware | {Ascend Model} |
| Measurement Method | {msprof / msprof op} |
性能指标
Performance Metrics
| 指标 | 值 | 参考值 | 利用率 |
|---|---|---|---|
| 执行耗时 | {X} us | - | - |
| 内存带宽 | {X} GB/s | {理论峰值} GB/s | {X}% |
| Cube 利用率 | - | - | {X}% |
| Vector 利用率 | - | - | {X}% |
| L2 Cache 命中率 | - | - | {X}% |
| Bank Conflict 比例 | - | - | {X}% |
| Metric | Value | Reference Value | Utilization |
|---|---|---|---|
| Execution Time | {X} us | - | - |
| Memory Bandwidth | {X} GB/s | {Theoretical Peak} GB/s | {X}% |
| Cube Utilization | - | - | {X}% |
| Vector Utilization | - | - | {X}% |
| L2 Cache Hit Rate | - | - | {X}% |
| Bank Conflict Ratio | - | - | {X}% |
瓶颈诊断
Bottleneck Diagnosis
- 瓶颈类型:{Memory-Bound / Compute-Bound}
- 判断依据:{基于 Arithmetic Intensity 和硬件利用率数据的分析}
- 关键证据:{引用具体 CSV 数据}
- Bottleneck Type: {Memory-Bound / Compute-Bound}
- Judgment Basis: {Analysis based on Arithmetic Intensity and hardware utilization data}
- Key Evidence: {Cite specific CSV data}
性能问题清单
Performance Issue List
| 优先级 | 问题 | 证据 | 优化方向 |
|---|---|---|---|
| P0 | {最关键问题} | {数据来源} | {具体建议} |
| P1 | ... | ... | ... |
| Priority | Problem | Evidence | Optimization Direction |
|---|---|---|---|
| P0 | {Most Critical Issue} | {Data Source} | {Specific Suggestions} |
| P1 | ... | ... | ... |
优化建议
Optimization Suggestions
- {最高优先级优化建议及预期收益}
- ...
**输出原则**:
- 所有结论必须有 Profiling 数据支撑,不做主观猜测
- 利用率低于 30% 的指标标记为 **重点关注**
- 对比场景需同时列出 baseline 和 optimized 结果- {Highest Priority Optimization Suggestion and Expected Benefit}
- ...
**Output Principles**:
- All conclusions must be supported by Profiling data, no subjective guesses allowed
- Metrics with utilization below 30% are marked as **Key Focus**
- Both baseline and optimized results must be listed in comparison scenarios参考资源加载指南
Reference Resource Loading Guide
MANDATORY - 按需加载:根据任务类型加载对应的参考文档
| 任务类型 | 必须加载 | 不要加载 |
|---|---|---|
| 函数级性能算子对比 | | |
| 算子级硬件分析 | | |
| 性能瓶颈诊断 | | - |
| 理解硬件术语 | | - |
| 完整性能优化流程 | 所有 references | - |
MANDATORY - Load On Demand: Load corresponding reference documents based on task type
| Task Type | Must Load | Do Not Load |
|---|---|---|
| Function-Level Performance Operator Comparison | | |
| Operator-Level Hardware Analysis | | |
| Performance Bottleneck Diagnosis | | - |
| Understand Hardware Terminology | | - |
| Complete Performance Optimization Process | All references | - |
性能评估检查清单
Performance Evaluation Checklist
基础检查
Basic Checks
- 是否使用 msprof 进行性能采集?
- 是否进行了预热(避免首次编译开销)?
- 是否多次测量取统计值?
- 是否同步了 NPU 设备()?
torch.npu.synchronize()
- Is msprof used for performance collection?
- Has warm-up been performed (to avoid first-time compilation overhead)?
- Have multiple measurements been taken to get statistical values?
- Has the NPU device been synchronized ()?
torch.npu.synchronize()
Profiler 检查
Profiler Checks
- 是否指定了正确的 或
--application?--kernel-name - 是否选择了合适的 ?
--aic-metrics - 是否分析了所有关键性能指标?
- 是否识别了瓶颈类型(Memory-Bound vs Compute-Bound)?
- Is the correct or
--applicationspecified?--kernel-name - Has the appropriate been selected?
--aic-metrics - Have all key performance metrics been analyzed?
- Has the bottleneck type (Memory-Bound vs Compute-Bound) been identified?
性能指标检查
Performance Metric Checks
- 内存带宽利用率是否合理?
- 计算单元利用率是否合理?
- 是否存在高 Bank Conflict?
- L2 Cache 命中率是否合理?
- Is memory bandwidth utilization reasonable?
- Is computing unit utilization reasonable?
- Does high Bank Conflict exist?
- Is L2 Cache hit rate reasonable?
反模式清单(NEVER)
Anti-Pattern List (NEVER)
- NEVER 使用任何非 /
msprof的方式进行计时或性能评估(包括msprof op、time.time()、torch.npu.Event、自定义计时器等)——这些方式采集的数据精度不足、无法排除系统调度和 JIT 编译干扰,绝不可接受,绝不可用于任何性能评估及优化决策triton.testing.do_bench - NEVER 在性能测试中不进行预热(首次执行包含编译开销)
- NEVER 只测试一次就得出结论(需要多次测量取统计值)
- NEVER 在性能测试中包含打印或日志(I/O 会严重影响结果)
- NEVER 忘记同步 NPU 设备()
torch.npu.synchronize() - NEVER 在不同硬件环境下对比性能结果
- NEVER 混淆 和
msprof命令(前者函数级全局分析,后者算子级深度分析)msprof op - NEVER 在没有 Profiling 数据支撑时给出优化建议
- NEVER use any non-msprof/msprof op methods for timing or performance evaluation (including ,
time.time(),torch.npu.Event, custom timers, etc.)——the data collected by these methods lacks precision and cannot eliminate interference from system scheduling and JIT compilation, it is absolutely unacceptable and must not be used for any performance evaluation or optimization decisionstriton.testing.do_bench - NEVER skip warm-up in performance tests (the first execution includes compilation overhead)
- NEVER draw conclusions based on only one test (multiple measurements are needed to get statistical values)
- NEVER include printing or logging in performance tests (I/O will seriously affect results)
- NEVER forget to synchronize the NPU device ()
torch.npu.synchronize() - NEVER compare performance results across different hardware environments
- NEVER confuse and
msprofcommands (the former is for function-level global analysis, the latter for operator-level in-depth analysis)msprof op - NEVER give optimization suggestions without Profiling data support
常见陷阱与注意事项
Common Pitfalls and Notes
| 陷阱 | 表现 | 正确做法 |
|---|---|---|
用 | 只能看到单个 kernel,无法对比 | 算子对比用 |
| | 确认 kernel 名称与 Triton 函数定义一致 |
| 未区分首次编译和稳态性能 | 首次运行耗时异常高 | 至少 5 次预热后再采集 |
| 小规模输入测性能 | 启动开销占比过大,结论无参考价值 | 使用生产规模输入进行评估 |
| 忽略 dtype 对性能影响 | FP16 和 FP32 性能差异显著 | 固定 dtype 进行对比,分别评估 |
| Pitfall | Manifestation | Correct Approach |
|---|---|---|
Using | Only a single kernel can be seen, no comparison possible | Use |
Incorrect | | Confirm the kernel name matches the Triton function definition |
| Failing to distinguish first-time compilation and steady-state performance | Abnormally high execution time in the first run | Collect data only after at least 5 warm-up runs |
| Testing performance with small-scale inputs | Startup overhead accounts for a large proportion, conclusions have no reference value | Use production-scale inputs for evaluation |
| Ignoring the impact of dtype on performance | Significant performance difference between FP16 and FP32 | Fix dtype for comparison and evaluate separately |
性能问题与优化方向
Performance Issues and Optimization Directions
瓶颈类型与优化策略
Bottleneck Types and Optimization Strategies
| 瓶颈类型 | 判断条件 | 核心优化方向 |
|---|---|---|
| Memory-Bound | AI < 硬件平衡点;带宽利用率高、计算利用率低 | 减少数据搬运量、提高数据复用、优化访存模式 |
| Compute-Bound | AI > 硬件平衡点;计算利用率高、带宽利用率低 | 优化计算指令效率、提高 Cube/Vector 利用率 |
| Latency-Bound | 带宽和计算利用率均低 | 增大并行度(Grid Size)、减少同步开销 |
| Bottleneck Type | Judgment Criteria | Core Optimization Direction |
|---|---|---|
| Memory-Bound | AI < hardware balance point; high bandwidth utilization, low computing utilization | Reduce data transfer volume, improve data reuse, optimize memory access patterns |
| Compute-Bound | AI > hardware balance point; high computing utilization, low bandwidth utilization | Optimize computation instruction efficiency, improve Cube/Vector utilization |
| Latency-Bound | Both bandwidth and computing utilization are low | Increase parallelism (Grid Size), reduce synchronization overhead |
常见性能问题诊断
Common Performance Problem Diagnosis
| 问题 | 症状 | 诊断数据源 | 解决方向 |
|---|---|---|---|
| UB 溢出 | 编译错误/运行时 OOM | 检查 BLOCK_SIZE 配置 | 减小 BLOCK_SIZE 或核内再分块 |
| Cube 未命中 | 性能仅 10% 理论值 | ArithmeticUtilization.csv | 强制 BLOCK_M/N/K=16 倍数 |
| 精度损失 | FP16 结果偏差大 | 对比 PyTorch 结果 | 累加器用 FP32 |
| 非连续访存 | 带宽仅 20% 利用率 | Memory.csv | 调整数据布局为连续 |
| 低并行度 | AI Core 利用率低 | PipeUtilization.csv | 增大 Grid Size |
| 高 Bank Conflict | 资源冲突率 > 10% | ResourceConflictRatio.csv | 调整数据块大小和对齐方式 |
| L2 Cache 命中率低 | 频繁 GM 访问 | L2Cache.csv | 优化 Tiling 策略,提高数据局部性 |
| Problem | Symptom | Diagnosis Data Source | Solution Direction |
|---|---|---|---|
| UB Overflow | Compilation error/runtime OOM | Check BLOCK_SIZE configuration | Reduce BLOCK_SIZE or split blocks within the kernel |
| Cube Miss | Performance is only 10% of theoretical value | ArithmeticUtilization.csv | Force BLOCK_M/N/K to be multiples of 16 |
| Precision Loss | Large deviation in FP16 results | Compare with PyTorch results | Use FP32 for accumulators |
| Non-Contiguous Memory Access | Bandwidth utilization is only 20% | Memory.csv | Adjust data layout to be contiguous |
| Low Parallelism | Low AI Core utilization | PipeUtilization.csv | Increase Grid Size |
| High Bank Conflict | Resource conflict rate > 10% | ResourceConflictRatio.csv | Adjust data block size and alignment method |
| Low L2 Cache Hit Rate | Frequent GM access | L2Cache.csv | Optimize Tiling strategy to improve data locality |
优化方向速查
Optimization Direction Quick Reference
Memory-Bound 算子优化路径:
- 检查访存模式 → 确保连续访存(Memory.csv 带宽利用率)
- 减少数据搬运 → 算子融合减少 GM 读写次数
- 提高数据复用 → 优化 Tiling 策略使数据在 UB/L1 中多次使用
- 消除 Bank Conflict → 调整对齐方式(ResourceConflictRatio.csv)
Compute-Bound 算子优化路径:
- 命中 Cube 单元 → BLOCK 维度设为 16 倍数(ArithmeticUtilization.csv)
- 减少类型转换 → 避免不必要的 upcast/downcast
- 流水线优化 → 检查 Pipe 利用率,平衡计算和搬运(PipeUtilization.csv)
- 向量化 → 确保 Vector 操作充分利用 SIMD 宽度
Memory-Bound Operator Optimization Path:
- Check memory access pattern → Ensure contiguous memory access (Memory.csv bandwidth utilization)
- Reduce data transfer → Operator fusion to reduce GM read/write times
- Improve data reuse → Optimize Tiling strategy to enable multiple uses of data in UB/L1
- Eliminate Bank Conflict → Adjust alignment method (ResourceConflictRatio.csv)
Compute-Bound Operator Optimization Path:
- Hit Cube units → Set BLOCK dimensions to multiples of 16 (ArithmeticUtilization.csv)
- Reduce type conversions → Avoid unnecessary upcast/downcast
- Pipeline optimization → Check Pipe utilization, balance computation and data transfer (PipeUtilization.csv)
- Vectorization → Ensure Vector operations fully utilize SIMD width
参考资源
Reference Resources
msprof
vs msprof op
命令对比
msprofmsprof opmsprof
vs msprof op
Command Comparison
msprofmsprof op这两个命令是完全不同的分析层级,选择错误会导致无效分析:
| 维度 | | |
|---|---|---|
| 命令格式 | | |
| 分析粒度 | 整个应用的所有算子 | 指定的单个 kernel |
| 核心输出 | op_summary.csv、timeline_trace.json、report.html | ArithmeticUtilization.csv、Memory.csv、PipeUtilization.csv 等 |
| 提供信息 | 各算子耗时排名、Host/Device 全链路时间线 | 硬件利用率、内存带宽、Bank Conflict 等微架构指标 |
| 典型用途 | 对比 PyTorch vs Triton 算子整体性能 | 诊断单个 kernel 的硬件瓶颈 |
| 必需参数 | | |
选择决策:
- 需要知道"哪个算子最慢" → 用
msprof - 需要知道"这个 kernel 为什么慢" → 用
msprof op - 完整优化流程:先 定位热点,再
msprof深度分析msprof op
These two commands are completely different analysis levels, choosing the wrong one will lead to invalid analysis:
| Dimension | | |
|---|---|---|
| Command Format | | |
| Analysis Granularity | All operators of the entire application | Specified single kernel |
| Core Output | op_summary.csv, timeline_trace.json, report.html | ArithmeticUtilization.csv, Memory.csv, PipeUtilization.csv, etc. |
| Provided Information | Operator execution time ranking, Host/Device full-link timeline | Hardware utilization, memory bandwidth, Bank Conflict and other microarchitecture metrics |
| Typical Usage | Compare overall performance of PyTorch vs Triton operators | Diagnose hardware bottlenecks of a single kernel |
| Required Parameter | | |
Selection Decision:
- Need to know "which operator is the slowest" → Use
msprof - Need to know "why this kernel is slow" → Use
msprof op - Complete optimization process: First use to locate hotspots, then use
msproffor in-depth analysismsprof op
参考文档
Reference Documents
| 文档 | 内容 | 关联命令 |
|---|---|---|
| 函数级性能采集用法和输出分析 | |
| 算子级深度分析用法和硬件指标 | |
| | |
| 性能分析工具链总览和工作流 | 两者均涉及 |
| Ascend 硬件术语和架构概念 | - |
| Document | Content | Associated Command |
|---|---|---|
| Function-level performance collection usage and output analysis | |
| Operator-level in-depth analysis usage and hardware metrics | |
| Detailed analysis methods for | |
| Overview of performance analysis toolchain and workflow | Both |
| Ascend hardware terminology and architecture concepts | - |