Ascend C 算子性能优化(排查 → 修改 → 验证 闭环)
Ascend C Operator Performance Optimization (Troubleshoot → Modify → Validate Closed Loop)
本 skill 不仅排查性能问题,还负责 修改代码并验证优化效果。完整流程为:
Phase 1: 排查 — 审查代码 + 学习设计文档,发现优化点
Phase 2: 基线 — 保存当前性能测试结果(自定义算子 vs 标杆)
Phase 3: 优化 — 学习 code-gen 知识后修改算子代码
Phase 4: 精度 — 精度验证(确保优化后功能正确)
Phase 5: 性能 — 同 case 性能对比(优化后 vs 标杆)
Phase 6: 迭代 — 未提升则继续优化,最多 3 轮
This skill not only troubleshoots performance issues, but also takes charge of modifying code and verifying optimization effects. The complete process is as follows:
Phase 1: Troubleshoot — Review code + study design documents to identify optimization points
Phase 2: Baseline — Save current performance test results (custom operator vs benchmark)
Phase 3: Optimization — Modify operator code after learning code-gen knowledge
Phase 4: Precision — Precision verification (ensure correct functionality after optimization)
Phase 5: Performance — Performance comparison with the same cases (post-optimization vs benchmark)
Phase 6: Iteration — Continue optimization if no improvement is achieved, up to 3 rounds
Phase 1: 排查 — 发现优化点
Phase 1: Troubleshoot — Identify Optimization Points
1.1 学习算子设计文档
1.1 Study Operator Design Documents
MANDATORY — 排查前必须先理解算子设计:
- 读取
ascend-kernel/csrc/ops/<op_name>/design.md
(若存在),提取:
- 算子类型(elementwise / 行处理 / Cube)
- Tiling 策略(核间切分 / 核内切分)
- UB 空间分配方案
- 计算逻辑与数据流
- 读取 和 全部源码
MANDATORY — Must understand operator design before troubleshooting:
- Read
ascend-kernel/csrc/ops/<op_name>/design.md
(if exists) and extract:
- Operator type (elementwise / row processing / Cube)
- Tiling strategy (inter-core splitting / intra-core splitting)
- UB space allocation scheme
- Computation logic and data flow
- Read all source codes of and
1.2 逐阶段排查
1.2 Troubleshoot Phase by Phase
按以下顺序逐阶段审查算子代码。对每个阶段,加载对应的 reference 文件,逐项
对照代码检查。
- [ ] 1. Tiling — 数据在多核与 L2Cache 间的切分策略
- [ ] 2. 搬运 — DataCopy 的带宽利用率
- [ ] 3. API 使用 — Ascend C API 的高效用法
- [ ] 4. 内存 — 数据在存储层级中的放置策略
- [ ] 5. 流水 — CopyIn / Compute / CopyOut 的重叠执行
每个阶段有独立的 reference 文件,排查时仅加载当前阶段的文件:
- 阶段 1:references/tiling-prof.md
- 阶段 2:references/data-copy-prof.md
- 阶段 3:references/api-usage-prof.md
- 阶段 4:references/memory-prof.md
- 阶段 5:references/pipeline-prof.md
Review operator code in the following order. For each phase, load the corresponding reference file and check against the code item by item.
- [ ] 1. Tiling — Splitting strategy of data between multi-cores and L2Cache
- [ ] 2. Data Copy — Bandwidth utilization of DataCopy
- [ ] 3. API Usage — Efficient usage of Ascend C API
- [ ] 4. Memory — Placement strategy of data in storage hierarchy
- [ ] 5. Pipeline — Overlapped execution of CopyIn / Compute / CopyOut
Each phase has an independent reference file, only load the current phase's file during troubleshooting:
- Phase 1: references/tiling-prof.md
- Phase 2: references/data-copy-prof.md
- Phase 3: references/api-usage-prof.md
- Phase 4: references/memory-prof.md
- Phase 5: references/pipeline-prof.md
详细示例:references/tiling-prof.md
排查项:
Detailed example: references/tiling-prof.md
Troubleshooting items:
详细示例:references/data-copy-prof.md
排查项:
Detailed example: references/data-copy-prof.md
Troubleshooting items:
详细示例:references/api-usage-prof.md
排查项:
Detailed example: references/api-usage-prof.md
Troubleshooting items:
详细示例:references/memory-prof.md
排查项:
Detailed example: references/memory-prof.md
Troubleshooting items:
详细示例:references/pipeline-prof.md
排查项:
Detailed example: references/pipeline-prof.md
Troubleshooting items:
1.3 输出排查报告
1.3 Output Troubleshooting Report
After troubleshooting all phases, output a summary in the following format:
优化排查报告
Optimization Troubleshooting Report
发现的问题(按预期收益排序)
Identified Issues (Sorted by Expected Benefit)
- [阶段 X.Y] <问题描述> — <预期收益>
- [阶段 X.Y] <问题描述> — <预期收益>
...
- [Phase X.Y] <Issue Description> — <Expected Benefit>
- [Phase X.Y] <Issue Description> — <Expected Benefit>
...
已确认无问题
Confirmed No Issues
- [Phase X.Y] <Check Item Description>
...
Sort by expected benefit from highest to lowest, determine the target items for this round of optimization.
Phase 2: 基线 — 保存当前性能测试结果
Phase 2: Baseline — Save Current Performance Test Results
Must save the performance baseline before optimization for accurate comparison after optimization.
2.1 检查现有性能验证结果
2.1 Check Existing Performance Verification Results
<op_name>_perf_cases.jsonl
— 性能测试用例
<op_name>_torch_npu_profiler_report.md
— 性能对比报告
Check if the following files exist under
:
<op_name>_perf_cases.jsonl
— Performance test cases
<op_name>_torch_npu_profiler_report.md
— Performance comparison report
2.2 无结果时执行性能评估
2.2 Execute Performance Evaluation if No Results Exist
若上述文件不存在或结果已过时(如代码已更新但报告未重新生成),
MUST 调用
ascendc-operator-performance-eval
skill 完成完整性能评估:
- 读取
ascendc-operator-performance-eval
SKILL.md
- 按其流程生成性能用例(JSONL)、运行 profiler、生成对比报告
- 确保报告包含自定义算子 vs 标杆的完整对比数据
If the above files do not exist or the results are outdated (e.g., code has been updated but the report has not been regenerated),
MUST call the
ascendc-operator-performance-eval
skill to complete the full performance evaluation:
- Read the SKILL.md of
ascendc-operator-performance-eval
- Generate performance cases (JSONL), run profiler, and generate comparison report according to its process
- Ensure the report contains complete comparison data between custom operator vs benchmark
2.3 保存基线快照
2.3 Save Baseline Snapshot
将当前性能报告备份为基线文件,命名为
<op_name>_baseline_report.md
,
保存在同一
目录下。该文件后续用于对比优化效果。
csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl ← 性能用例(优化前后共用)
├── <op_name>_torch_npu_profiler_report.md ← 当前报告(会被覆盖)
└── <op_name>_baseline_report.md ← 基线快照(优化前的性能数据)
Back up the current performance report as a baseline file named
<op_name>_baseline_report.md
, stored in the same
directory. This file will be used for comparing optimization effects later.
csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl ← Performance cases (shared before and after optimization)
├── <op_name>_torch_npu_profiler_report.md ← Current report (will be overwritten)
└── <op_name>_baseline_report.md ← Baseline snapshot (performance data before optimization)
Phase 3: 优化 — 学习知识后修改代码
Phase 3: Optimization — Modify Code After Learning Knowledge
3.1 学习算子开发知识(MANDATORY)
3.1 Learn Operator Development Knowledge (MANDATORY)
修改代码前 MUST 加载 ascendc-operator-code-gen
skill 的 reference 文件,
确保对 AscendC API、数据搬运、同步控制等有准确理解。
按需加载以下 reference(位于
ascendc-operator-code-gen/references/
):
| Reference 文件 | 用途 |
|---|
| 总览:模板选择、代码生成流程 |
| DataCopy/DataCopyPad API 详解 |
| Vector 计算 API 详解 |
| TQue/Pipe 同步控制 |
resource-management-api.md
| TPipe/TBuf 资源管理 |
basic-data-structures-api.md
| LocalTensor/GlobalTensor 等基础结构 |
| Kernel 编程约束与常见陷阱 |
根据 Phase 1 发现的优化点,选择性加载相关 reference。例如:
- 优化搬运 → 加载
- 优化流水 → 加载 +
resource-management-api.md
- 优化计算 → 加载
MUST load the reference files of ascendc-operator-code-gen
skill before modifying code to ensure accurate understanding of AscendC API, data transfer, synchronization control, etc.
Load the following references as needed (located in
ascendc-operator-code-gen/references/
):
| Reference File | Purpose |
|---|
| Overview: Template selection, code generation process |
| Detailed explanation of DataCopy/DataCopyPad API |
| Detailed explanation of Vector computation API |
| TQue/Pipe synchronization control |
resource-management-api.md
| TPipe/TBuf resource management |
basic-data-structures-api.md
| Basic structures such as LocalTensor/GlobalTensor |
| Kernel programming constraints and common pitfalls |
Selectively load relevant references based on the optimization points identified in Phase 1. For example:
- Optimize data copy → Load
- Optimize pipeline → Load +
resource-management-api.md
- Optimize computation → Load
3.2 制定修改方案
3.2 Formulate Modification Plan
针对 Phase 1 排查报告中的每个优化点,制定具体的代码修改方案:
优化点 [X.Y]: <问题描述>
├── 修改文件: op_host / op_kernel / 两者
├── 修改内容: <具体代码变更描述>
├── 预期效果: <量化预期(如搬运时间减少 30%)>
└── 风险评估: <是否可能影响精度/是否需要修改 tiling>
For each optimization point in the Phase 1 troubleshooting report, formulate a specific code modification plan:
Optimization Point [X.Y]: <Issue Description>
├── Modified Files: op_host / op_kernel / both
├── Modification Content: <Specific code change description>
├── Expected Effect: <Quantified expectation (e.g., 30% reduction in transfer time)>
└── Risk Assessment: <Whether precision may be affected / whether tiling needs to be modified>
3.3 执行代码修改
3.3 Execute Code Modification
按照修改方案逐一修改代码。修改时遵守以下规则:
MUST 遵守 code-gen 反模式清单:
- NEVER 让 FP16/BF16 直接参与复杂数学计算,必须先 Cast 到 FP32
- NEVER 在 EXEC_KERNEL_CMD 中传右值
- NEVER 对 GM↔UB 搬运使用 DataCopy,必须用 DataCopyPad
- NEVER 在 ReduceSum/ReduceMax 后直接复用源 tensor
- NEVER 在 kernel 中使用
std::min/max/abs/sqrt/exp
等标准库函数
- NEVER 向高维切分 API 传入 repeatTime > 255
- NEVER 修改 或 下的文件
- NEVER 硬编码核数或 UB 大小
Modify code one by one according to the modification plan. Follow the following rules during modification:
MUST comply with the code-gen anti-pattern list:
- NEVER let FP16/BF16 directly participate in complex mathematical calculations, must cast to FP32 first
- NEVER pass r-values in EXEC_KERNEL_CMD
- NEVER use DataCopy for GM↔UB transfer, must use DataCopyPad
- NEVER directly reuse source tensor after ReduceSum/ReduceMax
- NEVER use standard library functions such as
std::min/max/abs/sqrt/exp
in kernel
- NEVER pass repeatTime > 255 to high-dimensional splitting API
- NEVER modify files under or
- NEVER hardcode core count or UB size
3.4 编译安装
3.4 Compile and Install
修改完成后必须重新编译安装:
bash
source ${ASCEND_HOME_PATH}/set_env.sh
cd task/ascend-kernel
bash build.sh
pip install output/ascend_kernel*.whl --force-reinstall --no-deps
编译失败时进入排错循环(最多 3 次)。
Must recompile and install after modification:
bash
source ${ASCEND_HOME_PATH}/set_env.sh
cd task/ascend-kernel
bash build.sh
pip install output/ascend_kernel*.whl --force-reinstall --no-deps
Enter the troubleshooting loop if compilation fails (up to 3 times).
Phase 4: 精度验证 — 确保优化后功能正确
Phase 4: Precision Verification — Ensure Correct Functionality After Optimization
MANDATORY — 优化后必须先通过精度验证再进行性能对比。
MANDATORY — Must pass precision verification before performance comparison after optimization.
4.1 调用精度评估 skill
4.1 Call Precision Evaluation Skill
读取并执行
ascendc-operator-precision-eval
SKILL.md 的完整流程:
- 生成精度测试用例(≥30 例,覆盖全部 dtype)
- 运行 pytest 精度测试
- 生成精度报告(Markdown + JSON)
- 在当前对话中展示总览、失败摘要与关键发现
Read and execute the complete process of
ascendc-operator-precision-eval
SKILL.md:
- Generate precision test cases (≥30 cases, covering all dtypes)
- Run pytest precision tests
- Generate precision report (Markdown + JSON)
- Display overview, failure summary and key findings in the current conversation
4.2 精度判定
4.2 Precision Determination
| 结果 | 处理 |
|---|
| 全部通过 | 进入 Phase 5 性能验证 |
| 部分失败 | 分析失败原因,回退或修复代码,重新进入 Phase 3 |
| 大量失败 | 回退本轮所有修改,重新分析优化方案 |
| Result | Handling |
|---|
| All Passed | Proceed to Phase 5 Performance Verification |
| Partially Failed | Analyze failure causes, roll back or fix code, and re-enter Phase 3 |
| Massively Failed | Roll back all modifications in this round, re-analyze optimization plan |
Phase 5: 性能验证 — 确认优化效果
Phase 5: Performance Verification — Confirm Optimization Effects
5.1 运行同 case 性能测试
5.1 Run Performance Tests with the Same Cases
使用
Phase 2 中相同的性能用例(
<op_name>_perf_cases.jsonl
),调用
ascendc-operator-performance-eval
skill 重新执行性能评估。
关键要求:
- MUST 使用与基线完全相同的 perf_cases.jsonl(不能增删用例)
- MUST 生成新的
<op_name>_torch_npu_profiler_report.md
- MUST 在当前对话中展示对比表、汇总与结论
Use the
same performance cases from Phase 2 (
<op_name>_perf_cases.jsonl
), call the
ascendc-operator-performance-eval
skill to re-execute performance evaluation.
Key requirements:
- MUST use the exact same perf_cases.jsonl as the baseline (cannot add or remove cases)
- MUST generate a new
<op_name>_torch_npu_profiler_report.md
- MUST display comparison table, summary and conclusion in the current conversation
5.2 对比分析
5.2 Comparative Analysis
将优化后的性能数据与 Phase 2 保存的基线进行对比:
Compare the post-optimization performance data with the baseline saved in Phase 2:
优化效果对比
Optimization Effect Comparison
| Case | Shape | dtype | 基线 per-step(us) | 优化后 per-step(us) | 提升比 | 标杆 per-step(us) | vs 标杆 |
|---|
| ... | ... | ... | ... | ... | ... | ... | ... |
| Case | Shape | dtype | Baseline per-step(us) | Post-optimization per-step(us) | Improvement Ratio | Benchmark per-step(us) | vs Benchmark |
|---|
| ... | ... | ... | ... | ... | ... | ... | ... |
- 平均提升: X%
- 最大提升: X%(Case Y)
- vs 标杆平均比值: 优化前 A → 优化后 B
- Average Improvement: X%
- Maximum Improvement: X% (Case Y)
- vs Benchmark Average Ratio: Before optimization A → After optimization B
5.3 性能判定
5.3 Performance Determination
| 结果 | 处理 |
|---|
| 性能提升(大部分 case 优化后更快) | 优化成功,输出最终报告 |
| 性能未提升或回退 | 进入 Phase 6 迭代优化 |
| Result | Handling |
|---|
| Performance Improved (most cases are faster after optimization) | Optimization successful, output final report |
| Performance Not Improved or Regressed | Enter Phase 6 Iterative Optimization |
Phase 6: 迭代优化(最多 3 轮)
Phase 6: Iterative Optimization (Up to 3 Rounds)
若 Phase 5 判定性能未提升,进入迭代:
当前轮次: N (N ∈ {1, 2, 3})
├── N < 3: 回到 Phase 1,选择下一优先级优化点或调整方案
│ ├── 重新排查,分析上一轮修改为何未生效
│ ├── 选择新的优化点或调整上一轮的方案
│ └── 重复 Phase 3 → Phase 4 → Phase 5
│
└── N = 3: 停止迭代,输出最终报告(含所有轮次记录)
If Phase 5 determines no performance improvement, enter iteration:
Current Round: N (N ∈ {1, 2, 3})
├── N < 3: Return to Phase 1, select next priority optimization point or adjust plan
│ ├── Re-troubleshoot, analyze why previous round's modification did not take effect
│ ├── Select new optimization points or adjust previous round's plan
│ └── Repeat Phase 3 → Phase 4 → Phase 5
│
└── N = 3: Stop iteration, output final report (including records of all rounds)
Must record each iteration:
第 N 轮优化
Round N Optimization
- 优化目标: [阶段 X.Y] <描述>
- 修改内容: <代码变更摘要>
- 精度结果: 通过 / 失败
- 性能结果: 提升 X% / 未提升 / 回退 Y%
- 决策: 保留本轮修改 / 回退 / 继续下一轮
- Optimization Target: [Phase X.Y] <Description>
- Modification Content: <Code change summary>
- Precision Result: Passed / Failed
- Performance Result: Improved X% / Not improved / Regressed Y%
- Decision: Keep current round's modification / Roll back / Proceed to next round
所有轮次完成后(成功提升或达到 3 轮上限),输出最终汇总报告。
After completing all rounds (successful improvement or reaching 3-round limit), output the final summary report.
在当前对话中展示(MANDATORY)
Display in Current Conversation (MANDATORY)
MUST 在对话中展示以下内容,NEVER 仅输出文件路径:
- 优化排查总结:发现的所有问题及处理状态
- 性能对比总表:基线 → 优化后 → 标杆的三方对比
- 迭代历史摘要:每轮的优化目标、结果、决策
- ≥3 条关键结论:主要瓶颈、优化收益分布、剩余优化空间等
- 文件路径殿后:报告与代码文件路径
MUST display the following content in the conversation, NEVER only output file paths:
- Optimization Troubleshooting Summary: All identified issues and their handling status
- Performance Comparison Summary Table: Three-way comparison of baseline → post-optimization → benchmark
- Iteration History Summary: Optimization target, result and decision of each round
- ≥3 Key Conclusions: Main bottlenecks, optimization benefit distribution, remaining optimization space, etc.
- File Paths at the End: Paths of reports and code files
csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl ← 性能用例
├── <op_name>_baseline_report.md ← 优化前基线
├── <op_name>_torch_npu_profiler_report.md ← 优化后最终性能报告
├── <op_name>_precision_report.md ← 精度验证报告
└── <op_name>_optim_summary.md ← 优化迭代汇总报告(新增)
csrc/ops/<op_name>/
├── op_host/<op_name>.cpp ← 优化后的 host 代码
└── op_kernel/<op_name>.cpp ← 优化后的 kernel 代码
csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl ← Performance cases
├── <op_name>_baseline_report.md ← Pre-optimization baseline
├── <op_name>_torch_npu_profiler_report.md ← Final post-optimization performance report
├── <op_name>_precision_report.md ← Precision verification report
└── <op_name>_optim_summary.md ← Optimization iteration summary report (newly added)
csrc/ops/<op_name>/
├── op_host/<op_name>.cpp ← Optimized host code
└── op_kernel/<op_name>.cpp ← Optimized kernel code
优化迭代汇总报告结构
Optimization Iteration Summary Report Structure
<op_name>_optim_summary.md
必须包含:
<op_name>_optim_summary.md
must include:
<op_name> 性能优化报告
<op_name> Performance Optimization Report
排查发现
Troubleshooting Findings
(Content of Phase 1 troubleshooting report)
优化前基线
Pre-optimization Baseline
(Summary of Phase 2 performance data)
- 优化目标: ...
- 代码修改: ...
- 精度结果: ...
- 性能结果: ...
- Optimization Target: ...
- Code Modifications: ...
- Precision Result: ...
- Performance Result: ...
最终性能对比
Final Performance Comparison
(Three-way comparison table of pre-optimization vs post-optimization vs benchmark)
检查清单(助手自检)
Checklist (Assistant Self-check)
Phase 1: 排查
Phase 1: Troubleshoot
Phase 2: 基线
Phase 2: Baseline
Phase 3: 优化
Phase 3: Optimization
Phase 4: 精度
Phase 4: Precision
Phase 5: 性能
Phase 5: Performance
Phase 6: 迭代
Phase 6: Iteration