Loading...
Loading...
Troubleshoot and optimize the performance of Ascend C operators. This skill is applicable when users develop, review or optimize Ascend C kernel operators, or triggered when users mention keywords such as Ascend C performance optimization, operator optimization, tiling, pipeline, data copy, memory optimization, NPU/Ascend.
npx skill4agent add ascend/agent-skills ascendc-operator-performance-optimPhase 1: Troubleshoot — Review code + study design documents to identify optimization points
Phase 2: Baseline — Save current performance test results (custom operator vs benchmark)
Phase 3: Optimization — Modify operator code after learning code-gen knowledge
Phase 4: Precision — Precision verification (ensure correct functionality after optimization)
Phase 5: Performance — Performance comparison with the same cases (post-optimization vs benchmark)
Phase 6: Iteration — Continue optimization if no improvement is achieved, up to 3 roundsascend-kernel/csrc/ops/<op_name>/design.mdop_host/<op_name>.cppop_kernel/<op_name>.cpp- [ ] 1. Tiling — Splitting strategy of data between multi-cores and L2Cache
- [ ] 2. Data Copy — Bandwidth utilization of DataCopy
- [ ] 3. API Usage — Efficient usage of Ascend C API
- [ ] 4. Memory — Placement strategy of data in storage hierarchy
- [ ] 5. Pipeline — Overlapped execution of CopyIn / Compute / CopyOutDetailed example: references/tiling-prof.md
blockDimGetCoreNumAiv()GetCoreNumAic()input + output > L2Cache capacityDetailed example: references/data-copy-prof.md
DataCopyDataCopyParamsDetailed example: references/api-usage-prof.md
TPipeTQueBind<VECIN, VECOUT>TQue<VECIN>TQue<VECOUT>aiv_vec_timeenAtomic=1IterateAllGetTensorCBlockReduceSumWholeReduceSumDetailed example: references/memory-prof.md
A1*B1 + A2*B2 + ...MmadFixpipeDetailed example: references/pipeline-prof.md
TQueInitBufferIterate<false>()IterateAll<false>()## Optimization Troubleshooting Report
### Identified Issues (Sorted by Expected Benefit)
1. [Phase X.Y] <Issue Description> — <Expected Benefit>
2. [Phase X.Y] <Issue Description> — <Expected Benefit>
...
### Confirmed No Issues
- [Phase X.Y] <Check Item Description>
...
### Optimization Plan
Sort by expected benefit from highest to lowest, determine the target items for this round of optimization.csrc/ops/<op_name>/test/<op_name>_perf_cases.jsonl<op_name>_torch_npu_profiler_report.mdascendc-operator-performance-evalascendc-operator-performance-eval<op_name>_baseline_report.mdtest/csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl ← Performance cases (shared before and after optimization)
├── <op_name>_torch_npu_profiler_report.md ← Current report (will be overwritten)
└── <op_name>_baseline_report.md ← Baseline snapshot (performance data before optimization)ascendc-operator-code-genascendc-operator-code-gen/references/| Reference File | Purpose |
|---|---|
| Overview: Template selection, code generation process |
| Detailed explanation of DataCopy/DataCopyPad API |
| Detailed explanation of Vector computation API |
| TQue/Pipe synchronization control |
| TPipe/TBuf resource management |
| Basic structures such as LocalTensor/GlobalTensor |
| Kernel programming constraints and common pitfalls |
data-copy-api.mdsync-control-api.mdresource-management-api.mdvector-compute-api.mdOptimization Point [X.Y]: <Issue Description>
├── Modified Files: op_host / op_kernel / both
├── Modification Content: <Specific code change description>
├── Expected Effect: <Quantified expectation (e.g., 30% reduction in transfer time)>
└── Risk Assessment: <Whether precision may be affected / whether tiling needs to be modified>std::min/max/abs/sqrt/expcmake/csrc/utils/source ${ASCEND_HOME_PATH}/set_env.sh
cd task/ascend-kernel
bash build.sh
pip install output/ascend_kernel*.whl --force-reinstall --no-depsascendc-operator-precision-eval| Result | Handling |
|---|---|
| All Passed | Proceed to Phase 5 Performance Verification |
| Partially Failed | Analyze failure causes, roll back or fix code, and re-enter Phase 3 |
| Massively Failed | Roll back all modifications in this round, re-analyze optimization plan |
<op_name>_perf_cases.jsonlascendc-operator-performance-eval<op_name>_torch_npu_profiler_report.md## Optimization Effect Comparison
| Case | Shape | dtype | Baseline per-step(us) | Post-optimization per-step(us) | Improvement Ratio | Benchmark per-step(us) | vs Benchmark |
|------|-------|-------|-------------------|--------------------|---------|--------------------|---------|
| ... | ... | ... | ... | ... | ... | ... | ... |
### Summary
- Average Improvement: X%
- Maximum Improvement: X% (Case Y)
- vs Benchmark Average Ratio: Before optimization A → After optimization B| Result | Handling |
|---|---|
| Performance Improved (most cases are faster after optimization) | Optimization successful, output final report |
| Performance Not Improved or Regressed | Enter Phase 6 Iterative Optimization |
Current Round: N (N ∈ {1, 2, 3})
├── N < 3: Return to Phase 1, select next priority optimization point or adjust plan
│ ├── Re-troubleshoot, analyze why previous round's modification did not take effect
│ ├── Select new optimization points or adjust previous round's plan
│ └── Repeat Phase 3 → Phase 4 → Phase 5
│
└── N = 3: Stop iteration, output final report (including records of all rounds)### Round N Optimization
- Optimization Target: [Phase X.Y] <Description>
- Modification Content: <Code change summary>
- Precision Result: Passed / Failed
- Performance Result: Improved X% / Not improved / Regressed Y%
- Decision: Keep current round's modification / Roll back / Proceed to next roundcsrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl ← Performance cases
├── <op_name>_baseline_report.md ← Pre-optimization baseline
├── <op_name>_torch_npu_profiler_report.md ← Final post-optimization performance report
├── <op_name>_precision_report.md ← Precision verification report
└── <op_name>_optim_summary.md ← Optimization iteration summary report (newly added)
csrc/ops/<op_name>/
├── op_host/<op_name>.cpp ← Optimized host code
└── op_kernel/<op_name>.cpp ← Optimized kernel code<op_name>_optim_summary.md# <op_name> Performance Optimization Report
## Troubleshooting Findings
(Content of Phase 1 troubleshooting report)
## Pre-optimization Baseline
(Summary of Phase 2 performance data)
## Iteration History
### Round 1
- Optimization Target: ...
- Code Modifications: ...
- Precision Result: ...
- Performance Result: ...
### Round N
...
## Final Performance Comparison
(Three-way comparison table of pre-optimization vs post-optimization vs benchmark)
## Conclusions
(≥3 key findings)_baseline_report.mdascendc-operator-precision-eval_optim_summary.md