Ascend C Operator Performance Optimization (Troubleshoot → Modify → Validate Closed Loop)

This skill not only troubleshoots performance issues, but also takes charge of modifying code and verifying optimization effects. The complete process is as follows:

Phase 1: Troubleshoot — Review code + study design documents to identify optimization points
Phase 2: Baseline — Save current performance test results (custom operator vs benchmark)
Phase 3: Optimization — Modify operator code after learning code-gen knowledge
Phase 4: Precision — Precision verification (ensure correct functionality after optimization)
Phase 5: Performance — Performance comparison with the same cases (post-optimization vs benchmark)
Phase 6: Iteration — Continue optimization if no improvement is achieved, up to 3 rounds

Phase 1: Troubleshoot — Identify Optimization Points

1.1 Study Operator Design Documents

MANDATORY — Must understand operator design before troubleshooting:

Read
```
ascend-kernel/csrc/ops/<op_name>/design.md
```
(if exists) and extract:
- Operator type (elementwise / row processing / Cube)
- Tiling strategy (inter-core splitting / intra-core splitting)
- UB space allocation scheme
- Computation logic and data flow

Read all source codes of

op_host/<op_name>.cpp

and

op_kernel/<op_name>.cpp

1.2 Troubleshoot Phase by Phase

Review operator code in the following order. For each phase, load the corresponding reference file and check against the code item by item.

- [ ] 1. Tiling    — Splitting strategy of data between multi-cores and L2Cache
- [ ] 2. Data Copy — Bandwidth utilization of DataCopy
- [ ] 3. API Usage — Efficient usage of Ascend C API
- [ ] 4. Memory    — Placement strategy of data in storage hierarchy
- [ ] 5. Pipeline  — Overlapped execution of CopyIn / Compute / CopyOut

Each phase has an independent reference file, only load the current phase's file during troubleshooting:

Phase 1: references/tiling-prof.md
Phase 2: references/data-copy-prof.md
Phase 3: references/api-usage-prof.md
Phase 4: references/memory-prof.md
Phase 5: references/pipeline-prof.md

1. Tiling

Detailed example: references/tiling-prof.md

Troubleshooting items:

1.1 Multi-core Splitting: Is
```
blockDim
```
set to the number of hardware cores?
- Coupled architecture:
```
GetCoreNumAiv()
```
  or
```
GetCoreNumAic()
```
- Separated architecture Vector operators: Number of AIV cores (e.g., 40)
- Separated architecture Cube operators: Number of AIC cores (e.g., 20)
- Separated architecture MIX operators: Number of physical core groups (e.g., 20 = 40 AIV / 2), cannot exceed the number of physical cores
1.2 L2Cache Splitting: When
```
input + output > L2Cache capacity
```
, is the data divided into blocks according to L2Cache size, with all cores processing the same block collaboratively before switching to the next block?
1.3 Inter-core Load Balancing: After L2Cache splitting, are tail blocks alternately allocated among passes to avoid certain cores always lagging behind?

2. Data Copy

Detailed example: references/data-copy-prof.md

Troubleshooting items:

2.1 Single Copy Size >= 16 KB: Does each
```
DataCopy
```
transfer at least 16 KB? Bandwidth utilization drops significantly below this value.
2.2 GM Address 512B Alignment: Is the GM start address aligned to 512 bytes? (On Atlas A2 series, 32B alignment can reduce bandwidth by up to 30% compared to 512B alignment.)
2.3 Stride Parameter Instead of for Loop: Is
```
DataCopyParams
```
(blockCount/blockLen/srcStride/dstStride) used for interval transfer in one go instead of row-by-row transfer via for loop?

3. API Usage

Detailed example: references/api-usage-prof.md

Troubleshooting items:

3.1 TPipe Created Outside Kernel Class: Is
```
TPipe
```
created in the kernel entry function and passed into the class as a pointer? (TPipe inside the class prevents Scalar constant folding, increasing scalar_time by about 17%.)
3.2 TQueBind Used for Pure Copy Operators: Do operators without Vector computation use
```
TQueBind<VECIN, VECOUT>
```
instead of separate
```
TQue<VECIN>
```
+
```
TQue<VECOUT>
```
? (Eliminates redundant DataCopy between LocalTensors, reducing
```
aiv_vec_time
```
to approximately 0.)
3.3 Counter Mode (SetMaskCount): Do Vector instructions use Counter mode instead of manually calculating main block/tail block mask in Normal mode?
3.4 Matmul AtomicAdd: When Matmul result C needs to be added to GM matrix D, is
```
enAtomic=1
```
set in
```
IterateAll
```
/
```
GetTensorC
```
to fuse accumulation? (Reduces cycles by about 12%.)
3.5 Reduction Instruction Combination: When reducing a continuous buffer to a scalar, is the combination of
```
BlockReduceSum
```
+
```
WholeReduceSum
```
used instead of multiple identical reduction instructions?

4. Memory

Detailed example: references/memory-prof.md

Troubleshooting items:

4.1 UB Buffer Fusion: Are intermediate results of continuous Vector operations (such as Exp → Abs) kept in UB instead of being transferred back and forth via GM?
4.2 L0C Accumulation for Matrix Multiplication: In scenarios like
```
A1*B1 + A2*B2 + ...
```
, are Mmad results accumulated in-place in CO1 (L0C) instead of writing to GM sequentially and then summing in UB?
4.3 Small Matrix Resides in L1: When L1 cannot accommodate both left and right matrices simultaneously, is the smaller matrix loaded once and kept in L1, with only the larger matrix transferred cyclically?
4.4 BT Buffer for Bias (Separated Architecture): Is bias stored in BT Buffer (C2) and fused in one step via
```
Mmad
```
instead of performing Add separately in UB?
4.5 FP Buffer for Quantization Parameters (Separated Architecture): Are quantization parameters stored in FP Buffer (C2PIPE2GM) and quantized along the way via
```
Fixpipe
```
instead of being calculated separately in UB?

5. Pipeline

Detailed example: references/pipeline-prof.md

Troubleshooting items:

5.1 CopyIn/Compute/CopyOut Paradigm: Is the operator divided into three-level pipeline, with
```
TQue
```
used for inter-stage synchronization?
5.2 Double Buffer: Is the number of buffers in
```
InitBuffer
```
set to 2 to enable overlapped execution of CopyIn/CopyOut and Compute? (Prerequisite: Number of loops >= 2, and transfer time is not negligible relative to computation time.)
5.3 Asynchronous Iterate (MIX Mode): In Matmul MIX scenarios, is
```
Iterate<false>()
```
/
```
IterateAll<false>()
```
used to avoid AIC/AIV synchronization overhead per iteration?

1.3 Output Troubleshooting Report

After troubleshooting all phases, output a summary in the following format:

## Optimization Troubleshooting Report

### Identified Issues (Sorted by Expected Benefit)
1. [Phase X.Y] <Issue Description> — <Expected Benefit>
2. [Phase X.Y] <Issue Description> — <Expected Benefit>
...

### Confirmed No Issues
- [Phase X.Y] <Check Item Description>
...

### Optimization Plan
Sort by expected benefit from highest to lowest, determine the target items for this round of optimization.

Phase 2: Baseline — Save Current Performance Test Results

Must save the performance baseline before optimization for accurate comparison after optimization.

2.1 Check Existing Performance Verification Results

Check if the following files exist under

csrc/ops/<op_name>/test/

```
<op_name>_perf_cases.jsonl
```
— Performance test cases
```
<op_name>_torch_npu_profiler_report.md
```
— Performance comparison report

2.2 Execute Performance Evaluation if No Results Exist

If the above files do not exist or the results are outdated (e.g., code has been updated but the report has not been regenerated), MUST call the ascendc-operator-performance-eval
skill to complete the full performance evaluation:

Read the SKILL.md of
```
ascendc-operator-performance-eval
```
Generate performance cases (JSONL), run profiler, and generate comparison report according to its process
Ensure the report contains complete comparison data between custom operator vs benchmark

2.3 Save Baseline Snapshot

Back up the current performance report as a baseline file named

<op_name>_baseline_report.md

, stored in the same

test/

directory. This file will be used for comparing optimization effects later.

csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl                 ← Performance cases (shared before and after optimization)
├── <op_name>_torch_npu_profiler_report.md     ← Current report (will be overwritten)
└── <op_name>_baseline_report.md               ← Baseline snapshot (performance data before optimization)

Phase 3: Optimization — Modify Code After Learning Knowledge

3.1 Learn Operator Development Knowledge (MANDATORY)

MUST load the reference files of
ascendc-operator-code-gen
skill before modifying code to ensure accurate understanding of AscendC API, data transfer, synchronization control, etc.

Load the following references as needed (located in

ascendc-operator-code-gen/references/

Reference File	Purpose
`GUIDE.md`	Overview: Template selection, code generation process
`data-copy-api.md`	Detailed explanation of DataCopy/DataCopyPad API
`vector-compute-api.md`	Detailed explanation of Vector computation API
`sync-control-api.md`	TQue/Pipe synchronization control
`resource-management-api.md`	TPipe/TBuf resource management
`basic-data-structures-api.md`	Basic structures such as LocalTensor/GlobalTensor
`kernel-constraints.md`	Kernel programming constraints and common pitfalls

Selectively load relevant references based on the optimization points identified in Phase 1. For example:

Optimize data copy → Load
```
data-copy-api.md
```

Optimize pipeline → Load

sync-control-api.md

resource-management-api.md

Optimize computation → Load
```
vector-compute-api.md
```

3.2 Formulate Modification Plan

For each optimization point in the Phase 1 troubleshooting report, formulate a specific code modification plan:

Optimization Point [X.Y]: <Issue Description>
├── Modified Files: op_host / op_kernel / both
├── Modification Content: <Specific code change description>
├── Expected Effect: <Quantified expectation (e.g., 30% reduction in transfer time)>
└── Risk Assessment: <Whether precision may be affected / whether tiling needs to be modified>

3.3 Execute Code Modification

Modify code one by one according to the modification plan. Follow the following rules during modification:

MUST comply with the code-gen anti-pattern list:

NEVER let FP16/BF16 directly participate in complex mathematical calculations, must cast to FP32 first
NEVER pass r-values in EXEC_KERNEL_CMD
NEVER use DataCopy for GM↔UB transfer, must use DataCopyPad
NEVER directly reuse source tensor after ReduceSum/ReduceMax
NEVER use standard library functions such as
```
std::min/max/abs/sqrt/exp
```
in kernel
NEVER pass repeatTime > 255 to high-dimensional splitting API
NEVER modify files under
```
cmake/
```
or
```
csrc/utils/
```
NEVER hardcode core count or UB size

3.4 Compile and Install

Must recompile and install after modification:

bash

source ${ASCEND_HOME_PATH}/set_env.sh
cd task/ascend-kernel
bash build.sh
pip install output/ascend_kernel*.whl --force-reinstall --no-deps

Enter the troubleshooting loop if compilation fails (up to 3 times).

Phase 4: Precision Verification — Ensure Correct Functionality After Optimization

MANDATORY — Must pass precision verification before performance comparison after optimization.

4.1 Call Precision Evaluation Skill

Read and execute the complete process of ascendc-operator-precision-eval
SKILL.md:

Generate precision test cases (≥30 cases, covering all dtypes)
Run pytest precision tests
Generate precision report (Markdown + JSON)
Display overview, failure summary and key findings in the current conversation

4.2 Precision Determination

Result	Handling
All Passed	Proceed to Phase 5 Performance Verification
Partially Failed	Analyze failure causes, roll back or fix code, and re-enter Phase 3
Massively Failed	Roll back all modifications in this round, re-analyze optimization plan

Phase 5: Performance Verification — Confirm Optimization Effects

5.1 Run Performance Tests with the Same Cases

Use the same performance cases from Phase 2 (

<op_name>_perf_cases.jsonl

), call the ascendc-operator-performance-eval
skill to re-execute performance evaluation.

Key requirements:

MUST use the exact same perf_cases.jsonl as the baseline (cannot add or remove cases)
MUST generate a new
```
<op_name>_torch_npu_profiler_report.md
```
MUST display comparison table, summary and conclusion in the current conversation

5.2 Comparative Analysis

Compare the post-optimization performance data with the baseline saved in Phase 2:

## Optimization Effect Comparison

| Case | Shape | dtype | Baseline per-step(us) | Post-optimization per-step(us) | Improvement Ratio | Benchmark per-step(us) | vs Benchmark |
|------|-------|-------|-------------------|--------------------|---------|--------------------|---------|
| ...  | ...   | ...   | ...               | ...                | ...     | ...                | ...     |

### Summary
- Average Improvement: X%
- Maximum Improvement: X% (Case Y)
- vs Benchmark Average Ratio: Before optimization A → After optimization B

5.3 Performance Determination

Result	Handling
Performance Improved (most cases are faster after optimization)	Optimization successful, output final report
Performance Not Improved or Regressed	Enter Phase 6 Iterative Optimization

Phase 6: Iterative Optimization (Up to 3 Rounds)

If Phase 5 determines no performance improvement, enter iteration:

Current Round: N (N ∈ {1, 2, 3})

├── N < 3: Return to Phase 1, select next priority optimization point or adjust plan
│   ├── Re-troubleshoot, analyze why previous round's modification did not take effect
│   ├── Select new optimization points or adjust previous round's plan
│   └── Repeat Phase 3 → Phase 4 → Phase 5
│
└── N = 3: Stop iteration, output final report (including records of all rounds)

Iteration Records

Must record each iteration:

### Round N Optimization
- Optimization Target: [Phase X.Y] <Description>
- Modification Content: <Code change summary>
- Precision Result: Passed / Failed
- Performance Result: Improved X% / Not improved / Regressed Y%
- Decision: Keep current round's modification / Roll back / Proceed to next round

Final Output

After completing all rounds (successful improvement or reaching 3-round limit), output the final summary report.

Display in Current Conversation (MANDATORY)

MUST display the following content in the conversation, NEVER only output file paths:

Optimization Troubleshooting Summary: All identified issues and their handling status
Performance Comparison Summary Table: Three-way comparison of baseline → post-optimization → benchmark
Iteration History Summary: Optimization target, result and decision of each round
≥3 Key Conclusions: Main bottlenecks, optimization benefit distribution, remaining optimization space, etc.
File Paths at the End: Paths of reports and code files

File Products

csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl                 ← Performance cases
├── <op_name>_baseline_report.md               ← Pre-optimization baseline
├── <op_name>_torch_npu_profiler_report.md     ← Final post-optimization performance report
├── <op_name>_precision_report.md              ← Precision verification report
└── <op_name>_optim_summary.md                 ← Optimization iteration summary report (newly added)

csrc/ops/<op_name>/
├── op_host/<op_name>.cpp                      ← Optimized host code
└── op_kernel/<op_name>.cpp                    ← Optimized kernel code

Optimization Iteration Summary Report Structure

<op_name>_optim_summary.md

must include:

markdown

# <op_name> Performance Optimization Report

## Troubleshooting Findings
(Content of Phase 1 troubleshooting report)

## Pre-optimization Baseline
(Summary of Phase 2 performance data)

## Iteration History

### Round 1
- Optimization Target: ...
- Code Modifications: ...
- Precision Result: ...
- Performance Result: ...

### Round N
...

## Final Performance Comparison
(Three-way comparison table of pre-optimization vs post-optimization vs benchmark)

## Conclusions
(≥3 key findings)

Checklist (Assistant Self-check)

Phase 1: Troubleshoot

Have read operator design document (design.md)
Have read complete source codes of op_host + op_kernel
Have loaded reference phase by phase and checked item by item
Have output troubleshooting report, with optimization points sorted by expected benefit

Phase 2: Baseline

Have confirmed or generated performance test cases (JSONL)
Have confirmed or run performance evaluation (custom vs benchmark)
Have saved baseline snapshot (
```
_baseline_report.md
```
)

Phase 3: Optimization

Have loaded code-gen reference (must read before modification)
Code modifications comply with anti-pattern list
Compilation and installation successful

Phase 4: Precision

Have completed precision verification according to
```
ascendc-operator-precision-eval
```
process
Precision verification passed (all or most cases PASS)

Phase 5: Performance

Used the same perf_cases.jsonl as baseline
Have displayed performance comparison data in conversation
Have determined whether performance improved

Phase 6: Iteration

Iteration does not exceed 3 rounds
Each round has records (target, modification, precision, performance, decision)
Have output final summary report (
```
_optim_summary.md
```
)

Output

Have displayed troubleshooting summary, performance comparison, iteration history, ≥3 conclusions in current conversation
NEVER only output file paths

ascendc-operator-performance-optim

NPX Install

Tags

SKILL.md Content (Chinese)

Ascend C Operator Performance Optimization (Troubleshoot → Modify → Validate Closed Loop)

Phase 1: Troubleshoot — Identify Optimization Points

1.1 Study Operator Design Documents

1.2 Troubleshoot Phase by Phase

1. Tiling

2. Data Copy

3. API Usage

4. Memory

5. Pipeline

1.3 Output Troubleshooting Report

Phase 2: Baseline — Save Current Performance Test Results

2.1 Check Existing Performance Verification Results

2.2 Execute Performance Evaluation if No Results Exist

2.3 Save Baseline Snapshot

Phase 3: Optimization — Modify Code After Learning Knowledge

3.1 Learn Operator Development Knowledge (MANDATORY)

3.2 Formulate Modification Plan

3.3 Execute Code Modification

3.4 Compile and Install

Phase 4: Precision Verification — Ensure Correct Functionality After Optimization

4.1 Call Precision Evaluation Skill

4.2 Precision Determination

Phase 5: Performance Verification — Confirm Optimization Effects

5.1 Run Performance Tests with the Same Cases

5.2 Comparative Analysis

5.3 Performance Determination

Phase 6: Iterative Optimization (Up to 3 Rounds)

Iteration Records

Final Output

Display in Current Conversation (MANDATORY)

File Products

Optimization Iteration Summary Report Structure

Checklist (Assistant Self-check)

Phase 1: Troubleshoot

Phase 2: Baseline

Phase 3: Optimization

Phase 4: Precision

Phase 5: Performance

Phase 6: Iteration

Output