Catlass Operator Performance Tuning

Core Workflow

Read optimization guide → Obtain baseline data → Modify Tiling configuration → Recompile and run → Profiler dual-path collection
    → Output performance comparison report (save to disk + display in chat) → Iterate and compare → Determine optimal configuration

Preconditions

Check Item	Description
Project Directory	OPS_PROJECT_ROOT exists, and `catlass/docs/1_Practice/10_matmul_optimization.md` is present under it
Operator Status	The operator can be compiled and run normally
Performance Data	The baseline is preferably the `<op>_torch_npu_profiler_report.md` collected under `csrc/ops/<op>/test/` according to the ascendc-operator-performance-eval specification (or equivalent Markdown generated by the project's `benchmark__torch_npu_profiler.py` ). ascendc-operator-precision-eval* only covers precision and is not used as a performance baseline

Ask for clarification if conditions are unclear.

Tuning Principles

Tuning strategies are based on the official Catlass optimization guide
Only adjust one variable per modification to accurately evaluate optimization effects
Record all attempted configurations and corresponding performance data

NEVER / ALWAYS

NEVER: Modify tiling parameters without reading the Catlass optimization guide; modify multiple variables at once; ignore hardware resource limits; fail to recompile after modifying code; only modify code without generating a comparison report when NPU and profiler are available

ALWAYS: Read the Catlass optimization guide before taking action; record configurations and performance data for each iteration; roll back configurations if performance degrades; recompile using ascendc-operator-compile-debug after modifying code; mandatorily generate and display the performance comparison report (save to disk + chat display) after completing a round of submitable performance optimization

Steps

1. Read Optimization Guide

Read the following under OPS_PROJECT_ROOT:

<OPS_PROJECT_ROOT>/catlass/docs/1_Practice/10_matmul_optimization.md

Understand adjustable parameters: TileShape, DispatchPolicy, Swizzle, etc. Please be sure to read the documentation and conduct optimization attempts according to the suggestions, rather than modifying parameters randomly.

2. Obtain Baseline Data

The iteration baseline is derived from the per-case latency and native/custom ratio in the comparison table of reports generated by ascendc-operator-performance-eval (or equivalent torch_npu.profiler scripts in the operator directory). Before tuning, complete data collection in the same csrc/ops/<op>/test/
directory to get the initial report as the "pre-optimization" reference.

3. Modify Tiling Configuration

Modify the operator's tiling shape configuration according to suggestions in the optimization guide:

Modification Item	Description
TileShape Size	Adjust L1/L0 layer TileShape
DispatchPolicy	Select other scheduling policies
Swizzle Strategy	Attempt to modify data movement Swizzle

4. Recompile and Collect Data

After modifying the code, you must recompile (using ascendc-operator-compile-debug), then run the operator to collect performance data and compare with the baseline:

Performance improved → Record the configuration and continue trying other optimization suggestions
Performance degraded → Roll back the configuration and try other solutions
Performance stable → Fine-tune around the current configuration

5. Iterate and Determine Optimal Configuration

Within a reasonable number of iterations, find the tiling configuration with optimal performance. Record detailed parameters of the optimal configuration and corresponding performance metrics.

6. Mandatory Delivery of Performance Comparison Report

Every time a round of submitable performance optimization is completed (or when delivering tuning conclusions to the user), under the premise that NPU and torch_npu.profiler are available, you MUST execute the following:

No.	Requirements
6.1	Follow the ascendc-operator-performance-eval convention: warmup=5, active=5, collect data for both custom operator vs. benchmark paths on NPU; the benchmark can be `torch.matmul` / `torch_npu` equivalent API or small operator stitching (consistent with eval skill).
6.2	Save the Markdown performance comparison report to disk in `csrc/ops/<op_name>/test/` (the file name can be consistent with the project script, e.g., `<op>_torch_npu_profiler_report.md` ). The report must include a comparison table with Case / Shape / Custom per-step / Benchmark per-step / Ratio, and specify the test case JSONL, trace root directory, and schedule.
6.3	Display in the current conversation: At least paste the main body of the comparison table or the corresponding row for the target shape, the full paths of the report and JSONL, and explain the conclusion in 1～3 sentences (e.g., which is faster, native/custom ratio of key cases, whether expectations are met). Do not only write "optimized" without attaching data.
6.4	If the environment is unavailable (no NPU, profiler failed): Honestly explain the reason, list the modified configurations and the collection commands recommended for the user to execute locally, do not forge reports.

6.5 Pre- vs. Post-Optimization Comparison (Mandatory When Comparing to Previous Kernel Version)

When the current round of tasks includes conclusions on "performance changes relative to pre-optimization" (rather than only comparing with the

torch.matmul

benchmark), you MUST additionally meet the following:

No.	Requirements
6.5.1	Before modifying the Kernel/tiling: Under the same CANN / driver / clock conditions, run a full torch_npu.profiler with the current binary (same convention as §6.1), save the pre-optimization report to disk, it is recommended to name it `<op>_torch_npu_profiler_report_PRE.md` (or `*_baseline.md` ), and keep the corresponding trace directory for future reference.
6.5.2	After modifying the code and recompiling/installing: Run another round of profiler and save the post-optimization report to disk, recommended to name it `<op>_torch_npu_profiler_report_POST.md` .
6.5.3	Combined delivery: Generate a pre- vs. post-optimization comparison table, which must at least include Case / Shape / Project custom per-step (pre) / (post) / Project Δ%; the benchmark column can be taken from the post-optimization report. You can use the merge script in the operator's `test/` directory (e.g., `merge_profiler_before_after.py` ) or equivalent tables.
6.5.4	Display this comparison table in the current conversation (covering at least the shapes specified by the user), and honestly explain cases where performance regressed (Δ%>0); do not only select cases with improved performance and hide regressions.

Note: Coexist with the dual-path report comparing to the
torch.matmul
benchmark (§6.2): The latter answers "how far is it from the framework", while §6.5 answers "how much better is this Kernel modification than before".

Explanation: This step aligns with Phase 6 of catlass-operator-dev "Display performance summary in chat"; this skill emphasizes that after tuning iterations, you must also deliver reviewable Markdown + conversation summary.

Quality Verification

Have read the Catlass optimization guide and did not modify parameters randomly
Only adjusted one variable per iteration (or explained exceptions in the report if multiple variables were modified at once)
Recorded all attempted configurations and performance data
Determined the optimal configuration and recorded its parameters and metrics
Generated and saved the performance comparison Markdown to disk, and displayed the table and conclusion in the current conversation (meeting §6; explained the reason if the environment did not allow it)
If claiming "pre- vs. post-optimization" effects: Saved PRE/POST profiler reports and combined comparison table to disk (§6.5), and displayed the table with Δ% in the conversation
If conducting single-target shape tiling scan: Updated TILING_EXPLORATION_*.md
and TILING_SEARCH.md
summary in
```
test/
```
, and did not leave behind minimal kernels used for exploration

Reference Materials

Document	Purpose
`<ASCEND_KERNEL_ROOT>/catlass/docs/1_Practice/10_matmul_optimization.md`	Tuning strategies are based on this
ascendc-operator-performance-eval	Performance baseline and profiler comparison process, JSONL specification
ascendc-operator-compile-debug	Recompile and install whl after code modification

Recommended Continuous Tiling Search

If the target performance has not been achieved, maintain TILING_SEARCH.md
(or equivalent document) in csrc/ops/<op>/test/
: List verified invalid/regressed configurations and pending items (Swizzle offset, Preload, Split-K, TLA, etc.) according to

10_matmul_optimization.md

, and check each item after PRE/POST comparison; avoid repeating the same mistakes.

Tiling Exploration for Single Target Shape (Note Solidification)

When the user only cares about a few large shapes (e.g., a single M×N×K) and wants to find a better L1/L0/Swizzle combination compared to the Catlass example / default tiling in documentation:

Requirements	Description
Exploration Notes	Add or maintain `TILING_EXPLORATION_<brief>.md` in `csrc/ops/<op>/test/` (e.g., `TILING_EXPLORATION_1280x16384x4096.md` ), and add one-line summary + link in the "Verified Conclusions" section of `TILING_SEARCH.md` to avoid mixing with full-shape parameter tuning.
Table Fields	Recommended columns: Date, Tag, L1(M,N,K), L0 (if different from default), Swizzle, Project custom per-step (μs), Benchmark (μs), Δ% relative to default or conclusion (optimal / regressed / to be retested). Append a new row for each new exploration, do not overwrite history.
Isolated Exploration (Full File Kernel Replacement)	To only run one tile, you can temporarily change the `op_kernel` to a minimal kernel with "single instantiation for the entire problem"; must restore the production version with shape dispatch before the end of the same task round, and recompile and install; do not deliver the minimal exploration kernel as the final product.
Relationship with §6	Before officially adopting a configuration, you must still run PRE/POST (§6.5) for the complete JSONL or user-specified case set and save the report to disk; the single-shape table only records the search process and does not replace full regression.

catlass-operator-performance-optim

NPX Install

Tags

SKILL.md Content (Chinese)