Catlass Operator Performance Tuning
Core Workflow
Read optimization guide → Obtain baseline data → Modify Tiling configuration → Recompile and run → Profiler dual-path collection
→ Output performance comparison report (save to disk + display in chat) → Iterate and compare → Determine optimal configuration
Preconditions
| Check Item | Description |
|---|
| Project Directory | OPS_PROJECT_ROOT exists, and catlass/docs/1_Practice/10_matmul_optimization.md
is present under it |
| Operator Status | The operator can be compiled and run normally |
| Performance Data | The baseline is preferably the <op>_torch_npu_profiler_report.md
collected under according to the ascendc-operator-performance-eval specification (or equivalent Markdown generated by the project's benchmark_*_torch_npu_profiler.py
). ascendc-operator-precision-eval only covers precision and is not used as a performance baseline |
Ask for clarification if conditions are unclear.
Tuning Principles
- Tuning strategies are based on the official Catlass optimization guide
- Only adjust one variable per modification to accurately evaluate optimization effects
- Record all attempted configurations and corresponding performance data
NEVER / ALWAYS
NEVER: Modify tiling parameters without reading the Catlass optimization guide; modify multiple variables at once; ignore hardware resource limits; fail to recompile after modifying code; only modify code without generating a comparison report when NPU and profiler are available
ALWAYS: Read the Catlass optimization guide before taking action; record configurations and performance data for each iteration; roll back configurations if performance degrades; recompile using ascendc-operator-compile-debug after modifying code; mandatorily generate and display the performance comparison report (save to disk + chat display) after completing a round of submitable performance optimization
Steps
1. Read Optimization Guide
Read the following under
OPS_PROJECT_ROOT:
<OPS_PROJECT_ROOT>/catlass/docs/1_Practice/10_matmul_optimization.md
Understand adjustable parameters: TileShape, DispatchPolicy, Swizzle, etc. Please be sure to read the documentation and conduct optimization attempts according to the suggestions, rather than modifying parameters randomly.
2. Obtain Baseline Data
The iteration baseline is derived from the
per-case latency and native/custom ratio in the comparison table of reports generated by
ascendc-operator-performance-eval (or equivalent
torch_npu.profiler scripts in the operator directory). Before tuning, complete data collection in the same
directory to get the initial report as the "pre-optimization" reference.
3. Modify Tiling Configuration
Modify the operator's tiling shape configuration according to suggestions in the optimization guide:
| Modification Item | Description |
|---|
| TileShape Size | Adjust L1/L0 layer TileShape |
| DispatchPolicy | Select other scheduling policies |
| Swizzle Strategy | Attempt to modify data movement Swizzle |
4. Recompile and Collect Data
After modifying the code, you must recompile (using ascendc-operator-compile-debug), then run the operator to collect performance data and compare with the baseline:
- Performance improved → Record the configuration and continue trying other optimization suggestions
- Performance degraded → Roll back the configuration and try other solutions
- Performance stable → Fine-tune around the current configuration
5. Iterate and Determine Optimal Configuration
Within a reasonable number of iterations, find the tiling configuration with optimal performance. Record detailed parameters of the optimal configuration and corresponding performance metrics.
6. Mandatory Delivery of Performance Comparison Report
Every time a round of submitable performance optimization is completed (or when delivering tuning conclusions to the user), under the premise that NPU and torch_npu.profiler are available, you MUST execute the following:
| No. | Requirements |
|---|
| 6.1 | Follow the ascendc-operator-performance-eval convention: warmup=5, active=5, collect data for both custom operator vs. benchmark paths on NPU; the benchmark can be / equivalent API or small operator stitching (consistent with eval skill). |
| 6.2 | Save the Markdown performance comparison report to disk in (the file name can be consistent with the project script, e.g., <op>_torch_npu_profiler_report.md
). The report must include a comparison table with Case / Shape / Custom per-step / Benchmark per-step / Ratio, and specify the test case JSONL, trace root directory, and schedule. |
| 6.3 | Display in the current conversation: At least paste the main body of the comparison table or the corresponding row for the target shape, the full paths of the report and JSONL, and explain the conclusion in 1~3 sentences (e.g., which is faster, native/custom ratio of key cases, whether expectations are met). Do not only write "optimized" without attaching data. |
| 6.4 | If the environment is unavailable (no NPU, profiler failed): Honestly explain the reason, list the modified configurations and the collection commands recommended for the user to execute locally, do not forge reports. |
6.5 Pre- vs. Post-Optimization Comparison (Mandatory When Comparing to Previous Kernel Version)
When the current round of tasks includes conclusions on
"performance changes relative to pre-optimization" (rather than only comparing with the
benchmark), you
MUST additionally meet the following:
| No. | Requirements |
|---|
| 6.5.1 | Before modifying the Kernel/tiling: Under the same CANN / driver / clock conditions, run a full torch_npu.profiler with the current binary (same convention as §6.1), save the pre-optimization report to disk, it is recommended to name it <op>_torch_npu_profiler_report_PRE.md
(or ), and keep the corresponding trace directory for future reference. |
| 6.5.2 | After modifying the code and recompiling/installing: Run another round of profiler and save the post-optimization report to disk, recommended to name it <op>_torch_npu_profiler_report_POST.md
. |
| 6.5.3 | Combined delivery: Generate a pre- vs. post-optimization comparison table, which must at least include Case / Shape / Project custom per-step (pre) / (post) / Project Δ%; the benchmark column can be taken from the post-optimization report. You can use the merge script in the operator's directory (e.g., merge_profiler_before_after.py
) or equivalent tables. |
| 6.5.4 | Display this comparison table in the current conversation (covering at least the shapes specified by the user), and honestly explain cases where performance regressed (Δ%>0); do not only select cases with improved performance and hide regressions. |
Note: Coexist with the dual-path report comparing to the
benchmark (§6.2): The latter answers "how far is it from the framework", while §6.5 answers "how much better is this Kernel modification than before".
Explanation: This step aligns with Phase 6 of catlass-operator-dev "Display performance summary in chat"; this skill emphasizes that after tuning iterations, you must also deliver reviewable Markdown + conversation summary.
Quality Verification
Reference Materials
| Document | Purpose |
|---|
<ASCEND_KERNEL_ROOT>/catlass/docs/1_Practice/10_matmul_optimization.md
| Tuning strategies are based on this |
| ascendc-operator-performance-eval | Performance baseline and profiler comparison process, JSONL specification |
| ascendc-operator-compile-debug | Recompile and install whl after code modification |
Recommended Continuous Tiling Search
If the target performance has not been achieved, maintain
(or equivalent document) in
: List
verified invalid/regressed configurations and
pending items (Swizzle offset, Preload, Split-K, TLA, etc.) according to
10_matmul_optimization.md
, and check each item after
PRE/POST comparison; avoid repeating the same mistakes.
Tiling Exploration for Single Target Shape (Note Solidification)
When the user only cares about a few large shapes (e.g., a single M×N×K) and wants to find a better L1/L0/Swizzle combination compared to the Catlass example / default tiling in documentation:
| Requirements | Description |
|---|
| Exploration Notes | Add or maintain TILING_EXPLORATION_<brief>.md
in (e.g., TILING_EXPLORATION_1280x16384x4096.md
), and add one-line summary + link in the "Verified Conclusions" section of to avoid mixing with full-shape parameter tuning. |
| Table Fields | Recommended columns: Date, Tag, L1(M,N,K), L0 (if different from default), Swizzle, Project custom per-step (μs), Benchmark (μs), Δ% relative to default or conclusion (optimal / regressed / to be retested). Append a new row for each new exploration, do not overwrite history. |
| Isolated Exploration (Full File Kernel Replacement) | To only run one tile, you can temporarily change the to a minimal kernel with "single instantiation for the entire problem"; must restore the production version with shape dispatch before the end of the same task round, and recompile and install; do not deliver the minimal exploration kernel as the final product. |
| Relationship with §6 | Before officially adopting a configuration, you must still run PRE/POST (§6.5) for the complete JSONL or user-specified case set and save the report to disk; the single-shape table only records the search process and does not replace full regression. |