Catlass Operator End-to-End Development Orchestration

Skill Type: Process-oriented (six-phase workflow; Catlass Source Code Preparation is integrated into Phase 1, sub-skills are orchestrated serially)

This skill orchestrates the development of Catlass operators on ascend-kernel from scratch to production readiness; general capabilities (project skeleton, compilation debugging, interface documentation, precision, performance) reuse ascendc-*** sub-skills, while Catlass-specific (source tree, design, Device/Host implementation) uses catlass-* sub-skills.

Core Principles

Six-phase Serial Execution: Project Initialization (including Catlass source code) → Design Documentation → Code Generation & Compilation Testing → Interface Documentation → Precision Evaluation → Performance Evaluation, executed in strict order
Sub-skill Execution: Each phase MUST open and follow the corresponding sub-skill, and shall not implement alternatives on its own
Phase Gating: Enter the next phase only after all checkpoints of the previous phase are passed
Design-Driven Coding: Code generation depends on the finalized
```
design.md
```
from catlass-operator-design and the selection of catlass/examples
No Pre-written Design Documentation Required: The design phase generates and saves the document via catlass-operator-design
Documentation Closed Loop: After passing compilation testing, MUST generate PyTorch-style Chinese interface documentation (Phase 4) and display it in the chat interface
Precision Closed Loop: The operator must pass ≥30 comprehensive precision evaluation cases (Phase 5) to be considered complete
Performance Closed Loop: The operator must complete torch_npu.profiler comparison evaluation and output a performance report (Phase 6); conclusions are subject to ascendc-operator-performance-eval
Result Visualization: Key results of Phase 3/4/5/6 MUST be directly displayed in the chat interface in Markdown or other formats, do not only output paths
Operator Naming:
```
op_name
```
(snake_case) MUST contain the substring catlass
, consistent with the convention of existing Catlass operators in ascend-kernel
Honest Shutdown: When unable to continue due to environment or dependencies, stop after explaining the specific reason and completed steps

Catlass Compilation and Operation (Error-prone Summary)

Build:

BUILD_CATLASS_MODULE=ON

; CMake uses Python with torch_npu (e.g., -DPYTHON_EXECUTABLE
/
ASCEND_BUILD_PYTHON
); CATLASS_ARCH
matches the chip (see catlass-operator-code-gen/references/compile-catlass.md
); CANN can be bundle root + cann-*/set_env.sh
.

pytest / torch_npu: If ASCEND_RUNTIME_PATH
is reported:

export ASCEND_RUNTIME_PATH="${ASCEND_TOOLKIT_HOME}/runtime"

Design/Code: Align with examples that can be compiled with catlass/include
and catlass/examples
, details see compile-catlass.md
.

Available Sub-skill List

Skill	Path	Responsibilities
`ascendc-operator-project-init`	`ascendc-operator-project-init/SKILL.md`	Detect/create ascend-kernel, generate operator skeleton in `csrc/ops/<op_name>/`
—	(Steps within Phase 1)	Clone `catlass/` under ASCEND_KERNEL_ROOT (same level as `csrc/` ) to make `include/` and `examples/` available
`catlass-operator-design`	`catlass-operator-design/SKILL.md`	Convert Catlass requirements into finalized design documentation (recommended `csrc/ops/<op_name>/design.md` )
`catlass-operator-code-gen`	`catlass-operator-code-gen/SKILL.md`	Implement op_host / op_kernel and framework adaptation according to `design.md` and catlass/examples, and internally call compilation testing skill
`ascendc-operator-compile-debug`	`ascendc-operator-compile-debug/SKILL.md`	Compile, install whl, generate/run `tests/test_<op_name>.py` (called in Phase 5 of catlass-operator-code-gen, do not skip code-gen directly and claim completion)
`ascendc-operator-doc-gen`	`ascendc-operator-doc-gen/SKILL.md`	Generate PyTorch-style Chinese API documentation `README.md` (mandatory phase)
`ascendc-operator-precision-eval`	`ascendc-operator-precision-eval/SKILL.md`	≥30 precision test cases and precision verification report (mandatory phase)
`ascendc-operator-performance-eval`	`ascendc-operator-performance-eval/SKILL.md`	JSONL test cases + torch_npu.profiler (warmup/active=5) + `op_statistic.csv` summary, output custom vs benchmark Markdown report (mandatory phase)
`catlass-operator-performance-optim`	`catlass-operator-performance-optim/SKILL.md`	Optional after delivery: Perform tiling/performance iteration according to Catlass documentation; after code changes, re-run the closed loop starting from Phase 3

Project Directory Terminology (Aligned with AscendC)

Term	Meaning
ASCEND_KERNEL_ROOT	ascend-kernel root directory: contains `build.sh` , `CMakeLists.txt` , `csrc/`
Operator Directory	`<ASCEND_KERNEL_ROOT>/csrc/ops/<op_name>/`
Catlass Source Code	`<ASCEND_KERNEL_ROOT>/catlass/` (Prohibited to clone within `csrc/ops/<op>/` )

Workflow Overview

┌─────────────────────────────┐   ┌──────────────┐   ┌───────────────────────────┐   ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐
│  Phase 1                    │   │  Phase 2     │   │  Phase 3                  │   │  Phase 4         │   │  Phase 5         │   │  Phase 6         │
│  Project Init + Catlass Src │──▶│  Catlass Design │──▶│  Code Gen+Framework Adapt+Compile Test │──▶│  Interface Doc Gen │──▶│  Precision Eval Report │──▶│  Performance Eval Report │
│  project-init + clone      │   │  catlass-    │   │  catlass-code-gen →       │   │  doc-gen         │   │  precision-eval  │   │  performance-eval│
│  catlass                   │   │  design      │   │  compile-debug            │   │                  │   │                  │   │  (profiler)      │
└─────────────────────────────┘   └──────────────┘   └───────────────────────────┘   └──────────────────┘   └──────────────────┘   └──────────────────┘

Input: Operator name (contains catlass) + Function description + Environment confirmation          Output: Deliverable operator + README + Precision report + Profiler performance report

Anti-pattern List (NEVER DO THESE)

❌ Do not skip Catlass Source Code Preparation (do design or code generation without catlass/include
and catlass/examples
)
❌ Do not clone Catlass within csrc/ops/<op_name>/
, must clone in
```
catlass/
```
under project root
❌ Do not skip the design phase and directly write kernel/host code
❌ Do not implement the entire operator on your own without following the catlass-operator-code-gen process
❌ Do not modify framework registration without authorization before code generation (follow the conventions of project-init / code-gen)
❌ Do not manually replace the compilation installation and basic test closed loop responsible for compile-debug (should be triggered via Phase 5 of code-gen)
❌ Do not skip the interface documentation phase (Phase 4 must be executed after Phase 3 passes)
❌ Do not skip the precision evaluation phase (Phase 5 must be executed after Phase 4 passes)
❌ Do not skip the performance evaluation phase (Phase 6 must be executed after Phase 5 passes)
❌ Do not use a collection method inconsistent with ascendc-operator-performance-eval as the final performance conclusion
❌ Do not reference non-existent skills

Phase 0: Requirements Collection

Goal: Confirm the minimum information set and operating environment for Catlass operator development (aligned with ascendc-operator-dev Phase 0, with additional Catlass naming constraints).

Step 0.1: Environment Confirmation (MUST be completed before any development actions)

CANN Environment

Check
```
ASCEND_HOME_PATH
```
(
```
echo $ASCEND_HOME_PATH
```
)
Already set: Use as
```
CANN_PATH
```
, no need to ask repeatedly
Not set: MUST ask the user for the CANN path (e.g.,
```
/usr/local/Ascend/ascend-toolkit
```
)

bash

source ${CANN_PATH}/*/set_env.sh

Conda Environment

Check
```
CONDA_DEFAULT_ENV
```
Activated and not
base
: Use directly
Not activated or is
base
: MUST ask the user for the conda environment name

bash

conda activate <env_name>

Environment Confirmation Checkpoints

CANN path is confirmed and
```
set_env.sh
```
is executable
Conda environment is confirmed and can be activated

Step 0.2: Operator Requirements Collection

Information	Format Requirement	Mandatory	Description
CANN Path	Absolute path	Yes	Same as ascendc, can be detected automatically
Conda Environment	String	Yes	Same as ascendc, can be detected automatically
Operator Name	snake_case, contains `catlass`	Yes	e.g., `catlass_matmul_basic`
Function Description	Text/Formula/Benchmark Example	Yes	Consistent with Catlass capability scope

Optional: Support dtype, SoC — default values can be consistent with catlass-operator-design / platform API.

Decision Tree

User Request	Handling Method
"Develop/generate a Catlass operator"	Complete Step 0.1 → Verify name contains `catlass` → Confirm function → Execute full workflow
"Continue Catlass operator development"	Complete Step 0.1 → Detect current phase according to Error Recovery and resume execution

Acceptance Criteria

CANN + Conda are confirmed
```
op_name
```
is confirmed and contains catlass
Function description is clear

Phase 1: Project Initialization + Catlass Source Code Preparation

Step 1.1: Project Skeleton

Call Skill:

ascendc-operator-project-init

MANDATORY: Execute according to ascendc-operator-project-init:
1. Detect or create ascend-kernel
2. Create operator skeleton in csrc/ops/<op_name>/
3. Prompt registration update points (to be implemented by catlass-operator-code-gen later)

Checkpoints (Step 1.1)

ASCEND_KERNEL_ROOT

contains

build.sh

CMakeLists.txt

csrc/

```
csrc/ops/<op_name>/
```
is created, containing placeholder
```
design.md
```
,
```
op_host/
```
,
```
op_kernel/
```
,
```
CMakeLists.txt
```
, etc. (subject to this skill)

Step 1.2: Catlass Source Code

This step does not correspond to an independent skill file, but must be executed in accordance with the following requirements.

Prerequisite: Step 1.1 is completed

Execution Content

Ensure catlass/
exists under ASCEND_KERNEL_ROOT
, containing catlass/include
and catlass/examples

If not exists: MUST execute in project root (Prohibited to clone within

csrc/ops/<op_name>/

)

git clone https://gitcode.com/cann/catlass.git catlass

Checkpoints (Step 1.2)

```
<ASCEND_KERNEL_ROOT>/catlass/include
```
exists
```
<ASCEND_KERNEL_ROOT>/catlass/examples
```
exists

All Phase 1 checkpoints passed → Enter Phase 2

Phase 2: Catlass Design Documentation

Call Skill:

catlass-operator-design

Execution Content

MANDATORY: Execute according to catlass-operator-design:
1. Analyze requirements and Catlass component boundaries
2. Align with implementable paths of catlass/examples and catlass/include
3. Finalize and save the recommended path: csrc/ops/<op_name>/design.md (consistent with the reading of doc-gen / precision-eval / performance-eval)

Checkpoints

```
csrc/ops/<op_name>/design.md
```
is finalized (non-empty placeholder)
Clear reference example path, Kernel/Host contract, dtype/shape constraints, etc. (subject to catlass-operator-design)

All checkpoints passed → Enter Phase 3

Phase 3: Code Generation + Framework Adaptation + Compilation Testing

Call Skill:

catlass-operator-code-gen

(Phase 5 MUST call

ascendc-operator-compile-debug

)

Execution Content

MANDATORY: Execute according to catlass-operator-code-gen (aligned with ascendc-operator-code-gen phase structure):

Phase 1: Load GUIDE / references (including compile-catlass, chapters aligned with ascendc code-gen)
Phase 2: Read design.md, lock catlass/examples path and type system
Phase 3: Generate op_kernel + op_host, register Catlass compilation options in CMake (BUILD_CATLASS_MODULE, CATLASS_ARCH, etc. see compile-catlass.md)
Phase 4: Framework adaptation — ops.h, register.cpp, csrc/CMakeLists.txt
Phase 5: Compilation installation and testing — call ascendc-operator-compile-debug (build.sh, pip install, tests/test_<op_name>.py, failure troubleshooting subject to this skill)

Checkpoints

```
op_host
```
,
```
op_kernel
```
are consistent with
```
design.md
```
and selected example
Framework registration is consistent with repository template (e.g.,
```
namespace ascend_kernel
```
)
Compilation is successful, whl can be installed
```
tests/test_<op_name>.py
```
exists and passes (exit code 0)
Key compilation/test results are summarized and displayed in chat

All checkpoints passed → Enter Phase 4

Phase 4: Interface Documentation Generation

Call Skill:

ascendc-operator-doc-gen

Execution Content

MANDATORY: Execute according to ascendc-operator-doc-gen:
- Extract interface information from register.cpp, ops.h, design.md, op_host, tests
- Generate csrc/ops/<op_name>/README.md (PyTorch-style Chinese)
- Display document key points or full text in chat interface

Checkpoints

```
README.md
```
is written to operator directory
Consistent with
```
m.def
```
/ actual Python calls
Displayed in chat interface

All checkpoints passed → Enter Phase 5

Phase 5: Precision Evaluation Report

Call Skill:

ascendc-operator-precision-eval

Execution Content

MANDATORY: Execute according to ascendc-operator-precision-eval:
- Number of test cases ≥ 30, covering shapes × dtypes × boundaries
- Output to csrc/ops/<op_name>/test/, generate Markdown precision report
- Display overview, failure summary and key findings in chat interface (do not only provide paths)

Checkpoints

All pytest precision test cases pass
```
<op_name>_precision_report.md
```
(or the report name specified by this skill) is generated
Precision result summary is displayed in chat

FAIL Closed Loop: Root cause analysis → Revise design (Phase 2) or code (Phase 3) → Retest via Phase 4, Phase 5

All checkpoints passed → Enter Phase 6

Phase 6: Performance Evaluation Report

Call Skill:

ascendc-operator-performance-eval

Execution Content

MANDATORY: Only follow the details of ascendc-operator-performance-eval SKILL.md:
- Maintain JSONL test cases in csrc/ops/<op_name>/test/; read design.md before generation
- Use torch_npu.profiler, warmup=5, active=5
- Summarize indicators such as ASCEND_PROFILER_OUTPUT/op_statistic.csv, output custom operator vs benchmark Markdown report
- Display comparison table and brief conclusion in chat interface

Checkpoints

Test cases and report form meet the requirements of this skill (including DType, dual-path comparison, etc.)
Report file is saved in operator
```
test/
```
directory
Performance summary is displayed in chat

All checkpoints passed → Catlass operator main workflow completed

Optional Post-delivery: Performance Optimization

Call Skill:

catlass-operator-performance-optim

Must ask user whether to enter tuning; do not skip the question by default.

User agrees → Modify tiling/implementation according to catlass-operator-performance-optim; any code change → Re-run the workflow starting from Phase 3 (Phase 3→4→5→6) until it meets the standards again
User refuses → End

Inter-phase Data Flow

Phase 1 Output                         Phase 2 Input
  ascend-kernel + ops/<op>/skeleton       Operator name, catlass/ available
  + catlass/include, examples   ────▶

Phase 2 Output                         Phase 3 Input
  design.md (finalized)            ────▶  Example path, type and Host contract

Phase 3 Output                         Phase 4 Input
  Installed whl + test_<op>.py     ────▶  register.cpp / ops.h / design.md / op_host

Phase 4 Output                         Phase 5 Input
  README.md                    ────▶  Interface, dtype, constraints, calling method

Phase 5 Output                         Phase 6 Input
  Precision passed + Report                ────▶  Operator name, benchmark API, JSONL and profiler process

Phase 6 Output
  Performance report (profiler)           ────▶  Optional: Enter catlass-operator-performance-optim after user confirmation

Status Tracking Table

Phase	Prerequisites	Called Skill / Action	Key Deliverables
0. Requirements Collection	None	—	CANN + Conda + `op_name` (contains catlass) + Function description
1. Project + Catlass	Phase 0	`ascendc-operator-project-init` + root directory `catlass/`	Skeleton + Catlass source tree
2. Design	Phase 1	`catlass-operator-design`	`design.md`
3. Code & Testing	Phase 2	`catlass-operator-code-gen` → `compile-debug`	Runnable operator + basic test passed
4. Interface Documentation	Phase 3	`ascendc-operator-doc-gen`	`README.md`
5. Precision Evaluation	Phase 4	`ascendc-operator-precision-eval`	≥30 cases + Precision report
6. Performance Evaluation	Phase 5	`ascendc-operator-performance-eval`	JSONL + Profiler report
(Optional) Tuning	Phase 6 + User confirmation	`catlass-operator-performance-optim`	Iterated implementation and report

Error Recovery

Resume from Breakpoint

When the user says "Continue Catlass operator development":

Detection Condition	Determined Phase	Recovery Action
`csrc/ops/<op_name>/` does not exist	Phase 1 incomplete	Start from Phase 1 Step 1.1
`catlass/examples` does not exist	Phase 1 incomplete	Complete Step 1.2 cloning
`design.md` is empty or placeholder	Phase 2 incomplete	Start from Phase 2
`op_host` / `op_kernel` are still skeleton or inconsistent with design	Phase 3 incomplete	Start from Phase 3
whl not installed or `tests/test_<op_name>.py` fails	Phase 3 incomplete	Recover within compile-debug workflow
No `README.md`	Phase 4 incomplete	Start from Phase 4
No precision report in `test/` or precision not fully passed	Phase 5 incomplete	Recover from Phase 5
No performance report or does not meet performance-eval requirements	Phase 6 incomplete	Recover from Phase 6

Compilation/Testing Failure

Handled by ascendc-operator-compile-debug (triggered via catlass-operator-code-gen); retry and troubleshooting limits are subject to compile-debug skill.

catlass-operator-dev

NPX Install

Tags

SKILL.md Content (Chinese)

Catlass Operator End-to-End Development Orchestration

Core Principles

Catlass Compilation and Operation (Error-prone Summary)

Available Sub-skill List

Project Directory Terminology (Aligned with AscendC)

Workflow Overview

Anti-pattern List (NEVER DO THESE)

Phase 0: Requirements Collection

Step 0.1: Environment Confirmation (MUST be completed before any development actions)

CANN Environment

Conda Environment

Environment Confirmation Checkpoints

Step 0.2: Operator Requirements Collection

Decision Tree

Acceptance Criteria

Phase 1: Project Initialization + Catlass Source Code Preparation

Step 1.1: Project Skeleton

Step 1.2: Catlass Source Code

Phase 2: Catlass Design Documentation

Execution Content

Checkpoints

Phase 3: Code Generation + Framework Adaptation + Compilation Testing

Execution Content

Checkpoints

Phase 4: Interface Documentation Generation

Execution Content

Checkpoints

Phase 5: Precision Evaluation Report

Execution Content

Checkpoints

Phase 6: Performance Evaluation Report

Execution Content

Checkpoints

Optional Post-delivery: Performance Optimization

Inter-phase Data Flow

Status Tracking Table

Error Recovery

Resume from Breakpoint

Compilation/Testing Failure