external-gitcode-ascend-catlass-operator-dev

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Catlass 算子端到端开发编排

Catlass Operator End-to-End Development Orchestration

Skill 类型：流程导向型（六阶段工作流；Catlass 源码准备并入 Phase 1，子技能串行编排）

本 skill 编排 ascend-kernel 上 Catlass 算子从零到生产可用；通用能力（工程骨架、编译调试、接口文档、精度、性能）复用 ascendc-* 子 skill，Catlass 专属（源码树、设计、Device/Host 落地）使用 catlass-* 子 skill。

Skill Type: Process-oriented (six-phase workflow; Catlass Source Code Preparation is incorporated into Phase 1, sub-skills are orchestrated serially)

This skill orchestrates Catlass operators on ascend-kernel from scratch to production-ready; general capabilities (project skeleton, compilation debugging, interface documentation, precision, performance) reuse ascendc-* sub-skills, and Catlass-specific (source code tree, design, Device/Host implementation) uses catlass-* sub-skills.

核心原则

Core Principles

六阶段串行：工程初始化（含 Catlass 源码）→ 设计文档 → 代码生成与编译测试 → 接口文档 → 精度评估 → 性能评测，严格顺序执行
子技能执行：每个阶段 MUST 打开并遵循对应子 skill，不得自行替代实现
阶段门控：前一阶段检查点全部通过后才进入下一阶段
设计驱动编码：代码生成依赖 catlass-operator-design 定稿的
```
design.md
```
与 catlass/examples 选型
无需用户预先手写设计文档：设计阶段由 catlass-operator-design 生成并落盘
文档闭环：编译测试通过后 MUST 生成 PyTorch 风格中文接口文档（Phase 4），并在聊天界面展示
精度闭环：算子必须通过 ≥30 例全面精度评估（Phase 5）才算完成
性能闭环：算子必须完成 torch_npu.profiler 对比评测并输出性能报告（Phase 6）；结论以 ascendc-operator-performance-eval 为准
结果可视化：Phase 3/4/5/6 的关键结果 MUST 以 Markdown 等形式直接展示在聊天界面，不要仅输出路径
算子命名：
```
op_name
```
（snake_case）必须包含子串 catlass
，与 ascend-kernel 内既有 Catlass 算子约定一致
诚实停机：因环境或依赖无法继续时，说明具体原因与已完成步骤后停止

Six-phase serial execution: Project initialization (including Catlass source code) → Design documentation → Code generation & compilation testing → Interface documentation → Precision evaluation → Performance evaluation, executed in strict order
Sub-skill execution: Each phase MUST open and follow the corresponding sub-skill, and shall not replace the implementation by oneself
Phase gating: Enter the next phase only after all checkpoints of the previous phase are passed
Design-driven coding: Code generation depends on the finalized
```
design.md
```
from catlass-operator-design and the selection of catlass/examples
No need for users to pre-write design documents: The design phase generates and saves the design document via catlass-operator-design
Documentation closed loop: After passing compilation testing, MUST generate PyTorch-style Chinese interface documentation (Phase 4) and display it in the chat interface
Precision closed loop: The operator must pass ≥30 comprehensive precision evaluation cases (Phase 5) to be considered completed
Performance closed loop: The operator must complete the comparative evaluation with torch_npu.profiler and output a performance report (Phase 6); the conclusion shall be based on ascendc-operator-performance-eval
Result visualization: Key results of Phase 3/4/5/6 MUST be directly displayed in the chat interface in the form of Markdown or other formats, do not only output paths
Operator naming:
```
op_name
```
(snake_case) MUST contain the substring catlass
, consistent with the convention of existing Catlass operators in ascend-kernel
Honest shutdown: When unable to continue due to environment or dependencies, explain the specific reason and completed steps before stopping

Catlass 编译与运行（易错摘要）

Catlass Compilation and Operation (Error-Prone Summary)

构建：

BUILD_CATLASS_MODULE=ON

；CMake 使用含 torch_npu 的 Python（如 -DPYTHON_EXECUTABLE
/
ASCEND_BUILD_PYTHON
）；CATLASS_ARCH
与芯片一致（见 catlass-operator-code-gen/references/compile-catlass.md
）；CANN 可为 bundle 根 + cann-*/set_env.sh
。

pytest / torch_npu：若报 ASCEND_RUNTIME_PATH
：

export ASCEND_RUNTIME_PATH="${ASCEND_TOOLKIT_HOME}/runtime"

。

设计/代码：与 catlass/include
、catlass/examples
可对齐编译的示例一致，细则见 compile-catlass.md
。

Build:

BUILD_CATLASS_MODULE=ON

; CMake uses Python with torch_npu (such as -DPYTHON_EXECUTABLE
/
ASCEND_BUILD_PYTHON
); CATLASS_ARCH
must match the chip (see catlass-operator-code-gen/references/compile-catlass.md
); CANN can be the bundle root + cann-*/set_env.sh
.

pytest / torch_npu: If ASCEND_RUNTIME_PATH
is reported:

export ASCEND_RUNTIME_PATH="${ASCEND_TOOLKIT_HOME}/runtime"

Design/Code: Consistent with the compilable examples in catlass/include
and catlass/examples
, details see compile-catlass.md
.

可用子 Skill 清单

Available Sub-Skill List

Skill	路径	职责
`ascendc-operator-project-init`	`ascendc-operator-project-init/SKILL.md`	检测/创建 ascend-kernel，在 `csrc/ops/<op_name>/` 生成算子骨架
—	（Phase 1 内步骤）	在 ASCEND_KERNEL_ROOT 克隆 `catlass/` （与 `csrc/` 同级），使 `include/` 、 `examples/` 可用
`catlass-operator-design`	`catlass-operator-design/SKILL.md`	将 Catlass 需求转为定稿设计文档（推荐 `csrc/ops/<op_name>/design.md` ）
`catlass-operator-code-gen`	`catlass-operator-code-gen/SKILL.md`	按 `design.md` 与 catlass/examples 落地 op_host / op_kernel、框架适配，并内部调用编译测试 skill
`ascendc-operator-compile-debug`	`ascendc-operator-compile-debug/SKILL.md`	编译、安装 whl、生成/运行 `tests/test_<op_name>.py` （由 catlass-operator-code-gen 阶段 5 调用，勿单独跳过 code-gen 直接宣称完成）
`ascendc-operator-doc-gen`	`ascendc-operator-doc-gen/SKILL.md`	生成 PyTorch 风格中文 API 文档 `README.md` （必选阶段）
`ascendc-operator-precision-eval`	`ascendc-operator-precision-eval/SKILL.md`	≥30 例精度测试与精度验证报告（必选阶段）
`ascendc-operator-performance-eval`	`ascendc-operator-performance-eval/SKILL.md`	JSONL 用例 + torch_npu.profiler（warmup/active=5）+ `op_statistic.csv` 汇总，输出自定义 vs 标杆 Markdown 报告（必选阶段）
`catlass-operator-performance-optim`	`catlass-operator-performance-optim/SKILL.md`	交付后可选：按 Catlass 文档做 tiling/性能迭代；代码变更后须回到 Phase 3 起复跑闭环

Skill	Path	Responsibility
`ascendc-operator-project-init`	`ascendc-operator-project-init/SKILL.md`	Detect/create ascend-kernel, generate operator skeleton in `csrc/ops/<op_name>/`
—	(Step within Phase 1)	Clone `catlass/` in ASCEND_KERNEL_ROOT (at the same level as `csrc/` ) to make `include/` and `examples/` available
`catlass-operator-design`	`catlass-operator-design/SKILL.md`	Convert Catlass requirements into finalized design documents (recommended path: `csrc/ops/<op_name>/design.md` )
`catlass-operator-code-gen`	`catlass-operator-code-gen/SKILL.md`	Implement op_host / op_kernel and framework adaptation according to `design.md` and catlass/examples, and internally call the compilation testing skill
`ascendc-operator-compile-debug`	`ascendc-operator-compile-debug/SKILL.md`	Compile, install whl package, generate/run `tests/test_<op_name>.py` (called in Phase 5 of catlass-operator-code-gen, do not skip code-gen directly and claim completion)
`ascendc-operator-doc-gen`	`ascendc-operator-doc-gen/SKILL.md`	Generate PyTorch-style Chinese API document `README.md` (mandatory phase)
`ascendc-operator-precision-eval`	`ascendc-operator-precision-eval/SKILL.md`	≥30 precision test cases and precision verification report (mandatory phase)
`ascendc-operator-performance-eval`	`ascendc-operator-performance-eval/SKILL.md`	JSONL test cases + torch_npu.profiler (warmup/active=5) + `op_statistic.csv` summary, output Markdown report of custom vs benchmark operators (mandatory phase)
`catlass-operator-performance-optim`	`catlass-operator-performance-optim/SKILL.md`	Optional after delivery: Perform tiling/performance iteration according to Catlass documentation; after code changes, re-run the closed loop starting from Phase 3

工程目录术语（与 AscendC 对齐）

Project Directory Terminology (Aligned with AscendC)

术语	含义
ASCEND_KERNEL_ROOT	ascend-kernel 根目录：含 `build.sh` 、 `CMakeLists.txt` 、 `csrc/`
算子目录	`<ASCEND_KERNEL_ROOT>/csrc/ops/<op_name>/`
Catlass 源码	`<ASCEND_KERNEL_ROOT>/catlass/` （禁止在 `csrc/ops/<op>/` 内克隆）

Term	Meaning
ASCEND_KERNEL_ROOT	Root directory of ascend-kernel: contains `build.sh` , `CMakeLists.txt` , `csrc/`
Operator Directory	`<ASCEND_KERNEL_ROOT>/csrc/ops/<op_name>/`
Catlass Source Code	`<ASCEND_KERNEL_ROOT>/catlass/` (Prohibited to clone in `csrc/ops/<op>/` )

工作流总览

Workflow Overview

┌─────────────────────────────┐   ┌──────────────┐   ┌───────────────────────────┐   ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐
│  Phase 1                    │   │  Phase 2     │   │  Phase 3                  │   │  Phase 4         │   │  Phase 5         │   │  Phase 6         │
│  工程初始化 + Catlass 源码   │──▶│  Catlass 设计 │──▶│  代码生成+框架适配+编译测试 │──▶│  接口文档生成     │──▶│  精度评估报告     │──▶│  性能评测报告     │
│  project-init + clone      │   │  catlass-    │   │  catlass-code-gen →       │   │  doc-gen         │   │  precision-eval  │   │  performance-eval│
│  catlass                   │   │  design      │   │  compile-debug            │   │                  │   │                  │   │  (profiler)      │
└─────────────────────────────┘   └──────────────┘   └───────────────────────────┘   └──────────────────┘   └──────────────────┘   └──────────────────┘

输入: 算子名(含 catlass) + 功能描述 + 环境确认          输出: 可交付算子 + README + 精度报告 + profiler 性能报告

┌─────────────────────────────┐   ┌──────────────┐   ┌───────────────────────────┐   ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐
│  Phase 1                    │   │  Phase 2     │   │  Phase 3                  │   │  Phase 4         │   │  Phase 5         │   │  Phase 6         │
│  Project Init + Catlass Src │──▶│  Catlass Design │──▶│  Code Gen + Framework Adaption + Compile Test │──▶│  Interface Doc Gen │──▶│  Precision Eval Report │──▶│  Performance Eval Report │
│  project-init + clone      │   │  catlass-    │   │  catlass-code-gen →       │   │  doc-gen         │   │  precision-eval  │   │  performance-eval│
│  catlass                   │   │  design      │   │  compile-debug            │   │                  │   │                  │   │  (profiler)      │
└─────────────────────────────┘   └──────────────┘   └───────────────────────────┘   └──────────────────┘   └──────────────────┘   └──────────────────┘

Input: Operator name (contains catlass) + Function description + Environment confirmation          Output: Deliverable operator + README + Precision report + Profiler performance report

反模式清单（NEVER DO THESE）

Anti-Pattern List (NEVER DO THESE)

❌ 不要跳过 Catlass 源码准备（无 catlass/include
、catlass/examples
就做设计或代码生成）
❌ 不要在 csrc/ops/<op_name>/
内克隆 Catlass，必须在 工程根 下
```
catlass/
```
❌ 不要跳过设计阶段直接写 kernel/host
❌ 不要自行实现整套算子落地而不遵循 catlass-operator-code-gen 流程
❌ 不要在代码生成前擅自修改框架注册（以 project-init / code-gen 约定为准）
❌ 不要手动替代 compile-debug 所负责的编译安装与基础测试闭环（应通过 code-gen 阶段 5 触发）
❌ 不要跳过接口文档阶段（Phase 3 通过后必须 Phase 4）
❌ 不要跳过精度评估阶段（Phase 4 通过后必须 Phase 5）
❌ 不要跳过性能评测阶段（Phase 5 通过后必须 Phase 6）
❌ 不要使用与 ascendc-operator-performance-eval 不一致的采集方式作为最终性能结论
❌ 不要引用不存在的 skill

❌ Do not skip Catlass Source Code Preparation (do not proceed with design or code generation without catlass/include
and catlass/examples
)
❌ Do not clone Catlass in csrc/ops/<op_name>/
, must be in project root under
```
catlass/
```
❌ Do not skip the design phase and directly write kernel/host code
❌ Do not implement the entire operator independently without following the catlass-operator-code-gen process
❌ Do not modify framework registration without permission before code generation (follow the conventions of project-init / code-gen)
❌ Do not manually replace the compilation, installation and basic test closed loop responsible for compile-debug (should be triggered via code-gen Phase 5)
❌ Do not skip the interface documentation phase (Phase 4 must be executed after Phase 3 passes)
❌ Do not skip the precision evaluation phase (Phase 5 must be executed after Phase 4 passes)
❌ Do not skip the performance evaluation phase (Phase 6 must be executed after Phase 5 passes)
❌ Do not use collection methods inconsistent with ascendc-operator-performance-eval as the final performance conclusion
❌ Do not reference non-existent skills

Phase 0：需求收集

Phase 0: Requirements Collection

目标：确认 Catlass 算子开发的最小信息集与运行环境（与 ascendc-operator-dev Phase 0 对齐，并增加 Catlass 命名约束）。

Objective: Confirm the minimum information set and operating environment for Catlass operator development (aligned with ascendc-operator-dev Phase 0, plus Catlass naming constraints).

Step 0.1：环境确认（MUST 在任何开发动作之前完成）

Step 0.1: Environment Confirmation (MUST be completed before any development actions)

CANN 环境

CANN Environment

检查
```
ASCEND_HOME_PATH
```
（
```
echo $ASCEND_HOME_PATH
```
）
已设置：作为
```
CANN_PATH
```
，无需重复询问
未设置：MUST 询问用户 CANN 路径（如
```
/usr/local/Ascend/ascend-toolkit
```
）

bash

source ${CANN_PATH}/*/set_env.sh

Check
```
ASCEND_HOME_PATH
```
(run
```
echo $ASCEND_HOME_PATH
```
)
Already set: Use as
```
CANN_PATH
```
, no need to ask repeatedly
Not set: MUST ask the user for the CANN path (e.g.,
```
/usr/local/Ascend/ascend-toolkit
```
)

bash

source ${CANN_PATH}/*/set_env.sh

Conda 环境

Conda Environment

检查
```
CONDA_DEFAULT_ENV
```
已激活且非
base
：直接使用
未激活或为
base
：MUST 询问 conda 环境名

bash

conda activate <env_name>

Check
```
CONDA_DEFAULT_ENV
```
Activated and not
base
: Use directly
Not activated or is
base
: MUST ask the user for the conda environment name

bash

conda activate <env_name>

环境确认检查点

Environment Confirmation Checkpoints

CANN 路径已确定且
```
set_env.sh
```
可执行
Conda 环境已确定且可激活

CANN path is confirmed and
```
set_env.sh
```
is executable
Conda environment is confirmed and can be activated

Step 0.2：算子需求收集

Step 0.2: Operator Requirements Collection

信息	格式要求	必填	说明
CANN 路径	绝对路径	是	同 ascendc，可自动检测
Conda 环境	字符串	是	同 ascendc，可自动检测
算子名称	snake_case，含 `catlass`	是	如 `catlass_matmul_basic`
功能描述	文本/公式/对标示例	是	与 Catlass 能力范围一致

可选：支持 dtype、SoC —— 默认值与 catlass-operator-design / 平台 API 一致即可。

Information	Format Requirement	Mandatory	Description
CANN Path	Absolute path	Yes	Aligned with ascendc, can be detected automatically
Conda Environment	String	Yes	Aligned with ascendc, can be detected automatically
Operator Name	snake_case, contains `catlass`	Yes	e.g., `catlass_matmul_basic`
Function Description	Text/Formula/Benchmark Example	Yes	Consistent with Catlass capability scope

Optional: Support dtype, SoC — default values can be consistent with catlass-operator-design / platform APIs.

决策树

Decision Tree

用户请求	处理方式
「开发/生成某 Catlass 算子」	完成 Step 0.1 → 校验名称含 `catlass` → 确认功能 → 执行全流程
「继续 Catlass 算子开发」	完成 Step 0.1 → 按错误恢复检测当前阶段并续跑

User Request	Handling Method
"Develop/generate a certain Catlass operator"	Complete Step 0.1 → Validate that the name contains `catlass` → Confirm function → Execute full workflow
"Continue Catlass operator development"	Complete Step 0.1 → Detect current phase according to Error Recovery and resume

验收标准

Acceptance Criteria

CANN + Conda 已确认
```
op_name
```
已确认且包含 catlass
功能描述明确

CANN + Conda are confirmed
```
op_name
```
is confirmed and contains catlass
Function description is clear

Phase 1：工程初始化 + Catlass 源码准备

Phase 1: Project Initialization + Catlass Source Code Preparation

Step 1.1：工程骨架

Step 1.1: Project Skeleton

调用 Skill：

ascendc-operator-project-init

MANDATORY: 按 ascendc-operator-project-init 执行：
1. 检测或创建 ascend-kernel
2. 在 csrc/ops/<op_name>/ 创建算子骨架
3. 提示注册更新点（后续由 catlass-operator-code-gen 落实）

检查点（Step 1.1）

ASCEND_KERNEL_ROOT

含

build.sh

、

CMakeLists.txt

、

csrc/

csrc/ops/<op_name>/

已创建，含占位

design.md

、

op_host/

、

op_kernel/

、

CMakeLists.txt

等（以该 skill 为准）

Call Skill:

ascendc-operator-project-init

MANDATORY: Execute according to ascendc-operator-project-init:
1. Detect or create ascend-kernel
2. Create operator skeleton in csrc/ops/<op_name>/
3. Prompt registration update points (to be implemented by catlass-operator-code-gen later)

Checkpoints (Step 1.1)

ASCEND_KERNEL_ROOT

contains

build.sh

CMakeLists.txt

csrc/

```
csrc/ops/<op_name>/
```
is created, containing placeholder
```
design.md
```
,
```
op_host/
```
,
```
op_kernel/
```
,
```
CMakeLists.txt
```
, etc. (subject to this skill)

Step 1.2：Catlass 源码

Step 1.2: Catlass Source Code

本步骤不对应独立 skill文件，但必须按下列要求执行。

前置：Step 1.1 完成

执行内容

在 ASCEND_KERNEL_ROOT
下确保存在 catlass/
，且含 catlass/include
、catlass/examples

若不存在：MUST 在工程根执行（禁止在

csrc/ops/<op_name>/

内克隆）

git clone https://gitcode.com/cann/catlass.git catlass

检查点（Step 1.2）

```
<ASCEND_KERNEL_ROOT>/catlass/include
```
存在
```
<ASCEND_KERNEL_ROOT>/catlass/examples
```
存在

Phase 1 全部通过 → 进入 Phase 2

This step does not correspond to an independent skill file, but must be executed according to the following requirements.

Prerequisite: Step 1.1 is completed

Execution Content

Ensure catlass/
exists under ASCEND_KERNEL_ROOT
, and contains catlass/include
and catlass/examples

If not exists: MUST execute in the project root (Prohibited to clone in

csrc/ops/<op_name>/

)

git clone https://gitcode.com/cann/catlass.git catlass

Checkpoints (Step 1.2)

```
<ASCEND_KERNEL_ROOT>/catlass/include
```
exists
```
<ASCEND_KERNEL_ROOT>/catlass/examples
```
exists

All Phase 1 checkpoints passed → Enter Phase 2

Phase 2：Catlass 设计文档

Phase 2: Catlass Design Document

调用 Skill：

catlass-operator-design

Call Skill:

catlass-operator-design

执行内容

Execution Content

MANDATORY: 按 catlass-operator-design 执行：
1. 分析需求与 Catlass 组件边界
2. 对齐 catlass/examples 与 catlass/include 的可实现路径
3. 定稿并落盘推荐路径：csrc/ops/<op_name>/design.md（与 doc-gen / precision-eval / performance-eval 读取一致）

MANDATORY: Execute according to catlass-operator-design:
1. Analyze requirements and Catlass component boundaries
2. Align with the implementable paths of catlass/examples and catlass/include
3. Finalize and save the recommended path: csrc/ops/<op_name>/design.md (consistent with the reading path of doc-gen / precision-eval / performance-eval)

检查点

Checkpoints

```
csrc/ops/<op_name>/design.md
```
已定稿（非空占位）
写清参考 example 路径、Kernel/Host 契约、dtype/shape 约束等（以 catlass-operator-design 为准）

全部通过 → 进入 Phase 3

```
csrc/ops/<op_name>/design.md
```
is finalized (not an empty placeholder)
Clearly states the reference example path, Kernel/Host contract, dtype/shape constraints, etc. (subject to catlass-operator-design)

All checkpoints passed → Enter Phase 3

Phase 3：代码生成 + 框架适配 + 编译测试

Phase 3: Code Generation + Framework Adaption + Compile Test

调用 Skill：

catlass-operator-code-gen

（阶段 5 MUST 调用

ascendc-operator-compile-debug

）

Call Skill:

catlass-operator-code-gen

(Phase 5 MUST call

ascendc-operator-compile-debug

)

执行内容

Execution Content

MANDATORY: 按 catlass-operator-code-gen 执行（与 ascendc-operator-code-gen 阶段结构对齐）：

阶段 1: 加载 GUIDE / references（含 compile-catlass、与 ascendc code-gen 对齐章节）
阶段 2: 读取 design.md，锁定 catlass/examples 路径与类型系统
阶段 3: 生成 op_kernel + op_host，CMake 登记 Catlass 编译选项（BUILD_CATLASS_MODULE、CATLASS_ARCH 等见 compile-catlass.md）
阶段 4: 框架适配 — ops.h、register.cpp、csrc/CMakeLists.txt
阶段 5: 编译安装与测试 — 调用 ascendc-operator-compile-debug（build.sh、pip install、tests/test_<op_name>.py，失败排错以该 skill 为准）

MANDATORY: Execute according to catlass-operator-code-gen (aligned with the phase structure of ascendc-operator-code-gen):

Phase 1: Load GUIDE / references (including compile-catlass, chapters aligned with ascendc code-gen)
Phase 2: Read design.md, lock the catlass/examples path and type system
Phase 3: Generate op_kernel + op_host, register Catlass compilation options in CMake (BUILD_CATLASS_MODULE, CATLASS_ARCH, etc. see compile-catlass.md)
Phase 4: Framework adaptation — ops.h, register.cpp, csrc/CMakeLists.txt
Phase 5: Compile, install and test — call ascendc-operator-compile-debug (build.sh, pip install, tests/test_<op_name>.py, error troubleshooting is subject to this skill)

检查点

Checkpoints

```
op_host
```
、
```
op_kernel
```
与
```
design.md
```
、选定 example 一致
框架注册与仓库模板一致（
```
namespace ascend_kernel
```
等）
编译成功，whl 可安装
```
tests/test_<op_name>.py
```
存在且通过（exit code 0）
关键编译/测试结果在聊天中有摘要展示

全部通过 → 进入 Phase 4

```
op_host
```
,
```
op_kernel
```
are consistent with
```
design.md
```
and the selected example
Framework registration is consistent with the repository template (e.g.,
```
namespace ascend_kernel
```
)
Compilation is successful, whl package can be installed
```
tests/test_<op_name>.py
```
exists and passes (exit code 0)
Key compilation/test results are summarized and displayed in the chat

All checkpoints passed → Enter Phase 4

Phase 4：接口文档生成

Phase 4: Interface Document Generation

调用 Skill：

ascendc-operator-doc-gen

Call Skill:

ascendc-operator-doc-gen

执行内容

Execution Content

MANDATORY: 按 ascendc-operator-doc-gen 执行：
- 从 register.cpp、ops.h、design.md、op_host、tests 提取接口信息
- 生成 csrc/ops/<op_name>/README.md（PyTorch 风格中文）
- 在聊天界面展示文档要点或全文

MANDATORY: Execute according to ascendc-operator-doc-gen:
- Extract interface information from register.cpp, ops.h, design.md, op_host, tests
- Generate csrc/ops/<op_name>/README.md (PyTorch-style Chinese)
- Display document key points or full text in the chat interface

检查点

Checkpoints

```
README.md
```
已写入算子目录
与
```
m.def
```
/ 实际 Python 调用一致
已在聊天界面展示

全部通过 → 进入 Phase 5

```
README.md
```
is written to the operator directory
Consistent with
```
m.def
```
/ actual Python calls
Displayed in the chat interface

All checkpoints passed → Enter Phase 5

Phase 5：精度评估报告

Phase 5: Precision Evaluation Report

调用 Skill：

ascendc-operator-precision-eval

Call Skill:

ascendc-operator-precision-eval

执行内容

Execution Content

MANDATORY: 按 ascendc-operator-precision-eval 执行：
- 用例数 ≥ 30，覆盖 shapes × dtypes × 边界
- 输出到 csrc/ops/<op_name>/test/，生成 Markdown 精度报告
- 在聊天界面展示总览、失败摘要与关键发现（不得仅给路径）

MANDATORY: Execute according to ascendc-operator-precision-eval:
- Number of test cases ≥30, covering shapes × dtypes × boundaries
- Output to csrc/ops/<op_name>/test/, generate Markdown precision report
- Display overview, failure summary and key findings in the chat interface (do not only provide paths)

检查点

Checkpoints

pytest 精度用例全部通过
```
<op_name>_precision_report.md
```
（或该 skill 规定的报告名）已生成
聊天中已展示精度结果摘要

FAIL 闭环：根因分析 → 修正设计（Phase 2）或代码（Phase 3）→ 再经 Phase 4、Phase 5 复测

全部通过 → 进入 Phase 6

All pytest precision test cases pass
```
<op_name>_precision_report.md
```
(or the report name specified by this skill) is generated
Precision result summary is displayed in the chat

FAIL Closed Loop: Root cause analysis → Revise design (Phase 2) or code (Phase 3) → Re-test via Phase 4, Phase 5

All checkpoints passed → Enter Phase 6

Phase 6：性能评测报告

Phase 6: Performance Evaluation Report

调用 Skill：

ascendc-operator-performance-eval

Call Skill:

ascendc-operator-performance-eval

执行内容

Execution Content

MANDATORY: 以 ascendc-operator-performance-eval SKILL.md 为唯一细则：
- 在 csrc/ops/<op_name>/test/ 维护 JSONL 用例；生成前先读 design.md
- 使用 torch_npu.profiler，warmup=5、active=5
- 汇总 ASCEND_PROFILER_OUTPUT/op_statistic.csv 等指标，输出自定义算子 vs 标杆的 Markdown 报告
- 在聊天界面展示对比表与简要结论

MANDATORY: Take ascendc-operator-performance-eval SKILL.md as the only detailed rule:
- Maintain JSONL test cases in csrc/ops/<op_name>/test/; read design.md before generation
- Use torch_npu.profiler, warmup=5, active=5
- Summarize indicators such as ASCEND_PROFILER_OUTPUT/op_statistic.csv, output Markdown report of custom operator vs benchmark
- Display comparison table and brief conclusion in the chat interface

检查点

Checkpoints

用例与报告形态符合该 skill（含 DType、双路径对比等）
报告文件已落盘于算子
```
test/
```
目录
聊天中已展示性能摘要

全部通过 → Catlass 算子主流程完成

Test cases and report format comply with this skill (including DType, dual-path comparison, etc.)
Report file is saved in the operator
```
test/
```
directory
Performance summary is displayed in the chat

All checkpoints passed → Catlass operator main workflow is completed

交付后可选：性能优化

Post-Delivery Optional: Performance Optimization

调用 Skill：

catlass-operator-performance-optim

须询问用户是否进入调优；不得默认跳过询问。

用户同意 → 按 catlass-operator-performance-optim 修改 tiling/实现；凡改代码 → 从 Phase 3 起复跑（Phase 3→4→5→6），直至再次达标
用户拒绝 → 结束

Call Skill:

catlass-operator-performance-optim

Must ask the user whether to enter tuning; do not skip the question by default.

User agrees → Modify tiling/implementation according to catlass-operator-performance-optim; any code change → Re-run from Phase 3 (Phase 3→4→5→6) until it meets the standards again
User refuses → End

阶段间数据流

Inter-Phase Data Flow

Phase 1 输出                         Phase 2 输入
  ascend-kernel + ops/<op>/骨架       算子名、catlass/ 可引用
  + catlass/include、examples   ────▶

Phase 2 输出                         Phase 3 输入
  design.md（定稿）            ────▶  example 路径、类型与 Host 契约

Phase 3 输出                         Phase 4 输入
  已安装 whl + test_<op>.py     ────▶  register.cpp / ops.h / design.md / op_host

Phase 4 输出                         Phase 5 输入
  README.md                    ────▶  接口、dtype、约束、调用方式

Phase 5 输出                         Phase 6 输入
  精度通过 + 报告                ────▶  算子名、标杆 API、JSONL 与 profiler 流程

Phase 6 输出
  性能报告（profiler）           ────▶  可选：用户确认后进入 catlass-operator-performance-optim

Phase 1 Output                         Phase 2 Input
  ascend-kernel + ops/<op>/skeleton       Operator name, catlass/ is referenceable
  + catlass/include, examples   ────▶

Phase 2 Output                         Phase 3 Input
  design.md (finalized)            ────▶  Example path, type and Host contract

Phase 3 Output                         Phase 4 Input
  Installed whl + test_<op>.py     ────▶  register.cpp / ops.h / design.md / op_host

Phase 4 Output                         Phase 5 Input
  README.md                    ────▶  Interface, dtype, constraints, calling method

Phase 5 Output                         Phase 6 Input
  Precision passed + Report                ────▶  Operator name, benchmark API, JSONL and profiler workflow

Phase 6 Output
  Performance Report (profiler)           ────▶  Optional: Enter catlass-operator-performance-optim after user confirmation

状态跟踪表

Status Tracking Table

Phase	前置条件	调用 Skill / 动作	关键产出物
0. 需求收集	无	—	CANN + Conda + `op_name` （含 catlass）+ 功能描述
1. 工程 + Catlass	Phase 0	`ascendc-operator-project-init` + 根目录 `catlass/`	骨架 + Catlass 源码树
2. 设计	Phase 1	`catlass-operator-design`	`design.md`
3. 代码与测试	Phase 2	`catlass-operator-code-gen` → `compile-debug`	可运行算子 + 基础测试通过
4. 接口文档	Phase 3	`ascendc-operator-doc-gen`	`README.md`
5. 精度评估	Phase 4	`ascendc-operator-precision-eval`	≥30 例 + 精度报告
6. 性能评测	Phase 5	`ascendc-operator-performance-eval`	JSONL + profiler 报告
（可选）调优	Phase 6 + 用户确认	`catlass-operator-performance-optim`	迭代后的实现与报告

Phase	Prerequisite	Called Skill / Action	Key Deliverables
0. Requirements Collection	None	—	CANN + Conda + `op_name` (contains catlass) + Function description
1. Project + Catlass	Phase 0	`ascendc-operator-project-init` + root directory `catlass/`	Skeleton + Catlass source code tree
2. Design	Phase 1	`catlass-operator-design`	`design.md`
3. Code & Test	Phase 2	`catlass-operator-code-gen` → `compile-debug`	Runnable operator + Basic test passed
4. Interface Doc	Phase 3	`ascendc-operator-doc-gen`	`README.md`
5. Precision Eval	Phase 4	`ascendc-operator-precision-eval`	≥30 cases + Precision report
6. Performance Eval	Phase 5	`ascendc-operator-performance-eval`	JSONL + Profiler report
(Optional) Tuning	Phase 6 + User Confirmation	`catlass-operator-performance-optim`	Iterated implementation and report

错误恢复

Error Recovery

从中断点恢复

Resume from Breakpoint

当用户说「继续 Catlass 算子开发」时：

检测条件	判定阶段	恢复动作
`csrc/ops/<op_name>/` 不存在	Phase 1 未完成	从 Phase 1 Step 1.1 开始
`catlass/examples` 不存在	Phase 1 未完成	完成 Step 1.2 克隆
`design.md` 为空或占位	Phase 2 未完成	从 Phase 2 开始
`op_host` / `op_kernel` 仍为骨架或与 design 不符	Phase 3 未完成	从 Phase 3 开始
whl 未安装或 `tests/test_<op_name>.py` 失败	Phase 3 未完成	在 compile-debug 流程内恢复
无 `README.md`	Phase 4 未完成	从 Phase 4 开始
`test/` 无精度报告或精度未全过	Phase 5 未完成	从 Phase 5 恢复
无性能报告或不符合 performance-eval 要求	Phase 6 未完成	从 Phase 6 恢复

When the user says "Continue Catlass operator development":

Detection Condition	Determined Phase	Recovery Action
`csrc/ops/<op_name>/` does not exist	Phase 1 not completed	Start from Phase 1 Step 1.1
`catlass/examples` does not exist	Phase 1 not completed	Complete Step 1.2 cloning
`design.md` is empty or placeholder	Phase 2 not completed	Start from Phase 2
`op_host` / `op_kernel` are still skeleton or inconsistent with design	Phase 3 not completed	Start from Phase 3
whl not installed or `tests/test_<op_name>.py` fails	Phase 3 not completed	Recover within compile-debug workflow
No `README.md`	Phase 4 not completed	Start from Phase 4
No precision report in `test/` or precision not fully passed	Phase 5 not completed	Recover from Phase 5
No performance report or does not meet performance-eval requirements	Phase 6 not completed	Recover from Phase 6

编译/测试失败

Compile/Test Failure

由 ascendc-operator-compile-debug（经 catlass-operator-code-gen 触发）处理；重试与排错上限以 compile-debug skill 为准。

Handled by ascendc-operator-compile-debug (triggered via catlass-operator-code-gen); retry and troubleshooting limits are subject to compile-debug skill.