ascendc-operator-dev

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AscendC 算子端到端开发编排

AscendC Operator End-to-End Development Orchestration

Skill类型：流程导向型（七阶段工作流，子技能串行编排）

本 skill 编排七个子 skill，驱动 ascend-kernel 算子从零到生产可用。

Skill Type: Process-oriented (seven-stage workflow with serial orchestration of sub-skills)

This skill orchestrates seven sub-skills to drive ascend-kernel operators from scratch to production-ready.

核心原则

Core Principles

七阶段串行：工程初始化 → 设计文档 → 用例生成 → 代码生成&测试 → 接口文档 → 精度评估 → 性能评测，严格顺序执行
子技能执行：每个阶段 MUST 调用对应子 skill，不得自行实现
阶段门控：前一阶段检查点全部通过后才进入下一阶段
设计驱动编码：代码生成依赖设计文档中的 Tiling 策略和 UB 分配表
自动化设计：无需用户预先提供设计文档，设计阶段自动生成
用例统一生成：设计完成后立即生成测试用例文档，供后续精度评估和性能评测复用
文档闭环：编译测试通过后 MUST 生成 PyTorch 风格的中文接口文档，并在聊天界面展示
精度闭环：算子必须通过 ≥30 例全面精度评估才算完成
性能闭环：算子必须通过 msprof 性能对比评测，输出性能报告
结果可视化：Phase 4/5/6/7 的结果 MUST 以 Markdown 形式直接展示在聊天界面中，不要仅输出路径

Seven-stage Serial Execution: Project Initialization → Design Documentation → Test Case Generation → Code Generation & Testing → Interface Documentation → Precision Evaluation → Performance Benchmarking, executed in strict order
Sub-skill Execution: Each stage MUST call the corresponding sub-skill, no self-implementation allowed
Stage Gating: Proceed to the next stage only after all checkpoints of the previous stage are passed
Design-driven Coding: Code generation depends on the Tiling strategy and UB allocation table in the design document
Automated Design: No need for users to provide pre-prepared design documents; the design stage generates them automatically
Unified Test Case Generation: Generate test case documents immediately after design completion for reuse in subsequent precision evaluation and performance benchmarking
Documentation Closure: After passing compilation and testing, MUST generate Chinese interface documents in PyTorch style and display them in the chat interface
Precision Closure: Operators must pass ≥30 comprehensive precision evaluation cases to be considered complete
Performance Closure: Operators must pass msprof performance comparison and benchmarking, with a performance report output
Result Visualization: Results of Phase 4/5/6/7 MUST be directly displayed in the chat interface in Markdown format, do not only output paths

可用子 Skill 清单

Available Sub-skill List

Skill	路径	职责
`ascendc-operator-project-init`	`ascendc-operator-project-init/SKILL.md`	检测/创建 ascend-kernel 项目，生成算子骨架目录
`ascendc-operator-design`	`ascendc-operator-design/SKILL.md`	分析算子需求，生成设计文档（含 Tiling 策略、UB 分配表）
`ascendc-operator-testcase-gen`	`ascendc-operator-testcase-gen/SKILL.md`	根据设计文档生成统一测试用例文档，供精度评估和性能评测复用
`ascendc-operator-code-gen`	`ascendc-operator-code-gen/SKILL.md`	根据设计文档生成 op_host/op_kernel 代码、框架适配、编译测试
`ascendc-operator-compile-debug`	`ascendc-operator-compile-debug/SKILL.md`	编译、安装 whl、生成测试文件、运行精度测试（由 code-gen 内部调用）
`ascendc-operator-doc-gen`	`ascendc-operator-doc-gen/SKILL.md`	从源码提取接口信息，生成 PyTorch 风格中文 API 文档（必选阶段）
`ascendc-operator-precision-eval`	`ascendc-operator-precision-eval/SKILL.md`	生成 ≥30 例精度测试、运行并输出精度验证报告（必选阶段）
`ascendc-operator-performance-eval`	`ascendc-operator-performance-eval/SKILL.md`	使用 msprof 对比工程算子与原生算子性能，输出性能评测报告（必选阶段）

Skill	Path	Responsibility
`ascendc-operator-project-init`	`ascendc-operator-project-init/SKILL.md`	Detect/create ascend-kernel project, generate operator skeleton directory
`ascendc-operator-design`	`ascendc-operator-design/SKILL.md`	Analyze operator requirements, generate design document (including Tiling strategy, UB allocation table)
`ascendc-operator-testcase-gen`	`ascendc-operator-testcase-gen/SKILL.md`	Generate unified test case document based on design document for reuse in precision evaluation and performance benchmarking
`ascendc-operator-code-gen`	`ascendc-operator-code-gen/SKILL.md`	Generate op_host/op_kernel code, framework adaptation, compilation testing
`ascendc-operator-compile-debug`	`ascendc-operator-compile-debug/SKILL.md`	Compile, install whl package, generate test files, run precision tests (called internally by code-gen)
`ascendc-operator-doc-gen`	`ascendc-operator-doc-gen/SKILL.md`	Extract interface information from source code, generate Chinese API documents in PyTorch style (mandatory stage)
`ascendc-operator-precision-eval`	`ascendc-operator-precision-eval/SKILL.md`	Generate ≥30 precision test cases, run them and output precision verification report (mandatory stage)
`ascendc-operator-performance-eval`	`ascendc-operator-performance-eval/SKILL.md`	Use msprof to compare performance between project operators and native operators, output performance benchmarking report (mandatory stage)

工作流总览

Workflow Overview

Phase 1        Phase 2        Phase 3        Phase 4                      Phase 5        Phase 6         Phase 7
工程初始化  ──▶ 设计文档  ──▶ 用例生成  ──▶ 代码生成+框架适配+编译测试  ──▶ 接口文档  ──▶ 精度评估报告  ──▶ 性能评测报告
project-init   design         testcase-gen   code-gen → compile-debug      doc-gen        precision-eval  performance-eval

输入: 算子名称 + 功能描述                              输出: 生产可用算子 + 用例文档 + 接口文档 + 精度报告 + 性能报告

Phase 1        Phase 2        Phase 3        Phase 4                      Phase 5        Phase 6         Phase 7
Project Init  ──▶  Design Doc  ──▶  Test Case Gen  ──▶  Code Gen + Framework Adaptation + Compile Test  ──▶  Interface Doc  ──▶  Precision Eval Report  ──▶  Performance Benchmark Report
project-init   design         testcase-gen   code-gen → compile-debug      doc-gen        precision-eval  performance-eval

Input: Operator Name + Function Description                              Output: Production-ready Operator + Test Case Doc + Interface Doc + Precision Report + Performance Report

反模式清单（NEVER DO THESE）

Anti-pattern List (NEVER DO THESE)

❌ 不要跳过设计阶段直接写代码
❌ 不要跳过用例生成阶段，Phase 2 通过后必须执行 Phase 3（testcase-gen）
❌ 不要自行实现任何算子代码，必须调用子 skill
❌ 不要在代码生成之前修改框架文件（ops.h / register.cpp / CMakeLists.txt）
❌ 不要手动执行编译和测试，统一由 compile-debug skill 处理
❌ 不要引用不存在的 skill
❌ 不要跳过检查点验证
❌ 不要跳过接口文档阶段，Phase 4 通过后必须执行 Phase 5
❌ 不要跳过精度评估阶段，Phase 5 通过后必须执行 Phase 6
❌ 不要跳过性能评测阶段，Phase 6 通过后必须执行 Phase 7
❌ 不要使用非 msprof 的计时方式作为性能结论
❌ 精度评估和性能评测不要自行设计用例，必须先读取 testcase-gen 生成的用例文档

❌ Do not skip the design stage and directly write code
❌ Do not skip the test case generation stage; Phase 3 (testcase-gen) must be executed after Phase 2 is passed
❌ Do not implement any operator code by yourself, must call sub-skills
❌ Do not modify framework files (ops.h / register.cpp / CMakeLists.txt) before code generation
❌ Do not manually execute compilation and testing, handle uniformly via compile-debug skill
❌ Do not reference non-existent skills
❌ Do not skip checkpoint verification
❌ Do not skip the interface documentation stage; Phase 5 must be executed after Phase 4 is passed
❌ Do not skip the precision evaluation stage; Phase 6 must be executed after Phase 5 is passed
❌ Do not skip the performance benchmarking stage; Phase 7 must be executed after Phase 6 is passed
❌ Do not use timing methods other than msprof as performance conclusions
❌ Do not design test cases for precision evaluation and performance benchmarking by yourself, must first read the test case document generated by testcase-gen

Phase 0：需求收集

Phase 0: Requirements Collection

目标：确认算子开发所需的最小信息集，包括开发环境和算子需求

Goal: Confirm the minimum information set required for operator development, including development environment and operator requirements

Step 0.1：环境确认（MUST 在任何开发动作之前完成）

Step 0.1: Environment Confirmation (MUST be completed before any development action)

开发环境是所有后续阶段的前置依赖，必须首先确认。

The development environment is a prerequisite for all subsequent stages, must be confirmed first.

CANN 环境

CANN Environment

自动检测流程：

检查环境变量
```
ASCEND_HOME_PATH
```
是否已设置（
```
echo $ASCEND_HOME_PATH
```
）
若已设置：直接使用，无需询问用户，将其作为
```
CANN_PATH
```
若未设置：MUST 向用户询问 CANN 安装路径（如
```
/usr/local/Ascend/ascend-toolkit
```
）

激活方式：

bash

source ${CANN_PATH}/*/set_env.sh

在每个需要编译或运行算子的 Shell 会话中，都必须先执行此激活命令。

Automatic Detection Process:

Check if the environment variable
```
ASCEND_HOME_PATH
```
is set (
```
echo $ASCEND_HOME_PATH
```
)
If set: Use it directly as
```
CANN_PATH
```
without asking the user
If not set: MUST ask the user for the CANN installation path (e.g.,
```
/usr/local/Ascend/ascend-toolkit
```
)

Activation Method:

bash

source ${CANN_PATH}/*/set_env.sh

In every Shell session that requires compiling or running operators, this activation command must be executed first.

Conda 环境

Conda Environment

自动检测流程：

检查当前是否已激活 conda 环境（
```
echo $CONDA_DEFAULT_ENV
```
）
若已激活（值非
```
base
```
且非空）：直接使用当前环境，无需询问用户
若未激活或为
base
：MUST 向用户询问要使用的 conda 环境名称

激活方式：

bash

conda activate <env_name>

在每个需要编译或运行算子的 Shell 会话中，都必须先激活 conda 环境。

Automatic Detection Process:

Check if a conda environment is currently activated (
```
echo $CONDA_DEFAULT_ENV
```
)
If activated (value is not
```
base
```
and not empty): Use the current environment directly without asking the user
If not activated or is
base
: MUST ask the user for the name of the conda environment to use

Activation Method:

bash

conda activate <env_name>

In every Shell session that requires compiling or running operators, the conda environment must be activated first.

环境确认检查点

Environment Confirmation Checkpoints

CANN 路径已确定（自动检测或用户提供）
```
source ${CANN_PATH}/*/set_env.sh
```
可正常执行
Conda 环境已确定（自动检测或用户提供）
```
conda activate <env_name>
```
可正常执行

CANN path is confirmed (auto-detected or provided by user)
```
source ${CANN_PATH}/*/set_env.sh
```
can be executed normally
Conda environment name is confirmed (auto-detected or provided by user)
```
conda activate <env_name>
```
can be executed normally

Step 0.2：算子需求收集

Step 0.2: Operator Requirements Collection

必须确认的信息

Mandatory Information to Confirm

信息	格式要求	必填	说明
CANN 环境路径	绝对路径	是	自动检测 `$ASCEND_HOME_PATH` ，未设置则询问用户
Conda 环境名称	字符串	是	自动检测 `$CONDA_DEFAULT_ENV` ，未激活则询问用户
算子名称	snake_case	是	如 `acosh` , `rms_norm` , `flash_attn`
功能描述	文本/数学公式	是	如 "反双曲余弦 acosh(x) = ln(x + sqrt(x²-1))"

可选信息（有默认值）：

信息	默认值	说明
支持的数据类型	float16, float32	可扩展 bfloat16
SoC平台	ascend910b	通过平台 API 自动获取

Information	Format Requirement	Mandatory	Description
CANN Environment Path	Absolute path	Yes	Auto-detect `$ASCEND_HOME_PATH` , ask user if not set
Conda Environment Name	String	Yes	Auto-detect `$CONDA_DEFAULT_ENV` , ask user if not activated
Operator Name	snake_case	Yes	e.g., `acosh` , `rms_norm` , `flash_attn`
Function Description	Text/Mathematical Formula	Yes	e.g., "Inverse hyperbolic cosine acosh(x) = ln(x + sqrt(x²-1))"

Optional Information (with default values):

Information	Default Value	Description
Supported Data Types	float16, float32	Can be extended to bfloat16
SoC Platform	ascend910b	Auto-obtained via platform API

决策树

Decision Tree

用户请求	处理方式
"生成 X 算子" / "开发 X 算子"	先完成环境确认（Step 0.1），再从算子名推断功能，确认后直接执行全流程
"帮我开发新算子"（无具体名称）	先完成环境确认（Step 0.1），再询问算子名称和功能描述
"继续算子开发"	先完成环境确认（Step 0.1），再检查已有文件判断阶段，从中断处继续

User Request	Handling Method
"Generate X operator" / "Develop X operator"	Complete environment confirmation (Step 0.1) first, then infer the function from the operator name, and execute the full process directly after confirmation
"Help me develop a new operator" (no specific name)	Complete environment confirmation (Step 0.1) first, then ask for the operator name and function description
"Continue operator development"	Complete environment confirmation (Step 0.1) first, then check existing files to determine the stage and resume from the interrupted point

验收标准

Acceptance Criteria

CANN 环境路径已确定且可激活
Conda 环境名称已确定且可激活
算子名称已确认（snake_case 格式）
功能描述已明确（含数学公式或计算逻辑）

CANN environment path is confirmed and can be activated
Conda environment name is confirmed and can be activated
Operator name is confirmed (snake_case format)
Function description is clear (including mathematical formula or calculation logic)

Phase 1：工程初始化

Phase 1: Project Initialization

调用 Skill：

ascendc-operator-project-init

Called Skill:

ascendc-operator-project-init

执行内容

Execution Content

MANDATORY: 按 ascendc-operator-project-init skill 流程执行：
1. 检测 ascend-kernel 项目是否存在
2. 不存在则从模板复制
3. 在 csrc/ops/<op_name>/ 下创建算子骨架
4. 提示三处注册更新点

MANDATORY: Execute according to the ascendc-operator-project-init skill process:
1. Detect if the ascend-kernel project exists
2. Copy from template if it does not exist
3. Create operator skeleton under csrc/ops/<op_name>/
4. Prompt three registration update points

检查点

Checkpoints

ascend-kernel 项目存在（build.sh、CMakeLists.txt、csrc/）
```
csrc/ops/<op_name>/
```
目录已创建

包含

op_host/<op_name>.cpp

、

op_kernel/<op_name>.cpp

、

CMakeLists.txt

、

design.md

全部通过 → 进入 Phase 2

ascend-kernel project exists (build.sh, CMakeLists.txt, csrc/)
```
csrc/ops/<op_name>/
```
directory has been created

Contains

op_host/<op_name>.cpp

op_kernel/<op_name>.cpp

CMakeLists.txt

design.md

All passed → Proceed to Phase 2

Phase 2：设计文档生成

Phase 2: Design Document Generation

调用 Skill：

ascendc-operator-design

Called Skill:

ascendc-operator-design

执行内容

Execution Content

MANDATORY: 按 ascendc-operator-design skill 流程执行：
1. 分析算子需求（名称、功能、数据类型）
2. 确定实现路径（AscendC Kernel / CATLASS / ACLNN）
3. 设计 Tiling 策略（Block级 + UB级）
4. 填写 UB 分配表，推导 bufferCoefficient
5. 生成完整设计文档到 csrc/ops/<op_name>/design.md

MANDATORY: Execute according to the ascendc-operator-design skill process:
1. Analyze operator requirements (name, function, data types)
2. Determine implementation path (AscendC Kernel / CATLASS / ACLNN)
3. Design Tiling strategy (Block-level + UB-level)
4. Fill in UB allocation table, derive bufferCoefficient
5. Generate complete design document to csrc/ops/<op_name>/design.md

检查点

Checkpoints

```
csrc/ops/<op_name>/design.md
```
内容完整
包含函数签名和支持的数据类型
包含计算逻辑伪代码（AscendC API 调用序列）
包含 UB 分配表（列出所有 buffer 及总系数）
包含 bufferCoefficient（每种 dtype 的值）

全部通过 → 进入 Phase 3

```
csrc/ops/<op_name>/design.md
```
is complete in content
Contains function signature and supported data types
Contains calculation logic pseudocode (AscendC API call sequence)
Contains UB allocation table (lists all buffers and total coefficients)
Contains bufferCoefficient (value for each dtype)

All passed → Proceed to Phase 3

Phase 3：测试用例生成

Phase 3: Test Case Generation

调用 Skill：

ascendc-operator-testcase-gen

Called Skill:

ascendc-operator-testcase-gen

执行内容

Execution Content

MANDATORY: 按 ascendc-operator-testcase-gen skill 流程执行：
1. 读取 csrc/ops/<op_name>/design.md，提取参数约束、支持的 dtype、典型 shape
2. 生成 TEST_SHAPES（常规 shape）、GENERAL_SHAPES（泛化 shape）、BOUNDARY_VALUES（边界值）
3. 生成算子标杆（CPU 参考实现、NPU 调用方式）
4. 输出用例文档到 csrc/ops/<op_name>/test/<op_name>-test-cases.md

MANDATORY: Execute according to the ascendc-operator-testcase-gen skill process:
1. Read csrc/ops/<op_name>/design.md, extract parameter constraints, supported dtypes, typical shapes
2. Generate TEST_SHAPES (regular shapes), GENERAL_SHAPES (generalized shapes), BOUNDARY_VALUES (boundary values)
3. Generate operator benchmarks (CPU reference implementation, NPU calling method)
4. Output test case document to csrc/ops/<op_name>/test/<op_name>-test-cases.md

检查点

Checkpoints

csrc/ops/<op_name>/test/<op_name>-test-cases.md

已生成

包含 SUPPORTED_DTYPES、TEST_SHAPES、GENERAL_SHAPES、BOUNDARY_VALUES
包含算子标杆（NPU 调用方式 + CPU 参考实现）
shape 和参数值均在 design.md 约束范围内

全部通过 → 进入 Phase 4

csrc/ops/<op_name>/test/<op_name>-test-cases.md

has been generated

Contains SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, BOUNDARY_VALUES
Contains operator benchmarks (NPU calling method + CPU reference implementation)
Shapes and parameter values are within the constraints of design.md

All passed → Proceed to Phase 4

Phase 4：代码生成 + 框架适配 + 编译测试

Phase 4: Code Generation + Framework Adaptation + Compile Test

调用 Skill：

ascendc-operator-code-gen

（内部自动调用

ascendc-operator-compile-debug

）

Called Skill:

ascendc-operator-code-gen

(internally calls

ascendc-operator-compile-debug

automatically)

执行内容

Execution Content

MANDATORY: 按 ascendc-operator-code-gen skill 流程执行：

阶段 1: 加载参考文档
  - 读取 references/GUIDE.md
  - 按算子类型加载对应 reference

阶段 2: 读取设计文档
  - 提取函数签名、UB 分配表、计算伪代码

阶段 3: 选择模板并生成代码
  - 选择 elementwise / row 模板
  - 生成 op_host/<op_name>.cpp（含 Tiling 计算逻辑）
  - 生成 op_kernel/<op_name>.cpp（含 Compute 计算逻辑）

阶段 4: 框架适配
  - 更新 csrc/ops.h（函数声明）
  - 更新 csrc/register.cpp（m.def + m.impl）
  - 更新 csrc/CMakeLists.txt（OP_SRCS + ascendc_library）

阶段 5: 编译安装与测试（调用 compile-debug skill）
  - ./build.sh 编译
  - pip install whl 安装
  - 生成 tests/test_<op_name>.py
  - 运行功能测试和精度测试
  - 编译/测试失败最多排错 3 次

MANDATORY: Execute according to the ascendc-operator-code-gen skill process:

Stage 1: Load Reference Documents
  - Read references/GUIDE.md
  - Load corresponding reference according to operator type

Stage 2: Read Design Document
  - Extract function signature, UB allocation table, calculation pseudocode

Stage 3: Select Template and Generate Code
  - Select elementwise / row template
  - Generate op_host/<op_name>.cpp (includes Tiling calculation logic)
  - Generate op_kernel/<op_name>.cpp (includes Compute calculation logic)

Stage 4: Framework Adaptation
  - Update csrc/ops.h (function declaration)
  - Update csrc/register.cpp (m.def + m.impl)
  - Update csrc/CMakeLists.txt (OP_SRCS + ascendc_library)

Stage 5: Compilation, Installation and Testing (call compile-debug skill)
  - Compile via ./build.sh
  - Install via pip install whl
  - Generate tests/test_<op_name>.py
  - Run functional tests and precision tests
  - Debug up to 3 times if compilation/test fails

检查点

Checkpoints

```
op_host/<op_name>.cpp
```
使用平台 API 获取硬件参数
```
op_kernel/<op_name>.cpp
```
包含完整 CopyIn → Compute → CopyOut 流水线
```
ops.h
```
已添加函数声明
```
register.cpp
```
已添加
```
m.def
```
和
```
m.impl
```
```
csrc/CMakeLists.txt
```
已添加 host 和 kernel 源文件
编译成功（whl 包已生成）
功能测试通过（exit code 0）
精度测试全部通过（pytest 全绿）

全部通过 → 进入 Phase 5

```
op_host/<op_name>.cpp
```
uses platform API to obtain hardware parameters
```
op_kernel/<op_name>.cpp
```
contains complete CopyIn → Compute → CopyOut pipeline
Function declaration has been added to
```
ops.h
```
```
m.def
```
and
```
m.impl
```
have been added to
```
register.cpp
```
Host and kernel source files have been added to
```
csrc/CMakeLists.txt
```
Compilation is successful (whl package has been generated)
Functional tests pass (exit code 0)
All precision tests pass (pytest all green)

All passed → Proceed to Phase 5

Phase 5：接口文档生成

Phase 5: Interface Document Generation

调用 Skill：

ascendc-operator-doc-gen

Called Skill:

ascendc-operator-doc-gen

执行内容

Execution Content

MANDATORY: 按 ascendc-operator-doc-gen skill 流程执行：

阶段 1: 信息提取
  - 从 register.cpp 提取 Python 调用签名（m.def schema）
  - 从 ops.h 提取 C++ 函数声明和返回类型
  - 从 design.md 提取算法描述、参数说明、dtype 支持、约束条件
  - 从 op_host 提取 TORCH_CHECK 约束
  - 从 tests/test_<op_name>.py 提取使用示例

阶段 2: 文档结构组装
  - 按 PyTorch 官方文档风格组装中文接口文档
  - 包含：标题签名 + 功能描述 + 参数说明 + 支持的数据类型 + Shape + 约束条件 + 使用示例 + 返回值

阶段 3: 文件生成
  - 生成 csrc/ops/<op_name>/README.md

阶段 4: 在交互界面展示完整文档内容

MANDATORY: Execute according to the ascendc-operator-doc-gen skill process:

Stage 1: Information Extraction
  - Extract Python calling signature (m.def schema) from register.cpp
  - Extract C++ function declaration and return type from ops.h
  - Extract algorithm description, parameter description, dtype support, constraint conditions from design.md
  - Extract TORCH_CHECK constraints from op_host
  - Extract usage examples from tests/test_<op_name>.py

Stage 2: Document Structure Assembly
  - Assemble Chinese interface documents in PyTorch official documentation style
  - Includes: Title Signature + Function Description + Parameter Description + Supported Data Types + Shape + Constraint Conditions + Usage Examples + Return Value

Stage 3: File Generation
  - Generate csrc/ops/<op_name>/README.md

Stage 4: Display complete document content in the interactive interface

检查点

Checkpoints

从源代码提取了完整的接口信息（签名、参数、dtype、shape、约束）
README.md 包含完整的 7 个段落（标题签名 + 功能描述 + 参数说明 + 支持的数据类型 + Shape + 约束条件 + 使用示例 + 返回值）
Python 调用签名与
```
register.cpp
```
的
```
m.def
```
一致
参数说明使用 PyTorch 文档风格，描述使用中文
使用示例中的代码可运行
README.md 已写入
```
csrc/ops/<op_name>/README.md
```
接口文档已在聊天界面完整展示

全部通过 → 进入 Phase 6

Complete interface information has been extracted from source code (signature, parameters, dtype, shape, constraints)
README.md contains all 7 sections (title signature + function description + parameter description + supported data types + shape + constraint conditions + usage examples + return value)
Python calling signature is consistent with
```
m.def
```
in
```
register.cpp
```
Parameter descriptions use PyTorch documentation style, described in Chinese
Code in usage examples is runnable
README.md has been written to
```
csrc/ops/<op_name>/README.md
```
Interface document has been fully displayed in the chat interface

All passed → Proceed to Phase 6

Phase 6：精度评估报告

Phase 6: Precision Evaluation Report

调用 Skill：

ascendc-operator-precision-eval

Called Skill:

ascendc-operator-precision-eval

执行内容

Execution Content

MANDATORY: 按 ascendc-operator-precision-eval skill 流程执行：

阶段 1: 加载用例文档 + 信息收集
  - 读取 csrc/ops/<op_name>/test/<op_name>-test-cases.md（testcase-gen 产出）
  - 提取 SUPPORTED_DTYPES、TEST_SHAPES、GENERAL_SHAPES、BOUNDARY_VALUES、算子标杆
  - 从已有代码补充提取精度阈值等信息

阶段 2: 用例适配（(shapes + boundary) × dtypes ≥ 30 例）
  - 直接复用 testcase-gen 的 TEST_SHAPES 和 BOUNDARY_VALUES
  - 每个 shape / 边界值遍历算子支持的全部 dtype

阶段 3: 测试脚本生成（输出到算子目录 csrc/ops/<op_name>/test/）
  - 基于模板生成 test_<op_name>_precision.py（pytest 格式）
  - 基于模板生成 run_<op_name>_precision_report.py（报告生成器）

阶段 4: 执行
  - 运行 pytest 全部通过
  - 运行报告生成器输出 JSON

阶段 5: 报告生成
  - 生成 <op_name>_precision_report.md（含常规 shape + 边界值表格 + 汇总 + 关键发现）
  - 向用户提示报告路径

MANDATORY: Execute according to the ascendc-operator-precision-eval skill process:

Stage 1: Load Test Case Document + Information Collection
  - Read csrc/ops/<op_name>/test/<op_name>-test-cases.md (output from testcase-gen)
  - Extract SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, BOUNDARY_VALUES, operator benchmarks
  - Supplement and extract information such as precision thresholds from existing code

Stage 2: Test Case Adaptation ((shapes + boundary) × dtypes ≥ 30 cases)
  - Directly reuse TEST_SHAPES and BOUNDARY_VALUES from testcase-gen
  - Traverse all dtypes supported by the operator for each shape / boundary value

Stage 3: Test Script Generation (output to operator directory csrc/ops/<op_name>/test/)
  - Generate test_<op_name>_precision.py (pytest format) based on template
  - Generate run_<op_name>_precision_report.py (report generator) based on template

Stage 4: Execution
  - Run pytest and all tests pass
  - Run report generator to output JSON

Stage 5: Report Generation
  - Generate <op_name>_precision_report.md (includes regular shape + boundary value table + summary + key findings)
  - Prompt the user for the report path

检查点

Checkpoints

用例数 = (shapes + boundary) × dtypes ≥ 30
算子支持的每种 dtype 都已测试
pytest 精度测试全部通过
JSON 报告生成（含 5 个精度指标: MaxAbsErr / MeanAbsErr / MaxRelErr / MeanRelErr / CosineSim）

Markdown 报告生成于

csrc/ops/<op_name>/test/<op_name>_precision_report.md

精度测试结果已以 Markdown 表格形式展示在聊天界面
已向用户提示精度报告路径

全部通过 → 进入 Phase 7

Number of test cases = (shapes + boundary) × dtypes ≥ 30
Each dtype supported by the operator has been tested
All pytest precision tests pass
JSON report is generated (includes 5 precision metrics: MaxAbsErr / MeanAbsErr / MaxRelErr / MeanRelErr / CosineSim)

Markdown report is generated at

csrc/ops/<op_name>/test/<op_name>_precision_report.md

Precision test results have been displayed in the chat interface in Markdown table format
The user has been prompted for the precision report path

All passed → Proceed to Phase 7

Phase 7：性能评测报告

Phase 7: Performance Benchmarking Report

调用 Skill：

ascendc-operator-performance-eval

Called Skill:

ascendc-operator-performance-eval

执行内容

Execution Content

MANDATORY: 按 ascendc-operator-performance-eval skill 流程执行：

阶段 1: 加载用例文档 + 信息收集
  - 读取 csrc/ops/<op_name>/test/<op_name>-test-cases.md（testcase-gen 产出）
  - 提取 SUPPORTED_DTYPES、TEST_SHAPES、GENERAL_SHAPES、算子标杆
  - 从已有代码补充提取 OP Type 关键字等信息

阶段 2: 用例适配（JSONL 格式，≥8 case）
  - 从 testcase-gen 的 TEST_SHAPES + GENERAL_SHAPES 中选取代表性 shape
  - 覆盖算子支持的全部 dtype
  - 转换为 JSONL 格式

阶段 3: 脚本生成（输出到算子目录 csrc/ops/<op_name>/test/）
  - 基于模板生成 run_<op_name>_case.py（单 case msprof 执行器）
  - 基于模板生成 benchmark_<op_name>_msprof.py（总控脚本）
  - 生成 <op_name>_cases.jsonl

阶段 4: 执行采集
  - 运行总控脚本，每 case 20 次迭代（前 10 次预热）
  - 按 OP Type 从 op_summary_*.csv 提取 Task Duration(us) 和硬件指标
  - 输出 JSON 结果

阶段 5: 报告生成
  - 生成 <op_name>_perf_report.md（含结果表格 + 汇总 + 简短分析）
  - 向用户提示报告路径

MANDATORY: Execute according to the ascendc-operator-performance-eval skill process:

Stage 1: Load Test Case Document + Information Collection
  - Read csrc/ops/<op_name>/test/<op_name>-test-cases.md (output from testcase-gen)
  - Extract SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, operator benchmarks
  - Supplement and extract information such as OP Type keywords from existing code

Stage 2: Test Case Adaptation (JSONL format, ≥8 cases)
  - Select representative shapes from TEST_SHAPES + GENERAL_SHAPES of testcase-gen
  - Cover all dtypes supported by the operator
  - Convert to JSONL format

Stage 3: Script Generation (output to operator directory csrc/ops/<op_name>/test/)
  - Generate run_<op_name>_case.py (single case msprof executor) based on template
  - Generate benchmark_<op_name>_msprof.py (master control script) based on template
  - Generate <op_name>_cases.jsonl

Stage 4: Execution and Collection
  - Run the master control script, 20 iterations per case (first 10 for warm-up)
  - Extract Task Duration(us) and hardware metrics from op_summary_*.csv by OP Type
  - Output JSON results

Stage 5: Report Generation
  - Generate <op_name>_perf_report.md (includes result table + summary + brief analysis)
  - Prompt the user for the report path

检查点

Checkpoints

JSONL 用例覆盖多种 shape × dtype（≥ 8 case）
使用
```
msprof
```
采集，非其他计时方式
按
```
OP Type
```
筛选目标算子（非 Op Name）
20/10 预热/统计策略
JSON 报告生成（含 Task Duration + 硬件指标）

Markdown 报告生成于

csrc/ops/<op_name>/test/<op_name>_perf_report.md

报告包含简短分析（≥ 3 条结论）
性能测试结果已以 Markdown 表格形式展示在聊天界面
已向用户提示性能报告路径

全部通过 → 算子开发完成

JSONL test cases cover multiple shape × dtype combinations (≥8 cases)
Uses
```
msprof
```
for collection, no other timing methods
Filters target operators by
```
OP Type
```
(not Op Name)
20/10 warm-up/statistics strategy is used
JSON report is generated (includes Task Duration + hardware metrics)

Markdown report is generated at

csrc/ops/<op_name>/test/<op_name>_perf_report.md

Report contains brief analysis (≥3 conclusions)
Performance test results have been displayed in the chat interface in Markdown table format
The user has been prompted for the performance report path

All passed → Operator development is complete

阶段间数据流

Inter-stage Data Flow

Phase 1 输出                    Phase 2 输入
  csrc/ops/<op_name>/    ────▶    算子名称、目录结构
  design.md (占位)

Phase 2 输出                    Phase 3 输入
  design.md (完整)       ────▶    参数约束、支持的 dtype、典型 shape
                                  → 生成统一测试用例文档

Phase 3 输出                    Phase 4 输入
  <op_name>-test-cases.md ────▶    design.md (完整)
  （用例文档，供后续复用）          函数签名、UB 分配表 → bufferCoefficient
                                  计算伪代码 → Compute 逻辑
                                  Tiling 策略 → Block/UB 切分参数

Phase 4 输出                    Phase 5 输入
  已安装的算子 whl        ────▶    register.cpp / ops.h / design.md /
  tests/test_<op_name>.py        op_host / test 文件
                                  → 提取接口信息生成文档

Phase 5 输出                    Phase 6 输入
  csrc/ops/<op>/README.md ────▶    <op_name>-test-cases.md（来自 Phase 3）
  接口文档完成                     算子名、调用方式、输入域约束
                                  支持的全部 dtype、精度阈值
                                  → 输出到 csrc/ops/<op_name>/test/

Phase 6 输出                    Phase 7 输入
  精度报告通过             ────▶    <op_name>-test-cases.md（来自 Phase 3）
  csrc/ops/<op>/test/            算子名、工程/原生调用方式
                                  支持的全部 dtype、OP Type 关键字
                                  → 输出到 csrc/ops/<op_name>/test/

Phase 1 Output                    Phase 2 Input
  csrc/ops/<op_name>/    ────▶    Operator name, directory structure
  design.md (placeholder)

Phase 2 Output                    Phase 3 Input
  design.md (complete)       ────▶    Parameter constraints, supported dtypes, typical shapes
                                  → Generate unified test case document

Phase 3 Output                    Phase 4 Input
  <op_name>-test-cases.md ────▶    design.md (complete)
  (test case document for subsequent reuse)          Function signature, UB allocation table → bufferCoefficient
                                  Calculation pseudocode → Compute logic
                                  Tiling strategy → Block/UB splitting parameters

Phase 4 Output                    Phase 5 Input
  Installed operator whl        ────▶    register.cpp / ops.h / design.md /
  tests/test_<op_name>.py        op_host / test files
                                  → Extract interface information to generate documents

Phase 5 Output                    Phase 6 Input
  csrc/ops/<op>/README.md ────▶    <op_name>-test-cases.md (from Phase 3)
  Interface document completed                     Operator name, calling method, input domain constraints
                                  All supported dtypes, precision thresholds
                                  → Output to csrc/ops/<op_name>/test/

Phase 6 Output                    Phase 7 Input
  Precision report passed             ────▶    <op_name>-test-cases.md (from Phase 3)
  csrc/ops/<op>/test/            Operator name, project/native calling method
                                  All supported dtypes, OP Type keywords
                                  → Output to csrc/ops/<op_name>/test/

状态跟踪表

Status Tracking Table

Phase	前置条件	调用 Skill	关键产出物
0. 需求收集	无	—	CANN 路径 + Conda 环境 + 算子名称 + 功能描述
1. 工程初始化	Phase 0	`ascendc-operator-project-init`	算子骨架目录
2. 设计文档	Phase 1	`ascendc-operator-design`	design.md（含 Tiling + UB 分配表）
3. 用例生成	Phase 2	`ascendc-operator-testcase-gen`	`<op_name>-test-cases.md` （统一用例文档）
4. 代码&测试	Phase 3	`ascendc-operator-code-gen` → `compile-debug`	可运行算子 + 基本测试通过
5. 接口文档	Phase 4	`ascendc-operator-doc-gen`	PyTorch 风格中文 API 文档 (README.md)
6. 精度评估	Phase 5	`ascendc-operator-precision-eval`	≥30 例精度测试 + 精度报告
7. 性能评测	Phase 6	`ascendc-operator-performance-eval`	msprof 性能对比 + 性能报告

Phase	Precondition	Called Skill	Key Deliverables
0. Requirements Collection	None	—	CANN path + Conda environment + Operator name + Function description
1. Project Initialization	Phase 0	`ascendc-operator-project-init`	Operator skeleton directory
2. Design Document	Phase 1	`ascendc-operator-design`	design.md (includes Tiling + UB allocation table)
3. Test Case Generation	Phase 2	`ascendc-operator-testcase-gen`	`<op_name>-test-cases.md` (unified test case document)
4. Code & Testing	Phase 3	`ascendc-operator-code-gen` → `compile-debug`	Runnable operator + basic tests passed
5. Interface Document	Phase 4	`ascendc-operator-doc-gen`	PyTorch-style Chinese API document (README.md)
6. Precision Evaluation	Phase 5	`ascendc-operator-precision-eval`	≥30 precision test cases + precision report
7. Performance Benchmarking	Phase 6	`ascendc-operator-performance-eval`	msprof performance comparison + performance report

错误恢复

Error Recovery

从中断点恢复

Resume from Interrupted Point

当用户说"继续算子开发"时：

检测条件	判定阶段	恢复动作
`csrc/ops/<op_name>/` 不存在	Phase 1 未完成	从 Phase 1 开始
`design.md` 为占位或空	Phase 2 未完成	从 Phase 2 开始
`csrc/ops/<op_name>/test/<op_name>-test-cases.md` 不存在	Phase 3 未完成	从 Phase 3 开始
`op_host/` 仍为骨架代码	Phase 4 未完成	从 Phase 4 开始
whl 未生成	Phase 4 编译未完成	从编译步骤恢复
基本测试未通过	Phase 4 测试未完成	从测试步骤恢复
`csrc/ops/<op_name>/README.md` 不存在	Phase 5 未完成	从 Phase 5 开始
`csrc/ops/<op_name>/test/` 无精度报告	Phase 6 未开始	从 Phase 6 开始
精度报告不存在或精度测试未全部通过	Phase 6 未完成	从 Phase 6 恢复
精度报告存在但性能报告不存在	Phase 7 未开始	从 Phase 7 开始
`<op_name>_perf_report.md` 不存在或不完整	Phase 7 未完成	从 Phase 7 恢复

When the user says "Continue operator development":

Detection Condition	Determined Stage	Recovery Action
`csrc/ops/<op_name>/` does not exist	Phase 1 not completed	Start from Phase 1
`design.md` is placeholder or empty	Phase 2 not completed	Start from Phase 2
`csrc/ops/<op_name>/test/<op_name>-test-cases.md` does not exist	Phase 3 not completed	Start from Phase 3
`op_host/` still contains skeleton code	Phase 4 not completed	Start from Phase 4
whl package not generated	Phase 4 compilation not completed	Resume from compilation step
Basic tests not passed	Phase 4 testing not completed	Resume from testing step
`csrc/ops/<op_name>/README.md` does not exist	Phase 5 not completed	Start from Phase 5
No precision report in `csrc/ops/<op_name>/test/`	Phase 6 not started	Start from Phase 6
Precision report does not exist or precision tests not all passed	Phase 6 not completed	Resume from Phase 6
Precision report exists but performance report does not	Phase 7 not started	Start from Phase 7
`<op_name>_perf_report.md` does not exist or is incomplete	Phase 7 not completed	Resume from Phase 7

编译/测试失败

Compilation/Test Failure

由

ascendc-operator-compile-debug

skill 内部处理，最多排错 3 次。3 次仍失败则停止并向用户报告详细错误。

Handled internally by

ascendc-operator-compile-debug

skill, up to 3 debugging attempts. If it still fails after 3 times, stop and report detailed errors to the user.