AscendC Operator End-to-End Development Orchestration

Skill Type: Process-oriented (seven-stage workflow with serial orchestration of sub-skills)

This skill orchestrates seven sub-skills to drive ascend-kernel operators from scratch to production-ready.

Core Principles

Seven-stage Serial Execution: Project Initialization → Design Documentation → Test Case Generation → Code Generation & Testing → Interface Documentation → Precision Evaluation → Performance Benchmarking, executed in strict order
Sub-skill Execution: Each stage MUST call the corresponding sub-skill, no self-implementation allowed
Stage Gating: Proceed to the next stage only after all checkpoints of the previous stage are passed
Design-driven Coding: Code generation depends on the Tiling strategy and UB allocation table in the design document
Automated Design: No need for users to provide pre-prepared design documents; the design stage generates them automatically
Unified Test Case Generation: Generate test case documents immediately after design completion for reuse in subsequent precision evaluation and performance benchmarking
Documentation Closure: After passing compilation and testing, MUST generate Chinese interface documents in PyTorch style and display them in the chat interface
Precision Closure: Operators must pass ≥30 comprehensive precision evaluation cases to be considered complete
Performance Closure: Operators must pass msprof performance comparison and benchmarking, with a performance report output
Result Visualization: Results of Phase 4/5/6/7 MUST be directly displayed in the chat interface in Markdown format, do not only output paths

Available Sub-skill List

Skill	Path	Responsibility
`ascendc-operator-project-init`	`ascendc-operator-project-init/SKILL.md`	Detect/create ascend-kernel project, generate operator skeleton directory
`ascendc-operator-design`	`ascendc-operator-design/SKILL.md`	Analyze operator requirements, generate design document (including Tiling strategy, UB allocation table)
`ascendc-operator-testcase-gen`	`ascendc-operator-testcase-gen/SKILL.md`	Generate unified test case document based on design document for reuse in precision evaluation and performance benchmarking
`ascendc-operator-code-gen`	`ascendc-operator-code-gen/SKILL.md`	Generate op_host/op_kernel code, framework adaptation, compilation testing
`ascendc-operator-compile-debug`	`ascendc-operator-compile-debug/SKILL.md`	Compile, install whl package, generate test files, run precision tests (called internally by code-gen)
`ascendc-operator-doc-gen`	`ascendc-operator-doc-gen/SKILL.md`	Extract interface information from source code, generate Chinese API documents in PyTorch style (mandatory stage)
`ascendc-operator-precision-eval`	`ascendc-operator-precision-eval/SKILL.md`	Generate ≥30 precision test cases, run them and output precision verification report (mandatory stage)
`ascendc-operator-performance-eval`	`ascendc-operator-performance-eval/SKILL.md`	Use msprof to compare performance between project operators and native operators, output performance benchmarking report (mandatory stage)

Workflow Overview

Phase 1        Phase 2        Phase 3        Phase 4                      Phase 5        Phase 6         Phase 7
Project Init  ──▶  Design Doc  ──▶  Test Case Gen  ──▶  Code Gen + Framework Adaptation + Compile Test  ──▶  Interface Doc  ──▶  Precision Eval Report  ──▶  Performance Benchmark Report
project-init   design         testcase-gen   code-gen → compile-debug      doc-gen        precision-eval  performance-eval

Input: Operator Name + Function Description                              Output: Production-ready Operator + Test Case Doc + Interface Doc + Precision Report + Performance Report

Anti-pattern List (NEVER DO THESE)

❌ Do not skip the design stage and directly write code
❌ Do not skip the test case generation stage; Phase 3 (testcase-gen) must be executed after Phase 2 is passed
❌ Do not implement any operator code by yourself, must call sub-skills
❌ Do not modify framework files (ops.h / register.cpp / CMakeLists.txt) before code generation
❌ Do not manually execute compilation and testing, handle uniformly via compile-debug skill
❌ Do not reference non-existent skills
❌ Do not skip checkpoint verification
❌ Do not skip the interface documentation stage; Phase 5 must be executed after Phase 4 is passed
❌ Do not skip the precision evaluation stage; Phase 6 must be executed after Phase 5 is passed
❌ Do not skip the performance benchmarking stage; Phase 7 must be executed after Phase 6 is passed
❌ Do not use timing methods other than msprof as performance conclusions
❌ Do not design test cases for precision evaluation and performance benchmarking by yourself, must first read the test case document generated by testcase-gen

Phase 0: Requirements Collection

Goal: Confirm the minimum information set required for operator development, including development environment and operator requirements

Step 0.1: Environment Confirmation (MUST be completed before any development action)

The development environment is a prerequisite for all subsequent stages, must be confirmed first.

CANN Environment

Automatic Detection Process:

Check if the environment variable
```
ASCEND_HOME_PATH
```
is set (
```
echo $ASCEND_HOME_PATH
```
)
If set: Use it directly as
```
CANN_PATH
```
without asking the user
If not set: MUST ask the user for the CANN installation path (e.g.,
```
/usr/local/Ascend/ascend-toolkit
```
)

Activation Method:

bash

source ${CANN_PATH}/*/set_env.sh

In every Shell session that requires compiling or running operators, this activation command must be executed first.

Conda Environment

Automatic Detection Process:

Check if a conda environment is currently activated (
```
echo $CONDA_DEFAULT_ENV
```
)
If activated (value is not
```
base
```
and not empty): Use the current environment directly without asking the user
If not activated or is
base
: MUST ask the user for the name of the conda environment to use

Activation Method:

bash

conda activate <env_name>

In every Shell session that requires compiling or running operators, the conda environment must be activated first.

Environment Confirmation Checkpoints

CANN path is confirmed (auto-detected or provided by user)
```
source ${CANN_PATH}/*/set_env.sh
```
can be executed normally
Conda environment name is confirmed (auto-detected or provided by user)
```
conda activate <env_name>
```
can be executed normally

Step 0.2: Operator Requirements Collection

Mandatory Information to Confirm

Information	Format Requirement	Mandatory	Description
CANN Environment Path	Absolute path	Yes	Auto-detect `$ASCEND_HOME_PATH` , ask user if not set
Conda Environment Name	String	Yes	Auto-detect `$CONDA_DEFAULT_ENV` , ask user if not activated
Operator Name	snake_case	Yes	e.g., `acosh` , `rms_norm` , `flash_attn`
Function Description	Text/Mathematical Formula	Yes	e.g., "Inverse hyperbolic cosine acosh(x) = ln(x + sqrt(x²-1))"

Optional Information (with default values):

Information	Default Value	Description
Supported Data Types	float16, float32	Can be extended to bfloat16
SoC Platform	ascend910b	Auto-obtained via platform API

Decision Tree

User Request	Handling Method
"Generate X operator" / "Develop X operator"	Complete environment confirmation (Step 0.1) first, then infer the function from the operator name, and execute the full process directly after confirmation
"Help me develop a new operator" (no specific name)	Complete environment confirmation (Step 0.1) first, then ask for the operator name and function description
"Continue operator development"	Complete environment confirmation (Step 0.1) first, then check existing files to determine the stage and resume from the interrupted point

Acceptance Criteria

CANN environment path is confirmed and can be activated
Conda environment name is confirmed and can be activated
Operator name is confirmed (snake_case format)
Function description is clear (including mathematical formula or calculation logic)

Phase 1: Project Initialization

Called Skill:

ascendc-operator-project-init

Execution Content

MANDATORY: Execute according to the ascendc-operator-project-init skill process:
1. Detect if the ascend-kernel project exists
2. Copy from template if it does not exist
3. Create operator skeleton under csrc/ops/<op_name>/
4. Prompt three registration update points

Checkpoints

ascend-kernel project exists (build.sh, CMakeLists.txt, csrc/)
```
csrc/ops/<op_name>/
```
directory has been created

Contains

op_host/<op_name>.cpp

op_kernel/<op_name>.cpp

CMakeLists.txt

design.md

All passed → Proceed to Phase 2

Phase 2: Design Document Generation

Called Skill:

ascendc-operator-design

Execution Content

MANDATORY: Execute according to the ascendc-operator-design skill process:
1. Analyze operator requirements (name, function, data types)
2. Determine implementation path (AscendC Kernel / CATLASS / ACLNN)
3. Design Tiling strategy (Block-level + UB-level)
4. Fill in UB allocation table, derive bufferCoefficient
5. Generate complete design document to csrc/ops/<op_name>/design.md

Checkpoints

```
csrc/ops/<op_name>/design.md
```
is complete in content
Contains function signature and supported data types
Contains calculation logic pseudocode (AscendC API call sequence)
Contains UB allocation table (lists all buffers and total coefficients)
Contains bufferCoefficient (value for each dtype)

All passed → Proceed to Phase 3

Phase 3: Test Case Generation

Called Skill:

ascendc-operator-testcase-gen

Execution Content

MANDATORY: Execute according to the ascendc-operator-testcase-gen skill process:
1. Read csrc/ops/<op_name>/design.md, extract parameter constraints, supported dtypes, typical shapes
2. Generate TEST_SHAPES (regular shapes), GENERAL_SHAPES (generalized shapes), BOUNDARY_VALUES (boundary values)
3. Generate operator benchmarks (CPU reference implementation, NPU calling method)
4. Output test case document to csrc/ops/<op_name>/test/<op_name>-test-cases.md

Checkpoints

csrc/ops/<op_name>/test/<op_name>-test-cases.md

has been generated

Contains SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, BOUNDARY_VALUES
Contains operator benchmarks (NPU calling method + CPU reference implementation)
Shapes and parameter values are within the constraints of design.md

All passed → Proceed to Phase 4

Phase 4: Code Generation + Framework Adaptation + Compile Test

Called Skill:

ascendc-operator-code-gen

(internally calls

ascendc-operator-compile-debug

automatically)

Execution Content

MANDATORY: Execute according to the ascendc-operator-code-gen skill process:

Stage 1: Load Reference Documents
  - Read references/GUIDE.md
  - Load corresponding reference according to operator type

Stage 2: Read Design Document
  - Extract function signature, UB allocation table, calculation pseudocode

Stage 3: Select Template and Generate Code
  - Select elementwise / row template
  - Generate op_host/<op_name>.cpp (includes Tiling calculation logic)
  - Generate op_kernel/<op_name>.cpp (includes Compute calculation logic)

Stage 4: Framework Adaptation
  - Update csrc/ops.h (function declaration)
  - Update csrc/register.cpp (m.def + m.impl)
  - Update csrc/CMakeLists.txt (OP_SRCS + ascendc_library)

Stage 5: Compilation, Installation and Testing (call compile-debug skill)
  - Compile via ./build.sh
  - Install via pip install whl
  - Generate tests/test_<op_name>.py
  - Run functional tests and precision tests
  - Debug up to 3 times if compilation/test fails

Checkpoints

```
op_host/<op_name>.cpp
```
uses platform API to obtain hardware parameters
```
op_kernel/<op_name>.cpp
```
contains complete CopyIn → Compute → CopyOut pipeline
Function declaration has been added to
```
ops.h
```
```
m.def
```
and
```
m.impl
```
have been added to
```
register.cpp
```
Host and kernel source files have been added to
```
csrc/CMakeLists.txt
```
Compilation is successful (whl package has been generated)
Functional tests pass (exit code 0)
All precision tests pass (pytest all green)

All passed → Proceed to Phase 5

Phase 5: Interface Document Generation

Called Skill:

ascendc-operator-doc-gen

Execution Content

MANDATORY: Execute according to the ascendc-operator-doc-gen skill process:

Stage 1: Information Extraction
  - Extract Python calling signature (m.def schema) from register.cpp
  - Extract C++ function declaration and return type from ops.h
  - Extract algorithm description, parameter description, dtype support, constraint conditions from design.md
  - Extract TORCH_CHECK constraints from op_host
  - Extract usage examples from tests/test_<op_name>.py

Stage 2: Document Structure Assembly
  - Assemble Chinese interface documents in PyTorch official documentation style
  - Includes: Title Signature + Function Description + Parameter Description + Supported Data Types + Shape + Constraint Conditions + Usage Examples + Return Value

Stage 3: File Generation
  - Generate csrc/ops/<op_name>/README.md

Stage 4: Display complete document content in the interactive interface

Checkpoints

Complete interface information has been extracted from source code (signature, parameters, dtype, shape, constraints)
README.md contains all 7 sections (title signature + function description + parameter description + supported data types + shape + constraint conditions + usage examples + return value)
Python calling signature is consistent with
```
m.def
```
in
```
register.cpp
```
Parameter descriptions use PyTorch documentation style, described in Chinese
Code in usage examples is runnable
README.md has been written to
```
csrc/ops/<op_name>/README.md
```
Interface document has been fully displayed in the chat interface

All passed → Proceed to Phase 6

Phase 6: Precision Evaluation Report

Called Skill:

ascendc-operator-precision-eval

Execution Content

MANDATORY: Execute according to the ascendc-operator-precision-eval skill process:

Stage 1: Load Test Case Document + Information Collection
  - Read csrc/ops/<op_name>/test/<op_name>-test-cases.md (output from testcase-gen)
  - Extract SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, BOUNDARY_VALUES, operator benchmarks
  - Supplement and extract information such as precision thresholds from existing code

Stage 2: Test Case Adaptation ((shapes + boundary) × dtypes ≥ 30 cases)
  - Directly reuse TEST_SHAPES and BOUNDARY_VALUES from testcase-gen
  - Traverse all dtypes supported by the operator for each shape / boundary value

Stage 3: Test Script Generation (output to operator directory csrc/ops/<op_name>/test/)
  - Generate test_<op_name>_precision.py (pytest format) based on template
  - Generate run_<op_name>_precision_report.py (report generator) based on template

Stage 4: Execution
  - Run pytest and all tests pass
  - Run report generator to output JSON

Stage 5: Report Generation
  - Generate <op_name>_precision_report.md (includes regular shape + boundary value table + summary + key findings)
  - Prompt the user for the report path

Checkpoints

Number of test cases = (shapes + boundary) × dtypes ≥ 30
Each dtype supported by the operator has been tested
All pytest precision tests pass
JSON report is generated (includes 5 precision metrics: MaxAbsErr / MeanAbsErr / MaxRelErr / MeanRelErr / CosineSim)

Markdown report is generated at

csrc/ops/<op_name>/test/<op_name>_precision_report.md

Precision test results have been displayed in the chat interface in Markdown table format
The user has been prompted for the precision report path

All passed → Proceed to Phase 7

Phase 7: Performance Benchmarking Report

Called Skill:

ascendc-operator-performance-eval

Execution Content

MANDATORY: Execute according to the ascendc-operator-performance-eval skill process:

Stage 1: Load Test Case Document + Information Collection
  - Read csrc/ops/<op_name>/test/<op_name>-test-cases.md (output from testcase-gen)
  - Extract SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, operator benchmarks
  - Supplement and extract information such as OP Type keywords from existing code

Stage 2: Test Case Adaptation (JSONL format, ≥8 cases)
  - Select representative shapes from TEST_SHAPES + GENERAL_SHAPES of testcase-gen
  - Cover all dtypes supported by the operator
  - Convert to JSONL format

Stage 3: Script Generation (output to operator directory csrc/ops/<op_name>/test/)
  - Generate run_<op_name>_case.py (single case msprof executor) based on template
  - Generate benchmark_<op_name>_msprof.py (master control script) based on template
  - Generate <op_name>_cases.jsonl

Stage 4: Execution and Collection
  - Run the master control script, 20 iterations per case (first 10 for warm-up)
  - Extract Task Duration(us) and hardware metrics from op_summary_*.csv by OP Type
  - Output JSON results

Stage 5: Report Generation
  - Generate <op_name>_perf_report.md (includes result table + summary + brief analysis)
  - Prompt the user for the report path

Checkpoints

JSONL test cases cover multiple shape × dtype combinations (≥8 cases)
Uses
```
msprof
```
for collection, no other timing methods
Filters target operators by
```
OP Type
```
(not Op Name)
20/10 warm-up/statistics strategy is used
JSON report is generated (includes Task Duration + hardware metrics)

Markdown report is generated at

csrc/ops/<op_name>/test/<op_name>_perf_report.md

Report contains brief analysis (≥3 conclusions)
Performance test results have been displayed in the chat interface in Markdown table format
The user has been prompted for the performance report path

All passed → Operator development is complete

Inter-stage Data Flow

Phase 1 Output                    Phase 2 Input
  csrc/ops/<op_name>/    ────▶    Operator name, directory structure
  design.md (placeholder)

Phase 2 Output                    Phase 3 Input
  design.md (complete)       ────▶    Parameter constraints, supported dtypes, typical shapes
                                  → Generate unified test case document

Phase 3 Output                    Phase 4 Input
  <op_name>-test-cases.md ────▶    design.md (complete)
  (test case document for subsequent reuse)          Function signature, UB allocation table → bufferCoefficient
                                  Calculation pseudocode → Compute logic
                                  Tiling strategy → Block/UB splitting parameters

Phase 4 Output                    Phase 5 Input
  Installed operator whl        ────▶    register.cpp / ops.h / design.md /
  tests/test_<op_name>.py        op_host / test files
                                  → Extract interface information to generate documents

Phase 5 Output                    Phase 6 Input
  csrc/ops/<op>/README.md ────▶    <op_name>-test-cases.md (from Phase 3)
  Interface document completed                     Operator name, calling method, input domain constraints
                                  All supported dtypes, precision thresholds
                                  → Output to csrc/ops/<op_name>/test/

Phase 6 Output                    Phase 7 Input
  Precision report passed             ────▶    <op_name>-test-cases.md (from Phase 3)
  csrc/ops/<op>/test/            Operator name, project/native calling method
                                  All supported dtypes, OP Type keywords
                                  → Output to csrc/ops/<op_name>/test/

Status Tracking Table

Phase	Precondition	Called Skill	Key Deliverables
0. Requirements Collection	None	—	CANN path + Conda environment + Operator name + Function description
1. Project Initialization	Phase 0	`ascendc-operator-project-init`	Operator skeleton directory
2. Design Document	Phase 1	`ascendc-operator-design`	design.md (includes Tiling + UB allocation table)
3. Test Case Generation	Phase 2	`ascendc-operator-testcase-gen`	`<op_name>-test-cases.md` (unified test case document)
4. Code & Testing	Phase 3	`ascendc-operator-code-gen` → `compile-debug`	Runnable operator + basic tests passed
5. Interface Document	Phase 4	`ascendc-operator-doc-gen`	PyTorch-style Chinese API document (README.md)
6. Precision Evaluation	Phase 5	`ascendc-operator-precision-eval`	≥30 precision test cases + precision report
7. Performance Benchmarking	Phase 6	`ascendc-operator-performance-eval`	msprof performance comparison + performance report

Error Recovery

Resume from Interrupted Point

When the user says "Continue operator development":

Detection Condition	Determined Stage	Recovery Action
`csrc/ops/<op_name>/` does not exist	Phase 1 not completed	Start from Phase 1
`design.md` is placeholder or empty	Phase 2 not completed	Start from Phase 2
`csrc/ops/<op_name>/test/<op_name>-test-cases.md` does not exist	Phase 3 not completed	Start from Phase 3
`op_host/` still contains skeleton code	Phase 4 not completed	Start from Phase 4
whl package not generated	Phase 4 compilation not completed	Resume from compilation step
Basic tests not passed	Phase 4 testing not completed	Resume from testing step
`csrc/ops/<op_name>/README.md` does not exist	Phase 5 not completed	Start from Phase 5
No precision report in `csrc/ops/<op_name>/test/`	Phase 6 not started	Start from Phase 6
Precision report does not exist or precision tests not all passed	Phase 6 not completed	Resume from Phase 6
Precision report exists but performance report does not	Phase 7 not started	Start from Phase 7
`<op_name>_perf_report.md` does not exist or is incomplete	Phase 7 not completed	Resume from Phase 7

Compilation/Test Failure

Handled internally by

ascendc-operator-compile-debug

skill, up to 3 debugging attempts. If it still fails after 3 times, stop and report detailed errors to the user.

ascendc-operator-dev

NPX Install

Tags

SKILL.md Content (Chinese)

AscendC Operator End-to-End Development Orchestration

Core Principles

Available Sub-skill List

Workflow Overview

Anti-pattern List (NEVER DO THESE)

Phase 0: Requirements Collection

Step 0.1: Environment Confirmation (MUST be completed before any development action)

CANN Environment

Conda Environment

Environment Confirmation Checkpoints

Step 0.2: Operator Requirements Collection

Mandatory Information to Confirm

Decision Tree

Acceptance Criteria

Phase 1: Project Initialization

Execution Content

Checkpoints

Phase 2: Design Document Generation

Execution Content

Checkpoints

Phase 3: Test Case Generation

Execution Content

Checkpoints

Phase 4: Code Generation + Framework Adaptation + Compile Test

Execution Content

Checkpoints

Phase 5: Interface Document Generation

Execution Content

Checkpoints

Phase 6: Precision Evaluation Report

Execution Content

Checkpoints

Phase 7: Performance Benchmarking Report

Execution Content

Checkpoints

Inter-stage Data Flow

Status Tracking Table

Error Recovery

Resume from Interrupted Point

Compilation/Test Failure