AscendC Operator Design Skill

Generate a complete design document (design.md) based on operator requirements for subsequent consumption by the code-gen skill.

Usage Scenarios

Call after

ascendc-operator-project-init

creates the skeleton and before

ascendc-operator-code-gen

generates code.

Design Process

1. Operator Requirement Analysis

If called by scheduling skill, the operator name and function description are already determined, proceed directly to Step 2.

If called independently, confirm the following information with the user:

Information	Mandatory	Description
Operator Name (snake_case)	Yes	e.g., `acosh` , `rms_norm`
Function Description / Mathematical Formula	Yes	e.g., "acosh(x) = ln(x + sqrt(x²-1))"
Supported Data Types	No	Default: float16 + float32

MANDATORY: Check if a homonymous interface exists in PyTorch / NumPy. If it exists, the interface signature and semantics must align with it (e.g.,

torch.acosh

torch.softmax

2. Select Implementation Path

Recommend appropriate implementation methods based on operator characteristics:

Implementation Path	Applicable Scenarios	Judgment Criteria
AscendC Kernel	Pure vector operators	No matrix multiplication involved
CATLASS Template Library	GEMM / FlashAttention	Contains cube matrix computation
ACLNN Encapsulation	Built-in operators already available in CANN	No custom kernel required

New operators default to the AscendC Kernel path unless matrix multiplication is explicitly involved.

3. Generate Detailed Design Document

MANDATORY: Before generating the design document, you must read the following reference documents:

Required:
```
templates/design-template.md
```
— Design document template
Optional reading by operator type:
- Element-wise operations (add/relu/acosh/sigmoid...) →
```
references/elementwise-tiling.md
```
- Reduction operations (softmax/layernorm...) →
```
references/reduction-tiling.md
```
General Reference:
```
references/general-tiling-principles.md
```

Never skip reading the reference documents.

3.1 Design Document Structure

The design document includes the following core sections:

Operator Interface Definition — Function signature, parameter description, supported data types
Computation Logic Design — Algorithm description, AscendC API call pseudocode, implementation path selection
Tiling Strategy — Two-level Tiling design (Block-level + UB-level), UB allocation table, tileLength calculation
Workspace Requirements — Workspace size calculation
Performance Optimization — Key optimization points, operator characteristic analysis
Kernel-side Implementation Key Points — Offset calculation, execution process, FP16/BF16 precision lifting process
Implementation Checklist — File structure, code key points, test key points

3.2 Computation Logic Pseudocode (Key Output)

Must decompose mathematical formulas into AscendC API call sequences. This is the direct input for the code-gen skill.

Common Mathematical Function to AscendC API Mapping:

Mathematical Operation	AscendC API	Remarks
x + y	`Add(dst, src0, src1, len)`	Dual input
x - y	`Sub(dst, src0, src1, len)`	Dual input
x * y	`Mul(dst, src0, src1, len)`	Dual input
x / y	`Div(dst, src0, src1, len)`	Dual input
x + scalar	`Adds(dst, src, scalar, len)`	Scalar operation, prefer to use
x * scalar	`Muls(dst, src, scalar, len)`	Scalar operation, prefer to use
abs(x)	`Abs(dst, src, len)`
exp(x)	`Exp(dst, src, len)`
ln(x)	`Ln(dst, src, len)`
sqrt(x)	`Sqrt(dst, src, len)`
1/x	`Reciprocal(dst, src, len)`
1/sqrt(x)	`Rsqrt(dst, src, len)`
tanh(x)	`Tanh(dst, src, len)`
relu(x)	`Relu(dst, src, len)`
max(x,y)	`Max(dst, src0, src1, len)`
min(x,y)	`Min(dst, src0, src1, len)`
fp16→fp32	`Cast(dst, src, CAST_NONE, len)`	Precision lifting without loss
fp32→fp16	`Cast(dst, src, CAST_ROUND, len)`	Precision reduction with loss

Example — API Call Sequence for acosh(x):

cpp

Mul(tmp, x, x, len);        // tmp = x²
Adds(tmp, tmp, -1.0f, len); // tmp = x² - 1
Sqrt(tmp, tmp, len);         // tmp = sqrt(x² - 1)
Add(tmp, tmp, x, len);      // tmp = x + sqrt(x² - 1)
Ln(y, tmp, len);             // y = ln(x + sqrt(x² - 1))

Note: If the dst and src are the same in a certain step of the computation sequence (in-place operation), most AscendC APIs support this, but you need to confirm the specific API.

3.2 Tiling Strategy

Important: AscendC operators adopt a two-level Tiling strategy. Refer to the corresponding documents based on operator types:

┌─────────────────────────────────────────────────────────────┐
│                    Global Memory (GM)                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Data of totalLength elements               │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │  Core 0  │     │  Core 1  │ ... │ Core 39  │   ← Block-level Tiling (Inter-core Splitting)
    └──────────┘     └──────────┘     └──────────┘
          │                │                │
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │   UB 0   │     │   UB 1   │     │  UB 39   │   ← UB-level Tiling (Intra-core Splitting)
    └──────────┘     └──────────┘     └──────────┘

Block-level Tiling (Inter-core Splitting):

Allocate data to multiple AI Cores for parallel processing
Load Balancing: Full-core/tail-core strategy, the data volume processed by the tail core is less than or equal to that of the full core

UB-level Tiling (Intra-core Splitting):

Split data for processing within each Core
UB Alignment: 32 bytes
UB Capacity: Do not exceed UB_SIZE_LIMIT (obtained via interface during actual coding, example value: 192KB)

Reference Documents:

Element-wise Operations: Read
```
references/elementwise-tiling.md
```
(includes complete two-level Tiling implementation)
Reduction Operations: Read
```
references/reduction-tiling.md
```
General Principles: Refer to
```
references/general-tiling-principles.md
```

3.3 Hardware Constraint Description

UB Buffer: Must be aligned to 32 bytes, even if only a small amount of data needs to be stored logically
Reduction Operators: A single-value buffer requires 32B of space
Precision Processing:
- FP32 Input: No precision lifting needed, compute directly
- FP16 Input: Must lift precision to FP32 for computation to ensure calculation accuracy
- BF16 Input: Must lift precision to FP32 for computation, vector computation units do not support direct bfloat16 computation
Workspace Requirements:
- Elementwise Operators: SYSTEM_WORKSPACE_SIZE (usually 16MB)
- Other Operator Types: Calculate based on actual tiling data size

3.4 Quick Reference Table for UB Allocation of Common Operator Types

Quickly determine bufferCoefficient based on the number of operator inputs and data types:

Single-input Single-output Elementwise (acosh, relu, sigmoid, exp, ln, sqrt, abs...):

Data Type	UB Layout	bufferCoefficient
float32	inQ(2×4) + outQ(2×4) + tmpBuf(1×4) = 20	20
float16	inQ(2×2) + outQ(2×2) + tmpBuf1(1×4) + tmpBuf2(1×4) = 16	16

Dual-input Single-output Elementwise (add, mul, sub, div...):

Data Type	UB Layout	bufferCoefficient
float32	inQ_X(2×4) + inQ_Y(2×4) + outQ(2×4) + tmpBuf(2×4) = 32	32
float16	inQ_X(2×2) + inQ_Y(2×2) + outQ(2×2) + tmpBuf(3×4) = 24	24

Practical Experience: bufferCoefficient is the most critical parameter in the code-gen phase. The design document must clearly specify the value for each dtype, otherwise code generation cannot calculate tileLength correctly.

3.5 Generate Design Document

Based on the collected information, read the

templates/design-template.md

template, fill in all sections, and output to

csrc/ops/<op_name>/design.md

Output Location:

ascend-kernel/csrc/ops/<op_name>/design.md

(overwrite the placeholder file from the initialization phase)

Interaction Process

When called by scheduling skill (recommended process):

Receive operator name and function description
Automatically select implementation path
Read reference documents and generate complete design document
Output to design.md

When called independently:

Requirement Collection: Understand operator requirements through dialogue
Solution Recommendation: Recommend implementation path based on requirements
Detailed Design: Generate complete design document
Check and Confirm: Confirm design key points with the user
Handover to Development: Generate checklist and prepare for coding phase

Notes

Tiling Parameter Design Principles

Parameter Structuring:

cpp

// Good practice: Use struct
struct MyOperatorTilingData {
    int64_t totalLength;        // Total data length

    int64_t formerNum;          // Number of full cores
    int64_t formerLength;       // Data length per full core
    int64_t tailNum;            // Number of tail cores
    int64_t tailLength;         // Data length per tail core

    int64_t tileLength;         // UB single processing length
};

// Avoid: Use a large number of independent parameters
void KernelFunc(int64_t totalLength, int64_t tileNum, int64_t tileLength, ...);

Two-level Alignment:

cpp

// Block-level: Cache Line alignment (512 bytes)
constexpr int64_t CACHE_LINE_BYTE_LENGTH = 512;
int64_t totalLengthCoreAlign = ((totalLengthCore + CACHE_LINE_BYTE_LENGTH - 1) / CACHE_LINE_BYTE_LENGTH) * CACHE_LINE_BYTE_LENGTH;

// UB-level: 32-byte alignment
int64_t ubAlignElements = 32 / dtypeSize;
int64_t tileLengthAligned = ((tileLength + ubAlignElements - 1) / ubAlignElements) * ubAlignElements;

UB Allocation Table: Each operator design must include a UB allocation table, clearly listing:
- Names and purposes of all buffers
- Size (in bytes) of each buffer
- Number of buffers (single buffer or double buffer)
- Total UB usage and constraint verification
Double Buffer: Use BUFFER_NUM=2 to implement double buffer and hide memory latency

Other Notes

Data Type Alignment: Ensure PyTorch tensor types match AscendC kernel types (half ↔ float16, float ↔ float32)
Memory Alignment: AscendC requires memory address alignment (UB 32B, Cache Line 512B)
Shape Constraints: Some operators have special shape requirements (e.g., need to be divisible by tile size)
Performance Trade-off: Find a balance between code complexity and performance
Interface Definition: Check if similar operator interfaces exist in libraries like PyTorch/Numpy. If they exist, refer to PyTorch/Numpy for interface definition
Test Input Range: Specify the valid input range of the operator in the design document (e.g., acosh requires x >= 1), and test cases should generate data accordingly

Definition of Done (DoD)

After generating the design document, it must include the following key deliverables (for direct consumption by the code-gen skill):

Function Signature: Complete declaration of

at::Tensor op_name(const at::Tensor &self, ...)

Supported Data Types: Clearly listed (e.g., float16, float32)
AscendC API Call Pseudocode: Map each computation step to a specific API
UB Allocation Table: Buffer layout and bufferCoefficient for each dtype
Tiling Parameter Struct: Field definitions and calculation formulas
FP16/BF16 Precision Lifting Process: If half-precision is supported, the Cast path must be described

Next Step

After design completion, use the

ascendc-operator-code-gen

skill to generate specific code implementation.

ascendc-operator-design

NPX Install

Tags

SKILL.md Content (Chinese)