AscendC Operator Design Skill
Generate a complete design document (design.md) based on operator requirements for subsequent consumption by the code-gen skill.
Usage Scenarios
Call after
ascendc-operator-project-init
creates the skeleton and before
ascendc-operator-code-gen
generates code.
Design Process
1. Operator Requirement Analysis
If called by scheduling skill, the operator name and function description are already determined, proceed directly to Step 2.
If called independently, confirm the following information with the user:
| Information | Mandatory | Description |
|---|
| Operator Name (snake_case) | Yes | e.g., , |
| Function Description / Mathematical Formula | Yes | e.g., "acosh(x) = ln(x + sqrt(x²-1))" |
| Supported Data Types | No | Default: float16 + float32 |
MANDATORY: Check if a homonymous interface exists in PyTorch / NumPy. If it exists, the interface signature and semantics
must align with it (e.g.,
,
).
2. Select Implementation Path
Recommend appropriate implementation methods based on operator characteristics:
| Implementation Path | Applicable Scenarios | Judgment Criteria |
|---|
| AscendC Kernel | Pure vector operators | No matrix multiplication involved |
| CATLASS Template Library | GEMM / FlashAttention | Contains cube matrix computation |
| ACLNN Encapsulation | Built-in operators already available in CANN | No custom kernel required |
New operators default to the AscendC Kernel path unless matrix multiplication is explicitly involved.
3. Generate Detailed Design Document
MANDATORY: Before generating the design document, you must read the following reference documents:
- Required:
templates/design-template.md
— Design document template
- Optional reading by operator type:
- Element-wise operations (add/relu/acosh/sigmoid...) →
references/elementwise-tiling.md
- Reduction operations (softmax/layernorm...) →
references/reduction-tiling.md
- General Reference:
references/general-tiling-principles.md
Never skip reading the reference documents.
3.1 Design Document Structure
The design document includes the following core sections:
- Operator Interface Definition — Function signature, parameter description, supported data types
- Computation Logic Design — Algorithm description, AscendC API call pseudocode, implementation path selection
- Tiling Strategy — Two-level Tiling design (Block-level + UB-level), UB allocation table, tileLength calculation
- Workspace Requirements — Workspace size calculation
- Performance Optimization — Key optimization points, operator characteristic analysis
- Kernel-side Implementation Key Points — Offset calculation, execution process, FP16/BF16 precision lifting process
- Implementation Checklist — File structure, code key points, test key points
3.2 Computation Logic Pseudocode (Key Output)
Must decompose mathematical formulas into AscendC API call sequences. This is the direct input for the code-gen skill.
Common Mathematical Function to AscendC API Mapping:
| Mathematical Operation | AscendC API | Remarks |
|---|
| x + y | Add(dst, src0, src1, len)
| Dual input |
| x - y | Sub(dst, src0, src1, len)
| Dual input |
| x * y | Mul(dst, src0, src1, len)
| Dual input |
| x / y | Div(dst, src0, src1, len)
| Dual input |
| x + scalar | Adds(dst, src, scalar, len)
| Scalar operation, prefer to use |
| x * scalar | Muls(dst, src, scalar, len)
| Scalar operation, prefer to use |
| abs(x) | | |
| exp(x) | | |
| ln(x) | | |
| sqrt(x) | | |
| 1/x | Reciprocal(dst, src, len)
| |
| 1/sqrt(x) | | |
| tanh(x) | | |
| relu(x) | | |
| max(x,y) | Max(dst, src0, src1, len)
| |
| min(x,y) | Min(dst, src0, src1, len)
| |
| fp16→fp32 | Cast(dst, src, CAST_NONE, len)
| Precision lifting without loss |
| fp32→fp16 | Cast(dst, src, CAST_ROUND, len)
| Precision reduction with loss |
Example — API Call Sequence for acosh(x):
cpp
Mul(tmp, x, x, len); // tmp = x²
Adds(tmp, tmp, -1.0f, len); // tmp = x² - 1
Sqrt(tmp, tmp, len); // tmp = sqrt(x² - 1)
Add(tmp, tmp, x, len); // tmp = x + sqrt(x² - 1)
Ln(y, tmp, len); // y = ln(x + sqrt(x² - 1))
Note: If the dst and src are the same in a certain step of the computation sequence (in-place operation), most AscendC APIs support this, but you need to confirm the specific API.
3.2 Tiling Strategy
Important: AscendC operators adopt a two-level Tiling strategy. Refer to the corresponding documents based on operator types:
┌─────────────────────────────────────────────────────────────┐
│ Global Memory (GM) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Data of totalLength elements │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Core 0 │ │ Core 1 │ ... │ Core 39 │ ← Block-level Tiling (Inter-core Splitting)
└──────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ UB 0 │ │ UB 1 │ │ UB 39 │ ← UB-level Tiling (Intra-core Splitting)
└──────────┘ └──────────┘ └──────────┘
Block-level Tiling (Inter-core Splitting):
- Allocate data to multiple AI Cores for parallel processing
- Load Balancing: Full-core/tail-core strategy, the data volume processed by the tail core is less than or equal to that of the full core
UB-level Tiling (Intra-core Splitting):
- Split data for processing within each Core
- UB Alignment: 32 bytes
- UB Capacity: Do not exceed UB_SIZE_LIMIT (obtained via interface during actual coding, example value: 192KB)
Reference Documents:
- Element-wise Operations: Read
references/elementwise-tiling.md
(includes complete two-level Tiling implementation)
- Reduction Operations: Read
references/reduction-tiling.md
- General Principles: Refer to
references/general-tiling-principles.md
3.3 Hardware Constraint Description
- UB Buffer: Must be aligned to 32 bytes, even if only a small amount of data needs to be stored logically
- Reduction Operators: A single-value buffer requires 32B of space
- Precision Processing:
- FP32 Input: No precision lifting needed, compute directly
- FP16 Input: Must lift precision to FP32 for computation to ensure calculation accuracy
- BF16 Input: Must lift precision to FP32 for computation, vector computation units do not support direct bfloat16 computation
- Workspace Requirements:
- Elementwise Operators: SYSTEM_WORKSPACE_SIZE (usually 16MB)
- Other Operator Types: Calculate based on actual tiling data size
3.4 Quick Reference Table for UB Allocation of Common Operator Types
Quickly determine bufferCoefficient based on the number of operator inputs and data types:
Single-input Single-output Elementwise (acosh, relu, sigmoid, exp, ln, sqrt, abs...):
| Data Type | UB Layout | bufferCoefficient |
|---|
| float32 | inQ(2×4) + outQ(2×4) + tmpBuf(1×4) = 20 | 20 |
| float16 | inQ(2×2) + outQ(2×2) + tmpBuf1(1×4) + tmpBuf2(1×4) = 16 | 16 |
Dual-input Single-output Elementwise (add, mul, sub, div...):
| Data Type | UB Layout | bufferCoefficient |
|---|
| float32 | inQ_X(2×4) + inQ_Y(2×4) + outQ(2×4) + tmpBuf(2×4) = 32 | 32 |
| float16 | inQ_X(2×2) + inQ_Y(2×2) + outQ(2×2) + tmpBuf(3×4) = 24 | 24 |
Practical Experience: bufferCoefficient is the most critical parameter in the code-gen phase. The design document must clearly specify the value for each dtype, otherwise code generation cannot calculate tileLength correctly.
3.5 Generate Design Document
Based on the collected information, read the
templates/design-template.md
template, fill in all sections, and output to
csrc/ops/<op_name>/design.md
.
Output Location:
ascend-kernel/csrc/ops/<op_name>/design.md
(overwrite the placeholder file from the initialization phase)
Interaction Process
When called by scheduling skill (recommended process):
- Receive operator name and function description
- Automatically select implementation path
- Read reference documents and generate complete design document
- Output to design.md
When called independently:
- Requirement Collection: Understand operator requirements through dialogue
- Solution Recommendation: Recommend implementation path based on requirements
- Detailed Design: Generate complete design document
- Check and Confirm: Confirm design key points with the user
- Handover to Development: Generate checklist and prepare for coding phase
Notes
Tiling Parameter Design Principles
-
Parameter Structuring:
cpp
// Good practice: Use struct
struct MyOperatorTilingData {
int64_t totalLength; // Total data length
int64_t formerNum; // Number of full cores
int64_t formerLength; // Data length per full core
int64_t tailNum; // Number of tail cores
int64_t tailLength; // Data length per tail core
int64_t tileLength; // UB single processing length
};
// Avoid: Use a large number of independent parameters
void KernelFunc(int64_t totalLength, int64_t tileNum, int64_t tileLength, ...);
-
Two-level Alignment:
cpp
// Block-level: Cache Line alignment (512 bytes)
constexpr int64_t CACHE_LINE_BYTE_LENGTH = 512;
int64_t totalLengthCoreAlign = ((totalLengthCore + CACHE_LINE_BYTE_LENGTH - 1) / CACHE_LINE_BYTE_LENGTH) * CACHE_LINE_BYTE_LENGTH;
// UB-level: 32-byte alignment
int64_t ubAlignElements = 32 / dtypeSize;
int64_t tileLengthAligned = ((tileLength + ubAlignElements - 1) / ubAlignElements) * ubAlignElements;
-
UB Allocation Table: Each operator design must include a UB allocation table, clearly listing:
- Names and purposes of all buffers
- Size (in bytes) of each buffer
- Number of buffers (single buffer or double buffer)
- Total UB usage and constraint verification
-
Double Buffer: Use BUFFER_NUM=2 to implement double buffer and hide memory latency
Other Notes
- Data Type Alignment: Ensure PyTorch tensor types match AscendC kernel types (half ↔ float16, float ↔ float32)
- Memory Alignment: AscendC requires memory address alignment (UB 32B, Cache Line 512B)
- Shape Constraints: Some operators have special shape requirements (e.g., need to be divisible by tile size)
- Performance Trade-off: Find a balance between code complexity and performance
- Interface Definition: Check if similar operator interfaces exist in libraries like PyTorch/Numpy. If they exist, refer to PyTorch/Numpy for interface definition
- Test Input Range: Specify the valid input range of the operator in the design document (e.g., acosh requires x >= 1), and test cases should generate data accordingly
Definition of Done (DoD)
After generating the design document, it must include the following key deliverables (for direct consumption by the code-gen skill):
Next Step
After design completion, use the
ascendc-operator-code-gen
skill to generate specific code implementation.