ascendc-operator-design

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AscendC Operator Design Skill

根据算子需求生成完整的设计文档（design.md），供后续 code-gen skill 消费。

Generate a complete design document (design.md) based on operator requirements for subsequent consumption by the code-gen skill.

使用场景

Usage Scenarios

在

ascendc-operator-project-init

创建骨架后、

ascendc-operator-code-gen

生成代码之前调用。

Call after

ascendc-operator-project-init

creates the skeleton and before

ascendc-operator-code-gen

generates code.

设计流程

Design Process

1. 算子需求分析

1. Operator Requirement Analysis

如果由调度 skill 调用，算子名称和功能描述已确定，直接进入步骤 2。

如果独立调用，与用户确认以下信息：

信息	必填	说明
算子名称（snake_case）	是	如 `acosh` , `rms_norm`
功能描述 / 数学公式	是	如 "acosh(x) = ln(x + sqrt(x²-1))"
支持的数据类型	否	默认 float16 + float32

MANDATORY: 检查 PyTorch / NumPy 是否存在同名接口。如果存在，接口签名和语义必须与之对齐（如

torch.acosh

、

torch.softmax

）。

If called by scheduling skill, the operator name and function description are already determined, proceed directly to Step 2.

If called independently, confirm the following information with the user:

Information	Mandatory	Description
Operator Name (snake_case)	Yes	e.g., `acosh` , `rms_norm`
Function Description / Mathematical Formula	Yes	e.g., "acosh(x) = ln(x + sqrt(x²-1))"
Supported Data Types	No	Default: float16 + float32

MANDATORY: Check if a homonymous interface exists in PyTorch / NumPy. If it exists, the interface signature and semantics must align with it (e.g.,

torch.acosh

torch.softmax

2. 选择实现路径

2. Select Implementation Path

根据算子特性，推荐合适的实现方式：

实现路径	适用场景	判断标准
AscendC Kernel	纯 vector 算子	不涉及矩阵乘法
CATLASS 模板库	GEMM / FlashAttention	含 cube 矩阵计算
ACLNN 封装	CANN 已有内置算子	无需自定义 kernel

新算子默认使用 AscendC Kernel 路径，除非明确涉及矩阵乘法。

Recommend appropriate implementation methods based on operator characteristics:

Implementation Path	Applicable Scenarios	Judgment Criteria
AscendC Kernel	Pure vector operators	No matrix multiplication involved
CATLASS Template Library	GEMM / FlashAttention	Contains cube matrix computation
ACLNN Encapsulation	Built-in operators already available in CANN	No custom kernel required

New operators default to the AscendC Kernel path unless matrix multiplication is explicitly involved.

3. 详细设计文档生成

3. Generate Detailed Design Document

MANDATORY: 在生成设计文档之前，必须读取以下参考文档：

必读:
```
templates/design-template.md
```
— 设计文档模板
按算子类型选读:
- 逐元素操作（add/relu/acosh/sigmoid...）→
```
references/elementwise-tiling.md
```
- 归约操作（softmax/layernorm...）→
```
references/reduction-tiling.md
```
通用参考:
```
references/general-tiling-principles.md
```

绝对不要跳过参考文档的阅读。

MANDATORY: Before generating the design document, you must read the following reference documents:

Required:
```
templates/design-template.md
```
— Design document template
Optional reading by operator type:
- Element-wise operations (add/relu/acosh/sigmoid...) →
```
references/elementwise-tiling.md
```
- Reduction operations (softmax/layernorm...) →
```
references/reduction-tiling.md
```
General Reference:
```
references/general-tiling-principles.md
```

Never skip reading the reference documents.

3.1 设计文档结构

3.1 Design Document Structure

设计文档包含以下核心章节：

算子接口定义 — 函数签名、参数说明、支持的数据类型
计算逻辑设计 — 算法描述、AscendC API 调用伪代码、实现路径选择
Tiling策略 — 两级Tiling设计（Block级 + UB级）、UB分配表、tileLength计算
Workspace需求 — workspace大小计算
性能优化 — 关键优化点、算子特性分析
Kernel端实现要点 — 偏移计算、执行流程、FP16/BF16 升精度流程
实现检查清单 — 文件结构、代码要点、测试要点

The design document includes the following core sections:

Operator Interface Definition — Function signature, parameter description, supported data types
Computation Logic Design — Algorithm description, AscendC API call pseudocode, implementation path selection
Tiling Strategy — Two-level Tiling design (Block-level + UB-level), UB allocation table, tileLength calculation
Workspace Requirements — Workspace size calculation
Performance Optimization — Key optimization points, operator characteristic analysis
Kernel-side Implementation Key Points — Offset calculation, execution process, FP16/BF16 precision lifting process
Implementation Checklist — File structure, code key points, test key points

3.2 计算逻辑伪代码（关键产出）

3.2 Computation Logic Pseudocode (Key Output)

必须将数学公式分解为 AscendC API 调用序列。这是 code-gen skill 的直接输入。

常见数学函数到 AscendC API 映射：

数学运算	AscendC API	备注
x + y	`Add(dst, src0, src1, len)`	双输入
x - y	`Sub(dst, src0, src1, len)`	双输入
x * y	`Mul(dst, src0, src1, len)`	双输入
x / y	`Div(dst, src0, src1, len)`	双输入
x + scalar	`Adds(dst, src, scalar, len)`	标量运算，优先使用
x * scalar	`Muls(dst, src, scalar, len)`	标量运算，优先使用
abs(x)	`Abs(dst, src, len)`
exp(x)	`Exp(dst, src, len)`
ln(x)	`Ln(dst, src, len)`
sqrt(x)	`Sqrt(dst, src, len)`
1/x	`Reciprocal(dst, src, len)`
1/sqrt(x)	`Rsqrt(dst, src, len)`
tanh(x)	`Tanh(dst, src, len)`
relu(x)	`Relu(dst, src, len)`
max(x,y)	`Max(dst, src0, src1, len)`
min(x,y)	`Min(dst, src0, src1, len)`
fp16→fp32	`Cast(dst, src, CAST_NONE, len)`	升精度无损
fp32→fp16	`Cast(dst, src, CAST_ROUND, len)`	降精度有损

示例 — acosh(x) 的 API 调用序列：

cpp

Mul(tmp, x, x, len);        // tmp = x²
Adds(tmp, tmp, -1.0f, len); // tmp = x² - 1
Sqrt(tmp, tmp, len);         // tmp = sqrt(x² - 1)
Add(tmp, tmp, x, len);      // tmp = x + sqrt(x² - 1)
Ln(y, tmp, len);             // y = ln(x + sqrt(x² - 1))

注意：如果计算序列中某步的 dst 与 src 相同（原地操作），大部分 AscendC API 支持，但需确认具体 API。

Must decompose mathematical formulas into AscendC API call sequences. This is the direct input for the code-gen skill.

Common Mathematical Function to AscendC API Mapping:

Mathematical Operation	AscendC API	Remarks
x + y	`Add(dst, src0, src1, len)`	Dual input
x - y	`Sub(dst, src0, src1, len)`	Dual input
x * y	`Mul(dst, src0, src1, len)`	Dual input
x / y	`Div(dst, src0, src1, len)`	Dual input
x + scalar	`Adds(dst, src, scalar, len)`	Scalar operation, prefer to use
x * scalar	`Muls(dst, src, scalar, len)`	Scalar operation, prefer to use
abs(x)	`Abs(dst, src, len)`
exp(x)	`Exp(dst, src, len)`
ln(x)	`Ln(dst, src, len)`
sqrt(x)	`Sqrt(dst, src, len)`
1/x	`Reciprocal(dst, src, len)`
1/sqrt(x)	`Rsqrt(dst, src, len)`
tanh(x)	`Tanh(dst, src, len)`
relu(x)	`Relu(dst, src, len)`
max(x,y)	`Max(dst, src0, src1, len)`
min(x,y)	`Min(dst, src0, src1, len)`
fp16→fp32	`Cast(dst, src, CAST_NONE, len)`	Precision lifting without loss
fp32→fp16	`Cast(dst, src, CAST_ROUND, len)`	Precision reduction with loss

Example — API Call Sequence for acosh(x):

cpp

Mul(tmp, x, x, len);        // tmp = x²
Adds(tmp, tmp, -1.0f, len); // tmp = x² - 1
Sqrt(tmp, tmp, len);         // tmp = sqrt(x² - 1)
Add(tmp, tmp, x, len);      // tmp = x + sqrt(x² - 1)
Ln(y, tmp, len);             // y = ln(x + sqrt(x² - 1))

Note: If the dst and src are the same in a certain step of the computation sequence (in-place operation), most AscendC APIs support this, but you need to confirm the specific API.

3.2 Tiling 策略

3.2 Tiling Strategy

重要: AscendC 算子采用两级 Tiling 策略，根据算子类型参考相应文档：

┌─────────────────────────────────────────────────────────────┐
│                    全局内存 (GM)                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              totalLength 元素数据                     │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │  Core 0  │     │  Core 1  │ ... │ Core 39  │   ← Block级Tiling (核间切分)
    └──────────┘     └──────────┘     └──────────┘
          │                │                │
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │   UB 0   │     │   UB 1   │     │  UB 39   │   ← UB级Tiling (核内切分)
    └──────────┘     └──────────┘     └──────────┘

Block级Tiling（核间切分）:

将数据分配到多个 AI Core 并行处理
负载均衡: 整核/尾核策略，尾核处理数据量小于等于整核

UB级Tiling（核内切分）:

每个 Core 内部分块处理数据
UB 对齐: 32 字节
UB 容量: 不超过 UB_SIZE_LIMIT（实际编码时通过接口获取，示例值 192KB）

参考文档:

逐元素操作: 阅读
```
references/elementwise-tiling.md
```
（包含完整两级 Tiling 实现）
归约操作: 阅读
```
references/reduction-tiling.md
```
通用原则: 参考
```
references/general-tiling-principles.md
```

Important: AscendC operators adopt a two-level Tiling strategy. Refer to the corresponding documents based on operator types:

┌─────────────────────────────────────────────────────────────┐
│                    Global Memory (GM)                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Data of totalLength elements               │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │  Core 0  │     │  Core 1  │ ... │ Core 39  │   ← Block-level Tiling (Inter-core Splitting)
    └──────────┘     └──────────┘     └──────────┘
          │                │                │
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │   UB 0   │     │   UB 1   │     │  UB 39   │   ← UB-level Tiling (Intra-core Splitting)
    └──────────┘     └──────────┘     └──────────┘

Block-level Tiling (Inter-core Splitting):

Allocate data to multiple AI Cores for parallel processing
Load Balancing: Full-core/tail-core strategy, the data volume processed by the tail core is less than or equal to that of the full core

UB-level Tiling (Intra-core Splitting):

Split data for processing within each Core
UB Alignment: 32 bytes
UB Capacity: Do not exceed UB_SIZE_LIMIT (obtained via interface during actual coding, example value: 192KB)

Reference Documents:

Element-wise Operations: Read
```
references/elementwise-tiling.md
```
(includes complete two-level Tiling implementation)
Reduction Operations: Read
```
references/reduction-tiling.md
```
General Principles: Refer to
```
references/general-tiling-principles.md
```

3.3 硬件约束说明

3.3 Hardware Constraint Description

UB 缓冲区: 必须按 32 字节对齐，即使逻辑上只需要存储少量数据
归约类算子: 单值缓冲区需要开辟 32B 空间
精度处理:
- FP32 输入: 无需升精度，直接计算
- FP16 输入: 必须升精度到 FP32 计算，保证计算精度
- BF16 输入: 必须升精度到 FP32 计算，vector 计算单元不支持 bfloat16 直接计算
Workspace 需求:
- elementwise 类: SYSTEM_WORKSPACE_SIZE（通常为 16MB）
- 其他类算子: 根据实际 tiling data 大小计算

UB Buffer: Must be aligned to 32 bytes, even if only a small amount of data needs to be stored logically
Reduction Operators: A single-value buffer requires 32B of space
Precision Processing:
- FP32 Input: No precision lifting needed, compute directly
- FP16 Input: Must lift precision to FP32 for computation to ensure calculation accuracy
- BF16 Input: Must lift precision to FP32 for computation, vector computation units do not support direct bfloat16 computation
Workspace Requirements:
- Elementwise Operators: SYSTEM_WORKSPACE_SIZE (usually 16MB)
- Other Operator Types: Calculate based on actual tiling data size

3.4 常见算子类型的 UB 分配速查表

3.4 Quick Reference Table for UB Allocation of Common Operator Types

根据算子输入数量和数据类型，快速确定 bufferCoefficient：

单输入单输出 elementwise（acosh, relu, sigmoid, exp, ln, sqrt, abs...）：

数据类型	UB 布局	bufferCoefficient
float32	inQ(2×4) + outQ(2×4) + tmpBuf(1×4) = 20	20
float16	inQ(2×2) + outQ(2×2) + tmpBuf1(1×4) + tmpBuf2(1×4) = 16	16

双输入单输出 elementwise（add, mul, sub, div...）：

数据类型	UB 布局	bufferCoefficient
float32	inQ_X(2×4) + inQ_Y(2×4) + outQ(2×4) + tmpBuf(2×4) = 32	32
float16	inQ_X(2×2) + inQ_Y(2×2) + outQ(2×2) + tmpBuf(3×4) = 24	24

实战经验：bufferCoefficient 是 code-gen 阶段最关键的参数。设计文档中必须明确给出每种 dtype 的值，否则代码生成无法正确计算 tileLength。

Quickly determine bufferCoefficient based on the number of operator inputs and data types:

Single-input Single-output Elementwise (acosh, relu, sigmoid, exp, ln, sqrt, abs...):

Data Type	UB Layout	bufferCoefficient
float32	inQ(2×4) + outQ(2×4) + tmpBuf(1×4) = 20	20
float16	inQ(2×2) + outQ(2×2) + tmpBuf1(1×4) + tmpBuf2(1×4) = 16	16

Dual-input Single-output Elementwise (add, mul, sub, div...):

Data Type	UB Layout	bufferCoefficient
float32	inQ_X(2×4) + inQ_Y(2×4) + outQ(2×4) + tmpBuf(2×4) = 32	32
float16	inQ_X(2×2) + inQ_Y(2×2) + outQ(2×2) + tmpBuf(3×4) = 24	24

Practical Experience: bufferCoefficient is the most critical parameter in the code-gen phase. The design document must clearly specify the value for each dtype, otherwise code generation cannot calculate tileLength correctly.

3.5 生成设计文档

3.5 Generate Design Document

基于收集的信息，读取

templates/design-template.md

模板，填充所有章节，输出到

csrc/ops/<op_name>/design.md

。

输出位置:

ascend-kernel/csrc/ops/<op_name>/design.md

（覆盖初始化阶段的占位文件）

Based on the collected information, read the

templates/design-template.md

template, fill in all sections, and output to

csrc/ops/<op_name>/design.md

Output Location:

ascend-kernel/csrc/ops/<op_name>/design.md

(overwrite the placeholder file from the initialization phase)

交互流程

Interaction Process

被调度 skill 调用时（推荐流程）：

接收算子名称和功能描述
自动选择实现路径
读取参考文档，生成完整设计文档
输出到 design.md

独立调用时：

需求收集: 通过对话了解算子需求
方案推荐: 基于需求推荐实现路径
详细设计: 生成完整的设计文档
检查确认: 与用户确认设计要点
移交开发: 生成检查清单，准备进入编码阶段

When called by scheduling skill (recommended process):

Receive operator name and function description
Automatically select implementation path
Read reference documents and generate complete design document
Output to design.md

When called independently:

Requirement Collection: Understand operator requirements through dialogue
Solution Recommendation: Recommend implementation path based on requirements
Detailed Design: Generate complete design document
Check and Confirm: Confirm design key points with the user
Handover to Development: Generate checklist and prepare for coding phase

注意事项

Notes

Tiling 参数设计原则

Tiling Parameter Design Principles

参数结构化:

cpp

// 好的做法：使用结构体
struct MyOperatorTilingData {
    int64_t totalLength;        // 总数据长度

    int64_t formerNum;          // 整核数量
    int64_t formerLength;       // 整核数据长度
    int64_t tailNum;            // 尾核数量
    int64_t tailLength;         // 尾核数据长度

    int64_t tileLength;         // UB单次处理长度
};

// 避免：使用大量独立参数
void KernelFunc(int64_t totalLength, int64_t tileNum, int64_t tileLength, ...);

两级对齐:

cpp

// Block级: Cache Line 对齐 (512字节)
constexpr int64_t CACHE_LINE_BYTE_LENGTH = 512;
int64_t totalLengthCoreAlign = ((totalLengthCore + CACHE_LINE_BYTE_LENGTH - 1) / CACHE_LINE_BYTE_LENGTH) * CACHE_LINE_BYTE_LENGTH;

// UB级: 32字节对齐
int64_t ubAlignElements = 32 / dtypeSize;
int64_t tileLengthAligned = ((tileLength + ubAlignElements - 1) / ubAlignElements) * ubAlignElements;

UB 分配表: 每个算子设计必须包含 UB 分配表，明确列出：
- 所有 buffer 名称和用途
- 每个 buffer 的大小（字节）
- buffer 数量（单 buffer 或 double buffer）
- 总 UB 使用量和约束验证
Double Buffer: 使用 BUFFER_NUM=2 实现 double buffer，隐藏内存延迟

Parameter Structuring:

cpp

// Good practice: Use struct
struct MyOperatorTilingData {
    int64_t totalLength;        // Total data length

    int64_t formerNum;          // Number of full cores
    int64_t formerLength;       // Data length per full core
    int64_t tailNum;            // Number of tail cores
    int64_t tailLength;         // Data length per tail core

    int64_t tileLength;         // UB single processing length
};

// Avoid: Use a large number of independent parameters
void KernelFunc(int64_t totalLength, int64_t tileNum, int64_t tileLength, ...);

Two-level Alignment:

cpp

// Block-level: Cache Line alignment (512 bytes)
constexpr int64_t CACHE_LINE_BYTE_LENGTH = 512;
int64_t totalLengthCoreAlign = ((totalLengthCore + CACHE_LINE_BYTE_LENGTH - 1) / CACHE_LINE_BYTE_LENGTH) * CACHE_LINE_BYTE_LENGTH;

// UB-level: 32-byte alignment
int64_t ubAlignElements = 32 / dtypeSize;
int64_t tileLengthAligned = ((tileLength + ubAlignElements - 1) / ubAlignElements) * ubAlignElements;

UB Allocation Table: Each operator design must include a UB allocation table, clearly listing:
- Names and purposes of all buffers
- Size (in bytes) of each buffer
- Number of buffers (single buffer or double buffer)
- Total UB usage and constraint verification
Double Buffer: Use BUFFER_NUM=2 to implement double buffer and hide memory latency

其他注意事项

Other Notes

数据类型对齐: 确保PyTorch tensor类型和AscendC kernel类型匹配（half ↔ float16, float ↔ float32）
内存对齐: AscendC要求内存地址对齐（UB 32B, Cache Line 512B）
Shape约束: 某些算子对shape有特殊要求（如需要被tile size整除）
性能权衡: 在代码复杂度和性能之间找到平衡点
接口定义: 检查PyTorch/Numpy等库是否存在类似算子接口，如果存在，接口定义参考PyTorch/Numpy等库
测试输入范围: 在设计文档中注明算子的有效输入范围（如 acosh 要求 x >= 1），测试用例需据此生成数据

Data Type Alignment: Ensure PyTorch tensor types match AscendC kernel types (half ↔ float16, float ↔ float32)
Memory Alignment: AscendC requires memory address alignment (UB 32B, Cache Line 512B)
Shape Constraints: Some operators have special shape requirements (e.g., need to be divisible by tile size)
Performance Trade-off: Find a balance between code complexity and performance
Interface Definition: Check if similar operator interfaces exist in libraries like PyTorch/Numpy. If they exist, refer to PyTorch/Numpy for interface definition
Test Input Range: Specify the valid input range of the operator in the design document (e.g., acosh requires x >= 1), and test cases should generate data accordingly

交付标准（DoD）

Definition of Done (DoD)

设计文档生成后，必须包含以下关键产出物（供 code-gen skill 直接消费）：

函数签名:

at::Tensor op_name(const at::Tensor &self, ...)

完整声明

支持的数据类型: 明确列出（如 float16, float32）
AscendC API 调用伪代码: 每步计算映射到具体 API
UB 分配表: 每种 dtype 的 buffer 布局和 bufferCoefficient
Tiling 参数结构体: 字段定义和计算公式
FP16/BF16 升精度流程: 如支持半精度，必须描述 Cast 路径

After generating the design document, it must include the following key deliverables (for direct consumption by the code-gen skill):

Function Signature: Complete declaration of

at::Tensor op_name(const at::Tensor &self, ...)

Supported Data Types: Clearly listed (e.g., float16, float32)
AscendC API Call Pseudocode: Map each computation step to a specific API
UB Allocation Table: Buffer layout and bufferCoefficient for each dtype
Tiling Parameter Struct: Field definitions and calculation formulas
FP16/BF16 Precision Lifting Process: If half-precision is supported, the Cast path must be described

下一步

Next Step

设计完成后，使用

ascendc-operator-code-gen

skill 生成具体代码实现。

After design completion, use the

ascendc-operator-code-gen

skill to generate specific code implementation.