ascendc-operator-design
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAscendC Operator Design Skill
AscendC Operator Design Skill
根据算子需求生成完整的设计文档(design.md),供后续 code-gen skill 消费。
Generate a complete design document (design.md) based on operator requirements for subsequent consumption by the code-gen skill.
使用场景
Usage Scenarios
在 创建骨架后、 生成代码之前调用。
ascendc-operator-project-initascendc-operator-code-genCall after creates the skeleton and before generates code.
ascendc-operator-project-initascendc-operator-code-gen设计流程
Design Process
1. 算子需求分析
1. Operator Requirement Analysis
如果由调度 skill 调用,算子名称和功能描述已确定,直接进入步骤 2。
如果独立调用,与用户确认以下信息:
| 信息 | 必填 | 说明 |
|---|---|---|
| 算子名称(snake_case) | 是 | 如 |
| 功能描述 / 数学公式 | 是 | 如 "acosh(x) = ln(x + sqrt(x²-1))" |
| 支持的数据类型 | 否 | 默认 float16 + float32 |
MANDATORY: 检查 PyTorch / NumPy 是否存在同名接口。如果存在,接口签名和语义 必须 与之对齐(如 、)。
torch.acoshtorch.softmaxIf called by scheduling skill, the operator name and function description are already determined, proceed directly to Step 2.
If called independently, confirm the following information with the user:
| Information | Mandatory | Description |
|---|---|---|
| Operator Name (snake_case) | Yes | e.g., |
| Function Description / Mathematical Formula | Yes | e.g., "acosh(x) = ln(x + sqrt(x²-1))" |
| Supported Data Types | No | Default: float16 + float32 |
MANDATORY: Check if a homonymous interface exists in PyTorch / NumPy. If it exists, the interface signature and semantics must align with it (e.g., , ).
torch.acoshtorch.softmax2. 选择实现路径
2. Select Implementation Path
根据算子特性,推荐合适的实现方式:
| 实现路径 | 适用场景 | 判断标准 |
|---|---|---|
| AscendC Kernel | 纯 vector 算子 | 不涉及矩阵乘法 |
| CATLASS 模板库 | GEMM / FlashAttention | 含 cube 矩阵计算 |
| ACLNN 封装 | CANN 已有内置算子 | 无需自定义 kernel |
新算子默认使用 AscendC Kernel 路径,除非明确涉及矩阵乘法。
Recommend appropriate implementation methods based on operator characteristics:
| Implementation Path | Applicable Scenarios | Judgment Criteria |
|---|---|---|
| AscendC Kernel | Pure vector operators | No matrix multiplication involved |
| CATLASS Template Library | GEMM / FlashAttention | Contains cube matrix computation |
| ACLNN Encapsulation | Built-in operators already available in CANN | No custom kernel required |
New operators default to the AscendC Kernel path unless matrix multiplication is explicitly involved.
3. 详细设计文档生成
3. Generate Detailed Design Document
MANDATORY: 在生成设计文档之前,必须读取以下参考文档:
- 必读: — 设计文档模板
templates/design-template.md - 按算子类型选读:
- 逐元素操作(add/relu/acosh/sigmoid...)→
references/elementwise-tiling.md - 归约操作(softmax/layernorm...)→
references/reduction-tiling.md
- 逐元素操作(add/relu/acosh/sigmoid...)→
- 通用参考:
references/general-tiling-principles.md
绝对不要跳过参考文档的阅读。
MANDATORY: Before generating the design document, you must read the following reference documents:
- Required: — Design document template
templates/design-template.md - Optional reading by operator type:
- Element-wise operations (add/relu/acosh/sigmoid...) →
references/elementwise-tiling.md - Reduction operations (softmax/layernorm...) →
references/reduction-tiling.md
- Element-wise operations (add/relu/acosh/sigmoid...) →
- General Reference:
references/general-tiling-principles.md
Never skip reading the reference documents.
3.1 设计文档结构
3.1 Design Document Structure
设计文档包含以下核心章节:
- 算子接口定义 — 函数签名、参数说明、支持的数据类型
- 计算逻辑设计 — 算法描述、AscendC API 调用伪代码、实现路径选择
- Tiling策略 — 两级Tiling设计(Block级 + UB级)、UB分配表、tileLength计算
- Workspace需求 — workspace大小计算
- 性能优化 — 关键优化点、算子特性分析
- Kernel端实现要点 — 偏移计算、执行流程、FP16/BF16 升精度流程
- 实现检查清单 — 文件结构、代码要点、测试要点
The design document includes the following core sections:
- Operator Interface Definition — Function signature, parameter description, supported data types
- Computation Logic Design — Algorithm description, AscendC API call pseudocode, implementation path selection
- Tiling Strategy — Two-level Tiling design (Block-level + UB-level), UB allocation table, tileLength calculation
- Workspace Requirements — Workspace size calculation
- Performance Optimization — Key optimization points, operator characteristic analysis
- Kernel-side Implementation Key Points — Offset calculation, execution process, FP16/BF16 precision lifting process
- Implementation Checklist — File structure, code key points, test key points
3.2 计算逻辑伪代码(关键产出)
3.2 Computation Logic Pseudocode (Key Output)
必须 将数学公式分解为 AscendC API 调用序列。这是 code-gen skill 的直接输入。
常见数学函数到 AscendC API 映射:
| 数学运算 | AscendC API | 备注 |
|---|---|---|
| x + y | | 双输入 |
| x - y | | 双输入 |
| x * y | | 双输入 |
| x / y | | 双输入 |
| x + scalar | | 标量运算,优先使用 |
| x * scalar | | 标量运算,优先使用 |
| abs(x) | | |
| exp(x) | | |
| ln(x) | | |
| sqrt(x) | | |
| 1/x | | |
| 1/sqrt(x) | | |
| tanh(x) | | |
| relu(x) | | |
| max(x,y) | | |
| min(x,y) | | |
| fp16→fp32 | | 升精度无损 |
| fp32→fp16 | | 降精度有损 |
示例 — acosh(x) 的 API 调用序列:
cpp
Mul(tmp, x, x, len); // tmp = x²
Adds(tmp, tmp, -1.0f, len); // tmp = x² - 1
Sqrt(tmp, tmp, len); // tmp = sqrt(x² - 1)
Add(tmp, tmp, x, len); // tmp = x + sqrt(x² - 1)
Ln(y, tmp, len); // y = ln(x + sqrt(x² - 1))注意:如果计算序列中某步的 dst 与 src 相同(原地操作),大部分 AscendC API 支持,但需确认具体 API。
Must decompose mathematical formulas into AscendC API call sequences. This is the direct input for the code-gen skill.
Common Mathematical Function to AscendC API Mapping:
| Mathematical Operation | AscendC API | Remarks |
|---|---|---|
| x + y | | Dual input |
| x - y | | Dual input |
| x * y | | Dual input |
| x / y | | Dual input |
| x + scalar | | Scalar operation, prefer to use |
| x * scalar | | Scalar operation, prefer to use |
| abs(x) | | |
| exp(x) | | |
| ln(x) | | |
| sqrt(x) | | |
| 1/x | | |
| 1/sqrt(x) | | |
| tanh(x) | | |
| relu(x) | | |
| max(x,y) | | |
| min(x,y) | | |
| fp16→fp32 | | Precision lifting without loss |
| fp32→fp16 | | Precision reduction with loss |
Example — API Call Sequence for acosh(x):
cpp
Mul(tmp, x, x, len); // tmp = x²
Adds(tmp, tmp, -1.0f, len); // tmp = x² - 1
Sqrt(tmp, tmp, len); // tmp = sqrt(x² - 1)
Add(tmp, tmp, x, len); // tmp = x + sqrt(x² - 1)
Ln(y, tmp, len); // y = ln(x + sqrt(x² - 1))Note: If the dst and src are the same in a certain step of the computation sequence (in-place operation), most AscendC APIs support this, but you need to confirm the specific API.
3.2 Tiling 策略
3.2 Tiling Strategy
重要: AscendC 算子采用两级 Tiling 策略,根据算子类型参考相应文档:
┌─────────────────────────────────────────────────────────────┐
│ 全局内存 (GM) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ totalLength 元素数据 │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Core 0 │ │ Core 1 │ ... │ Core 39 │ ← Block级Tiling (核间切分)
└──────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ UB 0 │ │ UB 1 │ │ UB 39 │ ← UB级Tiling (核内切分)
└──────────┘ └──────────┘ └──────────┘Block级Tiling(核间切分):
- 将数据分配到多个 AI Core 并行处理
- 负载均衡: 整核/尾核策略,尾核处理数据量小于等于整核
UB级Tiling(核内切分):
- 每个 Core 内部分块处理数据
- UB 对齐: 32 字节
- UB 容量: 不超过 UB_SIZE_LIMIT(实际编码时通过接口获取,示例值 192KB)
参考文档:
- 逐元素操作: 阅读 (包含完整两级 Tiling 实现)
references/elementwise-tiling.md - 归约操作: 阅读
references/reduction-tiling.md - 通用原则: 参考
references/general-tiling-principles.md
Important: AscendC operators adopt a two-level Tiling strategy. Refer to the corresponding documents based on operator types:
┌─────────────────────────────────────────────────────────────┐
│ Global Memory (GM) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Data of totalLength elements │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Core 0 │ │ Core 1 │ ... │ Core 39 │ ← Block-level Tiling (Inter-core Splitting)
└──────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ UB 0 │ │ UB 1 │ │ UB 39 │ ← UB-level Tiling (Intra-core Splitting)
└──────────┘ └──────────┘ └──────────┘Block-level Tiling (Inter-core Splitting):
- Allocate data to multiple AI Cores for parallel processing
- Load Balancing: Full-core/tail-core strategy, the data volume processed by the tail core is less than or equal to that of the full core
UB-level Tiling (Intra-core Splitting):
- Split data for processing within each Core
- UB Alignment: 32 bytes
- UB Capacity: Do not exceed UB_SIZE_LIMIT (obtained via interface during actual coding, example value: 192KB)
Reference Documents:
- Element-wise Operations: Read (includes complete two-level Tiling implementation)
references/elementwise-tiling.md - Reduction Operations: Read
references/reduction-tiling.md - General Principles: Refer to
references/general-tiling-principles.md
3.3 硬件约束说明
3.3 Hardware Constraint Description
- UB 缓冲区: 必须按 32 字节对齐,即使逻辑上只需要存储少量数据
- 归约类算子: 单值缓冲区需要开辟 32B 空间
- 精度处理:
- FP32 输入: 无需升精度,直接计算
- FP16 输入: 必须升精度到 FP32 计算,保证计算精度
- BF16 输入: 必须升精度到 FP32 计算,vector 计算单元不支持 bfloat16 直接计算
- Workspace 需求:
- elementwise 类: SYSTEM_WORKSPACE_SIZE(通常为 16MB)
- 其他类算子: 根据实际 tiling data 大小计算
- UB Buffer: Must be aligned to 32 bytes, even if only a small amount of data needs to be stored logically
- Reduction Operators: A single-value buffer requires 32B of space
- Precision Processing:
- FP32 Input: No precision lifting needed, compute directly
- FP16 Input: Must lift precision to FP32 for computation to ensure calculation accuracy
- BF16 Input: Must lift precision to FP32 for computation, vector computation units do not support direct bfloat16 computation
- Workspace Requirements:
- Elementwise Operators: SYSTEM_WORKSPACE_SIZE (usually 16MB)
- Other Operator Types: Calculate based on actual tiling data size
3.4 常见算子类型的 UB 分配速查表
3.4 Quick Reference Table for UB Allocation of Common Operator Types
根据算子输入数量和数据类型,快速确定 bufferCoefficient:
单输入单输出 elementwise(acosh, relu, sigmoid, exp, ln, sqrt, abs...):
| 数据类型 | UB 布局 | bufferCoefficient |
|---|---|---|
| float32 | inQ(2×4) + outQ(2×4) + tmpBuf(1×4) = 20 | 20 |
| float16 | inQ(2×2) + outQ(2×2) + tmpBuf1(1×4) + tmpBuf2(1×4) = 16 | 16 |
双输入单输出 elementwise(add, mul, sub, div...):
| 数据类型 | UB 布局 | bufferCoefficient |
|---|---|---|
| float32 | inQ_X(2×4) + inQ_Y(2×4) + outQ(2×4) + tmpBuf(2×4) = 32 | 32 |
| float16 | inQ_X(2×2) + inQ_Y(2×2) + outQ(2×2) + tmpBuf(3×4) = 24 | 24 |
实战经验:bufferCoefficient 是 code-gen 阶段最关键的参数。设计文档中 必须 明确给出每种 dtype 的值,否则代码生成无法正确计算 tileLength。
Quickly determine bufferCoefficient based on the number of operator inputs and data types:
Single-input Single-output Elementwise (acosh, relu, sigmoid, exp, ln, sqrt, abs...):
| Data Type | UB Layout | bufferCoefficient |
|---|---|---|
| float32 | inQ(2×4) + outQ(2×4) + tmpBuf(1×4) = 20 | 20 |
| float16 | inQ(2×2) + outQ(2×2) + tmpBuf1(1×4) + tmpBuf2(1×4) = 16 | 16 |
Dual-input Single-output Elementwise (add, mul, sub, div...):
| Data Type | UB Layout | bufferCoefficient |
|---|---|---|
| float32 | inQ_X(2×4) + inQ_Y(2×4) + outQ(2×4) + tmpBuf(2×4) = 32 | 32 |
| float16 | inQ_X(2×2) + inQ_Y(2×2) + outQ(2×2) + tmpBuf(3×4) = 24 | 24 |
Practical Experience: bufferCoefficient is the most critical parameter in the code-gen phase. The design document must clearly specify the value for each dtype, otherwise code generation cannot calculate tileLength correctly.
3.5 生成设计文档
3.5 Generate Design Document
基于收集的信息,读取 模板,填充所有章节,输出到 。
templates/design-template.mdcsrc/ops/<op_name>/design.md输出位置: (覆盖初始化阶段的占位文件)
ascend-kernel/csrc/ops/<op_name>/design.mdBased on the collected information, read the template, fill in all sections, and output to .
templates/design-template.mdcsrc/ops/<op_name>/design.mdOutput Location: (overwrite the placeholder file from the initialization phase)
ascend-kernel/csrc/ops/<op_name>/design.md交互流程
Interaction Process
被调度 skill 调用时(推荐流程):
- 接收算子名称和功能描述
- 自动选择实现路径
- 读取参考文档,生成完整设计文档
- 输出到 design.md
独立调用时:
- 需求收集: 通过对话了解算子需求
- 方案推荐: 基于需求推荐实现路径
- 详细设计: 生成完整的设计文档
- 检查确认: 与用户确认设计要点
- 移交开发: 生成检查清单,准备进入编码阶段
When called by scheduling skill (recommended process):
- Receive operator name and function description
- Automatically select implementation path
- Read reference documents and generate complete design document
- Output to design.md
When called independently:
- Requirement Collection: Understand operator requirements through dialogue
- Solution Recommendation: Recommend implementation path based on requirements
- Detailed Design: Generate complete design document
- Check and Confirm: Confirm design key points with the user
- Handover to Development: Generate checklist and prepare for coding phase
注意事项
Notes
Tiling 参数设计原则
Tiling Parameter Design Principles
-
参数结构化:cpp
// 好的做法:使用结构体 struct MyOperatorTilingData { int64_t totalLength; // 总数据长度 int64_t formerNum; // 整核数量 int64_t formerLength; // 整核数据长度 int64_t tailNum; // 尾核数量 int64_t tailLength; // 尾核数据长度 int64_t tileLength; // UB单次处理长度 }; // 避免:使用大量独立参数 void KernelFunc(int64_t totalLength, int64_t tileNum, int64_t tileLength, ...); -
两级对齐:cpp
// Block级: Cache Line 对齐 (512字节) constexpr int64_t CACHE_LINE_BYTE_LENGTH = 512; int64_t totalLengthCoreAlign = ((totalLengthCore + CACHE_LINE_BYTE_LENGTH - 1) / CACHE_LINE_BYTE_LENGTH) * CACHE_LINE_BYTE_LENGTH; // UB级: 32字节对齐 int64_t ubAlignElements = 32 / dtypeSize; int64_t tileLengthAligned = ((tileLength + ubAlignElements - 1) / ubAlignElements) * ubAlignElements; -
UB 分配表: 每个算子设计必须包含 UB 分配表,明确列出:
- 所有 buffer 名称和用途
- 每个 buffer 的大小(字节)
- buffer 数量(单 buffer 或 double buffer)
- 总 UB 使用量和约束验证
-
Double Buffer: 使用 BUFFER_NUM=2 实现 double buffer,隐藏内存延迟
-
Parameter Structuring:cpp
// Good practice: Use struct struct MyOperatorTilingData { int64_t totalLength; // Total data length int64_t formerNum; // Number of full cores int64_t formerLength; // Data length per full core int64_t tailNum; // Number of tail cores int64_t tailLength; // Data length per tail core int64_t tileLength; // UB single processing length }; // Avoid: Use a large number of independent parameters void KernelFunc(int64_t totalLength, int64_t tileNum, int64_t tileLength, ...); -
Two-level Alignment:cpp
// Block-level: Cache Line alignment (512 bytes) constexpr int64_t CACHE_LINE_BYTE_LENGTH = 512; int64_t totalLengthCoreAlign = ((totalLengthCore + CACHE_LINE_BYTE_LENGTH - 1) / CACHE_LINE_BYTE_LENGTH) * CACHE_LINE_BYTE_LENGTH; // UB-level: 32-byte alignment int64_t ubAlignElements = 32 / dtypeSize; int64_t tileLengthAligned = ((tileLength + ubAlignElements - 1) / ubAlignElements) * ubAlignElements; -
UB Allocation Table: Each operator design must include a UB allocation table, clearly listing:
- Names and purposes of all buffers
- Size (in bytes) of each buffer
- Number of buffers (single buffer or double buffer)
- Total UB usage and constraint verification
-
Double Buffer: Use BUFFER_NUM=2 to implement double buffer and hide memory latency
其他注意事项
Other Notes
- 数据类型对齐: 确保PyTorch tensor类型和AscendC kernel类型匹配(half ↔ float16, float ↔ float32)
- 内存对齐: AscendC要求内存地址对齐(UB 32B, Cache Line 512B)
- Shape约束: 某些算子对shape有特殊要求(如需要被tile size整除)
- 性能权衡: 在代码复杂度和性能之间找到平衡点
- 接口定义: 检查PyTorch/Numpy等库是否存在类似算子接口,如果存在,接口定义参考PyTorch/Numpy等库
- 测试输入范围: 在设计文档中注明算子的有效输入范围(如 acosh 要求 x >= 1),测试用例需据此生成数据
- Data Type Alignment: Ensure PyTorch tensor types match AscendC kernel types (half ↔ float16, float ↔ float32)
- Memory Alignment: AscendC requires memory address alignment (UB 32B, Cache Line 512B)
- Shape Constraints: Some operators have special shape requirements (e.g., need to be divisible by tile size)
- Performance Trade-off: Find a balance between code complexity and performance
- Interface Definition: Check if similar operator interfaces exist in libraries like PyTorch/Numpy. If they exist, refer to PyTorch/Numpy for interface definition
- Test Input Range: Specify the valid input range of the operator in the design document (e.g., acosh requires x >= 1), and test cases should generate data accordingly
交付标准(DoD)
Definition of Done (DoD)
设计文档生成后,必须包含以下关键产出物(供 code-gen skill 直接消费):
- 函数签名: 完整声明
at::Tensor op_name(const at::Tensor &self, ...) - 支持的数据类型: 明确列出(如 float16, float32)
- AscendC API 调用伪代码: 每步计算映射到具体 API
- UB 分配表: 每种 dtype 的 buffer 布局和 bufferCoefficient
- Tiling 参数结构体: 字段定义和计算公式
- FP16/BF16 升精度流程: 如支持半精度,必须描述 Cast 路径
After generating the design document, it must include the following key deliverables (for direct consumption by the code-gen skill):
- Function Signature: Complete declaration of
at::Tensor op_name(const at::Tensor &self, ...) - Supported Data Types: Clearly listed (e.g., float16, float32)
- AscendC API Call Pseudocode: Map each computation step to a specific API
- UB Allocation Table: Buffer layout and bufferCoefficient for each dtype
- Tiling Parameter Struct: Field definitions and calculation formulas
- FP16/BF16 Precision Lifting Process: If half-precision is supported, the Cast path must be described
下一步
Next Step
设计完成后,使用 skill 生成具体代码实现。
ascendc-operator-code-genAfter design completion, use the skill to generate specific code implementation.
ascendc-operator-code-gen