ascendc-operator-design

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AscendC Operator Design Skill

AscendC Operator Design Skill

根据算子需求生成完整的设计文档(design.md),供后续 code-gen skill 消费。
Generate a complete design document (design.md) based on operator requirements for subsequent consumption by the code-gen skill.

使用场景

Usage Scenarios

ascendc-operator-project-init
创建骨架后、
ascendc-operator-code-gen
生成代码之前调用。
Call after
ascendc-operator-project-init
creates the skeleton and before
ascendc-operator-code-gen
generates code.

设计流程

Design Process

1. 算子需求分析

1. Operator Requirement Analysis

如果由调度 skill 调用,算子名称和功能描述已确定,直接进入步骤 2。
如果独立调用,与用户确认以下信息:
信息必填说明
算子名称(snake_case)
acosh
,
rms_norm
功能描述 / 数学公式如 "acosh(x) = ln(x + sqrt(x²-1))"
支持的数据类型默认 float16 + float32
MANDATORY: 检查 PyTorch / NumPy 是否存在同名接口。如果存在,接口签名和语义 必须 与之对齐(如
torch.acosh
torch.softmax
)。
If called by scheduling skill, the operator name and function description are already determined, proceed directly to Step 2.
If called independently, confirm the following information with the user:
InformationMandatoryDescription
Operator Name (snake_case)Yese.g.,
acosh
,
rms_norm
Function Description / Mathematical FormulaYese.g., "acosh(x) = ln(x + sqrt(x²-1))"
Supported Data TypesNoDefault: float16 + float32
MANDATORY: Check if a homonymous interface exists in PyTorch / NumPy. If it exists, the interface signature and semantics must align with it (e.g.,
torch.acosh
,
torch.softmax
).

2. 选择实现路径

2. Select Implementation Path

根据算子特性,推荐合适的实现方式:
实现路径适用场景判断标准
AscendC Kernel纯 vector 算子不涉及矩阵乘法
CATLASS 模板库GEMM / FlashAttention含 cube 矩阵计算
ACLNN 封装CANN 已有内置算子无需自定义 kernel
新算子默认使用 AscendC Kernel 路径,除非明确涉及矩阵乘法。
Recommend appropriate implementation methods based on operator characteristics:
Implementation PathApplicable ScenariosJudgment Criteria
AscendC KernelPure vector operatorsNo matrix multiplication involved
CATLASS Template LibraryGEMM / FlashAttentionContains cube matrix computation
ACLNN EncapsulationBuilt-in operators already available in CANNNo custom kernel required
New operators default to the AscendC Kernel path unless matrix multiplication is explicitly involved.

3. 详细设计文档生成

3. Generate Detailed Design Document

MANDATORY: 在生成设计文档之前,必须读取以下参考文档:
  1. 必读:
    templates/design-template.md
    — 设计文档模板
  2. 按算子类型选读:
    • 逐元素操作(add/relu/acosh/sigmoid...)→
      references/elementwise-tiling.md
    • 归约操作(softmax/layernorm...)→
      references/reduction-tiling.md
  3. 通用参考:
    references/general-tiling-principles.md
绝对不要跳过参考文档的阅读。
MANDATORY: Before generating the design document, you must read the following reference documents:
  1. Required:
    templates/design-template.md
    — Design document template
  2. Optional reading by operator type:
    • Element-wise operations (add/relu/acosh/sigmoid...) →
      references/elementwise-tiling.md
    • Reduction operations (softmax/layernorm...) →
      references/reduction-tiling.md
  3. General Reference:
    references/general-tiling-principles.md
Never skip reading the reference documents.

3.1 设计文档结构

3.1 Design Document Structure

设计文档包含以下核心章节:
  1. 算子接口定义 — 函数签名、参数说明、支持的数据类型
  2. 计算逻辑设计 — 算法描述、AscendC API 调用伪代码、实现路径选择
  3. Tiling策略 — 两级Tiling设计(Block级 + UB级)、UB分配表、tileLength计算
  4. Workspace需求 — workspace大小计算
  5. 性能优化 — 关键优化点、算子特性分析
  6. Kernel端实现要点 — 偏移计算、执行流程、FP16/BF16 升精度流程
  7. 实现检查清单 — 文件结构、代码要点、测试要点
The design document includes the following core sections:
  1. Operator Interface Definition — Function signature, parameter description, supported data types
  2. Computation Logic Design — Algorithm description, AscendC API call pseudocode, implementation path selection
  3. Tiling Strategy — Two-level Tiling design (Block-level + UB-level), UB allocation table, tileLength calculation
  4. Workspace Requirements — Workspace size calculation
  5. Performance Optimization — Key optimization points, operator characteristic analysis
  6. Kernel-side Implementation Key Points — Offset calculation, execution process, FP16/BF16 precision lifting process
  7. Implementation Checklist — File structure, code key points, test key points

3.2 计算逻辑伪代码(关键产出)

3.2 Computation Logic Pseudocode (Key Output)

必须 将数学公式分解为 AscendC API 调用序列。这是 code-gen skill 的直接输入。
常见数学函数到 AscendC API 映射
数学运算AscendC API备注
x + y
Add(dst, src0, src1, len)
双输入
x - y
Sub(dst, src0, src1, len)
双输入
x * y
Mul(dst, src0, src1, len)
双输入
x / y
Div(dst, src0, src1, len)
双输入
x + scalar
Adds(dst, src, scalar, len)
标量运算,优先使用
x * scalar
Muls(dst, src, scalar, len)
标量运算,优先使用
abs(x)
Abs(dst, src, len)
exp(x)
Exp(dst, src, len)
ln(x)
Ln(dst, src, len)
sqrt(x)
Sqrt(dst, src, len)
1/x
Reciprocal(dst, src, len)
1/sqrt(x)
Rsqrt(dst, src, len)
tanh(x)
Tanh(dst, src, len)
relu(x)
Relu(dst, src, len)
max(x,y)
Max(dst, src0, src1, len)
min(x,y)
Min(dst, src0, src1, len)
fp16→fp32
Cast(dst, src, CAST_NONE, len)
升精度无损
fp32→fp16
Cast(dst, src, CAST_ROUND, len)
降精度有损
示例 — acosh(x) 的 API 调用序列
cpp
Mul(tmp, x, x, len);        // tmp = x²
Adds(tmp, tmp, -1.0f, len); // tmp = x² - 1
Sqrt(tmp, tmp, len);         // tmp = sqrt(x² - 1)
Add(tmp, tmp, x, len);      // tmp = x + sqrt(x² - 1)
Ln(y, tmp, len);             // y = ln(x + sqrt(x² - 1))
注意:如果计算序列中某步的 dst 与 src 相同(原地操作),大部分 AscendC API 支持,但需确认具体 API。
Must decompose mathematical formulas into AscendC API call sequences. This is the direct input for the code-gen skill.
Common Mathematical Function to AscendC API Mapping:
Mathematical OperationAscendC APIRemarks
x + y
Add(dst, src0, src1, len)
Dual input
x - y
Sub(dst, src0, src1, len)
Dual input
x * y
Mul(dst, src0, src1, len)
Dual input
x / y
Div(dst, src0, src1, len)
Dual input
x + scalar
Adds(dst, src, scalar, len)
Scalar operation, prefer to use
x * scalar
Muls(dst, src, scalar, len)
Scalar operation, prefer to use
abs(x)
Abs(dst, src, len)
exp(x)
Exp(dst, src, len)
ln(x)
Ln(dst, src, len)
sqrt(x)
Sqrt(dst, src, len)
1/x
Reciprocal(dst, src, len)
1/sqrt(x)
Rsqrt(dst, src, len)
tanh(x)
Tanh(dst, src, len)
relu(x)
Relu(dst, src, len)
max(x,y)
Max(dst, src0, src1, len)
min(x,y)
Min(dst, src0, src1, len)
fp16→fp32
Cast(dst, src, CAST_NONE, len)
Precision lifting without loss
fp32→fp16
Cast(dst, src, CAST_ROUND, len)
Precision reduction with loss
Example — API Call Sequence for acosh(x):
cpp
Mul(tmp, x, x, len);        // tmp = x²
Adds(tmp, tmp, -1.0f, len); // tmp = x² - 1
Sqrt(tmp, tmp, len);         // tmp = sqrt(x² - 1)
Add(tmp, tmp, x, len);      // tmp = x + sqrt(x² - 1)
Ln(y, tmp, len);             // y = ln(x + sqrt(x² - 1))
Note: If the dst and src are the same in a certain step of the computation sequence (in-place operation), most AscendC APIs support this, but you need to confirm the specific API.

3.2 Tiling 策略

3.2 Tiling Strategy

重要: AscendC 算子采用两级 Tiling 策略,根据算子类型参考相应文档:
┌─────────────────────────────────────────────────────────────┐
│                    全局内存 (GM)                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              totalLength 元素数据                     │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │  Core 0  │     │  Core 1  │ ... │ Core 39  │   ← Block级Tiling (核间切分)
    └──────────┘     └──────────┘     └──────────┘
          │                │                │
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │   UB 0   │     │   UB 1   │     │  UB 39   │   ← UB级Tiling (核内切分)
    └──────────┘     └──────────┘     └──────────┘
Block级Tiling(核间切分):
  • 将数据分配到多个 AI Core 并行处理
  • 负载均衡: 整核/尾核策略,尾核处理数据量小于等于整核
UB级Tiling(核内切分):
  • 每个 Core 内部分块处理数据
  • UB 对齐: 32 字节
  • UB 容量: 不超过 UB_SIZE_LIMIT(实际编码时通过接口获取,示例值 192KB)
参考文档:
  • 逐元素操作: 阅读
    references/elementwise-tiling.md
    (包含完整两级 Tiling 实现)
  • 归约操作: 阅读
    references/reduction-tiling.md
  • 通用原则: 参考
    references/general-tiling-principles.md
Important: AscendC operators adopt a two-level Tiling strategy. Refer to the corresponding documents based on operator types:
┌─────────────────────────────────────────────────────────────┐
│                    Global Memory (GM)                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Data of totalLength elements               │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │  Core 0  │     │  Core 1  │ ... │ Core 39  │   ← Block-level Tiling (Inter-core Splitting)
    └──────────┘     └──────────┘     └──────────┘
          │                │                │
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │   UB 0   │     │   UB 1   │     │  UB 39   │   ← UB-level Tiling (Intra-core Splitting)
    └──────────┘     └──────────┘     └──────────┘
Block-level Tiling (Inter-core Splitting):
  • Allocate data to multiple AI Cores for parallel processing
  • Load Balancing: Full-core/tail-core strategy, the data volume processed by the tail core is less than or equal to that of the full core
UB-level Tiling (Intra-core Splitting):
  • Split data for processing within each Core
  • UB Alignment: 32 bytes
  • UB Capacity: Do not exceed UB_SIZE_LIMIT (obtained via interface during actual coding, example value: 192KB)
Reference Documents:
  • Element-wise Operations: Read
    references/elementwise-tiling.md
    (includes complete two-level Tiling implementation)
  • Reduction Operations: Read
    references/reduction-tiling.md
  • General Principles: Refer to
    references/general-tiling-principles.md

3.3 硬件约束说明

3.3 Hardware Constraint Description

  • UB 缓冲区: 必须按 32 字节对齐,即使逻辑上只需要存储少量数据
  • 归约类算子: 单值缓冲区需要开辟 32B 空间
  • 精度处理:
    • FP32 输入: 无需升精度,直接计算
    • FP16 输入: 必须升精度到 FP32 计算,保证计算精度
    • BF16 输入: 必须升精度到 FP32 计算,vector 计算单元不支持 bfloat16 直接计算
  • Workspace 需求:
    • elementwise 类: SYSTEM_WORKSPACE_SIZE(通常为 16MB)
    • 其他类算子: 根据实际 tiling data 大小计算
  • UB Buffer: Must be aligned to 32 bytes, even if only a small amount of data needs to be stored logically
  • Reduction Operators: A single-value buffer requires 32B of space
  • Precision Processing:
    • FP32 Input: No precision lifting needed, compute directly
    • FP16 Input: Must lift precision to FP32 for computation to ensure calculation accuracy
    • BF16 Input: Must lift precision to FP32 for computation, vector computation units do not support direct bfloat16 computation
  • Workspace Requirements:
    • Elementwise Operators: SYSTEM_WORKSPACE_SIZE (usually 16MB)
    • Other Operator Types: Calculate based on actual tiling data size

3.4 常见算子类型的 UB 分配速查表

3.4 Quick Reference Table for UB Allocation of Common Operator Types

根据算子输入数量和数据类型,快速确定 bufferCoefficient:
单输入单输出 elementwise(acosh, relu, sigmoid, exp, ln, sqrt, abs...)
数据类型UB 布局bufferCoefficient
float32inQ(2×4) + outQ(2×4) + tmpBuf(1×4) = 2020
float16inQ(2×2) + outQ(2×2) + tmpBuf1(1×4) + tmpBuf2(1×4) = 1616
双输入单输出 elementwise(add, mul, sub, div...)
数据类型UB 布局bufferCoefficient
float32inQ_X(2×4) + inQ_Y(2×4) + outQ(2×4) + tmpBuf(2×4) = 3232
float16inQ_X(2×2) + inQ_Y(2×2) + outQ(2×2) + tmpBuf(3×4) = 2424
实战经验:bufferCoefficient 是 code-gen 阶段最关键的参数。设计文档中 必须 明确给出每种 dtype 的值,否则代码生成无法正确计算 tileLength。
Quickly determine bufferCoefficient based on the number of operator inputs and data types:
Single-input Single-output Elementwise (acosh, relu, sigmoid, exp, ln, sqrt, abs...):
Data TypeUB LayoutbufferCoefficient
float32inQ(2×4) + outQ(2×4) + tmpBuf(1×4) = 2020
float16inQ(2×2) + outQ(2×2) + tmpBuf1(1×4) + tmpBuf2(1×4) = 1616
Dual-input Single-output Elementwise (add, mul, sub, div...):
Data TypeUB LayoutbufferCoefficient
float32inQ_X(2×4) + inQ_Y(2×4) + outQ(2×4) + tmpBuf(2×4) = 3232
float16inQ_X(2×2) + inQ_Y(2×2) + outQ(2×2) + tmpBuf(3×4) = 2424
Practical Experience: bufferCoefficient is the most critical parameter in the code-gen phase. The design document must clearly specify the value for each dtype, otherwise code generation cannot calculate tileLength correctly.

3.5 生成设计文档

3.5 Generate Design Document

基于收集的信息,读取
templates/design-template.md
模板,填充所有章节,输出到
csrc/ops/<op_name>/design.md
输出位置:
ascend-kernel/csrc/ops/<op_name>/design.md
(覆盖初始化阶段的占位文件)
Based on the collected information, read the
templates/design-template.md
template, fill in all sections, and output to
csrc/ops/<op_name>/design.md
.
Output Location:
ascend-kernel/csrc/ops/<op_name>/design.md
(overwrite the placeholder file from the initialization phase)

交互流程

Interaction Process

被调度 skill 调用时(推荐流程):
  1. 接收算子名称和功能描述
  2. 自动选择实现路径
  3. 读取参考文档,生成完整设计文档
  4. 输出到 design.md
独立调用时
  1. 需求收集: 通过对话了解算子需求
  2. 方案推荐: 基于需求推荐实现路径
  3. 详细设计: 生成完整的设计文档
  4. 检查确认: 与用户确认设计要点
  5. 移交开发: 生成检查清单,准备进入编码阶段
When called by scheduling skill (recommended process):
  1. Receive operator name and function description
  2. Automatically select implementation path
  3. Read reference documents and generate complete design document
  4. Output to design.md
When called independently:
  1. Requirement Collection: Understand operator requirements through dialogue
  2. Solution Recommendation: Recommend implementation path based on requirements
  3. Detailed Design: Generate complete design document
  4. Check and Confirm: Confirm design key points with the user
  5. Handover to Development: Generate checklist and prepare for coding phase

注意事项

Notes

Tiling 参数设计原则

Tiling Parameter Design Principles

  1. 参数结构化:
    cpp
    // 好的做法:使用结构体
    struct MyOperatorTilingData {
        int64_t totalLength;        // 总数据长度
    
        int64_t formerNum;          // 整核数量
        int64_t formerLength;       // 整核数据长度
        int64_t tailNum;            // 尾核数量
        int64_t tailLength;         // 尾核数据长度
    
        int64_t tileLength;         // UB单次处理长度
    };
    
    // 避免:使用大量独立参数
    void KernelFunc(int64_t totalLength, int64_t tileNum, int64_t tileLength, ...);
  2. 两级对齐:
    cpp
    // Block级: Cache Line 对齐 (512字节)
    constexpr int64_t CACHE_LINE_BYTE_LENGTH = 512;
    int64_t totalLengthCoreAlign = ((totalLengthCore + CACHE_LINE_BYTE_LENGTH - 1) / CACHE_LINE_BYTE_LENGTH) * CACHE_LINE_BYTE_LENGTH;
    
    // UB级: 32字节对齐
    int64_t ubAlignElements = 32 / dtypeSize;
    int64_t tileLengthAligned = ((tileLength + ubAlignElements - 1) / ubAlignElements) * ubAlignElements;
  3. UB 分配表: 每个算子设计必须包含 UB 分配表,明确列出:
    • 所有 buffer 名称和用途
    • 每个 buffer 的大小(字节)
    • buffer 数量(单 buffer 或 double buffer)
    • 总 UB 使用量和约束验证
  4. Double Buffer: 使用 BUFFER_NUM=2 实现 double buffer,隐藏内存延迟
  1. Parameter Structuring:
    cpp
    // Good practice: Use struct
    struct MyOperatorTilingData {
        int64_t totalLength;        // Total data length
    
        int64_t formerNum;          // Number of full cores
        int64_t formerLength;       // Data length per full core
        int64_t tailNum;            // Number of tail cores
        int64_t tailLength;         // Data length per tail core
    
        int64_t tileLength;         // UB single processing length
    };
    
    // Avoid: Use a large number of independent parameters
    void KernelFunc(int64_t totalLength, int64_t tileNum, int64_t tileLength, ...);
  2. Two-level Alignment:
    cpp
    // Block-level: Cache Line alignment (512 bytes)
    constexpr int64_t CACHE_LINE_BYTE_LENGTH = 512;
    int64_t totalLengthCoreAlign = ((totalLengthCore + CACHE_LINE_BYTE_LENGTH - 1) / CACHE_LINE_BYTE_LENGTH) * CACHE_LINE_BYTE_LENGTH;
    
    // UB-level: 32-byte alignment
    int64_t ubAlignElements = 32 / dtypeSize;
    int64_t tileLengthAligned = ((tileLength + ubAlignElements - 1) / ubAlignElements) * ubAlignElements;
  3. UB Allocation Table: Each operator design must include a UB allocation table, clearly listing:
    • Names and purposes of all buffers
    • Size (in bytes) of each buffer
    • Number of buffers (single buffer or double buffer)
    • Total UB usage and constraint verification
  4. Double Buffer: Use BUFFER_NUM=2 to implement double buffer and hide memory latency

其他注意事项

Other Notes

  1. 数据类型对齐: 确保PyTorch tensor类型和AscendC kernel类型匹配(half ↔ float16, float ↔ float32)
  2. 内存对齐: AscendC要求内存地址对齐(UB 32B, Cache Line 512B)
  3. Shape约束: 某些算子对shape有特殊要求(如需要被tile size整除)
  4. 性能权衡: 在代码复杂度和性能之间找到平衡点
  5. 接口定义: 检查PyTorch/Numpy等库是否存在类似算子接口,如果存在,接口定义参考PyTorch/Numpy等库
  6. 测试输入范围: 在设计文档中注明算子的有效输入范围(如 acosh 要求 x >= 1),测试用例需据此生成数据
  1. Data Type Alignment: Ensure PyTorch tensor types match AscendC kernel types (half ↔ float16, float ↔ float32)
  2. Memory Alignment: AscendC requires memory address alignment (UB 32B, Cache Line 512B)
  3. Shape Constraints: Some operators have special shape requirements (e.g., need to be divisible by tile size)
  4. Performance Trade-off: Find a balance between code complexity and performance
  5. Interface Definition: Check if similar operator interfaces exist in libraries like PyTorch/Numpy. If they exist, refer to PyTorch/Numpy for interface definition
  6. Test Input Range: Specify the valid input range of the operator in the design document (e.g., acosh requires x >= 1), and test cases should generate data accordingly

交付标准(DoD)

Definition of Done (DoD)

设计文档生成后,必须包含以下关键产出物(供 code-gen skill 直接消费):
  • 函数签名:
    at::Tensor op_name(const at::Tensor &self, ...)
    完整声明
  • 支持的数据类型: 明确列出(如 float16, float32)
  • AscendC API 调用伪代码: 每步计算映射到具体 API
  • UB 分配表: 每种 dtype 的 buffer 布局和 bufferCoefficient
  • Tiling 参数结构体: 字段定义和计算公式
  • FP16/BF16 升精度流程: 如支持半精度,必须描述 Cast 路径
After generating the design document, it must include the following key deliverables (for direct consumption by the code-gen skill):
  • Function Signature: Complete declaration of
    at::Tensor op_name(const at::Tensor &self, ...)
  • Supported Data Types: Clearly listed (e.g., float16, float32)
  • AscendC API Call Pseudocode: Map each computation step to a specific API
  • UB Allocation Table: Buffer layout and bufferCoefficient for each dtype
  • Tiling Parameter Struct: Field definitions and calculation formulas
  • FP16/BF16 Precision Lifting Process: If half-precision is supported, the Cast path must be described

下一步

Next Step

设计完成后,使用
ascendc-operator-code-gen
skill 生成具体代码实现。
After design completion, use the
ascendc-operator-code-gen
skill to generate specific code implementation.