triton-operator-design

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triton 算子需求文档生成

Generation of Triton Operator Requirement Documents

工作流概览

Workflow Overview

生成 Triton 算子需求文档分为以下阶段:
  1. 需求分析 → 产出:功能定义、竞品对比
  2. 原型设计 → 产出:API 接口定义
  3. 规格约束 → 产出:输入输出约束、硬件限制
  4. 特性实现 → 产出:Tiling 策略、Kernel 实现方案
Generating Triton operator requirement documents is divided into the following phases:
  1. Requirement Analysis → Deliverables: Function Definition, Competitor Comparison
  2. Prototype Design → Deliverables: API Interface Definition
  3. Specification Constraints → Deliverables: Input/Output Constraints, Hardware Limitations
  4. Feature Implementation → Deliverables: Tiling Strategy, Kernel Implementation Scheme

阶段 1:需求分析

Phase 1: Requirement Analysis

1.1 功能分析

1.1 Functional Analysis

必须包含
  • 算子功能说明:清晰描述算子的作用和应用场景
  • 数学公式:给出核心计算公式,使用标准数学符号
  • 变量说明表
变量类型含义约束条件
[变量名][类型][含义][约束]
关键术语(必须准确使用):
  • GM (Global Memory):全局内存,DDR 上的大容量存储
  • UB (Unified Buffer):统一缓冲区,AI Core 中 Vector Core 内部的高速缓存
  • L1 Buffer:一级缓冲区,AI Core 中 Cube Core 内部的缓存
  • AI Core:昇腾处理器的计算核心,A2/A3 通常有 24 个,每个包含 1 个 Cube 计算核和 2 个 Vector 计算核
  • Tiling:数据切分策略,将大任务分解为小块
  • 归约操作:sum、mean、max 等降维计算
  • 升精度/降精度:FP16→FP32 或 FP32→FP16 的类型转换
Must include:
  • Operator Function Description: Clearly describe the role and application scenarios of the operator
  • Mathematical Formula: Provide core calculation formulas using standard mathematical symbols
  • Variable Description Table:
VariableTypeMeaningConstraints
[Variable Name][Type][Meaning][Constraints]
Key Terms (must be used accurately):
  • GM (Global Memory): Global memory, large-capacity storage on DDR
  • UB (Unified Buffer): Unified buffer, high-speed cache inside the Vector Core of AI Core
  • L1 Buffer: Level 1 buffer, cache inside the Cube Core of AI Core
  • AI Core: Computing core of Ascend processor, A2/A3 usually has 24 units, each containing 1 Cube computing core and 2 Vector computing cores
  • Tiling: Data splitting strategy, decomposing large tasks into small chunks
  • Reduction Operation: Dimensionality reduction calculations such as sum, mean, max, etc.
  • Precision Upcasting/Downcasting: Type conversion from FP16→FP32 or FP32→FP16

1.2 竞品方案分析

1.2 Competitor Solution Analysis

必须包含
  • 竞品算子列表
竞品名称来源框架接口定义实现功能约束限制
[名称][框架][接口][功能][约束]
  • 对比分析
    • 功能对比:各框架支持的功能差异
    • 性能对比:不同硬件平台的性能表现
    • 设计借鉴:可参考的优秀设计
Must include:
  • Competitor Operator List:
Competitor NameSource FrameworkInterface DefinitionImplemented FunctionConstraints
[Name][Framework][Interface][Function][Constraints]
  • Comparison Analysis:
    • Function Comparison: Differences in functions supported by each framework
    • Performance Comparison: Performance on different hardware platforms
    • Design Reference: Excellent designs that can be referenced

阶段 2:原型设计

Phase 2: Prototype Design

2.1 接口定义

2.1 Interface Definition

Triton 接口特点
  • 使用 Python 函数定义
  • 支持自动微分
  • 支持多种数据类型
接口示例
python
def triton_operator(
    input: torch.Tensor,
    param1: torch.Tensor,
    param2: Optional[torch.Tensor] = None,
    eps: float = 1e-6
) -> torch.Tensor:
    """
    [算子功能描述]
    
    Args:
        input: 输入张量,形状为 [..., D]
        param1: 参数1,形状为 [D]
        param2: 参数2(可选),形状为 [D]
        eps: 微小常数
    
    Returns:
        输出张量,形状与 input 相同
    """
    pass
Triton Interface Features:
  • Defined using Python functions
  • Supports automatic differentiation
  • Supports multiple data types
Interface Example:
python
def triton_operator(
    input: torch.Tensor,
    param1: torch.Tensor,
    param2: Optional[torch.Tensor] = None,
    eps: float = 1e-6
) -> torch.Tensor:
    """
    [Operator Function Description]
    
    Args:
        input: Input tensor with shape [..., D]
        param1: Parameter 1 with shape [D]
        param2: Parameter 2 (optional) with shape [D]
        eps: Small constant
    
    Returns:
        Output tensor with the same shape as input
    """
    pass

2.2 接口说明表

2.2 Interface Description Table

参数名称类型输入/输出说明约束条件
[参数名][类型][输入/输出][说明][约束]
Parameter NameTypeInput/OutputDescriptionConstraints
[Parameter Name][Type][Input/Output][Description][Constraints]

2.3 数据类型支持

2.3 Data Type Support

接口类型支持的数据类型数据格式
TritonFLOAT16, BF16, FLOATND
Interface TypeSupported Data TypesData Format
TritonFLOAT16, BF16, FLOATND

阶段 3:规格约束

Phase 3: Specification Constraints

3.1 输入 Tensor 约束

3.1 Input Tensor Constraints

约束项约束条件说明
Shape[具体约束][说明]
数据类型[支持的类型][说明]
数据格式ND统一使用 ND 格式
内存对齐16字节或32字节硬件要求
Constraint ItemConstraint ConditionsDescription
Shape[Specific Constraints][Description]
Data Type[Supported Types][Description]
Data FormatNDUnified use of ND format
Memory Alignment16-byte or 32-byteHardware requirement

3.2 输出 Tensor 约束

3.2 Output Tensor Constraints

约束项约束条件说明
Shape[具体约束][说明]
数据类型[具体约束][说明]
Constraint ItemConstraint ConditionsDescription
Shape[Specific Constraints][Description]
Data Type[Specific Constraints][Description]

3.3 硬件约束

3.3 Hardware Constraints

必须考虑的硬件限制
  • AI Core 架构
    • A2/A3 通常有 24 个 AI Core
    • 每个 AI Core 包含 1 个 Cube 计算核和 2 个 Vector 计算核
    • Cube Core 专用于矩阵计算,Vector Core 专用于向量计算
  • UB 缓冲区大小:192KB(A2/A3),Vector Core 专用
  • L1 Buffer 大小:通常为 1MB(A2/A3),Cube Core 专用
  • 内存对齐要求
    • UB 缓冲区必须 32 字节对齐
    • 单值缓冲区(如均值)需要 32B 空间(即使逻辑上只需 4B)
  • 数据类型大小:FP16=2B, BF16=2B, FP32=4B
Hardware Limitations that must be considered:
  • AI Core Architecture:
    • A2/A3 usually has 24 AI Cores
    • Each AI Core contains 1 Cube computing core and 2 Vector computing cores
    • Cube Core is dedicated to matrix computation, Vector Core is dedicated to vector computation
  • UB Buffer Size: 192KB (A2/A3), dedicated to Vector Core
  • L1 Buffer Size: Usually 1MB (A2/A3), dedicated to Cube Core
  • Memory Alignment Requirements:
    • UB buffer must be 32-byte aligned
    • Single-value buffers (such as mean) require 32B space (even if only 4B is needed logically)
  • Data Type Size: FP16=2B, BF16=2B, FP32=4B

阶段 4:特性实现方案

Phase 4: Feature Implementation Scheme

4.1 Tiling 切分

4.1 Tiling Splitting

这是最关键的部分,必须详细说明
This is the most critical part and must be explained in detail.

4.1.1 核间切分策略

4.1.1 Inter-Core Splitting Strategy

必须包含
  1. 切分原则
    • 如何划分任务到多个 AI Core
    • 为什么选择这种切分方式
    • 如何保证负载均衡
  2. 计算方法
    输入: x[B, D]
    
    // 步骤1: 计算每个 Core 处理的数据量
    data_per_core = ceil(total_size / num_cores)
    
    // 步骤2: 计算当前 Core 的数据范围
    core_start = core_id * data_per_core
    core_end = min((core_id + 1) * data_per_core, total_size)
  3. 示例
    • 给出具体的输入形状
    • 展示切分结果
    • 说明每个 Core 处理的数据范围
Must include:
  1. Splitting Principles:
    • How to divide tasks into multiple AI Cores
    • Why this splitting method is chosen
    • How to ensure load balancing
  2. Calculation Method:
    Input: x[B, D]
    
    // Step 1: Calculate the amount of data processed by each Core
    data_per_core = ceil(total_size / num_cores)
    
    // Step 2: Calculate the data range of the current Core
    core_start = core_id * data_per_core
    core_end = min((core_id + 1) * data_per_core, total_size)
  3. Example:
    • Provide specific input shapes
    • Show splitting results
    • Explain the data range processed by each Core

4.1.2 核内循环策略

4.1.2 Intra-Core Loop Strategy

必须包含
  1. UB 空间计算
    UB 总大小: 192KB
    数据类型大小: FP16=2B, FP32=4B
    
    单次循环需要的缓冲区:
    - 输入缓冲区: [大小] × [类型大小]
    - 中间缓冲区: [大小] × [类型大小]
    - 输出缓冲区: [大小] × [类型大小]
    
    单次循环可处理的数据量 = UB总大小 / 单次循环总空间
  2. 缓冲区分配策略
    • 列出所有需要的缓冲区
    • 说明每个缓冲区的大小和用途
    • 考虑对齐要求
  3. 精度处理策略
    • 是否需要升精度计算(FP16→FP32)
    • 在哪个阶段升精度
    • 在哪个阶段降精度
Must include:
  1. UB Space Calculation:
    Total UB size: 192KB
    Data type size: FP16=2B, FP32=4B
    
    Buffers required for a single loop:
    - Input buffer: [Size] × [Type Size]
    - Intermediate buffer: [Size] × [Type Size]
    - Output buffer: [Size] × [Type Size]
    
    Amount of data processed per loop = Total UB size / Total space per loop
  2. Buffer Allocation Strategy:
    • List all required buffers
    • Explain the size and purpose of each buffer
    • Consider alignment requirements
  3. Precision Processing Strategy:
    • Whether precision upcasting (FP16→FP32) is required
    • At which stage to upcast precision
    • At which stage to downcast precision

4.2 Kernel 实现

4.2 Kernel Implementation

4.2.1 计算流图

4.2.1 Computational Flow Diagram

必须绘制数据流图
输入张量 (GM)    参数张量 (GM)
    │                │
    ▼                ▼
[加载到UB]       [加载到UB]
    │                │
    ▼                ▼
[计算步骤1]      [预处理]
    │                │
    ▼                │
[计算步骤2]      ───┘
[最终计算]
输出张量 (GM)
关键点
  • 标注每个步骤的数据类型
  • 标注 GM↔UB 的数据传输
  • 标注精度转换的位置
Must draw a data flow diagram:
Input Tensor (GM)    Parameter Tensor (GM)
    │                │
    ▼                ▼
[Load to UB]       [Load to UB]
    │                │
    ▼                ▼
[Calculation Step 1]      [Preprocessing]
    │                │
    ▼                │
[Calculation Step 2]      ───┘
[Final Calculation]
Output Tensor (GM)
Key Points:
  • Label the data type of each step
  • Label data transmission between GM↔UB
  • Label the location of precision conversion

4.2.2 核心实现逻辑

4.2.2 Core Implementation Logic

按输入数据类型分别说明
FP32 输入类型
  1. 核间任务分配
  2. UB 缓冲区管理(列出所有缓冲区)
  3. 计算流程(按步骤详细说明)
FP16/BF16 输入类型
  1. 核间任务分配
  2. UB 缓冲区管理(包括升/降精度缓冲区)
  3. 计算流程(包括精度转换步骤)
硬件优化要点
  • 向量化计算
  • 数据复用
  • 内存访问优化
  • 对齐处理
Explain separately according to input data types:
FP32 Input Type:
  1. Inter-core task allocation
  2. UB buffer management (list all buffers)
  3. Calculation process (explain in detail step by step)
FP16/BF16 Input Type:
  1. Inter-core task allocation
  2. UB buffer management (including upcasting/downcasting buffers)
  3. Calculation process (including precision conversion steps)
Hardware Optimization Points:
  • Vectorized computation
  • Data reuse
  • Memory access optimization
  • Alignment processing

绝对不要做的事

Things Absolutely Not to Do

  • ❌ 使用模糊的术语(如"适当切分"、"合理分配")
  • ❌ 忽略硬件约束(UB 大小、对齐要求)
  • ❌ 不说明 Tiling 的具体计算方法
  • ❌ 不区分不同数据类型的处理策略
  • ❌ 不标注数据流图中的数据类型
  • ❌ 忽略归约操作的对齐要求(必须 32B)
  • ❌ 混淆 Vector Core 和 Cube Core 的用途(Vector Core 用于向量计算,Cube Core 用于矩阵计算)
  • ❌ 忽略 UB 和 L1 的区别(UB 是 Vector Core 专用,L1 是 Cube Core 专用)
  • ❌ Use vague terms (such as "appropriate splitting", "reasonable allocation")
  • ❌ Ignore hardware constraints (UB size, alignment requirements)
  • ❌ Do not explain the specific calculation method of Tiling
  • ❌ Do not distinguish processing strategies for different data types
  • ❌ Do not label data types in the data flow diagram
  • ❌ Ignore alignment requirements for reduction operations (must be 32B)
  • ❌ Confuse the uses of Vector Core and Cube Core (Vector Core for vector computation, Cube Core for matrix computation)
  • ❌ Ignore the difference between UB and L1 (UB is dedicated to Vector Core, L1 is dedicated to Cube Core)

常见陷阱

Common Pitfalls

陷阱 1:忽略 UB 大小限制

Pitfall 1: Ignoring UB Size Limitations

症状:设计的方案超出 UB 容量
解决
  1. 计算所有缓冲区的总大小
  2. 确保总大小 < UB 总大小
  3. 如果超出,调整单次循环处理的数据量
Symptom: The designed scheme exceeds UB capacity
Solution:
  1. Calculate the total size of all buffers
  2. Ensure the total size < Total UB size
  3. If exceeded, adjust the amount of data processed per loop

陷阱 2:忽略内存对齐

Pitfall 2: Ignoring Memory Alignment

症状:硬件报错或性能下降
解决
  1. UB 缓冲区按 32 字节对齐
  2. 单值缓冲区(均值、方差等)分配 32B 空间
  3. 使用
    ceil(实际大小, 32)
    计算分配空间
Symptom: Hardware errors or performance degradation
Solution:
  1. UB buffers are aligned to 32 bytes
  2. Allocate 32B space for single-value buffers (mean, variance, etc.)
  3. Use
    ceil(actual_size, 32)
    to calculate allocated space

陷阱 3:精度损失

Pitfall 3: Precision Loss

症状:FP16 输入时计算结果不准确
解决
  1. 归约操作前升精度到 FP32
  2. 在 FP32 精度下完成所有计算
  3. 最后再降精度到输出类型
Symptom: Inaccurate calculation results when input is FP16
Solution:
  1. Upcast precision to FP32 before reduction operations
  2. Complete all calculations in FP32 precision
  3. Finally downcast precision to the output type

陷阱 4:Tiling 策略不合理

Pitfall 4: Unreasonable Tiling Strategy

症状:性能不佳或无法处理大 shape
解决
  1. 根据算子特点选择切分维度
  2. 确保每个 Core 独立完成计算
  3. 避免跨 Core 的数据依赖
Symptom: Poor performance or inability to handle large shapes
Solution:
  1. Choose splitting dimensions based on operator characteristics
  2. Ensure each Core completes calculations independently
  3. Avoid cross-Core data dependencies

质量检查清单

Quality Check Checklist

在完成文档后,检查以下内容:
After completing the document, check the following items:

需求分析

Requirement Analysis

  • 算子功能说明清晰
  • 数学公式正确且完整
  • 变量说明表包含所有关键变量
  • 竞品分析涵盖主流框架
  • 术语使用准确(GM、UB、AI Core 等)
  • Operator function description is clear
  • Mathematical formulas are correct and complete
  • Variable description table includes all key variables
  • Competitor analysis covers mainstream frameworks
  • Terms are used accurately (GM, UB, AI Core, etc.)

原型设计

Prototype Design

  • 接口定义完整
  • 参数说明详细
  • 数据类型支持明确
  • Interface definition is complete
  • Parameter descriptions are detailed
  • Data type support is clear

规格约束

Specification Constraints

  • 输入输出约束完整
  • 硬件约束明确(UB 大小、L1 大小、对齐要求)
  • 区分 Vector Core 和 Cube Core 的用途
  • 边界条件说明清楚
  • Input/output constraints are complete
  • Hardware constraints are clear (UB size, L1 size, alignment requirements)
  • Distinguish the uses of Vector Core and Cube Core
  • Boundary conditions are clearly explained

Tiling 切分

Tiling Splitting

  • 核间切分策略有具体计算方法
  • 核内循环策略有 UB 空间计算
  • 缓冲区分配详细
  • 有具体示例
  • Inter-core splitting strategy has specific calculation methods
  • Intra-core loop strategy includes UB space calculation
  • Buffer allocation is detailed
  • Has specific examples

Kernel 实现

Kernel Implementation

  • 计算流图清晰
  • 数据类型标注完整
  • 不同输入类型分别说明
  • 硬件优化要点明确
  • Computational flow diagram is clear
  • Data type labels are complete
  • Different input types are explained separately
  • Hardware optimization points are clear

参考资源

Reference Resources

详细的设计指南和示例,请参考:
  • triton-operator-template.md - 完整的文档模板
  • ascend-terminology.md - Ascend 术语表
  • tiling-strategies.md - Tiling 策略详解
官方文档
For detailed design guidelines and examples, please refer to:
  • triton-operator-template.md - Complete document template
  • ascend-terminology.md - Ascend Terminology Glossary
  • tiling-strategies.md - Detailed Tiling Strategies
Official Documentation: