triton-operator-design

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Triton 算子需求文档生成

Generation of Triton Operator Requirement Documents

工作流概览

Workflow Overview

生成 Triton 算子需求文档分为以下阶段：

需求分析 → 产出：功能定义、竞品对比
原型设计 → 产出：API 接口定义
规格约束 → 产出：输入输出约束、硬件限制
特性实现 → 产出：Tiling 策略、Kernel 实现方案

Generating Triton operator requirement documents is divided into the following phases:

Requirement Analysis → Deliverables: Function Definition, Competitor Comparison
Prototype Design → Deliverables: API Interface Definition
Specification Constraints → Deliverables: Input/Output Constraints, Hardware Limitations
Feature Implementation → Deliverables: Tiling Strategy, Kernel Implementation Scheme

阶段 1：需求分析

Phase 1: Requirement Analysis

1.1 功能分析

1.1 Functional Analysis

必须包含：

算子功能说明：清晰描述算子的作用和应用场景
数学公式：给出核心计算公式，使用标准数学符号
变量说明表：

变量	类型	含义	约束条件
[变量名]	[类型]	[含义]	[约束]

关键术语（必须准确使用）：

GM (Global Memory)：全局内存，DDR 上的大容量存储
UB (Unified Buffer)：统一缓冲区，AI Core 中 Vector Core 内部的高速缓存
L1 Buffer：一级缓冲区，AI Core 中 Cube Core 内部的缓存
AI Core：昇腾处理器的计算核心，A2/A3 通常有 24 个，每个包含 1 个 Cube 计算核和 2 个 Vector 计算核
Tiling：数据切分策略，将大任务分解为小块
归约操作：sum、mean、max 等降维计算
升精度/降精度：FP16→FP32 或 FP32→FP16 的类型转换

Must include:

Operator Function Description: Clearly describe the role and application scenarios of the operator
Mathematical Formula: Provide core calculation formulas using standard mathematical symbols
Variable Description Table:

Variable	Type	Meaning	Constraints
[Variable Name]	[Type]	[Meaning]	[Constraints]

Key Terms (must be used accurately):

GM (Global Memory): Global memory, large-capacity storage on DDR
UB (Unified Buffer): Unified buffer, high-speed cache inside the Vector Core of AI Core
L1 Buffer: Level 1 buffer, cache inside the Cube Core of AI Core
AI Core: Computing core of Ascend processor, A2/A3 usually has 24 units, each containing 1 Cube computing core and 2 Vector computing cores
Tiling: Data splitting strategy, decomposing large tasks into small chunks
Reduction Operation: Dimensionality reduction calculations such as sum, mean, max, etc.
Precision Upcasting/Downcasting: Type conversion from FP16→FP32 or FP32→FP16

1.2 竞品方案分析

1.2 Competitor Solution Analysis

必须包含：

竞品算子列表：

竞品名称	来源框架	接口定义	实现功能	约束限制
[名称]	[框架]	[接口]	[功能]	[约束]

对比分析：
- 功能对比：各框架支持的功能差异
- 性能对比：不同硬件平台的性能表现
- 设计借鉴：可参考的优秀设计

Must include:

Competitor Operator List:

Competitor Name	Source Framework	Interface Definition	Implemented Function	Constraints
[Name]	[Framework]	[Interface]	[Function]	[Constraints]

Comparison Analysis:
- Function Comparison: Differences in functions supported by each framework
- Performance Comparison: Performance on different hardware platforms
- Design Reference: Excellent designs that can be referenced

阶段 2：原型设计

Phase 2: Prototype Design

2.1 接口定义

2.1 Interface Definition

Triton 接口特点：

使用 Python 函数定义
支持自动微分
支持多种数据类型

接口示例：

python

def triton_operator(
    input: torch.Tensor,
    param1: torch.Tensor,
    param2: Optional[torch.Tensor] = None,
    eps: float = 1e-6
) -> torch.Tensor:
    """
    [算子功能描述]
    
    Args:
        input: 输入张量，形状为 [..., D]
        param1: 参数1，形状为 [D]
        param2: 参数2（可选），形状为 [D]
        eps: 微小常数
    
    Returns:
        输出张量，形状与 input 相同
    """
    pass

Triton Interface Features:

Defined using Python functions
Supports automatic differentiation
Supports multiple data types

Interface Example:

python

def triton_operator(
    input: torch.Tensor,
    param1: torch.Tensor,
    param2: Optional[torch.Tensor] = None,
    eps: float = 1e-6
) -> torch.Tensor:
    """
    [Operator Function Description]
    
    Args:
        input: Input tensor with shape [..., D]
        param1: Parameter 1 with shape [D]
        param2: Parameter 2 (optional) with shape [D]
        eps: Small constant
    
    Returns:
        Output tensor with the same shape as input
    """
    pass

2.2 接口说明表

2.2 Interface Description Table

参数名称	类型	输入/输出	说明	约束条件
[参数名]	[类型]	[输入/输出]	[说明]	[约束]

Parameter Name	Type	Input/Output	Description	Constraints
[Parameter Name]	[Type]	[Input/Output]	[Description]	[Constraints]

2.3 数据类型支持

2.3 Data Type Support

接口类型	支持的数据类型	数据格式
Triton	FLOAT16, BF16, FLOAT	ND

Interface Type	Supported Data Types	Data Format
Triton	FLOAT16, BF16, FLOAT	ND

阶段 3：规格约束

Phase 3: Specification Constraints

3.1 输入 Tensor 约束

3.1 Input Tensor Constraints

约束项	约束条件	说明
Shape	[具体约束]	[说明]
数据类型	[支持的类型]	[说明]
数据格式	ND	统一使用 ND 格式
内存对齐	16字节或32字节	硬件要求

Constraint Item	Constraint Conditions	Description
Shape	[Specific Constraints]	[Description]
Data Type	[Supported Types]	[Description]
Data Format	ND	Unified use of ND format
Memory Alignment	16-byte or 32-byte	Hardware requirement

3.2 输出 Tensor 约束

3.2 Output Tensor Constraints

约束项	约束条件	说明
Shape	[具体约束]	[说明]
数据类型	[具体约束]	[说明]

Constraint Item	Constraint Conditions	Description
Shape	[Specific Constraints]	[Description]
Data Type	[Specific Constraints]	[Description]

3.3 硬件约束

3.3 Hardware Constraints

必须考虑的硬件限制：

AI Core 架构：
- A2/A3 通常有 24 个 AI Core
- 每个 AI Core 包含 1 个 Cube 计算核和 2 个 Vector 计算核
- Cube Core 专用于矩阵计算，Vector Core 专用于向量计算
UB 缓冲区大小：192KB（A2/A3），Vector Core 专用
L1 Buffer 大小：通常为 1MB（A2/A3），Cube Core 专用
内存对齐要求：
- UB 缓冲区必须 32 字节对齐
- 单值缓冲区（如均值）需要 32B 空间（即使逻辑上只需 4B）
数据类型大小：FP16=2B, BF16=2B, FP32=4B

Hardware Limitations that must be considered:

AI Core Architecture:
- A2/A3 usually has 24 AI Cores
- Each AI Core contains 1 Cube computing core and 2 Vector computing cores
- Cube Core is dedicated to matrix computation, Vector Core is dedicated to vector computation
UB Buffer Size: 192KB (A2/A3), dedicated to Vector Core
L1 Buffer Size: Usually 1MB (A2/A3), dedicated to Cube Core
Memory Alignment Requirements:
- UB buffer must be 32-byte aligned
- Single-value buffers (such as mean) require 32B space (even if only 4B is needed logically)
Data Type Size: FP16=2B, BF16=2B, FP32=4B

阶段 4：特性实现方案

Phase 4: Feature Implementation Scheme

4.1 Tiling 切分

4.1 Tiling Splitting

这是最关键的部分，必须详细说明。

This is the most critical part and must be explained in detail.

4.1.1 核间切分策略

4.1.1 Inter-Core Splitting Strategy

必须包含：

切分原则：
- 如何划分任务到多个 AI Core
- 为什么选择这种切分方式
- 如何保证负载均衡

计算方法：

输入: x[B, D]

// 步骤1: 计算每个 Core 处理的数据量
data_per_core = ceil(total_size / num_cores)

// 步骤2: 计算当前 Core 的数据范围
core_start = core_id * data_per_core
core_end = min((core_id + 1) * data_per_core, total_size)

示例：
- 给出具体的输入形状
- 展示切分结果
- 说明每个 Core 处理的数据范围

Must include:

Splitting Principles:
- How to divide tasks into multiple AI Cores
- Why this splitting method is chosen
- How to ensure load balancing

Calculation Method:

Input: x[B, D]

// Step 1: Calculate the amount of data processed by each Core
data_per_core = ceil(total_size / num_cores)

// Step 2: Calculate the data range of the current Core
core_start = core_id * data_per_core
core_end = min((core_id + 1) * data_per_core, total_size)

Example:
- Provide specific input shapes
- Show splitting results
- Explain the data range processed by each Core

4.1.2 核内循环策略

4.1.2 Intra-Core Loop Strategy

必须包含：

UB 空间计算：

UB 总大小: 192KB
数据类型大小: FP16=2B, FP32=4B

单次循环需要的缓冲区:
- 输入缓冲区: [大小] × [类型大小]
- 中间缓冲区: [大小] × [类型大小]
- 输出缓冲区: [大小] × [类型大小]

单次循环可处理的数据量 = UB总大小 / 单次循环总空间

缓冲区分配策略：
- 列出所有需要的缓冲区
- 说明每个缓冲区的大小和用途
- 考虑对齐要求
精度处理策略：
- 是否需要升精度计算（FP16→FP32）
- 在哪个阶段升精度
- 在哪个阶段降精度

Must include:

UB Space Calculation:

Total UB size: 192KB
Data type size: FP16=2B, FP32=4B

Buffers required for a single loop:
- Input buffer: [Size] × [Type Size]
- Intermediate buffer: [Size] × [Type Size]
- Output buffer: [Size] × [Type Size]

Amount of data processed per loop = Total UB size / Total space per loop

Buffer Allocation Strategy:
- List all required buffers
- Explain the size and purpose of each buffer
- Consider alignment requirements
Precision Processing Strategy:
- Whether precision upcasting (FP16→FP32) is required
- At which stage to upcast precision
- At which stage to downcast precision

4.2 Kernel 实现

4.2 Kernel Implementation

4.2.1 计算流图

4.2.1 Computational Flow Diagram

必须绘制数据流图：

输入张量 (GM)    参数张量 (GM)
    │                │
    ▼                ▼
[加载到UB]       [加载到UB]
    │                │
    ▼                ▼
[计算步骤1]      [预处理]
    │                │
    ▼                │
[计算步骤2]      ───┘
    │
    ▼
[最终计算]
    │
    ▼
输出张量 (GM)

关键点：

标注每个步骤的数据类型
标注 GM↔UB 的数据传输
标注精度转换的位置

Must draw a data flow diagram:

Input Tensor (GM)    Parameter Tensor (GM)
    │                │
    ▼                ▼
[Load to UB]       [Load to UB]
    │                │
    ▼                ▼
[Calculation Step 1]      [Preprocessing]
    │                │
    ▼                │
[Calculation Step 2]      ───┘
    │
    ▼
[Final Calculation]
    │
    ▼
Output Tensor (GM)

Key Points:

Label the data type of each step
Label data transmission between GM↔UB
Label the location of precision conversion

4.2.2 核心实现逻辑

4.2.2 Core Implementation Logic

按输入数据类型分别说明：

FP32 输入类型：

核间任务分配
UB 缓冲区管理（列出所有缓冲区）
计算流程（按步骤详细说明）

FP16/BF16 输入类型：

核间任务分配
UB 缓冲区管理（包括升/降精度缓冲区）
计算流程（包括精度转换步骤）

硬件优化要点：

向量化计算
数据复用
内存访问优化
对齐处理

Explain separately according to input data types:

FP32 Input Type:

Inter-core task allocation
UB buffer management (list all buffers)
Calculation process (explain in detail step by step)

FP16/BF16 Input Type:

Inter-core task allocation
UB buffer management (including upcasting/downcasting buffers)
Calculation process (including precision conversion steps)

Hardware Optimization Points:

Vectorized computation
Data reuse
Memory access optimization
Alignment processing

绝对不要做的事

Things Absolutely Not to Do

❌ 使用模糊的术语（如"适当切分"、"合理分配"）
❌ 忽略硬件约束（UB 大小、对齐要求）
❌ 不说明 Tiling 的具体计算方法
❌ 不区分不同数据类型的处理策略
❌ 不标注数据流图中的数据类型
❌ 忽略归约操作的对齐要求（必须 32B）
❌ 混淆 Vector Core 和 Cube Core 的用途（Vector Core 用于向量计算，Cube Core 用于矩阵计算）
❌ 忽略 UB 和 L1 的区别（UB 是 Vector Core 专用，L1 是 Cube Core 专用）

❌ Use vague terms (such as "appropriate splitting", "reasonable allocation")
❌ Ignore hardware constraints (UB size, alignment requirements)
❌ Do not explain the specific calculation method of Tiling
❌ Do not distinguish processing strategies for different data types
❌ Do not label data types in the data flow diagram
❌ Ignore alignment requirements for reduction operations (must be 32B)
❌ Confuse the uses of Vector Core and Cube Core (Vector Core for vector computation, Cube Core for matrix computation)
❌ Ignore the difference between UB and L1 (UB is dedicated to Vector Core, L1 is dedicated to Cube Core)

常见陷阱

Common Pitfalls

陷阱 1：忽略 UB 大小限制

Pitfall 1: Ignoring UB Size Limitations

症状：设计的方案超出 UB 容量

解决：

计算所有缓冲区的总大小
确保总大小 < UB 总大小
如果超出，调整单次循环处理的数据量

Symptom: The designed scheme exceeds UB capacity

Solution:

Calculate the total size of all buffers
Ensure the total size < Total UB size
If exceeded, adjust the amount of data processed per loop

陷阱 2：忽略内存对齐

Pitfall 2: Ignoring Memory Alignment

症状：硬件报错或性能下降

解决：

UB 缓冲区按 32 字节对齐
单值缓冲区（均值、方差等）分配 32B 空间
使用
```
ceil(实际大小, 32)
```
计算分配空间

Symptom: Hardware errors or performance degradation

Solution:

UB buffers are aligned to 32 bytes
Allocate 32B space for single-value buffers (mean, variance, etc.)
Use
```
ceil(actual_size, 32)
```
to calculate allocated space

陷阱 3：精度损失

Pitfall 3: Precision Loss

症状：FP16 输入时计算结果不准确

解决：

归约操作前升精度到 FP32
在 FP32 精度下完成所有计算
最后再降精度到输出类型

Symptom: Inaccurate calculation results when input is FP16

Solution:

Upcast precision to FP32 before reduction operations
Complete all calculations in FP32 precision
Finally downcast precision to the output type

陷阱 4：Tiling 策略不合理

Pitfall 4: Unreasonable Tiling Strategy

症状：性能不佳或无法处理大 shape

解决：

根据算子特点选择切分维度
确保每个 Core 独立完成计算
避免跨 Core 的数据依赖

Symptom: Poor performance or inability to handle large shapes

Solution:

Choose splitting dimensions based on operator characteristics
Ensure each Core completes calculations independently
Avoid cross-Core data dependencies

质量检查清单

Quality Check Checklist

在完成文档后，检查以下内容：

After completing the document, check the following items:

需求分析

Requirement Analysis

原型设计

Prototype Design

规格约束

Specification Constraints

输入输出约束完整
硬件约束明确（UB 大小、L1 大小、对齐要求）
区分 Vector Core 和 Cube Core 的用途
边界条件说明清楚

Input/output constraints are complete
Hardware constraints are clear (UB size, L1 size, alignment requirements)
Distinguish the uses of Vector Core and Cube Core
Boundary conditions are clearly explained

Tiling 切分

Tiling Splitting

核间切分策略有具体计算方法
核内循环策略有 UB 空间计算
缓冲区分配详细
有具体示例

Inter-core splitting strategy has specific calculation methods
Intra-core loop strategy includes UB space calculation
Buffer allocation is detailed
Has specific examples

Kernel 实现

Kernel Implementation

计算流图清晰
数据类型标注完整
不同输入类型分别说明
硬件优化要点明确

Computational flow diagram is clear
Data type labels are complete
Different input types are explained separately
Hardware optimization points are clear

参考资源

Reference Resources

详细的设计指南和示例，请参考：

triton-operator-template.md - 完整的文档模板
ascend-terminology.md - Ascend 术语表
tiling-strategies.md - Tiling 策略详解

官方文档：

For detailed design guidelines and examples, please refer to:

triton-operator-template.md - Complete document template
ascend-terminology.md - Ascend Terminology Glossary
tiling-strategies.md - Detailed Tiling Strategies

Official Documentation: