triton-operator-design
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTriton 算子需求文档生成
Generation of Triton Operator Requirement Documents
工作流概览
Workflow Overview
生成 Triton 算子需求文档分为以下阶段:
- 需求分析 → 产出:功能定义、竞品对比
- 原型设计 → 产出:API 接口定义
- 规格约束 → 产出:输入输出约束、硬件限制
- 特性实现 → 产出:Tiling 策略、Kernel 实现方案
Generating Triton operator requirement documents is divided into the following phases:
- Requirement Analysis → Deliverables: Function Definition, Competitor Comparison
- Prototype Design → Deliverables: API Interface Definition
- Specification Constraints → Deliverables: Input/Output Constraints, Hardware Limitations
- Feature Implementation → Deliverables: Tiling Strategy, Kernel Implementation Scheme
阶段 1:需求分析
Phase 1: Requirement Analysis
1.1 功能分析
1.1 Functional Analysis
必须包含:
- 算子功能说明:清晰描述算子的作用和应用场景
- 数学公式:给出核心计算公式,使用标准数学符号
- 变量说明表:
| 变量 | 类型 | 含义 | 约束条件 |
|---|---|---|---|
| [变量名] | [类型] | [含义] | [约束] |
关键术语(必须准确使用):
- GM (Global Memory):全局内存,DDR 上的大容量存储
- UB (Unified Buffer):统一缓冲区,AI Core 中 Vector Core 内部的高速缓存
- L1 Buffer:一级缓冲区,AI Core 中 Cube Core 内部的缓存
- AI Core:昇腾处理器的计算核心,A2/A3 通常有 24 个,每个包含 1 个 Cube 计算核和 2 个 Vector 计算核
- Tiling:数据切分策略,将大任务分解为小块
- 归约操作:sum、mean、max 等降维计算
- 升精度/降精度:FP16→FP32 或 FP32→FP16 的类型转换
Must include:
- Operator Function Description: Clearly describe the role and application scenarios of the operator
- Mathematical Formula: Provide core calculation formulas using standard mathematical symbols
- Variable Description Table:
| Variable | Type | Meaning | Constraints |
|---|---|---|---|
| [Variable Name] | [Type] | [Meaning] | [Constraints] |
Key Terms (must be used accurately):
- GM (Global Memory): Global memory, large-capacity storage on DDR
- UB (Unified Buffer): Unified buffer, high-speed cache inside the Vector Core of AI Core
- L1 Buffer: Level 1 buffer, cache inside the Cube Core of AI Core
- AI Core: Computing core of Ascend processor, A2/A3 usually has 24 units, each containing 1 Cube computing core and 2 Vector computing cores
- Tiling: Data splitting strategy, decomposing large tasks into small chunks
- Reduction Operation: Dimensionality reduction calculations such as sum, mean, max, etc.
- Precision Upcasting/Downcasting: Type conversion from FP16→FP32 or FP32→FP16
1.2 竞品方案分析
1.2 Competitor Solution Analysis
必须包含:
- 竞品算子列表:
| 竞品名称 | 来源框架 | 接口定义 | 实现功能 | 约束限制 |
|---|---|---|---|---|
| [名称] | [框架] | [接口] | [功能] | [约束] |
- 对比分析:
- 功能对比:各框架支持的功能差异
- 性能对比:不同硬件平台的性能表现
- 设计借鉴:可参考的优秀设计
Must include:
- Competitor Operator List:
| Competitor Name | Source Framework | Interface Definition | Implemented Function | Constraints |
|---|---|---|---|---|
| [Name] | [Framework] | [Interface] | [Function] | [Constraints] |
- Comparison Analysis:
- Function Comparison: Differences in functions supported by each framework
- Performance Comparison: Performance on different hardware platforms
- Design Reference: Excellent designs that can be referenced
阶段 2:原型设计
Phase 2: Prototype Design
2.1 接口定义
2.1 Interface Definition
Triton 接口特点:
- 使用 Python 函数定义
- 支持自动微分
- 支持多种数据类型
接口示例:
python
def triton_operator(
input: torch.Tensor,
param1: torch.Tensor,
param2: Optional[torch.Tensor] = None,
eps: float = 1e-6
) -> torch.Tensor:
"""
[算子功能描述]
Args:
input: 输入张量,形状为 [..., D]
param1: 参数1,形状为 [D]
param2: 参数2(可选),形状为 [D]
eps: 微小常数
Returns:
输出张量,形状与 input 相同
"""
passTriton Interface Features:
- Defined using Python functions
- Supports automatic differentiation
- Supports multiple data types
Interface Example:
python
def triton_operator(
input: torch.Tensor,
param1: torch.Tensor,
param2: Optional[torch.Tensor] = None,
eps: float = 1e-6
) -> torch.Tensor:
"""
[Operator Function Description]
Args:
input: Input tensor with shape [..., D]
param1: Parameter 1 with shape [D]
param2: Parameter 2 (optional) with shape [D]
eps: Small constant
Returns:
Output tensor with the same shape as input
"""
pass2.2 接口说明表
2.2 Interface Description Table
| 参数名称 | 类型 | 输入/输出 | 说明 | 约束条件 |
|---|---|---|---|---|
| [参数名] | [类型] | [输入/输出] | [说明] | [约束] |
| Parameter Name | Type | Input/Output | Description | Constraints |
|---|---|---|---|---|
| [Parameter Name] | [Type] | [Input/Output] | [Description] | [Constraints] |
2.3 数据类型支持
2.3 Data Type Support
| 接口类型 | 支持的数据类型 | 数据格式 |
|---|---|---|
| Triton | FLOAT16, BF16, FLOAT | ND |
| Interface Type | Supported Data Types | Data Format |
|---|---|---|
| Triton | FLOAT16, BF16, FLOAT | ND |
阶段 3:规格约束
Phase 3: Specification Constraints
3.1 输入 Tensor 约束
3.1 Input Tensor Constraints
| 约束项 | 约束条件 | 说明 |
|---|---|---|
| Shape | [具体约束] | [说明] |
| 数据类型 | [支持的类型] | [说明] |
| 数据格式 | ND | 统一使用 ND 格式 |
| 内存对齐 | 16字节或32字节 | 硬件要求 |
| Constraint Item | Constraint Conditions | Description |
|---|---|---|
| Shape | [Specific Constraints] | [Description] |
| Data Type | [Supported Types] | [Description] |
| Data Format | ND | Unified use of ND format |
| Memory Alignment | 16-byte or 32-byte | Hardware requirement |
3.2 输出 Tensor 约束
3.2 Output Tensor Constraints
| 约束项 | 约束条件 | 说明 |
|---|---|---|
| Shape | [具体约束] | [说明] |
| 数据类型 | [具体约束] | [说明] |
| Constraint Item | Constraint Conditions | Description |
|---|---|---|
| Shape | [Specific Constraints] | [Description] |
| Data Type | [Specific Constraints] | [Description] |
3.3 硬件约束
3.3 Hardware Constraints
必须考虑的硬件限制:
- AI Core 架构:
- A2/A3 通常有 24 个 AI Core
- 每个 AI Core 包含 1 个 Cube 计算核和 2 个 Vector 计算核
- Cube Core 专用于矩阵计算,Vector Core 专用于向量计算
- UB 缓冲区大小:192KB(A2/A3),Vector Core 专用
- L1 Buffer 大小:通常为 1MB(A2/A3),Cube Core 专用
- 内存对齐要求:
- UB 缓冲区必须 32 字节对齐
- 单值缓冲区(如均值)需要 32B 空间(即使逻辑上只需 4B)
- 数据类型大小:FP16=2B, BF16=2B, FP32=4B
Hardware Limitations that must be considered:
- AI Core Architecture:
- A2/A3 usually has 24 AI Cores
- Each AI Core contains 1 Cube computing core and 2 Vector computing cores
- Cube Core is dedicated to matrix computation, Vector Core is dedicated to vector computation
- UB Buffer Size: 192KB (A2/A3), dedicated to Vector Core
- L1 Buffer Size: Usually 1MB (A2/A3), dedicated to Cube Core
- Memory Alignment Requirements:
- UB buffer must be 32-byte aligned
- Single-value buffers (such as mean) require 32B space (even if only 4B is needed logically)
- Data Type Size: FP16=2B, BF16=2B, FP32=4B
阶段 4:特性实现方案
Phase 4: Feature Implementation Scheme
4.1 Tiling 切分
4.1 Tiling Splitting
这是最关键的部分,必须详细说明。
This is the most critical part and must be explained in detail.
4.1.1 核间切分策略
4.1.1 Inter-Core Splitting Strategy
必须包含:
-
切分原则:
- 如何划分任务到多个 AI Core
- 为什么选择这种切分方式
- 如何保证负载均衡
-
计算方法:
输入: x[B, D] // 步骤1: 计算每个 Core 处理的数据量 data_per_core = ceil(total_size / num_cores) // 步骤2: 计算当前 Core 的数据范围 core_start = core_id * data_per_core core_end = min((core_id + 1) * data_per_core, total_size) -
示例:
- 给出具体的输入形状
- 展示切分结果
- 说明每个 Core 处理的数据范围
Must include:
-
Splitting Principles:
- How to divide tasks into multiple AI Cores
- Why this splitting method is chosen
- How to ensure load balancing
-
Calculation Method:
Input: x[B, D] // Step 1: Calculate the amount of data processed by each Core data_per_core = ceil(total_size / num_cores) // Step 2: Calculate the data range of the current Core core_start = core_id * data_per_core core_end = min((core_id + 1) * data_per_core, total_size) -
Example:
- Provide specific input shapes
- Show splitting results
- Explain the data range processed by each Core
4.1.2 核内循环策略
4.1.2 Intra-Core Loop Strategy
必须包含:
-
UB 空间计算:
UB 总大小: 192KB 数据类型大小: FP16=2B, FP32=4B 单次循环需要的缓冲区: - 输入缓冲区: [大小] × [类型大小] - 中间缓冲区: [大小] × [类型大小] - 输出缓冲区: [大小] × [类型大小] 单次循环可处理的数据量 = UB总大小 / 单次循环总空间 -
缓冲区分配策略:
- 列出所有需要的缓冲区
- 说明每个缓冲区的大小和用途
- 考虑对齐要求
-
精度处理策略:
- 是否需要升精度计算(FP16→FP32)
- 在哪个阶段升精度
- 在哪个阶段降精度
Must include:
-
UB Space Calculation:
Total UB size: 192KB Data type size: FP16=2B, FP32=4B Buffers required for a single loop: - Input buffer: [Size] × [Type Size] - Intermediate buffer: [Size] × [Type Size] - Output buffer: [Size] × [Type Size] Amount of data processed per loop = Total UB size / Total space per loop -
Buffer Allocation Strategy:
- List all required buffers
- Explain the size and purpose of each buffer
- Consider alignment requirements
-
Precision Processing Strategy:
- Whether precision upcasting (FP16→FP32) is required
- At which stage to upcast precision
- At which stage to downcast precision
4.2 Kernel 实现
4.2 Kernel Implementation
4.2.1 计算流图
4.2.1 Computational Flow Diagram
必须绘制数据流图:
输入张量 (GM) 参数张量 (GM)
│ │
▼ ▼
[加载到UB] [加载到UB]
│ │
▼ ▼
[计算步骤1] [预处理]
│ │
▼ │
[计算步骤2] ───┘
│
▼
[最终计算]
│
▼
输出张量 (GM)关键点:
- 标注每个步骤的数据类型
- 标注 GM↔UB 的数据传输
- 标注精度转换的位置
Must draw a data flow diagram:
Input Tensor (GM) Parameter Tensor (GM)
│ │
▼ ▼
[Load to UB] [Load to UB]
│ │
▼ ▼
[Calculation Step 1] [Preprocessing]
│ │
▼ │
[Calculation Step 2] ───┘
│
▼
[Final Calculation]
│
▼
Output Tensor (GM)Key Points:
- Label the data type of each step
- Label data transmission between GM↔UB
- Label the location of precision conversion
4.2.2 核心实现逻辑
4.2.2 Core Implementation Logic
按输入数据类型分别说明:
FP32 输入类型:
- 核间任务分配
- UB 缓冲区管理(列出所有缓冲区)
- 计算流程(按步骤详细说明)
FP16/BF16 输入类型:
- 核间任务分配
- UB 缓冲区管理(包括升/降精度缓冲区)
- 计算流程(包括精度转换步骤)
硬件优化要点:
- 向量化计算
- 数据复用
- 内存访问优化
- 对齐处理
Explain separately according to input data types:
FP32 Input Type:
- Inter-core task allocation
- UB buffer management (list all buffers)
- Calculation process (explain in detail step by step)
FP16/BF16 Input Type:
- Inter-core task allocation
- UB buffer management (including upcasting/downcasting buffers)
- Calculation process (including precision conversion steps)
Hardware Optimization Points:
- Vectorized computation
- Data reuse
- Memory access optimization
- Alignment processing
绝对不要做的事
Things Absolutely Not to Do
- ❌ 使用模糊的术语(如"适当切分"、"合理分配")
- ❌ 忽略硬件约束(UB 大小、对齐要求)
- ❌ 不说明 Tiling 的具体计算方法
- ❌ 不区分不同数据类型的处理策略
- ❌ 不标注数据流图中的数据类型
- ❌ 忽略归约操作的对齐要求(必须 32B)
- ❌ 混淆 Vector Core 和 Cube Core 的用途(Vector Core 用于向量计算,Cube Core 用于矩阵计算)
- ❌ 忽略 UB 和 L1 的区别(UB 是 Vector Core 专用,L1 是 Cube Core 专用)
- ❌ Use vague terms (such as "appropriate splitting", "reasonable allocation")
- ❌ Ignore hardware constraints (UB size, alignment requirements)
- ❌ Do not explain the specific calculation method of Tiling
- ❌ Do not distinguish processing strategies for different data types
- ❌ Do not label data types in the data flow diagram
- ❌ Ignore alignment requirements for reduction operations (must be 32B)
- ❌ Confuse the uses of Vector Core and Cube Core (Vector Core for vector computation, Cube Core for matrix computation)
- ❌ Ignore the difference between UB and L1 (UB is dedicated to Vector Core, L1 is dedicated to Cube Core)
常见陷阱
Common Pitfalls
陷阱 1:忽略 UB 大小限制
Pitfall 1: Ignoring UB Size Limitations
症状:设计的方案超出 UB 容量
解决:
- 计算所有缓冲区的总大小
- 确保总大小 < UB 总大小
- 如果超出,调整单次循环处理的数据量
Symptom: The designed scheme exceeds UB capacity
Solution:
- Calculate the total size of all buffers
- Ensure the total size < Total UB size
- If exceeded, adjust the amount of data processed per loop
陷阱 2:忽略内存对齐
Pitfall 2: Ignoring Memory Alignment
症状:硬件报错或性能下降
解决:
- UB 缓冲区按 32 字节对齐
- 单值缓冲区(均值、方差等)分配 32B 空间
- 使用 计算分配空间
ceil(实际大小, 32)
Symptom: Hardware errors or performance degradation
Solution:
- UB buffers are aligned to 32 bytes
- Allocate 32B space for single-value buffers (mean, variance, etc.)
- Use to calculate allocated space
ceil(actual_size, 32)
陷阱 3:精度损失
Pitfall 3: Precision Loss
症状:FP16 输入时计算结果不准确
解决:
- 归约操作前升精度到 FP32
- 在 FP32 精度下完成所有计算
- 最后再降精度到输出类型
Symptom: Inaccurate calculation results when input is FP16
Solution:
- Upcast precision to FP32 before reduction operations
- Complete all calculations in FP32 precision
- Finally downcast precision to the output type
陷阱 4:Tiling 策略不合理
Pitfall 4: Unreasonable Tiling Strategy
症状:性能不佳或无法处理大 shape
解决:
- 根据算子特点选择切分维度
- 确保每个 Core 独立完成计算
- 避免跨 Core 的数据依赖
Symptom: Poor performance or inability to handle large shapes
Solution:
- Choose splitting dimensions based on operator characteristics
- Ensure each Core completes calculations independently
- Avoid cross-Core data dependencies
质量检查清单
Quality Check Checklist
在完成文档后,检查以下内容:
After completing the document, check the following items:
需求分析
Requirement Analysis
- 算子功能说明清晰
- 数学公式正确且完整
- 变量说明表包含所有关键变量
- 竞品分析涵盖主流框架
- 术语使用准确(GM、UB、AI Core 等)
- Operator function description is clear
- Mathematical formulas are correct and complete
- Variable description table includes all key variables
- Competitor analysis covers mainstream frameworks
- Terms are used accurately (GM, UB, AI Core, etc.)
原型设计
Prototype Design
- 接口定义完整
- 参数说明详细
- 数据类型支持明确
- Interface definition is complete
- Parameter descriptions are detailed
- Data type support is clear
规格约束
Specification Constraints
- 输入输出约束完整
- 硬件约束明确(UB 大小、L1 大小、对齐要求)
- 区分 Vector Core 和 Cube Core 的用途
- 边界条件说明清楚
- Input/output constraints are complete
- Hardware constraints are clear (UB size, L1 size, alignment requirements)
- Distinguish the uses of Vector Core and Cube Core
- Boundary conditions are clearly explained
Tiling 切分
Tiling Splitting
- 核间切分策略有具体计算方法
- 核内循环策略有 UB 空间计算
- 缓冲区分配详细
- 有具体示例
- Inter-core splitting strategy has specific calculation methods
- Intra-core loop strategy includes UB space calculation
- Buffer allocation is detailed
- Has specific examples
Kernel 实现
Kernel Implementation
- 计算流图清晰
- 数据类型标注完整
- 不同输入类型分别说明
- 硬件优化要点明确
- Computational flow diagram is clear
- Data type labels are complete
- Different input types are explained separately
- Hardware optimization points are clear
参考资源
Reference Resources
详细的设计指南和示例,请参考:
- triton-operator-template.md - 完整的文档模板
- ascend-terminology.md - Ascend 术语表
- tiling-strategies.md - Tiling 策略详解
官方文档:
For detailed design guidelines and examples, please refer to:
- triton-operator-template.md - Complete document template
- ascend-terminology.md - Ascend Terminology Glossary
- tiling-strategies.md - Detailed Tiling Strategies
Official Documentation: