Generation of Triton Operator Requirement Documents
Workflow Overview
Generating Triton operator requirement documents is divided into the following phases:
- Requirement Analysis → Deliverables: Function Definition, Competitor Comparison
- Prototype Design → Deliverables: API Interface Definition
- Specification Constraints → Deliverables: Input/Output Constraints, Hardware Limitations
- Feature Implementation → Deliverables: Tiling Strategy, Kernel Implementation Scheme
Phase 1: Requirement Analysis
1.1 Functional Analysis
Must include:
- Operator Function Description: Clearly describe the role and application scenarios of the operator
- Mathematical Formula: Provide core calculation formulas using standard mathematical symbols
- Variable Description Table:
| Variable | Type | Meaning | Constraints |
|---|
| [Variable Name] | [Type] | [Meaning] | [Constraints] |
Key Terms (must be used accurately):
- GM (Global Memory): Global memory, large-capacity storage on DDR
- UB (Unified Buffer): Unified buffer, high-speed cache inside the Vector Core of AI Core
- L1 Buffer: Level 1 buffer, cache inside the Cube Core of AI Core
- AI Core: Computing core of Ascend processor, A2/A3 usually has 24 units, each containing 1 Cube computing core and 2 Vector computing cores
- Tiling: Data splitting strategy, decomposing large tasks into small chunks
- Reduction Operation: Dimensionality reduction calculations such as sum, mean, max, etc.
- Precision Upcasting/Downcasting: Type conversion from FP16→FP32 or FP32→FP16
1.2 Competitor Solution Analysis
Must include:
- Competitor Operator List:
| Competitor Name | Source Framework | Interface Definition | Implemented Function | Constraints |
|---|
| [Name] | [Framework] | [Interface] | [Function] | [Constraints] |
- Comparison Analysis:
- Function Comparison: Differences in functions supported by each framework
- Performance Comparison: Performance on different hardware platforms
- Design Reference: Excellent designs that can be referenced
Phase 2: Prototype Design
2.1 Interface Definition
Triton Interface Features:
- Defined using Python functions
- Supports automatic differentiation
- Supports multiple data types
Interface Example:
python
def triton_operator(
input: torch.Tensor,
param1: torch.Tensor,
param2: Optional[torch.Tensor] = None,
eps: float = 1e-6
) -> torch.Tensor:
"""
[Operator Function Description]
Args:
input: Input tensor with shape [..., D]
param1: Parameter 1 with shape [D]
param2: Parameter 2 (optional) with shape [D]
eps: Small constant
Returns:
Output tensor with the same shape as input
"""
pass
2.2 Interface Description Table
| Parameter Name | Type | Input/Output | Description | Constraints |
|---|
| [Parameter Name] | [Type] | [Input/Output] | [Description] | [Constraints] |
2.3 Data Type Support
| Interface Type | Supported Data Types | Data Format |
|---|
| Triton | FLOAT16, BF16, FLOAT | ND |
Phase 3: Specification Constraints
3.1 Input Tensor Constraints
| Constraint Item | Constraint Conditions | Description |
|---|
| Shape | [Specific Constraints] | [Description] |
| Data Type | [Supported Types] | [Description] |
| Data Format | ND | Unified use of ND format |
| Memory Alignment | 16-byte or 32-byte | Hardware requirement |
3.2 Output Tensor Constraints
| Constraint Item | Constraint Conditions | Description |
|---|
| Shape | [Specific Constraints] | [Description] |
| Data Type | [Specific Constraints] | [Description] |
3.3 Hardware Constraints
Hardware Limitations that must be considered:
- AI Core Architecture:
- A2/A3 usually has 24 AI Cores
- Each AI Core contains 1 Cube computing core and 2 Vector computing cores
- Cube Core is dedicated to matrix computation, Vector Core is dedicated to vector computation
- UB Buffer Size: 192KB (A2/A3), dedicated to Vector Core
- L1 Buffer Size: Usually 1MB (A2/A3), dedicated to Cube Core
- Memory Alignment Requirements:
- UB buffer must be 32-byte aligned
- Single-value buffers (such as mean) require 32B space (even if only 4B is needed logically)
- Data Type Size: FP16=2B, BF16=2B, FP32=4B
Phase 4: Feature Implementation Scheme
4.1 Tiling Splitting
This is the most critical part and must be explained in detail.
4.1.1 Inter-Core Splitting Strategy
Must include:
-
Splitting Principles:
- How to divide tasks into multiple AI Cores
- Why this splitting method is chosen
- How to ensure load balancing
-
Calculation Method:
Input: x[B, D]
// Step 1: Calculate the amount of data processed by each Core
data_per_core = ceil(total_size / num_cores)
// Step 2: Calculate the data range of the current Core
core_start = core_id * data_per_core
core_end = min((core_id + 1) * data_per_core, total_size)
-
Example:
- Provide specific input shapes
- Show splitting results
- Explain the data range processed by each Core
4.1.2 Intra-Core Loop Strategy
Must include:
-
UB Space Calculation:
Total UB size: 192KB
Data type size: FP16=2B, FP32=4B
Buffers required for a single loop:
- Input buffer: [Size] × [Type Size]
- Intermediate buffer: [Size] × [Type Size]
- Output buffer: [Size] × [Type Size]
Amount of data processed per loop = Total UB size / Total space per loop
-
Buffer Allocation Strategy:
- List all required buffers
- Explain the size and purpose of each buffer
- Consider alignment requirements
-
Precision Processing Strategy:
- Whether precision upcasting (FP16→FP32) is required
- At which stage to upcast precision
- At which stage to downcast precision
4.2 Kernel Implementation
4.2.1 Computational Flow Diagram
Must draw a data flow diagram:
Input Tensor (GM) Parameter Tensor (GM)
│ │
▼ ▼
[Load to UB] [Load to UB]
│ │
▼ ▼
[Calculation Step 1] [Preprocessing]
│ │
▼ │
[Calculation Step 2] ───┘
│
▼
[Final Calculation]
│
▼
Output Tensor (GM)
Key Points:
- Label the data type of each step
- Label data transmission between GM↔UB
- Label the location of precision conversion
4.2.2 Core Implementation Logic
Explain separately according to input data types:
FP32 Input Type:
- Inter-core task allocation
- UB buffer management (list all buffers)
- Calculation process (explain in detail step by step)
FP16/BF16 Input Type:
- Inter-core task allocation
- UB buffer management (including upcasting/downcasting buffers)
- Calculation process (including precision conversion steps)
Hardware Optimization Points:
- Vectorized computation
- Data reuse
- Memory access optimization
- Alignment processing
Things Absolutely Not to Do
- ❌ Use vague terms (such as "appropriate splitting", "reasonable allocation")
- ❌ Ignore hardware constraints (UB size, alignment requirements)
- ❌ Do not explain the specific calculation method of Tiling
- ❌ Do not distinguish processing strategies for different data types
- ❌ Do not label data types in the data flow diagram
- ❌ Ignore alignment requirements for reduction operations (must be 32B)
- ❌ Confuse the uses of Vector Core and Cube Core (Vector Core for vector computation, Cube Core for matrix computation)
- ❌ Ignore the difference between UB and L1 (UB is dedicated to Vector Core, L1 is dedicated to Cube Core)
Common Pitfalls
Pitfall 1: Ignoring UB Size Limitations
Symptom: The designed scheme exceeds UB capacity
Solution:
- Calculate the total size of all buffers
- Ensure the total size < Total UB size
- If exceeded, adjust the amount of data processed per loop
Pitfall 2: Ignoring Memory Alignment
Symptom: Hardware errors or performance degradation
Solution:
- UB buffers are aligned to 32 bytes
- Allocate 32B space for single-value buffers (mean, variance, etc.)
- Use to calculate allocated space
Pitfall 3: Precision Loss
Symptom: Inaccurate calculation results when input is FP16
Solution:
- Upcast precision to FP32 before reduction operations
- Complete all calculations in FP32 precision
- Finally downcast precision to the output type
Pitfall 4: Unreasonable Tiling Strategy
Symptom: Poor performance or inability to handle large shapes
Solution:
- Choose splitting dimensions based on operator characteristics
- Ensure each Core completes calculations independently
- Avoid cross-Core data dependencies
Quality Check Checklist
After completing the document, check the following items:
Requirement Analysis
Prototype Design
Specification Constraints
Tiling Splitting
Kernel Implementation
Reference Resources
For detailed design guidelines and examples, please refer to:
- triton-operator-template.md - Complete document template
- ascend-terminology.md - Ascend Terminology Glossary
- tiling-strategies.md - Detailed Tiling Strategies
Official Documentation: