triton-operator-dev

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Triton 算子全流程开发

Full-Process Development of Triton Operators

任务编排

Task Orchestration

阶段	Skill	产出
1	triton-operator-env-config	可用的开发环境
2	triton-operator-design	算子需求文档
3	triton-operator-code-gen	可执行代码
4	triton-operator-code-review	代码检视报告
5	triton-operator-precision-eval	精度验证报告
6	triton-operator-performance-eval	性能评估报告
7	triton-operator-doc-gen	接口文档
8	triton-operator-performance-optim	优化后代码

Stage	Skill	Output
1	triton-operator-env-config	Available development environment
2	triton-operator-design	Operator requirement document
3	triton-operator-code-gen	Executable code
4	triton-operator-code-review	Code inspection report
5	triton-operator-precision-eval	Precision verification report
6	triton-operator-performance-eval	Performance evaluation report
7	triton-operator-doc-gen	Interface document
8	triton-operator-performance-optim	Optimized code

子 Skill 概览

Sub-Skill Overview

1. triton-operator-env-config

触发: 首次开发或环境异常
核心: 依次检查 CANN → Python → torch → triton-ascend
验证: 运行
```
01-vector-add.py
```

Trigger: First-time development or environment exceptions
Core: Check CANN → Python → torch → triton-ascend in sequence
Verification: Run
```
01-vector-add.py
```

2. triton-operator-design

触发: 需要设计新算子
核心: 需求分析 → 原型设计 → 规格约束 → 特性实现
关键: 必须包含 Tiling 策略具体计算方法

Trigger: Need to design a new operator
Core: Requirement analysis → prototype design → specification constraints → feature implementation
Key: Must include specific calculation method of Tiling strategy

3. triton-operator-code-gen

触发: 已有需求文档，需要生成代码
流程: 确认计算逻辑 → 设计 Tiling → 生成 Kernel → 生成测试

依赖: 必须先阅读

references/hardware-architecture.md

和

references/templates.md

Trigger: Requirement document exists, need to generate code
Process: Confirm calculation logic → design Tiling → generate Kernel → generate test cases

Dependency: Must read

references/hardware-architecture.md

and

references/templates.md

first

4. triton-operator-code-review

触发: 代码生成完成后，进入精度验证前
核心: Host 侧检视 → Device 侧检视 → 性能隐患检视
关键: 静态分析 Ascend API 约束合规性、Mask 完整性、精度处理

Trigger: After code generation, before precision verification
Core: Host-side inspection → Device-side inspection → performance risk inspection
Key: Static analysis of Ascend API constraint compliance, Mask integrity, and precision processing

5. triton-operator-precision-eval

触发: 代码检视通过后，进入性能评估前
核心: 与 PyTorch 参考实现对比 → 计算误差指标 → 生成精度报告
关键: 归约操作必须使用 FP32，确保 rtol/atol 满足阈值

Trigger: After code inspection passes, before performance evaluation
Core: Compare with PyTorch reference implementation → calculate error metrics → generate precision report
Key: Reduction operations must use FP32 to ensure rtol/atol meet thresholds

6. triton-operator-performance-eval

触发: 精度验证通过后，进入性能优化前
核心: msprof 性能采集 → 瓶颈诊断 → 硬件利用率分析 → 性能报告
关键: 必须使用 msprof/msprof op，不接受其他计时方式

Trigger: After precision verification passes, before performance optimization
Core: msprof performance collection → bottleneck diagnosis → hardware utilization analysis → performance report
Key: Must use msprof/msprof op, other timing methods are not accepted

7. triton-operator-doc-gen

触发: 需要生成接口文档
产出: 标准化的昇腾 NPU 接口文档（产品支持表、参数说明、调用示例）

Trigger: Need to generate interface documents
Output: Standardized Ascend NPU interface documents (product support table, parameter description, call examples)

8. triton-operator-performance-optim

触发: 性能不达标
流程: 性能诊断 → 基础调优 → 硬件特化 → 高级优化
关键: 必须先用 msprof 定位瓶颈，优化后重新验证精度

Trigger: Performance does not meet standards
Process: Performance diagnosis → basic tuning → hardware specialization → advanced optimization
Key: Must use msprof to locate bottlenecks first, re-verify precision after optimization

快速决策

Quick Decision

场景	跳过阶段
环境已配置	1
已有设计文档	2
只需文档	1,2,3,4,5,6,8
只需代码	1,2,4,5,6,7,8
只需优化	1,2,3,4,5,6,7
跳过检视和验证	4,5,6

Scenario	Skip Stages
Environment already configured	1
Design document already exists	2
Only need documents	1,2,3,4,5,6,8
Only need code	1,2,4,5,6,7,8
Only need optimization	1,2,3,4,5,6,7
Skip inspection and verification	4,5,6

通用反模式

Common Anti-Patterns

❌ 忽略 UB 大小（192KB）
❌ 归约操作不使用 FP32
❌ BLOCK 非 16 倍数（Cube 单元）
❌ 忘记 Mask（Ascend 零容错）
❌ 混淆 Vector Core 和 Cube Core 用途

❌ Ignore UB size (192KB)
❌ Do not use FP32 for reduction operations
❌ BLOCK is not a multiple of 16 (Cube unit)
❌ Forget Mask (Ascend has zero fault tolerance)
❌ Confuse the purposes of Vector Core and Cube Core