triton-operator-dev
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTriton 算子全流程开发
Full-Process Development of Triton Operators
任务编排
Task Orchestration
| 阶段 | Skill | 产出 |
|---|---|---|
| 1 | triton-operator-env-config | 可用的开发环境 |
| 2 | triton-operator-design | 算子需求文档 |
| 3 | triton-operator-code-gen | 可执行代码 |
| 4 | triton-operator-code-review | 代码检视报告 |
| 5 | triton-operator-precision-eval | 精度验证报告 |
| 6 | triton-operator-performance-eval | 性能评估报告 |
| 7 | triton-operator-doc-gen | 接口文档 |
| 8 | triton-operator-performance-optim | 优化后代码 |
| Stage | Skill | Output |
|---|---|---|
| 1 | triton-operator-env-config | Available development environment |
| 2 | triton-operator-design | Operator requirement document |
| 3 | triton-operator-code-gen | Executable code |
| 4 | triton-operator-code-review | Code inspection report |
| 5 | triton-operator-precision-eval | Precision verification report |
| 6 | triton-operator-performance-eval | Performance evaluation report |
| 7 | triton-operator-doc-gen | Interface document |
| 8 | triton-operator-performance-optim | Optimized code |
子 Skill 概览
Sub-Skill Overview
1. triton-operator-env-config
1. triton-operator-env-config
- 触发: 首次开发或环境异常
- 核心: 依次检查 CANN → Python → torch → triton-ascend
- 验证: 运行
01-vector-add.py
- Trigger: First-time development or environment exceptions
- Core: Check CANN → Python → torch → triton-ascend in sequence
- Verification: Run
01-vector-add.py
2. triton-operator-design
2. triton-operator-design
- 触发: 需要设计新算子
- 核心: 需求分析 → 原型设计 → 规格约束 → 特性实现
- 关键: 必须包含 Tiling 策略具体计算方法
- Trigger: Need to design a new operator
- Core: Requirement analysis → prototype design → specification constraints → feature implementation
- Key: Must include specific calculation method of Tiling strategy
3. triton-operator-code-gen
3. triton-operator-code-gen
- 触发: 已有需求文档,需要生成代码
- 流程: 确认计算逻辑 → 设计 Tiling → 生成 Kernel → 生成测试
- 依赖: 必须先阅读 和
references/hardware-architecture.mdreferences/templates.md
- Trigger: Requirement document exists, need to generate code
- Process: Confirm calculation logic → design Tiling → generate Kernel → generate test cases
- Dependency: Must read and
references/hardware-architecture.mdfirstreferences/templates.md
4. triton-operator-code-review
4. triton-operator-code-review
- 触发: 代码生成完成后,进入精度验证前
- 核心: Host 侧检视 → Device 侧检视 → 性能隐患检视
- 关键: 静态分析 Ascend API 约束合规性、Mask 完整性、精度处理
- Trigger: After code generation, before precision verification
- Core: Host-side inspection → Device-side inspection → performance risk inspection
- Key: Static analysis of Ascend API constraint compliance, Mask integrity, and precision processing
5. triton-operator-precision-eval
5. triton-operator-precision-eval
- 触发: 代码检视通过后,进入性能评估前
- 核心: 与 PyTorch 参考实现对比 → 计算误差指标 → 生成精度报告
- 关键: 归约操作必须使用 FP32,确保 rtol/atol 满足阈值
- Trigger: After code inspection passes, before performance evaluation
- Core: Compare with PyTorch reference implementation → calculate error metrics → generate precision report
- Key: Reduction operations must use FP32 to ensure rtol/atol meet thresholds
6. triton-operator-performance-eval
6. triton-operator-performance-eval
- 触发: 精度验证通过后,进入性能优化前
- 核心: msprof 性能采集 → 瓶颈诊断 → 硬件利用率分析 → 性能报告
- 关键: 必须使用 msprof/msprof op,不接受其他计时方式
- Trigger: After precision verification passes, before performance optimization
- Core: msprof performance collection → bottleneck diagnosis → hardware utilization analysis → performance report
- Key: Must use msprof/msprof op, other timing methods are not accepted
7. triton-operator-doc-gen
7. triton-operator-doc-gen
- 触发: 需要生成接口文档
- 产出: 标准化的昇腾 NPU 接口文档(产品支持表、参数说明、调用示例)
- Trigger: Need to generate interface documents
- Output: Standardized Ascend NPU interface documents (product support table, parameter description, call examples)
8. triton-operator-performance-optim
8. triton-operator-performance-optim
- 触发: 性能不达标
- 流程: 性能诊断 → 基础调优 → 硬件特化 → 高级优化
- 关键: 必须先用 msprof 定位瓶颈,优化后重新验证精度
- Trigger: Performance does not meet standards
- Process: Performance diagnosis → basic tuning → hardware specialization → advanced optimization
- Key: Must use msprof to locate bottlenecks first, re-verify precision after optimization
快速决策
Quick Decision
| 场景 | 跳过阶段 |
|---|---|
| 环境已配置 | 1 |
| 已有设计文档 | 2 |
| 只需文档 | 1,2,3,4,5,6,8 |
| 只需代码 | 1,2,4,5,6,7,8 |
| 只需优化 | 1,2,3,4,5,6,7 |
| 跳过检视和验证 | 4,5,6 |
| Scenario | Skip Stages |
|---|---|
| Environment already configured | 1 |
| Design document already exists | 2 |
| Only need documents | 1,2,3,4,5,6,8 |
| Only need code | 1,2,4,5,6,7,8 |
| Only need optimization | 1,2,3,4,5,6,7 |
| Skip inspection and verification | 4,5,6 |
通用反模式
Common Anti-Patterns
- ❌ 忽略 UB 大小(192KB)
- ❌ 归约操作不使用 FP32
- ❌ BLOCK 非 16 倍数(Cube 单元)
- ❌ 忘记 Mask(Ascend 零容错)
- ❌ 混淆 Vector Core 和 Cube Core 用途
- ❌ Ignore UB size (192KB)
- ❌ Do not use FP32 for reduction operations
- ❌ BLOCK is not a multiple of 16 (Cube unit)
- ❌ Forget Mask (Ascend has zero fault tolerance)
- ❌ Confuse the purposes of Vector Core and Cube Core