triton-operator-dev

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triton 算子全流程开发

Full-Process Development of Triton Operators

任务编排

Task Orchestration

阶段Skill产出
1triton-operator-env-config可用的开发环境
2triton-operator-design算子需求文档
3triton-operator-code-gen可执行代码
4triton-operator-code-review代码检视报告
5triton-operator-precision-eval精度验证报告
6triton-operator-performance-eval性能评估报告
7triton-operator-doc-gen接口文档
8triton-operator-performance-optim优化后代码
StageSkillOutput
1triton-operator-env-configAvailable development environment
2triton-operator-designOperator requirement document
3triton-operator-code-genExecutable code
4triton-operator-code-reviewCode inspection report
5triton-operator-precision-evalPrecision verification report
6triton-operator-performance-evalPerformance evaluation report
7triton-operator-doc-genInterface document
8triton-operator-performance-optimOptimized code

子 Skill 概览

Sub-Skill Overview

1. triton-operator-env-config

1. triton-operator-env-config

  • 触发: 首次开发或环境异常
  • 核心: 依次检查 CANN → Python → torch → triton-ascend
  • 验证: 运行
    01-vector-add.py
  • Trigger: First-time development or environment exceptions
  • Core: Check CANN → Python → torch → triton-ascend in sequence
  • Verification: Run
    01-vector-add.py

2. triton-operator-design

2. triton-operator-design

  • 触发: 需要设计新算子
  • 核心: 需求分析 → 原型设计 → 规格约束 → 特性实现
  • 关键: 必须包含 Tiling 策略具体计算方法
  • Trigger: Need to design a new operator
  • Core: Requirement analysis → prototype design → specification constraints → feature implementation
  • Key: Must include specific calculation method of Tiling strategy

3. triton-operator-code-gen

3. triton-operator-code-gen

  • 触发: 已有需求文档,需要生成代码
  • 流程: 确认计算逻辑 → 设计 Tiling → 生成 Kernel → 生成测试
  • 依赖: 必须先阅读
    references/hardware-architecture.md
    references/templates.md
  • Trigger: Requirement document exists, need to generate code
  • Process: Confirm calculation logic → design Tiling → generate Kernel → generate test cases
  • Dependency: Must read
    references/hardware-architecture.md
    and
    references/templates.md
    first

4. triton-operator-code-review

4. triton-operator-code-review

  • 触发: 代码生成完成后,进入精度验证前
  • 核心: Host 侧检视 → Device 侧检视 → 性能隐患检视
  • 关键: 静态分析 Ascend API 约束合规性、Mask 完整性、精度处理
  • Trigger: After code generation, before precision verification
  • Core: Host-side inspection → Device-side inspection → performance risk inspection
  • Key: Static analysis of Ascend API constraint compliance, Mask integrity, and precision processing

5. triton-operator-precision-eval

5. triton-operator-precision-eval

  • 触发: 代码检视通过后,进入性能评估前
  • 核心: 与 PyTorch 参考实现对比 → 计算误差指标 → 生成精度报告
  • 关键: 归约操作必须使用 FP32,确保 rtol/atol 满足阈值
  • Trigger: After code inspection passes, before performance evaluation
  • Core: Compare with PyTorch reference implementation → calculate error metrics → generate precision report
  • Key: Reduction operations must use FP32 to ensure rtol/atol meet thresholds

6. triton-operator-performance-eval

6. triton-operator-performance-eval

  • 触发: 精度验证通过后,进入性能优化前
  • 核心: msprof 性能采集 → 瓶颈诊断 → 硬件利用率分析 → 性能报告
  • 关键: 必须使用 msprof/msprof op,不接受其他计时方式
  • Trigger: After precision verification passes, before performance optimization
  • Core: msprof performance collection → bottleneck diagnosis → hardware utilization analysis → performance report
  • Key: Must use msprof/msprof op, other timing methods are not accepted

7. triton-operator-doc-gen

7. triton-operator-doc-gen

  • 触发: 需要生成接口文档
  • 产出: 标准化的昇腾 NPU 接口文档(产品支持表、参数说明、调用示例)
  • Trigger: Need to generate interface documents
  • Output: Standardized Ascend NPU interface documents (product support table, parameter description, call examples)

8. triton-operator-performance-optim

8. triton-operator-performance-optim

  • 触发: 性能不达标
  • 流程: 性能诊断 → 基础调优 → 硬件特化 → 高级优化
  • 关键: 必须先用 msprof 定位瓶颈,优化后重新验证精度
  • Trigger: Performance does not meet standards
  • Process: Performance diagnosis → basic tuning → hardware specialization → advanced optimization
  • Key: Must use msprof to locate bottlenecks first, re-verify precision after optimization

快速决策

Quick Decision

场景跳过阶段
环境已配置1
已有设计文档2
只需文档1,2,3,4,5,6,8
只需代码1,2,4,5,6,7,8
只需优化1,2,3,4,5,6,7
跳过检视和验证4,5,6
ScenarioSkip Stages
Environment already configured1
Design document already exists2
Only need documents1,2,3,4,5,6,8
Only need code1,2,4,5,6,7,8
Only need optimization1,2,3,4,5,6,7
Skip inspection and verification4,5,6

通用反模式

Common Anti-Patterns

  • ❌ 忽略 UB 大小(192KB)
  • ❌ 归约操作不使用 FP32
  • ❌ BLOCK 非 16 倍数(Cube 单元)
  • ❌ 忘记 Mask(Ascend 零容错)
  • ❌ 混淆 Vector Core 和 Cube Core 用途
  • ❌ Ignore UB size (192KB)
  • ❌ Do not use FP32 for reduction operations
  • ❌ BLOCK is not a multiple of 16 (Cube unit)
  • ❌ Forget Mask (Ascend has zero fault tolerance)
  • ❌ Confuse the purposes of Vector Core and Cube Core