triton-operator-precision-eval

Original：🇨🇳 Chinese

Translated

2 scripts

Accepts Triton operator implementations, automatically invokes Torch small operator implementations (CPU or NPU) for precision comparison, and generates precision reports. It is used when users need to verify the correctness and precision of Triton operator implementations, compare precision with PyTorch implementations, and generate standardized precision reports.

5installs

Sourceascend/agent-skills

Added on2026-04-15

NPX Install

npx skill4agent add ascend/agent-skills triton-operator-precision-eval

SKILL.md Content (Chinese)

View Translation Comparison →

Triton Operator Precision Evaluation Skill

Core Principles

Precision is the bottom line for operator correctness; no optimization should cross this line.

Function Overview

This skill is used to automatically evaluate the precision of Triton operator implementations. By comparing with corresponding PyTorch (CPU or NPU) operator implementations, it generates detailed precision verification reports.

Core Functions

Automatically accepts Triton operator implementations
Supports comparison with Torch small operators on CPU or NPU
Supports multiple data types (float16, float32, int8, uint8, etc.)
Automatically generates precision verification reports
Supports batch testing of different parameter configurations

Workflow

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ Triton算子实现  │────▶│ 生成测试数据    │────▶│ 执行Torch对比实现 │
└─────────────────┘     └─────────────────┘     └─────────────────┘
          ▲                     │                     │
          │                     ▼                     ▼
          │              ┌─────────────────┐     ┌─────────────────┐
          │              │ 执行Triton实现  │     │ 计算误差指标    │
          │              └─────────────────┘     └─────────────────┘
          │                     │                     │
          └─────────────────────┼─────────────────────┘
                                │
                                ▼
                        ┌─────────────────┐
                        │ 生成精度报告    │
                        └─────────────────┘

Core Components

Test Data Generation: Generate random test data using
```
test_common.generate_numpy()
```
Torch Comparison Implementation: Torch operator implementation provided by users
Triton Operator Execution: Use Triton JIT to compile and execute the user-provided Triton kernel
Precision Verification: Perform precision comparison using
```
test_common.validate_cmp()
```
, supporting error thresholds for different data types
Report Generation: Generate precision verification reports containing error metrics

Usage Methods

Prerequisites

Triton and PyTorch environments are installed
```
torch_npu
```
is installed (if using NPU for testing)
Triton operator implementation code is prepared

Write Test Cases

Create a test file (e.g.,

test_abs.py

) with the following content:

Import Required Modules:

python

import triton
import triton.language as tl
import numpy as np
import torch
import pytest
import test_common

Implement Torch Comparison Function:

python

def torch_pointwise(x0):
    # Implement Torch function corresponding to the Triton operator
    return torch.abs(x0)

Implement Triton Operator:

python

@triton.jit
def triton_abs(in_ptr0, out_ptr0, XBLOCK: tl.constexpr, XBLOCK_SUB: tl.constexpr):
    # Triton kernel implementation
    offset = tl.program_id(0) * XBLOCK
    base1 = tl.arange(0, XBLOCK_SUB)
    loops1: tl.constexpr = (XBLOCK + XBLOCK_SUB - 1) // XBLOCK_SUB
    for loop1 in range(loops1):
        x0_prime = offset + (loop1 * XBLOCK_SUB) + base1
        x0 = offset + (loop1 * XBLOCK_SUB) + base1
        tmp0 = tl.load(in_ptr0 + (x0), None)
        tmp2 = tl.abs(tmp0)
        tl.store(out_ptr0 + (x0), tmp2, None)

Write Test Cases:

python

@pytest.mark.parametrize('param_list',
                       [
                           ['float16', (2, 4096, 8), 32, 2048, 64],
                           ['float32', (2, 4096, 8), 32, 2048, 64],
                           ['int8', (2, 4096, 8), 32, 2048, 64],
                           ['uint8', (2, 4096, 8), 32, 2048, 64],
                       ]
                       )

def test_case(param_list):
    dtype, shape, ncore, xblock, xblock_sub = param_list
    np_x0 = test_common.generate_numpy(shape, dtype)
    x0 = torch.from_numpy(np_x0).to(eval('torch.' + dtype)).npu()
    y_ref = torch_pointwise(x0)
    y_cal = torch.zeros(shape, dtype = eval('torch.' + dtype)).npu()
    triton_abs[ncore, 1, 1](x0, y_cal, xblock, xblock_sub)
    test_common.validate_cmp(dtype, y_cal, y_ref)

Run Tests

bash

# 运行单个测试文件
pytest test_abs.py -v

# 运行所有测试文件
pytest ./examples/ -v

Precision Verification Rules

Verification Rules for Different Data Types

Data Type	Verification Method	Error Threshold
float16	Relative Error	rtol=1e-03, atol=1e-03
float32	Relative Error	rtol=1e-04, atol=1e-04
bfloat16	Relative Error	rtol=1e-02, atol=1e-02
int32/int64/int16/int8	Exact Match	-
uint32/uint64/uint16/uint8	Exact Match	-
bool	Exact Match	-

Error Metrics

Mean Relative Error (MERE): Average of relative errors of all elements
Maximum Absolute Relative Error (MARE): Maximum relative error among all elements
Absolute Error: Absolute value of the difference between element values

Precision Report Format

The generated precision report (e.g.,

eco_report.txt

) includes the following content:

================================================================================
                              Triton算子精度验证报告                               
--------------------------------------------------------------------------------
[验证配置]:
  数据类型: float32 (Single Precision)
  MERE阈值: 1.220703e-04
  MARE阈值: 1.220703e-03 (10×MERE阈值)
  小值域阈值: 1.000000e-07
--------------------------------------------------------------------------------
[精度标准]:
  float16: 相对误差 rtol=1e-03, atol=1e-03
  float32: 相对误差 rtol=1e-04, atol=1e-04
  bfloat16: 相对误差 rtol=1e-02, atol=1e-02
  int32/int64/int16/int8: 完全相等
  uint32/uint64/uint16/uint8: 完全相等
  bool: 完全相等
--------------------------------------------------------------------------------
[验证结果]:
  验证结果: FAIL
  样本总数: 4096
--------------------------------------------------------------------------------
[误差指标]:
  平均相对误差(MERE): 6.642197e-03
    阈值要求: MERE < 1.220703e-04
  最大相对误差(MARE): 3.458786e+00
    阈值要求: MARE < 1.220703e-03
--------------------------------------------------------------------------------
[判定条件]:
  ✓ MERE < 阈值: False
  ✓ MARE < 10×阈值: False
  ✓ 总体结果: False
================================================================================

Report Content Requirements

The precision report must include the following content:

Verification Configuration: Operator name, test shape, data type, number of NPU cores, etc.
Precision Standards: Specific precision requirements (error thresholds or exact match) for each data type
Verification Results: Total number of tests, number of passed tests, number of failed tests, overall result
Detailed Error Metrics: Mean relative error, maximum relative error, maximum absolute error for each data type
Judgment Conditions: Pass status of tests for all data types

Among them, the Precision Standards section must list all supported data types and their corresponding precision requirements to ensure the report's readability and traceability.

Anti-Pattern List (NEVER)

❌ Perform precision verification without providing a Torch comparison implementation
❌ Use incorrect error thresholds (e.g., using FP32 thresholds for FP16)
❌ Do not upgrade precision to FP32 for reduction operations
❌ Assert precision correctness after testing only one data type
❌ Skip boundary case testing (e.g., non-aligned dimensions)
❌ Fail to generate a standardized precision report
❌ Conduct performance optimization before verification passes

Checklist

Test Case Completeness

Provided Torch comparison implementation?
Test cases cover multiple shapes?
Test cases cover multiple data types?
Test cases include boundary conditions?

Precision Verification

Use FP32 precision for reduction operations?
Used correct error thresholds?
Conducted testing on NPU?
Verified the accuracy of the test benchmark?

Report Generation

Generated a standardized precision report?
Report includes all necessary information?
Error metrics are calculated correctly?

Troubleshooting

Problem	Possible Cause	Solution
Triton kernel compilation failed	Triton syntax error or version incompatibility	Check Triton syntax, ensure Triton version is compatible with the code
Precision verification failed	Incorrect operator implementation logic or precision loss	Check the operator implementation, adjust the algorithm to improve precision
NPU device unavailable	`torch_npu` not installed or device not configured correctly	Install `torch_npu` , check NPU device status
Insufficient memory	Test data is too large	Reduce test data scale or adjust parameter configuration

triton-operator-precision-eval

NPX Install

Tags

SKILL.md Content (Chinese)

Triton Operator Precision Evaluation Skill

Core Principles

Function Overview

Core Functions

Workflow

Core Components

Usage Methods

Prerequisites

Write Test Cases

Run Tests

Precision Verification Rules

Verification Rules for Different Data Types

Error Metrics

Precision Report Format

Report Content Requirements

Anti-Pattern List (NEVER)

Checklist

Test Case Completeness

Precision Verification

Report Generation

Troubleshooting