ascendc-operator-performance-eval

Original🇨🇳 Chinese
Translated
2 scriptsChecked / no sensitive code detected

Maintain JSONL-only profiler performance test cases under csrc/ops/<op>/test in ascend-kernel. Collect data using torch_npu.profiler (with fixed warmup=5 and active=5), aggregate the Total Time(us) from ASCEND_PROFILER_OUTPUT/op_statistic.csv, and output a unified Markdown comparison report (custom operator vs baseline) that includes a DType column. Do not generate perf_cases.json or *_profiler_results.json. Refer to examples/layer_norm_profiler_reference/ for the reference implementation.

3installs
Added on

NPX Install

npx skill4agent add ascend/agent-skills ascendc-operator-performance-eval

SKILL.md Content (Chinese)

View Translation Comparison →

Performance Evaluation of AscendC Operators with torch_npu.profiler

Reference Files in This Skill Directory

When executing this skill, prioritize materials in this directory:
File / DirectoryPurpose
SKILL.md
(this file)
Process, directory conventions, complete JSONL test case specifications, report structure, fixed schedule
references/REFERENCE_JSON_CASE_FORMAT.md
Identical to the "JSONL Specification for Performance Test Cases" section below
references/REFERENCE_PROFILER_AND_METRICS.md
torch_npu.profiler
,
op_statistic.csv
,
*_ascend_pt
path
examples/sample_perf_cases.jsonl
Minimal LayerNorm-style JSONL, can be copied and renamed
examples/layer_norm_profiler_reference/
Layer Norm reference implementation (
layer_norm_profiler_common.py
,
benchmark_layer_norm_torch_npu_profiler.py
, test case JSONL, instructions); new operators can copy this directory to
csrc/ops/<op>/test/
and replace the forward logic and filenames

Role

In ascend-kernel, establish a reusable process for profiler performance test cases and Markdown reports comparing custom operators vs baselines for
csrc/ops/<operator-name>/
. Data collection must use
torch_npu.profiler
, with fixed
warmup
and
active
values of 5
(see next section). For details, refer to
references/REFERENCE_PROFILER_AND_METRICS.md
.
Core Principles (Two Mandatory Constraints):
  1. Comparison report must always be presented: Regardless of whether the baseline path is a baseline API or a combination of small operators, the final report must include a dual-path comparison table of custom operator vs baseline. Single-path reports are not allowed, and the baseline must run on NPU.
  2. Test case generation must first read
    design.md
    : Before generating any JSONL test cases, read
    csrc/ops/<op>/design.md
    in the operator directory, extract parameter constraints, typical shapes, supported dtypes, and key attribute values. Test cases must cover all execution modes described in the design document.

Test Case Source: Load from testcase-gen Test Case Document (MANDATORY)

Before generating or modifying any JSONL test cases, MUST first read the test case document generated by testcase-gen:

Step 0: Read testcase-gen Test Case Document

READ csrc/ops/<op>/test/<op>-test-cases.md
Extract the following from it:
Extracted ItemLocation in Test Case DocumentPurpose
SUPPORTED_DTYPES§Test ConfigurationCoverage of dtypes for JSONL test cases
TEST_SHAPES§Test ConfigurationBenchmark for selecting small/medium/large-scale shapes
GENERAL_SHAPES§Test ConfigurationGeneralized shapes, can be supplemented for performance scenarios
NPU Calling Method§Operator BaselineForward call for custom operator
CPU Reference Implementation§Operator BaselineReference implementation for baseline path

Conversion Rules from testcase-gen Output to JSONL Test Cases

  1. Select representative shapes from TEST_SHAPES + GENERAL_SHAPES (covering small/medium/large scales), avoid duplicates
  2. Iterate through all dtypes in SUPPORTED_DTYPES for each shape
  3. Fill the
    inputs
    field of JSONL with attribute values (such as block_size, eps, etc.) from design.md
  4. Total number of JSONL test cases ≥ 8
  5. The NPU calling method and CPU reference implementation in the operator baseline are used to build the custom operator path and baseline path
If
<op>-test-cases.md
does not exist
: Fall back to designing test cases entirely from design.md (following the process below), but note "Test cases are self-designed, not generated by testcase-gen" in the report.

Test Case Generation: Must First Read design.md (Mandatory)

Before generating or modifying any JSONL test cases (whether or not the testcase-gen test case document has been loaded), perform the following steps:

Step 1: Read Design Document

READ csrc/ops/<op>/design.md
Extract the following from it:
Extracted ItemLocation in design.mdPurpose
Supported Data Types§1 "Supported Data Types"Coverage of dtypes for test cases
Parameter Constraints and Value RangesConstraint column in §1 "Parameter Description"Valid range for attribute values (e.g., block_size ≤ 128)
Typical Shapes / Input Scales§2 "Computation Logic" / §3 "Tiling Strategy"Benchmark for small/medium/large-scale test case shapes
Mode Combinations of Key Attributes§2 "Pseudocode" / §1 "Parameter Description"Execution paths that need to be covered individually (e.g., do_transpose=True/False, is_input_split=True/False)
Performance Key Points§6 "Performance Optimization" / §3 "Tiling Strategy"Branches that affect performance (e.g., transpose vs non-transpose using different DMA paths)

Step 2: Test Case Design Rules

RuleDescription
Cover all execution modesWhen design.md describes multiple execution paths (e.g., transpose/non-transpose, input_split mode), there must be at least one test case for each mode
Cover all supported dtypesAt least one set of test cases for each supported data type, using a typical medium-scale shape
One set each for small/medium/large-scale shapesMust cover small scale (kernel launch overhead dominant), medium scale (typical production scenario), and large scale (memory bandwidth dominant)
Parameter values from constraint rangesAttribute values (such as block_size) must be selected from the constraints in design.md, cannot be set arbitrarily
Integer/index tensor values must be semantically validSpecific values of tensors like win_lengths, offsets must comply with operator semantics (e.g., offsets must be valid window start offsets)

Step 3: Validate Test Cases

After generating test cases, check:
  • All dtypes are covered
  • All execution modes (defined by design.md) have corresponding test cases
  • Parameter values (including attribute values and integer tensor values) are within the constraints of design.md
  • Includes at least one "small shape" test case and one "large shape" test case

Reference Path Decision Tree (Mandatory)

Performance evaluation always requires dual-path comparison (custom operator vs baseline). Determine the baseline path in the following order:
Does the operator have an equivalent baseline API?
  ├─ Yes (e.g., torch.nn.functional.*, built-in torch_npu operators)
  │    └─ Use the baseline API as the baseline path
  └─ No (no equivalent baseline interface)
       └─ Must implement a "small operator combination" baseline path ← Mandatory requirement of this skill
            └─ Implement using PyTorch basic operator combinations from the "Reference Implementation" or "Pseudocode" section of the design document

Requirements for Small Operator Combination Baseline Path

When there is no equivalent baseline interface, must:
  1. Read the reference implementation from design.md: The design document usually contains a PyTorch reference implementation (pseudocode or Python function), use this as the basis to build the baseline path.
  2. Use PyTorch basic operator combinations:
    torch.zeros
    , slice assignment,
    .permute()
    ,
    torch.cat
    , etc. Do not use loops + Python scalar assignment (otherwise the profiler collects CPU operators instead of NPU operators, making fair comparison impossible); the entire baseline implementation must be dominated by tensor operations and executable on NPU.
  3. Clearly label in the report: The header of the report must state "No equivalent baseline interface, baseline path is a combination of small operators" and list the basic operators used.
  4. Comparison table must be presented: Do not degrade to a single-path report due to "no baseline interface", must retain the three columns: "Custom Operator per-step", "Baseline per-step", "Ratio".
NEVER: Output a single-path report or skip the comparison table on the grounds of "no equivalent baseline interface".

Fixed Profiler Steps (Mandatory)

ParameterValueDescription
warmup
5Do not modify to other values via script or CLI
active
5Do not modify to other values via script or CLI
wait
Default
0
Can retain in CLI or as a constant, adjust as needed
repeat
Default
1
Fixed to 1 for simple scenarios; if
repeat>1
, must explain CSV selection semantics in the document
Must call
prof.step()
at the end of each step; total number of loop steps =
repeat * (wait + warmup + active)
.

File Placement (Unified in Operator
test/
Subdirectory)

All the following products are placed in
ascend-kernel/csrc/ops/<op>/test/
:
CategoryNaming Convention (
<op>
e.g.,
layer_norm
)
Test Cases JSONL only
<op>_perf_cases.jsonl
(one JSON object per line); do not maintain or generate
<op>_perf_cases.json
Markdown Report
<op>_torch_npu_profiler_report.md
(only structured result to be saved; do not generate
<op>_torch_npu_profiler_results.json
)
Profiler Export Root Directory
test/profiler_trace/
(or override with
--trace-root
)
Performance scripts, common modules are placed in the same
test/
directory as the above files.

Complete JSONL Specification for Performance Test Cases

The following is the complete field and type description for test case files; only use
.jsonl
as the test case carrier.

1. File Format

FormatDescription
JSONLEach line contains one JSON object, with a newline at the end; empty lines are ignored. File extension
.jsonl
.
Prohibit generating or maintaining
.json
array files in sync with test cases in this process.

2. Top-level Structure of a Single Test Case

Each test case object must contain the key
"inputs"
, whose value is an array.
Layer Norm Example (replace
build_inputs
convention for different operators, but the structure must still include the
inputs
array):
json
{
  "inputs": [
    { "name": "x", "type": "tensor", "required": true, "dtype": "float16", "shape": [8, 128] },
    { "name": "normalized_shape", "type": "attr", "required": true, "dtype": "int", "value": [128] },
    { "name": "use_affine", "type": "attr", "required": false, "dtype": "bool", "value": true },
    { "name": "eps", "type": "attr", "required": false, "dtype": "float", "value": 1e-05 }
  ]
}
  • name
    of each element in
    inputs
    is unique within the same test case
    .
  • For other operators: Rules for tensors /
    tensor_list
    /
    attr
    / integer tensor
    range
    , etc. are in
    references/REFERENCE_JSON_CASE_FORMAT.md
    .

3. Complete JSONL Example (Two Lines, Layer Norm)

json
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[2,128]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[128]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":true},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[4,256]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[256]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":false},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
For more complete field descriptions (such as
tensor_list
,
int
tensor
range
, etc.), see
references/REFERENCE_JSON_CASE_FORMAT.md
.

Profiler and Directory Semantics (Summary)

  • Each
    with torch_npu.profiler.profile(...)
    generates an export directory with the suffix
    _ascend_pt
    under the handler directory; CSV path is
    …/*_ascend_pt/ASCEND_PROFILER_OUTPUT/op_statistic.csv
    .
  • Each test case, each implementation (e.g.,
    custom
    /
    baseline
    ) uses an independent
    with
    ; subpath is recommended as
    {trace_root}/{op_trace_tag}/{custom|baseline}/case_XXX/
    , clear
    case_XXX
    before running.
For details, see
references/REFERENCE_PROFILER_AND_METRICS.md
.

Metrics (Summary)

  1. For the CSV corresponding to a single
    with
    : Sum the Total Time(us) of each operator row.
  2. Divide by
    active×repeat
    (when
    divisor_mode=active_steps
    ) or only
    active
    (when
    active_only
    ). This skill fixes
    active=5
    ; if
    repeat=1
    , then
    divisor = 5
    .

Mandatory Structure of Performance Comparison Report (Markdown)

The report format strictly follows
examples/sample_report.md
, with the following structure:

1. Title

markdown
# Performance Evaluation Results

2. Comparison Table (Unified Single Table, Mandatory Dual Path)

All test cases are displayed in the same table, with fixed headers:
Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio
.
Example:
markdown
## Performance Comparison

| Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio |
| ---- | ----- | ----- | ------------- | -------- | -------------- |
| 0 | [128, 4096] | float16 | 9.75 | 10.10 | 1.036 |
| 1 | [128, 5120] | float16 | 10.52 | 9.39 | 0.893 |
| 2 | [128, 6144] | float16 | 10.99 | 14.36 | 1.307 |
| 3 | [64, 6400] | float16 | 9.13 | 9.49 | 1.040 |
| 4 | [2, 1024, 4096] | float16 | 57.01 | 84.92 | 1.490 |
| 5 | [2, 1024, 6144] | float16 | 73.80 | 139.56 | 1.891 |
| 6 | [1, 2048, 6400] | float16 | 75.60 | 143.09 | 1.893 |
| 7 | [64, 4096] | float32 | 8.45 | 7.14 | 0.846 |

3. Full Summary

Use the second-level title
## Full Summary
, which contains a key-value table:
markdown
## Full Summary

| Metric | Value |
| ---- | -- |
| Number of Test Cases | N |
| Average Speedup Ratio (>1 means custom operator is faster) | X.XXX |
| Custom Operator Better (Ratio>1) | M |
| Baseline Better (Ratio<1) | K |
Immediately below, use the third-level title
### Summary by Data Type
to display the summary table grouped by dtype:
markdown
### Summary by Data Type

| DType | Number of Test Cases | Average Speedup Ratio | Custom Operator Better | Baseline Better |
| ----- | ------ | ------------------- | ------------- | -------- |
| float16 | 7 | 1.364 | 6 | 1 |
| float32 | 1 | 0.846 | 0 | 1 |

4. Brief Analysis

Use the second-level title
## Brief Analysis
, list ≥3 short conclusions in unordered list format, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.
markdown
## Brief Analysis

- The average speedup ratio is greater than 1, so the custom operator has a slight overall advantage.
- The custom operator has a more obvious advantage in large shape scenarios, as the vector path is more fully utilized.
- The custom operator is slightly inferior to the baseline in the float32 small shape scenario, which may be related to the high proportion of kernel launch overhead.

Other Conventions

  • Do not write
    *_profiler_results.json
    which duplicates the report; intermediate statistics only exist in script memory and are written to Markdown.

Display Results in Conversation (MANDATORY)

After generating
csrc/ops/<op>/test/<op>_torch_npu_profiler_report.md
(or if it already exists and has been updated in this run), the assistant MUST simultaneously complete the following in the current conversation reply, cannot only output "Report generated" and the path without displaying data:
  1. Paste key performance content (users can read conclusions without opening files, displayed content is consistent with report structure):
    • Unified comparison table: Header
      Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio
      , all dtypes displayed in the same table. Truncate and note "See report for rest" if there are many cases.
    • Full summary: Summary metrics in key-value format (number of test cases, average ratio, number of cases where custom operator/baseline is better) and summary table by data type.
    • Brief analysis: ≥3 unordered list conclusions, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.
NEVER: Only reply with the report path; NEVER replace displaying core numbers and conclusions in the conversation with "Please open the Markdown file yourself".

Common Mistakes

  • warmup
    /
    active
    are changed to non-5 values, inconsistent with skill conventions.
  • torch_npu.profiler
    is not used, or
    prof.step()
    is inconsistent with the schedule.
  • repeat>1
    may generate multiple
    *_ascend_pt
    exports; must explain semantics when selecting CSV by mtime.
  • Must be compatible with Total Time(us) when CSV header has BOM / column names change.
  • Only run the baseline path if the custom operator is not registered; load the custom library before comparison.
  • Designing test cases directly without reading
    <op>-test-cases.md
    : testcase-gen has generated a unified test case document, should first extract shapes and dtypes from it to avoid duplicate design.
  • Generating test cases directly without reading design.md: Results in shapes not complying with constraints, missing coverage of key execution modes (such as omitting the transpose=True path).
  • Outputting a single-path report on the grounds of "no equivalent baseline interface": Must implement a small operator combination baseline path, always output a dual-path comparison table.
  • Using Python loops for scalar-by-scalar assignment in small operator combination: Profiler collects CPU logic instead of NPU operators, leading to distorted baseline path latency; baseline implementation should be dominated by tensor operations.
  • Assistant only outputs the report path, does not display key tables and summary conclusions in the current conversation (violates "Display Results in Conversation").

Reference Implementation (
examples/layer_norm_profiler_reference/
)

Is isomorphic to the profiler-related files in
ascend-kernel/csrc/ops/layer_norm/test/
, including:
  • layer_norm_profiler_common.py
    ,
    benchmark_layer_norm_torch_npu_profiler.py
  • layer_norm_perf_cases.jsonl
    (JSONL only, no
    .json
    )
  • LAYER_NORM_PROFILER_PERF_GUIDE.md
    ,
    README.md
Copy this directory in its entirety to
csrc/ops/<op>/test/
for new operators, then replace the operator name, forward call,
build_inputs
, and trace subdirectory name. If there are updates in
layer_norm/test/
in the repository, synchronize them back to
examples/layer_norm_profiler_reference/
.

Checklist (Assistant Self-Check)

  • Has read
    csrc/ops/<op>/test/<op>-test-cases.md
    (if exists), extracted SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, operator baseline from it
  • Has read
    csrc/ops/<op>/design.md
    , extracted dtypes, parameter constraints, typical shapes, execution modes from it
  • Test cases cover all execution modes described in design.md (such as transpose/non-transpose, input_split mode, etc.)
  • Parameter values (including attribute values and integer tensor values) in test cases are within the constraints of design.md
  • Has confirmed the baseline path type (baseline API or small operator combination), and clearly labeled it in the report header
  • If there is no equivalent baseline interface, has implemented the small operator combination baseline path, and the baseline implementation is dominated by tensor operations (not Python scalar loops)
  • Has used
    torch_npu.profiler
    , and
    warmup=5
    active=5
    have not been modified
  • Has generated or updated
    <op>_torch_npu_profiler_report.md
    , with format consistent with
    examples/sample_report.md
    : includes unified comparison table with DType column + full summary + summary by data type + brief analysis
  • Has displayed the unified comparison table with DType column, full summary, summary by data type, and ≥3 brief analysis conclusions in the current conversation, not just attaching the path
  • Has explained the fixed step convention and metric caliber