ascendc-operator-performance-eval
Original:🇨🇳 Chinese
Translated
2 scriptsChecked / no sensitive code detected
Maintain JSONL-only profiler performance test cases under csrc/ops/<op>/test in ascend-kernel. Collect data using torch_npu.profiler (with fixed warmup=5 and active=5), aggregate the Total Time(us) from ASCEND_PROFILER_OUTPUT/op_statistic.csv, and output a unified Markdown comparison report (custom operator vs baseline) that includes a DType column. Do not generate perf_cases.json or *_profiler_results.json. Refer to examples/layer_norm_profiler_reference/ for the reference implementation.
3installs
Sourceascend/agent-skills
Added on
NPX Install
npx skill4agent add ascend/agent-skills ascendc-operator-performance-evalTags
Translated version includes tags in frontmatterSKILL.md Content (Chinese)
View Translation Comparison →Performance Evaluation of AscendC Operators with torch_npu.profiler
Reference Files in This Skill Directory
When executing this skill, prioritize materials in this directory:
| File / Directory | Purpose |
|---|---|
| Process, directory conventions, complete JSONL test case specifications, report structure, fixed schedule |
| Identical to the "JSONL Specification for Performance Test Cases" section below |
| |
| Minimal LayerNorm-style JSONL, can be copied and renamed |
| Layer Norm reference implementation ( |
Role
In ascend-kernel, establish a reusable process for profiler performance test cases and Markdown reports comparing custom operators vs baselines for . Data collection must use , with fixed and values of 5 (see next section). For details, refer to .
csrc/ops/<operator-name>/torch_npu.profilerwarmupactivereferences/REFERENCE_PROFILER_AND_METRICS.mdCore Principles (Two Mandatory Constraints):
- Comparison report must always be presented: Regardless of whether the baseline path is a baseline API or a combination of small operators, the final report must include a dual-path comparison table of custom operator vs baseline. Single-path reports are not allowed, and the baseline must run on NPU.
- Test case generation must first read : Before generating any JSONL test cases, read
design.mdin the operator directory, extract parameter constraints, typical shapes, supported dtypes, and key attribute values. Test cases must cover all execution modes described in the design document.csrc/ops/<op>/design.md
Test Case Source: Load from testcase-gen Test Case Document (MANDATORY)
Before generating or modifying any JSONL test cases, MUST first read the test case document generated by testcase-gen:
Step 0: Read testcase-gen Test Case Document
READ csrc/ops/<op>/test/<op>-test-cases.mdExtract the following from it:
| Extracted Item | Location in Test Case Document | Purpose |
|---|---|---|
| SUPPORTED_DTYPES | §Test Configuration | Coverage of dtypes for JSONL test cases |
| TEST_SHAPES | §Test Configuration | Benchmark for selecting small/medium/large-scale shapes |
| GENERAL_SHAPES | §Test Configuration | Generalized shapes, can be supplemented for performance scenarios |
| NPU Calling Method | §Operator Baseline | Forward call for custom operator |
| CPU Reference Implementation | §Operator Baseline | Reference implementation for baseline path |
Conversion Rules from testcase-gen Output to JSONL Test Cases
- Select representative shapes from TEST_SHAPES + GENERAL_SHAPES (covering small/medium/large scales), avoid duplicates
- Iterate through all dtypes in SUPPORTED_DTYPES for each shape
- Fill the field of JSONL with attribute values (such as block_size, eps, etc.) from design.md
inputs - Total number of JSONL test cases ≥ 8
- The NPU calling method and CPU reference implementation in the operator baseline are used to build the custom operator path and baseline path
Ifdoes not exist: Fall back to designing test cases entirely from design.md (following the process below), but note "Test cases are self-designed, not generated by testcase-gen" in the report.<op>-test-cases.md
Test Case Generation: Must First Read design.md (Mandatory)
Before generating or modifying any JSONL test cases (whether or not the testcase-gen test case document has been loaded), perform the following steps:
Step 1: Read Design Document
READ csrc/ops/<op>/design.mdExtract the following from it:
| Extracted Item | Location in design.md | Purpose |
|---|---|---|
| Supported Data Types | §1 "Supported Data Types" | Coverage of dtypes for test cases |
| Parameter Constraints and Value Ranges | Constraint column in §1 "Parameter Description" | Valid range for attribute values (e.g., block_size ≤ 128) |
| Typical Shapes / Input Scales | §2 "Computation Logic" / §3 "Tiling Strategy" | Benchmark for small/medium/large-scale test case shapes |
| Mode Combinations of Key Attributes | §2 "Pseudocode" / §1 "Parameter Description" | Execution paths that need to be covered individually (e.g., do_transpose=True/False, is_input_split=True/False) |
| Performance Key Points | §6 "Performance Optimization" / §3 "Tiling Strategy" | Branches that affect performance (e.g., transpose vs non-transpose using different DMA paths) |
Step 2: Test Case Design Rules
| Rule | Description |
|---|---|
| Cover all execution modes | When design.md describes multiple execution paths (e.g., transpose/non-transpose, input_split mode), there must be at least one test case for each mode |
| Cover all supported dtypes | At least one set of test cases for each supported data type, using a typical medium-scale shape |
| One set each for small/medium/large-scale shapes | Must cover small scale (kernel launch overhead dominant), medium scale (typical production scenario), and large scale (memory bandwidth dominant) |
| Parameter values from constraint ranges | Attribute values (such as block_size) must be selected from the constraints in design.md, cannot be set arbitrarily |
| Integer/index tensor values must be semantically valid | Specific values of tensors like win_lengths, offsets must comply with operator semantics (e.g., offsets must be valid window start offsets) |
Step 3: Validate Test Cases
After generating test cases, check:
- All dtypes are covered
- All execution modes (defined by design.md) have corresponding test cases
- Parameter values (including attribute values and integer tensor values) are within the constraints of design.md
- Includes at least one "small shape" test case and one "large shape" test case
Reference Path Decision Tree (Mandatory)
Performance evaluation always requires dual-path comparison (custom operator vs baseline). Determine the baseline path in the following order:
Does the operator have an equivalent baseline API?
├─ Yes (e.g., torch.nn.functional.*, built-in torch_npu operators)
│ └─ Use the baseline API as the baseline path
└─ No (no equivalent baseline interface)
└─ Must implement a "small operator combination" baseline path ← Mandatory requirement of this skill
└─ Implement using PyTorch basic operator combinations from the "Reference Implementation" or "Pseudocode" section of the design documentRequirements for Small Operator Combination Baseline Path
When there is no equivalent baseline interface, must:
- Read the reference implementation from design.md: The design document usually contains a PyTorch reference implementation (pseudocode or Python function), use this as the basis to build the baseline path.
- Use PyTorch basic operator combinations: , slice assignment,
torch.zeros,.permute(), etc. Do not use loops + Python scalar assignment (otherwise the profiler collects CPU operators instead of NPU operators, making fair comparison impossible); the entire baseline implementation must be dominated by tensor operations and executable on NPU.torch.cat - Clearly label in the report: The header of the report must state "No equivalent baseline interface, baseline path is a combination of small operators" and list the basic operators used.
- Comparison table must be presented: Do not degrade to a single-path report due to "no baseline interface", must retain the three columns: "Custom Operator per-step", "Baseline per-step", "Ratio".
NEVER: Output a single-path report or skip the comparison table on the grounds of "no equivalent baseline interface".
Fixed Profiler Steps (Mandatory)
| Parameter | Value | Description |
|---|---|---|
| 5 | Do not modify to other values via script or CLI |
| 5 | Do not modify to other values via script or CLI |
| Default | Can retain in CLI or as a constant, adjust as needed |
| Default | Fixed to 1 for simple scenarios; if |
Must call at the end of each step; total number of loop steps = .
prof.step()repeat * (wait + warmup + active)File Placement (Unified in Operator test/
Subdirectory)
test/All the following products are placed in :
ascend-kernel/csrc/ops/<op>/test/| Category | Naming Convention ( |
|---|---|
| Test Cases JSONL only | |
| Markdown Report | |
| Profiler Export Root Directory | |
Performance scripts, common modules are placed in the same directory as the above files.
test/Complete JSONL Specification for Performance Test Cases
The following is the complete field and type description for test case files; only use as the test case carrier.
.jsonl1. File Format
| Format | Description |
|---|---|
| JSONL | Each line contains one JSON object, with a newline at the end; empty lines are ignored. File extension |
Prohibit generating or maintaining array files in sync with test cases in this process.
.json2. Top-level Structure of a Single Test Case
Each test case object must contain the key , whose value is an array.
"inputs"Layer Norm Example (replace convention for different operators, but the structure must still include the array):
build_inputsinputsjson
{
"inputs": [
{ "name": "x", "type": "tensor", "required": true, "dtype": "float16", "shape": [8, 128] },
{ "name": "normalized_shape", "type": "attr", "required": true, "dtype": "int", "value": [128] },
{ "name": "use_affine", "type": "attr", "required": false, "dtype": "bool", "value": true },
{ "name": "eps", "type": "attr", "required": false, "dtype": "float", "value": 1e-05 }
]
}- of each element in
nameis unique within the same test case.inputs - For other operators: Rules for tensors / /
tensor_list/ integer tensorattr, etc. are inrange.references/REFERENCE_JSON_CASE_FORMAT.md
3. Complete JSONL Example (Two Lines, Layer Norm)
json
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[2,128]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[128]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":true},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[4,256]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[256]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":false},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}For more complete field descriptions (such as , tensor , etc.), see .
tensor_listintrangereferences/REFERENCE_JSON_CASE_FORMAT.mdProfiler and Directory Semantics (Summary)
- Each generates an export directory with the suffix
with torch_npu.profiler.profile(...)under the handler directory; CSV path is_ascend_pt.…/*_ascend_pt/ASCEND_PROFILER_OUTPUT/op_statistic.csv - Each test case, each implementation (e.g., /
custom) uses an independentbaseline; subpath is recommended aswith, clear{trace_root}/{op_trace_tag}/{custom|baseline}/case_XXX/before running.case_XXX
For details, see .
references/REFERENCE_PROFILER_AND_METRICS.mdMetrics (Summary)
- For the CSV corresponding to a single : Sum the Total Time(us) of each operator row.
with - Divide by (when
active×repeat) or onlydivisor_mode=active_steps(whenactive). This skill fixesactive_only; ifactive=5, thenrepeat=1.divisor = 5
Mandatory Structure of Performance Comparison Report (Markdown)
The report format strictly follows , with the following structure:
examples/sample_report.md1. Title
markdown
# Performance Evaluation Results2. Comparison Table (Unified Single Table, Mandatory Dual Path)
All test cases are displayed in the same table, with fixed headers: .
Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup RatioExample:
markdown
## Performance Comparison
| Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio |
| ---- | ----- | ----- | ------------- | -------- | -------------- |
| 0 | [128, 4096] | float16 | 9.75 | 10.10 | 1.036 |
| 1 | [128, 5120] | float16 | 10.52 | 9.39 | 0.893 |
| 2 | [128, 6144] | float16 | 10.99 | 14.36 | 1.307 |
| 3 | [64, 6400] | float16 | 9.13 | 9.49 | 1.040 |
| 4 | [2, 1024, 4096] | float16 | 57.01 | 84.92 | 1.490 |
| 5 | [2, 1024, 6144] | float16 | 73.80 | 139.56 | 1.891 |
| 6 | [1, 2048, 6400] | float16 | 75.60 | 143.09 | 1.893 |
| 7 | [64, 4096] | float32 | 8.45 | 7.14 | 0.846 |3. Full Summary
Use the second-level title , which contains a key-value table:
## Full Summarymarkdown
## Full Summary
| Metric | Value |
| ---- | -- |
| Number of Test Cases | N |
| Average Speedup Ratio (>1 means custom operator is faster) | X.XXX |
| Custom Operator Better (Ratio>1) | M |
| Baseline Better (Ratio<1) | K |Immediately below, use the third-level title to display the summary table grouped by dtype:
### Summary by Data Typemarkdown
### Summary by Data Type
| DType | Number of Test Cases | Average Speedup Ratio | Custom Operator Better | Baseline Better |
| ----- | ------ | ------------------- | ------------- | -------- |
| float16 | 7 | 1.364 | 6 | 1 |
| float32 | 1 | 0.846 | 0 | 1 |4. Brief Analysis
Use the second-level title , list ≥3 short conclusions in unordered list format, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.
## Brief Analysismarkdown
## Brief Analysis
- The average speedup ratio is greater than 1, so the custom operator has a slight overall advantage.
- The custom operator has a more obvious advantage in large shape scenarios, as the vector path is more fully utilized.
- The custom operator is slightly inferior to the baseline in the float32 small shape scenario, which may be related to the high proportion of kernel launch overhead.Other Conventions
- Do not write which duplicates the report; intermediate statistics only exist in script memory and are written to Markdown.
*_profiler_results.json
Display Results in Conversation (MANDATORY)
After generating (or if it already exists and has been updated in this run), the assistant MUST simultaneously complete the following in the current conversation reply, cannot only output "Report generated" and the path without displaying data:
csrc/ops/<op>/test/<op>_torch_npu_profiler_report.md- Paste key performance content (users can read conclusions without opening files, displayed content is consistent with report structure):
- Unified comparison table: Header , all dtypes displayed in the same table. Truncate and note "See report for rest" if there are many cases.
Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio - Full summary: Summary metrics in key-value format (number of test cases, average ratio, number of cases where custom operator/baseline is better) and summary table by data type.
- Brief analysis: ≥3 unordered list conclusions, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.
- Unified comparison table: Header
NEVER: Only reply with the report path; NEVER replace displaying core numbers and conclusions in the conversation with "Please open the Markdown file yourself".
Common Mistakes
- /
warmupare changed to non-5 values, inconsistent with skill conventions.active - is not used, or
torch_npu.profileris inconsistent with the schedule.prof.step() - may generate multiple
repeat>1exports; must explain semantics when selecting CSV by mtime.*_ascend_pt - Must be compatible with Total Time(us) when CSV header has BOM / column names change.
- Only run the baseline path if the custom operator is not registered; load the custom library before comparison.
- Designing test cases directly without reading : testcase-gen has generated a unified test case document, should first extract shapes and dtypes from it to avoid duplicate design.
<op>-test-cases.md - Generating test cases directly without reading design.md: Results in shapes not complying with constraints, missing coverage of key execution modes (such as omitting the transpose=True path).
- Outputting a single-path report on the grounds of "no equivalent baseline interface": Must implement a small operator combination baseline path, always output a dual-path comparison table.
- Using Python loops for scalar-by-scalar assignment in small operator combination: Profiler collects CPU logic instead of NPU operators, leading to distorted baseline path latency; baseline implementation should be dominated by tensor operations.
- Assistant only outputs the report path, does not display key tables and summary conclusions in the current conversation (violates "Display Results in Conversation").
Reference Implementation (examples/layer_norm_profiler_reference/
)
examples/layer_norm_profiler_reference/Is isomorphic to the profiler-related files in , including:
ascend-kernel/csrc/ops/layer_norm/test/- ,
layer_norm_profiler_common.pybenchmark_layer_norm_torch_npu_profiler.py - (JSONL only, no
layer_norm_perf_cases.jsonl).json - ,
LAYER_NORM_PROFILER_PERF_GUIDE.mdREADME.md
Copy this directory in its entirety to for new operators, then replace the operator name, forward call, , and trace subdirectory name. If there are updates in in the repository, synchronize them back to .
csrc/ops/<op>/test/build_inputslayer_norm/test/examples/layer_norm_profiler_reference/Checklist (Assistant Self-Check)
- Has read (if exists), extracted SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, operator baseline from it
csrc/ops/<op>/test/<op>-test-cases.md - Has read , extracted dtypes, parameter constraints, typical shapes, execution modes from it
csrc/ops/<op>/design.md - Test cases cover all execution modes described in design.md (such as transpose/non-transpose, input_split mode, etc.)
- Parameter values (including attribute values and integer tensor values) in test cases are within the constraints of design.md
- Has confirmed the baseline path type (baseline API or small operator combination), and clearly labeled it in the report header
- If there is no equivalent baseline interface, has implemented the small operator combination baseline path, and the baseline implementation is dominated by tensor operations (not Python scalar loops)
- Has used , and
torch_npu.profiler、warmup=5have not been modifiedactive=5 - Has generated or updated , with format consistent with
<op>_torch_npu_profiler_report.md: includes unified comparison table with DType column + full summary + summary by data type + brief analysisexamples/sample_report.md - Has displayed the unified comparison table with DType column, full summary, summary by data type, and ≥3 brief analysis conclusions in the current conversation, not just attaching the path
- Has explained the fixed step convention and metric caliber