Loading...
Loading...
Compare original and translation side by side
| 文件 / 目录 | 用途 |
|---|---|
| 流程、目录约定、完整 JSONL 用例规范、报告结构、固定 schedule |
| 与下文「性能用例 JSONL 规范」同文 |
| |
| 最小 LayerNorm 风格 JSONL,可复制改名 |
| Layer Norm 参考实现( |
| File / Directory | Purpose |
|---|---|
| Process, directory conventions, complete JSONL test case specifications, report structure, fixed schedule |
| Identical to the "JSONL Specification for Performance Test Cases" section below |
| |
| Minimal LayerNorm-style JSONL, can be copied and renamed |
| Layer Norm reference implementation ( |
csrc/ops/<算子名>/torch_npu.profilerwarmupactivereferences/REFERENCE_PROFILER_AND_METRICS.mddesign.mdcsrc/ops/<op>/design.mdcsrc/ops/<operator-name>/torch_npu.profilerwarmupactivereferences/REFERENCE_PROFILER_AND_METRICS.mddesign.mdcsrc/ops/<op>/design.mdREAD csrc/ops/<op>/test/<op>-test-cases.md| 提取项 | 在用例文档中的位置 | 用途 |
|---|---|---|
| SUPPORTED_DTYPES | §测试配置 | JSONL 用例的 dtype 覆盖范围 |
| TEST_SHAPES | §测试配置 | 小/中/大规模 shape 的选取基准 |
| GENERAL_SHAPES | §测试配置 | 泛化 shape,可补充用于性能场景 |
| NPU 调用方式 | §算子标杆 | 自定义算子的前向调用 |
| CPU 参考实现 | §算子标杆 | 标杆路径的参考实现 |
READ csrc/ops/<op>/test/<op>-test-cases.md| Extracted Item | Location in Test Case Document | Purpose |
|---|---|---|
| SUPPORTED_DTYPES | §Test Configuration | Coverage of dtypes for JSONL test cases |
| TEST_SHAPES | §Test Configuration | Benchmark for selecting small/medium/large-scale shapes |
| GENERAL_SHAPES | §Test Configuration | Generalized shapes, can be supplemented for performance scenarios |
| NPU Calling Method | §Operator Baseline | Forward call for custom operator |
| CPU Reference Implementation | §Operator Baseline | Reference implementation for baseline path |
inputs若不存在:回退为完全从 design.md 自行设计用例(按下方流程),但需在报告中注明"用例为自行设计,非 testcase-gen 产出"。<op>-test-cases.md
inputsIfdoes not exist: Fall back to designing test cases entirely from design.md (following the process below), but note "Test cases are self-designed, not generated by testcase-gen" in the report.<op>-test-cases.md
READ csrc/ops/<op>/design.md| 提取项 | 在 design.md 中的位置 | 用途 |
|---|---|---|
| 支持的数据类型 | §1「支持的数据类型」 | 用例的 dtype 覆盖范围 |
| 参数约束与取值范围 | §1「参数说明」约束条件列 | 属性值的合法范围(如 block_size ≤ 128) |
| 典型 shape / 输入规模 | §2「计算逻辑」/ §3「Tiling 策略」 | 小/中/大规模用例的 shape 基准 |
| 关键属性的模式组合 | §2「伪代码」/ §1「参数说明」 | 需要各自覆盖的执行路径(如 do_transpose=True/False、is_input_split=True/False) |
| 性能关键点 | §6「性能优化」/ §3「Tiling 策略」 | 影响性能的分支(如转置 vs 非转置走不同 DMA 路径) |
READ csrc/ops/<op>/design.md| Extracted Item | Location in design.md | Purpose |
|---|---|---|
| Supported Data Types | §1 "Supported Data Types" | Coverage of dtypes for test cases |
| Parameter Constraints and Value Ranges | Constraint column in §1 "Parameter Description" | Valid range for attribute values (e.g., block_size ≤ 128) |
| Typical Shapes / Input Scales | §2 "Computation Logic" / §3 "Tiling Strategy" | Benchmark for small/medium/large-scale test case shapes |
| Mode Combinations of Key Attributes | §2 "Pseudocode" / §1 "Parameter Description" | Execution paths that need to be covered individually (e.g., do_transpose=True/False, is_input_split=True/False) |
| Performance Key Points | §6 "Performance Optimization" / §3 "Tiling Strategy" | Branches that affect performance (e.g., transpose vs non-transpose using different DMA paths) |
| 规则 | 说明 |
|---|---|
| 覆盖所有执行模式 | design.md 描述了多个执行路径(如转置/非转置、input_split 模式)时,每种模式必须有至少一个用例 |
| 覆盖所有支持的 dtype | 每种支持的数据类型至少有一组用例,典型中等规模 shape |
| 小/中/大规模 shape 各一组 | 小规模(内核 launch 开销主导)、中规模(典型生产场景)、大规模(访存带宽主导)各需覆盖 |
| 参数值来自约束范围 | 属性值(如 block_size)必须从 design.md 的约束条件中选取,不得随意设定 |
| 整数/索引张量值须语义合法 | win_lengths、offsets 等张量的具体值需满足算子语义(如 offsets 必须是合法的 window 起始偏移) |
| Rule | Description |
|---|---|
| Cover all execution modes | When design.md describes multiple execution paths (e.g., transpose/non-transpose, input_split mode), there must be at least one test case for each mode |
| Cover all supported dtypes | At least one set of test cases for each supported data type, using a typical medium-scale shape |
| One set each for small/medium/large-scale shapes | Must cover small scale (kernel launch overhead dominant), medium scale (typical production scenario), and large scale (memory bandwidth dominant) |
| Parameter values from constraint ranges | Attribute values (such as block_size) must be selected from the constraints in design.md, cannot be set arbitrarily |
| Integer/index tensor values must be semantically valid | Specific values of tensors like win_lengths, offsets must comply with operator semantics (e.g., offsets must be valid window start offsets) |
算子是否有标杆等价 API?
├─ 是(如 torch.nn.functional.*、torch_npu 内置算子)
│ └─ 使用标杆 API 作为标杆路径
└─ 否(无标杆等价接口)
└─ 必须实现「小算子拼接」标杆路径 ← 本技能的强制要求
└─ 用设计文档 §「参考实现」或「伪代码」中的 PyTorch 基础算子组合实现Does the operator have an equivalent baseline API?
├─ Yes (e.g., torch.nn.functional.*, built-in torch_npu operators)
│ └─ Use the baseline API as the baseline path
└─ No (no equivalent baseline interface)
└─ Must implement a "small operator combination" baseline path ← Mandatory requirement of this skill
└─ Implement using PyTorch basic operator combinations from the "Reference Implementation" or "Pseudocode" section of the design documenttorch.zeros.permute()torch.cattorch.zeros.permute()torch.cat| 参数 | 值 | 说明 |
|---|---|---|
| 5 | 不允许脚本或 CLI 改为其它值 |
| 5 | 不允许脚本或 CLI 改为其它值 |
| 默认 | 可保留 CLI 或常量,按需 |
| 默认 | 简单场景固定为 1;若 |
prof.step()repeat * (wait + warmup + active)| Parameter | Value | Description |
|---|---|---|
| 5 | Do not modify to other values via script or CLI |
| 5 | Do not modify to other values via script or CLI |
| Default | Can retain in CLI or as a constant, adjust as needed |
| Default | Fixed to 1 for simple scenarios; if |
prof.step()repeat * (wait + warmup + active)test/test/ascend-kernel/csrc/ops/<op>/test/| 类别 | 命名约定( |
|---|---|
| 用例 仅 JSONL | |
| Markdown 报告 | |
| Profiler 导出根目录 | |
test/ascend-kernel/csrc/ops/<op>/test/| Category | Naming Convention ( |
|---|---|
| Test Cases JSONL only | |
| Markdown Report | |
| Profiler Export Root Directory | |
test/.jsonl.jsonl| 形态 | 说明 |
|---|---|
| JSONL | 每行 一个 JSON 对象,行尾换行;空行忽略。扩展名 |
.json| Format | Description |
|---|---|
| JSONL | Each line contains one JSON object, with a newline at the end; empty lines are ignored. File extension |
.json"inputs"build_inputsinputs{
"inputs": [
{ "name": "x", "type": "tensor", "required": true, "dtype": "float16", "shape": [8, 128] },
{ "name": "normalized_shape", "type": "attr", "required": true, "dtype": "int", "value": [128] },
{ "name": "use_affine", "type": "attr", "required": false, "dtype": "bool", "value": true },
{ "name": "eps", "type": "attr", "required": false, "dtype": "float", "value": 1e-05 }
]
}inputsnametensor_listattrrangereferences/REFERENCE_JSON_CASE_FORMAT.md"inputs"build_inputsinputs{
"inputs": [
{ "name": "x", "type": "tensor", "required": true, "dtype": "float16", "shape": [8, 128] },
{ "name": "normalized_shape", "type": "attr", "required": true, "dtype": "int", "value": [128] },
{ "name": "use_affine", "type": "attr", "required": false, "dtype": "bool", "value": true },
{ "name": "eps", "type": "attr", "required": false, "dtype": "float", "value": 1e-05 }
]
}nameinputstensor_listattrrangereferences/REFERENCE_JSON_CASE_FORMAT.md{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[2,128]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[128]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":true},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[4,256]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[256]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":false},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}tensor_listintrangereferences/REFERENCE_JSON_CASE_FORMAT.md{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[2,128]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[128]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":true},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[4,256]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[256]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":false},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}tensor_listintrangereferences/REFERENCE_JSON_CASE_FORMAT.mdwith torch_npu.profiler.profile(...)_ascend_pt…/*_ascend_pt/ASCEND_PROFILER_OUTPUT/op_statistic.csvcustombaselinewith{trace_root}/{op_trace_tag}/{custom|baseline}/case_XXX/case_XXXreferences/REFERENCE_PROFILER_AND_METRICS.mdwith torch_npu.profiler.profile(...)_ascend_pt…/*_ascend_pt/ASCEND_PROFILER_OUTPUT/op_statistic.csvcustombaselinewith{trace_root}/{op_trace_tag}/{custom|baseline}/case_XXX/case_XXXreferences/REFERENCE_PROFILER_AND_METRICS.mdwithactive×repeatdivisor_mode=active_stepsactiveactive_onlyactive=5repeat=1divisor = 5withactive×repeatdivisor_mode=active_stepsactiveactive_onlyactive=5repeat=1divisor = 5examples/sample_report.mdexamples/sample_report.mdundefinedundefinedundefinedundefinedCase | Shape | DType | 自定义算子(us) | 标杆(us) | 加速比undefinedCase | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratioundefined| Case | Shape | DType | 自定义算子(us) | 标杆(us) | 加速比 |
|---|---|---|---|---|---|
| 0 | [128, 4096] | float16 | 9.75 | 10.10 | 1.036 |
| 1 | [128, 5120] | float16 | 10.52 | 9.39 | 0.893 |
| 2 | [128, 6144] | float16 | 10.99 | 14.36 | 1.307 |
| 3 | [64, 6400] | float16 | 9.13 | 9.49 | 1.040 |
| 4 | [2, 1024, 4096] | float16 | 57.01 | 84.92 | 1.490 |
| 5 | [2, 1024, 6144] | float16 | 73.80 | 139.56 | 1.891 |
| 6 | [1, 2048, 6400] | float16 | 75.60 | 143.09 | 1.893 |
| 7 | [64, 4096] | float32 | 8.45 | 7.14 | 0.846 |
undefined| Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio |
|---|---|---|---|---|---|
| 0 | [128, 4096] | float16 | 9.75 | 10.10 | 1.036 |
| 1 | [128, 5120] | float16 | 10.52 | 9.39 | 0.893 |
| 2 | [128, 6144] | float16 | 10.99 | 14.36 | 1.307 |
| 3 | [64, 6400] | float16 | 9.13 | 9.49 | 1.040 |
| 4 | [2, 1024, 4096] | float16 | 57.01 | 84.92 | 1.490 |
| 5 | [2, 1024, 6144] | float16 | 73.80 | 139.56 | 1.891 |
| 6 | [1, 2048, 6400] | float16 | 75.60 | 143.09 | 1.893 |
| 7 | [64, 4096] | float32 | 8.45 | 7.14 | 0.846 |
undefined## 全量汇总undefined## Full Summaryundefined| 指标 | 值 |
|---|---|
| 用例数 | N |
| 平均 加速比(>1 表示自定义算子更快) | X.XXX |
| 自定义算子更优(比值>1) | M |
| 标杆更优(比值<1) | K |
紧接其下用 `### 按数据类型汇总` 三级标题,展示分 dtype 的汇总表:
```markdown| Metric | Value |
|---|---|
| Number of Test Cases | N |
| Average Speedup Ratio (>1 means custom operator is faster) | X.XXX |
| Custom Operator Better (Ratio>1) | M |
| Baseline Better (Ratio<1) | K |
Immediately below, use the third-level title `### Summary by Data Type` to display the summary table grouped by dtype:
```markdown| DType | 用例数 | 平均 加速比 | 自定义算子更优 | 标杆更优 |
|---|---|---|---|---|
| float16 | 7 | 1.364 | 6 | 1 |
| float32 | 1 | 0.846 | 0 | 1 |
undefined| DType | Number of Test Cases | Average Speedup Ratio | Custom Operator Better | Baseline Better |
|---|---|---|---|---|
| float16 | 7 | 1.364 | 6 | 1 |
| float32 | 1 | 0.846 | 0 | 1 |
undefined## 简短分析undefined## Brief Analysisundefinedundefinedundefined*_profiler_results.json*_profiler_results.jsoncsrc/ops/<op>/test/<op>_torch_npu_profiler_report.mdCase | Shape | DType | 自定义算子(us) | 标杆(us) | 加速比csrc/ops/<op>/test/<op>_torch_npu_profiler_report.mdCase | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratiowarmupactivetorch_npu.profilerprof.step()repeat>1*_ascend_pt<op>-test-cases.mdwarmupactivetorch_npu.profilerprof.step()repeat>1*_ascend_pt<op>-test-cases.mdexamples/layer_norm_profiler_reference/examples/layer_norm_profiler_reference/ascend-kernel/csrc/ops/layer_norm/test/layer_norm_profiler_common.pybenchmark_layer_norm_torch_npu_profiler.pylayer_norm_perf_cases.jsonl.jsonLAYER_NORM_PROFILER_PERF_GUIDE.mdREADME.mdcsrc/ops/<op>/test/build_inputslayer_norm/test/examples/layer_norm_profiler_reference/ascend-kernel/csrc/ops/layer_norm/test/layer_norm_profiler_common.pybenchmark_layer_norm_torch_npu_profiler.pylayer_norm_perf_cases.jsonl.jsonLAYER_NORM_PROFILER_PERF_GUIDE.mdREADME.mdcsrc/ops/<op>/test/build_inputslayer_norm/test/examples/layer_norm_profiler_reference/csrc/ops/<op>/test/<op>-test-cases.mdcsrc/ops/<op>/design.mdtorch_npu.profilerwarmup=5active=5<op>_torch_npu_profiler_report.mdexamples/sample_report.mdcsrc/ops/<op>/test/<op>-test-cases.mdcsrc/ops/<op>/design.mdtorch_npu.profilerwarmup=5active=5<op>_torch_npu_profiler_report.mdexamples/sample_report.md