ascendc-operator-dev
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAscendC 算子端到端开发编排
AscendC Operator End-to-End Development Orchestration
Skill类型:流程导向型(七阶段工作流,子技能串行编排)
本 skill 编排七个子 skill,驱动 ascend-kernel 算子从零到生产可用。
Skill Type: Process-oriented (seven-stage workflow with serial orchestration of sub-skills)
This skill orchestrates seven sub-skills to drive ascend-kernel operators from scratch to production-ready.
核心原则
Core Principles
- 七阶段串行:工程初始化 → 设计文档 → 用例生成 → 代码生成&测试 → 接口文档 → 精度评估 → 性能评测,严格顺序执行
- 子技能执行:每个阶段 MUST 调用对应子 skill,不得自行实现
- 阶段门控:前一阶段检查点全部通过后才进入下一阶段
- 设计驱动编码:代码生成依赖设计文档中的 Tiling 策略和 UB 分配表
- 自动化设计:无需用户预先提供设计文档,设计阶段自动生成
- 用例统一生成:设计完成后立即生成测试用例文档,供后续精度评估和性能评测复用
- 文档闭环:编译测试通过后 MUST 生成 PyTorch 风格的中文接口文档,并在聊天界面展示
- 精度闭环:算子必须通过 ≥30 例全面精度评估才算完成
- 性能闭环:算子必须通过 msprof 性能对比评测,输出性能报告
- 结果可视化:Phase 4/5/6/7 的结果 MUST 以 Markdown 形式直接展示在聊天界面中,不要仅输出路径
- Seven-stage Serial Execution: Project Initialization → Design Documentation → Test Case Generation → Code Generation & Testing → Interface Documentation → Precision Evaluation → Performance Benchmarking, executed in strict order
- Sub-skill Execution: Each stage MUST call the corresponding sub-skill, no self-implementation allowed
- Stage Gating: Proceed to the next stage only after all checkpoints of the previous stage are passed
- Design-driven Coding: Code generation depends on the Tiling strategy and UB allocation table in the design document
- Automated Design: No need for users to provide pre-prepared design documents; the design stage generates them automatically
- Unified Test Case Generation: Generate test case documents immediately after design completion for reuse in subsequent precision evaluation and performance benchmarking
- Documentation Closure: After passing compilation and testing, MUST generate Chinese interface documents in PyTorch style and display them in the chat interface
- Precision Closure: Operators must pass ≥30 comprehensive precision evaluation cases to be considered complete
- Performance Closure: Operators must pass msprof performance comparison and benchmarking, with a performance report output
- Result Visualization: Results of Phase 4/5/6/7 MUST be directly displayed in the chat interface in Markdown format, do not only output paths
可用子 Skill 清单
Available Sub-skill List
| Skill | 路径 | 职责 |
|---|---|---|
| | 检测/创建 ascend-kernel 项目,生成算子骨架目录 |
| | 分析算子需求,生成设计文档(含 Tiling 策略、UB 分配表) |
| | 根据设计文档生成统一测试用例文档,供精度评估和性能评测复用 |
| | 根据设计文档生成 op_host/op_kernel 代码、框架适配、编译测试 |
| | 编译、安装 whl、生成测试文件、运行精度测试(由 code-gen 内部调用) |
| | 从源码提取接口信息,生成 PyTorch 风格中文 API 文档(必选阶段) |
| | 生成 ≥30 例精度测试、运行并输出精度验证报告(必选阶段) |
| | 使用 msprof 对比工程算子与原生算子性能,输出性能评测报告(必选阶段) |
| Skill | Path | Responsibility |
|---|---|---|
| | Detect/create ascend-kernel project, generate operator skeleton directory |
| | Analyze operator requirements, generate design document (including Tiling strategy, UB allocation table) |
| | Generate unified test case document based on design document for reuse in precision evaluation and performance benchmarking |
| | Generate op_host/op_kernel code, framework adaptation, compilation testing |
| | Compile, install whl package, generate test files, run precision tests (called internally by code-gen) |
| | Extract interface information from source code, generate Chinese API documents in PyTorch style (mandatory stage) |
| | Generate ≥30 precision test cases, run them and output precision verification report (mandatory stage) |
| | Use msprof to compare performance between project operators and native operators, output performance benchmarking report (mandatory stage) |
工作流总览
Workflow Overview
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7
工程初始化 ──▶ 设计文档 ──▶ 用例生成 ──▶ 代码生成+框架适配+编译测试 ──▶ 接口文档 ──▶ 精度评估报告 ──▶ 性能评测报告
project-init design testcase-gen code-gen → compile-debug doc-gen precision-eval performance-eval
输入: 算子名称 + 功能描述 输出: 生产可用算子 + 用例文档 + 接口文档 + 精度报告 + 性能报告Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7
Project Init ──▶ Design Doc ──▶ Test Case Gen ──▶ Code Gen + Framework Adaptation + Compile Test ──▶ Interface Doc ──▶ Precision Eval Report ──▶ Performance Benchmark Report
project-init design testcase-gen code-gen → compile-debug doc-gen precision-eval performance-eval
Input: Operator Name + Function Description Output: Production-ready Operator + Test Case Doc + Interface Doc + Precision Report + Performance Report反模式清单(NEVER DO THESE)
Anti-pattern List (NEVER DO THESE)
- ❌ 不要跳过设计阶段直接写代码
- ❌ 不要跳过用例生成阶段,Phase 2 通过后必须执行 Phase 3(testcase-gen)
- ❌ 不要自行实现任何算子代码,必须调用子 skill
- ❌ 不要在代码生成之前修改框架文件(ops.h / register.cpp / CMakeLists.txt)
- ❌ 不要手动执行编译和测试,统一由 compile-debug skill 处理
- ❌ 不要引用不存在的 skill
- ❌ 不要跳过检查点验证
- ❌ 不要跳过接口文档阶段,Phase 4 通过后必须执行 Phase 5
- ❌ 不要跳过精度评估阶段,Phase 5 通过后必须执行 Phase 6
- ❌ 不要跳过性能评测阶段,Phase 6 通过后必须执行 Phase 7
- ❌ 不要使用非 msprof 的计时方式作为性能结论
- ❌ 精度评估和性能评测不要自行设计用例,必须先读取 testcase-gen 生成的用例文档
- ❌ Do not skip the design stage and directly write code
- ❌ Do not skip the test case generation stage; Phase 3 (testcase-gen) must be executed after Phase 2 is passed
- ❌ Do not implement any operator code by yourself, must call sub-skills
- ❌ Do not modify framework files (ops.h / register.cpp / CMakeLists.txt) before code generation
- ❌ Do not manually execute compilation and testing, handle uniformly via compile-debug skill
- ❌ Do not reference non-existent skills
- ❌ Do not skip checkpoint verification
- ❌ Do not skip the interface documentation stage; Phase 5 must be executed after Phase 4 is passed
- ❌ Do not skip the precision evaluation stage; Phase 6 must be executed after Phase 5 is passed
- ❌ Do not skip the performance benchmarking stage; Phase 7 must be executed after Phase 6 is passed
- ❌ Do not use timing methods other than msprof as performance conclusions
- ❌ Do not design test cases for precision evaluation and performance benchmarking by yourself, must first read the test case document generated by testcase-gen
Phase 0:需求收集
Phase 0: Requirements Collection
目标:确认算子开发所需的最小信息集,包括开发环境和算子需求
Goal: Confirm the minimum information set required for operator development, including development environment and operator requirements
Step 0.1:环境确认(MUST 在任何开发动作之前完成)
Step 0.1: Environment Confirmation (MUST be completed before any development action)
开发环境是所有后续阶段的前置依赖,必须首先确认。
The development environment is a prerequisite for all subsequent stages, must be confirmed first.
CANN 环境
CANN Environment
自动检测流程:
- 检查环境变量 是否已设置(
ASCEND_HOME_PATH)echo $ASCEND_HOME_PATH - 若已设置:直接使用,无需询问用户,将其作为
CANN_PATH - 若未设置:MUST 向用户询问 CANN 安装路径(如 )
/usr/local/Ascend/ascend-toolkit
激活方式:
bash
source ${CANN_PATH}/*/set_env.sh在每个需要编译或运行算子的 Shell 会话中,都必须先执行此激活命令。
Automatic Detection Process:
- Check if the environment variable is set (
ASCEND_HOME_PATH)echo $ASCEND_HOME_PATH - If set: Use it directly as without asking the user
CANN_PATH - If not set: MUST ask the user for the CANN installation path (e.g., )
/usr/local/Ascend/ascend-toolkit
Activation Method:
bash
source ${CANN_PATH}/*/set_env.shIn every Shell session that requires compiling or running operators, this activation command must be executed first.
Conda 环境
Conda Environment
自动检测流程:
- 检查当前是否已激活 conda 环境()
echo $CONDA_DEFAULT_ENV - 若已激活(值非 且非空):直接使用当前环境,无需询问用户
base - 若未激活或为 :MUST 向用户询问要使用的 conda 环境名称
base
激活方式:
bash
conda activate <env_name>在每个需要编译或运行算子的 Shell 会话中,都必须先激活 conda 环境。
Automatic Detection Process:
- Check if a conda environment is currently activated ()
echo $CONDA_DEFAULT_ENV - If activated (value is not and not empty): Use the current environment directly without asking the user
base - If not activated or is : MUST ask the user for the name of the conda environment to use
base
Activation Method:
bash
conda activate <env_name>In every Shell session that requires compiling or running operators, the conda environment must be activated first.
环境确认检查点
Environment Confirmation Checkpoints
- CANN 路径已确定(自动检测或用户提供)
- 可正常执行
source ${CANN_PATH}/*/set_env.sh - Conda 环境已确定(自动检测或用户提供)
- 可正常执行
conda activate <env_name>
- CANN path is confirmed (auto-detected or provided by user)
- can be executed normally
source ${CANN_PATH}/*/set_env.sh - Conda environment name is confirmed (auto-detected or provided by user)
- can be executed normally
conda activate <env_name>
Step 0.2:算子需求收集
Step 0.2: Operator Requirements Collection
必须确认的信息
Mandatory Information to Confirm
| 信息 | 格式要求 | 必填 | 说明 |
|---|---|---|---|
| CANN 环境路径 | 绝对路径 | 是 | 自动检测 |
| Conda 环境名称 | 字符串 | 是 | 自动检测 |
| 算子名称 | snake_case | 是 | 如 |
| 功能描述 | 文本/数学公式 | 是 | 如 "反双曲余弦 acosh(x) = ln(x + sqrt(x²-1))" |
可选信息(有默认值):
| 信息 | 默认值 | 说明 |
|---|---|---|
| 支持的数据类型 | float16, float32 | 可扩展 bfloat16 |
| SoC平台 | ascend910b | 通过平台 API 自动获取 |
| Information | Format Requirement | Mandatory | Description |
|---|---|---|---|
| CANN Environment Path | Absolute path | Yes | Auto-detect |
| Conda Environment Name | String | Yes | Auto-detect |
| Operator Name | snake_case | Yes | e.g., |
| Function Description | Text/Mathematical Formula | Yes | e.g., "Inverse hyperbolic cosine acosh(x) = ln(x + sqrt(x²-1))" |
Optional Information (with default values):
| Information | Default Value | Description |
|---|---|---|
| Supported Data Types | float16, float32 | Can be extended to bfloat16 |
| SoC Platform | ascend910b | Auto-obtained via platform API |
决策树
Decision Tree
| 用户请求 | 处理方式 |
|---|---|
| "生成 X 算子" / "开发 X 算子" | 先完成环境确认(Step 0.1),再从算子名推断功能,确认后直接执行全流程 |
| "帮我开发新算子"(无具体名称) | 先完成环境确认(Step 0.1),再询问算子名称和功能描述 |
| "继续算子开发" | 先完成环境确认(Step 0.1),再检查已有文件判断阶段,从中断处继续 |
| User Request | Handling Method |
|---|---|
| "Generate X operator" / "Develop X operator" | Complete environment confirmation (Step 0.1) first, then infer the function from the operator name, and execute the full process directly after confirmation |
| "Help me develop a new operator" (no specific name) | Complete environment confirmation (Step 0.1) first, then ask for the operator name and function description |
| "Continue operator development" | Complete environment confirmation (Step 0.1) first, then check existing files to determine the stage and resume from the interrupted point |
验收标准
Acceptance Criteria
- CANN 环境路径已确定且可激活
- Conda 环境名称已确定且可激活
- 算子名称已确认(snake_case 格式)
- 功能描述已明确(含数学公式或计算逻辑)
- CANN environment path is confirmed and can be activated
- Conda environment name is confirmed and can be activated
- Operator name is confirmed (snake_case format)
- Function description is clear (including mathematical formula or calculation logic)
Phase 1:工程初始化
Phase 1: Project Initialization
调用 Skill:
ascendc-operator-project-initCalled Skill:
ascendc-operator-project-init执行内容
Execution Content
MANDATORY: 按 ascendc-operator-project-init skill 流程执行:
1. 检测 ascend-kernel 项目是否存在
2. 不存在则从模板复制
3. 在 csrc/ops/<op_name>/ 下创建算子骨架
4. 提示三处注册更新点MANDATORY: Execute according to the ascendc-operator-project-init skill process:
1. Detect if the ascend-kernel project exists
2. Copy from template if it does not exist
3. Create operator skeleton under csrc/ops/<op_name>/
4. Prompt three registration update points检查点
Checkpoints
- ascend-kernel 项目存在(build.sh、CMakeLists.txt、csrc/)
- 目录已创建
csrc/ops/<op_name>/ - 包含 、
op_host/<op_name>.cpp、op_kernel/<op_name>.cpp、CMakeLists.txtdesign.md
全部通过 → 进入 Phase 2
- ascend-kernel project exists (build.sh, CMakeLists.txt, csrc/)
- directory has been created
csrc/ops/<op_name>/ - Contains ,
op_host/<op_name>.cpp,op_kernel/<op_name>.cpp,CMakeLists.txtdesign.md
All passed → Proceed to Phase 2
Phase 2:设计文档生成
Phase 2: Design Document Generation
调用 Skill:
ascendc-operator-designCalled Skill:
ascendc-operator-design执行内容
Execution Content
MANDATORY: 按 ascendc-operator-design skill 流程执行:
1. 分析算子需求(名称、功能、数据类型)
2. 确定实现路径(AscendC Kernel / CATLASS / ACLNN)
3. 设计 Tiling 策略(Block级 + UB级)
4. 填写 UB 分配表,推导 bufferCoefficient
5. 生成完整设计文档到 csrc/ops/<op_name>/design.mdMANDATORY: Execute according to the ascendc-operator-design skill process:
1. Analyze operator requirements (name, function, data types)
2. Determine implementation path (AscendC Kernel / CATLASS / ACLNN)
3. Design Tiling strategy (Block-level + UB-level)
4. Fill in UB allocation table, derive bufferCoefficient
5. Generate complete design document to csrc/ops/<op_name>/design.md检查点
Checkpoints
- 内容完整
csrc/ops/<op_name>/design.md - 包含函数签名和支持的数据类型
- 包含计算逻辑伪代码(AscendC API 调用序列)
- 包含 UB 分配表(列出所有 buffer 及总系数)
- 包含 bufferCoefficient(每种 dtype 的值)
全部通过 → 进入 Phase 3
- is complete in content
csrc/ops/<op_name>/design.md - Contains function signature and supported data types
- Contains calculation logic pseudocode (AscendC API call sequence)
- Contains UB allocation table (lists all buffers and total coefficients)
- Contains bufferCoefficient (value for each dtype)
All passed → Proceed to Phase 3
Phase 3:测试用例生成
Phase 3: Test Case Generation
调用 Skill:
ascendc-operator-testcase-genCalled Skill:
ascendc-operator-testcase-gen执行内容
Execution Content
MANDATORY: 按 ascendc-operator-testcase-gen skill 流程执行:
1. 读取 csrc/ops/<op_name>/design.md,提取参数约束、支持的 dtype、典型 shape
2. 生成 TEST_SHAPES(常规 shape)、GENERAL_SHAPES(泛化 shape)、BOUNDARY_VALUES(边界值)
3. 生成算子标杆(CPU 参考实现、NPU 调用方式)
4. 输出用例文档到 csrc/ops/<op_name>/test/<op_name>-test-cases.mdMANDATORY: Execute according to the ascendc-operator-testcase-gen skill process:
1. Read csrc/ops/<op_name>/design.md, extract parameter constraints, supported dtypes, typical shapes
2. Generate TEST_SHAPES (regular shapes), GENERAL_SHAPES (generalized shapes), BOUNDARY_VALUES (boundary values)
3. Generate operator benchmarks (CPU reference implementation, NPU calling method)
4. Output test case document to csrc/ops/<op_name>/test/<op_name>-test-cases.md检查点
Checkpoints
- 已生成
csrc/ops/<op_name>/test/<op_name>-test-cases.md - 包含 SUPPORTED_DTYPES、TEST_SHAPES、GENERAL_SHAPES、BOUNDARY_VALUES
- 包含算子标杆(NPU 调用方式 + CPU 参考实现)
- shape 和参数值均在 design.md 约束范围内
全部通过 → 进入 Phase 4
- has been generated
csrc/ops/<op_name>/test/<op_name>-test-cases.md - Contains SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, BOUNDARY_VALUES
- Contains operator benchmarks (NPU calling method + CPU reference implementation)
- Shapes and parameter values are within the constraints of design.md
All passed → Proceed to Phase 4
Phase 4:代码生成 + 框架适配 + 编译测试
Phase 4: Code Generation + Framework Adaptation + Compile Test
调用 Skill:(内部自动调用 )
ascendc-operator-code-genascendc-operator-compile-debugCalled Skill: (internally calls automatically)
ascendc-operator-code-genascendc-operator-compile-debug执行内容
Execution Content
MANDATORY: 按 ascendc-operator-code-gen skill 流程执行:
阶段 1: 加载参考文档
- 读取 references/GUIDE.md
- 按算子类型加载对应 reference
阶段 2: 读取设计文档
- 提取函数签名、UB 分配表、计算伪代码
阶段 3: 选择模板并生成代码
- 选择 elementwise / row 模板
- 生成 op_host/<op_name>.cpp(含 Tiling 计算逻辑)
- 生成 op_kernel/<op_name>.cpp(含 Compute 计算逻辑)
阶段 4: 框架适配
- 更新 csrc/ops.h(函数声明)
- 更新 csrc/register.cpp(m.def + m.impl)
- 更新 csrc/CMakeLists.txt(OP_SRCS + ascendc_library)
阶段 5: 编译安装与测试(调用 compile-debug skill)
- ./build.sh 编译
- pip install whl 安装
- 生成 tests/test_<op_name>.py
- 运行功能测试和精度测试
- 编译/测试失败最多排错 3 次MANDATORY: Execute according to the ascendc-operator-code-gen skill process:
Stage 1: Load Reference Documents
- Read references/GUIDE.md
- Load corresponding reference according to operator type
Stage 2: Read Design Document
- Extract function signature, UB allocation table, calculation pseudocode
Stage 3: Select Template and Generate Code
- Select elementwise / row template
- Generate op_host/<op_name>.cpp (includes Tiling calculation logic)
- Generate op_kernel/<op_name>.cpp (includes Compute calculation logic)
Stage 4: Framework Adaptation
- Update csrc/ops.h (function declaration)
- Update csrc/register.cpp (m.def + m.impl)
- Update csrc/CMakeLists.txt (OP_SRCS + ascendc_library)
Stage 5: Compilation, Installation and Testing (call compile-debug skill)
- Compile via ./build.sh
- Install via pip install whl
- Generate tests/test_<op_name>.py
- Run functional tests and precision tests
- Debug up to 3 times if compilation/test fails检查点
Checkpoints
- 使用平台 API 获取硬件参数
op_host/<op_name>.cpp - 包含完整 CopyIn → Compute → CopyOut 流水线
op_kernel/<op_name>.cpp - 已添加函数声明
ops.h - 已添加
register.cpp和m.defm.impl - 已添加 host 和 kernel 源文件
csrc/CMakeLists.txt - 编译成功(whl 包已生成)
- 功能测试通过(exit code 0)
- 精度测试全部通过(pytest 全绿)
全部通过 → 进入 Phase 5
- uses platform API to obtain hardware parameters
op_host/<op_name>.cpp - contains complete CopyIn → Compute → CopyOut pipeline
op_kernel/<op_name>.cpp - Function declaration has been added to
ops.h - and
m.defhave been added tom.implregister.cpp - Host and kernel source files have been added to
csrc/CMakeLists.txt - Compilation is successful (whl package has been generated)
- Functional tests pass (exit code 0)
- All precision tests pass (pytest all green)
All passed → Proceed to Phase 5
Phase 5:接口文档生成
Phase 5: Interface Document Generation
调用 Skill:
ascendc-operator-doc-genCalled Skill:
ascendc-operator-doc-gen执行内容
Execution Content
MANDATORY: 按 ascendc-operator-doc-gen skill 流程执行:
阶段 1: 信息提取
- 从 register.cpp 提取 Python 调用签名(m.def schema)
- 从 ops.h 提取 C++ 函数声明和返回类型
- 从 design.md 提取算法描述、参数说明、dtype 支持、约束条件
- 从 op_host 提取 TORCH_CHECK 约束
- 从 tests/test_<op_name>.py 提取使用示例
阶段 2: 文档结构组装
- 按 PyTorch 官方文档风格组装中文接口文档
- 包含:标题签名 + 功能描述 + 参数说明 + 支持的数据类型 + Shape + 约束条件 + 使用示例 + 返回值
阶段 3: 文件生成
- 生成 csrc/ops/<op_name>/README.md
阶段 4: 在交互界面展示完整文档内容MANDATORY: Execute according to the ascendc-operator-doc-gen skill process:
Stage 1: Information Extraction
- Extract Python calling signature (m.def schema) from register.cpp
- Extract C++ function declaration and return type from ops.h
- Extract algorithm description, parameter description, dtype support, constraint conditions from design.md
- Extract TORCH_CHECK constraints from op_host
- Extract usage examples from tests/test_<op_name>.py
Stage 2: Document Structure Assembly
- Assemble Chinese interface documents in PyTorch official documentation style
- Includes: Title Signature + Function Description + Parameter Description + Supported Data Types + Shape + Constraint Conditions + Usage Examples + Return Value
Stage 3: File Generation
- Generate csrc/ops/<op_name>/README.md
Stage 4: Display complete document content in the interactive interface检查点
Checkpoints
- 从源代码提取了完整的接口信息(签名、参数、dtype、shape、约束)
- README.md 包含完整的 7 个段落(标题签名 + 功能描述 + 参数说明 + 支持的数据类型 + Shape + 约束条件 + 使用示例 + 返回值)
- Python 调用签名与 的
register.cpp一致m.def - 参数说明使用 PyTorch 文档风格,描述使用中文
- 使用示例中的代码可运行
- README.md 已写入
csrc/ops/<op_name>/README.md - 接口文档已在聊天界面完整展示
全部通过 → 进入 Phase 6
- Complete interface information has been extracted from source code (signature, parameters, dtype, shape, constraints)
- README.md contains all 7 sections (title signature + function description + parameter description + supported data types + shape + constraint conditions + usage examples + return value)
- Python calling signature is consistent with in
m.defregister.cpp - Parameter descriptions use PyTorch documentation style, described in Chinese
- Code in usage examples is runnable
- README.md has been written to
csrc/ops/<op_name>/README.md - Interface document has been fully displayed in the chat interface
All passed → Proceed to Phase 6
Phase 6:精度评估报告
Phase 6: Precision Evaluation Report
调用 Skill:
ascendc-operator-precision-evalCalled Skill:
ascendc-operator-precision-eval执行内容
Execution Content
MANDATORY: 按 ascendc-operator-precision-eval skill 流程执行:
阶段 1: 加载用例文档 + 信息收集
- 读取 csrc/ops/<op_name>/test/<op_name>-test-cases.md(testcase-gen 产出)
- 提取 SUPPORTED_DTYPES、TEST_SHAPES、GENERAL_SHAPES、BOUNDARY_VALUES、算子标杆
- 从已有代码补充提取精度阈值等信息
阶段 2: 用例适配((shapes + boundary) × dtypes ≥ 30 例)
- 直接复用 testcase-gen 的 TEST_SHAPES 和 BOUNDARY_VALUES
- 每个 shape / 边界值遍历算子支持的全部 dtype
阶段 3: 测试脚本生成(输出到算子目录 csrc/ops/<op_name>/test/)
- 基于模板生成 test_<op_name>_precision.py(pytest 格式)
- 基于模板生成 run_<op_name>_precision_report.py(报告生成器)
阶段 4: 执行
- 运行 pytest 全部通过
- 运行报告生成器输出 JSON
阶段 5: 报告生成
- 生成 <op_name>_precision_report.md(含常规 shape + 边界值表格 + 汇总 + 关键发现)
- 向用户提示报告路径MANDATORY: Execute according to the ascendc-operator-precision-eval skill process:
Stage 1: Load Test Case Document + Information Collection
- Read csrc/ops/<op_name>/test/<op_name>-test-cases.md (output from testcase-gen)
- Extract SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, BOUNDARY_VALUES, operator benchmarks
- Supplement and extract information such as precision thresholds from existing code
Stage 2: Test Case Adaptation ((shapes + boundary) × dtypes ≥ 30 cases)
- Directly reuse TEST_SHAPES and BOUNDARY_VALUES from testcase-gen
- Traverse all dtypes supported by the operator for each shape / boundary value
Stage 3: Test Script Generation (output to operator directory csrc/ops/<op_name>/test/)
- Generate test_<op_name>_precision.py (pytest format) based on template
- Generate run_<op_name>_precision_report.py (report generator) based on template
Stage 4: Execution
- Run pytest and all tests pass
- Run report generator to output JSON
Stage 5: Report Generation
- Generate <op_name>_precision_report.md (includes regular shape + boundary value table + summary + key findings)
- Prompt the user for the report path检查点
Checkpoints
- 用例数 = (shapes + boundary) × dtypes ≥ 30
- 算子支持的每种 dtype 都已测试
- pytest 精度测试全部通过
- JSON 报告生成(含 5 个精度指标: MaxAbsErr / MeanAbsErr / MaxRelErr / MeanRelErr / CosineSim)
- Markdown 报告生成于
csrc/ops/<op_name>/test/<op_name>_precision_report.md - 精度测试结果已以 Markdown 表格形式展示在聊天界面
- 已向用户提示精度报告路径
全部通过 → 进入 Phase 7
- Number of test cases = (shapes + boundary) × dtypes ≥ 30
- Each dtype supported by the operator has been tested
- All pytest precision tests pass
- JSON report is generated (includes 5 precision metrics: MaxAbsErr / MeanAbsErr / MaxRelErr / MeanRelErr / CosineSim)
- Markdown report is generated at
csrc/ops/<op_name>/test/<op_name>_precision_report.md - Precision test results have been displayed in the chat interface in Markdown table format
- The user has been prompted for the precision report path
All passed → Proceed to Phase 7
Phase 7:性能评测报告
Phase 7: Performance Benchmarking Report
调用 Skill:
ascendc-operator-performance-evalCalled Skill:
ascendc-operator-performance-eval执行内容
Execution Content
MANDATORY: 按 ascendc-operator-performance-eval skill 流程执行:
阶段 1: 加载用例文档 + 信息收集
- 读取 csrc/ops/<op_name>/test/<op_name>-test-cases.md(testcase-gen 产出)
- 提取 SUPPORTED_DTYPES、TEST_SHAPES、GENERAL_SHAPES、算子标杆
- 从已有代码补充提取 OP Type 关键字等信息
阶段 2: 用例适配(JSONL 格式,≥8 case)
- 从 testcase-gen 的 TEST_SHAPES + GENERAL_SHAPES 中选取代表性 shape
- 覆盖算子支持的全部 dtype
- 转换为 JSONL 格式
阶段 3: 脚本生成(输出到算子目录 csrc/ops/<op_name>/test/)
- 基于模板生成 run_<op_name>_case.py(单 case msprof 执行器)
- 基于模板生成 benchmark_<op_name>_msprof.py(总控脚本)
- 生成 <op_name>_cases.jsonl
阶段 4: 执行采集
- 运行总控脚本,每 case 20 次迭代(前 10 次预热)
- 按 OP Type 从 op_summary_*.csv 提取 Task Duration(us) 和硬件指标
- 输出 JSON 结果
阶段 5: 报告生成
- 生成 <op_name>_perf_report.md(含结果表格 + 汇总 + 简短分析)
- 向用户提示报告路径MANDATORY: Execute according to the ascendc-operator-performance-eval skill process:
Stage 1: Load Test Case Document + Information Collection
- Read csrc/ops/<op_name>/test/<op_name>-test-cases.md (output from testcase-gen)
- Extract SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, operator benchmarks
- Supplement and extract information such as OP Type keywords from existing code
Stage 2: Test Case Adaptation (JSONL format, ≥8 cases)
- Select representative shapes from TEST_SHAPES + GENERAL_SHAPES of testcase-gen
- Cover all dtypes supported by the operator
- Convert to JSONL format
Stage 3: Script Generation (output to operator directory csrc/ops/<op_name>/test/)
- Generate run_<op_name>_case.py (single case msprof executor) based on template
- Generate benchmark_<op_name>_msprof.py (master control script) based on template
- Generate <op_name>_cases.jsonl
Stage 4: Execution and Collection
- Run the master control script, 20 iterations per case (first 10 for warm-up)
- Extract Task Duration(us) and hardware metrics from op_summary_*.csv by OP Type
- Output JSON results
Stage 5: Report Generation
- Generate <op_name>_perf_report.md (includes result table + summary + brief analysis)
- Prompt the user for the report path检查点
Checkpoints
- JSONL 用例覆盖多种 shape × dtype(≥ 8 case)
- 使用 采集,非其他计时方式
msprof - 按 筛选目标算子(非 Op Name)
OP Type - 20/10 预热/统计策略
- JSON 报告生成(含 Task Duration + 硬件指标)
- Markdown 报告生成于
csrc/ops/<op_name>/test/<op_name>_perf_report.md - 报告包含简短分析(≥ 3 条结论)
- 性能测试结果已以 Markdown 表格形式展示在聊天界面
- 已向用户提示性能报告路径
全部通过 → 算子开发完成
- JSONL test cases cover multiple shape × dtype combinations (≥8 cases)
- Uses for collection, no other timing methods
msprof - Filters target operators by (not Op Name)
OP Type - 20/10 warm-up/statistics strategy is used
- JSON report is generated (includes Task Duration + hardware metrics)
- Markdown report is generated at
csrc/ops/<op_name>/test/<op_name>_perf_report.md - Report contains brief analysis (≥3 conclusions)
- Performance test results have been displayed in the chat interface in Markdown table format
- The user has been prompted for the performance report path
All passed → Operator development is complete
阶段间数据流
Inter-stage Data Flow
Phase 1 输出 Phase 2 输入
csrc/ops/<op_name>/ ────▶ 算子名称、目录结构
design.md (占位)
Phase 2 输出 Phase 3 输入
design.md (完整) ────▶ 参数约束、支持的 dtype、典型 shape
→ 生成统一测试用例文档
Phase 3 输出 Phase 4 输入
<op_name>-test-cases.md ────▶ design.md (完整)
(用例文档,供后续复用) 函数签名、UB 分配表 → bufferCoefficient
计算伪代码 → Compute 逻辑
Tiling 策略 → Block/UB 切分参数
Phase 4 输出 Phase 5 输入
已安装的算子 whl ────▶ register.cpp / ops.h / design.md /
tests/test_<op_name>.py op_host / test 文件
→ 提取接口信息生成文档
Phase 5 输出 Phase 6 输入
csrc/ops/<op>/README.md ────▶ <op_name>-test-cases.md(来自 Phase 3)
接口文档完成 算子名、调用方式、输入域约束
支持的全部 dtype、精度阈值
→ 输出到 csrc/ops/<op_name>/test/
Phase 6 输出 Phase 7 输入
精度报告通过 ────▶ <op_name>-test-cases.md(来自 Phase 3)
csrc/ops/<op>/test/ 算子名、工程/原生调用方式
支持的全部 dtype、OP Type 关键字
→ 输出到 csrc/ops/<op_name>/test/Phase 1 Output Phase 2 Input
csrc/ops/<op_name>/ ────▶ Operator name, directory structure
design.md (placeholder)
Phase 2 Output Phase 3 Input
design.md (complete) ────▶ Parameter constraints, supported dtypes, typical shapes
→ Generate unified test case document
Phase 3 Output Phase 4 Input
<op_name>-test-cases.md ────▶ design.md (complete)
(test case document for subsequent reuse) Function signature, UB allocation table → bufferCoefficient
Calculation pseudocode → Compute logic
Tiling strategy → Block/UB splitting parameters
Phase 4 Output Phase 5 Input
Installed operator whl ────▶ register.cpp / ops.h / design.md /
tests/test_<op_name>.py op_host / test files
→ Extract interface information to generate documents
Phase 5 Output Phase 6 Input
csrc/ops/<op>/README.md ────▶ <op_name>-test-cases.md (from Phase 3)
Interface document completed Operator name, calling method, input domain constraints
All supported dtypes, precision thresholds
→ Output to csrc/ops/<op_name>/test/
Phase 6 Output Phase 7 Input
Precision report passed ────▶ <op_name>-test-cases.md (from Phase 3)
csrc/ops/<op>/test/ Operator name, project/native calling method
All supported dtypes, OP Type keywords
→ Output to csrc/ops/<op_name>/test/状态跟踪表
Status Tracking Table
| Phase | 前置条件 | 调用 Skill | 关键产出物 |
|---|---|---|---|
| 0. 需求收集 | 无 | — | CANN 路径 + Conda 环境 + 算子名称 + 功能描述 |
| 1. 工程初始化 | Phase 0 | | 算子骨架目录 |
| 2. 设计文档 | Phase 1 | | design.md(含 Tiling + UB 分配表) |
| 3. 用例生成 | Phase 2 | | |
| 4. 代码&测试 | Phase 3 | | 可运行算子 + 基本测试通过 |
| 5. 接口文档 | Phase 4 | | PyTorch 风格中文 API 文档 (README.md) |
| 6. 精度评估 | Phase 5 | | ≥30 例精度测试 + 精度报告 |
| 7. 性能评测 | Phase 6 | | msprof 性能对比 + 性能报告 |
| Phase | Precondition | Called Skill | Key Deliverables |
|---|---|---|---|
| 0. Requirements Collection | None | — | CANN path + Conda environment + Operator name + Function description |
| 1. Project Initialization | Phase 0 | | Operator skeleton directory |
| 2. Design Document | Phase 1 | | design.md (includes Tiling + UB allocation table) |
| 3. Test Case Generation | Phase 2 | | |
| 4. Code & Testing | Phase 3 | | Runnable operator + basic tests passed |
| 5. Interface Document | Phase 4 | | PyTorch-style Chinese API document (README.md) |
| 6. Precision Evaluation | Phase 5 | | ≥30 precision test cases + precision report |
| 7. Performance Benchmarking | Phase 6 | | msprof performance comparison + performance report |
错误恢复
Error Recovery
从中断点恢复
Resume from Interrupted Point
当用户说"继续算子开发"时:
| 检测条件 | 判定阶段 | 恢复动作 |
|---|---|---|
| Phase 1 未完成 | 从 Phase 1 开始 |
| Phase 2 未完成 | 从 Phase 2 开始 |
| Phase 3 未完成 | 从 Phase 3 开始 |
| Phase 4 未完成 | 从 Phase 4 开始 |
| whl 未生成 | Phase 4 编译未完成 | 从编译步骤恢复 |
| 基本测试未通过 | Phase 4 测试未完成 | 从测试步骤恢复 |
| Phase 5 未完成 | 从 Phase 5 开始 |
| Phase 6 未开始 | 从 Phase 6 开始 |
| 精度报告不存在或精度测试未全部通过 | Phase 6 未完成 | 从 Phase 6 恢复 |
| 精度报告存在但性能报告不存在 | Phase 7 未开始 | 从 Phase 7 开始 |
| Phase 7 未完成 | 从 Phase 7 恢复 |
When the user says "Continue operator development":
| Detection Condition | Determined Stage | Recovery Action |
|---|---|---|
| Phase 1 not completed | Start from Phase 1 |
| Phase 2 not completed | Start from Phase 2 |
| Phase 3 not completed | Start from Phase 3 |
| Phase 4 not completed | Start from Phase 4 |
| whl package not generated | Phase 4 compilation not completed | Resume from compilation step |
| Basic tests not passed | Phase 4 testing not completed | Resume from testing step |
| Phase 5 not completed | Start from Phase 5 |
No precision report in | Phase 6 not started | Start from Phase 6 |
| Precision report does not exist or precision tests not all passed | Phase 6 not completed | Resume from Phase 6 |
| Precision report exists but performance report does not | Phase 7 not started | Start from Phase 7 |
| Phase 7 not completed | Resume from Phase 7 |
编译/测试失败
Compilation/Test Failure
由 skill 内部处理,最多排错 3 次。3 次仍失败则停止并向用户报告详细错误。
ascendc-operator-compile-debugHandled internally by skill, up to 3 debugging attempts. If it still fails after 3 times, stop and report detailed errors to the user.
ascendc-operator-compile-debug