Loading...
Loading...
Found 11 Skills
Static inspection of Triton operator code quality (Host side + Device side) for Ascend NPU. Used when users need to identify potential bugs, API misuses, and performance risks by reading code. Core capabilities: (1) Ascend API constraint compliance check (2) Mask integrity verification (3) Precision processing review (4) Code pattern recognition. Note: This Skill only focuses on static code analysis; compile-time and runtime issues are handled by other Skills.
Evaluate the performance of Triton operators on Ascend NPU. It is used when users need to analyze operator performance bottlenecks, collect and compare operator performance using msprof/msprof op, diagnose Memory-Bound/Compute-Bound bottlenecks, measure hardware utilization metrics, and generate performance evaluation reports.
Generate Triton operator requirement documents suitable for Ascend NPU. Used when users need to design new Triton operators, write operator requirement documents, or perform operator performance optimization design.
Task Orchestration for Full-Process Development of Ascend Triton Operators. Used when users need to develop Triton Operators, covering the complete workflow of environment configuration → requirement design → code generation → static inspection → precision verification → performance evaluation → document generation → performance optimization.
Verify and build the required environment for Triton operator development on the Ascend platform, including configurations of dependencies such as CANN, Python/torch/torch_npu/triton-ascend and PATH environment variables. This is used when users need to configure the Triton operator development environment, check the installation of CANN/torch/triton-ascend, or verify whether the environment is available.
Optimize the performance of Triton operators optimized for Ascend NPU. This guide is for users who need to optimize the performance of Triton operators on Ascend NPU, resolve UB overflow, improve Cube unit utilization, and design Tiling strategies.
Generate Triton kernel code for Ascend NPU based on operator design documents. Used when users need to implement Triton operator kernels and convert requirement documents into executable code. Core capabilities: (1) Parse requirement documents to confirm computing logic (2) Design tiling partitioning strategy (3) Generate high-performance kernel code (4) Generate test code to verify correctness.
Accepts Triton operator implementations, automatically invokes Torch small operator implementations (CPU or NPU) for precision comparison, and generates precision reports. It is used when users need to verify the correctness and precision of Triton operator implementations, compare precision with PyTorch implementations, and generate standardized precision reports.
Deep Performance Optimization Skill for Triton Operators on Ascend NPU, dedicated to achieving the Triton operator performance improvement required by users. Core technologies include but are not limited to Unified Buffer (UB) capacity planning, multi-Tokens parallel processing, MTE/Vector pipeline parallelism, mask optimization, etc. This Skill must be triggered when the user mentions the following: performance optimization of Vector-type Triton operators on Ascend NPU.
Generate interface documents for Triton operators of Ascend NPU. Used when users need to create or update interface documents for Triton operators of Ascend NPU. Core capabilities: (1) Generate standardized documents based on templates (2) Support the list of Ascend NPU product models (3) Provide specifications for operator parameter descriptions (4) Generate call example frameworks.
将简单Vector类型Triton算子从GPU迁移到昇腾NPU。当用户需要迁移Triton代码到NPU、提到GPU到NPU迁移、Triton迁移、昇腾适配时使用。注意:无法自动迁移存在编译问题的算子。