npu-adapter-reviewer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NPU Adapter Reviewer - GPU到昇腾NPU适配审查专家

NPU Adapter Reviewer - GPU to Ascend NPU Adaptation Review Expert

这是一个专门用于将GPU代码适配到华为昇腾NPU的Agent Skill。本技能覆盖完整的适配工作流:代码分析、堵点识别、适配脚本编写、验证方案设计、以及最终报告生成。
This is an Agent Skill specifically designed for adapting GPU code to Huawei Ascend NPU. This skill covers the complete adaptation workflow: code analysis, bottleneck identification, adaptation script writing, verification plan design, and final report generation.

核心工作流程

Core Workflow

阶段1:代码仓库获取与分析

Phase 1: Code Repository Acquisition and Analysis

任务1.1:获取源代码
根据用户提供的输入(本地路径或GitHub链接),获取完整的代码仓库:
bash
undefined
Task 1.1: Obtain Source Code
Acquire the complete code repository based on user input (local path or GitHub link):
bash
undefined

如果是GitHub链接,先克隆

If it's a GitHub link, clone first

git clone <repo_url> /tmp/gpu_code_base cd /tmp/gpu_code_base
git clone <repo_url> /tmp/gpu_code_base cd /tmp/gpu_code_base

如果是本地路径,直接分析

If it's a local path, analyze directly

ls -la <local_path>

**任务1.2:全面代码扫描**

使用并行探索的方式分析代码结构:

1. **探索Agent 1 - 代码结构分析**
   - 找出所有Python文件、CUDA文件、C++文件
   - 识别项目目录结构
   - 找出主要的入口文件和配置文件

2. **探索Agent 2 - GPU依赖识别**
   - 搜索CUDA API调用(`cudaMalloc`, `cudaMemcpy`, `kernel<<<...>>>`, `torch.cuda`等)
   - 搜索PyTorch GPU相关代码(`.cuda()`, `.to('cuda')`, `torch.device('cuda')`等)
   - 搜索TensorRT相关代码
   - 搜索深度学习框架特定API(Transformer引擎、Flash Attention等)

3. **探索Agent 3 - 外部库依赖**
   - 搜索`import`和`from ... import`语句
   - 识别所有第三方库依赖
   - 检查是否有NPU不支持的库

**任务1.3:生成代码结构报告**

输出以下信息:
- 项目总文件数、代码行数
- 文件类型分布(Python/CUDA/C++/其他)
- 主要依赖库列表
- 核心模块及其功能描述
ls -la <local_path>

**Task 1.2: Comprehensive Code Scanning**

Analyze the code structure using parallel exploration:

1. **Exploration Agent 1 - Code Structure Analysis**
   - Identify all Python files, CUDA files, and C++ files
   - Recognize the project directory structure
   - Locate main entry files and configuration files

2. **Exploration Agent 2 - GPU Dependency Identification**
   - Search for CUDA API calls (`cudaMalloc`, `cudaMemcpy`, `kernel<<<...>>>`, `torch.cuda`, etc.)
   - Search for PyTorch GPU-related code (`.cuda()`, `.to('cuda')`, `torch.device('cuda')`, etc.)
   - Search for TensorRT-related code
   - Search for deep learning framework-specific APIs (Transformer Engine, Flash Attention, etc.)

3. **Exploration Agent 3 - External Library Dependencies**
   - Search for `import` and `from ... import` statements
   - Identify all third-party library dependencies
   - Check for libraries not supported by NPU

**Task 1.3: Generate Code Structure Report**

Output the following information:
- Total number of project files and lines of code
- File type distribution (Python/CUDA/C++/others)
- List of main dependent libraries
- Core modules and their function descriptions

阶段2:GPU到NPU迁移堵点识别

Phase 2: Identify Bottlenecks in GPU-to-NPU Migration

任务2.1:算子兼容性分析
逐类识别GPU专用算子在NPU上的兼容性:
堵点类别GPU典型实现NPU替代方案迁移难度
CUDA核心算子
__global__
,
__device__
函数
Ascend C算子 / ATB
内存操作
cudaMallocHost
,
cudaMallocManaged
aclrtMalloc
,
HI_MPI_MALLOC
流和事件
cudaStream_t
,
cudaEvent_t
aclrtStream
,
aclrtEvent
cuBLAS/cuDNN
cublasGemmEx
,
cudnnConvolutionForward
aclblasGemmEx
, 算子融合
Flash Attention
flash_attn_varlen_func
昇腾Flash Attention算子
自定义算子PyTorch CUDA扩展ATC/ACL算子
AMP/混合精度
torch.cuda.amp
ascend_mixed_precision
任务2.2:识别具体堵点
对每个GPU API调用,生成以下分析:
undefined
Task 2.1: Operator Compatibility Analysis
Identify the compatibility of GPU-specific operators on NPU by category:
Bottleneck CategoryTypical GPU ImplementationNPU AlternativeMigration Difficulty
CUDA Core Operators
__global__
,
__device__
functions
Ascend C Operator / ATBHigh
Memory Operations
cudaMallocHost
,
cudaMallocManaged
aclrtMalloc
,
HI_MPI_MALLOC
Medium
Streams and Events
cudaStream_t
,
cudaEvent_t
aclrtStream
,
aclrtEvent
Medium
cuBLAS/cuDNN
cublasGemmEx
,
cudnnConvolutionForward
aclblasGemmEx
, Operator Fusion
High
Flash Attention
flash_attn_varlen_func
Ascend Flash Attention OperatorMedium
Custom OperatorsPyTorch CUDA ExtensionsATC/ACL OperatorsHigh
AMP/Mixed Precision
torch.cuda.amp
ascend_mixed_precision
Low
Task 2.2: Identify Specific Bottlenecks
Generate the following analysis for each GPU API call:
undefined

堵点编号: #001

Bottleneck ID: #001

  • 文件位置:
    src/attention/cuda_impl.cu:142
  • GPU API:
    cudaStreamCreate(&stream)
  • NPU替代:
    aclrtCreateStream(&stream)
  • 迁移方案:
    1. 替换头文件
      aclrt.h
    2. 替换API调用
    3. 处理错误码差异
  • 预估工作量: 0.5人天
  • 影响范围: 全局流管理

**任务2.3:生成堵点清单**

输出完整的堵点列表,按影响范围和迁移难度排序。
  • File Location:
    src/attention/cuda_impl.cu:142
  • GPU API:
    cudaStreamCreate(&stream)
  • NPU Alternative:
    aclrtCreateStream(&stream)
  • Migration Plan:
    1. Replace header file with
      aclrt.h
    2. Replace API calls
    3. Handle error code differences
  • Estimated Workload: 0.5 person-days
  • Impact Scope: Global stream management

**Task 2.3: Generate Bottleneck List**

Output the complete list of bottlenecks, sorted by impact scope and migration difficulty.

阶段3:适配脚本编写

Phase 3: Write Adaptation Scripts

任务3.1:创建NPU适配层
根据识别的堵点,创建适配脚本:
  1. 创建
    npu_compat.py
    - Python层兼容适配
    python
    # 自动检测运行设备
    def get_device():
        if is_npu_available():
            return "npu"
        elif is_cuda_available():
            return "cuda"
        else:
            return "cpu"
    
    # 替换torch.cuda调用
    def to_device(tensor):
        device = get_device()
        if device == "npu":
            return tensor.npu()
        elif device == "cuda":
            return tensor.cuda()
        return tensor
  2. 创建
    npu_ops.py
    - NPU算子封装
    • 将所有CUDA核心算子封装为NPU版本
    • 保留原有接口,内部实现NPU适配
  3. 创建
    build_npu.sh
    - 编译脚本
    • ASCEND C算子编译命令
    • 依赖环境检查
    • 错误诊断
任务3.2:修改原有代码
生成修改后的代码文件,保留原文件并创建
.npu
版本:
  • 替换所有GPU特定调用
  • 添加设备检测逻辑
  • 添加回退机制
Task 3.1: Create NPU Adaptation Layer
Create adaptation scripts based on identified bottlenecks:
  1. Create
    npu_compat.py
    - Python Layer Compatibility Adaptation
    python
    # Auto-detect running device
    def get_device():
        if is_npu_available():
            return "npu"
        elif is_cuda_available():
            return "cuda"
        else:
            return "cpu"
    
    # Replace torch.cuda calls
    def to_device(tensor):
        device = get_device()
        if device == "npu":
            return tensor.npu()
        elif device == "cuda":
            return tensor.cuda()
        return tensor
  2. Create
    npu_ops.py
    - NPU Operator Encapsulation
    • Encapsulate all CUDA core operators as NPU versions
    • Retain original interfaces with internal NPU adaptation implementation
  3. Create
    build_npu.sh
    - Compilation Script
    • ASCEND C operator compilation commands
    • Dependency environment checks
    • Error diagnosis
Task 3.2: Modify Original Code
Generate modified code files, retain original files and create
.npu
versions:
  • Replace all GPU-specific calls
  • Add device detection logic
  • Add fallback mechanism

阶段4:验证方案设计

Phase 4: Design Verification Plan

任务4.1:创建验证脚本
根据适配内容,生成验证脚本
verify_npu.sh
bash
#!/bin/bash
Task 4.1: Create Verification Scripts
Generate verification script
verify_npu.sh
based on adaptation content:
bash
#!/bin/bash

NPU适配验证脚本

NPU Adaptation Verification Script

echo "=== 1. 环境检查 ===" check_npu_env() { # 检查NPU驱动 ls -la /dev/npu 2>/dev/null || echo "Warning: NPU device not found" # 检查CANN echo $ASCEND_TOOLKIT_HOME # 检查Python包 python3 -c "import torch; print('PyTorch version:', torch.version)" python3 -c "import torch_npu; print('torch_npu installed')" }
echo "=== 2. 模块导入测试 ===" test_imports() { cd <project_path> python3 -c "import npu_compat; print('npu_compat OK')" python3 -c "import npu_ops; print('npu_ops OK')" }
echo "=== 3. 功能验证 ===" test_functions() { # 运行基础测试 python3 -m pytest tests/test_npu_*.py -v # 验证算子精度 python3 scripts/verify_precision.py }
echo "=== 4. 性能基准测试 ===" benchmark() { python3 scripts/benchmark.py --device npu --compare cuda }

**任务4.2:精度验证脚本**

生成 `verify_precision.py`:
```python
import numpy as np

def verify_npu_precision(cuda_result, npu_result, rtol=1e-3, atol=1e-3):
    """验证NPU与GPU输出精度差异"""
    diff = np.abs(cuda_result - npu_result)
    max_diff = np.max(diff)
    mean_diff = np.mean(diff)
    
    passed = np.allclose(cuda_result, npu_result, rtol=rtol, atol=atol)
    return {
        "passed": passed,
        "max_diff": max_diff,
        "mean_diff": mean_diff,
        "rtol": rtol,
        "atol": atol
    }
echo "=== 1. Environment Check ===" check_npu_env() { # Check NPU driver ls -la /dev/npu 2>/dev/null || echo "Warning: NPU device not found" # Check CANN echo $ASCEND_TOOLKIT_HOME # Check Python packages python3 -c "import torch; print('PyTorch version:', torch.version)" python3 -c "import torch_npu; print('torch_npu installed')" }
echo "=== 2. Module Import Test ===" test_imports() { cd <project_path> python3 -c "import npu_compat; print('npu_compat OK')" python3 -c "import npu_ops; print('npu_ops OK')" }
echo "=== 3. Function Verification ===" test_functions() { # Run basic tests python3 -m pytest tests/test_npu_*.py -v # Verify operator precision python3 scripts/verify_precision.py }
echo "=== 4. Performance Benchmark ===" benchmark() { python3 scripts/benchmark.py --device npu --compare cuda }

**Task 4.2: Precision Verification Script**

Generate `verify_precision.py`:
```python
import numpy as np

def verify_npu_precision(cuda_result, npu_result, rtol=1e-3, atol=1e-3):
    """Verify precision difference between NPU and GPU outputs"""
    diff = np.abs(cuda_result - npu_result)
    max_diff = np.max(diff)
    mean_diff = np.mean(diff)
    
    passed = np.allclose(cuda_result, npu_result, rtol=rtol, atol=atol)
    return {
        "passed": passed,
        "max_diff": max_diff,
        "mean_diff": mean_diff,
        "rtol": rtol,
        "atol": atol
    }

阶段5:审查报告生成

Phase 5: Generate Review Report

任务5.1:生成Markdown报告
根据验证结果,生成完整的审查报告:
markdown
undefined
Task 5.1: Generate Markdown Report
Generate a complete review report based on verification results:
markdown
undefined

GPU到昇腾NPU适配审查报告

GPU to Ascend NPU Adaptation Review Report

CodeReview_Results_YYYY-MM-DD.md

CodeReview_Results_YYYY-MM-DD.md

1. 执行摘要

1. Executive Summary

项目内容
原始代码仓库
<repo_url>
<local_path>
审查日期YYYY-MM-DD
适配状态✅ 完全适配 / ⚠️ 部分适配 / ❌ 适配失败
识别堵点总数XX个
已适配堵点XX个
剩余堵点XX个
ItemContent
Original Code Repository
<repo_url>
or
<local_path>
Review DateYYYY-MM-DD
Adaptation Status✅ Fully Adapted / ⚠️ Partially Adapted / ❌ Adaptation Failed
Total Identified BottlenecksXX
Adapted BottlenecksXX
Remaining BottlenecksXX

2. 原始代码分析

2. Original Code Analysis

2.1 代码结构概览

2.1 Code Structure Overview

  • 总文件数:XX
  • Python代码行数:XX
  • CUDA/C++代码行数:XX
  • 核心模块:...
  • Total Files: XX
  • Lines of Python Code: XX
  • Lines of CUDA/C++ Code: XX
  • Core Modules: ...

2.2 依赖分析

2.2 Dependency Analysis

库名版本NPU兼容性替代方案
torch2.x✅ 兼容torch_npu
flash-attn2.x⚠️ 部分昇顿Flash Attention
Library NameVersionNPU CompatibilityAlternative
torch2.x✅ Compatibletorch_npu
flash-attn2.x⚠️ PartialAscend Flash Attention

3. 迁移堵点详细分析

3. Detailed Analysis of Migration Bottlenecks

3.1 算子兼容性问题

3.1 Operator Compatibility Issues

问题 #001: CUDA Stream管理

Issue #001: CUDA Stream Management

  • 文件:
    src/utils/stream_manager.py:45
  • GPU API:
    cudaStreamCreate
  • 问题描述: 使用CUDA流管理异步执行
  • NPU替代:
    aclrtCreateStream
  • 影响范围: 全局,影响所有异步操作
  • 迁移建议:
    python
    # 修改前
    import torch.cuda
    stream = torch.cuda.Stream()
    
    # 修改后
    import torch_npu
    stream = torch.npu.Stream()
  • 状态: ✅ 已适配 / ⚠️ 待处理
  • File:
    src/utils/stream_manager.py:45
  • GPU API:
    cudaStreamCreate
  • Issue Description: Uses CUDA streams to manage asynchronous execution
  • NPU Alternative:
    aclrtCreateStream
  • Impact Scope: Global, affects all asynchronous operations
  • Migration Suggestion:
    python
    # Before modification
    import torch.cuda
    stream = torch.cuda.Stream()
    
    # After modification
    import torch_npu
    stream = torch.npu.Stream()
  • Status: ✅ Adapted / ⚠️ Pending

问题 #002: Flash Attention算子

Issue #002: Flash Attention Operator

  • 文件:
    src/attention/flash_attn_impl.py:78
  • GPU API:
    flash_attn_varlen_func
  • 问题描述: 使用Flash Attention加速注意力计算
  • NPU替代: Ascend flash_attn算子或MindSpore flash_attention
  • 影响范围: 高,核心推理性能
  • 迁移建议:
    python
    # 修改前
    from flash_attn import flash_attn_func
    output = flash_attn_func(q, k, v)
    
    # 修改后
    # 方案1: 使用torch_npu的算子
    import torch_npu
    output = torch_npu.npu_flash_attention(q, k, v)
    
    # 方案2: 使用ATB库
    from ascend_toolkit import flash_attention
    output = flash_attention(q, k, v)
  • 状态: ✅ 已适配 / ⚠️ 待处理
  • File:
    src/attention/flash_attn_impl.py:78
  • GPU API:
    flash_attn_varlen_func
  • Issue Description: Uses Flash Attention to accelerate attention computation
  • NPU Alternative: Ascend flash_attn operator or MindSpore flash_attention
  • Impact Scope: High, core inference performance
  • Migration Suggestion:
    python
    # Before modification
    from flash_attn import flash_attn_func
    output = flash_attn_func(q, k, v)
    
    # After modification
    # Option 1: Use torch_npu operators
    import torch_npu
    output = torch_npu.npu_flash_attention(q, k, v)
    
    # Option 2: Use ATB library
    from ascend_toolkit import flash_attention
    output = flash_attention(q, k, v)
  • Status: ✅ Adapted / ⚠️ Pending

3.2 模型加载与权重管理问题

3.2 Model Loading and Weight Management Issues

问题 #003: GPU权重格式

Issue #003: GPU Weight Format

  • 文件:
    src/model/loader.py:112
  • 问题描述: 权重以CUDA格式存储,直接加载会失败
  • 迁移建议:
    python
    # 修改前
    state_dict = torch.load(weights_path)
    model.load_state_dict(state_dict)
    
    # 修改后
    state_dict = torch.load(weights_path, map_location='cpu')
    # 转换权重
    for k, v in state_dict.items():
        if isinstance(v, torch.Tensor):
            state_dict[k] = v.npu()
    model.load_state_dict(state_dict)
  • 状态: ✅ 已适配
  • File:
    src/model/loader.py:112
  • Issue Description: Weights stored in CUDA format will fail to load directly
  • Migration Suggestion:
    python
    # Before modification
    state_dict = torch.load(weights_path)
    model.load_state_dict(state_dict)
    
    # After modification
    state_dict = torch.load(weights_path, map_location='cpu')
    # Convert weights
    for k, v in state_dict.items():
        if isinstance(v, torch.Tensor):
            state_dict[k] = v.npu()
    model.load_state_dict(state_dict)
  • Status: ✅ Adapted

3.3 计算性能瓶颈

3.3 Computing Performance Bottlenecks

问题 #004: 算子融合缺失

Issue #004: Missing Operator Fusion

  • 文件:
    src/model/inference.py:89
  • 问题描述: 多个独立算子导致性能下降
  • 迁移建议: 使用ATC进行算子融合优化
  • 预估性能提升: 20-30%
  • 状态: ⚠️ 待处理
  • File:
    src/model/inference.py:89
  • Issue Description: Multiple independent operators cause performance degradation
  • Migration Suggestion: Use ATC for operator fusion optimization
  • Estimated Performance Improvement: 20-30%
  • Status: ⚠️ Pending

3.4 NPU内存与KV Cache管理

3.4 NPU Memory and KV Cache Management

问题 #005: 动态内存分配

Issue #005: Dynamic Memory Allocation

  • 文件:
    src/cache/kv_cache.py:56
  • 问题描述: 使用CUDA动态内存分配
  • 迁移建议: 使用固定内存池
  • 状态: ⚠️ 待处理
  • File:
    src/cache/kv_cache.py:56
  • Issue Description: Uses CUDA dynamic memory allocation
  • Migration Suggestion: Use fixed memory pool
  • Status: ⚠️ Pending

3.5 Python-C++边界问题

3.5 Python-C++ Boundary Issues

问题 #006: C++扩展编译

Issue #006: C++ Extension Compilation

  • 文件:
    src/utils/gpu_ext.cpp:145
  • 问题描述: CUDA C++扩展需要重新编译
  • 迁移建议: 使用Ascend C重写或使用ATB
  • 状态: ⚠️ 待处理
  • File:
    src/utils/gpu_ext.cpp:145
  • Issue Description: CUDA C++ extensions need to be recompiled
  • Migration Suggestion: Rewrite with Ascend C or use ATB
  • Status: ⚠️ Pending

3.6 并发与异步问题

3.6 Concurrency and Asynchronous Issues

问题 #007: 多流并发

Issue #007: Multi-Stream Concurrency

  • 文件:
    src/server/request_handler.py:78
  • 问题描述: 使用CUDA流实现并发
  • 迁移建议: 重构为进程级并发
  • 状态: ⚠️ 待处理
  • File:
    src/server/request_handler.py:78
  • Issue Description: Uses CUDA streams to implement concurrency
  • Migration Suggestion: Refactor to process-level concurrency
  • Status: ⚠️ Pending

3.7 配置与可维护性问题

3.7 Configuration and Maintainability Issues

问题 #008: 硬编码设备

Issue #008: Hard-Coded Device

  • 文件:
    src/config.py:23
  • 问题描述: 配置中硬编码
    cuda:0
  • 迁移建议: 改为设备检测
  • 状态: ✅ 已适配
  • File:
    src/config.py:23
  • Issue Description: Hard-coded
    cuda:0
    in configuration
  • Migration Suggestion: Change to device detection
  • Status: ✅ Adapted

4. 适配代码清单

4. Adaptation Code List

4.1 新增文件

4.1 Newly Added Files

文件名功能状态
npu_compat.py
设备检测与兼容层
npu_ops.py
NPU算子封装
build_npu.sh
编译脚本
verify_npu.sh
验证脚本
File NameFunctionStatus
npu_compat.py
Device detection and compatibility layer
npu_ops.py
NPU operator encapsulation
build_npu.sh
Compilation script
verify_npu.sh
Verification script

4.2 修改文件

4.2 Modified Files

文件名修改内容状态
src/attention/flash_attn.py
替换为NPU算子
src/model/loader.py
添加权重转换
src/utils/stream_manager.py
Stream适配
File NameModification ContentStatus
src/attention/flash_attn.py
Replaced with NPU operators
src/model/loader.py
Added weight conversion
src/utils/stream_manager.py
Stream adaptation

5. 验证结果

5. Verification Results

5.1 环境验证

5.1 Environment Verification

  • NPU驱动已安装
  • CANN Toolkit已配置
  • torch_npu已安装
  • Python模块可导入
  • NPU driver installed
  • CANN Toolkit configured
  • torch_npu installed
  • Python modules importable

5.2 功能验证

5.2 Function Verification

  • 基础模块导入测试通过
  • 设备检测功能正常
  • 前向推理执行成功
  • 权重加载转换正常
  • Basic module import tests passed
  • Device detection function normal
  • Forward inference executed successfully
  • Weight loading and conversion normal

5.3 精度验证

5.3 Precision Verification

  • 推理结果与GPU差异 < 0.1%
  • 性能测试待执行(需要NPU硬件)
  • Inference result difference from GPU < 0.1%
  • Performance test pending (requires NPU hardware)

5.4 问题汇总

5.4 Issue Summary

问题类型数量严重程度
已解决XX-
待解决XX高/中/低
Issue TypeQuantitySeverity
ResolvedXX-
UnresolvedXXHigh/Medium/Low

6. 适配指南

6. Adaptation Guide

6.1 前置条件

6.1 Prerequisites

bash
undefined
bash
undefined

1. 安装CANN Toolkit

1. Install CANN Toolkit

2. 安装torch_npu

2. Install torch_npu

pip install torch torch_npu
pip install torch torch_npu

3. 验证安装

3. Verify installation

python3 -c "import torch; import torch_npu; print('NPU available:', torch_npu.is_npu_available())"
undefined
python3 -c "import torch; import torch_npu; print('NPU available:', torch_npu.is_npu_available())"
undefined

6.2 快速适配步骤

6.2 Quick Adaptation Steps

步骤1: 克隆并进入项目
bash
git clone <repo_url>
cd <project_name>
步骤2: 安装依赖
bash
pip install -r requirements-npu.txt
步骤3: 运行验证
bash
bash verify_npu.sh
步骤4: 执行推理
bash
python3 run_npu.py --model <model_path> --input <input_data>
Step 1: Clone and enter project
bash
git clone <repo_url>
cd <project_name>
Step 2: Install dependencies
bash
pip install -r requirements-npu.txt
Step 3: Run verification
bash
bash verify_npu.sh
Step 4: Execute inference
bash
python3 run_npu.py --model <model_path> --input <input_data>

6.3 常见问题排查

6.3 Common Troubleshooting

问题原因解决方案
导入失败CANN未正确安装重新配置环境变量
算子不支持NPU不支持该算子使用ATB替代或自研算子
内存溢出批处理过大减小batch_size
精度不达标混合精度配置问题检查AMP配置
IssueCauseSolution
Import failureCANN not installed correctlyReconfigure environment variables
Operator not supportedNPU does not support the operatorUse ATB alternative or self-developed operator
Out of memoryBatch size too largeReduce batch_size
Precision not up to standardMixed precision configuration issueCheck AMP configuration

7. 后续工作建议

7. Follow-Up Work Suggestions

7.1 短期(1周内)

7.1 Short-Term (Within 1 Week)

  • 完成剩余堵点的适配
  • 在真实NPU硬件上进行性能测试
  • 优化算子融合
  • Complete adaptation of remaining bottlenecks
  • Conduct performance tests on real NPU hardware
  • Optimize operator fusion

7.2 中期(1个月内)

7.2 Mid-Term (Within 1 Month)

  • 完善错误处理机制
  • 添加日志和监控
  • 性能调优
  • Improve error handling mechanism
  • Add logging and monitoring
  • Performance tuning

7.3 长期

7.3 Long-Term

  • 持续跟进CANN更新
  • 自动化测试流程
  • 文档完善

报告生成时间: YYYY-MM-DD HH:mm:ss 适配工程师: AI Agent (NPU Adapter Reviewer) 报告版本: v1.0

**任务5.2:输出报告**

将报告保存到当前目录:
CodeReview_Results_YYYY-MM-DD.md
undefined
  • Keep up with CANN updates
  • Automate testing processes
  • Improve documentation

Report Generation Time: YYYY-MM-DD HH:mm:ss Adaptation Engineer: AI Agent (NPU Adapter Reviewer) Report Version: v1.0

**Task 5.2: Output Report**

Save the report to the current directory:
CodeReview_Results_YYYY-MM-DD.md
undefined

输出要求

Output Requirements

  1. 报告格式: 必须是Markdown格式
  2. 文件命名:
    CodeReview_Results_运行当天的日期.md
    (格式:YYYY-MM-DD)
  3. 保存位置: 当前工作目录
  4. 内容完整性: 必须包含上述所有章节
  1. Report Format: Must be Markdown format
  2. File Naming:
    CodeReview_Results_YYYY-MM-DD.md
    (format: YYYY-MM-DD for the current run date)
  3. Save Location: Current working directory
  4. Content Completeness: Must include all sections mentioned above

特殊处理规则

Special Handling Rules

如果验证完全通过

If Verification Passes Completely

  • 输出"适配成功"的状态
  • 提供完整的适配指南
  • 包含端到端运行说明
  • Output "Adaptation Successful" status
  • Provide complete adaptation guide
  • Include end-to-end execution instructions

如果验证未完全通过

If Verification Does Not Pass Completely

  • 详细说明每个失败项
  • 提供具体的修复建议
  • 给出修改后的代码
  • 标注需要人工介入的部分
  • Detail each failed item
  • Provide specific repair suggestions
  • Provide modified code
  • Mark parts requiring manual intervention

知识参考

Knowledge References

在执行过程中,可参考以下资料(按需加载):
  • references/ascend_npu_best_practices.md
    - 昇腾NPU最佳实践
  • references/cann_migration_guide.md
    - CANN迁移指南
  • references/npu_python_api.md
    - NPU Python API参考
请使用此skill完成GPU到昇腾NPU的完整适配审查工作。
During execution, refer to the following materials (load on demand):
  • references/ascend_npu_best_practices.md
    - Ascend NPU Best Practices
  • references/cann_migration_guide.md
    - CANN Migration Guide
  • references/npu_python_api.md
    - NPU Python API Reference
Please use this skill to complete the full adaptation review work for GPU to Ascend NPU.