NPU Adapter Reviewer - GPU to Ascend NPU Adaptation Review Expert

This is an Agent Skill specifically designed for adapting GPU code to Huawei Ascend NPU. This skill covers the complete adaptation workflow: code analysis, bottleneck identification, adaptation script writing, verification plan design, and final report generation.

Core Workflow

Phase 1: Code Repository Acquisition and Analysis

Task 1.1: Obtain Source Code

Acquire the complete code repository based on user input (local path or GitHub link):

bash

# If it's a GitHub link, clone first
git clone <repo_url> /tmp/gpu_code_base
cd /tmp/gpu_code_base

# If it's a local path, analyze directly
ls -la <local_path>

Task 1.2: Comprehensive Code Scanning

Analyze the code structure using parallel exploration:

Exploration Agent 1 - Code Structure Analysis
- Identify all Python files, CUDA files, and C++ files
- Recognize the project directory structure
- Locate main entry files and configuration files
Exploration Agent 2 - GPU Dependency Identification
- Search for CUDA API calls (
```
cudaMalloc
```
  ,
```
cudaMemcpy
```
  ,
```
kernel<<<...>>>
```
  ,
```
torch.cuda
```
  , etc.)
- Search for PyTorch GPU-related code (
```
.cuda()
```
  ,
```
.to('cuda')
```
  ,
```
torch.device('cuda')
```
  , etc.)
- Search for TensorRT-related code
- Search for deep learning framework-specific APIs (Transformer Engine, Flash Attention, etc.)
Exploration Agent 3 - External Library Dependencies
- Search for
```
import
```
  and
```
from ... import
```
  statements
- Identify all third-party library dependencies
- Check for libraries not supported by NPU

Task 1.3: Generate Code Structure Report

Output the following information:

Total number of project files and lines of code
File type distribution (Python/CUDA/C++/others)
List of main dependent libraries
Core modules and their function descriptions

Phase 2: Identify Bottlenecks in GPU-to-NPU Migration

Task 2.1: Operator Compatibility Analysis

Identify the compatibility of GPU-specific operators on NPU by category:

Bottleneck Category	Typical GPU Implementation	NPU Alternative	Migration Difficulty
CUDA Core Operators	`__global__` , `__device__` functions	Ascend C Operator / ATB	High
Memory Operations	`cudaMallocHost` , `cudaMallocManaged`	`aclrtMalloc` , `HI_MPI_MALLOC`	Medium
Streams and Events	`cudaStream_t` , `cudaEvent_t`	`aclrtStream` , `aclrtEvent`	Medium
cuBLAS/cuDNN	`cublasGemmEx` , `cudnnConvolutionForward`	`aclblasGemmEx` , Operator Fusion	High
Flash Attention	`flash_attn_varlen_func`	Ascend Flash Attention Operator	Medium
Custom Operators	PyTorch CUDA Extensions	ATC/ACL Operators	High
AMP/Mixed Precision	`torch.cuda.amp`	`ascend_mixed_precision`	Low

Task 2.2: Identify Specific Bottlenecks

Generate the following analysis for each GPU API call:

### Bottleneck ID: #001
- **File Location**: `src/attention/cuda_impl.cu:142`
- **GPU API**: `cudaStreamCreate(&stream)`
- **NPU Alternative**: `aclrtCreateStream(&stream)`
- **Migration Plan**: 
  1. Replace header file with `aclrt.h`
  2. Replace API calls
  3. Handle error code differences
- **Estimated Workload**: 0.5 person-days
- **Impact Scope**: Global stream management

Task 2.3: Generate Bottleneck List

Output the complete list of bottlenecks, sorted by impact scope and migration difficulty.

Phase 3: Write Adaptation Scripts

Task 3.1: Create NPU Adaptation Layer

Create adaptation scripts based on identified bottlenecks:

Create
npu_compat.py
- Python Layer Compatibility Adaptation

python

# Auto-detect running device
def get_device():
    if is_npu_available():
        return "npu"
    elif is_cuda_available():
        return "cuda"
    else:
        return "cpu"

# Replace torch.cuda calls
def to_device(tensor):
    device = get_device()
    if device == "npu":
        return tensor.npu()
    elif device == "cuda":
        return tensor.cuda()
    return tensor

Create
npu_ops.py
- NPU Operator Encapsulation
- Encapsulate all CUDA core operators as NPU versions
- Retain original interfaces with internal NPU adaptation implementation
Create
build_npu.sh
- Compilation Script
- ASCEND C operator compilation commands
- Dependency environment checks
- Error diagnosis

Task 3.2: Modify Original Code

Generate modified code files, retain original files and create

.npu

versions:

Replace all GPU-specific calls
Add device detection logic
Add fallback mechanism

Phase 4: Design Verification Plan

Task 4.1: Create Verification Scripts

Generate verification script

verify_npu.sh

based on adaptation content:

bash

#!/bin/bash
# NPU Adaptation Verification Script

echo "=== 1. Environment Check ==="
check_npu_env() {
    # Check NPU driver
    ls -la /dev/*npu* 2>/dev/null || echo "Warning: NPU device not found"
    # Check CANN
    echo $ASCEND_TOOLKIT_HOME
    # Check Python packages
    python3 -c "import torch; print('PyTorch version:', torch.__version__)"
    python3 -c "import torch_npu; print('torch_npu installed')"
}

echo "=== 2. Module Import Test ==="
test_imports() {
    cd <project_path>
    python3 -c "import npu_compat; print('npu_compat OK')"
    python3 -c "import npu_ops; print('npu_ops OK')"
}

echo "=== 3. Function Verification ==="
test_functions() {
    # Run basic tests
    python3 -m pytest tests/test_npu_*.py -v
    # Verify operator precision
    python3 scripts/verify_precision.py
}

echo "=== 4. Performance Benchmark ==="
benchmark() {
    python3 scripts/benchmark.py --device npu --compare cuda
}

Task 4.2: Precision Verification Script

Generate

verify_precision.py

python

import numpy as np

def verify_npu_precision(cuda_result, npu_result, rtol=1e-3, atol=1e-3):
    """Verify precision difference between NPU and GPU outputs"""
    diff = np.abs(cuda_result - npu_result)
    max_diff = np.max(diff)
    mean_diff = np.mean(diff)
    
    passed = np.allclose(cuda_result, npu_result, rtol=rtol, atol=atol)
    return {
        "passed": passed,
        "max_diff": max_diff,
        "mean_diff": mean_diff,
        "rtol": rtol,
        "atol": atol
    }

Phase 5: Generate Review Report

Task 5.1: Generate Markdown Report

Generate a complete review report based on verification results:

markdown

# GPU to Ascend NPU Adaptation Review Report
# CodeReview_Results_YYYY-MM-DD.md

## 1. Executive Summary

| Item | Content |
|-----|------|
| Original Code Repository | `<repo_url>` or `<local_path>` |
| Review Date | YYYY-MM-DD |
| Adaptation Status | ✅ Fully Adapted / ⚠️ Partially Adapted / ❌ Adaptation Failed |
| Total Identified Bottlenecks | XX |
| Adapted Bottlenecks | XX |
| Remaining Bottlenecks | XX |

## 2. Original Code Analysis

### 2.1 Code Structure Overview
- Total Files: XX
- Lines of Python Code: XX
- Lines of CUDA/C++ Code: XX
- Core Modules: ...

### 2.2 Dependency Analysis
| Library Name | Version | NPU Compatibility | Alternative |
|-----|------|----------|---------|
| torch | 2.x | ✅ Compatible | torch_npu |
| flash-attn | 2.x | ⚠️ Partial | Ascend Flash Attention |

## 3. Detailed Analysis of Migration Bottlenecks

### 3.1 Operator Compatibility Issues

#### Issue #001: CUDA Stream Management
- **File**: `src/utils/stream_manager.py:45`
- **GPU API**: `cudaStreamCreate`
- **Issue Description**: Uses CUDA streams to manage asynchronous execution
- **NPU Alternative**: `aclrtCreateStream`
- **Impact Scope**: Global, affects all asynchronous operations
- **Migration Suggestion**: 
  ```python
  # Before modification
  import torch.cuda
  stream = torch.cuda.Stream()
  
  # After modification
  import torch_npu
  stream = torch.npu.Stream()

Status: ✅ Adapted / ⚠️ Pending

Issue #002: Flash Attention Operator

File:
```
src/attention/flash_attn_impl.py:78
```
GPU API:
```
flash_attn_varlen_func
```
Issue Description: Uses Flash Attention to accelerate attention computation
NPU Alternative: Ascend flash_attn operator or MindSpore flash_attention
Impact Scope: High, core inference performance

Migration Suggestion:

python

# Before modification
from flash_attn import flash_attn_func
output = flash_attn_func(q, k, v)

# After modification
# Option 1: Use torch_npu operators
import torch_npu
output = torch_npu.npu_flash_attention(q, k, v)

# Option 2: Use ATB library
from ascend_toolkit import flash_attention
output = flash_attention(q, k, v)

Status: ✅ Adapted / ⚠️ Pending

3.2 Model Loading and Weight Management Issues

Issue #003: GPU Weight Format

File:
```
src/model/loader.py:112
```
Issue Description: Weights stored in CUDA format will fail to load directly

Migration Suggestion:

python

# Before modification
state_dict = torch.load(weights_path)
model.load_state_dict(state_dict)

# After modification
state_dict = torch.load(weights_path, map_location='cpu')
# Convert weights
for k, v in state_dict.items():
    if isinstance(v, torch.Tensor):
        state_dict[k] = v.npu()
model.load_state_dict(state_dict)

Status: ✅ Adapted

3.3 Computing Performance Bottlenecks

Issue #004: Missing Operator Fusion

File:
```
src/model/inference.py:89
```
Issue Description: Multiple independent operators cause performance degradation
Migration Suggestion: Use ATC for operator fusion optimization
Estimated Performance Improvement: 20-30%
Status: ⚠️ Pending

3.4 NPU Memory and KV Cache Management

Issue #005: Dynamic Memory Allocation

File:
```
src/cache/kv_cache.py:56
```
Issue Description: Uses CUDA dynamic memory allocation
Migration Suggestion: Use fixed memory pool
Status: ⚠️ Pending

3.5 Python-C++ Boundary Issues

Issue #006: C++ Extension Compilation

File:
```
src/utils/gpu_ext.cpp:145
```
Issue Description: CUDA C++ extensions need to be recompiled
Migration Suggestion: Rewrite with Ascend C or use ATB
Status: ⚠️ Pending

3.6 Concurrency and Asynchronous Issues

Issue #007: Multi-Stream Concurrency

File:
```
src/server/request_handler.py:78
```
Issue Description: Uses CUDA streams to implement concurrency
Migration Suggestion: Refactor to process-level concurrency
Status: ⚠️ Pending

3.7 Configuration and Maintainability Issues

Issue #008: Hard-Coded Device

File:
```
src/config.py:23
```
Issue Description: Hard-coded
```
cuda:0
```
in configuration
Migration Suggestion: Change to device detection
Status: ✅ Adapted

4. Adaptation Code List

4.1 Newly Added Files

File Name	Function	Status
`npu_compat.py`	Device detection and compatibility layer	✅
`npu_ops.py`	NPU operator encapsulation	✅
`build_npu.sh`	Compilation script	✅
`verify_npu.sh`	Verification script	✅

4.2 Modified Files

File Name	Modification Content	Status
`src/attention/flash_attn.py`	Replaced with NPU operators	✅
`src/model/loader.py`	Added weight conversion	✅
`src/utils/stream_manager.py`	Stream adaptation	✅

5. Verification Results

5.1 Environment Verification

NPU driver installed
CANN Toolkit configured
torch_npu installed
Python modules importable

5.2 Function Verification

Basic module import tests passed
Device detection function normal
Forward inference executed successfully
Weight loading and conversion normal

5.3 Precision Verification

Inference result difference from GPU < 0.1%
Performance test pending (requires NPU hardware)

5.4 Issue Summary

Issue Type	Quantity	Severity
Resolved	XX	-
Unresolved	XX	High/Medium/Low

6. Adaptation Guide

6.1 Prerequisites

bash

# 1. Install CANN Toolkit
# Download URL: https://www.hiascend.com/software/aiengine

# 2. Install torch_npu
pip install torch torch_npu

# 3. Verify installation
python3 -c "import torch; import torch_npu; print('NPU available:', torch_npu.is_npu_available())"

6.2 Quick Adaptation Steps

Step 1: Clone and enter project

bash

git clone <repo_url>
cd <project_name>

Step 2: Install dependencies

bash

pip install -r requirements-npu.txt

Step 3: Run verification

bash

bash verify_npu.sh

Step 4: Execute inference

bash

python3 run_npu.py --model <model_path> --input <input_data>

6.3 Common Troubleshooting

Issue	Cause	Solution
Import failure	CANN not installed correctly	Reconfigure environment variables
Operator not supported	NPU does not support the operator	Use ATB alternative or self-developed operator
Out of memory	Batch size too large	Reduce batch_size
Precision not up to standard	Mixed precision configuration issue	Check AMP configuration

7. Follow-Up Work Suggestions

7.1 Short-Term (Within 1 Week)

Complete adaptation of remaining bottlenecks
Conduct performance tests on real NPU hardware
Optimize operator fusion

7.2 Mid-Term (Within 1 Month)

Improve error handling mechanism
Add logging and monitoring
Performance tuning

7.3 Long-Term

Keep up with CANN updates
Automate testing processes
Improve documentation

Report Generation Time: YYYY-MM-DD HH:mm:ss Adaptation Engineer: AI Agent (NPU Adapter Reviewer) Report Version: v1.0


**Task 5.2: Output Report**

Save the report to the current directory:

CodeReview_Results_YYYY-MM-DD.md


## Output Requirements

1. **Report Format**: Must be Markdown format
2. **File Naming**: `CodeReview_Results_YYYY-MM-DD.md` (format: YYYY-MM-DD for the current run date)
3. **Save Location**: Current working directory
4. **Content Completeness**: Must include all sections mentioned above

## Special Handling Rules

### If Verification Passes Completely
- Output "Adaptation Successful" status
- Provide complete adaptation guide
- Include end-to-end execution instructions

### If Verification Does Not Pass Completely
- Detail each failed item
- Provide specific repair suggestions
- Provide modified code
- Mark parts requiring manual intervention

## Knowledge References

During execution, refer to the following materials (load on demand):
- `references/ascend_npu_best_practices.md` - Ascend NPU Best Practices
- `references/cann_migration_guide.md` - CANN Migration Guide
- `references/npu_python_api.md` - NPU Python API Reference

Please use this skill to complete the full adaptation review work for GPU to Ascend NPU.

npu-adapter-reviewer

NPX Install

Tags

SKILL.md Content (Chinese)

NPU Adapter Reviewer - GPU to Ascend NPU Adaptation Review Expert

Core Workflow

Phase 1: Code Repository Acquisition and Analysis

Phase 2: Identify Bottlenecks in GPU-to-NPU Migration

Phase 3: Write Adaptation Scripts

Phase 4: Design Verification Plan

Phase 5: Generate Review Report

Issue #002: Flash Attention Operator

3.2 Model Loading and Weight Management Issues

Issue #003: GPU Weight Format

3.3 Computing Performance Bottlenecks

Issue #004: Missing Operator Fusion

3.4 NPU Memory and KV Cache Management

Issue #005: Dynamic Memory Allocation

3.5 Python-C++ Boundary Issues

Issue #006: C++ Extension Compilation

3.6 Concurrency and Asynchronous Issues

Issue #007: Multi-Stream Concurrency

3.7 Configuration and Maintainability Issues

Issue #008: Hard-Coded Device

4. Adaptation Code List

4.1 Newly Added Files

4.2 Modified Files

5. Verification Results

5.1 Environment Verification

5.2 Function Verification

5.3 Precision Verification

5.4 Issue Summary

6. Adaptation Guide

6.1 Prerequisites

6.2 Quick Adaptation Steps

6.3 Common Troubleshooting

7. Follow-Up Work Suggestions

7.1 Short-Term (Within 1 Week)

7.2 Mid-Term (Within 1 Month)

7.3 Long-Term