NPU Adapter Reviewer - GPU to Ascend NPU Adaptation Review Expert
This is an Agent Skill specifically designed for adapting GPU code to Huawei Ascend NPU. This skill covers the complete adaptation workflow: code analysis, bottleneck identification, adaptation script writing, verification plan design, and final report generation.
Core Workflow
Phase 1: Code Repository Acquisition and Analysis
Task 1.1: Obtain Source Code
Acquire the complete code repository based on user input (local path or GitHub link):
bash
# If it's a GitHub link, clone first
git clone <repo_url> /tmp/gpu_code_base
cd /tmp/gpu_code_base
# If it's a local path, analyze directly
ls -la <local_path>
Task 1.2: Comprehensive Code Scanning
Analyze the code structure using parallel exploration:
-
Exploration Agent 1 - Code Structure Analysis
- Identify all Python files, CUDA files, and C++ files
- Recognize the project directory structure
- Locate main entry files and configuration files
-
Exploration Agent 2 - GPU Dependency Identification
- Search for CUDA API calls (, , , , etc.)
- Search for PyTorch GPU-related code (, , , etc.)
- Search for TensorRT-related code
- Search for deep learning framework-specific APIs (Transformer Engine, Flash Attention, etc.)
-
Exploration Agent 3 - External Library Dependencies
- Search for and statements
- Identify all third-party library dependencies
- Check for libraries not supported by NPU
Task 1.3: Generate Code Structure Report
Output the following information:
- Total number of project files and lines of code
- File type distribution (Python/CUDA/C++/others)
- List of main dependent libraries
- Core modules and their function descriptions
Phase 2: Identify Bottlenecks in GPU-to-NPU Migration
Task 2.1: Operator Compatibility Analysis
Identify the compatibility of GPU-specific operators on NPU by category:
| Bottleneck Category | Typical GPU Implementation | NPU Alternative | Migration Difficulty |
|---|
| CUDA Core Operators | , functions | Ascend C Operator / ATB | High |
| Memory Operations | , | , | Medium |
| Streams and Events | , | , | Medium |
| cuBLAS/cuDNN | , | , Operator Fusion | High |
| Flash Attention | | Ascend Flash Attention Operator | Medium |
| Custom Operators | PyTorch CUDA Extensions | ATC/ACL Operators | High |
| AMP/Mixed Precision | | | Low |
Task 2.2: Identify Specific Bottlenecks
Generate the following analysis for each GPU API call:
### Bottleneck ID: #001
- **File Location**: `src/attention/cuda_impl.cu:142`
- **GPU API**: `cudaStreamCreate(&stream)`
- **NPU Alternative**: `aclrtCreateStream(&stream)`
- **Migration Plan**:
1. Replace header file with `aclrt.h`
2. Replace API calls
3. Handle error code differences
- **Estimated Workload**: 0.5 person-days
- **Impact Scope**: Global stream management
Task 2.3: Generate Bottleneck List
Output the complete list of bottlenecks, sorted by impact scope and migration difficulty.
Phase 3: Write Adaptation Scripts
Task 3.1: Create NPU Adaptation Layer
Create adaptation scripts based on identified bottlenecks:
-
Create - Python Layer Compatibility Adaptation
python
# Auto-detect running device
def get_device():
if is_npu_available():
return "npu"
elif is_cuda_available():
return "cuda"
else:
return "cpu"
# Replace torch.cuda calls
def to_device(tensor):
device = get_device()
if device == "npu":
return tensor.npu()
elif device == "cuda":
return tensor.cuda()
return tensor
-
Create - NPU Operator Encapsulation
- Encapsulate all CUDA core operators as NPU versions
- Retain original interfaces with internal NPU adaptation implementation
-
Create - Compilation Script
- ASCEND C operator compilation commands
- Dependency environment checks
- Error diagnosis
Task 3.2: Modify Original Code
Generate modified code files, retain original files and create
versions:
- Replace all GPU-specific calls
- Add device detection logic
- Add fallback mechanism
Phase 4: Design Verification Plan
Task 4.1: Create Verification Scripts
Generate verification script
based on adaptation content:
bash
#!/bin/bash
# NPU Adaptation Verification Script
echo "=== 1. Environment Check ==="
check_npu_env() {
# Check NPU driver
ls -la /dev/*npu* 2>/dev/null || echo "Warning: NPU device not found"
# Check CANN
echo $ASCEND_TOOLKIT_HOME
# Check Python packages
python3 -c "import torch; print('PyTorch version:', torch.__version__)"
python3 -c "import torch_npu; print('torch_npu installed')"
}
echo "=== 2. Module Import Test ==="
test_imports() {
cd <project_path>
python3 -c "import npu_compat; print('npu_compat OK')"
python3 -c "import npu_ops; print('npu_ops OK')"
}
echo "=== 3. Function Verification ==="
test_functions() {
# Run basic tests
python3 -m pytest tests/test_npu_*.py -v
# Verify operator precision
python3 scripts/verify_precision.py
}
echo "=== 4. Performance Benchmark ==="
benchmark() {
python3 scripts/benchmark.py --device npu --compare cuda
}
Task 4.2: Precision Verification Script
python
import numpy as np
def verify_npu_precision(cuda_result, npu_result, rtol=1e-3, atol=1e-3):
"""Verify precision difference between NPU and GPU outputs"""
diff = np.abs(cuda_result - npu_result)
max_diff = np.max(diff)
mean_diff = np.mean(diff)
passed = np.allclose(cuda_result, npu_result, rtol=rtol, atol=atol)
return {
"passed": passed,
"max_diff": max_diff,
"mean_diff": mean_diff,
"rtol": rtol,
"atol": atol
}
Phase 5: Generate Review Report
Task 5.1: Generate Markdown Report
Generate a complete review report based on verification results:
markdown
# GPU to Ascend NPU Adaptation Review Report
# CodeReview_Results_YYYY-MM-DD.md
## 1. Executive Summary
|-----|------|
| Original Code Repository | `<repo_url>` or `<local_path>` |
| Review Date | YYYY-MM-DD |
| Adaptation Status | ✅ Fully Adapted / ⚠️ Partially Adapted / ❌ Adaptation Failed |
| Total Identified Bottlenecks | XX |
| Adapted Bottlenecks | XX |
| Remaining Bottlenecks | XX |
## 2. Original Code Analysis
### 2.1 Code Structure Overview
- Total Files: XX
- Lines of Python Code: XX
- Lines of CUDA/C++ Code: XX
- Core Modules: ...
### 2.2 Dependency Analysis
|-----|------|----------|---------|
| torch | 2.x | ✅ Compatible | torch_npu |
| flash-attn | 2.x | ⚠️ Partial | Ascend Flash Attention |
## 3. Detailed Analysis of Migration Bottlenecks
### 3.1 Operator Compatibility Issues
#### Issue #001: CUDA Stream Management
- **File**: `src/utils/stream_manager.py:45`
- **GPU API**: `cudaStreamCreate`
- **Issue Description**: Uses CUDA streams to manage asynchronous execution
- **NPU Alternative**: `aclrtCreateStream`
- **Impact Scope**: Global, affects all asynchronous operations
- **Migration Suggestion**:
```python
# Before modification
import torch.cuda
stream = torch.cuda.Stream()
# After modification
import torch_npu
stream = torch.npu.Stream()
- Status: ✅ Adapted / ⚠️ Pending
Issue #002: Flash Attention Operator
- File:
src/attention/flash_attn_impl.py:78
- GPU API:
- Issue Description: Uses Flash Attention to accelerate attention computation
- NPU Alternative: Ascend flash_attn operator or MindSpore flash_attention
- Impact Scope: High, core inference performance
- Migration Suggestion:
python
# Before modification
from flash_attn import flash_attn_func
output = flash_attn_func(q, k, v)
# After modification
# Option 1: Use torch_npu operators
import torch_npu
output = torch_npu.npu_flash_attention(q, k, v)
# Option 2: Use ATB library
from ascend_toolkit import flash_attention
output = flash_attention(q, k, v)
- Status: ✅ Adapted / ⚠️ Pending
3.2 Model Loading and Weight Management Issues
Issue #003: GPU Weight Format
- File:
- Issue Description: Weights stored in CUDA format will fail to load directly
- Migration Suggestion:
python
# Before modification
state_dict = torch.load(weights_path)
model.load_state_dict(state_dict)
# After modification
state_dict = torch.load(weights_path, map_location='cpu')
# Convert weights
for k, v in state_dict.items():
if isinstance(v, torch.Tensor):
state_dict[k] = v.npu()
model.load_state_dict(state_dict)
- Status: ✅ Adapted
3.3 Computing Performance Bottlenecks
Issue #004: Missing Operator Fusion
- File:
src/model/inference.py:89
- Issue Description: Multiple independent operators cause performance degradation
- Migration Suggestion: Use ATC for operator fusion optimization
- Estimated Performance Improvement: 20-30%
- Status: ⚠️ Pending
3.4 NPU Memory and KV Cache Management
Issue #005: Dynamic Memory Allocation
- File:
- Issue Description: Uses CUDA dynamic memory allocation
- Migration Suggestion: Use fixed memory pool
- Status: ⚠️ Pending
3.5 Python-C++ Boundary Issues
Issue #006: C++ Extension Compilation
- File:
src/utils/gpu_ext.cpp:145
- Issue Description: CUDA C++ extensions need to be recompiled
- Migration Suggestion: Rewrite with Ascend C or use ATB
- Status: ⚠️ Pending
3.6 Concurrency and Asynchronous Issues
Issue #007: Multi-Stream Concurrency
- File:
src/server/request_handler.py:78
- Issue Description: Uses CUDA streams to implement concurrency
- Migration Suggestion: Refactor to process-level concurrency
- Status: ⚠️ Pending
3.7 Configuration and Maintainability Issues
Issue #008: Hard-Coded Device
- File:
- Issue Description: Hard-coded in configuration
- Migration Suggestion: Change to device detection
- Status: ✅ Adapted
4. Adaptation Code List
4.1 Newly Added Files
| File Name | Function | Status |
|---|
| Device detection and compatibility layer | ✅ |
| NPU operator encapsulation | ✅ |
| Compilation script | ✅ |
| Verification script | ✅ |
4.2 Modified Files
| File Name | Modification Content | Status |
|---|
src/attention/flash_attn.py
| Replaced with NPU operators | ✅ |
| Added weight conversion | ✅ |
src/utils/stream_manager.py
| Stream adaptation | ✅ |
5. Verification Results
5.1 Environment Verification
5.2 Function Verification
5.3 Precision Verification
5.4 Issue Summary
| Issue Type | Quantity | Severity |
|---|
| Resolved | XX | - |
| Unresolved | XX | High/Medium/Low |
6. Adaptation Guide
6.1 Prerequisites
bash
# 1. Install CANN Toolkit
# Download URL: https://www.hiascend.com/software/aiengine
# 2. Install torch_npu
pip install torch torch_npu
# 3. Verify installation
python3 -c "import torch; import torch_npu; print('NPU available:', torch_npu.is_npu_available())"
6.2 Quick Adaptation Steps
Step 1: Clone and enter project
bash
git clone <repo_url>
cd <project_name>
Step 2: Install dependencies
bash
pip install -r requirements-npu.txt
Step 3: Run verification
Step 4: Execute inference
bash
python3 run_npu.py --model <model_path> --input <input_data>
6.3 Common Troubleshooting
| Issue | Cause | Solution |
|---|
| Import failure | CANN not installed correctly | Reconfigure environment variables |
| Operator not supported | NPU does not support the operator | Use ATB alternative or self-developed operator |
| Out of memory | Batch size too large | Reduce batch_size |
| Precision not up to standard | Mixed precision configuration issue | Check AMP configuration |
7. Follow-Up Work Suggestions
7.1 Short-Term (Within 1 Week)
7.2 Mid-Term (Within 1 Month)
7.3 Long-Term
Report Generation Time: YYYY-MM-DD HH:mm:ss
Adaptation Engineer: AI Agent (NPU Adapter Reviewer)
Report Version: v1.0
**Task 5.2: Output Report**
Save the report to the current directory:
CodeReview_Results_YYYY-MM-DD.md
## Output Requirements
1. **Report Format**: Must be Markdown format
2. **File Naming**: `CodeReview_Results_YYYY-MM-DD.md` (format: YYYY-MM-DD for the current run date)
3. **Save Location**: Current working directory
4. **Content Completeness**: Must include all sections mentioned above
## Special Handling Rules
### If Verification Passes Completely
- Output "Adaptation Successful" status
- Provide complete adaptation guide
- Include end-to-end execution instructions
### If Verification Does Not Pass Completely
- Detail each failed item
- Provide specific repair suggestions
- Provide modified code
- Mark parts requiring manual intervention
## Knowledge References
During execution, refer to the following materials (load on demand):
- `references/ascend_npu_best_practices.md` - Ascend NPU Best Practices
- `references/cann_migration_guide.md` - CANN Migration Guide
- `references/npu_python_api.md` - NPU Python API Reference
Please use this skill to complete the full adaptation review work for GPU to Ascend NPU.