wan-ascend-adaptation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWan-Series Model Ascend NPU Adaptation Skill
万系列模型Ascend NPU适配技能
Purpose
用途
Provide a systematic, step-by-step guide for adapting Wan-series (and similar DiT-based) video generation models from NVIDIA CUDA/GPU to Huawei Ascend NPU. The skill encodes 9 major adaptation domains covering every layer of the inference stack, from device initialization to distributed parallelism.
为将万系列(及类似基于DiT的)视频生成模型从NVIDIA CUDA/GPU适配到华为Ascend NPU提供系统化的分步指南。本技能包含9大适配领域,覆盖从设备初始化到分布式并行的推理栈各层。
When to Use
适用场景
- Porting a Wan-series (Wan2.1 / Wan2.2) model from CUDA to Ascend NPU
- Adapting any DiT-based video diffusion model for Ascend hardware
- Optimizing inference performance on Ascend NPU (attention, quantization, VAE parallel)
- Setting up multi-card distributed inference on Atlas 800 series hardware
- Integrating MindIE SD acceleration library into a PyTorch video generation pipeline
- 将万系列(Wan2.1 / Wan2.2)模型从CUDA移植到Ascend NPU
- 为Ascend硬件适配任何基于DiT的视频扩散模型
- 在Ascend NPU上优化推理性能(注意力机制、量化、VAE并行)
- 在Atlas 800系列硬件上搭建多卡分布式推理环境
- 将MindIE SD加速库集成到PyTorch视频生成流水线中
Adaptation Domains Overview
适配领域概述
The adaptation work is organized into 9 domains. Each domain has a dedicated reference file under with detailed instructions, code patterns, and pitfalls.
references/| # | Domain | Reference File | Priority |
|---|---|---|---|
| 1 | Device Layer Adaptation | | P0 — Must |
| 2 | Operator Replacement | | P0 — Must |
| 3 | Precision Strategy | | P0 — Must |
| 4 | Attention Mechanism | | P1 — Critical |
| 5 | Distributed Parallelism | | P1 — Critical |
| 6 | VAE Patch Parallel | | P2 — Important |
| 7 | Model Quantization | | P2 — Important |
| 8 | Sparse Attention (RainFusion) | | P2 — Important |
| 9 | Inference Pipeline Integration | | P1 — Critical |
适配工作分为9个领域,每个领域在目录下有对应的参考文件,包含详细说明、代码示例及注意事项。
references/| 序号 | 领域 | 参考文件 | 优先级 |
|---|---|---|---|
| 1 | 设备层适配 | | P0 — 必须完成 |
| 2 | 算子替换 | | P0 — 必须完成 |
| 3 | 精度策略 | | P0 — 必须完成 |
| 4 | 注意力机制 | | P1 — 关键 |
| 5 | 分布式并行 | | P1 — 关键 |
| 6 | VAE Patch并行 | | P2 — 重要 |
| 7 | 模型量化 | | P2 — 重要 |
| 8 | 稀疏注意力(RainFusion) | | P2 — 重要 |
| 9 | 推理流水线集成 | | P1 — 关键 |
Workflow
工作流程
To adapt a Wan-series model to Ascend, follow these steps in order:
要将万系列模型适配到Ascend,请按以下顺序执行步骤:
Step 1: Device Layer Adaptation (Domain 1)
步骤1:设备层适配(领域1)
Read for complete guidance.
references/01-device-layer.mdKey actions:
- Import and
torch_npuat the entry pointtransfer_to_npu - Configure NPU compile mode and internal format settings
- Replace with
dist.init_process_group(backend="nccl")backend="hccl" - Replace all with
torch.amp.autocast('cuda', ...)autocast('npu', ...) - Replace device type checks from to
'cuda''npu'
阅读获取完整指导。
references/01-device-layer.md关键操作:
- 在入口处导入和
torch_nputransfer_to_npu - 配置NPU编译模式和内部格式设置
- 将替换为
dist.init_process_group(backend="nccl")backend="hccl" - 将所有替换为
torch.amp.autocast('cuda', ...)autocast('npu', ...) - 将设备类型检查从改为
'cuda''npu'
Step 2: Operator Replacement (Domain 2)
步骤2:算子替换(领域2)
Read for complete guidance.
references/02-operator-replacement.mdKey actions:
- Replace RMSNorm with
torch_npu.npu_rms_norm() - Replace LayerNorm forward to remove type casting
.float() - Replace RoPE with fused operator
mindiesd.rotary_position_embedding() - Optionally enable via
mindiesd.fast_layernormenv varFAST_LAYERNORM - Replace Flash Attention with multi-backend dispatch
mindiesd.attention_forward()
阅读获取完整指导。
references/02-operator-replacement.md关键操作:
- 将RMSNorm替换为
torch_npu.npu_rms_norm() - 修改LayerNorm前向传播逻辑以移除类型转换
.float() - 将RoPE替换为融合算子
mindiesd.rotary_position_embedding() - 可通过环境变量启用
FAST_LAYERNORMmindiesd.fast_layernorm - 将Flash Attention替换为多后端调度算子
mindiesd.attention_forward()
Step 3: Precision Strategy (Domain 3)
步骤3:精度策略(领域3)
Read for complete guidance.
references/03-precision-strategy.mdKey actions:
- Lower sinusoidal embedding from float64 to float32
- Lower RoPE frequency from complex128 to complex64
- Change autocast dtype from float32 to bfloat16
- Remove type conversions in normalization layers
.float() - Use env var to control random number device for cross-platform reproducibility
PRECISION
阅读获取完整指导。
references/03-precision-strategy.md关键操作:
- 将正弦嵌入从float64降至float32
- 将RoPE频率从complex128降至complex64
- 将自动混合精度 dtype 从float32改为bfloat16
- 移除归一化层中的类型转换
.float() - 使用环境变量控制随机数生成设备,实现跨平台可复现性
PRECISION
Step 4: Attention Mechanism Adaptation (Domain 4)
步骤4:注意力机制适配(领域4)
Read for complete guidance.
references/04-attention-mechanism.mdKey actions:
- Implement multi-backend attention dispatch via env var (0/1/3)
ALGO - Create combining Ulysses + Ring Attention
xFuserLongContextAttention - Integrate Attention Cache via
mindiesd.CacheAgent - Add sub-head splitting support via env var
USE_SUB_HEAD
阅读获取完整指导。
references/04-attention-mechanism.md关键操作:
- 通过环境变量(0/1/3)实现多后端注意力调度
ALGO - 实现结合Ulysses + Ring Attention的
xFuserLongContextAttention - 通过集成注意力缓存
mindiesd.CacheAgent - 通过环境变量添加子头拆分支持
USE_SUB_HEAD
Step 5: Distributed Parallelism Refactoring (Domain 5)
步骤5:分布式并行重构(领域5)
Read for complete guidance.
references/05-distributed-parallelism.mdKey actions:
- Implement with 4D parallelism: TP × SP × CFG
ParallelConfig - Create for orthogonal process group assignment
RankGenerator - Create with dual-channel communication (HCCL + Gloo)
GroupCoordinator - Implement for automatic model sharding
TensorParallelApplicator - Implement CFG parallel to halve sampling loop forward passes
阅读获取完整指导。
references/05-distributed-parallelism.md关键操作:
- 实现支持4D并行的:TP × SP × CFG
ParallelConfig - 创建用于正交进程组分配的
RankGenerator - 创建支持双信道通信(HCCL + Gloo)的
GroupCoordinator - 实现用于自动模型分片的
TensorParallelApplicator - 实现CFG并行以将采样循环前向传播次数减半
Step 6: VAE Patch Parallel (Domain 6)
步骤6:VAE Patch并行(领域6)
Read for complete guidance.
references/06-vae-patch-parallel.mdKey actions:
- Implement spatial H×W slicing across NPUs
- Monkey-patch ,
F.conv3d,F.conv2d,F.interpolatefor boundary exchangeF.pad - Use P2P communication for neighbor boundary data exchange
- Adjust VAE CausalConv3d padding strategy for compatibility
阅读获取完整指导。
references/06-vae-patch-parallel.md关键操作:
- 实现跨NPU的空间H×W切片
- 对、
F.conv3d、F.conv2d、F.interpolate进行猴子补丁以实现边界数据交换F.pad - 使用P2P通信进行相邻节点边界数据交换
- 调整VAE CausalConv3d填充策略以保证兼容性
Step 7: Model Quantization (Domain 7)
步骤7:模型量化(领域7)
Read for complete guidance.
references/07-model-quantization.mdKey actions:
- Use for W8A8 dynamic quantization
msmodelslim - Integrate for runtime quantization loading
mindiesd.quantize() - Handle FSDP + float8 compatibility via
patch_cast_buffers_for_float8()
阅读获取完整指导。
references/07-model-quantization.md关键操作:
- 使用进行W8A8动态量化
msmodelslim - 集成实现运行时量化加载
mindiesd.quantize() - 通过处理FSDP + float8兼容性
patch_cast_buffers_for_float8()
Step 8: Sparse Attention — RainFusion (Domain 8)
步骤8:稀疏注意力 — RainFusion(领域8)
Read for complete guidance.
references/08-sparse-attention.mdKey actions:
- Implement RainFusion v1 (window-based Local/Global adaptive)
- Implement RainFusion v2 (blockwise Top-K sparse)
- Configure skip_timesteps for quality-speed tradeoff
阅读获取完整指导。
references/08-sparse-attention.md关键操作:
- 实现RainFusion v1(基于窗口的局部/全局自适应)
- 实现RainFusion v2(基于块的Top-K稀疏)
- 配置skip_timesteps以平衡质量与速度
Step 9: Pipeline Integration (Domain 9)
步骤9:流水线集成(领域9)
Read for complete guidance.
references/09-pipeline-integration.mdKey actions:
- Add warm-up generation steps for NPU operator compilation
- Configure for flexible T5 loading strategy
T5_LOAD_CPU - Add RoPE frequency cache () lifecycle management
freqs_list - Implement multi-resolution VAE decode condition ()
rank < 8 - Add performance timing with
stream.synchronize()
阅读获取完整指导。
references/09-pipeline-integration.md关键操作:
- 添加预热生成步骤以完成NPU算子编译
- 配置实现灵活的T5模型加载策略
T5_LOAD_CPU - 添加RoPE频率缓存()生命周期管理
freqs_list - 实现多分辨率VAE解码条件()
rank < 8 - 通过添加性能计时
stream.synchronize()
Key Environment Variables
关键环境变量
| Variable | Default | Description |
|---|---|---|
| | Attention algorithm: 0=fused_attn_score, 1=ascend_laser_attention, 3=npu_fused_infer |
| | Enable mindiesd fast LayerNorm |
| | Sub-head group size for attention splitting |
| | Load T5 model on CPU to save NPU memory |
| | Generate random numbers on CPU for cross-platform reproducibility |
| | Enable FA-AllToAll communication overlap |
| - | NPU memory allocation strategy |
| - | NPU task queue optimization |
| - | CPU affinity configuration |
| 变量名 | 默认值 | 描述 |
|---|---|---|
| | 注意力算法:0=fused_attn_score, 1=ascend_laser_attention, 3=npu_fused_infer |
| | 启用mindiesd快速LayerNorm |
| | 注意力拆分的子头组大小 |
| | 在CPU上加载T5模型以节省NPU内存 |
| | 在CPU上生成随机数以实现跨平台可复现性 |
| | 启用FA-AllToAll通信重叠 |
| - | NPU内存分配策略 |
| - | NPU任务队列优化 |
| - | CPU亲和性配置 |
Key Dependencies
关键依赖库
| Library | Purpose |
|---|---|
| PyTorch Ascend NPU backend |
| MindIE Stable Diffusion acceleration (FA, RoPE, LayerNorm, quantize) |
| Huawei model compression toolkit (W8A8 quantization) |
| Sequence parallel framework (Ulysses + Ring Attention) |
| Ascend Transformer Boost operators |
| ATB fused matmul-allreduce operators |
| 库 | 用途 |
|---|---|
| PyTorch Ascend NPU后端 |
| MindIE Stable Diffusion加速库(注意力机制、RoPE、LayerNorm、量化) |
| 华为模型压缩工具包(W8A8量化) |
| 序列并行框架(Ulysses + Ring Attention) |
| Ascend Transformer Boost算子 |
| ATB融合矩阵乘法-全归约算子 |
Notes
注意事项
- This skill is derived from comparing Wan2.2-Original (CUDA) and Wan2.2-Ascend (NPU) codebases
- The Ascend version removes S2V (Speech-to-Video) and Animate tasks, focusing on T2V, I2V, and TI2V
- Hardware target: Atlas 800I A2 / Atlas 800T A2 with 8×64G NPU
- All adaptation patterns are applicable to similar DiT-based video diffusion architectures
- 本技能源自Wan2.2-Original(CUDA版)与Wan2.2-Ascend(NPU版)代码库的对比分析
- Ascend版本移除了S2V(语音转视频)和Animate任务,专注于T2V、I2V和TI2V任务
- 目标硬件:搭载8×64G NPU的Atlas 800I A2 / Atlas 800T A2
- 所有适配模式均适用于类似的基于DiT的视频扩散架构