wan-ascend-adaptation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Wan-Series Model Ascend NPU Adaptation Skill

万系列模型Ascend NPU适配技能

Purpose

用途

Provide a systematic, step-by-step guide for adapting Wan-series (and similar DiT-based) video generation models from NVIDIA CUDA/GPU to Huawei Ascend NPU. The skill encodes 9 major adaptation domains covering every layer of the inference stack, from device initialization to distributed parallelism.

为将万系列（及类似基于DiT的）视频生成模型从NVIDIA CUDA/GPU适配到华为Ascend NPU提供系统化的分步指南。本技能包含9大适配领域，覆盖从设备初始化到分布式并行的推理栈各层。

When to Use

适用场景

Porting a Wan-series (Wan2.1 / Wan2.2) model from CUDA to Ascend NPU
Adapting any DiT-based video diffusion model for Ascend hardware
Optimizing inference performance on Ascend NPU (attention, quantization, VAE parallel)
Setting up multi-card distributed inference on Atlas 800 series hardware
Integrating MindIE SD acceleration library into a PyTorch video generation pipeline

将万系列（Wan2.1 / Wan2.2）模型从CUDA移植到Ascend NPU
为Ascend硬件适配任何基于DiT的视频扩散模型
在Ascend NPU上优化推理性能（注意力机制、量化、VAE并行）
在Atlas 800系列硬件上搭建多卡分布式推理环境
将MindIE SD加速库集成到PyTorch视频生成流水线中

Adaptation Domains Overview

适配领域概述

The adaptation work is organized into 9 domains. Each domain has a dedicated reference file under

references/

with detailed instructions, code patterns, and pitfalls.


references/01-device-layer.md
references/02-operator-replacement.md
references/03-precision-strategy.md
references/04-attention-mechanism.md
references/05-distributed-parallelism.md
references/06-vae-patch-parallel.md
references/07-model-quantization.md
references/08-sparse-attention.md
references/09-pipeline-integration.md

#	Domain	Reference File	Priority
1	Device Layer Adaptation	`references/01-device-layer.md`	P0 — Must
2	Operator Replacement	`references/02-operator-replacement.md`	P0 — Must
3	Precision Strategy	`references/03-precision-strategy.md`	P0 — Must
4	Attention Mechanism	`references/04-attention-mechanism.md`	P1 — Critical
5	Distributed Parallelism	`references/05-distributed-parallelism.md`	P1 — Critical
6	VAE Patch Parallel	`references/06-vae-patch-parallel.md`	P2 — Important
7	Model Quantization	`references/07-model-quantization.md`	P2 — Important
8	Sparse Attention (RainFusion)	`references/08-sparse-attention.md`	P2 — Important
9	Inference Pipeline Integration	`references/09-pipeline-integration.md`	P1 — Critical

适配工作分为9个领域，每个领域在

references/

目录下有对应的参考文件，包含详细说明、代码示例及注意事项。


references/01-device-layer.md
references/02-operator-replacement.md
references/03-precision-strategy.md
references/04-attention-mechanism.md
references/05-distributed-parallelism.md
references/06-vae-patch-parallel.md
references/07-model-quantization.md
references/08-sparse-attention.md
references/09-pipeline-integration.md

序号	领域	参考文件	优先级
1	设备层适配	`references/01-device-layer.md`	P0 — 必须完成
2	算子替换	`references/02-operator-replacement.md`	P0 — 必须完成
3	精度策略	`references/03-precision-strategy.md`	P0 — 必须完成
4	注意力机制	`references/04-attention-mechanism.md`	P1 — 关键
5	分布式并行	`references/05-distributed-parallelism.md`	P1 — 关键
6	VAE Patch并行	`references/06-vae-patch-parallel.md`	P2 — 重要
7	模型量化	`references/07-model-quantization.md`	P2 — 重要
8	稀疏注意力（RainFusion）	`references/08-sparse-attention.md`	P2 — 重要
9	推理流水线集成	`references/09-pipeline-integration.md`	P1 — 关键

Workflow

工作流程

To adapt a Wan-series model to Ascend, follow these steps in order:

要将万系列模型适配到Ascend，请按以下顺序执行步骤：

Step 1: Device Layer Adaptation (Domain 1)

步骤1：设备层适配（领域1）

Read

references/01-device-layer.md

for complete guidance.

Key actions:

Import
```
torch_npu
```
and
```
transfer_to_npu
```
at the entry point
Configure NPU compile mode and internal format settings

Replace

dist.init_process_group(backend="nccl")

with

backend="hccl"

Replace all

torch.amp.autocast('cuda', ...)

with

autocast('npu', ...)

Replace device type checks from
```
'cuda'
```
to
```
'npu'
```

阅读

references/01-device-layer.md

获取完整指导。

关键操作：

在入口处导入
```
torch_npu
```
和
```
transfer_to_npu
```
配置NPU编译模式和内部格式设置

将

dist.init_process_group(backend="nccl")

替换为

backend="hccl"

将所有

torch.amp.autocast('cuda', ...)

替换为

autocast('npu', ...)

将设备类型检查从
```
'cuda'
```
改为
```
'npu'
```

Step 2: Operator Replacement (Domain 2)

步骤2：算子替换（领域2）

Read

references/02-operator-replacement.md

for complete guidance.

Key actions:

Replace RMSNorm with
```
torch_npu.npu_rms_norm()
```
Replace LayerNorm forward to remove
```
.float()
```
type casting
Replace RoPE with
```
mindiesd.rotary_position_embedding()
```
fused operator
Optionally enable
```
mindiesd.fast_layernorm
```
via
```
FAST_LAYERNORM
```
env var
Replace Flash Attention with
```
mindiesd.attention_forward()
```
multi-backend dispatch

阅读

references/02-operator-replacement.md

获取完整指导。

关键操作：

将RMSNorm替换为
```
torch_npu.npu_rms_norm()
```
修改LayerNorm前向传播逻辑以移除
```
.float()
```
类型转换
将RoPE替换为
```
mindiesd.rotary_position_embedding()
```
融合算子
可通过
```
FAST_LAYERNORM
```
环境变量启用
```
mindiesd.fast_layernorm
```
将Flash Attention替换为
```
mindiesd.attention_forward()
```
多后端调度算子

Step 3: Precision Strategy (Domain 3)

步骤3：精度策略（领域3）

Read

references/03-precision-strategy.md

for complete guidance.

Key actions:

Lower sinusoidal embedding from float64 to float32
Lower RoPE frequency from complex128 to complex64
Change autocast dtype from float32 to bfloat16
Remove
```
.float()
```
type conversions in normalization layers
Use
```
PRECISION
```
env var to control random number device for cross-platform reproducibility

阅读

references/03-precision-strategy.md

获取完整指导。

关键操作：

将正弦嵌入从float64降至float32
将RoPE频率从complex128降至complex64
将自动混合精度 dtype 从float32改为bfloat16
移除归一化层中的
```
.float()
```
类型转换
使用
```
PRECISION
```
环境变量控制随机数生成设备，实现跨平台可复现性

Step 4: Attention Mechanism Adaptation (Domain 4)

步骤4：注意力机制适配（领域4）

Read

references/04-attention-mechanism.md

for complete guidance.

Key actions:

Implement multi-backend attention dispatch via
```
ALGO
```
env var (0/1/3)
Create
```
xFuserLongContextAttention
```
combining Ulysses + Ring Attention
Integrate Attention Cache via
```
mindiesd.CacheAgent
```
Add sub-head splitting support via
```
USE_SUB_HEAD
```
env var

阅读

references/04-attention-mechanism.md

获取完整指导。

关键操作：

通过
```
ALGO
```
环境变量（0/1/3）实现多后端注意力调度
实现结合Ulysses + Ring Attention的
```
xFuserLongContextAttention
```
通过
```
mindiesd.CacheAgent
```
集成注意力缓存
通过
```
USE_SUB_HEAD
```
环境变量添加子头拆分支持

Step 5: Distributed Parallelism Refactoring (Domain 5)

步骤5：分布式并行重构（领域5）

Read

references/05-distributed-parallelism.md

for complete guidance.

Key actions:

Implement
```
ParallelConfig
```
with 4D parallelism: TP × SP × CFG
Create
```
RankGenerator
```
for orthogonal process group assignment
Create
```
GroupCoordinator
```
with dual-channel communication (HCCL + Gloo)
Implement
```
TensorParallelApplicator
```
for automatic model sharding
Implement CFG parallel to halve sampling loop forward passes

阅读

references/05-distributed-parallelism.md

获取完整指导。

关键操作：

实现支持4D并行的
```
ParallelConfig
```
：TP × SP × CFG
创建用于正交进程组分配的
```
RankGenerator
```
创建支持双信道通信（HCCL + Gloo）的
```
GroupCoordinator
```
实现用于自动模型分片的
```
TensorParallelApplicator
```
实现CFG并行以将采样循环前向传播次数减半

Step 6: VAE Patch Parallel (Domain 6)

步骤6：VAE Patch并行（领域6）

Read

references/06-vae-patch-parallel.md

for complete guidance.

Key actions:

Implement spatial H×W slicing across NPUs
Monkey-patch
```
F.conv3d
```
,
```
F.conv2d
```
,
```
F.interpolate
```
,
```
F.pad
```
for boundary exchange
Use P2P communication for neighbor boundary data exchange
Adjust VAE CausalConv3d padding strategy for compatibility

阅读

references/06-vae-patch-parallel.md

获取完整指导。

关键操作：

实现跨NPU的空间H×W切片
对
```
F.conv3d
```
、
```
F.conv2d
```
、
```
F.interpolate
```
、
```
F.pad
```
进行猴子补丁以实现边界数据交换
使用P2P通信进行相邻节点边界数据交换
调整VAE CausalConv3d填充策略以保证兼容性

Step 7: Model Quantization (Domain 7)

步骤7：模型量化（领域7）

Read

references/07-model-quantization.md

for complete guidance.

Key actions:

Use
```
msmodelslim
```
for W8A8 dynamic quantization
Integrate
```
mindiesd.quantize()
```
for runtime quantization loading
Handle FSDP + float8 compatibility via
```
patch_cast_buffers_for_float8()
```

阅读

references/07-model-quantization.md

获取完整指导。

关键操作：

使用
```
msmodelslim
```
进行W8A8动态量化
集成
```
mindiesd.quantize()
```
实现运行时量化加载
通过
```
patch_cast_buffers_for_float8()
```
处理FSDP + float8兼容性

Step 8: Sparse Attention — RainFusion (Domain 8)

步骤8：稀疏注意力 — RainFusion（领域8）

Read

references/08-sparse-attention.md

for complete guidance.

Key actions:

Implement RainFusion v1 (window-based Local/Global adaptive)
Implement RainFusion v2 (blockwise Top-K sparse)
Configure skip_timesteps for quality-speed tradeoff

阅读

references/08-sparse-attention.md

获取完整指导。

关键操作：

实现RainFusion v1（基于窗口的局部/全局自适应）
实现RainFusion v2（基于块的Top-K稀疏）
配置skip_timesteps以平衡质量与速度

Step 9: Pipeline Integration (Domain 9)

步骤9：流水线集成（领域9）

Read

references/09-pipeline-integration.md

for complete guidance.

Key actions:

Add warm-up generation steps for NPU operator compilation
Configure
```
T5_LOAD_CPU
```
for flexible T5 loading strategy
Add RoPE frequency cache (
```
freqs_list
```
) lifecycle management
Implement multi-resolution VAE decode condition (
```
rank < 8
```
)
Add performance timing with
```
stream.synchronize()
```

阅读

references/09-pipeline-integration.md

获取完整指导。

关键操作：

添加预热生成步骤以完成NPU算子编译
配置
```
T5_LOAD_CPU
```
实现灵活的T5模型加载策略
添加RoPE频率缓存（
```
freqs_list
```
）生命周期管理
实现多分辨率VAE解码条件（
```
rank < 8
```
）
通过
```
stream.synchronize()
```
添加性能计时

Key Environment Variables

关键环境变量

Variable	Default	Description
`ALGO`	`0`	Attention algorithm: 0=fused_attn_score, 1=ascend_laser_attention, 3=npu_fused_infer
`FAST_LAYERNORM`	`0`	Enable mindiesd fast LayerNorm
`USE_SUB_HEAD`	`0`	Sub-head group size for attention splitting
`T5_LOAD_CPU`	`0`	Load T5 model on CPU to save NPU memory
`PRECISION`	`0`	Generate random numbers on CPU for cross-platform reproducibility
`OVERLAP`	`0`	Enable FA-AllToAll communication overlap
`PYTORCH_NPU_ALLOC_CONF`	-	NPU memory allocation strategy
`TASK_QUEUE_ENABLE`	-	NPU task queue optimization
`CPU_AFFINITY_CONF`	-	CPU affinity configuration

变量名	默认值	描述
`ALGO`	`0`	注意力算法：0=fused_attn_score, 1=ascend_laser_attention, 3=npu_fused_infer
`FAST_LAYERNORM`	`0`	启用mindiesd快速LayerNorm
`USE_SUB_HEAD`	`0`	注意力拆分的子头组大小
`T5_LOAD_CPU`	`0`	在CPU上加载T5模型以节省NPU内存
`PRECISION`	`0`	在CPU上生成随机数以实现跨平台可复现性
`OVERLAP`	`0`	启用FA-AllToAll通信重叠
`PYTORCH_NPU_ALLOC_CONF`	-	NPU内存分配策略
`TASK_QUEUE_ENABLE`	-	NPU任务队列优化
`CPU_AFFINITY_CONF`	-	CPU亲和性配置

Key Dependencies

关键依赖库

Library	Purpose
`torch_npu`	PyTorch Ascend NPU backend
`mindiesd`	MindIE Stable Diffusion acceleration (FA, RoPE, LayerNorm, quantize)
`msmodelslim`	Huawei model compression toolkit (W8A8 quantization)
`yunchang`	Sequence parallel framework (Ulysses + Ring Attention)
`torch_atb`	Ascend Transformer Boost operators
`atb_ops`	ATB fused matmul-allreduce operators

库	用途
`torch_npu`	PyTorch Ascend NPU后端
`mindiesd`	MindIE Stable Diffusion加速库（注意力机制、RoPE、LayerNorm、量化）
`msmodelslim`	华为模型压缩工具包（W8A8量化）
`yunchang`	序列并行框架（Ulysses + Ring Attention）
`torch_atb`	Ascend Transformer Boost算子
`atb_ops`	ATB融合矩阵乘法-全归约算子

Notes

注意事项

This skill is derived from comparing Wan2.2-Original (CUDA) and Wan2.2-Ascend (NPU) codebases
The Ascend version removes S2V (Speech-to-Video) and Animate tasks, focusing on T2V, I2V, and TI2V
Hardware target: Atlas 800I A2 / Atlas 800T A2 with 8×64G NPU
All adaptation patterns are applicable to similar DiT-based video diffusion architectures

本技能源自Wan2.2-Original（CUDA版）与Wan2.2-Ascend（NPU版）代码库的对比分析
Ascend版本移除了S2V（语音转视频）和Animate任务，专注于T2V、I2V和TI2V任务
目标硬件：搭载8×64G NPU的Atlas 800I A2 / Atlas 800T A2
所有适配模式均适用于类似的基于DiT的视频扩散架构