wan-ascend-adaptation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Wan-Series Model Ascend NPU Adaptation Skill

万系列模型Ascend NPU适配技能

Purpose

用途

Provide a systematic, step-by-step guide for adapting Wan-series (and similar DiT-based) video generation models from NVIDIA CUDA/GPU to Huawei Ascend NPU. The skill encodes 9 major adaptation domains covering every layer of the inference stack, from device initialization to distributed parallelism.
为将万系列(及类似基于DiT的)视频生成模型从NVIDIA CUDA/GPU适配到华为Ascend NPU提供系统化的分步指南。本技能包含9大适配领域,覆盖从设备初始化到分布式并行的推理栈各层。

When to Use

适用场景

  • Porting a Wan-series (Wan2.1 / Wan2.2) model from CUDA to Ascend NPU
  • Adapting any DiT-based video diffusion model for Ascend hardware
  • Optimizing inference performance on Ascend NPU (attention, quantization, VAE parallel)
  • Setting up multi-card distributed inference on Atlas 800 series hardware
  • Integrating MindIE SD acceleration library into a PyTorch video generation pipeline
  • 将万系列(Wan2.1 / Wan2.2)模型从CUDA移植到Ascend NPU
  • 为Ascend硬件适配任何基于DiT的视频扩散模型
  • 在Ascend NPU上优化推理性能(注意力机制、量化、VAE并行)
  • 在Atlas 800系列硬件上搭建多卡分布式推理环境
  • 将MindIE SD加速库集成到PyTorch视频生成流水线中

Adaptation Domains Overview

适配领域概述

The adaptation work is organized into 9 domains. Each domain has a dedicated reference file under
references/
with detailed instructions, code patterns, and pitfalls.
#DomainReference FilePriority
1Device Layer Adaptation
references/01-device-layer.md
P0 — Must
2Operator Replacement
references/02-operator-replacement.md
P0 — Must
3Precision Strategy
references/03-precision-strategy.md
P0 — Must
4Attention Mechanism
references/04-attention-mechanism.md
P1 — Critical
5Distributed Parallelism
references/05-distributed-parallelism.md
P1 — Critical
6VAE Patch Parallel
references/06-vae-patch-parallel.md
P2 — Important
7Model Quantization
references/07-model-quantization.md
P2 — Important
8Sparse Attention (RainFusion)
references/08-sparse-attention.md
P2 — Important
9Inference Pipeline Integration
references/09-pipeline-integration.md
P1 — Critical
适配工作分为9个领域,每个领域在
references/
目录下有对应的参考文件,包含详细说明、代码示例及注意事项。
序号领域参考文件优先级
1设备层适配
references/01-device-layer.md
P0 — 必须完成
2算子替换
references/02-operator-replacement.md
P0 — 必须完成
3精度策略
references/03-precision-strategy.md
P0 — 必须完成
4注意力机制
references/04-attention-mechanism.md
P1 — 关键
5分布式并行
references/05-distributed-parallelism.md
P1 — 关键
6VAE Patch并行
references/06-vae-patch-parallel.md
P2 — 重要
7模型量化
references/07-model-quantization.md
P2 — 重要
8稀疏注意力(RainFusion)
references/08-sparse-attention.md
P2 — 重要
9推理流水线集成
references/09-pipeline-integration.md
P1 — 关键

Workflow

工作流程

To adapt a Wan-series model to Ascend, follow these steps in order:
要将万系列模型适配到Ascend,请按以下顺序执行步骤:

Step 1: Device Layer Adaptation (Domain 1)

步骤1:设备层适配(领域1)

Read
references/01-device-layer.md
for complete guidance.
Key actions:
  • Import
    torch_npu
    and
    transfer_to_npu
    at the entry point
  • Configure NPU compile mode and internal format settings
  • Replace
    dist.init_process_group(backend="nccl")
    with
    backend="hccl"
  • Replace all
    torch.amp.autocast('cuda', ...)
    with
    autocast('npu', ...)
  • Replace device type checks from
    'cuda'
    to
    'npu'
阅读
references/01-device-layer.md
获取完整指导。
关键操作:
  • 在入口处导入
    torch_npu
    transfer_to_npu
  • 配置NPU编译模式和内部格式设置
  • dist.init_process_group(backend="nccl")
    替换为
    backend="hccl"
  • 将所有
    torch.amp.autocast('cuda', ...)
    替换为
    autocast('npu', ...)
  • 将设备类型检查从
    'cuda'
    改为
    'npu'

Step 2: Operator Replacement (Domain 2)

步骤2:算子替换(领域2)

Read
references/02-operator-replacement.md
for complete guidance.
Key actions:
  • Replace RMSNorm with
    torch_npu.npu_rms_norm()
  • Replace LayerNorm forward to remove
    .float()
    type casting
  • Replace RoPE with
    mindiesd.rotary_position_embedding()
    fused operator
  • Optionally enable
    mindiesd.fast_layernorm
    via
    FAST_LAYERNORM
    env var
  • Replace Flash Attention with
    mindiesd.attention_forward()
    multi-backend dispatch
阅读
references/02-operator-replacement.md
获取完整指导。
关键操作:
  • 将RMSNorm替换为
    torch_npu.npu_rms_norm()
  • 修改LayerNorm前向传播逻辑以移除
    .float()
    类型转换
  • 将RoPE替换为
    mindiesd.rotary_position_embedding()
    融合算子
  • 可通过
    FAST_LAYERNORM
    环境变量启用
    mindiesd.fast_layernorm
  • 将Flash Attention替换为
    mindiesd.attention_forward()
    多后端调度算子

Step 3: Precision Strategy (Domain 3)

步骤3:精度策略(领域3)

Read
references/03-precision-strategy.md
for complete guidance.
Key actions:
  • Lower sinusoidal embedding from float64 to float32
  • Lower RoPE frequency from complex128 to complex64
  • Change autocast dtype from float32 to bfloat16
  • Remove
    .float()
    type conversions in normalization layers
  • Use
    PRECISION
    env var to control random number device for cross-platform reproducibility
阅读
references/03-precision-strategy.md
获取完整指导。
关键操作:
  • 将正弦嵌入从float64降至float32
  • 将RoPE频率从complex128降至complex64
  • 将自动混合精度 dtype 从float32改为bfloat16
  • 移除归一化层中的
    .float()
    类型转换
  • 使用
    PRECISION
    环境变量控制随机数生成设备,实现跨平台可复现性

Step 4: Attention Mechanism Adaptation (Domain 4)

步骤4:注意力机制适配(领域4)

Read
references/04-attention-mechanism.md
for complete guidance.
Key actions:
  • Implement multi-backend attention dispatch via
    ALGO
    env var (0/1/3)
  • Create
    xFuserLongContextAttention
    combining Ulysses + Ring Attention
  • Integrate Attention Cache via
    mindiesd.CacheAgent
  • Add sub-head splitting support via
    USE_SUB_HEAD
    env var
阅读
references/04-attention-mechanism.md
获取完整指导。
关键操作:
  • 通过
    ALGO
    环境变量(0/1/3)实现多后端注意力调度
  • 实现结合Ulysses + Ring Attention的
    xFuserLongContextAttention
  • 通过
    mindiesd.CacheAgent
    集成注意力缓存
  • 通过
    USE_SUB_HEAD
    环境变量添加子头拆分支持

Step 5: Distributed Parallelism Refactoring (Domain 5)

步骤5:分布式并行重构(领域5)

Read
references/05-distributed-parallelism.md
for complete guidance.
Key actions:
  • Implement
    ParallelConfig
    with 4D parallelism: TP × SP × CFG
  • Create
    RankGenerator
    for orthogonal process group assignment
  • Create
    GroupCoordinator
    with dual-channel communication (HCCL + Gloo)
  • Implement
    TensorParallelApplicator
    for automatic model sharding
  • Implement CFG parallel to halve sampling loop forward passes
阅读
references/05-distributed-parallelism.md
获取完整指导。
关键操作:
  • 实现支持4D并行的
    ParallelConfig
    :TP × SP × CFG
  • 创建用于正交进程组分配的
    RankGenerator
  • 创建支持双信道通信(HCCL + Gloo)的
    GroupCoordinator
  • 实现用于自动模型分片的
    TensorParallelApplicator
  • 实现CFG并行以将采样循环前向传播次数减半

Step 6: VAE Patch Parallel (Domain 6)

步骤6:VAE Patch并行(领域6)

Read
references/06-vae-patch-parallel.md
for complete guidance.
Key actions:
  • Implement spatial H×W slicing across NPUs
  • Monkey-patch
    F.conv3d
    ,
    F.conv2d
    ,
    F.interpolate
    ,
    F.pad
    for boundary exchange
  • Use P2P communication for neighbor boundary data exchange
  • Adjust VAE CausalConv3d padding strategy for compatibility
阅读
references/06-vae-patch-parallel.md
获取完整指导。
关键操作:
  • 实现跨NPU的空间H×W切片
  • F.conv3d
    F.conv2d
    F.interpolate
    F.pad
    进行猴子补丁以实现边界数据交换
  • 使用P2P通信进行相邻节点边界数据交换
  • 调整VAE CausalConv3d填充策略以保证兼容性

Step 7: Model Quantization (Domain 7)

步骤7:模型量化(领域7)

Read
references/07-model-quantization.md
for complete guidance.
Key actions:
  • Use
    msmodelslim
    for W8A8 dynamic quantization
  • Integrate
    mindiesd.quantize()
    for runtime quantization loading
  • Handle FSDP + float8 compatibility via
    patch_cast_buffers_for_float8()
阅读
references/07-model-quantization.md
获取完整指导。
关键操作:
  • 使用
    msmodelslim
    进行W8A8动态量化
  • 集成
    mindiesd.quantize()
    实现运行时量化加载
  • 通过
    patch_cast_buffers_for_float8()
    处理FSDP + float8兼容性

Step 8: Sparse Attention — RainFusion (Domain 8)

步骤8:稀疏注意力 — RainFusion(领域8)

Read
references/08-sparse-attention.md
for complete guidance.
Key actions:
  • Implement RainFusion v1 (window-based Local/Global adaptive)
  • Implement RainFusion v2 (blockwise Top-K sparse)
  • Configure skip_timesteps for quality-speed tradeoff
阅读
references/08-sparse-attention.md
获取完整指导。
关键操作:
  • 实现RainFusion v1(基于窗口的局部/全局自适应)
  • 实现RainFusion v2(基于块的Top-K稀疏)
  • 配置skip_timesteps以平衡质量与速度

Step 9: Pipeline Integration (Domain 9)

步骤9:流水线集成(领域9)

Read
references/09-pipeline-integration.md
for complete guidance.
Key actions:
  • Add warm-up generation steps for NPU operator compilation
  • Configure
    T5_LOAD_CPU
    for flexible T5 loading strategy
  • Add RoPE frequency cache (
    freqs_list
    ) lifecycle management
  • Implement multi-resolution VAE decode condition (
    rank < 8
    )
  • Add performance timing with
    stream.synchronize()
阅读
references/09-pipeline-integration.md
获取完整指导。
关键操作:
  • 添加预热生成步骤以完成NPU算子编译
  • 配置
    T5_LOAD_CPU
    实现灵活的T5模型加载策略
  • 添加RoPE频率缓存(
    freqs_list
    )生命周期管理
  • 实现多分辨率VAE解码条件(
    rank < 8
  • 通过
    stream.synchronize()
    添加性能计时

Key Environment Variables

关键环境变量

VariableDefaultDescription
ALGO
0
Attention algorithm: 0=fused_attn_score, 1=ascend_laser_attention, 3=npu_fused_infer
FAST_LAYERNORM
0
Enable mindiesd fast LayerNorm
USE_SUB_HEAD
0
Sub-head group size for attention splitting
T5_LOAD_CPU
0
Load T5 model on CPU to save NPU memory
PRECISION
0
Generate random numbers on CPU for cross-platform reproducibility
OVERLAP
0
Enable FA-AllToAll communication overlap
PYTORCH_NPU_ALLOC_CONF
-NPU memory allocation strategy
TASK_QUEUE_ENABLE
-NPU task queue optimization
CPU_AFFINITY_CONF
-CPU affinity configuration
变量名默认值描述
ALGO
0
注意力算法:0=fused_attn_score, 1=ascend_laser_attention, 3=npu_fused_infer
FAST_LAYERNORM
0
启用mindiesd快速LayerNorm
USE_SUB_HEAD
0
注意力拆分的子头组大小
T5_LOAD_CPU
0
在CPU上加载T5模型以节省NPU内存
PRECISION
0
在CPU上生成随机数以实现跨平台可复现性
OVERLAP
0
启用FA-AllToAll通信重叠
PYTORCH_NPU_ALLOC_CONF
-NPU内存分配策略
TASK_QUEUE_ENABLE
-NPU任务队列优化
CPU_AFFINITY_CONF
-CPU亲和性配置

Key Dependencies

关键依赖库

LibraryPurpose
torch_npu
PyTorch Ascend NPU backend
mindiesd
MindIE Stable Diffusion acceleration (FA, RoPE, LayerNorm, quantize)
msmodelslim
Huawei model compression toolkit (W8A8 quantization)
yunchang
Sequence parallel framework (Ulysses + Ring Attention)
torch_atb
Ascend Transformer Boost operators
atb_ops
ATB fused matmul-allreduce operators
用途
torch_npu
PyTorch Ascend NPU后端
mindiesd
MindIE Stable Diffusion加速库(注意力机制、RoPE、LayerNorm、量化)
msmodelslim
华为模型压缩工具包(W8A8量化)
yunchang
序列并行框架(Ulysses + Ring Attention)
torch_atb
Ascend Transformer Boost算子
atb_ops
ATB融合矩阵乘法-全归约算子

Notes

注意事项

  • This skill is derived from comparing Wan2.2-Original (CUDA) and Wan2.2-Ascend (NPU) codebases
  • The Ascend version removes S2V (Speech-to-Video) and Animate tasks, focusing on T2V, I2V, and TI2V
  • Hardware target: Atlas 800I A2 / Atlas 800T A2 with 8×64G NPU
  • All adaptation patterns are applicable to similar DiT-based video diffusion architectures
  • 本技能源自Wan2.2-Original(CUDA版)与Wan2.2-Ascend(NPU版)代码库的对比分析
  • Ascend版本移除了S2V(语音转视频)和Animate任务,专注于T2V、I2V和TI2V任务
  • 目标硬件:搭载8×64G NPU的Atlas 800I A2 / Atlas 800T A2
  • 所有适配模式均适用于类似的基于DiT的视频扩散架构