Loading...
Loading...
This skill provides comprehensive guidance for adapting Wan-series video generation models (Wan2.1/Wan2.2) from NVIDIA CUDA to Huawei Ascend NPU. It should be used when performing NPU migration of DiT-based video diffusion models, including device layer adaptation, operator replacement, distributed parallelism refactoring, attention optimization, VAE parallelization, and model quantization. This skill covers 9 major adaptation domains derived from real-world Wan2.2 CUDA-to-Ascend porting experience.
npx skill4agent add ascend-ai-coding/awesome-ascend-skills wan-ascend-adaptationreferences/| # | Domain | Reference File | Priority |
|---|---|---|---|
| 1 | Device Layer Adaptation | | P0 — Must |
| 2 | Operator Replacement | | P0 — Must |
| 3 | Precision Strategy | | P0 — Must |
| 4 | Attention Mechanism | | P1 — Critical |
| 5 | Distributed Parallelism | | P1 — Critical |
| 6 | VAE Patch Parallel | | P2 — Important |
| 7 | Model Quantization | | P2 — Important |
| 8 | Sparse Attention (RainFusion) | | P2 — Important |
| 9 | Inference Pipeline Integration | | P1 — Critical |
references/01-device-layer.mdtorch_nputransfer_to_npudist.init_process_group(backend="nccl")backend="hccl"torch.amp.autocast('cuda', ...)autocast('npu', ...)'cuda''npu'references/02-operator-replacement.mdtorch_npu.npu_rms_norm().float()mindiesd.rotary_position_embedding()mindiesd.fast_layernormFAST_LAYERNORMmindiesd.attention_forward()references/03-precision-strategy.md.float()PRECISIONreferences/04-attention-mechanism.mdALGOxFuserLongContextAttentionmindiesd.CacheAgentUSE_SUB_HEADreferences/05-distributed-parallelism.mdParallelConfigRankGeneratorGroupCoordinatorTensorParallelApplicatorreferences/06-vae-patch-parallel.mdF.conv3dF.conv2dF.interpolateF.padreferences/07-model-quantization.mdmsmodelslimmindiesd.quantize()patch_cast_buffers_for_float8()references/08-sparse-attention.mdreferences/09-pipeline-integration.mdT5_LOAD_CPUfreqs_listrank < 8stream.synchronize()| Variable | Default | Description |
|---|---|---|
| | Attention algorithm: 0=fused_attn_score, 1=ascend_laser_attention, 3=npu_fused_infer |
| | Enable mindiesd fast LayerNorm |
| | Sub-head group size for attention splitting |
| | Load T5 model on CPU to save NPU memory |
| | Generate random numbers on CPU for cross-platform reproducibility |
| | Enable FA-AllToAll communication overlap |
| - | NPU memory allocation strategy |
| - | NPU task queue optimization |
| - | CPU affinity configuration |
| Library | Purpose |
|---|---|
| PyTorch Ascend NPU backend |
| MindIE Stable Diffusion acceleration (FA, RoPE, LayerNorm, quantize) |
| Huawei model compression toolkit (W8A8 quantization) |
| Sequence parallel framework (Ulysses + Ring Attention) |
| Ascend Transformer Boost operators |
| ATB fused matmul-allreduce operators |