Triton-Ascend Migration

Quick Start

When handling migration requests, follow the sequence below:

First identify the input method:
- File path / specified code snippet
- User directly pastes code
Then identify the input source:
- GPU/CUDA Triton kernel
- Python/PyTorch operator implementation
Then identify the operator type:
- ```
elementwise
```
- ```
broadcast / mask
```
- ```
reduce
```
- Contains
```
tl.dot
```
First create a minimally runnable version:
- ```
cuda -> npu
```
- Add
```
torch_npu
```
  import
- Remove GPU-specific device logic
- Prioritize 1D grid
- For simple tutorial examples, default to the "minimal diff migration version"
Perform Ascend-side optimization after the code runs successfully:
- Physical core binding
- ```
BLOCK_SIZE/XBLOCK
```
- ```
BLOCK_SIZE_SUB/XBLOCK_SUB
```
- Continuous/aligned memory access
- Troubleshooting for
```
coreDim
```
  / UB / dtype / mask
If clear optimization opportunities exist, directly output the optimized implementation instead of just providing suggestions.

How to Use This Skill

If users ask "How to use this skill", do not immediately dive into lengthy migration analysis; first provide a concise usage guide in 3 to 6 lines, then proceed based on the user's input.

Only retain these points in the concise guide:

Users can provide
```
Triton/CUDA
```
code,
```
PyTorch
```
reference implementation, file path, or error/performance logs.
Users are advised to specify the runtime environment: local command line, existing container, CI, or code generation only without execution.

Users can also indicate preferences:

minimal diff migration

documentation style

get it running first then optimize

directly provide optimized version

Corresponding outputs will be provided based on scenarios:

Triton-Ascend implementation

minimal validation script

execution command

optimization instructions

If users follow up with questions like "How to ask specifically", "What command to write", "How to run in a container", refer to

references/usage.md

and provide local commands, container commands, and example questions as needed; do not directly include the entire long guide in regular responses.

Copy this checklist to track progress:

text

Migration Progress
- [ ] Identify input source and operator type
- [ ] First perform minimal migration or semantic rewriting
- [ ] Adjust to Ascend-friendly parallelism and grid
- [ ] Redesign block / tiling
- [ ] Review stride / block_ptr / alignment
- [ ] Handle coreDim / UB / scalar degradation
- [ ] Implement feasible optimization directly
- [ ] Generate and save minimal NPU validation script
- [ ] Actually execute the validation script
- [ ] Output results and optimization instructions

Input Identification

First answer these three questions:

Is the user providing a file path or directly pasting code?
Is it a complete script, partial snippet, or single kernel?
Is it GPU Triton migration or Python/PyTorch semantic rewriting?

Details about input methods, default handling when information is missing, and priority when file paths conflict with pasted code can be found in

references/input-modes.md

Scenario A: GPU Triton -> Triton-Ascend

Prioritize checking:

Whether
```
device='cuda'
```
exists
Whether there is GPU-specific device acquisition or assertion logic
Whether GPU-style multi-dimensional free grid is retained
Whether
```
tl.dot
```
is used
Whether complex
```
shape/stride/block_ptr/order
```
exists

Scenario B: Python/PyTorch -> Triton-Ascend

First extract semantics, then write Triton code:

Input-output tensor relationship
Indexing and broadcasting method
Mask / reduce logic
dtype and precision requirements
Whether the original PyTorch implementation already has naturally continuous memory access

If the original operator is only a reference implementation, first write a semantically equivalent Triton-Ascend version, then proceed with optimization.

Migration Process

1. Collect Minimal Necessary Information

Prioritize collecting this information; supplement what is missing:

Input code or minimal reproduction case
Input method: file path / specified code snippet / user directly pastes code
shape, dtype, stride
Whether there is mask, broadcast, reduce
Current error or performance issue
Whether exact precision consistency is required
Runtime environment: local command line, inside container, CI, or code generation only without execution

If information is incomplete, supplement in this order:

First infer from existing code
Then use minimal reasonable assumptions to complete the validation script
Finally ask users for necessary information

If the "execution location" information is missing, infer in this order:

First check if the user provided container name,
```
docker exec
```
, container path, or image information
Then check if the user provided local file path, current directory, or terminal command
If still undetermined, ask: "Would you like me to write the validation steps based on local command line or container environment?"

2. First Perform Minimal Migration or Semantic Rewriting

Default to pursuing "semantic alignment and successful execution":

GPU Triton: First change
```
cuda
```
to
```
npu
```
Import
```
torch_npu
```
Remove GPU-specific device logic
For documentation/tutorial-style simple examples, try to keep the original
```
kernel
```
name, wrapper name,
```
BLOCK_SIZE
```
, grid writing method, and main code structure unchanged
Do not actively add
```
contiguous()
```
, additional assertions, function renaming, or engineering packaging in the first version, unless the user explicitly requests "enhanced/production version", or these changes are necessary to fix deterministic issues on NPU
Python/PyTorch: First rewrite into the most straightforward Triton kernel according to the original computation semantics

Do not over-rewrite in the first step.

If users explicitly mention these keywords:

Official documentation style
Strict minimal migration
Minimal diff
No engineering enhanced version
Only refer to official migration examples

Then this "minimal migration mode" should override the subsequent generalized optimization requirements:

Only make necessary code modifications
The
```
optimization instructions
```
can be 1 to 3 lines, clearly stating "No in-depth optimization is performed for this task"
Do not forcefully expand content like
```
TRITON_ALL_BLOCKS_PARALLEL
```
,
```
multibuffer
```
,
```
care_padding=False
```
, physical core binding just to complete the template
Do not deviate the response style from "documentation diff" to "engineering optimization overview"
The validation script should also remain "minimally runnable", do not default to writing it as an engineering test framework

Details about documentation-style minimal migration, single-file example organization, and validation script naming and saving rules can be found in

references/output-and-validation.md

3. Rewrite Parallelism Model

Ascend-side should follow these rules first:

Prioritize 1D grid
Switch from GPU logical grid thinking to Ascend physical core binding thinking
```
Vector-only
```
operators should be designed with Vector Core path in mind first
Operators containing
```
tl.dot
```
should be designed with AI Core path in mind first

Further judge based on this set of "general convergence rules", do not mechanically retain all implementation branches from GPU:

If the original implementation has multiple kernels,
```
autotune
```
, environment variable branches, or automatic distribution of different data paths, first distinguish which are "semantically necessary" and which are just "performance strategies on GPU"
For performance branches that are obviously no longer necessary on Ascend, converge to a single kernel or fewer paths; focus on retaining semantics rather than all historical branches
If an operator is essentially
```
Vector-only
```
, but the original implementation uses complex
```
block_ptr
```
, 2D/3D grid, additional tiling, or multi-version kernels, prioritize evaluating whether it can be rewritten into a more straightforward 1D grid, fixed configuration, single-path implementation
If an operator contains
```
tl.dot
```
, do not just think about "compressing multi-dimensional grid into 1D"; first judge which grid dimensions are only logical chunk / token / tile dimensions, and whether they are more suitable to be moved into the kernel's internal loop to reduce scheduling dimensions
Do not mechanically classify based on "tl.dot appears in the source code"; if
```
tl.dot
```
is only used to implement intermediate techniques like prefix-sum, local scan, triangular mask aggregation, first judge whether it is more like a
```
Vector-only
```
reduction/scan based on the main semantics of the operator, or if it should indeed follow the AI Core path
If the operator naturally has structures like chunk, tile, window, prefix-sum, local reduction, do not just follow the original block pointer logic; also evaluate whether "rearrange layout first, then perform vectorized computation" is more suitable for Ascend
If an auxiliary tensor (such as gate, mask, bias, index, state-gate) is not continuous in the current access direction, first perform lightweight
```
transpose/contiguous
```
or equivalent layout rearrangement on the wrapper side, then access it with a simpler linear ptr or more regular
```
block_ptr
```
inside the kernel
If the main loop order is rearranged, such as changing from "K first then T" to "T first then K", re-review the
```
shape/stride/block_ptr/order
```
of state tensors, cache tensors, and historical block tensors simultaneously; do not just change the scheduling order while continuing to use the old view and remedy with
```
trans
```
or additional indexing
If common capabilities like
```
get_vectorcore_num()
```
, device attribute tools, or common layout helpers already exist in the current project, prioritize reusing project helpers instead of writing inline replacement versions by default
However, if the current output target is an "independent runnable script" or "minimal validation script", continue to check whether these helpers rely on additional initialization; if they rely on project initialization steps, either add the initialization or clearly state the preconditions in the result
When you decide to "delete branches / converge implementation", explain the reason in the result: whether the branch only serves GPU autotune, only serves shared memory selection, or has no clear benefit on Ascend
If the runtime log of the migrated Triton-Ascend shows warnings like
```
Please DO NOT tune args ['num_warps']
```
,
```
['num_stages']
```
or similar, first check whether GPU-style launch/tuning parameters are still mechanically retained; for minimally runnable implementations on Ascend, do not explicitly retain these parameters by default unless you can provide clear compilation requirements or measured benefits
Do not use only a set of general shapes in the validation script; the test set should be derived from operator features, covering at least one non-divisible block case, one case that is most likely to trigger branch differences, and one case closer to the real working set

If the user provides a 2D/3D grid, prioritize evaluating whether it can be folded into a 1D grid and then restore the index inside the kernel. Details about

coreDim

, UB,

shape/stride/block_ptr/order

care_padding=False

TRITON_ALL_BLOCKS_PARALLEL

multibuffer

can be found in

references/reference.md

Optimization and Troubleshooting

Default Rules for Direct Optimization

Directly provide the optimized implementation if any of the following conditions are met:

```
coreDim
```
is obviously exceeded
UB usage is obviously too large
Memory access is discrete and can be reconstructed into continuous access
Mask load/store has a more optimal writing method
dtype obviously causes vector operations to degrade to scalar operations

If these conditions are not met, especially for simple examples like vector addition, do not output an enhanced wrapped version by default just to "look more complete". First provide the minimal migration version, then put the enhanced items in "Optional Optimization".

Optimization Priority

Adjust grid and number of cores
Adjust main block size
Introduce or reconstruct sub-block loops
Correct
```
shape/stride/block_ptr/order
```
Evaluate
```
care_padding=False
```
Evaluate
```
TRITON_ALL_BLOCKS_PARALLEL
```
Evaluate
```
multibuffer
```
and related compilation optimization items
Adjust dtype path without breaking semantics

Key Points to Cover

The output must cover these content:

```
cuda -> npu
```
```
torch_npu
```
1D grid
Physical core binding
Distinction between
```
Vector-only
```
and operators containing
```
tl.dot
```
```
coreDim <= 65535
```
UB limit
Continuous / aligned memory access
Re-review of
```
shape/stride/block_ptr/order
```
```
TRITON_ALL_BLOCKS_PARALLEL
```
```
multibuffer
```
```
care_padding=False
```
Scalar degradation caused by dtype

Fixed Output Template

Always output in this structure:

markdown

## Migration Conclusion
- Input Source:
- Operator Type:
- Main Migration Actions:

## Triton-Ascend Implementation
- Provide the final kernel and calling wrapper code
- For basic migration scenarios, first provide the "minimal diff migration version"
- Only provide "engineering enhanced/optimized version" additionally when users request it, or when there are clear optimization opportunities
- If clear optimization opportunities exist, directly provide the optimized version
- Explain the save path and naming of the generated file

## Validation Script
- Provide a minimally executable validation script
- Compare with PyTorch reference
- Include at least `allclose` or maximum error output
- Explain the save path of the validation script
- Clearly state whether it has been actually executed, along with execution commands and results

## Optimization Instructions
- Explain the reasons for adjusting grid / number of cores / block / sub-block
- Explain whether `coreDim`, UB, memory access, dtype, mask performance issues are handled
- Explain whether `TRITON_ALL_BLOCKS_PARALLEL`, `multibuffer`, `care_padding=False` are used

If the current task is "documentation-style minimal migration", this section can be extremely concise:
- Only state that minimal migration is performed first
- Briefly state that optimization items like `coreDim` / UB / `multibuffer` are not expanded in this task
- Do not expand into lengthy optimization analysis just to fit the template

## Risks and Limitations
- List unvalidated boundary conditions
- List information that needs to be supplemented by users
- If the script fails to run, clearly state which step it is stuck on

If the user's question is "How to use this skill", add a minimalist "Usage" section before the official template, limited to 3 to 6 lines, explaining:

What input the user should provide
Whether to handle it according to local or container scenario
What outputs will be generated next

Then proceed to the normal migration output.

If users ask about command lines, containers, directory switching, validation command templates, refer to

references/usage.md

, do not include these details in every migration response by default.

Additional Resources

For detailed rules, refer to:

Usage, Local Commands and Container Scenarios
Input Methods and Context Completion
Output, Naming and Minimal Validation Script
Migration and Optimization Reference
Typical Examples and Output Samples
Manual Review Test Checklist

triton-ascend-migration

NPX Install

Tags

SKILL.md Content (Chinese)