Triton-Ascend Migration
Quick Start
When handling migration requests, follow the sequence below:
- First identify the input method:
- File path / specified code snippet
- User directly pastes code
- Then identify the input source:
- GPU/CUDA Triton kernel
- Python/PyTorch operator implementation
- Then identify the operator type:
- First create a minimally runnable version:
- Add import
- Remove GPU-specific device logic
- Prioritize 1D grid
- For simple tutorial examples, default to the "minimal diff migration version"
- Perform Ascend-side optimization after the code runs successfully:
- Physical core binding
BLOCK_SIZE_SUB/XBLOCK_SUB
- Continuous/aligned memory access
- Troubleshooting for / UB / dtype / mask
- If clear optimization opportunities exist, directly output the optimized implementation instead of just providing suggestions.
How to Use This Skill
If users ask "How to use this skill", do not immediately dive into lengthy migration analysis; first provide a concise usage guide in 3 to 6 lines, then proceed based on the user's input.
Only retain these points in the concise guide:
- Users can provide code, reference implementation, file path, or error/performance logs.
- Users are advised to specify the runtime environment: local command line, existing container, CI, or code generation only without execution.
- Users can also indicate preferences: , ,
get it running first then optimize
, directly provide optimized version
.
- Corresponding outputs will be provided based on scenarios:
Triton-Ascend implementation
, minimal validation script
, , optimization instructions
.
If users follow up with questions like "How to ask specifically", "What command to write", "How to run in a container", refer to
and provide local commands, container commands, and example questions as needed; do not directly include the entire long guide in regular responses.
Copy this checklist to track progress:
text
Migration Progress
- [ ] Identify input source and operator type
- [ ] First perform minimal migration or semantic rewriting
- [ ] Adjust to Ascend-friendly parallelism and grid
- [ ] Redesign block / tiling
- [ ] Review stride / block_ptr / alignment
- [ ] Handle coreDim / UB / scalar degradation
- [ ] Implement feasible optimization directly
- [ ] Generate and save minimal NPU validation script
- [ ] Actually execute the validation script
- [ ] Output results and optimization instructions
Input Identification
First answer these three questions:
- Is the user providing a file path or directly pasting code?
- Is it a complete script, partial snippet, or single kernel?
- Is it GPU Triton migration or Python/PyTorch semantic rewriting?
Details about input methods, default handling when information is missing, and priority when file paths conflict with pasted code can be found in
references/input-modes.md
.
Scenario A: GPU Triton -> Triton-Ascend
Prioritize checking:
- Whether exists
- Whether there is GPU-specific device acquisition or assertion logic
- Whether GPU-style multi-dimensional free grid is retained
- Whether is used
- Whether complex
shape/stride/block_ptr/order
exists
Scenario B: Python/PyTorch -> Triton-Ascend
First extract semantics, then write Triton code:
- Input-output tensor relationship
- Indexing and broadcasting method
- Mask / reduce logic
- dtype and precision requirements
- Whether the original PyTorch implementation already has naturally continuous memory access
If the original operator is only a reference implementation, first write a semantically equivalent Triton-Ascend version, then proceed with optimization.
Migration Process
1. Collect Minimal Necessary Information
Prioritize collecting this information; supplement what is missing:
- Input code or minimal reproduction case
- Input method: file path / specified code snippet / user directly pastes code
- shape, dtype, stride
- Whether there is mask, broadcast, reduce
- Current error or performance issue
- Whether exact precision consistency is required
- Runtime environment: local command line, inside container, CI, or code generation only without execution
If information is incomplete, supplement in this order:
- First infer from existing code
- Then use minimal reasonable assumptions to complete the validation script
- Finally ask users for necessary information
If the "execution location" information is missing, infer in this order:
- First check if the user provided container name, , container path, or image information
- Then check if the user provided local file path, current directory, or terminal command
- If still undetermined, ask: "Would you like me to write the validation steps based on local command line or container environment?"
2. First Perform Minimal Migration or Semantic Rewriting
Default to pursuing "semantic alignment and successful execution":
- GPU Triton: First change to
- Import
- Remove GPU-specific device logic
- For documentation/tutorial-style simple examples, try to keep the original name, wrapper name, , grid writing method, and main code structure unchanged
- Do not actively add , additional assertions, function renaming, or engineering packaging in the first version, unless the user explicitly requests "enhanced/production version", or these changes are necessary to fix deterministic issues on NPU
- Python/PyTorch: First rewrite into the most straightforward Triton kernel according to the original computation semantics
Do not over-rewrite in the first step.
If users explicitly mention these keywords:
- Official documentation style
- Strict minimal migration
- Minimal diff
- No engineering enhanced version
- Only refer to official migration examples
Then this "minimal migration mode" should override the subsequent generalized optimization requirements:
- Only make necessary code modifications
- The
optimization instructions
can be 1 to 3 lines, clearly stating "No in-depth optimization is performed for this task"
- Do not forcefully expand content like
TRITON_ALL_BLOCKS_PARALLEL
, , , physical core binding just to complete the template
- Do not deviate the response style from "documentation diff" to "engineering optimization overview"
- The validation script should also remain "minimally runnable", do not default to writing it as an engineering test framework
Details about documentation-style minimal migration, single-file example organization, and validation script naming and saving rules can be found in
references/output-and-validation.md
.
3. Rewrite Parallelism Model
Ascend-side should follow these rules first:
- Prioritize 1D grid
- Switch from GPU logical grid thinking to Ascend physical core binding thinking
- operators should be designed with Vector Core path in mind first
- Operators containing should be designed with AI Core path in mind first
Further judge based on this set of "general convergence rules", do not mechanically retain all implementation branches from GPU:
- If the original implementation has multiple kernels, , environment variable branches, or automatic distribution of different data paths, first distinguish which are "semantically necessary" and which are just "performance strategies on GPU"
- For performance branches that are obviously no longer necessary on Ascend, converge to a single kernel or fewer paths; focus on retaining semantics rather than all historical branches
- If an operator is essentially , but the original implementation uses complex , 2D/3D grid, additional tiling, or multi-version kernels, prioritize evaluating whether it can be rewritten into a more straightforward 1D grid, fixed configuration, single-path implementation
- If an operator contains , do not just think about "compressing multi-dimensional grid into 1D"; first judge which grid dimensions are only logical chunk / token / tile dimensions, and whether they are more suitable to be moved into the kernel's internal loop to reduce scheduling dimensions
- Do not mechanically classify based on "tl.dot appears in the source code"; if is only used to implement intermediate techniques like prefix-sum, local scan, triangular mask aggregation, first judge whether it is more like a reduction/scan based on the main semantics of the operator, or if it should indeed follow the AI Core path
- If the operator naturally has structures like chunk, tile, window, prefix-sum, local reduction, do not just follow the original block pointer logic; also evaluate whether "rearrange layout first, then perform vectorized computation" is more suitable for Ascend
- If an auxiliary tensor (such as gate, mask, bias, index, state-gate) is not continuous in the current access direction, first perform lightweight or equivalent layout rearrangement on the wrapper side, then access it with a simpler linear ptr or more regular inside the kernel
- If the main loop order is rearranged, such as changing from "K first then T" to "T first then K", re-review the
shape/stride/block_ptr/order
of state tensors, cache tensors, and historical block tensors simultaneously; do not just change the scheduling order while continuing to use the old view and remedy with or additional indexing
- If common capabilities like , device attribute tools, or common layout helpers already exist in the current project, prioritize reusing project helpers instead of writing inline replacement versions by default
- However, if the current output target is an "independent runnable script" or "minimal validation script", continue to check whether these helpers rely on additional initialization; if they rely on project initialization steps, either add the initialization or clearly state the preconditions in the result
- When you decide to "delete branches / converge implementation", explain the reason in the result: whether the branch only serves GPU autotune, only serves shared memory selection, or has no clear benefit on Ascend
- If the runtime log of the migrated Triton-Ascend shows warnings like
Please DO NOT tune args ['num_warps']
, or similar, first check whether GPU-style launch/tuning parameters are still mechanically retained; for minimally runnable implementations on Ascend, do not explicitly retain these parameters by default unless you can provide clear compilation requirements or measured benefits
- Do not use only a set of general shapes in the validation script; the test set should be derived from operator features, covering at least one non-divisible block case, one case that is most likely to trigger branch differences, and one case closer to the real working set
If the user provides a 2D/3D grid, prioritize evaluating whether it can be folded into a 1D grid and then restore the index inside the kernel. Details about
, UB,
shape/stride/block_ptr/order
,
,
TRITON_ALL_BLOCKS_PARALLEL
,
can be found in
.
Optimization and Troubleshooting
Default Rules for Direct Optimization
Directly provide the optimized implementation if any of the following conditions are met:
- is obviously exceeded
- UB usage is obviously too large
- Memory access is discrete and can be reconstructed into continuous access
- Mask load/store has a more optimal writing method
- dtype obviously causes vector operations to degrade to scalar operations
If these conditions are not met, especially for simple examples like vector addition, do not output an enhanced wrapped version by default just to "look more complete". First provide the minimal migration version, then put the enhanced items in "Optional Optimization".
Optimization Priority
- Adjust grid and number of cores
- Adjust main block size
- Introduce or reconstruct sub-block loops
- Correct
shape/stride/block_ptr/order
- Evaluate
- Evaluate
TRITON_ALL_BLOCKS_PARALLEL
- Evaluate and related compilation optimization items
- Adjust dtype path without breaking semantics
Key Points to Cover
The output must cover these content:
- 1D grid
- Physical core binding
- Distinction between and operators containing
- UB limit
- Continuous / aligned memory access
- Re-review of
shape/stride/block_ptr/order
TRITON_ALL_BLOCKS_PARALLEL
- Scalar degradation caused by dtype
Fixed Output Template
Always output in this structure:
markdown
## Migration Conclusion
- Input Source:
- Operator Type:
- Main Migration Actions:
## Triton-Ascend Implementation
- Provide the final kernel and calling wrapper code
- For basic migration scenarios, first provide the "minimal diff migration version"
- Only provide "engineering enhanced/optimized version" additionally when users request it, or when there are clear optimization opportunities
- If clear optimization opportunities exist, directly provide the optimized version
- Explain the save path and naming of the generated file
## Validation Script
- Provide a minimally executable validation script
- Compare with PyTorch reference
- Include at least `allclose` or maximum error output
- Explain the save path of the validation script
- Clearly state whether it has been actually executed, along with execution commands and results
## Optimization Instructions
- Explain the reasons for adjusting grid / number of cores / block / sub-block
- Explain whether `coreDim`, UB, memory access, dtype, mask performance issues are handled
- Explain whether `TRITON_ALL_BLOCKS_PARALLEL`, `multibuffer`, `care_padding=False` are used
If the current task is "documentation-style minimal migration", this section can be extremely concise:
- Only state that minimal migration is performed first
- Briefly state that optimization items like `coreDim` / UB / `multibuffer` are not expanded in this task
- Do not expand into lengthy optimization analysis just to fit the template
## Risks and Limitations
- List unvalidated boundary conditions
- List information that needs to be supplemented by users
- If the script fails to run, clearly state which step it is stuck on
If the user's question is "How to use this skill", add a minimalist "Usage" section before the official template, limited to 3 to 6 lines, explaining:
- What input the user should provide
- Whether to handle it according to local or container scenario
- What outputs will be generated next
Then proceed to the normal migration output.
If users ask about command lines, containers, directory switching, validation command templates, refer to
, do not include these details in every migration response by default.
Additional Resources
For detailed rules, refer to:
- Usage, Local Commands and Container Scenarios
- Input Methods and Context Completion
- Output, Naming and Minimal Validation Script
- Migration and Optimization Reference
- Typical Examples and Output Samples
- Manual Review Test Checklist