triton-ascend-migration
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTriton-Ascend Migration
Triton-Ascend Migration
Quick Start
Quick Start
遇到迁移请求时,按下面顺序执行:
- 先识别输入方式:
- 文件路径 / 指定代码段
- 用户直接粘贴代码
- 再识别输入来源:
- GPU/CUDA Triton kernel
- Python/PyTorch 算子实现
- 再识别算子类型:
elementwisebroadcast / maskreduce- 含
tl.dot
- 先做最小可运行版本:
cuda -> npu- 补
torch_npu - 去掉 GPU 专属设备逻辑
- grid 优先 1D
- 简单教程示例默认给“最小 diff 迁移版”
- 跑通后再做 Ascend 侧优化:
- 物理核绑定
BLOCK_SIZE/XBLOCKBLOCK_SIZE_SUB/XBLOCK_SUB- 连续/对齐访存
- / UB / dtype / mask 排查
coreDim
- 如果存在明确优化空间,直接输出优化后的实现,不要只停留在建议。
When handling migration requests, follow the sequence below:
- First identify the input method:
- File path / specified code snippet
- User directly pastes code
- Then identify the input source:
- GPU/CUDA Triton kernel
- Python/PyTorch operator implementation
- Then identify the operator type:
elementwisebroadcast / maskreduce- Contains
tl.dot
- First create a minimally runnable version:
cuda -> npu- Add import
torch_npu - Remove GPU-specific device logic
- Prioritize 1D grid
- For simple tutorial examples, default to the "minimal diff migration version"
- Perform Ascend-side optimization after the code runs successfully:
- Physical core binding
BLOCK_SIZE/XBLOCKBLOCK_SIZE_SUB/XBLOCK_SUB- Continuous/aligned memory access
- Troubleshooting for / UB / dtype / mask
coreDim
- If clear optimization opportunities exist, directly output the optimized implementation instead of just providing suggestions.
如何使用这个 Skill
How to Use This Skill
如果用户问“这个 skill 怎么用”,先不要立刻进入长篇迁移分析;先用 3 到 6 行给出简明用法,再根据用户提供的输入继续执行。
简述时只保留这几个点:
- 用户可以提供 代码、
Triton/CUDA参考实现、文件路径,或报错/性能日志。PyTorch - 用户最好同时说明运行环境:本机命令行、已有容器、CI,或只需要生成代码不执行。
- 用户如果有偏好,也应说明:、
最小 diff 迁移、文档风格、先跑通再优化。直接给优化版 - 你会按场景输出:、
Triton-Ascend 实现、最小验证脚本、执行命令。优化说明
如果用户继续追问“具体怎么提问”“命令怎么写”“容器里怎么跑”,再读取 ,按需给出本机命令、容器命令和示例问法;不要把整份长说明直接搬进常规回答里。
references/usage.md复制这份检查清单并跟踪进度:
text
迁移进度
- [ ] 识别输入来源与算子类型
- [ ] 先做最小迁移或语义改写
- [ ] 调整为 Ascend 友好的并行与 grid
- [ ] 重做 block / tiling
- [ ] 审查 stride / block_ptr / 对齐
- [ ] 处理 coreDim / UB / scalar 退化
- [ ] 直接落地可行优化
- [ ] 生成并保存最小 NPU 验证脚本
- [ ] 实际执行验证脚本
- [ ] 输出结果与优化说明If users ask "How to use this skill", do not immediately dive into lengthy migration analysis; first provide a concise usage guide in 3 to 6 lines, then proceed based on the user's input.
Only retain these points in the concise guide:
- Users can provide code,
Triton/CUDAreference implementation, file path, or error/performance logs.PyTorch - Users are advised to specify the runtime environment: local command line, existing container, CI, or code generation only without execution.
- Users can also indicate preferences: ,
minimal diff migration,documentation style,get it running first then optimize.directly provide optimized version - Corresponding outputs will be provided based on scenarios: ,
Triton-Ascend implementation,minimal validation script,execution command.optimization instructions
If users follow up with questions like "How to ask specifically", "What command to write", "How to run in a container", refer to and provide local commands, container commands, and example questions as needed; do not directly include the entire long guide in regular responses.
references/usage.mdCopy this checklist to track progress:
text
Migration Progress
- [ ] Identify input source and operator type
- [ ] First perform minimal migration or semantic rewriting
- [ ] Adjust to Ascend-friendly parallelism and grid
- [ ] Redesign block / tiling
- [ ] Review stride / block_ptr / alignment
- [ ] Handle coreDim / UB / scalar degradation
- [ ] Implement feasible optimization directly
- [ ] Generate and save minimal NPU validation script
- [ ] Actually execute the validation script
- [ ] Output results and optimization instructions输入识别
Input Identification
先回答三个问题:
- 用户是给文件路径,还是直接贴代码?
- 这是完整脚本、局部片段,还是单个 kernel?
- 这是 GPU Triton 迁移,还是 Python/PyTorch 语义改写?
输入方式的细节、缺信息时的默认处理、文件路径与粘贴代码冲突时的优先级,见 。
references/input-modes.mdFirst answer these three questions:
- Is the user providing a file path or directly pasting code?
- Is it a complete script, partial snippet, or single kernel?
- Is it GPU Triton migration or Python/PyTorch semantic rewriting?
Details about input methods, default handling when information is missing, and priority when file paths conflict with pasted code can be found in .
references/input-modes.md场景 A:GPU Triton -> Triton-Ascend
Scenario A: GPU Triton -> Triton-Ascend
优先检查:
- 是否存在
device='cuda' - 是否有 GPU 专属设备获取或断言逻辑
- 是否保留了 GPU 风格的多维自由 grid
- 是否使用
tl.dot - 是否存在复杂
shape/stride/block_ptr/order
Prioritize checking:
- Whether exists
device='cuda' - Whether there is GPU-specific device acquisition or assertion logic
- Whether GPU-style multi-dimensional free grid is retained
- Whether is used
tl.dot - Whether complex exists
shape/stride/block_ptr/order
场景 B:Python/PyTorch -> Triton-Ascend
Scenario B: Python/PyTorch -> Triton-Ascend
先提炼语义,再写 Triton:
- 输入输出张量关系
- 索引与广播方式
- mask / reduce 逻辑
- dtype 和精度要求
- 原始 PyTorch 是否已经天然连续访存
如果原始算子只是参考实现,先写一个语义等价的 Triton-Ascend 版本,再继续优化。
First extract semantics, then write Triton code:
- Input-output tensor relationship
- Indexing and broadcasting method
- Mask / reduce logic
- dtype and precision requirements
- Whether the original PyTorch implementation already has naturally continuous memory access
If the original operator is only a reference implementation, first write a semantically equivalent Triton-Ascend version, then proceed with optimization.
迁移流程
Migration Process
1. 收集最小必要信息
1. Collect Minimal Necessary Information
优先收集这些信息;缺什么补什么:
- 输入代码或最小复现
- 输入方式:文件路径 / 指定代码段 / 用户直接粘贴代码
- shape、dtype、stride
- 是否有 mask、broadcast、reduce
- 当前报错或性能问题
- 是否要求保持完全相同精度
- 运行环境:本机命令行、容器内、CI、或仅生成代码不执行
如果信息不完整,按这个顺序补:
- 先从现有代码里推断
- 再用最小合理假设补齐验证脚本
- 最后才向用户追问必须的信息
如果当前缺的是“执行位置”信息,按下面顺序推断:
- 先看用户是否给了容器名、、容器路径、镜像信息
docker exec - 再看用户是否给了本机文件路径、当前目录、终端命令
- 仍无法判断时,再追问“你希望我按本机命令行还是容器内命令来写验证步骤?”
Prioritize collecting this information; supplement what is missing:
- Input code or minimal reproduction case
- Input method: file path / specified code snippet / user directly pastes code
- shape, dtype, stride
- Whether there is mask, broadcast, reduce
- Current error or performance issue
- Whether exact precision consistency is required
- Runtime environment: local command line, inside container, CI, or code generation only without execution
If information is incomplete, supplement in this order:
- First infer from existing code
- Then use minimal reasonable assumptions to complete the validation script
- Finally ask users for necessary information
If the "execution location" information is missing, infer in this order:
- First check if the user provided container name, , container path, or image information
docker exec - Then check if the user provided local file path, current directory, or terminal command
- If still undetermined, ask: "Would you like me to write the validation steps based on local command line or container environment?"
2. 先做最小迁移或语义改写
2. First Perform Minimal Migration or Semantic Rewriting
默认先追求“语义对齐并跑通”:
- GPU Triton:先把 改成
cudanpu - 导入
torch_npu - 移除 GPU 专属设备逻辑
- 对文档/教程式的简单示例,尽量保持原有 名字、wrapper 名字、
kernel、grid 写法和主体代码结构不变BLOCK_SIZE - 第一版不要主动加入 、额外断言、函数重命名、工程化包装,除非用户明确要求“增强版/生产版”,或这些改动是修复 NPU 上的确定性问题所必需
contiguous() - Python/PyTorch:先按原计算语义改写成最直接的 Triton kernel
不要在第一步就过度重写。
如果用户明确说了这些信号词:
- 官方文档风格
- 严格最小迁移
- 最小 diff
- 不要工程增强版
- 只参考官方迁移示例
则这条“最小迁移模式”应当覆盖后面那些更泛化的优化说明要求:
- 代码只做必要修改
- 可以只写 1 到 3 行,明确“本题先不展开优化”
优化说明 - 不要为了模板完整性,强行展开 、
TRITON_ALL_BLOCKS_PARALLEL、multibuffer、物理核绑定等内容care_padding=False - 不要把回答风格从“文档 diff”带偏成“工程优化综述”
- 验证脚本也应保持“最小可运行”,不要默认写成工程化测试框架
文档风格最小迁移、单文件示例组织方式、验证脚本命名与保存规则,见 。
references/output-and-validation.mdDefault to pursuing "semantic alignment and successful execution":
- GPU Triton: First change to
cudanpu - Import
torch_npu - Remove GPU-specific device logic
- For documentation/tutorial-style simple examples, try to keep the original name, wrapper name,
kernel, grid writing method, and main code structure unchangedBLOCK_SIZE - Do not actively add , additional assertions, function renaming, or engineering packaging in the first version, unless the user explicitly requests "enhanced/production version", or these changes are necessary to fix deterministic issues on NPU
contiguous() - Python/PyTorch: First rewrite into the most straightforward Triton kernel according to the original computation semantics
Do not over-rewrite in the first step.
If users explicitly mention these keywords:
- Official documentation style
- Strict minimal migration
- Minimal diff
- No engineering enhanced version
- Only refer to official migration examples
Then this "minimal migration mode" should override the subsequent generalized optimization requirements:
- Only make necessary code modifications
- The can be 1 to 3 lines, clearly stating "No in-depth optimization is performed for this task"
optimization instructions - Do not forcefully expand content like ,
TRITON_ALL_BLOCKS_PARALLEL,multibuffer, physical core binding just to complete the templatecare_padding=False - Do not deviate the response style from "documentation diff" to "engineering optimization overview"
- The validation script should also remain "minimally runnable", do not default to writing it as an engineering test framework
Details about documentation-style minimal migration, single-file example organization, and validation script naming and saving rules can be found in .
references/output-and-validation.md3. 改写并行模型
3. Rewrite Parallelism Model
Ascend 侧优先遵循这些规则:
- grid 优先 1D
- 从 GPU 逻辑 grid 思维切换到 Ascend 物理核绑定思维
- 算子优先按 Vector Core 路径思考
Vector-only - 含 算子优先按 AI Core 路径思考
tl.dot
进一步按这组“通用收敛规则”判断,不要机械保留 GPU 上的全部实现分支:
- 如果原始实现存在多 kernel、、环境变量分支、不同数据路径的自动分发,先区分哪些是“语义必需”,哪些只是“GPU 上的性能策略”
autotune - 对 Ascend 上明显不再必要的性能分支,可以收敛成单 kernel 或更少的路径;重点保留语义,而不是保留所有历史分支
- 如果一个算子本质上仍是 ,但原始实现用了复杂
Vector-only、二维/三维 grid、额外分块或多版本 kernel,优先评估能否改成更直接的 1D grid、固定配置、单路径实现block_ptr - 如果一个算子含 ,不要只想着“把多维 grid 压成 1D”;优先判断哪些 grid 维度只是逻辑 chunk / token / tile 维,是否更适合移入 kernel 内部循环,以减少调度维度
tl.dot - 不要只按“源码里出现了 ”做机械分类;如果
tl.dot只是拿来实现 prefix-sum、局部扫描、三角 mask 聚合等中间技巧,仍要先按算子主语义判断它更像tl.dot归约/扫描,还是确实应走 AI Core 路径Vector-only - 如果算子天然带 chunk、tile、window、prefix-sum、局部归约等结构,不要只沿用原始逐块指针逻辑;同时评估“先重排布局,再做向量化计算”是否更适合 Ascend
- 如果某个辅助张量(例如 gate、mask、bias、index、state-gate)在当前访问方向上并不连续,优先在 wrapper 侧做轻量 或等价布局重排,再在 kernel 内按更简单的线性 ptr 或更规整的
transpose/contiguous访问block_ptr - 如果主循环顺序被重排了,例如从“先 K 后 T”改成“先 T 后 K”,要同步重审状态张量、cache 张量、历史块张量的 ;不要只改调度顺序,却继续沿用旧视图再靠
shape/stride/block_ptr/order或额外索引补救trans - 如果当前工程里已经存在 、设备属性工具、常用布局 helper 等公共能力,优先复用工程内 helper,不要默认手写内联替代版本
get_vectorcore_num() - 但如果当前输出目标是“独立可运行脚本”或“最小验证脚本”,继续检查这些 helper 是否依赖额外初始化;若依赖工程初始化步骤,要么补上初始化,要么在结果里明确说明前置条件
- 当你决定“删分支 / 收敛实现”时,要在结果里说明原因:是因为该分支只服务于 GPU autotune、只服务于共享内存选择、还是在 Ascend 上没有明确收益
- 如果迁移后的 Triton-Ascend 运行日志出现 、
Please DO NOT tune args ['num_warps']或类似告警,优先回头检查是否仍机械保留了 GPU 风格 launch/tuning 参数;对 Ascend 的最小可运行实现,默认不要显式保留这些参数,除非你能给出明确的编译要求或实测收益['num_stages'] - 验证脚本不要只用一组通用 shape;测试集应从算子特征反推出来,至少覆盖一个非整除块、一个最容易触发分支差异的 case,以及一个更接近真实工作集的 case
如果用户给的是 2D/3D grid,优先评估能否折叠为 1D grid 再在 kernel 内恢复索引。、UB、、、、 等细则,见 。
coreDimshape/stride/block_ptr/ordercare_padding=FalseTRITON_ALL_BLOCKS_PARALLELmultibufferreferences/reference.mdAscend-side should follow these rules first:
- Prioritize 1D grid
- Switch from GPU logical grid thinking to Ascend physical core binding thinking
- operators should be designed with Vector Core path in mind first
Vector-only - Operators containing should be designed with AI Core path in mind first
tl.dot
Further judge based on this set of "general convergence rules", do not mechanically retain all implementation branches from GPU:
- If the original implementation has multiple kernels, , environment variable branches, or automatic distribution of different data paths, first distinguish which are "semantically necessary" and which are just "performance strategies on GPU"
autotune - For performance branches that are obviously no longer necessary on Ascend, converge to a single kernel or fewer paths; focus on retaining semantics rather than all historical branches
- If an operator is essentially , but the original implementation uses complex
Vector-only, 2D/3D grid, additional tiling, or multi-version kernels, prioritize evaluating whether it can be rewritten into a more straightforward 1D grid, fixed configuration, single-path implementationblock_ptr - If an operator contains , do not just think about "compressing multi-dimensional grid into 1D"; first judge which grid dimensions are only logical chunk / token / tile dimensions, and whether they are more suitable to be moved into the kernel's internal loop to reduce scheduling dimensions
tl.dot - Do not mechanically classify based on "tl.dot appears in the source code"; if is only used to implement intermediate techniques like prefix-sum, local scan, triangular mask aggregation, first judge whether it is more like a
tl.dotreduction/scan based on the main semantics of the operator, or if it should indeed follow the AI Core pathVector-only - If the operator naturally has structures like chunk, tile, window, prefix-sum, local reduction, do not just follow the original block pointer logic; also evaluate whether "rearrange layout first, then perform vectorized computation" is more suitable for Ascend
- If an auxiliary tensor (such as gate, mask, bias, index, state-gate) is not continuous in the current access direction, first perform lightweight or equivalent layout rearrangement on the wrapper side, then access it with a simpler linear ptr or more regular
transpose/contiguousinside the kernelblock_ptr - If the main loop order is rearranged, such as changing from "K first then T" to "T first then K", re-review the of state tensors, cache tensors, and historical block tensors simultaneously; do not just change the scheduling order while continuing to use the old view and remedy with
shape/stride/block_ptr/orderor additional indexingtrans - If common capabilities like , device attribute tools, or common layout helpers already exist in the current project, prioritize reusing project helpers instead of writing inline replacement versions by default
get_vectorcore_num() - However, if the current output target is an "independent runnable script" or "minimal validation script", continue to check whether these helpers rely on additional initialization; if they rely on project initialization steps, either add the initialization or clearly state the preconditions in the result
- When you decide to "delete branches / converge implementation", explain the reason in the result: whether the branch only serves GPU autotune, only serves shared memory selection, or has no clear benefit on Ascend
- If the runtime log of the migrated Triton-Ascend shows warnings like ,
Please DO NOT tune args ['num_warps']or similar, first check whether GPU-style launch/tuning parameters are still mechanically retained; for minimally runnable implementations on Ascend, do not explicitly retain these parameters by default unless you can provide clear compilation requirements or measured benefits['num_stages'] - Do not use only a set of general shapes in the validation script; the test set should be derived from operator features, covering at least one non-divisible block case, one case that is most likely to trigger branch differences, and one case closer to the real working set
If the user provides a 2D/3D grid, prioritize evaluating whether it can be folded into a 1D grid and then restore the index inside the kernel. Details about , UB, , , , can be found in .
coreDimshape/stride/block_ptr/ordercare_padding=FalseTRITON_ALL_BLOCKS_PARALLELmultibufferreferences/reference.md优化与排障
Optimization and Troubleshooting
直接优化的默认规则
Default Rules for Direct Optimization
如果满足以下任一条件,直接给出优化后的实现:
- 明显超限
coreDim - UB 使用明显过大
- 访存离散且可重构为连续访问
- mask load/store 具备更优写法
- dtype 明显导致 vector 运算退化为 scalar
如果不满足这些条件,尤其是简单向量加法这类示例,不要为了“看起来更完整”而默认输出增强包装版。先给最小迁移版,再把增强项放到“可选优化”里。
Directly provide the optimized implementation if any of the following conditions are met:
- is obviously exceeded
coreDim - UB usage is obviously too large
- Memory access is discrete and can be reconstructed into continuous access
- Mask load/store has a more optimal writing method
- dtype obviously causes vector operations to degrade to scalar operations
If these conditions are not met, especially for simple examples like vector addition, do not output an enhanced wrapped version by default just to "look more complete". First provide the minimal migration version, then put the enhanced items in "Optional Optimization".
优化动作优先级
Optimization Priority
- 调整 grid 和核数
- 调整主块大小
- 引入或重构子块循环
- 修正
shape/stride/block_ptr/order - 评估
care_padding=False - 评估
TRITON_ALL_BLOCKS_PARALLEL - 评估 和相关编译优化项
multibuffer - 在不破坏语义前提下调整 dtype 路径
- Adjust grid and number of cores
- Adjust main block size
- Introduce or reconstruct sub-block loops
- Correct
shape/stride/block_ptr/order - Evaluate
care_padding=False - Evaluate
TRITON_ALL_BLOCKS_PARALLEL - Evaluate and related compilation optimization items
multibuffer - Adjust dtype path without breaking semantics
必须覆盖的关键点
Key Points to Cover
输出中必须覆盖这些内容:
cuda -> nputorch_npu- 1D grid
- 物理核绑定
- 与含
Vector-only的区分tl.dot coreDim <= 65535- UB 限制
- 连续 / 对齐访存
- 重审
shape/stride/block_ptr/order TRITON_ALL_BLOCKS_PARALLELmultibuffercare_padding=False- dtype 导致的 scalar 退化
The output must cover these content:
cuda -> nputorch_npu- 1D grid
- Physical core binding
- Distinction between and operators containing
Vector-onlytl.dot coreDim <= 65535- UB limit
- Continuous / aligned memory access
- Re-review of
shape/stride/block_ptr/order TRITON_ALL_BLOCKS_PARALLELmultibuffercare_padding=False- Scalar degradation caused by dtype
固定输出模板
Fixed Output Template
始终按这个结构输出:
markdown
undefinedAlways output in this structure:
markdown
undefined迁移结论
Migration Conclusion
- 输入来源:
- 算子类型:
- 主要迁移动作:
- Input Source:
- Operator Type:
- Main Migration Actions:
Triton-Ascend 实现
Triton-Ascend Implementation
- 给出最终 kernel 和调用包装代码
- 如果当前场景只是基础迁移,先给“最小 diff 迁移版”
- 只有在用户要求增强版,或确有明确优化空间时,再额外给“工程增强版/优化版”
- 如果存在明确优化空间,直接给出优化后的版本
- 说明生成文件的保存路径和命名
- Provide the final kernel and calling wrapper code
- For basic migration scenarios, first provide the "minimal diff migration version"
- Only provide "engineering enhanced/optimized version" additionally when users request it, or when there are clear optimization opportunities
- If clear optimization opportunities exist, directly provide the optimized version
- Explain the save path and naming of the generated file
验证脚本
Validation Script
- 给出最小可执行验证脚本
- 使用 PyTorch reference 对比
- 至少包含 或最大误差输出
allclose - 说明验证脚本保存路径
- 明确是否已实际执行,以及执行命令与结果
- Provide a minimally executable validation script
- Compare with PyTorch reference
- Include at least or maximum error output
allclose - Explain the save path of the validation script
- Clearly state whether it has been actually executed, along with execution commands and results
优化说明
Optimization Instructions
- 说明 grid / 核数 / block / 子块 的调整原因
- 说明是否处理了 、UB、访存、dtype、mask 性能问题
coreDim - 说明是否使用 、
TRITON_ALL_BLOCKS_PARALLEL、multibuffercare_padding=False
如果当前题目是“文档风格最小迁移”,这里可以极简:
- 只说明当前先做最小迁移
- 一句话说明本题未展开 / UB /
coreDim等优化multibuffer - 不要为了套模板而展开长篇优化分析
- Explain the reasons for adjusting grid / number of cores / block / sub-block
- Explain whether , UB, memory access, dtype, mask performance issues are handled
coreDim - Explain whether ,
TRITON_ALL_BLOCKS_PARALLEL,multibufferare usedcare_padding=False
If the current task is "documentation-style minimal migration", this section can be extremely concise:
- Only state that minimal migration is performed first
- Briefly state that optimization items like / UB /
coreDimare not expanded in this taskmultibuffer - Do not expand into lengthy optimization analysis just to fit the template
风险与限制
Risks and Limitations
- 列出仍未验证的边界条件
- 列出需要用户补充的信息
- 如果脚本未跑通,明确卡在哪一步
如果用户问题本身是“怎么使用这个 skill”,先在正式模板前加一个极简“使用方法”小节,控制在 3 到 6 行,说明:
- 用户应提供什么输入
- 当前按本机还是容器场景处理
- 你接下来会产出什么
然后再进入正常迁移输出。
如果用户继续追问命令行、容器、目录切换、验证命令模板,再读取 `references/usage.md`,不要把这些细节默认塞进每次迁移回答。- List unvalidated boundary conditions
- List information that needs to be supplemented by users
- If the script fails to run, clearly state which step it is stuck on
If the user's question is "How to use this skill", add a minimalist "Usage" section before the official template, limited to 3 to 6 lines, explaining:
- What input the user should provide
- Whether to handle it according to local or container scenario
- What outputs will be generated next
Then proceed to the normal migration output.
If users ask about command lines, containers, directory switching, validation command templates, refer to `references/usage.md`, do not include these details in every migration response by default.Additional Resources
Additional Resources
需要规则细节时,继续读取:
- 使用方法、本机命令与容器场景
- 输入方式与上下文补齐
- 输出、命名与最小验证脚本
- 迁移与优化参考
- 典型示例与输出样例
- 人工 Review 测试清单
For detailed rules, refer to:
- Usage, Local Commands and Container Scenarios
- Input Methods and Context Completion
- Output, Naming and Minimal Validation Script
- Migration and Optimization Reference
- Typical Examples and Output Samples
- Manual Review Test Checklist