triton-ascend-migration

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triton-Ascend Migration

Triton-Ascend Migration

Quick Start

Quick Start

遇到迁移请求时,按下面顺序执行:
  1. 先识别输入方式:
    • 文件路径 / 指定代码段
    • 用户直接粘贴代码
  2. 再识别输入来源:
    • GPU/CUDA Triton kernel
    • Python/PyTorch 算子实现
  3. 再识别算子类型:
    • elementwise
    • broadcast / mask
    • reduce
    • tl.dot
  4. 先做最小可运行版本:
    • cuda -> npu
    • torch_npu
    • 去掉 GPU 专属设备逻辑
    • grid 优先 1D
    • 简单教程示例默认给“最小 diff 迁移版”
  5. 跑通后再做 Ascend 侧优化:
    • 物理核绑定
    • BLOCK_SIZE/XBLOCK
    • BLOCK_SIZE_SUB/XBLOCK_SUB
    • 连续/对齐访存
    • coreDim
      / UB / dtype / mask 排查
  6. 如果存在明确优化空间,直接输出优化后的实现,不要只停留在建议。
When handling migration requests, follow the sequence below:
  1. First identify the input method:
    • File path / specified code snippet
    • User directly pastes code
  2. Then identify the input source:
    • GPU/CUDA Triton kernel
    • Python/PyTorch operator implementation
  3. Then identify the operator type:
    • elementwise
    • broadcast / mask
    • reduce
    • Contains
      tl.dot
  4. First create a minimally runnable version:
    • cuda -> npu
    • Add
      torch_npu
      import
    • Remove GPU-specific device logic
    • Prioritize 1D grid
    • For simple tutorial examples, default to the "minimal diff migration version"
  5. Perform Ascend-side optimization after the code runs successfully:
    • Physical core binding
    • BLOCK_SIZE/XBLOCK
    • BLOCK_SIZE_SUB/XBLOCK_SUB
    • Continuous/aligned memory access
    • Troubleshooting for
      coreDim
      / UB / dtype / mask
  6. If clear optimization opportunities exist, directly output the optimized implementation instead of just providing suggestions.

如何使用这个 Skill

How to Use This Skill

如果用户问“这个 skill 怎么用”,先不要立刻进入长篇迁移分析;先用 3 到 6 行给出简明用法,再根据用户提供的输入继续执行。
简述时只保留这几个点:
  • 用户可以提供
    Triton/CUDA
    代码、
    PyTorch
    参考实现、文件路径,或报错/性能日志。
  • 用户最好同时说明运行环境:本机命令行、已有容器、CI,或只需要生成代码不执行。
  • 用户如果有偏好,也应说明:
    最小 diff 迁移
    文档风格
    先跑通再优化
    直接给优化版
  • 你会按场景输出:
    Triton-Ascend 实现
    最小验证脚本
    执行命令
    优化说明
如果用户继续追问“具体怎么提问”“命令怎么写”“容器里怎么跑”,再读取
references/usage.md
,按需给出本机命令、容器命令和示例问法;不要把整份长说明直接搬进常规回答里。
复制这份检查清单并跟踪进度:
text
迁移进度
- [ ] 识别输入来源与算子类型
- [ ] 先做最小迁移或语义改写
- [ ] 调整为 Ascend 友好的并行与 grid
- [ ] 重做 block / tiling
- [ ] 审查 stride / block_ptr / 对齐
- [ ] 处理 coreDim / UB / scalar 退化
- [ ] 直接落地可行优化
- [ ] 生成并保存最小 NPU 验证脚本
- [ ] 实际执行验证脚本
- [ ] 输出结果与优化说明
If users ask "How to use this skill", do not immediately dive into lengthy migration analysis; first provide a concise usage guide in 3 to 6 lines, then proceed based on the user's input.
Only retain these points in the concise guide:
  • Users can provide
    Triton/CUDA
    code,
    PyTorch
    reference implementation, file path, or error/performance logs.
  • Users are advised to specify the runtime environment: local command line, existing container, CI, or code generation only without execution.
  • Users can also indicate preferences:
    minimal diff migration
    ,
    documentation style
    ,
    get it running first then optimize
    ,
    directly provide optimized version
    .
  • Corresponding outputs will be provided based on scenarios:
    Triton-Ascend implementation
    ,
    minimal validation script
    ,
    execution command
    ,
    optimization instructions
    .
If users follow up with questions like "How to ask specifically", "What command to write", "How to run in a container", refer to
references/usage.md
and provide local commands, container commands, and example questions as needed; do not directly include the entire long guide in regular responses.
Copy this checklist to track progress:
text
Migration Progress
- [ ] Identify input source and operator type
- [ ] First perform minimal migration or semantic rewriting
- [ ] Adjust to Ascend-friendly parallelism and grid
- [ ] Redesign block / tiling
- [ ] Review stride / block_ptr / alignment
- [ ] Handle coreDim / UB / scalar degradation
- [ ] Implement feasible optimization directly
- [ ] Generate and save minimal NPU validation script
- [ ] Actually execute the validation script
- [ ] Output results and optimization instructions

输入识别

Input Identification

先回答三个问题:
  1. 用户是给文件路径,还是直接贴代码?
  2. 这是完整脚本、局部片段,还是单个 kernel?
  3. 这是 GPU Triton 迁移,还是 Python/PyTorch 语义改写?
输入方式的细节、缺信息时的默认处理、文件路径与粘贴代码冲突时的优先级,见
references/input-modes.md
First answer these three questions:
  1. Is the user providing a file path or directly pasting code?
  2. Is it a complete script, partial snippet, or single kernel?
  3. Is it GPU Triton migration or Python/PyTorch semantic rewriting?
Details about input methods, default handling when information is missing, and priority when file paths conflict with pasted code can be found in
references/input-modes.md
.

场景 A:GPU Triton -> Triton-Ascend

Scenario A: GPU Triton -> Triton-Ascend

优先检查:
  • 是否存在
    device='cuda'
  • 是否有 GPU 专属设备获取或断言逻辑
  • 是否保留了 GPU 风格的多维自由 grid
  • 是否使用
    tl.dot
  • 是否存在复杂
    shape/stride/block_ptr/order
Prioritize checking:
  • Whether
    device='cuda'
    exists
  • Whether there is GPU-specific device acquisition or assertion logic
  • Whether GPU-style multi-dimensional free grid is retained
  • Whether
    tl.dot
    is used
  • Whether complex
    shape/stride/block_ptr/order
    exists

场景 B:Python/PyTorch -> Triton-Ascend

Scenario B: Python/PyTorch -> Triton-Ascend

先提炼语义,再写 Triton:
  • 输入输出张量关系
  • 索引与广播方式
  • mask / reduce 逻辑
  • dtype 和精度要求
  • 原始 PyTorch 是否已经天然连续访存
如果原始算子只是参考实现,先写一个语义等价的 Triton-Ascend 版本,再继续优化。
First extract semantics, then write Triton code:
  • Input-output tensor relationship
  • Indexing and broadcasting method
  • Mask / reduce logic
  • dtype and precision requirements
  • Whether the original PyTorch implementation already has naturally continuous memory access
If the original operator is only a reference implementation, first write a semantically equivalent Triton-Ascend version, then proceed with optimization.

迁移流程

Migration Process

1. 收集最小必要信息

1. Collect Minimal Necessary Information

优先收集这些信息;缺什么补什么:
  • 输入代码或最小复现
  • 输入方式:文件路径 / 指定代码段 / 用户直接粘贴代码
  • shape、dtype、stride
  • 是否有 mask、broadcast、reduce
  • 当前报错或性能问题
  • 是否要求保持完全相同精度
  • 运行环境:本机命令行、容器内、CI、或仅生成代码不执行
如果信息不完整,按这个顺序补:
  1. 先从现有代码里推断
  2. 再用最小合理假设补齐验证脚本
  3. 最后才向用户追问必须的信息
如果当前缺的是“执行位置”信息,按下面顺序推断:
  1. 先看用户是否给了容器名、
    docker exec
    、容器路径、镜像信息
  2. 再看用户是否给了本机文件路径、当前目录、终端命令
  3. 仍无法判断时,再追问“你希望我按本机命令行还是容器内命令来写验证步骤?”
Prioritize collecting this information; supplement what is missing:
  • Input code or minimal reproduction case
  • Input method: file path / specified code snippet / user directly pastes code
  • shape, dtype, stride
  • Whether there is mask, broadcast, reduce
  • Current error or performance issue
  • Whether exact precision consistency is required
  • Runtime environment: local command line, inside container, CI, or code generation only without execution
If information is incomplete, supplement in this order:
  1. First infer from existing code
  2. Then use minimal reasonable assumptions to complete the validation script
  3. Finally ask users for necessary information
If the "execution location" information is missing, infer in this order:
  1. First check if the user provided container name,
    docker exec
    , container path, or image information
  2. Then check if the user provided local file path, current directory, or terminal command
  3. If still undetermined, ask: "Would you like me to write the validation steps based on local command line or container environment?"

2. 先做最小迁移或语义改写

2. First Perform Minimal Migration or Semantic Rewriting

默认先追求“语义对齐并跑通”:
  • GPU Triton:先把
    cuda
    改成
    npu
  • 导入
    torch_npu
  • 移除 GPU 专属设备逻辑
  • 对文档/教程式的简单示例,尽量保持原有
    kernel
    名字、wrapper 名字、
    BLOCK_SIZE
    、grid 写法和主体代码结构不变
  • 第一版不要主动加入
    contiguous()
    、额外断言、函数重命名、工程化包装,除非用户明确要求“增强版/生产版”,或这些改动是修复 NPU 上的确定性问题所必需
  • Python/PyTorch:先按原计算语义改写成最直接的 Triton kernel
不要在第一步就过度重写。
如果用户明确说了这些信号词:
  • 官方文档风格
  • 严格最小迁移
  • 最小 diff
  • 不要工程增强版
  • 只参考官方迁移示例
则这条“最小迁移模式”应当覆盖后面那些更泛化的优化说明要求:
  • 代码只做必要修改
  • 优化说明
    可以只写 1 到 3 行,明确“本题先不展开优化”
  • 不要为了模板完整性,强行展开
    TRITON_ALL_BLOCKS_PARALLEL
    multibuffer
    care_padding=False
    、物理核绑定等内容
  • 不要把回答风格从“文档 diff”带偏成“工程优化综述”
  • 验证脚本也应保持“最小可运行”,不要默认写成工程化测试框架
文档风格最小迁移、单文件示例组织方式、验证脚本命名与保存规则,见
references/output-and-validation.md
Default to pursuing "semantic alignment and successful execution":
  • GPU Triton: First change
    cuda
    to
    npu
  • Import
    torch_npu
  • Remove GPU-specific device logic
  • For documentation/tutorial-style simple examples, try to keep the original
    kernel
    name, wrapper name,
    BLOCK_SIZE
    , grid writing method, and main code structure unchanged
  • Do not actively add
    contiguous()
    , additional assertions, function renaming, or engineering packaging in the first version, unless the user explicitly requests "enhanced/production version", or these changes are necessary to fix deterministic issues on NPU
  • Python/PyTorch: First rewrite into the most straightforward Triton kernel according to the original computation semantics
Do not over-rewrite in the first step.
If users explicitly mention these keywords:
  • Official documentation style
  • Strict minimal migration
  • Minimal diff
  • No engineering enhanced version
  • Only refer to official migration examples
Then this "minimal migration mode" should override the subsequent generalized optimization requirements:
  • Only make necessary code modifications
  • The
    optimization instructions
    can be 1 to 3 lines, clearly stating "No in-depth optimization is performed for this task"
  • Do not forcefully expand content like
    TRITON_ALL_BLOCKS_PARALLEL
    ,
    multibuffer
    ,
    care_padding=False
    , physical core binding just to complete the template
  • Do not deviate the response style from "documentation diff" to "engineering optimization overview"
  • The validation script should also remain "minimally runnable", do not default to writing it as an engineering test framework
Details about documentation-style minimal migration, single-file example organization, and validation script naming and saving rules can be found in
references/output-and-validation.md
.

3. 改写并行模型

3. Rewrite Parallelism Model

Ascend 侧优先遵循这些规则:
  • grid 优先 1D
  • 从 GPU 逻辑 grid 思维切换到 Ascend 物理核绑定思维
  • Vector-only
    算子优先按 Vector Core 路径思考
  • tl.dot
    算子优先按 AI Core 路径思考
进一步按这组“通用收敛规则”判断,不要机械保留 GPU 上的全部实现分支:
  • 如果原始实现存在多 kernel、
    autotune
    、环境变量分支、不同数据路径的自动分发,先区分哪些是“语义必需”,哪些只是“GPU 上的性能策略”
  • 对 Ascend 上明显不再必要的性能分支,可以收敛成单 kernel 或更少的路径;重点保留语义,而不是保留所有历史分支
  • 如果一个算子本质上仍是
    Vector-only
    ,但原始实现用了复杂
    block_ptr
    、二维/三维 grid、额外分块或多版本 kernel,优先评估能否改成更直接的 1D grid、固定配置、单路径实现
  • 如果一个算子含
    tl.dot
    ,不要只想着“把多维 grid 压成 1D”;优先判断哪些 grid 维度只是逻辑 chunk / token / tile 维,是否更适合移入 kernel 内部循环,以减少调度维度
  • 不要只按“源码里出现了
    tl.dot
    ”做机械分类;如果
    tl.dot
    只是拿来实现 prefix-sum、局部扫描、三角 mask 聚合等中间技巧,仍要先按算子主语义判断它更像
    Vector-only
    归约/扫描,还是确实应走 AI Core 路径
  • 如果算子天然带 chunk、tile、window、prefix-sum、局部归约等结构,不要只沿用原始逐块指针逻辑;同时评估“先重排布局,再做向量化计算”是否更适合 Ascend
  • 如果某个辅助张量(例如 gate、mask、bias、index、state-gate)在当前访问方向上并不连续,优先在 wrapper 侧做轻量
    transpose/contiguous
    或等价布局重排,再在 kernel 内按更简单的线性 ptr 或更规整的
    block_ptr
    访问
  • 如果主循环顺序被重排了,例如从“先 K 后 T”改成“先 T 后 K”,要同步重审状态张量、cache 张量、历史块张量的
    shape/stride/block_ptr/order
    ;不要只改调度顺序,却继续沿用旧视图再靠
    trans
    或额外索引补救
  • 如果当前工程里已经存在
    get_vectorcore_num()
    、设备属性工具、常用布局 helper 等公共能力,优先复用工程内 helper,不要默认手写内联替代版本
  • 但如果当前输出目标是“独立可运行脚本”或“最小验证脚本”,继续检查这些 helper 是否依赖额外初始化;若依赖工程初始化步骤,要么补上初始化,要么在结果里明确说明前置条件
  • 当你决定“删分支 / 收敛实现”时,要在结果里说明原因:是因为该分支只服务于 GPU autotune、只服务于共享内存选择、还是在 Ascend 上没有明确收益
  • 如果迁移后的 Triton-Ascend 运行日志出现
    Please DO NOT tune args ['num_warps']
    ['num_stages']
    或类似告警,优先回头检查是否仍机械保留了 GPU 风格 launch/tuning 参数;对 Ascend 的最小可运行实现,默认不要显式保留这些参数,除非你能给出明确的编译要求或实测收益
  • 验证脚本不要只用一组通用 shape;测试集应从算子特征反推出来,至少覆盖一个非整除块、一个最容易触发分支差异的 case,以及一个更接近真实工作集的 case
如果用户给的是 2D/3D grid,优先评估能否折叠为 1D grid 再在 kernel 内恢复索引。
coreDim
、UB、
shape/stride/block_ptr/order
care_padding=False
TRITON_ALL_BLOCKS_PARALLEL
multibuffer
等细则,见
references/reference.md
Ascend-side should follow these rules first:
  • Prioritize 1D grid
  • Switch from GPU logical grid thinking to Ascend physical core binding thinking
  • Vector-only
    operators should be designed with Vector Core path in mind first
  • Operators containing
    tl.dot
    should be designed with AI Core path in mind first
Further judge based on this set of "general convergence rules", do not mechanically retain all implementation branches from GPU:
  • If the original implementation has multiple kernels,
    autotune
    , environment variable branches, or automatic distribution of different data paths, first distinguish which are "semantically necessary" and which are just "performance strategies on GPU"
  • For performance branches that are obviously no longer necessary on Ascend, converge to a single kernel or fewer paths; focus on retaining semantics rather than all historical branches
  • If an operator is essentially
    Vector-only
    , but the original implementation uses complex
    block_ptr
    , 2D/3D grid, additional tiling, or multi-version kernels, prioritize evaluating whether it can be rewritten into a more straightforward 1D grid, fixed configuration, single-path implementation
  • If an operator contains
    tl.dot
    , do not just think about "compressing multi-dimensional grid into 1D"; first judge which grid dimensions are only logical chunk / token / tile dimensions, and whether they are more suitable to be moved into the kernel's internal loop to reduce scheduling dimensions
  • Do not mechanically classify based on "tl.dot appears in the source code"; if
    tl.dot
    is only used to implement intermediate techniques like prefix-sum, local scan, triangular mask aggregation, first judge whether it is more like a
    Vector-only
    reduction/scan based on the main semantics of the operator, or if it should indeed follow the AI Core path
  • If the operator naturally has structures like chunk, tile, window, prefix-sum, local reduction, do not just follow the original block pointer logic; also evaluate whether "rearrange layout first, then perform vectorized computation" is more suitable for Ascend
  • If an auxiliary tensor (such as gate, mask, bias, index, state-gate) is not continuous in the current access direction, first perform lightweight
    transpose/contiguous
    or equivalent layout rearrangement on the wrapper side, then access it with a simpler linear ptr or more regular
    block_ptr
    inside the kernel
  • If the main loop order is rearranged, such as changing from "K first then T" to "T first then K", re-review the
    shape/stride/block_ptr/order
    of state tensors, cache tensors, and historical block tensors simultaneously; do not just change the scheduling order while continuing to use the old view and remedy with
    trans
    or additional indexing
  • If common capabilities like
    get_vectorcore_num()
    , device attribute tools, or common layout helpers already exist in the current project, prioritize reusing project helpers instead of writing inline replacement versions by default
  • However, if the current output target is an "independent runnable script" or "minimal validation script", continue to check whether these helpers rely on additional initialization; if they rely on project initialization steps, either add the initialization or clearly state the preconditions in the result
  • When you decide to "delete branches / converge implementation", explain the reason in the result: whether the branch only serves GPU autotune, only serves shared memory selection, or has no clear benefit on Ascend
  • If the runtime log of the migrated Triton-Ascend shows warnings like
    Please DO NOT tune args ['num_warps']
    ,
    ['num_stages']
    or similar, first check whether GPU-style launch/tuning parameters are still mechanically retained; for minimally runnable implementations on Ascend, do not explicitly retain these parameters by default unless you can provide clear compilation requirements or measured benefits
  • Do not use only a set of general shapes in the validation script; the test set should be derived from operator features, covering at least one non-divisible block case, one case that is most likely to trigger branch differences, and one case closer to the real working set
If the user provides a 2D/3D grid, prioritize evaluating whether it can be folded into a 1D grid and then restore the index inside the kernel. Details about
coreDim
, UB,
shape/stride/block_ptr/order
,
care_padding=False
,
TRITON_ALL_BLOCKS_PARALLEL
,
multibuffer
can be found in
references/reference.md
.

优化与排障

Optimization and Troubleshooting

直接优化的默认规则

Default Rules for Direct Optimization

如果满足以下任一条件,直接给出优化后的实现:
  • coreDim
    明显超限
  • UB 使用明显过大
  • 访存离散且可重构为连续访问
  • mask load/store 具备更优写法
  • dtype 明显导致 vector 运算退化为 scalar
如果不满足这些条件,尤其是简单向量加法这类示例,不要为了“看起来更完整”而默认输出增强包装版。先给最小迁移版,再把增强项放到“可选优化”里。
Directly provide the optimized implementation if any of the following conditions are met:
  • coreDim
    is obviously exceeded
  • UB usage is obviously too large
  • Memory access is discrete and can be reconstructed into continuous access
  • Mask load/store has a more optimal writing method
  • dtype obviously causes vector operations to degrade to scalar operations
If these conditions are not met, especially for simple examples like vector addition, do not output an enhanced wrapped version by default just to "look more complete". First provide the minimal migration version, then put the enhanced items in "Optional Optimization".

优化动作优先级

Optimization Priority

  1. 调整 grid 和核数
  2. 调整主块大小
  3. 引入或重构子块循环
  4. 修正
    shape/stride/block_ptr/order
  5. 评估
    care_padding=False
  6. 评估
    TRITON_ALL_BLOCKS_PARALLEL
  7. 评估
    multibuffer
    和相关编译优化项
  8. 在不破坏语义前提下调整 dtype 路径
  1. Adjust grid and number of cores
  2. Adjust main block size
  3. Introduce or reconstruct sub-block loops
  4. Correct
    shape/stride/block_ptr/order
  5. Evaluate
    care_padding=False
  6. Evaluate
    TRITON_ALL_BLOCKS_PARALLEL
  7. Evaluate
    multibuffer
    and related compilation optimization items
  8. Adjust dtype path without breaking semantics

必须覆盖的关键点

Key Points to Cover

输出中必须覆盖这些内容:
  • cuda -> npu
  • torch_npu
  • 1D grid
  • 物理核绑定
  • Vector-only
    与含
    tl.dot
    的区分
  • coreDim <= 65535
  • UB 限制
  • 连续 / 对齐访存
  • shape/stride/block_ptr/order
    重审
  • TRITON_ALL_BLOCKS_PARALLEL
  • multibuffer
  • care_padding=False
  • dtype 导致的 scalar 退化
The output must cover these content:
  • cuda -> npu
  • torch_npu
  • 1D grid
  • Physical core binding
  • Distinction between
    Vector-only
    and operators containing
    tl.dot
  • coreDim <= 65535
  • UB limit
  • Continuous / aligned memory access
  • Re-review of
    shape/stride/block_ptr/order
  • TRITON_ALL_BLOCKS_PARALLEL
  • multibuffer
  • care_padding=False
  • Scalar degradation caused by dtype

固定输出模板

Fixed Output Template

始终按这个结构输出:
markdown
undefined
Always output in this structure:
markdown
undefined

迁移结论

Migration Conclusion

  • 输入来源:
  • 算子类型:
  • 主要迁移动作:
  • Input Source:
  • Operator Type:
  • Main Migration Actions:

Triton-Ascend 实现

Triton-Ascend Implementation

  • 给出最终 kernel 和调用包装代码
  • 如果当前场景只是基础迁移,先给“最小 diff 迁移版”
  • 只有在用户要求增强版,或确有明确优化空间时,再额外给“工程增强版/优化版”
  • 如果存在明确优化空间,直接给出优化后的版本
  • 说明生成文件的保存路径和命名
  • Provide the final kernel and calling wrapper code
  • For basic migration scenarios, first provide the "minimal diff migration version"
  • Only provide "engineering enhanced/optimized version" additionally when users request it, or when there are clear optimization opportunities
  • If clear optimization opportunities exist, directly provide the optimized version
  • Explain the save path and naming of the generated file

验证脚本

Validation Script

  • 给出最小可执行验证脚本
  • 使用 PyTorch reference 对比
  • 至少包含
    allclose
    或最大误差输出
  • 说明验证脚本保存路径
  • 明确是否已实际执行,以及执行命令与结果
  • Provide a minimally executable validation script
  • Compare with PyTorch reference
  • Include at least
    allclose
    or maximum error output
  • Explain the save path of the validation script
  • Clearly state whether it has been actually executed, along with execution commands and results

优化说明

Optimization Instructions

  • 说明 grid / 核数 / block / 子块 的调整原因
  • 说明是否处理了
    coreDim
    、UB、访存、dtype、mask 性能问题
  • 说明是否使用
    TRITON_ALL_BLOCKS_PARALLEL
    multibuffer
    care_padding=False
如果当前题目是“文档风格最小迁移”,这里可以极简:
  • 只说明当前先做最小迁移
  • 一句话说明本题未展开
    coreDim
    / UB /
    multibuffer
    等优化
  • 不要为了套模板而展开长篇优化分析
  • Explain the reasons for adjusting grid / number of cores / block / sub-block
  • Explain whether
    coreDim
    , UB, memory access, dtype, mask performance issues are handled
  • Explain whether
    TRITON_ALL_BLOCKS_PARALLEL
    ,
    multibuffer
    ,
    care_padding=False
    are used
If the current task is "documentation-style minimal migration", this section can be extremely concise:
  • Only state that minimal migration is performed first
  • Briefly state that optimization items like
    coreDim
    / UB /
    multibuffer
    are not expanded in this task
  • Do not expand into lengthy optimization analysis just to fit the template

风险与限制

Risks and Limitations

  • 列出仍未验证的边界条件
  • 列出需要用户补充的信息
  • 如果脚本未跑通,明确卡在哪一步

如果用户问题本身是“怎么使用这个 skill”,先在正式模板前加一个极简“使用方法”小节,控制在 3 到 6 行,说明:

- 用户应提供什么输入
- 当前按本机还是容器场景处理
- 你接下来会产出什么

然后再进入正常迁移输出。

如果用户继续追问命令行、容器、目录切换、验证命令模板,再读取 `references/usage.md`,不要把这些细节默认塞进每次迁移回答。
  • List unvalidated boundary conditions
  • List information that needs to be supplemented by users
  • If the script fails to run, clearly state which step it is stuck on

If the user's question is "How to use this skill", add a minimalist "Usage" section before the official template, limited to 3 to 6 lines, explaining:

- What input the user should provide
- Whether to handle it according to local or container scenario
- What outputs will be generated next

Then proceed to the normal migration output.

If users ask about command lines, containers, directory switching, validation command templates, refer to `references/usage.md`, do not include these details in every migration response by default.

Additional Resources

Additional Resources

需要规则细节时,继续读取:
  • 使用方法、本机命令与容器场景
  • 输入方式与上下文补齐
  • 输出、命名与最小验证脚本
  • 迁移与优化参考
  • 典型示例与输出样例
  • 人工 Review 测试清单
For detailed rules, refer to:
  • Usage, Local Commands and Container Scenarios
  • Input Methods and Context Completion
  • Output, Naming and Minimal Validation Script
  • Migration and Optimization Reference
  • Typical Examples and Output Samples
  • Manual Review Test Checklist