triton-ascend-migration

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Triton-Ascend Migration

Quick Start

遇到迁移请求时，按下面顺序执行：

先识别输入方式：
- 文件路径 / 指定代码段
- 用户直接粘贴代码
再识别输入来源：
- GPU/CUDA Triton kernel
- Python/PyTorch 算子实现
再识别算子类型：
- ```
elementwise
```
- ```
broadcast / mask
```
- ```
reduce
```
- 含
```
tl.dot
```
先做最小可运行版本：
- ```
cuda -> npu
```
- 补
```
torch_npu
```
- 去掉 GPU 专属设备逻辑
- grid 优先 1D
- 简单教程示例默认给“最小 diff 迁移版”
跑通后再做 Ascend 侧优化：
- 物理核绑定
- ```
BLOCK_SIZE/XBLOCK
```
- ```
BLOCK_SIZE_SUB/XBLOCK_SUB
```
- 连续/对齐访存
- ```
coreDim
```
  / UB / dtype / mask 排查
如果存在明确优化空间，直接输出优化后的实现，不要只停留在建议。

When handling migration requests, follow the sequence below:

First identify the input method:
- File path / specified code snippet
- User directly pastes code
Then identify the input source:
- GPU/CUDA Triton kernel
- Python/PyTorch operator implementation
Then identify the operator type:
- ```
elementwise
```
- ```
broadcast / mask
```
- ```
reduce
```
- Contains
```
tl.dot
```
First create a minimally runnable version:
- ```
cuda -> npu
```
- Add
```
torch_npu
```
  import
- Remove GPU-specific device logic
- Prioritize 1D grid
- For simple tutorial examples, default to the "minimal diff migration version"
Perform Ascend-side optimization after the code runs successfully:
- Physical core binding
- ```
BLOCK_SIZE/XBLOCK
```
- ```
BLOCK_SIZE_SUB/XBLOCK_SUB
```
- Continuous/aligned memory access
- Troubleshooting for
```
coreDim
```
  / UB / dtype / mask
If clear optimization opportunities exist, directly output the optimized implementation instead of just providing suggestions.

如何使用这个 Skill

How to Use This Skill

如果用户问“这个 skill 怎么用”，先不要立刻进入长篇迁移分析；先用 3 到 6 行给出简明用法，再根据用户提供的输入继续执行。

简述时只保留这几个点：

用户可以提供
```
Triton/CUDA
```
代码、
```
PyTorch
```
参考实现、文件路径，或报错/性能日志。
用户最好同时说明运行环境：本机命令行、已有容器、CI，或只需要生成代码不执行。

用户如果有偏好，也应说明：

最小 diff 迁移

、

文档风格

、

先跑通再优化

、

直接给优化版

。

你会按场景输出：

Triton-Ascend 实现

、

最小验证脚本

、

执行命令

、

优化说明

。

如果用户继续追问“具体怎么提问”“命令怎么写”“容器里怎么跑”，再读取

references/usage.md

，按需给出本机命令、容器命令和示例问法；不要把整份长说明直接搬进常规回答里。

复制这份检查清单并跟踪进度：

text

迁移进度
- [ ] 识别输入来源与算子类型
- [ ] 先做最小迁移或语义改写
- [ ] 调整为 Ascend 友好的并行与 grid
- [ ] 重做 block / tiling
- [ ] 审查 stride / block_ptr / 对齐
- [ ] 处理 coreDim / UB / scalar 退化
- [ ] 直接落地可行优化
- [ ] 生成并保存最小 NPU 验证脚本
- [ ] 实际执行验证脚本
- [ ] 输出结果与优化说明

If users ask "How to use this skill", do not immediately dive into lengthy migration analysis; first provide a concise usage guide in 3 to 6 lines, then proceed based on the user's input.

Only retain these points in the concise guide:

Users can provide
```
Triton/CUDA
```
code,
```
PyTorch
```
reference implementation, file path, or error/performance logs.
Users are advised to specify the runtime environment: local command line, existing container, CI, or code generation only without execution.

Users can also indicate preferences:

minimal diff migration

documentation style

get it running first then optimize

directly provide optimized version

Corresponding outputs will be provided based on scenarios:

Triton-Ascend implementation

minimal validation script

execution command

optimization instructions

If users follow up with questions like "How to ask specifically", "What command to write", "How to run in a container", refer to

references/usage.md

and provide local commands, container commands, and example questions as needed; do not directly include the entire long guide in regular responses.

Copy this checklist to track progress:

text

Migration Progress
- [ ] Identify input source and operator type
- [ ] First perform minimal migration or semantic rewriting
- [ ] Adjust to Ascend-friendly parallelism and grid
- [ ] Redesign block / tiling
- [ ] Review stride / block_ptr / alignment
- [ ] Handle coreDim / UB / scalar degradation
- [ ] Implement feasible optimization directly
- [ ] Generate and save minimal NPU validation script
- [ ] Actually execute the validation script
- [ ] Output results and optimization instructions

输入识别

Input Identification

先回答三个问题：

用户是给文件路径，还是直接贴代码？
这是完整脚本、局部片段，还是单个 kernel？
这是 GPU Triton 迁移，还是 Python/PyTorch 语义改写？

输入方式的细节、缺信息时的默认处理、文件路径与粘贴代码冲突时的优先级，见

references/input-modes.md

。

First answer these three questions:

Is the user providing a file path or directly pasting code?
Is it a complete script, partial snippet, or single kernel?
Is it GPU Triton migration or Python/PyTorch semantic rewriting?

Details about input methods, default handling when information is missing, and priority when file paths conflict with pasted code can be found in

references/input-modes.md

场景 A：GPU Triton -> Triton-Ascend

Scenario A: GPU Triton -> Triton-Ascend

优先检查：

是否存在
```
device='cuda'
```
是否有 GPU 专属设备获取或断言逻辑
是否保留了 GPU 风格的多维自由 grid
是否使用
```
tl.dot
```
是否存在复杂
```
shape/stride/block_ptr/order
```

Prioritize checking:

Whether
```
device='cuda'
```
exists
Whether there is GPU-specific device acquisition or assertion logic
Whether GPU-style multi-dimensional free grid is retained
Whether
```
tl.dot
```
is used
Whether complex
```
shape/stride/block_ptr/order
```
exists

场景 B：Python/PyTorch -> Triton-Ascend

Scenario B: Python/PyTorch -> Triton-Ascend

先提炼语义，再写 Triton：

输入输出张量关系
索引与广播方式
mask / reduce 逻辑
dtype 和精度要求
原始 PyTorch 是否已经天然连续访存

如果原始算子只是参考实现，先写一个语义等价的 Triton-Ascend 版本，再继续优化。

First extract semantics, then write Triton code:

Input-output tensor relationship
Indexing and broadcasting method
Mask / reduce logic
dtype and precision requirements
Whether the original PyTorch implementation already has naturally continuous memory access

If the original operator is only a reference implementation, first write a semantically equivalent Triton-Ascend version, then proceed with optimization.

迁移流程

Migration Process

1. 收集最小必要信息

1. Collect Minimal Necessary Information

优先收集这些信息；缺什么补什么：

输入代码或最小复现
输入方式：文件路径 / 指定代码段 / 用户直接粘贴代码
shape、dtype、stride
是否有 mask、broadcast、reduce
当前报错或性能问题
是否要求保持完全相同精度
运行环境：本机命令行、容器内、CI、或仅生成代码不执行

如果信息不完整，按这个顺序补：

先从现有代码里推断
再用最小合理假设补齐验证脚本
最后才向用户追问必须的信息

如果当前缺的是“执行位置”信息，按下面顺序推断：

先看用户是否给了容器名、
```
docker exec
```
、容器路径、镜像信息
再看用户是否给了本机文件路径、当前目录、终端命令
仍无法判断时，再追问“你希望我按本机命令行还是容器内命令来写验证步骤？”

Prioritize collecting this information; supplement what is missing:

Input code or minimal reproduction case
Input method: file path / specified code snippet / user directly pastes code
shape, dtype, stride
Whether there is mask, broadcast, reduce
Current error or performance issue
Whether exact precision consistency is required
Runtime environment: local command line, inside container, CI, or code generation only without execution

If information is incomplete, supplement in this order:

First infer from existing code
Then use minimal reasonable assumptions to complete the validation script
Finally ask users for necessary information

If the "execution location" information is missing, infer in this order:

First check if the user provided container name,
```
docker exec
```
, container path, or image information
Then check if the user provided local file path, current directory, or terminal command
If still undetermined, ask: "Would you like me to write the validation steps based on local command line or container environment?"

2. 先做最小迁移或语义改写

2. First Perform Minimal Migration or Semantic Rewriting

默认先追求“语义对齐并跑通”：

GPU Triton：先把
```
cuda
```
改成
```
npu
```
导入
```
torch_npu
```
移除 GPU 专属设备逻辑
对文档/教程式的简单示例，尽量保持原有
```
kernel
```
名字、wrapper 名字、
```
BLOCK_SIZE
```
、grid 写法和主体代码结构不变
第一版不要主动加入
```
contiguous()
```
、额外断言、函数重命名、工程化包装，除非用户明确要求“增强版/生产版”，或这些改动是修复 NPU 上的确定性问题所必需
Python/PyTorch：先按原计算语义改写成最直接的 Triton kernel

不要在第一步就过度重写。

如果用户明确说了这些信号词：

官方文档风格
严格最小迁移
最小 diff
不要工程增强版
只参考官方迁移示例

则这条“最小迁移模式”应当覆盖后面那些更泛化的优化说明要求：

代码只做必要修改
```
优化说明
```
可以只写 1 到 3 行，明确“本题先不展开优化”
不要为了模板完整性，强行展开
```
TRITON_ALL_BLOCKS_PARALLEL
```
、
```
multibuffer
```
、
```
care_padding=False
```
、物理核绑定等内容
不要把回答风格从“文档 diff”带偏成“工程优化综述”
验证脚本也应保持“最小可运行”，不要默认写成工程化测试框架

文档风格最小迁移、单文件示例组织方式、验证脚本命名与保存规则，见

references/output-and-validation.md

。

Default to pursuing "semantic alignment and successful execution":

GPU Triton: First change
```
cuda
```
to
```
npu
```
Import
```
torch_npu
```
Remove GPU-specific device logic
For documentation/tutorial-style simple examples, try to keep the original
```
kernel
```
name, wrapper name,
```
BLOCK_SIZE
```
, grid writing method, and main code structure unchanged
Do not actively add
```
contiguous()
```
, additional assertions, function renaming, or engineering packaging in the first version, unless the user explicitly requests "enhanced/production version", or these changes are necessary to fix deterministic issues on NPU
Python/PyTorch: First rewrite into the most straightforward Triton kernel according to the original computation semantics

Do not over-rewrite in the first step.

If users explicitly mention these keywords:

Official documentation style
Strict minimal migration
Minimal diff
No engineering enhanced version
Only refer to official migration examples

Then this "minimal migration mode" should override the subsequent generalized optimization requirements:

Only make necessary code modifications
The
```
optimization instructions
```
can be 1 to 3 lines, clearly stating "No in-depth optimization is performed for this task"
Do not forcefully expand content like
```
TRITON_ALL_BLOCKS_PARALLEL
```
,
```
multibuffer
```
,
```
care_padding=False
```
, physical core binding just to complete the template
Do not deviate the response style from "documentation diff" to "engineering optimization overview"
The validation script should also remain "minimally runnable", do not default to writing it as an engineering test framework

Details about documentation-style minimal migration, single-file example organization, and validation script naming and saving rules can be found in

references/output-and-validation.md

3. 改写并行模型

3. Rewrite Parallelism Model

Ascend 侧优先遵循这些规则：

grid 优先 1D
从 GPU 逻辑 grid 思维切换到 Ascend 物理核绑定思维
```
Vector-only
```
算子优先按 Vector Core 路径思考
含
```
tl.dot
```
算子优先按 AI Core 路径思考

进一步按这组“通用收敛规则”判断，不要机械保留 GPU 上的全部实现分支：

如果原始实现存在多 kernel、
```
autotune
```
、环境变量分支、不同数据路径的自动分发，先区分哪些是“语义必需”，哪些只是“GPU 上的性能策略”
对 Ascend 上明显不再必要的性能分支，可以收敛成单 kernel 或更少的路径；重点保留语义，而不是保留所有历史分支
如果一个算子本质上仍是
```
Vector-only
```
，但原始实现用了复杂
```
block_ptr
```
、二维/三维 grid、额外分块或多版本 kernel，优先评估能否改成更直接的 1D grid、固定配置、单路径实现
如果一个算子含
```
tl.dot
```
，不要只想着“把多维 grid 压成 1D”；优先判断哪些 grid 维度只是逻辑 chunk / token / tile 维，是否更适合移入 kernel 内部循环，以减少调度维度
不要只按“源码里出现了
```
tl.dot
```
”做机械分类；如果
```
tl.dot
```
只是拿来实现 prefix-sum、局部扫描、三角 mask 聚合等中间技巧，仍要先按算子主语义判断它更像
```
Vector-only
```
归约/扫描，还是确实应走 AI Core 路径
如果算子天然带 chunk、tile、window、prefix-sum、局部归约等结构，不要只沿用原始逐块指针逻辑；同时评估“先重排布局，再做向量化计算”是否更适合 Ascend
如果某个辅助张量（例如 gate、mask、bias、index、state-gate）在当前访问方向上并不连续，优先在 wrapper 侧做轻量
```
transpose/contiguous
```
或等价布局重排，再在 kernel 内按更简单的线性 ptr 或更规整的
```
block_ptr
```
访问
如果主循环顺序被重排了，例如从“先 K 后 T”改成“先 T 后 K”，要同步重审状态张量、cache 张量、历史块张量的
```
shape/stride/block_ptr/order
```
；不要只改调度顺序，却继续沿用旧视图再靠
```
trans
```
或额外索引补救
如果当前工程里已经存在
```
get_vectorcore_num()
```
、设备属性工具、常用布局 helper 等公共能力，优先复用工程内 helper，不要默认手写内联替代版本
但如果当前输出目标是“独立可运行脚本”或“最小验证脚本”，继续检查这些 helper 是否依赖额外初始化；若依赖工程初始化步骤，要么补上初始化，要么在结果里明确说明前置条件
当你决定“删分支 / 收敛实现”时，要在结果里说明原因：是因为该分支只服务于 GPU autotune、只服务于共享内存选择、还是在 Ascend 上没有明确收益
如果迁移后的 Triton-Ascend 运行日志出现
```
Please DO NOT tune args ['num_warps']
```
、
```
['num_stages']
```
或类似告警，优先回头检查是否仍机械保留了 GPU 风格 launch/tuning 参数；对 Ascend 的最小可运行实现，默认不要显式保留这些参数，除非你能给出明确的编译要求或实测收益
验证脚本不要只用一组通用 shape；测试集应从算子特征反推出来，至少覆盖一个非整除块、一个最容易触发分支差异的 case，以及一个更接近真实工作集的 case

如果用户给的是 2D/3D grid，优先评估能否折叠为 1D grid 再在 kernel 内恢复索引。

coreDim

、UB、

shape/stride/block_ptr/order

、

care_padding=False

、

TRITON_ALL_BLOCKS_PARALLEL

、

multibuffer

等细则，见

references/reference.md

。

Ascend-side should follow these rules first:

Prioritize 1D grid
Switch from GPU logical grid thinking to Ascend physical core binding thinking
```
Vector-only
```
operators should be designed with Vector Core path in mind first
Operators containing
```
tl.dot
```
should be designed with AI Core path in mind first

Further judge based on this set of "general convergence rules", do not mechanically retain all implementation branches from GPU:

If the original implementation has multiple kernels,
```
autotune
```
, environment variable branches, or automatic distribution of different data paths, first distinguish which are "semantically necessary" and which are just "performance strategies on GPU"
For performance branches that are obviously no longer necessary on Ascend, converge to a single kernel or fewer paths; focus on retaining semantics rather than all historical branches
If an operator is essentially
```
Vector-only
```
, but the original implementation uses complex
```
block_ptr
```
, 2D/3D grid, additional tiling, or multi-version kernels, prioritize evaluating whether it can be rewritten into a more straightforward 1D grid, fixed configuration, single-path implementation
If an operator contains
```
tl.dot
```
, do not just think about "compressing multi-dimensional grid into 1D"; first judge which grid dimensions are only logical chunk / token / tile dimensions, and whether they are more suitable to be moved into the kernel's internal loop to reduce scheduling dimensions
Do not mechanically classify based on "tl.dot appears in the source code"; if
```
tl.dot
```
is only used to implement intermediate techniques like prefix-sum, local scan, triangular mask aggregation, first judge whether it is more like a
```
Vector-only
```
reduction/scan based on the main semantics of the operator, or if it should indeed follow the AI Core path
If the operator naturally has structures like chunk, tile, window, prefix-sum, local reduction, do not just follow the original block pointer logic; also evaluate whether "rearrange layout first, then perform vectorized computation" is more suitable for Ascend
If an auxiliary tensor (such as gate, mask, bias, index, state-gate) is not continuous in the current access direction, first perform lightweight
```
transpose/contiguous
```
or equivalent layout rearrangement on the wrapper side, then access it with a simpler linear ptr or more regular
```
block_ptr
```
inside the kernel
If the main loop order is rearranged, such as changing from "K first then T" to "T first then K", re-review the
```
shape/stride/block_ptr/order
```
of state tensors, cache tensors, and historical block tensors simultaneously; do not just change the scheduling order while continuing to use the old view and remedy with
```
trans
```
or additional indexing
If common capabilities like
```
get_vectorcore_num()
```
, device attribute tools, or common layout helpers already exist in the current project, prioritize reusing project helpers instead of writing inline replacement versions by default
However, if the current output target is an "independent runnable script" or "minimal validation script", continue to check whether these helpers rely on additional initialization; if they rely on project initialization steps, either add the initialization or clearly state the preconditions in the result
When you decide to "delete branches / converge implementation", explain the reason in the result: whether the branch only serves GPU autotune, only serves shared memory selection, or has no clear benefit on Ascend
If the runtime log of the migrated Triton-Ascend shows warnings like
```
Please DO NOT tune args ['num_warps']
```
,
```
['num_stages']
```
or similar, first check whether GPU-style launch/tuning parameters are still mechanically retained; for minimally runnable implementations on Ascend, do not explicitly retain these parameters by default unless you can provide clear compilation requirements or measured benefits
Do not use only a set of general shapes in the validation script; the test set should be derived from operator features, covering at least one non-divisible block case, one case that is most likely to trigger branch differences, and one case closer to the real working set

If the user provides a 2D/3D grid, prioritize evaluating whether it can be folded into a 1D grid and then restore the index inside the kernel. Details about

coreDim

, UB,

shape/stride/block_ptr/order

care_padding=False

TRITON_ALL_BLOCKS_PARALLEL

multibuffer

can be found in

references/reference.md

优化与排障

Optimization and Troubleshooting

直接优化的默认规则

Default Rules for Direct Optimization

如果满足以下任一条件，直接给出优化后的实现：

```
coreDim
```
明显超限
UB 使用明显过大
访存离散且可重构为连续访问
mask load/store 具备更优写法
dtype 明显导致 vector 运算退化为 scalar

如果不满足这些条件，尤其是简单向量加法这类示例，不要为了“看起来更完整”而默认输出增强包装版。先给最小迁移版，再把增强项放到“可选优化”里。

Directly provide the optimized implementation if any of the following conditions are met:

```
coreDim
```
is obviously exceeded
UB usage is obviously too large
Memory access is discrete and can be reconstructed into continuous access
Mask load/store has a more optimal writing method
dtype obviously causes vector operations to degrade to scalar operations

If these conditions are not met, especially for simple examples like vector addition, do not output an enhanced wrapped version by default just to "look more complete". First provide the minimal migration version, then put the enhanced items in "Optional Optimization".

优化动作优先级

Optimization Priority

调整 grid 和核数
调整主块大小
引入或重构子块循环
修正
```
shape/stride/block_ptr/order
```
评估
```
care_padding=False
```
评估
```
TRITON_ALL_BLOCKS_PARALLEL
```
评估
```
multibuffer
```
和相关编译优化项
在不破坏语义前提下调整 dtype 路径

Adjust grid and number of cores
Adjust main block size
Introduce or reconstruct sub-block loops
Correct
```
shape/stride/block_ptr/order
```
Evaluate
```
care_padding=False
```
Evaluate
```
TRITON_ALL_BLOCKS_PARALLEL
```
Evaluate
```
multibuffer
```
and related compilation optimization items
Adjust dtype path without breaking semantics

必须覆盖的关键点

Key Points to Cover

输出中必须覆盖这些内容：

```
cuda -> npu
```
```
torch_npu
```
1D grid
物理核绑定
```
Vector-only
```
与含
```
tl.dot
```
的区分
```
coreDim <= 65535
```
UB 限制
连续 / 对齐访存
```
shape/stride/block_ptr/order
```
重审
```
TRITON_ALL_BLOCKS_PARALLEL
```
```
multibuffer
```
```
care_padding=False
```
dtype 导致的 scalar 退化

The output must cover these content:

```
cuda -> npu
```
```
torch_npu
```
1D grid
Physical core binding
Distinction between
```
Vector-only
```
and operators containing
```
tl.dot
```
```
coreDim <= 65535
```
UB limit
Continuous / aligned memory access
Re-review of
```
shape/stride/block_ptr/order
```
```
TRITON_ALL_BLOCKS_PARALLEL
```
```
multibuffer
```
```
care_padding=False
```
Scalar degradation caused by dtype

固定输出模板

Fixed Output Template

始终按这个结构输出：

markdown

undefined

Always output in this structure:

markdown

undefined

迁移结论

Migration Conclusion

输入来源：
算子类型：
主要迁移动作：

Input Source:
Operator Type:
Main Migration Actions:

Triton-Ascend 实现

Triton-Ascend Implementation

给出最终 kernel 和调用包装代码
如果当前场景只是基础迁移，先给“最小 diff 迁移版”
只有在用户要求增强版，或确有明确优化空间时，再额外给“工程增强版/优化版”
如果存在明确优化空间，直接给出优化后的版本
说明生成文件的保存路径和命名

Provide the final kernel and calling wrapper code
For basic migration scenarios, first provide the "minimal diff migration version"
Only provide "engineering enhanced/optimized version" additionally when users request it, or when there are clear optimization opportunities
If clear optimization opportunities exist, directly provide the optimized version
Explain the save path and naming of the generated file

验证脚本

Validation Script

给出最小可执行验证脚本
使用 PyTorch reference 对比
至少包含
```
allclose
```
或最大误差输出
说明验证脚本保存路径
明确是否已实际执行，以及执行命令与结果

Provide a minimally executable validation script
Compare with PyTorch reference
Include at least
```
allclose
```
or maximum error output
Explain the save path of the validation script
Clearly state whether it has been actually executed, along with execution commands and results

优化说明

Optimization Instructions

说明 grid / 核数 / block / 子块的调整原因
说明是否处理了
```
coreDim
```
、UB、访存、dtype、mask 性能问题

说明是否使用

TRITON_ALL_BLOCKS_PARALLEL

、

multibuffer

、

care_padding=False

如果当前题目是“文档风格最小迁移”，这里可以极简：

只说明当前先做最小迁移
一句话说明本题未展开
```
coreDim
```
/ UB /
```
multibuffer
```
等优化
不要为了套模板而展开长篇优化分析

Explain the reasons for adjusting grid / number of cores / block / sub-block
Explain whether
```
coreDim
```
, UB, memory access, dtype, mask performance issues are handled

Explain whether

TRITON_ALL_BLOCKS_PARALLEL

multibuffer

care_padding=False

are used

If the current task is "documentation-style minimal migration", this section can be extremely concise:

Only state that minimal migration is performed first
Briefly state that optimization items like
```
coreDim
```
/ UB /
```
multibuffer
```
are not expanded in this task
Do not expand into lengthy optimization analysis just to fit the template

风险与限制

Risks and Limitations

列出仍未验证的边界条件
列出需要用户补充的信息
如果脚本未跑通，明确卡在哪一步


如果用户问题本身是“怎么使用这个 skill”，先在正式模板前加一个极简“使用方法”小节，控制在 3 到 6 行，说明：

- 用户应提供什么输入
- 当前按本机还是容器场景处理
- 你接下来会产出什么

然后再进入正常迁移输出。

如果用户继续追问命令行、容器、目录切换、验证命令模板，再读取 `references/usage.md`，不要把这些细节默认塞进每次迁移回答。

List unvalidated boundary conditions
List information that needs to be supplemented by users
If the script fails to run, clearly state which step it is stuck on


If the user's question is "How to use this skill", add a minimalist "Usage" section before the official template, limited to 3 to 6 lines, explaining:

- What input the user should provide
- Whether to handle it according to local or container scenario
- What outputs will be generated next

Then proceed to the normal migration output.

If users ask about command lines, containers, directory switching, validation command templates, refer to `references/usage.md`, do not include these details in every migration response by default.

Additional Resources

需要规则细节时，继续读取：

使用方法、本机命令与容器场景
输入方式与上下文补齐
输出、命名与最小验证脚本
迁移与优化参考
典型示例与输出样例
人工 Review 测试清单

For detailed rules, refer to:

Usage, Local Commands and Container Scenarios
Input Methods and Context Completion
Output, Naming and Minimal Validation Script
Migration and Optimization Reference
Typical Examples and Output Samples
Manual Review Test Checklist