converting-cutile-to-triton
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesecuTile → Triton Conversion
cuTile → Triton 转换
Convert kernels to . API mapping: references/api-mapping.md (cuTile → Triton).
@ct.kernel@triton.jitIn this skill’s Markdown, Triton launch syntax uses Unicode brackets so link checkers do not parse as a hyperlink; use normal ASCII brackets in real Triton code.
kernel[grid](…)[grid](…)将内核转换为内核。API映射参考:references/api-mapping.md(cuTile → Triton)。
@ct.kernel@triton.jit在本技能的Markdown文档中,Triton启动语法使用了Unicode方括号,避免链接检查器将解析为超链接;实际编写Triton代码时请使用标准ASCII方括号。
kernel[grid](…)[grid](…)Instructions
操作步骤
Follow the phase-gated workflow in translations/workflow.md. Every conversion should go through analyze → convert → validate → test → benchmark, with explicit gates before moving on. Use the documents in Workflow Selection when the task matches a special case (errors, layout flags, perf).
-
Optimization strategy (perf-sensitive / attention) — If the op is attention, FMHA, sliding window, soft cap, or GQA (e.g. Gemma), read references/optimization-strategy.md before converting the inner loop, then apply §4 Gemma FMHA checklist. For other GEMM/BMM/attention-adjacent kernels, still skim §2–§3 of that file after TMA is done.
gemma_attention -
Select path — Existing TileGym op: standard mode in. If the cuTile source uses
translations/workflow.md/transpose, dual layouts, or MLA-style paths, read translations/advanced-patterns.md before writing Triton (two kernels +transpose_vgrid, not one kernel +META).tl.trans -
Pre-flight — Run the Pre-flight Analysis grep commands on the cuTile source. Countdefinitions; note TMA-relevant
@ct.kernel/ct.load,ct.store,ct.launch, and layout flags.Constant -
Read mapping — Keep references/api-mapping.md open for cuTile → Triton API pairs. For runtime failures (illegal address, dtype, strides), use references/debugging.md.
-
Convert — Copy the Conversion Checklist into a todo list and execute in order. Structure and file placement: translations/file-structure.md. Mandatory: any 2D+ block-shaped tile load/store uses(TMA), not raw
tl.make_tensor_descriptorfor full tiles—skipping this is the most common source of large regressions. Host side: Triton bracket launch <code>kernel[grid](args)</code> with tuple ortl.load(ptr+offs, mask=…)for autotune; nolambda META: (…).ct.launch -
Validate — Syntax-check the new Triton module; run the relevant TileGym pytest targets for the op:. Fix failures before benchmarking.
pytest tests/ops/test_<op>.py -k "triton" -vs -
Benchmark — Compare Triton vs cuTile on perf tests. If Triton is clearly slower, follow PERFORMANCE ANALYSIS (Phase c2t-5) in translations/workflow.md and references/optimizing-reference.md for GEMM/BMM/attention; use references/optimization-strategy.md as the ordered checklist. If you see 10–50× slowdowns, read CRITICAL PERFORMANCE PATTERNS in that same workflow file first.
Execution rules (MUST):
- Create and track the conversion checklist (e.g. TodoWrite) before editing kernel code; complete steps in order—do not skip pre-flight or TMA decisions.
- For attention / FMHA / Gemma / GQA / soft cap / sliding window: read references/optimization-strategy.md and apply §4 before treating the conversion as optimized.
- Do not ship raw pointer+mask 2D+ tile loads where TMA applies; document any intentional exception.
- If tests or benchmarks fail a gate, stop and fix before declaring the conversion done—do not stack unverified changes.
遵循translations/workflow.md中的阶段式工作流。所有转换都应经过分析 → 转换 → 验证 → 测试 → 基准测试的流程,进入下一阶段前需通过明确的检查点。当任务涉及特殊场景(错误处理、布局标志、性能优化)时,请参考工作流选择中的对应文档。
-
优化策略(性能敏感/注意力机制场景) — 如果操作是注意力机制、FMHA、滑动窗口、Soft Cap或GQA(例如Gemma模型的),在转换内循环前请先阅读**references/optimization-strategy.md,然后应用§4 Gemma FMHA 检查清单。对于其他GEMM/BMM/注意力相关内核,完成TMA配置后仍需浏览该文档的§2–§3**部分。
gemma_attention -
选择路径 — 已有TileGym操作:遵循中的标准模式。如果cuTile源码使用了
translations/workflow.md/transpose、双布局或MLA风格路径,在编写Triton代码前请先阅读translations/advanced-patterns.md(需使用两个内核+transpose_v网格,而非单个内核+META)。tl.trans -
查阅映射表 — 打开references/api-mapping.md,对照cuTile与Triton的API对应关系。遇到运行时错误(非法地址、数据类型、步长问题)时,请参考references/debugging.md。
-
转换代码 — 将转换检查清单复制为待办事项并按顺序执行。文件结构与存放位置参考:translations/file-structure.md。强制要求:所有二维及以上块形状的分片加载/存储必须使用(TMA),不得对完整分片使用原生
tl.make_tensor_descriptor——跳过这一步是导致性能大幅退化的最常见原因。主机端:使用Triton方括号启动方式<code>kernel[grid](args)</code>,可使用元组或tl.load(ptr+offs, mask=…)实现自动调优;无需使用lambda META: (…)。ct.launch -
验证语法 — 检查新Triton模块的语法;运行对应TileGym操作的pytest测试目标:。在基准测试前修复所有错误。
pytest tests/ops/test_<op>.py -k "triton" -vs -
基准测试 — 对比Triton与cuTile的性能测试结果。如果Triton明显更慢,请遵循translations/workflow.md中的PERFORMANCE ANALYSIS (Phase c2t-5)部分,以及针对GEMM/BMM/注意力机制的references/optimizing-reference.md;将references/optimization-strategy.md作为有序检查清单使用。如果出现10–50倍的性能下降,请先阅读同一工作流文档中的CRITICAL PERFORMANCE PATTERNS部分。
执行规则(必须遵守):
- 在编辑内核代码前,创建并跟踪转换检查清单(例如使用TodoWrite);按顺序完成步骤——不得跳过预检查或TMA决策环节。
- 对于注意力机制/FMHA/Gemma/GQA/Soft Cap/滑动窗口场景:阅读references/optimization-strategy.md并应用**§4**部分,之后才能认为转换已完成优化。
- 不得在TMA适用场景下使用原生指针+掩码的二维及以上分片加载;如有故意例外情况,请记录原因。
- 如果测试或基准测试未通过检查点,停止操作并修复问题——不得叠加未经验证的修改。
Workflow Selection
工作流选择
- Existing TileGym op → Standard Mode: translations/workflow.md
- Errors (, shape mismatch, numerical mismatch) → references/debugging.md
cudaErrorIllegalAddress - Advanced patterns (TMA, dual layout flags , autotune +
transposegrid, Array.slice, ct.gather().item()) → translations/advanced-patterns.md (MLA-style two kernels, avoid 3–15× regression onMETA).transpose=False - Performance (Triton kernel slower than cuTile, autotuning, profiling) → translations/workflow.md (section PERFORMANCE ANALYSIS (Phase c2t-5))
- Optimization strategy hub (ordered checklist: advanced-patterns + optimizing-reference) → references/optimization-strategy.md — read first for attention/FMHA/Gemma; then drill into the two source docs as needed
- Optimizing GEMM/BMM/attention (after TMA, or Triton 10–20% slower) → references/optimizing-reference.md — EVEN_K fast path, transpose via pointer arithmetic, grid layout, autotune breadth, epilogue subtile; use these patterns during conversion and before perf sign-off (summarized in optimization-strategy §2–§3)
- Gemma attention / GQA FMHA conversion → references/optimization-strategy.md §4
- Blackwell optimization (complex kernels with iterative algorithms, register pressure, loop unrolling) → references/optimizing-reference.md §9 — TMA descriptors, , occupancy autotuning, TMEM-friendly block sizes, slab allocator, dual-path kernel design
loop_unroll_factor - ⚠️ 10-50x REGRESSION (catastrophic slowdown after conversion) → translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)
- ⚠️ Good perf on only, collapse on
transpose=True(or opposite) → translations/advanced-patterns.md — §1 Dual layout flag; twotranspose=Falsekernels +@triton.jitgrid = lambda META: (... META["BLOCK_H"] ...)
- 已有TileGym操作 → 标准模式:translations/workflow.md
- 错误处理(、形状不匹配、数值不匹配)→ references/debugging.md
cudaErrorIllegalAddress - 高级模式(TMA、双布局标志、自动调优+
transpose网格、Array.slice、ct.gather().item())→ translations/advanced-patterns.md(MLA风格双内核,避免META时出现3–15倍性能退化)。transpose=False - 性能优化(Triton内核比cuTile慢、自动调优、性能分析)→ translations/workflow.md(**PERFORMANCE ANALYSIS (Phase c2t-5)**章节)
- 优化策略中心(有序检查清单:高级模式+优化参考)→ references/optimization-strategy.md — 注意力/FMHA/Gemma场景请先阅读;之后根据需要深入参考另外两份文档
- GEMM/BMM/注意力机制优化(完成TMA后,或Triton性能慢10–20%)→ references/optimizing-reference.md — EVEN_K快速路径、通过指针算术实现转置、网格布局、自动调优范围、尾声子分片;在转换过程中及性能验收前应用这些模式(总结于optimization-strategy §2–§3)
- Gemma注意力/GQA FMHA转换 → references/optimization-strategy.md §4
- Blackwell优化(含迭代算法、寄存器压力、循环展开的复杂内核)→ references/optimizing-reference.md §9 — TMA描述符、、占用率自动调优、TMEM友好的块大小、 slab分配器、双路径内核设计
loop_unroll_factor - ⚠️ 10-50倍性能退化(转换后出现灾难性性能下降)→ translations/workflow.md — **CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)**章节
- ⚠️ 仅性能良好,
transpose=True性能崩溃(反之亦然)→ translations/advanced-patterns.md — §1 双布局标志;两个transpose=False内核 +@triton.jitgrid = lambda META: (... META["BLOCK_H"] ...)
Pre-flight Analysis (Run BEFORE converting)
预检查分析(转换前必须运行)
bash
undefinedbash
undefinedCount kernels (only main kernel gets @triton.jit, helpers stay plain def)
统计内核数量(仅主内核使用@triton.jit,辅助函数保持普通def)
grep "@ct.kernel" source.py | wc -l
grep "@ct.kernel" source.py | wc -l
Check for patterns needing special handling
检查需要特殊处理的模式
grep "ct.transpose|ct.permute" source.py # → use tl.trans/tl.permute
grep "ct.astype" source.py # → use .to(dtype)
grep "ct.load|ct.store" source.py # → TMA for 2D+ (tl.make_tensor_descriptor), NOT raw tl.load(ptr+offs)
grep "ct.launch" source.py # → bracket launch: kernel then [grid] then (args)
grep "ct.Constant|ct.ConstInt" source.py # → tl.constexpr
grep "ct.cdiv" source.py # → triton.cdiv (host) or Python (a+b-1)//b
grep "ct.bid|ct.num_blocks" source.py # → tl.program_id/tl.num_programs
grep "1 << .*.bit_length" source.py # → triton.next_power_of_2 if needed
grep "transpose|transpose_v" source.py # → if hit, read translations/advanced-patterns.md (dual kernels + META grid)
undefinedgrep "ct.transpose|ct.permute" source.py # → 使用tl.trans/tl.permute
grep "ct.astype" source.py # → 使用.to(dtype)
grep "ct.load|ct.store" source.py # → 二维及以上场景使用TMA(tl.make_tensor_descriptor),不得使用原生tl.load(ptr+offs)
grep "ct.launch" source.py # → 使用方括号启动:kernel后接[grid]再传(args)
grep "ct.Constant|ct.ConstInt" source.py # → 使用tl.constexpr
grep "ct.cdiv" source.py # → 主机端使用triton.cdiv,或Python写法(a+b-1)//b
grep "ct.bid|ct.num_blocks" source.py # → 使用tl.program_id/tl.num_programs
grep "1 << .*.bit_length" source.py # → 必要时使用triton.next_power_of_2
grep "transpose|transpose_v" source.py # → 如果命中,请阅读translations/advanced-patterns.md(双内核+META网格)
undefinedConversion Checklist
转换检查清单
Copy this checklist and track progress:
Conversion Progress:
[ ] Step 0 (attention / Gemma FMHA / GQA / soft cap / sliding window): Read [references/optimization-strategy.md](./references/optimization-strategy.md) and apply §4 checklist before inner-loop Triton
[ ] Step 1: Pre-flight — run grep commands above, note special patterns and 2D+ loads (→ TMA)
[ ] Step 2: Analyze source cuTile kernel (identify patterns, shapes, dtypes)
[ ] Step 3: Create Triton file with correct structure (see translations/file-structure.md)
[ ] Step 4: Convert kernel signature (tensor args → pointer args, Constant → constexpr)
[ ] Step 4b: TMA (MANDATORY for 2D+ loads) — use tl.make_tensor_descriptor for every 2D+ tile load/store; do NOT ship raw tl.load(ptr+offs,mask) for block-shaped access (see workflow.md § TMA OPTIMIZATION)
[ ] Step 5: Convert kernel body (apply gotchas table below + API mapping)
[ ] Step 6: Convert host wrapper (grid tuple/lambda, bracket-style launch: kernel, grid, then arguments; no ct.launch); call triton.set_allocator(alloc_fn) if using TMA
[ ] Step 7: Validate — run pytest or syntax check on Triton file
[ ] Step 8: Test — run pytest, verify X passed 0 failed
[ ] Step 9: If test fails → fix → re-validate → re-test (loop until green)
[ ] Step 10: Benchmark — run perf test, compare vs cuTile (see workflow.md § PERFORMANCE ANALYSIS)
[ ] Step 10b: If GEMM/BMM/attention and Triton >20% slower → walk [references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3 then [references/optimizing-reference.md](./references/optimizing-reference.md) (EVEN_K, transpose, grid, autotune, epilogue subtile), then re-benchmark
[ ] Step 10c: If op has `transpose` / layout flag → read [translations/advanced-patterns.md](./translations/advanced-patterns.md); verify **separate kernels** per layout (not transpose-kernel + `tl.trans`); **autotuned** launches use `lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — no fixed `BLOCK_H`/`BLOCK_N` through `apply()` unless autotune is disabled
Post-conversion Verification (TMA is mandatory for 2D+ loads):
[ ] TMA: All 2D+ tile loads use tl.make_tensor_descriptor(...).load([...]); no raw ptr+mask for block-shaped 2D+ access (else 5x-20x regression)
[ ] Grid uses tuple or lambda (not 3-tuple required like cuTile)
[ ] Triton autotune added if cuTile op used kernel_configs/autotune (see workflow § PERFORMANCE ANALYSIS)
[ ] Host grid uses triton.cdiv where appropriate (not (a+b-1)//b only)
[ ] Pointer/offset indexing: Triton uses element offsets (ptr + offs), not block index in tl.load (or use TMA descriptor)
[ ] ct.astype(x, dtype) → x.to(dtype) in Triton
[ ] ct.mma(a, b, acc=acc) → tl.dot(a, b, acc) (no keyword in Triton)
[ ] Optional/None args: Triton allows None in kernel args if desired (cuTile required dummy+flag)
[ ] Masking applied when BLOCK_SIZE > actual dimension (same as cuTile); with TMA, masks can often be removed for full tiles
[ ] Reduction divisor uses actual_size, NOT BLOCK_SIZE
[ ] fp32/tf32: Triton defaults allow_tf32=True; match cuTile behavior if you had explicit tf32 cast
[ ] If any 2D+ load uses raw ptr+mask (exception only): document WHY TMA was not used
[ ] tl.assume() alignment hints added for strides and pointers复制此清单并跟踪进度:
转换进度:
[ ] 步骤0(注意力/Gemma FMHA/GQA/Soft Cap/滑动窗口):阅读[references/optimization-strategy.md](./references/optimization-strategy.md)并应用§4检查清单,再编写Triton内循环
[ ] 步骤1:预检查 — 运行上述grep命令,记录特殊模式及二维以上加载操作(→需使用TMA)
[ ] 步骤2:分析源cuTile内核(识别模式、形状、数据类型)
[ ] 步骤3:创建结构正确的Triton文件(参考translations/file-structure.md)
[ ] 步骤4:转换内核签名(张量参数→指针参数,Constant→constexpr)
[ ] 步骤4b:TMA(二维及以上加载强制要求)——所有二维及以上分片加载/存储使用tl.make_tensor_descriptor;不得对块形状访问使用原生tl.load(ptr+offs,mask)(参考workflow.md § TMA OPTIMIZATION)
[ ] 步骤5:转换内核主体(应用下方的常见问题表+API映射)
[ ] 步骤6:转换主机端包装器(网格使用元组/lambda,方括号式启动:kernel、grid、然后传参;无需ct.launch);如果使用TMA,调用triton.set_allocator(alloc_fn)
[ ] 步骤7:验证 — 对Triton文件运行pytest或语法检查
[ ] 步骤8:测试 — 运行pytest,确认全部通过、无失败
[ ] 步骤9:如果测试失败 → 修复 → 重新验证 → 重新测试(循环直至全部通过)
[ ] 步骤10:基准测试 — 运行性能测试,对比cuTile结果(参考workflow.md § PERFORMANCE ANALYSIS)
[ ] 步骤10b:如果是GEMM/BMM/注意力机制且Triton性能慢20%以上 → 浏览[references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3,再参考[references/optimizing-reference.md](./references/optimizing-reference.md)(EVEN_K、转置、网格、自动调优、尾声子分片),然后重新进行基准测试
[ ] 步骤10c:如果操作包含`transpose`/布局标志 → 阅读[translations/advanced-patterns.md](./translations/advanced-patterns.md);验证每个布局使用**独立内核**(而非转置内核+`tl.trans`);**自动调优**启动使用`lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — 除非禁用自动调优,否则不得通过`apply()`使用固定的`BLOCK_H`/`BLOCK_N`
转换后验证(二维及以上加载强制使用TMA):
[ ] TMA:所有二维及以上分片加载使用tl.make_tensor_descriptor(...).load([...]); 块形状的二维及以上访问不得使用原生ptr+mask(否则会出现5-20倍性能退化)
[ ] 网格使用元组或lambda(无需像cuTile那样必须使用三元组)
[ ] 如果cuTile操作使用了kernel_configs/自动调优,需添加Triton自动调优(参考工作流§ PERFORMANCE ANALYSIS)
[ ] 主机端网格在合适场景下使用triton.cdiv(不得仅使用(a+b-1)//b)
[ ] 指针/偏移索引:Triton使用元素偏移(ptr + offs),不得在tl.load中使用块索引(或使用TMA描述符)
[ ] ct.astype(x, dtype) → Triton中使用x.to(dtype)
[ ] ct.mma(a, b, acc=acc) → 使用tl.dot(a, b, acc)(Triton不支持关键字参数)
[ ] 可选/None参数:Triton允许内核参数为None(cuTile需要占位符+标志)
[ ] 当BLOCK_SIZE > 实际维度时应用掩码(与cuTile相同);使用TMA时,完整分片通常可移除掩码
[ ] 归约除数使用实际大小,而非BLOCK_SIZE
[ ] fp32/tf32:Triton默认allow_tf32=True;如果cuTile有显式tf32转换,请匹配该行为
[ ] 如果任何二维及以上加载使用了原生ptr+mask(仅例外情况):记录未使用TMA的原因
[ ] 为步长和指针添加tl.assume()对齐提示Gotchas (Most Common Translation Errors) {#gotchas-most-common-translation-errors}
常见问题(最易出错的转换场景) {#gotchas-most-common-translation-errors}
Comprehensive table of patterns that frequently break or regress when porting to — mma accumulator, type cast, grid, TMA usage, dtype handling, layout flags, batched matmul, etc.
@ct.kernel@triton.jitSee: references/gotchas.md — read this BEFORE writing the Triton kernel.
汇总了将移植到时经常出现故障或性能退化的模式——mma累加器、类型转换、网格、TMA使用、数据类型处理、布局标志、批处理矩阵乘法等。
@ct.kernel@triton.jit参考: references/gotchas.md — 编写Triton内核前请阅读。
Performance Gotchas (10-50x Regression Risk) {#performance-gotchas-10-50x-regression-risk}
性能陷阱(10-50倍退化风险) {#performance-gotchas-10-50x-regression-risk}
⚠️ These cause CATASTROPHIC slowdowns. Check BEFORE benchmarking.
Patterns and their impact: TMA vs raw ptr+mask (5-20×), autotune vs fixed tile sizes (2-3×), (10-50×), chains (2-5×), and more.
broadcast_to + tl.dotextract_sliceSee: references/performance-gotchas.md — full regression-risk table.
Full details: translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION).
Full API mapping: references/api-mapping.md.
Triton math dtype (erf/erfc/exp/log/sqrt) and the "don't substitute erf with tanh" pattern: references/debugging.md — section Triton Math Function Dtype Requirements (CRITICAL).
⚠️ 这些会导致灾难性性能下降。基准测试前务必检查。
相关模式及影响:TMA vs 原生ptr+mask(5-20倍)、自动调优 vs 固定分片大小(2-3倍)、(10-50倍)、链式操作(2-5倍)等。
broadcast_to + tl.dotextract_slice参考: references/performance-gotchas.md — 完整的退化风险表。
详细内容: translations/workflow.md — **CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)**章节。
完整API映射:references/api-mapping.md。
Triton数学函数数据类型(erf/erfc/exp/log/sqrt)及“不得用tanh替代erf”模式:references/debugging.md — **Triton Math Function Dtype Requirements (CRITICAL)**章节。
Optimization strategy (hub)
优化策略中心
File: references/optimization-strategy.md
Summarizes translations/advanced-patterns.md (layout flags, dual kernels, autotune+, batched launch, Blackwell pointers) and references/optimizing-reference.md (post-TMA micro-opts, §9) into §1–§3 plus a mandatory §4 Gemma FMHA checklist.
METARule: For attention / FMHA / Gemma-style conversions, open optimization-strategy in the same session as workflow — do not rely on TMA alone for perf sign-off.
文档: references/optimization-strategy.md
将**translations/advanced-patterns.md(布局标志、双内核、自动调优+、批处理启动、Blackwell指针)和references/optimizing-reference.md(TMA后微优化、§9)的内容总结为§1–§3**,并添加了强制要求的§4 Gemma FMHA检查清单。
META规则: 对于注意力/FMHA/Gemma风格的转换,在同一会话中同时打开optimization-strategy和workflow文档——不得仅依赖TMA完成性能验收。
Reference Documents {#reference-documents}
参考文档 {#reference-documents}
Read from cuTile → Triton perspective. Core files live in this skill under ``.
| Category | Document | Content |
|---|---|---|
| Strategy | optimization-strategy.md | Ordered hub: advanced-patterns + optimizing-reference; §4 Gemma FMHA mandatory checklist |
| Workflows | translations/workflow.md | Standard c2t conversion (phases + checklist) |
| translations/file-structure.md | Where to place Triton files when converting from cuTile | |
| translations/advanced-patterns.md | Dual layout flags (transpose), autotune + | |
| API | api-mapping.md | cuTile → Triton mapping |
| optimizing-reference.md | GEMM/BMM/attention optimizations (EVEN_K, transpose, grid, autotune, epilogue subtile) | |
| Gotchas | gotchas.md | Common cuTile→Triton translation errors (mma, dtype, grid, TMA, layout flags) |
| performance-gotchas.md | 10-50× regression-risk table (TMA vs ptr+mask, broadcast_to, extract_slice chains, autotune) | |
| Testing & errors | references/debugging.md | Triton runtime errors (cudaErrorIllegalAddress, pointer type, stride overflow) |
从cuTile → Triton的视角阅读。核心文件位于本技能的当前目录下。
| 分类 | 文档 | 内容 |
|---|---|---|
| 策略 | optimization-strategy.md | 有序中心文档: 高级模式+优化参考;§4 Gemma FMHA强制检查清单 |
| 工作流 | translations/workflow.md | 标准c2t转换流程(阶段+检查清单) |
| translations/file-structure.md | 从cuTile转换为Triton时的文件存放位置 | |
| translations/advanced-patterns.md | 双布局标志(transpose)、自动调优+ | |
| API | api-mapping.md | cuTile → Triton映射表 |
| optimizing-reference.md | GEMM/BMM/注意力机制优化(EVEN_K、转置、网格、自动调优、尾声子分片) | |
| 常见问题 | gotchas.md | cuTile→Triton转换常见错误(mma、数据类型、网格、TMA、布局标志) |
| performance-gotchas.md | 10-50倍退化风险表(TMA vs ptr+mask、broadcast_to、extract_slice链式操作、自动调优) | |
| 测试与错误 | references/debugging.md | Triton运行时错误(cudaErrorIllegalAddress、指针类型、步长溢出) |
Worked Examples
示例代码
Use cutile_kernel.py as source and triton_kernel.py as target:
| Example | Directory | Complexity |
|---|---|---|
| Vector Add | examples/01_vector_add/ | Basic |
| Softmax | examples/02_softmax/ | Intermediate |
| LayerNorm | examples/03_layernorm/ | Intermediate |
| MatMul | examples/04_matmul/ | Advanced |
| Attention | examples/05_attention/ | Advanced |
Read first, then , to see the inverse mapping.
cutile_kernel.pytriton_kernel.py以cutile_kernel.py为源文件,triton_kernel.py为目标文件:
| 示例 | 目录 | 复杂度 |
|---|---|---|
| 向量加法 | examples/01_vector_add/ | 基础 |
| Softmax | examples/02_softmax/ | 中级 |
| LayerNorm | examples/03_layernorm/ | 中级 |
| 矩阵乘法 | examples/04_matmul/ | 高级 |
| 注意力机制 | examples/05_attention/ | 高级 |
先阅读,再阅读,了解反向映射关系。
cutile_kernel.pytriton_kernel.py⚠️ MANDATORY COMPLETION CHECKLIST (DO NOT SKIP)
⚠️ 强制完成检查清单(不得跳过)
A conversion is NOT COMPLETE until ALL items are checked. Copy and complete:
MANDATORY COMPLETION GATES:
[ ] 1. CORRECTNESS: pytest passes with 0 failures
Command: python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
Gate: "X passed, 0 failed"
[ ] 2. TMA OPTIMIZATION: All 2D+ tile loads use tl.make_tensor_descriptor
Verify: grep -n "tl.load.*mask" triton_file.py | wc -l # Should be 0 for 2D+ ops
Skip = 5-20x performance regression
[ ] 3. PERFORMANCE TEST: Triton within 20% of cuTile baseline
Command: python -m pytest {test_path} -k "test_perf" --print-record -v
OR: Run benchmark script: cd tests/benchmark && python bench_{op}.py
Gate: Triton TFLOPS >= 0.8 * CuTile TFLOPS
[ ] 4. PERFORMANCE COMPARISON RECORDED:
Document results:
| Config | Triton (TFLOPS) | CuTile (TFLOPS) | Ratio |
|--------|-----------------|-----------------|-------|
| [fill] | [fill] | [fill] | [fill]|
CONVERSION COMPLETE: All 4 gates passed? → YES / NOWhy this matters:
- Gate 1 catches functional bugs
- Gate 2 prevents catastrophic 5-20x regressions (most common mistake)
- Gate 3 validates that optimization was effective
- Gate 4 creates accountability record
If any gate fails: Fix and re-verify before declaring complete.
所有项目检查通过后,转换才算完成。复制并填写:
强制完成检查点:
[ ] 1. 正确性:pytest全部通过,无失败
命令:python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
检查标准:“X passed, 0 failed”
[ ] 2. TMA优化:所有二维及以上分片加载使用tl.make_tensor_descriptor
验证方式:grep -n "tl.load.*mask" triton_file.py | wc -l # 二维及以上操作结果应为0
跳过此步骤会导致5-20倍性能退化
[ ] 3. 性能测试:Triton性能达到cuTile基准的80%以上
命令:python -m pytest {test_path} -k "test_perf" --print-record -v
或:运行基准测试脚本:cd tests/benchmark && python bench_{op}.py
检查标准:Triton TFLOPS >= 0.8 * CuTile TFLOPS
[ ] 4. 性能对比记录:
记录结果:
| 配置 | Triton (TFLOPS) | CuTile (TFLOPS) | 比值 |
|--------|-----------------|-----------------|-------|
| [填写] | [填写] | [填写] | [填写]|
转换完成:所有4个检查点均通过?→ 是 / 否重要性说明:
- 检查点1:捕获功能性bug
- 检查点2:避免灾难性的5-20倍性能退化(最常见错误)
- 检查点3:验证优化效果
- 检查点4:创建可追溯的记录
如果任何检查点未通过: 修复后重新验证,再宣布转换完成。