converting-cutile-to-triton

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

cuTile → Triton Conversion

cuTile → Triton 转换

Convert
@ct.kernel
kernels to
@triton.jit
. API mapping: references/api-mapping.md (cuTile → Triton).
In this skill’s Markdown, Triton launch syntax
kernel[grid](…)
uses Unicode brackets so link checkers do not parse
[grid](…)
as a hyperlink; use normal ASCII brackets in real Triton code.
@ct.kernel
内核转换为
@triton.jit
内核。API映射参考:references/api-mapping.md(cuTile → Triton)。
在本技能的Markdown文档中,Triton启动语法
kernel[grid](…)
使用了Unicode方括号,避免链接检查器将
[grid](…)
解析为超链接;实际编写Triton代码时请使用标准ASCII方括号。

Instructions

操作步骤

Follow the phase-gated workflow in translations/workflow.md. Every conversion should go through analyze → convert → validate → test → benchmark, with explicit gates before moving on. Use the documents in Workflow Selection when the task matches a special case (errors, layout flags, perf).
  1. Optimization strategy (perf-sensitive / attention) — If the op is attention, FMHA, sliding window, soft cap, or GQA (e.g. Gemma
    gemma_attention
    ), read references/optimization-strategy.md before converting the inner loop, then apply §4 Gemma FMHA checklist. For other GEMM/BMM/attention-adjacent kernels, still skim §2–§3 of that file after TMA is done.
  2. Select path — Existing TileGym op: standard mode in
    translations/workflow.md
    . If the cuTile source uses
    transpose
    /
    transpose_v
    , dual layouts, or MLA-style paths, read translations/advanced-patterns.md before writing Triton (two kernels +
    META
    grid, not one kernel +
    tl.trans
    ).
  3. Pre-flight — Run the Pre-flight Analysis grep commands on the cuTile source. Count
    @ct.kernel
    definitions; note TMA-relevant
    ct.load
    /
    ct.store
    ,
    ct.launch
    ,
    Constant
    , and layout flags.
  4. Read mapping — Keep references/api-mapping.md open for cuTile → Triton API pairs. For runtime failures (illegal address, dtype, strides), use references/debugging.md.
  5. Convert — Copy the Conversion Checklist into a todo list and execute in order. Structure and file placement: translations/file-structure.md. Mandatory: any 2D+ block-shaped tile load/store uses
    tl.make_tensor_descriptor
    (TMA), not raw
    tl.load(ptr+offs, mask=…)
    for full tiles—skipping this is the most common source of large regressions. Host side: Triton bracket launch <code>kernel[grid](args)</code> with tuple or
    lambda META: (…)
    for autotune; no
    ct.launch
    .
  6. Validate — Syntax-check the new Triton module; run the relevant TileGym pytest targets for the op:
    pytest tests/ops/test_<op>.py -k "triton" -vs
    . Fix failures before benchmarking.
  7. Benchmark — Compare Triton vs cuTile on perf tests. If Triton is clearly slower, follow PERFORMANCE ANALYSIS (Phase c2t-5) in translations/workflow.md and references/optimizing-reference.md for GEMM/BMM/attention; use references/optimization-strategy.md as the ordered checklist. If you see 10–50× slowdowns, read CRITICAL PERFORMANCE PATTERNS in that same workflow file first.
Execution rules (MUST):
  • Create and track the conversion checklist (e.g. TodoWrite) before editing kernel code; complete steps in order—do not skip pre-flight or TMA decisions.
  • For attention / FMHA / Gemma / GQA / soft cap / sliding window: read references/optimization-strategy.md and apply §4 before treating the conversion as optimized.
  • Do not ship raw pointer+mask 2D+ tile loads where TMA applies; document any intentional exception.
  • If tests or benchmarks fail a gate, stop and fix before declaring the conversion done—do not stack unverified changes.
遵循translations/workflow.md中的阶段式工作流。所有转换都应经过分析 → 转换 → 验证 → 测试 → 基准测试的流程,进入下一阶段前需通过明确的检查点。当任务涉及特殊场景(错误处理、布局标志、性能优化)时,请参考工作流选择中的对应文档。
  1. 优化策略(性能敏感/注意力机制场景) — 如果操作是注意力机制、FMHA、滑动窗口、Soft Cap或GQA(例如Gemma模型的
    gemma_attention
    ),在转换内循环前请先阅读**references/optimization-strategy.md,然后应用§4 Gemma FMHA 检查清单。对于其他GEMM/BMM/注意力相关内核,完成TMA配置后仍需浏览该文档的§2–§3**部分。
  2. 选择路径 — 已有TileGym操作:遵循
    translations/workflow.md
    中的标准模式。如果cuTile源码使用了
    transpose
    /
    transpose_v
    、双布局或MLA风格路径,在编写Triton代码前请先阅读translations/advanced-patterns.md(需使用两个内核+
    META
    网格,而非单个内核+
    tl.trans
    )。
  3. 预检查 — 对cuTile源码运行预检查分析中的grep命令。统计
    @ct.kernel
    定义数量;记录与TMA相关的
    ct.load
    /
    ct.store
    ct.launch
    Constant
    以及布局标志。
  4. 查阅映射表 — 打开references/api-mapping.md,对照cuTile与Triton的API对应关系。遇到运行时错误(非法地址、数据类型、步长问题)时,请参考references/debugging.md
  5. 转换代码 — 将转换检查清单复制为待办事项并按顺序执行。文件结构与存放位置参考:translations/file-structure.md强制要求:所有二维及以上块形状的分片加载/存储必须使用
    tl.make_tensor_descriptor
    (TMA),不得对完整分片使用原生
    tl.load(ptr+offs, mask=…)
    ——跳过这一步是导致性能大幅退化的最常见原因。主机端:使用Triton方括号启动方式<code>kernel[grid](args)</code>,可使用元组或
    lambda META: (…)
    实现自动调优;无需使用
    ct.launch
  6. 验证语法 — 检查新Triton模块的语法;运行对应TileGym操作的pytest测试目标:
    pytest tests/ops/test_<op>.py -k "triton" -vs
    。在基准测试前修复所有错误。
  7. 基准测试 — 对比Triton与cuTile的性能测试结果。如果Triton明显更慢,请遵循translations/workflow.md中的PERFORMANCE ANALYSIS (Phase c2t-5)部分,以及针对GEMM/BMM/注意力机制的references/optimizing-reference.md;将references/optimization-strategy.md作为有序检查清单使用。如果出现10–50倍的性能下降,请先阅读同一工作流文档中的CRITICAL PERFORMANCE PATTERNS部分。
执行规则(必须遵守)
  • 在编辑内核代码前,创建并跟踪转换检查清单(例如使用TodoWrite);按顺序完成步骤——不得跳过预检查或TMA决策环节。
  • 对于注意力机制/FMHA/Gemma/GQA/Soft Cap/滑动窗口场景:阅读references/optimization-strategy.md并应用**§4**部分,之后才能认为转换已完成优化。
  • 不得在TMA适用场景下使用原生指针+掩码的二维及以上分片加载;如有故意例外情况,请记录原因。
  • 如果测试或基准测试未通过检查点,停止操作并修复问题——不得叠加未经验证的修改。

Workflow Selection

工作流选择

  • Existing TileGym op → Standard Mode: translations/workflow.md
  • Errors (
    cudaErrorIllegalAddress
    , shape mismatch, numerical mismatch) → references/debugging.md
  • Advanced patterns (TMA, dual layout flags
    transpose
    , autotune +
    META
    grid, Array.slice, ct.gather().item()) → translations/advanced-patterns.md (MLA-style two kernels, avoid 3–15× regression on
    transpose=False
    ).
  • Performance (Triton kernel slower than cuTile, autotuning, profiling) → translations/workflow.md (section PERFORMANCE ANALYSIS (Phase c2t-5))
  • Optimization strategy hub (ordered checklist: advanced-patterns + optimizing-reference) → references/optimization-strategy.md — read first for attention/FMHA/Gemma; then drill into the two source docs as needed
  • Optimizing GEMM/BMM/attention (after TMA, or Triton 10–20% slower) → references/optimizing-reference.md — EVEN_K fast path, transpose via pointer arithmetic, grid layout, autotune breadth, epilogue subtile; use these patterns during conversion and before perf sign-off (summarized in optimization-strategy §2–§3)
  • Gemma attention / GQA FMHA conversionreferences/optimization-strategy.md §4
  • Blackwell optimization (complex kernels with iterative algorithms, register pressure, loop unrolling) → references/optimizing-reference.md §9 — TMA descriptors,
    loop_unroll_factor
    , occupancy autotuning, TMEM-friendly block sizes, slab allocator, dual-path kernel design
  • ⚠️ 10-50x REGRESSION (catastrophic slowdown after conversion) → translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)
  • ⚠️ Good perf on
    transpose=True
    only, collapse on
    transpose=False
    (or opposite) → translations/advanced-patterns.md — §1 Dual layout flag; two
    @triton.jit
    kernels +
    grid = lambda META: (... META["BLOCK_H"] ...)
  • 已有TileGym操作 → 标准模式:translations/workflow.md
  • 错误处理
    cudaErrorIllegalAddress
    、形状不匹配、数值不匹配)→ references/debugging.md
  • 高级模式(TMA、双布局标志
    transpose
    、自动调优+
    META
    网格、Array.slice、ct.gather().item())→ translations/advanced-patterns.md(MLA风格双内核,避免
    transpose=False
    时出现3–15倍性能退化)。
  • 性能优化(Triton内核比cuTile慢、自动调优、性能分析)→ translations/workflow.md(**PERFORMANCE ANALYSIS (Phase c2t-5)**章节)
  • 优化策略中心(有序检查清单:高级模式+优化参考)→ references/optimization-strategy.md — 注意力/FMHA/Gemma场景请先阅读;之后根据需要深入参考另外两份文档
  • GEMM/BMM/注意力机制优化(完成TMA后,或Triton性能慢10–20%)→ references/optimizing-reference.md — EVEN_K快速路径、通过指针算术实现转置、网格布局、自动调优范围、尾声子分片;在转换过程中及性能验收前应用这些模式(总结于optimization-strategy §2–§3
  • Gemma注意力/GQA FMHA转换references/optimization-strategy.md §4
  • Blackwell优化(含迭代算法、寄存器压力、循环展开的复杂内核)→ references/optimizing-reference.md §9 — TMA描述符、
    loop_unroll_factor
    、占用率自动调优、TMEM友好的块大小、 slab分配器、双路径内核设计
  • ⚠️ 10-50倍性能退化(转换后出现灾难性性能下降)→ translations/workflow.md — **CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)**章节
  • ⚠️ 仅
    transpose=True
    性能良好,
    transpose=False
    性能崩溃
    (反之亦然)→ translations/advanced-patterns.md — §1 双布局标志;两个
    @triton.jit
    内核 +
    grid = lambda META: (... META["BLOCK_H"] ...)

Pre-flight Analysis (Run BEFORE converting)

预检查分析(转换前必须运行)

bash
undefined
bash
undefined

Count kernels (only main kernel gets @triton.jit, helpers stay plain def)

统计内核数量(仅主内核使用@triton.jit,辅助函数保持普通def)

grep "@ct.kernel" source.py | wc -l
grep "@ct.kernel" source.py | wc -l

Check for patterns needing special handling

检查需要特殊处理的模式

grep "ct.transpose|ct.permute" source.py # → use tl.trans/tl.permute grep "ct.astype" source.py # → use .to(dtype) grep "ct.load|ct.store" source.py # → TMA for 2D+ (tl.make_tensor_descriptor), NOT raw tl.load(ptr+offs) grep "ct.launch" source.py # → bracket launch: kernel then [grid] then (args) grep "ct.Constant|ct.ConstInt" source.py # → tl.constexpr grep "ct.cdiv" source.py # → triton.cdiv (host) or Python (a+b-1)//b grep "ct.bid|ct.num_blocks" source.py # → tl.program_id/tl.num_programs grep "1 << .*.bit_length" source.py # → triton.next_power_of_2 if needed grep "transpose|transpose_v" source.py # → if hit, read translations/advanced-patterns.md (dual kernels + META grid)
undefined
grep "ct.transpose|ct.permute" source.py # → 使用tl.trans/tl.permute grep "ct.astype" source.py # → 使用.to(dtype) grep "ct.load|ct.store" source.py # → 二维及以上场景使用TMA(tl.make_tensor_descriptor),不得使用原生tl.load(ptr+offs) grep "ct.launch" source.py # → 使用方括号启动:kernel后接[grid]再传(args) grep "ct.Constant|ct.ConstInt" source.py # → 使用tl.constexpr grep "ct.cdiv" source.py # → 主机端使用triton.cdiv,或Python写法(a+b-1)//b grep "ct.bid|ct.num_blocks" source.py # → 使用tl.program_id/tl.num_programs grep "1 << .*.bit_length" source.py # → 必要时使用triton.next_power_of_2 grep "transpose|transpose_v" source.py # → 如果命中,请阅读translations/advanced-patterns.md(双内核+META网格)
undefined

Conversion Checklist

转换检查清单

Copy this checklist and track progress:
Conversion Progress:
 [ ] Step 0 (attention / Gemma FMHA / GQA / soft cap / sliding window): Read [references/optimization-strategy.md](./references/optimization-strategy.md) and apply §4 checklist before inner-loop Triton
 [ ] Step 1: Pre-flight — run grep commands above, note special patterns and 2D+ loads (→ TMA)
 [ ] Step 2: Analyze source cuTile kernel (identify patterns, shapes, dtypes)
 [ ] Step 3: Create Triton file with correct structure (see translations/file-structure.md)
 [ ] Step 4: Convert kernel signature (tensor args → pointer args, Constant → constexpr)
 [ ] Step 4b: TMA (MANDATORY for 2D+ loads) — use tl.make_tensor_descriptor for every 2D+ tile load/store; do NOT ship raw tl.load(ptr+offs,mask) for block-shaped access (see workflow.md § TMA OPTIMIZATION)
 [ ] Step 5: Convert kernel body (apply gotchas table below + API mapping)
 [ ] Step 6: Convert host wrapper (grid tuple/lambda, bracket-style launch: kernel, grid, then arguments; no ct.launch); call triton.set_allocator(alloc_fn) if using TMA
 [ ] Step 7: Validate — run pytest or syntax check on Triton file
 [ ] Step 8: Test — run pytest, verify X passed 0 failed
 [ ] Step 9: If test fails → fix → re-validate → re-test (loop until green)
 [ ] Step 10: Benchmark — run perf test, compare vs cuTile (see workflow.md § PERFORMANCE ANALYSIS)
 [ ] Step 10b: If GEMM/BMM/attention and Triton &gt;20% slower → walk [references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3 then [references/optimizing-reference.md](./references/optimizing-reference.md) (EVEN_K, transpose, grid, autotune, epilogue subtile), then re-benchmark
 [ ] Step 10c: If op has `transpose` / layout flag → read [translations/advanced-patterns.md](./translations/advanced-patterns.md); verify **separate kernels** per layout (not transpose-kernel + `tl.trans`); **autotuned** launches use `lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — no fixed `BLOCK_H`/`BLOCK_N` through `apply()` unless autotune is disabled

Post-conversion Verification (TMA is mandatory for 2D+ loads):
 [ ] TMA: All 2D+ tile loads use tl.make_tensor_descriptor(...).load([...]); no raw ptr+mask for block-shaped 2D+ access (else 5x-20x regression)
 [ ] Grid uses tuple or lambda (not 3-tuple required like cuTile)
 [ ] Triton autotune added if cuTile op used kernel_configs/autotune (see workflow § PERFORMANCE ANALYSIS)
 [ ] Host grid uses triton.cdiv where appropriate (not (a+b-1)//b only)
 [ ] Pointer/offset indexing: Triton uses element offsets (ptr + offs), not block index in tl.load (or use TMA descriptor)
 [ ] ct.astype(x, dtype) → x.to(dtype) in Triton
 [ ] ct.mma(a, b, acc=acc) → tl.dot(a, b, acc) (no keyword in Triton)
 [ ] Optional/None args: Triton allows None in kernel args if desired (cuTile required dummy+flag)
 [ ] Masking applied when BLOCK_SIZE > actual dimension (same as cuTile); with TMA, masks can often be removed for full tiles
 [ ] Reduction divisor uses actual_size, NOT BLOCK_SIZE
 [ ] fp32/tf32: Triton defaults allow_tf32=True; match cuTile behavior if you had explicit tf32 cast
 [ ] If any 2D+ load uses raw ptr+mask (exception only): document WHY TMA was not used
 [ ] tl.assume() alignment hints added for strides and pointers
复制此清单并跟踪进度:
转换进度:
 [ ] 步骤0(注意力/Gemma FMHA/GQA/Soft Cap/滑动窗口):阅读[references/optimization-strategy.md](./references/optimization-strategy.md)并应用§4检查清单,再编写Triton内循环
 [ ] 步骤1:预检查 — 运行上述grep命令,记录特殊模式及二维以上加载操作(→需使用TMA)
 [ ] 步骤2:分析源cuTile内核(识别模式、形状、数据类型)
 [ ] 步骤3:创建结构正确的Triton文件(参考translations/file-structure.md)
 [ ] 步骤4:转换内核签名(张量参数→指针参数,Constant→constexpr)
 [ ] 步骤4b:TMA(二维及以上加载强制要求)——所有二维及以上分片加载/存储使用tl.make_tensor_descriptor;不得对块形状访问使用原生tl.load(ptr+offs,mask)(参考workflow.md § TMA OPTIMIZATION)
 [ ] 步骤5:转换内核主体(应用下方的常见问题表+API映射)
 [ ] 步骤6:转换主机端包装器(网格使用元组/lambda,方括号式启动:kernel、grid、然后传参;无需ct.launch);如果使用TMA,调用triton.set_allocator(alloc_fn)
 [ ] 步骤7:验证 — 对Triton文件运行pytest或语法检查
 [ ] 步骤8:测试 — 运行pytest,确认全部通过、无失败
 [ ] 步骤9:如果测试失败 → 修复 → 重新验证 → 重新测试(循环直至全部通过)
 [ ] 步骤10:基准测试 — 运行性能测试,对比cuTile结果(参考workflow.md § PERFORMANCE ANALYSIS)
 [ ] 步骤10b:如果是GEMM/BMM/注意力机制且Triton性能慢20%以上 → 浏览[references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3,再参考[references/optimizing-reference.md](./references/optimizing-reference.md)(EVEN_K、转置、网格、自动调优、尾声子分片),然后重新进行基准测试
 [ ] 步骤10c:如果操作包含`transpose`/布局标志 → 阅读[translations/advanced-patterns.md](./translations/advanced-patterns.md);验证每个布局使用**独立内核**(而非转置内核+`tl.trans`);**自动调优**启动使用`lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — 除非禁用自动调优,否则不得通过`apply()`使用固定的`BLOCK_H`/`BLOCK_N`

转换后验证(二维及以上加载强制使用TMA):
 [ ] TMA:所有二维及以上分片加载使用tl.make_tensor_descriptor(...).load([...]); 块形状的二维及以上访问不得使用原生ptr+mask(否则会出现5-20倍性能退化)
 [ ] 网格使用元组或lambda(无需像cuTile那样必须使用三元组)
 [ ] 如果cuTile操作使用了kernel_configs/自动调优,需添加Triton自动调优(参考工作流§ PERFORMANCE ANALYSIS)
 [ ] 主机端网格在合适场景下使用triton.cdiv(不得仅使用(a+b-1)//b)
 [ ] 指针/偏移索引:Triton使用元素偏移(ptr + offs),不得在tl.load中使用块索引(或使用TMA描述符)
 [ ] ct.astype(x, dtype) → Triton中使用x.to(dtype)
 [ ] ct.mma(a, b, acc=acc) → 使用tl.dot(a, b, acc)(Triton不支持关键字参数)
 [ ] 可选/None参数:Triton允许内核参数为None(cuTile需要占位符+标志)
 [ ] 当BLOCK_SIZE > 实际维度时应用掩码(与cuTile相同);使用TMA时,完整分片通常可移除掩码
 [ ] 归约除数使用实际大小,而非BLOCK_SIZE
 [ ] fp32/tf32:Triton默认allow_tf32=True;如果cuTile有显式tf32转换,请匹配该行为
 [ ] 如果任何二维及以上加载使用了原生ptr+mask(仅例外情况):记录未使用TMA的原因
 [ ] 为步长和指针添加tl.assume()对齐提示

Gotchas (Most Common Translation Errors) {#gotchas-most-common-translation-errors}

常见问题(最易出错的转换场景) {#gotchas-most-common-translation-errors}

Comprehensive table of patterns that frequently break or regress when porting
@ct.kernel
to
@triton.jit
mma accumulator, type cast, grid, TMA usage, dtype handling, layout flags, batched matmul, etc.
See: references/gotchas.md — read this BEFORE writing the Triton kernel.
汇总了将
@ct.kernel
移植到
@triton.jit
时经常出现故障或性能退化的模式——mma累加器、类型转换、网格、TMA使用、数据类型处理、布局标志、批处理矩阵乘法等
参考: references/gotchas.md — 编写Triton内核前请阅读。

Performance Gotchas (10-50x Regression Risk) {#performance-gotchas-10-50x-regression-risk}

性能陷阱(10-50倍退化风险) {#performance-gotchas-10-50x-regression-risk}

⚠️ These cause CATASTROPHIC slowdowns. Check BEFORE benchmarking.
Patterns and their impact: TMA vs raw ptr+mask (5-20×), autotune vs fixed tile sizes (2-3×),
broadcast_to + tl.dot
(10-50×),
extract_slice
chains (2-5×), and more.
See: references/performance-gotchas.md — full regression-risk table.
Full details: translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION).
Full API mapping: references/api-mapping.md.
Triton math dtype (erf/erfc/exp/log/sqrt) and the "don't substitute erf with tanh" pattern: references/debugging.md — section Triton Math Function Dtype Requirements (CRITICAL).
⚠️ 这些会导致灾难性性能下降。基准测试前务必检查。
相关模式及影响:TMA vs 原生ptr+mask(5-20倍)、自动调优 vs 固定分片大小(2-3倍)、
broadcast_to + tl.dot
(10-50倍)、
extract_slice
链式操作(2-5倍)等。
参考: references/performance-gotchas.md — 完整的退化风险表。
详细内容: translations/workflow.md — **CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)**章节。
完整API映射:references/api-mapping.md
Triton数学函数数据类型(erf/erfc/exp/log/sqrt)及“不得用tanh替代erf”模式:references/debugging.md — **Triton Math Function Dtype Requirements (CRITICAL)**章节。

Optimization strategy (hub)

优化策略中心

File: references/optimization-strategy.md
Summarizes translations/advanced-patterns.md (layout flags, dual kernels, autotune+
META
, batched launch, Blackwell pointers) and references/optimizing-reference.md (post-TMA micro-opts, §9) into §1–§3 plus a mandatory §4 Gemma FMHA checklist.
Rule: For attention / FMHA / Gemma-style conversions, open optimization-strategy in the same session as workflow — do not rely on TMA alone for perf sign-off.
文档: references/optimization-strategy.md
将**translations/advanced-patterns.md(布局标志、双内核、自动调优+
META
、批处理启动、Blackwell指针)和
references/optimizing-reference.md(TMA后微优化、§9)的内容总结为§1–§3**,并添加了强制要求的§4 Gemma FMHA检查清单
规则: 对于注意力/FMHA/Gemma风格的转换,在同一会话中同时打开optimization-strategyworkflow文档——不得仅依赖TMA完成性能验收。

Reference Documents {#reference-documents}

参考文档 {#reference-documents}

Read from cuTile → Triton perspective. Core files live in this skill under ``.
CategoryDocumentContent
Strategyoptimization-strategy.mdOrdered hub: advanced-patterns + optimizing-reference; §4 Gemma FMHA mandatory checklist
Workflowstranslations/workflow.mdStandard c2t conversion (phases + checklist)
translations/file-structure.mdWhere to place Triton files when converting from cuTile
translations/advanced-patterns.mdDual layout flags (transpose), autotune +
META
grid, MLA-style two kernels
APIapi-mapping.mdcuTile → Triton mapping
optimizing-reference.mdGEMM/BMM/attention optimizations (EVEN_K, transpose, grid, autotune, epilogue subtile)
Gotchasgotchas.mdCommon cuTile→Triton translation errors (mma, dtype, grid, TMA, layout flags)
performance-gotchas.md10-50× regression-risk table (TMA vs ptr+mask, broadcast_to, extract_slice chains, autotune)
Testing & errorsreferences/debugging.mdTriton runtime errors (cudaErrorIllegalAddress, pointer type, stride overflow)
cuTile → Triton的视角阅读。核心文件位于本技能的当前目录下。
分类文档内容
策略optimization-strategy.md有序中心文档: 高级模式+优化参考;§4 Gemma FMHA强制检查清单
工作流translations/workflow.md标准c2t转换流程(阶段+检查清单)
translations/file-structure.md从cuTile转换为Triton时的文件存放位置
translations/advanced-patterns.md双布局标志(transpose)、自动调优+
META
网格、MLA风格双内核
APIapi-mapping.mdcuTile → Triton映射表
optimizing-reference.mdGEMM/BMM/注意力机制优化(EVEN_K、转置、网格、自动调优、尾声子分片)
常见问题gotchas.mdcuTile→Triton转换常见错误(mma、数据类型、网格、TMA、布局标志)
performance-gotchas.md10-50倍退化风险表(TMA vs ptr+mask、broadcast_to、extract_slice链式操作、自动调优)
测试与错误references/debugging.mdTriton运行时错误(cudaErrorIllegalAddress、指针类型、步长溢出)

Worked Examples

示例代码

Use cutile_kernel.py as source and triton_kernel.py as target:
ExampleDirectoryComplexity
Vector Addexamples/01_vector_add/Basic
Softmaxexamples/02_softmax/Intermediate
LayerNormexamples/03_layernorm/Intermediate
MatMulexamples/04_matmul/Advanced
Attentionexamples/05_attention/Advanced
Read
cutile_kernel.py
first, then
triton_kernel.py
, to see the inverse mapping.
cutile_kernel.py为源文件triton_kernel.py为目标文件
示例目录复杂度
向量加法examples/01_vector_add/基础
Softmaxexamples/02_softmax/中级
LayerNormexamples/03_layernorm/中级
矩阵乘法examples/04_matmul/高级
注意力机制examples/05_attention/高级
先阅读
cutile_kernel.py
,再阅读
triton_kernel.py
,了解反向映射关系。

⚠️ MANDATORY COMPLETION CHECKLIST (DO NOT SKIP)

⚠️ 强制完成检查清单(不得跳过)

A conversion is NOT COMPLETE until ALL items are checked. Copy and complete:
MANDATORY COMPLETION GATES:
 [ ] 1. CORRECTNESS: pytest passes with 0 failures
     Command: python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
     Gate: "X passed, 0 failed"

 [ ] 2. TMA OPTIMIZATION: All 2D+ tile loads use tl.make_tensor_descriptor
     Verify: grep -n "tl.load.*mask" triton_file.py | wc -l  # Should be 0 for 2D+ ops
     Skip = 5-20x performance regression

 [ ] 3. PERFORMANCE TEST: Triton within 20% of cuTile baseline
     Command: python -m pytest {test_path} -k "test_perf" --print-record -v
     OR: Run benchmark script: cd tests/benchmark && python bench_{op}.py
     Gate: Triton TFLOPS >= 0.8 * CuTile TFLOPS

 [ ] 4. PERFORMANCE COMPARISON RECORDED:
     Document results:
     | Config | Triton (TFLOPS) | CuTile (TFLOPS) | Ratio |
     |--------|-----------------|-----------------|-------|
     | [fill] | [fill]          | [fill]          | [fill]|

CONVERSION COMPLETE: All 4 gates passed? → YES / NO
Why this matters:
  • Gate 1 catches functional bugs
  • Gate 2 prevents catastrophic 5-20x regressions (most common mistake)
  • Gate 3 validates that optimization was effective
  • Gate 4 creates accountability record
If any gate fails: Fix and re-verify before declaring complete.
所有项目检查通过后,转换才算完成。复制并填写:
强制完成检查点:
 [ ] 1. 正确性:pytest全部通过,无失败
     命令:python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
     检查标准:“X passed, 0 failed”

 [ ] 2. TMA优化:所有二维及以上分片加载使用tl.make_tensor_descriptor
     验证方式:grep -n "tl.load.*mask" triton_file.py | wc -l  # 二维及以上操作结果应为0
     跳过此步骤会导致5-20倍性能退化

 [ ] 3. 性能测试:Triton性能达到cuTile基准的80%以上
     命令:python -m pytest {test_path} -k "test_perf" --print-record -v
     或:运行基准测试脚本:cd tests/benchmark && python bench_{op}.py
     检查标准:Triton TFLOPS >= 0.8 * CuTile TFLOPS

 [ ] 4. 性能对比记录:
     记录结果:
     | 配置 | Triton (TFLOPS) | CuTile (TFLOPS) | 比值 |
     |--------|-----------------|-----------------|-------|
     | [填写] | [填写]          | [填写]          | [填写]|

转换完成:所有4个检查点均通过?→ 是 / 否
重要性说明:
  • 检查点1:捕获功能性bug
  • 检查点2:避免灾难性的5-20倍性能退化(最常见错误)
  • 检查点3:验证优化效果
  • 检查点4:创建可追溯的记录
如果任何检查点未通过: 修复后重新验证,再宣布转换完成。