cutile-autotuning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CuTile Autotuning

CuTile自动调优

Add autotuning to CuTile kernels using the
exhaustive_search
API with tune-once/cache/direct-launch pattern.
使用
exhaustive_search
API结合一次性调优/缓存/直接启动模式,为CuTile内核添加自动调优功能。

Instructions

操作步骤

Follow the decision tree to classify the kernel, design a search space, implement the tune-once/cache/launch pattern, and validate performance.
  1. Classify — use the Decision Tree to determine search dimensions (occupancy-only vs full tile search)
  2. Design search space — select the matching template from
    references/kernel-type-templates.md
    ; prune to ≤ 30 configs via arch filters
  3. Implement — add
    exhaustive_search
    + cache +
    ct.launch
    following the Step-by-Step Workflow; handle in-place writes with split-buffer if needed
  4. Test — run correctness with autotune enabled and with
    DISABLE_AUTOTUNE=1
  5. Validate — A/B benchmark against fixed best-known config; see
    references/search-strategies.md
按照决策树对内核进行分类,设计搜索空间,实现一次性调优/缓存/启动模式,并验证性能。
  1. 分类 — 使用决策树确定搜索维度(仅occupancy搜索 vs 完整tile搜索)
  2. 设计搜索空间 — 从
    references/kernel-type-templates.md
    中选择匹配的模板;通过架构过滤器将配置数量修剪至≤30个
  3. 实现 — 按照分步工作流添加
    exhaustive_search
    + 缓存 +
    ct.launch
    ;如果需要,使用拆分缓冲区处理原地写入
  4. 测试 — 在启用自动调优和设置
    DISABLE_AUTOTUNE=1
    的情况下分别运行正确性测试
  5. 验证 — 与已知最优的固定配置进行A/B基准测试;参考
    references/search-strategies.md

Task Router — Jump to What You Need

任务导航 — 快速定位所需内容

What are you trying to do?Go to
Add autotune to a new kernel (most common)Quick Reference below → Workflow: Adding Autotune →
references/kernel-type-templates.md
(pick by kernel type: T1=elementwise, T2=in-place, T3=matmul, T4=persistent, T5=FMHA, T6=FP8, T7=grouped GEMM, T8=varlen attention, T9=dual-GEMM fusion)
Debug: data corruption / wrong results after first runPitfall #1 (In-Place Kernel)
Debug: autotune taking 5+ minutesPitfall #2 (Compilation Timeout)
Debug: search space generator returning zero configsPitfall #5 first; also check arch filters, size guards, and
num_ctas
constraints
Optimize an existing autotune configWorkflow: Optimizing an Existing Config
你想要完成什么操作?前往位置
为新内核添加自动调优(最常见场景)下方快速参考 → 添加自动调优工作流 →
references/kernel-type-templates.md
(按内核类型选择:T1=逐元素型、T2=原地型、T3=矩阵乘法、T4=持久型、T5=FMHA、T6=FP8、T7=分组GEMM、T8=变长注意力、T9=双GEMM融合)
调试:首次运行后出现数据损坏/结果错误问题#1(原地内核)
调试:自动调优耗时超过5分钟问题#2(编译超时)
调试:搜索空间生成器返回空配置先查看问题#5;同时检查架构过滤器、尺寸限制和
num_ctas
约束
优化现有自动调优配置优化现有配置工作流

Quick Reference — Occupancy-Only Autotune (Tune-Once/Cache/Launch)

快速参考 — 仅Occupancy自动调优(一次性调优/缓存/启动)

Most CuTile kernels (elementwise, reduction, LayerNorm) need only occupancy tuning. Copy this pattern:
python
from types import SimpleNamespace
from cuda.tile.tune import exhaustive_search
import cuda.tile as ct
import torch

def _my_autotune_configs():
    for occ in [1, 2, 4, 8]:
        yield SimpleNamespace(occupancy=occ)
大多数CuTile内核(逐元素型、归约型、LayerNorm)仅需occupancy调优。复制以下模式:
python
from types import SimpleNamespace
from cuda.tile.tune import exhaustive_search
import cuda.tile as ct
import torch

def _my_autotune_configs():
    for occ in [1, 2, 4, 8]:
        yield SimpleNamespace(occupancy=occ)

Module-level cache: tune once, launch fast forever after

模块级缓存:仅调优一次,后续启动全程高速

_autotune_cache = {}
def my_op(x, output): stream = torch.cuda.current_stream() NUM_SM = torch.cuda.get_device_properties(x.device).multi_processor_count
# Cache key: anything that affects optimal config (use str() for device)
cache_key = (x.shape, x.dtype, str(x.device))

if cache_key not in _autotune_cache:
    configs = list(_my_autotune_configs())
    result = exhaustive_search(
        configs,
        stream,
        grid_fn=lambda cfg: (min(NUM_SM * cfg.occupancy, M), 1, 1),
        kernel=my_kernel,
        args_fn=lambda cfg: (x, output, ...),
        hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
    )
    best_cfg = result.best.config
    tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
    _autotune_cache[cache_key] = (best_cfg, tuned_kernel)  # cache BOTH

cfg, tuned_kernel = _autotune_cache[cache_key]
grid = (min(NUM_SM * cfg.occupancy, M), 1, 1)
ct.launch(stream, grid, tuned_kernel, (x, output, ...))

Key rules:
- **Tune once, cache, launch directly** — `exhaustive_search` runs only on first call per shape; subsequent calls use cached config + `ct.launch` with zero overhead
- For in-place kernels use split-buffer during search (separate input/output tensors)
- Keep ≤ 30 configs total
- `exhaustive_search` requires a `Sequence` (list/tuple) — convert generators with `list()`
- **Search space must include the original fixed config** — this guarantees autotuning never makes performance worse

**When to use this pattern**: Kernel has fixed block size (not tile-size tunable). Includes: elementwise (SwiGLU, GeGLU), reduction (RMSNorm, LayerNorm), RoPE, and persistent kernels with heuristic block sizes (grouped GEMM).

For complex kernels (matmul with tile sizes, FMHA, FP8 with num_ctas), read the full guide below + [`kernel-type-templates.md`](references/kernel-type-templates.md).

> **⚠️ Two pitfalls catch almost everyone — check before submitting:**
> - **In-place kernel** (writes back to input tensor)? → MUST use split-buffer pattern during search → Pitfall #1
> - **Search space empty?** → Check arch filters and `num_ctas` constraints → Pitfall #5
_autotune_cache = {}
def my_op(x, output): stream = torch.cuda.current_stream() NUM_SM = torch.cuda.get_device_properties(x.device).multi_processor_count
# 缓存键:所有会影响最优配置的参数(对device使用str()转换)
cache_key = (x.shape, x.dtype, str(x.device))

if cache_key not in _autotune_cache:
    configs = list(_my_autotune_configs())
    result = exhaustive_search(
        configs,
        stream,
        grid_fn=lambda cfg: (min(NUM_SM * cfg.occupancy, M), 1, 1),
        kernel=my_kernel,
        args_fn=lambda cfg: (x, output, ...),
        hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
    )
    best_cfg = result.best.config
    tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
    _autotune_cache[cache_key] = (best_cfg, tuned_kernel)  # 同时缓存两者

cfg, tuned_kernel = _autotune_cache[cache_key]
grid = (min(NUM_SM * cfg.occupancy, M), 1, 1)
ct.launch(stream, grid, tuned_kernel, (x, output, ...))

核心规则:
- **一次性调优、缓存、直接启动** — `exhaustive_search`仅在每个形状的首次调用时运行;后续调用使用缓存配置+`ct.launch`,零开销
- 原地内核在搜索期间需使用拆分缓冲区(独立的输入/输出张量)
- 总配置数保持≤30
- `exhaustive_search`要求传入`Sequence`(列表/元组)—— 使用`list()`转换生成器
- **搜索空间必须包含原固定配置** — 这保证自动调优绝不会导致性能下降

**适用场景**:内核具有固定块大小(tile尺寸不可调)。包括:逐元素型(SwiGLU、GeGLU)、归约型(RMSNorm、LayerNorm)、RoPE,以及使用启发式块大小的持久型内核(分组GEMM)。

对于复杂内核(带tile尺寸的矩阵乘法、FMHA、带num_ctas的FP8),请阅读下方完整指南 + [`kernel-type-templates.md`](references/kernel-type-templates.md)。

> **⚠️ 几乎所有人都会遇到两个问题 — 提交前务必检查:**
> - **原地内核**(会写回输入张量)?→ 搜索期间必须使用拆分缓冲区模式 → 问题#1
> - **搜索空间为空?** → 检查架构过滤器和`num_ctas`约束 → 问题#5

Reading Guide

阅读指南

  • Occupancy-only kernels (elementwise, reduction, persistent with fixed block sizes): Quick Reference + Pitfall Checklist is sufficient — skip
    references/
    docs. For in-place kernels, also read Pitfall #1.
  • Complex kernels (matmul with tunable tile sizes, FMHA, FP8 with num_ctas): Quick Reference → Decision Tree → API Reference → Step-by-Step Workflow → relevant
    references/
    docs.
5-step summary: Classify kernel → Design search space (
parameter-space-design.md
) → Implement using template (
kernel-type-templates.md
) → Validate with A/B test → Check Pitfall Checklist.
  • 仅Occupancy内核(逐元素型、归约型、固定块大小的持久型):快速参考 + 问题清单已足够 — 可跳过
    references/
    文档。对于原地内核,还需阅读问题#1。
  • 复杂内核(带可调tile尺寸的矩阵乘法、FMHA、带num_ctas的FP8):快速参考 → 决策树 → API参考 → 分步工作流 → 相关
    references/
    文档。
5步总结:分类内核 → 设计搜索空间(
parameter-space-design.md
) → 使用模板实现(
kernel-type-templates.md
) → 通过A/B测试验证 → 检查问题清单。

Design Philosophy

设计理念

Build a small, precise search space bottom-up — not a large space trimmed down. CuTile compilation is much heavier than Triton (~0.5-1s per config), so 30 configs is the hard upper limit. The approach is: classify the kernel type first, then construct only the relevant configs for that type and architecture. Never start with a large cartesian product and prune — start with the minimum viable space and expand only if data shows it's needed.
自底向上构建小型、精准的搜索空间 — 而非先构建大空间再修剪。CuTile编译比Triton重得多(每个配置约0.5-1秒),因此30个配置是硬性上限。方法是:先分类内核类型,再仅为该类型和架构构建相关配置。绝不要从大型笛卡尔积开始再修剪 — 从最小可行空间开始,仅当数据表明需要时才扩展。

Decision Tree: What Search Dimensions Does This Kernel Need?

决策树:该内核需要哪些搜索维度?

All kernels should have autotuning added. The question is not whether to autotune, but what dimensions to search:
What type of kernel is this?
├── Compute-bound (matmul, GEMM, FMHA) → Does it have multiple tunable dimensions (tile sizes)?
│   ├── YES → Is it a fused multi-GEMM kernel (dual-GEMM, e.g. Linear+GLUAct)?
│   │   ├── YES → Template 9: low occupancy (1–2), conservative tiles (2× SHMEM/register pressure)
│   │   └── NO  → Full search: TILE_M × TILE_N × (TILE_K) × occupancy × num_ctas
│   │             (see matmul/FMHA templates in kernel-type-templates.md)
│   └── NO  → Occupancy-only search: [1, 2, 4, 8]
│             (see Quick Reference above)
├── Balanced (LayerNorm, reduction + compute) →
│   Occupancy-only search: [1, 2, 4, 8]
│   Expected benefit: 2-15%
└── Memory-bound (CE Loss, pure elementwise) →
    Occupancy-only search: [1, 2, 4, 8]
    Expected benefit: 0-15% (varies by kernel; zero-cost after tuning)
Why memory-bound kernels only search occupancy (not num_ctas or tile sizes):
  • num_ctas
    has zero benefit
    :
    num_ctas > 1
    enables TMA multicast, where multiple CTAs share tile data in shared memory (e.g., matmul A/B tiles reused across CTAs). Memory-bound kernels use per-element
    ct.gather
    /
    ct.scatter
    with no tile reuse — multi-CTA cooperation adds overhead with no data sharing benefit.
  • Tile sizes are pre-determined: BLOCK_SIZE for memory-bound kernels is determined by offline sweep (e.g., 1024 is globally optimal on B200 across [256, 512, 1024, 2048, 4096, 8192]). This is a constant, not a runtime tunable.
  • Occupancy is the only effective knob: Higher occupancy lets the GPU hide memory latency by switching to another CTA while one is stalled on a memory request.
Evidence — CE Loss experiment: A 12-config search (occupancy × num_ctas) on Cross-Entropy Loss yielded only 2.5% gain (0.79x → 0.81x vs Triton). The
num_ctas
dimension contributed nothing; the result was reverted because compilation cost outweighed the marginal benefit. Occupancy-only (4 configs) achieves the same result at 3x less compilation time.
Note on memory-bound kernels: Adding occupancy-only autotune is always worthwhile because:
  • The tune-once/cache/launch pattern has zero runtime overhead after the first call
  • The search space is tiny (4 configs, ~2-4s compilation)
  • Even small improvements have value at scale
所有内核都应添加自动调优。问题不是是否要调优,而是要搜索哪些维度
这是什么类型的内核?
├── 计算密集型(矩阵乘法、GEMM、FMHA)→ 是否有多可调维度(tile尺寸)?
│   ├── 是 → 是否为融合多GEMM内核(双GEMM,例如Linear+GLUAct)?
│   │   ├── 是 → 模板9:低occupancy(1–2)、保守tile(2×共享内存/寄存器压力)
│   │   └── 否 → 完整搜索:TILE_M × TILE_N × (TILE_K) × occupancy × num_ctas
│   │             (参考kernel-type-templates.md中的矩阵乘法/FMHA模板)
│   └── 否 → 仅Occupancy搜索:[1, 2, 4, 8]
│             (参考上方快速参考)
├── 平衡型(LayerNorm、归约+计算)→
│   仅Occupancy搜索:[1, 2, 4, 8]
│   预期收益:2-15%
└── 内存密集型(交叉熵损失、纯逐元素型)→
    仅Occupancy搜索:[1, 2, 4, 8]
    预期收益:0-15%(因内核而异;调优后零成本)
为什么内存密集型内核仅搜索occupancy(不搜索num_ctas或tile尺寸)
  • num_ctas
    无收益
    num_ctas > 1
    启用TMA多播,即多个CTA共享共享内存中的tile数据(例如,矩阵乘法中A/B tile在CTA间复用)。内存密集型内核使用逐元素
    ct.gather
    /
    ct.scatter
    ,无tile复用 — 多CTA协作只会增加开销,无数据共享收益。
  • tile尺寸预先确定:内存密集型内核的BLOCK_SIZE由离线扫描确定(例如,在B200上,1024在[256, 512, 1024, 2048, 4096, 8192]中全局最优)。这是常量,而非运行时可调参数。
  • Occupancy是唯一有效旋钮:更高的occupancy让GPU在某个CTA因内存请求停滞时切换到另一个CTA,从而隐藏内存延迟。
证据 — 交叉熵损失实验:对交叉熵损失进行12个配置的搜索(occupancy × num_ctas)仅获得2.5%的收益(相对于Triton,从0.79x提升至0.81x)。
num_ctas
维度无贡献;因编译成本超过边际收益,该结果被回退。仅Occupancy搜索(4个配置)以1/3的编译时间实现了相同结果。
内存密集型内核注意事项:添加仅Occupancy自动调优始终值得,因为:
  • 一次性调优/缓存/启动模式在首次调用后零运行时开销
  • 搜索空间极小(4个配置,约2-4秒编译时间)
  • 即使是微小的改进在规模上也有价值

Occupancy Selection Guide

Occupancy选择指南

Occupancy controls how many CTAs run concurrently per SM. Use this as a starting point when designing the occupancy search space:
Occupancy RangeBest ForExample Kernels
1–4Compute-bound (heavy math)Complex transforms, matmul
4–8Balanced (GEMM, TMA)Matrix multiply, FMHA
8–16Memory-bound (reductions)Softmax, LayerNorm
16–32Very light (copies, casts)Type conversions, elementwise
Use these ranges to seed your initial search space. For occupancy-only kernels,
[1, 2, 4, 8]
covers most cases — see Quick Reference above.
Occupancy控制每个SM上同时运行的CTA数量。设计occupancy搜索空间时,可将以下内容作为起点:
Occupancy范围适用场景示例内核
1–4计算密集型(大量数学运算)复杂变换、矩阵乘法
4–8平衡型(GEMM、TMA)矩阵乘法、FMHA
8–16内存密集型(归约)Softmax、LayerNorm
16–32极轻量型(复制、类型转换)类型转换、逐元素型
使用这些范围初始化搜索空间。对于仅Occupancy内核,
[1, 2, 4, 8]
覆盖大多数场景 — 参考上方快速参考。

exhaustive_search API Reference

exhaustive_search API参考

⚠️ Deprecated API:
cuda.tile_experimental.autotune_launch()
(aka
ct_experimental.autotune_launch
) is deprecated and should NOT be used. It combines search + launch in one call with random sampling, which produces less reproducible results and worse config selection compared to
exhaustive_search
. Always use
cuda.tile.tune.exhaustive_search
(the current API below) with explicit caching and
ct.launch
.
⚠️ 已废弃API
cuda.tile_experimental.autotune_launch()
(又名
ct_experimental.autotune_launch
)已废弃,禁止使用。它将搜索+启动合并为一次调用,采用随机采样,与
exhaustive_search
相比,结果可复现性更低,配置选择效果更差。请始终使用
cuda.tile.tune.exhaustive_search
(下方当前API),并结合显式缓存和
ct.launch

Current API (
cuda.tile.tune
)

当前API (
cuda.tile.tune
)

python
from cuda.tile.tune import exhaustive_search, TuningResult

result: TuningResult = exhaustive_search(
    search_space,   # Sequence[T] — list or tuple of configs (NOT a generator)
    stream,         # torch.cuda.current_stream()
    grid_fn,        # callable(cfg) → tuple[int, ...]
    kernel,         # @ct.kernel decorated function
    args_fn,        # callable(cfg) → tuple of kernel args
    hints_fn=None,  # callable(cfg) → {"occupancy": int, "num_ctas": int}
    *,
    quiet=False     # suppress output
)
python
from cuda.tile.tune import exhaustive_search, TuningResult

result: TuningResult = exhaustive_search(
    search_space,   # Sequence[T] — 配置的列表或元组(不能是生成器)
    stream,         # torch.cuda.current_stream()
    grid_fn,        # callable(cfg) → tuple[int, ...]
    kernel,         # 被@ct.kernel装饰的函数
    args_fn,        # callable(cfg) → 内核参数的元组
    hints_fn=None,  # callable(cfg) → {"occupancy": int, "num_ctas": int}
    *,
    quiet=False     # 抑制输出
)

TuningResult

TuningResult

python
@dataclass
class TuningResult[T]:
    best: Measurement       # best config + timing (mean_us, error_margin_us, num_samples)
    successes: Sequence[Measurement]   # all successful configs (sorted by performance)
    failures: Sequence[tuple[T, str, str]]  # (config, exception_type, message)
Key properties:
  • Exhaustive: evaluates ALL configs in order — no random sampling, no skipped configs
  • Search only: does not perform the final production launch — it executes trial runs internally for benchmarking, but you call
    ct.launch
    separately for the actual production invocation
  • No built-in cache: you manage caching explicitly (see tune-once/cache/launch pattern)
  • Deterministic: same search space always produces the same evaluation order
python
@dataclass
class TuningResult[T]:
    best: Measurement       # 最优配置 + 计时(mean_us, error_margin_us, num_samples)
    successes: Sequence[Measurement]   # 所有成功的配置(按性能排序)
    failures: Sequence[tuple[T, str, str]]  # (配置, 异常类型, 消息)
核心特性:
  • ** exhaustive(穷尽式)**:按顺序评估所有配置 — 无随机采样,无跳过配置
  • 仅搜索:不执行最终生产环境启动 — 内部执行试运行以进行基准测试,但需单独调用
    ct.launch
    进行实际生产调用
  • 无内置缓存:需显式管理缓存(参考一次性调优/缓存/启动模式)
  • 确定性:相同搜索空间始终产生相同的评估顺序

Tune-Once / Cache / Launch Pattern

一次性调优/缓存/启动模式

This is the recommended pattern for all autotuned kernels. It ensures:
  • First call: runs
    exhaustive_search
    to find the best config (~2-30s depending on space size)
  • Subsequent calls: uses cached config with
    ct.launch
    — zero overhead (identical to a fixed
    ct.launch
    )
python
_cache = {}

def run_kernel_autotuned(x, ...):
    stream = torch.cuda.current_stream()
    cache_key = (x.shape, x.dtype, str(x.device))

    if cache_key not in _cache:
        configs = list(_my_autotune_configs())
        result = exhaustive_search(
            configs, stream,
            grid_fn=lambda cfg: ...,
            kernel=my_kernel,
            args_fn=lambda cfg: ...,
            hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
        )
        best_cfg = result.best.config
        tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
        _cache[cache_key] = (best_cfg, tuned_kernel)  # cache BOTH config and compiled kernel

    cfg, tuned_kernel = _cache[cache_key]
    grid = compute_grid(cfg)
    ct.launch(stream, grid, tuned_kernel, (x, ...))
Why this pattern matters: The
ct.launch
call in the fast path is identical to what you'd write for a fixed-config kernel. There is zero per-call overhead — no lock, no hash lookup, no lambda invocation. The only cost is the Python dict lookup for
_cache[cache_key]
.
⚠️ Critical: always cache the tuned kernel object, not just the config.
replace_hints()
returns a new kernel object with its own independent JIT cache. Calling it on every invocation triggers recompilation each time, degrading performance by 100–500×. Call
replace_hints()
once after
exhaustive_search
, store the returned kernel in the cache alongside the config, and reuse it directly on the fast path. See Pitfall #7.
这是所有自动调优内核的推荐模式。它确保:
  • 首次调用:运行
    exhaustive_search
    找到最优配置(根据空间大小,约2-30秒)
  • 后续调用:使用缓存配置+
    ct.launch
    — 零开销(与固定配置的
    ct.launch
    完全相同)
python
_cache = {}

def run_kernel_autotuned(x, ...):
    stream = torch.cuda.current_stream()
    cache_key = (x.shape, x.dtype, str(x.device))

    if cache_key not in _cache:
        configs = list(_my_autotune_configs())
        result = exhaustive_search(
            configs, stream,
            grid_fn=lambda cfg: ...,
            kernel=my_kernel,
            args_fn=lambda cfg: ...,
            hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
        )
        best_cfg = result.best.config
        tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
        _cache[cache_key] = (best_cfg, tuned_kernel)  # 同时缓存配置和编译后的内核

    cfg, tuned_kernel = _cache[cache_key]
    grid = compute_grid(cfg)
    ct.launch(stream, grid, tuned_kernel, (x, ...))
该模式的重要性:快速路径中的
ct.launch
调用与固定配置内核的写法完全相同。无每次调用的开销 — 无锁、无哈希查找、无lambda调用。唯一成本是Python字典查找
_cache[cache_key]
⚠️ 关键:始终缓存调优后的内核对象,而不仅仅是配置
replace_hints()
返回一个新的内核对象,拥有独立的JIT缓存。每次调用它都会触发重新编译,导致性能下降100–500倍。在
exhaustive_search
后调用一次
replace_hints()
,将返回的内核与配置一起存储在缓存中,并在快速路径中直接复用。参考问题#7。

replace_hints

replace_hints

After finding the best config, use
kernel.replace_hints()
to create a kernel variant with the optimal hints:
python
undefined
找到最优配置后,使用
kernel.replace_hints()
创建带最优提示的内核变体:
python
undefined

For occupancy-only:

仅Occupancy调优:

tuned_kernel = my_kernel.replace_hints(occupancy=cfg.occupancy)
tuned_kernel = my_kernel.replace_hints(occupancy=cfg.occupancy)

For occupancy + num_ctas:

Occupancy + num_ctas调优:

tuned_kernel = my_kernel.replace_hints(occupancy=cfg.occupancy, num_ctas=cfg.num_ctas)

`replace_hints` accepts only `occupancy` and `num_ctas` — these are the only compiler hints controllable via the autotune API.

**`ByTarget` wrapping for cross-architecture portability**: When creating tuned kernel variants via `ct.kernel()`, prefer wrapping hint values in `ct.ByTarget` for portability across GPU architectures:

```python
tuned_kernel = my_kernel.replace_hints(occupancy=cfg.occupancy, num_ctas=cfg.num_ctas)

`replace_hints`仅接受`occupancy`和`num_ctas` — 这是自动调优API可控制的仅有的两个编译器提示。

**跨架构可移植性的`ByTarget`包装**:通过`ct.kernel()`创建调优后的内核变体时,建议将提示值包装在`ct.ByTarget`中,以实现跨GPU架构的可移植性:

```python

Preferred: explicit architecture targeting (portable)

推荐:显式架构目标(可移植)

tuned_kernel = ct.kernel( my_kernel._pyfunc, occupancy=ct.ByTarget(sm_100=best_cfg.occupancy), num_ctas=ct.ByTarget(sm_100=best_cfg.num_ctas, default=1), )
tuned_kernel = ct.kernel( my_kernel._pyfunc, occupancy=ct.ByTarget(sm_100=best_cfg.occupancy), num_ctas=ct.ByTarget(sm_100=best_cfg.num_ctas, default=1), )

Also acceptable: plain integers (when targeting a single architecture)

也可接受:纯整数(仅针对单一架构时)

tuned_kernel = ct.kernel(my_kernel._pyfunc, occupancy=best_cfg.occupancy)

When targeting only the current GPU (the common case in autotuning), plain integers work fine. Use `ByTarget` when the code may run on multiple architectures or when following production conventions (TileGym production code consistently uses `ByTarget`).
tuned_kernel = ct.kernel(my_kernel._pyfunc, occupancy=best_cfg.occupancy)

当仅针对当前GPU(自动调优中的常见场景)时,纯整数足够。当代码可能在多个架构上运行或遵循生产规范时(TileGym生产代码一致使用`ByTarget`),使用`ByTarget`。

Kernel Hints

内核提示

CuTile kernel performance is controlled by two compile-time hints:
  • occupancy
    : Number of CTAs per SM. Higher occupancy = more parallelism but less shared memory per CTA.
  • num_ctas
    : Number of CTAs in a CGA (Cooperative Group Array). Used for multi-CTA cooperation (e.g., TMA multicast). Only supported on sm90+.
Three ways to set hints:
python
undefined
CuTile内核性能由两个编译时提示控制:
  • occupancy
    :每个SM上的CTA数量。更高的occupancy = 更多并行性,但每个CTA的共享内存更少。
  • num_ctas
    :CGA(协作组数组)中的CTA数量。用于多CTA协作(例如TMA多播)。仅在sm90+上支持。
设置提示的三种方式:
python
undefined

1. Fixed value in decorator (no autotune needed)

1. 装饰器中的固定值(无需自动调优)

@ct.kernel(occupancy=2, num_ctas=1) def my_kernel(...): ...
@ct.kernel(occupancy=2, num_ctas=1) def my_kernel(...): ...

2. Architecture-specific fixed value (no autotune needed)

2. 特定架构的固定值(无需自动调优)

@ct.kernel(num_ctas=ct.ByTarget(sm_100=2, sm_120=1, default=1)) def my_kernel(...): ...
@ct.kernel(num_ctas=ct.ByTarget(sm_100=2, sm_120=1, default=1)) def my_kernel(...): ...

3. Runtime autotune via exhaustive_search + replace_hints

3. 通过exhaustive_search + replace_hints进行运行时自动调优

IMPORTANT: Remove fixed hints from decorator first!

重要:先从装饰器中移除固定提示!

@ct.kernel def my_kernel(...): ...
@ct.kernel def my_kernel(...): ...

Then in the host wrapper:

然后在主机包装器中:

tuned_kernel = my_kernel.replace_hints(occupancy=best_occ, num_ctas=best_ctas) ct.launch(stream, grid, tuned_kernel, args)

**Important**: `replace_hints` correctly overrides decorator hints (it uses `dataclasses.replace()` internally). However, if you forget to call `replace_hints`, the decorator's fixed values are used instead of the autotuned values. To avoid this confusion, always remove fixed hints from the `@ct.kernel(...)` decorator before adding autotuning — this makes it explicit that hints come only from the autotune path.
tuned_kernel = my_kernel.replace_hints(occupancy=best_occ, num_ctas=best_ctas) ct.launch(stream, grid, tuned_kernel, args)

**重要提示**:`replace_hints`会正确覆盖装饰器中的提示(内部使用`dataclasses.replace()`)。但如果忘记调用`replace_hints`,将使用装饰器中的固定值而非自动调优值。为避免这种混淆,添加自动调优前,请始终从`@ct.kernel(...)`装饰器中移除固定提示 — 这样可以明确提示仅来自自动调优路径。

search_space Design

search_space设计

The search space is a list of
SimpleNamespace
objects. Each namespace holds config fields that
grid_fn
,
args_fn
, and
hints_fn
can read.
python
from types import SimpleNamespace
搜索空间是
SimpleNamespace
对象的列表。每个命名空间包含
grid_fn
args_fn
hints_fn
可读取的配置字段。
python
from types import SimpleNamespace

Occupancy-only (elementwise kernels)

仅Occupancy(逐元素内核)

def autotune_configs(): for occ in [1, 2, 4, 8]: yield SimpleNamespace(occupancy=occ)
def autotune_configs(): for occ in [1, 2, 4, 8]: yield SimpleNamespace(occupancy=occ)

Full matmul search space — see parameter-space-design.md for complete per-architecture configs

完整矩阵乘法搜索空间 — 参考parameter-space-design.md获取完整的各架构配置

Pattern: yield SimpleNamespace(TILE_SIZE_M=..., TILE_SIZE_N=..., TILE_SIZE_K=..., num_ctas=..., occupancy=...)

模式:yield SimpleNamespace(TILE_SIZE_M=..., TILE_SIZE_N=..., TILE_SIZE_K=..., num_ctas=..., occupancy=...)


**Note**: `exhaustive_search` requires a `Sequence` (list/tuple), not a generator. Always convert with `list()`:
```python
configs = list(autotune_configs())
result = exhaustive_search(configs, ...)

**注意**:`exhaustive_search`要求传入`Sequence`(列表/元组),而非生成器。始终使用`list()`转换:
```python
configs = list(autotune_configs())
result = exhaustive_search(configs, ...)

grid_fn Patterns

grid_fn模式

python
from math import ceil
python
from math import ceil

Pattern A: Simple tile coverage (matmul, elementwise)

模式A:简单tile覆盖(矩阵乘法、逐元素型)

grid_fn=lambda cfg: (ceil(M / cfg.TILE_SIZE_M) * ceil(N / cfg.TILE_SIZE_N), 1, 1)
grid_fn=lambda cfg: (ceil(M / cfg.TILE_SIZE_M) * ceil(N / cfg.TILE_SIZE_N), 1, 1)

Pattern B: Persistent matmul (static_persistent_matmul_kernel)

模式B:持久型矩阵乘法(static_persistent_matmul_kernel)

NUM_SMS = torch.cuda.get_device_properties("cuda").multi_processor_count grid_fn=lambda cfg: ( min(NUM_SMS // cfg.num_ctas, ceil(M / cfg.TILE_M) * ceil(N / cfg.TILE_N)) * cfg.occupancy, 1, 1, )
NUM_SMS = torch.cuda.get_device_properties("cuda").multi_processor_count grid_fn=lambda cfg: ( min(NUM_SMS // cfg.num_ctas, ceil(M / cfg.TILE_M) * ceil(N / cfg.TILE_N)) * cfg.occupancy, 1, 1, )

Pattern C: 2D grid (FMHA — one dim for seq tiles, one for batch*heads)

模式C:2D网格(FMHA — 一维用于序列tile,一维用于batch*heads)

grid_fn=lambda cfg: (ceil(q_len / cfg.TILE_M), batch_size * num_heads, 1)
grid_fn=lambda cfg: (ceil(q_len / cfg.TILE_M), batch_size * num_heads, 1)

Pattern D: 1D elementwise (cdiv = math.ceil(a/b), from ct_ops.py)

模式D:1D逐元素型(cdiv = math.ceil(a/b),来自ct_ops.py)

grid_fn=lambda cfg: (cdiv(n_elements, BLOCK_SIZE),)
grid_fn=lambda cfg: (cdiv(n_elements, BLOCK_SIZE),)

Pattern E: Grouped GEMM persistent (grid fixed at NUM_SMS, occupancy via hints_fn only)

模式E:分组GEMM持久型(网格固定为NUM_SMS,仅通过hints_fn设置occupancy)

grid_fn=lambda cfg: (NUM_SMS, 1, 1)
undefined
grid_fn=lambda cfg: (NUM_SMS, 1, 1)
undefined

Step-by-Step Workflow

分步工作流

Adding Autotune to a New Kernel

为新内核添加自动调优

  1. Classify the kernel using the decision tree above.
    • VERIFY: You know whether this is occupancy-only or requires tile-size tuning.
  2. Remove hardcoded hints from decorator (strongly recommended): If the kernel currently has hardcoded hints in its decorator (e.g.
    @ct.kernel(occupancy=2, num_ctas=1)
    ), remove those fixed hints and change to bare
    @ct.kernel
    before adding autotuning. While
    replace_hints
    does correctly override decorator values at runtime, leaving them creates a silent fallback trap: if any code path (e.g.,
    DISABLE_AUTOTUNE
    , error handling, or a future refactor) skips
    replace_hints
    , the decorator's fixed hints are used instead of the autotuned values — and this produces no error, just silently worse performance. Removing them makes the failure mode explicit (missing hints → compiler defaults) rather than silent (wrong fixed hints used).
    • VERIFY: The
      @ct.kernel
      decorator has no
      occupancy=
      or
      num_ctas=
      arguments before proceeding. Use bare
      @ct.kernel
      instead.
  3. Check for in-place writes: If the kernel modifies input tensors in-place, you MUST use the split-buffer pattern during
    exhaustive_search
    — see Pitfall #1.
    • VERIFY: Either the kernel is not in-place, or you have added a split-buffer scratch tensor for the search phase.
  4. Select the template from
    kernel-type-templates.md
    based on kernel type.
  5. Design the search space following
    parameter-space-design.md
    :
    • Start from reference configs, not from scratch. Clone configs from existing production kernels of the same type (e.g.,
      ops/cutile/matmul.py
      for GEMM) and adapt. For GEMM-class kernels,
      nvMatmulHeuristics
      can suggest 8-16 high-quality candidates that reach 96-99% peak performance — see
      parameter-space-design.md
      for details.
    • Detect the current GPU architecture with
      torch.cuda.get_device_capability()
      .
    • Target one architecture at a time. Generate configs only for the detected arch. Do NOT add branches for other architectures — they cannot be tested on this machine and untested code paths are unreliable. If multi-arch support is needed later, add it in a separate pass on the appropriate hardware.
    • Identify tunable parameters (tile sizes, occupancy, num_ctas)
    • Ensure the search space includes the original fixed config (or an equivalent). This guarantees that the autotuned result is at least as good as the original — no performance regression is possible.
    • If the generated set exceeds 30, apply tile size filters and pruning rules to reduce it to ≤ 30
    • VERIFY: Total configs ≤ 30 (hard limit: CuTile compilation is heavy, >30 configs will timeout).
  6. Implement the tune-once/cache/launch pattern:
    • Define a
      _cache
      dict at module level
    • Define a cache key that captures all parameters affecting optimal config (shapes, dtypes, device, any flags like
      is_causal
      ). ⚠️ Use
      str(x.device)
      not
      x.device
      in the cache key —
      torch.device
      objects are not reliably hashable and can cause
      TypeError: unhashable type
      at runtime. Always convert to string:
      cache_key = (..., x.dtype, str(x.device))
      . Tip: For GEMM-class kernels, round dimensions to the next power of 2 in the cache key (e.g.,
      cache_key = (next_pow2(M), next_pow2(N), next_pow2(K), dtype, str(device))
      ) to reduce unique key count and avoid re-tuning for similar shapes.
    • Call
      exhaustive_search(list(configs), ...)
      only when cache misses
    • Store
      result.best.config
      in cache
    • Use
      kernel.replace_hints(...)
      to create the tuned kernel variant
    • Use
      ct.launch()
      for the actual kernel invocation
    • grid_fn
      correctly computes grid from config
    • args_fn
      passes all kernel arguments including tile sizes as
      ct.Constant[int]
    • hints_fn
      passes
      occupancy
      and/or
      num_ctas
      from config
    • VERIFY:
      exhaustive_search
      receives a
      list()
      of configs, not a raw generator.
  7. (MANDATORY) Add DISABLE_AUTOTUNE support for CI and profiling: check
    os.environ.get("DISABLE_AUTOTUNE", "0") == "1"
    — when set, skip
    exhaustive_search
    entirely and fall back to
    ct.launch
    with the first valid config. This is required for:
    • CI determinism (autotune adds variable wall time)
    • NCU profiling (prevents autotune trial runs from cluttering the trace — see Pitfall #4)
    • Debugging (isolates kernel correctness from autotune behavior) Place the check before the cache lookup so that
      DISABLE_AUTOTUNE=1
      bypasses all autotune logic. Provide a hardcoded fallback config in case the generator yields zero configs.
    • VERIFY: Running with
      DISABLE_AUTOTUNE=1
      produces correct results and does not call
      exhaustive_search
      .
  8. Test: Run correctness tests first (
    pytest -k "test_op and cutile"
    ), then benchmark.
    • VERIFY: Correctness passes with autotune enabled AND with
      DISABLE_AUTOTUNE=1
      .
  9. Validate with A/B test: Compare autotune version vs fixed best-known config. See
    search-strategies.md
    for methodology.
    • VERIFY: Autotune version ≥ baseline (or within noise). If worse, check that the search space includes the original fixed config, and that
      replace_hints
      is being used correctly.
  10. (MANDATORY) Run the test and verify performance before submitting.
    Execute the provided test script (e.g.
    ENABLE_TILE=1 python3 test.py
    ) and check:
    • correctness: PASS
    • speedup_over_fixed >= 1.0
      (autotuned must not be slower than fixed baseline)
    If
    speedup_over_fixed < 1.0
    :
    • Check that the search space includes the original fixed config (this guarantees no regression)
    • Check if
      replace_hints
      is being called on every code path — revisit Step 2 (if any path skips
      replace_hints
      , the decorator's fixed hints are used instead of autotuned values)
    • Expand search space if all configs perform similarly (see
      references/parameter-space-design.md
      → "Adapting Search Space")
    ⚠️ DO NOT submit without running the test at least once. Writing correct-looking code is not sufficient — autotuning bugs (silent hint override, split-buffer omission) are only caught at runtime.
  1. 使用上述决策树对内核进行分类
    • 验证:明确该内核是仅需Occupancy调优还是需要tile尺寸调优。
  2. 从装饰器中移除硬编码提示(强烈推荐):如果内核当前在装饰器中有硬编码提示(例如
    @ct.kernel(occupancy=2, num_ctas=1)
    ),移除这些固定提示,改为使用裸
    @ct.kernel
    ,再添加自动调优。虽然
    replace_hints
    会在运行时正确覆盖装饰器值,但保留它们会造成隐性回退陷阱:如果任何代码路径(例如
    DISABLE_AUTOTUNE
    、错误处理或未来重构)跳过
    replace_hints
    ,将使用装饰器的固定提示而非自动调优值 — 且不会报错,只会导致性能下降。移除它们可让失败模式显式化(缺少提示→编译器默认值),而非隐性化(使用错误的固定提示)。
    • 验证
      @ct.kernel
      装饰器在继续前无
      occupancy=
      num_ctas=
      参数。使用裸
      @ct.kernel
      替代。
  3. 检查原地写入:如果内核会原地修改输入张量,必须在
    exhaustive_search
    期间使用拆分缓冲区模式 — 参考问题#1。
    • 验证:要么内核不是原地型,要么已为搜索阶段添加拆分缓冲区临时张量。
  4. 根据内核类型从
    kernel-type-templates.md
    中选择模板
  5. 遵循
    parameter-space-design.md
    设计搜索空间
    • 从参考配置开始,而非从零构建。克隆相同类型的现有生产内核的配置(例如GEMM参考
      ops/cutile/matmul.py
      )并调整。对于GEMM类内核,
      nvMatmulHeuristics
      可推荐8-16个高质量候选,达到96-99%的峰值性能 — 参考
      parameter-space-design.md
      获取详情。
    • 使用
      torch.cuda.get_device_capability()
      检测当前GPU架构。
    • 一次针对一个架构。仅为检测到的架构生成配置。不要为其他架构添加分支 — 这些分支无法在本机测试,未测试的代码路径不可靠。如果后续需要多架构支持,在相应硬件上单独添加。
    • 识别可调参数(tile尺寸、occupancy、num_ctas)
    • 确保搜索空间包含原固定配置(或等效配置)。这保证自动调优结果至少与原配置一样好 — 不会出现性能退化。
    • 如果生成的配置集超过30个,应用tile尺寸过滤器和修剪规则将其减少至≤30个
    • 验证:总配置数≤30(硬性限制:CuTile编译较重,超过30个配置会超时)。
  6. 实现一次性调优/缓存/启动模式
    • 在模块级别定义
      _cache
      字典
    • 定义缓存键,包含所有影响最优配置的参数(形状、dtype、设备、任何标志如
      is_causal
      )。⚠️ 在缓存键中使用
      str(x.device)
      而非
      x.device
      torch.device
      对象不可靠地哈希,会导致运行时
      TypeError: unhashable type
      。始终转换为字符串:
      cache_key = (..., x.dtype, str(x.device))
      提示:对于GEMM类内核,在缓存键中将维度四舍五入到下一个2的幂(例如
      cache_key = (next_pow2(M), next_pow2(N), next_pow2(K), dtype, str(device))
      ),以减少唯一键的数量,避免为相似形状重复调优。
    • 仅在缓存未命中时调用
      exhaustive_search(list(configs), ...)
    • result.best.config
      存储在缓存中
    • 使用
      kernel.replace_hints(...)
      创建调优后的内核变体
    • 使用
      ct.launch()
      进行实际内核调用
    • grid_fn
      根据配置正确计算网格
    • args_fn
      传递所有内核参数,包括作为
      ct.Constant[int]
      的tile尺寸
    • hints_fn
      传递配置中的
      occupancy
      和/或
      num_ctas
    • 验证
      exhaustive_search
      接收的是配置的
      list()
      ,而非原始生成器。
  7. (必须)添加DISABLE_AUTOTUNE支持,用于CI和性能分析:检查
    os.environ.get("DISABLE_AUTOTUNE", "0") == "1"
    — 当设置为1时,完全跳过
    exhaustive_search
    ,回退到使用第一个有效配置的
    ct.launch
    。这是必需的:
    • CI确定性(自动调优会增加可变的运行时间)
    • NCU性能分析(防止自动调优试运行干扰跟踪 — 参考问题#4)
    • 调试(将内核正确性与自动调优行为隔离) 将检查放在缓存查找之前,以便
      DISABLE_AUTOTUNE=1
      绕过所有自动调优逻辑。如果生成器返回空配置,提供硬编码的回退配置。
    • 验证:设置
      DISABLE_AUTOTUNE=1
      运行时,结果正确且未调用
      exhaustive_search
  8. 测试:先运行正确性测试(
    pytest -k "test_op and cutile"
    ),再进行基准测试。
    • 验证:启用自动调优和设置
      DISABLE_AUTOTUNE=1
      时,正确性测试均通过。
  9. 通过A/B测试验证:比较自动调优版本与已知最优的固定配置。参考
    search-strategies.md
    获取方法。
    • 验证:自动调优版本≥基线(或在误差范围内)。如果性能更差,检查搜索空间是否包含原固定配置,以及
      replace_hints
      是否正确使用。
  10. (必须)提交前运行测试并验证性能
    执行提供的测试脚本(例如
    ENABLE_TILE=1 python3 test.py
    )并检查:
    • correctness: PASS
    • speedup_over_fixed >= 1.0
      (自动调优版本不得慢于固定基线)
    如果
    speedup_over_fixed < 1.0
    • 检查搜索空间是否包含原固定配置(这保证不会出现退化)
    • 检查是否所有代码路径都调用了
      replace_hints
      — 重新查看步骤2(如果任何路径跳过
      replace_hints
      ,将使用装饰器的固定提示而非自动调优值)
    • 如果所有配置性能相似,扩展搜索空间(参考
      references/parameter-space-design.md
      → "调整搜索空间")
    ⚠️ 至少运行一次测试后再提交。代码看起来正确并不足够 — 自动调优错误(隐性提示覆盖、拆分缓冲区遗漏)仅在运行时才会被发现。

Integration with torch.autograd.Function

与torch.autograd.Function集成

When the kernel is used inside a
torch.autograd.Function
:
  • Place the tune-once/cache/launch logic in
    forward()
    only. The cached config is reused across calls.
  • In
    backward()
    , using
    ct.launch
    with a fixed or cached config is often sufficient. However, if backward has its own independent search space (e.g. grouped GEMM dX and dW have separate optimal configs), autotuning is appropriate there too.
  • Example:
    rope_embedding.py
    — forward uses
    exhaustive_search
    + cache with split-buffer, backward uses
    ct.launch
    with same-buffer (Q_in=Q_out).
当内核在
torch.autograd.Function
内部使用时:
  • 仅在
    forward()
    中放置一次性调优/缓存/启动逻辑。缓存的配置会在调用间复用。
  • backward()
    中,使用带固定或缓存配置的
    ct.launch
    通常足够。但如果反向传播有独立的搜索空间(例如分组GEMM的dX和dW有各自的最优配置),也适合在此处添加自动调优。
  • 示例:
    rope_embedding.py
    — 正向传播使用
    exhaustive_search
    + 拆分缓冲区缓存,反向传播使用带相同缓冲区(Q_in=Q_out)的
    ct.launch

Cross-Backend Config Transfer (Triton → CuTile)

跨后端配置迁移(Triton → CuTile)

Use
src/tilegym/autotune.py
: maps
BLOCK_SIZE_M/N/K
TILE_SIZE_M/N/K
;
num_warps
/
num_stages
have no CuTile equivalent.
使用
src/tilegym/autotune.py
:将
BLOCK_SIZE_M/N/K
映射为
TILE_SIZE_M/N/K
num_warps
/
num_stages
在CuTile中无等效项。

Optimizing an Existing Autotune Config

优化现有自动调优配置

  1. Profile first: Use NCU (set
    DISABLE_AUTOTUNE=1
    ).
  2. Expand (too narrow): add tile sizes,
    num_ctas
    (sm90+),
    swap_ab
    .
  3. Prune (too slow): remove suboptimal configs, use arch-conditional yield, add size filters.
  4. Re-validate: A/B test to confirm improvement.
  1. 先进行性能分析:使用NCU(设置
    DISABLE_AUTOTUNE=1
    )。
  2. 扩展(空间过窄):添加tile尺寸、
    num_ctas
    (sm90+)、
    swap_ab
  3. 修剪(速度过慢):移除次优配置,使用架构条件生成,添加尺寸过滤器。
  4. 重新验证:通过A/B测试确认改进。

Pitfall Checklist

问题清单

Before submitting code with autotune, verify these:
提交带自动调优的代码前,请验证以下内容:

Pitfall #1: In-Place Kernel Data Corruption

问题#1:原地内核数据损坏

Problem:
exhaustive_search
runs the kernel multiple times to benchmark. If the kernel modifies input tensors in-place, the data is corrupted after the first trial run.
Solution: Split-buffer pattern — use separate read-only input and write-only output during search:
python
undefined
问题
exhaustive_search
多次运行内核进行基准测试。如果内核原地修改输入张量,首次试运行后数据会损坏。
解决方案:拆分缓冲区模式 — 搜索期间使用独立的只读输入和只写输出:
python
undefined

During exhaustive_search: use separate output buffer

exhaustive_search期间:使用独立的输出缓冲区

Q_scratch = torch.empty_like(Q) configs = list(_rope_autotune_configs()) result = exhaustive_search( configs, stream, grid_fn=..., kernel=rope_kernel, args_fn=lambda cfg: (Q, Q_scratch, ...), # Q_in != Q_out hints_fn=..., )
Q_scratch = torch.empty_like(Q) configs = list(_rope_autotune_configs()) result = exhaustive_search( configs, stream, grid_fn=..., kernel=rope_kernel, args_fn=lambda cfg: (Q, Q_scratch, ...), # Q_in != Q_out hints_fn=..., )

After search: launch with in-place args using tuned config

搜索完成后:使用调优后的配置启动原地参数

cfg = result.best.config tuned_kernel = rope_kernel.replace_hints(occupancy=cfg.occupancy) ct.launch(stream, grid, tuned_kernel, (Q, Q, ...)) # Q_in == Q_out (in-place)

**Real example**: `rope_embedding.py` — Search uses split-buffer, final launch uses same-buffer.

**Also wrong**: Using `Q.clone()` in `args_fn` — this adds ~4us per clone, which is fatal for small kernels (~5us). The clone+copy pattern caused 0.48x performance in RoPE.

**Tip — isolating output buffers in `args_fn`**: For kernels that write to a dedicated output tensor (not in-place), use `c.clone()` inside `args_fn` to prevent trial runs from overwriting the final output buffer:

```python
cfg = result.best.config tuned_kernel = rope_kernel.replace_hints(occupancy=cfg.occupancy) ct.launch(stream, grid, tuned_kernel, (Q, Q, ...)) # Q_in == Q_out(原地)

**实际示例**:`rope_embedding.py` — 搜索使用拆分缓冲区,最终启动使用相同缓冲区。

**另一种错误做法**:在`args_fn`中使用`Q.clone()` — 每次克隆会增加约4微秒的开销,这对小型内核(约5微秒)是致命的。克隆+复制模式导致RoPE性能下降至0.48x。

**提示 — 在`args_fn`中隔离输出缓冲区**:对于写入专用输出张量(非原地)的内核,在`args_fn`中使用`c.clone()`,防止试运行覆盖最终输出缓冲区:

```python

Output tensor c will be overwritten by each trial — clone it so trials don't

输出张量c会被每次试运行覆盖 — 克隆它,使试运行不会破坏调用者期望在exhaustive_search返回后使用的缓冲区。

corrupt the buffer the caller expects to use after exhaustive_search returns.

result = exhaustive_search( configs, stream, grid_fn=..., kernel=my_kernel, args_fn=lambda cfg: (a, b, c.clone()), # each trial gets a fresh output hints_fn=..., )

This is safe because the clone cost (~4us) is negligible relative to compute-bound kernel execution time (~50us+). Only avoid `clone()` for very small, memory-bound kernels where 4us is a significant fraction of runtime — in that case, pre-allocate a single scratch buffer outside `args_fn` (as in the split-buffer pattern above).
result = exhaustive_search( configs, stream, grid_fn=..., kernel=my_kernel, args_fn=lambda cfg: (a, b, c.clone()), # 每次试运行获得新的输出 hints_fn=..., )

这是安全的,因为克隆成本(约4微秒)相对于计算密集型内核的执行时间(约50微秒+)可以忽略不计。仅在非常小的内存密集型内核中避免`clone()`,因为4微秒占运行时间的比例很大 — 这种情况下,在`args_fn`外预分配单个临时缓冲区(如上述拆分缓冲区模式)。

Pitfall #2: Compilation Timeout

问题#2:编译超时

Problem: >30 configs causes compilation to exceed 5 minutes. CuTile compilation is heavier than Triton.
Solution:
  • Keep total search space ≤ 30 configs — apply arch filters, tile size filters, and pruning rules until you're under the limit
  • Use architecture-conditional yield to only generate relevant configs
  • Prune the search space using architecture-conditional yield and size filters until total configs ≤ 30
Real example: Grouped GEMM expanded from 4 to 32 configs → all backward tests timed out. Reverted to occupancy-only (4 configs) with no performance loss.
问题:超过30个配置导致编译时间超过5分钟。CuTile编译比Triton重。
解决方案
  • 总搜索空间保持≤30个配置 — 应用架构过滤器、tile尺寸过滤器和修剪规则,直到数量低于限制
  • 使用架构条件生成仅生成相关配置
  • 使用架构条件生成和尺寸过滤器修剪搜索空间,直到总配置数≤30
实际示例:分组GEMM从4个配置扩展到32个 → 所有反向测试超时。回退到仅Occupancy(4个配置),无性能损失。

Pitfall #3: Cold-Cache Performance Skew

问题#3:冷缓存性能偏差

Problem: First process run is slower due to driver/JIT caches. Can cause wrong config selection.
Solution: Always warm up before measuring.
exhaustive_search
has built-in warmup, but first-process cold start is unavoidable. Re-run if you suspect the initial result was affected.
问题:首次进程运行因驱动/JIT缓存而变慢。可能导致错误的配置选择。
解决方案:测量前始终预热。
exhaustive_search
内置预热,但首次进程冷启动不可避免。如果怀疑初始结果受影响,重新运行。

Pitfall #4: NCU Profiling Interference

问题#4:NCU性能分析干扰

Problem: NCU profiles autotune trial runs, cluttering the trace.
Solution: Set
DISABLE_AUTOTUNE=1
before profiling, or use
ncu --launch-skip N
.
问题:NCU会分析自动调优试运行,导致跟踪信息混乱。
解决方案:性能分析前设置
DISABLE_AUTOTUNE=1
,或使用
ncu --launch-skip N

Pitfall #5: search_space as Generator (Exhaustion)

问题#5:search_space为生成器(耗尽)

Problem:
exhaustive_search
requires a
Sequence
(list/tuple), not a generator. Passing a generator directly will fail or produce unexpected results.
Solution: Always convert to list:
python
undefined
问题
exhaustive_search
要求传入
Sequence
(列表/元组),而非生成器。直接传递生成器会失败或产生意外结果。
解决方案:始终转换为列表:
python
undefined

CORRECT: convert generator to list

正确:将生成器转换为列表

configs = list(_matmul_autotune_configs()) result = exhaustive_search(configs, ...)
configs = list(_matmul_autotune_configs()) result = exhaustive_search(configs, ...)

WRONG: passing generator directly

错误:直接传递生成器

result = exhaustive_search(_matmul_autotune_configs(), ...)
undefined
result = exhaustive_search(_matmul_autotune_configs(), ...)
undefined

Pitfall #6: FP8 Precision Loss

问题#6:FP8精度损失

Problem: Hardware
/
breaks FP8 quantization bucket boundaries.
Solution: Use
ct.truediv(x, y, rounding_mode=RoundingMode.FULL)
for IEEE-compliant division in FP8 kernels. Never use
/
operator for FP8 scale computation.
问题:硬件
/
运算符破坏FP8量化桶边界。
解决方案:在FP8内核中使用
ct.truediv(x, y, rounding_mode=RoundingMode.FULL)
实现符合IEEE标准的除法。绝不要在FP8缩放计算中使用
/
运算符。

Pitfall #7:
replace_hints
on Hot Path (Recompilation)

问题#7:热路径中调用
replace_hints
(重新编译)

Problem:
replace_hints()
returns a new kernel object with its own JIT cache (internally uses
dataclasses.replace()
which creates a fresh instance). Calling it on every kernel invocation — even with the same arguments — triggers recompilation every time. This is the most common autotune performance bug:
cutile_ms
jumps from ~0.04ms to 16–39ms (100–500× slower).
Incorrect (recompiles on every call):
python
_cache[key] = result.best.config  # only stores config

cfg = _cache[key]
tuned = my_kernel.replace_hints(occupancy=cfg.occupancy)  # NEW kernel each time!
ct.launch(stream, grid, tuned, ...)
Correct (compile once, reuse forever):
python
best_cfg = result.best.config
tuned = my_kernel.replace_hints(occupancy=best_cfg.occupancy)  # compile ONCE
_cache[key] = (best_cfg, tuned)  # cache both

cfg, tuned = _cache[key]
ct.launch(stream, grid, tuned, ...)  # reuse compiled kernel
Rule: Call
replace_hints
exactly once per config (immediately after
exhaustive_search
), cache the returned kernel object, and never call
replace_hints
again on the fast path.
问题
replace_hints()
返回一个新的内核对象,拥有独立的JIT缓存(内部使用
dataclasses.replace()
创建新实例)。即使参数相同,每次内核调用时调用它都会触发重新编译。这是最常见的自动调优性能错误:
cutile_ms
从约0.04ms跃升至16–39ms(慢100–500倍)。
错误写法(每次调用都重新编译):
python
_cache[key] = result.best.config  # 仅存储配置

cfg = _cache[key]
tuned = my_kernel.replace_hints(occupancy=cfg.occupancy)  # 每次都是新内核!
ct.launch(stream, grid, tuned, ...)
正确写法(编译一次,永久复用):
python
best_cfg = result.best.config
tuned = my_kernel.replace_hints(occupancy=best_cfg.occupancy)  # 仅编译一次
_cache[key] = (best_cfg, tuned)  # 同时缓存两者

cfg, tuned = _cache[key]
ct.launch(stream, grid, tuned, ...)  # 复用编译后的内核
规则:每个配置仅调用一次
replace_hints
(在
exhaustive_search
后立即调用),缓存返回的内核对象,绝不在热路径中再次调用
replace_hints

Scope and Boundaries

范围与边界

This skill covers only autotune configuration: search space design,
exhaustive_search
invocation, caching, and
ct.launch
with tuned hints. It does not modify kernel code.
In scope (autotune config):
  • Search space generator functions
  • exhaustive_search()
    calls and result handling
  • kernel.replace_hints()
    for applying tuned hints
  • Cache logic (key design, dict management)
  • ct.launch()
    with tuned kernel
  • DISABLE_AUTOTUNE
    fallback path
Out of scope (kernel code modifications — do NOT make these changes):
  • Math flags (flush_to_zero, rounding_mode)
  • Performance Hints (slice_hint, buffer_depth, copy_config)
  • Memory access patterns (2D→1D gather/scatter conversion)
  • Codegen optimizations (safe_offs → padding_value)
  • Algorithm changes (K-loop split, load balancing)
本技能仅涵盖自动调优配置:搜索空间设计、
exhaustive_search
调用、缓存,以及带调优提示的
ct.launch
修改内核代码。
范围内(自动调优配置):
  • 搜索空间生成器函数
  • exhaustive_search()
    调用和结果处理
  • kernel.replace_hints()
    应用调优提示
  • 缓存逻辑(键设计、字典管理)
  • 带调优内核的
    ct.launch()
  • DISABLE_AUTOTUNE
    回退路径
范围外(内核代码修改 — 请勿进行这些更改):
  • 数学标志(flush_to_zero、rounding_mode)
  • 性能提示(slice_hint、buffer_depth、copy_config)
  • 内存访问模式(2D→1D gather/scatter转换)
  • 代码生成优化(safe_offs → padding_value)
  • 算法更改(K循环拆分、负载均衡)

Further Optimization Suggestions

进一步优化建议

After adding autotuning, the following kernel-level optimizations may yield additional gains. These are outside the scope of this skill — mention them to the user as potential next steps, but do not implement them as part of autotuning:
  • Math flags:
    flush_to_zero=True
    +
    rounding_mode=APPROX
    can provide 34-72% improvement for FMHA-class kernels (set via environment variables
    TILEIR_ENABLE_FTZ=1 TILEIR_ENABLE_APPROX=1
    or in kernel code). Causal chain: larger tiles initially decrease performance by 18-43% due to subnormal handling overhead; enabling FTZ+APPROX rescues this and flips the result to +34-72%. Math flags are therefore a prerequisite for large-tile configs to be effective on FMHA-class kernels.
  • Performance Hints:
    slice_hint
    ,
    buffer_depth
    ,
    copy_config
    — requires modifying kernel IR code
  • Memory access patterns: Using TMA loads (
    ct.load
    ) instead of
    ct.gather
    ; removing unnecessary bounds checks (
    check_bounds=False
    when safe)
  • Codegen quality: Using
    padding_value
    parameter instead of manual
    ct.where
    masking; removing
    safe_offs
  • Algorithm restructuring: K-loop split, load balancing, algebraic simplification
添加自动调优后,以下内核级优化可能带来额外收益。这些超出本技能范围 — 可向用户提及作为潜在下一步,但不要作为自动调优的一部分实现:
  • 数学标志
    flush_to_zero=True
    +
    rounding_mode=APPROX
    可为FMHA类内核带来34-72%的性能提升(通过环境变量
    TILEIR_ENABLE_FTZ=1 TILEIR_ENABLE_APPROX=1
    或内核代码设置)。因果链:大tile最初会因次正规数处理开销导致性能下降18-43%;启用FTZ+APPROX可挽回损失,转而实现+34-72%的提升。因此,数学标志是大tile配置在FMHA类内核上生效的先决条件
  • 性能提示
    slice_hint
    buffer_depth
    copy_config
    — 需要修改内核IR代码
  • 内存访问模式:使用TMA加载(
    ct.load
    )替代
    ct.gather
    ;移除不必要的边界检查(安全时设置
    check_bounds=False
  • 代码生成质量:使用
    padding_value
    参数替代手动
    ct.where
    掩码;移除
    safe_offs
  • 算法重构:K循环拆分、负载均衡、代数简化

Differences from Triton Autotune

与Triton自动调优的差异

Key differences: Triton uses
@triton.autotune
decorator with
Config(...)
objects; CuTile uses
exhaustive_search()
with
SimpleNamespace
configs + separate cache +
ct.launch
. CuTile has no
num_warps
/
num_stages
(compiler decides) — only tile sizes +
occupancy
+
num_ctas
. CuTile compilation is heavier (keep ≤30 configs). CuTile cache is user-managed in-memory (no automatic persistence). CuTile separates
args_fn
(kernel args) from
hints_fn
(compiler hints).
核心差异:Triton使用
@triton.autotune
装饰器和
Config(...)
对象;CuTile使用
exhaustive_search()
SimpleNamespace
配置 + 独立缓存 +
ct.launch
。CuTile无
num_warps
/
num_stages
(由编译器决定) — 仅tile尺寸 +
occupancy
+
num_ctas
。CuTile编译更重(保持≤30个配置)。CuTile缓存由用户在内存中管理(无自动持久化)。CuTile将
args_fn
(内核参数)与
hints_fn
(编译器提示)分离。

Reference Documents

参考文档

CategoryDocumentContent
Parameter Design
parameter-space-design.md
Per-kernel-type parameter spaces, cross-arch patterns, grid_fn patterns, pruning rules
Search Strategies
search-strategies.md
Exhaustive search, A/B test methodology, DISABLE_AUTOTUNE pattern
Templates
kernel-type-templates.md
Copy-paste autotune templates for 8 kernel types
Hardware
hardware-constraints.md
Per-architecture constraints, tile size ranges, num_ctas rules, TMA requirements
分类文档内容
参数设计
parameter-space-design.md
各内核类型的参数空间、跨架构模式、grid_fn模式、修剪规则
搜索策略
search-strategies.md
穷尽式搜索、A/B测试方法、DISABLE_AUTOTUNE模式
模板
kernel-type-templates.md
8种内核类型的复制粘贴自动调优模板
硬件
hardware-constraints.md
各架构约束、tile尺寸范围、num_ctas规则、TMA要求

Source Code References

源代码参考

Key files:
ops/cutile/matmul.py
(matmul autotune),
ops/cutile/attention.py
(FMHA autotune),
suites/unsloth/cutile/ct_ops.py
(shared
autotune_configs()
occupancy=[1,2,4,8]),
suites/unsloth/cutile/swiglu.py
(elementwise example),
suites/unsloth/cutile/rope_embedding.py
(split-buffer pattern),
suites/unsloth/cutile/grouped_gemm.py
(persistent GEMM, occupancy-only).
关键文件:
ops/cutile/matmul.py
(矩阵乘法自动调优)、
ops/cutile/attention.py
(FMHA自动调优)、
suites/unsloth/cutile/ct_ops.py
(共享
autotune_configs()
occupancy=[1,2,4,8])、
suites/unsloth/cutile/swiglu.py
(逐元素型示例)、
suites/unsloth/cutile/rope_embedding.py
(拆分缓冲区模式)、
suites/unsloth/cutile/grouped_gemm.py
(持久型GEMM、仅Occupancy)。

Worked Examples

实战示例

Each example shows the before → after pattern:
fixed_launch.py
(hardcoded
ct.launch
) and
autotuned_launch.py
(refactored to tune-once/cache/launch).
DirectoryKernelAutotune PatternComplexityKey Teaching Point
assets/examples/01_rmsnorm_occupancy_only/
RMSNorm (reduction)Occupancy-only
[1,2,4,8]
LowMost common pattern — no tile tuning, just find best occupancy. Grid =
NUM_SM * cfg.occupancy
. Not in-place.
assets/examples/02_matmul_full_search/
GEMM C=A@BFull:
TILE_M/N/K
+
occupancy
+
num_ctas
(sm90+)
HighCompute-bound kernel with multiple tunable dimensions.
args_fn
passes tile sizes as
ct.Constant[int]
.
grid_fn
depends on
cfg
. ≤30 configs.
assets/examples/03_rope_inplace_splitbuffer/
RoPE embedding (in-place)Occupancy-only, with split-bufferMediumIn-place kernel MUST use split-buffer during search to avoid corruption. Search writes to scratch; final
ct.launch
uses real in-place args.
每个示例展示前后模式:
fixed_launch.py
(硬编码
ct.launch
)和
autotuned_launch.py
(重构为一次性调优/缓存/启动)。
目录内核自动调优模式复杂度核心教学点
assets/examples/01_rmsnorm_occupancy_only/
RMSNorm(归约型)仅Occupancy
[1,2,4,8]
最常见模式 — 无需tile调优,仅寻找最优occupancy。网格 =
NUM_SM * cfg.occupancy
。非原地型。
assets/examples/02_matmul_full_search/
GEMM C=A@B完整搜索:
TILE_M/N/K
+
occupancy
+
num_ctas
(sm90+)
计算密集型内核,多可调维度。
args_fn
将tile尺寸作为
ct.Constant[int]
传递。
grid_fn
依赖
cfg
。≤30个配置。
assets/examples/03_rope_inplace_splitbuffer/
RoPE嵌入(原地型)仅Occupancy,带拆分缓冲区原地内核必须在搜索期间使用拆分缓冲区避免损坏。搜索写入临时缓冲区;最终
ct.launch
使用真实原地参数。
",