Loading...
Loading...
Found 6 Skills
Iteratively optimize cuTile kernel performance through systematic profiling, bottleneck analysis, IR comparison, and targeted tuning. Covers tile sizes, occupancy, autotune configs, TMA, latency hints, persistent scheduling, num_ctas, flush_to_zero, and IR-level debugging. Use when asked to "optimize cutile kernel", "improve kernel perf", "tune cutile performance", "make kernel faster", or iteratively benchmark and refine a cuTile GPU kernel in the TileGym project.
Use when adding, modifying, optimizing, or debugging CuTile autotuning code. Trigger signals: `exhaustive_search` / `replace_hints` / `hints_fn` / `cuda.tile.tune` in code, `autotune` in filenames, or correctness/performance issues in autotuned CuTile kernels. Covers: tune-once/cache/launch pattern, per-architecture configs (sm80–sm120), parameter space design (tile sizes, occupancy, num_ctas), and 7 common pitfalls with solutions.
Optimize existing Triton kernels for NVIDIA TileIR backend on Blackwell GPUs (sm_100+). Adds TileIR-specific autotune configs: occupancy, num_ctas, TMA descriptors. Covers kernel classification (dot-related, norm-like, elementwise, reduction), type-specific transformations, and PTX-vs-TileIR benchmarking. Triggered by: "optimize for TileIR", "add TileIR configs", "Blackwell optimization", "TMA descriptors", "2CTA mode", "occupancy tuning". Kernels use standard `import triton`; TileIR activates via ENABLE_TILE=1 when nvtriton is installed.
Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.
Write, debug, and optimize Triton and Gluon GPU kernels using local source code, tutorials, and kernel references. Use when the user mentions Triton, Gluon, tl.load, tl.store, tl.dot, triton.jit, gluon.jit, wgmma, tcgen05, TMA, tensor descriptor, persistent kernel, warp specialization, fused attention, matmul kernel, kernel fusion, tl.program_id, triton autotune, MXFP, FP8, FP4, block-scaled matmul, SwiGLU, top-k, or asks about writing GPU kernels in Python.
ONLY for OpenAI Triton (@triton.jit) kernel development. NEVER use for CUDA C++ kernels, TileIR, or profiling tools (ncu, nsys). The user's request must involve Triton explicitly. Covers Triton-specific patterns: fused elementwise, reductions (softmax, LayerNorm, RMSNorm), tiled GEMM with triton.autotune, and flash attention. Workflow: design, write, verify (with fast-path for explicit requests).