Loading...
Loading...
Found 7,750 Skills
Clean up text while preserving the writer's voice - minimal edits only
Query NVIDIA PTX ISA 9.1, CUDA Runtime API 13.1, Driver API 13.1, Programming Guide v13.1, Best Practices Guide, Nsight Compute, Nsight Systems local documentation. Debug and optimize GPU kernels with nsys/ncu/compute-sanitizer workflows. Use when writing, debugging, or optimizing CUDA code, GPU kernels, PTX instructions, inline PTX, TensorCore operations (WMMA, WGMMA, TMA, tcgen05), or when the user mentions CUDA API functions, error codes, device properties, memory management, profiling, GPU performance, compute capabilities, CUDA Graphs, Cooperative Groups, Unified Memory, dynamic parallelism, or CUDA programming model concepts.
Develop, debug, and optimize SGLang LLM serving engine. Use when the user mentions SGLang, sglang, srt, sgl-kernel, LLM serving, model inference, KV cache, attention backend, FlashInfer, MLA, MoE routing, speculative decoding, disaggregated serving, TP/PP/EP, radix cache, continuous batching, chunked prefill, CUDA graph, model loading, quantization FP8/GPTQ/AWQ, JIT kernel, triton kernel SGLang, or asks about serving LLMs with SGLang.
Write, debug, and optimize CUTLASS and CuTeDSL GPU kernels using local source code, examples, and header references. Use when the user mentions CUTLASS, CuTe, CuTeDSL, cute::Layout, cute::Tensor, TiledMMA, TiledCopy, CollectiveMainloop, CollectiveEpilogue, GEMM kernel, grouped GEMM, sparse GEMM, flash attention CUTLASS, blackwell GEMM, hopper GEMM, FP8 GEMM, blockwise scaling, MoE GEMM, StreamK, warp specialization CUTLASS, TMA CUTLASS, or asks about writing high-performance CUDA kernels with CUTLASS/CuTe templates.
Write, debug, and optimize Triton and Gluon GPU kernels using local source code, tutorials, and kernel references. Use when the user mentions Triton, Gluon, tl.load, tl.store, tl.dot, triton.jit, gluon.jit, wgmma, tcgen05, TMA, tensor descriptor, persistent kernel, warp specialization, fused attention, matmul kernel, kernel fusion, tl.program_id, triton autotune, MXFP, FP8, FP4, block-scaled matmul, SwiGLU, top-k, or asks about writing GPU kernels in Python.
Review code for bugs, security issues, and best practices. Use when asked to review a PR, diff, or code snippet.
Use when converting Java source files to idiomatic Kotlin, when user mentions "java to kotlin", "j2k", "convert java", "migrate java to kotlin", or when working with .java files that need to become .kt files. Handles framework-aware conversion for Spring, Lombok, Hibernate, Jackson, Micronaut, Quarkus, Dagger/Hilt, RxJava, JUnit, Guice, Retrofit, and Mockito.
Use Claude Code to simplify the current pull request by safely reducing unnecessary scope, complexity, and noise while preserving the intended outcome.
Run Gemini CLI review against the current branch, fix only the review comments that are still valid for the current codebase, and leave invalid comments unchanged.
Complete GitHub pull requests by iterating on CI and review feedback until the PR is ready.
Run Codex, Claude Code, and Gemini CLI reviews against the current branch concurrently, deduplicate the findings, and report only the review comments that are still valid for the current codebase.
Run Codex code review against the current branch and report only the review comments that are still valid for the current codebase, without applying fixes.