Search Results: triton-kernel

Found 6 Skills

perf-optimization

Performance optimization coordination playbook. Contains specialist routing table, TileIR two-step pipeline, kernel generation specialist selection, prioritization criteria, and safe modification workflow. Use when the user asks to apply optimizations, write kernels, or improve performance. Covers both user-specified optimization and autopilot-driven iterative optimization.

🇺🇸|EnglishTranslated

AI & Machine Learningpytorch/pytorch

pt2-bug-basher

Debug PyTorch 2 compiler stack failures including Dynamo graph breaks, Inductor codegen errors, AOTAutograd crashes, and accuracy mismatches. Use when encountering torch.compile errors, BackendCompilerFailed exceptions, recompilation issues, Triton kernel failures, FX graph problems, or when the user mentions debugging PT2, Dynamo, Inductor, or compiled model issues.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

ad-add-fusion-transformation

Claude Code skill (trtllm-agent-toolkit): implement or extend TensorRT-LLM AutoDeploy fusion transforms under transform/library/ in a TensorRT-LLM checkout. Prefer existing kernels and custom ops; use Triton only when no viable existing-kernel path exists. Use ad-graph-dump for AD_DUMP_GRAPHS_DIR workflows. Covers TRT-LLM paths, registry, default.yaml registration, graph validation, tests, and a review checklist — without prescribing profiling tools or throughput targets.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

kernel-triton-writing

ONLY for OpenAI Triton (@triton.jit) kernel development. NEVER use for CUDA C++ kernels, TileIR, or profiling tools (ncu, nsys). The user's request must involve Triton explicitly. Covers Triton-specific patterns: fused elementwise, reductions (softmax, LayerNorm, RMSNorm), tiled GEMM with triton.autotune, and flash attention. Workflow: design, write, verify (with fast-path for explicit requests).

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningslowlyc/agent-gpu-skills

triton-skill

Write, debug, and optimize Triton and Gluon GPU kernels using local source code, tutorials, and kernel references. Use when the user mentions Triton, Gluon, tl.load, tl.store, tl.dot, triton.jit, gluon.jit, wgmma, tcgen05, TMA, tensor descriptor, persistent kernel, warp specialization, fused attention, matmul kernel, kernel fusion, tl.program_id, triton autotune, MXFP, FP8, FP4, block-scaled matmul, SwiGLU, top-k, or asks about writing GPU kernels in Python.

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learningascend/agent-skills

triton-operator-code-gen

Generate Triton kernel code for Ascend NPU based on operator design documents. Used when users need to implement Triton operator kernels and convert requirement documents into executable code. Core capabilities: (1) Parse requirement documents to confirm computing logic (2) Design tiling partitioning strategy (3) Generate high-performance kernel code (4) Generate test code to verify correctness.

🇨🇳|ChineseTranslated