All Skills

Total 50,320 skills, AI & Machine Learning has 8453 skills

Showing 12 of 8453 skills

Per page

Downloads

Sort

perf-nsight-systems

Nsight Systems (nsys) CLI for system-level timeline profiling. Use when the user wants to run nsys profile, analyze .nsys-rep reports, use nsys stats/analyze/recipe commands, diagnose GPU idle time from timeline traces, or profile distributed training with NCCL overlap analysis. NOT for kernel-level metrics like SOL%, occupancy, or roofline (use perf-nsight-compute-analysis for ncu). NOT for writing or generating kernels. NOT for applying optimizations like CUDA Graphs.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

perf-cuda-graphs

Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

perf-torch-cuda-graphs

Apply CUDA Graphs to PyTorch workloads — API selection (torch.compile, PyTorch make_graphed_callables, TE make_graphed_callables, MCore CudaGraphManager, FullCudaGraphWrapper, manual torch.cuda.graph), code compatibility, capture workflows, dynamic pattern handling, and troubleshooting. Triggers: CUDA graph, torch.cuda.graph, make_graphed_callables, reduce-overhead, graph capture, graph replay, kernel launch overhead, CudaGraphManager, FullCudaGraphWrapper, full-iteration graph, stream capture.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningnvidia/skills

cudaq-guide

CUDA-Q onboarding guide for installation, test programs, GPU simulation, QPU hardware, and quantum applications.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

ad-accuracy-debug

Debug AutoDeploy accuracy regressions vs a reference score (PyTorch backend or published baseline). Use when an AutoDeploy model's eval score is significantly below the reference and the root cause is unknown.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

kernel-cute-writing

Write and implement GPU kernels using NVIDIA CuTe DSL (CUTLASS 4.x Python API) — NOT for Triton, CUDA C++, or conceptual explanations. Trigger only when the user wants to write or implement a kernel, not when asking questions about CuTe DSL concepts or layouts. CuTe DSL uses cute.jit/cute.kernel decorators and cutlass.cute imports. Covers element-wise kernels, GEMM patterns, reductions, memory hierarchy (global/shared/register/TMA), MMA tensor core operations, software pipelining, and framework integration.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningnvidia/skills

trtllm-flashinfer-upgrade

Upgrade flashinfer-python version in TensorRT-LLM. Fetches the latest releases from GitHub (stable and nightly), compares with the current pinned version, lets the user pick a target version, and updates all version references across the repo. Use when the user wants to bump or upgrade flashinfer.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

multi-node-slurm

Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

perf-megatron-fsdp

Operational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

perf-analysis

Performance analysis coordination workflow. Guides profiling delegation, bottleneck classification (compute/memory/launch/communication/sync), and structured report generation. Use when the user asks to analyze performance, profile a workload, check MFU/SOL, or diagnose bottlenecks.

🇺🇸|EnglishTranslated

AI & Machine Learningaliyun/alibabacloud-aiops...

alibabacloud-wxz-website-builder

Use when building or modifying websites with AI Staff (零号员工/万小智) via Alibaba Cloud OpenAPI. Supports conversation creation, async chat with requirement collection, PRD generation, code generation, and incremental SSE event polling.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningnvidia/skills

perf-sequence-packing

Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.

🇺🇸|EnglishTranslated