Search Results: nvidia-gpu

Found 5 Skills

AI & Machine Learninghuggingface/kernels

cuda-kernels

Provides guidance for writing and benchmarking optimized CUDA kernels for NVIDIA GPUs (H100, A100, T4) targeting HuggingFace diffusers and transformers libraries. Supports models like LTX-Video, Stable Diffusion, LLaMA, Mistral, and Qwen. Includes integration with HuggingFace Kernels Hub (get_kernel) for loading pre-compiled kernels. Includes benchmarking scripts to compare kernel performance against baseline implementations.

🇺🇸|EnglishTranslated

5 scripts/Checked

AI & Machine Learningkiterlin/intelligent-dete...

training-llms-megatron

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

🇺🇸|EnglishTranslated

AI & Machine Learningpepperu96/hyper-mla

cute-dsl-ref

CuTe Python DSL API reference and implementation patterns for NVIDIA GPU kernel programming. Provides execution model, core API table, key constraints, common patterns, and documentation index. Use when: (1) writing or modifying CuTe DSL kernel code, (2) looking up CuTe DSL API syntax, (3) implementing attention/GEMM/MLA patterns in CuTe DSL, (4) understanding CuTe DSL execution model and compilation pipeline, (5) checking what CuTe DSL can and cannot do.

🇺🇸|EnglishTranslated

AI & Machine Learningvllm-project/vllm-skills

vllm-deploy-docker

Deploy vLLM using Docker (pre-built images or build-from-source) with NVIDIA GPU support and run the OpenAI-compatible server.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

🇺🇸|EnglishTranslated