Search Results: inference-optimization

Found 13 Skills

AI & Machine Learningdavila7/claude-code-templ...

gptq

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

🇺🇸|EnglishTranslated

AI & Machine Learningancoleman/ai-design-compo...

model-serving

LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.

🇺🇸|EnglishTranslated

5 scripts/Attention

AI & Machine Learningmartinholovsky/claude-ski...

llm-integration

Expert skill for integrating local Large Language Models using llama.cpp and Ollama. Covers secure model loading, inference optimization, prompt handling, and protection against LLM-specific vulnerabilities including prompt injection, model theft, and denial of service attacks.

🇺🇸|EnglishTranslated

AI & Machine Learningopenai/codex

code-review-context

Model visible context

🇺🇸|EnglishTranslated

AI & Machine Learningpluginagentmarketplace/cu...

model-deployment

LLM deployment strategies including vLLM, TGI, and cloud inference endpoints.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningascend-ai-coding/awesome-...

wan-ascend-adaptation

This skill provides comprehensive guidance for adapting Wan-series video generation models (Wan2.1/Wan2.2) from NVIDIA CUDA to Huawei Ascend NPU. It should be used when performing NPU migration of DiT-based video diffusion models, including device layer adaptation, operator replacement, distributed parallelism refactoring, attention optimization, VAE parallelization, and model quantization. This skill covers 9 major adaptation domains derived from real-world Wan2.2 CUDA-to-Ascend porting experience.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

knowledge-distillation

Compress large language models using knowledge distillation from teacher to student models. Use when deploying smaller models with retained performance, transferring GPT-4 capabilities to open-source models, or reducing inference costs. Covers temperature scaling, soft targets, reverse KLD, logit distillation, and MiniLLM training strategies.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

moe-training

Train Mixture of Experts (MoE) models using DeepSpeed or HuggingFace. Use when training large-scale models with limited compute (5× cost reduction vs dense models), implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3, or scaling model capacity without proportional compute increase. Covers MoE architectures, routing mechanisms, load balancing, expert parallelism, and inference optimization.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

mamba-architecture

State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

awq-quantization

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

🇺🇸|EnglishTranslated

AI & Machine Learningvuralserhat86/antigravity...

huggingface_transformers

Hugging Face Transformers best practices including model loading, tokenization, fine-tuning workflows, and inference optimization. Use when working with transformer models, fine-tuning LLMs, implementing NLP tasks, or optimizing transformer inference.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

ad-graph-dump

Enable and interpret TensorRT-LLM AutoDeploy FX graph text dumps via AD_DUMP_GRAPHS_DIR. Use when you need before/after graphs per transform, to locate subgraphs, or to confirm a rewrite ran. Paths and behavior are grounded in tensorrt_llm/_torch/auto_deploy (GraphWriter, BaseTransform). Complements ad-add-fusion-transformation.

🇺🇸|EnglishTranslated