Search Results: quantization

Found 33 Skills

AI & Machine Learningmartinholovsky/claude-ski...

model-quantization

Expert skill for AI model quantization and optimization. Covers 4-bit/8-bit quantization, GGUF conversion, memory optimization, and quality-performance tradeoffs for deploying LLMs in resource-constrained JARVIS environments.

🇺🇸|EnglishTranslated

AI & Machine Learningomer-metin/skills-for-ant...

model-optimization

Use when reducing model size, improving inference speed, or deploying to edge devices - covers quantization, pruning, knowledge distillation, ONNX export, and TensorRT optimizationUse when ", " mentioned.

🇺🇸|EnglishTranslated

AI & Machine Learningdoanchienthangdev/omgkit

efficient-ai

Efficient AI techniques including model compression, quantization, pruning, knowledge distillation, and hardware-aware optimization for production systems.

🇺🇸|EnglishTranslated

AI & Machine Learningwshobson/agents

vector-index-tuning

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

tao-train-rtdetr

RT-DETR (Real-Time DEtection TRansformer) for 2D object detection. Designed for real-time inference with competitive accuracy and supports distillation and quantization for deployment optimization. Use when training, evaluating, distilling, quantizing, exporting, or running inference for a TAO RT-DETR model. Trigger phrases include "train RT-DETR", "real-time DETR", "low-latency object detection", "RT-DETR distillation / quantization".

🇺🇸|EnglishTranslated

AI & Machine Learningsickn33/antigravity-aweso...

local-llm-expert

Master local LLM inference, model selection, VRAM optimization, and local deployment using Ollama, llama.cpp, vLLM, and LM Studio. Expert in quantization formats (GGUF, EXL2) and local AI privacy.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

gptq

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

hqq-quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

llama-cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

🇺🇸|EnglishTranslated

AI & Machine Learningtimescale/pg-aiguide

pgvector-semantic-search

Use this skill for setting up vector similarity search with pgvector for AI/ML embeddings, RAG applications, or semantic search. **Trigger when user asks to:** - Store or search vector embeddings in PostgreSQL - Set up semantic search, similarity search, or nearest neighbor search - Create HNSW or IVFFlat indexes for vectors - Implement RAG (Retrieval Augmented Generation) with PostgreSQL - Optimize pgvector performance, recall, or memory usage - Use binary quantization for large vector datasets **Keywords:** pgvector, embeddings, semantic search, vector similarity, HNSW, IVFFlat, halfvec, cosine distance, nearest neighbor, RAG, LLM, AI search Covers: halfvec storage, HNSW index configuration (m, ef_construction, ef_search), quantization strategies, filtered search, bulk loading, and performance tuning.

🇺🇸|EnglishTranslated

AI & Machine Learningbagelhole/devops-security...

llm-cost-optimization

Reduce LLM API and infrastructure costs through model selection, prompt caching, batching, caching, quantization, and self-hosting strategies. Track spend by team and model, set budgets, and implement cost-aware routing.

🇺🇸|EnglishTranslated

AI & Machine Learningitsmostafa/llm-engineerin...

qlora

Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory. Builds on the lora skill.

🇺🇸|EnglishTranslated