Search Results: model-quantization

Found 10 Skills

AI & Machine Learningmartinholovsky/claude-ski...

model-quantization

Expert skill for AI model quantization and optimization. Covers 4-bit/8-bit quantization, GGUF conversion, memory optimization, and quality-performance tradeoffs for deploying LLMs in resource-constrained JARVIS environments.

🇺🇸|EnglishTranslated

AI & Machine Learninghuggingface/skills

huggingface-local-models

Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

serving-llms-vllm

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

🇺🇸|EnglishTranslated

AI & Machine Learningthebushidocollective/han

tensorflow-model-deployment

Deploy and serve TensorFlow models

🇺🇸|EnglishTranslated

AI & Machine Learningdavidcastagnetoa/skills

fp16_int8_quantization

Cuantización de modelos ML a FP16/INT8 para reducir memoria y acelerar inferencia en el pipeline KYC

🇺🇸|EnglishTranslated

AI & Machine Learningjakerains/agentskills

onnx-webgpu-converter

Convert HuggingFace transformer models to ONNX format for browser inference with Transformers.js and WebGPU. Use when given a HuggingFace model link to convert to ONNX, when setting up optimum-cli for ONNX export, when quantizing models (fp16, q8, q4) for web deployment, when configuring Transformers.js with WebGPU acceleration, or when troubleshooting ONNX conversion errors. Triggers on mentions of ONNX conversion, Transformers.js, WebGPU inference, optimum export, model quantization for browser, or running ML models in the browser.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningnvidia/skills

ptq

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

🇺🇸|EnglishTranslated

AI & Machine Learningluongnv89/skills

ollama-optimizer

Optimize Ollama configuration for maximum performance on the current machine. Use when asked to "optimize Ollama", "configure Ollama", "speed up Ollama", "tune LLM performance", "setup local LLM", "fix Ollama performance", "Ollama running slow", or when users want to maximize inference speed, reduce memory usage, or select appropriate models for their hardware. Analyzes system hardware (GPU, RAM, CPU) and provides tailored recommendations.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningdavila7/claude-code-templ...

awq-quantization

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

🇺🇸|EnglishTranslated