Search Results: vllm

Found 44 Skills

vllm-studio-backend

Use when working on vLLM Studio backend architecture (controller runtime, Pi-mono agent loop, OpenAI-compatible endpoints, LiteLLM gateway, inference process, and debugging commands).

🇺🇸|EnglishTranslated

AI & Machine Learningwinsorllc/upgraded-carniv...

local-llm-provider

Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic provider fallback. Use when: (1) you need to run LLM inference locally for privacy/cost, (2) you want to use models not available via cloud APIs, (3) you need offline capability, (4) you want automatic fallback to cloud providers when local fails.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningvllm-project/vllm-skills

vllm-deploy-simple

Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learningvllm-project/vllm-skills

vllm-deploy-docker

Deploy vLLM using Docker (pre-built images or build-from-source) with NVIDIA GPU support and run the OpenAI-compatible server.

🇺🇸|EnglishTranslated

AI & Machine Learningvllm-project/vllm-skills

vllm-bench-serve

Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

deployment

Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learningsickn33/antigravity-aweso...

local-llm-expert

Master local LLM inference, model selection, VRAM optimization, and local deployment using Ollama, llama.cpp, vLLM, and LM Studio. Expert in quantization formats (GGUF, EXL2) and local AI privacy.

🇺🇸|EnglishTranslated

AI & Machine Learningancoleman/ai-design-compo...

model-serving

LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.

🇺🇸|EnglishTranslated

5 scripts/Attention

AI & Machine Learningdavila7/claude-code-templ...

outlines

Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library

🇺🇸|EnglishTranslated

Code Qualityascend/agent-skills

python-refactoring

Python code refactoring skills, covering code smell identification, design pattern application, readability improvement, and practical experience. This skill is applicable when users request "refactor code", "refactor", "code optimization", "improve code quality", "code smell review", "apply design patterns", "enhance readability", or submit code review requests. It supports generating structured refactoring documents after refactoring completion ("output refactoring document", "generate refactoring report"). It includes practical patterns extracted from 20+ real refactoring PRs in the vllm-ascend repository.

🇨🇳|ChineseTranslated

AI & Machine Learningorchestra-research/ai-res...

llamaguard

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

🇺🇸|EnglishTranslated

Documentation & Writingbbuf/sglang-auto-driven-s...

model-pr-diff-dossier

Use when creating or revising model PR optimization history documents for SGLang, vLLM, or another serving framework that cite GitHub PRs. Requires manual, per-PR source-diff review and documentation of motivation, key implementation approach, most important code excerpts, reviewed files, and validation implications instead of generated or one-line summaries.

🇺🇸|EnglishTranslated