serving-llms-vllm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesevLLM - High-Performance LLM Serving
vLLM - 高性能大语言模型服务
Quick start
快速开始
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Installation:
bash
pip install vllmBasic offline inference:
python
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)OpenAI-compatible server:
bash
vllm serve meta-llama/Llama-3-8B-Instruct通过PagedAttention(基于块的KV缓存)和连续批处理(混合预填充/解码请求),vLLM的吞吐量比标准transformers高24倍。
安装:
bash
pip install vllm基础离线推理:
python
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)兼容OpenAI的服务器:
bash
vllm serve meta-llama/Llama-3-8B-InstructQuery with OpenAI SDK
Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"
undefinedpython -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"
undefinedCommon workflows
常见工作流
Workflow 1: Production API deployment
工作流1:生产级API部署
Copy this checklist and track progress:
Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metricsStep 1: Configure server settings
Choose configuration based on your model size:
bash
undefined复制以下检查清单并跟踪进度:
部署进度:
- [ ] 步骤1:配置服务器设置
- [ ] 步骤2:使用有限流量进行测试
- [ ] 步骤3:启用监控
- [ ] 步骤4:部署到生产环境
- [ ] 步骤5:验证性能指标步骤1:配置服务器设置
根据模型大小选择配置:
bash
undefinedFor 7B-13B models on single GPU
单GPU上运行7B-13B模型
vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--max-model-len 8192
--port 8000
--gpu-memory-utilization 0.9
--max-model-len 8192
--port 8000
vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--max-model-len 8192
--port 8000
--gpu-memory-utilization 0.9
--max-model-len 8192
--port 8000
For 30B-70B models with tensor parallelism
借助张量并行运行30B-70B模型
vllm serve meta-llama/Llama-2-70b-hf
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--quantization awq
--port 8000
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--quantization awq
--port 8000
vllm serve meta-llama/Llama-2-70b-hf
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--quantization awq
--port 8000
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--quantization awq
--port 8000
For production with caching and metrics
带缓存和指标的生产环境配置
vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
--enable-metrics
--metrics-port 9090
--port 8000
--host 0.0.0.0
--gpu-memory-utilization 0.9
--enable-prefix-caching
--enable-metrics
--metrics-port 9090
--port 8000
--host 0.0.0.0
**Step 2: Test with limited traffic**
Run load test before production:
```bashvllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
--enable-metrics
--metrics-port 9090
--port 8000
--host 0.0.0.0
--gpu-memory-utilization 0.9
--enable-prefix-caching
--enable-metrics
--metrics-port 9090
--port 8000
--host 0.0.0.0
**步骤2:使用有限流量进行测试**
部署到生产环境前运行负载测试:
```bashInstall load testing tool
安装负载测试工具
pip install locust
pip install locust
Create test_load.py with sample requests
创建test_load.py并编写示例请求
Run: locust -f test_load.py --host http://localhost:8000
运行: locust -f test_load.py --host http://localhost:8000
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
**Step 3: Enable monitoring**
vLLM exposes Prometheus metrics on port 9090:
```bash
curl http://localhost:9090/metrics | grep vllmKey metrics to monitor:
- - Latency
vllm:time_to_first_token_seconds - - Active requests
vllm:num_requests_running - - KV cache utilization
vllm:gpu_cache_usage_perc
Step 4: Deploy to production
Use Docker for consistent deployment:
bash
undefined
验证TTFT(首 token 生成时间)< 500ms,吞吐量 > 100 请求/秒。
**步骤3:启用监控**
vLLM在9090端口暴露Prometheus指标:
```bash
curl http://localhost:9090/metrics | grep vllm需要监控的关键指标:
- - 延迟
vllm:time_to_first_token_seconds - - 活跃请求数
vllm:num_requests_running - - KV缓存利用率
vllm:gpu_cache_usage_perc
步骤4:部署到生产环境
使用Docker实现一致部署:
bash
undefinedRun vLLM in Docker
在Docker中运行vLLM
docker run --gpus all -p 8000:8000
vllm/vllm-openai:latest
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
vllm/vllm-openai:latest
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
**Step 5: Verify performance metrics**
Check that deployment meets targets:
- TTFT < 500ms (for short prompts)
- Throughput > target req/sec
- GPU utilization > 80%
- No OOM errors in logsdocker run --gpus all -p 8000:8000
vllm/vllm-openai:latest
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
vllm/vllm-openai:latest
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
**步骤5:验证性能指标**
检查部署是否符合目标:
- TTFT < 500ms(短提示场景)
- 吞吐量 > 目标请求数/秒
- GPU利用率 > 80%
- 日志中无OOM(内存不足)错误Workflow 2: Offline batch inference
工作流2:离线批量推理
For processing large datasets without server overhead.
Copy this checklist:
Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process resultsStep 1: Prepare input data
python
undefined用于在无服务器开销的情况下处理大型数据集。
复制以下检查清单:
批量处理:
- [ ] 步骤1:准备输入数据
- [ ] 步骤2:配置LLM引擎
- [ ] 步骤3:运行批量推理
- [ ] 步骤4:处理结果步骤1:准备输入数据
python
undefinedLoad prompts from file
从文件加载提示词
prompts = []
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")
**Step 2: Configure LLM engine**
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # Use 2 GPUs
gpu_memory_utilization=0.9,
max_model_len=4096
)
sampling = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "\n\n"]
)Step 3: Run batch inference
vLLM automatically batches requests for efficiency:
python
undefinedprompts = []
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
print(f"已加载 {len(prompts)} 个提示词")
**步骤2:配置LLM引擎**
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # 使用2块GPU
gpu_memory_utilization=0.9,
max_model_len=4096
)
sampling = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "\n\n"]
)步骤3:运行批量推理
vLLM会自动对请求进行批处理以提升效率:
python
undefinedProcess all prompts in one call
一次性处理所有提示词
outputs = llm.generate(prompts, sampling)
outputs = llm.generate(prompts, sampling)
vLLM handles batching internally
vLLM在内部处理批处理
No need to manually chunk prompts
无需手动拆分提示词
**Step 4: Process results**
```python
**步骤4:处理结果**
```pythonExtract generated text
提取生成的文本
results = []
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
results.append({
"prompt": prompt,
"generated": generated,
"tokens": len(output.outputs[0].token_ids)
})
results = []
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
results.append({
"prompt": prompt,
"generated": generated,
"tokens": len(output.outputs[0].token_ids)
})
Save to file
保存到文件
import json
with open("results.jsonl", "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")
undefinedimport json
with open("results.jsonl", "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"已处理 {len(results)} 个提示词")
undefinedWorkflow 3: Quantized model serving
工作流3:量化模型服务
Fit large models in limited GPU memory.
Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracyStep 1: Choose quantization method
- AWQ: Best for 70B models, minimal accuracy loss
- GPTQ: Wide model support, good compression
- FP8: Fastest on H100 GPUs
Step 2: Find or create quantized model
Use pre-quantized models from HuggingFace:
bash
undefined在有限GPU内存中运行大模型。
量化设置:
- [ ] 步骤1:选择量化方法
- [ ] 步骤2:查找或创建量化模型
- [ ] 步骤3:通过量化标志启动服务
- [ ] 步骤4:验证准确性步骤1:选择量化方法
- AWQ: 最适合70B模型,准确性损失极小
- GPTQ: 支持模型范围广,压缩效果好
- FP8: 在H100 GPU上速度最快
步骤2:查找或创建量化模型
使用HuggingFace上的预量化模型:
bash
undefinedSearch for AWQ models
搜索AWQ模型
Example: TheBloke/Llama-2-70B-AWQ
示例: TheBloke/Llama-2-70B-AWQ
**Step 3: Launch with quantization flag**
```bash
**步骤3:通过量化标志启动服务**
```bashUsing pre-quantized model
使用预量化模型
vllm serve TheBloke/Llama-2-70B-AWQ
--quantization awq
--tensor-parallel-size 1
--gpu-memory-utilization 0.95
--quantization awq
--tensor-parallel-size 1
--gpu-memory-utilization 0.95
vllm serve TheBloke/Llama-2-70B-AWQ
--quantization awq
--tensor-parallel-size 1
--gpu-memory-utilization 0.95
--quantization awq
--tensor-parallel-size 1
--gpu-memory-utilization 0.95
Results: 70B model in ~40GB VRAM
效果: 70B模型可在约40GB显存中运行
**Step 4: Verify accuracy**
Test outputs match expected quality:
```python
**步骤4:验证准确性**
测试输出是否符合预期质量:
```pythonCompare quantized vs non-quantized responses
比较量化与非量化模型的响应
Verify task-specific performance unchanged
验证特定任务性能无变化
undefinedundefinedWhen to use vs alternatives
适用场景与替代方案对比
Use vLLM when:
- Deploying production LLM APIs (100+ req/sec)
- Serving OpenAI-compatible endpoints
- Limited GPU memory but need large models
- Multi-user applications (chatbots, assistants)
- Need low latency with high throughput
Use alternatives instead:
- llama.cpp: CPU/edge inference, single-user
- HuggingFace transformers: Research, prototyping, one-off generation
- TensorRT-LLM: NVIDIA-only, need absolute maximum performance
- Text-Generation-Inference: Already in HuggingFace ecosystem
适合使用vLLM的场景:
- 部署生产级LLM API(100+ 请求/秒)
- 提供兼容OpenAI的端点
- GPU内存有限但需要运行大模型
- 多用户应用(聊天机器人、助手)
- 需要低延迟且高吞吐量
适合使用替代方案的场景:
- llama.cpp: CPU/边缘设备推理,单用户场景
- HuggingFace transformers: 研究、原型开发、一次性生成任务
- TensorRT-LLM: 仅支持NVIDIA,追求极致性能
- Text-Generation-Inference: 已在HuggingFace生态中使用
Common issues
常见问题
Issue: Out of memory during model loading
Reduce memory usage:
bash
vllm serve MODEL \
--gpu-memory-utilization 0.7 \
--max-model-len 4096Or use quantization:
bash
vllm serve MODEL --quantization awqIssue: Slow first token (TTFT > 1 second)
Enable prefix caching for repeated prompts:
bash
vllm serve MODEL --enable-prefix-cachingFor long prompts, enable chunked prefill:
bash
vllm serve MODEL --enable-chunked-prefillIssue: Model not found error
Use for custom models:
--trust-remote-codebash
vllm serve MODEL --trust-remote-codeIssue: Low throughput (<50 req/sec)
Increase concurrent sequences:
bash
vllm serve MODEL --max-num-seqs 512Check GPU utilization with - should be >80%.
nvidia-smiIssue: Inference slower than expected
Verify tensor parallelism uses power of 2 GPUs:
bash
vllm serve MODEL --tensor-parallel-size 4 # Not 3Enable speculative decoding for faster generation:
bash
vllm serve MODEL --speculative-model DRAFT_MODEL问题:模型加载时内存不足
降低内存占用:
bash
vllm serve MODEL \
--gpu-memory-utilization 0.7 \
--max-model-len 4096或使用量化:
bash
vllm serve MODEL --quantization awq问题:首token生成缓慢(TTFT > 1秒)
为重复提示词启用前缀缓存:
bash
vllm serve MODEL --enable-prefix-caching对于长提示词,启用分块预填充:
bash
vllm serve MODEL --enable-chunked-prefill问题:模型未找到错误
对于自定义模型,使用:
--trust-remote-codebash
vllm serve MODEL --trust-remote-code问题:吞吐量低(<50 请求/秒)
增加并发序列数:
bash
vllm serve MODEL --max-num-seqs 512使用检查GPU利用率 - 应大于80%。
nvidia-smi问题:推理速度低于预期
验证张量并行使用的GPU数量为2的幂:
bash
vllm serve MODEL --tensor-parallel-size 4 # 不要用3启用推测解码以加快生成速度:
bash
vllm serve MODEL --speculative-model DRAFT_MODELAdvanced topics
高级主题
Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.
Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.
Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.
服务器部署模式: 查看references/server-deployment.md获取Docker、Kubernetes和负载均衡配置。
性能优化: 查看references/optimization.md获取PagedAttention调优、连续批处理细节和基准测试结果。
量化指南: 查看references/quantization.md获取AWQ/GPTQ/FP8设置、模型准备和准确性对比。
故障排除: 查看references/troubleshooting.md获取详细错误信息、调试步骤和性能诊断。
Hardware requirements
硬件要求
- Small models (7B-13B): 1x A10 (24GB) or A100 (40GB)
- Medium models (30B-40B): 2x A100 (40GB) with tensor parallelism
- Large models (70B+): 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
- 小型模型(7B-13B): 1块A10(24GB)或A100(40GB)GPU
- 中型模型(30B-40B): 2块A100(40GB)GPU + 张量并行
- 大型模型(70B+): 4块A100(40GB)或2块A100(80GB)GPU,使用AWQ/GPTQ量化
支持平台: NVIDIA(首选)、AMD ROCm、Intel GPU、TPU
Resources
资源
- Official docs: https://docs.vllm.ai
- GitHub: https://github.com/vllm-project/vllm
- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
- Community: https://discuss.vllm.ai
- 官方文档: https://docs.vllm.ai
- GitHub: https://github.com/vllm-project/vllm
- 论文: 《Efficient Memory Management for Large Language Model Serving with PagedAttention》(SOSP 2023)
- 社区: https://discuss.vllm.ai