model-deployment
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseModel Deployment
模型部署
Deploy LLMs to production with optimal performance.
将大语言模型(LLM)部署到生产环境并实现最优性能。
Quick Start
快速开始
vLLM Server
vLLM 服务器
bash
undefinedbash
undefinedInstall
Install
pip install vllm
pip install vllm
Start server
Start server
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
--tensor-parallel-size 1
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
--tensor-parallel-size 1
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
--tensor-parallel-size 1
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
--tensor-parallel-size 1
Query (OpenAI-compatible)
Query (OpenAI-compatible)
curl http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "Hello, how are you?", "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "Hello, how are you?", "max_tokens": 100 }'
undefinedcurl http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "Hello, how are you?", "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "Hello, how are you?", "max_tokens": 100 }'
undefinedText Generation Inference (TGI)
Text Generation Inference (TGI)
bash
undefinedbash
undefinedDocker deployment
Docker deployment
docker run --gpus all -p 8080:80
-v ./data:/data
ghcr.io/huggingface/text-generation-inference:latest
--model-id meta-llama/Llama-2-7b-chat-hf
--quantize bitsandbytes-nf4
--max-input-length 4096
--max-total-tokens 8192
-v ./data:/data
ghcr.io/huggingface/text-generation-inference:latest
--model-id meta-llama/Llama-2-7b-chat-hf
--quantize bitsandbytes-nf4
--max-input-length 4096
--max-total-tokens 8192
docker run --gpus all -p 8080:80
-v ./data:/data
ghcr.io/huggingface/text-generation-inference:latest
--model-id meta-llama/Llama-2-7b-chat-hf
--quantize bitsandbytes-nf4
--max-input-length 4096
--max-total-tokens 8192
-v ./data:/data
ghcr.io/huggingface/text-generation-inference:latest
--model-id meta-llama/Llama-2-7b-chat-hf
--quantize bitsandbytes-nf4
--max-input-length 4096
--max-total-tokens 8192
Query
Query
curl http://localhost:8080/generate
-H "Content-Type: application/json"
-d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'
-H "Content-Type: application/json"
-d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'
undefinedcurl http://localhost:8080/generate
-H "Content-Type: application/json"
-d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'
-H "Content-Type: application/json"
-d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'
undefinedOllama (Local Deployment)
Ollama(本地部署)
bash
undefinedbash
undefinedInstall and run
Install and run
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama2
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama2
API usage
API usage
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'
undefinedcurl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'
undefinedDeployment Options Comparison
部署方案对比
| Platform | Ease | Cost | Scale | Latency | Best For |
|---|---|---|---|---|---|
| vLLM | ⭐⭐ | Self-host | High | Low | Production |
| TGI | ⭐⭐ | Self-host | High | Low | HuggingFace ecosystem |
| Ollama | ⭐⭐⭐ | Free | Low | Medium | Local dev |
| OpenAI | ⭐⭐⭐ | Pay-per-token | Very High | Low | Quick start |
| AWS Bedrock | ⭐⭐ | Pay-per-token | Very High | Medium | Enterprise |
| Replicate | ⭐⭐⭐ | Pay-per-second | High | Medium | Prototyping |
| 平台 | 易用性 | 成本 | 扩展性 | 延迟 | 适用场景 |
|---|---|---|---|---|---|
| vLLM | ⭐⭐ | 自托管 | 高 | 低 | 生产环境 |
| TGI | ⭐⭐ | 自托管 | 高 | 低 | HuggingFace 生态系统 |
| Ollama | ⭐⭐⭐ | 免费 | 低 | 中等 | 本地开发 |
| OpenAI | ⭐⭐⭐ | 按令牌付费 | 极高 | 低 | 快速启动 |
| AWS Bedrock | ⭐⭐ | 按令牌付费 | 极高 | 中等 | 企业级 |
| Replicate | ⭐⭐⭐ | 按秒付费 | 高 | 中等 | 原型开发 |
FastAPI Inference Server
FastAPI 推理服务器
python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()Load model at startup
Load model at startup
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 100
temperature: float = 0.7
top_p: float = 0.9
class GenerateResponse(BaseModel):
text: str
tokens_used: int
@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
try:
inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=request.max_tokens,
temperature=request.temperature,
top_p=request.top_p,
do_sample=True
)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
new_tokens = len(outputs[0]) - len(inputs.input_ids[0])
return GenerateResponse(
text=generated,
tokens_used=new_tokens
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))@app.get("/health")
async def health():
return {"status": "healthy", "model": model_name}
undefinedmodel_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 100
temperature: float = 0.7
top_p: float = 0.9
class GenerateResponse(BaseModel):
text: str
tokens_used: int
@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
try:
inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=request.max_tokens,
temperature=request.temperature,
top_p=request.top_p,
do_sample=True
)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
new_tokens = len(outputs[0]) - len(inputs.input_ids[0])
return GenerateResponse(
text=generated,
tokens_used=new_tokens
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))@app.get("/health")
async def health():
return {"status": "healthy", "model": model_name}
undefinedDocker Deployment
Docker 部署
Dockerfile
Dockerfile
dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
WORKDIR /appdockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
WORKDIR /appInstall Python
Install Python
RUN apt-get update && apt-get install -y python3 python3-pip
RUN apt-get update && apt-get install -y python3 python3-pip
Install dependencies
Install dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
Copy application
Copy application
COPY . .
COPY . .
Download model (or mount volume)
Download model (or mount volume)
RUN python3 -c "from transformers import AutoModelForCausalLM;
AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"
AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
undefinedRUN python3 -c "from transformers import AutoModelForCausalLM;
AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"
AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
undefinedDocker Compose
Docker Compose
yaml
version: '3.8'
services:
llm-server:
build: .
ports:
- "8000:8000"
volumes:
- ./models:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3yaml
version: '3.8'
services:
llm-server:
build: .
ports:
- "8000:8000"
volumes:
- ./models:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3Kubernetes Deployment
Kubernetes 部署
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
containers:
- name: llm
image: llm-inference:latest
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: llm-inference
ports:
- port: 80
targetPort: 8000
type: LoadBalanceryaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
containers:
- name: llm
image: llm-inference:latest
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: llm-inference
ports:
- port: 80
targetPort: 8000
type: LoadBalancerOptimization Techniques
优化技巧
Quantization
量化
python
from transformers import BitsAndBytesConfigpython
from transformers import BitsAndBytesConfig8-bit quantization
8-bit quantization
config_8bit = BitsAndBytesConfig(load_in_8bit=True)
config_8bit = BitsAndBytesConfig(load_in_8bit=True)
4-bit quantization (QLoRA-style)
4-bit quantization (QLoRA-style)
config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=config_4bit,
device_map="auto"
)
undefinedconfig_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=config_4bit,
device_map="auto"
)
undefinedBatching
批处理
python
undefinedpython
undefinedDynamic batching with vLLM
Dynamic batching with vLLM
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
vLLM automatically batches concurrent requests
vLLM automatically batches concurrent requests
prompts = ["Prompt 1", "Prompt 2", "Prompt 3"]
sampling = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(prompts, sampling) # Batched execution
undefinedprompts = ["Prompt 1", "Prompt 2", "Prompt 3"]
sampling = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(prompts, sampling) # Batched execution
undefinedKV Cache Optimization
KV缓存优化
python
undefinedpython
undefinedvLLM with PagedAttention
vLLM with PagedAttention
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
gpu_memory_utilization=0.9,
max_num_batched_tokens=4096
)
undefinedllm = LLM(
model="meta-llama/Llama-2-7b-hf",
gpu_memory_utilization=0.9,
max_num_batched_tokens=4096
)
undefinedMonitoring
监控
python
from prometheus_client import Counter, Histogram, start_http_server
import time
REQUEST_COUNT = Counter('inference_requests_total', 'Total requests')
LATENCY = Histogram('inference_latency_seconds', 'Request latency')
TOKENS = Counter('tokens_generated_total', 'Total tokens generated')
@app.middleware("http")
async def metrics_middleware(request, call_next):
REQUEST_COUNT.inc()
start = time.time()
response = await call_next(request)
LATENCY.observe(time.time() - start)
return responsepython
from prometheus_client import Counter, Histogram, start_http_server
import time
REQUEST_COUNT = Counter('inference_requests_total', 'Total requests')
LATENCY = Histogram('inference_latency_seconds', 'Request latency')
TOKENS = Counter('tokens_generated_total', 'Total tokens generated')
@app.middleware("http")
async def metrics_middleware(request, call_next):
REQUEST_COUNT.inc()
start = time.time()
response = await call_next(request)
LATENCY.observe(time.time() - start)
return responseStart metrics server
Start metrics server
start_http_server(9090)
undefinedstart_http_server(9090)
undefinedBest Practices
最佳实践
- Use quantization: 4-bit for dev, 8-bit for production
- Implement batching: vLLM/TGI handle this automatically
- Monitor everything: Latency, throughput, errors, GPU utilization
- Cache responses: For repeated queries
- Set timeouts: Prevent hung requests
- Load balance: Multiple replicas for high availability
- 使用量化:开发环境用4位量化,生产环境用8位量化
- 实现批处理:vLLM/TGI会自动处理批处理
- 全面监控:监控延迟、吞吐量、错误、GPU利用率
- 响应缓存:针对重复查询进行缓存
- 设置超时:防止请求挂起
- 负载均衡:多副本部署以实现高可用性
Error Handling & Retry
错误处理与重试
python
from tenacity import retry, stop_after_attempt, wait_exponential
import httpx
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def call_inference_api(prompt: str):
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/generate",
json={"prompt": prompt},
timeout=30.0
)
return response.json()python
from tenacity import retry, stop_after_attempt, wait_exponential
import httpx
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def call_inference_api(prompt: str):
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/generate",
json={"prompt": prompt},
timeout=30.0
)
return response.json()Troubleshooting
故障排查
| Symptom | Cause | Solution |
|---|---|---|
| OOM on load | Model too large | Use quantization |
| High latency | No batching | Enable vLLM batching |
| Connection refused | Server not started | Check health endpoint |
| 症状 | 原因 | 解决方案 |
|---|---|---|
| 加载时内存不足(OOM) | 模型过大 | 使用量化技术 |
| 延迟过高 | 未启用批处理 | 开启vLLM批处理 |
| 连接被拒绝 | 服务器未启动 | 检查健康检查端点 |
Unit Test Template
单元测试模板
python
def test_health_endpoint():
response = client.get("/health")
assert response.status_code == 200
assert response.json()["status"] == "healthy"python
def test_health_endpoint():
response = client.get("/health")
assert response.status_code == 200
assert response.json()["status"] == "healthy"