vllm-deploy-simple
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesevLLM Simple Deployment
vLLM 简易部署
A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.
这是一项可快速安装vLLM、启动服务器并验证兼容OpenAI的API的简易技能。
What this skill does
该技能的功能
This skill provides a streamlined workflow to:
- Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU)
- Install vLLM with appropriate backend support
- Start the vLLM server with configurable model and port
- Test the OpenAI-compatible API endpoint
- Validate the deployment is working correctly
- Support virtual environment isolation
本技能提供了一套简化的工作流,可用于:
- 检测硬件后端(NVIDIA CUDA、AMD ROCm、Google TPU或CPU)
- 安装适配对应后端的vLLM
- 启动可配置模型和端口的vLLM服务器
- 测试兼容OpenAI的API端点
- 验证部署是否正常运行
- 支持虚拟环境隔离
Prerequisites
前置条件
- Python 3.10+
- GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU
- pip or uv package manager
- curl (for API testing)
- Virtual environment (optional but recommended)
- Python 3.10+
- GPU(NVIDIA CUDA、AMD ROCm,推荐使用)或TPU或CPU
- pip或uv包管理器
- curl(用于API测试)
- 虚拟环境(可选但推荐)
Usage
使用方法
Create a venv
创建虚拟环境
If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.
如果用户未指定虚拟环境路径,或要求在当前环境部署,则使用uv在当前文件夹创建基于Python 3.12的虚拟环境。若未找到uv,则在此路径创建文件夹并使用Python创建虚拟环境。
Run the complete workflow (suggested)
运行完整工作流(推荐)
If user did not specify the venv path, model, or port, use default options:
bash
undefined如果用户未指定虚拟环境路径、模型或端口,则使用默认选项:
bash
undefinedDefault deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
默认部署选项(--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
scripts/quickstart.sh
Or with custom options:
```bashscripts/quickstart.sh
或使用自定义选项:
```bashUse custom virtual environment
使用自定义虚拟环境
scripts/quickstart.sh --venv /path/to/venv
scripts/quickstart.sh --venv /path/to/venv
Use custom model and port
使用自定义模型和端口
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000
Use custom GPU memory utilization
使用自定义GPU内存利用率
scripts/quickstart.sh --gpu_memory_utilization 0.6
scripts/quickstart.sh --gpu_memory_utilization 0.6
Combine all options
组合所有选项
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
This will:
1. Activate the virtual environment (if specified)
2. Detect hardware backend (CUDA/ROCm/TPU/CPU)
3. Install vLLM with appropriate backend support
4. Start the vLLM server in the background
5. Wait for the server to be ready
6. Test the API with a sample request
7. Display the server statusscripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
这将执行以下操作:
1. 激活虚拟环境(若已指定)
2. 检测硬件后端(CUDA/ROCm/TPU/CPU)
3. 安装适配对应后端的vLLM
4. 在后台启动vLLM服务器
5. 等待服务器就绪
6. 使用示例请求测试API
7. 显示服务器状态Run individual commands (for step-by-step usage or troubleshooting)
运行单个命令(用于分步操作或故障排查)
Install vLLM:
bash
scripts/quickstart.sh install安装vLLM:
bash
scripts/quickstart.sh installOr with virtual environment
或使用虚拟环境
scripts/quickstart.sh install --venv /path/to/venv
**Start the server:**
```bash
scripts/quickstart.sh startscripts/quickstart.sh install --venv /path/to/venv
**启动服务器:**
```bash
scripts/quickstart.sh startOr with custom options
或使用自定义选项
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
**Test the API:**
```bash
scripts/quickstart.sh testscripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
**测试API:**
```bash
scripts/quickstart.sh testOr with custom port
或使用自定义端口
scripts/quickstart.sh test --port 8000
**Stop the server:**
```bash
scripts/quickstart.sh stopscripts/quickstart.sh test --port 8000
**停止服务器:**
```bash
scripts/quickstart.sh stopOr with virtual environment
或使用虚拟环境
scripts/quickstart.sh stop --venv /path/to/venv
**Check server status:**
```bash
scripts/quickstart.sh statusRestart the server:
bash
scripts/quickstart.sh restartscripts/quickstart.sh stop --venv /path/to/venv
**检查服务器状态:**
```bash
scripts/quickstart.sh status重启服务器:
bash
scripts/quickstart.sh restartOr with custom options
或使用自定义选项
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8
undefinedscripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8
undefinedConfiguration
配置
The script supports the following command-line options:
bash
scripts/quickstart.sh [command] [OPTIONS]
Commands:
install - Install vLLM and dependencies
start - Start the vLLM server
stop - Stop the vLLM server
test - Test the OpenAI-compatible API
status - Show server status
restart - Restart the server
all - Run complete workflow (default)
Options:
--model MODEL Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
--port PORT Port to run server on (default: 8000)
--venv VENV_PATH Virtual environment path (default: .)
--gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)该脚本支持以下命令行选项:
bash
scripts/quickstart.sh [command] [OPTIONS]
Commands:
install - 安装vLLM及依赖
start - 启动vLLM服务器
stop - 停止vLLM服务器
test - 测试兼容OpenAI的API
status - 显示服务器状态
restart - 重启服务器
all - 运行完整工作流(默认)
Options:
--model MODEL 使用的模型(默认:Qwen/Qwen2.5-1.5B-Instruct)
--port PORT 服务器运行端口(默认:8000)
--venv VENV_PATH 虚拟环境路径(默认:.)
--gpu_memory_utilization VRAM GPU内存利用率(默认:0.8)Hardware Backend Detection
硬件后端检测
The script automatically detects your hardware and installs the appropriate vLLM version:
- NVIDIA CUDA: Detected via command
nvidia-smi - AMD ROCm: Detected via and
/dev/kfddevices/dev/dri - Google TPU: Detected via environment variable or
TPU_NAMEcommandgcloud - CPU: Fallback if no GPU/TPU detected
For Google TPU, the script installs instead of the standard package.
vllm-tpuvllm脚本会自动检测您的硬件并安装合适的vLLM版本:
- NVIDIA CUDA:通过命令检测
nvidia-smi - AMD ROCm:通过和
/dev/kfd设备检测/dev/dri - Google TPU:通过环境变量或
TPU_NAME命令检测gcloud - CPU:若未检测到GPU/TPU则作为备选方案
对于Google TPU,脚本会安装而非标准的包。
vllm-tpuvllmAPI Testing
API测试
The test script sends a simple chat completion request:
bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "Say hello!"}],
"max_tokens": 50
}'测试脚本会发送一个简单的聊天补全请求:
bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "Say hello!"}],
"max_tokens": 50
}'Troubleshooting
故障排查
Virtual environment not found:
- Ensure the path provided with exists and is a valid virtual environment
--venv - Check that the activation script exists (on Linux/macOS or
bin/activateon Windows)Scripts/activate - Check and install uv, and create a new virtual environment with uv: (suggested); or with pip:
uv venv /path/to/venvpython3 -m venv /path/to/venv
Server won't start:
- Check if the port is already in use:
lsof -i :8000 - Verify GPU availability: (for NVIDIA) or
nvidia-smi(for AMD)rocm-smi - Check vLLM installation:
python -c "import vllm; print(vllm.__version__)" - Review server logs at
$VENV_PATH/tmp/vllm-server.log
API returns errors:
- Wait a few seconds for the model to load
- Check server logs:
cat $VENV_PATH/tmp/vllm-server.log - Verify the server is running:
scripts/quickstart.sh status
Out of memory:
- Use a smaller model (e.g., Qwen2.5-0.5B-Instruct)
- Reduce parameter
--gpu-memory-utilization - Close other GPU-intensive applications
Wrong backend detected:
- For NVIDIA: Ensure is in your PATH
nvidia-smi - For AMD: Check that ROCm drivers are properly installed
- For TPU: Set environment variable or install
TPU_NAMEgcloud
虚拟环境未找到:
- 确保参数提供的路径存在且是有效的虚拟环境
--venv - 检查激活脚本是否存在(Linux/macOS下为,Windows下为
bin/activate)Scripts/activate - 检查并安装uv,使用uv创建新的虚拟环境:(推荐);或使用pip:
uv venv /path/to/venvpython3 -m venv /path/to/venv
服务器无法启动:
- 检查端口是否已被占用:
lsof -i :8000 - 验证GPU可用性:(NVIDIA)或
nvidia-smi(AMD)rocm-smi - 检查vLLM安装情况:
python -c "import vllm; print(vllm.__version__)" - 查看服务器日志:
$VENV_PATH/tmp/vllm-server.log
API返回错误:
- 等待几秒让模型加载完成
- 查看服务器日志:
cat $VENV_PATH/tmp/vllm-server.log - 验证服务器是否运行:
scripts/quickstart.sh status
内存不足:
- 使用更小的模型(如Qwen/Qwen2.5-0.5B-Instruct)
- 降低参数值
--gpu-memory-utilization - 关闭其他占用GPU的应用
后端检测错误:
- 对于NVIDIA:确保在您的PATH中
nvidia-smi - 对于AMD:检查ROCm驱动是否已正确安装
- 对于TPU:设置环境变量或安装
TPU_NAMEgcloud
Notes
注意事项
- The server runs in the background and logs to
$VENV_PATH/tmp/vllm-server.log - The PID is stored in for easy management
$VENV_PATH/tmp/vllm-server.pid - First run will download the model (~3GB for Qwen2.5-1.5B-Instruct)
- Subsequent runs will use the cached model
- The script automatically detects and uses if available, otherwise falls back to
uvpip - Virtual environment support allows isolation from system Python packages
- Arguments can be specified in any order (e.g., )
scripts/quickstart.sh --port 8080 start --venv /path/to/venv
- 服务器在后台运行,日志存储于
$VENV_PATH/tmp/vllm-server.log - PID存储于,便于管理
$VENV_PATH/tmp/vllm-server.pid - 首次运行会下载模型(Qwen/Qwen2.5-1.5B-Instruct约3GB)
- 后续运行会使用缓存的模型
- 脚本会自动检测并使用uv(若可用),否则回退到pip
- 虚拟环境支持可隔离系统Python包
- 参数可按任意顺序指定(如)
scripts/quickstart.sh --port 8080 start --venv /path/to/venv