vllm-deploy-simple

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

vLLM Simple Deployment

vLLM 简易部署

A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.
这是一项可快速安装vLLM、启动服务器并验证兼容OpenAI的API的简易技能。

What this skill does

该技能的功能

This skill provides a streamlined workflow to:
  • Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU)
  • Install vLLM with appropriate backend support
  • Start the vLLM server with configurable model and port
  • Test the OpenAI-compatible API endpoint
  • Validate the deployment is working correctly
  • Support virtual environment isolation
本技能提供了一套简化的工作流,可用于:
  • 检测硬件后端(NVIDIA CUDA、AMD ROCm、Google TPU或CPU)
  • 安装适配对应后端的vLLM
  • 启动可配置模型和端口的vLLM服务器
  • 测试兼容OpenAI的API端点
  • 验证部署是否正常运行
  • 支持虚拟环境隔离

Prerequisites

前置条件

  • Python 3.10+
  • GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU
  • pip or uv package manager
  • curl (for API testing)
  • Virtual environment (optional but recommended)
  • Python 3.10+
  • GPU(NVIDIA CUDA、AMD ROCm,推荐使用)或TPU或CPU
  • pip或uv包管理器
  • curl(用于API测试)
  • 虚拟环境(可选但推荐)

Usage

使用方法

Create a venv

创建虚拟环境

If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.
如果用户未指定虚拟环境路径,或要求在当前环境部署,则使用uv在当前文件夹创建基于Python 3.12的虚拟环境。若未找到uv,则在此路径创建文件夹并使用Python创建虚拟环境。

Run the complete workflow (suggested)

运行完整工作流(推荐)

If user did not specify the venv path, model, or port, use default options:
bash
undefined
如果用户未指定虚拟环境路径、模型或端口,则使用默认选项:
bash
undefined

Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)

默认部署选项(--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)

scripts/quickstart.sh

Or with custom options:

```bash
scripts/quickstart.sh

或使用自定义选项:

```bash

Use custom virtual environment

使用自定义虚拟环境

scripts/quickstart.sh --venv /path/to/venv
scripts/quickstart.sh --venv /path/to/venv

Use custom model and port

使用自定义模型和端口

scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000

Use custom GPU memory utilization

使用自定义GPU内存利用率

scripts/quickstart.sh --gpu_memory_utilization 0.6
scripts/quickstart.sh --gpu_memory_utilization 0.6

Combine all options

组合所有选项

scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

This will:
1. Activate the virtual environment (if specified)
2. Detect hardware backend (CUDA/ROCm/TPU/CPU)
3. Install vLLM with appropriate backend support
4. Start the vLLM server in the background
5. Wait for the server to be ready
6. Test the API with a sample request
7. Display the server status
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

这将执行以下操作:
1. 激活虚拟环境(若已指定)
2. 检测硬件后端(CUDA/ROCm/TPU/CPU)
3. 安装适配对应后端的vLLM
4. 在后台启动vLLM服务器
5. 等待服务器就绪
6. 使用示例请求测试API
7. 显示服务器状态

Run individual commands (for step-by-step usage or troubleshooting)

运行单个命令(用于分步操作或故障排查)

Install vLLM:
bash
scripts/quickstart.sh install
安装vLLM:
bash
scripts/quickstart.sh install

Or with virtual environment

或使用虚拟环境

scripts/quickstart.sh install --venv /path/to/venv

**Start the server:**
```bash
scripts/quickstart.sh start
scripts/quickstart.sh install --venv /path/to/venv

**启动服务器:**
```bash
scripts/quickstart.sh start

Or with custom options

或使用自定义选项

scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

**Test the API:**
```bash
scripts/quickstart.sh test
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

**测试API:**
```bash
scripts/quickstart.sh test

Or with custom port

或使用自定义端口

scripts/quickstart.sh test --port 8000

**Stop the server:**
```bash
scripts/quickstart.sh stop
scripts/quickstart.sh test --port 8000

**停止服务器:**
```bash
scripts/quickstart.sh stop

Or with virtual environment

或使用虚拟环境

scripts/quickstart.sh stop --venv /path/to/venv

**Check server status:**
```bash
scripts/quickstart.sh status
Restart the server:
bash
scripts/quickstart.sh restart
scripts/quickstart.sh stop --venv /path/to/venv

**检查服务器状态:**
```bash
scripts/quickstart.sh status
重启服务器:
bash
scripts/quickstart.sh restart

Or with custom options

或使用自定义选项

scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8
undefined
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8
undefined

Configuration

配置

The script supports the following command-line options:
bash
scripts/quickstart.sh [command] [OPTIONS]

Commands:
  install  - Install vLLM and dependencies
  start    - Start the vLLM server
  stop     - Stop the vLLM server
  test     - Test the OpenAI-compatible API
  status   - Show server status
  restart  - Restart the server
  all      - Run complete workflow (default)

Options:
  --model MODEL                 Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
  --port PORT                   Port to run server on (default: 8000)
  --venv VENV_PATH              Virtual environment path (default: .)
  --gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)
该脚本支持以下命令行选项:
bash
scripts/quickstart.sh [command] [OPTIONS]

Commands:
  install  - 安装vLLM及依赖
  start    - 启动vLLM服务器
  stop     - 停止vLLM服务器
  test     - 测试兼容OpenAI的API
  status   - 显示服务器状态
  restart  - 重启服务器
  all      - 运行完整工作流(默认)

Options:
  --model MODEL                 使用的模型(默认:Qwen/Qwen2.5-1.5B-Instruct)
  --port PORT                   服务器运行端口(默认:8000)
  --venv VENV_PATH              虚拟环境路径(默认:.)
  --gpu_memory_utilization VRAM GPU内存利用率(默认:0.8)

Hardware Backend Detection

硬件后端检测

The script automatically detects your hardware and installs the appropriate vLLM version:
  • NVIDIA CUDA: Detected via
    nvidia-smi
    command
  • AMD ROCm: Detected via
    /dev/kfd
    and
    /dev/dri
    devices
  • Google TPU: Detected via
    TPU_NAME
    environment variable or
    gcloud
    command
  • CPU: Fallback if no GPU/TPU detected
For Google TPU, the script installs
vllm-tpu
instead of the standard
vllm
package.
脚本会自动检测您的硬件并安装合适的vLLM版本:
  • NVIDIA CUDA:通过
    nvidia-smi
    命令检测
  • AMD ROCm:通过
    /dev/kfd
    /dev/dri
    设备检测
  • Google TPU:通过
    TPU_NAME
    环境变量或
    gcloud
    命令检测
  • CPU:若未检测到GPU/TPU则作为备选方案
对于Google TPU,脚本会安装
vllm-tpu
而非标准的
vllm
包。

API Testing

API测试

The test script sends a simple chat completion request:
bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 50
  }'
测试脚本会发送一个简单的聊天补全请求:
bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 50
  }'

Troubleshooting

故障排查

Virtual environment not found:
  • Ensure the path provided with
    --venv
    exists and is a valid virtual environment
  • Check that the activation script exists (
    bin/activate
    on Linux/macOS or
    Scripts/activate
    on Windows)
  • Check and install uv, and create a new virtual environment with uv:
    uv venv /path/to/venv
    (suggested); or with pip:
    python3 -m venv /path/to/venv
Server won't start:
  • Check if the port is already in use:
    lsof -i :8000
  • Verify GPU availability:
    nvidia-smi
    (for NVIDIA) or
    rocm-smi
    (for AMD)
  • Check vLLM installation:
    python -c "import vllm; print(vllm.__version__)"
  • Review server logs at
    $VENV_PATH/tmp/vllm-server.log
API returns errors:
  • Wait a few seconds for the model to load
  • Check server logs:
    cat $VENV_PATH/tmp/vllm-server.log
  • Verify the server is running:
    scripts/quickstart.sh status
Out of memory:
  • Use a smaller model (e.g., Qwen2.5-0.5B-Instruct)
  • Reduce
    --gpu-memory-utilization
    parameter
  • Close other GPU-intensive applications
Wrong backend detected:
  • For NVIDIA: Ensure
    nvidia-smi
    is in your PATH
  • For AMD: Check that ROCm drivers are properly installed
  • For TPU: Set
    TPU_NAME
    environment variable or install
    gcloud
虚拟环境未找到:
  • 确保
    --venv
    参数提供的路径存在且是有效的虚拟环境
  • 检查激活脚本是否存在(Linux/macOS下为
    bin/activate
    ,Windows下为
    Scripts/activate
  • 检查并安装uv,使用uv创建新的虚拟环境:
    uv venv /path/to/venv
    (推荐);或使用pip:
    python3 -m venv /path/to/venv
服务器无法启动:
  • 检查端口是否已被占用:
    lsof -i :8000
  • 验证GPU可用性:
    nvidia-smi
    (NVIDIA)或
    rocm-smi
    (AMD)
  • 检查vLLM安装情况:
    python -c "import vllm; print(vllm.__version__)"
  • 查看服务器日志:
    $VENV_PATH/tmp/vllm-server.log
API返回错误:
  • 等待几秒让模型加载完成
  • 查看服务器日志:
    cat $VENV_PATH/tmp/vllm-server.log
  • 验证服务器是否运行:
    scripts/quickstart.sh status
内存不足:
  • 使用更小的模型(如Qwen/Qwen2.5-0.5B-Instruct)
  • 降低
    --gpu-memory-utilization
    参数值
  • 关闭其他占用GPU的应用
后端检测错误:
  • 对于NVIDIA:确保
    nvidia-smi
    在您的PATH中
  • 对于AMD:检查ROCm驱动是否已正确安装
  • 对于TPU:设置
    TPU_NAME
    环境变量或安装
    gcloud

Notes

注意事项

  • The server runs in the background and logs to
    $VENV_PATH/tmp/vllm-server.log
  • The PID is stored in
    $VENV_PATH/tmp/vllm-server.pid
    for easy management
  • First run will download the model (~3GB for Qwen2.5-1.5B-Instruct)
  • Subsequent runs will use the cached model
  • The script automatically detects and uses
    uv
    if available, otherwise falls back to
    pip
  • Virtual environment support allows isolation from system Python packages
  • Arguments can be specified in any order (e.g.,
    scripts/quickstart.sh --port 8080 start --venv /path/to/venv
    )
  • 服务器在后台运行,日志存储于
    $VENV_PATH/tmp/vllm-server.log
  • PID存储于
    $VENV_PATH/tmp/vllm-server.pid
    ,便于管理
  • 首次运行会下载模型(Qwen/Qwen2.5-1.5B-Instruct约3GB)
  • 后续运行会使用缓存的模型
  • 脚本会自动检测并使用uv(若可用),否则回退到pip
  • 虚拟环境支持可隔离系统Python包
  • 参数可按任意顺序指定(如
    scripts/quickstart.sh --port 8080 start --venv /path/to/venv