vllm-deploy-simple

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

vLLM Simple Deployment

vLLM 简易部署

A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.

这是一项可快速安装vLLM、启动服务器并验证兼容OpenAI的API的简易技能。

What this skill does

该技能的功能

This skill provides a streamlined workflow to:

Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU)
Install vLLM with appropriate backend support
Start the vLLM server with configurable model and port
Test the OpenAI-compatible API endpoint
Validate the deployment is working correctly
Support virtual environment isolation

本技能提供了一套简化的工作流，可用于：

检测硬件后端（NVIDIA CUDA、AMD ROCm、Google TPU或CPU）
安装适配对应后端的vLLM
启动可配置模型和端口的vLLM服务器
测试兼容OpenAI的API端点
验证部署是否正常运行
支持虚拟环境隔离

Prerequisites

前置条件

Python 3.10+
GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU
pip or uv package manager
curl (for API testing)
Virtual environment (optional but recommended)

Python 3.10+
GPU（NVIDIA CUDA、AMD ROCm，推荐使用）或TPU或CPU
pip或uv包管理器
curl（用于API测试）
虚拟环境（可选但推荐）

Usage

使用方法

Create a venv

创建虚拟环境

If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.

如果用户未指定虚拟环境路径，或要求在当前环境部署，则使用uv在当前文件夹创建基于Python 3.12的虚拟环境。若未找到uv，则在此路径创建文件夹并使用Python创建虚拟环境。

Run the complete workflow (suggested)

运行完整工作流（推荐）

If user did not specify the venv path, model, or port, use default options:

bash

undefined

如果用户未指定虚拟环境路径、模型或端口，则使用默认选项：

bash

undefined

Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)

默认部署选项（--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8）

scripts/quickstart.sh


Or with custom options:

```bash

scripts/quickstart.sh


或使用自定义选项：

```bash

Use custom virtual environment

使用自定义虚拟环境

scripts/quickstart.sh --venv /path/to/venv

Use custom model and port

使用自定义模型和端口

scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000

Use custom GPU memory utilization

使用自定义GPU内存利用率

scripts/quickstart.sh --gpu_memory_utilization 0.6

Combine all options

组合所有选项

scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8


This will:
1. Activate the virtual environment (if specified)
2. Detect hardware backend (CUDA/ROCm/TPU/CPU)
3. Install vLLM with appropriate backend support
4. Start the vLLM server in the background
5. Wait for the server to be ready
6. Test the API with a sample request
7. Display the server status

scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8


这将执行以下操作：
1. 激活虚拟环境（若已指定）
2. 检测硬件后端（CUDA/ROCm/TPU/CPU）
3. 安装适配对应后端的vLLM
4. 在后台启动vLLM服务器
5. 等待服务器就绪
6. 使用示例请求测试API
7. 显示服务器状态

Run individual commands (for step-by-step usage or troubleshooting)

运行单个命令（用于分步操作或故障排查）

Install vLLM:

bash

scripts/quickstart.sh install

安装vLLM：

bash

scripts/quickstart.sh install

Or with virtual environment

或使用虚拟环境

scripts/quickstart.sh install --venv /path/to/venv


**Start the server:**
```bash
scripts/quickstart.sh start

scripts/quickstart.sh install --venv /path/to/venv


**启动服务器：**
```bash
scripts/quickstart.sh start

Or with custom options

或使用自定义选项

scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8


**Test the API:**
```bash
scripts/quickstart.sh test

scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8


**测试API：**
```bash
scripts/quickstart.sh test

Or with custom port

或使用自定义端口

scripts/quickstart.sh test --port 8000


**Stop the server:**
```bash
scripts/quickstart.sh stop

scripts/quickstart.sh test --port 8000


**停止服务器：**
```bash
scripts/quickstart.sh stop

Or with virtual environment

或使用虚拟环境

scripts/quickstart.sh stop --venv /path/to/venv


**Check server status:**
```bash
scripts/quickstart.sh status

Restart the server:

bash

scripts/quickstart.sh restart

scripts/quickstart.sh stop --venv /path/to/venv


**检查服务器状态：**
```bash
scripts/quickstart.sh status

重启服务器：

bash

scripts/quickstart.sh restart

Or with custom options

或使用自定义选项

scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8

undefined

scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8

undefined

Configuration

配置

The script supports the following command-line options:

bash

scripts/quickstart.sh [command] [OPTIONS]

Commands:
  install  - Install vLLM and dependencies
  start    - Start the vLLM server
  stop     - Stop the vLLM server
  test     - Test the OpenAI-compatible API
  status   - Show server status
  restart  - Restart the server
  all      - Run complete workflow (default)

Options:
  --model MODEL                 Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
  --port PORT                   Port to run server on (default: 8000)
  --venv VENV_PATH              Virtual environment path (default: .)
  --gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)

该脚本支持以下命令行选项：

bash

scripts/quickstart.sh [command] [OPTIONS]

Commands:
  install  - 安装vLLM及依赖
  start    - 启动vLLM服务器
  stop     - 停止vLLM服务器
  test     - 测试兼容OpenAI的API
  status   - 显示服务器状态
  restart  - 重启服务器
  all      - 运行完整工作流（默认）

Options:
  --model MODEL                 使用的模型（默认：Qwen/Qwen2.5-1.5B-Instruct）
  --port PORT                   服务器运行端口（默认：8000）
  --venv VENV_PATH              虚拟环境路径（默认：.）
  --gpu_memory_utilization VRAM GPU内存利用率（默认：0.8）

Hardware Backend Detection

硬件后端检测

The script automatically detects your hardware and installs the appropriate vLLM version:

NVIDIA CUDA: Detected via
```
nvidia-smi
```
command
AMD ROCm: Detected via
```
/dev/kfd
```
and
```
/dev/dri
```
devices
Google TPU: Detected via
```
TPU_NAME
```
environment variable or
```
gcloud
```
command
CPU: Fallback if no GPU/TPU detected

For Google TPU, the script installs

vllm-tpu

instead of the standard

vllm

package.

脚本会自动检测您的硬件并安装合适的vLLM版本：

NVIDIA CUDA：通过
```
nvidia-smi
```
命令检测
AMD ROCm：通过
```
/dev/kfd
```
和
```
/dev/dri
```
设备检测
Google TPU：通过
```
TPU_NAME
```
环境变量或
```
gcloud
```
命令检测
CPU：若未检测到GPU/TPU则作为备选方案

对于Google TPU，脚本会安装

vllm-tpu

而非标准的

vllm

包。

API Testing

API测试

The test script sends a simple chat completion request:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 50
  }'

测试脚本会发送一个简单的聊天补全请求：

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 50
  }'

Troubleshooting

故障排查

Virtual environment not found:

Ensure the path provided with
```
--venv
```
exists and is a valid virtual environment
Check that the activation script exists (
```
bin/activate
```
on Linux/macOS or
```
Scripts/activate
```
on Windows)
Check and install uv, and create a new virtual environment with uv:
```
uv venv /path/to/venv
```
(suggested); or with pip:
```
python3 -m venv /path/to/venv
```

Server won't start:

Check if the port is already in use:
```
lsof -i :8000
```
Verify GPU availability:
```
nvidia-smi
```
(for NVIDIA) or
```
rocm-smi
```
(for AMD)

Check vLLM installation:

python -c "import vllm; print(vllm.__version__)"

Review server logs at
```
$VENV_PATH/tmp/vllm-server.log
```

API returns errors:

Wait a few seconds for the model to load
Check server logs:
```
cat $VENV_PATH/tmp/vllm-server.log
```
Verify the server is running:
```
scripts/quickstart.sh status
```

Out of memory:

Use a smaller model (e.g., Qwen2.5-0.5B-Instruct)
Reduce
```
--gpu-memory-utilization
```
parameter
Close other GPU-intensive applications

Wrong backend detected:

For NVIDIA: Ensure
```
nvidia-smi
```
is in your PATH
For AMD: Check that ROCm drivers are properly installed
For TPU: Set
```
TPU_NAME
```
environment variable or install
```
gcloud
```

虚拟环境未找到：

确保
```
--venv
```
参数提供的路径存在且是有效的虚拟环境
检查激活脚本是否存在（Linux/macOS下为
```
bin/activate
```
，Windows下为
```
Scripts/activate
```
）
检查并安装uv，使用uv创建新的虚拟环境：
```
uv venv /path/to/venv
```
（推荐）；或使用pip：
```
python3 -m venv /path/to/venv
```

服务器无法启动：

检查端口是否已被占用：
```
lsof -i :8000
```
验证GPU可用性：
```
nvidia-smi
```
（NVIDIA）或
```
rocm-smi
```
（AMD）

检查vLLM安装情况：

python -c "import vllm; print(vllm.__version__)"

查看服务器日志：
```
$VENV_PATH/tmp/vllm-server.log
```

API返回错误：

等待几秒让模型加载完成
查看服务器日志：
```
cat $VENV_PATH/tmp/vllm-server.log
```
验证服务器是否运行：
```
scripts/quickstart.sh status
```

内存不足：

使用更小的模型（如Qwen/Qwen2.5-0.5B-Instruct）
降低
```
--gpu-memory-utilization
```
参数值
关闭其他占用GPU的应用

后端检测错误：

对于NVIDIA：确保
```
nvidia-smi
```
在您的PATH中
对于AMD：检查ROCm驱动是否已正确安装
对于TPU：设置
```
TPU_NAME
```
环境变量或安装
```
gcloud
```

Notes

注意事项

The server runs in the background and logs to
```
$VENV_PATH/tmp/vllm-server.log
```
The PID is stored in
```
$VENV_PATH/tmp/vllm-server.pid
```
for easy management
First run will download the model (~3GB for Qwen2.5-1.5B-Instruct)
Subsequent runs will use the cached model
The script automatically detects and uses
```
uv
```
if available, otherwise falls back to
```
pip
```
Virtual environment support allows isolation from system Python packages

Arguments can be specified in any order (e.g.,

scripts/quickstart.sh --port 8080 start --venv /path/to/venv

)

服务器在后台运行，日志存储于
```
$VENV_PATH/tmp/vllm-server.log
```
PID存储于
```
$VENV_PATH/tmp/vllm-server.pid
```
，便于管理
首次运行会下载模型（Qwen/Qwen2.5-1.5B-Instruct约3GB）
后续运行会使用缓存的模型
脚本会自动检测并使用uv（若可用），否则回退到pip
虚拟环境支持可隔离系统Python包

参数可按任意顺序指定（如

scripts/quickstart.sh --port 8080 start --venv /path/to/venv

）