vllm-deploy-docker

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

vLLM Docker Deployment

vLLM Docker部署

A Claude skill describing how to deploy vLLM with Docker using the official pre-built images or building the image from source supporting NVIDIA GPUs with CUDA. Instructions include NVIDIA CUDA support, example
docker run
and a minimal
docker-compose
snippet, recommended flags, and troubleshooting notes. For AMD, Intel, or other accelerators, please refer to the vLLM documentation for alternative deployment methods.
这是一个Claude技能,介绍如何使用官方预构建镜像或从源码构建的方式,通过Docker部署支持NVIDIA GPU与CUDA的vLLM。说明内容包括NVIDIA CUDA支持、示例
docker run
命令和极简
docker-compose
代码片段、推荐参数以及故障排查说明。如需针对AMD、Intel或其他加速器的部署方法,请参考vLLM官方文档

What this skill does

本技能能实现什么

  • Deploy vLLM with docker using pre-built images (recommended for most users) or build from source for custom configurations
  • Provide example commands for running the OpenAI-compatible server with GPU access and mounted Hugging Face cache
  • Point to build-from-source instructions when a custom image or optional dependencies are needed
  • Explain common flags:
    --ipc=host
    , shared cache mounts, and
    HF_TOKEN
    handling
  • 使用Docker部署vLLM,可选择预构建镜像(推荐大多数用户使用)或从源码构建以实现自定义配置
  • 提供运行兼容OpenAI的服务器的示例命令,支持GPU访问并挂载Hugging Face缓存
  • 当需要自定义镜像或可选依赖时,提供从源码构建的指引
  • 解释常见参数:
    --ipc=host
    、共享缓存挂载以及
    HF_TOKEN
    的处理方式

Prerequisites

前置条件

  • Docker Engine installed (Docker 20.10+ recommended)
  • NVIDIA GPU(s) with appropriate drivers and CUDA toolkit installed
  • Optional:
    curl
    for API tests
  • A Hugging Face token if pulling private models or to avoid rate-limits:
    HF_TOKEN
  • 已安装Docker引擎(推荐Docker 20.10及以上版本)
  • 已安装NVIDIA GPU及对应的驱动和CUDA工具包
  • 可选:安装
    curl
    用于API测试
  • 若拉取私有模型或避免速率限制,需准备Hugging Face令牌:
    HF_TOKEN

Quickstart using Pre-built Image (recommended)

使用预构建镜像快速开始(推荐)

Run a vLLM OpenAI-compatible server with GPU access, mounting the HF cache and forwarding port 8000:
bash
docker run --rm --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-1.5B-Instruct
  • --gpus all
    exposes all GPUs to the container. Adjust if you need specific GPUs.
  • --ipc=host
    or an appropriately large
    --shm-size
    is recommended so PyTorch and vLLM can share host shared memory.
  • Mounting
    ~/.cache/huggingface
    avoids re-downloading models inside the container.
Note: vLLM and this skill recommend using the latest Docker image (
vllm/vllm-openai:latest
). For legacy version images, you may refer to the Docker Hub image tags.
运行支持GPU访问的vLLM兼容OpenAI服务器,挂载HF缓存并转发8000端口:
bash
docker run --rm --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-1.5B-Instruct
  • --gpus all
    将所有GPU暴露给容器。如需指定特定GPU可调整参数。
  • 推荐使用
    --ipc=host
    或设置足够大的
    --shm-size
    ,以便PyTorch和vLLM可以共享主机的共享内存。
  • 挂载
    ~/.cache/huggingface
    可避免在容器内重复下载模型。
注意: vLLM和本技能推荐使用最新的Docker镜像(
vllm/vllm-openai:latest
)。如需旧版本镜像,可参考Docker Hub镜像标签

Build Docker image from source

从源码构建Docker镜像

You can build and run vLLM from source by using the provided docker/Dockerfile. First, check the hardware of the host machine and ensure you have the necessary dependencies installed (e.g., NVIDIA drivers, CUDA toolkit, Docker with BuildKit support). For ARM64/aarch64 builds, refer to the "Building for ARM64/aarch64" section.
你可以使用提供的docker/Dockerfile从源码构建并运行vLLM。首先,检查主机硬件并确保已安装必要依赖(如NVIDIA驱动、CUDA工具包、支持BuildKit的Docker)。如需构建ARM64/aarch64版本,请参考“为ARM64/aarch64构建”章节。

Basic build command

基础构建命令

bash
DOCKER_BUILDKIT=1 docker build . \
  --target vllm-openai \
  --tag vllm/vllm-openai \
  --file docker/Dockerfile
The
--target vllm-openai
specifies that you are building the OpenAI-compatible server image. The
DOCKER_BUILDKIT=1
environment variable enables BuildKit, which provides better caching and faster builds.
bash
DOCKER_BUILDKIT=1 docker build . \
  --target vllm-openai \
  --tag vllm/vllm-openai \
  --file docker/Dockerfile
--target vllm-openai
指定构建兼容OpenAI的服务器镜像。
DOCKER_BUILDKIT=1
环境变量启用BuildKit,可提供更好的缓存和更快的构建速度。

Build arguments and options

构建参数与选项

  • --build-arg max_jobs=<N>
    — sets the number of parallel compilation jobs for building CUDA kernels. Useful for speeding up builds on multi-core systems.
  • --build-arg nvcc_threads=<N>
    — controls CUDA compiler threads. Recommended to use a smaller value than
    max_jobs
    to avoid excessive memory usage.
  • --build-arg torch_cuda_arch_list=""
    — if set to empty string, vLLM will detect and build only for the current GPU's compute capability. By default, vLLM builds for all GPU types for wider distribution.
  • --build-arg max_jobs=<N>
    — 设置构建CUDA内核的并行编译任务数。在多核系统中可加快构建速度。
  • --build-arg nvcc_threads=<N>
    — 控制CUDA编译器线程数。建议设置的值小于
    max_jobs
    ,以避免内存占用过高。
  • --build-arg torch_cuda_arch_list=""
    — 若设置为空字符串,vLLM将仅检测并为当前GPU的计算能力构建内核。默认情况下,vLLM会为所有GPU类型构建,以支持更广泛的分发。

Using precompiled wheels to speed up builds

使用预编译Wheel包加快构建速度

If you have not changed any C++ or CUDA kernel code, you can use precompiled wheels to significantly reduce Docker build time:
  • Enable precompiled wheels: Add
    --build-arg VLLM_USE_PRECOMPILED="1"
    to your build command.
  • How it works: By default, vLLM automatically finds the correct precompiled wheels from the Nightly Builds by using the merge-base commit with the upstream
    main
    branch.
  • Specify a commit: To use wheels from a specific commit, add
    --build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=<commit_hash>
    .
Example with precompiled wheels and options for fast compilation:
bash
DOCKER_BUILDKIT=1 docker build . \
  --target vllm-openai \
  --tag vllm/vllm-openai \
  --file docker/Dockerfile \
  --build-arg max_jobs=8 \
  --build-arg nvcc_threads=2 \
  --build-arg VLLM_USE_PRECOMPILED="1"
若未修改任何C++或CUDA内核代码,可使用预编译Wheel包大幅缩短Docker构建时间:
  • 启用预编译Wheel包: 在构建命令中添加
    --build-arg VLLM_USE_PRECOMPILED="1"
  • 工作原理: 默认情况下,vLLM会通过与上游
    main
    分支的合并基准提交,自动从夜间构建版本中找到正确的预编译Wheel包。
  • 指定提交版本: 如需使用特定提交的Wheel包,添加
    --build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=<commit_hash>
使用预编译Wheel包和快速编译选项的示例:
bash
DOCKER_BUILDKIT=1 docker build . \
  --target vllm-openai \
  --tag vllm/vllm-openai \
  --file docker/Dockerfile \
  --build-arg max_jobs=8 \
  --build-arg nvcc_threads=2 \
  --build-arg VLLM_USE_PRECOMPILED="1"

Building with optional dependencies (optional)

构建包含可选依赖的镜像(可选)

vLLM does not include optional dependencies (e.g., audio processing) in the pre-built image to avoid licensing issues. If you need optional dependencies, create a custom Dockerfile that extends the base image:
Example: adding audio optional dependencies
dockerfile
undefined
为避免许可问题,vLLM的预构建镜像不包含可选依赖(如音频处理)。如需添加可选依赖,可创建自定义Dockerfile扩展基础镜像:
示例:添加音频可选依赖
dockerfile
undefined

NOTE: MAKE SURE the version of vLLM matches the base image!

注意:确保vLLM版本与基础镜像匹配!

FROM vllm/vllm-openai:0.11.0
FROM vllm/vllm-openai:0.11.0

Install audio optional dependencies

安装音频可选依赖

RUN uv pip install --system vllm[audio]==0.11.0

**Example: using development version of transformers:**
```dockerfile
FROM vllm/vllm-openai:latest
RUN uv pip install --system vllm[audio]==0.11.0

**示例:使用开发版transformers**
```dockerfile
FROM vllm/vllm-openai:latest

Install development version of Transformers from source

从源码安装开发版Transformers

RUN uv pip install --system git+https://github.com/huggingface/transformers.git

Build this custom Dockerfile with:
```bash
docker build -t my-vllm-custom:latest -f Dockerfile .
Then use it like any other vLLM image:
bash
docker run --rm --gpus all \
  -p 8000:8000 \
  --ipc=host \
  my-vllm-custom:latest \
  --model Qwen/Qwen2.5-1.5B-Instruct
RUN uv pip install --system git+https://github.com/huggingface/transformers.git

使用以下命令构建自定义Dockerfile:
```bash
docker build -t my-vllm-custom:latest -f Dockerfile .
然后像使用其他vLLM镜像一样使用它:
bash
docker run --rm --gpus all \
  -p 8000:8000 \
  --ipc=host \
  my-vllm-custom:latest \
  --model Qwen/Qwen2.5-1.5B-Instruct

Building for ARM64/aarch64

为ARM64/aarch64构建

A Docker container can be built for ARM64 systems (e.g., NVIDIA Grace-Hopper and Grace-Blackwell). Use the flag
--platform "linux/arm64"
:
bash
DOCKER_BUILDKIT=1 docker build . \
  --target vllm-openai \
  --tag vllm/vllm-openai \
  --file docker/Dockerfile \
  --platform "linux/arm64"
Note: Multiple modules must be compiled, so this process can take longer. Use build arguments like
--build-arg max_jobs=8 --build-arg nvcc_threads=2
to speed up the process (ensure
max_jobs
is substantially larger than
nvcc_threads
). Monitor memory usage, as parallel jobs can require significant RAM.
For cross-compilation (building ARM64 on an x86_64 host), register QEMU user-static handlers first:
bash
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
Then use the
--platform "linux/arm64"
flag in your build command.
可为ARM64系统(如NVIDIA Grace-Hopper和Grace-Blackwell)构建Docker容器。使用
--platform "linux/arm64"
参数:
bash
DOCKER_BUILDKIT=1 docker build . \
  --target vllm-openai \
  --tag vllm/vllm-openai \
  --file docker/Dockerfile \
  --platform "linux/arm64"
注意: 需要编译多个模块,因此构建过程可能耗时更长。可使用
--build-arg max_jobs=8 --build-arg nvcc_threads=2
等构建参数加快速度(确保
max_jobs
远大于
nvcc_threads
)。请监控内存使用情况,因为并行任务可能需要大量RAM。
交叉编译(在x86_64主机上构建ARM64版本):先注册QEMU用户静态处理程序:
bash
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
然后在构建命令中使用
--platform "linux/arm64"
参数。

Running your custom-built image

运行自定义构建的镜像

After building, run your image just like the pre-built image:
bash
docker run --rm --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai \
  --model Qwen/Qwen2.5-1.5B-Instruct
Replace
vllm/vllm-openai
with the tag you specified during the build (e.g.,
my-vllm-custom:latest
).
Note:
--runtime nvidia
is deprecated for most environments. Prefer
--gpus ...
with NVIDIA Container Toolkit. Use
--runtime nvidia
only for legacy Docker configurations.
构建完成后,像使用预构建镜像一样运行自定义镜像:
bash
docker run --rm --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai \
  --model Qwen/Qwen2.5-1.5B-Instruct
vllm/vllm-openai
替换为构建时指定的标签(如
my-vllm-custom:latest
)。
注意:
--runtime nvidia
在大多数环境中已被弃用。优先使用带NVIDIA容器工具包的
--gpus ...
参数。仅在旧版Docker配置中使用
--runtime nvidia

Common server flags

常见服务器参数

  • --model <MODEL_ID>
    — model to load (HF ID or local path)
  • --port <PORT>
    — server port (default 8000 for OpenAI-compatible server)
  • --log-level
    — adjust verbosity
  • You may pass additional
    engine_args
    after the image tag; see vLLM docs for tuning options.
  • --model <MODEL_ID>
    — 要加载的模型(HF ID或本地路径)
  • --port <PORT>
    — 服务器端口(兼容OpenAI的服务器默认端口为8000)
  • --log-level
    — 调整日志Verbosity级别
  • 可在镜像标签后添加额外的
    engine_args
    ;如需调优选项,请参考vLLM文档。

Testing the API

测试API

After the container starts, make a quick test request against the OpenAI-compatible endpoint:
bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Who are you?"}],"max_tokens":128}'
容器启动后,向兼容OpenAI的端点发送快速测试请求:
bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Who are you?"}],"max_tokens":128}'

Security and operational notes

安全与运维注意事项

  • Keep
    HF_TOKEN
    secret; prefer passing it via environment variables or a secret manager.
  • For production, run behind a reverse proxy (Nginx) with TLS and authentication.
  • Mount only necessary host paths into the container.
  • 请保密
    HF_TOKEN
    ;优先通过环境变量或密钥管理器传递。
  • 生产环境中,请在带TLS和认证的反向代理(如Nginx)后运行。
  • 仅将必要的主机路径挂载到容器中。

Troubleshooting

故障排查

  • Container can't access GPUs: ensure
    nvidia-container-toolkit
    is installed and restart Docker.
  • Model download failures: check
    HF_TOKEN
    and network; mount cache directory to persist downloads.
  • Memory / OOM errors: try a smaller model or add more GPU memory; check
    --shm-size
    .
  • If the container fails with NCCL library path issues (rare): set
    VLLM_NCCL_SO_PATH
    per upstream guidance.
  • Permission issues of current user with Docker: ensure your user is in the
    docker
    group or suggest user to add current user to
    docker
    group manually following:
bash
undefined
  • 容器无法访问GPU:确保已安装
    nvidia-container-toolkit
    并重启Docker。
  • 模型下载失败:检查
    HF_TOKEN
    和网络;挂载缓存目录以持久化下载内容。
  • 内存/内存不足(OOM)错误:尝试使用更小的模型或增加GPU内存;检查
    --shm-size
    参数。
  • 若容器因NCCL库路径问题启动失败(罕见):根据上游指引设置
    VLLM_NCCL_SO_PATH
  • 当前用户与Docker的权限问题:确保用户属于
    docker
    组,或建议用户按照以下步骤手动将当前用户添加到
    docker
    组:
bash
undefined

1. Create docker group if it doesn't exist (may already exist on some systems)

1. 若docker组不存在则创建(部分系统可能已存在)

sudo groupadd docker
sudo groupadd docker

2. Add current user to the docker group (replace $USER with your username if needed)

2. 将当前用户添加到docker组(如需替换$USER为你的用户名)

sudo usermod -aG docker $USER
sudo usermod -aG docker $USER

3. Apply the new group membership (you may need to log out and log back in for this to take effect)

3. 应用新的组成员身份(可能需要注销并重新登录才能生效)

newgrp docker
newgrp docker

4. Verify that the user is in the docker group (output should include docker)

4. 验证用户是否属于docker组(输出应包含docker)

groups $USER
- Docker pull failed due to a network timeout connecting to Docker Hub: check the network connection and try again. If the issue persists, suggest and show the user how to use a mirror for Docker Hub.
- Error during downloading models from Hugging Face: check if the `HF_TOKEN` is passed to the container and is valid. Check if `HTTP_PROXY` and `HTTPS_PROXY` are passed to the container if the host is behind a proxy. Also, verify that the model ID is correct and that the model is public or accessible with the provided token.
groups $USER
- 因连接Docker Hub超时导致Docker拉取失败:检查网络连接并重试。若问题持续,建议并向用户展示如何使用Docker Hub镜像源。
- 从Hugging Face下载模型时出错:检查`HF_TOKEN`是否已传递给容器且有效。若主机位于代理后,检查是否已将`HTTP_PROXY`和`HTTPS_PROXY`传递给容器。同时验证模型ID是否正确,以及模型是否公开或可通过提供的令牌访问。

References

参考资料