vllm-deploy-docker
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesevLLM Docker Deployment
vLLM Docker部署
A Claude skill describing how to deploy vLLM with Docker using the official pre-built images or building the image from source supporting NVIDIA GPUs with CUDA. Instructions include NVIDIA CUDA support, example and a minimal snippet, recommended flags, and troubleshooting notes. For AMD, Intel, or other accelerators, please refer to the vLLM documentation for alternative deployment methods.
docker rundocker-compose这是一个Claude技能,介绍如何使用官方预构建镜像或从源码构建的方式,通过Docker部署支持NVIDIA GPU与CUDA的vLLM。说明内容包括NVIDIA CUDA支持、示例命令和极简代码片段、推荐参数以及故障排查说明。如需针对AMD、Intel或其他加速器的部署方法,请参考vLLM官方文档。
docker rundocker-composeWhat this skill does
本技能能实现什么
- Deploy vLLM with docker using pre-built images (recommended for most users) or build from source for custom configurations
- Provide example commands for running the OpenAI-compatible server with GPU access and mounted Hugging Face cache
- Point to build-from-source instructions when a custom image or optional dependencies are needed
- Explain common flags: , shared cache mounts, and
--ipc=hosthandlingHF_TOKEN
- 使用Docker部署vLLM,可选择预构建镜像(推荐大多数用户使用)或从源码构建以实现自定义配置
- 提供运行兼容OpenAI的服务器的示例命令,支持GPU访问并挂载Hugging Face缓存
- 当需要自定义镜像或可选依赖时,提供从源码构建的指引
- 解释常见参数:、共享缓存挂载以及
--ipc=host的处理方式HF_TOKEN
Prerequisites
前置条件
- Docker Engine installed (Docker 20.10+ recommended)
- NVIDIA GPU(s) with appropriate drivers and CUDA toolkit installed
- Optional: for API tests
curl - A Hugging Face token if pulling private models or to avoid rate-limits:
HF_TOKEN
- 已安装Docker引擎(推荐Docker 20.10及以上版本)
- 已安装NVIDIA GPU及对应的驱动和CUDA工具包
- 可选:安装用于API测试
curl - 若拉取私有模型或避免速率限制,需准备Hugging Face令牌:
HF_TOKEN
Quickstart using Pre-built Image (recommended)
使用预构建镜像快速开始(推荐)
Run a vLLM OpenAI-compatible server with GPU access, mounting the HF cache and forwarding port 8000:
bash
docker run --rm --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-1.5B-Instruct- exposes all GPUs to the container. Adjust if you need specific GPUs.
--gpus all - or an appropriately large
--ipc=hostis recommended so PyTorch and vLLM can share host shared memory.--shm-size - Mounting avoids re-downloading models inside the container.
~/.cache/huggingface
Note: vLLM and this skill recommend using the latest Docker image (). For legacy version images, you may refer to the Docker Hub image tags.vllm/vllm-openai:latest
运行支持GPU访问的vLLM兼容OpenAI服务器,挂载HF缓存并转发8000端口:
bash
docker run --rm --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-1.5B-Instruct- 将所有GPU暴露给容器。如需指定特定GPU可调整参数。
--gpus all - 推荐使用或设置足够大的
--ipc=host,以便PyTorch和vLLM可以共享主机的共享内存。--shm-size - 挂载可避免在容器内重复下载模型。
~/.cache/huggingface
注意: vLLM和本技能推荐使用最新的Docker镜像()。如需旧版本镜像,可参考Docker Hub镜像标签。vllm/vllm-openai:latest
Build Docker image from source
从源码构建Docker镜像
You can build and run vLLM from source by using the provided docker/Dockerfile.
First, check the hardware of the host machine and ensure you have the necessary dependencies installed (e.g., NVIDIA drivers, CUDA toolkit, Docker with BuildKit support). For ARM64/aarch64 builds, refer to the "Building for ARM64/aarch64" section.
你可以使用提供的docker/Dockerfile从源码构建并运行vLLM。首先,检查主机硬件并确保已安装必要依赖(如NVIDIA驱动、CUDA工具包、支持BuildKit的Docker)。如需构建ARM64/aarch64版本,请参考“为ARM64/aarch64构建”章节。
Basic build command
基础构建命令
bash
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/DockerfileThe specifies that you are building the OpenAI-compatible server image. The environment variable enables BuildKit, which provides better caching and faster builds.
--target vllm-openaiDOCKER_BUILDKIT=1bash
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile--target vllm-openaiDOCKER_BUILDKIT=1Build arguments and options
构建参数与选项
- — sets the number of parallel compilation jobs for building CUDA kernels. Useful for speeding up builds on multi-core systems.
--build-arg max_jobs=<N> - — controls CUDA compiler threads. Recommended to use a smaller value than
--build-arg nvcc_threads=<N>to avoid excessive memory usage.max_jobs - — if set to empty string, vLLM will detect and build only for the current GPU's compute capability. By default, vLLM builds for all GPU types for wider distribution.
--build-arg torch_cuda_arch_list=""
- — 设置构建CUDA内核的并行编译任务数。在多核系统中可加快构建速度。
--build-arg max_jobs=<N> - — 控制CUDA编译器线程数。建议设置的值小于
--build-arg nvcc_threads=<N>,以避免内存占用过高。max_jobs - — 若设置为空字符串,vLLM将仅检测并为当前GPU的计算能力构建内核。默认情况下,vLLM会为所有GPU类型构建,以支持更广泛的分发。
--build-arg torch_cuda_arch_list=""
Using precompiled wheels to speed up builds
使用预编译Wheel包加快构建速度
If you have not changed any C++ or CUDA kernel code, you can use precompiled wheels to significantly reduce Docker build time:
- Enable precompiled wheels: Add to your build command.
--build-arg VLLM_USE_PRECOMPILED="1" - How it works: By default, vLLM automatically finds the correct precompiled wheels from the Nightly Builds by using the merge-base commit with the upstream branch.
main - Specify a commit: To use wheels from a specific commit, add .
--build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=<commit_hash>
Example with precompiled wheels and options for fast compilation:
bash
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile \
--build-arg max_jobs=8 \
--build-arg nvcc_threads=2 \
--build-arg VLLM_USE_PRECOMPILED="1"若未修改任何C++或CUDA内核代码,可使用预编译Wheel包大幅缩短Docker构建时间:
- 启用预编译Wheel包: 在构建命令中添加。
--build-arg VLLM_USE_PRECOMPILED="1" - 工作原理: 默认情况下,vLLM会通过与上游分支的合并基准提交,自动从夜间构建版本中找到正确的预编译Wheel包。
main - 指定提交版本: 如需使用特定提交的Wheel包,添加。
--build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=<commit_hash>
使用预编译Wheel包和快速编译选项的示例:
bash
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile \
--build-arg max_jobs=8 \
--build-arg nvcc_threads=2 \
--build-arg VLLM_USE_PRECOMPILED="1"Building with optional dependencies (optional)
构建包含可选依赖的镜像(可选)
vLLM does not include optional dependencies (e.g., audio processing) in the pre-built image to avoid licensing issues. If you need optional dependencies, create a custom Dockerfile that extends the base image:
Example: adding audio optional dependencies
dockerfile
undefined为避免许可问题,vLLM的预构建镜像不包含可选依赖(如音频处理)。如需添加可选依赖,可创建自定义Dockerfile扩展基础镜像:
示例:添加音频可选依赖
dockerfile
undefinedNOTE: MAKE SURE the version of vLLM matches the base image!
注意:确保vLLM版本与基础镜像匹配!
FROM vllm/vllm-openai:0.11.0
FROM vllm/vllm-openai:0.11.0
Install audio optional dependencies
安装音频可选依赖
RUN uv pip install --system vllm[audio]==0.11.0
**Example: using development version of transformers:**
```dockerfile
FROM vllm/vllm-openai:latestRUN uv pip install --system vllm[audio]==0.11.0
**示例:使用开发版transformers**
```dockerfile
FROM vllm/vllm-openai:latestInstall development version of Transformers from source
从源码安装开发版Transformers
RUN uv pip install --system git+https://github.com/huggingface/transformers.git
Build this custom Dockerfile with:
```bash
docker build -t my-vllm-custom:latest -f Dockerfile .Then use it like any other vLLM image:
bash
docker run --rm --gpus all \
-p 8000:8000 \
--ipc=host \
my-vllm-custom:latest \
--model Qwen/Qwen2.5-1.5B-InstructRUN uv pip install --system git+https://github.com/huggingface/transformers.git
使用以下命令构建自定义Dockerfile:
```bash
docker build -t my-vllm-custom:latest -f Dockerfile .然后像使用其他vLLM镜像一样使用它:
bash
docker run --rm --gpus all \
-p 8000:8000 \
--ipc=host \
my-vllm-custom:latest \
--model Qwen/Qwen2.5-1.5B-InstructBuilding for ARM64/aarch64
为ARM64/aarch64构建
A Docker container can be built for ARM64 systems (e.g., NVIDIA Grace-Hopper and Grace-Blackwell). Use the flag :
--platform "linux/arm64"bash
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile \
--platform "linux/arm64"Note: Multiple modules must be compiled, so this process can take longer. Use build arguments like to speed up the process (ensure is substantially larger than ). Monitor memory usage, as parallel jobs can require significant RAM.
--build-arg max_jobs=8 --build-arg nvcc_threads=2max_jobsnvcc_threadsFor cross-compilation (building ARM64 on an x86_64 host), register QEMU user-static handlers first:
bash
docker run --rm --privileged multiarch/qemu-user-static --reset -p yesThen use the flag in your build command.
--platform "linux/arm64"可为ARM64系统(如NVIDIA Grace-Hopper和Grace-Blackwell)构建Docker容器。使用参数:
--platform "linux/arm64"bash
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile \
--platform "linux/arm64"注意: 需要编译多个模块,因此构建过程可能耗时更长。可使用等构建参数加快速度(确保远大于)。请监控内存使用情况,因为并行任务可能需要大量RAM。
--build-arg max_jobs=8 --build-arg nvcc_threads=2max_jobsnvcc_threads交叉编译(在x86_64主机上构建ARM64版本):先注册QEMU用户静态处理程序:
bash
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes然后在构建命令中使用参数。
--platform "linux/arm64"Running your custom-built image
运行自定义构建的镜像
After building, run your image just like the pre-built image:
bash
docker run --rm --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai \
--model Qwen/Qwen2.5-1.5B-InstructReplace with the tag you specified during the build (e.g., ).
vllm/vllm-openaimy-vllm-custom:latestNote:is deprecated for most environments. Prefer--runtime nvidiawith NVIDIA Container Toolkit. Use--gpus ...only for legacy Docker configurations.--runtime nvidia
构建完成后,像使用预构建镜像一样运行自定义镜像:
bash
docker run --rm --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai \
--model Qwen/Qwen2.5-1.5B-Instruct将替换为构建时指定的标签(如)。
vllm/vllm-openaimy-vllm-custom:latest注意:在大多数环境中已被弃用。优先使用带NVIDIA容器工具包的--runtime nvidia参数。仅在旧版Docker配置中使用--gpus ...。--runtime nvidia
Common server flags
常见服务器参数
- — model to load (HF ID or local path)
--model <MODEL_ID> - — server port (default 8000 for OpenAI-compatible server)
--port <PORT> - — adjust verbosity
--log-level - You may pass additional after the image tag; see vLLM docs for tuning options.
engine_args
- — 要加载的模型(HF ID或本地路径)
--model <MODEL_ID> - — 服务器端口(兼容OpenAI的服务器默认端口为8000)
--port <PORT> - — 调整日志Verbosity级别
--log-level - 可在镜像标签后添加额外的;如需调优选项,请参考vLLM文档。
engine_args
Testing the API
测试API
After the container starts, make a quick test request against the OpenAI-compatible endpoint:
bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Who are you?"}],"max_tokens":128}'容器启动后,向兼容OpenAI的端点发送快速测试请求:
bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Who are you?"}],"max_tokens":128}'Security and operational notes
安全与运维注意事项
- Keep secret; prefer passing it via environment variables or a secret manager.
HF_TOKEN - For production, run behind a reverse proxy (Nginx) with TLS and authentication.
- Mount only necessary host paths into the container.
- 请保密;优先通过环境变量或密钥管理器传递。
HF_TOKEN - 生产环境中,请在带TLS和认证的反向代理(如Nginx)后运行。
- 仅将必要的主机路径挂载到容器中。
Troubleshooting
故障排查
- Container can't access GPUs: ensure is installed and restart Docker.
nvidia-container-toolkit - Model download failures: check and network; mount cache directory to persist downloads.
HF_TOKEN - Memory / OOM errors: try a smaller model or add more GPU memory; check .
--shm-size - If the container fails with NCCL library path issues (rare): set per upstream guidance.
VLLM_NCCL_SO_PATH - Permission issues of current user with Docker: ensure your user is in the group or suggest user to add current user to
dockergroup manually following:docker
bash
undefined- 容器无法访问GPU:确保已安装并重启Docker。
nvidia-container-toolkit - 模型下载失败:检查和网络;挂载缓存目录以持久化下载内容。
HF_TOKEN - 内存/内存不足(OOM)错误:尝试使用更小的模型或增加GPU内存;检查参数。
--shm-size - 若容器因NCCL库路径问题启动失败(罕见):根据上游指引设置。
VLLM_NCCL_SO_PATH - 当前用户与Docker的权限问题:确保用户属于组,或建议用户按照以下步骤手动将当前用户添加到
docker组:docker
bash
undefined1. Create docker group if it doesn't exist (may already exist on some systems)
1. 若docker组不存在则创建(部分系统可能已存在)
sudo groupadd docker
sudo groupadd docker
2. Add current user to the docker group (replace $USER with your username if needed)
2. 将当前用户添加到docker组(如需替换$USER为你的用户名)
sudo usermod -aG docker $USER
sudo usermod -aG docker $USER
3. Apply the new group membership (you may need to log out and log back in for this to take effect)
3. 应用新的组成员身份(可能需要注销并重新登录才能生效)
newgrp docker
newgrp docker
4. Verify that the user is in the docker group (output should include docker)
4. 验证用户是否属于docker组(输出应包含docker)
groups $USER
- Docker pull failed due to a network timeout connecting to Docker Hub: check the network connection and try again. If the issue persists, suggest and show the user how to use a mirror for Docker Hub.
- Error during downloading models from Hugging Face: check if the `HF_TOKEN` is passed to the container and is valid. Check if `HTTP_PROXY` and `HTTPS_PROXY` are passed to the container if the host is behind a proxy. Also, verify that the model ID is correct and that the model is public or accessible with the provided token.groups $USER
- 因连接Docker Hub超时导致Docker拉取失败:检查网络连接并重试。若问题持续,建议并向用户展示如何使用Docker Hub镜像源。
- 从Hugging Face下载模型时出错:检查`HF_TOKEN`是否已传递给容器且有效。若主机位于代理后,检查是否已将`HTTP_PROXY`和`HTTPS_PROXY`传递给容器。同时验证模型ID是否正确,以及模型是否公开或可通过提供的令牌访问。References
参考资料
- vLLM repository (docker/Dockerfile): https://github.com/vllm-project/vllm/tree/main/docker
- NVIDIA Container Toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- Up-to-date deployment instructions and troubleshooting: https://docs.vllm.ai/en/latest/deployment/docker/
- vLLM仓库(docker/Dockerfile):https://github.com/vllm-project/vllm/tree/main/docker
- NVIDIA容器工具包:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- 最新部署说明与故障排查:https://docs.vllm.ai/en/latest/deployment/docker/