vllm-deploy-k8s

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

vLLM Kubernetes Deployment

vLLM Kubernetes 部署

A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.

这是一个用于通过YAML模板在Kubernetes上部署vLLM的Claude技能。将兼容OpenAI的vLLM服务器部署为带有ClusterIP Service、GPU资源和健康检查探针的Kubernetes Deployment。

What this skill does

此技能的功能

Deploy vLLM as a Kubernetes Deployment + Service with NVIDIA GPU support
Check if a vLLM deployment already exists before deploying
Check if the Hugging Face token secret exists, and ask the user for their token if not
Use the
```
vllm/vllm-openai:latest
```
image by default (user can specify a different version)
Provide sensible default configuration that users can customize (model, replicas, GPU count, extra vLLM flags, etc.)

部署带有NVIDIA GPU支持的vLLM Kubernetes Deployment + Service
部署前检查vLLM部署是否已存在
检查Hugging Face令牌密钥是否存在，若不存在则向用户索要令牌
默认使用
```
vllm/vllm-openai:latest
```
镜像（用户可指定其他版本）
提供用户可自定义的合理默认配置（模型、副本数、GPU数量、额外vLLM参数等）

Prerequisites

前提条件

```
kubectl
```
configured with access to a Kubernetes cluster
NVIDIA GPU Operator or device plugin installed on cluster nodes
Hugging Face token (required for gated models like Llama, optional for public models)

```
kubectl
```
已配置为可访问Kubernetes集群
集群节点上已安装NVIDIA GPU Operator或设备插件
Hugging Face令牌（对于Llama等受限模型是必需的，公共模型可选）

Deployment Steps

部署步骤

Step 1: Check HF token secret

步骤1：检查HF令牌密钥

Before deploying, check if the

hf-token

Kubernetes secret exists in the target namespace:

bash

kubectl get secret hf-token -n <namespace>

If the secret exists: proceed to Step 2.
If the secret does not exist: ask the user to provide their Hugging Face token, then create the secret:

bash

kubectl create secret generic hf-token --from-literal=HF_TOKEN="<user-provided-token>" -n <namespace>

This is required for gated models (e.g.,

meta-llama/Meta-Llama-3.1-8B

). For public models, the secret is optional but recommended to avoid rate limits.

部署前，检查目标命名空间中是否存在

hf-token

Kubernetes密钥：

bash

kubectl get secret hf-token -n <namespace>

若密钥存在：继续步骤2。
若密钥不存在：请用户提供其Hugging Face令牌，然后创建密钥：

bash

kubectl create secret generic hf-token --from-literal=HF_TOKEN="<user-provided-token>" -n <namespace>

这对于受限模型（如

meta-llama/Meta-Llama-3.1-8B

）是必需的。对于公共模型，密钥可选，但建议使用以避免速率限制。

Step 2: Check if deployment already exists

步骤2：检查部署是否已存在

Before applying, check if a vLLM deployment already exists:

bash

kubectl get deployment vllm -n <namespace>

If it exists: inform the user that the deployment already exists. Show the current image and status. Ask the user if they want to update it or skip.
If it does not exist: proceed to deploy.

应用部署前，检查vLLM部署是否已存在：

bash

kubectl get deployment vllm -n <namespace>

若存在：告知用户部署已存在，显示当前镜像和状态，询问用户是否要更新或跳过。
若不存在：继续部署。

Step 3: Deploy

步骤3：部署

Apply the template YAML files to deploy vLLM:

bash

kubectl apply -f templates/vllm-service.yaml -n <namespace>
kubectl apply -f templates/vllm-deployment.yaml -n <namespace>

应用模板YAML文件以部署vLLM：

bash

kubectl apply -f templates/vllm-service.yaml -n <namespace>
kubectl apply -f templates/vllm-deployment.yaml -n <namespace>

Step 4: Wait and verify

步骤4：等待并验证

Wait for the deployment to roll out:

bash

kubectl rollout status deployment/vllm -n <namespace> --timeout=600s

Verify the pod is running and ready:

bash

kubectl get pods -n <namespace> -l app=vllm

Confirm the pod shows

READY 1/1

and

STATUS Running

. If the pod is not ready yet, wait and check again. If it's in

CrashLoopBackOff

Error

, check the logs with

kubectl logs -n <namespace> -l app=vllm

等待部署完成滚动更新：

bash

kubectl rollout status deployment/vllm -n <namespace> --timeout=600s

验证Pod是否运行且就绪：

bash

kubectl get pods -n <namespace> -l app=vllm

确认Pod显示

READY 1/1

且

STATUS Running

。若Pod尚未就绪，请等待后再次检查。若处于

CrashLoopBackOff

或

Error

状态，使用

kubectl logs -n <namespace> -l app=vllm

查看日志。

Step 5: Print deployment summary

步骤5：打印部署摘要

Once the pod is ready, print a summary message to the user in this format (replace placeholders with actual values):

🎉 **vLLM Deployment Successful!**

| Resource | Name | Status |
|----------|------|--------|
| Deployment | <deployment-name> | <ready>/<total> Ready |
| Service | <service-name> | ClusterIP:<port> |
| Pod | <pod-name> | Running |
| Image | <image> | |
| Model | <model> | |

&nbsp;

**To test the API, run these two commands in your terminal:**

**1. Open a port-forward** (this connects your local port <port> to the vLLM service inside the cluster):

kubectl port-forward svc/vllm-svc <port>:<port> -n <namespace>

**2. In a separate terminal**, send a test request to the OpenAI-compatible API:

curl -s http://localhost:<port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<model>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}' | python3 -m json.tool

If everything is working, you'll get a JSON response with the model's reply.

Pod就绪后，按以下格式向用户打印摘要信息（替换占位符为实际值）：

🎉 **vLLM 部署成功！**

| 资源 | 名称 | 状态 |
|----------|------|--------|
| Deployment | <deployment-name> | <ready>/<total> 就绪 |
| Service | <service-name> | ClusterIP:<port> |
| Pod | <pod-name> | 运行中 |
| 镜像 | <image> | |
| 模型 | <model> | |

&nbsp;

**要测试API，请在终端中运行以下两个命令：**

**1. 开启端口转发**（将本地端口<port>连接到集群内的vLLM服务）：

kubectl port-forward svc/vllm-svc <port>:<port> -n <namespace>

**2. 在另一个终端中**，向兼容OpenAI的API发送测试请求：

curl -s http://localhost:<port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<model>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}' | python3 -m json.tool

如果一切正常，您将收到包含模型回复的JSON响应。

Default Configuration

默认配置

The templates use the following defaults:

Parameter	Default Value
Image	`vllm/vllm-openai:latest`
Model	`Qwen/Qwen2.5-1.5B-Instruct`
Port	`8000`
Replicas	`1`
GPU count	`1`
GPU memory utilization	`0.85`
Tensor parallel size	`1`
CPU request / limit	`12` / `128`
Memory request / limit	`100Gi` / `400Gi`
Shared memory (dshm)	`80Gi`

模板使用以下默认值：

参数	默认值
镜像	`vllm/vllm-openai:latest`
模型	`Qwen/Qwen2.5-1.5B-Instruct`
端口	`8000`
副本数	`1`
GPU数量	`1`
GPU内存利用率	`0.85`
张量并行大小	`1`
CPU请求 / 限制	`12` / `128`
内存请求 / 限制	`100Gi` / `400Gi`
共享内存（dshm）	`80Gi`

Customization

自定义配置

When the user requests changes, modify the template YAML files before applying. The following can be customized:

Image version: Change
```
image: vllm/vllm-openai:<version>
```
in
```
templates/vllm-deployment.yaml
```
(default:
```
latest
```
). Use a specific version tag like
```
v0.17.1
```
if the user requests it.
Model: Change the model name in the
```
vllm serve
```
command inside the Deployment
```
args
```
.

Extra vLLM flags: Append additional flags to the

vllm serve

command in the Deployment

args

(e.g.,

--max-model-len 4096

--kv-cache-dtype fp8

--enforce-eager

--generation-config vllm

Replicas: Change
```
replicas:
```
in the Deployment spec.
GPU count: Change
```
nvidia.com/gpu
```
in both
```
requests
```
and
```
limits
```
under resources.
Tensor parallel size: Change
```
--tensor-parallel-size
```
flag to match the GPU count.
CPU/Memory resources: Change
```
cpu
```
and
```
memory
```
values under
```
requests
```
and
```
limits
```
.
Port: Change
```
containerPort
```
in the Deployment,
```
port
```
/
```
targetPort
```
in the Service, the
```
port
```
in all health probes (liveness, readiness, startup), AND add
```
--port <port>
```
to the
```
vllm serve
```
command in args. All four must match.
Namespace: Apply to a specific namespace using
```
-n <namespace>
```
.
Shared memory size: Change the
```
sizeLimit
```
of the
```
dshm
```
emptyDir volume.

Edit the template files using the Edit tool, then apply the modified templates.

当用户请求更改时，在应用前修改模板YAML文件。以下内容可自定义：

镜像版本：在
```
templates/vllm-deployment.yaml
```
中修改
```
image: vllm/vllm-openai:<version>
```
（默认：
```
latest
```
）。若用户要求，使用特定版本标签如
```
v0.17.1
```
。
模型：在Deployment的
```
args
```
中的
```
vllm serve
```
命令里修改模型名称。

额外vLLM参数：在Deployment的

args

中的

vllm serve

命令后追加额外参数（如

--max-model-len 4096

、

--kv-cache-dtype fp8

、

--enforce-eager

、

--generation-config vllm

）。

副本数：修改Deployment spec中的
```
replicas:
```
。
GPU数量：修改resources下
```
requests
```
和
```
limits
```
中的
```
nvidia.com/gpu
```
。
张量并行大小：修改
```
--tensor-parallel-size
```
参数以匹配GPU数量。
CPU/内存资源：修改
```
requests
```
和
```
limits
```
下的
```
cpu
```
和
```
memory
```
值。
端口：修改Deployment中的
```
containerPort
```
、Service中的
```
port
```
/
```
targetPort
```
、所有健康检查探针（存活、就绪、启动）中的
```
port
```
，并在args中的
```
vllm serve
```
命令添加
```
--port <port>
```
。这四个必须保持一致。
命名空间：使用
```
-n <namespace>
```
应用到特定命名空间。
共享内存大小：修改
```
dshm
```
emptyDir卷的
```
sizeLimit
```
。

使用编辑工具修改模板文件，然后应用修改后的模板。

Status Check

状态检查

bash

kubectl get deployment,svc,pods -n <namespace> -l app=vllm

bash

kubectl get deployment,svc,pods -n <namespace> -l app=vllm

Cleanup

清理

When the user asks to clean up or delete the vLLM deployment, run the following steps:

Delete the Deployment and Service:

bash

kubectl delete -f templates/vllm-deployment.yaml -n <namespace>
kubectl delete -f templates/vllm-service.yaml -n <namespace>

Ask the user if they also want to delete the HF token secret. If yes:

bash

kubectl delete secret hf-token -n <namespace>

Verify everything is cleaned up:

bash

kubectl get deployment,svc,pods -n <namespace> -l app=vllm

Print a summary message to the user:

vLLM deployment has been cleaned up from namespace <namespace>.
Deleted: Deployment/vllm, Service/vllm-svc
HF token secret: <kept/deleted>

当用户要求清理或删除vLLM部署时，执行以下步骤：

删除Deployment和Service：

bash

kubectl delete -f templates/vllm-deployment.yaml -n <namespace>
kubectl delete -f templates/vllm-service.yaml -n <namespace>

询问用户是否还要删除HF令牌密钥。若是：

bash

kubectl delete secret hf-token -n <namespace>

验证所有资源已清理：

bash

kubectl get deployment,svc,pods -n <namespace> -l app=vllm

向用户打印摘要信息：

vLLM部署已从命名空间<namespace>中清理。
已删除：Deployment/vllm、Service/vllm-svc
HF令牌密钥：<保留/已删除>

Troubleshooting

故障排除

Pod stuck in Pending: No GPU nodes available. Check
```
kubectl describe pod <pod-name>
```
for scheduling errors. Ensure NVIDIA GPU Operator or device plugin is installed.
Pod OOMKilled: Increase
```
memory
```
limits in the Deployment, or use a smaller model.
ImagePullBackOff: Check the image name and tag. Verify the node has access to Docker Hub / the container registry.
Startup probe failures (CrashLoopBackOff): Model download may be slow. Check logs with
```
kubectl logs <pod-name>
```
. Ensure
```
hf-token
```
secret exists for gated models. Increase
```
failureThreshold
```
on the startup probe if needed.
HF_TOKEN not working: Verify the secret exists:
```
kubectl get secret hf-token -n <namespace>
```
. Check the token is valid.
GPU not detected in container: Ensure
```
nvidia.com/gpu
```
resource is requested and the NVIDIA device plugin is running on the node.

Pod 卡在Pending状态：没有可用的GPU节点。使用
```
kubectl describe pod <pod-name>
```
检查调度错误。确保已安装NVIDIA GPU Operator或设备插件。
Pod 被OOMKilled：增加Deployment中的
```
memory
```
限制，或使用更小的模型。
ImagePullBackOff：检查镜像名称和标签。验证节点可访问Docker Hub/容器镜像仓库。
启动探针失败（CrashLoopBackOff）：模型下载可能较慢。使用
```
kubectl logs <pod-name>
```
查看日志。确保受限模型对应的
```
hf-token
```
密钥存在。若需要，增加启动探针的
```
failureThreshold
```
。
HF_TOKEN 无效：验证密钥存在：
```
kubectl get secret hf-token -n <namespace>
```
。检查令牌是否有效。
容器中未检测到GPU：确保已请求
```
nvidia.com/gpu
```
资源，且节点上运行着NVIDIA设备插件。

vllm-deploy-k8s

Original

Translation

vLLM Kubernetes Deployment

vLLM Kubernetes 部署

What this skill does

此技能的功能

Prerequisites

前提条件

Deployment Steps

部署步骤

Step 1: Check HF token secret

步骤1：检查HF令牌密钥

Step 2: Check if deployment already exists

步骤2：检查部署是否已存在

Step 3: Deploy

步骤3：部署

Step 4: Wait and verify

步骤4：等待并验证

Step 5: Print deployment summary

步骤5：打印部署摘要

Default Configuration

默认配置

Customization

自定义配置

Status Check

状态检查

Cleanup

清理

Troubleshooting

故障排除

References

参考资料