vllm-deploy-k8s

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

vLLM Kubernetes Deployment

vLLM Kubernetes 部署

A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.
这是一个用于通过YAML模板在Kubernetes上部署vLLM的Claude技能。将兼容OpenAI的vLLM服务器部署为带有ClusterIP Service、GPU资源和健康检查探针的Kubernetes Deployment。

What this skill does

此技能的功能

  • Deploy vLLM as a Kubernetes Deployment + Service with NVIDIA GPU support
  • Check if a vLLM deployment already exists before deploying
  • Check if the Hugging Face token secret exists, and ask the user for their token if not
  • Use the
    vllm/vllm-openai:latest
    image by default (user can specify a different version)
  • Provide sensible default configuration that users can customize (model, replicas, GPU count, extra vLLM flags, etc.)
  • 部署带有NVIDIA GPU支持的vLLM Kubernetes Deployment + Service
  • 部署前检查vLLM部署是否已存在
  • 检查Hugging Face令牌密钥是否存在,若不存在则向用户索要令牌
  • 默认使用
    vllm/vllm-openai:latest
    镜像(用户可指定其他版本)
  • 提供用户可自定义的合理默认配置(模型、副本数、GPU数量、额外vLLM参数等)

Prerequisites

前提条件

  • kubectl
    configured with access to a Kubernetes cluster
  • NVIDIA GPU Operator or device plugin installed on cluster nodes
  • Hugging Face token (required for gated models like Llama, optional for public models)
  • kubectl
    已配置为可访问Kubernetes集群
  • 集群节点上已安装NVIDIA GPU Operator或设备插件
  • Hugging Face令牌(对于Llama等受限模型是必需的,公共模型可选)

Deployment Steps

部署步骤

Step 1: Check HF token secret

步骤1:检查HF令牌密钥

Before deploying, check if the
hf-token
Kubernetes secret exists in the target namespace:
bash
kubectl get secret hf-token -n <namespace>
  • If the secret exists: proceed to Step 2.
  • If the secret does not exist: ask the user to provide their Hugging Face token, then create the secret:
bash
kubectl create secret generic hf-token --from-literal=HF_TOKEN="<user-provided-token>" -n <namespace>
This is required for gated models (e.g.,
meta-llama/Meta-Llama-3.1-8B
). For public models, the secret is optional but recommended to avoid rate limits.
部署前,检查目标命名空间中是否存在
hf-token
Kubernetes密钥:
bash
kubectl get secret hf-token -n <namespace>
  • 若密钥存在:继续步骤2。
  • 若密钥不存在:请用户提供其Hugging Face令牌,然后创建密钥:
bash
kubectl create secret generic hf-token --from-literal=HF_TOKEN="<user-provided-token>" -n <namespace>
这对于受限模型(如
meta-llama/Meta-Llama-3.1-8B
)是必需的。对于公共模型,密钥可选,但建议使用以避免速率限制。

Step 2: Check if deployment already exists

步骤2:检查部署是否已存在

Before applying, check if a vLLM deployment already exists:
bash
kubectl get deployment vllm -n <namespace>
  • If it exists: inform the user that the deployment already exists. Show the current image and status. Ask the user if they want to update it or skip.
  • If it does not exist: proceed to deploy.
应用部署前,检查vLLM部署是否已存在:
bash
kubectl get deployment vllm -n <namespace>
  • 存在:告知用户部署已存在,显示当前镜像和状态,询问用户是否要更新或跳过。
  • 不存在:继续部署。

Step 3: Deploy

步骤3:部署

Apply the template YAML files to deploy vLLM:
bash
kubectl apply -f templates/vllm-service.yaml -n <namespace>
kubectl apply -f templates/vllm-deployment.yaml -n <namespace>
应用模板YAML文件以部署vLLM:
bash
kubectl apply -f templates/vllm-service.yaml -n <namespace>
kubectl apply -f templates/vllm-deployment.yaml -n <namespace>

Step 4: Wait and verify

步骤4:等待并验证

Wait for the deployment to roll out:
bash
kubectl rollout status deployment/vllm -n <namespace> --timeout=600s
Verify the pod is running and ready:
bash
kubectl get pods -n <namespace> -l app=vllm
Confirm the pod shows
READY 1/1
and
STATUS Running
. If the pod is not ready yet, wait and check again. If it's in
CrashLoopBackOff
or
Error
, check the logs with
kubectl logs -n <namespace> -l app=vllm
.
等待部署完成滚动更新:
bash
kubectl rollout status deployment/vllm -n <namespace> --timeout=600s
验证Pod是否运行且就绪:
bash
kubectl get pods -n <namespace> -l app=vllm
确认Pod显示
READY 1/1
STATUS Running
。若Pod尚未就绪,请等待后再次检查。若处于
CrashLoopBackOff
Error
状态,使用
kubectl logs -n <namespace> -l app=vllm
查看日志。

Step 5: Print deployment summary

步骤5:打印部署摘要

Once the pod is ready, print a summary message to the user in this format (replace placeholders with actual values):
🎉 **vLLM Deployment Successful!**

| Resource | Name | Status |
|----------|------|--------|
| Deployment | <deployment-name> | <ready>/<total> Ready |
| Service | <service-name> | ClusterIP:<port> |
| Pod | <pod-name> | Running |
| Image | <image> | |
| Model | <model> | |

&nbsp;

**To test the API, run these two commands in your terminal:**

**1. Open a port-forward** (this connects your local port <port> to the vLLM service inside the cluster):

kubectl port-forward svc/vllm-svc <port>:<port> -n <namespace>

**2. In a separate terminal**, send a test request to the OpenAI-compatible API:

curl -s http://localhost:<port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<model>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}' | python3 -m json.tool

If everything is working, you'll get a JSON response with the model's reply.
Pod就绪后,按以下格式向用户打印摘要信息(替换占位符为实际值):
🎉 **vLLM 部署成功!**

| 资源 | 名称 | 状态 |
|----------|------|--------|
| Deployment | <deployment-name> | <ready>/<total> 就绪 |
| Service | <service-name> | ClusterIP:<port> |
| Pod | <pod-name> | 运行中 |
| 镜像 | <image> | |
| 模型 | <model> | |

&nbsp;

**要测试API,请在终端中运行以下两个命令:**

**1. 开启端口转发**(将本地端口<port>连接到集群内的vLLM服务):

kubectl port-forward svc/vllm-svc <port>:<port> -n <namespace>

**2. 在另一个终端中**,向兼容OpenAI的API发送测试请求:

curl -s http://localhost:<port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<model>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}' | python3 -m json.tool

如果一切正常,您将收到包含模型回复的JSON响应。

Default Configuration

默认配置

The templates use the following defaults:
ParameterDefault Value
Image
vllm/vllm-openai:latest
Model
Qwen/Qwen2.5-1.5B-Instruct
Port
8000
Replicas
1
GPU count
1
GPU memory utilization
0.85
Tensor parallel size
1
CPU request / limit
12
/
128
Memory request / limit
100Gi
/
400Gi
Shared memory (dshm)
80Gi
模板使用以下默认值:
参数默认值
镜像
vllm/vllm-openai:latest
模型
Qwen/Qwen2.5-1.5B-Instruct
端口
8000
副本数
1
GPU数量
1
GPU内存利用率
0.85
张量并行大小
1
CPU请求 / 限制
12
/
128
内存请求 / 限制
100Gi
/
400Gi
共享内存(dshm)
80Gi

Customization

自定义配置

When the user requests changes, modify the template YAML files before applying. The following can be customized:
  • Image version: Change
    image: vllm/vllm-openai:<version>
    in
    templates/vllm-deployment.yaml
    (default:
    latest
    ). Use a specific version tag like
    v0.17.1
    if the user requests it.
  • Model: Change the model name in the
    vllm serve
    command inside the Deployment
    args
    .
  • Extra vLLM flags: Append additional flags to the
    vllm serve
    command in the Deployment
    args
    (e.g.,
    --max-model-len 4096
    ,
    --kv-cache-dtype fp8
    ,
    --enforce-eager
    ,
    --generation-config vllm
    ).
  • Replicas: Change
    replicas:
    in the Deployment spec.
  • GPU count: Change
    nvidia.com/gpu
    in both
    requests
    and
    limits
    under resources.
  • Tensor parallel size: Change
    --tensor-parallel-size
    flag to match the GPU count.
  • CPU/Memory resources: Change
    cpu
    and
    memory
    values under
    requests
    and
    limits
    .
  • Port: Change
    containerPort
    in the Deployment,
    port
    /
    targetPort
    in the Service, the
    port
    in all health probes (liveness, readiness, startup), AND add
    --port <port>
    to the
    vllm serve
    command in args. All four must match.
  • Namespace: Apply to a specific namespace using
    -n <namespace>
    .
  • Shared memory size: Change the
    sizeLimit
    of the
    dshm
    emptyDir volume.
Edit the template files using the Edit tool, then apply the modified templates.
当用户请求更改时,在应用前修改模板YAML文件。以下内容可自定义:
  • 镜像版本:在
    templates/vllm-deployment.yaml
    中修改
    image: vllm/vllm-openai:<version>
    (默认:
    latest
    )。若用户要求,使用特定版本标签如
    v0.17.1
  • 模型:在Deployment的
    args
    中的
    vllm serve
    命令里修改模型名称。
  • 额外vLLM参数:在Deployment的
    args
    中的
    vllm serve
    命令后追加额外参数(如
    --max-model-len 4096
    --kv-cache-dtype fp8
    --enforce-eager
    --generation-config vllm
    )。
  • 副本数:修改Deployment spec中的
    replicas:
  • GPU数量:修改resources下
    requests
    limits
    中的
    nvidia.com/gpu
  • 张量并行大小:修改
    --tensor-parallel-size
    参数以匹配GPU数量。
  • CPU/内存资源:修改
    requests
    limits
    下的
    cpu
    memory
    值。
  • 端口:修改Deployment中的
    containerPort
    、Service中的
    port
    /
    targetPort
    、所有健康检查探针(存活、就绪、启动)中的
    port
    ,并在args中的
    vllm serve
    命令添加
    --port <port>
    。这四个必须保持一致。
  • 命名空间:使用
    -n <namespace>
    应用到特定命名空间。
  • 共享内存大小:修改
    dshm
    emptyDir卷的
    sizeLimit
使用编辑工具修改模板文件,然后应用修改后的模板。

Status Check

状态检查

bash
kubectl get deployment,svc,pods -n <namespace> -l app=vllm
bash
kubectl get deployment,svc,pods -n <namespace> -l app=vllm

Cleanup

清理

When the user asks to clean up or delete the vLLM deployment, run the following steps:
  1. Delete the Deployment and Service:
bash
kubectl delete -f templates/vllm-deployment.yaml -n <namespace>
kubectl delete -f templates/vllm-service.yaml -n <namespace>
  1. Ask the user if they also want to delete the HF token secret. If yes:
bash
kubectl delete secret hf-token -n <namespace>
  1. Verify everything is cleaned up:
bash
kubectl get deployment,svc,pods -n <namespace> -l app=vllm
  1. Print a summary message to the user:
vLLM deployment has been cleaned up from namespace <namespace>.
Deleted: Deployment/vllm, Service/vllm-svc
HF token secret: <kept/deleted>
当用户要求清理或删除vLLM部署时,执行以下步骤:
  1. 删除Deployment和Service:
bash
kubectl delete -f templates/vllm-deployment.yaml -n <namespace>
kubectl delete -f templates/vllm-service.yaml -n <namespace>
  1. 询问用户是否还要删除HF令牌密钥。若是:
bash
kubectl delete secret hf-token -n <namespace>
  1. 验证所有资源已清理:
bash
kubectl get deployment,svc,pods -n <namespace> -l app=vllm
  1. 向用户打印摘要信息:
vLLM部署已从命名空间<namespace>中清理。
已删除:Deployment/vllm、Service/vllm-svc
HF令牌密钥:<保留/已删除>

Troubleshooting

故障排除

  • Pod stuck in Pending: No GPU nodes available. Check
    kubectl describe pod <pod-name>
    for scheduling errors. Ensure NVIDIA GPU Operator or device plugin is installed.
  • Pod OOMKilled: Increase
    memory
    limits in the Deployment, or use a smaller model.
  • ImagePullBackOff: Check the image name and tag. Verify the node has access to Docker Hub / the container registry.
  • Startup probe failures (CrashLoopBackOff): Model download may be slow. Check logs with
    kubectl logs <pod-name>
    . Ensure
    hf-token
    secret exists for gated models. Increase
    failureThreshold
    on the startup probe if needed.
  • HF_TOKEN not working: Verify the secret exists:
    kubectl get secret hf-token -n <namespace>
    . Check the token is valid.
  • GPU not detected in container: Ensure
    nvidia.com/gpu
    resource is requested and the NVIDIA device plugin is running on the node.
  • Pod 卡在Pending状态:没有可用的GPU节点。使用
    kubectl describe pod <pod-name>
    检查调度错误。确保已安装NVIDIA GPU Operator或设备插件。
  • Pod 被OOMKilled:增加Deployment中的
    memory
    限制,或使用更小的模型。
  • ImagePullBackOff:检查镜像名称和标签。验证节点可访问Docker Hub/容器镜像仓库。
  • 启动探针失败(CrashLoopBackOff):模型下载可能较慢。使用
    kubectl logs <pod-name>
    查看日志。确保受限模型对应的
    hf-token
    密钥存在。若需要,增加启动探针的
    failureThreshold
  • HF_TOKEN 无效:验证密钥存在:
    kubectl get secret hf-token -n <namespace>
    。检查令牌是否有效。
  • 容器中未检测到GPU:确保已请求
    nvidia.com/gpu
    资源,且节点上运行着NVIDIA设备插件。

References

参考资料