vllm-bench-serve
Original:🇺🇸 English
Translated
Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.
4installs
Sourcevllm-project/vllm-skills
Added on
NPX Install
npx skill4agent add vllm-project/vllm-skills vllm-bench-serveTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →vLLM Bench Serve
Benchmark vLLM or any OpenAI-compatible serving endpoint using the CLI. Measures throughput, latency (TTFT, TPOT), and goodput against configurable request load.
vllm bench serveReference: vLLM Bench Serve Documentation
Prerequisites
- vLLM installed (or any OpenAI-compatible server running)
- A vLLM server or API endpoint already serving a model
- Python environment with vLLM for the benchmark client
Quick Start
Basic benchmark against local vLLM server (default random dataset, 1000 prompts):
bash
vllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completionsSave results to JSON:
bash
vllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--save-result \
--result-dir ./bench-results \
--metadata "version=0.6.0" "tp=1"Note: When using, you must specify--backend openai-chat(default is--endpoint /v1/chat/completions)./v1/completions
Core Arguments
| Argument | Default | Description |
|---|---|---|
| | Backend type: |
| | Server host |
| | Server port |
| - | Alternative: full base URL instead of host:port |
| | API endpoint; use |
| (from /v1/models) | Model name |
| | Number of prompts to process |
| | Requests per second; |
| - | Max concurrent requests (caps parallelism) |
| | Warmup requests before measuring |
Datasets
| Use Case |
|---|---|
| Synthetic random prompts (default) |
| ShareGPT conversation format; requires |
| Sonnet-style prompts |
| HuggingFace dataset; requires |
| Custom dataset; requires |
| Prefix repetition benchmark |
| Random multimodal (images/videos) |
| Spec bench dataset |
Dataset-specific options (examples):
bash
# Random: control input/output length
--dataset-name random --random-input-len 1024 --random-output-len 128
# Sonnet defaults: input 550, output 150, prefix 200
--dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150
# HuggingFace dataset
--dataset-name hf --dataset-path "lmarena-ai/VisionArena-Chat" --hf-split test
# General overrides (map to dataset-specific args)
--input-len 512 --output-len 256Load Control
bash
# Fixed request rate (Poisson process)
--request-rate 10
# More bursty arrivals (gamma distribution, burstiness < 1)
--request-rate 10 --burstiness 0.5
# Ramp-up from low to high RPS
--ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 50
# Limit concurrency (useful for rate-limited APIs)
--max-concurrency 32Results and Metrics
| Argument | Description |
|---|---|
| Save benchmark results to JSON |
| Include per-request TTFT, TPOT, errors in JSON |
| Append to existing result file |
| Directory for result files |
| Custom filename (default: |
| Metrics for percentiles: |
| Percentile values, e.g. |
| SLO for goodput: |
Sampling Parameters (OpenAI-compatible backends)
bash
--temperature 0.7 --top-p 0.95 --top-k 50
--frequency-penalty 0 --presence-penalty 0 --repetition-penalty 1.0Common Workflows
1. Throughput test with random dataset (burst):
bash
vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 500 --random-input-len 512 --random-output-len 1282. Latency test with fixed QPS:
bash
vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--request-rate 5 --num-prompts 200 \
--save-result --percentile-metrics ttft,tpot --metric-percentiles 50,993. Benchmark against remote API (base-url):
bash
vllm bench serve --backend openai-chat \
--base-url "https://api.example.com/v1" \
--model my-model \
--header "Authorization=Bearer $API_KEY"4. Run inside Docker (when vLLM client not on host):
bash
docker exec <container-name> vllm bench serve \
--backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random --num-prompts 100Troubleshooting
- Connection refused: Ensure the server is running and /
--hostor--portare correct.--base-url - Model not found: Pass explicitly or ensure
--modelreturns the model./v1/models - URL must end with chat/completions: Use when
--endpoint /v1/chat/completions.--backend openai-chat - Rate limit / 429: Reduce or
--request-rate.--max-concurrency - Ready check: Use to wait for the endpoint before benchmarking.
--ready-check-timeout-sec 60 - SSL: Use for self-signed certificates.
--insecure
Notes
- For embeddings/rerank benchmarks, use ,
--backend openai-embeddings, orvllm-pooling.vllm-rerank - requires
--profileon the server for vLLM profiling.--profiler-config - Goodput SLOs are useful for SLA-style analysis; see DistServe paper for details.