LLM Serving Auto Benchmark
Overview
Use this skill to compare LLM serving frameworks such as SGLang, vLLM, and
TensorRT-LLM for the same model and workload.
Use a config-driven workflow:
- keep launch-only capacity choices in each framework's
- put the search knobs in
- run the same dataset scenarios for every framework
- generate a bounded candidate list from , with the baseline
candidate included first
- keep failed candidates in the result file
- pick the best SLA-passing candidate after normalizing the results
For model-specific starting points, prefer the shipped configs in
. They define a framework-neutral LLM serving cookbook
model set and translate each entry into framework-native SGLang, vLLM, and
TensorRT-LLM server flags. Validate those configs before a real run:
bash
python skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py \
skills/llm-serving-auto-benchmark/configs/cookbook-llm
If you have captured target-environment
files, add
--help-dir <artifact-help-dir>
. That check only loads configs, verifies the
server flag names, and renders candidate commands; it does not launch model
servers.
Prefer native tooling when it gives better coverage:
- SGLang:
python -m sglang.auto_benchmark
when available, otherwise
python -m sglang.bench_serving
- vLLM: for server-parameter sweeps, otherwise
plus
- TensorRT-LLM: for the OpenAI-compatible server plus the
TensorRT-LLM serving benchmark client or a common OpenAI-compatible benchmark
client
TensorRT-LLM has one hard scope rule in this skill: the server backend is fixed
to
trtllm-serve serve --backend pytorch
. Do not search TensorRT-LLM backend
choice. If a request, config, or candidate asks for
, an engine backend, or
any other non-PyTorch TensorRT-LLM server backend, reject that candidate as
unsupported for this skill and record the reason. This does not change the
benchmark client backend; the TensorRT-LLM benchmark client still uses
OpenAI-compatible modes such as
or
.
Only pick a winner after each requested framework has had its main serving knobs
tuned.
The parameter lists in this skill are not a compatibility contract. They are
version-sensitive candidate knob families. Before every real run, record the
exact framework version or git commit and verify the concrete CLI flag names
with
in the target environment.
The default search style is framework-neutral: start from a mostly pure-TP
baseline, sweep a small set of high-impact runtime knobs, and cap the first
pass around 10 candidates per framework. Do not search memory fractions by
default.
Validation Environment
This skill is target-agnostic. It assumes any one of the following is
available, and nothing more:
- a local GPU host with Docker/Podman and the target framework images pulled;
- a remote GPU host reached via with the framework images already
running in a container there;
- a CI runner that can exec into a pre-built image for each framework.
Do not assume a specific operator host name (
,
,
,
, etc.) inside this skill's own workflow. The concrete
SSH wiring, container names, workspace paths, and HF token plumbing for a given
box live in the operator-side per-host skills (for example
,
,
,
,
,
); this
skill only requires that the caller can reach a shell inside a container with
,
, or
installed.
Historical validation snapshots in
are evidence of which flag
names and failure modes were seen in specific images and are not a requirement
that the next run happens on the same hardware or framework version.
Skill Scope
This skill is a playbook plus a config+validator toolchain, not a
turn-key orchestrator. The
directory contains exactly two tools:
validate_cookbook_configs.py
: reads YAML, renders bounded candidate server
commands, and checks flag names against captured snapshots. It never
launches a model server.
compare_benchmark_results.py
: takes the normalized per-candidate JSONL and
emits the markdown tables described in the Output Contract.
Launching servers, driving the workload, and writing one JSONL row per
candidate are the operator's responsibility; the skill tells you how to do
them, and the validator keeps your inputs honest.
The cookbook configs under
and the sample runtime plan
at
references/example-plan.yaml
use related but not identical schemas:
- Cookbook configs carry , set to
,
(nested), and
frameworks.*.server_command
; they must pass
validate_cookbook_configs.py
.
- is a shorter runtime plan shape with top-level and
no . It is the skeleton a caller fills in for a one-off run
and is not expected to pass the cookbook validator as-is.
Either shape can feed a benchmark run; the SLA key names in
references/result-schema.md are the single
source of truth.
Required Inputs
Collect these before starting a long run:
- model path or Hugging Face repo id
- tokenizer path if it differs from the model
- target frameworks: any subset of , ,
- GPU model, GPU count, and whether multi-node is allowed
- precision and quantization constraints
- endpoint shape: completions, chat completions, responses, or custom
- workload source: real traffic JSONL, ShareGPT, random synthetic, or generated
shared-prefix synthetic
- dataset scenarios when synthetic traffic is used, for example and
- SLA target: TTFT, TPOT/ITL, end-to-end latency, success rate, or goodput
- search budget: quick smoke, default search, or exhaustive search
- output directory for logs and result artifacts
Also collect a version manifest:
- framework package version and git commit when available
- container image or Python environment identifier
- snapshots for the server command and benchmark command
- whether each parameter in the search plan was accepted by that exact CLI
If real production traffic is the goal, use the real request distribution. A
synthetic workload is fine for bring-up and first-pass comparison, but it is not
enough for a production choice.
Known Gotchas
Short list of failure modes that have bitten past validation runs. Check these
before starting a long sweep.
- SGLang attention backends need Hopper or newer. On A100, L40S, RTX
5090, and older GPUs, drop from the SGLang and keep
(or when FlashInfer is unavailable).
- SGLang has two SGLang-facing backends: for
the native endpoint and for the
OpenAI-compatible endpoint. For cross-framework comparisons, prefer
so every framework is measured on the same request path.
- vLLM only works when the target vLLM image is built with a
supported all2all backend. Keep DBO out of the default candidate list unless
the operator has verified the image.
- vLLM
--max-num-partial-prefills > 1
is model- and runtime-gated. Keep
in the default pass; raise only after a preflight with the actual model.
- In the validated TensorRT-LLM 1.0.0 image, accepts
--kv_cache_free_gpu_memory_fraction
; the older --free_gpu_memory_fraction
exits with a CLI error. Re-check the accepted flag name via on the
target image before a real run.
- TensorRT-LLM 1.0.0 multi-GPU PyTorch-backend servers need ,
, , , and
(for single-node) or an equivalent NCCL setup.
- TensorRT-LLM 1.0.0 benchmark client takes or
; is rejected. This is separate
from the server backend, which is pinned to by this skill.
-
benchmark_serving --dataset-name random
silently falls back to
ShareGPT sampling without (or ).
- / / candidates must cover
max(input_len + output_len)
across every scenario, including values inside
, not just the baseline. The validator checks this; do not
bypass it.
Secrets Hygiene
- Never print , , or any upstream API key into
a saved artifact. Pass them through container (unquoted on the right
side so the host value is inherited) and keep them out of
and fields written to the result JSONL.
- When a framework echoes the full argv at startup, scrub the log or redact
token-shaped substrings before uploading the artifact.
Fairness Rules
Use these rules throughout the benchmark:
- Run every framework on the same GPU type, GPU count, model weights, tokenizer,
precision, quantization policy, prompt distribution, output length target, and
sampling settings.
- Record framework version, git commit, container image, CUDA/NCCL versions, GPU
driver, visible GPU ids, launch command, and benchmark command.
- Warm the server before measuring. Restart or clear state between candidate
configurations when cache effects would bias the comparison.
- Compare steady-state fixed-QPS runs separately from burst throughput runs.
- Keep failed candidates in the final results with their failure reason.
- Report both raw throughput and SLA-passing throughput. The fastest failing
candidate is not the best deployment command.
Workflow
1. Preflight
Verify all requested frameworks before starting a search:
bash
python -m sglang.launch_server --help
python -m sglang.bench_serving --help
vllm serve --help
vllm serve --help=all
vllm bench serve --help
vllm bench serve --help=all
vllm bench sweep serve --help=all
trtllm-serve serve --help
python -m tensorrt_llm.serve.scripts.benchmark_serving --help
Use the framework-specific
output in the target environment as the
source of truth. Do not keep a stale launch flag just because it appears in an
old note.
vLLM 0.19 and newer use grouped help. Plain
only shows the
groups, so capture
before deciding whether a search knob exists.
Save these
outputs into the run artifact directory. If a listed search
knob is missing from the current CLI, remove or translate that knob before
running the benchmark. Do not silently pass unknown flags.
For TensorRT-LLM, also confirm that
trtllm-serve serve --help
accepts
. If it does not, mark TensorRT-LLM unsupported in that
environment rather than falling back to a different server backend.
For each framework:
- Launch a minimal server.
- Confirm or the framework-native model-info endpoint works.
- Send one streaming request and verify TTFT can be measured.
- Run one tiny benchmark with at least 5 requests.
- Save the launch command, benchmark command, server log, and benchmark output.
Before any GPU-backed smoke run, check the requested GPU ids directly with
. If a requested GPU is already in use, stop and record that fact.
Do not silently borrow a different GPU count for a performance comparison. It is
fine to run a smaller one-GPU smoke only when the result is clearly labeled as a
flow check rather than a fair throughput comparison.
If the target environment runs through containers, follow
references/container-runbook.md. Save the
image tags, pull commands, launch commands, server logs, benchmark logs, and
cleanup commands in the artifact directory.
2. Normalize The Workload
Use one canonical workload for all frameworks. Recommended JSONL row shape:
json
{"prompt": [{"role": "user", "content": "Summarize this text."}], "output_len": 256}
{"prompt": "Write a short explanation of CUDA graphs.", "output_len": 128}
Optional fields:
json
{
"prompt": [{"role": "user", "content": "Use low temperature."}],
"output_len": 256,
"extra_request_body": {"temperature": 0.0, "top_p": 0.95},
"metadata": {"source": "prod-sample"}
}
When converting user data:
- inspect at least 3 rows before conversion
- preserve request-level sampling options in
- do not include the final assistant answer in the prompt when that answer is
the target completion
- keep multimodal or tool-call payloads only if all requested frameworks support
the chosen endpoint shape
For synthetic bring-up, use the shipped two-scenario shape:
yaml
dataset:
kind: random
num_prompts: 80
scenario_names: [chat, summarization]
input_len: [1000, 8000]
output_len: [1000, 1000]
Each aligned
/
pair is one scenario. Do not take the
cartesian product unless the user asks for that.
Before searching any sequence-length limit, compute the largest
in the dataset. SGLang
, vLLM
, and TensorRT-LLM
must be at least that value for
every candidate that is expected to run all scenarios.
3. Pick A Search Tier
Use the smallest tier that can answer the user's question:
- Tier 1: smoke and sanity. One baseline plus a few high-impact knobs.
- Tier 2: default. A bounded sweep over the most likely server settings.
- Tier 3: exhaustive. Only when the search space is already tight and the user
accepts a long run.
Default budget:
- for the default cross-framework comparison; per scenario is acceptable for a smoke/flow check and must be labeled as
such in the artifact (not as a performance result).
search.max_candidates_per_framework: 10
for the first useful pass
- candidate generation: baseline first, then a bounded product or ordered
candidate list from
- at most 5 QPS search rounds unless the user asks for more
- stop early when every candidate in one framework is clearly OOM or fails the
basic health check
Keep these in
unless the user specifically wants a capacity
or memory study:
- SGLang
- SGLang
- vLLM
- TensorRT-LLM
kv_cache_free_gpu_memory_fraction
These are real knobs, but they widen the search quickly and often turn a serving
comparison into a memory-limit study.
4. Tune SGLang
Prefer the SGLang auto-benchmark runner when the target checkout supports it:
bash
python -m sglang.auto_benchmark run --config /path/to/sglang.yaml
Otherwise launch the server manually and benchmark with:
bash
python -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 256 \
--num-prompts 80 \
--request-rate 8 \
--output-file /path/to/sglang/results.json \
--output-details
Version-sensitive SGLang knob families to verify:
- , , ,
- ,
prefill_attention_backend
,
- ,
- , ,
- ,
- CUDA graph and piecewise CUDA graph settings
- speculative or EAGLE settings only after the non-speculative baseline is tuned
Keep
and
pinned in the default pass,
matching the shared cookbook config style.
For quick smoke tests, it is reasonable to disable CUDA graph and piecewise CUDA
graph startup work if the goal is only to prove the framework flow. Record those
flags in the artifact. Do not carry that smoke setting into a performance winner
unless the user asked to tune eager-mode serving.
5. Tune vLLM
Use vLLM's sweep runner when available:
bash
vllm bench sweep serve \
--serve-cmd 'vllm serve <model> --port 8000' \
--bench-cmd 'vllm bench serve --backend vllm --model <model> --port 8000 --dataset-name random --num-prompts 80' \
--serve-params /path/to/vllm_serve_params.json \
--bench-params /path/to/vllm_bench_params.json \
--output-dir /path/to/vllm_results
If sweep support is unavailable, run
for each candidate and measure
with
.
Version-sensitive vLLM knob families to verify:
- tensor, pipeline, data, decode-context, and expert parallelism
- , partial prefill limits, and DBO thresholds
- KV cache dtype and block size
- dtype and quantization settings
- CUDA graph capture sizes or eager-mode toggles when relevant
- prefix cache and speculative decoding settings only when the workload needs
those features
vLLM should get a normal sweep, not one baseline command. See
references/parameter-coverage.md for the
validated flag families. The historical audit happens to use an H100 host, but
the flag-family coverage is not H100-specific; confirm each flag on the target
image's
before a run.
Keep
in the baseline for the default pass. Search it
only when the question is explicitly about fitting the model or trading capacity
against throughput.
Keep DBO and all2all backend settings out of the default pass unless the target
vLLM environment is already set up for them. They are real tuning knobs, but a
candidate can fail at startup if the required all2all backend is not available.
Also preflight concurrent partial prefill before raising
above 1; some model/runtime combinations reject it at
startup.
6. Tune TensorRT-LLM
Use
as the server entrypoint when the target environment
supports it:
bash
trtllm-serve serve <model> \
--backend pytorch \
--tp_size <tp> \
--pp_size <pp> \
--kv_cache_free_gpu_memory_fraction 0.75 \
--host 0.0.0.0 \
--port 8000
Then benchmark the OpenAI-compatible endpoint with the TensorRT-LLM serving
benchmark client or with the same OpenAI-compatible client used for the other
frameworks.
For TensorRT-LLM 1.0.0,
benchmark_serving --dataset-name random
samples from
ShareGPT unless you pass either
or
. For a fast
synthetic smoke test, pass
.
TensorRT-LLM flag names are especially version-sensitive. In the validated
TensorRT-LLM 1.0.0 image, the KV-cache memory flag accepted by
is
--kv_cache_free_gpu_memory_fraction
, not
--free_gpu_memory_fraction
. Verify this with
trtllm-serve serve --help
before running a search on any GPU target.
TensorRT-LLM backend policy for this skill:
- launch the server with
- keep in
- do not add to
- reject , engine-backed serving, or any other non-PyTorch TensorRT-LLM
server backend as unsupported for this skill
Version-sensitive TensorRT-LLM knob families to verify:
- , , and
- max batch size, max sequence length, max number of tokens, and KV-cache budget
- inflight batching and scheduler options
- extra LLM API options YAML used by with the PyTorch backend
The
CLI exposes fewer direct runtime knobs than SGLang or
vLLM. Use direct flags when they exist, then use
for
PyTorch-backend settings that are not top-level CLI flags. Keep unsupported
backend or engine requests in the failure table instead of translating them.
Keep
kv_cache_free_gpu_memory_fraction
in the baseline for the default pass.
Search
,
,
, and validated
PyTorch-backend config options first. The server backend remains fixed to
.
7. Normalize Results
Write one JSONL row per candidate using the schema in
references/result-schema.md. Then run:
bash
python skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py \
--input /path/to/candidates.jsonl \
--output /path/to/summary.md
Rank candidates in this order:
- SLA passed
- highest request throughput or goodput
- highest output token throughput
- lower p99 TTFT
- lower p99 TPOT/ITL
- lower GPU count or simpler deployment if performance is close
Output Contract
Return a compact report with:
- workload and SLA used
- hardware and framework versions
- for each framework, one table listing the best deployment command for each
dataset scenario and all relevant performance metrics
- one cross-framework comparison table for the selected best command per
framework and scenario, including the command, so the deployment choice is
clear for each dataset
- failed or excluded candidates with reasons. Explain that this table is an
record of tried configs that were not selected: candidates that failed, were
skipped by policy, or completed but missed the SLA.
- exact launch command and benchmark command for each winner
- artifact paths: canonical workload, raw results JSONL, normalized JSONL, CSV or
markdown summary, and server logs needed to debug winners or failures
- a caveat if the workload was synthetic, if any framework did not complete a
fair search, or if any framework needed framework-specific parameter
substitutions
Use references/framework-matrix.md when you
need command templates or source links for each framework. Use
references/example-plan.yaml as the starting
point for a full cross-framework run plan. Use
references/version-notes.md to understand which
source snapshots informed this skill and what has or has not been smoke-tested.