llm-serving-auto-benchmark
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Serving Auto Benchmark
LLM服务自动基准测试
Overview
概述
Use this skill to compare LLM serving frameworks such as SGLang, vLLM, and
TensorRT-LLM for the same model and workload.
Use a config-driven workflow:
- keep launch-only capacity choices in each framework's
base_server_flags - put the search knobs in
search_space - run the same dataset scenarios for every framework
- generate a bounded candidate list from , with the baseline candidate included first
search_space - keep failed candidates in the result file
- pick the best SLA-passing candidate after normalizing the results
For model-specific starting points, prefer the shipped configs in
. They define a framework-neutral LLM serving cookbook
model set and translate each entry into framework-native SGLang, vLLM, and
TensorRT-LLM server flags. Validate those configs before a real run:
configs/cookbook-llm/bash
python skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py \
skills/llm-serving-auto-benchmark/configs/cookbook-llmIf you have captured target-environment files, add
. That check only loads configs, verifies the
server flag names, and renders candidate commands; it does not launch model
servers.
--help--help-dir <artifact-help-dir>Prefer native tooling when it gives better coverage:
- SGLang: when available, otherwise
python -m sglang.auto_benchmarkpython -m sglang.bench_serving - vLLM: for server-parameter sweeps, otherwise
vllm bench sweep serveplusvllm servevllm bench serve - TensorRT-LLM: for the OpenAI-compatible server plus the TensorRT-LLM serving benchmark client or a common OpenAI-compatible benchmark client
trtllm-serve
TensorRT-LLM has one hard scope rule in this skill: the server backend is fixed
to . Do not search TensorRT-LLM backend
choice. If a request, config, or candidate asks for , an engine backend, or
any other non-PyTorch TensorRT-LLM server backend, reject that candidate as
unsupported for this skill and record the reason. This does not change the
benchmark client backend; the TensorRT-LLM benchmark client still uses
OpenAI-compatible modes such as or .
trtllm-serve serve --backend pytorchtrt--backend openai--backend openai-chatOnly pick a winner after each requested framework has had its main serving knobs
tuned.
The parameter lists in this skill are not a compatibility contract. They are
version-sensitive candidate knob families. Before every real run, record the
exact framework version or git commit and verify the concrete CLI flag names
with in the target environment.
--helpThe default search style is framework-neutral: start from a mostly pure-TP
baseline, sweep a small set of high-impact runtime knobs, and cap the first
pass around 10 candidates per framework. Do not search memory fractions by
default.
使用本工具可针对同一模型和工作负载,对比SGLang、vLLM和TensorRT-LLM等LLM服务框架。
采用配置驱动的工作流:
- 将仅启动相关的容量选项保留在各框架的中
base_server_flags - 将搜索参数放入
search_space - 为每个框架运行相同的数据集场景
- 从生成有限的候选列表,基准候选排在首位
search_space - 将失败的候选记录在结果文件中
- 归一化结果后,选出符合SLA要求的最佳候选
针对模型特定的起始配置,优先使用中提供的配置文件。这些配置定义了一套与框架无关的LLM服务参考模型集,并将每个条目转换为SGLang、vLLM和TensorRT-LLM原生的服务器参数。正式运行前请验证这些配置:
configs/cookbook-llm/bash
python skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py \
skills/llm-serving-auto-benchmark/configs/cookbook-llm如果已捕获目标环境的文件,可添加参数。该检查仅加载配置、验证服务器参数名称并生成候选命令,不会启动模型服务器。
--help--help-dir <artifact-help-dir>当原生工具覆盖范围更广时,优先使用原生工具:
- SGLang:若可用则使用,否则使用
python -m sglang.auto_benchmarkpython -m sglang.bench_serving - vLLM:使用进行服务器参数扫描,否则使用
vllm bench sweep serve搭配vllm servevllm bench serve - TensorRT-LLM:使用搭建OpenAI兼容服务器,搭配TensorRT-LLM服务基准测试客户端或通用OpenAI兼容基准测试客户端
trtllm-serve
本工具对TensorRT-LLM有一项硬性规则:服务器后端固定为。请勿搜索TensorRT-LLM的后端选项。若请求、配置或候选要求使用、引擎后端或任何非PyTorch的TensorRT-LLM服务器后端,需将该候选标记为不支持并记录原因。这不会改变基准测试客户端的后端;TensorRT-LLM基准测试客户端仍可使用或等OpenAI兼容模式。
trtllm-serve serve --backend pytorchtrt--backend openai--backend openai-chat只有在每个请求框架的主要服务参数都完成调优后,才能选出最优方案。
本工具中的参数列表不构成兼容性约定,它们是与版本相关的候选参数组。每次正式运行前,请记录精确的框架版本或Git提交记录,并通过目标环境中的命令验证具体的CLI参数名称。
--help默认搜索风格与框架无关:从基本纯TP基准开始,扫描一组影响较大的运行时参数,第一轮每个框架最多生成10个候选。默认不搜索内存占比参数。
Validation Environment
验证环境
This skill is target-agnostic. It assumes any one of the following is
available, and nothing more:
- a local GPU host with Docker/Podman and the target framework images pulled;
- a remote GPU host reached via with the framework images already running in a container there;
ssh <host> - a CI runner that can exec into a pre-built image for each framework.
Do not assume a specific operator host name (, ,
, , etc.) inside this skill's own workflow. The concrete
SSH wiring, container names, workspace paths, and HF token plumbing for a given
box live in the operator-side per-host skills (for example ,
, , , , ); this
skill only requires that the caller can reach a shell inside a container with
, , or installed.
h100_sglangb200_*radixark*rtx5090_*h100h100-sglang-diffusionb200rtx5090radixark02radixark03sglangvllmtensorrt_llmHistorical validation snapshots in are evidence of which flag
names and failure modes were seen in specific images and are not a requirement
that the next run happens on the same hardware or framework version.
references/本工具与目标环境无关,仅假设具备以下任一条件,无其他要求:
- 本地GPU主机,已安装Docker/Podman并拉取目标框架镜像;
- 可通过访问的远程GPU主机,框架镜像已在容器中运行;
ssh <host> - CI运行器可进入每个框架的预构建镜像。
本工具的工作流中请勿假设特定的操作主机名(如、、、等)。特定主机的SSH连接、容器名称、工作区路径和HF令牌配置等信息,存放在操作端的每台主机专属工具中(例如、、、、、);本工具仅要求调用方能够进入安装了、或的容器内的Shell环境。
h100_sglangb200_*radixark*rtx5090_*h100h100-sglang-diffusionb200rtx5090radixark02radixark03sglangvllmtensorrt_llmreferences/Skill Scope
工具范围
This skill is a playbook plus a config+validator toolchain, not a
turn-key orchestrator. The directory contains exactly two tools:
scripts/- : reads YAML, renders bounded candidate server commands, and checks flag names against captured
validate_cookbook_configs.pysnapshots. It never launches a model server.--help - : takes the normalized per-candidate JSONL and emits the markdown tables described in the Output Contract.
compare_benchmark_results.py
Launching servers, driving the workload, and writing one JSONL row per
candidate are the operator's responsibility; the skill tells you how to do
them, and the validator keeps your inputs honest.
The cookbook configs under and the sample runtime plan
at use related but not identical schemas:
configs/cookbook-llm/references/example-plan.yaml- Cookbook configs carry ,
schema_version: 1set tosource.kind,llm_serving_cookbook(nested), andbenchmark.sla; they must passframeworks.*.server_command.validate_cookbook_configs.py - is a shorter runtime plan shape with top-level
example-plan.yamland nosla. It is the skeleton a caller fills in for a one-off run and is not expected to pass the cookbook validator as-is.server_command
Either shape can feed a benchmark run; the SLA key names in
references/result-schema.md are the single
source of truth.
本工具是一份操作手册加上配置+验证工具链,而非一键式编排器。目录包含以下两个工具:
scripts/- :读取YAML文件,生成有限的候选服务器命令,并对照捕获的
validate_cookbook_configs.py快照验证参数名称。该工具从不启动模型服务器。--help - :接收归一化后的候选JSONL文件,生成输出合约中描述的Markdown表格。
compare_benchmark_results.py
启动服务器、驱动工作负载和为每个候选写入一行JSONL是操作方的责任;本工具会告知操作方法,验证工具则确保输入的准确性。
configs/cookbook-llm/references/example-plan.yaml- 参考配置包含、
schema_version: 1设为source.kind、嵌套的llm_serving_cookbook以及benchmark.sla;它们必须通过frameworks.*.server_command验证。validate_cookbook_configs.py - 是更简短的运行计划格式,包含顶层
example-plan.yaml但无sla。它是调用方用于一次性运行的框架,无需按原样通过参考配置验证工具。server_command
两种格式均可用于基准测试运行;references/result-schema.md中的SLA键名是唯一的事实来源。
Required Inputs
必填输入
Collect these before starting a long run:
- model path or Hugging Face repo id
- tokenizer path if it differs from the model
- target frameworks: any subset of ,
sglang,vllmtensorrt-llm - GPU model, GPU count, and whether multi-node is allowed
- precision and quantization constraints
- endpoint shape: completions, chat completions, responses, or custom
- workload source: real traffic JSONL, ShareGPT, random synthetic, or generated shared-prefix synthetic
- dataset scenarios when synthetic traffic is used, for example and
chatsummarization - SLA target: TTFT, TPOT/ITL, end-to-end latency, success rate, or goodput
- search budget: quick smoke, default search, or exhaustive search
- output directory for logs and result artifacts
Also collect a version manifest:
- framework package version and git commit when available
- container image or Python environment identifier
- snapshots for the server command and benchmark command
--help - whether each parameter in the search plan was accepted by that exact CLI
If real production traffic is the goal, use the real request distribution. A
synthetic workload is fine for bring-up and first-pass comparison, but it is not
enough for a production choice.
开始长时间运行前请收集以下信息:
- 模型路径或Hugging Face仓库ID
- 若分词器路径与模型不同,需提供分词器路径
- 目标框架:、
sglang、vllm的任意子集tensorrt-llm - GPU型号、GPU数量,以及是否允许多节点
- 精度和量化约束
- 端点类型:补全、聊天补全、响应或自定义
- 工作负载来源:真实流量JSONL、ShareGPT、随机合成或生成的共享前缀合成数据
- 使用合成流量时的数据集场景,例如和
chatsummarization - SLA目标:TTFT、TPOT/ITL、端到端延迟、成功率或有效吞吐量
- 搜索预算:快速冒烟测试、默认搜索或 exhaustive 搜索
- 日志和结果工件的输出目录
还需收集版本清单:
- 框架包版本和Git提交记录(若可用)
- 容器镜像或Python环境标识符
- 服务器命令和基准测试命令的快照
--help - 搜索计划中的每个参数是否被当前CLI接受
若目标是真实生产流量,请使用真实请求分布。合成工作负载适用于启动和首轮对比,但不足以作为生产环境选择的依据。
Known Gotchas
已知问题
Short list of failure modes that have bitten past validation runs. Check these
before starting a long sweep.
- SGLang attention backends need Hopper or newer. On A100, L40S, RTX 5090, and older GPUs, drop
fa3from the SGLangfa3and keepsearch_space(orflashinferwhen FlashInfer is unavailable).triton - SGLang has two SGLang-facing backends:
bench_servingfor the native--backend sglangendpoint and/generatefor the OpenAI-compatible endpoint. For cross-framework comparisons, prefer--backend sglang-oaiso every framework is measured on the same request path.sglang-oai - vLLM only works when the target vLLM image is built with a supported all2all backend. Keep DBO out of the default candidate list unless the operator has verified the image.
--enable-dbo - vLLM is model- and runtime-gated. Keep
--max-num-partial-prefills > 1in the default pass; raise only after a preflight with the actual model.1 - In the validated TensorRT-LLM 1.0.0 image, accepts
trtllm-serve serve; the older--kv_cache_free_gpu_memory_fractionexits with a CLI error. Re-check the accepted flag name via--free_gpu_memory_fractionon the target image before a real run.--help - TensorRT-LLM 1.0.0 multi-GPU PyTorch-backend servers need ,
--ipc=host,--ulimit memlock=-1,--ulimit stack=67108864, and--shm-size=16g(for single-node) or an equivalent NCCL setup.NCCL_IB_DISABLE=1 - TensorRT-LLM 1.0.0 benchmark client takes or
--backend openai;--backend openai-chatis rejected. This is separate from the server backend, which is pinned to--backend trtllmby this skill.pytorch trtllmsilently falls back to ShareGPT sampling withoutbenchmark_serving --dataset-name random(or--random-ids).--download-path- /
max_seq_len/max_model_lencandidates must covercontext_lengthacross every scenario, including values insidemax(input_len + output_len), not just the baseline. The validator checks this; do not bypass it.search_space
以下是过往验证运行中遇到的失败模式列表,开始大规模扫描前请检查这些问题。
- SGLang 注意力后端需要Hopper或更新的GPU。在A100、L40S、RTX 5090及更旧的GPU上,请从SGLang的
fa3中移除search_space,保留fa3(若FlashInfer不可用则保留flashinfer)。triton - SGLang 有两个面向SGLang的后端:
bench_serving用于原生--backend sglang端点,/generate用于OpenAI兼容端点。跨框架对比时,优先使用--backend sglang-oai,确保所有框架在相同请求路径上被测量。sglang-oai - vLLM 仅在目标vLLM镜像使用支持的all2all后端构建时有效。除非操作方已验证镜像,否则请勿将DBO加入默认候选列表。
--enable-dbo - vLLM 受模型和运行时限制。默认轮次保留
--max-num-partial-prefills > 1,仅在使用实际模型进行预检查后再提高该值。1 - 在已验证的TensorRT-LLM 1.0.0镜像中,接受
trtllm-serve serve参数;旧版参数--kv_cache_free_gpu_memory_fraction会导致CLI错误。正式运行前,请通过目标镜像的--free_gpu_memory_fraction命令重新检查可接受的参数名称。--help - TensorRT-LLM 1.0.0多GPU PyTorch后端服务器需要、
--ipc=host、--ulimit memlock=-1、--ulimit stack=67108864以及--shm-size=16g(单节点)或等效的NCCL配置。NCCL_IB_DISABLE=1 - TensorRT-LLM 1.0.0基准测试客户端接受或
--backend openai;--backend openai-chat会被拒绝。这与服务器后端分开,服务器后端被本工具固定为--backend trtllm。pytorch - 的
trtllm在未传入benchmark_serving --dataset-name random(或--random-ids)时会静默回退到ShareGPT采样。--download-path - /
max_seq_len/max_model_len候选值必须覆盖所有场景中的context_length,包括max(input_len + output_len)内的值,而不仅仅是基准值。验证工具会检查此条件,请不要绕过。search_space
Secrets Hygiene
密钥安全
- Never print ,
HF_TOKEN, or any upstream API key into a saved artifact. Pass them through containerHUGGINGFACE_HUB_TOKEN(unquoted on the right side so the host value is inherited) and keep them out of-e VARandserver_commandfields written to the result JSONL.benchmark_command - When a framework echoes the full argv at startup, scrub the log or redact token-shaped substrings before uploading the artifact.
- 请勿将、
HF_TOKEN或任何上游API密钥打印到保存的工件中。通过容器HUGGINGFACE_HUB_TOKEN传递(右侧不加引号,以便继承主机值),并避免将其写入结果JSONL的-e VAR和server_command字段。benchmark_command - 若框架在启动时回显完整argv,请在上传工件前清理日志或编辑掉令牌格式的子字符串。
Fairness Rules
公平性规则
Use these rules throughout the benchmark:
- Run every framework on the same GPU type, GPU count, model weights, tokenizer, precision, quantization policy, prompt distribution, output length target, and sampling settings.
- Record framework version, git commit, container image, CUDA/NCCL versions, GPU driver, visible GPU ids, launch command, and benchmark command.
- Warm the server before measuring. Restart or clear state between candidate configurations when cache effects would bias the comparison.
- Compare steady-state fixed-QPS runs separately from burst throughput runs.
- Keep failed candidates in the final results with their failure reason.
- Report both raw throughput and SLA-passing throughput. The fastest failing candidate is not the best deployment command.
基准测试全程请遵循以下规则:
- 在相同GPU类型、GPU数量、模型权重、分词器、精度、量化策略、提示分布、输出长度目标和采样设置下运行每个框架。
- 记录框架版本、Git提交记录、容器镜像、CUDA/NCCL版本、GPU驱动、可见GPU ID、启动命令和基准测试命令。
- 测量前预热服务器。当缓存效应会影响对比结果时,在候选配置之间重启服务器或清除状态。
- 分别对比稳态固定QPS运行和突发吞吐量运行。
- 将失败的候选保留在最终结果中,并记录失败原因。
- 同时报告原始吞吐量和符合SLA的吞吐量。最快但不符合要求的候选并非最佳部署命令。
Workflow
工作流
1. Preflight
1. 预检查
Verify all requested frameworks before starting a search:
bash
python -m sglang.launch_server --help
python -m sglang.bench_serving --help
vllm serve --help
vllm serve --help=all
vllm bench serve --help
vllm bench serve --help=all
vllm bench sweep serve --help=all
trtllm-serve serve --help
python -m tensorrt_llm.serve.scripts.benchmark_serving --helpUse the framework-specific output in the target environment as the
source of truth. Do not keep a stale launch flag just because it appears in an
old note.
--helpvLLM 0.19 and newer use grouped help. Plain only shows the
groups, so capture before deciding whether a search knob exists.
vllm serve --help--help=allSave these outputs into the run artifact directory. If a listed search
knob is missing from the current CLI, remove or translate that knob before
running the benchmark. Do not silently pass unknown flags.
--helpFor TensorRT-LLM, also confirm that accepts
. If it does not, mark TensorRT-LLM unsupported in that
environment rather than falling back to a different server backend.
trtllm-serve serve --help--backend pytorchFor each framework:
- Launch a minimal server.
- Confirm or the framework-native model-info endpoint works.
/v1/models - Send one streaming request and verify TTFT can be measured.
- Run one tiny benchmark with at least 5 requests.
- Save the launch command, benchmark command, server log, and benchmark output.
Before any GPU-backed smoke run, check the requested GPU ids directly with
. If a requested GPU is already in use, stop and record that fact.
Do not silently borrow a different GPU count for a performance comparison. It is
fine to run a smaller one-GPU smoke only when the result is clearly labeled as a
flow check rather than a fair throughput comparison.
nvidia-smiIf the target environment runs through containers, follow
references/container-runbook.md. Save the
image tags, pull commands, launch commands, server logs, benchmark logs, and
cleanup commands in the artifact directory.
开始搜索前验证所有请求的框架:
bash
python -m sglang.launch_server --help
python -m sglang.bench_serving --help
vllm serve --help
vllm serve --help=all
vllm bench serve --help
vllm bench serve --help=all
vllm bench sweep serve --help=all
trtllm-serve serve --help
python -m tensorrt_llm.serve.scripts.benchmark_serving --help以目标环境中框架特定的输出为事实来源。请勿仅因旧笔记中存在就保留过时的启动参数。
--helpvLLM 0.19及更高版本使用分组帮助。普通仅显示分组,因此在确定搜索参数是否存在前,请捕获的输出。
vllm serve --help--help=all将这些输出保存到运行工件目录中。若列出的搜索参数在当前CLI中缺失,请在运行基准测试前移除或转换该参数。请勿静默传递未知参数。
--help对于TensorRT-LLM,还需确认是否接受。若不接受,请标记该环境中TensorRT-LLM不支持,而非回退到其他服务器后端。
trtllm-serve serve --help--backend pytorch针对每个框架:
- 启动最小化服务器。
- 确认或框架原生的模型信息端点可用。
/v1/models - 发送一个流式请求并验证TTFT可被测量。
- 运行至少包含5个请求的小型基准测试。
- 保存启动命令、基准测试命令、服务器日志和基准测试输出。
在任何GPU冒烟测试前,请直接使用检查请求的GPU ID。若请求的GPU已在使用中,请停止并记录该情况。请勿在性能对比中静默使用不同数量的GPU。若仅进行流程检查而非公平吞吐量对比,运行单GPU小型冒烟测试是可行的,但需明确标记。
nvidia-smi若目标环境通过容器运行,请遵循references/container-runbook.md。将镜像标签、拉取命令、启动命令、服务器日志、基准测试日志和清理命令保存到工件目录中。
2. Normalize The Workload
2. 归一化工作负载
Use one canonical workload for all frameworks. Recommended JSONL row shape:
json
{"prompt": [{"role": "user", "content": "Summarize this text."}], "output_len": 256}
{"prompt": "Write a short explanation of CUDA graphs.", "output_len": 128}Optional fields:
json
{
"prompt": [{"role": "user", "content": "Use low temperature."}],
"output_len": 256,
"extra_request_body": {"temperature": 0.0, "top_p": 0.95},
"metadata": {"source": "prod-sample"}
}When converting user data:
- inspect at least 3 rows before conversion
- preserve request-level sampling options in
extra_request_body - do not include the final assistant answer in the prompt when that answer is the target completion
- keep multimodal or tool-call payloads only if all requested frameworks support the chosen endpoint shape
For synthetic bring-up, use the shipped two-scenario shape:
yaml
dataset:
kind: random
num_prompts: 80
scenario_names: [chat, summarization]
input_len: [1000, 8000]
output_len: [1000, 1000]Each aligned / pair is one scenario. Do not take the
cartesian product unless the user asks for that.
input_lenoutput_lenBefore searching any sequence-length limit, compute the largest
in the dataset. SGLang , vLLM
, and TensorRT-LLM must be at least that value for
every candidate that is expected to run all scenarios.
input_len + output_lencontext_lengthmax_model_lenmax_seq_len为所有框架使用一套标准工作负载。推荐的JSONL行格式:
json
{"prompt": [{"role": "user", "content": "Summarize this text."}], "output_len": 256}
{"prompt": "Write a short explanation of CUDA graphs.", "output_len": 128}可选字段:
json
{
"prompt": [{"role": "user", "content": "Use low temperature."}],
"output_len": 256,
"extra_request_body": {"temperature": 0.0, "top_p": 0.95},
"metadata": {"source": "prod-sample"}
}转换用户数据时:
- 转换前至少检查3行数据
- 将请求级采样选项保留在中
extra_request_body - 当最终助手回答是目标补全时,请勿将其包含在提示中
- 仅当所有请求框架都支持所选端点类型时,才保留多模态或工具调用负载
对于启动阶段的合成数据,使用内置的双场景格式:
yaml
dataset:
kind: random
num_prompts: 80
scenario_names: [chat, summarization]
input_len: [1000, 8000]
output_len: [1000, 1000]每个对齐的/对是一个场景。除非用户要求,否则请勿使用笛卡尔积。
input_lenoutput_len搜索任何序列长度限制前,请计算数据集中最大的值。SGLang的、vLLM的和TensorRT-LLM的必须至少等于该值,才能确保候选能运行所有场景。
input_len + output_lencontext_lengthmax_model_lenmax_seq_len3. Pick A Search Tier
3. 选择搜索层级
Use the smallest tier that can answer the user's question:
- Tier 1: smoke and sanity. One baseline plus a few high-impact knobs.
- Tier 2: default. A bounded sweep over the most likely server settings.
- Tier 3: exhaustive. Only when the search space is already tight and the user accepts a long run.
Default budget:
- for the default cross-framework comparison;
num_prompts: 80per scenario is acceptable for a smoke/flow check and must be labeled as such in the artifact (not as a performance result).num_prompts: 20 - for the first useful pass
search.max_candidates_per_framework: 10 - candidate generation: baseline first, then a bounded product or ordered
candidate list from
search_space - at most 5 QPS search rounds unless the user asks for more
- stop early when every candidate in one framework is clearly OOM or fails the basic health check
Keep these in unless the user specifically wants a capacity
or memory study:
base_server_flags- SGLang
mem_fraction_static - SGLang
schedule_policy - vLLM
gpu_memory_utilization - TensorRT-LLM
kv_cache_free_gpu_memory_fraction
These are real knobs, but they widen the search quickly and often turn a serving
comparison into a memory-limit study.
使用能回答用户问题的最小层级:
- 层级1:冒烟和 sanity 检查。一个基准候选加上几个高影响参数。
- 层级2:默认。对最可能的服务器设置进行有限扫描。
- 层级3: exhaustive 搜索。仅当搜索空间已收窄且用户接受长时间运行时使用。
默认预算:
- 默认跨框架对比使用;冒烟/流程检查可接受
num_prompts: 80每场景,但必须在工件中明确标记(不作为性能结果)。num_prompts: 20 - 用于第一轮有效扫描
search.max_candidates_per_framework: 10 - 候选生成:基准候选优先,然后从生成有限乘积或有序候选列表
search_space - 最多5轮QPS搜索,除非用户要求更多
- 当某一框架的所有候选均明显OOM或未通过基本健康检查时,提前停止
除非用户明确要求进行容量或内存研究,否则请将以下参数保留在中:
base_server_flags- SGLang
mem_fraction_static - SGLang
schedule_policy - vLLM
gpu_memory_utilization - TensorRT-LLM
kv_cache_free_gpu_memory_fraction
这些是真实的调优参数,但会迅速扩大搜索范围,往往会将服务对比变成内存限制研究。
4. Tune SGLang
4. 调优SGLang
Prefer the SGLang auto-benchmark runner when the target checkout supports it:
bash
python -m sglang.auto_benchmark run --config /path/to/sglang.yamlOtherwise launch the server manually and benchmark with:
bash
python -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 256 \
--num-prompts 80 \
--request-rate 8 \
--output-file /path/to/sglang/results.json \
--output-detailsVersion-sensitive SGLang knob families to verify:
- ,
tp_size,pp_size,dp_sizeep_size - ,
attention_backend,prefill_attention_backenddecode_attention_backend sampling_backend- ,
max_running_requestsmax_queued_requests - ,
chunked_prefill_size,prefill_max_requestsmax_prefill_tokens - ,
max_total_tokenspage_size - CUDA graph and piecewise CUDA graph settings
- speculative or EAGLE settings only after the non-speculative baseline is tuned
Keep and pinned in the default pass,
matching the shared cookbook config style.
mem_fraction_staticschedule_policyFor quick smoke tests, it is reasonable to disable CUDA graph and piecewise CUDA
graph startup work if the goal is only to prove the framework flow. Record those
flags in the artifact. Do not carry that smoke setting into a performance winner
unless the user asked to tune eager-mode serving.
若目标代码库支持,请优先使用SGLang自动基准测试运行器:
bash
python -m sglang.auto_benchmark run --config /path/to/sglang.yaml否则手动启动服务器并使用以下命令进行基准测试:
bash
python -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 256 \
--num-prompts 80 \
--request-rate 8 \
--output-file /path/to/sglang/results.json \
--output-details需验证的与版本相关的SGLang参数组:
- ,
tp_size,pp_size,dp_sizeep_size - ,
attention_backend,prefill_attention_backenddecode_attention_backend sampling_backend- ,
max_running_requestsmax_queued_requests - ,
chunked_prefill_size,prefill_max_requestsmax_prefill_tokens - ,
max_total_tokenspage_size - CUDA图和分段CUDA图设置
- 仅在非推测基准调优完成后,才考虑推测或EAGLE设置
默认轮次中,请将和固定,与共享参考配置风格一致。
mem_fraction_staticschedule_policy对于快速冒烟测试,若仅需验证框架流程,可禁用CUDA图和分段CUDA图启动操作。请在工件中记录这些参数。除非用户要求调优 eager 模式服务,否则请勿将冒烟测试设置用于性能最优候选。
5. Tune vLLM
5. 调优vLLM
Use vLLM's sweep runner when available:
bash
vllm bench sweep serve \
--serve-cmd 'vllm serve <model> --port 8000' \
--bench-cmd 'vllm bench serve --backend vllm --model <model> --port 8000 --dataset-name random --num-prompts 80' \
--serve-params /path/to/vllm_serve_params.json \
--bench-params /path/to/vllm_bench_params.json \
--output-dir /path/to/vllm_resultsIf sweep support is unavailable, run for each candidate and measure
with .
vllm servevllm bench serveVersion-sensitive vLLM knob families to verify:
- tensor, pipeline, data, decode-context, and expert parallelism
gpu_memory_utilizationmax_num_seqsmax_num_batched_tokensmax_model_len- , partial prefill limits, and DBO thresholds
enable_chunked_prefill - KV cache dtype and block size
- dtype and quantization settings
- CUDA graph capture sizes or eager-mode toggles when relevant
- prefix cache and speculative decoding settings only when the workload needs those features
vLLM should get a normal sweep, not one baseline command. See
references/parameter-coverage.md for the
validated flag families. The historical audit happens to use an H100 host, but
the flag-family coverage is not H100-specific; confirm each flag on the target
image's before a run.
--helpKeep in the baseline for the default pass. Search it
only when the question is explicitly about fitting the model or trading capacity
against throughput.
gpu_memory_utilizationKeep DBO and all2all backend settings out of the default pass unless the target
vLLM environment is already set up for them. They are real tuning knobs, but a
candidate can fail at startup if the required all2all backend is not available.
Also preflight concurrent partial prefill before raising
above 1; some model/runtime combinations reject it at
startup.
max_num_partial_prefills若可用,请使用vLLM的扫描运行器:
bash
vllm bench sweep serve \
--serve-cmd 'vllm serve <model> --port 8000' \
--bench-cmd 'vllm bench serve --backend vllm --model <model> --port 8000 --dataset-name random --num-prompts 80' \
--serve-params /path/to/vllm_serve_params.json \
--bench-params /path/to/vllm_bench_params.json \
--output-dir /path/to/vllm_results若不支持扫描功能,请为每个候选运行并使用进行测量。
vllm servevllm bench serve需验证的与版本相关的vLLM参数组:
- 张量、流水线、数据、解码上下文和专家并行
gpu_memory_utilizationmax_num_seqsmax_num_batched_tokensmax_model_len- 、部分预填充限制和DBO阈值
enable_chunked_prefill - KV缓存数据类型和块大小
- 数据类型和量化设置
- 相关时的CUDA图捕获大小或 eager 模式切换
- 仅当工作负载需要时,才考虑前缀缓存和推测解码设置
vLLM应进行常规扫描,而非仅使用一个基准命令。请查看references/parameter-coverage.md获取已验证的参数组。历史审计使用了H100主机,但参数组覆盖并非H100专属;运行前请确认目标镜像中的每个参数。
--help默认轮次中,请将保留在基准中。仅当问题明确涉及模型适配或吞吐量与容量权衡时,才搜索该参数。
gpu_memory_utilization除非目标vLLM环境已配置完成,否则请勿将DBO和all2all后端设置加入默认轮次。它们是真实的调优参数,但如果所需的all2all后端不可用,候选可能在启动时失败。此外,在将提高到1以上前,请预检查并发部分预填充;某些模型/运行时组合会在启动时拒绝该设置。
max_num_partial_prefills6. Tune TensorRT-LLM
6. 调优TensorRT-LLM
Use as the server entrypoint when the target environment
supports it:
trtllm-serve servebash
trtllm-serve serve <model> \
--backend pytorch \
--tp_size <tp> \
--pp_size <pp> \
--kv_cache_free_gpu_memory_fraction 0.75 \
--host 0.0.0.0 \
--port 8000Then benchmark the OpenAI-compatible endpoint with the TensorRT-LLM serving
benchmark client or with the same OpenAI-compatible client used for the other
frameworks.
For TensorRT-LLM 1.0.0, samples from
ShareGPT unless you pass either or . For a fast
synthetic smoke test, pass .
benchmark_serving --dataset-name random--download-path--random-ids--random-idsTensorRT-LLM flag names are especially version-sensitive. In the validated
TensorRT-LLM 1.0.0 image, the KV-cache memory flag accepted by
is , not
. Verify this with
before running a search on any GPU target.
trtllm-serve serve--kv_cache_free_gpu_memory_fraction--free_gpu_memory_fractiontrtllm-serve serve --helpTensorRT-LLM backend policy for this skill:
- launch the server with
--backend pytorch - keep in
backend: pytorchbase_server_flags - do not add to
backendsearch_space - reject , engine-backed serving, or any other non-PyTorch TensorRT-LLM server backend as unsupported for this skill
trt
Version-sensitive TensorRT-LLM knob families to verify:
- ,
tp_size, andpp_sizeep_size - max batch size, max sequence length, max number of tokens, and KV-cache budget
- inflight batching and scheduler options
- extra LLM API options YAML used by with the PyTorch backend
trtllm-serve
The CLI exposes fewer direct runtime knobs than SGLang or
vLLM. Use direct flags when they exist, then use for
PyTorch-backend settings that are not top-level CLI flags. Keep unsupported
backend or engine requests in the failure table instead of translating them.
trtllm-serve serve--extra_llm_api_optionsKeep in the baseline for the default pass.
Search , , , and validated
PyTorch-backend config options first. The server backend remains fixed to
.
kv_cache_free_gpu_memory_fractionmax_batch_sizemax_num_tokensmax_seq_lenpytorch若目标环境支持,请使用作为服务器入口:
trtllm-serve servebash
trtllm-serve serve <model> \
--backend pytorch \
--tp_size <tp> \
--pp_size <pp> \
--kv_cache_free_gpu_memory_fraction 0.75 \
--host 0.0.0.0 \
--port 8000然后使用TensorRT-LLM服务基准测试客户端或其他框架使用的相同OpenAI兼容客户端,对OpenAI兼容端点进行基准测试。
对于TensorRT-LLM 1.0.0,会从ShareGPT采样,除非传入或。如需快速合成冒烟测试,请传入。
benchmark_serving --dataset-name random--download-path--random-ids--random-idsTensorRT-LLM的参数名称对版本尤其敏感。在已验证的TensorRT-LLM 1.0.0镜像中,接受的KV缓存内存参数是,而非。在任何GPU目标上运行搜索前,请通过验证此参数。
trtllm-serve serve--kv_cache_free_gpu_memory_fraction--free_gpu_memory_fractiontrtllm-serve serve --help本工具的TensorRT-LLM后端策略:
- 使用启动服务器
--backend pytorch - 在中保留
base_server_flagsbackend: pytorch - 请勿将加入
backendsearch_space - 将、引擎驱动服务或任何非PyTorch的TensorRT-LLM服务器后端标记为不支持
trt
需验证的与版本相关的TensorRT-LLM参数组:
- ,
tp_size, 和pp_sizeep_size - 最大批处理大小、最大序列长度、最大令牌数和KV缓存预算
- 飞行中批处理和调度器选项
- 与PyTorch后端配合使用的额外LLM API选项YAML
trtllm-serve
trtllm-serve serve--extra_llm_api_options默认轮次中,请将保留在基准中。优先搜索、、和已验证的PyTorch后端配置选项。服务器后端固定为。
kv_cache_free_gpu_memory_fractionmax_batch_sizemax_num_tokensmax_seq_lenpytorch7. Normalize Results
7. 归一化结果
Write one JSONL row per candidate using the schema in
references/result-schema.md. Then run:
bash
python skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py \
--input /path/to/candidates.jsonl \
--output /path/to/summary.mdRank candidates in this order:
- SLA passed
- highest request throughput or goodput
- highest output token throughput
- lower p99 TTFT
- lower p99 TPOT/ITL
- lower GPU count or simpler deployment if performance is close
使用references/result-schema.md中的schema,为每个候选写入一行JSONL。然后运行:
bash
python skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py \
--input /path/to/candidates.jsonl \
--output /path/to/summary.md按以下顺序对候选排序:
- 是否符合SLA
- 请求吞吐量或有效吞吐量最高
- 输出令牌吞吐量最高
- p99 TTFT更低
- p99 TPOT/ITL更低
- 若性能相近,GPU数量更少或部署更简单
Output Contract
输出合约
Return a compact report with:
- workload and SLA used
- hardware and framework versions
- for each framework, one table listing the best deployment command for each dataset scenario and all relevant performance metrics
- one cross-framework comparison table for the selected best command per framework and scenario, including the command, so the deployment choice is clear for each dataset
- failed or excluded candidates with reasons. Explain that this table is an record of tried configs that were not selected: candidates that failed, were skipped by policy, or completed but missed the SLA.
- exact launch command and benchmark command for each winner
- artifact paths: canonical workload, raw results JSONL, normalized JSONL, CSV or markdown summary, and server logs needed to debug winners or failures
- a caveat if the workload was synthetic, if any framework did not complete a fair search, or if any framework needed framework-specific parameter substitutions
Use references/framework-matrix.md when you
need command templates or source links for each framework. Use
references/example-plan.yaml as the starting
point for a full cross-framework run plan. Use
references/version-notes.md to understand which
source snapshots informed this skill and what has or has not been smoke-tested.
返回一份简洁报告,包含:
- 使用的工作负载和SLA
- 硬件和框架版本
- 针对每个框架的表格,列出每个数据集场景的最佳部署命令及所有相关性能指标
- 跨框架对比表格,展示每个框架和场景的选定最佳命令,包括命令本身,以便明确每个数据集的部署选择
- 失败或排除的候选及原因。说明此表格记录了未被选中的尝试配置:失败、因策略跳过或完成但未达到SLA的候选。
- 每个最优候选的精确启动命令和基准测试命令
- 工件路径:标准工作负载、原始结果JSONL、归一化JSONL、CSV或Markdown摘要,以及调试最优候选或失败所需的服务器日志
- 若工作负载为合成数据、任一框架未完成公平搜索或任一框架需要框架特定参数替换,请添加说明