llm-serving-auto-benchmark

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM Serving Auto Benchmark

LLM服务自动基准测试

Overview

概述

Use this skill to compare LLM serving frameworks such as SGLang, vLLM, and TensorRT-LLM for the same model and workload.

Use a config-driven workflow:

keep launch-only capacity choices in each framework's
```
base_server_flags
```
put the search knobs in
```
search_space
```
run the same dataset scenarios for every framework
generate a bounded candidate list from
```
search_space
```
, with the baseline candidate included first
keep failed candidates in the result file
pick the best SLA-passing candidate after normalizing the results

For model-specific starting points, prefer the shipped configs in

configs/cookbook-llm/

. They define a framework-neutral LLM serving cookbook model set and translate each entry into framework-native SGLang, vLLM, and TensorRT-LLM server flags. Validate those configs before a real run:

bash

python skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py \
  skills/llm-serving-auto-benchmark/configs/cookbook-llm

If you have captured target-environment

--help

files, add

--help-dir <artifact-help-dir>

. That check only loads configs, verifies the server flag names, and renders candidate commands; it does not launch model servers.

Prefer native tooling when it gives better coverage:

SGLang:

python -m sglang.auto_benchmark

when available, otherwise

python -m sglang.bench_serving

vLLM:

vllm bench sweep serve

for server-parameter sweeps, otherwise

vllm serve

plus

vllm bench serve

TensorRT-LLM:
```
trtllm-serve
```
for the OpenAI-compatible server plus the TensorRT-LLM serving benchmark client or a common OpenAI-compatible benchmark client

TensorRT-LLM has one hard scope rule in this skill: the server backend is fixed to

trtllm-serve serve --backend pytorch

. Do not search TensorRT-LLM backend choice. If a request, config, or candidate asks for

trt

, an engine backend, or any other non-PyTorch TensorRT-LLM server backend, reject that candidate as unsupported for this skill and record the reason. This does not change the benchmark client backend; the TensorRT-LLM benchmark client still uses OpenAI-compatible modes such as

--backend openai

--backend openai-chat

Only pick a winner after each requested framework has had its main serving knobs tuned.

The parameter lists in this skill are not a compatibility contract. They are version-sensitive candidate knob families. Before every real run, record the exact framework version or git commit and verify the concrete CLI flag names with

--help

in the target environment.

The default search style is framework-neutral: start from a mostly pure-TP baseline, sweep a small set of high-impact runtime knobs, and cap the first pass around 10 candidates per framework. Do not search memory fractions by default.

使用本工具可针对同一模型和工作负载，对比SGLang、vLLM和TensorRT-LLM等LLM服务框架。

采用配置驱动的工作流：

将仅启动相关的容量选项保留在各框架的
```
base_server_flags
```
中
将搜索参数放入
```
search_space
```
为每个框架运行相同的数据集场景
从
```
search_space
```
生成有限的候选列表，基准候选排在首位
将失败的候选记录在结果文件中
归一化结果后，选出符合SLA要求的最佳候选

针对模型特定的起始配置，优先使用

configs/cookbook-llm/

中提供的配置文件。这些配置定义了一套与框架无关的LLM服务参考模型集，并将每个条目转换为SGLang、vLLM和TensorRT-LLM原生的服务器参数。正式运行前请验证这些配置：

bash

python skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py \
  skills/llm-serving-auto-benchmark/configs/cookbook-llm

如果已捕获目标环境的

--help

文件，可添加

--help-dir <artifact-help-dir>

参数。该检查仅加载配置、验证服务器参数名称并生成候选命令，不会启动模型服务器。

当原生工具覆盖范围更广时，优先使用原生工具：

SGLang：若可用则使用

python -m sglang.auto_benchmark

，否则使用

python -m sglang.bench_serving

vLLM：使用
```
vllm bench sweep serve
```
进行服务器参数扫描，否则使用
```
vllm serve
```
搭配
```
vllm bench serve
```
TensorRT-LLM：使用
```
trtllm-serve
```
搭建OpenAI兼容服务器，搭配TensorRT-LLM服务基准测试客户端或通用OpenAI兼容基准测试客户端

本工具对TensorRT-LLM有一项硬性规则：服务器后端固定为

trtllm-serve serve --backend pytorch

。请勿搜索TensorRT-LLM的后端选项。若请求、配置或候选要求使用

trt

、引擎后端或任何非PyTorch的TensorRT-LLM服务器后端，需将该候选标记为不支持并记录原因。这不会改变基准测试客户端的后端；TensorRT-LLM基准测试客户端仍可使用

--backend openai

或

--backend openai-chat

等OpenAI兼容模式。

只有在每个请求框架的主要服务参数都完成调优后，才能选出最优方案。

本工具中的参数列表不构成兼容性约定，它们是与版本相关的候选参数组。每次正式运行前，请记录精确的框架版本或Git提交记录，并通过目标环境中的

--help

命令验证具体的CLI参数名称。

默认搜索风格与框架无关：从基本纯TP基准开始，扫描一组影响较大的运行时参数，第一轮每个框架最多生成10个候选。默认不搜索内存占比参数。

Validation Environment

验证环境

This skill is target-agnostic. It assumes any one of the following is available, and nothing more:

a local GPU host with Docker/Podman and the target framework images pulled;
a remote GPU host reached via
```
ssh <host>
```
with the framework images already running in a container there;
a CI runner that can exec into a pre-built image for each framework.

Do not assume a specific operator host name (

h100_sglang

b200_*

radixark*

rtx5090_*

, etc.) inside this skill's own workflow. The concrete SSH wiring, container names, workspace paths, and HF token plumbing for a given box live in the operator-side per-host skills (for example

h100

h100-sglang-diffusion

b200

rtx5090

radixark02

radixark03

); this skill only requires that the caller can reach a shell inside a container with

sglang

vllm

, or

tensorrt_llm

installed.

Historical validation snapshots in

references/

are evidence of which flag names and failure modes were seen in specific images and are not a requirement that the next run happens on the same hardware or framework version.

本工具与目标环境无关，仅假设具备以下任一条件，无其他要求：

本地GPU主机，已安装Docker/Podman并拉取目标框架镜像；
可通过
```
ssh <host>
```
访问的远程GPU主机，框架镜像已在容器中运行；
CI运行器可进入每个框架的预构建镜像。

本工具的工作流中请勿假设特定的操作主机名（如

h100_sglang

、

b200_*

、

radixark*

、

rtx5090_*

等）。特定主机的SSH连接、容器名称、工作区路径和HF令牌配置等信息，存放在操作端的每台主机专属工具中（例如

h100

、

h100-sglang-diffusion

、

b200

、

rtx5090

、

radixark02

、

radixark03

）；本工具仅要求调用方能够进入安装了

sglang

、

vllm

或

tensorrt_llm

的容器内的Shell环境。

references/

目录中的历史验证快照，仅用于记录在特定镜像中出现的参数名称和失败模式，不要求后续运行使用相同的硬件或框架版本。

Skill Scope

工具范围

This skill is a playbook plus a config+validator toolchain, not a turn-key orchestrator. The

scripts/

directory contains exactly two tools:

```
validate_cookbook_configs.py
```
: reads YAML, renders bounded candidate server commands, and checks flag names against captured
```
--help
```
snapshots. It never launches a model server.
```
compare_benchmark_results.py
```
: takes the normalized per-candidate JSONL and emits the markdown tables described in the Output Contract.

Launching servers, driving the workload, and writing one JSONL row per candidate are the operator's responsibility; the skill tells you how to do them, and the validator keeps your inputs honest.

The cookbook configs under

configs/cookbook-llm/

and the sample runtime plan at

references/example-plan.yaml

use related but not identical schemas:

Cookbook configs carry

schema_version: 1

source.kind

set to

llm_serving_cookbook

benchmark.sla

(nested), and

frameworks.*.server_command

; they must pass

validate_cookbook_configs.py

```
example-plan.yaml
```
is a shorter runtime plan shape with top-level
```
sla
```
and no
```
server_command
```
. It is the skeleton a caller fills in for a one-off run and is not expected to pass the cookbook validator as-is.

Either shape can feed a benchmark run; the SLA key names in references/result-schema.md are the single source of truth.

本工具是一份操作手册加上配置+验证工具链，而非一键式编排器。

scripts/

目录包含以下两个工具：

```
validate_cookbook_configs.py
```
：读取YAML文件，生成有限的候选服务器命令，并对照捕获的
```
--help
```
快照验证参数名称。该工具从不启动模型服务器。
```
compare_benchmark_results.py
```
：接收归一化后的候选JSONL文件，生成输出合约中描述的Markdown表格。

启动服务器、驱动工作负载和为每个候选写入一行JSONL是操作方的责任；本工具会告知操作方法，验证工具则确保输入的准确性。

configs/cookbook-llm/

下的参考配置与

references/example-plan.yaml

中的示例运行计划使用相关但不完全相同的 schema：

参考配置包含

schema_version: 1

、

source.kind

设为

llm_serving_cookbook

、嵌套的

benchmark.sla

以及

frameworks.*.server_command

；它们必须通过

validate_cookbook_configs.py

验证。

```
example-plan.yaml
```
是更简短的运行计划格式，包含顶层
```
sla
```
但无
```
server_command
```
。它是调用方用于一次性运行的框架，无需按原样通过参考配置验证工具。

两种格式均可用于基准测试运行；references/result-schema.md中的SLA键名是唯一的事实来源。

Required Inputs

必填输入

Collect these before starting a long run:

model path or Hugging Face repo id
tokenizer path if it differs from the model
target frameworks: any subset of
```
sglang
```
,
```
vllm
```
,
```
tensorrt-llm
```
GPU model, GPU count, and whether multi-node is allowed
precision and quantization constraints
endpoint shape: completions, chat completions, responses, or custom
workload source: real traffic JSONL, ShareGPT, random synthetic, or generated shared-prefix synthetic
dataset scenarios when synthetic traffic is used, for example
```
chat
```
and
```
summarization
```
SLA target: TTFT, TPOT/ITL, end-to-end latency, success rate, or goodput
search budget: quick smoke, default search, or exhaustive search
output directory for logs and result artifacts

Also collect a version manifest:

framework package version and git commit when available
container image or Python environment identifier
```
--help
```
snapshots for the server command and benchmark command
whether each parameter in the search plan was accepted by that exact CLI

If real production traffic is the goal, use the real request distribution. A synthetic workload is fine for bring-up and first-pass comparison, but it is not enough for a production choice.

开始长时间运行前请收集以下信息：

模型路径或Hugging Face仓库ID
若分词器路径与模型不同，需提供分词器路径
目标框架：
```
sglang
```
、
```
vllm
```
、
```
tensorrt-llm
```
的任意子集
GPU型号、GPU数量，以及是否允许多节点
精度和量化约束
端点类型：补全、聊天补全、响应或自定义
工作负载来源：真实流量JSONL、ShareGPT、随机合成或生成的共享前缀合成数据
使用合成流量时的数据集场景，例如
```
chat
```
和
```
summarization
```
SLA目标：TTFT、TPOT/ITL、端到端延迟、成功率或有效吞吐量
搜索预算：快速冒烟测试、默认搜索或 exhaustive 搜索
日志和结果工件的输出目录

还需收集版本清单：

框架包版本和Git提交记录（若可用）
容器镜像或Python环境标识符
服务器命令和基准测试命令的
```
--help
```
快照
搜索计划中的每个参数是否被当前CLI接受

若目标是真实生产流量，请使用真实请求分布。合成工作负载适用于启动和首轮对比，但不足以作为生产环境选择的依据。

Known Gotchas

已知问题

Short list of failure modes that have bitten past validation runs. Check these before starting a long sweep.

SGLang
```
fa3
```
attention backends need Hopper or newer. On A100, L40S, RTX 5090, and older GPUs, drop
```
fa3
```
from the SGLang
```
search_space
```
and keep
```
flashinfer
```
(or
```
triton
```
when FlashInfer is unavailable).
SGLang
```
bench_serving
```
has two SGLang-facing backends:
```
--backend sglang
```
for the native
```
/generate
```
endpoint and
```
--backend sglang-oai
```
for the OpenAI-compatible endpoint. For cross-framework comparisons, prefer
```
sglang-oai
```
so every framework is measured on the same request path.
vLLM
```
--enable-dbo
```
only works when the target vLLM image is built with a supported all2all backend. Keep DBO out of the default candidate list unless the operator has verified the image.
vLLM
```
--max-num-partial-prefills > 1
```
is model- and runtime-gated. Keep
```
1
```
in the default pass; raise only after a preflight with the actual model.
In the validated TensorRT-LLM 1.0.0 image,
```
trtllm-serve serve
```
accepts
```
--kv_cache_free_gpu_memory_fraction
```
; the older
```
--free_gpu_memory_fraction
```
exits with a CLI error. Re-check the accepted flag name via
```
--help
```
on the target image before a real run.
TensorRT-LLM 1.0.0 multi-GPU PyTorch-backend servers need
```
--ipc=host
```
,
```
--ulimit memlock=-1
```
,
```
--ulimit stack=67108864
```
,
```
--shm-size=16g
```
, and
```
NCCL_IB_DISABLE=1
```
(for single-node) or an equivalent NCCL setup.
TensorRT-LLM 1.0.0 benchmark client takes
```
--backend openai
```
or
```
--backend openai-chat
```
;
```
--backend trtllm
```
is rejected. This is separate from the server backend, which is pinned to
```
pytorch
```
by this skill.

trtllm

benchmark_serving --dataset-name random

silently falls back to ShareGPT sampling without

--random-ids

(or

--download-path

```
max_seq_len
```
/
```
max_model_len
```
/
```
context_length
```
candidates must cover
```
max(input_len + output_len)
```
across every scenario, including values inside
```
search_space
```
, not just the baseline. The validator checks this; do not bypass it.

以下是过往验证运行中遇到的失败模式列表，开始大规模扫描前请检查这些问题。

SGLang
```
fa3
```
注意力后端需要Hopper或更新的GPU。在A100、L40S、RTX 5090及更旧的GPU上，请从SGLang的
```
search_space
```
中移除
```
fa3
```
，保留
```
flashinfer
```
（若FlashInfer不可用则保留
```
triton
```
）。
SGLang
```
bench_serving
```
有两个面向SGLang的后端：
```
--backend sglang
```
用于原生
```
/generate
```
端点，
```
--backend sglang-oai
```
用于OpenAI兼容端点。跨框架对比时，优先使用
```
sglang-oai
```
，确保所有框架在相同请求路径上被测量。
vLLM
```
--enable-dbo
```
仅在目标vLLM镜像使用支持的all2all后端构建时有效。除非操作方已验证镜像，否则请勿将DBO加入默认候选列表。
vLLM
```
--max-num-partial-prefills > 1
```
受模型和运行时限制。默认轮次保留
```
1
```
，仅在使用实际模型进行预检查后再提高该值。
在已验证的TensorRT-LLM 1.0.0镜像中，
```
trtllm-serve serve
```
接受
```
--kv_cache_free_gpu_memory_fraction
```
参数；旧版参数
```
--free_gpu_memory_fraction
```
会导致CLI错误。正式运行前，请通过目标镜像的
```
--help
```
命令重新检查可接受的参数名称。
TensorRT-LLM 1.0.0多GPU PyTorch后端服务器需要
```
--ipc=host
```
、
```
--ulimit memlock=-1
```
、
```
--ulimit stack=67108864
```
、
```
--shm-size=16g
```
以及
```
NCCL_IB_DISABLE=1
```
（单节点）或等效的NCCL配置。
TensorRT-LLM 1.0.0基准测试客户端接受
```
--backend openai
```
或
```
--backend openai-chat
```
；
```
--backend trtllm
```
会被拒绝。这与服务器后端分开，服务器后端被本工具固定为
```
pytorch
```
。

trtllm

的

benchmark_serving --dataset-name random

在未传入

--random-ids

（或

--download-path

）时会静默回退到ShareGPT采样。

```
max_seq_len
```
/
```
max_model_len
```
/
```
context_length
```
候选值必须覆盖所有场景中的
```
max(input_len + output_len)
```
，包括
```
search_space
```
内的值，而不仅仅是基准值。验证工具会检查此条件，请不要绕过。

Secrets Hygiene

密钥安全

Never print
```
HF_TOKEN
```
,
```
HUGGINGFACE_HUB_TOKEN
```
, or any upstream API key into a saved artifact. Pass them through container
```
-e VAR
```
(unquoted on the right side so the host value is inherited) and keep them out of
```
server_command
```
and
```
benchmark_command
```
fields written to the result JSONL.
When a framework echoes the full argv at startup, scrub the log or redact token-shaped substrings before uploading the artifact.

请勿将
```
HF_TOKEN
```
、
```
HUGGINGFACE_HUB_TOKEN
```
或任何上游API密钥打印到保存的工件中。通过容器
```
-e VAR
```
传递（右侧不加引号，以便继承主机值），并避免将其写入结果JSONL的
```
server_command
```
和
```
benchmark_command
```
字段。
若框架在启动时回显完整argv，请在上传工件前清理日志或编辑掉令牌格式的子字符串。

Fairness Rules

公平性规则

Use these rules throughout the benchmark:

Run every framework on the same GPU type, GPU count, model weights, tokenizer, precision, quantization policy, prompt distribution, output length target, and sampling settings.
Record framework version, git commit, container image, CUDA/NCCL versions, GPU driver, visible GPU ids, launch command, and benchmark command.
Warm the server before measuring. Restart or clear state between candidate configurations when cache effects would bias the comparison.
Compare steady-state fixed-QPS runs separately from burst throughput runs.
Keep failed candidates in the final results with their failure reason.
Report both raw throughput and SLA-passing throughput. The fastest failing candidate is not the best deployment command.

基准测试全程请遵循以下规则：

在相同GPU类型、GPU数量、模型权重、分词器、精度、量化策略、提示分布、输出长度目标和采样设置下运行每个框架。
记录框架版本、Git提交记录、容器镜像、CUDA/NCCL版本、GPU驱动、可见GPU ID、启动命令和基准测试命令。
测量前预热服务器。当缓存效应会影响对比结果时，在候选配置之间重启服务器或清除状态。
分别对比稳态固定QPS运行和突发吞吐量运行。
将失败的候选保留在最终结果中，并记录失败原因。
同时报告原始吞吐量和符合SLA的吞吐量。最快但不符合要求的候选并非最佳部署命令。

Workflow

工作流

1. Preflight

1. 预检查

Verify all requested frameworks before starting a search:

bash

python -m sglang.launch_server --help
python -m sglang.bench_serving --help
vllm serve --help
vllm serve --help=all
vllm bench serve --help
vllm bench serve --help=all
vllm bench sweep serve --help=all
trtllm-serve serve --help
python -m tensorrt_llm.serve.scripts.benchmark_serving --help

Use the framework-specific

--help

output in the target environment as the source of truth. Do not keep a stale launch flag just because it appears in an old note.

vLLM 0.19 and newer use grouped help. Plain

vllm serve --help

only shows the groups, so capture

--help=all

before deciding whether a search knob exists.

Save these

--help

outputs into the run artifact directory. If a listed search knob is missing from the current CLI, remove or translate that knob before running the benchmark. Do not silently pass unknown flags.

For TensorRT-LLM, also confirm that

trtllm-serve serve --help

accepts

--backend pytorch

. If it does not, mark TensorRT-LLM unsupported in that environment rather than falling back to a different server backend.

For each framework:

Launch a minimal server.
Confirm
```
/v1/models
```
or the framework-native model-info endpoint works.
Send one streaming request and verify TTFT can be measured.
Run one tiny benchmark with at least 5 requests.
Save the launch command, benchmark command, server log, and benchmark output.

Before any GPU-backed smoke run, check the requested GPU ids directly with

nvidia-smi

. If a requested GPU is already in use, stop and record that fact. Do not silently borrow a different GPU count for a performance comparison. It is fine to run a smaller one-GPU smoke only when the result is clearly labeled as a flow check rather than a fair throughput comparison.

If the target environment runs through containers, follow references/container-runbook.md. Save the image tags, pull commands, launch commands, server logs, benchmark logs, and cleanup commands in the artifact directory.

开始搜索前验证所有请求的框架：

bash

python -m sglang.launch_server --help
python -m sglang.bench_serving --help
vllm serve --help
vllm serve --help=all
vllm bench serve --help
vllm bench serve --help=all
vllm bench sweep serve --help=all
trtllm-serve serve --help
python -m tensorrt_llm.serve.scripts.benchmark_serving --help

以目标环境中框架特定的

--help

输出为事实来源。请勿仅因旧笔记中存在就保留过时的启动参数。

vLLM 0.19及更高版本使用分组帮助。普通

vllm serve --help

仅显示分组，因此在确定搜索参数是否存在前，请捕获

--help=all

的输出。

将这些

--help

输出保存到运行工件目录中。若列出的搜索参数在当前CLI中缺失，请在运行基准测试前移除或转换该参数。请勿静默传递未知参数。

对于TensorRT-LLM，还需确认

trtllm-serve serve --help

是否接受

--backend pytorch

。若不接受，请标记该环境中TensorRT-LLM不支持，而非回退到其他服务器后端。

针对每个框架：

启动最小化服务器。
确认
```
/v1/models
```
或框架原生的模型信息端点可用。
发送一个流式请求并验证TTFT可被测量。
运行至少包含5个请求的小型基准测试。
保存启动命令、基准测试命令、服务器日志和基准测试输出。

在任何GPU冒烟测试前，请直接使用

nvidia-smi

检查请求的GPU ID。若请求的GPU已在使用中，请停止并记录该情况。请勿在性能对比中静默使用不同数量的GPU。若仅进行流程检查而非公平吞吐量对比，运行单GPU小型冒烟测试是可行的，但需明确标记。

若目标环境通过容器运行，请遵循references/container-runbook.md。将镜像标签、拉取命令、启动命令、服务器日志、基准测试日志和清理命令保存到工件目录中。

2. Normalize The Workload

2. 归一化工作负载

Use one canonical workload for all frameworks. Recommended JSONL row shape:

json

{"prompt": [{"role": "user", "content": "Summarize this text."}], "output_len": 256}
{"prompt": "Write a short explanation of CUDA graphs.", "output_len": 128}

Optional fields:

json

{
  "prompt": [{"role": "user", "content": "Use low temperature."}],
  "output_len": 256,
  "extra_request_body": {"temperature": 0.0, "top_p": 0.95},
  "metadata": {"source": "prod-sample"}
}

When converting user data:

inspect at least 3 rows before conversion
preserve request-level sampling options in
```
extra_request_body
```
do not include the final assistant answer in the prompt when that answer is the target completion
keep multimodal or tool-call payloads only if all requested frameworks support the chosen endpoint shape

For synthetic bring-up, use the shipped two-scenario shape:

yaml

dataset:
  kind: random
  num_prompts: 80
  scenario_names: [chat, summarization]
  input_len: [1000, 8000]
  output_len: [1000, 1000]

Each aligned

input_len

output_len

pair is one scenario. Do not take the cartesian product unless the user asks for that.

Before searching any sequence-length limit, compute the largest

input_len + output_len

in the dataset. SGLang

context_length

, vLLM

max_model_len

, and TensorRT-LLM

max_seq_len

must be at least that value for every candidate that is expected to run all scenarios.

为所有框架使用一套标准工作负载。推荐的JSONL行格式：

json

{"prompt": [{"role": "user", "content": "Summarize this text."}], "output_len": 256}
{"prompt": "Write a short explanation of CUDA graphs.", "output_len": 128}

可选字段：

json

{
  "prompt": [{"role": "user", "content": "Use low temperature."}],
  "output_len": 256,
  "extra_request_body": {"temperature": 0.0, "top_p": 0.95},
  "metadata": {"source": "prod-sample"}
}

转换用户数据时：

转换前至少检查3行数据
将请求级采样选项保留在
```
extra_request_body
```
中
当最终助手回答是目标补全时，请勿将其包含在提示中
仅当所有请求框架都支持所选端点类型时，才保留多模态或工具调用负载

对于启动阶段的合成数据，使用内置的双场景格式：

yaml

dataset:
  kind: random
  num_prompts: 80
  scenario_names: [chat, summarization]
  input_len: [1000, 8000]
  output_len: [1000, 1000]

每个对齐的

input_len

output_len

对是一个场景。除非用户要求，否则请勿使用笛卡尔积。

搜索任何序列长度限制前，请计算数据集中最大的

input_len + output_len

值。SGLang的

context_length

、vLLM的

max_model_len

和TensorRT-LLM的

max_seq_len

必须至少等于该值，才能确保候选能运行所有场景。

3. Pick A Search Tier

3. 选择搜索层级

Use the smallest tier that can answer the user's question:

Tier 1: smoke and sanity. One baseline plus a few high-impact knobs.
Tier 2: default. A bounded sweep over the most likely server settings.
Tier 3: exhaustive. Only when the search space is already tight and the user accepts a long run.

Default budget:

```
num_prompts: 80
```
for the default cross-framework comparison;
```
num_prompts: 20
```
per scenario is acceptable for a smoke/flow check and must be labeled as such in the artifact (not as a performance result).
```
search.max_candidates_per_framework: 10
```
for the first useful pass
candidate generation: baseline first, then a bounded product or ordered candidate list from
```
search_space
```
at most 5 QPS search rounds unless the user asks for more
stop early when every candidate in one framework is clearly OOM or fails the basic health check

Keep these in

base_server_flags

unless the user specifically wants a capacity or memory study:

SGLang
```
mem_fraction_static
```
SGLang
```
schedule_policy
```
vLLM
```
gpu_memory_utilization
```
TensorRT-LLM
```
kv_cache_free_gpu_memory_fraction
```

These are real knobs, but they widen the search quickly and often turn a serving comparison into a memory-limit study.

使用能回答用户问题的最小层级：

层级1：冒烟和 sanity 检查。一个基准候选加上几个高影响参数。
层级2：默认。对最可能的服务器设置进行有限扫描。
层级3： exhaustive 搜索。仅当搜索空间已收窄且用户接受长时间运行时使用。

默认预算：

默认跨框架对比使用
```
num_prompts: 80
```
；冒烟/流程检查可接受
```
num_prompts: 20
```
每场景，但必须在工件中明确标记（不作为性能结果）。
```
search.max_candidates_per_framework: 10
```
用于第一轮有效扫描
候选生成：基准候选优先，然后从
```
search_space
```
生成有限乘积或有序候选列表
最多5轮QPS搜索，除非用户要求更多
当某一框架的所有候选均明显OOM或未通过基本健康检查时，提前停止

除非用户明确要求进行容量或内存研究，否则请将以下参数保留在

base_server_flags

中：

SGLang
```
mem_fraction_static
```
SGLang
```
schedule_policy
```
vLLM
```
gpu_memory_utilization
```
TensorRT-LLM
```
kv_cache_free_gpu_memory_fraction
```

这些是真实的调优参数，但会迅速扩大搜索范围，往往会将服务对比变成内存限制研究。

4. Tune SGLang

4. 调优SGLang

Prefer the SGLang auto-benchmark runner when the target checkout supports it:

bash

python -m sglang.auto_benchmark run --config /path/to/sglang.yaml

Otherwise launch the server manually and benchmark with:

bash

python -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 256 \
  --num-prompts 80 \
  --request-rate 8 \
  --output-file /path/to/sglang/results.json \
  --output-details

Version-sensitive SGLang knob families to verify:

```
tp_size
```
,
```
pp_size
```
,
```
dp_size
```
,
```
ep_size
```

attention_backend

prefill_attention_backend

decode_attention_backend

```
sampling_backend
```
```
max_running_requests
```
,
```
max_queued_requests
```

chunked_prefill_size

prefill_max_requests

max_prefill_tokens

```
max_total_tokens
```
,
```
page_size
```
CUDA graph and piecewise CUDA graph settings
speculative or EAGLE settings only after the non-speculative baseline is tuned

Keep

mem_fraction_static

and

schedule_policy

pinned in the default pass, matching the shared cookbook config style.

For quick smoke tests, it is reasonable to disable CUDA graph and piecewise CUDA graph startup work if the goal is only to prove the framework flow. Record those flags in the artifact. Do not carry that smoke setting into a performance winner unless the user asked to tune eager-mode serving.

若目标代码库支持，请优先使用SGLang自动基准测试运行器：

bash

python -m sglang.auto_benchmark run --config /path/to/sglang.yaml

否则手动启动服务器并使用以下命令进行基准测试：

bash

python -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 256 \
  --num-prompts 80 \
  --request-rate 8 \
  --output-file /path/to/sglang/results.json \
  --output-details

需验证的与版本相关的SGLang参数组：

```
tp_size
```
,
```
pp_size
```
,
```
dp_size
```
,
```
ep_size
```

attention_backend

prefill_attention_backend

decode_attention_backend

```
sampling_backend
```
```
max_running_requests
```
,
```
max_queued_requests
```

chunked_prefill_size

prefill_max_requests

max_prefill_tokens

```
max_total_tokens
```
,
```
page_size
```
CUDA图和分段CUDA图设置
仅在非推测基准调优完成后，才考虑推测或EAGLE设置

默认轮次中，请将

mem_fraction_static

和

schedule_policy

固定，与共享参考配置风格一致。

对于快速冒烟测试，若仅需验证框架流程，可禁用CUDA图和分段CUDA图启动操作。请在工件中记录这些参数。除非用户要求调优 eager 模式服务，否则请勿将冒烟测试设置用于性能最优候选。

5. Tune vLLM

5. 调优vLLM

Use vLLM's sweep runner when available:

bash

vllm bench sweep serve \
  --serve-cmd 'vllm serve <model> --port 8000' \
  --bench-cmd 'vllm bench serve --backend vllm --model <model> --port 8000 --dataset-name random --num-prompts 80' \
  --serve-params /path/to/vllm_serve_params.json \
  --bench-params /path/to/vllm_bench_params.json \
  --output-dir /path/to/vllm_results

If sweep support is unavailable, run

vllm serve

for each candidate and measure with

vllm bench serve

Version-sensitive vLLM knob families to verify:

tensor, pipeline, data, decode-context, and expert parallelism
```
gpu_memory_utilization
```
```
max_num_seqs
```
```
max_num_batched_tokens
```
```
max_model_len
```
```
enable_chunked_prefill
```
, partial prefill limits, and DBO thresholds
KV cache dtype and block size
dtype and quantization settings
CUDA graph capture sizes or eager-mode toggles when relevant
prefix cache and speculative decoding settings only when the workload needs those features

vLLM should get a normal sweep, not one baseline command. See references/parameter-coverage.md for the validated flag families. The historical audit happens to use an H100 host, but the flag-family coverage is not H100-specific; confirm each flag on the target image's

--help

before a run.

Keep

gpu_memory_utilization

in the baseline for the default pass. Search it only when the question is explicitly about fitting the model or trading capacity against throughput.

Keep DBO and all2all backend settings out of the default pass unless the target vLLM environment is already set up for them. They are real tuning knobs, but a candidate can fail at startup if the required all2all backend is not available. Also preflight concurrent partial prefill before raising

max_num_partial_prefills

above 1; some model/runtime combinations reject it at startup.

若可用，请使用vLLM的扫描运行器：

bash

vllm bench sweep serve \
  --serve-cmd 'vllm serve <model> --port 8000' \
  --bench-cmd 'vllm bench serve --backend vllm --model <model> --port 8000 --dataset-name random --num-prompts 80' \
  --serve-params /path/to/vllm_serve_params.json \
  --bench-params /path/to/vllm_bench_params.json \
  --output-dir /path/to/vllm_results

若不支持扫描功能，请为每个候选运行

vllm serve

并使用

vllm bench serve

进行测量。

需验证的与版本相关的vLLM参数组：

张量、流水线、数据、解码上下文和专家并行
```
gpu_memory_utilization
```
```
max_num_seqs
```
```
max_num_batched_tokens
```
```
max_model_len
```
```
enable_chunked_prefill
```
、部分预填充限制和DBO阈值
KV缓存数据类型和块大小
数据类型和量化设置
相关时的CUDA图捕获大小或 eager 模式切换
仅当工作负载需要时，才考虑前缀缓存和推测解码设置

vLLM应进行常规扫描，而非仅使用一个基准命令。请查看references/parameter-coverage.md获取已验证的参数组。历史审计使用了H100主机，但参数组覆盖并非H100专属；运行前请确认目标镜像

--help

中的每个参数。

默认轮次中，请将

gpu_memory_utilization

保留在基准中。仅当问题明确涉及模型适配或吞吐量与容量权衡时，才搜索该参数。

除非目标vLLM环境已配置完成，否则请勿将DBO和all2all后端设置加入默认轮次。它们是真实的调优参数，但如果所需的all2all后端不可用，候选可能在启动时失败。此外，在将

max_num_partial_prefills

提高到1以上前，请预检查并发部分预填充；某些模型/运行时组合会在启动时拒绝该设置。

6. Tune TensorRT-LLM

6. 调优TensorRT-LLM

Use

trtllm-serve serve

as the server entrypoint when the target environment supports it:

bash

trtllm-serve serve <model> \
  --backend pytorch \
  --tp_size <tp> \
  --pp_size <pp> \
  --kv_cache_free_gpu_memory_fraction 0.75 \
  --host 0.0.0.0 \
  --port 8000

Then benchmark the OpenAI-compatible endpoint with the TensorRT-LLM serving benchmark client or with the same OpenAI-compatible client used for the other frameworks.

For TensorRT-LLM 1.0.0,

benchmark_serving --dataset-name random

samples from ShareGPT unless you pass either

--download-path

--random-ids

. For a fast synthetic smoke test, pass

--random-ids

TensorRT-LLM flag names are especially version-sensitive. In the validated TensorRT-LLM 1.0.0 image, the KV-cache memory flag accepted by

trtllm-serve serve

--kv_cache_free_gpu_memory_fraction

, not

--free_gpu_memory_fraction

. Verify this with

trtllm-serve serve --help

before running a search on any GPU target.

TensorRT-LLM backend policy for this skill:

launch the server with
```
--backend pytorch
```
keep
```
backend: pytorch
```
in
```
base_server_flags
```
do not add
```
backend
```
to
```
search_space
```
reject
```
trt
```
, engine-backed serving, or any other non-PyTorch TensorRT-LLM server backend as unsupported for this skill

Version-sensitive TensorRT-LLM knob families to verify:

```
tp_size
```
,
```
pp_size
```
, and
```
ep_size
```
max batch size, max sequence length, max number of tokens, and KV-cache budget
inflight batching and scheduler options
extra LLM API options YAML used by
```
trtllm-serve
```
with the PyTorch backend

The

trtllm-serve serve

CLI exposes fewer direct runtime knobs than SGLang or vLLM. Use direct flags when they exist, then use

--extra_llm_api_options

for PyTorch-backend settings that are not top-level CLI flags. Keep unsupported backend or engine requests in the failure table instead of translating them.

Keep

kv_cache_free_gpu_memory_fraction

in the baseline for the default pass. Search

max_batch_size

max_num_tokens

max_seq_len

, and validated PyTorch-backend config options first. The server backend remains fixed to

pytorch

若目标环境支持，请使用

trtllm-serve serve

作为服务器入口：

bash

trtllm-serve serve <model> \
  --backend pytorch \
  --tp_size <tp> \
  --pp_size <pp> \
  --kv_cache_free_gpu_memory_fraction 0.75 \
  --host 0.0.0.0 \
  --port 8000

然后使用TensorRT-LLM服务基准测试客户端或其他框架使用的相同OpenAI兼容客户端，对OpenAI兼容端点进行基准测试。

对于TensorRT-LLM 1.0.0，

benchmark_serving --dataset-name random

会从ShareGPT采样，除非传入

--download-path

或

--random-ids

。如需快速合成冒烟测试，请传入

--random-ids

。

TensorRT-LLM的参数名称对版本尤其敏感。在已验证的TensorRT-LLM 1.0.0镜像中，

trtllm-serve serve

接受的KV缓存内存参数是

--kv_cache_free_gpu_memory_fraction

，而非

--free_gpu_memory_fraction

。在任何GPU目标上运行搜索前，请通过

trtllm-serve serve --help

验证此参数。

本工具的TensorRT-LLM后端策略：

使用
```
--backend pytorch
```
启动服务器
在
```
base_server_flags
```
中保留
```
backend: pytorch
```
请勿将
```
backend
```
加入
```
search_space
```
将
```
trt
```
、引擎驱动服务或任何非PyTorch的TensorRT-LLM服务器后端标记为不支持

需验证的与版本相关的TensorRT-LLM参数组：

```
tp_size
```
,
```
pp_size
```
, 和
```
ep_size
```
最大批处理大小、最大序列长度、最大令牌数和KV缓存预算
飞行中批处理和调度器选项
```
trtllm-serve
```
与PyTorch后端配合使用的额外LLM API选项YAML

trtllm-serve serve

CLI暴露的直接运行时参数少于SGLang或vLLM。优先使用直接参数，然后对PyTorch后端中未作为顶层CLI参数的设置使用

--extra_llm_api_options

。将不支持的后端或引擎请求记录在失败表中，而非进行转换。

默认轮次中，请将

kv_cache_free_gpu_memory_fraction

保留在基准中。优先搜索

max_batch_size

、

max_num_tokens

、

max_seq_len

和已验证的PyTorch后端配置选项。服务器后端固定为

pytorch

。

7. Normalize Results

7. 归一化结果

Write one JSONL row per candidate using the schema in references/result-schema.md. Then run:

bash

python skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py \
  --input /path/to/candidates.jsonl \
  --output /path/to/summary.md

Rank candidates in this order:

SLA passed
highest request throughput or goodput
highest output token throughput
lower p99 TTFT
lower p99 TPOT/ITL
lower GPU count or simpler deployment if performance is close

使用references/result-schema.md中的schema，为每个候选写入一行JSONL。然后运行：

bash

python skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py \
  --input /path/to/candidates.jsonl \
  --output /path/to/summary.md

按以下顺序对候选排序：

是否符合SLA
请求吞吐量或有效吞吐量最高
输出令牌吞吐量最高
p99 TTFT更低
p99 TPOT/ITL更低
若性能相近，GPU数量更少或部署更简单

Output Contract

输出合约

Return a compact report with:

workload and SLA used
hardware and framework versions
for each framework, one table listing the best deployment command for each dataset scenario and all relevant performance metrics
one cross-framework comparison table for the selected best command per framework and scenario, including the command, so the deployment choice is clear for each dataset
failed or excluded candidates with reasons. Explain that this table is an record of tried configs that were not selected: candidates that failed, were skipped by policy, or completed but missed the SLA.
exact launch command and benchmark command for each winner
artifact paths: canonical workload, raw results JSONL, normalized JSONL, CSV or markdown summary, and server logs needed to debug winners or failures
a caveat if the workload was synthetic, if any framework did not complete a fair search, or if any framework needed framework-specific parameter substitutions

Use references/framework-matrix.md when you need command templates or source links for each framework. Use references/example-plan.yaml as the starting point for a full cross-framework run plan. Use references/version-notes.md to understand which source snapshots informed this skill and what has or has not been smoke-tested.

返回一份简洁报告，包含：

使用的工作负载和SLA
硬件和框架版本
针对每个框架的表格，列出每个数据集场景的最佳部署命令及所有相关性能指标
跨框架对比表格，展示每个框架和场景的选定最佳命令，包括命令本身，以便明确每个数据集的部署选择
失败或排除的候选及原因。说明此表格记录了未被选中的尝试配置：失败、因策略跳过或完成但未达到SLA的候选。
每个最优候选的精确启动命令和基准测试命令
工件路径：标准工作负载、原始结果JSONL、归一化JSONL、CSV或Markdown摘要，以及调试最优候选或失败所需的服务器日志
若工作负载为合成数据、任一框架未完成公平搜索或任一框架需要框架特定参数替换，请添加说明