llm-serving-auto-benchmark

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Serving Auto Benchmark

LLM服务自动基准测试

Overview

概述

Use this skill to compare LLM serving frameworks such as SGLang, vLLM, and TensorRT-LLM for the same model and workload.
Use a config-driven workflow:
  • keep launch-only capacity choices in each framework's
    base_server_flags
  • put the search knobs in
    search_space
  • run the same dataset scenarios for every framework
  • generate a bounded candidate list from
    search_space
    , with the baseline candidate included first
  • keep failed candidates in the result file
  • pick the best SLA-passing candidate after normalizing the results
For model-specific starting points, prefer the shipped configs in
configs/cookbook-llm/
. They define a framework-neutral LLM serving cookbook model set and translate each entry into framework-native SGLang, vLLM, and TensorRT-LLM server flags. Validate those configs before a real run:
bash
python skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py \
  skills/llm-serving-auto-benchmark/configs/cookbook-llm
If you have captured target-environment
--help
files, add
--help-dir <artifact-help-dir>
. That check only loads configs, verifies the server flag names, and renders candidate commands; it does not launch model servers.
Prefer native tooling when it gives better coverage:
  • SGLang:
    python -m sglang.auto_benchmark
    when available, otherwise
    python -m sglang.bench_serving
  • vLLM:
    vllm bench sweep serve
    for server-parameter sweeps, otherwise
    vllm serve
    plus
    vllm bench serve
  • TensorRT-LLM:
    trtllm-serve
    for the OpenAI-compatible server plus the TensorRT-LLM serving benchmark client or a common OpenAI-compatible benchmark client
TensorRT-LLM has one hard scope rule in this skill: the server backend is fixed to
trtllm-serve serve --backend pytorch
. Do not search TensorRT-LLM backend choice. If a request, config, or candidate asks for
trt
, an engine backend, or any other non-PyTorch TensorRT-LLM server backend, reject that candidate as unsupported for this skill and record the reason. This does not change the benchmark client backend; the TensorRT-LLM benchmark client still uses OpenAI-compatible modes such as
--backend openai
or
--backend openai-chat
.
Only pick a winner after each requested framework has had its main serving knobs tuned.
The parameter lists in this skill are not a compatibility contract. They are version-sensitive candidate knob families. Before every real run, record the exact framework version or git commit and verify the concrete CLI flag names with
--help
in the target environment.
The default search style is framework-neutral: start from a mostly pure-TP baseline, sweep a small set of high-impact runtime knobs, and cap the first pass around 10 candidates per framework. Do not search memory fractions by default.
使用本工具可针对同一模型和工作负载,对比SGLang、vLLM和TensorRT-LLM等LLM服务框架。
采用配置驱动的工作流:
  • 将仅启动相关的容量选项保留在各框架的
    base_server_flags
  • 将搜索参数放入
    search_space
  • 为每个框架运行相同的数据集场景
  • search_space
    生成有限的候选列表,基准候选排在首位
  • 将失败的候选记录在结果文件中
  • 归一化结果后,选出符合SLA要求的最佳候选
针对模型特定的起始配置,优先使用
configs/cookbook-llm/
中提供的配置文件。这些配置定义了一套与框架无关的LLM服务参考模型集,并将每个条目转换为SGLang、vLLM和TensorRT-LLM原生的服务器参数。正式运行前请验证这些配置:
bash
python skills/llm-serving-auto-benchmark/scripts/validate_cookbook_configs.py \
  skills/llm-serving-auto-benchmark/configs/cookbook-llm
如果已捕获目标环境的
--help
文件,可添加
--help-dir <artifact-help-dir>
参数。该检查仅加载配置、验证服务器参数名称并生成候选命令,不会启动模型服务器。
当原生工具覆盖范围更广时,优先使用原生工具:
  • SGLang:若可用则使用
    python -m sglang.auto_benchmark
    ,否则使用
    python -m sglang.bench_serving
  • vLLM:使用
    vllm bench sweep serve
    进行服务器参数扫描,否则使用
    vllm serve
    搭配
    vllm bench serve
  • TensorRT-LLM:使用
    trtllm-serve
    搭建OpenAI兼容服务器,搭配TensorRT-LLM服务基准测试客户端或通用OpenAI兼容基准测试客户端
本工具对TensorRT-LLM有一项硬性规则:服务器后端固定为
trtllm-serve serve --backend pytorch
。请勿搜索TensorRT-LLM的后端选项。若请求、配置或候选要求使用
trt
、引擎后端或任何非PyTorch的TensorRT-LLM服务器后端,需将该候选标记为不支持并记录原因。这不会改变基准测试客户端的后端;TensorRT-LLM基准测试客户端仍可使用
--backend openai
--backend openai-chat
等OpenAI兼容模式。
只有在每个请求框架的主要服务参数都完成调优后,才能选出最优方案。
本工具中的参数列表不构成兼容性约定,它们是与版本相关的候选参数组。每次正式运行前,请记录精确的框架版本或Git提交记录,并通过目标环境中的
--help
命令验证具体的CLI参数名称。
默认搜索风格与框架无关:从基本纯TP基准开始,扫描一组影响较大的运行时参数,第一轮每个框架最多生成10个候选。默认不搜索内存占比参数。

Validation Environment

验证环境

This skill is target-agnostic. It assumes any one of the following is available, and nothing more:
  • a local GPU host with Docker/Podman and the target framework images pulled;
  • a remote GPU host reached via
    ssh <host>
    with the framework images already running in a container there;
  • a CI runner that can exec into a pre-built image for each framework.
Do not assume a specific operator host name (
h100_sglang
,
b200_*
,
radixark*
,
rtx5090_*
, etc.) inside this skill's own workflow. The concrete SSH wiring, container names, workspace paths, and HF token plumbing for a given box live in the operator-side per-host skills (for example
h100
,
h100-sglang-diffusion
,
b200
,
rtx5090
,
radixark02
,
radixark03
); this skill only requires that the caller can reach a shell inside a container with
sglang
,
vllm
, or
tensorrt_llm
installed.
Historical validation snapshots in
references/
are evidence of which flag names and failure modes were seen in specific images and are not a requirement that the next run happens on the same hardware or framework version.
本工具与目标环境无关,仅假设具备以下任一条件,无其他要求:
  • 本地GPU主机,已安装Docker/Podman并拉取目标框架镜像;
  • 可通过
    ssh <host>
    访问的远程GPU主机,框架镜像已在容器中运行;
  • CI运行器可进入每个框架的预构建镜像。
本工具的工作流中请勿假设特定的操作主机名(如
h100_sglang
b200_*
radixark*
rtx5090_*
等)。特定主机的SSH连接、容器名称、工作区路径和HF令牌配置等信息,存放在操作端的每台主机专属工具中(例如
h100
h100-sglang-diffusion
b200
rtx5090
radixark02
radixark03
);本工具仅要求调用方能够进入安装了
sglang
vllm
tensorrt_llm
的容器内的Shell环境。
references/
目录中的历史验证快照,仅用于记录在特定镜像中出现的参数名称和失败模式,不要求后续运行使用相同的硬件或框架版本。

Skill Scope

工具范围

This skill is a playbook plus a config+validator toolchain, not a turn-key orchestrator. The
scripts/
directory contains exactly two tools:
  • validate_cookbook_configs.py
    : reads YAML, renders bounded candidate server commands, and checks flag names against captured
    --help
    snapshots. It never launches a model server.
  • compare_benchmark_results.py
    : takes the normalized per-candidate JSONL and emits the markdown tables described in the Output Contract.
Launching servers, driving the workload, and writing one JSONL row per candidate are the operator's responsibility; the skill tells you how to do them, and the validator keeps your inputs honest.
The cookbook configs under
configs/cookbook-llm/
and the sample runtime plan at
references/example-plan.yaml
use related but not identical schemas:
  • Cookbook configs carry
    schema_version: 1
    ,
    source.kind
    set to
    llm_serving_cookbook
    ,
    benchmark.sla
    (nested), and
    frameworks.*.server_command
    ; they must pass
    validate_cookbook_configs.py
    .
  • example-plan.yaml
    is a shorter runtime plan shape with top-level
    sla
    and no
    server_command
    . It is the skeleton a caller fills in for a one-off run and is not expected to pass the cookbook validator as-is.
Either shape can feed a benchmark run; the SLA key names in references/result-schema.md are the single source of truth.
本工具是一份操作手册加上配置+验证工具链,而非一键式编排器。
scripts/
目录包含以下两个工具:
  • validate_cookbook_configs.py
    :读取YAML文件,生成有限的候选服务器命令,并对照捕获的
    --help
    快照验证参数名称。该工具从不启动模型服务器。
  • compare_benchmark_results.py
    :接收归一化后的候选JSONL文件,生成输出合约中描述的Markdown表格。
启动服务器、驱动工作负载和为每个候选写入一行JSONL是操作方的责任;本工具会告知操作方法,验证工具则确保输入的准确性。
configs/cookbook-llm/
下的参考配置与
references/example-plan.yaml
中的示例运行计划使用相关但不完全相同的 schema:
  • 参考配置包含
    schema_version: 1
    source.kind
    设为
    llm_serving_cookbook
    、嵌套的
    benchmark.sla
    以及
    frameworks.*.server_command
    ;它们必须通过
    validate_cookbook_configs.py
    验证。
  • example-plan.yaml
    是更简短的运行计划格式,包含顶层
    sla
    但无
    server_command
    。它是调用方用于一次性运行的框架,无需按原样通过参考配置验证工具。
两种格式均可用于基准测试运行;references/result-schema.md中的SLA键名是唯一的事实来源。

Required Inputs

必填输入

Collect these before starting a long run:
  • model path or Hugging Face repo id
  • tokenizer path if it differs from the model
  • target frameworks: any subset of
    sglang
    ,
    vllm
    ,
    tensorrt-llm
  • GPU model, GPU count, and whether multi-node is allowed
  • precision and quantization constraints
  • endpoint shape: completions, chat completions, responses, or custom
  • workload source: real traffic JSONL, ShareGPT, random synthetic, or generated shared-prefix synthetic
  • dataset scenarios when synthetic traffic is used, for example
    chat
    and
    summarization
  • SLA target: TTFT, TPOT/ITL, end-to-end latency, success rate, or goodput
  • search budget: quick smoke, default search, or exhaustive search
  • output directory for logs and result artifacts
Also collect a version manifest:
  • framework package version and git commit when available
  • container image or Python environment identifier
  • --help
    snapshots for the server command and benchmark command
  • whether each parameter in the search plan was accepted by that exact CLI
If real production traffic is the goal, use the real request distribution. A synthetic workload is fine for bring-up and first-pass comparison, but it is not enough for a production choice.
开始长时间运行前请收集以下信息:
  • 模型路径或Hugging Face仓库ID
  • 若分词器路径与模型不同,需提供分词器路径
  • 目标框架:
    sglang
    vllm
    tensorrt-llm
    的任意子集
  • GPU型号、GPU数量,以及是否允许多节点
  • 精度和量化约束
  • 端点类型:补全、聊天补全、响应或自定义
  • 工作负载来源:真实流量JSONL、ShareGPT、随机合成或生成的共享前缀合成数据
  • 使用合成流量时的数据集场景,例如
    chat
    summarization
  • SLA目标:TTFT、TPOT/ITL、端到端延迟、成功率或有效吞吐量
  • 搜索预算:快速冒烟测试、默认搜索或 exhaustive 搜索
  • 日志和结果工件的输出目录
还需收集版本清单:
  • 框架包版本和Git提交记录(若可用)
  • 容器镜像或Python环境标识符
  • 服务器命令和基准测试命令的
    --help
    快照
  • 搜索计划中的每个参数是否被当前CLI接受
若目标是真实生产流量,请使用真实请求分布。合成工作负载适用于启动和首轮对比,但不足以作为生产环境选择的依据。

Known Gotchas

已知问题

Short list of failure modes that have bitten past validation runs. Check these before starting a long sweep.
  • SGLang
    fa3
    attention backends need Hopper or newer. On A100, L40S, RTX 5090, and older GPUs, drop
    fa3
    from the SGLang
    search_space
    and keep
    flashinfer
    (or
    triton
    when FlashInfer is unavailable).
  • SGLang
    bench_serving
    has two SGLang-facing backends:
    --backend sglang
    for the native
    /generate
    endpoint and
    --backend sglang-oai
    for the OpenAI-compatible endpoint. For cross-framework comparisons, prefer
    sglang-oai
    so every framework is measured on the same request path.
  • vLLM
    --enable-dbo
    only works when the target vLLM image is built with a supported all2all backend. Keep DBO out of the default candidate list unless the operator has verified the image.
  • vLLM
    --max-num-partial-prefills > 1
    is model- and runtime-gated. Keep
    1
    in the default pass; raise only after a preflight with the actual model.
  • In the validated TensorRT-LLM 1.0.0 image,
    trtllm-serve serve
    accepts
    --kv_cache_free_gpu_memory_fraction
    ; the older
    --free_gpu_memory_fraction
    exits with a CLI error. Re-check the accepted flag name via
    --help
    on the target image before a real run.
  • TensorRT-LLM 1.0.0 multi-GPU PyTorch-backend servers need
    --ipc=host
    ,
    --ulimit memlock=-1
    ,
    --ulimit stack=67108864
    ,
    --shm-size=16g
    , and
    NCCL_IB_DISABLE=1
    (for single-node) or an equivalent NCCL setup.
  • TensorRT-LLM 1.0.0 benchmark client takes
    --backend openai
    or
    --backend openai-chat
    ;
    --backend trtllm
    is rejected. This is separate from the server backend, which is pinned to
    pytorch
    by this skill.
  • trtllm
    benchmark_serving --dataset-name random
    silently falls back to ShareGPT sampling without
    --random-ids
    (or
    --download-path
    ).
  • max_seq_len
    /
    max_model_len
    /
    context_length
    candidates must cover
    max(input_len + output_len)
    across every scenario, including values inside
    search_space
    , not just the baseline. The validator checks this; do not bypass it.
以下是过往验证运行中遇到的失败模式列表,开始大规模扫描前请检查这些问题。
  • SGLang
    fa3
    注意力后端需要Hopper或更新的GPU。在A100、L40S、RTX 5090及更旧的GPU上,请从SGLang的
    search_space
    中移除
    fa3
    ,保留
    flashinfer
    (若FlashInfer不可用则保留
    triton
    )。
  • SGLang
    bench_serving
    有两个面向SGLang的后端:
    --backend sglang
    用于原生
    /generate
    端点,
    --backend sglang-oai
    用于OpenAI兼容端点。跨框架对比时,优先使用
    sglang-oai
    ,确保所有框架在相同请求路径上被测量。
  • vLLM
    --enable-dbo
    仅在目标vLLM镜像使用支持的all2all后端构建时有效。除非操作方已验证镜像,否则请勿将DBO加入默认候选列表。
  • vLLM
    --max-num-partial-prefills > 1
    受模型和运行时限制。默认轮次保留
    1
    ,仅在使用实际模型进行预检查后再提高该值。
  • 在已验证的TensorRT-LLM 1.0.0镜像中,
    trtllm-serve serve
    接受
    --kv_cache_free_gpu_memory_fraction
    参数;旧版参数
    --free_gpu_memory_fraction
    会导致CLI错误。正式运行前,请通过目标镜像的
    --help
    命令重新检查可接受的参数名称。
  • TensorRT-LLM 1.0.0多GPU PyTorch后端服务器需要
    --ipc=host
    --ulimit memlock=-1
    --ulimit stack=67108864
    --shm-size=16g
    以及
    NCCL_IB_DISABLE=1
    (单节点)或等效的NCCL配置。
  • TensorRT-LLM 1.0.0基准测试客户端接受
    --backend openai
    --backend openai-chat
    --backend trtllm
    会被拒绝。这与服务器后端分开,服务器后端被本工具固定为
    pytorch
  • trtllm
    benchmark_serving --dataset-name random
    在未传入
    --random-ids
    (或
    --download-path
    )时会静默回退到ShareGPT采样。
  • max_seq_len
    /
    max_model_len
    /
    context_length
    候选值必须覆盖所有场景中的
    max(input_len + output_len)
    ,包括
    search_space
    内的值,而不仅仅是基准值。验证工具会检查此条件,请不要绕过。

Secrets Hygiene

密钥安全

  • Never print
    HF_TOKEN
    ,
    HUGGINGFACE_HUB_TOKEN
    , or any upstream API key into a saved artifact. Pass them through container
    -e VAR
    (unquoted on the right side so the host value is inherited) and keep them out of
    server_command
    and
    benchmark_command
    fields written to the result JSONL.
  • When a framework echoes the full argv at startup, scrub the log or redact token-shaped substrings before uploading the artifact.
  • 请勿将
    HF_TOKEN
    HUGGINGFACE_HUB_TOKEN
    或任何上游API密钥打印到保存的工件中。通过容器
    -e VAR
    传递(右侧不加引号,以便继承主机值),并避免将其写入结果JSONL的
    server_command
    benchmark_command
    字段。
  • 若框架在启动时回显完整argv,请在上传工件前清理日志或编辑掉令牌格式的子字符串。

Fairness Rules

公平性规则

Use these rules throughout the benchmark:
  • Run every framework on the same GPU type, GPU count, model weights, tokenizer, precision, quantization policy, prompt distribution, output length target, and sampling settings.
  • Record framework version, git commit, container image, CUDA/NCCL versions, GPU driver, visible GPU ids, launch command, and benchmark command.
  • Warm the server before measuring. Restart or clear state between candidate configurations when cache effects would bias the comparison.
  • Compare steady-state fixed-QPS runs separately from burst throughput runs.
  • Keep failed candidates in the final results with their failure reason.
  • Report both raw throughput and SLA-passing throughput. The fastest failing candidate is not the best deployment command.
基准测试全程请遵循以下规则:
  • 在相同GPU类型、GPU数量、模型权重、分词器、精度、量化策略、提示分布、输出长度目标和采样设置下运行每个框架。
  • 记录框架版本、Git提交记录、容器镜像、CUDA/NCCL版本、GPU驱动、可见GPU ID、启动命令和基准测试命令。
  • 测量前预热服务器。当缓存效应会影响对比结果时,在候选配置之间重启服务器或清除状态。
  • 分别对比稳态固定QPS运行和突发吞吐量运行。
  • 将失败的候选保留在最终结果中,并记录失败原因。
  • 同时报告原始吞吐量和符合SLA的吞吐量。最快但不符合要求的候选并非最佳部署命令。

Workflow

工作流

1. Preflight

1. 预检查

Verify all requested frameworks before starting a search:
bash
python -m sglang.launch_server --help
python -m sglang.bench_serving --help
vllm serve --help
vllm serve --help=all
vllm bench serve --help
vllm bench serve --help=all
vllm bench sweep serve --help=all
trtllm-serve serve --help
python -m tensorrt_llm.serve.scripts.benchmark_serving --help
Use the framework-specific
--help
output in the target environment as the source of truth. Do not keep a stale launch flag just because it appears in an old note.
vLLM 0.19 and newer use grouped help. Plain
vllm serve --help
only shows the groups, so capture
--help=all
before deciding whether a search knob exists.
Save these
--help
outputs into the run artifact directory. If a listed search knob is missing from the current CLI, remove or translate that knob before running the benchmark. Do not silently pass unknown flags.
For TensorRT-LLM, also confirm that
trtllm-serve serve --help
accepts
--backend pytorch
. If it does not, mark TensorRT-LLM unsupported in that environment rather than falling back to a different server backend.
For each framework:
  1. Launch a minimal server.
  2. Confirm
    /v1/models
    or the framework-native model-info endpoint works.
  3. Send one streaming request and verify TTFT can be measured.
  4. Run one tiny benchmark with at least 5 requests.
  5. Save the launch command, benchmark command, server log, and benchmark output.
Before any GPU-backed smoke run, check the requested GPU ids directly with
nvidia-smi
. If a requested GPU is already in use, stop and record that fact. Do not silently borrow a different GPU count for a performance comparison. It is fine to run a smaller one-GPU smoke only when the result is clearly labeled as a flow check rather than a fair throughput comparison.
If the target environment runs through containers, follow references/container-runbook.md. Save the image tags, pull commands, launch commands, server logs, benchmark logs, and cleanup commands in the artifact directory.
开始搜索前验证所有请求的框架:
bash
python -m sglang.launch_server --help
python -m sglang.bench_serving --help
vllm serve --help
vllm serve --help=all
vllm bench serve --help
vllm bench serve --help=all
vllm bench sweep serve --help=all
trtllm-serve serve --help
python -m tensorrt_llm.serve.scripts.benchmark_serving --help
以目标环境中框架特定的
--help
输出为事实来源。请勿仅因旧笔记中存在就保留过时的启动参数。
vLLM 0.19及更高版本使用分组帮助。普通
vllm serve --help
仅显示分组,因此在确定搜索参数是否存在前,请捕获
--help=all
的输出。
将这些
--help
输出保存到运行工件目录中。若列出的搜索参数在当前CLI中缺失,请在运行基准测试前移除或转换该参数。请勿静默传递未知参数。
对于TensorRT-LLM,还需确认
trtllm-serve serve --help
是否接受
--backend pytorch
。若不接受,请标记该环境中TensorRT-LLM不支持,而非回退到其他服务器后端。
针对每个框架:
  1. 启动最小化服务器。
  2. 确认
    /v1/models
    或框架原生的模型信息端点可用。
  3. 发送一个流式请求并验证TTFT可被测量。
  4. 运行至少包含5个请求的小型基准测试。
  5. 保存启动命令、基准测试命令、服务器日志和基准测试输出。
在任何GPU冒烟测试前,请直接使用
nvidia-smi
检查请求的GPU ID。若请求的GPU已在使用中,请停止并记录该情况。请勿在性能对比中静默使用不同数量的GPU。若仅进行流程检查而非公平吞吐量对比,运行单GPU小型冒烟测试是可行的,但需明确标记。
若目标环境通过容器运行,请遵循references/container-runbook.md。将镜像标签、拉取命令、启动命令、服务器日志、基准测试日志和清理命令保存到工件目录中。

2. Normalize The Workload

2. 归一化工作负载

Use one canonical workload for all frameworks. Recommended JSONL row shape:
json
{"prompt": [{"role": "user", "content": "Summarize this text."}], "output_len": 256}
{"prompt": "Write a short explanation of CUDA graphs.", "output_len": 128}
Optional fields:
json
{
  "prompt": [{"role": "user", "content": "Use low temperature."}],
  "output_len": 256,
  "extra_request_body": {"temperature": 0.0, "top_p": 0.95},
  "metadata": {"source": "prod-sample"}
}
When converting user data:
  • inspect at least 3 rows before conversion
  • preserve request-level sampling options in
    extra_request_body
  • do not include the final assistant answer in the prompt when that answer is the target completion
  • keep multimodal or tool-call payloads only if all requested frameworks support the chosen endpoint shape
For synthetic bring-up, use the shipped two-scenario shape:
yaml
dataset:
  kind: random
  num_prompts: 80
  scenario_names: [chat, summarization]
  input_len: [1000, 8000]
  output_len: [1000, 1000]
Each aligned
input_len
/
output_len
pair is one scenario. Do not take the cartesian product unless the user asks for that.
Before searching any sequence-length limit, compute the largest
input_len + output_len
in the dataset. SGLang
context_length
, vLLM
max_model_len
, and TensorRT-LLM
max_seq_len
must be at least that value for every candidate that is expected to run all scenarios.
为所有框架使用一套标准工作负载。推荐的JSONL行格式:
json
{"prompt": [{"role": "user", "content": "Summarize this text."}], "output_len": 256}
{"prompt": "Write a short explanation of CUDA graphs.", "output_len": 128}
可选字段:
json
{
  "prompt": [{"role": "user", "content": "Use low temperature."}],
  "output_len": 256,
  "extra_request_body": {"temperature": 0.0, "top_p": 0.95},
  "metadata": {"source": "prod-sample"}
}
转换用户数据时:
  • 转换前至少检查3行数据
  • 将请求级采样选项保留在
    extra_request_body
  • 当最终助手回答是目标补全时,请勿将其包含在提示中
  • 仅当所有请求框架都支持所选端点类型时,才保留多模态或工具调用负载
对于启动阶段的合成数据,使用内置的双场景格式:
yaml
dataset:
  kind: random
  num_prompts: 80
  scenario_names: [chat, summarization]
  input_len: [1000, 8000]
  output_len: [1000, 1000]
每个对齐的
input_len
/
output_len
对是一个场景。除非用户要求,否则请勿使用笛卡尔积。
搜索任何序列长度限制前,请计算数据集中最大的
input_len + output_len
值。SGLang的
context_length
、vLLM的
max_model_len
和TensorRT-LLM的
max_seq_len
必须至少等于该值,才能确保候选能运行所有场景。

3. Pick A Search Tier

3. 选择搜索层级

Use the smallest tier that can answer the user's question:
  • Tier 1: smoke and sanity. One baseline plus a few high-impact knobs.
  • Tier 2: default. A bounded sweep over the most likely server settings.
  • Tier 3: exhaustive. Only when the search space is already tight and the user accepts a long run.
Default budget:
  • num_prompts: 80
    for the default cross-framework comparison;
    num_prompts: 20
    per scenario is acceptable for a smoke/flow check and must be labeled as such in the artifact (not as a performance result).
  • search.max_candidates_per_framework: 10
    for the first useful pass
  • candidate generation: baseline first, then a bounded product or ordered candidate list from
    search_space
  • at most 5 QPS search rounds unless the user asks for more
  • stop early when every candidate in one framework is clearly OOM or fails the basic health check
Keep these in
base_server_flags
unless the user specifically wants a capacity or memory study:
  • SGLang
    mem_fraction_static
  • SGLang
    schedule_policy
  • vLLM
    gpu_memory_utilization
  • TensorRT-LLM
    kv_cache_free_gpu_memory_fraction
These are real knobs, but they widen the search quickly and often turn a serving comparison into a memory-limit study.
使用能回答用户问题的最小层级:
  • 层级1:冒烟和 sanity 检查。一个基准候选加上几个高影响参数。
  • 层级2:默认。对最可能的服务器设置进行有限扫描。
  • 层级3: exhaustive 搜索。仅当搜索空间已收窄且用户接受长时间运行时使用。
默认预算:
  • 默认跨框架对比使用
    num_prompts: 80
    ;冒烟/流程检查可接受
    num_prompts: 20
    每场景,但必须在工件中明确标记(不作为性能结果)。
  • search.max_candidates_per_framework: 10
    用于第一轮有效扫描
  • 候选生成:基准候选优先,然后从
    search_space
    生成有限乘积或有序候选列表
  • 最多5轮QPS搜索,除非用户要求更多
  • 当某一框架的所有候选均明显OOM或未通过基本健康检查时,提前停止
除非用户明确要求进行容量或内存研究,否则请将以下参数保留在
base_server_flags
中:
  • SGLang
    mem_fraction_static
  • SGLang
    schedule_policy
  • vLLM
    gpu_memory_utilization
  • TensorRT-LLM
    kv_cache_free_gpu_memory_fraction
这些是真实的调优参数,但会迅速扩大搜索范围,往往会将服务对比变成内存限制研究。

4. Tune SGLang

4. 调优SGLang

Prefer the SGLang auto-benchmark runner when the target checkout supports it:
bash
python -m sglang.auto_benchmark run --config /path/to/sglang.yaml
Otherwise launch the server manually and benchmark with:
bash
python -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 256 \
  --num-prompts 80 \
  --request-rate 8 \
  --output-file /path/to/sglang/results.json \
  --output-details
Version-sensitive SGLang knob families to verify:
  • tp_size
    ,
    pp_size
    ,
    dp_size
    ,
    ep_size
  • attention_backend
    ,
    prefill_attention_backend
    ,
    decode_attention_backend
  • sampling_backend
  • max_running_requests
    ,
    max_queued_requests
  • chunked_prefill_size
    ,
    prefill_max_requests
    ,
    max_prefill_tokens
  • max_total_tokens
    ,
    page_size
  • CUDA graph and piecewise CUDA graph settings
  • speculative or EAGLE settings only after the non-speculative baseline is tuned
Keep
mem_fraction_static
and
schedule_policy
pinned in the default pass, matching the shared cookbook config style.
For quick smoke tests, it is reasonable to disable CUDA graph and piecewise CUDA graph startup work if the goal is only to prove the framework flow. Record those flags in the artifact. Do not carry that smoke setting into a performance winner unless the user asked to tune eager-mode serving.
若目标代码库支持,请优先使用SGLang自动基准测试运行器:
bash
python -m sglang.auto_benchmark run --config /path/to/sglang.yaml
否则手动启动服务器并使用以下命令进行基准测试:
bash
python -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 256 \
  --num-prompts 80 \
  --request-rate 8 \
  --output-file /path/to/sglang/results.json \
  --output-details
需验证的与版本相关的SGLang参数组:
  • tp_size
    ,
    pp_size
    ,
    dp_size
    ,
    ep_size
  • attention_backend
    ,
    prefill_attention_backend
    ,
    decode_attention_backend
  • sampling_backend
  • max_running_requests
    ,
    max_queued_requests
  • chunked_prefill_size
    ,
    prefill_max_requests
    ,
    max_prefill_tokens
  • max_total_tokens
    ,
    page_size
  • CUDA图和分段CUDA图设置
  • 仅在非推测基准调优完成后,才考虑推测或EAGLE设置
默认轮次中,请将
mem_fraction_static
schedule_policy
固定,与共享参考配置风格一致。
对于快速冒烟测试,若仅需验证框架流程,可禁用CUDA图和分段CUDA图启动操作。请在工件中记录这些参数。除非用户要求调优 eager 模式服务,否则请勿将冒烟测试设置用于性能最优候选。

5. Tune vLLM

5. 调优vLLM

Use vLLM's sweep runner when available:
bash
vllm bench sweep serve \
  --serve-cmd 'vllm serve <model> --port 8000' \
  --bench-cmd 'vllm bench serve --backend vllm --model <model> --port 8000 --dataset-name random --num-prompts 80' \
  --serve-params /path/to/vllm_serve_params.json \
  --bench-params /path/to/vllm_bench_params.json \
  --output-dir /path/to/vllm_results
If sweep support is unavailable, run
vllm serve
for each candidate and measure with
vllm bench serve
.
Version-sensitive vLLM knob families to verify:
  • tensor, pipeline, data, decode-context, and expert parallelism
  • gpu_memory_utilization
  • max_num_seqs
  • max_num_batched_tokens
  • max_model_len
  • enable_chunked_prefill
    , partial prefill limits, and DBO thresholds
  • KV cache dtype and block size
  • dtype and quantization settings
  • CUDA graph capture sizes or eager-mode toggles when relevant
  • prefix cache and speculative decoding settings only when the workload needs those features
vLLM should get a normal sweep, not one baseline command. See references/parameter-coverage.md for the validated flag families. The historical audit happens to use an H100 host, but the flag-family coverage is not H100-specific; confirm each flag on the target image's
--help
before a run.
Keep
gpu_memory_utilization
in the baseline for the default pass. Search it only when the question is explicitly about fitting the model or trading capacity against throughput.
Keep DBO and all2all backend settings out of the default pass unless the target vLLM environment is already set up for them. They are real tuning knobs, but a candidate can fail at startup if the required all2all backend is not available. Also preflight concurrent partial prefill before raising
max_num_partial_prefills
above 1; some model/runtime combinations reject it at startup.
若可用,请使用vLLM的扫描运行器:
bash
vllm bench sweep serve \
  --serve-cmd 'vllm serve <model> --port 8000' \
  --bench-cmd 'vllm bench serve --backend vllm --model <model> --port 8000 --dataset-name random --num-prompts 80' \
  --serve-params /path/to/vllm_serve_params.json \
  --bench-params /path/to/vllm_bench_params.json \
  --output-dir /path/to/vllm_results
若不支持扫描功能,请为每个候选运行
vllm serve
并使用
vllm bench serve
进行测量。
需验证的与版本相关的vLLM参数组:
  • 张量、流水线、数据、解码上下文和专家并行
  • gpu_memory_utilization
  • max_num_seqs
  • max_num_batched_tokens
  • max_model_len
  • enable_chunked_prefill
    、部分预填充限制和DBO阈值
  • KV缓存数据类型和块大小
  • 数据类型和量化设置
  • 相关时的CUDA图捕获大小或 eager 模式切换
  • 仅当工作负载需要时,才考虑前缀缓存和推测解码设置
vLLM应进行常规扫描,而非仅使用一个基准命令。请查看references/parameter-coverage.md获取已验证的参数组。历史审计使用了H100主机,但参数组覆盖并非H100专属;运行前请确认目标镜像
--help
中的每个参数。
默认轮次中,请将
gpu_memory_utilization
保留在基准中。仅当问题明确涉及模型适配或吞吐量与容量权衡时,才搜索该参数。
除非目标vLLM环境已配置完成,否则请勿将DBO和all2all后端设置加入默认轮次。它们是真实的调优参数,但如果所需的all2all后端不可用,候选可能在启动时失败。此外,在将
max_num_partial_prefills
提高到1以上前,请预检查并发部分预填充;某些模型/运行时组合会在启动时拒绝该设置。

6. Tune TensorRT-LLM

6. 调优TensorRT-LLM

Use
trtllm-serve serve
as the server entrypoint when the target environment supports it:
bash
trtllm-serve serve <model> \
  --backend pytorch \
  --tp_size <tp> \
  --pp_size <pp> \
  --kv_cache_free_gpu_memory_fraction 0.75 \
  --host 0.0.0.0 \
  --port 8000
Then benchmark the OpenAI-compatible endpoint with the TensorRT-LLM serving benchmark client or with the same OpenAI-compatible client used for the other frameworks.
For TensorRT-LLM 1.0.0,
benchmark_serving --dataset-name random
samples from ShareGPT unless you pass either
--download-path
or
--random-ids
. For a fast synthetic smoke test, pass
--random-ids
.
TensorRT-LLM flag names are especially version-sensitive. In the validated TensorRT-LLM 1.0.0 image, the KV-cache memory flag accepted by
trtllm-serve serve
is
--kv_cache_free_gpu_memory_fraction
, not
--free_gpu_memory_fraction
. Verify this with
trtllm-serve serve --help
before running a search on any GPU target.
TensorRT-LLM backend policy for this skill:
  • launch the server with
    --backend pytorch
  • keep
    backend: pytorch
    in
    base_server_flags
  • do not add
    backend
    to
    search_space
  • reject
    trt
    , engine-backed serving, or any other non-PyTorch TensorRT-LLM server backend as unsupported for this skill
Version-sensitive TensorRT-LLM knob families to verify:
  • tp_size
    ,
    pp_size
    , and
    ep_size
  • max batch size, max sequence length, max number of tokens, and KV-cache budget
  • inflight batching and scheduler options
  • extra LLM API options YAML used by
    trtllm-serve
    with the PyTorch backend
The
trtllm-serve serve
CLI exposes fewer direct runtime knobs than SGLang or vLLM. Use direct flags when they exist, then use
--extra_llm_api_options
for PyTorch-backend settings that are not top-level CLI flags. Keep unsupported backend or engine requests in the failure table instead of translating them.
Keep
kv_cache_free_gpu_memory_fraction
in the baseline for the default pass. Search
max_batch_size
,
max_num_tokens
,
max_seq_len
, and validated PyTorch-backend config options first. The server backend remains fixed to
pytorch
.
若目标环境支持,请使用
trtllm-serve serve
作为服务器入口:
bash
trtllm-serve serve <model> \
  --backend pytorch \
  --tp_size <tp> \
  --pp_size <pp> \
  --kv_cache_free_gpu_memory_fraction 0.75 \
  --host 0.0.0.0 \
  --port 8000
然后使用TensorRT-LLM服务基准测试客户端或其他框架使用的相同OpenAI兼容客户端,对OpenAI兼容端点进行基准测试。
对于TensorRT-LLM 1.0.0,
benchmark_serving --dataset-name random
会从ShareGPT采样,除非传入
--download-path
--random-ids
。如需快速合成冒烟测试,请传入
--random-ids
TensorRT-LLM的参数名称对版本尤其敏感。在已验证的TensorRT-LLM 1.0.0镜像中,
trtllm-serve serve
接受的KV缓存内存参数是
--kv_cache_free_gpu_memory_fraction
,而非
--free_gpu_memory_fraction
。在任何GPU目标上运行搜索前,请通过
trtllm-serve serve --help
验证此参数。
本工具的TensorRT-LLM后端策略:
  • 使用
    --backend pytorch
    启动服务器
  • base_server_flags
    中保留
    backend: pytorch
  • 请勿将
    backend
    加入
    search_space
  • trt
    、引擎驱动服务或任何非PyTorch的TensorRT-LLM服务器后端标记为不支持
需验证的与版本相关的TensorRT-LLM参数组:
  • tp_size
    ,
    pp_size
    , 和
    ep_size
  • 最大批处理大小、最大序列长度、最大令牌数和KV缓存预算
  • 飞行中批处理和调度器选项
  • trtllm-serve
    与PyTorch后端配合使用的额外LLM API选项YAML
trtllm-serve serve
CLI暴露的直接运行时参数少于SGLang或vLLM。优先使用直接参数,然后对PyTorch后端中未作为顶层CLI参数的设置使用
--extra_llm_api_options
。将不支持的后端或引擎请求记录在失败表中,而非进行转换。
默认轮次中,请将
kv_cache_free_gpu_memory_fraction
保留在基准中。优先搜索
max_batch_size
max_num_tokens
max_seq_len
和已验证的PyTorch后端配置选项。服务器后端固定为
pytorch

7. Normalize Results

7. 归一化结果

Write one JSONL row per candidate using the schema in references/result-schema.md. Then run:
bash
python skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py \
  --input /path/to/candidates.jsonl \
  --output /path/to/summary.md
Rank candidates in this order:
  1. SLA passed
  2. highest request throughput or goodput
  3. highest output token throughput
  4. lower p99 TTFT
  5. lower p99 TPOT/ITL
  6. lower GPU count or simpler deployment if performance is close
使用references/result-schema.md中的schema,为每个候选写入一行JSONL。然后运行:
bash
python skills/llm-serving-auto-benchmark/scripts/compare_benchmark_results.py \
  --input /path/to/candidates.jsonl \
  --output /path/to/summary.md
按以下顺序对候选排序:
  1. 是否符合SLA
  2. 请求吞吐量或有效吞吐量最高
  3. 输出令牌吞吐量最高
  4. p99 TTFT更低
  5. p99 TPOT/ITL更低
  6. 若性能相近,GPU数量更少或部署更简单

Output Contract

输出合约

Return a compact report with:
  • workload and SLA used
  • hardware and framework versions
  • for each framework, one table listing the best deployment command for each dataset scenario and all relevant performance metrics
  • one cross-framework comparison table for the selected best command per framework and scenario, including the command, so the deployment choice is clear for each dataset
  • failed or excluded candidates with reasons. Explain that this table is an record of tried configs that were not selected: candidates that failed, were skipped by policy, or completed but missed the SLA.
  • exact launch command and benchmark command for each winner
  • artifact paths: canonical workload, raw results JSONL, normalized JSONL, CSV or markdown summary, and server logs needed to debug winners or failures
  • a caveat if the workload was synthetic, if any framework did not complete a fair search, or if any framework needed framework-specific parameter substitutions
Use references/framework-matrix.md when you need command templates or source links for each framework. Use references/example-plan.yaml as the starting point for a full cross-framework run plan. Use references/version-notes.md to understand which source snapshots informed this skill and what has or has not been smoke-tested.
返回一份简洁报告,包含:
  • 使用的工作负载和SLA
  • 硬件和框架版本
  • 针对每个框架的表格,列出每个数据集场景的最佳部署命令及所有相关性能指标
  • 跨框架对比表格,展示每个框架和场景的选定最佳命令,包括命令本身,以便明确每个数据集的部署选择
  • 失败或排除的候选及原因。说明此表格记录了未被选中的尝试配置:失败、因策略跳过或完成但未达到SLA的候选。
  • 每个最优候选的精确启动命令和基准测试命令
  • 工件路径:标准工作负载、原始结果JSONL、归一化JSONL、CSV或Markdown摘要,以及调试最优候选或失败所需的服务器日志
  • 若工作负载为合成数据、任一框架未完成公平搜索或任一框架需要框架特定参数替换,请添加说明