sglang-sota-performance

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SGLang SOTA Performance

SGLang SOTA性能

Overview

概述

Use this skill as the top-level optimization loop for one model at a time. It composes two lower-level skills:

```
llm-serving-auto-benchmark
```
: search and compare best deployment commands across SGLang, vLLM, and TensorRT-LLM.
```
llm-torch-profiler-analysis
```
: capture or analyze torch-profiler traces and produce kernel, overlap-opportunity, and fuse-pattern tables.

This skill's goal is not "run one benchmark." Its goal is a reproducible SGLang improvement loop: tune every framework fairly, prove whether SGLang is behind, explain the gap with profiler evidence, patch SGLang, and re-run the same model workload until the result is SOTA for the target environment.

Treat "SOTA" as "best observed, reproducible performance under the recorded model, workload, hardware, framework commits, precision, and SLA." Do not claim global SOTA without enough external evidence.

将此技能用作针对单个模型的顶层优化循环。它由两个底层技能组成：

```
llm-serving-auto-benchmark
```
：搜索并对比SGLang、vLLM和TensorRT-LLM的最佳部署命令。
```
llm-torch-profiler-analysis
```
：捕获或分析torch-profiler追踪数据，并生成内核、重叠机会和融合模式表。

此技能的目标并非“运行一次基准测试”，而是构建可复现的SGLang改进循环：公平调优每个框架、验证SGLang是否落后、通过剖析证据解释性能差距、修补SGLang代码，并针对同一模型工作负载重新运行，直到在目标环境中达到SOTA结果。

将“SOTA”定义为“在记录的模型、工作负载、硬件、框架版本、精度和SLA下，可复现的最佳观测性能”。若无足够外部证据，请勿宣称全局SOTA。

Required Companion Reads

必备参考文档

Before a real run, read only the needed sections from:

```
../llm-serving-auto-benchmark/SKILL.md
```
```
../llm-torch-profiler-analysis/SKILL.md
```

If the run uses a remote GPU host, also read the matching host skill such as

h100

b200

rtx5090

, or another operator-side skill that gives SSH, container, workspace, and artifact-path conventions.

正式运行前，仅需阅读以下文档的相关章节：

```
../llm-serving-auto-benchmark/SKILL.md
```
```
../llm-torch-profiler-analysis/SKILL.md
```

若运行使用远程GPU主机，还需阅读对应的主机技能文档，如

h100

、

b200

、

rtx5090

，或其他提供SSH、容器、工作区和产物路径规范的运维侧技能文档。

Required Inputs

必备输入项

Collect or infer these before starting a long search:

model id or local checkpoint path, tokenizer path, precision, quantization, trust-remote-code policy, and max context length
target GPU type/count, single-node or multi-node allowance, and VRAM budget
workload distribution: dataset, input/output lengths, request rate or concurrency mode, sampling settings, endpoint style, and SLA target
frameworks to compare: default to SGLang, vLLM, and TensorRT-LLM when all are available in the target environment
artifact root for commands, logs, benchmark JSONL, profiles, analysis reports, patches, and final comparison tables

If the user only provides a model, choose a reasonable first workload and state it explicitly. Prefer the closest cookbook config from

llm-serving-auto-benchmark/configs/cookbook-llm/

when available.

开始长时间搜索前，需收集或推断以下信息：

模型ID或本地检查点路径、分词器路径、精度、量化方式、信任远程代码策略及最大上下文长度
目标GPU类型/数量、是否允许单节点/多节点部署、显存预算
工作负载分布：数据集、输入/输出长度、请求速率或并发模式、采样设置、端点类型及SLA目标
待对比框架：默认在目标环境支持的情况下选择SGLang、vLLM和TensorRT-LLM
产物根目录，用于存储命令、日志、基准测试JSONL文件、性能剖析数据、分析报告、补丁及最终对比表

若用户仅提供模型，需选择合理的初始工作负载并明确说明。若有可用配置，优先选用

llm-serving-auto-benchmark/configs/cookbook-llm/

中最匹配的手册配置。

Artifact Layout

产物目录结构

Use one run directory per model and date, for example:

text

runs/YYYYMMDD_<model_slug>_sota_loop/
  manifest.txt
  help/
  benchmark/
  profiles/
  analysis/
  patches/
  final_report.md

Record exact framework versions, git commits, container names/images, CUDA/NCCL versions, GPU ids, launch commands, benchmark commands, and environment knobs. Never write Hugging Face tokens or other secrets into artifacts.

为每个模型和日期创建独立的运行目录，例如：

text

runs/YYYYMMDD_<model_slug>_sota_loop/
  manifest.txt
  help/
  benchmark/
  profiles/
  analysis/
  patches/
  final_report.md

记录精确的框架版本、Git提交记录、容器名称/镜像、CUDA/NCCL版本、GPU ID、启动命令、基准测试命令及环境配置。切勿将Hugging Face令牌或其他机密信息写入产物。

Workflow

工作流

1. Preflight The Model And Environment

1. 模型与环境预检

Verify the model can be loaded by each framework before launching a sweep. Capture each framework's current

--help

output and version. Remove candidate flags that are not accepted by that exact environment.

For TensorRT-LLM, keep the server backend within the scope of

llm-serving-auto-benchmark

trtllm-serve serve --backend pytorch

. If that backend is unavailable, mark TensorRT-LLM unsupported for the run instead of silently switching to a different serving stack.

在启动扫描前，验证每个框架均可加载模型。捕获每个框架当前的

--help

输出及版本。移除当前环境不支持的候选参数。

对于TensorRT-LLM，将服务器后端限制在

llm-serving-auto-benchmark

的范围内：

trtllm-serve serve --backend pytorch

。若该后端不可用，标记TensorRT-LLM在本次运行中不支持，而非静默切换至其他服务栈。

2. Search Each Framework's Best Command

2. 搜索各框架的最佳命令

Use

llm-serving-auto-benchmark

as the source of truth for benchmark fairness, candidate generation, result schema, and comparison tables.

Run a bounded search for every available framework. Do not compare SGLang's tuned command against competitor defaults. Each framework must get a real chance to find its best deployment command under the same:

model weights and tokenizer
precision and quantization policy
GPU type/count and memory budget
dataset and request distribution
endpoint path and sampling settings
SLA target and measurement window

Keep failed candidates and their failure reasons. The fastest SLA-failing candidate is not the winner.

以

llm-serving-auto-benchmark

作为基准测试公平性、候选命令生成、结果 schema 及对比表的权威来源。

对所有可用框架进行有限搜索。请勿将SGLang的调优命令与竞品的默认命令对比。每个框架都必须在以下相同条件下获得寻找最佳部署命令的机会：

模型权重与分词器
精度与量化策略
GPU类型/数量与显存预算
数据集与请求分布
端点路径与采样设置
SLA目标与测量窗口

保留失败的候选命令及其失败原因。最快但未通过SLA的候选命令并非最优解。

3. Compare The Best Commands

3. 对比最佳命令

Normalize the benchmark output with

llm-serving-auto-benchmark/scripts/compare_benchmark_results.py

The comparison must include:

best server command per framework
benchmark command and workload settings
SLA pass/fail status
throughput and goodput
TTFT, ITL, end-to-end latency, and p95/p99 where available
peak memory or allocator evidence when available
failed candidate summary

If SGLang is within benchmark noise of the best framework, rerun enough samples to decide whether the difference is real. Use a default regression threshold of 3-5% unless the user specifies a tighter target.

使用

llm-serving-auto-benchmark/scripts/compare_benchmark_results.py

标准化基准测试输出。

对比内容必须包括：

每个框架的最佳服务器命令
基准测试命令与工作负载设置
SLA通过/失败状态
吞吐量与有效吞吐量
TTFT、ITL、端到端延迟，以及可用的p95/p99延迟
可用情况下的峰值内存或分配器证据
失败候选命令汇总

若SGLang与最佳框架的性能差距在基准测试误差范围内，需重新运行足够样本以判断差异是否真实存在。默认回归阈值为3-5%，除非用户指定更严格的目标。

4. Profile SGLang When It Is Behind

4. 当SGLang落后时进行性能剖析

If SGLang is meaningfully slower, fails SLA while another framework passes, or uses much more memory for the same workload, run profiler triage before patching.

Use

llm-torch-profiler-analysis

against the SGLang best command first:

capture live SGLang profiles with
```
--profile-by-stage
```
when possible
keep separate
```
extend/prefill
```
and
```
decode
```
traces when the server supports it
run mapping+formal triage if single-trace output cannot map kernels to useful Python source locations
save the kernel, overlap-opportunity, and fuse-pattern tables in artifacts

Profile the winning competitor too when the SGLang table alone cannot explain why the other framework is faster. Compare stage by stage, not just total QPS.

若SGLang性能显著落后、未通过SLA而其他框架通过，或在相同工作负载下占用更多内存，需在修补前进行剖析诊断。

首先针对SGLang的最佳命令使用

llm-torch-profiler-analysis

：

尽可能使用
```
--profile-by-stage
```
捕获实时SGLang性能剖析数据
若服务器支持，分别保存
```
extend/prefill
```
和
```
decode
```
阶段的追踪数据
若单追踪输出无法将内核映射至有用的Python源码位置，运行映射+正式诊断
将内核、重叠机会和融合模式表保存至产物中

当仅通过SGLang的表无法解释竞品为何更快时，也需对获胜的竞品进行性能剖析。逐阶段对比，而非仅对比总QPS。

5. Turn Tables Into A Root Cause

5. 将剖析表转化为根本原因

Use the profiler tables to identify the narrowest plausible bottleneck.

Typical signals:

kernel table: attention, MoE routing, quantization, sampling, GEMM shape, cache update, communication, or framework overhead dominates GPU time
overlap-opportunity table: CPU scheduling, host-to-device work, collectives, or decode bookkeeping leaves GPU idle time
fuse-pattern table: a known fusion or overlap path should have applied but did not, or competitor traces show a fused path SGLang lacks
source map: hot kernels map to a concrete SGLang Python/CUDA/Triton path that can be patched

Do not patch from vibes. State the table row, stage, source location, and benchmark symptom that justify the code change.

使用剖析表识别最可能的瓶颈。

典型信号：

内核表：注意力、MoE路由、量化、采样、GEMM形状、缓存更新、通信或框架开销占据GPU主要时间
重叠机会表：CPU调度、主机到设备的工作、集合操作或解码簿记导致GPU出现空闲时间
融合模式表：已知的融合或重叠路径本应生效但未生效，或竞品追踪数据显示SGLang缺少某条融合路径
源码映射：热点内核可映射至可修补的具体SGLang Python/CUDA/Triton路径

请勿仅凭直觉进行修补。需说明支撑代码变更的表行、阶段、源码位置及基准测试症状。

6. Patch SGLang Conservatively

6. 保守修补SGLang

Patch SGLang only after the benchmark gap and profiler evidence agree.

Good patch candidates:

enable or select a better existing kernel for the model/hardware shape
fix a missed fast path, fusion, overlap, or batching condition
reduce unnecessary synchronization, CPU scheduling overhead, or tensor copies
improve model-specific routing, quantization, attention, or cache handling
add a guarded heuristic that is backed by benchmark and profiler evidence

Avoid changes that merely make the benchmark easier:

weakening correctness, output quality, safety checks, or tokenizer handling
changing only the workload or SLA after seeing results
disabling features for SGLang but not competitors
claiming SOTA from synthetic data when the user asked for production traffic

Keep patches minimal and local. Add focused tests when behavior changes, and add microbenchmarks or profiler evidence when performance is the only intended change.

仅当基准测试差距与剖析证据一致时，才可修补SGLang。

优质修补候选方向：

为模型/硬件形状启用或选择更优的现有内核
修复遗漏的快速路径、融合、重叠或批处理条件
减少不必要的同步、CPU调度开销或张量拷贝
改进模型特定的路由、量化、注意力或缓存处理
添加有基准测试和剖析证据支撑的受控启发式逻辑

避免仅为简化基准测试而进行的变更：

削弱正确性、输出质量、安全检查或分词器处理逻辑
看到结果后仅修改工作负载或SLA
仅为SGLang禁用功能而不为竞品禁用
当用户要求生产流量测试时，却使用合成数据宣称SOTA

保持补丁最小化且本地化。若变更了行为，添加针对性测试；若仅为优化性能，添加微基准测试或剖析证据。

7. Revalidate The Patch

7. 重新验证补丁

After patching, rerun:

the relevant unit or integration tests
the SGLang candidate that exposed the gap
the same cross-framework benchmark comparison
the profiler triage if the original gap was diagnosed from profiler tables

If the patch changes SGLang's available knobs, re-search SGLang's best command. If competitor versions or commands changed during the work, rerun their best commands too. Preserve before/after artifacts.

修补完成后，重新运行：

相关单元测试或集成测试
暴露性能差距的SGLang候选命令
相同的跨框架基准测试对比
若原始差距由剖析表诊断得出，重新进行性能剖析诊断

若补丁改变了SGLang的可用参数，需重新搜索SGLang的最佳命令。若竞品版本或命令在工作期间发生变化，也需重新运行其最佳命令。保留修补前后的产物。

Stop Conditions

停止条件

Stop with a clear report when any of these is true:

SGLang is the best SLA-passing framework for the target workload
SGLang is within noise of the best framework and the remaining gap is not statistically stable
SGLang remains behind but the root cause is external to SGLang, such as missing model weights, unavailable backend dependencies, or an unsupported hardware feature
a patch improves SGLang but still does not reach SOTA; report the next table row or source path to investigate

当满足以下任一条件时，生成清晰报告并停止：

SGLang是目标工作负载下通过SLA的最佳框架
SGLang与最佳框架的性能差距在误差范围内，且剩余差距无统计稳定性
SGLang仍落后，但根本原因来自SGLang外部，如缺失模型权重、依赖后端不可用或硬件特性不支持
补丁提升了SGLang性能但仍未达到SOTA；报告下一个需调查的表行或源码路径

Final Report Contract

最终报告规范

Return a compact report with:

model, hardware, framework versions, workload, and artifact root
best deployment command per framework
benchmark comparison table before patch and after patch
SGLang gap analysis, including exact profiler table rows and source paths
patch summary with changed files and correctness tests
real-model validation result and whether SGLang reached target-environment SOTA

If no code patch was needed, say why and include the benchmark evidence. If a patch was attempted but not enough, be explicit about the remaining gap.

返回简洁报告，包含：

模型、硬件、框架版本、工作负载及产物根目录
每个框架的最佳部署命令
修补前后的基准测试对比表
SGLang性能差距分析，包括精确的剖析表行和源码路径
补丁摘要，包含修改的文件及正确性测试
真实模型验证结果，以及SGLang是否在目标环境中达到SOTA

若无需代码修补，说明原因并提供基准测试证据。若尝试修补但效果不足，明确说明剩余差距。