SGLang SOTA Performance

Overview

Use this skill as the top-level optimization loop for one model at a time. It composes two lower-level skills:

```
llm-serving-auto-benchmark
```
: search and compare best deployment commands across SGLang, vLLM, and TensorRT-LLM.
```
llm-torch-profiler-analysis
```
: capture or analyze torch-profiler traces and produce kernel, overlap-opportunity, and fuse-pattern tables.

This skill's goal is not "run one benchmark." Its goal is a reproducible SGLang improvement loop: tune every framework fairly, prove whether SGLang is behind, explain the gap with profiler evidence, patch SGLang, and re-run the same model workload until the result is SOTA for the target environment.

Treat "SOTA" as "best observed, reproducible performance under the recorded model, workload, hardware, framework commits, precision, and SLA." Do not claim global SOTA without enough external evidence.

Required Companion Reads

Before a real run, read only the needed sections from:

```
../llm-serving-auto-benchmark/SKILL.md
```
```
../llm-torch-profiler-analysis/SKILL.md
```

If the run uses a remote GPU host, also read the matching host skill such as

h100

b200

rtx5090

, or another operator-side skill that gives SSH, container, workspace, and artifact-path conventions.

Required Inputs

Collect or infer these before starting a long search:

model id or local checkpoint path, tokenizer path, precision, quantization, trust-remote-code policy, and max context length
target GPU type/count, single-node or multi-node allowance, and VRAM budget
workload distribution: dataset, input/output lengths, request rate or concurrency mode, sampling settings, endpoint style, and SLA target
frameworks to compare: default to SGLang, vLLM, and TensorRT-LLM when all are available in the target environment
artifact root for commands, logs, benchmark JSONL, profiles, analysis reports, patches, and final comparison tables

If the user only provides a model, choose a reasonable first workload and state it explicitly. Prefer the closest cookbook config from

llm-serving-auto-benchmark/configs/cookbook-llm/

when available.

Artifact Layout

Use one run directory per model and date, for example:

text

runs/YYYYMMDD_<model_slug>_sota_loop/
  manifest.txt
  help/
  benchmark/
  profiles/
  analysis/
  patches/
  final_report.md

Record exact framework versions, git commits, container names/images, CUDA/NCCL versions, GPU ids, launch commands, benchmark commands, and environment knobs. Never write Hugging Face tokens or other secrets into artifacts.

Workflow

1. Preflight The Model And Environment

Verify the model can be loaded by each framework before launching a sweep. Capture each framework's current

--help

output and version. Remove candidate flags that are not accepted by that exact environment.

For TensorRT-LLM, keep the server backend within the scope of

llm-serving-auto-benchmark

trtllm-serve serve --backend pytorch

. If that backend is unavailable, mark TensorRT-LLM unsupported for the run instead of silently switching to a different serving stack.

2. Search Each Framework's Best Command

Use

llm-serving-auto-benchmark

as the source of truth for benchmark fairness, candidate generation, result schema, and comparison tables.

Run a bounded search for every available framework. Do not compare SGLang's tuned command against competitor defaults. Each framework must get a real chance to find its best deployment command under the same:

model weights and tokenizer
precision and quantization policy
GPU type/count and memory budget
dataset and request distribution
endpoint path and sampling settings
SLA target and measurement window

Keep failed candidates and their failure reasons. The fastest SLA-failing candidate is not the winner.

3. Compare The Best Commands

Normalize the benchmark output with

llm-serving-auto-benchmark/scripts/compare_benchmark_results.py

The comparison must include:

best server command per framework
benchmark command and workload settings
SLA pass/fail status
throughput and goodput
TTFT, ITL, end-to-end latency, and p95/p99 where available
peak memory or allocator evidence when available
failed candidate summary

If SGLang is within benchmark noise of the best framework, rerun enough samples to decide whether the difference is real. Use a default regression threshold of 3-5% unless the user specifies a tighter target.

4. Profile SGLang When It Is Behind

If SGLang is meaningfully slower, fails SLA while another framework passes, or uses much more memory for the same workload, run profiler triage before patching.

Use

llm-torch-profiler-analysis

against the SGLang best command first:

capture live SGLang profiles with
```
--profile-by-stage
```
when possible
keep separate
```
extend/prefill
```
and
```
decode
```
traces when the server supports it
run mapping+formal triage if single-trace output cannot map kernels to useful Python source locations
save the kernel, overlap-opportunity, and fuse-pattern tables in artifacts

Profile the winning competitor too when the SGLang table alone cannot explain why the other framework is faster. Compare stage by stage, not just total QPS.

5. Turn Tables Into A Root Cause

Use the profiler tables to identify the narrowest plausible bottleneck.

Typical signals:

kernel table: attention, MoE routing, quantization, sampling, GEMM shape, cache update, communication, or framework overhead dominates GPU time
overlap-opportunity table: CPU scheduling, host-to-device work, collectives, or decode bookkeeping leaves GPU idle time
fuse-pattern table: a known fusion or overlap path should have applied but did not, or competitor traces show a fused path SGLang lacks
source map: hot kernels map to a concrete SGLang Python/CUDA/Triton path that can be patched

Do not patch from vibes. State the table row, stage, source location, and benchmark symptom that justify the code change.

6. Patch SGLang Conservatively

Patch SGLang only after the benchmark gap and profiler evidence agree.

Good patch candidates:

enable or select a better existing kernel for the model/hardware shape
fix a missed fast path, fusion, overlap, or batching condition
reduce unnecessary synchronization, CPU scheduling overhead, or tensor copies
improve model-specific routing, quantization, attention, or cache handling
add a guarded heuristic that is backed by benchmark and profiler evidence

Avoid changes that merely make the benchmark easier:

weakening correctness, output quality, safety checks, or tokenizer handling
changing only the workload or SLA after seeing results
disabling features for SGLang but not competitors
claiming SOTA from synthetic data when the user asked for production traffic

Keep patches minimal and local. Add focused tests when behavior changes, and add microbenchmarks or profiler evidence when performance is the only intended change.

7. Revalidate The Patch

After patching, rerun:

the relevant unit or integration tests
the SGLang candidate that exposed the gap
the same cross-framework benchmark comparison
the profiler triage if the original gap was diagnosed from profiler tables

If the patch changes SGLang's available knobs, re-search SGLang's best command. If competitor versions or commands changed during the work, rerun their best commands too. Preserve before/after artifacts.

Stop Conditions

Stop with a clear report when any of these is true:

SGLang is the best SLA-passing framework for the target workload
SGLang is within noise of the best framework and the remaining gap is not statistically stable
SGLang remains behind but the root cause is external to SGLang, such as missing model weights, unavailable backend dependencies, or an unsupported hardware feature
a patch improves SGLang but still does not reach SOTA; report the next table row or source path to investigate

Final Report Contract

Return a compact report with:

model, hardware, framework versions, workload, and artifact root
best deployment command per framework
benchmark comparison table before patch and after patch
SGLang gap analysis, including exact profiler table rows and source paths
patch summary with changed files and correctness tests
real-model validation result and whether SGLang reached target-environment SOTA

If no code patch was needed, say why and include the benchmark evidence. If a patch was attempted but not enough, be explicit about the remaining gap.

sglang-sota-performance

NPX Install

Tags

SKILL.md Content