sglang

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SGLang

SGLang

High-performance serving framework for LLMs and VLMs with RadixAttention for automatic prefix caching.
基于RadixAttention自动前缀缓存的LLM和VLM高性能服务框架。

When to use SGLang

何时使用SGLang

Use SGLang when:
  • Need structured outputs (JSON, regex, grammar)
  • Building agents with repeated prefixes (system prompts, tools)
  • Agentic workflows with function calling
  • Multi-turn conversations with shared context
  • Need faster JSON decoding (3× vs standard)
Use vLLM instead when:
  • Simple text generation without structure
  • Don't need prefix caching
  • Want mature, widely-tested production system
Use TensorRT-LLM instead when:
  • Maximum single-request latency (no batching needed)
  • NVIDIA-only deployment
  • Need FP8/INT4 quantization on H100
在以下场景使用SGLang:
  • 需要结构化输出(JSON、正则表达式、语法规则)
  • 构建带重复前缀的智能体(系统提示词、工具)
  • 带函数调用的智能体工作流
  • 带共享上下文的多轮对话
  • 需要更快的JSON解码速度(比标准方式快3倍)
在以下场景使用vLLM:
  • 无需结构化的简单文本生成
  • 不需要前缀缓存
  • 追求成熟、经过广泛测试的生产系统
在以下场景使用TensorRT-LLM:
  • 追求单请求最低延迟(无需批处理)
  • 仅NVIDIA部署
  • 需要在H100上使用FP8/INT4量化

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

pip install (recommended)

pip install (recommended)

pip install "sglang[all]"
pip install "sglang[all]"

With FlashInfer (faster, CUDA 11.8/12.1)

With FlashInfer (faster, CUDA 11.8/12.1)

pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

From source

From source

git clone https://github.com/sgl-project/sglang.git cd sglang pip install -e "python[all]"
undefined
git clone https://github.com/sgl-project/sglang.git cd sglang pip install -e "python[all]"
undefined

Launch server

启动服务器

bash
undefined
bash
undefined

Basic server (Llama 3-8B)

Basic server (Llama 3-8B)

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000

With RadixAttention (automatic prefix caching)

With RadixAttention (automatic prefix caching)

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--enable-radix-cache # Default: enabled
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--enable-radix-cache # Default: enabled

Multi-GPU (tensor parallelism)

Multi-GPU (tensor parallelism)

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--tp 4
--port 30000
undefined
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--tp 4
--port 30000
undefined

Basic inference

基础推理

python
import sglang as sgl
python
import sglang as sgl

Set backend

Set backend

sgl.set_default_backend(sgl.OpenAI("http://localhost:30000/v1"))
sgl.set_default_backend(sgl.OpenAI("http://localhost:30000/v1"))

Simple generation

Simple generation

@sgl.function def simple_gen(s, question): s += "Q: " + question + "\n" s += "A:" + sgl.gen("answer", max_tokens=100)
@sgl.function def simple_gen(s, question): s += "Q: " + question + "\n" s += "A:" + sgl.gen("answer", max_tokens=100)

Run

Run

state = simple_gen.run(question="What is the capital of France?") print(state["answer"])
state = simple_gen.run(question="What is the capital of France?") print(state["answer"])

Output: "The capital of France is Paris."

Output: "The capital of France is Paris."

undefined
undefined

Structured JSON output

结构化JSON输出

python
import sglang as sgl

@sgl.function
def extract_person(s, text):
    s += f"Extract person information from: {text}\n"
    s += "Output JSON:\n"

    # Constrained JSON generation
    s += sgl.gen(
        "json_output",
        max_tokens=200,
        regex=r'\{"name": "[^"]+", "age": \d+, "occupation": "[^"]+"\}'
    )
python
import sglang as sgl

@sgl.function
def extract_person(s, text):
    s += f"Extract person information from: {text}\n"
    s += "Output JSON:\n"

    # Constrained JSON generation
    s += sgl.gen(
        "json_output",
        max_tokens=200,
        regex=r'\{"name": "[^"]+", "age": \d+, "occupation": "[^"]+"\}'
    )

Run

Run

state = extract_person.run( text="John Smith is a 35-year-old software engineer." ) print(state["json_output"])
state = extract_person.run( text="John Smith is a 35-year-old software engineer." ) print(state["json_output"])

Output: {"name": "John Smith", "age": 35, "occupation": "software engineer"}

Output: {"name": "John Smith", "age": 35, "occupation": "software engineer"}

undefined
undefined

RadixAttention (Key Innovation)

RadixAttention(核心创新)

What it does: Automatically caches and reuses common prefixes across requests.
Performance:
  • 5× faster for agentic workloads with shared system prompts
  • 10× faster for few-shot prompting with repeated examples
  • Zero configuration - works automatically
How it works:
  1. Builds radix tree of all processed tokens
  2. Automatically detects shared prefixes
  3. Reuses KV cache for matching prefixes
  4. Only computes new tokens
Example (Agent with system prompt):
Request 1: [SYSTEM_PROMPT] + "What's the weather?"
→ Computes full prompt (1000 tokens)

Request 2: [SAME_SYSTEM_PROMPT] + "Book a flight"
→ Reuses system prompt KV cache (998 tokens)
→ Only computes 2 new tokens
→ 5× faster!
功能介绍:自动缓存并复用请求间的公共前缀。
性能表现
  • 对于带共享系统提示词的智能体工作流,速度提升5倍
  • 对于带重复示例的少样本提示,速度提升10倍
  • 零配置 - 自动生效
工作原理
  1. 构建所有已处理tokens的基数树
  2. 自动检测共享前缀
  3. 复用匹配前缀的KV缓存
  4. 仅计算新tokens
示例(带系统提示词的智能体):
Request 1: [SYSTEM_PROMPT] + "What's the weather?"
→ 计算完整提示词(1000个tokens)

Request 2: [SAME_SYSTEM_PROMPT] + "Book a flight"
→ 复用系统提示词的KV缓存(998个tokens)
→ 仅计算2个新tokens
→ 速度提升5倍!

Structured generation patterns

结构化生成模式

JSON with schema

带Schema的JSON生成

python
@sgl.function
def structured_extraction(s, article):
    s += f"Article: {article}\n\n"
    s += "Extract key information as JSON:\n"

    # JSON schema constraint
    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"},
            "summary": {"type": "string"},
            "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
        },
        "required": ["title", "author", "summary", "sentiment"]
    }

    s += sgl.gen("info", max_tokens=300, json_schema=schema)

state = structured_extraction.run(article="...")
print(state["info"])
python
@sgl.function
def structured_extraction(s, article):
    s += f"Article: {article}\n\n"
    s += "Extract key information as JSON:\n"

    # JSON schema constraint
    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"},
            "summary": {"type": "string"},
            "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
        },
        "required": ["title", "author", "summary", "sentiment"]
    }

    s += sgl.gen("info", max_tokens=300, json_schema=schema)

state = structured_extraction.run(article="...")
print(state["info"])

Output: Valid JSON matching schema

Output: Valid JSON matching schema

undefined
undefined

Regex-constrained generation

正则约束生成

python
@sgl.function
def extract_email(s, text):
    s += f"Extract email from: {text}\n"
    s += "Email: "

    # Email regex pattern
    s += sgl.gen(
        "email",
        max_tokens=50,
        regex=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    )

state = extract_email.run(text="Contact john.doe@example.com for details")
print(state["email"])
python
@sgl.function
def extract_email(s, text):
    s += f"Extract email from: {text}\n"
    s += "Email: "

    # Email regex pattern
    s += sgl.gen(
        "email",
        max_tokens=50,
        regex=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    )

state = extract_email.run(text="Contact john.doe@example.com for details")
print(state["email"])

Output: "john.doe@example.com"

Output: "john.doe@example.com"

undefined
undefined

Grammar-based generation

基于语法规则的生成

python
@sgl.function
def generate_code(s, description):
    s += f"Generate Python code for: {description}\n"
    s += "```python\n"

    # EBNF grammar for Python
    python_grammar = """
    ?start: function_def
    function_def: "def" NAME "(" [parameters] "):" suite
    parameters: parameter ("," parameter)*
    parameter: NAME
    suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
    """

    s += sgl.gen("code", max_tokens=200, grammar=python_grammar)
    s += "\n```"
python
@sgl.function
def generate_code(s, description):
    s += f"Generate Python code for: {description}\n"
    s += "```python\n"

    # EBNF grammar for Python
    python_grammar = """
    ?start: function_def
    function_def: "def" NAME "(" [parameters] "):" suite
    parameters: parameter ("," parameter)*
    parameter: NAME
    suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
    """

    s += sgl.gen("code", max_tokens=200, grammar=python_grammar)
    s += "\n```"

Agent workflows with function calling

带函数调用的智能体工作流

python
import sglang as sgl
python
import sglang as sgl

Define tools

Define tools

tools = [ { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} } } }, { "name": "book_flight", "description": "Book a flight", "parameters": { "type": "object", "properties": { "from": {"type": "string"}, "to": {"type": "string"}, "date": {"type": "string"} } } } ]
@sgl.function def agent_workflow(s, user_query, tools): # System prompt (cached with RadixAttention) s += "You are a helpful assistant with access to tools.\n" s += f"Available tools: {tools}\n\n"
# User query
s += f"User: {user_query}\n"
s += "Assistant: "

# Generate with function calling
s += sgl.gen(
    "response",
    max_tokens=200,
    tools=tools,  # SGLang handles tool call format
    stop=["User:", "\n\n"]
)
tools = [ { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} } } }, { "name": "book_flight", "description": "Book a flight", "parameters": { "type": "object", "properties": { "from": {"type": "string"}, "to": {"type": "string"}, "date": {"type": "string"} } } } ]
@sgl.function def agent_workflow(s, user_query, tools): # System prompt (cached with RadixAttention) s += "You are a helpful assistant with access to tools.\n" s += f"Available tools: {tools}\n\n"
# User query
s += f"User: {user_query}\n"
s += "Assistant: "

# Generate with function calling
s += sgl.gen(
    "response",
    max_tokens=200,
    tools=tools,  # SGLang handles tool call format
    stop=["User:", "\n\n"]
)

Multiple queries reuse system prompt

Multiple queries reuse system prompt

state1 = agent_workflow.run( user_query="What's the weather in NYC?", tools=tools )
state1 = agent_workflow.run( user_query="What's the weather in NYC?", tools=tools )

First call: Computes full system prompt

First call: Computes full system prompt

state2 = agent_workflow.run( user_query="Book a flight to LA", tools=tools )
state2 = agent_workflow.run( user_query="Book a flight to LA", tools=tools )

Second call: Reuses system prompt (5× faster)

Second call: Reuses system prompt (5× faster)

undefined
undefined

Performance benchmarks

性能基准测试

RadixAttention speedup

RadixAttention加速效果

Few-shot prompting (10 examples in prompt):
  • vLLM: 2.5 sec/request
  • SGLang: 0.25 sec/request (10× faster)
  • Throughput: 4× higher
Agent workflows (1000-token system prompt):
  • vLLM: 1.8 sec/request
  • SGLang: 0.35 sec/request (5× faster)
JSON decoding:
  • Standard: 45 tok/s
  • SGLang: 135 tok/s (3× faster)
少样本提示(提示词含10个示例):
  • vLLM: 2.5秒/请求
  • SGLang: 0.25秒/请求(10倍加速)
  • 吞吐量:提升4倍
智能体工作流(1000-token系统提示词):
  • vLLM: 1.8秒/请求
  • SGLang: 0.35秒/请求(5倍加速)
JSON解码
  • 标准方式:45 tok/s
  • SGLang: 135 tok/s(3倍加速)

Throughput (Llama 3-8B, A100)

吞吐量(Llama 3-8B,A100)

WorkloadvLLMSGLangSpeedup
Simple generation2500 tok/s2800 tok/s1.12×
Few-shot (10 examples)500 tok/s5000 tok/s10×
Agent (tool calls)800 tok/s4000 tok/s
JSON output600 tok/s2400 tok/s
工作负载vLLMSGLang加速比
简单生成2500 tok/s2800 tok/s1.12×
少样本(10个示例)500 tok/s5000 tok/s10×
智能体(工具调用)800 tok/s4000 tok/s
JSON输出600 tok/s2400 tok/s

Multi-turn conversations

多轮对话

python
@sgl.function
def multi_turn_chat(s, history, new_message):
    # System prompt (always cached)
    s += "You are a helpful AI assistant.\n\n"

    # Conversation history (cached as it grows)
    for msg in history:
        s += f"{msg['role']}: {msg['content']}\n"

    # New user message (only new part)
    s += f"User: {new_message}\n"
    s += "Assistant: "
    s += sgl.gen("response", max_tokens=200)
python
@sgl.function
def multi_turn_chat(s, history, new_message):
    # System prompt (always cached)
    s += "You are a helpful AI assistant.\n\n"

    # Conversation history (cached as it grows)
    for msg in history:
        s += f"{msg['role']}: {msg['content']}\n"

    # New user message (only new part)
    s += f"User: {new_message}\n"
    s += "Assistant: "
    s += sgl.gen("response", max_tokens=200)

Turn 1

Turn 1

history = [] state = multi_turn_chat.run(history=history, new_message="Hi there!") history.append({"role": "User", "content": "Hi there!"}) history.append({"role": "Assistant", "content": state["response"]})
history = [] state = multi_turn_chat.run(history=history, new_message="Hi there!") history.append({"role": "User", "content": "Hi there!"}) history.append({"role": "Assistant", "content": state["response"]})

Turn 2 (reuses Turn 1 KV cache)

Turn 2 (reuses Turn 1 KV cache)

state = multi_turn_chat.run(history=history, new_message="What's 2+2?")
state = multi_turn_chat.run(history=history, new_message="What's 2+2?")

Only computes new message (much faster!)

Only computes new message (much faster!)

Turn 3 (reuses Turn 1 + Turn 2 KV cache)

Turn 3 (reuses Turn 1 + Turn 2 KV cache)

state = multi_turn_chat.run(history=history, new_message="Tell me a joke")
state = multi_turn_chat.run(history=history, new_message="Tell me a joke")

Progressively faster as history grows

Progressively faster as history grows

undefined
undefined

Advanced features

高级功能

Speculative decoding

推测解码

bash
undefined
bash
undefined

Launch with draft model (2-3× faster)

Launch with draft model (2-3× faster)

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--speculative-model meta-llama/Meta-Llama-3-8B-Instruct
--speculative-num-steps 5
undefined
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--speculative-model meta-llama/Meta-Llama-3-8B-Instruct
--speculative-num-steps 5
undefined

Multi-modal (vision models)

多模态(视觉模型)

python
@sgl.function
def describe_image(s, image_path):
    s += sgl.image(image_path)
    s += "Describe this image in detail: "
    s += sgl.gen("description", max_tokens=200)

state = describe_image.run(image_path="photo.jpg")
print(state["description"])
python
@sgl.function
def describe_image(s, image_path):
    s += sgl.image(image_path)
    s += "Describe this image in detail: "
    s += sgl.gen("description", max_tokens=200)

state = describe_image.run(image_path="photo.jpg")
print(state["description"])

Batching and parallel requests

批处理与并行请求

python
undefined
python
undefined

Automatic batching (continuous batching)

Automatic batching (continuous batching)

states = sgl.run_batch( [ simple_gen.bind(question="What is AI?"), simple_gen.bind(question="What is ML?"), simple_gen.bind(question="What is DL?"), ] )
states = sgl.run_batch( [ simple_gen.bind(question="What is AI?"), simple_gen.bind(question="What is ML?"), simple_gen.bind(question="What is DL?"), ] )

All 3 processed in single batch (efficient)

All 3 processed in single batch (efficient)

undefined
undefined

OpenAI-compatible API

OpenAI兼容API

bash
undefined
bash
undefined

Start server with OpenAI API

Start server with OpenAI API

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000

Use with OpenAI client

Use with OpenAI client

curl http://localhost:30000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [ {"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Hello"} ], "temperature": 0.7, "max_tokens": 100 }'
curl http://localhost:30000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [ {"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Hello"} ], "temperature": 0.7, "max_tokens": 100 }'

Works with OpenAI Python SDK

Works with OpenAI Python SDK

from openai import OpenAI client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create( model="default", messages=[{"role": "user", "content": "Hello"}] )
undefined
from openai import OpenAI client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create( model="default", messages=[{"role": "user", "content": "Hello"}] )
undefined

Supported models

支持的模型

Text models:
  • Llama 2, Llama 3, Llama 3.1, Llama 3.2
  • Mistral, Mixtral
  • Qwen, Qwen2, QwQ
  • DeepSeek-V2, DeepSeek-V3
  • Gemma, Phi-3
Vision models:
  • LLaVA, LLaVA-OneVision
  • Phi-3-Vision
  • Qwen2-VL
100+ models from HuggingFace
文本模型
  • Llama 2, Llama 3, Llama 3.1, Llama 3.2
  • Mistral, Mixtral
  • Qwen, Qwen2, QwQ
  • DeepSeek-V2, DeepSeek-V3
  • Gemma, Phi-3
视觉模型
  • LLaVA, LLaVA-OneVision
  • Phi-3-Vision
  • Qwen2-VL
100+模型来自HuggingFace

Hardware support

硬件支持

NVIDIA: A100, H100, L4, T4 (CUDA 11.8+) AMD: MI300, MI250 (ROCm 6.0+) Intel: Xeon with GPU (coming soon) Apple: M1/M2/M3 via MPS (experimental)
NVIDIA:A100, H100, L4, T4(CUDA 11.8+) AMD:MI300, MI250(ROCm 6.0+) Intel:带GPU的Xeon(即将支持) Apple:M1/M2/M3 via MPS(实验性支持)

References

参考资料

  • Structured Generation Guide - JSON schemas, regex, grammars, validation
  • RadixAttention Deep Dive - How it works, optimization, benchmarks
  • Production Deployment - Multi-GPU, monitoring, autoscaling
  • 结构化生成指南 - JSON Schema、正则表达式、语法规则、验证
  • RadixAttention深度解析 - 工作原理、优化、基准测试
  • 生产部署指南 - 多GPU、监控、自动扩缩容

Resources

资源