sglang-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SGLang Development

SGLang开发

Source Code Locations

源码位置

SGLang 源码位于此 skill 安装目录下的
repos/sglang/
。 实际路径取决于所用工具:
  • Cursor:
    ~/.cursor/skills/sglang-skill/repos/sglang/
  • Claude Code:
    ~/.claude/skills/sglang-skill/repos/sglang/
  • Codex:
    ~/.agents/skills/sglang-skill/repos/sglang/
SGLANG_REPO: 下文示例用
~/.cursor/skills/sglang-skill/repos/sglang/
作占位符,替换为实际路径
如果该路径不存在,在项目目录下运行
bash update-repos.sh sglang
SGLang 源码位于此 skill 安装目录下的
repos/sglang/
。 实际路径取决于所用工具:
  • Cursor:
    ~/.cursor/skills/sglang-skill/repos/sglang/
  • Claude Code:
    ~/.claude/skills/sglang-skill/repos/sglang/
  • Codex:
    ~/.agents/skills/sglang-skill/repos/sglang/
SGLANG_REPO: 下文示例用
~/.cursor/skills/sglang-skill/repos/sglang/
作占位符,替换为实际路径
如果该路径不存在,在项目目录下运行
bash update-repos.sh sglang

Core Runtime (SRT)

核心运行时(SRT)

SGLANG_REPO/python/sglang/srt/
├── layers/
│   ├── attention/          # Attention backends
│   │   ├── flashinfer_backend.py      # FlashInfer (默认)
│   │   ├── flashinfer_mla_backend.py  # FlashInfer MLA (DeepSeek)
│   │   ├── cutlass_mla_backend.py     # CUTLASS MLA
│   │   ├── flashattention_backend.py  # FlashAttention
│   │   ├── triton_backend.py          # Triton attention
│   │   ├── flashmla_backend.py        # FlashMLA
│   │   ├── nsa_backend.py             # Native Sparse Attention
│   │   ├── tbo_backend.py             # TBO
│   │   ├── fla/                       # Flash Linear Attention
│   │   ├── triton_ops/                # Triton attention ops
│   │   └── wave_ops/                  # Wave attention ops
│   ├── moe/                # MoE routing and dispatch
│   ├── quantization/       # FP8, GPTQ, AWQ, Marlin, etc.
│   ├── deep_gemm_wrapper/  # DeepGEMM 集成
│   └── utils/
├── models/                 # 模型实现 (LLaMA, DeepSeek, Qwen, etc.)
│   └── deepseek_common/    # DeepSeek V2/V3 共享组件
├── managers/               # Scheduler, TokenizerManager, Detokenizer
├── mem_cache/              # KV cache, Radix cache
├── model_executor/         # 模型执行器, forward batch
├── model_loader/           # 模型加载, 权重映射
├── entrypoints/            # 启动入口: Engine, OpenAI API server
├── speculative/            # Speculative decoding
├── disaggregation/         # Disaggregated prefill/decode
├── distributed/            # TP/PP/EP 分布式
├── compilation/            # CUDA Graph, Torch.compile
├── configs/                # 模型配置
├── lora/                   # LoRA 推理
├── eplb/                   # Expert-level load balancing
├── hardware_backend/       # 硬件适配 (CUDA, ROCm, XPU)
└── utils/                  # 工具函数
SGLANG_REPO/python/sglang/srt/
├── layers/
│   ├── attention/          # 注意力后端
│   │   ├── flashinfer_backend.py      # FlashInfer(默认)
│   │   ├── flashinfer_mla_backend.py  # FlashInfer MLA(DeepSeek)
│   │   ├── cutlass_mla_backend.py     # CUTLASS MLA
│   │   ├── flashattention_backend.py  # FlashAttention
│   │   ├── triton_backend.py          # Triton注意力实现
│   │   ├── flashmla_backend.py        # FlashMLA
│   │   ├── nsa_backend.py             # Native Sparse Attention
│   │   ├── tbo_backend.py             # TBO
│   │   ├── fla/                       # Flash Linear Attention
│   │   ├── triton_ops/                # Triton注意力算子
│   │   └── wave_ops/                  # Wave注意力算子
│   ├── moe/                # MoE路由与调度
│   ├── quantization/       # FP8、GPTQ、AWQ、Marlin等量化实现
│   ├── deep_gemm_wrapper/  # DeepGEMM集成
│   └── utils/
├── models/                 # 模型实现(LLaMA、DeepSeek、Qwen等)
│   └── deepseek_common/    # DeepSeek V2/V3共享组件
├── managers/               # 调度器、Tokenizer管理器、Detokenizer
├── mem_cache/              # KV缓存、基数缓存
├── model_executor/         # 模型执行器、前向批处理
├── model_loader/           # 模型加载、权重映射
├── entrypoints/            # 启动入口:Engine、OpenAI API服务器
├── speculative/            # 投机解码
├── disaggregation/         # 解耦预填充/解码
├── distributed/            # TP/PP/EP分布式
├── compilation/            # CUDA图、Torch.compile
├── configs/                # 模型配置
├── lora/                   # LoRA推理
├── eplb/                   # 专家级负载均衡
├── hardware_backend/       # 硬件适配(CUDA、ROCm、XPU)
└── utils/                  # 工具函数

JIT Kernels (Python CUDA/Triton Kernels)

JIT内核(Python CUDA/Triton内核)

SGLANG_REPO/python/sglang/jit_kernel/
├── flash_attention/        # Flash Attention 自定义实现
├── flash_attention_v4.py   # Flash Attention v4
├── cutedsl_gdn.py          # CuTeDSL GDN kernel
├── concat_mla.py           # MLA concat kernel
├── norm.py                 # Normalization kernels
├── rope.py                 # RoPE position encoding
├── pos_enc.py              # Position encoding
├── per_tensor_quant_fp8.py # FP8 量化
├── kvcache.py              # KV cache kernels
├── hicache.py              # HiCache kernels
├── gptq_marlin.py          # GPTQ Marlin kernel
├── cuda_wait_value.py      # CUDA sync primitives
└── diffusion/              # Diffusion model kernels
SGLANG_REPO/python/sglang/jit_kernel/
├── flash_attention/        # Flash Attention自定义实现
├── flash_attention_v4.py   # Flash Attention v4
├── cutedsl_gdn.py          # CuTeDSL GDN内核
├── concat_mla.py           # MLA拼接内核
├── norm.py                 # 归一化内核
├── rope.py                 # RoPE位置编码
├── pos_enc.py              # 位置编码
├── per_tensor_quant_fp8.py # FP8量化
├── kvcache.py              # KV缓存内核
├── hicache.py              # HiCache内核
├── gptq_marlin.py          # GPTQ Marlin内核
├── cuda_wait_value.py      # CUDA同步原语
└── diffusion/              # 扩散模型内核

sgl-kernel (C++/CUDA Custom Kernels)

sgl-kernel(C++/CUDA自定义内核)

SGLANG_REPO/sgl-kernel/
├── csrc/
│   ├── attention/          # Custom attention CUDA kernels
│   ├── cutlass_extensions/ # CUTLASS GEMM extensions
│   ├── gemm/               # GEMM kernels
│   ├── moe/                # MoE dispatch/combine kernels
│   ├── quantization/       # Quantization CUDA kernels
│   ├── allreduce/          # AllReduce CUDA kernels
│   ├── speculative/        # Speculative decoding kernels
│   ├── kvcacheio/          # KV cache I/O
│   ├── mamba/              # Mamba SSM kernels
│   ├── memory/             # Memory management
│   └── grammar/            # Grammar-guided generation
├── include/                # C++ headers
├── python/                 # Python bindings
├── tests/                  # Kernel tests
└── benchmark/              # Kernel benchmarks
SGLANG_REPO/sgl-kernel/
├── csrc/
│   ├── attention/          # 自定义注意力CUDA内核
│   ├── cutlass_extensions/ # CUTLASS GEMM扩展
│   ├── gemm/               # GEMM内核
│   ├── moe/                # MoE调度/组合内核
│   ├── quantization/       # 量化CUDA内核
│   ├── allreduce/          # AllReduce CUDA内核
│   ├── speculative/        # 投机解码内核
│   ├── kvcacheio/          # KV缓存I/O
│   ├── mamba/              # Mamba SSM内核
│   ├── memory/             # 内存管理
│   └── grammar/            # 语法引导生成
├── include/                # C++头文件
├── python/                 # Python绑定
├── tests/                  # 内核测试
└── benchmark/              # 内核基准测试

Frontend Language

前端语言

SGLANG_REPO/python/sglang/lang/   # SGLang 前端 DSL
SGLANG_REPO/examples/             # 使用示例
SGLANG_REPO/benchmark/            # 性能基准
SGLANG_REPO/test/                 # 测试套件
SGLANG_REPO/docs/                 # 文档
SGLANG_REPO/python/sglang/lang/   # SGLang前端DSL
SGLANG_REPO/examples/             # 使用示例
SGLANG_REPO/benchmark/            # 性能基准
SGLANG_REPO/test/                 # 测试套件
SGLANG_REPO/docs/                 # 文档

Search Strategy

搜索策略

用 Grep 工具搜索,不要整文件加载。
使用Grep工具搜索,不要加载整个文件。

Attention 和 MLA

Attention和MLA

bash
SGLANG_REPO="$HOME/.cursor/skills/sglang-skill/repos/sglang"
bash
SGLANG_REPO="$HOME/.cursor/skills/sglang-skill/repos/sglang"

查找 attention backend 注册

查找Attention后端注册

rg "register|Backend" $SGLANG_REPO/python/sglang/srt/layers/attention/attention_registry.py
rg "register|Backend" $SGLANG_REPO/python/sglang/srt/layers/attention/attention_registry.py

查找 FlashInfer MLA 实现

查找FlashInfer MLA实现

rg "forward|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/flashinfer_mla_backend.py
rg "forward|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/flashinfer_mla_backend.py

查找 CUTLASS MLA

查找CUTLASS MLA

rg "cutlass|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/cutlass_mla_backend.py
rg "cutlass|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/cutlass_mla_backend.py

查找 attention 通用接口

查找Attention通用接口

rg "class.*Backend|def forward" $SGLANG_REPO/python/sglang/srt/layers/attention/base_attn_backend.py
undefined
rg "class.*Backend|def forward" $SGLANG_REPO/python/sglang/srt/layers/attention/base_attn_backend.py
undefined

Scheduler 和 Batching

调度器与批处理

bash
undefined
bash
undefined

Scheduler 核心逻辑

调度器核心逻辑

rg "class Scheduler|def get_next_batch" $SGLANG_REPO/python/sglang/srt/managers/
rg "class Scheduler|def get_next_batch" $SGLANG_REPO/python/sglang/srt/managers/

Continuous batching 和 chunked prefill

连续批处理和分块预填充

rg "chunk|prefill|extend" $SGLANG_REPO/python/sglang/srt/managers/
rg "chunk|prefill|extend" $SGLANG_REPO/python/sglang/srt/managers/

CUDA Graph

CUDA图

rg "cuda_graph|CudaGraph" $SGLANG_REPO/python/sglang/srt/compilation/
undefined
rg "cuda_graph|CudaGraph" $SGLANG_REPO/python/sglang/srt/compilation/
undefined

KV Cache 和 Memory

KV缓存与内存

bash
undefined
bash
undefined

Radix cache 实现

基数缓存实现

rg "RadixCache|radix" $SGLANG_REPO/python/sglang/srt/mem_cache/
rg "RadixCache|radix" $SGLANG_REPO/python/sglang/srt/mem_cache/

KV cache 管理

KV缓存管理

rg "class.*Pool|allocate|free" $SGLANG_REPO/python/sglang/srt/mem_cache/
rg "class.*Pool|allocate|free" $SGLANG_REPO/python/sglang/srt/mem_cache/

HiCache (hierarchical cache)

HiCache(分层缓存)

rg "HiCache|hicache" $SGLANG_REPO/python/sglang/srt/mem_cache/
undefined
rg "HiCache|hicache" $SGLANG_REPO/python/sglang/srt/mem_cache/
undefined

模型相关

模型相关

bash
undefined
bash
undefined

查找特定模型实现

查找特定模型实现

rg "class.*ForCausalLM" $SGLANG_REPO/python/sglang/srt/models/
rg "class.*ForCausalLM" $SGLANG_REPO/python/sglang/srt/models/

DeepSeek V2/V3 实现

DeepSeek V2/V3实现

rg "DeepSeek|MLA|MoE" $SGLANG_REPO/python/sglang/srt/models/deepseek_v2.py
rg "DeepSeek|MLA|MoE" $SGLANG_REPO/python/sglang/srt/models/deepseek_v2.py

模型加载和权重映射

模型加载和权重映射

rg "load_weight|weight_map" $SGLANG_REPO/python/sglang/srt/model_loader/
undefined
rg "load_weight|weight_map" $SGLANG_REPO/python/sglang/srt/model_loader/
undefined

MoE

MoE

bash
undefined
bash
undefined

MoE routing

MoE路由

rg "TopK|router|expert" $SGLANG_REPO/python/sglang/srt/layers/moe/
rg "TopK|router|expert" $SGLANG_REPO/python/sglang/srt/layers/moe/

MoE CUDA kernels

MoE CUDA内核

rg "moe" $SGLANG_REPO/sgl-kernel/csrc/moe/
undefined
rg "moe" $SGLANG_REPO/sgl-kernel/csrc/moe/
undefined

量化

量化

bash
undefined
bash
undefined

FP8 量化

FP8量化

rg "fp8|float8" $SGLANG_REPO/python/sglang/srt/layers/quantization/
rg "fp8|float8" $SGLANG_REPO/python/sglang/srt/layers/quantization/

GPTQ/AWQ/Marlin

GPTQ/AWQ/Marlin

rg "gptq|awq|marlin" $SGLANG_REPO/python/sglang/srt/layers/quantization/
undefined
rg "gptq|awq|marlin" $SGLANG_REPO/python/sglang/srt/layers/quantization/
undefined

Speculative Decoding

投机解码

bash
rg "speculative\|draft\|verify" $SGLANG_REPO/python/sglang/srt/speculative/
bash
rg "speculative\|draft\|verify" $SGLANG_REPO/python/sglang/srt/speculative/

分布式

分布式

bash
undefined
bash
undefined

TP/PP/EP

TP/PP/EP

rg "tensor_parallel|pipeline_parallel|expert_parallel" $SGLANG_REPO/python/sglang/srt/distributed/
rg "tensor_parallel|pipeline_parallel|expert_parallel" $SGLANG_REPO/python/sglang/srt/distributed/

Disaggregated serving

解耦服务

rg "disagg|prefill_worker|decode_worker" $SGLANG_REPO/python/sglang/srt/disaggregation/
undefined
rg "disagg|prefill_worker|decode_worker" $SGLANG_REPO/python/sglang/srt/disaggregation/
undefined

When to Use Each Source

各源码目录适用场景

NeedSourcePath
Attention backend 接口SRT layers
srt/layers/attention/base_attn_backend.py
FlashInfer attentionSRT layers
srt/layers/attention/flashinfer_backend.py
MLA (DeepSeek)SRT layers
srt/layers/attention/*mla*.py
MoE routing/dispatchSRT layers
srt/layers/moe/
量化 (FP8/GPTQ/AWQ)SRT layers
srt/layers/quantization/
SchedulerSRT managers
srt/managers/
KV cache / Radix cacheSRT mem_cache
srt/mem_cache/
模型实现SRT models
srt/models/
DeepSeek V2/V3SRT models
srt/models/deepseek_v2.py
,
deepseek_common/
Speculative decodingSRT speculative
srt/speculative/
Disaggregated servingSRT disagg
srt/disaggregation/
TP/PP/EP 分布式SRT distributed
srt/distributed/
CUDA GraphSRT compilation
srt/compilation/
模型加载SRT model_loader
srt/model_loader/
启动入口SRT entrypoints
srt/entrypoints/
JIT Triton kernelsjit_kernel
jit_kernel/
Custom CUDA kernelssgl-kernel
sgl-kernel/csrc/
CUTLASS extensionssgl-kernel
sgl-kernel/csrc/cutlass_extensions/
前端 DSLlang
python/sglang/lang/
使用示例examples
examples/
需求源码模块路径
Attention后端接口SRT layers
srt/layers/attention/base_attn_backend.py
FlashInfer注意力SRT layers
srt/layers/attention/flashinfer_backend.py
MLA(DeepSeek)SRT layers
srt/layers/attention/*mla*.py
MoE路由/调度SRT layers
srt/layers/moe/
量化(FP8/GPTQ/AWQ)SRT layers
srt/layers/quantization/
调度器SRT managers
srt/managers/
KV缓存/基数缓存SRT mem_cache
srt/mem_cache/
模型实现SRT models
srt/models/
DeepSeek V2/V3SRT models
srt/models/deepseek_v2.py
,
deepseek_common/
投机解码SRT speculative
srt/speculative/
解耦服务SRT disagg
srt/disaggregation/
TP/PP/EP分布式SRT distributed
srt/distributed/
CUDA图SRT compilation
srt/compilation/
模型加载SRT model_loader
srt/model_loader/
启动入口SRT entrypoints
srt/entrypoints/
JIT Triton内核jit_kernel
jit_kernel/
自定义CUDA内核sgl-kernel
sgl-kernel/csrc/
CUTLASS扩展sgl-kernel
sgl-kernel/csrc/cutlass_extensions/
前端DSLlang
python/sglang/lang/
使用示例examples
examples/

常见开发场景

常见开发场景

添加新 Attention Backend

添加新的Attention后端

  1. 继承
    base_attn_backend.py
    中的
    AttnBackend
  2. 实现
    forward()
    方法
  3. attention_registry.py
    注册
  4. 参考
    flashinfer_backend.py
    作为模板
  1. 继承
    base_attn_backend.py
    中的
    AttnBackend
  2. 实现
    forward()
    方法
  3. attention_registry.py
    中注册
  4. 参考
    flashinfer_backend.py
    作为模板

添加新模型

添加新模型

  1. srt/models/
    创建模型文件
  2. 实现
    ForCausalLM
  3. 实现
    load_weights()
    方法
  4. 参考
    srt/models/llama.py
    作为模板
  1. srt/models/
    目录下创建模型文件
  2. 实现
    ForCausalLM
  3. 实现
    load_weights()
    方法
  4. 参考
    srt/models/llama.py
    作为模板

添加新量化方法

添加新量化方法

  1. srt/layers/quantization/
    添加量化模块
  2. 注册到量化工厂
  3. 参考
    fp8_kernel.py
    gptq.py
  1. srt/layers/quantization/
    目录下添加量化模块
  2. 注册到量化工厂
  3. 参考
    fp8_kernel.py
    gptq.py

启动和调试

启动和调试

bash
undefined
bash
undefined

启动 OpenAI 兼容 API server

启动兼容OpenAI的API服务器

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 1

使用 Engine API (Python)

使用Engine API(Python)

from sglang import Engine engine = Engine(model_path="meta-llama/Meta-Llama-3-8B-Instruct")
from sglang import Engine engine = Engine(model_path="meta-llama/Meta-Llama-3-8B-Instruct")

Profiling

性能分析

python -m sglang.launch_server --model-path ... --enable-torch-compile nsys profile -o report python -m sglang.launch_server ...
undefined
python -m sglang.launch_server --model-path ... --enable-torch-compile nsys profile -o report python -m sglang.launch_server ...
undefined

更新 SGLang 源码

更新SGLang源码

bash
undefined
bash
undefined

在 cursor-gpu-skills 项目目录下

在cursor-gpu-skills项目目录下执行

bash update-repos.sh sglang
undefined
bash update-repos.sh sglang
undefined

Additional References

额外参考资料