llm-torch-profiler-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Unified LLM Torch Profiler Analysis

统一LLM Torch Profiler分析

Overview

概述

Use this skill for
torch.profiler
analysis across:
  • sglang
  • vllm
  • TensorRT-LLM
There is only one public workflow:
  • triage
Preferred unified entrypoint:
  • scripts/analyze_llm_torch_profile.py
Backwards-compatibility shim (kept so older
docker exec ... analyze_sglang_torch_profile.py ...
calls keep working; it just forwards to the unified entrypoint):
  • scripts/analyze_sglang_torch_profile.py
Markdown bundling helper:
  • scripts/render_triage_markdown_bundle.py
triage
always prints the same three tables:
  • kernel table
  • overlap-opportunity table
  • fuse-pattern table
By default, all three tables only render rows at or above
1.0%
cumulative GPU-time share. Rows below that are hidden by default unless the user asks for a lower cutoff.
Keep the fuse-pattern table source-backed and deterministic. Do not turn it into a fuzzy matcher.
If exact source-backed matching is weak but a kernel cluster is still close to a known family, add one short note after the tables with exactly one of:
  • high
  • medium
  • low
使用此工具对以下框架进行
torch.profiler
分析:
  • sglang
  • vllm
  • TensorRT-LLM
仅提供一个公开工作流:
  • triage
    (分析)
推荐的统一入口点:
  • scripts/analyze_llm_torch_profile.py
向后兼容垫片(保留以确保旧版
docker exec ... analyze_sglang_torch_profile.py ...
调用仍可正常工作;它会直接转发到统一入口点):
  • scripts/analyze_sglang_torch_profile.py
Markdown打包工具:
  • scripts/render_triage_markdown_bundle.py
triage
始终输出相同的三个表格:
  • 内核表
  • 重叠机会表
  • 融合模式表
默认情况下,所有三个表格仅显示累计GPU时间占比≥1.0%的行。低于该阈值的行默认隐藏,除非用户要求降低阈值。
保持融合模式表基于源代码且具有确定性,不要将其改为模糊匹配器。
如果基于源代码的精确匹配效果不佳,但内核集群仍接近已知类别,请在表格后添加一条简短注释,仅使用以下级别之一:
  • high
    (高)
  • medium
    (中)
  • low
    (低)

Capability Matrix

功能矩阵

CapabilitySGLangvLLMTensorRT-LLM
Existing trace triageyesyesyes
Single-trace live captureyesyes, if torch profiler is enabled on serverrequires profiler control endpoints
Two-trace mapping+formal triageyesyesyes
Stage-aware live captureyesnono
--profile-prefix
control
yesusually ignored on HTTP profiler routeusually ignored on HTTP profiler route
For TensorRT-LLM, live capture only works when the server exposes
/start_profile
and
/stop_profile
, and when the deployment already provides a shared trace path plus the required env vars.
功能SGLangvLLMTensorRT-LLM
已有追踪文件分析
单追踪文件实时捕获是(需服务器启用torch profiler)需分析器控制端点
双追踪文件映射+正式分析
阶段感知实时捕获
--profile-prefix
控制
通常在HTTP分析路由中被忽略通常在HTTP分析路由中被忽略
对于TensorRT-LLM,仅当服务器暴露
/start_profile
/stop_profile
端点,且部署已提供共享追踪路径及所需环境变量时,实时捕获才能生效。

Real H100 Validation

H100验证情况

The current reference run is the
4x H100
matrix captured on
2026-04-23
on
h100_sglang
under:
  • /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3
Rendered markdown bundle:
  • /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md
Validated model directories:
  • mixtral_8x7b_instruct
  • qwen2_5_32b_instruct
  • qwen3_32b
Each model directory contains:
  • analysis_sglang.txt
  • analysis_vllm.txt
  • analysis_trtllm.txt
  • framework-specific trace roots and probe artifacts
Validated matrix:
ModelSGLangvLLMTensorRT-LLMResult
mistralai/Mixtral-8x7B-Instruct-v0.1
4x H100
4x H100
4x H100
three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text
Qwen/Qwen2.5-32B-Instruct
4x H100
4x H100
4x H100
three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text
Qwen/Qwen3-32B
4x H100
4x H100
4x H100
three tables rendered correctly on all three frameworks; vLLM and TensorRT-LLM chat probes often emitted
<think>
prefixes
Use this run as the main H100 reference. The older
2026-04-22
single-card Qwen3 matrix is still useful for bring-up, but it is not the default reference anymore.
Checked-in sample outputs:
  • references/validated_outputs/20260422_h100_qwen3_matrix/qwen3_30b_a3b
To render a validated run into one markdown document:
bash
python3 scripts/render_triage_markdown_bundle.py \
  --analysis-root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3 \
  --output /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md
The bundle groups by model and keeps the three tables for each framework.
H100 notes:
  • all three frameworks now render kernel, overlap, and fuse tables with separate
    extend/prefill
    and
    decode
    sections when the trace contains a clean stage split
  • SGLang live capture is validated and calls the server profiler API directly instead of shelling out to
    sglang.profiler
  • SGLang trace flush can lag well beyond a few seconds, so the runner waits longer for artifacts than the earlier implementation
  • SGLang kernel-site reconstruction keeps sampling disabled in the mapping path so the optimized parser does not perturb SGLang table output; equality rechecks matched for
    Mixtral-8x7B-Instruct-v0.1
    ,
    Qwen3-32B
    , and
    nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
  • vLLM live capture requires
    --output-dir
    to match the server
    torch_profiler_dir
    ; the validated H100 flow uses
    --profiler-config {"profiler":"torch","torch_profiler_dir":"..."}
    and then drives
    /start_profile
    and
    /stop_profile
  • TensorRT-LLM validation stays on
    --backend pytorch
    ; the H100 flow writes the trace with
    TLLM_TORCH_PROFILE_TRACE
    and then analyzes the saved trace
  • current TensorRT-LLM
    py_executor.py
    profiler setup still needs a
    with_stack=True
    override for table-quality Python locations, and the matrix runner generates that override under
    /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm
  • on this host, keep all trace roots under
    /data/...
    , not
    /home/...
当前参考运行是2026年4月23日在
h100_sglang
上捕获的
4x H100
矩阵,路径为:
  • /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3
渲染后的Markdown打包文件:
  • /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md
已验证的模型目录:
  • mixtral_8x7b_instruct
  • qwen2_5_32b_instruct
  • qwen3_32b
每个模型目录包含:
  • analysis_sglang.txt
  • analysis_vllm.txt
  • analysis_trtllm.txt
  • 框架特定的追踪根目录和探测 artifacts
已验证矩阵:
模型SGLangvLLMTensorRT-LLM结果
mistralai/Mixtral-8x7B-Instruct-v0.1
4x H100
4x H100
4x H100
三个框架均正确渲染三个表格;基准探测返回直接、非空文本
Qwen/Qwen2.5-32B-Instruct
4x H100
4x H100
4x H100
三个框架均正确渲染三个表格;基准探测返回直接、非空文本
Qwen/Qwen3-32B
4x H100
4x H100
4x H100
三个框架均正确渲染三个表格;vLLM和TensorRT-LLM的聊天探测常输出
<think>
前缀
将此运行作为主要的H100参考。旧版2026年4月22日的单卡Qwen3矩阵仍可用于启动测试,但不再作为默认参考。
已提交的示例输出:
  • references/validated_outputs/20260422_h100_qwen3_matrix/qwen3_30b_a3b
要将已验证运行渲染为单个Markdown文档:
bash
python3 scripts/render_triage_markdown_bundle.py \
  --analysis-root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3 \
  --output /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md
打包文件按模型分组,并保留每个框架的三个表格。
H100相关说明:
  • 当追踪文件包含清晰的阶段划分时,三个框架现在都会分别渲染
    extend/prefill
    decode
    部分的内核表、重叠表和融合表
  • SGLang实时捕获已通过验证,会直接调用服务器分析器API,而非调用
    sglang.profiler
  • SGLang追踪文件刷新可能会延迟数秒以上,因此运行器等待 artifacts 的时间比早期版本更长
  • SGLang内核站点重构在映射路径中保持采样禁用状态,以避免优化后的解析器干扰SGLang表格输出;针对
    Mixtral-8x7B-Instruct-v0.1
    Qwen3-32B
    nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
    的一致性检查均匹配
  • vLLM实时捕获要求
    --output-dir
    与服务器的
    torch_profiler_dir
    匹配;已验证的H100流程使用
    --profiler-config {"profiler":"torch","torch_profiler_dir":"..."}
    ,然后调用
    /start_profile
    /stop_profile
  • TensorRT-LLM验证基于
    --backend pytorch
    ;H100流程通过
    TLLM_TORCH_PROFILE_TRACE
    写入追踪文件,然后分析保存的追踪文件
  • 当前TensorRT-LLM的
    py_executor.py
    分析器设置仍需
    with_stack=True
    覆盖才能获得表格级别的Python位置信息,矩阵运行器会在
    /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm
    下生成该覆盖文件
  • 在该主机上,所有追踪根目录需放在
    /data/...
    下,而非
    /home/...

When To Use It

使用场景

  • inspect a
    torch.profiler
    trace or profile directory from
    sglang
    ,
    vllm
    , or
    TensorRT-LLM
  • profile a live serving endpoint and analyze the result
  • summarize which kernel families dominate prefill or decode
  • map kernels back to Python code paths
  • judge whether a code path still leaves overlap opportunity
  • check whether an already-known fusion or overlap path should have applied
  • 检查来自
    sglang
    vllm
    TensorRT-LLM
    torch.profiler
    追踪文件或分析目录
  • 对实时服务端点进行性能分析并分析结果
  • 总结哪些内核族在prefill或decode阶段占主导地位
  • 将内核映射回Python代码路径
  • 判断代码路径是否仍存在重叠优化机会
  • 检查已知的融合或重叠路径是否应已生效

Diffusion Backend Gate

Diffusion后端限制

For diffusion benchmark or profiling work, only analyze traces produced by the native SGLang diffusion backend.
If the run that generated the trace logs any of:
  • Falling back to diffusers backend
  • Using diffusers backend
  • Loaded diffusers pipeline
stop the workflow instead of analyzing the trace. Handle it as a backend-selection issue, not as native-kernel profiler evidence.
对于扩散模型的基准测试或性能分析工作,仅分析由原生SGLang扩散后端生成的追踪文件。
如果生成追踪文件的运行日志中出现以下任意内容:
  • Falling back to diffusers backend
  • Using diffusers backend
  • Loaded diffusers pipeline
请终止工作流,不要分析该追踪文件。将其视为后端选择问题,而非原生内核分析器的有效数据。

Main Flows

主要流程

1. Single-trace triage from an existing profile dir or trace

1. 基于已有分析目录或追踪文件的单追踪文件分析

bash
python3 scripts/analyze_llm_torch_profile.py \
  --input /path/to/profile_dir_or_trace.json.gz
Use this when one trace is enough. The overlap table stays conservative in single-trace mode and will tell you when a mapping/formal pair is needed.
bash
python3 scripts/analyze_llm_torch_profile.py \
  --input /path/to/profile_dir_or_trace.json.gz
当单个追踪文件足够时使用此流程。单追踪文件模式下的重叠表会保持保守,并告知用户何时需要映射/正式对组。

2. Single-trace live capture from SGLang

2. 基于SGLang的单追踪文件实时捕获

bash
python3 scripts/analyze_llm_torch_profile.py \
  --framework sglang \
  --url http://127.0.0.1:30000 \
  --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/sglang_profile_live \
  --num-steps 5 \
  --profile-by-stage
The script sends
POST /start_profile
to the SGLang server directly. Keep
--output-dir
under
/data/...
so later analysis and docs can see the trace. The script writes
server_args.json
, sends the probe requests after profiling is armed, and waits longer for trace flush than the earlier implementation.
bash
python3 scripts/analyze_llm_torch_profile.py \
  --framework sglang \
  --url http://127.0.0.1:30000 \
  --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/sglang_profile_live \
  --num-steps 5 \
  --profile-by-stage
脚本会直接向SGLang服务器发送
POST /start_profile
请求。请将
--output-dir
设置在
/data/...
下,以便后续分析和文档可以访问追踪文件。脚本会写入
server_args.json
,在启动分析后发送探测请求,并比早期版本等待更长时间以确保追踪文件刷新完成。

3. Single-trace live capture from vLLM

3. 基于vLLM的单追踪文件实时捕获

Launch vLLM with torch profiler enabled, for example:
bash
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --profiler-config '{"profiler":"torch","torch_profiler_dir":"/data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile"}'
Then run:
bash
python3 scripts/analyze_llm_torch_profile.py \
  --framework vllm \
  --url http://127.0.0.1:8000 \
  --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile \
  --num-steps 5 \
  --no-profile-by-stage
For vLLM,
--output-dir
must point to the same
torch_profiler_dir
the server uses. The current vLLM profiler config already defaults
torch_profiler_with_stack=true
, so the runner only needs to set
torch_profiler_dir
. On
h100_sglang
, external vLLM containers should mount both:
  • /data/.cache/huggingface:/root/.cache/huggingface
  • /data/bbuf/validate/unified_llm_profiler_skill:/data/bbuf/validate/unified_llm_profiler_skill
启动启用torch profiler的vLLM服务,例如:
bash
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --profiler-config '{"profiler":"torch","torch_profiler_dir":"/data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile"}'
然后运行:
bash
python3 scripts/analyze_llm_torch_profile.py \
  --framework vllm \
  --url http://127.0.0.1:8000 \
  --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile \
  --num-steps 5 \
  --no-profile-by-stage
对于vLLM,
--output-dir
必须指向服务器使用的同一
torch_profiler_dir
。当前vLLM分析器配置默认已设置
torch_profiler_with_stack=true
,因此运行器仅需设置
torch_profiler_dir
即可。在
h100_sglang
上,外部vLLM容器应挂载以下两个路径:
  • /data/.cache/huggingface:/root/.cache/huggingface
  • /data/bbuf/validate/unified_llm_profiler_skill:/data/bbuf/validate/unified_llm_profiler_skill

4. Single-trace live capture from TensorRT-LLM

4. 基于TensorRT-LLM的单追踪文件实时捕获

Use this only when the server exposes
POST /start_profile
and
POST /stop_profile
, and the trace path is shared with the current machine.
Typical env expectations are:
  • TLLM_PROFILE_START_STOP=1
  • TLLM_TORCH_PROFILE_TRACE=/shared/path/trace.json
    or
    .json.gz
Then run:
bash
python3 scripts/analyze_llm_torch_profile.py \
  --framework trtllm \
  --url http://127.0.0.1:8000 \
  --output-dir /shared/path \
  --num-steps 5 \
  --no-profile-by-stage
If the deployment does not expose the profiler control endpoints, fall back to analyzing an existing trace instead of trying live capture.
On the current TensorRT-LLM mainline path,
py_executor.py
creates the torch profiler with
record_shapes=True
and
with_modules=True
but not
with_stack=True
. For table-quality validation, use the override generator:
bash
python3 scripts/make_trtllm_py_executor_override.py \
  --source /path/to/original/py_executor.py \
  --output /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm/py_executor_with_stack.py
The matrix runner does this automatically on H100 before TensorRT-LLM capture starts.
This is the validated TensorRT-LLM flow on
h100_sglang
:
  1. launch
    trtllm-serve
    with
    TLLM_TORCH_PROFILE_TRACE=/data/.../trace.json
  2. run a few benchmark requests
  3. analyze the emitted trace with
    --input /data/.../trace.json
仅当服务器暴露
POST /start_profile
POST /stop_profile
端点,且追踪路径与当前机器共享时,才可使用此流程。
典型的环境变量要求:
  • TLLM_PROFILE_START_STOP=1
  • TLLM_TORCH_PROFILE_TRACE=/shared/path/trace.json
    .json.gz
然后运行:
bash
python3 scripts/analyze_llm_torch_profile.py \
  --framework trtllm \
  --url http://127.0.0.1:8000 \
  --output-dir /shared/path \
  --num-steps 5 \
  --no-profile-by-stage
如果部署未暴露分析器控制端点,请退回到分析已有追踪文件,不要尝试实时捕获。
在当前TensorRT-LLM主线代码中,
py_executor.py
创建torch profiler时设置了
record_shapes=True
with_modules=True
,但未设置
with_stack=True
。为获得表格级别的验证质量,请使用覆盖生成器:
bash
python3 scripts/make_trtllm_py_executor_override.py \
  --source /path/to/original/py_executor.py \
  --output /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm/py_executor_with_stack.py
矩阵运行器在H100上启动TensorRT-LLM捕获前会自动执行此操作。
这是
h100_sglang
上已验证的TensorRT-LLM流程:
  1. 使用
    TLLM_TORCH_PROFILE_TRACE=/data/.../trace.json
    启动
    trtllm-serve
  2. 发送若干基准请求
  3. 使用
    --input /data/.../trace.json
    分析生成的追踪文件

5. Two-trace triage from existing profile dirs or traces

5. 基于已有分析目录或追踪文件的双追踪文件分析

bash
python3 scripts/analyze_llm_torch_profile.py triage \
  --mapping-input /path/to/graph_off_profile_dir \
  --formal-input /path/to/graph_on_profile_dir
Use this when you need stronger overlap attribution and kernel-to-source mapping.
bash
python3 scripts/analyze_llm_torch_profile.py triage \
  --mapping-input /path/to/graph_off_profile_dir \
  --formal-input /path/to/graph_on_profile_dir
当需要更精确的重叠归因和内核到源代码的映射时使用此流程。

6. Two-trace triage from running servers

6. 基于运行中服务器的双追踪文件分析

bash
python3 scripts/analyze_llm_torch_profile.py triage \
  --framework sglang \
  --mapping-url http://127.0.0.1:31025 \
  --formal-url http://127.0.0.1:31026 \
  --num-steps 5 \
  --profile-by-stage
For
vllm
or
TensorRT-LLM
, use the same shape but pass:
  • --framework vllm
    or
    --framework trtllm
  • --mapping-output-dir ...
  • --formal-output-dir ...
  • --no-profile-by-stage
bash
python3 scripts/analyze_llm_torch_profile.py triage \
  --framework sglang \
  --mapping-url http://127.0.0.1:31025 \
  --formal-url http://127.0.0.1:31026 \
  --num-steps 5 \
  --profile-by-stage
对于
vllm
TensorRT-LLM
,使用相同的参数,但需传递:
  • --framework vllm
    --framework trtllm
  • --mapping-output-dir ...
  • --formal-output-dir ...
  • --no-profile-by-stage

profile_by_stage

profile_by_stage
说明

--profile-by-stage
is only meaningful on the SGLang live-capture path.
  • On ordinary non-PD SGLang serving, it is still useful because prefill and decode usually have very different bottlenecks.
  • On the current profile-v2 path inside SGLang, stage-based profiling is effectively the normal path.
  • PD-disaggregated serving adds one extra rule: prefill workers and decode workers must be profiled separately. That is stricter than ordinary
    profile_by_stage
    .
  • For
    vllm
    and
    TensorRT-LLM
    , disable it with
    --no-profile-by-stage
    .
--profile-by-stage
仅在SGLang实时捕获流程中有意义。
  • 在普通非PD(Disaggregated)SGLang服务中,它仍然有用,因为prefill和decode阶段通常有非常不同的瓶颈。
  • 在SGLang内部的profile-v2流程中,基于阶段的分析实际上是标准流程。
  • PD拆分服务增加了一条额外规则:prefill工作器和decode工作器必须分开分析。这比普通的
    profile_by_stage
    要求更严格。
  • 对于
    vllm
    TensorRT-LLM
    ,请使用
    --no-profile-by-stage
    禁用此选项。

How To Choose The Triage Shape

如何选择分析模式

Single-trace triage

单追踪文件分析

Use when you want the lowest-friction report:
  • one trace is already available
  • you mainly want kernel share and fusion clues
  • you are comparing two runs side by side by running triage once per trace
Prefer this by default.
当您希望获得最低成本的报告时使用:
  • 已有单个追踪文件
  • 主要关注内核占比和融合线索
  • 通过对每个追踪文件分别运行分析来对比两次运行
默认优先选择此模式。

Two-trace triage

双追踪文件分析

Use when you need:
  • a stronger overlap answer
  • graph-off source mapping plus graph-on final behavior
  • more trustworthy overlap recommendations in the middle table
  1. mapping trace with graph disabled or with the lower-fusion / more-readable config
  2. formal trace with the real serving optimizations enabled
Do not call the mapping pass a "fast profile". It exists to recover
kernel -> cpu_op -> python scope
.
当您需要以下功能时使用:
  • 更精确的重叠分析结果
  • 图禁用模式下的源代码映射加上图启用模式下的最终行为
  • 中间表格中更可靠的重叠建议
  1. 映射追踪文件:禁用图或使用低融合/更易读的配置
  2. 正式追踪文件:启用真实服务优化
不要将映射过程称为“快速分析”。它的作用是恢复
kernel -> cpu_op -> python scope
的映射关系。

Workflow

工作流

Single-trace workflow

单追踪文件工作流

  1. If the user only wants a diagnosis, one trace is enough.
  2. Prefer one-rank traces over merged traces whenever the profiler emitted both.
  3. For a live server, let the script drive the profiler only when the framework-specific prerequisites are already met.
  4. Prefer SGLang
    --profile-by-stage
    unless the user explicitly wants an all-stage mixed trace.
  5. When on
    h100_sglang
    , create or clean the target trace directory through
    docker exec sglang_bbuf ...
    so the path is definitely writable under
    /data
    .
  1. 如果用户仅需要诊断,单个追踪文件足够。
  2. 只要分析器同时输出了单rank追踪文件和合并追踪文件,优先选择单rank追踪文件。
  3. 对于实时服务器,仅当框架特定的先决条件已满足时,才让脚本驱动分析器。
  4. 优先使用SGLang的
    --profile-by-stage
    ,除非用户明确需要混合所有阶段的追踪文件。
  5. h100_sglang
    上,通过
    docker exec sglang_bbuf ...
    创建或清理目标追踪目录,确保路径在
    /data
    下可写。

Two-trace workflow

双追踪文件工作流

  1. Produce a mapping trace first with graph disabled or the lower-fusion configuration.
  2. Produce a formal trace second with the real serving optimizations enabled.
  3. Run
    triage
    for the three-table report.
  4. Read the results in this order:
    • kernel table
    • overlap-opportunity table
    • fuse-pattern table
  5. Before calling something a "new" optimization idea, compare the top rows against both references/fuse-overlap-catalog.md and references/overlap-catalog.md. Check mainline rows first, then the
    PR-backed / in-flight
    sections. Prefer reporting:
    • an existing fused or overlap path that should already apply here
    • an existing path that appears disabled, unsupported, or regressed in this trace
    • an upstream pattern that is mainline elsewhere but missing locally, or still open upstream
    • a truly new opportunity only when no catalog entry fits
  6. If no exact pattern fully matches but the trace is still close to a known family, add one flat similarity note after the tables. Use
    high
    ,
    medium
    , or
    low
    only. Base that note on the full pattern shape, not on one kernel name alone. Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure. Do not rewrite the script table itself to include these heuristic judgments.
  1. 先生成映射追踪文件:禁用图或使用低融合配置。
  2. 再生成正式追踪文件:启用真实服务优化。
  3. 运行
    triage
    以获取三表报告。
  4. 按以下顺序阅读结果:
    • 内核表
    • 重叠机会表
    • 融合模式表
  5. 在提出“新”优化想法之前,将顶部行与references/fuse-overlap-catalog.mdreferences/overlap-catalog.md进行对比。先检查主线代码中的条目,再查看
    PR-backed / in-flight
    部分。优先报告:
    • 本应已在此处生效的现有融合或重叠路径
    • 在该追踪文件中显示为禁用、不支持或退化的现有路径
    • 在其他主线环境中存在但本地缺失,或仍在 upstream 开发中的模式
    • 仅当没有目录条目匹配时,才报告真正的新机会
  6. 如果没有完全匹配的模式,但追踪文件仍接近已知类别,请在表格后添加一条简洁的相似度注释。仅使用
    high
    medium
    low
    级别。该注释需基于完整的模式形态,而非单个内核名称。优先考虑语义线索,例如生产者-消费者链、源代码位置、CPU操作名称、TP上下文和模型特定结构。不要修改脚本生成的表格本身以包含这些启发式判断。

References

参考文档

Load these only when needed:
  • references/source-map.md
    • upstream SGLang profiler entrypoints and trace-writing paths; still most useful for SGLang-specific source follow-up
  • references/heuristics.md
    • overlap labels, dependency-risk interpretation, and limits
  • references/fuse-overlap-catalog.md
    • mixed source-backed catalog of existing fuse and overlap patterns, including mainline rows plus PR-backed / in-flight rows
  • references/overlap-catalog.md
    • overlap-only lookup table across LLM, VLM, diffusion, disaggregation, HiSparse, and speculative scheduling
仅在需要时加载以下文档:
  • references/source-map.md
    • upstream SGLang分析器入口点和追踪文件写入路径;仍最适用于SGLang特定的源代码追踪
  • references/heuristics.md
    • 重叠标签、依赖风险解释及限制
  • references/fuse-overlap-catalog.md
    • 基于源代码的现有融合和重叠模式合集,包括主线代码条目及PR支持/开发中的条目
  • references/overlap-catalog.md
    • 跨LLM、VLM、扩散模型、拆分服务、HiSparse和推测调度的重叠优化查找表

Output Contract

输出约定

Return:
  • trace path or generated profile path
  • framework
  • model/server args when available
  • kernel table
  • overlap-opportunity table
  • fuse-pattern table
  • optional similarity note with
    high
    /
    medium
    /
    low
    when exact matching is inconclusive
  • one short summary of what dominates the run
  • whether the overlap read came from single-trace triage or mapping/formal two-trace triage
返回内容包括:
  • 追踪文件路径或生成的分析路径
  • 框架信息
  • 模型/服务器参数(如有)
  • 内核表
  • 重叠机会表
  • 融合模式表
  • 可选的相似度注释(当精确匹配不确定时,使用
    high
    /
    medium
    /
    low
  • 关于运行主导因素的简短总结
  • 重叠分析结果来自单追踪文件分析还是映射/正式双追踪文件分析