llm-torch-profiler-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Unified LLM Torch Profiler Analysis

统一LLM Torch Profiler分析

Overview

概述

Use this skill for

torch.profiler

analysis across:

```
sglang
```
```
vllm
```
```
TensorRT-LLM
```

There is only one public workflow:

```
triage
```

Preferred unified entrypoint:

scripts/analyze_llm_torch_profile.py

Backwards-compatibility shim (kept so older

docker exec ... analyze_sglang_torch_profile.py ...

calls keep working; it just forwards to the unified entrypoint):

scripts/analyze_sglang_torch_profile.py

Markdown bundling helper:

scripts/render_triage_markdown_bundle.py

triage

always prints the same three tables:

kernel table
overlap-opportunity table
fuse-pattern table

By default, all three tables only render rows at or above

1.0%

cumulative GPU-time share. Rows below that are hidden by default unless the user asks for a lower cutoff.

Keep the fuse-pattern table source-backed and deterministic. Do not turn it into a fuzzy matcher.

If exact source-backed matching is weak but a kernel cluster is still close to a known family, add one short note after the tables with exactly one of:

```
high
```
```
medium
```
```
low
```

使用此工具对以下框架进行

torch.profiler

分析：

```
sglang
```
```
vllm
```
```
TensorRT-LLM
```

仅提供一个公开工作流：

```
triage
```
（分析）

推荐的统一入口点：

scripts/analyze_llm_torch_profile.py

向后兼容垫片（保留以确保旧版

docker exec ... analyze_sglang_torch_profile.py ...

调用仍可正常工作；它会直接转发到统一入口点）：

scripts/analyze_sglang_torch_profile.py

Markdown打包工具：

scripts/render_triage_markdown_bundle.py

triage

始终输出相同的三个表格：

内核表
重叠机会表
融合模式表

默认情况下，所有三个表格仅显示累计GPU时间占比≥1.0%的行。低于该阈值的行默认隐藏，除非用户要求降低阈值。

保持融合模式表基于源代码且具有确定性，不要将其改为模糊匹配器。

如果基于源代码的精确匹配效果不佳，但内核集群仍接近已知类别，请在表格后添加一条简短注释，仅使用以下级别之一：

```
high
```
（高）
```
medium
```
（中）
```
low
```
（低）

Capability Matrix

功能矩阵

Capability	SGLang	vLLM	TensorRT-LLM
Existing trace triage	yes	yes	yes
Single-trace live capture	yes	yes, if torch profiler is enabled on server	requires profiler control endpoints
Two-trace mapping+formal triage	yes	yes	yes
Stage-aware live capture	yes	no	no
`--profile-prefix` control	yes	usually ignored on HTTP profiler route	usually ignored on HTTP profiler route

For TensorRT-LLM, live capture only works when the server exposes

/start_profile

and

/stop_profile

, and when the deployment already provides a shared trace path plus the required env vars.

功能	SGLang	vLLM	TensorRT-LLM
已有追踪文件分析	是	是	是
单追踪文件实时捕获	是	是（需服务器启用torch profiler）	需分析器控制端点
双追踪文件映射+正式分析	是	是	是
阶段感知实时捕获	是	否	否
`--profile-prefix` 控制	是	通常在HTTP分析路由中被忽略	通常在HTTP分析路由中被忽略

对于TensorRT-LLM，仅当服务器暴露

/start_profile

和

/stop_profile

端点，且部署已提供共享追踪路径及所需环境变量时，实时捕获才能生效。

Real H100 Validation

H100验证情况

The current reference run is the

4x H100

matrix captured on

2026-04-23

h100_sglang

under:

/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3

Rendered markdown bundle:

/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md

Validated model directories:

```
mixtral_8x7b_instruct
```
```
qwen2_5_32b_instruct
```
```
qwen3_32b
```

Each model directory contains:

```
analysis_sglang.txt
```
```
analysis_vllm.txt
```
```
analysis_trtllm.txt
```
framework-specific trace roots and probe artifacts

Validated matrix:

Model	SGLang	vLLM	TensorRT-LLM	Result
`mistralai/Mixtral-8x7B-Instruct-v0.1`	`4x H100`	`4x H100`	`4x H100`	three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text
`Qwen/Qwen2.5-32B-Instruct`	`4x H100`	`4x H100`	`4x H100`	three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text
`Qwen/Qwen3-32B`	`4x H100`	`4x H100`	`4x H100`	three tables rendered correctly on all three frameworks; vLLM and TensorRT-LLM chat probes often emitted `<think>` prefixes

Use this run as the main H100 reference. The older

2026-04-22

single-card Qwen3 matrix is still useful for bring-up, but it is not the default reference anymore.

Checked-in sample outputs:

references/validated_outputs/20260422_h100_qwen3_matrix/qwen3_30b_a3b

To render a validated run into one markdown document:

bash

python3 scripts/render_triage_markdown_bundle.py \
  --analysis-root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3 \
  --output /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md

The bundle groups by model and keeps the three tables for each framework.

H100 notes:

all three frameworks now render kernel, overlap, and fuse tables with separate
```
extend/prefill
```
and
```
decode
```
sections when the trace contains a clean stage split
SGLang live capture is validated and calls the server profiler API directly instead of shelling out to
```
sglang.profiler
```
SGLang trace flush can lag well beyond a few seconds, so the runner waits longer for artifacts than the earlier implementation
SGLang kernel-site reconstruction keeps sampling disabled in the mapping path so the optimized parser does not perturb SGLang table output; equality rechecks matched for
```
Mixtral-8x7B-Instruct-v0.1
```
,
```
Qwen3-32B
```
, and
```
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
```

vLLM live capture requires

--output-dir

to match the server

torch_profiler_dir

; the validated H100 flow uses

--profiler-config {"profiler":"torch","torch_profiler_dir":"..."}

and then drives

/start_profile

and

/stop_profile

TensorRT-LLM validation stays on
```
--backend pytorch
```
; the H100 flow writes the trace with
```
TLLM_TORCH_PROFILE_TRACE
```
and then analyzes the saved trace
current TensorRT-LLM
```
py_executor.py
```
profiler setup still needs a
```
with_stack=True
```
override for table-quality Python locations, and the matrix runner generates that override under
```
/data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm
```
on this host, keep all trace roots under
```
/data/...
```
, not
```
/home/...
```

当前参考运行是2026年4月23日在

h100_sglang

上捕获的

4x H100

矩阵，路径为：

/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3

渲染后的Markdown打包文件：

/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md

已验证的模型目录：

```
mixtral_8x7b_instruct
```
```
qwen2_5_32b_instruct
```
```
qwen3_32b
```

每个模型目录包含：

```
analysis_sglang.txt
```
```
analysis_vllm.txt
```
```
analysis_trtllm.txt
```
框架特定的追踪根目录和探测 artifacts

已验证矩阵：

模型	SGLang	vLLM	TensorRT-LLM	结果
`mistralai/Mixtral-8x7B-Instruct-v0.1`	`4x H100`	`4x H100`	`4x H100`	三个框架均正确渲染三个表格；基准探测返回直接、非空文本
`Qwen/Qwen2.5-32B-Instruct`	`4x H100`	`4x H100`	`4x H100`	三个框架均正确渲染三个表格；基准探测返回直接、非空文本
`Qwen/Qwen3-32B`	`4x H100`	`4x H100`	`4x H100`	三个框架均正确渲染三个表格；vLLM和TensorRT-LLM的聊天探测常输出 `<think>` 前缀

将此运行作为主要的H100参考。旧版2026年4月22日的单卡Qwen3矩阵仍可用于启动测试，但不再作为默认参考。

已提交的示例输出：

references/validated_outputs/20260422_h100_qwen3_matrix/qwen3_30b_a3b

要将已验证运行渲染为单个Markdown文档：

bash

python3 scripts/render_triage_markdown_bundle.py \
  --analysis-root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3 \
  --output /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md

打包文件按模型分组，并保留每个框架的三个表格。

H100相关说明：

当追踪文件包含清晰的阶段划分时，三个框架现在都会分别渲染
```
extend/prefill
```
和
```
decode
```
部分的内核表、重叠表和融合表
SGLang实时捕获已通过验证，会直接调用服务器分析器API，而非调用
```
sglang.profiler
```
SGLang追踪文件刷新可能会延迟数秒以上，因此运行器等待 artifacts 的时间比早期版本更长
SGLang内核站点重构在映射路径中保持采样禁用状态，以避免优化后的解析器干扰SGLang表格输出；针对
```
Mixtral-8x7B-Instruct-v0.1
```
、
```
Qwen3-32B
```
和
```
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
```
的一致性检查均匹配

vLLM实时捕获要求

--output-dir

与服务器的

torch_profiler_dir

匹配；已验证的H100流程使用

--profiler-config {"profiler":"torch","torch_profiler_dir":"..."}

，然后调用

/start_profile

和

/stop_profile

TensorRT-LLM验证基于
```
--backend pytorch
```
；H100流程通过
```
TLLM_TORCH_PROFILE_TRACE
```
写入追踪文件，然后分析保存的追踪文件
当前TensorRT-LLM的
```
py_executor.py
```
分析器设置仍需
```
with_stack=True
```
覆盖才能获得表格级别的Python位置信息，矩阵运行器会在
```
/data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm
```
下生成该覆盖文件
在该主机上，所有追踪根目录需放在
```
/data/...
```
下，而非
```
/home/...
```

When To Use It

使用场景

inspect a
```
torch.profiler
```
trace or profile directory from
```
sglang
```
,
```
vllm
```
, or
```
TensorRT-LLM
```
profile a live serving endpoint and analyze the result
summarize which kernel families dominate prefill or decode
map kernels back to Python code paths
judge whether a code path still leaves overlap opportunity
check whether an already-known fusion or overlap path should have applied

检查来自
```
sglang
```
、
```
vllm
```
或
```
TensorRT-LLM
```
的
```
torch.profiler
```
追踪文件或分析目录
对实时服务端点进行性能分析并分析结果
总结哪些内核族在prefill或decode阶段占主导地位
将内核映射回Python代码路径
判断代码路径是否仍存在重叠优化机会
检查已知的融合或重叠路径是否应已生效

Diffusion Backend Gate

Diffusion后端限制

For diffusion benchmark or profiling work, only analyze traces produced by the native SGLang diffusion backend.

If the run that generated the trace logs any of:

```
Falling back to diffusers backend
```
```
Using diffusers backend
```
```
Loaded diffusers pipeline
```

stop the workflow instead of analyzing the trace. Handle it as a backend-selection issue, not as native-kernel profiler evidence.

对于扩散模型的基准测试或性能分析工作，仅分析由原生SGLang扩散后端生成的追踪文件。

如果生成追踪文件的运行日志中出现以下任意内容：

```
Falling back to diffusers backend
```
```
Using diffusers backend
```
```
Loaded diffusers pipeline
```

请终止工作流，不要分析该追踪文件。将其视为后端选择问题，而非原生内核分析器的有效数据。

Main Flows

主要流程

1. Single-trace triage from an existing profile dir or trace

1. 基于已有分析目录或追踪文件的单追踪文件分析

bash

python3 scripts/analyze_llm_torch_profile.py \
  --input /path/to/profile_dir_or_trace.json.gz

Use this when one trace is enough. The overlap table stays conservative in single-trace mode and will tell you when a mapping/formal pair is needed.

bash

python3 scripts/analyze_llm_torch_profile.py \
  --input /path/to/profile_dir_or_trace.json.gz

当单个追踪文件足够时使用此流程。单追踪文件模式下的重叠表会保持保守，并告知用户何时需要映射/正式对组。

2. Single-trace live capture from SGLang

2. 基于SGLang的单追踪文件实时捕获

bash

python3 scripts/analyze_llm_torch_profile.py \
  --framework sglang \
  --url http://127.0.0.1:30000 \
  --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/sglang_profile_live \
  --num-steps 5 \
  --profile-by-stage

The script sends

POST /start_profile

to the SGLang server directly. Keep

--output-dir

under

/data/...

so later analysis and docs can see the trace. The script writes

server_args.json

, sends the probe requests after profiling is armed, and waits longer for trace flush than the earlier implementation.

bash

python3 scripts/analyze_llm_torch_profile.py \
  --framework sglang \
  --url http://127.0.0.1:30000 \
  --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/sglang_profile_live \
  --num-steps 5 \
  --profile-by-stage

脚本会直接向SGLang服务器发送

POST /start_profile

请求。请将

--output-dir

设置在

/data/...

下，以便后续分析和文档可以访问追踪文件。脚本会写入

server_args.json

，在启动分析后发送探测请求，并比早期版本等待更长时间以确保追踪文件刷新完成。

3. Single-trace live capture from vLLM

3. 基于vLLM的单追踪文件实时捕获

Launch vLLM with torch profiler enabled, for example:

bash

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --profiler-config '{"profiler":"torch","torch_profiler_dir":"/data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile"}'

Then run:

bash

python3 scripts/analyze_llm_torch_profile.py \
  --framework vllm \
  --url http://127.0.0.1:8000 \
  --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile \
  --num-steps 5 \
  --no-profile-by-stage

For vLLM,

--output-dir

must point to the same

torch_profiler_dir

the server uses. The current vLLM profiler config already defaults

torch_profiler_with_stack=true

, so the runner only needs to set

torch_profiler_dir

. On

h100_sglang

, external vLLM containers should mount both:

/data/.cache/huggingface:/root/.cache/huggingface

/data/bbuf/validate/unified_llm_profiler_skill:/data/bbuf/validate/unified_llm_profiler_skill

启动启用torch profiler的vLLM服务，例如：

bash

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --profiler-config '{"profiler":"torch","torch_profiler_dir":"/data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile"}'

然后运行：

bash

python3 scripts/analyze_llm_torch_profile.py \
  --framework vllm \
  --url http://127.0.0.1:8000 \
  --output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile \
  --num-steps 5 \
  --no-profile-by-stage

对于vLLM，

--output-dir

必须指向服务器使用的同一

torch_profiler_dir

。当前vLLM分析器配置默认已设置

torch_profiler_with_stack=true

，因此运行器仅需设置

torch_profiler_dir

即可。在

h100_sglang

上，外部vLLM容器应挂载以下两个路径：

/data/.cache/huggingface:/root/.cache/huggingface

/data/bbuf/validate/unified_llm_profiler_skill:/data/bbuf/validate/unified_llm_profiler_skill

4. Single-trace live capture from TensorRT-LLM

4. 基于TensorRT-LLM的单追踪文件实时捕获

Use this only when the server exposes

POST /start_profile

and

POST /stop_profile

, and the trace path is shared with the current machine.

Typical env expectations are:

```
TLLM_PROFILE_START_STOP=1
```

TLLM_TORCH_PROFILE_TRACE=/shared/path/trace.json

.json.gz

Then run:

bash

python3 scripts/analyze_llm_torch_profile.py \
  --framework trtllm \
  --url http://127.0.0.1:8000 \
  --output-dir /shared/path \
  --num-steps 5 \
  --no-profile-by-stage

If the deployment does not expose the profiler control endpoints, fall back to analyzing an existing trace instead of trying live capture.

On the current TensorRT-LLM mainline path,

py_executor.py

creates the torch profiler with

record_shapes=True

and

with_modules=True

but not

with_stack=True

. For table-quality validation, use the override generator:

bash

python3 scripts/make_trtllm_py_executor_override.py \
  --source /path/to/original/py_executor.py \
  --output /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm/py_executor_with_stack.py

The matrix runner does this automatically on H100 before TensorRT-LLM capture starts.

This is the validated TensorRT-LLM flow on

h100_sglang

launch

trtllm-serve

with

TLLM_TORCH_PROFILE_TRACE=/data/.../trace.json

run a few benchmark requests
analyze the emitted trace with
```
--input /data/.../trace.json
```

仅当服务器暴露

POST /start_profile

和

POST /stop_profile

端点，且追踪路径与当前机器共享时，才可使用此流程。

典型的环境变量要求：

```
TLLM_PROFILE_START_STOP=1
```

TLLM_TORCH_PROFILE_TRACE=/shared/path/trace.json

或

.json.gz

然后运行：

bash

python3 scripts/analyze_llm_torch_profile.py \
  --framework trtllm \
  --url http://127.0.0.1:8000 \
  --output-dir /shared/path \
  --num-steps 5 \
  --no-profile-by-stage

如果部署未暴露分析器控制端点，请退回到分析已有追踪文件，不要尝试实时捕获。

在当前TensorRT-LLM主线代码中，

py_executor.py

创建torch profiler时设置了

record_shapes=True

和

with_modules=True

，但未设置

with_stack=True

。为获得表格级别的验证质量，请使用覆盖生成器：

bash

python3 scripts/make_trtllm_py_executor_override.py \
  --source /path/to/original/py_executor.py \
  --output /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm/py_executor_with_stack.py

矩阵运行器在H100上启动TensorRT-LLM捕获前会自动执行此操作。

这是

h100_sglang

上已验证的TensorRT-LLM流程：

使用

TLLM_TORCH_PROFILE_TRACE=/data/.../trace.json

启动

trtllm-serve

发送若干基准请求
使用
```
--input /data/.../trace.json
```
分析生成的追踪文件

5. Two-trace triage from existing profile dirs or traces

5. 基于已有分析目录或追踪文件的双追踪文件分析

bash

python3 scripts/analyze_llm_torch_profile.py triage \
  --mapping-input /path/to/graph_off_profile_dir \
  --formal-input /path/to/graph_on_profile_dir

Use this when you need stronger overlap attribution and kernel-to-source mapping.

bash

python3 scripts/analyze_llm_torch_profile.py triage \
  --mapping-input /path/to/graph_off_profile_dir \
  --formal-input /path/to/graph_on_profile_dir

当需要更精确的重叠归因和内核到源代码的映射时使用此流程。

6. Two-trace triage from running servers

6. 基于运行中服务器的双追踪文件分析

bash

python3 scripts/analyze_llm_torch_profile.py triage \
  --framework sglang \
  --mapping-url http://127.0.0.1:31025 \
  --formal-url http://127.0.0.1:31026 \
  --num-steps 5 \
  --profile-by-stage

For

vllm

TensorRT-LLM

, use the same shape but pass:

```
--framework vllm
```
or
```
--framework trtllm
```
```
--mapping-output-dir ...
```
```
--formal-output-dir ...
```
```
--no-profile-by-stage
```

bash

python3 scripts/analyze_llm_torch_profile.py triage \
  --framework sglang \
  --mapping-url http://127.0.0.1:31025 \
  --formal-url http://127.0.0.1:31026 \
  --num-steps 5 \
  --profile-by-stage

对于

vllm

或

TensorRT-LLM

，使用相同的参数，但需传递：

```
--framework vllm
```
或
```
--framework trtllm
```
```
--mapping-output-dir ...
```
```
--formal-output-dir ...
```
```
--no-profile-by-stage
```

profile_by_stage

profile_by_stage

说明

--profile-by-stage

is only meaningful on the SGLang live-capture path.

On ordinary non-PD SGLang serving, it is still useful because prefill and decode usually have very different bottlenecks.
On the current profile-v2 path inside SGLang, stage-based profiling is effectively the normal path.
PD-disaggregated serving adds one extra rule: prefill workers and decode workers must be profiled separately. That is stricter than ordinary
```
profile_by_stage
```
.
For
```
vllm
```
and
```
TensorRT-LLM
```
, disable it with
```
--no-profile-by-stage
```
.

--profile-by-stage

仅在SGLang实时捕获流程中有意义。

在普通非PD（Disaggregated）SGLang服务中，它仍然有用，因为prefill和decode阶段通常有非常不同的瓶颈。
在SGLang内部的profile-v2流程中，基于阶段的分析实际上是标准流程。
PD拆分服务增加了一条额外规则：prefill工作器和decode工作器必须分开分析。这比普通的
```
profile_by_stage
```
要求更严格。
对于
```
vllm
```
和
```
TensorRT-LLM
```
，请使用
```
--no-profile-by-stage
```
禁用此选项。

How To Choose The Triage Shape

如何选择分析模式

Single-trace triage

单追踪文件分析

Use when you want the lowest-friction report:

one trace is already available
you mainly want kernel share and fusion clues
you are comparing two runs side by side by running triage once per trace

Prefer this by default.

当您希望获得最低成本的报告时使用：

已有单个追踪文件
主要关注内核占比和融合线索
通过对每个追踪文件分别运行分析来对比两次运行

默认优先选择此模式。

Two-trace triage

双追踪文件分析

Use when you need:

a stronger overlap answer
graph-off source mapping plus graph-on final behavior
more trustworthy overlap recommendations in the middle table

mapping trace with graph disabled or with the lower-fusion / more-readable config
formal trace with the real serving optimizations enabled

Do not call the mapping pass a "fast profile". It exists to recover

kernel -> cpu_op -> python scope

当您需要以下功能时使用：

更精确的重叠分析结果
图禁用模式下的源代码映射加上图启用模式下的最终行为
中间表格中更可靠的重叠建议

映射追踪文件：禁用图或使用低融合/更易读的配置
正式追踪文件：启用真实服务优化

不要将映射过程称为“快速分析”。它的作用是恢复

kernel -> cpu_op -> python scope

的映射关系。

Workflow

工作流

Single-trace workflow

单追踪文件工作流

If the user only wants a diagnosis, one trace is enough.
Prefer one-rank traces over merged traces whenever the profiler emitted both.
For a live server, let the script drive the profiler only when the framework-specific prerequisites are already met.
Prefer SGLang
```
--profile-by-stage
```
unless the user explicitly wants an all-stage mixed trace.
When on
```
h100_sglang
```
, create or clean the target trace directory through
```
docker exec sglang_bbuf ...
```
so the path is definitely writable under
```
/data
```
.

如果用户仅需要诊断，单个追踪文件足够。
只要分析器同时输出了单rank追踪文件和合并追踪文件，优先选择单rank追踪文件。
对于实时服务器，仅当框架特定的先决条件已满足时，才让脚本驱动分析器。
优先使用SGLang的
```
--profile-by-stage
```
，除非用户明确需要混合所有阶段的追踪文件。
在
```
h100_sglang
```
上，通过
```
docker exec sglang_bbuf ...
```
创建或清理目标追踪目录，确保路径在
```
/data
```
下可写。

Two-trace workflow

双追踪文件工作流

Produce a mapping trace first with graph disabled or the lower-fusion configuration.
Produce a formal trace second with the real serving optimizations enabled.
Run
```
triage
```
for the three-table report.
Read the results in this order:
- kernel table
- overlap-opportunity table
- fuse-pattern table
Before calling something a "new" optimization idea, compare the top rows against both references/fuse-overlap-catalog.md and references/overlap-catalog.md. Check mainline rows first, then the
```
PR-backed / in-flight
```
sections. Prefer reporting:
- an existing fused or overlap path that should already apply here
- an existing path that appears disabled, unsupported, or regressed in this trace
- an upstream pattern that is mainline elsewhere but missing locally, or still open upstream
- a truly new opportunity only when no catalog entry fits
If no exact pattern fully matches but the trace is still close to a known family, add one flat similarity note after the tables. Use
```
high
```
,
```
medium
```
, or
```
low
```
only. Base that note on the full pattern shape, not on one kernel name alone. Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure. Do not rewrite the script table itself to include these heuristic judgments.

先生成映射追踪文件：禁用图或使用低融合配置。
再生成正式追踪文件：启用真实服务优化。
运行
```
triage
```
以获取三表报告。
按以下顺序阅读结果：
- 内核表
- 重叠机会表
- 融合模式表
在提出“新”优化想法之前，将顶部行与references/fuse-overlap-catalog.md和references/overlap-catalog.md进行对比。先检查主线代码中的条目，再查看
```
PR-backed / in-flight
```
部分。优先报告：
- 本应已在此处生效的现有融合或重叠路径
- 在该追踪文件中显示为禁用、不支持或退化的现有路径
- 在其他主线环境中存在但本地缺失，或仍在 upstream 开发中的模式
- 仅当没有目录条目匹配时，才报告真正的新机会
如果没有完全匹配的模式，但追踪文件仍接近已知类别，请在表格后添加一条简洁的相似度注释。仅使用
```
high
```
、
```
medium
```
或
```
low
```
级别。该注释需基于完整的模式形态，而非单个内核名称。优先考虑语义线索，例如生产者-消费者链、源代码位置、CPU操作名称、TP上下文和模型特定结构。不要修改脚本生成的表格本身以包含这些启发式判断。

References

参考文档

Load these only when needed:

references/source-map.md
- upstream SGLang profiler entrypoints and trace-writing paths; still most useful for SGLang-specific source follow-up
references/heuristics.md
- overlap labels, dependency-risk interpretation, and limits
references/fuse-overlap-catalog.md
- mixed source-backed catalog of existing fuse and overlap patterns, including mainline rows plus PR-backed / in-flight rows
references/overlap-catalog.md
- overlap-only lookup table across LLM, VLM, diffusion, disaggregation, HiSparse, and speculative scheduling

仅在需要时加载以下文档：

references/source-map.md
- upstream SGLang分析器入口点和追踪文件写入路径；仍最适用于SGLang特定的源代码追踪
references/heuristics.md
- 重叠标签、依赖风险解释及限制
references/fuse-overlap-catalog.md
- 基于源代码的现有融合和重叠模式合集，包括主线代码条目及PR支持/开发中的条目
references/overlap-catalog.md
- 跨LLM、VLM、扩散模型、拆分服务、HiSparse和推测调度的重叠优化查找表

Output Contract

输出约定

Return:

trace path or generated profile path
framework
model/server args when available
kernel table
overlap-opportunity table
fuse-pattern table
optional similarity note with
```
high
```
/
```
medium
```
/
```
low
```
when exact matching is inconclusive
one short summary of what dominates the run
whether the overlap read came from single-trace triage or mapping/formal two-trace triage

返回内容包括：

追踪文件路径或生成的分析路径
框架信息
模型/服务器参数（如有）
内核表
重叠机会表
融合模式表
可选的相似度注释（当精确匹配不确定时，使用
```
high
```
/
```
medium
```
/
```
low
```
）
关于运行主导因素的简短总结
重叠分析结果来自单追踪文件分析还是映射/正式双追踪文件分析

llm-torch-profiler-analysis

Original

Translation

Unified LLM Torch Profiler Analysis

统一LLM Torch Profiler分析

Overview

概述

Capability Matrix

功能矩阵

Real H100 Validation

H100验证情况

When To Use It

使用场景

Diffusion Backend Gate

Diffusion后端限制

Main Flows

主要流程

1. Single-trace triage from an existing profile dir or trace

1. 基于已有分析目录或追踪文件的单追踪文件分析

2. Single-trace live capture from SGLang

2. 基于SGLang的单追踪文件实时捕获

3. Single-trace live capture from vLLM

3. 基于vLLM的单追踪文件实时捕获

4. Single-trace live capture from TensorRT-LLM

4. 基于TensorRT-LLM的单追踪文件实时捕获

5. Two-trace triage from existing profile dirs or traces

5. 基于已有分析目录或追踪文件的双追踪文件分析

6. Two-trace triage from running servers

6. 基于运行中服务器的双追踪文件分析

profile_by_stage

profile_by_stage说明

How To Choose The Triage Shape

如何选择分析模式

Single-trace triage

单追踪文件分析

Two-trace triage

双追踪文件分析

Workflow

工作流

Single-trace workflow

单追踪文件工作流

Two-trace workflow

双追踪文件工作流

References

参考文档

Output Contract

输出约定

`profile_by_stage`

`profile_by_stage`
说明