llm-torch-profiler-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUnified LLM Torch Profiler Analysis
统一LLM Torch Profiler分析
Overview
概述
Use this skill for analysis across:
torch.profilersglangvllmTensorRT-LLM
There is only one public workflow:
triage
Preferred unified entrypoint:
- scripts/analyze_llm_torch_profile.py
Backwards-compatibility shim (kept so older calls keep working; it just forwards to the unified entrypoint):
docker exec ... analyze_sglang_torch_profile.py ...- scripts/analyze_sglang_torch_profile.py
Markdown bundling helper:
- scripts/render_triage_markdown_bundle.py
triage- kernel table
- overlap-opportunity table
- fuse-pattern table
By default, all three tables only render rows at or above cumulative GPU-time share.
Rows below that are hidden by default unless the user asks for a lower cutoff.
1.0%Keep the fuse-pattern table source-backed and deterministic.
Do not turn it into a fuzzy matcher.
If exact source-backed matching is weak but a kernel cluster is still close to a known family,
add one short note after the tables with exactly one of:
highmediumlow
使用此工具对以下框架进行分析:
torch.profilersglangvllmTensorRT-LLM
仅提供一个公开工作流:
- (分析)
triage
推荐的统一入口点:
- scripts/analyze_llm_torch_profile.py
向后兼容垫片(保留以确保旧版调用仍可正常工作;它会直接转发到统一入口点):
docker exec ... analyze_sglang_torch_profile.py ...- scripts/analyze_sglang_torch_profile.py
Markdown打包工具:
- scripts/render_triage_markdown_bundle.py
triage- 内核表
- 重叠机会表
- 融合模式表
默认情况下,所有三个表格仅显示累计GPU时间占比≥1.0%的行。低于该阈值的行默认隐藏,除非用户要求降低阈值。
保持融合模式表基于源代码且具有确定性,不要将其改为模糊匹配器。
如果基于源代码的精确匹配效果不佳,但内核集群仍接近已知类别,请在表格后添加一条简短注释,仅使用以下级别之一:
- (高)
high - (中)
medium - (低)
low
Capability Matrix
功能矩阵
| Capability | SGLang | vLLM | TensorRT-LLM |
|---|---|---|---|
| Existing trace triage | yes | yes | yes |
| Single-trace live capture | yes | yes, if torch profiler is enabled on server | requires profiler control endpoints |
| Two-trace mapping+formal triage | yes | yes | yes |
| Stage-aware live capture | yes | no | no |
| yes | usually ignored on HTTP profiler route | usually ignored on HTTP profiler route |
For TensorRT-LLM, live capture only works when the server exposes and
, and when the deployment already provides a shared trace path plus the
required env vars.
/start_profile/stop_profile| 功能 | SGLang | vLLM | TensorRT-LLM |
|---|---|---|---|
| 已有追踪文件分析 | 是 | 是 | 是 |
| 单追踪文件实时捕获 | 是 | 是(需服务器启用torch profiler) | 需分析器控制端点 |
| 双追踪文件映射+正式分析 | 是 | 是 | 是 |
| 阶段感知实时捕获 | 是 | 否 | 否 |
| 是 | 通常在HTTP分析路由中被忽略 | 通常在HTTP分析路由中被忽略 |
对于TensorRT-LLM,仅当服务器暴露和端点,且部署已提供共享追踪路径及所需环境变量时,实时捕获才能生效。
/start_profile/stop_profileReal H100 Validation
H100验证情况
The current reference run is the matrix captured on on
under:
4x H1002026-04-23h100_sglang/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3
Rendered markdown bundle:
/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md
Validated model directories:
mixtral_8x7b_instructqwen2_5_32b_instructqwen3_32b
Each model directory contains:
analysis_sglang.txtanalysis_vllm.txtanalysis_trtllm.txt- framework-specific trace roots and probe artifacts
Validated matrix:
| Model | SGLang | vLLM | TensorRT-LLM | Result |
|---|---|---|---|---|
| | | | three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text |
| | | | three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text |
| | | | three tables rendered correctly on all three frameworks; vLLM and TensorRT-LLM chat probes often emitted |
Use this run as the main H100 reference.
The older single-card Qwen3 matrix is still useful for bring-up, but it is
not the default reference anymore.
2026-04-22Checked-in sample outputs:
references/validated_outputs/20260422_h100_qwen3_matrix/qwen3_30b_a3b
To render a validated run into one markdown document:
bash
python3 scripts/render_triage_markdown_bundle.py \
--analysis-root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3 \
--output /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.mdThe bundle groups by model and keeps the three tables for each framework.
H100 notes:
- all three frameworks now render kernel, overlap, and fuse tables with separate and
extend/prefillsections when the trace contains a clean stage splitdecode - SGLang live capture is validated and calls the server profiler API directly instead of shelling out to
sglang.profiler - SGLang trace flush can lag well beyond a few seconds, so the runner waits longer for artifacts than the earlier implementation
- SGLang kernel-site reconstruction keeps sampling disabled in the mapping path so the optimized parser does not perturb SGLang table output; equality rechecks matched for ,
Mixtral-8x7B-Instruct-v0.1, andQwen3-32Bnvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 - vLLM live capture requires to match the server
--output-dir; the validated H100 flow usestorch_profiler_dirand then drives--profiler-config {"profiler":"torch","torch_profiler_dir":"..."}and/start_profile/stop_profile - TensorRT-LLM validation stays on ; the H100 flow writes the trace with
--backend pytorchand then analyzes the saved traceTLLM_TORCH_PROFILE_TRACE - current TensorRT-LLM profiler setup still needs a
py_executor.pyoverride for table-quality Python locations, and the matrix runner generates that override underwith_stack=True/data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm - on this host, keep all trace roots under , not
/data/.../home/...
当前参考运行是2026年4月23日在上捕获的矩阵,路径为:
h100_sglang4x H100/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3
渲染后的Markdown打包文件:
/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md
已验证的模型目录:
mixtral_8x7b_instructqwen2_5_32b_instructqwen3_32b
每个模型目录包含:
analysis_sglang.txtanalysis_vllm.txtanalysis_trtllm.txt- 框架特定的追踪根目录和探测 artifacts
已验证矩阵:
| 模型 | SGLang | vLLM | TensorRT-LLM | 结果 |
|---|---|---|---|---|
| | | | 三个框架均正确渲染三个表格;基准探测返回直接、非空文本 |
| | | | 三个框架均正确渲染三个表格;基准探测返回直接、非空文本 |
| | | | 三个框架均正确渲染三个表格;vLLM和TensorRT-LLM的聊天探测常输出 |
将此运行作为主要的H100参考。旧版2026年4月22日的单卡Qwen3矩阵仍可用于启动测试,但不再作为默认参考。
已提交的示例输出:
references/validated_outputs/20260422_h100_qwen3_matrix/qwen3_30b_a3b
要将已验证运行渲染为单个Markdown文档:
bash
python3 scripts/render_triage_markdown_bundle.py \
--analysis-root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3 \
--output /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md打包文件按模型分组,并保留每个框架的三个表格。
H100相关说明:
- 当追踪文件包含清晰的阶段划分时,三个框架现在都会分别渲染和
extend/prefill部分的内核表、重叠表和融合表decode - SGLang实时捕获已通过验证,会直接调用服务器分析器API,而非调用
sglang.profiler - SGLang追踪文件刷新可能会延迟数秒以上,因此运行器等待 artifacts 的时间比早期版本更长
- SGLang内核站点重构在映射路径中保持采样禁用状态,以避免优化后的解析器干扰SGLang表格输出;针对、
Mixtral-8x7B-Instruct-v0.1和Qwen3-32B的一致性检查均匹配nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 - vLLM实时捕获要求与服务器的
--output-dir匹配;已验证的H100流程使用torch_profiler_dir,然后调用--profiler-config {"profiler":"torch","torch_profiler_dir":"..."}和/start_profile/stop_profile - TensorRT-LLM验证基于;H100流程通过
--backend pytorch写入追踪文件,然后分析保存的追踪文件TLLM_TORCH_PROFILE_TRACE - 当前TensorRT-LLM的分析器设置仍需
py_executor.py覆盖才能获得表格级别的Python位置信息,矩阵运行器会在with_stack=True下生成该覆盖文件/data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm - 在该主机上,所有追踪根目录需放在下,而非
/data/.../home/...
When To Use It
使用场景
- inspect a trace or profile directory from
torch.profiler,sglang, orvllmTensorRT-LLM - profile a live serving endpoint and analyze the result
- summarize which kernel families dominate prefill or decode
- map kernels back to Python code paths
- judge whether a code path still leaves overlap opportunity
- check whether an already-known fusion or overlap path should have applied
- 检查来自、
sglang或vllm的TensorRT-LLM追踪文件或分析目录torch.profiler - 对实时服务端点进行性能分析并分析结果
- 总结哪些内核族在prefill或decode阶段占主导地位
- 将内核映射回Python代码路径
- 判断代码路径是否仍存在重叠优化机会
- 检查已知的融合或重叠路径是否应已生效
Diffusion Backend Gate
Diffusion后端限制
For diffusion benchmark or profiling work, only analyze traces produced by the native
SGLang diffusion backend.
If the run that generated the trace logs any of:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipeline
stop the workflow instead of analyzing the trace.
Handle it as a backend-selection issue, not as native-kernel profiler evidence.
对于扩散模型的基准测试或性能分析工作,仅分析由原生SGLang扩散后端生成的追踪文件。
如果生成追踪文件的运行日志中出现以下任意内容:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipeline
请终止工作流,不要分析该追踪文件。将其视为后端选择问题,而非原生内核分析器的有效数据。
Main Flows
主要流程
1. Single-trace triage from an existing profile dir or trace
1. 基于已有分析目录或追踪文件的单追踪文件分析
bash
python3 scripts/analyze_llm_torch_profile.py \
--input /path/to/profile_dir_or_trace.json.gzUse this when one trace is enough.
The overlap table stays conservative in single-trace mode and will tell you when a
mapping/formal pair is needed.
bash
python3 scripts/analyze_llm_torch_profile.py \
--input /path/to/profile_dir_or_trace.json.gz当单个追踪文件足够时使用此流程。单追踪文件模式下的重叠表会保持保守,并告知用户何时需要映射/正式对组。
2. Single-trace live capture from SGLang
2. 基于SGLang的单追踪文件实时捕获
bash
python3 scripts/analyze_llm_torch_profile.py \
--framework sglang \
--url http://127.0.0.1:30000 \
--output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/sglang_profile_live \
--num-steps 5 \
--profile-by-stageThe script sends to the SGLang server directly.
Keep under so later analysis and docs can see the trace.
The script writes , sends the probe requests after profiling is armed,
and waits longer for trace flush than the earlier implementation.
POST /start_profile--output-dir/data/...server_args.jsonbash
python3 scripts/analyze_llm_torch_profile.py \
--framework sglang \
--url http://127.0.0.1:30000 \
--output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/sglang_profile_live \
--num-steps 5 \
--profile-by-stage脚本会直接向SGLang服务器发送请求。请将设置在下,以便后续分析和文档可以访问追踪文件。脚本会写入,在启动分析后发送探测请求,并比早期版本等待更长时间以确保追踪文件刷新完成。
POST /start_profile--output-dir/data/...server_args.json3. Single-trace live capture from vLLM
3. 基于vLLM的单追踪文件实时捕获
Launch vLLM with torch profiler enabled, for example:
bash
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--profiler-config '{"profiler":"torch","torch_profiler_dir":"/data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile"}'Then run:
bash
python3 scripts/analyze_llm_torch_profile.py \
--framework vllm \
--url http://127.0.0.1:8000 \
--output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile \
--num-steps 5 \
--no-profile-by-stageFor vLLM, must point to the same the server uses.
The current vLLM profiler config already defaults ,
so the runner only needs to set .
On , external vLLM containers should mount both:
--output-dirtorch_profiler_dirtorch_profiler_with_stack=truetorch_profiler_dirh100_sglang/data/.cache/huggingface:/root/.cache/huggingface/data/bbuf/validate/unified_llm_profiler_skill:/data/bbuf/validate/unified_llm_profiler_skill
启动启用torch profiler的vLLM服务,例如:
bash
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--profiler-config '{"profiler":"torch","torch_profiler_dir":"/data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile"}'然后运行:
bash
python3 scripts/analyze_llm_torch_profile.py \
--framework vllm \
--url http://127.0.0.1:8000 \
--output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile \
--num-steps 5 \
--no-profile-by-stage对于vLLM,必须指向服务器使用的同一。当前vLLM分析器配置默认已设置,因此运行器仅需设置即可。在上,外部vLLM容器应挂载以下两个路径:
--output-dirtorch_profiler_dirtorch_profiler_with_stack=truetorch_profiler_dirh100_sglang/data/.cache/huggingface:/root/.cache/huggingface/data/bbuf/validate/unified_llm_profiler_skill:/data/bbuf/validate/unified_llm_profiler_skill
4. Single-trace live capture from TensorRT-LLM
4. 基于TensorRT-LLM的单追踪文件实时捕获
Use this only when the server exposes and ,
and the trace path is shared with the current machine.
POST /start_profilePOST /stop_profileTypical env expectations are:
TLLM_PROFILE_START_STOP=1- or
TLLM_TORCH_PROFILE_TRACE=/shared/path/trace.json.json.gz
Then run:
bash
python3 scripts/analyze_llm_torch_profile.py \
--framework trtllm \
--url http://127.0.0.1:8000 \
--output-dir /shared/path \
--num-steps 5 \
--no-profile-by-stageIf the deployment does not expose the profiler control endpoints, fall back to analyzing
an existing trace instead of trying live capture.
On the current TensorRT-LLM mainline path, creates the torch profiler
with and but not .
For table-quality validation, use the override generator:
py_executor.pyrecord_shapes=Truewith_modules=Truewith_stack=Truebash
python3 scripts/make_trtllm_py_executor_override.py \
--source /path/to/original/py_executor.py \
--output /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm/py_executor_with_stack.pyThe matrix runner does this automatically on H100 before TensorRT-LLM capture starts.
This is the validated TensorRT-LLM flow on :
h100_sglang- launch with
trtllm-serveTLLM_TORCH_PROFILE_TRACE=/data/.../trace.json - run a few benchmark requests
- analyze the emitted trace with
--input /data/.../trace.json
仅当服务器暴露和端点,且追踪路径与当前机器共享时,才可使用此流程。
POST /start_profilePOST /stop_profile典型的环境变量要求:
TLLM_PROFILE_START_STOP=1- 或
TLLM_TORCH_PROFILE_TRACE=/shared/path/trace.json.json.gz
然后运行:
bash
python3 scripts/analyze_llm_torch_profile.py \
--framework trtllm \
--url http://127.0.0.1:8000 \
--output-dir /shared/path \
--num-steps 5 \
--no-profile-by-stage如果部署未暴露分析器控制端点,请退回到分析已有追踪文件,不要尝试实时捕获。
在当前TensorRT-LLM主线代码中,创建torch profiler时设置了和,但未设置。为获得表格级别的验证质量,请使用覆盖生成器:
py_executor.pyrecord_shapes=Truewith_modules=Truewith_stack=Truebash
python3 scripts/make_trtllm_py_executor_override.py \
--source /path/to/original/py_executor.py \
--output /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm/py_executor_with_stack.py矩阵运行器在H100上启动TensorRT-LLM捕获前会自动执行此操作。
这是上已验证的TensorRT-LLM流程:
h100_sglang- 使用启动
TLLM_TORCH_PROFILE_TRACE=/data/.../trace.jsontrtllm-serve - 发送若干基准请求
- 使用分析生成的追踪文件
--input /data/.../trace.json
5. Two-trace triage from existing profile dirs or traces
5. 基于已有分析目录或追踪文件的双追踪文件分析
bash
python3 scripts/analyze_llm_torch_profile.py triage \
--mapping-input /path/to/graph_off_profile_dir \
--formal-input /path/to/graph_on_profile_dirUse this when you need stronger overlap attribution and kernel-to-source mapping.
bash
python3 scripts/analyze_llm_torch_profile.py triage \
--mapping-input /path/to/graph_off_profile_dir \
--formal-input /path/to/graph_on_profile_dir当需要更精确的重叠归因和内核到源代码的映射时使用此流程。
6. Two-trace triage from running servers
6. 基于运行中服务器的双追踪文件分析
bash
python3 scripts/analyze_llm_torch_profile.py triage \
--framework sglang \
--mapping-url http://127.0.0.1:31025 \
--formal-url http://127.0.0.1:31026 \
--num-steps 5 \
--profile-by-stageFor or , use the same shape but pass:
vllmTensorRT-LLM- or
--framework vllm--framework trtllm --mapping-output-dir ...--formal-output-dir ...--no-profile-by-stage
bash
python3 scripts/analyze_llm_torch_profile.py triage \
--framework sglang \
--mapping-url http://127.0.0.1:31025 \
--formal-url http://127.0.0.1:31026 \
--num-steps 5 \
--profile-by-stage对于或,使用相同的参数,但需传递:
vllmTensorRT-LLM- 或
--framework vllm--framework trtllm --mapping-output-dir ...--formal-output-dir ...--no-profile-by-stage
profile_by_stage
profile_by_stageprofile_by_stage
说明
profile_by_stage--profile-by-stage- On ordinary non-PD SGLang serving, it is still useful because prefill and decode usually have very different bottlenecks.
- On the current profile-v2 path inside SGLang, stage-based profiling is effectively the normal path.
- PD-disaggregated serving adds one extra rule: prefill workers and decode workers must be profiled separately. That is stricter than ordinary .
profile_by_stage - For and
vllm, disable it withTensorRT-LLM.--no-profile-by-stage
--profile-by-stage- 在普通非PD(Disaggregated)SGLang服务中,它仍然有用,因为prefill和decode阶段通常有非常不同的瓶颈。
- 在SGLang内部的profile-v2流程中,基于阶段的分析实际上是标准流程。
- PD拆分服务增加了一条额外规则:prefill工作器和decode工作器必须分开分析。这比普通的要求更严格。
profile_by_stage - 对于和
vllm,请使用TensorRT-LLM禁用此选项。--no-profile-by-stage
How To Choose The Triage Shape
如何选择分析模式
Single-trace triage
单追踪文件分析
Use when you want the lowest-friction report:
- one trace is already available
- you mainly want kernel share and fusion clues
- you are comparing two runs side by side by running triage once per trace
Prefer this by default.
当您希望获得最低成本的报告时使用:
- 已有单个追踪文件
- 主要关注内核占比和融合线索
- 通过对每个追踪文件分别运行分析来对比两次运行
默认优先选择此模式。
Two-trace triage
双追踪文件分析
Use when you need:
- a stronger overlap answer
- graph-off source mapping plus graph-on final behavior
- more trustworthy overlap recommendations in the middle table
- mapping trace with graph disabled or with the lower-fusion / more-readable config
- formal trace with the real serving optimizations enabled
Do not call the mapping pass a "fast profile".
It exists to recover .
kernel -> cpu_op -> python scope当您需要以下功能时使用:
- 更精确的重叠分析结果
- 图禁用模式下的源代码映射加上图启用模式下的最终行为
- 中间表格中更可靠的重叠建议
- 映射追踪文件:禁用图或使用低融合/更易读的配置
- 正式追踪文件:启用真实服务优化
不要将映射过程称为“快速分析”。它的作用是恢复的映射关系。
kernel -> cpu_op -> python scopeWorkflow
工作流
Single-trace workflow
单追踪文件工作流
- If the user only wants a diagnosis, one trace is enough.
- Prefer one-rank traces over merged traces whenever the profiler emitted both.
- For a live server, let the script drive the profiler only when the framework-specific prerequisites are already met.
- Prefer SGLang unless the user explicitly wants an all-stage mixed trace.
--profile-by-stage - When on , create or clean the target trace directory through
h100_sglangso the path is definitely writable underdocker exec sglang_bbuf ..../data
- 如果用户仅需要诊断,单个追踪文件足够。
- 只要分析器同时输出了单rank追踪文件和合并追踪文件,优先选择单rank追踪文件。
- 对于实时服务器,仅当框架特定的先决条件已满足时,才让脚本驱动分析器。
- 优先使用SGLang的,除非用户明确需要混合所有阶段的追踪文件。
--profile-by-stage - 在上,通过
h100_sglang创建或清理目标追踪目录,确保路径在docker exec sglang_bbuf ...下可写。/data
Two-trace workflow
双追踪文件工作流
- Produce a mapping trace first with graph disabled or the lower-fusion configuration.
- Produce a formal trace second with the real serving optimizations enabled.
- Run for the three-table report.
triage - Read the results in this order:
- kernel table
- overlap-opportunity table
- fuse-pattern table
- Before calling something a "new" optimization idea, compare the top rows against both references/fuse-overlap-catalog.md and references/overlap-catalog.md. Check mainline rows first, then the sections. Prefer reporting:
PR-backed / in-flight- an existing fused or overlap path that should already apply here
- an existing path that appears disabled, unsupported, or regressed in this trace
- an upstream pattern that is mainline elsewhere but missing locally, or still open upstream
- a truly new opportunity only when no catalog entry fits
- If no exact pattern fully matches but the trace is still close to a known family, add one flat similarity note after the tables.
Use ,
high, ormediumonly. Base that note on the full pattern shape, not on one kernel name alone. Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure. Do not rewrite the script table itself to include these heuristic judgments.low
- 先生成映射追踪文件:禁用图或使用低融合配置。
- 再生成正式追踪文件:启用真实服务优化。
- 运行以获取三表报告。
triage - 按以下顺序阅读结果:
- 内核表
- 重叠机会表
- 融合模式表
- 在提出“新”优化想法之前,将顶部行与references/fuse-overlap-catalog.md和references/overlap-catalog.md进行对比。先检查主线代码中的条目,再查看部分。优先报告:
PR-backed / in-flight- 本应已在此处生效的现有融合或重叠路径
- 在该追踪文件中显示为禁用、不支持或退化的现有路径
- 在其他主线环境中存在但本地缺失,或仍在 upstream 开发中的模式
- 仅当没有目录条目匹配时,才报告真正的新机会
- 如果没有完全匹配的模式,但追踪文件仍接近已知类别,请在表格后添加一条简洁的相似度注释。仅使用、
high或medium级别。该注释需基于完整的模式形态,而非单个内核名称。优先考虑语义线索,例如生产者-消费者链、源代码位置、CPU操作名称、TP上下文和模型特定结构。不要修改脚本生成的表格本身以包含这些启发式判断。low
References
参考文档
Load these only when needed:
- references/source-map.md
- upstream SGLang profiler entrypoints and trace-writing paths; still most useful for SGLang-specific source follow-up
- references/heuristics.md
- overlap labels, dependency-risk interpretation, and limits
- references/fuse-overlap-catalog.md
- mixed source-backed catalog of existing fuse and overlap patterns, including mainline rows plus PR-backed / in-flight rows
- references/overlap-catalog.md
- overlap-only lookup table across LLM, VLM, diffusion, disaggregation, HiSparse, and speculative scheduling
仅在需要时加载以下文档:
- references/source-map.md
- upstream SGLang分析器入口点和追踪文件写入路径;仍最适用于SGLang特定的源代码追踪
- references/heuristics.md
- 重叠标签、依赖风险解释及限制
- references/fuse-overlap-catalog.md
- 基于源代码的现有融合和重叠模式合集,包括主线代码条目及PR支持/开发中的条目
- references/overlap-catalog.md
- 跨LLM、VLM、扩散模型、拆分服务、HiSparse和推测调度的重叠优化查找表
Output Contract
输出约定
Return:
- trace path or generated profile path
- framework
- model/server args when available
- kernel table
- overlap-opportunity table
- fuse-pattern table
- optional similarity note with /
high/mediumwhen exact matching is inconclusivelow - one short summary of what dominates the run
- whether the overlap read came from single-trace triage or mapping/formal two-trace triage
返回内容包括:
- 追踪文件路径或生成的分析路径
- 框架信息
- 模型/服务器参数(如有)
- 内核表
- 重叠机会表
- 融合模式表
- 可选的相似度注释(当精确匹配不确定时,使用/
high/medium)low - 关于运行主导因素的简短总结
- 重叠分析结果来自单追踪文件分析还是映射/正式双追踪文件分析