perf-host-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Host Performance Analysis

主机性能分析

Analyze host/CPU overhead in TensorRT-LLM inference workloads from nsys traces. This skill operates in two phases:

Phase	Question	Input	Output
Detection	Is host overhead the bottleneck?	Single nsys trace	YES/NO verdict with metric evidence
Root Cause	What specifically regressed?	One or two nsys traces	NVTX per-step breakdown, regression sources, optional kernel-level drill-down

基于nsys trace分析TensorRT-LLM推理工作负载中的主机/CPU开销。该技能分为两个阶段：

阶段	问题	输入	输出
检测阶段	主机开销是否为性能瓶颈？	单个nsys trace	包含指标证据的YES/NO判定结果
根因分析阶段	具体是哪些部分出现了性能退化？	一个或两个nsys trace	NVTX分步细分结果、退化来源、可选的内核级钻取分析

When to Use

使用场景

Before starting host optimization work -- confirms the bottleneck is real (Detection)
As a sub-step of
```
perf-analysis
```
for bottleneck classification (Detection)
When GPU utilization is suspiciously low and you need to know why (Detection)
When throughput regressed but GPU kernel execution times are unchanged (Root Cause)
When the gap between forward step iterations has increased (Root Cause)
To compare inter-iteration overhead between two versions of the inference engine (Root Cause)
When you need sub-operation granularity on inter-kernel gaps or graph coverage (Root Cause, kernel-level drill-down)
When piecewise CUDA graph coverage is unexpectedly low (Root Cause, kernel-level drill-down)
When multi-rank inference shows unexplained performance asymmetry (Root Cause, kernel-level drill-down)

Do NOT use when:

The regression is in individual kernel performance (use
```
perf-nsight-compute-analysis
```
)
You need to profile a workload from scratch (use
```
workload-instrumentation
```
first)
The issue is NCCL communication (use distributed analysis)

开始主机优化工作前——确认瓶颈真实存在（检测阶段）
作为
```
perf-analysis
```
的子步骤进行瓶颈分类（检测阶段）
当GPU利用率异常低下，需要排查原因时（检测阶段）
当吞吐量出现退化但GPU内核执行时间未发生变化时（根因分析阶段）
当前向步骤迭代之间的间隙增大时（根因分析阶段）
对比推理引擎两个版本之间的迭代间开销时（根因分析阶段）
需要对核间间隙或图覆盖率进行子操作粒度分析时（根因分析阶段，内核级钻取）
分段CUDA图覆盖率低于预期时（根因分析阶段，内核级钻取）
多rank推理出现无法解释的性能不对称时（根因分析阶段，内核级钻取）

请勿在以下场景使用：

性能退化出现在单个内核性能上（使用
```
perf-nsight-compute-analysis
```
）
需要从头开始分析工作负载（先使用
```
workload-instrumentation
```
）
问题与NCCL通信相关（使用分布式分析工具）

Prerequisites

前置条件

An nsys trace file (
```
.sqlite
```
or
```
.nsys-rep
```
) from a TRT-LLM benchmark run
For Root Cause comparison: two traces (baseline and target)
Python 3 with sqlite3 support

来自TRT-LLM基准测试的nsys trace文件（
```
.sqlite
```
或
```
.nsys-rep
```
格式）
若进行根因对比分析：两个trace文件（基线版本与目标版本）
支持sqlite3的Python 3环境

Key Concepts

核心概念

Host Overhead in LLM Inference

LLM推理中的主机开销

In an LLM inference loop, each iteration consists of:

[inter-step gap] -> [_forward_step] -> [inter-step gap] -> [_forward_step] -> ...

The forward step includes GPU kernel execution (GEMM, attention, normalization, allreduce) plus host-side preparation. The inter-step gap includes host-side work between forward steps (scheduling, request fetching, broadcasting, sampling, response handling).

See references/trtllm-nvtx-ranges.md for the full per-operation breakdown and timing ranges.

在LLM推理循环中，每个迭代流程如下：

[步骤间间隙] -> [_forward_step] -> [步骤间间隙] -> [_forward_step] -> ...

前向步骤包括GPU内核执行（GEMM、注意力计算、归一化、allreduce）以及主机端准备工作。步骤间间隙包括前向步骤之间的主机端工作（调度、请求获取、广播、采样、响应处理）。

完整的操作细分和时序范围请参考references/trtllm-nvtx-ranges.md。

Hidden vs Exposed Host Overhead

隐藏与暴露的主机开销

Host overhead only hurts performance when it is exposed -- the GPU is idle waiting for work. When host prep overlaps with GPU execution, it is hidden and free. See references/metrics.md (M3 section) for diagrams and the exposed/hidden computation.

只有当主机开销处于暴露状态时才会影响性能——即GPU处于空闲状态等待工作。当主机准备工作与GPU执行重叠时，该开销是隐藏的，不会影响性能。有关暴露/隐藏计算的示意图和说明，请参考references/metrics.md的M3章节。

Forward Step Isolation

前向步骤隔离

In TP configurations, forward steps are isolated via allreduce kernel grouping (deterministic count per transformer layer). For TP=1, NVTX

_forward_step

ranges are used directly. See references/iteration-isolation-techniques.md for the full algorithm.

在TP配置中，前向步骤通过allreduce内核分组进行隔离（每个Transformer层的allreduce数量是确定的）。对于TP=1的情况，直接使用NVTX的

_forward_step

范围。完整算法请参考references/iteration-isolation-techniques.md。

Phase Classification (Context vs Generation)

阶段分类（上下文与生成）

Iterations are classified by NVTX marker text into context (eager, no CUDA graphs) and generation (CUDA graph replay). Per-phase analysis is critical because aggregate metrics can mask phase-specific bottlenecks. See references/phase-classification.md.

通过NVTX标记文本将迭代分为上下文阶段（即时模式，无CUDA图）和生成阶段（CUDA图重放）。分阶段分析至关重要，因为聚合指标可能会掩盖特定阶段的瓶颈。请参考references/phase-classification.md。

Phase 1: Detection (YES/NO Verdict)

阶段1：检测（YES/NO判定）

Determine whether host overhead is the primary bottleneck.

确定主机开销是否为主要瓶颈。

Detection Metrics

检测指标

Six metrics in four categories. See references/metrics.md for full definitions, formulas, and SQL queries.

#	Metric	Threshold	What it answers
M1	GPU idle ratio	> 0.30	Is the GPU starved for work?
M2	Launch overhead ratio	> 0.10	Is kernel launch itself expensive?
M3a	Host prep exposed ratio	> 0.50	How well is host prep pipelined?
M3b	Host prep perf impact	> 0.05	How much throughput does exposed prep cost?
M3c	Host prep idle attribution	> 0.50	Is host prep the main cause of GPU idle?
M4	GPU utilization	< 0.60	Is GPU utilization too low?
M5	NCCL ratio (caveat)	> 0.20	Is communication a confounding factor?

Host prep confirmation rule: Host prep is a confirmed bottleneck only when both M3b AND M3c cross their thresholds.

Thresholds are configurable with per-phase variants. See references/thresholds.md.

四大类共六项指标。完整定义、公式和SQL查询请参考references/metrics.md。

编号	指标	阈值	用途
M1	GPU空闲率	> 0.30	GPU是否因缺少工作而闲置？
M2	启动开销占比	> 0.10	内核启动本身是否开销高昂？
M3a	主机准备暴露率	> 0.50	主机准备工作的流水线化程度如何？
M3b	主机准备性能影响	> 0.05	暴露的主机准备工作会损失多少吞吐量？
M3c	主机准备空闲归因占比	> 0.50	主机准备工作是否是GPU空闲的主要原因？
M4	GPU利用率	< 0.60	GPU利用率是否过低？
M5	NCCL占比（注意事项）	> 0.20	通信是否是干扰因素？

主机准备瓶颈确认规则：只有当M3b 和 M3c均超过阈值时，才能确认主机准备工作是瓶颈。

阈值可配置，且支持分阶段变体。请参考references/thresholds.md。

Detection Workflow

检测流程

Step 1: Input Validation

步骤1：输入验证

bash

undefined

bash

undefined

Accept .sqlite or .nsys-rep

接受.sqlite或.nsys-rep格式

ls -la <trace_file>

If .nsys-rep, export to SQLite first

若为.nsys-rep格式，先导出为SQLite

nsys export -t sqlite -o <output.sqlite> <input.nsys-rep>

undefined

nsys export -t sqlite -o <output.sqlite> <input.nsys-rep>

undefined

Step 2: Run Detection Script

步骤2：运行检测脚本

bash

python scripts/detect_host_overhead.py \
  --trace /path/to/trace.sqlite \
  --output /path/to/verdict.json

The script computes M1, M2, M4, M5 from SQL, optionally M3 via range intersection, applies the verdict logic, and outputs structured JSON. See references/output-format.md for the output schema.

For manual metric extraction via SQL, see references/nsys-schema.md.

bash

python scripts/detect_host_overhead.py \
  --trace /path/to/trace.sqlite \
  --output /path/to/verdict.json

该脚本通过SQL计算M1、M2、M4、M5指标，可选通过范围交集计算M3指标，应用判定逻辑后输出结构化JSON。输出格式请参考references/output-format.md。

若需通过SQL手动提取指标，请参考references/nsys-schema.md。

Step 3: Interpret Verdict

步骤3：解读判定结果

Overall Verdict:

if aggregate_verdict == YES or context_verdict == YES or generation_verdict == YES:
    overall_verdict = YES

Per-phase analysis can elevate the verdict but never demote it.

Format using the template in references/output-format.md.

Next Steps:

If YES -> Proceed to Phase 2 (Root Cause) below, then use
```
perf-host-optimization
```
skill
If NO -> Use
```
perf-nsight-compute-analysis
```
for kernel SOL% or
```
trace-interpretation
```
for full classification

总体判定规则：

if aggregate_verdict == YES or context_verdict == YES or generation_verdict == YES:
    overall_verdict = YES

分阶段分析可以提升判定结果，但不会降低判定等级。

请使用references/output-format.md中的模板格式化结果。

下一步操作：

若判定为YES -> 进入下方的阶段2（根因分析），之后使用
```
perf-host-optimization
```
技能
若判定为NO -> 使用
```
perf-nsight-compute-analysis
```
分析内核SOL%，或使用
```
trace-interpretation
```
进行完整分类

Phase 2: Root Cause Analysis

阶段2：根因分析

Identify which specific host operations regressed and by how much. Works with a single trace (breakdown) or two traces (comparison).

识别具体哪些主机操作出现了性能退化以及退化程度。支持单个trace（细分分析）或两个trace（对比分析）。

Principles

原则

Isolate forward steps, not the full trace. nsys traces contain warmup, JIT, model loading, and teardown.
Use structural kernel patterns for iteration detection. Allreduce grouping is more robust than kernel density.
Compare steady-state iterations. Filter to identical workload (same batch size, same ctx/gen mix).
Per-step metrics, not totals. Always compare per-step averages.

隔离前向步骤，而非完整trace。nsys trace包含预热、JIT编译、模型加载和销毁等阶段。
使用结构化内核模式检测迭代。allreduce分组比内核密度更可靠。
对比稳态迭代。过滤到相同工作负载（相同批量大小、相同上下文/生成阶段比例）。
使用分步指标，而非总指标。始终对比分步平均值。

Root Cause Workflow

根因分析流程

Step 1: Collect nsys Traces

步骤1：收集nsys Trace

Profile both versions (if comparing) with identical settings:

bash

nsys profile -o /path/to/trace \
  -t cuda,nvtx,osrt \
  --force-overwrite=true \
  --cuda-memory-usage=true \
  -w true \
  <benchmark_command> --num_requests 500

使用相同配置对两个版本（若进行对比）进行性能分析：

bash

nsys profile -o /path/to/trace \
  -t cuda,nvtx,osrt \
  --force-overwrite=true \
  --cuda-memory-usage=true \
  -w true \
  <benchmark_command> --num_requests 500

Step 2: Export to SQLite

步骤2：导出为SQLite

bash

nsys export --type=sqlite --force-overwrite=true -o trace.sqlite trace.nsys-rep

bash

nsys export --type=sqlite --force-overwrite=true -o trace.sqlite trace.nsys-rep

Step 3: Run Host Overhead Analysis

步骤3：运行主机开销分析脚本

bash

undefined

bash

undefined

Two-trace comparison

双trace对比分析

python scripts/analyze_host_overhead.py
--baseline /path/to/baseline/trace.sqlite
--target /path/to/target/trace.sqlite
--baseline-label "v1.1"
--target-label "main"
--output /path/to/output/analysis.txt

Single-trace breakdown

单trace细分分析

python scripts/analyze_host_overhead.py
--baseline /path/to/trace.sqlite
--baseline-label "current"

undefined

python scripts/analyze_host_overhead.py
--baseline /path/to/trace.sqlite
--baseline-label "current"

undefined

Step 4: Interpret Results

步骤4：解读结果

The script produces:

Allreduce-based iteration detection -- confirms forward step boundaries
Per-step wall time comparison -- quantifies the regression
NVTX per-step breakdown -- identifies which host operations regressed
GPU kernel comparison -- confirms GPU execution is unchanged
CUDA API comparison -- detects kernel launch overhead changes

脚本会生成以下内容：

基于allreduce的迭代检测——确认前向步骤边界
分步 wall time 对比——量化性能退化程度
NVTX分步细分——识别出现退化的主机操作
GPU内核对比——确认GPU执行时间未发生变化
CUDA API对比——检测内核启动开销的变化

Reading the Output

结果解读

Per-Step Wall Time:

Avg wall time per step: 3,317 us (baseline) vs 3,978 us (target)  +19.9%

This is the primary regression metric.

NVTX Breakdown:

Operation           | baseline (us/step) | target (us/step) | Delta    | Status
_fetch_new_requests |               36   |             270  | +234     | REGRESSION
broadcast_requests  |                -   |             250  | +250     | NEW
_update_requests    |              413   |             723  | +310     | REGRESSION

Focus on operations with large absolute deltas.

GPU Kernel Comparison:

Kernels per step (launched): 6.2 (baseline) vs 21.9 (target)  +253%

More individual launches = more host-side launch overhead.

分步Wall Time：

Avg wall time per step: 3,317 us (baseline) vs 3,978 us (target)  +19.9%

这是主要的性能退化指标。

NVTX细分：

Operation           | baseline (us/step) | target (us/step) | Delta    | Status
_fetch_new_requests |               36   |             270  | +234     | REGRESSION
broadcast_requests  |                -   |             250  | +250     | NEW
_update_requests    |              413   |             723  | +310     | REGRESSION

重点关注绝对增量较大的操作。

GPU内核对比：

Kernels per step (launched): 6.2 (baseline) vs 21.9 (target)  +253%

单个启动次数越多，主机端启动开销越大。

Step 5: Kernel-Level Drill-Down (Optional)

步骤5：内核级钻取分析（可选）

When the NVTX breakdown identifies a regressing operation but does not reveal why (the overhead is inside the GPU dispatch, not between NVTX ranges), drill below NVTX operations into individual GPU kernel launches.

See references/kernel-level-analysis.md for full technique details, SQL queries, and examples.

When to drill down:

An operation has high wall time but the overhead is inside GPU dispatch, not between NVTX ranges
You need to understand how much of the forward pass is graph-captured vs eager
Per-layer overhead is significant and you need to map kernels to functional groups
Multi-rank inference shows unexplained performance asymmetry

当NVTX细分识别到退化操作但无法揭示原因（开销来自GPU调度内部，而非NVTX范围之间）时，需要深入到单个GPU内核启动层面进行分析。

完整技术细节、SQL查询和示例请参考references/kernel-level-analysis.md。

钻取分析适用场景：

某操作wall time较高，但开销来自GPU调度内部，而非NVTX范围之间
需要了解前向传播中已捕获到图中的部分与即时模式部分的占比
每层开销显著，需要将内核映射到功能组
多rank推理出现无法解释的性能不对称

Kernel-Level Techniques

内核级分析技术

Technique	Question	Key Output
Inter-Kernel Gap Analysis	Where is the GPU idle between kernels?	Gap bucket distribution, top-N largest gaps with source mapping
Eager vs Graph Classification	What fraction of kernels are graph-captured?	Graph coverage ratio, list of eager kernels with source attribution
Repeating-Pattern Mapping	Which functional group within a layer has the most overhead?	Per-group gap totals, priority ranking
Straggler Detection	Is one rank consistently slower?	Straggler rank ID, root cause (extra host work, queue depth feedback loop)

技术	解决问题	核心输出
核间间隙分析	GPU在内核之间的空闲时间分布在哪里？	间隙桶分布、带来源映射的Top-N最大间隙
即时模式与图模式分类	有多少比例的内核被图捕获？	图覆盖率、带来源归因的即时内核列表
重复模式映射	层内哪个功能组的开销最大？	各组间隙总和、优先级排序
掉队节点检测	是否存在某个rank持续变慢？	掉队rank ID、根因（额外主机工作、队列深度反馈循环）

Workflow

流程

Start with Inter-Kernel Gap Analysis — bucket the gap distribution to understand the dominant overhead type (graph dispatch, Python interpreter, host-device sync)
If piecewise graph is in use, run Eager vs Graph Classification to measure graph coverage and identify unnecessary eager kernels
For per-layer overhead, use Repeating-Pattern Mapping to isolate the highest-overhead functional group within a single layer
For multi-rank setups, run Straggler Detection if per-step wall time varies across ranks

从核间间隙分析开始——对间隙分布进行分桶，了解主导开销类型（图调度、Python解释器、主机-设备同步）
若使用分段图，运行即时模式与图模式分类，测量图覆盖率并识别不必要的即时内核
针对每层开销，使用重复模式映射隔离单一层内开销最高的功能组
针对多rank设置，若分步wall time在不同rank间存在差异，运行掉队节点检测

Kernel-Level Findings to Optimization Patterns

内核级发现对应的优化模式

Finding	Optimization Pattern
Large gaps from Python tensor view chains	CUSTOM_OP — replace with C++ custom op
Graph-capturable kernels running eagerly	GRAPH_EXPAND — fix partition poisoning
Monolithic custom op blocking graph capture	GRAPH_SPLIT — split into capturable + eager parts
Host-device sync ( `.item()` ) in per-layer code	SYNC (Pattern 1: pre-compute on CPU) + HOIST (Variant B: pass from step level)
Per-layer buffer allocation	ALLOC — pre-allocate at init
Straggler rank with extra host work	Apply targeted optimization to coordinator-only code paths

发现	优化模式
Python张量视图链导致的大间隙	CUSTOM_OP —— 替换为C++自定义算子
可被图捕获的内核以即时模式运行	GRAPH_EXPAND —— 修复分区污染问题
单片自定义算子阻碍图捕获	GRAPH_SPLIT —— 拆分为可捕获部分+即时部分
每层代码中的主机-设备同步（ `.item()` ）	SYNC（模式1：在CPU上预计算） + HOIST（变体B：从步骤级别传递）
每层缓冲区分配	ALLOC —— 在初始化阶段预分配
存在额外主机工作的掉队rank	针对协调器专属代码路径进行定向优化

Common Patterns and Root Causes

常见模式与根因

Pattern 1: Request Management Refactor

模式1：请求管理重构

Symptom:

_fetch_new_requests

regressed 5-10x, new

broadcast_requests

operation. Cause: Request fetching refactored for multi-rank broadcasting in TP. Mitigation: Optimize broadcast path; batch request state updates.

症状：

_fetch_new_requests

退化5-10倍，新增

broadcast_requests

操作。原因：为支持TP中的多rank广播，重构了请求获取逻辑。 缓解方案：优化广播路径；批量处理请求状态更新。

Pattern 2: Increased Kernel Launch Count

模式2：内核启动次数增加

Symptom: 3-5x more

cudaLaunchKernel

calls per step, similar GPU time. Cause: Operations that were fused or graph-captured are now individual launches. Mitigation: Re-fuse kernels; extend CUDA graph capture scope.

症状：每步

cudaLaunchKernel

调用次数增加3-5倍，但GPU时间相近。原因：原本被融合或图捕获的操作现在变为单独启动。 缓解方案：重新融合内核；扩展CUDA图捕获范围。

Pattern 3: New Bookkeeping Operations

模式3：新增记账操作

Symptom: New NVTX ranges like

_write_finish_reasons

handle_additional_outputs

. Cause: New features added to the inference loop without overhead budgeting. Mitigation: Defer non-critical bookkeeping to async paths; batch updates.

症状：新增

_write_finish_reasons

、

handle_additional_outputs

等NVTX范围。原因：在推理循环中添加了新功能，但未控制开销。 缓解方案：将非关键记账操作推迟到异步路径；批量更新。

Pattern 4: Flashinfer JIT Warmup Masquerading as Inference

模式4：Flashinfer JIT预热被误判为推理阶段

Symptom: Massive elementwise/reduce kernel counts in "steady state" analysis. Cause: Analysis window includes flashinfer JIT compilation phase. Fix: Use allreduce-based iteration isolation, not kernel density or time windows.

症状：“稳态”分析中出现大量elementwise/reduce内核。原因：分析窗口包含Flashinfer JIT编译阶段。 修复方案：使用基于allreduce的迭代隔离，而非内核密度或时间窗口。

Pattern 5: Context-Only Bottleneck (Masked by Aggregate)

模式5：仅上下文阶段存在瓶颈（被聚合指标掩盖）

Symptom: Aggregate metrics below threshold, but context iterations have 50% GPU idle. Cause: Generation iterations dilute the context-phase bottleneck. Fix: Per-phase analysis in Detection phase catches this.

症状：聚合指标低于阈值，但上下文迭代的GPU空闲率达50%。原因：生成阶段的迭代稀释了上下文阶段的瓶颈。 修复方案：检测阶段的分阶段分析可捕获此问题。

Pitfalls

常见陷阱

1. shortName is an Integer ID

1. shortName是整数ID

CUPTI_ACTIVITY_KIND_KERNEL

shortName

is an integer referencing

StringIds.id

. Always join. See references/nsys-schema.md.

在

CUPTI_ACTIVITY_KIND_KERNEL

中，

shortName

是引用

StringIds.id

的整数。必须进行关联查询。请参考references/nsys-schema.md。

2. NVTX textId vs text

2. NVTX的textId与text

Most NVTX events have

textId

(integer) but NULL

text

. Join with StringIds. See references/nsys-schema.md.

大多数NVTX事件有

textId

（整数）但

text

为NULL。需与StringIds进行关联查询。请参考references/nsys-schema.md。

3. Duplicate NVTX Ranges from TP Ranks

3. TP rank产生的重复NVTX范围

In TP configurations, each rank reports NVTX ranges independently. De-duplicate by grouping entries within 100us of each other.

在TP配置中，每个rank独立上报NVTX范围。需将时间差在100us内的条目分组以去重。

4. Negative Inter-Step Gaps

4. 负的步骤间间隙

When TP ranks report overlapping NVTX ranges,

gap = next_start - prev_end

can be negative. Use the maximum end time when de-duplicating.

当TP rank上报的NVTX范围重叠时，

gap = next_start - prev_end

可能为负数。去重时使用最大结束时间。

5. Benchmark Window Selection

5. 基准测试窗口选择

The allreduce-based window captures context+generation phases; steady-state NVTX filtering captures generation-only. Both are valid; use the appropriate one for your comparison goal.

基于allreduce的窗口会捕获上下文+生成阶段；基于稳态NVTX过滤的窗口仅捕获生成阶段。两种方式均有效，请根据对比目标选择合适的窗口。

Handoff to Optimization

向优化流程移交

When analysis is complete and the verdict is YES, hand off to the

perf-host-optimization

skill with:

Detection verdict and evidence: Which metrics crossed thresholds (M1-M5), whether host prep was confirmed (M3b+M3c), and per-phase breakdown.
NVTX-based triage (from Root Cause): Top regressing operations by absolute delta (us/step). Map NVTX range names to source functions -- see references/trtllm-nvtx-ranges.md.
Handoff data block: Include structured data from references/output-format.md (see "Handoff to Optimization" section).
Kernel-level findings (from drill-down, if performed): Inter-kernel gap distribution, graph coverage ratio, per-group overhead map, and straggler rank identification. Map findings to optimization patterns using the table in the Root Cause kernel-level drill-down section above.

当分析完成且判定结果为YES时，向

perf-host-optimization

技能移交以下内容：

检测判定结果与证据：哪些指标（M1-M5）超过阈值，是否确认主机准备工作为瓶颈（M3b+M3c），以及分阶段细分结果。
基于NVTX的分类结果（来自根因分析）：按绝对增量（us/step）排序的顶级退化操作。将NVTX范围名称映射到源函数——请参考references/trtllm-nvtx-ranges.md。
移交数据块：包含references/output-format.md中“向优化流程移交”章节的结构化数据。
内核级发现（若进行了钻取分析）：核间间隙分布、图覆盖率、各组开销映射、掉队rank识别。使用根因分析中内核级钻取章节的表格将发现映射到优化模式。

Reference

参考文档

File	Contents
references/metrics.md	Full metric definitions, formulas, SQL queries, M3 sub-metric analysis
references/thresholds.md	Aggregate and per-phase threshold tables
references/phase-classification.md	NVTX marker parsing, iteration classification, per-phase aggregation
references/output-format.md	Report template and integration JSON schema
references/examples.md	Worked scenarios (aggregate, phase-specific, and case study)
references/iteration-isolation-techniques.md	Allreduce, NVTX, and kernel-density iteration isolation techniques
references/trtllm-nvtx-ranges.md	TRT-LLM NVTX range reference with per-operation timings
references/kernel-level-analysis.md	Kernel-level drill-down techniques: gap analysis, graph classification, pattern mapping, straggler detection
references/nsys-schema.md	nsys SQLite schema reference and useful queries
scripts/analyze_host_overhead.py	Python script for Phase 2 root cause analysis
scripts/detect_host_overhead.py	Python script for Phase 1 detection verdict

文件	内容
references/metrics.md	完整指标定义、公式、SQL查询、M3子指标分析
references/thresholds.md	聚合与分阶段阈值表
references/phase-classification.md	NVTX标记解析、迭代分类、分阶段聚合
references/output-format.md	报告模板与集成JSON schema
references/examples.md	已完成的场景案例（聚合、特定阶段、案例研究）
references/iteration-isolation-techniques.md	allreduce、NVTX、内核密度迭代隔离技术
references/trtllm-nvtx-ranges.md	TRT-LLM NVTX范围参考及操作时序
references/kernel-level-analysis.md	内核级钻取分析技术：间隙分析、图分类、模式映射、掉队节点检测
references/nsys-schema.md	nsys SQLite schema参考及实用查询
scripts/analyze_host_overhead.py	阶段2根因分析的Python脚本
scripts/detect_host_overhead.py	阶段1检测判定的Python脚本