perf-host-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Host Performance Analysis

主机性能分析

Analyze host/CPU overhead in TensorRT-LLM inference workloads from nsys traces. This skill operates in two phases:
PhaseQuestionInputOutput
DetectionIs host overhead the bottleneck?Single nsys traceYES/NO verdict with metric evidence
Root CauseWhat specifically regressed?One or two nsys tracesNVTX per-step breakdown, regression sources, optional kernel-level drill-down
基于nsys trace分析TensorRT-LLM推理工作负载中的主机/CPU开销。该技能分为两个阶段:
阶段问题输入输出
检测阶段主机开销是否为性能瓶颈?单个nsys trace包含指标证据的YES/NO判定结果
根因分析阶段具体是哪些部分出现了性能退化?一个或两个nsys traceNVTX分步细分结果、退化来源、可选的内核级钻取分析

When to Use

使用场景

  • Before starting host optimization work -- confirms the bottleneck is real (Detection)
  • As a sub-step of
    perf-analysis
    for bottleneck classification (Detection)
  • When GPU utilization is suspiciously low and you need to know why (Detection)
  • When throughput regressed but GPU kernel execution times are unchanged (Root Cause)
  • When the gap between forward step iterations has increased (Root Cause)
  • To compare inter-iteration overhead between two versions of the inference engine (Root Cause)
  • When you need sub-operation granularity on inter-kernel gaps or graph coverage (Root Cause, kernel-level drill-down)
  • When piecewise CUDA graph coverage is unexpectedly low (Root Cause, kernel-level drill-down)
  • When multi-rank inference shows unexplained performance asymmetry (Root Cause, kernel-level drill-down)
Do NOT use when:
  • The regression is in individual kernel performance (use
    perf-nsight-compute-analysis
    )
  • You need to profile a workload from scratch (use
    workload-instrumentation
    first)
  • The issue is NCCL communication (use distributed analysis)
  • 开始主机优化工作前——确认瓶颈真实存在(检测阶段)
  • 作为
    perf-analysis
    的子步骤进行瓶颈分类(检测阶段)
  • 当GPU利用率异常低下,需要排查原因时(检测阶段)
  • 当吞吐量出现退化但GPU内核执行时间未发生变化时(根因分析阶段)
  • 当前向步骤迭代之间的间隙增大时(根因分析阶段)
  • 对比推理引擎两个版本之间的迭代间开销时(根因分析阶段)
  • 需要对核间间隙或图覆盖率进行子操作粒度分析时(根因分析阶段,内核级钻取)
  • 分段CUDA图覆盖率低于预期时(根因分析阶段,内核级钻取)
  • 多rank推理出现无法解释的性能不对称时(根因分析阶段,内核级钻取)
请勿在以下场景使用:
  • 性能退化出现在单个内核性能上(使用
    perf-nsight-compute-analysis
  • 需要从头开始分析工作负载(先使用
    workload-instrumentation
  • 问题与NCCL通信相关(使用分布式分析工具)

Prerequisites

前置条件

  • An nsys trace file (
    .sqlite
    or
    .nsys-rep
    ) from a TRT-LLM benchmark run
  • For Root Cause comparison: two traces (baseline and target)
  • Python 3 with sqlite3 support

  • 来自TRT-LLM基准测试的nsys trace文件(
    .sqlite
    .nsys-rep
    格式)
  • 若进行根因对比分析:两个trace文件(基线版本与目标版本)
  • 支持sqlite3的Python 3环境

Key Concepts

核心概念

Host Overhead in LLM Inference

LLM推理中的主机开销

In an LLM inference loop, each iteration consists of:
[inter-step gap] -> [_forward_step] -> [inter-step gap] -> [_forward_step] -> ...
The forward step includes GPU kernel execution (GEMM, attention, normalization, allreduce) plus host-side preparation. The inter-step gap includes host-side work between forward steps (scheduling, request fetching, broadcasting, sampling, response handling).
See references/trtllm-nvtx-ranges.md for the full per-operation breakdown and timing ranges.
在LLM推理循环中,每个迭代流程如下:
[步骤间间隙] -> [_forward_step] -> [步骤间间隙] -> [_forward_step] -> ...
前向步骤包括GPU内核执行(GEMM、注意力计算、归一化、allreduce)以及主机端准备工作。步骤间间隙包括前向步骤之间的主机端工作(调度、请求获取、广播、采样、响应处理)。
完整的操作细分和时序范围请参考references/trtllm-nvtx-ranges.md

Hidden vs Exposed Host Overhead

隐藏与暴露的主机开销

Host overhead only hurts performance when it is exposed -- the GPU is idle waiting for work. When host prep overlaps with GPU execution, it is hidden and free. See references/metrics.md (M3 section) for diagrams and the exposed/hidden computation.
只有当主机开销处于暴露状态时才会影响性能——即GPU处于空闲状态等待工作。当主机准备工作与GPU执行重叠时,该开销是隐藏的,不会影响性能。有关暴露/隐藏计算的示意图和说明,请参考references/metrics.md的M3章节。

Forward Step Isolation

前向步骤隔离

In TP configurations, forward steps are isolated via allreduce kernel grouping (deterministic count per transformer layer). For TP=1, NVTX
_forward_step
ranges are used directly. See references/iteration-isolation-techniques.md for the full algorithm.
在TP配置中,前向步骤通过allreduce内核分组进行隔离(每个Transformer层的allreduce数量是确定的)。对于TP=1的情况,直接使用NVTX的
_forward_step
范围。完整算法请参考references/iteration-isolation-techniques.md

Phase Classification (Context vs Generation)

阶段分类(上下文与生成)

Iterations are classified by NVTX marker text into context (eager, no CUDA graphs) and generation (CUDA graph replay). Per-phase analysis is critical because aggregate metrics can mask phase-specific bottlenecks. See references/phase-classification.md.

通过NVTX标记文本将迭代分为上下文阶段(即时模式,无CUDA图)和生成阶段(CUDA图重放)。分阶段分析至关重要,因为聚合指标可能会掩盖特定阶段的瓶颈。请参考references/phase-classification.md

Phase 1: Detection (YES/NO Verdict)

阶段1:检测(YES/NO判定)

Determine whether host overhead is the primary bottleneck.
确定主机开销是否为主要瓶颈。

Detection Metrics

检测指标

Six metrics in four categories. See references/metrics.md for full definitions, formulas, and SQL queries.
#MetricThresholdWhat it answers
M1GPU idle ratio> 0.30Is the GPU starved for work?
M2Launch overhead ratio> 0.10Is kernel launch itself expensive?
M3aHost prep exposed ratio> 0.50How well is host prep pipelined?
M3bHost prep perf impact> 0.05How much throughput does exposed prep cost?
M3cHost prep idle attribution> 0.50Is host prep the main cause of GPU idle?
M4GPU utilization< 0.60Is GPU utilization too low?
M5NCCL ratio (caveat)> 0.20Is communication a confounding factor?
Host prep confirmation rule: Host prep is a confirmed bottleneck only when both M3b AND M3c cross their thresholds.
Thresholds are configurable with per-phase variants. See references/thresholds.md.
四大类共六项指标。完整定义、公式和SQL查询请参考references/metrics.md
编号指标阈值用途
M1GPU空闲率> 0.30GPU是否因缺少工作而闲置?
M2启动开销占比> 0.10内核启动本身是否开销高昂?
M3a主机准备暴露率> 0.50主机准备工作的流水线化程度如何?
M3b主机准备性能影响> 0.05暴露的主机准备工作会损失多少吞吐量?
M3c主机准备空闲归因占比> 0.50主机准备工作是否是GPU空闲的主要原因?
M4GPU利用率< 0.60GPU利用率是否过低?
M5NCCL占比(注意事项)> 0.20通信是否是干扰因素?
主机准备瓶颈确认规则:只有当M3b M3c均超过阈值时,才能确认主机准备工作是瓶颈。
阈值可配置,且支持分阶段变体。请参考references/thresholds.md

Detection Workflow

检测流程

Step 1: Input Validation

步骤1:输入验证

bash
undefined
bash
undefined

Accept .sqlite or .nsys-rep

接受.sqlite或.nsys-rep格式

ls -la <trace_file>
ls -la <trace_file>

If .nsys-rep, export to SQLite first

若为.nsys-rep格式,先导出为SQLite

nsys export -t sqlite -o <output.sqlite> <input.nsys-rep>
undefined
nsys export -t sqlite -o <output.sqlite> <input.nsys-rep>
undefined

Step 2: Run Detection Script

步骤2:运行检测脚本

bash
python scripts/detect_host_overhead.py \
  --trace /path/to/trace.sqlite \
  --output /path/to/verdict.json
The script computes M1, M2, M4, M5 from SQL, optionally M3 via range intersection, applies the verdict logic, and outputs structured JSON. See references/output-format.md for the output schema.
For manual metric extraction via SQL, see references/nsys-schema.md.
bash
python scripts/detect_host_overhead.py \
  --trace /path/to/trace.sqlite \
  --output /path/to/verdict.json
该脚本通过SQL计算M1、M2、M4、M5指标,可选通过范围交集计算M3指标,应用判定逻辑后输出结构化JSON。输出格式请参考references/output-format.md
若需通过SQL手动提取指标,请参考references/nsys-schema.md

Step 3: Interpret Verdict

步骤3:解读判定结果

Overall Verdict:
if aggregate_verdict == YES or context_verdict == YES or generation_verdict == YES:
    overall_verdict = YES
Per-phase analysis can elevate the verdict but never demote it.
Format using the template in references/output-format.md.
Next Steps:
  • If YES -> Proceed to Phase 2 (Root Cause) below, then use
    perf-host-optimization
    skill
  • If NO -> Use
    perf-nsight-compute-analysis
    for kernel SOL% or
    trace-interpretation
    for full classification

总体判定规则
if aggregate_verdict == YES or context_verdict == YES or generation_verdict == YES:
    overall_verdict = YES
分阶段分析可以提升判定结果,但不会降低判定等级。
请使用references/output-format.md中的模板格式化结果。
下一步操作
  • 若判定为YES -> 进入下方的阶段2(根因分析),之后使用
    perf-host-optimization
    技能
  • 若判定为NO -> 使用
    perf-nsight-compute-analysis
    分析内核SOL%,或使用
    trace-interpretation
    进行完整分类

Phase 2: Root Cause Analysis

阶段2:根因分析

Identify which specific host operations regressed and by how much. Works with a single trace (breakdown) or two traces (comparison).
识别具体哪些主机操作出现了性能退化以及退化程度。支持单个trace(细分分析)或两个trace(对比分析)。

Principles

原则

  1. Isolate forward steps, not the full trace. nsys traces contain warmup, JIT, model loading, and teardown.
  2. Use structural kernel patterns for iteration detection. Allreduce grouping is more robust than kernel density.
  3. Compare steady-state iterations. Filter to identical workload (same batch size, same ctx/gen mix).
  4. Per-step metrics, not totals. Always compare per-step averages.
  1. 隔离前向步骤,而非完整trace。nsys trace包含预热、JIT编译、模型加载和销毁等阶段。
  2. 使用结构化内核模式检测迭代。allreduce分组比内核密度更可靠。
  3. 对比稳态迭代。过滤到相同工作负载(相同批量大小、相同上下文/生成阶段比例)。
  4. 使用分步指标,而非总指标。始终对比分步平均值。

Root Cause Workflow

根因分析流程

Step 1: Collect nsys Traces

步骤1:收集nsys Trace

Profile both versions (if comparing) with identical settings:
bash
nsys profile -o /path/to/trace \
  -t cuda,nvtx,osrt \
  --force-overwrite=true \
  --cuda-memory-usage=true \
  -w true \
  <benchmark_command> --num_requests 500
使用相同配置对两个版本(若进行对比)进行性能分析:
bash
nsys profile -o /path/to/trace \
  -t cuda,nvtx,osrt \
  --force-overwrite=true \
  --cuda-memory-usage=true \
  -w true \
  <benchmark_command> --num_requests 500

Step 2: Export to SQLite

步骤2:导出为SQLite

bash
nsys export --type=sqlite --force-overwrite=true -o trace.sqlite trace.nsys-rep
bash
nsys export --type=sqlite --force-overwrite=true -o trace.sqlite trace.nsys-rep

Step 3: Run Host Overhead Analysis

步骤3:运行主机开销分析脚本

bash
undefined
bash
undefined

Two-trace comparison

双trace对比分析

python scripts/analyze_host_overhead.py
--baseline /path/to/baseline/trace.sqlite
--target /path/to/target/trace.sqlite
--baseline-label "v1.1"
--target-label "main"
--output /path/to/output/analysis.txt
python scripts/analyze_host_overhead.py
--baseline /path/to/baseline/trace.sqlite
--target /path/to/target/trace.sqlite
--baseline-label "v1.1"
--target-label "main"
--output /path/to/output/analysis.txt

Single-trace breakdown

单trace细分分析

python scripts/analyze_host_overhead.py
--baseline /path/to/trace.sqlite
--baseline-label "current"
undefined
python scripts/analyze_host_overhead.py
--baseline /path/to/trace.sqlite
--baseline-label "current"
undefined

Step 4: Interpret Results

步骤4:解读结果

The script produces:
  1. Allreduce-based iteration detection -- confirms forward step boundaries
  2. Per-step wall time comparison -- quantifies the regression
  3. NVTX per-step breakdown -- identifies which host operations regressed
  4. GPU kernel comparison -- confirms GPU execution is unchanged
  5. CUDA API comparison -- detects kernel launch overhead changes
脚本会生成以下内容:
  1. 基于allreduce的迭代检测——确认前向步骤边界
  2. 分步 wall time 对比——量化性能退化程度
  3. NVTX分步细分——识别出现退化的主机操作
  4. GPU内核对比——确认GPU执行时间未发生变化
  5. CUDA API对比——检测内核启动开销的变化

Reading the Output

结果解读

Per-Step Wall Time:
Avg wall time per step: 3,317 us (baseline) vs 3,978 us (target)  +19.9%
This is the primary regression metric.
NVTX Breakdown:
Operation           | baseline (us/step) | target (us/step) | Delta    | Status
_fetch_new_requests |               36   |             270  | +234     | REGRESSION
broadcast_requests  |                -   |             250  | +250     | NEW
_update_requests    |              413   |             723  | +310     | REGRESSION
Focus on operations with large absolute deltas.
GPU Kernel Comparison:
Kernels per step (launched): 6.2 (baseline) vs 21.9 (target)  +253%
More individual launches = more host-side launch overhead.
分步Wall Time
Avg wall time per step: 3,317 us (baseline) vs 3,978 us (target)  +19.9%
这是主要的性能退化指标。
NVTX细分
Operation           | baseline (us/step) | target (us/step) | Delta    | Status
_fetch_new_requests |               36   |             270  | +234     | REGRESSION
broadcast_requests  |                -   |             250  | +250     | NEW
_update_requests    |              413   |             723  | +310     | REGRESSION
重点关注绝对增量较大的操作。
GPU内核对比
Kernels per step (launched): 6.2 (baseline) vs 21.9 (target)  +253%
单个启动次数越多,主机端启动开销越大。

Step 5: Kernel-Level Drill-Down (Optional)

步骤5:内核级钻取分析(可选)

When the NVTX breakdown identifies a regressing operation but does not reveal why (the overhead is inside the GPU dispatch, not between NVTX ranges), drill below NVTX operations into individual GPU kernel launches.
See references/kernel-level-analysis.md for full technique details, SQL queries, and examples.
When to drill down:
  • An operation has high wall time but the overhead is inside GPU dispatch, not between NVTX ranges
  • You need to understand how much of the forward pass is graph-captured vs eager
  • Per-layer overhead is significant and you need to map kernels to functional groups
  • Multi-rank inference shows unexplained performance asymmetry
当NVTX细分识别到退化操作但无法揭示原因(开销来自GPU调度内部,而非NVTX范围之间)时,需要深入到单个GPU内核启动层面进行分析。
完整技术细节、SQL查询和示例请参考references/kernel-level-analysis.md
钻取分析适用场景
  • 某操作wall time较高,但开销来自GPU调度内部,而非NVTX范围之间
  • 需要了解前向传播中已捕获到图中的部分与即时模式部分的占比
  • 每层开销显著,需要将内核映射到功能组
  • 多rank推理出现无法解释的性能不对称

Kernel-Level Techniques

内核级分析技术

TechniqueQuestionKey Output
Inter-Kernel Gap AnalysisWhere is the GPU idle between kernels?Gap bucket distribution, top-N largest gaps with source mapping
Eager vs Graph ClassificationWhat fraction of kernels are graph-captured?Graph coverage ratio, list of eager kernels with source attribution
Repeating-Pattern MappingWhich functional group within a layer has the most overhead?Per-group gap totals, priority ranking
Straggler DetectionIs one rank consistently slower?Straggler rank ID, root cause (extra host work, queue depth feedback loop)
技术解决问题核心输出
核间间隙分析GPU在内核之间的空闲时间分布在哪里?间隙桶分布、带来源映射的Top-N最大间隙
即时模式与图模式分类有多少比例的内核被图捕获?图覆盖率、带来源归因的即时内核列表
重复模式映射层内哪个功能组的开销最大?各组间隙总和、优先级排序
掉队节点检测是否存在某个rank持续变慢?掉队rank ID、根因(额外主机工作、队列深度反馈循环)

Workflow

流程

  1. Start with Inter-Kernel Gap Analysis — bucket the gap distribution to understand the dominant overhead type (graph dispatch, Python interpreter, host-device sync)
  2. If piecewise graph is in use, run Eager vs Graph Classification to measure graph coverage and identify unnecessary eager kernels
  3. For per-layer overhead, use Repeating-Pattern Mapping to isolate the highest-overhead functional group within a single layer
  4. For multi-rank setups, run Straggler Detection if per-step wall time varies across ranks
  1. 从核间间隙分析开始——对间隙分布进行分桶,了解主导开销类型(图调度、Python解释器、主机-设备同步)
  2. 若使用分段图,运行即时模式与图模式分类,测量图覆盖率并识别不必要的即时内核
  3. 针对每层开销,使用重复模式映射隔离单一层内开销最高的功能组
  4. 针对多rank设置,若分步wall time在不同rank间存在差异,运行掉队节点检测

Kernel-Level Findings to Optimization Patterns

内核级发现对应的优化模式

FindingOptimization Pattern
Large gaps from Python tensor view chainsCUSTOM_OP — replace with C++ custom op
Graph-capturable kernels running eagerlyGRAPH_EXPAND — fix partition poisoning
Monolithic custom op blocking graph captureGRAPH_SPLIT — split into capturable + eager parts
Host-device sync (
.item()
) in per-layer code
SYNC (Pattern 1: pre-compute on CPU) + HOIST (Variant B: pass from step level)
Per-layer buffer allocationALLOC — pre-allocate at init
Straggler rank with extra host workApply targeted optimization to coordinator-only code paths

发现优化模式
Python张量视图链导致的大间隙CUSTOM_OP —— 替换为C++自定义算子
可被图捕获的内核以即时模式运行GRAPH_EXPAND —— 修复分区污染问题
单片自定义算子阻碍图捕获GRAPH_SPLIT —— 拆分为可捕获部分+即时部分
每层代码中的主机-设备同步(
.item()
SYNC(模式1:在CPU上预计算) + HOIST(变体B:从步骤级别传递)
每层缓冲区分配ALLOC —— 在初始化阶段预分配
存在额外主机工作的掉队rank针对协调器专属代码路径进行定向优化

Common Patterns and Root Causes

常见模式与根因

Pattern 1: Request Management Refactor

模式1:请求管理重构

Symptom:
_fetch_new_requests
regressed 5-10x, new
broadcast_requests
operation. Cause: Request fetching refactored for multi-rank broadcasting in TP. Mitigation: Optimize broadcast path; batch request state updates.
症状
_fetch_new_requests
退化5-10倍,新增
broadcast_requests
操作。 原因:为支持TP中的多rank广播,重构了请求获取逻辑。 缓解方案:优化广播路径;批量处理请求状态更新。

Pattern 2: Increased Kernel Launch Count

模式2:内核启动次数增加

Symptom: 3-5x more
cudaLaunchKernel
calls per step, similar GPU time. Cause: Operations that were fused or graph-captured are now individual launches. Mitigation: Re-fuse kernels; extend CUDA graph capture scope.
症状:每步
cudaLaunchKernel
调用次数增加3-5倍,但GPU时间相近。 原因:原本被融合或图捕获的操作现在变为单独启动。 缓解方案:重新融合内核;扩展CUDA图捕获范围。

Pattern 3: New Bookkeeping Operations

模式3:新增记账操作

Symptom: New NVTX ranges like
_write_finish_reasons
,
handle_additional_outputs
. Cause: New features added to the inference loop without overhead budgeting. Mitigation: Defer non-critical bookkeeping to async paths; batch updates.
症状:新增
_write_finish_reasons
handle_additional_outputs
等NVTX范围。 原因:在推理循环中添加了新功能,但未控制开销。 缓解方案:将非关键记账操作推迟到异步路径;批量更新。

Pattern 4: Flashinfer JIT Warmup Masquerading as Inference

模式4:Flashinfer JIT预热被误判为推理阶段

Symptom: Massive elementwise/reduce kernel counts in "steady state" analysis. Cause: Analysis window includes flashinfer JIT compilation phase. Fix: Use allreduce-based iteration isolation, not kernel density or time windows.
症状:“稳态”分析中出现大量elementwise/reduce内核。 原因:分析窗口包含Flashinfer JIT编译阶段。 修复方案:使用基于allreduce的迭代隔离,而非内核密度或时间窗口。

Pattern 5: Context-Only Bottleneck (Masked by Aggregate)

模式5:仅上下文阶段存在瓶颈(被聚合指标掩盖)

Symptom: Aggregate metrics below threshold, but context iterations have 50% GPU idle. Cause: Generation iterations dilute the context-phase bottleneck. Fix: Per-phase analysis in Detection phase catches this.

症状:聚合指标低于阈值,但上下文迭代的GPU空闲率达50%。 原因:生成阶段的迭代稀释了上下文阶段的瓶颈。 修复方案:检测阶段的分阶段分析可捕获此问题。

Pitfalls

常见陷阱

1. shortName is an Integer ID

1. shortName是整数ID

In
CUPTI_ACTIVITY_KIND_KERNEL
,
shortName
is an integer referencing
StringIds.id
. Always join. See references/nsys-schema.md.
CUPTI_ACTIVITY_KIND_KERNEL
中,
shortName
是引用
StringIds.id
的整数。必须进行关联查询。请参考references/nsys-schema.md

2. NVTX textId vs text

2. NVTX的textId与text

Most NVTX events have
textId
(integer) but NULL
text
. Join with StringIds. See references/nsys-schema.md.
大多数NVTX事件有
textId
(整数)但
text
为NULL。需与StringIds进行关联查询。请参考references/nsys-schema.md

3. Duplicate NVTX Ranges from TP Ranks

3. TP rank产生的重复NVTX范围

In TP configurations, each rank reports NVTX ranges independently. De-duplicate by grouping entries within 100us of each other.
在TP配置中,每个rank独立上报NVTX范围。需将时间差在100us内的条目分组以去重。

4. Negative Inter-Step Gaps

4. 负的步骤间间隙

When TP ranks report overlapping NVTX ranges,
gap = next_start - prev_end
can be negative. Use the maximum end time when de-duplicating.
当TP rank上报的NVTX范围重叠时,
gap = next_start - prev_end
可能为负数。去重时使用最大结束时间。

5. Benchmark Window Selection

5. 基准测试窗口选择

The allreduce-based window captures context+generation phases; steady-state NVTX filtering captures generation-only. Both are valid; use the appropriate one for your comparison goal.

基于allreduce的窗口会捕获上下文+生成阶段;基于稳态NVTX过滤的窗口仅捕获生成阶段。两种方式均有效,请根据对比目标选择合适的窗口。

Handoff to Optimization

向优化流程移交

When analysis is complete and the verdict is YES, hand off to the
perf-host-optimization
skill with:
  1. Detection verdict and evidence: Which metrics crossed thresholds (M1-M5), whether host prep was confirmed (M3b+M3c), and per-phase breakdown.
  2. NVTX-based triage (from Root Cause): Top regressing operations by absolute delta (us/step). Map NVTX range names to source functions -- see references/trtllm-nvtx-ranges.md.
  3. Handoff data block: Include structured data from references/output-format.md (see "Handoff to Optimization" section).
  4. Kernel-level findings (from drill-down, if performed): Inter-kernel gap distribution, graph coverage ratio, per-group overhead map, and straggler rank identification. Map findings to optimization patterns using the table in the Root Cause kernel-level drill-down section above.

当分析完成且判定结果为YES时,向
perf-host-optimization
技能移交以下内容:
  1. 检测判定结果与证据:哪些指标(M1-M5)超过阈值,是否确认主机准备工作为瓶颈(M3b+M3c),以及分阶段细分结果。
  2. 基于NVTX的分类结果(来自根因分析):按绝对增量(us/step)排序的顶级退化操作。将NVTX范围名称映射到源函数——请参考references/trtllm-nvtx-ranges.md
  3. 移交数据块:包含references/output-format.md中“向优化流程移交”章节的结构化数据。
  4. 内核级发现(若进行了钻取分析):核间间隙分布、图覆盖率、各组开销映射、掉队rank识别。使用根因分析中内核级钻取章节的表格将发现映射到优化模式。

Reference

参考文档

FileContents
references/metrics.mdFull metric definitions, formulas, SQL queries, M3 sub-metric analysis
references/thresholds.mdAggregate and per-phase threshold tables
references/phase-classification.mdNVTX marker parsing, iteration classification, per-phase aggregation
references/output-format.mdReport template and integration JSON schema
references/examples.mdWorked scenarios (aggregate, phase-specific, and case study)
references/iteration-isolation-techniques.mdAllreduce, NVTX, and kernel-density iteration isolation techniques
references/trtllm-nvtx-ranges.mdTRT-LLM NVTX range reference with per-operation timings
references/kernel-level-analysis.mdKernel-level drill-down techniques: gap analysis, graph classification, pattern mapping, straggler detection
references/nsys-schema.mdnsys SQLite schema reference and useful queries
scripts/analyze_host_overhead.pyPython script for Phase 2 root cause analysis
scripts/detect_host_overhead.pyPython script for Phase 1 detection verdict
文件内容
references/metrics.md完整指标定义、公式、SQL查询、M3子指标分析
references/thresholds.md聚合与分阶段阈值表
references/phase-classification.mdNVTX标记解析、迭代分类、分阶段聚合
references/output-format.md报告模板与集成JSON schema
references/examples.md已完成的场景案例(聚合、特定阶段、案例研究)
references/iteration-isolation-techniques.mdallreduce、NVTX、内核密度迭代隔离技术
references/trtllm-nvtx-ranges.mdTRT-LLM NVTX范围参考及操作时序
references/kernel-level-analysis.md内核级钻取分析技术:间隙分析、图分类、模式映射、掉队节点检测
references/nsys-schema.mdnsys SQLite schema参考及实用查询
scripts/analyze_host_overhead.py阶段2根因分析的Python脚本
scripts/detect_host_overhead.py阶段1检测判定的Python脚本