Host Performance Analysis
Analyze host/CPU overhead in TensorRT-LLM inference workloads from nsys traces. This skill operates in two phases:
| Phase | Question | Input | Output |
|---|
| Detection | Is host overhead the bottleneck? | Single nsys trace | YES/NO verdict with metric evidence |
| Root Cause | What specifically regressed? | One or two nsys traces | NVTX per-step breakdown, regression sources, optional kernel-level drill-down |
When to Use
- Before starting host optimization work -- confirms the bottleneck is real (Detection)
- As a sub-step of for bottleneck classification (Detection)
- When GPU utilization is suspiciously low and you need to know why (Detection)
- When throughput regressed but GPU kernel execution times are unchanged (Root Cause)
- When the gap between forward step iterations has increased (Root Cause)
- To compare inter-iteration overhead between two versions of the inference engine (Root Cause)
- When you need sub-operation granularity on inter-kernel gaps or graph coverage (Root Cause, kernel-level drill-down)
- When piecewise CUDA graph coverage is unexpectedly low (Root Cause, kernel-level drill-down)
- When multi-rank inference shows unexplained performance asymmetry (Root Cause, kernel-level drill-down)
Do NOT use when:
- The regression is in individual kernel performance (use
perf-nsight-compute-analysis
)
- You need to profile a workload from scratch (use first)
- The issue is NCCL communication (use distributed analysis)
Prerequisites
- An nsys trace file ( or ) from a TRT-LLM benchmark run
- For Root Cause comparison: two traces (baseline and target)
- Python 3 with sqlite3 support
Key Concepts
Host Overhead in LLM Inference
In an LLM inference loop, each iteration consists of:
[inter-step gap] -> [_forward_step] -> [inter-step gap] -> [_forward_step] -> ...
The forward step includes GPU kernel execution (GEMM, attention, normalization, allreduce) plus host-side preparation. The inter-step gap includes host-side work between forward steps (scheduling, request fetching, broadcasting, sampling, response handling).
See references/trtllm-nvtx-ranges.md for the full per-operation breakdown and timing ranges.
Hidden vs Exposed Host Overhead
Host overhead only hurts performance when it is exposed -- the GPU is idle waiting for work. When host prep overlaps with GPU execution, it is hidden and free. See references/metrics.md (M3 section) for diagrams and the exposed/hidden computation.
Forward Step Isolation
In TP configurations, forward steps are isolated via allreduce kernel grouping (deterministic count per transformer layer). For TP=1, NVTX
ranges are used directly. See
references/iteration-isolation-techniques.md for the full algorithm.
Phase Classification (Context vs Generation)
Iterations are classified by NVTX marker text into context (eager, no CUDA graphs) and generation (CUDA graph replay). Per-phase analysis is critical because aggregate metrics can mask phase-specific bottlenecks. See references/phase-classification.md.
Phase 1: Detection (YES/NO Verdict)
Determine whether host overhead is the primary bottleneck.
Detection Metrics
Six metrics in four categories. See references/metrics.md for full definitions, formulas, and SQL queries.
| # | Metric | Threshold | What it answers |
|---|
| M1 | GPU idle ratio | > 0.30 | Is the GPU starved for work? |
| M2 | Launch overhead ratio | > 0.10 | Is kernel launch itself expensive? |
| M3a | Host prep exposed ratio | > 0.50 | How well is host prep pipelined? |
| M3b | Host prep perf impact | > 0.05 | How much throughput does exposed prep cost? |
| M3c | Host prep idle attribution | > 0.50 | Is host prep the main cause of GPU idle? |
| M4 | GPU utilization | < 0.60 | Is GPU utilization too low? |
| M5 | NCCL ratio (caveat) | > 0.20 | Is communication a confounding factor? |
Host prep confirmation rule: Host prep is a confirmed bottleneck only when both M3b AND M3c cross their thresholds.
Thresholds are configurable with per-phase variants. See references/thresholds.md.
Detection Workflow
Step 1: Input Validation
bash
# Accept .sqlite or .nsys-rep
ls -la <trace_file>
# If .nsys-rep, export to SQLite first
nsys export -t sqlite -o <output.sqlite> <input.nsys-rep>
Step 2: Run Detection Script
bash
python scripts/detect_host_overhead.py \
--trace /path/to/trace.sqlite \
--output /path/to/verdict.json
The script computes M1, M2, M4, M5 from SQL, optionally M3 via range intersection, applies the verdict logic, and outputs structured JSON. See references/output-format.md for the output schema.
For manual metric extraction via SQL, see references/nsys-schema.md.
Step 3: Interpret Verdict
Overall Verdict:
if aggregate_verdict == YES or context_verdict == YES or generation_verdict == YES:
overall_verdict = YES
Per-phase analysis can elevate the verdict but never demote it.
Format using the template in references/output-format.md.
Next Steps:
- If YES -> Proceed to Phase 2 (Root Cause) below, then use skill
- If NO -> Use
perf-nsight-compute-analysis
for kernel SOL% or for full classification
Phase 2: Root Cause Analysis
Identify which specific host operations regressed and by how much. Works with a single trace (breakdown) or two traces (comparison).
Principles
- Isolate forward steps, not the full trace. nsys traces contain warmup, JIT, model loading, and teardown.
- Use structural kernel patterns for iteration detection. Allreduce grouping is more robust than kernel density.
- Compare steady-state iterations. Filter to identical workload (same batch size, same ctx/gen mix).
- Per-step metrics, not totals. Always compare per-step averages.
Root Cause Workflow
Step 1: Collect nsys Traces
Profile both versions (if comparing) with identical settings:
bash
nsys profile -o /path/to/trace \
-t cuda,nvtx,osrt \
--force-overwrite=true \
--cuda-memory-usage=true \
-w true \
<benchmark_command> --num_requests 500
Step 2: Export to SQLite
bash
nsys export --type=sqlite --force-overwrite=true -o trace.sqlite trace.nsys-rep
Step 3: Run Host Overhead Analysis
bash
# Two-trace comparison
python scripts/analyze_host_overhead.py \
--baseline /path/to/baseline/trace.sqlite \
--target /path/to/target/trace.sqlite \
--baseline-label "v1.1" \
--target-label "main" \
--output /path/to/output/analysis.txt
# Single-trace breakdown
python scripts/analyze_host_overhead.py \
--baseline /path/to/trace.sqlite \
--baseline-label "current"
Step 4: Interpret Results
The script produces:
- Allreduce-based iteration detection -- confirms forward step boundaries
- Per-step wall time comparison -- quantifies the regression
- NVTX per-step breakdown -- identifies which host operations regressed
- GPU kernel comparison -- confirms GPU execution is unchanged
- CUDA API comparison -- detects kernel launch overhead changes
Reading the Output
Per-Step Wall Time:
Avg wall time per step: 3,317 us (baseline) vs 3,978 us (target) +19.9%
This is the primary regression metric.
NVTX Breakdown:
Operation | baseline (us/step) | target (us/step) | Delta | Status
_fetch_new_requests | 36 | 270 | +234 | REGRESSION
broadcast_requests | - | 250 | +250 | NEW
_update_requests | 413 | 723 | +310 | REGRESSION
Focus on operations with large absolute deltas.
GPU Kernel Comparison:
Kernels per step (launched): 6.2 (baseline) vs 21.9 (target) +253%
More individual launches = more host-side launch overhead.
Step 5: Kernel-Level Drill-Down (Optional)
When the NVTX breakdown identifies a regressing operation but does not reveal why (the overhead is inside the GPU dispatch, not between NVTX ranges), drill below NVTX operations into individual GPU kernel launches.
See references/kernel-level-analysis.md for full technique details, SQL queries, and examples.
When to drill down:
- An operation has high wall time but the overhead is inside GPU dispatch, not between NVTX ranges
- You need to understand how much of the forward pass is graph-captured vs eager
- Per-layer overhead is significant and you need to map kernels to functional groups
- Multi-rank inference shows unexplained performance asymmetry
Kernel-Level Techniques
| Technique | Question | Key Output |
|---|
| Inter-Kernel Gap Analysis | Where is the GPU idle between kernels? | Gap bucket distribution, top-N largest gaps with source mapping |
| Eager vs Graph Classification | What fraction of kernels are graph-captured? | Graph coverage ratio, list of eager kernels with source attribution |
| Repeating-Pattern Mapping | Which functional group within a layer has the most overhead? | Per-group gap totals, priority ranking |
| Straggler Detection | Is one rank consistently slower? | Straggler rank ID, root cause (extra host work, queue depth feedback loop) |
Workflow
- Start with Inter-Kernel Gap Analysis — bucket the gap distribution to understand the dominant overhead type (graph dispatch, Python interpreter, host-device sync)
- If piecewise graph is in use, run Eager vs Graph Classification to measure graph coverage and identify unnecessary eager kernels
- For per-layer overhead, use Repeating-Pattern Mapping to isolate the highest-overhead functional group within a single layer
- For multi-rank setups, run Straggler Detection if per-step wall time varies across ranks
Kernel-Level Findings to Optimization Patterns
| Finding | Optimization Pattern |
|---|
| Large gaps from Python tensor view chains | CUSTOM_OP — replace with C++ custom op |
| Graph-capturable kernels running eagerly | GRAPH_EXPAND — fix partition poisoning |
| Monolithic custom op blocking graph capture | GRAPH_SPLIT — split into capturable + eager parts |
| Host-device sync () in per-layer code | SYNC (Pattern 1: pre-compute on CPU) + HOIST (Variant B: pass from step level) |
| Per-layer buffer allocation | ALLOC — pre-allocate at init |
| Straggler rank with extra host work | Apply targeted optimization to coordinator-only code paths |
Common Patterns and Root Causes
Pattern 1: Request Management Refactor
Symptom:
regressed 5-10x, new
operation.
Cause: Request fetching refactored for multi-rank broadcasting in TP.
Mitigation: Optimize broadcast path; batch request state updates.
Pattern 2: Increased Kernel Launch Count
Symptom: 3-5x more
calls per step, similar GPU time.
Cause: Operations that were fused or graph-captured are now individual launches.
Mitigation: Re-fuse kernels; extend CUDA graph capture scope.
Pattern 3: New Bookkeeping Operations
Symptom: New NVTX ranges like
,
handle_additional_outputs
.
Cause: New features added to the inference loop without overhead budgeting.
Mitigation: Defer non-critical bookkeeping to async paths; batch updates.
Pattern 4: Flashinfer JIT Warmup Masquerading as Inference
Symptom: Massive elementwise/reduce kernel counts in "steady state" analysis.
Cause: Analysis window includes flashinfer JIT compilation phase.
Fix: Use allreduce-based iteration isolation, not kernel density or time windows.
Pattern 5: Context-Only Bottleneck (Masked by Aggregate)
Symptom: Aggregate metrics below threshold, but context iterations have 50% GPU idle.
Cause: Generation iterations dilute the context-phase bottleneck.
Fix: Per-phase analysis in Detection phase catches this.
Pitfalls
1. shortName is an Integer ID
In
CUPTI_ACTIVITY_KIND_KERNEL
,
is an integer referencing
. Always join. See
references/nsys-schema.md.
2. NVTX textId vs text
Most NVTX events have
(integer) but NULL
. Join with StringIds. See
references/nsys-schema.md.
3. Duplicate NVTX Ranges from TP Ranks
In TP configurations, each rank reports NVTX ranges independently. De-duplicate by grouping entries within 100us of each other.
4. Negative Inter-Step Gaps
When TP ranks report overlapping NVTX ranges,
gap = next_start - prev_end
can be negative. Use the maximum end time when de-duplicating.
5. Benchmark Window Selection
The allreduce-based window captures context+generation phases; steady-state NVTX filtering captures generation-only. Both are valid; use the appropriate one for your comparison goal.
Handoff to Optimization
When analysis is complete and the verdict is
YES, hand off to the
skill with:
-
Detection verdict and evidence: Which metrics crossed thresholds (M1-M5), whether host prep was confirmed (M3b+M3c), and per-phase breakdown.
-
NVTX-based triage (from Root Cause): Top regressing operations by absolute delta (us/step). Map NVTX range names to source functions -- see references/trtllm-nvtx-ranges.md.
-
Handoff data block: Include structured data from references/output-format.md (see "Handoff to Optimization" section).
-
Kernel-level findings (from drill-down, if performed): Inter-kernel gap distribution, graph coverage ratio, per-group overhead map, and straggler rank identification. Map findings to optimization patterns using the table in the Root Cause kernel-level drill-down section above.
Reference
| File | Contents |
|---|
| references/metrics.md | Full metric definitions, formulas, SQL queries, M3 sub-metric analysis |
| references/thresholds.md | Aggregate and per-phase threshold tables |
| references/phase-classification.md | NVTX marker parsing, iteration classification, per-phase aggregation |
| references/output-format.md | Report template and integration JSON schema |
| references/examples.md | Worked scenarios (aggregate, phase-specific, and case study) |
| references/iteration-isolation-techniques.md | Allreduce, NVTX, and kernel-density iteration isolation techniques |
| references/trtllm-nvtx-ranges.md | TRT-LLM NVTX range reference with per-operation timings |
| references/kernel-level-analysis.md | Kernel-level drill-down techniques: gap analysis, graph classification, pattern mapping, straggler detection |
| references/nsys-schema.md | nsys SQLite schema reference and useful queries |
| scripts/analyze_host_overhead.py | Python script for Phase 2 root cause analysis |
| scripts/detect_host_overhead.py | Python script for Phase 1 detection verdict |