AI & Machine Learningnvidia/skills
perf-host-analysis
Analyze host/CPU overhead in TensorRT-LLM inference from nsys traces. Detect whether host overhead is the bottleneck using GPU idle ratio, host prep exposed ratio, and per-phase evidence. For regressions, isolate forward steps via allreduce/NVTX patterns, compare host operation breakdowns across versions, and identify scheduling or request-management overhead. Supports optional inter-kernel gap, eager-vs-graph, pattern mapping, and multi-rank straggler drill-down. Use standalone or within perf-analysis. Triggers: host overhead, inter-step gap, scheduling overhead, forward step isolation, nsys iteration analysis, NVTX breakdown, request management overhead, GPU idle, host bottleneck, host prep exposed, inter-kernel gap, bubble analysis, graph coverage, eager kernel, rank imbalance, straggler detection.