perf-host-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHost Performance Optimization Skill
主机性能优化技能
Automates detection and optimization of host-side (CPU) overhead in TensorRT-LLM's PyTorch backend.
自动检测并优化TensorRT-LLM的PyTorch后端中的主机端(CPU)开销。
When to Use
适用场景
- GPU utilization is low during inference (CPU bottleneck suspected)
- User asks to reduce host overhead or CPU latency
- Optimizing PyExecutor throughput (requests/sec)
- Need line-by-line profiling of specific Python functions
- 推理过程中GPU利用率较低(怀疑存在CPU瓶颈)
- 用户需要降低主机开销或CPU延迟
- 优化PyExecutor吞吐量(请求/秒)
- 需要对特定Python函数进行逐行剖析
Confirming the Bottleneck
确认瓶颈
line_profiler measures where CPU time is spent but not whether CPU is the bottleneck.
If you need to confirm CPU is the limiting factor, run the skill first -- it provides a YES/NO verdict with metric evidence.
perf-host-analysisAs a rough heuristic without nsys: if doubling the batch size does not proportionally increase GPU utilization or throughput, CPU overhead is likely the bottleneck.
line_profiler可测量CPU时间的消耗位置,但无法判断CPU是否为瓶颈。如果需要确认CPU是否是限制因素,请先运行技能——它会提供带有指标依据的YES/NO判定结果。
perf-host-analysis在没有nsys的情况下,可使用粗略的判断方法:如果将批量大小翻倍后,GPU利用率或吞吐量并未成比例提升,则很可能存在CPU开销瓶颈。
Using Analysis Skill Results
利用分析技能的结果
If the skill has already been run, use its output to skip the confirmation step and prioritize targets:
perf-host-analysis- Detection verdict: If YES with host_prep_confirmed, start with .
_prepare_tp_inputs - NVTX triage (from Root Cause): The in the handoff data block maps NVTX range names to source functions. Profile the function with the largest absolute delta first.
top_regressing_ops - Cross-function triage: When the top NVTX regression is NOT in (e.g.,
_prepare_tp_inputs,_fetch_new_requests,broadcast_requests), target that function's source file directly instead of defaulting to_update_requests. See references/trtllm-nvtx-ranges.md for the NVTX-to-source mapping._prepare_tp_inputs
如果已运行技能,可使用其输出跳过确认步骤并优先处理目标:
perf-host-analysis- 检测判定结果:如果结果为YES且host_prep_confirmed为真,从函数开始处理。
_prepare_tp_inputs - NVTX分类(来自根本原因分析):交接数据块中的将NVTX范围名称映射到源函数。优先剖析绝对差值最大的函数。
top_regressing_ops - 跨函数分类:当排名靠前的NVTX回归不在中时(例如
_prepare_tp_inputs、_fetch_new_requests、broadcast_requests),直接针对该函数的源文件,而非默认使用_update_requests。NVTX到源文件的映射可参考references/trtllm-nvtx-ranges.md。_prepare_tp_inputs
Profiling Setup
剖析设置
line_profiler (Primary Method)
line_profiler(主要方法)
Environment Variables:
- — Enable the profiler
TLLM_LINE_PROFILER_ENABLED=True - — Output file path
TLLM_LINE_PROFILER_PATH - — Additional functions to profile (comma-separated)
TLLM_LINE_PROFILER_FUNCTIONS
Function specification format:
bash
undefined环境变量:
- — 启用剖析器
TLLM_LINE_PROFILER_ENABLED=True - — 输出文件路径
TLLM_LINE_PROFILER_PATH - — 额外需要剖析的函数(逗号分隔)
TLLM_LINE_PROFILER_FUNCTIONS
函数指定格式:
bash
undefinedClass methods: module.path.ClassName.method_name
类方法:module.path.ClassName.method_name
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs"
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs"
Standalone functions: module.path::function_name
独立函数:module.path::function_name
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.sampler::_group_requests_by_strategy_key"
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.sampler::_group_requests_by_strategy_key"
Multiple functions (comma-separated)
多个函数(逗号分隔)
TLLM_LINE_PROFILER_FUNCTIONS="module.Class.method1,module.Class.method2"
undefinedTLLM_LINE_PROFILER_FUNCTIONS="module.Class.method1,module.Class.method2"
undefinedCPU Affinity (Environment Factor)
CPU亲和性(环境因素)
CPU core affinity can significantly affect host overhead measurements,
especially on multi-socket systems (e.g., B300). Pinning processes to cores
near the GPU's NUMA node reduces cross-socket memory access latency.
- Check current affinity: or
taskset -p <pid>numactl --show - Pin to local NUMA node:
numactl --cpunodebind=<node> --membind=<node> - Impact: Up to 2x difference in host overhead on B300 systems
When comparing profiling results across runs, ensure CPU affinity is consistent.
Do not externally modify the affinity, unless user requires to do this to examine the affects of this part.
Document the affinity setting in each round's report if it varies.
CPU核心亲和性会显著影响主机开销的测量结果,尤其是在多插槽系统(如B300)中。将进程固定在靠近GPU NUMA节点的核心上,可减少跨插槽内存访问延迟。
- 检查当前亲和性:或
taskset -p <pid>numactl --show - 固定到本地NUMA节点:
numactl --cpunodebind=<node> --membind=<node> - 影响:在B300系统中,主机开销最多可相差2倍
在对比不同运行的剖析结果时,确保CPU亲和性一致。除非用户要求检查该部分的影响,否则不要外部修改亲和性。如果亲和性设置有变化,请在每一轮的报告中记录。
Workspace & Suffix Management
工作区与后缀管理
Each profiling run should have a unique suffix to track progress across rounds:
bash
EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.sh每次剖析运行应使用唯一的后缀,以跟踪多轮优化的进度:
bash
EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.shAutonomous Optimization Loop
自动优化循环
Before starting the loop, review references/optimization-strategy.md for strategic guidance on ordering (zero-risk-first), measurement traps, and overhead scoping.
Key insight: Optimizations are NOT independent. Fixing a 50ms bottleneck may reveal a 30ms bottleneck that was previously masked (hidden behind the larger one). Always re-profile after each significant change — the bottleneck landscape shifts.
Ordering principle: Within each round, prefer zero-risk optimizations (caching, pre-allocation, hoisting invariants) over medium/high-risk ones (graph partition changes, algorithm fusion). Zero-risk changes provide free gains and make subsequent profiling cleaner.
Run N rounds (default 3) of the following cycle:
FOR round = 1 to MAX_ROUNDS:
1. PROFILE (with Drill-Down)
2. ANALYZE (Multi-Option)
3. OPTIMIZE (Apply Change — prefer zero-risk first)
4. TEST (Unit Test Validation)
5. VALIDATE (Re-Profile — expect bottleneck landscape to shift)
6. REPORT
END FOR → FINAL SUMMARY开始循环前,请查看references/optimization-strategy.md,获取关于优化顺序(优先零风险优化)、测量陷阱以及开销范围的策略指导。
核心要点:优化并非独立的。修复一个50ms的瓶颈可能会暴露之前被掩盖的30ms瓶颈。每次重大修改后务必重新剖析——瓶颈分布会发生变化。
顺序原则:在每一轮中,优先选择零风险优化(缓存、预分配、提升不变量),而非中/高风险优化(图分区更改、算法融合)。零风险优化可带来无成本的性能提升,并使后续剖析结果更清晰。
运行N轮(默认3轮)以下循环:
FOR round = 1 to MAX_ROUNDS:
1. 剖析(含深度挖掘)
2. 分析(多选项)
3. 优化(应用更改——优先零风险选项)
4. 测试(单元测试验证)
5. 验证(重新剖析——预期瓶颈分布会变化)
6. 报告
END FOR → 最终总结Phase 1: PROFILE (with Drill-Down)
阶段1:剖析(含深度挖掘)
- Run workload with profiler enabled
- Parse output: identify functions with highest Total time and lines with highest % Time
- CRITICAL: Drill down into sub-functions that are not yet profiled (see below)
- 启用剖析器运行工作负载
- 解析输出:识别总耗时最高的函数以及占比最高的代码行
- 关键操作:深度挖掘尚未被剖析的子函数(见下文)
Drill-Down Profiling
深度挖掘剖析
The default profiler covers top-level executor functions but not all sub-functions. When a profiled function shows most time in a single sub-call, you must drill down.
When: A single line consumes >80% of a function's time calling an unprofiled sub-function:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2848 4100 59200000000.0 14439024.4 98.7 output = self.model_engine.forward(...)How:
- Identify the sub-function's full qualified path (e.g., )
tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs - Add it to
TLLM_LINE_PROFILER_FUNCTIONS - Re-profile to get line-level data inside it
- Now analyze the inner hotspots
For common drill-down targets, see references/hot-path-files.md.
默认剖析器涵盖顶层执行器函数,但不包含所有子函数。当某个已剖析函数的大部分时间消耗在单个未剖析的子调用上时,必须进行深度挖掘。
触发条件:某一行调用未剖析子函数的时间占函数总时间的80%以上:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2848 4100 59200000000.0 14439024.4 98.7 output = self.model_engine.forward(...)操作步骤:
- 确定子函数的完整路径(例如)
tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs - 将其添加到中
TLLM_LINE_PROFILER_FUNCTIONS - 重新剖析以获取该函数内部的逐行数据
- 现在分析内部热点
常见的深度挖掘目标可参考references/hot-path-files.md。
Phase 2: ANALYZE (Multi-Option)
阶段2:分析(多选项)
For the chosen hotspot:
- Identify the top hotspots by absolute time (not just %) within the target function
- Classify each hotspot by type. Summary table:
| Type | Indicators | Severity |
|---|---|---|
| HOST_SYNC | | Critical |
| SYNC | | Critical |
| CUSTOM_OP | Chain of Python tensor ops (view/slice/cast) before kernel launch | Critical |
| GRAPH_BREAK | Op that prevents CUDA graph capture of surrounding code (fix via GRAPH_EXPAND / GRAPH_SPLIT) | High |
| ALLOC | | High |
| HOIST | Per-layer recomputation of step-invariant values | High |
| PYLOOP | | High |
| REDUNDANT_ITER | Multiple passes over the same collection | High |
| DEAD_WORK | Object construction whose results are always discarded | High |
| CONTAINER | Dict/set lookups in hot loops | Medium |
| FUNCALL | Repeated method/property calls | Medium |
| COMM | | Medium |
| GIL | Lock/queue contention | Medium |
| SERIALIZE | | Medium |
| GC | Periodic latency spikes, non-deterministic pauses (tail latency) | Low |
| COMPUTE | Actual computation (may not be optimizable) | Low |
For detailed classification with code examples, see references/hotspot-classification.md.
- Propose 2-4 optimization options in a table:
| Option | Description | Estimated Savings | Risk | Complexity |
|---|---|---|---|---|
| A | ... | ... | Low/Med/High | ... |
| B | ... | ... | ... | ... |
- Select the best option and explain reasoning (prefer high-savings + low-risk; follow zero-risk-first ordering from references/optimization-strategy.md)
For optimization patterns by type, see references/optimization-patterns.md (index) — it links to the relevant sub-file for each hotspot type. For GPU-specific patterns (CUSTOM_OP, GRAPH_SPLIT, GRAPH_EXPAND), see references/patterns/gpu-graph.md.
针对选定的热点:
- 识别目标函数内按绝对时间(而非仅占比)排名靠前的热点
- 分类每个热点类型。汇总表:
| 类型 | 特征 | 严重程度 |
|---|---|---|
| HOST_SYNC | 每层前向路径中的 | 严重 |
| SYNC | 步骤级代码中的 | 严重 |
| CUSTOM_OP | 内核启动前的Python张量操作链(view/slice/cast) | 严重 |
| GRAPH_BREAK | 阻止CUDA图捕获周围代码的操作(通过GRAPH_EXPAND/GRAPH_SPLIT修复) | 高 |
| ALLOC | 循环中的 | 高 |
| HOIST | 每层重复计算步骤不变量 | 高 |
| PYLOOP | 迭代次数较多的 | 高 |
| REDUNDANT_ITER | 对同一集合进行多次遍历 | 高 |
| DEAD_WORK | 结果始终被丢弃的对象构造 | 高 |
| CONTAINER | 热点循环中的字典/集合查找 | 中 |
| FUNCALL | 重复的方法/属性调用 | 中 |
| COMM | TP/PP路径中的 | 中 |
| GIL | 锁/队列竞争 | 中 |
| SERIALIZE | 请求处理中的 | 中 |
| GC | 周期性延迟峰值、非确定性暂停(尾部延迟) | 低 |
| COMPUTE | 实际计算(可能无法优化) | 低 |
带有代码示例的详细分类可参考references/hotspot-classification.md。
- 提出2-4个优化选项,以表格形式呈现:
| 选项 | 描述 | 预估收益 | 风险 | 复杂度 |
|---|---|---|---|---|
| A | ... | ... | 低/中/高 | ... |
| B | ... | ... | ... | ... |
- 选择最佳选项并说明理由(优先选择高收益+低风险的选项;遵循references/optimization-strategy.md中的零风险优先顺序)
按类型划分的优化模式可参考references/optimization-patterns.md(索引)——它链接到每种热点类型对应的子文件。针对GPU特定模式(CUSTOM_OP、GRAPH_SPLIT、GRAPH_EXPAND),请参考references/patterns/gpu-graph.md。
Phase 3: OPTIMIZE (Apply Change)
阶段3:优化(应用更改)
- Apply the selected code change with Edit tool
- One optimization per round — keep changes minimal and targeted
- Record the exact change (file, line range, before/after) for potential rollback
- 使用编辑工具应用选定的代码更改
- 每轮仅进行一项优化——保持更改最小且针对性强
- 记录确切的更改内容(文件、行范围、更改前后),以便后续可能的回滚
Phase 4: TEST (Unit Test Validation)
阶段4:测试(单元测试验证)
Mandatory after each optimization. Find and run related UTs to verify correctness.
Finding related tests:
bash
undefined每次优化后必须执行此步骤。查找并运行相关单元测试以验证正确性。
查找相关测试:
bash
undefinedSearch by modified file name
按修改的文件名搜索
grep -rl "model_engine|PyTorchModelEngine" tests/unittest/_torch/executor/
grep -rl "model_engine|PyTorchModelEngine" tests/unittest/_torch/executor/
Search by modified function name
按修改的函数名搜索
grep -rl "_prepare_tp_inputs|prepare_inputs" tests/
**Running tests:**
```bashgrep -rl "_prepare_tp_inputs|prepare_inputs" tests/
**运行测试:**
```bashRun specific test file with stop-on-first-failure
运行特定测试文件,失败即停止
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py -v -x --timeout=120
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py -v -x --timeout=120
Run specific test method
运行特定测试方法
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase::test_position_id_preparation -v -x
For the full UT-to-file mapping, see [references/hot-path-files.md](references/hot-path-files.md).
**If tests fail:**
1. Read the failure message
2. Rollback immediately (`git checkout -- <file>`)
3. Analyze why the optimization broke correctness
4. Try the next-best option from Phase 2pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase::test_position_id_preparation -v -x
完整的单元测试与文件映射可参考[references/hot-path-files.md](references/hot-path-files.md)。
**如果测试失败:**
1. 阅读失败信息
2. 立即回滚(`git checkout -- <file>`)
3. 分析优化破坏正确性的原因
4. 尝试阶段2中的次优选项Phase 5: VALIDATE (Re-Profile)
阶段5:验证(重新剖析)
- Re-run profiler with identical workload, using suffix
round<N>_<description> - Compare three things:
- Did the target hotspot time decrease?
- Did the overall function Total time decrease?
- Did benchmark metrics (TPOT, throughput) improve?
If regression detected (function time increased or metrics worsened):
- The "optimization" may have triggered a CPython pitfall — see references/patterns/compound-pitfalls.md (CPython Pitfalls section)
- Rollback and try the next-best option from Phase 2
- 使用相同的工作负载重新运行剖析器,后缀格式为
round<N>_<description> - 对比三个方面:
- 目标热点的耗时是否减少?
- 整体函数的总耗时是否减少?
- 基准指标(TPOT、吞吐量)是否提升?
如果检测到性能退化(函数耗时增加或指标恶化):
- 该“优化”可能触发了CPython陷阱——请参考references/patterns/compound-pitfalls.md(CPython陷阱部分)
- 回滚并尝试阶段2中的次优选项
Phase 6: REPORT
阶段6:报告
Log for this round:
- Round number
- Hotspot location (file:line) and classification
- Optimization applied (with before/after code summary)
- Time delta: function Total time before → after
- Benchmark delta: TPOT, throughput before → after
记录本轮的以下信息:
- 轮次编号
- 热点位置(文件:行号)及分类
- 应用的优化(含更改前后的代码摘要)
- 时间变化:函数总耗时的前后对比
- 基准指标变化:TPOT、吞吐量的前后对比
Reading Profile Output
解读剖析输出
Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100
Line # Hits Time Per Hit % Time Line Contents
==============================================================
100 def my_function(self):
101 500 890000.0 1780.0 72.1 result = tensor.item()
102 500 234567.0 469.1 19.0 return resultHow to read effectively:
- Start with Total time for each function — this is the overall budget
- Sort lines mentally by absolute Time, not just % Time (3% of a 60s function = 1.8s)
- Check Hits count to understand iteration patterns:
- Hits = 2 × expected count → loop overhead (2 hits = enter + exit check)
for x in range(1): - Hits ≫ expected → the line is inside a nested loop
- Hits = 2 × expected count →
- Look for repeated patterns: if 10 lines each take 3% in a loop body, the loop itself costs 30%
Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100
Line # Hits Time Per Hit % Time Line Contents
==============================================================
100 def my_function(self):
101 500 890000.0 1780.0 72.1 result = tensor.item()
102 500 234567.0 469.1 19.0 return result有效解读方法:
- 从每个函数的总耗时开始——这是整体的时间预算
- 按绝对时间而非仅占比对代码行进行排序(60s函数的3% = 1.8s)
- 查看Hits计数以理解迭代模式:
- Hits = 预期计数×2 → 循环开销(2次命中 = 进入+退出检查)
for x in range(1): - Hits ≫ 预期值 → 该行位于嵌套循环内
- Hits = 预期计数×2 →
- 寻找重复模式:如果循环体中有10行各占3%,则循环本身消耗30%
Stopping Criteria
停止条件
Stop the optimization loop when:
- Iteration limit reached: Completed N rounds (default 3)
- No actionable hotspots: Top hotspots are pure GPU compute (COMPUTE type)
- Diminishing returns: < 5% improvement in last 2 rounds
- Risk threshold: Further optimizations require architectural changes (e.g., Cython, struct-of-arrays)
- Test failures: Cannot find an optimization that passes UTs
Primary success metric: Benchmark throughput (requests/sec or tokens/sec) as measured by the profiling script. line_profiler time reductions are leading indicators, but throughput is the ground truth — a function-level speedup that doesn't improve throughput is not a real win.
满足以下任一条件时,停止优化循环:
- 达到迭代上限:完成N轮(默认3轮)
- 无可行热点:排名靠前的热点为纯GPU计算(COMPUTE类型)
- 收益递减:最近2轮的提升幅度<5%
- 风险阈值:进一步优化需要架构变更(如Cython、数组结构体)
- 测试失败:无法找到通过单元测试的优化方案
主要成功指标:剖析脚本测量的基准吞吐量(请求/秒或令牌/秒)。line_profiler的耗时减少是领先指标,但吞吐量才是真实结果——函数级的加速如果没有提升吞吐量,则不算真正的成功。
Final Summary Output
最终总结输出
The final report should include:
- Rounds executed: Number of profile-optimize cycles completed
- Cumulative improvement: Total host time reduction (percentage and absolute)
- Benchmark metrics: Before/after comparison table (TPOT, throughput, ITL, E2EL)
- Optimizations applied: List of changes with file:line locations and classification
- Failed attempts: Any optimizations tried and reverted (with why)
- Remaining hotspots: Top bottlenecks that couldn't be optimized (with classification)
- Recommendations: Suggested follow-up for architectural changes if needed
For a concrete multi-round example, see references/examples.md.
最终报告应包含:
- 执行轮次:完成的剖析-优化循环次数
- 累计提升:主机总耗时的减少量(百分比和绝对值)
- 基准指标:前后对比表(TPOT、吞吐量、ITL、E2EL)
- 应用的优化:更改列表,包含文件:行号位置和分类
- 失败尝试:任何尝试后回滚的优化(含原因)
- 剩余热点:无法优化的顶级瓶颈(含分类)
- 建议:如需架构变更,提出后续改进建议
具体的多轮示例可参考references/examples.md。
Reference Files
参考文件
| File | Contents |
|---|---|
| references/optimization-patterns.md | Pattern index — links to 6 sub-files: sync-alloc, loop-iteration, python-overhead, gpu-graph, system, compound-pitfalls |
| references/optimization-strategy.md | Zero-risk-first ordering, metric traps, three scopes of host overhead, pattern selection guide |
| references/hotspot-classification.md | Extended per-type indicators and code examples (including CUSTOM_OP, GRAPH_BREAK, HOST_SYNC) |
| references/communication-patterns.md | Communication overhead patterns (NCCL batching, barrier removal, async overlap, reduce_scatter) |
| references/hot-path-files.md | Key file tables, drill-down targets, UT mapping |
| references/examples.md | Usage examples and multi-round walkthrough |
| trtllm-nvtx-ranges.md | TRT-LLM NVTX range reference (from analysis skill) — maps range names to source functions |
| 文件 | 内容 |
|---|---|
| references/optimization-patterns.md | 模式索引——链接到6个子文件:sync-alloc、loop-iteration、python-overhead、gpu-graph、system、compound-pitfalls |
| references/optimization-strategy.md | 零风险优先顺序、指标陷阱、主机开销的三个范围、模式选择指南 |
| references/hotspot-classification.md | 扩展的按类型特征说明和代码示例(包括CUSTOM_OP、GRAPH_BREAK、HOST_SYNC) |
| references/communication-patterns.md | 通信开销模式(NCCL批处理、移除屏障、异步重叠、reduce_scatter) |
| references/hot-path-files.md | 关键文件表、深度挖掘目标、单元测试映射 |
| references/examples.md | 使用示例和多轮优化演练 |
| trtllm-nvtx-ranges.md | TRT-LLM NVTX范围参考(来自分析技能)——将范围名称映射到源函数 |