perf-host-optimization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Host Performance Optimization Skill

主机性能优化技能

Automates detection and optimization of host-side (CPU) overhead in TensorRT-LLM's PyTorch backend.

自动检测并优化TensorRT-LLM的PyTorch后端中的主机端（CPU）开销。

When to Use

适用场景

GPU utilization is low during inference (CPU bottleneck suspected)
User asks to reduce host overhead or CPU latency
Optimizing PyExecutor throughput (requests/sec)
Need line-by-line profiling of specific Python functions

推理过程中GPU利用率较低（怀疑存在CPU瓶颈）
用户需要降低主机开销或CPU延迟
优化PyExecutor吞吐量（请求/秒）
需要对特定Python函数进行逐行剖析

Confirming the Bottleneck

确认瓶颈

line_profiler measures where CPU time is spent but not whether CPU is the bottleneck. If you need to confirm CPU is the limiting factor, run the

perf-host-analysis

skill first -- it provides a YES/NO verdict with metric evidence.

As a rough heuristic without nsys: if doubling the batch size does not proportionally increase GPU utilization or throughput, CPU overhead is likely the bottleneck.

line_profiler可测量CPU时间的消耗位置，但无法判断CPU是否为瓶颈。如果需要确认CPU是否是限制因素，请先运行

perf-host-analysis

技能——它会提供带有指标依据的YES/NO判定结果。

在没有nsys的情况下，可使用粗略的判断方法：如果将批量大小翻倍后，GPU利用率或吞吐量并未成比例提升，则很可能存在CPU开销瓶颈。

Using Analysis Skill Results

利用分析技能的结果

If the

perf-host-analysis

skill has already been run, use its output to skip the confirmation step and prioritize targets:

Detection verdict: If YES with host_prep_confirmed, start with
```
_prepare_tp_inputs
```
.
NVTX triage (from Root Cause): The
```
top_regressing_ops
```
in the handoff data block maps NVTX range names to source functions. Profile the function with the largest absolute delta first.
Cross-function triage: When the top NVTX regression is NOT in
```
_prepare_tp_inputs
```
(e.g.,
```
_fetch_new_requests
```
,
```
broadcast_requests
```
,
```
_update_requests
```
), target that function's source file directly instead of defaulting to
```
_prepare_tp_inputs
```
. See references/trtllm-nvtx-ranges.md for the NVTX-to-source mapping.

如果已运行

perf-host-analysis

技能，可使用其输出跳过确认步骤并优先处理目标：

检测判定结果：如果结果为YES且host_prep_confirmed为真，从
```
_prepare_tp_inputs
```
函数开始处理。
NVTX分类（来自根本原因分析）：交接数据块中的
```
top_regressing_ops
```
将NVTX范围名称映射到源函数。优先剖析绝对差值最大的函数。
跨函数分类：当排名靠前的NVTX回归不在
```
_prepare_tp_inputs
```
中时（例如
```
_fetch_new_requests
```
、
```
broadcast_requests
```
、
```
_update_requests
```
），直接针对该函数的源文件，而非默认使用
```
_prepare_tp_inputs
```
。NVTX到源文件的映射可参考references/trtllm-nvtx-ranges.md。

Profiling Setup

剖析设置

line_profiler (Primary Method)

line_profiler（主要方法）

Environment Variables:

```
TLLM_LINE_PROFILER_ENABLED=True
```
— Enable the profiler
```
TLLM_LINE_PROFILER_PATH
```
— Output file path
```
TLLM_LINE_PROFILER_FUNCTIONS
```
— Additional functions to profile (comma-separated)

Function specification format:

bash

undefined

环境变量：

```
TLLM_LINE_PROFILER_ENABLED=True
```
— 启用剖析器
```
TLLM_LINE_PROFILER_PATH
```
— 输出文件路径
```
TLLM_LINE_PROFILER_FUNCTIONS
```
— 额外需要剖析的函数（逗号分隔）

函数指定格式：

bash

undefined

Class methods: module.path.ClassName.method_name

类方法：module.path.ClassName.method_name

TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs"

Standalone functions: module.path::function_name

独立函数：module.path::function_name

TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.sampler::_group_requests_by_strategy_key"

Multiple functions (comma-separated)

多个函数（逗号分隔）

TLLM_LINE_PROFILER_FUNCTIONS="module.Class.method1,module.Class.method2"

undefined

TLLM_LINE_PROFILER_FUNCTIONS="module.Class.method1,module.Class.method2"

undefined

CPU Affinity (Environment Factor)

CPU亲和性（环境因素）

CPU core affinity can significantly affect host overhead measurements, especially on multi-socket systems (e.g., B300). Pinning processes to cores near the GPU's NUMA node reduces cross-socket memory access latency.

Check current affinity:
```
taskset -p <pid>
```
or
```
numactl --show
```

Pin to local NUMA node:

numactl --cpunodebind=<node> --membind=<node>

Impact: Up to 2x difference in host overhead on B300 systems

When comparing profiling results across runs, ensure CPU affinity is consistent. Do not externally modify the affinity, unless user requires to do this to examine the affects of this part. Document the affinity setting in each round's report if it varies.

CPU核心亲和性会显著影响主机开销的测量结果，尤其是在多插槽系统（如B300）中。将进程固定在靠近GPU NUMA节点的核心上，可减少跨插槽内存访问延迟。

检查当前亲和性：
```
taskset -p <pid>
```
或
```
numactl --show
```

固定到本地NUMA节点：

numactl --cpunodebind=<node> --membind=<node>

影响：在B300系统中，主机开销最多可相差2倍

在对比不同运行的剖析结果时，确保CPU亲和性一致。除非用户要求检查该部分的影响，否则不要外部修改亲和性。如果亲和性设置有变化，请在每一轮的报告中记录。

Workspace & Suffix Management

工作区与后缀管理

Each profiling run should have a unique suffix to track progress across rounds:

bash

EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.sh

每次剖析运行应使用唯一的后缀，以跟踪多轮优化的进度：

bash

EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.sh

Autonomous Optimization Loop

自动优化循环

Before starting the loop, review references/optimization-strategy.md for strategic guidance on ordering (zero-risk-first), measurement traps, and overhead scoping.

Key insight: Optimizations are NOT independent. Fixing a 50ms bottleneck may reveal a 30ms bottleneck that was previously masked (hidden behind the larger one). Always re-profile after each significant change — the bottleneck landscape shifts.

Ordering principle: Within each round, prefer zero-risk optimizations (caching, pre-allocation, hoisting invariants) over medium/high-risk ones (graph partition changes, algorithm fusion). Zero-risk changes provide free gains and make subsequent profiling cleaner.

Run N rounds (default 3) of the following cycle:

FOR round = 1 to MAX_ROUNDS:

  1. PROFILE (with Drill-Down)
  2. ANALYZE (Multi-Option)
  3. OPTIMIZE (Apply Change — prefer zero-risk first)
  4. TEST (Unit Test Validation)
  5. VALIDATE (Re-Profile — expect bottleneck landscape to shift)
  6. REPORT

END FOR → FINAL SUMMARY

开始循环前，请查看references/optimization-strategy.md，获取关于优化顺序（优先零风险优化）、测量陷阱以及开销范围的策略指导。

核心要点：优化并非独立的。修复一个50ms的瓶颈可能会暴露之前被掩盖的30ms瓶颈。每次重大修改后务必重新剖析——瓶颈分布会发生变化。

顺序原则：在每一轮中，优先选择零风险优化（缓存、预分配、提升不变量），而非中/高风险优化（图分区更改、算法融合）。零风险优化可带来无成本的性能提升，并使后续剖析结果更清晰。

运行N轮（默认3轮）以下循环：

FOR round = 1 to MAX_ROUNDS:

  1. 剖析（含深度挖掘）
  2. 分析（多选项）
  3. 优化（应用更改——优先零风险选项）
  4. 测试（单元测试验证）
  5. 验证（重新剖析——预期瓶颈分布会变化）
  6. 报告

END FOR → 最终总结

Phase 1: PROFILE (with Drill-Down)

阶段1：剖析（含深度挖掘）

Run workload with profiler enabled
Parse output: identify functions with highest Total time and lines with highest % Time
CRITICAL: Drill down into sub-functions that are not yet profiled (see below)

启用剖析器运行工作负载
解析输出：识别总耗时最高的函数以及占比最高的代码行
关键操作：深度挖掘尚未被剖析的子函数（见下文）

Drill-Down Profiling

深度挖掘剖析

The default profiler covers top-level executor functions but not all sub-functions. When a profiled function shows most time in a single sub-call, you must drill down.

When: A single line consumes >80% of a function's time calling an unprofiled sub-function:

Line #      Hits         Time    Per Hit   % Time  Line Contents
==============================================================
  2848      4100  59200000000.0  14439024.4   98.7      output = self.model_engine.forward(...)

How:

Identify the sub-function's full qualified path (e.g.,

tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs

)

Add it to
```
TLLM_LINE_PROFILER_FUNCTIONS
```
Re-profile to get line-level data inside it
Now analyze the inner hotspots

For common drill-down targets, see references/hot-path-files.md.

默认剖析器涵盖顶层执行器函数，但不包含所有子函数。当某个已剖析函数的大部分时间消耗在单个未剖析的子调用上时，必须进行深度挖掘。

触发条件：某一行调用未剖析子函数的时间占函数总时间的80%以上：

Line #      Hits         Time    Per Hit   % Time  Line Contents
==============================================================
  2848      4100  59200000000.0  14439024.4   98.7      output = self.model_engine.forward(...)

操作步骤：

确定子函数的完整路径（例如

tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs

）

将其添加到
```
TLLM_LINE_PROFILER_FUNCTIONS
```
中
重新剖析以获取该函数内部的逐行数据
现在分析内部热点

常见的深度挖掘目标可参考references/hot-path-files.md。

Phase 2: ANALYZE (Multi-Option)

阶段2：分析（多选项）

For the chosen hotspot:

Identify the top hotspots by absolute time (not just %) within the target function
Classify each hotspot by type. Summary table:

Type	Indicators	Severity
HOST_SYNC	`.item()` , `.cpu()` in per-layer forward path	Critical
SYNC	`.item()` , `.cpu()` , `synchronize()` in step-level code	Critical
CUSTOM_OP	Chain of Python tensor ops (view/slice/cast) before kernel launch	Critical
GRAPH_BREAK	Op that prevents CUDA graph capture of surrounding code (fix via GRAPH_EXPAND / GRAPH_SPLIT)	High
ALLOC	`torch.zeros/empty/tensor()` in loops, `.clone()`	High
HOIST	Per-layer recomputation of step-invariant values	High
PYLOOP	`for x in collection:` with many iterations	High
REDUNDANT_ITER	Multiple passes over the same collection	High
DEAD_WORK	Object construction whose results are always discarded	High
CONTAINER	Dict/set lookups in hot loops	Medium
FUNCALL	Repeated method/property calls	Medium
COMM	`dist.all_reduce` , `dist.barrier` , NCCL overhead in TP/PP paths	Medium
GIL	Lock/queue contention	Medium
SERIALIZE	`pickle.dumps/loads` , `json.dumps/loads` in request processing	Medium
GC	Periodic latency spikes, non-deterministic pauses (tail latency)	Low
COMPUTE	Actual computation (may not be optimizable)	Low

For detailed classification with code examples, see references/hotspot-classification.md.

Propose 2-4 optimization options in a table:

Option	Description	Estimated Savings	Risk	Complexity
A	...	...	Low/Med/High	...
B	...	...	...	...

Select the best option and explain reasoning (prefer high-savings + low-risk; follow zero-risk-first ordering from references/optimization-strategy.md)

For optimization patterns by type, see references/optimization-patterns.md (index) — it links to the relevant sub-file for each hotspot type. For GPU-specific patterns (CUSTOM_OP, GRAPH_SPLIT, GRAPH_EXPAND), see references/patterns/gpu-graph.md.

针对选定的热点：

识别目标函数内按绝对时间（而非仅占比）排名靠前的热点
分类每个热点类型。汇总表：

类型	特征	严重程度
HOST_SYNC	每层前向路径中的 `.item()` 、 `.cpu()` 操作	严重
SYNC	步骤级代码中的 `.item()` 、 `.cpu()` 、 `synchronize()` 操作	严重
CUSTOM_OP	内核启动前的Python张量操作链（view/slice/cast）	严重
GRAPH_BREAK	阻止CUDA图捕获周围代码的操作（通过GRAPH_EXPAND/GRAPH_SPLIT修复）	高
ALLOC	循环中的 `torch.zeros/empty/tensor()` 、 `.clone()` 操作	高
HOIST	每层重复计算步骤不变量	高
PYLOOP	迭代次数较多的 `for x in collection:` 循环	高
REDUNDANT_ITER	对同一集合进行多次遍历	高
DEAD_WORK	结果始终被丢弃的对象构造	高
CONTAINER	热点循环中的字典/集合查找	中
FUNCALL	重复的方法/属性调用	中
COMM	TP/PP路径中的 `dist.all_reduce` 、 `dist.barrier` 、NCCL开销	中
GIL	锁/队列竞争	中
SERIALIZE	请求处理中的 `pickle.dumps/loads` 、 `json.dumps/loads` 操作	中
GC	周期性延迟峰值、非确定性暂停（尾部延迟）	低
COMPUTE	实际计算（可能无法优化）	低

带有代码示例的详细分类可参考references/hotspot-classification.md。

提出2-4个优化选项，以表格形式呈现：

选项	描述	预估收益	风险	复杂度
A	...	...	低/中/高	...
B	...	...	...	...

选择最佳选项并说明理由（优先选择高收益+低风险的选项；遵循references/optimization-strategy.md中的零风险优先顺序）

按类型划分的优化模式可参考references/optimization-patterns.md（索引）——它链接到每种热点类型对应的子文件。针对GPU特定模式（CUSTOM_OP、GRAPH_SPLIT、GRAPH_EXPAND），请参考references/patterns/gpu-graph.md。

Phase 3: OPTIMIZE (Apply Change)

阶段3：优化（应用更改）

Apply the selected code change with Edit tool
One optimization per round — keep changes minimal and targeted
Record the exact change (file, line range, before/after) for potential rollback

使用编辑工具应用选定的代码更改
每轮仅进行一项优化——保持更改最小且针对性强
记录确切的更改内容（文件、行范围、更改前后），以便后续可能的回滚

Phase 4: TEST (Unit Test Validation)

阶段4：测试（单元测试验证）

Mandatory after each optimization. Find and run related UTs to verify correctness.

Finding related tests:

bash

undefined

每次优化后必须执行此步骤。查找并运行相关单元测试以验证正确性。

查找相关测试：

bash

undefined

Search by modified file name

按修改的文件名搜索

grep -rl "model_engine|PyTorchModelEngine" tests/unittest/_torch/executor/

Search by modified function name

按修改的函数名搜索

grep -rl "_prepare_tp_inputs|prepare_inputs" tests/


**Running tests:**
```bash

grep -rl "_prepare_tp_inputs|prepare_inputs" tests/


**运行测试：**
```bash

Run specific test file with stop-on-first-failure

运行特定测试文件，失败即停止

pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py -v -x --timeout=120

Run specific test method

运行特定测试方法

pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase::test_position_id_preparation -v -x


For the full UT-to-file mapping, see [references/hot-path-files.md](references/hot-path-files.md).

**If tests fail:**
1. Read the failure message
2. Rollback immediately (`git checkout -- <file>`)
3. Analyze why the optimization broke correctness
4. Try the next-best option from Phase 2

pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase::test_position_id_preparation -v -x


完整的单元测试与文件映射可参考[references/hot-path-files.md](references/hot-path-files.md)。

**如果测试失败：**
1. 阅读失败信息
2. 立即回滚（`git checkout -- <file>`）
3. 分析优化破坏正确性的原因
4. 尝试阶段2中的次优选项

Phase 5: VALIDATE (Re-Profile)

阶段5：验证（重新剖析）

Re-run profiler with identical workload, using suffix
```
round<N>_<description>
```
Compare three things:
1. Did the target hotspot time decrease?
2. Did the overall function Total time decrease?
3. Did benchmark metrics (TPOT, throughput) improve?

If regression detected (function time increased or metrics worsened):

The "optimization" may have triggered a CPython pitfall — see references/patterns/compound-pitfalls.md (CPython Pitfalls section)
Rollback and try the next-best option from Phase 2

使用相同的工作负载重新运行剖析器，后缀格式为
```
round<N>_<description>
```
对比三个方面：
1. 目标热点的耗时是否减少？
2. 整体函数的总耗时是否减少？
3. 基准指标（TPOT、吞吐量）是否提升？

如果检测到性能退化（函数耗时增加或指标恶化）：

该“优化”可能触发了CPython陷阱——请参考references/patterns/compound-pitfalls.md（CPython陷阱部分）
回滚并尝试阶段2中的次优选项

Phase 6: REPORT

阶段6：报告

Log for this round:

Round number
Hotspot location (file:line) and classification
Optimization applied (with before/after code summary)
Time delta: function Total time before → after
Benchmark delta: TPOT, throughput before → after

记录本轮的以下信息：

轮次编号
热点位置（文件:行号）及分类
应用的优化（含更改前后的代码摘要）
时间变化：函数总耗时的前后对比
基准指标变化：TPOT、吞吐量的前后对比

Reading Profile Output

解读剖析输出

Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   100                                           def my_function(self):
   101       500    890000.0   1780.0     72.1       result = tensor.item()
   102       500    234567.0    469.1     19.0       return result

How to read effectively:

Start with Total time for each function — this is the overall budget
Sort lines mentally by absolute Time, not just % Time (3% of a 60s function = 1.8s)
Check Hits count to understand iteration patterns:
- Hits = 2 × expected count →
```
for x in range(1):
```
  loop overhead (2 hits = enter + exit check)
- Hits ≫ expected → the line is inside a nested loop
Look for repeated patterns: if 10 lines each take 3% in a loop body, the loop itself costs 30%

Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   100                                           def my_function(self):
   101       500    890000.0   1780.0     72.1       result = tensor.item()
   102       500    234567.0    469.1     19.0       return result

有效解读方法：

从每个函数的总耗时开始——这是整体的时间预算
按绝对时间而非仅占比对代码行进行排序（60s函数的3% = 1.8s）
查看Hits计数以理解迭代模式：
- Hits = 预期计数×2 →
```
for x in range(1):
```
  循环开销（2次命中 = 进入+退出检查）
- Hits ≫ 预期值 → 该行位于嵌套循环内
寻找重复模式：如果循环体中有10行各占3%，则循环本身消耗30%

Stopping Criteria

停止条件

Stop the optimization loop when:

Iteration limit reached: Completed N rounds (default 3)
No actionable hotspots: Top hotspots are pure GPU compute (COMPUTE type)
Diminishing returns: < 5% improvement in last 2 rounds
Risk threshold: Further optimizations require architectural changes (e.g., Cython, struct-of-arrays)
Test failures: Cannot find an optimization that passes UTs

Primary success metric: Benchmark throughput (requests/sec or tokens/sec) as measured by the profiling script. line_profiler time reductions are leading indicators, but throughput is the ground truth — a function-level speedup that doesn't improve throughput is not a real win.

满足以下任一条件时，停止优化循环：

达到迭代上限：完成N轮（默认3轮）
无可行热点：排名靠前的热点为纯GPU计算（COMPUTE类型）
收益递减：最近2轮的提升幅度<5%
风险阈值：进一步优化需要架构变更（如Cython、数组结构体）
测试失败：无法找到通过单元测试的优化方案

主要成功指标：剖析脚本测量的基准吞吐量（请求/秒或令牌/秒）。line_profiler的耗时减少是领先指标，但吞吐量才是真实结果——函数级的加速如果没有提升吞吐量，则不算真正的成功。

Final Summary Output

最终总结输出

The final report should include:

Rounds executed: Number of profile-optimize cycles completed
Cumulative improvement: Total host time reduction (percentage and absolute)
Benchmark metrics: Before/after comparison table (TPOT, throughput, ITL, E2EL)
Optimizations applied: List of changes with file:line locations and classification
Failed attempts: Any optimizations tried and reverted (with why)
Remaining hotspots: Top bottlenecks that couldn't be optimized (with classification)
Recommendations: Suggested follow-up for architectural changes if needed

For a concrete multi-round example, see references/examples.md.

最终报告应包含：

执行轮次：完成的剖析-优化循环次数
累计提升：主机总耗时的减少量（百分比和绝对值）
基准指标：前后对比表（TPOT、吞吐量、ITL、E2EL）
应用的优化：更改列表，包含文件:行号位置和分类
失败尝试：任何尝试后回滚的优化（含原因）
剩余热点：无法优化的顶级瓶颈（含分类）
建议：如需架构变更，提出后续改进建议

具体的多轮示例可参考references/examples.md。

Reference Files

参考文件

File	Contents
references/optimization-patterns.md	Pattern index — links to 6 sub-files: sync-alloc, loop-iteration, python-overhead, gpu-graph, system, compound-pitfalls
references/optimization-strategy.md	Zero-risk-first ordering, metric traps, three scopes of host overhead, pattern selection guide
references/hotspot-classification.md	Extended per-type indicators and code examples (including CUSTOM_OP, GRAPH_BREAK, HOST_SYNC)
references/communication-patterns.md	Communication overhead patterns (NCCL batching, barrier removal, async overlap, reduce_scatter)
references/hot-path-files.md	Key file tables, drill-down targets, UT mapping
references/examples.md	Usage examples and multi-round walkthrough
trtllm-nvtx-ranges.md	TRT-LLM NVTX range reference (from analysis skill) — maps range names to source functions

文件	内容
references/optimization-patterns.md	模式索引——链接到6个子文件：sync-alloc、loop-iteration、python-overhead、gpu-graph、system、compound-pitfalls
references/optimization-strategy.md	零风险优先顺序、指标陷阱、主机开销的三个范围、模式选择指南
references/hotspot-classification.md	扩展的按类型特征说明和代码示例（包括CUSTOM_OP、GRAPH_BREAK、HOST_SYNC）
references/communication-patterns.md	通信开销模式（NCCL批处理、移除屏障、异步重叠、reduce_scatter）
references/hot-path-files.md	关键文件表、深度挖掘目标、单元测试映射
references/examples.md	使用示例和多轮优化演练
trtllm-nvtx-ranges.md	TRT-LLM NVTX范围参考（来自分析技能）——将范围名称映射到源函数