perf-host-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Host Performance Optimization Skill

主机性能优化技能

Automates detection and optimization of host-side (CPU) overhead in TensorRT-LLM's PyTorch backend.
自动检测并优化TensorRT-LLM的PyTorch后端中的主机端(CPU)开销。

When to Use

适用场景

  • GPU utilization is low during inference (CPU bottleneck suspected)
  • User asks to reduce host overhead or CPU latency
  • Optimizing PyExecutor throughput (requests/sec)
  • Need line-by-line profiling of specific Python functions
  • 推理过程中GPU利用率较低(怀疑存在CPU瓶颈)
  • 用户需要降低主机开销或CPU延迟
  • 优化PyExecutor吞吐量(请求/秒)
  • 需要对特定Python函数进行逐行剖析

Confirming the Bottleneck

确认瓶颈

line_profiler measures where CPU time is spent but not whether CPU is the bottleneck. If you need to confirm CPU is the limiting factor, run the
perf-host-analysis
skill first -- it provides a YES/NO verdict with metric evidence.
As a rough heuristic without nsys: if doubling the batch size does not proportionally increase GPU utilization or throughput, CPU overhead is likely the bottleneck.
line_profiler可测量CPU时间的消耗位置,但无法判断CPU是否为瓶颈。如果需要确认CPU是否是限制因素,请先运行
perf-host-analysis
技能——它会提供带有指标依据的YES/NO判定结果。
在没有nsys的情况下,可使用粗略的判断方法:如果将批量大小翻倍后,GPU利用率或吞吐量并未成比例提升,则很可能存在CPU开销瓶颈。

Using Analysis Skill Results

利用分析技能的结果

If the
perf-host-analysis
skill has already been run, use its output to skip the confirmation step and prioritize targets:
  1. Detection verdict: If YES with host_prep_confirmed, start with
    _prepare_tp_inputs
    .
  2. NVTX triage (from Root Cause): The
    top_regressing_ops
    in the handoff data block maps NVTX range names to source functions. Profile the function with the largest absolute delta first.
  3. Cross-function triage: When the top NVTX regression is NOT in
    _prepare_tp_inputs
    (e.g.,
    _fetch_new_requests
    ,
    broadcast_requests
    ,
    _update_requests
    ), target that function's source file directly instead of defaulting to
    _prepare_tp_inputs
    . See references/trtllm-nvtx-ranges.md for the NVTX-to-source mapping.

如果已运行
perf-host-analysis
技能,可使用其输出跳过确认步骤并优先处理目标:
  1. 检测判定结果:如果结果为YES且host_prep_confirmed为真,从
    _prepare_tp_inputs
    函数开始处理。
  2. NVTX分类(来自根本原因分析):交接数据块中的
    top_regressing_ops
    将NVTX范围名称映射到源函数。优先剖析绝对差值最大的函数。
  3. 跨函数分类:当排名靠前的NVTX回归不在
    _prepare_tp_inputs
    中时(例如
    _fetch_new_requests
    broadcast_requests
    _update_requests
    ),直接针对该函数的源文件,而非默认使用
    _prepare_tp_inputs
    。NVTX到源文件的映射可参考references/trtllm-nvtx-ranges.md

Profiling Setup

剖析设置

line_profiler (Primary Method)

line_profiler(主要方法)

Environment Variables:
  • TLLM_LINE_PROFILER_ENABLED=True
    — Enable the profiler
  • TLLM_LINE_PROFILER_PATH
    — Output file path
  • TLLM_LINE_PROFILER_FUNCTIONS
    — Additional functions to profile (comma-separated)
Function specification format:
bash
undefined
环境变量:
  • TLLM_LINE_PROFILER_ENABLED=True
    — 启用剖析器
  • TLLM_LINE_PROFILER_PATH
    — 输出文件路径
  • TLLM_LINE_PROFILER_FUNCTIONS
    — 额外需要剖析的函数(逗号分隔)
函数指定格式:
bash
undefined

Class methods: module.path.ClassName.method_name

类方法:module.path.ClassName.method_name

TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs"
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs"

Standalone functions: module.path::function_name

独立函数:module.path::function_name

TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.sampler::_group_requests_by_strategy_key"
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.sampler::_group_requests_by_strategy_key"

Multiple functions (comma-separated)

多个函数(逗号分隔)

TLLM_LINE_PROFILER_FUNCTIONS="module.Class.method1,module.Class.method2"
undefined
TLLM_LINE_PROFILER_FUNCTIONS="module.Class.method1,module.Class.method2"
undefined

CPU Affinity (Environment Factor)

CPU亲和性(环境因素)

CPU core affinity can significantly affect host overhead measurements, especially on multi-socket systems (e.g., B300). Pinning processes to cores near the GPU's NUMA node reduces cross-socket memory access latency.
  • Check current affinity:
    taskset -p <pid>
    or
    numactl --show
  • Pin to local NUMA node:
    numactl --cpunodebind=<node> --membind=<node>
  • Impact: Up to 2x difference in host overhead on B300 systems
When comparing profiling results across runs, ensure CPU affinity is consistent. Do not externally modify the affinity, unless user requires to do this to examine the affects of this part. Document the affinity setting in each round's report if it varies.
CPU核心亲和性会显著影响主机开销的测量结果,尤其是在多插槽系统(如B300)中。将进程固定在靠近GPU NUMA节点的核心上,可减少跨插槽内存访问延迟。
  • 检查当前亲和性:
    taskset -p <pid>
    numactl --show
  • 固定到本地NUMA节点:
    numactl --cpunodebind=<node> --membind=<node>
  • 影响:在B300系统中,主机开销最多可相差2倍
在对比不同运行的剖析结果时,确保CPU亲和性一致。除非用户要求检查该部分的影响,否则不要外部修改亲和性。如果亲和性设置有变化,请在每一轮的报告中记录。

Workspace & Suffix Management

工作区与后缀管理

Each profiling run should have a unique suffix to track progress across rounds:
bash
EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.sh

每次剖析运行应使用唯一的后缀,以跟踪多轮优化的进度:
bash
EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.sh

Autonomous Optimization Loop

自动优化循环

Before starting the loop, review references/optimization-strategy.md for strategic guidance on ordering (zero-risk-first), measurement traps, and overhead scoping.
Key insight: Optimizations are NOT independent. Fixing a 50ms bottleneck may reveal a 30ms bottleneck that was previously masked (hidden behind the larger one). Always re-profile after each significant change — the bottleneck landscape shifts.
Ordering principle: Within each round, prefer zero-risk optimizations (caching, pre-allocation, hoisting invariants) over medium/high-risk ones (graph partition changes, algorithm fusion). Zero-risk changes provide free gains and make subsequent profiling cleaner.
Run N rounds (default 3) of the following cycle:
FOR round = 1 to MAX_ROUNDS:

  1. PROFILE (with Drill-Down)
  2. ANALYZE (Multi-Option)
  3. OPTIMIZE (Apply Change — prefer zero-risk first)
  4. TEST (Unit Test Validation)
  5. VALIDATE (Re-Profile — expect bottleneck landscape to shift)
  6. REPORT

END FOR → FINAL SUMMARY
开始循环前,请查看references/optimization-strategy.md,获取关于优化顺序(优先零风险优化)、测量陷阱以及开销范围的策略指导。
核心要点:优化并非独立的。修复一个50ms的瓶颈可能会暴露之前被掩盖的30ms瓶颈。每次重大修改后务必重新剖析——瓶颈分布会发生变化。
顺序原则:在每一轮中,优先选择零风险优化(缓存、预分配、提升不变量),而非中/高风险优化(图分区更改、算法融合)。零风险优化可带来无成本的性能提升,并使后续剖析结果更清晰。
运行N轮(默认3轮)以下循环:
FOR round = 1 to MAX_ROUNDS:

  1. 剖析(含深度挖掘)
  2. 分析(多选项)
  3. 优化(应用更改——优先零风险选项)
  4. 测试(单元测试验证)
  5. 验证(重新剖析——预期瓶颈分布会变化)
  6. 报告

END FOR → 最终总结

Phase 1: PROFILE (with Drill-Down)

阶段1:剖析(含深度挖掘)

  • Run workload with profiler enabled
  • Parse output: identify functions with highest Total time and lines with highest % Time
  • CRITICAL: Drill down into sub-functions that are not yet profiled (see below)
  • 启用剖析器运行工作负载
  • 解析输出:识别总耗时最高的函数以及占比最高的代码行
  • 关键操作:深度挖掘尚未被剖析的子函数(见下文)

Drill-Down Profiling

深度挖掘剖析

The default profiler covers top-level executor functions but not all sub-functions. When a profiled function shows most time in a single sub-call, you must drill down.
When: A single line consumes >80% of a function's time calling an unprofiled sub-function:
Line #      Hits         Time    Per Hit   % Time  Line Contents
==============================================================
  2848      4100  59200000000.0  14439024.4   98.7      output = self.model_engine.forward(...)
How:
  1. Identify the sub-function's full qualified path (e.g.,
    tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs
    )
  2. Add it to
    TLLM_LINE_PROFILER_FUNCTIONS
  3. Re-profile to get line-level data inside it
  4. Now analyze the inner hotspots
For common drill-down targets, see references/hot-path-files.md.
默认剖析器涵盖顶层执行器函数,但不包含所有子函数。当某个已剖析函数的大部分时间消耗在单个未剖析的子调用上时,必须进行深度挖掘。
触发条件:某一行调用未剖析子函数的时间占函数总时间的80%以上:
Line #      Hits         Time    Per Hit   % Time  Line Contents
==============================================================
  2848      4100  59200000000.0  14439024.4   98.7      output = self.model_engine.forward(...)
操作步骤
  1. 确定子函数的完整路径(例如
    tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs
  2. 将其添加到
    TLLM_LINE_PROFILER_FUNCTIONS
  3. 重新剖析以获取该函数内部的逐行数据
  4. 现在分析内部热点
常见的深度挖掘目标可参考references/hot-path-files.md

Phase 2: ANALYZE (Multi-Option)

阶段2:分析(多选项)

For the chosen hotspot:
  1. Identify the top hotspots by absolute time (not just %) within the target function
  2. Classify each hotspot by type. Summary table:
TypeIndicatorsSeverity
HOST_SYNC
.item()
,
.cpu()
in per-layer forward path
Critical
SYNC
.item()
,
.cpu()
,
synchronize()
in step-level code
Critical
CUSTOM_OPChain of Python tensor ops (view/slice/cast) before kernel launchCritical
GRAPH_BREAKOp that prevents CUDA graph capture of surrounding code (fix via GRAPH_EXPAND / GRAPH_SPLIT)High
ALLOC
torch.zeros/empty/tensor()
in loops,
.clone()
High
HOISTPer-layer recomputation of step-invariant valuesHigh
PYLOOP
for x in collection:
with many iterations
High
REDUNDANT_ITERMultiple passes over the same collectionHigh
DEAD_WORKObject construction whose results are always discardedHigh
CONTAINERDict/set lookups in hot loopsMedium
FUNCALLRepeated method/property callsMedium
COMM
dist.all_reduce
,
dist.barrier
, NCCL overhead in TP/PP paths
Medium
GILLock/queue contentionMedium
SERIALIZE
pickle.dumps/loads
,
json.dumps/loads
in request processing
Medium
GCPeriodic latency spikes, non-deterministic pauses (tail latency)Low
COMPUTEActual computation (may not be optimizable)Low
For detailed classification with code examples, see references/hotspot-classification.md.
  1. Propose 2-4 optimization options in a table:
OptionDescriptionEstimated SavingsRiskComplexity
A......Low/Med/High...
B............
  1. Select the best option and explain reasoning (prefer high-savings + low-risk; follow zero-risk-first ordering from references/optimization-strategy.md)
For optimization patterns by type, see references/optimization-patterns.md (index) — it links to the relevant sub-file for each hotspot type. For GPU-specific patterns (CUSTOM_OP, GRAPH_SPLIT, GRAPH_EXPAND), see references/patterns/gpu-graph.md.
针对选定的热点:
  1. 识别目标函数内按绝对时间(而非仅占比)排名靠前的热点
  2. 分类每个热点类型。汇总表:
类型特征严重程度
HOST_SYNC每层前向路径中的
.item()
.cpu()
操作
严重
SYNC步骤级代码中的
.item()
.cpu()
synchronize()
操作
严重
CUSTOM_OP内核启动前的Python张量操作链(view/slice/cast)严重
GRAPH_BREAK阻止CUDA图捕获周围代码的操作(通过GRAPH_EXPAND/GRAPH_SPLIT修复)
ALLOC循环中的
torch.zeros/empty/tensor()
.clone()
操作
HOIST每层重复计算步骤不变量
PYLOOP迭代次数较多的
for x in collection:
循环
REDUNDANT_ITER对同一集合进行多次遍历
DEAD_WORK结果始终被丢弃的对象构造
CONTAINER热点循环中的字典/集合查找
FUNCALL重复的方法/属性调用
COMMTP/PP路径中的
dist.all_reduce
dist.barrier
、NCCL开销
GIL锁/队列竞争
SERIALIZE请求处理中的
pickle.dumps/loads
json.dumps/loads
操作
GC周期性延迟峰值、非确定性暂停(尾部延迟)
COMPUTE实际计算(可能无法优化)
带有代码示例的详细分类可参考references/hotspot-classification.md
  1. 提出2-4个优化选项,以表格形式呈现:
选项描述预估收益风险复杂度
A......低/中/高...
B............
  1. 选择最佳选项并说明理由(优先选择高收益+低风险的选项;遵循references/optimization-strategy.md中的零风险优先顺序)
按类型划分的优化模式可参考references/optimization-patterns.md(索引)——它链接到每种热点类型对应的子文件。针对GPU特定模式(CUSTOM_OP、GRAPH_SPLIT、GRAPH_EXPAND),请参考references/patterns/gpu-graph.md

Phase 3: OPTIMIZE (Apply Change)

阶段3:优化(应用更改)

  • Apply the selected code change with Edit tool
  • One optimization per round — keep changes minimal and targeted
  • Record the exact change (file, line range, before/after) for potential rollback
  • 使用编辑工具应用选定的代码更改
  • 每轮仅进行一项优化——保持更改最小且针对性强
  • 记录确切的更改内容(文件、行范围、更改前后),以便后续可能的回滚

Phase 4: TEST (Unit Test Validation)

阶段4:测试(单元测试验证)

Mandatory after each optimization. Find and run related UTs to verify correctness.
Finding related tests:
bash
undefined
每次优化后必须执行此步骤。查找并运行相关单元测试以验证正确性。
查找相关测试:
bash
undefined

Search by modified file name

按修改的文件名搜索

grep -rl "model_engine|PyTorchModelEngine" tests/unittest/_torch/executor/
grep -rl "model_engine|PyTorchModelEngine" tests/unittest/_torch/executor/

Search by modified function name

按修改的函数名搜索

grep -rl "_prepare_tp_inputs|prepare_inputs" tests/

**Running tests:**
```bash
grep -rl "_prepare_tp_inputs|prepare_inputs" tests/

**运行测试:**
```bash

Run specific test file with stop-on-first-failure

运行特定测试文件,失败即停止

pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py -v -x --timeout=120
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py -v -x --timeout=120

Run specific test method

运行特定测试方法

pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase::test_position_id_preparation -v -x

For the full UT-to-file mapping, see [references/hot-path-files.md](references/hot-path-files.md).

**If tests fail:**
1. Read the failure message
2. Rollback immediately (`git checkout -- <file>`)
3. Analyze why the optimization broke correctness
4. Try the next-best option from Phase 2
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase::test_position_id_preparation -v -x

完整的单元测试与文件映射可参考[references/hot-path-files.md](references/hot-path-files.md)。

**如果测试失败:**
1. 阅读失败信息
2. 立即回滚(`git checkout -- <file>`)
3. 分析优化破坏正确性的原因
4. 尝试阶段2中的次优选项

Phase 5: VALIDATE (Re-Profile)

阶段5:验证(重新剖析)

  • Re-run profiler with identical workload, using suffix
    round<N>_<description>
  • Compare three things:
    1. Did the target hotspot time decrease?
    2. Did the overall function Total time decrease?
    3. Did benchmark metrics (TPOT, throughput) improve?
If regression detected (function time increased or metrics worsened):
  • The "optimization" may have triggered a CPython pitfall — see references/patterns/compound-pitfalls.md (CPython Pitfalls section)
  • Rollback and try the next-best option from Phase 2
  • 使用相同的工作负载重新运行剖析器,后缀格式为
    round<N>_<description>
  • 对比三个方面:
    1. 目标热点的耗时是否减少?
    2. 整体函数的总耗时是否减少?
    3. 基准指标(TPOT、吞吐量)是否提升?
如果检测到性能退化(函数耗时增加或指标恶化):
  • 该“优化”可能触发了CPython陷阱——请参考references/patterns/compound-pitfalls.md(CPython陷阱部分)
  • 回滚并尝试阶段2中的次优选项

Phase 6: REPORT

阶段6:报告

Log for this round:
  • Round number
  • Hotspot location (file:line) and classification
  • Optimization applied (with before/after code summary)
  • Time delta: function Total time before → after
  • Benchmark delta: TPOT, throughput before → after

记录本轮的以下信息:
  • 轮次编号
  • 热点位置(文件:行号)及分类
  • 应用的优化(含更改前后的代码摘要)
  • 时间变化:函数总耗时的前后对比
  • 基准指标变化:TPOT、吞吐量的前后对比

Reading Profile Output

解读剖析输出

Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   100                                           def my_function(self):
   101       500    890000.0   1780.0     72.1       result = tensor.item()
   102       500    234567.0    469.1     19.0       return result
How to read effectively:
  1. Start with Total time for each function — this is the overall budget
  2. Sort lines mentally by absolute Time, not just % Time (3% of a 60s function = 1.8s)
  3. Check Hits count to understand iteration patterns:
    • Hits = 2 × expected count →
      for x in range(1):
      loop overhead (2 hits = enter + exit check)
    • Hits ≫ expected → the line is inside a nested loop
  4. Look for repeated patterns: if 10 lines each take 3% in a loop body, the loop itself costs 30%

Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   100                                           def my_function(self):
   101       500    890000.0   1780.0     72.1       result = tensor.item()
   102       500    234567.0    469.1     19.0       return result
有效解读方法:
  1. 从每个函数的总耗时开始——这是整体的时间预算
  2. 绝对时间而非仅占比对代码行进行排序(60s函数的3% = 1.8s)
  3. 查看Hits计数以理解迭代模式:
    • Hits = 预期计数×2 →
      for x in range(1):
      循环开销(2次命中 = 进入+退出检查)
    • Hits ≫ 预期值 → 该行位于嵌套循环内
  4. 寻找重复模式:如果循环体中有10行各占3%,则循环本身消耗30%

Stopping Criteria

停止条件

Stop the optimization loop when:
  1. Iteration limit reached: Completed N rounds (default 3)
  2. No actionable hotspots: Top hotspots are pure GPU compute (COMPUTE type)
  3. Diminishing returns: < 5% improvement in last 2 rounds
  4. Risk threshold: Further optimizations require architectural changes (e.g., Cython, struct-of-arrays)
  5. Test failures: Cannot find an optimization that passes UTs
Primary success metric: Benchmark throughput (requests/sec or tokens/sec) as measured by the profiling script. line_profiler time reductions are leading indicators, but throughput is the ground truth — a function-level speedup that doesn't improve throughput is not a real win.

满足以下任一条件时,停止优化循环:
  1. 达到迭代上限:完成N轮(默认3轮)
  2. 无可行热点:排名靠前的热点为纯GPU计算(COMPUTE类型)
  3. 收益递减:最近2轮的提升幅度<5%
  4. 风险阈值:进一步优化需要架构变更(如Cython、数组结构体)
  5. 测试失败:无法找到通过单元测试的优化方案
主要成功指标:剖析脚本测量的基准吞吐量(请求/秒或令牌/秒)。line_profiler的耗时减少是领先指标,但吞吐量才是真实结果——函数级的加速如果没有提升吞吐量,则不算真正的成功。

Final Summary Output

最终总结输出

The final report should include:
  • Rounds executed: Number of profile-optimize cycles completed
  • Cumulative improvement: Total host time reduction (percentage and absolute)
  • Benchmark metrics: Before/after comparison table (TPOT, throughput, ITL, E2EL)
  • Optimizations applied: List of changes with file:line locations and classification
  • Failed attempts: Any optimizations tried and reverted (with why)
  • Remaining hotspots: Top bottlenecks that couldn't be optimized (with classification)
  • Recommendations: Suggested follow-up for architectural changes if needed
For a concrete multi-round example, see references/examples.md.

最终报告应包含:
  • 执行轮次:完成的剖析-优化循环次数
  • 累计提升:主机总耗时的减少量(百分比和绝对值)
  • 基准指标:前后对比表(TPOT、吞吐量、ITL、E2EL)
  • 应用的优化:更改列表,包含文件:行号位置和分类
  • 失败尝试:任何尝试后回滚的优化(含原因)
  • 剩余热点:无法优化的顶级瓶颈(含分类)
  • 建议:如需架构变更,提出后续改进建议
具体的多轮示例可参考references/examples.md

Reference Files

参考文件

FileContents
references/optimization-patterns.mdPattern index — links to 6 sub-files: sync-alloc, loop-iteration, python-overhead, gpu-graph, system, compound-pitfalls
references/optimization-strategy.mdZero-risk-first ordering, metric traps, three scopes of host overhead, pattern selection guide
references/hotspot-classification.mdExtended per-type indicators and code examples (including CUSTOM_OP, GRAPH_BREAK, HOST_SYNC)
references/communication-patterns.mdCommunication overhead patterns (NCCL batching, barrier removal, async overlap, reduce_scatter)
references/hot-path-files.mdKey file tables, drill-down targets, UT mapping
references/examples.mdUsage examples and multi-round walkthrough
trtllm-nvtx-ranges.mdTRT-LLM NVTX range reference (from analysis skill) — maps range names to source functions
文件内容
references/optimization-patterns.md模式索引——链接到6个子文件:sync-allocloop-iterationpython-overheadgpu-graphsystemcompound-pitfalls
references/optimization-strategy.md零风险优先顺序、指标陷阱、主机开销的三个范围、模式选择指南
references/hotspot-classification.md扩展的按类型特征说明和代码示例(包括CUSTOM_OP、GRAPH_BREAK、HOST_SYNC)
references/communication-patterns.md通信开销模式(NCCL批处理、移除屏障、异步重叠、reduce_scatter)
references/hot-path-files.md关键文件表、深度挖掘目标、单元测试映射
references/examples.md使用示例和多轮优化演练
trtllm-nvtx-ranges.mdTRT-LLM NVTX范围参考(来自分析技能)——将范围名称映射到源函数