trtllm-code-contribution

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TensorRT-LLM Code Contribution Best Practices

TensorRT-LLM代码贡献最佳实践

Contribution Process

贡献流程

1. Developer Workflow

1. 开发者工作流

Commit the changes. Never commit using NVIDIA internal email (
```
<user>@nvidia.com
```
)!

Push changes to a branch on the personal fork:

bash

git push -u <user> <local-branch>:<remote-branch>

Create a PR from the fork branch into upstream (typically targeting
```
main
```
).

提交变更。切勿使用NVIDIA内部邮箱（
```
<user>@nvidia.com
```
）提交代码！

推送变更到个人Fork仓库的分支：

bash

git push -u <user> <local-branch>:<remote-branch>

创建PR，将Fork仓库的分支合并到上游仓库（通常目标分支为
```
main
```
）。

2. Coding Guidelines

2. 编码规范

TRT-LLM coding style is defined in

CODING_GUIDELINES.md

. Key highlights:

C++: Allman brace style, 4-space indent, 120 char line limit, camelCase for variables/methods, PascalCase for types,

prefix for member variables,

prefix for constants, Doxygen for API docs, smart pointers over raw,

static_cast

over

reinterpret_cast

, no C-style casts.

Python: snake_case for files/functions/variables, PascalCase for classes, UPPER_SNAKE_CASE for constants, 4-space indent, Google-style docstrings, narrow

except

clauses, Pydantic

StrictBaseModel

for user-facing config classes (no custom

__init__

TRT-LLM的编码风格定义在

CODING_GUIDELINES.md

中。核心要点：

C++： 使用Allman大括号风格、4空格缩进、行宽限制120字符；变量/方法采用驼峰式（camelCase）命名，类型采用帕斯卡式（PascalCase）命名；成员变量前缀为

，常量前缀为

；API文档使用Doxygen；优先使用智能指针而非原始指针；使用

static_cast

而非

reinterpret_cast

，禁止使用C风格强制转换。

Python： 文件/函数/变量采用蛇形命名法（snake_case），类采用帕斯卡式命名；常量采用大写蛇形命名法（UPPER_SNAKE_CASE）；4空格缩进；使用Google风格文档字符串；

except

子句范围尽可能窄；面向用户的配置类使用Pydantic

StrictBaseModel

（禁止自定义

__init__

）。

3. Pre-commit Setup

3. 预提交钩子设置

bash

pip install pre-commit
pre-commit install

Pre-commit runs automatically on every

git commit

. Hooks include: isort, yapf, autoflake, clang-format, cmake-format, codespell, ruff, ruff-format, mdformat, and others. If hooks modify files, stage and commit them again.

bash

pip install pre-commit
pre-commit install

预提交钩子会在每次

git commit

时自动运行，包含isort、yapf、autoflake、clang-format、cmake-format、codespell、ruff、ruff-format、mdformat等工具。如果钩子修改了文件，需重新暂存并提交。

4. DCO Sign-off (Required)

4. DCO签署（必填）

All commits must be signed off to certify the contribution under the Developer Certificate of Origin:

bash

git commit -s -m "Add cool feature."

This appends

Signed-off-by: Your Name <your@email.com>

to the commit message. PRs containing unsigned commits will not be accepted.

IMPORTANT: Never sign off commits using NVIDIA internal email (

<user>@nvidia.com

所有提交必须签署，以证明贡献符合开发者原创证书（Developer Certificate of Origin）：

bash

git commit -s -m "Add cool feature."

这会在提交信息末尾添加

Signed-off-by: Your Name <your@email.com>

。包含未签署提交的PR将不会被接受。

重要提示： 切勿使用NVIDIA内部邮箱（

<user>@nvidia.com

）签署提交！

Pre-Implementation Checklist

实现前检查清单

Before writing any code, complete these steps:

在编写任何代码之前，完成以下步骤：

1. Survey Existing Infrastructure

1. 调研现有基础设施

Search before building. TRT-LLM is a large codebase with many reusable components. Before implementing something from scratch, search for existing utilities:

undefined

先搜索再开发。TRT-LLM是一个大型代码库，包含许多可复用组件。在从零开始实现功能前，先搜索现有工具：

undefined

Before writing a new attention computation

在编写新的注意力计算逻辑前

grep -r "TrtllmAttention|create_attention|scaled_dot_product" tensorrt_llm/_torch/

Before writing a new compiled helper

在编写新的编译辅助工具前

grep -r "maybe_compile|maybe_compiled_" tensorrt_llm/_torch/utils.py

Before writing a custom RoPE

在编写自定义RoPE前

grep -r "RotaryEmbedding|rotary_emb|rope" tensorrt_llm/_torch/modules/

Before writing a new cache management pattern

在编写新的缓存管理逻辑前

grep -r "mla_rope_append_paged_kv|append_paged_kv" tensorrt_llm/_torch/


**Trace existing forward methods.** Before writing a new `forward_*` method, read all existing forward methods in the class and understand what each one does. Often an existing method already implements the computation you need, and you just need to set up the right state (e.g., create an attribute, adjust a guard) to dispatch to it.

grep -r "mla_rope_append_paged_kv|append_paged_kv" tensorrt_llm/_torch/


**跟踪现有forward方法**。在编写新的`forward_*`方法前，阅读类中所有已存在的forward方法，理解每个方法的作用。通常已有方法已经实现了你需要的计算逻辑，只需设置正确的状态（例如创建属性、调整检查条件）即可分发到该方法。

Find all forward methods in a class

查找类中所有forward方法

grep -n "def forward" tensorrt_llm/_torch/modules/attention.py

Then READ each one to understand what it does

然后逐个阅读，理解其功能


**Lesson learned:** On the short-seq MHA branch (30 commits, ~250 lines written then deleted), the attention computation went through **4 rewrites**: per-sequence SDPA loop → batched SDPA with pad_sequence → custom TrtllmAttention backend → deletion in favor of the *already-existing* `forward_context_default()`. The final approach was +10 lines: a guard check + dispatch to an existing method. Similarly, `maybe_compiled_cat` was discovered only after a standalone `@maybe_compile` wrapper was written and then removed.

**Anti-pattern: Parallel reimplementation.** Before writing a new `forward_*` method, trace what existing forward methods do. The new method may already be implemented. In the MLA case, `forward_context_short_mha` reimplemented `forward_context_default` nearly line-for-line before being deleted.


**经验总结：** 在短序列MHA分支（30次提交，约250行代码编写后删除）中，注意力计算经历了**4次重写**：逐序列SDPA循环 → 带pad_sequence的批量SDPA → 自定义TrtllmAttention后端 → 最终删除，转而使用**已存在的**`forward_context_default()`。最终方案仅新增10行代码：一个检查条件 + 分发到已有方法。类似地，`maybe_compiled_cat`是在编写了独立的`@maybe_compile`包装器并删除后才被发现的。

**反模式：并行重实现**。在编写新的`forward_*`方法前，跟踪现有forward方法的功能。新方法可能已经被实现。在MLA案例中，`forward_context_short_mha`几乎逐行重实现了`forward_context_default`，最终被删除。

2. Check Parallelism Dimensions

2. 检查并行维度

When adding a new code path, verify correctness under ALL parallelism modes:

Dimension	Guard	Why
Tensor Parallelism (TP)	`mapping.tp_size`	Head counts are sharded
Pipeline Parallelism (PP)	`mapping.pp_size`	Layers may be on different ranks
Context Parallelism (CP)	`mapping.cp_size`	Sequence is split across ranks — tokens are not all local
Expert Parallelism (EP)	`mapping.ep_size`	MoE experts distributed

Lesson learned: The short-seq MHA path assumed all tokens were local, which breaks under Context Parallelism. The

cp_size == 1

guard was added as a fix in a later commit instead of being part of the initial design.

添加新代码路径时，需验证在所有并行模式下的正确性：

并行维度	检查条件	原因
Tensor Parallelism (TP)	`mapping.tp_size`	注意力头数量被分片
Pipeline Parallelism (PP)	`mapping.pp_size`	层可能分布在不同rank上
Context Parallelism (CP)	`mapping.cp_size`	序列被拆分到多个rank — 并非所有token都在本地
Expert Parallelism (EP)	`mapping.ep_size`	MoE专家被分布式部署

经验总结： 短序列MHA路径假设所有token都在本地，这在Context Parallelism模式下会失效。

cp_size == 1

的检查条件是在后续提交中添加的修复，而非初始设计的一部分。

3. Think About Threshold/Guard Semantics

3. 思考阈值/检查条件的语义

When gating a code path with a threshold:

What does the threshold measure? (per-sequence metric? total batch metric?)
What does the cost of the path scale with? (per-sequence? total tokens? quadratic in something?)
Do these match? If cost scales with total tokens, the threshold should check total tokens, not per-sequence max.

Lesson learned: The initial implementation checked

max_ctx_seq_len

(longest single sequence) against the threshold, but the cost of the short-seq path scales with total packed tokens. A batch of 100 short sequences could incorrectly trigger the path.

使用阈值控制代码路径时：

阈值衡量的是什么？（逐序列指标？总批量指标？）
路径的成本随什么变化？（逐序列？总token数？与某因素成二次方关系？）
两者是否匹配？ 如果成本随总token数变化，阈值应检查总token数，而非单序列最大长度。

经验总结： 初始实现将

max_ctx_seq_len

（单个序列的最大长度）与阈值对比，但短序列路径的成本随总打包token数变化。包含100个短序列的批量可能会错误触发该路径。

4. Check RoPE State

4. 检查RoPE状态

When adding attention code paths:

Is
```
apply_rotary_emb
```
True (caller handles RoPE) or False (rope_fusion, backend handles RoPE)?
Does your path apply RoPE? Will that cause double-application?
Do you need to handle both RoPE states or can you gate to one?

添加注意力代码路径时：

```
apply_rotary_emb
```
是True（调用方处理RoPE）还是False（rope_fusion，后端处理RoPE）？
你的路径是否会应用RoPE？是否会导致重复应用？
是否需要处理两种RoPE状态，还是可以只针对一种设置检查条件？

5. Trace Method Limitations

5. 跟踪方法的局限性

Understand what a method does NOT handle. When reusing an existing method, fully trace the dispatch chain above it. A method may be correct for one scenario but miss edge cases handled by a higher-level dispatcher.

Example:

forward_context_default()

handles fresh prefill with no cached KV tokens. But when there are cached KV tokens (chunked context), it silently ignores them — causing a correctness bug. The fix was to call

forward_context()

instead, which dispatches to:

```
forward_context_with_chunked_prefill
```
(SM100+, chunked context)
```
forward_context_with_cached_kv
```
(SM90 fallback, or cached context)
```
forward_context_default
```
(fresh prefill, no cached tokens)

Checklist for reusing a method:

What does this method handle?
What does it NOT handle? (cached tokens? chunked prefill? specific hardware?)
Is there a higher-level dispatcher that routes to this method for the right cases?
Should I call the dispatcher instead of the method directly?

理解方法不处理的场景。在复用已有方法时，需完整跟踪其上方的分发链。某个方法可能在一种场景下正确，但遗漏了更高分发器处理的边缘情况。

示例：

forward_context_default()

处理无缓存KV token的全新预填充。但当存在缓存KV token（分块上下文）时，它会静默忽略这些token — 导致正确性Bug。修复方案是调用

forward_context()

，它会分发到：

```
forward_context_with_chunked_prefill
```
（SM100+，分块上下文）
```
forward_context_with_cached_kv
```
（SM90降级方案，或缓存上下文）
```
forward_context_default
```
（全新预填充，无缓存token）

复用方法检查清单：

该方法处理哪些场景？
它不处理哪些场景？（缓存token？分块预填充？特定硬件？）
是否存在更高层级的分发器，会在正确场景下路由到该方法？
我应该调用分发器还是直接调用该方法？

6. Check Hardware-Specific Behavior

6. 检查硬件特定行为

The same algorithm can have different numerical properties across SM versions. FMHA kernels may use different internal implementations (e.g., online softmax merge on SM90 vs single-pass on SM100+) that produce different accuracy characteristics.

Lesson learned: The SM90 (Hopper) FMHA kernel's online softmax merge for chunked prefill diverged from the single-pass reference by ~0.45 max diff — unacceptable for a correctness-critical path. The fix was to gate chunked prefill behind

get_sm_version() >= 100

(Blackwell+) and fall back to

forward_context_with_cached_kv

on SM90.

When to check:

Any new attention code path that uses fused kernels
Any path that changes how attention is split/chunked (chunked prefill, context parallelism)
When accuracy tolerances are tight and the path crosses hardware generations

同一算法在不同SM版本上可能有不同的数值特性。FMHA内核可能使用不同的内部实现（例如SM90上的在线softmax合并 vs SM100+上的单遍实现），会产生不同的精度特性。

经验总结： SM90（Hopper）FMHA内核在分块预填充时的在线softmax合并与单遍参考实现的最大差异约为0.45 — 这在正确性关键路径上是不可接受的。修复方案是将分块预填充限制在

get_sm_version() >= 100

（Blackwell+），并在SM90上降级使用

forward_context_with_cached_kv

。

检查时机：

任何使用融合内核的新注意力代码路径
任何改变注意力拆分/分块方式的路径（分块预填充、上下文并行）
当精度容差严格且路径跨硬件代际时

Implementation Workflow

实现工作流

Use the Right Abstraction Level

使用正确的抽象层级

Choose backends from this priority list:

Existing forward method (e.g.,
```
forward_context_default
```
) — may already implement what you need; just set up state and dispatch
Existing fused backend (e.g.,
```
TrtllmAttention
```
,
```
FlashInferAttention
```
) — handles packed sequences, variable lengths, KV cache natively
PyTorch fused ops (e.g.,
```
F.scaled_dot_product_attention
```
) — good for prototyping but requires manual batching/padding
Manual implementation — last resort, only when no existing backend fits

按以下优先级选择后端：

已有forward方法（例如
```
forward_context_default
```
）— 可能已经实现了你需要的功能；只需设置状态并分发
已有融合后端（例如
```
TrtllmAttention
```
、
```
FlashInferAttention
```
）— 原生支持打包序列、可变长度、KV缓存
PyTorch融合操作（例如
```
F.scaled_dot_product_attention
```
）— 适合原型开发，但需要手动批量处理/填充
手动实现 — 最后选择，仅当没有合适的现有后端时使用

Use the Right Dispatch Abstraction Level

使用正确的分发抽象层级

When dispatching to an existing method, use the highest-level dispatch point that provides the right abstraction. Don't bypass dispatch layers — you'll miss edge cases.

Abstraction Level	Example	Handles
Top-level dispatcher	`forward_context()`	Chunked prefill, cached KV, fresh prefill, SM-version gating
Specific handler	`forward_context_default()`	Fresh prefill only
Backend directly	`self.mha.forward(...)`	Nothing beyond raw attention

Lesson learned: The initial short-seq MHA implementation called

forward_context_default()

directly. This worked for fresh prefill but silently dropped cached KV tokens during chunked context. Switching to

forward_context()

(which dispatches to

forward_context_with_cached_kv

forward_context_with_chunked_prefill

as appropriate) fixed the bug with a 1-line change.

分发到已有方法时，使用最高层级的分发点，以获得正确的抽象。不要绕过分发层 — 否则会遗漏边缘情况。

抽象层级	示例	处理场景
顶层分发器	`forward_context()`	分块预填充、缓存KV、全新预填充、SM版本适配
特定处理器	`forward_context_default()`	仅全新预填充
直接调用后端	`self.mha.forward(...)`	仅原始注意力计算，无其他处理

经验总结： 短序列MHA的初始实现直接调用

forward_context_default()

。这在全新预填充场景下有效，但在分块上下文时会静默丢弃缓存KV token。切换到

forward_context()

（会根据情况分发到

forward_context_with_cached_kv

或

forward_context_with_chunked_prefill

）只需1行代码就修复了Bug。

Prefer Reusing Existing Attributes Over Creating New Ones

优先复用已有属性而非创建新属性

When adding a new code path, check if an existing attribute can serve double duty:

python

undefined

添加新代码路径时，检查是否可以复用已有属性：

python

undefined

BAD: parallel attribute alongside existing one

不良实践：新增并行属性与已有属性并存

self._short_seq_mha = create_attention(...) # separate from self.mha

self._short_seq_mha = create_attention(...) # 独立于self.mha

Then need special handling everywhere self.mha is referenced

之后在所有引用self.mha的地方都需要特殊处理

GOOD: reuse existing attribute with conditional initialization

良好实践：复用已有属性，条件初始化

if should_use_dense_mha: self.mha = create_attention(...) # replaces None for DSA models

if should_use_dense_mha: self.mha = create_attention(...) # 替换DSA模型的None值

Existing code paths that check self.mha just work

所有检查self.mha的已有代码路径都能正常工作


**Lesson learned:** The short-seq MHA initially used `self._short_seq_mha` as a separate attribute to "preserve the assertion that `self.mha is None`". Later, it was realized the assertion itself should change (`self.mqa is not None`) and `self.mha` could be reused.


**经验总结：** 短序列MHA最初使用`self._short_seq_mha`作为独立属性，以"保持`self.mha is None`的断言"。后来发现断言本身应该修改为`self.mqa is not None`，并且`self.mha`可以被复用。

Run Pre-Commit Before Every Commit

每次提交前运行预提交钩子

Always run
pre-commit run --all-files
before committing. The short-seq MHA branch had a 377-line formatting-only commit (commit 15/19) that existed solely because pre-commit wasn't run on earlier commits. This is wasted reviewer attention and pollutes

git blame

bash

undefined

提交前务必运行
pre-commit run --all-files
。短序列MHA分支有一个377行的仅格式化提交（第15/19次提交），完全是因为之前的提交没有运行预提交钩子。这会浪费评审者的注意力，并污染

git blame

记录。

bash

undefined

Before every commit:

每次提交前：

pre-commit run --all-files git add -u # stage any auto-formatted files git commit -s -m "..."

undefined

pre-commit run --all-files git add -u # 暂存所有自动格式化的文件 git commit -s -m "..."

undefined

Apply torch.compile Judiciously

谨慎使用torch.compile

Pattern	Use `@maybe_compile` ?	Why
Fused math (RoPE rotation, GELU)	Yes	Fuses multiple element-wise ops into one kernel
`torch.cat` of computed tensors	Use `maybe_compiled_cat`	Already exists as a utility
Pure metadata ops (split, view, expand, reshape)	No	These are zero-cost; compile adds overhead
Mixed metadata + compute	Extract the compute part	Compile only what benefits from fusion

模式	是否使用 `@maybe_compile` ？	原因
融合数学操作（RoPE旋转、GELU）	是	将多个逐元素操作融合为一个内核
计算张量的 `torch.cat`	使用 `maybe_compiled_cat`	已有现成工具可用
纯元数据操作（split、view、expand、reshape）	否	这些操作零成本；编译会增加开销
混合元数据 + 计算	提取计算部分	仅编译能从融合中获益的部分

Extract Shared Logic Immediately

立即提取共享逻辑

When a condition appears in more than one place, extract it into a helper method in the same commit. Don't wait for a later refactoring commit.

python

undefined

当同一个条件出现在多个地方时，立即将其提取为辅助方法 — 在同一次提交中完成。不要等到后续重构提交。

python

undefined

BAD: same 5-condition check in two places

不良实践：两个地方出现相同的5条件检查

if (threshold > 0 and not apply_rotary and cp_size == 1 and ...): # site 1 ... if (threshold > 0 and not apply_rotary and cp_size == 1 and ...): # site 2 ...

if (threshold > 0 and not apply_rotary and cp_size == 1 and ...): # 位置1 ... if (threshold > 0 and not apply_rotary and cp_size == 1 and ...): # 位置2 ...

GOOD: extract immediately

良好实践：立即提取

def _should_use_short_mha(self, ...): return (threshold > 0 and not apply_rotary and cp_size == 1 and ...)

undefined

def _should_use_short_mha(self, ...): return (threshold > 0 and not apply_rotary and cp_size == 1 and ...)

undefined

Feature Flags for Complex Optimizations

复杂优化使用功能开关

Complex optimizations with multiple guards, edge cases, and hardware-specific behavior should ship disabled by default. Let users opt-in via environment variable after testing.

python

undefined

包含多个检查条件、边缘情况和硬件特定行为的复杂优化，应默认禁用。测试完成后，让用户通过环境变量选择启用。

python

undefined

Pattern: disabled by default (threshold=0), opt-in via env var

模式：默认禁用（threshold=0），通过环境变量启用

_threshold_str = os.environ.get('TRTLLM_MLA_SHORT_SEQ_MHA_THRESHOLD', '0') self.short_seq_mha_threshold = int(_threshold_str)


**Lesson learned:** The short-seq MHA optimization was initially enabled by default (threshold=10240) at commit 8 but had 18 more correctness fixes over the next 22 commits before being disabled by default at commit 26. Complex optimizations accumulate edge cases (chunked context, SM90 accuracy, threshold semantics) that may not be discovered until broad testing.

**When to disable by default:**
- The optimization has 3+ guard conditions
- It touches attention/correctness-critical paths
- It has hardware-specific behavior (different SM versions)
- It hasn't been tested in full CI across all configurations

_threshold_str = os.environ.get('TRTLLM_MLA_SHORT_SEQ_MHA_THRESHOLD', '0') self.short_seq_mha_threshold = int(_threshold_str)


**经验总结：** 短序列MHA优化最初在第8次提交中默认启用（threshold=10240），但在接下来的22次提交中又进行了18次正确性修复，最终在第26次提交中改为默认禁用。复杂优化会积累边缘情况（分块上下文、SM90精度、阈值语义），这些可能需要广泛测试才能发现。

**默认禁用的场景：**
- 优化包含3个以上检查条件
- 涉及注意力/正确性关键路径
- 存在硬件特定行为（不同SM版本）
- 未在全CI环境的所有配置下测试过

Update All References When Changing Semantics

修改语义时更新所有引用

When changing what a variable/threshold means, grep for ALL references:

bash

undefined

当变量/阈值的含义改变时，搜索所有引用：

bash

undefined

After changing threshold from max_seq_len to total_packed_tokens:

将阈值从max_seq_len改为total_packed_tokens后：

grep -rn "max_seq_len|max_ctx_seq_len|short.*seq.*threshold" tests/ tensorrt_llm/

Update comments, docstrings, test descriptions, and variable names in the **same commit**.

grep -rn "max_seq_len|max_ctx_seq_len|short.*seq.*threshold" tests/ tensorrt_llm/

在**同一次提交**中更新注释、文档字符串、测试描述和变量名。

Testing Strategy

测试策略

When to Write Tests

编写测试的时机

Phase	What to test	Why
After implementation stabilizes	Full correctness suite	Avoid rewriting tests with each iteration
During prototyping	Minimal smoke test only	Validates basic plumbing without coupling to implementation details
After optimization changes	Add regression tests for the specific optimization	Catches if the optimization breaks something

Lesson learned: Tests were written before the attention backend was settled, then required 5 separate fix/update commits as the implementation evolved through 4 rewrites. The 770-line test file needed immediate fixing (device placement, weight layout bugs) because it was never run before committing.

阶段	测试内容	原因
实现稳定后	完整正确性套件	避免每次迭代都重写测试
原型开发期间	仅最小化冒烟测试	验证基础逻辑，不耦合实现细节
优化变更后	添加针对特定优化的回归测试	捕获优化是否破坏现有功能

经验总结： 在注意力后端稳定前就编写了测试，随着实现经历4次重写，测试需要5次单独的修复/更新提交。770行的测试文件需要立即修复（设备放置、权重布局Bug），因为提交前从未运行过。

Common Test Gotchas in TRT-LLM

TRT-LLM中常见的测试陷阱

Non-Module children aren't moved by
.to(device)
: If a module has attributes that aren't
```
nn.Module
```
subclasses (e.g.,
```
DSATrtllmAttention.indexer
```
),
```
model.to(device)
```
won't move their parameters. Move them explicitly.
Weight layout differs from HuggingFace: Model loading transforms weights. Initialize test weights in the loaded layout (check
```
modeling_*.py
```
for load functions), not the HuggingFace checkpoint layout.
Background threads from cache managers:
```
DSACacheManager
```
and similar create
```
ThreadPoolExecutor
```
threads that outlive tests. Add
```
pytestmark = pytest.mark.threadleak(enabled=False)
```
at the module level.
named_parameters()
misses non-Module attributes: When copying weights for A/B comparison tests, explicitly copy parameters from non-Module children (like indexer weights).
Attention metadata construction: Use the test fixtures/helpers already in the codebase (check
```
tests/unittest/_torch/attention/
```
for patterns) rather than building
```
AttentionMetadata
```
from scratch.

非Module子类不会被
.to(device)
移动：如果模块包含非
```
nn.Module
```
子类的属性（例如
```
DSATrtllmAttention.indexer
```
），
```
model.to(device)
```
不会移动它们的参数。需显式移动。
权重布局与HuggingFace不同：模型加载会转换权重。在加载后的布局中初始化测试权重（查看
```
modeling_*.py
```
中的加载函数），而非HuggingFace checkpoint布局。
缓存管理器的后台线程：
```
DSACacheManager
```
等会创建
```
ThreadPoolExecutor
```
线程，线程生命周期超过测试。在模块级别添加
```
pytestmark = pytest.mark.threadleak(enabled=False)
```
。
named_parameters()
遗漏非Module属性：在A/B对比测试中复制权重时，需显式复制非Module子类的参数（如indexer权重）。
注意力元数据构造：使用代码库中已有的测试夹具/辅助工具（查看
```
tests/unittest/_torch/attention/
```
中的模式），而非从头构建
```
AttentionMetadata
```
。

Test Consolidation

测试合并

After implementation stabilizes, aggressively prune tests to a minimal set where each parametrized case exercises a distinct code path.

Pattern:

During development, write comprehensive tests (many parametrized cases covering all combinations)
After implementation stabilizes, identify which code paths each test case exercises
Merge cases that exercise the same code path; remove redundant cases
Extract shared test helpers (
```
_make_inputs
```
,
```
_make_metadata
```
,
```
_run_forward
```
) to reduce duplication

Lesson learned: The short-seq MHA test file peaked at 1394 lines with 21 parametrized cases, then was consolidated to 665 lines with 10 cases covering the same 6 code paths. Three separate cleanup commits were needed because consolidation wasn't done in one pass. Do consolidation as a single deliberate pass.

实现稳定后，主动精简测试，保留最小集合，每个参数化案例覆盖不同的代码路径。

模式：

开发期间，编写全面测试（多个参数化案例覆盖所有组合）
实现稳定后，确定每个测试案例覆盖的代码路径
合并覆盖相同代码路径的案例；删除冗余案例
提取共享测试辅助工具（
```
_make_inputs
```
、
```
_make_metadata
```
、
```
_run_forward
```
）以减少重复

经验总结： 短序列MHA测试文件最多达到1394行，包含21个参数化案例，之后合并为665行，10个案例覆盖相同的6条代码路径。由于没有一次性完成合并，需要3次单独的清理提交。应一次性完成合并工作。

Test on Multiple Hardware Targets

在多硬件目标上测试

When testing attention kernels or fused operations, verify on multiple SM versions. The same kernel can have different numerical properties across hardware generations.

SM90 (Hopper): Online softmax merge in FMHA — can diverge from reference
SM100+ (Blackwell): Single-pass FMHA — tighter numerical accuracy
Use
```
get_sm_version()
```
guards to skip or adjust tests per hardware

测试注意力内核或融合操作时，需在多个SM版本上验证。同一内核在不同硬件代际上可能有不同的数值特性。

SM90 (Hopper)：FMHA中的在线softmax合并 — 可能与参考实现存在差异
SM100+ (Blackwell)：单遍FMHA — 数值精度更高
使用
```
get_sm_version()
```
检查条件，根据硬件跳过或调整测试

Commit Hygiene

提交规范

During Development

开发期间

Commit freely — small, frequent commits help track progress and enable bisection.

自由提交 — 小而频繁的提交有助于跟踪进度，并支持二分查找定位问题。

Before PR Submission

PR提交前

Squash fix-on-fix chains using interactive rebase:

bash

undefined

使用交互式变基合并修复链：

bash

undefined

Fold fix commits into the commits they fix

将修复提交合并到对应的原始提交中

git rebase -i $(git merge-base HEAD main)


Target commit structure for a PR:
1. **Core implementation** — the new feature with all guards and edge cases
2. **Additional optimizations** — one commit per distinct optimization
3. **Tests** — comprehensive test suite
4. **Refactoring** (optional) — cleanup that's separate from the feature

git rebase -i $(git merge-base HEAD main)


PR的目标提交结构：
1. **核心实现** — 包含所有检查条件和边缘情况的新功能
2. **额外优化** — 每个独立优化对应一次提交
3. **测试** — 全面的测试套件
4. **重构**（可选）— 与功能无关的清理工作

Anti-patterns to Avoid

需避免的反模式

Anti-pattern	What happens	Prevention
Fix-on-fix chains (A → fix A → fix fix A)	Noisy history, hard to review	Squash before PR
Add-then-revert (add X → revert X)	Wasted reviewer attention	Survey existing utilities first
Modify shared utility then revert (edit rotary_embedding.py → revert)	Pollutes unrelated files	Check if existing code paths handle it
Create compiled helper then inline it (add @maybe_compile → remove)	Churn	Profile first; only compile proven bottlenecks
Semantic change + behavior change in one commit	Hard to bisect regressions	Separate bug fixes from feature changes
Stale comment fix as separate commit	Shows the comment wasn't updated with the code change	Update comments in the same commit as the code

反模式	后果	预防措施
修复链（A → 修复A → 修复修复A）	提交历史杂乱，难以评审	PR前合并提交
添加后回滚（添加X → 回滚X）	浪费评审者注意力	先调研现有工具
修改共享工具后回滚（编辑rotary_embedding.py → 回滚）	污染无关文件	检查现有代码路径是否已处理该场景
创建编译辅助工具后内联（添加@maybe_compile → 删除）	代码频繁变动	先分析性能；仅编译已验证的瓶颈
语义变更 + 行为变更在同一次提交中	难以二分查找回归问题	将Bug修复与功能变更分离
单独提交过时注释修复	表明代码变更时未更新注释	在代码变更的同一次提交中更新注释

PR Title Format (Conventional Commits)

PR标题格式（约定式提交）

PR titles follow Conventional Commits:

type: description

Types:

feat

fix

perf

refactor

test

docs

chore

None

For breaking API changes, use

BREAKING CHANGE:

as the type to alert reviewers.

For NVIDIA developers, prefix with JIRA number or NVBUG ID:

[TRTLLM-5516] perf: description
[nvbug/5334370] fix: description

Examples:

feat: Add support for starcoder-v2 FP8 base + FP16/BF16 LoRA

BREAKING CHANGE: Set default max batch size to 2048

```
chore: Remove version from plugins .so
```

None: Stringized enums for better error msgs

fix https://github.com/NVIDIA/TensorRT-LLM/issues/700: a Memory leak issue in C++ runtime

[TRTLLM-5516] perf: Replicate dummy request for cuda graph padding

PR标题遵循约定式提交（Conventional Commits）：

type: description

类型：

feat

、

fix

、

perf

、

refactor

、

test

、

docs

、

chore

、

None

对于破坏性API变更，使用

BREAKING CHANGE:

作为类型，以提醒评审者。

NVIDIA开发者需前缀JIRA编号或NVBUG ID：

[TRTLLM-5516] perf: description
[nvbug/5334370] fix: description

示例：

feat: Add support for starcoder-v2 FP8 base + FP16/BF16 LoRA

BREAKING CHANGE: Set default max batch size to 2048

```
chore: Remove version from plugins .so
```

None: Stringized enums for better error msgs

fix https://github.com/NVIDIA/TensorRT-LLM/issues/700: a Memory leak issue in C++ runtime

[TRTLLM-5516] perf: Replicate dummy request for cuda graph padding

PR Description

PR描述

Address these points in the PR description:

Background/motivation: Why is the change necessary?
Summary: Summarize the changes in one paragraph.
Size justification: If the PR is large, explain why it cannot be broken into multiple PRs.
Impact assessment: Potential performance or functional impacts. Flag risks for reviewers.
Related PRs: Link to any related PRs.

PR描述需涵盖以下要点：

背景/动机：为什么需要该变更？
摘要：用一段话总结变更内容。
规模说明：如果PR较大，解释为什么无法拆分为多个PR。
影响评估：潜在的性能或功能影响。向评审者标记风险点。
相关PR：链接所有相关PR。

PR Conciseness

PR简洁性

Avoid committing commented-out code.
Each PR should address a single concern. If there are several unrelated fixes, open separate PRs and indicate dependencies in the descriptions.

不要提交注释掉的代码。
每个PR应解决单一问题。如果包含多个无关修复，应打开单独的PR，并在描述中说明依赖关系。

API Stability Tests

API稳定性测试

Some APIs are protected by the API stability testsuite. If your PR breaks a protected API, the stability tests will fail with

API stability validation failed

. In this case, request review from the API code owners.

部分API受API稳定性测试套件保护。如果你的PR破坏了受保护的API，稳定性测试会失败并提示

API stability validation failed

。此时需请求API代码所有者评审。

Quantified Impact of Common Mistakes

常见错误的量化影响

From the short-seq MHA branch (30 commits → net 2 files changed):

Mistake	Commits wasted	Lines written & deleted	Root cause
Reimplementing existing forward method	4 (commits 1,5,6,17)	~150 lines	Didn't read `forward_context_default`
Custom RoPE handling	5 (commits 1,13,16,17,18)	~100 lines	Didn't trace how fused kernel handles RoPE
Tests before stable implementation	5 (commits 3,4,8,11,15)	~200 lines of churn	Tests coupled to implementation details
Compiled helpers created then removed	4 (commits 10,12,13,18)	~60 lines	Premature optimization without profiling
Style-only commit	1 (commit 15)	377 lines reformatted	Pre-commit not run on earlier commits
Stale comment fixes	2 (commits 11,18)	~15 lines	Comments not updated with code changes
Calling method directly instead of dispatcher	3 (commits 21,23,30)	~20 lines	Didn't trace `forward_context()` dispatch chain
Not testing on SM90	1 (commit 30)	~10 lines	Assumed uniform numerical behavior across SM versions
Enabled by default too early	2 (commits 8,26)	~5 lines	Shipped threshold=10240 before edge cases were found
Threshold semantics drift in chunked context	1 (commit 28)	~10 lines	`num_ctx_tokens` doesn't account for cached tokens
Redundant test parametrizations	3 (commits 24,25,27)	~730 lines pruned	Tests written incrementally without path-coverage analysis

Total waste: ~24 of 30 commits were fixes/reverts/cleanups of earlier work on the same branch. The final net change is ~200 lines in attention.py and ~665 lines in tests — achievable in ~4-5 clean commits.

来自短序列MHA分支（30次提交 → 最终仅修改2个文件）：

错误	浪费的提交次数	编写后删除的代码行数	根本原因
重实现已有forward方法	4次（提交1、5、6、17）	~150行	未阅读 `forward_context_default`
自定义RoPE处理	5次（提交1、13、16、17、18）	~100行	未跟踪融合内核如何处理RoPE
实现稳定前编写测试	5次（提交3、4、8、11、15）	~200行代码变动	测试与实现细节耦合
创建编译辅助工具后删除	4次（提交10、12、13、18）	~60行	未分析性能就过早优化
仅格式化提交	1次（提交15）	377行格式化代码	之前的提交未运行预提交钩子
过时注释修复	2次（提交11、18）	~15行	代码变更时未更新注释
直接调用方法而非分发器	3次（提交21、23、30）	~20行	未跟踪 `forward_context()` 的分发链
未在SM90上测试	1次（提交30）	~10行	假设不同SM版本的数值行为一致
过早默认启用	2次（提交8、26）	~5行	在发现边缘情况前设置threshold=10240
分块上下文中阈值语义漂移	1次（提交28）	~10行	`num_ctx_tokens` 未考虑缓存token
冗余测试参数化	3次（提交24、25、27）	~730行代码精简	测试增量编写，未分析路径覆盖

总浪费： 30次提交中约24次是对分支内早期工作的修复/回滚/清理。最终净变更为attention.py中约200行代码，测试中约665行代码 — 原本可以用约4-5次干净的提交完成。

Review Readiness Checklist

评审准备检查清单

Before marking a PR ready for review:

GitHub issue created and approved
All parallelism modes checked (TP, PP, CP, EP)
RoPE state handled correctly (no double-application)
Threshold/guard semantics match the cost model
Existing infrastructure surveyed and used where possible
Shared logic extracted (no duplicated conditions)
Comments/docstrings updated with any semantic changes
Tests pass and cover key scenarios (including API stability tests if applicable)
Commits squashed (no fix-on-fix chains)
Pre-commit hooks pass (
```
pre-commit run --all-files
```
)
DCO sign-off on all commits (
```
git commit -s
```
)
Dispatch calls use the right abstraction level (dispatcher, not specific handler)
Method limitations understood (what the reused method does NOT handle)
Hardware-specific behavior tested (SM90, SM100+) or gated appropriately
Complex optimizations disabled by default with env var opt-in
Test cases exercise distinct code paths (no redundant parametrizations)
PR title follows Conventional Commits format
PR description addresses background, summary, and impact

标记PR为可评审前，完成以下检查：

创建并获得批准的GitHub issue
检查所有并行模式（TP、PP、CP、EP）
正确处理RoPE状态（无重复应用）
阈值/检查条件语义与成本模型匹配
调研并复用现有基础设施
提取共享逻辑（无重复条件）
更新注释/文档字符串以反映语义变更
测试通过并覆盖关键场景（如适用，包含API稳定性测试）
提交已合并（无修复链）
预提交钩子通过（
```
pre-commit run --all-files
```
）
所有提交都有DCO签署（
```
git commit -s
```
）
分发调用使用正确的抽象层级（分发器，而非特定处理器）
理解复用方法的局限性（复用方法不处理的场景）
测试硬件特定行为（SM90、SM100+）或设置了适当的检查条件
复杂优化默认禁用，通过环境变量启用
测试案例覆盖不同代码路径（无冗余参数化）
PR标题符合约定式提交格式
PR描述涵盖背景、摘要和影响