trtllm-moe-develop

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TensorRT-LLM MoE Code Quality

TensorRT-LLM MoE 代码质量

Use this skill to keep MoE changes aligned with the current TensorRT-LLM MoE architecture. Favor module roles, API boundaries, and testability over local style cleanup.

使用本规范确保MoE相关修改与当前TensorRT-LLM MoE架构保持一致。相较于局部风格优化，更应优先保证模块职责、API边界与可测试性。

Required Context

必要前置上下文

Before proposing or editing MoE code, read:

```
CODING_GUIDELINES.md
```

tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md

The target files being changed
The relevant tests under
```
tests/unittest/_torch/modules/moe/
```

Also inspect these files when the area is relevant:

Forward execution/chunking: inspect
```
moe_scheduler.py
```
,
```
configurable_moe.py
```
,
```
interface.py
```
, backend
```
run_moe
```
/
```
quantize_input
```
paths, and communication code.
MegaMoE/fused communication: inspect
```
moe_scheduler.py
```
,
```
mega_moe/
```
,
```
configurable_moe.py
```
,
```
quantization.py
```
, and communication code.

Communication:

tensorrt_llm/_torch/modules/fused_moe/communication/base.py

and

communication_factory.py

Quantization and weights:

tensorrt_llm/_torch/modules/fused_moe/quantization.py

EPLB/load balancing:

interface.py

moe_load_balancer.py

quantization.py

moe_scheduler.py

, current forward-execution/chunking code, and

test_moe_module.py

Test matrix/helpers:
```
tests/unittest/_torch/modules/moe/moe_test_utils.py
```
and
```
quantize_utils.py
```
when adding backend, quantization, skip, or parameter coverage.

For module-specific work, read

references/moe-canonical-code-examples.md

after the guide and load only the relevant section. Each design gate or review should cite at least one concrete code example with file:line evidence.

在提出或编辑MoE代码前，请阅读以下内容：

```
CODING_GUIDELINES.md
```

tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md

待修改的目标文件
```
tests/unittest/_torch/modules/moe/
```
下的相关测试用例

当涉及对应领域时，还需检查以下文件：

前向执行/分块：检查
```
moe_scheduler.py
```
、
```
configurable_moe.py
```
、
```
interface.py
```
、后端
```
run_moe
```
/
```
quantize_input
```
路径，以及通信代码。
MegaMoE/融合通信：检查
```
moe_scheduler.py
```
、
```
mega_moe/
```
、
```
configurable_moe.py
```
、
```
quantization.py
```
，以及通信代码。

通信：

tensorrt_llm/_torch/modules/fused_moe/communication/base.py

和

communication_factory.py

。

量化与权重：

tensorrt_llm/_torch/modules/fused_moe/quantization.py

。

EPLB/负载均衡：

interface.py

、

moe_load_balancer.py

、

quantization.py

、

moe_scheduler.py

、当前前向执行/分块代码，以及

test_moe_module.py

。

测试矩阵/工具类：当新增后端、量化、跳过逻辑或参数覆盖时，检查
```
tests/unittest/_torch/modules/moe/moe_test_utils.py
```
和
```
quantize_utils.py
```
。

针对特定模块的工作，阅读完指南后，请阅读

references/moe-canonical-code-examples.md

并仅加载相关章节。每个设计关卡或审查都应至少引用一个具体的代码示例，并提供文件:行号作为依据。

Working With MOE_DEVELOPER_GUIDE.md

与MOE_DEVELOPER_GUIDE.md协作

Treat

MOE_DEVELOPER_GUIDE.md

as the in-repo source of truth for MoE architecture. Treat this skill as the agent workflow layer that tells Codex how to apply that source of truth while designing, editing, or reviewing code.

Use the guide this way:

Start from the guide sections that match the requested change: Architecture, File Map, Backend Capability Matrix, execution-flow/EPLB constraints, Canonical Examples, and Anti-Patterns.
Use guide content to fill the design gate: owner boundary, main API, reference pattern, and test plan.
Do not duplicate fast-changing matrices or backend support tables in this skill; prefer the guide as the current reference.
If a code change adds a backend, quantization method, communication strategy, fused-communication behavior, EPLB behavior, or test convention, check whether the guide also needs an update.
If guide and code disagree, inspect code and tests, mention the mismatch, and either update the guide as part of the change or report it as follow-up.

Guide-update checklist:

File map changed: update
```
File Map
```
.
Backend or quant support changed: update
```
Backend Capability Matrix
```
.
New backend/communication/forward-execution pattern: update
```
Canonical Examples
```
.
New forbidden pattern or ownership rule: update
```
Anti-Patterns
```
.
Test convention changed: update
```
Tests
```
.

将

MOE_DEVELOPER_GUIDE.md

视为MoE架构在仓库内的权威来源。将本规范视为Agent工作流层，指导Codex在设计、编辑或审查代码时如何应用该权威来源的内容。

按以下方式使用该指南：

从与请求修改匹配的指南章节开始：架构、文件映射、后端能力矩阵、执行流程/EPLB约束、典型示例及反模式。
使用指南内容填充设计关卡：职责边界、主API、参考模式及测试计划。
不要在本规范中复制快速变化的矩阵或后端支持表格；优先以指南作为当前参考。
如果代码变更新增了后端、量化方法、通信策略、融合通信行为、EPLB行为或测试约定，请检查指南是否也需要更新。
如果指南与代码存在不一致，请检查代码和测试，提及该不匹配情况，并要么将指南更新作为变更的一部分，要么将其作为后续任务上报。

指南更新检查清单：

文件映射变更：更新
```
File Map
```
章节。
后端或量化支持变更：更新
```
Backend Capability Matrix
```
章节。
新增后端/通信/前向执行模式：更新
```
Canonical Examples
```
章节。
新增禁用模式或职责规则：更新
```
Anti-Patterns
```
章节。
测试约定变更：更新
```
Tests
```
章节。

Core Principle

核心原则

Preserve these owner boundaries:

```
ConfigurableMoE
```
is the assembler/orchestrator. It wires backend, communication, EPLB, weight lifecycle delegation, and shared wrapper bookkeeping.
Backends declare capabilities, run MoE computation, and own the MoE module's weight lifecycle boundary. They expose and implement
```
create_weights
```
,
```
load_weights
```
,
```
post_load_weights
```
,
```
process_weights_after_loading
```
, and
```
pre_reload_weights
```
as needed, select any
```
FusedMoEMethod
```
, and make those hooks compatible with ConfigurableMoE deferred weight creation and reload flows. Backends may delegate quantization-specific tensor layout, loading, post-load transforms, and scale setup to a quantization method, but backend lifecycle hooks remain the public owner of weight handling. New ConfigurableMoE-compatible backends should expose
```
quantize_input
```
and
```
run_moe
```
, not
```
forward
```
or
```
forward_impl
```
, unless the user explicitly asks for legacy standalone behavior. For an active ConfigurableMoE-compatible backend,
```
run_moe
```
must be a concrete implementation, not an empty stub, and ConfigurableMoE or its scheduler should call
```
backend.run_moe(...)
```
as the compute entrypoint. Backend-specific alternatives such as
```
run_with_prequant
```
are acceptable only as private helpers called from
```
run_moe
```
, not as public wrapper/scheduler targets that bypass the common contract. The current
```
MoE
```
interface still covers both legacy standalone MoE modules and newer backends; treat legacy
```
forward
```
methods as transitional until a dedicated
```
MoEBackend
```
interface exists. Backends should not become orchestration or external-communication state machines.
Quantization methods are backend-selected implementation helpers for quantization-specific weight tensor layout, loading details, post-load transforms, scale setup, and EPLB fix-up registration. They do not replace backend ownership of the weight lifecycle API.
Communication strategies own external cross-rank dispatch/combine.
```
MoEScheduler
```
owns forward-time policy: padding/truncation, chunking, dispatch/quantize ordering, EPLB hook ordering, zero-token chunk behavior, external-vs-fused communication workflow, and backend
```
run_moe
```
invocation.
```
ConfigurableMoE
```
constructs the scheduler from
```
backend.scheduler_kind
```
and delegates to it; schedulers may read wrapper state and call wrapper helpers but must not own lifecycle, weight loading, DWDP record,
```
repeat_idx
```
advancement, or communication lifetime. The only sanctioned scheduler mutation of
```
moe.comm
```
is through
```
determine_communication_method
```
fallback.
Shared test helpers own backend/quantization matrices and skip logic. Updating one test file while leaving
```
moe_test_utils.py
```
or
```
quantize_utils.py
```
stale is usually incomplete.
Tests should exercise the boundary that changed: backend, module, communication, routing, EPLB, or multi-GPU behavior.

A refactor is good only if it keeps these roles clearer than before.

保留以下职责边界：

```
ConfigurableMoE
```
是组装器/编排器。它负责连接后端、通信策略、EPLB、权重生命周期委托，以及共享包装器的簿记工作。
后端声明能力、运行MoE计算，并拥有MoE模块的权重生命周期边界。它们根据需要暴露并实现
```
create_weights
```
、
```
load_weights
```
、
```
post_load_weights
```
、
```
process_weights_after_loading
```
和
```
pre_reload_weights
```
，选择任意
```
FusedMoEMethod
```
，并使这些钩子与ConfigurableMoE的延迟权重创建和重载流程兼容。后端可将量化特定的张量布局、加载、加载后转换及缩放设置委托给量化方法，但后端生命周期钩子仍是权重处理的公开负责人。新的兼容ConfigurableMoE的后端应暴露
```
quantize_input
```
和
```
run_moe
```
，而非
```
forward
```
或
```
forward_impl
```
，除非用户明确要求遗留的独立行为。对于活跃的兼容ConfigurableMoE的后端，
```
run_moe
```
必须是具体实现，而非空存根，且ConfigurableMoE或其调度器应调用
```
backend.run_moe(...)
```
作为计算入口点。仅当作为
```
run_moe
```
调用的私有辅助函数时，后端特定的替代方法（如
```
run_with_prequant
```
）才是可接受的，不能作为绕过通用契约的公开包装器/调度器目标。当前的
```
MoE
```
接口仍涵盖遗留的独立MoE模块和较新的后端；将遗留的
```
forward
```
方法视为过渡性实现，直到专用的
```
MoEBackend
```
接口出现。后端不应成为编排或外部通信状态机。
量化方法是后端选择的实现辅助工具，负责量化特定的权重张量布局、加载细节、加载后转换、缩放设置及EPLB修正注册。它们不会取代后端对权重生命周期API的所有权。
通信策略负责外部跨rank的分发/合并。
```
MoEScheduler
```
负责前向时间策略：填充/截断、分块、分发/量化顺序、EPLB钩子顺序、零token分块行为、外部与融合通信工作流，以及后端
```
run_moe
```
调用。
```
ConfigurableMoE
```
从
```
backend.scheduler_kind
```
构造调度器并委托给它；调度器可读取包装器状态并调用包装器辅助函数，但不得拥有生命周期、权重加载、DWDP记录、
```
repeat_idx
```
推进或通信生命周期。调度器对
```
moe.comm
```
的唯一认可修改方式是通过
```
determine_communication_method
```
回退。
共享测试工具类负责后端/量化矩阵及跳过逻辑。更新一个测试文件但让
```
moe_test_utils.py
```
或
```
quantize_utils.py
```
过时通常是不完整的。
测试应针对变更的边界：后端、模块、通信、路由、EPLB或多GPU行为。

只有当重构能让这些职责比之前更清晰时，才是合理的。

Module Blocks

模块块

ConfigurableMoE: Assembler

ConfigurableMoE：组装器

Role:

Compose backend, communication strategy, EPLB, and wrapper-level lifecycle.
Keep
```
forward_impl
```
focused on wrapper-level work: resolve output dtype, delegate execution, record DWDP, advance
```
repeat_idx
```
once.
Own backend construction/sync and validation, not backend-specific forward policy.

Main APIs / references:

```
configurable_moe.py
```
:
```
ConfigurableMoE.__init__
```
, backend construction, communication strategy creation/bypass, scheduler construction,
```
forward_impl
```
,
```
validate_backend
```
.
```
MOE_DEVELOPER_GUIDE.md
```
: ConfigurableMoE orchestrator and file map.

Checklist:

New behavior still leaves
```
ConfigurableMoE
```
as an assembler.
No new backend-specific fast path in
```
forward_impl
```
unless it is a temporary compatibility bridge with a clear follow-up.
```
forward_impl
```
or an extracted scheduler should invoke backend computation through
```
backend.run_moe(...)
```
; direct calls to backend-specific compute entrypoints such as
```
run_with_prequant
```
are red flags unless the change is explicitly a short-lived adapter and
```
run_moe
```
remains the real implementation.
Shared wrapper state such as
```
repeat_idx
```
, DWDP record, backend attr sync, and communication lifetime stays in one place.
Scheduler creation happens after backend, communication, chunking streams, validation, and optional DWDP setup are initialized, because schedulers read that wrapper state.
```
forward_impl
```
should not accumulate chunking, routing, communication, EPLB, or fused-kernel branches; that policy belongs in
```
MoEScheduler
```
.

职责：

组合后端、通信策略、EPLB及包装器级生命周期。
让
```
forward_impl
```
专注于包装器级工作：解析输出 dtype、委托执行、记录DWDP、推进一次
```
repeat_idx
```
。
负责后端构造/同步与验证，而非后端特定的前向策略。

主API / 参考：

```
configurable_moe.py
```
：
```
ConfigurableMoE.__init__
```
、后端构造、通信策略创建/绕过、调度器构造、
```
forward_impl
```
、
```
validate_backend
```
。
```
MOE_DEVELOPER_GUIDE.md
```
：ConfigurableMoE编排器及文件映射章节。

检查清单：

新行为仍将
```
ConfigurableMoE
```
作为组装器。
```
forward_impl
```
中无新增后端特定快速路径，除非是带有明确后续计划的临时兼容桥接。
```
forward_impl
```
或提取的调度器应通过
```
backend.run_moe(...)
```
调用后端计算；直接调用后端特定计算入口点（如
```
run_with_prequant
```
）是危险信号，除非该变更明确是短期适配器且
```
run_moe
```
仍是实际实现。
共享包装器状态（如
```
repeat_idx
```
、DWDP记录、后端属性同步、通信生命周期）保持在一处。
调度器创建应在后端、通信、分块流、验证及可选DWDP设置初始化之后进行，因为调度器会读取这些包装器状态。
```
forward_impl
```
不应累积分块、路由、通信、EPLB或融合内核分支；这些策略属于
```
MoEScheduler
```
。

MoE Scheduler: Forward Execution Strategy

MoE调度器：前向执行策略

Role:

Own per-forward execution policy for ConfigurableMoE: padding/truncation, chunking, dispatch ordering, adaptive pre/post quant dispatch, EPLB wait/stat update/route/CPU-stage hook ordering, zero-token chunk behavior, and backend
```
run_moe
```
invocation.
Select external-vs-fused communication behavior through
```
MoESchedulerKind
```
, not through wrapper
```
isinstance
```
checks.
Read wrapper state and call wrapper helpers, but do not own module lifecycle, backend construction, weight lifecycle, communication object lifetime, DWDP record, or
```
repeat_idx
```
advancement.

Main APIs / references:

moe_scheduler.py

MoEScheduler

ExternalCommMoEScheduler

FusedCommMoEScheduler

create_moe_scheduler

interface.py

MoESchedulerKind

and backend

scheduler_kind

```
configurable_moe.py
```
: scheduler construction and thin
```
forward_impl
```
delegation.

communication/base.py

supports_post_quant_dispatch

prepare_dispatch

dispatch

, and

combine

contracts used by

ExternalCommMoEScheduler

Checklist:

New forward policy goes in
```
moe_scheduler.py
```
, not in ConfigurableMoE or a backend, unless it is truly backend-local compute inside
```
run_moe
```
.
```
ExternalCommMoEScheduler
```
owns host-side dispatch/combine, communication fallback, optional multi-stream chunk overlap, padding/truncation, and external communication EPLB statistic paths.
```
FusedCommMoEScheduler
```
owns fused-kernel lockstep: ADP stripping, per-rank-consistent chunk count, zero-token launches, no external dispatch/combine, and
```
ignore_allreduce=False
```
EPLB statistic update.
Schedulers call
```
backend.quantize_input(...)
```
and
```
backend.run_moe(...)
```
; they must not call backend-specific alternate compute helpers that bypass
```
run_moe
```
.
Schedulers must not advance
```
repeat_idx
```
, run DWDP record/prefetch, create or destroy communication strategies, or call weight lifecycle hooks.
If backend-specific kwargs are needed, keep them centralized and narrow inside scheduler helper code, with comments explaining why the common
```
run_moe
```
contract is insufficient for that backend.
Add/update module-level tests for changed scheduler behavior, especially chunking, zero-token chunks, DP padding/truncation, EPLB hook order, and fused-communication lockstep.

职责：

为ConfigurableMoE负责每一次前向执行策略：填充/截断、分块、分发顺序、自适应预/后量化分发、EPLB等待/统计更新/路由/CPU阶段钩子顺序、零token分块行为，以及后端
```
run_moe
```
调用。
通过
```
MoESchedulerKind
```
选择外部与融合通信行为，而非通过包装器
```
isinstance
```
检查。
读取包装器状态并调用包装器辅助函数，但不得拥有模块生命周期、后端构造、权重生命周期、通信对象生命周期、DWDP记录或
```
repeat_idx
```
推进。

主API / 参考：

moe_scheduler.py

：

MoEScheduler

、

ExternalCommMoEScheduler

、

FusedCommMoEScheduler

、

create_moe_scheduler

。

interface.py

：

MoESchedulerKind

及后端

scheduler_kind

。

```
configurable_moe.py
```
：调度器构造及
```
forward_impl
```
的轻量委托。

communication/base.py

：

supports_post_quant_dispatch

、

prepare_dispatch

、

dispatch

及

combine

契约，供

ExternalCommMoEScheduler

使用。

检查清单：

新的前向策略应放在
```
moe_scheduler.py
```
中，而非ConfigurableMoE或后端中，除非它确实是
```
run_moe
```
内部的后端本地计算。
```
ExternalCommMoEScheduler
```
负责主机端分发/合并、通信回退、可选多流分块重叠、填充/截断，以及外部通信EPLB统计路径。
```
FusedCommMoEScheduler
```
负责融合内核同步：ADP剥离、每rank一致的分块计数、零token启动、无外部分发/合并，以及
```
ignore_allreduce=False
```
的EPLB统计更新。
调度器调用
```
backend.quantize_input(...)
```
和
```
backend.run_moe(...)
```
；不得调用绕过
```
run_moe
```
的后端特定替代计算辅助函数。
调度器不得推进
```
repeat_idx
```
、运行DWDP记录/预取、创建或销毁通信策略，或调用权重生命周期钩子。
如果需要后端特定参数，应将其集中在调度器辅助代码中，并添加注释说明为何通用
```
run_moe
```
契约不足以支持该后端。
为变更的调度器行为添加/更新模块级测试，尤其是分块、零token分块、DP填充/截断、EPLB钩子顺序及融合通信同步。

MoE Backend

MoE后端

Role:

Pure MoE computation and backend-specific capability/config validation.

Own module-level weight handling and lifecycle delegation through

create_weights

load_weights

post_load_weights

process_weights_after_loading

, and

pre_reload_weights

Own
```
quantize_input
```
and
```
run_moe
```
shape/kernel contracts.
```
run_moe
```
must launch the backend compute path for every active ConfigurableMoE-compatible backend. Do not leave it as
```
NotImplementedError
```
while the wrapper calls an alternate method such as
```
run_with_prequant
```
.
Do not implement
```
forward
```
or
```
forward_impl
```
for new ConfigurableMoE-compatible backends unless the user explicitly requests legacy standalone behavior; if required, document why the normal backend contract is insufficient.
Declare whether the backend's cross-rank exchange is external to the kernel or fused inside the kernel.

Main APIs / references:

interface.py

MoE

scheduler_kind

can_implement

_supports_load_balancer

validate_configurable_moe

when present, and weight lifecycle hooks (

create_weights

load_weights

post_load_weights

process_weights_after_loading

pre_reload_weights

```
fused_moe_cutlass.py
```
: reference backend using external communication.
```
mega_moe/
```
: reference area for a fused-communication backend.
```
create_moe.py
```
: backend selection and fallback path.

Checklist:

```
can_implement()
```
returns clear
```
(False, reason)
```
for unsupported quant, dtype, shape, or hardware.
Backend weight lifecycle hooks are implemented or explicitly rejected with a narrow error;
```
create_weights()
```
is safe under ConfigurableMoE deferred weight creation,
```
load_weights()
```
honors or rejects
```
allow_partial_loading
```
, and
```
post_load_weights()
```
/
```
process_weights_after_loading()
```
/
```
pre_reload_weights()
```
keep transformed weights and reload metadata coherent.
The backend selects and stores the quantization method before delegating layout-specific weight registration/loading/transforms; callers should not need to reach into
```
quantization.py
```
directly.
```
run_moe
```
is implemented and is the method reached by ConfigurableMoE or the scheduler. If a helper like
```
run_with_prequant
```
exists for performance or naming compatibility, it is called from
```
run_moe
```
, not directly from wrapper policy code.
Cross-rank exchange ownership is explicit via
```
scheduler_kind
```
and not hidden behind wrapper
```
isinstance
```
checks. Backends with kernel-fused exchange declare
```
MoESchedulerKind.FUSED_COMM
```
; normal backends use
```
EXTERNAL_COMM
```
.
Backend-specific wrapper constraints go in a validation hook or an equivalent narrow contract, not in scattered forward branches.
Weight handling remains backend API scope even when the actual tensor layout is implemented by a
```
FusedMoEMethod
```
.
Do not add external host communication logic to a backend, except for a true fused-communication backend whose kernel owns the exchange.
New backend tests belong in
```
test_moe_backend.py
```
.
Existing legacy
```
forward
```
methods can be read for compatibility context, but they are not the default pattern for new backend work.

职责：

纯MoE计算及后端特定能力/配置验证。

通过

create_weights

、

load_weights

、

post_load_weights

、

process_weights_after_loading

和

pre_reload_weights

负责模块级权重处理及生命周期委托。

拥有
```
quantize_input
```
和
```
run_moe
```
的形状/内核契约。对于每个活跃的兼容ConfigurableMoE的后端，
```
run_moe
```
必须启动后端计算路径。当包装器调用替代方法（如
```
run_with_prequant
```
）时，不得将
```
run_moe
```
留为
```
NotImplementedError
```
。
对于新的兼容ConfigurableMoE的后端，除非用户明确要求遗留的独立行为，否则不得实现
```
forward
```
或
```
forward_impl
```
；如果必须实现，需说明为何正常后端契约不足以满足需求。
声明后端的跨rank交换是内核外部实现还是内核内部融合实现。

主API / 参考：

interface.py

：

MoE

、

scheduler_kind

、

can_implement

、

_supports_load_balancer

、

validate_configurable_moe

（若存在），以及权重生命周期钩子（

create_weights

、

load_weights

、

post_load_weights

、

process_weights_after_loading

、

pre_reload_weights

）。

```
fused_moe_cutlass.py
```
：使用外部通信的参考后端。
```
mega_moe/
```
：融合通信后端的参考区域。
```
create_moe.py
```
：后端选择及回退路径。

检查清单：

```
can_implement()
```
对不支持的量化、dtype、形状或硬件返回明确的
```
(False, reason)
```
。
后端权重生命周期钩子已实现或明确拒绝并给出窄范围错误；
```
create_weights()
```
在ConfigurableMoE延迟权重创建下是安全的，
```
load_weights()
```
遵守或拒绝
```
allow_partial_loading
```
，且
```
post_load_weights()
```
/
```
process_weights_after_loading()
```
/
```
pre_reload_weights()
```
保持转换后的权重及重载元数据一致。
后端在委托布局特定的权重注册/加载/转换之前选择并存储量化方法；调用者无需直接访问
```
quantization.py
```
。
```
run_moe
```
已实现，且是ConfigurableMoE或调度器调用的方法。如果存在
```
run_with_prequant
```
之类的辅助函数用于性能或命名兼容，它应从
```
run_moe
```
调用，而非直接从包装器策略代码调用。
跨rank交换所有权通过
```
scheduler_kind
```
明确声明，而非隐藏在包装器
```
isinstance
```
检查之后。内核融合交换的后端声明
```
MoESchedulerKind.FUSED_COMM
```
；普通后端使用
```
EXTERNAL_COMM
```
。
后端特定的包装器约束放在验证钩子或等效的窄契约中，而非分散在前向分支中。
即使实际张量布局由
```
FusedMoEMethod
```
实现，权重处理仍属于后端API范围。
不要向后端添加外部主机通信逻辑，除非是内核负责交换的真正融合通信后端。
新后端测试应放在
```
test_moe_backend.py
```
中。
现有遗留
```
forward
```
方法可用于兼容性上下文参考，但不是新后端工作的默认模式。

Quantization And Weights

量化与权重

Role:

Weight handling is backend scope at the module/API boundary: the backend exposes the lifecycle hooks, owns when they are called, and is accountable for reload/EPLB consistency.
Quantization-specific tensor creation, loading details, post-load transforms, quant scales, and EPLB weight fix-ups should live in
```
quantization.py
```
as a backend-selected
```
FusedMoEMethod
```
implementation when they are specific to a quantization layout.
When adding new weight handling, first look for a reusable existing quant method or base class before creating a new one, then make the backend select and invoke it through the lifecycle hooks.

Main APIs / references:

quantization.py

FusedMoEMethodBase

create_weights

load_weights

post_load_weights

setup_quant_scales

eplb_support_status

supports_online_eplb

need_load_shared_weights

Existing quant methods in
```
quantization.py
```
are the reference patterns.

Checklist:

New backend weight handling is surfaced through backend lifecycle hooks; new quantization-specific tensor layouts are represented by a backend-selected quantization method, not ad hoc caller or wrapper code.
Existing quant method/layout is reused when the tensor layout and scale semantics match.
```
create_weights()
```
registers module parameters with the correct slot, expert, hidden, intermediate, and scale layout.
```
load_weights()
```
handles supported loading modes and rejects unsupported ones clearly. Preserve the EPLB split: common MoE FC weights/biases (
```
w3_w1_weight
```
,
```
w2_weight
```
, and bias tensors when present) use the shared
```
FusedMoEMethodBase.load_weights()
```
/
```
post_load_weights()
```
path, where
```
need_load_shared_weights(module)
```
gates CPU shared staging and registration.
Quantization methods add only their quantization-specific EPLB registrations for scales, alphas, transformed weights, or layout-specific views that are not covered by the base FC weight path. Those extra tensors must also be gated by
```
need_load_shared_weights(module)
```
before loading, transforming, or registering shared copies. If a specialized method cannot reuse the base FC path because its raw parameter layout is incompatible, the design must call out that exception and preserve equivalent base semantics explicitly.
```
post_load_weights()
```
performs transforms, shared-weight setup, and scale setup in the quantization method only for tensors outside the base FC path; base FC weight registration should still flow through the base class whenever possible.
```
setup_quant_scales()
```
is updated when a quant mode exposes scales consumed by backend, communication, or forward-execution paths.
EPLB support status is explicit:
```
SUPPORTED
```
,
```
NOT_SUPPORTED
```
, or
```
NOT_VERIFIED
```
.

职责：

在模块/API边界，权重处理属于后端范围：后端暴露生命周期钩子，负责调用时机，并对重载/EPLB一致性负责。
量化特定的张量创建、加载细节、加载后转换、量化缩放及EPLB权重修正，当特定于量化布局时，应作为后端选择的
```
FusedMoEMethod
```
实现放在
```
quantization.py
```
中。
当添加新的权重处理逻辑时，首先寻找可复用的现有量化方法或基类，再创建新方法，然后让后端通过生命周期钩子选择并调用它。

主API / 参考：

quantization.py

：

FusedMoEMethodBase

、

create_weights

、

load_weights

、

post_load_weights

、

setup_quant_scales

、

eplb_support_status

、

supports_online_eplb

、

need_load_shared_weights

。

```
quantization.py
```
中的现有量化方法是参考模式。

检查清单：

新后端权重处理通过后端生命周期钩子呈现；新的量化特定张量布局由后端选择的量化方法表示，而非临时的调用者或包装器代码。
当张量布局和缩放语义匹配时，复用现有量化方法/布局。
```
create_weights()
```
使用正确的槽、专家、隐藏层、中间层及缩放布局注册模块参数。
```
load_weights()
```
处理支持的加载模式，并明确拒绝不支持的模式。保留EPLB拆分：通用MoE FC权重/偏置（
```
w3_w1_weight
```
、
```
w2_weight
```
及存在时的偏置张量）使用共享的
```
FusedMoEMethodBase.load_weights()
```
/
```
post_load_weights()
```
路径，其中
```
need_load_shared_weights(module)
```
控制CPU共享暂存及注册。
量化方法仅为缩放、alpha、转换后的权重或布局特定视图（未被基础FC权重路径覆盖）添加其量化特定的EPLB注册。这些额外张量在加载、转换或注册共享副本之前也必须由
```
need_load_shared_weights(module)
```
控制。如果专用方法因原始参数布局不兼容而无法复用基础FC路径，设计必须明确指出该例外情况，并显式保留等效的基础语义。
```
post_load_weights()
```
仅在量化方法中对基础FC路径之外的张量执行转换、共享权重设置及缩放设置；基础FC权重注册应尽可能仍通过基类流程进行。
当量化模式暴露后端、通信或前向执行路径使用的缩放时，更新
```
setup_quant_scales()
```
。
EPLB支持状态明确：
```
SUPPORTED
```
、
```
NOT_SUPPORTED
```
或
```
NOT_VERIFIED
```
。

EPLB

Role:

EPLB is cross-cutting. A correct change may need updates in interface, quantization, forward execution, communication, and tests.
Do not treat EPLB as only a backend flag.

Main APIs / references:

interface.py

_supports_load_balancer

_add_raw_shared_weights_for_unmap

_using_load_balancer

_using_dynamic_load_balancer

, validation hooks.

quantization.py

eplb_support_status

need_load_shared_weights

register_all_parameter_slot_and_to_fix_weight_fns

setup_quant_scales

post_load_weights

Current forward-execution code: statistic update, route,
```
ignore_allreduce
```
, per-chunk first/last hook ordering.

test_moe_module.py

: EPLB params and

generate_*_eplb_test_params

Checklist:

Backend reports whether load balancing is supported.
Quantization method declares online EPLB status.
EPLB weight registration is split into two layers:
1. Common MoE FC weights/biases are handled by
```
FusedMoEMethodBase
```
  using
```
need_load_shared_weights(module)
```
  in its shared-load/register flow.
2. Quantization-specific scales, alphas, transformed weights, or layout views are handled by the concrete quantization method and must add their own
```
need_load_shared_weights(module)
```
  gated shared-load/register logic.
Shared quant-specific tensors needed by EPLB are registered in the quantization method, including any fix-up functions for transformed weights.
Forward execution collects routing statistics and chooses
```
ignore_allreduce
```
correctly for the communication path.
EPLB hook order is preserved around routing,
```
run_moe
```
, and CPU weight migration.
```
num_slots
```
,
```
num_experts
```
,
```
ep_size
```
, and slot-vs-expert IDs are not mixed.
Add or update concrete EPLB tests in
```
test_moe_module.py
```
, including the backend/comm/quant combination that changed.

职责：

EPLB是跨领域的。正确的变更可能需要更新接口、量化、前向执行、通信及测试。
不要将EPLB仅视为后端标志。

主API / 参考：

interface.py

：

_supports_load_balancer

、

_add_raw_shared_weights_for_unmap

、

_using_load_balancer

、

_using_dynamic_load_balancer

、验证钩子。

quantization.py

：

eplb_support_status

、

need_load_shared_weights

、

register_all_parameter_slot_and_to_fix_weight_fns

、

setup_quant_scales

、

post_load_weights

。

当前前向执行代码：统计更新、路由、
```
ignore_allreduce
```
、每分块首次/末次钩子顺序。

test_moe_module.py

：EPLB参数及

generate_*_eplb_test_params

。

检查清单：

后端报告是否支持负载均衡。
量化方法声明在线EPLB状态。
EPLB权重注册分为两层：
1. 通用MoE FC权重/偏置由
```
FusedMoEMethodBase
```
  处理，在其共享加载/注册流程中使用
```
need_load_shared_weights(module)
```
  。
2. 量化特定的缩放、alpha、转换后的权重或布局视图由具体量化方法处理，且必须添加自己的
```
need_load_shared_weights(module)
```
  控制的共享加载/注册逻辑。
EPLB所需的共享量化特定张量在量化方法中注册，包括转换后权重的修正函数。
前向执行收集路由统计信息，并为通信路径正确选择
```
ignore_allreduce
```
。
在路由、
```
run_moe
```
及CPU权重迁移周围保留EPLB钩子顺序。
不混淆
```
num_slots
```
、
```
num_experts
```
、
```
ep_size
```
及槽与专家ID。
在
```
test_moe_module.py
```
中添加或更新具体的EPLB测试，包括变更的后端/通信/量化组合。

CPU shared-staging buffer family (EPLB migration)

CPU共享暂存缓冲家族（EPLB迁移）

Dynamic EPLB needs host-resident copies of per-expert tensors so that

MoeLoadBalancer

can migrate experts between ranks via host shared memory. Each per-expert

nn.Parameter

on the module has a parallel CPU staging buffer; all of them are passed to

register_all_parameter_slot_and_to_fix_weight_fns

once loading finishes. Any new per-expert Parameter MUST add its own staging buffer and migration hook, or the shared-load path will either write out of bounds or silently corrupt routed slots (NVBug 6130334 / PR #13856).

Full family in the NVFP4 path (

quantization.py

GPU `nn.Parameter` on module	CPU shared staging buffer	Sized by
`w3_w1_weight` (packed FP4)	`module.local_shared_w3_w1_tensors`	`len(local_shared_load_expert_ids)`
`w2_weight` (packed FP4)	`module.local_shared_w2_tensors`	same
`w3_w1_bias` / `w2_bias` (if `bias=True` )	`module.local_shared_w3_w1_bias_tensors` / `module.local_shared_w2_bias_tensors`	same
`w3_w1_weight_scale` / `w2_weight_scale` (block scales)	`module.local_shared_w3_w1_scale_tensors` / `module.local_shared_w2_scale_tensors`	same
`fc31_alpha` / `fc2_alpha` (per-expert fp32 scalar)	`shared_fc31_alpha` / `shared_fc2_alpha` (local variables in `process_weights_after_loading` )	`num_shared = len(tmp_shared_weight_scale_2)`
`fc31_weight_scale_2` / `fc2_weight_scale_2` (per-expert fp32 scalar, gated by `force_dynamic_quantization` )	`shared_fc31_weight_scale_2` / `shared_fc2_weight_scale_2` (local variables)	same

Key index-space distinction:

```
expert_size_per_partition = num_slots / ep_size
```
is the routed-slot count on this rank; sizes the on-GPU module Parameters.

num_shared = len(local_shared_load_expert_ids) = num_experts / shared_size

, where

shared_size = shared_mpi_comm.Get_size()

is the same-node MPI rank count (from

MPI_COMM_TYPE_SHARED

split); sizes the CPU staging buffers.

On multi-node setups
```
shared_size < ep_size
```
is legal and makes
```
num_shared > expert_size_per_partition
```
. Any code that writes into a routed-sized Parameter using a staging-space index will go out of bounds.
On single-node setups
```
shared_size == ep_size
```
is enforced by the
```
assert shared_size == local_size
```
in
```
MoeLoadBalancer._setup_mpi_comm
```
, so single-node unit tests cannot exercise the
```
num_shared > expert_size_per_partition
```
failure mode through parameter tuning alone. A regression test for staging-index correctness must either (a) invoke the reconcile/migration function directly with a crafted staging dict, or (b) run on a real multi-node Slurm environment.

Naming convention quirk: bulk weights and block-scales use

module.local_shared_*_tensors

(attribute on module, deleted after register); per-expert scalars (alphas,

weight_scale_2

) use

shared_*

(function-local). Both are equally valid migration sources -- the distinction is historical.

Checklist for adding a new per-expert Parameter to an EPLB-supporting quantization method:

nn.Parameter

sized

expert_size_per_partition

create_weights()

In whichever loader fills it, also fill a

tmp_shared_*_weight_scale_X

dict keyed by

enumerate(local_shared_load_expert_ids)

during the

need_load_shared_weights(module)

branch.

In
```
process_weights_after_loading()
```
(or the equivalent finalize step), allocate a CPU
```
shared_*
```
buffer sized
```
num_shared
```
and fill it from the temp dict. Pass it as an explicit destination to reconcile/compute helpers -- do NOT write into the on-module
```
.data[expert_idx]
```
from the shared path, since
```
expert_idx
```
is in staging space and the on-module Parameter is in routed space.
Add the staging buffer to the
```
weight_fns
```
dict handed to
```
register_all_parameter_slot_and_to_fix_weight_fns({...})
```
so migration can find it.
If the reconcile/compute helper is shared between routed and staging paths, its signature must take the destination tensor as a parameter (not read
```
module.<param>.data
```
directly), so the same body serves both index spaces.

Red flags:

A new per-expert Parameter registered in
```
create_weights()
```
but never added to any
```
weight_fns
```
migration dict -- it will be stale after the first EPLB migration.
A reconcile/compute function that both reads
```
tmp_shared_*
```
and writes
```
module.<per_expert_param>.data[expert_idx]
```
-- the staging-space index can exceed the routed-space bound (multi-node) or silently overwrite routed slots (single-node).
Asymmetric gating: one of
```
fc31_*
```
/
```
fc2_*
```
pair registered but its twin not (or one added to
```
weight_fns
```
but not the other) -- migration will leave half the state stale.

动态EPLB需要每个专家张量的主机驻留副本，以便

MoeLoadBalancer

通过主机共享内存在rank之间迁移专家。模块上的每个专家

nn.Parameter

都有一个并行的CPU暂存缓冲；所有这些缓冲在加载完成后都会传递给

register_all_parameter_slot_and_to_fix_weight_fns

。任何新增的专家Parameter必须添加自己的暂存缓冲及迁移钩子，否则共享加载路径要么越界写入，要么静默损坏路由槽（NVBug 6130334 / PR #13856）。

NVFP4路径中的完整家族（

quantization.py

）：

模块上的GPU `nn.Parameter`	CPU共享暂存缓冲	大小由以下决定
`w3_w1_weight` （打包FP4）	`module.local_shared_w3_w1_tensors`	`len(local_shared_load_expert_ids)`
`w2_weight` （打包FP4）	`module.local_shared_w2_tensors`	同上
`w3_w1_bias` / `w2_bias` （若 `bias=True` ）	`module.local_shared_w3_w1_bias_tensors` / `module.local_shared_w2_bias_tensors`	同上
`w3_w1_weight_scale` / `w2_weight_scale` （块缩放）	`module.local_shared_w3_w1_scale_tensors` / `module.local_shared_w2_scale_tensors`	同上
`fc31_alpha` / `fc2_alpha` （每个专家的fp32标量）	`shared_fc31_alpha` / `shared_fc2_alpha` （ `process_weights_after_loading` 中的局部变量）	`num_shared = len(tmp_shared_weight_scale_2)`
`fc31_weight_scale_2` / `fc2_weight_scale_2` （每个专家的fp32标量，由 `force_dynamic_quantization` 控制）	`shared_fc31_weight_scale_2` / `shared_fc2_weight_scale_2` （局部变量）	同上

关键索引空间区别：

```
expert_size_per_partition = num_slots / ep_size
```
是当前rank上的路由槽数量；决定GPU模块Parameters的大小。

num_shared = len(local_shared_load_expert_ids) = num_experts / shared_size

，其中

shared_size = shared_mpi_comm.Get_size()

是同节点MPI rank数量（来自

MPI_COMM_TYPE_SHARED

拆分）；决定CPU暂存缓冲的大小。

在多节点设置中，
```
shared_size < ep_size
```
是合法的，这会导致
```
num_shared > expert_size_per_partition
```
。任何使用暂存空间索引写入路由大小Parameter的代码都会越界。
在单节点设置中，
```
MoeLoadBalancer._setup_mpi_comm
```
中的
```
assert shared_size == local_size
```
强制
```
shared_size == ep_size
```
，因此单节点单元测试无法仅通过参数调用来测试
```
num_shared > expert_size_per_partition
```
的失败模式。暂存索引正确性的回归测试必须要么(a)使用精心构造的暂存字典直接调用调和/迁移函数，要么(b)在真实的多节点Slurm环境中运行。

命名约定特点：批量权重和块缩放使用

module.local_shared_*_tensors

（模块上的属性，注册后删除）；每个专家的标量（alpha、

weight_scale_2

）使用

shared_*

（函数局部变量）。两者都是有效的迁移源——区别是历史原因造成的。

为支持EPLB的量化方法新增专家Parameter的检查清单：

在

create_weights()

中注册大小为

expert_size_per_partition

的模块上的

nn.Parameter

。

在填充该Parameter的加载器中，在

need_load_shared_weights(module)

分支期间，填充以

enumerate(local_shared_load_expert_ids)

为键的

tmp_shared_*_weight_scale_X

字典。

在
```
process_weights_after_loading()
```
（或等效的最终步骤）中，分配大小为
```
num_shared
```
的CPU
```
shared_*
```
缓冲，并从临时字典填充它。将其作为显式目标传递给调和/计算辅助函数——不要从共享路径写入模块上的
```
.data[expert_idx]
```
，因为
```
expert_idx
```
属于暂存空间，而模块上的Parameter属于路由空间。
将暂存缓冲添加到传递给
```
register_all_parameter_slot_and_to_fix_weight_fns({...})
```
的
```
weight_fns
```
字典中，以便迁移能找到它。
如果调和/计算辅助函数在路由和暂存路径之间共享，其签名必须接受目标张量作为参数（而非直接读取
```
module.<param>.data
```
），以便同一代码体可服务于两个索引空间。

危险信号：

在
```
create_weights()
```
中注册了新的专家Parameter，但从未添加到任何
```
weight_fns
```
迁移字典中——第一次EPLB迁移后它会过期。
调和/计算函数既读取
```
tmp_shared_*
```
又写入
```
module.<per_expert_param>.data[expert_idx]
```
——暂存空间索引可能超过路由空间边界（多节点）或静默覆盖路由槽（单节点）。
不对称控制：
```
fc31_*
```
/
```
fc2_*
```
对中的一个已注册但另一个未注册（或一个添加到
```
weight_fns
```
但另一个未添加）——迁移会导致一半状态过期。

Communication

通信

Role:

External communication strategies implement dispatch/combine and expose what ordering they support relative to quantization.
Backends whose kernel owns cross-rank exchange should bypass external communication strategies rather than being forced through the factory.

Main APIs / references:

communication/base.py

Communication

is_platform_supported

is_workload_feasible

supports_post_quant_dispatch

prepare_dispatch

dispatch

combine

```
communication/communication_factory.py
```
: strategy selection.

Existing strategies:

nvlink_one_sided.py

nvlink_two_sided.py

deep_ep.py

allgather_reducescatter.py

Checklist:

Strategy selection and forced method behavior are handled through the factory.
```
supports_post_quant_dispatch()
```
is correct for the payload layout.
```
prepare_dispatch()
```
is used only for metadata/statistics that must happen before dispatch.
```
dispatch()
```
and
```
combine()
```
maintain enough internal state for the pair to be correct.
EPLB statistics gathered by the communication strategy are fed back to the load balancer through the forward-execution path.
Add/update
```
test_moe_comm.py
```
or module-level tests when changing strategy behavior.

职责：

外部通信策略实现分发/合并，并暴露它们支持的与量化相关的顺序。
内核负责跨rank交换的后端应绕过外部通信策略，而非强制通过工厂类。

主API / 参考：

communication/base.py

：

Communication

、

is_platform_supported

、

is_workload_feasible

、

supports_post_quant_dispatch

、

prepare_dispatch

、

dispatch

、

combine

。

```
communication/communication_factory.py
```
：策略选择。

现有策略：

nvlink_one_sided.py

、

nvlink_two_sided.py

、

deep_ep.py

、

allgather_reducescatter.py

。

检查清单：

策略选择和强制方法行为通过工厂类处理。
```
supports_post_quant_dispatch()
```
针对有效负载布局是正确的。
```
prepare_dispatch()
```
仅用于必须在分发前处理的元数据/统计信息。
```
dispatch()
```
和
```
combine()
```
保持足够的内部状态以确保配对正确。
通信策略收集的EPLB统计信息通过前向执行路径反馈给负载均衡器。
变更策略行为时，添加/更新
```
test_moe_comm.py
```
或模块级测试。

Forward Execution And Chunking

前向执行与分块

Role:

Treat
```
moe_scheduler.py
```
as the current owner of forward-time policy. Use this section as the detailed checklist for scheduler changes and for reviews that suspect policy has leaked back into the wrapper or backend.
Keep lifecycle outside this policy: backend construction, weight loading, communication strategy lifetime, DWDP record, and
```
repeat_idx
```
advancement remain wrapper-level concerns.

Main APIs / references:

```
moe_scheduler.py
```
: scheduler ABC, external/fused scheduler implementations, chunk helpers, EPLB hook order, and backend kwargs construction.
```
configurable_moe.py
```
: scheduler construction and wrapper lifecycle after scheduler return.
Current communication interfaces and backend
```
run_moe
```
/
```
quantize_input
```
contracts.
Existing tests that exercise module forward, multi-GPU EP, EPLB, and communication behavior.

Checklist:

The wrapper advances
```
repeat_idx
```
once per
```
forward_impl
```
; schedulers must not mutate it independently.
External-communication scheduler respects padding, chunking, communication fallback, quantize/dispatch order, EPLB hooks, and output truncation.
Fused-communication path does not call external
```
Communication.dispatch
```
or
```
combine
```
.
Per-chunk EPLB first/last-call behavior is preserved.
Multi-stream overlap is used only on paths that support it.
Add module or focused forward-path tests for new policy, especially chunking and zero-token behavior.

职责：

将
```
moe_scheduler.py
```
视为当前前向时间策略的负责人。使用本节作为调度器变更及怀疑策略泄露回包装器或后端的审查的详细检查清单。
将生命周期排除在该策略之外：后端构造、权重加载、通信策略生命周期、DWDP记录及
```
repeat_idx
```
推进仍是包装器级别的关注点。

主API / 参考：

```
moe_scheduler.py
```
：调度器抽象基类、外部/融合调度器实现、分块辅助函数、EPLB钩子顺序及后端参数构造。
```
configurable_moe.py
```
：调度器构造及调度器返回后的包装器生命周期。
当前通信接口及后端
```
run_moe
```
/
```
quantize_input
```
契约。
现有测试，用于测试模块前向、多GPU EP、EPLB及通信行为。

检查清单：

包装器在每个
```
forward_impl
```
中推进一次
```
repeat_idx
```
；调度器不得独立修改它。
外部通信调度器遵守填充、分块、通信回退、量化/分发顺序、EPLB钩子及输出截断。
融合通信路径不调用外部
```
Communication.dispatch
```
或
```
combine
```
。
保留每分块EPLB首次/末次调用行为。
仅在支持多流重叠的路径上使用多流重叠。
为新策略添加模块或聚焦前向路径的测试，尤其是分块和零token行为。

Routing And Factory

路由与工厂类

Role:

Routing methods map router logits to expert or slot selections.
Factory/config code selects a backend based on requested backend, quantization, hardware capability, and model config.

Main APIs / references:

```
routing.py
```
: routing method implementations.

create_moe.py

get_moe_cls

create_moe_backend

create_moe

```
moe_test_utils.py
```
: backend enum, backend class map, skip logic.

Checklist:

Routing output dtype/shape matches backend and forward-execution expectations.
Unsupported backend/quant/model combinations fall back or skip with clear reasons.
Test skip logic mirrors backend
```
can_implement()
```
instead of hiding bugs with broad skips.

职责：

路由方法将路由器logits映射到专家或槽选择。
工厂类/配置代码根据请求的后端、量化、硬件能力及模型配置选择后端。

主API / 参考：

```
routing.py
```
：路由方法实现。

create_moe.py

：

get_moe_cls

、

create_moe_backend

、

create_moe

。

```
moe_test_utils.py
```
：后端枚举、后端类映射、跳过逻辑。

检查清单：

路由输出的dtype/形状符合后端和前向执行的预期。
不支持的后端/量化/模型组合会回退或给出明确原因后跳过。
测试跳过逻辑与后端
```
can_implement()
```
一致，而非用宽泛的跳过隐藏bug。

Test Matrix And Helpers

测试矩阵与工具类

Role:

Keep backend, quantization, model-shape, routing, communication, and CI/local test matrices centralized and consistent across backend-level and module-level tests.
Keep skip reasons aligned with production capability checks such as
```
can_implement()
```
instead of hiding failures with broad local skips.

Main APIs / references:

tests/unittest/_torch/modules/moe/moe_test_utils.py

MoeBackendType

get_backend_class

get_quick_skip_reason

, backend-specific

should_skip_*

iter_base_test_configs

, CI acceleration logic.

```
tests/unittest/_torch/modules/moe/quantize_utils.py
```
: quantized test weight generation and quant-parameter setup.

test_moe_backend.py

: backend interface tests for

quantize_input

and

run_moe

```
test_moe_module.py
```
: ConfigurableMoE integration matrix, multi-GPU, and EPLB coverage.
```
test_moe_comm.py
```
: communication dispatch/combine coverage.

Checklist:

New backend is added to
```
MoeBackendType
```
,
```
get_backend_class
```
, backend/module matrices, and skip logic.
New quantization method is added to test quant parameters and EPLB support checks when applicable.
New unsupported combination returns a precise skip reason tied to production capability checks.
CI subset and local exhaustive matrix stay intentionally different and are documented in the test helpers.
Legacy tests such as
```
test_fused_moe.py
```
are used only for compatibility; new ConfigurableMoE behavior belongs in
```
test_moe_backend.py
```
,
```
test_moe_module.py
```
, or focused comm/routing/load-balancer tests.

职责：

保持后端、量化、模型形状、路由、通信及CI/本地测试矩阵集中化，并在后端级和模块级测试之间保持一致。
保持跳过原因与生产能力检查（如
```
can_implement()
```
）一致，而非用宽泛的本地跳过隐藏失败。

主API / 参考：

tests/unittest/_torch/modules/moe/moe_test_utils.py

：

MoeBackendType

、

get_backend_class

、

get_quick_skip_reason

、后端特定的

should_skip_*

、

iter_base_test_configs

、CI加速逻辑。

tests/unittest/_torch/modules/moe/quantize_utils.py

：量化测试权重生成及量化参数设置。

test_moe_backend.py

：后端接口测试，针对

quantize_input

和

run_moe

。

```
test_moe_module.py
```
：ConfigurableMoE集成矩阵、多GPU及EPLB覆盖。
```
test_moe_comm.py
```
：通信分发/合并覆盖。

检查清单：

新后端已添加到
```
MoeBackendType
```
、
```
get_backend_class
```
、后端/模块矩阵及跳过逻辑中。
新增量化方法已添加到测试量化参数及适用时的EPLB支持检查中。
新的不支持组合返回与生产能力检查相关的精确跳过原因。
CI子集和本地详尽矩阵保持有意不同，并在测试工具类中记录。
遗留测试（如
```
test_fused_moe.py
```
）仅用于兼容性；新的ConfigurableMoE行为应放在
```
test_moe_backend.py
```
、
```
test_moe_module.py
```
或聚焦通信/路由/负载均衡器的测试中。

Design Gate

设计关卡

Before editing, write a short gate:

markdown

undefined

编辑前，请编写简短的关卡说明：

markdown

undefined

MoE Design Gate

MoE设计关卡

Change area: <ConfigurableMoE / MoEScheduler-forward-execution / backend / quantization-weights / EPLB / communication / routing-factory / test-matrix / tests>
Owner boundary: <where the behavior belongs and why>
Main API touched: <method/class names>
Reference pattern: <existing file/class/function from references/moe-canonical-code-examples.md, with file:line evidence>
Guide sections used: <MOE_DEVELOPER_GUIDE.md sections>
Guide update needed: <yes/no; which section if yes>
Refactor needed: <yes/no; one reason tied to architecture, not style>
Test plan: <backend/module/comm/routing/EPLB/multi-GPU tests>


If the owner boundary is unclear, inspect more code before editing.

变更领域：<ConfigurableMoE / MoEScheduler前向执行 / 后端 / 量化-权重 / EPLB / 通信 / 路由-工厂类 / 测试矩阵 / 测试>
职责边界：<行为归属及原因>
涉及的主API：<方法/类名称>
参考模式：<来自references/moe-canonical-code-examples.md的现有文件/类/函数，提供文件:行号依据>
使用的指南章节：<MOE_DEVELOPER_GUIDE.md章节>
是否需要更新指南：<是/否；若是，需更新哪个章节>
是否需要重构：<是/否；与架构相关的一个原因，非风格原因>
测试计划：<后端/模块/通信/路由/EPLB/多GPU测试>


如果职责边界不明确，请在编辑前检查更多代码。

Refactor Rubric

重构准则

Recommend a refactor when it:

Moves behavior to the correct owner boundary.
Simplifies
```
ConfigurableMoE
```
while preserving its assembler role.
Clarifies backend ownership of the weight lifecycle and quantization-method delegation for weights/scales.
Makes backend capabilities and unsupported combinations explicit.
Separates external-communication and fused-communication policies cleanly in
```
MoEScheduler
```
rather than wrapper/backend branches.
Makes EPLB support testable across interface, quantization, forward execution, and module tests.
Updates shared test matrices/helpers when backend, quantization, or skip semantics change.
Reduces duplicate dispatch/chunking/EPLB ordering logic by centralizing forward-time policy in
```
moe_scheduler.py
```
without changing performance-critical semantics.

Reject or question a refactor when it:

Adds backend-specific forward branches to
```
ConfigurableMoE
```
instead of selecting behavior through
```
MoESchedulerKind
```
/
```
MoEScheduler
```
.
Moves weight layout logic out of quantization methods without a strong reason.
Hides hardware or quantization constraints behind vague abstractions.
Changes communication/EPLB ordering without tests.
Adds one-off skips in individual tests instead of shared capability/skip helpers.
Touches legacy MoE paths for new features when the ConfigurableMoE path should be used.

当重构满足以下条件时，推荐进行：

将行为移至正确的职责边界。
在保留ConfigurableMoE组装器角色的同时简化它。
明确后端对权重生命周期的所有权及对权重/缩放的量化方法委托。
明确后端能力及不支持的组合。
在
```
MoEScheduler
```
中清晰分离外部通信与融合通信策略，而非通过包装器/后端分支。
使EPLB支持可在接口、量化、前向执行及模块测试中测试。
当后端、量化或跳过语义变更时，更新共享测试矩阵/工具类。
通过将前向时间策略集中在
```
moe_scheduler.py
```
中减少重复的分发/分块/EPLB顺序逻辑，且不改变性能关键语义。

当重构满足以下条件时，拒绝或提出质疑：

在
```
ConfigurableMoE
```
中添加后端特定前向分支，而非通过
```
MoESchedulerKind
```
/
```
MoEScheduler
```
选择行为。
无充分理由将权重布局逻辑移出量化方法。
将硬件或量化约束隐藏在模糊的抽象背后。
无测试情况下变更通信/EPLB顺序。
在单个测试中添加一次性跳过，而非共享能力/跳过工具类。
当应使用ConfigurableMoE路径时，为新功能修改遗留MoE路径。

Review Output

审查输出

For reviews, lead with findings and concrete references:

markdown

undefined

审查时，先列出发现结果及具体参考：

markdown

undefined

Findings

发现结果

[High] file:line <architecture, correctness, or testability issue>
[Medium] file:line <maintainability or boundary issue>
[Low] file:line <local cleanup>

[高] <文件:行号> <架构、正确性或可测试性问题>
[中] <文件:行号> <可维护性或边界问题>
[低] <文件:行号> <局部优化>

Architecture Fit

架构适配性

ConfigurableMoE remains assembler: <yes/no>
Owner boundaries respected: <yes/no>
Scheduler boundary respected: <yes/no; forward policy in
```
moe_scheduler.py
```
, lifecycle in wrapper, compute in backend>
Refactor recommended: <yes/no + reason>

ConfigurableMoE仍为组装器：<是/否>
职责边界已遵守：<是/否>
调度器边界已遵守：<是/否；前向策略在
```
moe_scheduler.py
```
中，生命周期在包装器中，计算在后端中>
是否推荐重构：<是/否 + 原因>

Guide Alignment

指南对齐性

Sections checked: <MOE_DEVELOPER_GUIDE.md sections>
Guide update needed: <yes/no + section>

检查的章节：<MOE_DEVELOPER_GUIDE.md章节>
是否需要更新指南：<是/否 + 章节>

Checklist Coverage

检查清单覆盖情况

Weights/quantization: <covered/gap>
EPLB: <covered/gap>
Communication: <covered/gap>
MoEScheduler/forward execution: <covered/gap>
Backend: <covered/gap>
Forward execution/chunking details: <covered/gap>
Test matrix/helpers: <covered/gap>
Tests: <covered/gap>


If there are no findings, say so and list remaining test or performance risk.

权重/量化：<已覆盖/存在缺口>
EPLB：<已覆盖/存在缺口>
通信：<已覆盖/存在缺口>
MoEScheduler/前向执行：<已覆盖/存在缺口>
后端：<已覆盖/存在缺口>
前向执行/分块细节：<已覆盖/存在缺口>
测试矩阵/工具类：<已覆盖/存在缺口>
测试：<已覆盖/存在缺口>


如果没有发现结果，请说明并列出剩余的测试或性能风险。

Test Selection

测试选择

Prefer the unified MoE tests:

Shared test matrix/helper changes: inspect
```
tests/unittest/_torch/modules/moe/moe_test_utils.py
```
and
```
quantize_utils.py
```
, then run the affected backend/module tests below.

Backend interface changes:

pytest tests/unittest/_torch/modules/moe/test_moe_backend.py -k '<backend or quant>'

Module/create/forward changes:

pytest tests/unittest/_torch/modules/moe/test_moe_module.py -k '<backend or feature>'

Communication changes:

pytest tests/unittest/_torch/modules/moe/test_moe_comm.py -k '<strategy>'

Routing changes:

pytest tests/unittest/_torch/modules/test_moe_routing.py -k '<routing>'

Load balancer changes:

pytest tests/unittest/_torch/modules/test_moe_load_balancer.py -k '<case>'

Multi-GPU EP/all-to-all behavior:

pytest tests/unittest/_torch/multi_gpu/test_moe_a2a.py -k '<case>'

When GPU resources are required, use the TRT-LLM GPU allocation/test-runner skills first and record skipped tests with reasons.

优先选择统一的MoE测试：

共享测试矩阵/工具类变更：检查
```
tests/unittest/_torch/modules/moe/moe_test_utils.py
```
和
```
quantize_utils.py
```
，然后运行以下受影响的后端/模块测试。

后端接口变更：

pytest tests/unittest/_torch/modules/moe/test_moe_backend.py -k '<backend或quant>'

。

模块/创建/前向变更：

pytest tests/unittest/_torch/modules/moe/test_moe_module.py -k '<backend或feature>'

。

通信变更：

pytest tests/unittest/_torch/modules/moe/test_moe_comm.py -k '<strategy>'

。

路由变更：

pytest tests/unittest/_torch/modules/test_moe_routing.py -k '<routing>'

。

负载均衡器变更：

pytest tests/unittest/_torch/modules/test_moe_load_balancer.py -k '<case>'

。

多GPU EP/全对全行为：

pytest tests/unittest/_torch/multi_gpu/test_moe_a2a.py -k '<case>'

。

当需要GPU资源时，先使用TRT-LLM GPU分配/测试运行器规范，并记录跳过的测试及原因。