trtllm-moe-develop
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTensorRT-LLM MoE Code Quality
TensorRT-LLM MoE 代码质量
Use this skill to keep MoE changes aligned with the current TensorRT-LLM MoE
architecture. Favor module roles, API boundaries, and testability over local
style cleanup.
使用本规范确保MoE相关修改与当前TensorRT-LLM MoE架构保持一致。相较于局部风格优化,更应优先保证模块职责、API边界与可测试性。
Required Context
必要前置上下文
Before proposing or editing MoE code, read:
CODING_GUIDELINES.mdtensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md- The target files being changed
- The relevant tests under
tests/unittest/_torch/modules/moe/
Also inspect these files when the area is relevant:
- Forward execution/chunking: inspect ,
moe_scheduler.py,configurable_moe.py, backendinterface.py/run_moepaths, and communication code.quantize_input - MegaMoE/fused communication: inspect ,
moe_scheduler.py,mega_moe/,configurable_moe.py, and communication code.quantization.py - Communication: and
tensorrt_llm/_torch/modules/fused_moe/communication/base.py.communication_factory.py - Quantization and weights: .
tensorrt_llm/_torch/modules/fused_moe/quantization.py - EPLB/load balancing: ,
interface.py,moe_load_balancer.py,quantization.py, current forward-execution/chunking code, andmoe_scheduler.py.test_moe_module.py - Test matrix/helpers: and
tests/unittest/_torch/modules/moe/moe_test_utils.pywhen adding backend, quantization, skip, or parameter coverage.quantize_utils.py
For module-specific work, read
after the guide and load only the relevant section. Each design gate or review
should cite at least one concrete code example with file:line evidence.
references/moe-canonical-code-examples.md在提出或编辑MoE代码前,请阅读以下内容:
CODING_GUIDELINES.mdtensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md- 待修改的目标文件
- 下的相关测试用例
tests/unittest/_torch/modules/moe/
当涉及对应领域时,还需检查以下文件:
- 前向执行/分块:检查、
moe_scheduler.py、configurable_moe.py、后端interface.py/run_moe路径,以及通信代码。quantize_input - MegaMoE/融合通信:检查、
moe_scheduler.py、mega_moe/、configurable_moe.py,以及通信代码。quantization.py - 通信:和
tensorrt_llm/_torch/modules/fused_moe/communication/base.py。communication_factory.py - 量化与权重:。
tensorrt_llm/_torch/modules/fused_moe/quantization.py - EPLB/负载均衡:、
interface.py、moe_load_balancer.py、quantization.py、当前前向执行/分块代码,以及moe_scheduler.py。test_moe_module.py - 测试矩阵/工具类:当新增后端、量化、跳过逻辑或参数覆盖时,检查和
tests/unittest/_torch/modules/moe/moe_test_utils.py。quantize_utils.py
针对特定模块的工作,阅读完指南后,请阅读并仅加载相关章节。每个设计关卡或审查都应至少引用一个具体的代码示例,并提供文件:行号作为依据。
references/moe-canonical-code-examples.mdWorking With MOE_DEVELOPER_GUIDE.md
与MOE_DEVELOPER_GUIDE.md协作
Treat as the in-repo source of truth for MoE
architecture. Treat this skill as the agent workflow
layer that tells Codex how to apply that source of truth while designing,
editing, or reviewing code.
MOE_DEVELOPER_GUIDE.mdUse the guide this way:
- Start from the guide sections that match the requested change: Architecture, File Map, Backend Capability Matrix, execution-flow/EPLB constraints, Canonical Examples, and Anti-Patterns.
- Use guide content to fill the design gate: owner boundary, main API, reference pattern, and test plan.
- Do not duplicate fast-changing matrices or backend support tables in this skill; prefer the guide as the current reference.
- If a code change adds a backend, quantization method, communication strategy, fused-communication behavior, EPLB behavior, or test convention, check whether the guide also needs an update.
- If guide and code disagree, inspect code and tests, mention the mismatch, and either update the guide as part of the change or report it as follow-up.
Guide-update checklist:
- File map changed: update .
File Map - Backend or quant support changed: update .
Backend Capability Matrix - New backend/communication/forward-execution pattern: update .
Canonical Examples - New forbidden pattern or ownership rule: update .
Anti-Patterns - Test convention changed: update .
Tests
将视为MoE架构在仓库内的权威来源。将本规范视为Agent工作流层,指导Codex在设计、编辑或审查代码时如何应用该权威来源的内容。
MOE_DEVELOPER_GUIDE.md按以下方式使用该指南:
- 从与请求修改匹配的指南章节开始:架构、文件映射、后端能力矩阵、执行流程/EPLB约束、典型示例及反模式。
- 使用指南内容填充设计关卡:职责边界、主API、参考模式及测试计划。
- 不要在本规范中复制快速变化的矩阵或后端支持表格;优先以指南作为当前参考。
- 如果代码变更新增了后端、量化方法、通信策略、融合通信行为、EPLB行为或测试约定,请检查指南是否也需要更新。
- 如果指南与代码存在不一致,请检查代码和测试,提及该不匹配情况,并要么将指南更新作为变更的一部分,要么将其作为后续任务上报。
指南更新检查清单:
- 文件映射变更:更新章节。
File Map - 后端或量化支持变更:更新章节。
Backend Capability Matrix - 新增后端/通信/前向执行模式:更新章节。
Canonical Examples - 新增禁用模式或职责规则:更新章节。
Anti-Patterns - 测试约定变更:更新章节。
Tests
Core Principle
核心原则
Preserve these owner boundaries:
- is the assembler/orchestrator. It wires backend, communication, EPLB, weight lifecycle delegation, and shared wrapper bookkeeping.
ConfigurableMoE - Backends declare capabilities, run MoE computation, and own the MoE module's
weight lifecycle boundary. They expose and implement ,
create_weights,load_weights,post_load_weights, andprocess_weights_after_loadingas needed, select anypre_reload_weights, and make those hooks compatible with ConfigurableMoE deferred weight creation and reload flows. Backends may delegate quantization-specific tensor layout, loading, post-load transforms, and scale setup to a quantization method, but backend lifecycle hooks remain the public owner of weight handling. New ConfigurableMoE-compatible backends should exposeFusedMoEMethodandquantize_input, notrun_moeorforward, unless the user explicitly asks for legacy standalone behavior. For an active ConfigurableMoE-compatible backend,forward_implmust be a concrete implementation, not an empty stub, and ConfigurableMoE or its scheduler should callrun_moeas the compute entrypoint. Backend-specific alternatives such asbackend.run_moe(...)are acceptable only as private helpers called fromrun_with_prequant, not as public wrapper/scheduler targets that bypass the common contract. The currentrun_moeinterface still covers both legacy standalone MoE modules and newer backends; treat legacyMoEmethods as transitional until a dedicatedforwardinterface exists. Backends should not become orchestration or external-communication state machines.MoEBackend - Quantization methods are backend-selected implementation helpers for quantization-specific weight tensor layout, loading details, post-load transforms, scale setup, and EPLB fix-up registration. They do not replace backend ownership of the weight lifecycle API.
- Communication strategies own external cross-rank dispatch/combine.
- owns forward-time policy: padding/truncation, chunking, dispatch/quantize ordering, EPLB hook ordering, zero-token chunk behavior, external-vs-fused communication workflow, and backend
MoESchedulerinvocation.run_moeconstructs the scheduler fromConfigurableMoEand delegates to it; schedulers may read wrapper state and call wrapper helpers but must not own lifecycle, weight loading, DWDP record,backend.scheduler_kindadvancement, or communication lifetime. The only sanctioned scheduler mutation ofrepeat_idxis throughmoe.commfallback.determine_communication_method - Shared test helpers own backend/quantization matrices and skip logic. Updating
one test file while leaving or
moe_test_utils.pystale is usually incomplete.quantize_utils.py - Tests should exercise the boundary that changed: backend, module, communication, routing, EPLB, or multi-GPU behavior.
A refactor is good only if it keeps these roles clearer than before.
保留以下职责边界:
- 是组装器/编排器。它负责连接后端、通信策略、EPLB、权重生命周期委托,以及共享包装器的簿记工作。
ConfigurableMoE - 后端声明能力、运行MoE计算,并拥有MoE模块的权重生命周期边界。它们根据需要暴露并实现、
create_weights、load_weights、post_load_weights和process_weights_after_loading,选择任意pre_reload_weights,并使这些钩子与ConfigurableMoE的延迟权重创建和重载流程兼容。后端可将量化特定的张量布局、加载、加载后转换及缩放设置委托给量化方法,但后端生命周期钩子仍是权重处理的公开负责人。新的兼容ConfigurableMoE的后端应暴露FusedMoEMethod和quantize_input,而非run_moe或forward,除非用户明确要求遗留的独立行为。对于活跃的兼容ConfigurableMoE的后端,forward_impl必须是具体实现,而非空存根,且ConfigurableMoE或其调度器应调用run_moe作为计算入口点。仅当作为backend.run_moe(...)调用的私有辅助函数时,后端特定的替代方法(如run_moe)才是可接受的,不能作为绕过通用契约的公开包装器/调度器目标。当前的run_with_prequant接口仍涵盖遗留的独立MoE模块和较新的后端;将遗留的MoE方法视为过渡性实现,直到专用的forward接口出现。后端不应成为编排或外部通信状态机。MoEBackend - 量化方法是后端选择的实现辅助工具,负责量化特定的权重张量布局、加载细节、加载后转换、缩放设置及EPLB修正注册。它们不会取代后端对权重生命周期API的所有权。
- 通信策略负责外部跨rank的分发/合并。
- 负责前向时间策略:填充/截断、分块、分发/量化顺序、EPLB钩子顺序、零token分块行为、外部与融合通信工作流,以及后端
MoEScheduler调用。run_moe从ConfigurableMoE构造调度器并委托给它;调度器可读取包装器状态并调用包装器辅助函数,但不得拥有生命周期、权重加载、DWDP记录、backend.scheduler_kind推进或通信生命周期。调度器对repeat_idx的唯一认可修改方式是通过moe.comm回退。determine_communication_method - 共享测试工具类负责后端/量化矩阵及跳过逻辑。更新一个测试文件但让或
moe_test_utils.py过时通常是不完整的。quantize_utils.py - 测试应针对变更的边界:后端、模块、通信、路由、EPLB或多GPU行为。
只有当重构能让这些职责比之前更清晰时,才是合理的。
Module Blocks
模块块
ConfigurableMoE: Assembler
ConfigurableMoE:组装器
Role:
- Compose backend, communication strategy, EPLB, and wrapper-level lifecycle.
- Keep focused on wrapper-level work: resolve output dtype, delegate execution, record DWDP, advance
forward_implonce.repeat_idx - Own backend construction/sync and validation, not backend-specific forward policy.
Main APIs / references:
- :
configurable_moe.py, backend construction, communication strategy creation/bypass, scheduler construction,ConfigurableMoE.__init__,forward_impl.validate_backend - : ConfigurableMoE orchestrator and file map.
MOE_DEVELOPER_GUIDE.md
Checklist:
- New behavior still leaves as an assembler.
ConfigurableMoE - No new backend-specific fast path in unless it is a temporary compatibility bridge with a clear follow-up.
forward_impl - or an extracted scheduler should invoke backend computation through
forward_impl; direct calls to backend-specific compute entrypoints such asbackend.run_moe(...)are red flags unless the change is explicitly a short-lived adapter andrun_with_prequantremains the real implementation.run_moe - Shared wrapper state such as , DWDP record, backend attr sync, and communication lifetime stays in one place.
repeat_idx - Scheduler creation happens after backend, communication, chunking streams, validation, and optional DWDP setup are initialized, because schedulers read that wrapper state.
- should not accumulate chunking, routing, communication, EPLB, or fused-kernel branches; that policy belongs in
forward_impl.MoEScheduler
职责:
- 组合后端、通信策略、EPLB及包装器级生命周期。
- 让专注于包装器级工作:解析输出 dtype、委托执行、记录DWDP、推进一次
forward_impl。repeat_idx - 负责后端构造/同步与验证,而非后端特定的前向策略。
主API / 参考:
- :
configurable_moe.py、后端构造、通信策略创建/绕过、调度器构造、ConfigurableMoE.__init__、forward_impl。validate_backend - :ConfigurableMoE编排器及文件映射章节。
MOE_DEVELOPER_GUIDE.md
检查清单:
- 新行为仍将作为组装器。
ConfigurableMoE - 中无新增后端特定快速路径,除非是带有明确后续计划的临时兼容桥接。
forward_impl - 或提取的调度器应通过
forward_impl调用后端计算;直接调用后端特定计算入口点(如backend.run_moe(...))是危险信号,除非该变更明确是短期适配器且run_with_prequant仍是实际实现。run_moe - 共享包装器状态(如、DWDP记录、后端属性同步、通信生命周期)保持在一处。
repeat_idx - 调度器创建应在后端、通信、分块流、验证及可选DWDP设置初始化之后进行,因为调度器会读取这些包装器状态。
- 不应累积分块、路由、通信、EPLB或融合内核分支;这些策略属于
forward_impl。MoEScheduler
MoE Scheduler: Forward Execution Strategy
MoE调度器:前向执行策略
Role:
- Own per-forward execution policy for ConfigurableMoE: padding/truncation,
chunking, dispatch ordering, adaptive pre/post quant dispatch, EPLB wait/stat
update/route/CPU-stage hook ordering, zero-token chunk behavior, and backend
invocation.
run_moe - Select external-vs-fused communication behavior through , not through wrapper
MoESchedulerKindchecks.isinstance - Read wrapper state and call wrapper helpers, but do not own module lifecycle,
backend construction, weight lifecycle, communication object lifetime, DWDP
record, or advancement.
repeat_idx
Main APIs / references:
- :
moe_scheduler.py,MoEScheduler,ExternalCommMoEScheduler,FusedCommMoEScheduler.create_moe_scheduler - :
interface.pyand backendMoESchedulerKind.scheduler_kind - : scheduler construction and thin
configurable_moe.pydelegation.forward_impl - :
communication/base.py,supports_post_quant_dispatch,prepare_dispatch, anddispatchcontracts used bycombine.ExternalCommMoEScheduler
Checklist:
- New forward policy goes in , not in ConfigurableMoE or a backend, unless it is truly backend-local compute inside
moe_scheduler.py.run_moe - owns host-side dispatch/combine, communication fallback, optional multi-stream chunk overlap, padding/truncation, and external communication EPLB statistic paths.
ExternalCommMoEScheduler - owns fused-kernel lockstep: ADP stripping, per-rank-consistent chunk count, zero-token launches, no external dispatch/combine, and
FusedCommMoESchedulerEPLB statistic update.ignore_allreduce=False - Schedulers call and
backend.quantize_input(...); they must not call backend-specific alternate compute helpers that bypassbackend.run_moe(...).run_moe - Schedulers must not advance , run DWDP record/prefetch, create or destroy communication strategies, or call weight lifecycle hooks.
repeat_idx - If backend-specific kwargs are needed, keep them centralized and narrow inside
scheduler helper code, with comments explaining why the common contract is insufficient for that backend.
run_moe - Add/update module-level tests for changed scheduler behavior, especially chunking, zero-token chunks, DP padding/truncation, EPLB hook order, and fused-communication lockstep.
职责:
- 为ConfigurableMoE负责每一次前向执行策略:填充/截断、分块、分发顺序、自适应预/后量化分发、EPLB等待/统计更新/路由/CPU阶段钩子顺序、零token分块行为,以及后端调用。
run_moe - 通过选择外部与融合通信行为,而非通过包装器
MoESchedulerKind检查。isinstance - 读取包装器状态并调用包装器辅助函数,但不得拥有模块生命周期、后端构造、权重生命周期、通信对象生命周期、DWDP记录或推进。
repeat_idx
主API / 参考:
- :
moe_scheduler.py、MoEScheduler、ExternalCommMoEScheduler、FusedCommMoEScheduler。create_moe_scheduler - :
interface.py及后端MoESchedulerKind。scheduler_kind - :调度器构造及
configurable_moe.py的轻量委托。forward_impl - :
communication/base.py、supports_post_quant_dispatch、prepare_dispatch及dispatch契约,供combine使用。ExternalCommMoEScheduler
检查清单:
- 新的前向策略应放在中,而非ConfigurableMoE或后端中,除非它确实是
moe_scheduler.py内部的后端本地计算。run_moe - 负责主机端分发/合并、通信回退、可选多流分块重叠、填充/截断,以及外部通信EPLB统计路径。
ExternalCommMoEScheduler - 负责融合内核同步:ADP剥离、每rank一致的分块计数、零token启动、无外部分发/合并,以及
FusedCommMoEScheduler的EPLB统计更新。ignore_allreduce=False - 调度器调用和
backend.quantize_input(...);不得调用绕过backend.run_moe(...)的后端特定替代计算辅助函数。run_moe - 调度器不得推进、运行DWDP记录/预取、创建或销毁通信策略,或调用权重生命周期钩子。
repeat_idx - 如果需要后端特定参数,应将其集中在调度器辅助代码中,并添加注释说明为何通用契约不足以支持该后端。
run_moe - 为变更的调度器行为添加/更新模块级测试,尤其是分块、零token分块、DP填充/截断、EPLB钩子顺序及融合通信同步。
MoE Backend
MoE后端
Role:
- Pure MoE computation and backend-specific capability/config validation.
- Own module-level weight handling and lifecycle delegation through
,
create_weights,load_weights,post_load_weights, andprocess_weights_after_loading.pre_reload_weights - Own and
quantize_inputshape/kernel contracts.run_moemust launch the backend compute path for every active ConfigurableMoE-compatible backend. Do not leave it asrun_moewhile the wrapper calls an alternate method such asNotImplementedError.run_with_prequant - Do not implement or
forwardfor new ConfigurableMoE-compatible backends unless the user explicitly requests legacy standalone behavior; if required, document why the normal backend contract is insufficient.forward_impl - Declare whether the backend's cross-rank exchange is external to the kernel or fused inside the kernel.
Main APIs / references:
- :
interface.py,MoE,scheduler_kind,can_implement,_supports_load_balancerwhen present, and weight lifecycle hooks (validate_configurable_moe,create_weights,load_weights,post_load_weights,process_weights_after_loading).pre_reload_weights - : reference backend using external communication.
fused_moe_cutlass.py - : reference area for a fused-communication backend.
mega_moe/ - : backend selection and fallback path.
create_moe.py
Checklist:
- returns clear
can_implement()for unsupported quant, dtype, shape, or hardware.(False, reason) - Backend weight lifecycle hooks are implemented or explicitly rejected with a
narrow error; is safe under ConfigurableMoE deferred weight creation,
create_weights()honors or rejectsload_weights(), andallow_partial_loading/post_load_weights()/process_weights_after_loading()keep transformed weights and reload metadata coherent.pre_reload_weights() - The backend selects and stores the quantization method before delegating
layout-specific weight registration/loading/transforms; callers should not
need to reach into directly.
quantization.py - is implemented and is the method reached by ConfigurableMoE or the scheduler. If a helper like
run_moeexists for performance or naming compatibility, it is called fromrun_with_prequant, not directly from wrapper policy code.run_moe - Cross-rank exchange ownership is explicit via and not hidden behind wrapper
scheduler_kindchecks. Backends with kernel-fused exchange declareisinstance; normal backends useMoESchedulerKind.FUSED_COMM.EXTERNAL_COMM - Backend-specific wrapper constraints go in a validation hook or an equivalent narrow contract, not in scattered forward branches.
- Weight handling remains backend API scope even when the actual tensor layout is
implemented by a .
FusedMoEMethod - Do not add external host communication logic to a backend, except for a true fused-communication backend whose kernel owns the exchange.
- New backend tests belong in .
test_moe_backend.py - Existing legacy methods can be read for compatibility context, but they are not the default pattern for new backend work.
forward
职责:
- 纯MoE计算及后端特定能力/配置验证。
- 通过、
create_weights、load_weights、post_load_weights和process_weights_after_loading负责模块级权重处理及生命周期委托。pre_reload_weights - 拥有和
quantize_input的形状/内核契约。对于每个活跃的兼容ConfigurableMoE的后端,run_moe必须启动后端计算路径。当包装器调用替代方法(如run_moe)时,不得将run_with_prequant留为run_moe。NotImplementedError - 对于新的兼容ConfigurableMoE的后端,除非用户明确要求遗留的独立行为,否则不得实现或
forward;如果必须实现,需说明为何正常后端契约不足以满足需求。forward_impl - 声明后端的跨rank交换是内核外部实现还是内核内部融合实现。
主API / 参考:
- :
interface.py、MoE、scheduler_kind、can_implement、_supports_load_balancer(若存在),以及权重生命周期钩子(validate_configurable_moe、create_weights、load_weights、post_load_weights、process_weights_after_loading)。pre_reload_weights - :使用外部通信的参考后端。
fused_moe_cutlass.py - :融合通信后端的参考区域。
mega_moe/ - :后端选择及回退路径。
create_moe.py
检查清单:
- 对不支持的量化、dtype、形状或硬件返回明确的
can_implement()。(False, reason) - 后端权重生命周期钩子已实现或明确拒绝并给出窄范围错误;在ConfigurableMoE延迟权重创建下是安全的,
create_weights()遵守或拒绝load_weights(),且allow_partial_loading/post_load_weights()/process_weights_after_loading()保持转换后的权重及重载元数据一致。pre_reload_weights() - 后端在委托布局特定的权重注册/加载/转换之前选择并存储量化方法;调用者无需直接访问。
quantization.py - 已实现,且是ConfigurableMoE或调度器调用的方法。如果存在
run_moe之类的辅助函数用于性能或命名兼容,它应从run_with_prequant调用,而非直接从包装器策略代码调用。run_moe - 跨rank交换所有权通过明确声明,而非隐藏在包装器
scheduler_kind检查之后。内核融合交换的后端声明isinstance;普通后端使用MoESchedulerKind.FUSED_COMM。EXTERNAL_COMM - 后端特定的包装器约束放在验证钩子或等效的窄契约中,而非分散在前向分支中。
- 即使实际张量布局由实现,权重处理仍属于后端API范围。
FusedMoEMethod - 不要向后端添加外部主机通信逻辑,除非是内核负责交换的真正融合通信后端。
- 新后端测试应放在中。
test_moe_backend.py - 现有遗留方法可用于兼容性上下文参考,但不是新后端工作的默认模式。
forward
Quantization And Weights
量化与权重
Role:
- Weight handling is backend scope at the module/API boundary: the backend exposes the lifecycle hooks, owns when they are called, and is accountable for reload/EPLB consistency.
- Quantization-specific tensor creation, loading details, post-load transforms,
quant scales, and EPLB weight fix-ups should live in as a backend-selected
quantization.pyimplementation when they are specific to a quantization layout.FusedMoEMethod - When adding new weight handling, first look for a reusable existing quant method or base class before creating a new one, then make the backend select and invoke it through the lifecycle hooks.
Main APIs / references:
- :
quantization.py,FusedMoEMethodBase,create_weights,load_weights,post_load_weights,setup_quant_scales,eplb_support_status,supports_online_eplb.need_load_shared_weights - Existing quant methods in are the reference patterns.
quantization.py
Checklist:
- New backend weight handling is surfaced through backend lifecycle hooks; new quantization-specific tensor layouts are represented by a backend-selected quantization method, not ad hoc caller or wrapper code.
- Existing quant method/layout is reused when the tensor layout and scale semantics match.
- registers module parameters with the correct slot, expert, hidden, intermediate, and scale layout.
create_weights() - handles supported loading modes and rejects unsupported ones clearly. Preserve the EPLB split: common MoE FC weights/biases (
load_weights(),w3_w1_weight, and bias tensors when present) use the sharedw2_weight/FusedMoEMethodBase.load_weights()path, wherepost_load_weights()gates CPU shared staging and registration.need_load_shared_weights(module) - Quantization methods add only their quantization-specific EPLB registrations
for scales, alphas, transformed weights, or layout-specific views that are not
covered by the base FC weight path. Those extra tensors must also be gated by
before loading, transforming, or registering shared copies. If a specialized method cannot reuse the base FC path because its raw parameter layout is incompatible, the design must call out that exception and preserve equivalent base semantics explicitly.
need_load_shared_weights(module) - performs transforms, shared-weight setup, and scale setup in the quantization method only for tensors outside the base FC path; base FC weight registration should still flow through the base class whenever possible.
post_load_weights() - is updated when a quant mode exposes scales consumed by backend, communication, or forward-execution paths.
setup_quant_scales() - EPLB support status is explicit: ,
SUPPORTED, orNOT_SUPPORTED.NOT_VERIFIED
职责:
- 在模块/API边界,权重处理属于后端范围:后端暴露生命周期钩子,负责调用时机,并对重载/EPLB一致性负责。
- 量化特定的张量创建、加载细节、加载后转换、量化缩放及EPLB权重修正,当特定于量化布局时,应作为后端选择的实现放在
FusedMoEMethod中。quantization.py - 当添加新的权重处理逻辑时,首先寻找可复用的现有量化方法或基类,再创建新方法,然后让后端通过生命周期钩子选择并调用它。
主API / 参考:
- :
quantization.py、FusedMoEMethodBase、create_weights、load_weights、post_load_weights、setup_quant_scales、eplb_support_status、supports_online_eplb。need_load_shared_weights - 中的现有量化方法是参考模式。
quantization.py
检查清单:
- 新后端权重处理通过后端生命周期钩子呈现;新的量化特定张量布局由后端选择的量化方法表示,而非临时的调用者或包装器代码。
- 当张量布局和缩放语义匹配时,复用现有量化方法/布局。
- 使用正确的槽、专家、隐藏层、中间层及缩放布局注册模块参数。
create_weights() - 处理支持的加载模式,并明确拒绝不支持的模式。保留EPLB拆分:通用MoE FC权重/偏置(
load_weights()、w3_w1_weight及存在时的偏置张量)使用共享的w2_weight/FusedMoEMethodBase.load_weights()路径,其中post_load_weights()控制CPU共享暂存及注册。need_load_shared_weights(module) - 量化方法仅为缩放、alpha、转换后的权重或布局特定视图(未被基础FC权重路径覆盖)添加其量化特定的EPLB注册。这些额外张量在加载、转换或注册共享副本之前也必须由控制。如果专用方法因原始参数布局不兼容而无法复用基础FC路径,设计必须明确指出该例外情况,并显式保留等效的基础语义。
need_load_shared_weights(module) - 仅在量化方法中对基础FC路径之外的张量执行转换、共享权重设置及缩放设置;基础FC权重注册应尽可能仍通过基类流程进行。
post_load_weights() - 当量化模式暴露后端、通信或前向执行路径使用的缩放时,更新。
setup_quant_scales() - EPLB支持状态明确:、
SUPPORTED或NOT_SUPPORTED。NOT_VERIFIED
EPLB
EPLB
Role:
- EPLB is cross-cutting. A correct change may need updates in interface, quantization, forward execution, communication, and tests.
- Do not treat EPLB as only a backend flag.
Main APIs / references:
- :
interface.py,_supports_load_balancer,_add_raw_shared_weights_for_unmap,_using_load_balancer, validation hooks._using_dynamic_load_balancer - :
quantization.py,eplb_support_status,need_load_shared_weights,register_all_parameter_slot_and_to_fix_weight_fns,setup_quant_scales.post_load_weights - Current forward-execution code: statistic update, route, , per-chunk first/last hook ordering.
ignore_allreduce - : EPLB params and
test_moe_module.py.generate_*_eplb_test_params
Checklist:
- Backend reports whether load balancing is supported.
- Quantization method declares online EPLB status.
- EPLB weight registration is split into two layers:
- Common MoE FC weights/biases are handled by using
FusedMoEMethodBasein its shared-load/register flow.need_load_shared_weights(module) - Quantization-specific scales, alphas, transformed weights, or layout views
are handled by the concrete quantization method and must add their own
gated shared-load/register logic.
need_load_shared_weights(module)
- Common MoE FC weights/biases are handled by
- Shared quant-specific tensors needed by EPLB are registered in the quantization method, including any fix-up functions for transformed weights.
- Forward execution collects routing statistics and chooses correctly for the communication path.
ignore_allreduce - EPLB hook order is preserved around routing, , and CPU weight migration.
run_moe - ,
num_slots,num_experts, and slot-vs-expert IDs are not mixed.ep_size - Add or update concrete EPLB tests in , including the backend/comm/quant combination that changed.
test_moe_module.py
职责:
- EPLB是跨领域的。正确的变更可能需要更新接口、量化、前向执行、通信及测试。
- 不要将EPLB仅视为后端标志。
主API / 参考:
- :
interface.py、_supports_load_balancer、_add_raw_shared_weights_for_unmap、_using_load_balancer、验证钩子。_using_dynamic_load_balancer - :
quantization.py、eplb_support_status、need_load_shared_weights、register_all_parameter_slot_and_to_fix_weight_fns、setup_quant_scales。post_load_weights - 当前前向执行代码:统计更新、路由、、每分块首次/末次钩子顺序。
ignore_allreduce - :EPLB参数及
test_moe_module.py。generate_*_eplb_test_params
检查清单:
- 后端报告是否支持负载均衡。
- 量化方法声明在线EPLB状态。
- EPLB权重注册分为两层:
- 通用MoE FC权重/偏置由处理,在其共享加载/注册流程中使用
FusedMoEMethodBase。need_load_shared_weights(module) - 量化特定的缩放、alpha、转换后的权重或布局视图由具体量化方法处理,且必须添加自己的控制的共享加载/注册逻辑。
need_load_shared_weights(module)
- 通用MoE FC权重/偏置由
- EPLB所需的共享量化特定张量在量化方法中注册,包括转换后权重的修正函数。
- 前向执行收集路由统计信息,并为通信路径正确选择。
ignore_allreduce - 在路由、及CPU权重迁移周围保留EPLB钩子顺序。
run_moe - 不混淆、
num_slots、num_experts及槽与专家ID。ep_size - 在中添加或更新具体的EPLB测试,包括变更的后端/通信/量化组合。
test_moe_module.py
CPU shared-staging buffer family (EPLB migration)
CPU共享暂存缓冲家族(EPLB迁移)
Dynamic EPLB needs host-resident copies of per-expert tensors so that
can migrate experts between ranks via host shared memory.
Each per-expert on the module has a parallel CPU staging buffer;
all of them are passed to
once loading finishes. Any new per-expert Parameter MUST add its own staging
buffer and migration hook, or the shared-load path will either write out of
bounds or silently corrupt routed slots (NVBug 6130334 / PR #13856).
MoeLoadBalancernn.Parameterregister_all_parameter_slot_and_to_fix_weight_fnsFull family in the NVFP4 path ():
quantization.pyGPU | CPU shared staging buffer | Sized by |
|---|---|---|
| | |
| | same |
| | same |
| | same |
| | |
| | same |
Key index-space distinction:
- is the routed-slot count on this rank; sizes the on-GPU module Parameters.
expert_size_per_partition = num_slots / ep_size - , where
num_shared = len(local_shared_load_expert_ids) = num_experts / shared_sizeis the same-node MPI rank count (fromshared_size = shared_mpi_comm.Get_size()split); sizes the CPU staging buffers.MPI_COMM_TYPE_SHARED - On multi-node setups is legal and makes
shared_size < ep_size. Any code that writes into a routed-sized Parameter using a staging-space index will go out of bounds.num_shared > expert_size_per_partition - On single-node setups is enforced by the
shared_size == ep_sizeinassert shared_size == local_size, so single-node unit tests cannot exercise theMoeLoadBalancer._setup_mpi_commfailure mode through parameter tuning alone. A regression test for staging-index correctness must either (a) invoke the reconcile/migration function directly with a crafted staging dict, or (b) run on a real multi-node Slurm environment.num_shared > expert_size_per_partition
Naming convention quirk: bulk weights and block-scales use
(attribute on module, deleted after register);
per-expert scalars (alphas, ) use (function-local).
Both are equally valid migration sources -- the distinction is historical.
module.local_shared_*_tensorsweight_scale_2shared_*Checklist for adding a new per-expert Parameter to an EPLB-supporting
quantization method:
- Register the on-module sized
nn.Parameterinexpert_size_per_partition.create_weights() - In whichever loader fills it, also fill a dict keyed by
tmp_shared_*_weight_scale_Xduring theenumerate(local_shared_load_expert_ids)branch.need_load_shared_weights(module) - In (or the equivalent finalize step), allocate a CPU
process_weights_after_loading()buffer sizedshared_*and fill it from the temp dict. Pass it as an explicit destination to reconcile/compute helpers -- do NOT write into the on-modulenum_sharedfrom the shared path, since.data[expert_idx]is in staging space and the on-module Parameter is in routed space.expert_idx - Add the staging buffer to the dict handed to
weight_fnsso migration can find it.register_all_parameter_slot_and_to_fix_weight_fns({...}) - If the reconcile/compute helper is shared between routed and staging paths,
its signature must take the destination tensor as a parameter (not read
directly), so the same body serves both index spaces.
module.<param>.data
Red flags:
- A new per-expert Parameter registered in but never added to any
create_weights()migration dict -- it will be stale after the first EPLB migration.weight_fns - A reconcile/compute function that both reads and writes
tmp_shared_*-- the staging-space index can exceed the routed-space bound (multi-node) or silently overwrite routed slots (single-node).module.<per_expert_param>.data[expert_idx] - Asymmetric gating: one of /
fc31_*pair registered but its twin not (or one added tofc2_*but not the other) -- migration will leave half the state stale.weight_fns
动态EPLB需要每个专家张量的主机驻留副本,以便通过主机共享内存在rank之间迁移专家。模块上的每个专家都有一个并行的CPU暂存缓冲;所有这些缓冲在加载完成后都会传递给。任何新增的专家Parameter必须添加自己的暂存缓冲及迁移钩子,否则共享加载路径要么越界写入,要么静默损坏路由槽(NVBug 6130334 / PR #13856)。
MoeLoadBalancernn.Parameterregister_all_parameter_slot_and_to_fix_weight_fnsNVFP4路径中的完整家族():
quantization.py模块上的GPU | CPU共享暂存缓冲 | 大小由以下决定 |
|---|---|---|
| | |
| | 同上 |
| | 同上 |
| | 同上 |
| | |
| | 同上 |
关键索引空间区别:
- 是当前rank上的路由槽数量;决定GPU模块Parameters的大小。
expert_size_per_partition = num_slots / ep_size - ,其中
num_shared = len(local_shared_load_expert_ids) = num_experts / shared_size是同节点MPI rank数量(来自shared_size = shared_mpi_comm.Get_size()拆分);决定CPU暂存缓冲的大小。MPI_COMM_TYPE_SHARED - 在多节点设置中,是合法的,这会导致
shared_size < ep_size。任何使用暂存空间索引写入路由大小Parameter的代码都会越界。num_shared > expert_size_per_partition - 在单节点设置中,中的
MoeLoadBalancer._setup_mpi_comm强制assert shared_size == local_size,因此单节点单元测试无法仅通过参数调用来测试shared_size == ep_size的失败模式。暂存索引正确性的回归测试必须要么(a)使用精心构造的暂存字典直接调用调和/迁移函数,要么(b)在真实的多节点Slurm环境中运行。num_shared > expert_size_per_partition
命名约定特点:批量权重和块缩放使用(模块上的属性,注册后删除);每个专家的标量(alpha、)使用(函数局部变量)。两者都是有效的迁移源——区别是历史原因造成的。
module.local_shared_*_tensorsweight_scale_2shared_*为支持EPLB的量化方法新增专家Parameter的检查清单:
- 在中注册大小为
create_weights()的模块上的expert_size_per_partition。nn.Parameter - 在填充该Parameter的加载器中,在分支期间,填充以
need_load_shared_weights(module)为键的enumerate(local_shared_load_expert_ids)字典。tmp_shared_*_weight_scale_X - 在(或等效的最终步骤)中,分配大小为
process_weights_after_loading()的CPUnum_shared缓冲,并从临时字典填充它。将其作为显式目标传递给调和/计算辅助函数——不要从共享路径写入模块上的shared_*,因为.data[expert_idx]属于暂存空间,而模块上的Parameter属于路由空间。expert_idx - 将暂存缓冲添加到传递给的
register_all_parameter_slot_and_to_fix_weight_fns({...})字典中,以便迁移能找到它。weight_fns - 如果调和/计算辅助函数在路由和暂存路径之间共享,其签名必须接受目标张量作为参数(而非直接读取),以便同一代码体可服务于两个索引空间。
module.<param>.data
危险信号:
- 在中注册了新的专家Parameter,但从未添加到任何
create_weights()迁移字典中——第一次EPLB迁移后它会过期。weight_fns - 调和/计算函数既读取又写入
tmp_shared_*——暂存空间索引可能超过路由空间边界(多节点)或静默覆盖路由槽(单节点)。module.<per_expert_param>.data[expert_idx] - 不对称控制:/
fc31_*对中的一个已注册但另一个未注册(或一个添加到fc2_*但另一个未添加)——迁移会导致一半状态过期。weight_fns
Communication
通信
Role:
- External communication strategies implement dispatch/combine and expose what ordering they support relative to quantization.
- Backends whose kernel owns cross-rank exchange should bypass external communication strategies rather than being forced through the factory.
Main APIs / references:
- :
communication/base.py,Communication,is_platform_supported,is_workload_feasible,supports_post_quant_dispatch,prepare_dispatch,dispatch.combine - : strategy selection.
communication/communication_factory.py - Existing strategies: ,
nvlink_one_sided.py,nvlink_two_sided.py,deep_ep.py.allgather_reducescatter.py
Checklist:
- Strategy selection and forced method behavior are handled through the factory.
- is correct for the payload layout.
supports_post_quant_dispatch() - is used only for metadata/statistics that must happen before dispatch.
prepare_dispatch() - and
dispatch()maintain enough internal state for the pair to be correct.combine() - EPLB statistics gathered by the communication strategy are fed back to the load balancer through the forward-execution path.
- Add/update or module-level tests when changing strategy behavior.
test_moe_comm.py
职责:
- 外部通信策略实现分发/合并,并暴露它们支持的与量化相关的顺序。
- 内核负责跨rank交换的后端应绕过外部通信策略,而非强制通过工厂类。
主API / 参考:
- :
communication/base.py、Communication、is_platform_supported、is_workload_feasible、supports_post_quant_dispatch、prepare_dispatch、dispatch。combine - :策略选择。
communication/communication_factory.py - 现有策略:、
nvlink_one_sided.py、nvlink_two_sided.py、deep_ep.py。allgather_reducescatter.py
检查清单:
- 策略选择和强制方法行为通过工厂类处理。
- 针对有效负载布局是正确的。
supports_post_quant_dispatch() - 仅用于必须在分发前处理的元数据/统计信息。
prepare_dispatch() - 和
dispatch()保持足够的内部状态以确保配对正确。combine() - 通信策略收集的EPLB统计信息通过前向执行路径反馈给负载均衡器。
- 变更策略行为时,添加/更新或模块级测试。
test_moe_comm.py
Forward Execution And Chunking
前向执行与分块
Role:
- Treat as the current owner of forward-time policy. Use this section as the detailed checklist for scheduler changes and for reviews that suspect policy has leaked back into the wrapper or backend.
moe_scheduler.py - Keep lifecycle outside this policy: backend construction, weight loading,
communication strategy lifetime, DWDP record, and advancement remain wrapper-level concerns.
repeat_idx
Main APIs / references:
- : scheduler ABC, external/fused scheduler implementations, chunk helpers, EPLB hook order, and backend kwargs construction.
moe_scheduler.py - : scheduler construction and wrapper lifecycle after scheduler return.
configurable_moe.py - Current communication interfaces and backend /
run_moecontracts.quantize_input - Existing tests that exercise module forward, multi-GPU EP, EPLB, and communication behavior.
Checklist:
- The wrapper advances once per
repeat_idx; schedulers must not mutate it independently.forward_impl - External-communication scheduler respects padding, chunking, communication fallback, quantize/dispatch order, EPLB hooks, and output truncation.
- Fused-communication path does not call external or
Communication.dispatch.combine - Per-chunk EPLB first/last-call behavior is preserved.
- Multi-stream overlap is used only on paths that support it.
- Add module or focused forward-path tests for new policy, especially chunking and zero-token behavior.
职责:
- 将视为当前前向时间策略的负责人。使用本节作为调度器变更及怀疑策略泄露回包装器或后端的审查的详细检查清单。
moe_scheduler.py - 将生命周期排除在该策略之外:后端构造、权重加载、通信策略生命周期、DWDP记录及推进仍是包装器级别的关注点。
repeat_idx
主API / 参考:
- :调度器抽象基类、外部/融合调度器实现、分块辅助函数、EPLB钩子顺序及后端参数构造。
moe_scheduler.py - :调度器构造及调度器返回后的包装器生命周期。
configurable_moe.py - 当前通信接口及后端/
run_moe契约。quantize_input - 现有测试,用于测试模块前向、多GPU EP、EPLB及通信行为。
检查清单:
- 包装器在每个中推进一次
forward_impl;调度器不得独立修改它。repeat_idx - 外部通信调度器遵守填充、分块、通信回退、量化/分发顺序、EPLB钩子及输出截断。
- 融合通信路径不调用外部或
Communication.dispatch。combine - 保留每分块EPLB首次/末次调用行为。
- 仅在支持多流重叠的路径上使用多流重叠。
- 为新策略添加模块或聚焦前向路径的测试,尤其是分块和零token行为。
Routing And Factory
路由与工厂类
Role:
- Routing methods map router logits to expert or slot selections.
- Factory/config code selects a backend based on requested backend, quantization, hardware capability, and model config.
Main APIs / references:
- : routing method implementations.
routing.py - :
create_moe.py,get_moe_cls,create_moe_backend.create_moe - : backend enum, backend class map, skip logic.
moe_test_utils.py
Checklist:
- Routing output dtype/shape matches backend and forward-execution expectations.
- Unsupported backend/quant/model combinations fall back or skip with clear reasons.
- Test skip logic mirrors backend instead of hiding bugs with broad skips.
can_implement()
职责:
- 路由方法将路由器logits映射到专家或槽选择。
- 工厂类/配置代码根据请求的后端、量化、硬件能力及模型配置选择后端。
主API / 参考:
- :路由方法实现。
routing.py - :
create_moe.py、get_moe_cls、create_moe_backend。create_moe - :后端枚举、后端类映射、跳过逻辑。
moe_test_utils.py
检查清单:
- 路由输出的dtype/形状符合后端和前向执行的预期。
- 不支持的后端/量化/模型组合会回退或给出明确原因后跳过。
- 测试跳过逻辑与后端一致,而非用宽泛的跳过隐藏bug。
can_implement()
Test Matrix And Helpers
测试矩阵与工具类
Role:
- Keep backend, quantization, model-shape, routing, communication, and CI/local test matrices centralized and consistent across backend-level and module-level tests.
- Keep skip reasons aligned with production capability checks such as
instead of hiding failures with broad local skips.
can_implement()
Main APIs / references:
- :
tests/unittest/_torch/modules/moe/moe_test_utils.py,MoeBackendType,get_backend_class, backend-specificget_quick_skip_reason,should_skip_*, CI acceleration logic.iter_base_test_configs - : quantized test weight generation and quant-parameter setup.
tests/unittest/_torch/modules/moe/quantize_utils.py - : backend interface tests for
test_moe_backend.pyandquantize_input.run_moe - : ConfigurableMoE integration matrix, multi-GPU, and EPLB coverage.
test_moe_module.py - : communication dispatch/combine coverage.
test_moe_comm.py
Checklist:
- New backend is added to ,
MoeBackendType, backend/module matrices, and skip logic.get_backend_class - New quantization method is added to test quant parameters and EPLB support checks when applicable.
- New unsupported combination returns a precise skip reason tied to production capability checks.
- CI subset and local exhaustive matrix stay intentionally different and are documented in the test helpers.
- Legacy tests such as are used only for compatibility; new ConfigurableMoE behavior belongs in
test_fused_moe.py,test_moe_backend.py, or focused comm/routing/load-balancer tests.test_moe_module.py
职责:
- 保持后端、量化、模型形状、路由、通信及CI/本地测试矩阵集中化,并在后端级和模块级测试之间保持一致。
- 保持跳过原因与生产能力检查(如)一致,而非用宽泛的本地跳过隐藏失败。
can_implement()
主API / 参考:
- :
tests/unittest/_torch/modules/moe/moe_test_utils.py、MoeBackendType、get_backend_class、后端特定的get_quick_skip_reason、should_skip_*、CI加速逻辑。iter_base_test_configs - :量化测试权重生成及量化参数设置。
tests/unittest/_torch/modules/moe/quantize_utils.py - :后端接口测试,针对
test_moe_backend.py和quantize_input。run_moe - :ConfigurableMoE集成矩阵、多GPU及EPLB覆盖。
test_moe_module.py - :通信分发/合并覆盖。
test_moe_comm.py
检查清单:
- 新后端已添加到、
MoeBackendType、后端/模块矩阵及跳过逻辑中。get_backend_class - 新增量化方法已添加到测试量化参数及适用时的EPLB支持检查中。
- 新的不支持组合返回与生产能力检查相关的精确跳过原因。
- CI子集和本地详尽矩阵保持有意不同,并在测试工具类中记录。
- 遗留测试(如)仅用于兼容性;新的ConfigurableMoE行为应放在
test_fused_moe.py、test_moe_backend.py或聚焦通信/路由/负载均衡器的测试中。test_moe_module.py
Design Gate
设计关卡
Before editing, write a short gate:
markdown
undefined编辑前,请编写简短的关卡说明:
markdown
undefinedMoE Design Gate
MoE设计关卡
- Change area: <ConfigurableMoE / MoEScheduler-forward-execution / backend / quantization-weights / EPLB / communication / routing-factory / test-matrix / tests>
- Owner boundary: <where the behavior belongs and why>
- Main API touched: <method/class names>
- Reference pattern: <existing file/class/function from references/moe-canonical-code-examples.md, with file:line evidence>
- Guide sections used: <MOE_DEVELOPER_GUIDE.md sections>
- Guide update needed: <yes/no; which section if yes>
- Refactor needed: <yes/no; one reason tied to architecture, not style>
- Test plan: <backend/module/comm/routing/EPLB/multi-GPU tests>
If the owner boundary is unclear, inspect more code before editing.- 变更领域:<ConfigurableMoE / MoEScheduler前向执行 / 后端 / 量化-权重 / EPLB / 通信 / 路由-工厂类 / 测试矩阵 / 测试>
- 职责边界:<行为归属及原因>
- 涉及的主API:<方法/类名称>
- 参考模式:<来自references/moe-canonical-code-examples.md的现有文件/类/函数,提供文件:行号依据>
- 使用的指南章节:<MOE_DEVELOPER_GUIDE.md章节>
- 是否需要更新指南:<是/否;若是,需更新哪个章节>
- 是否需要重构:<是/否;与架构相关的一个原因,非风格原因>
- 测试计划:<后端/模块/通信/路由/EPLB/多GPU测试>
如果职责边界不明确,请在编辑前检查更多代码。Refactor Rubric
重构准则
Recommend a refactor when it:
- Moves behavior to the correct owner boundary.
- Simplifies while preserving its assembler role.
ConfigurableMoE - Clarifies backend ownership of the weight lifecycle and quantization-method delegation for weights/scales.
- Makes backend capabilities and unsupported combinations explicit.
- Separates external-communication and fused-communication policies cleanly in
rather than wrapper/backend branches.
MoEScheduler - Makes EPLB support testable across interface, quantization, forward execution, and module tests.
- Updates shared test matrices/helpers when backend, quantization, or skip semantics change.
- Reduces duplicate dispatch/chunking/EPLB ordering logic by centralizing
forward-time policy in without changing performance-critical semantics.
moe_scheduler.py
Reject or question a refactor when it:
- Adds backend-specific forward branches to instead of selecting behavior through
ConfigurableMoE/MoESchedulerKind.MoEScheduler - Moves weight layout logic out of quantization methods without a strong reason.
- Hides hardware or quantization constraints behind vague abstractions.
- Changes communication/EPLB ordering without tests.
- Adds one-off skips in individual tests instead of shared capability/skip helpers.
- Touches legacy MoE paths for new features when the ConfigurableMoE path should be used.
当重构满足以下条件时,推荐进行:
- 将行为移至正确的职责边界。
- 在保留ConfigurableMoE组装器角色的同时简化它。
- 明确后端对权重生命周期的所有权及对权重/缩放的量化方法委托。
- 明确后端能力及不支持的组合。
- 在中清晰分离外部通信与融合通信策略,而非通过包装器/后端分支。
MoEScheduler - 使EPLB支持可在接口、量化、前向执行及模块测试中测试。
- 当后端、量化或跳过语义变更时,更新共享测试矩阵/工具类。
- 通过将前向时间策略集中在中减少重复的分发/分块/EPLB顺序逻辑,且不改变性能关键语义。
moe_scheduler.py
当重构满足以下条件时,拒绝或提出质疑:
- 在中添加后端特定前向分支,而非通过
ConfigurableMoE/MoESchedulerKind选择行为。MoEScheduler - 无充分理由将权重布局逻辑移出量化方法。
- 将硬件或量化约束隐藏在模糊的抽象背后。
- 无测试情况下变更通信/EPLB顺序。
- 在单个测试中添加一次性跳过,而非共享能力/跳过工具类。
- 当应使用ConfigurableMoE路径时,为新功能修改遗留MoE路径。
Review Output
审查输出
For reviews, lead with findings and concrete references:
markdown
undefined审查时,先列出发现结果及具体参考:
markdown
undefinedFindings
发现结果
- [High] file:line <architecture, correctness, or testability issue>
- [Medium] file:line <maintainability or boundary issue>
- [Low] file:line <local cleanup>
- [高] <文件:行号> <架构、正确性或可测试性问题>
- [中] <文件:行号> <可维护性或边界问题>
- [低] <文件:行号> <局部优化>
Architecture Fit
架构适配性
- ConfigurableMoE remains assembler: <yes/no>
- Owner boundaries respected: <yes/no>
- Scheduler boundary respected: <yes/no; forward policy in , lifecycle in wrapper, compute in backend>
moe_scheduler.py - Refactor recommended: <yes/no + reason>
- ConfigurableMoE仍为组装器:<是/否>
- 职责边界已遵守:<是/否>
- 调度器边界已遵守:<是/否;前向策略在中,生命周期在包装器中,计算在后端中>
moe_scheduler.py - 是否推荐重构:<是/否 + 原因>
Guide Alignment
指南对齐性
- Sections checked: <MOE_DEVELOPER_GUIDE.md sections>
- Guide update needed: <yes/no + section>
- 检查的章节:<MOE_DEVELOPER_GUIDE.md章节>
- 是否需要更新指南:<是/否 + 章节>
Checklist Coverage
检查清单覆盖情况
- Weights/quantization: <covered/gap>
- EPLB: <covered/gap>
- Communication: <covered/gap>
- MoEScheduler/forward execution: <covered/gap>
- Backend: <covered/gap>
- Forward execution/chunking details: <covered/gap>
- Test matrix/helpers: <covered/gap>
- Tests: <covered/gap>
If there are no findings, say so and list remaining test or performance risk.- 权重/量化:<已覆盖/存在缺口>
- EPLB:<已覆盖/存在缺口>
- 通信:<已覆盖/存在缺口>
- MoEScheduler/前向执行:<已覆盖/存在缺口>
- 后端:<已覆盖/存在缺口>
- 前向执行/分块细节:<已覆盖/存在缺口>
- 测试矩阵/工具类:<已覆盖/存在缺口>
- 测试:<已覆盖/存在缺口>
如果没有发现结果,请说明并列出剩余的测试或性能风险。Test Selection
测试选择
Prefer the unified MoE tests:
- Shared test matrix/helper changes: inspect and
tests/unittest/_torch/modules/moe/moe_test_utils.py, then run the affected backend/module tests below.quantize_utils.py - Backend interface changes: .
pytest tests/unittest/_torch/modules/moe/test_moe_backend.py -k '<backend or quant>' - Module/create/forward changes: .
pytest tests/unittest/_torch/modules/moe/test_moe_module.py -k '<backend or feature>' - Communication changes: .
pytest tests/unittest/_torch/modules/moe/test_moe_comm.py -k '<strategy>' - Routing changes: .
pytest tests/unittest/_torch/modules/test_moe_routing.py -k '<routing>' - Load balancer changes: .
pytest tests/unittest/_torch/modules/test_moe_load_balancer.py -k '<case>' - Multi-GPU EP/all-to-all behavior: .
pytest tests/unittest/_torch/multi_gpu/test_moe_a2a.py -k '<case>'
When GPU resources are required, use the TRT-LLM GPU allocation/test-runner
skills first and record skipped tests with reasons.
优先选择统一的MoE测试:
- 共享测试矩阵/工具类变更:检查和
tests/unittest/_torch/modules/moe/moe_test_utils.py,然后运行以下受影响的后端/模块测试。quantize_utils.py - 后端接口变更:。
pytest tests/unittest/_torch/modules/moe/test_moe_backend.py -k '<backend或quant>' - 模块/创建/前向变更:。
pytest tests/unittest/_torch/modules/moe/test_moe_module.py -k '<backend或feature>' - 通信变更:。
pytest tests/unittest/_torch/modules/moe/test_moe_comm.py -k '<strategy>' - 路由变更:。
pytest tests/unittest/_torch/modules/test_moe_routing.py -k '<routing>' - 负载均衡器变更:。
pytest tests/unittest/_torch/modules/test_moe_load_balancer.py -k '<case>' - 多GPU EP/全对全行为:。
pytest tests/unittest/_torch/multi_gpu/test_moe_a2a.py -k '<case>'
当需要GPU资源时,先使用TRT-LLM GPU分配/测试运行器规范,并记录跳过的测试及原因。