import-model
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseImport a model into MAX
将模型导入MAX
Input: a Hugging Face model ID ().
$ARGUMENTSCopy references/template.md to track this port as
you work through the phases.
Porting a model to MAX means writing a MAX graph that performs the same
computation as the model's in Hugging Face ,
then loading the released weights into that graph and verifying the outputs
match.
modeling_<type>.pytransformersThe workflow has three phases: decide & plan, implement, verify.
Phase 1 is reading and planning. Phase 2 is the port: implement every divergent
sublayer in the graph. Phase 3 is verification, only after implementation is
complete. Guards (preconditions that stop the line) gate the transitions
between activities; they are not steps of their own.
Anti-pattern: running , tweaking , and serving
while still implements the donor (, , …). That is
not a bring-up — logit verification will fail because the wrong architecture
is running. Do not run verification scripts until
implement-graph.md completion criteria pass.
scaffold.pyarch.py<slug>.pyllama3qwen3Each phase links to references with the details. Read the reference for the
activity you're on, not all of them upfront.
Environment: run every command through the pixi env that has MAX
installed (, ), from the skill
root where lives (do not use bare or on the
shell PATH):
pixi run python …pixi run max serve …pixi.tomlpythonmaxbash
cd <path-to-skill>
pixi install
pixi run python scripts/inspect_hf.py <HF_MODEL_ID>**输入:**一个Hugging Face模型ID()。
$ARGUMENTS复制references/template.md来跟踪移植进度,贯穿整个流程。
将模型移植到MAX意味着编写一个MAX图,使其执行与Hugging Face 库中文件相同的计算,然后将发布的权重加载到该图中,并验证输出是否匹配。
transformersmodeling_<type>.py工作流程分为三个阶段:决策与规划、实现、验证。第一阶段是阅读和规划。第二阶段是移植:在图中实现每个有差异的子层。第三阶段是验证,仅在实现完成后进行。防护机制(阻止流程推进的前提条件)控制各活动之间的过渡;它们本身不是步骤。
**反模式:**运行,调整,然后在仍使用原有架构(如、等)的情况下部署服务。这不是真正的部署——因为运行的是错误的架构,logit验证会失败。在implement-graph.md的完成标准通过之前,不要运行验证脚本。
scaffold.pyarch.py<slug>.pyllama3qwen3每个阶段都链接到包含详细信息的参考文档。阅读当前活动对应的参考文档,不要提前通读所有文档。
**环境要求:**所有命令都要通过安装了MAX的pixi环境运行(、),且要在包含的技能根目录下执行(不要使用shell PATH中的原生或命令):
pixi run python …pixi run max serve …pixi.tomlpythonmaxbash
cd <path-to-skill>
pixi install
pixi run python scripts/inspect_hf.py <HF_MODEL_ID>Or: pixi run test-scripts # smoke-test all scripts (no GPU)
或者:pixi run test-scripts # 冒烟测试所有脚本(无需GPU)
Helper scripts live in this skill's `scripts/` directory (copy or vendor
them into your repo). All helpers are also reachable through a unified
dispatcher with the same argument names and exit codes:
```bash
pixi run python scripts/import_model.py inspect <HF_MODEL_ID>
pixi run python scripts/import_model.py scaffold <HF_MODEL_ID> --start-from llama3 --output-dir ./
pixi run python scripts/import_model.py list-archs --match LlamaForCausalLM
pixi run python scripts/import_model.py check-walls <HF_MODEL_ID>
pixi run python scripts/import_model.py list-keys <HF_MODEL_ID> --summary
pixi run python scripts/import_model.py gates <HF_MODEL_ID> --port-dir <port_dir>/
pixi run python scripts/import_model.py compare <HF_MODEL_ID> --slug <slug> --port 8000Port layout:
- — slug folder containing
<port_dir>andarch.pyinARCHITECTURES(usually__init__.py). Pass this path to both<output_dir>/<slug>/and--custom-architectures.run_oss_gates.py --port-dir
MAX resolves by adding
to and importing as the module. Passing the
parent directory imports the wrong module name (e.g. instead of
your slug).
--custom-architectures <port_dir>dirname(<port_dir>)sys.pathbasename(<port_dir>)custom-archImport/API errors while editing: copy the donor arch under
; see
pitfalls-config.md § Import and config API traps.
modular/max/python/max/pipelines/architectures/<donor>/
辅助脚本位于本技能的`scripts/`目录中(可复制或引入到你的仓库)。所有辅助工具也可通过统一调度器调用,参数名称和退出码一致:
```bash
pixi run python scripts/import_model.py inspect <HF_MODEL_ID>
pixi run python scripts/import_model.py scaffold <HF_MODEL_ID> --start-from llama3 --output-dir ./
pixi run python scripts/import_model.py list-archs --match LlamaForCausalLM
pixi run python scripts/import_model.py check-walls <HF_MODEL_ID>
pixi run python scripts/import_model.py list-keys <HF_MODEL_ID> --summary
pixi run python scripts/import_model.py gates <HF_MODEL_ID> --port-dir <port_dir>/
pixi run python scripts/import_model.py compare <HF_MODEL_ID> --slug <slug> --port 8000移植目录结构:
- — 包含
<port_dir>和arch.py中__init__.py的slug文件夹(通常为ARCHITECTURES)。需将此路径同时传递给<output_dir>/<slug>/和--custom-architectures。run_oss_gates.py --port-dir
MAX通过将添加到并将作为模块导入,来解析。传递父目录会导入错误的模块名称(例如而非你的slug)。
dirname(<port_dir>)sys.pathbasename(<port_dir>)--custom-architectures <port_dir>custom-arch编辑时遇到导入/API错误:将原有架构复制到下;详见pitfalls-config.md § Import and config API traps。
modular/max/python/max/pipelines/architectures/<donor>/Phase 1 — Decide & plan
阶段1 — 决策与规划
Guard: is the architecture already registered in MAX? Before writing any code, check whether MAX already registers the architecture class in your model's. Ifconfig.json::architectures[0]returns a slug, runpixi run python list_native_archs.py --match <Class>and stop — no port needed. Full procedure: native-arch-check.md.pixi run max serve --model <HF_MODEL_ID>
防护机制:该架构是否已在MAX中注册? 在编写任何代码之前,检查MAX是否已注册模型中的架构类。如果config.json::architectures[0]返回一个slug,运行pixi run python list_native_archs.py --match <Class>即可,无需移植。完整流程:native-arch-check.md。pixi run max serve --model <HF_MODEL_ID>
Read config.json
config.json读取config.json
config.jsonPull the config and read every field:
bash
pixi run python -c "from transformers import AutoConfig; \
print(AutoConfig.from_pretrained('<HF_MODEL_ID>', trust_remote_code=True))"Or use the helper, which fetches raw from the Hub, runs the
native-arch check, and prints every key mapped to the MAX API:
config.jsonbash
pixi run python inspect_hf.py <HF_MODEL_ID>Then list safetensors metadata (keys, shapes, dtypes — no weight download):
bash
pixi run python list_checkpoint_keys.py <HF_MODEL_ID> --summaryEach row is one key →
(or in for , ).
Keys you cannot wire through are the deltas you
implement in the graph. Field meanings and common deltas:
read-config-json.md.
config.jsonpipeline_config.model.huggingface_configSupportedArchitecturearch.pyarchitecturestorch_dtypeMyConfig.initialize()Scan for hard blockers before you commit to a port:
bash
pixi run python check_walls.py <HF_MODEL_ID>Exit 0 → continue. Exit 1 → review
recognize-walls.md. Exit 2 → stop until the
wall is resolved or scoped out.
拉取配置并读取每个字段:
bash
pixi run python -c "from transformers import AutoConfig; \
print(AutoConfig.from_pretrained('<HF_MODEL_ID>', trust_remote_code=True))"或者使用辅助工具,它会从Hub获取原始,运行原生架构检查,并打印每个映射到MAX API的键:
config.jsonbash
pixi run python inspect_hf.py <HF_MODEL_ID>然后列出safetensors元数据(键、形状、数据类型——无需下载权重):
bash
pixi run python list_checkpoint_keys.py <HF_MODEL_ID> --summary每一行对应一个键 → (或中的、)。无法通过连接的键就是你需要在图中实现的差异项。字段含义和常见差异项:read-config-json.md。
config.jsonpipeline_config.model.huggingface_configarch.pySupportedArchitecturearchitecturestorch_dtypeMyConfig.initialize()在开始移植前,检查是否存在难以解决的障碍:
bash
pixi run python check_walls.py <HF_MODEL_ID>退出码0 → 继续。退出码1 → 查看recognize-walls.md。退出码2 → 停止,直到障碍解决或排除范围。
Read the model card
读取模型卡片
Open and read the model card for:
https://huggingface.co/<HF_MODEL_ID>- The paper or blog post. Skim its architecture section — authors call out the interesting modifications (QK-norm, MLA, sliding-window attention, MoE routing) because those are what they want credit for.
- "Tricks" mentioned in the card. Phrases like "we introduce", "unlike prior models", "this is the first model to" mark deltas that will bite you during implementation if you miss them now.
If the card says the model is from a known family (Llama, Mistral, Qwen,
Gemma), note that; the donor-comparison activity below will start from the
closest already-ported variant of that family.
If the card mentions custom CUDA kernels, custom attention with no public
reference, FP8/FP4-only released weights, ALiBi, recurrence or state-space
layers; see recognize-walls.md before going
further. Some models can't be ported with the public MAX surface alone.
打开并阅读模型卡片,重点关注:
https://huggingface.co/<HF_MODEL_ID>- 论文或博客文章。略读其架构部分——作者会突出强调重要的修改(如QK归一化、MLA、滑动窗口注意力、MoE路由),因为这些是他们希望获得认可的内容。
- 卡片中提到的“技巧”。诸如“我们引入”、“与之前的模型不同”、“这是首个实现该功能的模型”等表述标记了差异点,如果现在忽略,在实现阶段会遇到问题。
如果卡片显示模型属于已知系列(如Llama、Mistral、Qwen、Gemma),请记录下来;下面的原有架构对比活动将从该系列中已移植的最接近变体开始。
如果卡片提到自定义CUDA内核、无公开参考的自定义注意力、仅发布FP8/FP4权重、ALiBi、循环或状态空间层;在继续之前请查看recognize-walls.md。有些模型仅靠公开的MAX接口无法移植。
Propose a plan; accept a veto
提出计划;接受否决
Before any code, write a short paragraph stating what you'd do by default,
then wait for the user to confirm or veto. Cover four axes (distribution
shape, quantization variants, validation depth, hardware target) — all
derived from what you've already read. Don't ask blank questions; state a
default and let them push back.
Full guidance and an example paragraph:
plan-and-veto.md.
If estimated weight bytes do not fit one GPU, read
distributed-transformer.md before
choosing — distribution shape matters more than attention
family alone.
--start-from在编写任何代码之前,写一小段文字说明默认情况下你会做什么,然后等待用户确认或否决。涵盖四个维度(分布式形状、量化变体、验证深度、硬件目标)——所有这些都来自你已阅读的内容。不要问开放式问题;给出默认方案,让用户提出异议。
完整指南和示例段落:plan-and-veto.md。
如果估计的权重大小无法容纳在单个GPU中,在选择之前请阅读distributed-transformer.md——分布式形状比注意力系列本身更重要。
--start-fromCompare with other MAX architectures
与其他MAX架构对比
You're picking the closest already-ported MAX architecture to copy from.
"Closest" means: same attention shape (dense vs. GQA vs. MLA vs. MoE), same
MLP shape (gated vs. non-gated, dense vs. routed), same head layout (tied vs.
untied, single Linear vs. multi-step).
List what your installed MAX registers (do not hard-code a slug list):
bash
pixi run python list_native_archs.pyHeuristic HF-signal → donor slug hints are in
map-to-max.md. Quick version:
| Your model | Start from |
|---|---|
| Llama 3-ish (GQA, RoPE, SwiGLU MLP) | |
| Gemma-ish (RMSNorm scale, logit softcap, dual norm) | |
| Qwen-ish (GQA, RoPE, may have QK-norm) | |
| Mistral-ish (sliding window) | |
| Phi-ish (partial RoPE) | |
| MoE (sparse experts, top-k routing) | |
| MLA (latent KV) | |
Open the chosen MAX arch's directory and read its top-level model file
(usually ). You're answering: which functions/classes need to
change vs. stay the same when I port my model?
<slug>.pyNow read the corresponding Hugging Face modeling file:
bash
pixi run python -c "from transformers.models.<model_type> import modeling_<model_type>; \
print(modeling_<model_type>.__file__)"Read the , the attention , the MLP , the block
class, and the final head. Compare each to the MAX equivalent. The reference
read-modeling-code.md covers what to look
for in each.
__init__forwardforwardOutput of this activity: a delta list — one row per real difference between
HF and the donor MAX arch (attention, MLP/MoE, block wiring, head, RoPE,
masks). You implement every row in Phase 2. Three or fewer structural deltas
→ good donor choice. Many deltas → pick a closer donor or plan to rewrite
whole classes. Do not proceed to verification with an empty or "looks
Llama-ish" delta list.
你需要选择已移植的最接近的MAX架构作为参考。“最接近”意味着:相同的注意力形状(密集型 vs. GQA vs. MLA vs. MoE)、相同的MLP形状(门控 vs. 非门控、密集型 vs. 路由型)、相同的头部布局(绑定 vs. 非绑定、单Linear层 vs. 多步骤)。
列出已安装的MAX注册的架构(不要硬编码slug列表):
bash
pixi run python list_native_archs.pyHF信号→原有slug的启发式提示在map-to-max.md中。简化版:
| 你的模型 | 参考架构 |
|---|---|
| Llama 3类(GQA、RoPE、SwiGLU MLP) | |
| Gemma类(RMSNorm缩放、logit软限制、双重归一化) | |
| Qwen类(GQA、RoPE,可能包含QK归一化) | |
| Mistral类(滑动窗口) | |
| Phi类(部分RoPE) | |
| MoE(稀疏专家、top-k路由) | |
| MLA(潜在KV) | |
打开所选MAX架构的目录并阅读其顶级模型文件(通常为)。你需要确定:移植模型时,哪些函数/类需要修改,哪些可以保持不变?
<slug>.py然后阅读对应的Hugging Face建模文件:
bash
pixi run python -c "from transformers.models.<model_type> import modeling_<model_type>; \
print(modeling_<model_type>.__file__)"阅读、注意力、MLP 、块类和最终头部。将每个部分与MAX的对应部分进行对比。参考文档read-modeling-code.md涵盖了每个部分需要关注的内容。
__init__forwardforward此活动的输出:差异列表——每行对应HF模型与原有MAX架构之间的一个实际差异(注意力、MLP/MoE、块连接、头部、RoPE、掩码)。你需要在阶段2中实现每一项。三个或更少的结构差异→原有架构选择得当。多个差异→选择更接近的原有架构或计划重写整个类。不要带着空的或“看起来像Llama”的差异列表进入验证阶段。
Phase 2 — Implement
阶段2 — 实现
Scaffold the file layout
生成文件结构脚手架
scaffold.pybash
pixi run python scaffold.py <HF_MODEL_ID> --start-from <max_arch_slug> --output-dir <output_dir>This reads from the Hub for
, then copies the chosen native MAX architecture into
as five files:
architectures[0]config.jsonarch.py::name<output_dir>/<slug>/- — registration shell (verify
arch.pyand encoding)name= - — donor config (must be rewired during implementation)
model_config.py - — pipeline model shell
model.py - — donor renames (must be rewritten for your checkpoint)
weight_adapters.py - — donor graph (must be edited to match HF during implementation)
<slug>.py
After scaffold, you have a directory layout and a wrong graph. Stop here
until the graph is implemented — do not serve.
Scaffold also leaves the donor's docstrings and code comments in place.
Sed-renaming class names doesn't touch text that records what the file
claims to do. After scaffold, opens with a docstring
describing the donor; the new class claims behaviors (single-GPU support,
QK-norm, post-attention norm, etc.) the new file may not have. Rewriting
those docstrings is a required part of the graph implementation — not
optional polish. See honest-docstrings.md
for the three-sentence pattern every module docstring should follow and a
mandatory audit checklist before declaring the implementation done.
<slug>.pyscaffold.pybash
pixi run python scaffold.py <HF_MODEL_ID> --start-from <max_arch_slug> --output-dir <output_dir>该脚本从Hub的中读取作为,然后将所选的原生MAX架构复制到,生成五个文件:
config.jsonarchitectures[0]arch.py::name<output_dir>/<slug>/- — 注册外壳(验证
arch.py和编码)name= - — 原有配置(实现期间必须重新连接)
model_config.py - — 管道模型外壳
model.py - — 原有权重重命名规则(必须为你的检查点重写)
weight_adapters.py - — 原有图(实现期间必须编辑以匹配HF模型)
<slug>.py
生成脚手架后,你拥有了目录结构,但图是错误的。在此处停止,直到图实现完成——不要部署服务。
**脚手架还会保留原有架构的文档字符串和代码注释。**使用sed重命名类名不会修改记录文件功能的文本。生成脚手架后,开头的文档字符串描述的是原有架构;新类声称具有的特性(单GPU支持、QK归一化、注意力后归一化等)可能是新文件不具备的。重写这些文档字符串是图实现的必要部分——不是可选的优化。每个模块文档字符串应遵循三句话模式,且在宣布实现完成前必须运行强制审核清单,详见honest-docstrings.md。
<slug>.pyImplement the graph
实现图
This is the bring-up. Phase 1 produced the config map and delta list; the
implementation activity executes them in code.
Full checklist, work order, anti-patterns, and completion criteria:
implement-graph.md.
In order:
- — wire every
model_config.pykey from Phase 1 /config.json. Setinspect_hf.pyhead counts and head_dim to match HF.get_kv_params() - — map your checkpoint's safetensor keys to the MAX module names you will use. Run
weight_adapters.pyfirst; see rename-weights.md. After load, wire the coverage audit in state-dict-audit.md (especially MoE andlist_checkpoint_keys.pytied embeddings).strict=False - — for each row in the delta list, edit or replace the donor class so MAX
<slug>.pymirrors HFforward():forward()- Attention (Q/K/V, RoPE, mask, GQA, softcap, …)
- MLP or MoE (activation, routing, shared experts, …)
- Decoder block (norm order and residual wiring — not interchangeable with Llama)
- Final norm and LM head (tie, logit scale, softcap)
- — confirm
arch.pymatchesname=;architectures[0]matches Hubdefault_encoding.torch_dtype - — only if HF wraps the backbone differently (VL, multi-modal).
model.py
Read HF while editing, not after verification fails.
Subclass the donor only where HF and donor match; rewrite the class where the
delta list said they differ.
modeling_<type>.pyThe implementation is done when every item in
implement-graph.md
is checked — especially: every delta has a corresponding code change, weights
load without orphan keys, and the scaffold-comment audit in
honest-docstrings.md
has been run with each match classified as OK / Lie / Stale. A passing audit is
mandatory; declaring the implementation done without it leaves donor lies in
the codebase that nothing downstream will catch.
Quick grep recipe (full classification rules in
honest-docstrings.md):
bash
pixi run rg -i -n 'qwen|llama|mistral|cohere|gemma|phi|deepseek|exaone|olmo|granite|qwen3|mixtral|single-GPU|single GPU|RMSNorm|QK-norm' <port_dir>/Your implementation-complete message must explicitly attest to the audit
(e.g. ). A claim without the
attestation isn't a completion.
"docstrings rewritten to the three-sentence pattern; rg returns N hits, all legitimate lineage references"Preflight (Hub config + arch registration — run before first serve):
bash
pixi run python run_oss_gates.py <HF_MODEL_ID> --port-dir <port_dir>/Guard: local smoke gate (mandatory before Phase 3).cold-compiles for 5–25 minutes. Before serving, run the four local checks in serve-and-iterate.md (import smoke, graph dry-build, adapter⇄graph key diff, weights-format preflight).pixi run max servecovers walls, checkpoint metadata, andrun_oss_gates.pyname/encoding — not a substitute for those four.arch.py
这是核心部署环节。阶段1生成了配置映射和差异列表;实现活动将在代码中执行这些内容。
完整的检查清单、工作顺序、反模式和完成标准:implement-graph.md。
按顺序执行:
- — 连接阶段1/
model_config.py中的每个inspect_hf.py键。设置config.json的头部数量和头部维度以匹配HF模型。get_kv_params() - — 将检查点的safetensor键映射到你将使用的MAX模块名称。先运行
weight_adapters.py;详见rename-weights.md。加载后,连接state-dict-audit.md中的覆盖审核(尤其是MoE和list_checkpoint_keys.py的绑定嵌入)。strict=False - — 针对差异列表中的每一行,编辑或替换原有类,使MAX的
<slug>.py与HF的forward()一致:forward()- 注意力(Q/K/V、RoPE、掩码、GQA、软限制等)
- MLP或MoE(激活函数、路由、共享专家等)
- 解码器块(归一化顺序和残差连接——与Llama不可互换)
- 最终归一化和LM头部(绑定、logit缩放、软限制)
- — 确认
arch.py与name=匹配;architectures[0]与Hub的default_encoding匹配。torch_dtype - — 仅当HF以不同方式封装骨干网络时(如多模态VL模型)需要修改。
model.py
编辑时边读HF的边修改,不要等到验证失败后再看。仅在HF模型与原有架构匹配的地方继承原有类;在差异列表显示不同的地方重写类。
modeling_<type>.py实现完成的标志是 implement-graph.md中的每一项都已勾选——尤其是:每个差异都有对应的代码更改,权重加载时没有孤立键,并且honest-docstrings.md中的脚手架注释审核已运行,每个匹配项都被分类为OK / 错误 / 过时。审核通过是强制要求;未进行审核就宣布实现完成会导致代码库中遗留原有架构的错误描述,后续流程无法发现。
快速grep命令(完整分类规则在honest-docstrings.md中):
bash
pixi run rg -i -n 'qwen|llama|mistral|cohere|gemma|phi|deepseek|exaone|olmo|granite|qwen3|mixtral|single-GPU|single GPU|RMSNorm|QK-norm' <port_dir>/你的实现完成消息必须明确证明已完成审核(例如:“文档字符串已重写为三句话模式;rg返回N个匹配项,均为合法的谱系引用”)。没有证明的声称不算完成。
预检查(Hub配置+架构注册——首次部署前运行):
bash
pixi run python run_oss_gates.py <HF_MODEL_ID> --port-dir <port_dir>/防护机制:本地冒烟测试门(阶段3前必须执行)。冷编译需要5-25分钟。部署前,运行serve-and-iterate.md中的四个本地检查(导入冒烟测试、图预构建、适配器与图键差异、权重格式预检查)。pixi run max serve涵盖障碍、检查点元数据和run_oss_gates.py的名称/编码——不能替代这四个检查。arch.py
Phase 3 — Verify
阶段3 — 验证
Check if it generates coherent text
检查是否生成连贯文本
Prerequisite: graph implementation complete. Do not serve to "see what
happens" during implementation — fix config, adapters, and graph first.
Sanity-check the HF reference FIRST. Run HF alone on the model card's
intended prompt template, before involving MAX. If HF itself produces
gibberish, your oracle is broken — fixing your port against a broken
oracle wastes days.
Then serve with
and probe with the model card's intended template (not just "The capital of
France is" — that prompt is wrong for PrefixLMs and instruction-tuned models).
Three possible outcomes: server crashes during load → fix config/adapters;
server starts but returns garbage → divergence hunt; server returns plausible
text → run at before celebrating.
pixi run max serve --model-path <HF_MODEL_ID> --custom-architectures <port_dir>max_tokens=64+Full HF-reference sanity check, encoder/embedding slug serve flow, and
fix-test loop discipline:
serve-and-iterate.md.
**前提条件:**图实现完成。实现期间不要部署服务“看看效果”——先修复配置、适配器和图。
**首先对HF参考模型进行 sanity 检查。**在使用MAX之前,先单独运行HF模型,使用模型卡片中指定的提示模板。如果HF本身生成无意义内容,你的基准就有问题——针对错误基准修复移植会浪费数天时间。
然后运行部署服务,并使用模型卡片中指定的模板进行测试(不要只用“法国的首都是”——这个提示对PrefixLM和指令微调模型不适用)。三种可能的结果:加载期间服务器崩溃→修复配置/适配器;服务器启动但返回垃圾内容→查找差异;服务器返回合理文本→运行测试后再庆祝。
pixi run max serve --model-path <HF_MODEL_ID> --custom-architectures <port_dir>max_tokens=64+完整的HF参考sanity检查、编码器/嵌入slug部署流程和修复-测试循环规范:serve-and-iterate.md。
Layer-by-layer divergence hunt
逐层查找差异
This is the main loop. You're going to:
- Read the HF reference implementation to understand what should happen at each layer.
- Dump intermediate tensors from both implementations and find the first layer where they diverge.
- Fix that layer.
- Re-run the layer check.
- Repeat until all layers match.
这是主要循环。你需要:
- 阅读HF参考实现,了解每层应该执行的操作。
- 导出两个实现的中间张量,找到第一个出现差异的层。
- 修复该层。
- 重新运行层检查。
- 重复直到所有层匹配。
Read the reference implementation
阅读参考实现
Before diving into the HF source, consult the symptom table at the top
of divergences.md. Match what you're
observing (gibberish at token 0, divergence growing with length, output
plausible but text drifts, etc.) to its candidate causes, and read every
candidate listed — not just the first plausible one. Several causes
produce the same symptom; the bug is the one you haven't checked yet.
Then open the HF as a debugger, not a reviewer. You're
looking for the specific detail you missed. Common ones:
modeling_<type>.py- A norm whose variant or position differs from the template
- A scale factor applied somewhere (,
hidden_states * scale, MuP multipliers)attn_weights / sqrt(d) - A different activation function in the MLP
- A different RoPE style (split-half vs. interleaved, partial vs. full)
- A boundary condition that only fires at certain layers (sliding-window vs. global attention, sink-token handling)
Easter-egg warning: HF modeling code inherits aggressively. A class named
may inherit critical behavior
from a different family entirely. Always chase inheritance up at least one
level before concluding "this is just Llama with renamed fields."
MyModelDecoderLayer(GraniteDecoderLayer)The catalog of "this differs from Llama and here is how" is in
divergences.md. Indexed by symptom.
深入HF源代码之前,先查看divergences.md顶部的症状表。将你观察到的现象(第0个token就出现无意义内容、差异随长度增加、输出看似合理但文本偏离等)与候选原因匹配,并阅读每个候选原因——不要只看第一个看似合理的。多个原因可能产生相同的症状;未检查的那个就是问题所在。
然后将HF的当作调试器而非评审文档来阅读。你要找的是之前忽略的具体细节。常见的细节包括:
modeling_<type>.py- 变体或位置与模板不同的归一化层
- 某处应用的缩放因子(、
hidden_states * scale、MuP乘数)attn_weights / sqrt(d) - MLP中不同的激活函数
- 不同的RoPE样式(半拆分 vs. 交错、部分 vs. 完整)
- 仅在特定层触发的边界条件(滑动窗口 vs. 全局注意力、sink-token处理)
彩蛋警告:HF建模代码大量使用继承。名为的类可能完全继承自另一个系列的关键行为。在得出“这只是重命名字段的Llama”结论之前,至少向上追踪一层继承关系。
MyModelDecoderLayer(GraniteDecoderLayer)“与Llama不同的地方及差异方式”的目录在divergences.md中,按症状索引。
Compare logits (and HF layer stats)
对比logits(以及HF层统计数据)
bash
pixi run python compare_layers.py <HF_MODEL_ID> \
--slug <your_slug> --port 8000 \
--prompt "The capital of France is"Requires with on
the same port.
pixi run max serve--custom-architectures <port_dir>MAX does not expose per-layer hidden states via the OpenAI API. This script:
- Prints HF-only per-layer stats (embedding + each block output) as a diagnostic snapshot while you read the modeling code.
- Compares top-1 logprob at the prompt between HF and MAX via
.
/v1/completions?logprobs=…
If logprobs diverge, use divergences.md and add
taps in your for true tensor diffs inside each
block — see
layer-by-layer-debugging.md.
ops.output(...)<slug>.pybash
pixi run python compare_layers.py <HF_MODEL_ID> \
--slug <your_slug> --port 8000 \
--prompt "The capital of France is"要求在同一端口上运行带有参数的。
--custom-architectures <port_dir>pixi run max serveMAX不会通过OpenAI API暴露每层的隐藏状态。该脚本:
- 打印仅HF的每层统计数据(嵌入+每个块输出),作为你阅读建模代码时的诊断快照。
- 通过对比HF与MAX在提示下的top-1对数概率。
/v1/completions?logprobs=…
如果对数概率存在差异,使用divergences.md并在中添加探针,以获取每个块内部的真实张量差异——详见layer-by-layer-debugging.md。
<slug>.pyops.output(...)Fix the layer, then re-run
修复层,然后重新运行
Edit to fix the identified layer, restart ,
re-run . When top-1 logprob matches (rel_diff < 5%), logits
are aligned at that prompt. For block-local confirmation, use manual
taps.
<slug>.pypixi run max servecompare_layers.pyops.output()If you fix a layer and the divergence point doesn't move, you fixed the
wrong thing. Revert and re-read the HF source for that layer.
编辑修复已识别的层,重启,重新运行。当top-1对数概率匹配(相对差异<5%)时,说明该提示下的logits已对齐。如需确认块本地的匹配情况,使用手动探针。
<slug>.pypixi run max servecompare_layers.pyops.output()如果修复了一个层但差异点未移动,说明修复的是错误的地方。回滚并重新阅读该层的HF源代码。
Check against Hugging Face
与Hugging Face对比
Run the model end-to-end with pretrained weights, then run HF on the same
prompt with greedy sampling. On the MAX side, use the dtype that matches the
weight encoding the model supports (most models ship bfloat16). Outputs
should be identical or nearly identical; small BF16/FP16 rounding can cause
divergence past a dozen tokens. Persistent divergence in the first tokens
after the divergence hunt passed usually means tokenizer/chat-template
mismatch, dtype mismatch with the released weights, or nonzero MAX sampling.
When matching text comes out, the port is done for greedy text. Real
"done" depends on the validation depth picked during planning — pick a tier
from 1 (smoke) to 6 (logit parity).
Full HF-comparison recipe, divergence triage, and the 6-tier validation
table: validation-tiers.md.
使用预训练权重端到端运行模型,然后使用贪婪采样在相同提示下运行HF模型。在MAX端,使用与模型支持的权重编码匹配的数据类型(大多数模型发布为bfloat16)。输出应完全相同或几乎相同;BF16/FP16的微小舍入误差可能导致十几个token后出现差异。在差异查找通过后,前几个token仍持续差异通常意味着分词器/聊天模板不匹配、与发布权重的数据类型不匹配,或MAX采样非零。
当生成匹配文本时,贪婪文本生成的移植完成。真正的“完成”取决于规划阶段选择的验证深度——从1级(冒烟测试)到6级(logit一致性)中选择一个等级。
完整的HF对比流程、差异分类和6级验证表:validation-tiers.md。
Common pitfalls
常见陷阱
Use pitfalls.md as an index — find your symptom,
then load the one category file (config, weights, graph, or serving) —
honest-docstrings.md for the docstring
audit specifically. The two big ones:
- Scaffold ≠ port. Do not serve or verify until the graph implements
every delta in .
<slug>.py - Sed-rename leaves donor docstrings intact. Class names get renamed, but docstrings and comments still describe the donor. Rewrite them and run the audit grep before declaring the implementation done.
将pitfalls.md作为索引——找到你的症状,然后加载对应的分类文件(配置、权重、图或部署)——honest-docstrings.md专门针对文档字符串审核。两个主要陷阱:
- 脚手架≠移植。在中实现所有差异之前,不要部署服务或进行验证。
<slug>.py - Sed重命名会保留原有文档字符串。类名被重命名,但文档字符串和注释仍描述原有架构。在宣布实现完成前,重写它们并运行审核grep。
Tests and CI
测试与CI
When you add tests for the ported model, minimize the number of
MAX graph compilations per file. Compile once via a module-scoped fixture
and reuse it across cases. For files that must
compile different graphs, parallelize them with Bazel
instead of splitting the file. Full patterns and examples:
tests-and-ci.md.
pytest@pytest.mark.parametrizeshard_count为移植的模型添加测试时,尽量减少每个文件的MAX图编译次数。通过模块范围的fixture编译一次,并在用例中复用。对于必须编译不同图的文件,使用Bazel的并行化,而非拆分文件。完整模式和示例:tests-and-ci.md。
pytest@pytest.mark.parametrizeshard_count