import-model

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Import a model into MAX

将模型导入MAX

Input: a Hugging Face model ID (
$ARGUMENTS
).
Copy references/template.md to track this port as you work through the phases.
Porting a model to MAX means writing a MAX graph that performs the same computation as the model's
modeling_<type>.py
in Hugging Face
transformers
, then loading the released weights into that graph and verifying the outputs match.
The workflow has three phases: decide & plan, implement, verify. Phase 1 is reading and planning. Phase 2 is the port: implement every divergent sublayer in the graph. Phase 3 is verification, only after implementation is complete. Guards (preconditions that stop the line) gate the transitions between activities; they are not steps of their own.
Anti-pattern: running
scaffold.py
, tweaking
arch.py
, and serving while
<slug>.py
still implements the donor (
llama3
,
qwen3
, …). That is not a bring-up — logit verification will fail because the wrong architecture is running. Do not run verification scripts until implement-graph.md completion criteria pass.
Each phase links to references with the details. Read the reference for the activity you're on, not all of them upfront.
Environment: run every command through the pixi env that has MAX installed (
pixi run python …
,
pixi run max serve …
), from the skill root where
pixi.toml
lives (do not use bare
python
or
max
on the shell PATH):
bash
cd <path-to-skill>
pixi install
pixi run python scripts/inspect_hf.py <HF_MODEL_ID>
**输入:**一个Hugging Face模型ID(
$ARGUMENTS
)。
复制references/template.md来跟踪移植进度,贯穿整个流程。
将模型移植到MAX意味着编写一个MAX图,使其执行与Hugging Face
transformers
库中
modeling_<type>.py
文件相同的计算,然后将发布的权重加载到该图中,并验证输出是否匹配。
工作流程分为三个阶段:决策与规划实现验证。第一阶段是阅读和规划。第二阶段是移植:在图中实现每个有差异的子层。第三阶段是验证,仅在实现完成后进行。防护机制(阻止流程推进的前提条件)控制各活动之间的过渡;它们本身不是步骤。
**反模式:**运行
scaffold.py
,调整
arch.py
,然后在
<slug>.py
仍使用原有架构(如
llama3
qwen3
等)的情况下部署服务。这不是真正的部署——因为运行的是错误的架构,logit验证会失败。在implement-graph.md的完成标准通过之前,不要运行验证脚本。
每个阶段都链接到包含详细信息的参考文档。阅读当前活动对应的参考文档,不要提前通读所有文档。
**环境要求:**所有命令都要通过安装了MAX的pixi环境运行(
pixi run python …
pixi run max serve …
),且要在包含
pixi.toml
的技能根目录下执行(不要使用shell PATH中的原生
python
max
命令):
bash
cd <path-to-skill>
pixi install
pixi run python scripts/inspect_hf.py <HF_MODEL_ID>

Or: pixi run test-scripts # smoke-test all scripts (no GPU)

或者:pixi run test-scripts # 冒烟测试所有脚本(无需GPU)


Helper scripts live in this skill's `scripts/` directory (copy or vendor
them into your repo). All helpers are also reachable through a unified
dispatcher with the same argument names and exit codes:

```bash
pixi run python scripts/import_model.py inspect <HF_MODEL_ID>
pixi run python scripts/import_model.py scaffold <HF_MODEL_ID> --start-from llama3 --output-dir ./
pixi run python scripts/import_model.py list-archs --match LlamaForCausalLM
pixi run python scripts/import_model.py check-walls <HF_MODEL_ID>
pixi run python scripts/import_model.py list-keys <HF_MODEL_ID> --summary
pixi run python scripts/import_model.py gates <HF_MODEL_ID> --port-dir <port_dir>/
pixi run python scripts/import_model.py compare <HF_MODEL_ID> --slug <slug> --port 8000
Port layout:
  • <port_dir>
    — slug folder containing
    arch.py
    and
    ARCHITECTURES
    in
    __init__.py
    (usually
    <output_dir>/<slug>/
    ). Pass this path to both
    --custom-architectures
    and
    run_oss_gates.py --port-dir
    .
MAX resolves
--custom-architectures <port_dir>
by adding
dirname(<port_dir>)
to
sys.path
and importing
basename(<port_dir>)
as the module. Passing the parent directory imports the wrong module name (e.g.
custom-arch
instead of your slug).
Import/API errors while editing: copy the donor arch under
modular/max/python/max/pipelines/architectures/<donor>/
; see pitfalls-config.md § Import and config API traps.


辅助脚本位于本技能的`scripts/`目录中(可复制或引入到你的仓库)。所有辅助工具也可通过统一调度器调用,参数名称和退出码一致:

```bash
pixi run python scripts/import_model.py inspect <HF_MODEL_ID>
pixi run python scripts/import_model.py scaffold <HF_MODEL_ID> --start-from llama3 --output-dir ./
pixi run python scripts/import_model.py list-archs --match LlamaForCausalLM
pixi run python scripts/import_model.py check-walls <HF_MODEL_ID>
pixi run python scripts/import_model.py list-keys <HF_MODEL_ID> --summary
pixi run python scripts/import_model.py gates <HF_MODEL_ID> --port-dir <port_dir>/
pixi run python scripts/import_model.py compare <HF_MODEL_ID> --slug <slug> --port 8000
移植目录结构:
  • <port_dir>
    — 包含
    arch.py
    __init__.py
    ARCHITECTURES
    的slug文件夹(通常为
    <output_dir>/<slug>/
    )。需将此路径同时传递给
    --custom-architectures
    run_oss_gates.py --port-dir
MAX通过将
dirname(<port_dir>)
添加到
sys.path
并将
basename(<port_dir>)
作为模块导入,来解析
--custom-architectures <port_dir>
。传递父目录会导入错误的模块名称(例如
custom-arch
而非你的slug)。
编辑时遇到导入/API错误:将原有架构复制到
modular/max/python/max/pipelines/architectures/<donor>/
下;详见pitfalls-config.md § Import and config API traps

Phase 1 — Decide & plan

阶段1 — 决策与规划

Guard: is the architecture already registered in MAX? Before writing any code, check whether MAX already registers the architecture class in your model's
config.json::architectures[0]
. If
pixi run python list_native_archs.py --match <Class>
returns a slug, run
pixi run max serve --model <HF_MODEL_ID>
and stop — no port needed. Full procedure: native-arch-check.md.
防护机制:该架构是否已在MAX中注册? 在编写任何代码之前,检查MAX是否已注册模型
config.json::architectures[0]
中的架构类。如果
pixi run python list_native_archs.py --match <Class>
返回一个slug,运行
pixi run max serve --model <HF_MODEL_ID>
即可,无需移植。完整流程:native-arch-check.md

Read
config.json

读取
config.json

Pull the config and read every field:
bash
pixi run python -c "from transformers import AutoConfig; \
  print(AutoConfig.from_pretrained('<HF_MODEL_ID>', trust_remote_code=True))"
Or use the helper, which fetches raw
config.json
from the Hub, runs the native-arch check, and prints every key mapped to the MAX API:
bash
pixi run python inspect_hf.py <HF_MODEL_ID>
Then list safetensors metadata (keys, shapes, dtypes — no weight download):
bash
pixi run python list_checkpoint_keys.py <HF_MODEL_ID> --summary
Each row is one
config.json
key →
pipeline_config.model.huggingface_config
(or
SupportedArchitecture
in
arch.py
for
architectures
,
torch_dtype
). Keys you cannot wire through
MyConfig.initialize()
are the deltas you implement in the graph. Field meanings and common deltas: read-config-json.md.
Scan for hard blockers before you commit to a port:
bash
pixi run python check_walls.py <HF_MODEL_ID>
Exit 0 → continue. Exit 1 → review recognize-walls.md. Exit 2 → stop until the wall is resolved or scoped out.
拉取配置并读取每个字段:
bash
pixi run python -c "from transformers import AutoConfig; \
  print(AutoConfig.from_pretrained('<HF_MODEL_ID>', trust_remote_code=True))"
或者使用辅助工具,它会从Hub获取原始
config.json
,运行原生架构检查,并打印每个映射到MAX API的键:
bash
pixi run python inspect_hf.py <HF_MODEL_ID>
然后列出safetensors元数据(键、形状、数据类型——无需下载权重):
bash
pixi run python list_checkpoint_keys.py <HF_MODEL_ID> --summary
每一行对应一个
config.json
键 →
pipeline_config.model.huggingface_config
(或
arch.py
SupportedArchitecture
architectures
torch_dtype
)。无法通过
MyConfig.initialize()
连接的键就是你需要在图中实现的差异项。字段含义和常见差异项:read-config-json.md
在开始移植前,检查是否存在难以解决的障碍:
bash
pixi run python check_walls.py <HF_MODEL_ID>
退出码0 → 继续。退出码1 → 查看recognize-walls.md。退出码2 → 停止,直到障碍解决或排除范围。

Read the model card

读取模型卡片

Open
https://huggingface.co/<HF_MODEL_ID>
and read the model card for:
  • The paper or blog post. Skim its architecture section — authors call out the interesting modifications (QK-norm, MLA, sliding-window attention, MoE routing) because those are what they want credit for.
  • "Tricks" mentioned in the card. Phrases like "we introduce", "unlike prior models", "this is the first model to" mark deltas that will bite you during implementation if you miss them now.
If the card says the model is from a known family (Llama, Mistral, Qwen, Gemma), note that; the donor-comparison activity below will start from the closest already-ported variant of that family.
If the card mentions custom CUDA kernels, custom attention with no public reference, FP8/FP4-only released weights, ALiBi, recurrence or state-space layers; see recognize-walls.md before going further. Some models can't be ported with the public MAX surface alone.
打开
https://huggingface.co/<HF_MODEL_ID>
并阅读模型卡片,重点关注:
  • 论文或博客文章。略读其架构部分——作者会突出强调重要的修改(如QK归一化、MLA、滑动窗口注意力、MoE路由),因为这些是他们希望获得认可的内容。
  • 卡片中提到的“技巧”。诸如“我们引入”、“与之前的模型不同”、“这是首个实现该功能的模型”等表述标记了差异点,如果现在忽略,在实现阶段会遇到问题。
如果卡片显示模型属于已知系列(如Llama、Mistral、Qwen、Gemma),请记录下来;下面的原有架构对比活动将从该系列中已移植的最接近变体开始。
如果卡片提到自定义CUDA内核、无公开参考的自定义注意力、仅发布FP8/FP4权重、ALiBi、循环或状态空间层;在继续之前请查看recognize-walls.md。有些模型仅靠公开的MAX接口无法移植。

Propose a plan; accept a veto

提出计划;接受否决

Before any code, write a short paragraph stating what you'd do by default, then wait for the user to confirm or veto. Cover four axes (distribution shape, quantization variants, validation depth, hardware target) — all derived from what you've already read. Don't ask blank questions; state a default and let them push back.
Full guidance and an example paragraph: plan-and-veto.md.
If estimated weight bytes do not fit one GPU, read distributed-transformer.md before choosing
--start-from
— distribution shape matters more than attention family alone.
在编写任何代码之前,写一小段文字说明默认情况下你会做什么,然后等待用户确认或否决。涵盖四个维度(分布式形状、量化变体、验证深度、硬件目标)——所有这些都来自你已阅读的内容。不要问开放式问题;给出默认方案,让用户提出异议。
完整指南和示例段落:plan-and-veto.md
如果估计的权重大小无法容纳在单个GPU中,在选择
--start-from
之前请阅读distributed-transformer.md——分布式形状比注意力系列本身更重要。

Compare with other MAX architectures

与其他MAX架构对比

You're picking the closest already-ported MAX architecture to copy from. "Closest" means: same attention shape (dense vs. GQA vs. MLA vs. MoE), same MLP shape (gated vs. non-gated, dense vs. routed), same head layout (tied vs. untied, single Linear vs. multi-step).
List what your installed MAX registers (do not hard-code a slug list):
bash
pixi run python list_native_archs.py
Heuristic HF-signal → donor slug hints are in map-to-max.md. Quick version:
Your modelStart from
Llama 3-ish (GQA, RoPE, SwiGLU MLP)
llama3
Gemma-ish (RMSNorm scale, logit softcap, dual norm)
gemma2
or
gemma3
Qwen-ish (GQA, RoPE, may have QK-norm)
qwen2
/
qwen3
Mistral-ish (sliding window)
mistral
Phi-ish (partial RoPE)
phi3
MoE (sparse experts, top-k routing)
mixtral
or
qwen3_moe
MLA (latent KV)
deepseekV3
Open the chosen MAX arch's directory and read its top-level model file (usually
<slug>.py
). You're answering: which functions/classes need to change vs. stay the same when I port my model?
Now read the corresponding Hugging Face modeling file:
bash
pixi run python -c "from transformers.models.<model_type> import modeling_<model_type>; \
  print(modeling_<model_type>.__file__)"
Read the
__init__
, the attention
forward
, the MLP
forward
, the block class, and the final head. Compare each to the MAX equivalent. The reference read-modeling-code.md covers what to look for in each.
Output of this activity: a delta list — one row per real difference between HF and the donor MAX arch (attention, MLP/MoE, block wiring, head, RoPE, masks). You implement every row in Phase 2. Three or fewer structural deltas → good donor choice. Many deltas → pick a closer donor or plan to rewrite whole classes. Do not proceed to verification with an empty or "looks Llama-ish" delta list.

你需要选择已移植的最接近的MAX架构作为参考。“最接近”意味着:相同的注意力形状(密集型 vs. GQA vs. MLA vs. MoE)、相同的MLP形状(门控 vs. 非门控、密集型 vs. 路由型)、相同的头部布局(绑定 vs. 非绑定、单Linear层 vs. 多步骤)。
列出已安装的MAX注册的架构(不要硬编码slug列表):
bash
pixi run python list_native_archs.py
HF信号→原有slug的启发式提示在map-to-max.md中。简化版:
你的模型参考架构
Llama 3类(GQA、RoPE、SwiGLU MLP)
llama3
Gemma类(RMSNorm缩放、logit软限制、双重归一化)
gemma2
gemma3
Qwen类(GQA、RoPE,可能包含QK归一化)
qwen2
/
qwen3
Mistral类(滑动窗口)
mistral
Phi类(部分RoPE)
phi3
MoE(稀疏专家、top-k路由)
mixtral
qwen3_moe
MLA(潜在KV)
deepseekV3
打开所选MAX架构的目录并阅读其顶级模型文件(通常为
<slug>.py
)。你需要确定:移植模型时,哪些函数/类需要修改,哪些可以保持不变?
然后阅读对应的Hugging Face建模文件:
bash
pixi run python -c "from transformers.models.<model_type> import modeling_<model_type>; \
  print(modeling_<model_type>.__file__)"
阅读
__init__
、注意力
forward
、MLP
forward
、块类和最终头部。将每个部分与MAX的对应部分进行对比。参考文档read-modeling-code.md涵盖了每个部分需要关注的内容。
此活动的输出:差异列表——每行对应HF模型与原有MAX架构之间的一个实际差异(注意力、MLP/MoE、块连接、头部、RoPE、掩码)。你需要在阶段2中实现每一项。三个或更少的结构差异→原有架构选择得当。多个差异→选择更接近的原有架构或计划重写整个类。不要带着空的或“看起来像Llama”的差异列表进入验证阶段。

Phase 2 — Implement

阶段2 — 实现

Scaffold the file layout

生成文件结构脚手架

scaffold.py
only copies files. It does not implement your model.
bash
pixi run python scaffold.py <HF_MODEL_ID> --start-from <max_arch_slug> --output-dir <output_dir>
This reads
architectures[0]
from the Hub
config.json
for
arch.py::name
, then copies the chosen native MAX architecture into
<output_dir>/<slug>/
as five files:
  • arch.py
    — registration shell (verify
    name=
    and encoding)
  • model_config.py
    — donor config (must be rewired during implementation)
  • model.py
    — pipeline model shell
  • weight_adapters.py
    — donor renames (must be rewritten for your checkpoint)
  • <slug>.py
    donor graph (must be edited to match HF during implementation)
After scaffold, you have a directory layout and a wrong graph. Stop here until the graph is implemented — do not serve.
Scaffold also leaves the donor's docstrings and code comments in place. Sed-renaming class names doesn't touch text that records what the file claims to do. After scaffold,
<slug>.py
opens with a docstring describing the donor; the new class claims behaviors (single-GPU support, QK-norm, post-attention norm, etc.) the new file may not have. Rewriting those docstrings is a required part of the graph implementation — not optional polish. See honest-docstrings.md for the three-sentence pattern every module docstring should follow and a mandatory audit checklist before declaring the implementation done.
scaffold.py
仅复制文件。它不会实现你的模型。
bash
pixi run python scaffold.py <HF_MODEL_ID> --start-from <max_arch_slug> --output-dir <output_dir>
该脚本从Hub的
config.json
中读取
architectures[0]
作为
arch.py::name
,然后将所选的原生MAX架构复制到
<output_dir>/<slug>/
,生成五个文件:
  • arch.py
    — 注册外壳(验证
    name=
    和编码)
  • model_config.py
    — 原有配置(实现期间必须重新连接)
  • model.py
    — 管道模型外壳
  • weight_adapters.py
    — 原有权重重命名规则(必须为你的检查点重写)
  • <slug>.py
    原有图(实现期间必须编辑以匹配HF模型)
生成脚手架后,你拥有了目录结构,但图是错误的。在此处停止,直到图实现完成——不要部署服务。
**脚手架还会保留原有架构的文档字符串和代码注释。**使用sed重命名类名不会修改记录文件功能的文本。生成脚手架后,
<slug>.py
开头的文档字符串描述的是原有架构;新类声称具有的特性(单GPU支持、QK归一化、注意力后归一化等)可能是新文件不具备的。重写这些文档字符串是图实现的必要部分——不是可选的优化。每个模块文档字符串应遵循三句话模式,且在宣布实现完成前必须运行强制审核清单,详见honest-docstrings.md

Implement the graph

实现图

This is the bring-up. Phase 1 produced the config map and delta list; the implementation activity executes them in code.
Full checklist, work order, anti-patterns, and completion criteria: implement-graph.md.
In order:
  1. model_config.py
    — wire every
    config.json
    key from Phase 1 /
    inspect_hf.py
    . Set
    get_kv_params()
    head counts and head_dim to match HF.
  2. weight_adapters.py
    — map your checkpoint's safetensor keys to the MAX module names you will use. Run
    list_checkpoint_keys.py
    first; see rename-weights.md. After load, wire the coverage audit in state-dict-audit.md (especially MoE and
    strict=False
    tied embeddings).
  3. <slug>.py
    — for each row in the delta list, edit or replace the donor class so MAX
    forward()
    mirrors HF
    forward()
    :
    • Attention (Q/K/V, RoPE, mask, GQA, softcap, …)
    • MLP or MoE (activation, routing, shared experts, …)
    • Decoder block (norm order and residual wiring — not interchangeable with Llama)
    • Final norm and LM head (tie, logit scale, softcap)
  4. arch.py
    — confirm
    name=
    matches
    architectures[0]
    ;
    default_encoding
    matches Hub
    torch_dtype
    .
  5. model.py
    — only if HF wraps the backbone differently (VL, multi-modal).
Read HF
modeling_<type>.py
while editing, not after verification fails. Subclass the donor only where HF and donor match; rewrite the class where the delta list said they differ.
The implementation is done when every item in implement-graph.md is checked — especially: every delta has a corresponding code change, weights load without orphan keys, and the scaffold-comment audit in honest-docstrings.md has been run with each match classified as OK / Lie / Stale. A passing audit is mandatory; declaring the implementation done without it leaves donor lies in the codebase that nothing downstream will catch.
Quick grep recipe (full classification rules in honest-docstrings.md):
bash
pixi run rg -i -n 'qwen|llama|mistral|cohere|gemma|phi|deepseek|exaone|olmo|granite|qwen3|mixtral|single-GPU|single GPU|RMSNorm|QK-norm' <port_dir>/
Your implementation-complete message must explicitly attest to the audit (e.g.
"docstrings rewritten to the three-sentence pattern; rg returns N hits, all legitimate lineage references"
). A claim without the attestation isn't a completion.
Preflight (Hub config + arch registration — run before first serve):
bash
pixi run python run_oss_gates.py <HF_MODEL_ID> --port-dir <port_dir>/
Guard: local smoke gate (mandatory before Phase 3).
pixi run max serve
cold-compiles for 5–25 minutes. Before serving, run the four local checks in serve-and-iterate.md (import smoke, graph dry-build, adapter⇄graph key diff, weights-format preflight).
run_oss_gates.py
covers walls, checkpoint metadata, and
arch.py
name/encoding — not a substitute for those four.

这是核心部署环节。阶段1生成了配置映射和差异列表;实现活动将在代码中执行这些内容。
完整的检查清单、工作顺序、反模式和完成标准:implement-graph.md
按顺序执行:
  1. model_config.py
    — 连接阶段1/
    inspect_hf.py
    中的每个
    config.json
    键。设置
    get_kv_params()
    的头部数量和头部维度以匹配HF模型。
  2. weight_adapters.py
    — 将检查点的safetensor键映射到你将使用的MAX模块名称。先运行
    list_checkpoint_keys.py
    ;详见rename-weights.md。加载后,连接state-dict-audit.md中的覆盖审核(尤其是MoE和
    strict=False
    的绑定嵌入)。
  3. <slug>.py
    — 针对差异列表中的每一行,编辑或替换原有类,使MAX的
    forward()
    与HF的
    forward()
    一致:
    • 注意力(Q/K/V、RoPE、掩码、GQA、软限制等)
    • MLP或MoE(激活函数、路由、共享专家等)
    • 解码器块(归一化顺序和残差连接——与Llama不可互换)
    • 最终归一化和LM头部(绑定、logit缩放、软限制)
  4. arch.py
    — 确认
    name=
    architectures[0]
    匹配;
    default_encoding
    与Hub的
    torch_dtype
    匹配。
  5. model.py
    — 仅当HF以不同方式封装骨干网络时(如多模态VL模型)需要修改。
编辑时边读HF的
modeling_<type>.py
边修改
,不要等到验证失败后再看。仅在HF模型与原有架构匹配的地方继承原有类;在差异列表显示不同的地方重写类。
实现完成的标志是 implement-graph.md中的每一项都已勾选——尤其是:每个差异都有对应的代码更改,权重加载时没有孤立键,并且honest-docstrings.md中的脚手架注释审核已运行,每个匹配项都被分类为OK / 错误 / 过时。审核通过是强制要求;未进行审核就宣布实现完成会导致代码库中遗留原有架构的错误描述,后续流程无法发现。
快速grep命令(完整分类规则在honest-docstrings.md中):
bash
pixi run rg -i -n 'qwen|llama|mistral|cohere|gemma|phi|deepseek|exaone|olmo|granite|qwen3|mixtral|single-GPU|single GPU|RMSNorm|QK-norm' <port_dir>/
你的实现完成消息必须明确证明已完成审核(例如:“文档字符串已重写为三句话模式;rg返回N个匹配项,均为合法的谱系引用”)。没有证明的声称不算完成。
预检查(Hub配置+架构注册——首次部署前运行):
bash
pixi run python run_oss_gates.py <HF_MODEL_ID> --port-dir <port_dir>/
防护机制:本地冒烟测试门(阶段3前必须执行)。
pixi run max serve
冷编译需要5-25分钟。部署前,运行serve-and-iterate.md中的四个本地检查(导入冒烟测试、图预构建、适配器与图键差异、权重格式预检查)。
run_oss_gates.py
涵盖障碍、检查点元数据和
arch.py
的名称/编码——不能替代这四个检查。

Phase 3 — Verify

阶段3 — 验证

Check if it generates coherent text

检查是否生成连贯文本

Prerequisite: graph implementation complete. Do not serve to "see what happens" during implementation — fix config, adapters, and graph first.
Sanity-check the HF reference FIRST. Run HF alone on the model card's intended prompt template, before involving MAX. If HF itself produces gibberish, your oracle is broken — fixing your port against a broken oracle wastes days.
Then serve with
pixi run max serve --model-path <HF_MODEL_ID> --custom-architectures <port_dir>
and probe with the model card's intended template (not just "The capital of France is" — that prompt is wrong for PrefixLMs and instruction-tuned models). Three possible outcomes: server crashes during load → fix config/adapters; server starts but returns garbage → divergence hunt; server returns plausible text → run at
max_tokens=64+
before celebrating.
Full HF-reference sanity check, encoder/embedding slug serve flow, and fix-test loop discipline: serve-and-iterate.md.
**前提条件:**图实现完成。实现期间不要部署服务“看看效果”——先修复配置、适配器和图。
**首先对HF参考模型进行 sanity 检查。**在使用MAX之前,先单独运行HF模型,使用模型卡片中指定的提示模板。如果HF本身生成无意义内容,你的基准就有问题——针对错误基准修复移植会浪费数天时间。
然后运行
pixi run max serve --model-path <HF_MODEL_ID> --custom-architectures <port_dir>
部署服务,并使用模型卡片中指定的模板进行测试(不要只用“法国的首都是”——这个提示对PrefixLM和指令微调模型不适用)。三种可能的结果:加载期间服务器崩溃→修复配置/适配器;服务器启动但返回垃圾内容→查找差异;服务器返回合理文本→运行
max_tokens=64+
测试后再庆祝。
完整的HF参考sanity检查、编码器/嵌入slug部署流程和修复-测试循环规范:serve-and-iterate.md

Layer-by-layer divergence hunt

逐层查找差异

This is the main loop. You're going to:
  1. Read the HF reference implementation to understand what should happen at each layer.
  2. Dump intermediate tensors from both implementations and find the first layer where they diverge.
  3. Fix that layer.
  4. Re-run the layer check.
  5. Repeat until all layers match.
这是主要循环。你需要:
  1. 阅读HF参考实现,了解每层应该执行的操作。
  2. 导出两个实现的中间张量,找到第一个出现差异的层。
  3. 修复该层。
  4. 重新运行层检查。
  5. 重复直到所有层匹配。

Read the reference implementation

阅读参考实现

Before diving into the HF source, consult the symptom table at the top of divergences.md. Match what you're observing (gibberish at token 0, divergence growing with length, output plausible but text drifts, etc.) to its candidate causes, and read every candidate listed — not just the first plausible one. Several causes produce the same symptom; the bug is the one you haven't checked yet.
Then open the HF
modeling_<type>.py
as a debugger, not a reviewer. You're looking for the specific detail you missed. Common ones:
  • A norm whose variant or position differs from the template
  • A scale factor applied somewhere (
    hidden_states * scale
    ,
    attn_weights / sqrt(d)
    , MuP multipliers)
  • A different activation function in the MLP
  • A different RoPE style (split-half vs. interleaved, partial vs. full)
  • A boundary condition that only fires at certain layers (sliding-window vs. global attention, sink-token handling)
Easter-egg warning: HF modeling code inherits aggressively. A class named
MyModelDecoderLayer(GraniteDecoderLayer)
may inherit critical behavior from a different family entirely. Always chase inheritance up at least one level before concluding "this is just Llama with renamed fields."
The catalog of "this differs from Llama and here is how" is in divergences.md. Indexed by symptom.
深入HF源代码之前,先查看divergences.md顶部的症状表。将你观察到的现象(第0个token就出现无意义内容、差异随长度增加、输出看似合理但文本偏离等)与候选原因匹配,并阅读每个候选原因——不要只看第一个看似合理的。多个原因可能产生相同的症状;未检查的那个就是问题所在。
然后将HF的
modeling_<type>.py
当作调试器而非评审文档来阅读。你要找的是之前忽略的具体细节。常见的细节包括:
  • 变体或位置与模板不同的归一化层
  • 某处应用的缩放因子(
    hidden_states * scale
    attn_weights / sqrt(d)
    、MuP乘数)
  • MLP中不同的激活函数
  • 不同的RoPE样式(半拆分 vs. 交错、部分 vs. 完整)
  • 仅在特定层触发的边界条件(滑动窗口 vs. 全局注意力、sink-token处理)
彩蛋警告:HF建模代码大量使用继承。名为
MyModelDecoderLayer(GraniteDecoderLayer)
的类可能完全继承自另一个系列的关键行为。在得出“这只是重命名字段的Llama”结论之前,至少向上追踪一层继承关系。
“与Llama不同的地方及差异方式”的目录在divergences.md中,按症状索引。

Compare logits (and HF layer stats)

对比logits(以及HF层统计数据)

bash
pixi run python compare_layers.py <HF_MODEL_ID> \
  --slug <your_slug> --port 8000 \
  --prompt "The capital of France is"
Requires
pixi run max serve
with
--custom-architectures <port_dir>
on the same port.
MAX does not expose per-layer hidden states via the OpenAI API. This script:
  • Prints HF-only per-layer stats (embedding + each block output) as a diagnostic snapshot while you read the modeling code.
  • Compares top-1 logprob at the prompt between HF and MAX via
    /v1/completions?logprobs=…
    .
If logprobs diverge, use divergences.md and add
ops.output(...)
taps in your
<slug>.py
for true tensor diffs inside each block — see layer-by-layer-debugging.md.
bash
pixi run python compare_layers.py <HF_MODEL_ID> \
  --slug <your_slug> --port 8000 \
  --prompt "The capital of France is"
要求在同一端口上运行带有
--custom-architectures <port_dir>
参数的
pixi run max serve
MAX不会通过OpenAI API暴露每层的隐藏状态。该脚本:
  • 打印仅HF的每层统计数据(嵌入+每个块输出),作为你阅读建模代码时的诊断快照。
  • 通过
    /v1/completions?logprobs=…
    对比HF与MAX在提示下的top-1对数概率
如果对数概率存在差异,使用divergences.md并在
<slug>.py
中添加
ops.output(...)
探针,以获取每个块内部的真实张量差异——详见layer-by-layer-debugging.md

Fix the layer, then re-run

修复层,然后重新运行

Edit
<slug>.py
to fix the identified layer, restart
pixi run max serve
, re-run
compare_layers.py
. When top-1 logprob matches (rel_diff < 5%), logits are aligned at that prompt. For block-local confirmation, use manual
ops.output()
taps.
If you fix a layer and the divergence point doesn't move, you fixed the wrong thing. Revert and re-read the HF source for that layer.
编辑
<slug>.py
修复已识别的层,重启
pixi run max serve
,重新运行
compare_layers.py
。当top-1对数概率匹配(相对差异<5%)时,说明该提示下的logits已对齐。如需确认块本地的匹配情况,使用手动
ops.output()
探针。
如果修复了一个层但差异点未移动,说明修复的是错误的地方。回滚并重新阅读该层的HF源代码。

Check against Hugging Face

与Hugging Face对比

Run the model end-to-end with pretrained weights, then run HF on the same prompt with greedy sampling. On the MAX side, use the dtype that matches the weight encoding the model supports (most models ship bfloat16). Outputs should be identical or nearly identical; small BF16/FP16 rounding can cause divergence past a dozen tokens. Persistent divergence in the first tokens after the divergence hunt passed usually means tokenizer/chat-template mismatch, dtype mismatch with the released weights, or nonzero MAX sampling.
When matching text comes out, the port is done for greedy text. Real "done" depends on the validation depth picked during planning — pick a tier from 1 (smoke) to 6 (logit parity).
Full HF-comparison recipe, divergence triage, and the 6-tier validation table: validation-tiers.md.

使用预训练权重端到端运行模型,然后使用贪婪采样在相同提示下运行HF模型。在MAX端,使用与模型支持的权重编码匹配的数据类型(大多数模型发布为bfloat16)。输出应完全相同或几乎相同;BF16/FP16的微小舍入误差可能导致十几个token后出现差异。在差异查找通过后,前几个token仍持续差异通常意味着分词器/聊天模板不匹配、与发布权重的数据类型不匹配,或MAX采样非零。
当生成匹配文本时,贪婪文本生成的移植完成。真正的“完成”取决于规划阶段选择的验证深度——从1级(冒烟测试)到6级(logit一致性)中选择一个等级。
完整的HF对比流程、差异分类和6级验证表:validation-tiers.md

Common pitfalls

常见陷阱

Use pitfalls.md as an index — find your symptom, then load the one category file (config, weights, graph, or serving) — honest-docstrings.md for the docstring audit specifically. The two big ones:
  • Scaffold ≠ port. Do not serve or verify until the graph implements every delta in
    <slug>.py
    .
  • Sed-rename leaves donor docstrings intact. Class names get renamed, but docstrings and comments still describe the donor. Rewrite them and run the audit grep before declaring the implementation done.
pitfalls.md作为索引——找到你的症状,然后加载对应的分类文件(配置、权重、图或部署)——honest-docstrings.md专门针对文档字符串审核。两个主要陷阱:
  • 脚手架≠移植。在
    <slug>.py
    中实现所有差异之前,不要部署服务或进行验证。
  • Sed重命名会保留原有文档字符串。类名被重命名,但文档字符串和注释仍描述原有架构。在宣布实现完成前,重写它们并运行审核grep。

Tests and CI

测试与CI

When you add
pytest
tests for the ported model, minimize the number of MAX graph compilations per file. Compile once via a module-scoped fixture and reuse it across
@pytest.mark.parametrize
cases. For files that must compile different graphs, parallelize them with Bazel
shard_count
instead of splitting the file. Full patterns and examples: tests-and-ci.md.
为移植的模型添加
pytest
测试时,尽量减少每个文件的MAX图编译次数。通过模块范围的fixture编译一次,并在
@pytest.mark.parametrize
用例中复用。对于必须编译不同图的文件,使用Bazel的
shard_count
并行化,而非拆分文件。完整模式和示例:tests-and-ci.md