flash-moe-inference
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFlash-MoE Inference Engine
Flash-MoE推理引擎
Skill by ara.so — Daily 2026 Skills collection.
Flash-MoE is a pure C/Objective-C/Metal inference engine that runs Qwen3.5-397B-A17B (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.
由ara.so提供的技能 — 2026每日技能合集。
Flash-MoE是一款纯C/Objective-C/Metal推理引擎,可在配备48GB内存的MacBook Pro上以4.4+ tokens/秒的速度运行Qwen3.5-397B-A17B(3970亿参数的Mixture-of-Experts模型)。它可按需从NVMe SSD流式加载209GB的专家权重 — 无需Python,无需机器学习框架,仅依赖C、Objective-C和手工调优的Metal着色器。
Requirements
环境要求
- Hardware: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free
- OS: macOS 26+ (Darwin 25+)
- Tools: Xcode Command Line Tools, Python 3.x (for weight extraction only)
- Model: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)
- 硬件:Apple Silicon Mac(如M3 Max或类似型号),48GB及以上统一内存,1TB及以上SSD且剩余空间约210GB
- 操作系统:macOS 26+(Darwin 25+)
- 工具:Xcode命令行工具,Python 3.x(仅用于权重提取)
- 模型:Qwen3.5-397B-A17B safetensors权重(需从HuggingFace单独下载)
Installation & Build
安装与构建
bash
undefinedbash
undefinedClone the repo
克隆仓库
git clone https://github.com/danveloper/flash-moe
cd flash-moe/metal_infer
git clone https://github.com/danveloper/flash-moe
cd flash-moe/metal_infer
Build everything
构建所有组件
make
make
Verify build artifacts
验证构建产物
ls infer chat main
The Makefile compiles `infer.m`, `chat.m`, `main.m` with Metal shader compilation for `shaders.metal`.ls infer chat main
Makefile会编译`infer.m`、`chat.m`、`main.m`,同时编译`shaders.metal`中的Metal着色器。Weight Preparation
权重准备
Step 1: Extract non-expert weights
步骤1:提取非专家权重
bash
undefinedbash
undefinedFrom the metal_infer/ directory
在metal_infer/目录下执行
Point to your downloaded Qwen3.5-397B safetensors directory
指向你下载好的Qwen3.5-397B safetensors权重目录
python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/
python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/
Produces:
生成以下文件:
model_weights.bin (~5.5GB, mmap'd at runtime)
model_weights.bin (~5.5GB,运行时通过mmap加载)
model_weights.json (tensor manifest)
model_weights.json (张量清单)
vocab.bin (vocabulary)
vocab.bin (词汇表)
tokenizer.bin (BPE tokenizer data)
tokenizer.bin (BPE分词器数据)
undefinedundefinedStep 2: Pack expert weights (4-bit, production)
步骤2:打包专家权重(4位量化,生产环境)
bash
undefinedbash
undefinedFrom repo root
在仓库根目录执行
python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/
python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/
Produces packed_experts/ directory (~209GB)
生成packed_experts/目录(约209GB)
Each expert is a separate file: layer_XX_expert_YYYY.bin
每个专家对应单独文件:layer_XX_expert_YYYY.bin
undefinedundefinedStep 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)
步骤3:可选的2位重量化(速度更快,但会破坏JSON/工具调用功能)
bash
undefinedbash
undefinedConvert 4-bit experts to 2-bit (saves ~89GB, 120GB total)
将4位专家权重转换为2位(节省约89GB,总大小120GB)
python3 metal_infer/repack_experts_2bit.py
metal_infer/packed_experts/
metal_infer/packed_experts_2bit/
metal_infer/packed_experts/
metal_infer/packed_experts_2bit/
undefinedpython3 metal_infer/repack_experts_2bit.py
metal_infer/packed_experts/
metal_infer/packed_experts_2bit/
metal_infer/packed_experts/
metal_infer/packed_experts_2bit/
undefinedKey Commands
核心命令
Basic inference
基础推理
bash
cd metal_inferbash
cd metal_infer4-bit inference (production quality, tool calling works)
4位推理(生产环境质量,支持工具调用)
./infer --prompt "Explain quantum computing" --tokens 100
./infer --prompt "解释量子计算" --tokens 100
2-bit inference (faster, breaks JSON/tool calling)
2位推理(速度更快,会破坏JSON/工具调用)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit
./infer --prompt "解释量子计算" --tokens 100 --2bit
Per-layer timing breakdown
按层展示时间分解
./infer --prompt "Hello" --tokens 20 --timing
undefined./infer --prompt "Hello" --tokens 20 --timing
undefinedInteractive chat with tool calling
带工具调用的交互式聊天
bash
./chatbash
./chatOpens TUI with full tool calling support
打开带完整工具调用支持的终端UI
Uses 4-bit experts by default
默认使用4位专家权重
undefinedundefinedMoE-only benchmark (measures expert throughput)
仅MoE基准测试(测量专家吞吐量)
bash
./mainbash
./mainRuns pure expert forward-pass benchmark
运行纯专家前向传播基准测试
Reports tokens/sec without attention overhead
报告不含注意力开销的tokens/秒速度
undefinedundefinedProject Structure
项目结构
flash-moe/
├── paper/
│ └── flash_moe.pdf # Full technical paper
├── metal_infer/
│ ├── infer.m # Complete inference engine (~7000 lines)
│ ├── shaders.metal # Metal compute kernels (~1200 lines)
│ ├── chat.m # Interactive chat TUI
│ ├── tokenizer.h # Single-header C BPE tokenizer (449 lines)
│ ├── main.m # MoE-only benchmark
│ ├── Makefile
│ ├── extract_weights.py # Safetensors → model_weights.bin
│ ├── repack_experts_2bit.py # 4-bit → 2-bit requantization
│ ├── train_predictor.py # Expert routing prediction analysis
│ ├── model_weights.bin # Non-expert weights (mmap'd)
│ ├── model_weights.json # Tensor manifest
│ ├── vocab.bin
│ ├── tokenizer.bin
│ ├── packed_experts/ # 4-bit expert files (209GB)
│ └── packed_experts_2bit/ # 2-bit expert files (120GB, optional)
├── repack_experts.py # 4-bit expert packing from safetensors
├── progress.py # Results visualization
└── results.tsv # Experiment logflash-moe/
├── paper/
│ └── flash_moe.pdf # 完整技术论文
├── metal_infer/
│ ├── infer.m # 完整推理引擎(约7000行代码)
│ ├── shaders.metal # Metal计算内核(约1200行代码)
│ ├── chat.m # 交互式聊天终端UI
│ ├── tokenizer.h # 单头文件C语言BPE分词器(449行代码)
│ ├── main.m # 仅MoE基准测试
│ ├── Makefile
│ ├── extract_weights.py # Safetensors → model_weights.bin转换工具
│ ├── repack_experts_2bit.py # 4位→2位重量化工具
│ ├── train_predictor.py # 专家路由预测分析工具
│ ├── model_weights.bin # 非专家权重(内存映射加载)
│ ├── model_weights.json # 张量清单
│ ├── vocab.bin
│ ├── tokenizer.bin
│ ├── packed_experts/ # 4位专家文件(209GB)
│ └── packed_experts_2bit/ # 2位专家文件(120GB,可选)
├── repack_experts.py # 从safetensors打包4位专家权重的工具
├── progress.py # 结果可视化工具
└── results.tsv # 实验日志Architecture Overview
架构概述
The model has 60 transformer layers:
- 45 GatedDeltaNet (linear attention) layers
- 15 standard full attention layers
- Each layer: 512 experts, K=4 activated per token + 1 shared expert
- Hidden dimension: 4096
该模型包含60个Transformer层:
- 45个GatedDeltaNet(线性注意力)层
- 15个标准全注意力层
- 每层:512个专家,每个token激活K=4个专家 + 1个共享专家
- 隐藏维度:4096
Per-layer pipeline (4.28ms average at 4-bit)
每层处理流水线(4位量化下平均4.28毫秒)
CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
→ CPU: flush results [0.01ms CPU]
→ CMD2: o_proj + norm + routing + shared [0.55ms GPU]
→ CPU: softmax + topK routing [0.003ms]
→ I/O: parallel pread K=4 experts [2.41ms SSD]
→ CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]CMD3(prev) → CMD1: 注意力投影 + delta-net [1.22ms GPU]
→ CPU: 刷新结果 [0.01ms CPU]
→ CMD2: o_proj + 归一化 + 路由 + 共享专家 [0.55ms GPU]
→ CPU: softmax + topK路由 [0.003ms]
→ I/O: 并行预读取K=4个专家 [2.41ms SSD]
→ CMD3: 专家前向计算 + 合并 + 归一化 [0.04ms编码,延迟执行]Metal Shader Kernels
Metal着色器内核
The file contains hand-written kernels. Key kernels:
shaders.metalmetal
// 4-bit dequantized matrix-vector multiply (FMA-optimized)
// Key insight: fma(nibble, scale*x, bias*x) instead of (nibble*scale + bias)*x
// Pre-compute scale*x and bias*x to fuse dequant+multiply in one FMA instruction
kernel void matvec_4bit_fma(
device const uint8_t* weights [[buffer(0)]],
device const float* scales [[buffer(1)]],
device const float* biases [[buffer(2)]],
device const float* x [[buffer(3)]],
device float* out [[buffer(4)]],
uint tid [[thread_position_in_threadgroup]],
uint gid [[threadgroup_position_in_grid]])
{
// ... tiled SIMD-reduced FMA kernel
// 12% faster than naive (nibble * scale + bias) * x
}
// Fused SwiGLU activation
kernel void swiglu(device float* gate [[buffer(0)]],
device const float* up [[buffer(1)]],
uint gid [[thread_position_in_grid]])
{
float g = gate[gid];
gate[gid] = (g / (1.0f + exp(-g))) * up[gid];
}
// RMS normalization (two-pass)
kernel void rms_norm_pass1(...) // sum of squares reduction
kernel void rms_norm_pass2(...) // apply normalization
// GPU RoPE (fused with Q deinterleave and K normalization)
kernel void rope_qk(...)
// MoE combine + residual + sigmoid gate (fused)
kernel void moe_combine_residual(...)shaders.metalmetal
// 4位反量化矩阵-向量乘法(FMA优化)
// 核心思路:fma(nibble, scale*x, bias*x) 替代 (nibble*scale + bias)*x
// 预计算scale*x和bias*x,将反量化与乘法融合为一条FMA指令
kernel void matvec_4bit_fma(
device const uint8_t* weights [[buffer(0)]],
device const float* scales [[buffer(1)]],
device const float* biases [[buffer(2)]],
device const float* x [[buffer(3)]],
device float* out [[buffer(4)]],
uint tid [[thread_position_in_threadgroup]],
uint gid [[threadgroup_position_in_grid]])
{
// ... 分块SIMD归约FMA内核
// 比原生(nibble * scale + bias) * x快12%
}
// 融合SwiGLU激活函数
kernel void swiglu(device float* gate [[buffer(0)]],
device const float* up [[buffer(1)]],
uint gid [[thread_position_in_grid]])
{
float g = gate[gid];
gate[gid] = (g / (1.0f + exp(-g))) * up[gid];
}
// RMS归一化(两步法)
kernel void rms_norm_pass1(...) // 平方和归约
kernel void rms_norm_pass2(...) // 应用归一化
// GPU RoPE(与Q解交错和K归一化融合)
kernel void rope_qk(...)
// MoE合并 + 残差 + Sigmoid门控(融合)
kernel void moe_combine_residual(...)SSD Expert Streaming Pattern
SSD专家流式加载模式
The core innovation — loading only K=4 active experts per layer from SSD:
objc
// Parallel expert loading using GCD dispatch groups
// From infer.m (conceptual pattern)
dispatch_group_t group = dispatch_group_create();
dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);
for (int k = 0; k < K_EXPERTS; k++) {
int expert_id = top_k_indices[k];
dispatch_group_async(group, ioQueue, ^{
// Each expert: ~6.75MB at 4-bit
char path[256];
snprintf(path, sizeof(path),
"packed_experts/layer_%02d_expert_%04d.bin",
layer, expert_id);
int fd = open(path, O_RDONLY);
// pread() — non-blocking, OS page cache handles LRU
pread(fd, expert_buffer[k], expert_size, 0);
close(fd);
});
}
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
// GPU compute follows — serial pipeline is hardware-optimal on Apple SiliconWhy not : mmap incurs per-page fault overhead on cold data (~5x slower). Direct with OS page cache achieves ~71% hit rate naturally.
pread()mmap()pread()核心创新点 — 仅从SSD加载每层中K=4个激活的专家:
objc
// 使用GCD调度组实现并行专家加载
// 来自infer.m(概念性代码)
dispatch_group_t group = dispatch_group_create();
dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);
for (int k = 0; k < K_EXPERTS; k++) {
int expert_id = top_k_indices[k];
dispatch_group_async(group, ioQueue, ^{
// 每个专家:4位量化下约6.75MB
char path[256];
snprintf(path, sizeof(path),
"packed_experts/layer_%02d_expert_%04d.bin",
layer, expert_id);
int fd = open(path, O_RDONLY);
// pread() — 非阻塞,由操作系统页缓存处理LRU
pread(fd, expert_buffer[k], expert_size, 0);
close(fd);
});
}
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
// 后续执行GPU计算 — 串行流水线在Apple Silicon上是硬件最优的为何使用而非:mmap在冷数据上会产生每页故障开销(速度比pread()慢5倍)。直接使用结合操作系统页缓存可自然达到约71%的命中率。
pread()mmap()pread()GatedDeltaNet Linear Attention (BLAS)
GatedDeltaNet线性注意力(BLAS)
The recurrence update uses Accelerate BLAS — 64% faster than scalar:
objc
// GatedDeltaNet state update per head (conceptual pattern)
// state: 128×128 float matrix, 64 heads
// From infer.m
#import <Accelerate/Accelerate.h>
for (int h = 0; h < 64; h++) {
float* S = state + h * 128 * 128; // 128×128 state matrix
float* q = Q + h * 128;
float* k = K + h * 128;
float* v = V + h * 128;
// β·(k⊗v) outer product update
// cblas_sger: S += beta * (k ⊗ v)
cblas_sger(CblasRowMajor, 128, 128,
beta[h], k, 1, v, 1, S, 128);
// Decay: S = alpha * S
cblas_sscal(128 * 128, alpha[h], S, 1);
// Output: o = S @ q
cblas_sgemv(CblasRowMajor, CblasNoTrans,
128, 128, 1.0f, S, 128, q, 1, 0.0f,
output + h * 128, 1);
}循环更新使用Accelerate BLAS — 比标量实现快64%:
objc
// 每个头的GatedDeltaNet状态更新(概念性代码)
// state: 128×128浮点矩阵,共64个注意力头
// 来自infer.m
#import <Accelerate/Accelerate.h>
for (int h = 0; h < 64; h++) {
float* S = state + h * 128 * 128; // 128×128状态矩阵
float* q = Q + h * 128;
float* k = K + h * 128;
float* v = V + h * 128;
// β·(k⊗v)外积更新
// cblas_sger: S += beta * (k ⊗ v)
cblas_sger(CblasRowMajor, 128, 128,
beta[h], k, 1, v, 1, S, 128);
// 衰减:S = alpha * S
cblas_sscal(128 * 128, alpha[h], S, 1);
// 输出:o = S @ q
cblas_sgemv(CblasRowMajor, CblasNoTrans,
128, 128, 1.0f, S, 128, q, 1, 0.0f,
output + h * 128, 1);
}Performance Configuration
性能配置
4-bit (production default)
4位量化(生产环境默认)
- Quality: Excellent — full tool calling, correct JSON
- Speed: 4.36 tok/s
- Disk: 209GB
- 质量:极佳 — 完整支持工具调用,JSON输出正确
- 速度:4.36 tok/s
- 磁盘占用:209GB
2-bit (speed testing only)
2位量化(仅用于速度测试)
- Quality: Good — but breaks JSON/tool calling (instead of
\name\)"name" - Speed: 5.74 tok/s (7.05 peak single token with warm cache)
- Disk: 120GB
- Uses flag to avoid page cache thrashing
F_NOCACHE
- 质量:良好 — 但会破坏JSON/工具调用(输出而非
\name\)"name" - 速度:5.74 tok/s(缓存预热后单token峰值7.05 tok/s)
- 磁盘占用:120GB
- 使用标志避免页缓存抖动
F_NOCACHE
What NOT to Try (Learned from 58 Experiments)
请勿尝试的方案(从58次实验中总结)
| Approach | Why it fails |
|---|---|
| Per-page fault overhead: 5x slower than |
| |
| SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed |
| Custom Metal LRU cache | GPU memory pressure: -38% vs OS page cache |
| LZ4 expert compression | Decompress overhead > warm cache savings: -13% |
| Temporal expert prediction | 25% hit rate, wastes SSD bandwidth: -18% |
| Speculative early routing | Cache pollution: -38% |
| MTP speculative decoding | MoE I/O scales per-token (unlike dense models): break-even |
| Spin-poll GPU wait | CPU thermal throttle competes with GPU: -23% |
| Parallel SSD + GPU overlap | Unified memory controller arbitration: net negative |
Key principle: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.
| 方案 | 失败原因 |
|---|---|
| 每页故障开销:比 |
| |
| SSD DMA与GPU共享内存控制器 — 并发访问导致GPU速度下降73% |
| 自定义Metal LRU缓存 | GPU内存压力:比操作系统页缓存性能低38% |
| LZ4专家权重压缩 | 解压开销超过缓存预热收益:性能下降13% |
| 时间维度专家预测 | 命中率仅25%,浪费SSD带宽:性能下降18% |
| speculative early routing | 缓存污染:性能下降38% |
| MTP speculative解码 | MoE的I/O开销随token数量增加而线性增长(与密集模型不同):收支平衡 |
| 自旋轮询GPU等待 | CPU热节流与GPU竞争资源:性能下降23% |
| SSD与GPU并行重叠执行 | 统一内存控制器仲裁:整体性能为负收益 |
核心原则:在Apple Silicon上,GPU DMA与SSD DMA共享同一内存控制器。串行流水线(GPU → SSD → GPU)是硬件最优的执行方式。
Troubleshooting
故障排除
Build fails
构建失败
bash
undefinedbash
undefinedEnsure Xcode CLI tools are installed
确保已安装Xcode命令行工具
xcode-select --install
xcode-select --install
Check Metal compiler is available
检查Metal编译器是否可用
xcrun -sdk macosx metal --version
undefinedxcrun -sdk macosx metal --version
undefinedOut of memory
内存不足
The engine is designed to use ~6GB active:
- 5.5GB: (mmap'd, read-only)
model_weights.bin - ~200MB: Metal scratch buffers
- Remaining ~42GB: OS page cache for expert data
If you see OOM, check for other processes consuming unified memory:
bash
sudo memory_pressure
vm_stat该引擎设计为仅使用约6GB活跃内存:
- 5.5GB:(内存映射,只读)
model_weights.bin - ~200MB:Metal临时缓冲区
- 剩余约42GB:用于专家数据的操作系统页缓存
如果出现内存不足,检查是否有其他进程占用统一内存:
bash
sudo memory_pressure
vm_statSlow performance
性能缓慢
bash
undefinedbash
undefinedCheck SSD speed — needs ~17GB/s for target performance
检查SSD速度 — 需要约17GB/s才能达到目标性能
Run with timing to identify bottleneck
带时间参数运行以定位瓶颈
./infer --prompt "Hello" --tokens 5 --timing
./infer --prompt "Hello" --tokens 5 --timing
Verify packed_experts/ is on internal SSD, not external drive
验证packed_experts/目录位于内部SSD而非外部磁盘
diskutil info /
undefineddiskutil info /
undefinedWrong expert directory
专家目录路径错误
bash
undefinedbash
undefinedDefault paths expected by infer.m:
infer.m默认期望的路径:
metal_infer/packed_experts/ (4-bit)
metal_infer/packed_experts/ (4位权重)
metal_infer/packed_experts_2bit/ (2-bit)
metal_infer/packed_experts_2bit/ (2位权重)
Ensure you're running from metal_infer/ directory
确保你在metal_infer/目录下运行
cd metal_infer
./infer --prompt "test"
undefinedcd metal_infer
./infer --prompt "test"
undefinedTool calling broken
工具调用功能异常
Use 4-bit, not 2-bit. The 2-bit quantization corrupts quote characters in JSON output, making tool calling unreliable. Always use the default 4-bit configuration for agentic workloads.
请使用4位量化权重,不要使用2位。2位量化会破坏JSON输出中的引号字符,导致工具调用不可靠。对于智能体类工作负载,请始终使用默认的4位配置。
Memory Safety
内存安全性
The engine explicitly manages all allocations:
- No unbounded caches
- Expert data never accumulates in GPU memory
- is mmap'd read-only — kernel manages pages
model_weights.bin - Expert files are opened/read/closed per inference step
该引擎显式管理所有内存分配:
- 无无界缓存
- 专家数据不会在GPU内存中累积
- 通过内存映射加载(只读),由内核管理页面
model_weights.bin - 专家文件在每次推理步骤中打开/读取/关闭