dflash-mlx-speculative-decoding
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesedflash-mlx Speculative Decoding
dflash-mlx Speculative Decoding
Skill by ara.so — Daily 2026 Skills collection.
DFlash implements lossless speculative decoding for MLX on Apple Silicon. A small draft model (~1B params) generates 16 tokens in parallel using block diffusion; the target model verifies all 16 in a single forward pass. Tokens are only emitted after target verification — output is lossless (every token is the target model's greedy argmax).
Typical speedups: 1.7x–4.1x over baseline depending on model size and context length. Acceptance rates hover around 87–90% for Qwen3.5 models.
mlx_lm由ara.so提供的技能——2026每日技能合集。
DFlash为Apple Silicon平台的MLX实现了无损Speculative Decoding。一个小型草稿模型(约10亿参数)通过块扩散并行生成16个token;目标模型在单次前向传播中验证所有16个token。只有经过目标模型验证的token才会输出——输出是无损的(每个token都是目标模型的贪心argmax结果)。
典型加速比:相较于基准版,根据模型大小和上下文长度不同,可实现1.7倍-4.1倍的加速。对于Qwen3.5模型,token通过率约为87%-90%。
mlx_lmInstallation
安装
bash
pip install dflash-mlxbash
pip install dflash-mlxor isolated install
或隔离安装
pipx install dflash-mlx
Requires Python 3.10+, MLX 0.31.1+, Apple Silicon Mac.pipx install dflash-mlx
要求Python 3.10+、MLX 0.31.1+、Apple Silicon芯片的Mac。Key CLI Commands
核心CLI命令
Generate text
生成文本
bash
undefinedbash
undefinedAuto-resolve draft model from registry
从注册表自动解析草稿模型
dflash --model Qwen/Qwen3.5-9B --prompt "Explain backpropagation"
dflash --model Qwen/Qwen3.5-9B --prompt "Explain backpropagation"
Explicit draft model
指定草稿模型
dflash --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--prompt "Explain backpropagation"
--draft z-lab/Qwen3.5-9B-DFlash
--prompt "Explain backpropagation"
dflash --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--prompt "Explain backpropagation"
--draft z-lab/Qwen3.5-9B-DFlash
--prompt "Explain backpropagation"
Disable EOS (useful for benchmarking fixed token counts)
禁用EOS标记(适合固定token数量的基准测试)
dflash --model Qwen/Qwen3.5-9B --prompt "..." --max-tokens 1024 --no-eos
undefineddflash --model Qwen/Qwen3.5-9B --prompt "..." --max-tokens 1024 --no-eos
undefinedOpenAI-compatible server
兼容OpenAI的服务端
bash
undefinedbash
undefinedBasic server
基础服务端
dflash-serve --model Qwen/Qwen3.5-9B --port 8000
dflash-serve --model Qwen/Qwen3.5-9B --port 8000
With explicit draft
指定草稿模型
dflash-serve --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--port 8000
--draft z-lab/Qwen3.5-9B-DFlash
--port 8000
dflash-serve --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--port 8000
--draft z-lab/Qwen3.5-9B-DFlash
--port 8000
Disable thinking/reasoning tokens (Qwen3.5 thinking models)
禁用思考推理标记(适用于Qwen3.5思考型模型)
dflash-serve --model Qwen/Qwen3.5-9B --port 8000
--chat-template-args '{"enable_thinking": false}'
--chat-template-args '{"enable_thinking": false}'
dflash-serve --model Qwen/Qwen3.5-9B --port 8000
--chat-template-args '{"enable_thinking": false}'
--chat-template-args '{"enable_thinking": false}'
Raise fallback threshold for longer prompts (large models)
针对长提示提升回退阈值(大型模型)
dflash-serve --model mlx-community/Qwen3.5-35B-A3B-4bit --port 8000
--chat-template-args '{"enable_thinking": false}'
--dflash-max-ctx 16384
--chat-template-args '{"enable_thinking": false}'
--dflash-max-ctx 16384
undefineddflash-serve --model mlx-community/Qwen3.5-35B-A3B-4bit --port 8000
--chat-template-args '{"enable_thinking": false}'
--dflash-max-ctx 16384
--chat-template-args '{"enable_thinking": false}'
--dflash-max-ctx 16384
undefinedBenchmark
基准测试
bash
dflash-benchmark \
--model Qwen/Qwen3.5-9B \
--draft z-lab/Qwen3.5-9B-DFlash \
--prompt "The function f satisfies..." \
--max-tokens 1024 \
--repeat 3 \
--no-eosOutputs per-run JSON reports with tok/s, acceptance rate, and speedup vs baseline.
bash
dflash-benchmark \
--model Qwen/Qwen3.5-9B \
--draft z-lab/Qwen3.5-9B-DFlash \
--prompt "The function f satisfies..." \
--max-tokens 1024 \
--repeat 3 \
--no-eos输出包含tok/s、通过率、相对基准版加速比的单次运行JSON报告。
Supported Model Pairs
支持的模型对
| Target Model | Draft Model |
|---|---|
| |
| |
| |
| |
Draft models are auto-resolved from a registry — no flag needed for listed pairs. Models without a matching draft are rejected at startup.
--draft| 目标模型 | 草稿模型 |
|---|---|
| |
| |
| |
| |
草稿模型会从注册表自动解析——对于列表中的模型对,无需使用参数。没有匹配草稿模型的模型会在启动时被拒绝。
--draftPython API Usage
Python API 使用方法
Streaming generation
流式生成
python
from dflash_mlx import DFlashRuntime
runtime = DFlashRuntime.from_pretrained(
model="Qwen/Qwen3.5-9B",
draft="z-lab/Qwen3.5-9B-DFlash", # optional, auto-resolved
)
prompt = "Explain the Pythagorean theorem step by step."
for token_text in runtime.stream_generate(
prompt=prompt,
max_tokens=512,
use_chat_template=True,
):
print(token_text, end="", flush=True)
print()python
from dflash_mlx import DFlashRuntime
runtime = DFlashRuntime.from_pretrained(
model="Qwen/Qwen3.5-9B",
draft="z-lab/Qwen3.5-9B-DFlash", # 可选,会自动解析
)
prompt = "Explain the Pythagorean theorem step by step."
for token_text in runtime.stream_generate(
prompt=prompt,
max_tokens=512,
use_chat_template=True,
):
print(token_text, end="", flush=True)
print()Full generation with stats
完整生成及统计信息
python
from dflash_mlx import DFlashRuntime
runtime = DFlashRuntime.from_pretrained(model="Qwen/Qwen3.5-9B")
result = runtime.generate(
prompt="What is speculative decoding?",
max_tokens=256,
use_chat_template=True,
)
print(result.text)
print(f"Tokens/sec: {result.tokens_per_second:.2f}")
print(f"Acceptance rate: {result.acceptance_rate:.2%}")
print(f"Total tokens: {result.total_tokens}")python
from dflash_mlx import DFlashRuntime
runtime = DFlashRuntime.from_pretrained(model="Qwen/Qwen3.5-9B")
result = runtime.generate(
prompt="What is speculative decoding?",
max_tokens=256,
use_chat_template=True,
)
print(result.text)
print(f"Tokens/sec: {result.tokens_per_second:.2f}")
print(f"Acceptance rate: {result.acceptance_rate:.2%}")
print(f"Total tokens: {result.total_tokens}")Custom draft block size and context
自定义草稿块大小和上下文
python
from dflash_mlx import DFlashRuntime, DFlashConfig
config = DFlashConfig(
draft_block_size=16, # tokens drafted per speculative step
max_ctx=8192, # max context length before fallback
enable_tape_replay=True, # GatedDeltaNet recurrent rollback
jit_sdpa=True, # custom Metal SDPA for long contexts
)
runtime = DFlashRuntime.from_pretrained(
model="mlx-community/Qwen3.5-27B-4bit",
config=config,
)python
from dflash_mlx import DFlashRuntime, DFlashConfig
config = DFlashConfig(
draft_block_size=16, # 每次推测步骤生成的token数量
max_ctx=8192, # 触发回退前的最大上下文长度
enable_tape_replay=True, # GatedDeltaNet循环回滚
jit_sdpa=True, # 针对长上下文的自定义Metal SDPA
)
runtime = DFlashRuntime.from_pretrained(
model="mlx-community/Qwen3.5-27B-4bit",
config=config,
)OpenAI client against dflash-serve
对接dflash-serve的OpenAI客户端
python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # dflash-serve does not require auth by default
)python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # dflash-serve默认不需要认证
)Non-streaming
非流式请求
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "Explain gradient descent."}
],
max_tokens=512,
)
print(response.choices[0].message.content)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "Explain gradient descent."}
],
max_tokens=512,
)
print(response.choices[0].message.content)
Streaming
流式请求
stream = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": "Write a haiku about silicon."}],
max_tokens=128,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()
undefinedstream = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": "Write a haiku about silicon."}],
max_tokens=128,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()
undefinedTool calling (via dflash-serve)
工具调用(通过dflash-serve)
python
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
},
"required": ["city"],
},
},
}
]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Args: {json.loads(tool_call.function.arguments)}")python
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
},
"required": ["city"],
},
},
}
]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Args: {json.loads(tool_call.function.arguments)}")Common Patterns
常见使用场景
Side-by-side demo (baseline vs DFlash)
对比演示(基准版 vs DFlash)
bash
PYTHONPATH=. python3 -m examples.demo --mode dflash \
--target-model Qwen/Qwen3.5-9B \
--draft-model z-lab/Qwen3.5-9B-DFlash \
--prompt "Solve: f(x) + f(y) = f(x+y) - xy - 1" \
--max-tokens 2048 \
--no-eosbash
PYTHONPATH=. python3 -m examples.demo --mode dflash \
--target-model Qwen/Qwen3.5-9B \
--draft-model z-lab/Qwen3.5-9B-DFlash \
--prompt "Solve: f(x) + f(y) = f(x+y) - xy - 1" \
--max-tokens 2048 \
--no-eosIntegrating with Open WebUI
集成Open WebUI
- Start
dflash-serve --model Qwen/Qwen3.5-9B --port 8000 - In Open WebUI settings → Connections → add OpenAI API with URL
http://localhost:8000/v1 - Select model in the chat UI
Qwen/Qwen3.5-9B
Works the same for Continue, aider, OpenCode, and any OpenAI-compatible client.
- 启动
dflash-serve --model Qwen/Qwen3.5-9B --port 8000 - 在Open WebUI设置→连接→添加OpenAI API,URL填写
http://localhost:8000/v1 - 在聊天界面选择模型
Qwen/Qwen3.5-9B
同样适用于Continue、aider、OpenCode及所有兼容OpenAI的客户端。
Override draft for unsupported models
为不支持的模型指定草稿
bash
undefinedbash
undefinedForce a custom draft — bypasses registry check
强制使用自定义草稿——跳过注册表检查
dflash --model my-org/MyCustomModel
--draft my-org/MyCustomModel-DFlash
--prompt "Hello"
--draft my-org/MyCustomModel-DFlash
--prompt "Hello"
undefineddflash --model my-org/MyCustomModel
--draft my-org/MyCustomModel-DFlash
--prompt "Hello"
--draft my-org/MyCustomModel-DFlash
--prompt "Hello"
undefinedDisable thinking tokens for Qwen3.5
为Qwen3.5禁用思考标记
bash
undefinedbash
undefinedCLI
CLI方式
dflash --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--prompt "What is 2+2?"
--chat-template-args '{"enable_thinking": false}'
--prompt "What is 2+2?"
dflash --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--prompt "What is 2+2?"
--chat-template-args '{"enable_thinking": false}'
--prompt "What is 2+2?"
Server
服务端方式
dflash-serve --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--port 8000
--chat-template-args '{"enable_thinking": false}'
--port 8000
undefineddflash-serve --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--port 8000
--chat-template-args '{"enable_thinking": false}'
--port 8000
undefinedArchitecture Notes
架构说明
- Tape-replay rollback: For hybrid GatedDeltaNet + attention models (Qwen3.5), dflash records an innovation tape during verify and replays only accepted steps via a custom Metal kernel — avoids full state snapshots.
- JIT SDPA 2-pass: For contexts ≥ 1024 tokens, a custom Metal attention kernel maintains numerical alignment with stock MLX attention.
- Greedy acceptance: Keeps the longest correct prefix from the 16 drafted tokens, rejects the rest. No temperature/sampling on verification — strictly lossless.
- Qwen3 (pure attention) models work but don't benefit from tape-replay rollback (that's GatedDeltaNet-specific).
- 磁带重放回滚:针对混合GatedDeltaNet+注意力机制的模型(如Qwen3.5),dflash在验证阶段记录创新磁带,并通过自定义Metal内核仅重放已通过的步骤——避免完整状态快照。
- JIT SDPA双阶段:对于≥1024token的上下文,自定义Metal注意力内核与原生MLX注意力保持数值对齐。
- 贪心验证:保留16个草稿token中最长的正确前缀,拒绝其余部分。验证阶段无温度/采样——严格无损。
- **Qwen3(纯注意力)**模型可运行,但无法受益于磁带重放回滚(该特性为GatedDeltaNet专属)。
Troubleshooting
故障排查
Model rejected at startup
Error: No DFlash draft found for model 'org/ModelName'→ Pass explicitly, or use a model from the supported pairs table.
--draft org/ModelName-DFlashLow acceptance rate (< 80%)
- Usually caused by very long context (4096+). Try to extend the fallback threshold.
--dflash-max-ctx 8192 - Qwen3 (non-3.5) models have lower acceptance than Qwen3.5 hybrid models.
Numerical divergence / output differs from pure AR
- Expected behavior: "Output can still differ from pure AR because of MLX dispatch divergence, but no unverified token is ever emitted."
- If outputs seem wrong (not just different), ensure MLX 0.31.1+ is installed:
python -c "import mlx; print(mlx.__version__)"
Server not accepting connections
bash
undefined启动时模型被拒绝
Error: No DFlash draft found for model 'org/ModelName'→ 显式传递参数,或使用支持模型对列表中的模型。
--draft org/ModelName-DFlash通过率低(<80%)
- 通常由超长上下文(4096+)导致。尝试使用扩展回退阈值。
--dflash-max-ctx 8192 - Qwen3(非3.5版本)模型的通过率低于Qwen3.5混合模型。
数值偏差/输出与纯AR模式不同
- 预期行为:“由于MLX调度差异,输出可能与纯AR模式不同,但绝不会输出未验证的token。”
- 如果输出明显错误(并非只是不同),确保安装了MLX 0.31.1+版本:
python -c "import mlx; print(mlx.__version__)"
服务端无法接收连接
bash
undefinedCheck port is not in use
检查端口是否被占用
lsof -i :8000
lsof -i :8000
Bind to all interfaces for network access
绑定到所有接口以支持网络访问
dflash-serve --model Qwen/Qwen3.5-9B --port 8000 --host 0.0.0.0
**Out of memory with large models**
- Use 4-bit quantized variants: `mlx-community/Qwen3.5-27B-4bit` instead of the full model.
- The draft model loads alongside the target — budget ~1–2GB extra for the draft.
**Benchmark results JSON location**
```bash
ls benchmark/results/dflash-serve --model Qwen/Qwen3.5-9B --port 8000 --host 0.0.0.0
**大型模型内存不足**
- 使用4位量化变体:例如用`mlx-community/Qwen3.5-27B-4bit`替代完整模型。
- 草稿模型会与目标模型一同加载——需额外预留约1-2GB内存给草稿模型。
**基准测试结果JSON位置**
```bash
ls benchmark/results/Per-run JSON with tok/s, acceptance rate, repeat measurements
包含tok/s、通过率、重复测量数据的单次运行JSON文件
undefinedundefined