dflash-mlx-speculative-decoding

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

dflash-mlx Speculative Decoding

dflash-mlx Speculative Decoding

Skill by ara.so — Daily 2026 Skills collection.
DFlash implements lossless speculative decoding for MLX on Apple Silicon. A small draft model (~1B params) generates 16 tokens in parallel using block diffusion; the target model verifies all 16 in a single forward pass. Tokens are only emitted after target verification — output is lossless (every token is the target model's greedy argmax).
Typical speedups: 1.7x–4.1x over baseline
mlx_lm
depending on model size and context length. Acceptance rates hover around 87–90% for Qwen3.5 models.
ara.so提供的技能——2026每日技能合集。
DFlash为Apple Silicon平台的MLX实现了无损Speculative Decoding。一个小型草稿模型(约10亿参数)通过块扩散并行生成16个token;目标模型在单次前向传播中验证所有16个token。只有经过目标模型验证的token才会输出——输出是无损的(每个token都是目标模型的贪心argmax结果)。
典型加速比:相较于基准版
mlx_lm
,根据模型大小和上下文长度不同,可实现1.7倍-4.1倍的加速。对于Qwen3.5模型,token通过率约为87%-90%。

Installation

安装

bash
pip install dflash-mlx
bash
pip install dflash-mlx

or isolated install

或隔离安装

pipx install dflash-mlx

Requires Python 3.10+, MLX 0.31.1+, Apple Silicon Mac.
pipx install dflash-mlx

要求Python 3.10+、MLX 0.31.1+、Apple Silicon芯片的Mac。

Key CLI Commands

核心CLI命令

Generate text

生成文本

bash
undefined
bash
undefined

Auto-resolve draft model from registry

从注册表自动解析草稿模型

dflash --model Qwen/Qwen3.5-9B --prompt "Explain backpropagation"
dflash --model Qwen/Qwen3.5-9B --prompt "Explain backpropagation"

Explicit draft model

指定草稿模型

dflash --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--prompt "Explain backpropagation"
dflash --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--prompt "Explain backpropagation"

Disable EOS (useful for benchmarking fixed token counts)

禁用EOS标记(适合固定token数量的基准测试)

dflash --model Qwen/Qwen3.5-9B --prompt "..." --max-tokens 1024 --no-eos
undefined
dflash --model Qwen/Qwen3.5-9B --prompt "..." --max-tokens 1024 --no-eos
undefined

OpenAI-compatible server

兼容OpenAI的服务端

bash
undefined
bash
undefined

Basic server

基础服务端

dflash-serve --model Qwen/Qwen3.5-9B --port 8000
dflash-serve --model Qwen/Qwen3.5-9B --port 8000

With explicit draft

指定草稿模型

dflash-serve --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--port 8000
dflash-serve --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--port 8000

Disable thinking/reasoning tokens (Qwen3.5 thinking models)

禁用思考推理标记(适用于Qwen3.5思考型模型)

dflash-serve --model Qwen/Qwen3.5-9B --port 8000
--chat-template-args '{"enable_thinking": false}'
dflash-serve --model Qwen/Qwen3.5-9B --port 8000
--chat-template-args '{"enable_thinking": false}'

Raise fallback threshold for longer prompts (large models)

针对长提示提升回退阈值(大型模型)

dflash-serve --model mlx-community/Qwen3.5-35B-A3B-4bit --port 8000
--chat-template-args '{"enable_thinking": false}'
--dflash-max-ctx 16384
undefined
dflash-serve --model mlx-community/Qwen3.5-35B-A3B-4bit --port 8000
--chat-template-args '{"enable_thinking": false}'
--dflash-max-ctx 16384
undefined

Benchmark

基准测试

bash
dflash-benchmark \
  --model Qwen/Qwen3.5-9B \
  --draft z-lab/Qwen3.5-9B-DFlash \
  --prompt "The function f satisfies..." \
  --max-tokens 1024 \
  --repeat 3 \
  --no-eos
Outputs per-run JSON reports with tok/s, acceptance rate, and speedup vs baseline.
bash
dflash-benchmark \
  --model Qwen/Qwen3.5-9B \
  --draft z-lab/Qwen3.5-9B-DFlash \
  --prompt "The function f satisfies..." \
  --max-tokens 1024 \
  --repeat 3 \
  --no-eos
输出包含tok/s、通过率、相对基准版加速比的单次运行JSON报告。

Supported Model Pairs

支持的模型对

Target ModelDraft Model
Qwen/Qwen3.5-4B
z-lab/Qwen3.5-4B-DFlash
Qwen/Qwen3.5-9B
z-lab/Qwen3.5-9B-DFlash
mlx-community/Qwen3.5-27B-4bit
z-lab/Qwen3.5-27B-DFlash
mlx-community/Qwen3.5-35B-A3B-4bit
z-lab/Qwen3.5-35B-A3B-DFlash
Draft models are auto-resolved from a registry — no
--draft
flag needed for listed pairs. Models without a matching draft are rejected at startup.
目标模型草稿模型
Qwen/Qwen3.5-4B
z-lab/Qwen3.5-4B-DFlash
Qwen/Qwen3.5-9B
z-lab/Qwen3.5-9B-DFlash
mlx-community/Qwen3.5-27B-4bit
z-lab/Qwen3.5-27B-DFlash
mlx-community/Qwen3.5-35B-A3B-4bit
z-lab/Qwen3.5-35B-A3B-DFlash
草稿模型会从注册表自动解析——对于列表中的模型对,无需使用
--draft
参数。没有匹配草稿模型的模型会在启动时被拒绝。

Python API Usage

Python API 使用方法

Streaming generation

流式生成

python
from dflash_mlx import DFlashRuntime

runtime = DFlashRuntime.from_pretrained(
    model="Qwen/Qwen3.5-9B",
    draft="z-lab/Qwen3.5-9B-DFlash",  # optional, auto-resolved
)

prompt = "Explain the Pythagorean theorem step by step."

for token_text in runtime.stream_generate(
    prompt=prompt,
    max_tokens=512,
    use_chat_template=True,
):
    print(token_text, end="", flush=True)
print()
python
from dflash_mlx import DFlashRuntime

runtime = DFlashRuntime.from_pretrained(
    model="Qwen/Qwen3.5-9B",
    draft="z-lab/Qwen3.5-9B-DFlash",  # 可选,会自动解析
)

prompt = "Explain the Pythagorean theorem step by step."

for token_text in runtime.stream_generate(
    prompt=prompt,
    max_tokens=512,
    use_chat_template=True,
):
    print(token_text, end="", flush=True)
print()

Full generation with stats

完整生成及统计信息

python
from dflash_mlx import DFlashRuntime

runtime = DFlashRuntime.from_pretrained(model="Qwen/Qwen3.5-9B")

result = runtime.generate(
    prompt="What is speculative decoding?",
    max_tokens=256,
    use_chat_template=True,
)

print(result.text)
print(f"Tokens/sec: {result.tokens_per_second:.2f}")
print(f"Acceptance rate: {result.acceptance_rate:.2%}")
print(f"Total tokens: {result.total_tokens}")
python
from dflash_mlx import DFlashRuntime

runtime = DFlashRuntime.from_pretrained(model="Qwen/Qwen3.5-9B")

result = runtime.generate(
    prompt="What is speculative decoding?",
    max_tokens=256,
    use_chat_template=True,
)

print(result.text)
print(f"Tokens/sec: {result.tokens_per_second:.2f}")
print(f"Acceptance rate: {result.acceptance_rate:.2%}")
print(f"Total tokens: {result.total_tokens}")

Custom draft block size and context

自定义草稿块大小和上下文

python
from dflash_mlx import DFlashRuntime, DFlashConfig

config = DFlashConfig(
    draft_block_size=16,      # tokens drafted per speculative step
    max_ctx=8192,             # max context length before fallback
    enable_tape_replay=True,  # GatedDeltaNet recurrent rollback
    jit_sdpa=True,            # custom Metal SDPA for long contexts
)

runtime = DFlashRuntime.from_pretrained(
    model="mlx-community/Qwen3.5-27B-4bit",
    config=config,
)
python
from dflash_mlx import DFlashRuntime, DFlashConfig

config = DFlashConfig(
    draft_block_size=16,      # 每次推测步骤生成的token数量
    max_ctx=8192,             # 触发回退前的最大上下文长度
    enable_tape_replay=True,  # GatedDeltaNet循环回滚
    jit_sdpa=True,            # 针对长上下文的自定义Metal SDPA
)

runtime = DFlashRuntime.from_pretrained(
    model="mlx-community/Qwen3.5-27B-4bit",
    config=config,
)

OpenAI client against dflash-serve

对接dflash-serve的OpenAI客户端

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # dflash-serve does not require auth by default
)
python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # dflash-serve默认不需要认证
)

Non-streaming

非流式请求

response = client.chat.completions.create( model="Qwen/Qwen3.5-9B", messages=[ {"role": "user", "content": "Explain gradient descent."} ], max_tokens=512, ) print(response.choices[0].message.content)
response = client.chat.completions.create( model="Qwen/Qwen3.5-9B", messages=[ {"role": "user", "content": "Explain gradient descent."} ], max_tokens=512, ) print(response.choices[0].message.content)

Streaming

流式请求

stream = client.chat.completions.create( model="Qwen/Qwen3.5-9B", messages=[{"role": "user", "content": "Write a haiku about silicon."}], max_tokens=128, stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True) print()
undefined
stream = client.chat.completions.create( model="Qwen/Qwen3.5-9B", messages=[{"role": "user", "content": "Write a haiku about silicon."}], max_tokens=128, stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True) print()
undefined

Tool calling (via dflash-serve)

工具调用(通过dflash-serve)

python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-9B",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Args: {json.loads(tool_call.function.arguments)}")
python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-9B",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Args: {json.loads(tool_call.function.arguments)}")

Common Patterns

常见使用场景

Side-by-side demo (baseline vs DFlash)

对比演示(基准版 vs DFlash)

bash
PYTHONPATH=. python3 -m examples.demo --mode dflash \
  --target-model Qwen/Qwen3.5-9B \
  --draft-model z-lab/Qwen3.5-9B-DFlash \
  --prompt "Solve: f(x) + f(y) = f(x+y) - xy - 1" \
  --max-tokens 2048 \
  --no-eos
bash
PYTHONPATH=. python3 -m examples.demo --mode dflash \
  --target-model Qwen/Qwen3.5-9B \
  --draft-model z-lab/Qwen3.5-9B-DFlash \
  --prompt "Solve: f(x) + f(y) = f(x+y) - xy - 1" \
  --max-tokens 2048 \
  --no-eos

Integrating with Open WebUI

集成Open WebUI

  1. Start
    dflash-serve --model Qwen/Qwen3.5-9B --port 8000
  2. In Open WebUI settings → Connections → add OpenAI API with URL
    http://localhost:8000/v1
  3. Select model
    Qwen/Qwen3.5-9B
    in the chat UI
Works the same for Continue, aider, OpenCode, and any OpenAI-compatible client.
  1. 启动
    dflash-serve --model Qwen/Qwen3.5-9B --port 8000
  2. 在Open WebUI设置→连接→添加OpenAI API,URL填写
    http://localhost:8000/v1
  3. 在聊天界面选择模型
    Qwen/Qwen3.5-9B
同样适用于Continue、aider、OpenCode及所有兼容OpenAI的客户端。

Override draft for unsupported models

为不支持的模型指定草稿

bash
undefined
bash
undefined

Force a custom draft — bypasses registry check

强制使用自定义草稿——跳过注册表检查

dflash --model my-org/MyCustomModel
--draft my-org/MyCustomModel-DFlash
--prompt "Hello"
undefined
dflash --model my-org/MyCustomModel
--draft my-org/MyCustomModel-DFlash
--prompt "Hello"
undefined

Disable thinking tokens for Qwen3.5

为Qwen3.5禁用思考标记

bash
undefined
bash
undefined

CLI

CLI方式

dflash --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--prompt "What is 2+2?"
dflash --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--prompt "What is 2+2?"

Server

服务端方式

dflash-serve --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--port 8000
undefined
dflash-serve --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--port 8000
undefined

Architecture Notes

架构说明

  • Tape-replay rollback: For hybrid GatedDeltaNet + attention models (Qwen3.5), dflash records an innovation tape during verify and replays only accepted steps via a custom Metal kernel — avoids full state snapshots.
  • JIT SDPA 2-pass: For contexts ≥ 1024 tokens, a custom Metal attention kernel maintains numerical alignment with stock MLX attention.
  • Greedy acceptance: Keeps the longest correct prefix from the 16 drafted tokens, rejects the rest. No temperature/sampling on verification — strictly lossless.
  • Qwen3 (pure attention) models work but don't benefit from tape-replay rollback (that's GatedDeltaNet-specific).
  • 磁带重放回滚:针对混合GatedDeltaNet+注意力机制的模型(如Qwen3.5),dflash在验证阶段记录创新磁带,并通过自定义Metal内核仅重放已通过的步骤——避免完整状态快照。
  • JIT SDPA双阶段:对于≥1024token的上下文,自定义Metal注意力内核与原生MLX注意力保持数值对齐。
  • 贪心验证:保留16个草稿token中最长的正确前缀,拒绝其余部分。验证阶段无温度/采样——严格无损。
  • **Qwen3(纯注意力)**模型可运行,但无法受益于磁带重放回滚(该特性为GatedDeltaNet专属)。

Troubleshooting

故障排查

Model rejected at startup
Error: No DFlash draft found for model 'org/ModelName'
→ Pass
--draft org/ModelName-DFlash
explicitly, or use a model from the supported pairs table.
Low acceptance rate (< 80%)
  • Usually caused by very long context (4096+). Try
    --dflash-max-ctx 8192
    to extend the fallback threshold.
  • Qwen3 (non-3.5) models have lower acceptance than Qwen3.5 hybrid models.
Numerical divergence / output differs from pure AR
  • Expected behavior: "Output can still differ from pure AR because of MLX dispatch divergence, but no unverified token is ever emitted."
  • If outputs seem wrong (not just different), ensure MLX 0.31.1+ is installed:
    python -c "import mlx; print(mlx.__version__)"
Server not accepting connections
bash
undefined
启动时模型被拒绝
Error: No DFlash draft found for model 'org/ModelName'
→ 显式传递
--draft org/ModelName-DFlash
参数,或使用支持模型对列表中的模型。
通过率低(<80%)
  • 通常由超长上下文(4096+)导致。尝试使用
    --dflash-max-ctx 8192
    扩展回退阈值。
  • Qwen3(非3.5版本)模型的通过率低于Qwen3.5混合模型。
数值偏差/输出与纯AR模式不同
  • 预期行为:“由于MLX调度差异,输出可能与纯AR模式不同,但绝不会输出未验证的token。”
  • 如果输出明显错误(并非只是不同),确保安装了MLX 0.31.1+版本:
    python -c "import mlx; print(mlx.__version__)"
服务端无法接收连接
bash
undefined

Check port is not in use

检查端口是否被占用

lsof -i :8000
lsof -i :8000

Bind to all interfaces for network access

绑定到所有接口以支持网络访问

dflash-serve --model Qwen/Qwen3.5-9B --port 8000 --host 0.0.0.0

**Out of memory with large models**
- Use 4-bit quantized variants: `mlx-community/Qwen3.5-27B-4bit` instead of the full model.
- The draft model loads alongside the target — budget ~1–2GB extra for the draft.

**Benchmark results JSON location**
```bash
ls benchmark/results/
dflash-serve --model Qwen/Qwen3.5-9B --port 8000 --host 0.0.0.0

**大型模型内存不足**
- 使用4位量化变体:例如用`mlx-community/Qwen3.5-27B-4bit`替代完整模型。
- 草稿模型会与目标模型一同加载——需额外预留约1-2GB内存给草稿模型。

**基准测试结果JSON位置**
```bash
ls benchmark/results/

Per-run JSON with tok/s, acceptance rate, repeat measurements

包含tok/s、通过率、重复测量数据的单次运行JSON文件

undefined
undefined