dflash-mlx-speculative-decoding

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

dflash-mlx Speculative Decoding

Skill by ara.so — Daily 2026 Skills collection.

DFlash implements lossless speculative decoding for MLX on Apple Silicon. A small draft model (~1B params) generates 16 tokens in parallel using block diffusion; the target model verifies all 16 in a single forward pass. Tokens are only emitted after target verification — output is lossless (every token is the target model's greedy argmax).

Typical speedups: 1.7x–4.1x over baseline

mlx_lm

depending on model size and context length. Acceptance rates hover around 87–90% for Qwen3.5 models.

由ara.so提供的技能——2026每日技能合集。

DFlash为Apple Silicon平台的MLX实现了无损Speculative Decoding。一个小型草稿模型（约10亿参数）通过块扩散并行生成16个token；目标模型在单次前向传播中验证所有16个token。只有经过目标模型验证的token才会输出——输出是无损的（每个token都是目标模型的贪心argmax结果）。

典型加速比：相较于基准版

mlx_lm

，根据模型大小和上下文长度不同，可实现1.7倍-4.1倍的加速。对于Qwen3.5模型，token通过率约为87%-90%。

Installation

安装

bash

pip install dflash-mlx

bash

pip install dflash-mlx

or isolated install

或隔离安装

pipx install dflash-mlx


Requires Python 3.10+, MLX 0.31.1+, Apple Silicon Mac.

pipx install dflash-mlx


要求Python 3.10+、MLX 0.31.1+、Apple Silicon芯片的Mac。

Key CLI Commands

核心CLI命令

Generate text

生成文本

bash

undefined

bash

undefined

Auto-resolve draft model from registry

从注册表自动解析草稿模型

dflash --model Qwen/Qwen3.5-9B --prompt "Explain backpropagation"

Explicit draft model

指定草稿模型

dflash --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--prompt "Explain backpropagation"

Disable EOS (useful for benchmarking fixed token counts)

禁用EOS标记（适合固定token数量的基准测试）

dflash --model Qwen/Qwen3.5-9B --prompt "..." --max-tokens 1024 --no-eos

undefined

dflash --model Qwen/Qwen3.5-9B --prompt "..." --max-tokens 1024 --no-eos

undefined

OpenAI-compatible server

兼容OpenAI的服务端

bash

undefined

bash

undefined

Basic server

基础服务端

dflash-serve --model Qwen/Qwen3.5-9B --port 8000

With explicit draft

指定草稿模型

dflash-serve --model Qwen/Qwen3.5-9B
--draft z-lab/Qwen3.5-9B-DFlash
--port 8000

Disable thinking/reasoning tokens (Qwen3.5 thinking models)

禁用思考推理标记（适用于Qwen3.5思考型模型）

dflash-serve --model Qwen/Qwen3.5-9B --port 8000
--chat-template-args '{"enable_thinking": false}'

Raise fallback threshold for longer prompts (large models)

针对长提示提升回退阈值（大型模型）

dflash-serve --model mlx-community/Qwen3.5-35B-A3B-4bit --port 8000
--chat-template-args '{"enable_thinking": false}'
--dflash-max-ctx 16384

undefined

dflash-serve --model mlx-community/Qwen3.5-35B-A3B-4bit --port 8000
--chat-template-args '{"enable_thinking": false}'
--dflash-max-ctx 16384

undefined

Benchmark

基准测试

bash

dflash-benchmark \
  --model Qwen/Qwen3.5-9B \
  --draft z-lab/Qwen3.5-9B-DFlash \
  --prompt "The function f satisfies..." \
  --max-tokens 1024 \
  --repeat 3 \
  --no-eos

Outputs per-run JSON reports with tok/s, acceptance rate, and speedup vs baseline.

bash

dflash-benchmark \
  --model Qwen/Qwen3.5-9B \
  --draft z-lab/Qwen3.5-9B-DFlash \
  --prompt "The function f satisfies..." \
  --max-tokens 1024 \
  --repeat 3 \
  --no-eos

输出包含tok/s、通过率、相对基准版加速比的单次运行JSON报告。

Supported Model Pairs

支持的模型对

Target Model	Draft Model
`Qwen/Qwen3.5-4B`	`z-lab/Qwen3.5-4B-DFlash`
`Qwen/Qwen3.5-9B`	`z-lab/Qwen3.5-9B-DFlash`
`mlx-community/Qwen3.5-27B-4bit`	`z-lab/Qwen3.5-27B-DFlash`
`mlx-community/Qwen3.5-35B-A3B-4bit`	`z-lab/Qwen3.5-35B-A3B-DFlash`

Draft models are auto-resolved from a registry — no

--draft

flag needed for listed pairs. Models without a matching draft are rejected at startup.

目标模型	草稿模型
`Qwen/Qwen3.5-4B`	`z-lab/Qwen3.5-4B-DFlash`
`Qwen/Qwen3.5-9B`	`z-lab/Qwen3.5-9B-DFlash`
`mlx-community/Qwen3.5-27B-4bit`	`z-lab/Qwen3.5-27B-DFlash`
`mlx-community/Qwen3.5-35B-A3B-4bit`	`z-lab/Qwen3.5-35B-A3B-DFlash`

草稿模型会从注册表自动解析——对于列表中的模型对，无需使用

--draft

参数。没有匹配草稿模型的模型会在启动时被拒绝。

Python API Usage

Python API 使用方法

Streaming generation

流式生成

python

from dflash_mlx import DFlashRuntime

runtime = DFlashRuntime.from_pretrained(
    model="Qwen/Qwen3.5-9B",
    draft="z-lab/Qwen3.5-9B-DFlash",  # optional, auto-resolved
)

prompt = "Explain the Pythagorean theorem step by step."

for token_text in runtime.stream_generate(
    prompt=prompt,
    max_tokens=512,
    use_chat_template=True,
):
    print(token_text, end="", flush=True)
print()

python

from dflash_mlx import DFlashRuntime

runtime = DFlashRuntime.from_pretrained(
    model="Qwen/Qwen3.5-9B",
    draft="z-lab/Qwen3.5-9B-DFlash",  # 可选，会自动解析
)

prompt = "Explain the Pythagorean theorem step by step."

for token_text in runtime.stream_generate(
    prompt=prompt,
    max_tokens=512,
    use_chat_template=True,
):
    print(token_text, end="", flush=True)
print()

Full generation with stats

完整生成及统计信息

python

from dflash_mlx import DFlashRuntime

runtime = DFlashRuntime.from_pretrained(model="Qwen/Qwen3.5-9B")

result = runtime.generate(
    prompt="What is speculative decoding?",
    max_tokens=256,
    use_chat_template=True,
)

print(result.text)
print(f"Tokens/sec: {result.tokens_per_second:.2f}")
print(f"Acceptance rate: {result.acceptance_rate:.2%}")
print(f"Total tokens: {result.total_tokens}")

python

from dflash_mlx import DFlashRuntime

runtime = DFlashRuntime.from_pretrained(model="Qwen/Qwen3.5-9B")

result = runtime.generate(
    prompt="What is speculative decoding?",
    max_tokens=256,
    use_chat_template=True,
)

print(result.text)
print(f"Tokens/sec: {result.tokens_per_second:.2f}")
print(f"Acceptance rate: {result.acceptance_rate:.2%}")
print(f"Total tokens: {result.total_tokens}")

Custom draft block size and context

自定义草稿块大小和上下文

python

from dflash_mlx import DFlashRuntime, DFlashConfig

config = DFlashConfig(
    draft_block_size=16,      # tokens drafted per speculative step
    max_ctx=8192,             # max context length before fallback
    enable_tape_replay=True,  # GatedDeltaNet recurrent rollback
    jit_sdpa=True,            # custom Metal SDPA for long contexts
)

runtime = DFlashRuntime.from_pretrained(
    model="mlx-community/Qwen3.5-27B-4bit",
    config=config,
)

python

from dflash_mlx import DFlashRuntime, DFlashConfig

config = DFlashConfig(
    draft_block_size=16,      # 每次推测步骤生成的token数量
    max_ctx=8192,             # 触发回退前的最大上下文长度
    enable_tape_replay=True,  # GatedDeltaNet循环回滚
    jit_sdpa=True,            # 针对长上下文的自定义Metal SDPA
)

runtime = DFlashRuntime.from_pretrained(
    model="mlx-community/Qwen3.5-27B-4bit",
    config=config,
)

OpenAI client against dflash-serve

对接dflash-serve的OpenAI客户端

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # dflash-serve does not require auth by default
)

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # dflash-serve默认不需要认证
)

Non-streaming

非流式请求

response = client.chat.completions.create( model="Qwen/Qwen3.5-9B", messages=[ {"role": "user", "content": "Explain gradient descent."} ], max_tokens=512, ) print(response.choices[0].message.content)

Streaming

流式请求

stream = client.chat.completions.create( model="Qwen/Qwen3.5-9B", messages=[{"role": "user", "content": "Write a haiku about silicon."}], max_tokens=128, stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True) print()

undefined

undefined

Tool calling (via dflash-serve)

工具调用（通过dflash-serve）

python

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-9B",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Args: {json.loads(tool_call.function.arguments)}")

python

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-9B",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Args: {json.loads(tool_call.function.arguments)}")

Common Patterns

常见使用场景

Side-by-side demo (baseline vs DFlash)

对比演示（基准版 vs DFlash）

bash

PYTHONPATH=. python3 -m examples.demo --mode dflash \
  --target-model Qwen/Qwen3.5-9B \
  --draft-model z-lab/Qwen3.5-9B-DFlash \
  --prompt "Solve: f(x) + f(y) = f(x+y) - xy - 1" \
  --max-tokens 2048 \
  --no-eos

bash

PYTHONPATH=. python3 -m examples.demo --mode dflash \
  --target-model Qwen/Qwen3.5-9B \
  --draft-model z-lab/Qwen3.5-9B-DFlash \
  --prompt "Solve: f(x) + f(y) = f(x+y) - xy - 1" \
  --max-tokens 2048 \
  --no-eos

Integrating with Open WebUI

集成Open WebUI

Start

dflash-serve --model Qwen/Qwen3.5-9B --port 8000

In Open WebUI settings → Connections → add OpenAI API with URL
```
http://localhost:8000/v1
```
Select model
```
Qwen/Qwen3.5-9B
```
in the chat UI

Works the same for Continue, aider, OpenCode, and any OpenAI-compatible client.

启动

dflash-serve --model Qwen/Qwen3.5-9B --port 8000

在Open WebUI设置→连接→添加OpenAI API，URL填写
```
http://localhost:8000/v1
```
在聊天界面选择模型
```
Qwen/Qwen3.5-9B
```

同样适用于Continue、aider、OpenCode及所有兼容OpenAI的客户端。

Override draft for unsupported models

为不支持的模型指定草稿

bash

undefined

bash

undefined

Force a custom draft — bypasses registry check

强制使用自定义草稿——跳过注册表检查

dflash --model my-org/MyCustomModel
--draft my-org/MyCustomModel-DFlash
--prompt "Hello"

undefined

dflash --model my-org/MyCustomModel
--draft my-org/MyCustomModel-DFlash
--prompt "Hello"

undefined

Disable thinking tokens for Qwen3.5

为Qwen3.5禁用思考标记

bash

undefined

bash

undefined

CLI

CLI方式

dflash --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--prompt "What is 2+2?"

Server

服务端方式

dflash-serve --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--port 8000

undefined

dflash-serve --model Qwen/Qwen3.5-9B
--chat-template-args '{"enable_thinking": false}'
--port 8000

undefined

Architecture Notes

架构说明

Tape-replay rollback: For hybrid GatedDeltaNet + attention models (Qwen3.5), dflash records an innovation tape during verify and replays only accepted steps via a custom Metal kernel — avoids full state snapshots.
JIT SDPA 2-pass: For contexts ≥ 1024 tokens, a custom Metal attention kernel maintains numerical alignment with stock MLX attention.
Greedy acceptance: Keeps the longest correct prefix from the 16 drafted tokens, rejects the rest. No temperature/sampling on verification — strictly lossless.
Qwen3 (pure attention) models work but don't benefit from tape-replay rollback (that's GatedDeltaNet-specific).

磁带重放回滚：针对混合GatedDeltaNet+注意力机制的模型（如Qwen3.5），dflash在验证阶段记录创新磁带，并通过自定义Metal内核仅重放已通过的步骤——避免完整状态快照。
JIT SDPA双阶段：对于≥1024token的上下文，自定义Metal注意力内核与原生MLX注意力保持数值对齐。
贪心验证：保留16个草稿token中最长的正确前缀，拒绝其余部分。验证阶段无温度/采样——严格无损。
**Qwen3（纯注意力）**模型可运行，但无法受益于磁带重放回滚（该特性为GatedDeltaNet专属）。

Troubleshooting

故障排查

Model rejected at startup

Error: No DFlash draft found for model 'org/ModelName'

→ Pass

--draft org/ModelName-DFlash

explicitly, or use a model from the supported pairs table.

Low acceptance rate (< 80%)

Usually caused by very long context (4096+). Try
```
--dflash-max-ctx 8192
```
to extend the fallback threshold.
Qwen3 (non-3.5) models have lower acceptance than Qwen3.5 hybrid models.

Numerical divergence / output differs from pure AR

Expected behavior: "Output can still differ from pure AR because of MLX dispatch divergence, but no unverified token is ever emitted."
If outputs seem wrong (not just different), ensure MLX 0.31.1+ is installed:
```
python -c "import mlx; print(mlx.__version__)"
```

Server not accepting connections

bash

undefined

启动时模型被拒绝

Error: No DFlash draft found for model 'org/ModelName'

→ 显式传递

--draft org/ModelName-DFlash

参数，或使用支持模型对列表中的模型。

通过率低（<80%）

通常由超长上下文（4096+）导致。尝试使用
```
--dflash-max-ctx 8192
```
扩展回退阈值。
Qwen3（非3.5版本）模型的通过率低于Qwen3.5混合模型。

数值偏差/输出与纯AR模式不同

预期行为：“由于MLX调度差异，输出可能与纯AR模式不同，但绝不会输出未验证的token。”
如果输出明显错误（并非只是不同），确保安装了MLX 0.31.1+版本：
```
python -c "import mlx; print(mlx.__version__)"
```

服务端无法接收连接

bash

undefined

Check port is not in use

检查端口是否被占用

lsof -i :8000

Bind to all interfaces for network access

绑定到所有接口以支持网络访问

dflash-serve --model Qwen/Qwen3.5-9B --port 8000 --host 0.0.0.0


**Out of memory with large models**
- Use 4-bit quantized variants: `mlx-community/Qwen3.5-27B-4bit` instead of the full model.
- The draft model loads alongside the target — budget ~1–2GB extra for the draft.

**Benchmark results JSON location**
```bash
ls benchmark/results/

dflash-serve --model Qwen/Qwen3.5-9B --port 8000 --host 0.0.0.0


**大型模型内存不足**
- 使用4位量化变体：例如用`mlx-community/Qwen3.5-27B-4bit`替代完整模型。
- 草稿模型会与目标模型一同加载——需额外预留约1-2GB内存给草稿模型。

**基准测试结果JSON位置**
```bash
ls benchmark/results/

Per-run JSON with tok/s, acceptance rate, repeat measurements

包含tok/s、通过率、重复测量数据的单次运行JSON文件

undefined

undefined