openduck-distributed-duckdb

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OpenDuck Distributed DuckDB

OpenDuck 分布式DuckDB

Skill by ara.so — Daily 2026 Skills collection
OpenDuck is an open-source implementation of distributed DuckDB featuring differential storage (append-only immutable layers via Postgres + object store), hybrid dual execution (single queries split across local and remote workers), and transparent remote database attach via
ATTACH 'openduck:mydb'
. It is architecturally inspired by MotherDuck but fully open protocol (gRPC + Arrow IPC).

来自ara.so的技能 — 2026每日技能合集
OpenDuck是分布式DuckDB的开源实现,具备差分存储(通过Postgres + 对象存储实现仅追加的不可变分层)、混合双执行(单查询拆分至本地和远程工作节点执行)以及通过
ATTACH 'openduck:mydb'
实现的透明远程数据库挂载功能。其架构灵感源自MotherDuck,但采用完全开放的协议(gRPC + Arrow IPC)。

Architecture Overview

架构概述

DuckDB client (openduck extension)
  └─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
       └─ gRPC + Arrow IPC
            └─ Gateway (Rust)
                 ├─ auth / routing / plan splitting
                 ├─ Worker 1 (embedded DuckDB)
                 └─ Worker N (embedded DuckDB)
                      ├─ Postgres (metadata)
                      └─ Object store (sealed layers)
Key concepts:
  • OpenDuckCatalog
    /
    OpenDuckTableEntry
    — remote tables appear as first-class DuckDB catalog entries
  • Hybrid execution — gateway labels operators
    LOCAL
    or
    REMOTE
    , inserts
    Bridge
    operators at boundaries
  • Differential storage — immutable sealed layers, snapshot isolation, one write path, many readers
  • Protocol — only 2 gRPC RPCs defined in
    proto/openduck/v1/execution.proto

DuckDB client (openduck extension)
  └─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
       └─ gRPC + Arrow IPC
            └─ Gateway (Rust)
                 ├─ auth / routing / plan splitting
                 ├─ Worker 1 (embedded DuckDB)
                 └─ Worker N (embedded DuckDB)
                      ├─ Postgres (metadata)
                      └─ Object store (sealed layers)
核心概念:
  • OpenDuckCatalog
    /
    OpenDuckTableEntry
    — 远程表作为一等公民的DuckDB目录条目呈现
  • 混合执行 — 网关将算子标记为
    LOCAL
    REMOTE
    ,在边界处插入
    Bridge
    算子
  • 差分存储 — 不可变的密封分层、快照隔离、单一写入路径、多读取者
  • 协议 — 仅在
    proto/openduck/v1/execution.proto
    中定义了2个gRPC RPC接口

Repository Layout

仓库结构

crates/
  exec-gateway/     # auth, routing, hybrid plan splitting
  exec-worker/      # embedded DuckDB, Arrow IPC streaming
  exec-proto/       # protobuf/tonic codegen
  openduck-cli/     # unified CLI (serve|gateway|worker)
  diff-*/           # differential storage (layers, metadata, FUSE)

extensions/
  openduck/         # DuckDB C++ extension (StorageExtension + Catalog)

clients/
  python/           # pip-installable openduck wrapper

proto/
  openduck/v1/      # execution.proto

crates/
  exec-gateway/     # 认证、路由、混合执行计划拆分
  exec-worker/      # 嵌入式DuckDB、Arrow IPC流处理
  exec-proto/       # protobuf/tonic代码生成
  openduck-cli/     # 统一CLI(serve|gateway|worker)
  diff-*/           # 差分存储(分层、元数据、FUSE)

extensions/
  openduck/         # DuckDB C++扩展(StorageExtension + Catalog)

clients/
  python/           # 可通过pip安装的openduck封装库

proto/
  openduck/v1/      # execution.proto

Installation & Build

安装与构建

Prerequisites

前置依赖

  • Rust toolchain (stable)
  • C++ build tools,
    vcpkg
    ,
    bison
    (macOS:
    brew install bison
    )
  • DuckDB development headers (handled by the extension Makefile)
  • Rust工具链(稳定版)
  • C++构建工具、
    vcpkg
    bison
    (macOS:
    brew install bison
  • DuckDB开发头文件(由扩展Makefile自动处理)

1. Build the Rust backend

1. 构建Rust后端

bash
git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspace
bash
git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspace

2. Build the DuckDB C++ extension

2. 构建DuckDB C++扩展

bash
cd extensions/openduck
make
bash
cd extensions/openduck
make

Output:

输出:

extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

undefined
undefined

3. Install the Python client (optional)

3. 安装Python客户端(可选)

bash
pip install -e clients/python

bash
pip install -e clients/python

Running the Server

运行服务器

bash
undefined
bash
undefined

Required env vars

必填环境变量

export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_TOKEN=your-secret-token

Start all-in-one (gateway + worker)

启动一体化服务(网关 + 工作节点)

cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN

Or run gateway and worker separately

或者分别运行网关和工作节点

cargo run -p openduck-cli -- gateway --port 7878 cargo run -p openduck-cli -- worker --gateway http://localhost:7878

---
cargo run -p openduck-cli -- gateway --port 7878 cargo run -p openduck-cli -- worker --gateway http://localhost:7878

---

Connecting from Python

从Python连接

Via openduck wrapper (recommended)

通过openduck封装库(推荐)

python
import openduck  # auto-detects extension from build tree or OPENDUCK_EXTENSION_PATH

con = openduck.connect("mydb")  # uses OPENDUCK_TOKEN env var
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()
python
import openduck  # 自动从构建目录或OPENDUCK_EXTENSION_PATH检测扩展

con = openduck.connect("mydb")  # 使用OPENDUCK_TOKEN环境变量
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()

Via raw DuckDB SDK

通过原生DuckDB SDK

python
import duckdb
import os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
python
import duckdb
import os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]

e.g. extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

示例:extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

con = duckdb.connect(config={"allow_unsigned_extensions": "true"}) con.execute(f"LOAD '{ext_path}';") con.execute( "ATTACH 'openduck:mydb" "?endpoint=http://localhost:7878" f"&token={os.environ["OPENDUCK_TOKEN"]}' AS cloud;" )
con = duckdb.connect(config={"allow_unsigned_extensions": "true"}) con.execute(f"LOAD '{ext_path}';") con.execute( "ATTACH 'openduck:mydb" "?endpoint=http://localhost:7878" f"&token={os.environ["OPENDUCK_TOKEN"]}' AS cloud;" )

Query remote table

查询远程表

con.sql("SELECT * FROM cloud.users LIMIT 10").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()

Hybrid query — local table joined with remote table

混合查询 — 本地表与远程表关联

con.sql(""" SELECT l.product_id, l.name, r.total_sales FROM local.products l JOIN cloud.sales r ON l.product_id = r.product_id WHERE r.total_sales > 1000 """).show()
undefined
con.sql(""" SELECT l.product_id, l.name, r.total_sales FROM local.products l JOIN cloud.sales r ON l.product_id = r.product_id WHERE r.total_sales > 1000 """).show()
undefined

Environment variables

环境变量

bash
export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878   # default

bash
export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878   # 默认值

Connecting from the CLI

从CLI连接

bash
duckdb -unsigned -c "
  LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
  ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
  SHOW ALL TABLES;
  SELECT * FROM cloud.users LIMIT 5;
"

bash
duckdb -unsigned -c "
  LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
  ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
  SHOW ALL TABLES;
  SELECT * FROM cloud.users LIMIT 5;
"

Connecting from Rust

从Rust连接

rust
use duckdb::{Connection, Result};

fn main() -> Result<()> {
    let conn = Connection::open_in_memory()?;
    let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
    let token = std::env::var("OPENDUCK_TOKEN").unwrap();

    conn.execute_batch(&format!(r#"
        SET allow_unsigned_extensions = true;
        LOAD '{ext_path}';
        ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
    "#))?;

    let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
    let rows = stmt.query_map([], |row| {
        Ok(row.get::<_, String>(0)?)
    })?;

    for row in rows {
        println!("{}", row?);
    }
    Ok(())
}

rust
use duckdb::{Connection, Result};

fn main() -> Result<()> {
    let conn = Connection::open_in_memory()?;
    let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
    let token = std::env::var("OPENDUCK_TOKEN").unwrap();

    conn.execute_batch(&format!(r#"
        SET allow_unsigned_extensions = true;
        LOAD '{ext_path}';
        ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
    "#))?;

    let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
    let rows = stmt.query_map([], |row| {
        Ok(row.get::<_, String>(0)?)
    })?;

    for row in rows {
        println!("{}", row?);
    }
    Ok(())
}

Hybrid Execution Pattern

混合执行模式

Hybrid execution happens automatically — the gateway splits the logical plan:
[LOCAL]  HashJoin(l.id = r.id)
  [LOCAL]  Scan(products)       ← runs on your machine
  [LOCAL]  Bridge(REMOTE→LOCAL)
    [REMOTE] Scan(sales)        ← runs on remote worker
Write queries naturally — the extension handles routing:
python
undefined
混合执行自动触发 — 网关会拆分逻辑执行计划:
[LOCAL]  HashJoin(l.id = r.id)
  [LOCAL]  Scan(products)       ← 在本地机器运行
  [LOCAL]  Bridge(REMOTE→LOCAL)
    [REMOTE] Scan(sales)        ← 在远程工作节点运行
自然编写查询即可 — 扩展会自动处理路由:
python
undefined

This single query runs across two engines transparently

该单查询会在两个引擎间透明执行

con.sql(""" SELECT p.category, SUM(s.amount) AS revenue FROM local.products p -- local table JOIN cloud.sales s -- remote table ON p.id = s.product_id GROUP BY p.category ORDER BY revenue DESC """).show()

---
con.sql(""" SELECT p.category, SUM(s.amount) AS revenue FROM local.products p -- 本地表 JOIN cloud.sales s -- 远程表 ON p.id = s.product_id GROUP BY p.category ORDER BY revenue DESC """).show()

---

Differential Storage

差分存储

Differential storage is managed server-side. Key properties:
  • Append-only sealed layers stored in object storage (S3-compatible)
  • Postgres stores layer metadata and snapshot pointers
  • Snapshot isolation — readers always see a consistent view
  • One serialized write path — many concurrent readers
From a client perspective it is fully transparent. DuckDB sees normal table semantics.

差分存储由服务器端管理。核心特性:
  • 仅追加的密封分层存储在对象存储(兼容S3)中
  • Postgres存储分层元数据和快照指针
  • 快照隔离 — 读取者始终看到一致视图
  • 单一序列化写入路径 — 支持多并发读取者
从客户端视角来看完全透明。DuckDB会将其视为普通表进行操作。

ATTACH URL Reference

ATTACH URL 参考

openduck:<database_name>?endpoint=<url>&token=<token>
ParameterDefaultDescription
endpoint
http://localhost:7878
Gateway URL
token
$OPENDUCK_TOKEN
env var
Auth token matching server config
Examples:
sql
-- Local dev
ATTACH 'openduck:mydb?token=dev-token' AS cloud;

-- Remote server, explicit endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;

-- Alias: od: also works
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;

openduck:<database_name>?endpoint=<url>&token=<token>
参数默认值描述
endpoint
http://localhost:7878
网关URL
token
$OPENDUCK_TOKEN
环境变量
与服务器配置匹配的认证令牌
示例:
sql
-- 本地开发环境
ATTACH 'openduck:mydb?token=dev-token' AS cloud;

-- 远程服务器,显式指定endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;

-- 别名:od:同样生效
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;

DuckLake Integration

DuckLake 集成

OpenDuck and DuckLake operate at different layers and complement each other:
python
import duckdb, os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]

con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
OpenDuck与DuckLake在不同层运行,可相互补充:
python
import duckdb, os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]

con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")

-- 挂载由远程工作节点托管、基于DuckLake的目录
con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;")

-- 通过OpenDuck传输层透明查询DuckLake表
con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()

-- 混合查询:本地临时数据与远程DuckLake表关联
con.sql("""
    SELECT l.session_id, r.user_email
    FROM memory.sessions l
    JOIN lh.users r ON l.user_id = r.id
""").show()

Attach DuckLake catalog (managed by remote worker backed by DuckLake)

协议参考

con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;")
有线协议设计得尽可能简洁。请查看
proto/openduck/v1/execution.proto
  • ExecuteQuery
    — 发送SQL,获取查询句柄
  • StreamResults
    — 将Arrow IPC记录批次流式返回给客户端
任何实现这两个RPC接口的gRPC服务都是有效的OpenDuck后端。你可以用任意语言替换Rust网关实现自定义版本。

Query DuckLake tables transparently via OpenDuck transport

常见使用模式

查看可用的远程表

con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()
sql
-- 执行ATTACH ... AS cloud后
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';

Hybrid: local scratch data joined with remote DuckLake table

写入远程表

con.sql(""" SELECT l.session_id, r.user_email FROM memory.sessions l JOIN lh.users r ON l.user_id = r.id """).show()

---
sql
INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');

Protocol Reference

从本地数据创建远程表

The wire protocol is intentionally minimal. See
proto/openduck/v1/execution.proto
:
  • ExecuteQuery
    — send SQL, receive a query handle
  • StreamResults
    — stream Arrow IPC record batches back to client
Any gRPC service implementing these two RPCs is a valid OpenDuck backend. You can replace the Rust gateway with a custom implementation in any language.

sql
CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;

Common Patterns

将远程查询结果导出为本地Parquet文件

Check which tables are available remotely

sql
-- After ATTACH ... AS cloud
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';
python
con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
   .write_parquet("output/us_east.parquet")

Write to a remote table

故障排除

Extension is not trusted
/ 签名错误

sql
INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');
python
-- 加载扩展前务必设置allow_unsigned_extensions
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
或在CLI中:
bash
duckdb -unsigned

Create a remote table from local data

LOAD
失败 — 扩展未找到

sql
CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;
bash
-- 设置环境变量指向构建好的扩展路径
export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
ls -la $OPENDUCK_EXTENSION_PATH   # 确认文件存在

Export remote query result to local Parquet

连接网关被拒绝

python
con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
   .write_parquet("output/us_east.parquet")

bash
-- 验证服务器是否运行
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
-- 默认端口为7878 — 检查防火墙/端口绑定
curl http://localhost:7878/health

Troubleshooting

令牌不匹配 / 认证失败

Extension is not trusted
/ signature error

python
undefined
bash
-- 服务器令牌与客户端令牌必须完全一致
export OPENDUCK_TOKEN=same-value-on-both-sides
-- 服务器端:cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
-- 客户端:ATTACH '...&token=same-value-on-both-sides' AS cloud;

Always set allow_unsigned_extensions before loading

macOS上构建失败 — bison版本问题

con = duckdb.connect(config={"allow_unsigned_extensions": "true"})

Or in CLI:
```bash
duckdb -unsigned
bash
brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && make

LOAD
fails — extension not found

扩展版本与DuckDB不匹配

bash
undefined
扩展必须与Python包使用的DuckDB版本一致:
bash
python -c "import duckdb; print(duckdb.__version__)"
-- 确保扩展Makefile使用相同版本
-- 检查extensions/openduck/Makefile中的DUCKDB_VERSION

Set the env var to the exact built path

OpenDuck vs 替代方案

export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension ls -la $OPENDUCK_EXTENSION_PATH # confirm it exists
undefined
特性OpenDuckArrow Flight SQLDuckLake
远程挂载体验
ATTACH 'openduck:db'
独立驱动
ATTACH 'ducklake:...'
混合执行✅ 拆分执行计划❌ 全远程执行
DuckDB目录集成✅ 原生支持
协议RPC数量2~15N/A
差分存储通过Parquet分层实现
自托管

Connection refused to gateway

bash
undefined

Verify server is running

cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN

Default port is 7878 — check firewall / port binding

Token mismatch / auth failure

bash
undefined

Server token and client token must match exactly

export OPENDUCK_TOKEN=same-value-on-both-sides

Server: cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN

Client: ATTACH '...&token=same-value-on-both-sides' AS cloud;

undefined

Build fails on macOS — bison version

bash
brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && make

Extension version mismatch with DuckDB

The extension must be built against the same DuckDB version as the Python package:
bash
python -c "import duckdb; print(duckdb.__version__)"

Ensure the extension Makefile targets the same version

Check extensions/openduck/Makefile for DUCKDB_VERSION


---

OpenDuck vs Alternatives

FeatureOpenDuckArrow Flight SQLDuckLake
Remote attach UX
ATTACH 'openduck:db'
Separate driver
ATTACH 'ducklake:...'
Hybrid execution✅ split plan❌ full remote
DuckDB catalog integration✅ native
Protocol RPCs2~15N/A
Differential storagevia Parquet layers
Self-hosted