OpenDuck Distributed DuckDB
OpenDuck 分布式DuckDB
Skill by
ara.so — Daily 2026 Skills collection
OpenDuck is an open-source implementation of distributed DuckDB featuring differential storage (append-only immutable layers via Postgres + object store), hybrid dual execution (single queries split across local and remote workers), and transparent remote database attach via
. It is architecturally inspired by MotherDuck but fully open protocol (gRPC + Arrow IPC).
OpenDuck是分布式DuckDB的开源实现,具备差分存储(通过Postgres + 对象存储实现仅追加的不可变分层)、混合双执行(单查询拆分至本地和远程工作节点执行)以及通过
实现的透明远程数据库挂载功能。其架构灵感源自MotherDuck,但采用完全开放的协议(gRPC + Arrow IPC)。
Architecture Overview
架构概述
DuckDB client (openduck extension)
└─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
└─ gRPC + Arrow IPC
└─ Gateway (Rust)
├─ auth / routing / plan splitting
├─ Worker 1 (embedded DuckDB)
└─ Worker N (embedded DuckDB)
├─ Postgres (metadata)
└─ Object store (sealed layers)
Key concepts:
- / — remote tables appear as first-class DuckDB catalog entries
- Hybrid execution — gateway labels operators or , inserts operators at boundaries
- Differential storage — immutable sealed layers, snapshot isolation, one write path, many readers
- Protocol — only 2 gRPC RPCs defined in
proto/openduck/v1/execution.proto
DuckDB client (openduck extension)
└─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
└─ gRPC + Arrow IPC
└─ Gateway (Rust)
├─ auth / routing / plan splitting
├─ Worker 1 (embedded DuckDB)
└─ Worker N (embedded DuckDB)
├─ Postgres (metadata)
└─ Object store (sealed layers)
核心概念:
- / — 远程表作为一等公民的DuckDB目录条目呈现
- 混合执行 — 网关将算子标记为或,在边界处插入算子
- 差分存储 — 不可变的密封分层、快照隔离、单一写入路径、多读取者
- 协议 — 仅在
proto/openduck/v1/execution.proto
中定义了2个gRPC RPC接口
crates/
exec-gateway/ # auth, routing, hybrid plan splitting
exec-worker/ # embedded DuckDB, Arrow IPC streaming
exec-proto/ # protobuf/tonic codegen
openduck-cli/ # unified CLI (serve|gateway|worker)
diff-*/ # differential storage (layers, metadata, FUSE)
extensions/
openduck/ # DuckDB C++ extension (StorageExtension + Catalog)
clients/
python/ # pip-installable openduck wrapper
proto/
openduck/v1/ # execution.proto
crates/
exec-gateway/ # 认证、路由、混合执行计划拆分
exec-worker/ # 嵌入式DuckDB、Arrow IPC流处理
exec-proto/ # protobuf/tonic代码生成
openduck-cli/ # 统一CLI(serve|gateway|worker)
diff-*/ # 差分存储(分层、元数据、FUSE)
extensions/
openduck/ # DuckDB C++扩展(StorageExtension + Catalog)
clients/
python/ # 可通过pip安装的openduck封装库
proto/
openduck/v1/ # execution.proto
Installation & Build
安装与构建
- Rust toolchain (stable)
- C++ build tools, , (macOS: )
- DuckDB development headers (handled by the extension Makefile)
- Rust工具链(稳定版)
- C++构建工具、、(macOS:)
- DuckDB开发头文件(由扩展Makefile自动处理)
1. Build the Rust backend
1. 构建Rust后端
bash
git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspace
bash
git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspace
2. Build the DuckDB C++ extension
2. 构建DuckDB C++扩展
bash
cd extensions/openduck
make
bash
cd extensions/openduck
make
extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
3. Install the Python client (optional)
3. 安装Python客户端(可选)
bash
pip install -e clients/python
bash
pip install -e clients/python
export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_TOKEN=your-secret-token
Start all-in-one (gateway + worker)
启动一体化服务(网关 + 工作节点)
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
Or run gateway and worker separately
或者分别运行网关和工作节点
cargo run -p openduck-cli -- gateway --port 7878
cargo run -p openduck-cli -- worker --gateway
http://localhost:7878
cargo run -p openduck-cli -- gateway --port 7878
cargo run -p openduck-cli -- worker --gateway
http://localhost:7878
Connecting from Python
从Python连接
Via openduck wrapper (recommended)
通过openduck封装库(推荐)
python
import openduck # auto-detects extension from build tree or OPENDUCK_EXTENSION_PATH
con = openduck.connect("mydb") # uses OPENDUCK_TOKEN env var
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()
python
import openduck # 自动从构建目录或OPENDUCK_EXTENSION_PATH检测扩展
con = openduck.connect("mydb") # 使用OPENDUCK_TOKEN环境变量
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()
Via raw DuckDB SDK
通过原生DuckDB SDK
python
import duckdb
import os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
python
import duckdb
import os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
e.g. extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
示例:extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
con.execute(
"ATTACH 'openduck:mydb"
"?endpoint=
http://localhost:7878"
f"&token={os.environ["OPENDUCK_TOKEN"]}' AS cloud;"
)
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
con.execute(
"ATTACH 'openduck:mydb"
"?endpoint=
http://localhost:7878"
f"&token={os.environ["OPENDUCK_TOKEN"]}' AS cloud;"
)
con.sql("SELECT * FROM cloud.users LIMIT 10").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()
Hybrid query — local table joined with remote table
混合查询 — 本地表与远程表关联
con.sql("""
SELECT l.product_id, l.name, r.total_sales
FROM local.products l
JOIN cloud.sales r ON l.product_id = r.product_id
WHERE r.total_sales > 1000
""").show()
con.sql("""
SELECT l.product_id, l.name, r.total_sales
FROM local.products l
JOIN cloud.sales r ON l.product_id = r.product_id
WHERE r.total_sales > 1000
""").show()
Environment variables
环境变量
bash
export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878 # default
bash
export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878 # 默认值
Connecting from the CLI
从CLI连接
bash
duckdb -unsigned -c "
LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
SHOW ALL TABLES;
SELECT * FROM cloud.users LIMIT 5;
"
bash
duckdb -unsigned -c "
LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
SHOW ALL TABLES;
SELECT * FROM cloud.users LIMIT 5;
"
Connecting from Rust
从Rust连接
rust
use duckdb::{Connection, Result};
fn main() -> Result<()> {
let conn = Connection::open_in_memory()?;
let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
let token = std::env::var("OPENDUCK_TOKEN").unwrap();
conn.execute_batch(&format!(r#"
SET allow_unsigned_extensions = true;
LOAD '{ext_path}';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
"#))?;
let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
let rows = stmt.query_map([], |row| {
Ok(row.get::<_, String>(0)?)
})?;
for row in rows {
println!("{}", row?);
}
Ok(())
}
rust
use duckdb::{Connection, Result};
fn main() -> Result<()> {
let conn = Connection::open_in_memory()?;
let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
let token = std::env::var("OPENDUCK_TOKEN").unwrap();
conn.execute_batch(&format!(r#"
SET allow_unsigned_extensions = true;
LOAD '{ext_path}';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
"#))?;
let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
let rows = stmt.query_map([], |row| {
Ok(row.get::<_, String>(0)?)
})?;
for row in rows {
println!("{}", row?);
}
Ok(())
}
Hybrid Execution Pattern
混合执行模式
Hybrid execution happens automatically — the gateway splits the logical plan:
[LOCAL] HashJoin(l.id = r.id)
[LOCAL] Scan(products) ← runs on your machine
[LOCAL] Bridge(REMOTE→LOCAL)
[REMOTE] Scan(sales) ← runs on remote worker
Write queries naturally — the extension handles routing:
混合执行自动触发 — 网关会拆分逻辑执行计划:
[LOCAL] HashJoin(l.id = r.id)
[LOCAL] Scan(products) ← 在本地机器运行
[LOCAL] Bridge(REMOTE→LOCAL)
[REMOTE] Scan(sales) ← 在远程工作节点运行
自然编写查询即可 — 扩展会自动处理路由:
This single query runs across two engines transparently
该单查询会在两个引擎间透明执行
con.sql("""
SELECT
p.category,
SUM(s.amount) AS revenue
FROM local.products p -- local table
JOIN cloud.sales s -- remote table
ON p.id = s.product_id
GROUP BY p.category
ORDER BY revenue DESC
""").show()
con.sql("""
SELECT
p.category,
SUM(s.amount) AS revenue
FROM local.products p -- 本地表
JOIN cloud.sales s -- 远程表
ON p.id = s.product_id
GROUP BY p.category
ORDER BY revenue DESC
""").show()
Differential storage is managed server-side. Key properties:
- Append-only sealed layers stored in object storage (S3-compatible)
- Postgres stores layer metadata and snapshot pointers
- Snapshot isolation — readers always see a consistent view
- One serialized write path — many concurrent readers
From a client perspective it is fully transparent. DuckDB sees normal table semantics.
差分存储由服务器端管理。核心特性:
- 仅追加的密封分层存储在对象存储(兼容S3)中
- Postgres存储分层元数据和快照指针
- 快照隔离 — 读取者始终看到一致视图
- 单一序列化写入路径 — 支持多并发读取者
从客户端视角来看完全透明。DuckDB会将其视为普通表进行操作。
ATTACH URL Reference
ATTACH URL 参考
openduck:<database_name>?endpoint=<url>&token=<token>
| Parameter | Default | Description |
|---|
| | Gateway URL |
| env var | Auth token matching server config |
Examples:
sql
-- Local dev
ATTACH 'openduck:mydb?token=dev-token' AS cloud;
-- Remote server, explicit endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;
-- Alias: od: also works
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;
openduck:<database_name>?endpoint=<url>&token=<token>
| 参数 | 默认值 | 描述 |
|---|
| | 网关URL |
| 环境变量 | 与服务器配置匹配的认证令牌 |
示例:
sql
-- 本地开发环境
ATTACH 'openduck:mydb?token=dev-token' AS cloud;
-- 远程服务器,显式指定endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;
-- 别名:od:同样生效
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;
DuckLake Integration
DuckLake 集成
OpenDuck and DuckLake operate at different layers and complement each other:
python
import duckdb, os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
OpenDuck与DuckLake在不同层运行,可相互补充:
python
import duckdb, os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
-- 挂载由远程工作节点托管、基于DuckLake的目录
con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;")
-- 通过OpenDuck传输层透明查询DuckLake表
con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()
-- 混合查询:本地临时数据与远程DuckLake表关联
con.sql("""
SELECT l.session_id, r.user_email
FROM memory.sessions l
JOIN lh.users r ON l.user_id = r.id
""").show()
Attach DuckLake catalog (managed by remote worker backed by DuckLake)
协议参考
有线协议设计得尽可能简洁。请查看
proto/openduck/v1/execution.proto
:
- — 发送SQL,获取查询句柄
- — 将Arrow IPC记录批次流式返回给客户端
任何实现这两个RPC接口的gRPC服务都是有效的OpenDuck后端。你可以用任意语言替换Rust网关实现自定义版本。
Query DuckLake tables transparently via OpenDuck transport
常见使用模式
con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()
sql
-- 执行ATTACH ... AS cloud后
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';
Hybrid: local scratch data joined with remote DuckLake table
写入远程表
con.sql("""
SELECT l.session_id, r.user_email
FROM memory.sessions l
JOIN lh.users r ON l.user_id = r.id
""").show()
sql
INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');
Protocol Reference
从本地数据创建远程表
The wire protocol is intentionally minimal. See
proto/openduck/v1/execution.proto
:
- — send SQL, receive a query handle
- — stream Arrow IPC record batches back to client
Any gRPC service implementing these two RPCs is a valid OpenDuck backend. You can replace the Rust gateway with a custom implementation in any language.
sql
CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;
Common Patterns
将远程查询结果导出为本地Parquet文件
Check which tables are available remotely
sql
-- After ATTACH ... AS cloud
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';
python
con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
.write_parquet("output/us_east.parquet")
Write to a remote table
故障排除
sql
INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');
python
-- 加载扩展前务必设置allow_unsigned_extensions
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
或在CLI中:
Create a remote table from local data
sql
CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;
bash
-- 设置环境变量指向构建好的扩展路径
export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
ls -la $OPENDUCK_EXTENSION_PATH # 确认文件存在
Export remote query result to local Parquet
连接网关被拒绝
python
con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
.write_parquet("output/us_east.parquet")
bash
-- 验证服务器是否运行
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
-- 默认端口为7878 — 检查防火墙/端口绑定
curl http://localhost:7878/health
Troubleshooting
令牌不匹配 / 认证失败
bash
-- 服务器令牌与客户端令牌必须完全一致
export OPENDUCK_TOKEN=same-value-on-both-sides
-- 服务器端:cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
-- 客户端:ATTACH '...&token=same-value-on-both-sides' AS cloud;
Always set allow_unsigned_extensions before loading
macOS上构建失败 — bison版本问题
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
Or in CLI:
```bash
duckdb -unsigned
bash
brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && make
fails — extension not found
扩展版本与DuckDB不匹配
扩展必须与Python包使用的DuckDB版本一致:
bash
python -c "import duckdb; print(duckdb.__version__)"
-- 确保扩展Makefile使用相同版本
-- 检查extensions/openduck/Makefile中的DUCKDB_VERSION
Set the env var to the exact built path
OpenDuck vs 替代方案
export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
ls -la $OPENDUCK_EXTENSION_PATH # confirm it exists
| 特性 | OpenDuck | Arrow Flight SQL | DuckLake |
|---|
| 远程挂载体验 | | 独立驱动 | |
| 混合执行 | ✅ 拆分执行计划 | ❌ 全远程执行 | ❌ |
| DuckDB目录集成 | ✅ 原生支持 | ❌ | ✅ |
| 协议RPC数量 | 2 | ~15 | N/A |
| 差分存储 | ✅ | ❌ | 通过Parquet分层实现 |
| 自托管 | ✅ | ✅ | ✅ |
Connection refused to gateway
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
Default port is 7878 — check firewall / port binding
Token mismatch / auth failure
Server token and client token must match exactly
export OPENDUCK_TOKEN=same-value-on-both-sides
Server: cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
Client: ATTACH '...&token=same-value-on-both-sides' AS cloud;
Build fails on macOS — bison version
bash
brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && make
Extension version mismatch with DuckDB
The extension must be built against the same DuckDB version as the Python package:
bash
python -c "import duckdb; print(duckdb.__version__)"
Ensure the extension Makefile targets the same version
Check extensions/openduck/Makefile for DUCKDB_VERSION
| Feature | OpenDuck | Arrow Flight SQL | DuckLake |
|---|
| Remote attach UX | | Separate driver | |
| Hybrid execution | ✅ split plan | ❌ full remote | ❌ |
| DuckDB catalog integration | ✅ native | ❌ | ✅ |
| Protocol RPCs | 2 | ~15 | N/A |
| Differential storage | ✅ | ❌ | via Parquet layers |
| Self-hosted | ✅ | ✅ | ✅ |