openduck-distributed-duckdb
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOpenDuck Distributed DuckDB
OpenDuck 分布式DuckDB
Skill by ara.so — Daily 2026 Skills collection
OpenDuck is an open-source implementation of distributed DuckDB featuring differential storage (append-only immutable layers via Postgres + object store), hybrid dual execution (single queries split across local and remote workers), and transparent remote database attach via . It is architecturally inspired by MotherDuck but fully open protocol (gRPC + Arrow IPC).
ATTACH 'openduck:mydb'来自ara.so的技能 — 2026每日技能合集
OpenDuck是分布式DuckDB的开源实现,具备差分存储(通过Postgres + 对象存储实现仅追加的不可变分层)、混合双执行(单查询拆分至本地和远程工作节点执行)以及通过实现的透明远程数据库挂载功能。其架构灵感源自MotherDuck,但采用完全开放的协议(gRPC + Arrow IPC)。
ATTACH 'openduck:mydb'Architecture Overview
架构概述
DuckDB client (openduck extension)
└─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
└─ gRPC + Arrow IPC
└─ Gateway (Rust)
├─ auth / routing / plan splitting
├─ Worker 1 (embedded DuckDB)
└─ Worker N (embedded DuckDB)
├─ Postgres (metadata)
└─ Object store (sealed layers)Key concepts:
- /
OpenDuckCatalog— remote tables appear as first-class DuckDB catalog entriesOpenDuckTableEntry - Hybrid execution — gateway labels operators or
LOCAL, insertsREMOTEoperators at boundariesBridge - Differential storage — immutable sealed layers, snapshot isolation, one write path, many readers
- Protocol — only 2 gRPC RPCs defined in
proto/openduck/v1/execution.proto
DuckDB client (openduck extension)
└─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
└─ gRPC + Arrow IPC
└─ Gateway (Rust)
├─ auth / routing / plan splitting
├─ Worker 1 (embedded DuckDB)
└─ Worker N (embedded DuckDB)
├─ Postgres (metadata)
└─ Object store (sealed layers)核心概念:
- /
OpenDuckCatalog— 远程表作为一等公民的DuckDB目录条目呈现OpenDuckTableEntry - 混合执行 — 网关将算子标记为或
LOCAL,在边界处插入REMOTE算子Bridge - 差分存储 — 不可变的密封分层、快照隔离、单一写入路径、多读取者
- 协议 — 仅在中定义了2个gRPC RPC接口
proto/openduck/v1/execution.proto
Repository Layout
仓库结构
crates/
exec-gateway/ # auth, routing, hybrid plan splitting
exec-worker/ # embedded DuckDB, Arrow IPC streaming
exec-proto/ # protobuf/tonic codegen
openduck-cli/ # unified CLI (serve|gateway|worker)
diff-*/ # differential storage (layers, metadata, FUSE)
extensions/
openduck/ # DuckDB C++ extension (StorageExtension + Catalog)
clients/
python/ # pip-installable openduck wrapper
proto/
openduck/v1/ # execution.protocrates/
exec-gateway/ # 认证、路由、混合执行计划拆分
exec-worker/ # 嵌入式DuckDB、Arrow IPC流处理
exec-proto/ # protobuf/tonic代码生成
openduck-cli/ # 统一CLI(serve|gateway|worker)
diff-*/ # 差分存储(分层、元数据、FUSE)
extensions/
openduck/ # DuckDB C++扩展(StorageExtension + Catalog)
clients/
python/ # 可通过pip安装的openduck封装库
proto/
openduck/v1/ # execution.protoInstallation & Build
安装与构建
Prerequisites
前置依赖
- Rust toolchain (stable)
- C++ build tools, ,
vcpkg(macOS:bison)brew install bison - DuckDB development headers (handled by the extension Makefile)
- Rust工具链(稳定版)
- C++构建工具、、
vcpkg(macOS:bison)brew install bison - DuckDB开发头文件(由扩展Makefile自动处理)
1. Build the Rust backend
1. 构建Rust后端
bash
git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspacebash
git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspace2. Build the DuckDB C++ extension
2. 构建DuckDB C++扩展
bash
cd extensions/openduck
makebash
cd extensions/openduck
makeOutput:
输出:
extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
undefinedundefined3. Install the Python client (optional)
3. 安装Python客户端(可选)
bash
pip install -e clients/pythonbash
pip install -e clients/pythonRunning the Server
运行服务器
bash
undefinedbash
undefinedRequired env vars
必填环境变量
export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_TOKEN=your-secret-token
Start all-in-one (gateway + worker)
启动一体化服务(网关 + 工作节点)
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
Or run gateway and worker separately
或者分别运行网关和工作节点
cargo run -p openduck-cli -- gateway --port 7878
cargo run -p openduck-cli -- worker --gateway http://localhost:7878
---cargo run -p openduck-cli -- gateway --port 7878
cargo run -p openduck-cli -- worker --gateway http://localhost:7878
---Connecting from Python
从Python连接
Via openduck wrapper (recommended)
通过openduck封装库(推荐)
python
import openduck # auto-detects extension from build tree or OPENDUCK_EXTENSION_PATH
con = openduck.connect("mydb") # uses OPENDUCK_TOKEN env var
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()python
import openduck # 自动从构建目录或OPENDUCK_EXTENSION_PATH检测扩展
con = openduck.connect("mydb") # 使用OPENDUCK_TOKEN环境变量
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()Via raw DuckDB SDK
通过原生DuckDB SDK
python
import duckdb
import os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]python
import duckdb
import os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]e.g. extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
示例:extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
con.execute(
"ATTACH 'openduck:mydb"
"?endpoint=http://localhost:7878"
f"&token={os.environ["OPENDUCK_TOKEN"]}' AS cloud;"
)
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
con.execute(
"ATTACH 'openduck:mydb"
"?endpoint=http://localhost:7878"
f"&token={os.environ["OPENDUCK_TOKEN"]}' AS cloud;"
)
Query remote table
查询远程表
con.sql("SELECT * FROM cloud.users LIMIT 10").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()
Hybrid query — local table joined with remote table
混合查询 — 本地表与远程表关联
con.sql("""
SELECT l.product_id, l.name, r.total_sales
FROM local.products l
JOIN cloud.sales r ON l.product_id = r.product_id
WHERE r.total_sales > 1000
""").show()
undefinedcon.sql("""
SELECT l.product_id, l.name, r.total_sales
FROM local.products l
JOIN cloud.sales r ON l.product_id = r.product_id
WHERE r.total_sales > 1000
""").show()
undefinedEnvironment variables
环境变量
bash
export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878 # defaultbash
export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878 # 默认值Connecting from the CLI
从CLI连接
bash
duckdb -unsigned -c "
LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
SHOW ALL TABLES;
SELECT * FROM cloud.users LIMIT 5;
"bash
duckdb -unsigned -c "
LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
SHOW ALL TABLES;
SELECT * FROM cloud.users LIMIT 5;
"Connecting from Rust
从Rust连接
rust
use duckdb::{Connection, Result};
fn main() -> Result<()> {
let conn = Connection::open_in_memory()?;
let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
let token = std::env::var("OPENDUCK_TOKEN").unwrap();
conn.execute_batch(&format!(r#"
SET allow_unsigned_extensions = true;
LOAD '{ext_path}';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
"#))?;
let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
let rows = stmt.query_map([], |row| {
Ok(row.get::<_, String>(0)?)
})?;
for row in rows {
println!("{}", row?);
}
Ok(())
}rust
use duckdb::{Connection, Result};
fn main() -> Result<()> {
let conn = Connection::open_in_memory()?;
let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
let token = std::env::var("OPENDUCK_TOKEN").unwrap();
conn.execute_batch(&format!(r#"
SET allow_unsigned_extensions = true;
LOAD '{ext_path}';
ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
"#))?;
let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
let rows = stmt.query_map([], |row| {
Ok(row.get::<_, String>(0)?)
})?;
for row in rows {
println!("{}", row?);
}
Ok(())
}Hybrid Execution Pattern
混合执行模式
Hybrid execution happens automatically — the gateway splits the logical plan:
[LOCAL] HashJoin(l.id = r.id)
[LOCAL] Scan(products) ← runs on your machine
[LOCAL] Bridge(REMOTE→LOCAL)
[REMOTE] Scan(sales) ← runs on remote workerWrite queries naturally — the extension handles routing:
python
undefined混合执行自动触发 — 网关会拆分逻辑执行计划:
[LOCAL] HashJoin(l.id = r.id)
[LOCAL] Scan(products) ← 在本地机器运行
[LOCAL] Bridge(REMOTE→LOCAL)
[REMOTE] Scan(sales) ← 在远程工作节点运行自然编写查询即可 — 扩展会自动处理路由:
python
undefinedThis single query runs across two engines transparently
该单查询会在两个引擎间透明执行
con.sql("""
SELECT
p.category,
SUM(s.amount) AS revenue
FROM local.products p -- local table
JOIN cloud.sales s -- remote table
ON p.id = s.product_id
GROUP BY p.category
ORDER BY revenue DESC
""").show()
---con.sql("""
SELECT
p.category,
SUM(s.amount) AS revenue
FROM local.products p -- 本地表
JOIN cloud.sales s -- 远程表
ON p.id = s.product_id
GROUP BY p.category
ORDER BY revenue DESC
""").show()
---Differential Storage
差分存储
Differential storage is managed server-side. Key properties:
- Append-only sealed layers stored in object storage (S3-compatible)
- Postgres stores layer metadata and snapshot pointers
- Snapshot isolation — readers always see a consistent view
- One serialized write path — many concurrent readers
From a client perspective it is fully transparent. DuckDB sees normal table semantics.
差分存储由服务器端管理。核心特性:
- 仅追加的密封分层存储在对象存储(兼容S3)中
- Postgres存储分层元数据和快照指针
- 快照隔离 — 读取者始终看到一致视图
- 单一序列化写入路径 — 支持多并发读取者
从客户端视角来看完全透明。DuckDB会将其视为普通表进行操作。
ATTACH URL Reference
ATTACH URL 参考
openduck:<database_name>?endpoint=<url>&token=<token>| Parameter | Default | Description |
|---|---|---|
| | Gateway URL |
| | Auth token matching server config |
Examples:
sql
-- Local dev
ATTACH 'openduck:mydb?token=dev-token' AS cloud;
-- Remote server, explicit endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;
-- Alias: od: also works
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;openduck:<database_name>?endpoint=<url>&token=<token>| 参数 | 默认值 | 描述 |
|---|---|---|
| | 网关URL |
| | 与服务器配置匹配的认证令牌 |
示例:
sql
-- 本地开发环境
ATTACH 'openduck:mydb?token=dev-token' AS cloud;
-- 远程服务器,显式指定endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;
-- 别名:od:同样生效
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;DuckLake Integration
DuckLake 集成
OpenDuck and DuckLake operate at different layers and complement each other:
python
import duckdb, os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")OpenDuck与DuckLake在不同层运行,可相互补充:
python
import duckdb, os
ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
-- 挂载由远程工作节点托管、基于DuckLake的目录
con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;")
-- 通过OpenDuck传输层透明查询DuckLake表
con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()
-- 混合查询:本地临时数据与远程DuckLake表关联
con.sql("""
SELECT l.session_id, r.user_email
FROM memory.sessions l
JOIN lh.users r ON l.user_id = r.id
""").show()Attach DuckLake catalog (managed by remote worker backed by DuckLake)
协议参考
con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;")
有线协议设计得尽可能简洁。请查看:
proto/openduck/v1/execution.proto- — 发送SQL,获取查询句柄
ExecuteQuery - — 将Arrow IPC记录批次流式返回给客户端
StreamResults
任何实现这两个RPC接口的gRPC服务都是有效的OpenDuck后端。你可以用任意语言替换Rust网关实现自定义版本。
Query DuckLake tables transparently via OpenDuck transport
常见使用模式
—
查看可用的远程表
con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()
sql
-- 执行ATTACH ... AS cloud后
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';Hybrid: local scratch data joined with remote DuckLake table
写入远程表
con.sql("""
SELECT l.session_id, r.user_email
FROM memory.sessions l
JOIN lh.users r ON l.user_id = r.id
""").show()
---sql
INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');Protocol Reference
从本地数据创建远程表
The wire protocol is intentionally minimal. See :
proto/openduck/v1/execution.proto- — send SQL, receive a query handle
ExecuteQuery - — stream Arrow IPC record batches back to client
StreamResults
Any gRPC service implementing these two RPCs is a valid OpenDuck backend. You can replace the Rust gateway with a custom implementation in any language.
sql
CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;Common Patterns
将远程查询结果导出为本地Parquet文件
Check which tables are available remotely
—
sql
-- After ATTACH ... AS cloud
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';python
con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
.write_parquet("output/us_east.parquet")Write to a remote table
故障排除
—
Extension is not trusted
/ 签名错误
Extension is not trustedsql
INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');python
-- 加载扩展前务必设置allow_unsigned_extensions
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})或在CLI中:
bash
duckdb -unsignedCreate a remote table from local data
LOAD
失败 — 扩展未找到
LOADsql
CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;bash
-- 设置环境变量指向构建好的扩展路径
export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
ls -la $OPENDUCK_EXTENSION_PATH # 确认文件存在Export remote query result to local Parquet
连接网关被拒绝
python
con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
.write_parquet("output/us_east.parquet")bash
-- 验证服务器是否运行
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
-- 默认端口为7878 — 检查防火墙/端口绑定
curl http://localhost:7878/healthTroubleshooting
令牌不匹配 / 认证失败
Extension is not trusted
/ signature error
Extension is not trusted—
python
undefinedbash
-- 服务器令牌与客户端令牌必须完全一致
export OPENDUCK_TOKEN=same-value-on-both-sides
-- 服务器端:cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
-- 客户端:ATTACH '...&token=same-value-on-both-sides' AS cloud;Always set allow_unsigned_extensions before loading
macOS上构建失败 — bison版本问题
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
Or in CLI:
```bash
duckdb -unsignedbash
brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && makeLOAD
fails — extension not found
LOAD扩展版本与DuckDB不匹配
bash
undefined扩展必须与Python包使用的DuckDB版本一致:
bash
python -c "import duckdb; print(duckdb.__version__)"
-- 确保扩展Makefile使用相同版本
-- 检查extensions/openduck/Makefile中的DUCKDB_VERSIONSet the env var to the exact built path
OpenDuck vs 替代方案
export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
ls -la $OPENDUCK_EXTENSION_PATH # confirm it exists
undefined| 特性 | OpenDuck | Arrow Flight SQL | DuckLake |
|---|---|---|---|
| 远程挂载体验 | | 独立驱动 | |
| 混合执行 | ✅ 拆分执行计划 | ❌ 全远程执行 | ❌ |
| DuckDB目录集成 | ✅ 原生支持 | ❌ | ✅ |
| 协议RPC数量 | 2 | ~15 | N/A |
| 差分存储 | ✅ | ❌ | 通过Parquet分层实现 |
| 自托管 | ✅ | ✅ | ✅ |
Connection refused to gateway
—
bash
undefined—
Verify server is running
—
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
—
Default port is 7878 — check firewall / port binding
—
undefined—
Token mismatch / auth failure
—
bash
undefined—
Server token and client token must match exactly
—
export OPENDUCK_TOKEN=same-value-on-both-sides
—
Server: cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
—
Client: ATTACH '...&token=same-value-on-both-sides' AS cloud;
—
undefined—
Build fails on macOS — bison version
—
bash
brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && make—
Extension version mismatch with DuckDB
—
The extension must be built against the same DuckDB version as the Python package:
bash
python -c "import duckdb; print(duckdb.__version__)"—
Ensure the extension Makefile targets the same version
—
Check extensions/openduck/Makefile for DUCKDB_VERSION
—
---—
OpenDuck vs Alternatives
—
| Feature | OpenDuck | Arrow Flight SQL | DuckLake |
|---|---|---|---|
| Remote attach UX | | Separate driver | |
| Hybrid execution | ✅ split plan | ❌ full remote | ❌ |
| DuckDB catalog integration | ✅ native | ❌ | ✅ |
| Protocol RPCs | 2 | ~15 | N/A |
| Differential storage | ✅ | ❌ | via Parquet layers |
| Self-hosted | ✅ | ✅ | ✅ |
—