openduck-distributed-duckdb

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OpenDuck Distributed DuckDB

OpenDuck 分布式DuckDB

Skill by ara.so — Daily 2026 Skills collection

OpenDuck is an open-source implementation of distributed DuckDB featuring differential storage (append-only immutable layers via Postgres + object store), hybrid dual execution (single queries split across local and remote workers), and transparent remote database attach via

ATTACH 'openduck:mydb'

. It is architecturally inspired by MotherDuck but fully open protocol (gRPC + Arrow IPC).

来自ara.so的技能 — 2026每日技能合集

OpenDuck是分布式DuckDB的开源实现，具备差分存储（通过Postgres + 对象存储实现仅追加的不可变分层）、混合双执行（单查询拆分至本地和远程工作节点执行）以及通过

ATTACH 'openduck:mydb'

实现的透明远程数据库挂载功能。其架构灵感源自MotherDuck，但采用完全开放的协议（gRPC + Arrow IPC）。

Architecture Overview

架构概述

DuckDB client (openduck extension)
  └─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
       └─ gRPC + Arrow IPC
            └─ Gateway (Rust)
                 ├─ auth / routing / plan splitting
                 ├─ Worker 1 (embedded DuckDB)
                 └─ Worker N (embedded DuckDB)
                      ├─ Postgres (metadata)
                      └─ Object store (sealed layers)

Key concepts:

```
OpenDuckCatalog
```
/
```
OpenDuckTableEntry
```
— remote tables appear as first-class DuckDB catalog entries
Hybrid execution — gateway labels operators
```
LOCAL
```
or
```
REMOTE
```
, inserts
```
Bridge
```
operators at boundaries
Differential storage — immutable sealed layers, snapshot isolation, one write path, many readers
Protocol — only 2 gRPC RPCs defined in
```
proto/openduck/v1/execution.proto
```

DuckDB client (openduck extension)
  └─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
       └─ gRPC + Arrow IPC
            └─ Gateway (Rust)
                 ├─ auth / routing / plan splitting
                 ├─ Worker 1 (embedded DuckDB)
                 └─ Worker N (embedded DuckDB)
                      ├─ Postgres (metadata)
                      └─ Object store (sealed layers)

核心概念：

```
OpenDuckCatalog
```
/
```
OpenDuckTableEntry
```
— 远程表作为一等公民的DuckDB目录条目呈现
混合执行 — 网关将算子标记为
```
LOCAL
```
或
```
REMOTE
```
，在边界处插入
```
Bridge
```
算子
差分存储 — 不可变的密封分层、快照隔离、单一写入路径、多读取者
协议 — 仅在
```
proto/openduck/v1/execution.proto
```
中定义了2个gRPC RPC接口

Repository Layout

仓库结构

crates/
  exec-gateway/     # auth, routing, hybrid plan splitting
  exec-worker/      # embedded DuckDB, Arrow IPC streaming
  exec-proto/       # protobuf/tonic codegen
  openduck-cli/     # unified CLI (serve|gateway|worker)
  diff-*/           # differential storage (layers, metadata, FUSE)

extensions/
  openduck/         # DuckDB C++ extension (StorageExtension + Catalog)

clients/
  python/           # pip-installable openduck wrapper

proto/
  openduck/v1/      # execution.proto

crates/
  exec-gateway/     # 认证、路由、混合执行计划拆分
  exec-worker/      # 嵌入式DuckDB、Arrow IPC流处理
  exec-proto/       # protobuf/tonic代码生成
  openduck-cli/     # 统一CLI（serve|gateway|worker）
  diff-*/           # 差分存储（分层、元数据、FUSE）

extensions/
  openduck/         # DuckDB C++扩展（StorageExtension + Catalog）

clients/
  python/           # 可通过pip安装的openduck封装库

proto/
  openduck/v1/      # execution.proto

Installation & Build

安装与构建

Prerequisites

前置依赖

Rust toolchain (stable)
C++ build tools,
```
vcpkg
```
,
```
bison
```
(macOS:
```
brew install bison
```
)
DuckDB development headers (handled by the extension Makefile)

Rust工具链（稳定版）
C++构建工具、
```
vcpkg
```
、
```
bison
```
（macOS：
```
brew install bison
```
）
DuckDB开发头文件（由扩展Makefile自动处理）

1. Build the Rust backend

1. 构建Rust后端

bash

git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspace

bash

git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspace

2. Build the DuckDB C++ extension

2. 构建DuckDB C++扩展

bash

cd extensions/openduck
make

bash

cd extensions/openduck
make

Output:

输出：

extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

undefined

undefined

3. Install the Python client (optional)

3. 安装Python客户端（可选）

bash

pip install -e clients/python

bash

pip install -e clients/python

Running the Server

运行服务器

bash

undefined

bash

undefined

Required env vars

必填环境变量

export OPENDUCK_TOKEN=your-secret-token

Start all-in-one (gateway + worker)

启动一体化服务（网关 + 工作节点）

cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN

Or run gateway and worker separately

或者分别运行网关和工作节点

cargo run -p openduck-cli -- gateway --port 7878 cargo run -p openduck-cli -- worker --gateway http://localhost:7878

---

cargo run -p openduck-cli -- gateway --port 7878 cargo run -p openduck-cli -- worker --gateway http://localhost:7878

---

Connecting from Python

从Python连接

Via openduck wrapper (recommended)

通过openduck封装库（推荐）

python

import openduck  # auto-detects extension from build tree or OPENDUCK_EXTENSION_PATH

con = openduck.connect("mydb")  # uses OPENDUCK_TOKEN env var
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()

python

import openduck  # 自动从构建目录或OPENDUCK_EXTENSION_PATH检测扩展

con = openduck.connect("mydb")  # 使用OPENDUCK_TOKEN环境变量
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()

Via raw DuckDB SDK

通过原生DuckDB SDK

python

import duckdb
import os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]

python

import duckdb
import os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]

e.g. extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

示例：extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

con = duckdb.connect(config={"allow_unsigned_extensions": "true"}) con.execute(f"LOAD '{ext_path}';") con.execute( "ATTACH 'openduck:mydb" "?endpoint=http://localhost:7878" f"&token={os.environ["OPENDUCK_TOKEN"]}' AS cloud;" )

Query remote table

查询远程表

con.sql("SELECT * FROM cloud.users LIMIT 10").show()

Hybrid query — local table joined with remote table

混合查询 — 本地表与远程表关联

con.sql(""" SELECT l.product_id, l.name, r.total_sales FROM local.products l JOIN cloud.sales r ON l.product_id = r.product_id WHERE r.total_sales > 1000 """).show()

undefined

con.sql(""" SELECT l.product_id, l.name, r.total_sales FROM local.products l JOIN cloud.sales r ON l.product_id = r.product_id WHERE r.total_sales > 1000 """).show()

undefined

Environment variables

环境变量

bash

export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878   # default

bash

export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878   # 默认值

Connecting from the CLI

从CLI连接

bash

duckdb -unsigned -c "
  LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
  ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
  SHOW ALL TABLES;
  SELECT * FROM cloud.users LIMIT 5;
"

bash

duckdb -unsigned -c "
  LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
  ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
  SHOW ALL TABLES;
  SELECT * FROM cloud.users LIMIT 5;
"

Connecting from Rust

从Rust连接

rust

use duckdb::{Connection, Result};

fn main() -> Result<()> {
    let conn = Connection::open_in_memory()?;
    let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
    let token = std::env::var("OPENDUCK_TOKEN").unwrap();

    conn.execute_batch(&format!(r#"
        SET allow_unsigned_extensions = true;
        LOAD '{ext_path}';
        ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
    "#))?;

    let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
    let rows = stmt.query_map([], |row| {
        Ok(row.get::<_, String>(0)?)
    })?;

    for row in rows {
        println!("{}", row?);
    }
    Ok(())
}

rust

use duckdb::{Connection, Result};

fn main() -> Result<()> {
    let conn = Connection::open_in_memory()?;
    let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
    let token = std::env::var("OPENDUCK_TOKEN").unwrap();

    conn.execute_batch(&format!(r#"
        SET allow_unsigned_extensions = true;
        LOAD '{ext_path}';
        ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
    "#))?;

    let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
    let rows = stmt.query_map([], |row| {
        Ok(row.get::<_, String>(0)?)
    })?;

    for row in rows {
        println!("{}", row?);
    }
    Ok(())
}

Hybrid Execution Pattern

混合执行模式

Hybrid execution happens automatically — the gateway splits the logical plan:

[LOCAL]  HashJoin(l.id = r.id)
  [LOCAL]  Scan(products)       ← runs on your machine
  [LOCAL]  Bridge(REMOTE→LOCAL)
    [REMOTE] Scan(sales)        ← runs on remote worker

Write queries naturally — the extension handles routing:

python

undefined

混合执行自动触发 — 网关会拆分逻辑执行计划：

[LOCAL]  HashJoin(l.id = r.id)
  [LOCAL]  Scan(products)       ← 在本地机器运行
  [LOCAL]  Bridge(REMOTE→LOCAL)
    [REMOTE] Scan(sales)        ← 在远程工作节点运行

自然编写查询即可 — 扩展会自动处理路由：

python

undefined

This single query runs across two engines transparently

该单查询会在两个引擎间透明执行

con.sql(""" SELECT p.category, SUM(s.amount) AS revenue FROM local.products p -- local table JOIN cloud.sales s -- remote table ON p.id = s.product_id GROUP BY p.category ORDER BY revenue DESC """).show()

---

con.sql(""" SELECT p.category, SUM(s.amount) AS revenue FROM local.products p -- 本地表 JOIN cloud.sales s -- 远程表 ON p.id = s.product_id GROUP BY p.category ORDER BY revenue DESC """).show()

---

Differential Storage

差分存储

Differential storage is managed server-side. Key properties:

Append-only sealed layers stored in object storage (S3-compatible)
Postgres stores layer metadata and snapshot pointers
Snapshot isolation — readers always see a consistent view
One serialized write path — many concurrent readers

From a client perspective it is fully transparent. DuckDB sees normal table semantics.

差分存储由服务器端管理。核心特性：

仅追加的密封分层存储在对象存储（兼容S3）中
Postgres存储分层元数据和快照指针
快照隔离 — 读取者始终看到一致视图
单一序列化写入路径 — 支持多并发读取者

从客户端视角来看完全透明。DuckDB会将其视为普通表进行操作。

ATTACH URL Reference

ATTACH URL 参考

openduck:<database_name>?endpoint=<url>&token=<token>

Parameter	Default	Description
`endpoint`	`http://localhost:7878`	Gateway URL
`token`	`$OPENDUCK_TOKEN` env var	Auth token matching server config

Examples:

sql

-- Local dev
ATTACH 'openduck:mydb?token=dev-token' AS cloud;

-- Remote server, explicit endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;

-- Alias: od: also works
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;

openduck:<database_name>?endpoint=<url>&token=<token>

参数	默认值	描述
`endpoint`	`http://localhost:7878`	网关URL
`token`	`$OPENDUCK_TOKEN` 环境变量	与服务器配置匹配的认证令牌

示例：

sql

-- 本地开发环境
ATTACH 'openduck:mydb?token=dev-token' AS cloud;

-- 远程服务器，显式指定endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;

-- 别名：od:同样生效
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;

DuckLake Integration

DuckLake 集成

OpenDuck and DuckLake operate at different layers and complement each other:

python

import duckdb, os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]

con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")

OpenDuck与DuckLake在不同层运行，可相互补充：

python

import duckdb, os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]

con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")

-- 挂载由远程工作节点托管、基于DuckLake的目录
con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;")

-- 通过OpenDuck传输层透明查询DuckLake表
con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()

-- 混合查询：本地临时数据与远程DuckLake表关联
con.sql("""
    SELECT l.session_id, r.user_email
    FROM memory.sessions l
    JOIN lh.users r ON l.user_id = r.id
""").show()

Attach DuckLake catalog (managed by remote worker backed by DuckLake)

协议参考

con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;")

有线协议设计得尽可能简洁。请查看

proto/openduck/v1/execution.proto

：

ExecuteQuery
— 发送SQL，获取查询句柄
StreamResults
— 将Arrow IPC记录批次流式返回给客户端

任何实现这两个RPC接口的gRPC服务都是有效的OpenDuck后端。你可以用任意语言替换Rust网关实现自定义版本。

Query DuckLake tables transparently via OpenDuck transport

常见使用模式

—

查看可用的远程表

con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()

sql

-- 执行ATTACH ... AS cloud后
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';

Hybrid: local scratch data joined with remote DuckLake table

写入远程表

con.sql(""" SELECT l.session_id, r.user_email FROM memory.sessions l JOIN lh.users r ON l.user_id = r.id """).show()

---

sql

INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');

Protocol Reference

从本地数据创建远程表

The wire protocol is intentionally minimal. See

proto/openduck/v1/execution.proto

ExecuteQuery
— send SQL, receive a query handle
StreamResults
— stream Arrow IPC record batches back to client

Any gRPC service implementing these two RPCs is a valid OpenDuck backend. You can replace the Rust gateway with a custom implementation in any language.

sql

CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;

Common Patterns

将远程查询结果导出为本地Parquet文件

Check which tables are available remotely

—

sql

-- After ATTACH ... AS cloud
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';

python

con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
   .write_parquet("output/us_east.parquet")

Write to a remote table

故障排除

—

Extension is not trusted

/ 签名错误

sql

INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');

python

-- 加载扩展前务必设置allow_unsigned_extensions
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})

或在CLI中：

bash

duckdb -unsigned

Create a remote table from local data

LOAD

失败 — 扩展未找到

sql

CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;

bash

-- 设置环境变量指向构建好的扩展路径
export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
ls -la $OPENDUCK_EXTENSION_PATH   # 确认文件存在

Export remote query result to local Parquet

连接网关被拒绝

python

con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
   .write_parquet("output/us_east.parquet")

bash

-- 验证服务器是否运行
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
-- 默认端口为7878 — 检查防火墙/端口绑定
curl http://localhost:7878/health

Troubleshooting

令牌不匹配 / 认证失败

Extension is not trusted

/ signature error

—

python

undefined

bash

-- 服务器令牌与客户端令牌必须完全一致
export OPENDUCK_TOKEN=same-value-on-both-sides
-- 服务器端：cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
-- 客户端：ATTACH '...&token=same-value-on-both-sides' AS cloud;

Always set allow_unsigned_extensions before loading

macOS上构建失败 — bison版本问题

con = duckdb.connect(config={"allow_unsigned_extensions": "true"})


Or in CLI:
```bash
duckdb -unsigned

bash

brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && make

LOAD

fails — extension not found

扩展版本与DuckDB不匹配

bash

undefined

扩展必须与Python包使用的DuckDB版本一致：

bash

python -c "import duckdb; print(duckdb.__version__)"
-- 确保扩展Makefile使用相同版本
-- 检查extensions/openduck/Makefile中的DUCKDB_VERSION

Set the env var to the exact built path

OpenDuck vs 替代方案

export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension ls -la $OPENDUCK_EXTENSION_PATH # confirm it exists

undefined

特性	OpenDuck	Arrow Flight SQL	DuckLake
远程挂载体验	`ATTACH 'openduck:db'`	独立驱动	`ATTACH 'ducklake:...'`
混合执行	✅ 拆分执行计划	❌ 全远程执行	❌
DuckDB目录集成	✅ 原生支持	❌	✅
协议RPC数量	2	~15	N/A
差分存储	✅	❌	通过Parquet分层实现
自托管	✅	✅	✅

Connection refused to gateway

—

bash

undefined

—

Verify server is running

—

cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN

—

Default port is 7878 — check firewall / port binding

—

curl http://localhost:7878/health

undefined

—

Token mismatch / auth failure

—

bash

undefined

—

Server token and client token must match exactly

—

export OPENDUCK_TOKEN=same-value-on-both-sides

—

Server: cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN

—

Client: ATTACH '...&token=same-value-on-both-sides' AS cloud;

—

undefined

—

Build fails on macOS — bison version

—

bash

brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && make

—

Extension version mismatch with DuckDB

—

The extension must be built against the same DuckDB version as the Python package:

bash

python -c "import duckdb; print(duckdb.__version__)"

—

Ensure the extension Makefile targets the same version

—

Check extensions/openduck/Makefile for DUCKDB_VERSION

—

---

—

OpenDuck vs Alternatives

—

Feature	OpenDuck	Arrow Flight SQL	DuckLake
Remote attach UX	`ATTACH 'openduck:db'`	Separate driver	`ATTACH 'ducklake:...'`
Hybrid execution	✅ split plan	❌ full remote	❌
DuckDB catalog integration	✅ native	❌	✅
Protocol RPCs	2	~15	N/A
Differential storage	✅	❌	via Parquet layers
Self-hosted	✅	✅	✅

—

openduck-distributed-duckdb

Original

Translation

OpenDuck Distributed DuckDB

OpenDuck 分布式DuckDB

Architecture Overview

架构概述

Repository Layout

仓库结构

Installation & Build

安装与构建

Prerequisites

前置依赖

1. Build the Rust backend

1. 构建Rust后端

2. Build the DuckDB C++ extension

2. 构建DuckDB C++扩展

Output:

输出：

extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

3. Install the Python client (optional)

3. 安装Python客户端（可选）

Running the Server

运行服务器

Required env vars

必填环境变量

Start all-in-one (gateway + worker)

启动一体化服务（网关 + 工作节点）

Or run gateway and worker separately

或者分别运行网关和工作节点

Connecting from Python

从Python连接

Via openduck wrapper (recommended)

通过openduck封装库（推荐）

Via raw DuckDB SDK

通过原生DuckDB SDK

e.g. extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

示例：extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

Query remote table

查询远程表

Hybrid query — local table joined with remote table

混合查询 — 本地表与远程表关联

Environment variables

环境变量

Connecting from the CLI

从CLI连接

Connecting from Rust

从Rust连接

Hybrid Execution Pattern

混合执行模式

This single query runs across two engines transparently

该单查询会在两个引擎间透明执行

Differential Storage

差分存储

ATTACH URL Reference

ATTACH URL 参考

DuckLake Integration

DuckLake 集成

Attach DuckLake catalog (managed by remote worker backed by DuckLake)

协议参考

Query DuckLake tables transparently via OpenDuck transport

常见使用模式

查看可用的远程表

Hybrid: local scratch data joined with remote DuckLake table

写入远程表

Protocol Reference

从本地数据创建远程表

Common Patterns

将远程查询结果导出为本地Parquet文件

Check which tables are available remotely

Write to a remote table

故障排除

Extension is not trusted / 签名错误

Create a remote table from local data

LOAD失败 — 扩展未找到

Export remote query result to local Parquet

连接网关被拒绝

Troubleshooting

令牌不匹配 / 认证失败

`Extension is not trusted`
/ 签名错误

`LOAD`
失败 — 扩展未找到

`Extension is not trusted`
/ signature error

`LOAD`
fails — extension not found