motherduck-build-data-pipeline

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Build a Data Pipeline with MotherDuck

使用MotherDuck构建数据管道

Use this skill when the user needs an ingestion-to-serving workflow, not just a single load step.

This is a use-case skill. It orchestrates

motherduck-connect

motherduck-load-data

motherduck-model-data

motherduck-query

motherduck-share-data

, and

motherduck-ducklake

当用户需要从数据摄入到服务的完整工作流，而非单一加载步骤时，可使用此技能。

这是一个场景化技能，可编排

motherduck-connect

、

motherduck-load-data

、

motherduck-model-data

、

motherduck-query

、

motherduck-share-data

和

motherduck-ducklake

等技能。

Start Here: Is a MotherDuck Server Active?

第一步：是否有活跃的MotherDuck服务器？

Always determine this first.

If a remote MotherDuck MCP server or local MotherDuck server is active, use it.
If the user already knows the destination database, confirm it before designing stages.
Explore the live environment:
- current databases and schemas
- raw, staging, and analytics boundaries if they already exist
- source tables, target tables, and table grain
- key columns, date fields, and join keys

Use that discovery to decide whether the pipeline is:

landing into an empty workspace
extending an existing warehouse layout
publishing into an existing analytics model

If no server is active, ask for source shape and target shape before drafting the pipeline.

请始终先确认这一点。

如果存在远程MotherDuck MCP服务器或本地MotherDuck服务器处于活跃状态，请直接使用。
如果用户已知目标数据库，请在设计阶段前确认。
探索当前环境：
- 现有数据库和模式
- 已存在的原始数据、中间层和分析层边界
- 源表、目标表及表粒度
- 关键字段、日期字段和关联键

基于上述探索结果，判断管道类型：

部署至空工作区
扩展现有数据仓库布局
发布至现有分析模型

若没有活跃服务器，请先询问源数据结构和目标数据结构，再起草管道方案。

Use This Skill When

适用场景

The user needs ingestion plus transformation plus serving output.
The work spans raw landing, curation, and publication.
The user needs a stage-by-stage pipeline pattern rather than one command.
The problem is bigger than a single import step or one ad hoc transformation.

用户需要从数据摄入、转换到输出服务的完整流程
工作覆盖原始数据落地、数据治理和发布环节
用户需要分阶段的管道模式，而非单一命令
问题复杂度超过单次导入或临时转换操作

Pipeline Defaults

管道默认规则

batch over streaming
raw landing before curation
explicit raw -> staging -> analytics boundaries
bulk ingest paths over row-by-row writes
idempotent stage rebuilds or append contracts before scheduled automation
verify the MotherDuck-supported DuckDB client version before recommending upstream-only write, checkpoint, or lakehouse features
native MotherDuck storage unless DuckLake is explicitly required

优先批量处理而非流处理
先落地原始数据再进行治理
明确划分原始数据→中间层→分析层的边界
优先批量摄入路径而非逐行写入
在计划自动化前，确保阶段重建具备幂等性或追加契约
在推荐上游写入、检查点或湖仓特性前，验证MotherDuck支持的DuckDB客户端版本
除非明确需要DuckLake，否则默认使用MotherDuck原生存储

Workflow

工作流程

Confirm whether live MotherDuck discovery is available.
Inspect the current workspace and target data model.
Define raw, staging, and analytics boundaries.
Ingest raw data.
Deduplicate, type, and promote into staging.
Materialize analytics-ready outputs.
Validate counts, freshness, uniqueness, and business metrics before publishing downstream assets.

When this skill produces a native DuckDB (

md:

) connection, watermark it with

custom_user_agent=agent-skills/2.2.0(harness-<harness>;llm-<llm>)

. If metadata is missing, fall back to

harness-unknown

and

llm-unknown

确认是否可进行实时MotherDuck环境探索
检查当前工作区和目标数据模型
定义原始数据、中间层和分析层的边界
摄入原始数据
去重、类型转换并提升至中间层
生成可用于分析的输出结果
在发布下游资产前，验证数据量、新鲜度、唯一性及业务指标

当此技能生成原生DuckDB（

md:

）连接时，请添加水印

custom_user_agent=agent-skills/2.2.0(harness-<harness>;llm-<llm>)

。若元数据缺失， fallback为

harness-unknown

和

llm-unknown

。

Output

输出内容

The output of this skill should be:

the stage layout
the ingestion method
the transformation sequence
the serving tables or views
the validation checks

If the caller explicitly asks for structured JSON, return raw JSON only with no Markdown fences or prose before/after it. This is mainly for automated tests, regression checks, or downstream tooling that needs a stable machine-readable shape. Normal human-facing use of the skill can stay in prose unless JSON is explicitly requested.

Use this exact top-level shape when JSON is requested:

json

{
  "summary": {},
  "assumptions": [],
  "implementation_plan": [],
  "validation_plan": [],
  "risks": []
}

此技能的输出应包含：

阶段布局
摄入方法
转换顺序
服务用表或视图
验证检查项

若调用方明确要求结构化JSON，请仅返回原始JSON，不要添加Markdown围栏或前后说明文字。这主要用于自动化测试、回归检查或需要稳定机器可读格式的下游工具。面向普通用户的场景下，除非明确要求JSON，否则可使用自然语言输出。

当要求返回JSON时，请严格使用以下顶层结构：

json

{
  "summary": {},
  "assumptions": [],
  "implementation_plan": [],
  "validation_plan": [],
  "risks": []
}

References

参考资料

```
references/dlt-dbt-motherduck-project/
```
-- fully runnable MotherDuck reference project using
```
dlt
```
,
```
dbt-duckdb
```
, and validation queries
```
references/PIPELINE_IMPLEMENTATION_GUIDE.md
```
-- preserved detailed pipeline guidance that used to live in this skill

../motherduck-load-data/references/INGESTION_PATTERNS.md

-- lower-level ingestion patterns

```
references/dlt-dbt-motherduck-project/
```
-- 可直接运行的MotherDuck参考项目，使用
```
dlt
```
、
```
dbt-duckdb
```
和验证查询
```
references/PIPELINE_IMPLEMENTATION_GUIDE.md
```
-- 本技能中保留的详细管道实施指南

../motherduck-load-data/references/INGESTION_PATTERNS.md

-- 底层摄入模式

Runnable Artifact

可运行工件

```
artifacts/pipeline_stage_example.py
```
-- MotherDuck-backed Python example that stages a Parquet extract, lands it into raw, deduplicates it, and publishes analytics output across raw/staging/analytics databases
```
artifacts/pipeline_stage_example.ts
```
-- TypeScript companion artifact with the same stage layout and output contract
```
references/dlt-dbt-motherduck-project/
```
-- end-to-end MotherDuck example that bootstraps the target database, lands raw data with
```
dlt
```
, builds staging and analytics models with
```
dbt
```
, and validates the final mart

Run it with:

bash

uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py

Run the same stage pattern against temporary MotherDuck databases:

bash

MOTHERDUCK_ARTIFACT_USE_MOTHERDUCK=1 \
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py

Validate the TypeScript companion artifact:

bash

uv run scripts/test_typescript_artifacts.py

For the full MotherDuck project:

bash

cd skills/motherduck-build-data-pipeline/references/dlt-dbt-motherduck-project
export MOTHERDUCK_TOKEN=...
export MOTHERDUCK_PIPELINE_DB=md_skills_pipeline_demo
uv sync --python 3.12
uv run python pipeline/run_all.py
uv run python pipeline/cleanup.py

```
artifacts/pipeline_stage_example.py
```
-- 基于MotherDuck的Python示例，实现Parquet文件提取、原始数据落地、去重，并在原始/中间/分析数据库间发布分析输出
```
artifacts/pipeline_stage_example.ts
```
-- TypeScript配套工件，具备相同的阶段布局和输出契约
```
references/dlt-dbt-motherduck-project/
```
-- 端到端MotherDuck示例，可初始化目标数据库、用
```
dlt
```
落地原始数据、用
```
dbt
```
构建中间层和分析模型，并验证最终数据集市

运行方式：

bash

uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py

针对临时MotherDuck数据库运行相同阶段模式：

bash

MOTHERDUCK_ARTIFACT_USE_MOTHERDUCK=1 \
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py

验证TypeScript配套工件：

bash

uv run scripts/test_typescript_artifacts.py

运行完整MotherDuck项目：

bash

cd skills/motherduck-build-data-pipeline/references/dlt-dbt-motherduck-project
export MOTHERDUCK_TOKEN=...
export MOTHERDUCK_PIPELINE_DB=md_skills_pipeline_demo
uv sync --python 3.12
uv run python pipeline/run_all.py
uv run python pipeline/cleanup.py

Verified Notes

验证说明

Bootstrap the target MotherDuck database before running
```
dlt
```
. The
```
motherduck
```
destination does not create the database for you.
Keep this stack on Python 3.11 or 3.12 for now. The tested
```
dbt-duckdb
```
path here was not reliable on Python 3.14.
If you want exact schema names like
```
raw
```
,
```
staging
```
, and
```
analytics
```
in dbt, override
```
generate_schema_name
```
.
When a long-lived Python process loads data and a separate
```
dbt
```
subprocess builds models, run post-build validation in a fresh process or refresh database state before reading new relations.

运行
```
dlt
```
前，请先初始化目标MotherDuck数据库。
```
motherduck
```
目标端不会自动创建数据库。
目前请将此技术栈保持在Python 3.11或3.12版本。经测试，此处的
```
dbt-duckdb
```
路径在Python 3.14上不稳定。
若要在dbt中使用
```
raw
```
、
```
staging
```
和
```
analytics
```
等精确的模式名称，请覆盖
```
generate_schema_name
```
。
当长期运行的Python进程加载数据，且由独立的
```
dbt
```
子进程构建模型时，请在新进程中运行构建后验证，或在读取新关联前刷新数据库状态。

motherduck-build-data-pipeline

Original

Translation

Build a Data Pipeline with MotherDuck

使用MotherDuck构建数据管道

Start Here: Is a MotherDuck Server Active?

第一步：是否有活跃的MotherDuck服务器？

Use This Skill When

适用场景

Pipeline Defaults

管道默认规则

Workflow

工作流程

Output

输出内容

References

参考资料

Runnable Artifact

可运行工件

Verified Notes

验证说明

Related Skills

相关技能