motherduck-build-data-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Build a Data Pipeline with MotherDuck

使用MotherDuck构建数据管道

Use this skill when the user needs an ingestion-to-serving workflow, not just a single load step.
This is a use-case skill. It orchestrates
motherduck-connect
,
motherduck-load-data
,
motherduck-model-data
,
motherduck-query
,
motherduck-share-data
, and
motherduck-ducklake
.
当用户需要从数据摄入到服务的完整工作流,而非单一加载步骤时,可使用此技能。
这是一个场景化技能,可编排
motherduck-connect
motherduck-load-data
motherduck-model-data
motherduck-query
motherduck-share-data
motherduck-ducklake
等技能。

Start Here: Is a MotherDuck Server Active?

第一步:是否有活跃的MotherDuck服务器?

Always determine this first.
  • If a remote MotherDuck MCP server or local MotherDuck server is active, use it.
  • If the user already knows the destination database, confirm it before designing stages.
  • Explore the live environment:
    • current databases and schemas
    • raw, staging, and analytics boundaries if they already exist
    • source tables, target tables, and table grain
    • key columns, date fields, and join keys
Use that discovery to decide whether the pipeline is:
  • landing into an empty workspace
  • extending an existing warehouse layout
  • publishing into an existing analytics model
If no server is active, ask for source shape and target shape before drafting the pipeline.
请始终先确认这一点。
  • 如果存在远程MotherDuck MCP服务器本地MotherDuck服务器处于活跃状态,请直接使用。
  • 如果用户已知目标数据库,请在设计阶段前确认。
  • 探索当前环境:
    • 现有数据库和模式
    • 已存在的原始数据、中间层和分析层边界
    • 源表、目标表及表粒度
    • 关键字段、日期字段和关联键
基于上述探索结果,判断管道类型:
  • 部署至空工作区
  • 扩展现有数据仓库布局
  • 发布至现有分析模型
若没有活跃服务器,请先询问源数据结构和目标数据结构,再起草管道方案。

Use This Skill When

适用场景

  • The user needs ingestion plus transformation plus serving output.
  • The work spans raw landing, curation, and publication.
  • The user needs a stage-by-stage pipeline pattern rather than one command.
  • The problem is bigger than a single import step or one ad hoc transformation.
  • 用户需要从数据摄入、转换到输出服务的完整流程
  • 工作覆盖原始数据落地、数据治理和发布环节
  • 用户需要分阶段的管道模式,而非单一命令
  • 问题复杂度超过单次导入或临时转换操作

Pipeline Defaults

管道默认规则

  • batch over streaming
  • raw landing before curation
  • explicit raw -> staging -> analytics boundaries
  • bulk ingest paths over row-by-row writes
  • idempotent stage rebuilds or append contracts before scheduled automation
  • verify the MotherDuck-supported DuckDB client version before recommending upstream-only write, checkpoint, or lakehouse features
  • native MotherDuck storage unless DuckLake is explicitly required
  • 优先批量处理而非流处理
  • 先落地原始数据再进行治理
  • 明确划分原始数据→中间层→分析层的边界
  • 优先批量摄入路径而非逐行写入
  • 在计划自动化前,确保阶段重建具备幂等性或追加契约
  • 在推荐上游写入、检查点或湖仓特性前,验证MotherDuck支持的DuckDB客户端版本
  • 除非明确需要DuckLake,否则默认使用MotherDuck原生存储

Workflow

工作流程

  1. Confirm whether live MotherDuck discovery is available.
  2. Inspect the current workspace and target data model.
  3. Define raw, staging, and analytics boundaries.
  4. Ingest raw data.
  5. Deduplicate, type, and promote into staging.
  6. Materialize analytics-ready outputs.
  7. Validate counts, freshness, uniqueness, and business metrics before publishing downstream assets.
When this skill produces a native DuckDB (
md:
) connection, watermark it with
custom_user_agent=agent-skills/2.2.0(harness-<harness>;llm-<llm>)
. If metadata is missing, fall back to
harness-unknown
and
llm-unknown
.
  1. 确认是否可进行实时MotherDuck环境探索
  2. 检查当前工作区和目标数据模型
  3. 定义原始数据、中间层和分析层的边界
  4. 摄入原始数据
  5. 去重、类型转换并提升至中间层
  6. 生成可用于分析的输出结果
  7. 在发布下游资产前,验证数据量、新鲜度、唯一性及业务指标
当此技能生成原生DuckDB(
md:
)连接时,请添加水印
custom_user_agent=agent-skills/2.2.0(harness-<harness>;llm-<llm>)
。若元数据缺失, fallback为
harness-unknown
llm-unknown

Output

输出内容

The output of this skill should be:
  • the stage layout
  • the ingestion method
  • the transformation sequence
  • the serving tables or views
  • the validation checks
If the caller explicitly asks for structured JSON, return raw JSON only with no Markdown fences or prose before/after it. This is mainly for automated tests, regression checks, or downstream tooling that needs a stable machine-readable shape. Normal human-facing use of the skill can stay in prose unless JSON is explicitly requested.
Use this exact top-level shape when JSON is requested:
json
{
  "summary": {},
  "assumptions": [],
  "implementation_plan": [],
  "validation_plan": [],
  "risks": []
}
此技能的输出应包含:
  • 阶段布局
  • 摄入方法
  • 转换顺序
  • 服务用表或视图
  • 验证检查项
若调用方明确要求结构化JSON,请仅返回原始JSON,不要添加Markdown围栏或前后说明文字。这主要用于自动化测试、回归检查或需要稳定机器可读格式的下游工具。面向普通用户的场景下,除非明确要求JSON,否则可使用自然语言输出。
当要求返回JSON时,请严格使用以下顶层结构:
json
{
  "summary": {},
  "assumptions": [],
  "implementation_plan": [],
  "validation_plan": [],
  "risks": []
}

References

参考资料

  • references/dlt-dbt-motherduck-project/
    -- fully runnable MotherDuck reference project using
    dlt
    ,
    dbt-duckdb
    , and validation queries
  • references/PIPELINE_IMPLEMENTATION_GUIDE.md
    -- preserved detailed pipeline guidance that used to live in this skill
  • ../motherduck-load-data/references/INGESTION_PATTERNS.md
    -- lower-level ingestion patterns
  • references/dlt-dbt-motherduck-project/
    -- 可直接运行的MotherDuck参考项目,使用
    dlt
    dbt-duckdb
    和验证查询
  • references/PIPELINE_IMPLEMENTATION_GUIDE.md
    -- 本技能中保留的详细管道实施指南
  • ../motherduck-load-data/references/INGESTION_PATTERNS.md
    -- 底层摄入模式

Runnable Artifact

可运行工件

  • artifacts/pipeline_stage_example.py
    -- MotherDuck-backed Python example that stages a Parquet extract, lands it into raw, deduplicates it, and publishes analytics output across raw/staging/analytics databases
  • artifacts/pipeline_stage_example.ts
    -- TypeScript companion artifact with the same stage layout and output contract
  • references/dlt-dbt-motherduck-project/
    -- end-to-end MotherDuck example that bootstraps the target database, lands raw data with
    dlt
    , builds staging and analytics models with
    dbt
    , and validates the final mart
Run it with:
bash
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
Run the same stage pattern against temporary MotherDuck databases:
bash
MOTHERDUCK_ARTIFACT_USE_MOTHERDUCK=1 \
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
Validate the TypeScript companion artifact:
bash
uv run scripts/test_typescript_artifacts.py
For the full MotherDuck project:
bash
cd skills/motherduck-build-data-pipeline/references/dlt-dbt-motherduck-project
export MOTHERDUCK_TOKEN=...
export MOTHERDUCK_PIPELINE_DB=md_skills_pipeline_demo
uv sync --python 3.12
uv run python pipeline/run_all.py
uv run python pipeline/cleanup.py
  • artifacts/pipeline_stage_example.py
    -- 基于MotherDuck的Python示例,实现Parquet文件提取、原始数据落地、去重,并在原始/中间/分析数据库间发布分析输出
  • artifacts/pipeline_stage_example.ts
    -- TypeScript配套工件,具备相同的阶段布局和输出契约
  • references/dlt-dbt-motherduck-project/
    -- 端到端MotherDuck示例,可初始化目标数据库、用
    dlt
    落地原始数据、用
    dbt
    构建中间层和分析模型,并验证最终数据集市
运行方式:
bash
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
针对临时MotherDuck数据库运行相同阶段模式:
bash
MOTHERDUCK_ARTIFACT_USE_MOTHERDUCK=1 \
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
验证TypeScript配套工件:
bash
uv run scripts/test_typescript_artifacts.py
运行完整MotherDuck项目:
bash
cd skills/motherduck-build-data-pipeline/references/dlt-dbt-motherduck-project
export MOTHERDUCK_TOKEN=...
export MOTHERDUCK_PIPELINE_DB=md_skills_pipeline_demo
uv sync --python 3.12
uv run python pipeline/run_all.py
uv run python pipeline/cleanup.py

Verified Notes

验证说明

  • Bootstrap the target MotherDuck database before running
    dlt
    . The
    motherduck
    destination does not create the database for you.
  • Keep this stack on Python 3.11 or 3.12 for now. The tested
    dbt-duckdb
    path here was not reliable on Python 3.14.
  • If you want exact schema names like
    raw
    ,
    staging
    , and
    analytics
    in dbt, override
    generate_schema_name
    .
  • When a long-lived Python process loads data and a separate
    dbt
    subprocess builds models, run post-build validation in a fresh process or refresh database state before reading new relations.
  • 运行
    dlt
    前,请先初始化目标MotherDuck数据库。
    motherduck
    目标端不会自动创建数据库。
  • 目前请将此技术栈保持在Python 3.11或3.12版本。经测试,此处的
    dbt-duckdb
    路径在Python 3.14上不稳定。
  • 若要在dbt中使用
    raw
    staging
    analytics
    等精确的模式名称,请覆盖
    generate_schema_name
  • 当长期运行的Python进程加载数据,且由独立的
    dbt
    子进程构建模型时,请在新进程中运行构建后验证,或在读取新关联前刷新数据库状态。

Related Skills

相关技能

  • motherduck-connect
    -- choose the right connection path
  • motherduck-load-data
    -- ingestion mechanics
  • motherduck-model-data
    -- shape the analytics layer
  • motherduck-query
    -- write transformations and validations
  • motherduck-share-data
    -- publish curated outputs
  • motherduck-ducklake
    -- only when open-table-format storage is a real requirement
  • motherduck-connect
    -- 选择合适的连接路径
  • motherduck-load-data
    -- 摄入机制
  • motherduck-model-data
    -- 构建分析层
  • motherduck-query
    -- 编写转换和验证逻辑
  • motherduck-share-data
    -- 发布治理后的输出
  • motherduck-ducklake
    -- 仅当确实需要开放表格式存储时使用