motherduck-build-data-pipeline
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBuild a Data Pipeline with MotherDuck
使用MotherDuck构建数据管道
Use this skill when the user needs an ingestion-to-serving workflow, not just a single load step.
This is a use-case skill. It orchestrates , , , , , and .
motherduck-connectmotherduck-load-datamotherduck-model-datamotherduck-querymotherduck-share-datamotherduck-ducklake当用户需要从数据摄入到服务的完整工作流,而非单一加载步骤时,可使用此技能。
这是一个场景化技能,可编排、、、、和等技能。
motherduck-connectmotherduck-load-datamotherduck-model-datamotherduck-querymotherduck-share-datamotherduck-ducklakeStart Here: Is a MotherDuck Server Active?
第一步:是否有活跃的MotherDuck服务器?
Always determine this first.
- If a remote MotherDuck MCP server or local MotherDuck server is active, use it.
- If the user already knows the destination database, confirm it before designing stages.
- Explore the live environment:
- current databases and schemas
- raw, staging, and analytics boundaries if they already exist
- source tables, target tables, and table grain
- key columns, date fields, and join keys
Use that discovery to decide whether the pipeline is:
- landing into an empty workspace
- extending an existing warehouse layout
- publishing into an existing analytics model
If no server is active, ask for source shape and target shape before drafting the pipeline.
请始终先确认这一点。
- 如果存在远程MotherDuck MCP服务器或本地MotherDuck服务器处于活跃状态,请直接使用。
- 如果用户已知目标数据库,请在设计阶段前确认。
- 探索当前环境:
- 现有数据库和模式
- 已存在的原始数据、中间层和分析层边界
- 源表、目标表及表粒度
- 关键字段、日期字段和关联键
基于上述探索结果,判断管道类型:
- 部署至空工作区
- 扩展现有数据仓库布局
- 发布至现有分析模型
若没有活跃服务器,请先询问源数据结构和目标数据结构,再起草管道方案。
Use This Skill When
适用场景
- The user needs ingestion plus transformation plus serving output.
- The work spans raw landing, curation, and publication.
- The user needs a stage-by-stage pipeline pattern rather than one command.
- The problem is bigger than a single import step or one ad hoc transformation.
- 用户需要从数据摄入、转换到输出服务的完整流程
- 工作覆盖原始数据落地、数据治理和发布环节
- 用户需要分阶段的管道模式,而非单一命令
- 问题复杂度超过单次导入或临时转换操作
Pipeline Defaults
管道默认规则
- batch over streaming
- raw landing before curation
- explicit raw -> staging -> analytics boundaries
- bulk ingest paths over row-by-row writes
- idempotent stage rebuilds or append contracts before scheduled automation
- verify the MotherDuck-supported DuckDB client version before recommending upstream-only write, checkpoint, or lakehouse features
- native MotherDuck storage unless DuckLake is explicitly required
- 优先批量处理而非流处理
- 先落地原始数据再进行治理
- 明确划分原始数据→中间层→分析层的边界
- 优先批量摄入路径而非逐行写入
- 在计划自动化前,确保阶段重建具备幂等性或追加契约
- 在推荐上游写入、检查点或湖仓特性前,验证MotherDuck支持的DuckDB客户端版本
- 除非明确需要DuckLake,否则默认使用MotherDuck原生存储
Workflow
工作流程
- Confirm whether live MotherDuck discovery is available.
- Inspect the current workspace and target data model.
- Define raw, staging, and analytics boundaries.
- Ingest raw data.
- Deduplicate, type, and promote into staging.
- Materialize analytics-ready outputs.
- Validate counts, freshness, uniqueness, and business metrics before publishing downstream assets.
When this skill produces a native DuckDB () connection, watermark it with . If metadata is missing, fall back to and .
md:custom_user_agent=agent-skills/2.2.0(harness-<harness>;llm-<llm>)harness-unknownllm-unknown- 确认是否可进行实时MotherDuck环境探索
- 检查当前工作区和目标数据模型
- 定义原始数据、中间层和分析层的边界
- 摄入原始数据
- 去重、类型转换并提升至中间层
- 生成可用于分析的输出结果
- 在发布下游资产前,验证数据量、新鲜度、唯一性及业务指标
当此技能生成原生DuckDB()连接时,请添加水印。若元数据缺失, fallback为和。
md:custom_user_agent=agent-skills/2.2.0(harness-<harness>;llm-<llm>)harness-unknownllm-unknownOutput
输出内容
The output of this skill should be:
- the stage layout
- the ingestion method
- the transformation sequence
- the serving tables or views
- the validation checks
If the caller explicitly asks for structured JSON, return raw JSON only with no Markdown fences or prose before/after it.
This is mainly for automated tests, regression checks, or downstream tooling that needs a stable machine-readable shape. Normal human-facing use of the skill can stay in prose unless JSON is explicitly requested.
Use this exact top-level shape when JSON is requested:
json
{
"summary": {},
"assumptions": [],
"implementation_plan": [],
"validation_plan": [],
"risks": []
}此技能的输出应包含:
- 阶段布局
- 摄入方法
- 转换顺序
- 服务用表或视图
- 验证检查项
若调用方明确要求结构化JSON,请仅返回原始JSON,不要添加Markdown围栏或前后说明文字。这主要用于自动化测试、回归检查或需要稳定机器可读格式的下游工具。面向普通用户的场景下,除非明确要求JSON,否则可使用自然语言输出。
当要求返回JSON时,请严格使用以下顶层结构:
json
{
"summary": {},
"assumptions": [],
"implementation_plan": [],
"validation_plan": [],
"risks": []
}References
参考资料
- -- fully runnable MotherDuck reference project using
references/dlt-dbt-motherduck-project/,dlt, and validation queriesdbt-duckdb - -- preserved detailed pipeline guidance that used to live in this skill
references/PIPELINE_IMPLEMENTATION_GUIDE.md - -- lower-level ingestion patterns
../motherduck-load-data/references/INGESTION_PATTERNS.md
- -- 可直接运行的MotherDuck参考项目,使用
references/dlt-dbt-motherduck-project/、dlt和验证查询dbt-duckdb - -- 本技能中保留的详细管道实施指南
references/PIPELINE_IMPLEMENTATION_GUIDE.md - -- 底层摄入模式
../motherduck-load-data/references/INGESTION_PATTERNS.md
Runnable Artifact
可运行工件
- -- MotherDuck-backed Python example that stages a Parquet extract, lands it into raw, deduplicates it, and publishes analytics output across raw/staging/analytics databases
artifacts/pipeline_stage_example.py - -- TypeScript companion artifact with the same stage layout and output contract
artifacts/pipeline_stage_example.ts - -- end-to-end MotherDuck example that bootstraps the target database, lands raw data with
references/dlt-dbt-motherduck-project/, builds staging and analytics models withdlt, and validates the final martdbt
Run it with:
bash
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.pyRun the same stage pattern against temporary MotherDuck databases:
bash
MOTHERDUCK_ARTIFACT_USE_MOTHERDUCK=1 \
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.pyValidate the TypeScript companion artifact:
bash
uv run scripts/test_typescript_artifacts.pyFor the full MotherDuck project:
bash
cd skills/motherduck-build-data-pipeline/references/dlt-dbt-motherduck-project
export MOTHERDUCK_TOKEN=...
export MOTHERDUCK_PIPELINE_DB=md_skills_pipeline_demo
uv sync --python 3.12
uv run python pipeline/run_all.py
uv run python pipeline/cleanup.py- -- 基于MotherDuck的Python示例,实现Parquet文件提取、原始数据落地、去重,并在原始/中间/分析数据库间发布分析输出
artifacts/pipeline_stage_example.py - -- TypeScript配套工件,具备相同的阶段布局和输出契约
artifacts/pipeline_stage_example.ts - -- 端到端MotherDuck示例,可初始化目标数据库、用
references/dlt-dbt-motherduck-project/落地原始数据、用dlt构建中间层和分析模型,并验证最终数据集市dbt
运行方式:
bash
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py针对临时MotherDuck数据库运行相同阶段模式:
bash
MOTHERDUCK_ARTIFACT_USE_MOTHERDUCK=1 \
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py验证TypeScript配套工件:
bash
uv run scripts/test_typescript_artifacts.py运行完整MotherDuck项目:
bash
cd skills/motherduck-build-data-pipeline/references/dlt-dbt-motherduck-project
export MOTHERDUCK_TOKEN=...
export MOTHERDUCK_PIPELINE_DB=md_skills_pipeline_demo
uv sync --python 3.12
uv run python pipeline/run_all.py
uv run python pipeline/cleanup.pyVerified Notes
验证说明
- Bootstrap the target MotherDuck database before running . The
dltdestination does not create the database for you.motherduck - Keep this stack on Python 3.11 or 3.12 for now. The tested path here was not reliable on Python 3.14.
dbt-duckdb - If you want exact schema names like ,
raw, andstagingin dbt, overrideanalytics.generate_schema_name - When a long-lived Python process loads data and a separate subprocess builds models, run post-build validation in a fresh process or refresh database state before reading new relations.
dbt
- 运行前,请先初始化目标MotherDuck数据库。
dlt目标端不会自动创建数据库。motherduck - 目前请将此技术栈保持在Python 3.11或3.12版本。经测试,此处的路径在Python 3.14上不稳定。
dbt-duckdb - 若要在dbt中使用、
raw和staging等精确的模式名称,请覆盖analytics。generate_schema_name - 当长期运行的Python进程加载数据,且由独立的子进程构建模型时,请在新进程中运行构建后验证,或在读取新关联前刷新数据库状态。
dbt
Related Skills
相关技能
- -- choose the right connection path
motherduck-connect - -- ingestion mechanics
motherduck-load-data - -- shape the analytics layer
motherduck-model-data - -- write transformations and validations
motherduck-query - -- publish curated outputs
motherduck-share-data - -- only when open-table-format storage is a real requirement
motherduck-ducklake
- -- 选择合适的连接路径
motherduck-connect - -- 摄入机制
motherduck-load-data - -- 构建分析层
motherduck-model-data - -- 编写转换和验证逻辑
motherduck-query - -- 发布治理后的输出
motherduck-share-data - -- 仅当确实需要开放表格式存储时使用
motherduck-ducklake