Build a Data Pipeline with MotherDuck
Use this skill when the user needs an ingestion-to-serving workflow, not just a single load step.
This is a use-case skill. It orchestrates
,
,
,
,
, and
.
Start Here: Is a MotherDuck Server Active?
Always determine this first.
- If a remote MotherDuck MCP server or local MotherDuck server is active, use it.
- If the user already knows the destination database, confirm it before designing stages.
- Explore the live environment:
- current databases and schemas
- raw, staging, and analytics boundaries if they already exist
- source tables, target tables, and table grain
- key columns, date fields, and join keys
Use that discovery to decide whether the pipeline is:
- landing into an empty workspace
- extending an existing warehouse layout
- publishing into an existing analytics model
If no server is active, ask for source shape and target shape before drafting the pipeline.
Use This Skill When
- The user needs ingestion plus transformation plus serving output.
- The work spans raw landing, curation, and publication.
- The user needs a stage-by-stage pipeline pattern rather than one command.
- The problem is bigger than a single import step or one ad hoc transformation.
Pipeline Defaults
- batch over streaming
- raw landing before curation
- explicit raw -> staging -> analytics boundaries
- bulk ingest paths over row-by-row writes
- idempotent stage rebuilds or append contracts before scheduled automation
- verify the MotherDuck-supported DuckDB client version before recommending upstream-only write, checkpoint, or lakehouse features
- native MotherDuck storage unless DuckLake is explicitly required
Workflow
- Confirm whether live MotherDuck discovery is available.
- Inspect the current workspace and target data model.
- Define raw, staging, and analytics boundaries.
- Ingest raw data.
- Deduplicate, type, and promote into staging.
- Materialize analytics-ready outputs.
- Validate counts, freshness, uniqueness, and business metrics before publishing downstream assets.
When this skill produces a native DuckDB (
) connection, watermark it with
custom_user_agent=agent-skills/2.2.0(harness-<harness>;llm-<llm>)
. If metadata is missing, fall back to
and
.
Output
The output of this skill should be:
- the stage layout
- the ingestion method
- the transformation sequence
- the serving tables or views
- the validation checks
If the caller explicitly asks for structured JSON, return raw JSON only with no Markdown fences or prose before/after it.
This is mainly for automated tests, regression checks, or downstream tooling that needs a stable machine-readable shape. Normal human-facing use of the skill can stay in prose unless JSON is explicitly requested.
Use this exact top-level shape when JSON is requested:
json
{
"summary": {},
"assumptions": [],
"implementation_plan": [],
"validation_plan": [],
"risks": []
}
References
references/dlt-dbt-motherduck-project/
-- fully runnable MotherDuck reference project using , , and validation queries
references/PIPELINE_IMPLEMENTATION_GUIDE.md
-- preserved detailed pipeline guidance that used to live in this skill
../motherduck-load-data/references/INGESTION_PATTERNS.md
-- lower-level ingestion patterns
Runnable Artifact
artifacts/pipeline_stage_example.py
-- MotherDuck-backed Python example that stages a Parquet extract, lands it into raw, deduplicates it, and publishes analytics output across raw/staging/analytics databases
artifacts/pipeline_stage_example.ts
-- TypeScript companion artifact with the same stage layout and output contract
references/dlt-dbt-motherduck-project/
-- end-to-end MotherDuck example that bootstraps the target database, lands raw data with , builds staging and analytics models with , and validates the final mart
Run it with:
bash
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
Run the same stage pattern against temporary MotherDuck databases:
bash
MOTHERDUCK_ARTIFACT_USE_MOTHERDUCK=1 \
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
Validate the TypeScript companion artifact:
bash
uv run scripts/test_typescript_artifacts.py
For the full MotherDuck project:
bash
cd skills/motherduck-build-data-pipeline/references/dlt-dbt-motherduck-project
export MOTHERDUCK_TOKEN=...
export MOTHERDUCK_PIPELINE_DB=md_skills_pipeline_demo
uv sync --python 3.12
uv run python pipeline/run_all.py
uv run python pipeline/cleanup.py
Verified Notes
- Bootstrap the target MotherDuck database before running . The destination does not create the database for you.
- Keep this stack on Python 3.11 or 3.12 for now. The tested path here was not reliable on Python 3.14.
- If you want exact schema names like , , and in dbt, override .
- When a long-lived Python process loads data and a separate subprocess builds models, run post-build validation in a fresh process or refresh database state before reading new relations.
Related Skills
- -- choose the right connection path
- -- ingestion mechanics
- -- shape the analytics layer
- -- write transformations and validations
- -- publish curated outputs
- -- only when open-table-format storage is a real requirement