motherduck-load-data

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Load Data into MotherDuck

将数据加载到MotherDuck中

Use this skill when the job is getting data into MotherDuck correctly and efficiently, not just writing one ad hoc import query.
当需要正确且高效地将数据导入MotherDuck,而非仅编写临时导入查询时,使用本技能。

Source Of Truth

权威参考

  • Prefer current MotherDuck loading, cloud-storage, and Postgres-endpoint loading docs first.
  • Use
    CREATE SECRET
    and cloud-storage docs for protected-object-store workflows.
  • Use the DuckDB database upload docs when the source is an existing local
    .duckdb
    ,
    .ddb
    , or attached DuckDB database.
  • Keep the loading advice aligned with MotherDuck's documented posture:
    • batch over streaming
    • Parquet over CSV when you control the format
    • dataframe,
      COPY
      , CTAS, or
      INSERT ... SELECT
      over row-by-row inserts
    • native MotherDuck storage first unless DuckLake is explicitly required
  • 优先参考最新的MotherDuck加载、云存储及Postgres端点加载文档。
  • 针对受保护的对象存储工作流,使用
    CREATE SECRET
    及云存储文档。
  • 当数据源为本地现有
    .duckdb
    .ddb
    文件或已关联的DuckDB数据库时,参考DuckDB数据库上传文档。
  • 加载建议需与MotherDuck的文档规范保持一致:
    • 优先批量处理而非流式处理
    • 若您能控制格式,优先选择Parquet而非CSV
    • 优先使用数据帧、
      COPY
      、CTAS或
      INSERT ... SELECT
      而非逐行插入
    • 除非明确需要DuckLake,否则优先使用MotherDuck原生存储

Default Posture

默认原则

  • Start by classifying the source: object storage or HTTPS, local file or local DuckDB, in-memory rows, or an external database.
  • Prefer
    CREATE TABLE AS SELECT
    for first loads and
    INSERT INTO ... SELECT
    for appends.
  • For whole local DuckDB databases, use
    CREATE OR REPLACE DATABASE remote_name FROM CURRENT_DATABASE()
    , an attached local database, or a file path from a native DuckDB client after attaching
    md:
    .
  • Use Parquet for durable bulk movement whenever you control the source format.
  • Treat the Postgres endpoint as a thin-client path for server-side remote reads, not for local-file or extension-driven ingestion.
  • Bootstrap the target MotherDuck database first when the ingestion tool does not create it automatically.
  • Keep raw landing minimally transformed; do typing, deduplication, and business logic in staging or modeling steps.
  • Keep source storage close to the MotherDuck region when you control placement.
  • 首先对数据源进行分类:对象存储或HTTPS、本地文件或本地DuckDB、内存行数据,或外部数据库。
  • 首次加载优先使用
    CREATE TABLE AS SELECT
    ,追加数据优先使用
    INSERT INTO ... SELECT
  • 对于完整的本地DuckDB数据库,可使用
    CREATE OR REPLACE DATABASE remote_name FROM CURRENT_DATABASE()
    、关联本地数据库,或在关联
    md:
    后通过原生DuckDB客户端使用文件路径。
  • 只要您能控制源格式,就使用Parquet进行持久化批量数据迁移。
  • 将Postgres端点视为服务器端远程读取的轻量客户端路径,而非用于本地文件或扩展驱动的摄入。
  • 当摄入工具无法自动创建目标MotherDuck数据库时,先初始化该数据库。
  • 原始落地数据尽量少做转换;类型定义、去重及业务逻辑处理放在staging或建模步骤中。
  • 若您能控制存储位置,将源存储放在靠近MotherDuck区域的位置。

Workflow

工作流程

  1. Identify where the source data actually lives.
  2. Choose the loading path:
    • object storage or HTTPS: remote read into MotherDuck
    • local file or local DuckDB: use a DuckDB client path
    • in-memory rows: Arrow or dataframe bulk load first, batched inserts only as a fallback
    • external database: use the appropriate scan or replication path from a DuckDB-capable environment
  3. Land the data into a raw or staging table with minimal transformation.
  4. Validate row counts, types, and a few business aggregates immediately after the load.
  5. Promote into modeled tables only after the landing step is correct.
  1. 确定数据源的实际存储位置。
  2. 选择加载路径:
    • 对象存储或HTTPS:直接远程读取到MotherDuck
    • 本地文件或本地DuckDB:使用DuckDB客户端路径
    • 内存行数据:优先使用Arrow或数据帧批量加载,仅在万不得已时使用批量插入
    • 外部数据库:从支持DuckDB的环境中使用合适的扫描或复制路径
  3. 将数据落地到原始表或staging表,尽量少做转换。
  4. 加载完成后立即验证行数、数据类型及部分业务聚合指标。
  5. 仅在落地步骤验证正确后,再将数据迁移到建模表中。

Open Next

后续参考

  • references/INGESTION_PATTERNS.md
    for format-specific options, cloud-storage secrets, Postgres-endpoint loading tradeoffs, Python dataframe paths, and advanced ingestion patterns
  • references/INGESTION_PATTERNS.md
    :包含格式专属选项、云存储密钥、Postgres端点加载权衡、Python数据帧路径及高级摄入模式的内容

Related Skills

相关技能

  • motherduck-connect
    for choosing between the Postgres endpoint and a DuckDB client path
  • motherduck-explore
    for inspecting destination databases and validating landed tables
  • motherduck-query
    for writing CTAS, append, and validation SQL
  • motherduck-model-data
    for promoting landed data into staging and analytics tables
  • motherduck-ducklake
    only when object-storage-backed lakehouse storage is an explicit requirement
  • motherduck-connect
    :用于选择Postgres端点与DuckDB客户端路径
  • motherduck-explore
    :用于检查目标数据库并验证落地表
  • motherduck-query
    :用于编写CTAS、追加及验证SQL
  • motherduck-model-data
    :用于将落地数据迁移到staging表及分析表
  • motherduck-ducklake
    :仅当明确需要基于对象存储的数据湖存储时使用