e2e-medallion-architecture

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
  • GitHub Copilot CLI / VS Code: invoke the
    check-updates
    skill.
  • Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
  • Skip if the check was already performed earlier in this session.
CRITICAL NOTES
  1. To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering
  2. To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering
更新检查 — 每会话一次(必填) 在会话中首次使用此技能时,请先运行check-updates技能再继续。
  • GitHub Copilot CLI / VS Code:调用
    check-updates
    技能。
  • Claude Code / Cowork / Cursor / Windsurf / Codex:比较本地与远程的package.json版本。
  • 若本次会话中已完成过该检查,则可跳过。
重要说明
  1. 从工作区名称查找工作区详情(包括其ID):列出所有工作区,然后使用JMESPath过滤
  2. 从工作区ID、项类型和项名称查找项详情(包括其ID):列出该工作区中该类型的所有项,然后使用JMESPath过滤

End-to-End Medallion Architecture

端到端Medallion架构

Prerequisite Knowledge

前置知识

Read these companion documents — they contain the foundational context this skill depends on:
  • COMMON-CORE.md — Fabric REST API patterns, authentication, token audiences, item discovery
  • COMMON-CLI.md
    az rest
    ,
    az login
    , token acquisition, Fabric REST via CLI
  • SPARK-AUTHORING-CORE.md — Notebook deployment, lakehouse creation, job execution
  • notebook-api-operations.mdRequired for notebook creation
    .ipynb
    structure requirements, cell format,
    getDefinition
    /
    updateDefinition
    workflow
For Spark-specific optimization details, see data-engineering-patterns.md.

请阅读以下配套文档,它们包含本技能依赖的基础背景知识:
  • COMMON-CORE.md — Fabric REST API模式、身份验证、令牌受众、项发现
  • COMMON-CLI.md
    az rest
    az login
    、令牌获取、通过CLI调用Fabric REST API
  • SPARK-AUTHORING-CORE.md — Notebook部署、湖仓创建、任务执行
  • notebook-api-operations.md创建Notebook必需
    .ipynb
    结构要求、单元格格式、
    getDefinition
    /
    updateDefinition
    工作流
有关Spark特定的优化细节,请参阅data-engineering-patterns.md

Architecture Overview

架构概述

Medallion Architecture is a data lakehouse pattern with three progressive layers:
LayerPurposeOptimization ProfileUse Case
Bronze (Raw)Land raw data exactly as receivedWrite-optimized, append-only, partitioned by ingestion dateAudit trail, reprocessing, lineage
Silver (Cleaned)Deduplicated, validated, conformed dataBalanced read/write, partitioned by business dateFeature engineering, operational reporting
Gold (Aggregated)Pre-calculated metrics for analyticsRead-optimized (ZORDER, compaction), partitioned by month/yearPower BI reports, dashboards, ad-hoc analytics via SQL endpoint
  • Bronze: Schema-on-read — flexible schema, Delta time travel supports audit and rollback
  • Silver: Schema enforcement — reject non-conforming writes; handle schema evolution with
    mergeSchema
    when sources change
  • Gold: Strict schema governance — curated, business-approved datasets only

Medallion架构是一种包含三个递进层级的数据湖仓模式:
层级用途优化配置使用场景
Bronze(原始层)按接收原样存储原始数据写优化、仅追加、按摄入日期分区审计追踪、重处理、数据血缘
Silver(清洗层)去重、验证、标准化后的数据读写平衡、按业务日期分区特征工程、运营报表
Gold(聚合层)为分析预计算的指标读优化(ZORDER、压缩)、按年/月分区Power BI报表、仪表盘、通过SQL端点进行即席分析
  • Bronze层:读时模式——灵活的Schema,Delta时间旅行支持审计和回滚
  • Silver层:Schema校验——拒绝不符合规范的写入;当源Schema变更时,使用
    mergeSchema
    处理Schema演进
  • Gold层:严格的Schema治理——仅包含经过业务审批的 curated 数据集

Must/Prefer/Avoid

必须/推荐/避免

MUST DO

必须执行

  • Create a separate lakehouse for each medallion layer (Bronze, Silver, Gold)
  • Add metadata columns in Bronze: ingestion timestamp, source file, batch ID
  • Apply data quality rules in the Bronze-to-Silver transformation (deduplication, null handling, range validation)
  • Use Delta Lake format for all medallion layer tables
  • Use partition-aware overwrite in Silver/Gold writes to avoid reprocessing unchanged data
  • Include validation steps after each layer (row counts, schema checks, anomaly detection)
  • Follow the
    .ipynb
    validation + Fabric nuances
    in notebook-api-operations.md when creating notebooks via REST API — every code cell must include
    "outputs": []
    and
    "execution_count": null
  • Default to separate workspaces per layer for governance and access control: one workspace each for Bronze, Silver, and Gold
  • Complete the full end-to-end flow — do not stop after creating notebooks; always bind lakehouses, execute notebooks sequentially (Bronze → Silver → Gold), verify results, and connect Power BI to the Gold layer unless the user explicitly requests a partial setup
  • 为每个Medallion层级创建独立的湖仓(Bronze、Silver、Gold)
  • 在Bronze层添加元数据列:摄入时间戳、源文件、批次ID
  • 在Bronze到Silver的转换中应用数据质量规则(去重、空值处理、范围验证)
  • 所有Medallion层的表都使用Delta Lake格式
  • 在Silver/Gold层写入时使用分区感知覆盖,避免重新处理未变更的数据
  • 在每个层级后添加验证步骤(行数统计、Schema检查、异常检测)
  • 通过REST API创建Notebook时,遵循notebook-api-operations.md中的
    .ipynb
    验证+Fabric细节——每个代码单元格必须包含
    "outputs": []
    "execution_count": null
  • 默认按层级使用独立工作区以实现治理和访问控制:Bronze、Silver、Gold各一个工作区
  • 完成完整的端到端流程——创建Notebook后不要停止;始终绑定湖仓、按顺序执行Notebook(Bronze→Silver→Gold)、验证结果,并将Power BI连接到Gold层,除非用户明确要求部分设置

PREFER

推荐执行

  • Incremental processing (watermark pattern) over full refresh
  • Separate notebooks per layer for independent testing and debugging
  • ZORDER on frequently filtered columns in Gold tables
  • Running OPTIMIZE after writes in Silver and Gold layers
  • Environment-specific Spark configs (write-heavy for Bronze, balanced for Silver, read-heavy for Gold)
  • OneLake shortcuts to expose Gold data to consumer workspaces without duplication
  • Clear layer ownership: engineers own Bronze/Silver, analysts own Gold
  • Fabric Variable Libraries to centralize paths and configuration across layers
  • Multi-workspace deployment patterns for medium/high governance requirements (Bronze/Silver/Gold in separate workspaces)
  • 使用增量处理(水印模式)而非全量刷新
  • 为每个层级使用独立的Notebook,便于独立测试和调试
  • 在Gold层表中对频繁过滤的列使用ZORDER
  • 在Silver和Gold层写入后运行OPTIMIZE
  • 根据环境配置Spark参数(Bronze层写密集、Silver层读写平衡、Gold层读密集)
  • 使用OneLake快捷方式向消费工作区暴露Gold层数据,无需复制
  • 明确层级所有权:工程师负责Bronze/Silver层,分析师负责Gold层
  • 使用Fabric变量库集中管理跨层级的路径和配置
  • 针对中/高治理要求使用多工作区部署模式(Bronze/Silver/Gold位于独立工作区)

AVOID

避免操作

  • Storing all layers in a single lakehouse — this defeats isolation and independent optimization
  • Skipping the Silver layer and going directly from Bronze to Gold
  • Hardcoded workspace IDs, lakehouse IDs, or FQDNs — discover via REST API
  • SELECT * without LIMIT on Bronze tables (they grow unboundedly)
  • Running VACUUM without checking downstream dependencies
  • Chaining OneLake shortcuts between medallion layers (Bronze→Silver→Gold) — each layer must be physically materialized for lineage and governance
  • Copying complete implementation code into skills — guide the LLM to generate instead
  • Reading from external HTTP/HTTPS URLs directly in Spark — Fabric Spark cannot access arbitrary external URLs; land data in lakehouse
    Files/
    first (via
    curl
    , OneLake API, or Fabric pipeline Copy activity), then read from the lakehouse path
  • Creating notebooks via REST API without validating
    .ipynb
    structure
    — missing
    execution_count: null
    or
    outputs: []
    on code cells causes silent failures or "Job instance failed without detail error"

  • 将所有层级存储在单个湖仓中——这会破坏隔离性和独立优化能力
  • 跳过Silver层直接从Bronze到Gold
  • 硬编码工作区ID、湖仓ID或FQDN——通过REST API发现
  • 在Bronze层表上使用无LIMIT的SELECT *(数据会无限增长)
  • 在未检查下游依赖的情况下运行VACUUM
  • 在Medallion层级之间链式使用OneLake快捷方式(Bronze→Silver→Gold)——每个层级必须物理物化以支持数据血缘和治理
  • 将完整的实现代码复制到技能中——引导LLM生成代码
  • 在Spark中直接读取外部HTTP/HTTPS URL——Fabric Spark无法访问任意外部URL;先将数据存入湖仓
    Files/
    目录(通过
    curl
    、OneLake API或Fabric管道复制活动),再从湖仓路径读取
  • 创建Notebook时不验证
    .ipynb
    结构
    ——代码单元格缺少
    execution_count: null
    outputs: []
    会导致静默失败或"Job instance failed without detail error"

Workspace Setup Guidance

工作区设置指南

When setting up a medallion workspace, guide LLM to generate commands for:
  1. Default architecture: create three workspaces (recommended):
    • {project}-bronze-{env}
    • {project}-silver-{env}
    • {project}-gold-{env}
  2. Create one lakehouse per workspace:
    • Bronze workspace →
      {project}_bronze
      lakehouse
    • Silver workspace →
      {project}_silver
      lakehouse
    • Gold workspace →
      {project}_gold
      lakehouse
  3. Assign RBAC per layer workspace:
    • Bronze: ingestion/engineering write permissions
    • Silver: engineering/data quality permissions
    • Gold: analytics/BI consumer access with stricter curation controls
  4. Create notebooks for each layer (one per transformation stage) — follow
    .ipynb
    validation + Fabric nuances
  5. Bind each notebook to its lakehouse — set
    metadata.dependencies.lakehouse
    with the correct lakehouse ID (see notebook-api-operations.md § Default Lakehouse Binding):
    • Bronze notebook → Bronze workspace/lakehouse
    • Silver notebook → Silver workspace/lakehouse (reads Bronze via cross-workspace oneLake access / fully qualified references)
    • Gold notebook → Gold workspace/lakehouse (reads Silver via cross-workspace access)
  6. Confirm notebook deployment — check that
    updateDefinition
    returned
    Succeeded
    ; this is sufficient confirmation that content and lakehouse binding persisted. Do NOT call
    getDefinition
    to re-verify — it is an async LRO and adds unnecessary latency.
  7. Execute notebooks sequentially — Bronze first, then Silver, then Gold — using
    POST .../jobs/instances?jobType=RunNotebook
    with the correct
    defaultLakehouse
    in execution config (both
    id
    and
    name
    required)
  8. Connect Power BI to Gold layer — discover the Gold lakehouse SQL endpoint, create a Direct Lake semantic model, create a report with visuals on the Gold summary table (see Gold Layer → Power BI Consumption)
  9. Create pipeline to orchestrate the Bronze → Silver → Gold flow for recurring execution
设置Medallion工作区时,引导LLM生成以下命令:
  1. 默认架构:创建三个工作区(推荐):
    • {project}-bronze-{env}
    • {project}-silver-{env}
    • {project}-gold-{env}
  2. 为每个工作区创建一个湖仓
    • Bronze工作区 →
      {project}_bronze
      湖仓
    • Silver工作区 →
      {project}_silver
      湖仓
    • Gold工作区 →
      {project}_gold
      湖仓
  3. 为每个层级工作区分配RBAC权限
    • Bronze层:摄入/工程师写入权限
    • Silver层:工程师/数据质量权限
    • Gold层:分析/BI消费者访问权限,带有更严格的管控
  4. 为每个层级创建Notebook(每个转换阶段一个)——遵循
    .ipynb
    验证+Fabric细节
  5. 将每个Notebook绑定到对应的湖仓——设置
    metadata.dependencies.lakehouse
    并填入正确的湖仓ID(参阅notebook-api-operations.md § 默认湖仓绑定):
    • Bronze Notebook → Bronze工作区/湖仓
    • Silver Notebook → Silver工作区/湖仓(通过跨工作区OneLake访问/全限定引用读取Bronze层数据)
    • Gold Notebook → Gold工作区/湖仓(通过跨工作区访问读取Silver层数据)
  6. 确认Notebook部署——检查
    updateDefinition
    是否返回
    Succeeded
    ;这足以确认内容和湖仓绑定已持久化。不要调用
    getDefinition
    重新验证——这是异步LRO,会增加不必要的延迟。
  7. 按顺序执行Notebook——先执行Bronze,再执行Silver,最后执行Gold——使用
    POST .../jobs/instances?jobType=RunNotebook
    ,并在执行配置中传入正确的
    defaultLakehouse
    (需要
    id
    name
  8. 将Power BI连接到Gold层——发现Gold湖仓的SQL端点,创建Direct Lake语义模型,基于Gold汇总表创建带可视化的报表(参阅Gold层 → Power BI消费
  9. 创建管道以编排Bronze→Silver→Gold的流,实现定期执行

Explicit Override: Single Workspace

显式覆盖:单工作区模式

If the user explicitly asks for a single workspace deployment (for example, POC/small team/monolithic pattern), keep the current approach:
  • One workspace with separate Bronze/Silver/Gold lakehouses
  • Preserve layer separation logically even when workspace is shared
  • Call out governance trade-offs versus multi-workspace design
Parameterize by environment: workspace name suffix (
-dev
,
-prod
), data volume (sample vs full), capacity SKU, and Bronze retention period.

如果用户明确要求单工作区部署(例如POC/小型团队/单体模式),保留当前方法:
  • 一个工作区包含独立的Bronze/Silver/Gold湖仓
  • 即使共享工作区,也要在逻辑上保持层级分离
  • 说明与多工作区设计相比的治理权衡
按环境参数化:工作区名称后缀(
-dev
-prod
)、数据量(样本 vs 全量)、容量SKU、Bronze层保留期。

Bronze Layer — Ingestion Patterns

Bronze层 — 摄入模式

When a user requests data ingestion into the Bronze layer, guide LLM to:
  1. Land data in lakehouse first: External data must be staged into the lakehouse
    Files/
    folder before Spark can read it — use one of:
    • Fabric Pipeline Copy activity (preferred for recurring loads) — connects to external sources (HTTP, FTP, databases, cloud storage) and writes to OneLake
    • OneLake API /
      curl
      — upload files via REST API using
      storage.azure.com
      token (see COMMON-CLI.md § OneLake Data Access)
    • OneLake Shortcut — for data already in Azure ADLS Gen2, S3, or another OneLake location
    • notebookutils.fs
      — copy from mounted storage paths within a notebook
    • ⚠️ Fabric Spark cannot read from arbitrary HTTP/HTTPS URLs
      spark.read.format("csv").load("https://...")
      will fail
  2. Read from lakehouse path: Once data is in
    Files/
    , read using lakehouse-relative paths (e.g.,
    spark.read.format("csv").load("Files/landing/daily/")
    )
  3. Add metadata and write: Tracking columns (ingestion timestamp, source file, batch ID), Delta table with descriptive name, partition by ingestion date, append mode
  4. Validate: Log row counts, validate schema structure, flag anomalies vs historical patterns

当用户请求将数据摄入Bronze层时,引导LLM:
  1. 先将数据存入湖仓:外部数据必须先存入湖仓
    Files/
    目录,Spark才能读取——可使用以下方式之一:
    • Fabric管道复制活动(推荐用于定期加载)——连接到外部源(HTTP、FTP、数据库、云存储)并写入OneLake
    • OneLake API /
      curl
      ——使用
      storage.azure.com
      令牌通过REST API上传文件(参阅COMMON-CLI.md § OneLake数据访问)
    • OneLake快捷方式——适用于已存储在Azure ADLS Gen2、S3或其他OneLake位置的数据
    • notebookutils.fs
      ——在Notebook中从挂载存储路径复制
    • ⚠️ Fabric Spark无法读取任意HTTP/HTTPS URL——
      spark.read.format("csv").load("https://...")
      会失败
  2. 从湖仓路径读取:数据存入
    Files/
    后,使用湖仓相对路径读取(例如
    spark.read.format("csv").load("Files/landing/daily/")
  3. 添加元数据并写入:跟踪列(摄入时间戳、源文件、批次ID)、具有描述性名称的Delta表、按摄入日期分区、追加模式
  4. 验证:记录行数、验证Schema结构、标记与历史模式不符的异常

Silver Layer — Transformation Patterns

Silver层 — 转换模式

When a user requests Bronze-to-Silver transformation, guide LLM to:
  • Quality rules: Deduplicate on natural/composite key, filter invalid ranges, handle nulls (drop required, fill optional), validate logical constraints
  • Schema conformance: snake_case column names, standardized data types, derived columns (durations, percentages, categories)
  • Schema evolution: Use
    mergeSchema
    option when source schemas change; coordinate downstream updates to Gold tables and Power BI datasets
  • Write strategy: Partition by business date, partition-aware overwrite, run OPTIMIZE after write, log before/after metrics

当用户请求Bronze到Silver的转换时,引导LLM:
  • 质量规则:基于自然/复合键去重、过滤无效范围、处理空值(删除必填项空值、填充可选项空值)、验证逻辑约束
  • Schema标准化:使用snake_case列名、标准化数据类型、派生列(时长、百分比、分类)
  • Schema演进:当源Schema变更时使用
    mergeSchema
    选项;协调Gold层表和Power BI数据集的下游更新
  • 写入策略:按业务日期分区、分区感知覆盖、写入后运行OPTIMIZE、记录转换前后的指标

Gold Layer — Aggregation Patterns

Gold层 — 聚合模式

When a user requests Gold analytics tables, guide LLM to generate:
  • Common aggregates: Daily/weekly/monthly summaries, dimensional analysis (by location, category, type), trend breakdowns over time, demand patterns (hour-of-day, day-of-week)
  • Spark session config — set these properties in the Gold notebook before any write operations:
    python
    spark.conf.set("spark.sql.parquet.vorder.default", "true")
    spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
    spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "1g")
    • V-Order (
      vorder.default
      ) — applies Fabric's columnar sort optimization to all Parquet files, dramatically improving Direct Lake and SQL endpoint read performance
    • Optimize Write (
      optimizeWrite.enabled
      ) — coalesces small partitions into optimally-sized files (target ~1 GB per
      binSize
      ), reducing file count and improving scan efficiency
  • Optimization: ZORDER on filter columns, run OPTIMIZE after writes, pre-aggregate metrics to avoid runtime computation

当用户请求Gold层分析表时,引导LLM生成:
  • 常见聚合:日/周/月汇总、维度分析(按位置、分类、类型)、时间趋势细分、需求模式(一天中的时段、一周中的某天)
  • Spark会话配置——在Gold层Notebook中所有写入操作之前设置以下属性:
    python
    spark.conf.set("spark.sql.parquet.vorder.default", "true")
    spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
    spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "1g")
    • V-Order (
      vorder.default
      )——对所有Parquet文件应用Fabric的列排序优化,显著提升Direct Lake和SQL端点的读取性能
    • Optimize Write (
      optimizeWrite.enabled
      )——将小分区合并为最优大小的文件(目标每个
      binSize
      约1 GB),减少文件数量并提升扫描效率
  • 优化:对过滤列使用ZORDER、写入后运行OPTIMIZE、预聚合指标以避免运行时计算

End-to-End Execution Flow

端到端执行流程

When setting up medallion architecture end-to-end, the LLM must not stop after creating notebooks and deploying code. The complete lifecycle is:
Create Resources → Deploy Content → Bind Lakehouses → Execute → Verify Results
设置端到端Medallion架构时,LLM不得在创建Notebook和部署代码后停止。完整生命周期为:
创建资源 → 部署内容 → 绑定湖仓 → 执行 → 验证结果

Step-by-Step

分步指南

  1. Create layer workspaces and lakehouses (default) — one workspace and one lakehouse per layer (Bronze, Silver, Gold); capture workspace IDs and lakehouse IDs
  2. Create notebooks — one per layer, with valid
    .ipynb
    structure (see notebook-api-operations.md)
  3. Bind lakehouse to each notebook — include
    metadata.dependencies.lakehouse
    in the
    .ipynb
    payload with:
    • default_lakehouse
      : the target lakehouse GUID
    • default_lakehouse_name
      : the lakehouse display name
    • default_lakehouse_workspace_id
      : the workspace GUID
  4. Deploy notebook content
    updateDefinition
    with the Base64-encoded
    .ipynb
    payload (content + lakehouse binding together)
  5. Confirm deployment — check that each
    updateDefinition
    LRO returned
    Succeeded
    ; that is sufficient. Do NOT call
    getDefinition
    to re-verify — it is an async LRO and adds significant latency per notebook.
  6. Execute notebooks sequentially — use
    POST .../jobs/instances?jobType=RunNotebook
    :
    • Pass
      defaultLakehouse
      with both
      id
      and
      name
      in
      executionData.configuration
    • Run Bronze first → poll until
      Completed
      → run Silver → poll → run Gold → poll
    • Check for recent jobs before submitting (prevent duplicates — see SPARK-AUTHORING-CORE.md)
  7. Verify results — after each notebook completes, confirm expected tables exist and row counts are reasonable
  8. Connect Power BI to Gold — create semantic model + report on Gold summary tables (see Gold Layer → Power BI Consumption)
  1. 创建层级工作区和湖仓(默认)——每个层级(Bronze、Silver、Gold)一个工作区和一个湖仓;记录工作区ID和湖仓ID
  2. 创建Notebook——每个层级一个,具有有效的
    .ipynb
    结构(参阅notebook-api-operations.md
  3. 将湖仓绑定到每个Notebook——在
    .ipynb
    负载中包含
    metadata.dependencies.lakehouse
    ,内容如下:
    • default_lakehouse
      : 目标湖仓GUID
    • default_lakehouse_name
      : 湖仓显示名称
    • default_lakehouse_workspace_id
      : 工作区GUID
  4. 部署Notebook内容——使用Base64编码的
    .ipynb
    负载(内容+湖仓绑定)调用
    updateDefinition
  5. 确认部署——检查每个
    updateDefinition
    LRO是否返回
    Succeeded
    ;这已足够。不要调用
    getDefinition
    重新验证——这是异步LRO,每个Notebook都会增加显著延迟。
  6. 按顺序执行Notebook——使用
    POST .../jobs/instances?jobType=RunNotebook
    • executionData.configuration
      中传入包含
      id
      name
      defaultLakehouse
    • 先运行Bronze → 轮询直到
      Completed
      → 运行Silver → 轮询 → 运行Gold → 轮询
    • 提交前检查最近的任务(防止重复——参阅SPARK-AUTHORING-CORE.md)
  7. 验证结果——每个Notebook完成后,确认预期表存在且行数合理
  8. 将Power BI连接到Gold层——基于Gold汇总表创建语义模型+报表(参阅Gold层 → Power BI消费

Common Failure: Stopping After Notebook Creation

常见失败:创建Notebook后停止

If the flow stops after deploying notebook code without binding or executing:
  • Notebooks will have no lakehouse context →
    spark.sql()
    and relative paths (
    Tables/
    ,
    Files/
    ) fail at runtime
  • The user sees no output or results — the architecture is set up but never tested
  • Always complete through step 7 unless the user explicitly asks to stop at a specific step

如果流程在部署Notebook代码后未绑定或执行就停止:
  • Notebook将没有湖仓上下文 → 运行时
    spark.sql()
    和相对路径(
    Tables/
    Files/
    )会失败
  • 用户看不到任何输出或结果——架构已设置但从未测试
  • 除非用户明确要求在特定步骤停止,否则始终完成到第7步

Gold Layer → Power BI Consumption

Gold层 → Power BI消费

After Gold tables are populated, connect Power BI to surface the analytics. Build a semantic model on top of the Gold lakehouse, using DirectLake.
Gold层表填充完成后,连接Power BI以展示分析结果。在Gold湖仓之上构建语义模型,使用DirectLake模式。

Step-by-Step

分步指南

  1. Discover the Gold lakehouse SQL endpoint — call
    GET /v1/workspaces/{workspaceId}/lakehouses/{goldLakehouseId}
    and extract
    properties.sqlEndpointProperties.connectionString
    and
    provisioningStatus
    ; wait until status is
    Success
  2. Verify Gold tables via SQL — connect to the SQL endpoint using
    sqlcmd
    (see COMMON-CLI.md § SQL / TDS Data-Plane Access) and confirm the target table exists:
    sql
    SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'nyc_taxi_daily_summary'
  3. Create a semantic model — use the powerbi-authoring-cli skill for semantic model creation and TMDL deployment. Create via
    POST /v1/workspaces/{workspaceId}/items
    with
    type: "SemanticModel"
    then deploy definition via
    updateDefinition
    using TMDL format (see ITEM-DEFINITIONS-CORE.md § SemanticModel):
    • The model must reference the Gold lakehouse SQL endpoint as its data source
    • Define a table mapping to the Gold summary table (e.g.,
      nyc_taxi_daily_summary
      )
    • Use Direct Lake mode — this connects directly to Delta tables in OneLake without data import
    • Include measures for key aggregations you find interesting (e.g.,
      Total Trips
      ,
      Avg Fare
      ,
      Total Revenue
      ,
      Month over Month Growth
      )
  4. Create a Power BI report
    POST /v1/workspaces/{workspaceId}/items
    with
    type: "Report"
    then deploy definition via
    updateDefinition
    using PBIR format (see ITEM-DEFINITIONS-CORE.md § Report):
    • Reference the semantic model created in step 3 via
      definition.pbir
    • Define at least one page with visuals on the Gold summary table
    • Suggested visuals: line chart (daily trend), card (KPI totals), bar chart (by category), table (detail view)
  5. Verify end-to-end — use the
    powerbi-consumption-cli
    skill to run DAX queries against the semantic model and confirm data flows from Gold tables through to the report
  1. 发现Gold湖仓SQL端点——调用
    GET /v1/workspaces/{workspaceId}/lakehouses/{goldLakehouseId}
    并提取
    properties.sqlEndpointProperties.connectionString
    provisioningStatus
    ;等待状态变为
    Success
  2. 通过SQL验证Gold层表——使用
    sqlcmd
    连接到SQL端点(参阅COMMON-CLI.md § SQL / TDS数据平面访问)并确认目标表存在:
    sql
    SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'nyc_taxi_daily_summary'
  3. 创建语义模型——使用powerbi-authoring-cli技能进行语义模型创建和TMDL部署。通过
    POST /v1/workspaces/{workspaceId}/items
    创建,
    type: "SemanticModel"
    ,然后使用TMDL格式通过
    updateDefinition
    部署定义(参阅ITEM-DEFINITIONS-CORE.md § SemanticModel):
    • 模型必须引用Gold湖仓SQL端点作为数据源
    • 定义与Gold汇总表(例如
      nyc_taxi_daily_summary
      )的表映射
    • 使用Direct Lake模式——直接连接到OneLake中的Delta表,无需数据导入
    • 包含关键聚合的度量值(例如
      总行程数
      平均车费
      总收入
      月度环比增长
  4. 创建Power BI报表——通过
    POST /v1/workspaces/{workspaceId}/items
    创建,
    type: "Report"
    ,然后使用PBIR格式通过
    updateDefinition
    部署定义(参阅ITEM-DEFINITIONS-CORE.md § Report):
    • 通过
      definition.pbir
      引用步骤3中创建的语义模型
    • 定义至少一个页面,包含Gold汇总表的可视化
    • 推荐可视化:折线图(每日趋势)、卡片(KPI总计)、柱状图(按分类)、表格(详情视图)
  5. 端到端验证——使用
    powerbi-consumption-cli
    技能对语义模型运行DAX查询,确认数据从Gold层表流向报表

Principles

原则

  • Discover SQL endpoint dynamically — the connection string is in
    properties.sqlEndpointProperties.connectionString
    on the lakehouse response; never hardcode it
  • Wait for SQL endpoint provisioning — status must be
    Success
    before connecting; newly created lakehouses may take minutes to provision
  • Prefer Direct Lake mode — avoids data duplication; semantic model reads directly from OneLake Delta tables
  • Match table/column names exactly — the semantic model table definition must use the exact Delta table and column names from the Gold lakehouse
  • For semantic model authoring (TMDL, refresh, permissions), cross-reference the powerbi-authoring-cli skill
  • For DAX query validation, cross-reference the powerbi-consumption-cli skill

  • 动态发现SQL端点——连接字符串在湖仓响应的
    properties.sqlEndpointProperties.connectionString
    中;切勿硬编码
  • 等待SQL端点配置完成——状态必须为
    Success
    才能连接;新创建的湖仓可能需要几分钟才能完成配置
  • 优先使用Direct Lake模式——避免数据重复;语义模型直接从OneLake Delta表读取
  • 完全匹配表/列名称——语义模型表定义必须使用Gold湖仓中Delta表和列的确切名称
  • 语义模型创作(TMDL、刷新、权限)——交叉参考powerbi-authoring-cli技能
  • DAX查询验证——交叉参考powerbi-consumption-cli技能

Pipeline Orchestration

管道编排

When a user requests a pipeline for the medallion flow, guide LLM to design with:
  • Structure: Sequential activities (Bronze → Silver → Gold), each waiting for previous success; independent Gold aggregations can run in parallel; include validation and notification activities
  • Parameterization: Pipeline-level processing date (defaults to yesterday), passed to all notebooks; dynamic date expressions
  • Scheduling: Daily aligned with source refresh, watermark-based incremental processing, periodic full refresh for corrections
  • Error handling: Retry with backoff for transient failures, alerting for persistent failures, graceful degradation (downstream uses previous data if upstream fails)

当用户请求Medallion流的管道时,引导LLM设计以下内容:
  • 结构:顺序活动(Bronze→Silver→Gold),每个活动等待前一个成功完成;独立的Gold聚合可并行运行;包含验证和通知活动
  • 参数化:管道级处理日期(默认为昨天),传递给所有Notebook;动态日期表达式
  • 调度:与源数据刷新对齐的每日调度、基于水印的增量处理、定期全量刷新以修正数据
  • 错误处理:针对临时失败的退避重试、针对持续失败的告警、优雅降级(上游失败时下游使用历史数据)

Environment Optimization

环境优化

For detailed Spark configurations and optimization strategies, see data-engineering-patterns.md.
LayerProfileKey Settings
BronzeWrite-heavyDisable V-Order, enable autoCompact, large file targets, partition by ingestion_date
SilverBalancedEnable V-Order, adaptive query execution, partition by business date, ZORDER on filtered columns
GoldRead-heavyV-Order (
spark.sql.parquet.vorder.default=true
), Optimize Write (
optimizeWrite.enabled=true
,
binSize=1g
), vectorized readers, adaptive execution, ZORDER on all filter columns, pre-aggregate metrics

有关详细的Spark配置和优化策略,请参阅data-engineering-patterns.md
层级配置文件关键设置
Bronze写密集禁用V-Order、启用autoCompact、大文件目标、按ingestion_date分区
Silver读写平衡启用V-Order、自适应查询执行、按业务日期分区、对过滤列使用ZORDER
Gold读密集V-Order (
spark.sql.parquet.vorder.default=true
)、Optimize Write (
optimizeWrite.enabled=true
,
binSize=1g
)、向量化读取器、自适应执行、对所有过滤列使用ZORDER、预聚合指标

Examples

示例

Example 1: Set Up Medallion Workspaces (Default)

示例1:设置Medallion工作区(默认)

Prompt: "Set up medallion architecture with separate Bronze, Silver, and Gold workspaces for sales analytics"
What the LLM should generate: REST API calls to:
  1. Create workspaces:
    sales-bronze-dev
    ,
    sales-silver-dev
    ,
    sales-gold-dev
  2. Create one lakehouse in each workspace:
    sales_bronze
    ,
    sales_silver
    ,
    sales_gold
  3. Assign RBAC roles per workspace/layer
bash
undefined
提示:"为销售分析设置带有独立Bronze、Silver、Gold工作区的Medallion架构"
LLM应生成:REST API调用以:
  1. 创建工作区:
    sales-bronze-dev
    sales-silver-dev
    sales-gold-dev
  2. 在每个工作区创建一个湖仓:
    sales_bronze
    sales_silver
    sales_gold
  3. 为每个工作区/层级分配RBAC角色
bash
undefined

Workspace creation (see COMMON-CLI.md for full patterns)

Workspace creation (see COMMON-CLI.md for full patterns)

cat > /tmp/body.json << 'EOF' {"displayName": "sales-analytics-dev"} EOF workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces"
--body @/tmp/body.json --query "id" --output tsv)
cat > /tmp/body.json << 'EOF' {"displayName": "sales-analytics-dev"} EOF workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces"
--body @/tmp/body.json --query "id" --output tsv)

Create Bronze lakehouse

Create Bronze lakehouse

cat > /tmp/body.json << 'EOF' {"displayName": "sales_bronze", "type": "Lakehouse"} EOF az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items"
--body @/tmp/body.json
undefined
cat > /tmp/body.json << 'EOF' {"displayName": "sales_bronze", "type": "Lakehouse"} EOF az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items"
--body @/tmp/body.json
undefined

Example 2: Design Bronze Ingestion

示例2:设计Bronze层摄入

Prompt: "Ingest daily CSV files into bronze lakehouse with metadata columns"
What the LLM should generate: PySpark notebook that:
  1. Reads source files with schema inference or explicit schema
  2. Adds
    ingestion_timestamp
    ,
    source_file
    ,
    batch_id
    columns
  3. Writes to Delta table partitioned by ingestion date
  4. Logs row count and validation metrics
python
undefined
提示:"将每日CSV文件摄入Bronze湖仓并添加元数据列"
LLM应生成:PySpark Notebook,内容包括:
  1. 使用Schema推断或显式Schema读取源文件
  2. 添加
    ingestion_timestamp
    source_file
    batch_id
  3. 写入按摄入日期分区的Delta表
  4. 记录行数和验证指标
python
undefined

Bronze ingestion pattern (guide LLM to generate full implementation)

Bronze ingestion pattern (guide LLM to generate full implementation)

from pyspark.sql.functions import current_timestamp, input_file_name, lit import uuid
batch_id = str(uuid.uuid4()) df = (spark.read.format("csv").option("header", True).load("/Files/landing/daily/") .withColumn("ingestion_timestamp", current_timestamp()) .withColumn("source_file", input_file_name()) .withColumn("batch_id", lit(batch_id))) df.write.mode("append").partitionBy("ingestion_date").format("delta").saveAsTable("bronze.events_raw")
undefined
from pyspark.sql.functions import current_timestamp, input_file_name, lit import uuid
batch_id = str(uuid.uuid4()) df = (spark.read.format("csv").option("header", True).load("/Files/landing/daily/") .withColumn("ingestion_timestamp", current_timestamp()) .withColumn("source_file", input_file_name()) .withColumn("batch_id", lit(batch_id))) df.write.mode("append").partitionBy("ingestion_date").format("delta").saveAsTable("bronze.events_raw")
undefined

Example 3: Bronze-to-Silver Transformation

示例3:Bronze到Silver层转换

Prompt: "Clean bronze data: remove duplicates, filter invalid records, add derived columns, write to silver"
What the LLM should generate: PySpark notebook applying quality rules, schema conformance, and partitioned write with optimization.
提示:"清洗Bronze层数据:去重、过滤无效记录、添加派生列、写入Silver层"
LLM应生成:应用质量规则、Schema标准化和分区写入优化的PySpark Notebook。

Example 4: End-to-End Pipeline

示例4:端到端管道

Prompt: "Create a pipeline that runs bronze ingestion, then silver transformation, then gold aggregation daily at 2 AM"
What the LLM should generate: Pipeline JSON definition with sequential notebook activities, date parameter, retry logic, and schedule trigger.
提示:"创建每日凌晨2点运行Bronze摄入、Silver转换、Gold聚合的管道"
LLM应生成:包含顺序Notebook活动、日期参数、重试逻辑和调度触发器的管道JSON定义。