e2e-medallion-architecture

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
GitHub Copilot CLI / VS Code: invoke the
check-updates
skill.
Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.

Skip if the check was already performed earlier in this session.

CRITICAL NOTES

To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering

To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering

更新检查 — 每会话一次（必填） 在会话中首次使用此技能时，请先运行check-updates技能再继续。
GitHub Copilot CLI / VS Code：调用
check-updates
技能。
Claude Code / Cowork / Cursor / Windsurf / Codex：比较本地与远程的package.json版本。

若本次会话中已完成过该检查，则可跳过。

重要说明

从工作区名称查找工作区详情（包括其ID）：列出所有工作区，然后使用JMESPath过滤

从工作区ID、项类型和项名称查找项详情（包括其ID）：列出该工作区中该类型的所有项，然后使用JMESPath过滤

End-to-End Medallion Architecture

端到端Medallion架构

Prerequisite Knowledge

前置知识

Read these companion documents — they contain the foundational context this skill depends on:

COMMON-CORE.md — Fabric REST API patterns, authentication, token audiences, item discovery
COMMON-CLI.md —
```
az rest
```
,
```
az login
```
, token acquisition, Fabric REST via CLI
SPARK-AUTHORING-CORE.md — Notebook deployment, lakehouse creation, job execution
notebook-api-operations.md — Required for notebook creation —
```
.ipynb
```
structure requirements, cell format,
```
getDefinition
```
/
```
updateDefinition
```
workflow

For Spark-specific optimization details, see data-engineering-patterns.md.

请阅读以下配套文档，它们包含本技能依赖的基础背景知识：

COMMON-CORE.md — Fabric REST API模式、身份验证、令牌受众、项发现
COMMON-CLI.md —
```
az rest
```
、
```
az login
```
、令牌获取、通过CLI调用Fabric REST API
SPARK-AUTHORING-CORE.md — Notebook部署、湖仓创建、任务执行
notebook-api-operations.md — 创建Notebook必需 —
```
.ipynb
```
结构要求、单元格格式、
```
getDefinition
```
/
```
updateDefinition
```
工作流

有关Spark特定的优化细节，请参阅data-engineering-patterns.md。

Architecture Overview

架构概述

Medallion Architecture is a data lakehouse pattern with three progressive layers:

Layer	Purpose	Optimization Profile	Use Case
Bronze (Raw)	Land raw data exactly as received	Write-optimized, append-only, partitioned by ingestion date	Audit trail, reprocessing, lineage
Silver (Cleaned)	Deduplicated, validated, conformed data	Balanced read/write, partitioned by business date	Feature engineering, operational reporting
Gold (Aggregated)	Pre-calculated metrics for analytics	Read-optimized (ZORDER, compaction), partitioned by month/year	Power BI reports, dashboards, ad-hoc analytics via SQL endpoint

Bronze: Schema-on-read — flexible schema, Delta time travel supports audit and rollback
Silver: Schema enforcement — reject non-conforming writes; handle schema evolution with
```
mergeSchema
```
when sources change
Gold: Strict schema governance — curated, business-approved datasets only

Medallion架构是一种包含三个递进层级的数据湖仓模式：

层级	用途	优化配置	使用场景
Bronze（原始层）	按接收原样存储原始数据	写优化、仅追加、按摄入日期分区	审计追踪、重处理、数据血缘
Silver（清洗层）	去重、验证、标准化后的数据	读写平衡、按业务日期分区	特征工程、运营报表
Gold（聚合层）	为分析预计算的指标	读优化（ZORDER、压缩）、按年/月分区	Power BI报表、仪表盘、通过SQL端点进行即席分析

Bronze层：读时模式——灵活的Schema，Delta时间旅行支持审计和回滚
Silver层：Schema校验——拒绝不符合规范的写入；当源Schema变更时，使用
```
mergeSchema
```
处理Schema演进
Gold层：严格的Schema治理——仅包含经过业务审批的 curated 数据集

Must/Prefer/Avoid

必须/推荐/避免

MUST DO

必须执行

Create a separate lakehouse for each medallion layer (Bronze, Silver, Gold)
Add metadata columns in Bronze: ingestion timestamp, source file, batch ID
Apply data quality rules in the Bronze-to-Silver transformation (deduplication, null handling, range validation)
Use Delta Lake format for all medallion layer tables
Use partition-aware overwrite in Silver/Gold writes to avoid reprocessing unchanged data
Include validation steps after each layer (row counts, schema checks, anomaly detection)
Follow the .ipynb
validation + Fabric nuances in notebook-api-operations.md when creating notebooks via REST API — every code cell must include
```
"outputs": []
```
and
```
"execution_count": null
```
Default to separate workspaces per layer for governance and access control: one workspace each for Bronze, Silver, and Gold
Complete the full end-to-end flow — do not stop after creating notebooks; always bind lakehouses, execute notebooks sequentially (Bronze → Silver → Gold), verify results, and connect Power BI to the Gold layer unless the user explicitly requests a partial setup

为每个Medallion层级创建独立的湖仓（Bronze、Silver、Gold）
在Bronze层添加元数据列：摄入时间戳、源文件、批次ID
在Bronze到Silver的转换中应用数据质量规则（去重、空值处理、范围验证）
所有Medallion层的表都使用Delta Lake格式
在Silver/Gold层写入时使用分区感知覆盖，避免重新处理未变更的数据
在每个层级后添加验证步骤（行数统计、Schema检查、异常检测）
通过REST API创建Notebook时，遵循notebook-api-operations.md中的
```
.ipynb
```
验证+Fabric细节——每个代码单元格必须包含
```
"outputs": []
```
和
```
"execution_count": null
```
默认按层级使用独立工作区以实现治理和访问控制：Bronze、Silver、Gold各一个工作区
完成完整的端到端流程——创建Notebook后不要停止；始终绑定湖仓、按顺序执行Notebook（Bronze→Silver→Gold）、验证结果，并将Power BI连接到Gold层，除非用户明确要求部分设置

PREFER

AVOID

避免操作

Storing all layers in a single lakehouse — this defeats isolation and independent optimization
Skipping the Silver layer and going directly from Bronze to Gold
Hardcoded workspace IDs, lakehouse IDs, or FQDNs — discover via REST API
SELECT * without LIMIT on Bronze tables (they grow unboundedly)
Running VACUUM without checking downstream dependencies
Chaining OneLake shortcuts between medallion layers (Bronze→Silver→Gold) — each layer must be physically materialized for lineage and governance
Copying complete implementation code into skills — guide the LLM to generate instead
Reading from external HTTP/HTTPS URLs directly in Spark — Fabric Spark cannot access arbitrary external URLs; land data in lakehouse
```
Files/
```
first (via
```
curl
```
, OneLake API, or Fabric pipeline Copy activity), then read from the lakehouse path
Creating notebooks via REST API without validating
.ipynb
structure — missing
```
execution_count: null
```
or
```
outputs: []
```
on code cells causes silent failures or "Job instance failed without detail error"

将所有层级存储在单个湖仓中——这会破坏隔离性和独立优化能力
跳过Silver层直接从Bronze到Gold
硬编码工作区ID、湖仓ID或FQDN——通过REST API发现
在Bronze层表上使用无LIMIT的SELECT *（数据会无限增长）
在未检查下游依赖的情况下运行VACUUM
在Medallion层级之间链式使用OneLake快捷方式（Bronze→Silver→Gold）——每个层级必须物理物化以支持数据血缘和治理
将完整的实现代码复制到技能中——引导LLM生成代码
在Spark中直接读取外部HTTP/HTTPS URL——Fabric Spark无法访问任意外部URL；先将数据存入湖仓
```
Files/
```
目录（通过
```
curl
```
、OneLake API或Fabric管道复制活动），再从湖仓路径读取
创建Notebook时不验证
.ipynb
结构——代码单元格缺少
```
execution_count: null
```
或
```
outputs: []
```
会导致静默失败或"Job instance failed without detail error"

Workspace Setup Guidance

工作区设置指南

When setting up a medallion workspace, guide LLM to generate commands for:

Default architecture: create three workspaces (recommended):

```
{project}-bronze-{env}
```
```
{project}-silver-{env}
```
```
{project}-gold-{env}
```

Create one lakehouse per workspace:
- Bronze workspace →
```
{project}_bronze
```
  lakehouse
- Silver workspace →
```
{project}_silver
```
  lakehouse
- Gold workspace →
```
{project}_gold
```
  lakehouse
Assign RBAC per layer workspace:
- Bronze: ingestion/engineering write permissions
- Silver: engineering/data quality permissions
- Gold: analytics/BI consumer access with stricter curation controls
Create notebooks for each layer (one per transformation stage) — follow
```
.ipynb
```
validation + Fabric nuances
Bind each notebook to its lakehouse — set
```
metadata.dependencies.lakehouse
```
with the correct lakehouse ID (see notebook-api-operations.md § Default Lakehouse Binding):
- Bronze notebook → Bronze workspace/lakehouse
- Silver notebook → Silver workspace/lakehouse (reads Bronze via cross-workspace oneLake access / fully qualified references)
- Gold notebook → Gold workspace/lakehouse (reads Silver via cross-workspace access)
Confirm notebook deployment — check that
```
updateDefinition
```
returned
```
Succeeded
```
; this is sufficient confirmation that content and lakehouse binding persisted. Do NOT call
```
getDefinition
```
to re-verify — it is an async LRO and adds unnecessary latency.
Execute notebooks sequentially — Bronze first, then Silver, then Gold — using
```
POST .../jobs/instances?jobType=RunNotebook
```
with the correct
```
defaultLakehouse
```
in execution config (both
```
id
```
and
```
name
```
required)
Connect Power BI to Gold layer — discover the Gold lakehouse SQL endpoint, create a Direct Lake semantic model, create a report with visuals on the Gold summary table (see Gold Layer → Power BI Consumption)
Create pipeline to orchestrate the Bronze → Silver → Gold flow for recurring execution

设置Medallion工作区时，引导LLM生成以下命令：

默认架构：创建三个工作区（推荐）：

```
{project}-bronze-{env}
```
```
{project}-silver-{env}
```
```
{project}-gold-{env}
```

为每个工作区创建一个湖仓：
- Bronze工作区 →
```
{project}_bronze
```
  湖仓
- Silver工作区 →
```
{project}_silver
```
  湖仓
- Gold工作区 →
```
{project}_gold
```
  湖仓
为每个层级工作区分配RBAC权限：
- Bronze层：摄入/工程师写入权限
- Silver层：工程师/数据质量权限
- Gold层：分析/BI消费者访问权限，带有更严格的管控
为每个层级创建Notebook（每个转换阶段一个）——遵循
```
.ipynb
```
验证+Fabric细节
将每个Notebook绑定到对应的湖仓——设置
```
metadata.dependencies.lakehouse
```
并填入正确的湖仓ID（参阅notebook-api-operations.md § 默认湖仓绑定）：
- Bronze Notebook → Bronze工作区/湖仓
- Silver Notebook → Silver工作区/湖仓（通过跨工作区OneLake访问/全限定引用读取Bronze层数据）
- Gold Notebook → Gold工作区/湖仓（通过跨工作区访问读取Silver层数据）
确认Notebook部署——检查
```
updateDefinition
```
是否返回
```
Succeeded
```
；这足以确认内容和湖仓绑定已持久化。不要调用
```
getDefinition
```
重新验证——这是异步LRO，会增加不必要的延迟。
按顺序执行Notebook——先执行Bronze，再执行Silver，最后执行Gold——使用
```
POST .../jobs/instances?jobType=RunNotebook
```
，并在执行配置中传入正确的
```
defaultLakehouse
```
（需要
```
id
```
和
```
name
```
）
将Power BI连接到Gold层——发现Gold湖仓的SQL端点，创建Direct Lake语义模型，基于Gold汇总表创建带可视化的报表（参阅Gold层 → Power BI消费）
创建管道以编排Bronze→Silver→Gold的流，实现定期执行

Explicit Override: Single Workspace

显式覆盖：单工作区模式

If the user explicitly asks for a single workspace deployment (for example, POC/small team/monolithic pattern), keep the current approach:

One workspace with separate Bronze/Silver/Gold lakehouses
Preserve layer separation logically even when workspace is shared
Call out governance trade-offs versus multi-workspace design

Parameterize by environment: workspace name suffix (

-dev

-prod

), data volume (sample vs full), capacity SKU, and Bronze retention period.

如果用户明确要求单工作区部署（例如POC/小型团队/单体模式），保留当前方法：

一个工作区包含独立的Bronze/Silver/Gold湖仓
即使共享工作区，也要在逻辑上保持层级分离
说明与多工作区设计相比的治理权衡

按环境参数化：工作区名称后缀（

-dev

、

-prod

）、数据量（样本 vs 全量）、容量SKU、Bronze层保留期。

Bronze Layer — Ingestion Patterns

Bronze层 — 摄入模式

When a user requests data ingestion into the Bronze layer, guide LLM to:

Land data in lakehouse first: External data must be staged into the lakehouse
```
Files/
```
folder before Spark can read it — use one of:
- Fabric Pipeline Copy activity (preferred for recurring loads) — connects to external sources (HTTP, FTP, databases, cloud storage) and writes to OneLake
- OneLake API /
  curl
  — upload files via REST API using
```
storage.azure.com
```
  token (see COMMON-CLI.md § OneLake Data Access)
- OneLake Shortcut — for data already in Azure ADLS Gen2, S3, or another OneLake location
- notebookutils.fs
  — copy from mounted storage paths within a notebook
- ⚠️ Fabric Spark cannot read from arbitrary HTTP/HTTPS URLs —
```
spark.read.format("csv").load("https://...")
```
  will fail
Read from lakehouse path: Once data is in
```
Files/
```
, read using lakehouse-relative paths (e.g.,
```
spark.read.format("csv").load("Files/landing/daily/")
```
)
Add metadata and write: Tracking columns (ingestion timestamp, source file, batch ID), Delta table with descriptive name, partition by ingestion date, append mode
Validate: Log row counts, validate schema structure, flag anomalies vs historical patterns

当用户请求将数据摄入Bronze层时，引导LLM：

先将数据存入湖仓：外部数据必须先存入湖仓
```
Files/
```
目录，Spark才能读取——可使用以下方式之一：
- Fabric管道复制活动（推荐用于定期加载）——连接到外部源（HTTP、FTP、数据库、云存储）并写入OneLake
- OneLake API /
  curl
  ——使用
```
storage.azure.com
```
  令牌通过REST API上传文件（参阅COMMON-CLI.md § OneLake数据访问）
- OneLake快捷方式——适用于已存储在Azure ADLS Gen2、S3或其他OneLake位置的数据
- notebookutils.fs
  ——在Notebook中从挂载存储路径复制
- ⚠️ Fabric Spark无法读取任意HTTP/HTTPS URL——
```
spark.read.format("csv").load("https://...")
```
  会失败
从湖仓路径读取：数据存入
```
Files/
```
后，使用湖仓相对路径读取（例如
```
spark.read.format("csv").load("Files/landing/daily/")
```
）
添加元数据并写入：跟踪列（摄入时间戳、源文件、批次ID）、具有描述性名称的Delta表、按摄入日期分区、追加模式
验证：记录行数、验证Schema结构、标记与历史模式不符的异常

Silver Layer — Transformation Patterns

Silver层 — 转换模式

When a user requests Bronze-to-Silver transformation, guide LLM to:

Quality rules: Deduplicate on natural/composite key, filter invalid ranges, handle nulls (drop required, fill optional), validate logical constraints
Schema conformance: snake_case column names, standardized data types, derived columns (durations, percentages, categories)
Schema evolution: Use
```
mergeSchema
```
option when source schemas change; coordinate downstream updates to Gold tables and Power BI datasets
Write strategy: Partition by business date, partition-aware overwrite, run OPTIMIZE after write, log before/after metrics

当用户请求Bronze到Silver的转换时，引导LLM：

质量规则：基于自然/复合键去重、过滤无效范围、处理空值（删除必填项空值、填充可选项空值）、验证逻辑约束
Schema标准化：使用snake_case列名、标准化数据类型、派生列（时长、百分比、分类）
Schema演进：当源Schema变更时使用
```
mergeSchema
```
选项；协调Gold层表和Power BI数据集的下游更新
写入策略：按业务日期分区、分区感知覆盖、写入后运行OPTIMIZE、记录转换前后的指标

Gold Layer — Aggregation Patterns

Gold层 — 聚合模式

When a user requests Gold analytics tables, guide LLM to generate:

Common aggregates: Daily/weekly/monthly summaries, dimensional analysis (by location, category, type), trend breakdowns over time, demand patterns (hour-of-day, day-of-week)
Spark session config — set these properties in the Gold notebook before any write operations:
python
```
spark.conf.set("spark.sql.parquet.vorder.default", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "1g")
```
- V-Order (
```
vorder.default
```
  ) — applies Fabric's columnar sort optimization to all Parquet files, dramatically improving Direct Lake and SQL endpoint read performance
- Optimize Write (
```
optimizeWrite.enabled
```
  ) — coalesces small partitions into optimally-sized files (target ~1 GB per
```
binSize
```
  ), reducing file count and improving scan efficiency
Optimization: ZORDER on filter columns, run OPTIMIZE after writes, pre-aggregate metrics to avoid runtime computation

当用户请求Gold层分析表时，引导LLM生成：

常见聚合：日/周/月汇总、维度分析（按位置、分类、类型）、时间趋势细分、需求模式（一天中的时段、一周中的某天）
Spark会话配置——在Gold层Notebook中所有写入操作之前设置以下属性：
python
```
spark.conf.set("spark.sql.parquet.vorder.default", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "1g")
```
- V-Order (
```
vorder.default
```
  )——对所有Parquet文件应用Fabric的列排序优化，显著提升Direct Lake和SQL端点的读取性能
- Optimize Write (
```
optimizeWrite.enabled
```
  )——将小分区合并为最优大小的文件（目标每个
```
binSize
```
  约1 GB），减少文件数量并提升扫描效率
优化：对过滤列使用ZORDER、写入后运行OPTIMIZE、预聚合指标以避免运行时计算

End-to-End Execution Flow

端到端执行流程

When setting up medallion architecture end-to-end, the LLM must not stop after creating notebooks and deploying code. The complete lifecycle is:

Create Resources → Deploy Content → Bind Lakehouses → Execute → Verify Results

设置端到端Medallion架构时，LLM不得在创建Notebook和部署代码后停止。完整生命周期为：

创建资源 → 部署内容 → 绑定湖仓 → 执行 → 验证结果

Step-by-Step

分步指南

Create layer workspaces and lakehouses (default) — one workspace and one lakehouse per layer (Bronze, Silver, Gold); capture workspace IDs and lakehouse IDs
Create notebooks — one per layer, with valid
```
.ipynb
```
structure (see notebook-api-operations.md)
Bind lakehouse to each notebook — include
```
metadata.dependencies.lakehouse
```
in the
```
.ipynb
```
payload with:
- ```
default_lakehouse
```
  : the target lakehouse GUID
- ```
default_lakehouse_name
```
  : the lakehouse display name
- ```
default_lakehouse_workspace_id
```
  : the workspace GUID
Deploy notebook content —
```
updateDefinition
```
with the Base64-encoded
```
.ipynb
```
payload (content + lakehouse binding together)
Confirm deployment — check that each
```
updateDefinition
```
LRO returned
```
Succeeded
```
; that is sufficient. Do NOT call
```
getDefinition
```
to re-verify — it is an async LRO and adds significant latency per notebook.
Execute notebooks sequentially — use
```
POST .../jobs/instances?jobType=RunNotebook
```
:
- Pass
```
defaultLakehouse
```
  with both
```
id
```
  and
```
name
```
  in
```
executionData.configuration
```
- Run Bronze first → poll until
```
Completed
```
  → run Silver → poll → run Gold → poll
- Check for recent jobs before submitting (prevent duplicates — see SPARK-AUTHORING-CORE.md)
Verify results — after each notebook completes, confirm expected tables exist and row counts are reasonable
Connect Power BI to Gold — create semantic model + report on Gold summary tables (see Gold Layer → Power BI Consumption)

创建层级工作区和湖仓（默认）——每个层级（Bronze、Silver、Gold）一个工作区和一个湖仓；记录工作区ID和湖仓ID
创建Notebook——每个层级一个，具有有效的
```
.ipynb
```
结构（参阅notebook-api-operations.md）
将湖仓绑定到每个Notebook——在
```
.ipynb
```
负载中包含
```
metadata.dependencies.lakehouse
```
，内容如下：
- ```
default_lakehouse
```
  : 目标湖仓GUID
- ```
default_lakehouse_name
```
  : 湖仓显示名称
- ```
default_lakehouse_workspace_id
```
  : 工作区GUID
部署Notebook内容——使用Base64编码的
```
.ipynb
```
负载（内容+湖仓绑定）调用
```
updateDefinition
```
确认部署——检查每个
```
updateDefinition
```
LRO是否返回
```
Succeeded
```
；这已足够。不要调用
```
getDefinition
```
重新验证——这是异步LRO，每个Notebook都会增加显著延迟。
按顺序执行Notebook——使用
```
POST .../jobs/instances?jobType=RunNotebook
```
：
- 在
```
executionData.configuration
```
  中传入包含
```
id
```
  和
```
name
```
  的
```
defaultLakehouse
```
- 先运行Bronze → 轮询直到
```
Completed
```
  → 运行Silver → 轮询 → 运行Gold → 轮询
- 提交前检查最近的任务（防止重复——参阅SPARK-AUTHORING-CORE.md）
验证结果——每个Notebook完成后，确认预期表存在且行数合理
将Power BI连接到Gold层——基于Gold汇总表创建语义模型+报表（参阅Gold层 → Power BI消费）

Common Failure: Stopping After Notebook Creation

常见失败：创建Notebook后停止

If the flow stops after deploying notebook code without binding or executing:

Notebooks will have no lakehouse context →
```
spark.sql()
```
and relative paths (
```
Tables/
```
,
```
Files/
```
) fail at runtime
The user sees no output or results — the architecture is set up but never tested
Always complete through step 7 unless the user explicitly asks to stop at a specific step

如果流程在部署Notebook代码后未绑定或执行就停止：

Notebook将没有湖仓上下文 → 运行时
```
spark.sql()
```
和相对路径（
```
Tables/
```
、
```
Files/
```
）会失败
用户看不到任何输出或结果——架构已设置但从未测试
除非用户明确要求在特定步骤停止，否则始终完成到第7步

Gold Layer → Power BI Consumption

Gold层 → Power BI消费

After Gold tables are populated, connect Power BI to surface the analytics. Build a semantic model on top of the Gold lakehouse, using DirectLake.

Gold层表填充完成后，连接Power BI以展示分析结果。在Gold湖仓之上构建语义模型，使用DirectLake模式。

Step-by-Step

分步指南

Discover the Gold lakehouse SQL endpoint — call

GET /v1/workspaces/{workspaceId}/lakehouses/{goldLakehouseId}

and extract

properties.sqlEndpointProperties.connectionString

and

provisioningStatus

; wait until status is

Success

Verify Gold tables via SQL — connect to the SQL endpoint using
```
sqlcmd
```
(see COMMON-CLI.md § SQL / TDS Data-Plane Access) and confirm the target table exists:
sql
```
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'nyc_taxi_daily_summary'
```
Create a semantic model — use the powerbi-authoring-cli skill for semantic model creation and TMDL deployment. Create via
```
POST /v1/workspaces/{workspaceId}/items
```
with
```
type: "SemanticModel"
```
then deploy definition via
```
updateDefinition
```
using TMDL format (see ITEM-DEFINITIONS-CORE.md § SemanticModel):
- The model must reference the Gold lakehouse SQL endpoint as its data source
- Define a table mapping to the Gold summary table (e.g.,
```
nyc_taxi_daily_summary
```
  )
- Use Direct Lake mode — this connects directly to Delta tables in OneLake without data import
- Include measures for key aggregations you find interesting (e.g.,
```
Total Trips
```
  ,
```
Avg Fare
```
  ,
```
Total Revenue
```
  ,
```
Month over Month Growth
```
  )
Create a Power BI report —
```
POST /v1/workspaces/{workspaceId}/items
```
with
```
type: "Report"
```
then deploy definition via
```
updateDefinition
```
using PBIR format (see ITEM-DEFINITIONS-CORE.md § Report):
- Reference the semantic model created in step 3 via
```
definition.pbir
```
- Define at least one page with visuals on the Gold summary table
- Suggested visuals: line chart (daily trend), card (KPI totals), bar chart (by category), table (detail view)
Verify end-to-end — use the
```
powerbi-consumption-cli
```
skill to run DAX queries against the semantic model and confirm data flows from Gold tables through to the report

发现Gold湖仓SQL端点——调用

GET /v1/workspaces/{workspaceId}/lakehouses/{goldLakehouseId}

并提取

properties.sqlEndpointProperties.connectionString

和

provisioningStatus

；等待状态变为

Success

通过SQL验证Gold层表——使用
```
sqlcmd
```
连接到SQL端点（参阅COMMON-CLI.md § SQL / TDS数据平面访问）并确认目标表存在：
sql
```
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'nyc_taxi_daily_summary'
```
创建语义模型——使用powerbi-authoring-cli技能进行语义模型创建和TMDL部署。通过
```
POST /v1/workspaces/{workspaceId}/items
```
创建，
```
type: "SemanticModel"
```
，然后使用TMDL格式通过
```
updateDefinition
```
部署定义（参阅ITEM-DEFINITIONS-CORE.md § SemanticModel）：
- 模型必须引用Gold湖仓SQL端点作为数据源
- 定义与Gold汇总表（例如
```
nyc_taxi_daily_summary
```
  ）的表映射
- 使用Direct Lake模式——直接连接到OneLake中的Delta表，无需数据导入
- 包含关键聚合的度量值（例如
```
总行程数
```
  、
```
平均车费
```
  、
```
总收入
```
  、
```
月度环比增长
```
  ）
创建Power BI报表——通过
```
POST /v1/workspaces/{workspaceId}/items
```
创建，
```
type: "Report"
```
，然后使用PBIR格式通过
```
updateDefinition
```
部署定义（参阅ITEM-DEFINITIONS-CORE.md § Report）：
- 通过
```
definition.pbir
```
  引用步骤3中创建的语义模型
- 定义至少一个页面，包含Gold汇总表的可视化
- 推荐可视化：折线图（每日趋势）、卡片（KPI总计）、柱状图（按分类）、表格（详情视图）
端到端验证——使用
```
powerbi-consumption-cli
```
技能对语义模型运行DAX查询，确认数据从Gold层表流向报表

Principles

原则

Discover SQL endpoint dynamically — the connection string is in
```
properties.sqlEndpointProperties.connectionString
```
on the lakehouse response; never hardcode it
Wait for SQL endpoint provisioning — status must be
```
Success
```
before connecting; newly created lakehouses may take minutes to provision
Prefer Direct Lake mode — avoids data duplication; semantic model reads directly from OneLake Delta tables
Match table/column names exactly — the semantic model table definition must use the exact Delta table and column names from the Gold lakehouse
For semantic model authoring (TMDL, refresh, permissions), cross-reference the powerbi-authoring-cli skill
For DAX query validation, cross-reference the powerbi-consumption-cli skill

动态发现SQL端点——连接字符串在湖仓响应的
```
properties.sqlEndpointProperties.connectionString
```
中；切勿硬编码
等待SQL端点配置完成——状态必须为
```
Success
```
才能连接；新创建的湖仓可能需要几分钟才能完成配置
优先使用Direct Lake模式——避免数据重复；语义模型直接从OneLake Delta表读取
完全匹配表/列名称——语义模型表定义必须使用Gold湖仓中Delta表和列的确切名称
语义模型创作（TMDL、刷新、权限）——交叉参考powerbi-authoring-cli技能
DAX查询验证——交叉参考powerbi-consumption-cli技能

Pipeline Orchestration

管道编排

When a user requests a pipeline for the medallion flow, guide LLM to design with:

Structure: Sequential activities (Bronze → Silver → Gold), each waiting for previous success; independent Gold aggregations can run in parallel; include validation and notification activities
Parameterization: Pipeline-level processing date (defaults to yesterday), passed to all notebooks; dynamic date expressions
Scheduling: Daily aligned with source refresh, watermark-based incremental processing, periodic full refresh for corrections
Error handling: Retry with backoff for transient failures, alerting for persistent failures, graceful degradation (downstream uses previous data if upstream fails)

当用户请求Medallion流的管道时，引导LLM设计以下内容：

结构：顺序活动（Bronze→Silver→Gold），每个活动等待前一个成功完成；独立的Gold聚合可并行运行；包含验证和通知活动
参数化：管道级处理日期（默认为昨天），传递给所有Notebook；动态日期表达式
调度：与源数据刷新对齐的每日调度、基于水印的增量处理、定期全量刷新以修正数据
错误处理：针对临时失败的退避重试、针对持续失败的告警、优雅降级（上游失败时下游使用历史数据）

Environment Optimization

环境优化

For detailed Spark configurations and optimization strategies, see data-engineering-patterns.md.

Layer	Profile	Key Settings
Bronze	Write-heavy	Disable V-Order, enable autoCompact, large file targets, partition by ingestion_date
Silver	Balanced	Enable V-Order, adaptive query execution, partition by business date, ZORDER on filtered columns
Gold	Read-heavy	V-Order ( `spark.sql.parquet.vorder.default=true` ), Optimize Write ( `optimizeWrite.enabled=true` , `binSize=1g` ), vectorized readers, adaptive execution, ZORDER on all filter columns, pre-aggregate metrics

有关详细的Spark配置和优化策略，请参阅data-engineering-patterns.md。

层级	配置文件	关键设置
Bronze	写密集	禁用V-Order、启用autoCompact、大文件目标、按ingestion_date分区
Silver	读写平衡	启用V-Order、自适应查询执行、按业务日期分区、对过滤列使用ZORDER
Gold	读密集	V-Order ( `spark.sql.parquet.vorder.default=true` )、Optimize Write ( `optimizeWrite.enabled=true` , `binSize=1g` )、向量化读取器、自适应执行、对所有过滤列使用ZORDER、预聚合指标

Examples

示例

Example 1: Set Up Medallion Workspaces (Default)

示例1：设置Medallion工作区（默认）

Prompt: "Set up medallion architecture with separate Bronze, Silver, and Gold workspaces for sales analytics"

What the LLM should generate: REST API calls to:

Create workspaces:

sales-bronze-dev

sales-silver-dev

sales-gold-dev

Create one lakehouse in each workspace:
```
sales_bronze
```
,
```
sales_silver
```
,
```
sales_gold
```
Assign RBAC roles per workspace/layer

bash

undefined

提示："为销售分析设置带有独立Bronze、Silver、Gold工作区的Medallion架构"

LLM应生成：REST API调用以：

创建工作区：

sales-bronze-dev

、

sales-silver-dev

、

sales-gold-dev

在每个工作区创建一个湖仓：
```
sales_bronze
```
、
```
sales_silver
```
、
```
sales_gold
```
为每个工作区/层级分配RBAC角色

bash

undefined

Workspace creation (see COMMON-CLI.md for full patterns)

cat > /tmp/body.json << 'EOF' {"displayName": "sales-analytics-dev"} EOF workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces"
--body @/tmp/body.json --query "id" --output tsv)

Create Bronze lakehouse

cat > /tmp/body.json << 'EOF' {"displayName": "sales_bronze", "type": "Lakehouse"} EOF az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items"
--body @/tmp/body.json

undefined

undefined

Example 2: Design Bronze Ingestion

示例2：设计Bronze层摄入

Prompt: "Ingest daily CSV files into bronze lakehouse with metadata columns"

What the LLM should generate: PySpark notebook that:

Reads source files with schema inference or explicit schema
Adds
```
ingestion_timestamp
```
,
```
source_file
```
,
```
batch_id
```
columns
Writes to Delta table partitioned by ingestion date
Logs row count and validation metrics

python

undefined

提示："将每日CSV文件摄入Bronze湖仓并添加元数据列"

LLM应生成：PySpark Notebook，内容包括：

使用Schema推断或显式Schema读取源文件
添加
```
ingestion_timestamp
```
、
```
source_file
```
、
```
batch_id
```
列
写入按摄入日期分区的Delta表
记录行数和验证指标

python

undefined

Bronze ingestion pattern (guide LLM to generate full implementation)

from pyspark.sql.functions import current_timestamp, input_file_name, lit import uuid

batch_id = str(uuid.uuid4()) df = (spark.read.format("csv").option("header", True).load("/Files/landing/daily/") .withColumn("ingestion_timestamp", current_timestamp()) .withColumn("source_file", input_file_name()) .withColumn("batch_id", lit(batch_id))) df.write.mode("append").partitionBy("ingestion_date").format("delta").saveAsTable("bronze.events_raw")

undefined

from pyspark.sql.functions import current_timestamp, input_file_name, lit import uuid

undefined

Example 3: Bronze-to-Silver Transformation

示例3：Bronze到Silver层转换

Prompt: "Clean bronze data: remove duplicates, filter invalid records, add derived columns, write to silver"

What the LLM should generate: PySpark notebook applying quality rules, schema conformance, and partitioned write with optimization.

提示："清洗Bronze层数据：去重、过滤无效记录、添加派生列、写入Silver层"

LLM应生成：应用质量规则、Schema标准化和分区写入优化的PySpark Notebook。

Example 4: End-to-End Pipeline

示例4：端到端管道

Prompt: "Create a pipeline that runs bronze ingestion, then silver transformation, then gold aggregation daily at 2 AM"

What the LLM should generate: Pipeline JSON definition with sequential notebook activities, date parameter, retry logic, and schedule trigger.

提示："创建每日凌晨2点运行Bronze摄入、Silver转换、Gold聚合的管道"

LLM应生成：包含顺序Notebook活动、日期参数、重试逻辑和调度触发器的管道JSON定义。

e2e-medallion-architecture

Original

Translation

End-to-End Medallion Architecture

端到端Medallion架构

Prerequisite Knowledge

前置知识

Architecture Overview

架构概述

Must/Prefer/Avoid

必须/推荐/避免

MUST DO

必须执行

PREFER

推荐执行

AVOID

避免操作

Workspace Setup Guidance

工作区设置指南

Explicit Override: Single Workspace

显式覆盖：单工作区模式

Bronze Layer — Ingestion Patterns

Bronze层 — 摄入模式

Silver Layer — Transformation Patterns

Silver层 — 转换模式

Gold Layer — Aggregation Patterns

Gold层 — 聚合模式

End-to-End Execution Flow

端到端执行流程

Step-by-Step

分步指南

Common Failure: Stopping After Notebook Creation

常见失败：创建Notebook后停止

Gold Layer → Power BI Consumption

Gold层 → Power BI消费

Step-by-Step

分步指南

Principles

原则

Pipeline Orchestration

管道编排

Environment Optimization

环境优化

Examples

示例

Example 1: Set Up Medallion Workspaces (Default)

示例1：设置Medallion工作区（默认）

Workspace creation (see COMMON-CLI.md for full patterns)

Workspace creation (see COMMON-CLI.md for full patterns)

Create Bronze lakehouse

Create Bronze lakehouse

Example 2: Design Bronze Ingestion

示例2：设计Bronze层摄入

Bronze ingestion pattern (guide LLM to generate full implementation)

Bronze ingestion pattern (guide LLM to generate full implementation)

Example 3: Bronze-to-Silver Transformation

示例3：Bronze到Silver层转换

Example 4: End-to-End Pipeline

示例4：端到端管道