e2e-medallion-architecture
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUpdate Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
- GitHub Copilot CLI / VS Code: invoke the
skill.check-updates- Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
- Skip if the check was already performed earlier in this session.
CRITICAL NOTES
- To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering
- To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering
更新检查 — 每会话一次(必填) 在会话中首次使用此技能时,请先运行check-updates技能再继续。
- GitHub Copilot CLI / VS Code:调用
技能。check-updates- Claude Code / Cowork / Cursor / Windsurf / Codex:比较本地与远程的package.json版本。
- 若本次会话中已完成过该检查,则可跳过。
重要说明
- 从工作区名称查找工作区详情(包括其ID):列出所有工作区,然后使用JMESPath过滤
- 从工作区ID、项类型和项名称查找项详情(包括其ID):列出该工作区中该类型的所有项,然后使用JMESPath过滤
End-to-End Medallion Architecture
端到端Medallion架构
Prerequisite Knowledge
前置知识
Read these companion documents — they contain the foundational context this skill depends on:
- COMMON-CORE.md — Fabric REST API patterns, authentication, token audiences, item discovery
- COMMON-CLI.md — ,
az rest, token acquisition, Fabric REST via CLIaz login - SPARK-AUTHORING-CORE.md — Notebook deployment, lakehouse creation, job execution
- notebook-api-operations.md — Required for notebook creation — structure requirements, cell format,
.ipynb/getDefinitionworkflowupdateDefinition
For Spark-specific optimization details, see data-engineering-patterns.md.
请阅读以下配套文档,它们包含本技能依赖的基础背景知识:
- COMMON-CORE.md — Fabric REST API模式、身份验证、令牌受众、项发现
- COMMON-CLI.md — 、
az rest、令牌获取、通过CLI调用Fabric REST APIaz login - SPARK-AUTHORING-CORE.md — Notebook部署、湖仓创建、任务执行
- notebook-api-operations.md — 创建Notebook必需 — 结构要求、单元格格式、
.ipynb/getDefinition工作流updateDefinition
有关Spark特定的优化细节,请参阅data-engineering-patterns.md。
Architecture Overview
架构概述
Medallion Architecture is a data lakehouse pattern with three progressive layers:
| Layer | Purpose | Optimization Profile | Use Case |
|---|---|---|---|
| Bronze (Raw) | Land raw data exactly as received | Write-optimized, append-only, partitioned by ingestion date | Audit trail, reprocessing, lineage |
| Silver (Cleaned) | Deduplicated, validated, conformed data | Balanced read/write, partitioned by business date | Feature engineering, operational reporting |
| Gold (Aggregated) | Pre-calculated metrics for analytics | Read-optimized (ZORDER, compaction), partitioned by month/year | Power BI reports, dashboards, ad-hoc analytics via SQL endpoint |
- Bronze: Schema-on-read — flexible schema, Delta time travel supports audit and rollback
- Silver: Schema enforcement — reject non-conforming writes; handle schema evolution with when sources change
mergeSchema - Gold: Strict schema governance — curated, business-approved datasets only
Medallion架构是一种包含三个递进层级的数据湖仓模式:
| 层级 | 用途 | 优化配置 | 使用场景 |
|---|---|---|---|
| Bronze(原始层) | 按接收原样存储原始数据 | 写优化、仅追加、按摄入日期分区 | 审计追踪、重处理、数据血缘 |
| Silver(清洗层) | 去重、验证、标准化后的数据 | 读写平衡、按业务日期分区 | 特征工程、运营报表 |
| Gold(聚合层) | 为分析预计算的指标 | 读优化(ZORDER、压缩)、按年/月分区 | Power BI报表、仪表盘、通过SQL端点进行即席分析 |
- Bronze层:读时模式——灵活的Schema,Delta时间旅行支持审计和回滚
- Silver层:Schema校验——拒绝不符合规范的写入;当源Schema变更时,使用处理Schema演进
mergeSchema - Gold层:严格的Schema治理——仅包含经过业务审批的 curated 数据集
Must/Prefer/Avoid
必须/推荐/避免
MUST DO
必须执行
- Create a separate lakehouse for each medallion layer (Bronze, Silver, Gold)
- Add metadata columns in Bronze: ingestion timestamp, source file, batch ID
- Apply data quality rules in the Bronze-to-Silver transformation (deduplication, null handling, range validation)
- Use Delta Lake format for all medallion layer tables
- Use partition-aware overwrite in Silver/Gold writes to avoid reprocessing unchanged data
- Include validation steps after each layer (row counts, schema checks, anomaly detection)
- Follow the validation + Fabric nuances in notebook-api-operations.md when creating notebooks via REST API — every code cell must include
.ipynband"outputs": []"execution_count": null - Default to separate workspaces per layer for governance and access control: one workspace each for Bronze, Silver, and Gold
- Complete the full end-to-end flow — do not stop after creating notebooks; always bind lakehouses, execute notebooks sequentially (Bronze → Silver → Gold), verify results, and connect Power BI to the Gold layer unless the user explicitly requests a partial setup
- 为每个Medallion层级创建独立的湖仓(Bronze、Silver、Gold)
- 在Bronze层添加元数据列:摄入时间戳、源文件、批次ID
- 在Bronze到Silver的转换中应用数据质量规则(去重、空值处理、范围验证)
- 所有Medallion层的表都使用Delta Lake格式
- 在Silver/Gold层写入时使用分区感知覆盖,避免重新处理未变更的数据
- 在每个层级后添加验证步骤(行数统计、Schema检查、异常检测)
- 通过REST API创建Notebook时,遵循notebook-api-operations.md中的验证+Fabric细节——每个代码单元格必须包含
.ipynb和"outputs": []"execution_count": null - 默认按层级使用独立工作区以实现治理和访问控制:Bronze、Silver、Gold各一个工作区
- 完成完整的端到端流程——创建Notebook后不要停止;始终绑定湖仓、按顺序执行Notebook(Bronze→Silver→Gold)、验证结果,并将Power BI连接到Gold层,除非用户明确要求部分设置
PREFER
推荐执行
- Incremental processing (watermark pattern) over full refresh
- Separate notebooks per layer for independent testing and debugging
- ZORDER on frequently filtered columns in Gold tables
- Running OPTIMIZE after writes in Silver and Gold layers
- Environment-specific Spark configs (write-heavy for Bronze, balanced for Silver, read-heavy for Gold)
- OneLake shortcuts to expose Gold data to consumer workspaces without duplication
- Clear layer ownership: engineers own Bronze/Silver, analysts own Gold
- Fabric Variable Libraries to centralize paths and configuration across layers
- Multi-workspace deployment patterns for medium/high governance requirements (Bronze/Silver/Gold in separate workspaces)
- 使用增量处理(水印模式)而非全量刷新
- 为每个层级使用独立的Notebook,便于独立测试和调试
- 在Gold层表中对频繁过滤的列使用ZORDER
- 在Silver和Gold层写入后运行OPTIMIZE
- 根据环境配置Spark参数(Bronze层写密集、Silver层读写平衡、Gold层读密集)
- 使用OneLake快捷方式向消费工作区暴露Gold层数据,无需复制
- 明确层级所有权:工程师负责Bronze/Silver层,分析师负责Gold层
- 使用Fabric变量库集中管理跨层级的路径和配置
- 针对中/高治理要求使用多工作区部署模式(Bronze/Silver/Gold位于独立工作区)
AVOID
避免操作
- Storing all layers in a single lakehouse — this defeats isolation and independent optimization
- Skipping the Silver layer and going directly from Bronze to Gold
- Hardcoded workspace IDs, lakehouse IDs, or FQDNs — discover via REST API
- SELECT * without LIMIT on Bronze tables (they grow unboundedly)
- Running VACUUM without checking downstream dependencies
- Chaining OneLake shortcuts between medallion layers (Bronze→Silver→Gold) — each layer must be physically materialized for lineage and governance
- Copying complete implementation code into skills — guide the LLM to generate instead
- Reading from external HTTP/HTTPS URLs directly in Spark — Fabric Spark cannot access arbitrary external URLs; land data in lakehouse first (via
Files/, OneLake API, or Fabric pipeline Copy activity), then read from the lakehouse pathcurl - Creating notebooks via REST API without validating structure — missing
.ipynborexecution_count: nullon code cells causes silent failures or "Job instance failed without detail error"outputs: []
- 将所有层级存储在单个湖仓中——这会破坏隔离性和独立优化能力
- 跳过Silver层直接从Bronze到Gold
- 硬编码工作区ID、湖仓ID或FQDN——通过REST API发现
- 在Bronze层表上使用无LIMIT的SELECT *(数据会无限增长)
- 在未检查下游依赖的情况下运行VACUUM
- 在Medallion层级之间链式使用OneLake快捷方式(Bronze→Silver→Gold)——每个层级必须物理物化以支持数据血缘和治理
- 将完整的实现代码复制到技能中——引导LLM生成代码
- 在Spark中直接读取外部HTTP/HTTPS URL——Fabric Spark无法访问任意外部URL;先将数据存入湖仓目录(通过
Files/、OneLake API或Fabric管道复制活动),再从湖仓路径读取curl - 创建Notebook时不验证结构——代码单元格缺少
.ipynb或execution_count: null会导致静默失败或"Job instance failed without detail error"outputs: []
Workspace Setup Guidance
工作区设置指南
When setting up a medallion workspace, guide LLM to generate commands for:
- Default architecture: create three workspaces (recommended):
{project}-bronze-{env}{project}-silver-{env}{project}-gold-{env}
- Create one lakehouse per workspace:
- Bronze workspace → lakehouse
{project}_bronze - Silver workspace → lakehouse
{project}_silver - Gold workspace → lakehouse
{project}_gold
- Bronze workspace →
- Assign RBAC per layer workspace:
- Bronze: ingestion/engineering write permissions
- Silver: engineering/data quality permissions
- Gold: analytics/BI consumer access with stricter curation controls
- Create notebooks for each layer (one per transformation stage) — follow validation + Fabric nuances
.ipynb - Bind each notebook to its lakehouse — set with the correct lakehouse ID (see notebook-api-operations.md § Default Lakehouse Binding):
metadata.dependencies.lakehouse- Bronze notebook → Bronze workspace/lakehouse
- Silver notebook → Silver workspace/lakehouse (reads Bronze via cross-workspace oneLake access / fully qualified references)
- Gold notebook → Gold workspace/lakehouse (reads Silver via cross-workspace access)
- Confirm notebook deployment — check that returned
updateDefinition; this is sufficient confirmation that content and lakehouse binding persisted. Do NOT callSucceededto re-verify — it is an async LRO and adds unnecessary latency.getDefinition - Execute notebooks sequentially — Bronze first, then Silver, then Gold — using with the correct
POST .../jobs/instances?jobType=RunNotebookin execution config (bothdefaultLakehouseandidrequired)name - Connect Power BI to Gold layer — discover the Gold lakehouse SQL endpoint, create a Direct Lake semantic model, create a report with visuals on the Gold summary table (see Gold Layer → Power BI Consumption)
- Create pipeline to orchestrate the Bronze → Silver → Gold flow for recurring execution
设置Medallion工作区时,引导LLM生成以下命令:
- 默认架构:创建三个工作区(推荐):
{project}-bronze-{env}{project}-silver-{env}{project}-gold-{env}
- 为每个工作区创建一个湖仓:
- Bronze工作区 → 湖仓
{project}_bronze - Silver工作区 → 湖仓
{project}_silver - Gold工作区 → 湖仓
{project}_gold
- Bronze工作区 →
- 为每个层级工作区分配RBAC权限:
- Bronze层:摄入/工程师写入权限
- Silver层:工程师/数据质量权限
- Gold层:分析/BI消费者访问权限,带有更严格的管控
- 为每个层级创建Notebook(每个转换阶段一个)——遵循验证+Fabric细节
.ipynb - 将每个Notebook绑定到对应的湖仓——设置并填入正确的湖仓ID(参阅notebook-api-operations.md § 默认湖仓绑定):
metadata.dependencies.lakehouse- Bronze Notebook → Bronze工作区/湖仓
- Silver Notebook → Silver工作区/湖仓(通过跨工作区OneLake访问/全限定引用读取Bronze层数据)
- Gold Notebook → Gold工作区/湖仓(通过跨工作区访问读取Silver层数据)
- 确认Notebook部署——检查是否返回
updateDefinition;这足以确认内容和湖仓绑定已持久化。不要调用Succeeded重新验证——这是异步LRO,会增加不必要的延迟。getDefinition - 按顺序执行Notebook——先执行Bronze,再执行Silver,最后执行Gold——使用,并在执行配置中传入正确的
POST .../jobs/instances?jobType=RunNotebook(需要defaultLakehouse和id)name - 将Power BI连接到Gold层——发现Gold湖仓的SQL端点,创建Direct Lake语义模型,基于Gold汇总表创建带可视化的报表(参阅Gold层 → Power BI消费)
- 创建管道以编排Bronze→Silver→Gold的流,实现定期执行
Explicit Override: Single Workspace
显式覆盖:单工作区模式
If the user explicitly asks for a single workspace deployment (for example, POC/small team/monolithic pattern), keep the current approach:
- One workspace with separate Bronze/Silver/Gold lakehouses
- Preserve layer separation logically even when workspace is shared
- Call out governance trade-offs versus multi-workspace design
Parameterize by environment: workspace name suffix (, ), data volume (sample vs full), capacity SKU, and Bronze retention period.
-dev-prod如果用户明确要求单工作区部署(例如POC/小型团队/单体模式),保留当前方法:
- 一个工作区包含独立的Bronze/Silver/Gold湖仓
- 即使共享工作区,也要在逻辑上保持层级分离
- 说明与多工作区设计相比的治理权衡
按环境参数化:工作区名称后缀(、)、数据量(样本 vs 全量)、容量SKU、Bronze层保留期。
-dev-prodBronze Layer — Ingestion Patterns
Bronze层 — 摄入模式
When a user requests data ingestion into the Bronze layer, guide LLM to:
- Land data in lakehouse first: External data must be staged into the lakehouse folder before Spark can read it — use one of:
Files/- Fabric Pipeline Copy activity (preferred for recurring loads) — connects to external sources (HTTP, FTP, databases, cloud storage) and writes to OneLake
- OneLake API / — upload files via REST API using
curltoken (see COMMON-CLI.md § OneLake Data Access)storage.azure.com - OneLake Shortcut — for data already in Azure ADLS Gen2, S3, or another OneLake location
- — copy from mounted storage paths within a notebook
notebookutils.fs - ⚠️ Fabric Spark cannot read from arbitrary HTTP/HTTPS URLs — will fail
spark.read.format("csv").load("https://...")
- Read from lakehouse path: Once data is in , read using lakehouse-relative paths (e.g.,
Files/)spark.read.format("csv").load("Files/landing/daily/") - Add metadata and write: Tracking columns (ingestion timestamp, source file, batch ID), Delta table with descriptive name, partition by ingestion date, append mode
- Validate: Log row counts, validate schema structure, flag anomalies vs historical patterns
当用户请求将数据摄入Bronze层时,引导LLM:
- 先将数据存入湖仓:外部数据必须先存入湖仓目录,Spark才能读取——可使用以下方式之一:
Files/- Fabric管道复制活动(推荐用于定期加载)——连接到外部源(HTTP、FTP、数据库、云存储)并写入OneLake
- OneLake API / ——使用
curl令牌通过REST API上传文件(参阅COMMON-CLI.md § OneLake数据访问)storage.azure.com - OneLake快捷方式——适用于已存储在Azure ADLS Gen2、S3或其他OneLake位置的数据
- ——在Notebook中从挂载存储路径复制
notebookutils.fs - ⚠️ Fabric Spark无法读取任意HTTP/HTTPS URL——会失败
spark.read.format("csv").load("https://...")
- 从湖仓路径读取:数据存入后,使用湖仓相对路径读取(例如
Files/)spark.read.format("csv").load("Files/landing/daily/") - 添加元数据并写入:跟踪列(摄入时间戳、源文件、批次ID)、具有描述性名称的Delta表、按摄入日期分区、追加模式
- 验证:记录行数、验证Schema结构、标记与历史模式不符的异常
Silver Layer — Transformation Patterns
Silver层 — 转换模式
When a user requests Bronze-to-Silver transformation, guide LLM to:
- Quality rules: Deduplicate on natural/composite key, filter invalid ranges, handle nulls (drop required, fill optional), validate logical constraints
- Schema conformance: snake_case column names, standardized data types, derived columns (durations, percentages, categories)
- Schema evolution: Use option when source schemas change; coordinate downstream updates to Gold tables and Power BI datasets
mergeSchema - Write strategy: Partition by business date, partition-aware overwrite, run OPTIMIZE after write, log before/after metrics
当用户请求Bronze到Silver的转换时,引导LLM:
- 质量规则:基于自然/复合键去重、过滤无效范围、处理空值(删除必填项空值、填充可选项空值)、验证逻辑约束
- Schema标准化:使用snake_case列名、标准化数据类型、派生列(时长、百分比、分类)
- Schema演进:当源Schema变更时使用选项;协调Gold层表和Power BI数据集的下游更新
mergeSchema - 写入策略:按业务日期分区、分区感知覆盖、写入后运行OPTIMIZE、记录转换前后的指标
Gold Layer — Aggregation Patterns
Gold层 — 聚合模式
When a user requests Gold analytics tables, guide LLM to generate:
- Common aggregates: Daily/weekly/monthly summaries, dimensional analysis (by location, category, type), trend breakdowns over time, demand patterns (hour-of-day, day-of-week)
- Spark session config — set these properties in the Gold notebook before any write operations:
python
spark.conf.set("spark.sql.parquet.vorder.default", "true") spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true") spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "1g")- V-Order () — applies Fabric's columnar sort optimization to all Parquet files, dramatically improving Direct Lake and SQL endpoint read performance
vorder.default - Optimize Write () — coalesces small partitions into optimally-sized files (target ~1 GB per
optimizeWrite.enabled), reducing file count and improving scan efficiencybinSize
- V-Order (
- Optimization: ZORDER on filter columns, run OPTIMIZE after writes, pre-aggregate metrics to avoid runtime computation
当用户请求Gold层分析表时,引导LLM生成:
- 常见聚合:日/周/月汇总、维度分析(按位置、分类、类型)、时间趋势细分、需求模式(一天中的时段、一周中的某天)
- Spark会话配置——在Gold层Notebook中所有写入操作之前设置以下属性:
python
spark.conf.set("spark.sql.parquet.vorder.default", "true") spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true") spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "1g")- V-Order ()——对所有Parquet文件应用Fabric的列排序优化,显著提升Direct Lake和SQL端点的读取性能
vorder.default - Optimize Write ()——将小分区合并为最优大小的文件(目标每个
optimizeWrite.enabled约1 GB),减少文件数量并提升扫描效率binSize
- V-Order (
- 优化:对过滤列使用ZORDER、写入后运行OPTIMIZE、预聚合指标以避免运行时计算
End-to-End Execution Flow
端到端执行流程
When setting up medallion architecture end-to-end, the LLM must not stop after creating notebooks and deploying code. The complete lifecycle is:
Create Resources → Deploy Content → Bind Lakehouses → Execute → Verify Results设置端到端Medallion架构时,LLM不得在创建Notebook和部署代码后停止。完整生命周期为:
创建资源 → 部署内容 → 绑定湖仓 → 执行 → 验证结果Step-by-Step
分步指南
- Create layer workspaces and lakehouses (default) — one workspace and one lakehouse per layer (Bronze, Silver, Gold); capture workspace IDs and lakehouse IDs
- Create notebooks — one per layer, with valid structure (see notebook-api-operations.md)
.ipynb - Bind lakehouse to each notebook — include in the
metadata.dependencies.lakehousepayload with:.ipynb- : the target lakehouse GUID
default_lakehouse - : the lakehouse display name
default_lakehouse_name - : the workspace GUID
default_lakehouse_workspace_id
- Deploy notebook content — with the Base64-encoded
updateDefinitionpayload (content + lakehouse binding together).ipynb - Confirm deployment — check that each LRO returned
updateDefinition; that is sufficient. Do NOT callSucceededto re-verify — it is an async LRO and adds significant latency per notebook.getDefinition - Execute notebooks sequentially — use :
POST .../jobs/instances?jobType=RunNotebook- Pass with both
defaultLakehouseandidinnameexecutionData.configuration - Run Bronze first → poll until → run Silver → poll → run Gold → poll
Completed - Check for recent jobs before submitting (prevent duplicates — see SPARK-AUTHORING-CORE.md)
- Pass
- Verify results — after each notebook completes, confirm expected tables exist and row counts are reasonable
- Connect Power BI to Gold — create semantic model + report on Gold summary tables (see Gold Layer → Power BI Consumption)
- 创建层级工作区和湖仓(默认)——每个层级(Bronze、Silver、Gold)一个工作区和一个湖仓;记录工作区ID和湖仓ID
- 创建Notebook——每个层级一个,具有有效的结构(参阅notebook-api-operations.md)
.ipynb - 将湖仓绑定到每个Notebook——在负载中包含
.ipynb,内容如下:metadata.dependencies.lakehouse- : 目标湖仓GUID
default_lakehouse - : 湖仓显示名称
default_lakehouse_name - : 工作区GUID
default_lakehouse_workspace_id
- 部署Notebook内容——使用Base64编码的负载(内容+湖仓绑定)调用
.ipynbupdateDefinition - 确认部署——检查每个LRO是否返回
updateDefinition;这已足够。不要调用Succeeded重新验证——这是异步LRO,每个Notebook都会增加显著延迟。getDefinition - 按顺序执行Notebook——使用:
POST .../jobs/instances?jobType=RunNotebook- 在中传入包含
executionData.configuration和id的namedefaultLakehouse - 先运行Bronze → 轮询直到→ 运行Silver → 轮询 → 运行Gold → 轮询
Completed - 提交前检查最近的任务(防止重复——参阅SPARK-AUTHORING-CORE.md)
- 在
- 验证结果——每个Notebook完成后,确认预期表存在且行数合理
- 将Power BI连接到Gold层——基于Gold汇总表创建语义模型+报表(参阅Gold层 → Power BI消费)
Common Failure: Stopping After Notebook Creation
常见失败:创建Notebook后停止
If the flow stops after deploying notebook code without binding or executing:
- Notebooks will have no lakehouse context → and relative paths (
spark.sql(),Tables/) fail at runtimeFiles/ - The user sees no output or results — the architecture is set up but never tested
- Always complete through step 7 unless the user explicitly asks to stop at a specific step
如果流程在部署Notebook代码后未绑定或执行就停止:
- Notebook将没有湖仓上下文 → 运行时和相对路径(
spark.sql()、Tables/)会失败Files/ - 用户看不到任何输出或结果——架构已设置但从未测试
- 除非用户明确要求在特定步骤停止,否则始终完成到第7步
Gold Layer → Power BI Consumption
Gold层 → Power BI消费
After Gold tables are populated, connect Power BI to surface the analytics.
Build a semantic model on top of the Gold lakehouse, using DirectLake.
Gold层表填充完成后,连接Power BI以展示分析结果。在Gold湖仓之上构建语义模型,使用DirectLake模式。
Step-by-Step
分步指南
- Discover the Gold lakehouse SQL endpoint — call and extract
GET /v1/workspaces/{workspaceId}/lakehouses/{goldLakehouseId}andproperties.sqlEndpointProperties.connectionString; wait until status isprovisioningStatusSuccess - Verify Gold tables via SQL — connect to the SQL endpoint using (see COMMON-CLI.md § SQL / TDS Data-Plane Access) and confirm the target table exists:
sqlcmdsqlSELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'nyc_taxi_daily_summary' - Create a semantic model — use the powerbi-authoring-cli skill for semantic model creation and TMDL deployment. Create via with
POST /v1/workspaces/{workspaceId}/itemsthen deploy definition viatype: "SemanticModel"using TMDL format (see ITEM-DEFINITIONS-CORE.md § SemanticModel):updateDefinition- The model must reference the Gold lakehouse SQL endpoint as its data source
- Define a table mapping to the Gold summary table (e.g., )
nyc_taxi_daily_summary - Use Direct Lake mode — this connects directly to Delta tables in OneLake without data import
- Include measures for key aggregations you find interesting (e.g., ,
Total Trips,Avg Fare,Total Revenue)Month over Month Growth
- Create a Power BI report — with
POST /v1/workspaces/{workspaceId}/itemsthen deploy definition viatype: "Report"using PBIR format (see ITEM-DEFINITIONS-CORE.md § Report):updateDefinition- Reference the semantic model created in step 3 via
definition.pbir - Define at least one page with visuals on the Gold summary table
- Suggested visuals: line chart (daily trend), card (KPI totals), bar chart (by category), table (detail view)
- Reference the semantic model created in step 3 via
- Verify end-to-end — use the skill to run DAX queries against the semantic model and confirm data flows from Gold tables through to the report
powerbi-consumption-cli
- 发现Gold湖仓SQL端点——调用并提取
GET /v1/workspaces/{workspaceId}/lakehouses/{goldLakehouseId}和properties.sqlEndpointProperties.connectionString;等待状态变为provisioningStatusSuccess - 通过SQL验证Gold层表——使用连接到SQL端点(参阅COMMON-CLI.md § SQL / TDS数据平面访问)并确认目标表存在:
sqlcmdsqlSELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'nyc_taxi_daily_summary' - 创建语义模型——使用powerbi-authoring-cli技能进行语义模型创建和TMDL部署。通过创建,
POST /v1/workspaces/{workspaceId}/items,然后使用TMDL格式通过type: "SemanticModel"部署定义(参阅ITEM-DEFINITIONS-CORE.md § SemanticModel):updateDefinition- 模型必须引用Gold湖仓SQL端点作为数据源
- 定义与Gold汇总表(例如)的表映射
nyc_taxi_daily_summary - 使用Direct Lake模式——直接连接到OneLake中的Delta表,无需数据导入
- 包含关键聚合的度量值(例如、
总行程数、平均车费、总收入)月度环比增长
- 创建Power BI报表——通过创建,
POST /v1/workspaces/{workspaceId}/items,然后使用PBIR格式通过type: "Report"部署定义(参阅ITEM-DEFINITIONS-CORE.md § Report):updateDefinition- 通过引用步骤3中创建的语义模型
definition.pbir - 定义至少一个页面,包含Gold汇总表的可视化
- 推荐可视化:折线图(每日趋势)、卡片(KPI总计)、柱状图(按分类)、表格(详情视图)
- 通过
- 端到端验证——使用技能对语义模型运行DAX查询,确认数据从Gold层表流向报表
powerbi-consumption-cli
Principles
原则
- Discover SQL endpoint dynamically — the connection string is in on the lakehouse response; never hardcode it
properties.sqlEndpointProperties.connectionString - Wait for SQL endpoint provisioning — status must be before connecting; newly created lakehouses may take minutes to provision
Success - Prefer Direct Lake mode — avoids data duplication; semantic model reads directly from OneLake Delta tables
- Match table/column names exactly — the semantic model table definition must use the exact Delta table and column names from the Gold lakehouse
- For semantic model authoring (TMDL, refresh, permissions), cross-reference the powerbi-authoring-cli skill
- For DAX query validation, cross-reference the powerbi-consumption-cli skill
- 动态发现SQL端点——连接字符串在湖仓响应的中;切勿硬编码
properties.sqlEndpointProperties.connectionString - 等待SQL端点配置完成——状态必须为才能连接;新创建的湖仓可能需要几分钟才能完成配置
Success - 优先使用Direct Lake模式——避免数据重复;语义模型直接从OneLake Delta表读取
- 完全匹配表/列名称——语义模型表定义必须使用Gold湖仓中Delta表和列的确切名称
- 语义模型创作(TMDL、刷新、权限)——交叉参考powerbi-authoring-cli技能
- DAX查询验证——交叉参考powerbi-consumption-cli技能
Pipeline Orchestration
管道编排
When a user requests a pipeline for the medallion flow, guide LLM to design with:
- Structure: Sequential activities (Bronze → Silver → Gold), each waiting for previous success; independent Gold aggregations can run in parallel; include validation and notification activities
- Parameterization: Pipeline-level processing date (defaults to yesterday), passed to all notebooks; dynamic date expressions
- Scheduling: Daily aligned with source refresh, watermark-based incremental processing, periodic full refresh for corrections
- Error handling: Retry with backoff for transient failures, alerting for persistent failures, graceful degradation (downstream uses previous data if upstream fails)
当用户请求Medallion流的管道时,引导LLM设计以下内容:
- 结构:顺序活动(Bronze→Silver→Gold),每个活动等待前一个成功完成;独立的Gold聚合可并行运行;包含验证和通知活动
- 参数化:管道级处理日期(默认为昨天),传递给所有Notebook;动态日期表达式
- 调度:与源数据刷新对齐的每日调度、基于水印的增量处理、定期全量刷新以修正数据
- 错误处理:针对临时失败的退避重试、针对持续失败的告警、优雅降级(上游失败时下游使用历史数据)
Environment Optimization
环境优化
For detailed Spark configurations and optimization strategies, see data-engineering-patterns.md.
| Layer | Profile | Key Settings |
|---|---|---|
| Bronze | Write-heavy | Disable V-Order, enable autoCompact, large file targets, partition by ingestion_date |
| Silver | Balanced | Enable V-Order, adaptive query execution, partition by business date, ZORDER on filtered columns |
| Gold | Read-heavy | V-Order ( |
有关详细的Spark配置和优化策略,请参阅data-engineering-patterns.md。
| 层级 | 配置文件 | 关键设置 |
|---|---|---|
| Bronze | 写密集 | 禁用V-Order、启用autoCompact、大文件目标、按ingestion_date分区 |
| Silver | 读写平衡 | 启用V-Order、自适应查询执行、按业务日期分区、对过滤列使用ZORDER |
| Gold | 读密集 | V-Order ( |
Examples
示例
Example 1: Set Up Medallion Workspaces (Default)
示例1:设置Medallion工作区(默认)
Prompt: "Set up medallion architecture with separate Bronze, Silver, and Gold workspaces for sales analytics"
What the LLM should generate: REST API calls to:
- Create workspaces: ,
sales-bronze-dev,sales-silver-devsales-gold-dev - Create one lakehouse in each workspace: ,
sales_bronze,sales_silversales_gold - Assign RBAC roles per workspace/layer
bash
undefined提示:"为销售分析设置带有独立Bronze、Silver、Gold工作区的Medallion架构"
LLM应生成:REST API调用以:
- 创建工作区:、
sales-bronze-dev、sales-silver-devsales-gold-dev - 在每个工作区创建一个湖仓:、
sales_bronze、sales_silversales_gold - 为每个工作区/层级分配RBAC角色
bash
undefinedWorkspace creation (see COMMON-CLI.md for full patterns)
Workspace creation (see COMMON-CLI.md for full patterns)
cat > /tmp/body.json << 'EOF'
{"displayName": "sales-analytics-dev"}
EOF
workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces"
--body @/tmp/body.json --query "id" --output tsv)
--url "https://api.fabric.microsoft.com/v1/workspaces"
--body @/tmp/body.json --query "id" --output tsv)
cat > /tmp/body.json << 'EOF'
{"displayName": "sales-analytics-dev"}
EOF
workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces"
--body @/tmp/body.json --query "id" --output tsv)
--url "https://api.fabric.microsoft.com/v1/workspaces"
--body @/tmp/body.json --query "id" --output tsv)
Create Bronze lakehouse
Create Bronze lakehouse
cat > /tmp/body.json << 'EOF'
{"displayName": "sales_bronze", "type": "Lakehouse"}
EOF
az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items"
--body @/tmp/body.json
--url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items"
--body @/tmp/body.json
undefinedcat > /tmp/body.json << 'EOF'
{"displayName": "sales_bronze", "type": "Lakehouse"}
EOF
az rest --method post --resource "https://api.fabric.microsoft.com"
--url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items"
--body @/tmp/body.json
--url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items"
--body @/tmp/body.json
undefinedExample 2: Design Bronze Ingestion
示例2:设计Bronze层摄入
Prompt: "Ingest daily CSV files into bronze lakehouse with metadata columns"
What the LLM should generate: PySpark notebook that:
- Reads source files with schema inference or explicit schema
- Adds ,
ingestion_timestamp,source_filecolumnsbatch_id - Writes to Delta table partitioned by ingestion date
- Logs row count and validation metrics
python
undefined提示:"将每日CSV文件摄入Bronze湖仓并添加元数据列"
LLM应生成:PySpark Notebook,内容包括:
- 使用Schema推断或显式Schema读取源文件
- 添加、
ingestion_timestamp、source_file列batch_id - 写入按摄入日期分区的Delta表
- 记录行数和验证指标
python
undefinedBronze ingestion pattern (guide LLM to generate full implementation)
Bronze ingestion pattern (guide LLM to generate full implementation)
from pyspark.sql.functions import current_timestamp, input_file_name, lit
import uuid
batch_id = str(uuid.uuid4())
df = (spark.read.format("csv").option("header", True).load("/Files/landing/daily/")
.withColumn("ingestion_timestamp", current_timestamp())
.withColumn("source_file", input_file_name())
.withColumn("batch_id", lit(batch_id)))
df.write.mode("append").partitionBy("ingestion_date").format("delta").saveAsTable("bronze.events_raw")
undefinedfrom pyspark.sql.functions import current_timestamp, input_file_name, lit
import uuid
batch_id = str(uuid.uuid4())
df = (spark.read.format("csv").option("header", True).load("/Files/landing/daily/")
.withColumn("ingestion_timestamp", current_timestamp())
.withColumn("source_file", input_file_name())
.withColumn("batch_id", lit(batch_id)))
df.write.mode("append").partitionBy("ingestion_date").format("delta").saveAsTable("bronze.events_raw")
undefinedExample 3: Bronze-to-Silver Transformation
示例3:Bronze到Silver层转换
Prompt: "Clean bronze data: remove duplicates, filter invalid records, add derived columns, write to silver"
What the LLM should generate: PySpark notebook applying quality rules, schema conformance, and partitioned write with optimization.
提示:"清洗Bronze层数据:去重、过滤无效记录、添加派生列、写入Silver层"
LLM应生成:应用质量规则、Schema标准化和分区写入优化的PySpark Notebook。
Example 4: End-to-End Pipeline
示例4:端到端管道
Prompt: "Create a pipeline that runs bronze ingestion, then silver transformation, then gold aggregation daily at 2 AM"
What the LLM should generate: Pipeline JSON definition with sequential notebook activities, date parameter, retry logic, and schedule trigger.
提示:"创建每日凌晨2点运行Bronze摄入、Silver转换、Gold聚合的管道"
LLM应生成:包含顺序Notebook活动、日期参数、重试逻辑和调度触发器的管道JSON定义。