databricks-pipelines
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLakeflow Spark Declarative Pipelines Development
Lakeflow Spark Declarative Pipelines 开发
FIRST: Use the parent skill for CLI basics, authentication, profile selection, and data discovery commands.
databricks-core首要提示:请使用父级技能了解CLI基础、身份验证、配置文件选择和数据发现命令。
databricks-coreDecision Tree
决策树
Use this tree to determine which dataset type and features to use. Multiple features can apply to the same dataset — e.g., a Streaming Table can use Auto Loader for ingestion, Append Flows for fan-in, and Expectations for data quality. Choose the dataset type first, then layer on applicable features.
User request → What kind of output?
├── Intermediate/reusable logic (not persisted) → Temporary View
│ ├── Preprocessing/filtering before Auto CDC → Temporary View feeding CDC flow
│ ├── Shared intermediate streaming logic reused by multiple downstream tables
│ ├── Pipeline-private helper logic (not published to catalog)
│ └── Published to UC for external queries → Persistent View (SQL only)
├── Persisted dataset
│ ├── Source is streaming/incremental/continuously growing → Streaming Table
│ │ ├── File ingestion (cloud storage, Volumes) → Auto Loader
│ │ ├── Message bus (Kafka, Kinesis, Pub/Sub, Pulsar, Event Hubs) → streaming source read
│ │ ├── Existing streaming/Delta table → streaming read from table
│ │ ├── CDC / upserts / track changes / keep latest per key / SCD Type 1 or 2 → Auto CDC
│ │ ├── Multiple sources into one table → Append Flows (NOT union)
│ │ ├── Historical backfill + live stream → one-time Append Flow + regular flow
│ │ └── Windowed aggregation with watermark → stateful streaming
│ └── Source is batch/historical/full scan → Materialized View
│ ├── Aggregation/join across full dataset (GROUP BY, SUM, COUNT, etc.)
│ ├── Gold layer aggregation from streaming table → MV with batch read (spark.read / no STREAM)
│ ├── JDBC/Federation/external batch sources
│ └── Small static file load (reference data, no streaming read)
├── Output to external system (Python only) → Sink
│ ├── Existing external table not managed by this pipeline → Sink with format="delta"
│ │ (prefer fully-qualified dataset names if the pipeline should own the table — see Publishing Modes)
│ ├── Kafka / Event Hubs → Sink with format="kafka" + @dp.append_flow(target="sink_name")
│ ├── Custom destination not natively supported → Sink with custom format
│ ├── Custom merge/upsert logic per batch → ForEachBatch Sink (Public Preview)
│ └── Multiple destinations per batch → ForEachBatch Sink (Public Preview)
└── Data quality constraints → Expectations (on any dataset type)使用此决策树确定要使用的数据集类型和功能。同一数据集可以应用多个功能——例如,Streaming Table可以使用Auto Loader进行数据摄入、Append Flows实现扇入、Expectations保障数据质量。请先选择数据集类型,再叠加适用的功能。
User request → What kind of output?
├── Intermediate/reusable logic (not persisted) → Temporary View
│ ├── Preprocessing/filtering before Auto CDC → Temporary View feeding CDC flow
│ ├── Shared intermediate streaming logic reused by multiple downstream tables
│ ├── Pipeline-private helper logic (not published to catalog)
│ └── Published to UC for external queries → Persistent View (SQL only)
├── Persisted dataset
│ ├── Source is streaming/incremental/continuously growing → Streaming Table
│ │ ├── File ingestion (cloud storage, Volumes) → Auto Loader
│ │ ├── Message bus (Kafka, Kinesis, Pub/Sub, Pulsar, Event Hubs) → streaming source read
│ │ ├── Existing streaming/Delta table → streaming read from table
│ │ ├── CDC / upserts / track changes / keep latest per key / SCD Type 1 or 2 → Auto CDC
│ │ ├── Multiple sources into one table → Append Flows (NOT union)
│ │ ├── Historical backfill + live stream → one-time Append Flow + regular flow
│ │ └── Windowed aggregation with watermark → stateful streaming
│ └── Source is batch/historical/full scan → Materialized View
│ ├── Aggregation/join across full dataset (GROUP BY, SUM, COUNT, etc.)
│ ├── Gold layer aggregation from streaming table → MV with batch read (spark.read / no STREAM)
│ ├── JDBC/Federation/external batch sources
│ └── Small static file load (reference data, no streaming read)
├── Output to external system (Python only) → Sink
│ ├── Existing external table not managed by this pipeline → Sink with format="delta"
│ │ (prefer fully-qualified dataset names if the pipeline should own the table — see Publishing Modes)
│ ├── Kafka / Event Hubs → Sink with format="kafka" + @dp.append_flow(target="sink_name")
│ ├── Custom destination not natively supported → Sink with custom format
│ ├── Custom merge/upsert logic per batch → ForEachBatch Sink (Public Preview)
│ └── Multiple destinations per batch → ForEachBatch Sink (Public Preview)
└── Data quality constraints → Expectations (on any dataset type)Common Traps
常见误区
- "Create a table" without specifying type → ask whether the source is streaming or batch
- Materialized View from streaming source is an error → use a Streaming Table instead, or switch to a batch read
- Streaming Table from batch source is an error → use a Materialized View instead, or switch to a streaming read
- Aggregation over streaming table → use a Materialized View with batch read (/
spark.read.tablewithoutSELECT FROM), NOT a Streaming Table. This is the correct pattern for Gold layer aggregation.STREAM - Aggregation over batch/historical data → use a Materialized View, not a Streaming Table. MVs recompute or incrementally refresh aggregates to stay correct; STs are append-only and don't recompute when source data changes.
- Preprocessing before Auto CDC → use a Temporary View to filter/transform the source before feeding into the CDC flow. SQL: the CDC flow reads from the view via . Python: use
STREAM(view_name).spark.readStream.table("view_name") - Intermediate logic → default to Temporary View → Use a Temporary View for intermediate/preprocessing logic, even when reused by multiple downstream tables. Only consider a Private MV/ST (/
private=True) when the computation is expensive and materializing once would save significant reprocessing.CREATE PRIVATE ... - View vs Temporary View → Persistent Views publish to Unity Catalog (SQL only), Temporary Views are pipeline-private
- Union of streams → use multiple Append Flows. Do NOT present UNION as an alternative — it is an anti-pattern for streaming sources.
- Changing dataset type → cannot change ST→MV or MV→ST without manually dropping the existing table first. Full refresh does NOT help. Rename the new dataset instead.
- SQL → Prefer
OR REFRESHover bareCREATE OR REFRESHfor SQL dataset definitions. Both work identically, butCREATEis the idiomatic convention. For PRIVATE datasets:OR REFRESH/CREATE OR REFRESH PRIVATE STREAMING TABLE.CREATE OR REFRESH PRIVATE MATERIALIZED VIEW - Kafka/Event Hubs sink serialization → The column is mandatory. Use
valueto serialize the entire row as JSON. Read the sink skill for details.to_json(struct(*)) AS value - Multi-column sequencing in Auto CDC → SQL: . Python:
SEQUENCE BY STRUCT(col1, col2). Read the auto-cdc skill for details.sequence_by=struct("col1", "col2") - Auto CDC supports TRUNCATE (SCD Type 1 only) → SQL: . Python:
APPLY AS TRUNCATE WHEN condition. Do NOT say truncate is unsupported.apply_as_truncates=expr("condition") - Python-only features → Sinks, ForEachBatch Sinks, CDC from snapshots, and custom data sources are Python-only. When the user is working in SQL, explicitly clarify this and suggest switching to Python.
- MV incremental refresh → Materialized Views on serverless pipelines support automatic incremental refresh for aggregations. Mention the serverless requirement when discussing incremental refresh.
- Recommend ONE clear approach → Present a single recommended approach. Do NOT present anti-patterns or significantly inferior alternatives — it confuses users. Only mention alternatives if they are genuinely viable for different trade-offs.
- 用户仅提出**"创建一个表"**但未指定类型 → 询问数据源是流处理还是批处理类型
- 从流数据源创建Materialized View属于错误操作 → 请改用Streaming Table,或切换为批读取模式
- 从批数据源创建Streaming Table属于错误操作 → 请改用Materialized View,或切换为流读取模式
- 对Streaming Table执行聚合 → 请使用带批读取的Materialized View(/ 不带
spark.read.table的STREAM),不要使用Streaming Table,这是黄金层聚合的正确实现模式SELECT FROM - 对批/历史数据执行聚合 → 请使用Materialized View,不要使用Streaming Table。MV会重新计算或增量刷新聚合结果以保证准确性;ST是仅追加模式,源数据变更时不会重新计算
- Auto CDC前的预处理操作 → 请使用Temporary View对源数据进行过滤/转换后再输入CDC流。SQL实现:CDC流通过读取视图数据;Python实现:使用
STREAM(view_name)spark.readStream.table("view_name") - 中间逻辑默认使用Temporary View → 中间/预处理逻辑请优先使用Temporary View,即使会被多个下游表复用。仅当计算成本极高,物化一次可以节省大量重复计算时,才考虑使用私有MV/ST(/
private=True)CREATE PRIVATE ... - 视图与临时视图的区别 → Persistent Views会发布到Unity Catalog(仅支持SQL),Temporary Views是管道私有资源
- 流的合并操作 → 请使用多个Append Flows,不要使用UNION作为替代方案——这是流数据源的反模式
- 变更数据集类型 → 无法直接将ST转为MV或MV转为ST,需要先手动删除现有表,全量刷新无法解决该问题,建议直接重命名新数据集
- SQL 语法 → SQL数据集定义优先使用
OR REFRESH而非单纯的CREATE OR REFRESH,两者功能一致,但CREATE是通用约定写法。私有数据集写法:OR REFRESH/CREATE OR REFRESH PRIVATE STREAMING TABLECREATE OR REFRESH PRIVATE MATERIALIZED VIEW - Kafka/Event Hubs sink序列化 → 列为必填项,使用
value将整行序列化为JSON格式,详情可参考sink相关技能文档to_json(struct(*)) AS value - Auto CDC中的多列排序 → SQL写法:;Python写法:
SEQUENCE BY STRUCT(col1, col2),详情可参考auto-cdc相关技能文档sequence_by=struct("col1", "col2") - Auto CDC支持TRUNCATE操作(仅SCD Type 1) → SQL写法:;Python写法:
APPLY AS TRUNCATE WHEN condition,请勿声明不支持truncate操作apply_as_truncates=expr("condition") - 仅Python支持的功能 → Sinks、ForEachBatch Sinks、快照来源CDC、自定义数据源仅支持Python实现。当用户使用SQL开发时,请明确说明该限制并建议切换到Python
- MV增量刷新 → 无服务器(serverless)管道上的Materialized Views支持聚合的自动增量刷新,讨论增量刷新时请提及无服务器环境要求
- 推荐唯一明确的方案 → 仅给出一个推荐实现方案,不要展示反模式或明显更差的替代方案,避免混淆用户。仅当不同方案存在合理的取舍权衡时,才提及替代方案
Publishing Modes
发布模式
Pipelines use a default catalog and schema configured in the pipeline settings. All datasets are published there unless overridden.
- Fully-qualified names: Use in the dataset name to write to a different catalog/schema than the pipeline default. The pipeline creates the dataset there directly — no Sink needed.
catalog.schema.table - USE CATALOG / USE SCHEMA: SQL commands that change the current catalog/schema for all subsequent definitions in the same file.
- LIVE prefix: Deprecated. Ignored in the default publishing mode.
- When reading or defining datasets within the pipeline, use the dataset name only — do NOT use fully-qualified names unless the pipeline already does so or the user explicitly requests a different target catalog/schema.
管道使用配置中设置的默认catalog和schema,所有数据集默认发布到该位置,除非另行指定。
- 全限定名称:数据集名称使用格式可以写入到非管道默认的其他catalog/schema下,管道会直接在对应位置创建数据集,无需Sink
catalog.schema.table - :SQL命令可以修改当前文件后续所有定义的默认catalog/schema
USE CATALOG / USE SCHEMA - 前缀:已弃用,默认发布模式下会被忽略
LIVE - 在管道内部读取或定义数据集时,仅使用数据集名称即可——不要使用全限定名称,除非管道本身已使用该写法,或用户明确要求写入其他目标catalog/schema
Comprehensive API Reference
完整API参考
MANDATORY: Before implementing, editing, or suggesting any code for a feature, you MUST read the linked reference file for that feature. NO exceptions — always look up the reference before writing code.
Some features require reading multiple skills together:
- Auto Loader → also read the streaming-table skill (Auto Loader produces a streaming DataFrame, so the target is a streaming table) and look up format-specific options for the file format being loaded
- Auto CDC → also read the streaming-table skill (Auto CDC always targets a streaming table)
- Sinks → also read the streaming-table skill (sinks use streaming append flows)
- Expectations → also read the corresponding dataset definition skill to ensure constraints are correctly placed
强制要求:在为任何功能实现、编辑或提供代码建议前,你必须阅读该功能对应的参考文档,无例外——编写代码前务必查阅参考资料。
部分功能需要同时阅读多个技能文档:
- Auto Loader → 还需阅读streaming-table技能文档(Auto Loader生成流DataFrame,因此目标为流表),并查阅所加载文件格式的特定配置选项
- Auto CDC → 还需阅读streaming-table技能文档(Auto CDC的目标始终是流表)
- Sinks → 还需阅读streaming-table技能文档(sink使用流append flow)
- Expectations → 还需阅读对应数据集定义的技能文档,确保约束放置位置正确
Dataset Definition APIs
数据集定义API
| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|---|
| Streaming Table | | | | | streaming-table-python | streaming-table-sql |
| Materialized View | | | | | materialized-view-python | materialized-view-sql |
| Temporary View | | | | | temporary-view-python | temporary-view-sql |
| Persistent View (UC) | N/A — SQL only | — | | — | — | view-sql |
| Streaming Table (explicit) | | | | — | streaming-table-python | streaming-table-sql |
| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|---|
| Streaming Table | | | | | streaming-table-python | streaming-table-sql |
| Materialized View | | | | | materialized-view-python | materialized-view-sql |
| Temporary View | | | | | temporary-view-python | temporary-view-sql |
| Persistent View (UC) | N/A — SQL only | — | | — | — | view-sql |
| Streaming Table (explicit) | | | | — | streaming-table-python | streaming-table-sql |
Flow and Sink APIs
Flow与Sink API
| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|---|
| Append Flow | | | | — | streaming-table-python | streaming-table-sql |
| Backfill Flow | | | | — | streaming-table-python | streaming-table-sql |
| Sink (Delta/Kafka/EH/custom) | | | N/A — Python only | — | sink-python | — |
| ForEachBatch Sink | | — | N/A — Python only | — | foreach-batch-sink-python | — |
| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|---|
| Append Flow | | | | — | streaming-table-python | streaming-table-sql |
| Backfill Flow | | | | — | streaming-table-python | streaming-table-sql |
| Sink (Delta/Kafka/EH/custom) | | | N/A — Python only | — | sink-python | — |
| ForEachBatch Sink | | — | N/A — Python only | — | foreach-batch-sink-python | — |
CDC APIs
CDC API
| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|---|
| Auto CDC (streaming source) | | | | | auto-cdc-python | auto-cdc-sql |
| Auto CDC (periodic snapshot) | | | N/A — Python only | — | auto-cdc-python | — |
| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|---|
| Auto CDC (streaming source) | | | | | auto-cdc-python | auto-cdc-sql |
| Auto CDC (periodic snapshot) | | | N/A — Python only | — | auto-cdc-python | — |
Data Quality APIs
数据质量API
| Feature | Python (current) | Python (deprecated) | SQL (current) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|
| Expect (warn) | | | | expectations-python | expectations-sql |
| Expect or drop | | | | expectations-python | expectations-sql |
| Expect or fail | | | | expectations-python | expectations-sql |
| Expect all (warn) | | | Multiple | expectations-python | expectations-sql |
| Expect all or drop | | | Multiple constraints with | expectations-python | expectations-sql |
| Expect all or fail | | | Multiple constraints with | expectations-python | expectations-sql |
| Feature | Python (current) | Python (deprecated) | SQL (current) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|
| Expect (warn) | | | | expectations-python | expectations-sql |
| Expect or drop | | | | expectations-python | expectations-sql |
| Expect or fail | | | | expectations-python | expectations-sql |
| Expect all (warn) | | | Multiple | expectations-python | expectations-sql |
| Expect all or drop | | | Multiple constraints with | expectations-python | expectations-sql |
| Expect all or fail | | | Multiple constraints with | expectations-python | expectations-sql |
Reading Data APIs
数据读取API
| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|---|
| Batch read (pipeline dataset) | | | | | — | — |
| Streaming read (pipeline dataset) | | | | | — | — |
| Auto Loader (cloud files) | | — | | — | auto-loader-python | auto-loader-sql |
| Kafka source | | — | | — | — | — |
| Kinesis source | | — | | — | — | — |
| Pub/Sub source | | — | | — | — | — |
| Pulsar source | | — | | — | — | — |
| Event Hubs source | | — | | — | — | — |
| JDBC / Lakehouse Federation | | — | Direct table ref via federation catalog | — | — | — |
| Custom data source | | — | N/A — Python only | — | — | — |
| Static file read (batch) | | — | | — | — | — |
| Skip upstream change commits | | — | | — | streaming-table-python | streaming-table-sql |
| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|---|---|
| Batch read (pipeline dataset) | | | | | — | — |
| Streaming read (pipeline dataset) | | | | | — | — |
| Auto Loader (cloud files) | | — | | — | auto-loader-python | auto-loader-sql |
| Kafka source | | — | | — | — | — |
| Kinesis source | | — | | — | — | — |
| Pub/Sub source | | — | | — | — | — |
| Pulsar source | | — | | — | — | — |
| Event Hubs source | | — | | — | — | — |
| JDBC / Lakehouse Federation | | — | Direct table ref via federation catalog | — | — | — |
| Custom data source | | — | N/A — Python only | — | — | — |
| Static file read (batch) | | — | | — | — | — |
| Skip upstream change commits | | — | | — | streaming-table-python | streaming-table-sql |
Table/Schema Feature APIs
表/ Schema功能API
| Feature | Python (current) | SQL (current) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|
| Liquid clustering | | | materialized-view-python | materialized-view-sql |
| Auto liquid clustering | | | materialized-view-python | materialized-view-sql |
| Partition columns | | | materialized-view-python | materialized-view-sql |
| Table properties | | | materialized-view-python | materialized-view-sql |
| Explicit schema | | | materialized-view-python | materialized-view-sql |
| Generated columns | | | materialized-view-python | materialized-view-sql |
| Row filter (Public Preview) | | | materialized-view-python | materialized-view-sql |
| Column mask (Public Preview) | | | materialized-view-python | materialized-view-sql |
| Private dataset | | | materialized-view-python | materialized-view-sql |
| Feature | Python (current) | SQL (current) | Skill (Py) | Skill (SQL) |
|---|---|---|---|---|
| Liquid clustering | | | materialized-view-python | materialized-view-sql |
| Auto liquid clustering | | | materialized-view-python | materialized-view-sql |
| Partition columns | | | materialized-view-python | materialized-view-sql |
| Table properties | | | materialized-view-python | materialized-view-sql |
| Explicit schema | | | materialized-view-python | materialized-view-sql |
| Generated columns | | | materialized-view-python | materialized-view-sql |
| Row filter (Public Preview) | | | materialized-view-python | materialized-view-sql |
| Column mask (Public Preview) | | | materialized-view-python | materialized-view-sql |
| Private dataset | | | materialized-view-python | materialized-view-sql |
Import / Module APIs
导入/模块API
| Current | Deprecated | Notes |
|---|---|---|
| | Both work. Prefer |
| | Deprecated reads still work. Prefer |
| — | | Fully deprecated. NEVER use. Causes errors in newer pipelines. |
| — | | Fully deprecated. Use |
| Current | Deprecated | Notes |
|---|---|---|
| | Both work. Prefer |
| | Deprecated reads still work. Prefer |
| — | | Fully deprecated. NEVER use. Causes errors in newer pipelines. |
| — | | Fully deprecated. Use |
Language-specific guides
语言特定指南
Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables / DLT) is a framework for building batch and streaming data pipelines.
Lakeflow Spark Declarative Pipelines(前身为Delta Live Tables / DLT)是用于构建批处理和流处理数据管道的框架。
Scaffolding a New Pipeline Project
新管道项目脚手架搭建
Use with a config file to scaffold non-interactively. This creates a project in the directory:
databricks bundle init<project_name>/bash
databricks bundle init lakeflow-pipelines --config-file <(echo '{"project_name": "my_pipeline", "language": "python", "serverless": "yes"}') --profile <PROFILE> < /dev/null- : letters, numbers, underscores only
project_name - :
languageorpython. Ask the user which they prefer:sql- SQL: Recommended for straightforward transformations (filters, joins, aggregations)
- Python: Recommended for complex logic (custom UDFs, ML, advanced processing)
After scaffolding, create and in the project directory. These files are essential to provide agents with guidance on how to work with the project. Use this content:
CLAUDE.mdAGENTS.mdundefined使用搭配配置文件可以非交互式生成项目脚手架,会在目录下创建项目结构:
databricks bundle init<project_name>/bash
databricks bundle init lakeflow-pipelines --config-file <(echo '{"project_name": "my_pipeline", "language": "python", "serverless": "yes"}') --profile <PROFILE> < /dev/null- :仅支持字母、数字、下划线
project_name - :可选
language或python,请询问用户偏好:sql- SQL:推荐用于简单转换场景(过滤、关联、聚合)
- Python:推荐用于复杂逻辑场景(自定义UDF、机器学习、高级处理)
脚手架生成后,请在项目目录下创建和文件,这些文件用于为Agent提供项目开发指导,内容如下:
CLAUDE.mdAGENTS.mdundefinedDeclarative Automation Bundles Project
Declarative Automation Bundles Project
This project uses Declarative Automation Bundles (formerly Databricks Asset Bundles) for deployment.
This project uses Declarative Automation Bundles (formerly Databricks Asset Bundles) for deployment.
Prerequisites
Prerequisites
Install the Databricks CLI (>= v0.288.0) if not already installed:
- macOS:
brew tap databricks/tap && brew install databricks - Linux:
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - Windows:
winget install Databricks.DatabricksCLI
Verify:
databricks -vInstall the Databricks CLI (>= v0.288.0) if not already installed:
- macOS:
brew tap databricks/tap && brew install databricks - Linux:
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - Windows:
winget install Databricks.DatabricksCLI
Verify:
databricks -vFor AI Agents
For AI Agents
Read the skill for CLI basics, authentication, and deployment workflow.
Read the skill for pipeline-specific guidance.
databricks-coredatabricks-pipelinesIf skills are not available, install them:
databricks experimental aitools skills installundefinedRead the skill for CLI basics, authentication, and deployment workflow.
Read the skill for pipeline-specific guidance.
databricks-coredatabricks-pipelinesIf skills are not available, install them:
databricks experimental aitools skills installundefinedPipeline Structure
管道结构
- Follow the medallion architecture pattern (Bronze → Silver → Gold) unless the user specifies otherwise
- Use the convention of 1 dataset per file, named after the dataset
- Place transformation files in a or
src/foldertransformations/
my-pipeline-project/
├── databricks.yml # Bundle configuration
├── resources/
│ ├── my_pipeline.pipeline.yml # Pipeline definition
│ └── my_pipeline_job.job.yml # Scheduling job (optional)
└── src/
├── my_table.py (or .sql) # One dataset per file
├── another_table.py (or .sql)
└── ...- 除非用户另行指定,否则遵循medallion架构模式(青铜→白银→黄金层)
- 遵循一个文件对应一个数据集的约定,文件名与数据集名称一致
- 转换代码文件放置在或
src/目录下transformations/
my-pipeline-project/
├── databricks.yml # Bundle configuration
├── resources/
│ ├── my_pipeline.pipeline.yml # Pipeline definition
│ └── my_pipeline_job.job.yml # Scheduling job (optional)
└── src/
├── my_table.py (or .sql) # One dataset per file
├── another_table.py (or .sql)
└── ...Scheduling Pipelines
管道调度
To schedule a pipeline, add a job that triggers it in :
resources/<name>.job.ymlyaml
resources:
jobs:
my_pipeline_job:
trigger:
periodic:
interval: 1
unit: DAYS
tasks:
- task_key: refresh_pipeline
pipeline_task:
pipeline_id: ${resources.pipelines.my_pipeline.id}如需调度管道,在中添加触发管道的任务:
resources/<name>.job.ymlyaml
resources:
jobs:
my_pipeline_job:
trigger:
periodic:
interval: 1
unit: DAYS
tasks:
- task_key: refresh_pipeline
pipeline_task:
pipeline_id: ${resources.pipelines.my_pipeline.id}Running Pipelines
运行管道
You must deploy before running. In local development, code changes only take effect after . Always deploy before any run, dry run, or selective refresh.
databricks bundle deploy- Selective refresh is preferred when you only need to run one table. For selective refresh it is important that dependencies are already materialized.
- Full refresh is the most expensive and dangerous option, and can lead to data loss, so it should be used only when really necessary. Always suggest this as a follow-up that the user explicitly needs to select.
运行前必须先部署。本地开发时,代码变更仅在执行后生效,任何运行、试运行、选择性刷新操作前都必须先部署。
databricks bundle deploy- 仅需要运行单个表时优先使用选择性刷新,使用选择性刷新的前提是依赖已经完成物化
- 全量刷新是成本最高、风险最大的操作,可能导致数据丢失,仅在确有必要时使用,请仅作为用户主动选择的后续选项提出
Development Workflow
开发工作流
- Validate:
databricks bundle validate --profile <profile> - Deploy:
databricks bundle deploy -t dev --profile <profile> - Run pipeline:
databricks bundle run <pipeline_name> -t dev --profile <profile> - Check status:
databricks pipelines get --pipeline-id <id> --profile <profile>
- 校验:
databricks bundle validate --profile <profile> - 部署:
databricks bundle deploy -t dev --profile <profile> - 运行管道:
databricks bundle run <pipeline_name> -t dev --profile <profile> - 检查状态:
databricks pipelines get --pipeline-id <id> --profile <profile>
Pipeline API Reference
管道API参考
Detailed reference guides for each pipeline API. Read the relevant guide before writing pipeline code.
- Write Spark Declarative Pipelines — Core syntax and rules (Python, SQL)
- Streaming Tables — Continuous data stream processing (Python, SQL)
- Materialized Views — Physically stored query results with incremental refresh (Python, SQL)
- Views — Reusable query logic published to Unity Catalog (SQL)
- Temporary Views — Pipeline-private views (Python, SQL)
- Auto Loader — Incrementally ingest files from cloud storage (Python, SQL)
- Auto CDC — Process Change Data Capture feeds, SCD Type 1 & 2 (Python, SQL)
- Expectations — Define and enforce data quality constraints (Python, SQL)
- Sinks — Write to Kafka, Event Hubs, external Delta tables (Python)
- ForEachBatch Sinks — Custom streaming sink with per-batch Python logic (Python)
各管道API的详细参考指南,编写管道代码前请阅读对应指南。
- Write Spark Declarative Pipelines — Core syntax and rules (Python, SQL)
- Streaming Tables — Continuous data stream processing (Python, SQL)
- Materialized Views — Physically stored query results with incremental refresh (Python, SQL)
- Views — Reusable query logic published to Unity Catalog (SQL)
- Temporary Views — Pipeline-private views (Python, SQL)
- Auto Loader — Incrementally ingest files from cloud storage (Python, SQL)
- Auto CDC — Process Change Data Capture feeds, SCD Type 1 & 2 (Python, SQL)
- Expectations — Define and enforce data quality constraints (Python, SQL)
- Sinks — Write to Kafka, Event Hubs, external Delta tables (Python)
- ForEachBatch Sinks — Custom streaming sink with per-batch Python logic (Python)