databricks-deploy-integration
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDatabricks Deploy Integration
Databricks 部署集成
Overview
概述
Deploy Databricks workloads using Asset Bundles for environment management.
使用Asset Bundles部署Databricks工作负载,实现环境管理。
Prerequisites
前提条件
- Databricks CLI v0.200+
- Asset Bundle project structure
- Workspace access for target environments
- Databricks CLI v0.200+
- Asset Bundle项目结构
- 目标环境的工作区访问权限
Instructions
操作步骤
Step 1: Project Structure
步骤1:项目结构
my-databricks-project/
├── databricks.yml # Main bundle configuration
├── resources/
│ ├── jobs.yml # Job definitions
│ ├── pipelines.yml # DLT pipeline definitions
│ └── clusters.yml # Cluster policies
├── src/
│ ├── notebooks/ # Databricks notebooks
│ │ ├── bronze/
│ │ ├── silver/
│ │ └── gold/
│ └── python/ # Python modules
│ └── etl/
├── tests/
│ ├── unit/
│ └── integration/
├── fixtures/ # Test data
└── conf/
├── dev.yml # Dev overrides
├── staging.yml # Staging overrides
└── prod.yml # Production overridesmy-databricks-project/
├── databricks.yml # 主Bundle配置文件
├── resources/
│ ├── jobs.yml # 作业定义
│ ├── pipelines.yml # DLT管道定义
│ └── clusters.yml # 集群策略
├── src/
│ ├── notebooks/ # Databricks笔记本
│ │ ├── bronze/
│ │ ├── silver/
│ │ └── gold/
│ └── python/ # Python模块
│ └── etl/
├── tests/
│ ├── unit/
│ └── integration/
├── fixtures/ # 测试数据
└── conf/
├── dev.yml # 开发环境覆盖配置
├── staging.yml # 预发布环境覆盖配置
└── prod.yml # 生产环境覆盖配置Step 2: Main Bundle Configuration
步骤2:主Bundle配置
yaml
undefinedyaml
undefineddatabricks.yml
databricks.yml
bundle:
name: data-platform
variables:
catalog:
description: Unity Catalog name
default: dev_catalog
warehouse_id:
description: SQL Warehouse ID
default: ""
include:
- resources/*.yml
workspace:
host: ${DATABRICKS_HOST}
artifacts:
etl_wheel:
type: whl
path: ./src/python
build: poetry build
targets:
dev:
default: true
mode: development
variables:
catalog: dev_catalog
workspace:
root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev
staging:
mode: development
variables:
catalog: staging_catalog
workspace:
root_path: /Shared/.bundle/${bundle.name}/staging
run_as:
service_principal_name: staging-sp
prod:
mode: production
variables:
catalog: prod_catalog
warehouse_id: "abc123def456"
workspace:
root_path: /Shared/.bundle/${bundle.name}/prod
run_as:
service_principal_name: prod-sp
permissions:
- level: CAN_VIEW
group_name: data-consumers
- level: CAN_MANAGE_RUN
group_name: data-engineers
- level: CAN_MANAGE
service_principal_name: prod-sp
undefinedbundle:
name: data-platform
variables:
catalog:
description: Unity Catalog名称
default: dev_catalog
warehouse_id:
description: SQL Warehouse ID
default: ""
include:
- resources/*.yml
workspace:
host: ${DATABRICKS_HOST}
artifacts:
etl_wheel:
type: whl
path: ./src/python
build: poetry build
targets:
dev:
default: true
mode: development
variables:
catalog: dev_catalog
workspace:
root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev
staging:
mode: development
variables:
catalog: staging_catalog
workspace:
root_path: /Shared/.bundle/${bundle.name}/staging
run_as:
service_principal_name: staging-sp
prod:
mode: production
variables:
catalog: prod_catalog
warehouse_id: "abc123def456"
workspace:
root_path: /Shared/.bundle/${bundle.name}/prod
run_as:
service_principal_name: prod-sp
permissions:
- level: CAN_VIEW
group_name: data-consumers
- level: CAN_MANAGE_RUN
group_name: data-engineers
- level: CAN_MANAGE
service_principal_name: prod-sp
undefinedStep 3: Job Definitions
步骤3:作业定义
yaml
undefinedyaml
undefinedresources/jobs.yml
resources/jobs.yml
resources:
jobs:
etl_pipeline:
name: "${bundle.name}-etl-${bundle.target}"
description: "Main ETL pipeline for ${var.catalog}"
tags:
environment: ${bundle.target}
team: data-engineering
managed_by: asset_bundles
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "America/New_York"
pause_status: ${bundle.target == "dev" ? "PAUSED" : "UNPAUSED"}
email_notifications:
on_failure:
- oncall@company.com
no_alert_for_skipped_runs: true
parameters:
- name: catalog
default: ${var.catalog}
- name: run_date
default: ""
tasks:
- task_key: bronze_ingest
job_cluster_key: etl_cluster
notebook_task:
notebook_path: ../src/notebooks/bronze/ingest.py
base_parameters:
catalog: "{{job.parameters.catalog}}"
run_date: "{{job.parameters.run_date}}"
- task_key: silver_transform
depends_on:
- task_key: bronze_ingest
job_cluster_key: etl_cluster
notebook_task:
notebook_path: ../src/notebooks/silver/transform.py
- task_key: gold_aggregate
depends_on:
- task_key: silver_transform
job_cluster_key: etl_cluster
python_wheel_task:
package_name: etl
entry_point: gold_aggregate
libraries:
- whl: ../artifacts/etl_wheel/*.whl
- task_key: data_quality
depends_on:
- task_key: gold_aggregate
job_cluster_key: etl_cluster
notebook_task:
notebook_path: ../src/notebooks/quality/validate.py
job_clusters:
- job_cluster_key: etl_cluster
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: ${bundle.target == "prod" ? "Standard_DS4_v2" : "Standard_DS3_v2"}
num_workers: ${bundle.target == "prod" ? 4 : 1}
autoscale:
min_workers: ${bundle.target == "prod" ? 2 : 1}
max_workers: ${bundle.target == "prod" ? 10 : 2}
spark_conf:
spark.databricks.delta.optimizeWrite.enabled: "true"
spark.databricks.delta.autoCompact.enabled: "true"
custom_tags:
ResourceClass: ${bundle.target == "prod" ? "production" : "development"}undefinedresources:
jobs:
etl_pipeline:
name: "${bundle.name}-etl-${bundle.target}"
description: "${var.catalog}的主ETL管道"
tags:
environment: ${bundle.target}
team: data-engineering
managed_by: asset_bundles
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "America/New_York"
pause_status: ${bundle.target == "dev" ? "PAUSED" : "UNPAUSED"}
email_notifications:
on_failure:
- oncall@company.com
no_alert_for_skipped_runs: true
parameters:
- name: catalog
default: ${var.catalog}
- name: run_date
default: ""
tasks:
- task_key: bronze_ingest
job_cluster_key: etl_cluster
notebook_task:
notebook_path: ../src/notebooks/bronze/ingest.py
base_parameters:
catalog: "{{job.parameters.catalog}}"
run_date: "{{job.parameters.run_date}}"
- task_key: silver_transform
depends_on:
- task_key: bronze_ingest
job_cluster_key: etl_cluster
notebook_task:
notebook_path: ../src/notebooks/silver/transform.py
- task_key: gold_aggregate
depends_on:
- task_key: silver_transform
job_cluster_key: etl_cluster
python_wheel_task:
package_name: etl
entry_point: gold_aggregate
libraries:
- whl: ../artifacts/etl_wheel/*.whl
- task_key: data_quality
depends_on:
- task_key: gold_aggregate
job_cluster_key: etl_cluster
notebook_task:
notebook_path: ../src/notebooks/quality/validate.py
job_clusters:
- job_cluster_key: etl_cluster
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: ${bundle.target == "prod" ? "Standard_DS4_v2" : "Standard_DS3_v2"}
num_workers: ${bundle.target == "prod" ? 4 : 1}
autoscale:
min_workers: ${bundle.target == "prod" ? 2 : 1}
max_workers: ${bundle.target == "prod" ? 10 : 2}
spark_conf:
spark.databricks.delta.optimizeWrite.enabled: "true"
spark.databricks.delta.autoCompact.enabled: "true"
custom_tags:
ResourceClass: ${bundle.target == "prod" ? "production" : "development"}undefinedStep 4: Deployment Commands
步骤4:部署命令
bash
undefinedbash
undefinedValidate bundle
验证Bundle
databricks bundle validate
databricks bundle validate -t staging
databricks bundle validate -t prod
databricks bundle validate
databricks bundle validate -t staging
databricks bundle validate -t prod
Deploy to development
部署到开发环境
databricks bundle deploy -t dev
databricks bundle deploy -t dev
Deploy to staging
部署到预发布环境
databricks bundle deploy -t staging
databricks bundle deploy -t staging
Deploy to production (with confirmation)
部署到生产环境(需确认)
databricks bundle deploy -t prod
databricks bundle deploy -t prod
Deploy specific resources
部署特定资源
databricks bundle deploy -t staging --resource etl_pipeline
databricks bundle deploy -t staging --resource etl_pipeline
Destroy resources (cleanup)
删除资源(清理)
databricks bundle destroy -t dev --auto-approve
undefineddatabricks bundle destroy -t dev --auto-approve
undefinedStep 5: Run Management
步骤5:运行管理
bash
undefinedbash
undefinedRun job manually
手动运行作业
databricks bundle run -t staging etl_pipeline
databricks bundle run -t staging etl_pipeline
Run with parameters
带参数运行作业
databricks bundle run -t staging etl_pipeline
--params '{"catalog": "test_catalog", "run_date": "2024-01-15"}'
--params '{"catalog": "test_catalog", "run_date": "2024-01-15"}'
databricks bundle run -t staging etl_pipeline
--params '{"catalog": "test_catalog", "run_date": "2024-01-15"}'
--params '{"catalog": "test_catalog", "run_date": "2024-01-15"}'
Check deployment status
检查部署状态
databricks bundle summary -t prod
databricks bundle summary -t prod
View deployed resources
查看已部署资源
databricks bundle summary -t prod --output json | jq '.resources.jobs'
undefineddatabricks bundle summary -t prod --output json | jq '.resources.jobs'
undefinedStep 6: Blue-Green Deployment
步骤6:蓝绿部署
python
undefinedpython
undefinedscripts/blue_green_deploy.py
scripts/blue_green_deploy.py
from databricks.sdk import WorkspaceClient
import time
def blue_green_deploy(
w: WorkspaceClient,
job_name: str,
new_config: dict,
rollback_on_failure: bool = True,
) -> dict:
"""
Deploy job using blue-green strategy.
1. Create new job version
2. Run validation
3. Switch traffic
4. Remove old version (or rollback)
"""
# Find existing job
jobs = [j for j in w.jobs.list() if j.settings.name == job_name]
old_job = jobs[0] if jobs else None
# Create new job with suffix
new_name = f"{job_name}-new"
new_config["name"] = new_name
new_job = w.jobs.create(**new_config)
try:
# Run validation job
run = w.jobs.run_now(job_id=new_job.job_id)
result = w.jobs.get_run(run.run_id).wait()
if result.state.result_state != "SUCCESS":
raise Exception(f"Validation failed: {result.state.state_message}")
# Success - rename jobs
if old_job:
w.jobs.update(
job_id=old_job.job_id,
new_settings={"name": f"{job_name}-old"}
)
w.jobs.update(
job_id=new_job.job_id,
new_settings={"name": job_name}
)
# Cleanup old job
if old_job:
w.jobs.delete(job_id=old_job.job_id)
return {"status": "SUCCESS", "job_id": new_job.job_id}
except Exception as e:
if rollback_on_failure:
# Cleanup new job
w.jobs.delete(job_id=new_job.job_id)
raiseundefinedfrom databricks.sdk import WorkspaceClient
import time
def blue_green_deploy(
w: WorkspaceClient,
job_name: str,
new_config: dict,
rollback_on_failure: bool = True,
) -> dict:
"""
使用蓝绿策略部署作业。
1. 创建新作业版本
2. 运行验证
3. 切换流量
4. 删除旧版本(或回滚)
"""
# 查找现有作业
jobs = [j for j in w.jobs.list() if j.settings.name == job_name]
old_job = jobs[0] if jobs else None
# 创建带后缀的新作业
new_name = f"{job_name}-new"
new_config["name"] = new_name
new_job = w.jobs.create(**new_config)
try:
# 运行验证作业
run = w.jobs.run_now(job_id=new_job.job_id)
result = w.jobs.get_run(run.run_id).wait()
if result.state.result_state != "SUCCESS":
raise Exception(f"验证失败: {result.state.state_message}")
# 验证成功 - 重命名作业
if old_job:
w.jobs.update(
job_id=old_job.job_id,
new_settings={"name": f"{job_name}-old"}
)
w.jobs.update(
job_id=new_job.job_id,
new_settings={"name": job_name}
)
# 清理旧作业
if old_job:
w.jobs.delete(job_id=old_job.job_id)
return {"status": "SUCCESS", "job_id": new_job.job_id}
except Exception as e:
if rollback_on_failure:
# 清理新作业
w.jobs.delete(job_id=new_job.job_id)
raiseundefinedOutput
输出
- Deployed Asset Bundle
- Jobs created in target workspace
- Environment-specific configurations applied
- 已部署的Asset Bundle
- 在目标工作区中创建的作业
- 已应用的环境专属配置
Error Handling
错误处理
| Issue | Cause | Solution |
|---|---|---|
| Permission denied | Missing run_as permissions | Configure service principal |
| Resource conflict | Name collision | Use unique names with target suffix |
| Artifact not found | Build failed | Run |
| Validation error | Invalid YAML | Check bundle syntax |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 权限拒绝 | 缺少run_as权限 | 配置服务主体 |
| 资源冲突 | 名称重复 | 使用带目标环境后缀的唯一名称 |
| 找不到制品 | 构建失败 | 先运行 |
| 验证错误 | YAML无效 | 检查Bundle语法 |
Examples
示例
Environment Comparison
环境对比
bash
undefinedbash
undefinedCompare configurations across environments
跨环境对比配置
databricks bundle summary -t dev --output json > dev.json
databricks bundle summary -t prod --output json > prod.json
diff <(jq -S . dev.json) <(jq -S . prod.json)
undefineddatabricks bundle summary -t dev --output json > dev.json
databricks bundle summary -t prod --output json > prod.json
diff <(jq -S . dev.json) <(jq -S . prod.json)
undefinedRollback Procedure
回滚流程
bash
undefinedbash
undefinedQuick rollback using git
使用Git快速回滚
git checkout HEAD~1 -- databricks.yml resources/
databricks bundle deploy -t prod --force
undefinedgit checkout HEAD~1 -- databricks.yml resources/
databricks bundle deploy -t prod --force
undefinedResources
参考资源
Next Steps
后续步骤
For webhooks and events, see .
databricks-webhooks-events如需了解Webhook与事件,请查看。
databricks-webhooks-events