databricks-jobs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Databricks Lakeflow Jobs

Databricks Lakeflow Jobs

Overview

概述

Databricks Jobs orchestrate data workflows with multi-task DAGs, flexible triggers, and comprehensive monitoring. Jobs support diverse task types and can be managed via Python SDK, CLI, or Asset Bundles.
Databricks Jobs 借助多任务DAG、灵活的触发器和全面的监控功能来编排数据工作流。作业支持多种任务类型,可通过Python SDK、CLI或Asset Bundles进行管理。

Reference Files

参考文件

Use CaseReference File
Configure task types (notebook, Python, SQL, dbt, etc.)task-types.md
Set up triggers and schedulestriggers-schedules.md
Configure notifications and health monitoringnotifications-monitoring.md
Complete working examplesexamples.md
使用场景参考文件
配置任务类型(notebook、Python、SQL、dbt等)task-types.md
设置触发器和调度triggers-schedules.md
配置通知和健康监控notifications-monitoring.md
完整可用示例examples.md

Quick Start

快速开始

Python SDK

Python SDK

python
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, Source

w = WorkspaceClient()

job = w.jobs.create(
    name="my-etl-job",
    tasks=[
        Task(
            task_key="extract",
            notebook_task=NotebookTask(
                notebook_path="/Workspace/Users/user@example.com/extract",
                source=Source.WORKSPACE
            )
        )
    ]
)
print(f"Created job: {job.job_id}")
python
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, Source

w = WorkspaceClient()

job = w.jobs.create(
    name="my-etl-job",
    tasks=[
        Task(
            task_key="extract",
            notebook_task=NotebookTask(
                notebook_path="/Workspace/Users/user@example.com/extract",
                source=Source.WORKSPACE
            )
        )
    ]
)
print(f"Created job: {job.job_id}")

CLI

CLI

bash
databricks jobs create --json '{
  "name": "my-etl-job",
  "tasks": [{
    "task_key": "extract",
    "notebook_task": {
      "notebook_path": "/Workspace/Users/user@example.com/extract",
      "source": "WORKSPACE"
    }
  }]
}'
bash
databricks jobs create --json '{
  "name": "my-etl-job",
  "tasks": [{
    "task_key": "extract",
    "notebook_task": {
      "notebook_path": "/Workspace/Users/user@example.com/extract",
      "source": "WORKSPACE"
    }
  }]
}'

Asset Bundles (DABs)

Asset Bundles (DABs)

yaml
undefined
yaml
undefined

resources/jobs.yml

resources/jobs.yml

resources: jobs: my_etl_job: name: "[${bundle.target}] My ETL Job" tasks: - task_key: extract notebook_task: notebook_path: ../src/notebooks/extract.py
undefined
resources: jobs: my_etl_job: name: "[${bundle.target}] My ETL Job" tasks: - task_key: extract notebook_task: notebook_path: ../src/notebooks/extract.py
undefined

Core Concepts

核心概念

Multi-Task Workflows

多任务工作流

Jobs support DAG-based task dependencies:
yaml
tasks:
  - task_key: extract
    notebook_task:
      notebook_path: ../src/extract.py

  - task_key: transform
    depends_on:
      - task_key: extract
    notebook_task:
      notebook_path: ../src/transform.py

  - task_key: load
    depends_on:
      - task_key: transform
    run_if: ALL_SUCCESS  # Only run if all dependencies succeed
    notebook_task:
      notebook_path: ../src/load.py
run_if conditions:
  • ALL_SUCCESS
    (default) - Run when all dependencies succeed
  • ALL_DONE
    - Run when all dependencies complete (success or failure)
  • AT_LEAST_ONE_SUCCESS
    - Run when at least one dependency succeeds
  • NONE_FAILED
    - Run when no dependencies failed
  • ALL_FAILED
    - Run when all dependencies failed
  • AT_LEAST_ONE_FAILED
    - Run when at least one dependency failed
Jobs支持基于DAG的任务依赖:
yaml
tasks:
  - task_key: extract
    notebook_task:
      notebook_path: ../src/extract.py

  - task_key: transform
    depends_on:
      - task_key: extract
    notebook_task:
      notebook_path: ../src/transform.py

  - task_key: load
    depends_on:
      - task_key: transform
    run_if: ALL_SUCCESS  # Only run if all dependencies succeed
    notebook_task:
      notebook_path: ../src/load.py
run_if 条件:
  • ALL_SUCCESS
    (默认)- 所有依赖任务成功时运行
  • ALL_DONE
    - 所有依赖任务完成时运行(无论成功或失败)
  • AT_LEAST_ONE_SUCCESS
    - 至少一个依赖任务成功时运行
  • NONE_FAILED
    - 没有依赖任务失败时运行
  • ALL_FAILED
    - 所有依赖任务失败时运行
  • AT_LEAST_ONE_FAILED
    - 至少一个依赖任务失败时运行

Task Types Summary

任务类型汇总

Task TypeUse CaseReference
notebook_task
Run notebookstask-types.md#notebook-task
spark_python_task
Run Python scriptstask-types.md#spark-python-task
python_wheel_task
Run Python wheelstask-types.md#python-wheel-task
sql_task
Run SQL queries/filestask-types.md#sql-task
dbt_task
Run dbt projectstask-types.md#dbt-task
pipeline_task
Trigger DLT/SDP pipelinestask-types.md#pipeline-task
spark_jar_task
Run Spark JARstask-types.md#spark-jar-task
run_job_task
Trigger other jobstask-types.md#run-job-task
for_each_task
Loop over inputstask-types.md#for-each-task
任务类型使用场景参考链接
notebook_task
运行notebooktask-types.md#notebook-task
spark_python_task
运行Python脚本task-types.md#spark-python-task
python_wheel_task
运行Python wheelstask-types.md#python-wheel-task
sql_task
运行SQL查询/文件task-types.md#sql-task
dbt_task
运行dbt项目task-types.md#dbt-task
pipeline_task
触发DLT/SDP流水线task-types.md#pipeline-task
spark_jar_task
运行Spark JAR包task-types.md#spark-jar-task
run_job_task
触发其他作业task-types.md#run-job-task
for_each_task
循环处理输入task-types.md#for-each-task

Trigger Types Summary

触发器类型汇总

Trigger TypeUse CaseReference
schedule
Cron-based schedulingtriggers-schedules.md#cron-schedule
trigger.periodic
Interval-basedtriggers-schedules.md#periodic-trigger
trigger.file_arrival
File arrival eventstriggers-schedules.md#file-arrival-trigger
trigger.table_update
Table change eventstriggers-schedules.md#table-update-trigger
continuous
Always-running jobstriggers-schedules.md#continuous-jobs
触发器类型使用场景参考链接
schedule
基于Cron的调度triggers-schedules.md#cron-schedule
trigger.periodic
基于时间间隔的触发triggers-schedules.md#periodic-trigger
trigger.file_arrival
文件到达事件触发triggers-schedules.md#file-arrival-trigger
trigger.table_update
表变更事件触发triggers-schedules.md#table-update-trigger
continuous
持续运行的作业triggers-schedules.md#continuous-jobs

Compute Configuration

计算资源配置

Job Clusters (Recommended)

作业集群(推荐)

Define reusable cluster configurations:
yaml
job_clusters:
  - job_cluster_key: shared_cluster
    new_cluster:
      spark_version: "15.4.x-scala2.12"
      node_type_id: "i3.xlarge"
      num_workers: 2
      spark_conf:
        spark.speculation: "true"

tasks:
  - task_key: my_task
    job_cluster_key: shared_cluster
    notebook_task:
      notebook_path: ../src/notebook.py
定义可复用的集群配置:
yaml
job_clusters:
  - job_cluster_key: shared_cluster
    new_cluster:
      spark_version: "15.4.x-scala2.12"
      node_type_id: "i3.xlarge"
      num_workers: 2
      spark_conf:
        spark.speculation: "true"

tasks:
  - task_key: my_task
    job_cluster_key: shared_cluster
    notebook_task:
      notebook_path: ../src/notebook.py

Autoscaling Clusters

自动扩缩容集群

yaml
new_cluster:
  spark_version: "15.4.x-scala2.12"
  node_type_id: "i3.xlarge"
  autoscale:
    min_workers: 2
    max_workers: 8
yaml
new_cluster:
  spark_version: "15.4.x-scala2.12"
  node_type_id: "i3.xlarge"
  autoscale:
    min_workers: 2
    max_workers: 8

Existing Cluster

现有集群

yaml
tasks:
  - task_key: my_task
    existing_cluster_id: "0123-456789-abcdef12"
    notebook_task:
      notebook_path: ../src/notebook.py
yaml
tasks:
  - task_key: my_task
    existing_cluster_id: "0123-456789-abcdef12"
    notebook_task:
      notebook_path: ../src/notebook.py

Serverless Compute

无服务器计算

For notebook and Python tasks, omit cluster configuration to use serverless:
yaml
tasks:
  - task_key: serverless_task
    notebook_task:
      notebook_path: ../src/notebook.py
    # No cluster config = serverless
对于notebook和Python任务,省略集群配置即可使用无服务器计算:
yaml
tasks:
  - task_key: serverless_task
    notebook_task:
      notebook_path: ../src/notebook.py
    # 无集群配置 = 使用无服务器计算

Job Parameters

作业参数

Define Parameters

定义参数

yaml
parameters:
  - name: env
    default: "dev"
  - name: date
    default: "{{start_date}}"  # Dynamic value reference
yaml
parameters:
  - name: env
    default: "dev"
  - name: date
    default: "{{start_date}}"  # 动态值引用

Access in Notebook

在Notebook中访问参数

python
undefined
python
undefined

In notebook

在Notebook中

dbutils.widgets.get("env") dbutils.widgets.get("date")
undefined
dbutils.widgets.get("env") dbutils.widgets.get("date")
undefined

Pass to Tasks

传递参数给任务

yaml
tasks:
  - task_key: my_task
    notebook_task:
      notebook_path: ../src/notebook.py
      base_parameters:
        env: "{{job.parameters.env}}"
        custom_param: "value"
yaml
tasks:
  - task_key: my_task
    notebook_task:
      notebook_path: ../src/notebook.py
      base_parameters:
        env: "{{job.parameters.env}}"
        custom_param: "value"

Common Operations

常见操作

Python SDK Operations

Python SDK 操作

python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

List jobs

列出作业

jobs = w.jobs.list()
jobs = w.jobs.list()

Get job details

获取作业详情

job = w.jobs.get(job_id=12345)
job = w.jobs.get(job_id=12345)

Run job now

立即运行作业

run = w.jobs.run_now(job_id=12345)
run = w.jobs.run_now(job_id=12345)

Run with parameters

带参数运行作业

run = w.jobs.run_now( job_id=12345, job_parameters={"env": "prod", "date": "2024-01-15"} )
run = w.jobs.run_now( job_id=12345, job_parameters={"env": "prod", "date": "2024-01-15"} )

Cancel run

取消作业运行

w.jobs.cancel_run(run_id=run.run_id)
w.jobs.cancel_run(run_id=run.run_id)

Delete job

删除作业

w.jobs.delete(job_id=12345)
undefined
w.jobs.delete(job_id=12345)
undefined

CLI Operations

CLI 操作

bash
undefined
bash
undefined

List jobs

列出作业

databricks jobs list
databricks jobs list

Get job details

获取作业详情

databricks jobs get 12345
databricks jobs get 12345

Run job

运行作业

databricks jobs run-now 12345
databricks jobs run-now 12345

Run with parameters

带参数运行作业

databricks jobs run-now 12345 --job-params '{"env": "prod"}'
databricks jobs run-now 12345 --job-params '{"env": "prod"}'

Cancel run

取消作业运行

databricks jobs cancel-run 67890
databricks jobs cancel-run 67890

Delete job

删除作业

databricks jobs delete 12345
undefined
databricks jobs delete 12345
undefined

Asset Bundle Operations

Asset Bundle 操作

bash
undefined
bash
undefined

Validate configuration

验证配置

databricks bundle validate
databricks bundle validate

Deploy job

部署作业

databricks bundle deploy
databricks bundle deploy

Run job

运行作业

databricks bundle run my_job_resource_key
databricks bundle run my_job_resource_key

Deploy to specific target

部署到指定环境

databricks bundle deploy -t prod
databricks bundle deploy -t prod

Destroy resources

销毁资源

databricks bundle destroy
undefined
databricks bundle destroy
undefined

Permissions (DABs)

权限配置(DABs)

yaml
resources:
  jobs:
    my_job:
      name: "My Job"
      permissions:
        - level: CAN_VIEW
          group_name: "data-analysts"
        - level: CAN_MANAGE_RUN
          group_name: "data-engineers"
        - level: CAN_MANAGE
          user_name: "admin@example.com"
Permission levels:
  • CAN_VIEW
    - View job and run history
  • CAN_MANAGE_RUN
    - View, trigger, and cancel runs
  • CAN_MANAGE
    - Full control including edit and delete
yaml
resources:
  jobs:
    my_job:
      name: "My Job"
      permissions:
        - level: CAN_VIEW
          group_name: "data-analysts"
        - level: CAN_MANAGE_RUN
          group_name: "data-engineers"
        - level: CAN_MANAGE
          user_name: "admin@example.com"
权限级别:
  • CAN_VIEW
    - 查看作业和运行历史
  • CAN_MANAGE_RUN
    - 查看、触发和取消作业运行
  • CAN_MANAGE
    - 完全控制权限,包括编辑和删除作业

Common Issues

常见问题

IssueSolution
Job cluster startup slowUse job clusters with
job_cluster_key
for reuse across tasks
Task dependencies not workingVerify
task_key
references match exactly in
depends_on
Schedule not triggeringCheck
pause_status: UNPAUSED
and valid timezone
File arrival not detectingEnsure path has proper permissions and uses cloud storage URL
Table update trigger missing eventsVerify Unity Catalog table and proper grants
Parameter not accessibleUse
dbutils.widgets.get()
in notebooks
"admins" group errorCannot modify admins permissions on jobs
Serverless task failsEnsure task type supports serverless (notebook, Python)
问题解决方案
作业集群启动缓慢使用带有
job_cluster_key
的作业集群,实现跨任务复用
任务依赖不生效验证
depends_on
中的
task_key
引用与定义完全一致
调度未触发检查
pause_status: UNPAUSED
(未暂停)和时区是否有效
文件到达事件未被检测到确保路径有正确的权限,且使用云存储URL
表更新触发器未捕获事件验证Unity Catalog表及权限配置正确
参数无法访问在Notebook中使用
dbutils.widgets.get()
获取参数
“admins”组权限错误无法修改作业上的admins组权限
无服务器任务执行失败确保任务类型支持无服务器计算(notebook、Python)

Related Skills

相关技能

  • asset-bundles - Deploy jobs via Databricks Asset Bundles
  • spark-declarative-pipelines - Configure pipelines triggered by jobs
  • asset-bundles - 通过Databricks Asset Bundles部署作业
  • spark-declarative-pipelines - 配置由作业触发的流水线

Resources

相关资源