databricks-jobs

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Databricks Lakeflow Jobs

Overview

概述

Databricks Jobs orchestrate data workflows with multi-task DAGs, flexible triggers, and comprehensive monitoring. Jobs support diverse task types and can be managed via Python SDK, CLI, or Asset Bundles.

Databricks Jobs 借助多任务DAG、灵活的触发器和全面的监控功能来编排数据工作流。作业支持多种任务类型，可通过Python SDK、CLI或Asset Bundles进行管理。

Reference Files

参考文件

Use Case	Reference File
Configure task types (notebook, Python, SQL, dbt, etc.)	task-types.md
Set up triggers and schedules	triggers-schedules.md
Configure notifications and health monitoring	notifications-monitoring.md
Complete working examples	examples.md

使用场景	参考文件
配置任务类型（notebook、Python、SQL、dbt等）	task-types.md
设置触发器和调度	triggers-schedules.md
配置通知和健康监控	notifications-monitoring.md
完整可用示例	examples.md

Quick Start

快速开始

Python SDK

python

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, Source

w = WorkspaceClient()

job = w.jobs.create(
    name="my-etl-job",
    tasks=[
        Task(
            task_key="extract",
            notebook_task=NotebookTask(
                notebook_path="/Workspace/Users/user@example.com/extract",
                source=Source.WORKSPACE
            )
        )
    ]
)
print(f"Created job: {job.job_id}")

python

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, Source

w = WorkspaceClient()

job = w.jobs.create(
    name="my-etl-job",
    tasks=[
        Task(
            task_key="extract",
            notebook_task=NotebookTask(
                notebook_path="/Workspace/Users/user@example.com/extract",
                source=Source.WORKSPACE
            )
        )
    ]
)
print(f"Created job: {job.job_id}")

CLI

bash

databricks jobs create --json '{
  "name": "my-etl-job",
  "tasks": [{
    "task_key": "extract",
    "notebook_task": {
      "notebook_path": "/Workspace/Users/user@example.com/extract",
      "source": "WORKSPACE"
    }
  }]
}'

bash

databricks jobs create --json '{
  "name": "my-etl-job",
  "tasks": [{
    "task_key": "extract",
    "notebook_task": {
      "notebook_path": "/Workspace/Users/user@example.com/extract",
      "source": "WORKSPACE"
    }
  }]
}'

Asset Bundles (DABs)

yaml

undefined

yaml

undefined

resources/jobs.yml

resources: jobs: my_etl_job: name: "[${bundle.target}] My ETL Job" tasks: - task_key: extract notebook_task: notebook_path: ../src/notebooks/extract.py

undefined

resources: jobs: my_etl_job: name: "[${bundle.target}] My ETL Job" tasks: - task_key: extract notebook_task: notebook_path: ../src/notebooks/extract.py

undefined

Core Concepts

核心概念

Multi-Task Workflows

多任务工作流

Jobs support DAG-based task dependencies:

yaml

tasks:
  - task_key: extract
    notebook_task:
      notebook_path: ../src/extract.py

  - task_key: transform
    depends_on:
      - task_key: extract
    notebook_task:
      notebook_path: ../src/transform.py

  - task_key: load
    depends_on:
      - task_key: transform
    run_if: ALL_SUCCESS  # Only run if all dependencies succeed
    notebook_task:
      notebook_path: ../src/load.py

run_if conditions:

```
ALL_SUCCESS
```
(default) - Run when all dependencies succeed
```
ALL_DONE
```
- Run when all dependencies complete (success or failure)
```
AT_LEAST_ONE_SUCCESS
```
- Run when at least one dependency succeeds
```
NONE_FAILED
```
- Run when no dependencies failed
```
ALL_FAILED
```
- Run when all dependencies failed
```
AT_LEAST_ONE_FAILED
```
- Run when at least one dependency failed

Jobs支持基于DAG的任务依赖：

yaml

tasks:
  - task_key: extract
    notebook_task:
      notebook_path: ../src/extract.py

  - task_key: transform
    depends_on:
      - task_key: extract
    notebook_task:
      notebook_path: ../src/transform.py

  - task_key: load
    depends_on:
      - task_key: transform
    run_if: ALL_SUCCESS  # Only run if all dependencies succeed
    notebook_task:
      notebook_path: ../src/load.py

run_if 条件：

```
ALL_SUCCESS
```
（默认）- 所有依赖任务成功时运行
```
ALL_DONE
```
- 所有依赖任务完成时运行（无论成功或失败）
```
AT_LEAST_ONE_SUCCESS
```
- 至少一个依赖任务成功时运行
```
NONE_FAILED
```
- 没有依赖任务失败时运行
```
ALL_FAILED
```
- 所有依赖任务失败时运行
```
AT_LEAST_ONE_FAILED
```
- 至少一个依赖任务失败时运行

Task Types Summary

任务类型汇总

Task Type	Use Case	Reference
`notebook_task`	Run notebooks	task-types.md#notebook-task
`spark_python_task`	Run Python scripts	task-types.md#spark-python-task
`python_wheel_task`	Run Python wheels	task-types.md#python-wheel-task
`sql_task`	Run SQL queries/files	task-types.md#sql-task
`dbt_task`	Run dbt projects	task-types.md#dbt-task
`pipeline_task`	Trigger DLT/SDP pipelines	task-types.md#pipeline-task
`spark_jar_task`	Run Spark JARs	task-types.md#spark-jar-task
`run_job_task`	Trigger other jobs	task-types.md#run-job-task
`for_each_task`	Loop over inputs	task-types.md#for-each-task

任务类型	使用场景	参考链接
`notebook_task`	运行notebook	task-types.md#notebook-task
`spark_python_task`	运行Python脚本	task-types.md#spark-python-task
`python_wheel_task`	运行Python wheels	task-types.md#python-wheel-task
`sql_task`	运行SQL查询/文件	task-types.md#sql-task
`dbt_task`	运行dbt项目	task-types.md#dbt-task
`pipeline_task`	触发DLT/SDP流水线	task-types.md#pipeline-task
`spark_jar_task`	运行Spark JAR包	task-types.md#spark-jar-task
`run_job_task`	触发其他作业	task-types.md#run-job-task
`for_each_task`	循环处理输入	task-types.md#for-each-task

Trigger Types Summary

触发器类型汇总

Trigger Type	Use Case	Reference
`schedule`	Cron-based scheduling	triggers-schedules.md#cron-schedule
`trigger.periodic`	Interval-based	triggers-schedules.md#periodic-trigger
`trigger.file_arrival`	File arrival events	triggers-schedules.md#file-arrival-trigger
`trigger.table_update`	Table change events	triggers-schedules.md#table-update-trigger
`continuous`	Always-running jobs	triggers-schedules.md#continuous-jobs

触发器类型	使用场景	参考链接
`schedule`	基于Cron的调度	triggers-schedules.md#cron-schedule
`trigger.periodic`	基于时间间隔的触发	triggers-schedules.md#periodic-trigger
`trigger.file_arrival`	文件到达事件触发	triggers-schedules.md#file-arrival-trigger
`trigger.table_update`	表变更事件触发	triggers-schedules.md#table-update-trigger
`continuous`	持续运行的作业	triggers-schedules.md#continuous-jobs

Compute Configuration

计算资源配置

Job Clusters (Recommended)

作业集群（推荐）

Define reusable cluster configurations:

yaml

job_clusters:
  - job_cluster_key: shared_cluster
    new_cluster:
      spark_version: "15.4.x-scala2.12"
      node_type_id: "i3.xlarge"
      num_workers: 2
      spark_conf:
        spark.speculation: "true"

tasks:
  - task_key: my_task
    job_cluster_key: shared_cluster
    notebook_task:
      notebook_path: ../src/notebook.py

定义可复用的集群配置：

yaml

job_clusters:
  - job_cluster_key: shared_cluster
    new_cluster:
      spark_version: "15.4.x-scala2.12"
      node_type_id: "i3.xlarge"
      num_workers: 2
      spark_conf:
        spark.speculation: "true"

tasks:
  - task_key: my_task
    job_cluster_key: shared_cluster
    notebook_task:
      notebook_path: ../src/notebook.py

Autoscaling Clusters

自动扩缩容集群

yaml

new_cluster:
  spark_version: "15.4.x-scala2.12"
  node_type_id: "i3.xlarge"
  autoscale:
    min_workers: 2
    max_workers: 8

yaml

new_cluster:
  spark_version: "15.4.x-scala2.12"
  node_type_id: "i3.xlarge"
  autoscale:
    min_workers: 2
    max_workers: 8

Existing Cluster

现有集群

yaml

tasks:
  - task_key: my_task
    existing_cluster_id: "0123-456789-abcdef12"
    notebook_task:
      notebook_path: ../src/notebook.py

yaml

tasks:
  - task_key: my_task
    existing_cluster_id: "0123-456789-abcdef12"
    notebook_task:
      notebook_path: ../src/notebook.py

Serverless Compute

无服务器计算

For notebook and Python tasks, omit cluster configuration to use serverless:

yaml

tasks:
  - task_key: serverless_task
    notebook_task:
      notebook_path: ../src/notebook.py
    # No cluster config = serverless

对于notebook和Python任务，省略集群配置即可使用无服务器计算：

yaml

tasks:
  - task_key: serverless_task
    notebook_task:
      notebook_path: ../src/notebook.py
    # 无集群配置 = 使用无服务器计算

Job Parameters

作业参数

Define Parameters

定义参数

yaml

parameters:
  - name: env
    default: "dev"
  - name: date
    default: "{{start_date}}"  # Dynamic value reference

yaml

parameters:
  - name: env
    default: "dev"
  - name: date
    default: "{{start_date}}"  # 动态值引用

Access in Notebook

在Notebook中访问参数

python

undefined

python

undefined

In notebook

在Notebook中

dbutils.widgets.get("env") dbutils.widgets.get("date")

undefined

dbutils.widgets.get("env") dbutils.widgets.get("date")

undefined

Pass to Tasks

传递参数给任务

yaml

tasks:
  - task_key: my_task
    notebook_task:
      notebook_path: ../src/notebook.py
      base_parameters:
        env: "{{job.parameters.env}}"
        custom_param: "value"

yaml

tasks:
  - task_key: my_task
    notebook_task:
      notebook_path: ../src/notebook.py
      base_parameters:
        env: "{{job.parameters.env}}"
        custom_param: "value"

Common Operations

常见操作

Python SDK Operations

Python SDK 操作

python

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

python

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

List jobs

列出作业

jobs = w.jobs.list()

Get job details

获取作业详情

job = w.jobs.get(job_id=12345)

Run job now

立即运行作业

run = w.jobs.run_now(job_id=12345)

Run with parameters

带参数运行作业

run = w.jobs.run_now( job_id=12345, job_parameters={"env": "prod", "date": "2024-01-15"} )

Cancel run

取消作业运行

w.jobs.cancel_run(run_id=run.run_id)

Delete job

删除作业

w.jobs.delete(job_id=12345)

undefined

w.jobs.delete(job_id=12345)

undefined

CLI Operations

CLI 操作

bash

undefined

bash

undefined

List jobs

列出作业

databricks jobs list

Get job details

获取作业详情

databricks jobs get 12345

Run job

运行作业

databricks jobs run-now 12345

Run with parameters

带参数运行作业

databricks jobs run-now 12345 --job-params '{"env": "prod"}'

Cancel run

取消作业运行

databricks jobs cancel-run 67890

Delete job

删除作业

databricks jobs delete 12345

undefined

databricks jobs delete 12345

undefined

Asset Bundle Operations

Asset Bundle 操作

bash

undefined

bash

undefined

Validate configuration

验证配置

databricks bundle validate

Deploy job

部署作业

databricks bundle deploy

Run job

运行作业

databricks bundle run my_job_resource_key

Deploy to specific target

部署到指定环境

databricks bundle deploy -t prod

Destroy resources

销毁资源

databricks bundle destroy

undefined

databricks bundle destroy

undefined

Permissions (DABs)

权限配置（DABs）

yaml

resources:
  jobs:
    my_job:
      name: "My Job"
      permissions:
        - level: CAN_VIEW
          group_name: "data-analysts"
        - level: CAN_MANAGE_RUN
          group_name: "data-engineers"
        - level: CAN_MANAGE
          user_name: "admin@example.com"

Permission levels:

```
CAN_VIEW
```
- View job and run history
```
CAN_MANAGE_RUN
```
- View, trigger, and cancel runs
```
CAN_MANAGE
```
- Full control including edit and delete

yaml

resources:
  jobs:
    my_job:
      name: "My Job"
      permissions:
        - level: CAN_VIEW
          group_name: "data-analysts"
        - level: CAN_MANAGE_RUN
          group_name: "data-engineers"
        - level: CAN_MANAGE
          user_name: "admin@example.com"

权限级别：

```
CAN_VIEW
```
- 查看作业和运行历史
```
CAN_MANAGE_RUN
```
- 查看、触发和取消作业运行
```
CAN_MANAGE
```
- 完全控制权限，包括编辑和删除作业

Common Issues

常见问题

Issue	Solution
Job cluster startup slow	Use job clusters with `job_cluster_key` for reuse across tasks
Task dependencies not working	Verify `task_key` references match exactly in `depends_on`
Schedule not triggering	Check `pause_status: UNPAUSED` and valid timezone
File arrival not detecting	Ensure path has proper permissions and uses cloud storage URL
Table update trigger missing events	Verify Unity Catalog table and proper grants
Parameter not accessible	Use `dbutils.widgets.get()` in notebooks
"admins" group error	Cannot modify admins permissions on jobs
Serverless task fails	Ensure task type supports serverless (notebook, Python)

问题	解决方案
作业集群启动缓慢	使用带有 `job_cluster_key` 的作业集群，实现跨任务复用
任务依赖不生效	验证 `depends_on` 中的 `task_key` 引用与定义完全一致
调度未触发	检查 `pause_status: UNPAUSED` （未暂停）和时区是否有效
文件到达事件未被检测到	确保路径有正确的权限，且使用云存储URL
表更新触发器未捕获事件	验证Unity Catalog表及权限配置正确
参数无法访问	在Notebook中使用 `dbutils.widgets.get()` 获取参数
“admins”组权限错误	无法修改作业上的admins组权限
无服务器任务执行失败	确保任务类型支持无服务器计算（notebook、Python）