databricks-serverless-migration

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Serverless Compute Migration

Serverless Compute 迁移

FIRST: Use the parent
databricks-core
skill for CLI basics, authentication, and profile selection.
Analyze existing Databricks code for serverless compute compatibility and guide migration from classic clusters. The skill follows a 4-step migration lifecycle: Ingest the workload → Analyze for compatibility → Test via A/B comparison → Validate and iterate.
注意事项:请先使用父技能
databricks-core
完成CLI基础操作、身份验证和配置文件选择。
分析现有Databricks代码的serverless compute兼容性,并指导从classic集群进行迁移。本技能遵循四步迁移生命周期:采集工作负载 → 分析兼容性 → 通过A/B对比测试验证并迭代

When to Use This Skill

适用场景

  • Migrating notebooks, jobs, or pipelines from classic compute to serverless
  • Checking if existing code is serverless-compatible
  • Writing new code that targets serverless compute
  • Troubleshooting serverless-specific errors after migration
  • Choosing between Performance-Optimized and Standard mode
  • 将notebook、任务或管道从classic compute迁移至serverless
  • 检查现有代码是否兼容serverless
  • 编写针对serverless compute的新代码
  • 迁移后排查serverless特有的错误
  • 在Performance-Optimized模式和Standard模式之间做选择

Understanding Migration Blockers

迁移阻碍因素说明

Migration blockers fall into three categories. Focus your effort on category 2 — that's where this skill helps most.
CategoryDescriptionAction
1. Feature expandingDatabricks is actively expanding support (e.g., SparkML, custom JDBC)Use the workaround now and revisit later
2. Code/config change neededYour code uses patterns that need updating for serverless (e.g., RDDs, DBFS, streaming triggers)This skill helps here — it detects these patterns and provides fixes
3. Classic-onlyWorkload requires capabilities not available on serverless (e.g., root OS access, R language)Keep on classic compute
迁移阻碍因素分为三类。重点关注第二类——本技能主要针对此类问题提供帮助。
类别描述操作
1. 功能扩展中Databricks正积极扩展支持范围(如SparkML、自定义JDBC)先使用临时解决方案,后续再跟进
2. 需要修改代码/配置你的代码使用了需要针对serverless更新的模式(如RDDs、DBFS、流触发器)本技能可提供帮助——它能检测这些模式并提供修复方案
3. 仅支持classic工作负载需要serverless不具备的功能(如根操作系统访问权限、R语言)继续使用classic compute

Decision Tree: Is My Workload Ready?

决策树:我的工作负载是否已准备好迁移?

Workload → Check language
├── R code → Category 3: keep on classic
├── Scala notebook cells → Category 2: port to PySpark/SQL or compile as JAR
├── Python / SQL → Continue
    ├── Uses RDD APIs? → Category 2: rewrite to DataFrame API (see fixes below)
    ├── Uses DBFS paths? → Category 2: migrate to UC Volumes
    ├── Uses Hive Metastore? → Category 2: migrate to Unity Catalog (or use HMS Federation)
    ├── Uses df.cache/persist? → Category 1: remove and materialize to Delta (native support coming soon)
    ├── Uses streaming?
    │   ├── ProcessingTime trigger → Category 2: use AvailableNow or migrate to SDP
    │   ├── Continuous trigger → Category 2: use SDP continuous mode
    │   ├── No trigger specified → Category 2: add explicit .trigger(availableNow=True)
    │   └── AvailableNow / Once → Ready ✓
    ├── Uses init scripts? → Category 2: use Environments
    ├── Uses VPC peering? → Category 2: use NCCs / Private Link
    ├── Uses unsupported Spark configs? → Category 2: remove (serverless auto-tunes)
    ├── Uses custom JDBC drivers? → Category 2: use Lakehouse Federation or built-in JDBC
    ├── Uses Docker containers? → Category 3: use Environments for libs, or keep on classic
    └── All clear → Ready for serverless ✓
Workload → Check language
├── R code → Category 3: keep on classic
├── Scala notebook cells → Category 2: port to PySpark/SQL or compile as JAR
├── Python / SQL → Continue
    ├── Uses RDD APIs? → Category 2: rewrite to DataFrame API (see fixes below)
    ├── Uses DBFS paths? → Category 2: migrate to UC Volumes
    ├── Uses Hive Metastore? → Category 2: migrate to Unity Catalog (or use HMS Federation)
    ├── Uses df.cache/persist? → Category 1: remove and materialize to Delta (native support coming soon)
    ├── Uses streaming?
    │   ├── ProcessingTime trigger → Category 2: use AvailableNow or migrate to SDP
    │   ├── Continuous trigger → Category 2: use SDP continuous mode
    │   ├── No trigger specified → Category 2: add explicit .trigger(availableNow=True)
    │   └── AvailableNow / Once → Ready ✓
    ├── Uses init scripts? → Category 2: use Environments
    ├── Uses VPC peering? → Category 2: use NCCs / Private Link
    ├── Uses unsupported Spark configs? → Category 2: remove (serverless auto-tunes)
    ├── Uses custom JDBC drivers? → Category 2: use Lakehouse Federation or built-in JDBC
    ├── Uses Docker containers? → Category 3: use Environments for libs, or keep on classic
    └── All clear → Ready for serverless ✓

Migration Workflow

迁移流程

Step 1: Ingest — Gather Workload Context

步骤1:采集——收集工作负载上下文

Confirm the migration target is serverless compute. This skill is purpose-built for classic → serverless migrations. The checks, fixes, and workflow all target the serverless compute architecture (Spark Connect, Environments, NCCs). If the user wants to upgrade between classic DBR versions instead, this skill does not apply — classic DBR upgrades have a different compatibility surface and should follow the standard DBR upgrade guide.
Collect the full picture of what needs to migrate to serverless:
  • Read the user's notebook/script files
  • Identify the classic cluster configuration (instance type, DBR version, Spark configs, init scripts, libraries)
  • Note the networking setup (VPC peering, instance profiles, mounts)
  • Understand the workload type: batch job, streaming, interactive notebook, pipeline
  • Determine the target: the output is always a serverless compute configuration, not a classic cluster with a newer DBR
确认迁移目标为serverless compute。本技能专为classic → serverless迁移设计,所有检查、修复方案和流程均针对serverless compute架构(Spark Connect、Environments、NCCs)。如果用户希望在classic DBR版本之间升级,本技能不适用——classic DBR升级有不同的兼容性范围,应遵循标准DBR升级指南。
收集需要迁移至serverless的完整信息:
  • 读取用户的notebook/脚本文件
  • 识别classic集群配置(实例类型、DBR版本、Spark配置、初始化脚本、库)
  • 记录网络设置(VPC peering、实例配置文件、挂载点)
  • 了解工作负载类型:批处理任务、流处理、交互式notebook、管道
  • 确定目标:输出始终为serverless compute配置,而非使用更新DBR版本的classic集群

Step 2: Analyze — Scan for Serverless Readiness

步骤2:分析——扫描server就绪状态

Read notebooks before running them — do not rely on failed job runs to discover issues. A pre-run scan surfaces incompatibilities faster than iterating on error traces, and many serverless failures (hardcoded catalog references, init scripts, missing dependencies) are easy to spot statically but expensive to debug after a failed run.
Before creating or running any test job:
  1. Read every notebook and source file referenced by the job
  2. Scan for all hardcoded catalog/schema references (e.g.,
    spark.table("main.schema.table")
    ,
    spark.sql("... FROM main...")
    ,
    catalog = "main"
    )
  3. Check for dependency patterns: init scripts, local wheel files, custom install functions,
    %pip install
    lines
  4. Locate any
    requirements.txt
    or equivalent and resolve the full dependency set
  5. Flag OS-level installs (
    apt install
    ,
    yum install
    ) for conversion or escalation
Scan the code for patterns that are incompatible with the serverless compute architecture. These checks are serverless-specific — most of these patterns work fine on classic compute regardless of DBR version. For each issue found, report:
  • Category: Which of the 3 blocker categories it falls into
  • Severity: Blocker (must fix for serverless) / Warning (should fix) / Info (awareness)
  • Pattern: What was detected and where
  • Fix: Specific remediation targeting serverless compute
Category A: Unsupported APIs
PatternSeverityFix
sc.parallelize(data)
Blocker
spark.createDataFrame([(x,) for x in data], ["value"])
rdd.map(fn)
Blocker
df.select(F.col("value") * 2)
or
df.withColumn(...)
rdd.filter(fn)
Blocker
df.filter(F.col("value") > 3)
rdd.reduce(fn)
Blocker
df.agg(F.sum("col")).collect()[0][0]
rdd.flatMap(fn)
Blocker
df.select(F.explode(F.split(col, " ")))
rdd.groupByKey()
Blocker
df.groupBy("key").agg(F.collect_list("value"))
rdd.mapPartitions(fn)
Blocker
df.groupBy(F.spark_partition_id()).applyInPandas(fn, schema)
sc.textFile(path)
Blocker
spark.read.text(path)
sc.wholeTextFiles(path)
Blocker
spark.read.format("binaryFile").load(path)
sc.broadcast(data)
Blocker
from pyspark.sql.functions import broadcast; df.join(broadcast(lookup_df), key)
sc.accumulator(init)
Blocker
df.agg(F.sum("col"))
or
df.count()
spark.sparkContext
BlockerUse
spark
(SparkSession) directly
SparkContext.getOrCreate()
BlockerNot supported — raises
RuntimeError: Only remote Spark sessions using Databricks Connect are supported
. Replace with
spark.createDataFrame()
or
spark.range()
for data setup.
sqlContext.sql(query)
Blocker
spark.sql(query)
sc.hadoopConfiguration.set(...)
BlockerUse UC external locations — no credential configs needed
df.cache()
/
df.persist()
WarningRemove caching calls. For expensive intermediate results, materialize to a Delta table. Native support coming soon.
df.checkpoint()
WarningWrite to Delta table instead
spark.catalog.cacheTable(t)
/
CACHE TABLE
WarningRemove — not needed on serverless
%scala
cells in notebook
BlockerPort to PySpark/SQL or compile as JAR for job tasks
%r
cells in notebook
BlockerNo serverless equivalent — keep on classic or port to PySpark
Hive variable syntax
${var}
WarningUse
DECLARE VARIABLE
/
SET VARIABLE
(SQL) or Python f-strings
CREATE GLOBAL TEMPORARY VIEW
BlockerUse
CREATE OR REPLACE TEMPORARY VIEW
global_temp
database doesn't exist on serverless
global_temp.
prefix in queries
WarningRemove prefix — session-scoped temp views are accessible without qualifier
Category B: Data Access
PatternSeverityFix
dbfs:/
or
/dbfs/
paths (persistent data)
BlockerReplace with
/Volumes/<your_catalog>/schema/volume/path
dbfs:/tmp/
,
/dbfs/tmp/
, paths with
cache
/
scratch
/
temp
WarningUse
/tmp/
or
/local_disk0/tmp/
(local driver disk) — do not use Volumes for temp files due to performance
file:///dbfs/
FUSE mount paths
WarningReplace persistent paths with
/Volumes/...
; replace temp paths with
/local_disk0/tmp/
dbutils.fs.mount(...)
BlockerCreate UC external location + external volume
hive_metastore.db.table
WarningMigrate to UC or use HMS Federation:
CREATE FOREIGN CATALOG ... USING CONNECTION hms_connection
CREATE DATABASE
/
CREATE SCHEMA
without
USE CATALOG
or 3-level name
BlockerPrepend
spark.sql("USE CATALOG <your_catalog>")
at notebook start before any CREATE statements. Detect target catalog from existing table references, or ask the user.
IAM instance profile referencesWarningUse UC external locations + storage credentials
Hive SerDe tablesBlockerMigrate to Delta tables in UC
Category C: Streaming
PatternSeverityFix
.trigger(processingTime=...)
Blocker
.trigger(availableNow=True)
+ set
maxFilesPerTrigger
or
maxBytesPerTrigger
to prevent OOM
.trigger(continuous=...)
BlockerMigrate to SDP continuous mode
No
.trigger()
call on writeStream
BlockerMust add
.trigger(availableNow=True)
— Spark defaults to
ProcessingTime("0 seconds")
which is not supported
Kafka sourceInfoWorks with AvailableNow; use
maxOffsetsPerTrigger
to control batch size
Auto LoaderInfoWorks; use
cloudFiles.maxFilesPerTrigger
(note the
cloudFiles.
prefix)
Category D: Configuration
PatternSeverityFix
Unsupported
spark.conf.set(...)
WarningRemove — only 6 configs supported:
spark.sql.shuffle.partitions
,
spark.sql.session.timeZone
,
spark.sql.ansi.enabled
,
spark.sql.files.maxPartitionBytes
,
spark.sql.legacy.timeParserPolicy
,
spark.databricks.execution.timeout
. Serverless auto-tunes everything else.
Init scriptsBlockerUse Environments: add dependencies via notebook Environment panel or
requirements.txt
. Pin specific versions.
Cluster policiesInfoUse budget policies for cost attribution
Docker containersBlockerUse Environments for library management. Keep on classic only if Docker is needed for OS-level customization.
%run ./relative/path
or
%run ../path
WarningRelative
%run
paths may not resolve correctly in serverless job tasks. Fix: (1) Inline the referenced notebook's code if <500 lines (preferred), (2) Convert to
dbutils.notebook.run("<absolute_workspace_path>", timeout)
with absolute path. Found in ~19% of repos.
os.environ["VAR"]
(system/custom env vars)
WarningUse
os.environ.get()
with fallback,
spark.version
for Spark info, or
dbutils.widgets
for custom vars
SET hivevar:
/
${hivevar:...}
(Hive variable substitution)
BlockerUse SQL session variables:
DECLARE OR REPLACE VARIABLE name = value
(DBR 14.1+)
Environment variables (in init scripts)WarningUse
dbutils.widgets
or job parameters
Explicit executor count/memory configsInfoRemove — serverless auto-scales and auto-tunes
Category E: Libraries
PatternSeverityFix
JAR libraries in notebooksBlockerCompile as JAR job task (Scala 2.13, JDK 17, env version 4+)
Maven coordinatesBlockerReplace with PyPI packages in Environments
%pip install
without version pins
WarningPin versions:
%pip install numpy==2.2.2 pandas==2.2.3
Custom Spark data sources (v1/v2 JARs)BlockerUse Lakehouse Federation, Lakeflow Connect, or PySpark custom data sources
LZO format filesBlockerConvert to Parquet or Delta
Category F: Networking
PatternSeverityFix
VPC peering configurationBlockerCreate NCCs, get stable IPs, allowlist on resource firewalls. S3 same-region access works without changes.
Direct S3/ADLS access without UCWarningUse UC external locations
Category G: Sizing & Debugging
PatternSeverityFix
Large driver memory configsInfoServerless REPL default is 8GB (high-memory option for 16GB+ via Environments)
Spark UI referencesInfoUse Query Profile instead: click "See performance" under cell output
运行前先读取notebook——不要依赖任务失败来发现问题。预运行扫描比通过错误追踪迭代更快地发现不兼容问题,许多serverless故障(硬编码目录引用、初始化脚本、缺失依赖)静态检查容易发现,但在任务失败后调试成本很高。
在创建或运行任何测试任务前:
  1. 读取任务引用的所有notebook和源文件
  2. 扫描所有硬编码的目录/模式引用(如
    spark.table("main.schema.table")
    spark.sql("... FROM main...")
    catalog = "main"
  3. 检查依赖模式:初始化脚本、本地wheel文件、自定义安装函数、
    %pip install
  4. 找到任何
    requirements.txt
    或等效文件并解析完整依赖集
  5. 标记需要转换或升级的操作系统级安装命令(
    apt install
    yum install
扫描代码中与serverless compute架构不兼容的模式。这些检查是serverless特有的——这些模式在classic compute上无论DBR版本如何都能正常工作。对于发现的每个问题,需报告:
  • 类别:属于三类阻碍因素中的哪一类
  • 严重程度:阻碍(必须修复才能使用serverless)/警告(建议修复)/信息(仅需知晓)
  • 模式:检测到什么问题以及位置
  • 修复方案:针对serverless compute的具体修复措施
类别A:不支持的API
模式严重程度修复方案
sc.parallelize(data)
阻碍
spark.createDataFrame([(x,) for x in data], ["value"])
rdd.map(fn)
阻碍
df.select(F.col("value") * 2)
df.withColumn(...)
rdd.filter(fn)
阻碍
df.filter(F.col("value") > 3)
rdd.reduce(fn)
阻碍
df.agg(F.sum("col")).collect()[0][0]
rdd.flatMap(fn)
阻碍
df.select(F.explode(F.split(col, " ")))
rdd.groupByKey()
阻碍
df.groupBy("key").agg(F.collect_list("value"))
rdd.mapPartitions(fn)
阻碍
df.groupBy(F.spark_partition_id()).applyInPandas(fn, schema)
sc.textFile(path)
阻碍
spark.read.text(path)
sc.wholeTextFiles(path)
阻碍
spark.read.format("binaryFile").load(path)
sc.broadcast(data)
阻碍
from pyspark.sql.functions import broadcast; df.join(broadcast(lookup_df), key)
sc.accumulator(init)
阻碍
df.agg(F.sum("col"))
df.count()
spark.sparkContext
阻碍直接使用
spark
(SparkSession)
SparkContext.getOrCreate()
阻碍不支持——会抛出
RuntimeError: Only remote Spark sessions using Databricks Connect are supported
。替换为
spark.createDataFrame()
spark.range()
进行数据设置。
sqlContext.sql(query)
阻碍
spark.sql(query)
sc.hadoopConfiguration.set(...)
阻碍使用UC外部位置——无需凭证配置
df.cache()
/
df.persist()
警告移除缓存调用。对于昂贵的中间结果,将其物化到Delta表。原生支持即将推出。
df.checkpoint()
警告写入Delta表替代
spark.catalog.cacheTable(t)
/
CACHE TABLE
警告移除——serverless不需要
%scala
notebook单元格
阻碍迁移到PySpark/SQL或编译为JAR用于任务
%r
notebook单元格
阻碍无serverless等效方案——继续使用classic或迁移到PySpark
Hive变量语法
${var}
警告使用
DECLARE VARIABLE
/
SET VARIABLE
(SQL)或Python f-string
CREATE GLOBAL TEMPORARY VIEW
阻碍使用
CREATE OR REPLACE TEMPORARY VIEW
——serverless上不存在
global_temp
数据库
查询中的
global_temp.
前缀
警告移除前缀——会话级临时视图无需限定符即可访问
类别B:数据访问
模式严重程度修复方案
dbfs:/
/dbfs/
路径(持久化数据)
阻碍替换为
/Volumes/<your_catalog>/schema/volume/path
dbfs:/tmp/
/dbfs/tmp/
、包含
cache
/
scratch
/
temp
的路径
警告使用
/tmp/
/local_disk0/tmp/
(本地驱动磁盘)——不要使用Volumes存储临时文件,会影响性能
file:///dbfs/
FUSE挂载路径
警告持久化路径替换为
/Volumes/...
;临时路径替换为
/local_disk0/tmp/
dbutils.fs.mount(...)
阻碍创建UC外部位置 + 外部卷
hive_metastore.db.table
警告迁移到UC或使用HMS Federation:
CREATE FOREIGN CATALOG ... USING CONNECTION hms_connection
未使用
USE CATALOG
或三级名称的
CREATE DATABASE
/
CREATE SCHEMA
阻碍在notebook开头的所有CREATE语句前添加
spark.sql("USE CATALOG <your_catalog>")
。从现有表引用中检测目标目录,或询问用户。
IAM实例配置文件引用警告使用UC外部位置 + 存储凭证
Hive SerDe表阻碍迁移到UC中的Delta表
类别C:流处理
模式严重程度修复方案
.trigger(processingTime=...)
阻碍
.trigger(availableNow=True)
+ 设置
maxFilesPerTrigger
maxBytesPerTrigger
防止OOM
.trigger(continuous=...)
阻碍迁移到SDP连续模式
writeStream未调用
.trigger()
阻碍必须添加
.trigger(availableNow=True)
——Spark默认使用
ProcessingTime("0 seconds")
,这在serverless上不支持
Kafka源信息支持AvailableNow;使用
maxOffsetsPerTrigger
控制批处理大小
Auto Loader信息支持;使用
cloudFiles.maxFilesPerTrigger
(注意
cloudFiles.
前缀)
类别D:配置
模式严重程度修复方案
不支持的
spark.conf.set(...)
警告移除——仅支持6种配置:
spark.sql.shuffle.partitions
spark.sql.session.timeZone
spark.sql.ansi.enabled
spark.sql.files.maxPartitionBytes
spark.sql.legacy.timeParserPolicy
spark.databricks.execution.timeout
。serverless会自动调优其他所有配置。
初始化脚本阻碍使用Environments:通过notebook的Environment面板或
requirements.txt
添加依赖。固定具体版本。
集群策略信息使用预算策略进行成本归因
Docker容器阻碍使用Environments进行库管理。仅当需要操作系统级自定义时才继续使用classic
%run ./relative/path
%run ../path
警告相对路径的
%run
在serverless任务中可能无法正确解析。修复方案:(1) 如果引用的notebook代码少于500行,直接内联(推荐);(2) 转换为
dbutils.notebook.run("<absolute_workspace_path>", timeout)
并使用绝对路径。约19%的仓库存在此问题。
os.environ["VAR"]
(系统/自定义环境变量)
警告使用
os.environ.get()
并设置默认值,使用
spark.version
获取Spark信息,或使用
dbutils.widgets
处理自定义变量
SET hivevar:
/
${hivevar:...}
(Hive变量替换)
阻碍使用SQL会话变量:
DECLARE OR REPLACE VARIABLE name = value
(DBR 14.1+)
初始化脚本中的环境变量警告使用
dbutils.widgets
或任务参数
显式执行器数量/内存配置信息移除——serverless会自动扩缩容和调优
类别E:库
模式严重程度修复方案
notebook中的JAR库阻碍编译为JAR任务(Scala 2.13、JDK 17、环境版本4+)
Maven坐标阻碍替换为Environments中的PyPI包
%pip install
未固定版本
警告固定版本:
%pip install numpy==2.2.2 pandas==2.2.3
自定义Spark数据源(v1/v2 JAR)阻碍使用Lakehouse Federation、Lakeflow Connect或PySpark自定义数据源
LZO格式文件阻碍转换为Parquet或Delta格式
类别F:网络
模式严重程度修复方案
VPC peering配置阻碍创建NCCs,获取稳定IP,在资源防火墙中添加白名单。同区域S3访问无需更改。
未使用UC直接访问S3/ADLS警告使用UC外部位置
类别G:规模与调试
模式严重程度修复方案
大驱动内存配置信息Serverless REPL默认是8GB(可通过Environments选择16GB+的高内存选项)
Spark UI引用信息使用Query Profile替代:点击单元格输出下方的"See performance"

Required Output: Serverless Environment Specification

必需输出:Serverless Environment规范

The migration output MUST include a Serverless Environment specification alongside migrated code. Generate this by:
  1. Scanning all
    import
    statements and
    %pip install
    lines to detect required packages
  2. Extracting init script
    pip install
    commands from the job configuration
  3. Producing a JSON block suitable for the Jobs API
    environments
    field:
json
{
  "environment_key": "Default",
  "spec": {
    "client": "2",
    "dependencies": ["mlflow==2.12.1", "scikit-learn==1.3.0", "xgboost==2.0.3"]
  }
}
Important: ML runtime libraries (mlflow, scikit-learn, hyperopt, xgboost, tensorflow, torch, etc.) are NOT pre-installed on serverless compute. They MUST be listed explicitly in the environment spec
dependencies
. ML runtime is NOT available on serverless — always use Serverless Environments with explicit package dependencies instead.
迁移输出必须包含Serverless Environment规范以及迁移后的代码。生成方式如下:
  1. 扫描所有
    import
    语句和
    %pip install
    行以检测所需包
  2. 从任务配置中提取初始化脚本的
    pip install
    命令
  3. 生成适用于Jobs API
    environments
    字段的JSON块:
json
{
  "environment_key": "Default",
  "spec": {
    "client": "2",
    "dependencies": ["mlflow==2.12.1", "scikit-learn==1.3.0", "xgboost==2.0.3"]
  }
}
重要提示:ML运行时库(mlflow、scikit-learn、hyperopt、xgboost、tensorflow、torch等)未预安装在serverless compute上。必须在环境规范的
dependencies
中显式列出。serverless上不提供ML运行时——始终使用带有显式包依赖的Serverless Environments替代。

Step 3: Test — Two-Branch Strategy

步骤3:测试——双分支策略

Use separate branches for testing and production to keep test-only workarounds out of the code that ships. The test branch is a safe sandbox for experimentation; the production branch contains only changes that production actually needs.
AspectTest branchProduction branch
Name pattern
serverless-test-{job_name}-{timestamp}
serverless-prod-{job_name}
Base branchAny working branchMust be master
PurposeVerify serverless compatibilityDeploy to production
Test-only workaroundsYes (catalog overrides, sampled data, date limits)No
Compatibility fixesYes (discover them here)Yes (apply the validated ones)
Job config changesYes (for the test job)Yes (for the prod job)
CatalogTest catalogProduction catalog
PR requiredNoYes
Merged to masterNoYes
Test branch (
serverless-test-{job_name}-{timestamp}
): Temporary, no PR needed.
  1. Create a branch from your current working branch
  2. Set up test data: create sampled copies of upstream tables in a test catalog using job lineage (see test data setup below)
  3. Parameterize the catalog so the notebook works with both test and production data (see catalog parameterization pattern below)
  4. Apply all compatibility fixes discovered in Step 2
  5. Create a serverless test job and run it
  6. If it fails, get the error output, debug, fix, and retry
  7. Document which changes are test workarounds vs. real compatibility fixes
Production branch (
serverless-prod-{job_name}
): PR required, created from master.
  1. Create a new branch from master (NOT from the test branch)
  2. Apply ONLY the real compatibility fixes — no test workarounds
  3. Apply job config changes (see job config transformation below)
  4. Commit and create a PR
使用单独的分支进行测试和生产,避免仅用于测试的临时解决方案进入生产代码。测试分支是安全的实验沙箱;生产分支仅包含生产实际需要的更改。
维度测试分支生产分支
命名模式
serverless-test-{job_name}-{timestamp}
serverless-prod-{job_name}
基础分支任何可用分支必须是master
用途验证serverless兼容性部署到生产
仅测试临时解决方案是(目录覆盖、采样数据、日期限制)
兼容性修复是(在此分支发现问题)是(应用已验证的修复)
任务配置更改是(针对测试任务)是(针对生产任务)
目录测试目录生产目录
是否需要PR
是否合并到master
测试分支 (
serverless-test-{job_name}-{timestamp}
):临时分支,无需PR。
  1. 从当前工作分支创建新分支
  2. 设置测试数据:使用任务血缘关系在测试目录中创建上游表的采样副本(见下方测试数据设置)
  3. 参数化目录,使notebook可同时适配测试和生产数据(见下方目录参数化模式)
  4. 应用步骤2中发现的所有兼容性修复
  5. 创建serverless测试任务并运行
  6. 如果失败,获取错误输出、调试、修复并重试
  7. 记录哪些更改是仅测试临时方案 vs 真实兼容性修复
生产分支 (
serverless-prod-{job_name}
):需要PR,从master创建。
  1. 从master创建新分支(不要从测试分支创建)
  2. 仅应用真实兼容性修复——不包含仅测试临时方案
  3. 应用任务配置更改(见下方任务配置转换)
  4. 提交并创建PR

Test Data Setup

测试数据设置

When the job reads from production tables, do not point the test job at production data. Instead, create sampled copies of upstream tables in a dedicated test catalog and run the test job against those.
The recommended pattern:
  1. Resolve the job's upstream tables from its lineage (or from a static scan of the notebook)
  2. For each upstream table, run
    CREATE TABLE IF NOT EXISTS <test_catalog>.<schema>.<table> AS SELECT * FROM <prod_catalog>.<schema>.<table> LIMIT N
    (typical N: 10–1000 rows)
  3. Keep the schema names identical to production — only the catalog changes
  4. Make the operation idempotent: skip tables that already exist, so the setup step is safe to re-run
  5. Require a running SQL warehouse and
    CREATE TABLE
    permission on the test catalog
With schema names preserved, the same notebook code runs in both environments — only the
catalog
widget value changes.
当任务读取生产表时,不要让测试任务指向生产数据。相反,在专用测试目录中创建上游表的采样副本,并让测试任务针对这些副本运行。
推荐模式:
  1. 从任务血缘关系(或notebook静态扫描)解析任务的上游表
  2. 对每个上游表运行
    CREATE TABLE IF NOT EXISTS <test_catalog>.<schema>.<table> AS SELECT * FROM <prod_catalog>.<schema>.<table> LIMIT N
    (典型N:10–1000行)
  3. 保持模式名称与生产环境一致——仅更改目录
  4. 确保操作是幂等的:跳过已存在的表,使设置步骤可安全重运行
  5. 需要运行中的SQL warehouse以及测试目录的
    CREATE TABLE
    权限
保持模式名称一致后,同一notebook代码可在两种环境中运行——仅需更改
catalog
小部件的值。

Decision Tree: Should This Change Go to Production?

决策树:此更改是否应进入生产?

Change typeProduction?Reason
Remove incompatible Spark configsYesServerless compatibility fix
Update library versionsYesServerless compatibility fix
Replace DBFS paths with UC VolumesYesServerless compatibility fix
Remove init scripts, add EnvironmentsYesServerless compatibility fix
Fix hardcoded cluster settingsYesServerless compatibility fix
Catalog override to test catalogNoTest workaround only
Empty DataFrame handling for missing test dataNoTest workaround only
Date range limiting for faster testsNoTest workaround only
Simple test: Would production fail without this change on serverless? If yes → include. If no → test branch only.
更改类型是否进入生产?原因
移除不兼容的Spark配置Serverless兼容性修复
更新库版本Serverless兼容性修复
将DBFS路径替换为UC VolumesServerless兼容性修复
移除初始化脚本,添加EnvironmentsServerless兼容性修复
修复硬编码集群设置Serverless兼容性修复
目录覆盖为测试目录仅为测试临时方案
缺失测试数据时的空DataFrame处理仅为测试临时方案
日期范围限制以加快测试仅为测试临时方案
简单测试:如果生产环境在serverless上运行时没有此更改会失败?如果是→包含。如果否→仅保留在测试分支。

A/B Comparison

A/B对比

After both branches are ready, compare outputs:
python
undefined
两个分支都准备好后,对比输出:
python
undefined

Compare outputs between classic and serverless runs

Compare outputs between classic and serverless runs

classic_df = spark.read.table("main.output.classic_results") serverless_df = spark.read.table("main.output.serverless_results")
assert classic_df.count() == serverless_df.count(), "Row count mismatch" assert classic_df.schema == serverless_df.schema, "Schema mismatch" diff = classic_df.exceptAll(serverless_df) assert diff.count() == 0, f"Found {diff.count()} differing rows"

**Temporary bridge configs**: If the serverless run fails, you may temporarily set supported Spark configs (like `spark.sql.shuffle.partitions`) to bridge gaps. Mark these as temporary — remove once the workload stabilizes.
classic_df = spark.read.table("main.output.classic_results") serverless_df = spark.read.table("main.output.serverless_results")
assert classic_df.count() == serverless_df.count(), "Row count mismatch" assert classic_df.schema == serverless_df.schema, "Schema mismatch" diff = classic_df.exceptAll(serverless_df) assert diff.count() == 0, f"Found {diff.count()} differing rows"

**临时桥接配置**:如果serverless运行失败,可临时设置支持的Spark配置(如`spark.sql.shuffle.partitions`)来填补差距。标记这些配置为临时——工作负载稳定后移除。

Step 4: Validate — Confirm and Monitor

步骤4:验证——确认并监控

Once the A/B comparison passes:
  1. Merge the production branch PR
  2. Switch the production job to serverless compute
  3. Monitor cost via system tables (
    system.billing.usage
    ) and budget policies
  4. Remove any temporary bridge configurations
  5. Set up budget alerts for cost visibility
A/B对比通过后:
  1. 合并生产分支的PR
  2. 将生产任务切换到serverless compute
  3. 通过系统表(
    system.billing.usage
    )和预算策略监控成本
  4. 移除所有临时桥接配置
  5. 设置预算警报以掌握成本情况

Migration Deliverables

迁移交付物

At the end of a successful migration run, surface these artifacts so the user can verify the work and inspect the results:
DeliverableWhat it isWhy it matters
Test branch name/URLThe
serverless-test-{job_name}-{timestamp}
branch with all compatibility fixes and test workarounds
Lets the user see what changed during experimentation, including test-only adjustments
Production branch name/URLThe
serverless-prod-{job_name}
branch containing only the validated compatibility fixes
This is what ships — the user reviews and merges the PR from here
Test job ID and run URLThe serverless test job that validated the migrationProves the notebook runs successfully on serverless against sampled data
Classic vs serverless comparisonA/B result summary (row counts, schema check, row-level diff)Confidence that serverless output matches classic output
Serverless environment specThe
environments
JSON block (client version + pinned dependencies)
Ready to paste into the production job config
Change summaryList of what went to production vs test-only (with reasons)Audit trail for the PR reviewer
If any deliverable is missing, the migration is incomplete — do not mark it as done.
成功完成迁移后,向用户展示以下工件,以便用户验证工作并检查结果:
交付物说明重要性
测试分支名称/URL包含所有兼容性修复和仅测试临时方案的
serverless-test-{job_name}-{timestamp}
分支
让用户查看实验期间的所有更改,包括仅测试的调整
生产分支名称/URL仅包含已验证兼容性修复的
serverless-prod-{job_name}
分支
这是要部署的内容——用户从此处审核并合并PR
测试任务ID和运行URL验证迁移的serverless测试任务证明notebook在serverless上针对采样数据成功运行
Classic与serverless对比结果A/B结果摘要(行数、模式检查、行级差异)确保serverless输出与classic输出一致
Serverless环境规范
environments
JSON块(客户端版本 + 固定依赖)
可直接粘贴到生产任务配置中
更改摘要进入生产的更改与仅测试更改的列表(含原因)为PR审核者提供审计跟踪
如果任何交付物缺失,迁移未完成——不要标记为已完成。

Stopping Conditions

终止条件

Do not attempt workarounds for these — surface them to the user and stop:
  • Permission failures on source tables, the test catalog, or the workspace
  • Category 3 blockers (R code, custom Spark data source JARs, features that require classic compute)
  • SQL warehouse or test catalog not available
  • Repeated failures (typically 5+) with no new information in the error trace — generate a failure report instead (see Failure Reporting Protocol)
不要尝试以下情况的临时解决方案——告知用户并停止:
  • 源表、测试目录或工作区的权限失败
  • 第三类阻碍因素(R代码、自定义Spark数据源JAR、需要classic compute的功能)
  • SQL warehouse或测试目录不可用
  • 重复失败(通常5次以上)且错误跟踪中无新信息——生成失败报告替代(见失败报告协议)

Quick Fixes Reference

快速修复参考

Replace DBFS paths with UC Volumes

将DBFS路径替换为UC Volumes

python
undefined
python
undefined

BEFORE (classic)

BEFORE (classic)

df = spark.read.csv("dbfs:/mnt/datalake/sales/data.csv", header=True) df.write.parquet("dbfs:/mnt/output/results")
df = spark.read.csv("dbfs:/mnt/datalake/sales/data.csv", header=True) df.write.parquet("dbfs:/mnt/output/results")

AFTER (serverless)

AFTER (serverless)

df = spark.read.csv("/Volumes/main/sales/raw_data/data.csv", header=True) df.write.parquet("/Volumes/main/analytics/output/results")
df = spark.read.csv("/Volumes/main/sales/raw_data/data.csv", header=True) df.write.parquet("/Volumes/main/analytics/output/results")

Replace mounts with external volumes (SQL):

Replace mounts with external volumes (SQL):

CREATE EXTERNAL VOLUME main.data.raw_files LOCATION 's3://my-bucket/data/';

CREATE EXTERNAL VOLUME main.data.raw_files LOCATION 's3://my-bucket/data/';

Then use: /Volumes/main/data/raw_files/

Then use: /Volumes/main/data/raw_files/

Pandas paths too:

Pandas paths too:

BEFORE: pd.read_csv("/dbfs/mnt/data/file.csv")

BEFORE: pd.read_csv("/dbfs/mnt/data/file.csv")

AFTER: pd.read_csv("/Volumes/main/data/volume/file.csv")

AFTER: pd.read_csv("/Volumes/main/data/volume/file.csv")

undefined
undefined

Replace RDD operations with DataFrames

将RDD操作替换为DataFrame

python
from pyspark.sql import functions as F
python
from pyspark.sql import functions as F

parallelize + map

parallelize + map

BEFORE:

BEFORE:

rdd = sc.parallelize([1, 2, 3]) result = rdd.map(lambda x: x * 2).collect()
rdd = sc.parallelize([1, 2, 3]) result = rdd.map(lambda x: x * 2).collect()

AFTER:

AFTER:

df = spark.createDataFrame([(1,), (2,), (3,)], ["value"]) result = df.select((F.col("value") * 2).alias("value")).collect()
df = spark.createDataFrame([(1,), (2,), (3,)], ["value"]) result = df.select((F.col("value") * 2).alias("value")).collect()

flatMap (word splitting)

flatMap (word splitting)

BEFORE:

BEFORE:

words = sc.parallelize(["hello world"]).flatMap(lambda l: l.split(" ")).collect()
words = sc.parallelize(["hello world"]).flatMap(lambda l: l.split(" ")).collect()

AFTER:

AFTER:

df = spark.createDataFrame([("hello world",)], ["line"]) words = df.select(F.explode(F.split("line", " ")).alias("word")).collect()
df = spark.createDataFrame([("hello world",)], ["line"]) words = df.select(F.explode(F.split("line", " ")).alias("word")).collect()

groupByKey

groupByKey

BEFORE:

BEFORE:

rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)]) grouped = rdd.groupByKey().mapValues(list).collect()
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)]) grouped = rdd.groupByKey().mapValues(list).collect()

AFTER:

AFTER:

df = spark.createDataFrame([("a", 1), ("b", 2), ("a", 3)], ["key", "value"]) grouped = df.groupBy("key").agg(F.collect_list("value").alias("values")).collect()
df = spark.createDataFrame([("a", 1), ("b", 2), ("a", 3)], ["key", "value"]) grouped = df.groupBy("key").agg(F.collect_list("value").alias("values")).collect()

mapPartitions → applyInPandas

mapPartitions → applyInPandas

BEFORE:

BEFORE:

def process_partition(iterator): yield sum(iterator) result = sc.parallelize(range(100), 4).mapPartitions(process_partition).collect()
def process_partition(iterator): yield sum(iterator) result = sc.parallelize(range(100), 4).mapPartitions(process_partition).collect()

AFTER:

AFTER:

import pandas as pd def process_group(pdf: pd.DataFrame) -> pd.DataFrame: return pd.DataFrame({"total": [pdf["id"].sum()]}) result = (spark.range(100).repartition(4) .groupBy(F.spark_partition_id()) .applyInPandas(process_group, schema="total long") .collect())
import pandas as pd def process_group(pdf: pd.DataFrame) -> pd.DataFrame: return pd.DataFrame({"total": [pdf["id"].sum()]}) result = (spark.range(100).repartition(4) .groupBy(F.spark_partition_id()) .applyInPandas(process_group, schema="total long") .collect())

textFile

textFile

BEFORE: rdd = sc.textFile("/mnt/data/file.txt")

BEFORE: rdd = sc.textFile("/mnt/data/file.txt")

AFTER: df = spark.read.text("/Volumes/catalog/schema/volume/file.txt")

AFTER: df = spark.read.text("/Volumes/catalog/schema/volume/file.txt")

wholeTextFiles

wholeTextFiles

BEFORE: rdd = sc.wholeTextFiles("/mnt/data/dir/")

BEFORE: rdd = sc.wholeTextFiles("/mnt/data/dir/")

AFTER: df = spark.read.format("binaryFile").load("/Volumes/catalog/schema/volume/dir/")

AFTER: df = spark.read.format("binaryFile").load("/Volumes/catalog/schema/volume/dir/")

undefined
undefined

Fix streaming triggers

修复流触发器

python
undefined
python
undefined

CRITICAL: Omitting .trigger() defaults to ProcessingTime(0) — not supported on serverless

CRITICAL: Omitting .trigger() defaults to ProcessingTime(0) — not supported on serverless

BEFORE (fails on serverless — no trigger = ProcessingTime default):

BEFORE (fails on serverless — no trigger = ProcessingTime default):

query = df.writeStream.format("delta").outputMode("append").start(path)
query = df.writeStream.format("delta").outputMode("append").start(path)

BEFORE (fails — explicit ProcessingTime):

BEFORE (fails — explicit ProcessingTime):

query = df.writeStream.trigger(processingTime="10 seconds").start(path)
query = df.writeStream.trigger(processingTime="10 seconds").start(path)

AFTER (serverless compatible):

AFTER (serverless compatible):

query = (df.writeStream .format("delta") .outputMode("append") .trigger(availableNow=True) .option("checkpointLocation", "/Volumes/main/data/checkpoints/stream1") .start("/Volumes/main/data/output/stream1")) query.awaitTermination()
query = (df.writeStream .format("delta") .outputMode("append") .trigger(availableNow=True) .option("checkpointLocation", "/Volumes/main/data/checkpoints/stream1") .start("/Volumes/main/data/output/stream1")) query.awaitTermination()

With OOM prevention (recommended for large sources):

With OOM prevention (recommended for large sources):

query = (spark.readStream.format("delta") .option("maxFilesPerTrigger", 100) # Delta/Parquet sources .option("maxBytesPerTrigger", "10g") # Limit data per micro-batch .load(input_path) .writeStream .trigger(availableNow=True) .option("checkpointLocation", checkpoint_path) .start(output_path))
query = (spark.readStream.format("delta") .option("maxFilesPerTrigger", 100) # Delta/Parquet sources .option("maxBytesPerTrigger", "10g") # Limit data per micro-batch .load(input_path) .writeStream .trigger(availableNow=True) .option("checkpointLocation", checkpoint_path) .start(output_path))

Kafka: use maxOffsetsPerTrigger

Kafka: use maxOffsetsPerTrigger

query = (spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "broker:9092") .option("subscribe", "topic1") .option("maxOffsetsPerTrigger", 100000) # Kafka-specific .load() .writeStream.trigger(availableNow=True).start(output_path))
query = (spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "broker:9092") .option("subscribe", "topic1") .option("maxOffsetsPerTrigger", 100000) # Kafka-specific .load() .writeStream.trigger(availableNow=True).start(output_path))

Auto Loader: use cloudFiles.maxFilesPerTrigger (note the prefix)

Auto Loader: use cloudFiles.maxFilesPerTrigger (note the prefix)

query = (spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.maxFilesPerTrigger", 1000) # cloudFiles. prefix .load(landing_path) .writeStream.trigger(availableNow=True).start(output_path))
undefined
query = (spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.maxFilesPerTrigger", 1000) # cloudFiles. prefix .load(landing_path) .writeStream.trigger(availableNow=True).start(output_path))
undefined

Remove caching

移除缓存

python
undefined
python
undefined

BEFORE (classic):

BEFORE (classic):

df = spark.read.parquet(path) df.cache() df.count() # materialize cache result1 = df.filter("status = 'active'") result2 = df.groupBy("region").agg(F.sum("revenue"))
df = spark.read.parquet(path) df.cache() df.count() # materialize cache result1 = df.filter("status = 'active'") result2 = df.groupBy("region").agg(F.sum("revenue"))

AFTER (serverless — remove .cache(); native support coming soon):

AFTER (serverless — remove .cache(); native support coming soon):

df = spark.read.parquet(path) result1 = df.filter("status = 'active'") result2 = df.groupBy("region").agg(F.sum("revenue"))
df = spark.read.parquet(path) result1 = df.filter("status = 'active'") result2 = df.groupBy("region").agg(F.sum("revenue"))

For truly expensive intermediate results, materialize to Delta:

For truly expensive intermediate results, materialize to Delta:

expensive_df.write.format("delta").mode("overwrite").saveAsTable("main.scratch.intermediate") result = spark.table("main.scratch.intermediate")
expensive_df.write.format("delta").mode("overwrite").saveAsTable("main.scratch.intermediate") result = spark.table("main.scratch.intermediate")

SQL equivalent:

SQL equivalent:

BEFORE: CACHE TABLE my_table

BEFORE: CACHE TABLE my_table

AFTER: (just remove the CACHE TABLE statement)

AFTER: (just remove the CACHE TABLE statement)

undefined
undefined

Other quick fixes

其他快速修复

PatternFixFull example
sc.broadcast
/
sc.accumulator
/
sqlContext.sql
Use SparkSession equivalents:
broadcast()
join,
df.agg()
,
spark.sql()
Code Patterns
Init scriptsMove to Environment panel or
requirements.txt
. Do NOT install PySpark. Pin versions.
Code Patterns
Hive Metastore tablesUse HMS Federation as bridge (
CREATE FOREIGN CATALOG
) or migrate directly (
CREATE TABLE ... AS SELECT
)
Code Patterns
Custom JDBC JARsUse Lakehouse Federation (
CREATE CONNECTION ... TYPE POSTGRESQL
) or built-in JDBC (works on serverless)
Code Patterns
Spark UI debuggingUse Query Profile: click "See performance" under cell output, or
df.explain(True)
Code Patterns
模式修复方案完整示例
sc.broadcast
/
sc.accumulator
/
sqlContext.sql
使用SparkSession等效方案:
broadcast()
连接、
df.agg()
spark.sql()
代码模式
初始化脚本移至Environment面板或
requirements.txt
。不要安装PySpark。固定版本。
代码模式
Hive Metastore表使用HMS Federation作为桥梁(
CREATE FOREIGN CATALOG
)或直接迁移(
CREATE TABLE ... AS SELECT
代码模式
自定义JDBC JAR使用Lakehouse Federation(
CREATE CONNECTION ... TYPE POSTGRESQL
)或内置JDBC(serverless支持)
代码模式
Spark UI调试使用Query Profile:点击单元格输出下方的"See performance",或
df.explain(True)
代码模式

Detect serverless at runtime

运行时检测serverless环境

python
import os
is_serverless = os.getenv("IS_SERVERLESS", "").lower() == "true"
python
import os
is_serverless = os.getenv("IS_SERVERLESS", "").lower() == "true"

Transform job config from classic to serverless

将任务配置从classic转换为serverless

Remove
job_clusters
/
new_cluster
, add
environments
with serverless spec, replace
job_cluster_key
with
environment_key
, remove
init_scripts
. See Configuration Guide for full before/after JSON and environment version mapping.
Environment version mapping (match to the DBR version the workload was on):
Classic DBRServerless
spec.client
Python
13.x, 14.x
"1"
3.10
15.x
"2"
3.11
16.x+
"3"
3.12
移除
job_clusters
/
new_cluster
,添加包含serverless规范的
environments
,将
job_cluster_key
替换为
environment_key
,移除
init_scripts
。完整的前后JSON示例和环境版本映射请参考配置指南
环境版本映射(与工作负载使用的DBR版本匹配):
Classic DBRServerless
spec.client
Python
13.x, 14.x
"1"
3.10
15.x
"2"
3.11
16.x+
"3"
3.12

Job Definition Migration

任务定义迁移

When migrating a job, the job configuration JSON must be transformed alongside notebook code. The agent should perform all of the following:
Init scripts to Serverless Environments: Detect
init_scripts
in the job JSON. Extract all
pip install
commands and convert them to Environment
dependencies
. For OS-level packages (
apt install
/
yum install
) that have pip equivalents (e.g.,
apt install python3-opencv
becomes
opencv-python
), convert them. Flag OS-level packages without pip equivalents as serverless-incompatible (Category 3).
Cluster libraries (Maven/JAR) to Environment or Volumes: Maven coordinates for Python-wrapping JARs should be replaced with their PyPI equivalent in the Environment spec. Custom JARs on DBFS need to be moved to
/Volumes/<your_catalog>/schema/volume/
and referenced there. Custom Spark data source JARs (v1/v2) are a Category 3 blocker — flag them for classic retention.
job_clusters to serverless compute: Remove
job_clusters
/
new_cluster
blocks entirely. Add an
environments
array with the serverless spec. Replace
job_cluster_key
in each task with
environment_key
. Remove
init_scripts
,
num_workers
,
node_type_id
,
spark_version
. See Configuration Guide for a complete before/after example.
spark_conf migration: Scan all
spark.conf.set(...)
calls in the notebook and
spark_conf
entries in the job JSON. For each:
  • Supported (keep):
    spark.sql.shuffle.partitions
    ,
    spark.sql.session.timeZone
    ,
    spark.sql.ansi.enabled
    ,
    spark.sql.files.maxPartitionBytes
    ,
    spark.sql.legacy.timeParserPolicy
    ,
    spark.databricks.execution.timeout
  • Auto-tuned (remove with comment): AQE configs, Delta auto-compact, executor/driver sizing, parallelism configs
  • Credential configs (remove):
    fs.s3a.*
    ,
    fs.azure.*
    — replaced by UC external locations
  • Add a code comment at each removal explaining why:
    # Removed: auto-tuned on serverless
    or
    # Removed: use UC external locations instead
迁移任务时,任务配置JSON必须与notebook代码一起转换。智能体应执行以下所有操作:
初始化脚本转换为Serverless Environments:检测任务JSON中的
init_scripts
。提取所有
pip install
命令并转换为Environment的
dependencies
。对于有pip等效包的操作系统级包(如
apt install python3-opencv
转换为
opencv-python
),进行转换。将无pip等效包的操作系统级包标记为不兼容serverless(第三类)。
集群库(Maven/JAR)转换为Environment或Volumes:用于Python包装JAR的Maven坐标应替换为Environment规范中的PyPI等效包。DBFS上的自定义JAR需要移至
/Volumes/<your_catalog>/schema/volume/
并在此引用。自定义Spark数据源JAR(v1/v2)属于第三类阻碍因素——标记为需保留在classic环境。
job_clusters转换为serverless compute:完全移除
job_clusters
/
new_cluster
块。添加包含serverless规范的
environments
数组。将每个任务中的
job_cluster_key
替换为
environment_key
。移除
init_scripts
num_workers
node_type_id
spark_version
。完整的前后示例请参考配置指南
spark_conf迁移:扫描notebook中的所有
spark.conf.set(...)
调用和任务JSON中的
spark_conf
条目。对于每个配置:
  • 支持(保留):
    spark.sql.shuffle.partitions
    spark.sql.session.timeZone
    spark.sql.ansi.enabled
    spark.sql.files.maxPartitionBytes
    spark.sql.legacy.timeParserPolicy
    spark.databricks.execution.timeout
  • 自动调优(移除并添加注释):AQE配置、Delta自动压缩、执行器/驱动大小、并行度配置
  • 凭证配置(移除):
    fs.s3a.*
    fs.azure.*
    ——替换为UC外部位置
  • 在每个移除的位置添加代码注释说明原因:
    # Removed: auto-tuned on serverless
    # Removed: use UC external locations instead

Parameterize catalogs for testing

参数化目录用于测试

python
dbutils.widgets.text("catalog", "main")  # Default to production
catalog = dbutils.widgets.get("catalog")
df = spark.table(f"{catalog}.sales.orders")
python
dbutils.widgets.text("catalog", "main")  # Default to production
catalog = dbutils.widgets.get("catalog")
df = spark.table(f"{catalog}.sales.orders")

Pass catalog="test_catalog" as a job parameter during testing

Pass catalog="test_catalog" as a job parameter during testing


See [Configuration Guide](references/configuration-guide.md) for mock table catalog mapping and test job creation patterns.

模拟表目录映射和测试任务创建模式请参考[配置指南](references/configuration-guide.md)。

Debug failed serverless runs

调试失败的serverless运行

Always get the actual error with
w.jobs.get_run_output(run_id=...)
before guessing. Common errors:
ErrorFix
INFINITE_STREAMING_TRIGGER_NOT_SUPPORTED
Add
.trigger(availableNow=True)
UNRESOLVED_COLUMN
Temp view name collision — use unique names
TABLE_OR_VIEW_NOT_FOUND
DBFS/HMS table not accessible — migrate to UC
Py4JError: ... is not available
SparkContext/RDD used — rewrite to DataFrame
Package installation timeoutPin versions; do NOT install PySpark as a dependency
ModuleNotFoundError: No module named 'mlflow'
Add to environment spec
dependencies
— ML runtime is NOT available on serverless
SparkContext.getOrCreate() is NOT supported
/
RuntimeError: Only remote Spark sessions
Replace with
spark.createDataFrame()
or
spark.range()
UC_FILE_SCHEME_FOR_TABLE_CREATION_NOT_SUPPORTED
Use managed tables or
/Volumes/...
paths
PERMISSION_DENIED: CREATE SCHEMA on Catalog 'main'
Add
spark.sql("USE CATALOG <your_catalog>")
before CREATE statements
DATA_SOURCE_NOT_FOUND: Failed to find data source
Category 3 blocker — custom JAR data source needs classic compute
SyntaxError
after migration
Ensure comments are inside MAGIC blocks, not straddling cell delimiters
See Configuration Guide for the full error reference and SDK code examples.
在猜测原因前,务必使用
w.jobs.get_run_output(run_id=...)
获取实际错误。常见错误:
错误修复方案
INFINITE_STREAMING_TRIGGER_NOT_SUPPORTED
添加
.trigger(availableNow=True)
UNRESOLVED_COLUMN
临时视图名称冲突——使用唯一名称
TABLE_OR_VIEW_NOT_FOUND
DBFS/HMS表不可访问——迁移到UC
Py4JError: ... is not available
使用了SparkContext/RDD——重写为DataFrame
包安装超时固定版本;不要将PySpark作为依赖安装
ModuleNotFoundError: No module named 'mlflow'
添加到环境规范的
dependencies
中——serverless上不提供ML运行时
SparkContext.getOrCreate() is NOT supported
/
RuntimeError: Only remote Spark sessions
替换为
spark.createDataFrame()
spark.range()
UC_FILE_SCHEME_FOR_TABLE_CREATION_NOT_SUPPORTED
使用托管表或
/Volumes/...
路径
PERMISSION_DENIED: CREATE SCHEMA on Catalog 'main'
在CREATE语句前添加
spark.sql("USE CATALOG <your_catalog>")
DATA_SOURCE_NOT_FOUND: Failed to find data source
第三类阻碍因素——自定义JAR数据源需要classic compute
迁移后出现
SyntaxError
确保注释在MAGIC块内,不要跨单元格分隔符
完整的错误参考和SDK代码示例请参考配置指南

Performance Mode Selection

性能模式选择

CriteriaPerformance-OptimizedStandard
Startup time<50 seconds4-6 minutes
CostHigherSignificantly lower
Available forNotebooks, Jobs, SDPJobs and SDP only
Best forInteractive work, dev, time-sensitiveBatch ETL, scheduled pipelines
DefaultYes (UI and API)Must be explicitly selected
Standard mode is NOT available for notebooks. Notebooks always use Performance-Optimized.
标准Performance-OptimizedStandard
启动时间<50秒4-6分钟
成本较高显著更低
适用场景Notebooks、Jobs、SDP仅Jobs和SDP
最佳用途交互式工作、开发、时间敏感型任务批处理ETL、调度管道
默认模式是(UI和API)必须显式选择
Standard模式不适用于notebook。Notebook始终使用Performance-Optimized模式。

Serverless Defaults to Know

Serverless默认设置须知

SettingValue
REPL VM memory8GB default (high-memory option available)
Max executors32 (Premium), 64 (Enterprise) — raise via support
Supported Spark configs6 only (see Category D above)
DebuggingQuery Profile (no Spark UI)
ANSI SQLEnabled by default (configurable)
设置
REPL VM内存默认8GB(提供高内存选项)
最大执行器数32(高级版)、64(企业版)——可通过支持渠道提升
支持的Spark配置仅6种(见类别D)
调试工具Query Profile(无Spark UI)
ANSI SQL默认启用(可配置)

Failure Reporting Protocol

失败报告协议

When migration fails irrecoverably, generate a structured failure report to help improve the skill. This applies when:
  • All retry attempts are exhausted (typically 5)
  • An unknown pattern is encountered that isn't in the compatibility checks
  • A fix was applied but didn't resolve the underlying issue
  • The workload hits a Category 3 blocker the user wasn't aware of
当迁移无法恢复失败时,生成结构化失败报告以帮助改进本技能。适用于以下情况:
  • 所有重试尝试已耗尽(通常5次)
  • 遇到兼容性检查中未涵盖的未知模式
  • 应用了修复方案但未解决根本问题
  • 工作负载遇到用户未意识到的第三类阻碍因素

When to generate a report

何时生成报告

Generate a report at the end of a migration attempt if any of:
  • retry_count >= max_retries
    and final status is FAILED
  • A pattern was detected but no fix is available in the skill
  • The user explicitly requests a failure report (
    /migration-report
    )
如果出现以下任一情况,在迁移尝试结束时生成报告:
  • retry_count >= max_retries
    且最终状态为FAILED
  • 检测到模式但技能中无可用修复方案
  • 用户明确请求失败报告(
    /migration-report

How to generate

如何生成

Write a JSON file to
~/.databricks-migration-skill/reports/failure-<ISO-timestamp>.json
. Create the directory if it doesn't exist.
Schema (strictly follow — no free-text code or identifiers):
json
{
  "report_version": "1.0",
  "report_id": "<uuid-v4>",
  "skill_version": "<from SKILL.md frontmatter metadata.version>",
  "timestamp": "<ISO 8601 UTC>",
  "failure_phase": "analyze | migrate | test | validate",
  "detected_patterns": [
    {"category": "A", "pattern_id": "rdd_parallelize", "count": 3}
  ],
  "attempted_fixes": [
    {"pattern_id": "rdd_parallelize", "fix_applied": "<fix_id>", "attempt_number": 1, "outcome": "failed"}
  ],
  "final_error_category": "unknown_api | missing_library | data_access | permission | custom_data_source | other",
  "final_error_signature": "<SHA256 of top 3 stack frames, NOT the frames themselves>",
  "retry_count": 5,
  "total_duration_seconds": 245,
  "notebook_characteristics": {
    "lines_of_code": 180,
    "language": "python | sql | scala | r",
    "uses_streaming": false,
    "uses_ml_libraries": true,
    "databricks_runtime_source": "<DBR version only, no cluster identifiers>"
  }
}
将JSON文件写入
~/.databricks-migration-skill/reports/failure-<ISO-timestamp>.json
。如果目录不存在则创建。
Schema(严格遵循——不要使用自由文本代码或标识符):
json
{
  "report_version": "1.0",
  "report_id": "<uuid-v4>",
  "skill_version": "<from SKILL.md frontmatter metadata.version>",
  "timestamp": "<ISO 8601 UTC>",
  "failure_phase": "analyze | migrate | test | validate",
  "detected_patterns": [
    {"category": "A", "pattern_id": "rdd_parallelize", "count": 3}
  ],
  "attempted_fixes": [
    {"pattern_id": "rdd_parallelize", "fix_applied": "<fix_id>", "attempt_number": 1, "outcome": "failed"}
  ],
  "final_error_category": "unknown_api | missing_library | data_access | permission | custom_data_source | other",
  "final_error_signature": "<SHA256 of top 3 stack frames, NOT the frames themselves>",
  "retry_count": 5,
  "total_duration_seconds": 245,
  "notebook_characteristics": {
    "lines_of_code": 180,
    "language": "python | sql | scala | r",
    "uses_streaming": false,
    "uses_ml_libraries": true,
    "databricks_runtime_source": "<DBR version only, no cluster identifiers>"
  }
}

What the report MUST NOT contain

报告中禁止包含的内容

This is a hard requirement — the report must be safe to share publicly on GitHub Issues:
  • No code content — only pattern IDs from this skill's catalog (e.g.,
    rdd_parallelize
    ), never actual code snippets
  • No file paths — no notebook names, directory paths, or workspace URLs
  • No error message text — only the error category enum and a hashed signature
  • No identifiers — no table names, column names, catalog names, schema names, user emails, workspace IDs, or customer names
  • No credentials — no secret scope names, API keys, or connection strings
  • No data descriptions — no column value samples, row counts tied to specific tables, or data shape details beyond the
    notebook_characteristics
    fields
这是硬性要求——报告必须可安全分享到GitHub Issues:
  • 无代码内容——仅使用本技能目录中的模式ID(如
    rdd_parallelize
    ),绝不使用实际代码片段
  • 无文件路径——无notebook名称、目录路径或工作区URL
  • 无错误消息文本——仅使用错误类别枚举和哈希签名
  • 无标识符——无表名、列名、目录名、模式名、用户邮箱、工作区ID或客户名称
  • 无凭证——无密钥范围名称、API密钥或连接字符串
  • 无数据描述——无列值样本、与特定表关联的行数,或
    notebook_characteristics
    字段之外的数据形状细节

After generating the report

生成报告后

Tell the user:
Migration failed after <N> attempts. A failure report has been generated at:

  ~/.databricks-migration-skill/reports/failure-<timestamp>.json

This report contains anonymized diagnostic data (detected patterns, error categories, retry count) and no code content or PII. You can:

1. Review the JSON to confirm no sensitive information is present
2. Share it via GitHub Issue to help improve the skill:
   https://github.com/databricks/databricks-agent-skills/issues/new?template=migration-feedback.md

Submission is optional and opt-in. We use reports to prioritize new patterns and fix detection gaps.
Never transmit the report automatically. The user owns their data and must review before sharing.
告知用户:
Migration failed after <N> attempts. A failure report has been generated at:

  ~/.databricks-migration-skill/reports/failure-<timestamp>.json

This report contains anonymized diagnostic data (detected patterns, error categories, retry count) and no code content or PII. You can:

1. Review the JSON to confirm no sensitive information is present
2. Share it via GitHub Issue to help improve the skill:
   https://github.com/databricks/databricks-agent-skills/issues/new?template=migration-feedback.md

Submission is optional and opt-in. We use reports to prioritize new patterns and fix detection gaps.
绝不自动传输报告。用户拥有其数据,必须先审核再分享。

Reference Guides

参考指南

For detailed workarounds and code examples beyond the quick fixes above:
  • Compatibility Checks — Full pattern detection table with all 40+ checks
  • Streaming Migration — Trigger migration, SDP continuous mode, continuous jobs
  • Networking and Security — VPC peering to NCCs, Private Link, firewall setup
  • Code Patterns — Complete before/after code examples for every migration pattern
  • Configuration Guide — Supported Spark configs, Environments setup, budget policies
如需快速修复之外的详细解决方案和代码示例:
  • 兼容性检查——包含40+项检查的完整模式检测表
  • 流处理迁移——触发器迁移、SDP连续模式、连续任务
  • 网络与安全——VPC peering转NCCs、Private Link、防火墙设置
  • 代码模式——每种迁移模式的完整前后代码示例
  • 配置指南——支持的Spark配置、Environments设置、预算策略

Documentation

官方文档