gcp-spark

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Spark on Dataproc

[!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing spark code.

[!IMPORTANT] 编写Spark代码时，必须始终遵循任务执行工作流。

Task Execution Workflow

任务执行工作流

Understand schemas: ALWAYS use
```
@skill:discovering-gcp-data-assets
```
skill or
```
resources/schema_direct_inspection.md
```
to understand input and output schemas. Include the schema in your thought process BEFORE generating any code. Do NOT guess column names.
Generate spark code:
- Output Format: ALWAYS generate code in Python Notebooks (.ipynb) format. Generate scripts (.py) only if explicitly requested.
- Read and Write data: ALWAYS Refer to
```
resources/read_write_data.md
```
  when reading or writing data.
- ML Tasks: Refer to
```
@skill:ml-best-practices
```
  skill and
```
resources/ml_tasks.md
```
  when generating ML code.
- Spark Optimizations: ALWAYS refer to
```
resources/spark_optimizations.md
```
  when generating spark code and apply optimization whenever applicable.
Verify schema before write: ALWAYS verify that the dataframe and destination schema match, use
```
df.printSchema()
```
for dataframe schema and refer to
```
@skill:discovering-gcp-data-assets
```
skill or
```
resources/schema_direct_inspection.md
```
to verify destination schema.
Compile code before executing: For notebooks convert them to python script using
```
jupyter nbconvert --to script your-notebook.ipynb
```
first, then compile code using
```
python3 -m py_compile your-notebook.py
```
.
Execute script: ONLY when generating a
```
.py
```
script refer to
```
resources/gcloud_dataproc.md
```
on writing command to execute generated code on Dataproc. This DOES NOT apply when generating notebooks.

理解数据结构：务必使用
```
@skill:discovering-gcp-data-assets
```
技能或
```
resources/schema_direct_inspection.md
```
来理解输入和输出的数据结构。在生成任何代码之前，务必将数据结构纳入思考过程中。请勿猜测列名。
生成Spark代码：
- 输出格式：务必以**Python Notebook（.ipynb）**格式生成代码。仅在明确要求时才生成脚本（.py）。
- 数据读写：进行数据读写时，务必参考
```
resources/read_write_data.md
```
  。
- 机器学习任务：生成机器学习代码时，请参考
```
@skill:ml-best-practices
```
  技能和
```
resources/ml_tasks.md
```
  。
- Spark优化：生成Spark代码时，务必参考
```
resources/spark_optimizations.md
```
  ，并在适用时应用优化策略。
写入前验证数据结构：务必验证DataFrame与目标数据结构是否匹配，使用
```
df.printSchema()
```
查看DataFrame的数据结构，并参考
```
@skill:discovering-gcp-data-assets
```
技能或
```
resources/schema_direct_inspection.md
```
验证目标数据结构。
执行前编译代码：对于Notebook，先使用
```
jupyter nbconvert --to script your-notebook.ipynb
```
将其转换为Python脚本，然后使用
```
python3 -m py_compile your-notebook.py
```
编译代码。
执行脚本：仅在生成
```
.py
```
脚本时，才参考
```
resources/gcloud_dataproc.md
```
编写在Dataproc上执行生成代码的命令。此步骤不适用于生成Notebook的场景。

Common Mistakes Checklist

常见错误检查清单

[!CAUTION] Ensure you verify this checklist to avoid mistakes

Before submitting a job, verify:

All imports present (
```
col
```
,
```
when
```
,
```
lit
```
, etc. from
```
pyspark.sql.functions
```
)

vector_to_array
from correct module use

from pyspark.ml.functions import vector_to_array

(NOT

pyspark.sql.functions

)

DataFrame schema matches target Iceberg table verify with
```
df.printSchema()
```
before writing
CSV files read with
header
and
inferSchema
without these, the header row becomes data and all columns are strings
Avoid toPandas() Converting a pyspark dataframe to pandas by calling toPandas() can lead to out of memory errors. Only acceptable for building visualizations in Spark 3.5

[!CAUTION] 请务必检查此清单以避免错误

提交作业前，请验证：

所有导入语句齐全（
```
pyspark.sql.functions
```
中的
```
col
```
、
```
when
```
、
```
lit
```
等）

vector_to_array
来自正确模块使用

from pyspark.ml.functions import vector_to_array

（而非

pyspark.sql.functions

）

DataFrame数据结构与目标Iceberg表匹配 写入前使用
```
df.printSchema()
```
验证
读取CSV文件时指定
header
和
inferSchema
若不指定，表头行将被视为数据，且所有列都会被识别为字符串类型
避免使用toPandas() 调用toPandas()将PySpark DataFrame转换为Pandas DataFrame可能导致内存不足错误。仅在Spark 3.5中构建可视化时此操作是可接受的

IAM Requirements

IAM权限要求

The Dataproc service account needs:

```
roles/dataproc.worker
```
: Job execution
```
roles/biglake.admin
```
: Iceberg table management
```
roles/bigquery.jobUser
```
: Query materialization
```
roles/storage.objectUser
```
: Read/write GCS
```
roles/spanner.databaseUser
```
: Spanner writes

Dataproc服务账号需要具备以下权限：

```
roles/dataproc.worker
```
：作业执行权限
```
roles/biglake.admin
```
：Iceberg表管理权限
```
roles/bigquery.jobUser
```
：查询物化权限
```
roles/storage.objectUser
```
：GCS读写权限
```
roles/spanner.databaseUser
```
：Spanner写入权限

Spark resource management

Spark资源管理

Refer to

resources/gcloud_dataproc.md

for detailed guidelines on managing Spark clusters, jobs, batches, and interactive sessions.

有关管理Spark集群、作业、批处理和交互式会话的详细指南，请参考

resources/gcloud_dataproc.md

。