gcp-spark
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSpark on Dataproc
Spark on Dataproc
[!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing spark code.
[!IMPORTANT] 编写Spark代码时,必须始终遵循任务执行工作流。
Task Execution Workflow
任务执行工作流
- Understand schemas: ALWAYS use skill or
@skill:discovering-gcp-data-assetsto understand input and output schemas. Include the schema in your thought process BEFORE generating any code. Do NOT guess column names.resources/schema_direct_inspection.md - Generate spark code:
- Output Format: ALWAYS generate code in Python Notebooks (.ipynb) format. Generate scripts (.py) only if explicitly requested.
- Read and Write data: ALWAYS Refer to
when reading or writing data.
resources/read_write_data.md - ML Tasks: Refer to skill and
@skill:ml-best-practiceswhen generating ML code.resources/ml_tasks.md - Spark Optimizations: ALWAYS refer to
when generating spark code and apply optimization whenever applicable.
resources/spark_optimizations.md
- Verify schema before write: ALWAYS verify that the dataframe and
destination schema match, use for dataframe schema and refer to
df.printSchema()skill or@skill:discovering-gcp-data-assetsto verify destination schema.resources/schema_direct_inspection.md - Compile code before executing: For notebooks convert them to python
script using first, then compile code using
jupyter nbconvert --to script your-notebook.ipynb.python3 -m py_compile your-notebook.py - Execute script: ONLY when generating a script refer to
.pyon writing command to execute generated code on Dataproc. This DOES NOT apply when generating notebooks.resources/gcloud_dataproc.md
- 理解数据结构:务必使用技能或
@skill:discovering-gcp-data-assets来理解输入和输出的数据结构。在生成任何代码之前,务必将数据结构纳入思考过程中。请勿猜测列名。resources/schema_direct_inspection.md - 生成Spark代码:
- 输出格式:务必以**Python Notebook(.ipynb)**格式生成代码。仅在明确要求时才生成脚本(.py)。
- 数据读写:进行数据读写时,务必参考。
resources/read_write_data.md - 机器学习任务:生成机器学习代码时,请参考技能和
@skill:ml-best-practices。resources/ml_tasks.md - Spark优化:生成Spark代码时,务必参考,并在适用时应用优化策略。
resources/spark_optimizations.md
- 写入前验证数据结构:务必验证DataFrame与目标数据结构是否匹配,使用查看DataFrame的数据结构,并参考
df.printSchema()技能或@skill:discovering-gcp-data-assets验证目标数据结构。resources/schema_direct_inspection.md - 执行前编译代码:对于Notebook,先使用将其转换为Python脚本,然后使用
jupyter nbconvert --to script your-notebook.ipynb编译代码。python3 -m py_compile your-notebook.py - 执行脚本:仅在生成脚本时,才参考
.py编写在Dataproc上执行生成代码的命令。此步骤不适用于生成Notebook的场景。resources/gcloud_dataproc.md
Common Mistakes Checklist
常见错误检查清单
[!CAUTION] Ensure you verify this checklist to avoid mistakes
Before submitting a job, verify:
- All imports present (,
col,when, etc. fromlit)pyspark.sql.functions - from correct module use
vector_to_array(NOTfrom pyspark.ml.functions import vector_to_array)pyspark.sql.functions - DataFrame schema matches target Iceberg table verify with
before writing
df.printSchema() - CSV files read with and
headerwithout these, the header row becomes data and all columns are stringsinferSchema - Avoid toPandas() Converting a pyspark dataframe to pandas by calling toPandas() can lead to out of memory errors. Only acceptable for building visualizations in Spark 3.5
[!CAUTION] 请务必检查此清单以避免错误
提交作业前,请验证:
- 所有导入语句齐全(中的
pyspark.sql.functions、col、when等)lit - 来自正确模块 使用
vector_to_array(而非from pyspark.ml.functions import vector_to_array)pyspark.sql.functions - DataFrame数据结构与目标Iceberg表匹配 写入前使用验证
df.printSchema() - 读取CSV文件时指定和
header若不指定,表头行将被视为数据,且所有列都会被识别为字符串类型inferSchema - 避免使用toPandas() 调用toPandas()将PySpark DataFrame转换为Pandas DataFrame可能导致内存不足错误。仅在Spark 3.5中构建可视化时此操作是可接受的
IAM Requirements
IAM权限要求
The Dataproc service account needs:
- : Job execution
roles/dataproc.worker - : Iceberg table management
roles/biglake.admin - : Query materialization
roles/bigquery.jobUser - : Read/write GCS
roles/storage.objectUser - : Spanner writes
roles/spanner.databaseUser
Dataproc服务账号需要具备以下权限:
- :作业执行权限
roles/dataproc.worker - :Iceberg表管理权限
roles/biglake.admin - :查询物化权限
roles/bigquery.jobUser - :GCS读写权限
roles/storage.objectUser - :Spanner写入权限
roles/spanner.databaseUser
Spark resource management
Spark资源管理
Refer to for detailed guidelines on managing
Spark clusters, jobs, batches, and interactive sessions.
resources/gcloud_dataproc.md有关管理Spark集群、作业、批处理和交互式会话的详细指南,请参考。
resources/gcloud_dataproc.md