gcp-spark

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Spark on Dataproc

Spark on Dataproc

[!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing spark code.
[!IMPORTANT] 编写Spark代码时,必须始终遵循任务执行工作流

Task Execution Workflow

任务执行工作流

  1. Understand schemas: ALWAYS use
    @skill:discovering-gcp-data-assets
    skill or
    resources/schema_direct_inspection.md
    to understand input and output schemas. Include the schema in your thought process BEFORE generating any code. Do NOT guess column names.
  2. Generate spark code:
    • Output Format: ALWAYS generate code in Python Notebooks (.ipynb) format. Generate scripts (.py) only if explicitly requested.
    • Read and Write data: ALWAYS Refer to
      resources/read_write_data.md
      when reading or writing data.
    • ML Tasks: Refer to
      @skill:ml-best-practices
      skill and
      resources/ml_tasks.md
      when generating ML code.
    • Spark Optimizations: ALWAYS refer to
      resources/spark_optimizations.md
      when generating spark code and apply optimization whenever applicable.
  3. Verify schema before write: ALWAYS verify that the dataframe and destination schema match, use
    df.printSchema()
    for dataframe schema and refer to
    @skill:discovering-gcp-data-assets
    skill or
    resources/schema_direct_inspection.md
    to verify destination schema.
  4. Compile code before executing: For notebooks convert them to python script using
    jupyter nbconvert --to script your-notebook.ipynb
    first, then compile code using
    python3 -m py_compile your-notebook.py
    .
  5. Execute script: ONLY when generating a
    .py
    script refer to
    resources/gcloud_dataproc.md
    on writing command to execute generated code on Dataproc. This DOES NOT apply when generating notebooks.

  1. 理解数据结构务必使用
    @skill:discovering-gcp-data-assets
    技能或
    resources/schema_direct_inspection.md
    来理解输入和输出的数据结构。在生成任何代码之前,务必将数据结构纳入思考过程中。请勿猜测列名。
  2. 生成Spark代码
    • 输出格式务必以**Python Notebook(.ipynb)**格式生成代码。仅在明确要求时才生成脚本(.py)。
    • 数据读写:进行数据读写时,务必参考
      resources/read_write_data.md
    • 机器学习任务:生成机器学习代码时,请参考
      @skill:ml-best-practices
      技能和
      resources/ml_tasks.md
    • Spark优化:生成Spark代码时,务必参考
      resources/spark_optimizations.md
      ,并在适用时应用优化策略。
  3. 写入前验证数据结构务必验证DataFrame与目标数据结构是否匹配,使用
    df.printSchema()
    查看DataFrame的数据结构,并参考
    @skill:discovering-gcp-data-assets
    技能或
    resources/schema_direct_inspection.md
    验证目标数据结构。
  4. 执行前编译代码:对于Notebook,先使用
    jupyter nbconvert --to script your-notebook.ipynb
    将其转换为Python脚本,然后使用
    python3 -m py_compile your-notebook.py
    编译代码。
  5. 执行脚本:仅在生成
    .py
    脚本时,才参考
    resources/gcloud_dataproc.md
    编写在Dataproc上执行生成代码的命令。此步骤不适用于生成Notebook的场景。

Common Mistakes Checklist

常见错误检查清单

[!CAUTION] Ensure you verify this checklist to avoid mistakes
Before submitting a job, verify:
  • All imports present (
    col
    ,
    when
    ,
    lit
    , etc. from
    pyspark.sql.functions
    )
  • vector_to_array
    from correct module
    use
    from pyspark.ml.functions import vector_to_array
    (NOT
    pyspark.sql.functions
    )
  • DataFrame schema matches target Iceberg table verify with
    df.printSchema()
    before writing
  • CSV files read with
    header
    and
    inferSchema
    without these, the header row becomes data and all columns are strings
  • Avoid toPandas() Converting a pyspark dataframe to pandas by calling toPandas() can lead to out of memory errors. Only acceptable for building visualizations in Spark 3.5

[!CAUTION] 请务必检查此清单以避免错误
提交作业前,请验证:
  • 所有导入语句齐全
    pyspark.sql.functions
    中的
    col
    when
    lit
    等)
  • vector_to_array
    来自正确模块
    使用
    from pyspark.ml.functions import vector_to_array
    (而非
    pyspark.sql.functions
  • DataFrame数据结构与目标Iceberg表匹配 写入前使用
    df.printSchema()
    验证
  • 读取CSV文件时指定
    header
    inferSchema
    若不指定,表头行将被视为数据,且所有列都会被识别为字符串类型
  • 避免使用toPandas() 调用toPandas()将PySpark DataFrame转换为Pandas DataFrame可能导致内存不足错误。仅在Spark 3.5中构建可视化时此操作是可接受的

IAM Requirements

IAM权限要求

The Dataproc service account needs:
  • roles/dataproc.worker
    : Job execution
  • roles/biglake.admin
    : Iceberg table management
  • roles/bigquery.jobUser
    : Query materialization
  • roles/storage.objectUser
    : Read/write GCS
  • roles/spanner.databaseUser
    : Spanner writes

Dataproc服务账号需要具备以下权限:
  • roles/dataproc.worker
    :作业执行权限
  • roles/biglake.admin
    :Iceberg表管理权限
  • roles/bigquery.jobUser
    :查询物化权限
  • roles/storage.objectUser
    :GCS读写权限
  • roles/spanner.databaseUser
    :Spanner写入权限

Spark resource management

Spark资源管理

Refer to
resources/gcloud_dataproc.md
for detailed guidelines on managing Spark clusters, jobs, batches, and interactive sessions.
有关管理Spark集群、作业、批处理和交互式会话的详细指南,请参考
resources/gcloud_dataproc.md