gcp-dataflow

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Apache Beam Pipelines on Cloud Dataflow

在Cloud Dataflow上构建Apache Beam管道

Expert guidance for writing and packaging Apache Beam pipelines to run on Google Cloud Dataflow.
提供关于编写和打包Apache Beam管道以在Google Cloud Dataflow上运行的专业指导。

Creating a new project

创建新项目

Use this section when creating a new project for a Dataflow pipeline.
  • If the user doesn't say explicitly which language (Java, Python, Go) shall be used to write the pipeline, you MUST confirm the language.
  • Determine which version of Beam SDK should be used by searching for the most recently released version of Apache Beam, unless the user already uses a particular version.
    • Action: Run a web search for the latest Apache Beam SDK release.
  • YOU MUST use same version of Apache Beam consistently throughout the project in Dockerfiles,
    requirements.txt
    , and other similar files where versions are specified.
当为Dataflow管道创建新项目时,请参考本节内容。
  • 如果用户未明确说明将使用哪种语言(Java、Python、Go)编写管道,你必须确认语言类型。
  • 除非用户已在使用特定版本,否则需通过搜索Apache Beam的最新发布版本来确定应使用的Beam SDK版本。
    • 操作:搜索Apache Beam SDK的最新版本。
  • 在项目的Dockerfile、
    requirements.txt
    以及其他指定版本的类似文件中,你必须始终使用相同版本的Apache Beam。

Java projects using Gradle

使用Gradle的Java项目

Use this section when configuring a Dataflow Java pipeline project using gradle.
  • Shadow Jars (Fat Jars): Do NOT propose to use the Shadow plugin (
    com.github.johnrengelman.shadow
    ) unless the user explicitly requests a Fat Jar.
  • Passing command-line parameters: Use the
    application
    plugin for passing command-line parameters.
  • SLF4J Logging Dependency Alignment:
    • Verify the
      slf4j-api
      version pulled transitively by Apache Beam.
    • You MUST configure the application logging backend (
      slf4j-simple
      ,
      logback-classic
      , etc.) to exactly match the major/minor version of the resolved
      slf4j-api
      .
当使用Gradle配置Dataflow Java管道项目时,请参考本节内容。
  • Shadow Jars(胖Jar包):除非用户明确要求使用胖Jar包,否则不要建议使用Shadow插件(
    com.github.johnrengelman.shadow
    )。
  • 传递命令行参数:使用
    application
    插件来传递命令行参数。
  • SLF4J日志依赖对齐
    • 验证Apache Beam间接引入的
      slf4j-api
      版本。
    • 你必须将应用日志后端(
      slf4j-simple
      logback-classic
      等)配置为与解析后的
      slf4j-api
      主/次版本完全匹配。

Structure the pipeline as a Dataflow Flex Template

将管道构建为Dataflow Flex Template

When creating new Dataflow pipeline projects, configure them as a Flex template. Flex Templates offer a hermetic and reproducible launch environment, and are easy to launch with
gcloud
or with orchestrators like Cloud Composer.
Follow the Flex Templates section below.
创建新的Dataflow管道项目时,请将其配置为Flex Template。Flex Template提供封闭且可重现的启动环境,并且易于通过
gcloud
或Cloud Composer等编排工具启动。
请遵循下方的Flex Templates部分内容。

Flex Templates

Flex Templates

  • Provide Instructions: Provide instructions on rebuilding and running Flex Templates to the user in walkthrough.
  • Use Single Docker Image for Python pipelines: For Python Flex Templates, it is better to use a single image for the template launcher image and for the worker runtime environment (
    --sdk_container_image
    ). Whenever configuring or suggesting a Dataflow Flex Template for a Python pipeline that requires extra dependencies (e.g., using
    --requirements_file
    ,
    --setup_file
    , or
    --extra_package
    ), YOU MUST recommend the Single Docker Image Configuration as detailed in python_flex_template_reference.md.
  • Prefer Cloud Build over Local Docker:
    • Do NOT assume local Docker availability on the workspace machine.
    • Action: Suggest and provide
      cloudbuild.yaml
      out-of-the-box for building and pushing images unless local setup is explicitly requested.
    • When building images with Cloud Build in the background you MUST provide the link where the user can monitor the long-running operation.
  • 提供指导:在操作指南中为用户提供重新构建和运行Flex Templates的说明。
  • Python管道使用单一Docker镜像:对于Python Flex Templates,最好为模板启动器镜像和工作器运行时环境(
    --sdk_container_image
    )使用同一个镜像。当为需要额外依赖的Python管道配置或建议Dataflow Flex Template时(例如使用
    --requirements_file
    --setup_file
    --extra_package
    ),你必须推荐单一Docker镜像配置,详情请参考python_flex_template_reference.md
  • 优先使用Cloud Build而非本地Docker
    • 不要假设工作机器上已安装本地Docker。
    • 操作:除非用户明确要求本地配置,否则建议并提供现成的
      cloudbuild.yaml
      用于构建和推送镜像。
    • 当使用Cloud Build在后台构建镜像时,你必须提供用户可监控该长时间运行操作的链接。

Launching Apache Beam Pipelines with Dataflow Runner

使用Dataflow Runner启动Apache Beam管道

  • When launching Python Pipelines without a Flex Template with
    DataflowRunner
    , you MUST scan the pipeline project directory for the following files:
    • requirements.txt
      :
      • If found, you MUST include
        --requirements_file
        pipeline option.
    • setup.py
      :
      • If found, you MUST include
        --setup_file
        pipeline option. This is critical if the pipeline uses local modules or packages.
  • When launching Python Pipelines with a Flex Template, if the Flex Template image is also the SDK Container image (Single Docker Image Configuration), then you MUST supply the image in the
    sdk_container_image
    parameter.
  • Confirm the launch command with the user.
  • 当不使用Flex Template、通过
    DataflowRunner
    启动Python管道时,你必须扫描管道项目目录中的以下文件:
    • requirements.txt
      • 如果找到该文件,你必须在管道选项中包含
        --requirements_file
    • setup.py
      • 如果找到该文件,你必须在管道选项中包含
        --setup_file
        。如果管道使用本地模块或包,这一点至关重要。
  • 当使用Flex Template启动Python管道时,如果Flex Template镜像同时也是SDK容器镜像(单一Docker镜像配置),你必须在
    sdk_container_image
    参数中指定该镜像。
  • 与用户确认启动命令。

Lookup environment resources instead of using placeholder values

查询环境资源而非使用占位符

  • Avoid using generic placeholders (e.g.,
    your-gcp-project-id
    ) for GCP resources when drafting run scripts or configs. Action: If values are unknown, proactively run commands like
    gcloud config get-value project
    to find active resources to pre-fill scripts for the user. Confirm the values with the user before proceeding.
  • 在编写运行脚本或配置时,避免为GCP资源使用通用占位符(例如
    your-gcp-project-id
    )。操作:如果值未知,请主动运行
    gcloud config get-value project
    等命令来查找可用资源,为用户预填充脚本。在继续操作前,请与用户确认这些值。