gcp-dataflow
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseApache Beam Pipelines on Cloud Dataflow
在Cloud Dataflow上构建Apache Beam管道
Expert guidance for writing and packaging Apache Beam pipelines to run on Google
Cloud Dataflow.
提供关于编写和打包Apache Beam管道以在Google Cloud Dataflow上运行的专业指导。
Creating a new project
创建新项目
Use this section when creating a new project for a Dataflow pipeline.
- If the user doesn't say explicitly which language (Java, Python, Go) shall be used to write the pipeline, you MUST confirm the language.
- Determine which version of Beam SDK should be used by searching for the most
recently released version of Apache Beam, unless the user already uses a
particular version.
- Action: Run a web search for the latest Apache Beam SDK release.
- YOU MUST use same version of Apache Beam consistently throughout the project
in Dockerfiles, , and other similar files where versions are specified.
requirements.txt
当为Dataflow管道创建新项目时,请参考本节内容。
- 如果用户未明确说明将使用哪种语言(Java、Python、Go)编写管道,你必须确认语言类型。
- 除非用户已在使用特定版本,否则需通过搜索Apache Beam的最新发布版本来确定应使用的Beam SDK版本。
- 操作:搜索Apache Beam SDK的最新版本。
- 在项目的Dockerfile、以及其他指定版本的类似文件中,你必须始终使用相同版本的Apache Beam。
requirements.txt
Java projects using Gradle
使用Gradle的Java项目
Use this section when configuring a Dataflow Java pipeline project using gradle.
- Shadow Jars (Fat Jars): Do NOT propose to use the Shadow plugin
() unless the user explicitly requests a Fat Jar.
com.github.johnrengelman.shadow - Passing command-line parameters: Use the plugin for passing command-line parameters.
application - SLF4J Logging Dependency Alignment:
- Verify the version pulled transitively by Apache Beam.
slf4j-api - You MUST configure the application logging backend (,
slf4j-simple, etc.) to exactly match the major/minor version of the resolvedlogback-classic.slf4j-api
- Verify the
当使用Gradle配置Dataflow Java管道项目时,请参考本节内容。
- Shadow Jars(胖Jar包):除非用户明确要求使用胖Jar包,否则不要建议使用Shadow插件()。
com.github.johnrengelman.shadow - 传递命令行参数:使用插件来传递命令行参数。
application - SLF4J日志依赖对齐:
- 验证Apache Beam间接引入的版本。
slf4j-api - 你必须将应用日志后端(、
slf4j-simple等)配置为与解析后的logback-classic主/次版本完全匹配。slf4j-api
- 验证Apache Beam间接引入的
Structure the pipeline as a Dataflow Flex Template
将管道构建为Dataflow Flex Template
When creating new Dataflow pipeline projects, configure them as a Flex template.
Flex Templates offer a hermetic and reproducible launch environment, and are
easy to launch with or with orchestrators like Cloud Composer.
gcloudFollow the Flex Templates section below.
创建新的Dataflow管道项目时,请将其配置为Flex Template。Flex Template提供封闭且可重现的启动环境,并且易于通过或Cloud Composer等编排工具启动。
gcloud请遵循下方的Flex Templates部分内容。
Flex Templates
Flex Templates
- Provide Instructions: Provide instructions on rebuilding and running Flex Templates to the user in walkthrough.
- Use Single Docker Image for Python pipelines: For Python Flex Templates,
it is better to use a single image for the template launcher image and for
the worker runtime environment (). Whenever configuring or suggesting a Dataflow Flex Template for a Python pipeline that requires extra dependencies (e.g., using
--sdk_container_image,--requirements_file, or--setup_file), YOU MUST recommend the Single Docker Image Configuration as detailed in python_flex_template_reference.md.--extra_package - Prefer Cloud Build over Local Docker:
- Do NOT assume local Docker availability on the workspace machine.
- Action: Suggest and provide out-of-the-box for building and pushing images unless local setup is explicitly requested.
cloudbuild.yaml - When building images with Cloud Build in the background you MUST provide the link where the user can monitor the long-running operation.
- 提供指导:在操作指南中为用户提供重新构建和运行Flex Templates的说明。
- Python管道使用单一Docker镜像:对于Python Flex Templates,最好为模板启动器镜像和工作器运行时环境()使用同一个镜像。当为需要额外依赖的Python管道配置或建议Dataflow Flex Template时(例如使用
--sdk_container_image、--requirements_file或--setup_file),你必须推荐单一Docker镜像配置,详情请参考python_flex_template_reference.md。--extra_package - 优先使用Cloud Build而非本地Docker:
- 不要假设工作机器上已安装本地Docker。
- 操作:除非用户明确要求本地配置,否则建议并提供现成的用于构建和推送镜像。
cloudbuild.yaml - 当使用Cloud Build在后台构建镜像时,你必须提供用户可监控该长时间运行操作的链接。
Launching Apache Beam Pipelines with Dataflow Runner
使用Dataflow Runner启动Apache Beam管道
-
When launching Python Pipelines without a Flex Template with, you MUST scan the pipeline project directory for the following files:
DataflowRunner- :
requirements.txt- If found, you MUST include pipeline option.
--requirements_file
- If found, you MUST include
- :
setup.py- If found, you MUST include pipeline option. This is critical if the pipeline uses local modules or packages.
--setup_file
- If found, you MUST include
-
When launching Python Pipelines with a Flex Template, if the Flex Template image is also the SDK Container image (Single Docker Image Configuration), then you MUST supply the image in theparameter.
sdk_container_image -
Confirm the launch command with the user.
-
当不使用Flex Template、通过启动Python管道时,你必须扫描管道项目目录中的以下文件:
DataflowRunner- :
requirements.txt- 如果找到该文件,你必须在管道选项中包含。
--requirements_file
- 如果找到该文件,你必须在管道选项中包含
- :
setup.py- 如果找到该文件,你必须在管道选项中包含。如果管道使用本地模块或包,这一点至关重要。
--setup_file
- 如果找到该文件,你必须在管道选项中包含
-
当使用Flex Template启动Python管道时,如果Flex Template镜像同时也是SDK容器镜像(单一Docker镜像配置),你必须在参数中指定该镜像。
sdk_container_image -
与用户确认启动命令。
Lookup environment resources instead of using placeholder values
查询环境资源而非使用占位符
- Avoid using generic placeholders (e.g., ) for GCP resources when drafting run scripts or configs. Action: If values are unknown, proactively run commands like
your-gcp-project-idto find active resources to pre-fill scripts for the user. Confirm the values with the user before proceeding.gcloud config get-value project
- 在编写运行脚本或配置时,避免为GCP资源使用通用占位符(例如)。操作:如果值未知,请主动运行
your-gcp-project-id等命令来查找可用资源,为用户预填充脚本。在继续操作前,请与用户确认这些值。gcloud config get-value project