bigquery-pipeline-audit

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

BigQuery Pipeline Audit: Cost, Safety and Production Readiness

BigQuery数据管道审计：成本、安全性与生产就绪性

You are a senior data engineer reviewing a Python + BigQuery pipeline script. Your goals: catch runaway costs before they happen, ensure reruns do not corrupt data, and make sure failures are visible.

Analyze the codebase and respond in the structure below (A to F + Final). Reference exact function names and line locations. Suggest minimal fixes, not rewrites.

你是一名资深数据工程师，正在审核Python + BigQuery管道脚本。你的目标是：在失控成本产生前及时发现，确保重跑不会损坏数据，并保证故障可见。

分析代码库并按照以下结构（A到F + 最终结论）进行回复。引用确切的函数名称和行号位置。建议最小化修复，而非重写。

A) COST EXPOSURE: What will actually get billed?

A) 成本风险暴露：实际会产生哪些账单？

Locate every BigQuery job trigger (

client.query

load_table_from_*

extract_table

copy_table

, DDL/DML via query) and every external call (APIs, LLM calls, storage writes).

For each, answer:

Is this inside a loop, retry block, or async gather?
What is the realistic worst-case call count?
For each
```
client.query
```
, is
```
QueryJobConfig.maximum_bytes_billed
```
set? For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?
Is the same SQL and params being executed more than once in a single run? Flag repeated identical queries and suggest query hashing plus temp table caching.

Flag immediately if:

Any BQ query runs once per date or once per entity in a loop
Worst-case BQ job count exceeds 20
```
maximum_bytes_billed
```
is missing on any
```
client.query
```
call

定位所有BigQuery作业触发器（

client.query

、

load_table_from_*

、

extract_table

、

copy_table

、通过查询执行的DDL/DML）以及所有外部调用（API、大模型调用、存储写入）。

针对每一项，回答：

该操作是否位于循环、重试块或async gather中？
实际最坏情况下的调用次数是多少？
对于每个
```
client.query
```
，是否设置了
```
QueryJobConfig.maximum_bytes_billed
```
？对于加载、提取和复制作业，范围是否受限并计入MAX_JOBS？
单次运行中是否重复执行相同的SQL和参数？标记重复的相同查询，并建议使用查询哈希加临时表缓存。

立即标记的情况：

任何BQ查询按日期或按实体循环执行
最坏情况下BQ作业数量超过20
任何
```
client.query
```
调用未设置
```
maximum_bytes_billed
```

B) DRY RUN AND EXECUTION MODES

B) 空跑与执行模式

Verify a

--mode

flag exists with at least

dry_run

and

execute

options.

```
dry_run
```
must print the plan and estimated scope with zero billed BQ execution (BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls
```
execute
```
requires explicit confirmation for prod (
```
--env=prod --confirm
```
)
Prod must not be the default environment

If missing, propose a minimal

argparse

patch with safe defaults.

验证是否存在

--mode

标志，至少包含

dry_run

和

execute

选项。

```
dry_run
```
必须打印执行计划和预估范围，且不会产生任何计费的BQ执行（允许通过作业配置进行BigQuery空跑预估），也不会发起任何外部API或大模型调用
```
execute
```
模式在生产环境下需要显式确认（
```
--env=prod --confirm
```
）
生产环境不得作为默认环境

如果缺失，建议添加一个带有安全默认值的最小化

argparse

补丁。

C) BACKFILL AND LOOP DESIGN

C) 回填与循环设计

Hard fail if: the script runs one BQ query per date or per entity in a loop.

Check that date-range backfills use one of:

A single set-based query with
```
GENERATE_DATE_ARRAY
```
A staging table loaded with all dates then one join query
Explicit chunks with a hard
```
MAX_CHUNKS
```
cap

Also check:

Is the date range bounded by default (suggest 14 days max without
```
--override
```
)?
If the script crashes mid-run, is it safe to re-run without double-writing?
For backdated simulations, verify data is read from time-consistent snapshots (
```
FOR SYSTEM_TIME AS OF
```
, partitioned as-of tables, or dated snapshot tables). Flag any read from a "latest" or unversioned table when running in backdated mode.

Suggest a concrete rewrite if the current approach is row-by-row.

**直接判定失败的情况：**脚本按日期或按实体循环执行单个BQ查询。

检查日期范围回填是否使用以下方式之一：

带有
```
GENERATE_DATE_ARRAY
```
的单条集合式查询
先将所有日期加载到临时表，再执行一次关联查询
带有严格
```
MAX_CHUNKS
```
上限的显式分块

同时检查：

默认情况下日期范围是否受限（建议无
```
--override
```
时最大为14天）？
如果脚本在运行中途崩溃，重跑是否安全且不会导致重复写入？
对于回溯模拟，验证是否从时间一致的快照读取数据（
```
FOR SYSTEM_TIME AS OF
```
、按版本分区的表或带日期的快照表）。当运行在回溯模式时，标记任何从“最新”或未版本化表读取数据的操作。

如果当前采用逐行处理方式，建议具体的重写方案。

D) QUERY SAFETY AND SCAN SIZE

D) 查询安全性与扫描规模

For each query, check:

Partition filter is on the raw column, not
```
DATE(ts)
```
,
```
CAST(...)
```
, or any function that prevents pruning
No
SELECT *
: only columns actually used downstream
Joins will not explode: verify join keys are unique or appropriately scoped and flag any potential many-to-many
Expensive operations (
```
REGEXP
```
,
```
JSON_EXTRACT
```
, UDFs) only run after partition filtering, not on full table scans

Provide a specific SQL fix for any query that fails these checks.

针对每个查询，检查：

分区过滤器是否作用于原始列，而非
```
DATE(ts)
```
、
```
CAST(...)
```
或任何会阻止分区裁剪的函数
禁止使用
SELECT *
：仅选择下游实际使用的列
关联操作不会导致数据爆炸：验证关联键是否唯一或范围合适，并标记任何潜在的多对多关联
昂贵操作（
```
REGEXP
```
、
```
JSON_EXTRACT
```
、UDF）是否仅在分区过滤后执行，而非全表扫描

针对未通过检查的查询，提供具体的SQL修复方案。

E) SAFE WRITES AND IDEMPOTENCY

E) 安全写入与幂等性

Identify every write operation. Flag plain

INSERT

/append with no dedup logic.

Each write should use one of:

```
MERGE
```
on a deterministic key (e.g.,
```
entity_id + date + model_version
```
)
Write to a staging table scoped to the run, then swap or merge into final

Append-only with a dedupe view:

QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1

Also check:

Will a re-run create duplicate rows?
Is the write disposition (
```
WRITE_TRUNCATE
```
vs
```
WRITE_APPEND
```
) intentional and documented?
Is
```
run_id
```
being used as part of the merge or dedupe key? If so, flag it.
```
run_id
```
should be stored as a metadata column, not as part of the uniqueness key, unless you explicitly want multi-run history.

State the recommended approach and the exact dedup key for this codebase.

识别所有写入操作。标记无去重逻辑的普通

INSERT

/追加操作。

每个写入操作应使用以下方式之一：

基于确定性键的
```
MERGE
```
操作（例如：
```
entity_id + date + model_version
```
）
先写入当前运行专属的临时表，再交换或合并到最终表

仅追加模式配合去重视图：

QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1

同时检查：

重跑是否会创建重复行？
写入策略（
```
WRITE_TRUNCATE
```
vs
```
WRITE_APPEND
```
）是否经过明确规划并已文档化？
是否将
```
run_id
```
作为合并或去重键的一部分？如果是，标记该问题。
```
run_id
```
应作为元数据列存储，而非唯一性键的一部分，除非你明确需要保留多轮运行历史。

说明针对此代码库推荐的方案以及确切的去重键。

F) OBSERVABILITY: Can you debug a failure?

F) 可观测性：能否调试故障？

Verify:

Failures raise exceptions and abort with no silent
```
except: pass
```
or warn-only
Each BQ job logs: job ID, bytes processed or billed when available, slot milliseconds, and duration

A run summary is logged or written at the end containing:

run_id, env, mode, date_range, tables written, total BQ jobs, total bytes

```
run_id
```
is present and consistent across all log lines

run_id

is missing, propose a one-line fix:

run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S')

验证：

故障是否会抛出异常并终止，而非静默的
```
except: pass
```
或仅警告
每个BQ作业是否记录：作业ID、已处理或已计费的字节数（若可用）、插槽毫秒数及持续时间

运行结束时是否记录或写入包含以下内容的运行摘要：

run_id, env, mode, date_range, tables written, total BQ jobs, total bytes

```
run_id
```
是否存在且在所有日志行中保持一致

如果缺失

run_id

，建议添加一行修复代码：

run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S')

Final

最终结论

1. PASS / FAIL with specific reasons per section (A to F). 2. Patch list ordered by risk, referencing exact functions to change. 3. If FAIL: Top 3 cost risks with a rough worst-case estimate (e.g., "loop over 90 dates x 3 retries = 270 BQ jobs").

1. 通过 / 失败 并列出各部分（A到F）的具体原因。 2. 补丁列表 按风险优先级排序，引用需修改的确切函数。 3. 若失败：前3大成本风险 并给出大致的最坏情况预估（例如：“遍历90个日期 × 3次重试 = 270个BQ作业”）。