data-throughput-accelerator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Throughput Accelerator

数据吞吐量加速器

Use this skill when the bottleneck is moving, transforming, or saving lots of data. The goal is not just speed. The goal is faster correct data landing in the right place with proof.
当数据移动、转换或存储成为瓶颈时,可使用该技能。我们的目标不只是速度,而是在有验证依据的前提下,让正确的数据更快抵达目标位置。

First Distinction

首要区分

Separate these before optimizing:
  • source extraction speed;
  • network transfer speed;
  • warehouse/load speed;
  • transform speed;
  • serving-table freshness;
  • live tail growth while the job runs.
A pipeline can be "fast" and still appear behind if new data arrives faster than the final catch-up window.
在优化前需区分以下环节:
  • 源数据提取速度;
  • 网络传输速度;
  • 数据仓库/加载速度;
  • 转换速度;
  • 服务表新鲜度;
  • 任务运行期间实时数据尾部的增长情况。
如果新数据的到达速度快于最终追更窗口的处理速度,即使流水线本身“速度快”,仍可能出现滞后情况。

Fast Path Heuristics

快速路径启发式策略

  • Move compute to where the data already is.
  • Prefer warehouse-native scans, joins, and appends for large landed files.
  • Use manifests or checkpoints so completed files/partitions are skipped.
  • Use partitioning and clustering that match the read and append pattern.
  • Batch small files, requests, and writes.
  • Make writes idempotent through unique keys, manifests, or replaceable staging.
  • Keep raw, derived, and serving tables separately accountable.
  • 将计算迁移至数据所在位置。
  • 对于已落地的大文件,优先使用数据仓库原生的扫描、关联和追加操作。
  • 使用清单或检查点,跳过已处理完成的文件/分区。
  • 使用与读取和追加模式匹配的分区和聚类策略。
  • 对小文件、请求和写入操作进行批处理。
  • 通过唯一键、清单或可替换的 staging 区实现写入的幂等性。
  • 分别管理原始表、衍生表和服务表。

Workflow

工作流程

  1. Read the current source, target, and manifest contracts.
  2. Measure backlog: external files, manifest rows, raw rows, derived rows, min/max timestamps, and unprocessed counts.
  3. Run a safe catch-up or sample benchmark.
  4. Compare variants: batch size, worker count, warehouse SQL, file grouping, staging shape, and manifest update method.
  5. Promote only the fastest path that keeps counts and timestamps coherent.
  6. Codify the path as a CLI, scheduled job, workflow, or runbook.
  7. Rerun final accounting after the codified path executes.
  1. 读取当前的源数据、目标数据和清单约定。
  2. 测量积压情况:外部文件、清单行、原始行、衍生行、最小/最大时间戳以及未处理计数。
  3. 运行安全的追更操作或样本基准测试。
  4. 对比不同变体:批量大小、工作节点数量、数据仓库SQL、文件分组、staging 结构以及清单更新方式。
  5. 仅推广能保持计数和时间戳一致的最快路径。
  6. 将该路径编码为CLI、定时任务、工作流或运行手册。
  7. 在编码路径执行完成后,重新运行最终核算。

Accounting Output

核算输出

Use a hard accounting block:
text
Data throughput result:
- Source files discovered: 294
- Files processed this run: 294
- Raw rows added: 9,683,598
- Derived rows added: 8,917,585
- Remaining tail: 24 files at readback time
- Runtime: 38.7s
- Correctness gate: manifest counts and table max timestamps match
使用标准化的核算模块:
text
Data throughput result:
- Source files discovered: 294
- Files processed this run: 294
- Raw rows added: 9,683,598
- Derived rows added: 8,917,585
- Remaining tail: 24 files at readback time
- Runtime: 38.7s
- Correctness gate: manifest counts and table max timestamps match

Guardrails

防护规则

  • Do not delete raw data to make a metric look better.
  • Do not skip failed files silently.
  • Do not mix historical backfill status with live-tail freshness.
  • Do not call a pipeline complete until the target tables and manifest agree.
  • For finance, healthcare, regulated, or customer-impacting data, preserve replay evidence and approval gates.
  • 不得为了优化指标而删除原始数据。
  • 不得静默跳过处理失败的文件。
  • 不得将历史回填状态与实时数据尾部新鲜度混为一谈。
  • 只有当目标表与清单数据一致时,才可判定流水线完成。
  • 对于金融、医疗、受监管或影响客户的数据,需保留重放证据和审批环节。