data-pipeline-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Pipeline Engineer

数据管道工程师

Expert data engineer specializing in ETL/ELT pipelines, streaming architectures, data warehousing, and modern data stack implementation.
专注于ETL/ELT管道、流处理架构、数据仓库及现代数据栈实施的资深数据工程师。

Quick Start

快速入门

  1. Identify sources - data formats, volumes, freshness requirements
  2. Choose architecture - Medallion (Bronze/Silver/Gold), Lambda, or Kappa
  3. Design layers - staging → intermediate → marts (dbt pattern)
  4. Add quality gates - Great Expectations or dbt tests at each layer
  5. Orchestrate - Airflow DAGs with sensors and retries
  6. Monitor - lineage, freshness, anomaly detection
  1. 识别数据源 - 数据格式、数据量、新鲜度要求
  2. 选择架构 - Medallion(青铜/白银/黄金)、Lambda或Kappa架构
  3. 设计分层 - 暂存层 → 中间层 → 数据集市(dbt模式)
  4. 添加质量关卡 - 在每一层使用Great Expectations或dbt测试
  5. 编排调度 - 带传感器和重试机制的Airflow DAG
  6. 监控运维 - 数据血缘、新鲜度、异常检测

Core Capabilities

核心能力

CapabilityTechnologiesKey Patterns
Batch ProcessingSpark, dbt, DatabricksIncremental, partitioning, Delta/Iceberg
Stream ProcessingKafka, Flink, Spark StreamingWatermarks, exactly-once, windowing
OrchestrationAirflow, Dagster, PrefectDAG design, sensors, task groups
Data Modelingdbt, SQLKimball, Data Vault, SCD
Data QualityGreat Expectations, dbt testsValidation suites, freshness
能力领域技术栈关键模式
批处理Spark, dbt, Databricks增量处理、分区、Delta/Iceberg
流处理Kafka, Flink, Spark Streaming水位线、精确一次处理、窗口计算
任务编排Airflow, Dagster, PrefectDAG设计、传感器、任务组
数据建模dbt, SQLKimball、Data Vault、SCD(缓慢变化维度)
数据质量Great Expectations, dbt tests验证套件、新鲜度检查

Architecture Patterns

架构模式

Medallion Architecture (Recommended)

Medallion架构(推荐)

BRONZE (Raw)     → Exact source copy, schema-on-read, partitioned by ingestion
      ↓ Cleaning, Deduplication
SILVER (Cleansed) → Validated, standardized, business logic applied
      ↓ Aggregation, Enrichment
GOLD (Business)   → Dimensional models, aggregates, ready for BI/ML
BRONZE(原始层)     → 数据源精确副本,读时模式,按 ingestion 分区
      ↓ 清洗、去重
SILVER(清洗层) → 已验证、标准化,应用业务逻辑
      ↓ 聚合、 enrichment
GOLD(业务层)   → 维度模型、聚合数据,可直接用于BI/ML

Lambda vs Kappa

Lambda vs Kappa

  • Lambda: Batch + Stream layers → merged serving layer (complex but complete)
  • Kappa: Stream-only with replay → simpler but requires robust streaming
  • Lambda:批处理+流处理层 → 合并为服务层(复杂但功能完整)
  • Kappa:仅流处理+重放机制 → 架构更简单但需要健壮的流处理能力

Reference Examples

参考示例

Full implementation examples in
./references/
:
FileDescription
dbt-project-structure.md
Complete dbt layout with staging, intermediate, marts
airflow-dag.py
Production DAG with sensors, task groups, quality checks
spark-streaming.py
Kafka-to-Delta processor with windowing
great-expectations-suite.json
Comprehensive data quality expectation suite
完整实现示例位于
./references/
文件描述
dbt-project-structure.md
包含暂存层、中间层、数据集市的完整dbt项目结构
airflow-dag.py
带有传感器、任务组、质量检查的生产级DAG
spark-streaming.py
带窗口计算的Kafka-to-Delta处理器
great-expectations-suite.json
全面的数据质量验证套件

Anti-Patterns (10 Critical Mistakes)

反模式(10个关键错误)

1. Full Table Refreshes

1. 全表刷新

Symptom: Truncate and rebuild entire tables every run Fix: Use incremental models with
is_incremental()
, partition by date
症状:每次运行都截断并重建整张表 修复方案:使用带
is_incremental()
的增量模型,按日期分区

2. Tight Coupling to Source Schemas

2. 与源 schema 强耦合

Symptom: Pipeline breaks when upstream adds/removes columns Fix: Explicit source contracts, select only needed columns in staging
症状:上游添加/删除字段时管道崩溃 修复方案:明确源合约,在暂存层仅选择所需字段

3. Monolithic DAGs

3. 单体式DAG

Symptom: One 200-task DAG running 8 hours Fix: Domain-specific DAGs, ExternalTaskSensor for dependencies
症状:单个包含200个任务的DAG运行8小时 修复方案:按领域拆分DAG,使用ExternalTaskSensor处理依赖

4. No Data Quality Gates

4. 无数据质量关卡

Symptom: Bad data reaches production before detection Fix: Great Expectations or dbt tests at each layer, block on failures
症状:坏数据进入生产后才被发现 修复方案:在每一层添加Great Expectations或dbt测试,失败时阻断流程

5. Processing Before Archiving

5. 先处理再归档

Symptom: Raw data transformed without preserving original Fix: Always land raw in Bronze first, make transformations reproducible
症状:原始数据未保留就直接转换 修复方案:始终先将原始数据存入Bronze层,确保转换过程可复现

6. Hardcoded Dates in Queries

6. 查询中硬编码日期

Symptom: Manual updates needed for date filters Fix: Use Airflow templating (e.g.,
ds
variable) or dynamic date functions
症状:日期过滤器需要手动更新 修复方案:使用Airflow模板(如
ds
变量)或动态日期函数

7. Missing Watermarks in Streaming

7. 流处理中缺失水位线

Symptom: Unbounded state growth, OOM in long-running jobs Fix: Add
withWatermark()
to handle late-arriving data
症状:无界状态增长,长期运行的作业出现OOM 修复方案:添加
withWatermark()
处理延迟到达的数据

8. No Retry/Backoff Strategy

8. 无重试/退避策略

Symptom: Transient failures cause DAG failures Fix:
retries=3
,
retry_exponential_backoff=True
,
max_retry_delay
症状:临时故障导致DAG失败 修复方案:设置
retries=3
retry_exponential_backoff=True
max_retry_delay

9. Undocumented Data Lineage

9. 未记录数据血缘

Symptom: No one knows where data comes from or who uses it Fix: dbt docs, data catalog integration, column-level lineage
症状:无人知晓数据来源或使用者 修复方案:使用dbt docs、数据目录集成、字段级血缘追踪

10. Testing Only in Production

10. 仅在生产环境测试

Symptom: Bugs discovered by stakeholders, not engineers Fix: dbt
--target dev
, sample datasets, CI/CD for models
症状:Bug由利益相关者而非工程师发现 修复方案:使用dbt
--target dev
、样本数据集、模型的CI/CD流程

Quality Checklist

质量检查清单

Pipeline Design:
  • Incremental processing where possible
  • Idempotent transformations (re-runnable safely)
  • Partitioning strategy defined and documented
  • Backfill procedures documented
Data Quality:
  • Tests at Bronze layer (schema, nulls, ranges)
  • Tests at Silver layer (business rules, referential integrity)
  • Tests at Gold layer (aggregation checks, trend monitoring)
  • Anomaly detection for volumes and distributions
Orchestration:
  • Retry and alerting configured
  • SLAs defined and monitored
  • Cross-DAG dependencies use sensors
  • max_active_runs prevents parallel conflicts
Operations:
  • Data lineage documented
  • Runbooks for common failures
  • Monitoring dashboards for pipeline health
  • On-call procedures defined
管道设计:
  • 尽可能使用增量处理
  • 转换具有幂等性(可安全重跑)
  • 分区策略已定义并记录
  • 回填流程已文档化
数据质量:
  • Bronze层有测试(schema、空值、范围)
  • Silver层有测试(业务规则、参照完整性)
  • Gold层有测试(聚合检查、趋势监控)
  • 针对数据量和分布的异常检测
任务编排:
  • 已配置重试和告警
  • SLA已定义并监控
  • 跨DAG依赖使用传感器
  • max_active_runs 防止并行冲突
运维:
  • 数据血缘已文档化
  • 常见故障的运行手册
  • 管道健康状态监控仪表板
  • 值班流程已定义

Validation Script

验证脚本

Run
./scripts/validate-pipeline.sh
to check:
  • dbt project structure and conventions
  • Airflow DAG best practices
  • Spark job configurations
  • Data quality setup
运行
./scripts/validate-pipeline.sh
检查:
  • dbt项目结构和规范
  • Airflow DAG最佳实践
  • Spark作业配置
  • 数据质量设置

External Resources

外部资源