data-pipeline-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Pipeline Engineer
数据管道工程师
Expert data engineer specializing in ETL/ELT pipelines, streaming architectures, data warehousing, and modern data stack implementation.
专注于ETL/ELT管道、流处理架构、数据仓库及现代数据栈实施的资深数据工程师。
Quick Start
快速入门
- Identify sources - data formats, volumes, freshness requirements
- Choose architecture - Medallion (Bronze/Silver/Gold), Lambda, or Kappa
- Design layers - staging → intermediate → marts (dbt pattern)
- Add quality gates - Great Expectations or dbt tests at each layer
- Orchestrate - Airflow DAGs with sensors and retries
- Monitor - lineage, freshness, anomaly detection
- 识别数据源 - 数据格式、数据量、新鲜度要求
- 选择架构 - Medallion(青铜/白银/黄金)、Lambda或Kappa架构
- 设计分层 - 暂存层 → 中间层 → 数据集市(dbt模式)
- 添加质量关卡 - 在每一层使用Great Expectations或dbt测试
- 编排调度 - 带传感器和重试机制的Airflow DAG
- 监控运维 - 数据血缘、新鲜度、异常检测
Core Capabilities
核心能力
| Capability | Technologies | Key Patterns |
|---|---|---|
| Batch Processing | Spark, dbt, Databricks | Incremental, partitioning, Delta/Iceberg |
| Stream Processing | Kafka, Flink, Spark Streaming | Watermarks, exactly-once, windowing |
| Orchestration | Airflow, Dagster, Prefect | DAG design, sensors, task groups |
| Data Modeling | dbt, SQL | Kimball, Data Vault, SCD |
| Data Quality | Great Expectations, dbt tests | Validation suites, freshness |
| 能力领域 | 技术栈 | 关键模式 |
|---|---|---|
| 批处理 | Spark, dbt, Databricks | 增量处理、分区、Delta/Iceberg |
| 流处理 | Kafka, Flink, Spark Streaming | 水位线、精确一次处理、窗口计算 |
| 任务编排 | Airflow, Dagster, Prefect | DAG设计、传感器、任务组 |
| 数据建模 | dbt, SQL | Kimball、Data Vault、SCD(缓慢变化维度) |
| 数据质量 | Great Expectations, dbt tests | 验证套件、新鲜度检查 |
Architecture Patterns
架构模式
Medallion Architecture (Recommended)
Medallion架构(推荐)
BRONZE (Raw) → Exact source copy, schema-on-read, partitioned by ingestion
↓ Cleaning, Deduplication
SILVER (Cleansed) → Validated, standardized, business logic applied
↓ Aggregation, Enrichment
GOLD (Business) → Dimensional models, aggregates, ready for BI/MLBRONZE(原始层) → 数据源精确副本,读时模式,按 ingestion 分区
↓ 清洗、去重
SILVER(清洗层) → 已验证、标准化,应用业务逻辑
↓ 聚合、 enrichment
GOLD(业务层) → 维度模型、聚合数据,可直接用于BI/MLLambda vs Kappa
Lambda vs Kappa
- Lambda: Batch + Stream layers → merged serving layer (complex but complete)
- Kappa: Stream-only with replay → simpler but requires robust streaming
- Lambda:批处理+流处理层 → 合并为服务层(复杂但功能完整)
- Kappa:仅流处理+重放机制 → 架构更简单但需要健壮的流处理能力
Reference Examples
参考示例
Full implementation examples in :
./references/| File | Description |
|---|---|
| Complete dbt layout with staging, intermediate, marts |
| Production DAG with sensors, task groups, quality checks |
| Kafka-to-Delta processor with windowing |
| Comprehensive data quality expectation suite |
完整实现示例位于 :
./references/| 文件 | 描述 |
|---|---|
| 包含暂存层、中间层、数据集市的完整dbt项目结构 |
| 带有传感器、任务组、质量检查的生产级DAG |
| 带窗口计算的Kafka-to-Delta处理器 |
| 全面的数据质量验证套件 |
Anti-Patterns (10 Critical Mistakes)
反模式(10个关键错误)
1. Full Table Refreshes
1. 全表刷新
Symptom: Truncate and rebuild entire tables every run
Fix: Use incremental models with , partition by date
is_incremental()症状:每次运行都截断并重建整张表
修复方案:使用带的增量模型,按日期分区
is_incremental()2. Tight Coupling to Source Schemas
2. 与源 schema 强耦合
Symptom: Pipeline breaks when upstream adds/removes columns
Fix: Explicit source contracts, select only needed columns in staging
症状:上游添加/删除字段时管道崩溃
修复方案:明确源合约,在暂存层仅选择所需字段
3. Monolithic DAGs
3. 单体式DAG
Symptom: One 200-task DAG running 8 hours
Fix: Domain-specific DAGs, ExternalTaskSensor for dependencies
症状:单个包含200个任务的DAG运行8小时
修复方案:按领域拆分DAG,使用ExternalTaskSensor处理依赖
4. No Data Quality Gates
4. 无数据质量关卡
Symptom: Bad data reaches production before detection
Fix: Great Expectations or dbt tests at each layer, block on failures
症状:坏数据进入生产后才被发现
修复方案:在每一层添加Great Expectations或dbt测试,失败时阻断流程
5. Processing Before Archiving
5. 先处理再归档
Symptom: Raw data transformed without preserving original
Fix: Always land raw in Bronze first, make transformations reproducible
症状:原始数据未保留就直接转换
修复方案:始终先将原始数据存入Bronze层,确保转换过程可复现
6. Hardcoded Dates in Queries
6. 查询中硬编码日期
Symptom: Manual updates needed for date filters
Fix: Use Airflow templating (e.g., variable) or dynamic date functions
ds症状:日期过滤器需要手动更新
修复方案:使用Airflow模板(如变量)或动态日期函数
ds7. Missing Watermarks in Streaming
7. 流处理中缺失水位线
Symptom: Unbounded state growth, OOM in long-running jobs
Fix: Add to handle late-arriving data
withWatermark()症状:无界状态增长,长期运行的作业出现OOM
修复方案:添加处理延迟到达的数据
withWatermark()8. No Retry/Backoff Strategy
8. 无重试/退避策略
Symptom: Transient failures cause DAG failures
Fix: , ,
retries=3retry_exponential_backoff=Truemax_retry_delay症状:临时故障导致DAG失败
修复方案:设置、、
retries=3retry_exponential_backoff=Truemax_retry_delay9. Undocumented Data Lineage
9. 未记录数据血缘
Symptom: No one knows where data comes from or who uses it
Fix: dbt docs, data catalog integration, column-level lineage
症状:无人知晓数据来源或使用者
修复方案:使用dbt docs、数据目录集成、字段级血缘追踪
10. Testing Only in Production
10. 仅在生产环境测试
Symptom: Bugs discovered by stakeholders, not engineers
Fix: dbt , sample datasets, CI/CD for models
--target dev症状:Bug由利益相关者而非工程师发现
修复方案:使用dbt 、样本数据集、模型的CI/CD流程
--target devQuality Checklist
质量检查清单
Pipeline Design:
- Incremental processing where possible
- Idempotent transformations (re-runnable safely)
- Partitioning strategy defined and documented
- Backfill procedures documented
Data Quality:
- Tests at Bronze layer (schema, nulls, ranges)
- Tests at Silver layer (business rules, referential integrity)
- Tests at Gold layer (aggregation checks, trend monitoring)
- Anomaly detection for volumes and distributions
Orchestration:
- Retry and alerting configured
- SLAs defined and monitored
- Cross-DAG dependencies use sensors
- max_active_runs prevents parallel conflicts
Operations:
- Data lineage documented
- Runbooks for common failures
- Monitoring dashboards for pipeline health
- On-call procedures defined
管道设计:
- 尽可能使用增量处理
- 转换具有幂等性(可安全重跑)
- 分区策略已定义并记录
- 回填流程已文档化
数据质量:
- Bronze层有测试(schema、空值、范围)
- Silver层有测试(业务规则、参照完整性)
- Gold层有测试(聚合检查、趋势监控)
- 针对数据量和分布的异常检测
任务编排:
- 已配置重试和告警
- SLA已定义并监控
- 跨DAG依赖使用传感器
- max_active_runs 防止并行冲突
运维:
- 数据血缘已文档化
- 常见故障的运行手册
- 管道健康状态监控仪表板
- 值班流程已定义
Validation Script
验证脚本
Run to check:
./scripts/validate-pipeline.sh- dbt project structure and conventions
- Airflow DAG best practices
- Spark job configurations
- Data quality setup
运行 检查:
./scripts/validate-pipeline.sh- dbt项目结构和规范
- Airflow DAG最佳实践
- Spark作业配置
- 数据质量设置