data-engineering
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Engineering & Analytics Skill
数据工程与分析技能
Quick Start - SQL Data Pipeline
快速入门 - SQL数据管道
sql
-- Create staging table
CREATE TABLE staging_events AS
SELECT
event_id,
user_id,
event_type,
event_time,
properties
FROM raw_events
WHERE event_time >= CURRENT_DATE - INTERVAL '1 day'
AND event_type IN ('click', 'purchase', 'view');
-- Aggregate metrics
SELECT
DATE(event_time) as date,
user_id,
COUNT(*) as event_count,
COUNT(DISTINCT event_type) as unique_events
FROM staging_events
GROUP BY 1, 2
ORDER BY date DESC, event_count DESC;sql
-- Create staging table
CREATE TABLE staging_events AS
SELECT
event_id,
user_id,
event_type,
event_time,
properties
FROM raw_events
WHERE event_time >= CURRENT_DATE - INTERVAL '1 day'
AND event_type IN ('click', 'purchase', 'view');
-- Aggregate metrics
SELECT
DATE(event_time) as date,
user_id,
COUNT(*) as event_count,
COUNT(DISTINCT event_type) as unique_events
FROM staging_events
GROUP BY 1, 2
ORDER BY date DESC, event_count DESC;Core Technologies
核心技术
Data Processing
数据处理
- Apache Spark
- Apache Flink
- Pandas / Polars
- dbt (data transformation)
- Apache Spark
- Apache Flink
- Pandas / Polars
- dbt (数据转换)
Data Warehousing
数据仓库
- Snowflake
- BigQuery (GCP)
- Redshift (AWS)
- Azure Synapse
- Snowflake
- BigQuery (GCP)
- Redshift (AWS)
- Azure Synapse
ETL/ELT Tools
ETL/ELT工具
- dbt
- Airflow
- Talend
- Informatica
- dbt
- Airflow
- Talend
- Informatica
Streaming
流处理
- Apache Kafka
- AWS Kinesis
- Apache Pulsar
- Apache Kafka
- AWS Kinesis
- Apache Pulsar
ML & Analytics
机器学习与分析
- scikit-learn
- TensorFlow
- Tableau / Power BI
- scikit-learn
- TensorFlow
- Tableau / Power BI
Best Practices
最佳实践
- Data Quality - Validation and testing
- Documentation - Clear metadata
- Performance - Query optimization
- Governance - Data security
- Monitoring - Pipeline alerts
- Scalability - Design for growth
- Version Control - Git for code and configs
- Testing - Data and pipeline testing
- 数据质量 - 验证与测试
- 文档 - 清晰的元数据
- 性能 - 查询优化
- 治理 - 数据安全
- 监控 - 管道告警
- 可扩展性 - 面向增长的设计
- 版本控制 - 代码与配置使用Git
- 测试 - 数据与管道测试