agency-data-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Engineer

数据工程师

Use this skill for data work that must be reproducible, trustworthy, and operationally clear.
此技能适用于需要可复现、可信且操作清晰的数据工作。

Best for

最佳适用场景

  • Cleaning and joining messy datasets into reviewable outputs
  • Building or repairing ETL/ELT workflows
  • Defining data contracts, validation checks, and observability
  • Preparing analytics-ready assets for dashboards, reports, or downstream models
  • 将杂乱的数据集清洗、合并为可审核的输出结果
  • 构建或修复ETL/ELT工作流
  • 定义数据契约、验证检查和可观测性规则
  • 为仪表盘、报告或下游模型准备可用于分析的资产

Workflow

工作流程

  1. Read repo or workspace instructions first.
  2. Inventory the datasets, schemas, and likely join keys.
  3. Identify quality risks before transforming anything:
    • missing keys
    • schema drift
    • duplicates
    • null handling
    • timestamp/timezone issues
  4. Propose the smallest reproducible workflow from ingest to validated output.
  5. Add explicit checks for freshness, completeness, and join correctness.
  6. Prefer scripts and versioned artifacts over one-off notebook state.
  1. 首先阅读代码仓库或工作区的说明文档。
  2. 盘点数据集、模式以及可能的关联键。
  3. 在进行任何转换之前识别质量风险:
    • 缺失关联键
    • 模式漂移
    • 重复数据
    • 空值处理
    • 时间戳/时区问题
  4. 提出从数据摄入到验证输出的最小可复现工作流。
  5. 添加针对数据新鲜度、完整性和关联正确性的明确检查。
  6. 优先使用脚本和版本化工件,而非一次性的笔记本状态。

Output contract

输出契约

Produce:
  • source inventory
  • key assumptions and quality risks
  • proposed pipeline or analysis workflow
  • validation checks
  • output artifacts and how to reproduce them
交付内容:
  • 数据源盘点清单
  • 关键假设和质量风险说明
  • 拟议的数据管道或分析工作流
  • 验证检查规则
  • 输出工件及其复现方法

Critical rules

核心规则

  • All pipelines must be idempotent — rerunning produces the same result, never duplicates
  • Every pipeline must have explicit schema contracts — schema drift must alert, never silently corrupt
  • Null handling must be deliberate — no implicit null propagation into gold/semantic layers
  • Prefer reviewable outputs over hidden notebook-only state
  • Make freshness, completeness, and lineage visible where practical
  • 所有数据管道必须具备幂等性——重复运行会产生相同结果,绝不会生成重复数据
  • 每个数据管道必须有明确的模式契约——模式漂移必须触发告警,绝不能静默损坏数据
  • 空值处理必须明确——不允许空值隐式传播至黄金层/语义层
  • 优先选择可审核的输出,而非仅存在于笔记本中的隐藏状态
  • 在可行的情况下,确保数据新鲜度、完整性和血缘关系可见

Starter prompts

初始提示示例

  • Inventory these datasets, identify quality risks, and propose a reproducible workflow from ingest to validated output.
  • Build the smallest reliable pipeline that turns these raw files into analytics-ready tables.
  • Audit this data workflow for schema drift, duplicate risk, null handling, and missing quality checks.
  • 盘点这些数据集,识别质量风险,并提出从数据摄入到验证输出的可复现工作流。
  • 构建最小化的可靠数据管道,将这些原始文件转换为可用于分析的表格。
  • 审核此数据工作流,检查是否存在模式漂移、重复风险、空值处理不当以及缺失质量检查的问题。

Autonomous decision rules

自主决策规则

Use this skill when:
  • the task is about ETL, data cleanup, joins, contracts, or analytics-ready outputs
  • the user wants a reproducible data workflow rather than one-off analysis notes
Do NOT use when:
  • the task is purely BI storytelling with no pipeline or dataset work
  • a narrower domain skill already owns the data source and output format
适用场景:
  • 任务涉及ETL、数据清洗、数据关联、数据契约或可用于分析的输出
  • 用户需要可复现的数据工作流,而非一次性的分析笔记
不适用场景:
  • 任务仅为BI叙事,不涉及数据管道或数据集处理工作
  • 已有更细分的领域技能负责该数据源和输出格式