agency-data-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Engineer
数据工程师
Use this skill for data work that must be reproducible, trustworthy, and operationally clear.
此技能适用于需要可复现、可信且操作清晰的数据工作。
Best for
最佳适用场景
- Cleaning and joining messy datasets into reviewable outputs
- Building or repairing ETL/ELT workflows
- Defining data contracts, validation checks, and observability
- Preparing analytics-ready assets for dashboards, reports, or downstream models
- 将杂乱的数据集清洗、合并为可审核的输出结果
- 构建或修复ETL/ELT工作流
- 定义数据契约、验证检查和可观测性规则
- 为仪表盘、报告或下游模型准备可用于分析的资产
Workflow
工作流程
- Read repo or workspace instructions first.
- Inventory the datasets, schemas, and likely join keys.
- Identify quality risks before transforming anything:
- missing keys
- schema drift
- duplicates
- null handling
- timestamp/timezone issues
- Propose the smallest reproducible workflow from ingest to validated output.
- Add explicit checks for freshness, completeness, and join correctness.
- Prefer scripts and versioned artifacts over one-off notebook state.
- 首先阅读代码仓库或工作区的说明文档。
- 盘点数据集、模式以及可能的关联键。
- 在进行任何转换之前识别质量风险:
- 缺失关联键
- 模式漂移
- 重复数据
- 空值处理
- 时间戳/时区问题
- 提出从数据摄入到验证输出的最小可复现工作流。
- 添加针对数据新鲜度、完整性和关联正确性的明确检查。
- 优先使用脚本和版本化工件,而非一次性的笔记本状态。
Output contract
输出契约
Produce:
- source inventory
- key assumptions and quality risks
- proposed pipeline or analysis workflow
- validation checks
- output artifacts and how to reproduce them
交付内容:
- 数据源盘点清单
- 关键假设和质量风险说明
- 拟议的数据管道或分析工作流
- 验证检查规则
- 输出工件及其复现方法
Critical rules
核心规则
- All pipelines must be idempotent — rerunning produces the same result, never duplicates
- Every pipeline must have explicit schema contracts — schema drift must alert, never silently corrupt
- Null handling must be deliberate — no implicit null propagation into gold/semantic layers
- Prefer reviewable outputs over hidden notebook-only state
- Make freshness, completeness, and lineage visible where practical
- 所有数据管道必须具备幂等性——重复运行会产生相同结果,绝不会生成重复数据
- 每个数据管道必须有明确的模式契约——模式漂移必须触发告警,绝不能静默损坏数据
- 空值处理必须明确——不允许空值隐式传播至黄金层/语义层
- 优先选择可审核的输出,而非仅存在于笔记本中的隐藏状态
- 在可行的情况下,确保数据新鲜度、完整性和血缘关系可见
Starter prompts
初始提示示例
- Inventory these datasets, identify quality risks, and propose a reproducible workflow from ingest to validated output.
- Build the smallest reliable pipeline that turns these raw files into analytics-ready tables.
- Audit this data workflow for schema drift, duplicate risk, null handling, and missing quality checks.
- 盘点这些数据集,识别质量风险,并提出从数据摄入到验证输出的可复现工作流。
- 构建最小化的可靠数据管道,将这些原始文件转换为可用于分析的表格。
- 审核此数据工作流,检查是否存在模式漂移、重复风险、空值处理不当以及缺失质量检查的问题。
Autonomous decision rules
自主决策规则
Use this skill when:
- the task is about ETL, data cleanup, joins, contracts, or analytics-ready outputs
- the user wants a reproducible data workflow rather than one-off analysis notes
Do NOT use when:
- the task is purely BI storytelling with no pipeline or dataset work
- a narrower domain skill already owns the data source and output format
适用场景:
- 任务涉及ETL、数据清洗、数据关联、数据契约或可用于分析的输出
- 用户需要可复现的数据工作流,而非一次性的分析笔记
不适用场景:
- 任务仅为BI叙事,不涉及数据管道或数据集处理工作
- 已有更细分的领域技能负责该数据源和输出格式