mlops-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MLOps Engineer

MLOps工程师

Purpose

目标

Provides expertise in Machine Learning Operations, bridging data science and DevOps practices. Specializes in end-to-end ML lifecycles from training pipelines to production serving, model versioning, and monitoring.

提供机器学习运维（MLOps）领域的专业知识，衔接数据科学与DevOps实践。专注于从训练流水线到生产级服务、模型版本控制及监控的端到端ML生命周期管理。

When to Use

适用场景

Building ML training and serving pipelines
Implementing model versioning and registry
Setting up feature stores
Deploying models to production
Monitoring model performance and drift
Automating ML workflows (CI/CD for ML)
Implementing A/B testing for models
Managing experiment tracking

构建ML训练与服务流水线
实现模型版本控制与注册
搭建特征存储
将模型部署至生产环境
监控模型性能与数据漂移
自动化ML工作流（ML专属CI/CD）
为模型实现A/B测试
管理实验追踪

Quick Start

快速入门

Invoke this skill when:

Building ML pipelines and workflows
Deploying models to production
Setting up model versioning and registry
Implementing feature stores
Monitoring production ML systems

Do NOT invoke when:

Model development and training → use
```
/ml-engineer
```
Data pipeline ETL → use
```
/data-engineer
```
Kubernetes infrastructure → use
```
/kubernetes-specialist
```
General CI/CD without ML → use
```
/devops-engineer
```

调用此技能的场景：

构建ML流水线与工作流
将模型部署至生产环境
搭建模型版本控制与注册系统
实现特征存储
监控生产级ML系统

请勿调用此技能的场景：

模型开发与训练 → 使用
```
/ml-engineer
```
数据流水线ETL → 使用
```
/data-engineer
```
Kubernetes基础设施 → 使用
```
/kubernetes-specialist
```
无ML相关的通用CI/CD → 使用
```
/devops-engineer
```

Decision Framework

决策框架

ML Lifecycle Stage?
├── Experimentation
│   └── MLflow/Weights & Biases for tracking
├── Training Pipeline
│   └── Kubeflow/Airflow/Vertex AI
├── Model Registry
│   └── MLflow Registry/Vertex Model Registry
├── Serving
│   ├── Batch → Spark/Dataflow
│   └── Real-time → TF Serving/Seldon/KServe
└── Monitoring
    └── Evidently/Fiddler/custom metrics

ML Lifecycle Stage?
├── Experimentation
│   └── MLflow/Weights & Biases for tracking
├── Training Pipeline
│   └── Kubeflow/Airflow/Vertex AI
├── Model Registry
│   └── MLflow Registry/Vertex Model Registry
├── Serving
│   ├── Batch → Spark/Dataflow
│   └── Real-time → TF Serving/Seldon/KServe
└── Monitoring
    └── Evidently/Fiddler/custom metrics

Core Workflows

核心工作流

1. ML Pipeline Setup

1. ML流水线搭建

Define pipeline stages (data prep, training, eval)
Choose orchestrator (Kubeflow, Airflow, Vertex)
Containerize each pipeline step
Implement artifact storage
Add experiment tracking
Configure automated retraining triggers

定义流水线阶段（数据预处理、训练、评估）
选择编排工具（Kubeflow、Airflow、Vertex）
为每个流水线步骤容器化
实现工件存储
添加实验追踪
配置自动重训练触发机制

2. Model Deployment

2. 模型部署

Register model in model registry
Build serving container
Deploy to serving infrastructure
Configure autoscaling
Implement canary/shadow deployment
Set up monitoring and alerts

在模型注册中心注册模型
构建服务容器
部署至服务基础设施
配置自动扩缩容
实现金丝雀/影子部署
搭建监控与告警

3. Model Monitoring

3. 模型监控

Define key metrics (latency, throughput, accuracy)
Implement data drift detection
Set up prediction monitoring
Create alerting thresholds
Build dashboards for visibility
Automate retraining triggers

定义关键指标（延迟、吞吐量、准确率）
实现数据漂移检测
搭建预测监控
设置告警阈值
构建可视化仪表盘
自动化重训练触发机制

Best Practices

最佳实践

Version everything: code, data, models, configs
Use feature stores for consistency between training and serving
Implement CI/CD specifically designed for ML workflows
Monitor data drift and model performance continuously
Use canary deployments for model rollouts
Keep training and serving environments consistent

版本化所有内容：代码、数据、模型、配置
使用特征存储确保训练与服务环境的一致性
实现专为ML工作流设计的CI/CD
持续监控数据漂移与模型性能
为模型发布使用金丝雀部署
保持训练与服务环境一致

Anti-Patterns

反模式

Anti-Pattern	Problem	Correct Approach
Manual deployments	Error-prone, slow	Automated ML CI/CD
Training-serving skew	Prediction errors	Feature stores
No model versioning	Can't reproduce or rollback	Model registry
Ignoring data drift	Silent degradation	Continuous monitoring
Notebook-to-production	Unmaintainable	Proper pipeline code

反模式	问题	正确做法
手动部署	易出错、效率低	自动化ML CI/CD
训练-服务偏差	预测错误	特征存储
无模型版本控制	无法复现或回滚	模型注册中心
忽略数据漂移	性能隐性退化	持续监控
从Notebook直接到生产	难以维护	规范的流水线代码