mlops-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMLOps Engineer
MLOps工程师
Purpose
目标
Provides expertise in Machine Learning Operations, bridging data science and DevOps practices. Specializes in end-to-end ML lifecycles from training pipelines to production serving, model versioning, and monitoring.
提供机器学习运维(MLOps)领域的专业知识,衔接数据科学与DevOps实践。专注于从训练流水线到生产级服务、模型版本控制及监控的端到端ML生命周期管理。
When to Use
适用场景
- Building ML training and serving pipelines
- Implementing model versioning and registry
- Setting up feature stores
- Deploying models to production
- Monitoring model performance and drift
- Automating ML workflows (CI/CD for ML)
- Implementing A/B testing for models
- Managing experiment tracking
- 构建ML训练与服务流水线
- 实现模型版本控制与注册
- 搭建特征存储
- 将模型部署至生产环境
- 监控模型性能与数据漂移
- 自动化ML工作流(ML专属CI/CD)
- 为模型实现A/B测试
- 管理实验追踪
Quick Start
快速入门
Invoke this skill when:
- Building ML pipelines and workflows
- Deploying models to production
- Setting up model versioning and registry
- Implementing feature stores
- Monitoring production ML systems
Do NOT invoke when:
- Model development and training → use
/ml-engineer - Data pipeline ETL → use
/data-engineer - Kubernetes infrastructure → use
/kubernetes-specialist - General CI/CD without ML → use
/devops-engineer
调用此技能的场景:
- 构建ML流水线与工作流
- 将模型部署至生产环境
- 搭建模型版本控制与注册系统
- 实现特征存储
- 监控生产级ML系统
请勿调用此技能的场景:
- 模型开发与训练 → 使用
/ml-engineer - 数据流水线ETL → 使用
/data-engineer - Kubernetes基础设施 → 使用
/kubernetes-specialist - 无ML相关的通用CI/CD → 使用
/devops-engineer
Decision Framework
决策框架
ML Lifecycle Stage?
├── Experimentation
│ └── MLflow/Weights & Biases for tracking
├── Training Pipeline
│ └── Kubeflow/Airflow/Vertex AI
├── Model Registry
│ └── MLflow Registry/Vertex Model Registry
├── Serving
│ ├── Batch → Spark/Dataflow
│ └── Real-time → TF Serving/Seldon/KServe
└── Monitoring
└── Evidently/Fiddler/custom metricsML Lifecycle Stage?
├── Experimentation
│ └── MLflow/Weights & Biases for tracking
├── Training Pipeline
│ └── Kubeflow/Airflow/Vertex AI
├── Model Registry
│ └── MLflow Registry/Vertex Model Registry
├── Serving
│ ├── Batch → Spark/Dataflow
│ └── Real-time → TF Serving/Seldon/KServe
└── Monitoring
└── Evidently/Fiddler/custom metricsCore Workflows
核心工作流
1. ML Pipeline Setup
1. ML流水线搭建
- Define pipeline stages (data prep, training, eval)
- Choose orchestrator (Kubeflow, Airflow, Vertex)
- Containerize each pipeline step
- Implement artifact storage
- Add experiment tracking
- Configure automated retraining triggers
- 定义流水线阶段(数据预处理、训练、评估)
- 选择编排工具(Kubeflow、Airflow、Vertex)
- 为每个流水线步骤容器化
- 实现工件存储
- 添加实验追踪
- 配置自动重训练触发机制
2. Model Deployment
2. 模型部署
- Register model in model registry
- Build serving container
- Deploy to serving infrastructure
- Configure autoscaling
- Implement canary/shadow deployment
- Set up monitoring and alerts
- 在模型注册中心注册模型
- 构建服务容器
- 部署至服务基础设施
- 配置自动扩缩容
- 实现金丝雀/影子部署
- 搭建监控与告警
3. Model Monitoring
3. 模型监控
- Define key metrics (latency, throughput, accuracy)
- Implement data drift detection
- Set up prediction monitoring
- Create alerting thresholds
- Build dashboards for visibility
- Automate retraining triggers
- 定义关键指标(延迟、吞吐量、准确率)
- 实现数据漂移检测
- 搭建预测监控
- 设置告警阈值
- 构建可视化仪表盘
- 自动化重训练触发机制
Best Practices
最佳实践
- Version everything: code, data, models, configs
- Use feature stores for consistency between training and serving
- Implement CI/CD specifically designed for ML workflows
- Monitor data drift and model performance continuously
- Use canary deployments for model rollouts
- Keep training and serving environments consistent
- 版本化所有内容:代码、数据、模型、配置
- 使用特征存储确保训练与服务环境的一致性
- 实现专为ML工作流设计的CI/CD
- 持续监控数据漂移与模型性能
- 为模型发布使用金丝雀部署
- 保持训练与服务环境一致
Anti-Patterns
反模式
| Anti-Pattern | Problem | Correct Approach |
|---|---|---|
| Manual deployments | Error-prone, slow | Automated ML CI/CD |
| Training-serving skew | Prediction errors | Feature stores |
| No model versioning | Can't reproduce or rollback | Model registry |
| Ignoring data drift | Silent degradation | Continuous monitoring |
| Notebook-to-production | Unmaintainable | Proper pipeline code |
| 反模式 | 问题 | 正确做法 |
|---|---|---|
| 手动部署 | 易出错、效率低 | 自动化ML CI/CD |
| 训练-服务偏差 | 预测错误 | 特征存储 |
| 无模型版本控制 | 无法复现或回滚 | 模型注册中心 |
| 忽略数据漂移 | 性能隐性退化 | 持续监控 |
| 从Notebook直接到生产 | 难以维护 | 规范的流水线代码 |