mlops-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MLOps Engineer

MLOps工程师

Purpose

目标

Provides expertise in Machine Learning Operations, bridging data science and DevOps practices. Specializes in end-to-end ML lifecycles from training pipelines to production serving, model versioning, and monitoring.
提供机器学习运维(MLOps)领域的专业知识,衔接数据科学与DevOps实践。专注于从训练流水线到生产级服务、模型版本控制及监控的端到端ML生命周期管理。

When to Use

适用场景

  • Building ML training and serving pipelines
  • Implementing model versioning and registry
  • Setting up feature stores
  • Deploying models to production
  • Monitoring model performance and drift
  • Automating ML workflows (CI/CD for ML)
  • Implementing A/B testing for models
  • Managing experiment tracking
  • 构建ML训练与服务流水线
  • 实现模型版本控制与注册
  • 搭建特征存储
  • 将模型部署至生产环境
  • 监控模型性能与数据漂移
  • 自动化ML工作流(ML专属CI/CD)
  • 为模型实现A/B测试
  • 管理实验追踪

Quick Start

快速入门

Invoke this skill when:
  • Building ML pipelines and workflows
  • Deploying models to production
  • Setting up model versioning and registry
  • Implementing feature stores
  • Monitoring production ML systems
Do NOT invoke when:
  • Model development and training → use
    /ml-engineer
  • Data pipeline ETL → use
    /data-engineer
  • Kubernetes infrastructure → use
    /kubernetes-specialist
  • General CI/CD without ML → use
    /devops-engineer
调用此技能的场景:
  • 构建ML流水线与工作流
  • 将模型部署至生产环境
  • 搭建模型版本控制与注册系统
  • 实现特征存储
  • 监控生产级ML系统
请勿调用此技能的场景:
  • 模型开发与训练 → 使用
    /ml-engineer
  • 数据流水线ETL → 使用
    /data-engineer
  • Kubernetes基础设施 → 使用
    /kubernetes-specialist
  • 无ML相关的通用CI/CD → 使用
    /devops-engineer

Decision Framework

决策框架

ML Lifecycle Stage?
├── Experimentation
│   └── MLflow/Weights & Biases for tracking
├── Training Pipeline
│   └── Kubeflow/Airflow/Vertex AI
├── Model Registry
│   └── MLflow Registry/Vertex Model Registry
├── Serving
│   ├── Batch → Spark/Dataflow
│   └── Real-time → TF Serving/Seldon/KServe
└── Monitoring
    └── Evidently/Fiddler/custom metrics
ML Lifecycle Stage?
├── Experimentation
│   └── MLflow/Weights & Biases for tracking
├── Training Pipeline
│   └── Kubeflow/Airflow/Vertex AI
├── Model Registry
│   └── MLflow Registry/Vertex Model Registry
├── Serving
│   ├── Batch → Spark/Dataflow
│   └── Real-time → TF Serving/Seldon/KServe
└── Monitoring
    └── Evidently/Fiddler/custom metrics

Core Workflows

核心工作流

1. ML Pipeline Setup

1. ML流水线搭建

  1. Define pipeline stages (data prep, training, eval)
  2. Choose orchestrator (Kubeflow, Airflow, Vertex)
  3. Containerize each pipeline step
  4. Implement artifact storage
  5. Add experiment tracking
  6. Configure automated retraining triggers
  1. 定义流水线阶段(数据预处理、训练、评估)
  2. 选择编排工具(Kubeflow、Airflow、Vertex)
  3. 为每个流水线步骤容器化
  4. 实现工件存储
  5. 添加实验追踪
  6. 配置自动重训练触发机制

2. Model Deployment

2. 模型部署

  1. Register model in model registry
  2. Build serving container
  3. Deploy to serving infrastructure
  4. Configure autoscaling
  5. Implement canary/shadow deployment
  6. Set up monitoring and alerts
  1. 在模型注册中心注册模型
  2. 构建服务容器
  3. 部署至服务基础设施
  4. 配置自动扩缩容
  5. 实现金丝雀/影子部署
  6. 搭建监控与告警

3. Model Monitoring

3. 模型监控

  1. Define key metrics (latency, throughput, accuracy)
  2. Implement data drift detection
  3. Set up prediction monitoring
  4. Create alerting thresholds
  5. Build dashboards for visibility
  6. Automate retraining triggers
  1. 定义关键指标(延迟、吞吐量、准确率)
  2. 实现数据漂移检测
  3. 搭建预测监控
  4. 设置告警阈值
  5. 构建可视化仪表盘
  6. 自动化重训练触发机制

Best Practices

最佳实践

  • Version everything: code, data, models, configs
  • Use feature stores for consistency between training and serving
  • Implement CI/CD specifically designed for ML workflows
  • Monitor data drift and model performance continuously
  • Use canary deployments for model rollouts
  • Keep training and serving environments consistent
  • 版本化所有内容:代码、数据、模型、配置
  • 使用特征存储确保训练与服务环境的一致性
  • 实现专为ML工作流设计的CI/CD
  • 持续监控数据漂移与模型性能
  • 为模型发布使用金丝雀部署
  • 保持训练与服务环境一致

Anti-Patterns

反模式

Anti-PatternProblemCorrect Approach
Manual deploymentsError-prone, slowAutomated ML CI/CD
Training-serving skewPrediction errorsFeature stores
No model versioningCan't reproduce or rollbackModel registry
Ignoring data driftSilent degradationContinuous monitoring
Notebook-to-productionUnmaintainableProper pipeline code
反模式问题正确做法
手动部署易出错、效率低自动化ML CI/CD
训练-服务偏差预测错误特征存储
无模型版本控制无法复现或回滚模型注册中心
忽略数据漂移性能隐性退化持续监控
从Notebook直接到生产难以维护规范的流水线代码