ml-pipeline-workflow

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ML Pipeline Workflow

ML流水线工作流

Complete end-to-end MLOps pipeline orchestration from data preparation through model deployment.
实现从数据准备到模型部署的端到端MLOps流水线编排。

Overview

概述

This skill provides comprehensive guidance for building production ML pipelines that handle the full lifecycle: data ingestion → preparation → training → validation → deployment → monitoring.
本技能提供了构建覆盖完整生命周期的生产级ML流水线的全面指导:数据采集 → 准备 → 训练 → 验证 → 部署 → 监控。

When to Use This Skill

何时使用该技能

  • Building new ML pipelines from scratch
  • Designing workflow orchestration for ML systems
  • Implementing data → model → deployment automation
  • Setting up reproducible training workflows
  • Creating DAG-based ML orchestration
  • Integrating ML components into production systems
  • 从零开始构建新的ML流水线
  • 为ML系统设计工作流编排
  • 实现从数据→模型→部署的自动化
  • 搭建可复现的训练工作流
  • 创建基于DAG的ML编排
  • 将ML组件集成到生产系统中

What This Skill Provides

本技能提供的内容

Core Capabilities

核心能力

  1. Pipeline Architecture
    • End-to-end workflow design
    • DAG orchestration patterns (Airflow, Dagster, Kubeflow)
    • Component dependencies and data flow
    • Error handling and retry strategies
  2. Data Preparation
    • Data validation and quality checks
    • Feature engineering pipelines
    • Data versioning and lineage
    • Train/validation/test splitting strategies
  3. Model Training
    • Training job orchestration
    • Hyperparameter management
    • Experiment tracking integration
    • Distributed training patterns
  4. Model Validation
    • Validation frameworks and metrics
    • A/B testing infrastructure
    • Performance regression detection
    • Model comparison workflows
  5. Deployment Automation
    • Model serving patterns
    • Canary deployments
    • Blue-green deployment strategies
    • Rollback mechanisms
  1. 流水线架构
    • 端到端工作流设计
    • DAG编排模式(Airflow、Dagster、Kubeflow)
    • 组件依赖与数据流
    • 错误处理与重试策略
  2. 数据准备
    • 数据验证与质量检查
    • 特征工程流水线
    • 数据版本控制与血缘追踪
    • 训练/验证/测试集拆分策略
  3. 模型训练
    • 训练任务编排
    • 超参数管理
    • 实验追踪集成
    • 分布式训练模式
  4. 模型验证
    • 验证框架与指标
    • A/B测试基础设施
    • 性能退化检测
    • 模型对比工作流
  5. 部署自动化
    • 模型服务模式
    • 金丝雀部署
    • 蓝绿部署策略
    • 回滚机制

Reference Documentation

参考文档

See the
references/
directory for detailed guides:
  • data-preparation.md - Data cleaning, validation, and feature engineering
  • model-training.md - Training workflows and best practices
  • model-validation.md - Validation strategies and metrics
  • model-deployment.md - Deployment patterns and serving architectures
请查看
references/
目录下的详细指南:
  • data-preparation.md - 数据清洗、验证与特征工程
  • model-training.md - 训练工作流与最佳实践
  • model-validation.md - 验证策略与指标
  • model-deployment.md - 部署模式与服务架构

Assets and Templates

资产与模板

The
assets/
directory contains:
  • pipeline-dag.yaml.template - DAG template for workflow orchestration
  • training-config.yaml - Training configuration template
  • validation-checklist.md - Pre-deployment validation checklist
assets/
目录包含:
  • pipeline-dag.yaml.template - 工作流编排的DAG模板
  • training-config.yaml - 训练配置模板
  • validation-checklist.md - 部署前验证清单

Usage Patterns

使用模式

Basic Pipeline Setup

基础流水线搭建

python
undefined
python
undefined

1. Define pipeline stages

1. Define pipeline stages

stages = [ "data_ingestion", "data_validation", "feature_engineering", "model_training", "model_validation", "model_deployment" ]
stages = [ "data_ingestion", "data_validation", "feature_engineering", "model_training", "model_validation", "model_deployment" ]

2. Configure dependencies

2. Configure dependencies

See assets/pipeline-dag.yaml.template for full example

See assets/pipeline-dag.yaml.template for full example

undefined
undefined

Production Workflow

生产级工作流

  1. Data Preparation Phase
    • Ingest raw data from sources
    • Run data quality checks
    • Apply feature transformations
    • Version processed datasets
  2. Training Phase
    • Load versioned training data
    • Execute training jobs
    • Track experiments and metrics
    • Save trained models
  3. Validation Phase
    • Run validation test suite
    • Compare against baseline
    • Generate performance reports
    • Approve for deployment
  4. Deployment Phase
    • Package model artifacts
    • Deploy to serving infrastructure
    • Configure monitoring
    • Validate production traffic
  1. 数据准备阶段
    • 从数据源采集原始数据
    • 运行数据质量检查
    • 应用特征转换
    • 为处理后的数据集打版本
  2. 训练阶段
    • 加载带版本的训练数据
    • 执行训练任务
    • 追踪实验与指标
    • 保存训练好的模型
  3. 验证阶段
    • 运行验证测试套件
    • 与基线模型对比
    • 生成性能报告
    • 批准部署
  4. 部署阶段
    • 打包模型工件
    • 部署到服务基础设施
    • 配置监控
    • 验证生产流量

Best Practices

最佳实践

Pipeline Design

流水线设计

  • Modularity: Each stage should be independently testable
  • Idempotency: Re-running stages should be safe
  • Observability: Log metrics at every stage
  • Versioning: Track data, code, and model versions
  • Failure Handling: Implement retry logic and alerting
  • 模块化:每个阶段应可独立测试
  • 幂等性:重新运行阶段应是安全的
  • 可观测性:在每个阶段记录指标
  • 版本控制:追踪数据、代码与模型版本
  • 故障处理:实现重试逻辑与告警

Data Management

数据管理

  • Use data validation libraries (Great Expectations, TFX)
  • Version datasets with DVC or similar tools
  • Document feature engineering transformations
  • Maintain data lineage tracking
  • 使用数据验证库(Great Expectations、TFX)
  • 用DVC或类似工具为数据集打版本
  • 记录特征工程转换过程
  • 维护数据血缘追踪

Model Operations

模型运维

  • Separate training and serving infrastructure
  • Use model registries (MLflow, Weights & Biases)
  • Implement gradual rollouts for new models
  • Monitor model performance drift
  • Maintain rollback capabilities
  • 分离训练与服务基础设施
  • 使用模型注册表(MLflow、Weights & Biases)
  • 为新模型实现渐进式发布
  • 监控模型性能漂移
  • 保留回滚能力

Deployment Strategies

部署策略

  • Start with shadow deployments
  • Use canary releases for validation
  • Implement A/B testing infrastructure
  • Set up automated rollback triggers
  • Monitor latency and throughput
  • 从影子部署开始
  • 使用金丝雀发布进行验证
  • 搭建A/B测试基础设施
  • 设置自动化回滚触发器
  • 监控延迟与吞吐量

Integration Points

集成点

Orchestration Tools

编排工具

  • Apache Airflow: DAG-based workflow orchestration
  • Dagster: Asset-based pipeline orchestration
  • Kubeflow Pipelines: Kubernetes-native ML workflows
  • Prefect: Modern dataflow automation
  • Apache Airflow:基于DAG的工作流编排
  • Dagster:基于资产的流水线编排
  • Kubeflow Pipelines:Kubernetes原生ML工作流
  • Prefect:现代数据流自动化

Experiment Tracking

实验追踪

  • MLflow for experiment tracking and model registry
  • Weights & Biases for visualization and collaboration
  • TensorBoard for training metrics
  • MLflow用于实验追踪与模型注册表
  • Weights & Biases用于可视化与协作
  • TensorBoard用于训练指标

Deployment Platforms

部署平台

  • AWS SageMaker for managed ML infrastructure
  • Google Vertex AI for GCP deployments
  • Azure ML for Azure cloud
  • Kubernetes + KServe for cloud-agnostic serving
  • AWS SageMaker用于托管式ML基础设施
  • Google Vertex AI用于GCP部署
  • Azure ML用于Azure云
  • Kubernetes + KServe用于云无关的服务

Progressive Disclosure

渐进式扩展

Start with the basics and gradually add complexity:
  1. Level 1: Simple linear pipeline (data → train → deploy)
  2. Level 2: Add validation and monitoring stages
  3. Level 3: Implement hyperparameter tuning
  4. Level 4: Add A/B testing and gradual rollouts
  5. Level 5: Multi-model pipelines with ensemble strategies
从基础开始,逐步增加复杂度:
  1. Level 1:简单线性流水线(数据→训练→部署)
  2. Level 2:添加验证与监控阶段
  3. Level 3:实现超参数调优
  4. Level 4:添加A/B测试与渐进式发布
  5. Level 5:多模型流水线与集成策略

Common Patterns

常见模式

Batch Training Pipeline

批量训练流水线

yaml
undefined
yaml
undefined

See assets/pipeline-dag.yaml.template

See assets/pipeline-dag.yaml.template

stages:
  • name: data_preparation dependencies: []
  • name: model_training dependencies: [data_preparation]
  • name: model_evaluation dependencies: [model_training]
  • name: model_deployment dependencies: [model_evaluation]
undefined
stages:
  • name: data_preparation dependencies: []
  • name: model_training dependencies: [data_preparation]
  • name: model_evaluation dependencies: [model_training]
  • name: model_deployment dependencies: [model_evaluation]
undefined

Real-time Feature Pipeline

实时特征流水线

python
undefined
python
undefined

Stream processing for real-time features

Stream processing for real-time features

Combined with batch training

Combined with batch training

See references/data-preparation.md

See references/data-preparation.md

undefined
undefined

Continuous Training

持续训练

python
undefined
python
undefined

Automated retraining on schedule

Automated retraining on schedule

Triggered by data drift detection

Triggered by data drift detection

See references/model-training.md

See references/model-training.md

undefined
undefined

Troubleshooting

故障排查

Common Issues

常见问题

  • Pipeline failures: Check dependencies and data availability
  • Training instability: Review hyperparameters and data quality
  • Deployment issues: Validate model artifacts and serving config
  • Performance degradation: Monitor data drift and model metrics
  • 流水线失败:检查依赖与数据可用性
  • 训练不稳定:查看超参数与数据质量
  • 部署问题:验证模型工件与服务配置
  • 性能下降:监控数据漂移与模型指标

Debugging Steps

调试步骤

  1. Check pipeline logs for each stage
  2. Validate input/output data at boundaries
  3. Test components in isolation
  4. Review experiment tracking metrics
  5. Inspect model artifacts and metadata
  1. 检查每个阶段的流水线日志
  2. 验证边界处的输入/输出数据
  3. 独立测试组件
  4. 查看实验追踪指标
  5. 检查模型工件与元数据

Next Steps

后续步骤

After setting up your pipeline:
  1. Explore hyperparameter-tuning skill for optimization
  2. Learn experiment-tracking-setup for MLflow/W&B
  3. Review model-deployment-patterns for serving strategies
  4. Implement monitoring with observability tools
搭建完流水线后:
  1. 探索hyperparameter-tuning技能进行优化
  2. 学习experiment-tracking-setup以集成MLflow/W&B
  3. 查看model-deployment-patterns了解服务策略
  4. 用可观测性工具实现监控

Related Skills

相关技能

  • experiment-tracking-setup: MLflow and Weights & Biases integration
  • hyperparameter-tuning: Automated hyperparameter optimization
  • model-deployment-patterns: Advanced deployment strategies
  • experiment-tracking-setup:MLflow与Weights & Biases集成
  • hyperparameter-tuning:自动化超参数优化
  • model-deployment-patterns:高级部署策略