machine-learning-ops-ml-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Machine Learning Pipeline - Multi-Agent MLOps Orchestration

机器学习流水线 - 多Agent MLOps编排

Design and implement a complete ML pipeline for: $ARGUMENTS
为$ARGUMENTS设计并实现完整的机器学习(ML)流水线

Use this skill when

适用场景

  • Working on machine learning pipeline - multi-agent mlops orchestration tasks or workflows
  • Needing guidance, best practices, or checklists for machine learning pipeline - multi-agent mlops orchestration
  • 处理机器学习流水线 - 多Agent MLOps编排相关任务或工作流时
  • 需要机器学习流水线 - 多Agent MLOps编排的指导方案、最佳实践或检查清单时

Do not use this skill when

不适用场景

  • The task is unrelated to machine learning pipeline - multi-agent mlops orchestration
  • You need a different domain or tool outside this scope
  • 任务与机器学习流水线 - 多Agent MLOps编排无关时
  • 需要该范围之外的其他领域或工具时

Instructions

操作说明

  • Clarify goals, constraints, and required inputs.
  • Apply relevant best practices and validate outcomes.
  • Provide actionable steps and verification.
  • If detailed examples are required, open
    resources/implementation-playbook.md
    .
  • 明确目标、约束条件和所需输入。
  • 应用相关最佳实践并验证结果。
  • 提供可执行步骤和验证方法。
  • 若需要详细示例,请打开
    resources/implementation-playbook.md

Thinking

思路

This workflow orchestrates multiple specialized agents to build a production-ready ML pipeline following modern MLOps best practices. The approach emphasizes:
  • Phase-based coordination: Each phase builds upon previous outputs, with clear handoffs between agents
  • Modern tooling integration: MLflow/W&B for experiments, Feast/Tecton for features, KServe/Seldon for serving
  • Production-first mindset: Every component designed for scale, monitoring, and reliability
  • Reproducibility: Version control for data, models, and infrastructure
  • Continuous improvement: Automated retraining, A/B testing, and drift detection
The multi-agent approach ensures each aspect is handled by domain experts:
  • Data engineers handle ingestion and quality
  • Data scientists design features and experiments
  • ML engineers implement training pipelines
  • MLOps engineers handle production deployment
  • Observability engineers ensure monitoring
该工作流编排多个专业Agent,遵循现代MLOps最佳实践构建可投入生产的ML流水线。此方法重点关注:
  • 基于阶段的协作:每个阶段以前一阶段的输出为基础,Agent间的交接清晰明确
  • 现代工具集成:使用MLflow/W&B进行实验管理,Feast/Tecton进行特征存储,KServe/Seldon进行模型部署
  • 生产优先思维:每个组件均为规模化、可监控和高可靠性设计
  • 可复现性:对数据、模型和基础设施进行版本控制
  • 持续改进:自动化重训练、A/B测试和漂移检测
多Agent方法确保每个环节由领域专家处理:
  • 数据工程师负责数据摄入和质量管控
  • 数据科学家负责特征设计和实验规划
  • ML工程师负责实现训练流水线
  • MLOps工程师负责生产环境部署
  • 可观测性工程师负责监控体系搭建

Phase 1: Data & Requirements Analysis

阶段1:数据与需求分析

<Task> subagent_type: data-engineer prompt: | Analyze and design data pipeline for ML system with requirements: $ARGUMENTS
Deliverables:
  1. Data source audit and ingestion strategy:
    • Source systems and connection patterns
    • Schema validation using Pydantic/Great Expectations
    • Data versioning with DVC or lakeFS
    • Incremental loading and CDC strategies
  2. Data quality framework:
    • Profiling and statistics generation
    • Anomaly detection rules
    • Data lineage tracking
    • Quality gates and SLAs
  3. Storage architecture:
    • Raw/processed/feature layers
    • Partitioning strategy
    • Retention policies
    • Cost optimization
Provide implementation code for critical components and integration patterns. </Task>
<Task> subagent_type: data-scientist prompt: | Design feature engineering and model requirements for: $ARGUMENTS Using data architecture from: {phase1.data-engineer.output}
Deliverables:
  1. Feature engineering pipeline:
    • Transformation specifications
    • Feature store schema (Feast/Tecton)
    • Statistical validation rules
    • Handling strategies for missing data/outliers
  2. Model requirements:
    • Algorithm selection rationale
    • Performance metrics and baselines
    • Training data requirements
    • Evaluation criteria and thresholds
  3. Experiment design:
    • Hypothesis and success metrics
    • A/B testing methodology
    • Sample size calculations
    • Bias detection approach
Include feature transformation code and statistical validation logic. </Task>
<Task> subagent_type: data-engineer prompt: | 分析并设计符合以下需求的ML系统数据流水线:$ARGUMENTS
交付成果:
  1. 数据源审计与摄入策略:
    • 源系统与连接模式
    • 使用Pydantic/Great Expectations进行 schema 校验
    • 使用DVC或lakeFS进行数据版本控制
    • 增量加载与CDC(变更数据捕获)策略
  2. 数据质量框架:
    • 数据剖析与统计信息生成
    • 异常检测规则
    • 数据血缘追踪
    • 质量门控与SLA
  3. 存储架构:
    • 原始/处理后/特征数据分层
    • 分区策略
    • 数据保留政策
    • 成本优化
提供关键组件的实现代码和集成模式。 </Task>
<Task> subagent_type: data-scientist prompt: | 为$ARGUMENTS设计特征工程与模型需求 参考数据架构:{phase1.data-engineer.output}
交付成果:
  1. 特征工程流水线:
    • 转换规则说明
    • 特征存储 schema(Feast/Tecton)
    • 统计验证规则
    • 缺失值/异常值处理策略
  2. 模型需求:
    • 算法选择依据
    • 性能指标与基准线
    • 训练数据要求
    • 评估标准与阈值
  3. 实验设计:
    • 假设与成功指标
    • A/B测试方法论
    • 样本量计算
    • 偏差检测方案
包含特征转换代码和统计验证逻辑。 </Task>

Phase 2: Model Development & Training

阶段2:模型开发与训练

<Task> subagent_type: ml-engineer prompt: | Implement training pipeline based on requirements: {phase1.data-scientist.output} Using data pipeline: {phase1.data-engineer.output}
Build comprehensive training system:
  1. Training pipeline implementation:
    • Modular training code with clear interfaces
    • Hyperparameter optimization (Optuna/Ray Tune)
    • Distributed training support (Horovod/PyTorch DDP)
    • Cross-validation and ensemble strategies
  2. Experiment tracking setup:
    • MLflow/Weights & Biases integration
    • Metric logging and visualization
    • Artifact management (models, plots, data samples)
    • Experiment comparison and analysis tools
  3. Model registry integration:
    • Version control and tagging strategy
    • Model metadata and lineage
    • Promotion workflows (dev -> staging -> prod)
    • Rollback procedures
Provide complete training code with configuration management. </Task>
<Task> subagent_type: python-pro prompt: | Optimize and productionize ML code from: {phase2.ml-engineer.output}
Focus areas:
  1. Code quality and structure:
    • Refactor for production standards
    • Add comprehensive error handling
    • Implement proper logging with structured formats
    • Create reusable components and utilities
  2. Performance optimization:
    • Profile and optimize bottlenecks
    • Implement caching strategies
    • Optimize data loading and preprocessing
    • Memory management for large-scale training
  3. Testing framework:
    • Unit tests for data transformations
    • Integration tests for pipeline components
    • Model quality tests (invariance, directional)
    • Performance regression tests
Deliver production-ready, maintainable code with full test coverage. </Task>
<Task> subagent_type: ml-engineer prompt: | 根据以下需求实现训练流水线:{phase1.data-scientist.output} 参考数据流水线:{phase1.data-engineer.output}
构建全面的训练系统:
  1. 训练流水线实现:
    • 模块化训练代码,接口清晰
    • 超参数优化(Optuna/Ray Tune)
    • 分布式训练支持(Horovod/PyTorch DDP)
    • 交叉验证与集成策略
  2. 实验追踪设置:
    • MLflow/Weights & Biases集成
    • 指标日志与可视化
    • 制品管理(模型、图表、数据样本)
    • 实验对比与分析工具
  3. 模型注册集成:
    • 版本控制与标签策略
    • 模型元数据与血缘
    • 晋升流程(开发 -> 预发布 -> 生产)
    • 回滚流程
提供完整的训练代码及配置管理方案。 </Task>
<Task> subagent_type: python-pro prompt: | 对以下ML代码进行优化并适配生产环境:{phase2.ml-engineer.output}
重点优化方向:
  1. 代码质量与结构:
    • 重构以符合生产标准
    • 添加全面的错误处理
    • 实现结构化格式的日志记录
    • 创建可复用组件与工具类
  2. 性能优化:
    • 分析并优化性能瓶颈
    • 实现缓存策略
    • 优化数据加载与预处理
    • 大规模训练的内存管理
  3. 测试框架:
    • 数据转换单元测试
    • 流水线组件集成测试
    • 模型质量测试(不变性、方向性)
    • 性能回归测试
交付可投入生产、易于维护且测试覆盖率完整的代码。 </Task>

Phase 3: Production Deployment & Serving

阶段3:生产部署与服务

<Task> subagent_type: mlops-engineer prompt: | Design production deployment for models from: {phase2.ml-engineer.output} With optimized code from: {phase2.python-pro.output}
Implementation requirements:
  1. Model serving infrastructure:
    • REST/gRPC APIs with FastAPI/TorchServe
    • Batch prediction pipelines (Airflow/Kubeflow)
    • Stream processing (Kafka/Kinesis integration)
    • Model serving platforms (KServe/Seldon Core)
  2. Deployment strategies:
    • Blue-green deployments for zero downtime
    • Canary releases with traffic splitting
    • Shadow deployments for validation
    • A/B testing infrastructure
  3. CI/CD pipeline:
    • GitHub Actions/GitLab CI workflows
    • Automated testing gates
    • Model validation before deployment
    • ArgoCD for GitOps deployment
  4. Infrastructure as Code:
    • Terraform modules for cloud resources
    • Helm charts for Kubernetes deployments
    • Docker multi-stage builds for optimization
    • Secret management with Vault/Secrets Manager
Provide complete deployment configuration and automation scripts. </Task>
<Task> subagent_type: kubernetes-architect prompt: | Design Kubernetes infrastructure for ML workloads from: {phase3.mlops-engineer.output}
Kubernetes-specific requirements:
  1. Workload orchestration:
    • Training job scheduling with Kubeflow
    • GPU resource allocation and sharing
    • Spot/preemptible instance integration
    • Priority classes and resource quotas
  2. Serving infrastructure:
    • HPA/VPA for autoscaling
    • KEDA for event-driven scaling
    • Istio service mesh for traffic management
    • Model caching and warm-up strategies
  3. Storage and data access:
    • PVC strategies for training data
    • Model artifact storage with CSI drivers
    • Distributed storage for feature stores
    • Cache layers for inference optimization
Provide Kubernetes manifests and Helm charts for entire ML platform. </Task>
<Task> subagent_type: mlops-engineer prompt: | 为以下模型设计生产部署方案:{phase2.ml-engineer.output} 参考优化后代码:{phase2.python-pro.output}
实现要求:
  1. 模型服务基础设施:
    • 基于FastAPI/TorchServe的REST/gRPC API
    • 批量预测流水线(Airflow/Kubeflow)
    • 流处理集成(Kafka/Kinesis)
    • 模型服务平台(KServe/Seldon Core)
  2. 部署策略:
    • 蓝绿部署实现零停机
    • 金丝雀发布与流量拆分
    • 影子部署用于验证
    • A/B测试基础设施
  3. CI/CD流水线:
    • GitHub Actions/GitLab CI工作流
    • 自动化测试门控
    • 部署前模型验证
    • 基于ArgoCD的GitOps部署
  4. 基础设施即代码:
    • 云资源Terraform模块
    • Kubernetes部署Helm Chart
    • 多阶段Docker构建优化
    • 基于Vault/Secrets Manager的密钥管理
提供完整的部署配置与自动化脚本。 </Task>
<Task> subagent_type: kubernetes-architect prompt: | 为以下ML工作负载设计Kubernetes基础设施:{phase3.mlops-engineer.output}
Kubernetes专属要求:
  1. 工作负载编排:
    • 基于Kubeflow的训练任务调度
    • GPU资源分配与共享
    • 抢占式实例集成
    • 优先级类别与资源配额
  2. 服务基础设施:
    • HPA/VPA自动扩缩容
    • KEDA事件驱动扩缩容
    • Istio服务网格流量管理
    • 模型缓存与预热策略
  3. 存储与数据访问:
    • 训练数据PVC策略
    • 基于CSI驱动的模型制品存储
    • 特征存储分布式存储
    • 推理优化缓存层
提供整个ML平台的Kubernetes清单与Helm Chart。 </Task>

Phase 4: Monitoring & Continuous Improvement

阶段4:监控与持续改进

<Task> subagent_type: observability-engineer prompt: | Implement comprehensive monitoring for ML system deployed in: {phase3.mlops-engineer.output} Using Kubernetes infrastructure: {phase3.kubernetes-architect.output}
Monitoring framework:
  1. Model performance monitoring:
    • Prediction accuracy tracking
    • Latency and throughput metrics
    • Feature importance shifts
    • Business KPI correlation
  2. Data and model drift detection:
    • Statistical drift detection (KS test, PSI)
    • Concept drift monitoring
    • Feature distribution tracking
    • Automated drift alerts and reports
  3. System observability:
    • Prometheus metrics for all components
    • Grafana dashboards for visualization
    • Distributed tracing with Jaeger/Zipkin
    • Log aggregation with ELK/Loki
  4. Alerting and automation:
    • PagerDuty/Opsgenie integration
    • Automated retraining triggers
    • Performance degradation workflows
    • Incident response runbooks
  5. Cost tracking:
    • Resource utilization metrics
    • Cost allocation by model/experiment
    • Optimization recommendations
    • Budget alerts and controls
Deliver monitoring configuration, dashboards, and alert rules. </Task>
<Task> subagent_type: observability-engineer prompt: | 为以下部署的ML系统实现全面监控:{phase3.mlops-engineer.output} 参考Kubernetes基础设施:{phase3.kubernetes-architect.output}
监控框架:
  1. 模型性能监控:
    • 预测准确率追踪
    • 延迟与吞吐量指标
    • 特征重要性变化
    • 业务KPI关联分析
  2. 数据与模型漂移检测:
    • 统计漂移检测(KS检验、PSI)
    • 概念漂移监控
    • 特征分布追踪
    • 自动化漂移告警与报告
  3. 系统可观测性:
    • 全组件Prometheus指标
    • Grafana可视化仪表盘
    • Jaeger/Zipkin分布式追踪
    • ELK/Loki日志聚合
  4. 告警与自动化:
    • PagerDuty/Opsgenie集成
    • 自动重训练触发
    • 性能降级处理流程
    • 事件响应操作手册
  5. 成本追踪:
    • 资源利用率指标
    • 按模型/实验成本分摊
    • 优化建议
    • 预算告警与控制
提供监控配置、仪表盘和告警规则。 </Task>

Configuration Options

配置选项

  • experiment_tracking: mlflow | wandb | neptune | clearml
  • feature_store: feast | tecton | databricks | custom
  • serving_platform: kserve | seldon | torchserve | triton
  • orchestration: kubeflow | airflow | prefect | dagster
  • cloud_provider: aws | azure | gcp | multi-cloud
  • deployment_mode: realtime | batch | streaming | hybrid
  • monitoring_stack: prometheus | datadog | newrelic | custom
  • experiment_tracking: mlflow | wandb | neptune | clearml
  • feature_store: feast | tecton | databricks | custom
  • serving_platform: kserve | seldon | torchserve | triton
  • orchestration: kubeflow | airflow | prefect | dagster
  • cloud_provider: aws | azure | gcp | multi-cloud
  • deployment_mode: realtime | batch | streaming | hybrid
  • monitoring_stack: prometheus | datadog | newrelic | custom

Success Criteria

成功标准

  1. Data Pipeline Success:
    • < 0.1% data quality issues in production
    • Automated data validation passing 99.9% of time
    • Complete data lineage tracking
    • Sub-second feature serving latency
  2. Model Performance:
    • Meeting or exceeding baseline metrics
    • < 5% performance degradation before retraining
    • Successful A/B tests with statistical significance
    • No undetected model drift > 24 hours
  3. Operational Excellence:
    • 99.9% uptime for model serving
    • < 200ms p99 inference latency
    • Automated rollback within 5 minutes
    • Complete observability with < 1 minute alert time
  4. Development Velocity:
    • < 1 hour from commit to production
    • Parallel experiment execution
    • Reproducible training runs
    • Self-service model deployment
  5. Cost Efficiency:
    • < 20% infrastructure waste
    • Optimized resource allocation
    • Automatic scaling based on load
    • Spot instance utilization > 60%
  1. 数据流水线成功标准:
    • 生产环境中数据质量问题占比 < 0.1%
    • 自动化数据验证通过率达99.9%
    • 实现完整的数据血缘追踪
    • 特征服务延迟低于1秒
  2. 模型性能标准:
    • 达到或超越基准指标
    • 重训练前性能下降幅度 < 5%
    • 完成具有统计显著性的A/B测试
    • 无未被检测到的模型漂移超过24小时
  3. 运维卓越标准:
    • 模型服务可用性达99.9%
    • p99推理延迟 < 200ms
    • 5分钟内完成自动化回滚
    • 完整的可观测性体系,告警响应时间 < 1分钟
  4. 开发效率标准:
    • 从代码提交到生产部署耗时 < 1小时
    • 支持并行实验执行
    • 训练过程可复现
    • 模型部署自助化
  5. 成本效益标准:
    • 基础设施浪费率 < 20%
    • 资源分配优化
    • 基于负载自动扩缩容
    • 抢占式实例利用率 > 60%

Final Deliverables

最终交付成果

Upon completion, the orchestrated pipeline will provide:
  • End-to-end ML pipeline with full automation
  • Comprehensive documentation and runbooks
  • Production-ready infrastructure as code
  • Complete monitoring and alerting system
  • CI/CD pipelines for continuous improvement
  • Cost optimization and scaling strategies
  • Disaster recovery and rollback procedures
完成后,编排后的流水线将提供:
  • 端到端全自动化ML流水线
  • 全面的文档与操作手册
  • 可投入生产的基础设施即代码(IaC)
  • 完整的监控与告警系统
  • 用于持续改进的CI/CD流水线
  • 成本优化与扩缩容策略
  • 灾难恢复与回滚流程