machine-learning-ops-ml-pipeline
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMachine Learning Pipeline - Multi-Agent MLOps Orchestration
机器学习流水线 - 多Agent MLOps编排
Design and implement a complete ML pipeline for: $ARGUMENTS
为$ARGUMENTS设计并实现完整的机器学习(ML)流水线
Use this skill when
适用场景
- Working on machine learning pipeline - multi-agent mlops orchestration tasks or workflows
- Needing guidance, best practices, or checklists for machine learning pipeline - multi-agent mlops orchestration
- 处理机器学习流水线 - 多Agent MLOps编排相关任务或工作流时
- 需要机器学习流水线 - 多Agent MLOps编排的指导方案、最佳实践或检查清单时
Do not use this skill when
不适用场景
- The task is unrelated to machine learning pipeline - multi-agent mlops orchestration
- You need a different domain or tool outside this scope
- 任务与机器学习流水线 - 多Agent MLOps编排无关时
- 需要该范围之外的其他领域或工具时
Instructions
操作说明
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open .
resources/implementation-playbook.md
- 明确目标、约束条件和所需输入。
- 应用相关最佳实践并验证结果。
- 提供可执行步骤和验证方法。
- 若需要详细示例,请打开。
resources/implementation-playbook.md
Thinking
思路
This workflow orchestrates multiple specialized agents to build a production-ready ML pipeline following modern MLOps best practices. The approach emphasizes:
- Phase-based coordination: Each phase builds upon previous outputs, with clear handoffs between agents
- Modern tooling integration: MLflow/W&B for experiments, Feast/Tecton for features, KServe/Seldon for serving
- Production-first mindset: Every component designed for scale, monitoring, and reliability
- Reproducibility: Version control for data, models, and infrastructure
- Continuous improvement: Automated retraining, A/B testing, and drift detection
The multi-agent approach ensures each aspect is handled by domain experts:
- Data engineers handle ingestion and quality
- Data scientists design features and experiments
- ML engineers implement training pipelines
- MLOps engineers handle production deployment
- Observability engineers ensure monitoring
该工作流编排多个专业Agent,遵循现代MLOps最佳实践构建可投入生产的ML流水线。此方法重点关注:
- 基于阶段的协作:每个阶段以前一阶段的输出为基础,Agent间的交接清晰明确
- 现代工具集成:使用MLflow/W&B进行实验管理,Feast/Tecton进行特征存储,KServe/Seldon进行模型部署
- 生产优先思维:每个组件均为规模化、可监控和高可靠性设计
- 可复现性:对数据、模型和基础设施进行版本控制
- 持续改进:自动化重训练、A/B测试和漂移检测
多Agent方法确保每个环节由领域专家处理:
- 数据工程师负责数据摄入和质量管控
- 数据科学家负责特征设计和实验规划
- ML工程师负责实现训练流水线
- MLOps工程师负责生产环境部署
- 可观测性工程师负责监控体系搭建
Phase 1: Data & Requirements Analysis
阶段1:数据与需求分析
<Task>
subagent_type: data-engineer
prompt: |
Analyze and design data pipeline for ML system with requirements: $ARGUMENTS
Deliverables:
-
Data source audit and ingestion strategy:
- Source systems and connection patterns
- Schema validation using Pydantic/Great Expectations
- Data versioning with DVC or lakeFS
- Incremental loading and CDC strategies
-
Data quality framework:
- Profiling and statistics generation
- Anomaly detection rules
- Data lineage tracking
- Quality gates and SLAs
-
Storage architecture:
- Raw/processed/feature layers
- Partitioning strategy
- Retention policies
- Cost optimization
Provide implementation code for critical components and integration patterns.
</Task>
<Task>
subagent_type: data-scientist
prompt: |
Design feature engineering and model requirements for: $ARGUMENTS
Using data architecture from: {phase1.data-engineer.output}
Deliverables:
-
Feature engineering pipeline:
- Transformation specifications
- Feature store schema (Feast/Tecton)
- Statistical validation rules
- Handling strategies for missing data/outliers
-
Model requirements:
- Algorithm selection rationale
- Performance metrics and baselines
- Training data requirements
- Evaluation criteria and thresholds
-
Experiment design:
- Hypothesis and success metrics
- A/B testing methodology
- Sample size calculations
- Bias detection approach
Include feature transformation code and statistical validation logic.
</Task>
<Task>
subagent_type: data-engineer
prompt: |
分析并设计符合以下需求的ML系统数据流水线:$ARGUMENTS
交付成果:
-
数据源审计与摄入策略:
- 源系统与连接模式
- 使用Pydantic/Great Expectations进行 schema 校验
- 使用DVC或lakeFS进行数据版本控制
- 增量加载与CDC(变更数据捕获)策略
-
数据质量框架:
- 数据剖析与统计信息生成
- 异常检测规则
- 数据血缘追踪
- 质量门控与SLA
-
存储架构:
- 原始/处理后/特征数据分层
- 分区策略
- 数据保留政策
- 成本优化
提供关键组件的实现代码和集成模式。
</Task>
<Task>
subagent_type: data-scientist
prompt: |
为$ARGUMENTS设计特征工程与模型需求
参考数据架构:{phase1.data-engineer.output}
交付成果:
-
特征工程流水线:
- 转换规则说明
- 特征存储 schema(Feast/Tecton)
- 统计验证规则
- 缺失值/异常值处理策略
-
模型需求:
- 算法选择依据
- 性能指标与基准线
- 训练数据要求
- 评估标准与阈值
-
实验设计:
- 假设与成功指标
- A/B测试方法论
- 样本量计算
- 偏差检测方案
包含特征转换代码和统计验证逻辑。
</Task>
Phase 2: Model Development & Training
阶段2:模型开发与训练
<Task>
subagent_type: ml-engineer
prompt: |
Implement training pipeline based on requirements: {phase1.data-scientist.output}
Using data pipeline: {phase1.data-engineer.output}
Build comprehensive training system:
-
Training pipeline implementation:
- Modular training code with clear interfaces
- Hyperparameter optimization (Optuna/Ray Tune)
- Distributed training support (Horovod/PyTorch DDP)
- Cross-validation and ensemble strategies
-
Experiment tracking setup:
- MLflow/Weights & Biases integration
- Metric logging and visualization
- Artifact management (models, plots, data samples)
- Experiment comparison and analysis tools
-
Model registry integration:
- Version control and tagging strategy
- Model metadata and lineage
- Promotion workflows (dev -> staging -> prod)
- Rollback procedures
Provide complete training code with configuration management.
</Task>
<Task>
subagent_type: python-pro
prompt: |
Optimize and productionize ML code from: {phase2.ml-engineer.output}
Focus areas:
-
Code quality and structure:
- Refactor for production standards
- Add comprehensive error handling
- Implement proper logging with structured formats
- Create reusable components and utilities
-
Performance optimization:
- Profile and optimize bottlenecks
- Implement caching strategies
- Optimize data loading and preprocessing
- Memory management for large-scale training
-
Testing framework:
- Unit tests for data transformations
- Integration tests for pipeline components
- Model quality tests (invariance, directional)
- Performance regression tests
Deliver production-ready, maintainable code with full test coverage.
</Task>
<Task>
subagent_type: ml-engineer
prompt: |
根据以下需求实现训练流水线:{phase1.data-scientist.output}
参考数据流水线:{phase1.data-engineer.output}
构建全面的训练系统:
-
训练流水线实现:
- 模块化训练代码,接口清晰
- 超参数优化(Optuna/Ray Tune)
- 分布式训练支持(Horovod/PyTorch DDP)
- 交叉验证与集成策略
-
实验追踪设置:
- MLflow/Weights & Biases集成
- 指标日志与可视化
- 制品管理(模型、图表、数据样本)
- 实验对比与分析工具
-
模型注册集成:
- 版本控制与标签策略
- 模型元数据与血缘
- 晋升流程(开发 -> 预发布 -> 生产)
- 回滚流程
提供完整的训练代码及配置管理方案。
</Task>
<Task>
subagent_type: python-pro
prompt: |
对以下ML代码进行优化并适配生产环境:{phase2.ml-engineer.output}
重点优化方向:
-
代码质量与结构:
- 重构以符合生产标准
- 添加全面的错误处理
- 实现结构化格式的日志记录
- 创建可复用组件与工具类
-
性能优化:
- 分析并优化性能瓶颈
- 实现缓存策略
- 优化数据加载与预处理
- 大规模训练的内存管理
-
测试框架:
- 数据转换单元测试
- 流水线组件集成测试
- 模型质量测试(不变性、方向性)
- 性能回归测试
交付可投入生产、易于维护且测试覆盖率完整的代码。
</Task>
Phase 3: Production Deployment & Serving
阶段3:生产部署与服务
<Task>
subagent_type: mlops-engineer
prompt: |
Design production deployment for models from: {phase2.ml-engineer.output}
With optimized code from: {phase2.python-pro.output}
Implementation requirements:
-
Model serving infrastructure:
- REST/gRPC APIs with FastAPI/TorchServe
- Batch prediction pipelines (Airflow/Kubeflow)
- Stream processing (Kafka/Kinesis integration)
- Model serving platforms (KServe/Seldon Core)
-
Deployment strategies:
- Blue-green deployments for zero downtime
- Canary releases with traffic splitting
- Shadow deployments for validation
- A/B testing infrastructure
-
CI/CD pipeline:
- GitHub Actions/GitLab CI workflows
- Automated testing gates
- Model validation before deployment
- ArgoCD for GitOps deployment
-
Infrastructure as Code:
- Terraform modules for cloud resources
- Helm charts for Kubernetes deployments
- Docker multi-stage builds for optimization
- Secret management with Vault/Secrets Manager
Provide complete deployment configuration and automation scripts.
</Task>
<Task>
subagent_type: kubernetes-architect
prompt: |
Design Kubernetes infrastructure for ML workloads from: {phase3.mlops-engineer.output}
Kubernetes-specific requirements:
-
Workload orchestration:
- Training job scheduling with Kubeflow
- GPU resource allocation and sharing
- Spot/preemptible instance integration
- Priority classes and resource quotas
-
Serving infrastructure:
- HPA/VPA for autoscaling
- KEDA for event-driven scaling
- Istio service mesh for traffic management
- Model caching and warm-up strategies
-
Storage and data access:
- PVC strategies for training data
- Model artifact storage with CSI drivers
- Distributed storage for feature stores
- Cache layers for inference optimization
Provide Kubernetes manifests and Helm charts for entire ML platform.
</Task>
<Task>
subagent_type: mlops-engineer
prompt: |
为以下模型设计生产部署方案:{phase2.ml-engineer.output}
参考优化后代码:{phase2.python-pro.output}
实现要求:
-
模型服务基础设施:
- 基于FastAPI/TorchServe的REST/gRPC API
- 批量预测流水线(Airflow/Kubeflow)
- 流处理集成(Kafka/Kinesis)
- 模型服务平台(KServe/Seldon Core)
-
部署策略:
- 蓝绿部署实现零停机
- 金丝雀发布与流量拆分
- 影子部署用于验证
- A/B测试基础设施
-
CI/CD流水线:
- GitHub Actions/GitLab CI工作流
- 自动化测试门控
- 部署前模型验证
- 基于ArgoCD的GitOps部署
-
基础设施即代码:
- 云资源Terraform模块
- Kubernetes部署Helm Chart
- 多阶段Docker构建优化
- 基于Vault/Secrets Manager的密钥管理
提供完整的部署配置与自动化脚本。
</Task>
<Task>
subagent_type: kubernetes-architect
prompt: |
为以下ML工作负载设计Kubernetes基础设施:{phase3.mlops-engineer.output}
Kubernetes专属要求:
-
工作负载编排:
- 基于Kubeflow的训练任务调度
- GPU资源分配与共享
- 抢占式实例集成
- 优先级类别与资源配额
-
服务基础设施:
- HPA/VPA自动扩缩容
- KEDA事件驱动扩缩容
- Istio服务网格流量管理
- 模型缓存与预热策略
-
存储与数据访问:
- 训练数据PVC策略
- 基于CSI驱动的模型制品存储
- 特征存储分布式存储
- 推理优化缓存层
提供整个ML平台的Kubernetes清单与Helm Chart。
</Task>
Phase 4: Monitoring & Continuous Improvement
阶段4:监控与持续改进
<Task>
subagent_type: observability-engineer
prompt: |
Implement comprehensive monitoring for ML system deployed in: {phase3.mlops-engineer.output}
Using Kubernetes infrastructure: {phase3.kubernetes-architect.output}
Monitoring framework:
-
Model performance monitoring:
- Prediction accuracy tracking
- Latency and throughput metrics
- Feature importance shifts
- Business KPI correlation
-
Data and model drift detection:
- Statistical drift detection (KS test, PSI)
- Concept drift monitoring
- Feature distribution tracking
- Automated drift alerts and reports
-
System observability:
- Prometheus metrics for all components
- Grafana dashboards for visualization
- Distributed tracing with Jaeger/Zipkin
- Log aggregation with ELK/Loki
-
Alerting and automation:
- PagerDuty/Opsgenie integration
- Automated retraining triggers
- Performance degradation workflows
- Incident response runbooks
-
Cost tracking:
- Resource utilization metrics
- Cost allocation by model/experiment
- Optimization recommendations
- Budget alerts and controls
Deliver monitoring configuration, dashboards, and alert rules.
</Task>
<Task>
subagent_type: observability-engineer
prompt: |
为以下部署的ML系统实现全面监控:{phase3.mlops-engineer.output}
参考Kubernetes基础设施:{phase3.kubernetes-architect.output}
监控框架:
-
模型性能监控:
- 预测准确率追踪
- 延迟与吞吐量指标
- 特征重要性变化
- 业务KPI关联分析
-
数据与模型漂移检测:
- 统计漂移检测(KS检验、PSI)
- 概念漂移监控
- 特征分布追踪
- 自动化漂移告警与报告
-
系统可观测性:
- 全组件Prometheus指标
- Grafana可视化仪表盘
- Jaeger/Zipkin分布式追踪
- ELK/Loki日志聚合
-
告警与自动化:
- PagerDuty/Opsgenie集成
- 自动重训练触发
- 性能降级处理流程
- 事件响应操作手册
-
成本追踪:
- 资源利用率指标
- 按模型/实验成本分摊
- 优化建议
- 预算告警与控制
提供监控配置、仪表盘和告警规则。
</Task>
Configuration Options
配置选项
- experiment_tracking: mlflow | wandb | neptune | clearml
- feature_store: feast | tecton | databricks | custom
- serving_platform: kserve | seldon | torchserve | triton
- orchestration: kubeflow | airflow | prefect | dagster
- cloud_provider: aws | azure | gcp | multi-cloud
- deployment_mode: realtime | batch | streaming | hybrid
- monitoring_stack: prometheus | datadog | newrelic | custom
- experiment_tracking: mlflow | wandb | neptune | clearml
- feature_store: feast | tecton | databricks | custom
- serving_platform: kserve | seldon | torchserve | triton
- orchestration: kubeflow | airflow | prefect | dagster
- cloud_provider: aws | azure | gcp | multi-cloud
- deployment_mode: realtime | batch | streaming | hybrid
- monitoring_stack: prometheus | datadog | newrelic | custom
Success Criteria
成功标准
-
Data Pipeline Success:
- < 0.1% data quality issues in production
- Automated data validation passing 99.9% of time
- Complete data lineage tracking
- Sub-second feature serving latency
-
Model Performance:
- Meeting or exceeding baseline metrics
- < 5% performance degradation before retraining
- Successful A/B tests with statistical significance
- No undetected model drift > 24 hours
-
Operational Excellence:
- 99.9% uptime for model serving
- < 200ms p99 inference latency
- Automated rollback within 5 minutes
- Complete observability with < 1 minute alert time
-
Development Velocity:
- < 1 hour from commit to production
- Parallel experiment execution
- Reproducible training runs
- Self-service model deployment
-
Cost Efficiency:
- < 20% infrastructure waste
- Optimized resource allocation
- Automatic scaling based on load
- Spot instance utilization > 60%
-
数据流水线成功标准:
- 生产环境中数据质量问题占比 < 0.1%
- 自动化数据验证通过率达99.9%
- 实现完整的数据血缘追踪
- 特征服务延迟低于1秒
-
模型性能标准:
- 达到或超越基准指标
- 重训练前性能下降幅度 < 5%
- 完成具有统计显著性的A/B测试
- 无未被检测到的模型漂移超过24小时
-
运维卓越标准:
- 模型服务可用性达99.9%
- p99推理延迟 < 200ms
- 5分钟内完成自动化回滚
- 完整的可观测性体系,告警响应时间 < 1分钟
-
开发效率标准:
- 从代码提交到生产部署耗时 < 1小时
- 支持并行实验执行
- 训练过程可复现
- 模型部署自助化
-
成本效益标准:
- 基础设施浪费率 < 20%
- 资源分配优化
- 基于负载自动扩缩容
- 抢占式实例利用率 > 60%
Final Deliverables
最终交付成果
Upon completion, the orchestrated pipeline will provide:
- End-to-end ML pipeline with full automation
- Comprehensive documentation and runbooks
- Production-ready infrastructure as code
- Complete monitoring and alerting system
- CI/CD pipelines for continuous improvement
- Cost optimization and scaling strategies
- Disaster recovery and rollback procedures
完成后,编排后的流水线将提供:
- 端到端全自动化ML流水线
- 全面的文档与操作手册
- 可投入生产的基础设施即代码(IaC)
- 完整的监控与告警系统
- 用于持续改进的CI/CD流水线
- 成本优化与扩缩容策略
- 灾难恢复与回滚流程