implementing-mlops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMLOps Patterns
MLOps 模式
Operationalize machine learning models from experimentation to production deployment and monitoring.
将机器学习模型从实验阶段落地到生产部署与监控全流程。
Purpose
目标
Provide strategic guidance for ML engineers and platform teams to build production-grade ML infrastructure. Cover the complete lifecycle: experiment tracking, model registry, feature stores, deployment patterns, pipeline orchestration, and monitoring.
为ML工程师和平台团队提供构建生产级ML基础设施的战略指导,覆盖完整生命周期:实验追踪、模型注册、特征存储、部署模式、流水线编排以及监控。
When to Use This Skill
适用场景
Use this skill when:
- Designing MLOps infrastructure for production ML systems
- Selecting experiment tracking platforms (MLflow, Weights & Biases, Neptune)
- Implementing feature stores for online/offline feature serving
- Choosing model serving solutions (Seldon Core, KServe, BentoML, TorchServe)
- Building ML pipelines for training, evaluation, and deployment
- Setting up model monitoring and drift detection
- Establishing model governance and compliance frameworks
- Optimizing ML inference costs and performance
- Migrating from notebooks to production ML systems
- Implementing continuous training and automated retraining
适用于以下场景:
- 为生产级ML系统设计MLOps基础设施
- 选择实验追踪平台(MLflow、Weights & Biases、Neptune)
- 实现用于在线/离线特征服务的特征存储
- 选择模型部署解决方案(Seldon Core、KServe、BentoML、TorchServe)
- 构建用于训练、评估和部署的ML流水线
- 搭建模型监控与漂移检测体系
- 建立模型治理与合规框架
- 优化ML推理成本与性能
- 从Notebook迁移到生产级ML系统
- 实现持续训练与自动重训练
Core Concepts
核心概念
1. Experiment Tracking
1. 实验追踪
Track experiments systematically to ensure reproducibility and collaboration.
Key Components:
- Parameters: Hyperparameters logged for each training run
- Metrics: Performance measures tracked over time (accuracy, loss, F1)
- Artifacts: Model weights, plots, datasets, configuration files
- Metadata: Tags, descriptions, Git commit SHA, environment details
Platform Comparison:
MLflow (Open-source standard):
- Framework-agnostic (PyTorch, TensorFlow, scikit-learn, XGBoost)
- Self-hosted or cloud-agnostic deployment
- Integrated model registry
- Basic UI, adequate for most use cases
- Free, requires infrastructure management
Weights & Biases (SaaS, collaboration-focused):
- Advanced visualization and dashboards
- Integrated hyperparameter optimization (Sweeps)
- Excellent team collaboration features
- SaaS pricing scales with usage
- Best-in-class UI
Neptune.ai (Enterprise-grade):
- Enterprise features (RBAC, audit logs, compliance)
- Integrated production monitoring
- Higher cost than W&B
- Good for regulated industries
Selection Criteria:
- Open-source requirement → MLflow
- Team collaboration critical → Weights & Biases
- Enterprise compliance (RBAC, audits) → Neptune.ai
- Hyperparameter optimization primary → Weights & Biases (Sweeps)
For detailed comparison and decision framework, see references/experiment-tracking.md.
系统化追踪实验,确保可复现性与协作效率。
核心组件:
- 参数:记录每次训练运行的超参数
- 指标:随时间追踪的性能指标(准确率、损失值、F1值)
- 工件:模型权重、图表、数据集、配置文件
- 元数据:标签、描述、Git提交SHA、环境细节
平台对比:
MLflow(开源标准):
- 框架无关(支持PyTorch、TensorFlow、scikit-learn、XGBoost)
- 可自托管或云无关部署
- 集成模型注册功能
- 基础UI,满足多数使用场景
- 免费,需自行管理基础设施
Weights & Biases(SaaS,聚焦协作):
- 高级可视化与仪表盘
- 集成超参数优化(Sweeps)
- 出色的团队协作功能
- SaaS定价随使用量递增
- 业界领先的UI体验
Neptune.ai(企业级):
- 企业级功能(RBAC、审计日志、合规性)
- 集成生产监控
- 成本高于W&B
- 适用于受监管行业
选型标准:
- 需开源方案 → MLflow
- 团队协作优先级高 → Weights & Biases
- 企业合规需求(RBAC、审计) → Neptune.ai
- 超参数优化为核心需求 → Weights & Biases(Sweeps)
如需详细对比与决策框架,请查看references/experiment-tracking.md。
2. Model Registry and Versioning
2. 模型注册与版本管理
Centralize model artifacts with version control and stage management.
Model Registry Components:
- Model artifacts (weights, serialized models)
- Training metrics (accuracy, F1, AUC)
- Hyperparameters used during training
- Training dataset version
- Feature schema (input/output signatures)
- Model cards (documentation, use cases, limitations)
Stage Management:
- None: Newly registered model
- Staging: Testing in pre-production environment
- Production: Serving live traffic
- Archived: Deprecated, retained for compliance
Versioning Strategies:
Semantic Versioning for Models:
- Major version (v2.0.0): Breaking change in input/output schema
- Minor version (v1.1.0): New feature, backward-compatible
- Patch version (v1.0.1): Bug fix, model retrained on new data
Git-Based Versioning:
- Model code in Git (training scripts, configuration)
- Model weights in DVC (Data Version Control) or Git-LFS
- Reproducibility via commit SHA + data version hash
For model lineage tracking and registry patterns, see references/model-registry.md.
集中管理模型工件,实现版本控制与阶段管理。
模型注册组件:
- 模型工件(权重、序列化模型)
- 训练指标(准确率、F1值、AUC)
- 训练时使用的超参数
- 训练数据集版本
- 特征 schema(输入/输出签名)
- 模型卡片(文档、使用场景、局限性)
阶段管理:
- None:新注册的模型
- Staging:预生产环境测试中
- Production:提供实时流量服务
- Archived:已弃用,为合规保留
版本管理策略:
模型语义化版本:
- 主版本(v2.0.0):输入/输出schema发生破坏性变更
- 次版本(v1.1.0):新增功能,向后兼容
- 补丁版本(v1.0.1):修复Bug,基于新数据重训练模型
基于Git的版本管理:
- 模型代码存储在Git中(训练脚本、配置)
- 模型权重存储在DVC(Data Version Control)或Git-LFS中
- 通过提交SHA + 数据版本哈希确保可复现性
如需模型 lineage 追踪与注册模式,请查看references/model-registry.md。
3. Feature Stores
3. 特征存储
Centralize feature engineering to ensure consistency between training and inference.
Problem Addressed: Training/serving skew
- Training: Features computed with future knowledge (data leakage)
- Inference: Features computed with only past data
- Result: Model performs well in training but fails in production
Feature Store Solution:
Online Feature Store:
- Purpose: Low-latency feature retrieval for real-time inference
- Storage: Redis, DynamoDB, Cassandra (key-value stores)
- Latency: Sub-10ms for feature lookup
- Use Case: Real-time predictions (fraud detection, recommendations)
Offline Feature Store:
- Purpose: Historical feature data for training and batch inference
- Storage: Parquet files (S3/GCS), data warehouses (Snowflake, BigQuery)
- Latency: Seconds to minutes (batch retrieval)
- Use Case: Model training, backtesting, batch predictions
Point-in-Time Correctness:
- Ensures no future data leakage during training
- Feature values at time T only use data available before time T
- Critical for avoiding overly optimistic training metrics
Platform Comparison:
Feast (Open-source, cloud-agnostic):
- Most popular open-source feature store
- Supports Redis, DynamoDB, Datastore (online) and Parquet, BigQuery, Snowflake (offline)
- Cloud-agnostic, no vendor lock-in
- Active community, growing adoption
Tecton (Managed, production-grade):
- Feast-compatible API
- Fully managed service
- Integrated monitoring and governance
- Higher cost, enterprise-focused
SageMaker Feature Store (AWS):
- Integrated with AWS ecosystem
- Managed online/offline stores
- AWS lock-in
Databricks Feature Store (Databricks):
- Unity Catalog integration
- Delta Lake for offline storage
- Databricks ecosystem lock-in
Selection Criteria:
- Open-source, cloud-agnostic → Feast
- Managed solution, production-grade → Tecton
- AWS ecosystem → SageMaker Feature Store
- Databricks users → Databricks Feature Store
For feature engineering patterns and implementation, see references/feature-stores.md.
集中管理特征工程,确保训练与推理阶段的特征一致性。
解决的问题:训练/服务偏差
- 训练阶段:使用未来数据计算特征(数据泄露)
- 推理阶段:仅使用历史数据计算特征
- 结果:模型在训练中表现良好,但生产中失效
特征存储解决方案:
在线特征存储:
- 用途:为实时推理提供低延迟特征检索
- 存储:Redis、DynamoDB、Cassandra(键值存储)
- 延迟:特征查询延迟低于10ms
- 适用场景:实时预测(欺诈检测、推荐系统)
离线特征存储:
- 用途:为训练与批量推理提供历史特征数据
- 存储:Parquet文件(S3/GCS)、数据仓库(Snowflake、BigQuery)
- 延迟:批量检索延迟为几秒到几分钟
- 适用场景:模型训练、回测、批量预测
时间点正确性:
- 确保训练中无未来数据泄露
- 时间T的特征值仅使用T之前可用的数据
- 对避免过于乐观的训练指标至关重要
平台对比:
Feast(开源,云无关):
- 最受欢迎的开源特征存储
- 支持Redis、DynamoDB、Datastore(在线)以及Parquet、BigQuery、Snowflake(离线)
- 云无关,无厂商锁定
- 活跃社区,采用率持续增长
Tecton(托管式,生产级):
- 兼容Feast API
- 全托管服务
- 集成监控与治理功能
- 成本较高,聚焦企业级需求
SageMaker Feature Store(AWS):
- 与AWS生态集成
- 托管式在线/离线存储
- 存在AWS锁定
Databricks Feature Store(Databricks):
- 与Unity Catalog集成
- 使用Delta Lake作为离线存储
- 存在Databricks生态锁定
选型标准:
- 开源、云无关 → Feast
- 托管式生产级方案 → Tecton
- AWS生态用户 → SageMaker Feature Store
- Databricks用户 → Databricks Feature Store
如需特征工程模式与实现细节,请查看references/feature-stores.md。
4. Model Serving Patterns
4. 模型部署模式
Deploy models for synchronous, asynchronous, batch, or streaming inference.
Serving Patterns:
REST API Deployment:
- Pattern: HTTP endpoint for synchronous predictions
- Latency: <100ms acceptable
- Use Case: Request-response applications
- Tools: Flask, FastAPI, BentoML, Seldon Core
gRPC Deployment:
- Pattern: High-performance RPC for low-latency inference
- Latency: <10ms target
- Use Case: Microservices, latency-critical applications
- Tools: TensorFlow Serving, TorchServe, Seldon Core
Batch Inference:
- Pattern: Process large datasets offline
- Latency: Minutes to hours acceptable
- Use Case: Daily/hourly predictions for millions of records
- Tools: Spark, Dask, Ray
Streaming Inference:
- Pattern: Real-time predictions on streaming data
- Latency: Milliseconds
- Use Case: Fraud detection, anomaly detection, real-time recommendations
- Tools: Kafka + Flink/Spark Streaming
Platform Comparison:
Seldon Core (Kubernetes-native, advanced):
- Advanced deployment strategies (canary, A/B testing, multi-armed bandits)
- Multi-framework support
- Integrated explainability (Alibi)
- High complexity, steep learning curve
KServe (CNCF standard):
- Standardized InferenceService API
- Serverless scaling (scale-to-zero with Knative)
- Kubernetes-native
- Growing adoption, CNCF backing
BentoML (Python-first, simplicity):
- Easiest to get started
- Excellent developer experience
- Local testing → cloud deployment
- Lower complexity than Seldon/KServe
TorchServe (PyTorch official):
- PyTorch-specific serving
- Production-grade, optimized for PyTorch models
- Less flexible for multi-framework use
TensorFlow Serving (TensorFlow official):
- TensorFlow-specific serving
- Production-grade, optimized for TensorFlow models
- Less flexible for multi-framework use
Selection Criteria:
- Kubernetes, advanced deployments → Seldon Core or KServe
- Python-first, simplicity → BentoML
- PyTorch-specific → TorchServe
- TensorFlow-specific → TensorFlow Serving
- Managed solution → SageMaker/Vertex AI/Azure ML
For model optimization and serving infrastructure, see references/model-serving.md.
为同步、异步、批量或流式推理部署模型。
部署模式:
REST API部署:
- 模式:HTTP端点提供同步预测
- 延迟:可接受延迟<100ms
- 适用场景:请求-响应式应用
- 工具:Flask、FastAPI、BentoML、Seldon Core
gRPC部署:
- 模式:高性能RPC实现低延迟推理
- 延迟:目标延迟<10ms
- 适用场景:微服务、对延迟敏感的应用
- 工具:TensorFlow Serving、TorchServe、Seldon Core
批量推理:
- 模式:离线处理大型数据集
- 延迟:可接受延迟为几分钟到几小时
- 适用场景:每日/每小时对百万级记录进行预测
- 工具:Spark、Dask、Ray
流式推理:
- 模式:对流式数据进行实时预测
- 延迟:毫秒级
- 适用场景:欺诈检测、异常检测、实时推荐
- 工具:Kafka + Flink/Spark Streaming
平台对比:
Seldon Core(Kubernetes原生,高级功能):
- 高级部署策略(金丝雀、A/B测试、多臂老虎机)
- 多框架支持
- 集成可解释性(Alibi)
- 复杂度高,学习曲线陡峭
KServe(CNCF标准):
- 标准化InferenceService API
- 无服务器扩缩容(与Knative集成实现缩容至零)
- Kubernetes原生
- 采用率增长中,受CNCF支持
BentoML(Python优先,简洁易用):
- 上手最简单
- 出色的开发者体验
- 支持从本地测试到云部署
- 复杂度低于Seldon/KServe
TorchServe(PyTorch官方):
- 专为PyTorch设计的部署工具
- 生产级,针对PyTorch模型优化
- 多框架支持灵活性较低
TensorFlow Serving(TensorFlow官方):
- 专为TensorFlow设计的部署工具
- 生产级,针对TensorFlow模型优化
- 多框架支持灵活性较低
选型标准:
- Kubernetes环境、需高级部署策略 → Seldon Core或KServe
- Python优先、追求简洁 → BentoML
- 仅使用PyTorch → TorchServe
- 仅使用TensorFlow → TensorFlow Serving
- 托管式方案 → SageMaker/Vertex AI/Azure ML
如需模型优化与部署基础设施细节,请查看references/model-serving.md。
5. Deployment Strategies
5. 部署策略
Deploy models safely with rollback capabilities.
Blue-Green Deployment:
- Two identical environments (Blue: current, Green: new)
- Deploy to Green, test, switch 100% traffic instantly
- Instant rollback (switch back to Blue)
- Trade-off: Requires 2x infrastructure, all-or-nothing switch
Canary Deployment:
- Gradual rollout to subset of traffic
- Route 5% → 10% → 25% → 50% → 100% over time
- Monitor metrics at each stage, rollback if degradation
- Trade-off: Complex routing logic, longer deployment time
Shadow Deployment:
- New model receives traffic but predictions not used
- Compare new model vs old model offline
- Zero risk to production
- Trade-off: Requires 2x compute, delayed feedback
A/B Testing:
- Split traffic between model versions
- Measure business metrics (conversion rate, revenue)
- Statistical significance testing
- Use Case: Optimize for business outcomes, not just ML metrics
Multi-Armed Bandit (MAB):
- Epsilon-greedy: Explore (try new models) vs Exploit (use best model)
- Thompson Sampling: Bayesian approach to exploration
- Use Case: Continuous optimization, faster convergence than A/B
Selection Criteria:
- Low-risk model → Blue-green (instant cutover)
- Medium-risk model → Canary (gradual rollout)
- High-risk model → Shadow (test in production, no impact)
- Business optimization → A/B testing or MAB
For deployment architecture and examples, see references/deployment-strategies.md.
安全部署模型并具备回滚能力。
蓝绿部署:
- 两个完全相同的环境(蓝环境:当前版本,绿环境:新版本)
- 将新版本部署到绿环境,测试完成后立即切换100%流量
- 可即时回滚(切回蓝环境)
- 权衡点:需2倍基础设施,全量切换
金丝雀部署:
- 逐步向部分用户推出新版本
- 按5% → 10% → 25% → 50% → 100%的比例逐步扩容流量
- 每个阶段监控指标,若性能下降则回滚
- 权衡点:路由逻辑复杂,部署周期较长
影子部署:
- 新版本接收流量但不使用其预测结果
- 离线对比新版本与旧版本的表现
- 对生产无风险
- 权衡点:需2倍计算资源,反馈延迟
A/B测试:
- 在不同模型版本间拆分流量
- 衡量业务指标(转化率、收入)
- 统计显著性测试
- 适用场景:优化业务成果,而非仅ML指标
多臂老虎机(MAB):
- Epsilon-贪婪算法:探索(尝试新模型)与利用(使用最优模型)平衡
- Thompson采样:贝叶斯探索方法
- 适用场景:持续优化,收敛速度快于A/B测试
选型标准:
- 低风险模型 → 蓝绿部署(即时切换)
- 中风险模型 → 金丝雀部署(逐步推出)
- 高风险模型 → 影子部署(生产环境测试,无影响)
- 业务优化需求 → A/B测试或MAB
如需部署架构与示例,请查看references/deployment-strategies.md。
6. ML Pipeline Orchestration
6. ML流水线编排
Automate training, evaluation, and deployment workflows.
Training Pipeline Stages:
- Data Validation (Great Expectations, schema checks)
- Feature Engineering (transform raw data)
- Data Splitting (train/validation/test)
- Model Training (hyperparameter tuning)
- Model Evaluation (accuracy, fairness, explainability)
- Model Registration (push to registry if metrics pass thresholds)
- Deployment (promote to staging/production)
Continuous Training Pattern:
- Monitor production data for drift
- Detect data distribution changes (KS test, PSI)
- Trigger automated retraining when drift detected
- Validate new model before deployment
- Deploy via canary or shadow strategy
Platform Comparison:
Kubeflow Pipelines (ML-native, Kubernetes):
- ML-specific pipeline orchestration
- Kubernetes-native (scales with K8s)
- Component-based (reusable pipeline steps)
- Integrated with Katib (hyperparameter tuning)
Apache Airflow (Mature, general-purpose):
- Most mature orchestration platform
- Large ecosystem, extensive integrations
- Python-based DAGs
- Not ML-specific but widely used for ML workflows
Metaflow (Netflix, data science-friendly):
- Human-centric design, easy for data scientists
- Excellent local development experience
- Versioning built-in
- Simpler than Kubeflow/Airflow
Prefect (Modern, Python-native):
- Dynamic workflows, not static DAGs
- Better error handling than Airflow
- Modern UI and developer experience
- Growing community
Dagster (Asset-based, testing-focused):
- Asset-based thinking (not just task dependencies)
- Strong testing and data quality features
- Modern approach, good for data teams
- Smaller community than Airflow
Selection Criteria:
- ML-specific, Kubernetes → Kubeflow Pipelines
- Mature, battle-tested → Apache Airflow
- Data scientists, ease of use → Metaflow
- Software engineers, testing → Dagster
- Modern, simpler than Airflow → Prefect
For pipeline architecture and examples, see references/ml-pipelines.md.
自动化训练、评估与部署工作流。
训练流水线阶段:
- 数据验证(Great Expectations、schema检查)
- 特征工程(转换原始数据)
- 数据拆分(训练/验证/测试集)
- 模型训练(超参数调优)
- 模型评估(准确率、公平性、可解释性)
- 模型注册(若指标达标则推送至注册中心)
- 部署(升级至预生产/生产环境)
持续训练模式:
- 监控生产数据的漂移情况
- 检测数据分布变化(KS检验、PSI)
- 检测到漂移时触发自动重训练
- 部署前验证新模型
- 通过金丝雀或影子策略部署
平台对比:
Kubeflow Pipelines(ML原生,Kubernetes):
- 专为ML设计的流水线编排
- Kubernetes原生(随K8s扩缩容)
- 组件化(可复用流水线步骤)
- 与Katib集成(超参数调优)
Apache Airflow(成熟,通用型):
- 最成熟的编排平台
- 庞大的生态系统,广泛的集成能力
- 基于Python的DAG
- 非ML专用,但广泛用于ML工作流
Metaflow(Netflix,数据科学友好):
- 以人为中心的设计,数据科学家易上手
- 出色的本地开发体验
- 内置版本管理
- 比Kubeflow/Airflow更简洁
Prefect(现代,Python原生):
- 动态工作流,而非静态DAG
- 错误处理优于Airflow
- 现代UI与开发者体验
- 社区持续增长
Dagster(基于资产,聚焦测试)
- 基于资产的设计理念(而非仅任务依赖)
- 强大的测试与数据质量功能
- 现代方案,适用于数据团队
- 社区规模小于Airflow
选型标准:
- ML专用流水线、Kubernetes环境 → Kubeflow Pipelines
- 通用型编排、成熟生态 → Apache Airflow
- 数据科学工作流、易上手 → Metaflow
- 软件工程师、聚焦测试 → Dagster
- 现代方案、比Airflow简洁 → Prefect
如需流水线架构与示例,请查看references/ml-pipelines.md。
7. Model Monitoring and Observability
7. 模型监控与可观测性
Monitor production models for drift, performance, and quality.
Data Drift Detection:
- Definition: Input feature distributions change over time
- Impact: Model trained on old distribution, predictions degrade
- Detection Methods:
- Kolmogorov-Smirnov (KS) Test: Compare distributions
- Population Stability Index (PSI): Measure distribution shift
- Chi-Square Test: For categorical features
- Action: Trigger automated retraining when drift detected
Model Drift Detection:
- Definition: Model prediction quality degrades over time
- Impact: Accuracy, precision, recall decrease
- Detection Methods:
- Ground truth accuracy (delayed labels)
- Prediction distribution changes
- Calibration drift (predicted probabilities vs actual outcomes)
- Action: Alert team, trigger retraining
Performance Monitoring:
- Metrics:
- Latency: P50, P95, P99 inference time
- Throughput: Predictions per second
- Error Rate: Failed predictions / total predictions
- Resource Utilization: CPU, memory, GPU usage
- Alerting Thresholds:
- P95 latency > 100ms → Alert
- Error rate > 1% → Alert
- Accuracy drop > 5% → Trigger retraining
Business Metrics Monitoring:
- Downstream impact: Conversion rate, revenue, user satisfaction
- Model predictions → business outcomes correlation
- Use Case: Optimize models for business value, not just ML metrics
Tools:
- Evidently AI: Data drift, model drift, data quality reports
- Prometheus + Grafana: Performance metrics, custom dashboards
- Arize AI: ML observability platform
- Fiddler: Model monitoring and explainability
For monitoring architecture and implementation, see references/model-monitoring.md.
监控生产模型的漂移、性能与质量。
数据漂移检测:
- 定义:输入特征分布随时间变化
- 影响:模型基于旧分布训练,预测性能下降
- 检测方法:
- Kolmogorov-Smirnov(KS)检验:对比分布
- 群体稳定性指数(PSI):衡量分布偏移
- 卡方检验:针对分类特征
- 行动:检测到漂移时触发自动重训练
模型漂移检测:
- 定义:模型预测质量随时间下降
- 影响:准确率、精确率、召回率降低
- 检测方法:
- 真实标签准确率(延迟标签)
- 预测分布变化
- 校准漂移(预测概率与实际结果的偏差)
- 行动:向团队告警,触发重训练
性能监控:
- 指标:
- 延迟:P50、P95、P99推理时间
- 吞吐量:每秒预测次数
- 错误率:失败预测数/总预测数
- 资源利用率:CPU、内存、GPU使用率
- 告警阈值:
- P95延迟>100ms → 告警
- 错误率>1% → 告警
- 准确率下降>5% → 触发重训练
业务指标监控:
- 下游影响:转化率、收入、用户满意度
- 模型预测与业务成果的相关性
- 适用场景:针对业务价值优化模型,而非仅ML指标
工具:
- Evidently AI:数据漂移、模型漂移、数据质量报告
- Prometheus + Grafana:性能指标、自定义仪表盘
- Arize AI:ML可观测性平台
- Fiddler:模型监控与可解释性
如需监控架构与实现细节,请查看references/model-monitoring.md。
8. Model Optimization Techniques
8. 模型优化技术
Reduce model size and inference latency.
Quantization:
- Convert model weights from float32 to int8
- Model size reduction: 4x smaller
- Inference speed: 2-3x faster
- Accuracy impact: Minimal (<1% degradation typically)
- Tools: PyTorch quantization, TensorFlow Lite, ONNX Runtime
Model Distillation:
- Train small student model to mimic large teacher model
- Transfer knowledge from teacher (BERT-large) to student (DistilBERT)
- Size reduction: 2-10x smaller
- Speed improvement: 2-10x faster
- Use Case: Deploy small model on edge devices, reduce inference cost
ONNX Conversion:
- Convert models to Open Neural Network Exchange (ONNX) format
- Cross-framework compatibility (PyTorch → ONNX → TensorFlow)
- Optimized inference with ONNX Runtime
- Speed improvement: 1.5-3x faster than native framework
Model Pruning:
- Remove less important weights from neural networks
- Sparsity: 30-90% of weights set to zero
- Size reduction: 2-10x smaller
- Accuracy impact: Minimal with structured pruning
For optimization techniques and examples, see references/model-serving.md.
减小模型体积与推理延迟。
量化:
- 将模型权重从float32转换为int8
- 模型体积减小:4倍
- 推理速度提升:2-3倍
- 准确率影响:极小(通常<1%下降)
- 工具:PyTorch quantization、TensorFlow Lite、ONNX Runtime
模型蒸馏:
- 训练小型学生模型模仿大型教师模型
- 将知识从教师模型(如BERT-large)迁移到学生模型(如DistilBERT)
- 体积减小:2-10倍
- 速度提升:2-10倍
- 适用场景:在边缘设备部署小型模型、降低推理成本
ONNX转换:
- 将模型转换为Open Neural Network Exchange(ONNX)格式
- 跨框架兼容(PyTorch → ONNX → TensorFlow)
- 使用ONNX Runtime实现优化推理
- 速度提升:比原生框架快1.5-3倍
模型剪枝:
- 从神经网络中移除不重要的权重
- 稀疏性:30-90%的权重设为0
- 体积减小:2-10倍
- 准确率影响:结构化剪枝影响极小
如需优化技术与示例,请查看references/model-serving.md。
9. LLMOps Patterns
9. LLMOps 模式
Operationalize Large Language Models with specialized patterns.
LLM Fine-Tuning Pipelines:
- LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning
- QLoRA: Quantized LoRA (4-bit quantization)
- Pipeline: Base model → Fine-tuning dataset → LoRA adapters → Merged model
- Tools: Hugging Face PEFT, Axolotl
Prompt Versioning:
- Version control for prompts (Git, prompt management platforms)
- A/B testing prompts for quality and cost optimization
- Monitoring prompt effectiveness over time
RAG System Monitoring:
- Retrieval quality: Relevance of retrieved documents
- Generation quality: Answer accuracy, hallucination detection
- End-to-end latency: Retrieval + generation time
- Tools: LangSmith, Arize Phoenix
LLM Inference Optimization:
- vLLM: High-throughput LLM serving
- TensorRT-LLM: NVIDIA-optimized LLM inference
- Text Generation Inference (TGI): Hugging Face serving
- Batching: Dynamic batching for throughput
Embedding Model Management:
- Version embeddings alongside models
- Monitor embedding drift (distribution changes)
- Update embeddings when underlying model changes
For LLMOps patterns and implementation, see references/llmops-patterns.md.
使用专用模式落地大语言模型(LLM)。
LLM微调流水线:
- LoRA(低秩适配):参数高效微调
- QLoRA:量化LoRA(4比特量化)
- 流水线:基础模型 → 微调数据集 → LoRA适配器 → 合并模型
- 工具:Hugging Face PEFT、Axolotl
Prompt版本管理:
- 对Prompt进行版本控制(Git、Prompt管理平台)
- 对Prompt进行A/B测试以优化质量与成本
- 随时间监控Prompt有效性
RAG系统监控:
- 检索质量:检索文档的相关性
- 生成质量:回答准确率、幻觉检测
- 端到端延迟:检索+生成时间
- 工具:LangSmith、Arize Phoenix
LLM推理优化:
- vLLM:高吞吐量LLM部署
- TensorRT-LLM:NVIDIA优化的LLM推理
- Text Generation Inference(TGI):Hugging Face部署工具
- 批处理:动态批处理提升吞吐量
嵌入模型管理:
- 嵌入与模型同步版本管理
- 监控嵌入漂移(分布变化)
- 底层模型变更时更新嵌入
如需LLMOps模式与实现细节,请查看references/llmops-patterns.md。
10. Model Governance and Compliance
10. 模型治理与合规
Establish governance for model risk management and regulatory compliance.
Model Cards:
- Documentation: Model purpose, training data, performance metrics
- Limitations: Known biases, failure modes, out-of-scope use cases
- Ethical considerations: Fairness, privacy, societal impact
- Template: Model Card Toolkit (Google)
Bias and Fairness Detection:
- Measure disparate impact across demographic groups
- Tools: Fairlearn, AI Fairness 360 (IBM)
- Metrics: Demographic parity, equalized odds, calibration
- Mitigation: Reweighting, adversarial debiasing, threshold optimization
Regulatory Compliance:
- EU AI Act: High-risk AI systems require documentation, monitoring
- Model Risk Management (SR 11-7): Banking industry requirements
- GDPR: Right to explanation for automated decisions
- HIPAA: Healthcare data privacy
Audit Trails:
- Log all model versions, training runs, deployments
- Track who approved model transitions (staging → production)
- Retain historical predictions for compliance audits
- Tools: MLflow, Neptune.ai (audit logs)
For governance frameworks and compliance, see references/governance.md.
建立模型风险管理与监管合规体系。
模型卡片:
- 文档:模型用途、训练数据、性能指标
- 局限性:已知偏差、失效模式、超出范围的使用场景
- 伦理考量:公平性、隐私性、社会影响
- 模板:Model Card Toolkit(Google)
偏差与公平性检测:
- 衡量不同群体间的差异化影响
- 工具:Fairlearn、AI Fairness 360(IBM)
- 指标:人口统计 parity、均等赔率、校准
- 缓解方法:重加权、对抗去偏、阈值优化
监管合规:
- EU AI法案:高风险AI系统需文档与监控
- 模型风险管理(SR 11-7):银行业要求
- GDPR:自动化决策的解释权
- HIPAA:医疗数据隐私
审计追踪:
- 记录所有模型版本、训练运行、部署操作
- 追踪谁批准了模型阶段转换(预生产→生产)
- 保留历史预测用于合规审计
- 工具:MLflow、Neptune.ai(审计日志)
如需治理框架与合规细节,请查看references/governance.md。
Decision Frameworks
决策框架
Framework 1: Experiment Tracking Platform Selection
框架1:实验追踪平台选型
Decision Tree:
Start with primary requirement:
- Open-source, self-hosted requirement → MLflow
- Team collaboration, advanced visualization (budget available) → Weights & Biases
- Team collaboration, advanced visualization (no budget) → MLflow
- Enterprise compliance (audit logs, RBAC) → Neptune.ai
- Hyperparameter optimization primary use case → Weights & Biases (Sweeps)
Detailed Criteria:
| Criteria | MLflow | Weights & Biases | Neptune.ai |
|---|---|---|---|
| Cost | Free | $200/user/month | $300/user/month |
| Collaboration | Basic | Excellent | Good |
| Visualization | Basic | Excellent | Good |
| Hyperparameter Tuning | External (Optuna) | Integrated (Sweeps) | Basic |
| Model Registry | Included | Add-on | Included |
| Self-Hosted | Yes | No (paid only) | Limited |
| Enterprise Features | No | Limited | Excellent |
Recommendation by Organization:
- Startup (<50 people): MLflow (free, adequate) or W&B (if budget)
- Growth (50-500 people): Weights & Biases (team collaboration)
- Enterprise (>500 people): Neptune.ai (compliance) or MLflow (cost)
For detailed decision framework, see references/decision-frameworks.md.
决策树:
从核心需求开始:
- 需开源、自托管方案 → MLflow
- 团队协作、高级可视化(有预算) → Weights & Biases
- 团队协作、高级可视化(无预算) → MLflow
- 企业合规需求(审计日志、RBAC) → Neptune.ai
- 核心需求为超参数优化 → Weights & Biases(Sweeps)
详细标准:
| 标准 | MLflow | Weights & Biases | Neptune.ai |
|---|---|---|---|
| 成本 | 免费 | $200/用户/月 | $300/用户/月 |
| 协作能力 | 基础 | 优秀 | 良好 |
| 可视化 | 基础 | 优秀 | 良好 |
| 超参数调优 | 需外部工具(如Optuna) | 集成(Sweeps) | 基础 |
| 模型注册 | 内置 | 附加功能 | 内置 |
| 自托管 | 是 | 否(仅付费版支持) | 有限支持 |
| 企业级功能 | 无 | 有限 | 优秀 |
按组织规模推荐:
- 初创公司(<50人):MLflow(免费,满足需求)或W&B(有预算时)
- 成长型公司(50-500人):Weights & Biases(团队协作)
- 企业(>500人):Neptune.ai(合规)或MLflow(成本优化)
如需详细决策框架,请查看references/decision-frameworks.md。
Framework 2: Feature Store Selection
框架2:特征存储选型
Decision Matrix:
Primary requirement:
- Open-source, cloud-agnostic → Feast
- Managed solution, production-grade, multi-cloud → Tecton
- AWS ecosystem → SageMaker Feature Store
- GCP ecosystem → Vertex AI Feature Store
- Azure ecosystem → Azure ML Feature Store
- Databricks users → Databricks Feature Store
- Self-hosted with UI → Hopsworks
Criteria Comparison:
| Factor | Feast | Tecton | Hopsworks | SageMaker FS |
|---|---|---|---|---|
| Cost | Free | $$$$ | Free (self-host) | $$$ |
| Online Serving | Redis, DynamoDB | Managed | RonDB | Managed |
| Offline Store | Parquet, BigQuery, Snowflake | Managed | Hive, S3 | S3 |
| Point-in-Time | Yes | Yes | Yes | Yes |
| Monitoring | External | Integrated | Basic | External |
| Cloud Lock-in | No | No | No | AWS |
Recommendation:
- Open-source, self-managed → Feast
- Managed, production-grade → Tecton
- AWS ecosystem → SageMaker Feature Store
- Databricks users → Databricks Feature Store
For detailed decision framework, see references/decision-frameworks.md.
决策矩阵:
核心需求:
- 开源、云无关 → Feast
- 托管式生产级方案、多云支持 → Tecton
- AWS生态用户 → SageMaker Feature Store
- GCP生态用户 → Vertex AI Feature Store
- Azure生态用户 → Azure ML Feature Store
- Databricks用户 → Databricks Feature Store
- 自托管且带UI → Hopsworks
标准对比:
| 因素 | Feast | Tecton | Hopsworks | SageMaker FS |
|---|---|---|---|---|
| 成本 | 免费 | $$$$ | 免费(自托管) | $$$ |
| 在线服务 | Redis、DynamoDB | 托管式 | RonDB | 托管式 |
| 离线存储 | Parquet、BigQuery、Snowflake | 托管式 | Hive、S3 | S3 |
| 时间点正确性 | 是 | 是 | 是 | 是 |
| 监控 | 需外部工具 | 集成 | 基础 | 需外部工具 |
| 云锁定 | 无 | 无 | 无 | AWS锁定 |
推荐:
- 开源、自托管 → Feast
- 托管式生产级方案 → Tecton
- AWS生态用户 → SageMaker Feature Store
- Databricks用户 → Databricks Feature Store
如需详细决策框架,请查看references/decision-frameworks.md。
Framework 3: Model Serving Platform Selection
框架3:模型部署平台选型
Decision Tree:
Infrastructure:
- Kubernetes-based → Advanced deployment patterns needed?
- Yes → Seldon Core (most features) or KServe (CNCF standard)
- No → BentoML (simpler, Python-first)
- Cloud-native (managed) → Cloud provider?
- AWS → SageMaker Endpoints
- GCP → Vertex AI Endpoints
- Azure → Azure ML Endpoints
- Framework-specific → Framework?
- PyTorch → TorchServe
- TensorFlow → TensorFlow Serving
- Serverless / minimal infrastructure → BentoML or Cloud Functions
Detailed Criteria:
| Feature | Seldon Core | KServe | BentoML | TorchServe |
|---|---|---|---|---|
| Kubernetes-Native | Yes | Yes | Optional | No |
| Multi-Framework | Yes | Yes | Yes | PyTorch-only |
| Deployment Strategies | Excellent | Good | Basic | Basic |
| Explainability | Integrated | Integrated | External | No |
| Complexity | High | Medium | Low | Low |
| Learning Curve | Steep | Medium | Gentle | Gentle |
Recommendation:
- Kubernetes, advanced deployments → Seldon Core or KServe
- Python-first, simplicity → BentoML
- PyTorch-specific → TorchServe
- TensorFlow-specific → TensorFlow Serving
- Managed solution → SageMaker/Vertex AI/Azure ML
For detailed decision framework, see references/decision-frameworks.md.
决策树:
基础设施类型:
- Kubernetes环境 → 是否需要高级部署模式?
- 是 → Seldon Core(功能最丰富)或KServe(CNCF标准)
- 否 → BentoML(更简洁,Python优先)
- 云原生(托管式) → 云提供商?
- AWS → SageMaker Endpoints
- GCP → Vertex AI Endpoints
- Azure → Azure ML Endpoints
- 框架专用 → 使用的框架?
- PyTorch → TorchServe
- TensorFlow → TensorFlow Serving
- 无服务器/极简基础设施 → BentoML或云函数
详细标准:
| 功能 | Seldon Core | KServe | BentoML | TorchServe |
|---|---|---|---|---|
| Kubernetes原生 | 是 | 是 | 可选 | 否 |
| 多框架支持 | 是 | 是 | 是 | 仅支持PyTorch |
| 部署策略 | 优秀 | 良好 | 基础 | 基础 |
| 可解释性 | 集成 | 集成 | 需外部工具 | 无 |
| 复杂度 | 高 | 中 | 低 | 低 |
| 学习曲线 | 陡峭 | 中等 | 平缓 | 平缓 |
推荐:
- Kubernetes环境、需高级部署策略 → Seldon Core或KServe
- Python优先、追求简洁 → BentoML
- 仅使用PyTorch → TorchServe
- 仅使用TensorFlow → TensorFlow Serving
- 托管式方案 → SageMaker/Vertex AI/Azure ML
如需详细决策框架,请查看references/decision-frameworks.md。
Framework 4: ML Pipeline Orchestration Selection
框架4:ML流水线编排选型
Decision Matrix:
Primary use case:
- ML-specific pipelines, Kubernetes-native → Kubeflow Pipelines
- General-purpose orchestration, mature ecosystem → Apache Airflow
- Data science workflows, ease of use → Metaflow
- Modern approach, asset-based thinking → Dagster
- Dynamic workflows, Python-native → Prefect
Criteria Comparison:
| Factor | Kubeflow | Airflow | Metaflow | Dagster | Prefect |
|---|---|---|---|---|---|
| ML-Specific | Excellent | Good | Excellent | Good | Good |
| Kubernetes | Native | Compatible | Optional | Compatible | Compatible |
| Learning Curve | Steep | Steep | Gentle | Medium | Medium |
| Maturity | High | Very High | Medium | Medium | Medium |
| Community | Large | Very Large | Growing | Growing | Growing |
Recommendation:
- ML-specific, Kubernetes → Kubeflow Pipelines
- Mature, battle-tested → Apache Airflow
- Data scientists → Metaflow
- Software engineers → Dagster
- Modern, simpler than Airflow → Prefect
For detailed decision framework, see references/decision-frameworks.md.
决策矩阵:
核心使用场景:
- ML专用流水线、Kubernetes原生 → Kubeflow Pipelines
- 通用型编排、成熟生态 → Apache Airflow
- 数据科学工作流、易上手 → Metaflow
- 现代方案、基于资产的理念 → Dagster
- 动态工作流、Python原生 → Prefect
标准对比:
| 因素 | Kubeflow | Airflow | Metaflow | Dagster | Prefect |
|---|---|---|---|---|---|
| ML专用性 | 优秀 | 良好 | 优秀 | 良好 | 良好 |
| Kubernetes支持 | 原生 | 兼容 | 可选 | 兼容 | 兼容 |
| 学习曲线 | 陡峭 | 陡峭 | 平缓 | 中等 | 中等 |
| 成熟度 | 高 | 极高 | 中 | 中 | 中 |
| 社区规模 | 大 | 极大 | 增长中 | 增长中 | 增长中 |
推荐:
- ML专用、Kubernetes环境 → Kubeflow Pipelines
- 成熟、经过验证 → Apache Airflow
- 数据科学家使用 → Metaflow
- 软件工程师使用 → Dagster
- 现代方案、比Airflow简洁 → Prefect
如需详细决策框架,请查看references/decision-frameworks.md。
Implementation Patterns
实现模式
Pattern 1: End-to-End ML Pipeline
模式1:端到端ML流水线
Automate the complete ML workflow from data to deployment.
Pipeline Stages:
- Data Validation (Great Expectations)
- Feature Engineering (transform raw data)
- Data Splitting (train/validation/test)
- Model Training (with hyperparameter tuning)
- Model Evaluation (accuracy, fairness, explainability)
- Model Registration (push to MLflow registry)
- Deployment (promote to staging/production)
Architecture:
Data Lake → Data Validation → Feature Engineering → Training → Evaluation
↓
Model Registry (staging) → Testing → Production DeploymentFor implementation details and code examples, see references/ml-pipelines.md.
自动化从数据到部署的完整ML工作流。
流水线阶段:
- 数据验证(Great Expectations)
- 特征工程(转换原始数据)
- 数据拆分(训练/验证/测试集)
- 模型训练(含超参数调优)
- 模型评估(准确率、公平性、可解释性)
- 模型注册(推送至MLflow注册中心)
- 部署(升级至预生产/生产环境)
架构:
数据湖 → 数据验证 → 特征工程 → 训练 → 评估
↓
模型注册中心(预生产) → 测试 → 生产部署如需实现细节与代码示例,请查看references/ml-pipelines.md。
Pattern 2: Continuous Training
模式2:持续训练
Automate model retraining based on drift detection.
Workflow:
- Monitor production data for distribution changes
- Detect data drift (KS test, PSI)
- Trigger automated retraining pipeline
- Validate new model (accuracy, fairness)
- Deploy via canary strategy (5% → 100%)
- Monitor new model performance
- Rollback if metrics degrade
Trigger Conditions:
- Scheduled: Daily/weekly retraining
- Data drift: KS test p-value < 0.05
- Model drift: Accuracy drop > 5%
- Data volume: New training data exceeds threshold (10K samples)
For implementation details, see references/ml-pipelines.md.
基于漂移检测自动触发模型重训练。
工作流:
- 监控生产数据的分布变化
- 检测数据漂移(KS检验、PSI)
- 触发自动重训练流水线
- 验证新模型(准确率、公平性)
- 通过金丝雀策略部署(5% → 100%)
- 监控新模型性能
- 若指标下降则回滚
触发条件:
- 定时触发:每日/每周重训练
- 数据漂移:KS检验p值<0.05
- 模型漂移:准确率下降>5%
- 数据量:新训练数据超过阈值(10K样本)
如需实现细节,请查看references/ml-pipelines.md。
Pattern 3: Feature Store Integration
模式3:特征存储集成
Ensure consistent features between training and inference.
Architecture:
Offline Store (Training):
Parquet/BigQuery → Point-in-Time Join → Training Dataset
Online Store (Inference):
Redis/DynamoDB → Low-Latency Lookup → Real-Time PredictionPoint-in-Time Correctness:
- Training: Fetch features as of specific timestamps (no future data)
- Inference: Fetch latest features (only past data)
- Guarantee: Same feature logic in training and inference
For implementation details and code examples, see references/feature-stores.md.
确保训练与推理阶段的特征一致性。
架构:
离线存储(训练):
Parquet/BigQuery → 时间点关联 → 训练数据集
在线存储(推理):
Redis/DynamoDB → 低延迟查询 → 实时预测时间点正确性:
- 训练:获取特定时间点的特征(无未来数据)
- 推理:获取最新特征(仅历史数据)
- 保障:训练与推理使用相同的特征逻辑
如需实现细节与代码示例,请查看references/feature-stores.md。
Pattern 4: Shadow Deployment Testing
模式4:影子部署测试
Test new models in production without risk.
Workflow:
- Deploy new model (v2) in shadow mode
- v2 receives copy of production traffic
- v1 predictions used for responses (no user impact)
- Compare v1 and v2 predictions offline
- Analyze differences, measure v2 accuracy
- Promote v2 to production if performance acceptable
Use Cases:
- High-risk models (financial, healthcare, safety-critical)
- Need extensive testing before cutover
- Compare model behavior on real production data
For deployment architecture, see references/deployment-strategies.md.
在生产环境测试新模型且无风险。
工作流:
- 以影子模式部署新模型(v2)
- v2接收生产流量副本
- v1的预测结果用于响应用户(无用户影响)
- 离线对比v1与v2的预测结果
- 分析差异,衡量v2的准确率
- 若性能达标则将v2升级至生产环境
适用场景:
- 高风险模型(金融、医疗、安全关键型)
- 需在切换前进行大量测试
- 对比模型在真实生产数据上的表现
如需部署架构,请查看references/deployment-strategies.md。
Tool Recommendations
工具推荐
Production-Ready Tools (High Adoption)
生产级工具(高采用率)
MLflow - Experiment Tracking & Model Registry
- GitHub Stars: 20,000+
- Trust Score: 95/100
- Use Cases: Experiment tracking, model registry, model serving
- Strengths: Open-source, framework-agnostic, self-hosted option
- Getting Started:
pip install mlflow && mlflow server
Feast - Feature Store
- GitHub Stars: 5,000+
- Trust Score: 85/100
- Use Cases: Online/offline feature serving, point-in-time correctness
- Strengths: Cloud-agnostic, most popular open-source feature store
- Getting Started:
pip install feast && feast init
Seldon Core - Model Serving (Advanced)
- GitHub Stars: 4,000+
- Trust Score: 85/100
- Use Cases: Kubernetes-native serving, advanced deployment patterns
- Strengths: Canary, A/B testing, MAB, explainability
- Limitation: High complexity, steep learning curve
KServe - Model Serving (CNCF Standard)
- GitHub Stars: 3,500+
- Trust Score: 85/100
- Use Cases: Standardized serving API, serverless scaling
- Strengths: CNCF project, Knative integration, growing adoption
- Limitation: Kubernetes required
BentoML - Model Serving (Simplicity)
- GitHub Stars: 6,000+
- Trust Score: 80/100
- Use Cases: Easy packaging, Python-first deployment
- Strengths: Lowest learning curve, excellent developer experience
- Limitation: Fewer advanced features than Seldon/KServe
Kubeflow Pipelines - ML Orchestration
- GitHub Stars: 14,000+ (Kubeflow project)
- Trust Score: 90/100
- Use Cases: ML-specific pipelines, Kubernetes-native workflows
- Strengths: ML-native, component reusability, Katib integration
- Limitation: Kubernetes required, steep learning curve
Weights & Biases - Experiment Tracking (SaaS)
- Trust Score: 90/100
- Use Cases: Team collaboration, advanced visualization, hyperparameter tuning
- Strengths: Best-in-class UI, integrated Sweeps, strong community
- Limitation: SaaS pricing, no self-hosted free tier
For detailed tool comparisons, see references/tool-recommendations.md.
MLflow - 实验追踪与模型注册
- GitHub星标:20,000+
- 信任评分:95/100
- 适用场景:实验追踪、模型注册、模型部署
- 优势:开源、框架无关、支持自托管
- 快速开始:
pip install mlflow && mlflow server
Feast - 特征存储
- GitHub星标:5,000+
- 信任评分:85/100
- 适用场景:在线/离线特征服务、时间点正确性
- 优势:云无关、最受欢迎的开源特征存储
- 快速开始:
pip install feast && feast init
Seldon Core - 模型部署(高级功能)
- GitHub星标:4,000+
- 信任评分:85/100
- 适用场景:Kubernetes原生部署、高级部署策略
- 优势:支持金丝雀、A/B测试、MAB、可解释性
- 局限性:复杂度高,学习曲线陡峭
KServe - 模型部署(CNCF标准)
- GitHub星标:3,500+
- 信任评分:85/100
- 适用场景:标准化部署API、无服务器扩缩容
- 优势:CNCF项目、与Knative集成、采用率增长中
- 局限性:需Kubernetes环境
BentoML - 模型部署(简洁易用)
- GitHub星标:6,000+
- 信任评分:80/100
- 适用场景:简易打包、Python优先部署
- 优势:学习曲线最低、出色的开发者体验
- 局限性:高级功能少于Seldon/KServe
Kubeflow Pipelines - ML流水线编排
- GitHub星标:14,000+(Kubeflow项目)
- 信任评分:90/100
- 适用场景:ML专用流水线、Kubernetes原生工作流
- 优势:ML原生、组件可复用、与Katib集成
- 局限性:需Kubernetes环境,学习曲线陡峭
Weights & Biases - 实验追踪(SaaS)
- 信任评分:90/100
- 适用场景:团队协作、高级可视化、超参数调优
- 优势:业界领先的UI、集成Sweeps、活跃社区
- 局限性:SaaS定价,无免费自托管版本
如需详细工具对比,请查看references/tool-recommendations.md。
Tool Stack Recommendations by Organization
按组织规模推荐工具栈
Startup (Cost-Optimized, Simple):
- Experiment Tracking: MLflow (free, self-hosted)
- Feature Store: None initially → Feast when needed
- Model Serving: BentoML (easy) or cloud functions
- Orchestration: Prefect or cron jobs
- Monitoring: Basic logging + Prometheus
Growth Company (Balanced):
- Experiment Tracking: Weights & Biases or MLflow
- Feature Store: Feast (open-source, production-ready)
- Model Serving: BentoML or KServe (Kubernetes-based)
- Orchestration: Kubeflow Pipelines or Airflow
- Monitoring: Evidently + Prometheus + Grafana
Enterprise (Full Stack):
- Experiment Tracking: MLflow (self-hosted) or Neptune.ai (compliance)
- Feature Store: Tecton (managed) or Feast (self-hosted)
- Model Serving: Seldon Core (advanced) or KServe
- Orchestration: Kubeflow Pipelines or Airflow
- Monitoring: Evidently + Prometheus + Grafana + PagerDuty
Cloud-Native (Managed Services):
- AWS: SageMaker (end-to-end platform)
- GCP: Vertex AI (end-to-end platform)
- Azure: Azure ML (end-to-end platform)
For scenario-specific recommendations, see references/scenarios.md.
初创公司(成本优化、简洁):
- 实验追踪:MLflow(免费,自托管)
- 特征存储:初期无需 → 有需求时使用Feast
- 模型部署:BentoML(简易)或云函数
- 编排:Prefect或定时任务
- 监控:基础日志 + Prometheus
成长型公司(平衡方案):
- 实验追踪:Weights & Biases或MLflow
- 特征存储:Feast(开源,生产级)
- 模型部署:BentoML或KServe(Kubernetes环境)
- 编排:Kubeflow Pipelines或Airflow
- 监控:Evidently + Prometheus + Grafana
企业(全栈方案):
- 实验追踪:MLflow(自托管)或Neptune.ai(合规)
- 特征存储:Tecton(托管式)或Feast(自托管)
- 模型部署:Seldon Core(高级功能)或KServe
- 编排:Kubeflow Pipelines或Airflow
- 监控:Evidently + Prometheus + Grafana + PagerDuty
云原生(托管服务):
- AWS:SageMaker(端到端平台)
- GCP:Vertex AI(端到端平台)
- Azure:Azure ML(端到端平台)
如需场景化推荐,请查看references/scenarios.md。
Common Scenarios
常见场景
Scenario 1: Startup MLOps Stack
场景1:初创公司MLOps栈
Context: 20-person startup, 5 data scientists, 3 models (fraud detection, recommendation, churn), limited budget.
Recommendation:
- Experiment Tracking: MLflow (free, self-hosted)
- Model Serving: BentoML (simple, fast iteration)
- Orchestration: Prefect (simpler than Airflow)
- Monitoring: Prometheus + basic drift detection
- Feature Store: Skip initially, use database tables
Rationale:
- Minimize cost (all open-source, self-hosted)
- Fast iteration (BentoML easy to deploy)
- Don't over-engineer (no Kubeflow for 3 models)
- Add feature store (Feast) when scaling to 10+ models
For detailed scenario, see references/scenarios.md.
**背景:**20人初创公司,5名数据科学家,3个模型(欺诈检测、推荐系统、客户流失预测),预算有限。
推荐:
- 实验追踪:MLflow(免费,自托管)
- 模型部署:BentoML(简易,快速迭代)
- 编排:Prefect(比Airflow简洁)
- 监控:基础日志 + Prometheus
- 特征存储:初期无需,使用数据库表
理由:
- 成本最小化(全开源、自托管)
- 快速迭代(BentoML部署便捷)
- 避免过度设计(3个模型无需Kubeflow)
- 模型数量达到10+时再引入Feast
如需详细场景,请查看references/scenarios.md。
Scenario 2: Enterprise ML Platform
场景2:企业级ML平台
Context: 500-person company, 50 data scientists, 100+ models, regulatory compliance, multi-cloud.
Recommendation:
- Experiment Tracking: Neptune.ai (compliance, audit logs) or MLflow (cost)
- Feature Store: Feast (self-hosted, cloud-agnostic)
- Model Serving: Seldon Core (advanced deployment patterns)
- Orchestration: Kubeflow Pipelines (ML-native, Kubernetes)
- Monitoring: Evidently + Prometheus + Grafana + PagerDuty
Rationale:
- Compliance required (Neptune audit logs, RBAC)
- Multi-cloud (Feast cloud-agnostic)
- Advanced deployments (Seldon canary, A/B testing)
- Scale (Kubernetes for 100+ models)
For detailed scenario, see references/scenarios.md.
**背景:**500人公司,50名数据科学家,100+模型,需监管合规,多云环境。
推荐:
- 实验追踪:Neptune.ai(合规、审计日志)或MLflow(成本优化)
- 特征存储:Feast(自托管,云无关)
- 模型部署:Seldon Core(高级部署策略)
- 编排:Kubeflow Pipelines(ML原生,Kubernetes)
- 监控:Evidently + Prometheus + Grafana + PagerDuty
理由:
- 需合规(Neptune的审计日志、RBAC)
- 多云环境(Feast云无关)
- 高级部署需求(Seldon的金丝雀、A/B测试)
- 规模化支持(Kubernetes支撑100+模型)
如需详细场景,请查看references/scenarios.md。
Scenario 3: LLM Fine-Tuning Pipeline
场景3:LLM微调流水线
Context: Fine-tune LLM for domain-specific use case, deploy for production serving.
Recommendation:
- Experiment Tracking: MLflow (track fine-tuning runs)
- Pipeline Orchestration: Kubeflow Pipelines (GPU scheduling)
- Model Serving: vLLM (high-throughput LLM serving)
- Prompt Versioning: Git + LangSmith
- Monitoring: Arize Phoenix (RAG monitoring)
Rationale:
- Track fine-tuning experiments (LoRA adapters, hyperparameters)
- GPU orchestration (Kubeflow on Kubernetes)
- Efficient LLM serving (vLLM optimized for throughput)
- Monitor RAG systems (retrieval + generation quality)
For detailed scenario, see references/scenarios.md.
**背景:**为特定领域微调LLM,部署至生产环境提供服务。
推荐:
- 实验追踪:MLflow(追踪微调运行)
- 流水线编排:Kubeflow Pipelines(GPU调度)
- 模型部署:vLLM(高吞吐量LLM部署)
- Prompt版本管理:Git + LangSmith
- 监控:Arize Phoenix(RAG监控)
理由:
- 追踪微调实验(LoRA适配器、超参数)
- GPU编排(Kubernetes上的Kubeflow)
- 高效LLM部署(vLLM针对吞吐量优化)
- 监控RAG系统(检索+生成质量)
如需详细场景,请查看references/scenarios.md。
Integration with Other Skills
与其他技能的集成
Direct Dependencies:
- : Feature engineering, ML algorithms, data preparation
ai-data-engineering - : K8s cluster management, GPU scheduling for ML workloads
kubernetes-operations - : Monitoring, alerting, distributed tracing for ML systems
observability
Complementary Skills:
- : Data pipelines, data lakes feeding ML models
data-architecture - : dbt for feature transformation pipelines
data-transformation - : Kafka, Flink for real-time ML inference
streaming-data - : Scalability patterns for ML workloads
designing-distributed-systems - : ML model APIs, REST/gRPC serving patterns
api-design-principles
Downstream Skills:
- : LLM-powered applications consuming ML models
building-ai-chat - : Dashboards for ML metrics and monitoring
visualizing-data
直接依赖:
- :特征工程、ML算法、数据准备
ai-data-engineering - :K8s集群管理、ML工作负载的GPU调度
kubernetes-operations - :ML系统的监控、告警、分布式追踪
observability
互补技能:
- :为ML模型提供数据的流水线、数据湖
data-architecture - :使用dbt构建特征转换流水线
data-transformation - :Kafka、Flink用于实时ML推理
streaming-data - :ML工作负载的可扩展性模式
designing-distributed-systems - :ML模型API、REST/gRPC部署模式
api-design-principles
下游技能:
- :调用ML模型的LLM驱动应用
building-ai-chat - :ML指标与监控仪表盘
visualizing-data
Best Practices
最佳实践
-
Version Everything:
- Code: Git commit SHA for reproducibility
- Data: DVC or data version hash
- Models: Semantic versioning (v1.2.3)
- Features: Feature store versioning
-
Automate Testing:
- Unit tests: Model loads, accepts input, produces output
- Integration tests: End-to-end pipeline execution
- Model validation: Accuracy thresholds, fairness checks
-
Monitor Continuously:
- Data drift: Distribution changes over time
- Model drift: Accuracy degradation
- Performance: Latency, throughput, error rates
-
Start Simple:
- Begin with MLflow + basic serving (BentoML)
- Add complexity as needed (feature store, Kubeflow)
- Avoid over-engineering (don't build Kubeflow for 2 models)
-
Point-in-Time Correctness:
- Use feature stores to avoid training/serving skew
- Ensure no future data leakage in training
- Consistent feature logic in training and inference
-
Deployment Strategies:
- Use canary for medium-risk models (gradual rollout)
- Use shadow for high-risk models (zero production impact)
- Always have rollback plan (instant switch to previous version)
-
Governance:
- Model cards: Document model purpose, limitations, biases
- Audit trails: Track all model versions, deployments, approvals
- Compliance: EU AI Act, model risk management (SR 11-7)
-
Cost Optimization:
- Quantization: Reduce model size 4x, inference speed 2-3x
- Spot instances: Train on preemptible VMs (60-90% cost reduction)
- Autoscaling: Scale inference endpoints based on load
-
版本化所有内容:
- 代码:使用Git提交SHA确保可复现性
- 数据:使用DVC或数据版本哈希
- 模型:语义化版本(v1.2.3)
- 特征:特征存储版本管理
-
自动化测试:
- 单元测试:模型可加载、接受输入、生成输出
- 集成测试:端到端流水线执行
- 模型验证:准确率阈值、公平性检查
-
持续监控:
- 数据漂移:分布随时间的变化
- 模型漂移:准确率下降
- 性能:延迟、吞吐量、错误率
-
从简开始:
- 先使用MLflow + 基础部署(BentoML)
- 按需增加复杂度(特征存储、Kubeflow)
- 避免过度设计(2个模型无需Kubeflow)
-
保证时间点正确性:
- 使用特征存储避免训练/服务偏差
- 确保训练中无未来数据泄露
- 训练与推理使用相同的特征逻辑
-
选择合适的部署策略:
- 中风险模型使用金丝雀部署(逐步推出)
- 高风险模型使用影子部署(生产测试无影响)
- 始终具备回滚方案(即时切换到上一版本)
-
建立治理体系:
- 模型卡片:记录模型用途、局限性、偏差
- 审计追踪:记录所有模型版本、部署、审批
- 合规:遵循EU AI法案、模型风险管理(SR 11-7)
-
成本优化:
- 量化:模型体积减小4倍,推理速度提升2-3倍
- 抢占式实例:使用可抢占VM训练(成本降低60-90%)
- 自动扩缩容:根据负载扩缩推理端点
Anti-Patterns
反模式
❌ Notebooks in Production:
- Never deploy Jupyter notebooks to production
- Use notebooks for experimentation only
- Production: Use scripts, Docker containers, CI/CD pipelines
❌ Manual Model Deployment:
- Automate deployment with CI/CD pipelines
- Use model registry stage transitions (staging → production)
- Eliminate human error, ensure reproducibility
❌ No Monitoring:
- Production models without monitoring will degrade silently
- Implement drift detection (data drift, model drift)
- Set up alerting for accuracy drops, latency spikes
❌ Training/Serving Skew:
- Different feature logic in training vs inference
- Use feature stores to ensure consistency
- Test feature parity before production deployment
❌ Ignoring Data Quality:
- Garbage in, garbage out (GIGO)
- Validate data schema, ranges, distributions
- Use Great Expectations for data validation
❌ Over-Engineering:
- Don't build Kubeflow for 2 models
- Start simple (MLflow + BentoML)
- Add complexity only when necessary (10+ models)
❌ No Rollback Plan:
- Always have ability to rollback to previous model version
- Blue-green, canary, shadow deployments enable instant rollback
- Test rollback procedure before production deployment
❌ Notebook用于生产:
- 绝不要将Jupyter Notebook部署到生产环境
- Notebook仅用于实验
- 生产环境使用脚本、Docker容器、CI/CD流水线
❌ 手动部署模型:
- 使用CI/CD流水线自动化部署
- 使用模型注册中心的阶段转换(预生产→生产)
- 消除人为错误,确保可复现性
❌ 无监控:
- 无监控的生产模型会静默失效
- 实现漂移检测(数据漂移、模型漂移)
- 为准确率下降、延迟峰值设置告警
❌ 训练/服务偏差:
- 训练与推理使用不同的特征逻辑
- 使用特征存储确保一致性
- 生产部署前测试特征一致性
❌ 忽略数据质量:
- 垃圾进,垃圾出(GIGO)
- 验证数据schema、范围、分布
- 使用Great Expectations进行数据验证
❌ 过度设计:
- 2个模型无需搭建Kubeflow
- 从简开始(MLflow + BentoML)
- 仅在必要时增加复杂度(10+模型时)
❌ 无回滚方案:
- 始终具备回滚到上一版本的能力
- 蓝绿、金丝雀、影子部署支持即时回滚
- 生产部署前测试回滚流程
Further Reading
延伸阅读
Reference Files:
- Experiment Tracking - MLflow, W&B, Neptune deep dive
- Model Registry - Versioning, lineage, stage transitions
- Feature Stores - Feast, Tecton, online/offline patterns
- Model Serving - Seldon, KServe, BentoML, optimization
- Deployment Strategies - Blue-green, canary, shadow, A/B
- ML Pipelines - Kubeflow, Airflow, training pipelines
- Model Monitoring - Drift detection, observability
- LLMOps Patterns - LLM fine-tuning, RAG, prompts
- Decision Frameworks - Tool selection frameworks
- Tool Recommendations - Detailed comparisons
- Scenarios - Startup, enterprise, LLMOps use cases
- Governance - Model cards, compliance, fairness
Example Projects:
- examples/mlflow-experiment/ - Complete MLflow setup
- examples/feast-feature-store/ - Feast online/offline
- examples/seldon-deployment/ - Canary, A/B testing
- examples/kubeflow-pipeline/ - End-to-end pipeline
- examples/monitoring-dashboard/ - Evidently + Prometheus
Scripts:
- scripts/setup_mlflow_server.sh - MLflow with PostgreSQL + S3
- scripts/feast_feature_definition_generator.py - Generate Feast features
- scripts/model_validation_suite.py - Automated model tests
- scripts/drift_detection_monitor.py - Scheduled drift detection
- scripts/kubernetes_model_deploy.py - Deploy to Seldon/KServe
参考文件:
- Experiment Tracking - MLflow、W&B、Neptune深度解析
- Model Registry - 版本管理、lineage、阶段转换
- Feature Stores - Feast、Tecton、在线/离线模式
- Model Serving - Seldon、KServe、BentoML、优化技术
- Deployment Strategies - 蓝绿、金丝雀、影子、A/B测试
- ML Pipelines - Kubeflow、Airflow、训练流水线
- Model Monitoring - 漂移检测、可观测性
- LLMOps Patterns - LLM微调、RAG、Prompt管理
- Decision Frameworks - 工具选型框架
- Tool Recommendations - 详细对比
- Scenarios - 初创公司、企业、LLMOps场景
- Governance - 模型卡片、合规、公平性
示例项目:
- examples/mlflow-experiment/ - 完整MLflow配置
- examples/feast-feature-store/ - Feast在线/离线存储
- examples/seldon-deployment/ - 金丝雀、A/B测试
- examples/kubeflow-pipeline/ - 端到端流水线
- examples/monitoring-dashboard/ - Evidently + Prometheus
脚本:
- scripts/setup_mlflow_server.sh - 带PostgreSQL + S3的MLflow配置
- scripts/feast_feature_definition_generator.py - 生成Feast特征
- scripts/model_validation_suite.py - 自动化模型测试
- scripts/drift_detection_monitor.py - 定时漂移检测
- scripts/kubernetes_model_deploy.py - 部署到Seldon/KServe