implementing-mlops

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MLOps Patterns

MLOps 模式

Operationalize machine learning models from experimentation to production deployment and monitoring.

将机器学习模型从实验阶段落地到生产部署与监控全流程。

Purpose

目标

Provide strategic guidance for ML engineers and platform teams to build production-grade ML infrastructure. Cover the complete lifecycle: experiment tracking, model registry, feature stores, deployment patterns, pipeline orchestration, and monitoring.

为ML工程师和平台团队提供构建生产级ML基础设施的战略指导，覆盖完整生命周期：实验追踪、模型注册、特征存储、部署模式、流水线编排以及监控。

When to Use This Skill

适用场景

Use this skill when:

Designing MLOps infrastructure for production ML systems
Selecting experiment tracking platforms (MLflow, Weights & Biases, Neptune)
Implementing feature stores for online/offline feature serving
Choosing model serving solutions (Seldon Core, KServe, BentoML, TorchServe)
Building ML pipelines for training, evaluation, and deployment
Setting up model monitoring and drift detection
Establishing model governance and compliance frameworks
Optimizing ML inference costs and performance
Migrating from notebooks to production ML systems
Implementing continuous training and automated retraining

适用于以下场景：

为生产级ML系统设计MLOps基础设施
选择实验追踪平台（MLflow、Weights & Biases、Neptune）
实现用于在线/离线特征服务的特征存储
选择模型部署解决方案（Seldon Core、KServe、BentoML、TorchServe）
构建用于训练、评估和部署的ML流水线
搭建模型监控与漂移检测体系
建立模型治理与合规框架
优化ML推理成本与性能
从Notebook迁移到生产级ML系统
实现持续训练与自动重训练

Core Concepts

核心概念

1. Experiment Tracking

1. 实验追踪

Track experiments systematically to ensure reproducibility and collaboration.

Key Components:

Parameters: Hyperparameters logged for each training run
Metrics: Performance measures tracked over time (accuracy, loss, F1)
Artifacts: Model weights, plots, datasets, configuration files
Metadata: Tags, descriptions, Git commit SHA, environment details

Platform Comparison:

MLflow (Open-source standard):

Framework-agnostic (PyTorch, TensorFlow, scikit-learn, XGBoost)
Self-hosted or cloud-agnostic deployment
Integrated model registry
Basic UI, adequate for most use cases
Free, requires infrastructure management

Weights & Biases (SaaS, collaboration-focused):

Advanced visualization and dashboards
Integrated hyperparameter optimization (Sweeps)
Excellent team collaboration features
SaaS pricing scales with usage
Best-in-class UI

Neptune.ai (Enterprise-grade):

Enterprise features (RBAC, audit logs, compliance)
Integrated production monitoring
Higher cost than W&B
Good for regulated industries

Selection Criteria:

Open-source requirement → MLflow
Team collaboration critical → Weights & Biases
Enterprise compliance (RBAC, audits) → Neptune.ai
Hyperparameter optimization primary → Weights & Biases (Sweeps)

For detailed comparison and decision framework, see references/experiment-tracking.md.

系统化追踪实验，确保可复现性与协作效率。

核心组件：

参数：记录每次训练运行的超参数
指标：随时间追踪的性能指标（准确率、损失值、F1值）
工件：模型权重、图表、数据集、配置文件
元数据：标签、描述、Git提交SHA、环境细节

平台对比：

MLflow（开源标准）：

框架无关（支持PyTorch、TensorFlow、scikit-learn、XGBoost）
可自托管或云无关部署
集成模型注册功能
基础UI，满足多数使用场景
免费，需自行管理基础设施

Weights & Biases（SaaS，聚焦协作）：

高级可视化与仪表盘
集成超参数优化（Sweeps）
出色的团队协作功能
SaaS定价随使用量递增
业界领先的UI体验

Neptune.ai（企业级）：

企业级功能（RBAC、审计日志、合规性）
集成生产监控
成本高于W&B
适用于受监管行业

选型标准：

需开源方案 → MLflow
团队协作优先级高 → Weights & Biases
企业合规需求（RBAC、审计） → Neptune.ai
超参数优化为核心需求 → Weights & Biases（Sweeps）

如需详细对比与决策框架，请查看references/experiment-tracking.md。

2. Model Registry and Versioning

2. 模型注册与版本管理

Centralize model artifacts with version control and stage management.

Model Registry Components:

Model artifacts (weights, serialized models)
Training metrics (accuracy, F1, AUC)
Hyperparameters used during training
Training dataset version
Feature schema (input/output signatures)
Model cards (documentation, use cases, limitations)

Stage Management:

None: Newly registered model
Staging: Testing in pre-production environment
Production: Serving live traffic
Archived: Deprecated, retained for compliance

Versioning Strategies:

Semantic Versioning for Models:

Major version (v2.0.0): Breaking change in input/output schema
Minor version (v1.1.0): New feature, backward-compatible
Patch version (v1.0.1): Bug fix, model retrained on new data

Git-Based Versioning:

Model code in Git (training scripts, configuration)
Model weights in DVC (Data Version Control) or Git-LFS
Reproducibility via commit SHA + data version hash

For model lineage tracking and registry patterns, see references/model-registry.md.

集中管理模型工件，实现版本控制与阶段管理。

模型注册组件：

模型工件（权重、序列化模型）
训练指标（准确率、F1值、AUC）
训练时使用的超参数
训练数据集版本
特征 schema（输入/输出签名）
模型卡片（文档、使用场景、局限性）

阶段管理：

None：新注册的模型
Staging：预生产环境测试中
Production：提供实时流量服务
Archived：已弃用，为合规保留

版本管理策略：

模型语义化版本：

主版本（v2.0.0）：输入/输出schema发生破坏性变更
次版本（v1.1.0）：新增功能，向后兼容
补丁版本（v1.0.1）：修复Bug，基于新数据重训练模型

基于Git的版本管理：

模型代码存储在Git中（训练脚本、配置）
模型权重存储在DVC（Data Version Control）或Git-LFS中
通过提交SHA + 数据版本哈希确保可复现性

如需模型 lineage 追踪与注册模式，请查看references/model-registry.md。

3. Feature Stores

3. 特征存储

Centralize feature engineering to ensure consistency between training and inference.

Problem Addressed: Training/serving skew

Training: Features computed with future knowledge (data leakage)
Inference: Features computed with only past data
Result: Model performs well in training but fails in production

Feature Store Solution:

Online Feature Store:

Purpose: Low-latency feature retrieval for real-time inference
Storage: Redis, DynamoDB, Cassandra (key-value stores)
Latency: Sub-10ms for feature lookup
Use Case: Real-time predictions (fraud detection, recommendations)

Offline Feature Store:

Purpose: Historical feature data for training and batch inference
Storage: Parquet files (S3/GCS), data warehouses (Snowflake, BigQuery)
Latency: Seconds to minutes (batch retrieval)
Use Case: Model training, backtesting, batch predictions

Point-in-Time Correctness:

Ensures no future data leakage during training
Feature values at time T only use data available before time T
Critical for avoiding overly optimistic training metrics

Platform Comparison:

Feast (Open-source, cloud-agnostic):

Most popular open-source feature store
Supports Redis, DynamoDB, Datastore (online) and Parquet, BigQuery, Snowflake (offline)
Cloud-agnostic, no vendor lock-in
Active community, growing adoption

Tecton (Managed, production-grade):

Feast-compatible API
Fully managed service
Integrated monitoring and governance
Higher cost, enterprise-focused

SageMaker Feature Store (AWS):

Integrated with AWS ecosystem
Managed online/offline stores
AWS lock-in

Databricks Feature Store (Databricks):

Unity Catalog integration
Delta Lake for offline storage
Databricks ecosystem lock-in

Selection Criteria:

Open-source, cloud-agnostic → Feast
Managed solution, production-grade → Tecton
AWS ecosystem → SageMaker Feature Store
Databricks users → Databricks Feature Store

For feature engineering patterns and implementation, see references/feature-stores.md.

集中管理特征工程，确保训练与推理阶段的特征一致性。

解决的问题：训练/服务偏差

训练阶段：使用未来数据计算特征（数据泄露）
推理阶段：仅使用历史数据计算特征
结果：模型在训练中表现良好，但生产中失效

特征存储解决方案：

在线特征存储：

用途：为实时推理提供低延迟特征检索
存储：Redis、DynamoDB、Cassandra（键值存储）
延迟：特征查询延迟低于10ms
适用场景：实时预测（欺诈检测、推荐系统）

离线特征存储：

用途：为训练与批量推理提供历史特征数据
存储：Parquet文件（S3/GCS）、数据仓库（Snowflake、BigQuery）
延迟：批量检索延迟为几秒到几分钟
适用场景：模型训练、回测、批量预测

时间点正确性：

确保训练中无未来数据泄露
时间T的特征值仅使用T之前可用的数据
对避免过于乐观的训练指标至关重要

平台对比：

Feast（开源，云无关）：

最受欢迎的开源特征存储
支持Redis、DynamoDB、Datastore（在线）以及Parquet、BigQuery、Snowflake（离线）
云无关，无厂商锁定
活跃社区，采用率持续增长

Tecton（托管式，生产级）：

兼容Feast API
全托管服务
集成监控与治理功能
成本较高，聚焦企业级需求

SageMaker Feature Store（AWS）：

与AWS生态集成
托管式在线/离线存储
存在AWS锁定

Databricks Feature Store（Databricks）：

与Unity Catalog集成
使用Delta Lake作为离线存储
存在Databricks生态锁定

选型标准：

开源、云无关 → Feast
托管式生产级方案 → Tecton
AWS生态用户 → SageMaker Feature Store
Databricks用户 → Databricks Feature Store

如需特征工程模式与实现细节，请查看references/feature-stores.md。

4. Model Serving Patterns

4. 模型部署模式

Deploy models for synchronous, asynchronous, batch, or streaming inference.

Serving Patterns:

REST API Deployment:

Pattern: HTTP endpoint for synchronous predictions
Latency: <100ms acceptable
Use Case: Request-response applications
Tools: Flask, FastAPI, BentoML, Seldon Core

gRPC Deployment:

Pattern: High-performance RPC for low-latency inference
Latency: <10ms target
Use Case: Microservices, latency-critical applications
Tools: TensorFlow Serving, TorchServe, Seldon Core

Batch Inference:

Pattern: Process large datasets offline
Latency: Minutes to hours acceptable
Use Case: Daily/hourly predictions for millions of records
Tools: Spark, Dask, Ray

Streaming Inference:

Pattern: Real-time predictions on streaming data
Latency: Milliseconds
Use Case: Fraud detection, anomaly detection, real-time recommendations
Tools: Kafka + Flink/Spark Streaming

Platform Comparison:

Seldon Core (Kubernetes-native, advanced):

Advanced deployment strategies (canary, A/B testing, multi-armed bandits)
Multi-framework support
Integrated explainability (Alibi)
High complexity, steep learning curve

KServe (CNCF standard):

Standardized InferenceService API
Serverless scaling (scale-to-zero with Knative)
Kubernetes-native
Growing adoption, CNCF backing

BentoML (Python-first, simplicity):

Easiest to get started
Excellent developer experience
Local testing → cloud deployment
Lower complexity than Seldon/KServe

TorchServe (PyTorch official):

PyTorch-specific serving
Production-grade, optimized for PyTorch models
Less flexible for multi-framework use

TensorFlow Serving (TensorFlow official):

TensorFlow-specific serving
Production-grade, optimized for TensorFlow models
Less flexible for multi-framework use

Selection Criteria:

Kubernetes, advanced deployments → Seldon Core or KServe
Python-first, simplicity → BentoML
PyTorch-specific → TorchServe
TensorFlow-specific → TensorFlow Serving
Managed solution → SageMaker/Vertex AI/Azure ML

For model optimization and serving infrastructure, see references/model-serving.md.

为同步、异步、批量或流式推理部署模型。

部署模式：

REST API部署：

模式：HTTP端点提供同步预测
延迟：可接受延迟<100ms
适用场景：请求-响应式应用
工具：Flask、FastAPI、BentoML、Seldon Core

gRPC部署：

模式：高性能RPC实现低延迟推理
延迟：目标延迟<10ms
适用场景：微服务、对延迟敏感的应用
工具：TensorFlow Serving、TorchServe、Seldon Core

批量推理：

模式：离线处理大型数据集
延迟：可接受延迟为几分钟到几小时
适用场景：每日/每小时对百万级记录进行预测
工具：Spark、Dask、Ray

流式推理：

模式：对流式数据进行实时预测
延迟：毫秒级
适用场景：欺诈检测、异常检测、实时推荐
工具：Kafka + Flink/Spark Streaming

平台对比：

Seldon Core（Kubernetes原生，高级功能）：

高级部署策略（金丝雀、A/B测试、多臂老虎机）
多框架支持
集成可解释性（Alibi）
复杂度高，学习曲线陡峭

KServe（CNCF标准）：

标准化InferenceService API
无服务器扩缩容（与Knative集成实现缩容至零）
Kubernetes原生
采用率增长中，受CNCF支持

BentoML（Python优先，简洁易用）：

上手最简单
出色的开发者体验
支持从本地测试到云部署
复杂度低于Seldon/KServe

TorchServe（PyTorch官方）：

专为PyTorch设计的部署工具
生产级，针对PyTorch模型优化
多框架支持灵活性较低

TensorFlow Serving（TensorFlow官方）：

专为TensorFlow设计的部署工具
生产级，针对TensorFlow模型优化
多框架支持灵活性较低

选型标准：

Kubernetes环境、需高级部署策略 → Seldon Core或KServe
Python优先、追求简洁 → BentoML
仅使用PyTorch → TorchServe
仅使用TensorFlow → TensorFlow Serving
托管式方案 → SageMaker/Vertex AI/Azure ML

如需模型优化与部署基础设施细节，请查看references/model-serving.md。

5. Deployment Strategies

5. 部署策略

Deploy models safely with rollback capabilities.

Blue-Green Deployment:

Two identical environments (Blue: current, Green: new)
Deploy to Green, test, switch 100% traffic instantly
Instant rollback (switch back to Blue)
Trade-off: Requires 2x infrastructure, all-or-nothing switch

Canary Deployment:

Gradual rollout to subset of traffic
Route 5% → 10% → 25% → 50% → 100% over time
Monitor metrics at each stage, rollback if degradation
Trade-off: Complex routing logic, longer deployment time

Shadow Deployment:

New model receives traffic but predictions not used
Compare new model vs old model offline
Zero risk to production
Trade-off: Requires 2x compute, delayed feedback

A/B Testing:

Split traffic between model versions
Measure business metrics (conversion rate, revenue)
Statistical significance testing
Use Case: Optimize for business outcomes, not just ML metrics

Multi-Armed Bandit (MAB):

Epsilon-greedy: Explore (try new models) vs Exploit (use best model)
Thompson Sampling: Bayesian approach to exploration
Use Case: Continuous optimization, faster convergence than A/B

Selection Criteria:

Low-risk model → Blue-green (instant cutover)
Medium-risk model → Canary (gradual rollout)
High-risk model → Shadow (test in production, no impact)
Business optimization → A/B testing or MAB

For deployment architecture and examples, see references/deployment-strategies.md.

安全部署模型并具备回滚能力。

蓝绿部署：

两个完全相同的环境（蓝环境：当前版本，绿环境：新版本）
将新版本部署到绿环境，测试完成后立即切换100%流量
可即时回滚（切回蓝环境）
权衡点：需2倍基础设施，全量切换

金丝雀部署：

逐步向部分用户推出新版本
按5% → 10% → 25% → 50% → 100%的比例逐步扩容流量
每个阶段监控指标，若性能下降则回滚
权衡点：路由逻辑复杂，部署周期较长

影子部署：

新版本接收流量但不使用其预测结果
离线对比新版本与旧版本的表现
对生产无风险
权衡点：需2倍计算资源，反馈延迟

A/B测试：

在不同模型版本间拆分流量
衡量业务指标（转化率、收入）
统计显著性测试
适用场景：优化业务成果，而非仅ML指标

多臂老虎机（MAB）：

Epsilon-贪婪算法：探索（尝试新模型）与利用（使用最优模型）平衡
Thompson采样：贝叶斯探索方法
适用场景：持续优化，收敛速度快于A/B测试

选型标准：

低风险模型 → 蓝绿部署（即时切换）
中风险模型 → 金丝雀部署（逐步推出）
高风险模型 → 影子部署（生产环境测试，无影响）
业务优化需求 → A/B测试或MAB

如需部署架构与示例，请查看references/deployment-strategies.md。

6. ML Pipeline Orchestration

6. ML流水线编排

Automate training, evaluation, and deployment workflows.

Training Pipeline Stages:

Data Validation (Great Expectations, schema checks)
Feature Engineering (transform raw data)
Data Splitting (train/validation/test)
Model Training (hyperparameter tuning)
Model Evaluation (accuracy, fairness, explainability)
Model Registration (push to registry if metrics pass thresholds)
Deployment (promote to staging/production)

Continuous Training Pattern:

Monitor production data for drift
Detect data distribution changes (KS test, PSI)
Trigger automated retraining when drift detected
Validate new model before deployment
Deploy via canary or shadow strategy

Platform Comparison:

Kubeflow Pipelines (ML-native, Kubernetes):

ML-specific pipeline orchestration
Kubernetes-native (scales with K8s)
Component-based (reusable pipeline steps)
Integrated with Katib (hyperparameter tuning)

Apache Airflow (Mature, general-purpose):

Most mature orchestration platform
Large ecosystem, extensive integrations
Python-based DAGs
Not ML-specific but widely used for ML workflows

Metaflow (Netflix, data science-friendly):

Human-centric design, easy for data scientists
Excellent local development experience
Versioning built-in
Simpler than Kubeflow/Airflow

Prefect (Modern, Python-native):

Dynamic workflows, not static DAGs
Better error handling than Airflow
Modern UI and developer experience
Growing community

Dagster (Asset-based, testing-focused):

Asset-based thinking (not just task dependencies)
Strong testing and data quality features
Modern approach, good for data teams
Smaller community than Airflow

Selection Criteria:

ML-specific, Kubernetes → Kubeflow Pipelines
Mature, battle-tested → Apache Airflow
Data scientists, ease of use → Metaflow
Software engineers, testing → Dagster
Modern, simpler than Airflow → Prefect

For pipeline architecture and examples, see references/ml-pipelines.md.

自动化训练、评估与部署工作流。

训练流水线阶段：

数据验证（Great Expectations、schema检查）
特征工程（转换原始数据）
数据拆分（训练/验证/测试集）
模型训练（超参数调优）
模型评估（准确率、公平性、可解释性）
模型注册（若指标达标则推送至注册中心）
部署（升级至预生产/生产环境）

持续训练模式：

监控生产数据的漂移情况
检测数据分布变化（KS检验、PSI）
检测到漂移时触发自动重训练
部署前验证新模型
通过金丝雀或影子策略部署

平台对比：

Kubeflow Pipelines（ML原生，Kubernetes）：

专为ML设计的流水线编排
Kubernetes原生（随K8s扩缩容）
组件化（可复用流水线步骤）
与Katib集成（超参数调优）

Apache Airflow（成熟，通用型）：

最成熟的编排平台
庞大的生态系统，广泛的集成能力
基于Python的DAG
非ML专用，但广泛用于ML工作流

Metaflow（Netflix，数据科学友好）：

以人为中心的设计，数据科学家易上手
出色的本地开发体验
内置版本管理
比Kubeflow/Airflow更简洁

Prefect（现代，Python原生）：

动态工作流，而非静态DAG
错误处理优于Airflow
现代UI与开发者体验
社区持续增长

Dagster（基于资产，聚焦测试）

基于资产的设计理念（而非仅任务依赖）
强大的测试与数据质量功能
现代方案，适用于数据团队
社区规模小于Airflow

选型标准：

ML专用流水线、Kubernetes环境 → Kubeflow Pipelines
通用型编排、成熟生态 → Apache Airflow
数据科学工作流、易上手 → Metaflow
软件工程师、聚焦测试 → Dagster
现代方案、比Airflow简洁 → Prefect

如需流水线架构与示例，请查看references/ml-pipelines.md。

7. Model Monitoring and Observability

7. 模型监控与可观测性

Monitor production models for drift, performance, and quality.

Data Drift Detection:

Definition: Input feature distributions change over time
Impact: Model trained on old distribution, predictions degrade
Detection Methods:
- Kolmogorov-Smirnov (KS) Test: Compare distributions
- Population Stability Index (PSI): Measure distribution shift
- Chi-Square Test: For categorical features
Action: Trigger automated retraining when drift detected

Model Drift Detection:

Definition: Model prediction quality degrades over time
Impact: Accuracy, precision, recall decrease
Detection Methods:
- Ground truth accuracy (delayed labels)
- Prediction distribution changes
- Calibration drift (predicted probabilities vs actual outcomes)
Action: Alert team, trigger retraining

Performance Monitoring:

Metrics:
- Latency: P50, P95, P99 inference time
- Throughput: Predictions per second
- Error Rate: Failed predictions / total predictions
- Resource Utilization: CPU, memory, GPU usage
Alerting Thresholds:
- P95 latency > 100ms → Alert
- Error rate > 1% → Alert
- Accuracy drop > 5% → Trigger retraining

Business Metrics Monitoring:

Downstream impact: Conversion rate, revenue, user satisfaction
Model predictions → business outcomes correlation
Use Case: Optimize models for business value, not just ML metrics

Tools:

Evidently AI: Data drift, model drift, data quality reports
Prometheus + Grafana: Performance metrics, custom dashboards
Arize AI: ML observability platform
Fiddler: Model monitoring and explainability

For monitoring architecture and implementation, see references/model-monitoring.md.

监控生产模型的漂移、性能与质量。

数据漂移检测：

定义：输入特征分布随时间变化
影响：模型基于旧分布训练，预测性能下降
检测方法：
- Kolmogorov-Smirnov（KS）检验：对比分布
- 群体稳定性指数（PSI）：衡量分布偏移
- 卡方检验：针对分类特征
行动：检测到漂移时触发自动重训练

模型漂移检测：

定义：模型预测质量随时间下降
影响：准确率、精确率、召回率降低
检测方法：
- 真实标签准确率（延迟标签）
- 预测分布变化
- 校准漂移（预测概率与实际结果的偏差）
行动：向团队告警，触发重训练

性能监控：

指标：
- 延迟：P50、P95、P99推理时间
- 吞吐量：每秒预测次数
- 错误率：失败预测数/总预测数
- 资源利用率：CPU、内存、GPU使用率
告警阈值：
- P95延迟>100ms → 告警
- 错误率>1% → 告警
- 准确率下降>5% → 触发重训练

业务指标监控：

下游影响：转化率、收入、用户满意度
模型预测与业务成果的相关性
适用场景：针对业务价值优化模型，而非仅ML指标

工具：

Evidently AI：数据漂移、模型漂移、数据质量报告
Prometheus + Grafana：性能指标、自定义仪表盘
Arize AI：ML可观测性平台
Fiddler：模型监控与可解释性

如需监控架构与实现细节，请查看references/model-monitoring.md。

8. Model Optimization Techniques

8. 模型优化技术

Reduce model size and inference latency.

Quantization:

Convert model weights from float32 to int8
Model size reduction: 4x smaller
Inference speed: 2-3x faster
Accuracy impact: Minimal (<1% degradation typically)
Tools: PyTorch quantization, TensorFlow Lite, ONNX Runtime

Model Distillation:

Train small student model to mimic large teacher model
Transfer knowledge from teacher (BERT-large) to student (DistilBERT)
Size reduction: 2-10x smaller
Speed improvement: 2-10x faster
Use Case: Deploy small model on edge devices, reduce inference cost

ONNX Conversion:

Convert models to Open Neural Network Exchange (ONNX) format
Cross-framework compatibility (PyTorch → ONNX → TensorFlow)
Optimized inference with ONNX Runtime
Speed improvement: 1.5-3x faster than native framework

Model Pruning:

Remove less important weights from neural networks
Sparsity: 30-90% of weights set to zero
Size reduction: 2-10x smaller
Accuracy impact: Minimal with structured pruning

For optimization techniques and examples, see references/model-serving.md.

减小模型体积与推理延迟。

量化：

将模型权重从float32转换为int8
模型体积减小：4倍
推理速度提升：2-3倍
准确率影响：极小（通常<1%下降）
工具：PyTorch quantization、TensorFlow Lite、ONNX Runtime

模型蒸馏：

训练小型学生模型模仿大型教师模型
将知识从教师模型（如BERT-large）迁移到学生模型（如DistilBERT）
体积减小：2-10倍
速度提升：2-10倍
适用场景：在边缘设备部署小型模型、降低推理成本

ONNX转换：

将模型转换为Open Neural Network Exchange（ONNX）格式
跨框架兼容（PyTorch → ONNX → TensorFlow）
使用ONNX Runtime实现优化推理
速度提升：比原生框架快1.5-3倍

模型剪枝：

从神经网络中移除不重要的权重
稀疏性：30-90%的权重设为0
体积减小：2-10倍
准确率影响：结构化剪枝影响极小

如需优化技术与示例，请查看references/model-serving.md。

9. LLMOps Patterns

9. LLMOps 模式

Operationalize Large Language Models with specialized patterns.

LLM Fine-Tuning Pipelines:

LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning
QLoRA: Quantized LoRA (4-bit quantization)
Pipeline: Base model → Fine-tuning dataset → LoRA adapters → Merged model
Tools: Hugging Face PEFT, Axolotl

Prompt Versioning:

Version control for prompts (Git, prompt management platforms)
A/B testing prompts for quality and cost optimization
Monitoring prompt effectiveness over time

RAG System Monitoring:

Retrieval quality: Relevance of retrieved documents
Generation quality: Answer accuracy, hallucination detection
End-to-end latency: Retrieval + generation time
Tools: LangSmith, Arize Phoenix

LLM Inference Optimization:

vLLM: High-throughput LLM serving
TensorRT-LLM: NVIDIA-optimized LLM inference
Text Generation Inference (TGI): Hugging Face serving
Batching: Dynamic batching for throughput

Embedding Model Management:

Version embeddings alongside models
Monitor embedding drift (distribution changes)
Update embeddings when underlying model changes

For LLMOps patterns and implementation, see references/llmops-patterns.md.

使用专用模式落地大语言模型（LLM）。

LLM微调流水线：

LoRA（低秩适配）：参数高效微调
QLoRA：量化LoRA（4比特量化）
流水线：基础模型 → 微调数据集 → LoRA适配器 → 合并模型
工具：Hugging Face PEFT、Axolotl

Prompt版本管理：

对Prompt进行版本控制（Git、Prompt管理平台）
对Prompt进行A/B测试以优化质量与成本
随时间监控Prompt有效性

RAG系统监控：

检索质量：检索文档的相关性
生成质量：回答准确率、幻觉检测
端到端延迟：检索+生成时间
工具：LangSmith、Arize Phoenix

LLM推理优化：

vLLM：高吞吐量LLM部署
TensorRT-LLM：NVIDIA优化的LLM推理
Text Generation Inference（TGI）：Hugging Face部署工具
批处理：动态批处理提升吞吐量

嵌入模型管理：

嵌入与模型同步版本管理
监控嵌入漂移（分布变化）
底层模型变更时更新嵌入

如需LLMOps模式与实现细节，请查看references/llmops-patterns.md。

10. Model Governance and Compliance

10. 模型治理与合规

Establish governance for model risk management and regulatory compliance.

Model Cards:

Documentation: Model purpose, training data, performance metrics
Limitations: Known biases, failure modes, out-of-scope use cases
Ethical considerations: Fairness, privacy, societal impact
Template: Model Card Toolkit (Google)

Bias and Fairness Detection:

Measure disparate impact across demographic groups
Tools: Fairlearn, AI Fairness 360 (IBM)
Metrics: Demographic parity, equalized odds, calibration
Mitigation: Reweighting, adversarial debiasing, threshold optimization

Regulatory Compliance:

EU AI Act: High-risk AI systems require documentation, monitoring
Model Risk Management (SR 11-7): Banking industry requirements
GDPR: Right to explanation for automated decisions
HIPAA: Healthcare data privacy

Audit Trails:

Log all model versions, training runs, deployments
Track who approved model transitions (staging → production)
Retain historical predictions for compliance audits
Tools: MLflow, Neptune.ai (audit logs)

For governance frameworks and compliance, see references/governance.md.

建立模型风险管理与监管合规体系。

模型卡片：

文档：模型用途、训练数据、性能指标
局限性：已知偏差、失效模式、超出范围的使用场景
伦理考量：公平性、隐私性、社会影响
模板：Model Card Toolkit（Google）

偏差与公平性检测：

衡量不同群体间的差异化影响
工具：Fairlearn、AI Fairness 360（IBM）
指标：人口统计 parity、均等赔率、校准
缓解方法：重加权、对抗去偏、阈值优化

监管合规：

EU AI法案：高风险AI系统需文档与监控
模型风险管理（SR 11-7）：银行业要求
GDPR：自动化决策的解释权
HIPAA：医疗数据隐私

审计追踪：

记录所有模型版本、训练运行、部署操作
追踪谁批准了模型阶段转换（预生产→生产）
保留历史预测用于合规审计
工具：MLflow、Neptune.ai（审计日志）

如需治理框架与合规细节，请查看references/governance.md。

Decision Frameworks

决策框架

Framework 1: Experiment Tracking Platform Selection

框架1：实验追踪平台选型

Decision Tree:

Start with primary requirement:

Open-source, self-hosted requirement → MLflow
Team collaboration, advanced visualization (budget available) → Weights & Biases
Team collaboration, advanced visualization (no budget) → MLflow
Enterprise compliance (audit logs, RBAC) → Neptune.ai
Hyperparameter optimization primary use case → Weights & Biases (Sweeps)

Detailed Criteria:

Criteria	MLflow	Weights & Biases	Neptune.ai
Cost	Free	$200/user/month	$300/user/month
Collaboration	Basic	Excellent	Good
Visualization	Basic	Excellent	Good
Hyperparameter Tuning	External (Optuna)	Integrated (Sweeps)	Basic
Model Registry	Included	Add-on	Included
Self-Hosted	Yes	No (paid only)	Limited
Enterprise Features	No	Limited	Excellent

Recommendation by Organization:

Startup (<50 people): MLflow (free, adequate) or W&B (if budget)
Growth (50-500 people): Weights & Biases (team collaboration)
Enterprise (>500 people): Neptune.ai (compliance) or MLflow (cost)

For detailed decision framework, see references/decision-frameworks.md.

决策树：

从核心需求开始：

需开源、自托管方案 → MLflow
团队协作、高级可视化（有预算） → Weights & Biases
团队协作、高级可视化（无预算） → MLflow
企业合规需求（审计日志、RBAC） → Neptune.ai
核心需求为超参数优化 → Weights & Biases（Sweeps）

详细标准：

标准	MLflow	Weights & Biases	Neptune.ai
成本	免费	$200/用户/月	$300/用户/月
协作能力	基础	优秀	良好
可视化	基础	优秀	良好
超参数调优	需外部工具（如Optuna）	集成（Sweeps）	基础
模型注册	内置	附加功能	内置
自托管	是	否（仅付费版支持）	有限支持
企业级功能	无	有限	优秀

按组织规模推荐：

初创公司（<50人）：MLflow（免费，满足需求）或W&B（有预算时）
成长型公司（50-500人）：Weights & Biases（团队协作）
企业（>500人）：Neptune.ai（合规）或MLflow（成本优化）

如需详细决策框架，请查看references/decision-frameworks.md。

Framework 2: Feature Store Selection

框架2：特征存储选型

Decision Matrix:

Primary requirement:

Open-source, cloud-agnostic → Feast
Managed solution, production-grade, multi-cloud → Tecton
AWS ecosystem → SageMaker Feature Store
GCP ecosystem → Vertex AI Feature Store
Azure ecosystem → Azure ML Feature Store
Databricks users → Databricks Feature Store
Self-hosted with UI → Hopsworks

Criteria Comparison:

Factor	Feast	Tecton	Hopsworks	SageMaker FS
Cost	Free	$$$$	Free (self-host)	$$$
Online Serving	Redis, DynamoDB	Managed	RonDB	Managed
Offline Store	Parquet, BigQuery, Snowflake	Managed	Hive, S3	S3
Point-in-Time	Yes	Yes	Yes	Yes
Monitoring	External	Integrated	Basic	External
Cloud Lock-in	No	No	No	AWS

Recommendation:

Open-source, self-managed → Feast
Managed, production-grade → Tecton
AWS ecosystem → SageMaker Feature Store
Databricks users → Databricks Feature Store

For detailed decision framework, see references/decision-frameworks.md.

决策矩阵：

核心需求：

开源、云无关 → Feast
托管式生产级方案、多云支持 → Tecton
AWS生态用户 → SageMaker Feature Store
GCP生态用户 → Vertex AI Feature Store
Azure生态用户 → Azure ML Feature Store
Databricks用户 → Databricks Feature Store
自托管且带UI → Hopsworks

标准对比：

因素	Feast	Tecton	Hopsworks	SageMaker FS
成本	免费	$$$$	免费（自托管）	$$$
在线服务	Redis、DynamoDB	托管式	RonDB	托管式
离线存储	Parquet、BigQuery、Snowflake	托管式	Hive、S3	S3
时间点正确性	是	是	是	是
监控	需外部工具	集成	基础	需外部工具
云锁定	无	无	无	AWS锁定

推荐：

开源、自托管 → Feast
托管式生产级方案 → Tecton
AWS生态用户 → SageMaker Feature Store
Databricks用户 → Databricks Feature Store

如需详细决策框架，请查看references/decision-frameworks.md。

Framework 3: Model Serving Platform Selection

框架3：模型部署平台选型

Decision Tree:

Infrastructure:

Kubernetes-based → Advanced deployment patterns needed?
- Yes → Seldon Core (most features) or KServe (CNCF standard)
- No → BentoML (simpler, Python-first)
Cloud-native (managed) → Cloud provider?
- AWS → SageMaker Endpoints
- GCP → Vertex AI Endpoints
- Azure → Azure ML Endpoints
Framework-specific → Framework?
- PyTorch → TorchServe
- TensorFlow → TensorFlow Serving
Serverless / minimal infrastructure → BentoML or Cloud Functions

Detailed Criteria:

Feature	Seldon Core	KServe	BentoML	TorchServe
Kubernetes-Native	Yes	Yes	Optional	No
Multi-Framework	Yes	Yes	Yes	PyTorch-only
Deployment Strategies	Excellent	Good	Basic	Basic
Explainability	Integrated	Integrated	External	No
Complexity	High	Medium	Low	Low
Learning Curve	Steep	Medium	Gentle	Gentle

Recommendation:

Kubernetes, advanced deployments → Seldon Core or KServe
Python-first, simplicity → BentoML
PyTorch-specific → TorchServe
TensorFlow-specific → TensorFlow Serving
Managed solution → SageMaker/Vertex AI/Azure ML

For detailed decision framework, see references/decision-frameworks.md.

决策树：

基础设施类型：

Kubernetes环境 → 是否需要高级部署模式？
- 是 → Seldon Core（功能最丰富）或KServe（CNCF标准）
- 否 → BentoML（更简洁，Python优先）
云原生（托管式） → 云提供商？
- AWS → SageMaker Endpoints
- GCP → Vertex AI Endpoints
- Azure → Azure ML Endpoints
框架专用 → 使用的框架？
- PyTorch → TorchServe
- TensorFlow → TensorFlow Serving
无服务器/极简基础设施 → BentoML或云函数

详细标准：

功能	Seldon Core	KServe	BentoML	TorchServe
Kubernetes原生	是	是	可选	否
多框架支持	是	是	是	仅支持PyTorch
部署策略	优秀	良好	基础	基础
可解释性	集成	集成	需外部工具	无
复杂度	高	中	低	低
学习曲线	陡峭	中等	平缓	平缓

推荐：

Kubernetes环境、需高级部署策略 → Seldon Core或KServe
Python优先、追求简洁 → BentoML
仅使用PyTorch → TorchServe
仅使用TensorFlow → TensorFlow Serving
托管式方案 → SageMaker/Vertex AI/Azure ML

如需详细决策框架，请查看references/decision-frameworks.md。

Framework 4: ML Pipeline Orchestration Selection

框架4：ML流水线编排选型

Decision Matrix:

Primary use case:

ML-specific pipelines, Kubernetes-native → Kubeflow Pipelines
General-purpose orchestration, mature ecosystem → Apache Airflow
Data science workflows, ease of use → Metaflow
Modern approach, asset-based thinking → Dagster
Dynamic workflows, Python-native → Prefect

Criteria Comparison:

Factor	Kubeflow	Airflow	Metaflow	Dagster	Prefect
ML-Specific	Excellent	Good	Excellent	Good	Good
Kubernetes	Native	Compatible	Optional	Compatible	Compatible
Learning Curve	Steep	Steep	Gentle	Medium	Medium
Maturity	High	Very High	Medium	Medium	Medium
Community	Large	Very Large	Growing	Growing	Growing

Recommendation:

ML-specific, Kubernetes → Kubeflow Pipelines
Mature, battle-tested → Apache Airflow
Data scientists → Metaflow
Software engineers → Dagster
Modern, simpler than Airflow → Prefect

For detailed decision framework, see references/decision-frameworks.md.

决策矩阵：

核心使用场景：

ML专用流水线、Kubernetes原生 → Kubeflow Pipelines
通用型编排、成熟生态 → Apache Airflow
数据科学工作流、易上手 → Metaflow
现代方案、基于资产的理念 → Dagster
动态工作流、Python原生 → Prefect

标准对比：

因素	Kubeflow	Airflow	Metaflow	Dagster	Prefect
ML专用性	优秀	良好	优秀	良好	良好
Kubernetes支持	原生	兼容	可选	兼容	兼容
学习曲线	陡峭	陡峭	平缓	中等	中等
成熟度	高	极高	中	中	中
社区规模	大	极大	增长中	增长中	增长中

推荐：

ML专用、Kubernetes环境 → Kubeflow Pipelines
成熟、经过验证 → Apache Airflow
数据科学家使用 → Metaflow
软件工程师使用 → Dagster
现代方案、比Airflow简洁 → Prefect

如需详细决策框架，请查看references/decision-frameworks.md。

Implementation Patterns

实现模式

Pattern 1: End-to-End ML Pipeline

模式1：端到端ML流水线

Automate the complete ML workflow from data to deployment.

Pipeline Stages:

Data Validation (Great Expectations)
Feature Engineering (transform raw data)
Data Splitting (train/validation/test)
Model Training (with hyperparameter tuning)
Model Evaluation (accuracy, fairness, explainability)
Model Registration (push to MLflow registry)
Deployment (promote to staging/production)

Architecture:

Data Lake → Data Validation → Feature Engineering → Training → Evaluation
    ↓
Model Registry (staging) → Testing → Production Deployment

For implementation details and code examples, see references/ml-pipelines.md.

自动化从数据到部署的完整ML工作流。

流水线阶段：

数据验证（Great Expectations）
特征工程（转换原始数据）
数据拆分（训练/验证/测试集）
模型训练（含超参数调优）
模型评估（准确率、公平性、可解释性）
模型注册（推送至MLflow注册中心）
部署（升级至预生产/生产环境）

架构：

数据湖 → 数据验证 → 特征工程 → 训练 → 评估
    ↓
模型注册中心（预生产） → 测试 → 生产部署

如需实现细节与代码示例，请查看references/ml-pipelines.md。

Pattern 2: Continuous Training

模式2：持续训练

Automate model retraining based on drift detection.

Workflow:

Monitor production data for distribution changes
Detect data drift (KS test, PSI)
Trigger automated retraining pipeline
Validate new model (accuracy, fairness)
Deploy via canary strategy (5% → 100%)
Monitor new model performance
Rollback if metrics degrade

Trigger Conditions:

Scheduled: Daily/weekly retraining
Data drift: KS test p-value < 0.05
Model drift: Accuracy drop > 5%
Data volume: New training data exceeds threshold (10K samples)

For implementation details, see references/ml-pipelines.md.

基于漂移检测自动触发模型重训练。

工作流：

监控生产数据的分布变化
检测数据漂移（KS检验、PSI）
触发自动重训练流水线
验证新模型（准确率、公平性）
通过金丝雀策略部署（5% → 100%）
监控新模型性能
若指标下降则回滚

触发条件：

定时触发：每日/每周重训练
数据漂移：KS检验p值<0.05
模型漂移：准确率下降>5%
数据量：新训练数据超过阈值（10K样本）

如需实现细节，请查看references/ml-pipelines.md。

Pattern 3: Feature Store Integration

模式3：特征存储集成

Ensure consistent features between training and inference.

Architecture:

Offline Store (Training):
  Parquet/BigQuery → Point-in-Time Join → Training Dataset

Online Store (Inference):
  Redis/DynamoDB → Low-Latency Lookup → Real-Time Prediction

Point-in-Time Correctness:

Training: Fetch features as of specific timestamps (no future data)
Inference: Fetch latest features (only past data)
Guarantee: Same feature logic in training and inference

For implementation details and code examples, see references/feature-stores.md.

确保训练与推理阶段的特征一致性。

架构：

离线存储（训练）：
  Parquet/BigQuery → 时间点关联 → 训练数据集

在线存储（推理）：
  Redis/DynamoDB → 低延迟查询 → 实时预测

时间点正确性：

训练：获取特定时间点的特征（无未来数据）
推理：获取最新特征（仅历史数据）
保障：训练与推理使用相同的特征逻辑

如需实现细节与代码示例，请查看references/feature-stores.md。

Pattern 4: Shadow Deployment Testing

模式4：影子部署测试

Test new models in production without risk.

Workflow:

Deploy new model (v2) in shadow mode
v2 receives copy of production traffic
v1 predictions used for responses (no user impact)
Compare v1 and v2 predictions offline
Analyze differences, measure v2 accuracy
Promote v2 to production if performance acceptable

Use Cases:

High-risk models (financial, healthcare, safety-critical)
Need extensive testing before cutover
Compare model behavior on real production data

For deployment architecture, see references/deployment-strategies.md.

在生产环境测试新模型且无风险。

工作流：

以影子模式部署新模型（v2）
v2接收生产流量副本
v1的预测结果用于响应用户（无用户影响）
离线对比v1与v2的预测结果
分析差异，衡量v2的准确率
若性能达标则将v2升级至生产环境

适用场景：

高风险模型（金融、医疗、安全关键型）
需在切换前进行大量测试
对比模型在真实生产数据上的表现

如需部署架构，请查看references/deployment-strategies.md。

Tool Recommendations

工具推荐

Production-Ready Tools (High Adoption)

生产级工具（高采用率）

MLflow - Experiment Tracking & Model Registry

GitHub Stars: 20,000+
Trust Score: 95/100
Use Cases: Experiment tracking, model registry, model serving
Strengths: Open-source, framework-agnostic, self-hosted option
Getting Started:
```
pip install mlflow && mlflow server
```

Feast - Feature Store

GitHub Stars: 5,000+
Trust Score: 85/100
Use Cases: Online/offline feature serving, point-in-time correctness
Strengths: Cloud-agnostic, most popular open-source feature store
Getting Started:
```
pip install feast && feast init
```

Seldon Core - Model Serving (Advanced)

GitHub Stars: 4,000+
Trust Score: 85/100
Use Cases: Kubernetes-native serving, advanced deployment patterns
Strengths: Canary, A/B testing, MAB, explainability
Limitation: High complexity, steep learning curve

KServe - Model Serving (CNCF Standard)

GitHub Stars: 3,500+
Trust Score: 85/100
Use Cases: Standardized serving API, serverless scaling
Strengths: CNCF project, Knative integration, growing adoption
Limitation: Kubernetes required

BentoML - Model Serving (Simplicity)

GitHub Stars: 6,000+
Trust Score: 80/100
Use Cases: Easy packaging, Python-first deployment
Strengths: Lowest learning curve, excellent developer experience
Limitation: Fewer advanced features than Seldon/KServe

Kubeflow Pipelines - ML Orchestration

GitHub Stars: 14,000+ (Kubeflow project)
Trust Score: 90/100
Use Cases: ML-specific pipelines, Kubernetes-native workflows
Strengths: ML-native, component reusability, Katib integration
Limitation: Kubernetes required, steep learning curve

Weights & Biases - Experiment Tracking (SaaS)

Trust Score: 90/100
Use Cases: Team collaboration, advanced visualization, hyperparameter tuning
Strengths: Best-in-class UI, integrated Sweeps, strong community
Limitation: SaaS pricing, no self-hosted free tier

For detailed tool comparisons, see references/tool-recommendations.md.

MLflow - 实验追踪与模型注册

GitHub星标：20,000+
信任评分：95/100
适用场景：实验追踪、模型注册、模型部署
优势：开源、框架无关、支持自托管
快速开始：
```
pip install mlflow && mlflow server
```

Feast - 特征存储

GitHub星标：5,000+
信任评分：85/100
适用场景：在线/离线特征服务、时间点正确性
优势：云无关、最受欢迎的开源特征存储
快速开始：
```
pip install feast && feast init
```

Seldon Core - 模型部署（高级功能）

GitHub星标：4,000+
信任评分：85/100
适用场景：Kubernetes原生部署、高级部署策略
优势：支持金丝雀、A/B测试、MAB、可解释性
局限性：复杂度高，学习曲线陡峭

KServe - 模型部署（CNCF标准）

GitHub星标：3,500+
信任评分：85/100
适用场景：标准化部署API、无服务器扩缩容
优势：CNCF项目、与Knative集成、采用率增长中
局限性：需Kubernetes环境

BentoML - 模型部署（简洁易用）

GitHub星标：6,000+
信任评分：80/100
适用场景：简易打包、Python优先部署
优势：学习曲线最低、出色的开发者体验
局限性：高级功能少于Seldon/KServe

Kubeflow Pipelines - ML流水线编排

GitHub星标：14,000+（Kubeflow项目）
信任评分：90/100
适用场景：ML专用流水线、Kubernetes原生工作流
优势：ML原生、组件可复用、与Katib集成
局限性：需Kubernetes环境，学习曲线陡峭

Weights & Biases - 实验追踪（SaaS）

信任评分：90/100
适用场景：团队协作、高级可视化、超参数调优
优势：业界领先的UI、集成Sweeps、活跃社区
局限性：SaaS定价，无免费自托管版本

如需详细工具对比，请查看references/tool-recommendations.md。

Tool Stack Recommendations by Organization

按组织规模推荐工具栈

Startup (Cost-Optimized, Simple):

Experiment Tracking: MLflow (free, self-hosted)
Feature Store: None initially → Feast when needed
Model Serving: BentoML (easy) or cloud functions
Orchestration: Prefect or cron jobs
Monitoring: Basic logging + Prometheus

Growth Company (Balanced):

Experiment Tracking: Weights & Biases or MLflow
Feature Store: Feast (open-source, production-ready)
Model Serving: BentoML or KServe (Kubernetes-based)
Orchestration: Kubeflow Pipelines or Airflow
Monitoring: Evidently + Prometheus + Grafana

Enterprise (Full Stack):

Experiment Tracking: MLflow (self-hosted) or Neptune.ai (compliance)
Feature Store: Tecton (managed) or Feast (self-hosted)
Model Serving: Seldon Core (advanced) or KServe
Orchestration: Kubeflow Pipelines or Airflow
Monitoring: Evidently + Prometheus + Grafana + PagerDuty

Cloud-Native (Managed Services):

AWS: SageMaker (end-to-end platform)
GCP: Vertex AI (end-to-end platform)
Azure: Azure ML (end-to-end platform)

For scenario-specific recommendations, see references/scenarios.md.

初创公司（成本优化、简洁）：

实验追踪：MLflow（免费，自托管）
特征存储：初期无需 → 有需求时使用Feast
模型部署：BentoML（简易）或云函数
编排：Prefect或定时任务
监控：基础日志 + Prometheus

成长型公司（平衡方案）：

实验追踪：Weights & Biases或MLflow
特征存储：Feast（开源，生产级）
模型部署：BentoML或KServe（Kubernetes环境）
编排：Kubeflow Pipelines或Airflow
监控：Evidently + Prometheus + Grafana

企业（全栈方案）：

实验追踪：MLflow（自托管）或Neptune.ai（合规）
特征存储：Tecton（托管式）或Feast（自托管）
模型部署：Seldon Core（高级功能）或KServe
编排：Kubeflow Pipelines或Airflow
监控：Evidently + Prometheus + Grafana + PagerDuty

云原生（托管服务）：

AWS：SageMaker（端到端平台）
GCP：Vertex AI（端到端平台）
Azure：Azure ML（端到端平台）

如需场景化推荐，请查看references/scenarios.md。

Common Scenarios

常见场景

Scenario 1: Startup MLOps Stack

场景1：初创公司MLOps栈

Context: 20-person startup, 5 data scientists, 3 models (fraud detection, recommendation, churn), limited budget.

Recommendation:

Experiment Tracking: MLflow (free, self-hosted)
Model Serving: BentoML (simple, fast iteration)
Orchestration: Prefect (simpler than Airflow)
Monitoring: Prometheus + basic drift detection
Feature Store: Skip initially, use database tables

Rationale:

Minimize cost (all open-source, self-hosted)
Fast iteration (BentoML easy to deploy)
Don't over-engineer (no Kubeflow for 3 models)
Add feature store (Feast) when scaling to 10+ models

For detailed scenario, see references/scenarios.md.

**背景：**20人初创公司，5名数据科学家，3个模型（欺诈检测、推荐系统、客户流失预测），预算有限。

推荐：

实验追踪：MLflow（免费，自托管）
模型部署：BentoML（简易，快速迭代）
编排：Prefect（比Airflow简洁）
监控：基础日志 + Prometheus
特征存储：初期无需，使用数据库表

理由：

成本最小化（全开源、自托管）
快速迭代（BentoML部署便捷）
避免过度设计（3个模型无需Kubeflow）
模型数量达到10+时再引入Feast

如需详细场景，请查看references/scenarios.md。

Scenario 2: Enterprise ML Platform

场景2：企业级ML平台

Context: 500-person company, 50 data scientists, 100+ models, regulatory compliance, multi-cloud.

Recommendation:

Experiment Tracking: Neptune.ai (compliance, audit logs) or MLflow (cost)
Feature Store: Feast (self-hosted, cloud-agnostic)
Model Serving: Seldon Core (advanced deployment patterns)
Orchestration: Kubeflow Pipelines (ML-native, Kubernetes)
Monitoring: Evidently + Prometheus + Grafana + PagerDuty

Rationale:

Compliance required (Neptune audit logs, RBAC)
Multi-cloud (Feast cloud-agnostic)
Advanced deployments (Seldon canary, A/B testing)
Scale (Kubernetes for 100+ models)

For detailed scenario, see references/scenarios.md.

**背景：**500人公司，50名数据科学家，100+模型，需监管合规，多云环境。

推荐：

实验追踪：Neptune.ai（合规、审计日志）或MLflow（成本优化）
特征存储：Feast（自托管，云无关）
模型部署：Seldon Core（高级部署策略）
编排：Kubeflow Pipelines（ML原生，Kubernetes）
监控：Evidently + Prometheus + Grafana + PagerDuty

理由：

需合规（Neptune的审计日志、RBAC）
多云环境（Feast云无关）
高级部署需求（Seldon的金丝雀、A/B测试）
规模化支持（Kubernetes支撑100+模型）

如需详细场景，请查看references/scenarios.md。

Scenario 3: LLM Fine-Tuning Pipeline

场景3：LLM微调流水线

Context: Fine-tune LLM for domain-specific use case, deploy for production serving.

Recommendation:

Experiment Tracking: MLflow (track fine-tuning runs)
Pipeline Orchestration: Kubeflow Pipelines (GPU scheduling)
Model Serving: vLLM (high-throughput LLM serving)
Prompt Versioning: Git + LangSmith
Monitoring: Arize Phoenix (RAG monitoring)

Rationale:

Track fine-tuning experiments (LoRA adapters, hyperparameters)
GPU orchestration (Kubeflow on Kubernetes)
Efficient LLM serving (vLLM optimized for throughput)
Monitor RAG systems (retrieval + generation quality)

For detailed scenario, see references/scenarios.md.

**背景：**为特定领域微调LLM，部署至生产环境提供服务。

推荐：

实验追踪：MLflow（追踪微调运行）
流水线编排：Kubeflow Pipelines（GPU调度）
模型部署：vLLM（高吞吐量LLM部署）
Prompt版本管理：Git + LangSmith
监控：Arize Phoenix（RAG监控）

理由：

追踪微调实验（LoRA适配器、超参数）
GPU编排（Kubernetes上的Kubeflow）
高效LLM部署（vLLM针对吞吐量优化）
监控RAG系统（检索+生成质量）

如需详细场景，请查看references/scenarios.md。

Integration with Other Skills

与其他技能的集成

Direct Dependencies:

```
ai-data-engineering
```
: Feature engineering, ML algorithms, data preparation
```
kubernetes-operations
```
: K8s cluster management, GPU scheduling for ML workloads
```
observability
```
: Monitoring, alerting, distributed tracing for ML systems

Complementary Skills:

```
data-architecture
```
: Data pipelines, data lakes feeding ML models
```
data-transformation
```
: dbt for feature transformation pipelines
```
streaming-data
```
: Kafka, Flink for real-time ML inference
```
designing-distributed-systems
```
: Scalability patterns for ML workloads
```
api-design-principles
```
: ML model APIs, REST/gRPC serving patterns

Downstream Skills:

```
building-ai-chat
```
: LLM-powered applications consuming ML models
```
visualizing-data
```
: Dashboards for ML metrics and monitoring

直接依赖：

```
ai-data-engineering
```
：特征工程、ML算法、数据准备
```
kubernetes-operations
```
：K8s集群管理、ML工作负载的GPU调度
```
observability
```
：ML系统的监控、告警、分布式追踪

互补技能：

```
data-architecture
```
：为ML模型提供数据的流水线、数据湖
```
data-transformation
```
：使用dbt构建特征转换流水线
```
streaming-data
```
：Kafka、Flink用于实时ML推理
```
designing-distributed-systems
```
：ML工作负载的可扩展性模式
```
api-design-principles
```
：ML模型API、REST/gRPC部署模式

下游技能：

```
building-ai-chat
```
：调用ML模型的LLM驱动应用
```
visualizing-data
```
：ML指标与监控仪表盘

Best Practices

最佳实践

Version Everything:
- Code: Git commit SHA for reproducibility
- Data: DVC or data version hash
- Models: Semantic versioning (v1.2.3)
- Features: Feature store versioning
Automate Testing:
- Unit tests: Model loads, accepts input, produces output
- Integration tests: End-to-end pipeline execution
- Model validation: Accuracy thresholds, fairness checks
Monitor Continuously:
- Data drift: Distribution changes over time
- Model drift: Accuracy degradation
- Performance: Latency, throughput, error rates
Start Simple:
- Begin with MLflow + basic serving (BentoML)
- Add complexity as needed (feature store, Kubeflow)
- Avoid over-engineering (don't build Kubeflow for 2 models)
Point-in-Time Correctness:
- Use feature stores to avoid training/serving skew
- Ensure no future data leakage in training
- Consistent feature logic in training and inference
Deployment Strategies:
- Use canary for medium-risk models (gradual rollout)
- Use shadow for high-risk models (zero production impact)
- Always have rollback plan (instant switch to previous version)
Governance:
- Model cards: Document model purpose, limitations, biases
- Audit trails: Track all model versions, deployments, approvals
- Compliance: EU AI Act, model risk management (SR 11-7)
Cost Optimization:
- Quantization: Reduce model size 4x, inference speed 2-3x
- Spot instances: Train on preemptible VMs (60-90% cost reduction)
- Autoscaling: Scale inference endpoints based on load

版本化所有内容：
- 代码：使用Git提交SHA确保可复现性
- 数据：使用DVC或数据版本哈希
- 模型：语义化版本（v1.2.3）
- 特征：特征存储版本管理
自动化测试：
- 单元测试：模型可加载、接受输入、生成输出
- 集成测试：端到端流水线执行
- 模型验证：准确率阈值、公平性检查
持续监控：
- 数据漂移：分布随时间的变化
- 模型漂移：准确率下降
- 性能：延迟、吞吐量、错误率
从简开始：
- 先使用MLflow + 基础部署（BentoML）
- 按需增加复杂度（特征存储、Kubeflow）
- 避免过度设计（2个模型无需Kubeflow）
保证时间点正确性：
- 使用特征存储避免训练/服务偏差
- 确保训练中无未来数据泄露
- 训练与推理使用相同的特征逻辑
选择合适的部署策略：
- 中风险模型使用金丝雀部署（逐步推出）
- 高风险模型使用影子部署（生产测试无影响）
- 始终具备回滚方案（即时切换到上一版本）
建立治理体系：
- 模型卡片：记录模型用途、局限性、偏差
- 审计追踪：记录所有模型版本、部署、审批
- 合规：遵循EU AI法案、模型风险管理（SR 11-7）
成本优化：
- 量化：模型体积减小4倍，推理速度提升2-3倍
- 抢占式实例：使用可抢占VM训练（成本降低60-90%）
- 自动扩缩容：根据负载扩缩推理端点

Anti-Patterns

反模式

❌ Notebooks in Production:

Never deploy Jupyter notebooks to production
Use notebooks for experimentation only
Production: Use scripts, Docker containers, CI/CD pipelines

❌ Manual Model Deployment:

Automate deployment with CI/CD pipelines
Use model registry stage transitions (staging → production)
Eliminate human error, ensure reproducibility

❌ No Monitoring:

Production models without monitoring will degrade silently
Implement drift detection (data drift, model drift)
Set up alerting for accuracy drops, latency spikes

❌ Training/Serving Skew:

Different feature logic in training vs inference
Use feature stores to ensure consistency
Test feature parity before production deployment

❌ Ignoring Data Quality:

Garbage in, garbage out (GIGO)
Validate data schema, ranges, distributions
Use Great Expectations for data validation

❌ Over-Engineering:

Don't build Kubeflow for 2 models
Start simple (MLflow + BentoML)
Add complexity only when necessary (10+ models)

❌ No Rollback Plan:

Always have ability to rollback to previous model version
Blue-green, canary, shadow deployments enable instant rollback
Test rollback procedure before production deployment

❌ Notebook用于生产：

绝不要将Jupyter Notebook部署到生产环境
Notebook仅用于实验
生产环境使用脚本、Docker容器、CI/CD流水线

❌ 手动部署模型：

使用CI/CD流水线自动化部署
使用模型注册中心的阶段转换（预生产→生产）
消除人为错误，确保可复现性

❌ 无监控：

无监控的生产模型会静默失效
实现漂移检测（数据漂移、模型漂移）
为准确率下降、延迟峰值设置告警

❌ 训练/服务偏差：

训练与推理使用不同的特征逻辑
使用特征存储确保一致性
生产部署前测试特征一致性

❌ 忽略数据质量：

垃圾进，垃圾出（GIGO）
验证数据schema、范围、分布
使用Great Expectations进行数据验证

❌ 过度设计：

2个模型无需搭建Kubeflow
从简开始（MLflow + BentoML）
仅在必要时增加复杂度（10+模型时）

❌ 无回滚方案：

始终具备回滚到上一版本的能力
蓝绿、金丝雀、影子部署支持即时回滚
生产部署前测试回滚流程

implementing-mlops

Original

Translation

MLOps Patterns

MLOps 模式

Purpose

目标

When to Use This Skill

适用场景

Core Concepts

核心概念

1. Experiment Tracking

1. 实验追踪

2. Model Registry and Versioning

2. 模型注册与版本管理

3. Feature Stores

3. 特征存储

4. Model Serving Patterns

4. 模型部署模式

5. Deployment Strategies

5. 部署策略

6. ML Pipeline Orchestration

6. ML流水线编排

7. Model Monitoring and Observability

7. 模型监控与可观测性

8. Model Optimization Techniques

8. 模型优化技术

9. LLMOps Patterns

9. LLMOps 模式

10. Model Governance and Compliance

10. 模型治理与合规

Decision Frameworks

决策框架

Framework 1: Experiment Tracking Platform Selection

框架1：实验追踪平台选型

Framework 2: Feature Store Selection

框架2：特征存储选型

Framework 3: Model Serving Platform Selection

框架3：模型部署平台选型

Framework 4: ML Pipeline Orchestration Selection

框架4：ML流水线编排选型

Implementation Patterns

实现模式

Pattern 1: End-to-End ML Pipeline

模式1：端到端ML流水线

Pattern 2: Continuous Training

模式2：持续训练

Pattern 3: Feature Store Integration

模式3：特征存储集成

Pattern 4: Shadow Deployment Testing

模式4：影子部署测试

Tool Recommendations

工具推荐

Production-Ready Tools (High Adoption)

生产级工具（高采用率）

Tool Stack Recommendations by Organization

按组织规模推荐工具栈

Common Scenarios

常见场景

Scenario 1: Startup MLOps Stack

场景1：初创公司MLOps栈

Scenario 2: Enterprise ML Platform

场景2：企业级ML平台

Scenario 3: LLM Fine-Tuning Pipeline

场景3：LLM微调流水线

Integration with Other Skills

与其他技能的集成

Best Practices

最佳实践

Anti-Patterns

反模式

Further Reading

延伸阅读