mle-workflow
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMachine Learning Engineering Workflow
机器学习工程工作流(Machine Learning Engineering Workflow)
Use this skill to turn model work into a production ML system with clear data contracts, repeatable training, measurable quality gates, deployable artifacts, and operational monitoring.
使用本技能可将模型工作转化为具备清晰data contracts、可重复训练、可衡量质量门控、可部署artifact及运维监控的生产级ML系统。
When to Activate
激活场景
- Planning or reviewing a production ML feature, model refresh, ranking system, recommender, classifier, embedding workflow, or forecasting pipeline
- Converting notebook code into a reusable training, evaluation, batch inference, or online inference pipeline
- Designing model promotion criteria, offline/online evals, experiment tracking, or rollback paths
- Debugging failures caused by data drift, label leakage, stale features, artifact mismatch, or inconsistent training and serving logic
- Adding model monitoring, canary rollout, shadow traffic, or post-deploy quality checks
- 规划或审查生产级ML功能、模型更新、排序系统、推荐器、分类器、嵌入工作流或预测管道
- 将Notebook代码转换为可复用的训练、评估、批量推理或在线推理管道
- 设计模型晋升标准、离线/在线评估、实验跟踪或回滚路径
- 调试由数据漂移、标签泄露、特征过时、artifact不匹配或训练与服务逻辑不一致导致的故障
- 添加模型监控、金丝雀发布、影子流量或部署后质量检查
Scope Calibration
范围校准
Use only the lanes that fit the system in front of you. This skill is useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLM workflows, anomaly detection, and batch analytics, but it should not force one architecture onto all of them.
- Do not assume every model has supervised labels, online serving, a feature store, PyTorch, GPUs, human review, A/B tests, or real-time feedback.
- Do not add heavyweight MLOps machinery when a data contract, baseline, eval script, and rollback note would make the change reviewable.
- Do make assumptions explicit when the project lacks labels, delayed outcomes, slice definitions, production traffic, or monitoring ownership.
- Treat examples as interchangeable scaffolds. Replace metrics, serving mode, data stores, and rollout mechanics with the project-native equivalents.
仅使用与当前系统适配的环节。本技能适用于排序、搜索、推荐、分类、预测、嵌入、LLM工作流、异常检测及批量分析,但不应强制所有系统采用同一架构。
- 不要假设每个模型都有监督标签、在线服务、特征存储、PyTorch、GPU、人工审核、A/B测试或实时反馈。
- 当仅需data contract、基准线、评估脚本和回滚说明即可让变更可审查时,不要添加重量级MLOps机制。
- 当项目缺少标签、延迟结果、切片定义、生产流量或监控负责人时,务必明确说明假设条件。
- 将示例视为可替换的框架,根据项目实际情况替换指标、服务模式、数据存储和发布机制。
Related Skills
相关技能
- and
python-patternsfor Python implementation and pytest coveragepython-testing - for deep learning models, data loaders, device handling, and training loops
pytorch-patterns - and
eval-harnessfor promotion gates and agent-assisted regression checksai-regression-testing - ,
database-migrations, andpostgres-patternsfor data storage and analytics surfacesclickhouse-io - ,
deployment-patterns, anddocker-patternsfor serving, secrets, containers, and production hardeningsecurity-review
- 和
python-patterns:用于Python实现及pytest覆盖python-testing - :用于深度学习模型、数据加载器、设备处理和训练循环
pytorch-patterns - 和
eval-harness:用于晋升门控和Agent辅助回归检查ai-regression-testing - 、
database-migrations和postgres-patterns:用于数据存储和分析层面clickhouse-io - 、
deployment-patterns和docker-patterns:用于服务、密钥、容器和生产环境加固security-review
Reuse the SWE Surface
复用软件工程(SWE)能力
Do not treat MLE as separate from software engineering. Most ECC SWE workflows apply directly to ML systems, often with stricter failure modes:
The recommended install keeps the core agent surface available alongside this skill. For skill-only or agent-limited harnesses, pair with where the target supports agents.
minimal --with capability:machine-learningskill:mle-workflowagent:mle-reviewer| SWE surface | MLE use |
|---|---|
| Turn model work into explicit product contracts and record irreversible data, model, and rollout choices |
| Find existing training, feature, serving, eval, and monitoring paths before introducing a parallel ML stack |
| Scope model changes as product capabilities with data, eval, serving, and rollback phases |
| Test feature transforms, split logic, metric calculations, artifact loading, and inference schemas before implementation |
| Review code quality plus ML-specific leakage, reproducibility, promotion, and monitoring risks |
| Diagnose broken CI, flaky evals, missing fixtures, and environment-specific model or dependency failures |
| Require automated evidence for transforms, metrics, inference contracts, promotion gates, and rollback behavior |
| Turn offline metrics, slice checks, latency budgets, and rollback drills into repeatable gates |
| Preserve every production bug as a regression: missing feature, stale label, bad artifact, schema drift, or serving mismatch |
| Design prediction APIs, batch jobs, idempotent retraining endpoints, and response envelopes |
| Version labels, feature snapshots, prediction logs, experiment metrics, and drift analytics |
| Package reproducible training and serving images with health checks, resource limits, and rollback |
| Make rollout health visible with model-version, slice, drift, latency, cost, and delayed-label dashboards |
| Check model artifacts, notebooks, prompts, datasets, and logs for secrets, PII, unsafe deserialization, and supply-chain risk |
| Test critical product flows that consume predictions, including explainability and fallback UI states |
| Measure throughput, p95 latency, memory, GPU utilization, and cost per prediction or retrain |
| Route LLM/embedding workloads by quality, latency, and budget instead of defaulting to the largest model |
| Verify current library behavior for model serving, feature stores, vector DBs, and eval tooling before coding |
| Package MLE changes for review with crisp scope, generated artifacts excluded, and reproducible test evidence |
| Split long ML work into parallel tracks: data contract, eval harness, serving path, monitoring, and docs |
不要将机器学习工程(MLE)与软件工程割裂开来。大多数ECC SWE工作流可直接应用于ML系统,且通常有更严格的故障模式:
推荐的安装方式可在保留核心Agent能力的同时使用本技能。对于仅技能或Agent受限的框架,若目标系统支持Agent,可将与搭配使用。
minimal --with capability:machine-learningskill:mle-workflowagent:mle-reviewer| 软件工程(SWE)领域 | 机器学习工程(MLE)用途 |
|---|---|
| 将模型工作转化为明确的产品契约,并记录不可逆转的数据、模型及回滚决策 |
| 在引入并行ML栈之前,先找到现有的训练、特征、服务、评估和监控路径 |
| 将模型变更划分为包含数据、评估、服务和回滚阶段的产品能力范围 |
| 在实现前测试特征转换、拆分逻辑、指标计算、artifact加载和推理模式 |
| 审查代码质量及ML特有的泄露、可复现性、晋升和监控风险 |
| 诊断CI失败、不稳定的评估、缺失的测试环境以及环境特定的模型或依赖故障 |
| 要求提供转换、指标、推理契约、晋升门控和回滚行为的自动化证据 |
| 将离线指标、切片检查、延迟预算和回滚演练转化为可重复的门控 |
| 将每个生产Bug保留为回归测试:缺失特征、过时标签、不良artifact、模式漂移或服务不匹配 |
| 设计预测API、批量任务、幂等重训练端点和响应包 |
| 对标签、特征快照、预测日志、实验指标和漂移分析进行版本控制 |
| 打包包含健康检查、资源限制和回滚机制的可复现训练与服务镜像 |
| 通过模型版本、切片、漂移、延迟、成本和延迟标签仪表盘展示发布健康状况 |
| 检查model artifact、Notebook、提示词、数据集和日志中的密钥、PII、不安全反序列化和供应链风险 |
| 测试使用预测结果的关键产品流程,包括可解释性和降级UI状态 |
| 测量吞吐量、p95延迟、内存、GPU利用率以及每次预测或重训练的成本 |
| 根据质量、延迟和预算路由LLM/嵌入工作负载,而非默认使用最大模型 |
| 在编码前验证模型服务、特征存储、向量数据库和评估工具的当前库行为 |
| 打包MLE变更以供审查,确保范围清晰、排除生成的artifact并提供可复现的测试证据 |
| 将长期ML工作拆分为并行任务:data contract、评估框架、服务路径、监控和文档 |
Ten MLE Task Simulations
十大MLE任务模拟
Use these simulations as coverage checks when planning or reviewing MLE work. A strong MLE workflow should reduce each task to explicit contracts, reusable SWE surfaces, automated evidence, and a reviewable artifact.
| ID | Common MLE task | Streamlined ECC path | Required output | Pipeline lanes covered |
|---|---|---|---|---|
| MLE-01 | Frame an ambiguous prediction, ranking, recommender, classifier, embedding, or forecast capability | | Iteration Compact naming who cares, decision owner, success metric, unacceptable mistakes, assumptions, constraints, and first experiment | product contract, stakeholder loss, risk, rollout |
| MLE-02 | Define metric goals, labels, data sources, and the mistake budget | | Data and metric contract with entity grain, label timing, label confidence, feature timing, point-in-time joins, split policy, and dataset snapshot | data contract, metric design, leakage, reproducibility |
| MLE-03 | Build a baseline model and scoring path before adding complexity | | Baseline scorer with confusion matrix, calibration notes, latency/cost estimate, known weaknesses, and tests for score shape and determinism | baseline, scoring, testing, serving parity |
| MLE-04 | Generate features from hypotheses about what separates outcomes | | Feature plan and transform module covering signal source, missing values, outliers, correlations, leakage checks, and train/serve equivalence | feature pipeline, leakage, training, artifacts |
| MLE-05 | Tune thresholds, configs, and model complexity under tradeoffs | | Threshold/config report comparing precision, recall, F1, AUC, calibration, group slices, latency, cost, complexity, and acceptable error classes | evaluation, threshold, promotion, regression |
| MLE-06 | Run error analysis and turn mistakes into the next experiment | | Error cluster report for false positives, false negatives, ambiguous labels, stale features, missing signals, and bug traces with lessons captured | error analysis, bug trace, iteration, regression |
| MLE-07 | Package a model artifact for batch or online inference | | Versioned artifact bundle with preprocessing, config, dependency constraints, schema validation, safe loading, and PII-safe logs | artifact, security, inference contract |
| MLE-08 | Ship online serving or batch scoring with feedback capture | | Prediction endpoint or batch job with response envelope, timeout, batching, fallback, model version, confidence, feedback logging, and product-flow tests | serving, batch inference, fallback, user workflow |
| MLE-09 | Roll out a model with shadow traffic, canary, A/B test, or rollback | | Rollout plan naming traffic split, dashboards, p95 latency, cost, quality guardrails, rollback artifact, and rollback trigger | deployment, canary, rollback |
| MLE-10 | Operate, debug, and refresh a production model after launch | | Observation ledger and refresh plan with drift checks, delayed-label health, alert owners, runbook updates, retrain criteria, and PR evidence | monitoring, incident response, retraining |
在规划或审查MLE工作时,可将这些模拟作为覆盖性检查。一个完善的MLE工作流应将每个任务简化为明确的契约、可复用的SWE能力、自动化证据和可审查的artifact。
| ID | 常见MLE任务 | 精简后的ECC路径 | 所需输出 | 覆盖的管道环节 |
|---|---|---|---|---|
| MLE-01 | 明确模糊的预测、排序、推荐、分类、嵌入或预测能力 | | 迭代契约,包含相关人员、决策负责人、成功指标、不可接受的错误、假设、约束和首个实验 | 产品契约、相关方损失、风险、发布 |
| MLE-02 | 定义指标目标、标签、数据源和错误预算 | | 数据和指标契约,包含实体粒度、标签时间、标签置信度、特征时间、点时间连接规则、拆分策略和数据集快照 | data contract、指标设计、泄露、可复现性 |
| MLE-03 | 在增加复杂度前构建基准模型和评分路径 | | 基准评分器,包含混淆矩阵、校准说明、延迟/成本估算、已知缺陷以及评分形状和确定性测试 | 基准线、评分、测试、服务一致性 |
| MLE-04 | 基于结果差异假设生成特征 | | 特征计划和转换模块,涵盖信号源、缺失值、异常值、相关性、泄露检查以及训练/服务等价性 | 特征管道、泄露、训练、artifact |
| MLE-05 | 在权衡下调整阈值、配置和模型复杂度 | | 阈值/配置报告,比较精确率、召回率、F1、AUC、校准度、分组切片、延迟、成本、复杂度和可接受的错误类别 | 评估、阈值、晋升、回归 |
| MLE-06 | 进行错误分析并将错误转化为下一个实验 | | 错误集群报告,涵盖假阳性、假阴性、模糊标签、过时特征、缺失信号和Bug轨迹,并记录经验教训 | 错误分析、Bug轨迹、迭代、回归 |
| MLE-07 | 为批量或在线推理打包model artifact | | 版本化artifact包,包含预处理、配置、依赖约束、模式验证、安全加载和PII安全日志 | artifact、安全、推理契约 |
| MLE-08 | 上线在线服务或批量评分并捕获反馈 | | 预测端点或批量任务,包含响应包、超时、批处理、降级策略、模型版本、置信度、反馈日志和产品流程测试 | 服务、批量推理、降级、用户工作流 |
| MLE-09 | 通过影子流量、金丝雀发布、A/B测试或回滚发布模型 | | 发布计划,包含流量拆分、仪表盘、p95延迟、成本、质量护栏、回滚artifact和回滚触发条件 | 部署、金丝雀发布、回滚 |
| MLE-10 | 发布后运维、调试和更新生产模型 | | 观察台账和更新计划,包含漂移检查、延迟标签健康状况、告警负责人、手册更新、重训练标准和PR证据 | 监控、事件响应、重训练 |
Iteration Compact
迭代契约
Before touching model code, compress the work into one reviewable artifact. This should be short enough to fit in a PR description and precise enough that another engineer can challenge the tradeoffs.
text
Goal:
Who cares:
Decision owner:
User or system action changed by the model:
Success metric:
Guardrail metrics:
Mistake budget:
Unacceptable mistakes:
Acceptable mistakes:
Assumptions:
Constraints:
Labels and data snapshot:
Baseline:
Candidate signals:
Threshold or config plan:
Eval slices:
Known risks:
Next experiment:
Rollback or fallback:This compact is the MLE equivalent of a strong SWE design note. It keeps the team from optimizing a metric no one trusts, adding features that do not address the real error mode, or shipping complexity without a rollback.
在编写模型代码之前,将工作压缩为一个可审查的artifact。它应简短到能放入PR描述,同时精确到足以让其他工程师挑战权衡决策。
text
Goal:
Who cares:
Decision owner:
User or system action changed by the model:
Success metric:
Guardrail metrics:
Mistake budget:
Unacceptable mistakes:
Acceptable mistakes:
Assumptions:
Constraints:
Labels and data snapshot:
Baseline:
Candidate signals:
Threshold or config plan:
Eval slices:
Known risks:
Next experiment:
Rollback or fallback:这份契约相当于高质量SWE设计文档的MLE版本。它能避免团队优化无人信任的指标、添加无法解决实际错误模式的特征,或在没有回滚方案的情况下交付复杂系统。
Decision Brain
决策思维框架
Use this loop whenever the task is ambiguous, high-impact, or metric-heavy:
- Start from the decision, not the model. Name the action that changes downstream behavior.
- Name who cares and why. Different stakeholders pay different costs for false positives, false negatives, latency, compute spend, opacity, or missed opportunities.
- Convert ambiguity into hypotheses. Ask what signal would separate outcomes, what evidence would disprove it, and what simple baseline should be hard to beat.
- Research prior art or a nearby known problem before inventing a bespoke system.
- Score choices with .
(probability, confidence) x (cost, severity, importance, impact) - Consider adversarial behavior, incentives, selective disclosure, distribution shift, and feedback loops.
- Prefer the simplest change that reduces the most important mistake. Simplicity is not laziness; it is a way to minimize blunders while preserving iteration speed.
- Capture the decision, evidence, counterargument, and next reversible step.
当任务模糊、影响重大或指标密集时,使用以下循环:
- 从决策而非模型入手,明确会改变下游行为的动作。
- 明确相关人员及其关注点。不同相关方对假阳性、假阴性、延迟、计算成本、不透明度或错失机会的成本敏感度不同。
- 将模糊性转化为假设。思考哪些信号能区分结果、哪些证据能推翻假设,以及哪些简单基准线难以超越。
- 在发明定制系统之前,研究现有方案或类似的已知问题。
- 使用对选项打分。
(概率, 置信度) x (成本, 严重性, 重要性, 影响) - 考虑对抗行为、激励机制、选择性披露、分布偏移和反馈循环。
- 优先选择能减少最重要错误的最简单变更。简单并非懒惰,而是在保持迭代速度的同时最小化失误的方式。
- 记录决策、证据、反驳论点和下一个可逆步骤。
Metric and Mistake Economics
指标与错误成本
Choose metrics from failure costs, not habit:
- Use a confusion matrix early so the team can discuss concrete false positives and false negatives instead of abstract accuracy.
- Favor precision when the cost of an incorrect positive decision dominates.
- Favor recall when the cost of a missed positive dominates.
- Use F1 only when the precision/recall tradeoff is genuinely balanced and explainable.
- Use AUC or ranking metrics when ordering quality matters more than a single threshold.
- Track latency, throughput, memory, and cost as first-class metrics because they shape feasible model complexity.
- Compare against a baseline and the current production model before celebrating an offline gain.
- Treat real-world feedback signals as delayed labels with bias, lag, and coverage gaps; do not treat them as ground truth without analysis.
Every metric choice should state which mistake it makes cheaper, which mistake it makes more likely, and who absorbs that cost.
从失败成本而非习惯出发选择指标:
- 尽早使用混淆矩阵,让团队讨论具体的假阳性和假阴性,而非抽象的准确率。
- 当错误阳性决策的成本占主导时,优先选择精确率。
- 当错过阳性结果的成本占主导时,优先选择召回率。
- 仅当精确率/召回率的权衡真正平衡且可解释时,才使用F1。
- 当排序质量比单一阈值更重要时,使用AUC或排序指标。
- 将延迟、吞吐量、内存和成本作为一级指标,因为它们决定了可行的模型复杂度。
- 在庆祝离线指标提升之前,与基准线和当前生产模型进行比较。
- 将现实世界的反馈信号视为存在偏差、滞后和覆盖缺口的延迟标签;未经分析不要将其视为真值。
每个指标选择都应说明它降低了哪种错误的成本、增加了哪种错误的可能性,以及由谁承担该成本。
Data and Feature Hypotheses
数据与特征假设
Features should come from a theory of separation:
- Text, categorical fields, numeric histories, graph relationships, recency, frequency, and aggregates are candidate signal families, not automatic features.
- For every feature family, state why it should separate outcomes and how it could leak future information.
- For noisy labels, consider adjudication, label confidence, soft targets, or confidence weighting.
- For class imbalance, compare weighted loss, resampling, threshold movement, and calibrated decision rules.
- For missing values, decide whether absence is informative, imputable, or a reason to abstain.
- For outliers, decide whether to clip, bucket, investigate, or preserve them as rare but important signal.
- For correlated features, check whether they are redundant, unstable, or proxies for unavailable future state.
Do not add model complexity until error analysis shows that the baseline is failing for a reason additional signal or capacity can plausibly fix.
特征应源于结果区分理论:
- 文本、分类字段、数值历史、图关系、新鲜度、频率和聚合是候选信号类别,而非自动特征。
- 对于每个信号类别,说明它为何能区分结果,以及它可能如何泄露未来信息。
- 对于噪声标签,考虑裁决、标签置信度、软目标或置信度加权。
- 对于类别不平衡,比较加权损失、重采样、阈值调整和校准决策规则。
- 对于缺失值,判断缺失是否具有信息性、可填充,或是否应放弃该样本。
- 对于异常值,判断是否应截断、分桶、调查,或保留为稀有但重要的信号。
- 对于相关特征,检查它们是否冗余、不稳定,或是否是不可用未来状态的代理。
在错误分析表明基准线因额外信号或容量可合理修复的原因失败之前,不要增加模型复杂度。
Error Analysis Loop
错误分析循环
After each baseline, training run, threshold change, or config change:
- Split mistakes into false positives, false negatives, abstentions, low-confidence cases, and system failures.
- Cluster errors by shared traits: language, entity type, source, time, geography, device, sparsity, recency, feature freshness, label source, or model version.
- Separate model mistakes from data bugs, label ambiguity, product ambiguity, instrumentation gaps, and serving mismatches.
- Trace each major cluster to one of four moves: better labels, better features, better threshold/config, or better product fallback.
- Preserve every important mistake as a regression test, eval slice, dashboard panel, or runbook entry.
- Write the next iteration as a falsifiable experiment, not a vague "improve model" task.
The strongest MLE loop is not train -> metric -> ship. It is mistake -> cluster -> hypothesis -> experiment -> evidence -> simpler system.
每次基准线、训练运行、阈值变更或配置变更后:
- 将错误分为假阳性、假阴性、弃权、低置信度案例和系统故障。
- 按共同特征聚类错误:语言、实体类型、来源、时间、地理、设备、稀疏性、新鲜度、特征时效性、标签来源或模型版本。
- 将模型错误与数据Bug、标签模糊、产品模糊、工具缺口和服务不匹配区分开。
- 将每个主要聚类追溯到四个行动之一:更好的标签、更好的特征、更好的阈值/配置或更好的产品降级策略。
- 将每个重要错误保留为回归测试、评估切片、仪表盘面板或手册条目。
- 将下一次迭代写成可证伪的实验,而非模糊的“改进模型”任务。
最强的MLE循环不是训练→指标→发布,而是错误→聚类→假设→实验→证据→简化系统。
Observation Ledger
观察台账
Keep a compact decision and evidence trail beside the code, PR, experiment report, or runbook:
text
Iteration:
Change:
Why this mattered:
Metric movement:
Slice movement:
False positives:
False negatives:
Unexpected errors:
Decision:
Tradeoff accepted:
Lesson captured:
Regression added:
Debt created:
Next iteration:Use the ledger to make model work cumulative. The goal is for each iteration to make the next decision easier, not merely to produce another artifact.
在代码、PR、实验报告或手册旁保留一份简洁的决策和证据记录:
text
Iteration:
Change:
Why this mattered:
Metric movement:
Slice movement:
False positives:
False negatives:
Unexpected errors:
Decision:
Tradeoff accepted:
Lesson captured:
Regression added:
Debt created:
Next iteration:使用台账让模型工作具备累积性。目标是让每次迭代都让下一个决策更容易,而非仅仅生成另一个artifact。
Core Workflow
核心工作流
1. Define the Prediction Contract
1. 定义预测契约
Capture the product-level contract before writing model code:
- Prediction target and decision owner
- Input entity, output schema, confidence/calibration fields, and allowed latency
- Batch, online, streaming, or hybrid serving mode
- Fallback behavior when the model, feature store, or dependency is unavailable
- Human review or override path for high-impact decisions
- Privacy, retention, and audit requirements for inputs, predictions, and labels
Do not accept "improve the model" as a requirement. Tie the model to an observable product behavior and a measurable acceptance gate.
在编写模型代码之前,先捕获产品级契约:
- 预测目标和决策负责人
- 输入实体、输出模式、置信度/校准字段和允许的延迟
- 批量、在线、流式或混合服务模式
- 模型、特征存储或依赖不可用时的降级行为
- 高影响决策的人工审核或覆盖路径
- 输入、预测和标签的隐私、保留和审计要求
不要接受“改进模型”作为需求。将模型与可观察的产品行为和可衡量的验收门控绑定。
2. Lock the Data Contract
2. 锁定数据契约
Every ML task needs an explicit data contract:
- Entity grain and primary key
- Label definition, label timestamp, and label availability delay
- Feature timestamp, freshness SLA, and point-in-time join rules
- Train, validation, test, and backtest split policy
- Required columns, allowed nulls, ranges, categories, and units
- PII or sensitive fields that must not enter training artifacts or logs
- Dataset version or snapshot ID for reproducibility
Guard against leakage first. If a feature is not available at prediction time, or is joined using future information, remove it or move it to an analysis-only path.
每个ML任务都需要明确的data contract:
- 实体粒度和主键
- 标签定义、标签时间戳和标签可用延迟
- 特征时间戳、新鲜度SLA和点时间连接规则
- 训练、验证、测试和回测拆分策略
- 必填列、允许的空值、范围、类别和单位
- 不得进入训练artifact或日志的PII或敏感字段
- 用于可复现性的数据集版本或快照ID
首先防范泄露。如果特征在预测时不可用,或使用未来信息进行连接,请移除它或将其移至仅分析路径。
3. Build a Reproducible Pipeline
3. 构建可复现管道
Training code should be runnable by another engineer without hidden notebook state:
- Use typed config files or dataclasses for all hyperparameters and paths
- Pin package and model dependencies
- Set random seeds and document any nondeterministic GPU behavior
- Record dataset version, code SHA, config hash, metrics, and artifact URI
- Save preprocessing logic with the model artifact, not separately in a notebook
- Keep train, eval, and inference transformations shared or generated from one source
- Make every step idempotent so retries do not corrupt artifacts or metrics
Prefer immutable values and pure transformation functions. Avoid mutating shared data frames or global config during feature generation.
python
import hashlib
from dataclasses import dataclass
from pathlib import Path
@dataclass(frozen=True)
class TrainingConfig:
dataset_uri: str
model_dir: Path
seed: int
learning_rate: float
batch_size: int
def artifact_name(config: TrainingConfig, code_sha: str) -> str:
config_key = f"{config.dataset_uri}:{config.seed}:{config.learning_rate}:{config.batch_size}"
config_hash = hashlib.sha256(config_key.encode("utf-8")).hexdigest()[:12]
return f"{code_sha[:12]}-{config_hash}"训练代码应能被其他工程师运行,无需依赖隐藏的Notebook状态:
- 使用类型化配置文件或数据类存储所有超参数和路径
- 固定包和模型依赖
- 设置随机种子并记录任何非确定性GPU行为
- 记录数据集版本、代码SHA、配置哈希、指标和artifact URI
- 将预处理逻辑与model artifact一起保存,而非单独存于Notebook
- 共享训练、评估和推理转换逻辑,或从同一源生成
- 让每个步骤具备幂等性,确保重试不会损坏artifact或指标
优先使用不可变值和纯转换函数。在特征生成过程中避免修改共享数据帧或全局配置。
python
import hashlib
from dataclasses import dataclass
from pathlib import Path
@dataclass(frozen=True)
class TrainingConfig:
dataset_uri: str
model_dir: Path
seed: int
learning_rate: float
batch_size: int
def artifact_name(config: TrainingConfig, code_sha: str) -> str:
config_key = f"{config.dataset_uri}:{config.seed}:{config.learning_rate}:{config.batch_size}"
config_hash = hashlib.sha256(config_key.encode("utf-8")).hexdigest()[:12]
return f"{code_sha[:12]}-{config_hash}"4. Evaluate Before Promotion
4. 晋升前评估
Promotion criteria should be declared before training finishes:
- Baseline model and current production model comparison
- Primary metric aligned to product behavior
- Guardrail metrics for latency, calibration, fairness slices, cost, and error concentration
- Slice metrics for important cohorts, geographies, devices, languages, or data sources
- Confidence intervals or repeated-run variance when metrics are noisy
- Failure examples reviewed by a human for high-impact models
- Explicit "do not ship" thresholds
python
PROMOTION_GATES = {
"auc": ("min", 0.82),
"calibration_error": ("max", 0.04),
"p95_latency_ms": ("max", 80),
}
def assert_promotion_ready(metrics: dict[str, float]) -> None:
missing = sorted(name for name in PROMOTION_GATES if name not in metrics)
if missing:
raise ValueError(f"Model promotion metrics missing required gates: {missing}")
failures = {
name: value
for name, (direction, threshold) in PROMOTION_GATES.items()
for value in [metrics[name]]
if (direction == "min" and value < threshold)
or (direction == "max" and value > threshold)
}
if failures:
raise ValueError(f"Model failed promotion gates: {failures}")Use offline metrics as gates, not guarantees. When the model changes product behavior, plan shadow evaluation, canary rollout, or A/B testing before full rollout.
晋升标准应在训练完成前声明:
- 基准模型与当前生产模型的比较
- 与产品行为对齐的主要指标
- 针对延迟、校准度、公平性切片、成本和错误集中度的护栏指标
- 针对重要群组、地理区域、设备、语言或数据源的切片指标
- 指标存在噪声时的置信区间或重复运行方差
- 高影响模型需由人工审核失败案例
- 明确的“禁止发布”阈值
python
PROMOTION_GATES = {
"auc": ("min", 0.82),
"calibration_error": ("max", 0.04),
"p95_latency_ms": ("max", 80),
}
def assert_promotion_ready(metrics: dict[str, float]) -> None:
missing = sorted(name for name in PROMOTION_GATES if name not in metrics)
if missing:
raise ValueError(f"Model promotion metrics missing required gates: {missing}")
failures = {
name: value
for name, (direction, threshold) in PROMOTION_GATES.items()
for value in [metrics[name]]
if (direction == "min" and value < threshold)
or (direction == "max" and value > threshold)
}
if failures:
raise ValueError(f"Model failed promotion gates: {failures}")将离线指标作为门控,而非保证。当模型改变产品行为时,在全面发布前规划影子评估、金丝雀发布或A/B测试。
5. Package for Serving
5. 为服务打包
An ML artifact is production-ready only when the serving contract is testable:
- Model artifact includes version, training data reference, config, and preprocessing
- Input schema rejects invalid, stale, or out-of-range features
- Output schema includes model version and confidence or explanation fields when useful
- Serving path has timeout, batching, resource limits, and fallback behavior
- CPU/GPU requirements are explicit and tested
- Prediction logs avoid PII and include enough identifiers for debugging and label joins
- Integration tests cover missing features, stale features, bad types, empty batches, and fallback path
Never let training-only feature code diverge from serving feature code without a test that proves equivalence.
只有当服务契约可测试时,ML artifact才具备生产就绪性:
- Model artifact包含版本、训练数据引用、配置和预处理逻辑
- 输入模式拒绝无效、过时或超出范围的特征
- 输出模式包含模型版本和置信度或解释字段(如有必要)
- 服务路径具备超时、批处理、资源限制和降级行为
- CPU/GPU需求明确且经过测试
- 预测日志避免包含PII,并包含足够的标识符用于调试和标签连接
- 集成测试覆盖缺失特征、过时特征、错误类型、空批次和降级路径
永远不要让仅训练用的特征代码与服务用特征代码脱节,除非有测试能证明两者等价。
6. Operate the Model
6. 运维模型
Model monitoring needs both system and quality signals:
- Availability, error rate, timeout rate, queue depth, and p50/p95/p99 latency
- Feature null rate, range drift, categorical drift, and freshness drift
- Prediction distribution drift and confidence distribution drift
- Label arrival health and delayed quality metrics
- Business KPI guardrails and rollback triggers
- Per-version dashboards for canaries and rollbacks
Every deployment should have a rollback plan that names the previous artifact, config, data dependency, and traffic-switch mechanism.
模型监控需要系统和质量两类信号:
- 可用性、错误率、超时率、队列深度和p50/p95/p99延迟
- 特征空值率、范围漂移、分类漂移和新鲜度漂移
- 预测分布漂移和置信度分布漂移
- 标签到达健康状况和延迟质量指标
- 业务KPI护栏和回滚触发条件
- 针对金丝雀和回滚的分版本仪表盘
每次部署都应有回滚计划,明确指定之前的artifact、配置、数据依赖和流量切换机制。
Review Checklist
审查清单
- Prediction contract is explicit and testable
- Data contract defines entity grain, label timing, feature timing, and snapshot/version
- Leakage risks were checked against prediction-time availability
- Training is reproducible from code, config, data version, and seed
- Metrics compare against baseline and current production model
- Slice metrics and guardrails are included for high-risk cohorts
- Promotion gates are automated and fail closed
- Training and serving transformations are shared or equivalence-tested
- Model artifact carries version, config, dataset reference, and preprocessing
- Serving path validates inputs and has timeout, fallback, and rollback behavior
- Monitoring covers system health, feature drift, prediction drift, and delayed labels
- Sensitive data is excluded from artifacts, logs, prompts, and examples
- 预测契约明确且可测试
- 数据契约定义了实体粒度、标签时间、特征时间和快照/版本
- 已针对预测时可用性检查泄露风险
- 可通过代码、配置、数据版本和种子复现训练
- 指标与基准线和当前生产模型进行了比较
- 包含高风险群组的切片指标和护栏
- 晋升门控已自动化且默认拒绝
- 训练和服务转换逻辑共享或经过等价性测试
- Model artifact包含版本、配置、数据集引用和预处理逻辑
- 服务路径验证输入并具备超时、降级和回滚行为
- 监控覆盖系统健康、特征漂移、预测漂移和延迟标签
- 敏感数据已从artifact、日志、提示词和示例中排除
Anti-Patterns
反模式
- Notebook state is required to reproduce the model
- Random split leaks future data into validation or test sets
- Feature joins ignore event time and label availability
- Offline metric improves while important slices regress
- Thresholds are tuned on the test set repeatedly
- Training preprocessing is copied manually into serving code
- Model version is missing from prediction logs
- Monitoring only checks service uptime, not data or prediction quality
- Rollback requires retraining instead of switching to a known-good artifact
- 需要Notebook状态才能复现模型
- 随机拆分将未来数据泄露到验证或测试集
- 特征连接忽略事件时间和标签可用性
- 离线指标提升但重要切片性能退化
- 在测试集上反复调整阈值
- 训练预处理逻辑被手动复制到服务代码中
- 预测日志中缺少模型版本
- 监控仅检查服务可用性,不检查数据或预测质量
- 回滚需要重新训练,而非切换到已知可用的artifact
Output Expectations
输出预期
When using this skill, return concrete artifacts: data contract, promotion gates, pipeline steps, test plan, deployment plan, or review findings. Call out unknowns that block production readiness instead of filling them with assumptions.
使用本技能时,返回具体的artifact:data contract、晋升门控、管道步骤、测试计划、部署计划或审查结果。指出阻碍生产就绪的未知因素,而非用假设填补空白。