mle-workflow

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Machine Learning Engineering Workflow

机器学习工程工作流（Machine Learning Engineering Workflow）

Use this skill to turn model work into a production ML system with clear data contracts, repeatable training, measurable quality gates, deployable artifacts, and operational monitoring.

使用本技能可将模型工作转化为具备清晰data contracts、可重复训练、可衡量质量门控、可部署artifact及运维监控的生产级ML系统。

When to Activate

激活场景

Planning or reviewing a production ML feature, model refresh, ranking system, recommender, classifier, embedding workflow, or forecasting pipeline
Converting notebook code into a reusable training, evaluation, batch inference, or online inference pipeline
Designing model promotion criteria, offline/online evals, experiment tracking, or rollback paths
Debugging failures caused by data drift, label leakage, stale features, artifact mismatch, or inconsistent training and serving logic
Adding model monitoring, canary rollout, shadow traffic, or post-deploy quality checks

规划或审查生产级ML功能、模型更新、排序系统、推荐器、分类器、嵌入工作流或预测管道
将Notebook代码转换为可复用的训练、评估、批量推理或在线推理管道
设计模型晋升标准、离线/在线评估、实验跟踪或回滚路径
调试由数据漂移、标签泄露、特征过时、artifact不匹配或训练与服务逻辑不一致导致的故障
添加模型监控、金丝雀发布、影子流量或部署后质量检查

Scope Calibration

范围校准

Use only the lanes that fit the system in front of you. This skill is useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLM workflows, anomaly detection, and batch analytics, but it should not force one architecture onto all of them.

Do not assume every model has supervised labels, online serving, a feature store, PyTorch, GPUs, human review, A/B tests, or real-time feedback.
Do not add heavyweight MLOps machinery when a data contract, baseline, eval script, and rollback note would make the change reviewable.
Do make assumptions explicit when the project lacks labels, delayed outcomes, slice definitions, production traffic, or monitoring ownership.
Treat examples as interchangeable scaffolds. Replace metrics, serving mode, data stores, and rollout mechanics with the project-native equivalents.

仅使用与当前系统适配的环节。本技能适用于排序、搜索、推荐、分类、预测、嵌入、LLM工作流、异常检测及批量分析，但不应强制所有系统采用同一架构。

不要假设每个模型都有监督标签、在线服务、特征存储、PyTorch、GPU、人工审核、A/B测试或实时反馈。
当仅需data contract、基准线、评估脚本和回滚说明即可让变更可审查时，不要添加重量级MLOps机制。
当项目缺少标签、延迟结果、切片定义、生产流量或监控负责人时，务必明确说明假设条件。
将示例视为可替换的框架，根据项目实际情况替换指标、服务模式、数据存储和发布机制。

Related Skills

Reuse the SWE Surface

复用软件工程（SWE）能力

Do not treat MLE as separate from software engineering. Most ECC SWE workflows apply directly to ML systems, often with stricter failure modes:

The recommended

minimal --with capability:machine-learning

install keeps the core agent surface available alongside this skill. For skill-only or agent-limited harnesses, pair

skill:mle-workflow

with

agent:mle-reviewer

where the target supports agents.

SWE surface	MLE use
`product-capability` / `architecture-decision-records`	Turn model work into explicit product contracts and record irreversible data, model, and rollout choices
`repo-scan` / `codebase-onboarding` / `code-tour`	Find existing training, feature, serving, eval, and monitoring paths before introducing a parallel ML stack
`plan` / `feature-dev`	Scope model changes as product capabilities with data, eval, serving, and rollback phases
`tdd-workflow` / `python-testing`	Test feature transforms, split logic, metric calculations, artifact loading, and inference schemas before implementation
`code-reviewer` / `mle-reviewer`	Review code quality plus ML-specific leakage, reproducibility, promotion, and monitoring risks
`build-fix` / `pr-test-analyzer`	Diagnose broken CI, flaky evals, missing fixtures, and environment-specific model or dependency failures
`quality-gate` / `test-coverage`	Require automated evidence for transforms, metrics, inference contracts, promotion gates, and rollback behavior
`eval-harness` / `verification-loop`	Turn offline metrics, slice checks, latency budgets, and rollback drills into repeatable gates
`ai-regression-testing`	Preserve every production bug as a regression: missing feature, stale label, bad artifact, schema drift, or serving mismatch
`api-design` / `backend-patterns`	Design prediction APIs, batch jobs, idempotent retraining endpoints, and response envelopes
`database-migrations` / `postgres-patterns` / `clickhouse-io`	Version labels, feature snapshots, prediction logs, experiment metrics, and drift analytics
`deployment-patterns` / `docker-patterns`	Package reproducible training and serving images with health checks, resource limits, and rollback
`canary-watch` / `dashboard-builder`	Make rollout health visible with model-version, slice, drift, latency, cost, and delayed-label dashboards
`security-review` / `security-scan`	Check model artifacts, notebooks, prompts, datasets, and logs for secrets, PII, unsafe deserialization, and supply-chain risk
`e2e-testing` / `browser-qa` / `accessibility`	Test critical product flows that consume predictions, including explainability and fallback UI states
`benchmark` / `performance-optimizer`	Measure throughput, p95 latency, memory, GPU utilization, and cost per prediction or retrain
`cost-aware-llm-pipeline` / `token-budget-advisor`	Route LLM/embedding workloads by quality, latency, and budget instead of defaulting to the largest model
`documentation-lookup` / `search-first`	Verify current library behavior for model serving, feature stores, vector DBs, and eval tooling before coding
`git-workflow` / `github-ops` / `opensource-pipeline`	Package MLE changes for review with crisp scope, generated artifacts excluded, and reproducible test evidence
`strategic-compact` / `dmux-workflows`	Split long ML work into parallel tracks: data contract, eval harness, serving path, monitoring, and docs

不要将机器学习工程（MLE）与软件工程割裂开来。大多数ECC SWE工作流可直接应用于ML系统，且通常有更严格的故障模式：

推荐的

minimal --with capability:machine-learning

安装方式可在保留核心Agent能力的同时使用本技能。对于仅技能或Agent受限的框架，若目标系统支持Agent，可将

skill:mle-workflow

与

agent:mle-reviewer

搭配使用。

软件工程（SWE）领域	机器学习工程（MLE）用途
`product-capability` / `architecture-decision-records`	将模型工作转化为明确的产品契约，并记录不可逆转的数据、模型及回滚决策
`repo-scan` / `codebase-onboarding` / `code-tour`	在引入并行ML栈之前，先找到现有的训练、特征、服务、评估和监控路径
`plan` / `feature-dev`	将模型变更划分为包含数据、评估、服务和回滚阶段的产品能力范围
`tdd-workflow` / `python-testing`	在实现前测试特征转换、拆分逻辑、指标计算、artifact加载和推理模式
`code-reviewer` / `mle-reviewer`	审查代码质量及ML特有的泄露、可复现性、晋升和监控风险
`build-fix` / `pr-test-analyzer`	诊断CI失败、不稳定的评估、缺失的测试环境以及环境特定的模型或依赖故障
`quality-gate` / `test-coverage`	要求提供转换、指标、推理契约、晋升门控和回滚行为的自动化证据
`eval-harness` / `verification-loop`	将离线指标、切片检查、延迟预算和回滚演练转化为可重复的门控
`ai-regression-testing`	将每个生产Bug保留为回归测试：缺失特征、过时标签、不良artifact、模式漂移或服务不匹配
`api-design` / `backend-patterns`	设计预测API、批量任务、幂等重训练端点和响应包
`database-migrations` / `postgres-patterns` / `clickhouse-io`	对标签、特征快照、预测日志、实验指标和漂移分析进行版本控制
`deployment-patterns` / `docker-patterns`	打包包含健康检查、资源限制和回滚机制的可复现训练与服务镜像
`canary-watch` / `dashboard-builder`	通过模型版本、切片、漂移、延迟、成本和延迟标签仪表盘展示发布健康状况
`security-review` / `security-scan`	检查model artifact、Notebook、提示词、数据集和日志中的密钥、PII、不安全反序列化和供应链风险
`e2e-testing` / `browser-qa` / `accessibility`	测试使用预测结果的关键产品流程，包括可解释性和降级UI状态
`benchmark` / `performance-optimizer`	测量吞吐量、p95延迟、内存、GPU利用率以及每次预测或重训练的成本
`cost-aware-llm-pipeline` / `token-budget-advisor`	根据质量、延迟和预算路由LLM/嵌入工作负载，而非默认使用最大模型
`documentation-lookup` / `search-first`	在编码前验证模型服务、特征存储、向量数据库和评估工具的当前库行为
`git-workflow` / `github-ops` / `opensource-pipeline`	打包MLE变更以供审查，确保范围清晰、排除生成的artifact并提供可复现的测试证据
`strategic-compact` / `dmux-workflows`	将长期ML工作拆分为并行任务：data contract、评估框架、服务路径、监控和文档

Ten MLE Task Simulations

十大MLE任务模拟

Use these simulations as coverage checks when planning or reviewing MLE work. A strong MLE workflow should reduce each task to explicit contracts, reusable SWE surfaces, automated evidence, and a reviewable artifact.

ID	Common MLE task	Streamlined ECC path	Required output	Pipeline lanes covered
MLE-01	Frame an ambiguous prediction, ranking, recommender, classifier, embedding, or forecast capability	`product-capability` , `plan` , `architecture-decision-records` , `mle-workflow`	Iteration Compact naming who cares, decision owner, success metric, unacceptable mistakes, assumptions, constraints, and first experiment	product contract, stakeholder loss, risk, rollout
MLE-02	Define metric goals, labels, data sources, and the mistake budget	`repo-scan` , `database-reviewer` , `database-migrations` , `postgres-patterns` , `clickhouse-io`	Data and metric contract with entity grain, label timing, label confidence, feature timing, point-in-time joins, split policy, and dataset snapshot	data contract, metric design, leakage, reproducibility
MLE-03	Build a baseline model and scoring path before adding complexity	`tdd-workflow` , `python-testing` , `python-patterns` , `code-reviewer`	Baseline scorer with confusion matrix, calibration notes, latency/cost estimate, known weaknesses, and tests for score shape and determinism	baseline, scoring, testing, serving parity
MLE-04	Generate features from hypotheses about what separates outcomes	`python-patterns` , `pytorch-patterns` , `docker-patterns` , `deployment-patterns`	Feature plan and transform module covering signal source, missing values, outliers, correlations, leakage checks, and train/serve equivalence	feature pipeline, leakage, training, artifacts
MLE-05	Tune thresholds, configs, and model complexity under tradeoffs	`eval-harness` , `ai-regression-testing` , `quality-gate` , `test-coverage`	Threshold/config report comparing precision, recall, F1, AUC, calibration, group slices, latency, cost, complexity, and acceptable error classes	evaluation, threshold, promotion, regression
MLE-06	Run error analysis and turn mistakes into the next experiment	`eval-harness` , `ai-regression-testing` , `mle-reviewer` , `silent-failure-hunter`	Error cluster report for false positives, false negatives, ambiguous labels, stale features, missing signals, and bug traces with lessons captured	error analysis, bug trace, iteration, regression
MLE-07	Package a model artifact for batch or online inference	`api-design` , `backend-patterns` , `security-review` , `security-scan`	Versioned artifact bundle with preprocessing, config, dependency constraints, schema validation, safe loading, and PII-safe logs	artifact, security, inference contract
MLE-08	Ship online serving or batch scoring with feedback capture	`api-design` , `backend-patterns` , `e2e-testing` , `browser-qa` , `accessibility`	Prediction endpoint or batch job with response envelope, timeout, batching, fallback, model version, confidence, feedback logging, and product-flow tests	serving, batch inference, fallback, user workflow
MLE-09	Roll out a model with shadow traffic, canary, A/B test, or rollback	`canary-watch` , `dashboard-builder` , `verification-loop` , `performance-optimizer`	Rollout plan naming traffic split, dashboards, p95 latency, cost, quality guardrails, rollback artifact, and rollback trigger	deployment, canary, rollback
MLE-10	Operate, debug, and refresh a production model after launch	`silent-failure-hunter` , `dashboard-builder` , `mle-reviewer` , `doc-updater` , `github-ops`	Observation ledger and refresh plan with drift checks, delayed-label health, alert owners, runbook updates, retrain criteria, and PR evidence	monitoring, incident response, retraining

在规划或审查MLE工作时，可将这些模拟作为覆盖性检查。一个完善的MLE工作流应将每个任务简化为明确的契约、可复用的SWE能力、自动化证据和可审查的artifact。

ID	常见MLE任务	精简后的ECC路径	所需输出	覆盖的管道环节
MLE-01	明确模糊的预测、排序、推荐、分类、嵌入或预测能力	`product-capability` , `plan` , `architecture-decision-records` , `mle-workflow`	迭代契约，包含相关人员、决策负责人、成功指标、不可接受的错误、假设、约束和首个实验	产品契约、相关方损失、风险、发布
MLE-02	定义指标目标、标签、数据源和错误预算	`repo-scan` , `database-reviewer` , `database-migrations` , `postgres-patterns` , `clickhouse-io`	数据和指标契约，包含实体粒度、标签时间、标签置信度、特征时间、点时间连接规则、拆分策略和数据集快照	data contract、指标设计、泄露、可复现性
MLE-03	在增加复杂度前构建基准模型和评分路径	`tdd-workflow` , `python-testing` , `python-patterns` , `code-reviewer`	基准评分器，包含混淆矩阵、校准说明、延迟/成本估算、已知缺陷以及评分形状和确定性测试	基准线、评分、测试、服务一致性
MLE-04	基于结果差异假设生成特征	`python-patterns` , `pytorch-patterns` , `docker-patterns` , `deployment-patterns`	特征计划和转换模块，涵盖信号源、缺失值、异常值、相关性、泄露检查以及训练/服务等价性	特征管道、泄露、训练、artifact
MLE-05	在权衡下调整阈值、配置和模型复杂度	`eval-harness` , `ai-regression-testing` , `quality-gate` , `test-coverage`	阈值/配置报告，比较精确率、召回率、F1、AUC、校准度、分组切片、延迟、成本、复杂度和可接受的错误类别	评估、阈值、晋升、回归
MLE-06	进行错误分析并将错误转化为下一个实验	`eval-harness` , `ai-regression-testing` , `mle-reviewer` , `silent-failure-hunter`	错误集群报告，涵盖假阳性、假阴性、模糊标签、过时特征、缺失信号和Bug轨迹，并记录经验教训	错误分析、Bug轨迹、迭代、回归
MLE-07	为批量或在线推理打包model artifact	`api-design` , `backend-patterns` , `security-review` , `security-scan`	版本化artifact包，包含预处理、配置、依赖约束、模式验证、安全加载和PII安全日志	artifact、安全、推理契约
MLE-08	上线在线服务或批量评分并捕获反馈	`api-design` , `backend-patterns` , `e2e-testing` , `browser-qa` , `accessibility`	预测端点或批量任务，包含响应包、超时、批处理、降级策略、模型版本、置信度、反馈日志和产品流程测试	服务、批量推理、降级、用户工作流
MLE-09	通过影子流量、金丝雀发布、A/B测试或回滚发布模型	`canary-watch` , `dashboard-builder` , `verification-loop` , `performance-optimizer`	发布计划，包含流量拆分、仪表盘、p95延迟、成本、质量护栏、回滚artifact和回滚触发条件	部署、金丝雀发布、回滚
MLE-10	发布后运维、调试和更新生产模型	`silent-failure-hunter` , `dashboard-builder` , `mle-reviewer` , `doc-updater` , `github-ops`	观察台账和更新计划，包含漂移检查、延迟标签健康状况、告警负责人、手册更新、重训练标准和PR证据	监控、事件响应、重训练

Iteration Compact

迭代契约

Before touching model code, compress the work into one reviewable artifact. This should be short enough to fit in a PR description and precise enough that another engineer can challenge the tradeoffs.

text

Goal:
Who cares:
Decision owner:
User or system action changed by the model:
Success metric:
Guardrail metrics:
Mistake budget:
Unacceptable mistakes:
Acceptable mistakes:
Assumptions:
Constraints:
Labels and data snapshot:
Baseline:
Candidate signals:
Threshold or config plan:
Eval slices:
Known risks:
Next experiment:
Rollback or fallback:

This compact is the MLE equivalent of a strong SWE design note. It keeps the team from optimizing a metric no one trusts, adding features that do not address the real error mode, or shipping complexity without a rollback.

在编写模型代码之前，将工作压缩为一个可审查的artifact。它应简短到能放入PR描述，同时精确到足以让其他工程师挑战权衡决策。

text

Goal:
Who cares:
Decision owner:
User or system action changed by the model:
Success metric:
Guardrail metrics:
Mistake budget:
Unacceptable mistakes:
Acceptable mistakes:
Assumptions:
Constraints:
Labels and data snapshot:
Baseline:
Candidate signals:
Threshold or config plan:
Eval slices:
Known risks:
Next experiment:
Rollback or fallback:

这份契约相当于高质量SWE设计文档的MLE版本。它能避免团队优化无人信任的指标、添加无法解决实际错误模式的特征，或在没有回滚方案的情况下交付复杂系统。

Decision Brain

决策思维框架

Use this loop whenever the task is ambiguous, high-impact, or metric-heavy:

Start from the decision, not the model. Name the action that changes downstream behavior.
Name who cares and why. Different stakeholders pay different costs for false positives, false negatives, latency, compute spend, opacity, or missed opportunities.
Convert ambiguity into hypotheses. Ask what signal would separate outcomes, what evidence would disprove it, and what simple baseline should be hard to beat.
Research prior art or a nearby known problem before inventing a bespoke system.

Score choices with

(probability, confidence) x (cost, severity, importance, impact)

Consider adversarial behavior, incentives, selective disclosure, distribution shift, and feedback loops.
Prefer the simplest change that reduces the most important mistake. Simplicity is not laziness; it is a way to minimize blunders while preserving iteration speed.
Capture the decision, evidence, counterargument, and next reversible step.

当任务模糊、影响重大或指标密集时，使用以下循环：

从决策而非模型入手，明确会改变下游行为的动作。
明确相关人员及其关注点。不同相关方对假阳性、假阴性、延迟、计算成本、不透明度或错失机会的成本敏感度不同。
将模糊性转化为假设。思考哪些信号能区分结果、哪些证据能推翻假设，以及哪些简单基准线难以超越。
在发明定制系统之前，研究现有方案或类似的已知问题。

使用

(概率, 置信度) x (成本, 严重性, 重要性, 影响)

对选项打分。

考虑对抗行为、激励机制、选择性披露、分布偏移和反馈循环。
优先选择能减少最重要错误的最简单变更。简单并非懒惰，而是在保持迭代速度的同时最小化失误的方式。
记录决策、证据、反驳论点和下一个可逆步骤。

Metric and Mistake Economics

指标与错误成本

Choose metrics from failure costs, not habit:

Use a confusion matrix early so the team can discuss concrete false positives and false negatives instead of abstract accuracy.
Favor precision when the cost of an incorrect positive decision dominates.
Favor recall when the cost of a missed positive dominates.
Use F1 only when the precision/recall tradeoff is genuinely balanced and explainable.
Use AUC or ranking metrics when ordering quality matters more than a single threshold.
Track latency, throughput, memory, and cost as first-class metrics because they shape feasible model complexity.
Compare against a baseline and the current production model before celebrating an offline gain.
Treat real-world feedback signals as delayed labels with bias, lag, and coverage gaps; do not treat them as ground truth without analysis.

Every metric choice should state which mistake it makes cheaper, which mistake it makes more likely, and who absorbs that cost.

从失败成本而非习惯出发选择指标：

尽早使用混淆矩阵，让团队讨论具体的假阳性和假阴性，而非抽象的准确率。
当错误阳性决策的成本占主导时，优先选择精确率。
当错过阳性结果的成本占主导时，优先选择召回率。
仅当精确率/召回率的权衡真正平衡且可解释时，才使用F1。
当排序质量比单一阈值更重要时，使用AUC或排序指标。
将延迟、吞吐量、内存和成本作为一级指标，因为它们决定了可行的模型复杂度。
在庆祝离线指标提升之前，与基准线和当前生产模型进行比较。
将现实世界的反馈信号视为存在偏差、滞后和覆盖缺口的延迟标签；未经分析不要将其视为真值。

每个指标选择都应说明它降低了哪种错误的成本、增加了哪种错误的可能性，以及由谁承担该成本。

Data and Feature Hypotheses

数据与特征假设

Features should come from a theory of separation:

Text, categorical fields, numeric histories, graph relationships, recency, frequency, and aggregates are candidate signal families, not automatic features.
For every feature family, state why it should separate outcomes and how it could leak future information.
For noisy labels, consider adjudication, label confidence, soft targets, or confidence weighting.
For class imbalance, compare weighted loss, resampling, threshold movement, and calibrated decision rules.
For missing values, decide whether absence is informative, imputable, or a reason to abstain.
For outliers, decide whether to clip, bucket, investigate, or preserve them as rare but important signal.
For correlated features, check whether they are redundant, unstable, or proxies for unavailable future state.

Do not add model complexity until error analysis shows that the baseline is failing for a reason additional signal or capacity can plausibly fix.

特征应源于结果区分理论：

文本、分类字段、数值历史、图关系、新鲜度、频率和聚合是候选信号类别，而非自动特征。
对于每个信号类别，说明它为何能区分结果，以及它可能如何泄露未来信息。
对于噪声标签，考虑裁决、标签置信度、软目标或置信度加权。
对于类别不平衡，比较加权损失、重采样、阈值调整和校准决策规则。
对于缺失值，判断缺失是否具有信息性、可填充，或是否应放弃该样本。
对于异常值，判断是否应截断、分桶、调查，或保留为稀有但重要的信号。
对于相关特征，检查它们是否冗余、不稳定，或是否是不可用未来状态的代理。

在错误分析表明基准线因额外信号或容量可合理修复的原因失败之前，不要增加模型复杂度。

Error Analysis Loop

错误分析循环

After each baseline, training run, threshold change, or config change:

Split mistakes into false positives, false negatives, abstentions, low-confidence cases, and system failures.
Cluster errors by shared traits: language, entity type, source, time, geography, device, sparsity, recency, feature freshness, label source, or model version.
Separate model mistakes from data bugs, label ambiguity, product ambiguity, instrumentation gaps, and serving mismatches.
Trace each major cluster to one of four moves: better labels, better features, better threshold/config, or better product fallback.
Preserve every important mistake as a regression test, eval slice, dashboard panel, or runbook entry.
Write the next iteration as a falsifiable experiment, not a vague "improve model" task.

The strongest MLE loop is not train -> metric -> ship. It is mistake -> cluster -> hypothesis -> experiment -> evidence -> simpler system.

每次基准线、训练运行、阈值变更或配置变更后：

将错误分为假阳性、假阴性、弃权、低置信度案例和系统故障。
按共同特征聚类错误：语言、实体类型、来源、时间、地理、设备、稀疏性、新鲜度、特征时效性、标签来源或模型版本。
将模型错误与数据Bug、标签模糊、产品模糊、工具缺口和服务不匹配区分开。
将每个主要聚类追溯到四个行动之一：更好的标签、更好的特征、更好的阈值/配置或更好的产品降级策略。
将每个重要错误保留为回归测试、评估切片、仪表盘面板或手册条目。
将下一次迭代写成可证伪的实验，而非模糊的“改进模型”任务。

最强的MLE循环不是训练→指标→发布，而是错误→聚类→假设→实验→证据→简化系统。

Observation Ledger

观察台账

Keep a compact decision and evidence trail beside the code, PR, experiment report, or runbook:

text

Iteration:
Change:
Why this mattered:
Metric movement:
Slice movement:
False positives:
False negatives:
Unexpected errors:
Decision:
Tradeoff accepted:
Lesson captured:
Regression added:
Debt created:
Next iteration:

Use the ledger to make model work cumulative. The goal is for each iteration to make the next decision easier, not merely to produce another artifact.

在代码、PR、实验报告或手册旁保留一份简洁的决策和证据记录：

text

Iteration:
Change:
Why this mattered:
Metric movement:
Slice movement:
False positives:
False negatives:
Unexpected errors:
Decision:
Tradeoff accepted:
Lesson captured:
Regression added:
Debt created:
Next iteration:

使用台账让模型工作具备累积性。目标是让每次迭代都让下一个决策更容易，而非仅仅生成另一个artifact。

Core Workflow

核心工作流

1. Define the Prediction Contract

1. 定义预测契约

Capture the product-level contract before writing model code:

Prediction target and decision owner
Input entity, output schema, confidence/calibration fields, and allowed latency
Batch, online, streaming, or hybrid serving mode
Fallback behavior when the model, feature store, or dependency is unavailable
Human review or override path for high-impact decisions
Privacy, retention, and audit requirements for inputs, predictions, and labels

Do not accept "improve the model" as a requirement. Tie the model to an observable product behavior and a measurable acceptance gate.

在编写模型代码之前，先捕获产品级契约：

预测目标和决策负责人
输入实体、输出模式、置信度/校准字段和允许的延迟
批量、在线、流式或混合服务模式
模型、特征存储或依赖不可用时的降级行为
高影响决策的人工审核或覆盖路径
输入、预测和标签的隐私、保留和审计要求

不要接受“改进模型”作为需求。将模型与可观察的产品行为和可衡量的验收门控绑定。

2. Lock the Data Contract

2. 锁定数据契约

Every ML task needs an explicit data contract:

Entity grain and primary key
Label definition, label timestamp, and label availability delay
Feature timestamp, freshness SLA, and point-in-time join rules
Train, validation, test, and backtest split policy
Required columns, allowed nulls, ranges, categories, and units
PII or sensitive fields that must not enter training artifacts or logs
Dataset version or snapshot ID for reproducibility

Guard against leakage first. If a feature is not available at prediction time, or is joined using future information, remove it or move it to an analysis-only path.

每个ML任务都需要明确的data contract：

实体粒度和主键
标签定义、标签时间戳和标签可用延迟
特征时间戳、新鲜度SLA和点时间连接规则
训练、验证、测试和回测拆分策略
必填列、允许的空值、范围、类别和单位
不得进入训练artifact或日志的PII或敏感字段
用于可复现性的数据集版本或快照ID

首先防范泄露。如果特征在预测时不可用，或使用未来信息进行连接，请移除它或将其移至仅分析路径。

3. Build a Reproducible Pipeline

3. 构建可复现管道

Training code should be runnable by another engineer without hidden notebook state:

Use typed config files or dataclasses for all hyperparameters and paths
Pin package and model dependencies
Set random seeds and document any nondeterministic GPU behavior
Record dataset version, code SHA, config hash, metrics, and artifact URI
Save preprocessing logic with the model artifact, not separately in a notebook
Keep train, eval, and inference transformations shared or generated from one source
Make every step idempotent so retries do not corrupt artifacts or metrics

Prefer immutable values and pure transformation functions. Avoid mutating shared data frames or global config during feature generation.

python

import hashlib
from dataclasses import dataclass
from pathlib import Path


@dataclass(frozen=True)
class TrainingConfig:
    dataset_uri: str
    model_dir: Path
    seed: int
    learning_rate: float
    batch_size: int


def artifact_name(config: TrainingConfig, code_sha: str) -> str:
    config_key = f"{config.dataset_uri}:{config.seed}:{config.learning_rate}:{config.batch_size}"
    config_hash = hashlib.sha256(config_key.encode("utf-8")).hexdigest()[:12]
    return f"{code_sha[:12]}-{config_hash}"

训练代码应能被其他工程师运行，无需依赖隐藏的Notebook状态：

使用类型化配置文件或数据类存储所有超参数和路径
固定包和模型依赖
设置随机种子并记录任何非确定性GPU行为
记录数据集版本、代码SHA、配置哈希、指标和artifact URI
将预处理逻辑与model artifact一起保存，而非单独存于Notebook
共享训练、评估和推理转换逻辑，或从同一源生成
让每个步骤具备幂等性，确保重试不会损坏artifact或指标

优先使用不可变值和纯转换函数。在特征生成过程中避免修改共享数据帧或全局配置。

python

import hashlib
from dataclasses import dataclass
from pathlib import Path


@dataclass(frozen=True)
class TrainingConfig:
    dataset_uri: str
    model_dir: Path
    seed: int
    learning_rate: float
    batch_size: int


def artifact_name(config: TrainingConfig, code_sha: str) -> str:
    config_key = f"{config.dataset_uri}:{config.seed}:{config.learning_rate}:{config.batch_size}"
    config_hash = hashlib.sha256(config_key.encode("utf-8")).hexdigest()[:12]
    return f"{code_sha[:12]}-{config_hash}"

4. Evaluate Before Promotion

4. 晋升前评估

Promotion criteria should be declared before training finishes:

Baseline model and current production model comparison
Primary metric aligned to product behavior
Guardrail metrics for latency, calibration, fairness slices, cost, and error concentration
Slice metrics for important cohorts, geographies, devices, languages, or data sources
Confidence intervals or repeated-run variance when metrics are noisy
Failure examples reviewed by a human for high-impact models
Explicit "do not ship" thresholds

python

PROMOTION_GATES = {
    "auc": ("min", 0.82),
    "calibration_error": ("max", 0.04),
    "p95_latency_ms": ("max", 80),
}


def assert_promotion_ready(metrics: dict[str, float]) -> None:
    missing = sorted(name for name in PROMOTION_GATES if name not in metrics)
    if missing:
        raise ValueError(f"Model promotion metrics missing required gates: {missing}")

    failures = {
        name: value
        for name, (direction, threshold) in PROMOTION_GATES.items()
        for value in [metrics[name]]
        if (direction == "min" and value < threshold)
        or (direction == "max" and value > threshold)
    }
    if failures:
        raise ValueError(f"Model failed promotion gates: {failures}")

Use offline metrics as gates, not guarantees. When the model changes product behavior, plan shadow evaluation, canary rollout, or A/B testing before full rollout.

晋升标准应在训练完成前声明：

基准模型与当前生产模型的比较
与产品行为对齐的主要指标
针对延迟、校准度、公平性切片、成本和错误集中度的护栏指标
针对重要群组、地理区域、设备、语言或数据源的切片指标
指标存在噪声时的置信区间或重复运行方差
高影响模型需由人工审核失败案例
明确的“禁止发布”阈值

python

PROMOTION_GATES = {
    "auc": ("min", 0.82),
    "calibration_error": ("max", 0.04),
    "p95_latency_ms": ("max", 80),
}


def assert_promotion_ready(metrics: dict[str, float]) -> None:
    missing = sorted(name for name in PROMOTION_GATES if name not in metrics)
    if missing:
        raise ValueError(f"Model promotion metrics missing required gates: {missing}")

    failures = {
        name: value
        for name, (direction, threshold) in PROMOTION_GATES.items()
        for value in [metrics[name]]
        if (direction == "min" and value < threshold)
        or (direction == "max" and value > threshold)
    }
    if failures:
        raise ValueError(f"Model failed promotion gates: {failures}")

将离线指标作为门控，而非保证。当模型改变产品行为时，在全面发布前规划影子评估、金丝雀发布或A/B测试。

5. Package for Serving

5. 为服务打包

An ML artifact is production-ready only when the serving contract is testable:

Model artifact includes version, training data reference, config, and preprocessing
Input schema rejects invalid, stale, or out-of-range features
Output schema includes model version and confidence or explanation fields when useful
Serving path has timeout, batching, resource limits, and fallback behavior
CPU/GPU requirements are explicit and tested
Prediction logs avoid PII and include enough identifiers for debugging and label joins
Integration tests cover missing features, stale features, bad types, empty batches, and fallback path

Never let training-only feature code diverge from serving feature code without a test that proves equivalence.

只有当服务契约可测试时，ML artifact才具备生产就绪性：

Model artifact包含版本、训练数据引用、配置和预处理逻辑
输入模式拒绝无效、过时或超出范围的特征
输出模式包含模型版本和置信度或解释字段（如有必要）
服务路径具备超时、批处理、资源限制和降级行为
CPU/GPU需求明确且经过测试
预测日志避免包含PII，并包含足够的标识符用于调试和标签连接
集成测试覆盖缺失特征、过时特征、错误类型、空批次和降级路径

永远不要让仅训练用的特征代码与服务用特征代码脱节，除非有测试能证明两者等价。

6. Operate the Model

6. 运维模型

Model monitoring needs both system and quality signals:

Availability, error rate, timeout rate, queue depth, and p50/p95/p99 latency
Feature null rate, range drift, categorical drift, and freshness drift
Prediction distribution drift and confidence distribution drift
Label arrival health and delayed quality metrics
Business KPI guardrails and rollback triggers
Per-version dashboards for canaries and rollbacks

Every deployment should have a rollback plan that names the previous artifact, config, data dependency, and traffic-switch mechanism.

模型监控需要系统和质量两类信号：

可用性、错误率、超时率、队列深度和p50/p95/p99延迟
特征空值率、范围漂移、分类漂移和新鲜度漂移
预测分布漂移和置信度分布漂移
标签到达健康状况和延迟质量指标
业务KPI护栏和回滚触发条件
针对金丝雀和回滚的分版本仪表盘

每次部署都应有回滚计划，明确指定之前的artifact、配置、数据依赖和流量切换机制。

Review Checklist

审查清单

Anti-Patterns

反模式

Notebook state is required to reproduce the model
Random split leaks future data into validation or test sets
Feature joins ignore event time and label availability
Offline metric improves while important slices regress
Thresholds are tuned on the test set repeatedly
Training preprocessing is copied manually into serving code
Model version is missing from prediction logs
Monitoring only checks service uptime, not data or prediction quality
Rollback requires retraining instead of switching to a known-good artifact

需要Notebook状态才能复现模型
随机拆分将未来数据泄露到验证或测试集
特征连接忽略事件时间和标签可用性
离线指标提升但重要切片性能退化
在测试集上反复调整阈值
训练预处理逻辑被手动复制到服务代码中
预测日志中缺少模型版本
监控仅检查服务可用性，不检查数据或预测质量
回滚需要重新训练，而非切换到已知可用的artifact

Output Expectations

输出预期

When using this skill, return concrete artifacts: data contract, promotion gates, pipeline steps, test plan, deployment plan, or review findings. Call out unknowns that block production readiness instead of filling them with assumptions.

使用本技能时，返回具体的artifact：data contract、晋升门控、管道步骤、测试计划、部署计划或审查结果。指出阻碍生产就绪的未知因素，而非用假设填补空白。

mle-workflow

Original

Translation

Machine Learning Engineering Workflow

机器学习工程工作流（Machine Learning Engineering Workflow）

When to Activate

激活场景

Scope Calibration

范围校准

Related Skills

相关技能

Reuse the SWE Surface

复用软件工程（SWE）能力

Ten MLE Task Simulations

十大MLE任务模拟

Iteration Compact

迭代契约

Decision Brain

决策思维框架

Metric and Mistake Economics

指标与错误成本

Data and Feature Hypotheses

数据与特征假设

Error Analysis Loop

错误分析循环

Observation Ledger

观察台账

Core Workflow

核心工作流

1. Define the Prediction Contract

1. 定义预测契约

2. Lock the Data Contract

2. 锁定数据契约

3. Build a Reproducible Pipeline

3. 构建可复现管道

4. Evaluate Before Promotion

4. 晋升前评估

5. Package for Serving

5. 为服务打包

6. Operate the Model

6. 运维模型

Review Checklist

审查清单

Anti-Patterns

反模式

Output Expectations

输出预期