Production machine-learning engineering workflow for data contracts, reproducible training, model evaluation, deployment, monitoring, and rollback. Use when building, reviewing, or hardening ML systems beyond one-off notebooks.
Use this skill to turn model work into a production ML system with clear data contracts, repeatable training, measurable quality gates, deployable artifacts, and operational monitoring.
When to Activate
Planning or reviewing a production ML feature, model refresh, ranking system, recommender, classifier, embedding workflow, or forecasting pipeline
Converting notebook code into a reusable training, evaluation, batch inference, or online inference pipeline
Designing model promotion criteria, offline/online evals, experiment tracking, or rollback paths
Debugging failures caused by data drift, label leakage, stale features, artifact mismatch, or inconsistent training and serving logic
Adding model monitoring, canary rollout, shadow traffic, or post-deploy quality checks
Scope Calibration
Use only the lanes that fit the system in front of you. This skill is useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLM workflows, anomaly detection, and batch analytics, but it should not force one architecture onto all of them.
Do not assume every model has supervised labels, online serving, a feature store, PyTorch, GPUs, human review, A/B tests, or real-time feedback.
Do not add heavyweight MLOps machinery when a data contract, baseline, eval script, and rollback note would make the change reviewable.
Do make assumptions explicit when the project lacks labels, delayed outcomes, slice definitions, production traffic, or monitoring ownership.
Treat examples as interchangeable scaffolds. Replace metrics, serving mode, data stores, and rollout mechanics with the project-native equivalents.
Related Skills
python-patterns
and
python-testing
for Python implementation and pytest coverage
pytorch-patterns
for deep learning models, data loaders, device handling, and training loops
eval-harness
and
ai-regression-testing
for promotion gates and agent-assisted regression checks
database-migrations
,
postgres-patterns
, and
clickhouse-io
for data storage and analytics surfaces
deployment-patterns
,
docker-patterns
, and
security-review
for serving, secrets, containers, and production hardening
Reuse the SWE Surface
Do not treat MLE as separate from software engineering. Most ECC SWE workflows apply directly to ML systems, often with stricter failure modes:
The recommended
minimal --with capability:machine-learning
install keeps the core agent surface available alongside this skill. For skill-only or agent-limited harnesses, pair
skill:mle-workflow
with
agent:mle-reviewer
where the target supports agents.
SWE surface
MLE use
product-capability
/
architecture-decision-records
Turn model work into explicit product contracts and record irreversible data, model, and rollout choices
repo-scan
/
codebase-onboarding
/
code-tour
Find existing training, feature, serving, eval, and monitoring paths before introducing a parallel ML stack
plan
/
feature-dev
Scope model changes as product capabilities with data, eval, serving, and rollback phases
tdd-workflow
/
python-testing
Test feature transforms, split logic, metric calculations, artifact loading, and inference schemas before implementation
code-reviewer
/
mle-reviewer
Review code quality plus ML-specific leakage, reproducibility, promotion, and monitoring risks
build-fix
/
pr-test-analyzer
Diagnose broken CI, flaky evals, missing fixtures, and environment-specific model or dependency failures
quality-gate
/
test-coverage
Require automated evidence for transforms, metrics, inference contracts, promotion gates, and rollback behavior
eval-harness
/
verification-loop
Turn offline metrics, slice checks, latency budgets, and rollback drills into repeatable gates
ai-regression-testing
Preserve every production bug as a regression: missing feature, stale label, bad artifact, schema drift, or serving mismatch
Version labels, feature snapshots, prediction logs, experiment metrics, and drift analytics
deployment-patterns
/
docker-patterns
Package reproducible training and serving images with health checks, resource limits, and rollback
canary-watch
/
dashboard-builder
Make rollout health visible with model-version, slice, drift, latency, cost, and delayed-label dashboards
security-review
/
security-scan
Check model artifacts, notebooks, prompts, datasets, and logs for secrets, PII, unsafe deserialization, and supply-chain risk
e2e-testing
/
browser-qa
/
accessibility
Test critical product flows that consume predictions, including explainability and fallback UI states
benchmark
/
performance-optimizer
Measure throughput, p95 latency, memory, GPU utilization, and cost per prediction or retrain
cost-aware-llm-pipeline
/
token-budget-advisor
Route LLM/embedding workloads by quality, latency, and budget instead of defaulting to the largest model
documentation-lookup
/
search-first
Verify current library behavior for model serving, feature stores, vector DBs, and eval tooling before coding
git-workflow
/
github-ops
/
opensource-pipeline
Package MLE changes for review with crisp scope, generated artifacts excluded, and reproducible test evidence
strategic-compact
/
dmux-workflows
Split long ML work into parallel tracks: data contract, eval harness, serving path, monitoring, and docs
Ten MLE Task Simulations
Use these simulations as coverage checks when planning or reviewing MLE work. A strong MLE workflow should reduce each task to explicit contracts, reusable SWE surfaces, automated evidence, and a reviewable artifact.
ID
Common MLE task
Streamlined ECC path
Required output
Pipeline lanes covered
MLE-01
Frame an ambiguous prediction, ranking, recommender, classifier, embedding, or forecast capability
product-capability
,
plan
,
architecture-decision-records
,
mle-workflow
Iteration Compact naming who cares, decision owner, success metric, unacceptable mistakes, assumptions, constraints, and first experiment
product contract, stakeholder loss, risk, rollout
MLE-02
Define metric goals, labels, data sources, and the mistake budget
repo-scan
,
database-reviewer
,
database-migrations
,
postgres-patterns
,
clickhouse-io
Data and metric contract with entity grain, label timing, label confidence, feature timing, point-in-time joins, split policy, and dataset snapshot
data contract, metric design, leakage, reproducibility
MLE-03
Build a baseline model and scoring path before adding complexity
tdd-workflow
,
python-testing
,
python-patterns
,
code-reviewer
Baseline scorer with confusion matrix, calibration notes, latency/cost estimate, known weaknesses, and tests for score shape and determinism
baseline, scoring, testing, serving parity
MLE-04
Generate features from hypotheses about what separates outcomes
python-patterns
,
pytorch-patterns
,
docker-patterns
,
deployment-patterns
Feature plan and transform module covering signal source, missing values, outliers, correlations, leakage checks, and train/serve equivalence
feature pipeline, leakage, training, artifacts
MLE-05
Tune thresholds, configs, and model complexity under tradeoffs
eval-harness
,
ai-regression-testing
,
quality-gate
,
test-coverage
Threshold/config report comparing precision, recall, F1, AUC, calibration, group slices, latency, cost, complexity, and acceptable error classes
evaluation, threshold, promotion, regression
MLE-06
Run error analysis and turn mistakes into the next experiment
eval-harness
,
ai-regression-testing
,
mle-reviewer
,
silent-failure-hunter
Error cluster report for false positives, false negatives, ambiguous labels, stale features, missing signals, and bug traces with lessons captured
error analysis, bug trace, iteration, regression
MLE-07
Package a model artifact for batch or online inference
api-design
,
backend-patterns
,
security-review
,
security-scan
Versioned artifact bundle with preprocessing, config, dependency constraints, schema validation, safe loading, and PII-safe logs
artifact, security, inference contract
MLE-08
Ship online serving or batch scoring with feedback capture
api-design
,
backend-patterns
,
e2e-testing
,
browser-qa
,
accessibility
Prediction endpoint or batch job with response envelope, timeout, batching, fallback, model version, confidence, feedback logging, and product-flow tests
serving, batch inference, fallback, user workflow
MLE-09
Roll out a model with shadow traffic, canary, A/B test, or rollback
canary-watch
,
dashboard-builder
,
verification-loop
,
performance-optimizer
Rollout plan naming traffic split, dashboards, p95 latency, cost, quality guardrails, rollback artifact, and rollback trigger
deployment, canary, rollback
MLE-10
Operate, debug, and refresh a production model after launch
silent-failure-hunter
,
dashboard-builder
,
mle-reviewer
,
doc-updater
,
github-ops
Observation ledger and refresh plan with drift checks, delayed-label health, alert owners, runbook updates, retrain criteria, and PR evidence
monitoring, incident response, retraining
Iteration Compact
Before touching model code, compress the work into one reviewable artifact. This should be short enough to fit in a PR description and precise enough that another engineer can challenge the tradeoffs.
text
Goal:
Who cares:
Decision owner:
User or system action changed by the model:
Success metric:
Guardrail metrics:
Mistake budget:
Unacceptable mistakes:
Acceptable mistakes:
Assumptions:
Constraints:
Labels and data snapshot:
Baseline:
Candidate signals:
Threshold or config plan:
Eval slices:
Known risks:
Next experiment:
Rollback or fallback:
This compact is the MLE equivalent of a strong SWE design note. It keeps the team from optimizing a metric no one trusts, adding features that do not address the real error mode, or shipping complexity without a rollback.
Decision Brain
Use this loop whenever the task is ambiguous, high-impact, or metric-heavy:
Start from the decision, not the model. Name the action that changes downstream behavior.
Name who cares and why. Different stakeholders pay different costs for false positives, false negatives, latency, compute spend, opacity, or missed opportunities.
Convert ambiguity into hypotheses. Ask what signal would separate outcomes, what evidence would disprove it, and what simple baseline should be hard to beat.
Research prior art or a nearby known problem before inventing a bespoke system.
Score choices with
(probability, confidence) x (cost, severity, importance, impact)
.
Consider adversarial behavior, incentives, selective disclosure, distribution shift, and feedback loops.
Prefer the simplest change that reduces the most important mistake. Simplicity is not laziness; it is a way to minimize blunders while preserving iteration speed.
Capture the decision, evidence, counterargument, and next reversible step.
Metric and Mistake Economics
Choose metrics from failure costs, not habit:
Use a confusion matrix early so the team can discuss concrete false positives and false negatives instead of abstract accuracy.
Favor precision when the cost of an incorrect positive decision dominates.
Favor recall when the cost of a missed positive dominates.
Use F1 only when the precision/recall tradeoff is genuinely balanced and explainable.
Use AUC or ranking metrics when ordering quality matters more than a single threshold.
Track latency, throughput, memory, and cost as first-class metrics because they shape feasible model complexity.
Compare against a baseline and the current production model before celebrating an offline gain.
Treat real-world feedback signals as delayed labels with bias, lag, and coverage gaps; do not treat them as ground truth without analysis.
Every metric choice should state which mistake it makes cheaper, which mistake it makes more likely, and who absorbs that cost.
Data and Feature Hypotheses
Features should come from a theory of separation:
Text, categorical fields, numeric histories, graph relationships, recency, frequency, and aggregates are candidate signal families, not automatic features.
For every feature family, state why it should separate outcomes and how it could leak future information.
For noisy labels, consider adjudication, label confidence, soft targets, or confidence weighting.
For class imbalance, compare weighted loss, resampling, threshold movement, and calibrated decision rules.
For missing values, decide whether absence is informative, imputable, or a reason to abstain.
For outliers, decide whether to clip, bucket, investigate, or preserve them as rare but important signal.
For correlated features, check whether they are redundant, unstable, or proxies for unavailable future state.
Do not add model complexity until error analysis shows that the baseline is failing for a reason additional signal or capacity can plausibly fix.
Error Analysis Loop
After each baseline, training run, threshold change, or config change:
Split mistakes into false positives, false negatives, abstentions, low-confidence cases, and system failures.
Cluster errors by shared traits: language, entity type, source, time, geography, device, sparsity, recency, feature freshness, label source, or model version.
Separate model mistakes from data bugs, label ambiguity, product ambiguity, instrumentation gaps, and serving mismatches.
Trace each major cluster to one of four moves: better labels, better features, better threshold/config, or better product fallback.
Preserve every important mistake as a regression test, eval slice, dashboard panel, or runbook entry.
Write the next iteration as a falsifiable experiment, not a vague "improve model" task.
The strongest MLE loop is not train -> metric -> ship. It is mistake -> cluster -> hypothesis -> experiment -> evidence -> simpler system.
Observation Ledger
Keep a compact decision and evidence trail beside the code, PR, experiment report, or runbook:
Use the ledger to make model work cumulative. The goal is for each iteration to make the next decision easier, not merely to produce another artifact.
Core Workflow
1. Define the Prediction Contract
Capture the product-level contract before writing model code:
Prediction target and decision owner
Input entity, output schema, confidence/calibration fields, and allowed latency
Batch, online, streaming, or hybrid serving mode
Fallback behavior when the model, feature store, or dependency is unavailable
Human review or override path for high-impact decisions
Privacy, retention, and audit requirements for inputs, predictions, and labels
Do not accept "improve the model" as a requirement. Tie the model to an observable product behavior and a measurable acceptance gate.
2. Lock the Data Contract
Every ML task needs an explicit data contract:
Entity grain and primary key
Label definition, label timestamp, and label availability delay
Feature timestamp, freshness SLA, and point-in-time join rules
Train, validation, test, and backtest split policy
Required columns, allowed nulls, ranges, categories, and units
PII or sensitive fields that must not enter training artifacts or logs
Dataset version or snapshot ID for reproducibility
Guard against leakage first. If a feature is not available at prediction time, or is joined using future information, remove it or move it to an analysis-only path.
3. Build a Reproducible Pipeline
Training code should be runnable by another engineer without hidden notebook state:
Use typed config files or dataclasses for all hyperparameters and paths
Pin package and model dependencies
Set random seeds and document any nondeterministic GPU behavior
Record dataset version, code SHA, config hash, metrics, and artifact URI
Save preprocessing logic with the model artifact, not separately in a notebook
Keep train, eval, and inference transformations shared or generated from one source
Make every step idempotent so retries do not corrupt artifacts or metrics
Prefer immutable values and pure transformation functions. Avoid mutating shared data frames or global config during feature generation.
Promotion criteria should be declared before training finishes:
Baseline model and current production model comparison
Primary metric aligned to product behavior
Guardrail metrics for latency, calibration, fairness slices, cost, and error concentration
Slice metrics for important cohorts, geographies, devices, languages, or data sources
Confidence intervals or repeated-run variance when metrics are noisy
Failure examples reviewed by a human for high-impact models
Explicit "do not ship" thresholds
python
PROMOTION_GATES ={"auc":("min",0.82),"calibration_error":("max",0.04),"p95_latency_ms":("max",80),}defassert_promotion_ready(metrics:dict[str,float])->None: missing =sorted(name for name in PROMOTION_GATES if name notin metrics)if missing:raise ValueError(f"Model promotion metrics missing required gates: {missing}") failures ={ name: value
for name,(direction, threshold)in PROMOTION_GATES.items()for value in[metrics[name]]if(direction =="min"and value < threshold)or(direction =="max"and value > threshold)}if failures:raise ValueError(f"Model failed promotion gates: {failures}")
Use offline metrics as gates, not guarantees. When the model changes product behavior, plan shadow evaluation, canary rollout, or A/B testing before full rollout.
5. Package for Serving
An ML artifact is production-ready only when the serving contract is testable:
Model artifact includes version, training data reference, config, and preprocessing
Input schema rejects invalid, stale, or out-of-range features
Output schema includes model version and confidence or explanation fields when useful
Serving path has timeout, batching, resource limits, and fallback behavior
CPU/GPU requirements are explicit and tested
Prediction logs avoid PII and include enough identifiers for debugging and label joins
Integration tests cover missing features, stale features, bad types, empty batches, and fallback path
Never let training-only feature code diverge from serving feature code without a test that proves equivalence.
6. Operate the Model
Model monitoring needs both system and quality signals:
Availability, error rate, timeout rate, queue depth, and p50/p95/p99 latency
Feature null rate, range drift, categorical drift, and freshness drift
Prediction distribution drift and confidence distribution drift
Label arrival health and delayed quality metrics
Business KPI guardrails and rollback triggers
Per-version dashboards for canaries and rollbacks
Every deployment should have a rollback plan that names the previous artifact, config, data dependency, and traffic-switch mechanism.
Review Checklist
Prediction contract is explicit and testable
Data contract defines entity grain, label timing, feature timing, and snapshot/version
Leakage risks were checked against prediction-time availability
Training is reproducible from code, config, data version, and seed
Metrics compare against baseline and current production model
Slice metrics and guardrails are included for high-risk cohorts
Promotion gates are automated and fail closed
Training and serving transformations are shared or equivalence-tested
Model artifact carries version, config, dataset reference, and preprocessing
Serving path validates inputs and has timeout, fallback, and rollback behavior
Monitoring covers system health, feature drift, prediction drift, and delayed labels
Sensitive data is excluded from artifacts, logs, prompts, and examples
Anti-Patterns
Notebook state is required to reproduce the model
Random split leaks future data into validation or test sets
Feature joins ignore event time and label availability
Offline metric improves while important slices regress
Thresholds are tuned on the test set repeatedly
Training preprocessing is copied manually into serving code
Model version is missing from prediction logs
Monitoring only checks service uptime, not data or prediction quality
Rollback requires retraining instead of switching to a known-good artifact
Output Expectations
When using this skill, return concrete artifacts: data contract, promotion gates, pipeline steps, test plan, deployment plan, or review findings. Call out unknowns that block production readiness instead of filling them with assumptions.