Loading...
Loading...
Owns the smoke test contract for an ML experiment: a small, diagnostic-by-construction pytest that fits the experiment's learner on a portion of the real `data/` source and predicts on a *disjoint* portion that deliberately carries **no pre-history buffer**. The assertion is structural — the number of predictions must equal the number of rows in the predict grid. A pipeline that loads-then-features-then-splits will silently drop the cold-start rows of the predict slice and the test will fail with a row-count mismatch; a pipeline that marks X early and references upstream history nodes from feature steps will pass trivially. The smoke test is the executable proof of the X-marker placement rule from `build-ml-pipeline`. TRIGGER when: `test-ml-pipeline` has dispatched here to write the smoke test for an approved experiment; `pytest tests/smoke/` is failing on row count; the user asks "why is the smoke test failing?"; a pipeline edit in `build-ml-pipeline` needs an executable proof; an experiment script changes the pipeline shape and the matching smoke test needs revisiting. SKIP when: the design note does not exist or is not yet approved (route to `iterate-ml-experiment`); the user is asking about a regression test or schema invariant (route to `regression-test-ml-pipeline` / `distribution-test-ml-pipeline` once those exist); the question is the *interpretation* of CV metrics, not predict-time correctness (route to `evaluate-ml-pipeline`). HOW TO USE: read the matching experiment's `journal/NN_*.md` and `experiments/NN_*.py` first to understand the pipeline's source binding (what env-dict keys does `build_learner` expect?). Then construct two env-dicts from the **real `data/` source** — a train env and a predict env — such that the predict env carries *only the rows we want predictions for* and *no pre-history buffer*. The hard assertion is that the prediction count matches the predict-env row count exactly. The soft assertion is that the smoke set's MAE is within `3 × CV_mean` (or the task-appropriate analogue). **Do not write the design note or run CV — that's other skills' job.**
npx skill4agent add probabl-ai/skills smoke-test-ml-pipelinetest-ml-pipelinetests/smoke/test_NN_<short_name>.pyjournal/NN_<short_name>.mdapprovedexperiments/NN_<short_name>.pySkill(python-api)Skill(python-api)data/eval_modeeval_modebuild-ml-pipelineSkrubLearnerbuild-ml-pipelinefitpredictscoresklearn.metricsskoreimport skrubimport sklearn@pytest.mark.filterwarnings(...)warnings.filterwarnings(...)filterwarnings = [...]pytest.inipyproject.tomlpython-code-stylePre-flight (smoke-test-ml-pipeline):
- [ ] Tier 1 mandatory libs importable: pytest + sklearn + skrub
(per `data-science-python-stack` § "Tier 1"). **Not skore** —
see the Stop conditions; the smoke test is intentionally
portable to any skrub-capable environment
- [ ] Skill(python-api) consulted for skrub / sklearn symbols used in
the test: <symbols, or "none">
Evidence: Read scratch/api/<lib>/<version>/<topic>.md (this turn)
| Write scratch/api/<lib>/<version>/<topic>.md (this turn)
| "n/a — test only uses symbols already present in
src/<pkg>/ (build_learner / load_training_table / etc.)"
"Read python-api SKILL.md" alone is NOT evidence.
- [ ] `journal/NN_<short_name>.md` read this turn (frozen sections:
Question, Method) so the test asserts what the experiment claims
- [ ] `experiments/NN_<short_name>.py` skimmed this turn for the env-dict
keys `build_learner` consumes (`data_dir` / `start` + `end` /
`raw_frame` / etc.)
- [ ] `src/<pkg>/data.py` skimmed this turn for the loader signature
(so the predict-env construction matches the loader's expectations)
- [ ] Test category & stem decided: `tests/smoke/test_NN_<short_name>.py`
- [ ] Predict-grid size decided: smallest window that still triggers
the failure mode (default: a single horizon-length slice; for
time series, the most recent N steps such that the target is
*just* observable for assertion)
- [ ] Hard assertion wired: `len(predictions) == n_predict_grid_rows`
- [ ] Soft assertion wired (or explicitly skipped): smoke MAE within
`3 × CV_MEAN_HARDCODED_FROM_PLAN` (or task-appropriate
analogue). Value is a literal pulled from the matching
`journal/NN_<short_name>.md` § Status.headline; the test does
not import `skore` / read the project store at runtime.assert len(predictions) == n_predict_grid_rowsbuild-ml-pipelinen_predict_grid_rowsbuild_supervised_frame(predict_dir)smoke_mae = mean_absolute_error(y_true, predictions)
assert smoke_mae < 3 * cv_mae_mean, (
f"smoke MAE {smoke_mae:.0f} is more than 3× the CV mean "
f"({cv_mae_mean:.0f}); predictions may be NaN-poisoned even "
f"though the count matches."
)3×mark_as_Xdrop_nullslen(predictions) < n_predict_grid_rowsmark_as_Xbuild-ml-pipelineapply_funclen(predictions) == n_predict_grid_rowsdata/data/experiments/NN_*.pybuild_learnerpredict_startpredict_endpredict_start - HORIZONtrain_envpredict_envn_predict_grid_rowsy_truedata/holdout/data/train/tmp_path| Binding shape | Predict env construction | |
|---|---|---|
Directory of raw files — | Write a tiny temp dir with the time-sliced raw files inside the test (use the | The row count of the supervised representation of the predict env (e.g. |
Predict-grid + raw-history sources — the early-mark shape from | Build the in-memory | |
Materialized | Hold out a small subset of rows from the materialized | |
mark_as_Xbuild-ml-pipelinepython-api/references/pre_mark_alignment.mdlen(predictions) == len(predict_subset)learner.predict(env)(N, 2)len(...)"""Smoke test for `experiments/NN_<short_name>.py`."""
# stdlib + numpy first
import pytest
from <pkg> import PROJECT_ROOT
from <pkg>.pipeline import build_learner
# additional imports per the experiment's binding shape
DATA_DIR = PROJECT_ROOT / "data"
@pytest.fixture
def train_predict_envs(tmp_path):
"""Build a (train_env, predict_env, n_predict_grid_rows, y_true) tuple.
Diagnostic by construction: predict_env carries only the
rows we want predictions for, with no pre-history padding.
"""
# ... per-experiment fixture construction ...
return train_env, predict_env, n_predict_grid_rows, y_true
def test_NN_<short_name>(train_predict_envs):
"""Predict-time replay must produce one prediction per predict-grid row."""
train_env, predict_env, n_predict_grid_rows, y_true = train_predict_envs
learner = build_learner()
learner.fit(train_env)
predictions = learner.predict(predict_env)
# HARD: structural correctness.
assert len(predictions) == n_predict_grid_rows, (
f"got {len(predictions)} predictions for "
f"{n_predict_grid_rows} predict-grid rows — pipeline is "
f"dropping cold-start rows; check `mark_as_X` placement "
f"and that history-dependent features reference an "
f"upstream history node, not a per-slice computation."
)
# SOFT: predictions are not NaN-poisoned.
from sklearn.metrics import mean_absolute_error
smoke_mae = mean_absolute_error(y_true, predictions)
# CV_MAE_MEAN is hardcoded at the top of the file from
# `journal/NN_<short_name>.md` § Status.headline. The smoke test
# uses only the predicting package's API (skrub/sklearn) —
# no skore import, so it runs anywhere skrub does.
assert smoke_mae < 3 * CV_MAE_MEAN, (
f"smoke MAE {smoke_mae:.0f} > 3 × CV mean "
f"({CV_MAE_MEAN:.0f}) — predictions may be NaN-poisoned."
)tmp_pathbuild-ml-pipelinelearner.skb.full_report()doneiterate-ml-experimentdoneJOURNAL.mdapprovediterate-ml-experimentorganize-ml-workspacebuild-ml-pipelineproject.putevaluate-ml-pipelineiterate-ml-experimenttest-ml-pipelinebuild-ml-pipelineiterate-ml-experimentdoneevaluate-ml-pipelineevaluate-ml-pipelinepython-apipython-apipython-apiscratch/api/<lib>/<version>/python-apidata-science-python-stackpython-code-styletests/smoke/test_NN_*.pypixi run ruff checkDParametersReturnsNotes