smoke-test-ml-pipeline

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Smoke Test ML Pipeline

ML管道冒烟测试

The minimal pytest that catches the "load → featurize → split" anti-pattern at iteration time, before it reaches production.

这是一个极简的pytest测试，用于在迭代阶段就捕捉“加载→特征工程→拆分”的反模式，避免其进入生产环境。

Stop conditions — read before anything else

停止条件 — 开始前必读

No smoke test without an approved design note + script. The pairing rule from
```
test-ml-pipeline
```
is hard:
```
tests/smoke/test_NN_<short_name>.py
```
exists only when
```
journal/NN_<short_name>.md
```
is at least
```
approved
```
and
```
experiments/NN_<short_name>.py
```
exists with the matching stem.
Symbol from memory is forbidden. Any skrub / scikit-learn name you write in the smoke test must come from a
```
Skill(python-api)
```
/
```
Skill(python-api)
```
call in this turn. The smoke test is a small file but it imports the predicting-package API surface; the same memory-forbidden rule applies.
Don't shrink the assertion. The hard assertion is exact row-count equality. Not "approximately equal", not "at least 80% of expected rows". A row-count mismatch is the failure mode the smoke test exists to catch. Loosening the assertion silently reintroduces the bug.
Don't synthesize the fixture. The smoke test reads the real
```
data/
```
source. Synthetic fixtures look fine but skip the loaders that actually break in production.
No wrappers, no NaN-handling, no
eval_mode
hacks. If the smoke test only passes after wrapping the predictor or conditioning on
```
eval_mode
```
, the pipeline is wrong. Route back to
```
build-ml-pipeline
```
and fix the X-marker placement. Wrappers paper over the failure mode; they don't solve it.
The smoke test uses only the predicting package's API. For a
```
SkrubLearner
```
produced by
```
build-ml-pipeline
```
that means skrub's
```
fit
```
/
```
predict
```
/ (optionally
```
score
```
) plus
```
sklearn.metrics
```
for any metric the soft assertion uses. Do not import
skore
(or any other tracking / reporting library) in the test file. The smoke test must be runnable in any environment that can
```
import skrub
```
+
```
import sklearn
```
— the skore Project is a side artifact, not a test dependency. Soft-assertion baselines (CV-mean MAE, etc.) are hardcoded from the design note's Status.headline with a comment pointing to the design note; update by hand when the experiment's headline number changes.

Don't filter warnings. No

@pytest.mark.filterwarnings(...)

, no

warnings.filterwarnings(...)

in the test body, no

filterwarnings = [...]

pytest.ini

pyproject.toml

— unless the user explicitly asks. See

python-code-style

§ Stop conditions.

无获批设计说明和脚本则不编写冒烟测试。
```
test-ml-pipeline
```
中的配对规则是硬性要求：仅当
```
journal/NN_<short_name>.md
```
至少处于
```
approved
```
状态，且
```
experiments/NN_<short_name>.py
```
存在匹配的文件名前缀时，才允许创建
```
tests/smoke/test_NN_<short_name>.py
```
。
禁止使用记忆中的符号。冒烟测试中使用的任何skrub/scikit-learn名称，必须来自本次任务中对
```
Skill(python-api)
```
的调用。冒烟测试是一个小文件，但它会导入预测包的API；同样的“禁止依赖记忆”规则适用。
请勿弱化断言。硬性断言要求行数完全相等。不是“近似相等”，也不是“至少达到预期行数的80%”。行数不匹配正是冒烟测试要捕捉的失败模式。弱化断言会悄无声息地重新引入漏洞。
请勿合成测试夹具。冒烟测试需读取真实的
```
data/
```
数据源。合成测试夹具看似正常，但会跳过那些在生产环境中实际会出错的加载器。
禁止使用包装器、NaN处理或
eval_mode
技巧。如果必须包装预测器或设置
```
eval_mode
```
才能让冒烟测试通过，说明管道存在问题。请返回
```
build-ml-pipeline
```
并修正X标记的位置。包装器只是掩盖了失败模式，并未解决问题。
冒烟测试仅使用预测包的API。对于
```
build-ml-pipeline
```
生成的
```
SkrubLearner
```
，这意味着仅使用skrub的
```
fit
```
/
```
predict
```
/（可选）
```
score
```
方法，以及
```
sklearn.metrics
```
中软性断言所需的任何指标。请勿导入
skore
（或任何其他跟踪/报告库）到测试文件中。冒烟测试必须能在任何可导入skrub和sklearn的环境中运行——skore项目是附属产物，并非测试依赖项。软性断言的基准值（如CV均值MAE等）需从设计说明的Status.headline中硬编码，并添加指向设计说明的注释；当实验的标题数值变更时，手动更新该值。
请勿过滤警告。除非用户明确要求，否则禁止使用
```
@pytest.mark.filterwarnings(...)
```
、测试体中的
```
warnings.filterwarnings(...)
```
，或
```
pytest.ini
```
/
```
pyproject.toml
```
中的
```
filterwarnings = [...]
```
。详情请见
```
python-code-style
```
的停止条件章节。

Pre-flight — emit this checklist as visible text before any test code

预检查 — 在编写任何测试代码前，输出以下检查清单

Pre-flight (smoke-test-ml-pipeline):
- [ ] Tier 1 mandatory libs importable: pytest + sklearn + skrub
      (per `data-science-python-stack` § "Tier 1"). **Not skore** —
      see the Stop conditions; the smoke test is intentionally
      portable to any skrub-capable environment
- [ ] Skill(python-api) consulted for skrub / sklearn symbols used in
      the test: <symbols, or "none">
      Evidence: Read scratch/api/<lib>/<version>/<topic>.md (this turn)
                | Write scratch/api/<lib>/<version>/<topic>.md (this turn)
                | "n/a — test only uses symbols already present in
                  src/<pkg>/ (build_learner / load_training_table / etc.)"
      "Read python-api SKILL.md" alone is NOT evidence.
- [ ] `journal/NN_<short_name>.md` read this turn (frozen sections:
      Question, Method) so the test asserts what the experiment claims
- [ ] `experiments/NN_<short_name>.py` skimmed this turn for the env-dict
      keys `build_learner` consumes (`data_dir` / `start` + `end` /
      `raw_frame` / etc.)
- [ ] `src/<pkg>/data.py` skimmed this turn for the loader signature
      (so the predict-env construction matches the loader's expectations)
- [ ] Test category & stem decided: `tests/smoke/test_NN_<short_name>.py`
- [ ] Predict-grid size decided: smallest window that still triggers
      the failure mode (default: a single horizon-length slice; for
      time series, the most recent N steps such that the target is
      *just* observable for assertion)
- [ ] Hard assertion wired: `len(predictions) == n_predict_grid_rows`
- [ ] Soft assertion wired (or explicitly skipped): smoke MAE within
      `3 × CV_MEAN_HARDCODED_FROM_PLAN` (or task-appropriate
      analogue). Value is a literal pulled from the matching
      `journal/NN_<short_name>.md` § Status.headline; the test does
      not import `skore` / read the project store at runtime.

Pre-flight (smoke-test-ml-pipeline):
- [ ] Tier 1 mandatory libs importable: pytest + sklearn + skrub
      (per `data-science-python-stack` § "Tier 1"). **Not skore** —
      see the Stop conditions; the smoke test is intentionally
      portable to any skrub-capable environment
- [ ] Skill(python-api) consulted for skrub / sklearn symbols used in
      the test: <symbols, or "none">
      Evidence: Read scratch/api/<lib>/<version>/<topic>.md (this turn)
                | Write scratch/api/<lib>/<version>/<topic>.md (this turn)
                | "n/a — test only uses symbols already present in
                  src/<pkg>/ (build_learner / load_training_table / etc.)"
      "Read python-api SKILL.md" alone is NOT evidence.
- [ ] `journal/NN_<short_name>.md` read this turn (frozen sections:
      Question, Method) so the test asserts what the experiment claims
- [ ] `experiments/NN_<short_name>.py` skimmed this turn for the env-dict
      keys `build_learner` consumes (`data_dir` / `start` + `end` /
      `raw_frame` / etc.)
- [ ] `src/<pkg>/data.py` skimmed this turn for the loader signature
      (so the predict-env construction matches the loader's expectations)
- [ ] Test category & stem decided: `tests/smoke/test_NN_<short_name>.py`
- [ ] Predict-grid size decided: smallest window that still triggers
      the failure mode (default: a single horizon-length slice; for
      time series, the most recent N steps such that the target is
      *just* observable for assertion)
- [ ] Hard assertion wired: `len(predictions) == n_predict_grid_rows`
- [ ] Soft assertion wired (or explicitly skipped): smoke MAE within
      `3 × CV_MEAN_HARDCODED_FROM_PLAN` (or task-appropriate
      analogue). Value is a literal pulled from the matching
      `journal/NN_<short_name>.md` § Status.headline; the test does
      not import `skore` / read the project store at runtime.

What the smoke test asserts

冒烟测试的断言内容

Two assertions, two severities:

包含两个断言，对应不同的严重程度：

Hard — the row-count check

硬性断言 — 行数检查

python

assert len(predictions) == n_predict_grid_rows

This is the structural-correctness assertion. It is a binary pass/fail and it is the whole point of the smoke test. A correctly built pipeline (per

build-ml-pipeline

's X-marker rule) satisfies this trivially. A pipeline that loads-then-features- then-splits will fail it because predict-time featurization on the predict env runs with no pre-history buffer and silently drops cold-start rows.

n_predict_grid_rows

is the count of rows the predict env claims to want predictions for — typically the number of target-time rows in the predict-time grid. If the pipeline's source binding is a directory of raw files, it's the row count of the supervised frame derived from the predict env at predict time (usable via

build_supervised_frame(predict_dir)

python

assert len(predictions) == n_predict_grid_rows

这是结构正确性断言。它是一个二元的通过/失败判断，也是冒烟测试的核心目的。符合

build-ml-pipeline

中X标记规则的正确管道会轻松满足该断言。而采用“加载→特征工程→拆分”流程的管道会失败，因为预测时对预测环境进行特征工程时无历史缓冲，会静默丢弃冷启动行。

n_predict_grid_rows

是预测环境声称需要预测的行数——通常是预测时间网格中的目标时间行数。如果管道的数据源绑定是原始文件目录，则该值是预测环境在预测时生成的监督数据集的行数（可通过

build_supervised_frame(predict_dir)

获取）。

Soft — the metric-vs-CV gap

软性断言 — 指标与CV均值的差距

python

smoke_mae = mean_absolute_error(y_true, predictions)
assert smoke_mae < 3 * cv_mae_mean, (
    f"smoke MAE {smoke_mae:.0f} is more than 3× the CV mean "
    f"({cv_mae_mean:.0f}); predictions may be NaN-poisoned even "
    f"though the count matches."
)

The metric gap catches the second-order failure mode: the prediction count is right, but the values are garbage because some features are NaN at predict time (e.g. an encoder hasn't seen a new category, a lag is null because the upstream history reference wasn't wired correctly). The

3×

bound is a starting heuristic; adjust per task. The smoke window is a single seasonal slice, so the bound has to be loose enough that a legitimate hard-season window doesn't trip it.

The soft assertion is opt-out, not opt-in: skip it only if the task has no obvious metric-vs-CV comparator (e.g. the smoke fixture deliberately has no ground truth). If you skip it, leave a comment on why in the test file.

python

smoke_mae = mean_absolute_error(y_true, predictions)
assert smoke_mae < 3 * cv_mae_mean, (
    f"smoke MAE {smoke_mae:.0f} is more than 3× the CV mean "
    f"({cv_mae_mean:.0f}); predictions may be NaN-poisoned even "
    f"though the count matches."
)

指标差距断言用于捕捉二阶失败模式：预测结果数量正确，但值是无效的，因为某些特征在预测时为NaN（例如编码器未见过新类别，或因上游历史引用未正确绑定导致滞后值为空）。

3×

的界限是初始启发式值；可根据任务调整。冒烟测试窗口是单个季节性切片，因此界限需足够宽松，避免合法的季节性窗口触发断言失败。

软性断言是默认启用，可选择跳过：仅当任务没有明确的指标与CV均值比较器时（例如冒烟测试夹具故意没有真实标签），才跳过该断言。如果跳过，请在测试文件中留下注释说明原因。

The diagnostic-by-construction property

内置诊断特性

The fixture is built specifically to fail on the buggy shape and pass on the correct one. This is the single most important property of the smoke test; if you take the fixture construction shortcut and it doesn't have this property, the test is worthless.

Concretely, the predict-time env-dict carries only the rows we want predictions for, with no pre-history buffer beyond what predict-time-known features absolutely require. Two consequences:

Late-
mark_as_X
pipeline: features are computed inside the graph from the predict env's data alone. Backward lags / rolling windows / target shifts have NaN at the cold-start rows. The pre-marker
```
drop_nulls
```
(or the model's NaN intolerance) drops those rows.
```
len(predictions) < n_predict_grid_rows
```
. Test fails.
Early-
mark_as_X
pipeline: the marker lands on the predict-grid node (Layer 2 of
```
build-ml-pipeline
```
's rule 2); history-dependent features take the upstream history DataOp as an additional
```
apply_func
```
argument. At predict time, the history node resolves to the full available history (bound from the same source the train env uses), and the join in each feature step produces real values for every row in the predict grid.
```
len(predictions) == n_predict_grid_rows
```
. Test passes.

The two outcomes are deterministic. The smoke test cannot be "flaky" — if the row count is off by one, the pipeline is wrong.

For the predict-grid size: smallest is best. Use the smallest predict window that is still an honest predict-time grid. A single horizon-length slice (e.g. one day for a t+24 model) is enough to expose the failure; anything larger only hides it behind volume.

测试夹具专门设计为在管道存在漏洞时失败，在管道正确时通过。这是冒烟测试最重要的特性；如果在构建夹具时走捷径导致该特性丢失，测试将毫无价值。

具体而言，预测时的环境字典仅包含我们需要预测的行，除了预测时已知特征绝对需要的内容外，无任何历史缓冲。这会导致两种结果：

延迟
mark_as_X
的管道：特征仅从预测环境的数据中计算。反向滞后/滚动窗口/目标偏移在冷启动行处为NaN。标记前的
```
drop_nulls
```
（或模型对NaN的不兼容性）会丢弃这些行。
```
len(predictions) < n_predict_grid_rows
```
。测试失败。
提前
mark_as_X
的管道：标记放置在预测网格节点上（
```
build-ml-pipeline
```
规则2的第二层）；依赖历史的特征会将上游历史DataOp作为额外的
```
apply_func
```
参数。在预测时，历史节点会解析为完整的可用历史（与训练环境使用相同的数据源绑定），每个特征步骤中的连接会为预测网格中的每一行生成有效值。
```
len(predictions) == n_predict_grid_rows
```
。测试通过。

这两种结果是确定性的。冒烟测试不会“不稳定”——如果行数相差一行，说明管道存在问题。

对于预测网格的大小：越小越好。使用最小的预测窗口，且该窗口仍能代表真实的预测时间网格。单个水平长度的切片（例如t+24模型的一天数据）足以暴露失败模式；更大的窗口只会因数据量掩盖问题。

Fixture construction —

data/

is the source

测试夹具构建 — 以

data/

为数据源

The fixture reads from the real
data/
source, not from a synthetic generator and not from a checked-in fixture file. The loaders the experiment uses are the loaders the smoke test must exercise. Synthetic fixtures defeat the purpose.

Construction depends on the experiment's source binding (read

experiments/NN_*.py

to find out which env-dict keys

build_learner

consumes), but the shape is always the same:

Identify the predict-grid time bounds (
```
predict_start
```
,
```
predict_end
```
). For time series, the most recent horizon-equivalent window of the data.
Identify the train env. The cleanest choice is all data strictly before
predict_start - HORIZON
(embargo equal to the forecast horizon). For tabular IID, just exclude the rows in the predict grid.
Build two env-dicts:
- ```
train_env
```
  : whatever shape the experiment uses for its fit binding, restricted to data before the embargo.
- ```
predict_env
```
  : the predict-grid description, with no additional history padding (this is the diagnostic property; if you pad, the test passes spuriously).
Compute
```
n_predict_grid_rows
```
independently of the prediction — the count comes from the supervised representation of the predict env (not from the prediction itself).
Compute
```
y_true
```
from the supervised representation of the predict env (the soft assertion's ground truth).

The fixture must not write derived files to
data/holdout/
,
data/train/
, etc. Those are workspace-level artifacts owned by the project's setup script(s); the smoke test fixture is ephemeral. Use

tmp_path

(the pytest-built-in temporary directory fixture) when the experiment's source binding requires on-disk inputs.

Three common source-binding shapes — the smoke fixture has to match whichever the experiment uses:

Binding shape	Predict env construction	`n_predict_grid_rows`
Directory of raw files — `build_learner` binds a `data_dir` -style var; the loader globs / reads files from it.	Write a tiny temp dir with the time-sliced raw files inside the test (use the `tmp_path` pytest built-in). Bind it as `data_dir` .	The row count of the supervised representation of the predict env (e.g. `len(build_supervised_frame(predict_dir))` in load-forecasting), known a priori from the slice.
Predict-grid + raw-history sources — the early-mark shape from `build-ml-pipeline` rule 2: `predict_grid` plus `history_source` / `weather_source` / etc. as separate vars.	Build the in-memory `predict_grid` value (a list of timestamps, a panel-key grid, …) and the source identifiers. No file write needed.	`len(predict_grid)` .
Materialized `(X, y)` IID — `build_learner` binds `X` and `y` directly (or a single `data` env-dict mapping to `{"X": ..., "y": ...}` ).	Hold out a small subset of rows from the materialized `X` (and the matching `y` ) before fit; `train_env` gets the rest, `predict_env` gets the held-out subset.	`len(predict_subset)` .

For the second shape (predict-grid + raw-history sources), the three layers — sources → predict-grid + alignment +

mark_as_X

→ features after (with history as an upstream reference) — are described in

build-ml-pipeline

§ "Common patterns" rule 2, with a full worked example (drawn from this workspace's 01_baseline pipeline) in

python-api/references/pre_mark_alignment.md

. Read that reference before constructing the predict env for an early-mark pipeline.

测试夹具必须读取真实的
data/
数据源，而非使用合成生成器或已签入的夹具文件。实验使用的加载器，冒烟测试也必须使用。合成夹具会违背测试的目的。

构建方式取决于实验的数据源绑定（阅读

experiments/NN_*.py

以了解

build_learner

使用哪些环境字典键），但结构始终相同：

确定预测网格的时间范围（
```
predict_start
```
、
```
predict_end
```
）。对于时间序列，选择数据中最近的与水平窗口长度相当的时间段。
确定训练环境。最合理的选择是**
```
predict_start - HORIZON
```
之前的所有数据**（embargo等于预测水平）。对于表格型IID数据，只需排除预测网格中的行即可。
构建两个环境字典：
- ```
train_env
```
  ：采用实验训练绑定使用的结构，仅包含embargo之前的数据。
- ```
predict_env
```
  ：预测网格的描述，无额外历史填充（这是诊断特性；如果添加填充，测试会虚假通过）。
独立于预测结果计算
```
n_predict_grid_rows
```
——该值来自预测环境的监督表示（而非预测结果本身）。
从预测环境的监督表示中计算
```
y_true
```
（软性断言的真实标签）。

测试夹具不得将派生文件写入
data/holdout/
、
data/train/
等目录。这些是项目级别的工件，由项目的设置脚本管理；冒烟测试夹具是临时产物。当实验的数据源绑定需要磁盘输入时，使用

tmp_path

（pytest内置的临时目录夹具）。

三种常见的数据源绑定结构——冒烟测试夹具必须匹配实验使用的结构：

绑定结构	预测环境构建方式	`n_predict_grid_rows`
原始文件目录 — `build_learner` 绑定 `data_dir` 类型的变量；加载器从该目录中读取/匹配文件。	在测试中使用 `tmp_path` （pytest内置）创建一个包含时间切片后原始文件的小型临时目录。将其绑定为 `data_dir` 。	预测环境监督表示的行数（例如负载预测场景中的 `len(build_supervised_frame(predict_dir))` ），可从切片中预先得知。
预测网格+原始历史数据源 — `build-ml-pipeline` 规则2中的提前标记结构： `predict_grid` 加上 `history_source` / `weather_source` 等独立变量。	构建内存中的 `predict_grid` 值（时间戳列表、面板键网格等）和数据源标识符。无需写入文件。	`len(predict_grid)` 。
物化的 `(X, y)` IID数据 — `build_learner` 直接绑定 `X` 和 `y` （或单个 `data` 环境字典映射为 `{"X": ..., "y": ...}` ）。	在训练前从物化的 `X` （以及对应的 `y` ）中保留一小部分行； `train_env` 获取剩余部分， `predict_env` 获取保留的子集。	`len(predict_subset)` 。

对于第二种结构（预测网格+原始历史数据源），三层结构——数据源→预测网格+对齐+

mark_as_X

→后续特征（以历史为上游引用）——在

build-ml-pipeline

的“常见模式”规则2中有描述，且在

python-api/references/pre_mark_alignment.md

中有完整的示例（来自本工作区的01_baseline管道）。在为提前标记的管道构建预测环境前，请阅读该参考文档。

IID flat-table problems — what the smoke test still buys you

IID扁平表场景 — 冒烟测试的价值

For pipelines with no cross-row dependencies (per-row math, stateful encoders that learn at fit and apply per-row at predict, no lags / rolling / joins-with-history), the smoke test reduces to "fit on the train subset, predict on the held-out subset, assert

len(predictions) == len(predict_subset)

The diagnostic-by-construction property does not apply — there are no cross-row reaches for the test to break, so the hard assertion will pass on a correctly-built pipeline and on a buggy one. What the smoke test still catches in the IID case:

Loader bugs that drop or duplicate rows on a smaller input than CV used.
Shape mismatches between
```
learner.predict(env)
```
's output and the predict-env row count (e.g. an estimator that returns
```
(N, 2)
```
predictions when the test only checks
```
len(...)
```
).
Accidental NaN-poisoning when an encoder has never seen a category present in the predict subset (the soft assertion on smoke-MAE-vs-CV-mean catches this; keep it on).

Treat the IID smoke test as a sanity check, not a CV-replacement. The CV-replacement role is what the test plays for cross-row pipelines, where the diagnostic-by- construction property is the load-bearing guarantee.

对于无跨行依赖的管道（行内运算、训练时学习并在预测时逐行应用的有状态编码器、无滞后/滚动/历史连接），冒烟测试简化为“在训练子集上拟合，在保留子集上预测，断言

len(predictions) == len(predict_subset)

”。

内置诊断特性不适用——没有跨行操作供测试触发失败，因此硬性断言在正确构建的管道和有漏洞的管道上都会通过。但冒烟测试在IID场景下仍能捕捉以下问题：

加载器在处理比CV更小的输入时出现的丢行或重复行问题。
```
learner.predict(env)
```
的输出与预测环境行数不匹配（例如估计器返回
```
(N, 2)
```
的预测结果，但测试仅检查
```
len(...)
```
）。
当编码器从未见过预测子集中的类别时，意外出现的NaN污染（软性断言中的冒烟MAE与CV均值比较会捕捉到该问题；请保持启用）。

将IID场景下的冒烟测试视为sanity检查，而非CV的替代方案。对于跨行管道，冒烟测试的CV替代作用由内置诊断特性提供，这是其核心保障。

The standard pytest shape

标准pytest结构

One test function per smoke test file. The function name mirrors the experiment stem so pytest output is self-explanatory.

python

"""Smoke test for `experiments/NN_<short_name>.py`."""

每个冒烟测试文件包含一个测试函数。函数名与实验文件名前缀一致，以便pytest输出结果自解释。

python

"""Smoke test for `experiments/NN_<short_name>.py`."""

stdlib + numpy first

import pytest

from <pkg> import PROJECT_ROOT from <pkg>.pipeline import build_learner

import pytest

from <pkg> import PROJECT_ROOT from <pkg>.pipeline import build_learner

additional imports per the experiment's binding shape

DATA_DIR = PROJECT_ROOT / "data"

@pytest.fixture def train_predict_envs(tmp_path): """Build a (train_env, predict_env, n_predict_grid_rows, y_true) tuple.

Diagnostic by construction: predict_env carries only the
rows we want predictions for, with no pre-history padding.
"""
# ... per-experiment fixture construction ...
return train_env, predict_env, n_predict_grid_rows, y_true

def test_NN_<short_name>(train_predict_envs): """Predict-time replay must produce one prediction per predict-grid row.""" train_env, predict_env, n_predict_grid_rows, y_true = train_predict_envs

learner = build_learner()
learner.fit(train_env)
predictions = learner.predict(predict_env)

# HARD: structural correctness.
assert len(predictions) == n_predict_grid_rows, (
    f"got {len(predictions)} predictions for "
    f"{n_predict_grid_rows} predict-grid rows — pipeline is "
    f"dropping cold-start rows; check `mark_as_X` placement "
    f"and that history-dependent features reference an "
    f"upstream history node, not a per-slice computation."
)

# SOFT: predictions are not NaN-poisoned.
from sklearn.metrics import mean_absolute_error
smoke_mae = mean_absolute_error(y_true, predictions)
# CV_MAE_MEAN is hardcoded at the top of the file from
# `journal/NN_<short_name>.md` § Status.headline. The smoke test
# uses only the predicting package's API (skrub/sklearn) —
# no skore import, so it runs anywhere skrub does.
assert smoke_mae < 3 * CV_MAE_MEAN, (
    f"smoke MAE {smoke_mae:.0f} > 3 × CV mean "
    f"({CV_MAE_MEAN:.0f}) — predictions may be NaN-poisoned."
)


`tmp_path` is the pytest built-in for a per-test temporary
directory; use it whenever the experiment's source binding
requires on-disk inputs.

DATA_DIR = PROJECT_ROOT / "data"

@pytest.fixture def train_predict_envs(tmp_path): """Build a (train_env, predict_env, n_predict_grid_rows, y_true) tuple.

Diagnostic by construction: predict_env carries only the
rows we want predictions for, with no pre-history padding.
"""
# ... per-experiment fixture construction ...
return train_env, predict_env, n_predict_grid_rows, y_true

def test_NN_<short_name>(train_predict_envs): """Predict-time replay must produce one prediction per predict-grid row.""" train_env, predict_env, n_predict_grid_rows, y_true = train_predict_envs

learner = build_learner()
learner.fit(train_env)
predictions = learner.predict(predict_env)

# HARD: structural correctness.
assert len(predictions) == n_predict_grid_rows, (
    f"got {len(predictions)} predictions for "
    f"{n_predict_grid_rows} predict-grid rows — pipeline is "
    f"dropping cold-start rows; check `mark_as_X` placement "
    f"and that history-dependent features reference an "
    f"upstream history node, not a per-slice computation."
)

# SOFT: predictions are not NaN-poisoned.
from sklearn.metrics import mean_absolute_error
smoke_mae = mean_absolute_error(y_true, predictions)
# CV_MAE_MEAN is hardcoded at the top of the file from
# `journal/NN_<short_name>.md` § Status.headline. The smoke test
# uses only the predicting package's API (skrub/sklearn) —
# no skore import, so it runs anywhere skrub does.
assert smoke_mae < 3 * CV_MAE_MEAN, (
    f"smoke MAE {smoke_mae:.0f} > 3 × CV mean "
    f"({CV_MAE_MEAN:.0f}) — predictions may be NaN-poisoned."
)


`tmp_path`是pytest内置的每个测试独立的临时目录；当实验的数据源绑定需要磁盘输入时使用它。

Failure semantics

失败语义

A failing smoke test is a pipeline-shape problem, not a metric problem.

Hard-assertion failure (row count) → the pipeline is broken. Re-enter
```
build-ml-pipeline
```
, audit the X-marker placement and the history-dependent feature steps. Don't tune the model; don't loosen the assertion; don't add a wrapper. Fix the shape.
Soft-assertion failure (metric way off) → the predictions exist but are garbage on the smoke window. Most common cause: an upstream history node isn't being correctly resolved at predict time, so a lag column is silently NaN. Inspect
```
learner.skb.full_report()
```
and look for nodes whose value at predict time doesn't match what fit time saw.
Failure blocks
done
status.
```
iterate-ml-experiment
```
§ 4 refuses to flip an experiment to
```
done
```
until the matching smoke test passes. The CV report can land in the skore Project before the smoke test passes (CV is independent of predict-time binding), but the experiment row in
```
JOURNAL.md
```
stays
```
approved
```
until smoke passes.

冒烟测试失败意味着管道结构存在问题，而非指标问题。

硬性断言失败（行数不匹配）→ 管道已损坏。重新进入
```
build-ml-pipeline
```
，检查X标记的位置和依赖历史的特征步骤。请勿调优模型；请勿弱化断言；请勿添加包装器。修复结构问题。
软性断言失败（指标偏差过大）→ 预测结果存在，但在冒烟测试窗口中无效。最常见的原因是：上游历史节点在预测时未正确解析，导致滞后列静默为NaN。检查
```
learner.skb.full_report()
```
，查找预测时值与训练时值不匹配的节点。
失败会阻止
done
状态。
```
iterate-ml-experiment
```
第4节规定，只有当对应的冒烟测试通过后，实验才能切换为
```
done
```
状态。CV报告可在冒烟测试通过前提交至skore项目（CV与预测时绑定无关），但
```
JOURNAL.md
```
中的实验条目会保持
```
approved
```
状态，直到冒烟测试通过。

What this skill does NOT do

本技能不负责的事项

Run pytest. Test execution is the user's call (or CI's).
Write the design note or the experiment script. Those are
```
iterate-ml-experiment
```
and
```
organize-ml-workspace
```
/
```
build-ml-pipeline
```
.
Touch the skore Project. The smoke test does not call
```
project.put
```
— it's a pre-flight check, not a metric artifact. CV metrics come from
```
evaluate-ml-pipeline
```
.
Define what "good metrics" mean. The hard assertion is structural; the soft assertion is a sanity bound, not a performance target. Performance judgment is the user's, per
```
iterate-ml-experiment
```
's rule that the user judges results.

运行pytest。测试执行由用户（或CI）负责。
编写设计说明或实验脚本。这些是
```
iterate-ml-experiment
```
和
```
organize-ml-workspace
```
/
```
build-ml-pipeline
```
的职责。
操作skore项目。冒烟测试不会调用
```
project.put
```
——它是预检查，而非指标工件。CV指标由
```
evaluate-ml-pipeline
```
负责。
定义“良好指标”的标准。硬性断言是结构性的；软性断言是sanity界限，而非性能目标。性能判断由用户负责，遵循
```
iterate-ml-experiment
```
中“用户判断结果”的规则。

Companion skills

配套技能

test-ml-pipeline
— the router that dispatched here. Owns layout and pairing.
build-ml-pipeline
— owns the X-marker placement rule the smoke test asserts. Smoke-test failure typically routes back here for a pipeline-shape fix.
iterate-ml-experiment
— owns the iteration loop. Requires the smoke test to pass before an experiment can flip to
```
done
```
.
evaluate-ml-pipeline
— owns CV. The smoke test fills the predict-time-binding gap CV doesn't cover. The soft assertion's CV-mean baseline is hardcoded in the smoke test from the matching design note's Status.headline (which
```
evaluate-ml-pipeline
```
ultimately fills in after the run); the test does not import skore at runtime.
python-api
/ python-api
— symbol references for the predicting-package APIs the smoke test uses. Consult before naming any imported function in the test body.
```
python-api
```
is not a smoke-test dependency — see the "no skore import" Stop condition above. Cache hits first: check
```
scratch/api/<lib>/<version>/
```
before WebSearching; cache new findings back there (per
```
python-api
```
Shape 0/3).
data-science-python-stack
— declares pytest as a Tier 1 mandatory dependency for any workspace using this skill.
python-code-style
— must be invoked after writing or editing
```
tests/smoke/test_NN_*.py
```
. Running
```
pixi run ruff check
```
directly without invoking this skill silently drops the NumPyDoc docstring convention the stack expects: ruff's
```
D
```
-rules pass on a one-line summary, but only the skill body teaches the parameter-shape-in-type-slot and the section layout (
```
Parameters
```
/
```
Returns
```
/
```
Notes
```
) the test fixture + test function should use.

test-ml-pipeline
— 将任务分发至此的路由技能。负责布局和配对规则。
build-ml-pipeline
— 负责冒烟测试所断言的X标记放置规则。冒烟测试失败通常需返回该技能修复管道结构。
iterate-ml-experiment
— 负责迭代循环。要求冒烟测试通过后，实验才能切换为
```
done
```
状态。
evaluate-ml-pipeline
— 负责CV。冒烟测试填补了CV未覆盖的预测时绑定空白。软性断言的CV均值基准值从对应设计说明的Status.headline中硬编码到冒烟测试中（最终由
```
evaluate-ml-pipeline
```
在运行后填充到设计说明中）；测试在运行时不导入skore。
python-api
/ python-api
— 冒烟测试使用的预测包API的符号参考。在测试体中命名任何导入函数前，请先查阅该技能。
```
python-api
```
不是冒烟测试的依赖项——请参见上文“禁止导入skore”的停止条件。优先使用缓存：在进行网络搜索前，先检查
```
scratch/api/<lib>/<version>/
```
；将新发现缓存回该目录（遵循
```
python-api
```
的Shape 0/3规则）。
data-science-python-stack
— 声明pytest为使用本技能的任何工作区的Tier 1强制依赖项。
python-code-style
— 在编写或编辑
```
tests/smoke/test_NN_*.py
```
后必须调用该技能。直接运行
```
pixi run ruff check
```
而不调用该技能，会忽略栈所要求的NumPyDoc文档字符串约定：ruff的
```
D
```
规则会通过单行摘要，但只有该技能会教授测试夹具和测试函数应使用的参数类型槽和章节布局（
```
Parameters
```
/
```
Returns
```
/
```
Notes
```
）。

smoke-test-ml-pipeline

Original

Translation

Smoke Test ML Pipeline

ML管道冒烟测试

Stop conditions — read before anything else

停止条件 — 开始前必读

Pre-flight — emit this checklist as visible text before any test code

预检查 — 在编写任何测试代码前，输出以下检查清单

What the smoke test asserts

冒烟测试的断言内容

Hard — the row-count check

硬性断言 — 行数检查

Soft — the metric-vs-CV gap

软性断言 — 指标与CV均值的差距

The diagnostic-by-construction property

内置诊断特性

Fixture construction —
`data/`
is the source

测试夹具构建 — 以
`data/`
为数据源

IID flat-table problems — what the smoke test still buys you

IID扁平表场景 — 冒烟测试的价值

The standard pytest shape

标准pytest结构

stdlib + numpy first

stdlib + numpy first

additional imports per the experiment's binding shape

additional imports per the experiment's binding shape

Failure semantics

失败语义

What this skill does NOT do

本技能不负责的事项

Companion skills

配套技能

smoke-test-ml-pipeline

Original

Translation

Smoke Test ML Pipeline

ML管道冒烟测试

Stop conditions — read before anything else

停止条件 — 开始前必读

Pre-flight — emit this checklist as visible text before any test code

预检查 — 在编写任何测试代码前，输出以下检查清单

What the smoke test asserts

冒烟测试的断言内容

Hard — the row-count check

硬性断言 — 行数检查

Soft — the metric-vs-CV gap

软性断言 — 指标与CV均值的差距

The diagnostic-by-construction property

内置诊断特性

Fixture construction — data/ is the source

测试夹具构建 — 以data/为数据源

IID flat-table problems — what the smoke test still buys you

IID扁平表场景 — 冒烟测试的价值

The standard pytest shape

标准pytest结构

stdlib + numpy first

stdlib + numpy first

additional imports per the experiment's binding shape

additional imports per the experiment's binding shape

Failure semantics

失败语义

What this skill does NOT do

本技能不负责的事项

Companion skills

配套技能

Fixture construction —
`data/`
is the source

测试夹具构建 — 以
`data/`
为数据源