smoke-test-ml-pipeline
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSmoke Test ML Pipeline
ML管道冒烟测试
The minimal pytest that catches the "load → featurize → split"
anti-pattern at iteration time, before it reaches production.
这是一个极简的pytest测试,用于在迭代阶段就捕捉“加载→特征工程→拆分”的反模式,避免其进入生产环境。
Stop conditions — read before anything else
停止条件 — 开始前必读
- No smoke test without an approved design note + script. The pairing
rule from is hard:
test-ml-pipelineexists only whentests/smoke/test_NN_<short_name>.pyis at leastjournal/NN_<short_name>.mdandapprovedexists with the matching stem.experiments/NN_<short_name>.py - Symbol from memory is forbidden. Any skrub /
scikit-learn name you write in the smoke test must come from a
/
Skill(python-api)call in this turn. The smoke test is a small file but it imports the predicting-package API surface; the same memory-forbidden rule applies.Skill(python-api) - Don't shrink the assertion. The hard assertion is exact row-count equality. Not "approximately equal", not "at least 80% of expected rows". A row-count mismatch is the failure mode the smoke test exists to catch. Loosening the assertion silently reintroduces the bug.
- Don't synthesize the fixture. The smoke test reads the real
source. Synthetic fixtures look fine but skip the loaders that actually break in production.
data/ - No wrappers, no NaN-handling, no hacks. If the smoke test only passes after wrapping the predictor or conditioning on
eval_mode, the pipeline is wrong. Route back toeval_modeand fix the X-marker placement. Wrappers paper over the failure mode; they don't solve it.build-ml-pipeline - The smoke test uses only the predicting package's API.
For a produced by
SkrubLearnerthat means skrub'sbuild-ml-pipeline/fit/ (optionallypredict) plusscorefor any metric the soft assertion uses. Do not importsklearn.metrics(or any other tracking / reporting library) in the test file. The smoke test must be runnable in any environment that canskore+import skrub— the skore Project is a side artifact, not a test dependency. Soft-assertion baselines (CV-mean MAE, etc.) are hardcoded from the design note's Status.headline with a comment pointing to the design note; update by hand when the experiment's headline number changes.import sklearn - Don't filter warnings. No
, no
@pytest.mark.filterwarnings(...)in the test body, nowarnings.filterwarnings(...)infilterwarnings = [...]/pytest.ini— unless the user explicitly asks. Seepyproject.toml§ Stop conditions.python-code-style
- 无获批设计说明和脚本则不编写冒烟测试。中的配对规则是硬性要求:仅当
test-ml-pipeline至少处于journal/NN_<short_name>.md状态,且approved存在匹配的文件名前缀时,才允许创建experiments/NN_<short_name>.py。tests/smoke/test_NN_<short_name>.py - 禁止使用记忆中的符号。冒烟测试中使用的任何skrub/scikit-learn名称,必须来自本次任务中对的调用。冒烟测试是一个小文件,但它会导入预测包的API;同样的“禁止依赖记忆”规则适用。
Skill(python-api) - 请勿弱化断言。硬性断言要求行数完全相等。不是“近似相等”,也不是“至少达到预期行数的80%”。行数不匹配正是冒烟测试要捕捉的失败模式。弱化断言会悄无声息地重新引入漏洞。
- 请勿合成测试夹具。冒烟测试需读取真实的数据源。合成测试夹具看似正常,但会跳过那些在生产环境中实际会出错的加载器。
data/ - 禁止使用包装器、NaN处理或技巧。如果必须包装预测器或设置
eval_mode才能让冒烟测试通过,说明管道存在问题。请返回eval_mode并修正X标记的位置。包装器只是掩盖了失败模式,并未解决问题。build-ml-pipeline - 冒烟测试仅使用预测包的API。对于生成的
build-ml-pipeline,这意味着仅使用skrub的SkrubLearner/fit/(可选)predict方法,以及score中软性断言所需的任何指标。请勿导入sklearn.metrics(或任何其他跟踪/报告库)到测试文件中。冒烟测试必须能在任何可导入skrub和sklearn的环境中运行——skore项目是附属产物,并非测试依赖项。软性断言的基准值(如CV均值MAE等)需从设计说明的Status.headline中硬编码,并添加指向设计说明的注释;当实验的标题数值变更时,手动更新该值。skore - 请勿过滤警告。除非用户明确要求,否则禁止使用、测试体中的
@pytest.mark.filterwarnings(...),或warnings.filterwarnings(...)/pytest.ini中的pyproject.toml。详情请见filterwarnings = [...]的停止条件章节。python-code-style
Pre-flight — emit this checklist as visible text before any test code
预检查 — 在编写任何测试代码前,输出以下检查清单
Pre-flight (smoke-test-ml-pipeline):
- [ ] Tier 1 mandatory libs importable: pytest + sklearn + skrub
(per `data-science-python-stack` § "Tier 1"). **Not skore** —
see the Stop conditions; the smoke test is intentionally
portable to any skrub-capable environment
- [ ] Skill(python-api) consulted for skrub / sklearn symbols used in
the test: <symbols, or "none">
Evidence: Read scratch/api/<lib>/<version>/<topic>.md (this turn)
| Write scratch/api/<lib>/<version>/<topic>.md (this turn)
| "n/a — test only uses symbols already present in
src/<pkg>/ (build_learner / load_training_table / etc.)"
"Read python-api SKILL.md" alone is NOT evidence.
- [ ] `journal/NN_<short_name>.md` read this turn (frozen sections:
Question, Method) so the test asserts what the experiment claims
- [ ] `experiments/NN_<short_name>.py` skimmed this turn for the env-dict
keys `build_learner` consumes (`data_dir` / `start` + `end` /
`raw_frame` / etc.)
- [ ] `src/<pkg>/data.py` skimmed this turn for the loader signature
(so the predict-env construction matches the loader's expectations)
- [ ] Test category & stem decided: `tests/smoke/test_NN_<short_name>.py`
- [ ] Predict-grid size decided: smallest window that still triggers
the failure mode (default: a single horizon-length slice; for
time series, the most recent N steps such that the target is
*just* observable for assertion)
- [ ] Hard assertion wired: `len(predictions) == n_predict_grid_rows`
- [ ] Soft assertion wired (or explicitly skipped): smoke MAE within
`3 × CV_MEAN_HARDCODED_FROM_PLAN` (or task-appropriate
analogue). Value is a literal pulled from the matching
`journal/NN_<short_name>.md` § Status.headline; the test does
not import `skore` / read the project store at runtime.Pre-flight (smoke-test-ml-pipeline):
- [ ] Tier 1 mandatory libs importable: pytest + sklearn + skrub
(per `data-science-python-stack` § "Tier 1"). **Not skore** —
see the Stop conditions; the smoke test is intentionally
portable to any skrub-capable environment
- [ ] Skill(python-api) consulted for skrub / sklearn symbols used in
the test: <symbols, or "none">
Evidence: Read scratch/api/<lib>/<version>/<topic>.md (this turn)
| Write scratch/api/<lib>/<version>/<topic>.md (this turn)
| "n/a — test only uses symbols already present in
src/<pkg>/ (build_learner / load_training_table / etc.)"
"Read python-api SKILL.md" alone is NOT evidence.
- [ ] `journal/NN_<short_name>.md` read this turn (frozen sections:
Question, Method) so the test asserts what the experiment claims
- [ ] `experiments/NN_<short_name>.py` skimmed this turn for the env-dict
keys `build_learner` consumes (`data_dir` / `start` + `end` /
`raw_frame` / etc.)
- [ ] `src/<pkg>/data.py` skimmed this turn for the loader signature
(so the predict-env construction matches the loader's expectations)
- [ ] Test category & stem decided: `tests/smoke/test_NN_<short_name>.py`
- [ ] Predict-grid size decided: smallest window that still triggers
the failure mode (default: a single horizon-length slice; for
time series, the most recent N steps such that the target is
*just* observable for assertion)
- [ ] Hard assertion wired: `len(predictions) == n_predict_grid_rows`
- [ ] Soft assertion wired (or explicitly skipped): smoke MAE within
`3 × CV_MEAN_HARDCODED_FROM_PLAN` (or task-appropriate
analogue). Value is a literal pulled from the matching
`journal/NN_<short_name>.md` § Status.headline; the test does
not import `skore` / read the project store at runtime.What the smoke test asserts
冒烟测试的断言内容
Two assertions, two severities:
包含两个断言,对应不同的严重程度:
Hard — the row-count check
硬性断言 — 行数检查
python
assert len(predictions) == n_predict_grid_rowsThis is the structural-correctness assertion. It is a binary
pass/fail and it is the whole point of the smoke test. A
correctly built pipeline (per 's X-marker rule)
satisfies this trivially. A pipeline that loads-then-features-
then-splits will fail it because predict-time featurization on the
predict env runs with no pre-history buffer and silently drops
cold-start rows.
build-ml-pipelinen_predict_grid_rowsbuild_supervised_frame(predict_dir)python
assert len(predictions) == n_predict_grid_rows这是结构正确性断言。它是一个二元的通过/失败判断,也是冒烟测试的核心目的。符合中X标记规则的正确管道会轻松满足该断言。而采用“加载→特征工程→拆分”流程的管道会失败,因为预测时对预测环境进行特征工程时无历史缓冲,会静默丢弃冷启动行。
build-ml-pipelinen_predict_grid_rowsbuild_supervised_frame(predict_dir)Soft — the metric-vs-CV gap
软性断言 — 指标与CV均值的差距
python
smoke_mae = mean_absolute_error(y_true, predictions)
assert smoke_mae < 3 * cv_mae_mean, (
f"smoke MAE {smoke_mae:.0f} is more than 3× the CV mean "
f"({cv_mae_mean:.0f}); predictions may be NaN-poisoned even "
f"though the count matches."
)The metric gap catches the second-order failure mode: the
prediction count is right, but the values are garbage because some
features are NaN at predict time (e.g. an encoder hasn't seen a
new category, a lag is null because the upstream history reference
wasn't wired correctly). The bound is a starting heuristic;
adjust per task. The smoke window is a single seasonal slice, so
the bound has to be loose enough that a legitimate hard-season
window doesn't trip it.
3×The soft assertion is opt-out, not opt-in: skip it only if the
task has no obvious metric-vs-CV comparator (e.g. the smoke fixture
deliberately has no ground truth). If you skip it, leave a comment
on why in the test file.
python
smoke_mae = mean_absolute_error(y_true, predictions)
assert smoke_mae < 3 * cv_mae_mean, (
f"smoke MAE {smoke_mae:.0f} is more than 3× the CV mean "
f"({cv_mae_mean:.0f}); predictions may be NaN-poisoned even "
f"though the count matches."
)指标差距断言用于捕捉二阶失败模式:预测结果数量正确,但值是无效的,因为某些特征在预测时为NaN(例如编码器未见过新类别,或因上游历史引用未正确绑定导致滞后值为空)。的界限是初始启发式值;可根据任务调整。冒烟测试窗口是单个季节性切片,因此界限需足够宽松,避免合法的季节性窗口触发断言失败。
3×软性断言是默认启用,可选择跳过:仅当任务没有明确的指标与CV均值比较器时(例如冒烟测试夹具故意没有真实标签),才跳过该断言。如果跳过,请在测试文件中留下注释说明原因。
The diagnostic-by-construction property
内置诊断特性
The fixture is built specifically to fail on the buggy shape and
pass on the correct one. This is the single most important
property of the smoke test; if you take the fixture construction
shortcut and it doesn't have this property, the test is worthless.
Concretely, the predict-time env-dict carries only the rows we
want predictions for, with no pre-history buffer beyond what
predict-time-known features absolutely require. Two consequences:
- Late-pipeline: features are computed inside the graph from the predict env's data alone. Backward lags / rolling windows / target shifts have NaN at the cold-start rows. The pre-marker
mark_as_X(or the model's NaN intolerance) drops those rows.drop_nulls. Test fails.len(predictions) < n_predict_grid_rows - Early-pipeline: the marker lands on the predict-grid node (Layer 2 of
mark_as_X's rule 2); history-dependent features take the upstream history DataOp as an additionalbuild-ml-pipelineargument. At predict time, the history node resolves to the full available history (bound from the same source the train env uses), and the join in each feature step produces real values for every row in the predict grid.apply_func. Test passes.len(predictions) == n_predict_grid_rows
The two outcomes are deterministic. The smoke test cannot be
"flaky" — if the row count is off by one, the pipeline is wrong.
For the predict-grid size: smallest is best. Use the smallest
predict window that is still an honest predict-time grid. A
single horizon-length slice (e.g. one day for a t+24 model) is
enough to expose the failure; anything larger only hides it
behind volume.
测试夹具专门设计为在管道存在漏洞时失败,在管道正确时通过。这是冒烟测试最重要的特性;如果在构建夹具时走捷径导致该特性丢失,测试将毫无价值。
具体而言,预测时的环境字典仅包含我们需要预测的行,除了预测时已知特征绝对需要的内容外,无任何历史缓冲。这会导致两种结果:
- 延迟的管道:特征仅从预测环境的数据中计算。反向滞后/滚动窗口/目标偏移在冷启动行处为NaN。标记前的
mark_as_X(或模型对NaN的不兼容性)会丢弃这些行。drop_nulls。测试失败。len(predictions) < n_predict_grid_rows - 提前的管道:标记放置在预测网格节点上(
mark_as_X规则2的第二层);依赖历史的特征会将上游历史DataOp作为额外的build-ml-pipeline参数。在预测时,历史节点会解析为完整的可用历史(与训练环境使用相同的数据源绑定),每个特征步骤中的连接会为预测网格中的每一行生成有效值。apply_func。测试通过。len(predictions) == n_predict_grid_rows
这两种结果是确定性的。冒烟测试不会“不稳定”——如果行数相差一行,说明管道存在问题。
对于预测网格的大小:越小越好。使用最小的预测窗口,且该窗口仍能代表真实的预测时间网格。单个水平长度的切片(例如t+24模型的一天数据)足以暴露失败模式;更大的窗口只会因数据量掩盖问题。
Fixture construction — data/
is the source
data/测试夹具构建 — 以data/
为数据源
data/The fixture reads from the real source, not from a
synthetic generator and not from a checked-in fixture file. The
loaders the experiment uses are the loaders the smoke test must
exercise. Synthetic fixtures defeat the purpose.
data/Construction depends on the experiment's source binding (read
to find out which env-dict keys
consumes), but the shape is always the same:
experiments/NN_*.pybuild_learner- Identify the predict-grid time bounds (,
predict_start). For time series, the most recent horizon-equivalent window of the data.predict_end - Identify the train env. The cleanest choice is all data
strictly before (embargo equal to the forecast horizon). For tabular IID, just exclude the rows in the predict grid.
predict_start - HORIZON - Build two env-dicts:
- : whatever shape the experiment uses for its fit binding, restricted to data before the embargo.
train_env - : the predict-grid description, with no additional history padding (this is the diagnostic property; if you pad, the test passes spuriously).
predict_env
- Compute independently of the prediction — the count comes from the supervised representation of the predict env (not from the prediction itself).
n_predict_grid_rows - Compute from the supervised representation of the predict env (the soft assertion's ground truth).
y_true
The fixture must not write derived files to ,
, etc. Those are workspace-level artifacts owned
by the project's setup script(s); the smoke test fixture is
ephemeral. Use (the pytest-built-in temporary
directory fixture) when the experiment's source binding requires
on-disk inputs.
data/holdout/data/train/tmp_pathThree common source-binding shapes — the smoke fixture has to
match whichever the experiment uses:
| Binding shape | Predict env construction | |
|---|---|---|
Directory of raw files — | Write a tiny temp dir with the time-sliced raw files inside the test (use the | The row count of the supervised representation of the predict env (e.g. |
Predict-grid + raw-history sources — the early-mark shape from | Build the in-memory | |
Materialized | Hold out a small subset of rows from the materialized | |
For the second shape (predict-grid + raw-history sources), the
three layers — sources → predict-grid + alignment +
→ features after (with history as an upstream reference) — are
described in § "Common patterns" rule 2,
with a full worked example (drawn from this workspace's
01_baseline pipeline) in
. Read that
reference before constructing the predict env for an early-mark
pipeline.
mark_as_Xbuild-ml-pipelinepython-api/references/pre_mark_alignment.md测试夹具必须读取真实的数据源,而非使用合成生成器或已签入的夹具文件。实验使用的加载器,冒烟测试也必须使用。合成夹具会违背测试的目的。
data/构建方式取决于实验的数据源绑定(阅读以了解使用哪些环境字典键),但结构始终相同:
experiments/NN_*.pybuild_learner- 确定预测网格的时间范围(、
predict_start)。对于时间序列,选择数据中最近的与水平窗口长度相当的时间段。predict_end - 确定训练环境。最合理的选择是**之前的所有数据**(embargo等于预测水平)。对于表格型IID数据,只需排除预测网格中的行即可。
predict_start - HORIZON - 构建两个环境字典:
- :采用实验训练绑定使用的结构,仅包含embargo之前的数据。
train_env - :预测网格的描述,无额外历史填充(这是诊断特性;如果添加填充,测试会虚假通过)。
predict_env
- 独立于预测结果计算——该值来自预测环境的监督表示(而非预测结果本身)。
n_predict_grid_rows - 从预测环境的监督表示中计算(软性断言的真实标签)。
y_true
测试夹具不得将派生文件写入、等目录。这些是项目级别的工件,由项目的设置脚本管理;冒烟测试夹具是临时产物。当实验的数据源绑定需要磁盘输入时,使用(pytest内置的临时目录夹具)。
data/holdout/data/train/tmp_path三种常见的数据源绑定结构——冒烟测试夹具必须匹配实验使用的结构:
| 绑定结构 | 预测环境构建方式 | |
|---|---|---|
原始文件目录 — | 在测试中使用 | 预测环境监督表示的行数(例如负载预测场景中的 |
预测网格+原始历史数据源 — | 构建内存中的 | |
物化的 | 在训练前从物化的 | |
对于第二种结构(预测网格+原始历史数据源),三层结构——数据源→预测网格+对齐+→后续特征(以历史为上游引用)——在的“常见模式”规则2中有描述,且在中有完整的示例(来自本工作区的01_baseline管道)。在为提前标记的管道构建预测环境前,请阅读该参考文档。
mark_as_Xbuild-ml-pipelinepython-api/references/pre_mark_alignment.mdIID flat-table problems — what the smoke test still buys you
IID扁平表场景 — 冒烟测试的价值
For pipelines with no cross-row dependencies (per-row math,
stateful encoders that learn at fit and apply per-row at
predict, no lags / rolling / joins-with-history), the smoke
test reduces to "fit on the train subset, predict on the
held-out subset, assert ".
len(predictions) == len(predict_subset)The diagnostic-by-construction property does not apply —
there are no cross-row reaches for the test to break, so the
hard assertion will pass on a correctly-built pipeline and on
a buggy one. What the smoke test still catches in the IID case:
- Loader bugs that drop or duplicate rows on a smaller input than CV used.
- Shape mismatches between 's output and the predict-env row count (e.g. an estimator that returns
learner.predict(env)predictions when the test only checks(N, 2)).len(...) - Accidental NaN-poisoning when an encoder has never seen a category present in the predict subset (the soft assertion on smoke-MAE-vs-CV-mean catches this; keep it on).
Treat the IID smoke test as a sanity check, not a
CV-replacement. The CV-replacement role is what the test
plays for cross-row pipelines, where the diagnostic-by-
construction property is the load-bearing guarantee.
对于无跨行依赖的管道(行内运算、训练时学习并在预测时逐行应用的有状态编码器、无滞后/滚动/历史连接),冒烟测试简化为“在训练子集上拟合,在保留子集上预测,断言”。
len(predictions) == len(predict_subset)内置诊断特性不适用——没有跨行操作供测试触发失败,因此硬性断言在正确构建的管道和有漏洞的管道上都会通过。但冒烟测试在IID场景下仍能捕捉以下问题:
- 加载器在处理比CV更小的输入时出现的丢行或重复行问题。
- 的输出与预测环境行数不匹配(例如估计器返回
learner.predict(env)的预测结果,但测试仅检查(N, 2))。len(...) - 当编码器从未见过预测子集中的类别时,意外出现的NaN污染(软性断言中的冒烟MAE与CV均值比较会捕捉到该问题;请保持启用)。
将IID场景下的冒烟测试视为sanity检查,而非CV的替代方案。对于跨行管道,冒烟测试的CV替代作用由内置诊断特性提供,这是其核心保障。
The standard pytest shape
标准pytest结构
One test function per smoke test file. The function name mirrors
the experiment stem so pytest output is self-explanatory.
python
"""Smoke test for `experiments/NN_<short_name>.py`."""每个冒烟测试文件包含一个测试函数。函数名与实验文件名前缀一致,以便pytest输出结果自解释。
python
"""Smoke test for `experiments/NN_<short_name>.py`."""stdlib + numpy first
stdlib + numpy first
import pytest
from <pkg> import PROJECT_ROOT
from <pkg>.pipeline import build_learner
import pytest
from <pkg> import PROJECT_ROOT
from <pkg>.pipeline import build_learner
additional imports per the experiment's binding shape
additional imports per the experiment's binding shape
DATA_DIR = PROJECT_ROOT / "data"
@pytest.fixture
def train_predict_envs(tmp_path):
"""Build a (train_env, predict_env, n_predict_grid_rows, y_true) tuple.
Diagnostic by construction: predict_env carries only the
rows we want predictions for, with no pre-history padding.
"""
# ... per-experiment fixture construction ...
return train_env, predict_env, n_predict_grid_rows, y_truedef test_NN_<short_name>(train_predict_envs):
"""Predict-time replay must produce one prediction per predict-grid row."""
train_env, predict_env, n_predict_grid_rows, y_true = train_predict_envs
learner = build_learner()
learner.fit(train_env)
predictions = learner.predict(predict_env)
# HARD: structural correctness.
assert len(predictions) == n_predict_grid_rows, (
f"got {len(predictions)} predictions for "
f"{n_predict_grid_rows} predict-grid rows — pipeline is "
f"dropping cold-start rows; check `mark_as_X` placement "
f"and that history-dependent features reference an "
f"upstream history node, not a per-slice computation."
)
# SOFT: predictions are not NaN-poisoned.
from sklearn.metrics import mean_absolute_error
smoke_mae = mean_absolute_error(y_true, predictions)
# CV_MAE_MEAN is hardcoded at the top of the file from
# `journal/NN_<short_name>.md` § Status.headline. The smoke test
# uses only the predicting package's API (skrub/sklearn) —
# no skore import, so it runs anywhere skrub does.
assert smoke_mae < 3 * CV_MAE_MEAN, (
f"smoke MAE {smoke_mae:.0f} > 3 × CV mean "
f"({CV_MAE_MEAN:.0f}) — predictions may be NaN-poisoned."
)
`tmp_path` is the pytest built-in for a per-test temporary
directory; use it whenever the experiment's source binding
requires on-disk inputs.DATA_DIR = PROJECT_ROOT / "data"
@pytest.fixture
def train_predict_envs(tmp_path):
"""Build a (train_env, predict_env, n_predict_grid_rows, y_true) tuple.
Diagnostic by construction: predict_env carries only the
rows we want predictions for, with no pre-history padding.
"""
# ... per-experiment fixture construction ...
return train_env, predict_env, n_predict_grid_rows, y_truedef test_NN_<short_name>(train_predict_envs):
"""Predict-time replay must produce one prediction per predict-grid row."""
train_env, predict_env, n_predict_grid_rows, y_true = train_predict_envs
learner = build_learner()
learner.fit(train_env)
predictions = learner.predict(predict_env)
# HARD: structural correctness.
assert len(predictions) == n_predict_grid_rows, (
f"got {len(predictions)} predictions for "
f"{n_predict_grid_rows} predict-grid rows — pipeline is "
f"dropping cold-start rows; check `mark_as_X` placement "
f"and that history-dependent features reference an "
f"upstream history node, not a per-slice computation."
)
# SOFT: predictions are not NaN-poisoned.
from sklearn.metrics import mean_absolute_error
smoke_mae = mean_absolute_error(y_true, predictions)
# CV_MAE_MEAN is hardcoded at the top of the file from
# `journal/NN_<short_name>.md` § Status.headline. The smoke test
# uses only the predicting package's API (skrub/sklearn) —
# no skore import, so it runs anywhere skrub does.
assert smoke_mae < 3 * CV_MAE_MEAN, (
f"smoke MAE {smoke_mae:.0f} > 3 × CV mean "
f"({CV_MAE_MEAN:.0f}) — predictions may be NaN-poisoned."
)
`tmp_path`是pytest内置的每个测试独立的临时目录;当实验的数据源绑定需要磁盘输入时使用它。Failure semantics
失败语义
A failing smoke test is a pipeline-shape problem, not a
metric problem.
- Hard-assertion failure (row count) → the pipeline is broken.
Re-enter , audit the X-marker placement and the history-dependent feature steps. Don't tune the model; don't loosen the assertion; don't add a wrapper. Fix the shape.
build-ml-pipeline - Soft-assertion failure (metric way off) → the predictions
exist but are garbage on the smoke window. Most common cause:
an upstream history node isn't being correctly resolved at
predict time, so a lag column is silently NaN. Inspect
and look for nodes whose value at predict time doesn't match what fit time saw.
learner.skb.full_report() - Failure blocks status.
done§ 4 refuses to flip an experiment toiterate-ml-experimentuntil the matching smoke test passes. The CV report can land in the skore Project before the smoke test passes (CV is independent of predict-time binding), but the experiment row indonestaysJOURNAL.mduntil smoke passes.approved
冒烟测试失败意味着管道结构存在问题,而非指标问题。
- 硬性断言失败(行数不匹配)→ 管道已损坏。重新进入,检查X标记的位置和依赖历史的特征步骤。请勿调优模型;请勿弱化断言;请勿添加包装器。修复结构问题。
build-ml-pipeline - 软性断言失败(指标偏差过大)→ 预测结果存在,但在冒烟测试窗口中无效。最常见的原因是:上游历史节点在预测时未正确解析,导致滞后列静默为NaN。检查,查找预测时值与训练时值不匹配的节点。
learner.skb.full_report() - 失败会阻止状态。
done第4节规定,只有当对应的冒烟测试通过后,实验才能切换为iterate-ml-experiment状态。CV报告可在冒烟测试通过前提交至skore项目(CV与预测时绑定无关),但done中的实验条目会保持JOURNAL.md状态,直到冒烟测试通过。approved
What this skill does NOT do
本技能不负责的事项
- Run pytest. Test execution is the user's call (or CI's).
- Write the design note or the experiment script. Those are
and
iterate-ml-experiment/organize-ml-workspace.build-ml-pipeline - Touch the skore Project. The smoke test does not call
— it's a pre-flight check, not a metric artifact. CV metrics come from
project.put.evaluate-ml-pipeline - Define what "good metrics" mean. The hard assertion is
structural; the soft assertion is a sanity bound, not a
performance target. Performance judgment is the user's, per
's rule that the user judges results.
iterate-ml-experiment
- 运行pytest。测试执行由用户(或CI)负责。
- 编写设计说明或实验脚本。这些是和
iterate-ml-experiment/organize-ml-workspace的职责。build-ml-pipeline - 操作skore项目。冒烟测试不会调用——它是预检查,而非指标工件。CV指标由
project.put负责。evaluate-ml-pipeline - 定义“良好指标”的标准。硬性断言是结构性的;软性断言是sanity界限,而非性能目标。性能判断由用户负责,遵循中“用户判断结果”的规则。
iterate-ml-experiment
Companion skills
配套技能
- — the router that dispatched here. Owns layout and pairing.
test-ml-pipeline - — owns the X-marker placement rule the smoke test asserts. Smoke-test failure typically routes back here for a pipeline-shape fix.
build-ml-pipeline - — owns the iteration loop. Requires the smoke test to pass before an experiment can flip to
iterate-ml-experiment.done - — owns CV. The smoke test fills the predict-time-binding gap CV doesn't cover. The soft assertion's CV-mean baseline is hardcoded in the smoke test from the matching design note's Status.headline (which
evaluate-ml-pipelineultimately fills in after the run); the test does not import skore at runtime.evaluate-ml-pipeline - /
python-api— symbol references for the predicting-package APIs the smoke test uses. Consult before naming any imported function in the test body.python-apiis not a smoke-test dependency — see the "no skore import" Stop condition above. Cache hits first: checkpython-apibefore WebSearching; cache new findings back there (perscratch/api/<lib>/<version>/Shape 0/3).python-api - — declares pytest as a Tier 1 mandatory dependency for any workspace using this skill.
data-science-python-stack - — must be invoked after writing or editing
python-code-style. Runningtests/smoke/test_NN_*.pydirectly without invoking this skill silently drops the NumPyDoc docstring convention the stack expects: ruff'spixi run ruff check-rules pass on a one-line summary, but only the skill body teaches the parameter-shape-in-type-slot and the section layout (D/Parameters/Returns) the test fixture + test function should use.Notes
- — 将任务分发至此的路由技能。负责布局和配对规则。
test-ml-pipeline - — 负责冒烟测试所断言的X标记放置规则。冒烟测试失败通常需返回该技能修复管道结构。
build-ml-pipeline - — 负责迭代循环。要求冒烟测试通过后,实验才能切换为
iterate-ml-experiment状态。done - — 负责CV。冒烟测试填补了CV未覆盖的预测时绑定空白。软性断言的CV均值基准值从对应设计说明的Status.headline中硬编码到冒烟测试中(最终由
evaluate-ml-pipeline在运行后填充到设计说明中);测试在运行时不导入skore。evaluate-ml-pipeline - /
python-api— 冒烟测试使用的预测包API的符号参考。在测试体中命名任何导入函数前,请先查阅该技能。python-api不是冒烟测试的依赖项——请参见上文“禁止导入skore”的停止条件。优先使用缓存:在进行网络搜索前,先检查python-api;将新发现缓存回该目录(遵循scratch/api/<lib>/<version>/的Shape 0/3规则)。python-api - — 声明pytest为使用本技能的任何工作区的Tier 1强制依赖项。
data-science-python-stack - — 在编写或编辑
python-code-style后必须调用该技能。直接运行tests/smoke/test_NN_*.py而不调用该技能,会忽略栈所要求的NumPyDoc文档字符串约定:ruff的pixi run ruff check规则会通过单行摘要,但只有该技能会教授测试夹具和测试函数应使用的参数类型槽和章节布局(D/Parameters/Returns)。Notes