AutoML + DEFT Pipeline
A workflow-bridge skill that runs
three phases in sequence by delegating to two existing skills —
for HPO and a DEFT application skill (default
for AOI; other
skills/applications/deft-*
skills for non-AOI cases) for the iterative data-improvement loop.
This skill does not re-implement AutoML or DEFT. It owns only the connective tissue: HPO spec inputs, the spec-handoff between AutoML and DEFT, and the post-DEFT AutoML re-run on the augmented dataset.
When this skill applies
- User asks to "run the AOI workflow" or "improve my AOI ChangeNet model" — default to this skill, not directly. The bare DEFT loop is the inner stage of this pipeline.
- User wants AutoML and DEFT chained on the same model/dataset
- User says "AutoML at both ends", "tune HPs then DEFT", "warm-start DEFT", "AutoML before and after DEFT"
- User has an AutoML-tuned spec and asks how to feed it into DEFT
When this skill does NOT apply
- User explicitly asks for the DEFT loop only ("run JUST the DEFT loop", "skip AutoML") → use directly
- User wants only AutoML with no follow-on DEFT → use directly
- User is doing zero-shot eval, RAG, or non-training workflows
The mental model
Phase 1 (AutoML baseline) Phase 2 (DEFT loop, plain train) Phase 3 (AutoML refinement)
───────────────────────── ──────────────────────────────── ───────────────────────────
specs/baseline_spec.yaml (Phase 1 winner pre-seeds baseline ${RESULTS_DIR}/iter${N}/dataset/
train/base/training_set.csv — DEFT skips its baseline train) train_combined_iter${N}.csv
│ │ │
▼ ▼ ▼
[ AutoML HPO sweep ] [ DEFT: baseline-inference → RCA [ AutoML HPO sweep ]
N recommendations → iter 1..N (plain retrain) ] re-tunes HPs against the
pick best by val_loss / FAR RCA / route / SDG / mining DEFT-augmented dataset
│ │ │
▼ ▼ ▼
best HPs spec + ckpt ─────► DEFT-augmented CSV ───────────► final best checkpoint
+ iter winner checkpoint (the deliverable; no
(Phase 3 warm-starts from it) further retrain)
The handoffs are:
- Phase 1 → Phase 2: a spec file AND the winning checkpoint. Retraining the same HPs in DEFT's baseline step is wasted compute, so the bridge deep-merges Phase 1's winning HPs onto , copies the winning checkpoint into
${RESULTS_DIR}/baseline/train/
under the filename DEFT expects, and pre-populates + so DEFT resumes at baseline inference → evaluate → RCA → iter 1. DEFT itself stays plain-train ( preserved). Verbatim 4-step procedure in .
- Phase 2 → Phase 3: a training CSV AND the iter winner's checkpoint. The CSV (
train_combined_iter${N_final}.csv
) is AutoML's training data; the checkpoint (iterations.<best>.best_ckpt_path
from ) is wired into each rec's train.pretrained_model_path
so Phase 3 fine-tunes from Phase 2's winner rather than from scratch. Without this warm-start Phase 3 routinely regresses vs the iter winner. Phase 3's winning checkpoint is the deliverable — no separate retrain after Phase 3. See .
Why three phases instead of two
- Phase 1 alone finds good HPs on the original training distribution, but the model still has the distributional gaps DEFT is designed to fill.
- Phase 2 alone (just DEFT) fills the gaps but uses whatever HPs was hand-authored with — usually not optimal.
- Phase 3 alone would run AutoML against the augmented dataset, but without a tuned baseline the DEFT loop's iteration cost is higher (slower convergence, more iterations to hit the KPI).
Running all three: AutoML cheap-tunes once on the original data, DEFT does the heavy data work with reasonable HPs, then AutoML tunes again on the now-richer dataset. Phase 3 is the most important of the three for the final deployed FAR/recall.
Cost up-front
The pipeline is sequential. Total wall-clock ≈ Phase 1 (N_automl × per-rec train) + Phase 2 (M iterations × per-iter cost) + Phase 3 (N_automl × per-rec train).
Note that
Phase 2 has no separate baseline train — Phase 1's winning checkpoint is reused as DEFT's baseline, so the baseline cost lands inside Phase 1's N_automl trainings rather than as an extra retrain. Surface this to the user before kickoff. Typically Phase 2's iterations still dominate (each includes SDG + retrain), but Phase 1 and Phase 3 each add several hours on a single-GPU box. Use the per-job estimate from the user's setup (if they have one) rather than guessing minutes. See
for the per-phase cost breakdown.
Consolidated Pre-Flight — one gate, all three phases
The pipeline has exactly one user gate. Before any side-effecting action (docker pull, docker login, any job-launch call delegated to a downstream skill, file mutations under
), the agent must produce a single consolidated Pre-Flight Summary that subsumes every downstream skill's preflight. Once the user approves, the run is autonomous through all three phases — no further interactive pauses.
The user explicitly does not want to be paged between phases. The DEFT loop's own inline
gate becomes a
zero-question display step (every value pre-supplied), as does
's shared launch preflight in Phases 1 and 3.
Before printing the gate the agent must read every downstream preflight section in full and run
every read-only check those sections prescribe, surfacing each
outcome in the summary. Running every step of the DEFT skill's
is mandatory — if any step is skipped the consolidated gate is invalid and the pipeline must not advance. The summary must include, in order: (1) workspace/host/platform/network, (2) credentials SET/UNSET status, (3) resolved container image URIs with PRESENT/MISSING, (4) dataset table with leakage check, (5) Phase 1 config, (6) Phase 2 config incl. pre-seeded baseline source, (7) Phase 3 config, (8) compute estimate, (9) the confirmation line. After the gate, pass every collected value through to each downstream skill so it has nothing to ask. The only allowed post-gate pauses are mid-run hard-stop safety gates (e.g. DEFT's KPI regression gate); call them out in the summary.
See
for the full build procedure, the exact mandatory contents of each summary section (with the GPU memory rule of thumb, DEFT loop defaults, and required inputs verbatim), the downstream gate-suppression inputs, and the fallback when an older skill-bank version hard-codes its own STOP gate.
Phase 1 — AutoML baseline
Invoke
tao-skill-bank:tao-run-automl
with:
| Input | AOI default | Notes |
|---|
| | Same model the DEFT loop expects |
| <workspace>/train/base/training_set.csv
| Same training set DEFT will start from |
| <workspace>/train/base/validation_set.csv
| Held-out — must NOT be the KPI test set (<workspace>/kpi/testing_set.csv
), since that set is reserved for DEFT's final reporting |
| FAR @ 100% recall (preferred) or | See — ChangeNet AOI is class-imbalanced, val_loss alone can mode-collapse |
| | LLM-brain or if compute is tight |
automl_max_recommendations
| 5–10 for AOI | More recs = better HPs but linear in compute |
| Pin epochs / batch_size; sweep optimizer-related HPs only | Otherwise AutoML wanders into long-train regimes that blow Phase 2's budget |
After the sweep finishes, AutoML's
is the winning hyperparameter dict.
Handoff to Phase 2
Phase 1 hands over
two artifacts: the winning
spec and the winning
checkpoint. Instead of retraining the same HPs in DEFT's baseline step, pre-seed DEFT's baseline state from Phase 1's outputs so DEFT starts at baseline inference → evaluate → RCA → iter 1. The four steps — write the merged
baseline_spec_automl.yaml
, copy the winning checkpoint into
${RESULTS_DIR}/baseline/train/
, initialise
with
iterations.baseline.stage_completed == "train"
(and append the matching
entry), then invoke DEFT — are given verbatim with the exact code in
.
inside the loop is preserved.
Quality check before handing off
Run a quick eval of the winning checkpoint against the held-out set: per-class prediction counts (if it collapsed to one class, evaluate the 2nd or 3rd best instead) and a comparison to a zero-shot ChangeNet baseline (if AutoML did not improve over zero-shot, surface that and pause). See
.
Phase 2 — DEFT loop (plain training, baseline pre-seeded from Phase 1)
Invoke
tao-skill-bank:tao-run-deft-aoi
(read its
for the full interface). For non-AOI applications, invoke the matching DEFT skill; the handoff shape is the same.
The DEFT loop's baseline-train sub-step is skipped. Phase 1 already produced a checkpoint trained at the winning HPs, and Phase 1's handoff (see above) pre-populated
${RESULTS_DIR}/baseline/train/
and
${RESULTS_DIR}/deft_state.json
so DEFT resumes at baseline inference → evaluate → RCA → iter 1. The rest of the DEFT loop runs unchanged.
Do not modify its invariant.
The DEFT loop owns:
- The Pre-Flight Summary display step — not a fresh user gate. The Consolidated Pre-Flight (above) is the single gate; the DEFT summary still prints as an audit-trail display of the pre-seeded source but must not re-prompt, since every input was collected in the consolidated gate.
- Baseline inference → evaluate → RCA on the pre-seeded checkpoint, and the full per-iteration RCA → routing → SDG → mining → assemble → train cycle.
- KPI gating and stop conditions; layout, , , .
After the loop exits (KPI met or
reached), capture two values from
:
iterations.<best>.best_ckpt_path
— the loop's best plain-train checkpoint
- The final iteration label — used to locate the augmented training CSV
If the DEFT loop hard-stops on an unrecoverable gate, skip Phase 3. There is no validated augmented CSV to feed AutoML.
Phase 3 — AutoML refinement on the DEFT-augmented dataset
Re-invoke
tao-skill-bank:tao-run-automl
with the augmented training CSV as the train dataset, the same held-out validation CSV as before, and
Phase 2's iter winner checkpoint as the warm-start:
| Input | AOI value |
|---|
| |
| ${RESULTS_DIR}/iter${N_final}/dataset/train_combined_iter${N_final}.csv
|
| Same as Phase 1 (<workspace>/train/base/validation_set.csv
) — keep the comparison apples-to-apples |
| Same metric as Phase 1 |
| Same as Phase 1 |
automl_max_recommendations
| 5–10 |
| Initial spec | Start from <workspace>/specs/baseline_spec_automl.yaml
(Phase 1's winner) — gives the sweep a strong centroid to refine around |
| Warm-start checkpoint | iterations.<best>.best_ckpt_path
from ${RESULTS_DIR}/deft_state.json
— set spec_overrides["train"]["pretrained_model_path"]
to this path. Each Phase 3 rec then fine-tunes from Phase 2's winner instead of training from scratch. |
The warm-start is
mandatory: with no warm-start, every rec starts from random init with only 10-20 epochs to reconverge, Phase 3's
regresses 0.03-0.05 vs iter1, and the
safety net silently rolls back to the iter winner — wasting Phase 3's compute. The concrete
code (selecting the lowest-
iteration, excluding any prior
), the broad-exploration tradeoff, output to
${RESULTS_DIR}/final_automl/
, and wiring Phase 3's checkpoint back into the DEFT report via
+ re-running
prepare_inference_spec.py
(with the
regression safety net) are all in
.
Pitfalls and quality checks
These apply to both AutoML phases — bake them into agent behavior, don't just paste once. The full detail is in
:
- Metric pitfalls (AOI is class-imbalanced). ChangeNet AOI is PASS-dominant; can mode-collapse to a zero-recall PASS-everything model. Prefer FAR @ 100%-recall directly, or gate val_loss with a sanity check, or decide top-K by FAR @ 100%-recall. For balanced / regression tasks, val_loss is fine.
- Run-to-run noise. AutoML can show 2–3× metric variance for the same config. If the winner looks suspiciously better than the runner-up, re-run with a fresh seed before committing the spec to Phase 2.
- Cleanliness (data leakage). Both AutoML phases use a validation set distinct from the KPI test set (), which stays untouched until DEFT's evaluate stage. Phase 3 trains on the augmented CSV but keeps the same val set so Phase 1 and Phase 3 numbers stay comparable.
- Compute budget. Surface the per-phase structure up front and only give a wall-clock range after the user supplies their per-job time.
Quick Start (AOI worked example)
When starting fresh from "run the AOI workflow", the agent delivers a three-phase worded message to the user (Phase 1 AutoML baseline → Phase 2 DEFT loop → Phase 3 AutoML refinement, with the cost framing and "OK to proceed?" close), then after confirmation invokes
(Phase 1), writes the merged spec, pre-seeds
, invokes
(Phase 2) with every input pre-supplied, and invokes
again (Phase 3) — with no further pauses unless a downstream skill hits an unrecoverable hard-stop gate — then summarizes the trajectory (baseline AutoML best → DEFT iter 1 → ... → DEFT iter N_final → Phase 3 best).
See
references/quick-start.md
for the verbatim customer-facing message and the exact post-confirmation invoke sequence.
Non-AOI DEFT applications
The same three-phase pattern applies to other DEFT skills — swap
, the Phase 2 DEFT skill, the spec/checkpoint path conventions, and the Phase 3 augmented-CSV path. The handoff shape (Phase 1 emits spec + checkpoint that pre-seeds the DEFT baseline, Phase 2 emits an augmented dataset, Phase 3 emits the final checkpoint) is identical, and the baseline-skip mechanism is generic to any DEFT-style loop with a resumable baseline state. See
references/quick-start.md
.
See also
tao-skill-bank:tao-run-automl
— AutoML interface, algorithms, HP ranges
tao-skill-bank:tao-run-deft-aoi
— full DEFT AOI loop (Phase 2 default)
tao-skill-bank:tao-train-visual-changenet
— underlying ChangeNet train/eval/infer skill (used by both AutoML and DEFT)
- Other
skills/applications/deft-*
skills — non-AOI Phase 2 targets
- — building the consolidated pre-flight gate
- — Phase 1→2 pre-seed, Phase 2 quality check, Phase 3 warm-start + report wiring
- — metric, noise, leakage, and compute-budget guidance
references/quick-start.md
— verbatim worked-example message and non-AOI variant