DEFT Mining and Embedding Skill
You are the operator of the DEFT embed-then-mine workflow for VCN AOI. Your job is to take a parquet of weak target images (the gap-analysis or routing output) and a source pool, then produce a deduplicated parquet of mined source images that look similar to the targets — ready to feed into the next training round.
The workflow is fixed and deterministic:
embed the targets, embed the source pool, then mine nearest neighbours. Each step's output parquet is the next step's input. There is no iterative search, no clustering pass, no human-in-the-loop selection — depth comes from picking the right encoder and the right
, not from a multi-phase investigation.
The whole skill is a thin wrapper around three direct
invocations against the
tao_toolkit.data_services
image declared in
(resolved at runtime — see Setup). The container's entrypoint takes
<category> <action> -e <spec.yaml> [hydra overrides...]
:
embedding image_embeddings -e <embedding_spec.yaml> …
for embedding and
tmm nearest_neighbors -e <mining_spec.yaml> …
for mining. The
flag points at a YAML of schema defaults; anything afterward is a bare Hydra override (
) applied per run. There is no
keyword inside the container — that's the TAO launcher's pillar prefix and is dropped here. Schema keys can rename between data-services releases, so when in doubt introspect once per image with
docker run --rm "$DS_IMAGE" embedding image_embeddings --cfg=job
and
... tmm nearest_neighbors --cfg=job
. See
for the full entrypoint contract,
introspection, and the paste-and-edit end-to-end recipe.
Inputs
- Target parquet — the gap-analysis output, typically from
tao-route-visual-changenet-samples
(or from tao-analyze-gaps-visual-changenet
if routing was skipped). Required column: . If is also present, label-aware filtering during mining is available; otherwise the mining task silently no-ops the filter.
- Source pool — a parquet of candidate images to mine against, with a column. If the user only has a CSV, convert it to a parquet with the same columns before Step 2. For label-aware filtering, the pool must also carry a column.
- Embedding spec file — a YAML containing , , , and (only when is a TAO /) . Reused across Steps 1 and 2; / are supplied per run as Hydra overrides. The same spec MUST drive both embedding steps — embeddings from different encoders are not comparable, and mismatched encoders are the most common cause of "the mined images look unrelated" reports.
- Mining spec file — a YAML containing , , , and (rarely changed) /. // are Hydra overrides at run time. SigLIP and CLIP embeddings should use . When but either embedding parquet lacks a column, the container logs a warning and proceeds without filtering.
Setup
The mining and embedding tasks live inside the
tao_toolkit.data_services
image declared in
. Resolve the concrete URI once at the top of the run, then confirm Docker, the NVIDIA container toolkit, and a GPU are present before anything else:
bash
# Resolve tao_toolkit.data_services → concrete nvcr.io/... URI from versions.yaml
DS_IMAGE=$(python3 -c "import yaml,os; print(yaml.safe_load(open(os.environ['TAO_SKILL_BANK_PATH']+'/versions.yaml'))['images']['tao_toolkit']['data_services'])")
echo "DS_IMAGE=$DS_IMAGE"
docker info > /dev/null && echo "OK: docker"
nvidia-smi > /dev/null && echo "OK: GPU"
docker image inspect "$DS_IMAGE" > /dev/null \
|| docker pull "$DS_IMAGE"
is exported by the plugin's
hook. If it is unset (e.g. running outside the Claude Code plugin), point it at the skill-bank repo root before resolving. A GPU is required for both the encoder forward pass and the cuML/cuDF k-NN search; both steps will fail without CUDA.
Path mounting. Every host path the container reads or writes — input parquets, output dirs, and the source-pool image root — must be bind-mounted. The simplest, most predictable approach mounts the workspace root with identical paths inside and outside the container so absolute paths in the parquet args resolve the same way on both sides:
bash
WORKSPACE=<absolute path that contains all parquets, outputs, and the source-pool images>
DOCKER="docker run --gpus all --rm --ipc=host --user $(id -u):$(id -g) -v $WORKSPACE:$WORKSPACE -w $WORKSPACE $DS_IMAGE"
Reuse
for the three invocations below.
CSV source pool. If the source pool is provided only as a CSV, convert it to a parquet up front with
pd.read_csv(...).to_parquet(..., index=False)
, preserving the
column verbatim (and
if present). Do not add a path prefix — the container reads input parquets as-is and the
mount keeps host and container paths identical.
Author the two spec files once per iteration. Both files live under
so the
argument resolves on both sides of the mount. Per-run values stay out of the spec and are passed as Hydra overrides at invocation time. The defaults are
,
model_path: google/siglip-base-patch16-224
,
for embedding, and
,
,
(quoted — the schema reads it as a string) for mining. Use
for SigLIP/CLIP,
/
otherwise; add
only when
is a TAO checkpoint. Any field can still be overridden inline at the CLI (e.g.
) — Hydra applies CLI overrides on top of the spec.
See
for the verbatim spec-file templates, the CSV conversion snippet, and the full mounting and image-resolution detail.
Method
Three commands, in order. Each command's output parquet is the next command's input. Run them as plain Bash; the
alias from Setup handles the container, GPU, and mounts. Every invocation follows the same shape:
for the baked-in defaults, then a handful of Hydra overrides for run-specific paths.
Step 1 — Embed the target images
bash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
input_parquet=<target_parquet> output_parquet=<target_embeddings_parquet>
Reads the gap-analysis / routing output and writes a parquet with
,
, and any extra metadata columns (e.g.
,
,
) carried forward verbatim. Print the output schema (
pd.read_parquet(...).columns
) to stdout so the script-check hook can confirm the embedding column exists. To override
/
/
for one run without editing the spec, append them as Hydra overrides.
Step 2 — Embed the source pool
bash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
input_parquet=<source_pool_parquet> output_parquet=<source_embeddings_parquet>
Same command shape as Step 1, applied to the source pool. Use the
identical as Step 1, and do not override
/
/
differently here — mismatched encoder configs across the two steps produce non-comparable embeddings.
Step 3 — Mine nearest neighbours
bash
$DOCKER tmm nearest_neighbors -e <mining_spec.yaml> \
source_parquet=<source_embeddings_parquet> \
target_parquet=<target_embeddings_parquet> output_parquet=<mined_parquet>
For each target embedding, finds the
closest source embeddings under the chosen metric, deduplicates across targets, and writes a single-column (
) parquet of unique mined source paths. The container also drops a
next to the output parquet with: query count, neighbour count, duplicates removed, and (when label filtering is on) kept-vs-dropped pair counts. Tweak
,
, or
via inline Hydra override when sweeping — no need to rewrite the spec. When
but one embedding parquet is missing the
column, the container logs a warning and proceeds without filtering; if the mined output looks too large or contains cross-label pairs, scan the docker log for that warning first.
See
for the complete paste-and-edit recipe that runs all three steps as one streamed Bash block with row-count sanity prints.
Outputs
Write everything into a timestamped folder under the experiment / iteration directory. The packaging hook will add
and
automatically when
is written.
<output_dir>/mining_results/YYYY-MM-DD_HHMMSS/
├── Mining_Report.md # Full mining report
├── embedding_spec.yaml # The -e spec used for Steps 1 and 2
├── mining_spec.yaml # The -e spec used for Step 3
├── target_embeddings.parquet # Step 1 output (filepath, embedding, + carried metadata)
├── source_embeddings.parquet # Step 2 output (filepath, embedding, + carried metadata)
├── mined.parquet # Step 3 output — unique mined source filepaths
├── mining_summary.txt # Auto-emitted next to mined.parquet by the container
├── mining_config/ # Auto-copied by hook
└── claude_session.jsonl # Auto-copied by hook
At the start of the run, get the real timestamp by running
in Bash. Do NOT hardcode or guess. If the user specifies a custom output path, use it directly but maintain the same internal layout.
The mined parquet is the artifact downstream training consumes. The two embedding parquets are intermediate but worth retaining: they are reusable across multiple mining runs against the same source pool, and they are the only place to look when a "looks unrelated" report needs encoder-level debugging.
Common pitfalls
The single most common cause of garbage output is
mismatched encoders — both embedding steps must consume the same
, and any
/
/
override must apply to both steps or neither. Other frequent issues: skipping an embedding step, a missing
column under
(silent no-op), spec files outside
, unresolved
sentinels, TAO checkpoints without
, CSV pools not converted to parquet, host/container path mismatches, no GPU, the wrong image tag, and
× N_targets exceeding the source size (expected, not a bug — report the actual mined count).
See
references/troubleshooting.md
for the full diagnosis and fix for each of these.
Report Structure
Keep the report tight (600–1200 words). Mining is a deterministic pipeline; the value is making the encoder choice, the row counts, and any silent filter no-ops auditable — not narrative. The report has seven sections: Verdict, Inputs, Encoder Consistency, Mining Run, Per-Label Breakdown (skipped if the target parquet has no
column), Output Sanity, and Recommended Actions.
See
references/reporting_spec.md
for the complete fill-in report template with every section and field.
Execution Order
- Resolve from (
images.tao_toolkit.data_services
), then run , , and docker image inspect "$DS_IMAGE"
(pulling if missing) once to confirm the environment. Abort with a clear message if any fail.
- Run to get the timestamp; create
<output_dir>/mining_results/<timestamp>/
.
- Write and into the timestamped dir, filling in the encoder choice and mining knobs. Keep these under so the path resolves inside the container.
- If the source pool is a CSV, convert to parquet first (preserve and ).
- Run Step 1 (embed targets) via
docker run … embedding image_embeddings -e embedding_spec.yaml input_parquet=… output_parquet=…
. Print the output parquet's row count and columns to stdout.
- Run Step 2 (embed source pool) with the identical as Step 1. Print output row count and columns.
- Run Step 3 (mine nearest neighbours) via
docker run … tmm nearest_neighbors -e mining_spec.yaml source_parquet=… target_parquet=… output_parquet=…
. Confirm was written next to .
- Compute the per-label breakdown (Section 5) by joining the target embeddings parquet with the mined output on filepath, if both carry .
- Write last — writing it triggers the packaging hook, which copies session logs and skill config alongside.