tao-finetune-huggingface-model

Local NVIDIA GPU fine-tuning for HuggingFace models, grounded in live-fetched documentation with curated references as a fallback safety net. One NGC container, a small set of focused scripts, one push to HF Hub. Behavior is governed by the rules in this file — follow them, do not improvise.

Order of authority (highest first): (1) user input → (2) live research (model card, HF repo example, author script, task docs, paper — always fetched, Step 3) → (3) curated

references/*.md

(fallback when live research is silent) → (4) training-data memory (last resort, suspect). On conflict, live research wins for the specific model + current API. See

references/core-rules.md

for the full order and conflict-resolution rules.

Inputs

Required:

```
model_id
```
— HuggingFace model ID, e.g.
```
google/vit-base-patch16-224
```

Conditional credentials (loaded by the SessionStart hook from
~/.config/tao/.env
):

```
HF_TOKEN
```
— only when the model/dataset is gated (read) or
```
push_to_hub
```
is on (write); public +
```
push_to_hub: false
```
runs don't need it. The agent never reads the value — only checks presence with
```
[ -n "$HF_TOKEN" ]
```
.
```
WANDB_API_KEY
```
,
```
WANDB_PROJECT
```
— only when WandB is enabled; set
```
WANDB_MODE=disabled
```
to opt out.

Dataset — exactly one:

```
dataset_id
```
— HuggingFace dataset ID (source:
hf
)
```
local_dataset_path
```
— local folder or file (source:
local
); optional
```
local_dataset_format
```
∈ {auto, imagefolder, coco, voc, jsonl, arrow, parquet, csv} (default auto-detect).
(omit) — agent recommends popular datasets (source:
recommend
)

Optional (have defaults):

task_type

(auto-detected);

n_train=10000

n_eval=1000

n_epochs=3

lora_r=16

;

output_dir=./output/<model_short_name>

;

hf_model_repo

(push target; if unset and HF_TOKEN has write access, auto-derived as

<whoami>/<model_short_name>-finetuned

);

push_to_hub=True

(set

False

to skip);

skip_baseline=False

(skip zero-shot baseline eval).

Optional deliverables (off by default):

emit_progress_log

→

output_dir/PROGRESS.md

(per-step ✅/⚠️/❌ journal);

emit_report

→

reports/report.{pdf,html}

with curves & samples;

emit_unit_tests

→

tests/

with fake-data heterogeneous-batch tests.

All values live in

output_dir/config.yaml

. Never hardcode in Python.

Execution platform

This skill orchestrates what to run; the platform skills own how (read them first, do not redraft their conventions here):

tao-setup-nvidia-gpu-host

(GPU host runtime — driver 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit 1.19.0),

tao-run-on-docker

(

docker run

flags, NGC auth,

--gpus

, mounts, env passthrough,

--ipc=host

--shm-size

, error modes), and

tao-run-on-local-docker

(local Docker job preflight — daemon reachable, GPU smoke).

Default platform:

local-docker

— build a one-off image (

run-<short>:latest

) and run it on the local Docker daemon. Ask only if the user needs a different backend (Brev, Lepton/SLURM/Kubernetes). See

references/execution-platform.md

for that path plus the alternate-backend routing, the GPU-runtime preflight, the credentials policy, and the

docker run

conventions.

References — fallback safety net

Curated

references/*.md

are consulted only when live research is silent, ambiguous, or unavailable; live docs always win for the specific model + current API. The workflow steps below link the file each step needs directly. Before falling back, log the live source you tried and why it was insufficient (in

config.yaml

notes:

, and PROGRESS.md if enabled).

[FETCH LIVE]

markers in

cv-scripts.md

vlm-scripts.md

are a research checklist, not code to inline — if a block has no Step 3 finding, refetch the listed URL.

See

references/reference-index.md

for the complete index — every always-on reference plus the three opt-in ones gated by a flag (

progress-tracking.md

←

emit_progress_log

testing.md

←

emit_unit_tests

reporting.md

←

emit_report

), each with its per-step role.

Core rules

The non-negotiable behaviors. Full text in

references/core-rules.md

. Short version:

Your HF-library knowledge is outdated. Fetch live docs before writing any ML code; never generate trainer args / collator / transforms from memory (Step 3).
Smoke-test on real data with
--max_steps 1
before any full run.
Never silently substitute model_id, dataset_id, or training_method — stop and ask.
Error recovery is minimal-change. OOM → halve batch, double grad_accum, enable gradient checkpointing (don't switch to LoRA without approval); NaN → reduce LR 10×; flat loss → inspect collator; same error 3× → stop and ask.
Dataset columns verified BEFORE the collator. Rename →
```
prepare_data.py
```
; restructuring → stop and ask.
Hardware sizing (bf16): ≤3B → 24 GB, 7–13B → 80 GB, 30B+ → multi-GPU or LoRA on 1× 80 GB, 70B+ → 8× 80 GB or LoRA. Won't fit + no LoRA request → ask.

references/core-rules.md

has the full enumeration (hallucinated imports, never-without-approval list, full error-recovery + hardware-sizing tables).

Workflow — 6 steps

Single pass, sequential. Each step has a clear gate before the next begins.

Step 1 — Inspect & qualify

Decide whether to proceed at all. 1a. Probe model and 1b. Probe dataset via two CPU-only

python:3.12-slim

containerized probes (no host Python prereqs): the model probe reports

model_type

architectures

tags

, head counts; the dataset probe verifies loadability + column schema. Detect

task

from

architectures

tags

+ card body (card silent on

AutoModelFor...

→

references/model-discovery.md

, log under

notes:

). For

source = recommend

, present 3–5 picks from

references/dataset-recommendations.md

; for

source = local

, use

references/dataset-sources.md

loaders. 1c. Accept/reject, 1d. walk
references/compat-workarounds.md
recording matches in

config.yaml

applicable_workarounds:

, then 1e. write the
config.yaml
skeleton.

See

references/step1-probes.md

for the full probe scripts +

docker run

invocations, the Docker-daemon preflight, prerequisites (

MODEL_ID

, optional

DATASET_ID

HF_TOKEN

OUTPUT_DIR

default

./output/<model_short_name>

bind-mounted by Steps 4–5), dataset-column verification + rename rule, the full reject criteria, compat-walk detail, the exact skeleton, and

.probe

cleanup.

Gate:

config.yaml

exists with model, dataset, task, applicable_workarounds. Do not proceed if any field is missing.

Step 2 — Hardware audit & NGC image

Verify Docker + GPU + disk, pick the NGC PyTorch image live, finalize hardware-dependent compat rules. 2a. Audit (hard gate) via

tao-setup-nvidia-gpu-host --check-only

(driver branch 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit 1.19.0); on failure ask to authorize the install, then re-run; soft-warn on

< 100 GB

free disk; check only the credentials this run needs; do not proceed to Step 4 on a hard-fail; record

gpu_count

gpu_name

driver_major

vram_gb_per_gpu

. 2b. Pick NGC image (live) — highest-versioned PyTorch NGC image with

Min driver ≤ driver_major

and container CUDA

≤

host CUDA Toolkit (never reject for an

aN

bN

rcN

suffix); WebFetch fail →

references/hardware-container.md

fallback. 2c. Re-evaluate

hw

-dependent compat rules. 2d. Model-fit check — bf16

param_bytes ≈ 2×param_count

; if > 60% of

vram_gb_per_gpu × 1e9

, recommend LoRA.

See

references/hardware-audit-ngc.md

for the full audit script, the soft-warn

```
MIN_DISK_GB
```
override, live-selection rules, the support-matrix WebFetch URL, the
```
24.09-py3
```
/ SDPA+GQA
```
attn_implementation: "eager"
```
fallback, and the
```
could not select device driver
```
failure note.

Gate:

config.yaml

has

ngc_image

gpu_count

gpu_name

driver_major

vram_gb_per_gpu

. Hardware-dependent compat fixes are recorded.

Step 3 — Research the recipe

Fetch the live recipe — the agent's

transformers

trl

peft

memory is suspect, so Step 3 is non-negotiable. Walk

references/research-priorities.md

in priority order (Priority 1 → 6). Stop once you have, for the detected task: the

AutoModel

/ processor class, train + eval transforms, collator,

compute_metrics

, and hyperparameter hints (LR, batch size, epochs, scheduler). Record findings in

meta/recipe.md

and append source URLs to

config.yaml: research_sources:

. If a slot has no live finding, fall back to the matching scaffold (

cv-scripts.md

vlm-scripts.md

) and log "fallback to scaffold — no live source for <slot>" under

notes:

. Conflict-resolution rules:

references/research-priorities.md

Gate: every required slot above is filled, with a source URL or an explicit scaffold-fallback note.

Step 4 — Generate project & smoke-test

Write all scripts, build the image, prepare data, run a 1-step smoke on real data (one

docker build

, two

docker run

s).

4a. Generate project files in

output_dir/

—

config.yaml

Dockerfile

requirements.txt

prepare_data.py

train.py

run_eval.py

(eval script MUST be

run_eval.py

, never

evaluate.py

— collides with HF

evaluate

infer.py

merge_lora.py

for VLM-LoRA,

.gitignore

. Authority order: Step 3 live research → scaffold reference (

cv-scripts.md

vlm-scripts.md

) for structure only, never their

[FETCH LIVE]

blocks. Apply each

applicable_workarounds

entry as a Dockerfile block, requirements pin, config override, or runtime env var. Every generated

.py

begins with the NVIDIA Apache-2.0

-comment copyright header (emitter must fail otherwise). If

emit_unit_tests: true

, also generate

tests/

per

references/testing.md

. See

references/project-scaffold.md

for the full file table, the exact copyright header, and the Dockerfile template (deps → compat → code layer order).

4b. Build, prepare, smoke —

docker build -t run-<short>:latest .

, then run

references/docker-runs.md

§1 (build), §2 (prepare_data), §3 (smoke,

--smoke --max_steps 1

); §3 lists the smoke pass criteria (no exception, loss finite,

grad_norm > 0

at step 1). If

emit_unit_tests: true

, also run

pytest tests/

inside the container. Any failure → STOP.

4c. Preflight summary — print the boxed

─ PREFLIGHT ─

summary (reference URL, dataset columns, push_to_hub repo, wandb monitoring, ngc_image, hardware, smoke result) and verify every field is filled before launching full training. Exact format:

references/project-scaffold.md

Gate: project files written, image built, smoke PASSED, preflight has no blank fields.

Step 5 — Train, evaluate, infer

Run in order, all commands in

references/docker-runs.md

: 5a baseline eval (§4, skip if

skip_baseline: true

), 5b full training detached (§5), 5c LoRA merge (§6, only VLM-with-LoRA), 5d post-train eval (§7), 5e inference 5 samples (§8). Multi-GPU: prepend

torchrun --nproc_per_node=$gpu_count

python train.py

. Watch

docker logs -f hft_train

: loss should drop within 10-20 steps (flat → stop; NaN → reduce LR; OOM → halve batch; full recovery in

references/core-rules.md

references/error-playbook.md

). If

emit_report: true

, run

report.py

after Step 5e per

references/reporting.md

Gate: all of —

checkpoints/final/

(or

checkpoints/merged/

for LoRA) exists;

reports/eval_results.json

has a numeric primary metric;

reports/baseline_results.json

exists (unless skipped);

reports/inference_samples/

has 5 samples; wandb URL shows descending loss.

Step 6 — Push & emit rerun skill

Publish the run and make it reproducible without re-research.

6a. Push to HF Hub — use

references/hub-push.md

(pushes weights merged or final, a generated model card

README.md

results/{eval,baseline}_results.json

config.yaml

Dockerfile

requirements.txt

inference_samples/*.jpg

, and

report.{pdf,html}

emit_report: true

). Skip iff

push_to_hub: false

is explicit in

config.yaml

6b. Emit rerun skill at

<output_dir>/skills/run-<short>/SKILL.md

per

references/pipeline-skill-template.md

. Every

<placeholder>

must be a real value (literal placeholders are a bug); include the full YAML (

license

compatibility

metadata

allowed-tools

) and the NVIDIA copyright notice in an HTML comment immediately after the closing

---

, as in that template; an emitter must fail unless the emitted

SKILL.md

contains those fields and the copyright comment.

Gate (Done criteria): all of — Step 5 gate met; HF Hub repo exists at the resolved URL with weights + card +

results/

(unless

push_to_hub: false

);

<output_dir>/skills/run-<short>/SKILL.md

exists with no

<placeholder>

left, with metadata + copyright HTML comment per

pipeline-skill-template.md

Final message to user — terse, with direct URLs: wandb URL; HF Hub URL; primary metric baseline → fine-tuned (Δ); path to

reports/inference_samples/

; path to

<output_dir>/skills/run-<short>/SKILL.md

Error playbook

On a known runtime error, consult

references/error-playbook.md

before redesigning anything — its symptom → minimal-fix table covers NGC ENTRYPOINT, SDPA+GQA,

transformers>=4.51

regression, numpy 2.x ABI, Albumentations bbox, PEFT + gradient_checkpointing, SmolVLM SDPA, LoRA target-regex, missing CV augmentation, OOM at step 0, and more. When a row fires twice across runs, lift it into

references/compat-workarounds.md

with a

detect

rule, auto-applied in Step 1d before the error can fire.

Communication style

Terse: no filler, no restating the request; always include direct Hub + wandb URLs; on error state what went wrong, why, what you changed (no menus, no "Option A/B/C" when the answer is clear — act). Full text:

references/core-rules.md

Example pipelines

tao-rerun-convnext-cifar10 — facebook/convnext-tiny-224 on cifar10 (image-classification, 10 classes, subset 5000/1000).
tao-rerun-detr-cppe5 — facebook/detr-resnet-50 on cppe-5 (object-detection, 5 classes, subset 800/200).
tao-rerun-segformer-foodseg103 — nvidia/mit-b0 on EduardoPacheco/FoodSeg103 (semantic segmentation, 103 classes + background, subset 1000/200).
tao-rerun-smolvlm-vqav2 — HuggingFaceTB/SmolVLM-256M-Instruct on merve/vqav2-small (image-text-to-text VLM LoRA, subset 500/100, 5 epochs).

tao-finetune-huggingface-model

NPX Install

Tags

SKILL.md Content

tao-finetune-huggingface-model

Inputs

Execution platform

References — fallback safety net

Core rules

Workflow — 6 steps

Step 1 — Inspect & qualify

Step 2 — Hardware audit & NGC image

Step 3 — Research the recipe

Step 4 — Generate project & smoke-test

Step 5 — Train, evaluate, infer

Step 6 — Push & emit rerun skill

Error playbook

Communication style

Example pipelines