<!--
Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
tao-finetune-huggingface-model
Local NVIDIA GPU fine-tuning for HuggingFace models, grounded in live-fetched
documentation with curated references as a fallback safety net. One NGC
container, a small set of focused scripts, one push to HF Hub. Behavior is
governed by the rules in this file — follow them, do not improvise.
Order of authority (highest first): (1) user input → (2) live research
(model card, HF repo example, author script, task docs, paper — always fetched,
Step 3) → (3) curated
(fallback when live research is silent) →
(4) training-data memory (last resort, suspect). On conflict, live research wins
for the specific model + current API. See
for the
full order and conflict-resolution rules.
Inputs
Required:
- — HuggingFace model ID, e.g.
google/vit-base-patch16-224
Conditional credentials (loaded by the SessionStart hook from ):
- — only when the model/dataset is gated (read) or is on (write); public + runs don't need it. The agent never reads the value — only checks presence with .
- , — only when WandB is enabled; set to opt out.
Dataset — exactly one:
- — HuggingFace dataset ID (source: )
- — local folder or file (source: ); optional ∈ {auto, imagefolder, coco, voc, jsonl, arrow, parquet, csv} (default auto-detect).
- (omit) — agent recommends popular datasets (source: )
Optional (have defaults): (auto-detected);
,
,
,
;
output_dir=./output/<model_short_name>
;
(push target; if unset and HF_TOKEN has write access,
auto-derived as
<whoami>/<model_short_name>-finetuned
);
(set
to skip);
(skip zero-shot baseline eval).
Optional deliverables (off by default): →
(per-step ✅/⚠️/❌ journal);
→
reports/report.{pdf,html}
with curves & samples;
→
with fake-data heterogeneous-batch tests.
All values live in
. Never hardcode in Python.
Execution platform
This skill orchestrates
what to run; the platform skills own
how (read them
first, do not redraft their conventions here):
tao-setup-nvidia-gpu-host
(GPU host runtime — driver 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit
1.19.0),
(
flags, NGC auth,
, mounts, env passthrough,
/
, error modes), and
(local Docker job preflight — daemon reachable, GPU smoke).
Default platform: — build a one-off image
(
) and run it on the local Docker daemon. Ask only if the
user needs a different backend (Brev, Lepton/SLURM/Kubernetes). See
references/execution-platform.md
for that path plus the alternate-backend
routing, the GPU-runtime preflight, the credentials policy, and the
conventions.
References — fallback safety net
Curated
are consulted
only when live research is silent,
ambiguous, or unavailable; live docs always win for the specific model + current
API. The workflow steps below link the file each step needs directly. Before
falling back, log the live source you tried and why it was insufficient (in
, and PROGRESS.md if enabled).
markers in
/
are a research checklist, not code to inline —
if a block has no Step 3 finding, refetch the listed URL.
See
references/reference-index.md
for the complete index — every always-on
reference plus the three opt-in ones gated by a flag (
←
,
←
,
←
), each with its per-step role.
Core rules
The non-negotiable behaviors. Full text in
.
Short version:
- Your HF-library knowledge is outdated. Fetch live docs before writing any
ML code; never generate trainer args / collator / transforms from memory (Step 3).
- Smoke-test on real data with before any full run.
- Never silently substitute model_id, dataset_id, or training_method — stop and ask.
- Error recovery is minimal-change. OOM → halve batch, double grad_accum,
enable gradient checkpointing (don't switch to LoRA without approval); NaN →
reduce LR 10×; flat loss → inspect collator; same error 3× → stop and ask.
- Dataset columns verified BEFORE the collator. Rename → ;
restructuring → stop and ask.
- Hardware sizing (bf16): ≤3B → 24 GB, 7–13B → 80 GB, 30B+ → multi-GPU or
LoRA on 1× 80 GB, 70B+ → 8× 80 GB or LoRA. Won't fit + no LoRA request → ask.
has the full enumeration (hallucinated imports,
never-without-approval list, full error-recovery + hardware-sizing tables).
Workflow — 6 steps
Single pass, sequential. Each step has a clear gate before the next begins.
Step 1 — Inspect & qualify
Decide whether to proceed at all.
1a. Probe model and
1b. Probe dataset
via two CPU-only
containerized probes (no host Python
prereqs): the model probe reports
,
,
, head
counts; the dataset probe verifies loadability + column schema. Detect
from
+
+ card body (card silent on
→
references/model-discovery.md
, log under
). For
, present 3–5 picks from
references/dataset-recommendations.md
; for
, use
references/dataset-sources.md
loaders.
1c. Accept/reject,
1d. walk
references/compat-workarounds.md
recording matches in
, then
1e. write the skeleton.
See
references/step1-probes.md
for the full probe scripts +
invocations, the Docker-daemon preflight, prerequisites (
, optional
/
,
default
./output/<model_short_name>
bind-mounted by Steps 4–5), dataset-column verification + rename rule, the full
reject criteria, compat-walk detail, the exact skeleton, and
cleanup.
Gate: exists with model, dataset, task, applicable_workarounds.
Do not proceed if any field is missing.
Step 2 — Hardware audit & NGC image
Verify Docker + GPU + disk, pick the NGC PyTorch image live, finalize
hardware-dependent compat rules.
2a. Audit (hard gate) via
tao-setup-nvidia-gpu-host --check-only
(driver branch 580, CUDA Toolkit 13.0,
NVIDIA Container Toolkit 1.19.0); on failure ask to authorize the install, then
re-run; soft-warn on
free disk; check only the credentials this run
needs;
do not proceed to Step 4 on a hard-fail; record
,
,
,
.
2b. Pick NGC image (live) —
highest-versioned PyTorch NGC image with
Min driver ≤ driver_major
and
container CUDA
host CUDA Toolkit (never reject for an
/
/
suffix); WebFetch fail →
references/hardware-container.md
fallback.
2c.
Re-evaluate -dependent compat rules.
2d. Model-fit check — bf16
param_bytes ≈ 2×param_count
; if > 60% of
, recommend
LoRA.
See
references/hardware-audit-ngc.md
for the full audit script, the soft-warn
- override, live-selection rules, the support-matrix WebFetch URL,
the / SDPA+GQA
attn_implementation: "eager"
fallback, and the
could not select device driver
failure note.
Gate: has
,
,
,
,
. Hardware-dependent compat fixes are recorded.
Step 3 — Research the recipe
Fetch the live recipe — the agent's
/
/
memory is
suspect, so Step 3 is non-negotiable. Walk
references/research-priorities.md
in priority order (Priority 1 → 6).
Stop once you have, for the detected task: the
/ processor class,
train + eval transforms, collator,
, and hyperparameter hints
(LR, batch size, epochs, scheduler). Record findings in
and
append source URLs to
config.yaml: research_sources:
. If a slot has no live
finding, fall back to the matching scaffold (
/
) and log "fallback to scaffold — no live source for <slot>"
under
. Conflict-resolution rules:
references/research-priorities.md
.
Gate: every required slot above is filled, with a source URL or an explicit
scaffold-fallback note.
Step 4 — Generate project & smoke-test
Write all scripts, build the image, prepare data, run a 1-step smoke on real
data (one
, two
s).
4a. Generate project files in
—
,
,
,
,
,
(eval script
MUST be
, never
— collides with HF
),
,
for VLM-LoRA,
. Authority order: Step 3
live research → scaffold reference (
/
) for
structure only, never their
blocks. Apply each
entry as a Dockerfile block, requirements pin, config
override, or runtime env var. Every generated
begins with the NVIDIA
Apache-2.0
-comment copyright header (emitter must fail otherwise). If
, also generate
per
. See
references/project-scaffold.md
for the full file table, the exact copyright
header, and the Dockerfile template (deps → compat → code layer order).
4b. Build, prepare, smoke —
docker build -t run-<short>:latest .
, then run
references/docker-runs.md
§1 (build), §2 (prepare_data), §3 (smoke,
); §3 lists the smoke pass criteria (no exception, loss
finite,
at step 1). If
, also run
inside the container. Any failure → STOP.
4c. Preflight summary — print the boxed
summary (reference
URL, dataset columns, push_to_hub repo, wandb monitoring, ngc_image, hardware,
smoke result) and verify every field is filled before launching full training.
Exact format:
references/project-scaffold.md
.
Gate: project files written, image built, smoke PASSED, preflight has no
blank fields.
Step 5 — Train, evaluate, infer
Run in order, all commands in
references/docker-runs.md
:
5a baseline eval
(§4, skip if
),
5b full training detached (§5),
5c
LoRA merge (§6, only VLM-with-LoRA),
5d post-train eval (§7),
5e
inference 5 samples (§8). Multi-GPU: prepend
torchrun --nproc_per_node=$gpu_count
to
. Watch
: loss should drop within
10-20 steps (flat → stop; NaN → reduce LR; OOM → halve batch; full recovery in
+
references/error-playbook.md
). If
, run
after Step 5e per
.
Gate: all of —
(or
for LoRA)
exists;
reports/eval_results.json
has a numeric primary metric;
reports/baseline_results.json
exists (unless skipped);
reports/inference_samples/
has 5 samples; wandb URL shows descending loss.
Step 6 — Push & emit rerun skill
Publish the run and make it reproducible without re-research.
6a. Push to HF Hub — use
(pushes weights merged or
final, a generated model card
,
results/{eval,baseline}_results.json
,
,
,
,
, and
if
). Skip iff
is
explicit in
.
6b. Emit rerun skill at
<output_dir>/skills/run-<short>/SKILL.md
per
references/pipeline-skill-template.md
. Every
must be a real
value (literal placeholders are a bug); include the full YAML (
,
,
,
) and the NVIDIA copyright notice in
an HTML comment immediately after the closing
, as in that template; an
emitter must fail unless the emitted
contains those fields and the
copyright comment.
Gate (Done criteria): all of — Step 5 gate met; HF Hub repo exists at the
resolved URL with weights + card +
(unless
);
<output_dir>/skills/run-<short>/SKILL.md
exists with no
left,
with metadata + copyright HTML comment per
pipeline-skill-template.md
.
Final message to user — terse, with direct URLs: wandb URL; HF Hub URL;
primary metric baseline → fine-tuned (Δ); path to
reports/inference_samples/
;
path to
<output_dir>/skills/run-<short>/SKILL.md
.
Error playbook
On a known runtime error, consult
references/error-playbook.md
before
redesigning anything — its symptom → minimal-fix table covers NGC ENTRYPOINT,
SDPA+GQA,
regression, numpy 2.x ABI, Albumentations bbox,
PEFT + gradient_checkpointing, SmolVLM SDPA, LoRA target-regex, missing CV
augmentation, OOM at step 0, and more. When a row fires twice across runs, lift
it into
references/compat-workarounds.md
with a
rule, auto-applied in
Step 1d before the error can fire.
Communication style
Terse: no filler, no restating the request; always include direct Hub + wandb
URLs; on error state what went wrong, why, what you changed (no menus, no
"Option A/B/C" when the answer is clear — act). Full text:
.
Example pipelines
- tao-rerun-convnext-cifar10 — facebook/convnext-tiny-224 on cifar10 (image-classification, 10 classes, subset 5000/1000).
- tao-rerun-detr-cppe5 — facebook/detr-resnet-50 on cppe-5 (object-detection, 5 classes, subset 800/200).
- tao-rerun-segformer-foodseg103 — nvidia/mit-b0 on EduardoPacheco/FoodSeg103 (semantic segmentation, 103 classes + background, subset 1000/200).
- tao-rerun-smolvlm-vqav2 — HuggingFaceTB/SmolVLM-256M-Instruct on merve/vqav2-small (image-text-to-text VLM LoRA, subset 500/100, 5 epochs).