<!--
Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
TAO-HF Integration Skill
Integrate a HuggingFace (HF) Computer Vision model into the NVIDIA TAO Toolkit ecosystem. Work the phases iteratively — not purely linearly — following a build → test → debug → fix → retest loop at every step.
This SKILL.md is the workflow coordinator. Each phase has a dedicated reference file under
with the full step-by-step content, code blocks, docker invocations, and gates. Read the matching reference at the start of each phase — the summaries below are not sufficient on their own.
Local-Only Rule
All work is strictly local. You may only read/clone from remotes; all file edits, Docker builds, and test runs stay on the local machine. Do NOT
/
/create remote branches (GitLab, GitHub, HuggingFace), create merge requests / pull requests / issues, or upload/publish/push Docker images to any registry or artifact store. This follows from the bind-mounted local-clone layout in
references/execution-and-debugging.md
.
Submodule Override & Execution Platform
is the default platform. The user clones the four TAO repos (
,
,
,
) independently into one working directory; each repo also carries nested
(and
)
submodules pinned at the original unmodified commit that are stale — modifications live only in the top-level
.
Always install from the top-level , never from (the nested submodule silently drops all modifications). The override of the CI
is three rules: mount the whole working directory (
);
pip install /workspace/tao-core
FIRST so modified schemas win; put top-level tao-core first on
(
-e PYTHONPATH=/workspace/tao-core:/workspace/tao-pytorch
).
Every test, smoke run, and end-to-end validation runs inside a locally prepared TAO Toolkit container (
,
, optionally
tao-dataservices-base:latest
, all from Phase 0), with local clones bind-mounted at
and installed via
pip install /workspace/tao-core
+
. All Python work runs in containers — no host venvs, no host
s. The platform skills own the
how of running containers — host GPU runtime via
tao-setup-nvidia-gpu-host
;
flags / NGC auth / mounts / env passthrough /
/
/ inspection / error modes via
and
. This workflow specifies only
what to run inside them and never forks those conventions. The annotated working-directory tree, canonical
flag set with the workflow-specific
/
/install-shell additions, three isolation contexts, four isolation rules, the
Development Loop, and the
Debugging Playbook table:
references/execution-and-debugging.md
.
Phase Map
The seven phases (full goals + gates below; references per phase):
- Phase 0 — Prerequisites + TAO Toolkit images + local image tags: phase-0-prereqs.md
- Phase 1 — HF-inspection environment, validate HF model + dataset: phase-1-inspection.md, hf-inspection.md
- Phase 2 — Closest existing TAO reference model: phase-2-codebase.md, task-type-guide.md
- Phase 3 — tao-core config + tao-pytorch trainer / native eval / inference: phase-3-implementation.md, tao-patterns.md, repo-structure.md
- Phase 4 — ONNX export + tao-deploy TRT engine, inference, evaluation: phase-4-deploy.md
- Phase 5 — Packaging ( console_scripts) + L0 tests: phase-5-packaging.md
- Phase 6 — Container-based testing + end-to-end pipeline validation: phase-6-container-tests.md, docker-patterns.md
- Phase 7 — (conditional) Accuracy / latency / size tuning: phase-7-optimization.md
IMPORTANT — Continuous Execution Through Phase 6: Do NOT stop after implementation (Phases 3–5) to wait for the user to run tests; immediately proceed to the mandatory Phase 6. The implementation is not complete until tests pass inside the TAO Toolkit containers and the end-to-end pipeline is validated. Apply the build-test-debug loop at every step — write, test immediately, fix on failure, never accumulate untested code.
Phase 0 — Prerequisites Check
Goal: verify Python 3.10+ and
; delegate the NVIDIA driver / CUDA / Docker / NVIDIA Container Toolkit host check to
tao-setup-nvidia-gpu-host
; verify NGC
for
. Then
ask the user for the TAO Toolkit image references (tao-pytorch, tao-deploy, optionally tao-dataservices), pull them, and prepare local image tags
,
,
tao-dataservices-base:latest
for Phases 3–6. Preparation strips the released TAO packages already in those images so the user's local clones (mounted at
) install and get picked up at run time.
Hard stop if any check fails. Full commands, user-prompt wording, and per-image preparation
snippets:
phase-0-prereqs.md.
Gate: all prerequisite checks pass; the user has supplied the required image references;
and
exist locally;
tao-dataservices-base:latest
exists if dataservices work is expected.
Phase 1 — Information Gathering & Validation
Goal: decide whether to proceed. Gather credentials, locate (or clone) the four TAO repos and create a consistent local working branch across them, launch the long-lived
container (isolation Context A), validate that the HF model is a CV model with a supported
, extract config + state-dict schema, sanity-check ONNX export, and clean up. Full step-by-step (1.1–1.7):
phase-1-inspection.md; generic patterns:
hf-inspection.md.
Reject if is NLP / audio / LLM (out of CV scope),
raises, or ONNX export fundamentally cannot work and has no rewrite path.
Gate: all 4 TAO repos located/cloned with a consistent working branch;
confirmed CV;
,
,
,
extracted; state-dict keys documented and the HF→TAO remapping plan drafted; ONNX sanity check passed (or failure mode understood); user confirmed
and task type. Present findings and confirm before proceeding.
Phase 2 — Codebase Exploration
Goal: find the closest existing TAO reference model for the detected
(classification →
, detection →
/
, segmentation →
, instance →
, panoptic →
, zero-shot →
, depth →
), read its full implementation across
,
, and
, and decide whether the backbone already exists in
. The chosen reference drives everything downstream — config structure, architecture, loss, ONNX export shape, TRT builder, deploy inferencer/loader, metrics, dataset format. The full reference list (12 files per model), the
coverage check (it already provides
,
,
,
, and others), and the
coverage check:
phase-2-codebase.md; per-task details:
task-type-guide.md.
If a new backbone is needed, decide the strategy (timm wrap > re-implement from scratch > HF black-box wrap) before Phase 3 — it changes weight loading, ONNX export, and the deploy pipeline.
Never dual-inherit from transformers.PreTrainedModel
and (metaclass conflict).
Gate: reference TAO model identified and all 12 locations read; task-type implications understood (architecture, loss, ONNX outputs, deploy classes, metrics, dataset); backbone coverage decided (reuse / wrap timm / new); dataservices coverage checked.
Phase 3 — TAO Core Configuration & Native Implementation
Goal: write the tao-core config schema and the tao-pytorch trainer + native inference + native evaluation, smoke-testing in between. Use
(
from Phase 1) and
(
). Seven steps: (1)
config under
—
ExperimentConfig(CommonExperimentConfig)
MUST contain
,
,
,
,
,
,
,
; (2)
trainer under
(
,
<ModelName>PlModel(TAOLightningModule)
,
, entrypoint,
; new backbone → add+register
cv/backbone_v2/<backbone_name>.py
); (3) multi-GPU/multi-node via the entrypoint's
; (4) native inference →
; (5) native evaluation →
; (6–7) MLOps wiring (
→
). Consistency rules (including
vs
and
= required
) are enforced by the Cross-Phase checklist below.
Full per-step code and the canonical
:
phase-3-implementation.md (with snippets
tao-patterns.md, layout
repo-structure.md, per-task
task-type-guide.md).
Gates: Step 1 —
imports cleanly in the container; Step 2 —
runs and the PLModel instantiates; overall — all 7 steps complete, smoke tests pass, no missing
.
Phase 4 — Export, Deployment & TensorRT Integration
Goal: ship ONNX export from tao-pytorch, then a TRT engine builder + TRT inference + TRT evaluation in tao-deploy that reuse the tao-core
. Four steps (8–11): ONNX export (
, per-task input/output names,
⇒ dynamic batch); TRT engine builder (
, subclasses
or reuses
ClassificationEngineBuilder
, writes
specs/{gen_trt_engine,inference,evaluate}.yaml
); TRT inference (NumPy-only
→
); TRT evaluation (sklearn/pycocotools →
). Full code and the Phase 3+4 gate:
phase-4-deploy.md.
Module pitfall: tao-pytorch and tao-deploy have
separate and
implementations — use the deploy versions in deploy scripts;
is imported from
in both repos (same schema, same field paths).
Phase 3+4 gate: all three in-container checks pass —
imports + model + ONNX export, and
imports.
Phase 5 — Packaging & L0 Testing
Goal: register the model as a
'<model_name>=...entrypoint.<model_name>:main'
console_script in both
and
(deploy entrypoint uses
nvidia_tao_deploy.cv.common.entrypoint.entrypoint_hydra
), and add L0 tests — deploy tests (
tao-deploy/tests/<model_name>/
, subprocess +
) and trainer tests (
tao-pytorch/tests/cv_unit_test/<model_name>/
,
Trainer(..., fast_dev_run=True)
, markers
@pytest.mark.cv_unit @pytest.mark.<model_name>
). Full code and test layout:
phase-5-packaging.md.
Gate: entrypoints registered; pytest files exist and follow the marker convention. Do NOT stop here — proceed directly to Phase 6.
Cross-Phase Data Flow & Consistency Verification
Before Docker testing, verify the artifact chain —
produces
<results_dir>/train/<model_name>_model_latest.pth
→
→
<results_dir>/export/<model_name>.onnx
→
→
<results_dir>/trt/<model_name>.engine
→
/
. Then confirm the consistency checklist: the
name;
/
matching across the training spec,
,
, and builder
; ONNX
/
;
/
vs
;
vs
; shared
; and an
in every package dir (including
for
discovery). Full interpolation paths, itemized checklist, and config field paths:
workflow-consistency.md.
Phase 6 — Container Testing & End-to-End Validation
Mandatory — start immediately after Phase 5. All TAO models ship as Docker images; code that only works outside a container is incomplete. Testing runs
directly inside the TAO Toolkit container (no Docker image build in the test loop): mount the local source into the Phase-0 image tags, install via
, and invoke
/
/
/
directly — use vanilla
+ lint binaries, NOT any
ci/run_functional_tests.py
/
wrappers (those exist only in NVIDIA's internal mirrors; the public
mirrors have no
directory).
Steps 16–25, in order: verify the local image tags (16); container
for tao-core (17), tao-pytorch (18,
,
), tao-deploy (19); static/lint tests (20,
+ optional
/
); wheel builds (21); the end-to-end pipeline (22 — train dry-run + export in
one tao-pytorch session, then gen_trt_engine + inference + evaluate in
one tao-deploy session, since
discards installed packages); native-vs-TRT cross-check (23 — FP32 ≈ exact, FP16 ≈ small delta, divergence ⇒ ONNX/TRT issue); interactive debug shells (24); optional release Docker image build (25, distribution-only). Full per-step commands and the fix-and-retest loop:
phase-6-container-tests.md; build scripts, runner patterns, requirements, CI conventions:
docker-patterns.md.
Phase 6 gate (Done criteria): tao-core / tao-pytorch / tao-deploy unit tests pass in their TAO Toolkit containers; static tests pass (or only legacy lint warnings); wheels build; end-to-end
<model_name>_model_latest.pth
→
→
→ non-empty
and
; native vs TRT predictions agree within tolerance.
Phase 7 — Optimization & Tuning (conditional)
Enter only if Phase 6 passes but accuracy / latency / model size needs improvement. Ask the user for target metrics first. Diagnose (Step 26) across four categories — accuracy too low, TRT-vs-native gap, training too slow, inference too slow — then apply the relevant technique: hyperparameter tuning (27), INT8 quantization (28), channel pruning + retrain (29), knowledge distillation (30), or resolution tuning (31). Full diagnostics, config blocks, YAML overrides, and decision tree: phase-7-optimization.md.
Argument
If provided, interpret
as the HuggingFace model ID or URL to use as the starting point for Phase 1. If credentials or model short-name are not included, ask the user for them before proceeding.