TAO-HF Integration Skill

Integrate a HuggingFace (HF) Computer Vision model into the NVIDIA TAO Toolkit ecosystem. Work the phases iteratively — not purely linearly — following a build → test → debug → fix → retest loop at every step.

This SKILL.md is the workflow coordinator. Each phase has a dedicated reference file under

references/

with the full step-by-step content, code blocks, docker invocations, and gates. Read the matching reference at the start of each phase — the summaries below are not sufficient on their own.

Local-Only Rule

All work is strictly local. You may only read/clone from remotes; all file edits, Docker builds, and test runs stay on the local machine. Do NOT

git commit

git push

/create remote branches (GitLab, GitHub, HuggingFace), create merge requests / pull requests / issues, or upload/publish/push Docker images to any registry or artifact store. This follows from the bind-mounted local-clone layout in

references/execution-and-debugging.md

Submodule Override & Execution Platform

local-docker

is the default platform. The user clones the four TAO repos (

tao-core

tao-pytorch

tao-deploy

tao-dataservices

) independently into one working directory; each repo also carries nested

tao-core/

(and

tao-pytorch/

) submodules pinned at the original unmodified commit that are stale — modifications live only in the top-level

tao-core/

. Always install from the top-level
tao-core/
, never from
<repo>/tao-core/
(the nested submodule silently drops all modifications). The override of the CI

pip install tao-core/

is three rules: mount the whole working directory (

-v $(pwd):/workspace

);

pip install /workspace/tao-core

FIRST so modified schemas win; put top-level tao-core first on

PYTHONPATH

(

-e PYTHONPATH=/workspace/tao-core:/workspace/tao-pytorch

Every test, smoke run, and end-to-end validation runs inside a locally prepared TAO Toolkit container (

tao-pytorch-base:latest

tao-deploy-base:latest

, optionally

tao-dataservices-base:latest

, all from Phase 0), with local clones bind-mounted at

/workspace

and installed via

pip install /workspace/tao-core

setup.py develop

. All Python work runs in containers — no host venvs, no host

pip install

s. The platform skills own the how of running containers — host GPU runtime via

tao-setup-nvidia-gpu-host

;

docker run

flags / NGC auth / mounts / env passthrough /

--ipc=host

--shm-size

/ inspection / error modes via

tao-run-on-docker

and

tao-run-on-local-docker

. This workflow specifies only what to run inside them and never forks those conventions. The annotated working-directory tree, canonical

docker run

flag set with the workflow-specific

-w

PYTHONPATH

/install-shell additions, three isolation contexts, four isolation rules, the Development Loop, and the Debugging Playbook table:

references/execution-and-debugging.md

Phase Map

The seven phases (full goals + gates below; references per phase):

Phase 0 — Prerequisites + TAO Toolkit images + local image tags: phase-0-prereqs.md
Phase 1 — HF-inspection environment, validate HF model + dataset: phase-1-inspection.md, hf-inspection.md
Phase 2 — Closest existing TAO reference model: phase-2-codebase.md, task-type-guide.md
Phase 3 — tao-core config + tao-pytorch trainer / native eval / inference: phase-3-implementation.md, tao-patterns.md, repo-structure.md
Phase 4 — ONNX export + tao-deploy TRT engine, inference, evaluation: phase-4-deploy.md
Phase 5 — Packaging (
```
setup.py
```
console_scripts) + L0 tests: phase-5-packaging.md
Phase 6 — Container-based testing + end-to-end pipeline validation: phase-6-container-tests.md, docker-patterns.md
Phase 7 — (conditional) Accuracy / latency / size tuning: phase-7-optimization.md

IMPORTANT — Continuous Execution Through Phase 6: Do NOT stop after implementation (Phases 3–5) to wait for the user to run tests; immediately proceed to the mandatory Phase 6. The implementation is not complete until tests pass inside the TAO Toolkit containers and the end-to-end pipeline is validated. Apply the build-test-debug loop at every step — write, test immediately, fix on failure, never accumulate untested code.

Phase 0 — Prerequisites Check

Goal: verify Python 3.10+ and

git

; delegate the NVIDIA driver / CUDA / Docker / NVIDIA Container Toolkit host check to

tao-setup-nvidia-gpu-host

; verify NGC

docker login

for

nvcr.io

. Then ask the user for the TAO Toolkit image references (tao-pytorch, tao-deploy, optionally tao-dataservices), pull them, and prepare local image tags

tao-pytorch-base:latest

tao-deploy-base:latest

tao-dataservices-base:latest

for Phases 3–6. Preparation strips the released TAO packages already in those images so the user's local clones (mounted at

/workspace/...

) install and get picked up at run time. Hard stop if any check fails. Full commands, user-prompt wording, and per-image preparation

Dockerfile

snippets: phase-0-prereqs.md.

Gate: all prerequisite checks pass; the user has supplied the required image references;

tao-pytorch-base:latest

and

tao-deploy-base:latest

exist locally;

tao-dataservices-base:latest

exists if dataservices work is expected.

Phase 1 — Information Gathering & Validation

Goal: decide whether to proceed. Gather credentials, locate (or clone) the four TAO repos and create a consistent local working branch across them, launch the long-lived

tao-hf-inspect

container (isolation Context A), validate that the HF model is a CV model with a supported

pipeline_tag

, extract config + state-dict schema, sanity-check ONNX export, and clean up. Full step-by-step (1.1–1.7): phase-1-inspection.md; generic patterns: hf-inspection.md.

Reject if

pipeline_tag

is NLP / audio / LLM (out of CV scope),

AutoConfig

raises, or ONNX export fundamentally cannot work and has no rewrite path.

Gate: all 4 TAO repos located/cloned with a consistent working branch;

pipeline_tag

confirmed CV;

model_type

image_size

hidden_size

num_labels

extracted; state-dict keys documented and the HF→TAO remapping plan drafted; ONNX sanity check passed (or failure mode understood); user confirmed

model_short_name

and task type. Present findings and confirm before proceeding.

Phase 2 — Codebase Exploration

Goal: find the closest existing TAO reference model for the detected

pipeline_tag

(classification →

classification_pyt

, detection →

dino

rtdetr

, segmentation →

segformer

, instance →

mask2former

, panoptic →

oneformer

, zero-shot →

grounding_dino

, depth →

mono_depth

), read its full implementation across

tao-core

tao-pytorch

, and

tao-deploy

, and decide whether the backbone already exists in

backbone_v2/

. The chosen reference drives everything downstream — config structure, architecture, loss, ONNX export shape, TRT builder, deploy inferencer/loader, metrics, dataset format. The full reference list (12 files per model), the

backbone_v2/

coverage check (it already provides

vit

swin

resnet

dino_v2

, and others), and the

tao-dataservices

coverage check: phase-2-codebase.md; per-task details: task-type-guide.md.

If a new backbone is needed, decide the strategy (timm wrap > re-implement from scratch > HF black-box wrap) before Phase 3 — it changes weight loading, ONNX export, and the deploy pipeline. Never dual-inherit from
transformers.PreTrainedModel
and
BackboneBase
(metaclass conflict).

Gate: reference TAO model identified and all 12 locations read; task-type implications understood (architecture, loss, ONNX outputs, deploy classes, metrics, dataset); backbone coverage decided (reuse / wrap timm / new); dataservices coverage checked.

Phase 3 — TAO Core Configuration & Native Implementation

Goal: write the tao-core config schema and the tao-pytorch trainer + native inference + native evaluation, smoke-testing in between. Use

<model_name>

(

snake_case

from Phase 1) and

<ModelName>

(

PascalCase

). Seven steps: (1)

tao-core

config under

config/<model_name>/

—

ExperimentConfig(CommonExperimentConfig)

MUST contain

model

dataset

train

evaluate

inference

export

gen_trt_engine

quantize

; (2)

tao-pytorch

trainer under

cv/<model_name>/

(

build_model()

<ModelName>PlModel(TAOLightningModule)

train.py

, entrypoint,

experiment_spec.yaml

; new backbone → add+register

cv/backbone_v2/<backbone_name>.py

); (3) multi-GPU/multi-node via the entrypoint's

launch()

; (4) native inference →

result.csv

; (5) native evaluation →

results.json

; (6–7) MLOps wiring (

@monitor_status

→

status.json

). Consistency rules (including

export.onnx_file

gen_trt_engine.onnx_file

and

???

= required

MISSING

) are enforced by the Cross-Phase checklist below.

Full per-step code and the canonical

experiment_spec.yaml

: phase-3-implementation.md (with snippets tao-patterns.md, layout repo-structure.md, per-task task-type-guide.md).

Gates: Step 1 —

ExperimentConfig

imports cleanly in the container; Step 2 —

build_model(cfg)

runs and the PLModel instantiates; overall — all 7 steps complete, smoke tests pass, no missing

__init__.py

Phase 4 — Export, Deployment & TensorRT Integration

Goal: ship ONNX export from tao-pytorch, then a TRT engine builder + TRT inference + TRT evaluation in tao-deploy that reuse the tao-core

ExperimentConfig

. Four steps (8–11): ONNX export (

scripts/export.py

, per-task input/output names,

batch_size=-1

⇒ dynamic batch); TRT engine builder (

gen_trt_engine.py

, subclasses

EngineBuilder

or reuses

ClassificationEngineBuilder

, writes

specs/{gen_trt_engine,inference,evaluate}.yaml

); TRT inference (NumPy-only

ClassificationLoader

→

result.csv

); TRT evaluation (sklearn/pycocotools →

results.json

). Full code and the Phase 3+4 gate: phase-4-deploy.md.

Module pitfall: tao-pytorch and tao-deploy have separate

hydra_runner

and

monitor_status

implementations — use the deploy versions in deploy scripts;

ExperimentConfig

is imported from

nvidia_tao_core

in both repos (same schema, same field paths).

Phase 3+4 gate: all three in-container checks pass —

tao-pytorch

imports + model + ONNX export, and

tao-deploy

imports.

Phase 5 — Packaging & L0 Testing

Goal: register the model as a

'<model_name>=...entrypoint.<model_name>:main'

console_script in both

tao-pytorch/setup.py

and

tao-deploy/setup.py

(deploy entrypoint uses

nvidia_tao_deploy.cv.common.entrypoint.entrypoint_hydra

), and add L0 tests — deploy tests (

tao-deploy/tests/<model_name>/

, subprocess +

--buildOnly

trtexec

) and trainer tests (

tao-pytorch/tests/cv_unit_test/<model_name>/

Trainer(..., fast_dev_run=True)

, markers

@pytest.mark.cv_unit @pytest.mark.<model_name>

). Full code and test layout: phase-5-packaging.md.

Gate: entrypoints registered; pytest files exist and follow the marker convention. Do NOT stop here — proceed directly to Phase 6.

Cross-Phase Data Flow & Consistency Verification

Before Docker testing, verify the artifact chain —

train

produces

<results_dir>/train/<model_name>_model_latest.pth

→

export.checkpoint

→

<results_dir>/export/<model_name>.onnx

→

gen_trt_engine

→

<results_dir>/trt/<model_name>.engine

→

inference.trt_engine

evaluate.trt_engine

. Then confirm the consistency checklist: the

*_latest.pth

name;

augmentation.mean

std

matching across the training spec,

inference.yaml

evaluate.yaml

, and builder

preprocess_mode

; ONNX

input_names

output_names

;

export.input_width

input_height

dataset.img_size

;

model.head.in_channels

model_params_mapping.py

; shared

classes.txt

; and an

__init__.py

in every package dir (including

scripts/__init__.py

for

get_subtasks()

pkgutil

discovery). Full interpolation paths, itemized checklist, and config field paths: workflow-consistency.md.

Phase 6 — Container Testing & End-to-End Validation

Mandatory — start immediately after Phase 5. All TAO models ship as Docker images; code that only works outside a container is incomplete. Testing runs directly inside the TAO Toolkit container (no Docker image build in the test loop): mount the local source into the Phase-0 image tags, install via

setup.py develop

, and invoke

pytest

pylint

pydocstyle

flake8

directly — use vanilla

pytest

+ lint binaries, NOT any

ci/run_functional_tests.py

ci/run_static_tests.py

wrappers (those exist only in NVIDIA's internal mirrors; the public

github.com/NVIDIA-TAO/

mirrors have no

ci/

directory).

Steps 16–25, in order: verify the local image tags (16); container

pytest

for tao-core (17), tao-pytorch (18,

-m cv_unit

--shm-size=16G

), tao-deploy (19); static/lint tests (20,

pylint --errors-only

+ optional

pydocstyle

flake8

); wheel builds (21); the end-to-end pipeline (22 — train dry-run + export in one tao-pytorch session, then gen_trt_engine + inference + evaluate in one tao-deploy session, since

--rm

discards installed packages); native-vs-TRT cross-check (23 — FP32 ≈ exact, FP16 ≈ small delta, divergence ⇒ ONNX/TRT issue); interactive debug shells (24); optional release Docker image build (25, distribution-only). Full per-step commands and the fix-and-retest loop: phase-6-container-tests.md; build scripts, runner patterns, requirements, CI conventions: docker-patterns.md.

Phase 6 gate (Done criteria): tao-core / tao-pytorch / tao-deploy unit tests pass in their TAO Toolkit containers; static tests pass (or only legacy lint warnings); wheels build; end-to-end

<model_name>_model_latest.pth

→

model.onnx

→

model.engine

→ non-empty

result.csv

and

results.json

; native vs TRT predictions agree within tolerance.

Phase 7 — Optimization & Tuning (conditional)

Enter only if Phase 6 passes but accuracy / latency / model size needs improvement. Ask the user for target metrics first. Diagnose (Step 26) across four categories — accuracy too low, TRT-vs-native gap, training too slow, inference too slow — then apply the relevant technique: hyperparameter tuning (27), INT8 quantization (28), channel pruning + retrain (29), knowledge distillation (30), or resolution tuning (31). Full diagnostics, config blocks, YAML overrides, and decision tree: phase-7-optimization.md.

Argument

$ARGUMENTS

If provided, interpret

$ARGUMENTS

as the HuggingFace model ID or URL to use as the starting point for Phase 1. If credentials or model short-name are not included, ask the user for them before proceeding.

tao-port-huggingface-model

NPX Install

Tags

SKILL.md Content