tao-finetune-huggingface-model

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->
<!-- Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

tao-finetune-huggingface-model

tao-finetune-huggingface-model

Local NVIDIA GPU fine-tuning for HuggingFace models, grounded in live-fetched documentation with curated references as a fallback safety net. One NGC container, a small set of focused scripts, one push to HF Hub. Behavior is governed by the rules in this file — follow them, do not improvise.
Order of authority (highest first): (1) user input → (2) live research (model card, HF repo example, author script, task docs, paper — always fetched, Step 3) → (3) curated
references/*.md
(fallback when live research is silent) → (4) training-data memory (last resort, suspect). On conflict, live research wins for the specific model + current API. See
references/core-rules.md
for the full order and conflict-resolution rules.

基于实时获取的文档对HuggingFace模型进行本地NVIDIA GPU微调,同时将精心整理的参考资料作为备用保障。仅需一个NGC容器、少量针对性脚本,即可一键推送到HF Hub。所有操作需遵循本文档中的规则,不得随意变通。
权威优先级(从高到低): (1) 用户输入 → (2) 实时调研(模型卡片、HF仓库示例、作者脚本、任务文档、论文——始终在步骤3获取) → (3) 精心整理的
references/*.md
文件(实时调研无结果时的备选方案) → (4) 训练数据记忆(最后手段,可信度低)。若出现冲突,针对特定模型+当前API的实时调研结果优先。完整优先级及冲突解决规则请查看
references/core-rules.md

Inputs

输入参数

Required:
  • model_id
    — HuggingFace model ID, e.g.
    google/vit-base-patch16-224
Conditional credentials (loaded by the SessionStart hook from
~/.config/tao/.env
):
  • HF_TOKEN
    — only when the model/dataset is gated (read) or
    push_to_hub
    is on (write); public +
    push_to_hub: false
    runs don't need it. The agent never reads the value — only checks presence with
    [ -n "$HF_TOKEN" ]
    .
  • WANDB_API_KEY
    ,
    WANDB_PROJECT
    — only when WandB is enabled; set
    WANDB_MODE=disabled
    to opt out.
Dataset — exactly one:
  • dataset_id
    — HuggingFace dataset ID (source:
    hf
    )
  • local_dataset_path
    — local folder or file (source:
    local
    )
    ; optional
    local_dataset_format
    ∈ {auto, imagefolder, coco, voc, jsonl, arrow, parquet, csv} (default auto-detect).
  • (omit) — agent recommends popular datasets (source:
    recommend
    )
Optional (have defaults):
task_type
(auto-detected);
n_train=10000
,
n_eval=1000
,
n_epochs=3
,
lora_r=16
;
output_dir=./output/<model_short_name>
;
hf_model_repo
(push target; if unset and HF_TOKEN has write access, auto-derived as
<whoami>/<model_short_name>-finetuned
);
push_to_hub=True
(set
False
to skip);
skip_baseline=False
(skip zero-shot baseline eval).
Optional deliverables (off by default):
emit_progress_log
output_dir/PROGRESS.md
(per-step ✅/⚠️/❌ journal);
emit_report
reports/report.{pdf,html}
with curves & samples;
emit_unit_tests
tests/
with fake-data heterogeneous-batch tests.
All values live in
output_dir/config.yaml
. Never hardcode in Python.

必填项:
  • model_id
    — HuggingFace模型ID,例如
    google/vit-base-patch16-224
条件性凭据(由SessionStart钩子从
~/.config/tao/.env
加载):
  • HF_TOKEN
    — 仅当模型/数据集为** gated(受限访问)**或
    push_to_hub
    开启时需要;公开模型且
    push_to_hub: false
    的运行无需此凭据。Agent不会读取具体值,仅通过
    [ -n "$HF_TOKEN" ]
    检查是否存在。
  • WANDB_API_KEY
    ,
    WANDB_PROJECT
    — 仅当启用WandB时需要;设置
    WANDB_MODE=disabled
    可选择退出。
数据集(三选一):
  • dataset_id
    — HuggingFace数据集ID (来源:
    hf
  • local_dataset_path
    — 本地文件夹或文件 (来源:
    local
    ;可选参数
    local_dataset_format
    ∈ {auto, imagefolder, coco, voc, jsonl, arrow, parquet, csv}(默认自动检测)。
  • (留空) — Agent将推荐热门数据集 (来源:
    recommend
可选参数(含默认值):
task_type
(自动检测);
n_train=10000
n_eval=1000
n_epochs=3
lora_r=16
output_dir=./output/<model_short_name>
hf_model_repo
(推送目标;若未设置且HF_TOKEN具备写入权限,将自动生成为
<whoami>/<model_short_name>-finetuned
);
push_to_hub=True
(设置为
False
可跳过推送);
skip_baseline=False
(跳过零样本基线评估)。
可选交付物(默认关闭):
emit_progress_log
→ 生成
output_dir/PROGRESS.md
(每步骤✅/⚠️/❌日志);
emit_report
→ 生成
reports/report.{pdf,html}
(包含曲线与样本);
emit_unit_tests
→ 生成
tests/
目录(含基于伪造数据的异构批量测试)。
所有参数值均存储在
output_dir/config.yaml
中,严禁硬编码到Python代码内。

Execution platform

执行平台

This skill orchestrates what to run; the platform skills own how (read them first, do not redraft their conventions here):
tao-setup-nvidia-gpu-host
(GPU host runtime — driver 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit 1.19.0),
tao-run-on-docker
(
docker run
flags, NGC auth,
--gpus
, mounts, env passthrough,
--ipc=host
/
--shm-size
, error modes), and
tao-run-on-local-docker
(local Docker job preflight — daemon reachable, GPU smoke).
Default platform:
local-docker
— build a one-off image (
run-<short>:latest
) and run it on the local Docker daemon. Ask only if the user needs a different backend (Brev, Lepton/SLURM/Kubernetes). See
references/execution-platform.md
for that path plus the alternate-backend routing, the GPU-runtime preflight, the credentials policy, and the
docker run
conventions.

本技能负责编排要执行的内容;平台技能负责处理执行方式(请先阅读这些技能文档,不得在此重写其约定):
tao-setup-nvidia-gpu-host
(GPU主机运行时——驱动版本580、CUDA Toolkit 13.0、NVIDIA Container Toolkit 1.19.0)、
tao-run-on-docker
docker run
参数、NGC认证、
--gpus
、挂载、环境变量传递、
--ipc=host
/
--shm-size
、错误模式)以及
tao-run-on-local-docker
(本地Docker任务预检查——守护进程可达、GPU冒烟测试)。
默认平台:
local-docker
— 构建一次性镜像(
run-<short>:latest
)并在本地Docker守护进程上运行。仅当用户需要其他后端(Brev、Lepton/SLURM/Kubernetes)时询问。请查看
references/execution-platform.md
获取该路径及备选后端路由、GPU运行时预检查、凭据策略和
docker run
约定。

References — fallback safety net

参考资料——备用保障

Curated
references/*.md
are consulted only when live research is silent, ambiguous, or unavailable; live docs always win for the specific model + current API. The workflow steps below link the file each step needs directly. Before falling back, log the live source you tried and why it was insufficient (in
config.yaml
notes:
, and PROGRESS.md if enabled).
[FETCH LIVE]
markers in
cv-scripts.md
/
vlm-scripts.md
are a research checklist, not code to inline — if a block has no Step 3 finding, refetch the listed URL.
See
references/reference-index.md
for the complete index — every always-on reference plus the three opt-in ones gated by a flag (
progress-tracking.md
emit_progress_log
,
testing.md
emit_unit_tests
,
reporting.md
emit_report
), each with its per-step role.

仅当实时调研无结果、存在歧义或无法获取时,才会参考精心整理的
references/*.md
文件;针对特定模型+当前API的实时文档始终优先。以下工作流步骤会直接链接各步骤所需的文件。在使用备选方案前,请记录尝试过的实时来源及不足原因(写入
config.yaml
notes:
字段,若启用则同时写入PROGRESS.md)。
cv-scripts.md
/
vlm-scripts.md
中的
[FETCH LIVE]
标记是调研检查清单,而非要嵌入的代码——若某模块无步骤3的调研结果,请重新获取列出的URL。
完整索引请查看
references/reference-index.md
——包含所有默认启用的参考资料,以及三个由标志控制的可选参考资料(
progress-tracking.md
emit_progress_log
testing.md
emit_unit_tests
reporting.md
emit_report
),每个资料都标注了其在各步骤中的作用。

Core rules

核心规则

The non-negotiable behaviors. Full text in
references/core-rules.md
. Short version:
  • Your HF-library knowledge is outdated. Fetch live docs before writing any ML code; never generate trainer args / collator / transforms from memory (Step 3).
  • Smoke-test on real data with
    --max_steps 1
    before any full run.
  • Never silently substitute model_id, dataset_id, or training_method — stop and ask.
  • Error recovery is minimal-change. OOM → halve batch, double grad_accum, enable gradient checkpointing (don't switch to LoRA without approval); NaN → reduce LR 10×; flat loss → inspect collator; same error 3× → stop and ask.
  • Dataset columns verified BEFORE the collator. Rename →
    prepare_data.py
    ; restructuring → stop and ask.
  • Hardware sizing (bf16): ≤3B → 24 GB, 7–13B → 80 GB, 30B+ → multi-GPU or LoRA on 1× 80 GB, 70B+ → 8× 80 GB or LoRA. Won't fit + no LoRA request → ask.
references/core-rules.md
has the full enumeration (hallucinated imports, never-without-approval list, full error-recovery + hardware-sizing tables).

不可协商的行为准则。完整内容请查看
references/core-rules.md
精简版:
  • 你的HF库知识已过时。 在编写任何ML代码前请获取实时文档;切勿凭记忆生成训练器参数/整理器/转换逻辑(步骤3)。
  • 在全量运行前,使用真实数据执行
    --max_steps 1
    冒烟测试。
  • 切勿擅自替换 model_id、dataset_id或训练方法——停止操作并询问用户。
  • 错误恢复需最小化改动。 内存不足(OOM)→ 将批量大小减半、加倍梯度累积、启用梯度检查点(未经批准不得切换到LoRA);出现NaN→将学习率降低10倍;损失持平→检查整理器;同一错误出现3次→停止操作并询问用户。
  • 数据集列需在整理器前验证。 重命名→修改
    prepare_data.py
    ;结构调整→停止操作并询问用户。
  • 硬件规格(bf16精度): ≤3B参数→24 GB显存,7–13B→80 GB,30B+→多GPU或在单张80 GB显存GPU上使用LoRA,70B+→8张80 GB显存GPU或使用LoRA。若无法容纳且未请求LoRA→询问用户。
references/core-rules.md
包含完整细则(幻觉导入、未经批准不得使用的操作列表、完整错误恢复+硬件规格表)。

Workflow — 6 steps

工作流——6个步骤

Single pass, sequential. Each step has a clear gate before the next begins.
单轮顺序执行,每个步骤完成后需通过明确的检查点才能进入下一步。

Step 1 — Inspect & qualify

步骤1 — 检查与验证

Decide whether to proceed at all. 1a. Probe model and 1b. Probe dataset via two CPU-only
python:3.12-slim
containerized probes (no host Python prereqs): the model probe reports
model_type
,
architectures
,
tags
, head counts; the dataset probe verifies loadability + column schema. Detect
task
from
architectures
+
tags
+ card body (card silent on
AutoModelFor...
references/model-discovery.md
, log under
notes:
). For
source = recommend
, present 3–5 picks from
references/dataset-recommendations.md
; for
source = local
, use
references/dataset-sources.md
loaders. 1c. Accept/reject, 1d. walk
references/compat-workarounds.md
recording matches in
config.yaml
applicable_workarounds:
, then 1e. write the
config.yaml
skeleton
.
See
references/step1-probes.md
for the full probe scripts +
docker run
invocations, the Docker-daemon preflight, prerequisites (
MODEL_ID
, optional
DATASET_ID
/
HF_TOKEN
,
OUTPUT_DIR
default
./output/<model_short_name>
bind-mounted by Steps 4–5), dataset-column verification + rename rule, the full reject criteria, compat-walk detail, the exact skeleton, and
.probe
cleanup.
Gate:
config.yaml
exists with model, dataset, task, applicable_workarounds. Do not proceed if any field is missing.

决定是否继续执行。1a. 探测模型1b. 探测数据集通过两个基于
python:3.12-slim
容器的CPU-only探测任务完成(无需主机Python环境):模型探测将返回
model_type
architectures
tags
、头部数量;数据集探测将验证可加载性及列结构。从
architectures
+
tags
+卡片内容中检测
task
(若卡片未提及
AutoModelFor...
→参考
references/model-discovery.md
,并记录在
notes:
字段)。若
source = recommend
,从
references/dataset-recommendations.md
中展示3–5个推荐数据集;若
source = local
,使用
references/dataset-sources.md
中的加载器。1c. 接受/拒绝1d. 遍历
references/compat-workarounds.md
并将匹配项记录在
config.yaml
applicable_workarounds:
字段中,然后
1e. 编写
config.yaml
框架
完整探测脚本+
docker run
调用、Docker守护进程预检查、先决条件(
MODEL_ID
、可选
DATASET_ID
/
HF_TOKEN
、步骤4–5将默认挂载的
OUTPUT_DIR
./output/<model_short_name>
)、数据集列验证+重命名规则、完整拒绝标准、兼容性遍历细节、精确框架及
.probe
清理操作请查看
references/step1-probes.md
检查点:
config.yaml
已存在,且包含模型、数据集、任务、applicable_workarounds字段。若任何字段缺失,不得继续执行。

Step 2 — Hardware audit & NGC image

步骤2 — 硬件审计与NGC镜像选型

Verify Docker + GPU + disk, pick the NGC PyTorch image live, finalize hardware-dependent compat rules. 2a. Audit (hard gate) via
tao-setup-nvidia-gpu-host --check-only
(driver branch 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit 1.19.0); on failure ask to authorize the install, then re-run; soft-warn on
< 100 GB
free disk; check only the credentials this run needs; do not proceed to Step 4 on a hard-fail; record
gpu_count
,
gpu_name
,
driver_major
,
vram_gb_per_gpu
. 2b. Pick NGC image (live) — highest-versioned PyTorch NGC image with
Min driver ≤ driver_major
and container CUDA
host CUDA Toolkit (never reject for an
aN
/
bN
/
rcN
suffix); WebFetch fail →
references/hardware-container.md
fallback. 2c. Re-evaluate
hw
-dependent compat rules. 2d. Model-fit check — bf16
param_bytes ≈ 2×param_count
; if > 60% of
vram_gb_per_gpu × 1e9
, recommend LoRA.
See
references/hardware-audit-ngc.md
for the full audit script, the soft-warn
  • MIN_DISK_GB
    override, live-selection rules, the support-matrix WebFetch URL, the
    24.09-py3
    / SDPA+GQA
    attn_implementation: "eager"
    fallback, and the
    could not select device driver
    failure note.
Gate:
config.yaml
has
ngc_image
,
gpu_count
,
gpu_name
,
driver_major
,
vram_gb_per_gpu
. Hardware-dependent compat fixes are recorded.

验证Docker+GPU+磁盘,实时选择NGC PyTorch镜像,最终确定硬件相关兼容性规则。**2a. 审计(硬性检查点)**通过
tao-setup-nvidia-gpu-host --check-only
完成(驱动分支580、CUDA Toolkit 13.0、NVIDIA Container Toolkit 1.19.0);若失败,询问用户是否授权安装,然后重新运行;磁盘可用空间
< 100 GB
时发出软警告;仅检查本次运行所需的凭据;若审计失败,不得进入步骤4;记录
gpu_count
gpu_name
driver_major
vram_gb_per_gpu
2b. 实时选择NGC镜像——选择最高版本的PyTorch NGC镜像,要求
Min driver ≤ driver_major
且容器CUDA版本
主机CUDA Toolkit版本(不得因
aN
/
bN
/
rcN
后缀拒绝);若WebFetch失败→使用
references/hardware-container.md
中的备选方案。2c. 重新评估依赖
hw
的兼容性规则。2d. 模型适配检查——bf16精度下
param_bytes ≈ 2×param_count
;若超过
vram_gb_per_gpu × 1e9
的60%,推荐使用LoRA。
完整审计脚本、软警告+
MIN_DISK_GB
覆盖规则、实时选择规则、支持矩阵WebFetch URL、
24.09-py3
/SDPA+GQA的
attn_implementation: "eager"
备选方案、
could not select device driver
失败说明请查看
references/hardware-audit-ngc.md
检查点:
config.yaml
包含
ngc_image
gpu_count
gpu_name
driver_major
vram_gb_per_gpu
字段。硬件相关兼容性修复已记录。

Step 3 — Research the recipe

步骤3 — 调研训练方案

Fetch the live recipe — the agent's
transformers
/
trl
/
peft
memory is suspect, so Step 3 is non-negotiable. Walk
references/research-priorities.md
in priority order (Priority 1 → 6). Stop once you have, for the detected task: the
AutoModel
/ processor class, train + eval transforms, collator,
compute_metrics
, and hyperparameter hints (LR, batch size, epochs, scheduler). Record findings in
meta/recipe.md
and append source URLs to
config.yaml: research_sources:
. If a slot has no live finding, fall back to the matching scaffold (
cv-scripts.md
/
vlm-scripts.md
) and log "fallback to scaffold — no live source for <slot>" under
notes:
. Conflict-resolution rules:
references/research-priorities.md
.
Gate: every required slot above is filled, with a source URL or an explicit scaffold-fallback note.

获取实时训练方案——Agent的
transformers
/
trl
/
peft
相关记忆可信度低,因此步骤3为必填项。按优先级顺序遍历
references/research-priorities.md
(优先级1→6)。 当获取到针对检测任务的以下内容后停止:
AutoModel
/处理器类、训练+评估转换逻辑、整理器、
compute_metrics
以及超参数提示(学习率、批量大小、 epoch数、调度器)。将调研结果记录在
meta/recipe.md
中,并将来源URL追加到
config.yaml: research_sources:
字段。若某模块无实时调研结果,使用匹配的框架(
cv-scripts.md
/
vlm-scripts.md
)并在
notes:
字段记录"fallback to scaffold — no live source for <slot>"。冲突解决规则请查看
references/research-priorities.md
检查点: 上述所有必填模块均已填充,且附带来源URL或明确的框架备选说明。

Step 4 — Generate project & smoke-test

步骤4 — 生成项目与冒烟测试

Write all scripts, build the image, prepare data, run a 1-step smoke on real data (one
docker build
, two
docker run
s).
4a. Generate project files in
output_dir/
config.yaml
,
Dockerfile
,
requirements.txt
,
prepare_data.py
,
train.py
,
run_eval.py
(eval script MUST be
run_eval.py
, never
evaluate.py
— collides with HF
evaluate
),
infer.py
,
merge_lora.py
for VLM-LoRA,
.gitignore
. Authority order: Step 3 live research → scaffold reference (
cv-scripts.md
/
vlm-scripts.md
) for structure only, never their
[FETCH LIVE]
blocks. Apply each
applicable_workarounds
entry as a Dockerfile block, requirements pin, config override, or runtime env var. Every generated
.py
begins with the NVIDIA Apache-2.0
#
-comment copyright header (emitter must fail otherwise). If
emit_unit_tests: true
, also generate
tests/
per
references/testing.md
. See
references/project-scaffold.md
for the full file table, the exact copyright header, and the Dockerfile template (deps → compat → code layer order).
4b. Build, prepare, smoke
docker build -t run-<short>:latest .
, then run
references/docker-runs.md
§1 (build), §2 (prepare_data), §3 (smoke,
--smoke --max_steps 1
); §3 lists the smoke pass criteria (no exception, loss finite,
grad_norm > 0
at step 1). If
emit_unit_tests: true
, also run
pytest tests/
inside the container. Any failure → STOP.
4c. Preflight summary — print the boxed
─ PREFLIGHT ─
summary (reference URL, dataset columns, push_to_hub repo, wandb monitoring, ngc_image, hardware, smoke result) and verify every field is filled before launching full training. Exact format:
references/project-scaffold.md
.
Gate: project files written, image built, smoke PASSED, preflight has no blank fields.

编写所有脚本、构建镜像、准备数据、使用真实数据执行1步冒烟测试(一次
docker build
、两次
docker run
)。
4a. 在
output_dir/
中生成项目文件
——
config.yaml
Dockerfile
requirements.txt
prepare_data.py
train.py
run_eval.py
(评估脚本必须
run_eval.py
,不得使用
evaluate.py
——会与HF的
evaluate
库冲突)、
infer.py
、针对VLM-LoRA的
merge_lora.py
.gitignore
。权威优先级:步骤3的实时调研结果→框架参考(
cv-scripts.md
/
vlm-scripts.md
仅用于结构,不得使用其
[FETCH LIVE]
模块)。将每个
applicable_workarounds
条目作为Dockerfile模块、依赖版本锁定、配置覆盖或运行时环境变量应用。所有生成的
.py
文件必须以NVIDIA Apache-2.0的
#
注释版权头开头(否则生成器必须报错)。若
emit_unit_tests: true
,同时根据
references/testing.md
生成
tests/
目录。完整文件列表、精确版权头、Dockerfile模板(依赖→兼容性→代码层顺序)请查看
references/project-scaffold.md
4b. 构建、准备、冒烟测试——执行
docker build -t run-<short>:latest .
,然后运行
references/docker-runs.md
中的§1(构建)、§2(prepare_data)、§3(冒烟测试,
--smoke --max_steps 1
);§3列出了冒烟测试通过标准(无异常、损失值有限、步骤1的
grad_norm > 0
)。若
emit_unit_tests: true
,同时在容器内运行
pytest tests/
。任何失败→停止操作。
4c. 预检查总结——打印带框的
─ PREFLIGHT ─
总结(参考URL、数据集列、push_to_hub仓库、wandb监控、ngc_image、硬件、冒烟测试结果),并在启动全量训练前验证所有字段已填充。精确格式请查看
references/project-scaffold.md
检查点: 项目文件已编写、镜像已构建、冒烟测试通过、预检查无空白字段。

Step 5 — Train, evaluate, infer

步骤5 — 训练、评估、推理

Run in order, all commands in
references/docker-runs.md
: 5a baseline eval (§4, skip if
skip_baseline: true
), 5b full training detached (§5), 5c LoRA merge (§6, only VLM-with-LoRA), 5d post-train eval (§7), 5e inference 5 samples (§8). Multi-GPU: prepend
torchrun --nproc_per_node=$gpu_count
to
python train.py
. Watch
docker logs -f hft_train
: loss should drop within 10-20 steps (flat → stop; NaN → reduce LR; OOM → halve batch; full recovery in
references/core-rules.md
+
references/error-playbook.md
). If
emit_report: true
, run
report.py
after Step 5e per
references/reporting.md
.
Gate: all of —
checkpoints/final/
(or
checkpoints/merged/
for LoRA) exists;
reports/eval_results.json
has a numeric primary metric;
reports/baseline_results.json
exists (unless skipped);
reports/inference_samples/
has 5 samples; wandb URL shows descending loss.

按顺序执行,所有命令均在
references/docker-runs.md
中:5a基线评估(§4,若
skip_baseline: true
则跳过)、5b全量训练(后台运行,§5)、5cLoRA合并(§6,仅针对带LoRA的VLM)、5d训练后评估(§7)、5e推理5个样本(§8)。多GPU场景:在
python train.py
前添加
torchrun --nproc_per_node=$gpu_count
。查看
docker logs -f hft_train
:损失值应在10-20步内下降(损失持平→停止;出现NaN→降低学习率;OOM→减半批量大小;完整恢复方案请查看
references/core-rules.md
+
references/error-playbook.md
)。若
emit_report: true
,在步骤5e后根据
references/reporting.md
运行
report.py
检查点: 满足以下所有条件——
checkpoints/final/
(或LoRA场景下的
checkpoints/merged/
)存在;
reports/eval_results.json
包含数值型主指标;
reports/baseline_results.json
存在(除非已跳过);
reports/inference_samples/
包含5个样本;wandb URL显示损失值下降。

Step 6 — Push & emit rerun skill

步骤6 — 推送与生成可重运行技能

Publish the run and make it reproducible without re-research.
6a. Push to HF Hub — use
references/hub-push.md
(pushes weights merged or final, a generated model card
README.md
,
results/{eval,baseline}_results.json
,
config.yaml
,
Dockerfile
,
requirements.txt
,
inference_samples/*.jpg
, and
report.{pdf,html}
if
emit_report: true
). Skip iff
push_to_hub: false
is explicit in
config.yaml
.
6b. Emit rerun skill at
<output_dir>/skills/run-<short>/SKILL.md
per
references/pipeline-skill-template.md
. Every
<placeholder>
must be a real value (literal placeholders are a bug); include the full YAML (
license
,
compatibility
,
metadata
,
allowed-tools
) and the NVIDIA copyright notice in an HTML comment immediately after the closing
---
, as in that template; an emitter must fail unless the emitted
SKILL.md
contains those fields and the copyright comment.
Gate (Done criteria): all of — Step 5 gate met; HF Hub repo exists at the resolved URL with weights + card +
results/
(unless
push_to_hub: false
);
<output_dir>/skills/run-<short>/SKILL.md
exists with no
<placeholder>
left, with metadata + copyright HTML comment per
pipeline-skill-template.md
.
Final message to user — terse, with direct URLs: wandb URL; HF Hub URL; primary metric baseline → fine-tuned (Δ); path to
reports/inference_samples/
; path to
<output_dir>/skills/run-<short>/SKILL.md
.

发布运行结果并使其无需重新调研即可复现。
6a. 推送到HF Hub——使用
references/hub-push.md
(推送合并后的权重或最终权重、生成的模型卡片
README.md
results/{eval,baseline}_results.json
config.yaml
Dockerfile
requirements.txt
inference_samples/*.jpg
,若
emit_report: true
则同时推送
report.{pdf,html}
)。仅当
config.yaml
中明确设置
push_to_hub: false
时跳过此步骤。
6b. 在
<output_dir>/skills/run-<short>/SKILL.md
生成可重运行技能
——遵循
references/pipeline-skill-template.md
。所有
<placeholder>
必须替换为真实值(保留字面占位符为错误);包含完整YAML(
license
compatibility
metadata
allowed-tools
),并在闭合
---
后立即添加HTML注释格式的NVIDIA版权声明,与模板一致;若生成的
SKILL.md
未包含这些字段和版权注释,生成器必须报错。
检查点(完成标准): 满足以下所有条件——步骤5的检查点已通过;HF Hub仓库在解析后的URL存在,且包含权重+卡片+
results/
(除非
push_to_hub: false
);
<output_dir>/skills/run-<short>/SKILL.md
存在,且无
<placeholder>
残留,包含符合
pipeline-skill-template.md
要求的元数据+HTML版权注释。
给用户的最终消息——简洁明了,包含直接URL:wandb URL;HF Hub URL;主指标从基线到微调的变化值(Δ);
reports/inference_samples/
路径;
<output_dir>/skills/run-<short>/SKILL.md
路径。

Error playbook

错误处理手册

On a known runtime error, consult
references/error-playbook.md
before redesigning anything — its symptom → minimal-fix table covers NGC ENTRYPOINT, SDPA+GQA,
transformers>=4.51
regression, numpy 2.x ABI, Albumentations bbox, PEFT + gradient_checkpointing, SmolVLM SDPA, LoRA target-regex, missing CV augmentation, OOM at step 0, and more. When a row fires twice across runs, lift it into
references/compat-workarounds.md
with a
detect
rule, auto-applied in Step 1d before the error can fire.

遇到已知运行时错误时,请先查阅
references/error-playbook.md
再进行任何修改——其症状→最小修复表涵盖了NGC ENTRYPOINT、SDPA+GQA、
transformers>=4.51
回归、numpy 2.x ABI、Albumentations边界框、PEFT+梯度检查点、SmolVLM SDPA、LoRA目标正则表达式、缺失CV增强、步骤0出现OOM等问题。若同一错误在多次运行中出现两次,请将其添加到
references/compat-workarounds.md
并附带
detect
规则,在步骤1d自动应用以避免错误再次发生。

Communication style

沟通风格

Terse: no filler, no restating the request; always include direct Hub + wandb URLs; on error state what went wrong, why, what you changed (no menus, no "Option A/B/C" when the answer is clear — act). Full text:
references/core-rules.md
.
简洁:无冗余内容,不得重复用户请求;始终包含Hub+wandb的直接URL;出现错误时说明问题、原因及修改内容(无需菜单,若答案明确不得提供"选项A/B/C"——直接执行)。完整内容请查看
references/core-rules.md

Example pipelines

示例流水线

  • tao-rerun-convnext-cifar10 — facebook/convnext-tiny-224 on cifar10 (image-classification, 10 classes, subset 5000/1000).
  • tao-rerun-detr-cppe5 — facebook/detr-resnet-50 on cppe-5 (object-detection, 5 classes, subset 800/200).
  • tao-rerun-segformer-foodseg103 — nvidia/mit-b0 on EduardoPacheco/FoodSeg103 (semantic segmentation, 103 classes + background, subset 1000/200).
  • tao-rerun-smolvlm-vqav2 — HuggingFaceTB/SmolVLM-256M-Instruct on merve/vqav2-small (image-text-to-text VLM LoRA, subset 500/100, 5 epochs).
  • tao-rerun-convnext-cifar10 — facebook/convnext-tiny-224在cifar10上的微调(图像分类,10类,子集5000/1000)。
  • tao-rerun-detr-cppe5 — facebook/detr-resnet-50在cppe-5上的微调(目标检测,5类,子集800/200)。
  • tao-rerun-segformer-foodseg103 — nvidia/mit-b0在EduardoPacheco/FoodSeg103上的微调(语义分割,103类+背景,子集1000/200)。
  • tao-rerun-smolvlm-vqav2 — HuggingFaceTB/SmolVLM-256M-Instruct在merve/vqav2-small上的微调(图文到文本VLM LoRA,子集500/100,5个epoch)。