tao-finetune-huggingface-model

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

tao-finetune-huggingface-model

Local NVIDIA GPU fine-tuning for HuggingFace models, grounded in live-fetched documentation with curated references as a fallback safety net. One NGC container, a small set of focused scripts, one push to HF Hub. Behavior is governed by the rules in this file — follow them, do not improvise.

Order of authority (highest first): (1) user input → (2) live research (model card, HF repo example, author script, task docs, paper — always fetched, Step 3) → (3) curated

references/*.md

(fallback when live research is silent) → (4) training-data memory (last resort, suspect). On conflict, live research wins for the specific model + current API. See

references/core-rules.md

for the full order and conflict-resolution rules.

基于实时获取的文档对HuggingFace模型进行本地NVIDIA GPU微调，同时将精心整理的参考资料作为备用保障。仅需一个NGC容器、少量针对性脚本，即可一键推送到HF Hub。所有操作需遵循本文档中的规则，不得随意变通。

权威优先级（从高到低）： (1) 用户输入 → (2) 实时调研（模型卡片、HF仓库示例、作者脚本、任务文档、论文——始终在步骤3获取） → (3) 精心整理的

references/*.md

文件（实时调研无结果时的备选方案） → (4) 训练数据记忆（最后手段，可信度低）。若出现冲突，针对特定模型+当前API的实时调研结果优先。完整优先级及冲突解决规则请查看

references/core-rules.md

。

Inputs

输入参数

Required:

```
model_id
```
— HuggingFace model ID, e.g.
```
google/vit-base-patch16-224
```

Conditional credentials (loaded by the SessionStart hook from
~/.config/tao/.env
):

```
HF_TOKEN
```
— only when the model/dataset is gated (read) or
```
push_to_hub
```
is on (write); public +
```
push_to_hub: false
```
runs don't need it. The agent never reads the value — only checks presence with
```
[ -n "$HF_TOKEN" ]
```
.
```
WANDB_API_KEY
```
,
```
WANDB_PROJECT
```
— only when WandB is enabled; set
```
WANDB_MODE=disabled
```
to opt out.

Dataset — exactly one:

```
dataset_id
```
— HuggingFace dataset ID (source:
hf
)
```
local_dataset_path
```
— local folder or file (source:
local
); optional
```
local_dataset_format
```
∈ {auto, imagefolder, coco, voc, jsonl, arrow, parquet, csv} (default auto-detect).
(omit) — agent recommends popular datasets (source:
recommend
)

Optional (have defaults):

task_type

(auto-detected);

n_train=10000

n_eval=1000

n_epochs=3

lora_r=16

;

output_dir=./output/<model_short_name>

;

hf_model_repo

(push target; if unset and HF_TOKEN has write access, auto-derived as

<whoami>/<model_short_name>-finetuned

);

push_to_hub=True

(set

False

to skip);

skip_baseline=False

(skip zero-shot baseline eval).

Optional deliverables (off by default):

emit_progress_log

→

output_dir/PROGRESS.md

(per-step ✅/⚠️/❌ journal);

emit_report

→

reports/report.{pdf,html}

with curves & samples;

emit_unit_tests

→

tests/

with fake-data heterogeneous-batch tests.

All values live in

output_dir/config.yaml

. Never hardcode in Python.

必填项：

```
model_id
```
— HuggingFace模型ID，例如
```
google/vit-base-patch16-224
```

条件性凭据（由SessionStart钩子从
~/.config/tao/.env
加载）：

```
HF_TOKEN
```
— 仅当模型/数据集为** gated（受限访问）**或
```
push_to_hub
```
开启时需要；公开模型且
```
push_to_hub: false
```
的运行无需此凭据。Agent不会读取具体值，仅通过
```
[ -n "$HF_TOKEN" ]
```
检查是否存在。
```
WANDB_API_KEY
```
,
```
WANDB_PROJECT
```
— 仅当启用WandB时需要；设置
```
WANDB_MODE=disabled
```
可选择退出。

数据集（三选一）：

```
dataset_id
```
— HuggingFace数据集ID （来源：
hf
）
```
local_dataset_path
```
— 本地文件夹或文件（来源：
local
）；可选参数
```
local_dataset_format
```
∈ {auto, imagefolder, coco, voc, jsonl, arrow, parquet, csv}（默认自动检测）。
（留空） — Agent将推荐热门数据集（来源：
recommend
）

可选参数（含默认值）：

task_type

（自动检测）；

n_train=10000

、

n_eval=1000

、

n_epochs=3

、

lora_r=16

；

output_dir=./output/<model_short_name>

；

hf_model_repo

（推送目标；若未设置且HF_TOKEN具备写入权限，将自动生成为

<whoami>/<model_short_name>-finetuned

）；

push_to_hub=True

（设置为

False

可跳过推送）；

skip_baseline=False

（跳过零样本基线评估）。

可选交付物（默认关闭）：

emit_progress_log

→ 生成

output_dir/PROGRESS.md

（每步骤✅/⚠️/❌日志）；

emit_report

→ 生成

reports/report.{pdf,html}

（包含曲线与样本）；

emit_unit_tests

→ 生成

tests/

目录（含基于伪造数据的异构批量测试）。

所有参数值均存储在

output_dir/config.yaml

中，严禁硬编码到Python代码内。

Execution platform

执行平台

This skill orchestrates what to run; the platform skills own how (read them first, do not redraft their conventions here):

tao-setup-nvidia-gpu-host

(GPU host runtime — driver 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit 1.19.0),

tao-run-on-docker

(

docker run

flags, NGC auth,

--gpus

, mounts, env passthrough,

--ipc=host

--shm-size

, error modes), and

tao-run-on-local-docker

(local Docker job preflight — daemon reachable, GPU smoke).

Default platform:

local-docker

— build a one-off image (

run-<short>:latest

) and run it on the local Docker daemon. Ask only if the user needs a different backend (Brev, Lepton/SLURM/Kubernetes). See

references/execution-platform.md

for that path plus the alternate-backend routing, the GPU-runtime preflight, the credentials policy, and the

docker run

conventions.

本技能负责编排要执行的内容；平台技能负责处理执行方式（请先阅读这些技能文档，不得在此重写其约定）：

tao-setup-nvidia-gpu-host

（GPU主机运行时——驱动版本580、CUDA Toolkit 13.0、NVIDIA Container Toolkit 1.19.0）、

tao-run-on-docker

（

docker run

参数、NGC认证、

--gpus

、挂载、环境变量传递、

--ipc=host

--shm-size

、错误模式）以及

tao-run-on-local-docker

（本地Docker任务预检查——守护进程可达、GPU冒烟测试）。

默认平台：

local-docker

— 构建一次性镜像（

run-<short>:latest

）并在本地Docker守护进程上运行。仅当用户需要其他后端（Brev、Lepton/SLURM/Kubernetes）时询问。请查看

references/execution-platform.md

获取该路径及备选后端路由、GPU运行时预检查、凭据策略和

docker run

约定。

References — fallback safety net

参考资料——备用保障

Curated

references/*.md

are consulted only when live research is silent, ambiguous, or unavailable; live docs always win for the specific model + current API. The workflow steps below link the file each step needs directly. Before falling back, log the live source you tried and why it was insufficient (in

config.yaml

notes:

, and PROGRESS.md if enabled).

[FETCH LIVE]

markers in

cv-scripts.md

vlm-scripts.md

are a research checklist, not code to inline — if a block has no Step 3 finding, refetch the listed URL.

See

references/reference-index.md

for the complete index — every always-on reference plus the three opt-in ones gated by a flag (

progress-tracking.md

←

emit_progress_log

testing.md

←

emit_unit_tests

reporting.md

←

emit_report

), each with its per-step role.

仅当实时调研无结果、存在歧义或无法获取时，才会参考精心整理的

references/*.md

文件；针对特定模型+当前API的实时文档始终优先。以下工作流步骤会直接链接各步骤所需的文件。在使用备选方案前，请记录尝试过的实时来源及不足原因（写入

config.yaml

的

notes:

字段，若启用则同时写入PROGRESS.md）。

cv-scripts.md

vlm-scripts.md

中的

[FETCH LIVE]

标记是调研检查清单，而非要嵌入的代码——若某模块无步骤3的调研结果，请重新获取列出的URL。

完整索引请查看

references/reference-index.md

——包含所有默认启用的参考资料，以及三个由标志控制的可选参考资料（

progress-tracking.md

←

emit_progress_log

、

testing.md

←

emit_unit_tests

、

reporting.md

←

emit_report

），每个资料都标注了其在各步骤中的作用。

Core rules

核心规则

The non-negotiable behaviors. Full text in

references/core-rules.md

. Short version:

Your HF-library knowledge is outdated. Fetch live docs before writing any ML code; never generate trainer args / collator / transforms from memory (Step 3).
Smoke-test on real data with
--max_steps 1
before any full run.
Never silently substitute model_id, dataset_id, or training_method — stop and ask.
Error recovery is minimal-change. OOM → halve batch, double grad_accum, enable gradient checkpointing (don't switch to LoRA without approval); NaN → reduce LR 10×; flat loss → inspect collator; same error 3× → stop and ask.
Dataset columns verified BEFORE the collator. Rename →
```
prepare_data.py
```
; restructuring → stop and ask.
Hardware sizing (bf16): ≤3B → 24 GB, 7–13B → 80 GB, 30B+ → multi-GPU or LoRA on 1× 80 GB, 70B+ → 8× 80 GB or LoRA. Won't fit + no LoRA request → ask.

references/core-rules.md

has the full enumeration (hallucinated imports, never-without-approval list, full error-recovery + hardware-sizing tables).

不可协商的行为准则。完整内容请查看

references/core-rules.md

。精简版：

你的HF库知识已过时。 在编写任何ML代码前请获取实时文档；切勿凭记忆生成训练器参数/整理器/转换逻辑（步骤3）。
在全量运行前，使用真实数据执行
--max_steps 1
冒烟测试。
切勿擅自替换 model_id、dataset_id或训练方法——停止操作并询问用户。
错误恢复需最小化改动。 内存不足（OOM）→ 将批量大小减半、加倍梯度累积、启用梯度检查点（未经批准不得切换到LoRA）；出现NaN→将学习率降低10倍；损失持平→检查整理器；同一错误出现3次→停止操作并询问用户。
数据集列需在整理器前验证。 重命名→修改
```
prepare_data.py
```
；结构调整→停止操作并询问用户。
硬件规格（bf16精度）： ≤3B参数→24 GB显存，7–13B→80 GB，30B+→多GPU或在单张80 GB显存GPU上使用LoRA，70B+→8张80 GB显存GPU或使用LoRA。若无法容纳且未请求LoRA→询问用户。

references/core-rules.md

包含完整细则（幻觉导入、未经批准不得使用的操作列表、完整错误恢复+硬件规格表）。

Workflow — 6 steps

工作流——6个步骤

Single pass, sequential. Each step has a clear gate before the next begins.

单轮顺序执行，每个步骤完成后需通过明确的检查点才能进入下一步。

Step 1 — Inspect & qualify

步骤1 — 检查与验证

Decide whether to proceed at all. 1a. Probe model and 1b. Probe dataset via two CPU-only

python:3.12-slim

containerized probes (no host Python prereqs): the model probe reports

model_type

architectures

tags

, head counts; the dataset probe verifies loadability + column schema. Detect

task

from

architectures

tags

+ card body (card silent on

AutoModelFor...

→

references/model-discovery.md

, log under

notes:

). For

source = recommend

, present 3–5 picks from

references/dataset-recommendations.md

; for

source = local

, use

references/dataset-sources.md

loaders. 1c. Accept/reject, 1d. walk
references/compat-workarounds.md
recording matches in

config.yaml

applicable_workarounds:

, then 1e. write the
config.yaml
skeleton.

See

references/step1-probes.md

for the full probe scripts +

docker run

invocations, the Docker-daemon preflight, prerequisites (

MODEL_ID

, optional

DATASET_ID

HF_TOKEN

OUTPUT_DIR

default

./output/<model_short_name>

bind-mounted by Steps 4–5), dataset-column verification + rename rule, the full reject criteria, compat-walk detail, the exact skeleton, and

.probe

cleanup.

Gate:

config.yaml

exists with model, dataset, task, applicable_workarounds. Do not proceed if any field is missing.

决定是否继续执行。1a. 探测模型和1b. 探测数据集通过两个基于

python:3.12-slim

容器的CPU-only探测任务完成（无需主机Python环境）：模型探测将返回

model_type

、

architectures

、

tags

、头部数量；数据集探测将验证可加载性及列结构。从

architectures

tags

+卡片内容中检测

task

（若卡片未提及

AutoModelFor...

→参考

references/model-discovery.md

，并记录在

notes:

字段）。若

source = recommend

，从

references/dataset-recommendations.md

中展示3–5个推荐数据集；若

source = local

，使用

references/dataset-sources.md

中的加载器。1c. 接受/拒绝，1d. 遍历
references/compat-workarounds.md
并将匹配项记录在
config.yaml
的
applicable_workarounds:
字段中，然后1e. 编写
config.yaml
框架。

完整探测脚本+

docker run

调用、Docker守护进程预检查、先决条件（

MODEL_ID

、可选

DATASET_ID

HF_TOKEN

、步骤4–5将默认挂载的

OUTPUT_DIR

为

./output/<model_short_name>

）、数据集列验证+重命名规则、完整拒绝标准、兼容性遍历细节、精确框架及

.probe

清理操作请查看

references/step1-probes.md

。

检查点：

config.yaml

已存在，且包含模型、数据集、任务、applicable_workarounds字段。若任何字段缺失，不得继续执行。

Step 2 — Hardware audit & NGC image

步骤2 — 硬件审计与NGC镜像选型

Verify Docker + GPU + disk, pick the NGC PyTorch image live, finalize hardware-dependent compat rules. 2a. Audit (hard gate) via

tao-setup-nvidia-gpu-host --check-only

(driver branch 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit 1.19.0); on failure ask to authorize the install, then re-run; soft-warn on

< 100 GB

free disk; check only the credentials this run needs; do not proceed to Step 4 on a hard-fail; record

gpu_count

gpu_name

driver_major

vram_gb_per_gpu

. 2b. Pick NGC image (live) — highest-versioned PyTorch NGC image with

Min driver ≤ driver_major

and container CUDA

≤

host CUDA Toolkit (never reject for an

aN

bN

rcN

suffix); WebFetch fail →

references/hardware-container.md

fallback. 2c. Re-evaluate

hw

-dependent compat rules. 2d. Model-fit check — bf16

param_bytes ≈ 2×param_count

; if > 60% of

vram_gb_per_gpu × 1e9

, recommend LoRA.

See

references/hardware-audit-ngc.md

for the full audit script, the soft-warn

```
MIN_DISK_GB
```
override, live-selection rules, the support-matrix WebFetch URL, the
```
24.09-py3
```
/ SDPA+GQA
```
attn_implementation: "eager"
```
fallback, and the
```
could not select device driver
```
failure note.

Gate:

config.yaml

has

ngc_image

gpu_count

gpu_name

driver_major

vram_gb_per_gpu

. Hardware-dependent compat fixes are recorded.

验证Docker+GPU+磁盘，实时选择NGC PyTorch镜像，最终确定硬件相关兼容性规则。**2a. 审计（硬性检查点）**通过

tao-setup-nvidia-gpu-host --check-only

完成（驱动分支580、CUDA Toolkit 13.0、NVIDIA Container Toolkit 1.19.0）；若失败，询问用户是否授权安装，然后重新运行；磁盘可用空间

< 100 GB

时发出软警告；仅检查本次运行所需的凭据；若审计失败，不得进入步骤4；记录

gpu_count

、

gpu_name

、

driver_major

、

vram_gb_per_gpu

。2b. 实时选择NGC镜像——选择最高版本的PyTorch NGC镜像，要求

Min driver ≤ driver_major

且容器CUDA版本

≤

主机CUDA Toolkit版本（不得因

aN

bN

rcN

后缀拒绝）；若WebFetch失败→使用

references/hardware-container.md

中的备选方案。2c. 重新评估依赖

hw

的兼容性规则。2d. 模型适配检查——bf16精度下

param_bytes ≈ 2×param_count

；若超过

vram_gb_per_gpu × 1e9

的60%，推荐使用LoRA。

完整审计脚本、软警告+

MIN_DISK_GB

覆盖规则、实时选择规则、支持矩阵WebFetch URL、

24.09-py3

/SDPA+GQA的

attn_implementation: "eager"

备选方案、

could not select device driver

失败说明请查看

references/hardware-audit-ngc.md

。

检查点：

config.yaml

包含

ngc_image

、

gpu_count

、

gpu_name

、

driver_major

、

vram_gb_per_gpu

字段。硬件相关兼容性修复已记录。

Step 3 — Research the recipe

步骤3 — 调研训练方案

Fetch the live recipe — the agent's

transformers

trl

peft

memory is suspect, so Step 3 is non-negotiable. Walk

references/research-priorities.md

in priority order (Priority 1 → 6). Stop once you have, for the detected task: the

AutoModel

/ processor class, train + eval transforms, collator,

compute_metrics

, and hyperparameter hints (LR, batch size, epochs, scheduler). Record findings in

meta/recipe.md

and append source URLs to

config.yaml: research_sources:

. If a slot has no live finding, fall back to the matching scaffold (

cv-scripts.md

vlm-scripts.md

) and log "fallback to scaffold — no live source for <slot>" under

notes:

. Conflict-resolution rules:

references/research-priorities.md

Gate: every required slot above is filled, with a source URL or an explicit scaffold-fallback note.

获取实时训练方案——Agent的

transformers

trl

peft

相关记忆可信度低，因此步骤3为必填项。按优先级顺序遍历

references/research-priorities.md

（优先级1→6）。当获取到针对检测任务的以下内容后停止：

AutoModel

/处理器类、训练+评估转换逻辑、整理器、

compute_metrics

以及超参数提示（学习率、批量大小、 epoch数、调度器）。将调研结果记录在

meta/recipe.md

中，并将来源URL追加到

config.yaml: research_sources:

字段。若某模块无实时调研结果，使用匹配的框架（

cv-scripts.md

vlm-scripts.md

）并在

notes:

字段记录"fallback to scaffold — no live source for <slot>"。冲突解决规则请查看

references/research-priorities.md

。

检查点： 上述所有必填模块均已填充，且附带来源URL或明确的框架备选说明。

Step 4 — Generate project & smoke-test

步骤4 — 生成项目与冒烟测试

Write all scripts, build the image, prepare data, run a 1-step smoke on real data (one

docker build

, two

docker run

s).

4a. Generate project files in

output_dir/

—

config.yaml

Dockerfile

requirements.txt

prepare_data.py

train.py

run_eval.py

(eval script MUST be

run_eval.py

, never

evaluate.py

— collides with HF

evaluate

infer.py

merge_lora.py

for VLM-LoRA,

.gitignore

. Authority order: Step 3 live research → scaffold reference (

cv-scripts.md

vlm-scripts.md

) for structure only, never their

[FETCH LIVE]

blocks. Apply each

applicable_workarounds

entry as a Dockerfile block, requirements pin, config override, or runtime env var. Every generated

.py

begins with the NVIDIA Apache-2.0

-comment copyright header (emitter must fail otherwise). If

emit_unit_tests: true

, also generate

tests/

per

references/testing.md

. See

references/project-scaffold.md

for the full file table, the exact copyright header, and the Dockerfile template (deps → compat → code layer order).

4b. Build, prepare, smoke —

docker build -t run-<short>:latest .

, then run

references/docker-runs.md

§1 (build), §2 (prepare_data), §3 (smoke,

--smoke --max_steps 1

); §3 lists the smoke pass criteria (no exception, loss finite,

grad_norm > 0

at step 1). If

emit_unit_tests: true

, also run

pytest tests/

inside the container. Any failure → STOP.

4c. Preflight summary — print the boxed

─ PREFLIGHT ─

summary (reference URL, dataset columns, push_to_hub repo, wandb monitoring, ngc_image, hardware, smoke result) and verify every field is filled before launching full training. Exact format:

references/project-scaffold.md

Gate: project files written, image built, smoke PASSED, preflight has no blank fields.

编写所有脚本、构建镜像、准备数据、使用真实数据执行1步冒烟测试（一次

docker build

、两次

docker run

）。

4a. 在
output_dir/
中生成项目文件——

config.yaml

、

Dockerfile

、

requirements.txt

、

prepare_data.py

、

train.py

、

run_eval.py

（评估脚本必须为

run_eval.py

，不得使用

evaluate.py

——会与HF的

evaluate

库冲突）、

infer.py

、针对VLM-LoRA的

merge_lora.py

、

.gitignore

。权威优先级：步骤3的实时调研结果→框架参考（

cv-scripts.md

vlm-scripts.md

仅用于结构，不得使用其

[FETCH LIVE]

模块）。将每个

applicable_workarounds

条目作为Dockerfile模块、依赖版本锁定、配置覆盖或运行时环境变量应用。所有生成的

.py

文件必须以NVIDIA Apache-2.0的

注释版权头开头（否则生成器必须报错）。若

emit_unit_tests: true

，同时根据

references/testing.md

生成

tests/

目录。完整文件列表、精确版权头、Dockerfile模板（依赖→兼容性→代码层顺序）请查看

references/project-scaffold.md

。

4b. 构建、准备、冒烟测试——执行

docker build -t run-<short>:latest .

，然后运行

references/docker-runs.md

中的§1（构建）、§2（prepare_data）、§3（冒烟测试，

--smoke --max_steps 1

）；§3列出了冒烟测试通过标准（无异常、损失值有限、步骤1的

grad_norm > 0

）。若

emit_unit_tests: true

，同时在容器内运行

pytest tests/

。任何失败→停止操作。

4c. 预检查总结——打印带框的

─ PREFLIGHT ─

总结（参考URL、数据集列、push_to_hub仓库、wandb监控、ngc_image、硬件、冒烟测试结果），并在启动全量训练前验证所有字段已填充。精确格式请查看

references/project-scaffold.md

。

检查点： 项目文件已编写、镜像已构建、冒烟测试通过、预检查无空白字段。

Step 5 — Train, evaluate, infer

步骤5 — 训练、评估、推理

Run in order, all commands in

references/docker-runs.md

: 5a baseline eval (§4, skip if

skip_baseline: true

), 5b full training detached (§5), 5c LoRA merge (§6, only VLM-with-LoRA), 5d post-train eval (§7), 5e inference 5 samples (§8). Multi-GPU: prepend

torchrun --nproc_per_node=$gpu_count

python train.py

. Watch

docker logs -f hft_train

: loss should drop within 10-20 steps (flat → stop; NaN → reduce LR; OOM → halve batch; full recovery in

references/core-rules.md

references/error-playbook.md

). If

emit_report: true

, run

report.py

after Step 5e per

references/reporting.md

Gate: all of —

checkpoints/final/

(or

checkpoints/merged/

for LoRA) exists;

reports/eval_results.json

has a numeric primary metric;

reports/baseline_results.json

exists (unless skipped);

reports/inference_samples/

has 5 samples; wandb URL shows descending loss.

按顺序执行，所有命令均在

references/docker-runs.md

中：5a基线评估（§4，若

skip_baseline: true

则跳过）、5b全量训练（后台运行，§5）、5cLoRA合并（§6，仅针对带LoRA的VLM）、5d训练后评估（§7）、5e推理5个样本（§8）。多GPU场景：在

python train.py

前添加

torchrun --nproc_per_node=$gpu_count

。查看

docker logs -f hft_train

：损失值应在10-20步内下降（损失持平→停止；出现NaN→降低学习率；OOM→减半批量大小；完整恢复方案请查看

references/core-rules.md

references/error-playbook.md

）。若

emit_report: true

，在步骤5e后根据

references/reporting.md

运行

report.py

。

检查点： 满足以下所有条件——

checkpoints/final/

（或LoRA场景下的

checkpoints/merged/

）存在；

reports/eval_results.json

包含数值型主指标；

reports/baseline_results.json

存在（除非已跳过）；

reports/inference_samples/

包含5个样本；wandb URL显示损失值下降。

Step 6 — Push & emit rerun skill

步骤6 — 推送与生成可重运行技能

Publish the run and make it reproducible without re-research.

6a. Push to HF Hub — use

references/hub-push.md

(pushes weights merged or final, a generated model card

README.md

results/{eval,baseline}_results.json

config.yaml

Dockerfile

requirements.txt

inference_samples/*.jpg

, and

report.{pdf,html}

emit_report: true

). Skip iff

push_to_hub: false

is explicit in

config.yaml

6b. Emit rerun skill at

<output_dir>/skills/run-<short>/SKILL.md

per

references/pipeline-skill-template.md

. Every

<placeholder>

must be a real value (literal placeholders are a bug); include the full YAML (

license

compatibility

metadata

allowed-tools

) and the NVIDIA copyright notice in an HTML comment immediately after the closing

---

, as in that template; an emitter must fail unless the emitted

SKILL.md

contains those fields and the copyright comment.

Gate (Done criteria): all of — Step 5 gate met; HF Hub repo exists at the resolved URL with weights + card +

results/

(unless

push_to_hub: false

);

<output_dir>/skills/run-<short>/SKILL.md

exists with no

<placeholder>

left, with metadata + copyright HTML comment per

pipeline-skill-template.md

Final message to user — terse, with direct URLs: wandb URL; HF Hub URL; primary metric baseline → fine-tuned (Δ); path to

reports/inference_samples/

; path to

<output_dir>/skills/run-<short>/SKILL.md

发布运行结果并使其无需重新调研即可复现。

6a. 推送到HF Hub——使用

references/hub-push.md

（推送合并后的权重或最终权重、生成的模型卡片

README.md

、

results/{eval,baseline}_results.json

、

config.yaml

、

Dockerfile

、

requirements.txt

、

inference_samples/*.jpg

，若

emit_report: true

则同时推送

report.{pdf,html}

）。仅当

config.yaml

中明确设置

push_to_hub: false

时跳过此步骤。

6b. 在
<output_dir>/skills/run-<short>/SKILL.md
生成可重运行技能——遵循

references/pipeline-skill-template.md

。所有

<placeholder>

必须替换为真实值（保留字面占位符为错误）；包含完整YAML（

license

、

compatibility

、

metadata

、

allowed-tools

），并在闭合

---

SKILL.md

未包含这些字段和版权注释，生成器必须报错。

检查点（完成标准）： 满足以下所有条件——步骤5的检查点已通过；HF Hub仓库在解析后的URL存在，且包含权重+卡片+

results/

（除非

push_to_hub: false

）；

<output_dir>/skills/run-<short>/SKILL.md

存在，且无

<placeholder>

残留，包含符合

pipeline-skill-template.md

要求的元数据+HTML版权注释。

给用户的最终消息——简洁明了，包含直接URL：wandb URL；HF Hub URL；主指标从基线到微调的变化值（Δ）；

reports/inference_samples/

路径；

<output_dir>/skills/run-<short>/SKILL.md

路径。

Error playbook

错误处理手册

On a known runtime error, consult

references/error-playbook.md

before redesigning anything — its symptom → minimal-fix table covers NGC ENTRYPOINT, SDPA+GQA,

transformers>=4.51

regression, numpy 2.x ABI, Albumentations bbox, PEFT + gradient_checkpointing, SmolVLM SDPA, LoRA target-regex, missing CV augmentation, OOM at step 0, and more. When a row fires twice across runs, lift it into

references/compat-workarounds.md

with a

detect

rule, auto-applied in Step 1d before the error can fire.

遇到已知运行时错误时，请先查阅

references/error-playbook.md

再进行任何修改——其症状→最小修复表涵盖了NGC ENTRYPOINT、SDPA+GQA、

transformers>=4.51

回归、numpy 2.x ABI、Albumentations边界框、PEFT+梯度检查点、SmolVLM SDPA、LoRA目标正则表达式、缺失CV增强、步骤0出现OOM等问题。若同一错误在多次运行中出现两次，请将其添加到

references/compat-workarounds.md

并附带

detect

规则，在步骤1d自动应用以避免错误再次发生。

Communication style

沟通风格

Terse: no filler, no restating the request; always include direct Hub + wandb URLs; on error state what went wrong, why, what you changed (no menus, no "Option A/B/C" when the answer is clear — act). Full text:

references/core-rules.md

简洁：无冗余内容，不得重复用户请求；始终包含Hub+wandb的直接URL；出现错误时说明问题、原因及修改内容（无需菜单，若答案明确不得提供"选项A/B/C"——直接执行）。完整内容请查看

references/core-rules.md

。

Example pipelines

示例流水线

tao-rerun-convnext-cifar10 — facebook/convnext-tiny-224 on cifar10 (image-classification, 10 classes, subset 5000/1000).
tao-rerun-detr-cppe5 — facebook/detr-resnet-50 on cppe-5 (object-detection, 5 classes, subset 800/200).
tao-rerun-segformer-foodseg103 — nvidia/mit-b0 on EduardoPacheco/FoodSeg103 (semantic segmentation, 103 classes + background, subset 1000/200).
tao-rerun-smolvlm-vqav2 — HuggingFaceTB/SmolVLM-256M-Instruct on merve/vqav2-small (image-text-to-text VLM LoRA, subset 500/100, 5 epochs).

tao-rerun-convnext-cifar10 — facebook/convnext-tiny-224在cifar10上的微调（图像分类，10类，子集5000/1000）。
tao-rerun-detr-cppe5 — facebook/detr-resnet-50在cppe-5上的微调（目标检测，5类，子集800/200）。
tao-rerun-segformer-foodseg103 — nvidia/mit-b0在EduardoPacheco/FoodSeg103上的微调（语义分割，103类+背景，子集1000/200）。
tao-rerun-smolvlm-vqav2 — HuggingFaceTB/SmolVLM-256M-Instruct在merve/vqav2-small上的微调（图文到文本VLM LoRA，子集500/100，5个epoch）。