tao-generate-image-grounding

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Image Grounding Pipeline

图像定位流程

Turn

(image, caption)

pairs into per-image grounded annotations: cleaned captions, referring expressions with character spans, and pixel-space bounding boxes for each expression. A single VLM (Gemini or any OpenAI-compatible endpoint) handles both steps.

将

(image, caption)

对转换为每张图像的定位标注：清洗后的标题、带字符跨度的指代表达式，以及每个表达式对应的像素空间边界框。单个VLM（Gemini或任何兼容OpenAI的端点）可处理这两个步骤。

Purpose

用途

Generate phrase-grounded training data for referring-expression and grounding models. The VLM acts as a "teacher" annotator: Step 0 extracts referring expressions from the caption while looking at the image; Step 1 returns one bbox set per expression for each image.

为指代表达式和定位模型生成短语定位训练数据。VLM充当“教师”标注器：步骤0在查看图像的同时从标题中提取指代表达式；步骤1为每张图像的每个表达式返回一组边界框。

Pipeline Architecture

流程架构

Step 0: Expression extraction  → VLM cleans caption, extracts referring expressions + char spans
Step 1: Phrase grounding       → VLM returns pixel bboxes + scores per expression

Steps are individually selectable via

workflow.steps

. Each step writes a per-sample checkpoint to

step_<N>_*/.ckpt/<sample_id>.json

and skips already-processed records on re-run. Set

workflow.force_reprocess: true

to ignore checkpoints and reprocess from scratch.

Step 0: 表达式提取  → VLM清洗标题，提取指代表达式 + 字符跨度
Step 1: 短语定位       → VLM返回每个表达式对应的像素边界框 + 置信度分数

可通过

workflow.steps

单独选择步骤。每个步骤会将每个样本的检查点写入

step_<N>_*/.ckpt/<sample_id>.json

，重新运行时会跳过已处理的记录。设置

workflow.force_reprocess: true

可忽略检查点并从头开始重新处理。

Instructions

操作说明

Initial setup

初始设置

When a user wants to run this pipeline, walk through these steps:

Input JSONL: Ask for the JSONL path. Each line must be one object like
```
{"image_path": "...", "caption": "..."}
```
.
```
image_path
```
can be absolute or relative.
Image root: If any
```
image_path
```
values are relative, set
```
data.image_root
```
to the directory they should resolve from.
API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
1. Gemini — set
```
vlm.backend: "gemini"
```
  ; require
```
GOOGLE_API_KEY
```
  (env var or
```
vlm.gemini.api_key
```
  ).
2. NIM (e.g.
```
https://inference-api.nvidia.com/v1
```
  ) — set
```
vlm.backend: "openai"
```
  ; collect
```
base_url
```
  ,
```
model_name
```
  , and
```
api_key
```
  .
3. TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
  - Running — collect
```
base_url
```
    ,
```
model_name
```
    , and (optionally)
```
api_key
```
    ; set
```
vlm.backend: "openai"
```
    .
  - Not running — guide the user through the
```
skills/applications/tao-run-inference-service
```
    skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check
```
skills/applications/tao-run-inference-service/references/service.yaml
```
    for
```
valid_network_arch_config_basenames
```
    . Once the server is up, collect
```
base_url
```
    ,
```
model_name
```
    , and (optionally)
```
api_key
```
    ; set
```
vlm.backend: "openai"
```
    .
4. vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
  - Running — collect
```
base_url
```
    ,
```
model_name
```
    , and (optionally)
```
api_key
```
    ; set
```
vlm.backend: "openai"
```
    .
  - Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect
```
base_url
```
    ,
```
model_name
```
    , and (optionally)
```
api_key
```
    ; set
```
vlm.backend: "openai"
```
    .
5. Custom (any other OpenAI-compatible endpoint) — set
```
vlm.backend: "openai"
```
  ; collect
```
base_url
```
  ,
```
model_name
```
  , and (optionally)
```
api_key
```
  .
If the user has no endpoint and does not want to set one up, stop and help resolve API access first.
Workflow steps: Choose one of:
- Full pipeline:
```
["0", "1"]
```
- Expression extraction only:
```
["0"]
```
- Grounding only:
```
["1"]
```
  , which requires existing step-0 output at
```
results_dir/step_0_expression_extraction/annotations.jsonl
```
Resume vs fresh run: By default, the workflow reuses checkpoints and skips completed records. To reprocess everything, set
```
image_grounding.workflow.force_reprocess=true
```
.

当用户想要运行此流程时，请按以下步骤操作：

输入JSONL：询问JSONL文件路径。每行必须是一个类似
```
{"image_path": "...", "caption": "..."}
```
的对象。
```
image_path
```
可以是绝对路径或相对路径。
图像根目录：如果有
```
image_path
```
是相对路径，设置
```
data.image_root
```
为其对应的解析目录。
API访问权限：询问用户想要使用哪个VLM端点。提供以下五个选项并根据选择执行：
1. Gemini — 设置
```
vlm.backend: "gemini"
```
  ；需要
```
GOOGLE_API_KEY
```
  （环境变量或
```
vlm.gemini.api_key
```
  配置项）。
2. NIM（例如
```
https://inference-api.nvidia.com/v1
```
  ） — 设置
```
vlm.backend: "openai"
```
  ；收集
```
base_url
```
  、
```
model_name
```
  和
```
api_key
```
  。
3. TAO推理微服务（自托管，兼容OpenAI）。确认服务器是否已运行：
  - 已运行 — 收集
```
base_url
```
    、
```
model_name
```
    和（可选）
```
api_key
```
    ；设置
```
vlm.backend: "openai"
```
    。
  - 未运行 — 引导用户使用
```
skills/applications/tao-run-inference-service
```
    技能，该技能可启动一个本地TAO推理微服务，提供兼容OpenAI的API。在承诺特定模型之前，请查看
```
skills/applications/tao-run-inference-service/references/service.yaml
```
    中的
```
valid_network_arch_config_basenames
```
    。服务器启动后，收集
```
base_url
```
    、
```
model_name
```
    和（可选）
```
api_key
```
    ；设置
```
vlm.backend: "openai"
```
    。
4. vLLM（自托管，兼容OpenAI）。确认服务器是否已运行：
  - 已运行 — 收集
```
base_url
```
    、
```
model_name
```
    和（可选）
```
api_key
```
    ；设置
```
vlm.backend: "openai"
```
    。
  - 未运行 — 按照references/vllm_server.md中的说明安装并启动vLLM服务器，然后收集
```
base_url
```
    、
```
model_name
```
    和（可选）
```
api_key
```
    ；设置
```
vlm.backend: "openai"
```
    。
5. 自定义（任何其他兼容OpenAI的端点） — 设置
```
vlm.backend: "openai"
```
  ；收集
```
base_url
```
  、
```
model_name
```
  和（可选）
```
api_key
```
  。
如果用户没有可用的端点且不想搭建，请先协助解决API访问问题。
工作流步骤：选择以下选项之一：
- 完整流程：
```
["0", "1"]
```
- 仅表达式提取：
```
["0"]
```
- 仅定位：
```
["1"]
```
  ，此选项需要
```
results_dir/step_0_expression_extraction/annotations.jsonl
```
  中存在步骤0的输出
恢复运行 vs 全新运行：默认情况下，工作流会重用检查点并跳过已完成的记录。要重新处理所有内容，请设置
```
image_grounding.workflow.force_reprocess=true
```
。

Running the pipeline

运行流程

The pipeline runs inside the TAO Toolkit container via the

auto_label

CLI:

bash

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_grounding.data.input_jsonl=/data/captions.jsonl \
    image_grounding.data.image_root=/data/images \
    image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY

Generate a default spec:

auto_label default_specs results_dir=/results module_name=auto_label

, then set

autolabel_type: "image_grounding"

. All fields support Hydra dot-notation overrides on the command line.

See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.

该流程在TAO Toolkit容器内通过

auto_label

CLI运行：

bash

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_grounding.data.input_jsonl=/data/captions.jsonl \
    image_grounding.data.image_root=/data/images \
    image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY

生成默认配置文件：

auto_label default_specs results_dir=/results module_name=auto_label

，然后设置

autolabel_type: "image_grounding"

。所有字段都支持在命令行使用Hydra点表示法进行覆盖。

有关完整的YAML结构、所有参数、模型/端点设置和错误模式，请参阅references/configuration.md。

Recommended pilot workflow

Configuration

配置

Key configuration fields (full reference in references/configuration.md):

Field	Default	Description
`workflow.steps`	`["0","1"]`	Which pipeline steps to execute ( `"0"` = expressions, `"1"` = grounding)
`workflow.max_workers`	`4`	Parallel threads per step (watch API rate limits)
`workflow.force_reprocess`	`false`	Ignore per-sample checkpoints and reprocess from scratch
`vlm.backend`	`"gemini"`	`"gemini"` or `"openai"` (OpenAI-compatible endpoint)
`data.input_jsonl`	required	Path to input JSONL with `image_path` + `caption` per line
`data.image_root`	`""`	Optional prefix for resolving relative `image_path` entries

关键配置字段（完整参考请见references/configuration.md）：

字段	默认值	描述
`workflow.steps`	`["0","1"]`	要执行的流程步骤（ `"0"` = 表达式提取， `"1"` = 定位）
`workflow.max_workers`	`4`	每个步骤的并行线程数（注意API速率限制）
`workflow.force_reprocess`	`false`	忽略每个样本的检查点并从头开始重新处理
`vlm.backend`	`"gemini"`	`"gemini"` 或 `"openai"` （兼容OpenAI的端点）
`data.input_jsonl`	必填	包含每行 `image_path` + `caption` 的输入JSONL文件路径
`data.image_root`	`""`	用于解析相对 `image_path` 条目的可选前缀

Inputs

输入

A single JSONL file at

data.input_jsonl

. One JSON object per line:

Field	Required	Description
`image_path`	yes	Absolute path, or relative path resolved against `data.image_root`
`caption`	yes	Free-text caption for the image
`image_id`	no	Stable identifier; auto-derived from the filename if missing
`width` , `height`	no	Image dimensions in pixels; default to `1920×1080` for bbox clamping if missing

位于

data.input_jsonl

的单个JSONL文件。每行一个JSON对象：

字段	是否必填	描述
`image_path`	是	绝对路径，或相对于 `data.image_root` 解析的相对路径
`caption`	是	图像的自由文本标题
`image_id`	否	稳定标识符；如果缺失则自动从文件名派生
`width` , `height`	否	图像的像素尺寸；如果缺失，边界框钳位默认使用 `1920×1080`

Outputs

输出

All outputs go to

results_dir/

step_0_expression_extraction/annotations.jsonl

— per-record output enriched with

cleaned_caption

and

expressions[]

(each with

text

expression_id

char_span

noun_chunk

, empty

instances[]

step_1_grounding/annotations.jsonl

— same records with

expressions[].instances[]

filled in (each instance has

bbox: [x1,y1,x2,y2]

in pixel space,

score

[0.0, 1.0]

, and

bbox_id

```
results_dir/annotations.jsonl
```
— copy of the last step's output for convenience.
```
step_<N>_*/.ckpt/<sample_id>.json
```
— per-sample checkpoints used for resume.

所有输出都保存到

results_dir/

：

step_0_expression_extraction/annotations.jsonl

— 每条记录的输出，包含

cleaned_caption

和

expressions[]

（每个表达式包含

text

、

expression_id

、

char_span

、

noun_chunk

，空的

instances[]

）。

```
step_1_grounding/annotations.jsonl
```
— 同一批记录，其中
```
expressions[].instances[]
```
已填充（每个实例包含像素空间的
```
bbox: [x1,y1,x2,y2]
```
、
```
[0.0, 1.0]
```
范围内的
```
score
```
，以及
```
bbox_id
```
）。
```
results_dir/annotations.jsonl
```
— 最后一步输出的副本，方便使用。
```
step_<N>_*/.ckpt/<sample_id>.json
```
— 用于恢复运行的每个样本检查点。

Prerequisites

前置条件

Container:

nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt

API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)

容器：

nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt

API访问权限：至少一个VLM端点（Gemini API密钥或支持图像输入的兼容OpenAI的端点）