tao-generate-image-grounding

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Image Grounding Pipeline

图像定位流程

Turn
(image, caption)
pairs into per-image grounded annotations: cleaned captions, referring expressions with character spans, and pixel-space bounding boxes for each expression. A single VLM (Gemini or any OpenAI-compatible endpoint) handles both steps.
(image, caption)
对转换为每张图像的定位标注:清洗后的标题、带字符跨度的指代表达式,以及每个表达式对应的像素空间边界框。单个VLM(Gemini或任何兼容OpenAI的端点)可处理这两个步骤。

Purpose

用途

Generate phrase-grounded training data for referring-expression and grounding models. The VLM acts as a "teacher" annotator: Step 0 extracts referring expressions from the caption while looking at the image; Step 1 returns one bbox set per expression for each image.
为指代表达式和定位模型生成短语定位训练数据。VLM充当“教师”标注器:步骤0在查看图像的同时从标题中提取指代表达式;步骤1为每张图像的每个表达式返回一组边界框。

Pipeline Architecture

流程架构

Step 0: Expression extraction  → VLM cleans caption, extracts referring expressions + char spans
Step 1: Phrase grounding       → VLM returns pixel bboxes + scores per expression
Steps are individually selectable via
workflow.steps
. Each step writes a per-sample checkpoint to
step_<N>_*/.ckpt/<sample_id>.json
and skips already-processed records on re-run. Set
workflow.force_reprocess: true
to ignore checkpoints and reprocess from scratch.
Step 0: 表达式提取  → VLM清洗标题,提取指代表达式 + 字符跨度
Step 1: 短语定位       → VLM返回每个表达式对应的像素边界框 + 置信度分数
可通过
workflow.steps
单独选择步骤。每个步骤会将每个样本的检查点写入
step_<N>_*/.ckpt/<sample_id>.json
,重新运行时会跳过已处理的记录。设置
workflow.force_reprocess: true
可忽略检查点并从头开始重新处理。

Instructions

操作说明

Initial setup

初始设置

When a user wants to run this pipeline, walk through these steps:
  1. Input JSONL: Ask for the JSONL path. Each line must be one object like
    {"image_path": "...", "caption": "..."}
    .
    image_path
    can be absolute or relative.
  2. Image root: If any
    image_path
    values are relative, set
    data.image_root
    to the directory they should resolve from.
  3. API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
    1. Gemini — set
      vlm.backend: "gemini"
      ; require
      GOOGLE_API_KEY
      (env var or
      vlm.gemini.api_key
      ).
    2. NIM (e.g.
      https://inference-api.nvidia.com/v1
      ) — set
      vlm.backend: "openai"
      ; collect
      base_url
      ,
      model_name
      , and
      api_key
      .
    3. TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
      • Running — collect
        base_url
        ,
        model_name
        , and (optionally)
        api_key
        ; set
        vlm.backend: "openai"
        .
      • Not running — guide the user through the
        skills/applications/tao-run-inference-service
        skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check
        skills/applications/tao-run-inference-service/references/service.yaml
        for
        valid_network_arch_config_basenames
        . Once the server is up, collect
        base_url
        ,
        model_name
        , and (optionally)
        api_key
        ; set
        vlm.backend: "openai"
        .
    4. vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
      • Running — collect
        base_url
        ,
        model_name
        , and (optionally)
        api_key
        ; set
        vlm.backend: "openai"
        .
      • Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect
        base_url
        ,
        model_name
        , and (optionally)
        api_key
        ; set
        vlm.backend: "openai"
        .
    5. Custom (any other OpenAI-compatible endpoint) — set
      vlm.backend: "openai"
      ; collect
      base_url
      ,
      model_name
      , and (optionally)
      api_key
      .
    If the user has no endpoint and does not want to set one up, stop and help resolve API access first.
  4. Workflow steps: Choose one of:
    • Full pipeline:
      ["0", "1"]
    • Expression extraction only:
      ["0"]
    • Grounding only:
      ["1"]
      , which requires existing step-0 output at
      results_dir/step_0_expression_extraction/annotations.jsonl
  5. Resume vs fresh run: By default, the workflow reuses checkpoints and skips completed records. To reprocess everything, set
    image_grounding.workflow.force_reprocess=true
    .
当用户想要运行此流程时,请按以下步骤操作:
  1. 输入JSONL:询问JSONL文件路径。每行必须是一个类似
    {"image_path": "...", "caption": "..."}
    的对象。
    image_path
    可以是绝对路径或相对路径。
  2. 图像根目录:如果有
    image_path
    是相对路径,设置
    data.image_root
    为其对应的解析目录。
  3. API访问权限:询问用户想要使用哪个VLM端点。提供以下五个选项并根据选择执行:
    1. Gemini — 设置
      vlm.backend: "gemini"
      ;需要
      GOOGLE_API_KEY
      (环境变量或
      vlm.gemini.api_key
      配置项)。
    2. NIM(例如
      https://inference-api.nvidia.com/v1
      ) — 设置
      vlm.backend: "openai"
      ;收集
      base_url
      model_name
      api_key
    3. TAO推理微服务(自托管,兼容OpenAI)。确认服务器是否已运行:
      • 已运行 — 收集
        base_url
        model_name
        和(可选)
        api_key
        ;设置
        vlm.backend: "openai"
      • 未运行 — 引导用户使用
        skills/applications/tao-run-inference-service
        技能,该技能可启动一个本地TAO推理微服务,提供兼容OpenAI的API。在承诺特定模型之前,请查看
        skills/applications/tao-run-inference-service/references/service.yaml
        中的
        valid_network_arch_config_basenames
        。服务器启动后,收集
        base_url
        model_name
        和(可选)
        api_key
        ;设置
        vlm.backend: "openai"
    4. vLLM(自托管,兼容OpenAI)。确认服务器是否已运行:
      • 已运行 — 收集
        base_url
        model_name
        和(可选)
        api_key
        ;设置
        vlm.backend: "openai"
      • 未运行 — 按照references/vllm_server.md中的说明安装并启动vLLM服务器,然后收集
        base_url
        model_name
        和(可选)
        api_key
        ;设置
        vlm.backend: "openai"
    5. 自定义(任何其他兼容OpenAI的端点) — 设置
      vlm.backend: "openai"
      ;收集
      base_url
      model_name
      和(可选)
      api_key
    如果用户没有可用的端点且不想搭建,请先协助解决API访问问题。
  4. 工作流步骤:选择以下选项之一:
    • 完整流程:
      ["0", "1"]
    • 仅表达式提取:
      ["0"]
    • 仅定位:
      ["1"]
      ,此选项需要
      results_dir/step_0_expression_extraction/annotations.jsonl
      中存在步骤0的输出
  5. 恢复运行 vs 全新运行:默认情况下,工作流会重用检查点并跳过已完成的记录。要重新处理所有内容,请设置
    image_grounding.workflow.force_reprocess=true

Running the pipeline

运行流程

The pipeline runs inside the TAO Toolkit container via the
auto_label
CLI:
bash
auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_grounding.data.input_jsonl=/data/captions.jsonl \
    image_grounding.data.image_root=/data/images \
    image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY
Generate a default spec:
auto_label default_specs results_dir=/results module_name=auto_label
, then set
autolabel_type: "image_grounding"
. All fields support Hydra dot-notation overrides on the command line.
See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.
该流程在TAO Toolkit容器内通过
auto_label
CLI运行:
bash
auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_grounding.data.input_jsonl=/data/captions.jsonl \
    image_grounding.data.image_root=/data/images \
    image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY
生成默认配置文件:
auto_label default_specs results_dir=/results module_name=auto_label
,然后设置
autolabel_type: "image_grounding"
。所有字段都支持在命令行使用Hydra点表示法进行覆盖。
有关完整的YAML结构、所有参数、模型/端点设置和错误模式,请参阅references/configuration.md

Recommended pilot workflow

推荐试点工作流

  1. Run on 5-10 images with both steps
  2. Inspect
    step_0_expression_extraction/annotations.jsonl
    — are
    cleaned_caption
    and
    expressions[]
    accurate? Are the right noun phrases captured?
  3. Inspect
    step_1_grounding/annotations.jsonl
    — do the bboxes in
    expressions[].instances[]
    look right? Are confidence scores reasonable?
  4. If quality is insufficient, switch the VLM to a stronger model (e.g.
    gemini-2.5-pro
    ) or raise
    media_resolution
    /
    max_output_tokens
    , then re-run with
    force_reprocess=true
    .
  5. Scale to the full dataset once satisfied.
  1. 使用两个步骤在5-10张图像上运行
  2. 检查
    step_0_expression_extraction/annotations.jsonl
    cleaned_caption
    expressions[]
    是否准确?是否捕获了正确的名词短语?
  3. 检查
    step_1_grounding/annotations.jsonl
    expressions[].instances[]
    中的边界框是否正确?置信度分数是否合理?
  4. 如果质量不足,将VLM切换为更强的模型(例如
    gemini-2.5-pro
    )或提高
    media_resolution
    /
    max_output_tokens
    ,然后设置
    force_reprocess=true
    重新运行。
  5. 满意后扩展到完整数据集。

Configuration

配置

Key configuration fields (full reference in references/configuration.md):
FieldDefaultDescription
workflow.steps
["0","1"]
Which pipeline steps to execute (
"0"
= expressions,
"1"
= grounding)
workflow.max_workers
4
Parallel threads per step (watch API rate limits)
workflow.force_reprocess
false
Ignore per-sample checkpoints and reprocess from scratch
vlm.backend
"gemini"
"gemini"
or
"openai"
(OpenAI-compatible endpoint)
data.input_jsonl
requiredPath to input JSONL with
image_path
+
caption
per line
data.image_root
""
Optional prefix for resolving relative
image_path
entries
关键配置字段(完整参考请见references/configuration.md):
字段默认值描述
workflow.steps
["0","1"]
要执行的流程步骤(
"0"
= 表达式提取,
"1"
= 定位)
workflow.max_workers
4
每个步骤的并行线程数(注意API速率限制)
workflow.force_reprocess
false
忽略每个样本的检查点并从头开始重新处理
vlm.backend
"gemini"
"gemini"
"openai"
(兼容OpenAI的端点)
data.input_jsonl
必填包含每行
image_path
+
caption
的输入JSONL文件路径
data.image_root
""
用于解析相对
image_path
条目的可选前缀

Inputs

输入

A single JSONL file at
data.input_jsonl
. One JSON object per line:
FieldRequiredDescription
image_path
yesAbsolute path, or relative path resolved against
data.image_root
caption
yesFree-text caption for the image
image_id
noStable identifier; auto-derived from the filename if missing
width
,
height
noImage dimensions in pixels; default to
1920×1080
for bbox clamping if missing
位于
data.input_jsonl
的单个JSONL文件。每行一个JSON对象:
字段是否必填描述
image_path
绝对路径,或相对于
data.image_root
解析的相对路径
caption
图像的自由文本标题
image_id
稳定标识符;如果缺失则自动从文件名派生
width
,
height
图像的像素尺寸;如果缺失,边界框钳位默认使用
1920×1080

Outputs

输出

All outputs go to
results_dir/
:
  • step_0_expression_extraction/annotations.jsonl
    — per-record output enriched with
    cleaned_caption
    and
    expressions[]
    (each with
    text
    ,
    expression_id
    ,
    char_span
    ,
    noun_chunk
    , empty
    instances[]
    ).
  • step_1_grounding/annotations.jsonl
    — same records with
    expressions[].instances[]
    filled in (each instance has
    bbox: [x1,y1,x2,y2]
    in pixel space,
    score
    in
    [0.0, 1.0]
    , and
    bbox_id
    ).
  • results_dir/annotations.jsonl
    — copy of the last step's output for convenience.
  • step_<N>_*/.ckpt/<sample_id>.json
    — per-sample checkpoints used for resume.
所有输出都保存到
results_dir/
  • step_0_expression_extraction/annotations.jsonl
    — 每条记录的输出,包含
    cleaned_caption
    expressions[]
    (每个表达式包含
    text
    expression_id
    char_span
    noun_chunk
    ,空的
    instances[]
    )。
  • step_1_grounding/annotations.jsonl
    — 同一批记录,其中
    expressions[].instances[]
    已填充(每个实例包含像素空间的
    bbox: [x1,y1,x2,y2]
    [0.0, 1.0]
    范围内的
    score
    ,以及
    bbox_id
    )。
  • results_dir/annotations.jsonl
    — 最后一步输出的副本,方便使用。
  • step_<N>_*/.ckpt/<sample_id>.json
    — 用于恢复运行的每个样本检查点。

Prerequisites

前置条件

  • Container:
    nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
  • API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)
  • 容器
    nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
  • API访问权限:至少一个VLM端点(Gemini API密钥或支持图像输入的兼容OpenAI的端点)