tao-generate-image-grounding
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseImage Grounding Pipeline
图像定位流程
Turn pairs into per-image grounded annotations: cleaned captions, referring expressions with character spans, and pixel-space bounding boxes for each expression. A single VLM (Gemini or any OpenAI-compatible endpoint) handles both steps.
(image, caption)将对转换为每张图像的定位标注:清洗后的标题、带字符跨度的指代表达式,以及每个表达式对应的像素空间边界框。单个VLM(Gemini或任何兼容OpenAI的端点)可处理这两个步骤。
(image, caption)Purpose
用途
Generate phrase-grounded training data for referring-expression and grounding models. The VLM acts as a "teacher" annotator: Step 0 extracts referring expressions from the caption while looking at the image; Step 1 returns one bbox set per expression for each image.
为指代表达式和定位模型生成短语定位训练数据。VLM充当“教师”标注器:步骤0在查看图像的同时从标题中提取指代表达式;步骤1为每张图像的每个表达式返回一组边界框。
Pipeline Architecture
流程架构
Step 0: Expression extraction → VLM cleans caption, extracts referring expressions + char spans
Step 1: Phrase grounding → VLM returns pixel bboxes + scores per expressionSteps are individually selectable via . Each step writes a per-sample checkpoint to and skips already-processed records on re-run. Set to ignore checkpoints and reprocess from scratch.
workflow.stepsstep_<N>_*/.ckpt/<sample_id>.jsonworkflow.force_reprocess: trueStep 0: 表达式提取 → VLM清洗标题,提取指代表达式 + 字符跨度
Step 1: 短语定位 → VLM返回每个表达式对应的像素边界框 + 置信度分数可通过单独选择步骤。每个步骤会将每个样本的检查点写入,重新运行时会跳过已处理的记录。设置可忽略检查点并从头开始重新处理。
workflow.stepsstep_<N>_*/.ckpt/<sample_id>.jsonworkflow.force_reprocess: trueInstructions
操作说明
Initial setup
初始设置
When a user wants to run this pipeline, walk through these steps:
-
Input JSONL: Ask for the JSONL path. Each line must be one object like.
{"image_path": "...", "caption": "..."}can be absolute or relative.image_path -
Image root: If anyvalues are relative, set
image_pathto the directory they should resolve from.data.image_root -
API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
- Gemini — set ; require
vlm.backend: "gemini"(env var orGOOGLE_API_KEY).vlm.gemini.api_key - NIM (e.g. ) — set
https://inference-api.nvidia.com/v1; collectvlm.backend: "openai",base_url, andmodel_name.api_key - TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
- Running — collect ,
base_url, and (optionally)model_name; setapi_key.vlm.backend: "openai" - Not running — guide the user through the skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check
skills/applications/tao-run-inference-serviceforskills/applications/tao-run-inference-service/references/service.yaml. Once the server is up, collectvalid_network_arch_config_basenames,base_url, and (optionally)model_name; setapi_key.vlm.backend: "openai"
- Running — collect
- vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
- Running — collect ,
base_url, and (optionally)model_name; setapi_key.vlm.backend: "openai" - Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect ,
base_url, and (optionally)model_name; setapi_key.vlm.backend: "openai"
- Running — collect
- Custom (any other OpenAI-compatible endpoint) — set ; collect
vlm.backend: "openai",base_url, and (optionally)model_name.api_key
If the user has no endpoint and does not want to set one up, stop and help resolve API access first. - Gemini — set
-
Workflow steps: Choose one of:
- Full pipeline:
["0", "1"] - Expression extraction only:
["0"] - Grounding only: , which requires existing step-0 output at
["1"]results_dir/step_0_expression_extraction/annotations.jsonl
- Full pipeline:
-
Resume vs fresh run: By default, the workflow reuses checkpoints and skips completed records. To reprocess everything, set.
image_grounding.workflow.force_reprocess=true
当用户想要运行此流程时,请按以下步骤操作:
-
输入JSONL:询问JSONL文件路径。每行必须是一个类似的对象。
{"image_path": "...", "caption": "..."}可以是绝对路径或相对路径。image_path -
图像根目录:如果有是相对路径,设置
image_path为其对应的解析目录。data.image_root -
API访问权限:询问用户想要使用哪个VLM端点。提供以下五个选项并根据选择执行:
- Gemini — 设置;需要
vlm.backend: "gemini"(环境变量或GOOGLE_API_KEY配置项)。vlm.gemini.api_key - NIM(例如) — 设置
https://inference-api.nvidia.com/v1;收集vlm.backend: "openai"、base_url和model_name。api_key - TAO推理微服务(自托管,兼容OpenAI)。确认服务器是否已运行:
- 已运行 — 收集、
base_url和(可选)model_name;设置api_key。vlm.backend: "openai" - 未运行 — 引导用户使用技能,该技能可启动一个本地TAO推理微服务,提供兼容OpenAI的API。在承诺特定模型之前,请查看
skills/applications/tao-run-inference-service中的skills/applications/tao-run-inference-service/references/service.yaml。服务器启动后,收集valid_network_arch_config_basenames、base_url和(可选)model_name;设置api_key。vlm.backend: "openai"
- 已运行 — 收集
- vLLM(自托管,兼容OpenAI)。确认服务器是否已运行:
- 已运行 — 收集、
base_url和(可选)model_name;设置api_key。vlm.backend: "openai" - 未运行 — 按照references/vllm_server.md中的说明安装并启动vLLM服务器,然后收集、
base_url和(可选)model_name;设置api_key。vlm.backend: "openai"
- 已运行 — 收集
- 自定义(任何其他兼容OpenAI的端点) — 设置;收集
vlm.backend: "openai"、base_url和(可选)model_name。api_key
如果用户没有可用的端点且不想搭建,请先协助解决API访问问题。 - Gemini — 设置
-
工作流步骤:选择以下选项之一:
- 完整流程:
["0", "1"] - 仅表达式提取:
["0"] - 仅定位:,此选项需要
["1"]中存在步骤0的输出results_dir/step_0_expression_extraction/annotations.jsonl
- 完整流程:
-
恢复运行 vs 全新运行:默认情况下,工作流会重用检查点并跳过已完成的记录。要重新处理所有内容,请设置。
image_grounding.workflow.force_reprocess=true
Running the pipeline
运行流程
The pipeline runs inside the TAO Toolkit container via the CLI:
auto_labelbash
auto_label generate -e /path/to/spec.yaml \
results_dir=/results \
image_grounding.data.input_jsonl=/data/captions.jsonl \
image_grounding.data.image_root=/data/images \
image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEYGenerate a default spec: , then set . All fields support Hydra dot-notation overrides on the command line.
auto_label default_specs results_dir=/results module_name=auto_labelautolabel_type: "image_grounding"See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.
该流程在TAO Toolkit容器内通过 CLI运行:
auto_labelbash
auto_label generate -e /path/to/spec.yaml \
results_dir=/results \
image_grounding.data.input_jsonl=/data/captions.jsonl \
image_grounding.data.image_root=/data/images \
image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY生成默认配置文件:,然后设置。所有字段都支持在命令行使用Hydra点表示法进行覆盖。
auto_label default_specs results_dir=/results module_name=auto_labelautolabel_type: "image_grounding"有关完整的YAML结构、所有参数、模型/端点设置和错误模式,请参阅references/configuration.md。
Recommended pilot workflow
推荐试点工作流
- Run on 5-10 images with both steps
- Inspect — are
step_0_expression_extraction/annotations.jsonlandcleaned_captionaccurate? Are the right noun phrases captured?expressions[] - Inspect — do the bboxes in
step_1_grounding/annotations.jsonllook right? Are confidence scores reasonable?expressions[].instances[] - If quality is insufficient, switch the VLM to a stronger model (e.g. ) or raise
gemini-2.5-pro/media_resolution, then re-run withmax_output_tokens.force_reprocess=true - Scale to the full dataset once satisfied.
- 使用两个步骤在5-10张图像上运行
- 检查—
step_0_expression_extraction/annotations.jsonl和cleaned_caption是否准确?是否捕获了正确的名词短语?expressions[] - 检查—
step_1_grounding/annotations.jsonl中的边界框是否正确?置信度分数是否合理?expressions[].instances[] - 如果质量不足,将VLM切换为更强的模型(例如)或提高
gemini-2.5-pro/media_resolution,然后设置max_output_tokens重新运行。force_reprocess=true - 满意后扩展到完整数据集。
Configuration
配置
Key configuration fields (full reference in references/configuration.md):
| Field | Default | Description |
|---|---|---|
| | Which pipeline steps to execute ( |
| | Parallel threads per step (watch API rate limits) |
| | Ignore per-sample checkpoints and reprocess from scratch |
| | |
| required | Path to input JSONL with |
| | Optional prefix for resolving relative |
关键配置字段(完整参考请见references/configuration.md):
| 字段 | 默认值 | 描述 |
|---|---|---|
| | 要执行的流程步骤( |
| | 每个步骤的并行线程数(注意API速率限制) |
| | 忽略每个样本的检查点并从头开始重新处理 |
| | |
| 必填 | 包含每行 |
| | 用于解析相对 |
Inputs
输入
A single JSONL file at . One JSON object per line:
data.input_jsonl| Field | Required | Description |
|---|---|---|
| yes | Absolute path, or relative path resolved against |
| yes | Free-text caption for the image |
| no | Stable identifier; auto-derived from the filename if missing |
| no | Image dimensions in pixels; default to |
位于的单个JSONL文件。每行一个JSON对象:
data.input_jsonl| 字段 | 是否必填 | 描述 |
|---|---|---|
| 是 | 绝对路径,或相对于 |
| 是 | 图像的自由文本标题 |
| 否 | 稳定标识符;如果缺失则自动从文件名派生 |
| 否 | 图像的像素尺寸;如果缺失,边界框钳位默认使用 |
Outputs
输出
All outputs go to :
results_dir/- — per-record output enriched with
step_0_expression_extraction/annotations.jsonlandcleaned_caption(each withexpressions[],text,expression_id,char_span, emptynoun_chunk).instances[] - — same records with
step_1_grounding/annotations.jsonlfilled in (each instance hasexpressions[].instances[]in pixel space,bbox: [x1,y1,x2,y2]inscore, and[0.0, 1.0]).bbox_id - — copy of the last step's output for convenience.
results_dir/annotations.jsonl - — per-sample checkpoints used for resume.
step_<N>_*/.ckpt/<sample_id>.json
所有输出都保存到:
results_dir/- — 每条记录的输出,包含
step_0_expression_extraction/annotations.jsonl和cleaned_caption(每个表达式包含expressions[]、text、expression_id、char_span,空的noun_chunk)。instances[] - — 同一批记录,其中
step_1_grounding/annotations.jsonl已填充(每个实例包含像素空间的expressions[].instances[]、bbox: [x1,y1,x2,y2]范围内的[0.0, 1.0],以及score)。bbox_id - — 最后一步输出的副本,方便使用。
results_dir/annotations.jsonl - — 用于恢复运行的每个样本检查点。
step_<N>_*/.ckpt/<sample_id>.json
Prerequisites
前置条件
- Container:
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt - API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)
- 容器:
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt - API访问权限:至少一个VLM端点(Gemini API密钥或支持图像输入的兼容OpenAI的端点)