Loading...
Loading...
Two-step image grounding pipeline: extracts referring expressions from (image, caption) pairs and grounds them to pixel-space bounding boxes via a VLM. Use when the user wants to ground captions to bboxes, generate phrase-grounded annotations, auto-label images for grounding, or run the image_grounding pipeline. Triggers include 'image grounding', 'phrase grounding', 'ground captions', 'auto-label image grounding', 'image_grounding'.
npx skill4agent add nvidia/skills tao-generate-image-grounding(image, caption)Step 0: Expression extraction → VLM cleans caption, extracts referring expressions + char spans
Step 1: Phrase grounding → VLM returns pixel bboxes + scores per expressionworkflow.stepsstep_<N>_*/.ckpt/<sample_id>.jsonworkflow.force_reprocess: true{"image_path": "...", "caption": "..."}image_pathimage_pathdata.image_rootvlm.backend: "gemini"GOOGLE_API_KEYvlm.gemini.api_keyhttps://inference-api.nvidia.com/v1vlm.backend: "openai"base_urlmodel_nameapi_keybase_urlmodel_nameapi_keyvlm.backend: "openai"skills/applications/tao-run-inference-serviceskills/applications/tao-run-inference-service/references/service.yamlvalid_network_arch_config_basenamesbase_urlmodel_nameapi_keyvlm.backend: "openai"base_urlmodel_nameapi_keyvlm.backend: "openai"base_urlmodel_nameapi_keyvlm.backend: "openai"vlm.backend: "openai"base_urlmodel_nameapi_key["0", "1"]["0"]["1"]results_dir/step_0_expression_extraction/annotations.jsonlimage_grounding.workflow.force_reprocess=trueauto_labelauto_label generate -e /path/to/spec.yaml \
results_dir=/results \
image_grounding.data.input_jsonl=/data/captions.jsonl \
image_grounding.data.image_root=/data/images \
image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEYauto_label default_specs results_dir=/results module_name=auto_labelautolabel_type: "image_grounding"step_0_expression_extraction/annotations.jsonlcleaned_captionexpressions[]step_1_grounding/annotations.jsonlexpressions[].instances[]gemini-2.5-promedia_resolutionmax_output_tokensforce_reprocess=true| Field | Default | Description |
|---|---|---|
| | Which pipeline steps to execute ( |
| | Parallel threads per step (watch API rate limits) |
| | Ignore per-sample checkpoints and reprocess from scratch |
| | |
| required | Path to input JSONL with |
| | Optional prefix for resolving relative |
data.input_jsonl| Field | Required | Description |
|---|---|---|
| yes | Absolute path, or relative path resolved against |
| yes | Free-text caption for the image |
| no | Stable identifier; auto-derived from the filename if missing |
| no | Image dimensions in pixels; default to |
results_dir/step_0_expression_extraction/annotations.jsonlcleaned_captionexpressions[]textexpression_idchar_spannoun_chunkinstances[]step_1_grounding/annotations.jsonlexpressions[].instances[]bbox: [x1,y1,x2,y2]score[0.0, 1.0]bbox_idresults_dir/annotations.jsonlstep_<N>_*/.ckpt/<sample_id>.jsonnvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt