Loading...
Loading...
Multi-step video annotation pipeline that turns raw videos into Chain-of-Thought training data — multi-level captions, structured descriptions, and QA pairs (MCQ, binary, open-ended) with reasoning traces, via VLM/LLM distillation. Use when the user wants to "create video training data", "generate video QA datasets", "build CoT reasoning traces from videos", "auto-label videos", or run the video_reasoning_annotation pipeline. Triggers include "video annotation", "video CoT", "video QA", "chain-of-thought", "video captioning pipeline", "video distillation".
npx skill4agent add nvidia/skills tao-generate-video-reasoning-annotationsStep 0: [Optional] Filter & classify videos → Keep domain-relevant, classify anomaly vs normal
Step 1a: Global + dense captions → VLM: narrative summary + timestamped events
Step 1b: Chunk captions → VLM: fixed-duration segment micro-captions
Step 1c: [Optional, anomaly only] Highlight → LLM extracts anomaly timestamp, VLM captions clip
Step 2: Description synthesis → LLM: synthesize captions into structured narrative
Step 3: QA generation → LLM: MCQ, binary, open-ended with reasoning
Step 4: Parse outputs → Per-task `tao-vl-reason-v1.0` JSON filesworkflow.steps{"video_path": "..."}.mp4.avi.mov.mkv| Domain | What to do |
|---|---|
| general | Use the default prompts. Set |
| traffic (CCTV intersections, highways; dashcam excluded) | Use the reference module. Set |
| warehouse (industrial site CCTV — safety, operations, security) | Same pattern. Set |
| custom (any other domain) | Run the workshop in references/domain_adaptation.md. It walks through: Phase 1 — question types the user wants the model to answer; Phase 2 — caption-requirements checklist; Phase 3 — fill the |
workflow.mode: "auto"workflow.mode: "anomaly"workflow.mode: "normal"vlm.backendllm.backendGOOGLE_API_KEYbase_urlmodel_nameapi_keyllm.backendvlm.backendskills/applications/tao-run-inference-serviceskills/applications/tao-run-inference-service/references/service.yamlvalid_network_arch_config_basenamescustomgeneraltrafficwarehouseauto_labelauto_label generate -e /path/to/spec.yaml \
results_dir=/results \
video_reasoning_annotation.data.video_root=/videos \
video_reasoning_annotation.vlm.gemini.api_key=$GOOGLE_API_KEY \
video_reasoning_annotation.workflow.mode=autoauto_label default_specs results_dir=/results module_name=auto_label
# then set: autolabel_type: "video_reasoning_annotation"prompts_moduleworkflow.moderesults_dir/step_1a_caption/captions.jsonlresults_dir/step_3_qa/qa_output.jsonlprompts_modulegeneraldata.video_rootdata.input_jsonl_filesresults_dir| Field | Default | Description |
|---|---|---|
| | Which pipeline steps to execute |
| | |
| | |
| | Same options; text-only, cheaper model works |
| | Parallel threads per step (watch API rate limits) |
| | Optional: written to |
| | Optional: extra text appended to per-task descriptions in step 4 metadata |
| | Dotted import path to custom prompts module |
nvidia_tao_ds.auto_label.video_reasoning_annotation.promptsnvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template[PLACEHOLDER]trafficwarehousevideo_root.mp4.avi.mov.mkvinput_jsonl_files{"video_path": "..."}videofilter_fieldvideo_rootinput_jsonl_filesresults_dir/step_0_filter/step_1a_caption/step_4_output/<task>.jsontao-vl-reason-v1.0mcq.jsonmcq_openended.jsonbcq.jsonbcq_openended.jsonopen_qa.jsoncausal_linkage.jsontemporal_localization.jsontemporal_description.jsonscene_description.jsonvideo_summarization.json{
"format": "tao-vl-reason-v1.0",
"metadata": {"type": "annotation", "task": "<task>", "date": "YYYY-MM-DD",
"description": "<per-task + description_extra>", "license": "<from config>"},
"media_root": "<data.video_root>" | null,
"items": [{"video_id": "...", "question": "...", "answer": "...", "reasoning": "..."}, ...]
}media_rootdata.video_rootnullvideo_idvideo_rootlicensedescription_extratao_toolkit.pytnvcr.io/nvidia/tao/tao-toolkit:6.26.3-pytversions.yaml