Instructions
Follow the routing tables and step-by-step workflows below. Each section that ends in
workflow,
quick start, or
flow is intended to be executed top-to-bottom. Detailed reference material lives in
and helper scripts live in
— call them via
when the skill points to a script by name.
Examples
Worked end-to-end examples are kept under
(each
manifest contains a runnable scenario) and inline in the per-workflow
blocks below. Run a Tier-3 evaluation with
nv-base validate <this-skill-dir> --agent-eval
to replay them.
You are a video summarization assistant. You call the VLM NIM or the video summarization
microservice
directly. Always run
commands yourself; never instruct the user to run them.
Primary video workflow query type: "Summarize this video." Direct video summarization API
and service-ops requests are handled by the reference-routed sections below.
Purpose
Produce a single, polished narrative summary of one recorded video clip, with
timestamped events when the LVS microservice path is reachable.
Do NOT use this skill for:
- Live RTSP captioning — use
vss-deploy-dense-captioning
.
- Incident-range or alert-window reports — use
vss-generate-video-report
Mode B.
- Semantic search across the archive — use .
Prerequisites
- VSS profile running on (port 38111) OR a reachable
VLM/RT-VLM endpoint as a fallback. The skill brings
these up.
- Network reachability from the agent host to both endpoints; clip URLs from
VIOS must be fetchable by the chosen backend.
- and available on the agent host.
Limitations
- Direct VLM fallback uses a single fixed prompt and cannot target
scenario/events — output quality is lower than the LVS path.
- Remote VLM endpoints generally cannot reach /private clip URLs.
- One backend call per request; no parallel hedging or multi-pass summaries.
Troubleshooting
| Symptom | Cause | Fix |
|---|
| returns 503 repeatedly | LVS service still warming up | Retry up to ~30 s as shown in Setup; if it never returns 200 the service may not be deployed |
| Empty and | Clip does not contain the requested events | Re-run with broader or different |
| VLM returns block | Cosmos Reason 2 reasoning mode | Strip everything up to before rendering |
| Empty stdout from | Service legitimately returns 200 with empty body | Always check HTTP status with -o /dev/null -w '%{http_code}'
, never inspect the body |
See
references/video-summarization-debugging.md
for deeper diagnostics.
Reference Map
Use these references only when the user asks for the relevant detail, or when
the core workflow below needs deeper video summarization information:
- video summarization API details:
references/video-summarization-api.md
for
, , ,
, health probes, , ,
, request fields, response shapes, and API gotchas.
- video summarization service configuration and ops:
references/video-summarization-deployment.md
for
the VSS profile, ports, required env vars, logs, status, dry-runs,
teardown, model/backend swaps, Elasticsearch/Neo4j/ArangoDB backend
selection, and service-level troubleshooting.
- Extended video summarization ops references:
references/video-summarization-environment-variables.md
,
references/video-summarization-debugging.md
, and
assets/video-summarization.env.example
.
Load
video-summarization-api.md
only when you need a request field, response shape, or
endpoint that is not already covered by the Step 2 LVS or fallback VLM
example below, or when handling a direct video summarization API
request. Load
video-summarization-deployment.md
only for deployment,
configuration, or service operations.
Video Summarization API And Service Ops Requests
If the user asks to call or debug video summarization endpoints directly, answer from
references/video-summarization-api.md
instead of running the
end-to-end video summarization workflow. Examples: list video summarization models, check
readiness, get recommended chunking config, inspect metrics, explain a 422
response, or build a
request body.
If the user asks to configure, deploy, restart, tear down, or troubleshoot the
video summarization service, prefer the
skill for full VSS profile
deployment and use
references/video-summarization-deployment.md
for video summarization-specific service details.
Routing
Decide purely from video summarization service availability (probed in
Setup → Availability checks below). Duration does not drive routing.
| Backend | Endpoint |
|---|
| HTTP 200 | LVS microservice with HITL | POST ${LVS_BACKEND_URL}/v1/summarize
|
| Anything else | VLM / RT-VLM with the default prompt + fallback note | POST ${VLM_BASE_URL}/v1/chat/completions
|
Fallback message when the LVS service is unreachable — copy verbatim above the summary:
⚠
Note: Input video
is
s long.
The video summarization service is not deployed, so this summary was
produced by the VLM alone with a generic default prompt. Deploy the
profile for higher-quality summaries with scenario/events
targeting.
Deployment prerequisite
The VSS
lvs profile on
is the primary backend. If the
probe (see
Setup → Availability checks) returns anything
other than 200 after the warmup retries, ask the user:
"The VSS profile isn't running on . Shall I deploy it now using the skill with ? Reply to summarize with the VLM-only fallback instead (lower quality, no scenario/events targeting)."
- Yes → hand off to , then re-probe and continue with Step 2 (LVS + HITL).
- No → go straight to Step 2 fallback (VLM with default prompt) and prepend the Routing fallback note. Do not ask again, and do not run scenario/events HITL.
- Pre-authorized to deploy autonomously (caller said so explicitly) → skip the confirmation and invoke directly.
- Pre-authorized to use VLM fallback ("skip lvs, just use the VLM") → go straight to Step 2 fallback without prompting.
Setup
Endpoints (defaults for a local VSS deployment):
- VLM / RT-VLM: — default
${RTVI_VLM_BASE_URL:-http://${HOST_IP:-localhost}:8018}
- LVS service: — default
http://${HOST_IP:-localhost}:38111
- VIOS: owned by
vss-manage-video-io-storage
; refer there.
Use env vars when set (strip trailing
from the VLM base — the skill appends it). Otherwise use the defaults. If neither works, ask the user — do not scan ports or read config files to guess.
Model name: read
(default
nim_nvidia_cosmos-reason2-8b_0303-fp8-dynamic-kv8
). It must match the id RT-VLM
advertises; do not substitute the friendly
.
For endpoint schemas, optional fields, response envelopes, and error handling, see
references/video-summarization-api.md
.
Availability checks (run both before routing).
Readiness is determined by the HTTP status code only — the LVS
may legitimately return
with an empty body, so do not
inspect the body.
bash
VLM="${VLM_BASE_URL:-${RTVI_VLM_BASE_URL:-http://${HOST_IP:-localhost}:8018}}"
VLM="${VLM%/v1}"
# VLM / RT-VLM: 200 on /v1/models
vlm_code=$(curl -s -o /dev/null -w '%{http_code}' --connect-timeout 3 \
"$VLM/v1/models")
[ "$vlm_code" = "200" ] && echo "VLM OK" || echo "VLM not reachable (HTTP $vlm_code)"
# Video summarization service: 200 on /v1/ready, with retry on 503 (warmup) for up to ~30s
VIDEO_SUMMARIZATION_URL=${LVS_BACKEND_URL:-http://${HOST_IP:-localhost}:38111}
video_sum_code=000
for i in $(seq 1 10); do
video_sum_code=$(curl -s -o /dev/null -w '%{http_code}' --connect-timeout 3 "$VIDEO_SUMMARIZATION_URL/v1/ready")
case "$video_sum_code" in
200) echo "video summarization OK"; break ;;
503) sleep 3 ;; # warming up; keep polling
*) break ;; # any other code = not reachable, stop retrying
esac
done
[ "$video_sum_code" = "200" ] || echo "video summarization service not reachable (HTTP $video_sum_code)"
How to interpret the results:
- → Step 2 (LVS + HITL) for every video.
- , → Step 2 fallback (VLM); prepend the Routing fallback note.
- → fail; at least one backend must be reachable.
- A non-200 LVS code after the retry loop is the ONLY signal of unavailability. Empty stdout or missing JSON fields are NOT "unavailable."
Step 1 - Get the clip URL via vss-manage-video-io-storage
(sub-task, NOT the final answer)
Use the vss-manage-video-io-storage
skill for all VIOS interactions — it
owns the canonical curl recipes, parameter defaults, and delete/upload flows.
Do not fabricate URLs or hand-roll VIOS calls; they will drift.
This step is a sub-task — do NOT end your turn here; do NOT return the clip
URL as the final answer. From VIOS collect three values:
- (via → , or directly from an upload response).
- Timeline - (ISO 8601 UTC). is the duration; needed only for the user-facing header (routing is driven solely by ).
- Temporary MP4 clip URL — the
/storage/file/<streamId>/url
variant with . Response field: . Both backends need an HTTP(S) URL they can .
Everything else (auth, upload,
, expiry, etc.) lives in the
vss-manage-video-io-storage
skill — refer users there if VIOS fails.
Step 2 — Primary: video summarization microservice with HITL
Use this path
whenever returned 200 in Setup. Duration is irrelevant.
For advanced fields (
,
, structured output, stream captioning, metrics, recommended config) see
references/video-summarization-api.md
.
HITL: collect scenario and events first (REQUIRED — do not skip)
Full walk-through is in
references/hitl-prompts.md
. Always run HITL before calling the LVS service.
Autonomous-mode defaults. When the caller has bypassed HITL ("run
autonomously without prompting") AND the original query asks for
/
(or gives none), use
scenario="activity monitoring"
and
events=["notable activity"]
verbatim — do not infer from filename or sensor name. Note the
defaults in the final reply and offer a re-run with more specific
parameters. This is the ONLY supported HITL bypass; "the video is
short" or "the user seems in a hurry" are not valid reasons.
Prefer
(3.2 GA route);
is a compatibility alias.
bash
VIDEO_SUMMARIZATION_URL=${LVS_BACKEND_URL:-http://${HOST_IP:-localhost}:38111}
# From HITL reply:
SCENARIO='warehouse monitoring'
EVENTS_JSON='["notable activity"]'
OBJECTS_JSON='' # '' to omit, else '["forklifts","pallets","workers"]'
curl -s -X POST "$VIDEO_SUMMARIZATION_URL/v1/summarize" \
-H "Content-Type: application/json" \
-d "$(jq -n --arg url "<clip_url_from_vss_manage_video_io_storage>" \
--arg model "${VLM_NAME:-nim_nvidia_cosmos-reason2-8b_0303-fp8-dynamic-kv8}" \
--arg scenario "$SCENARIO" \
--argjson events "$EVENTS_JSON" \
--argjson objects "${OBJECTS_JSON:-null}" '{
url: $url,
model: $model,
scenario: $scenario,
events: $events,
chunk_duration: 10,
num_frames_per_second_or_fixed_frames_chunk: 20,
use_fps_for_chunking: false,
seed: 1
} + (if $objects == null then {} else {objects_of_interest: $objects} end)')" \
| jq -r '.choices[0].message.content' \
| jq '{video_summary, events}'
If both
and
are empty, the clip probably doesn't contain the requested events — re-run with broader
/
, don't report "no content".
Tuning: (default
s;
= single chunk),
num_frames_per_second_or_fixed_frames_chunk
(default
; meaning depends
on
),
(default
).
is
deprecated.
Step 2 fallback — VLM direct with default prompt
Use this path
only when
did not return 200 after warmup. Do NOT run HITL — the user did not opt in; you fell back because the service was missing. Prepend the Routing fallback note to the response.
bash
VLM="${VLM_BASE_URL:-${RTVI_VLM_BASE_URL:-http://${HOST_IP:-localhost}:8018}}"
VLM="${VLM%/v1}"
PROMPT='Describe in detail what is happening in this video,
including all visible people, vehicles, equipments, objects,
actions, and environmental conditions.
OUTPUT REQUIREMENTS:
[timestamp-timestamp] Description of what is happening.
EXAMPLE:
[0.0s-4.0s] <description of the first event>
[4.0s-12.0s] <description of the second event>'
curl -s -X POST "$VLM/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "$(jq -n \
--arg model "${VLM_NAME:-nim_nvidia_cosmos-reason2-8b_0303-fp8-dynamic-kv8}" \
--arg text "$PROMPT" \
--arg url "<clip_url_from_vss_manage_video_io_storage>" \
'{
model: $model,
temperature: 0.0,
max_tokens: 1024,
messages: [{
role: "user",
content: [
{type: "text", text: $text},
{type: "video_url", video_url: {url: $url}}
]
}]
}')" | jq -r '.choices[0].message.content'
Response: standard OpenAI chat-completion envelope. The summary is in
choices[0].message.content
.
Cosmos-model notes: Cosmos Reason 2 supports reasoning via
<think>...</think><answer>...</answer>
blocks. Omit the reasoning
instructions if you want a plain summary. Frame sampling and pixel limits
are applied server-side; no client-side prep is required when you pass a
.
End-to-end example
See
references/end-to-end-example.md
for
the full LVS-or-VLM-fallback script that probes
and runs the
appropriate path.
Responses
- VLM returns an OpenAI chat-completion envelope; summary is
choices[0].message.content
.
- LVS service returns the same envelope but is a JSON string —
run
jq -r '.choices[0].message.content' | jq
to reach .
- Errors surface as HTTP non-2xx plus JSON . LVS usually
means warmup — retry .
Presenting the output to the user
Surface backend output with minimal transformation — do not paraphrase,
re-voice, add emojis, or reformat. One backend call → one rendering: no
parallel hedging, no duplicate headers, never call both LVS and VLM for the
same video.
Header line. Start with exactly one:
Summary of <video_name> (<duration>)
LVS output: render
verbatim (polished, tone-controlled
report — rewriting loses fidelity). Render each
entry with its
,
,
, and full
verbatim (table when
the client renders one cleanly, otherwise a per-event list). You MAY add a
one-line header and a closing offer to re-run with different parameters.
VLM output: render
choices[0].message.content
verbatim. If the model
produced
<think>…</think><answer>…</answer>
blocks, drop the
block and show the answer.
Fallback warning (when applicable) goes above the summary, never
mixed into it.
Tips
- Route by service availability, not by duration. Probe once
in Setup; HTTP 200 → LVS+HITL for every clip; anything else → VLM fallback.
- HITL is mandatory on the LVS path. The opt-in is the only
sanctioned bypass. The VLM fallback path is silent (no HITL).
- Readiness = HTTP 200 on . Nothing else. Body may be empty.
Always use
curl -s -o /dev/null -w '%{http_code}'
— never pipe through
//.
- Delegate VIOS to
vss-manage-video-io-storage
— it is a sub-task; the
final answer is the Step 2 summary, not the clip URL.
- twice for LVS output. First unwraps the OpenAI envelope, second
parses the JSON string inside .
- Prefer for 3.2 GA; is a compatibility alias.
- Use the exact VLM model id advertised by the endpoint (default
nim_nvidia_cosmos-reason2-8b_0303-fp8-dynamic-kv8
).
- Render output verbatim — no paraphrasing, no reformatting, no rewriting
the or
choices[0].message.content
.
- One call, one render. No parallel hedging, no double renderings.
Cross-reference
- vss-deploy-profile — bring up the (VLM only) or (VLM + video summarization service) profile
- vss-manage-video-io-storage (VIOS API) — upload videos, list streams, get clip URLs
- vss-search-archive — semantic search across the archive (different profile)
- vss-query-analytics — query incidents/events from Elasticsearch
- video summarization API reference —
references/video-summarization-api.md
- video summarization service ops reference —
references/video-summarization-deployment.md
bump:1