Purpose
Stand up the RT-VLM dense-captioning microservice on its own and exercise every endpoint it exposes (file upload, generate_captions, stream add/delete, chat-completions, Kafka topics).
Prerequisites
For standalone RT-VLM deployment:
- Docker, Docker Compose, NVIDIA Container Toolkit, and a visible GPU.
- NGC registry credentials in for ,
image pulls, and local NGC model/artifact downloads.
- , , and any writable working directory for the standalone compose copy.
For API calls against an existing service:
- Running RT-VLM service reachable at .
- Bearer token in or , depending on how the
service was configured.
For full VSS profile deployment:
- Use
../vss-deploy-profile/SKILL.md
; this skill does not deploy full VSS profiles.
Instructions
Follow the routing tables and step-by-step workflows below. Each section that ends in
workflow,
quick start, or
flow is intended to be executed top-to-bottom. Detailed reference material lives in
and helper scripts live in
— call them via
when the skill points to a script by name.
Examples
Worked end-to-end examples are kept under
(each
manifest contains a runnable scenario) and inline in the per-workflow
blocks below. Run a Tier-3 evaluation with
nv-base validate <this-skill-dir> --agent-eval
to replay them.
Limitations
- Requires either a standalone RT-VLM service deployed via this skill or an
existing RT-VLM service reachable from the caller.
- NGC-hosted models and NIMs may be subject to rate-limits, GPU memory requirements, and license restrictions.
- Concurrency, GPU memory, and storage limits depend on the host hardware and the profile's compose file.
Troubleshooting
- Error: REST call returns connection refused. Cause: target microservice not running. Solution: probe or ; redeploy via or the matching skill.
- Error: HTTP 401/403 from NGC pulls. Cause: missing/expired . Solution: and re-export the key before retrying.
- Error: container OOM or model fails to load. Cause: insufficient GPU memory for the selected profile. Solution: switch to a smaller variant or free GPUs via .
Deploy and Use RT-VLM Dense Captioning (VSS 3.2)
RT-VLM is NVIDIA's real-time vision-language microservice: decode video (file or
RTSP), segment it into chunks, run a VLM (
,
, or any
OpenAI-compatible model), stream dense captions back over SSE/HTTP, and publish
captions, incident alerts, and errors to Kafka. Use this skill to deploy the
standalone RT-VLM service when a full VSS profile is not already running, then call
its
API for caption generation, file upload, live-stream management, health
checks, NIM-compatible chat completions, or Prometheus metrics. API reference:
https://docs.nvidia.com/vss/latest/real-time-vlm-api.html.
Deployment Routing
If the user asks to deploy a full VSS profile, use
../vss-deploy-profile/SKILL.md
. That skill
owns profile routing,
,
, multi-service sizing, and
full-stack deploy/teardown.
If the user asks for standalone RT-VLM dense captioning, or no VSS profile is
already running, use the standalone RT-VLM flow in
references/deploy-rt-vlm-service.md
before calling the API. This follows the same compose-centric pattern as
: gather context, run preflights, work from a local copy,
dry-run with
, review, deploy, then wait for health.
Standalone Deployment Flow
Always follow this sequence. Never skip the dry-run.
bash
# 1. Copy deploy/docker/services/rtvi/rtvi-vlm/rtvi-vlm-docker-compose.yml
# into any writable standalone working directory.
# 2. Derive RTVI_VLM_IMAGE_TAG from that compose copy.
# 3. Strip the standalone-only dangling depends_on block from the copy.
# 4. Create a gitignored .env with the required RT-VLM values.
# 5. Prepare host bind paths such as $VSS_DATA_DIR/data_log/vst/clip_storage.
# 6. docker compose --env-file .env -f rtvi-vlm-docker-compose.yml config --quiet
# 7. docker pull the exact RT-VLM image tag.
# 8. docker compose ... up -d rtvi-vlm, wait for ready, then smoke test.
Run preflights before any pull or
; stop and fix failures here before
debugging RT-VLM itself:
bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
nvidia-container-cli info
docker compose version
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
For standalone single-file deployments, do not run the raw
deploy/docker/services/rtvi/rtvi-vlm/rtvi-vlm-docker-compose.yml
directly: it
contains
references to sibling VLM/NIM services that are only
defined in the full VSS/met-blueprints compose project. The standalone reference
shows how to copy the compose file, derive the current image tag from it, strip
the
block, and validate the result before
.
If
fails with a containerd snapshotter/unpack error on Docker 28+,
apply the
containerd-snapshotter=false
fix in the
standalone reference before retrying.
Minimum standalone
values:
| Host env var | Required when | Purpose |
|---|
| Standalone deploy path | NGC registry image pull and NGC model/artifact download |
| or | Authenticated API calls | RT-VLM bearer auth after the service is running |
| Always | Host API port mapped to container |
| Always | Kafka bootstrap host () |
| Always | Required clip-storage bind mount |
| Always for standalone | Backend selector; use for the default local model or for a remote/sibling endpoint |
| Local self-hosted model | Source-backed Cosmos Reason 2 path: ngc:nim/nvidia/cosmos-reason2-8b:0303-fp8-dynamic-kv8
|
| RTVI_VLM_MODEL_TO_USE=openai-compat
| Remote/sibling OpenAI-compatible VLM endpoint |
| RTVI_VLM_MODEL_TO_USE=openai-compat
| Model/deployment name exposed by that endpoint |
Setup
bash
export BASE_URL="http://localhost:${RTVI_VLM_PORT:-8018}" # host-side RT-VLM port
export API_KEY="${NGC_CLI_API_KEY:-${RTVI_VLM_API_KEY:-}}" # bearer token used by host-side curl commands
: "${API_KEY:?Set NGC_CLI_API_KEY or RTVI_VLM_API_KEY before calling authenticated endpoints}"
Every request below uses
Authorization: Bearer $API_KEY
. Health endpoints
(
,
,
,
) typically work without auth.
Smoke test before use:
bash
curl -fsS "$BASE_URL/v1/health/ready"
MODEL_ID="$(curl -fsS "$BASE_URL/v1/models" -H "Authorization: Bearer $API_KEY" | jq -r '.data[0].id // .id')"
curl -fsS "$BASE_URL/openapi.json" | jq -r '.paths | keys[]' | sort
Quick Start — dense captions from a local video
bash
# 1. Upload the video, capture its file id
FILE_ID=$(curl -fsS -X POST "$BASE_URL/v1/files" \
-H "Authorization: Bearer $API_KEY" \
-F "file=@/path/to/warehouse.mp4" \
-F "purpose=vision" \
-F "media_type=video" | jq -r '.id')
# 2. Generate captions + alerts (SSE stream of chunked responses)
curl -N -X POST "$BASE_URL/v1/generate_captions" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"id\": \"$FILE_ID\",
\"prompt\": \"Write a concise dense caption for each 10-second segment of this warehouse video.\",
\"model\": \"$MODEL_ID\",
\"chunk_duration\": 10,
\"stream\": true
}"
Endpoints
Captions
Generate VLM captions and alerts for videos and live streams.
POST /v1/generate_captions
— Generate VLM captions (and alerts) for video/stream
Required:
| Field | Type | Description |
|---|
| string | array | UUID of a previously-uploaded file, or id of an active live stream. Accepts a list of ids for batch |
| string | User prompt to the VLM (e.g. dense-caption instruction) |
| string | Exact model id returned by , for example nim_nvidia_cosmos-reason2-8b_0303-fp8-dynamic-kv8
; backend selector aliases such as are not request model ids |
Key optional fields:
| Field | Type | Default | Description |
|---|
| string | — | System prompt; use <think></think><answer></answer>
tags to enable reasoning on Cosmos Reason |
| boolean | false | Turn on reasoning for Cosmos Reason models |
| boolean | false | Transcribe audio (via Riva) and fold into captions |
| integer | — | Segment video into N-second chunks ( = no chunking) |
| integer | 0 | Overlap between consecutive chunks |
num_frames_per_second_or_fixed_frames_chunk
| number | — | FPS (if use_fps_for_chunking=true
) or fixed frames per chunk |
| boolean | false | Interpret above as FPS vs. fixed-frame count |
| / | int | — | Resize frames before inference (0 = native) |
| object | — | {"type":"offset","start_offset":0,"end_offset":10}
to process a slice of a file (not live streams) |
| boolean | false | SSE: emit per-chunk caption deltas as events (recommended for long videos) |
| / / / / / | | | Standard sampling controls |
| object | — | Query response format object |
| object | — | Extra kwargs for the multimodal processor (e.g. size, shortest/longest edge) |
bash
curl -N -X POST "$BASE_URL/v1/generate_captions" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"id": "123e4567-e89b-12d3-a456-426614174000",
"prompt": "Dense-caption this warehouse video, one sentence per 10s chunk.",
"model": "nim_nvidia_cosmos-reason2-8b_0303-fp8-dynamic-kv8",
"chunk_duration": 10,
"stream": true
}'
Response shape: live 26.05 responses use
with
/
; SSE streams terminate with
. See
references/api-surface-26.05.md
.
DELETE /v1/generate_captions/{stream_id}
— Stop caption generation for a live stream, if exposed
Some deployments expose this companion stop endpoint. Check the live OpenAPI
(
curl -fsS "$BASE_URL/openapi.json" | jq '.paths | keys[]'
) before using it.
Always pair live-stream cleanup with
DELETE /v1/streams/delete/{stream_id}
to
un-register the RTSP source.
bash
curl -X DELETE "$BASE_URL/v1/generate_captions/$STREAM_ID" -H "Authorization: Bearer $API_KEY"
Files
Upload and manage media files consumed by
.
— Upload a media file (multipart)
bash
curl -X POST "$BASE_URL/v1/files" -H "Authorization: Bearer $API_KEY" \
-F "file=@./video.mp4" -F "purpose=vision" -F "media_type=video"
Response: { "id", "object": "file", "bytes", "created_at", "filename", "purpose" }
.
Optional metadata such as
may be accepted by newer builds; check
the live OpenAPI before sending it.
GET /v1/files?purpose=vision
— List uploaded files
— File metadata
GET /v1/files/{file_id}/content
— Download original file content
DELETE /v1/files/{file_id}
— Delete file (releases asset storage)
Live Stream
RTSP stream lifecycle.
— Register one or more RTSP streams
Required per stream: (must start with
),
.
Optional:
,
,
, and placement metadata
(
,
,
,
,
,
,
).
Precheck public or external RTSP sources before registering them. A probe exit
code alone is not enough;
can exit
while reporting an
unknown media type. Treat the stream as usable only when a probe output
identifies at least one video stream/caps entry. If one probe is inconclusive,
cross-check with another tool such as
before failing or registering:
bash
ffprobe -v error -select_streams v:0 \
-show_entries stream=codec_type -of csv=p=0 "$RTSP_URL" | grep -qx video
bash
STREAM_ID=$(curl -fsS -X POST "$BASE_URL/v1/streams/add" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d '{"streams":[{"liveStreamUrl":"rtsp://cam:8554/live","description":"warehouse cam 1"}]}' \
| jq -r '.results[0].id')
GET /v1/streams/get-stream-info
— List active streams
DELETE /v1/streams/delete/{stream_id}
— Remove a single stream
DELETE /v1/streams/delete-batch
— Remove many ()
CV-style singular stream endpoints
26.05 deployments also expose CV-style stream control paths:
,
GET /v1/stream/get-stream-info
, and
. Use these when a workflow or release note explicitly uses
the key/value envelope; otherwise prefer the plural RT-VLM stream endpoints
above. See
references/api-surface-26.05.md
for examples and the
compatibility caveat.
NIM Compatible
OpenAI-compatible endpoints for interop with OpenAI/NVIDIA-API clients.
POST /v1/chat/completions
— OpenAI-compatible chat (text + multimodal)
Required: ,
. Text-only requests work and omit
,
, and
. For uploaded-video, direct
,
direct
, streaming, and RTSP-backed chat examples, see
references/api-surface-26.05.md
.
bash
curl -X POST "$BASE_URL/v1/chat/completions" -H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d "{\"model\":\"$MODEL_ID\",\"messages\":[{\"role\":\"user\",\"content\":\"Summarize this scene.\"}]}"
— OpenAI-compatible legacy completions
This endpoint exists for compatibility, but on current 26.05 builds text-only legacy
completion requests return HTTP 400 by design. Use
for
text-only and multimodal requests.
— { "version": "3.2.0-..." }
— NIM manifest
· — NIM-style probes
Do not assume
exists. The current 26.05 live OpenAPI does not expose it
and the endpoint returns 404; only call it after checking
.
Models · Metadata · Metrics · Health Check
— List loaded VLMs: { "data": [{ "id", "object": "model", "owned_by" }] }
— Service metadata (build, release, image tag)
— Asset storage counts, TTL, and oldest-asset age
— Prometheus metrics (plain text)
· · — Kubernetes-style probes
Common Workflows
The four standard dense-captioning scenarios.
1. Dense captions from a stored video file
bash
# Upload → capture file id → generate captions (SSE stream)
FILE_ID=$(curl -fsS -X POST "$BASE_URL/v1/files" \
-H "Authorization: Bearer $API_KEY" \
-F "file=@warehouse.mp4" -F "purpose=vision" -F "media_type=video" | jq -r '.id')
curl -N -X POST "$BASE_URL/v1/generate_captions" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d "{
\"id\": \"$FILE_ID\",
\"prompt\": \"Describe warehouse events in 1 sentence per 10s chunk.\",
\"model\": \"$MODEL_ID\",
\"chunk_duration\": 10,
\"stream\": true
}"
# When done, free storage:
curl -X DELETE "$BASE_URL/v1/files/$FILE_ID" -H "Authorization: Bearer $API_KEY"
2. Dense captions from an RTSP live stream
bash
# Register the stream
STREAM_ID=$(curl -fsS -X POST "$BASE_URL/v1/streams/add" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d '{"streams":[{"liveStreamUrl":"rtsp://10.0.0.5:8554/warehouse","description":"warehouse cam"}]}' \
| jq -r '.results[0].id')
# Start continuous caption generation
curl -N -X POST "$BASE_URL/v1/generate_captions" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d "{
\"id\": \"$STREAM_ID\",
\"prompt\": \"Describe each event; start each sentence with a timestamp.\",
\"model\": \"$MODEL_ID\",
\"chunk_duration\": 10,
\"num_frames_per_second_or_fixed_frames_chunk\": 2,
\"use_fps_for_chunking\": true,
\"stream\": true
}" &
# Tear down when finished. If the live OpenAPI exposes
# DELETE /v1/generate_captions/{stream_id}, call it before unregistering.
curl -X DELETE "$BASE_URL/v1/streams/delete/$STREAM_ID" -H "Authorization: Bearer $API_KEY"
3. Dense captions with alerts from an RTSP stream
bash
# Pre-req: Kafka is enabled and topics match the deployment source.
# The checked-in rtvi-vlm/.env and VSS alerts profiles use:
# RTVI_VLM_KAFKA_ENABLED=true
# RTVI_VLM_KAFKA_TOPIC=mdx-vlm
# RTVI_VLM_KAFKA_INCIDENT_TOPIC=mdx-vlm-incidents
# RTVI_VLM_ERROR_MESSAGE_TOPIC=vision-llm-errors
# HOST_IP=<kafka-host>
# A copied compose without those env overrides falls back to vision-llm-* topics.
# Confirm the live container before consuming:
# docker exec vss-rtvi-vlm printenv KAFKA_TOPIC KAFKA_INCIDENT_TOPIC ERROR_MESSAGE_TOPIC
STREAM_ID=$(curl -fsS -X POST "$BASE_URL/v1/streams/add" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d '{"streams":[{"liveStreamUrl":"rtsp://10.0.0.5:8554/warehouse","description":"warehouse cam"}]}' \
| jq -r '.results[0].id')
curl -N -X POST "$BASE_URL/v1/generate_captions" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d "{
\"id\": \"$STREAM_ID\",
\"prompt\": \"You are a warehouse monitoring system. Describe the scene in one sentence, then on a new line output exactly:\\nAnomaly Detected: Yes/No\\nReason: <one sentence>\\nFlag an anomaly if any worker is missing a hard hat or high-vis vest.\",
\"system_prompt\": \"Answer the user's question correctly in yes or no.\",
\"model\": \"$MODEL_ID\",
\"chunk_duration\": 60,
\"chunk_overlap_duration\": 10,
\"stream\": true
}"
Consume alerts from Kafka. Kafka values are NvSchema protobuf payloads, so
use
for a clean validation pass that shows timestamp, key,
and headers without dumping binary payload bytes. The VSS alerts/profile source
uses
; a bare copied compose may fall back to
vision-llm-events-incidents
if no
RTVI_VLM_KAFKA_INCIDENT_TOPIC
override is
loaded. Prefer the live container environment over hard-coded topic names.
bash
INCIDENT_TOPIC="${INCIDENT_TOPIC:-$(docker exec vss-rtvi-vlm printenv KAFKA_INCIDENT_TOPIC 2>/dev/null || true)}"
INCIDENT_TOPIC="${INCIDENT_TOPIC:-mdx-vlm-incidents}"
docker exec mdx-kafka kafka-console-consumer \
--bootstrap-server 127.0.0.1:9092 \
--topic "$INCIDENT_TOPIC" \
--from-beginning \
--timeout-ms 5000 \
--max-messages 10 \
--property print.timestamp=true \
--property print.key=true \
--property print.headers=true \
--property print.value=false
If Kafka is not running in the VSS
container, use the Kafka CLI from
the host or container running the broker:
bash
INCIDENT_TOPIC="${INCIDENT_TOPIC:-mdx-vlm-incidents}"
kafka-console-consumer \
--bootstrap-server "$HOST_IP:9092" \
--topic "$INCIDENT_TOPIC" \
--from-beginning \
--timeout-ms 5000 \
--max-messages 10 \
--property print.timestamp=true \
--property print.key=true \
--property print.headers=true \
--property print.value=false
For standalone validation, remember that the RT-VLM compose maps Kafka through
KAFKA_BOOTSTRAP_SERVERS=${HOST_IP}:9092
; setting
directly in
is ignored unless the compose is changed. The broker must
advertise a listener reachable from the
container.
inside the broker and service containers is not the host, and a broker alias
such as
only works when both containers share that Docker network.
For RT-VLM-only validation, prefer the self-contained broker in
references/kafka-workflows.md
over the full repo infra compose; the latter
expects full-profile SDRC env/config. If Kafka is already running, ask the user
whether to reuse it or launch a dedicated broker before stopping or replacing
anything. Run CLI checks inside the actual broker container, but still configure
the advertised listener so RT-VLM can connect from its container network.
Incident protobuf (
) key fields:
,
,
,
,
,
,
,
,
(
for
alerts),
(nested VisionLLM),
map including
,
,
,
,
,
(if the deployment supports the
query field — post-3.1).
4. Kafka workflows (alerts + message bus)
Dense captioning with alerts on an RTSP stream and the HTTP-vs-Kafka response model are documented in
references/kafka-workflows.md
.
Error Reference
| Code | Meaning | Common Cause |
|---|
| 400 | Bad Request | Missing required field (, , ); unsupported ; unknown name |
| 401 | Unauthorized | Missing/invalid Authorization: Bearer $API_KEY
— or wrong key format (expect ) |
| 404 | Not Found | deleted / stream_id not registered / wrong endpoint path (note: is required on DELETE /v1/streams/delete/{stream_id}
) |
| 413 | Payload Too Large | Uploaded file exceeds server ; increase or pre-chunk the video |
| 422 | Unprocessable Entity | Pydantic schema violation — e.g. use_fps_for_chunking=true
without num_frames_per_second_or_fixed_frames_chunk
; stream ids supplied to a file-only field like |
| 429 | Rate Limited | Too many concurrent streams — raise or spread across instances |
| 500 | Server Error | VLM inference exception (OOM, model unavailable) — check |
| 503 | Service Busy | Startup not complete (model still downloading) or upstream NIM dependency unhealthy |
Gotchas
- Use the live OpenAPI as the source of truth. For VSS 3.2, the caption-generation endpoint is . Some older references and images used
/v1/generate_captions_alerts
; do not assume that path exists unless shows it.
- URL-based input support depends on the deployed service version. If the live schema does not expose //, upload via first and pass the returned .
- Alert trigger = the tokens or in the VLM response (case-insensitive). There is no per-request alert flag. Design prompts with an explicit line and set to constrain the model to Yes/No answers (per the VSS docs). Every chunk is published to ; matched chunks additionally go to with , set to the matched tokens, and
info["verdict"]="confirmed"
.
- support depends on the deployed service version. If the live OpenAPI schema does not expose it, Kafka incidents default
incident.category = "vlm-alert"
.
- Kafka topics are server-side config, not per-request. The env vars (via compose rewrites) are fixed at container start — clients can't override topics on a per-request basis. Kafka publish is additive to the HTTP response, never a replacement.
- Topic names differ by deployment source. The checked-in RT-VLM and VSS alerts/profile sources use and ; a bare copied compose with no overrides falls back to and
vision-llm-events-incidents
. Always trust the live environment before consuming.
- Standalone Kafka must advertise . The RT-VLM compose uses
KAFKA_BOOTSTRAP_SERVERS=${HOST_IP}:9092
; a broker that advertises or may pass producer/consumer tests inside the broker container while RT-VLM publish fails.
- Start Kafka before RT-VLM when Kafka is enabled. For deterministic standalone validation, make the broker reachable at first. If you start Kafka later or change its advertised listener, restart/recreate before expecting Kafka offsets to move.
- returns Server-Sent Events, not chunked JSON. Use (no buffering). Each event is with per-chunk fields such as , , and , terminated by . Without the server buffers until the full video is processed — fine for short clips (<1 min), avoid for live streams.
- Trust live OpenAPI for optional NIM-compatible endpoints. is not exposed by current 26.05 builds and returns 404, even though older generic NIM docs may mention it.
- Prefer over . Text-only legacy completions return HTTP 400 by design on current 26.05 builds; text-only chat completions work.
- disables chunking — the entire video is sent to the VLM as one shot. Only meaningful for short clips; long videos will OOM or exceed .
- Default frame budget caps at
VLLM_MM_PROCESSOR_VIDEO_NUM_FRAMES
(256). Requesting FPS that implies >256 frames per chunk is silently capped; drop FPS or shorten to stay within budget.
- requires a Cosmos Reason model. Passing it with Qwen3-VL or other non-reasoning models is a no-op.
- is unauthenticated on current 26.05 standalone builds. A Bearer token is harmless if a deployment has stricter auth, but do not fail validation when returns HTTP 200 without auth.
- File upload is multipart, not JSON. Use
-F file=@path -F purpose=vision -F media_type=video
; a body returns 422.
- Live-stream lifecycle cleanup must unregister the stream:
DELETE /v1/streams/delete/{stream_id}
removes the RTSP source. If the live schema also exposes DELETE /v1/generate_captions/{stream_id}
, call it first to stop inference explicitly.