video-understand
Original:🇺🇸 English
Translated
1 scripts
Understand video content locally using ffmpeg frame extraction and Whisper transcription. No API keys needed. Use when: (1) Understanding what a video contains, (2) Transcribing video audio locally, (3) Extracting key frames for visual analysis, (4) Getting video content without API keys.
11installs
Sourceheygen-com/skills
Added on
NPX Install
npx skill4agent add heygen-com/skills video-understandTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →video-understand
Understand video content locally using ffmpeg for frame extraction and Whisper for transcription. Fully offline, no API keys required.
Prerequisites
- +
ffmpeg(required):ffprobebrew install ffmpeg - (optional, for transcription):
openai-whisperpip install openai-whisper
Commands
bash
# Scene detection + transcribe (default)
python3 skills/video-understand/scripts/understand_video.py video.mp4
# Keyframe extraction
python3 skills/video-understand/scripts/understand_video.py video.mp4 -m keyframe
# Regular interval extraction
python3 skills/video-understand/scripts/understand_video.py video.mp4 -m interval
# Limit frames extracted
python3 skills/video-understand/scripts/understand_video.py video.mp4 --max-frames 10
# Use a larger Whisper model
python3 skills/video-understand/scripts/understand_video.py video.mp4 --whisper-model small
# Frames only, skip transcription
python3 skills/video-understand/scripts/understand_video.py video.mp4 --no-transcribe
# Quiet mode (JSON only, no progress)
python3 skills/video-understand/scripts/understand_video.py video.mp4 -q
# Output to file
python3 skills/video-understand/scripts/understand_video.py video.mp4 -o result.jsonCLI Options
| Flag | Description |
|---|---|
| Input video file (positional, required) |
| Extraction mode: |
| Maximum frames to keep (default: 20) |
| Whisper model size: tiny, base, small, medium, large (default: base) |
| Skip audio transcription, extract frames only |
| Write result JSON to file instead of stdout |
| Suppress progress messages, output only JSON |
Extraction Modes
| Mode | How it works | Best for |
|---|---|---|
| Detects scene changes via ffmpeg | Most videos, varied content |
| Extracts I-frames (codec keyframes) | Encoded video with natural keyframe placement |
| Evenly spaced frames based on duration and max-frames | Fixed sampling, predictable output |
If mode detects no scene changes, it automatically falls back to mode.
sceneintervalOutput
The script outputs JSON to stdout (or file with ). See for the full schema.
-oreferences/output-format.mdjson
{
"video": "video.mp4",
"duration": 18.076,
"resolution": {"width": 1224, "height": 1080},
"mode": "scene",
"frames": [
{"path": "/abs/path/frame_0001.jpg", "timestamp": 0.0, "timestamp_formatted": "00:00"}
],
"frame_count": 12,
"transcript": [
{"start": 0.0, "end": 2.5, "text": "Hello and welcome..."}
],
"text": "Full transcript...",
"note": "Use the Read tool to view frame images for visual understanding."
}Use the Read tool on frame image paths to visually inspect extracted frames.
References
- -- Full JSON output schema documentation
references/output-format.md