Video Frames
Extract frames from video files using ffmpeg, producing JPEG images optimized for LLM vision analysis. Supports multiple frame-selection strategies (fixed FPS, scene detection, target frame count), quality presets, model-aware dimension optimization, and OCR enhancements.
Prerequisites
ffmpeg and ffprobe must be installed and on PATH:
bash
brew install ffmpeg # macOS
Workflow
- Receive a video file path from the user
- Run
scripts/extract_frames.py
to extract JPEG frames
- Parse the JSON output for frame paths, resolution, and token estimates
- Read the extracted frames as image attachments for analysis
- Answer the user's question about the video content
- Clean up temp directories when done
Quick Start
The simplest invocation -- extracts 1 frame/second at balanced quality:
bash
python3 scripts/extract_frames.py video.mp4
For most use cases, use
to let the script auto-calculate FPS:
bash
python3 scripts/extract_frames.py video.mp4 --max-frames 30
This is the
preferred approach over manually setting
, since it adapts to any video length and keeps the frame count predictable.
Presets
Four quality presets control resolution, JPEG quality, and image processing:
| Preset | Max dim | Quality | Extras | Best for |
|---|
| 768px | 5 | -- | Bulk frames, long videos |
| 1024px | 3 | -- | General analysis (default) |
| 1568px | 2 | -- | Fine detail, small objects |
| 1568px | 1 | grayscale + high contrast + sharpen | Text/document extraction |
bash
# Long video, keep costs low
python3 scripts/extract_frames.py long_video.mp4 --max-frames 20 --preset efficient
# Need to read text in a screencast
python3 scripts/extract_frames.py screencast.mp4 --max-frames 40 --preset ocr
Quality (1=best, 31=worst) and max dimension can be overridden independently:
bash
python3 scripts/extract_frames.py video.mp4 --preset balanced --quality 1 --max-dimension 1568
Scene-Change Detection
Instead of extracting at a fixed rate, detect visual scene changes and extract one frame per scene. This is ideal for videos with distinct segments (presentations, edited footage, tutorials).
bash
python3 scripts/extract_frames.py video.mp4 --scene-threshold 0.3
- (float, 0.0-1.0): Sensitivity. Lower = more sensitive, detects smaller changes. Start with (the default when the flag is used).
- (float, default: 1.0): Minimum seconds between detected scenes. Prevents burst detections during rapid cuts.
Note: and
are mutually exclusive.
can only be used with
mode, not scene detection.
bash
# Presentation with clear slide transitions
python3 scripts/extract_frames.py presentation.mp4 --scene-threshold 0.2
# Action footage -- less sensitive, min 2s apart
python3 scripts/extract_frames.py action.mp4 --scene-threshold 0.5 --min-scene-interval 2.0
Model-Aware Optimization
Use
to resize frames to dimensions that align with a specific model's tile boundaries, minimizing wasted tokens:
| Model | Max dim | Rationale |
|---|
| 1568px | Max native resolution before auto-resize |
| 768px | Aligned to 512px tile grid (shortest side 768) |
| 768px | Aligned to 768px tile boundaries |
| 768px | Sweet spot across all models (default) |
bash
# Optimized for Claude -- maximum detail
python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model claude
# Optimized for GPT-4o -- efficient tile packing
python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model openai
sets the max dimension unless
is explicitly provided (CLI override takes priority).
See
references/llm-image-specs.md
for detailed token formulas, tile calculations, and optimal dimension tables for each model.
OCR and Grayscale Mode
For videos containing text (screencasts, presentations, documents, terminal recordings):
bash
# Full OCR pipeline via preset
python3 scripts/extract_frames.py screencast.mp4 --preset ocr --max-frames 40
# Manual OCR flags (can combine with any preset)
python3 scripts/extract_frames.py video.mp4 --grayscale --high-contrast
- : Converts frames to grayscale. Reduces file size ~60% with no OCR accuracy loss.
- : Applies
contrast=1.3, brightness=0.05
to improve text/background separation.
- The preset enables both flags plus unsharp-mask sharpening at 1568px, quality 1 (best JPEG).
Advanced Options
FPS Selection Guide
When using
directly instead of
:
| Video length | Recommended fps | Rationale |
|---|
| < 30s | 2-5 | Short clip, capture detail |
| 30s - 5min | 1 | Good balance of coverage vs frame count |
| 5min - 30min | 0.5 | Avoid excessive frames |
| > 30min | 0.1 - 0.2 | Sample key moments only |
Keep total frame count under ~50 for optimal LLM context usage. Formula:
duration_seconds * fps = frame_count
.
Prefer
over manual FPS -- it auto-calculates the right rate and clamps to 0.05-30.0 FPS.
Timestamp Overlay
bash
python3 scripts/extract_frames.py video.mp4 --timestamps --max-frames 30
Overlays the source filename and
timestamp in the bottom-right corner of each frame (white text on semi-transparent black box). Use when the user needs to reference specific moments in the video.
All CLI Options Reference
| Option | Type | Default | Description |
|---|
| pos. | (required) | Path to the video file |
| float | 1.0 | Frames per second (mutually exclusive with ) |
| float | -- | Scene-change sensitivity 0.0-1.0 (mutually exclusive with ) |
| float | 1.0 | Min seconds between scene-change frames |
| int | -- | Auto-calculate FPS to produce ~N frames |
| choice | balanced | / / / |
| int | -- | Override max pixel dimension (longest edge) |
| int | -- | JPEG quality 1-31 (1=best, 31=worst) |
| choice | -- | / / / |
| flag | off | Convert to grayscale |
| flag | off | Boost contrast for text readability |
| flag | off | Overlay filename + timestamp on frames |
| string | temp dir | Output directory for extracted frames |
Output JSON Structure
The script prints JSON to stdout with the following structure:
json
{
"output_dir": "/tmp/video_frames_abc123/",
"frames": ["/tmp/video_frames_abc123/frame_00001.jpg", "..."],
"preset": "balanced",
"resolution": { "width": 1024, "height": 576 },
"token_estimate": {
"frame_count": 30,
"per_frame": {
"claude": 787,
"openai_high": 765,
"openai_low": 85,
"openai_patch": 934,
"gemini": 258
},
"total": {
"claude": 23610,
"openai_high": 22950,
"openai_low": 2550,
"openai_patch": 28020,
"gemini": 7740
}
},
"summary": {
"video_duration_seconds": 120.5,
"extraction_method": "max_frames",
"scene_changes_detected": null,
"frames_extracted": 30,
"estimated_total_tokens": {
"claude": 23610,
"openai_high": 22950,
"openai_low": 2550,
"openai_patch": 28020,
"gemini": 7740
}
}
}
Use
to verify the frame set fits within model context limits before attaching frames to a prompt.
Note: and
are for legacy models (GPT-4o, GPT-4.1).
is for newer models (gpt-5.4+, gpt-5-mini, o4-mini). See
references/llm-image-specs.md
for details.
On error, JSON with an
key is printed to stderr and the script exits with code 1.
After Extraction
- Parse the JSON output to get the list of frame paths from
- Check to ensure the frames fit within context limits
- Read each frame image using the Read tool (they are JPEG files)
- Analyze the frames to answer the user's question
- Clean up: delete the output directory when done if it was a temp dir