video-frames

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Video Frames

视频帧提取

Extract frames from video files using ffmpeg, producing JPEG images optimized for LLM vision analysis. Supports multiple frame-selection strategies (fixed FPS, scene detection, target frame count), quality presets, model-aware dimension optimization, and OCR enhancements.
使用ffmpeg从视频文件中提取帧,生成针对LLM视觉分析优化的JPEG图片。支持多种帧选择策略(固定FPS、场景检测、目标帧数量)、质量预设、模型感知维度优化以及OCR增强功能。

Prerequisites

前置要求

ffmpeg and ffprobe must be installed and on PATH:
bash
brew install ffmpeg  # macOS
必须安装ffmpeg和ffprobe并配置到PATH中:
bash
brew install ffmpeg  # macOS

Workflow

工作流程

  1. Receive a video file path from the user
  2. Run
    scripts/extract_frames.py
    to extract JPEG frames
  3. Parse the JSON output for frame paths, resolution, and token estimates
  4. Read the extracted frames as image attachments for analysis
  5. Answer the user's question about the video content
  6. Clean up temp directories when done
  1. 接收用户提供的视频文件路径
  2. 运行
    scripts/extract_frames.py
    提取JPEG帧
  3. 解析JSON输出,获取帧路径、分辨率和token估算值
  4. 将提取的帧作为图片附件读取,用于分析
  5. 回答用户关于视频内容的问题
  6. 完成后清理临时目录

Quick Start

快速开始

The simplest invocation -- extracts 1 frame/second at balanced quality:
bash
python3 scripts/extract_frames.py video.mp4
For most use cases, use
--max-frames
to let the script auto-calculate FPS:
bash
python3 scripts/extract_frames.py video.mp4 --max-frames 30
This is the preferred approach over manually setting
--fps
, since it adapts to any video length and keeps the frame count predictable.
最简单的调用方式——以平衡质量每秒提取1帧:
bash
python3 scripts/extract_frames.py video.mp4
对于大多数使用场景,建议使用
--max-frames
让脚本自动计算FPS:
bash
python3 scripts/extract_frames.py video.mp4 --max-frames 30
这是比手动设置
--fps
更推荐的方式,因为它能适配任意视频长度,同时保持帧数量可预测。

Presets

预设配置

Four quality presets control resolution, JPEG quality, and image processing:
PresetMax dimQualityExtrasBest for
efficient
768px5--Bulk frames, long videos
balanced
1024px3--General analysis (default)
detailed
1568px2--Fine detail, small objects
ocr
1568px1grayscale + high contrast + sharpenText/document extraction
bash
undefined
四种质量预设控制分辨率、JPEG质量和图像处理:
预设名称最大尺寸质量值额外配置最佳适用场景
efficient
768px5--批量帧提取、长视频
balanced
1024px3--通用分析(默认)
detailed
1568px2--精细细节、小物体识别
ocr
1568px1灰度化 + 高对比度 + 锐化处理文本/文档提取
bash
undefined

Long video, keep costs low

长视频,控制成本

python3 scripts/extract_frames.py long_video.mp4 --max-frames 20 --preset efficient
python3 scripts/extract_frames.py long_video.mp4 --max-frames 20 --preset efficient

Need to read text in a screencast

需要提取屏幕录制视频中的文本

python3 scripts/extract_frames.py screencast.mp4 --max-frames 40 --preset ocr

Quality (1=best, 31=worst) and max dimension can be overridden independently:

```bash
python3 scripts/extract_frames.py video.mp4 --preset balanced --quality 1 --max-dimension 1568
python3 scripts/extract_frames.py screencast.mp4 --max-frames 40 --preset ocr

质量值(1=最佳,31=最差)和最大尺寸可独立覆盖预设:

```bash
python3 scripts/extract_frames.py video.mp4 --preset balanced --quality 1 --max-dimension 1568

Scene-Change Detection

场景变化检测

Instead of extracting at a fixed rate, detect visual scene changes and extract one frame per scene. This is ideal for videos with distinct segments (presentations, edited footage, tutorials).
bash
python3 scripts/extract_frames.py video.mp4 --scene-threshold 0.3
  • --scene-threshold
    (float, 0.0-1.0): Sensitivity. Lower = more sensitive, detects smaller changes. Start with
    0.3
    (the default when the flag is used).
  • --min-scene-interval
    (float, default: 1.0): Minimum seconds between detected scenes. Prevents burst detections during rapid cuts.
Note:
--fps
and
--scene-threshold
are mutually exclusive.
--max-frames
can only be used with
--fps
mode, not scene detection.
bash
undefined
无需按固定速率提取帧,可检测视觉场景变化并为每个场景提取1帧。这适用于包含不同片段的视频(演示文稿、剪辑素材、教程视频)。
bash
python3 scripts/extract_frames.py video.mp4 --scene-threshold 0.3
  • --scene-threshold
    (浮点数,0.0-1.0):敏感度。值越低越敏感,能检测到更小的变化。首次使用建议从
    0.3
    开始(启用该标志时的默认值)。
  • --min-scene-interval
    (浮点数,默认值:1.0):检测到的场景之间的最小间隔秒数。防止快速切换时的连续检测。
注意:
--fps
--scene-threshold
互斥。
--max-frames
仅可与
--fps
模式配合使用,无法用于场景检测模式。
bash
undefined

Presentation with clear slide transitions

包含清晰幻灯片切换的演示文稿

python3 scripts/extract_frames.py presentation.mp4 --scene-threshold 0.2
python3 scripts/extract_frames.py presentation.mp4 --scene-threshold 0.2

Action footage -- less sensitive, min 2s apart

动作视频——降低敏感度,最小间隔2秒

python3 scripts/extract_frames.py action.mp4 --scene-threshold 0.5 --min-scene-interval 2.0
undefined
python3 scripts/extract_frames.py action.mp4 --scene-threshold 0.5 --min-scene-interval 2.0
undefined

Model-Aware Optimization

模型感知优化

Use
--target-model
to resize frames to dimensions that align with a specific model's tile boundaries, minimizing wasted tokens:
ModelMax dimRationale
claude
1568pxMax native resolution before auto-resize
openai
768pxAligned to 512px tile grid (shortest side 768)
gemini
768pxAligned to 768px tile boundaries
universal
768pxSweet spot across all models (default)
bash
undefined
使用
--target-model
将帧调整为与特定模型的分片边界对齐的尺寸,减少token浪费:
模型名称最大尺寸调整依据
claude
1568px自动调整前的最大原生分辨率
openai
768px对齐512px分片网格(最短边768)
gemini
768px对齐768px分片边界
universal
768px适用于所有模型的折中方案(默认)
bash
undefined

Optimized for Claude -- maximum detail

针对Claude优化——最大化细节

python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model claude
python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model claude

Optimized for GPT-4o -- efficient tile packing

针对GPT-4o优化——高效分片打包

python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model openai

`--target-model` sets the max dimension unless `--max-dimension` is explicitly provided (CLI override takes priority).

See `references/llm-image-specs.md` for detailed token formulas, tile calculations, and optimal dimension tables for each model.
python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model openai

`--target-model`会设置最大尺寸,除非显式提供`--max-dimension`(CLI覆盖优先)。

查看`references/llm-image-specs.md`获取各模型的详细token计算公式、分片计算和最佳尺寸表。

OCR and Grayscale Mode

OCR与灰度模式

For videos containing text (screencasts, presentations, documents, terminal recordings):
bash
undefined
针对包含文本的视频(屏幕录制、演示文稿、文档、终端录制):
bash
undefined

Full OCR pipeline via preset

通过预设启用完整OCR流程

python3 scripts/extract_frames.py screencast.mp4 --preset ocr --max-frames 40
python3 scripts/extract_frames.py screencast.mp4 --preset ocr --max-frames 40

Manual OCR flags (can combine with any preset)

手动设置OCR标志(可与任意预设组合)

python3 scripts/extract_frames.py video.mp4 --grayscale --high-contrast

- `--grayscale`: Converts frames to grayscale. Reduces file size ~60% with no OCR accuracy loss.
- `--high-contrast`: Applies `contrast=1.3, brightness=0.05` to improve text/background separation.
- The `ocr` preset enables both flags **plus** unsharp-mask sharpening at 1568px, quality 1 (best JPEG).
python3 scripts/extract_frames.py video.mp4 --grayscale --high-contrast

- `--grayscale`:将帧转换为灰度图。文件大小减少约60%且不损失OCR准确率。
- `--high-contrast`:应用`contrast=1.3, brightness=0.05`提升文本与背景的区分度。
- `ocr`预设会启用上述两个标志**外加**在1568px尺寸下的锐化处理,质量值设为1(最佳JPEG质量)。

Advanced Options

高级选项

FPS Selection Guide

FPS选择指南

When using
--fps
directly instead of
--max-frames
:
Video lengthRecommended fpsRationale
< 30s2-5Short clip, capture detail
30s - 5min1Good balance of coverage vs frame count
5min - 30min0.5Avoid excessive frames
> 30min0.1 - 0.2Sample key moments only
Keep total frame count under ~50 for optimal LLM context usage. Formula:
duration_seconds * fps = frame_count
.
Prefer
--max-frames
over manual FPS -- it auto-calculates the right rate and clamps to 0.05-30.0 FPS.
当直接使用
--fps
而非
--max-frames
时:
视频时长推荐fps值选择依据
< 30秒2-5短视频,捕捉细节
30秒 - 5分钟1覆盖范围与帧数量的平衡方案
5分钟 - 30分钟0.5避免过多帧
> 30分钟0.1 - 0.2仅采样关键时刻
保持总帧数量在~50以内,以优化LLM上下文使用。计算公式:
视频时长(秒) * fps = 帧数量
优先使用
--max-frames
而非手动设置FPS——它会自动计算合适的速率,并限制在0.05-30.0 FPS范围内。

Timestamp Overlay

时间戳叠加

bash
python3 scripts/extract_frames.py video.mp4 --timestamps --max-frames 30
Overlays the source filename and
hh:mm:ss
timestamp in the bottom-right corner of each frame (white text on semi-transparent black box). Use when the user needs to reference specific moments in the video.
bash
python3 scripts/extract_frames.py video.mp4 --timestamps --max-frames 30
在每个帧的右下角叠加源文件名和
hh:mm:ss
时间戳(半透明黑底白字)。适用于用户需要参考视频特定时刻的场景。

All CLI Options Reference

所有CLI选项参考

OptionTypeDefaultDescription
video_path
pos.(required)Path to the video file
--fps
float1.0Frames per second (mutually exclusive with
--scene-threshold
)
--scene-threshold
float--Scene-change sensitivity 0.0-1.0 (mutually exclusive with
--fps
)
--min-scene-interval
float1.0Min seconds between scene-change frames
--max-frames
int--Auto-calculate FPS to produce ~N frames
--preset
choicebalanced
efficient
/
balanced
/
detailed
/
ocr
--max-dimension
int--Override max pixel dimension (longest edge)
--quality
int--JPEG quality 1-31 (1=best, 31=worst)
--target-model
choice--
claude
/
openai
/
gemini
/
universal
--grayscale
flagoffConvert to grayscale
--high-contrast
flagoffBoost contrast for text readability
--timestamps
flagoffOverlay filename + timestamp on frames
--output-dir
stringtemp dirOutput directory for extracted frames
选项名称类型默认值描述
video_path
位置参数必填项视频文件路径
--fps
浮点数1.0每秒帧数(与
--scene-threshold
互斥)
--scene-threshold
浮点数--场景变化敏感度0.0-1.0(与
--fps
互斥)
--min-scene-interval
浮点数1.0场景变化帧之间的最小间隔秒数
--max-frames
整数--自动计算FPS以生成约N帧
--preset
选项值balanced
efficient
/
balanced
/
detailed
/
ocr
--max-dimension
整数--覆盖最大像素尺寸(最长边)
--quality
整数--JPEG质量1-31(1=最佳,31=最差)
--target-model
选项值--
claude
/
openai
/
gemini
/
universal
--grayscale
标志位关闭转换为灰度图
--high-contrast
标志位关闭提升对比度以增强文本可读性
--timestamps
标志位关闭在帧上叠加文件名+时间戳
--output-dir
字符串临时目录提取帧的输出目录

Output JSON Structure

输出JSON结构

The script prints JSON to stdout with the following structure:
json
{
  "output_dir": "/tmp/video_frames_abc123/",
  "frames": ["/tmp/video_frames_abc123/frame_00001.jpg", "..."],
  "preset": "balanced",
  "resolution": { "width": 1024, "height": 576 },
  "token_estimate": {
    "frame_count": 30,
    "per_frame": {
      "claude": 787,
      "openai_high": 765,
      "openai_low": 85,
      "openai_patch": 934,
      "gemini": 258
    },
    "total": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  },
  "summary": {
    "video_duration_seconds": 120.5,
    "extraction_method": "max_frames",
    "scene_changes_detected": null,
    "frames_extracted": 30,
    "estimated_total_tokens": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  }
}
Use
token_estimate.total
to verify the frame set fits within model context limits before attaching frames to a prompt.
Note:
openai_high
and
openai_low
are for legacy models (GPT-4o, GPT-4.1).
openai_patch
is for newer models (gpt-5.4+, gpt-5-mini, o4-mini). See
references/llm-image-specs.md
for details.
On error, JSON with an
"error"
key is printed to stderr and the script exits with code 1.
脚本会向标准输出打印如下结构的JSON:
json
{
  "output_dir": "/tmp/video_frames_abc123/",
  "frames": ["/tmp/video_frames_abc123/frame_00001.jpg", "..."],
  "preset": "balanced",
  "resolution": { "width": 1024, "height": 576 },
  "token_estimate": {
    "frame_count": 30,
    "per_frame": {
      "claude": 787,
      "openai_high": 765,
      "openai_low": 85,
      "openai_patch": 934,
      "gemini": 258
    },
    "total": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  },
  "summary": {
    "video_duration_seconds": 120.5,
    "extraction_method": "max_frames",
    "scene_changes_detected": null,
    "frames_extracted": 30,
    "estimated_total_tokens": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  }
}
在将帧附加到提示词之前,使用
token_estimate.total
验证帧集是否符合模型上下文限制。
注意:
openai_high
openai_low
适用于旧版模型(GPT-4o、GPT-4.1)。
openai_patch
适用于新版模型(gpt-5.4+、gpt-5-mini、o4-mini)。详情请查看
references/llm-image-specs.md
发生错误时,脚本会向标准错误输出包含
"error"
键的JSON,并以代码1退出。

After Extraction

提取后操作

  1. Parse the JSON output to get the list of frame paths from
    frames
  2. Check
    token_estimate.total
    to ensure the frames fit within context limits
  3. Read each frame image using the Read tool (they are JPEG files)
  4. Analyze the frames to answer the user's question
  5. Clean up: delete the output directory when done if it was a temp dir
  1. 解析JSON输出,从
    frames
    字段获取帧路径列表
  2. 检查
    token_estimate.total
    确保帧符合上下文限制
  3. 使用读取工具读取每个帧图片(均为JPEG文件)
  4. 分析帧以回答用户的问题
  5. 清理:如果输出目录是临时目录,完成后将其删除