video-frames

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Video Frames

视频帧提取

Extract frames from video files using ffmpeg, producing JPEG images optimized for LLM vision analysis. Supports multiple frame-selection strategies (fixed FPS, scene detection, target frame count), quality presets, model-aware dimension optimization, and OCR enhancements.

使用ffmpeg从视频文件中提取帧，生成针对LLM视觉分析优化的JPEG图片。支持多种帧选择策略（固定FPS、场景检测、目标帧数量）、质量预设、模型感知维度优化以及OCR增强功能。

Prerequisites

前置要求

ffmpeg and ffprobe must be installed and on PATH:

bash

brew install ffmpeg  # macOS

必须安装ffmpeg和ffprobe并配置到PATH中：

bash

brew install ffmpeg  # macOS

Workflow

工作流程

Receive a video file path from the user
Run
```
scripts/extract_frames.py
```
to extract JPEG frames
Parse the JSON output for frame paths, resolution, and token estimates
Read the extracted frames as image attachments for analysis
Answer the user's question about the video content
Clean up temp directories when done

接收用户提供的视频文件路径
运行
```
scripts/extract_frames.py
```
提取JPEG帧
解析JSON输出，获取帧路径、分辨率和token估算值
将提取的帧作为图片附件读取，用于分析
回答用户关于视频内容的问题
完成后清理临时目录

Quick Start

快速开始

The simplest invocation -- extracts 1 frame/second at balanced quality:

bash

python3 scripts/extract_frames.py video.mp4

For most use cases, use

--max-frames

to let the script auto-calculate FPS:

bash

python3 scripts/extract_frames.py video.mp4 --max-frames 30

This is the preferred approach over manually setting

--fps

, since it adapts to any video length and keeps the frame count predictable.

最简单的调用方式——以平衡质量每秒提取1帧：

bash

python3 scripts/extract_frames.py video.mp4

对于大多数使用场景，建议使用

--max-frames

让脚本自动计算FPS：

bash

python3 scripts/extract_frames.py video.mp4 --max-frames 30

这是比手动设置

--fps

更推荐的方式，因为它能适配任意视频长度，同时保持帧数量可预测。

Presets

预设配置

Four quality presets control resolution, JPEG quality, and image processing:

Preset	Max dim	Quality	Extras	Best for
`efficient`	768px	5	--	Bulk frames, long videos
`balanced`	1024px	3	--	General analysis (default)
`detailed`	1568px	2	--	Fine detail, small objects
`ocr`	1568px	1	grayscale + high contrast + sharpen	Text/document extraction

bash

undefined

四种质量预设控制分辨率、JPEG质量和图像处理：

预设名称	最大尺寸	质量值	额外配置	最佳适用场景
`efficient`	768px	5	--	批量帧提取、长视频
`balanced`	1024px	3	--	通用分析（默认）
`detailed`	1568px	2	--	精细细节、小物体识别
`ocr`	1568px	1	灰度化 + 高对比度 + 锐化处理	文本/文档提取

bash

undefined

Long video, keep costs low

长视频，控制成本

python3 scripts/extract_frames.py long_video.mp4 --max-frames 20 --preset efficient

Need to read text in a screencast

需要提取屏幕录制视频中的文本

python3 scripts/extract_frames.py screencast.mp4 --max-frames 40 --preset ocr


Quality (1=best, 31=worst) and max dimension can be overridden independently:

```bash
python3 scripts/extract_frames.py video.mp4 --preset balanced --quality 1 --max-dimension 1568

python3 scripts/extract_frames.py screencast.mp4 --max-frames 40 --preset ocr


质量值（1=最佳，31=最差）和最大尺寸可独立覆盖预设：

```bash
python3 scripts/extract_frames.py video.mp4 --preset balanced --quality 1 --max-dimension 1568

Scene-Change Detection

场景变化检测

Instead of extracting at a fixed rate, detect visual scene changes and extract one frame per scene. This is ideal for videos with distinct segments (presentations, edited footage, tutorials).

bash

python3 scripts/extract_frames.py video.mp4 --scene-threshold 0.3

```
--scene-threshold
```
(float, 0.0-1.0): Sensitivity. Lower = more sensitive, detects smaller changes. Start with
```
0.3
```
(the default when the flag is used).
```
--min-scene-interval
```
(float, default: 1.0): Minimum seconds between detected scenes. Prevents burst detections during rapid cuts.

Note:

--fps

and

--scene-threshold

are mutually exclusive.

--max-frames

can only be used with

--fps

mode, not scene detection.

bash

undefined

无需按固定速率提取帧，可检测视觉场景变化并为每个场景提取1帧。这适用于包含不同片段的视频（演示文稿、剪辑素材、教程视频）。

bash

python3 scripts/extract_frames.py video.mp4 --scene-threshold 0.3

```
--scene-threshold
```
（浮点数，0.0-1.0）：敏感度。值越低越敏感，能检测到更小的变化。首次使用建议从
```
0.3
```
开始（启用该标志时的默认值）。
```
--min-scene-interval
```
（浮点数，默认值：1.0）：检测到的场景之间的最小间隔秒数。防止快速切换时的连续检测。

注意：

--fps

和

--scene-threshold

互斥。

--max-frames

仅可与

--fps

模式配合使用，无法用于场景检测模式。

bash

undefined

Presentation with clear slide transitions

包含清晰幻灯片切换的演示文稿

python3 scripts/extract_frames.py presentation.mp4 --scene-threshold 0.2

Action footage -- less sensitive, min 2s apart

动作视频——降低敏感度，最小间隔2秒

python3 scripts/extract_frames.py action.mp4 --scene-threshold 0.5 --min-scene-interval 2.0

undefined

python3 scripts/extract_frames.py action.mp4 --scene-threshold 0.5 --min-scene-interval 2.0

undefined

Model-Aware Optimization

模型感知优化

Use

--target-model

to resize frames to dimensions that align with a specific model's tile boundaries, minimizing wasted tokens:

Model	Max dim	Rationale
`claude`	1568px	Max native resolution before auto-resize
`openai`	768px	Aligned to 512px tile grid (shortest side 768)
`gemini`	768px	Aligned to 768px tile boundaries
`universal`	768px	Sweet spot across all models (default)

bash

undefined

使用

--target-model

将帧调整为与特定模型的分片边界对齐的尺寸，减少token浪费：

模型名称	最大尺寸	调整依据
`claude`	1568px	自动调整前的最大原生分辨率
`openai`	768px	对齐512px分片网格（最短边768）
`gemini`	768px	对齐768px分片边界
`universal`	768px	适用于所有模型的折中方案（默认）

bash

undefined

Optimized for Claude -- maximum detail

针对Claude优化——最大化细节

python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model claude

Optimized for GPT-4o -- efficient tile packing

针对GPT-4o优化——高效分片打包

python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model openai


`--target-model` sets the max dimension unless `--max-dimension` is explicitly provided (CLI override takes priority).

See `references/llm-image-specs.md` for detailed token formulas, tile calculations, and optimal dimension tables for each model.

python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model openai


`--target-model`会设置最大尺寸，除非显式提供`--max-dimension`（CLI覆盖优先）。

查看`references/llm-image-specs.md`获取各模型的详细token计算公式、分片计算和最佳尺寸表。

OCR and Grayscale Mode

OCR与灰度模式

For videos containing text (screencasts, presentations, documents, terminal recordings):

bash

undefined

针对包含文本的视频（屏幕录制、演示文稿、文档、终端录制）：

bash

undefined

Full OCR pipeline via preset

通过预设启用完整OCR流程

python3 scripts/extract_frames.py screencast.mp4 --preset ocr --max-frames 40

Manual OCR flags (can combine with any preset)

手动设置OCR标志（可与任意预设组合）

python3 scripts/extract_frames.py video.mp4 --grayscale --high-contrast


- `--grayscale`: Converts frames to grayscale. Reduces file size ~60% with no OCR accuracy loss.
- `--high-contrast`: Applies `contrast=1.3, brightness=0.05` to improve text/background separation.
- The `ocr` preset enables both flags **plus** unsharp-mask sharpening at 1568px, quality 1 (best JPEG).

python3 scripts/extract_frames.py video.mp4 --grayscale --high-contrast


- `--grayscale`：将帧转换为灰度图。文件大小减少约60%且不损失OCR准确率。
- `--high-contrast`：应用`contrast=1.3, brightness=0.05`提升文本与背景的区分度。
- `ocr`预设会启用上述两个标志**外加**在1568px尺寸下的锐化处理，质量值设为1（最佳JPEG质量）。

Advanced Options

高级选项

FPS Selection Guide

FPS选择指南

When using

--fps

directly instead of

--max-frames

Video length	Recommended fps	Rationale
< 30s	2-5	Short clip, capture detail
30s - 5min	1	Good balance of coverage vs frame count
5min - 30min	0.5	Avoid excessive frames
> 30min	0.1 - 0.2	Sample key moments only

Keep total frame count under ~50 for optimal LLM context usage. Formula:

duration_seconds * fps = frame_count

Prefer

--max-frames

over manual FPS -- it auto-calculates the right rate and clamps to 0.05-30.0 FPS.

当直接使用

--fps

而非

--max-frames

时：

视频时长	推荐fps值	选择依据
< 30秒	2-5	短视频，捕捉细节
30秒 - 5分钟	1	覆盖范围与帧数量的平衡方案
5分钟 - 30分钟	0.5	避免过多帧
> 30分钟	0.1 - 0.2	仅采样关键时刻

保持总帧数量在~50以内，以优化LLM上下文使用。计算公式：

视频时长(秒) * fps = 帧数量

。

优先使用

--max-frames

而非手动设置FPS——它会自动计算合适的速率，并限制在0.05-30.0 FPS范围内。

Timestamp Overlay

时间戳叠加

bash

python3 scripts/extract_frames.py video.mp4 --timestamps --max-frames 30

Overlays the source filename and

hh:mm:ss

timestamp in the bottom-right corner of each frame (white text on semi-transparent black box). Use when the user needs to reference specific moments in the video.

bash

python3 scripts/extract_frames.py video.mp4 --timestamps --max-frames 30

在每个帧的右下角叠加源文件名和

hh:mm:ss

时间戳（半透明黑底白字）。适用于用户需要参考视频特定时刻的场景。

All CLI Options Reference

所有CLI选项参考

Option	Type	Default	Description
`video_path`	pos.	(required)	Path to the video file
`--fps`	float	1.0	Frames per second (mutually exclusive with `--scene-threshold` )
`--scene-threshold`	float	--	Scene-change sensitivity 0.0-1.0 (mutually exclusive with `--fps` )
`--min-scene-interval`	float	1.0	Min seconds between scene-change frames
`--max-frames`	int	--	Auto-calculate FPS to produce ~N frames
`--preset`	choice	balanced	`efficient` / `balanced` / `detailed` / `ocr`
`--max-dimension`	int	--	Override max pixel dimension (longest edge)
`--quality`	int	--	JPEG quality 1-31 (1=best, 31=worst)
`--target-model`	choice	--	`claude` / `openai` / `gemini` / `universal`
`--grayscale`	flag	off	Convert to grayscale
`--high-contrast`	flag	off	Boost contrast for text readability
`--timestamps`	flag	off	Overlay filename + timestamp on frames
`--output-dir`	string	temp dir	Output directory for extracted frames

选项名称	类型	默认值	描述
`video_path`	位置参数	必填项	视频文件路径
`--fps`	浮点数	1.0	每秒帧数（与 `--scene-threshold` 互斥）
`--scene-threshold`	浮点数	--	场景变化敏感度0.0-1.0（与 `--fps` 互斥）
`--min-scene-interval`	浮点数	1.0	场景变化帧之间的最小间隔秒数
`--max-frames`	整数	--	自动计算FPS以生成约N帧
`--preset`	选项值	balanced	`efficient` / `balanced` / `detailed` / `ocr`
`--max-dimension`	整数	--	覆盖最大像素尺寸（最长边）
`--quality`	整数	--	JPEG质量1-31（1=最佳，31=最差）
`--target-model`	选项值	--	`claude` / `openai` / `gemini` / `universal`
`--grayscale`	标志位	关闭	转换为灰度图
`--high-contrast`	标志位	关闭	提升对比度以增强文本可读性
`--timestamps`	标志位	关闭	在帧上叠加文件名+时间戳
`--output-dir`	字符串	临时目录	提取帧的输出目录

Output JSON Structure

输出JSON结构

The script prints JSON to stdout with the following structure:

json

{
  "output_dir": "/tmp/video_frames_abc123/",
  "frames": ["/tmp/video_frames_abc123/frame_00001.jpg", "..."],
  "preset": "balanced",
  "resolution": { "width": 1024, "height": 576 },
  "token_estimate": {
    "frame_count": 30,
    "per_frame": {
      "claude": 787,
      "openai_high": 765,
      "openai_low": 85,
      "openai_patch": 934,
      "gemini": 258
    },
    "total": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  },
  "summary": {
    "video_duration_seconds": 120.5,
    "extraction_method": "max_frames",
    "scene_changes_detected": null,
    "frames_extracted": 30,
    "estimated_total_tokens": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  }
}

Use

token_estimate.total

to verify the frame set fits within model context limits before attaching frames to a prompt.

Note:
openai_high
and
openai_low
are for legacy models (GPT-4o, GPT-4.1).
openai_patch
is for newer models (gpt-5.4+, gpt-5-mini, o4-mini). See
references/llm-image-specs.md
for details.

On error, JSON with an

"error"

key is printed to stderr and the script exits with code 1.

脚本会向标准输出打印如下结构的JSON：

json

{
  "output_dir": "/tmp/video_frames_abc123/",
  "frames": ["/tmp/video_frames_abc123/frame_00001.jpg", "..."],
  "preset": "balanced",
  "resolution": { "width": 1024, "height": 576 },
  "token_estimate": {
    "frame_count": 30,
    "per_frame": {
      "claude": 787,
      "openai_high": 765,
      "openai_low": 85,
      "openai_patch": 934,
      "gemini": 258
    },
    "total": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  },
  "summary": {
    "video_duration_seconds": 120.5,
    "extraction_method": "max_frames",
    "scene_changes_detected": null,
    "frames_extracted": 30,
    "estimated_total_tokens": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  }
}

在将帧附加到提示词之前，使用

token_estimate.total

验证帧集是否符合模型上下文限制。

注意：
openai_high
和
openai_low
适用于旧版模型（GPT-4o、GPT-4.1）。
openai_patch
适用于新版模型（gpt-5.4+、gpt-5-mini、o4-mini）。详情请查看
references/llm-image-specs.md
。

发生错误时，脚本会向标准错误输出包含

"error"

键的JSON，并以代码1退出。

After Extraction

提取后操作

Parse the JSON output to get the list of frame paths from
```
frames
```
Check
```
token_estimate.total
```
to ensure the frames fit within context limits
Read each frame image using the Read tool (they are JPEG files)
Analyze the frames to answer the user's question
Clean up: delete the output directory when done if it was a temp dir

解析JSON输出，从
```
frames
```
字段获取帧路径列表
检查
```
token_estimate.total
```
确保帧符合上下文限制
使用读取工具读取每个帧图片（均为JPEG文件）
分析帧以回答用户的问题
清理：如果输出目录是临时目录，完成后将其删除