video-perception

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Video Perception

视频感知

You have access to video understanding tools via the claude-video-vision MCP server.

你可通过claude-video-vision MCP服务器访问视频理解工具。

Available Tools

可用工具

```
video_analyze
```
— Analyze video structure with ffmpeg filters (scene changes, silence, motion, etc.). Use this BEFORE extracting frames to plan your strategy.
```
video_watch
```
— Extract frames + process audio from a video. Supports variable FPS/resolution per segment.
```
video_detail
```
— Drill into specific segments. Separates extraction from viewing — extract many frames, view few at a time.
```
video_info
```
— Get video metadata without processing.
```
video_configure
```
— Change settings (backend, resolution, enable_index, etc.).
```
video_setup
```
— Check/install dependencies.

```
video_analyze
```
— 使用ffmpeg滤镜分析视频结构（场景变化、静音段、运动状态等）。提取帧之前务必使用此工具规划策略。
```
video_watch
```
— 提取视频帧并处理音频。支持按片段设置可变帧率/分辨率。
```
video_detail
```
— 深入分析特定片段。将提取与查看分离——可提取大量帧，分次查看少量帧。
```
video_info
```
— 获取视频元数据，无需处理视频内容。
```
video_configure
```
— 修改设置（后端、分辨率、enable_index等）。
```
video_setup
```
— 检查/安装依赖项。

Workflow

工作流程

IMPORTANT: You MUST follow these steps in order. Do NOT skip step 2.

Always start with
```
video_info
```
to get duration, resolution, and audio presence. If the user gives a YouTube URL, pass the URL directly as
```
path
```
. The MCP server downloads it with
```
yt-dlp
```
, prefers YouTube subtitles/auto-captions for transcription, and falls back to the configured audio backend only when captions are missing, empty, or suspiciously incomplete.

REQUIRED for videos > 30s: Call

video_analyze

BEFORE extracting any frames. This is NOT optional — it gives you structural data to make smart extraction decisions. Select filters relevant to the user's question:

User intent	Filters to select
"What happens in this video?"	scene_changes, silence, transcription
"Find the scene transitions"	scene_changes, black_intervals
"Are there frozen/stuck parts?"	freeze, blur
"Is this a talking head or action?"	motion
"When does the music start?"	silence, loudness
"Analyze the lighting"	exposure
"Summarize this lecture"	transcription, scene_changes, silence
General / unclear intent	scene_changes, silence, transcription

Always include

transcription: true

when the video has audio — the transcription tells you WHERE to look visually.

Use the analysis results and transcription to plan your frame extraction strategy:
- Low FPS (0.1-0.5) for static or predictable segments
- Higher FPS (1-3) only around scene changes, motion peaks, or moments referenced in speech ("look at this", "as you can see", "let me show you")
- Never exceed the minimum FPS needed for the task
- Prefer fewer segments at lower FPS — you can always drill deeper
Call
```
video_watch
```
to extract frames:
- For short videos (< 2 minutes): Use
```
fps: "auto"
```
  without
```
view_sample
```
  — short videos need full coverage to avoid missing brief moments. The auto FPS already adapts to duration.
- For long videos (> 2 minutes): Use
```
segments
```
  based on analysis data with variable FPS, and
```
view_sample
```
  to limit initial frame count. You can always drill deeper with
```
video_detail
```
  .
Use
```
video_detail
```
to drill into specific moments:
- Start with 3-5 second windows around points of interest
- Use
```
view_sample: 3
```
  to preview (first, middle, last frame)
- Then request specific timestamps with
```
view
```
  if you need more detail
- Expand the window only if the initial view is insufficient
- Treat frame viewing like a binary search — narrow down to what matters
- Never view all extracted frames at once
When the user asks follow-up questions about the same video, consult the manifest already in your context. Do not re-extract frames you already have at the same resolution. Do not re-request frames you already have in context.

重要提示：必须按以下步骤执行，不得跳过步骤2。

始终从
```
video_info
```
开始，获取视频时长、分辨率及是否包含音频。若用户提供YouTube链接，直接将链接作为
```
path
```
传入。 MCP服务器会通过
```
yt-dlp
```
下载视频，优先使用YouTube字幕/自动生成字幕进行转录，仅在字幕缺失、为空或明显不完整时，才会 fallback 到配置的音频后端。

视频时长>30秒时必须执行： 在提取任何帧之前调用

video_analyze

。此步骤为必填项——它能提供结构化数据，帮助你做出明智的帧提取决策。根据用户问题选择相关滤镜：

用户意图	选择的滤镜
"这个视频里发生了什么？"	scene_changes, silence, transcription
"找出场景过渡的位置"	scene_changes, black_intervals
"视频里有卡顿/冻结的部分吗？"	freeze, blur
"这是访谈类视频还是动作类视频？"	motion
"音乐什么时候开始？"	silence, loudness
"分析视频的光线情况"	exposure
"总结这个讲座的内容"	transcription, scene_changes, silence
通用/意图不明确	scene_changes, silence, transcription

当视频包含音频时，务必勾选

transcription: true

——转录内容能告诉你需要关注哪些视觉画面。

利用分析结果和转录内容规划帧提取策略：
- 静态或可预测片段使用低帧率（0.1-0.5）
- 仅在场景变化、运动峰值或语音提及的时刻（如“看这里”“如你所见”“我来展示一下”）使用较高帧率（1-3）
- 帧率绝不超过任务所需的最小值
- 优先选择较少片段+低帧率——后续可随时深入分析
调用
```
video_watch
```
提取帧：
- 短视频（<2分钟）： 使用
```
fps: "auto"
```
  ，无需
```
view_sample
```
  ——短视频需要完整覆盖以避免遗漏短暂瞬间。自动帧率已根据时长进行适配。
- 长视频（>2分钟）： 根据分析数据使用
```
segments
```
  并设置可变帧率，同时用
```
view_sample
```
  限制初始提取的帧数量。后续可通过
```
video_detail
```
  深入分析。
使用
```
video_detail
```
深入分析特定时刻：
- 从兴趣点周围3-5秒的窗口开始
- 使用
```
view_sample: 3
```
  预览（首帧、中间帧、末帧）
- 若需要更多细节，再用
```
view
```
  请求特定时间戳的帧
- 仅在初始视图不足时才扩大窗口范围
- 将帧查看视为二分查找——逐步缩小到关键内容
- 切勿一次性查看所有提取的帧
当用户针对同一视频提出后续问题时，参考上下文已有的清单（manifest）。不要重新提取相同分辨率的已提取帧，也不要重新请求上下文已有的帧。

Parameter Guide

参数指南

fps:

"auto"

for general overview. Use the video's original fps (from

video_info

) for frame-by-frame detail. Use 5-10 for analyzing specific short moments. Use 0.1-0.5 for long videos.

resolution: 256-512 for quick scans. 512-768 for normal analysis. 1024+ when reading on-screen text or fine details.

segments: Use when you have analysis data. Each segment can have its own fps and resolution. Overrides global fps/start_time/end_time.

view_sample: Returns N evenly spaced frames from the extracted set. Use this to avoid flooding context with too many images.

skip_audio: Set to true when you only need visual analysis.

YouTube URLs: Pass supported YouTube URLs directly as

path

. Treat

transcription_source: "youtube_subtitles"

as stronger than

youtube_auto_captions

; auto-captions can still have recognition errors.

fps： 通用概览使用

"auto"

。逐帧细节分析使用视频原始帧率（来自

video_info

）。分析特定短时刻使用5-10。长视频使用0.1-0.5。

resolution： 快速扫描使用256-512。常规分析使用512-768。读取屏幕文本或精细细节使用1024+。

segments： 有分析数据时使用。每个片段可设置独立的帧率和分辨率。会覆盖全局的fps/start_time/end_time。

view_sample： 从提取的帧集中返回N个均匀分布的帧。使用此参数避免上下文被过多图片占用。

skip_audio： 仅需视觉分析时设为true。

YouTube链接： 直接将支持的YouTube链接作为

path

传入。优先级上

transcription_source: "youtube_subtitles"

高于

youtube_auto_captions

；自动生成字幕仍可能存在识别错误。

Working with Results

结果处理

You receive:

Manifest (when enable_index is on) — index of all cached frames by resolution and timestamp. Use this to avoid redundant requests.
Frames as images — look at them to understand what's happening visually
Audio transcription with timestamps — read the speech content
Audio tags — non-speech events (music, sounds, etc.)
Analysis data — scene changes, silence intervals, motion levels, etc.

Combine all sources to form a complete understanding. Use analysis + transcription to guide where you look visually. The analysis tells you WHEN things happen; the frames tell you WHAT happens.

你会收到以下内容：

清单（manifest）（当enable_index开启时）——按分辨率和时间戳索引的所有缓存帧。使用此清单避免重复请求。
帧图像——查看帧内容理解视觉信息
带时间戳的音频转录——读取语音内容
音频标签——非语音事件（音乐、音效等）
分析数据——场景变化、静音区间、运动强度等

结合所有信息形成完整理解。利用分析结果+转录内容引导视觉关注点。分析数据告诉你事件发生的时间，帧图像告诉你事件内容。