video-perception
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVideo Perception
视频感知
You have access to video understanding tools via the claude-video-vision MCP server.
你可通过claude-video-vision MCP服务器访问视频理解工具。
Available Tools
可用工具
- — Analyze video structure with ffmpeg filters (scene changes, silence, motion, etc.). Use this BEFORE extracting frames to plan your strategy.
video_analyze - — Extract frames + process audio from a video. Supports variable FPS/resolution per segment.
video_watch - — Drill into specific segments. Separates extraction from viewing — extract many frames, view few at a time.
video_detail - — Get video metadata without processing.
video_info - — Change settings (backend, resolution, enable_index, etc.).
video_configure - — Check/install dependencies.
video_setup
- — 使用ffmpeg滤镜分析视频结构(场景变化、静音段、运动状态等)。提取帧之前务必使用此工具规划策略。
video_analyze - — 提取视频帧并处理音频。支持按片段设置可变帧率/分辨率。
video_watch - — 深入分析特定片段。将提取与查看分离——可提取大量帧,分次查看少量帧。
video_detail - — 获取视频元数据,无需处理视频内容。
video_info - — 修改设置(后端、分辨率、enable_index等)。
video_configure - — 检查/安装依赖项。
video_setup
Workflow
工作流程
IMPORTANT: You MUST follow these steps in order. Do NOT skip step 2.
-
Always start withto get duration, resolution, and audio presence. If the user gives a YouTube URL, pass the URL directly as
video_info. The MCP server downloads it withpath, prefers YouTube subtitles/auto-captions for transcription, and falls back to the configured audio backend only when captions are missing, empty, or suspiciously incomplete.yt-dlp -
REQUIRED for videos > 30s: CallBEFORE extracting any frames. This is NOT optional — it gives you structural data to make smart extraction decisions. Select filters relevant to the user's question:
video_analyzeUser intent Filters to select "What happens in this video?" scene_changes, silence, transcription "Find the scene transitions" scene_changes, black_intervals "Are there frozen/stuck parts?" freeze, blur "Is this a talking head or action?" motion "When does the music start?" silence, loudness "Analyze the lighting" exposure "Summarize this lecture" transcription, scene_changes, silence General / unclear intent scene_changes, silence, transcription Always includewhen the video has audio — the transcription tells you WHERE to look visually.transcription: true -
Use the analysis results and transcription to plan your frame extraction strategy:
- Low FPS (0.1-0.5) for static or predictable segments
- Higher FPS (1-3) only around scene changes, motion peaks, or moments referenced in speech ("look at this", "as you can see", "let me show you")
- Never exceed the minimum FPS needed for the task
- Prefer fewer segments at lower FPS — you can always drill deeper
-
Callto extract frames:
video_watch- For short videos (< 2 minutes): Use without
fps: "auto"— short videos need full coverage to avoid missing brief moments. The auto FPS already adapts to duration.view_sample - For long videos (> 2 minutes): Use based on analysis data with variable FPS, and
segmentsto limit initial frame count. You can always drill deeper withview_sample.video_detail
- For short videos (< 2 minutes): Use
-
Useto drill into specific moments:
video_detail- Start with 3-5 second windows around points of interest
- Use to preview (first, middle, last frame)
view_sample: 3 - Then request specific timestamps with if you need more detail
view - Expand the window only if the initial view is insufficient
- Treat frame viewing like a binary search — narrow down to what matters
- Never view all extracted frames at once
-
When the user asks follow-up questions about the same video, consult the manifest already in your context. Do not re-extract frames you already have at the same resolution. Do not re-request frames you already have in context.
重要提示:必须按以下步骤执行,不得跳过步骤2。
-
始终从开始,获取视频时长、分辨率及是否包含音频。 若用户提供YouTube链接,直接将链接作为
video_info传入。 MCP服务器会通过path下载视频,优先使用YouTube字幕/自动生成字幕进行转录,仅在字幕缺失、为空或明显不完整时,才会 fallback 到配置的音频后端。yt-dlp -
视频时长>30秒时必须执行: 在提取任何帧之前调用。 此步骤为必填项——它能提供结构化数据,帮助你做出明智的帧提取决策。 根据用户问题选择相关滤镜:
video_analyze用户意图 选择的滤镜 "这个视频里发生了什么?" scene_changes, silence, transcription "找出场景过渡的位置" scene_changes, black_intervals "视频里有卡顿/冻结的部分吗?" freeze, blur "这是访谈类视频还是动作类视频?" motion "音乐什么时候开始?" silence, loudness "分析视频的光线情况" exposure "总结这个讲座的内容" transcription, scene_changes, silence 通用/意图不明确 scene_changes, silence, transcription 当视频包含音频时,务必勾选——转录内容能告诉你需要关注哪些视觉画面。transcription: true -
利用分析结果和转录内容规划帧提取策略:
- 静态或可预测片段使用低帧率(0.1-0.5)
- 仅在场景变化、运动峰值或语音提及的时刻(如“看这里”“如你所见”“我来展示一下”)使用较高帧率(1-3)
- 帧率绝不超过任务所需的最小值
- 优先选择较少片段+低帧率——后续可随时深入分析
-
调用提取帧:
video_watch- 短视频(<2分钟): 使用,无需
fps: "auto"——短视频需要完整覆盖以避免遗漏短暂瞬间。自动帧率已根据时长进行适配。view_sample - 长视频(>2分钟): 根据分析数据使用并设置可变帧率,同时用
segments限制初始提取的帧数量。后续可通过view_sample深入分析。video_detail
- 短视频(<2分钟): 使用
-
使用深入分析特定时刻:
video_detail- 从兴趣点周围3-5秒的窗口开始
- 使用预览(首帧、中间帧、末帧)
view_sample: 3 - 若需要更多细节,再用请求特定时间戳的帧
view - 仅在初始视图不足时才扩大窗口范围
- 将帧查看视为二分查找——逐步缩小到关键内容
- 切勿一次性查看所有提取的帧
-
当用户针对同一视频提出后续问题时,参考上下文已有的清单(manifest)。不要重新提取相同分辨率的已提取帧,也不要重新请求上下文已有的帧。
Parameter Guide
参数指南
fps: for general overview. Use the video's original fps (from ) for frame-by-frame detail. Use 5-10 for analyzing specific short moments. Use 0.1-0.5 for long videos.
"auto"video_inforesolution: 256-512 for quick scans. 512-768 for normal analysis. 1024+ when reading on-screen text or fine details.
segments: Use when you have analysis data. Each segment can have its own fps and resolution. Overrides global fps/start_time/end_time.
view_sample: Returns N evenly spaced frames from the extracted set. Use this to avoid flooding context with too many images.
skip_audio: Set to true when you only need visual analysis.
YouTube URLs: Pass supported YouTube URLs directly as . Treat
as stronger than
; auto-captions can still have recognition errors.
pathtranscription_source: "youtube_subtitles"youtube_auto_captionsfps: 通用概览使用。逐帧细节分析使用视频原始帧率(来自)。分析特定短时刻使用5-10。长视频使用0.1-0.5。
"auto"video_inforesolution: 快速扫描使用256-512。常规分析使用512-768。读取屏幕文本或精细细节使用1024+。
segments: 有分析数据时使用。每个片段可设置独立的帧率和分辨率。会覆盖全局的fps/start_time/end_time。
view_sample: 从提取的帧集中返回N个均匀分布的帧。使用此参数避免上下文被过多图片占用。
skip_audio: 仅需视觉分析时设为true。
YouTube链接: 直接将支持的YouTube链接作为传入。优先级上高于;自动生成字幕仍可能存在识别错误。
pathtranscription_source: "youtube_subtitles"youtube_auto_captionsWorking with Results
结果处理
You receive:
- Manifest (when enable_index is on) — index of all cached frames by resolution and timestamp. Use this to avoid redundant requests.
- Frames as images — look at them to understand what's happening visually
- Audio transcription with timestamps — read the speech content
- Audio tags — non-speech events (music, sounds, etc.)
- Analysis data — scene changes, silence intervals, motion levels, etc.
Combine all sources to form a complete understanding. Use analysis + transcription to guide where you look visually. The analysis tells you WHEN things happen; the frames tell you WHAT happens.
你会收到以下内容:
- 清单(manifest)(当enable_index开启时)——按分辨率和时间戳索引的所有缓存帧。使用此清单避免重复请求。
- 帧图像——查看帧内容理解视觉信息
- 带时间戳的音频转录——读取语音内容
- 音频标签——非语音事件(音乐、音效等)
- 分析数据——场景变化、静音区间、运动强度等
结合所有信息形成完整理解。利用分析结果+转录内容引导视觉关注点。分析数据告诉你事件发生的时间,帧图像告诉你事件内容。