video-perception

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Video Perception

视频感知

You have access to video understanding tools via the claude-video-vision MCP server.
你可通过claude-video-vision MCP服务器访问视频理解工具。

Available Tools

可用工具

  • video_analyze
    — Analyze video structure with ffmpeg filters (scene changes, silence, motion, etc.). Use this BEFORE extracting frames to plan your strategy.
  • video_watch
    — Extract frames + process audio from a video. Supports variable FPS/resolution per segment.
  • video_detail
    — Drill into specific segments. Separates extraction from viewing — extract many frames, view few at a time.
  • video_info
    — Get video metadata without processing.
  • video_configure
    — Change settings (backend, resolution, enable_index, etc.).
  • video_setup
    — Check/install dependencies.
  • video_analyze
    — 使用ffmpeg滤镜分析视频结构(场景变化、静音段、运动状态等)。提取帧之前务必使用此工具规划策略。
  • video_watch
    — 提取视频帧并处理音频。支持按片段设置可变帧率/分辨率。
  • video_detail
    — 深入分析特定片段。将提取与查看分离——可提取大量帧,分次查看少量帧。
  • video_info
    — 获取视频元数据,无需处理视频内容。
  • video_configure
    — 修改设置(后端、分辨率、enable_index等)。
  • video_setup
    — 检查/安装依赖项。

Workflow

工作流程

IMPORTANT: You MUST follow these steps in order. Do NOT skip step 2.
  1. Always start with
    video_info
    to get duration, resolution, and audio presence. If the user gives a YouTube URL, pass the URL directly as
    path
    . The MCP server downloads it with
    yt-dlp
    , prefers YouTube subtitles/auto-captions for transcription, and falls back to the configured audio backend only when captions are missing, empty, or suspiciously incomplete.
  2. REQUIRED for videos > 30s: Call
    video_analyze
    BEFORE extracting any frames. This is NOT optional — it gives you structural data to make smart extraction decisions. Select filters relevant to the user's question:
    User intentFilters to select
    "What happens in this video?"scene_changes, silence, transcription
    "Find the scene transitions"scene_changes, black_intervals
    "Are there frozen/stuck parts?"freeze, blur
    "Is this a talking head or action?"motion
    "When does the music start?"silence, loudness
    "Analyze the lighting"exposure
    "Summarize this lecture"transcription, scene_changes, silence
    General / unclear intentscene_changes, silence, transcription
    Always include
    transcription: true
    when the video has audio — the transcription tells you WHERE to look visually.
  3. Use the analysis results and transcription to plan your frame extraction strategy:
    • Low FPS (0.1-0.5) for static or predictable segments
    • Higher FPS (1-3) only around scene changes, motion peaks, or moments referenced in speech ("look at this", "as you can see", "let me show you")
    • Never exceed the minimum FPS needed for the task
    • Prefer fewer segments at lower FPS — you can always drill deeper
  4. Call
    video_watch
    to extract frames:
    • For short videos (< 2 minutes): Use
      fps: "auto"
      without
      view_sample
      — short videos need full coverage to avoid missing brief moments. The auto FPS already adapts to duration.
    • For long videos (> 2 minutes): Use
      segments
      based on analysis data with variable FPS, and
      view_sample
      to limit initial frame count. You can always drill deeper with
      video_detail
      .
  5. Use
    video_detail
    to drill into specific moments:
    • Start with 3-5 second windows around points of interest
    • Use
      view_sample: 3
      to preview (first, middle, last frame)
    • Then request specific timestamps with
      view
      if you need more detail
    • Expand the window only if the initial view is insufficient
    • Treat frame viewing like a binary search — narrow down to what matters
    • Never view all extracted frames at once
  6. When the user asks follow-up questions about the same video, consult the manifest already in your context. Do not re-extract frames you already have at the same resolution. Do not re-request frames you already have in context.
重要提示:必须按以下步骤执行,不得跳过步骤2。
  1. 始终从
    video_info
    开始,获取视频时长、分辨率及是否包含音频。 若用户提供YouTube链接,直接将链接作为
    path
    传入。 MCP服务器会通过
    yt-dlp
    下载视频,优先使用YouTube字幕/自动生成字幕进行转录,仅在字幕缺失、为空或明显不完整时,才会 fallback 到配置的音频后端。
  2. 视频时长>30秒时必须执行: 在提取任何帧之前调用
    video_analyze
    。 此步骤为必填项——它能提供结构化数据,帮助你做出明智的帧提取决策。 根据用户问题选择相关滤镜:
    用户意图选择的滤镜
    "这个视频里发生了什么?"scene_changes, silence, transcription
    "找出场景过渡的位置"scene_changes, black_intervals
    "视频里有卡顿/冻结的部分吗?"freeze, blur
    "这是访谈类视频还是动作类视频?"motion
    "音乐什么时候开始?"silence, loudness
    "分析视频的光线情况"exposure
    "总结这个讲座的内容"transcription, scene_changes, silence
    通用/意图不明确scene_changes, silence, transcription
    当视频包含音频时,务必勾选
    transcription: true
    ——转录内容能告诉你需要关注哪些视觉画面。
  3. 利用分析结果和转录内容规划帧提取策略:
    • 静态或可预测片段使用低帧率(0.1-0.5)
    • 仅在场景变化、运动峰值或语音提及的时刻(如“看这里”“如你所见”“我来展示一下”)使用较高帧率(1-3)
    • 帧率绝不超过任务所需的最小值
    • 优先选择较少片段+低帧率——后续可随时深入分析
  4. 调用
    video_watch
    提取帧:
    • 短视频(<2分钟): 使用
      fps: "auto"
      ,无需
      view_sample
      ——短视频需要完整覆盖以避免遗漏短暂瞬间。自动帧率已根据时长进行适配。
    • 长视频(>2分钟): 根据分析数据使用
      segments
      并设置可变帧率,同时用
      view_sample
      限制初始提取的帧数量。后续可通过
      video_detail
      深入分析。
  5. 使用
    video_detail
    深入分析特定时刻:
    • 从兴趣点周围3-5秒的窗口开始
    • 使用
      view_sample: 3
      预览(首帧、中间帧、末帧)
    • 若需要更多细节,再用
      view
      请求特定时间戳的帧
    • 仅在初始视图不足时才扩大窗口范围
    • 将帧查看视为二分查找——逐步缩小到关键内容
    • 切勿一次性查看所有提取的帧
  6. 当用户针对同一视频提出后续问题时,参考上下文已有的清单(manifest)。不要重新提取相同分辨率的已提取帧,也不要重新请求上下文已有的帧。

Parameter Guide

参数指南

fps:
"auto"
for general overview. Use the video's original fps (from
video_info
) for frame-by-frame detail. Use 5-10 for analyzing specific short moments. Use 0.1-0.5 for long videos.
resolution: 256-512 for quick scans. 512-768 for normal analysis. 1024+ when reading on-screen text or fine details.
segments: Use when you have analysis data. Each segment can have its own fps and resolution. Overrides global fps/start_time/end_time.
view_sample: Returns N evenly spaced frames from the extracted set. Use this to avoid flooding context with too many images.
skip_audio: Set to true when you only need visual analysis.
YouTube URLs: Pass supported YouTube URLs directly as
path
. Treat
transcription_source: "youtube_subtitles"
as stronger than
youtube_auto_captions
; auto-captions can still have recognition errors.
fps: 通用概览使用
"auto"
。逐帧细节分析使用视频原始帧率(来自
video_info
)。分析特定短时刻使用5-10。长视频使用0.1-0.5。
resolution: 快速扫描使用256-512。常规分析使用512-768。读取屏幕文本或精细细节使用1024+。
segments: 有分析数据时使用。每个片段可设置独立的帧率和分辨率。会覆盖全局的fps/start_time/end_time。
view_sample: 从提取的帧集中返回N个均匀分布的帧。使用此参数避免上下文被过多图片占用。
skip_audio: 仅需视觉分析时设为true。
YouTube链接: 直接将支持的YouTube链接作为
path
传入。优先级上
transcription_source: "youtube_subtitles"
高于
youtube_auto_captions
;自动生成字幕仍可能存在识别错误。

Working with Results

结果处理

You receive:
  • Manifest (when enable_index is on) — index of all cached frames by resolution and timestamp. Use this to avoid redundant requests.
  • Frames as images — look at them to understand what's happening visually
  • Audio transcription with timestamps — read the speech content
  • Audio tags — non-speech events (music, sounds, etc.)
  • Analysis data — scene changes, silence intervals, motion levels, etc.
Combine all sources to form a complete understanding. Use analysis + transcription to guide where you look visually. The analysis tells you WHEN things happen; the frames tell you WHAT happens.
你会收到以下内容:
  • 清单(manifest)(当enable_index开启时)——按分辨率和时间戳索引的所有缓存帧。使用此清单避免重复请求。
  • 帧图像——查看帧内容理解视觉信息
  • 带时间戳的音频转录——读取语音内容
  • 音频标签——非语音事件(音乐、音效等)
  • 分析数据——场景变化、静音区间、运动强度等
结合所有信息形成完整理解。利用分析结果+转录内容引导视觉关注点。分析数据告诉你事件发生的时间,帧图像告诉你事件内容。