video-reader

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Video Reader Skill

Video Reader Skill

Extract key frames to "see" videos, and extract + transcribe audio to "hear" them.
提取关键帧来“查看”视频内容,同时提取并转录音频来“聆听”音频信息。

Quick Start

快速开始

bash
undefined
bash
undefined

Get video info (duration, resolution, codec)

获取视频信息(时长、分辨率、编解码器)

ffprobe -v error -show_entries format=duration:stream=codec_name,width,height -of json "$VIDEO_PATH"
ffprobe -v error -show_entries format=duration:stream=codec_name,width,height -of json "$VIDEO_PATH"

Extract key frames (1 per second, max 12 frames)

提取关键帧(每秒1帧,最多12帧)

OUTDIR=/tmp/alma-frames-$(date +%s) mkdir -p "$OUTDIR" DURATION=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$VIDEO_PATH" | cut -d. -f1) FPS_RATE=$(echo "scale=2; 12 / $DURATION" | bc 2>/dev/null || echo "1")
OUTDIR=/tmp/alma-frames-$(date +%s) mkdir -p "$OUTDIR" DURATION=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$VIDEO_PATH" | cut -d. -f1) FPS_RATE=$(echo "scale=2; 12 / $DURATION" | bc 2>/dev/null || echo "1")

Cap at 1fps for short videos

短视频限制为1fps

if (( $(echo "$FPS_RATE > 1" | bc -l 2>/dev/null || echo 0) )); then FPS_RATE=1; fi ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH" -vf "fps=$FPS_RATE,scale=720:-1" -frames:v 12 "$OUTDIR/frame_%02d.jpg" ls "$OUTDIR"
undefined
if (( $(echo "$FPS_RATE > 1" | bc -l 2>/dev/null || echo 0) )); then FPS_RATE=1; fi ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH" -vf "fps=$FPS_RATE,scale=720:-1" -frames:v 12 "$OUTDIR/frame_%02d.jpg" ls "$OUTDIR"
undefined

How to Use

使用方法

  1. Get video info first to know the duration
  2. For short videos (<15s): extract 1 frame per second
  3. For medium videos (15-60s): extract ~8-12 frames evenly spread
  4. For long videos (>60s): extract 12 frames at key intervals
  5. Look at the extracted frames (they're image files) to describe the video content
  1. 先获取视频信息,了解其时长
  2. 短视频(<15秒):每秒提取1帧
  3. 中等时长视频(15-60秒):均匀提取约8-12帧
  4. 长视频(>60秒):在关键时间点提取12帧
  5. 查看提取的帧(图片文件)来描述视频内容

Frame Extraction Patterns

关键帧提取模式

bash
undefined
bash
undefined

Even spread: N frames across entire video

均匀分布:在整个视频中提取N帧

ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH"
-vf "select='not(mod(n,$(ffprobe -v error -count_frames -select_streams v:0 -show_entries stream=nb_read_frames -of csv=p=0 "$VIDEO_PATH" | awk -v n=8 '{printf "%d", $1/n}')))'"
-vsync vfr -frames:v 8 "$OUTDIR/frame_%02d.jpg"
ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH"
-vf "select='not(mod(n,$(ffprobe -v error -count_frames -select_streams v:0 -show_entries stream=nb_read_frames -of csv=p=0 "$VIDEO_PATH" | awk -v n=8 '{printf "%d", $1/n}')))'"
-vsync vfr -frames:v 8 "$OUTDIR/frame_%02d.jpg"

Specific timestamp

指定时间戳提取

ffmpeg -hide_banner -loglevel error -ss 5.0 -i "$VIDEO_PATH" -frames:v 1 "$OUTDIR/at_5s.jpg"
ffmpeg -hide_banner -loglevel error -ss 5.0 -i "$VIDEO_PATH" -frames:v 1 "$OUTDIR/at_5s.jpg"

Thumbnail grid (single image overview)

缩略图网格(单张图片概览)

ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH"
-vf "fps=1/$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$VIDEO_PATH" | awk '{printf "%.1f", $1/9}'),scale=320:-1,tile=3x3"
-frames:v 1 "$OUTDIR/grid.jpg"
undefined
ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH"
-vf "fps=1/$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$VIDEO_PATH" | awk '{printf "%.1f", $1/9}'),scale=320:-1,tile=3x3"
-frames:v 1 "$OUTDIR/grid.jpg"
undefined

Tips

小贴士

  • Always use
    -hide_banner -loglevel error
    to keep output clean
  • Scale down to 720px width (
    scale=720:-1
    ) to save tokens when sending to AI
  • Clean up frames after analysis:
    rm -rf "$OUTDIR"
  • The extracted frames are regular image files — include their paths in your reply and they'll be auto-sent to Telegram
  • 始终使用
    -hide_banner -loglevel error
    参数来保持输出简洁
  • 将帧缩放到720px宽度(
    scale=720:-1
    ),发送给AI时可节省token
  • 分析完成后清理帧文件:
    rm -rf "$OUTDIR"
  • 提取的帧是常规图片文件——在回复中包含其路径,它们会自动发送到Telegram

Audio: "Hearing" Videos

音频:“聆听”视频

bash
undefined
bash
undefined

Extract audio from video

从视频中提取音频

ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH" -vn -acodec pcm_s16le -ar 16000 -ac 1 "/tmp/alma-audio-$(date +%s).wav"
ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH" -vn -acodec pcm_s16le -ar 16000 -ac 1 "/tmp/alma-audio-$(date +%s).wav"

Transcribe with Whisper (auto-detect language)

使用Whisper转录(自动检测语言)

whisper "/tmp/alma-audio.wav" --model turbo --output_format txt --output_dir /tmp/alma-whisper
whisper "/tmp/alma-audio.wav" --model turbo --output_format txt --output_dir /tmp/alma-whisper

Transcribe with specific language

指定语言转录

whisper "/tmp/alma-audio.wav" --model turbo --language zh --output_format txt --output_dir /tmp/alma-whisper
whisper "/tmp/alma-audio.wav" --model turbo --language zh --output_format txt --output_dir /tmp/alma-whisper

Read transcription

查看转录内容

cat /tmp/alma-whisper/*.txt
undefined
cat /tmp/alma-whisper/*.txt
undefined

When to See vs Hear

何时查看画面vs聆听音频

  • "这个视频里有啥" → Extract frames (see) + transcribe audio (hear) for full picture
  • "他说了什么" → Transcribe audio only
  • "这个视频好看吗" → Extract frames to see the visuals
  • "好听" → The user is commenting on audio content, transcribe to understand
  • Music/street performance → Mention what you see in frames + note the audio content
  • When in doubt, do BOTH — extract a few frames AND transcribe the audio
  • "这个视频里有啥" → 提取帧(查看画面)+ 转录音频(聆听内容)以获取完整信息
  • "他说了什么" → 仅转录音频
  • "这个视频好看吗" → 提取帧查看画面内容
  • "好听" → 用户在评价音频内容,转录音频以理解具体内容
  • 音乐/街头表演 → 描述帧中的画面内容,并说明音频信息
  • 不确定时,两者都做——提取少量帧并转录音频

Cleanup

清理操作

Always clean up temp files after analysis:
bash
rm -rf "$OUTDIR" /tmp/alma-whisper /tmp/alma-audio*.wav
分析完成后务必清理临时文件:
bash
rm -rf "$OUTDIR" /tmp/alma-whisper /tmp/alma-audio*.wav