watch-cli-video-agent

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

watch-cli Video Agent Skill

watch-cli 视频Agent技能

Skill by ara.so — Devtools Skills collection.
ara.so开发的技能——Devtools技能合集。

What It Does

功能介绍

watch-cli
gives AI agents the ability to "watch" social videos by decomposing them into frames (JPGs) + audio transcript, avoiding expensive multimodal video APIs. Works with YouTube, X/Twitter, LinkedIn, TikTok, Reddit, Vimeo, and Facebook. ~50× cheaper and ~5× faster than calling a multimodal LLM on full video.
The insight: Videos are just frames + audio. Extract both, read them separately with vision + text LLMs, and you have full video understanding without burning a video model on every frame.
watch-cli
让AI Agent能够通过将社交视频分解为帧(JPG格式)+音频转录文本的方式“观看”视频,无需调用昂贵的多模态视频API。支持YouTube、X/Twitter、LinkedIn、TikTok、Reddit、Vimeo和Facebook平台。相比调用多模态LLM处理完整视频,成本低约50倍,速度快约5倍。
核心思路:视频本质是帧+音频。分别提取两者,用视觉LLM和文本LLM分别处理,无需为每一帧调用视频模型即可实现完整的视频理解。

Installation

安装步骤

bash
undefined
bash
undefined

Quick install (symlinks to ~/.local/bin)

快速安装(创建软链接至~/.local/bin)


**Dependencies** (installer checks, but install manually if needed):
- `yt-dlp` — video downloader
- `ffmpeg` — frame/audio extraction
- `jq`, `curl`, `python3` — utilities

macOS:
```bash
brew install yt-dlp ffmpeg jq
Debian/Ubuntu:
bash
sudo apt install yt-dlp ffmpeg jq python3 curl

**依赖项**(安装脚本会自动检查,若需手动安装):
- `yt-dlp` — 视频下载工具
- `ffmpeg` — 帧/音频提取工具
- `jq`, `curl`, `python3` — 实用工具

macOS系统:
```bash
brew install yt-dlp ffmpeg jq
Debian/Ubuntu系统:
bash
sudo apt install yt-dlp ffmpeg jq python3 curl

Configuration

配置说明

Set your Kyma API key (or bring-your-own Groq/Google keys):
bash
export KYMA_API_KEY=kyma-xxxxxxxx
Get a free key at https://kymaapi.com (includes ~1 hour of free transcription credit).
Alternative: For bring-your-own-keys, create
.env
:
bash
GROQ_API_KEY=gsk_...
GOOGLE_AI_KEY=AIza...
设置你的Kyma API密钥(或使用自定义的Groq/Google密钥):
bash
export KYMA_API_KEY=kyma-xxxxxxxx
替代方案:若使用自定义密钥,创建
.env
文件:
bash
GROQ_API_KEY=gsk_...
GOOGLE_AI_KEY=AIza...

Core Commands

核心命令

watch
— Full video analysis (orchestrator)

watch
— 完整视频分析(编排器)

bash
watch <url> [frame-count] [--cookies <file>]
Downloads video, extracts frames, transcribes audio — returns everything in one structured block.
Example:
bash
watch https://www.youtube.com/watch?v=dQw4w9WgXcQ
Output structure:
text
VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
  /tmp/frames_abc123/frame_01.jpg
  /tmp/frames_abc123/frame_02.jpg
  /tmp/frames_abc123/frame_03.jpg
  ...
TRANSCRIPT:
  Today I want to talk about how decomposition unlocks 10× cost reduction...
With custom frame count (default is 8):
bash
watch https://twitter.com/user/status/12345 16
For login-walled content (LinkedIn, private X, Facebook):
bash
watch https://www.linkedin.com/posts/someone_activity-12345 --cookies ~/cookies.txt
(Auto-detects browser cookies if signed in; see troubleshooting for manual export)
bash
watch <url> [frame-count] [--cookies <file>]
下载视频、提取帧、转录音频——以结构化格式返回所有结果。
示例
bash
watch https://www.youtube.com/watch?v=dQw4w9WgXcQ
输出结构
text
VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
  /tmp/frames_abc123/frame_01.jpg
  /tmp/frames_abc123/frame_02.jpg
  /tmp/frames_abc123/frame_03.jpg
  ...
TRANSCRIPT:
  Today I want to talk about how decomposition unlocks 10× cost reduction...
自定义帧数量(默认8帧):
bash
watch https://twitter.com/user/status/12345 16
处理需登录的内容(LinkedIn、私人X账号、Facebook):
bash
watch https://www.linkedin.com/posts/someone_activity-12345 --cookies ~/cookies.txt
(若已登录浏览器,会自动检测浏览器Cookie;如需手动导出,请查看故障排除部分)

dl-video
— Download only

dl-video
— 仅下载视频

bash
dl-video <url> [out-dir] [--cookies <file>]
Just downloads the video, returns the local mp4 path.
bash
VIDEO_PATH=$(dl-video https://www.youtube.com/watch?v=dQw4w9WgXcQ /tmp/videos)
echo "Downloaded to: $VIDEO_PATH"
bash
dl-video <url> [out-dir] [--cookies <file>]
仅下载视频,返回本地mp4文件路径。
bash
VIDEO_PATH=$(dl-video https://www.youtube.com/watch?v=dQw4w9WgXcQ /tmp/videos)
echo "Downloaded to: $VIDEO_PATH"

extract-frames
— Frame extraction

extract-frames
— 帧提取

bash
extract-frames <video> [count] [out-dir]
Extracts N evenly-spaced JPG frames from a video.
bash
extract-frames video.mp4 12 /tmp/my-frames
bash
extract-frames <video> [count] [out-dir]
从视频中提取N个均匀分布的JPG帧。
bash
extract-frames video.mp4 12 /tmp/my-frames

Returns:

返回结果:

/tmp/my-frames/frame_01.jpg

/tmp/my-frames/frame_01.jpg

/tmp/my-frames/frame_02.jpg

/tmp/my-frames/frame_02.jpg

...

...

undefined
undefined

transcribe
— Speech-to-text

transcribe
— 语音转文字

bash
transcribe <audio-or-video> [language]
Auto-extracts audio from video if needed, then transcribes using Whisper Large v3 Turbo via Kyma.
bash
transcribe video.mp4
bash
transcribe <audio-or-video> [language]
若输入为视频,会自动提取音频,通过Kyma平台的Whisper Large v3 Turbo模型进行转录。
bash
transcribe video.mp4

or

transcribe audio.mp3 en

**Output**: Plain text transcript.
transcribe audio.mp3 en

**输出**:纯文本转录内容。

audio-q
— Audio scene Q&A

audio-q
— 音频场景问答

bash
audio-q <audio-or-video> "<question>"
Beyond transcription: asks about tone, music, sound effects, language, emotion using a multimodal audio model (Gemini 3 Flash audio).
bash
audio-q video.mp4 "What is the speaker's emotional tone? Is there background music?"
bash
audio-q <audio-or-video> "<question>"
超越单纯转录:通过多模态音频模型(Gemini 3 Flash audio)询问语气、音乐、音效、语言、情绪等信息。
bash
audio-q video.mp4 "What is the speaker's emotional tone? Is there background music?"

models
— List available models

models
— 列出可用模型

bash
models           # Audio models only
models --all     # All Kyma models (text, image, video, audio)
Shows live model list from Kyma API —
transcribe
and
audio-understand
aliases auto-update when Kyma swaps underlying models.
bash
models           # 仅显示音频模型
models --all     # 显示所有Kyma模型(文本、图像、视频、音频)
显示Kyma API提供的实时模型列表——
transcribe
audio-understand
别名会在Kyma更换底层模型时自动更新。

Frame Count Guidance

帧数量参考

Video TypeRecommended Frames
Short tweet/clip (<2 min)4–8 (default)
Tutorial/talk (5–20 min)8–16
Lecture (20–60 min)16–24
Conference talk (>1 hr)24–32
Fast-cut UI demoDouble the recommendation
视频类型推荐帧数量
短推文/剪辑(<2分钟)4–8(默认值)
教程/演讲(5–20分钟)8–16
讲座(20–60分钟)16–24
会议演讲(>1小时)24–32
快速切换的UI演示推荐数量翻倍

Patterns for AI Agents

AI Agent使用模式

Pattern 1: Full video understanding

模式1:完整视频理解

bash
undefined
bash
undefined

User: "Analyze this video and tell me what it's about"

用户请求:"分析这个视频并告诉我它的内容"

Parse the output:

解析输出:

- VIDEO: line → path to mp4

- VIDEO行 → mp4文件路径

- FRAMES: block → list of jpg paths (read each as image)

- FRAMES块 → jpg帧路径列表(读取为图像)

- TRANSCRIPT: block → full text (read as text)

- TRANSCRIPT块 → 完整文本(读取为文本)

Now you have frames + transcript to reason about

现在你可以结合帧和转录文本进行推理

undefined
undefined

Pattern 2: Code tutorial → working implementation

模式2:代码教程→可运行实现

bash
undefined
bash
undefined

User: "Watch this coding tutorial and implement the project"

用户请求:"观看这个编码教程并实现该项目"

Read frames to see:

读取帧获取:

- File structure screenshots

- 文件结构截图

- Code snippets on screen

屏幕上的代码片段

- Terminal commands

- 终端命令

Read transcript for:

读取转录文本获取:

- Verbal explanations

- 口头解释

- Step-by-step instructions

- 分步说明

Combine both to reconstruct the full implementation

结合两者重构完整实现

undefined
undefined

Pattern 3: Architecture extraction

模式3:架构提取

bash
undefined
bash
undefined

User: "Extract the system architecture from this talk"

用户请求:"从这个演讲中提取系统架构"

Look for frames with:

在帧中查找:

- Diagrams (boxes, arrows)

- 图表(方框、箭头)

- Service names

- 服务名称

- Data flow illustrations

- 数据流示意图

Use transcript to identify component relationships

使用转录文本识别组件关系

Generate Mermaid/PlantUML diagram

生成Mermaid/PlantUML图表

undefined
undefined

Pattern 4: UI/UX cloning

模式4:UI/UX克隆

bash
undefined
bash
undefined

User: "Clone the interface shown in this demo"

用户请求:"克隆这个演示中展示的界面"

Frames show UI states:

帧展示UI状态:

- Layout structure

- 布局结构

- Color scheme

- 配色方案

- Interactive elements

- 交互元素

Transcript reveals:

转录文本揭示:

- Interaction patterns

- 交互模式

- Animation timing

- 动画时长

Generate React/HTML+CSS implementation

生成React/HTML+CSS实现

undefined
undefined

Pattern 5: Audio-only analysis

模式5:纯音频分析

bash
undefined
bash
undefined

For podcasts, music, or unclear audio:

针对播客、音乐或不清晰的音频:

TRANSCRIPT=$(transcribe podcast.mp3) SCENE=$(audio-q podcast.mp3 "Describe the audio scene: tone, music, number of speakers, emotion")
TRANSCRIPT=$(transcribe podcast.mp3) SCENE=$(audio-q podcast.mp3 "Describe the audio scene: tone, music, number of speakers, emotion")

Combine both for full audio understanding

结合两者实现完整音频理解

undefined
undefined

Prompt Templates (Copy-Paste for Agents)

提示词模板(供Agent复制粘贴)

Located in
prompts/
directory, paste above
watch
output:
  • implement-from-video.md
    — Coding walkthrough → working project
  • extract-architecture.md
    — System talk → architecture diagram
  • clone-ux.md
    — UI demo → React component
  • paper-to-code.md
    — Research talk → runnable notebook
  • tutorial-walkthrough.md
    — Long tutorial → cheat sheet
Usage:
bash
undefined
位于
prompts/
目录,粘贴在
watch
输出上方:
  • implement-from-video.md
    — 编码讲解→可运行项目
  • extract-architecture.md
    — 系统演讲→架构图
  • clone-ux.md
    — UI演示→React组件
  • paper-to-code.md
    — 研究演讲→可运行笔记本
  • tutorial-walkthrough.md
    — 长教程→ cheat sheet
使用方法
bash
undefined

1. Watch the video

1. 分析视频

2. Prepend the prompt template

2. 前置提示词模板

cat prompts/implement-from-video.md output.txt > full-context.txt
cat prompts/implement-from-video.md output.txt > full-context.txt

3. Feed to agent (already done if you're the agent!)

3. 提交给Agent(如果你是Agent,直接使用即可!)

undefined
undefined

Cost Estimates

成本估算

Video LengthTranscribe Cost
5 minutes~$0.003
1 hour~$0.04
2 hours~$0.08
Frame extraction is local (ffmpeg), free. Free Kyma credit covers ~25 hours of audio.
Comparison: Multimodal video API on 1-hour video ≈ $5.
watch-cli
≈ $0.10. (~50× cheaper)
视频时长转录成本
5分钟~$0.003
1小时~$0.04
2小时~$0.08
帧提取为本地操作(ffmpeg),免费。Kyma免费额度可覆盖约25小时音频转录。
对比:多模态视频API处理1小时视频≈$5。
watch-cli
≈$0.10。(成本低约50倍)

Troubleshooting

故障排除

Login-walled videos (LinkedIn, private X, Facebook)

需登录的视频(LinkedIn、私人X账号、Facebook)

Auto-detection (usually works):
  1. Sign into the platform in your browser (Chrome, Firefox, Safari, Edge, Brave)
  2. Run
    watch <url>
    — it auto-finds cookies
Manual cookie export (for servers/CI):
  1. Install browser extension: "Get cookies.txt LOCALLY" (Chrome/Firefox)
  2. Visit the platform while signed in
  3. Click extension → Export cookies.txt
  4. Save to
    ~/cookies.txt
  5. Run:
    watch <url> --cookies ~/cookies.txt
Full guide:
docs/cookies.md
in the repo.
自动检测(通常有效):
  1. 在浏览器(Chrome、Firefox、Safari、Edge、Brave)中登录对应平台
  2. 运行
    watch <url>
    — 会自动查找Cookie
手动导出Cookie(适用于服务器/CI环境):
  1. 安装浏览器扩展:"Get cookies.txt LOCALLY"(Chrome/Firefox)
  2. 登录平台后访问对应页面
  3. 点击扩展→导出cookies.txt
  4. 保存至
    ~/cookies.txt
  5. 运行:
    watch <url> --cookies ~/cookies.txt
完整指南:仓库中的
docs/cookies.md

"Region-locked video" error

"区域限制视频"错误

yt-dlp
can't download region-restricted content. Workarounds:
  • Use VPN to target region
  • Pass
    --cookies
    from browser with VPN active
yt-dlp
无法下载区域受限内容。解决方法:
  • 使用VPN切换至目标区域
  • 配合激活VPN的浏览器Cookie,传递
    --cookies
    参数

"Audio file too large" error

"音频文件过大"错误

Transcribe provider has 25MB audio limit. For 2+ hour videos:
bash
undefined
转录服务商限制音频文件大小为25MB。针对2小时以上的视频:
bash
undefined

Split video first

先拆分视频

ffmpeg -i long-video.mp4 -ss 00:00:00 -to 01:00:00 -c copy part1.mp4 ffmpeg -i long-video.mp4 -ss 01:00:00 -c copy part2.mp4
ffmpeg -i long-video.mp4 -ss 00:00:00 -to 01:00:00 -c copy part1.mp4 ffmpeg -i long-video.mp4 -ss 01:00:00 -c copy part2.mp4

Transcribe separately

分别转录

transcribe part1.mp4 > transcript1.txt transcribe part2.mp4 > transcript2.txt cat transcript1.txt transcript2.txt > full-transcript.txt
undefined
transcribe part1.mp4 > transcript1.txt transcribe part2.mp4 > transcript2.txt cat transcript1.txt transcript2.txt > full-transcript.txt
undefined

Empty transcript (silent video)

转录内容为空(静音视频)

For screencasts with no speech:
  1. Increase frame count:
    watch <url> 24
  2. Use
    audio-q
    to describe any sound design:
    audio-q video.mp4 "Are there any UI sounds, clicks, or ambient audio?"
针对无语音的屏幕录制视频:
  1. 增加帧数量:
    watch <url> 24
  2. 使用
    audio-q
    描述音效:
    audio-q video.mp4 "Are there any UI sounds, clicks, or ambient audio?"

Fast-cut content missing key moments

快速切换内容遗漏关键瞬间

Default 8 frames won't catch rapid edits. Solution:
bash
watch <url> 32  # 4× more frames
默认8帧无法捕捉快速剪辑。解决方法:
bash
watch <url> 32  # 帧数量增加4倍

Models not updating

模型未更新

transcribe
and
audio-q
use Kyma aliases that auto-update. To see current models:
bash
models
If you want to pin a specific model version, edit the script and replace the alias with a model ID from
models --all
.
transcribe
audio-q
使用Kyma别名,会自动更新。查看当前模型:
bash
models
若需固定特定模型版本,编辑脚本并将别名替换为
models --all
中的模型ID。

Advanced: Using as Claude Code Skill

进阶:作为Claude Code技能使用

Copy the pre-built skill into Claude's skill directory:
bash
mkdir -p ~/.claude/skills
cp -r skills/watch-cli ~/.claude/skills/
Now
/watch <url>
becomes a first-class command in Claude Code, with prompt library auto-attached.
将预构建技能复制到Claude的技能目录:
bash
mkdir -p ~/.claude/skills
cp -r skills/watch-cli ~/.claude/skills/
现在
/watch <url>
成为Claude Code中的一级命令,且会自动关联提示词库。

Environment Variables Reference

环境变量参考

bash
undefined
bash
undefined

Primary (Kyma unified API)

主API(Kyma统一API)

export KYMA_API_KEY=kyma-xxxxxxxx
export KYMA_API_KEY=kyma-xxxxxxxx

Alternative (bring-your-own-keys)

替代方案(自定义密钥)

export GROQ_API_KEY=gsk_... # Whisper transcription export GOOGLE_AI_KEY=AIza... # Gemini audio understanding
undefined
export GROQ_API_KEY=gsk_... # Whisper转录 export GOOGLE_AI_KEY=AIza... # Gemini音频理解
undefined

Real Agent Usage Example

Agent实际使用示例

bash
undefined
bash
undefined

User asks: "Watch this video and build the project shown"

用户提问:"观看这个视频并构建其中展示的项目"

Agent receives:

Agent收到的内容:

VIDEO: /tmp/dl-video/abc123.mp4 DURATION: 1847 FRAMES: /tmp/frames_abc123/frame_01.jpg # Shows project folder structure /tmp/frames_abc123/frame_02.jpg # Shows package.json /tmp/frames_abc123/frame_03.jpg # Shows main App.tsx code /tmp/frames_abc123/frame_04.jpg # Shows terminal: npm install ... TRANSCRIPT: Okay so first we're going to set up a new React project. Create a folder called my-app, then run npm init. Now let's install these dependencies...
VIDEO: /tmp/dl-video/abc123.mp4 DURATION: 1847 FRAMES: /tmp/frames_abc123/frame_01.jpg # 展示项目文件夹结构 /tmp/frames_abc123/frame_02.jpg # 展示package.json /tmp/frames_abc123/frame_03.jpg # 展示主App.tsx代码 /tmp/frames_abc123/frame_04.jpg # 展示终端命令:npm install ... TRANSCRIPT: Okay so first we're going to set up a new React project. Create a folder called my-app, then run npm init. Now let's install these dependencies...

Agent reads:

Agent处理:

- frame_01.jpg → sees folder structure → creates directories

- frame_01.jpg → 查看文件夹结构→创建目录

- frame_02.jpg → reads package.json → writes dependencies

- frame_02.jpg → 读取package.json→编写依赖

- frame_03.jpg → reads code on screen → implements App.tsx

- frame_03.jpg → 读取屏幕上的代码→实现App.tsx

- transcript → fills in verbal instructions for parts not shown

- 转录文本→补充屏幕未展示的口头说明

Result: working project matching the tutorial

结果:与教程一致的可运行项目

undefined
undefined

Key Takeaways for Agents

Agent核心要点

  1. Always use
    watch
    for the full pipeline (unless you only need one piece)
  2. Read frames as images — they contain visual info the transcript can't capture
  3. Read transcript as text — it contains verbal explanations the frames don't show
  4. Combine both — frames + transcript = full video understanding
  5. Adjust frame count based on video type (fast-cut → more frames)
  6. For login-walled content — it auto-detects browser cookies, but fall back to
    --cookies
    if needed
  7. Cost is negligible — transcription is ~$0.04/hour, don't hesitate to watch multiple videos
  8. Use prompt templates in
    prompts/
    for specific output types (architecture, code, UI, etc.)

  1. 优先使用
    watch
    获取完整流程(除非仅需单一功能)
  2. 将帧作为图像读取 — 帧包含转录文本无法捕捉的视觉信息
  3. 将转录文本作为文本读取 — 文本包含帧无法展示的口头解释
  4. 结合两者 — 帧+转录文本=完整视频理解
  5. 根据视频类型调整帧数量(快速切换内容→增加帧数量)
  6. 处理需登录的内容 — 会自动检测浏览器Cookie,若失败则使用
    --cookies
    参数
  7. 成本可忽略 — 转录成本约$0.04/小时,无需犹豫分析多个视频
  8. 使用
    prompts/
    中的提示词模板
    获取特定类型输出(架构、代码、UI等)