watch-cli-video-agent

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

watch-cli Video Agent Skill

watch-cli 视频Agent技能

Skill by ara.so — Devtools Skills collection.

由ara.so开发的技能——Devtools技能合集。

What It Does

功能介绍

watch-cli

gives AI agents the ability to "watch" social videos by decomposing them into frames (JPGs) + audio transcript, avoiding expensive multimodal video APIs. Works with YouTube, X/Twitter, LinkedIn, TikTok, Reddit, Vimeo, and Facebook. ~50× cheaper and ~5× faster than calling a multimodal LLM on full video.

The insight: Videos are just frames + audio. Extract both, read them separately with vision + text LLMs, and you have full video understanding without burning a video model on every frame.

watch-cli

让AI Agent能够通过将社交视频分解为帧（JPG格式）+音频转录文本的方式“观看”视频，无需调用昂贵的多模态视频API。支持YouTube、X/Twitter、LinkedIn、TikTok、Reddit、Vimeo和Facebook平台。相比调用多模态LLM处理完整视频，成本低约50倍，速度快约5倍。

核心思路：视频本质是帧+音频。分别提取两者，用视觉LLM和文本LLM分别处理，无需为每一帧调用视频模型即可实现完整的视频理解。

Installation

安装步骤

bash

undefined

bash

undefined

Quick install (symlinks to ~/.local/bin)

快速安装（创建软链接至~/.local/bin）

curl -fsSL https://raw.githubusercontent.com/sonpiaz/watch-cli/main/install.sh | bash


**Dependencies** (installer checks, but install manually if needed):
- `yt-dlp` — video downloader
- `ffmpeg` — frame/audio extraction
- `jq`, `curl`, `python3` — utilities

macOS:
```bash
brew install yt-dlp ffmpeg jq

Debian/Ubuntu:

bash

sudo apt install yt-dlp ffmpeg jq python3 curl

curl -fsSL https://raw.githubusercontent.com/sonpiaz/watch-cli/main/install.sh | bash


**依赖项**（安装脚本会自动检查，若需手动安装）：
- `yt-dlp` — 视频下载工具
- `ffmpeg` — 帧/音频提取工具
- `jq`, `curl`, `python3` — 实用工具

macOS系统：
```bash
brew install yt-dlp ffmpeg jq

Debian/Ubuntu系统：

bash

sudo apt install yt-dlp ffmpeg jq python3 curl

Configuration

配置说明

Set your Kyma API key (or bring-your-own Groq/Google keys):

bash

export KYMA_API_KEY=kyma-xxxxxxxx

Get a free key at https://kymaapi.com (includes ~1 hour of free transcription credit).

Alternative: For bring-your-own-keys, create

.env

bash

GROQ_API_KEY=gsk_...
GOOGLE_AI_KEY=AIza...

设置你的Kyma API密钥（或使用自定义的Groq/Google密钥）：

bash

export KYMA_API_KEY=kyma-xxxxxxxx

可在https://kymaapi.com获取免费密钥（包含约1小时的免费转录额度）。

替代方案：若使用自定义密钥，创建

.env

文件：

bash

GROQ_API_KEY=gsk_...
GOOGLE_AI_KEY=AIza...

Core Commands

核心命令

watch

— Full video analysis (orchestrator)

watch

— 完整视频分析（编排器）

bash

watch <url> [frame-count] [--cookies <file>]

Downloads video, extracts frames, transcribes audio — returns everything in one structured block.

Example:

bash

watch https://www.youtube.com/watch?v=dQw4w9WgXcQ

Output structure:

text

VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
  /tmp/frames_abc123/frame_01.jpg
  /tmp/frames_abc123/frame_02.jpg
  /tmp/frames_abc123/frame_03.jpg
  ...
TRANSCRIPT:
  Today I want to talk about how decomposition unlocks 10× cost reduction...

With custom frame count (default is 8):

bash

watch https://twitter.com/user/status/12345 16

For login-walled content (LinkedIn, private X, Facebook):

bash

watch https://www.linkedin.com/posts/someone_activity-12345 --cookies ~/cookies.txt

(Auto-detects browser cookies if signed in; see troubleshooting for manual export)

bash

watch <url> [frame-count] [--cookies <file>]

下载视频、提取帧、转录音频——以结构化格式返回所有结果。

示例：

bash

watch https://www.youtube.com/watch?v=dQw4w9WgXcQ

输出结构：

text

VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
  /tmp/frames_abc123/frame_01.jpg
  /tmp/frames_abc123/frame_02.jpg
  /tmp/frames_abc123/frame_03.jpg
  ...
TRANSCRIPT:
  Today I want to talk about how decomposition unlocks 10× cost reduction...

自定义帧数量（默认8帧）：

bash

watch https://twitter.com/user/status/12345 16

处理需登录的内容（LinkedIn、私人X账号、Facebook）：

bash

watch https://www.linkedin.com/posts/someone_activity-12345 --cookies ~/cookies.txt

（若已登录浏览器，会自动检测浏览器Cookie；如需手动导出，请查看故障排除部分）

dl-video

— Download only

dl-video

— 仅下载视频

bash

dl-video <url> [out-dir] [--cookies <file>]

Just downloads the video, returns the local mp4 path.

bash

VIDEO_PATH=$(dl-video https://www.youtube.com/watch?v=dQw4w9WgXcQ /tmp/videos)
echo "Downloaded to: $VIDEO_PATH"

bash

dl-video <url> [out-dir] [--cookies <file>]

仅下载视频，返回本地mp4文件路径。

bash

VIDEO_PATH=$(dl-video https://www.youtube.com/watch?v=dQw4w9WgXcQ /tmp/videos)
echo "Downloaded to: $VIDEO_PATH"

extract-frames

— Frame extraction

extract-frames

— 帧提取

bash

extract-frames <video> [count] [out-dir]

Extracts N evenly-spaced JPG frames from a video.

bash

extract-frames video.mp4 12 /tmp/my-frames

bash

extract-frames <video> [count] [out-dir]

从视频中提取N个均匀分布的JPG帧。

bash

extract-frames video.mp4 12 /tmp/my-frames

Returns:

返回结果：

/tmp/my-frames/frame_01.jpg

/tmp/my-frames/frame_02.jpg

...

undefined

undefined

transcribe

— Speech-to-text

transcribe

— 语音转文字

bash

transcribe <audio-or-video> [language]

Auto-extracts audio from video if needed, then transcribes using Whisper Large v3 Turbo via Kyma.

bash

transcribe video.mp4

bash

transcribe <audio-or-video> [language]

若输入为视频，会自动提取音频，通过Kyma平台的Whisper Large v3 Turbo模型进行转录。

bash

transcribe video.mp4

or

或

transcribe audio.mp3 en


**Output**: Plain text transcript.

transcribe audio.mp3 en


**输出**：纯文本转录内容。

audio-q

— Audio scene Q&A

audio-q

— 音频场景问答

bash

audio-q <audio-or-video> "<question>"

Beyond transcription: asks about tone, music, sound effects, language, emotion using a multimodal audio model (Gemini 3 Flash audio).

bash

audio-q video.mp4 "What is the speaker's emotional tone? Is there background music?"

bash

audio-q <audio-or-video> "<question>"

超越单纯转录：通过多模态音频模型（Gemini 3 Flash audio）询问语气、音乐、音效、语言、情绪等信息。

bash

audio-q video.mp4 "What is the speaker's emotional tone? Is there background music?"

models

— List available models

models

— 列出可用模型

bash

models           # Audio models only
models --all     # All Kyma models (text, image, video, audio)

Shows live model list from Kyma API —

transcribe

and

audio-understand

aliases auto-update when Kyma swaps underlying models.

bash

models           # 仅显示音频模型
models --all     # 显示所有Kyma模型（文本、图像、视频、音频）

显示Kyma API提供的实时模型列表——

transcribe

和

audio-understand

别名会在Kyma更换底层模型时自动更新。

Frame Count Guidance

帧数量参考

Video Type	Recommended Frames
Short tweet/clip (<2 min)	4–8 (default)
Tutorial/talk (5–20 min)	8–16
Lecture (20–60 min)	16–24
Conference talk (>1 hr)	24–32
Fast-cut UI demo	Double the recommendation

视频类型	推荐帧数量
短推文/剪辑（<2分钟）	4–8（默认值）
教程/演讲（5–20分钟）	8–16
讲座（20–60分钟）	16–24
会议演讲（>1小时）	24–32
快速切换的UI演示	推荐数量翻倍

Patterns for AI Agents

AI Agent使用模式

Pattern 1: Full video understanding

模式1：完整视频理解

bash

undefined

bash

undefined

User: "Analyze this video and tell me what it's about"

用户请求："分析这个视频并告诉我它的内容"

OUTPUT=$(watch https://www.youtube.com/watch?v=VIDEO_ID)

Parse the output:

解析输出：

- VIDEO: line → path to mp4

- VIDEO行 → mp4文件路径

- FRAMES: block → list of jpg paths (read each as image)

- FRAMES块 → jpg帧路径列表（读取为图像）

- TRANSCRIPT: block → full text (read as text)

- TRANSCRIPT块 → 完整文本（读取为文本）

Now you have frames + transcript to reason about

现在你可以结合帧和转录文本进行推理

undefined

undefined

Pattern 2: Code tutorial → working implementation

模式2：代码教程→可运行实现

bash

undefined

bash

undefined

User: "Watch this coding tutorial and implement the project"

用户请求："观看这个编码教程并实现该项目"

watch https://www.youtube.com/watch?v=CODING_TUTORIAL 16

Read frames to see:

读取帧获取：

- File structure screenshots

- 文件结构截图

- Code snippets on screen

屏幕上的代码片段

- Terminal commands

- 终端命令

Read transcript for:

读取转录文本获取：

- Verbal explanations

- 口头解释

- Step-by-step instructions

- 分步说明

Combine both to reconstruct the full implementation

结合两者重构完整实现

undefined

undefined

Pattern 3: Architecture extraction

模式3：架构提取

bash

undefined

bash

undefined

User: "Extract the system architecture from this talk"

用户请求："从这个演讲中提取系统架构"

watch https://www.youtube.com/watch?v=SYSTEM_DESIGN_TALK 24

Look for frames with:

在帧中查找：

- Diagrams (boxes, arrows)

- 图表（方框、箭头）

- Service names

- 服务名称

- Data flow illustrations

- 数据流示意图

Use transcript to identify component relationships

使用转录文本识别组件关系

Generate Mermaid/PlantUML diagram

生成Mermaid/PlantUML图表

undefined

undefined

Pattern 4: UI/UX cloning

模式4：UI/UX克隆

bash

undefined

bash

undefined

User: "Clone the interface shown in this demo"

用户请求："克隆这个演示中展示的界面"

watch https://twitter.com/designer/status/12345 12

Frames show UI states:

帧展示UI状态：

- Layout structure

- 布局结构

- Color scheme

- 配色方案

- Interactive elements

- 交互元素

Transcript reveals:

转录文本揭示：

- Interaction patterns

- 交互模式

- Animation timing

- 动画时长

Generate React/HTML+CSS implementation

生成React/HTML+CSS实现

undefined

undefined

Pattern 5: Audio-only analysis

模式5：纯音频分析

bash

undefined

bash

undefined

For podcasts, music, or unclear audio:

针对播客、音乐或不清晰的音频：

TRANSCRIPT=$(transcribe podcast.mp3) SCENE=$(audio-q podcast.mp3 "Describe the audio scene: tone, music, number of speakers, emotion")

Combine both for full audio understanding

结合两者实现完整音频理解

undefined

undefined

Prompt Templates (Copy-Paste for Agents)

提示词模板（供Agent复制粘贴）

Located in

prompts/

directory, paste above

watch

output:

implement-from-video.md
— Coding walkthrough → working project
extract-architecture.md
— System talk → architecture diagram
clone-ux.md
— UI demo → React component
paper-to-code.md
— Research talk → runnable notebook
tutorial-walkthrough.md
— Long tutorial → cheat sheet

Usage:

bash

undefined

位于

prompts/

目录，粘贴在

watch

输出上方：

implement-from-video.md
— 编码讲解→可运行项目
extract-architecture.md
— 系统演讲→架构图
clone-ux.md
— UI演示→React组件
paper-to-code.md
— 研究演讲→可运行笔记本
tutorial-walkthrough.md
— 长教程→ cheat sheet

使用方法：

bash

undefined

1. Watch the video

1. 分析视频

watch https://www.youtube.com/watch?v=VIDEO_ID > output.txt

2. Prepend the prompt template

2. 前置提示词模板

cat prompts/implement-from-video.md output.txt > full-context.txt

3. Feed to agent (already done if you're the agent!)

3. 提交给Agent（如果你是Agent，直接使用即可！）

undefined

undefined

Cost Estimates

成本估算

Video Length	Transcribe Cost
5 minutes	~$0.003
1 hour	~$0.04
2 hours	~$0.08

Frame extraction is local (ffmpeg), free. Free Kyma credit covers ~25 hours of audio.

Comparison: Multimodal video API on 1-hour video ≈ $5.

watch-cli

≈ $0.10. (~50× cheaper)

视频时长	转录成本
5分钟	~$0.003
1小时	~$0.04
2小时	~$0.08

帧提取为本地操作（ffmpeg），免费。Kyma免费额度可覆盖约25小时音频转录。

对比：多模态视频API处理1小时视频≈$5。

watch-cli

≈$0.10。（成本低约50倍）

Troubleshooting

故障排除

Login-walled videos (LinkedIn, private X, Facebook)

需登录的视频（LinkedIn、私人X账号、Facebook）

Auto-detection (usually works):

Sign into the platform in your browser (Chrome, Firefox, Safari, Edge, Brave)
Run
```
watch <url>
```
— it auto-finds cookies

Manual cookie export (for servers/CI):

Install browser extension: "Get cookies.txt LOCALLY" (Chrome/Firefox)
Visit the platform while signed in
Click extension → Export cookies.txt
Save to
```
~/cookies.txt
```
Run:
```
watch <url> --cookies ~/cookies.txt
```

Full guide:

docs/cookies.md

in the repo.

自动检测（通常有效）：

在浏览器（Chrome、Firefox、Safari、Edge、Brave）中登录对应平台
运行
```
watch <url>
```
— 会自动查找Cookie

手动导出Cookie（适用于服务器/CI环境）：

安装浏览器扩展："Get cookies.txt LOCALLY"（Chrome/Firefox）
登录平台后访问对应页面
点击扩展→导出cookies.txt
保存至
```
~/cookies.txt
```
运行：
```
watch <url> --cookies ~/cookies.txt
```

完整指南：仓库中的

docs/cookies.md

。

"Region-locked video" error

"区域限制视频"错误

yt-dlp

can't download region-restricted content. Workarounds:

Use VPN to target region
Pass
```
--cookies
```
from browser with VPN active

yt-dlp

无法下载区域受限内容。解决方法：

使用VPN切换至目标区域
配合激活VPN的浏览器Cookie，传递
```
--cookies
```
参数

"Audio file too large" error

"音频文件过大"错误

Transcribe provider has 25MB audio limit. For 2+ hour videos:

bash

undefined

转录服务商限制音频文件大小为25MB。针对2小时以上的视频：

bash

undefined

Split video first

先拆分视频

ffmpeg -i long-video.mp4 -ss 00:00:00 -to 01:00:00 -c copy part1.mp4 ffmpeg -i long-video.mp4 -ss 01:00:00 -c copy part2.mp4

Transcribe separately

分别转录

transcribe part1.mp4 > transcript1.txt transcribe part2.mp4 > transcript2.txt cat transcript1.txt transcript2.txt > full-transcript.txt

undefined

transcribe part1.mp4 > transcript1.txt transcribe part2.mp4 > transcript2.txt cat transcript1.txt transcript2.txt > full-transcript.txt

undefined

Empty transcript (silent video)

转录内容为空（静音视频）

For screencasts with no speech:

Increase frame count:
```
watch <url> 24
```

Use

audio-q

to describe any sound design:

audio-q video.mp4 "Are there any UI sounds, clicks, or ambient audio?"

针对无语音的屏幕录制视频：

增加帧数量：
```
watch <url> 24
```

使用

audio-q

描述音效：

audio-q video.mp4 "Are there any UI sounds, clicks, or ambient audio?"

Fast-cut content missing key moments

快速切换内容遗漏关键瞬间

Default 8 frames won't catch rapid edits. Solution:

bash

watch <url> 32  # 4× more frames

默认8帧无法捕捉快速剪辑。解决方法：

bash

watch <url> 32  # 帧数量增加4倍

Models not updating

模型未更新

transcribe

and

audio-q

use Kyma aliases that auto-update. To see current models:

bash

models

If you want to pin a specific model version, edit the script and replace the alias with a model ID from

models --all

transcribe

和

audio-q

使用Kyma别名，会自动更新。查看当前模型：

bash

models

若需固定特定模型版本，编辑脚本并将别名替换为

models --all

中的模型ID。

Advanced: Using as Claude Code Skill

进阶：作为Claude Code技能使用

Copy the pre-built skill into Claude's skill directory:

bash

mkdir -p ~/.claude/skills
cp -r skills/watch-cli ~/.claude/skills/

Now

/watch <url>

becomes a first-class command in Claude Code, with prompt library auto-attached.

将预构建技能复制到Claude的技能目录：

bash

mkdir -p ~/.claude/skills
cp -r skills/watch-cli ~/.claude/skills/

现在

/watch <url>

成为Claude Code中的一级命令，且会自动关联提示词库。

Environment Variables Reference

环境变量参考

bash

undefined

bash

undefined

Primary (Kyma unified API)

主API（Kyma统一API）

export KYMA_API_KEY=kyma-xxxxxxxx

Alternative (bring-your-own-keys)

替代方案（自定义密钥）

export GROQ_API_KEY=gsk_... # Whisper transcription export GOOGLE_AI_KEY=AIza... # Gemini audio understanding

undefined

export GROQ_API_KEY=gsk_... # Whisper转录 export GOOGLE_AI_KEY=AIza... # Gemini音频理解

undefined

Real Agent Usage Example

Agent实际使用示例

bash

undefined

bash

undefined

User asks: "Watch this video and build the project shown"

用户提问："观看这个视频并构建其中展示的项目"

$ watch https://www.youtube.com/watch?v=TUTORIAL_VIDEO 16

Agent receives:

Agent收到的内容：

VIDEO: /tmp/dl-video/abc123.mp4 DURATION: 1847 FRAMES: /tmp/frames_abc123/frame_01.jpg # Shows project folder structure /tmp/frames_abc123/frame_02.jpg # Shows package.json /tmp/frames_abc123/frame_03.jpg # Shows main App.tsx code /tmp/frames_abc123/frame_04.jpg # Shows terminal: npm install ... TRANSCRIPT: Okay so first we're going to set up a new React project. Create a folder called my-app, then run npm init. Now let's install these dependencies...

VIDEO: /tmp/dl-video/abc123.mp4 DURATION: 1847 FRAMES: /tmp/frames_abc123/frame_01.jpg # 展示项目文件夹结构 /tmp/frames_abc123/frame_02.jpg # 展示package.json /tmp/frames_abc123/frame_03.jpg # 展示主App.tsx代码 /tmp/frames_abc123/frame_04.jpg # 展示终端命令：npm install ... TRANSCRIPT: Okay so first we're going to set up a new React project. Create a folder called my-app, then run npm init. Now let's install these dependencies...

Agent reads:

Agent处理：

- frame_01.jpg → sees folder structure → creates directories

- frame_01.jpg → 查看文件夹结构→创建目录

- frame_02.jpg → reads package.json → writes dependencies

- frame_02.jpg → 读取package.json→编写依赖

- frame_03.jpg → reads code on screen → implements App.tsx

- frame_03.jpg → 读取屏幕上的代码→实现App.tsx

- transcript → fills in verbal instructions for parts not shown

- 转录文本→补充屏幕未展示的口头说明

Result: working project matching the tutorial

结果：与教程一致的可运行项目

undefined

undefined

Key Takeaways for Agents

Agent核心要点

Always use
watch
for the full pipeline (unless you only need one piece)
Read frames as images — they contain visual info the transcript can't capture
Read transcript as text — it contains verbal explanations the frames don't show
Combine both — frames + transcript = full video understanding
Adjust frame count based on video type (fast-cut → more frames)
For login-walled content — it auto-detects browser cookies, but fall back to
```
--cookies
```
if needed
Cost is negligible — transcription is ~$0.04/hour, don't hesitate to watch multiple videos
Use prompt templates in
```
prompts/
```
for specific output types (architecture, code, UI, etc.)

License: MIT
Homepage: https://kymaapi.com
GitHub: https://github.com/sonpiaz/watch-cli

优先使用
watch
获取完整流程（除非仅需单一功能）
将帧作为图像读取 — 帧包含转录文本无法捕捉的视觉信息
将转录文本作为文本读取 — 文本包含帧无法展示的口头解释
结合两者 — 帧+转录文本=完整视频理解
根据视频类型调整帧数量（快速切换内容→增加帧数量）
处理需登录的内容 — 会自动检测浏览器Cookie，若失败则使用
```
--cookies
```
参数
成本可忽略 — 转录成本约$0.04/小时，无需犹豫分析多个视频
使用
prompts/
中的提示词模板获取特定类型输出（架构、代码、UI等）

许可证：MIT
主页：https://kymaapi.com
GitHub：https://github.com/sonpiaz/watch-cli