watch-cli-video-agent
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesewatch-cli Video Agent Skill
watch-cli 视频Agent技能
What It Does
功能介绍
watch-cliThe insight: Videos are just frames + audio. Extract both, read them separately with vision + text LLMs, and you have full video understanding without burning a video model on every frame.
watch-cli核心思路:视频本质是帧+音频。分别提取两者,用视觉LLM和文本LLM分别处理,无需为每一帧调用视频模型即可实现完整的视频理解。
Installation
安装步骤
bash
undefinedbash
undefinedQuick install (symlinks to ~/.local/bin)
快速安装(创建软链接至~/.local/bin)
curl -fsSL https://raw.githubusercontent.com/sonpiaz/watch-cli/main/install.sh | bash
**Dependencies** (installer checks, but install manually if needed):
- `yt-dlp` — video downloader
- `ffmpeg` — frame/audio extraction
- `jq`, `curl`, `python3` — utilities
macOS:
```bash
brew install yt-dlp ffmpeg jqDebian/Ubuntu:
bash
sudo apt install yt-dlp ffmpeg jq python3 curlcurl -fsSL https://raw.githubusercontent.com/sonpiaz/watch-cli/main/install.sh | bash
**依赖项**(安装脚本会自动检查,若需手动安装):
- `yt-dlp` — 视频下载工具
- `ffmpeg` — 帧/音频提取工具
- `jq`, `curl`, `python3` — 实用工具
macOS系统:
```bash
brew install yt-dlp ffmpeg jqDebian/Ubuntu系统:
bash
sudo apt install yt-dlp ffmpeg jq python3 curlConfiguration
配置说明
Set your Kyma API key (or bring-your-own Groq/Google keys):
bash
export KYMA_API_KEY=kyma-xxxxxxxxGet a free key at https://kymaapi.com (includes ~1 hour of free transcription credit).
Alternative: For bring-your-own-keys, create :
.envbash
GROQ_API_KEY=gsk_...
GOOGLE_AI_KEY=AIza...设置你的Kyma API密钥(或使用自定义的Groq/Google密钥):
bash
export KYMA_API_KEY=kyma-xxxxxxxx替代方案:若使用自定义密钥,创建文件:
.envbash
GROQ_API_KEY=gsk_...
GOOGLE_AI_KEY=AIza...Core Commands
核心命令
watch
— Full video analysis (orchestrator)
watchwatch
— 完整视频分析(编排器)
watchbash
watch <url> [frame-count] [--cookies <file>]Downloads video, extracts frames, transcribes audio — returns everything in one structured block.
Example:
bash
watch https://www.youtube.com/watch?v=dQw4w9WgXcQOutput structure:
text
VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
/tmp/frames_abc123/frame_01.jpg
/tmp/frames_abc123/frame_02.jpg
/tmp/frames_abc123/frame_03.jpg
...
TRANSCRIPT:
Today I want to talk about how decomposition unlocks 10× cost reduction...With custom frame count (default is 8):
bash
watch https://twitter.com/user/status/12345 16For login-walled content (LinkedIn, private X, Facebook):
bash
watch https://www.linkedin.com/posts/someone_activity-12345 --cookies ~/cookies.txt(Auto-detects browser cookies if signed in; see troubleshooting for manual export)
bash
watch <url> [frame-count] [--cookies <file>]下载视频、提取帧、转录音频——以结构化格式返回所有结果。
示例:
bash
watch https://www.youtube.com/watch?v=dQw4w9WgXcQ输出结构:
text
VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
/tmp/frames_abc123/frame_01.jpg
/tmp/frames_abc123/frame_02.jpg
/tmp/frames_abc123/frame_03.jpg
...
TRANSCRIPT:
Today I want to talk about how decomposition unlocks 10× cost reduction...自定义帧数量(默认8帧):
bash
watch https://twitter.com/user/status/12345 16处理需登录的内容(LinkedIn、私人X账号、Facebook):
bash
watch https://www.linkedin.com/posts/someone_activity-12345 --cookies ~/cookies.txt(若已登录浏览器,会自动检测浏览器Cookie;如需手动导出,请查看故障排除部分)
dl-video
— Download only
dl-videodl-video
— 仅下载视频
dl-videobash
dl-video <url> [out-dir] [--cookies <file>]Just downloads the video, returns the local mp4 path.
bash
VIDEO_PATH=$(dl-video https://www.youtube.com/watch?v=dQw4w9WgXcQ /tmp/videos)
echo "Downloaded to: $VIDEO_PATH"bash
dl-video <url> [out-dir] [--cookies <file>]仅下载视频,返回本地mp4文件路径。
bash
VIDEO_PATH=$(dl-video https://www.youtube.com/watch?v=dQw4w9WgXcQ /tmp/videos)
echo "Downloaded to: $VIDEO_PATH"extract-frames
— Frame extraction
extract-framesextract-frames
— 帧提取
extract-framesbash
extract-frames <video> [count] [out-dir]Extracts N evenly-spaced JPG frames from a video.
bash
extract-frames video.mp4 12 /tmp/my-framesbash
extract-frames <video> [count] [out-dir]从视频中提取N个均匀分布的JPG帧。
bash
extract-frames video.mp4 12 /tmp/my-framesReturns:
返回结果:
/tmp/my-frames/frame_01.jpg
/tmp/my-frames/frame_01.jpg
/tmp/my-frames/frame_02.jpg
/tmp/my-frames/frame_02.jpg
...
...
undefinedundefinedtranscribe
— Speech-to-text
transcribetranscribe
— 语音转文字
transcribebash
transcribe <audio-or-video> [language]Auto-extracts audio from video if needed, then transcribes using Whisper Large v3 Turbo via Kyma.
bash
transcribe video.mp4bash
transcribe <audio-or-video> [language]若输入为视频,会自动提取音频,通过Kyma平台的Whisper Large v3 Turbo模型进行转录。
bash
transcribe video.mp4or
或
transcribe audio.mp3 en
**Output**: Plain text transcript.transcribe audio.mp3 en
**输出**:纯文本转录内容。audio-q
— Audio scene Q&A
audio-qaudio-q
— 音频场景问答
audio-qbash
audio-q <audio-or-video> "<question>"Beyond transcription: asks about tone, music, sound effects, language, emotion using a multimodal audio model (Gemini 3 Flash audio).
bash
audio-q video.mp4 "What is the speaker's emotional tone? Is there background music?"bash
audio-q <audio-or-video> "<question>"超越单纯转录:通过多模态音频模型(Gemini 3 Flash audio)询问语气、音乐、音效、语言、情绪等信息。
bash
audio-q video.mp4 "What is the speaker's emotional tone? Is there background music?"models
— List available models
modelsmodels
— 列出可用模型
modelsbash
models # Audio models only
models --all # All Kyma models (text, image, video, audio)Shows live model list from Kyma API — and aliases auto-update when Kyma swaps underlying models.
transcribeaudio-understandbash
models # 仅显示音频模型
models --all # 显示所有Kyma模型(文本、图像、视频、音频)显示Kyma API提供的实时模型列表——和别名会在Kyma更换底层模型时自动更新。
transcribeaudio-understandFrame Count Guidance
帧数量参考
| Video Type | Recommended Frames |
|---|---|
| Short tweet/clip (<2 min) | 4–8 (default) |
| Tutorial/talk (5–20 min) | 8–16 |
| Lecture (20–60 min) | 16–24 |
| Conference talk (>1 hr) | 24–32 |
| Fast-cut UI demo | Double the recommendation |
| 视频类型 | 推荐帧数量 |
|---|---|
| 短推文/剪辑(<2分钟) | 4–8(默认值) |
| 教程/演讲(5–20分钟) | 8–16 |
| 讲座(20–60分钟) | 16–24 |
| 会议演讲(>1小时) | 24–32 |
| 快速切换的UI演示 | 推荐数量翻倍 |
Patterns for AI Agents
AI Agent使用模式
Pattern 1: Full video understanding
模式1:完整视频理解
bash
undefinedbash
undefinedUser: "Analyze this video and tell me what it's about"
用户请求:"分析这个视频并告诉我它的内容"
OUTPUT=$(watch https://www.youtube.com/watch?v=VIDEO_ID)
OUTPUT=$(watch https://www.youtube.com/watch?v=VIDEO_ID)
Parse the output:
解析输出:
- VIDEO: line → path to mp4
- VIDEO行 → mp4文件路径
- FRAMES: block → list of jpg paths (read each as image)
- FRAMES块 → jpg帧路径列表(读取为图像)
- TRANSCRIPT: block → full text (read as text)
- TRANSCRIPT块 → 完整文本(读取为文本)
Now you have frames + transcript to reason about
现在你可以结合帧和转录文本进行推理
undefinedundefinedPattern 2: Code tutorial → working implementation
模式2:代码教程→可运行实现
bash
undefinedbash
undefinedUser: "Watch this coding tutorial and implement the project"
用户请求:"观看这个编码教程并实现该项目"
Read frames to see:
读取帧获取:
- File structure screenshots
- 文件结构截图
- Code snippets on screen
屏幕上的代码片段
- Terminal commands
- 终端命令
Read transcript for:
读取转录文本获取:
- Verbal explanations
- 口头解释
- Step-by-step instructions
- 分步说明
Combine both to reconstruct the full implementation
结合两者重构完整实现
undefinedundefinedPattern 3: Architecture extraction
模式3:架构提取
bash
undefinedbash
undefinedUser: "Extract the system architecture from this talk"
用户请求:"从这个演讲中提取系统架构"
Look for frames with:
在帧中查找:
- Diagrams (boxes, arrows)
- 图表(方框、箭头)
- Service names
- 服务名称
- Data flow illustrations
- 数据流示意图
Use transcript to identify component relationships
使用转录文本识别组件关系
Generate Mermaid/PlantUML diagram
生成Mermaid/PlantUML图表
undefinedundefinedPattern 4: UI/UX cloning
模式4:UI/UX克隆
bash
undefinedbash
undefinedUser: "Clone the interface shown in this demo"
用户请求:"克隆这个演示中展示的界面"
Frames show UI states:
帧展示UI状态:
- Layout structure
- 布局结构
- Color scheme
- 配色方案
- Interactive elements
- 交互元素
Transcript reveals:
转录文本揭示:
- Interaction patterns
- 交互模式
- Animation timing
- 动画时长
Generate React/HTML+CSS implementation
生成React/HTML+CSS实现
undefinedundefinedPattern 5: Audio-only analysis
模式5:纯音频分析
bash
undefinedbash
undefinedFor podcasts, music, or unclear audio:
针对播客、音乐或不清晰的音频:
TRANSCRIPT=$(transcribe podcast.mp3)
SCENE=$(audio-q podcast.mp3 "Describe the audio scene: tone, music, number of speakers, emotion")
TRANSCRIPT=$(transcribe podcast.mp3)
SCENE=$(audio-q podcast.mp3 "Describe the audio scene: tone, music, number of speakers, emotion")
Combine both for full audio understanding
结合两者实现完整音频理解
undefinedundefinedPrompt Templates (Copy-Paste for Agents)
提示词模板(供Agent复制粘贴)
Located in directory, paste above output:
prompts/watch- — Coding walkthrough → working project
implement-from-video.md - — System talk → architecture diagram
extract-architecture.md - — UI demo → React component
clone-ux.md - — Research talk → runnable notebook
paper-to-code.md - — Long tutorial → cheat sheet
tutorial-walkthrough.md
Usage:
bash
undefined位于目录,粘贴在输出上方:
prompts/watch- — 编码讲解→可运行项目
implement-from-video.md - — 系统演讲→架构图
extract-architecture.md - — UI演示→React组件
clone-ux.md - — 研究演讲→可运行笔记本
paper-to-code.md - — 长教程→ cheat sheet
tutorial-walkthrough.md
使用方法:
bash
undefined1. Watch the video
1. 分析视频
watch https://www.youtube.com/watch?v=VIDEO_ID > output.txt
watch https://www.youtube.com/watch?v=VIDEO_ID > output.txt
2. Prepend the prompt template
2. 前置提示词模板
cat prompts/implement-from-video.md output.txt > full-context.txt
cat prompts/implement-from-video.md output.txt > full-context.txt
3. Feed to agent (already done if you're the agent!)
3. 提交给Agent(如果你是Agent,直接使用即可!)
undefinedundefinedCost Estimates
成本估算
| Video Length | Transcribe Cost |
|---|---|
| 5 minutes | ~$0.003 |
| 1 hour | ~$0.04 |
| 2 hours | ~$0.08 |
Frame extraction is local (ffmpeg), free. Free Kyma credit covers ~25 hours of audio.
Comparison: Multimodal video API on 1-hour video ≈ $5. ≈ $0.10. (~50× cheaper)
watch-cli| 视频时长 | 转录成本 |
|---|---|
| 5分钟 | ~$0.003 |
| 1小时 | ~$0.04 |
| 2小时 | ~$0.08 |
帧提取为本地操作(ffmpeg),免费。Kyma免费额度可覆盖约25小时音频转录。
对比:多模态视频API处理1小时视频≈$5。≈$0.10。(成本低约50倍)
watch-cliTroubleshooting
故障排除
Login-walled videos (LinkedIn, private X, Facebook)
需登录的视频(LinkedIn、私人X账号、Facebook)
Auto-detection (usually works):
- Sign into the platform in your browser (Chrome, Firefox, Safari, Edge, Brave)
- Run — it auto-finds cookies
watch <url>
Manual cookie export (for servers/CI):
- Install browser extension: "Get cookies.txt LOCALLY" (Chrome/Firefox)
- Visit the platform while signed in
- Click extension → Export cookies.txt
- Save to
~/cookies.txt - Run:
watch <url> --cookies ~/cookies.txt
Full guide: in the repo.
docs/cookies.md自动检测(通常有效):
- 在浏览器(Chrome、Firefox、Safari、Edge、Brave)中登录对应平台
- 运行— 会自动查找Cookie
watch <url>
手动导出Cookie(适用于服务器/CI环境):
- 安装浏览器扩展:"Get cookies.txt LOCALLY"(Chrome/Firefox)
- 登录平台后访问对应页面
- 点击扩展→导出cookies.txt
- 保存至
~/cookies.txt - 运行:
watch <url> --cookies ~/cookies.txt
完整指南:仓库中的。
docs/cookies.md"Region-locked video" error
"区域限制视频"错误
yt-dlp- Use VPN to target region
- Pass from browser with VPN active
--cookies
yt-dlp- 使用VPN切换至目标区域
- 配合激活VPN的浏览器Cookie,传递参数
--cookies
"Audio file too large" error
"音频文件过大"错误
Transcribe provider has 25MB audio limit. For 2+ hour videos:
bash
undefined转录服务商限制音频文件大小为25MB。针对2小时以上的视频:
bash
undefinedSplit video first
先拆分视频
ffmpeg -i long-video.mp4 -ss 00:00:00 -to 01:00:00 -c copy part1.mp4
ffmpeg -i long-video.mp4 -ss 01:00:00 -c copy part2.mp4
ffmpeg -i long-video.mp4 -ss 00:00:00 -to 01:00:00 -c copy part1.mp4
ffmpeg -i long-video.mp4 -ss 01:00:00 -c copy part2.mp4
Transcribe separately
分别转录
transcribe part1.mp4 > transcript1.txt
transcribe part2.mp4 > transcript2.txt
cat transcript1.txt transcript2.txt > full-transcript.txt
undefinedtranscribe part1.mp4 > transcript1.txt
transcribe part2.mp4 > transcript2.txt
cat transcript1.txt transcript2.txt > full-transcript.txt
undefinedEmpty transcript (silent video)
转录内容为空(静音视频)
For screencasts with no speech:
- Increase frame count:
watch <url> 24 - Use to describe any sound design:
audio-qaudio-q video.mp4 "Are there any UI sounds, clicks, or ambient audio?"
针对无语音的屏幕录制视频:
- 增加帧数量:
watch <url> 24 - 使用描述音效:
audio-qaudio-q video.mp4 "Are there any UI sounds, clicks, or ambient audio?"
Fast-cut content missing key moments
快速切换内容遗漏关键瞬间
Default 8 frames won't catch rapid edits. Solution:
bash
watch <url> 32 # 4× more frames默认8帧无法捕捉快速剪辑。解决方法:
bash
watch <url> 32 # 帧数量增加4倍Models not updating
模型未更新
transcribeaudio-qbash
modelsIf you want to pin a specific model version, edit the script and replace the alias with a model ID from .
models --alltranscribeaudio-qbash
models若需固定特定模型版本,编辑脚本并将别名替换为中的模型ID。
models --allAdvanced: Using as Claude Code Skill
进阶:作为Claude Code技能使用
Copy the pre-built skill into Claude's skill directory:
bash
mkdir -p ~/.claude/skills
cp -r skills/watch-cli ~/.claude/skills/Now becomes a first-class command in Claude Code, with prompt library auto-attached.
/watch <url>将预构建技能复制到Claude的技能目录:
bash
mkdir -p ~/.claude/skills
cp -r skills/watch-cli ~/.claude/skills/现在成为Claude Code中的一级命令,且会自动关联提示词库。
/watch <url>Environment Variables Reference
环境变量参考
bash
undefinedbash
undefinedPrimary (Kyma unified API)
主API(Kyma统一API)
export KYMA_API_KEY=kyma-xxxxxxxx
export KYMA_API_KEY=kyma-xxxxxxxx
Alternative (bring-your-own-keys)
替代方案(自定义密钥)
export GROQ_API_KEY=gsk_... # Whisper transcription
export GOOGLE_AI_KEY=AIza... # Gemini audio understanding
undefinedexport GROQ_API_KEY=gsk_... # Whisper转录
export GOOGLE_AI_KEY=AIza... # Gemini音频理解
undefinedReal Agent Usage Example
Agent实际使用示例
bash
undefinedbash
undefinedUser asks: "Watch this video and build the project shown"
用户提问:"观看这个视频并构建其中展示的项目"
Agent receives:
Agent收到的内容:
VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 1847
FRAMES:
/tmp/frames_abc123/frame_01.jpg # Shows project folder structure
/tmp/frames_abc123/frame_02.jpg # Shows package.json
/tmp/frames_abc123/frame_03.jpg # Shows main App.tsx code
/tmp/frames_abc123/frame_04.jpg # Shows terminal: npm install
...
TRANSCRIPT:
Okay so first we're going to set up a new React project. Create a folder
called my-app, then run npm init. Now let's install these dependencies...
VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 1847
FRAMES:
/tmp/frames_abc123/frame_01.jpg # 展示项目文件夹结构
/tmp/frames_abc123/frame_02.jpg # 展示package.json
/tmp/frames_abc123/frame_03.jpg # 展示主App.tsx代码
/tmp/frames_abc123/frame_04.jpg # 展示终端命令:npm install
...
TRANSCRIPT:
Okay so first we're going to set up a new React project. Create a folder
called my-app, then run npm init. Now let's install these dependencies...
Agent reads:
Agent处理:
- frame_01.jpg → sees folder structure → creates directories
- frame_01.jpg → 查看文件夹结构→创建目录
- frame_02.jpg → reads package.json → writes dependencies
- frame_02.jpg → 读取package.json→编写依赖
- frame_03.jpg → reads code on screen → implements App.tsx
- frame_03.jpg → 读取屏幕上的代码→实现App.tsx
- transcript → fills in verbal instructions for parts not shown
- 转录文本→补充屏幕未展示的口头说明
Result: working project matching the tutorial
结果:与教程一致的可运行项目
undefinedundefinedKey Takeaways for Agents
Agent核心要点
- Always use for the full pipeline (unless you only need one piece)
watch - Read frames as images — they contain visual info the transcript can't capture
- Read transcript as text — it contains verbal explanations the frames don't show
- Combine both — frames + transcript = full video understanding
- Adjust frame count based on video type (fast-cut → more frames)
- For login-walled content — it auto-detects browser cookies, but fall back to if needed
--cookies - Cost is negligible — transcription is ~$0.04/hour, don't hesitate to watch multiple videos
- Use prompt templates in for specific output types (architecture, code, UI, etc.)
prompts/
- 优先使用获取完整流程(除非仅需单一功能)
watch - 将帧作为图像读取 — 帧包含转录文本无法捕捉的视觉信息
- 将转录文本作为文本读取 — 文本包含帧无法展示的口头解释
- 结合两者 — 帧+转录文本=完整视频理解
- 根据视频类型调整帧数量(快速切换内容→增加帧数量)
- 处理需登录的内容 — 会自动检测浏览器Cookie,若失败则使用参数
--cookies - 成本可忽略 — 转录成本约$0.04/小时,无需犹豫分析多个视频
- 使用中的提示词模板 获取特定类型输出(架构、代码、UI等)
prompts/