Loading...
Loading...
Download and analyze social videos using frames + transcript for AI agent understanding at 50× lower cost than multimodal APIs
npx skill4agent add aradotso/devtools-skills watch-cli-video-agentSkill by ara.so — Devtools Skills collection.
watch-cli# Quick install (symlinks to ~/.local/bin)
curl -fsSL https://raw.githubusercontent.com/sonpiaz/watch-cli/main/install.sh | bashyt-dlpffmpegjqcurlpython3brew install yt-dlp ffmpeg jqsudo apt install yt-dlp ffmpeg jq python3 curlexport KYMA_API_KEY=kyma-xxxxxxxx.envGROQ_API_KEY=gsk_...
GOOGLE_AI_KEY=AIza...watchwatch <url> [frame-count] [--cookies <file>]watch https://www.youtube.com/watch?v=dQw4w9WgXcQVIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
/tmp/frames_abc123/frame_01.jpg
/tmp/frames_abc123/frame_02.jpg
/tmp/frames_abc123/frame_03.jpg
...
TRANSCRIPT:
Today I want to talk about how decomposition unlocks 10× cost reduction...watch https://twitter.com/user/status/12345 16watch https://www.linkedin.com/posts/someone_activity-12345 --cookies ~/cookies.txtdl-videodl-video <url> [out-dir] [--cookies <file>]VIDEO_PATH=$(dl-video https://www.youtube.com/watch?v=dQw4w9WgXcQ /tmp/videos)
echo "Downloaded to: $VIDEO_PATH"extract-framesextract-frames <video> [count] [out-dir]extract-frames video.mp4 12 /tmp/my-frames
# Returns:
# /tmp/my-frames/frame_01.jpg
# /tmp/my-frames/frame_02.jpg
# ...transcribetranscribe <audio-or-video> [language]transcribe video.mp4
# or
transcribe audio.mp3 enaudio-qaudio-q <audio-or-video> "<question>"audio-q video.mp4 "What is the speaker's emotional tone? Is there background music?"modelsmodels # Audio models only
models --all # All Kyma models (text, image, video, audio)transcribeaudio-understand| Video Type | Recommended Frames |
|---|---|
| Short tweet/clip (<2 min) | 4–8 (default) |
| Tutorial/talk (5–20 min) | 8–16 |
| Lecture (20–60 min) | 16–24 |
| Conference talk (>1 hr) | 24–32 |
| Fast-cut UI demo | Double the recommendation |
# User: "Analyze this video and tell me what it's about"
OUTPUT=$(watch https://www.youtube.com/watch?v=VIDEO_ID)
# Parse the output:
# - VIDEO: line → path to mp4
# - FRAMES: block → list of jpg paths (read each as image)
# - TRANSCRIPT: block → full text (read as text)
# Now you have frames + transcript to reason about# User: "Watch this coding tutorial and implement the project"
watch https://www.youtube.com/watch?v=CODING_TUTORIAL 16
# Read frames to see:
# - File structure screenshots
# - Code snippets on screen
# - Terminal commands
# Read transcript for:
# - Verbal explanations
# - Step-by-step instructions
# Combine both to reconstruct the full implementation# User: "Extract the system architecture from this talk"
watch https://www.youtube.com/watch?v=SYSTEM_DESIGN_TALK 24
# Look for frames with:
# - Diagrams (boxes, arrows)
# - Service names
# - Data flow illustrations
# Use transcript to identify component relationships
# Generate Mermaid/PlantUML diagram# User: "Clone the interface shown in this demo"
watch https://twitter.com/designer/status/12345 12
# Frames show UI states:
# - Layout structure
# - Color scheme
# - Interactive elements
# Transcript reveals:
# - Interaction patterns
# - Animation timing
# Generate React/HTML+CSS implementation# For podcasts, music, or unclear audio:
TRANSCRIPT=$(transcribe podcast.mp3)
SCENE=$(audio-q podcast.mp3 "Describe the audio scene: tone, music, number of speakers, emotion")
# Combine both for full audio understandingprompts/watchimplement-from-video.mdextract-architecture.mdclone-ux.mdpaper-to-code.mdtutorial-walkthrough.md# 1. Watch the video
watch https://www.youtube.com/watch?v=VIDEO_ID > output.txt
# 2. Prepend the prompt template
cat prompts/implement-from-video.md output.txt > full-context.txt
# 3. Feed to agent (already done if you're the agent!)| Video Length | Transcribe Cost |
|---|---|
| 5 minutes | ~$0.003 |
| 1 hour | ~$0.04 |
| 2 hours | ~$0.08 |
watch-cliwatch <url>~/cookies.txtwatch <url> --cookies ~/cookies.txtdocs/cookies.mdyt-dlp--cookies# Split video first
ffmpeg -i long-video.mp4 -ss 00:00:00 -to 01:00:00 -c copy part1.mp4
ffmpeg -i long-video.mp4 -ss 01:00:00 -c copy part2.mp4
# Transcribe separately
transcribe part1.mp4 > transcript1.txt
transcribe part2.mp4 > transcript2.txt
cat transcript1.txt transcript2.txt > full-transcript.txtwatch <url> 24audio-qaudio-q video.mp4 "Are there any UI sounds, clicks, or ambient audio?"watch <url> 32 # 4× more framestranscribeaudio-qmodelsmodels --allmkdir -p ~/.claude/skills
cp -r skills/watch-cli ~/.claude/skills//watch <url># Primary (Kyma unified API)
export KYMA_API_KEY=kyma-xxxxxxxx
# Alternative (bring-your-own-keys)
export GROQ_API_KEY=gsk_... # Whisper transcription
export GOOGLE_AI_KEY=AIza... # Gemini audio understanding# User asks: "Watch this video and build the project shown"
$ watch https://www.youtube.com/watch?v=TUTORIAL_VIDEO 16
# Agent receives:
VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 1847
FRAMES:
/tmp/frames_abc123/frame_01.jpg # Shows project folder structure
/tmp/frames_abc123/frame_02.jpg # Shows package.json
/tmp/frames_abc123/frame_03.jpg # Shows main App.tsx code
/tmp/frames_abc123/frame_04.jpg # Shows terminal: npm install
...
TRANSCRIPT:
Okay so first we're going to set up a new React project. Create a folder
called my-app, then run npm init. Now let's install these dependencies...
# Agent reads:
# - frame_01.jpg → sees folder structure → creates directories
# - frame_02.jpg → reads package.json → writes dependencies
# - frame_03.jpg → reads code on screen → implements App.tsx
# - transcript → fills in verbal instructions for parts not shown
# Result: working project matching the tutorialwatch--cookiesprompts/