YouTube Transcript
Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly.
Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.
Script Directory
Scripts in
subdirectory.
= this SKILL.md's directory path. Resolve
runtime: if
installed →
; if
available →
; else suggest installing bun. Replace
and
with actual values.
| Script | Purpose |
|---|
| Transcript download CLI |
Usage
bash
# Default: markdown with timestamps (English)
${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>
# Specify languages (priority order)
${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja
# Without timestamps
${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps
# With chapter segmentation
${BUN_X} {baseDir}/scripts/main.ts <url> --chapters
# With speaker identification (requires AI post-processing)
${BUN_X} {baseDir}/scripts/main.ts <url> --speakers
# SRT subtitle file
${BUN_X} {baseDir}/scripts/main.ts <url> --format srt
# Translate transcript
${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans
# List available transcripts
${BUN_X} {baseDir}/scripts/main.ts <url> --list
# Force re-fetch (ignore cache)
${BUN_X} {baseDir}/scripts/main.ts <url> --refresh
Options
| Option | Description | Default |
|---|
| YouTube URL or video ID (multiple allowed) | Required |
| Language codes, comma-separated, in priority order | |
| Output format: , | |
| Translate to specified language code | |
| List available transcripts instead of fetching | |
| Include timestamps per paragraph | on |
| Disable timestamps | |
| Chapter segmentation from video description | |
| Raw transcript with metadata for speaker identification | |
| Skip auto-generated transcripts | |
--exclude-manually-created
| Skip manually created transcripts | |
| Force re-fetch, ignore cached data | |
| Save to specific file path | auto-generated |
| Base output directory | |
Input Formats
Accepts any of these as video input:
- Full URL:
https://www.youtube.com/watch?v=dQw4w9WgXcQ
- Short URL:
https://youtu.be/dQw4w9WgXcQ
- Embed URL:
https://www.youtube.com/embed/dQw4w9WgXcQ
- Shorts URL:
https://www.youtube.com/shorts/dQw4w9WgXcQ
- Video ID:
Output Formats
| Format | Extension | Description |
|---|
| | Markdown with frontmatter (incl. ), title heading, summary, optional TOC/cover/timestamps/chapters/speakers |
| | SubRip subtitle format for video players |
Output Directory
youtube-transcript/
├── .index.json # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
├── meta.json # Video metadata (title, channel, description, duration, chapters, etc.)
├── transcript-raw.json # Raw transcript snippets from YouTube API (cached)
├── transcript-sentences.json # Sentence-segmented transcript (split by punctuation, merged across snippets)
├── imgs/
│ └── cover.jpg # Video thumbnail
├── transcript.md # Markdown transcript (generated from sentences)
└── transcript.srt # SRT subtitle (generated from raw snippets, if --format srt)
- : Channel name in kebab-case
- : Full video title in kebab-case
The
mode outputs to stdout only (no file saved).
Caching
On first fetch, the script saves:
- — video metadata, chapters, cover image path, language info
- — raw transcript snippets from YouTube API (
{ text, start, duration }[]
)
transcript-sentences.json
— sentence-segmented transcript ({ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]
), split by sentence-ending punctuation ( etc.), timestamps proportionally allocated by character length, CJK-aware text merging
- — video thumbnail
Subsequent runs for the same video use cached data (no network calls). Use
to force re-fetch. If a different language is requested, the cache is automatically refreshed.
SRT output (
) is generated from
. Text/markdown output uses
transcript-sentences.json
for natural sentence boundaries.
Workflow
When user provides a YouTube URL and wants the transcript:
- Run with first if the user hasn't specified a language, to show available options
- Always single-quote the URL when running the script — zsh treats as a glob wildcard, so an unquoted YouTube URL causes "no matches found": use
'https://www.youtube.com/watch?v=ID'
- Default: run with for the richest output (chapters + speaker identification)
- The script auto-saves cached data + output file and prints the file path
- For mode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels
When user only wants a cover image or metadata, running the script with any option will also cache
and
.
When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.
Chapter & Speaker Workflow
Chapters ()
The script parses chapter timestamps from the video description (e.g.,
), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as
with a Table of Contents. No further processing needed.
If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.
Speaker Identification ()
Speaker identification requires AI processing. The script outputs a raw
file containing:
- YAML frontmatter with video metadata (title, channel, date, cover, description, language)
- Video description (for speaker name extraction)
- Chapter list from description (if available)
- Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)
After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:
- Read the saved file
- Read the prompt template at
{baseDir}/prompts/speaker-transcript.md
- Process the raw transcript following the prompt:
- Identify speakers using video metadata (title → guest, channel → host, description → names)
- Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
- Segment into chapters (use description chapters if available, else create from topic shifts)
- Format with labels, paragraph grouping (2-4 sentences), and timestamps
- Overwrite the file with the processed transcript (keep the YAML frontmatter)
When
is used,
is implied — the processed output always includes chapter segmentation.
Error Cases
| Error | Meaning |
|---|
| Transcripts disabled | Video has no captions at all |
| No transcript found | Requested language not available |
| Video unavailable | Video deleted, private, or region-locked |
| IP blocked | Too many requests, try again later |
| Age restricted | Video requires login for age verification |