baoyu-youtube-transcript

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

YouTube Transcript

YouTube 转录文本（字幕）下载

Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly.

Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.

从YouTube视频下载转录文本（字幕/字幕文件）。支持手动创建和自动生成的转录文本。无需API密钥或浏览器——直接调用YouTube的InnerTube API。

首次运行时获取视频元数据和封面图片，缓存原始数据以实现快速重新格式化。

Script Directory

脚本目录

Scripts in

scripts/

subdirectory.

{baseDir}

= this SKILL.md's directory path. Resolve

${BUN_X}

runtime: if

bun

installed →

bun

; if

npx

available →

npx -y bun

; else suggest installing bun. Replace

{baseDir}

and

${BUN_X}

with actual values.

Script	Purpose
`scripts/main.ts`	Transcript download CLI

脚本位于

scripts/

子目录中。

{baseDir}

= 本SKILL.md文件所在的目录路径。解析

${BUN_X}

运行时：若已安装

bun

则使用

bun

；若

npx

可用则使用

npx -y bun

；否则建议安装bun。将

{baseDir}

和

${BUN_X}

替换为实际值。

脚本	用途
`scripts/main.ts`	转录文本下载命令行工具

Usage

使用方法

bash

undefined

bash

undefined

Default: markdown with timestamps (English)

默认：带时间戳的Markdown格式（英文）

${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>

Specify languages (priority order)

指定语言（优先级顺序）

${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja

Without timestamps

不带时间戳

${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps

With chapter segmentation

按章节划分

${BUN_X} {baseDir}/scripts/main.ts <url> --chapters

With speaker identification (requires AI post-processing)

带说话人识别（需要AI后处理）

${BUN_X} {baseDir}/scripts/main.ts <url> --speakers

SRT subtitle file

生成SRT字幕文件

${BUN_X} {baseDir}/scripts/main.ts <url> --format srt

Translate transcript

翻译转录文本

${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans

List available transcripts

列出可用的转录文本

${BUN_X} {baseDir}/scripts/main.ts <url> --list

Force re-fetch (ignore cache)

强制重新获取（忽略缓存）

${BUN_X} {baseDir}/scripts/main.ts <url> --refresh

undefined

${BUN_X} {baseDir}/scripts/main.ts <url> --refresh

undefined

Options

选项

Option	Description	Default
`<url-or-id>`	YouTube URL or video ID (multiple allowed)	Required
`--languages <codes>`	Language codes, comma-separated, in priority order	`en`
`--format <fmt>`	Output format: `text` , `srt`	`text`
`--translate <code>`	Translate to specified language code
`--list`	List available transcripts instead of fetching
`--timestamps`	Include `[HH:MM:SS → HH:MM:SS]` timestamps per paragraph	on
`--no-timestamps`	Disable timestamps
`--chapters`	Chapter segmentation from video description
`--speakers`	Raw transcript with metadata for speaker identification
`--exclude-generated`	Skip auto-generated transcripts
`--exclude-manually-created`	Skip manually created transcripts
`--refresh`	Force re-fetch, ignore cached data
`-o, --output <path>`	Save to specific file path	auto-generated
`--output-dir <dir>`	Base output directory	`youtube-transcript`

选项	说明	默认值
`<url-or-id>`	YouTube URL或视频ID（支持多个）	必填
`--languages <codes>`	语言代码，逗号分隔，按优先级排序	`en`
`--format <fmt>`	输出格式： `text` 、 `srt`	`text`
`--translate <code>`	翻译为指定的语言代码
`--list`	列出可用的转录文本而非直接获取
`--timestamps`	为每个段落添加 `[HH:MM:SS → HH:MM:SS]` 时间戳	开启
`--no-timestamps`	禁用时间戳
`--chapters`	从视频描述中解析章节划分
`--speakers`	包含说话人识别所需元数据的原始转录文本
`--exclude-generated`	跳过自动生成的转录文本
`--exclude-manually-created`	跳过手动创建的转录文本
`--refresh`	强制重新获取，忽略缓存数据
`-o, --output <path>`	保存到指定文件路径	自动生成
`--output-dir <dir>`	基础输出目录	`youtube-transcript`

Input Formats

输入格式

Accepts any of these as video input:

Full URL:

https://www.youtube.com/watch?v=dQw4w9WgXcQ

Short URL:
```
https://youtu.be/dQw4w9WgXcQ
```

Embed URL:

https://www.youtube.com/embed/dQw4w9WgXcQ

Shorts URL:

https://www.youtube.com/shorts/dQw4w9WgXcQ

Video ID:
```
dQw4w9WgXcQ
```

支持以下任意一种视频输入格式：

完整URL：

https://www.youtube.com/watch?v=dQw4w9WgXcQ

短URL：
```
https://youtu.be/dQw4w9WgXcQ
```

嵌入URL：

https://www.youtube.com/embed/dQw4w9WgXcQ

Shorts URL：

https://www.youtube.com/shorts/dQw4w9WgXcQ

视频ID：
```
dQw4w9WgXcQ
```

Output Formats

输出格式

Format	Extension	Description
`text`	`.md`	Markdown with frontmatter (incl. `description` ), title heading, summary, optional TOC/cover/timestamps/chapters/speakers
`srt`	`.srt`	SubRip subtitle format for video players

格式	扩展名	说明
`text`	`.md`	带前置元数据（包含 `description` ）的Markdown格式，含标题、摘要，可选目录/封面/时间戳/章节/说话人信息
`srt`	`.srt`	适用于视频播放器的SubRip字幕格式

Output Directory

输出目录

youtube-transcript/
├── .index.json                          # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
    ├── meta.json                        # Video metadata (title, channel, description, duration, chapters, etc.)
    ├── transcript-raw.json              # Raw transcript snippets from YouTube API (cached)
    ├── transcript-sentences.json        # Sentence-segmented transcript (split by punctuation, merged across snippets)
    ├── imgs/
    │   └── cover.jpg                    # Video thumbnail
    ├── transcript.md                    # Markdown transcript (generated from sentences)
    └── transcript.srt                   # SRT subtitle (generated from raw snippets, if --format srt)

```
{channel-slug}
```
: Channel name in kebab-case
```
{title-full-slug}
```
: Full video title in kebab-case

The

--list

mode outputs to stdout only (no file saved).

youtube-transcript/
├── .index.json                          # 视频ID → 目录路径映射（用于缓存查找）
└── {channel-slug}/{title-full-slug}/
    ├── meta.json                        # 视频元数据（标题、频道、描述、时长、章节等）
    ├── transcript-raw.json              # 从YouTube API获取的原始转录文本片段（已缓存）
    ├── transcript-sentences.json        # 按句子分割的转录文本（按标点符号拆分，合并跨片段内容）
    ├── imgs/
    │   └── cover.jpg                    # 视频封面图片
    ├── transcript.md                    # Markdown格式转录文本（从句子数据生成）
    └── transcript.srt                   # SRT字幕文件（从原始片段生成，仅当指定--format srt时存在）

```
{channel-slug}
```
：频道名称的短横线分隔格式（kebab-case）
```
{title-full-slug}
```
：完整视频标题的短横线分隔格式（kebab-case）

--list

模式仅输出到标准输出（不保存文件）。

Caching

缓存机制

On first fetch, the script saves:

```
meta.json
```
— video metadata, chapters, cover image path, language info
```
transcript-raw.json
```
— raw transcript snippets from YouTube API (
```
{ text, start, duration }[]
```
)
```
transcript-sentences.json
```
— sentence-segmented transcript (
```
{ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]
```
), split by sentence-ending punctuation (
```
.?!…。？！
```
etc.), timestamps proportionally allocated by character length, CJK-aware text merging
```
imgs/cover.jpg
```
— video thumbnail

Subsequent runs for the same video use cached data (no network calls). Use

--refresh

to force re-fetch. If a different language is requested, the cache is automatically refreshed.

SRT output (

--format srt

) is generated from

transcript-raw.json

. Text/markdown output uses

transcript-sentences.json

for natural sentence boundaries.

首次获取时，脚本会保存以下内容：

```
meta.json
```
— 视频元数据、章节、封面图片路径、语言信息
```
transcript-raw.json
```
— 从YouTube API获取的原始转录文本片段（格式为
```
{ text, start, duration }[]
```
）
```
transcript-sentences.json
```
— 按句子分割的转录文本（格式为
```
{ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]
```
），按句末标点（
```
.?!…。？！
```
等）拆分，按字符长度比例分配时间戳，支持中日韩文本合并
```
imgs/cover.jpg
```
— 视频封面图片

后续针对同一视频的运行会使用缓存数据（无需网络请求）。使用

--refresh

参数可强制重新获取数据。若请求不同语言的转录文本，缓存会自动刷新。

SRT格式输出（

--format srt

）由

transcript-raw.json

生成。文本/Markdown格式输出使用

transcript-sentences.json

以实现自然的句子边界。

Workflow

使用流程

When user provides a YouTube URL and wants the transcript:

Run with
```
--list
```
first if the user hasn't specified a language, to show available options
Always single-quote the URL when running the script — zsh treats
```
?
```
as a glob wildcard, so an unquoted YouTube URL causes "no matches found": use
```
'https://www.youtube.com/watch?v=ID'
```
Default: run with
```
--chapters --speakers
```
for the richest output (chapters + speaker identification)
The script auto-saves cached data + output file and prints the file path
For
```
--speakers
```
mode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels

When user only wants a cover image or metadata, running the script with any option will also cache

meta.json

and

imgs/cover.jpg

When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.

当用户提供YouTube URL并需要转录文本时：

若用户未指定语言，先运行
```
--list
```
模式查看可用选项
运行脚本时务必用单引号包裹URL —— zsh会将
```
?
```
视为通配符，未加引号的YouTube URL会导致“未找到匹配项”错误：请使用
```
'https://www.youtube.com/watch?v=ID'
```
默认建议使用
```
--chapters --speakers
```
参数以获取最丰富的输出（章节划分 + 说话人识别）
脚本会自动保存缓存数据和输出文件，并打印文件路径
若使用
```
--speakers
```
模式：脚本保存原始文件后，按照以下说话人识别流程进行后处理以添加说话人标签

当用户仅需要封面图片或元数据时，运行任意参数的脚本都会缓存

meta.json

和

imgs/cover.jpg

。

当重新格式化同一视频（例如先生成文本格式再生成SRT格式）时，会复用缓存数据——无需重新获取。

Chapter & Speaker Workflow

章节与说话人识别流程

Chapters (

--chapters

)

章节划分（

--chapters

）

The script parses chapter timestamps from the video description (e.g.,

0:00 Introduction

), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as

.md

with a Table of Contents. No further processing needed.

If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.

脚本从视频描述中解析章节时间戳（例如

0:00 Introduction

），按章节边界分割转录文本，将片段分组为易读的段落，并保存为带目录的

.md

文件。无需进一步处理。

若视频描述中无章节时间戳，转录文本将以分组段落形式输出，不带章节标题。

Speaker Identification (

--speakers

)

说话人识别（

--speakers

）

Speaker identification requires AI processing. The script outputs a raw

.md

file containing:

YAML frontmatter with video metadata (title, channel, date, cover, description, language)
Video description (for speaker name extraction)
Chapter list from description (if available)
Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)

After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:

Read the saved
```
.md
```
file
Read the prompt template at
```
{baseDir}/prompts/speaker-transcript.md
```
Process the raw transcript following the prompt:
- Identify speakers using video metadata (title → guest, channel → host, description → names)
- Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
- Segment into chapters (use description chapters if available, else create from topic shifts)
- Format with
```
**Speaker Name:**
```
  labels, paragraph grouping (2-4 sentences), and
```
[HH:MM:SS → HH:MM:SS]
```
  timestamps
Overwrite the
```
.md
```
file with the processed transcript (keep the YAML frontmatter)

When

--speakers

is used,

--chapters

is implied — the processed output always includes chapter segmentation.

说话人识别需要AI处理。脚本会输出一个原始

.md

文件，包含：

带视频元数据（标题、频道、日期、封面、描述、语言）的YAML前置元数据
视频描述（用于提取说话人姓名）
从描述中获取的章节列表（若存在）
SRT格式的原始转录文本（预计算的开始/结束时间戳，高效分词）

脚本保存原始文件后，启动子Agent（使用Sonnet等低成本模型以降低成本）进行说话人识别处理：

读取已保存的
```
.md
```
文件
读取
```
{baseDir}/prompts/speaker-transcript.md
```
中的提示模板
按照提示处理原始转录文本：
- 利用视频元数据识别说话人（标题→嘉宾，频道→主持人，描述→姓名）
- 根据对话流程、问答模式和上下文线索检测说话人切换
- 按章节分割（若描述中有章节则使用，否则根据主题变化创建章节）
- 以
```
**Speaker Name:**
```
  标签格式输出，将内容分组为2-4句的段落，并添加
```
[HH:MM:SS → HH:MM:SS]
```
  时间戳
用处理后的转录文本覆盖原
```
.md
```
文件（保留YAML前置元数据）

当使用

--speakers

参数时，会自动启用

--chapters

——处理后的输出始终包含章节划分。

Error Cases

错误情况

Error	Meaning
Transcripts disabled	Video has no captions at all
No transcript found	Requested language not available
Video unavailable	Video deleted, private, or region-locked
IP blocked	Too many requests, try again later
Age restricted	Video requires login for age verification

错误	含义
Transcripts disabled	该视频完全没有字幕
No transcript found	请求的语言不可用
Video unavailable	视频已删除、设为私有或受区域限制
IP blocked	请求次数过多，请稍后重试
Age restricted	视频需要登录进行年龄验证