mk-youtube-audio-transcribe
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseYouTube Audio Transcribe
YouTube音频转录
Transcribe audio files to text using local whisper.cpp (no cloud API required).
使用本地whisper.cpp将音频文件转录为文本(无需云API)。
Quick Start
快速开始
/mk-youtube-audio-transcribe <audio_file> [model] [language] [--force]/mk-youtube-audio-transcribe <audio_file> [model] [language] [--force]Parameters
参数说明
| Parameter | Required | Default | Description |
|---|---|---|---|
| audio_file | Yes | - | Path to audio file |
| model | No | auto | Model: auto, tiny, base, small, medium, large-v3, belle-zh, kotoba-ja |
| language | No | auto | Language code: en, ja, zh, auto (auto-detect) |
| --force | No | false | Force re-transcribe even if cached file exists |
| 参数 | 是否必填 | 默认值 | 说明 |
|---|---|---|---|
| audio_file | 是 | - | 音频文件路径 |
| model | 否 | auto | 模型选项:auto, tiny, base, small, medium, large-v3, belle-zh, kotoba-ja |
| language | 否 | auto | 语言代码:en, ja, zh, auto(自动检测) |
| --force | 否 | false | 即使存在缓存文件,也强制重新转录 |
Examples
使用示例
- - Transcribe with auto model selection
/mk-youtube-audio-transcribe /path/to/audio/video.m4a - - Auto-select best model for Chinese → belle-zh
/mk-youtube-audio-transcribe video.m4a auto zh - - Auto-select best model for Japanese → kotoba-ja
/mk-youtube-audio-transcribe video.m4a auto ja - - Use small model, force English
/mk-youtube-audio-transcribe audio.mp3 small en - - Use medium model (explicit), Japanese
/mk-youtube-audio-transcribe podcast.wav medium ja
- - 使用自动模型选择进行转录
/mk-youtube-audio-transcribe /path/to/audio/video.m4a - - 自动为中文选择最佳模型 → belle-zh
/mk-youtube-audio-transcribe video.m4a auto zh - - 自动为日文选择最佳模型 → kotoba-ja
/mk-youtube-audio-transcribe video.m4a auto ja - - 使用small模型,强制识别英文
/mk-youtube-audio-transcribe audio.mp3 small en - - 使用medium模型(指定),识别日文
/mk-youtube-audio-transcribe podcast.wav medium ja
How it Works
工作原理
- Execute:
{baseDir}/scripts/transcribe.sh "<audio_file>" "<model>" "<language>" - Auto-download model if not found (with progress)
- Convert audio to 16kHz mono WAV using ffmpeg
- Run whisper-cli for transcription
- Save full JSON to
{baseDir}/data/<filename>.json - Save plain text to
{baseDir}/data/<filename>.txt - Return file paths and metadata
┌─────────────────────────────┐
│ transcribe.sh │
│ audio_file, [model], [lang]│
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ ffmpeg: convert to WAV │
│ 16kHz, mono, pcm_s16le │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ whisper-cli: transcribe │
│ with Metal acceleration │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Save to files │
│ .json (full) + .txt │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Return file paths │
│ {file_path, text_file_path}│
└─────────────────────────────┘- 执行命令:
{baseDir}/scripts/transcribe.sh "<audio_file>" "<model>" "<language>" - 如果未找到模型则自动下载(带进度提示)
- 使用ffmpeg将音频转换为16kHz单声道WAV格式
- 运行whisper-cli进行转录
- 将完整JSON保存至
{baseDir}/data/<filename>.json - 将纯文本保存至
{baseDir}/data/<filename>.txt - 返回文件路径和元数据
┌─────────────────────────────┐
│ transcribe.sh │
│ audio_file, [model], [lang]│
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ ffmpeg: 转换为WAV格式 │
│ 16kHz, 单声道, pcm_s16le │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ whisper-cli: 执行转录 │
│ 启用Metal加速 │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ 保存至文件 │
│ .json(完整内容) + .txt │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ 返回文件路径 │
│ {file_path, text_file_path}│
└─────────────────────────────┘Output Format
输出格式
Success:
json
{
"status": "success",
"file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
"text_file_path": "{baseDir}/data/20091025__VIDEO_ID.txt",
"language": "en",
"duration": "3:32",
"model": "medium",
"char_count": 12345,
"line_count": 100,
"text_char_count": 10000,
"text_line_count": 50,
"cached": false,
"video_id": "dQw4w9WgXcQ",
"title": "Video Title",
"channel": "Channel Name",
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}Cache hit (returns existing transcription):
json
{
"status": "success",
"file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
"cached": true,
...
}Error (general):
json
{
"status": "error",
"message": "Error description"
}Error (unknown model):
json
{
"status": "error",
"error_code": "UNKNOWN_MODEL",
"message": "Unknown model: invalid-name",
"available_models": ["tiny", "base", "small", "medium", "large-v3", "large-v3-turbo", "belle-zh", "kotoba-ja", "kotoba-ja-q5"]
}When you receive error: suggest a valid model from the list.
UNKNOWN_MODELavailable_modelsError (model not found):
json
{
"status": "error",
"error_code": "MODEL_NOT_FOUND",
"message": "Model 'medium' not found. Please download it first.",
"model": "medium",
"model_size": "1.4GB",
"download_url": "https://huggingface.co/...",
"download_command": "curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}When you receive error:
MODEL_NOT_FOUND- Inform user: "Downloading model '{model}' ({model_size})..."
- Execute using Bash tool with
download_command(30 minutes)timeout: 1800000 - After download completes: re-run the original transcribe command
Error (model corrupted):
json
{
"status": "error",
"error_code": "MODEL_CORRUPTED",
"message": "Model 'medium' is corrupted or incomplete. Please re-download.",
"model": "medium",
"model_size": "1.4GB",
"expected_sha256": "6c14d5adee5f86394037b4e4e8b59f1673b6cee10e3cf0b11bbdbee79c156208",
"actual_sha256": "def456...",
"model_path": "/path/to/models/ggml-medium.bin",
"download_command": "rm '/path/to/models/ggml-medium.bin' && curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}When you receive error:
MODEL_CORRUPTED- Inform user: "Model '{model}' is corrupted. Re-downloading ({model_size})..."
- Execute (removes corrupted file and re-downloads) using Bash tool with
download_command(30 minutes)timeout: 1800000 - After download completes: re-run the original transcribe command
成功时:
json
{
"status": "success",
"file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
"text_file_path": "{baseDir}/data/20091025__VIDEO_ID.txt",
"language": "en",
"duration": "3:32",
"model": "medium",
"char_count": 12345,
"line_count": 100,
"text_char_count": 10000,
"text_line_count": 50,
"cached": false,
"video_id": "dQw4w9WgXcQ",
"title": "Video Title",
"channel": "Channel Name",
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}命中缓存(返回已有的转录内容):
json
{
"status": "success",
"file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
"cached": true,
...
}通用错误:
json
{
"status": "error",
"message": "错误描述"
}未知模型错误:
json
{
"status": "error",
"error_code": "UNKNOWN_MODEL",
"message": "Unknown model: invalid-name",
"available_models": ["tiny", "base", "small", "medium", "large-v3", "large-v3-turbo", "belle-zh", "kotoba-ja", "kotoba-ja-q5"]
}当收到错误时:从列表中推荐有效的模型。
UNKNOWN_MODELavailable_models模型未找到错误:
json
{
"status": "error",
"error_code": "MODEL_NOT_FOUND",
"message": "Model 'medium' not found. Please download it first.",
"model": "medium",
"model_size": "1.4GB",
"download_url": "https://huggingface.co/...",
"download_command": "curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}当收到错误时:
MODEL_NOT_FOUND- 告知用户:"正在下载模型'{model}'(大小{model_size})..."
- 使用Bash工具执行,设置
download_command(30分钟)timeout: 1800000 - 下载完成后:重新运行原转录命令
模型损坏错误:
json
{
"status": "error",
"error_code": "MODEL_CORRUPTED",
"message": "Model 'medium' is corrupted or incomplete. Please re-download.",
"model": "medium",
"model_size": "1.4GB",
"expected_sha256": "6c14d5adee5f86394037b4e4e8b59f1673b6cee10e3cf0b11bbdbee79c156208",
"actual_sha256": "def456...",
"model_path": "/path/to/models/ggml-medium.bin",
"download_command": "rm '/path/to/models/ggml-medium.bin' && curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}当收到错误时:
MODEL_CORRUPTED- 告知用户:"模型'{model}'已损坏,正在重新下载(大小{model_size})..."
- 使用Bash工具执行(删除损坏文件并重新下载),设置
download_command(30分钟)timeout: 1800000 - 下载完成后:重新运行原转录命令
Output Fields
输出字段说明
| Field | Description |
|---|---|
| Absolute path to JSON file (with segments) |
| Absolute path to plain text file |
| Detected language code |
| Audio duration |
| Model used for transcription |
| Character count of JSON file |
| Line count of JSON file |
| Character count of plain text file |
| Line count of plain text file |
| YouTube video ID (from centralized metadata store) |
| Video title (from centralized metadata store) |
| Channel name (from centralized metadata store) |
| Full video URL (from centralized metadata store) |
| 字段 | 说明 |
|---|---|
| 包含分段信息的JSON文件绝对路径 |
| 纯文本文件的绝对路径 |
| 检测到的语言代码 |
| 音频时长 |
| 用于转录的模型 |
| JSON文件的字符数 |
| JSON文件的行数 |
| 纯文本文件的字符数 |
| 纯文本文件的行数 |
| YouTube视频ID(来自集中元数据存储) |
| 视频标题(来自集中元数据存储) |
| 频道名称(来自集中元数据存储) |
| 完整视频URL(来自集中元数据存储) |
Filename Format
文件名格式
Output files preserve the input audio filename's unified naming format with date prefix:
{YYYYMMDD}__{video_id}.{ext}Example:
20091025__dQw4w9WgXcQ.json输出文件保留输入音频文件名的统一命名格式,添加日期前缀:
{YYYYMMDD}__{video_id}.{ext}示例:
20091025__dQw4w9WgXcQ.jsonJSON File Format
JSON文件格式
The JSON file at contains:
file_pathjson
{
"text": "Full transcription text...",
"language": "en",
"duration": "3:32",
"model": "medium",
"segments": [
{
"start": "00:00:00.000",
"end": "00:00:05.000",
"text": "First segment..."
}
]
}file_pathjson
{
"text": "完整转录文本...",
"language": "en",
"duration": "3:32",
"model": "medium",
"segments": [
{
"start": "00:00:00.000",
"end": "00:00:05.000",
"text": "第一段内容..."
}
]
}Models
模型说明
Standard Models
标准模型
| Model | Size | RAM | Speed | Accuracy |
|---|---|---|---|---|
| auto | - | - | - | Auto-select based on language (default) |
| tiny | 74MB | ~273MB | Fastest | Low |
| base | 141MB | ~388MB | Fast | Medium |
| small | 465MB | ~852MB | Moderate | Good |
| medium | 1.4GB | ~2.1GB | Slow | High |
| large-v3 | 2.9GB | ~3.9GB | Slowest | Best |
| large-v3-turbo | 1.5GB | ~2.1GB | Moderate | High (optimized for speed) |
| 模型 | 大小 | 内存占用 | 速度 | 准确率 |
|---|---|---|---|---|
| auto | - | - | - | 根据语言自动选择(默认) |
| tiny | 74MB | ~273MB | 最快 | 低 |
| base | 141MB | ~388MB | 快 | 中等 |
| small | 465MB | ~852MB | 适中 | 良好 |
| medium | 1.4GB | ~2.1GB | 慢 | 高 |
| large-v3 | 2.9GB | ~3.9GB | 最慢 | 最佳 |
| large-v3-turbo | 1.5GB | ~2.1GB | 适中 | 高(针对速度优化) |
Language-Specialized Models
语言专用模型
| Model | Language | Size | Description |
|---|---|---|---|
| belle-zh | Chinese | 1.5GB | BELLE-2 Chinese-specialized model |
| kotoba-ja | Japanese | 1.4GB | kotoba-tech Japanese-specialized model |
| kotoba-ja-q5 | Japanese | 513MB | Quantized version (faster, smaller) |
| 模型 | 语言 | 大小 | 说明 |
|---|---|---|---|
| belle-zh | 中文 | 1.5GB | BELLE-2中文专用模型 |
| kotoba-ja | 日文 | 1.4GB | kotoba-tech日文专用模型 |
| kotoba-ja-q5 | 日文 | 513MB | 量化版本(更快、体积更小) |
Auto-Selection (model=auto)
自动选择(model=auto)
When model is (default), the system automatically selects the best model based on language:
auto| Language | Auto-Selected Model |
|---|---|
| zh | belle-zh (Chinese-specialized) |
| ja | kotoba-ja (Japanese-specialized) |
| others | medium (general purpose) |
Example: → uses
/mk-youtube-audio-transcribe video.m4a auto zhbelle-zh当模型设为(默认)时,系统会根据语言自动选择最佳模型:
auto| 语言 | 自动选择的模型 |
|---|---|
| zh | belle-zh(中文专用) |
| ja | kotoba-ja(日文专用) |
| 其他 | medium(通用) |
示例: → 使用模型
/mk-youtube-audio-transcribe video.m4a auto zhbelle-zhNotes
注意事项
- File caching: If transcription already exists for this video, it will be reused (returns )
cached: true - Force refresh: Use flag to re-transcribe even if cached file exists
--force - Specify language for best results - enables auto-selection of specialized models (zh→belle-zh, ja→kotoba-ja)
- Use Read tool to get file content from or
file_pathtext_file_path - Models must be downloaded before first use - returns error with download command
MODEL_NOT_FOUND - Uses Metal acceleration on macOS for faster processing
- Supports auto language detection
- Audio is converted to 16kHz WAV for optimal results
- Requires ffmpeg and whisper-cli (pre-built in bin/)
- 文件缓存:如果该视频的转录内容已存在,将直接复用(返回)
cached: true - 强制刷新:使用标志可忽略缓存文件,重新进行转录
--force - 指定语言以获得最佳结果 - 可触发专用模型的自动选择(中文→belle-zh,日文→kotoba-ja)
- 使用Read工具从或
file_path获取文件内容text_file_path - 首次使用前需下载模型 - 未找到模型时会返回错误及下载命令
MODEL_NOT_FOUND - 在macOS上使用Metal加速以提升处理速度
- 支持自动语言检测
- 音频会转换为16kHz WAV格式以获得最佳效果
- 需要ffmpeg和whisper-cli(已预构建在bin/目录中)
Model Download
模型下载
Models must be downloaded before transcription. When you receive a error, execute the with .
MODEL_NOT_FOUNDdownload_commandtimeout: 1800000bash
undefined转录前必须下载模型。当收到错误时,执行并设置。
MODEL_NOT_FOUNDdownload_commandtimeout: 1800000bash
undefinedIn terminal (to see progress bar)
在终端中运行(可查看进度条)
./scripts/download-model.sh medium # 1.4GB
./scripts/download-model.sh belle-zh # 1.5GB (Chinese)
./scripts/download-model.sh kotoba-ja # 1.4GB (Japanese)
./scripts/download-model.sh --list # Show all available models
undefined./scripts/download-model.sh medium # 1.4GB
./scripts/download-model.sh belle-zh # 1.5GB(中文)
./scripts/download-model.sh kotoba-ja # 1.4GB(日文)
./scripts/download-model.sh --list # 显示所有可用模型
undefinedNext Step
下一步操作
After transcription completes, invoke with the from the output to generate a structured summary:
/mk-youtube-transcript-summarizetext_file_path/mk-youtube-transcript-summarize <text_file_path>IMPORTANT: Always use the Skill tool to invoke . Do NOT generate summaries directly without loading the skill — it contains critical rules for compression ratio, section structure, data preservation, and language handling.
/mk-youtube-transcript-summarize转录完成后,使用输出中的调用以生成结构化摘要:
text_file_path/mk-youtube-transcript-summarize/mk-youtube-transcript-summarize <text_file_path>重要提示:请务必使用Skill工具调用。不要直接生成摘要,因为该工具包含关于压缩比、章节结构、数据保留和语言处理的关键规则。
/mk-youtube-transcript-summarize