mk-youtube-audio-transcribe

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

YouTube Audio Transcribe

YouTube音频转录

Transcribe audio files to text using local whisper.cpp (no cloud API required).
使用本地whisper.cpp将音频文件转录为文本(无需云API)。

Quick Start

快速开始

/mk-youtube-audio-transcribe <audio_file> [model] [language] [--force]
/mk-youtube-audio-transcribe <audio_file> [model] [language] [--force]

Parameters

参数说明

ParameterRequiredDefaultDescription
audio_fileYes-Path to audio file
modelNoautoModel: auto, tiny, base, small, medium, large-v3, belle-zh, kotoba-ja
languageNoautoLanguage code: en, ja, zh, auto (auto-detect)
--forceNofalseForce re-transcribe even if cached file exists
参数是否必填默认值说明
audio_file-音频文件路径
modelauto模型选项:auto, tiny, base, small, medium, large-v3, belle-zh, kotoba-ja
languageauto语言代码:en, ja, zh, auto(自动检测)
--forcefalse即使存在缓存文件,也强制重新转录

Examples

使用示例

  • /mk-youtube-audio-transcribe /path/to/audio/video.m4a
    - Transcribe with auto model selection
  • /mk-youtube-audio-transcribe video.m4a auto zh
    - Auto-select best model for Chinese → belle-zh
  • /mk-youtube-audio-transcribe video.m4a auto ja
    - Auto-select best model for Japanese → kotoba-ja
  • /mk-youtube-audio-transcribe audio.mp3 small en
    - Use small model, force English
  • /mk-youtube-audio-transcribe podcast.wav medium ja
    - Use medium model (explicit), Japanese
  • /mk-youtube-audio-transcribe /path/to/audio/video.m4a
    - 使用自动模型选择进行转录
  • /mk-youtube-audio-transcribe video.m4a auto zh
    - 自动为中文选择最佳模型 → belle-zh
  • /mk-youtube-audio-transcribe video.m4a auto ja
    - 自动为日文选择最佳模型 → kotoba-ja
  • /mk-youtube-audio-transcribe audio.mp3 small en
    - 使用small模型,强制识别英文
  • /mk-youtube-audio-transcribe podcast.wav medium ja
    - 使用medium模型(指定),识别日文

How it Works

工作原理

  1. Execute:
    {baseDir}/scripts/transcribe.sh "<audio_file>" "<model>" "<language>"
  2. Auto-download model if not found (with progress)
  3. Convert audio to 16kHz mono WAV using ffmpeg
  4. Run whisper-cli for transcription
  5. Save full JSON to
    {baseDir}/data/<filename>.json
  6. Save plain text to
    {baseDir}/data/<filename>.txt
  7. Return file paths and metadata
┌─────────────────────────────┐
│      transcribe.sh          │
│  audio_file, [model], [lang]│
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   ffmpeg: convert to WAV    │
│   16kHz, mono, pcm_s16le    │
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   whisper-cli: transcribe   │
│   with Metal acceleration   │
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   Save to files             │
│   .json (full) + .txt       │
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   Return file paths         │
│   {file_path, text_file_path}│
└─────────────────────────────┘
  1. 执行命令:
    {baseDir}/scripts/transcribe.sh "<audio_file>" "<model>" "<language>"
  2. 如果未找到模型则自动下载(带进度提示)
  3. 使用ffmpeg将音频转换为16kHz单声道WAV格式
  4. 运行whisper-cli进行转录
  5. 将完整JSON保存至
    {baseDir}/data/<filename>.json
  6. 将纯文本保存至
    {baseDir}/data/<filename>.txt
  7. 返回文件路径和元数据
┌─────────────────────────────┐
│      transcribe.sh          │
│  audio_file, [model], [lang]│
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   ffmpeg: 转换为WAV格式      │
│   16kHz, 单声道, pcm_s16le    │
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   whisper-cli: 执行转录      │
│   启用Metal加速              │
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   保存至文件                 │
│   .json(完整内容) + .txt    │
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   返回文件路径               │
│   {file_path, text_file_path}│
└─────────────────────────────┘

Output Format

输出格式

Success:
json
{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "text_file_path": "{baseDir}/data/20091025__VIDEO_ID.txt",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "char_count": 12345,
  "line_count": 100,
  "text_char_count": 10000,
  "text_line_count": 50,
  "cached": false,
  "video_id": "dQw4w9WgXcQ",
  "title": "Video Title",
  "channel": "Channel Name",
  "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}
Cache hit (returns existing transcription):
json
{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "cached": true,
  ...
}
Error (general):
json
{
  "status": "error",
  "message": "Error description"
}
Error (unknown model):
json
{
  "status": "error",
  "error_code": "UNKNOWN_MODEL",
  "message": "Unknown model: invalid-name",
  "available_models": ["tiny", "base", "small", "medium", "large-v3", "large-v3-turbo", "belle-zh", "kotoba-ja", "kotoba-ja-q5"]
}
When you receive
UNKNOWN_MODEL
error: suggest a valid model from the
available_models
list.
Error (model not found):
json
{
  "status": "error",
  "error_code": "MODEL_NOT_FOUND",
  "message": "Model 'medium' not found. Please download it first.",
  "model": "medium",
  "model_size": "1.4GB",
  "download_url": "https://huggingface.co/...",
  "download_command": "curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}
When you receive
MODEL_NOT_FOUND
error:
  1. Inform user: "Downloading model '{model}' ({model_size})..."
  2. Execute
    download_command
    using Bash tool with
    timeout: 1800000
    (30 minutes)
  3. After download completes: re-run the original transcribe command
Error (model corrupted):
json
{
  "status": "error",
  "error_code": "MODEL_CORRUPTED",
  "message": "Model 'medium' is corrupted or incomplete. Please re-download.",
  "model": "medium",
  "model_size": "1.4GB",
  "expected_sha256": "6c14d5adee5f86394037b4e4e8b59f1673b6cee10e3cf0b11bbdbee79c156208",
  "actual_sha256": "def456...",
  "model_path": "/path/to/models/ggml-medium.bin",
  "download_command": "rm '/path/to/models/ggml-medium.bin' && curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}
When you receive
MODEL_CORRUPTED
error:
  1. Inform user: "Model '{model}' is corrupted. Re-downloading ({model_size})..."
  2. Execute
    download_command
    (removes corrupted file and re-downloads) using Bash tool with
    timeout: 1800000
    (30 minutes)
  3. After download completes: re-run the original transcribe command
成功时:
json
{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "text_file_path": "{baseDir}/data/20091025__VIDEO_ID.txt",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "char_count": 12345,
  "line_count": 100,
  "text_char_count": 10000,
  "text_line_count": 50,
  "cached": false,
  "video_id": "dQw4w9WgXcQ",
  "title": "Video Title",
  "channel": "Channel Name",
  "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}
命中缓存(返回已有的转录内容):
json
{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "cached": true,
  ...
}
通用错误:
json
{
  "status": "error",
  "message": "错误描述"
}
未知模型错误:
json
{
  "status": "error",
  "error_code": "UNKNOWN_MODEL",
  "message": "Unknown model: invalid-name",
  "available_models": ["tiny", "base", "small", "medium", "large-v3", "large-v3-turbo", "belle-zh", "kotoba-ja", "kotoba-ja-q5"]
}
当收到
UNKNOWN_MODEL
错误时:从
available_models
列表中推荐有效的模型。
模型未找到错误:
json
{
  "status": "error",
  "error_code": "MODEL_NOT_FOUND",
  "message": "Model 'medium' not found. Please download it first.",
  "model": "medium",
  "model_size": "1.4GB",
  "download_url": "https://huggingface.co/...",
  "download_command": "curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}
当收到
MODEL_NOT_FOUND
错误时:
  1. 告知用户:"正在下载模型'{model}'(大小{model_size})..."
  2. 使用Bash工具执行
    download_command
    ,设置
    timeout: 1800000
    (30分钟)
  3. 下载完成后:重新运行原转录命令
模型损坏错误:
json
{
  "status": "error",
  "error_code": "MODEL_CORRUPTED",
  "message": "Model 'medium' is corrupted or incomplete. Please re-download.",
  "model": "medium",
  "model_size": "1.4GB",
  "expected_sha256": "6c14d5adee5f86394037b4e4e8b59f1673b6cee10e3cf0b11bbdbee79c156208",
  "actual_sha256": "def456...",
  "model_path": "/path/to/models/ggml-medium.bin",
  "download_command": "rm '/path/to/models/ggml-medium.bin' && curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}
当收到
MODEL_CORRUPTED
错误时:
  1. 告知用户:"模型'{model}'已损坏,正在重新下载(大小{model_size})..."
  2. 使用Bash工具执行
    download_command
    (删除损坏文件并重新下载),设置
    timeout: 1800000
    (30分钟)
  3. 下载完成后:重新运行原转录命令

Output Fields

输出字段说明

FieldDescription
file_path
Absolute path to JSON file (with segments)
text_file_path
Absolute path to plain text file
language
Detected language code
duration
Audio duration
model
Model used for transcription
char_count
Character count of JSON file
line_count
Line count of JSON file
text_char_count
Character count of plain text file
text_line_count
Line count of plain text file
video_id
YouTube video ID (from centralized metadata store)
title
Video title (from centralized metadata store)
channel
Channel name (from centralized metadata store)
url
Full video URL (from centralized metadata store)
字段说明
file_path
包含分段信息的JSON文件绝对路径
text_file_path
纯文本文件的绝对路径
language
检测到的语言代码
duration
音频时长
model
用于转录的模型
char_count
JSON文件的字符数
line_count
JSON文件的行数
text_char_count
纯文本文件的字符数
text_line_count
纯文本文件的行数
video_id
YouTube视频ID(来自集中元数据存储)
title
视频标题(来自集中元数据存储)
channel
频道名称(来自集中元数据存储)
url
完整视频URL(来自集中元数据存储)

Filename Format

文件名格式

Output files preserve the input audio filename's unified naming format with date prefix:
{YYYYMMDD}__{video_id}.{ext}
Example:
20091025__dQw4w9WgXcQ.json
输出文件保留输入音频文件名的统一命名格式,添加日期前缀:
{YYYYMMDD}__{video_id}.{ext}
示例:
20091025__dQw4w9WgXcQ.json

JSON File Format

JSON文件格式

The JSON file at
file_path
contains:
json
{
  "text": "Full transcription text...",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "segments": [
    {
      "start": "00:00:00.000",
      "end": "00:00:05.000",
      "text": "First segment..."
    }
  ]
}
file_path
对应的JSON文件包含以下内容:
json
{
  "text": "完整转录文本...",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "segments": [
    {
      "start": "00:00:00.000",
      "end": "00:00:05.000",
      "text": "第一段内容..."
    }
  ]
}

Models

模型说明

Standard Models

标准模型

ModelSizeRAMSpeedAccuracy
auto---Auto-select based on language (default)
tiny74MB~273MBFastestLow
base141MB~388MBFastMedium
small465MB~852MBModerateGood
medium1.4GB~2.1GBSlowHigh
large-v32.9GB~3.9GBSlowestBest
large-v3-turbo1.5GB~2.1GBModerateHigh (optimized for speed)
模型大小内存占用速度准确率
auto---根据语言自动选择(默认)
tiny74MB~273MB最快
base141MB~388MB中等
small465MB~852MB适中良好
medium1.4GB~2.1GB
large-v32.9GB~3.9GB最慢最佳
large-v3-turbo1.5GB~2.1GB适中高(针对速度优化)

Language-Specialized Models

语言专用模型

ModelLanguageSizeDescription
belle-zhChinese1.5GBBELLE-2 Chinese-specialized model
kotoba-jaJapanese1.4GBkotoba-tech Japanese-specialized model
kotoba-ja-q5Japanese513MBQuantized version (faster, smaller)
模型语言大小说明
belle-zh中文1.5GBBELLE-2中文专用模型
kotoba-ja日文1.4GBkotoba-tech日文专用模型
kotoba-ja-q5日文513MB量化版本(更快、体积更小)

Auto-Selection (model=auto)

自动选择(model=auto)

When model is
auto
(default), the system automatically selects the best model based on language:
LanguageAuto-Selected Model
zhbelle-zh (Chinese-specialized)
jakotoba-ja (Japanese-specialized)
othersmedium (general purpose)
Example:
/mk-youtube-audio-transcribe video.m4a auto zh
→ uses
belle-zh
当模型设为
auto
(默认)时,系统会根据语言自动选择最佳模型:
语言自动选择的模型
zhbelle-zh(中文专用)
jakotoba-ja(日文专用)
其他medium(通用)
示例:
/mk-youtube-audio-transcribe video.m4a auto zh
→ 使用
belle-zh
模型

Notes

注意事项

  • File caching: If transcription already exists for this video, it will be reused (returns
    cached: true
    )
  • Force refresh: Use
    --force
    flag to re-transcribe even if cached file exists
  • Specify language for best results - enables auto-selection of specialized models (zh→belle-zh, ja→kotoba-ja)
  • Use Read tool to get file content from
    file_path
    or
    text_file_path
  • Models must be downloaded before first use - returns
    MODEL_NOT_FOUND
    error with download command
  • Uses Metal acceleration on macOS for faster processing
  • Supports auto language detection
  • Audio is converted to 16kHz WAV for optimal results
  • Requires ffmpeg and whisper-cli (pre-built in bin/)
  • 文件缓存:如果该视频的转录内容已存在,将直接复用(返回
    cached: true
  • 强制刷新:使用
    --force
    标志可忽略缓存文件,重新进行转录
  • 指定语言以获得最佳结果 - 可触发专用模型的自动选择(中文→belle-zh,日文→kotoba-ja)
  • 使用Read工具从
    file_path
    text_file_path
    获取文件内容
  • 首次使用前需下载模型 - 未找到模型时会返回
    MODEL_NOT_FOUND
    错误及下载命令
  • 在macOS上使用Metal加速以提升处理速度
  • 支持自动语言检测
  • 音频会转换为16kHz WAV格式以获得最佳效果
  • 需要ffmpeg和whisper-cli(已预构建在bin/目录中)

Model Download

模型下载

Models must be downloaded before transcription. When you receive a
MODEL_NOT_FOUND
error, execute the
download_command
with
timeout: 1800000
.
bash
undefined
转录前必须下载模型。当收到
MODEL_NOT_FOUND
错误时,执行
download_command
并设置
timeout: 1800000
bash
undefined

In terminal (to see progress bar)

在终端中运行(可查看进度条)

./scripts/download-model.sh medium # 1.4GB ./scripts/download-model.sh belle-zh # 1.5GB (Chinese) ./scripts/download-model.sh kotoba-ja # 1.4GB (Japanese) ./scripts/download-model.sh --list # Show all available models
undefined
./scripts/download-model.sh medium # 1.4GB ./scripts/download-model.sh belle-zh # 1.5GB(中文) ./scripts/download-model.sh kotoba-ja # 1.4GB(日文) ./scripts/download-model.sh --list # 显示所有可用模型
undefined

Next Step

下一步操作

After transcription completes, invoke
/mk-youtube-transcript-summarize
with the
text_file_path
from the output to generate a structured summary:
/mk-youtube-transcript-summarize <text_file_path>
IMPORTANT: Always use the Skill tool to invoke
/mk-youtube-transcript-summarize
. Do NOT generate summaries directly without loading the skill — it contains critical rules for compression ratio, section structure, data preservation, and language handling.
转录完成后,使用输出中的
text_file_path
调用
/mk-youtube-transcript-summarize
以生成结构化摘要:
/mk-youtube-transcript-summarize <text_file_path>
重要提示:请务必使用Skill工具调用
/mk-youtube-transcript-summarize
。不要直接生成摘要,因为该工具包含关于压缩比、章节结构、数据保留和语言处理的关键规则。