whisper-transcribe

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Audio Transcription with Whisper

使用Whisper进行音频转录

You can transcribe audio and video files into text using OpenAI's Whisper API or compatible speech-to-text services. Support a wide range of audio formats (mp3, mp4, wav, m4a, webm, flac, ogg) with automatic language detection and optional translation to English.

When transcribing, first check the file format and size. If the file exceeds the API's size limit (25MB for OpenAI Whisper), use

ffmpeg

to split it into smaller segments or compress it. For video files, extract the audio track with

ffmpeg

before sending to the transcription API. Always inform the user of the detected language and confidence level.

Present transcription results in a clean, readable format. For long recordings, add timestamps at regular intervals or at natural paragraph breaks. Support different output formats: plain text, SRT subtitles, VTT captions, and timestamped segments. When the user requests a summary alongside the transcription, provide both the full transcript and a concise summary.

For multi-speaker recordings, attempt speaker diarization when the API supports it, or offer to label speakers manually based on context. Handle background noise and poor audio quality gracefully -- flag low-confidence segments rather than silently producing incorrect text. Support batch transcription of multiple files in a directory.

你可以使用OpenAI的Whisper API或兼容的语音转文本服务，将音频和视频文件转录为文本。支持多种音频格式（mp3、mp4、wav、m4a、webm、flac、ogg），具备自动语言检测功能，还可选择将转录内容翻译为英文。

转录时，首先检查文件格式和大小。如果文件超过API的大小限制（OpenAI Whisper为25MB），请使用

ffmpeg

将其分割为较小的片段或进行压缩。对于视频文件，需先使用

ffmpeg

提取音频轨道，再发送至转录API。务必向用户告知检测到的语言及其置信度。

转录结果需以清晰易读的格式呈现。对于长录音，需在固定间隔或自然段落停顿处添加时间戳。支持多种输出格式：纯文本、SRT字幕、VTT字幕以及带时间戳的片段。当用户要求在转录的同时提供摘要时，需同时提供完整转录文本和简洁摘要。

对于多说话者的录音，若API支持，可尝试进行说话人分离；若不支持，则可根据上下文手动标记说话人。需妥善处理背景噪音和低质量音频——标记出置信度低的片段，而非静默生成错误文本。支持对目录中的多个文件进行批量转录。

Examples

示例

"Transcribe this meeting recording: /path/to/meeting.mp3"
"Create SRT subtitles for this video file"
"Transcribe and summarize this 2-hour podcast episode"
"Transcribe all .wav files in the interviews/ directory"
"Translate this Spanish audio recording to English text"

"转录此会议录音：/path/to/meeting.mp3"
"为此视频文件生成SRT字幕"
"转录并总结这段2小时的播客内容"
"转录interviews/目录下的所有.wav文件"
"将这段西班牙语音频转录并翻译为英文文本"

Constraints

限制条件

Maximum file size for OpenAI Whisper API: 25MB per request. Use ffmpeg to split larger files.
Supported audio formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg.
Transcription accuracy depends on audio quality, background noise, and speaker clarity.
Speaker diarization (who said what) is not natively supported by all APIs and may require post-processing.
Real-time/streaming transcription is not supported; only file-based transcription.
API costs apply per minute of audio transcribed.
ffmpeg is required for audio conversion, splitting, and video audio extraction.

OpenAI Whisper API的最大文件大小限制：每个请求25MB。较大文件需使用ffmpeg分割。
支持的音频格式：mp3、mp4、mpeg、mpga、m4a、wav、webm、flac、ogg。
转录准确性取决于音频质量、背景噪音和说话人清晰度。
说话人分离（区分谁在说话）并非所有API都原生支持，可能需要后期处理。
不支持实时/流式转录；仅支持基于文件的转录。
API费用按转录的音频时长（每分钟）收取。
音频转换、分割以及视频音频提取需要使用ffmpeg。