streaming-stt-whisper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWhisper Chunked Streaming STT
Whisper 分块流式STT
Use this skill when OpenAI Whisper is the preferred STT provider, especially when a single API key should cover both LLM and speech. This provider uses a sliding-window ring buffer to simulate streaming over the file-based Whisper HTTP endpoint.
Prefer this over the Deepgram adapter when real-time WebSocket streaming is not required, when the user wants to route through a local Whisper-compatible server (e.g. Faster-Whisper, Groq), or when OpenAI is the only configured provider.
当OpenAI Whisper是首选的STT提供商时,尤其是当单个API密钥同时覆盖大语言模型(LLM)和语音服务时,请使用此技能。该提供商使用滑动窗口环形缓冲区,通过基于文件的Whisper HTTP端点模拟流式传输。
当不需要实时WebSocket流式传输、用户希望通过本地兼容Whisper的服务器(如Faster-Whisper、Groq)路由,或者仅配置了OpenAI作为提供商时,优先选择此适配器而非Deepgram适配器。
Setup
设置
Set in the environment or agent secrets store. For local endpoints, override in .
OPENAI_API_KEYbaseUrlproviderOptions在环境变量或Agent密钥存储中设置。对于本地端点,请在中覆盖。
OPENAI_API_KEYproviderOptionsbaseUrlConfiguration
配置
json
{
"voice": {
"stt": "whisper"
}
}For a local Faster-Whisper endpoint:
json
{
"voice": {
"stt": "whisper",
"providerOptions": {
"model": "whisper-1",
"language": "en",
"baseUrl": "http://localhost:8000"
}
}
}json
{
"voice": {
"stt": "whisper"
}
}对于本地Faster-Whisper端点:
json
{
"voice": {
"stt": "whisper",
"providerOptions": {
"model": "whisper-1",
"language": "en",
"baseUrl": "http://localhost:8000"
}
}
}Provider Rules
提供商规则
- Audio is accumulated in a 1 s sliding window with 200 ms overlap to avoid word boundary clipping.
- The previous chunk transcript is forwarded as to the next request for cross-chunk continuity.
prompt - On fetch failure the provider emits and continues — no session crash.
error - Use to force a specific language code (BCP-47); omit for automatic detection.
language - Compatible with any OpenAI -compatible server.
/v1/audio/transcriptions
- 音频在1秒滑动窗口中累积,重叠200毫秒,以避免单词边界截断。
- 前一个分块的转录文本作为转发到下一个请求,以保证分块间的连续性。
prompt - 当请求失败时,提供商发出事件并继续运行——不会导致会话崩溃。
error - 使用参数强制指定特定语言代码(BCP-47);省略则自动检测语言。
language - 兼容任何与OpenAI 接口兼容的服务器。
/v1/audio/transcriptions
Events
事件
| Event | Description |
|---|---|
| Emitted after each chunk is transcribed |
| Emitted after flush() completes |
| RMS energy crossed threshold |
| RMS energy dropped below threshold |
| Fetch failure (session continues) |
| Session fully terminated |
| 事件名称 | 描述 |
|---|---|
| 每个分块转录完成后触发 |
| flush()完成后触发 |
| RMS能量超过阈值时触发 |
| RMS能量低于阈值时触发 |
| 请求失败时触发(会话继续运行) |
| 会话完全终止时触发 |
Examples
示例
- "Use Whisper for live speech transcription during our voice session."
- "Transcribe my speech through a local Faster-Whisper server."
- "Use OpenAI for both the LLM and the STT provider."
- "在我们的语音会话中使用Whisper进行实时语音转录。"
- "通过本地Faster-Whisper服务器转录我的语音。"
- "使用OpenAI同时作为LLM和STT提供商。"
Constraints
限制
- Requires or a compatible local endpoint via
OPENAI_API_KEY.providerOptions.baseUrl - Latency is higher than native WebSocket providers (Deepgram) due to HTTP chunking overhead.
- Speaker diarization is not natively supported; use the extension for post-processing.
diarization
- 需要,或通过
OPENAI_API_KEY配置兼容的本地端点。providerOptions.baseUrl - 由于HTTP分块的开销,延迟高于原生WebSocket提供商(如Deepgram)。
- 原生不支持说话人分离;需使用扩展进行后处理。
diarization