streaming-stt-whisper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Whisper Chunked Streaming STT

Whisper 分块流式STT

Use this skill when OpenAI Whisper is the preferred STT provider, especially when a single API key should cover both LLM and speech. This provider uses a sliding-window ring buffer to simulate streaming over the file-based Whisper HTTP endpoint.

Prefer this over the Deepgram adapter when real-time WebSocket streaming is not required, when the user wants to route through a local Whisper-compatible server (e.g. Faster-Whisper, Groq), or when OpenAI is the only configured provider.

当OpenAI Whisper是首选的STT提供商时，尤其是当单个API密钥同时覆盖大语言模型（LLM）和语音服务时，请使用此技能。该提供商使用滑动窗口环形缓冲区，通过基于文件的Whisper HTTP端点模拟流式传输。

当不需要实时WebSocket流式传输、用户希望通过本地兼容Whisper的服务器（如Faster-Whisper、Groq）路由，或者仅配置了OpenAI作为提供商时，优先选择此适配器而非Deepgram适配器。

Setup

设置

Set

OPENAI_API_KEY

in the environment or agent secrets store. For local endpoints, override

baseUrl

providerOptions

在环境变量或Agent密钥存储中设置

OPENAI_API_KEY

。对于本地端点，请在

providerOptions

中覆盖

baseUrl

。

Configuration

配置

json

{
  "voice": {
    "stt": "whisper"
  }
}

For a local Faster-Whisper endpoint:

json

{
  "voice": {
    "stt": "whisper",
    "providerOptions": {
      "model": "whisper-1",
      "language": "en",
      "baseUrl": "http://localhost:8000"
    }
  }
}

json

{
  "voice": {
    "stt": "whisper"
  }
}

对于本地Faster-Whisper端点：

json

{
  "voice": {
    "stt": "whisper",
    "providerOptions": {
      "model": "whisper-1",
      "language": "en",
      "baseUrl": "http://localhost:8000"
    }
  }
}

Provider Rules

提供商规则

Audio is accumulated in a 1 s sliding window with 200 ms overlap to avoid word boundary clipping.
The previous chunk transcript is forwarded as
```
prompt
```
to the next request for cross-chunk continuity.
On fetch failure the provider emits
```
error
```
and continues — no session crash.
Use
```
language
```
to force a specific language code (BCP-47); omit for automatic detection.
Compatible with any OpenAI
```
/v1/audio/transcriptions
```
-compatible server.

音频在1秒滑动窗口中累积，重叠200毫秒，以避免单词边界截断。
前一个分块的转录文本作为
```
prompt
```
转发到下一个请求，以保证分块间的连续性。
当请求失败时，提供商发出
```
error
```
事件并继续运行——不会导致会话崩溃。
使用
```
language
```
参数强制指定特定语言代码（BCP-47）；省略则自动检测语言。
兼容任何与OpenAI
```
/v1/audio/transcriptions
```
接口兼容的服务器。

Events

事件

Event	Description
`interim_transcript`	Emitted after each chunk is transcribed
`final_transcript`	Emitted after flush() completes
`speech_start`	RMS energy crossed threshold
`speech_end`	RMS energy dropped below threshold
`error`	Fetch failure (session continues)
`close`	Session fully terminated

事件名称	描述
`interim_transcript`	每个分块转录完成后触发
`final_transcript`	flush()完成后触发
`speech_start`	RMS能量超过阈值时触发
`speech_end`	RMS能量低于阈值时触发
`error`	请求失败时触发（会话继续运行）
`close`	会话完全终止时触发

Examples

示例

"Use Whisper for live speech transcription during our voice session."
"Transcribe my speech through a local Faster-Whisper server."
"Use OpenAI for both the LLM and the STT provider."

"在我们的语音会话中使用Whisper进行实时语音转录。"
"通过本地Faster-Whisper服务器转录我的语音。"
"使用OpenAI同时作为LLM和STT提供商。"

Constraints

限制

Requires
```
OPENAI_API_KEY
```
or a compatible local endpoint via
```
providerOptions.baseUrl
```
.
Latency is higher than native WebSocket providers (Deepgram) due to HTTP chunking overhead.
Speaker diarization is not natively supported; use the
```
diarization
```
extension for post-processing.

需要
```
OPENAI_API_KEY
```
，或通过
```
providerOptions.baseUrl
```
配置兼容的本地端点。
由于HTTP分块的开销，延迟高于原生WebSocket提供商（如Deepgram）。
原生不支持说话人分离；需使用
```
diarization
```
扩展进行后处理。