streaming-stt-whisper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Whisper Chunked Streaming STT

Whisper 分块流式STT

Use this skill when OpenAI Whisper is the preferred STT provider, especially when a single API key should cover both LLM and speech. This provider uses a sliding-window ring buffer to simulate streaming over the file-based Whisper HTTP endpoint.
Prefer this over the Deepgram adapter when real-time WebSocket streaming is not required, when the user wants to route through a local Whisper-compatible server (e.g. Faster-Whisper, Groq), or when OpenAI is the only configured provider.
当OpenAI Whisper是首选的STT提供商时,尤其是当单个API密钥同时覆盖大语言模型(LLM)和语音服务时,请使用此技能。该提供商使用滑动窗口环形缓冲区,通过基于文件的Whisper HTTP端点模拟流式传输。
当不需要实时WebSocket流式传输、用户希望通过本地兼容Whisper的服务器(如Faster-Whisper、Groq)路由,或者仅配置了OpenAI作为提供商时,优先选择此适配器而非Deepgram适配器。

Setup

设置

Set
OPENAI_API_KEY
in the environment or agent secrets store. For local endpoints, override
baseUrl
in
providerOptions
.
在环境变量或Agent密钥存储中设置
OPENAI_API_KEY
。对于本地端点,请在
providerOptions
中覆盖
baseUrl

Configuration

配置

json
{
  "voice": {
    "stt": "whisper"
  }
}
For a local Faster-Whisper endpoint:
json
{
  "voice": {
    "stt": "whisper",
    "providerOptions": {
      "model": "whisper-1",
      "language": "en",
      "baseUrl": "http://localhost:8000"
    }
  }
}
json
{
  "voice": {
    "stt": "whisper"
  }
}
对于本地Faster-Whisper端点:
json
{
  "voice": {
    "stt": "whisper",
    "providerOptions": {
      "model": "whisper-1",
      "language": "en",
      "baseUrl": "http://localhost:8000"
    }
  }
}

Provider Rules

提供商规则

  • Audio is accumulated in a 1 s sliding window with 200 ms overlap to avoid word boundary clipping.
  • The previous chunk transcript is forwarded as
    prompt
    to the next request for cross-chunk continuity.
  • On fetch failure the provider emits
    error
    and continues — no session crash.
  • Use
    language
    to force a specific language code (BCP-47); omit for automatic detection.
  • Compatible with any OpenAI
    /v1/audio/transcriptions
    -compatible server.
  • 音频在1秒滑动窗口中累积,重叠200毫秒,以避免单词边界截断。
  • 前一个分块的转录文本作为
    prompt
    转发到下一个请求,以保证分块间的连续性。
  • 当请求失败时,提供商发出
    error
    事件并继续运行——不会导致会话崩溃。
  • 使用
    language
    参数强制指定特定语言代码(BCP-47);省略则自动检测语言。
  • 兼容任何与OpenAI
    /v1/audio/transcriptions
    接口兼容的服务器。

Events

事件

EventDescription
interim_transcript
Emitted after each chunk is transcribed
final_transcript
Emitted after flush() completes
speech_start
RMS energy crossed threshold
speech_end
RMS energy dropped below threshold
error
Fetch failure (session continues)
close
Session fully terminated
事件名称描述
interim_transcript
每个分块转录完成后触发
final_transcript
flush()完成后触发
speech_start
RMS能量超过阈值时触发
speech_end
RMS能量低于阈值时触发
error
请求失败时触发(会话继续运行)
close
会话完全终止时触发

Examples

示例

  • "Use Whisper for live speech transcription during our voice session."
  • "Transcribe my speech through a local Faster-Whisper server."
  • "Use OpenAI for both the LLM and the STT provider."
  • "在我们的语音会话中使用Whisper进行实时语音转录。"
  • "通过本地Faster-Whisper服务器转录我的语音。"
  • "使用OpenAI同时作为LLM和STT提供商。"

Constraints

限制

  • Requires
    OPENAI_API_KEY
    or a compatible local endpoint via
    providerOptions.baseUrl
    .
  • Latency is higher than native WebSocket providers (Deepgram) due to HTTP chunking overhead.
  • Speaker diarization is not natively supported; use the
    diarization
    extension for post-processing.
  • 需要
    OPENAI_API_KEY
    ,或通过
    providerOptions.baseUrl
    配置兼容的本地端点。
  • 由于HTTP分块的开销,延迟高于原生WebSocket提供商(如Deepgram)。
  • 原生不支持说话人分离;需使用
    diarization
    扩展进行后处理。