voice-conversation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Live Voice Conversations

实时语音对话

Use this skill when the user wants an agent to listen, transcribe, respond with speech, or switch between speech providers without changing the rest of the workflow.

Prefer the unified AgentOS speech runtime over provider-specific one-offs. Treat STT, TTS, VAD, and wake-word detection as separate capabilities that can be swapped independently.

当用户希望Agent能够倾听、转录、用语音响应，或者在不改变其余工作流的情况下切换语音供应商时，可以使用此技能。

优先使用统一的AgentOS语音运行时，而非针对特定供应商的一次性方案。将STT、TTS、VAD和唤醒词检测视为可独立替换的单独功能。

Workflow

工作流

Pick providers for:
- STT
- TTS
- optional wake word
- optional telephony
Start a speech session in one of three modes:
- ```
manual
```
  for push-to-talk or file transcription
- ```
vad
```
  for continuous listen-until-silence loops
- ```
wake-word
```
  when the user wants hands-free activation
Let VAD and silence detection decide utterance boundaries unless the user explicitly wants manual capture.
Transcribe the utterance, generate the response, then synthesize speech with the selected TTS provider.
Support interruption and provider switching without changing the higher-level agent behavior.

选择以下功能的供应商：
- STT
- TTS
- 可选的唤醒词
- 可选的电话服务
以三种模式之一启动语音会话：
- ```
manual
```
  ：按键通话或文件转录模式
- ```
vad
```
  ：持续监听直至静音的循环模式
- ```
wake-word
```
  ：用户需要免手动激活的模式
除非用户明确要求手动捕获，否则由VAD和静音检测来判定话语边界。
转录话语内容，生成响应，然后使用选定的TTS供应商合成语音。
支持中断操作和供应商切换，且不改变Agent的高层行为。

Provider Rules

供应商规则

Prefer
```
openai-whisper
```
for simple hosted transcription.
Prefer
```
openai-tts
```
for a default hosted voice path when one API key should cover both LLM and speech.
Prefer
```
elevenlabs
```
when voice quality or cloning matters more than simplicity.
Prefer local providers such as
```
whisper-local
```
or
```
piper
```
when the user wants offline or lower-cost operation.
Treat wake-word detection as optional. Default to VAD + silence detection unless the user asked for hands-free wake-up.

对于简单的托管转录，优先选择
```
openai-whisper
```
。
当需要用一个API密钥同时覆盖LLM和语音服务时，优先选择
```
openai-tts
```
作为默认托管语音方案。
当语音质量或克隆功能比简洁性更重要时，优先选择
```
elevenlabs
```
。
当用户需要离线运行或更低成本的操作时，优先选择本地供应商，如
```
whisper-local
```
或
```
piper
```
。
唤醒词检测为可选功能。除非用户要求免手动唤醒，否则默认使用VAD+静音检测。

Voice UX Rules

语音用户体验规则

Do not keep listening forever without a boundary policy.
Use significant pauses as a hint; use utterance-end silence as the final cutoff.
When speech playback is active, be ready for barge-in and interruption if the user starts speaking again.
Surface which provider combination is active so the user knows what is handling STT and TTS.
When provider credentials are missing, degrade to whichever speech providers are configured instead of failing the whole interaction.

若无边界策略，请勿持续无限监听。
将明显停顿作为提示；以话语结束时的静音作为最终截断点。
当语音播放处于活跃状态时，需准备好应对用户再次说话时的插话和中断。
显示当前激活的供应商组合，让用户了解是哪个供应商在处理STT和TTS。
当供应商凭证缺失时，降级使用已配置的语音供应商，而非直接终止整个交互。

Examples

示例

"Start a live voice session with Whisper for STT and ElevenLabs for TTS."
"Use OpenAI for both speech recognition and speech output."
"Run voice chat locally with VAD and no wake word."
"Switch the TTS provider but keep the same voice conversation flow."

"启动一个实时语音会话，使用Whisper作为STT供应商，ElevenLabs作为TTS供应商。"
"使用OpenAI同时处理语音识别和语音输出。"
"在本地运行带VAD且无唤醒词的语音聊天。"
"切换TTS供应商，但保持相同的语音对话流程。"

Constraints

约束条件

Hosted speech providers need API keys.
Wake-word support is optional and may depend on a plugin or local model.
VAD and silence thresholds should be tuned for the environment; do not hardcode one value for every context.
Telephony call control is separate from local microphone capture, but both should share the same speech provider abstractions.

托管语音供应商需要API密钥。
唤醒词支持为可选功能，可能依赖插件或本地模型。
VAD和静音阈值应根据环境调整；请勿为所有场景硬编码单一值。
电话呼叫控制与本地麦克风捕获是分开的，但两者应共享相同的语音供应商抽象层。