azure-ai-voicelive-py
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAzure AI Voice Live SDK
Azure AI Voice Live SDK
Build real-time voice AI applications with bidirectional WebSocket communication.
使用双向WebSocket通信构建实时语音AI应用。
Installation
安装
bash
pip install azure-ai-voicelive aiohttp azure-identitybash
pip install azure-ai-voicelive aiohttp azure-identityEnvironment Variables
环境变量
bash
AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.combash
AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.comFor API key auth (not recommended for production)
For API key auth (not recommended for production)
AZURE_COGNITIVE_SERVICES_KEY=<api-key>
undefinedAZURE_COGNITIVE_SERVICES_KEY=<api-key>
undefinedAuthentication
身份验证
DefaultAzureCredential (preferred):
python
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
...API Key:
python
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
model="gpt-4o-realtime-preview"
) as conn:
...DefaultAzureCredential(推荐使用):
python
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
...API密钥:
python
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
model="gpt-4o-realtime-preview"
) as conn:
...Quick Start
快速开始
python
import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async def main():
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
# Update session with instructions
await conn.session.update(session={
"instructions": "You are a helpful assistant.",
"modalities": ["text", "audio"],
"voice": "alloy"
})
# Listen for events
async for event in conn:
print(f"Event: {event.type}")
if event.type == "response.audio_transcript.done":
print(f"Transcript: {event.transcript}")
elif event.type == "response.done":
break
asyncio.run(main())python
import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential
async def main():
async with connect(
endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
credential=DefaultAzureCredential(),
model="gpt-4o-realtime-preview",
credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
# Update session with instructions
await conn.session.update(session={
"instructions": "You are a helpful assistant.",
"modalities": ["text", "audio"],
"voice": "alloy"
})
# Listen for events
async for event in conn:
print(f"Event: {event.type}")
if event.type == "response.audio_transcript.done":
print(f"Transcript: {event.transcript}")
elif event.type == "response.done":
break
asyncio.run(main())Core Architecture
核心架构
Connection Resources
连接资源
The exposes these resources:
VoiceLiveConnection| Resource | Purpose | Key Methods |
|---|---|---|
| Session configuration | |
| Model responses | |
| Audio input | |
| Audio output | |
| Conversation state | |
| Transcription config | |
VoiceLiveConnection| 资源 | 用途 | 核心方法 |
|---|---|---|
| 会话配置 | |
| 模型响应 | |
| 音频输入 | |
| 音频输出 | |
| 对话状态 | |
| 转录配置 | |
Session Configuration
会话配置
python
from azure.ai.voicelive.models import RequestSession, FunctionTool
await conn.session.update(session=RequestSession(
instructions="You are a helpful voice assistant.",
modalities=["text", "audio"],
voice="alloy", # or "echo", "shimmer", "sage", etc.
input_audio_format="pcm16",
output_audio_format="pcm16",
turn_detection={
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
tools=[
FunctionTool(
type="function",
name="get_weather",
description="Get current weather",
parameters={
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
)
]
))python
from azure.ai.voicelive.models import RequestSession, FunctionTool
await conn.session.update(session=RequestSession(
instructions="You are a helpful voice assistant.",
modalities=["text", "audio"],
voice="alloy", # or "echo", "shimmer", "sage", etc.
input_audio_format="pcm16",
output_audio_format="pcm16",
turn_detection={
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
tools=[
FunctionTool(
type="function",
name="get_weather",
description="Get current weather",
parameters={
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
)
]
))Audio Streaming
音频流
Send Audio (Base64 PCM16)
发送音频(Base64 PCM16)
python
import base64python
import base64Read audio chunk (16-bit PCM, 24kHz mono)
Read audio chunk (16-bit PCM, 24kHz mono)
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()
await conn.input_audio_buffer.append(audio=b64_audio)
undefinedaudio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()
await conn.input_audio_buffer.append(audio=b64_audio)
undefinedReceive Audio
接收音频
python
async for event in conn:
if event.type == "response.audio.delta":
audio_bytes = base64.b64decode(event.delta)
await play_audio(audio_bytes)
elif event.type == "response.audio.done":
print("Audio complete")python
async for event in conn:
if event.type == "response.audio.delta":
audio_bytes = base64.b64decode(event.delta)
await play_audio(audio_bytes)
elif event.type == "response.audio.done":
print("Audio complete")Event Handling
事件处理
python
async for event in conn:
match event.type:
# Session events
case "session.created":
print(f"Session: {event.session}")
case "session.updated":
print("Session updated")
# Audio input events
case "input_audio_buffer.speech_started":
print(f"Speech started at {event.audio_start_ms}ms")
case "input_audio_buffer.speech_stopped":
print(f"Speech stopped at {event.audio_end_ms}ms")
# Transcription events
case "conversation.item.input_audio_transcription.completed":
print(f"User said: {event.transcript}")
case "conversation.item.input_audio_transcription.delta":
print(f"Partial: {event.delta}")
# Response events
case "response.created":
print(f"Response started: {event.response.id}")
case "response.audio_transcript.delta":
print(event.delta, end="", flush=True)
case "response.audio.delta":
audio = base64.b64decode(event.delta)
case "response.done":
print(f"Response complete: {event.response.status}")
# Function calls
case "response.function_call_arguments.done":
result = handle_function(event.name, event.arguments)
await conn.conversation.item.create(item={
"type": "function_call_output",
"call_id": event.call_id,
"output": json.dumps(result)
})
await conn.response.create()
# Errors
case "error":
print(f"Error: {event.error.message}")python
async for event in conn:
match event.type:
# Session events
case "session.created":
print(f"Session: {event.session}")
case "session.updated":
print("Session updated")
# Audio input events
case "input_audio_buffer.speech_started":
print(f"Speech started at {event.audio_start_ms}ms")
case "input_audio_buffer.speech_stopped":
print(f"Speech stopped at {event.audio_end_ms}ms")
# Transcription events
case "conversation.item.input_audio_transcription.completed":
print(f"User said: {event.transcript}")
case "conversation.item.input_audio_transcription.delta":
print(f"Partial: {event.delta}")
# Response events
case "response.created":
print(f"Response started: {event.response.id}")
case "response.audio_transcript.delta":
print(event.delta, end="", flush=True)
case "response.audio.delta":
audio = base64.b64decode(event.delta)
case "response.done":
print(f"Response complete: {event.response.status}")
# Function calls
case "response.function_call_arguments.done":
result = handle_function(event.name, event.arguments)
await conn.conversation.item.create(item={
"type": "function_call_output",
"call_id": event.call_id,
"output": json.dumps(result)
})
await conn.response.create()
# Errors
case "error":
print(f"Error: {event.error.message}")Common Patterns
常见模式
Manual Turn Mode (No VAD)
手动轮次模式(无VAD)
python
await conn.session.update(session={"turn_detection": None})python
await conn.session.update(session={"turn_detection": None})Manually control turns
Manually control turns
await conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit() # End of user turn
await conn.response.create() # Trigger response
undefinedawait conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit() # End of user turn
await conn.response.create() # Trigger response
undefinedInterrupt Handling
中断处理
python
async for event in conn:
if event.type == "input_audio_buffer.speech_started":
# User interrupted - cancel current response
await conn.response.cancel()
await conn.output_audio_buffer.clear()python
async for event in conn:
if event.type == "input_audio_buffer.speech_started":
# User interrupted - cancel current response
await conn.response.cancel()
await conn.output_audio_buffer.clear()Conversation History
对话历史
python
undefinedpython
undefinedAdd system message
Add system message
await conn.conversation.item.create(item={
"type": "message",
"role": "system",
"content": [{"type": "input_text", "text": "Be concise."}]
})
await conn.conversation.item.create(item={
"type": "message",
"role": "system",
"content": [{"type": "input_text", "text": "Be concise."}]
})
Add user message
Add user message
await conn.conversation.item.create(item={
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Hello!"}]
})
await conn.response.create()
undefinedawait conn.conversation.item.create(item={
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Hello!"}]
})
await conn.response.create()
undefinedVoice Options
语音选项
| Voice | Description |
|---|---|
| Neutral, balanced |
| Warm, conversational |
| Clear, professional |
| Calm, authoritative |
| Friendly, upbeat |
| Deep, measured |
| Expressive |
| Storytelling |
Azure voices: Use , , or models.
AzureStandardVoiceAzureCustomVoiceAzurePersonalVoice| 语音 | 描述 |
|---|---|
| 中性、均衡 |
| 温暖、口语化 |
| 清晰、专业 |
| 沉稳、权威 |
| 友好、乐观 |
| 低沉、从容 |
| 富有表现力 |
| 叙事风格 |
Azure语音:使用、或模型。
AzureStandardVoiceAzureCustomVoiceAzurePersonalVoiceAudio Formats
音频格式
| Format | Sample Rate | Use Case |
|---|---|---|
| 24kHz | Default, high quality |
| 8kHz | Telephony |
| 16kHz | Voice assistants |
| 8kHz | Telephony (US) |
| 8kHz | Telephony (EU) |
| 格式 | 采样率 | 使用场景 |
|---|---|---|
| 24kHz | 默认,高质量 |
| 8kHz | 电话通信 |
| 16kHz | 语音助手 |
| 8kHz | 电话通信(美国) |
| 8kHz | 电话通信(欧洲) |
Turn Detection Options
轮次检测选项
python
undefinedpython
undefinedServer VAD (default)
Server VAD (default)
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}
Azure Semantic VAD (smarter detection)
Azure Semantic VAD (smarter detection)
{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"} # English optimized
{"type": "azure_semantic_vad_multilingual"}
undefined{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"} # English optimized
{"type": "azure_semantic_vad_multilingual"}
undefinedError Handling
错误处理
python
from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed
try:
async with connect(...) as conn:
async for event in conn:
if event.type == "error":
print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
print(f"Connection error: {e}")python
from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed
try:
async with connect(...) as conn:
async for event in conn:
if event.type == "error":
print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
print(f"Connection error: {e}")References
参考资料
- Detailed API Reference: See references/api-reference.md
- Complete Examples: See references/examples.md
- All Models & Types: See references/models.md
- 详细API参考:查看references/api-reference.md
- 完整示例:查看references/examples.md
- 所有模型与类型:查看references/models.md
When to Use
使用场景
This skill is applicable to execute the workflow or actions described in the overview.
本技能适用于执行概述中描述的工作流或操作。