azure-ai-voicelive-py

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Azure AI Voice Live SDK

Azure AI Voice Live SDK

Build real-time voice AI applications with bidirectional WebSocket communication.
使用双向WebSocket通信构建实时语音AI应用。

Installation

Installation

bash
pip install azure-ai-voicelive aiohttp azure-identity
bash
pip install azure-ai-voicelive aiohttp azure-identity

Environment Variables

Environment Variables

bash
AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com
bash
AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com

For API key auth (not recommended for production)

For API key auth (not recommended for production)

AZURE_COGNITIVE_SERVICES_KEY=<api-key>
undefined
AZURE_COGNITIVE_SERVICES_KEY=<api-key>
undefined

Authentication

Authentication

DefaultAzureCredential (preferred):
python
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...
API Key:
python
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
    model="gpt-4o-realtime-preview"
) as conn:
    ...
DefaultAzureCredential(推荐使用):
python
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...
API Key:
python
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
    model="gpt-4o-realtime-preview"
) as conn:
    ...

Quick Start

Quick Start

python
import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async def main():
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=DefaultAzureCredential(),
        model="gpt-4o-realtime-preview",
        credential_scopes=["https://cognitiveservices.azure.com/.default"]
    ) as conn:
        # Update session with instructions
        await conn.session.update(session={
            "instructions": "You are a helpful assistant.",
            "modalities": ["text", "audio"],
            "voice": "alloy"
        })
        
        # Listen for events
        async for event in conn:
            print(f"Event: {event.type}")
            if event.type == "response.audio_transcript.done":
                print(f"Transcript: {event.transcript}")
            elif event.type == "response.done":
                break

asyncio.run(main())
python
import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async def main():
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=DefaultAzureCredential(),
        model="gpt-4o-realtime-preview",
        credential_scopes=["https://cognitiveservices.azure.com/.default"]
    ) as conn:
        # Update session with instructions
        await conn.session.update(session={
            "instructions": "You are a helpful assistant.",
            "modalities": ["text", "audio"],
            "voice": "alloy"
        })
        
        # Listen for events
        async for event in conn:
            print(f"Event: {event.type}")
            if event.type == "response.audio_transcript.done":
                print(f"Transcript: {event.transcript}")
            elif event.type == "response.done":
                break

asyncio.run(main())

Core Architecture

Core Architecture

Connection Resources

Connection Resources

The
VoiceLiveConnection
exposes these resources:
ResourcePurposeKey Methods
conn.session
Session configuration
update(session=...)
conn.response
Model responses
create()
,
cancel()
conn.input_audio_buffer
Audio input
append()
,
commit()
,
clear()
conn.output_audio_buffer
Audio output
clear()
conn.conversation
Conversation state
item.create()
,
item.delete()
,
item.truncate()
conn.transcription_session
Transcription config
update(session=...)
VoiceLiveConnection
提供以下资源:
资源用途关键方法
conn.session
会话配置
update(session=...)
conn.response
模型响应
create()
,
cancel()
conn.input_audio_buffer
音频输入
append()
,
commit()
,
clear()
conn.output_audio_buffer
音频输出
clear()
conn.conversation
会话状态
item.create()
,
item.delete()
,
item.truncate()
conn.transcription_session
转录配置
update(session=...)

Session Configuration

Session Configuration

python
from azure.ai.voicelive.models import RequestSession, FunctionTool

await conn.session.update(session=RequestSession(
    instructions="You are a helpful voice assistant.",
    modalities=["text", "audio"],
    voice="alloy",  # or "echo", "shimmer", "sage", etc.
    input_audio_format="pcm16",
    output_audio_format="pcm16",
    turn_detection={
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 500
    },
    tools=[
        FunctionTool(
            type="function",
            name="get_weather",
            description="Get current weather",
            parameters={
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        )
    ]
))
python
from azure.ai.voicelive.models import RequestSession, FunctionTool

await conn.session.update(session=RequestSession(
    instructions="You are a helpful voice assistant.",
    modalities=["text", "audio"],
    voice="alloy",  # or "echo", "shimmer", "sage", etc.
    input_audio_format="pcm16",
    output_audio_format="pcm16",
    turn_detection={
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 500
    },
    tools=[
        FunctionTool(
            type="function",
            name="get_weather",
            description="Get current weather",
            parameters={
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        )
    ]
))

Audio Streaming

Audio Streaming

Send Audio (Base64 PCM16)

发送音频(Base64 PCM16)

python
import base64
python
import base64

Read audio chunk (16-bit PCM, 24kHz mono)

Read audio chunk (16-bit PCM, 24kHz mono)

audio_chunk = await read_audio_from_microphone() b64_audio = base64.b64encode(audio_chunk).decode()
await conn.input_audio_buffer.append(audio=b64_audio)
undefined
audio_chunk = await read_audio_from_microphone() b64_audio = base64.b64encode(audio_chunk).decode()
await conn.input_audio_buffer.append(audio=b64_audio)
undefined

Receive Audio

接收音频

python
async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("Audio complete")
python
async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("Audio complete")

Event Handling

Event Handling

python
async for event in conn:
    match event.type:
        # Session events
        case "session.created":
            print(f"Session: {event.session}")
        case "session.updated":
            print("Session updated")
        
        # Audio input events
        case "input_audio_buffer.speech_started":
            print(f"Speech started at {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"Speech stopped at {event.audio_end_ms}ms")
        
        # Transcription events
        case "conversation.item.input_audio_transcription.completed":
            print(f"User said: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"Partial: {event.delta}")
        
        # Response events
        case "response.created":
            print(f"Response started: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"Response complete: {event.response.status}")
        
        # Function calls
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()
        
        # Errors
        case "error":
            print(f"Error: {event.error.message}")
python
async for event in conn:
    match event.type:
        # Session events
        case "session.created":
            print(f"Session: {event.session}")
        case "session.updated":
            print("Session updated")
        
        # Audio input events
        case "input_audio_buffer.speech_started":
            print(f"Speech started at {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"Speech stopped at {event.audio_end_ms}ms")
        
        # Transcription events
        case "conversation.item.input_audio_transcription.completed":
            print(f"User said: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"Partial: {event.delta}")
        
        # Response events
        case "response.created":
            print(f"Response started: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"Response complete: {event.response.status}")
        
        # Function calls
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()
        
        # Errors
        case "error":
            print(f"Error: {event.error.message}")

Common Patterns

Common Patterns

Manual Turn Mode (No VAD)

手动回合模式(无VAD)

python
await conn.session.update(session={"turn_detection": None})
python
await conn.session.update(session={"turn_detection": None})

Manually control turns

Manually control turns

await conn.input_audio_buffer.append(audio=b64_audio) await conn.input_audio_buffer.commit() # End of user turn await conn.response.create() # Trigger response
undefined
await conn.input_audio_buffer.append(audio=b64_audio) await conn.input_audio_buffer.commit() # End of user turn await conn.response.create() # Trigger response
undefined

Interrupt Handling

中断处理

python
async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # User interrupted - cancel current response
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()
python
async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # User interrupted - cancel current response
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()

Conversation History

会话历史

python
undefined
python
undefined

Add system message

Add system message

await conn.conversation.item.create(item={ "type": "message", "role": "system", "content": [{"type": "input_text", "text": "Be concise."}] })
await conn.conversation.item.create(item={ "type": "message", "role": "system", "content": [{"type": "input_text", "text": "Be concise."}] })

Add user message

Add user message

await conn.conversation.item.create(item={ "type": "message", "role": "user", "content": [{"type": "input_text", "text": "Hello!"}] })
await conn.response.create()
undefined
await conn.conversation.item.create(item={ "type": "message", "role": "user", "content": [{"type": "input_text", "text": "Hello!"}] })
await conn.response.create()
undefined

Voice Options

Voice Options

VoiceDescription
alloy
Neutral, balanced
echo
Warm, conversational
shimmer
Clear, professional
sage
Calm, authoritative
coral
Friendly, upbeat
ash
Deep, measured
ballad
Expressive
verse
Storytelling
Azure voices: Use
AzureStandardVoice
,
AzureCustomVoice
, or
AzurePersonalVoice
models.
语音描述
alloy
中性、均衡
echo
温暖、口语化
shimmer
清晰、专业
sage
沉稳、权威
coral
友好、乐观
ash
低沉、从容
ballad
富有表现力
verse
擅长叙事
Azure语音:使用
AzureStandardVoice
AzureCustomVoice
AzurePersonalVoice
模型。

Audio Formats

Audio Formats

FormatSample RateUse Case
pcm16
24kHzDefault, high quality
pcm16-8000hz
8kHzTelephony
pcm16-16000hz
16kHzVoice assistants
g711_ulaw
8kHzTelephony (US)
g711_alaw
8kHzTelephony (EU)
格式采样率使用场景
pcm16
24kHz默认选项,高质量
pcm16-8000hz
8kHz电话通信
pcm16-16000hz
16kHz语音助手
g711_ulaw
8kHz电话通信(美国地区)
g711_alaw
8kHz电话通信(欧洲地区)

Turn Detection Options

Turn Detection Options

python
undefined
python
undefined

Server VAD (default)

Server VAD (default)

{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}

Azure Semantic VAD (smarter detection)

Azure Semantic VAD (smarter detection)

{"type": "azure_semantic_vad"} {"type": "azure_semantic_vad_en"} # English optimized {"type": "azure_semantic_vad_multilingual"}
undefined
{"type": "azure_semantic_vad"} {"type": "azure_semantic_vad_en"} # English optimized {"type": "azure_semantic_vad_multilingual"}
undefined

Error Handling

Error Handling

python
from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"Connection error: {e}")
python
from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"Connection error: {e}")

References

References

  • Detailed API Reference: See references/api-reference.md
  • Complete Examples: See references/examples.md
  • All Models & Types: See references/models.md
  • 详细API参考: 查看 references/api-reference.md
  • 完整示例: 查看 references/examples.md
  • 所有模型与类型: 查看 references/models.md