azure-ai-voicelive-py

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Azure AI Voice Live SDK

Build real-time voice AI applications with bidirectional WebSocket communication.

使用双向WebSocket通信构建实时语音AI应用。

Installation

bash

pip install azure-ai-voicelive aiohttp azure-identity

bash

pip install azure-ai-voicelive aiohttp azure-identity

Environment Variables

bash

AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com

bash

AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com

For API key auth (not recommended for production)

AZURE_COGNITIVE_SERVICES_KEY=<api-key>

undefined

AZURE_COGNITIVE_SERVICES_KEY=<api-key>

undefined

Authentication

DefaultAzureCredential (preferred):

python

from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...

API Key:

python

from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
    model="gpt-4o-realtime-preview"
) as conn:
    ...

DefaultAzureCredential（推荐使用）:

python

from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...

API Key:

python

from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
    model="gpt-4o-realtime-preview"
) as conn:
    ...

Quick Start

python

import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async def main():
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=DefaultAzureCredential(),
        model="gpt-4o-realtime-preview",
        credential_scopes=["https://cognitiveservices.azure.com/.default"]
    ) as conn:
        # Update session with instructions
        await conn.session.update(session={
            "instructions": "You are a helpful assistant.",
            "modalities": ["text", "audio"],
            "voice": "alloy"
        })
        
        # Listen for events
        async for event in conn:
            print(f"Event: {event.type}")
            if event.type == "response.audio_transcript.done":
                print(f"Transcript: {event.transcript}")
            elif event.type == "response.done":
                break

asyncio.run(main())

python

import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async def main():
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=DefaultAzureCredential(),
        model="gpt-4o-realtime-preview",
        credential_scopes=["https://cognitiveservices.azure.com/.default"]
    ) as conn:
        # Update session with instructions
        await conn.session.update(session={
            "instructions": "You are a helpful assistant.",
            "modalities": ["text", "audio"],
            "voice": "alloy"
        })
        
        # Listen for events
        async for event in conn:
            print(f"Event: {event.type}")
            if event.type == "response.audio_transcript.done":
                print(f"Transcript: {event.transcript}")
            elif event.type == "response.done":
                break

asyncio.run(main())

Core Architecture

Connection Resources

The

VoiceLiveConnection

exposes these resources:

Resource	Purpose	Key Methods
`conn.session`	Session configuration	`update(session=...)`
`conn.response`	Model responses	`create()` , `cancel()`
`conn.input_audio_buffer`	Audio input	`append()` , `commit()` , `clear()`
`conn.output_audio_buffer`	Audio output	`clear()`
`conn.conversation`	Conversation state	`item.create()` , `item.delete()` , `item.truncate()`
`conn.transcription_session`	Transcription config	`update(session=...)`

VoiceLiveConnection

提供以下资源：

资源	用途	关键方法
`conn.session`	会话配置	`update(session=...)`
`conn.response`	模型响应	`create()` , `cancel()`
`conn.input_audio_buffer`	音频输入	`append()` , `commit()` , `clear()`
`conn.output_audio_buffer`	音频输出	`clear()`
`conn.conversation`	会话状态	`item.create()` , `item.delete()` , `item.truncate()`
`conn.transcription_session`	转录配置	`update(session=...)`

Session Configuration

python

from azure.ai.voicelive.models import RequestSession, FunctionTool

await conn.session.update(session=RequestSession(
    instructions="You are a helpful voice assistant.",
    modalities=["text", "audio"],
    voice="alloy",  # or "echo", "shimmer", "sage", etc.
    input_audio_format="pcm16",
    output_audio_format="pcm16",
    turn_detection={
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 500
    },
    tools=[
        FunctionTool(
            type="function",
            name="get_weather",
            description="Get current weather",
            parameters={
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        )
    ]
))

python

from azure.ai.voicelive.models import RequestSession, FunctionTool

await conn.session.update(session=RequestSession(
    instructions="You are a helpful voice assistant.",
    modalities=["text", "audio"],
    voice="alloy",  # or "echo", "shimmer", "sage", etc.
    input_audio_format="pcm16",
    output_audio_format="pcm16",
    turn_detection={
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 500
    },
    tools=[
        FunctionTool(
            type="function",
            name="get_weather",
            description="Get current weather",
            parameters={
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        )
    ]
))

Audio Streaming

Send Audio (Base64 PCM16)

发送音频（Base64 PCM16）

python

import base64

python

import base64

Read audio chunk (16-bit PCM, 24kHz mono)

audio_chunk = await read_audio_from_microphone() b64_audio = base64.b64encode(audio_chunk).decode()

await conn.input_audio_buffer.append(audio=b64_audio)

undefined

audio_chunk = await read_audio_from_microphone() b64_audio = base64.b64encode(audio_chunk).decode()

await conn.input_audio_buffer.append(audio=b64_audio)

undefined

Receive Audio

接收音频

python

async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("Audio complete")

python

async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("Audio complete")

Event Handling

python

async for event in conn:
    match event.type:
        # Session events
        case "session.created":
            print(f"Session: {event.session}")
        case "session.updated":
            print("Session updated")
        
        # Audio input events
        case "input_audio_buffer.speech_started":
            print(f"Speech started at {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"Speech stopped at {event.audio_end_ms}ms")
        
        # Transcription events
        case "conversation.item.input_audio_transcription.completed":
            print(f"User said: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"Partial: {event.delta}")
        
        # Response events
        case "response.created":
            print(f"Response started: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"Response complete: {event.response.status}")
        
        # Function calls
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()
        
        # Errors
        case "error":
            print(f"Error: {event.error.message}")

python

async for event in conn:
    match event.type:
        # Session events
        case "session.created":
            print(f"Session: {event.session}")
        case "session.updated":
            print("Session updated")
        
        # Audio input events
        case "input_audio_buffer.speech_started":
            print(f"Speech started at {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"Speech stopped at {event.audio_end_ms}ms")
        
        # Transcription events
        case "conversation.item.input_audio_transcription.completed":
            print(f"User said: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"Partial: {event.delta}")
        
        # Response events
        case "response.created":
            print(f"Response started: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"Response complete: {event.response.status}")
        
        # Function calls
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()
        
        # Errors
        case "error":
            print(f"Error: {event.error.message}")

Common Patterns

Manual Turn Mode (No VAD)

手动回合模式（无VAD）

python

await conn.session.update(session={"turn_detection": None})

python

await conn.session.update(session={"turn_detection": None})

Manually control turns

await conn.input_audio_buffer.append(audio=b64_audio) await conn.input_audio_buffer.commit() # End of user turn await conn.response.create() # Trigger response

undefined

await conn.input_audio_buffer.append(audio=b64_audio) await conn.input_audio_buffer.commit() # End of user turn await conn.response.create() # Trigger response

undefined

Interrupt Handling

中断处理

python

async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # User interrupted - cancel current response
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()

python

async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # User interrupted - cancel current response
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()

Conversation History

会话历史

python

undefined

python

undefined

Add system message

await conn.conversation.item.create(item={ "type": "message", "role": "system", "content": [{"type": "input_text", "text": "Be concise."}] })

Add user message

await conn.conversation.item.create(item={ "type": "message", "role": "user", "content": [{"type": "input_text", "text": "Hello!"}] })

await conn.response.create()

undefined

await conn.conversation.item.create(item={ "type": "message", "role": "user", "content": [{"type": "input_text", "text": "Hello!"}] })

await conn.response.create()

undefined

Voice Options

Voice	Description
`alloy`	Neutral, balanced
`echo`	Warm, conversational
`shimmer`	Clear, professional
`sage`	Calm, authoritative
`coral`	Friendly, upbeat
`ash`	Deep, measured
`ballad`	Expressive
`verse`	Storytelling

Azure voices: Use

AzureStandardVoice

AzureCustomVoice

, or

AzurePersonalVoice

models.

语音	描述
`alloy`	中性、均衡
`echo`	温暖、口语化
`shimmer`	清晰、专业
`sage`	沉稳、权威
`coral`	友好、乐观
`ash`	低沉、从容
`ballad`	富有表现力
`verse`	擅长叙事

Azure语音：使用

AzureStandardVoice

、

AzureCustomVoice

或

AzurePersonalVoice

模型。

Audio Formats

Format	Sample Rate	Use Case
`pcm16`	24kHz	Default, high quality
`pcm16-8000hz`	8kHz	Telephony
`pcm16-16000hz`	16kHz	Voice assistants
`g711_ulaw`	8kHz	Telephony (US)
`g711_alaw`	8kHz	Telephony (EU)

格式	采样率	使用场景
`pcm16`	24kHz	默认选项，高质量
`pcm16-8000hz`	8kHz	电话通信
`pcm16-16000hz`	16kHz	语音助手
`g711_ulaw`	8kHz	电话通信（美国地区）
`g711_alaw`	8kHz	电话通信（欧洲地区）

Turn Detection Options

python

undefined

python

undefined

Server VAD (default)

{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}

Azure Semantic VAD (smarter detection)

{"type": "azure_semantic_vad"} {"type": "azure_semantic_vad_en"} # English optimized {"type": "azure_semantic_vad_multilingual"}

undefined

{"type": "azure_semantic_vad"} {"type": "azure_semantic_vad_en"} # English optimized {"type": "azure_semantic_vad_multilingual"}

undefined

Error Handling

python

from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"Connection error: {e}")

python

from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"Connection error: {e}")

References

Detailed API Reference: See references/api-reference.md
Complete Examples: See references/examples.md
All Models & Types: See references/models.md

详细API参考: 查看 references/api-reference.md
完整示例: 查看 references/examples.md
所有模型与类型: 查看 references/models.md