gemini-live-api-dev

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Gemini Live API Development Skill

Gemini Live API 开发技能

Overview

概述

The Live API enables low-latency, real-time voice and video interactions with Gemini over WebSockets. It processes continuous streams of audio, video, or text to deliver immediate, human-like spoken responses.

Key capabilities:

Bidirectional audio streaming — real-time mic-to-speaker conversations
Video streaming — send camera/screen frames alongside audio
Text input/output — send and receive text within a live session
Audio transcriptions — get text transcripts of both input and output audio
Voice Activity Detection (VAD) — automatic interruption handling
Native audio — affective dialog, proactive audio, thinking
Function calling — synchronous and asynchronous tool use
Google Search grounding — ground responses in real-time search results
Session management — context compression, session resumption, GoAway signals
Ephemeral tokens — secure client-side authentication

[!NOTE] The Live API currently only supports WebSockets. For WebRTC support or simplified integration, use a partner integration.

Live API支持通过WebSockets与Gemini实现低延迟的实时音视频交互。它可以处理音频、视频或文本的连续流，输出即时的、类人声的语音响应。

核心能力：

双向音频流 — 实现麦克风到扬声器的实时对话
视频流 — 可同步传输摄像头/屏幕画面与音频
文本输入/输出 — 在实时会话内收发文本
音频转写 — 获取输入和输出音频的文本转录内容
语音活动检测(VAD) — 自动处理打断场景
原生音频 — 情感对话、主动音频输出、思考过程播报
函数调用 — 同步和异步工具调用
谷歌搜索 grounding — 基于实时搜索结果生成回复
会话管理 — 上下文压缩、会话恢复、GoAway信号处理
临时令牌 — 安全的客户端侧认证

[!NOTE] Live API 目前仅支持WebSockets。如需WebRTC支持或简化集成，请使用合作伙伴集成方案。

Models

模型

```
gemini-2.5-flash-native-audio-preview-12-2025
```
— Native audio output, affective dialog, proactive audio, thinking. 128k context window. This is the recommended model for all Live API use cases.

[!WARNING] The following Live API models are deprecated and will be shut down. Migrate to
gemini-2.5-flash-native-audio-preview-12-2025
.
gemini-live-2.5-flash-preview
— Released June 17, 2025. Shutdown: December 9, 2025.
gemini-2.0-flash-live-001
— Released April 9, 2025. Shutdown: December 9, 2025.

```
gemini-2.5-flash-native-audio-preview-12-2025
```
— 支持原生音频输出、情感对话、主动音频、思考过程，上下文窗口为128k。这是所有Live API使用场景的推荐模型。

[!WARNING] 以下Live API模型已废弃，即将停止服务，请迁移至
gemini-2.5-flash-native-audio-preview-12-2025
。
gemini-live-2.5-flash-preview
— 2025年6月17日发布，停止服务时间：2025年12月9日。
gemini-2.0-flash-live-001
— 2025年4月9日发布，停止服务时间：2025年12月9日。

SDKs

SDK

Python:
```
google-genai
```
—
```
pip install google-genai
```
JavaScript/TypeScript:
```
@google/genai
```
—
```
npm install @google/genai
```

[!WARNING] Legacy SDKs
google-generativeai
(Python) and
@google/generative-ai
(JS) are deprecated. Use the new SDKs above.

Python:
```
google-genai
```
— 安装命令：
```
pip install google-genai
```
JavaScript/TypeScript:
```
@google/genai
```
— 安装命令：
```
npm install @google/genai
```

[!WARNING] 旧版SDK
google-generativeai
(Python) 和
@google/generative-ai
(JS) 已废弃，请使用上述新SDK。

Partner Integrations

合作伙伴集成

To streamline real-time audio/video app development, use a third-party integration supporting the Gemini Live API over WebRTC or WebSockets:

LiveKit — Use the Gemini Live API with LiveKit Agents.
Pipecat by Daily — Create a real-time AI chatbot using Gemini Live and Pipecat.
Fishjam by Software Mansion — Create live video and audio streaming applications with Fishjam.
Vision Agents by Stream — Build real-time voice and video AI applications with Vision Agents.
Voximplant — Connect inbound and outbound calls to Live API with Voximplant.
Firebase AI SDK — Get started with the Gemini Live API using Firebase AI Logic.

如需简化实时音视频应用开发，可使用支持通过WebRTC或WebSockets接入Gemini Live API的第三方集成方案：

LiveKit — 配合LiveKit Agents使用Gemini Live API
Pipecat by Daily — 基于Gemini Live和Pipecat构建实时AI聊天机器人
Fishjam by Software Mansion — 基于Fishjam构建实时音视频流应用
Vision Agents by Stream — 基于Vision Agents构建实时音视频AI应用
Voximplant — 通过Voximplant将呼入和呼出电话接入Live API
Firebase AI SDK — 基于Firebase AI Logic快速上手Gemini Live API

Audio Formats

音频格式

Input: Raw PCM, little-endian, 16-bit, mono. 16kHz native (will resample others). MIME type:
```
audio/pcm;rate=16000
```
Output: Raw PCM, little-endian, 16-bit, mono. 24kHz sample rate.

[!IMPORTANT] Use
send_realtime_input
/
sendRealtimeInput
for all real-time user input (audio, video, and text). Use
send_client_content
/
sendClientContent
only for incremental conversation history updates (appending prior turns to context), not for sending new user messages.

[!WARNING] Do not use
media
in
sendRealtimeInput
. Use the specific keys:
audio
for audio data,
video
for images/video frames, and
text
for text input.

输入：原始PCM、小端序、16位、单声道，原生采样率为16kHz（其他采样率会自动重采样），MIME类型：
```
audio/pcm;rate=16000
```
输出：原始PCM、小端序、16位、单声道，采样率为24kHz。

[!IMPORTANT] 所有实时用户输入（音频、视频以及文本）都使用
send_realtime_input
/
sendRealtimeInput
发送。
send_client_content
/
sendClientContent
仅用于增量更新会话历史（往上下文中追加之前的对话轮次），不可用于发送新的用户消息。

[!WARNING] 不要在
sendRealtimeInput
中使用
media
字段，请使用对应的专用字段：
audio
传音频数据、
video
传图片/视频帧、
text
传文本输入。

Quick Start

快速开始

Authentication

认证

Python

python

from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

python

from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({ apiKey: 'YOUR_API_KEY' });

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({ apiKey: 'YOUR_API_KEY' });

Connecting to the Live API

连接Live API

Python

python

from google.genai import types

config = types.LiveConnectConfig(
    response_modalities=[types.Modality.AUDIO],
    system_instruction=types.Content(
        parts=[types.Part(text="You are a helpful assistant.")]
    )
)

async with client.aio.live.connect(model="gemini-2.5-flash-native-audio-preview-12-2025", config=config) as session:
    pass  # Session is now active

python

from google.genai import types

config = types.LiveConnectConfig(
    response_modalities=[types.Modality.AUDIO],
    system_instruction=types.Content(
        parts=[types.Part(text="You are a helpful assistant.")]
    )
)

async with client.aio.live.connect(model="gemini-2.5-flash-native-audio-preview-12-2025", config=config) as session:
    pass  # 会话已激活

JavaScript

const session = await ai.live.connect({
  model: 'gemini-2.5-flash-native-audio-preview-12-2025',
  config: {
    responseModalities: ['audio'],
    systemInstruction: { parts: [{ text: 'You are a helpful assistant.' }] }
  },
  callbacks: {
    onopen: () => console.log('Connected'),
    onmessage: (response) => console.log('Message:', response),
    onerror: (error) => console.error('Error:', error),
    onclose: () => console.log('Closed')
  }
});

const session = await ai.live.connect({
  model: 'gemini-2.5-flash-native-audio-preview-12-2025',
  config: {
    responseModalities: ['audio'],
    systemInstruction: { parts: [{ text: 'You are a helpful assistant.' }] }
  },
  callbacks: {
    onopen: () => console.log('已连接'),
    onmessage: (response) => console.log('收到消息:', response),
    onerror: (error) => console.error('发生错误:', error),
    onclose: () => console.log('连接已关闭')
  }
});

Sending Text

发送文本

Python

python

await session.send_realtime_input(text="Hello, how are you?")

python

await session.send_realtime_input(text="Hello, how are you?")

JavaScript

session.sendRealtimeInput({ text: 'Hello, how are you?' });

session.sendRealtimeInput({ text: 'Hello, how are you?' });

Sending Audio

发送音频

Python

python

await session.send_realtime_input(
    audio=types.Blob(data=chunk, mime_type="audio/pcm;rate=16000")
)

python

await session.send_realtime_input(
    audio=types.Blob(data=chunk, mime_type="audio/pcm;rate=16000")
)

JavaScript

session.sendRealtimeInput({
  audio: { data: chunk.toString('base64'), mimeType: 'audio/pcm;rate=16000' }
});

session.sendRealtimeInput({
  audio: { data: chunk.toString('base64'), mimeType: 'audio/pcm;rate=16000' }
});

Sending Video

发送视频

Python

python

undefined

python

undefined

frame: raw JPEG-encoded bytes

frame: JPEG编码的原始字节数据

await session.send_realtime_input( video=types.Blob(data=frame, mime_type="image/jpeg") )

undefined

await session.send_realtime_input( video=types.Blob(data=frame, mime_type="image/jpeg") )

undefined

JavaScript

session.sendRealtimeInput({
  video: { data: frame.toString('base64'), mimeType: 'image/jpeg' }
});

session.sendRealtimeInput({
  video: { data: frame.toString('base64'), mimeType: 'image/jpeg' }
});

Receiving Audio and Text

接收音频和文本

Python

python

async for response in session.receive():
    content = response.server_content
    if content:
        # Audio
        if content.model_turn:
            for part in content.model_turn.parts:
                if part.inline_data:
                    audio_data = part.inline_data.data
        # Transcription
        if content.input_transcription:
            print(f"User: {content.input_transcription.text}")
        if content.output_transcription:
            print(f"Gemini: {content.output_transcription.text}")
        # Interruption
        if content.interrupted is True:
            pass  # Stop playback, clear audio queue

python

async for response in session.receive():
    content = response.server_content
    if content:
        # 音频处理
        if content.model_turn:
            for part in content.model_turn.parts:
                if part.inline_data:
                    audio_data = part.inline_data.data
        # 转写处理
        if content.input_transcription:
            print(f"用户: {content.input_transcription.text}")
        if content.output_transcription:
            print(f"Gemini: {content.output_transcription.text}")
        # 打断处理
        if content.interrupted is True:
            pass  # 停止播放，清空音频队列

JavaScript

// Inside the onmessage callback
const content = response.serverContent;
if (content?.modelTurn?.parts) {
  for (const part of content.modelTurn.parts) {
    if (part.inlineData) {
      const audioData = part.inlineData.data; // Base64 encoded
    }
  }
}
if (content?.inputTranscription) console.log('User:', content.inputTranscription.text);
if (content?.outputTranscription) console.log('Gemini:', content.outputTranscription.text);
if (content?.interrupted) { /* Stop playback, clear audio queue */ }

// 在onmessage回调内
const content = response.serverContent;
if (content?.modelTurn?.parts) {
  for (const part of content.modelTurn.parts) {
    if (part.inlineData) {
      const audioData = part.inlineData.data; // Base64编码
    }
  }
}
if (content?.inputTranscription) console.log('用户:', content.inputTranscription.text);
if (content?.outputTranscription) console.log('Gemini:', content.outputTranscription.text);
if (content?.interrupted) { /* 停止播放，清空音频队列 */ }

Limitations

限制

Response modality — Only
```
TEXT
```
or
```
AUDIO
```
per session, not both
Audio-only session — 15 min without compression
Audio+video session — 2 min without compression
Connection lifetime — ~10 min (use session resumption)
Context window — 128k tokens (native audio) / 32k tokens (standard)
Code execution — Not supported
URL context — Not supported

响应模态 — 单个会话仅支持
```
TEXT
```
或
```
AUDIO
```
，不可同时支持两者
纯音频会话 — 无压缩时最长15分钟
音视频会话 — 无压缩时最长2分钟
连接生命周期 — 约10分钟（可使用会话恢复功能延长）
上下文窗口 — 128k tokens（原生音频模型）/ 32k tokens（标准模型）
代码执行 — 不支持
URL上下文 — 不支持

Best Practices

最佳实践

Use headphones when testing mic audio to prevent echo/self-interruption
Enable context window compression for sessions longer than 15 minutes
Implement session resumption to handle connection resets gracefully
Use ephemeral tokens for client-side deployments — never expose API keys in browsers
Use
send_realtime_input
for all real-time user input (audio, video, text). Reserve
```
send_client_content
```
only for injecting conversation history
Send
audioStreamEnd
when the mic is paused to flush cached audio
Clear audio playback queues on interruption signals

测试麦克风音频时使用耳机，避免回声/自我打断
会话时长超过15分钟时开启上下文窗口压缩
实现会话恢复逻辑，优雅处理连接重置问题
客户端部署时使用临时令牌 — 绝对不要在浏览器中暴露API密钥
所有实时用户输入（音频、视频、文本）都使用
send_realtime_input
，
```
send_client_content
```
仅用于注入对话历史
麦克风暂停时发送
```
audioStreamEnd
```
，清空缓存的音频数据
收到打断信号时清空音频播放队列

How to use the Gemini API

如何使用Gemini API

For detailed API documentation, fetch from the official docs index:

llms.txt URL:

https://ai.google.dev/gemini-api/docs/llms.txt

This index contains links to all documentation pages in

.md.txt

format. Use web fetch tools to:

Fetch
```
llms.txt
```
to discover available documentation pages

Fetch specific pages (e.g.,

https://ai.google.dev/gemini-api/docs/live-session.md.txt

)

如需详细的API文档，请从官方文档索引获取：

llms.txt 地址:

https://ai.google.dev/gemini-api/docs/llms.txt

该索引包含所有

.md.txt

格式的文档页面链接，你可以使用网络抓取工具：

抓取
```
llms.txt
```
获取可用的文档页面列表

抓取指定页面（例如

https://ai.google.dev/gemini-api/docs/live-session.md.txt

）

Key Documentation Pages

核心文档页面

[!IMPORTANT] Those are not all the documentation pages. Use the
llms.txt
index to discover available documentation pages

Live API Overview — getting started, raw WebSocket usage
Live API Capabilities Guide — voice config, transcription config, native audio (affective dialog, proactive audio, thinking), VAD configuration, media resolution
Live API Tool Use — function calling (sync and async), Google Search grounding
Session Management — context window compression, session resumption, GoAway signals
Ephemeral Tokens — secure client-side authentication for browser/mobile
WebSockets API Reference — raw WebSocket protocol details

[!IMPORTANT] 以下不是全部文档页面，请使用
llms.txt
索引获取完整的可用文档列表

Live API 概述 — 入门指南、原生WebSocket使用方式
Live API 能力指南 — 语音配置、转写配置、原生音频（情感对话、主动音频、思考过程）、VAD配置、媒体分辨率
Live API 工具使用 — 函数调用（同步/异步）、谷歌搜索grounding
会话管理 — 上下文窗口压缩、会话恢复、GoAway信号
临时令牌 — 浏览器/移动端的安全客户端认证方案
WebSockets API 参考 — 原生WebSocket协议详情

Supported Languages

支持的语言

The Live API supports 70 languages including: English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Hindi, Arabic, Russian, and many more. Native audio models automatically detect and switch languages.

Live API支持70种语言，包括：英语、西班牙语、法语、德语、意大利语、葡萄牙语、中文、日语、韩语、印地语、阿拉伯语、俄语等。原生音频模型可自动检测并切换语言。