google-gemini-media

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Gemini Multimodal Media (Image/Video/Speech) Skill

Gemini多模态媒体（图像/视频/语音）Skill

1. Goals and scope

1. 目标与范围

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:

Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)

Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.

本Skill整合了Gemini API的六项能力，将其封装为可复用的工作流与实现模板：

图像生成（Nano Banana：文本转图像、图像编辑、多轮迭代）
图像理解（标题生成/视觉问答/分类/对比、多图像提示词；支持内嵌式与Files API）
视频生成（Veo 3.1：文本转视频、宽高比/分辨率控制、参考图像引导、首尾帧设定、视频扩展、原生音频）
视频理解（上传/内嵌/YouTube URL；摘要、问答、带时间戳的证据提取）
语音生成（Gemini原生TTS：单语音与多语音；可控制风格/口音/语速/语调）
音频理解（上传/内嵌；内容描述、转录、指定时间段转录、Token计数）

约定：本Skill以官方Google Gen AI SDK（Node.js/REST）为技术主线；目前仅提供Node.js/REST示例。若你的项目已封装其他语言或框架，请将本Skill的请求结构、模型选择与输入输出规范映射到你的封装层。

2. Quick routing (decide which capability to use)

2. 快速选型（确定使用哪项能力）

Do you need to produce images?

Need to generate images from scratch or edit based on an image -> use Nano Banana image generation (see Section 5)

Do you need to understand images?

Need recognition, description, Q&A, comparison, or info extraction -> use Image understanding (see Section 6)

Do you need to produce video?

Need to generate an 8-second video (optionally with native audio) -> use Veo 3.1 video generation (see Section 7)

Do you need to understand video?

Need summaries/Q&A/segment extraction with timestamps -> use Video understanding (see Section 8)

Do you need to read text aloud?

Need controllable narration, podcast/audiobook style, etc. -> use Speech generation (TTS) (see Section 9)

Do you need to understand audio?

Need audio descriptions, transcription, time-range transcription, token counting -> use Audio understanding (see Section 10)

是否需要生成图像？

需要从头生成图像或基于现有图像编辑 -> 使用Nano Banana图像生成（见第5节）

是否需要理解图像内容？

需要识别、描述、问答、对比或信息提取 -> 使用图像理解（见第6节）

是否需要生成视频？

需要生成8秒视频（可附带原生音频） -> 使用Veo 3.1视频生成（见第7节）

是否需要理解视频内容？

需要生成摘要/问答/带时间戳的片段提取 -> 使用视频理解（见第8节）

是否需要将文本转为语音？

需要可控的旁白、播客/有声书风格等 -> 使用语音生成（TTS）（见第9节）

是否需要理解音频内容？

需要音频描述、转录、指定时间段转录、Token计数 -> 使用音频理解（见第10节）

3. Unified engineering constraints and I/O spec (must read)

3. 统一工程约束与输入输出规范（必读）

3.0 Prerequisites (dependencies and tools)

3.0 前置条件（依赖与工具）

Node.js 18+ (match your project version)
Install SDK (example):

bash

npm install @google/genai

REST examples only need
```
curl
```
; if you need to parse image Base64, install
```
jq
```
(optional).

Node.js 18+（与你的项目版本匹配）
安装SDK（示例）：

bash

npm install @google/genai

REST示例仅需
```
curl
```
；若需要解析图像Base64，可安装
```
jq
```
（可选）。

3.1 Authentication and environment variables

3.1 认证与环境变量

Put your API key in
```
GEMINI_API_KEY
```
REST requests use
```
x-goog-api-key: $GEMINI_API_KEY
```

将你的API密钥存入
```
GEMINI_API_KEY
```
环境变量
REST请求需携带
```
x-goog-api-key: $GEMINI_API_KEY
```
请求头

3.2 Two file input modes: Inline vs Files API

3.2 两种文件输入模式：内嵌式 vs Files API

Inline (embedded bytes/Base64)

Pros: shorter call chain, good for small files.
Key constraint: total request size (text prompt + system instructions + embedded bytes) typically has a ~20MB ceiling.

Files API (upload then reference)

Pros: good for large files, reusing the same file, or multi-turn conversations.

Typical flow:

files.upload(...)

(SDK) or

POST /upload/v1beta/files

(REST resumable)

Use
```
file_data
```
/
```
file_uri
```
in
```
generateContent
```

Engineering suggestion: implement
ensure_file_uri()
so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.

内嵌式（嵌入字节/Base64）

优势：调用链更短，适合小文件。
关键限制：请求总大小（文本提示词+系统指令+嵌入字节）通常上限约为20MB。

Files API（先上传再引用）

优势：适合大文件、重复使用同一文件或多轮对话场景。
典型流程：
1. 使用SDK的
```
files.upload(...)
```
  或REST的
```
POST /upload/v1beta/files
```
  （支持断点续传）上传文件
2. 在
```
generateContent
```
  中使用
```
file_data
```
  /
```
file_uri
```
  引用文件

工程建议：实现
ensure_file_uri()
方法，当文件超过阈值（例如10-15MB时触发警告）或需要重复使用时，自动路由到Files API。

3.3 Unified handling of binary media outputs

3.3 二进制媒体输出的统一处理

Images: usually returned as
```
inline_data
```
(Base64) in response parts; in the SDK use
```
part.as_image()
```
or decode Base64 and save as PNG/JPG.
Speech (TTS): usually returns PCM bytes (Base64); save as
```
.pcm
```
or wrap into
```
.wav
```
(commonly 24kHz, 16-bit, mono).
Video (Veo): long-running async task; poll the operation; download the file (or use the returned URI).

图像：通常在响应部分以
```
inline_data
```
（Base64格式）返回；在SDK中可使用
```
part.as_image()
```
，或解码Base64后保存为PNG/JPG格式。
语音（TTS）：通常返回PCM字节（Base64格式）；可保存为
```
.pcm
```
文件，或封装为
```
.wav
```
文件（常用参数：24kHz采样率、16位位深、单声道）。
视频（Veo）：属于异步长时任务；需轮询操作状态；完成后下载文件（或使用返回的URI）。

4. Model selection matrix (choose by scenario)

4. 模型选择矩阵（按场景选型）

Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.

重要提示：模型名称、版本、限制与配额可能随时间变化。使用前请查阅官方文档确认。最后更新时间：2026-01-22。

4.1 Image generation (Nano Banana)

4.1 图像生成（Nano Banana）

gemini-2.5-flash-image: optimized for speed/throughput; good for frequent, low-latency generation/editing.
gemini-3-pro-image-preview: stronger instruction following and high-fidelity text rendering; better for professional assets and complex edits.

gemini-2.5-flash-image：针对速度/吞吐量优化；适合高频、低延迟的生成/编辑场景。
gemini-3-pro-image-preview：指令遵循能力更强，文本渲染保真度高；更适合专业素材与复杂编辑场景。

4.2 General image/video/audio understanding

4.2 通用图像/视频/音频理解

Docs use
```
gemini-3-flash-preview
```
for image, video, and audio understanding (choose stronger models as needed for quality/cost).

文档中使用
```
gemini-3-flash-preview
```
进行图像、视频与音频理解（若对质量/成本有更高要求，可选择更强大的模型）。

4.3 Video generation (Veo)

4.3 视频生成（Veo）

Example model:
```
veo-3.1-generate-preview
```
(generates 8-second video and can natively generate audio).

示例模型：
```
veo-3.1-generate-preview
```
（可生成8秒视频，并支持原生音频生成）。

4.4 Speech generation (TTS)

4.4 语音生成（TTS）

Example model:
```
gemini-2.5-flash-preview-tts
```
(native TTS, currently in preview).

示例模型：
```
gemini-2.5-flash-preview-tts
```
（原生TTS，目前处于预览阶段）。

5. Image generation (Nano Banana)

5. 图像生成（Nano Banana）

5.1 Text-to-Image

5.1 文本转图像

SDK (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents:
    "Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.text) console.log(part.text);
  if (part.inlineData?.data) {
    fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
  }
}

REST (with imageConfig) minimal template

bash

curl -s -X POST   "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent"   -H "x-goog-api-key: $GEMINI_API_KEY"   -H "Content-Type: application/json"   -d '{
    "contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
    "generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
  }'

REST image parsing (Base64 decode)

bash

curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
  | jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
  | base64 --decode > out.png

SDK（Node.js）最简模板

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents:
    "Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.text) console.log(part.text);
  if (part.inlineData?.data) {
    fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
  }
}

REST（带imageConfig）最简模板

bash

curl -s -X POST   "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent"   -H "x-goog-api-key: $GEMINI_API_KEY"   -H "Content-Type: application/json"   -d '{
    "contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
    "generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
  }'

REST图像解析（Base64解码）

bash

curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
  | jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
  | base64 --decode > out.png

macOS can use: base64 -D > out.png

macOS可使用：base64 -D > out.png

undefined

undefined

5.2 Text-and-Image-to-Image

5.2 文本+图像转图像

Use case: given an image, add/remove/modify elements, change style, color grading, etc.

SDK (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "Add a nano banana on the table, keep lighting consistent, cinematic tone.";
const imageBase64 = fs.readFileSync("input.png").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents: [
    { text: prompt },
    { inlineData: { mimeType: "image/png", data: imageBase64 } },
  ],
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.inlineData?.data) {
    fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
  }
}

适用场景：基于现有图像，添加/删除/修改元素、调整风格、色彩分级等。

SDK（Node.js）最简模板

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "Add a nano banana on the table, keep lighting consistent, cinematic tone.";
const imageBase64 = fs.readFileSync("input.png").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents: [
    { text: prompt },
    { inlineData: { mimeType: "image/png", data: imageBase64 } },
  ],
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.inlineData?.data) {
    fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
  }
}

5.3 Multi-turn image iteration (Multi-turn editing)

5.3 多轮图像迭代（多轮编辑）

Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").
To output mixed "text + image" results, set

response_modalities

["TEXT", "IMAGE"]

最佳实践：使用对话模式进行持续迭代（例如：先生成图像，然后「仅编辑特定区域/元素」，再「生成同风格变体」）。若需输出「文本+图像」混合结果，需将

response_modalities

设置为

["TEXT", "IMAGE"]

。

5.4 ImageConfig

5.4 ImageConfig配置

You can set in

generationConfig.imageConfig

or the SDK config:

```
aspectRatio
```
: e.g.
```
16:9
```
,
```
1:1
```
.
```
imageSize
```
: e.g.
```
2K
```
,
```
4K
```
(higher resolution is usually slower/more expensive and model support can vary).

可在

generationConfig.imageConfig

或SDK配置中设置：

```
aspectRatio
```
：例如
```
16:9
```
、
```
1:1
```
。
```
imageSize
```
：例如
```
2K
```
、
```
4K
```
（分辨率越高通常速度越慢/成本越高，且模型支持情况可能不同）。

6. Image understanding (Image Understanding)

6. 图像理解（Image Understanding）

6.1 Two ways to provide input images

6.1 提供输入图像的两种方式

Inline image data: suitable for small files (total request size < 20MB).
Files API upload: better for large files or reuse across multiple requests.

内嵌式图像数据：适合小文件（请求总大小<20MB）。
Files API上传：更适合大文件或多请求复用场景。

6.2 Inline images (Node.js) minimal template

6.2 内嵌式图像（Node.js）最简模板

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const imageBase64 = fs.readFileSync("image.jpg").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: [
    { inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
    { text: "Caption this image, and list any visible brands." },
  ],
});

console.log(response.text);

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const imageBase64 = fs.readFileSync("image.jpg").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: [
    { inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
    { text: "Caption this image, and list any visible brands." },
  ],
});

console.log(response.text);

6.3 Upload and reference with Files API (Node.js) minimal template

6.3 使用Files API上传并引用（Node.js）最简模板

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "image.jpg" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Caption this image.",
  ]),
});

console.log(response.text);

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "image.jpg" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Caption this image.",
  ]),
});

console.log(response.text);

6.4 Multi-image prompts

6.4 多图像提示词

Append multiple images as multiple

Part

entries in the same

contents

; you can mix uploaded references and inline bytes.

在同一个

contents

中添加多个图像作为

Part

条目；可混合使用上传引用与内嵌字节。

7. Video generation (Veo 3.1)

7. 视频生成（Veo 3.1）

7.1 Core features (must know)

7.1 核心功能（必知）

Generates 8-second high-fidelity video, optionally 720p / 1080p / 4k, and supports native audio generation (dialogue, ambience, SFX).
Supports:
- Aspect ratio (16:9 / 9:16)
- Video extension (extend a generated video; typically limited to 720p)
- First/last frame control (frame-specific)
- Up to 3 reference images (image-based direction)

生成8秒高保真视频，支持720p/1080p/4k分辨率可选，且支持原生音频生成（对话、环境音、音效）。
支持特性：
- 宽高比（16:9 / 9:16）
- 视频扩展（延长已生成的视频；通常限制为720p分辨率）
- 首尾帧控制（指定帧内容）
- 最多3张参考图像（基于图像引导生成）

7.2 SDK (Node.js) minimal template: async polling + download

7.2 SDK（Node.js）最简模板：异步轮询+下载

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.";
let operation = await ai.models.generateVideos({
  model: "veo-3.1-generate-preview",
  prompt,
  config: { resolution: "1080p" },
});

while (!operation.done) {
  await new Promise((resolve) => setTimeout(resolve, 10_000));
  operation = await ai.operations.getVideosOperation({ operation });
}

const video = operation.response?.generatedVideos?.[0]?.video;
if (!video) throw new Error("No video returned");
await ai.files.download({ file: video, downloadPath: "out.mp4" });

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.";
let operation = await ai.models.generateVideos({
  model: "veo-3.1-generate-preview",
  prompt,
  config: { resolution: "1080p" },
});

while (!operation.done) {
  await new Promise((resolve) => setTimeout(resolve, 10_000));
  operation = await ai.operations.getVideosOperation({ operation });
}

const video = operation.response?.generatedVideos?.[0]?.video;
if (!video) throw new Error("No video returned");
await ai.files.download({ file: video, downloadPath: "out.mp4" });

7.3 REST minimal template: predictLongRunning + poll + download

7.3 REST最简模板：predictLongRunning+轮询+下载

Key point: Veo REST uses

:predictLongRunning

to return an operation name, then poll

GET /v1beta/{operation_name}

; once done, download from the video URI in the response.

关键点：Veo REST接口使用

:predictLongRunning

返回操作名称，随后轮询

GET /v1beta/{operation_name}

；完成后从响应中的视频URI下载文件。

7.4 Common controls (recommend a unified wrapper)

7.4 常用控制项（建议封装统一工具）

```
aspectRatio
```
:
```
"16:9"
```
or
```
"9:16"
```
```
resolution
```
:
```
"720p" | "1080p" | "4k"
```
(higher resolutions are usually slower/more expensive)
When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language (camera position, movement, composition, lens effects, mood).
Negative constraints: if the API supports a negative prompt field, use it; otherwise list elements you do not want to see.

```
aspectRatio
```
：
```
"16:9"
```
或
```
"9:16"
```
```
resolution
```
：
```
"720p" | "1080p" | "4k"
```
（分辨率越高通常速度越慢/成本越高）
编写提示词时：将对话内容用引号包裹；明确标注音效与环境音；使用影视专业术语（机位、运镜、构图、镜头效果、氛围）。
负向约束：若API支持负向提示词字段则直接使用；否则在提示词中列出不希望出现的元素。

7.5 Important limits (engineering fallback needed)

7.5 重要限制（需实现工程降级方案）

Latency can vary from seconds to minutes; implement timeouts and retries.
Generated videos are only retained on the server for a limited time (download promptly).
Outputs include a SynthID watermark.

Polling fallback (with timeout/backoff) pseudocode

const deadline = Date.now() + 300_000; // 5 min
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
  await new Promise((resolve) => setTimeout(resolve, sleepMs));
  sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
  operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");

延迟范围从几秒到几分钟不等；需实现超时与重试机制。
生成的视频在服务器上仅保留有限时间（请及时下载）。
输出内容包含SynthID水印。

带超时/退避的轮询降级伪代码

const deadline = Date.now() + 300_000; // 5分钟
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
  await new Promise((resolve) => setTimeout(resolve, sleepMs));
  sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
  operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");

8. Video understanding (Video Understanding)

8. 视频理解（Video Understanding）

8.1 Video input options

8.1 视频输入方式

Files API upload: recommended when file > 100MB, video length > ~1 minute, or you need reuse.
Inline video data: for smaller files.
Direct YouTube URL: can analyze public videos.

Files API上传：推荐用于文件>100MB、视频时长>约1分钟或需要复用的场景。
内嵌式视频数据：适合小文件。
直接YouTube URL：可分析公开视频。

8.2 Files API (Node.js) minimal template

8.2 Files API（Node.js）最简模板

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp4" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Summarize this video. Provide timestamps for key events.",
  ]),
});

console.log(response.text);

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp4" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Summarize this video. Provide timestamps for key events.",
  ]),
});

console.log(response.text);

8.3 Timestamp prompting strategy

8.3 时间戳提示词策略

Ask for segmented bullets with "(mm:ss)" timestamps.
Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.

要求返回带"(mm:ss)"时间戳的分段要点。
若需要结构化输出（如JSON），可在提示词中同时要求「带具体时间段的证据提取」与下游结构化解析。

9. Speech generation (Text-to-Speech, TTS)

9. 语音生成（Text-to-Speech, TTS）

9.1 Positioning

9.1 定位

Native TTS: for "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.).
Distinguish from the Live API: Live API is more interactive and non-structured audio/multimodal conversation; TTS is focused on controlled narration.

原生TTS：适用于「精准朗读+可控风格」场景（播客、有声书、广告配音等）。
与Live API的区别：Live API更侧重交互式、非结构化音频/多模态对话；TTS专注于可控的旁白生成。

9.2 Single-speaker TTS (Node.js) minimal template

9.2 单语音TTS（Node.js）最简模板

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-preview-tts",
  contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }],
  config: {
    responseModalities: ["AUDIO"],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: "Kore" },
      },
    },
  },
});

const data =
  response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? "";
if (!data) throw new Error("No audio returned");
fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-preview-tts",
  contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }],
  config: {
    responseModalities: ["AUDIO"],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: "Kore" },
      },
    },
  },
});

const data =
  response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? "";
if (!data) throw new Error("No audio returned");
fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));

9.3 Multi-speaker TTS (max 2 speakers)

9.3 多语音TTS（最多2个语音）

Requirements:

Use
```
multiSpeakerVoiceConfig
```
Each speaker name must match the dialogue labels in the prompt (e.g., Joe/Jane).

要求：

使用
```
multiSpeakerVoiceConfig
```
每个语音名称必须与提示词中的对话标签匹配（例如Joe/Jane）。

9.4 Voice options and language

9.4 语音选项与语言

```
voice_name
```
supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.).
The model can auto-detect input language and supports 24 languages (see docs for the list).

```
voice_name
```
支持30种预构建语音（例如Zephyr、Puck、Charon、Kore等）。
模型可自动检测输入语言，支持24种语言（具体列表请查阅官方文档）。

9.5 "Director notes" (strongly recommended for high-quality voice)

9.5 「导演式提示词」（强烈推荐用于高质量语音生成）

Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.

为风格、语速、口音等提供可控指引，但避免过度约束。

10. Audio understanding (Audio Understanding)

10. 音频理解（Audio Understanding）

10.1 Typical tasks

10.1 典型任务

Describe audio content (including non-speech like birds, alarms, etc.)
Generate transcripts
Transcribe specific time ranges
Count tokens (for cost estimates/segmentation)

描述音频内容（包含非语音内容，如鸟鸣、警报等）
生成转录文本
转录指定时间段内容
Token计数（用于成本估算/内容分段）

10.2 Files API (Node.js) minimal template

10.2 Files API（Node.js）最简模板

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp3" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    "Describe this audio clip.",
    createPartFromUri(uploaded.uri, uploaded.mimeType),
  ]),
});

console.log(response.text);

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp3" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    "Describe this audio clip.",
    createPartFromUri(uploaded.uri, uploaded.mimeType),
  ]),
});

console.log(response.text);

10.3 Key limits and engineering tips

10.3 关键限制与工程建议

Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC.
Audio tokenization: about 32 tokens/second (about 1920 tokens per minute; values may change).
Total audio length per prompt is capped at 9.5 hours; multi-channel audio is downmixed; audio is resampled (see docs for exact parameters).
If total request size exceeds 20MB, you must use the Files API.

支持常见格式：WAV/MP3/AIFF/AAC/OGG/FLAC。
音频Token化：约32Token/秒（约1920Token/分钟；数值可能随版本变化）。
单提示词支持的音频总时长上限为9.5小时；多声道音频将被混音为单声道；音频会被重采样（具体参数请查阅官方文档）。
若请求总大小超过20MB，必须使用Files API。

11. End-to-end examples (composition)

11. 端到端示例（能力组合）

Example A: Image generation -> validation via understanding

示例A：图像生成 -> 理解校验

Generate product images with Nano Banana (require negative space, consistent lighting).
Use image understanding for self-check: verify text clarity, brand spelling, and unsafe elements.
If not satisfied, feed the generated image into text+image editing and iterate.

使用Nano Banana生成产品图像（要求留白、光线统一）。
使用图像理解进行自检：验证文本清晰度、品牌拼写、是否包含违规元素。
若不符合要求，将生成的图像输入文本+图像编辑流程进行迭代。

Example B: Video generation -> video understanding -> narration script

示例B：视频生成 -> 视频理解 -> 旁白脚本

Generate an 8-second shot with Veo (include dialogue or SFX).
Download and save (respect retention window).
Upload video to video understanding to produce a storyboard + timestamps + narration copy (then feed to TTS).

使用Veo生成8秒镜头（包含对话或音效）。
下载并保存视频（注意保留期限）。
上传视频至视频理解模块，生成分镜脚本+时间戳+旁白文案（随后输入TTS生成语音）。

Example C: Audio understanding -> time-range transcription -> TTS redub

示例C：音频理解 -> 时间段转录 -> TTS重配

Upload meeting audio and transcribe full content.
Transcribe or summarize specific time ranges.
Use TTS to generate a "broadcast" version of the summary.

上传会议音频并转录完整内容。
转录或总结指定时间段内容。
使用TTS生成该总结的「广播级」版本。

12. Compliance and risk (must follow)

12. 合规与风险（必须遵守）

Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content.
Generated images and videos include SynthID watermarking; videos may also have regional/person-based generation constraints.
Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content.

确保你拥有上传图像/视频/音频的必要权利；不得生成侵权、欺诈、骚扰或有害内容。
生成的图像与视频包含SynthID水印；视频可能还存在区域/人物相关的生成限制。
生产系统必须实现超时、重试、故障降级机制，并对生成内容进行人工审核/后处理。