text-to-speech

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Text-to-Speech (HeyGen Starfish)

文本转语音(HeyGen Starfish)

Generate speech audio files from text using HeyGen's in-house Starfish TTS model. This skill is for standalone audio generation — separate from video creation.
使用HeyGen自研的Starfish TTS模型将文本转换为语音音频文件。本技能用于独立音频生成——与视频创建功能分离。

Authentication

身份验证

All requests require the
X-Api-Key
header. Set the
HEYGEN_API_KEY
environment variable.
bash
curl -X GET "https://api.heygen.com/v1/audio/voices" \
  -H "X-Api-Key: $HEYGEN_API_KEY"
所有请求都需要携带
X-Api-Key
请求头。请设置
HEYGEN_API_KEY
环境变量。
bash
curl -X GET "https://api.heygen.com/v1/audio/voices" \
  -H "X-Api-Key: $HEYGEN_API_KEY"

Tool Selection

工具选择

If HeyGen MCP tools are available (
mcp__heygen__*
), prefer them over direct HTTP API calls.
TaskMCP ToolFallback (Direct API)
List TTS voices
mcp__heygen__list_audio_voices
GET /v1/audio/voices
Generate speech audio
mcp__heygen__text_to_speech
POST /v1/audio/text_to_speech
如果HeyGen MCP工具可用(
mcp__heygen__*
),优先使用这些工具而非直接调用HTTP API。
任务MCP工具备选方案(直接调用API)
列出TTS语音
mcp__heygen__list_audio_voices
GET /v1/audio/voices
生成语音音频
mcp__heygen__text_to_speech
POST /v1/audio/text_to_speech

Default Workflow

默认工作流程

  1. List voices with
    mcp__heygen__list_audio_voices
    (or
    GET /v1/audio/voices
    )
  2. Pick a voice matching desired language, gender, and features
  3. Call
    mcp__heygen__text_to_speech
    (or
    POST /v1/audio/text_to_speech
    ) with text and voice_id
  4. Use the returned
    audio_url
    to download or play the audio
  1. 使用
    mcp__heygen__list_audio_voices
    (或
    GET /v1/audio/voices
    )列出语音
  2. 选择符合所需语言、性别及特性的语音
  3. 携带文本和voice_id调用
    mcp__heygen__text_to_speech
    (或
    POST /v1/audio/text_to_speech
  4. 使用返回的
    audio_url
    下载或播放音频

List TTS Voices

列出TTS语音

Retrieve voices compatible with the Starfish TTS model.
Note: This uses
GET /v1/audio/voices
— a different endpoint from the video voices API (
GET /v2/voices
). Not all video voices support Starfish TTS.
获取兼容Starfish TTS模型的语音。
注意: 此处使用的是
GET /v1/audio/voices
接口——与视频语音API(
GET /v2/voices
)不同。并非所有视频语音都支持Starfish TTS。

curl

curl示例

bash
curl -X GET "https://api.heygen.com/v1/audio/voices" \
  -H "X-Api-Key: $HEYGEN_API_KEY"
bash
curl -X GET "https://api.heygen.com/v1/audio/voices" \
  -H "X-Api-Key: $HEYGEN_API_KEY"

TypeScript

TypeScript示例

typescript
interface TTSVoice {
  voice_id: string;
  language: string;
  gender: "female" | "male" | "unknown";
  name: string;
  preview_audio_url: string | null;
  support_pause: boolean;
  support_locale: boolean;
  type: string;
}

interface TTSVoicesResponse {
  error: null | string;
  data: {
    voices: TTSVoice[];
  };
}

async function listTTSVoices(): Promise<TTSVoice[]> {
  const response = await fetch("https://api.heygen.com/v1/audio/voices", {
    headers: { "X-Api-Key": process.env.HEYGEN_API_KEY! },
  });

  const json: TTSVoicesResponse = await response.json();

  if (json.error) {
    throw new Error(json.error);
  }

  return json.data.voices;
}
typescript
interface TTSVoice {
  voice_id: string;
  language: string;
  gender: "female" | "male" | "unknown";
  name: string;
  preview_audio_url: string | null;
  support_pause: boolean;
  support_locale: boolean;
  type: string;
}

interface TTSVoicesResponse {
  error: null | string;
  data: {
    voices: TTSVoice[];
  };
}

async function listTTSVoices(): Promise<TTSVoice[]> {
  const response = await fetch("https://api.heygen.com/v1/audio/voices", {
    headers: { "X-Api-Key": process.env.HEYGEN_API_KEY! },
  });

  const json: TTSVoicesResponse = await response.json();

  if (json.error) {
    throw new Error(json.error);
  }

  return json.data.voices;
}

Python

Python示例

python
import requests
import os

def list_tts_voices() -> list:
    response = requests.get(
        "https://api.heygen.com/v1/audio/voices",
        headers={"X-Api-Key": os.environ["HEYGEN_API_KEY"]}
    )

    data = response.json()
    if data.get("error"):
        raise Exception(data["error"])

    return data["data"]["voices"]
python
import requests
import os

def list_tts_voices() -> list:
    response = requests.get(
        "https://api.heygen.com/v1/audio/voices",
        headers={"X-Api-Key": os.environ["HEYGEN_API_KEY"]}
    )

    data = response.json()
    if data.get("error"):
        raise Exception(data["error"])

    return data["data"]["voices"]

Response Format

响应格式

json
{
  "error": null,
  "data": {
    "voices": [
      {
        "voice_id": "f38a635bee7a4d1f9b0a654a31d050d2",
        "name": "Chill Brian",
        "language": "English",
        "gender": "male",
        "preview_audio_url": "https://resource.heygen.ai/text_to_speech/WpSDQvmLGXEqXZVZQiVeg6.mp3",
        "support_pause": true,
        "support_locale": false,
        "type": "public"
      }
    ]
  }
}
json
{
  "error": null,
  "data": {
    "voices": [
      {
        "voice_id": "f38a635bee7a4d1f9b0a654a31d050d2",
        "name": "Chill Brian",
        "language": "English",
        "gender": "male",
        "preview_audio_url": "https://resource.heygen.ai/text_to_speech/WpSDQvmLGXEqXZVZQiVeg6.mp3",
        "support_pause": true,
        "support_locale": false,
        "type": "public"
      }
    ]
  }
}

Generate Speech Audio

生成语音音频

Convert text to speech audio using a specified voice.
使用指定语音将文本转换为语音音频。

Endpoint

接口地址

POST https://api.heygen.com/v1/audio/text_to_speech
POST https://api.heygen.com/v1/audio/text_to_speech

Request Fields

请求参数

FieldTypeReqDescription
text
stringYText content to convert to speech
voice_id
stringYVoice ID from
GET /v1/audio/voices
speed
numberSpeech speed, 0.5-1.5 (default: 1)
pitch
integerVoice pitch, -50 to 50 (default: 0)
locale
stringAccent/locale for multilingual voices (e.g.,
en-US
,
pt-BR
)
elevenlabs_settings
objectAdvanced settings for ElevenLabs voices
参数类型必填描述
text
string要转换为语音的文本内容
voice_id
string
GET /v1/audio/voices
获取的语音ID
speed
number语速,范围0.5-1.5(默认值:1)
pitch
integer音调,范围-50至50(默认值:0)
locale
string多语言语音的口音/区域设置(例如:
en-US
,
pt-BR
elevenlabs_settings
objectElevenLabs语音的高级设置

ElevenLabs Settings (optional)

ElevenLabs设置(可选)

FieldTypeDescription
model
stringModel selection (
eleven_v3
,
eleven_turbo_v2_5
, etc.)
similarity_boost
numberVoice similarity, 0.0-1.0
stability
numberOutput consistency, 0.0-1.0
style
numberStyle intensity, 0.0-1.0
参数类型描述
model
string模型选择(
eleven_v3
,
eleven_turbo_v2_5
等)
similarity_boost
number语音相似度,范围0.0-1.0
stability
number输出一致性,范围0.0-1.0
style
number风格强度,范围0.0-1.0

curl

curl示例

bash
curl -X POST "https://api.heygen.com/v1/audio/text_to_speech" \
  -H "X-Api-Key: $HEYGEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello! Welcome to our product demo.",
    "voice_id": "YOUR_VOICE_ID",
    "speed": 1.0
  }'
bash
curl -X POST "https://api.heygen.com/v1/audio/text_to_speech" \
  -H "X-Api-Key: $HEYGEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello! Welcome to our product demo.",
    "voice_id": "YOUR_VOICE_ID",
    "speed": 1.0
  }'

TypeScript

TypeScript示例

typescript
interface TTSRequest {
  text: string;
  voice_id: string;
  speed?: number;
  pitch?: number;
  locale?: string;
  elevenlabs_settings?: {
    model?: string;
    similarity_boost?: number;
    stability?: number;
    style?: number;
  };
}

interface WordTimestamp {
  word: string;
  start: number;
  end: number;
}

interface TTSResponse {
  error: null | string;
  data: {
    audio_url: string;
    duration: number;
    request_id: string;
    word_timestamps: WordTimestamp[];
  };
}

async function textToSpeech(request: TTSRequest): Promise<TTSResponse["data"]> {
  const response = await fetch(
    "https://api.heygen.com/v1/audio/text_to_speech",
    {
      method: "POST",
      headers: {
        "X-Api-Key": process.env.HEYGEN_API_KEY!,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(request),
    }
  );

  const json: TTSResponse = await response.json();

  if (json.error) {
    throw new Error(json.error);
  }

  return json.data;
}
typescript
interface TTSRequest {
  text: string;
  voice_id: string;
  speed?: number;
  pitch?: number;
  locale?: string;
  elevenlabs_settings?: {
    model?: string;
    similarity_boost?: number;
    stability?: number;
    style?: number;
  };
}

interface WordTimestamp {
  word: string;
  start: number;
  end: number;
}

interface TTSResponse {
  error: null | string;
  data: {
    audio_url: string;
    duration: number;
    request_id: string;
    word_timestamps: WordTimestamp[];
  };
}

async function textToSpeech(request: TTSRequest): Promise<TTSResponse["data"]> {
  const response = await fetch(
    "https://api.heygen.com/v1/audio/text_to_speech",
    {
      method: "POST",
      headers: {
        "X-Api-Key": process.env.HEYGEN_API_KEY!,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(request),
    }
  );

  const json: TTSResponse = await response.json();

  if (json.error) {
    throw new Error(json.error);
  }

  return json.data;
}

Python

Python示例

python
import requests
import os

def text_to_speech(
    text: str,
    voice_id: str,
    speed: float = 1.0,
    pitch: int = 0,
    locale: str | None = None,
) -> dict:
    payload = {
        "text": text,
        "voice_id": voice_id,
        "speed": speed,
        "pitch": pitch,
    }

    if locale:
        payload["locale"] = locale

    response = requests.post(
        "https://api.heygen.com/v1/audio/text_to_speech",
        headers={
            "X-Api-Key": os.environ["HEYGEN_API_KEY"],
            "Content-Type": "application/json",
        },
        json=payload,
    )

    data = response.json()
    if data.get("error"):
        raise Exception(data["error"])

    return data["data"]
python
import requests
import os

def text_to_speech(
    text: str,
    voice_id: str,
    speed: float = 1.0,
    pitch: int = 0,
    locale: str | None = None,
) -> dict:
    payload = {
        "text": text,
        "voice_id": voice_id,
        "speed": speed,
        "pitch": pitch,
    }

    if locale:
        payload["locale"] = locale

    response = requests.post(
        "https://api.heygen.com/v1/audio/text_to_speech",
        headers={
            "X-Api-Key": os.environ["HEYGEN_API_KEY"],
            "Content-Type": "application/json",
        },
        json=payload,
    )

    data = response.json()
    if data.get("error"):
        raise Exception(data["error"])

    return data["data"]

Response Format

响应格式

json
{
  "error": null,
  "data": {
    "audio_url": "https://resource2.heygen.ai/text_to_speech/.../id=365d46bb.wav",
    "duration": 5.526,
    "request_id": "p38QJ52hfgNlsYKZZmd9",
    "word_timestamps": [
      { "word": "<start>", "start": 0.0, "end": 0.0 },
      { "word": "Hey", "start": 0.079, "end": 0.219 },
      { "word": "there,", "start": 0.239, "end": 0.459 },
      { "word": "<end>", "start": 5.526, "end": 5.526 }
    ]
  }
}
json
{
  "error": null,
  "data": {
    "audio_url": "https://resource2.heygen.ai/text_to_speech/.../id=365d46bb.wav",
    "duration": 5.526,
    "request_id": "p38QJ52hfgNlsYKZZmd9",
    "word_timestamps": [
      { "word": "<start>", "start": 0.0, "end": 0.0 },
      { "word": "Hey", "start": 0.079, "end": 0.219 },
      { "word": "there,", "start": 0.239, "end": 0.459 },
      { "word": "<end>", "start": 5.526, "end": 5.526 }
    ]
  }
}

Usage Examples

使用示例

Basic TTS

基础文本转语音

typescript
const result = await textToSpeech({
  text: "Welcome to our quarterly earnings call.",
  voice_id: "YOUR_VOICE_ID",
});

console.log(`Audio URL: ${result.audio_url}`);
console.log(`Duration: ${result.duration}s`);
typescript
const result = await textToSpeech({
  text: "Welcome to our quarterly earnings call.",
  voice_id: "YOUR_VOICE_ID",
});

console.log(`Audio URL: ${result.audio_url}`);
console.log(`Duration: ${result.duration}s`);

With Speed Adjustment

调节语速

typescript
const result = await textToSpeech({
  text: "We're thrilled to announce our newest feature!",
  voice_id: "YOUR_VOICE_ID",
  speed: 1.1,
});
typescript
const result = await textToSpeech({
  text: "We're thrilled to announce our newest feature!",
  voice_id: "YOUR_VOICE_ID",
  speed: 1.1,
});

With Locale for Multilingual Voices

为多语言语音设置区域

typescript
const result = await textToSpeech({
  text: "Bem-vindo ao nosso produto.",
  voice_id: "MULTILINGUAL_VOICE_ID",
  locale: "pt-BR",
});
typescript
const result = await textToSpeech({
  text: "Bem-vindo ao nosso produto.",
  voice_id: "MULTILINGUAL_VOICE_ID",
  locale: "pt-BR",
});

Find a Voice and Generate Audio

查找语音并生成音频

typescript
async function generateSpeech(text: string, language: string): Promise<string> {
  const voices = await listTTSVoices();
  const voice = voices.find(
    (v) => v.language.toLowerCase().includes(language.toLowerCase())
  );

  if (!voice) {
    throw new Error(`No TTS voice found for language: ${language}`);
  }

  const result = await textToSpeech({
    text,
    voice_id: voice.voice_id,
  });

  return result.audio_url;
}

const audioUrl = await generateSpeech("Hello and welcome!", "english");
typescript
async function generateSpeech(text: string, language: string): Promise<string> {
  const voices = await listTTSVoices();
  const voice = voices.find(
    (v) => v.language.toLowerCase().includes(language.toLowerCase())
  );

  if (!voice) {
    throw new Error(`No TTS voice found for language: ${language}`);
  }

  const result = await textToSpeech({
    text,
    voice_id: voice.voice_id,
  });

  return result.audio_url;
}

const audioUrl = await generateSpeech("Hello and welcome!", "english");

Pauses with Break Tags

使用停顿标签添加停顿

Use SSML-style break tags in your text for pauses:
word <break time="1s"/> word
Rules:
  • Use seconds with
    s
    suffix:
    <break time="1.5s"/>
  • Must have spaces before and after the tag
  • Self-closing tag format
在文本中使用SSML风格的停顿标签来添加停顿:
word <break time="1s"/> word
规则:
  • 使用带
    s
    后缀的秒数:
    <break time="1.5s"/>
  • 标签前后必须有空格
  • 使用自闭合标签格式

Best Practices

最佳实践

  1. Use
    GET /v1/audio/voices
    to find compatible voices — not all voices from
    GET /v2/voices
    support Starfish TTS
  2. Check
    support_locale
    before setting a
    locale
    — only multilingual voices support locale selection
  3. Keep speed between 0.8-1.2 for natural-sounding output
  4. Preview voices using the
    preview_audio_url
    before generating (may be null for some voices)
  5. Use
    word_timestamps
    in the response for caption syncing or timed text overlays
  6. Use SSML break tags in your text for pauses:
    word <break time="1s"/> word
  1. **使用
    GET /v1/audio/voices
    **查找兼容语音——并非所有来自
    GET /v2/voices
    的语音都支持Starfish TTS
  2. 设置
    locale
    前检查
    support_locale
    ——只有多语言语音支持区域设置
  3. 将语速保持在0.8-1.2之间以获得自然的输出效果
  4. 生成前预览语音(部分语音的
    preview_audio_url
    可能为null)
  5. **使用响应中的
    word_timestamps
    **进行字幕同步或定时文本叠加
  6. 在文本中使用SSML停顿标签添加停顿:
    word <break time="1s"/> word