ListenHub TTS: Text-to-Speech

Convert text to speech using the ListenHub OpenAPI. Three synthesis modes are supported, covering all scenarios from short text to long text.

API Information

Base URL:
```
https://api.marswave.ai/openapi
```
Authentication:
```
Authorization: Bearer $LISTENHUB_API_KEY
```
(read from environment variable)
Prerequisite Check: Before calling any API, confirm that the
```
LISTENHUB_API_KEY
```
environment variable is set. If not, prompt the user to configure it.

Voice Selection Process

User Has Explicitly Specified a Voice

Directly use the speakerId specified by the user and skip the selection process.

User Has Not Specified a Voice

Call
```
GET /v1/speakers/list?language=zh
```
to retrieve the available voice list
Display the voice list for user selection using AskUserQuestion in the following format:
- ```
chat-girl-105-cn
```
  (Xiaoman dxqqq) is selected by default
- List display:
```
{name} ({gender}, {speakerId})
```
- Attach the demoAudioUrl of each voice for reference
Use the selected speakerId after user confirmation

Default Voice

Field	Value
speakerId	`chat-girl-105-cn`
Name	Xiaoman dxqqq

Three Synthesis Modes

Mode 1: Quick Synthesis (Short Text, Single Voice)

Applicable Scenarios: Short text (< 1000 words), single voice, low latency required

Endpoint:

POST /v1/tts

Request Body:

json

{
  "text": "Text to synthesize",
  "speakerId": "chat-girl-105-cn",
  "format": "mp3",
  "sampleRate": 24000,
  "speed": 1.0
}

Parameter	Type	Required	Description
text	string	Yes	Text to synthesize
speakerId	string	Yes	Voice ID
format	string	No	Output format, default is `mp3`
sampleRate	int	No	Sample rate, default is `24000`
speed	float	No	Speech speed, default is `1.0` , range `0.5 ~ 2.0`

Response: Directly returns MP3 binary stream (

Content-Type: audio/mpeg

)

Call Example:

bash

curl -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "你好世界", "speakerId": "chat-girl-105-cn"}' \
  -o output.mp3

Mode 2: Multi-role Script Synthesis

Applicable Scenarios: Multi-role dialogues, podcasts, audiobook clips, requiring alternating reading with different voices

Endpoint:

POST /v1/speech

Request Body:

json

{
  "script": [
    {
      "text": "Hello, welcome to this episode.",
      "speakerId": "chat-girl-105-cn"
    },
    {
      "text": "Thank you, today we'll talk about AI.",
      "speakerId": "chat-boy-101-cn"
    }
  ],
  "format": "mp3",
  "sampleRate": 24000
}

Parameter	Type	Required	Description
script	array	Yes	Script array, each item contains text and speakerId
script[].text	string	Yes	Text segment
script[].speakerId	string	Yes	Voice ID for this segment
format	string	No	Output format, default is `mp3`
sampleRate	int	No	Sample rate, default is `24000`

Response: JSON

json

{
  "audioUrl": "https://cdn.example.com/output.mp3",
  "duration": 12.5
}

Call Example:

bash

curl -X POST "https://api.marswave.ai/openapi/v1/speech" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "script": [
      {"text": "你好，欢迎收听。", "speakerId": "chat-girl-105-cn"},
      {"text": "谢谢，我们开始吧。", "speakerId": "chat-boy-101-cn"}
    ]
  }'

Mode 3: Long Text Streaming Synthesis

Applicable Scenarios: Long text (> 1000 words), article reading, requiring AI polishing or segment processing

Endpoint:

POST /v1/flow-speech/episodes

Request Body:

json

{
  "title": "Article Title",
  "content": "Long text content...",
  "speakerId": "chat-girl-105-cn",
  "mode": "direct",
  "format": "mp3"
}

Parameter	Type	Required	Description
title	string	Yes	Audio title
content	string	No	Text content (choose either content or contentUrl)
contentUrl	string	No	Content URL (choose either content or contentUrl)
speakerId	string	Yes	Voice ID
mode	string	No	`direct` (direct synthesis) or `aiPolish` (AI polishing), default is `direct`
format	string	No	Output format, default is `mp3`

Response: JSON

json

{
  "episodeId": "ep_abc123",
  "status": "processing"
}

Poll for Results:

bash

GET /v1/flow-speech/episodes/{episodeId}

Polling Strategy:

Wait 30 seconds after submission
Poll every 10 seconds afterwards
Until status becomes
```
completed
```
or
```
failed
```

Polling Response:

json

{
  "episodeId": "ep_abc123",
  "status": "completed",
  "audioUrl": "https://cdn.example.com/output.mp3",
  "duration": 180.5
}

Status Value	Description
processing	Synthesis in progress, continue polling
completed	Synthesis completed, audioUrl is available
failed	Synthesis failed, check errorMessage

Call Example:

bash

# Submit task
curl -X POST "https://api.marswave.ai/openapi/v1/flow-speech/episodes" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "AI Technology Trends",
    "content": "Long text content...",
    "speakerId": "chat-girl-105-cn",
    "mode": "direct"
  }'

# Poll for results
curl "https://api.marswave.ai/openapi/v1/flow-speech/episodes/ep_abc123" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY"

Voice List Query

Endpoint:

GET /v1/speakers/list

Query Parameters:

Parameter	Type	Required	Description
language	string	No	Filter by language, e.g., `zh` (Chinese), `en` (English)

Response:

json

{
  "speakers": [
    {
      "name": "Xiaoman dxqqq",
      "speakerId": "chat-girl-105-cn",
      "demoAudioUrl": "https://cdn.example.com/demo.mp3",
      "gender": "female",
      "language": "zh"
    }
  ]
}

Mode Selection Logic

Automatically select the most appropriate mode based on user input:

Condition	Mode
Text ≤ 1000 words, single voice	Mode 1: `/v1/tts`
Multi-role script, requiring different voices	Mode 2: `/v1/speech`
Text > 1000 words, or AI polishing required	Mode 3: `/v1/flow-speech/episodes`
User provides URL as content source	Mode 3: `/v1/flow-speech/episodes`

If the user explicitly specifies a mode, prioritize the user-specified mode.

User Interaction

Voice Selection

When the user does not specify a voice, use AskUserQuestion to display the voice list:

Please select a voice (default: Xiaoman dxqqq):
A. Xiaoman dxqqq (Female, chat-girl-105-cn) [Default]
B. [Other Voice Name] ([Gender], [speakerId])
C. ...

Synthesis Parameters

Optional inquiries:

Speech speed (speed, default 1.0)
Output format (format, default mp3)
Long text mode: direct or aiPolish (default direct)
Output file path (default
```
./output.mp3
```
)

Output

Save the audio to the specified path (default
```
./output.mp3
```
)
Output synthesis summary:
- Mode used
- Voice name and ID
- Audio duration
- File size
- File path

Error Handling

401 Unauthorized: Prompt the user to check the
```
LISTENHUB_API_KEY
```
environment variable
400 Bad Request: Check request parameters and report specific errors to the user
flow-speech failed: Report errorMessage, suggest the user retry or switch modes
Network Error: Prompt to check network connection, suggest retrying

Complete Example

User Input: "Convert this text to speech: The weather is nice today, perfect for a walk outside."

Execution Process:

Check
```
LISTENHUB_API_KEY
```
✓
Text length < 1000 words, single voice → Select Mode 1
```
/v1/tts
```
User did not specify a voice → Default to
```
chat-girl-105-cn
```
(Xiaoman)
Call API for synthesis
Save to
```
./output.mp3
```
Output summary

User Input: "Read this article with Xiaoman's voice: article.md"

Execution Process:

Read content of
```
article.md
```
Check text length > 1000 words → Select Mode 3
```
/v1/flow-speech/episodes
```
Voice specified:
```
chat-girl-105-cn
```
(Xiaoman)
Submit synthesis task
Poll until completion
Download audio and save to
```
./article.mp3
```
Output summary

listenhub-tts

NPX Install

Tags

SKILL.md Content (Chinese)

ListenHub TTS: Text-to-Speech

API Information

Voice Selection Process

User Has Explicitly Specified a Voice

User Has Not Specified a Voice

Default Voice

Three Synthesis Modes

Mode 1: Quick Synthesis (Short Text, Single Voice)

Mode 2: Multi-role Script Synthesis

Mode 3: Long Text Streaming Synthesis

Voice List Query

Mode Selection Logic

User Interaction

Voice Selection

Synthesis Parameters

Output

Error Handling

Complete Example