ListenHub TTS: Text-to-Speech
Convert text to speech using the ListenHub OpenAPI. Three synthesis modes are supported, covering all scenarios from short text to long text.
API Information
- Base URL:
https://api.marswave.ai/openapi
- Authentication:
Authorization: Bearer $LISTENHUB_API_KEY
(read from environment variable)
- Prerequisite Check: Before calling any API, confirm that the environment variable is set. If not, prompt the user to configure it.
Voice Selection Process
User Has Explicitly Specified a Voice
Directly use the speakerId specified by the user and skip the selection process.
User Has Not Specified a Voice
- Call
GET /v1/speakers/list?language=zh
to retrieve the available voice list
- Display the voice list for user selection using AskUserQuestion in the following format:
- (Xiaoman dxqqq) is selected by default
- List display:
{name} ({gender}, {speakerId})
- Attach the demoAudioUrl of each voice for reference
- Use the selected speakerId after user confirmation
Default Voice
| Field | Value |
|---|
| speakerId | |
| Name | Xiaoman dxqqq |
Three Synthesis Modes
Mode 1: Quick Synthesis (Short Text, Single Voice)
Applicable Scenarios: Short text (< 1000 words), single voice, low latency required
Request Body:
json
{
"text": "Text to synthesize",
"speakerId": "chat-girl-105-cn",
"format": "mp3",
"sampleRate": 24000,
"speed": 1.0
}
| Parameter | Type | Required | Description |
|---|
| text | string | Yes | Text to synthesize |
| speakerId | string | Yes | Voice ID |
| format | string | No | Output format, default is |
| sampleRate | int | No | Sample rate, default is |
| speed | float | No | Speech speed, default is , range |
Response: Directly returns MP3 binary stream (
)
Call Example:
bash
curl -X POST "https://api.marswave.ai/openapi/v1/tts" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text": "你好世界", "speakerId": "chat-girl-105-cn"}' \
-o output.mp3
Mode 2: Multi-role Script Synthesis
Applicable Scenarios: Multi-role dialogues, podcasts, audiobook clips, requiring alternating reading with different voices
Request Body:
json
{
"script": [
{
"text": "Hello, welcome to this episode.",
"speakerId": "chat-girl-105-cn"
},
{
"text": "Thank you, today we'll talk about AI.",
"speakerId": "chat-boy-101-cn"
}
],
"format": "mp3",
"sampleRate": 24000
}
| Parameter | Type | Required | Description |
|---|
| script | array | Yes | Script array, each item contains text and speakerId |
| script[].text | string | Yes | Text segment |
| script[].speakerId | string | Yes | Voice ID for this segment |
| format | string | No | Output format, default is |
| sampleRate | int | No | Sample rate, default is |
Response: JSON
json
{
"audioUrl": "https://cdn.example.com/output.mp3",
"duration": 12.5
}
Call Example:
bash
curl -X POST "https://api.marswave.ai/openapi/v1/speech" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"script": [
{"text": "你好,欢迎收听。", "speakerId": "chat-girl-105-cn"},
{"text": "谢谢,我们开始吧。", "speakerId": "chat-boy-101-cn"}
]
}'
Mode 3: Long Text Streaming Synthesis
Applicable Scenarios: Long text (> 1000 words), article reading, requiring AI polishing or segment processing
Endpoint: POST /v1/flow-speech/episodes
Request Body:
json
{
"title": "Article Title",
"content": "Long text content...",
"speakerId": "chat-girl-105-cn",
"mode": "direct",
"format": "mp3"
}
| Parameter | Type | Required | Description |
|---|
| title | string | Yes | Audio title |
| content | string | No | Text content (choose either content or contentUrl) |
| contentUrl | string | No | Content URL (choose either content or contentUrl) |
| speakerId | string | Yes | Voice ID |
| mode | string | No | (direct synthesis) or (AI polishing), default is |
| format | string | No | Output format, default is |
Response: JSON
json
{
"episodeId": "ep_abc123",
"status": "processing"
}
Poll for Results:
bash
GET /v1/flow-speech/episodes/{episodeId}
Polling Strategy:
- Wait 30 seconds after submission
- Poll every 10 seconds afterwards
- Until status becomes or
Polling Response:
json
{
"episodeId": "ep_abc123",
"status": "completed",
"audioUrl": "https://cdn.example.com/output.mp3",
"duration": 180.5
}
| Status Value | Description |
|---|
| processing | Synthesis in progress, continue polling |
| completed | Synthesis completed, audioUrl is available |
| failed | Synthesis failed, check errorMessage |
Call Example:
bash
# Submit task
curl -X POST "https://api.marswave.ai/openapi/v1/flow-speech/episodes" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"title": "AI Technology Trends",
"content": "Long text content...",
"speakerId": "chat-girl-105-cn",
"mode": "direct"
}'
# Poll for results
curl "https://api.marswave.ai/openapi/v1/flow-speech/episodes/ep_abc123" \
-H "Authorization: Bearer $LISTENHUB_API_KEY"
Voice List Query
Query Parameters:
| Parameter | Type | Required | Description |
|---|
| language | string | No | Filter by language, e.g., (Chinese), (English) |
Response:
json
{
"speakers": [
{
"name": "Xiaoman dxqqq",
"speakerId": "chat-girl-105-cn",
"demoAudioUrl": "https://cdn.example.com/demo.mp3",
"gender": "female",
"language": "zh"
}
]
}
Mode Selection Logic
Automatically select the most appropriate mode based on user input:
| Condition | Mode |
|---|
| Text ≤ 1000 words, single voice | Mode 1: |
| Multi-role script, requiring different voices | Mode 2: |
| Text > 1000 words, or AI polishing required | Mode 3: |
| User provides URL as content source | Mode 3: |
If the user explicitly specifies a mode, prioritize the user-specified mode.
User Interaction
Voice Selection
When the user does not specify a voice, use AskUserQuestion to display the voice list:
Please select a voice (default: Xiaoman dxqqq):
A. Xiaoman dxqqq (Female, chat-girl-105-cn) [Default]
B. [Other Voice Name] ([Gender], [speakerId])
C. ...
Synthesis Parameters
Optional inquiries:
- Speech speed (speed, default 1.0)
- Output format (format, default mp3)
- Long text mode: direct or aiPolish (default direct)
- Output file path (default )
Output
- Save the audio to the specified path (default )
- Output synthesis summary:
- Mode used
- Voice name and ID
- Audio duration
- File size
- File path
Error Handling
- 401 Unauthorized: Prompt the user to check the environment variable
- 400 Bad Request: Check request parameters and report specific errors to the user
- flow-speech failed: Report errorMessage, suggest the user retry or switch modes
- Network Error: Prompt to check network connection, suggest retrying
Complete Example
User Input: "Convert this text to speech: The weather is nice today, perfect for a walk outside."
Execution Process:
- Check ✓
- Text length < 1000 words, single voice → Select Mode 1
- User did not specify a voice → Default to (Xiaoman)
- Call API for synthesis
- Save to
- Output summary
User Input: "Read this article with Xiaoman's voice: article.md"
Execution Process:
- Read content of
- Check text length > 1000 words → Select Mode 3
- Voice specified: (Xiaoman)
- Submit synthesis task
- Poll until completion
- Download audio and save to
- Output summary