listenhub-tts
Original:🇨🇳 Chinese
Translated
Convert text to speech (TTS) using the ListenHub API. Three modes are supported: Quick Synthesis (/v1/tts), Multi-role Script (/v1/speech), and Long Text Streaming Synthesis (/v1/flow-speech/episodes). If no voice is specified, automatically retrieve the voice list for user selection, with chat-girl-105-cn (Xiaoman) as the default. Use when user says: "tts", "text to speech", "语音合成", "文字转语音", "朗读", "生成语音", "生成音频", "转音频", "text to audio"
7installs
Sourcesmallnest/goal-workflow
Added on
NPX Install
npx skill4agent add smallnest/goal-workflow listenhub-ttsTags
Translated version includes tags in frontmatterSKILL.md Content (Chinese)
View Translation Comparison →ListenHub TTS: Text-to-Speech
Convert text to speech using the ListenHub OpenAPI. Three synthesis modes are supported, covering all scenarios from short text to long text.
API Information
- Base URL:
https://api.marswave.ai/openapi - Authentication: (read from environment variable)
Authorization: Bearer $LISTENHUB_API_KEY - Prerequisite Check: Before calling any API, confirm that the environment variable is set. If not, prompt the user to configure it.
LISTENHUB_API_KEY
Voice Selection Process
User Has Explicitly Specified a Voice
Directly use the speakerId specified by the user and skip the selection process.
User Has Not Specified a Voice
- Call to retrieve the available voice list
GET /v1/speakers/list?language=zh - Display the voice list for user selection using AskUserQuestion in the following format:
- (Xiaoman dxqqq) is selected by default
chat-girl-105-cn - List display:
{name} ({gender}, {speakerId}) - Attach the demoAudioUrl of each voice for reference
- Use the selected speakerId after user confirmation
Default Voice
| Field | Value |
|---|---|
| speakerId | |
| Name | Xiaoman dxqqq |
Three Synthesis Modes
Mode 1: Quick Synthesis (Short Text, Single Voice)
Applicable Scenarios: Short text (< 1000 words), single voice, low latency required
Endpoint:
POST /v1/ttsRequest Body:
json
{
"text": "Text to synthesize",
"speakerId": "chat-girl-105-cn",
"format": "mp3",
"sampleRate": 24000,
"speed": 1.0
}| Parameter | Type | Required | Description |
|---|---|---|---|
| text | string | Yes | Text to synthesize |
| speakerId | string | Yes | Voice ID |
| format | string | No | Output format, default is |
| sampleRate | int | No | Sample rate, default is |
| speed | float | No | Speech speed, default is |
Response: Directly returns MP3 binary stream ()
Content-Type: audio/mpegCall Example:
bash
curl -X POST "https://api.marswave.ai/openapi/v1/tts" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text": "你好世界", "speakerId": "chat-girl-105-cn"}' \
-o output.mp3Mode 2: Multi-role Script Synthesis
Applicable Scenarios: Multi-role dialogues, podcasts, audiobook clips, requiring alternating reading with different voices
Endpoint:
POST /v1/speechRequest Body:
json
{
"script": [
{
"text": "Hello, welcome to this episode.",
"speakerId": "chat-girl-105-cn"
},
{
"text": "Thank you, today we'll talk about AI.",
"speakerId": "chat-boy-101-cn"
}
],
"format": "mp3",
"sampleRate": 24000
}| Parameter | Type | Required | Description |
|---|---|---|---|
| script | array | Yes | Script array, each item contains text and speakerId |
| script[].text | string | Yes | Text segment |
| script[].speakerId | string | Yes | Voice ID for this segment |
| format | string | No | Output format, default is |
| sampleRate | int | No | Sample rate, default is |
Response: JSON
json
{
"audioUrl": "https://cdn.example.com/output.mp3",
"duration": 12.5
}Call Example:
bash
curl -X POST "https://api.marswave.ai/openapi/v1/speech" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"script": [
{"text": "你好,欢迎收听。", "speakerId": "chat-girl-105-cn"},
{"text": "谢谢,我们开始吧。", "speakerId": "chat-boy-101-cn"}
]
}'Mode 3: Long Text Streaming Synthesis
Applicable Scenarios: Long text (> 1000 words), article reading, requiring AI polishing or segment processing
Endpoint:
POST /v1/flow-speech/episodesRequest Body:
json
{
"title": "Article Title",
"content": "Long text content...",
"speakerId": "chat-girl-105-cn",
"mode": "direct",
"format": "mp3"
}| Parameter | Type | Required | Description |
|---|---|---|---|
| title | string | Yes | Audio title |
| content | string | No | Text content (choose either content or contentUrl) |
| contentUrl | string | No | Content URL (choose either content or contentUrl) |
| speakerId | string | Yes | Voice ID |
| mode | string | No | |
| format | string | No | Output format, default is |
Response: JSON
json
{
"episodeId": "ep_abc123",
"status": "processing"
}Poll for Results:
bash
GET /v1/flow-speech/episodes/{episodeId}Polling Strategy:
- Wait 30 seconds after submission
- Poll every 10 seconds afterwards
- Until status becomes or
completedfailed
Polling Response:
json
{
"episodeId": "ep_abc123",
"status": "completed",
"audioUrl": "https://cdn.example.com/output.mp3",
"duration": 180.5
}| Status Value | Description |
|---|---|
| processing | Synthesis in progress, continue polling |
| completed | Synthesis completed, audioUrl is available |
| failed | Synthesis failed, check errorMessage |
Call Example:
bash
# Submit task
curl -X POST "https://api.marswave.ai/openapi/v1/flow-speech/episodes" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"title": "AI Technology Trends",
"content": "Long text content...",
"speakerId": "chat-girl-105-cn",
"mode": "direct"
}'
# Poll for results
curl "https://api.marswave.ai/openapi/v1/flow-speech/episodes/ep_abc123" \
-H "Authorization: Bearer $LISTENHUB_API_KEY"Voice List Query
Endpoint:
GET /v1/speakers/listQuery Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| language | string | No | Filter by language, e.g., |
Response:
json
{
"speakers": [
{
"name": "Xiaoman dxqqq",
"speakerId": "chat-girl-105-cn",
"demoAudioUrl": "https://cdn.example.com/demo.mp3",
"gender": "female",
"language": "zh"
}
]
}Mode Selection Logic
Automatically select the most appropriate mode based on user input:
| Condition | Mode |
|---|---|
| Text ≤ 1000 words, single voice | Mode 1: |
| Multi-role script, requiring different voices | Mode 2: |
| Text > 1000 words, or AI polishing required | Mode 3: |
| User provides URL as content source | Mode 3: |
If the user explicitly specifies a mode, prioritize the user-specified mode.
User Interaction
Voice Selection
When the user does not specify a voice, use AskUserQuestion to display the voice list:
Please select a voice (default: Xiaoman dxqqq):
A. Xiaoman dxqqq (Female, chat-girl-105-cn) [Default]
B. [Other Voice Name] ([Gender], [speakerId])
C. ...Synthesis Parameters
Optional inquiries:
- Speech speed (speed, default 1.0)
- Output format (format, default mp3)
- Long text mode: direct or aiPolish (default direct)
- Output file path (default )
./output.mp3
Output
- Save the audio to the specified path (default )
./output.mp3 - Output synthesis summary:
- Mode used
- Voice name and ID
- Audio duration
- File size
- File path
Error Handling
- 401 Unauthorized: Prompt the user to check the environment variable
LISTENHUB_API_KEY - 400 Bad Request: Check request parameters and report specific errors to the user
- flow-speech failed: Report errorMessage, suggest the user retry or switch modes
- Network Error: Prompt to check network connection, suggest retrying
Complete Example
User Input: "Convert this text to speech: The weather is nice today, perfect for a walk outside."
Execution Process:
- Check ✓
LISTENHUB_API_KEY - Text length < 1000 words, single voice → Select Mode 1
/v1/tts - User did not specify a voice → Default to (Xiaoman)
chat-girl-105-cn - Call API for synthesis
- Save to
./output.mp3 - Output summary
User Input: "Read this article with Xiaoman's voice: article.md"
Execution Process:
- Read content of
article.md - Check text length > 1000 words → Select Mode 3
/v1/flow-speech/episodes - Voice specified: (Xiaoman)
chat-girl-105-cn - Submit synthesis task
- Poll until completion
- Download audio and save to
./article.mp3 - Output summary