listenhub-tts

Original🇨🇳 Chinese
Translated

Convert text to speech (TTS) using the ListenHub API. Three modes are supported: Quick Synthesis (/v1/tts), Multi-role Script (/v1/speech), and Long Text Streaming Synthesis (/v1/flow-speech/episodes). If no voice is specified, automatically retrieve the voice list for user selection, with chat-girl-105-cn (Xiaoman) as the default. Use when user says: "tts", "text to speech", "语音合成", "文字转语音", "朗读", "生成语音", "生成音频", "转音频", "text to audio"

7installs
Added on

NPX Install

npx skill4agent add smallnest/goal-workflow listenhub-tts

Tags

Translated version includes tags in frontmatter

SKILL.md Content (Chinese)

View Translation Comparison →

ListenHub TTS: Text-to-Speech

Convert text to speech using the ListenHub OpenAPI. Three synthesis modes are supported, covering all scenarios from short text to long text.

API Information

  • Base URL:
    https://api.marswave.ai/openapi
  • Authentication:
    Authorization: Bearer $LISTENHUB_API_KEY
    (read from environment variable)
  • Prerequisite Check: Before calling any API, confirm that the
    LISTENHUB_API_KEY
    environment variable is set. If not, prompt the user to configure it.

Voice Selection Process

User Has Explicitly Specified a Voice

Directly use the speakerId specified by the user and skip the selection process.

User Has Not Specified a Voice

  1. Call
    GET /v1/speakers/list?language=zh
    to retrieve the available voice list
  2. Display the voice list for user selection using AskUserQuestion in the following format:
    • chat-girl-105-cn
      (Xiaoman dxqqq) is selected by default
    • List display:
      {name} ({gender}, {speakerId})
    • Attach the demoAudioUrl of each voice for reference
  3. Use the selected speakerId after user confirmation

Default Voice

FieldValue
speakerId
chat-girl-105-cn
NameXiaoman dxqqq

Three Synthesis Modes

Mode 1: Quick Synthesis (Short Text, Single Voice)

Applicable Scenarios: Short text (< 1000 words), single voice, low latency required
Endpoint:
POST /v1/tts
Request Body:
json
{
  "text": "Text to synthesize",
  "speakerId": "chat-girl-105-cn",
  "format": "mp3",
  "sampleRate": 24000,
  "speed": 1.0
}
ParameterTypeRequiredDescription
textstringYesText to synthesize
speakerIdstringYesVoice ID
formatstringNoOutput format, default is
mp3
sampleRateintNoSample rate, default is
24000
speedfloatNoSpeech speed, default is
1.0
, range
0.5 ~ 2.0
Response: Directly returns MP3 binary stream (
Content-Type: audio/mpeg
)
Call Example:
bash
curl -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "你好世界", "speakerId": "chat-girl-105-cn"}' \
  -o output.mp3

Mode 2: Multi-role Script Synthesis

Applicable Scenarios: Multi-role dialogues, podcasts, audiobook clips, requiring alternating reading with different voices
Endpoint:
POST /v1/speech
Request Body:
json
{
  "script": [
    {
      "text": "Hello, welcome to this episode.",
      "speakerId": "chat-girl-105-cn"
    },
    {
      "text": "Thank you, today we'll talk about AI.",
      "speakerId": "chat-boy-101-cn"
    }
  ],
  "format": "mp3",
  "sampleRate": 24000
}
ParameterTypeRequiredDescription
scriptarrayYesScript array, each item contains text and speakerId
script[].textstringYesText segment
script[].speakerIdstringYesVoice ID for this segment
formatstringNoOutput format, default is
mp3
sampleRateintNoSample rate, default is
24000
Response: JSON
json
{
  "audioUrl": "https://cdn.example.com/output.mp3",
  "duration": 12.5
}
Call Example:
bash
curl -X POST "https://api.marswave.ai/openapi/v1/speech" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "script": [
      {"text": "你好,欢迎收听。", "speakerId": "chat-girl-105-cn"},
      {"text": "谢谢,我们开始吧。", "speakerId": "chat-boy-101-cn"}
    ]
  }'

Mode 3: Long Text Streaming Synthesis

Applicable Scenarios: Long text (> 1000 words), article reading, requiring AI polishing or segment processing
Endpoint:
POST /v1/flow-speech/episodes
Request Body:
json
{
  "title": "Article Title",
  "content": "Long text content...",
  "speakerId": "chat-girl-105-cn",
  "mode": "direct",
  "format": "mp3"
}
ParameterTypeRequiredDescription
titlestringYesAudio title
contentstringNoText content (choose either content or contentUrl)
contentUrlstringNoContent URL (choose either content or contentUrl)
speakerIdstringYesVoice ID
modestringNo
direct
(direct synthesis) or
aiPolish
(AI polishing), default is
direct
formatstringNoOutput format, default is
mp3
Response: JSON
json
{
  "episodeId": "ep_abc123",
  "status": "processing"
}
Poll for Results:
bash
GET /v1/flow-speech/episodes/{episodeId}
Polling Strategy:
  1. Wait 30 seconds after submission
  2. Poll every 10 seconds afterwards
  3. Until status becomes
    completed
    or
    failed
Polling Response:
json
{
  "episodeId": "ep_abc123",
  "status": "completed",
  "audioUrl": "https://cdn.example.com/output.mp3",
  "duration": 180.5
}
Status ValueDescription
processingSynthesis in progress, continue polling
completedSynthesis completed, audioUrl is available
failedSynthesis failed, check errorMessage
Call Example:
bash
# Submit task
curl -X POST "https://api.marswave.ai/openapi/v1/flow-speech/episodes" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "AI Technology Trends",
    "content": "Long text content...",
    "speakerId": "chat-girl-105-cn",
    "mode": "direct"
  }'

# Poll for results
curl "https://api.marswave.ai/openapi/v1/flow-speech/episodes/ep_abc123" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY"

Voice List Query

Endpoint:
GET /v1/speakers/list
Query Parameters:
ParameterTypeRequiredDescription
languagestringNoFilter by language, e.g.,
zh
(Chinese),
en
(English)
Response:
json
{
  "speakers": [
    {
      "name": "Xiaoman dxqqq",
      "speakerId": "chat-girl-105-cn",
      "demoAudioUrl": "https://cdn.example.com/demo.mp3",
      "gender": "female",
      "language": "zh"
    }
  ]
}

Mode Selection Logic

Automatically select the most appropriate mode based on user input:
ConditionMode
Text ≤ 1000 words, single voiceMode 1:
/v1/tts
Multi-role script, requiring different voicesMode 2:
/v1/speech
Text > 1000 words, or AI polishing requiredMode 3:
/v1/flow-speech/episodes
User provides URL as content sourceMode 3:
/v1/flow-speech/episodes
If the user explicitly specifies a mode, prioritize the user-specified mode.

User Interaction

Voice Selection

When the user does not specify a voice, use AskUserQuestion to display the voice list:
Please select a voice (default: Xiaoman dxqqq):
A. Xiaoman dxqqq (Female, chat-girl-105-cn) [Default]
B. [Other Voice Name] ([Gender], [speakerId])
C. ...

Synthesis Parameters

Optional inquiries:
  • Speech speed (speed, default 1.0)
  • Output format (format, default mp3)
  • Long text mode: direct or aiPolish (default direct)
  • Output file path (default
    ./output.mp3
    )

Output

  1. Save the audio to the specified path (default
    ./output.mp3
    )
  2. Output synthesis summary:
    • Mode used
    • Voice name and ID
    • Audio duration
    • File size
    • File path

Error Handling

  • 401 Unauthorized: Prompt the user to check the
    LISTENHUB_API_KEY
    environment variable
  • 400 Bad Request: Check request parameters and report specific errors to the user
  • flow-speech failed: Report errorMessage, suggest the user retry or switch modes
  • Network Error: Prompt to check network connection, suggest retrying

Complete Example

User Input: "Convert this text to speech: The weather is nice today, perfect for a walk outside."
Execution Process:
  1. Check
    LISTENHUB_API_KEY
  2. Text length < 1000 words, single voice → Select Mode 1
    /v1/tts
  3. User did not specify a voice → Default to
    chat-girl-105-cn
    (Xiaoman)
  4. Call API for synthesis
  5. Save to
    ./output.mp3
  6. Output summary
User Input: "Read this article with Xiaoman's voice: article.md"
Execution Process:
  1. Read content of
    article.md
  2. Check text length > 1000 words → Select Mode 3
    /v1/flow-speech/episodes
  3. Voice specified:
    chat-girl-105-cn
    (Xiaoman)
  4. Submit synthesis task
  5. Poll until completion
  6. Download audio and save to
    ./article.mp3
  7. Output summary