MiMo V2.5 TTS

Generate speech using Xiaomi MiMo V2.5 TTS series models. Supports Chinese and English, preset voices, voice design, voice cloning, emotional styles, dialects, and singing.

Script directory:

$SKILLS_PATH/mimo-v2-5-tts/scripts/

$SKILLS_PATH
Description: Path to the skills directory, which varies depending on the deployment environment.

Model Selection

The V2.5 series offers three models, choose according to your usage scenario:

Model ID	Use Case	Voice Source	Special Ability
`mimo-v2.5-tts`	Preset Voice Speech Synthesis	Built-in Premium Voices	Supports Singing
`mimo-v2.5-tts-voicedesign`	Custom Voice via Text Description	Generated from Text Description	—
`mimo-v2.5-tts-voiceclone`	Voice Replication via Audio Sample	Audio Sample	—

Selection Recommendations:

Need to generate speech quickly or require singing functionality →
```
mimo-v2.5-tts
```
(Preset Voice)
Need a unique voice →
```
mimo-v2.5-tts-voicedesign
```
(Generated from Text Description)
Need to imitate a specific voice →
```
mimo-v2.5-tts-voiceclone
```
(Replicated from Audio Sample)

Note: TTS has randomness, the effect of the same input may vary. Users can generate multiple times if needed for selection.

Environment Dependencies

Environment Variable	Description	Required
`MIMO_API_KEY`	MiMo API Key (Obtained from MiMo Open Platform)	Yes

Dependency	Description	Required
`python3`	Run scripts	Yes
`openai`	`pip install openai`	Yes
`ffmpeg`	Format conversion, long text segmentation and splicing	Only required for conversion and splicing scenarios
`curl`	Call Feishu API	Only required for Feishu sending scenarios

Preset Voices

When using the

mimo-v2.5-tts

model, you must specify a voice clearly. You can remind users of available voices or select an appropriate voice based on the content's language and style.

Voice Name	Voice ID	Language	Gender	Style
Bing Tang	`Bing Tang`	Chinese	Female	Lively girl
Mo Li	`Mo Li`	Chinese	Female	Intellectual female voice
Su Da	`Su Da`	Chinese	Male	Sunny young man
Bai Hua	`Bai Hua`	Chinese	Male	Mature male voice
Mia	`Mia`	English	Female	Lively girl
Chloe	`Chloe`	English	Female	Sweet Dreamy
Milo	`Milo`	English	Male	Sunny boy
Dean	`Dean`	English	Male	Steady Gentle

Natural Language Control

All models support natural language control.

Let the model understand and generate speech with the corresponding style through natural language descriptions. All models can pass natural language control instructions via the

--context

parameter:

mimo-v2.5-tts

and

mimo-v2.5-tts-voiceclone

can be used to adjust styles such as tone and emotion for the specified voice;

mimo-v2.5-tts-voicedesign

uses this option to control both voice and style simultaneously.

Capability Features:

Multi-style Switching: Complete style transitions from broadcast → whisper → roar within the same speech segment
Multi-emotion Mixing: Supports complex emotions such as "suppressed anger", "smile with sobs", etc.
Multi-granularity Control: Can be specified at paragraph-level → sentence-level → word-level → character-level

Examples:

Report good news to the leader in a brisk and rising tone, speak a little faster, with the excitement and small pride that can't be suppressed after checking the results, and the voice is bright and energetic.

Looking at the results of the just-solved problem, I can't help but exclaim in triumph, with a high-pitched and bright voice, a slightly fast speaking speed, and a tone full of confidence and disbelief.

Director Mode

A special usage of natural language control is the "Director Mode", which comprehensively portrays characters and vocal lines from three dimensions: role, scene, and guidance:

[Role] Character identity, personality background, appearance temperament, and speaking habits
[Scene] What is happening now, who to talk to, emotional state
[Guidance] Speaking speed, breath, pauses, stress, resonance position, voice texture, emotional ups and downs

Director Mode Example:

Role: The current head of the century-old CEN family. Adopted by the gatekeeper of the ancestral temple since birth, she was shaped into a flawless, emotionless family totem. She stays behind closed doors all year round, with a strong sense of class alienation towards others.

Scene: In the shadow of the ancestral hall, watching the man who broke through the security line at all costs to find her and attempt to elope with her. She must use the coldest class barriers to crush the other party, and also crush her own newly sprouted feelings that could potentially set the prairie ablaze.

Guidance:
A cold, lazy yet highly imposing low-pitched mature female voice. The vocal channel is very relaxed, without any tension, but with an oppressive feeling that makes one's bones cold.

- Speaking Speed and Pauses: Extremely slow, every word seems to roll off the tip of the tongue before being spoken, with the casual arrogance of a superior. Leave extremely long, unsettling gaps between sentences.
- Breath and Voiced Sounds: Most of the time, her voice has no obvious tone fluctuations, the voiced sounds are heavy and hard, like a gentle yet cold dark river. But at certain tail sounds (such as "sincerity"), must add extremely slight breathy closure, revealing a trace of exhaustion and longing that even she herself didn't notice.
- Articulation Texture: The mixed classical and modern words carry traces of the old era, the lip and dental sounds are extremely light but extremely clear (such as "rush", "cheap"), appearing both elegant and sharp, hitting the mark every time.

Director Mode is suitable for scenarios with high requirements for voice performance, such as character dubbing, film-level content generation, etc.

Audio Tag Control

mimo-v2.5-tts

and

mimo-v2.5-tts-voiceclone

support audio tag control.

mimo-v2.5-tts-voicedesign

does not support this temporarily; if you need to control the style, please write a more detailed description via

--context

Describe tone, emotion, or sound-producing actions in parentheses anywhere in the text to achieve in-sentence switching. The content in the parentheses must be sound-related (such as tone, emotion, sigh, yawn, cough), not body movements (such as turn around, sit down, wave).

Chinese supports three types of parentheses: full-width

（）

, half-width

()

, square brackets

[]

; English supports two types: half-width

()

, square brackets

[]

(nervous, deep breath) Phew... Calm down, calm down. It's just an interview... (speaking faster, muttering) I've memorized the self-introduction fifty times, it should be fine. Come on, you can do it... (whispering) Oh no, is my tie crooked?
(extremely tired, weak) Master... Wake me up when we arrive... (sighs deeply) I'll take a nap first, this overtime has drained me.
If I had just... (pauses for a moment) held on for one more second, would the result be different? (bitter laugh) Heh, there's no 'if' anymore.
(rapid breathing due to cold) Hoo—Hoo—This snow in the Greater Khingan Range... (coughs) It can freeze a person to the bone... Don't stop, keep going, hurry.
(raises voice to shout) Sister! This fish is fresh! Just caught this morning! Hey! You there, don't rummage around, you have to pay if you crush it?!

Achoo! Ahem. I—I really (cough) think I am coming down with a terrible (cough) terrible cold.
(heavy breathing) Just... give me... a second. I ran... all the way... from the station.
I just feel... long sigh... like I'm constantly treading water, you know?
It's just so stupid! (sobbing) We spent all that money on the cake and the dog just... (sudden laugh) he just ate the whole thing in one bite!

Parentheses can be placed anywhere in the text (beginning, middle, end), and can describe: tone, emotion, breathing, laughter, crying, gasping, coughing, yawning, sighing, and other sound-producing content. Note that body movements (such as turn around, sit down, wave) cannot be written.

Overall Style Tags

Overall style tags are a type of audio tag control. Add a

(style)

tag at the beginning of the target text to specify the pronunciation style of the entire speech segment.

Format example:

(style1 style2)Content to be synthesized

Singing: Must add the

(singing)

tag at the very beginning, format:

(singing)Lyrics

. Supported identifiers in the tag:

唱歌

(sing),

sing

singing

Category	Common Styles
Basic Emotions	`happy` `sad` `angry` `fearful` `surprised` `excited` `wronged` `calm` `indifferent`
Complex Emotions	`disconsolate` `relieved` `helpless` `guilty` `reconciled` `jealous` `tired` `anxious` `emotional`
Overall Tone	`gentle` `cold` `lively` `serious` `lazy` `playful` `deep` `capable` `sharp`
Voice Texture	`magnetic` `mellow` `clear` `ethereal` `childish` `old` `sweet` `hoarse` `elegant`
Character Tone	`clipped voice` `mature female voice` `young boy voice` `middle-aged male voice` `Taiwan accent`
Dialects	`Northeastern dialect` `Sichuan dialect` `Henan dialect` `Cantonese`
Role-playing	`Sun Wukong` `Lin Daiyu`
Singing	`singing`

Classic Combinations:

(disconsolate) So many years have passed...

(lazy) Let me sleep for five more minutes...

(magnetic) The night has fallen...

(Northeastern dialect) Oh my goodness...

(Cantonese) This is really great...

Voice Description Writing

When using

mimo-v2.5-tts-voicedesign

to customize a voice via text description, you need to write a short voice description to guide the model to generate a voice with the expected characteristics. It shares the

--context

parameter with natural language control. The writing requirements for voice descriptions are as follows:

A voice description is an ID card for the voice. Only describe what this voice itself sounds like—do not write scenes, actions, or what to say this time. The more concise it is, the easier it is to reuse across scenarios.

Mandatory Items:

Identity Anchor: Age group + gender. Determines the fundamental frequency, which is the basis of all characteristics.
Voice Texture: Breath direction, resonance position, articulation and voice background. Use perceptible verbs or metaphors, do not pile up adjectives.
Speaking Speed and Rhythm: Steady / fast / slow / erratic / rapid with pauses.
Emotional Background: Default state of the voice (high-pitched / relaxed / soft / restrained).

Optional Items (Highly Recommended):

Style/Identity Tag: A brief mention of occupation or style anchor, which the model understands instantly. Examples: Auctioneer style / Food critic style / Broadcaster style / Court statement style.
Distinctive Quirk: A habit that makes the voice memorable at first listen. Examples: Occasional closed-eye inhalation / Tremor at the end of words / Chest chuckle when laughing.

Hard Constraints:

One to two sentences, descriptive, no paragraphs or bullet points.
Do not write scenes ("at a press conference" "on night shift").
Do not write actions ("she walks onto the stage").
Do not use real actors or IP character names (copyright + poor generalization).
Default to Mandarin or English; only add dialect when explicitly needed.

Voice Description Examples:

Middle-aged male, extremely fast rhythm, high-pitched emotion, auctioneer style. Articulation is rapid, with cadence and a sense of urgency.

Young male, esports commentator style, extremely fast and coherent speaking speed, with obvious breath breaks and explosive emphasis, voice rises sharply when excited.

Middle-aged male, court statement style, steady and formal voice, neat articulation with clear pauses per word, restrained emotion but each word carries weight.

Content and Tag Enhancement

When the user does not directly provide the text to be read aloud, you should write the text to be synthesized yourself according to specific situations such as role-playing, creation, etc.

When the user only provides the text to be read aloud without details of tone and emotion, you should adjust the text appropriately and insert suitable tags according to the actual situation.

Hard Rules:

Text emotion must match the voice background. A soft grandma does not roar, an auctioneer does not monologue late at night.
Length: 2–5 sentences, one whole paragraph. Too short and the model cannot establish rhythm, and subtle voice characteristics cannot be expressed.
Tags are seasoning, not the main dish. Pause when needed, laugh when needed, do not pile up tags. At most one tag per sentence.
Punctuation has performative meaning: Ellipsis = pause / Dash = interrupted or drawn-out sound / ALL CAPS = emphasis.
Tag language follows the main text. Chinese main text matches Chinese tags, English main text matches English tags, mixing is prohibited.

Recommended Tags (Chinese use
[tag]
):

Category Common Tags

Category	Common Tags
Rhythm	`[pause]` `[long pause]` `[rapid]` `[drawn-out]` `[speed up]` `[slow down]`
Emotion	`[whisper]` `[mutter]` `[sigh]` `[inhale]` `[choke up]` `[emphasize]` `[laugh]` `[hearty laugh]`
Others	`[hesitate to speak]` `[mutter to oneself]` `[silence for a moment]`

Rhythm

[pause]

[long pause]

[rapid]

[drawn-out]

[speed up]

[slow down]

Emotion

[whisper]

[mutter]

[sigh]

[inhale]

[choke up]

[emphasize]

[laugh]

[hearty laugh]

Others

[hesitate to speak]

[mutter to oneself]

[silence for a moment]

Recommended Tags (English use
[tag]
):

Category	Common Tags
Pacing	`[pause]` `[long pause]` `[fast]` `[slow]` `[drawn out]`
Emotion	`[whispering]` `[sighs]` `[inhale]` `[choked up]` `[emphasis]` `[laughs]` `[hearty laugh]`

Matching Examples:

Voice: Middle-aged female, food critic style, soft and infectious tone, with exaggerated aftertaste and intoxication when describing, occasional closed-eye inhalation.

Text:
[inhale] Mmm—When I take this bite…[pause] My entire tongue is wrapped up. [drawn-out] First, the sweetness of caramel, then [emphasize] a little bit [pause] just right saltiness, slowly spreading in the aftertaste. [sigh] Honestly, I haven't [whisper] eaten such a thoughtful dessert in a long [pause] time.

Voice: Elderly female, soft and slow voice, with a slight tremor and kindness, like a grandma telling a bedtime story to a grandchild.

Text:
[whisper] Come on, lie down. [pause] Grandma will tell you a story. [long pause] Once upon a time, there was a small wooden house at the foot of the mountain…[drawn-out] where a little rabbit lived. [sigh] This rabbit was very playful, running out as soon as the sun came up every day. [pause] Then one day—[emphasize] it rained. [whisper] It got lost and couldn't find its way home…

Python Script Usage

Each of the three models corresponds to a script, choose according to your needs:

Script	Model	Use Case
`mimo_tts.py`	`mimo-v2.5-tts`	Preset Voice Speech Synthesis
`mimo_tts_voicedesign.py`	`mimo-v2.5-tts-voicedesign`	Custom Voice via Text Description
`mimo_tts_voiceclone.py`	`mimo-v2.5-tts-voiceclone`	Voice Replication via Audio Sample

Preset Voice Speech Synthesis (mimo_tts.py)

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "Hello, the weather is really nice today." \
  --voice "Bing Tang"

Preset Voice + Natural Language Style Control

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --context "Speak in a gentle tone, slightly slower speed" \
  --text "It's okay, take your time, I'll wait for you." \
  --voice "Bing Tang" \
  --output tmp/mimo-v2.5-tts/comfort.wav

Preset Voice + Audio Tag Control

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "(nervous, deep breath) Phew... Calm down, calm down. It's just an interview... (whispering) Oh no, is my tie crooked?" \
  --voice "Bing Tang" \
  --output tmp/mimo-v2.5-tts/interview.wav

Voice Design (mimo_tts_voicedesign.py)

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voicedesign.py \
  --context "Give me a young male tone." \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voicedesign.wav

Tip: You can save the audio generated by Voice Design and use it as the
--voice-file
input for
mimo_tts_voiceclone.py
later, to implement the "Design → Clone" workflow: first generate a satisfactory voice using text description (
--context
), then use this audio as a sample to clone to other texts. You can also add director mode instructions via
--context
during cloning.

Voice Cloning (mimo_tts_voiceclone.py)

Note: The Base64-encoded voice sample should not exceed 10 MB, only mp3 and wav formats are supported.

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voiceclone.wav

Voice Cloning + Director Mode

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --context "Speak in a gentle tone, slightly slower speed" \
  --text "It's okay, take your time, I'll wait for you." \
  --output tmp/mimo-v2.5-tts/voiceclone_director.wav

Singing

Note: The singing lyrics must be complete; incomplete lyrics will cause off-key and poor effects.

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "(singing)Forgive me for being unruly and indulging in freedom all my life, I'm also afraid that one day I will fall, Oh no. Betraying ideals, anyone can do it, I won't be afraid if one day only you and I are left." \
  --voice "Bing Tang" \
  --output tmp/mimo-v2.5-tts/singing.wav

English + Audio Tags

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "I just... (sighs deeply) I don't know anymore. (suddenly firm) But I won't give up!" \
  --voice "Mia" \
  --output tmp/mimo-v2.5-tts/english.wav

Long Text Processing

V2.5 has good support for long text. It is recommended to generate it all at once in almost all scenarios, no need to manually segment. Only when the text exceeds 2500 characters should you consider segmenting, synthesizing separately, then splicing.

Segmentation and synthesis method: Split naturally by periods/paragraphs, generate wav files for each segment independently, then splice with ffmpeg:

bash

# Create file list
echo "file '/tmp/mimo-v2.5-tts/part1.wav'" > /tmp/mimo-v2.5-tts/list.txt
echo "file '/tmp/mimo-v2.5-tts/part2.wav'" >> /tmp/mimo-v2.5-tts/list.txt

# Splice
ffmpeg -y -f concat -safe 0 -i /tmp/mimo-v2.5-tts/list.txt -c copy /tmp/mimo-v2.5-tts/combined.wav

Feishu Voice Message Sending

Note: This function is only needed when users need to send TTS-generated voice to Feishu.

Send the TTS-generated WAV to Feishu, complete in one command: Generate → Transcode → Upload → Send.

Why not use message tool: Feishu voice messages require calling the
/im/v1/messages
interface (
msg_type: audio
), and you need to upload the audio first to obtain the
file_key
. Many tools' message tool (
asVoice: true
) have not implemented this logic on Feishu channels, and will send the audio as a regular attachment (users see a file instead of a voice bar).
feishu_send_audio.sh
completes the full upload + send process and cannot be replaced by a message tool.

Environment Dependencies

Environment Variable	Source	Description
`FEISHU_APP_ID`	Feishu Open Platform	App ID
`FEISHU_APP_SECRET`	Feishu Open Platform	App Secret

Dependency	Description	Required
`ffmpeg`	Convert WAV to Opus, get audio duration	Yes
`curl`	Call Feishu API	Yes

Usage

Send to Private Chat (open_id)

bash

# 1. Generate voice
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "Okay, I'll be right there!" \
  --voice "Bing Tang" \
  --output /tmp/mimo-v2.5-tts/voice.wav

# 2. Send to Feishu private chat (receive_id_type is open_id, receive_id is user ID)
bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav open_id ou_xxxxxx

Send to Group Chat (chat_id)

bash

# 1. Generate voice
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "Hello everyone, the weather is really nice today!" \
  --voice "Bing Tang" \
  --output /tmp/mimo-v2.5-tts/voice.wav

# 2. Send to Feishu group chat (receive_id_type is chat_id, receive_id is group ID)
bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav chat_id oc_xxxxxx

The internal process of

feishu_send_audio.sh

wav → opus (ffmpeg)

→

Obtain tenant_access_token

→

Upload audio file

→

Send audio message

mimo-v2-5-tts

NPX Install

Tags

SKILL.md Content (Chinese)

MiMo V2.5 TTS

Model Selection

Environment Dependencies

Preset Voices

Natural Language Control

Director Mode

Audio Tag Control

Overall Style Tags

Voice Description Writing

Content and Tag Enhancement

Python Script Usage

Preset Voice Speech Synthesis (mimo_tts.py)

Preset Voice + Natural Language Style Control

Preset Voice + Audio Tag Control

Voice Design (mimo_tts_voicedesign.py)

Voice Cloning (mimo_tts_voiceclone.py)

Voice Cloning + Director Mode

Singing

English + Audio Tags

Long Text Processing

Feishu Voice Message Sending

Environment Dependencies

Usage

Send to Private Chat (open_id)

Send to Group Chat (chat_id)