MiMo V2.5 TTS
Generate speech using Xiaomi MiMo V2.5 TTS series models. Supports Chinese and English, preset voices, voice design, voice cloning, emotional styles, dialects, and singing.
Script directory:
$SKILLS_PATH/mimo-v2-5-tts/scripts/
Description: Path to the skills directory, which varies depending on the deployment environment.
Model Selection
The V2.5 series offers three models, choose according to your usage scenario:
| Model ID | Use Case | Voice Source | Special Ability |
|---|
| Preset Voice Speech Synthesis | Built-in Premium Voices | Supports Singing |
mimo-v2.5-tts-voicedesign
| Custom Voice via Text Description | Generated from Text Description | — |
| Voice Replication via Audio Sample | Audio Sample | — |
Selection Recommendations:
- Need to generate speech quickly or require singing functionality → (Preset Voice)
- Need a unique voice →
mimo-v2.5-tts-voicedesign
(Generated from Text Description)
- Need to imitate a specific voice → (Replicated from Audio Sample)
Note: TTS has randomness, the effect of the same input may vary. Users can generate multiple times if needed for selection.
Environment Dependencies
| Environment Variable | Description | Required |
|---|
| MiMo API Key (Obtained from MiMo Open Platform) | Yes |
| Dependency | Description | Required |
|---|
| Run scripts | Yes |
| | Yes |
| Format conversion, long text segmentation and splicing | Only required for conversion and splicing scenarios |
| Call Feishu API | Only required for Feishu sending scenarios |
Preset Voices
When using the
model, you must specify a voice clearly. You can remind users of available voices or select an appropriate voice based on the content's language and style.
| Voice Name | Voice ID | Language | Gender | Style |
|---|
| Bing Tang | | Chinese | Female | Lively girl |
| Mo Li | | Chinese | Female | Intellectual female voice |
| Su Da | | Chinese | Male | Sunny young man |
| Bai Hua | | Chinese | Male | Mature male voice |
| Mia | | English | Female | Lively girl |
| Chloe | | English | Female | Sweet Dreamy |
| Milo | | English | Male | Sunny boy |
| Dean | | English | Male | Steady Gentle |
Natural Language Control
All models support natural language control.
Let the model understand and generate speech with the corresponding style through natural language descriptions. All models can pass natural language control instructions via the
parameter:
and
can be used to adjust styles such as tone and emotion for the specified voice;
mimo-v2.5-tts-voicedesign
uses this option to control both voice and style simultaneously.
Capability Features:
- Multi-style Switching: Complete style transitions from broadcast → whisper → roar within the same speech segment
- Multi-emotion Mixing: Supports complex emotions such as "suppressed anger", "smile with sobs", etc.
- Multi-granularity Control: Can be specified at paragraph-level → sentence-level → word-level → character-level
Examples:
Report good news to the leader in a brisk and rising tone, speak a little faster, with the excitement and small pride that can't be suppressed after checking the results, and the voice is bright and energetic.
Looking at the results of the just-solved problem, I can't help but exclaim in triumph, with a high-pitched and bright voice, a slightly fast speaking speed, and a tone full of confidence and disbelief.
Director Mode
A special usage of natural language control is the "Director Mode", which comprehensively portrays characters and vocal lines from three dimensions: role, scene, and guidance:
- [Role] Character identity, personality background, appearance temperament, and speaking habits
- [Scene] What is happening now, who to talk to, emotional state
- [Guidance] Speaking speed, breath, pauses, stress, resonance position, voice texture, emotional ups and downs
Director Mode Example:
Role: The current head of the century-old CEN family. Adopted by the gatekeeper of the ancestral temple since birth, she was shaped into a flawless, emotionless family totem. She stays behind closed doors all year round, with a strong sense of class alienation towards others.
Scene: In the shadow of the ancestral hall, watching the man who broke through the security line at all costs to find her and attempt to elope with her. She must use the coldest class barriers to crush the other party, and also crush her own newly sprouted feelings that could potentially set the prairie ablaze.
Guidance:
A cold, lazy yet highly imposing low-pitched mature female voice. The vocal channel is very relaxed, without any tension, but with an oppressive feeling that makes one's bones cold.
- Speaking Speed and Pauses: Extremely slow, every word seems to roll off the tip of the tongue before being spoken, with the casual arrogance of a superior. Leave extremely long, unsettling gaps between sentences.
- Breath and Voiced Sounds: Most of the time, her voice has no obvious tone fluctuations, the voiced sounds are heavy and hard, like a gentle yet cold dark river. But at certain tail sounds (such as "sincerity"), must add extremely slight breathy closure, revealing a trace of exhaustion and longing that even she herself didn't notice.
- Articulation Texture: The mixed classical and modern words carry traces of the old era, the lip and dental sounds are extremely light but extremely clear (such as "rush", "cheap"), appearing both elegant and sharp, hitting the mark every time.
Director Mode is suitable for scenarios with high requirements for voice performance, such as character dubbing, film-level content generation, etc.
Audio Tag Control
and
support audio tag control.
mimo-v2.5-tts-voicedesign
does not support this temporarily; if you need to control the style, please write a more detailed description via
.
Describe tone, emotion, or sound-producing actions in parentheses anywhere in the text to achieve in-sentence switching. The content in the parentheses must be sound-related (such as tone, emotion, sigh, yawn, cough), not body movements (such as turn around, sit down, wave).
Chinese supports three types of parentheses: full-width
, half-width
, square brackets
; English supports two types: half-width
, square brackets
.
(nervous, deep breath) Phew... Calm down, calm down. It's just an interview... (speaking faster, muttering) I've memorized the self-introduction fifty times, it should be fine. Come on, you can do it... (whispering) Oh no, is my tie crooked?
(extremely tired, weak) Master... Wake me up when we arrive... (sighs deeply) I'll take a nap first, this overtime has drained me.
If I had just... (pauses for a moment) held on for one more second, would the result be different? (bitter laugh) Heh, there's no 'if' anymore.
(rapid breathing due to cold) Hoo—Hoo—This snow in the Greater Khingan Range... (coughs) It can freeze a person to the bone... Don't stop, keep going, hurry.
(raises voice to shout) Sister! This fish is fresh! Just caught this morning! Hey! You there, don't rummage around, you have to pay if you crush it?!
Achoo! Ahem. I—I really (cough) think I am coming down with a terrible (cough) terrible cold.
(heavy breathing) Just... give me... a second. I ran... all the way... from the station.
I just feel... long sigh... like I'm constantly treading water, you know?
It's just so stupid! (sobbing) We spent all that money on the cake and the dog just... (sudden laugh) he just ate the whole thing in one bite!
Parentheses can be placed anywhere in the text (beginning, middle, end), and can describe: tone, emotion, breathing, laughter, crying, gasping, coughing, yawning, sighing, and other sound-producing content. Note that body movements (such as turn around, sit down, wave) cannot be written.
Overall Style Tags
Overall style tags are a type of audio tag control. Add a
tag at the beginning of the target text to specify the pronunciation style of the entire speech segment.
Format example:
(style1 style2)Content to be synthesized
Singing: Must add the
tag at the very beginning, format:
. Supported identifiers in the tag:
(sing),
,
.
| Category | Common Styles |
|---|
| Basic Emotions | |
| Complex Emotions | |
| Overall Tone | |
| Voice Texture | |
| Character Tone | |
| Dialects | |
| Role-playing | |
| Singing | |
Classic Combinations: (disconsolate) So many years have passed...
,
(lazy) Let me sleep for five more minutes...
,
(magnetic) The night has fallen...
,
(Northeastern dialect) Oh my goodness...
,
(Cantonese) This is really great...
Voice Description Writing
When using
mimo-v2.5-tts-voicedesign
to customize a voice via text description, you need to write a short voice description to guide the model to generate a voice with the expected characteristics. It shares the
parameter with natural language control. The writing requirements for voice descriptions are as follows:
A voice description is an ID card for the voice. Only describe what this voice itself sounds like—do not write scenes, actions, or what to say this time. The more concise it is, the easier it is to reuse across scenarios.
Mandatory Items:
- Identity Anchor: Age group + gender. Determines the fundamental frequency, which is the basis of all characteristics.
- Voice Texture: Breath direction, resonance position, articulation and voice background. Use perceptible verbs or metaphors, do not pile up adjectives.
- Speaking Speed and Rhythm: Steady / fast / slow / erratic / rapid with pauses.
- Emotional Background: Default state of the voice (high-pitched / relaxed / soft / restrained).
Optional Items (Highly Recommended):
- Style/Identity Tag: A brief mention of occupation or style anchor, which the model understands instantly. Examples: Auctioneer style / Food critic style / Broadcaster style / Court statement style.
- Distinctive Quirk: A habit that makes the voice memorable at first listen. Examples: Occasional closed-eye inhalation / Tremor at the end of words / Chest chuckle when laughing.
Hard Constraints:
- One to two sentences, descriptive, no paragraphs or bullet points.
- Do not write scenes ("at a press conference" "on night shift").
- Do not write actions ("she walks onto the stage").
- Do not use real actors or IP character names (copyright + poor generalization).
- Default to Mandarin or English; only add dialect when explicitly needed.
Voice Description Examples:
Middle-aged male, extremely fast rhythm, high-pitched emotion, auctioneer style. Articulation is rapid, with cadence and a sense of urgency.
Young male, esports commentator style, extremely fast and coherent speaking speed, with obvious breath breaks and explosive emphasis, voice rises sharply when excited.
Middle-aged male, court statement style, steady and formal voice, neat articulation with clear pauses per word, restrained emotion but each word carries weight.
Content and Tag Enhancement
When the user does not directly provide the text to be read aloud, you should write the text to be synthesized yourself according to specific situations such as role-playing, creation, etc.
When the user only provides the text to be read aloud without details of tone and emotion, you should adjust the text appropriately and insert suitable tags according to the actual situation.
Hard Rules:
- Text emotion must match the voice background. A soft grandma does not roar, an auctioneer does not monologue late at night.
- Length: 2–5 sentences, one whole paragraph. Too short and the model cannot establish rhythm, and subtle voice characteristics cannot be expressed.
- Tags are seasoning, not the main dish. Pause when needed, laugh when needed, do not pile up tags. At most one tag per sentence.
- Punctuation has performative meaning: Ellipsis = pause / Dash = interrupted or drawn-out sound / ALL CAPS = emphasis.
- Tag language follows the main text. Chinese main text matches Chinese tags, English main text matches English tags, mixing is prohibited.
Recommended Tags (Chinese use ):
| Category | Common Tags |
|---|
| Rhythm | |
| Emotion | |
| Others | |
Recommended Tags (English use ):
| Category | Common Tags |
|---|
| Pacing | |
| Emotion | |
Matching Examples:
Voice: Middle-aged female, food critic style, soft and infectious tone, with exaggerated aftertaste and intoxication when describing, occasional closed-eye inhalation.
Text:
[inhale] Mmm—When I take this bite…[pause] My entire tongue is wrapped up. [drawn-out] First, the sweetness of caramel, then [emphasize] a little bit [pause] just right saltiness, slowly spreading in the aftertaste. [sigh] Honestly, I haven't [whisper] eaten such a thoughtful dessert in a long [pause] time.
Voice: Elderly female, soft and slow voice, with a slight tremor and kindness, like a grandma telling a bedtime story to a grandchild.
Text:
[whisper] Come on, lie down. [pause] Grandma will tell you a story. [long pause] Once upon a time, there was a small wooden house at the foot of the mountain…[drawn-out] where a little rabbit lived. [sigh] This rabbit was very playful, running out as soon as the sun came up every day. [pause] Then one day—[emphasize] it rained. [whisper] It got lost and couldn't find its way home…
Python Script Usage
Each of the three models corresponds to a script, choose according to your needs:
| Script | Model | Use Case |
|---|
| | Preset Voice Speech Synthesis |
| mimo-v2.5-tts-voicedesign
| Custom Voice via Text Description |
| | Voice Replication via Audio Sample |
Preset Voice Speech Synthesis (mimo_tts.py)
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
--text "Hello, the weather is really nice today." \
--voice "Bing Tang"
Preset Voice + Natural Language Style Control
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
--context "Speak in a gentle tone, slightly slower speed" \
--text "It's okay, take your time, I'll wait for you." \
--voice "Bing Tang" \
--output tmp/mimo-v2.5-tts/comfort.wav
Preset Voice + Audio Tag Control
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
--text "(nervous, deep breath) Phew... Calm down, calm down. It's just an interview... (whispering) Oh no, is my tie crooked?" \
--voice "Bing Tang" \
--output tmp/mimo-v2.5-tts/interview.wav
Voice Design (mimo_tts_voicedesign.py)
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voicedesign.py \
--context "Give me a young male tone." \
--text "Yes, I had a sandwich." \
--output tmp/mimo-v2.5-tts/voicedesign.wav
Tip: You can save the audio generated by Voice Design and use it as the
input for
later, to implement the "Design → Clone" workflow: first generate a satisfactory voice using text description (
), then use this audio as a sample to clone to other texts. You can also add director mode instructions via
during cloning.
Voice Cloning (mimo_tts_voiceclone.py)
Note: The Base64-encoded voice sample should not exceed 10 MB, only mp3 and wav formats are supported.
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
--voice-file voice.mp3 \
--text "Yes, I had a sandwich." \
--output tmp/mimo-v2.5-tts/voiceclone.wav
Voice Cloning + Director Mode
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
--voice-file voice.mp3 \
--context "Speak in a gentle tone, slightly slower speed" \
--text "It's okay, take your time, I'll wait for you." \
--output tmp/mimo-v2.5-tts/voiceclone_director.wav
Singing
Note: The singing lyrics must be complete; incomplete lyrics will cause off-key and poor effects.
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
--text "(singing)Forgive me for being unruly and indulging in freedom all my life, I'm also afraid that one day I will fall, Oh no. Betraying ideals, anyone can do it, I won't be afraid if one day only you and I are left." \
--voice "Bing Tang" \
--output tmp/mimo-v2.5-tts/singing.wav
English + Audio Tags
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
--text "I just... (sighs deeply) I don't know anymore. (suddenly firm) But I won't give up!" \
--voice "Mia" \
--output tmp/mimo-v2.5-tts/english.wav
Long Text Processing
V2.5 has good support for long text. It is recommended to generate it all at once in almost all scenarios, no need to manually segment. Only when the text exceeds 2500 characters should you consider segmenting, synthesizing separately, then splicing.
Segmentation and synthesis method: Split naturally by periods/paragraphs, generate wav files for each segment independently, then splice with ffmpeg:
bash
# Create file list
echo "file '/tmp/mimo-v2.5-tts/part1.wav'" > /tmp/mimo-v2.5-tts/list.txt
echo "file '/tmp/mimo-v2.5-tts/part2.wav'" >> /tmp/mimo-v2.5-tts/list.txt
# Splice
ffmpeg -y -f concat -safe 0 -i /tmp/mimo-v2.5-tts/list.txt -c copy /tmp/mimo-v2.5-tts/combined.wav
Feishu Voice Message Sending
Note: This function is only needed when users need to send TTS-generated voice to Feishu.
Send the TTS-generated WAV to Feishu, complete in one command: Generate → Transcode → Upload → Send.
Why not use message tool: Feishu voice messages require calling the
interface (
), and you need to upload the audio first to obtain the
. Many tools' message tool (
) have not implemented this logic on Feishu channels, and will send the audio as a regular attachment (users see a file instead of a voice bar).
completes the full upload + send process and cannot be replaced by a message tool.
Environment Dependencies
| Environment Variable | Source | Description |
|---|
| Feishu Open Platform | App ID |
| Feishu Open Platform | App Secret |
| Dependency | Description | Required |
|---|
| Convert WAV to Opus, get audio duration | Yes |
| Call Feishu API | Yes |
Usage
Send to Private Chat (open_id)
bash
# 1. Generate voice
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
--text "Okay, I'll be right there!" \
--voice "Bing Tang" \
--output /tmp/mimo-v2.5-tts/voice.wav
# 2. Send to Feishu private chat (receive_id_type is open_id, receive_id is user ID)
bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav open_id ou_xxxxxx
Send to Group Chat (chat_id)
bash
# 1. Generate voice
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
--text "Hello everyone, the weather is really nice today!" \
--voice "Bing Tang" \
--output /tmp/mimo-v2.5-tts/voice.wav
# 2. Send to Feishu group chat (receive_id_type is chat_id, receive_id is group ID)
bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav chat_id oc_xxxxxx
The internal process of
:
→
Obtain tenant_access_token
→
→
.