mimo-v2-5-tts

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MiMo V2.5 TTS

使用小米 MiMo V2.5 TTS 系列模型生成语音。支持中英文、预置音色、音色设计、音色克隆、情绪风格、方言、唱歌。

脚本目录：

$SKILLS_PATH/mimo-v2-5-tts/scripts/

$SKILLS_PATH
说明： skills 目录路径，因部署环境而异。

Generate speech using Xiaomi MiMo V2.5 TTS series models. Supports Chinese and English, preset voices, voice design, voice cloning, emotional styles, dialects, and singing.

Script directory:

$SKILLS_PATH/mimo-v2-5-tts/scripts/

$SKILLS_PATH
Description: Path to the skills directory, which varies depending on the deployment environment.

模型选择

Model Selection

V2.5 系列提供三种模型，根据使用场景选择：

模型 ID	用途	音色来源	特殊能力
`mimo-v2.5-tts`	预置音色语音合成	内置精品音色	支持唱歌
`mimo-v2.5-tts-voicedesign`	文本描述定制音色	文本描述生成	—
`mimo-v2.5-tts-voiceclone`	音频样本复刻音色	音频样本	—

选择建议：

需要快速生成语音、需要唱歌功能 →
```
mimo-v2.5-tts
```
（预置音色）
需要独特音色 →
```
mimo-v2.5-tts-voicedesign
```
（文本描述生成）
需要模仿特定声音 →
```
mimo-v2.5-tts-voiceclone
```
（音频样本复刻）

注意： TTS 有随机性，同样输入的效果可能不同，用户有需要时可以多生成几次以供挑选。

The V2.5 series offers three models, choose according to your usage scenario:

Model ID	Use Case	Voice Source	Special Ability
`mimo-v2.5-tts`	Preset Voice Speech Synthesis	Built-in Premium Voices	Supports Singing
`mimo-v2.5-tts-voicedesign`	Custom Voice via Text Description	Generated from Text Description	—
`mimo-v2.5-tts-voiceclone`	Voice Replication via Audio Sample	Audio Sample	—

Selection Recommendations:

Need to generate speech quickly or require singing functionality →
```
mimo-v2.5-tts
```
(Preset Voice)
Need a unique voice →
```
mimo-v2.5-tts-voicedesign
```
(Generated from Text Description)
Need to imitate a specific voice →
```
mimo-v2.5-tts-voiceclone
```
(Replicated from Audio Sample)

Note: TTS has randomness, the effect of the same input may vary. Users can generate multiple times if needed for selection.

环境依赖

Environment Dependencies

环境变量	说明	必需
`MIMO_API_KEY`	MiMo API 密钥（MiMo 开放平台获取）	是

依赖	说明	必需
`python3`	运行脚本	是
`openai`	`pip install openai`	是
`ffmpeg`	转换格式、长文本分段拼接	仅转换、拼接场景使用
`curl`	调用飞书 API	仅飞书发送使用

Environment Variable	Description	Required
`MIMO_API_KEY`	MiMo API Key (Obtained from MiMo Open Platform)	Yes

Dependency	Description	Required
`python3`	Run scripts	Yes
`openai`	`pip install openai`	Yes
`ffmpeg`	Format conversion, long text segmentation and splicing	Only required for conversion and splicing scenarios
`curl`	Call Feishu API	Only required for Feishu sending scenarios

预置音色

Preset Voices

使用

mimo-v2.5-tts

模型时必须明确指定音色，你可以提醒用户有哪些音色可选，或根据内容语言和风格选择合适音色。

音色名	Voice ID	语言	性别	风格
冰糖	`冰糖`	中文	女性	活泼少女
茉莉	`茉莉`	中文	女性	知性女声
苏打	`苏打`	中文	男性	阳光少年
白桦	`白桦`	中文	男性	成熟男声
Mia	`Mia`	English	Female	Lively girl
Chloe	`Chloe`	English	Female	Sweet Dreamy
Milo	`Milo`	English	Male	Sunny boy
Dean	`Dean`	English	Male	Steady Gentle

When using the

mimo-v2.5-tts

model, you must specify a voice clearly. You can remind users of available voices or select an appropriate voice based on the content's language and style.

Voice Name	Voice ID	Language	Gender	Style
Bing Tang	`Bing Tang`	Chinese	Female	Lively girl
Mo Li	`Mo Li`	Chinese	Female	Intellectual female voice
Su Da	`Su Da`	Chinese	Male	Sunny young man
Bai Hua	`Bai Hua`	Chinese	Male	Mature male voice
Mia	`Mia`	English	Female	Lively girl
Chloe	`Chloe`	English	Female	Sweet Dreamy
Milo	`Milo`	English	Male	Sunny boy
Dean	`Dean`	English	Male	Steady Gentle

自然语言控制

Natural Language Control

所有模型都支持自然语言控制。

通过自然语言描述让模型理解并生成对应风格的语音。所有模型均可通过

--context

参数传入自然语言控制指令：

mimo-v2.5-tts

和

mimo-v2.5-tts-voiceclone

可用于调整指定音色下的语气情绪等风格；

mimo-v2.5-tts-voicedesign

则通过这个选项同时控制音色和风格。

能力特点：

多风格切换：同一段语音内完成播报 → 低语 → 嘶吼的风格转场
多情绪混合：支持"压抑的愤怒"、"带着哽咽的笑意"等复合情绪
多粒度控制：段落级 → 句子级 → 词级 → 字粒度都可指定

示例：

用轻快上扬的语调向领导报喜，语速稍快，带着查到成绩后压抑不住的激动与小骄傲，声音明亮有活力。

看着刚解决的难题成果忍不住得意忘形地惊呼，声音高亢明亮，语速偏快，语气中带着满满的自信与难以置信。

All models support natural language control.

Let the model understand and generate speech with the corresponding style through natural language descriptions. All models can pass natural language control instructions via the

--context

parameter:

mimo-v2.5-tts

and

mimo-v2.5-tts-voiceclone

can be used to adjust styles such as tone and emotion for the specified voice;

mimo-v2.5-tts-voicedesign

uses this option to control both voice and style simultaneously.

Capability Features:

Multi-style Switching: Complete style transitions from broadcast → whisper → roar within the same speech segment
Multi-emotion Mixing: Supports complex emotions such as "suppressed anger", "smile with sobs", etc.
Multi-granularity Control: Can be specified at paragraph-level → sentence-level → word-level → character-level

Examples:

Report good news to the leader in a brisk and rising tone, speak a little faster, with the excitement and small pride that can't be suppressed after checking the results, and the voice is bright and energetic.

Looking at the results of the just-solved problem, I can't help but exclaim in triumph, with a high-pitched and bright voice, a slightly fast speaking speed, and a tone full of confidence and disbelief.

导演模式

Director Mode

自然语言控制的一种特殊用法是「导演模式」，即从角色、场景、指导三个维度全方位刻画人物与声线：

【角色】 人物身份、性格底色、外形气质与说话习惯
【场景】 此刻发生了什么、和谁说话、情绪位置
【指导】 语速、气息、停顿、重音、共鸣位置、音色质感、情绪起伏

导演模式示例：

角色：百年门阀岑家的现任大当家。自出生便被过继给祖庙的守门老人抚养，被塑造成一尊完美无瑕、绝情断欲的家族图腾。常年深居简出，对人有着极强的阶级疏离感。

场景：在祠堂的阴影里，看着那个不顾一切冲破保安防线来找她、企图带她私奔的男人。她要用最冷硬的阶级壁垒，绞杀对方，也绞杀自己刚刚萌芽、却足以燎原的感情。

指导：
冰冷、慵懒却极具威压的低音御姐。发声通道非常松弛，没有任何剑拔弩张，却有着让人骨里生寒的压迫感。

- 语速与顿挫：极慢，每个字都像是在舌尖滚过才吐出来，带着上位者漫不经心的傲慢。句与句之间留下极长的、令人不安的空白。
- 气声与实声：大部分时间，她的声音没有明显的声调起伏，实音重且硬，像是一条平缓却冰冷的暗河。但一定要在某些尾音处（如“真心”），加入极其轻微的气音收束，透出一丝连她自己都没察觉到的疲惫与渴望。
- 咬字肌理：文白杂糅的用词带着旧时代的痕迹，唇齿音发得极轻但极清晰（如“冲撞”“廉价”），显得既清雅又锋利，刀刀见血。

导演模式适合对语音表演要求较高的场景，例如角色配音、影视级内容生成等。

A special usage of natural language control is the "Director Mode", which comprehensively portrays characters and vocal lines from three dimensions: role, scene, and guidance:

[Role] Character identity, personality background, appearance temperament, and speaking habits
[Scene] What is happening now, who to talk to, emotional state
[Guidance] Speaking speed, breath, pauses, stress, resonance position, voice texture, emotional ups and downs

Director Mode Example:

Role: The current head of the century-old CEN family. Adopted by the gatekeeper of the ancestral temple since birth, she was shaped into a flawless, emotionless family totem. She stays behind closed doors all year round, with a strong sense of class alienation towards others.

Scene: In the shadow of the ancestral hall, watching the man who broke through the security line at all costs to find her and attempt to elope with her. She must use the coldest class barriers to crush the other party, and also crush her own newly sprouted feelings that could potentially set the prairie ablaze.

Guidance:
A cold, lazy yet highly imposing low-pitched mature female voice. The vocal channel is very relaxed, without any tension, but with an oppressive feeling that makes one's bones cold.

- Speaking Speed and Pauses: Extremely slow, every word seems to roll off the tip of the tongue before being spoken, with the casual arrogance of a superior. Leave extremely long, unsettling gaps between sentences.
- Breath and Voiced Sounds: Most of the time, her voice has no obvious tone fluctuations, the voiced sounds are heavy and hard, like a gentle yet cold dark river. But at certain tail sounds (such as "sincerity"), must add extremely slight breathy closure, revealing a trace of exhaustion and longing that even she herself didn't notice.
- Articulation Texture: The mixed classical and modern words carry traces of the old era, the lip and dental sounds are extremely light but extremely clear (such as "rush", "cheap"), appearing both elegant and sharp, hitting the mark every time.

Director Mode is suitable for scenarios with high requirements for voice performance, such as character dubbing, film-level content generation, etc.

音频标签控制

Audio Tag Control

mimo-v2.5-tts

和

mimo-v2.5-tts-voiceclone

支持音频标签控制。

mimo-v2.5-tts-voicedesign

暂不支持，如需控制风格请通过

--context

写更详细的描述。

在文本任意位置用括号描述语气、情绪或能发出声音的动作，实现句内切换。括号内必须是声音相关内容（如语气、情绪、叹气、打哈欠、咳嗽），不能是身体动作（如转身、坐下、挥手）。

中文支持全角

（）

、半角

()

、方括号

[]

三种括号，英文支持半角

()

、方括号

[]

两种括号。

（紧张，深呼吸）呼……冷静，冷静。不就是一个面试吗……（语速加快，碎碎念）自我介绍已经背了五十遍了，应该没问题的。加油，你可以的……（小声）哎呀，领带歪没歪？
（极其疲惫，有气无力）师傅……到地方了叫我一声……（长叹一口气）我先眯一会儿，这班加得我魂儿都要散了。
如果我当时……（沉默片刻）哪怕再坚持一秒钟，结果是不是就不一样了？（苦笑）呵，没如果了。
（寒冷导致的急促呼吸）呼——呼——这、这大兴安岭的雪……（咳嗽）简直能把人骨头冻透了……别、别停下，走，快走。
（提高音量喊话）大姐！这鱼新鲜着呢！早上刚捞上来的！哎！那个谁，别乱翻，压坏了你赔啊？！

Achoo! Ahem. I—I really (cough) think I am coming down with a terrible (cough) terrible cold.
(heavy breathing) Just... give me... a second. I ran... all the way... from the station.
I just feel... long sigh... like I'm constantly treading water, you know?
It's just so stupid! (sobbing) We spent all that money on the cake and the dog just... (sudden laugh) he just ate the whole thing in one bite!

括号可以放在文本任意位置（开头、中间、结尾都行），可描述：语气、情绪、呼吸、笑声、哭声、喘息、咳嗽、打哈欠、叹气等能发出声音的内容。注意不能写身体动作（如转身、坐下、挥手）。

mimo-v2.5-tts

and

mimo-v2.5-tts-voiceclone

support audio tag control.

mimo-v2.5-tts-voicedesign

does not support this temporarily; if you need to control the style, please write a more detailed description via

--context

Describe tone, emotion, or sound-producing actions in parentheses anywhere in the text to achieve in-sentence switching. The content in the parentheses must be sound-related (such as tone, emotion, sigh, yawn, cough), not body movements (such as turn around, sit down, wave).

Chinese supports three types of parentheses: full-width

（）

, half-width

()

, square brackets

[]

; English supports two types: half-width

()

, square brackets

[]

(nervous, deep breath) Phew... Calm down, calm down. It's just an interview... (speaking faster, muttering) I've memorized the self-introduction fifty times, it should be fine. Come on, you can do it... (whispering) Oh no, is my tie crooked?
(extremely tired, weak) Master... Wake me up when we arrive... (sighs deeply) I'll take a nap first, this overtime has drained me.
If I had just... (pauses for a moment) held on for one more second, would the result be different? (bitter laugh) Heh, there's no 'if' anymore.
(rapid breathing due to cold) Hoo—Hoo—This snow in the Greater Khingan Range... (coughs) It can freeze a person to the bone... Don't stop, keep going, hurry.
(raises voice to shout) Sister! This fish is fresh! Just caught this morning! Hey! You there, don't rummage around, you have to pay if you crush it?!

Achoo! Ahem. I—I really (cough) think I am coming down with a terrible (cough) terrible cold.
(heavy breathing) Just... give me... a second. I ran... all the way... from the station.
I just feel... long sigh... like I'm constantly treading water, you know?
It's just so stupid! (sobbing) We spent all that money on the cake and the dog just... (sudden laugh) he just ate the whole thing in one bite!

Parentheses can be placed anywhere in the text (beginning, middle, end), and can describe: tone, emotion, breathing, laughter, crying, gasping, coughing, yawning, sighing, and other sound-producing content. Note that body movements (such as turn around, sit down, wave) cannot be written.

整体风格标签

Overall Style Tags

整体风格标签是音频标签控制的一种情况，在目标文本开头添加

(风格)

标签，指定整段语音的发音风格。

格式示例：

(风格1 风格2)待合成内容

唱歌： 必须在最开头添加

(唱歌)

标签，格式为：

(唱歌)歌词

。标签内标识支持：

唱歌

、

sing

、

singing

。

类别	常用风格
基础情绪	`开心` `悲伤` `愤怒` `恐惧` `惊讶` `兴奋` `委屈` `平静` `冷漠`
复合情绪	`怅然` `欣慰` `无奈` `愧疚` `释然` `嫉妒` `厌倦` `忐忑` `动情`
整体语调	`温柔` `高冷` `活泼` `严肃` `慵懒` `俏皮` `深沉` `干练` `凌厉`
音色定位	`磁性` `醇厚` `清亮` `空灵` `稚嫩` `苍老` `甜美` `沙哑` `醇雅`
人设腔调	`夹子音` `御姐音` `正太音` `大叔音` `台湾腔`
方言	`东北话` `四川话` `河南话` `粤语`
角色扮演	`孙悟空` `林黛玉`
唱歌	`唱歌`

经典组合：

(怅然) 这么多年过去了...

、

(慵懒) 再让我睡五分钟...

、

(磁性) 夜已经深了...

、

(东北话) 哎呀妈呀...

、

(粤语) 呢个真係好正啊...

Overall style tags are a type of audio tag control. Add a

(style)

tag at the beginning of the target text to specify the pronunciation style of the entire speech segment.

Format example:

(style1 style2)Content to be synthesized

Singing: Must add the

(singing)

tag at the very beginning, format:

(singing)Lyrics

. Supported identifiers in the tag:

唱歌

(sing),

sing

singing

Category	Common Styles
Basic Emotions	`happy` `sad` `angry` `fearful` `surprised` `excited` `wronged` `calm` `indifferent`
Complex Emotions	`disconsolate` `relieved` `helpless` `guilty` `reconciled` `jealous` `tired` `anxious` `emotional`
Overall Tone	`gentle` `cold` `lively` `serious` `lazy` `playful` `deep` `capable` `sharp`
Voice Texture	`magnetic` `mellow` `clear` `ethereal` `childish` `old` `sweet` `hoarse` `elegant`
Character Tone	`clipped voice` `mature female voice` `young boy voice` `middle-aged male voice` `Taiwan accent`
Dialects	`Northeastern dialect` `Sichuan dialect` `Henan dialect` `Cantonese`
Role-playing	`Sun Wukong` `Lin Daiyu`
Singing	`singing`

Classic Combinations:

(disconsolate) So many years have passed...

(lazy) Let me sleep for five more minutes...

(magnetic) The night has fallen...

(Northeastern dialect) Oh my goodness...

(Cantonese) This is really great...

音色描述编写

Voice Description Writing

当使用

mimo-v2.5-tts-voicedesign

进行文本描述定制音色时，需要编写一段简短的音色描述，指导模型生成符合预期特质的声音，和自然语言控制共用

--context

参数输入。音色描述的写作要求如下：

音色描述是嗓子的身份卡。只描写这副声音本身长什么样——不写场景、不写动作、不写这次要说什么。写得越凝练，越容易跨场景复用。

必写项：

身份锚点： 年龄段 + 性别。决定基频，是所有特质的基础。
声音质感： 气息走向、共鸣位置、吐字与音色底色。用可感的动词或比喻，不要堆形容词。
语速节奏： 稳 / 快 / 慢 / 忽快忽慢 / 连珠带顿挫。
情绪底色： 嗓子默认状态（高亢 / 松弛 / 温软 / 克制）。

可选项（强烈推荐）：

风格/身份标签： 一笔带过的职业或风格锚点，模型秒懂。例：拍卖师风格 / 美食评论家风格 / 播音员风格 / 法庭陈词风格。
辨识度小癖好： 一个让人一耳朵记住的习惯。例：偶尔闭眼吸气 / 字尾带颤音 / 笑起来是胸腔闷笑。

硬约束：

一到两句话，白描式，不分段、不列条。
不写场景（"在发布会""值夜班"）。
不写动作（"她走上舞台"）。
不用真实演员或 IP 角色名（版权 + 泛化差）。
默认普通话或英文；方言仅在明确需要时加。

音色描述样例：

中年男性，节奏极快，情绪高亢，拍卖师风格。吐字连珠，带抑扬顿挫与紧迫感。

青年男性，电竞解说风格，语速极快且连贯，带明显气口和爆发性强调，兴奋时声音上扬尖锐。

中年男性，法庭陈词风格，声线沉稳偏正式，吐字工整字字顿挫，情绪克制但每个词都很重。

When using

mimo-v2.5-tts-voicedesign

to customize a voice via text description, you need to write a short voice description to guide the model to generate a voice with the expected characteristics. It shares the

--context

parameter with natural language control. The writing requirements for voice descriptions are as follows:

A voice description is an ID card for the voice. Only describe what this voice itself sounds like—do not write scenes, actions, or what to say this time. The more concise it is, the easier it is to reuse across scenarios.

Mandatory Items:

Identity Anchor: Age group + gender. Determines the fundamental frequency, which is the basis of all characteristics.
Voice Texture: Breath direction, resonance position, articulation and voice background. Use perceptible verbs or metaphors, do not pile up adjectives.
Speaking Speed and Rhythm: Steady / fast / slow / erratic / rapid with pauses.
Emotional Background: Default state of the voice (high-pitched / relaxed / soft / restrained).

Optional Items (Highly Recommended):

Style/Identity Tag: A brief mention of occupation or style anchor, which the model understands instantly. Examples: Auctioneer style / Food critic style / Broadcaster style / Court statement style.
Distinctive Quirk: A habit that makes the voice memorable at first listen. Examples: Occasional closed-eye inhalation / Tremor at the end of words / Chest chuckle when laughing.

Hard Constraints:

One to two sentences, descriptive, no paragraphs or bullet points.
Do not write scenes ("at a press conference" "on night shift").
Do not write actions ("she walks onto the stage").
Do not use real actors or IP character names (copyright + poor generalization).
Default to Mandarin or English; only add dialect when explicitly needed.

Voice Description Examples:

Middle-aged male, extremely fast rhythm, high-pitched emotion, auctioneer style. Articulation is rapid, with cadence and a sense of urgency.

Young male, esports commentator style, extremely fast and coherent speaking speed, with obvious breath breaks and explosive emphasis, voice rises sharply when excited.

Middle-aged male, court statement style, steady and formal voice, neat articulation with clear pauses per word, restrained emotion but each word carries weight.

内容与标签增强

Content and Tag Enhancement

当用户没有直接提供要念的文本时，你应该根据角色扮演、创作等具体情况，自行编写要合成的文本。

当用户仅提供了要念的文本而不包含语气情绪细节的时候，你应该根据实际情况，微调文本并插入合适的标签。

硬规则：

文本情绪必须和音色底色契合。 温软奶奶不怒吼，拍卖师不深夜独白。
长度 2–5 句，一整段。 太短模型立不住节奏，微妙声线特质也出不来。
标签是调味，不是主菜。 该停就停、该笑就笑，不要堆砌。同一句话最多一个标签。
标点有表演意义： 省略号 = 停顿 / 破折号 = 被打断或拖音 / ALL CAPS = 强调。
标签语言跟随正文。 中文正文配中文标签，英文正文配英文标签，禁止混用。

推荐标签（中文用
[标签]
）：

类别常用标签

类别	常用标签
节奏	`[停顿]` `[长停顿]` `[急促]` `[拖音]` `[语速加快]` `[语速放缓]`
情绪	`[轻声]` `[低语]` `[叹气]` `[吸气]` `[哽咽]` `[强调]` `[笑]` `[爽朗大笑]`
其他	`[欲言又止]` `[碎碎念]` `[沉默片刻]`

节奏

[停顿]

[长停顿]

[急促]

[拖音]

[语速加快]

[语速放缓]

情绪

[轻声]

[低语]

[叹气]

[吸气]

[哽咽]

[强调]

[笑]

[爽朗大笑]

其他

[欲言又止]

[碎碎念]

[沉默片刻]

推荐标签（英文用
[tag]
）：

Category	Common Tags
Pacing	`[pause]` `[long pause]` `[fast]` `[slow]` `[drawn out]`
Emotion	`[whispering]` `[sighs]` `[inhale]` `[choked up]` `[emphasis]` `[laughs]` `[hearty laugh]`

配对样例：

音色：中年女性，美食评论家风格，语调绵柔富感染力，描述时带夸张的回味与陶醉感，偶尔闭眼吸气。

文本：
[吸气]唔——这一口下去…[停顿]整个舌尖都被包住了。[拖音]先是焦糖的甜，紧接着[强调]一点点[停顿]刚好的咸，在后味里慢慢铺开。[叹气]说真的，我已经[轻声]好久[停顿]没吃到这么用心的甜点了。

音色：高龄女性，声音温软缓慢，带点颤音和慈祥感，像奶奶在哄小孙辈睡前讲故事。

文本：
[轻声]来，躺好。[停顿]奶奶给你讲个故事啊。[长停顿]从前啊，山脚下有一间小木屋…[拖音]住着一只小兔子。[叹气]这只兔子呀，特别贪玩，每天太阳一出来就往外跑。[停顿]结果有一天——[强调]下雨了。[低语]它迷路了，怎么都找不到回家的路…

When the user does not directly provide the text to be read aloud, you should write the text to be synthesized yourself according to specific situations such as role-playing, creation, etc.

When the user only provides the text to be read aloud without details of tone and emotion, you should adjust the text appropriately and insert suitable tags according to the actual situation.

Hard Rules:

Text emotion must match the voice background. A soft grandma does not roar, an auctioneer does not monologue late at night.
Length: 2–5 sentences, one whole paragraph. Too short and the model cannot establish rhythm, and subtle voice characteristics cannot be expressed.
Tags are seasoning, not the main dish. Pause when needed, laugh when needed, do not pile up tags. At most one tag per sentence.
Punctuation has performative meaning: Ellipsis = pause / Dash = interrupted or drawn-out sound / ALL CAPS = emphasis.
Tag language follows the main text. Chinese main text matches Chinese tags, English main text matches English tags, mixing is prohibited.

Recommended Tags (Chinese use
[tag]
):

Category Common Tags

Category	Common Tags
Rhythm	`[pause]` `[long pause]` `[rapid]` `[drawn-out]` `[speed up]` `[slow down]`
Emotion	`[whisper]` `[mutter]` `[sigh]` `[inhale]` `[choke up]` `[emphasize]` `[laugh]` `[hearty laugh]`
Others	`[hesitate to speak]` `[mutter to oneself]` `[silence for a moment]`

Rhythm

[pause]

[long pause]

[rapid]

[drawn-out]

[speed up]

[slow down]

Emotion

[whisper]

[mutter]

[sigh]

[inhale]

[choke up]

[emphasize]

[laugh]

[hearty laugh]

Others

[hesitate to speak]

[mutter to oneself]

[silence for a moment]

Recommended Tags (English use
[tag]
):

Category	Common Tags
Pacing	`[pause]` `[long pause]` `[fast]` `[slow]` `[drawn out]`
Emotion	`[whispering]` `[sighs]` `[inhale]` `[choked up]` `[emphasis]` `[laughs]` `[hearty laugh]`

Matching Examples:

Voice: Middle-aged female, food critic style, soft and infectious tone, with exaggerated aftertaste and intoxication when describing, occasional closed-eye inhalation.

Text:
[inhale] Mmm—When I take this bite…[pause] My entire tongue is wrapped up. [drawn-out] First, the sweetness of caramel, then [emphasize] a little bit [pause] just right saltiness, slowly spreading in the aftertaste. [sigh] Honestly, I haven't [whisper] eaten such a thoughtful dessert in a long [pause] time.

Voice: Elderly female, soft and slow voice, with a slight tremor and kindness, like a grandma telling a bedtime story to a grandchild.

Text:
[whisper] Come on, lie down. [pause] Grandma will tell you a story. [long pause] Once upon a time, there was a small wooden house at the foot of the mountain…[drawn-out] where a little rabbit lived. [sigh] This rabbit was very playful, running out as soon as the sun came up every day. [pause] Then one day—[emphasize] it rained. [whisper] It got lost and couldn't find its way home…

Python 脚本用法

Python Script Usage

三个模型分别对应三个脚本，根据需求选择：

脚本	模型	用途
`mimo_tts.py`	`mimo-v2.5-tts`	预置音色语音合成
`mimo_tts_voicedesign.py`	`mimo-v2.5-tts-voicedesign`	文本描述定制音色
`mimo_tts_voiceclone.py`	`mimo-v2.5-tts-voiceclone`	音频样本复刻音色

Each of the three models corresponds to a script, choose according to your needs:

Script	Model	Use Case
`mimo_tts.py`	`mimo-v2.5-tts`	Preset Voice Speech Synthesis
`mimo_tts_voicedesign.py`	`mimo-v2.5-tts-voicedesign`	Custom Voice via Text Description
`mimo_tts_voiceclone.py`	`mimo-v2.5-tts-voiceclone`	Voice Replication via Audio Sample

预置音色语音合成（mimo_tts.py）

Preset Voice Speech Synthesis (mimo_tts.py)

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "你好，今天天气真不错。" \
  --voice "冰糖"

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "Hello, the weather is really nice today." \
  --voice "Bing Tang"

预置音色 + 自然语言风格控制

Preset Voice + Natural Language Style Control

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --context "用温柔的语气，语速稍慢" \
  --text "没关系，慢慢来，我等你。" \
  --voice "冰糖" \
  --output tmp/mimo-v2.5-tts/comfort.wav

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --context "Speak in a gentle tone, slightly slower speed" \
  --text "It's okay, take your time, I'll wait for you." \
  --voice "Bing Tang" \
  --output tmp/mimo-v2.5-tts/comfort.wav

预置音色 + 音频标签控制

Preset Voice + Audio Tag Control

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "（紧张，深呼吸）呼……冷静，冷静。不就是一个面试吗……（小声）哎呀，领带歪没歪？" \
  --voice "冰糖" \
  --output tmp/mimo-v2.5-tts/interview.wav

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "(nervous, deep breath) Phew... Calm down, calm down. It's just an interview... (whispering) Oh no, is my tie crooked?" \
  --voice "Bing Tang" \
  --output tmp/mimo-v2.5-tts/interview.wav

音色设计（mimo_tts_voicedesign.py）

Voice Design (mimo_tts_voicedesign.py)

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voicedesign.py \
  --context "Give me a young male tone." \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voicedesign.wav

技巧： 可以将 Voice Design 生成的音频保存下来，后续作为
mimo_tts_voiceclone.py
的
--voice-file
输入，实现「设计 → 克隆」的工作流：先用文本描述（
--context
）生成满意的音色，再用该音频作为样本克隆到其他文本。克隆时也可通过
--context
添加导演模式指令。

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voicedesign.py \
  --context "Give me a young male tone." \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voicedesign.wav

Tip: You can save the audio generated by Voice Design and use it as the
--voice-file
input for
mimo_tts_voiceclone.py
later, to implement the "Design → Clone" workflow: first generate a satisfactory voice using text description (
--context
), then use this audio as a sample to clone to other texts. You can also add director mode instructions via
--context
during cloning.

音色克隆（mimo_tts_voiceclone.py）

Voice Cloning (mimo_tts_voiceclone.py)

注意： 音色样本 Base64 编码不超过 10 MB，仅支持 mp3 和 wav 格式。

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voiceclone.wav

Note: The Base64-encoded voice sample should not exceed 10 MB, only mp3 and wav formats are supported.

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voiceclone.wav

音色克隆 + 导演模式

Voice Cloning + Director Mode

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --context "用温柔的语气，语速稍慢" \
  --text "没关系，慢慢来，我等你。" \
  --output tmp/mimo-v2.5-tts/voiceclone_director.wav

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --context "Speak in a gentle tone, slightly slower speed" \
  --text "It's okay, take your time, I'll wait for you." \
  --output tmp/mimo-v2.5-tts/voiceclone_director.wav

唱歌

Singing

注意： 唱歌歌词要完整，残缺歌词会导致跑调、效果差。

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "(唱歌)原谅我这一生不羁放纵爱自由，也会怕有一天会跌倒，Oh no。背弃了理想，谁人都可以，哪会怕有一天只你共我。" \
  --voice "冰糖" \
  --output tmp/mimo-v2.5-tts/singing.wav

Note: The singing lyrics must be complete; incomplete lyrics will cause off-key and poor effects.

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "(singing)Forgive me for being unruly and indulging in freedom all my life, I'm also afraid that one day I will fall, Oh no. Betraying ideals, anyone can do it, I won't be afraid if one day only you and I are left." \
  --voice "Bing Tang" \
  --output tmp/mimo-v2.5-tts/singing.wav

英文 + 音频标签

English + Audio Tags

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "I just... (sighs deeply) I don't know anymore. (suddenly firm) But I won't give up!" \
  --voice "Mia" \
  --output tmp/mimo-v2.5-tts/english.wav

bash

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "I just... (sighs deeply) I don't know anymore. (suddenly firm) But I won't give up!" \
  --voice "Mia" \
  --output tmp/mimo-v2.5-tts/english.wav

长文本处理

Long Text Processing

V2.5 对长文本的支持较好，建议几乎所有场景都一次性生成，无需手动分段。仅当文本超过 2500 字时，才需要考虑分段合成再拼接。

分段合成方法：按句号/段落自然切分，每段独立生成 wav，再用 ffmpeg 拼接：

bash

undefined

V2.5 has good support for long text. It is recommended to generate it all at once in almost all scenarios, no need to manually segment. Only when the text exceeds 2500 characters should you consider segmenting, synthesizing separately, then splicing.

Segmentation and synthesis method: Split naturally by periods/paragraphs, generate wav files for each segment independently, then splice with ffmpeg:

bash

undefined

创建文件列表

Create file list

echo "file '/tmp/mimo-v2.5-tts/part1.wav'" > /tmp/mimo-v2.5-tts/list.txt echo "file '/tmp/mimo-v2.5-tts/part2.wav'" >> /tmp/mimo-v2.5-tts/list.txt

拼接

Splice

ffmpeg -y -f concat -safe 0 -i /tmp/mimo-v2.5-tts/list.txt -c copy /tmp/mimo-v2.5-tts/combined.wav

---

ffmpeg -y -f concat -safe 0 -i /tmp/mimo-v2.5-tts/list.txt -c copy /tmp/mimo-v2.5-tts/combined.wav

---

飞书语音消息发送

Feishu Voice Message Sending

注意： 仅当用户需要将 TTS 生成的语音发送到飞书时才需要使用此功能。

将 TTS 生成的 WAV 发送到飞书，一条命令完成：生成 → 转码 → 上传 → 发送。

为什么不用 message tool： 飞书语音消息需要调用
/im/v1/messages
接口（
msg_type: audio
），且需先上传音频获取
file_key
。很多工具的 message tool（
asVoice: true
）在飞书 channel 上未实现此逻辑，会将音频当作普通附件发送（用户看到的是文件而非语音条）。
feishu_send_audio.sh
完成了完整的上传 + 发送流程，不可替换为 message tool。

Note: This function is only needed when users need to send TTS-generated voice to Feishu.

Send the TTS-generated WAV to Feishu, complete in one command: Generate → Transcode → Upload → Send.

Why not use message tool: Feishu voice messages require calling the
/im/v1/messages
interface (
msg_type: audio
), and you need to upload the audio first to obtain the
file_key
. Many tools' message tool (
asVoice: true
) have not implemented this logic on Feishu channels, and will send the audio as a regular attachment (users see a file instead of a voice bar).
feishu_send_audio.sh
completes the full upload + send process and cannot be replaced by a message tool.

环境依赖

Environment Dependencies

环境变量	来源	说明
`FEISHU_APP_ID`	飞书开放平台	应用 App ID
`FEISHU_APP_SECRET`	飞书开放平台	应用 App Secret

依赖	说明	必需
`ffmpeg`	WAV 转 Opus、获取音频时长	是
`curl`	调用飞书 API	是

Environment Variable	Source	Description
`FEISHU_APP_ID`	Feishu Open Platform	App ID
`FEISHU_APP_SECRET`	Feishu Open Platform	App Secret

Dependency	Description	Required
`ffmpeg`	Convert WAV to Opus, get audio duration	Yes
`curl`	Call Feishu API	Yes

用法

Usage

私聊发送（open_id）

Send to Private Chat (open_id)

bash

undefined

bash

undefined

1. 生成语音

1. Generate voice

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py
--text "好的，马上就好！"
--voice "冰糖"
--output /tmp/mimo-v2.5-tts/voice.wav

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py
--text "Okay, I'll be right there!"
--voice "Bing Tang"
--output /tmp/mimo-v2.5-tts/voice.wav

2. 发送到飞书私聊（receive_id_type 为 open_id，receive_id 为用户 ID）

2. Send to Feishu private chat (receive_id_type is open_id, receive_id is user ID)

bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav open_id ou_xxxxxx

undefined

bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav open_id ou_xxxxxx

undefined

群聊发送（chat_id）

Send to Group Chat (chat_id)

bash

undefined

bash

undefined

1. 生成语音

1. Generate voice

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py
--text "大家好，今天天气真不错！"
--voice "冰糖"
--output /tmp/mimo-v2.5-tts/voice.wav

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py
--text "Hello everyone, the weather is really nice today!"
--voice "Bing Tang"
--output /tmp/mimo-v2.5-tts/voice.wav

2. 发送到飞书群聊（receive_id_type 为 chat_id，receive_id 为群 ID）

2. Send to Feishu group chat (receive_id_type is chat_id, receive_id is group ID)

bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav chat_id oc_xxxxxx


`feishu_send_audio.sh` 内部流程：`wav → opus (ffmpeg)` → `获取 tenant_access_token` → `上传音频文件` → `发送 audio 消息`。

bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav chat_id oc_xxxxxx


The internal process of `feishu_send_audio.sh`: `wav → opus (ffmpeg)` → `Obtain tenant_access_token` → `Upload audio file` → `Send audio message`.