mimo-v2-5-tts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MiMo V2.5 TTS

MiMo V2.5 TTS

使用小米 MiMo V2.5 TTS 系列模型生成语音。支持中英文、预置音色、音色设计、音色克隆、情绪风格、方言、唱歌。
脚本目录:
$SKILLS_PATH/mimo-v2-5-tts/scripts/
$SKILLS_PATH
说明:
skills 目录路径,因部署环境而异。
Generate speech using Xiaomi MiMo V2.5 TTS series models. Supports Chinese and English, preset voices, voice design, voice cloning, emotional styles, dialects, and singing.
Script directory:
$SKILLS_PATH/mimo-v2-5-tts/scripts/
$SKILLS_PATH
Description:
Path to the skills directory, which varies depending on the deployment environment.

模型选择

Model Selection

V2.5 系列提供三种模型,根据使用场景选择:
模型 ID用途音色来源特殊能力
mimo-v2.5-tts
预置音色语音合成内置精品音色支持唱歌
mimo-v2.5-tts-voicedesign
文本描述定制音色文本描述生成
mimo-v2.5-tts-voiceclone
音频样本复刻音色音频样本
选择建议:
  • 需要快速生成语音、需要唱歌功能 →
    mimo-v2.5-tts
    (预置音色)
  • 需要独特音色 →
    mimo-v2.5-tts-voicedesign
    (文本描述生成)
  • 需要模仿特定声音 →
    mimo-v2.5-tts-voiceclone
    (音频样本复刻)
注意: TTS 有随机性,同样输入的效果可能不同,用户有需要时可以多生成几次以供挑选。
The V2.5 series offers three models, choose according to your usage scenario:
Model IDUse CaseVoice SourceSpecial Ability
mimo-v2.5-tts
Preset Voice Speech SynthesisBuilt-in Premium VoicesSupports Singing
mimo-v2.5-tts-voicedesign
Custom Voice via Text DescriptionGenerated from Text Description
mimo-v2.5-tts-voiceclone
Voice Replication via Audio SampleAudio Sample
Selection Recommendations:
  • Need to generate speech quickly or require singing functionality →
    mimo-v2.5-tts
    (Preset Voice)
  • Need a unique voice →
    mimo-v2.5-tts-voicedesign
    (Generated from Text Description)
  • Need to imitate a specific voice →
    mimo-v2.5-tts-voiceclone
    (Replicated from Audio Sample)
Note: TTS has randomness, the effect of the same input may vary. Users can generate multiple times if needed for selection.

环境依赖

Environment Dependencies

环境变量说明必需
MIMO_API_KEY
MiMo API 密钥(MiMo 开放平台获取)
依赖说明必需
python3
运行脚本
openai
pip install openai
ffmpeg
转换格式、长文本分段拼接仅转换、拼接场景使用
curl
调用飞书 API仅飞书发送使用
Environment VariableDescriptionRequired
MIMO_API_KEY
MiMo API Key (Obtained from MiMo Open Platform)Yes
DependencyDescriptionRequired
python3
Run scriptsYes
openai
pip install openai
Yes
ffmpeg
Format conversion, long text segmentation and splicingOnly required for conversion and splicing scenarios
curl
Call Feishu APIOnly required for Feishu sending scenarios

预置音色

Preset Voices

使用
mimo-v2.5-tts
模型时必须明确指定音色,你可以提醒用户有哪些音色可选,或根据内容语言和风格选择合适音色。
音色名Voice ID语言性别风格
冰糖
冰糖
中文女性活泼少女
茉莉
茉莉
中文女性知性女声
苏打
苏打
中文男性阳光少年
白桦
白桦
中文男性成熟男声
Mia
Mia
EnglishFemaleLively girl
Chloe
Chloe
EnglishFemaleSweet Dreamy
Milo
Milo
EnglishMaleSunny boy
Dean
Dean
EnglishMaleSteady Gentle
When using the
mimo-v2.5-tts
model, you must specify a voice clearly. You can remind users of available voices or select an appropriate voice based on the content's language and style.
Voice NameVoice IDLanguageGenderStyle
Bing Tang
Bing Tang
ChineseFemaleLively girl
Mo Li
Mo Li
ChineseFemaleIntellectual female voice
Su Da
Su Da
ChineseMaleSunny young man
Bai Hua
Bai Hua
ChineseMaleMature male voice
Mia
Mia
EnglishFemaleLively girl
Chloe
Chloe
EnglishFemaleSweet Dreamy
Milo
Milo
EnglishMaleSunny boy
Dean
Dean
EnglishMaleSteady Gentle

自然语言控制

Natural Language Control

所有模型都支持自然语言控制。
通过自然语言描述让模型理解并生成对应风格的语音。所有模型均可通过
--context
参数传入自然语言控制指令:
mimo-v2.5-tts
mimo-v2.5-tts-voiceclone
可用于调整指定音色下的语气情绪等风格;
mimo-v2.5-tts-voicedesign
则通过这个选项同时控制音色和风格。
能力特点:
  • 多风格切换:同一段语音内完成播报 → 低语 → 嘶吼的风格转场
  • 多情绪混合:支持"压抑的愤怒"、"带着哽咽的笑意"等复合情绪
  • 多粒度控制:段落级 → 句子级 → 词级 → 字粒度都可指定
示例:
用轻快上扬的语调向领导报喜,语速稍快,带着查到成绩后压抑不住的激动与小骄傲,声音明亮有活力。

看着刚解决的难题成果忍不住得意忘形地惊呼,声音高亢明亮,语速偏快,语气中带着满满的自信与难以置信。
All models support natural language control.
Let the model understand and generate speech with the corresponding style through natural language descriptions. All models can pass natural language control instructions via the
--context
parameter:
mimo-v2.5-tts
and
mimo-v2.5-tts-voiceclone
can be used to adjust styles such as tone and emotion for the specified voice;
mimo-v2.5-tts-voicedesign
uses this option to control both voice and style simultaneously.
Capability Features:
  • Multi-style Switching: Complete style transitions from broadcast → whisper → roar within the same speech segment
  • Multi-emotion Mixing: Supports complex emotions such as "suppressed anger", "smile with sobs", etc.
  • Multi-granularity Control: Can be specified at paragraph-level → sentence-level → word-level → character-level
Examples:
Report good news to the leader in a brisk and rising tone, speak a little faster, with the excitement and small pride that can't be suppressed after checking the results, and the voice is bright and energetic.

Looking at the results of the just-solved problem, I can't help but exclaim in triumph, with a high-pitched and bright voice, a slightly fast speaking speed, and a tone full of confidence and disbelief.

导演模式

Director Mode

自然语言控制的一种特殊用法是「导演模式」,即从角色、场景、指导三个维度全方位刻画人物与声线:
  • 【角色】 人物身份、性格底色、外形气质与说话习惯
  • 【场景】 此刻发生了什么、和谁说话、情绪位置
  • 【指导】 语速、气息、停顿、重音、共鸣位置、音色质感、情绪起伏
导演模式示例:
角色:百年门阀岑家的现任大当家。自出生便被过继给祖庙的守门老人抚养,被塑造成一尊完美无瑕、绝情断欲的家族图腾。常年深居简出,对人有着极强的阶级疏离感。

场景:在祠堂的阴影里,看着那个不顾一切冲破保安防线来找她、企图带她私奔的男人。她要用最冷硬的阶级壁垒,绞杀对方,也绞杀自己刚刚萌芽、却足以燎原的感情。

指导:
冰冷、慵懒却极具威压的低音御姐。发声通道非常松弛,没有任何剑拔弩张,却有着让人骨里生寒的压迫感。

- 语速与顿挫:极慢,每个字都像是在舌尖滚过才吐出来,带着上位者漫不经心的傲慢。句与句之间留下极长的、令人不安的空白。
- 气声与实声:大部分时间,她的声音没有明显的声调起伏,实音重且硬,像是一条平缓却冰冷的暗河。但一定要在某些尾音处(如“真心”),加入极其轻微的气音收束,透出一丝连她自己都没察觉到的疲惫与渴望。
- 咬字肌理:文白杂糅的用词带着旧时代的痕迹,唇齿音发得极轻但极清晰(如“冲撞”“廉价”),显得既清雅又锋利,刀刀见血。
导演模式适合对语音表演要求较高的场景,例如角色配音、影视级内容生成等。
A special usage of natural language control is the "Director Mode", which comprehensively portrays characters and vocal lines from three dimensions: role, scene, and guidance:
  • [Role] Character identity, personality background, appearance temperament, and speaking habits
  • [Scene] What is happening now, who to talk to, emotional state
  • [Guidance] Speaking speed, breath, pauses, stress, resonance position, voice texture, emotional ups and downs
Director Mode Example:
Role: The current head of the century-old CEN family. Adopted by the gatekeeper of the ancestral temple since birth, she was shaped into a flawless, emotionless family totem. She stays behind closed doors all year round, with a strong sense of class alienation towards others.

Scene: In the shadow of the ancestral hall, watching the man who broke through the security line at all costs to find her and attempt to elope with her. She must use the coldest class barriers to crush the other party, and also crush her own newly sprouted feelings that could potentially set the prairie ablaze.

Guidance:
A cold, lazy yet highly imposing low-pitched mature female voice. The vocal channel is very relaxed, without any tension, but with an oppressive feeling that makes one's bones cold.

- Speaking Speed and Pauses: Extremely slow, every word seems to roll off the tip of the tongue before being spoken, with the casual arrogance of a superior. Leave extremely long, unsettling gaps between sentences.
- Breath and Voiced Sounds: Most of the time, her voice has no obvious tone fluctuations, the voiced sounds are heavy and hard, like a gentle yet cold dark river. But at certain tail sounds (such as "sincerity"), must add extremely slight breathy closure, revealing a trace of exhaustion and longing that even she herself didn't notice.
- Articulation Texture: The mixed classical and modern words carry traces of the old era, the lip and dental sounds are extremely light but extremely clear (such as "rush", "cheap"), appearing both elegant and sharp, hitting the mark every time.
Director Mode is suitable for scenarios with high requirements for voice performance, such as character dubbing, film-level content generation, etc.

音频标签控制

Audio Tag Control

mimo-v2.5-tts
mimo-v2.5-tts-voiceclone
支持音频标签控制。
mimo-v2.5-tts-voicedesign
暂不支持,如需控制风格请通过
--context
写更详细的描述。
在文本任意位置用括号描述语气、情绪或能发出声音的动作,实现句内切换。括号内必须是声音相关内容(如语气、情绪、叹气、打哈欠、咳嗽),不能是身体动作(如转身、坐下、挥手)。
中文支持全角
()
、半角
()
、方括号
[]
三种括号,英文支持半角
()
、方括号
[]
两种括号。
(紧张,深呼吸)呼……冷静,冷静。不就是一个面试吗……(语速加快,碎碎念)自我介绍已经背了五十遍了,应该没问题的。加油,你可以的……(小声)哎呀,领带歪没歪?
(极其疲惫,有气无力)师傅……到地方了叫我一声……(长叹一口气)我先眯一会儿,这班加得我魂儿都要散了。
如果我当时……(沉默片刻)哪怕再坚持一秒钟,结果是不是就不一样了?(苦笑)呵,没如果了。
(寒冷导致的急促呼吸)呼——呼——这、这大兴安岭的雪……(咳嗽)简直能把人骨头冻透了……别、别停下,走,快走。
(提高音量喊话)大姐!这鱼新鲜着呢!早上刚捞上来的!哎!那个谁,别乱翻,压坏了你赔啊?!

Achoo! Ahem. I—I really (cough) think I am coming down with a terrible (cough) terrible cold.
(heavy breathing) Just... give me... a second. I ran... all the way... from the station.
I just feel... long sigh... like I'm constantly treading water, you know?
It's just so stupid! (sobbing) We spent all that money on the cake and the dog just... (sudden laugh) he just ate the whole thing in one bite!
括号可以放在文本任意位置(开头、中间、结尾都行),可描述:语气、情绪、呼吸、笑声、哭声、喘息、咳嗽、打哈欠、叹气等能发出声音的内容。注意不能写身体动作(如转身、坐下、挥手)。
mimo-v2.5-tts
and
mimo-v2.5-tts-voiceclone
support audio tag control.
mimo-v2.5-tts-voicedesign
does not support this temporarily; if you need to control the style, please write a more detailed description via
--context
.
Describe tone, emotion, or sound-producing actions in parentheses anywhere in the text to achieve in-sentence switching. The content in the parentheses must be sound-related (such as tone, emotion, sigh, yawn, cough), not body movements (such as turn around, sit down, wave).
Chinese supports three types of parentheses: full-width
()
, half-width
()
, square brackets
[]
; English supports two types: half-width
()
, square brackets
[]
.
(nervous, deep breath) Phew... Calm down, calm down. It's just an interview... (speaking faster, muttering) I've memorized the self-introduction fifty times, it should be fine. Come on, you can do it... (whispering) Oh no, is my tie crooked?
(extremely tired, weak) Master... Wake me up when we arrive... (sighs deeply) I'll take a nap first, this overtime has drained me.
If I had just... (pauses for a moment) held on for one more second, would the result be different? (bitter laugh) Heh, there's no 'if' anymore.
(rapid breathing due to cold) Hoo—Hoo—This snow in the Greater Khingan Range... (coughs) It can freeze a person to the bone... Don't stop, keep going, hurry.
(raises voice to shout) Sister! This fish is fresh! Just caught this morning! Hey! You there, don't rummage around, you have to pay if you crush it?!

Achoo! Ahem. I—I really (cough) think I am coming down with a terrible (cough) terrible cold.
(heavy breathing) Just... give me... a second. I ran... all the way... from the station.
I just feel... long sigh... like I'm constantly treading water, you know?
It's just so stupid! (sobbing) We spent all that money on the cake and the dog just... (sudden laugh) he just ate the whole thing in one bite!
Parentheses can be placed anywhere in the text (beginning, middle, end), and can describe: tone, emotion, breathing, laughter, crying, gasping, coughing, yawning, sighing, and other sound-producing content. Note that body movements (such as turn around, sit down, wave) cannot be written.

整体风格标签

Overall Style Tags

整体风格标签是音频标签控制的一种情况,在目标文本开头添加
(风格)
标签,指定整段语音的发音风格。
格式示例:
(风格1 风格2)待合成内容
唱歌: 必须在最开头添加
(唱歌)
标签,格式为:
(唱歌)歌词
。标签内标识支持:
唱歌
sing
singing
类别常用风格
基础情绪
开心
悲伤
愤怒
恐惧
惊讶
兴奋
委屈
平静
冷漠
复合情绪
怅然
欣慰
无奈
愧疚
释然
嫉妒
厌倦
忐忑
动情
整体语调
温柔
高冷
活泼
严肃
慵懒
俏皮
深沉
干练
凌厉
音色定位
磁性
醇厚
清亮
空灵
稚嫩
苍老
甜美
沙哑
醇雅
人设腔调
夹子音
御姐音
正太音
大叔音
台湾腔
方言
东北话
四川话
河南话
粤语
角色扮演
孙悟空
林黛玉
唱歌
唱歌
经典组合:
(怅然) 这么多年过去了...
(慵懒) 再让我睡五分钟...
(磁性) 夜已经深了...
(东北话) 哎呀妈呀...
(粤语) 呢个真係好正啊...
Overall style tags are a type of audio tag control. Add a
(style)
tag at the beginning of the target text to specify the pronunciation style of the entire speech segment.
Format example:
(style1 style2)Content to be synthesized
Singing: Must add the
(singing)
tag at the very beginning, format:
(singing)Lyrics
. Supported identifiers in the tag:
唱歌
(sing),
sing
,
singing
.
CategoryCommon Styles
Basic Emotions
happy
sad
angry
fearful
surprised
excited
wronged
calm
indifferent
Complex Emotions
disconsolate
relieved
helpless
guilty
reconciled
jealous
tired
anxious
emotional
Overall Tone
gentle
cold
lively
serious
lazy
playful
deep
capable
sharp
Voice Texture
magnetic
mellow
clear
ethereal
childish
old
sweet
hoarse
elegant
Character Tone
clipped voice
mature female voice
young boy voice
middle-aged male voice
Taiwan accent
Dialects
Northeastern dialect
Sichuan dialect
Henan dialect
Cantonese
Role-playing
Sun Wukong
Lin Daiyu
Singing
singing
Classic Combinations:
(disconsolate) So many years have passed...
,
(lazy) Let me sleep for five more minutes...
,
(magnetic) The night has fallen...
,
(Northeastern dialect) Oh my goodness...
,
(Cantonese) This is really great...

音色描述编写

Voice Description Writing

当使用
mimo-v2.5-tts-voicedesign
进行文本描述定制音色时,需要编写一段简短的音色描述,指导模型生成符合预期特质的声音,和自然语言控制共用
--context
参数输入。音色描述的写作要求如下:
音色描述是嗓子的身份卡。只描写这副声音本身长什么样——不写场景、不写动作、不写这次要说什么。写得越凝练,越容易跨场景复用。
必写项:
  1. 身份锚点: 年龄段 + 性别。决定基频,是所有特质的基础。
  2. 声音质感: 气息走向、共鸣位置、吐字与音色底色。用可感的动词或比喻,不要堆形容词。
  3. 语速节奏: 稳 / 快 / 慢 / 忽快忽慢 / 连珠带顿挫。
  4. 情绪底色: 嗓子默认状态(高亢 / 松弛 / 温软 / 克制)。
可选项(强烈推荐):
  1. 风格/身份标签: 一笔带过的职业或风格锚点,模型秒懂。例:拍卖师风格 / 美食评论家风格 / 播音员风格 / 法庭陈词风格。
  2. 辨识度小癖好: 一个让人一耳朵记住的习惯。例:偶尔闭眼吸气 / 字尾带颤音 / 笑起来是胸腔闷笑。
硬约束:
  • 一到两句话,白描式,不分段、不列条。
  • 不写场景("在发布会""值夜班")。
  • 不写动作("她走上舞台")。
  • 不用真实演员或 IP 角色名(版权 + 泛化差)。
  • 默认普通话或英文;方言仅在明确需要时加。
音色描述样例:
中年男性,节奏极快,情绪高亢,拍卖师风格。吐字连珠,带抑扬顿挫与紧迫感。

青年男性,电竞解说风格,语速极快且连贯,带明显气口和爆发性强调,兴奋时声音上扬尖锐。

中年男性,法庭陈词风格,声线沉稳偏正式,吐字工整字字顿挫,情绪克制但每个词都很重。
When using
mimo-v2.5-tts-voicedesign
to customize a voice via text description, you need to write a short voice description to guide the model to generate a voice with the expected characteristics. It shares the
--context
parameter with natural language control. The writing requirements for voice descriptions are as follows:
A voice description is an ID card for the voice. Only describe what this voice itself sounds like—do not write scenes, actions, or what to say this time. The more concise it is, the easier it is to reuse across scenarios.
Mandatory Items:
  1. Identity Anchor: Age group + gender. Determines the fundamental frequency, which is the basis of all characteristics.
  2. Voice Texture: Breath direction, resonance position, articulation and voice background. Use perceptible verbs or metaphors, do not pile up adjectives.
  3. Speaking Speed and Rhythm: Steady / fast / slow / erratic / rapid with pauses.
  4. Emotional Background: Default state of the voice (high-pitched / relaxed / soft / restrained).
Optional Items (Highly Recommended):
  1. Style/Identity Tag: A brief mention of occupation or style anchor, which the model understands instantly. Examples: Auctioneer style / Food critic style / Broadcaster style / Court statement style.
  2. Distinctive Quirk: A habit that makes the voice memorable at first listen. Examples: Occasional closed-eye inhalation / Tremor at the end of words / Chest chuckle when laughing.
Hard Constraints:
  • One to two sentences, descriptive, no paragraphs or bullet points.
  • Do not write scenes ("at a press conference" "on night shift").
  • Do not write actions ("she walks onto the stage").
  • Do not use real actors or IP character names (copyright + poor generalization).
  • Default to Mandarin or English; only add dialect when explicitly needed.
Voice Description Examples:
Middle-aged male, extremely fast rhythm, high-pitched emotion, auctioneer style. Articulation is rapid, with cadence and a sense of urgency.

Young male, esports commentator style, extremely fast and coherent speaking speed, with obvious breath breaks and explosive emphasis, voice rises sharply when excited.

Middle-aged male, court statement style, steady and formal voice, neat articulation with clear pauses per word, restrained emotion but each word carries weight.

内容与标签增强

Content and Tag Enhancement

当用户没有直接提供要念的文本时,你应该根据角色扮演、创作等具体情况,自行编写要合成的文本。
当用户仅提供了要念的文本而不包含语气情绪细节的时候,你应该根据实际情况,微调文本并插入合适的标签。
硬规则:
  1. 文本情绪必须和音色底色契合。 温软奶奶不怒吼,拍卖师不深夜独白。
  2. 长度 2–5 句,一整段。 太短模型立不住节奏,微妙声线特质也出不来。
  3. 标签是调味,不是主菜。 该停就停、该笑就笑,不要堆砌。同一句话最多一个标签。
  4. 标点有表演意义: 省略号 = 停顿 / 破折号 = 被打断或拖音 / ALL CAPS = 强调。
  5. 标签语言跟随正文。 中文正文配中文标签,英文正文配英文标签,禁止混用。
推荐标签(中文用
[标签]
):
类别常用标签
节奏
[停顿]
[长停顿]
[急促]
[拖音]
[语速加快]
[语速放缓]
情绪
[轻声]
[低语]
[叹气]
[吸气]
[哽咽]
[强调]
[笑]
[爽朗大笑]
其他
[欲言又止]
[碎碎念]
[沉默片刻]
推荐标签(英文用
[tag]
):
CategoryCommon Tags
Pacing
[pause]
[long pause]
[fast]
[slow]
[drawn out]
Emotion
[whispering]
[sighs]
[inhale]
[choked up]
[emphasis]
[laughs]
[hearty laugh]
配对样例:
音色:中年女性,美食评论家风格,语调绵柔富感染力,描述时带夸张的回味与陶醉感,偶尔闭眼吸气。

文本:
[吸气]唔——这一口下去…[停顿]整个舌尖都被包住了。[拖音]先是焦糖的甜,紧接着[强调]一点点[停顿]刚好的咸,在后味里慢慢铺开。[叹气]说真的,我已经[轻声]好久[停顿]没吃到这么用心的甜点了。
音色:高龄女性,声音温软缓慢,带点颤音和慈祥感,像奶奶在哄小孙辈睡前讲故事。

文本:
[轻声]来,躺好。[停顿]奶奶给你讲个故事啊。[长停顿]从前啊,山脚下有一间小木屋…[拖音]住着一只小兔子。[叹气]这只兔子呀,特别贪玩,每天太阳一出来就往外跑。[停顿]结果有一天——[强调]下雨了。[低语]它迷路了,怎么都找不到回家的路…

When the user does not directly provide the text to be read aloud, you should write the text to be synthesized yourself according to specific situations such as role-playing, creation, etc.
When the user only provides the text to be read aloud without details of tone and emotion, you should adjust the text appropriately and insert suitable tags according to the actual situation.
Hard Rules:
  1. Text emotion must match the voice background. A soft grandma does not roar, an auctioneer does not monologue late at night.
  2. Length: 2–5 sentences, one whole paragraph. Too short and the model cannot establish rhythm, and subtle voice characteristics cannot be expressed.
  3. Tags are seasoning, not the main dish. Pause when needed, laugh when needed, do not pile up tags. At most one tag per sentence.
  4. Punctuation has performative meaning: Ellipsis = pause / Dash = interrupted or drawn-out sound / ALL CAPS = emphasis.
  5. Tag language follows the main text. Chinese main text matches Chinese tags, English main text matches English tags, mixing is prohibited.
Recommended Tags (Chinese use
[tag]
):
CategoryCommon Tags
Rhythm
[pause]
[long pause]
[rapid]
[drawn-out]
[speed up]
[slow down]
Emotion
[whisper]
[mutter]
[sigh]
[inhale]
[choke up]
[emphasize]
[laugh]
[hearty laugh]
Others
[hesitate to speak]
[mutter to oneself]
[silence for a moment]
Recommended Tags (English use
[tag]
):
CategoryCommon Tags
Pacing
[pause]
[long pause]
[fast]
[slow]
[drawn out]
Emotion
[whispering]
[sighs]
[inhale]
[choked up]
[emphasis]
[laughs]
[hearty laugh]
Matching Examples:
Voice: Middle-aged female, food critic style, soft and infectious tone, with exaggerated aftertaste and intoxication when describing, occasional closed-eye inhalation.

Text:
[inhale] Mmm—When I take this bite…[pause] My entire tongue is wrapped up. [drawn-out] First, the sweetness of caramel, then [emphasize] a little bit [pause] just right saltiness, slowly spreading in the aftertaste. [sigh] Honestly, I haven't [whisper] eaten such a thoughtful dessert in a long [pause] time.
Voice: Elderly female, soft and slow voice, with a slight tremor and kindness, like a grandma telling a bedtime story to a grandchild.

Text:
[whisper] Come on, lie down. [pause] Grandma will tell you a story. [long pause] Once upon a time, there was a small wooden house at the foot of the mountain…[drawn-out] where a little rabbit lived. [sigh] This rabbit was very playful, running out as soon as the sun came up every day. [pause] Then one day—[emphasize] it rained. [whisper] It got lost and couldn't find its way home…

Python 脚本用法

Python Script Usage

三个模型分别对应三个脚本,根据需求选择:
脚本模型用途
mimo_tts.py
mimo-v2.5-tts
预置音色语音合成
mimo_tts_voicedesign.py
mimo-v2.5-tts-voicedesign
文本描述定制音色
mimo_tts_voiceclone.py
mimo-v2.5-tts-voiceclone
音频样本复刻音色
Each of the three models corresponds to a script, choose according to your needs:
ScriptModelUse Case
mimo_tts.py
mimo-v2.5-tts
Preset Voice Speech Synthesis
mimo_tts_voicedesign.py
mimo-v2.5-tts-voicedesign
Custom Voice via Text Description
mimo_tts_voiceclone.py
mimo-v2.5-tts-voiceclone
Voice Replication via Audio Sample

预置音色语音合成(mimo_tts.py)

Preset Voice Speech Synthesis (mimo_tts.py)

bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "你好,今天天气真不错。" \
  --voice "冰糖"
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "Hello, the weather is really nice today." \
  --voice "Bing Tang"

预置音色 + 自然语言风格控制

Preset Voice + Natural Language Style Control

bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --context "用温柔的语气,语速稍慢" \
  --text "没关系,慢慢来,我等你。" \
  --voice "冰糖" \
  --output tmp/mimo-v2.5-tts/comfort.wav
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --context "Speak in a gentle tone, slightly slower speed" \
  --text "It's okay, take your time, I'll wait for you." \
  --voice "Bing Tang" \
  --output tmp/mimo-v2.5-tts/comfort.wav

预置音色 + 音频标签控制

Preset Voice + Audio Tag Control

bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "(紧张,深呼吸)呼……冷静,冷静。不就是一个面试吗……(小声)哎呀,领带歪没歪?" \
  --voice "冰糖" \
  --output tmp/mimo-v2.5-tts/interview.wav
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "(nervous, deep breath) Phew... Calm down, calm down. It's just an interview... (whispering) Oh no, is my tie crooked?" \
  --voice "Bing Tang" \
  --output tmp/mimo-v2.5-tts/interview.wav

音色设计(mimo_tts_voicedesign.py)

Voice Design (mimo_tts_voicedesign.py)

bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voicedesign.py \
  --context "Give me a young male tone." \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voicedesign.wav
技巧: 可以将 Voice Design 生成的音频保存下来,后续作为
mimo_tts_voiceclone.py
--voice-file
输入,实现「设计 → 克隆」的工作流:先用文本描述(
--context
)生成满意的音色,再用该音频作为样本克隆到其他文本。克隆时也可通过
--context
添加导演模式指令。
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voicedesign.py \
  --context "Give me a young male tone." \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voicedesign.wav
Tip: You can save the audio generated by Voice Design and use it as the
--voice-file
input for
mimo_tts_voiceclone.py
later, to implement the "Design → Clone" workflow: first generate a satisfactory voice using text description (
--context
), then use this audio as a sample to clone to other texts. You can also add director mode instructions via
--context
during cloning.

音色克隆(mimo_tts_voiceclone.py)

Voice Cloning (mimo_tts_voiceclone.py)

注意: 音色样本 Base64 编码不超过 10 MB,仅支持 mp3 和 wav 格式。
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voiceclone.wav
Note: The Base64-encoded voice sample should not exceed 10 MB, only mp3 and wav formats are supported.
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --text "Yes, I had a sandwich." \
  --output tmp/mimo-v2.5-tts/voiceclone.wav

音色克隆 + 导演模式

Voice Cloning + Director Mode

bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --context "用温柔的语气,语速稍慢" \
  --text "没关系,慢慢来,我等你。" \
  --output tmp/mimo-v2.5-tts/voiceclone_director.wav
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts_voiceclone.py \
  --voice-file voice.mp3 \
  --context "Speak in a gentle tone, slightly slower speed" \
  --text "It's okay, take your time, I'll wait for you." \
  --output tmp/mimo-v2.5-tts/voiceclone_director.wav

唱歌

Singing

注意: 唱歌歌词要完整,残缺歌词会导致跑调、效果差。
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "(唱歌)原谅我这一生不羁放纵爱自由,也会怕有一天会跌倒,Oh no。背弃了理想,谁人都可以,哪会怕有一天只你共我。" \
  --voice "冰糖" \
  --output tmp/mimo-v2.5-tts/singing.wav
Note: The singing lyrics must be complete; incomplete lyrics will cause off-key and poor effects.
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "(singing)Forgive me for being unruly and indulging in freedom all my life, I'm also afraid that one day I will fall, Oh no. Betraying ideals, anyone can do it, I won't be afraid if one day only you and I are left." \
  --voice "Bing Tang" \
  --output tmp/mimo-v2.5-tts/singing.wav

英文 + 音频标签

English + Audio Tags

bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "I just... (sighs deeply) I don't know anymore. (suddenly firm) But I won't give up!" \
  --voice "Mia" \
  --output tmp/mimo-v2.5-tts/english.wav
bash
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py \
  --text "I just... (sighs deeply) I don't know anymore. (suddenly firm) But I won't give up!" \
  --voice "Mia" \
  --output tmp/mimo-v2.5-tts/english.wav

长文本处理

Long Text Processing

V2.5 对长文本的支持较好,建议几乎所有场景都一次性生成,无需手动分段。仅当文本超过 2500 字时,才需要考虑分段合成再拼接。
分段合成方法:按句号/段落自然切分,每段独立生成 wav,再用 ffmpeg 拼接:
bash
undefined
V2.5 has good support for long text. It is recommended to generate it all at once in almost all scenarios, no need to manually segment. Only when the text exceeds 2500 characters should you consider segmenting, synthesizing separately, then splicing.
Segmentation and synthesis method: Split naturally by periods/paragraphs, generate wav files for each segment independently, then splice with ffmpeg:
bash
undefined

创建文件列表

Create file list

echo "file '/tmp/mimo-v2.5-tts/part1.wav'" > /tmp/mimo-v2.5-tts/list.txt echo "file '/tmp/mimo-v2.5-tts/part2.wav'" >> /tmp/mimo-v2.5-tts/list.txt
echo "file '/tmp/mimo-v2.5-tts/part1.wav'" > /tmp/mimo-v2.5-tts/list.txt echo "file '/tmp/mimo-v2.5-tts/part2.wav'" >> /tmp/mimo-v2.5-tts/list.txt

拼接

Splice

ffmpeg -y -f concat -safe 0 -i /tmp/mimo-v2.5-tts/list.txt -c copy /tmp/mimo-v2.5-tts/combined.wav

---
ffmpeg -y -f concat -safe 0 -i /tmp/mimo-v2.5-tts/list.txt -c copy /tmp/mimo-v2.5-tts/combined.wav

---

飞书语音消息发送

Feishu Voice Message Sending

注意: 仅当用户需要将 TTS 生成的语音发送到飞书时才需要使用此功能。
将 TTS 生成的 WAV 发送到飞书,一条命令完成:生成 → 转码 → 上传 → 发送。
为什么不用 message tool: 飞书语音消息需要调用
/im/v1/messages
接口(
msg_type: audio
),且需先上传音频获取
file_key
。很多工具的 message tool(
asVoice: true
)在飞书 channel 上未实现此逻辑,会将音频当作普通附件发送(用户看到的是文件而非语音条)。
feishu_send_audio.sh
完成了完整的上传 + 发送流程,不可替换为 message tool。
Note: This function is only needed when users need to send TTS-generated voice to Feishu.
Send the TTS-generated WAV to Feishu, complete in one command: Generate → Transcode → Upload → Send.
Why not use message tool: Feishu voice messages require calling the
/im/v1/messages
interface (
msg_type: audio
), and you need to upload the audio first to obtain the
file_key
. Many tools' message tool (
asVoice: true
) have not implemented this logic on Feishu channels, and will send the audio as a regular attachment (users see a file instead of a voice bar).
feishu_send_audio.sh
completes the full upload + send process and cannot be replaced by a message tool.

环境依赖

Environment Dependencies

环境变量来源说明
FEISHU_APP_ID
飞书开放平台应用 App ID
FEISHU_APP_SECRET
飞书开放平台应用 App Secret
依赖说明必需
ffmpeg
WAV 转 Opus、获取音频时长
curl
调用飞书 API
Environment VariableSourceDescription
FEISHU_APP_ID
Feishu Open PlatformApp ID
FEISHU_APP_SECRET
Feishu Open PlatformApp Secret
DependencyDescriptionRequired
ffmpeg
Convert WAV to Opus, get audio durationYes
curl
Call Feishu APIYes

用法

Usage

私聊发送(open_id)

Send to Private Chat (open_id)

bash
undefined
bash
undefined

1. 生成语音

1. Generate voice

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py
--text "好的,马上就好!"
--voice "冰糖"
--output /tmp/mimo-v2.5-tts/voice.wav
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py
--text "Okay, I'll be right there!"
--voice "Bing Tang"
--output /tmp/mimo-v2.5-tts/voice.wav

2. 发送到飞书私聊(receive_id_type 为 open_id,receive_id 为用户 ID)

2. Send to Feishu private chat (receive_id_type is open_id, receive_id is user ID)

bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav open_id ou_xxxxxx
undefined
bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav open_id ou_xxxxxx
undefined

群聊发送(chat_id)

Send to Group Chat (chat_id)

bash
undefined
bash
undefined

1. 生成语音

1. Generate voice

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py
--text "大家好,今天天气真不错!"
--voice "冰糖"
--output /tmp/mimo-v2.5-tts/voice.wav
python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py
--text "Hello everyone, the weather is really nice today!"
--voice "Bing Tang"
--output /tmp/mimo-v2.5-tts/voice.wav

2. 发送到飞书群聊(receive_id_type 为 chat_id,receive_id 为群 ID)

2. Send to Feishu group chat (receive_id_type is chat_id, receive_id is group ID)

bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav chat_id oc_xxxxxx

`feishu_send_audio.sh` 内部流程:`wav → opus (ffmpeg)` → `获取 tenant_access_token` → `上传音频文件` → `发送 audio 消息`。
bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/mimo-v2.5-tts/voice.wav chat_id oc_xxxxxx

The internal process of `feishu_send_audio.sh`: `wav → opus (ffmpeg)` → `Obtain tenant_access_token` → `Upload audio file` → `Send audio message`.