qwen3-tts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Qwen3-TTS

Qwen3-TTS

Build text-to-speech applications using Qwen3-TTS from Alibaba Qwen. Reference the local repository at
D:\code\qwen3-tts
for source code and examples.
使用阿里巴巴Qwen推出的Qwen3-TTS构建文本转语音应用。可参考本地仓库
D:\code\qwen3-tts
获取源代码和示例。

Quick Reference

快速参考

TaskModelMethod
Custom voice with preset speakersCustomVoice
generate_custom_voice()
Design new voice via descriptionVoiceDesign
generate_voice_design()
Clone voice from audio sampleBase
generate_voice_clone()
Encode/decode audioTokenizer
encode()
/
decode()
任务模型方法
自定义预设发音人语音CustomVoice
generate_custom_voice()
通过描述设计新语音VoiceDesign
generate_voice_design()
从音频样本克隆语音Base
generate_voice_clone()
音频编码/解码Tokenizer
encode()
/
decode()

Environment Setup

环境搭建

bash
undefined
bash
undefined

Create fresh environment

创建全新环境

conda create -n qwen3-tts python=3.12 -y conda activate qwen3-tts
conda create -n qwen3-tts python=3.12 -y conda activate qwen3-tts

Install package

安装包

pip install -U qwen-tts
pip install -U qwen-tts

Optional: FlashAttention 2 for reduced GPU memory

可选:安装FlashAttention 2以减少GPU内存占用

pip install -U flash-attn --no-build-isolation
undefined
pip install -U flash-attn --no-build-isolation
undefined

Available Models

可用模型

ModelFeatures
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
9 preset speakers, instruction control
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Create voices from natural language descriptions
Qwen/Qwen3-TTS-12Hz-1.7B-Base
Voice cloning, fine-tuning base
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
Smaller custom voice model
Qwen/Qwen3-TTS-12Hz-0.6B-Base
Smaller base model for cloning/fine-tuning
Qwen/Qwen3-TTS-Tokenizer-12Hz
Audio encoder/decoder
模型特性
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
9个预设发音人,支持指令控制
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
通过自然语言描述创建语音
Qwen/Qwen3-TTS-12Hz-1.7B-Base
语音克隆,微调基础模型
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
轻量版自定义语音模型
Qwen/Qwen3-TTS-12Hz-0.6B-Base
轻量版克隆/微调基础模型
Qwen/Qwen3-TTS-Tokenizer-12Hz
音频编码器/解码器

Task Workflows

任务工作流

1. Custom Voice Generation

1. 自定义语音生成

Use preset speakers with optional style instructions.
python
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
使用预设发音人,可搭配风格指令。
python
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

Single generation

单次生成

wavs, sr = model.generate_custom_voice( text="Hello, how are you today?", language="English", # Or "Auto" for auto-detection speaker="Ryan", instruct="Speak with enthusiasm", # Optional style control ) sf.write("output.wav", wavs[0], sr)
wavs, sr = model.generate_custom_voice( text="Hello, how are you today?", language="English", # 或设置为"Auto"自动检测 speaker="Ryan", instruct="Speak with enthusiasm", # 可选风格控制 ) sf.write("output.wav", wavs[0], sr)

Batch generation

批量生成

wavs, sr = model.generate_custom_voice( text=["First sentence.", "Second sentence."], language=["English", "English"], speaker=["Ryan", "Aiden"], instruct=["Happy tone", "Calm tone"], )

**Available Speakers:**

| Speaker | Description | Native Language |
|---------|-------------|-----------------|
| Vivian | Bright, edgy young female | Chinese |
| Serena | Warm, gentle young female | Chinese |
| Uncle_Fu | Low, mellow mature male | Chinese |
| Dylan | Youthful Beijing male | Chinese (Beijing) |
| Eric | Lively Chengdu male | Chinese (Sichuan) |
| Ryan | Dynamic male with rhythmic drive | English |
| Aiden | Sunny American male | English |
| Ono_Anna | Playful Japanese female | Japanese |
| Sohee | Warm Korean female | Korean |
wavs, sr = model.generate_custom_voice( text=["First sentence.", "Second sentence."], language=["English", "English"], speaker=["Ryan", "Aiden"], instruct=["Happy tone", "Calm tone"], )

**可用发音人:**

| 发音人 | 描述 | 母语 |
|---------|-------------|-----------------|
| Vivian | 活泼干练的年轻女性 | 中文 |
| Serena | 温暖柔和的年轻女性 | 中文 |
| Uncle_Fu | 低沉醇厚的成熟男性 | 中文 |
| Dylan | 年轻的北京男性 | 中文(北京话) |
| Eric | 活泼的成都男性 | 中文(四川话) |
| Ryan | 富有节奏感的活力男性 | 英文 |
| Aiden | 阳光的美国男性 | 英文 |
| Ono_Anna | 俏皮的日本女性 | 日文 |
| Sohee | 温暖的韩国女性 | 韩文 |

2. Voice Design

2. 语音设计

Create new voices from natural language descriptions.
python
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_design(
    text="Welcome to our presentation today.",
    language="English",
    instruct="Professional male voice, warm baritone, confident and clear",
)
sf.write("designed_voice.wav", wavs[0], sr)
通过自然语言描述创建新语音。
python
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_design(
    text="Welcome to our presentation today.",
    language="English",
    instruct="Professional male voice, warm baritone, confident and clear",
)
sf.write("designed_voice.wav", wavs[0], sr)

3. Voice Cloning

3. 语音克隆

Clone a voice from a reference audio sample (3+ seconds recommended).
python
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
从参考音频样本克隆语音(建议样本时长3秒以上)。
python
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

Direct cloning

直接克隆

wavs, sr = model.generate_voice_clone( text="This is the cloned voice speaking.", language="English", ref_audio="path/to/reference.wav", # Or URL or (numpy_array, sr) tuple ref_text="Transcript of the reference audio.", ) sf.write("cloned.wav", wavs[0], sr)
wavs, sr = model.generate_voice_clone( text="This is the cloned voice speaking.", language="English", ref_audio="path/to/reference.wav", # 也可传入URL或(numpy数组, 采样率)元组 ref_text="Transcript of the reference audio.", ) sf.write("cloned.wav", wavs[0], sr)

Reusable clone prompt (for multiple generations)

可复用的克隆提示词(用于多次生成)

prompt = model.create_voice_clone_prompt( ref_audio="path/to/reference.wav", ref_text="Transcript of the reference audio.", ) wavs, sr = model.generate_voice_clone( text="Another sentence with the same voice.", language="English", voice_clone_prompt=prompt, )
undefined
prompt = model.create_voice_clone_prompt( ref_audio="path/to/reference.wav", ref_text="Transcript of the reference audio.", ) wavs, sr = model.generate_voice_clone( text="Another sentence with the same voice.", language="English", voice_clone_prompt=prompt, )
undefined

4. Voice Design + Clone Workflow

4. 语音设计+克隆工作流

Design a voice, then reuse it across multiple generations.
python
undefined
先设计语音,再在多次生成中复用该语音。
python
undefined

Step 1: Design the voice

步骤1:设计语音

design_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", )
ref_text = "Sample text for the reference audio." ref_wavs, sr = design_model.generate_voice_design( text=ref_text, language="English", instruct="Young energetic male, tenor range", )
design_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", )
ref_text = "Sample text for the reference audio." ref_wavs, sr = design_model.generate_voice_design( text=ref_text, language="English", instruct="Young energetic male, tenor range", )

Step 2: Create reusable clone prompt

步骤2:创建可复用的克隆提示词

clone_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", )
prompt = clone_model.create_voice_clone_prompt( ref_audio=(ref_wavs[0], sr), ref_text=ref_text, )
clone_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", )
prompt = clone_model.create_voice_clone_prompt( ref_audio=(ref_wavs[0], sr), ref_text=ref_text, )

Step 3: Generate multiple outputs with consistent voice

步骤3:生成多条语音,保持音色一致

for sentence in ["First line.", "Second line.", "Third line."]: wavs, sr = clone_model.generate_voice_clone( text=sentence, language="English", voice_clone_prompt=prompt, )
undefined
for sentence in ["First line.", "Second line.", "Third line."]: wavs, sr = clone_model.generate_voice_clone( text=sentence, language="English", voice_clone_prompt=prompt, )
undefined

5. Audio Tokenization

5. 音频分词

Encode and decode audio for transport or processing.
python
from qwen_tts import Qwen3TTSTokenizer
import soundfile as sf

tokenizer = Qwen3TTSTokenizer.from_pretrained(
    "Qwen/Qwen3-TTS-Tokenizer-12Hz",
    device_map="cuda:0",
)
对音频进行编码和解码,用于传输或处理。
python
from qwen_tts import Qwen3TTSTokenizer
import soundfile as sf

tokenizer = Qwen3TTSTokenizer.from_pretrained(
    "Qwen/Qwen3-TTS-Tokenizer-12Hz",
    device_map="cuda:0",
)

Encode audio (accepts path, URL, numpy array, or base64)

编码音频(支持路径、URL、numpy数组或base64)

enc = tokenizer.encode("path/to/audio.wav")
enc = tokenizer.encode("path/to/audio.wav")

Decode back to waveform

解码回波形

wavs, sr = tokenizer.decode(enc) sf.write("reconstructed.wav", wavs[0], sr)
undefined
wavs, sr = tokenizer.decode(enc) sf.write("reconstructed.wav", wavs[0], sr)
undefined

Generation Parameters

生成参数

Common parameters for all
generate_*
methods:
python
wavs, sr = model.generate_custom_voice(
    text="...",
    language="Auto",
    speaker="Ryan",
    max_new_tokens=2048,
    do_sample=True,
    top_k=50,
    top_p=1.0,
    temperature=0.9,
    repetition_penalty=1.05,
)
所有
generate_*
方法的通用参数:
python
wavs, sr = model.generate_custom_voice(
    text="...",
    language="Auto",
    speaker="Ryan",
    max_new_tokens=2048,
    do_sample=True,
    top_k=50,
    top_p=1.0,
    temperature=0.9,
    repetition_penalty=1.05,
)

Web UI Demo

Web UI 演示

Launch local Gradio demo:
bash
undefined
启动本地Gradio演示:
bash
undefined

CustomVoice demo

CustomVoice演示

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000

VoiceDesign demo

VoiceDesign演示

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000

Base (voice clone) demo - requires HTTPS for microphone

Base(语音克隆)演示 - 需要HTTPS以使用麦克风

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
--ssl-certfile cert.pem --ssl-keyfile key.pem --no-ssl-verify
undefined
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
--ssl-certfile cert.pem --ssl-keyfile key.pem --no-ssl-verify
undefined

Supported Languages

支持的语言

Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Pass
language="Auto"
for automatic detection, or specify explicitly for best quality.
中文、英文、日文、韩文、德文、法文、俄文、葡萄牙文、西班牙文、意大利文
可设置
language="Auto"
自动检测语言,或显式指定以获得最佳质量。

References

参考资料

  • Fine-tuning guide: See references/finetuning.md for training custom speakers
  • API details: See references/api-reference.md for complete method signatures
  • Local repo:
    D:\code\qwen3-tts
    contains source code and examples
  • 微调指南:查看references/finetuning.md了解自定义发音人的训练方法
  • API详情:查看references/api-reference.md获取完整的方法签名
  • 本地仓库
    D:\code\qwen3-tts
    包含源代码和示例