qwen3-tts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseQwen3-TTS
Qwen3-TTS
Build text-to-speech applications using Qwen3-TTS from Alibaba Qwen. Reference the local repository at for source code and examples.
D:\code\qwen3-tts使用阿里巴巴Qwen推出的Qwen3-TTS构建文本转语音应用。可参考本地仓库获取源代码和示例。
D:\code\qwen3-ttsQuick Reference
快速参考
| Task | Model | Method |
|---|---|---|
| Custom voice with preset speakers | CustomVoice | |
| Design new voice via description | VoiceDesign | |
| Clone voice from audio sample | Base | |
| Encode/decode audio | Tokenizer | |
| 任务 | 模型 | 方法 |
|---|---|---|
| 自定义预设发音人语音 | CustomVoice | |
| 通过描述设计新语音 | VoiceDesign | |
| 从音频样本克隆语音 | Base | |
| 音频编码/解码 | Tokenizer | |
Environment Setup
环境搭建
bash
undefinedbash
undefinedCreate fresh environment
创建全新环境
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
Install package
安装包
pip install -U qwen-tts
pip install -U qwen-tts
Optional: FlashAttention 2 for reduced GPU memory
可选:安装FlashAttention 2以减少GPU内存占用
pip install -U flash-attn --no-build-isolation
undefinedpip install -U flash-attn --no-build-isolation
undefinedAvailable Models
可用模型
| Model | Features |
|---|---|
| 9 preset speakers, instruction control |
| Create voices from natural language descriptions |
| Voice cloning, fine-tuning base |
| Smaller custom voice model |
| Smaller base model for cloning/fine-tuning |
| Audio encoder/decoder |
| 模型 | 特性 |
|---|---|
| 9个预设发音人,支持指令控制 |
| 通过自然语言描述创建语音 |
| 语音克隆,微调基础模型 |
| 轻量版自定义语音模型 |
| 轻量版克隆/微调基础模型 |
| 音频编码器/解码器 |
Task Workflows
任务工作流
1. Custom Voice Generation
1. 自定义语音生成
Use preset speakers with optional style instructions.
python
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)使用预设发音人,可搭配风格指令。
python
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)Single generation
单次生成
wavs, sr = model.generate_custom_voice(
text="Hello, how are you today?",
language="English", # Or "Auto" for auto-detection
speaker="Ryan",
instruct="Speak with enthusiasm", # Optional style control
)
sf.write("output.wav", wavs[0], sr)
wavs, sr = model.generate_custom_voice(
text="Hello, how are you today?",
language="English", # 或设置为"Auto"自动检测
speaker="Ryan",
instruct="Speak with enthusiasm", # 可选风格控制
)
sf.write("output.wav", wavs[0], sr)
Batch generation
批量生成
wavs, sr = model.generate_custom_voice(
text=["First sentence.", "Second sentence."],
language=["English", "English"],
speaker=["Ryan", "Aiden"],
instruct=["Happy tone", "Calm tone"],
)
**Available Speakers:**
| Speaker | Description | Native Language |
|---------|-------------|-----------------|
| Vivian | Bright, edgy young female | Chinese |
| Serena | Warm, gentle young female | Chinese |
| Uncle_Fu | Low, mellow mature male | Chinese |
| Dylan | Youthful Beijing male | Chinese (Beijing) |
| Eric | Lively Chengdu male | Chinese (Sichuan) |
| Ryan | Dynamic male with rhythmic drive | English |
| Aiden | Sunny American male | English |
| Ono_Anna | Playful Japanese female | Japanese |
| Sohee | Warm Korean female | Korean |wavs, sr = model.generate_custom_voice(
text=["First sentence.", "Second sentence."],
language=["English", "English"],
speaker=["Ryan", "Aiden"],
instruct=["Happy tone", "Calm tone"],
)
**可用发音人:**
| 发音人 | 描述 | 母语 |
|---------|-------------|-----------------|
| Vivian | 活泼干练的年轻女性 | 中文 |
| Serena | 温暖柔和的年轻女性 | 中文 |
| Uncle_Fu | 低沉醇厚的成熟男性 | 中文 |
| Dylan | 年轻的北京男性 | 中文(北京话) |
| Eric | 活泼的成都男性 | 中文(四川话) |
| Ryan | 富有节奏感的活力男性 | 英文 |
| Aiden | 阳光的美国男性 | 英文 |
| Ono_Anna | 俏皮的日本女性 | 日文 |
| Sohee | 温暖的韩国女性 | 韩文 |2. Voice Design
2. 语音设计
Create new voices from natural language descriptions.
python
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wavs, sr = model.generate_voice_design(
text="Welcome to our presentation today.",
language="English",
instruct="Professional male voice, warm baritone, confident and clear",
)
sf.write("designed_voice.wav", wavs[0], sr)通过自然语言描述创建新语音。
python
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wavs, sr = model.generate_voice_design(
text="Welcome to our presentation today.",
language="English",
instruct="Professional male voice, warm baritone, confident and clear",
)
sf.write("designed_voice.wav", wavs[0], sr)3. Voice Cloning
3. 语音克隆
Clone a voice from a reference audio sample (3+ seconds recommended).
python
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)从参考音频样本克隆语音(建议样本时长3秒以上)。
python
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)Direct cloning
直接克隆
wavs, sr = model.generate_voice_clone(
text="This is the cloned voice speaking.",
language="English",
ref_audio="path/to/reference.wav", # Or URL or (numpy_array, sr) tuple
ref_text="Transcript of the reference audio.",
)
sf.write("cloned.wav", wavs[0], sr)
wavs, sr = model.generate_voice_clone(
text="This is the cloned voice speaking.",
language="English",
ref_audio="path/to/reference.wav", # 也可传入URL或(numpy数组, 采样率)元组
ref_text="Transcript of the reference audio.",
)
sf.write("cloned.wav", wavs[0], sr)
Reusable clone prompt (for multiple generations)
可复用的克隆提示词(用于多次生成)
prompt = model.create_voice_clone_prompt(
ref_audio="path/to/reference.wav",
ref_text="Transcript of the reference audio.",
)
wavs, sr = model.generate_voice_clone(
text="Another sentence with the same voice.",
language="English",
voice_clone_prompt=prompt,
)
undefinedprompt = model.create_voice_clone_prompt(
ref_audio="path/to/reference.wav",
ref_text="Transcript of the reference audio.",
)
wavs, sr = model.generate_voice_clone(
text="Another sentence with the same voice.",
language="English",
voice_clone_prompt=prompt,
)
undefined4. Voice Design + Clone Workflow
4. 语音设计+克隆工作流
Design a voice, then reuse it across multiple generations.
python
undefined先设计语音,再在多次生成中复用该语音。
python
undefinedStep 1: Design the voice
步骤1:设计语音
design_model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
ref_text = "Sample text for the reference audio."
ref_wavs, sr = design_model.generate_voice_design(
text=ref_text,
language="English",
instruct="Young energetic male, tenor range",
)
design_model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
ref_text = "Sample text for the reference audio."
ref_wavs, sr = design_model.generate_voice_design(
text=ref_text,
language="English",
instruct="Young energetic male, tenor range",
)
Step 2: Create reusable clone prompt
步骤2:创建可复用的克隆提示词
clone_model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
prompt = clone_model.create_voice_clone_prompt(
ref_audio=(ref_wavs[0], sr),
ref_text=ref_text,
)
clone_model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
prompt = clone_model.create_voice_clone_prompt(
ref_audio=(ref_wavs[0], sr),
ref_text=ref_text,
)
Step 3: Generate multiple outputs with consistent voice
步骤3:生成多条语音,保持音色一致
for sentence in ["First line.", "Second line.", "Third line."]:
wavs, sr = clone_model.generate_voice_clone(
text=sentence,
language="English",
voice_clone_prompt=prompt,
)
undefinedfor sentence in ["First line.", "Second line.", "Third line."]:
wavs, sr = clone_model.generate_voice_clone(
text=sentence,
language="English",
voice_clone_prompt=prompt,
)
undefined5. Audio Tokenization
5. 音频分词
Encode and decode audio for transport or processing.
python
from qwen_tts import Qwen3TTSTokenizer
import soundfile as sf
tokenizer = Qwen3TTSTokenizer.from_pretrained(
"Qwen/Qwen3-TTS-Tokenizer-12Hz",
device_map="cuda:0",
)对音频进行编码和解码,用于传输或处理。
python
from qwen_tts import Qwen3TTSTokenizer
import soundfile as sf
tokenizer = Qwen3TTSTokenizer.from_pretrained(
"Qwen/Qwen3-TTS-Tokenizer-12Hz",
device_map="cuda:0",
)Encode audio (accepts path, URL, numpy array, or base64)
编码音频(支持路径、URL、numpy数组或base64)
enc = tokenizer.encode("path/to/audio.wav")
enc = tokenizer.encode("path/to/audio.wav")
Decode back to waveform
解码回波形
wavs, sr = tokenizer.decode(enc)
sf.write("reconstructed.wav", wavs[0], sr)
undefinedwavs, sr = tokenizer.decode(enc)
sf.write("reconstructed.wav", wavs[0], sr)
undefinedGeneration Parameters
生成参数
Common parameters for all methods:
generate_*python
wavs, sr = model.generate_custom_voice(
text="...",
language="Auto",
speaker="Ryan",
max_new_tokens=2048,
do_sample=True,
top_k=50,
top_p=1.0,
temperature=0.9,
repetition_penalty=1.05,
)所有方法的通用参数:
generate_*python
wavs, sr = model.generate_custom_voice(
text="...",
language="Auto",
speaker="Ryan",
max_new_tokens=2048,
do_sample=True,
top_k=50,
top_p=1.0,
temperature=0.9,
repetition_penalty=1.05,
)Web UI Demo
Web UI 演示
Launch local Gradio demo:
bash
undefined启动本地Gradio演示:
bash
undefinedCustomVoice demo
CustomVoice演示
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000
VoiceDesign demo
VoiceDesign演示
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000
Base (voice clone) demo - requires HTTPS for microphone
Base(语音克隆)演示 - 需要HTTPS以使用麦克风
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
--ssl-certfile cert.pem --ssl-keyfile key.pem --no-ssl-verify
--ssl-certfile cert.pem --ssl-keyfile key.pem --no-ssl-verify
undefinedqwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
--ssl-certfile cert.pem --ssl-keyfile key.pem --no-ssl-verify
--ssl-certfile cert.pem --ssl-keyfile key.pem --no-ssl-verify
undefinedSupported Languages
支持的语言
Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Pass for automatic detection, or specify explicitly for best quality.
language="Auto"中文、英文、日文、韩文、德文、法文、俄文、葡萄牙文、西班牙文、意大利文
可设置自动检测语言,或显式指定以获得最佳质量。
language="Auto"References
参考资料
- Fine-tuning guide: See references/finetuning.md for training custom speakers
- API details: See references/api-reference.md for complete method signatures
- Local repo: contains source code and examples
D:\code\qwen3-tts
- 微调指南:查看references/finetuning.md了解自定义发音人的训练方法
- API详情:查看references/api-reference.md获取完整的方法签名
- 本地仓库:包含源代码和示例
D:\code\qwen3-tts