comfyui-voice-pipeline
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseComfyUI Voice Pipeline
ComfyUI 语音工作流
Creates character voices through TTS/voice cloning and synchronizes them with generated video.
通过TTS/语音克隆生成角色语音,并将其与生成的视频同步。
Voice Generation Decision Tree
语音生成决策树
VOICE REQUEST
|
|-- Have reference audio of target voice?
| |-- Yes (5+ seconds) → Chatterbox (MIT, paralinguistic tags)
| |-- Yes (10-15 seconds) → F5-TTS (fastest zero-shot)
| |-- Yes (10+ minutes) → RVC training (highest fidelity)
| |-- Yes (any length, budget) → ElevenLabs (production quality)
|
|-- No reference audio?
| |-- Need emotion control → IndexTTS-2 (8-emotion vectors)
| |-- Need multi-language → TTS Audio Suite (23 languages)
| |-- Need voice design → ElevenLabs Voice Design (describe voice)
| |-- Quick prototype → Any TTS with default voice
|
|-- Need multi-speaker dialog?
| |-- Chatterbox (4 voices) or TTS Audio Suite (character switching)
|
|-- Need lip-sync?
| |-- Best accuracy → Wav2Lip + CodeFormer
| |-- Need head movement → SadTalker
| |-- Full expression control → LivePortrait
| |-- Unlimited length → InfiniteTalkVOICE REQUEST
|
|-- Have reference audio of target voice?
| |-- Yes (5+ seconds) → Chatterbox (MIT, paralinguistic tags)
| |-- Yes (10-15 seconds) → F5-TTS (fastest zero-shot)
| |-- Yes (10+ minutes) → RVC training (highest fidelity)
| |-- Yes (any length, budget) → ElevenLabs (production quality)
|
|-- No reference audio?
| |-- Need emotion control → IndexTTS-2 (8-emotion vectors)
| |-- Need multi-language → TTS Audio Suite (23 languages)
| |-- Need voice design → ElevenLabs Voice Design (describe voice)
| |-- Quick prototype → Any TTS with default voice
|
|-- Need multi-speaker dialog?
| |-- Chatterbox (4 voices) or TTS Audio Suite (character switching)
|
|-- Need lip-sync?
| |-- Best accuracy → Wav2Lip + CodeFormer
| |-- Need head movement → SadTalker
| |-- Full expression control → LivePortrait
| |-- Unlimited length → InfiniteTalkTool Reference
工具参考
Chatterbox (Recommended Open-Source)
Chatterbox(推荐开源工具)
Strengths: MIT license, beats ElevenLabs 63.8% in blind tests, 5-second sample, emotion control, sub-200ms latency.
Paralinguistic tags:
[laugh] [chuckle] [sigh] [gasp] [cough] [clear throat]
[whisper] [excited] [sad] [angry] [surprised]Key parameter: (0.25-2.0) controls expressiveness.
exaggerationLimit: 40-second generation cap. Split longer content.
优势:MIT许可证,在盲测中以63.8%的胜率超过ElevenLabs,仅需5秒样本,支持情绪控制,延迟低于200ms。
副语言标签:
[laugh] [chuckle] [sigh] [gasp] [cough] [clear throat]
[whisper] [excited] [sad] [angry] [surprised]关键参数:(0.25-2.0)用于控制表现力。
exaggeration限制:生成时长上限为40秒,较长内容需拆分。
F5-TTS
F5-TTS
Strengths: Fastest zero-shot cloning, <15 second samples, MIT license, multi-language.
Requirements: Reference audio must be paired with + (matching transcription).
.wav.txtLanguages: English, German, Spanish, French, Japanese, Hindi, Thai, Portuguese.
优势:最快的零样本克隆工具,支持15秒以内样本,MIT许可证,多语言支持。
要求:参考音频需搭配文件和对应的转录文本。
.wav.txt支持语言:英语、德语、西班牙语、法语、日语、印地语、泰语、葡萄牙语。
TTS Audio Suite
TTS Audio Suite
Strengths: Unified multi-engine platform, 23 languages, character switching.
Special features:
- Character switching: tags
[CharacterName] - Language switching: ,
[de:Alice][fr:Bob] - Pause control:
[pause:1s] - SRT timing sync
Integrates: F5-TTS, Chatterbox, Higgs Audio 2, VibeVoice, IndexTTS-2, RVC.
优势:统一的多引擎平台,支持23种语言,可切换角色。
特色功能:
- 角色切换:使用标签
[CharacterName] - 语言切换:、
[de:Alice][fr:Bob] - 停顿控制:
[pause:1s] - SRT时间同步
集成工具:F5-TTS、Chatterbox、Higgs Audio 2、VibeVoice、IndexTTS-2、RVC。
IndexTTS-2
IndexTTS-2
Strengths: 8-emotion vector control with per-segment parameters.
Emotions: happy, angry, sad, surprised, afraid, disgusted, calm, melancholic.
优势:支持8种情绪向量控制,可设置分段参数。
支持情绪:开心、愤怒、悲伤、惊讶、恐惧、厌恶、平静、忧郁。
RVC (Voice Conversion)
RVC(语音转换)
Use case: Train a model on target voice (10+ min audio), then convert any TTS output.
Pipeline:
Text → Any TTS → Base Audio → RVC Model → Character VoiceTraining: 300-500 epochs, RMVPE feature extraction.
适用场景:基于目标语音(10分钟以上音频)训练模型,然后转换任意TTS输出结果。
工作流:
文本 → 任意TTS工具 → 基础音频 → RVC模型 → 角色语音训练参数:300-500轮训练,采用RMVPE特征提取。
ElevenLabs (Commercial)
ElevenLabs(商用工具)
Tiers:
- Instant Clone: 1-minute sample, good quality
- Professional Clone: 30+ minutes (3h ideal), near-indistinguishable
- Voice Design: Describe voice in text (no sample needed)
服务层级:
- 快速克隆:1分钟样本,质量良好
- 专业克隆:30分钟以上样本(理想时长3小时),效果近乎无差别
- 语音设计:无需样本,通过文本描述生成语音
Voice Profile Setup
语音配置文件设置
For each character, establish a voice profile in :
projects/{project}/characters/{name}/profile.yamlyaml
voice:
cloned: true
model: "chatterbox"
sample_file: "references/voice_sample.wav"
settings:
exaggeration: 1.2
default_emotion: "neutral"
notes: "Warm, confident tone. Slight Italian-American undertones."为每个角色在中创建语音配置文件:
projects/{project}/characters/{name}/profile.yamlyaml
voice:
cloned: true
model: "chatterbox"
sample_file: "references/voice_sample.wav"
settings:
exaggeration: 1.2
default_emotion: "neutral"
notes: "Warm, confident tone. Slight Italian-American undertones."Script Preparation
脚本准备
Text Formatting for TTS
TTS文本格式规范
- Punctuation matters: Commas create pauses, periods create stops
- Phonetic hints: Spell unusual words phonetically if mispronounced
- Emotion cues: Use Chatterbox tags or split by emotion for IndexTTS-2
- Length: Split into 30-40 second segments for Chatterbox limit
- 标点很重要:逗号产生停顿,句号产生停顿
- 拼音提示:如果生僻词发音错误,可标注拼音
- 情绪提示:使用Chatterbox标签,或为IndexTTS-2按情绪拆分文本
- 长度限制:针对Chatterbox的限制,将文本拆分为30-40秒的片段
Multi-Speaker Script
多角色脚本示例
[Sage] Hello! *laughs* I've been looking forward to this.
[pause:0.5s]
[Alex] [excited] Same here! Let's dive right in.
[Sage] [whisper] But first, I need to tell you something...[Sage] Hello! *laughs* I've been looking forward to this.
[pause:0.5s]
[Alex] [excited] Same here! Let's dive right in.
[Sage] [whisper] But first, I need to tell you something...Audio Post-Processing
音频后期处理
Requirements for Lip-Sync Input
唇形同步输入要求
- Sample rate: 16-24kHz (model dependent)
- Format: WAV (uncompressed)
- Mono channel
- Trim leading silence
- Add 0.2s trailing silence
- Normalize to -3dB peak
- 采样率:16-24kHz(依模型而定)
- 格式:WAV(无压缩)
- 声道:单声道
- 修剪开头静音
- 添加0.2秒结尾静音
- 归一化至峰值-3dB
FFmpeg Processing
FFmpeg处理命令
bash
undefinedbash
undefinedConvert to mono 24kHz WAV, normalized
转换为单声道24kHz WAV并归一化
ffmpeg -i input.wav -ac 1 -ar 24000 -af "loudnorm=I=-16:TP=-3" output.wav
ffmpeg -i input.wav -ac 1 -ar 24000 -af "loudnorm=I=-16:TP=-3" output.wav
Trim silence from start/end
修剪开头和结尾的静音
ffmpeg -i input.wav -af "silenceremove=start_periods=1:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_threshold=-50dB,areverse" trimmed.wav
ffmpeg -i input.wav -af "silenceremove=start_periods=1:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_threshold=-50dB,areverse" trimmed.wav
Concatenate segments
拼接音频片段
ffmpeg -f concat -safe 0 -i filelist.txt -c copy combined.wav
undefinedffmpeg -f concat -safe 0 -i filelist.txt -c copy combined.wav
undefinedLip-Sync Methods
唇形同步方法
Wav2Lip (Best Accuracy)
Wav2Lip(精度最高)
Settings:
wav2lip_model: "wav2lip_gan.pth" # Better than wav2lip.pth
face_detect_batch: 16
nosmooth: false
pad_bottom: 10MUST post-process: CodeFormer (fidelity 0.7) after Wav2Lip output.
设置:
wav2lip_model: "wav2lip_gan.pth" # 优于wav2lip.pth
face_detect_batch: 16
nosmooth: false
pad_bottom: 10必须后期处理:Wav2Lip输出后使用CodeFormer(保真度0.7)优化。
SadTalker (Head Movement)
SadTalker(支持头部动作)
Settings:
preprocess: "full" # Better for novel faces
enhancer: "gfpgan"
pose_style: 10-20 # Natural conversation range设置:
preprocess: "full" # 更适合陌生面孔
enhancer: "gfpgan"
pose_style: 10-20 # 自然对话范围LivePortrait (Expression Control)
LivePortrait(表情控制)
Settings:
lip_zero: 0.03 # Reduces unnatural lip movement
stitching: true # Seamless face blendingBest for: Premium avatar creation, expression transfer from driving video.
设置:
lip_zero: 0.03 # 减少不自然的唇部动作
stitching: true # 无缝面部融合最佳场景:高端虚拟形象创建,从驱动视频转移表情。
LatentSync 1.6 (Newest, Highest Quality)
LatentSync 1.6(最新版本,质量最高)
ByteDance model trained at 512x512 with TREPA modules for temporal consistency.
字节跳动训练的512x512模型,采用TREPA模块保证时间一致性。
InfiniteTalk (Unlimited Length)
InfiniteTalk(无长度限制)
For videos longer than standard lip-sync limits. Integrates with Wan for joint generation.
适用于超过标准唇形同步时长限制的视频,可与Wan集成实现联合生成。
Complete Talking Head Workflow
完整虚拟说话人工作流
Pipeline A: Quick (Image → Talk)
工作流A:快速生成(图片→说话视频)
1. [Text] → Chatterbox/F5-TTS → audio.wav
2. [Character Image] + audio.wav → SadTalker → video.mp4
3. video.mp4 → GFPGAN/CodeFormer → final.mp4Time: ~2 minutes. Quality: Good.
1. [文本] → Chatterbox/F5-TTS → audio.wav
2. [角色图片] + audio.wav → SadTalker → video.mp4
3. video.mp4 → GFPGAN/CodeFormer → final.mp4耗时:约2分钟。质量:良好。
Pipeline B: Quality (Image → Video → Lip-Sync)
工作流B:高质量生成(图片→视频→唇形同步)
1. [Text] → Chatterbox → audio.wav
2. [Character Image] → Wan I2V → base_video.mp4
Prompt: "person talking, slight head movement, indoor"
3. base_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
4. lipsync.mp4 → FaceDetailer batch → enhanced.mp4
5. enhanced.mp4 → Color correct + Deflicker → final.mp4Time: ~10 minutes. Quality: Production.
1. [文本] → Chatterbox → audio.wav
2. [角色图片] → Wan I2V → base_video.mp4
Prompt: "person talking, slight head movement, indoor"
3. base_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
4. lipsync.mp4 → FaceDetailer批处理 → enhanced.mp4
5. enhanced.mp4 → 色彩校正 + 去闪烁 → final.mp4耗时:约10分钟。质量:生产级。
Pipeline C: Premium (Expression Transfer)
工作流C:高端生成(表情转移)
1. Record driving video (actor performing lines)
2. [Text] → Voice Clone TTS → audio.wav
3. [Character Image] + driving.mp4 → LivePortrait → expression_video.mp4
4. expression_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
5. lipsync.mp4 → CodeFormer → final.mp4Time: ~15 minutes. Quality: Premium.
1. 录制驱动视频(演员表演台词)
2. [文本] → 语音克隆TTS → audio.wav
3. [角色图片] + driving.mp4 → LivePortrait → expression_video.mp4
4. expression_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
5. lipsync.mp4 → CodeFormer → final.mp4耗时:约15分钟。质量:高端级。
Troubleshooting
故障排查
| Issue | Solution |
|---|---|
| Audio out of sync | Offset with ffmpeg: |
| Subtle mouth movements | Use wav2lip_gan.pth, increase audio volume |
| Face artifacts | Post-process with CodeFormer (fidelity 0.6-0.8) |
| Robotic voice clone | Use longer/cleaner reference, increase exaggeration |
| Unnatural head movement | Lower SadTalker pose_style to 0-10 |
| 问题 | 解决方案 |
|---|---|
| 音视频不同步 | 使用ffmpeg调整偏移: |
| 唇部动作不明显 | 使用wav2lip_gan.pth模型,提高音频音量 |
| 面部 artifacts | 使用CodeFormer后期处理(保真度0.6-0.8) |
| 克隆语音生硬 | 使用更长/更清晰的参考音频,提高exaggeration参数 |
| 头部动作不自然 | 将SadTalker的pose_style降低至0-10 |
Reference
参考资料
- - Full voice tool documentation
references/voice-synthesis.md - - Voice model download links
references/models.md - Character voice profiles in
projects/{project}/characters/
- - 完整语音工具文档
references/voice-synthesis.md - - 语音模型下载链接
references/models.md - 角色语音配置文件位于
projects/{project}/characters/