comfyui-voice-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ComfyUI Voice Pipeline

ComfyUI 语音工作流

Creates character voices through TTS/voice cloning and synchronizes them with generated video.
通过TTS/语音克隆生成角色语音,并将其与生成的视频同步。

Voice Generation Decision Tree

语音生成决策树

VOICE REQUEST
    |
    |-- Have reference audio of target voice?
    |   |-- Yes (5+ seconds) → Chatterbox (MIT, paralinguistic tags)
    |   |-- Yes (10-15 seconds) → F5-TTS (fastest zero-shot)
    |   |-- Yes (10+ minutes) → RVC training (highest fidelity)
    |   |-- Yes (any length, budget) → ElevenLabs (production quality)
    |
    |-- No reference audio?
    |   |-- Need emotion control → IndexTTS-2 (8-emotion vectors)
    |   |-- Need multi-language → TTS Audio Suite (23 languages)
    |   |-- Need voice design → ElevenLabs Voice Design (describe voice)
    |   |-- Quick prototype → Any TTS with default voice
    |
    |-- Need multi-speaker dialog?
    |   |-- Chatterbox (4 voices) or TTS Audio Suite (character switching)
    |
    |-- Need lip-sync?
    |   |-- Best accuracy → Wav2Lip + CodeFormer
    |   |-- Need head movement → SadTalker
    |   |-- Full expression control → LivePortrait
    |   |-- Unlimited length → InfiniteTalk
VOICE REQUEST
    |
    |-- Have reference audio of target voice?
    |   |-- Yes (5+ seconds) → Chatterbox (MIT, paralinguistic tags)
    |   |-- Yes (10-15 seconds) → F5-TTS (fastest zero-shot)
    |   |-- Yes (10+ minutes) → RVC training (highest fidelity)
    |   |-- Yes (any length, budget) → ElevenLabs (production quality)
    |
    |-- No reference audio?
    |   |-- Need emotion control → IndexTTS-2 (8-emotion vectors)
    |   |-- Need multi-language → TTS Audio Suite (23 languages)
    |   |-- Need voice design → ElevenLabs Voice Design (describe voice)
    |   |-- Quick prototype → Any TTS with default voice
    |
    |-- Need multi-speaker dialog?
    |   |-- Chatterbox (4 voices) or TTS Audio Suite (character switching)
    |
    |-- Need lip-sync?
    |   |-- Best accuracy → Wav2Lip + CodeFormer
    |   |-- Need head movement → SadTalker
    |   |-- Full expression control → LivePortrait
    |   |-- Unlimited length → InfiniteTalk

Tool Reference

工具参考

Chatterbox (Recommended Open-Source)

Chatterbox(推荐开源工具)

Strengths: MIT license, beats ElevenLabs 63.8% in blind tests, 5-second sample, emotion control, sub-200ms latency.
Paralinguistic tags:
[laugh] [chuckle] [sigh] [gasp] [cough] [clear throat]
[whisper] [excited] [sad] [angry] [surprised]
Key parameter:
exaggeration
(0.25-2.0) controls expressiveness.
Limit: 40-second generation cap. Split longer content.
优势:MIT许可证,在盲测中以63.8%的胜率超过ElevenLabs,仅需5秒样本,支持情绪控制,延迟低于200ms。
副语言标签:
[laugh] [chuckle] [sigh] [gasp] [cough] [clear throat]
[whisper] [excited] [sad] [angry] [surprised]
关键参数
exaggeration
(0.25-2.0)用于控制表现力。
限制:生成时长上限为40秒,较长内容需拆分。

F5-TTS

F5-TTS

Strengths: Fastest zero-shot cloning, <15 second samples, MIT license, multi-language.
Requirements: Reference audio must be paired with
.wav
+
.txt
(matching transcription).
Languages: English, German, Spanish, French, Japanese, Hindi, Thai, Portuguese.
优势:最快的零样本克隆工具,支持15秒以内样本,MIT许可证,多语言支持。
要求:参考音频需搭配
.wav
文件和对应的
.txt
转录文本。
支持语言:英语、德语、西班牙语、法语、日语、印地语、泰语、葡萄牙语。

TTS Audio Suite

TTS Audio Suite

Strengths: Unified multi-engine platform, 23 languages, character switching.
Special features:
  • Character switching:
    [CharacterName]
    tags
  • Language switching:
    [de:Alice]
    ,
    [fr:Bob]
  • Pause control:
    [pause:1s]
  • SRT timing sync
Integrates: F5-TTS, Chatterbox, Higgs Audio 2, VibeVoice, IndexTTS-2, RVC.
优势:统一的多引擎平台,支持23种语言,可切换角色。
特色功能:
  • 角色切换:使用
    [CharacterName]
    标签
  • 语言切换:
    [de:Alice]
    [fr:Bob]
  • 停顿控制:
    [pause:1s]
  • SRT时间同步
集成工具:F5-TTS、Chatterbox、Higgs Audio 2、VibeVoice、IndexTTS-2、RVC。

IndexTTS-2

IndexTTS-2

Strengths: 8-emotion vector control with per-segment parameters.
Emotions: happy, angry, sad, surprised, afraid, disgusted, calm, melancholic.
优势:支持8种情绪向量控制,可设置分段参数。
支持情绪:开心、愤怒、悲伤、惊讶、恐惧、厌恶、平静、忧郁。

RVC (Voice Conversion)

RVC(语音转换)

Use case: Train a model on target voice (10+ min audio), then convert any TTS output.
Pipeline:
Text → Any TTS → Base Audio → RVC Model → Character Voice
Training: 300-500 epochs, RMVPE feature extraction.
适用场景:基于目标语音(10分钟以上音频)训练模型,然后转换任意TTS输出结果。
工作流
文本 → 任意TTS工具 → 基础音频 → RVC模型 → 角色语音
训练参数:300-500轮训练,采用RMVPE特征提取。

ElevenLabs (Commercial)

ElevenLabs(商用工具)

Tiers:
  • Instant Clone: 1-minute sample, good quality
  • Professional Clone: 30+ minutes (3h ideal), near-indistinguishable
  • Voice Design: Describe voice in text (no sample needed)
服务层级:
  • 快速克隆:1分钟样本,质量良好
  • 专业克隆:30分钟以上样本(理想时长3小时),效果近乎无差别
  • 语音设计:无需样本,通过文本描述生成语音

Voice Profile Setup

语音配置文件设置

For each character, establish a voice profile in
projects/{project}/characters/{name}/profile.yaml
:
yaml
voice:
  cloned: true
  model: "chatterbox"
  sample_file: "references/voice_sample.wav"
  settings:
    exaggeration: 1.2
    default_emotion: "neutral"
  notes: "Warm, confident tone. Slight Italian-American undertones."
为每个角色在
projects/{project}/characters/{name}/profile.yaml
中创建语音配置文件:
yaml
voice:
  cloned: true
  model: "chatterbox"
  sample_file: "references/voice_sample.wav"
  settings:
    exaggeration: 1.2
    default_emotion: "neutral"
  notes: "Warm, confident tone. Slight Italian-American undertones."

Script Preparation

脚本准备

Text Formatting for TTS

TTS文本格式规范

  1. Punctuation matters: Commas create pauses, periods create stops
  2. Phonetic hints: Spell unusual words phonetically if mispronounced
  3. Emotion cues: Use Chatterbox tags or split by emotion for IndexTTS-2
  4. Length: Split into 30-40 second segments for Chatterbox limit
  1. 标点很重要:逗号产生停顿,句号产生停顿
  2. 拼音提示:如果生僻词发音错误,可标注拼音
  3. 情绪提示:使用Chatterbox标签,或为IndexTTS-2按情绪拆分文本
  4. 长度限制:针对Chatterbox的限制,将文本拆分为30-40秒的片段

Multi-Speaker Script

多角色脚本示例

[Sage] Hello! *laughs* I've been looking forward to this.
[pause:0.5s]
[Alex] [excited] Same here! Let's dive right in.
[Sage] [whisper] But first, I need to tell you something...
[Sage] Hello! *laughs* I've been looking forward to this.
[pause:0.5s]
[Alex] [excited] Same here! Let's dive right in.
[Sage] [whisper] But first, I need to tell you something...

Audio Post-Processing

音频后期处理

Requirements for Lip-Sync Input

唇形同步输入要求

  • Sample rate: 16-24kHz (model dependent)
  • Format: WAV (uncompressed)
  • Mono channel
  • Trim leading silence
  • Add 0.2s trailing silence
  • Normalize to -3dB peak
  • 采样率:16-24kHz(依模型而定)
  • 格式:WAV(无压缩)
  • 声道:单声道
  • 修剪开头静音
  • 添加0.2秒结尾静音
  • 归一化至峰值-3dB

FFmpeg Processing

FFmpeg处理命令

bash
undefined
bash
undefined

Convert to mono 24kHz WAV, normalized

转换为单声道24kHz WAV并归一化

ffmpeg -i input.wav -ac 1 -ar 24000 -af "loudnorm=I=-16:TP=-3" output.wav
ffmpeg -i input.wav -ac 1 -ar 24000 -af "loudnorm=I=-16:TP=-3" output.wav

Trim silence from start/end

修剪开头和结尾的静音

ffmpeg -i input.wav -af "silenceremove=start_periods=1:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_threshold=-50dB,areverse" trimmed.wav
ffmpeg -i input.wav -af "silenceremove=start_periods=1:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_threshold=-50dB,areverse" trimmed.wav

Concatenate segments

拼接音频片段

ffmpeg -f concat -safe 0 -i filelist.txt -c copy combined.wav
undefined
ffmpeg -f concat -safe 0 -i filelist.txt -c copy combined.wav
undefined

Lip-Sync Methods

唇形同步方法

Wav2Lip (Best Accuracy)

Wav2Lip(精度最高)

Settings:
wav2lip_model: "wav2lip_gan.pth"  # Better than wav2lip.pth
face_detect_batch: 16
nosmooth: false
pad_bottom: 10
MUST post-process: CodeFormer (fidelity 0.7) after Wav2Lip output.
设置:
wav2lip_model: "wav2lip_gan.pth"  # 优于wav2lip.pth
face_detect_batch: 16
nosmooth: false
pad_bottom: 10
必须后期处理:Wav2Lip输出后使用CodeFormer(保真度0.7)优化。

SadTalker (Head Movement)

SadTalker(支持头部动作)

Settings:
preprocess: "full"     # Better for novel faces
enhancer: "gfpgan"
pose_style: 10-20      # Natural conversation range
设置:
preprocess: "full"     # 更适合陌生面孔
enhancer: "gfpgan"
pose_style: 10-20      # 自然对话范围

LivePortrait (Expression Control)

LivePortrait(表情控制)

Settings:
lip_zero: 0.03         # Reduces unnatural lip movement
stitching: true        # Seamless face blending
Best for: Premium avatar creation, expression transfer from driving video.
设置:
lip_zero: 0.03         # 减少不自然的唇部动作
stitching: true        # 无缝面部融合
最佳场景:高端虚拟形象创建,从驱动视频转移表情。

LatentSync 1.6 (Newest, Highest Quality)

LatentSync 1.6(最新版本,质量最高)

ByteDance model trained at 512x512 with TREPA modules for temporal consistency.
字节跳动训练的512x512模型,采用TREPA模块保证时间一致性。

InfiniteTalk (Unlimited Length)

InfiniteTalk(无长度限制)

For videos longer than standard lip-sync limits. Integrates with Wan for joint generation.
适用于超过标准唇形同步时长限制的视频,可与Wan集成实现联合生成。

Complete Talking Head Workflow

完整虚拟说话人工作流

Pipeline A: Quick (Image → Talk)

工作流A:快速生成(图片→说话视频)

1. [Text] → Chatterbox/F5-TTS → audio.wav
2. [Character Image] + audio.wav → SadTalker → video.mp4
3. video.mp4 → GFPGAN/CodeFormer → final.mp4
Time: ~2 minutes. Quality: Good.
1. [文本] → Chatterbox/F5-TTS → audio.wav
2. [角色图片] + audio.wav → SadTalker → video.mp4
3. video.mp4 → GFPGAN/CodeFormer → final.mp4
耗时:约2分钟。质量:良好。

Pipeline B: Quality (Image → Video → Lip-Sync)

工作流B:高质量生成(图片→视频→唇形同步)

1. [Text] → Chatterbox → audio.wav
2. [Character Image] → Wan I2V → base_video.mp4
   Prompt: "person talking, slight head movement, indoor"
3. base_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
4. lipsync.mp4 → FaceDetailer batch → enhanced.mp4
5. enhanced.mp4 → Color correct + Deflicker → final.mp4
Time: ~10 minutes. Quality: Production.
1. [文本] → Chatterbox → audio.wav
2. [角色图片] → Wan I2V → base_video.mp4
   Prompt: "person talking, slight head movement, indoor"
3. base_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
4. lipsync.mp4 → FaceDetailer批处理 → enhanced.mp4
5. enhanced.mp4 → 色彩校正 + 去闪烁 → final.mp4
耗时:约10分钟。质量:生产级。

Pipeline C: Premium (Expression Transfer)

工作流C:高端生成(表情转移)

1. Record driving video (actor performing lines)
2. [Text] → Voice Clone TTS → audio.wav
3. [Character Image] + driving.mp4 → LivePortrait → expression_video.mp4
4. expression_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
5. lipsync.mp4 → CodeFormer → final.mp4
Time: ~15 minutes. Quality: Premium.
1. 录制驱动视频(演员表演台词)
2. [文本] → 语音克隆TTS → audio.wav
3. [角色图片] + driving.mp4 → LivePortrait → expression_video.mp4
4. expression_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
5. lipsync.mp4 → CodeFormer → final.mp4
耗时:约15分钟。质量:高端级。

Troubleshooting

故障排查

IssueSolution
Audio out of syncOffset with ffmpeg:
ffmpeg -itsoffset 0.1 -i audio.wav ...
Subtle mouth movementsUse wav2lip_gan.pth, increase audio volume
Face artifactsPost-process with CodeFormer (fidelity 0.6-0.8)
Robotic voice cloneUse longer/cleaner reference, increase exaggeration
Unnatural head movementLower SadTalker pose_style to 0-10
问题解决方案
音视频不同步使用ffmpeg调整偏移:
ffmpeg -itsoffset 0.1 -i audio.wav ...
唇部动作不明显使用wav2lip_gan.pth模型,提高音频音量
面部 artifacts使用CodeFormer后期处理(保真度0.6-0.8)
克隆语音生硬使用更长/更清晰的参考音频,提高exaggeration参数
头部动作不自然将SadTalker的pose_style降低至0-10

Reference

参考资料

  • references/voice-synthesis.md
    - Full voice tool documentation
  • references/models.md
    - Voice model download links
  • Character voice profiles in
    projects/{project}/characters/
  • references/voice-synthesis.md
    - 完整语音工具文档
  • references/models.md
    - 语音模型下载链接
  • 角色语音配置文件位于
    projects/{project}/characters/