comfyui-voice-pipeline

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ComfyUI Voice Pipeline

ComfyUI 语音工作流

Creates character voices through TTS/voice cloning and synchronizes them with generated video.

通过TTS/语音克隆生成角色语音，并将其与生成的视频同步。

Voice Generation Decision Tree

语音生成决策树

VOICE REQUEST
    |
    |-- Have reference audio of target voice?
    |   |-- Yes (5+ seconds) → Chatterbox (MIT, paralinguistic tags)
    |   |-- Yes (10-15 seconds) → F5-TTS (fastest zero-shot)
    |   |-- Yes (10+ minutes) → RVC training (highest fidelity)
    |   |-- Yes (any length, budget) → ElevenLabs (production quality)
    |
    |-- No reference audio?
    |   |-- Need emotion control → IndexTTS-2 (8-emotion vectors)
    |   |-- Need multi-language → TTS Audio Suite (23 languages)
    |   |-- Need voice design → ElevenLabs Voice Design (describe voice)
    |   |-- Quick prototype → Any TTS with default voice
    |
    |-- Need multi-speaker dialog?
    |   |-- Chatterbox (4 voices) or TTS Audio Suite (character switching)
    |
    |-- Need lip-sync?
    |   |-- Best accuracy → Wav2Lip + CodeFormer
    |   |-- Need head movement → SadTalker
    |   |-- Full expression control → LivePortrait
    |   |-- Unlimited length → InfiniteTalk

VOICE REQUEST
    |
    |-- Have reference audio of target voice?
    |   |-- Yes (5+ seconds) → Chatterbox (MIT, paralinguistic tags)
    |   |-- Yes (10-15 seconds) → F5-TTS (fastest zero-shot)
    |   |-- Yes (10+ minutes) → RVC training (highest fidelity)
    |   |-- Yes (any length, budget) → ElevenLabs (production quality)
    |
    |-- No reference audio?
    |   |-- Need emotion control → IndexTTS-2 (8-emotion vectors)
    |   |-- Need multi-language → TTS Audio Suite (23 languages)
    |   |-- Need voice design → ElevenLabs Voice Design (describe voice)
    |   |-- Quick prototype → Any TTS with default voice
    |
    |-- Need multi-speaker dialog?
    |   |-- Chatterbox (4 voices) or TTS Audio Suite (character switching)
    |
    |-- Need lip-sync?
    |   |-- Best accuracy → Wav2Lip + CodeFormer
    |   |-- Need head movement → SadTalker
    |   |-- Full expression control → LivePortrait
    |   |-- Unlimited length → InfiniteTalk

Tool Reference

工具参考

Chatterbox (Recommended Open-Source)

Chatterbox（推荐开源工具）

Strengths: MIT license, beats ElevenLabs 63.8% in blind tests, 5-second sample, emotion control, sub-200ms latency.

Paralinguistic tags:

[laugh] [chuckle] [sigh] [gasp] [cough] [clear throat]
[whisper] [excited] [sad] [angry] [surprised]

Key parameter:

exaggeration

(0.25-2.0) controls expressiveness.

Limit: 40-second generation cap. Split longer content.

优势：MIT许可证，在盲测中以63.8%的胜率超过ElevenLabs，仅需5秒样本，支持情绪控制，延迟低于200ms。

副语言标签：

[laugh] [chuckle] [sigh] [gasp] [cough] [clear throat]
[whisper] [excited] [sad] [angry] [surprised]

关键参数：

exaggeration

（0.25-2.0）用于控制表现力。

限制：生成时长上限为40秒，较长内容需拆分。

F5-TTS

Strengths: Fastest zero-shot cloning, <15 second samples, MIT license, multi-language.

Requirements: Reference audio must be paired with

.wav

.txt

(matching transcription).

Languages: English, German, Spanish, French, Japanese, Hindi, Thai, Portuguese.

优势：最快的零样本克隆工具，支持15秒以内样本，MIT许可证，多语言支持。

要求：参考音频需搭配

.wav

文件和对应的

.txt

转录文本。

支持语言：英语、德语、西班牙语、法语、日语、印地语、泰语、葡萄牙语。

TTS Audio Suite

Strengths: Unified multi-engine platform, 23 languages, character switching.

Special features:

Character switching:
```
[CharacterName]
```
tags
Language switching:
```
[de:Alice]
```
,
```
[fr:Bob]
```
Pause control:
```
[pause:1s]
```
SRT timing sync

Integrates: F5-TTS, Chatterbox, Higgs Audio 2, VibeVoice, IndexTTS-2, RVC.

优势：统一的多引擎平台，支持23种语言，可切换角色。

特色功能：

角色切换：使用
```
[CharacterName]
```
标签
语言切换：
```
[de:Alice]
```
、
```
[fr:Bob]
```
停顿控制：
```
[pause:1s]
```
SRT时间同步

集成工具：F5-TTS、Chatterbox、Higgs Audio 2、VibeVoice、IndexTTS-2、RVC。

IndexTTS-2

Strengths: 8-emotion vector control with per-segment parameters.

Emotions: happy, angry, sad, surprised, afraid, disgusted, calm, melancholic.

优势：支持8种情绪向量控制，可设置分段参数。

支持情绪：开心、愤怒、悲伤、惊讶、恐惧、厌恶、平静、忧郁。

RVC (Voice Conversion)

RVC（语音转换）

Use case: Train a model on target voice (10+ min audio), then convert any TTS output.

Pipeline:

Text → Any TTS → Base Audio → RVC Model → Character Voice

Training: 300-500 epochs, RMVPE feature extraction.

适用场景：基于目标语音（10分钟以上音频）训练模型，然后转换任意TTS输出结果。

工作流：

文本 → 任意TTS工具 → 基础音频 → RVC模型 → 角色语音

训练参数：300-500轮训练，采用RMVPE特征提取。

ElevenLabs (Commercial)

ElevenLabs（商用工具）

Tiers:

Instant Clone: 1-minute sample, good quality
Professional Clone: 30+ minutes (3h ideal), near-indistinguishable
Voice Design: Describe voice in text (no sample needed)

服务层级：

快速克隆：1分钟样本，质量良好
专业克隆：30分钟以上样本（理想时长3小时），效果近乎无差别
语音设计：无需样本，通过文本描述生成语音

Voice Profile Setup

语音配置文件设置

For each character, establish a voice profile in

projects/{project}/characters/{name}/profile.yaml

yaml

voice:
  cloned: true
  model: "chatterbox"
  sample_file: "references/voice_sample.wav"
  settings:
    exaggeration: 1.2
    default_emotion: "neutral"
  notes: "Warm, confident tone. Slight Italian-American undertones."

为每个角色在

projects/{project}/characters/{name}/profile.yaml

中创建语音配置文件：

yaml

voice:
  cloned: true
  model: "chatterbox"
  sample_file: "references/voice_sample.wav"
  settings:
    exaggeration: 1.2
    default_emotion: "neutral"
  notes: "Warm, confident tone. Slight Italian-American undertones."

Script Preparation

脚本准备

Text Formatting for TTS

TTS文本格式规范

Punctuation matters: Commas create pauses, periods create stops
Phonetic hints: Spell unusual words phonetically if mispronounced
Emotion cues: Use Chatterbox tags or split by emotion for IndexTTS-2
Length: Split into 30-40 second segments for Chatterbox limit

标点很重要：逗号产生停顿，句号产生停顿
拼音提示：如果生僻词发音错误，可标注拼音
情绪提示：使用Chatterbox标签，或为IndexTTS-2按情绪拆分文本
长度限制：针对Chatterbox的限制，将文本拆分为30-40秒的片段

Multi-Speaker Script

多角色脚本示例

[Sage] Hello! *laughs* I've been looking forward to this.
[pause:0.5s]
[Alex] [excited] Same here! Let's dive right in.
[Sage] [whisper] But first, I need to tell you something...

[Sage] Hello! *laughs* I've been looking forward to this.
[pause:0.5s]
[Alex] [excited] Same here! Let's dive right in.
[Sage] [whisper] But first, I need to tell you something...

Audio Post-Processing

音频后期处理

Requirements for Lip-Sync Input

唇形同步输入要求

Sample rate: 16-24kHz (model dependent)
Format: WAV (uncompressed)
Mono channel
Trim leading silence
Add 0.2s trailing silence
Normalize to -3dB peak

采样率：16-24kHz（依模型而定）
格式：WAV（无压缩）
声道：单声道
修剪开头静音
添加0.2秒结尾静音
归一化至峰值-3dB

FFmpeg Processing

FFmpeg处理命令

bash

undefined

bash

undefined

Convert to mono 24kHz WAV, normalized

转换为单声道24kHz WAV并归一化

ffmpeg -i input.wav -ac 1 -ar 24000 -af "loudnorm=I=-16:TP=-3" output.wav

Trim silence from start/end

修剪开头和结尾的静音

ffmpeg -i input.wav -af "silenceremove=start_periods=1:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_threshold=-50dB,areverse" trimmed.wav

Concatenate segments

拼接音频片段

ffmpeg -f concat -safe 0 -i filelist.txt -c copy combined.wav

undefined

ffmpeg -f concat -safe 0 -i filelist.txt -c copy combined.wav

undefined

Lip-Sync Methods

唇形同步方法

Wav2Lip (Best Accuracy)

Wav2Lip（精度最高）

Settings:

wav2lip_model: "wav2lip_gan.pth"  # Better than wav2lip.pth
face_detect_batch: 16
nosmooth: false
pad_bottom: 10

MUST post-process: CodeFormer (fidelity 0.7) after Wav2Lip output.

设置：

wav2lip_model: "wav2lip_gan.pth"  # 优于wav2lip.pth
face_detect_batch: 16
nosmooth: false
pad_bottom: 10

必须后期处理：Wav2Lip输出后使用CodeFormer（保真度0.7）优化。

SadTalker (Head Movement)

SadTalker（支持头部动作）

Settings:

preprocess: "full"     # Better for novel faces
enhancer: "gfpgan"
pose_style: 10-20      # Natural conversation range

设置：

preprocess: "full"     # 更适合陌生面孔
enhancer: "gfpgan"
pose_style: 10-20      # 自然对话范围

LivePortrait (Expression Control)

LivePortrait（表情控制）

Settings:

lip_zero: 0.03         # Reduces unnatural lip movement
stitching: true        # Seamless face blending

Best for: Premium avatar creation, expression transfer from driving video.

设置：

lip_zero: 0.03         # 减少不自然的唇部动作
stitching: true        # 无缝面部融合

最佳场景：高端虚拟形象创建，从驱动视频转移表情。

LatentSync 1.6 (Newest, Highest Quality)

LatentSync 1.6（最新版本，质量最高）

ByteDance model trained at 512x512 with TREPA modules for temporal consistency.

字节跳动训练的512x512模型，采用TREPA模块保证时间一致性。

InfiniteTalk (Unlimited Length)

InfiniteTalk（无长度限制）

For videos longer than standard lip-sync limits. Integrates with Wan for joint generation.

适用于超过标准唇形同步时长限制的视频，可与Wan集成实现联合生成。

Complete Talking Head Workflow

完整虚拟说话人工作流

Pipeline A: Quick (Image → Talk)

工作流A：快速生成（图片→说话视频）

1. [Text] → Chatterbox/F5-TTS → audio.wav
2. [Character Image] + audio.wav → SadTalker → video.mp4
3. video.mp4 → GFPGAN/CodeFormer → final.mp4

Time: ~2 minutes. Quality: Good.

1. [文本] → Chatterbox/F5-TTS → audio.wav
2. [角色图片] + audio.wav → SadTalker → video.mp4
3. video.mp4 → GFPGAN/CodeFormer → final.mp4

耗时：约2分钟。质量：良好。

Pipeline B: Quality (Image → Video → Lip-Sync)

工作流B：高质量生成（图片→视频→唇形同步）

1. [Text] → Chatterbox → audio.wav
2. [Character Image] → Wan I2V → base_video.mp4
   Prompt: "person talking, slight head movement, indoor"
3. base_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
4. lipsync.mp4 → FaceDetailer batch → enhanced.mp4
5. enhanced.mp4 → Color correct + Deflicker → final.mp4

Time: ~10 minutes. Quality: Production.

1. [文本] → Chatterbox → audio.wav
2. [角色图片] → Wan I2V → base_video.mp4
   Prompt: "person talking, slight head movement, indoor"
3. base_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
4. lipsync.mp4 → FaceDetailer批处理 → enhanced.mp4
5. enhanced.mp4 → 色彩校正 + 去闪烁 → final.mp4

耗时：约10分钟。质量：生产级。

Pipeline C: Premium (Expression Transfer)

工作流C：高端生成（表情转移）

1. Record driving video (actor performing lines)
2. [Text] → Voice Clone TTS → audio.wav
3. [Character Image] + driving.mp4 → LivePortrait → expression_video.mp4
4. expression_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
5. lipsync.mp4 → CodeFormer → final.mp4

Time: ~15 minutes. Quality: Premium.

1. 录制驱动视频（演员表演台词）
2. [文本] → 语音克隆TTS → audio.wav
3. [角色图片] + driving.mp4 → LivePortrait → expression_video.mp4
4. expression_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
5. lipsync.mp4 → CodeFormer → final.mp4

耗时：约15分钟。质量：高端级。

Troubleshooting

故障排查

Issue	Solution
Audio out of sync	Offset with ffmpeg: `ffmpeg -itsoffset 0.1 -i audio.wav ...`
Subtle mouth movements	Use wav2lip_gan.pth, increase audio volume
Face artifacts	Post-process with CodeFormer (fidelity 0.6-0.8)
Robotic voice clone	Use longer/cleaner reference, increase exaggeration
Unnatural head movement	Lower SadTalker pose_style to 0-10

问题	解决方案
音视频不同步	使用ffmpeg调整偏移： `ffmpeg -itsoffset 0.1 -i audio.wav ...`
唇部动作不明显	使用wav2lip_gan.pth模型，提高音频音量
面部 artifacts	使用CodeFormer后期处理（保真度0.6-0.8）
克隆语音生硬	使用更长/更清晰的参考音频，提高exaggeration参数
头部动作不自然	将SadTalker的pose_style降低至0-10

Reference

参考资料

```
references/voice-synthesis.md
```
- Full voice tool documentation
```
references/models.md
```
- Voice model download links
Character voice profiles in
```
projects/{project}/characters/
```

```
references/voice-synthesis.md
```
- 完整语音工具文档
```
references/models.md
```
- 语音模型下载链接
角色语音配置文件位于
```
projects/{project}/characters/
```