Loading...
Loading...
Generate character voices using TTS, voice cloning, and lip-sync tools. Supports Chatterbox, F5-TTS, TTS Audio Suite, RVC, and ElevenLabs. Use when creating speech audio for characters or syncing audio to video.
npx skill4agent add mckruz/comfyui-expert comfyui-voice-pipelineVOICE REQUEST
|
|-- Have reference audio of target voice?
| |-- Yes (5+ seconds) → Chatterbox (MIT, paralinguistic tags)
| |-- Yes (10-15 seconds) → F5-TTS (fastest zero-shot)
| |-- Yes (10+ minutes) → RVC training (highest fidelity)
| |-- Yes (any length, budget) → ElevenLabs (production quality)
|
|-- No reference audio?
| |-- Need emotion control → IndexTTS-2 (8-emotion vectors)
| |-- Need multi-language → TTS Audio Suite (23 languages)
| |-- Need voice design → ElevenLabs Voice Design (describe voice)
| |-- Quick prototype → Any TTS with default voice
|
|-- Need multi-speaker dialog?
| |-- Chatterbox (4 voices) or TTS Audio Suite (character switching)
|
|-- Need lip-sync?
| |-- Best accuracy → Wav2Lip + CodeFormer
| |-- Need head movement → SadTalker
| |-- Full expression control → LivePortrait
| |-- Unlimited length → InfiniteTalk[laugh] [chuckle] [sigh] [gasp] [cough] [clear throat]
[whisper] [excited] [sad] [angry] [surprised]exaggeration.wav.txt[CharacterName][de:Alice][fr:Bob][pause:1s]Text → Any TTS → Base Audio → RVC Model → Character Voiceprojects/{project}/characters/{name}/profile.yamlvoice:
cloned: true
model: "chatterbox"
sample_file: "references/voice_sample.wav"
settings:
exaggeration: 1.2
default_emotion: "neutral"
notes: "Warm, confident tone. Slight Italian-American undertones."[Sage] Hello! *laughs* I've been looking forward to this.
[pause:0.5s]
[Alex] [excited] Same here! Let's dive right in.
[Sage] [whisper] But first, I need to tell you something...# Convert to mono 24kHz WAV, normalized
ffmpeg -i input.wav -ac 1 -ar 24000 -af "loudnorm=I=-16:TP=-3" output.wav
# Trim silence from start/end
ffmpeg -i input.wav -af "silenceremove=start_periods=1:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_threshold=-50dB,areverse" trimmed.wav
# Concatenate segments
ffmpeg -f concat -safe 0 -i filelist.txt -c copy combined.wavwav2lip_model: "wav2lip_gan.pth" # Better than wav2lip.pth
face_detect_batch: 16
nosmooth: false
pad_bottom: 10preprocess: "full" # Better for novel faces
enhancer: "gfpgan"
pose_style: 10-20 # Natural conversation rangelip_zero: 0.03 # Reduces unnatural lip movement
stitching: true # Seamless face blending1. [Text] → Chatterbox/F5-TTS → audio.wav
2. [Character Image] + audio.wav → SadTalker → video.mp4
3. video.mp4 → GFPGAN/CodeFormer → final.mp41. [Text] → Chatterbox → audio.wav
2. [Character Image] → Wan I2V → base_video.mp4
Prompt: "person talking, slight head movement, indoor"
3. base_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
4. lipsync.mp4 → FaceDetailer batch → enhanced.mp4
5. enhanced.mp4 → Color correct + Deflicker → final.mp41. Record driving video (actor performing lines)
2. [Text] → Voice Clone TTS → audio.wav
3. [Character Image] + driving.mp4 → LivePortrait → expression_video.mp4
4. expression_video.mp4 + audio.wav → Wav2Lip → lipsync.mp4
5. lipsync.mp4 → CodeFormer → final.mp4| Issue | Solution |
|---|---|
| Audio out of sync | Offset with ffmpeg: |
| Subtle mouth movements | Use wav2lip_gan.pth, increase audio volume |
| Face artifacts | Post-process with CodeFormer (fidelity 0.6-0.8) |
| Robotic voice clone | Use longer/cleaner reference, increase exaggeration |
| Unnatural head movement | Lower SadTalker pose_style to 0-10 |
references/voice-synthesis.mdreferences/models.mdprojects/{project}/characters/