wjs-dubbing-video
Original:🇺🇸 English
Translated
2 scriptsChecked / no sensitive code detected
Use when the user has a video + a target-language SRT and wants the video to actually speak that language — generates a time-aligned TTS voice dub. Routes by voice ID — Volcano (豆包) TTS for Chinese, edge-tts neural for any language. Defaults to one voice (single-speaker); opt-in multi-speaker via visual diarization. Outputs `*_<lang>_dub.mp4` with the dub audio in place of the original. Final mixing (audio bed + burn-in) is handed off to `/wjs-burning-subtitles`. Triggers — "配音", "中文配音", "Chinese dub", "voice over this", "dub the video", "TTS this SRT", "different voice for each speaker".
5installs
Sourcejianshuo/claude-skills
Added on
NPX Install
npx skill4agent add jianshuo/claude-skills wjs-dubbing-videoTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →wjs-dubbing-video
Video + target-language SRT → with a time-aligned TTS voice. This skill stops at the dub track. Burn-in + audio bed mixing is the next skill ( composites everything in one final encode).
*_<lang>_dub.mp4/wjs-burning-subtitles/render.pyWhen to use
- User has a target-language SRT (e.g., ) and wants the video to speak that language.
entrevista.zh-CN.srt - User says "中文配音 / 配音 / 帮我做配音 / dub it / voice over".
- User has multiple speakers on camera and wants different voices per speaker.
When NOT to use
- No SRT yet → run then
/wjs-transcribing-audiofirst./wjs-translating-subtitles - Source-language only TTS (rare; usually you translate first) → still use this skill, but pass the source SRT.
- Burn-in only, no audio change → skip to .
/wjs-burning-subtitles
Number of speakers — default to one
Default: assume one speaker. Use a single voice for the entire dub. This is the right answer for monologues, vlogs, recorded talks, narrator-only clips, and the overwhelming majority of videos people ask about. Don't run diarization, don't tag the SRT with /, don't bring up multi-speaker complexity.
[A][B]Switch to multi-speaker only when the user explicitly says so — phrasings like "two people", "interview", "dialogue", "conversation between", "separate the speakers", "different voice for each", or a direct request to do diarization. When triggered, follow the "Multi-speaker dubbing" section below.
If you're unsure whether a video is one speaker or many, ship the single-voice version first. Adding speaker separation later is cheap (just regenerate the dub); shipping confused multi-speaker output by default wastes the user's time.
Engine routing — by voice ID
scripts/dub.py| Voice ID pattern | Engine | Auth |
|---|---|---|
| Volcano (字节跳动豆包) TTS | |
| edge-tts (Microsoft Edge neural) | none (free) |
For Mandarin, Volcano is markedly more natural than edge-tts, especially for emotional/contemplative content. Use edge-tts when Volcano credentials aren't available or as a debugging fallback.
Volcano TTS (Chinese only)
Endpoint: (used for both TTS 1.0 and 2.0; the Resource-Id header picks the backend).
https://openspeech.bytedance.com/api/v3/tts/unidirectionalHeaders:
X-Api-App-Id: (env: VOLC_TTS_APPID) # 10-digit speech App ID
X-Api-Access-Key: (env: VOLC_TTS_ACCESS_TOKEN) # 32-char token from speech console
X-Api-Resource-Id: volc.service_type.10029 # see resource ID note below
Content-Type: application/jsonLoading credentials: most users keep them in . Read them at the top of any session via:
~/code/.envbash
set -a; source ~/code/.env; set +aResource ID — important quirk
The doc lists as the "TTS 2.0 (recommended)" resource, but a typical TTS-SeedTTS2.0 console instance does not include the popular speaker catalog (爽快斯斯, 高冷御姐, 开朗姐姐, etc.). Trying those speakers against returns . The fix is to use (the TTS 1.0 V3 endpoint) — the audio quality of the bigtts speakers is identical, and they all work against this resource. The bundled defaults to ; override with env if you have a different instance.
seed-tts-2.0*_bigttsseed-tts-2.0200 code=55000000 "resource ID is mismatched with speaker related resource"volc.service_type.10029dub.pyvolc.service_type.10029VOLC_TTS_RESOURCEOther 401/403 errors:
- — the App ID + key combo is valid against the gateway, but the user has not activated this resource. They must go to 火山引擎 → 语音技术 → 语音合成大模型 → 实例管理 and 开通 the service. No workaround.
401 code=45000010 "load grant: requested grant not found in SaaS storage" - — the speaker isn't included in the user's instance bundle.
403 code=45000030
Response format
Despite the doc's casual language, the response is streaming NDJSON, not a single JSON object and not raw audio bytes. Each line is a separate JSON event with a base64-encoded MP3 chunk in . The terminal event has (which means OK in this API's success codes — different from ). Concatenate the decoded chunks for the full MP3.
datacode: 20000000code: 0python
import base64, json, requests
audio = b""
r = requests.post(url, headers=h, json=payload, timeout=60, stream=True)
for line in r.iter_lines():
if not line: continue
evt = json.loads(line)
if evt.get("code") not in (0, None, 20000000):
raise RuntimeError(f"code={evt.get('code')} {evt.get('message')}")
if evt.get("data"):
audio += base64.b64decode(evt["data"])Speaker catalog (verified working under volc.service_type.10029
)
volc.service_type.10029Full list at volcengine.com/docs/6561/1257544 — but availability depends on your instance bundle. Confirmed-working female voices for the typical SeedTTS-2.0 starter instance:
| Speaker ID | 中文名 | Feel |
|---|---|---|
| 高冷御姐 | Best for contemplative/spiritual content. Mature, restrained, calm. |
| 开朗姐姐 | Warm older-sister storytelling. |
| 爽快斯斯 | Versatile, conversational baseline. |
| 邻家女孩 | Casual, lifestyle-vlog. |
| 元气女友 | Lively, upbeat. |
| 美丽女友 | Soft, intimate. |
| 斯斯情感版 | Full emotional range — pair with explicit emotion + scale. |
These voices return 55000000 against the typical instance even though the doc lists them: , , , , , . Don't promise them without testing.
vv_uranus_bigttswenroushunv_moon_bigttsqingxin_moon_bigttsyingmaoxiaoyuan_moon_bigttstianxinxiaoling_moon_bigttsshaoergushi_moon_bigttsAudio params
speech_rate-8--rate -8%-8Useful emotion presets:
- ,
emotion="calm"— contemplative, default for this skill's spiritual-content niche.emotion_scale=4 - — softer / more intimate.
emotion="gentle" - — flat / informational.
emotion="neutral" - — melancholic. Use sparingly.
emotion="sad"
Override defaults with and env vars without editing code.
dub.pyVOLC_TTS_EMOTIONVOLC_TTS_EMOTION_SCALENo English Volcano voices are wired up in this skill — for English use edge-tts (next section). Volcano does have English speakers (, ) but they aren't typically included in TTS-SeedTTS-2.0 starter instances. Add them by extending the voice routing in once verified.
en_male_*_bigttsen_female_*_bigttsdub.pyedge-tts (Microsoft Edge neural TTS)
Free, no API key, high-quality but less expressive than Volcano. Install into a project venv — do not call it via once per segment. Each invocation spawns a fresh Python process and the bing endpoint will rate-limit or RST the connection after a handful of rapid hits, breaking mid-render.
uvxuvxbash
uv venv .venv
uv pip install --python .venv/bin/python edge-ttsThen drive it from a single long-lived Python process using directly, with retry-on-failure logic. The bundled does this.
edge_tts.Communicate(...)scripts/dub.pyVoice selection — match the original speaker
There is no perfect cross-language match — choose gender, age feel, and tone deliberately, then bend with rate/pitch.
Chinese voices (Volcano preferred, edge-tts fallback)
Volcano's (高冷御姐, calm, ) is the validated baseline for mature contemplative female speakers — equivalent to or better than any edge-tts option for that profile. See the Volcano speaker table above for the rest.
zh_female_gaolengyujie_moon_bigttsspeech_rate=-8edge-tts catalog (Chinese):
| Voice | Gender | Default feel |
|---|---|---|
| F | Warm, news/novel |
| F | Lively, young |
| M | Passionate, sports |
| M | Sunshine, lively |
| M | Professional newsreader |
| F | Friendly, slightly mature |
| F | Friendly |
English voices (edge-tts neural, all multilingual)
All voices below speak fluent American/British/Australian English; the ones also handle Spanish names, French/Italian loanwords, etc. without mispronunciation.
*Multilingual*| Voice | Gender | Default feel |
|---|---|---|
| F | Best for warm/mature/caring — natural for spiritual or coaching content |
| F | Cheerful, conversational, younger |
| M | Warm, confident, sincere |
| M | Approachable, casual |
| F | Crisp newsreader |
| M | Steady male newsreader |
| F | British female (RP) |
| M | British male (RP) |
| M | Australian male |
| F | Mature European female who also reads English |
For matching a mature contemplative Spanish female (this skill's canonical use case), start with at . Do not use the news-style or for spiritual content — they sound clinical.
en-US-AvaMultilingualNeural--rate -5% --pitch -3HzAriaGuyPicking heuristics
- Mature contemplative female speaker (yoga/spirituality/coaching): with
zh-CN-XiaoxiaoNeural(or Volcano--rate=-8% --pitch=-10Hz).gaolengyujie - Mature professional male: with
zh-CN-YunyangNeural. Avoid Yunjian/Yunxi (too energetic).--rate=-5% - Young casual speaker: Defaults; no pitch shift.
- Western-mouth feel: one of the voices.
*MultilingualNeural
Always sample before committing
🛑 Checkpoint — sample before full dub. A full-video dub is the most expensive step (TTS API calls + atempo + ffmpeg mux). Before running over the whole SRT:
dub.py- Pick the longest-text cue (worst stretch case) and one short/casual cue (timbre check).
- Synthesize 3–4 voice/rate/pitch combos at 3–8s each.
- Show the user the audio panel and ask: "选哪个 voice?rate/pitch 要调吗?确认后我再跑全片。" Wait for explicit pick.
Skip the checkpoint only if the user named a specific voice up front AND has already heard a sample of that voice on this video.
The script's (if present) is a thin wrapper for exactly this; otherwise drive the same Python loop the dub script uses.
scripts/sample_voices.pyMandatory smoke test before promising any Volcano voice on a new account: synth one ~5-word cue with that speaker ID first; only quote it to the user if the smoke test returns a non-empty MP3. If the smoke test 401s with ("grant not found"), tell the user they need to 开通 the resource in 火山引擎 console — do not pretend it'll work after a retry.
code=45000010Running dub.py
bash
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py [voice] [rate] [pitch]
# Mature Chinese contemplative female (Volcano):
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py \
zh_female_gaolengyujie_moon_bigtts -8% +0Hz
# Warm English caring female (edge-tts, multilingual):
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py \
en-US-AvaMultilingualNeural -5% -3Hz
# Default Chinese fallback (no Volcano creds needed):
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py \
zh-CN-XiaoxiaoNeural -8% -10HzThe script:
- Reads the SRT (auto-detects ,
*.zh-CN.srt, etc., or pass*.en.srt).--srt - Synthesizes one MP3 per cue under .
dub_work/seg_NN.mp3 - Probes each clip's actual duration with .
ffprobe - For each cue: if TTS is longer than the SRT slot, chains filters to speed it up; if shorter, pads with silence after.
atempo - Inserts silence segments for SRT gaps and any trailing tail so the output audio length exactly matches the source video.
- Muxes the new audio into /
*_zh_dub.mp4keeping the original video stream by*_en_dub.mp4.-c:v copy
Output: (e.g., ). This is the input for the next step — — which composites the final video.
<source-stem>_<lang>_dub.mp4entrevista_zh_dub.mp4/wjs-burning-subtitles/render.pyFilling awkward silences
Mandarin takes 60–80% of the time Spanish does to say the same thing. With strict cue-by-cue timing, that leaves awkward 2–4s silences at the end of most cues. English is closer to ~85% of Spanish. Three levers, in increasing impact:
-
Slow the native TTS rate. Changingfrom
--rateto+0%to-12%produces clean, natural-sounding slower speech (much better than time-stretching afterward). Try-15%first;-12%/-15%for very contemplative content.-20% -
Mild slow-stretch per cue. When a cue's TTS is still shorter than its slot, runbetween 0.82× and 0.95×.
atempodoes this automatically: when slack > 0.5s, it setsdub.pyand pads the remainder. Below 0.82× the voice starts sounding drugged; above 0.92× the stretch is essentially imperceptible.atempo = max(0.82, tts_dur / target_dur) -
Expand the target-language text in the worst cues. When the slot is so long that even 0.82× stretch leaves >2s of silence, the cleanest fix is to lengthen the translation. Add natural Mandarin particles ("嗯,", "其实", "也就是说", "你知道") or unpack a compressed phrase into its full meaning. This changes the on-screen subtitle, so confirm with the user before doing it. Edit the SRT, regenerate just those segments by deleting theirand re-running
dub_work/seg_NN.mp3.dub.py
Combine the levers: native rate + stretch-to-fit handles ~80% of cases. Reserve text expansion for the 2–3 worst outliers.
-12%Multi-speaker dubbing (opt-in)
Only invoke this section when the user explicitly says the source has multiple speakers ("interview", "two people", "dialogue", "separate the speakers", "different voice for each", or a direct request to do diarization).
When triggered, generate the dub with a different voice per speaker so the listener can follow who's speaking. Two paths:
Path 1 (recommended for on-camera speakers): visual diarization
scripts/visual_diarize.pybash
uv pip install --python .venv/bin/python mediapipe opencv-python
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/visual_diarize.py \
--video input.mp4 --srt input.en.srt \
--out input.en.diarized.srt \
--report diarization_report.json \
--sample-fps 5 --num-speakers 2How it works:
- Samples N frames per second (default 5).
- Runs MediaPipe FaceLandmarker (Tasks API) for up to faces per frame, 478 landmarks each.
--num-speakers - Measures mouth aperture per face as the vertical distance between inner upper lip (idx 13) and inner lower lip (idx 14).
- Bins faces by horizontal screen position (x-quantiles) → speakers ,
A, ... left-to-right.B - For every cue's [start, end] window, integrates per-speaker frame-to-frame mouth-aperture change. Highest mover wins the tag.
- Writes a /
[A]-prefixed SRT plus a JSON report with per-cue scores and a confidence ratio (winner / runner-up).[B]
On first run, downloads the FaceLandmarker model (~3.6 MB) to .
/tmp/mp_models/face_landmarker.taskVisual is materially better than guessing from text. In one validation, manual text-based labels split 6/50 between speakers; visual diarization showed the actual split was 29/27 — text-based guessing was wildly wrong because both people take similar-shaped turns. Always prefer visual when the speakers are on camera.
Spot-check low-confidence cues. Any cue in the JSON report with is borderline — usually overlapping speech or one speaker briefly off-frame. Hand-correct before dubbing.
confidence_ratio < 1.5Path 2 (fallback): manual tagging
For very short clips (1–2 minutes), or when speakers are off-camera, or when visual diarization fails:
text
1
00:00:00,000 --> 00:00:03,400
[A] So what about that AI rewrite thing?
2
00:00:03,400 --> 00:00:08,200
[B] Right — let me explain the workflow.Save as . Keep the clean SRT (without tags) for downstream burn-in via .
*.tagged.srt/wjs-burning-subtitlesRouting voices in dub.py
Pass with pairs. The positional voice arg is the default for cues with no tag.
--voice-mapspeaker=voicebash
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py \
en-US-AndrewMultilingualNeural -3% +0Hz \
--srt input.en.tagged.srt \
--voice-map "A=en-US-BrianMultilingualNeural,B=en-US-AndrewMultilingualNeural"Voice-pairing tips:
- Two of the same gender: pick voices with audibly different timbre. Brian (casual) + Andrew (warm) works for two American males. Ava (warm female) + Emma (cheerful female) for two females.
- Mixed gender: Ava + Andrew is a clean default.
- Accent contrast: pair and
en-US-for distinctness.en-GB- - Chinese: mix Volcano voices like (mature) +
zh_female_gaolengyujie_moon_bigtts(warm sister).zh_female_kailangjiejie_moon_bigtts
Limits
Visual diarization fails when:
- A speaker is consistently off-camera while talking.
- Camera cuts or zooms make face position unstable across cues.
- Three or more speakers sit at similar horizontal positions (x-quantile binning is too coarse — switch to k-means on (x, y) or use audio-based diarization instead).
For audio-only material (podcasts, voice-overs), fall back to or . This skill does not yet bundle audio-based diarization.
pyannote.audiowhisperx --diarizeOutput
- — video stream-copied from source, audio replaced with the time-aligned dub track. Drop-in input for
<source-stem>_<lang>_dub.mp4./wjs-burning-subtitles/render.py - — per-cue TTS clips (kept for resume / per-cue regen).
dub_work/seg_NN.mp3
Downstream
- — to mix the original audio as a low-volume bed, burn the SRT, or both. The final encode happens there in one ffmpeg pass (no cascade). Pass
/wjs-burning-subtitlesto its--video <source.mp4> --dub <source_lang_dub.mp4> [--srt <srt>].render.py - The dub-only file () is technically a finished video and can ship as-is, but it sounds dubbed (because it is). Mixing the original underneath gives the "professional translation" feel — do that in
*_<lang>_dub.mp4./wjs-burning-subtitles
Anti-patterns
- ❌ Calling once per cue. Spawns a Python process each time; bing endpoint rate-limits or RSTs mid-render. Use the persistent library path in
uvx edge-tts.dub.py - ❌ Trusting without listening. Always sample a 30 s clip before committing.
audio_source - ❌ Stretching below 0.82× atempo. Voice starts sounding drugged. Add silence padding or expand text instead.
- ❌ Tagging single-speaker SRTs with . Wastes time and the dub sounds the same. Default to one voice.
[A] - ❌ Promising a Volcano voice without smoke-testing it on the user's instance. The doc lists many voices that error with against typical SeedTTS-2.0 starter bundles. Always synth a 5-word smoke test before quoting.
code=55000000 - ❌ Parsing Volcano response as one JSON document. It's streaming NDJSON; the success terminator is , not
code=20000000. Concatenate base64-decodedcode=0chunks for the full MP3.data