wjs-transcribing-audio
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesewjs-transcribing-audio
wjs-transcribing-audio
Spoken audio in → timestamped SRT in the same language out. This skill stops at the source-language SRT. Translation to another language is the next skill ().
/wjs-translating-subtitles输入语音音频 → 输出同语言的带时间戳SRT字幕。本技能仅生成源语言SRT字幕,如需翻译为其他语言,请使用下一技能()。
/wjs-translating-subtitlesWhen to use
使用场景
- User provides a video or audio file and wants a transcript / SRT in the source language.
- User already has a translated SRT and the source SRT is missing.
- User asks "做 SRT" / "make subtitles" / "出逐字稿" with no translation step requested yet.
- 用户提供视频或音频文件,需要生成源语言的转录文本/SRT字幕。
- 用户已有翻译后的SRT字幕,但缺失源语言版本。
- 用户要求「做 SRT」/「make subtitles」/「出逐字稿」,且未要求翻译步骤。
When NOT to use
非使用场景
- Source-language SRT already exists → skip straight to .
/wjs-translating-subtitles - User wants the transcript in a different language than spoken → run this skill first, then .
/wjs-translating-subtitles - User wants only the dub or burn-in → if SRT exists, skip; otherwise run this first.
- 已存在源语言SRT字幕 → 直接使用。
/wjs-translating-subtitles - 用户需要与语音不同语言的转录文本 → 先运行本技能,再使用。
/wjs-translating-subtitles - 用户仅需要配音或字幕内嵌 → 若已有SRT则跳过;否则先运行本技能。
Routing: which engine
引擎路由规则
| Source language | Default engine | Why |
|---|---|---|
| Chinese (zh-CN, zh-HK, zh-TW) | Volcano (豆包) ASR | Materially better accuracy than Whisper for Chinese — user's standing preference |
| Any other (es, en, pt, fr, it, ja, ko, …) | OpenAI Whisper API with word-level granularity | Whisper's multilingual is strong; word timestamps let us assemble cues ourselves |
| Offline / no API access | Local | Quality floor; same loop/blob failure modes apply |
For Chinese, do not default to Whisper unless the user explicitly asks for it or Volcano is unavailable. This is a deliberate routing decision — see user's memory on Chinese ASR priority.
| 源语言 | 默认引擎 | 原因 |
|---|---|---|
| 中文(zh-CN、zh-HK、zh-TW) | Volcano(豆包)ASR | 针对中文的识别准确率显著优于Whisper,符合用户的优先选择 |
| 其他语言(es、en、pt、fr、it、ja、ko等) | OpenAI Whisper API(单词级粒度) | Whisper的多语言能力强劲;单词级时间戳支持我们自行组装字幕片段 |
| 离线/无API访问权限 | 本地 | 保证基础识别质量;存在相同的循环/长片段识别失效问题 |
针对中文场景,除非用户明确要求或Volcano不可用,否则默认不使用Whisper。这是经过考量的路由决策——可参考用户关于中文ASR优先级的记录。
OpenAI Whisper API path (non-Chinese, and Chinese fallback)
OpenAI Whisper API处理流程(非中文及中文 fallback 场景)
The key principle: do not request . Whisper cue-segmentation fails on long monologues (30-second blob cues) and quiet stretches (loop hallucinations). Request word-level timestamps and assemble cues yourself — the post-processing is deterministic and free.
response_format=srt核心原则:不要请求。Whisper的字幕分段在长篇独白(30秒长片段)和静音时段(循环幻觉)场景下会失效。应请求单词级时间戳并自行组装字幕片段——后处理过程是确定性的且无需额外成本。
response_format=srtWhy not response_format=srt
为何不使用response_format=srt
Two failure modes that wreck SRT output on long content:
whisper-1- 30-second blob cues. In long monologues, with
whisper-1emits one cue covering the full 30sresponse_format=srtwindow. Transcript is fine; timing is unusable for on-screen reading.condition_on_previous_text - Loop hallucination on quiet tails. Greedy on low-energy audio produces "你如果不把拥抱浪费写在这上面,你很难的" repeated 50 times.
temperature=0
Both stem from letting Whisper decide cue boundaries. Fix: word-level timestamps + your own punctuation-aware assembler.
whisper-1- 30秒长片段字幕:在长篇独白中,启用的
response_format=srt会生成覆盖整个30秒whisper-1窗口的单条字幕。转录文本没问题,但时间轴无法适配屏幕阅读。condition_on_previous_text - 静音时段循环幻觉:在低能量音频上使用的贪婪解码会出现重复50次的无意义文本,例如「你如果不把拥抱浪费写在这上面,你很难的」。
temperature=0
两种问题均源于让Whisper决定字幕边界。解决方案:单词级时间戳 + 自定义的标点感知组装器。
Calling the API
API调用示例
bash
undefinedbash
undefined1. Compress for upload — 64kbps mono MP3 is plenty for speech.
1. 压缩文件以便上传——64kbps单声道MP3足以满足语音识别需求。
OpenAI limit is 25MB per request; chunk into 10-min pieces
OpenAI单请求限制为25MB;为应对不稳定代理,将文件分割为10分钟片段
(≈4.5MB at 64kbps) for resilience under flaky proxies.
(64kbps下约4.5MB)
ffmpeg -hide_banner -loglevel error -y
-ss <start> -t 600 -i input.mp4
-vn -ac 1 -ar 16000 -c:a libmp3lame -b:a 64k chunk.mp3
-ss <start> -t 600 -i input.mp4
-vn -ac 1 -ar 16000 -c:a libmp3lame -b:a 64k chunk.mp3
```pythonffmpeg -hide_banner -loglevel error -y
-ss <start> -t 600 -i input.mp4
-vn -ac 1 -ar 16000 -c:a libmp3lame -b:a 64k chunk.mp3
-ss <start> -t 600 -i input.mp4
-vn -ac 1 -ar 16000 -c:a libmp3lame -b:a 64k chunk.mp3
```python2. Request word-level timestamps. Do NOT request response_format=srt.
2. 请求单词级时间戳。请勿请求response_format=srt。
import httpx, os
data = {
"model": "whisper-1",
"language": "es", # pin source language; never auto-detect
"response_format": "verbose_json",
"timestamp_granularities[]": "word", # ← the critical flag
"temperature": "0.2", # enable fallback chain (anti-loop)
}
with open("chunk.mp3", "rb") as f:
r = httpx.post(
"https://api.openai.com/v1/audio/transcriptions",
headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
data=data,
files={"file": ("chunk.mp3", f, "audio/mpeg")},
timeout=600.0,
)
r.raise_for_status()
j = r.json()
words = j["words"] # [{"word": "hola", "start": 0.12, "end": 0.34}, ...]
segments = j["segments"] # see surprise below
undefinedimport httpx, os
data = {
"model": "whisper-1",
"language": "es", # 指定源语言;绝不自动检测
"response_format": "verbose_json",
"timestamp_granularities[]": "word", # ← 关键参数
"temperature": "0.2", # 启用 fallback 机制(防循环)
}
with open("chunk.mp3", "rb") as f:
r = httpx.post(
"https://api.openai.com/v1/audio/transcriptions",
headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
data=data,
files={"file": ("chunk.mp3", f, "audio/mpeg")},
timeout=600.0,
)
r.raise_for_status()
j = r.json()
words = j["words"] # [{"word": "hola", "start": 0.12, "end": 0.34}, ...]
segments = j["segments"] # 注意下方的意外情况
undefinedSurprise: words[] has no punctuation, segments[] is inconsistent
意外情况:words[]无标点,segments[]格式不一致
Whisper's array typically has no punctuation in — each entry is a bare token like , , , . Punctuation, when present, lives only in field.
words[]word["word"]"做""个""测""试"segments[]textWorse, text is inconsistently punctuated across chunks of the same file: chunk 0 of a 79-min podcast might emit 285 bare segments ("做个测试" "你在" "呵呵") at 1-2s each with no punctuation; chunk 7 might emit 34 segments at 14-30s each with punctuation. Both behaviors ship in the same API response.
segments[]So the right recipe combines both: use for natural pause boundaries (already aligned to breath), but treat them as raw input to your own cue assembler, which uses word timestamps to split anywhere the segments are too long.
segments[]Whisper的数组中,通常不含标点——每个条目都是纯词汇,例如「做」「个」「测」「试」。标点(若存在)仅出现在的字段中。
words[]word["word"]segments[]text更糟的是,同一文件的不同片段中,的文本标点格式不一致:79分钟播客的第0个片段可能生成285条无标点的短片段(「做个测试」「你在」「呵呵」),每条时长1-2秒;而第7个片段可能生成34条带标点的长片段,每条时长14-30秒。两种情况会出现在同一API响应中。
segments[]因此正确的处理方式是结合两者:使用获取自然停顿边界(已对齐呼吸节奏),但将其作为自定义字幕组装器的原始输入,利用单词时间戳拆分过长的片段。
segments[]Cue assembly recipe
字幕组装流程
python
TARGET_DUR = 3.0 # try to make cues this long
MAX_CUE_DUR = 5.0 # never exceed
MAX_CHARS = 18 # ~one line at Fontsize 14 on 1080-wide vertical
MAX_GAP = 1.0 # silence threshold → force cue boundary
MIN_PIECE = 0.3 # below this, merge with neighbor
SPLIT_PUNCT = set(",。!?;,.;!?")python
TARGET_DUR = 3.0 # 目标字幕时长
MAX_CUE_DUR = 5.0 # 最大允许时长
MAX_CHARS = 18 # 1080竖屏、字号14时约一行的字符数
MAX_GAP = 1.0 # 静音阈值 → 强制拆分字幕
MIN_PIECE = 0.3 # 短于该时长则与相邻片段合并
SPLIT_PUNCT = set(",。!?;,.;!?")Step A: merge short segments[] toward TARGET_DUR (use segments,
步骤A:合并短片段以接近TARGET_DUR(使用segments而非words——Whisper的片段边界已对齐停顿)
not words — Whisper's segment boundaries are already
—
pause-aligned).
—
def assemble(segments, offset):
cues, buf = [], []
def flush():
if buf:
cues.append((buf[0]["start"]+offset, buf[-1]["end"]+offset,
"".join(s["text"].strip() for s in buf)))
buf.clear()
for s in segments:
dur = s["end"] - s["start"]
# Long single segment WITH internal punct → split standalone
if dur > MAX_CUE_DUR and any(c in s["text"] for c in SPLIT_PUNCT):
flush(); cues.extend(split_long_segment(s, offset)); continue
if not buf: buf.append(s); continue
if (s["start"] - buf[-1]["end"]) >= MAX_GAP
or (buf[-1]["end"] - buf[0]["start"]) >= TARGET_DUR
or (s["end"] - buf[0]["start"]) > MAX_CUE_DUR: flush() buf.append(s) flush(); return cues
or (buf[-1]["end"] - buf[0]["start"]) >= TARGET_DUR
or (s["end"] - buf[0]["start"]) > MAX_CUE_DUR: flush() buf.append(s) flush(); return cues
def assemble(segments, offset):
cues, buf = [], []
def flush():
if buf:
cues.append((buf[0]["start"]+offset, buf[-1]["end"]+offset,
"".join(s["text"].strip() for s in buf)))
buf.clear()
for s in segments:
dur = s["end"] - s["start"]
# 含内部标点的长片段 → 单独拆分
if dur > MAX_CUE_DUR and any(c in s["text"] for c in SPLIT_PUNCT):
flush(); cues.extend(split_long_segment(s, offset)); continue
if not buf: buf.append(s); continue
if (s["start"] - buf[-1]["end"]) >= MAX_GAP
or (buf[-1]["end"] - buf[0]["start"]) >= TARGET_DUR
or (s["end"] - buf[0]["start"]) > MAX_CUE_DUR: flush() buf.append(s) flush(); return cues
or (buf[-1]["end"] - buf[0]["start"]) >= TARGET_DUR
or (s["end"] - buf[0]["start"]) > MAX_CUE_DUR: flush() buf.append(s) flush(); return cues
Step B: final pass — split every internal comma/period to its own cue
步骤B:最终处理——将每个内部逗号/句号拆分到单独字幕
(proportional timestamps by char position). Coalesce pieces
(按字符位置分配比例时间戳)。将短于MIN_PIECE的片段向前合并。
shorter than MIN_PIECE forward.
步骤C:仍超过MAX_CHARS的字幕,利用words[]时间戳在单词间隙最大处拆分。递归处理直至符合要求。
Step C: any cue still > MAX_CHARS gets split at the largest inter-word
—
gap using words[] timestamps. Recursive until under cap.
—
Tweak `TARGET_DUR` and `MAX_CHARS` to platform reading rhythm. The 18-char cap matters for burn-in on vertical 1080×1920 at `Fontsize=14` — longer wraps to multiple unreadable lines.
可根据平台阅读节奏调整`TARGET_DUR`和`MAX_CHARS`。18字符限制对1080×1920竖屏、`Fontsize=14`的内嵌字幕至关重要——过长会换行导致无法阅读。Operational details
操作细节
- Auth: credentials live in . Load with
~/code/.envbefore invoking.set -a; source ~/code/.env; set +a - SOCKS proxy on this machine: needs the
httpxextra — usesocksio(without it you getuvx --with httpx --with socksio python ...).ImportError: Using SOCKS proxy, but the 'socksio' package is not installed - Chunking: 10-min pieces at 64kbps mono MP3 (~4.5MB each) are the reliability sweet spot. 20-min chunks (~9MB) sometimes RST under flaky proxies. Concurrency is more reliable than
max_workers=2.4 - Retry: every API call wrapped in 5× exponential backoff () —
time.sleep(min(2**n, 30))is common and transient.RemoteProtocolError: Server disconnected - Offset stitching: each chunk's words come back with timestamps relative to that chunk. When merging, add the chunk's absolute start offset to every word's /
startbefore assembling cues.end - Loop guard (belt + suspenders): even with , occasionally a sub-chunk still loops. After assembly, run a loop-detector on each cue's text — if any phrase of length 8–40 chars repeats 3+ times consecutively, drop the cue.
temperature=0.2
- 认证:凭证存储在中。调用前执行
~/code/.env加载。set -a; source ~/code/.env; set +a - 本机SOCKS代理:需要
httpx扩展——使用socksio(否则会报错uvx --with httpx --with socksio python ...)。ImportError: Using SOCKS proxy, but the 'socksio' package is not installed - 分片:64kbps单声道MP3的10分钟片段(约4.5MB)是可靠性最优选择。20分钟片段(约9MB)在不稳定代理下可能出现RST错误。并发数比
max_workers=2更可靠。4 - 重试:每个API调用包裹5次指数退避重试()——
time.sleep(min(2**n, 30))是常见的临时错误。RemoteProtocolError: Server disconnected - 偏移拼接:每个片段的单词时间戳是相对于该片段的起始时间。合并时需将片段的绝对起始偏移量添加到每个单词的/
start字段,再组装字幕。end - 循环防护(双重保障):即使设置,偶尔仍会出现片段循环。组装完成后,对每条字幕文本进行循环检测——若8-40字符的短语连续重复3次以上,则删除该字幕。
temperature=0.2
Anti-patterns (do not do)
反模式(禁止操作)
- ❌ Do not request for content longer than ~2 minutes.
response_format=srt - ❌ Do not "fix" bad cues with a second API call. If you got blob cues or loop hallucinations from your first call, redo with word-level granularity once — don't re-transcribe just the broken sub-range.
- ❌ Do not use on potentially-quiet audio (yoga, spiritual content, podcast outros). Greedy decoding loops.
temperature=0enables the fallback chain.0.2 - ❌ Do not skip . Auto-detect occasionally swaps Chinese→Japanese or Spanish→Portuguese on the first 30 seconds and the whole transcript is then wrong.
language=...
- ❌ 内容超过2分钟时,请勿请求。
response_format=srt - ❌ 请勿通过二次API调用修复错误字幕。若首次调用出现长片段或循环幻觉,只需重新以单词级粒度调用一次——无需仅重新转录错误片段。
- ❌ 在可能存在静音的音频(瑜伽、心灵内容、播客结尾)上,请勿使用。贪婪解码会导致循环。
temperature=0可启用fallback机制。0.2 - ❌ 请勿跳过参数。自动检测偶尔会在前30秒将中文识别为日语,或西班牙语识别为葡萄牙语,导致整个转录文本错误。
language=...
Volcano (豆包) ASR path — preferred for Chinese
Volcano(豆包)ASR处理流程——中文场景首选
Volcano ASR routinely beats Whisper on Mandarin accuracy (recognition rate, punctuation, named entities). Use this as the default for source.
zh-*- Endpoint and auth: see the user's and 豆包 ASR docs — credentials in
lark-minutes(~/code/.env,VOLC_ASR_APPID).VOLC_ASR_ACCESS_TOKEN - For long-form Chinese audio (>10 min): use the user's bundled scripts under if present; otherwise route through 飞书妙记 (
scripts/), which is the user's standing fallback for/lark-minutes/.m4a/.mp3→ SRT..mp4
If Volcano isn't available on this machine, fall back to OpenAI Whisper API with the same word-level recipe above — pin , and follow the cue assembly steps.
language=zhVolcano ASR在普通话识别准确率(识别率、标点、实体识别)上持续优于Whisper。默认将其作为源语言的处理引擎。
zh-*- 端点与认证:参考用户的和豆包ASR文档——凭证存储在
lark-minutes(~/code/.env,VOLC_ASR_APPID)。VOLC_ASR_ACCESS_TOKEN - 长篇中文音频(>10分钟):若存在用户的目录下的捆绑脚本则使用;否则通过飞书妙记(
scripts/)处理,这是用户针对/lark-minutes/.m4a/.mp3转SRT的备用方案。.mp4
若本机无法访问Volcano,则 fallback 到OpenAI Whisper API,使用上述相同的单词级处理流程——指定,并遵循字幕组装步骤。
language=zhLocal Whisper as last resort
本地Whisper作为最后备选
Only when offline, the API quota is exhausted, or for ultra-cheap rough drafts. Quality is materially lower for Chinese; same blob/loop failure modes apply; local Whisper does not expose word-level timestamps via the CLI so the principled fix isn't available.
bash
ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 -c:a pcm_s16le _audio.wav -y
uvx --from openai-whisper whisper _audio.wav \
--language zh --task transcribe \
--model medium --output_format srt --output_dir .
rm _audio.wavmediumsmall.,仅在离线、API配额耗尽或需要超低成本草稿时使用。中文识别质量显著降低;存在相同的长片段/循环失效问题;本地Whisper CLI不暴露单词级时间戳,因此无法使用上述可靠修复方案。
bash
ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 -c:a pcm_s16le _audio.wav -y
uvx --from openai-whisper whisper _audio.wav \
--language zh --task transcribe \
--model medium --output_format srt --output_dir .
rm _audio.wavmediumsmall.,Output
输出规范
- File name: (no language suffix — this is the source language SRT, the master).
<source-stem>.srt - Format: standard SRT, (comma ms), 1-indexed.
HH:MM:SS,mmm - Cue rules: punctuation-bounded; 3-8s typical duration; ≤18 Chinese chars or ≤42 Latin chars per visible line.
- Unclear audio: mark only when necessary; do not guess.
[inaudible]
- 文件名:(无语言后缀——这是源语言SRT字幕,为主文件)。
<source-stem>.srt - 格式:标准SRT格式,(逗号分隔毫秒),序号从1开始。
HH:MM:SS,mmm - 字幕规则:以标点为边界;典型时长3-8秒;每行可见字符数≤18个中文或≤42个拉丁字符。
- 模糊音频:仅在必要时标记;请勿猜测内容。
[inaudible]
Quality gate before handoff
交付前质量检查
- Subtitle numbers are sequential
- Timestamps don't overlap
- Milliseconds use commas
- No cue ends mid-word
- No cue exceeds MAX_CHARS without an internal split
- No phrase repeats 3+ times consecutively (loop residue)
- 字幕序号连续
- 时间戳无重叠
- 毫秒数使用逗号
- 字幕无单词拆分
- 超过MAX_CHARS的字幕均已内部拆分
- 无连续重复3次以上的短语(循环残留)
Downstream
下游技能
- — translate the source SRT to a target language with punctuation-bounded re-segmentation.
/wjs-translating-subtitles - — only if the user wants voice dub in the source language (rare); usually you translate first.
/wjs-dubbing-video - — only if the user wants the source-language SRT burned onto the source video (e.g., Spanish video with Spanish subs for hearing-impaired).
/wjs-burning-subtitles
- ——将源语言SRT字幕翻译为目标语言,并按标点重新分段。
/wjs-translating-subtitles - ——仅当用户需要源语言配音时使用(少见);通常先进行翻译。
/wjs-dubbing-video - ——仅当用户需要将源语言SRT字幕内嵌到源视频时使用(例如,为听障人士制作的西班牙语视频配西语字幕)。
/wjs-burning-subtitles
Common pitfalls
常见陷阱
- Sending the whole 60-minute file in one API call. OpenAI's hard limit is 25 MB and the call gets choppy at >15 min anyway. Chunk first.
- Treating text as authoritative. It's inconsistently punctuated across chunks of the same file — never trust it without the assembler.
segments[] - Letting Whisper auto-detect language. Pin every time.
- Forgetting to add chunk offsets. Each API response has timestamps relative to the chunk's t=0; merging without adding the chunk's absolute start makes every cue past the first chunk wrong by minutes.
- 单次API调用发送60分钟完整文件。OpenAI硬限制为25MB,且超过15分钟的调用会不稳定。务必先分片。
- 将文本视为权威内容。同一文件的不同片段中,其标点格式不一致——必须通过组装器处理后才能使用。
segments[] - 让Whisper自动检测语言。每次都要指定源语言。
- 忘记添加片段偏移量。每个API响应的时间戳是相对于片段起始时间t=0的;合并时若不添加片段的绝对起始偏移量,首个片段之后的所有字幕时间轴都会偏差数分钟。