wjs-transcribing-audio

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

wjs-transcribing-audio

Spoken audio in → timestamped SRT in the same language out. This skill stops at the source-language SRT. Translation to another language is the next skill (

/wjs-translating-subtitles

输入语音音频 → 输出同语言的带时间戳SRT字幕。本技能仅生成源语言SRT字幕，如需翻译为其他语言，请使用下一技能（

/wjs-translating-subtitles

）。

When to use

使用场景

User provides a video or audio file and wants a transcript / SRT in the source language.
User already has a translated SRT and the source SRT is missing.
User asks "做 SRT" / "make subtitles" / "出逐字稿" with no translation step requested yet.

用户提供视频或音频文件，需要生成源语言的转录文本/SRT字幕。
用户已有翻译后的SRT字幕，但缺失源语言版本。
用户要求「做 SRT」/「make subtitles」/「出逐字稿」，且未要求翻译步骤。

When NOT to use

非使用场景

Source-language SRT already exists → skip straight to
```
/wjs-translating-subtitles
```
.
User wants the transcript in a different language than spoken → run this skill first, then
```
/wjs-translating-subtitles
```
.
User wants only the dub or burn-in → if SRT exists, skip; otherwise run this first.

已存在源语言SRT字幕 → 直接使用
```
/wjs-translating-subtitles
```
。
用户需要与语音不同语言的转录文本 → 先运行本技能，再使用
```
/wjs-translating-subtitles
```
。
用户仅需要配音或字幕内嵌 → 若已有SRT则跳过；否则先运行本技能。

Routing: which engine

引擎路由规则

Source language	Default engine	Why
Chinese (zh-CN, zh-HK, zh-TW)	Volcano (豆包) ASR	Materially better accuracy than Whisper for Chinese — user's standing preference
Any other (es, en, pt, fr, it, ja, ko, …)	OpenAI Whisper API with word-level granularity	Whisper's multilingual is strong; word timestamps let us assemble cues ourselves
Offline / no API access	Local `openai-whisper` (medium)	Quality floor; same loop/blob failure modes apply

For Chinese, do not default to Whisper unless the user explicitly asks for it or Volcano is unavailable. This is a deliberate routing decision — see user's memory on Chinese ASR priority.

源语言	默认引擎	原因
中文（zh-CN、zh-HK、zh-TW）	Volcano（豆包）ASR	针对中文的识别准确率显著优于Whisper，符合用户的优先选择
其他语言（es、en、pt、fr、it、ja、ko等）	OpenAI Whisper API（单词级粒度）	Whisper的多语言能力强劲；单词级时间戳支持我们自行组装字幕片段
离线/无API访问权限	本地 `openai-whisper` （medium模型）	保证基础识别质量；存在相同的循环/长片段识别失效问题

针对中文场景，除非用户明确要求或Volcano不可用，否则默认不使用Whisper。这是经过考量的路由决策——可参考用户关于中文ASR优先级的记录。

OpenAI Whisper API path (non-Chinese, and Chinese fallback)

OpenAI Whisper API处理流程（非中文及中文 fallback 场景）

The key principle: do not request
response_format=srt
. Whisper cue-segmentation fails on long monologues (30-second blob cues) and quiet stretches (loop hallucinations). Request word-level timestamps and assemble cues yourself — the post-processing is deterministic and free.

核心原则：不要请求
response_format=srt
。Whisper的字幕分段在长篇独白（30秒长片段）和静音时段（循环幻觉）场景下会失效。应请求单词级时间戳并自行组装字幕片段——后处理过程是确定性的且无需额外成本。

Why not response_format=srt

为何不使用response_format=srt

Two failure modes that wreck

whisper-1

SRT output on long content:

30-second blob cues. In long monologues,
```
whisper-1
```
with
```
response_format=srt
```
emits one cue covering the full 30s
```
condition_on_previous_text
```
window. Transcript is fine; timing is unusable for on-screen reading.
Loop hallucination on quiet tails. Greedy
```
temperature=0
```
on low-energy audio produces "你如果不把拥抱浪费写在这上面,你很难的" repeated 50 times.

Both stem from letting Whisper decide cue boundaries. Fix: word-level timestamps + your own punctuation-aware assembler.

whisper-1

生成SRT输出时，针对长内容存在两种失效模式：

30秒长片段字幕：在长篇独白中，启用
```
response_format=srt
```
的
```
whisper-1
```
会生成覆盖整个30秒
```
condition_on_previous_text
```
窗口的单条字幕。转录文本没问题，但时间轴无法适配屏幕阅读。
静音时段循环幻觉：在低能量音频上使用
```
temperature=0
```
的贪婪解码会出现重复50次的无意义文本，例如「你如果不把拥抱浪费写在这上面,你很难的」。

两种问题均源于让Whisper决定字幕边界。解决方案：单词级时间戳 + 自定义的标点感知组装器。

Calling the API

API调用示例

bash

undefined

bash

undefined

1. Compress for upload — 64kbps mono MP3 is plenty for speech.

1. 压缩文件以便上传——64kbps单声道MP3足以满足语音识别需求。

OpenAI limit is 25MB per request; chunk into 10-min pieces

OpenAI单请求限制为25MB；为应对不稳定代理，将文件分割为10分钟片段

(≈4.5MB at 64kbps) for resilience under flaky proxies.

（64kbps下约4.5MB）

ffmpeg -hide_banner -loglevel error -y
-ss <start> -t 600 -i input.mp4
-vn -ac 1 -ar 16000 -c:a libmp3lame -b:a 64k chunk.mp3


```python

ffmpeg -hide_banner -loglevel error -y
-ss <start> -t 600 -i input.mp4
-vn -ac 1 -ar 16000 -c:a libmp3lame -b:a 64k chunk.mp3


```python

2. Request word-level timestamps. Do NOT request response_format=srt.

2. 请求单词级时间戳。请勿请求response_format=srt。

import httpx, os data = { "model": "whisper-1", "language": "es", # pin source language; never auto-detect "response_format": "verbose_json", "timestamp_granularities[]": "word", # ← the critical flag "temperature": "0.2", # enable fallback chain (anti-loop) } with open("chunk.mp3", "rb") as f: r = httpx.post( "https://api.openai.com/v1/audio/transcriptions", headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"}, data=data, files={"file": ("chunk.mp3", f, "audio/mpeg")}, timeout=600.0, ) r.raise_for_status() j = r.json() words = j["words"] # [{"word": "hola", "start": 0.12, "end": 0.34}, ...] segments = j["segments"] # see surprise below

undefined

import httpx, os data = { "model": "whisper-1", "language": "es", # 指定源语言；绝不自动检测 "response_format": "verbose_json", "timestamp_granularities[]": "word", # ← 关键参数 "temperature": "0.2", # 启用 fallback 机制（防循环） } with open("chunk.mp3", "rb") as f: r = httpx.post( "https://api.openai.com/v1/audio/transcriptions", headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"}, data=data, files={"file": ("chunk.mp3", f, "audio/mpeg")}, timeout=600.0, ) r.raise_for_status() j = r.json() words = j["words"] # [{"word": "hola", "start": 0.12, "end": 0.34}, ...] segments = j["segments"] # 注意下方的意外情况

undefined

Surprise: words[] has no punctuation, segments[] is inconsistent

意外情况：words[]无标点，segments[]格式不一致

Whisper's

words[]

array typically has no punctuation in

word["word"]

— each entry is a bare token like

"做"

"个"

"测"

"试"

. Punctuation, when present, lives only in

segments[]

text

field.

Worse,

segments[]

text is inconsistently punctuated across chunks of the same file: chunk 0 of a 79-min podcast might emit 285 bare segments ("做个测试" "你在" "呵呵") at 1-2s each with no punctuation; chunk 7 might emit 34 segments at 14-30s each with punctuation. Both behaviors ship in the same API response.

So the right recipe combines both: use

segments[]

for natural pause boundaries (already aligned to breath), but treat them as raw input to your own cue assembler, which uses word timestamps to split anywhere the segments are too long.

Whisper的

words[]

数组中，

word["word"]

通常不含标点——每个条目都是纯词汇，例如「做」「个」「测」「试」。标点（若存在）仅出现在

segments[]

的

text

字段中。

更糟的是，同一文件的不同片段中，

segments[]

的文本标点格式不一致：79分钟播客的第0个片段可能生成285条无标点的短片段（「做个测试」「你在」「呵呵」），每条时长1-2秒；而第7个片段可能生成34条带标点的长片段，每条时长14-30秒。两种情况会出现在同一API响应中。

因此正确的处理方式是结合两者：使用

segments[]

获取自然停顿边界（已对齐呼吸节奏），但将其作为自定义字幕组装器的原始输入，利用单词时间戳拆分过长的片段。

Cue assembly recipe

字幕组装流程

python

TARGET_DUR = 3.0   # try to make cues this long
MAX_CUE_DUR = 5.0  # never exceed
MAX_CHARS = 18     # ~one line at Fontsize 14 on 1080-wide vertical
MAX_GAP = 1.0      # silence threshold → force cue boundary
MIN_PIECE = 0.3    # below this, merge with neighbor
SPLIT_PUNCT = set("，。！？；,.;!?")

python

TARGET_DUR = 3.0   # 目标字幕时长
MAX_CUE_DUR = 5.0  # 最大允许时长
MAX_CHARS = 18     # 1080竖屏、字号14时约一行的字符数
MAX_GAP = 1.0      # 静音阈值 → 强制拆分字幕
MIN_PIECE = 0.3    # 短于该时长则与相邻片段合并
SPLIT_PUNCT = set("，。！？；,.;!?")

Step A: merge short segments[] toward TARGET_DUR (use segments,

步骤A：合并短片段以接近TARGET_DUR（使用segments而非words——Whisper的片段边界已对齐停顿）

not words — Whisper's segment boundaries are already

—

pause-aligned).

—

def assemble(segments, offset): cues, buf = [], [] def flush(): if buf: cues.append((buf[0]["start"]+offset, buf[-1]["end"]+offset, "".join(s["text"].strip() for s in buf))) buf.clear() for s in segments: dur = s["end"] - s["start"] # Long single segment WITH internal punct → split standalone if dur > MAX_CUE_DUR and any(c in s["text"] for c in SPLIT_PUNCT): flush(); cues.extend(split_long_segment(s, offset)); continue if not buf: buf.append(s); continue if (s["start"] - buf[-1]["end"]) >= MAX_GAP
or (buf[-1]["end"] - buf[0]["start"]) >= TARGET_DUR
or (s["end"] - buf[0]["start"]) > MAX_CUE_DUR: flush() buf.append(s) flush(); return cues

def assemble(segments, offset): cues, buf = [], [] def flush(): if buf: cues.append((buf[0]["start"]+offset, buf[-1]["end"]+offset, "".join(s["text"].strip() for s in buf))) buf.clear() for s in segments: dur = s["end"] - s["start"] # 含内部标点的长片段 → 单独拆分 if dur > MAX_CUE_DUR and any(c in s["text"] for c in SPLIT_PUNCT): flush(); cues.extend(split_long_segment(s, offset)); continue if not buf: buf.append(s); continue if (s["start"] - buf[-1]["end"]) >= MAX_GAP
or (buf[-1]["end"] - buf[0]["start"]) >= TARGET_DUR
or (s["end"] - buf[0]["start"]) > MAX_CUE_DUR: flush() buf.append(s) flush(); return cues

Step B: final pass — split every internal comma/period to its own cue

步骤B：最终处理——将每个内部逗号/句号拆分到单独字幕

(proportional timestamps by char position). Coalesce pieces

（按字符位置分配比例时间戳）。将短于MIN_PIECE的片段向前合并。

shorter than MIN_PIECE forward.

步骤C：仍超过MAX_CHARS的字幕，利用words[]时间戳在单词间隙最大处拆分。递归处理直至符合要求。

Step C: any cue still > MAX_CHARS gets split at the largest inter-word

—

gap using words[] timestamps. Recursive until under cap.

—


Tweak `TARGET_DUR` and `MAX_CHARS` to platform reading rhythm. The 18-char cap matters for burn-in on vertical 1080×1920 at `Fontsize=14` — longer wraps to multiple unreadable lines.


可根据平台阅读节奏调整`TARGET_DUR`和`MAX_CHARS`。18字符限制对1080×1920竖屏、`Fontsize=14`的内嵌字幕至关重要——过长会换行导致无法阅读。

Operational details

操作细节

Auth: credentials live in
```
~/code/.env
```
. Load with
```
set -a; source ~/code/.env; set +a
```
before invoking.

SOCKS proxy on this machine:

httpx

needs the

socksio

extra — use

uvx --with httpx --with socksio python ...

(without it you get

ImportError: Using SOCKS proxy, but the 'socksio' package is not installed

Chunking: 10-min pieces at 64kbps mono MP3 (~4.5MB each) are the reliability sweet spot. 20-min chunks (~9MB) sometimes RST under flaky proxies. Concurrency
```
max_workers=2
```
is more reliable than
```
4
```
.
Retry: every API call wrapped in 5× exponential backoff (
```
time.sleep(min(2**n, 30))
```
) —
```
RemoteProtocolError: Server disconnected
```
is common and transient.
Offset stitching: each chunk's words come back with timestamps relative to that chunk. When merging, add the chunk's absolute start offset to every word's
```
start
```
/
```
end
```
before assembling cues.
Loop guard (belt + suspenders): even with
```
temperature=0.2
```
, occasionally a sub-chunk still loops. After assembly, run a loop-detector on each cue's text — if any phrase of length 8–40 chars repeats 3+ times consecutively, drop the cue.

认证：凭证存储在
```
~/code/.env
```
中。调用前执行
```
set -a; source ~/code/.env; set +a
```
加载。

本机SOCKS代理：

httpx

需要

socksio

扩展——使用

uvx --with httpx --with socksio python ...

（否则会报错

ImportError: Using SOCKS proxy, but the 'socksio' package is not installed

）。

分片：64kbps单声道MP3的10分钟片段（约4.5MB）是可靠性最优选择。20分钟片段（约9MB）在不稳定代理下可能出现RST错误。并发数
```
max_workers=2
```
比
```
4
```
更可靠。
重试：每个API调用包裹5次指数退避重试（
```
time.sleep(min(2**n, 30))
```
）——
```
RemoteProtocolError: Server disconnected
```
是常见的临时错误。
偏移拼接：每个片段的单词时间戳是相对于该片段的起始时间。合并时需将片段的绝对起始偏移量添加到每个单词的
```
start
```
/
```
end
```
字段，再组装字幕。
循环防护（双重保障）：即使设置
```
temperature=0.2
```
，偶尔仍会出现片段循环。组装完成后，对每条字幕文本进行循环检测——若8-40字符的短语连续重复3次以上，则删除该字幕。

Anti-patterns (do not do)

反模式（禁止操作）

❌ Do not request
response_format=srt
for content longer than ~2 minutes.
❌ Do not "fix" bad cues with a second API call. If you got blob cues or loop hallucinations from your first call, redo with word-level granularity once — don't re-transcribe just the broken sub-range.
❌ Do not use
temperature=0
on potentially-quiet audio (yoga, spiritual content, podcast outros). Greedy decoding loops.
```
0.2
```
enables the fallback chain.
❌ Do not skip
language=...
. Auto-detect occasionally swaps Chinese→Japanese or Spanish→Portuguese on the first 30 seconds and the whole transcript is then wrong.

❌ 内容超过2分钟时，请勿请求
response_format=srt
。
❌ 请勿通过二次API调用修复错误字幕。若首次调用出现长片段或循环幻觉，只需重新以单词级粒度调用一次——无需仅重新转录错误片段。
❌ 在可能存在静音的音频（瑜伽、心灵内容、播客结尾）上，请勿使用
temperature=0
。贪婪解码会导致循环。
```
0.2
```
可启用fallback机制。
❌ 请勿跳过
language=...
参数。自动检测偶尔会在前30秒将中文识别为日语，或西班牙语识别为葡萄牙语，导致整个转录文本错误。

Volcano (豆包) ASR path — preferred for Chinese

Volcano（豆包）ASR处理流程——中文场景首选

Volcano ASR routinely beats Whisper on Mandarin accuracy (recognition rate, punctuation, named entities). Use this as the default for

zh-*

source.

Endpoint and auth: see the user's
```
lark-minutes
```
and 豆包 ASR docs — credentials in
```
~/code/.env
```
(
```
VOLC_ASR_APPID
```
,
```
VOLC_ASR_ACCESS_TOKEN
```
).
For long-form Chinese audio (>10 min): use the user's bundled scripts under
```
scripts/
```
if present; otherwise route through 飞书妙记 (
```
/lark-minutes
```
), which is the user's standing fallback for
```
.m4a
```
/
```
.mp3
```
/
```
.mp4
```
→ SRT.

If Volcano isn't available on this machine, fall back to OpenAI Whisper API with the same word-level recipe above — pin

language=zh

, and follow the cue assembly steps.

Volcano ASR在普通话识别准确率（识别率、标点、实体识别）上持续优于Whisper。默认将其作为

zh-*

源语言的处理引擎。

端点与认证：参考用户的
```
lark-minutes
```
和豆包ASR文档——凭证存储在
```
~/code/.env
```
（
```
VOLC_ASR_APPID
```
,
```
VOLC_ASR_ACCESS_TOKEN
```
）。
长篇中文音频（>10分钟）：若存在用户的
```
scripts/
```
目录下的捆绑脚本则使用；否则通过飞书妙记（
```
/lark-minutes
```
）处理，这是用户针对
```
.m4a
```
/
```
.mp3
```
/
```
.mp4
```
转SRT的备用方案。

若本机无法访问Volcano，则 fallback 到OpenAI Whisper API，使用上述相同的单词级处理流程——指定

language=zh

，并遵循字幕组装步骤。

Local Whisper as last resort

本地Whisper作为最后备选

Only when offline, the API quota is exhausted, or for ultra-cheap rough drafts. Quality is materially lower for Chinese; same blob/loop failure modes apply; local Whisper does not expose word-level timestamps via the CLI so the principled fix isn't available.

bash

ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 -c:a pcm_s16le _audio.wav -y
uvx --from openai-whisper whisper _audio.wav \
    --language zh --task transcribe \
    --model medium --output_format srt --output_dir .
rm _audio.wav

medium

is the practical floor for Chinese accuracy;

small

is OK only for clean studio English. Whisper writes

milliseconds; the file is still valid SRT. If you regenerate the SRT, always emit

ms.

仅在离线、API配额耗尽或需要超低成本草稿时使用。中文识别质量显著降低；存在相同的长片段/循环失效问题；本地Whisper CLI不暴露单词级时间戳，因此无法使用上述可靠修复方案。

bash

ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 -c:a pcm_s16le _audio.wav -y
uvx --from openai-whisper whisper _audio.wav \
    --language zh --task transcribe \
    --model medium --output_format srt --output_dir .
rm _audio.wav

medium

是中文识别的实用最低模型；

small

仅适用于清晰的录音室英语。Whisper会生成带

的毫秒数；文件仍为有效SRT。若重新生成SRT，始终使用

表示毫秒。

Output

输出规范

File name:
```
<source-stem>.srt
```
(no language suffix — this is the source language SRT, the master).
Format: standard SRT,
```
HH:MM:SS,mmm
```
(comma ms), 1-indexed.
Cue rules: punctuation-bounded; 3-8s typical duration; ≤18 Chinese chars or ≤42 Latin chars per visible line.
Unclear audio: mark
```
[inaudible]
```
only when necessary; do not guess.

文件名：
```
<source-stem>.srt
```
（无语言后缀——这是源语言SRT字幕，为主文件）。
格式：标准SRT格式，
```
HH:MM:SS,mmm
```
（逗号分隔毫秒），序号从1开始。
字幕规则：以标点为边界；典型时长3-8秒；每行可见字符数≤18个中文或≤42个拉丁字符。
模糊音频：仅在必要时标记
```
[inaudible]
```
；请勿猜测内容。

Quality gate before handoff

交付前质量检查

Subtitle numbers are sequential
Timestamps don't overlap
Milliseconds use commas
No cue ends mid-word
No cue exceeds MAX_CHARS without an internal split
No phrase repeats 3+ times consecutively (loop residue)

字幕序号连续
时间戳无重叠
毫秒数使用逗号
字幕无单词拆分
超过MAX_CHARS的字幕均已内部拆分
无连续重复3次以上的短语（循环残留）

Downstream

下游技能

/wjs-translating-subtitles
— translate the source SRT to a target language with punctuation-bounded re-segmentation.
/wjs-dubbing-video
— only if the user wants voice dub in the source language (rare); usually you translate first.
/wjs-burning-subtitles
— only if the user wants the source-language SRT burned onto the source video (e.g., Spanish video with Spanish subs for hearing-impaired).

/wjs-translating-subtitles
——将源语言SRT字幕翻译为目标语言，并按标点重新分段。
/wjs-dubbing-video
——仅当用户需要源语言配音时使用（少见）；通常先进行翻译。
/wjs-burning-subtitles
——仅当用户需要将源语言SRT字幕内嵌到源视频时使用（例如，为听障人士制作的西班牙语视频配西语字幕）。

Common pitfalls

常见陷阱

Sending the whole 60-minute file in one API call. OpenAI's hard limit is 25 MB and the call gets choppy at >15 min anyway. Chunk first.
Treating
segments[]
text as authoritative. It's inconsistently punctuated across chunks of the same file — never trust it without the assembler.
Letting Whisper auto-detect language. Pin every time.
Forgetting to add chunk offsets. Each API response has timestamps relative to the chunk's t=0; merging without adding the chunk's absolute start makes every cue past the first chunk wrong by minutes.

单次API调用发送60分钟完整文件。OpenAI硬限制为25MB，且超过15分钟的调用会不稳定。务必先分片。
将
segments[]
文本视为权威内容。同一文件的不同片段中，其标点格式不一致——必须通过组装器处理后才能使用。
让Whisper自动检测语言。每次都要指定源语言。
忘记添加片段偏移量。每个API响应的时间戳是相对于片段起始时间t=0的；合并时若不添加片段的绝对起始偏移量，首个片段之后的所有字幕时间轴都会偏差数分钟。