subtitle-refine

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

字幕精修

Subtitle Refinement

任务目标

Task Objectives

把用户提供的原始

srt

清洗成一份可交付的完整

clean.srt

。

不要覆盖原始字幕文件。
只做字幕级纠错，不做润色、总结或扩写。
最终校验：直接对原始
```
srt
```
和完整
```
clean.srt
```
运行
```
scripts/check_clean_srt.py
```
。

Clean the provided raw

srt

into a deliverable complete

clean.srt

Do not overwrite the original subtitle file.
Only perform subtitle-level correction, not polishing, summarization, or expansion.
Final verification: Run
```
scripts/check_clean_srt.py
```
directly on the original
```
srt
```
and the complete
```
clean.srt
```
.

原始要求

Original Requirements

文本清洗要求

Text Cleaning Requirements

修正识别错误，包括正确的“的地得”、合理的“他她它”。
删除“嗯、啊、呃、哈”之类的无意义语气词；清洗后的字幕不要带标点。

子句间需要有必要的空格停顿。例：

来那你就先开始

来 那你就先开始

例：

你是第二个是吧

你是第二个 是吧

删除明显重复字眼，不要书面化改写，不要总结，不要扩写，只做字幕级清洗和纠错。
每条字幕最长不超过 14 字（英文单词默认视为一个字），单条字幕内不要换行。

如果原字幕超过 14 字，要做准确断句拆分。例：

还得谢谢各位母亲对家里的付出

还得谢谢各位母亲

对家里的付出

不得因文本纠错、删语气词、删重复字眼或拆分字幕，造成后续字幕整体提前或滞后。所有调整都应限于当前字幕条目及其局部范围内。

Correct recognition errors, including proper usage of Chinese structural particles "de" (的, 地, 得) and appropriate choice of pronouns "he/she/it" (他, 她, 它).
Delete meaningless filler words such as "en, a, e, ha" (嗯、啊、呃、哈); cleaned subtitles should not contain punctuation.

Add necessary space pauses between clauses. Example:

来那你就先开始

来 那你就先开始

Example:

你是第二个是吧

你是第二个 是吧

Delete obvious repeated words; do not rewrite into formal text, summarize, or expand, only perform subtitle-level cleaning and correction.
Each subtitle entry should not exceed 14 characters (English words are counted as one character by default), no line breaks within a single subtitle entry.
If the original subtitle exceeds 14 characters, split it with accurate sentence segmentation. Example:
```
还得谢谢各位母亲对家里的付出
```
->
```
还得谢谢各位母亲
```
+
```
对家里的付出
```
Do not cause overall advance or delay of subsequent subtitles due to text correction, filler word deletion, repeated word deletion, or subtitle splitting. All adjustments must be limited to the current subtitle entry and its local scope.

时间轴要求

Timeline Requirements

处理后字幕必须与原音频严格同步，不得出现全局时间漂移。
若仅进行字幕级清洗与纠错，且未改变原句语义边界，则默认保持该条原始起止时间不变。
若因断句或单条超过 14 字而拆分，则拆分后的所有子条目必须完全落在原字幕时间范围内；各子条目之间不得重叠，优先首尾衔接，总覆盖时长必须与原条目一致。
拆分后的时间分配应优先依据语义停顿和说话节奏；无法精确判断时，再按各子句有效字符数比例分配，不得机械平均切分。
删除句中语气词或明显重复字眼时，通常不调整原条目起止时间；若某条本身只是独立且无意义的语气词，则可直接删除，但不得无依据拉伸前后字幕覆盖。
每条字幕应尽量贴合语音片段，不得明显早挂或滞留；如需微调，单条字幕起止时间相对语音边界的偏差应尽量控制在 ±100 到 150 毫秒内。
拆分和删改后仍须保证可读性，避免过短闪现、明显漏挂、无意义空窗或破坏自然语义边界的切分。

Processed subtitles must be strictly synchronized with the original audio, with no global timeline drift.
If only subtitle-level cleaning and correction are performed without changing the semantic boundary of the original sentence, the original start and end time of the entry will remain unchanged by default.
If splitting is required due to sentence segmentation or single-entry exceeding 14 characters, all split sub-entries must fall completely within the original subtitle time range; there should be no overlap between sub-entries, prioritize end-to-end connection, and the total coverage duration must be consistent with the original entry.
Time allocation for split entries should prioritize semantic pauses and speaking rhythm; if precise judgment is not possible, allocate according to the proportion of valid characters in each clause, do not split mechanically evenly.
When deleting filler words or obvious repeated words in a sentence, usually do not adjust the original entry's start and end time; if an entry is an independent and meaningless filler word, it can be directly deleted, but do not stretch the coverage of preceding or following subtitles without basis.
Each subtitle should fit the audio segment as much as possible, with no obvious early display or retention; if fine-tuning is needed, the deviation of the start and end time of a single subtitle relative to the audio boundary should be controlled within ±100 to 150 milliseconds as much as possible.
After splitting and modification, readability must still be ensured, avoiding overly short flashes, obvious missing displays, meaningless gaps, or segmentation that breaks natural semantic boundaries.

额外要求

Additional Requirements

不要误删有意义的字幕条目。
时间轴必须 double check。
如果字幕文件过长，只有在用户明确要求并行或允许并行子代理时才并行；即便并行，也只把并行当作编辑辅助，主代理最后仍需直接整理出一份完整的
```
clean.srt
```
。
脚本不应假设字幕一定发生了拆分；无拆分和有拆分都必须兼容。

Do not accidentally delete meaningful subtitle entries.
Double check the timeline.
If the subtitle file is too long, parallel processing is only allowed if the user explicitly requests or permits sub-agents; even if parallel processing is used, treat it only as an editing aid, and the main agent must finally organize a complete
```
clean.srt
```
manually.
The script should not assume that subtitles must be split; it must be compatible with both split and non-split scenarios.
Conservative priority. Unless it is clearly identified as an ASR recognition error, meaningless filler word, obvious repeated slip of the tongue, lack of necessary pause space, or must be split due to exceeding 14 characters, do not rewrite the word order, syntactic structure, collocation relationship of the original sentence, or replace it with a smoother expression.
Do not rewrite the original sentence just because it is "more fluent", "more concise", or "more like formal written language". For example, changing "给我支持" to "支持我", "跟我说" to "对我说", "就是也是" to "也是" are not allowed unless they are clear recognition errors or required by rules.
Allow subtitles to be less formal or less polished; only correct errors, do not polish. If a sentence already meets the rules in its current form, keep it as is.
Strict deletion rules. Independent entries of pure filler words can be directly deleted; in addition, only allow deleting entire entries when they are clearly judged as broken fragments, repeated fragments, or pure noise misrecognition. All non-filler deletions must be organized into an allowlist and listed one by one in the delivery.
If the user specifies a separate word count standard, such as "English words are counted as one character", follow the user's requirements; if the default script standard is inconsistent with the user's requirements, explain the script's limitations and the actual review method used in the delivery.
If the user specifies a separate word count standard, such as "English words are counted as one character", follow the user's requirements; you can add
```
--latin-word-as-one-char
```
when running the verification script.
The default output complete
```
clean.srt
```
prioritizes verifiability and re-listenability. Unless the user explicitly requests renumbering, prioritize retaining the original
```
block_id
```
; use temporary numbers like
```
123a
```
,
```
123b
```
when splitting a single entry.
Do not borrow words from other entries, add words, or rewrite to "smooth out" the current entry. If the current entry is unclear or suspected to be ASR garbled but cannot be clearly corrected based only on this entry, prioritize retaining the original sentence and marking it for manual re-listen, do not subjectively splice new text from preceding or following subtitles.

工作原则

Output Specifications

只做字幕级清洗和纠错，不做书面化改写、总结或扩写。除非确属错误、语气词、明显重复口误或规则强制要求，否则不要删减口语表达。
保守优先。除非能明确判断为 ASR 识别错误、无意义语气词、明显重复口误、缺少必要停顿空格，或因超过 14 字必须拆分，否则不得改写原句语序、句法结构、搭配关系，或替换成更顺的说法。
不得仅因“更通顺”“更简洁”“更像书面语”而改写原句。像“给我支持”改成“支持我”、“跟我说”改成“对我说”、“就是也是”改成“也是”等，若不属于明确识别错误或规则要求，一律不改。
允许字幕不够书面、不够漂亮；只修错误，不做润色。若一句话当前形态已满足规则，就保持原样。
删除规则从严。纯语气词独立条目可直接删除；除此之外，只有在能明确判断为断裂残句、重复残片或纯噪声误识别时才允许整条删除。所有非 filler 删除必须整理 allowlist，并在交付中逐条列出。
若用户单独指定了字数计算口径，例如“英文单词视为一个字”，以用户要求为准；若脚本默认口径与用户要求不一致，要在交付中说明脚本局限和实际采用的复核方式。
若用户单独指定了字数计算口径，例如“英文单词视为一个字”，以用户要求为准；可在运行校验脚本时加
```
--latin-word-as-one-char
```
。
默认输出的完整
```
clean.srt
```
以可校验、可回听为先。除非用户明确要求重编号，否则优先保留原始
```
block_id
```
；单条拆分时使用
```
123a
```
、
```
123b
```
这类临时编号。
不得跨条借词、补词或改写来“修顺”当前条目。若当前条目听不清或疑似 ASR 错乱，但无法仅依据本条明确修正，则优先保留原句并标记人工复听，不要从前后字幕主观拼接新文本。

The default output file name is in the form
```
xxx.clean.srt
```
.
Do not overwrite the original file.
The final deliverable is a complete
```
clean.srt
```
; do not use "block files to be merged" as the formal product.
If it is necessary to retain the original number after splitting for verification or editing convenience, you can directly use
```
123a
```
,
```
123b
```
in the complete
```
clean.srt
```
; do not rely on an additional merging script to assemble them a second time.

输出约定

Recommended Process

输出文件名默认形如
```
xxx.clean.srt
```
。
不覆盖原始文件。
最终交付物是一份完整的
```
clean.srt
```
；不要把“待合并的块文件”当作正式产物。
若为了校验或编辑方便需要保留拆分后的原始编号，可直接在完整
```
clean.srt
```
中使用
```
123a
```
、
```
123b
```
；不要额外依赖合并脚本把它们二次拼装。

Read the original
```
srt
```
, confirm the total number of entries, whether there are overlong entries, and whether block processing is needed.
Clarify the word count constraint standard for this task; if there is no additional explanation from the user, check according to the script's default standard.
Perform cleaning, correction, necessary splitting, and timeline control entry by entry, directly organizing into a complete
```
clean.srt
```
.
If the file is very long and the user allows parallel processing, assign to sub-agents by subtitle number or time range, but the main agent must finally manually integrate into a complete
```
clean.srt
```
, do not rely on merging scripts.
First organize an allowlist for non-filler deleted entries.
Run
```
scripts/check_clean_srt.py
```
for rule checks and timeline review.
Manually review the alarms, focusing on residual filler words, residual repeated slips of the tongue, possible missing pause spaces, overly short flashes, text misalignment, and the allowlist.

Script Resources

—

scripts/check_clean_srt.py

读取原始
```
srt
```
，确认总条数、是否存在超长条目、是否需要按块处理。
明确本次字数约束的计数口径；若用户无补充说明，默认按脚本口径检查。
逐条执行清洗、纠错、必要拆分和时间轴控制，直接整理成一份完整的
```
clean.srt
```
。
若文件很长且用户允许并行，可按字幕编号或时间范围分配给子代理，但主代理最终必须手工整合成一份完整
```
clean.srt
```
，不要依赖合并脚本。
对非 filler 删除条目先整理 allowlist。
运行
```
scripts/check_clean_srt.py
```
做规则检查和时间轴复核。
对告警做人工复核，重点看语气词残留、重复口误残留、可能缺少停顿空格、过短闪现、文本串位和 allowlist。

Used to check whether the complete

clean.srt

meets requirements such as length, punctuation, timeline, deleted entries, local splitting boundaries, etc. This script should be independently executable.

The script must be compatible with two types of input:

A complete but non-split
```
clean.srt
```
.
A complete
```
clean.srt
```
containing locally split entries, allowing temporary numbers like
```
123a
```
,
```
123b
```
.

Common usage:

bash

python3 scripts/check_clean_srt.py raw.srt clean.srt
python3 scripts/check_clean_srt.py raw.srt clean.srt --allowed-deletions '284,415,450'
python3 scripts/check_clean_srt.py raw.srt clean.srt --allowed-deletions '284,415,450' --fail-on-warnings
python3 scripts/check_clean_srt.py raw.srt clean.srt --latin-word-as-one-char

脚本资源

Delivery Requirements

scripts/check_clean_srt.py

—

用于检查完整

clean.srt

是否满足长度、标点、时间轴、删除条目、局部拆分边界等要求。该脚本应独立可运行。

该脚本必须兼容两类输入：

完整但未拆分的
```
clean.srt
```
。
完整且包含局部拆分条目的
```
clean.srt
```
，允许使用
```
123a
```
、
```
123b
```
这类临时编号。

常用方式：