hyperframes-media

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

HyperFrames Media Preprocessing

HyperFrames 媒体预处理

Three CLI commands that produce assets for compositions:
tts
(speech),
transcribe
(timestamps), and
remove-background
(transparent video). Each downloads a model on first run and caches it under
~/.cache/hyperframes/
. Drop the output into the project, then reference it from the composition HTML — see the
hyperframes
skill for the audio/video element conventions.
以下三个CLI命令可生成合成内容所需的资源:
tts
(语音生成)、
transcribe
(时间戳转录)和
remove-background
(透明视频生成)。每个命令首次运行时会下载模型,并缓存到
~/.cache/hyperframes/
目录下。将输出文件放入项目后,即可在合成HTML中引用——关于音视频元素的规范,请参考
hyperframes
技能文档。

Text-to-Speech (
tts
)

文本转语音(
tts

Generate speech audio locally with Kokoro-82M. No API key.
bash
npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list                       # all 54 voices
通过Kokoro-82M本地生成语音音频,无需API密钥。
bash
npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list                       # 查看全部54种语音

Voice Selection

语音选择

Match voice to content. Default is
af_heart
.
Content typeVoiceWhy
Product demo
af_heart
/
af_nova
Warm, professional
Tutorial / how-to
am_adam
/
bf_emma
Neutral, easy to follow
Marketing / promo
af_sky
/
am_michael
Energetic or authoritative
Documentation
bf_emma
/
bm_george
Clear British English, formal
Casual / social
af_heart
/
af_sky
Approachable, natural
根据内容类型匹配语音,默认语音为
af_heart
内容类型语音选项选择理由
产品演示
af_heart
/
af_nova
温暖、专业
教程/操作指南
am_adam
/
bf_emma
中性、易于理解
营销/推广内容
af_sky
/
am_michael
充满活力或具有权威性
文档类内容
bf_emma
/
bm_george
清晰的英式英语、正式严谨
休闲/社交类内容
af_heart
/
af_sky
亲切自然、易于接近

Multilingual

多语言支持

Voice IDs encode language in the first letter:
a
=American English,
b
=British English,
e
=Spanish,
f
=French,
h
=Hindi,
i
=Italian,
j
=Japanese,
p
=Brazilian Portuguese,
z
=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no
--lang
needed when the voice matches the text.
bash
npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav
Use
--lang
only to override auto-detection (stylized accents). Valid codes:
en-us
,
en-gb
,
es
,
fr-fr
,
hi
,
it
,
pt-br
,
ja
,
zh
. Non-English phonemization requires
espeak-ng
system-wide (
brew install espeak-ng
/
apt-get install espeak-ng
).
语音ID的首字母代表语言:
a
=美式英语,
b
=英式英语,
e
=西班牙语,
f
=法语,
h
=印地语,
i
=意大利语,
j
=日语,
p
=巴西葡萄牙语,
z
=普通话。CLI会根据前缀自动检测音素化区域设置——当语音与文本语言匹配时,无需添加
--lang
参数。
bash
npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav
仅在需要覆盖自动检测(如特殊口音风格)时使用
--lang
参数。有效的语言代码包括:
en-us
,
en-gb
,
es
,
fr-fr
,
hi
,
it
,
pt-br
,
ja
,
zh
。非英语音素化需要系统级安装
espeak-ng
(执行
brew install espeak-ng
/
apt-get install espeak-ng
安装)。

Speed

语速设置

  • 0.7-0.8
    — tutorial, complex content, accessibility
  • 1.0
    — natural pace (default)
  • 1.1-1.2
    — intros, transitions, upbeat content
  • 1.5+
    — rarely appropriate; test carefully
  • 0.7-0.8
    — 教程、复杂内容、无障碍场景
  • 1.0
    — 自然语速(默认)
  • 1.1-1.2
    — 开场、过渡片段、欢快内容
  • 1.5+
    — 极少适用;需仔细测试

Long Scripts

长脚本处理

For more than a few paragraphs, write to a
.txt
file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.
对于超过几段的长文本,建议写入
.txt
文件并传入文件路径。时长超过约5分钟的语音输入,可考虑拆分为多个片段处理。

Requirements

环境要求

Python 3.8+ with
kokoro-onnx
and
soundfile
(
pip install kokoro-onnx soundfile
). Model downloads on first use (~311 MB + ~27 MB voices, cached in
~/.cache/hyperframes/tts/
).
需安装Python 3.8+,并安装
kokoro-onnx
soundfile
库(执行
pip install kokoro-onnx soundfile
安装)。首次使用时会下载模型(约311 MB + 约27 MB语音包),缓存至
~/.cache/hyperframes/tts/
目录。

Transcription (
transcribe
)

转录功能(
transcribe

Produce a normalized
transcript.json
with word-level timestamps.
bash
npx hyperframes transcribe audio.mp3
npx hyperframes transcribe video.mp4 --model small --language es
npx hyperframes transcribe subtitles.srt          # import existing
npx hyperframes transcribe subtitles.vtt
npx hyperframes transcribe openai-response.json
生成包含单词级时间戳的标准化
transcript.json
文件。
bash
npx hyperframes transcribe audio.mp3
npx hyperframes transcribe video.mp4 --model small --language es
npx hyperframes transcribe subtitles.srt          # 导入已有字幕
npx hyperframes transcribe subtitles.vtt
npx hyperframes transcribe openai-response.json

Language Rule (Non-Negotiable)

语言规则(必须遵守)

Never use
.en
models unless the user explicitly states the audio is English.
.en
models (
small.en
,
medium.en
) translate non-English audio into English instead of transcribing it. This silently destroys the original language.
  1. Language known and non-English →
    --model small --language <code>
    (no
    .en
    suffix)
  2. Language known and English →
    --model small.en
  3. Language unknown →
    --model small
    (no
    .en
    , no
    --language
    ) — whisper auto-detects
Default model is
small
, not
small.en
.
除非用户明确说明音频为英语,否则绝不要使用
.en
模型。
.en
模型(
small.en
,
medium.en
)会将非英语音频翻译为英语,而非转录原语言,这会无声地破坏原始语言内容。
  1. 已知语言为非英语 → 使用
    --model small --language <code>
    (不要加
    .en
    后缀)
  2. 已知语言为英语 → 使用
    --model small.en
  3. 未知语言 → 使用
    --model small
    (不加
    .en
    ,也不加
    --language
    )——Whisper会自动检测语言
默认模型为
small
,而非
small.en

Model Sizes

模型尺寸

ModelSizeSpeedWhen to use
tiny
75 MBFastestQuick previews, testing pipeline
base
142 MBFastShort clips, clear audio
small
466 MBModerateDefault — most content
medium
1.5 GBSlowImportant content, noisy audio, music
large-v3
3.1 GBSlowestProduction quality
Music with vocals: start at
medium
minimum; produced tracks often need manual SRT/VTT import. For caption-quality checks (mandatory after every transcription), the cleaning JS, retry rules, and the OpenAI/Groq API import path, see hyperframes/references/transcript-guide.md.
模型大小速度使用场景
tiny
75 MB最快快速预览、测试流水线
base
142 MB短片段、清晰音频
small
466 MB中等默认选项——适用于大多数内容
medium
1.5 GB重要内容、嘈杂音频、带 vocals 的音乐
large-v3
3.1 GB最慢生产级质量要求
带 vocals 的音乐:至少从
medium
模型开始尝试;成品音轨通常需要手动导入SRT/VTT文件。关于字幕质量检查(每次转录后必须执行)、清理JS脚本、重试规则以及OpenAI/Groq API导入路径,请参考hyperframes/references/transcript-guide.md

Output Shape

输出格式

Compositions consume a flat array of word objects. The
id
field (
w0
,
w1
, ...) is added during normalization for stable references in caption overrides; it's optional for backwards compatibility.
json
[
  { "id": "w0", "text": "Hello", "start": 0.0, "end": 0.5 },
  { "id": "w1", "text": "world.", "start": 0.6, "end": 1.2 }
]
合成内容会使用扁平化的单词对象数组。
id
字段(
w0
,
w1
, ...)是在标准化过程中添加的,用于在字幕覆盖中提供稳定的引用;为了向后兼容,该字段为可选。
json
[
  { "id": "w0", "text": "Hello", "start": 0.0, "end": 0.5 },
  { "id": "w1", "text": "world.", "start": 0.6, "end": 1.2 }
]

Background Removal (
remove-background
)

背景移除功能(
remove-background

Remove the background from a video or image so it can sit as a transparent overlay in a composition (e.g. an avatar floating on a background plate).
bash
npx hyperframes remove-background avatar.mp4 -o transparent.webm  # default: VP9 alpha WebM
npx hyperframes remove-background avatar.mp4 -o transparent.mov   # ProRes 4444 (editing)
npx hyperframes remove-background portrait.jpg -o cutout.png      # single-image cutout
npx hyperframes remove-background avatar.mp4 -o transparent.webm --device cpu
npx hyperframes remove-background --info                          # detected providers
Uses
u2net_human_seg
(MIT). First run downloads ~168 MB of weights to
~/.cache/hyperframes/background-removal/models/
.
移除视频或图片的背景,使其可作为透明叠加层用于合成内容(例如,在背景板上悬浮的头像)。
bash
npx hyperframes remove-background avatar.mp4 -o transparent.webm  # 默认:VP9 带 alpha 通道的 WebM
npx hyperframes remove-background avatar.mp4 -o transparent.mov   # ProRes 4444(适用于剪辑)
npx hyperframes remove-background portrait.jpg -o cutout.png      # 单张图片抠图
npx hyperframes remove-background avatar.mp4 -o transparent.webm --device cpu
npx hyperframes remove-background --info                          # 查看已检测到的提供方
使用
u2net_human_seg
模型(MIT许可)。首次运行时会下载约168 MB的权重文件,缓存至
~/.cache/hyperframes/background-removal/models/
目录。

Output Format

输出格式

FormatWhen
.webm
(VP9 + alpha)
Default. Compositions play this directly via
<video>
.
.mov
(ProRes 4444)
Editing in DaVinci/Premiere/FCP. Large files.
.png
Single-image cutout (still subject, layered over a backdrop).
Chrome decodes VP9 alpha natively, so the
.webm
plugs into a composition like any other muted-autoplay video — see the
hyperframes
skill for the
<video>
track conventions.
格式使用场景
.webm
(VP9 + alpha通道)
默认格式。合成内容可通过
<video>
标签直接播放。
.mov
(ProRes 4444)
适用于DaVinci/Premiere/FCP等剪辑软件。文件体积较大。
.png
单张图片抠图(静态主体,可叠加在背景上)。
Chrome原生支持VP9 alpha通道解码,因此
.webm
文件可像其他静音自动播放视频一样直接接入合成内容——关于
<video>
轨道的规范,请参考
hyperframes
技能文档。

TTS → Transcribe → Captions

文本转语音→转录→生成字幕流程

When there's no pre-recorded voiceover, generate one and transcribe it back to get word-level timestamps for captions:
bash
npx hyperframes tts script.txt --voice af_heart --output narration.wav
npx hyperframes transcribe narration.wav   # → transcript.json
Whisper extracts precise word boundaries from the generated audio, so caption timing matches delivery without hand-tuning.
当没有预先录制的配音时,可先生成语音,再将其转录以获取单词级时间戳来制作字幕:
bash
npx hyperframes tts script.txt --voice af_heart --output narration.wav
npx hyperframes transcribe narration.wav   # → 生成 transcript.json
Whisper会从生成的音频中提取精确的单词边界,因此字幕时间轴会与语音播放完美匹配,无需手动调整。