hyperframes-media

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

HyperFrames Media Preprocessing

HyperFrames 媒体预处理

Three CLI commands that produce assets for compositions:

tts

(speech),

transcribe

(timestamps), and

remove-background

(transparent video). Each downloads a model on first run and caches it under

~/.cache/hyperframes/

. Drop the output into the project, then reference it from the composition HTML — see the

hyperframes

skill for the audio/video element conventions.

以下三个CLI命令可生成合成内容所需的资源：

tts

（语音生成）、

transcribe

（时间戳转录）和

remove-background

（透明视频生成）。每个命令首次运行时会下载模型，并缓存到

~/.cache/hyperframes/

目录下。将输出文件放入项目后，即可在合成HTML中引用——关于音视频元素的规范，请参考

hyperframes

技能文档。

Text-to-Speech (

tts

)

文本转语音（

tts

）

Generate speech audio locally with Kokoro-82M. No API key.

bash

npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list                       # all 54 voices

通过Kokoro-82M本地生成语音音频，无需API密钥。

bash

npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list                       # 查看全部54种语音

Voice Selection

语音选择

Match voice to content. Default is

af_heart

Content type	Voice	Why
Product demo	`af_heart` / `af_nova`	Warm, professional
Tutorial / how-to	`am_adam` / `bf_emma`	Neutral, easy to follow
Marketing / promo	`af_sky` / `am_michael`	Energetic or authoritative
Documentation	`bf_emma` / `bm_george`	Clear British English, formal
Casual / social	`af_heart` / `af_sky`	Approachable, natural

根据内容类型匹配语音，默认语音为

af_heart

。

内容类型	语音选项	选择理由
产品演示	`af_heart` / `af_nova`	温暖、专业
教程/操作指南	`am_adam` / `bf_emma`	中性、易于理解
营销/推广内容	`af_sky` / `am_michael`	充满活力或具有权威性
文档类内容	`bf_emma` / `bm_george`	清晰的英式英语、正式严谨
休闲/社交类内容	`af_heart` / `af_sky`	亲切自然、易于接近

Multilingual

多语言支持

Voice IDs encode language in the first letter:

=American English,

=British English,

=Spanish,

=French,

=Hindi,

=Italian,

=Japanese,

=Brazilian Portuguese,

=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no

--lang

needed when the voice matches the text.

bash

npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav

Use

--lang

only to override auto-detection (stylized accents). Valid codes:

en-us

en-gb

es

fr-fr

hi

it

pt-br

ja

zh

. Non-English phonemization requires

espeak-ng

system-wide (

brew install espeak-ng

apt-get install espeak-ng

语音ID的首字母代表语言：

=美式英语，

=英式英语，

=西班牙语，

=法语，

=印地语，

=意大利语，

=日语，

=巴西葡萄牙语，

=普通话。CLI会根据前缀自动检测音素化区域设置——当语音与文本语言匹配时，无需添加

--lang

参数。

bash

npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav

仅在需要覆盖自动检测（如特殊口音风格）时使用

--lang

参数。有效的语言代码包括：

en-us

en-gb

es

fr-fr

hi

it

pt-br

ja

zh

。非英语音素化需要系统级安装

espeak-ng

（执行

brew install espeak-ng

apt-get install espeak-ng

安装）。

Speed

语速设置

```
0.7-0.8
```
— tutorial, complex content, accessibility
```
1.0
```
— natural pace (default)
```
1.1-1.2
```
— intros, transitions, upbeat content
```
1.5+
```
— rarely appropriate; test carefully

```
0.7-0.8
```
— 教程、复杂内容、无障碍场景
```
1.0
```
— 自然语速（默认）
```
1.1-1.2
```
— 开场、过渡片段、欢快内容
```
1.5+
```
— 极少适用；需仔细测试

Long Scripts

长脚本处理

For more than a few paragraphs, write to a

.txt

file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.

对于超过几段的长文本，建议写入

.txt

文件并传入文件路径。时长超过约5分钟的语音输入，可考虑拆分为多个片段处理。

Requirements

环境要求

Python 3.8+ with

kokoro-onnx

and

soundfile

(

pip install kokoro-onnx soundfile

). Model downloads on first use (~311 MB + ~27 MB voices, cached in

~/.cache/hyperframes/tts/

需安装Python 3.8+，并安装

kokoro-onnx

和

soundfile

库（执行

pip install kokoro-onnx soundfile

安装）。首次使用时会下载模型（约311 MB + 约27 MB语音包），缓存至

~/.cache/hyperframes/tts/

目录。

Transcription (

transcribe

)

转录功能（

transcribe

）

Produce a normalized

transcript.json

with word-level timestamps.

bash

npx hyperframes transcribe audio.mp3
npx hyperframes transcribe video.mp4 --model small --language es
npx hyperframes transcribe subtitles.srt          # import existing
npx hyperframes transcribe subtitles.vtt
npx hyperframes transcribe openai-response.json

生成包含单词级时间戳的标准化

transcript.json

文件。

bash

npx hyperframes transcribe audio.mp3
npx hyperframes transcribe video.mp4 --model small --language es
npx hyperframes transcribe subtitles.srt          # 导入已有字幕
npx hyperframes transcribe subtitles.vtt
npx hyperframes transcribe openai-response.json

Language Rule (Non-Negotiable)

语言规则（必须遵守）

Never use
.en
models unless the user explicitly states the audio is English.

.en

models (

small.en

medium.en

) translate non-English audio into English instead of transcribing it. This silently destroys the original language.

Language known and non-English →
```
--model small --language <code>
```
(no
```
.en
```
suffix)
Language known and English →
```
--model small.en
```
Language unknown →
```
--model small
```
(no
```
.en
```
, no
```
--language
```
) — whisper auto-detects

Default model is
small
, not
small.en
.

除非用户明确说明音频为英语，否则绝不要使用
.en
模型。

.en

模型（

small.en

medium.en

）会将非英语音频翻译为英语，而非转录原语言，这会无声地破坏原始语言内容。

已知语言为非英语 → 使用
```
--model small --language <code>
```
（不要加
```
.en
```
后缀）
已知语言为英语 → 使用
```
--model small.en
```
未知语言 → 使用
```
--model small
```
（不加
```
.en
```
，也不加
```
--language
```
）——Whisper会自动检测语言

默认模型为
small
，而非
small.en
。

Model Sizes

模型尺寸

Model	Size	Speed	When to use
`tiny`	75 MB	Fastest	Quick previews, testing pipeline
`base`	142 MB	Fast	Short clips, clear audio
`small`	466 MB	Moderate	Default — most content
`medium`	1.5 GB	Slow	Important content, noisy audio, music
`large-v3`	3.1 GB	Slowest	Production quality

Music with vocals: start at

medium

minimum; produced tracks often need manual SRT/VTT import. For caption-quality checks (mandatory after every transcription), the cleaning JS, retry rules, and the OpenAI/Groq API import path, see hyperframes/references/transcript-guide.md.

模型	大小	速度	使用场景
`tiny`	75 MB	最快	快速预览、测试流水线
`base`	142 MB	快	短片段、清晰音频
`small`	466 MB	中等	默认选项——适用于大多数内容
`medium`	1.5 GB	慢	重要内容、嘈杂音频、带 vocals 的音乐
`large-v3`	3.1 GB	最慢	生产级质量要求

带 vocals 的音乐：至少从

medium

模型开始尝试；成品音轨通常需要手动导入SRT/VTT文件。关于字幕质量检查（每次转录后必须执行）、清理JS脚本、重试规则以及OpenAI/Groq API导入路径，请参考hyperframes/references/transcript-guide.md。

Output Shape

输出格式

Compositions consume a flat array of word objects. The

id

field (

w0

w1

, ...) is added during normalization for stable references in caption overrides; it's optional for backwards compatibility.

json

[
  { "id": "w0", "text": "Hello", "start": 0.0, "end": 0.5 },
  { "id": "w1", "text": "world.", "start": 0.6, "end": 1.2 }
]

合成内容会使用扁平化的单词对象数组。

id

字段（

w0

w1

, ...）是在标准化过程中添加的，用于在字幕覆盖中提供稳定的引用；为了向后兼容，该字段为可选。

json

[
  { "id": "w0", "text": "Hello", "start": 0.0, "end": 0.5 },
  { "id": "w1", "text": "world.", "start": 0.6, "end": 1.2 }
]

Background Removal (

remove-background

)

背景移除功能（

remove-background

）

Remove the background from a video or image so it can sit as a transparent overlay in a composition (e.g. an avatar floating on a background plate).

bash

npx hyperframes remove-background avatar.mp4 -o transparent.webm  # default: VP9 alpha WebM
npx hyperframes remove-background avatar.mp4 -o transparent.mov   # ProRes 4444 (editing)
npx hyperframes remove-background portrait.jpg -o cutout.png      # single-image cutout
npx hyperframes remove-background avatar.mp4 -o transparent.webm --device cpu
npx hyperframes remove-background --info                          # detected providers

Uses

u2net_human_seg

(MIT). First run downloads ~168 MB of weights to

~/.cache/hyperframes/background-removal/models/

移除视频或图片的背景，使其可作为透明叠加层用于合成内容（例如，在背景板上悬浮的头像）。

bash

npx hyperframes remove-background avatar.mp4 -o transparent.webm  # 默认：VP9 带 alpha 通道的 WebM
npx hyperframes remove-background avatar.mp4 -o transparent.mov   # ProRes 4444（适用于剪辑）
npx hyperframes remove-background portrait.jpg -o cutout.png      # 单张图片抠图
npx hyperframes remove-background avatar.mp4 -o transparent.webm --device cpu
npx hyperframes remove-background --info                          # 查看已检测到的提供方

使用

u2net_human_seg

模型（MIT许可）。首次运行时会下载约168 MB的权重文件，缓存至

~/.cache/hyperframes/background-removal/models/

目录。

Output Format

输出格式

Format	When
`.webm` (VP9 + alpha)	Default. Compositions play this directly via `<video>` .
`.mov` (ProRes 4444)	Editing in DaVinci/Premiere/FCP. Large files.
`.png`	Single-image cutout (still subject, layered over a backdrop).

Chrome decodes VP9 alpha natively, so the

.webm

plugs into a composition like any other muted-autoplay video — see the

hyperframes

skill for the

<video>

track conventions.

格式	使用场景
`.webm` （VP9 + alpha通道）	默认格式。合成内容可通过 `<video>` 标签直接播放。
`.mov` （ProRes 4444）	适用于DaVinci/Premiere/FCP等剪辑软件。文件体积较大。
`.png`	单张图片抠图（静态主体，可叠加在背景上）。

Chrome原生支持VP9 alpha通道解码，因此

.webm

文件可像其他静音自动播放视频一样直接接入合成内容——关于

<video>

轨道的规范，请参考

hyperframes

技能文档。

TTS → Transcribe → Captions

文本转语音→转录→生成字幕流程

When there's no pre-recorded voiceover, generate one and transcribe it back to get word-level timestamps for captions:

bash

npx hyperframes tts script.txt --voice af_heart --output narration.wav
npx hyperframes transcribe narration.wav   # → transcript.json

Whisper extracts precise word boundaries from the generated audio, so caption timing matches delivery without hand-tuning.

当没有预先录制的配音时，可先生成语音，再将其转录以获取单词级时间戳来制作字幕：

bash

npx hyperframes tts script.txt --voice af_heart --output narration.wav
npx hyperframes transcribe narration.wav   # → 生成 transcript.json

Whisper会从生成的音频中提取精确的单词边界，因此字幕时间轴会与语音播放完美匹配，无需手动调整。

hyperframes-media

Original

Translation

HyperFrames Media Preprocessing

HyperFrames 媒体预处理

Text-to-Speech (
`tts`
)

文本转语音（
`tts`
）

Voice Selection

语音选择

Multilingual

多语言支持

Speed

语速设置

Long Scripts

长脚本处理

Requirements

环境要求

Transcription (
`transcribe`
)

转录功能（
`transcribe`
）

Language Rule (Non-Negotiable)

语言规则（必须遵守）

Model Sizes

模型尺寸

Output Shape

输出格式

Background Removal (
`remove-background`
)

背景移除功能（
`remove-background`
）

Output Format

输出格式

TTS → Transcribe → Captions

文本转语音→转录→生成字幕流程

hyperframes-media

Original

Translation

HyperFrames Media Preprocessing

HyperFrames 媒体预处理

Text-to-Speech (tts)

文本转语音（tts）

Voice Selection

语音选择

Multilingual

多语言支持

Speed

语速设置

Long Scripts

长脚本处理

Requirements

环境要求

Transcription (transcribe)

转录功能（transcribe）

Language Rule (Non-Negotiable)

语言规则（必须遵守）

Model Sizes

模型尺寸

Output Shape

输出格式

Background Removal (remove-background)

背景移除功能（remove-background）

Output Format

输出格式

TTS → Transcribe → Captions

文本转语音→转录→生成字幕流程

Text-to-Speech (
`tts`
)

文本转语音（
`tts`
）

Transcription (
`transcribe`
)

转录功能（
`transcribe`
）

Background Removal (
`remove-background`
)

背景移除功能（
`remove-background`
）