wjs-reframing-video

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

wjs-reframing-video

Convert a video's orientation by cropping a narrow band from the source — not by physically rotating it. The crop window follows the active speaker (the face whose mouth is moving), not just the largest or most-confident face. A

.crop.json

sidecar records the crop plan, the per-segment speaker decisions, and the parameters used. The original input is never modified.

通过从源视频裁剪出一个窄条来转换视频方向——而非物理旋转视频。裁剪窗口会跟随活跃说话者（嘴巴正在动的人脸），而不仅仅是最大或置信度最高的人脸。一个

.crop.json

辅助文件会记录裁剪方案、每个片段的说话者判定结果以及所使用的参数。原始输入视频不会被修改。

When to use

适用场景

Repurposing a 16:9 podcast / interview / talk for vertical short-video platforms (WeChat Channels 视频号, Douyin 抖音, Xiaohongshu 小红书, YouTube Shorts, TikTok, Reels).
Repurposing a 9:16 phone recording for horizontal players (YouTube long-form, blog embeds).
Repurposing 4:3 archive footage for 3:4 mobile, or vice versa.

The output aspect is the source aspect with width and height swapped — 16:9 → 9:16, not "letterboxed 16:9 in a 9:16 frame".

将16:9比例的播客/访谈/演讲视频重新适配为竖屏短视频平台（微信视频号、抖音、小红书、YouTube Shorts、TikTok、Reels）。
将9:16比例的手机拍摄视频重新适配为横向播放场景（YouTube长视频、博客嵌入视频）。
将4:3比例的存档视频重新适配为3:4比例的移动端格式，反之亦然。

输出视频的宽高比是源视频宽高比的反转——16:9 → 9:16，而非“在9:16帧内添加黑边的16:9视频”。

When NOT to use

不适用场景

Multi-person Q&A where each face needs its own crop — this skill picks one crop track per video. For per-speaker split renders, use wjs-editing-multicam instead.
Animated content / B-roll with no faces — falls back to center crop, usually wrong for the intent.
Heavy camera motion in the source (handheld pan/zoom) — the face tracker amplifies camera shake. Stabilize first.
Source already at target aspect — no work to do.

多人问答场景：每个需要单独裁剪的人脸——本技能每个视频仅生成一条裁剪轨迹。如需按说话者拆分渲染，请使用wjs-editing-multicam。
无面部的动画内容/B-roll素材：会默认居中裁剪，通常不符合需求。
源视频存在剧烈镜头运动（手持摇摄/变焦）：面部追踪器会放大镜头抖动。请先进行画面稳定处理。
源视频已符合目标宽高比：无需处理。

What this skill IS — and IS NOT

本技能的能力边界

Is	Is not
Visual active-speaker detection via MAR (mouth-aspect-ratio) variance	Audio-visual fusion (audio energy + lip motion cross-correlated)
Stable face tracking across frames by center-distance matching	Re-identification across long gaps / occlusions
Speaker-aligned segments with hysteresis to prevent flicker	Frame-by-frame switching on every flicker
`--face-pick speaker` (default) — pick whoever's mouth is moving	`--face-pick largest` (opt-in legacy) — pick largest face
Hard cuts between segments, fixed crop within each segment ( `--motion cut` , default)	Smooth panning that drifts during a speaker's turn (opt-in `--motion smooth` )
Audio stream-copy (bit-exact)	Audio reprocessing / re-encoding
MediaPipe Tasks `FaceLandmarker` (478-pt mesh) at 5 fps sampled via ffmpeg	Per-frame neural inpainting / out-painting
One `ffmpeg crop + scale` pass	Frame-by-frame Python compositor

Falls back to "largest face" automatically when no one is talking (silence, music-only stretches).

具备能力	不具备能力
通过MAR（嘴部宽高比）变化实现视觉化活跃说话者检测	音视频融合检测（音频能量+唇部运动互相关联）
通过中心距离匹配实现跨帧稳定面部追踪	长间隔/遮挡情况下的人脸重识别
带滞后效应的说话者对齐片段，防止画面闪烁	逐帧切换导致的频繁闪烁
`--face-pick speaker` （默认）：选择嘴巴动的说话者	`--face-pick largest` （可选旧模式）：选择最大的人脸
片段间硬切，片段内固定裁剪（ `--motion cut` ，默认）	说话者发言期间的平滑平移（可选 `--motion smooth` ）
音频流直接复制（比特级精确）	音频重处理/重新编码
通过ffmpeg以5fps采样MediaPipe Tasks `FaceLandmarker` （478点网格）	逐帧神经修复/扩展画面
单次 `ffmpeg crop + scale` 处理	逐帧Python合成器处理

当无人说话时（静音、纯音乐片段），会自动 fallback 到“最大人脸”模式。

Dependencies

依赖项

bash

pip install mediapipe opencv-python numpy

(MediaPipe lives outside the standard Python distribution; ffmpeg and ffprobe must be on

PATH

First-run model download: MediaPipe 0.10+ uses the Tasks API, which needs a

face_landmarker.task

model file (~4 MB). On the first call,

crop.py

downloads it to

~/.claude/skills/wjs-reframing-video/models/

and caches it for subsequent runs. The script fails offline on first run.

Range limitation: The bundled landmarker is tuned for faces within ~2 m of the camera (selfie / podcast / interview distance). Wide event shots with small faces may not detect — sample a frame first to confirm.

bash

pip install mediapipe opencv-python numpy

(MediaPipe不在标准Python发行版中；ffmpeg和ffprobe必须在

PATH

环境变量中。)

首次运行模型下载：MediaPipe 0.10+使用Tasks API，需要

face_landmarker.task

模型文件（约4MB）。首次调用时，

crop.py

会将其下载到

~/.claude/skills/wjs-reframing-video/models/

并缓存，供后续运行使用。离线环境下首次运行脚本会失败。

检测范围限制：内置的关键点检测器针对距离摄像头约2米内的人脸优化（自拍/播客/访谈距离）。包含小尺寸人脸的广角活动画面可能无法检测——请先采样一帧确认。

Crop math

裁剪计算逻辑

Source aspect =

W / H

. Target aspect =

H / W

(inverted). Compute crop window:

Source orientation	Crop window
Horizontal (W > H) → Portrait	`W_crop = H × H / W` , `H_crop = H` (narrow vertical band)
Portrait (W < H) → Horizontal	`W_crop = W` , `H_crop = W × W / H` (narrow horizontal band)

For 1920×1080 → portrait,

W_crop = 608

H_crop = 1080

. Final scale to 1080×1920 (upscale ~1.78×). For 1080×1920 → landscape,

W_crop = 1080

H_crop = 608

. Final scale to 1920×1080.

Override the final size via

--output-size 1080x1920

if you want native crop dimensions instead of upscaling.

源宽高比 =

W / H

。目标宽高比 =

H / W

（反转）。计算裁剪窗口：

源视频方向	裁剪窗口
横向（W > H）→ 竖屏	`W_crop = H × H / W` ， `H_crop = H` （窄竖条）
竖屏（W < H）→ 横向	`W_crop = W` ， `H_crop = W × W / H` （窄横条）

对于1920×1080转竖屏，

W_crop = 608

，

H_crop = 1080

。最终缩放至1080×1920（约1.78倍放大）。对于1080×1920转横向，

W_crop = 1080

，

H_crop = 608

。最终缩放至1920×1080。

如果希望保留原生裁剪尺寸而非放大，可以通过

--output-size 1080x1920

覆盖最终输出尺寸。

Pipeline

处理流程

Probe input dimensions, fps, duration via ffprobe.
Decide orientation — auto from aspect (
```
--target portrait|landscape
```
to override).
Sample frames at
--sample-fps
(default 5; high enough to catch mouth motion — Nyquist for speech is ~10 Hz, we need at least 4–5 fps).
Detect face landmarks per sampled frame with MediaPipe Tasks
```
FaceLandmarker
```
(478 landmarks). For each detected face record: center, size proxy, MAR (mouth-aspect-ratio = inner-lip vertical distance / horizontal mouth-corner distance).
Track faces across frames by center-distance matching → each face gets a stable
```
face_id
```
.
Per-sample active speaker: for each face track, variance of MAR over a sliding window (
```
--mar-var-window-sec
```
, default 1 s). The face with the highest variance is "speaking". Below
```
--mar-var-threshold
```
, no one is speaking → fall back to largest face.
Hysteresis: a candidate switch only commits if the new speaker is stable for
```
--min-segment-sec
```
(default 1.5 s). Shorter flickers are squashed — prevents the crop from ping-ponging on a one-frame mis-detection.
Speaker-aligned segments → for each segment, mean (cx, cy) of that speaker's face over the segment becomes the crop center, fixed for the full duration of the segment.
Build a ffmpeg step-function expression (
```
--motion cut
```
, default) that holds each segment's crop position constant and jumps instantly at each segment boundary — the visual feel of a real cut between camera angles. (
```
--motion smooth
```
switches to piecewise-linear pan between segment midpoints; rarely the right call for talking-head content because the camera appears to drift mid-sentence.)
Render one ffmpeg pass —
```
crop=W:H:x='expr':y='expr', scale=OUT_W:OUT_H
```
. The crop filter evaluates
```
x
```
and
```
y
```
per frame natively. Audio stream-copied.

scripts/crop.py

is the implementation. Output side effects:

```
<input>.crop.json
```
— sidecar with the crop plan
```
<input>_cropped.mp4
```
— final cropped + scaled video

探测：通过ffprobe获取输入视频的分辨率、帧率、时长。
确定方向：根据宽高比自动判断（可通过
```
--target portrait|landscape
```
手动指定）。
采样帧：以
```
--sample-fps
```
（默认5fps）采样帧——帧率足够捕捉嘴部运动（语音的奈奎斯特频率约为10Hz，至少需要4-5fps）。
面部关键点检测：使用MediaPipe Tasks
```
FaceLandmarker
```
（478个关键点）对每个采样帧进行检测。记录每张检测到的人脸的：中心位置、尺寸代理值、MAR（嘴部宽高比 = 嘴唇内部垂直距离 / 嘴角水平距离）。
人脸追踪：通过中心距离匹配跨帧追踪人脸→每个人脸获得稳定的
```
face_id
```
。
采样帧活跃说话者判定：对每个人脸轨迹，计算滑动窗口内的MAR方差（
```
--mar-var-window-sec
```
，默认1秒）。方差最高的人脸即为“说话者”。低于
```
--mar-var-threshold
```
时，判定为无人说话→fallback到最大人脸。
滞后效应：候选说话者切换仅在新说话者稳定持续
```
--min-segment-sec
```
（默认1.5秒）时生效。较短的闪烁会被抑制——避免因单帧误检测导致裁剪位置频繁跳动。
说话者对齐片段：对每个片段，该说话者人脸在片段内的平均中心坐标(cx, cy)成为裁剪中心，在整个片段持续时间内固定。
构建ffmpeg阶跃函数表达式（
```
--motion cut
```
，默认）：每个片段的裁剪位置保持恒定，在片段边界处瞬间跳转——模拟真实镜头切换的视觉效果。（
```
--motion smooth
```
切换为片段中点间的分段线性平移；对于访谈类内容很少适用，因为镜头会在说话者发言期间漂移。）
渲染：单次ffmpeg处理——
```
crop=W:H:x='expr':y='expr', scale=OUT_W:OUT_H
```
。裁剪滤镜会在每帧原生计算
```
x
```
和
```
y
```
值。音频流直接复制。

scripts/crop.py

是实现脚本。输出产物：

```
<input>.crop.json
```
— 记录裁剪方案的辅助文件
```
<input>_cropped.mp4
```
— 最终裁剪+缩放后的视频

Sidecar schema (

<input>.crop.json

)

辅助文件 schema（

<input>.crop.json

）

json

{
  "_about": "wjs-reframing-video crop plan for cam_a.MOV. Active-speaker detected via MAR variance.",
  "_help": {
    "source_size":     "[width, height] in pixels.",
    "target_size":     "[width, height] of the final rendered output.",
    "crop_window":     "[width, height] of the moving crop in source coords.",
    "chunks":          "Speaker-aligned segments: {t0, t1, cx, cy, speaker_id}.",
    "face_pick_mode":  "speaker = MAR-variance active-speaker; largest = old behavior.",
    "speaker_id":      "Stable face track id. null means no face / silence fallback."
  },
  "schema_version": 2,
  "source": "cam_a.MOV",
  "source_size": [1920, 1080],
  "target": "portrait",
  "target_size": [1080, 1920],
  "crop_window": [608, 1080],
  "face_pick_mode": "speaker",
  "sample_fps": 5.0,
  "mar_var_window_sec": 1.0,
  "mar_var_threshold": 1.5e-4,
  "min_segment_sec": 1.5,
  "chunks": [
    {"t0":  0.0, "t1":  4.2, "cx": 808, "cy": 540, "speaker_id": 0},
    {"t0":  4.2, "t1": 11.6, "cx": 1182, "cy": 540, "speaker_id": 1},
    {"t0": 11.6, "t1": 14.0, "cx": 808, "cy": 540, "speaker_id": 0}
  ],
  "face_sample_count": 1234,
  "track_count": 2
}

json

{
  "_about": "wjs-reframing-video crop plan for cam_a.MOV. Active-speaker detected via MAR variance.",
  "_help": {
    "source_size":     "[width, height] in pixels.",
    "target_size":     "[width, height] of the final rendered output.",
    "crop_window":     "[width, height] of the moving crop in source coords.",
    "chunks":          "Speaker-aligned segments: {t0, t1, cx, cy, speaker_id}.",
    "face_pick_mode":  "speaker = MAR-variance active-speaker; largest = old behavior.",
    "speaker_id":      "Stable face track id. null means no face / silence fallback."
  },
  "schema_version": 2,
  "source": "cam_a.MOV",
  "source_size": [1920, 1080],
  "target": "portrait",
  "target_size": [1080, 1920],
  "crop_window": [608, 1080],
  "face_pick_mode": "speaker",
  "sample_fps": 5.0,
  "mar_var_window_sec": 1.0,
  "mar_var_threshold": 1.5e-4,
  "min_segment_sec": 1.5,
  "chunks": [
    {"t0":  0.0, "t1":  4.2, "cx": 808, "cy": 540, "speaker_id": 0},
    {"t0":  4.2, "t1": 11.6, "cx": 1182, "cy": 540, "speaker_id": 1},
    {"t0": 11.6, "t1": 14.0, "cx": 808, "cy": 540, "speaker_id": 0}
  ],
  "face_sample_count": 1234,
  "track_count": 2
}

Performance

性能表现

Detection is the slow step. On Apple Silicon at 2 fps sampling, expect ~10–20× realtime (a 30-min source detects in ~1–2 min). Bumping
```
--sample-fps
```
makes detection slower but tracking more responsive.
Render is fast — single ffmpeg pass with hardware encode (
```
hevc_videotoolbox
```
on macOS). Often <1× realtime for a 1080p source.
For very long sources (>200 chunks), the ffmpeg expression gets cumbersome; the script auto-downsamples chunk midpoints to keep the expression under ~200 control points.

检测环节是最慢的步骤。在Apple Silicon上以2fps采样时，处理速度约为实时的10-20倍（30分钟的源视频检测耗时约1-2分钟）。提高
```
--sample-fps
```
会减慢检测速度，但会提升追踪响应性。
渲染环节速度很快——单次ffmpeg处理，使用硬件编码（macOS上为
```
hevc_videotoolbox
```
）。对于1080p源视频，通常处理速度低于实时（耗时更短）。
对于超长源视频（超过200个片段），ffmpeg表达式会变得繁琐；脚本会自动对片段中点进行下采样，使表达式控制点数保持在约200个以内。

Common pitfalls

常见问题

Mouth gestures aren't speech — a yawn, laugh, eating, or sucking-in-air all raise MAR variance. The detector can briefly mistake these for talking. For high-stakes content, eyeball the speaker timeline in the sidecar (the script prints a
```
face#N: Xs on screen (Y%)
```
summary) and re-run with a different
```
--mar-var-threshold
```
if needed.
Side-profile or down-tilted faces — when a face is rotated >60° from camera, MediaPipe may fail to land mouth landmarks reliably, so MAR variance flatlines. The speaker fallback to "largest face" kicks in. If you have a long stretch of profile shots, consider
```
--face-pick largest
```
.
Two faces with overlapping speech (interruption / talking over) — both faces have MAR variance, only one wins. The losing face is treated as listener. For accurate per-speaker tracking under crosstalk, use wjs-editing-multicam with separate cams.
Long stretches of silence (B-roll, music) — falls back to largest face. If the largest face is wrong (e.g. a listener stays still while the speaker's mic feeds music), you'll see drift. Pre-segment around music-only sections.
Source has burned-in lower-thirds / subtitles — for H→V, the lower band gets cropped out; for V→H, it stays but gets stretched. Strip burn-ins before running.
Wide-angle / fish-eye lenses — landmarks miss faces near edges. Pre-correct distortion with
```
ffmpeg lenscorrection
```
first.
Upscaling artifacts —
```
608×1080 → 1080×1920
```
is a 1.78× upscale and visible on sharp text. Render at native crop dims (
```
--output-size 608x1080
```
) and let the platform upscale, if you have overlays you want to keep sharp.
Output bitrate > platform limit — default is
```
--bitrate 12M
```
. WeChat Channels (视频号) caps at 10 Mbps; pass
```
--bitrate 8M
```
for that target.

嘴部动作不等于说话：哈欠、大笑、进食、吸气都会提高MAR方差。检测器可能会短暂误判这些动作。对于重要内容，可以查看辅助文件中的说话者时间线（脚本会输出
```
face#N: Xs on screen (Y%)
```
摘要），如有需要，调整
```
--mar-var-threshold
```
重新运行。
侧脸或低头人脸：当人脸与摄像头夹角超过60°时，MediaPipe可能无法可靠检测嘴部关键点，导致MAR方差趋于平稳。此时会fallback到“最大人脸”模式。如果存在长时间侧脸镜头，可以考虑使用
```
--face-pick largest
```
。
两人同时说话（打断/插话）：两个人脸的MAR方差都会升高，但只有一个会被判定为说话者。另一个会被视为听众。如需在交叉对话时准确追踪每个说话者，请使用带独立摄像头的wjs-editing-multicam。
长时间静音片段（B-roll、音乐）：会fallback到最大人脸。如果最大人脸不符合需求（例如听众保持静止，而说话者的麦克风播放音乐），会出现画面漂移。请提前分割纯音乐片段。
源视频带有固定字幕/角标：横转竖时，底部区域会被裁剪掉；竖转横时，字幕会保留但被拉伸。请在运行前移除固定字幕。
广角/鱼眼镜头：边缘人脸的关键点检测会失败。请先使用
```
ffmpeg lenscorrection
```
校正畸变。
放大 artifacts：
```
608×1080 → 1080×1920
```
是1.78倍放大，清晰文本会出现可见失真。如果要保持叠加元素清晰，可以按原生裁剪尺寸渲染（
```
--output-size 608x1080
```
），再由平台进行放大。
输出码率超过平台限制：默认码率为
```
--bitrate 12M
```
。微信视频号上限为10 Mbps；针对该平台请传入
```
--bitrate 8M
```
。

wjs-reframing-video

Original

Translation

wjs-reframing-video

wjs-reframing-video

When to use

适用场景

When NOT to use

不适用场景

What this skill IS — and IS NOT

本技能的能力边界

Dependencies

依赖项

Crop math

裁剪计算逻辑

Pipeline

处理流程

Sidecar schema (
`<input>.crop.json`
)

辅助文件 schema（
`<input>.crop.json`
）

Performance

性能表现

Common pitfalls

常见问题

wjs-reframing-video

Original

Translation

wjs-reframing-video

wjs-reframing-video

When to use

适用场景

When NOT to use

不适用场景

What this skill IS — and IS NOT

本技能的能力边界

Dependencies

依赖项

Crop math

裁剪计算逻辑

Pipeline

处理流程

Sidecar schema (<input>.crop.json)

辅助文件 schema（<input>.crop.json）

Performance

性能表现

Common pitfalls

常见问题

Sidecar schema (
`<input>.crop.json`
)

辅助文件 schema（
`<input>.crop.json`
）