wjs-reframing-video
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesewjs-reframing-video
wjs-reframing-video
Convert a video's orientation by cropping a narrow band from the source — not by physically rotating it. The crop window follows the active speaker (the face whose mouth is moving), not just the largest or most-confident face. A sidecar records the crop plan, the per-segment speaker decisions, and the parameters used. The original input is never modified.
.crop.json通过从源视频裁剪出一个窄条来转换视频方向——而非物理旋转视频。裁剪窗口会跟随活跃说话者(嘴巴正在动的人脸),而不仅仅是最大或置信度最高的人脸。一个辅助文件会记录裁剪方案、每个片段的说话者判定结果以及所使用的参数。原始输入视频不会被修改。
.crop.jsonWhen to use
适用场景
- Repurposing a 16:9 podcast / interview / talk for vertical short-video platforms (WeChat Channels 视频号, Douyin 抖音, Xiaohongshu 小红书, YouTube Shorts, TikTok, Reels).
- Repurposing a 9:16 phone recording for horizontal players (YouTube long-form, blog embeds).
- Repurposing 4:3 archive footage for 3:4 mobile, or vice versa.
The output aspect is the source aspect with width and height swapped — 16:9 → 9:16, not "letterboxed 16:9 in a 9:16 frame".
- 将16:9比例的播客/访谈/演讲视频重新适配为竖屏短视频平台(微信视频号、抖音、小红书、YouTube Shorts、TikTok、Reels)。
- 将9:16比例的手机拍摄视频重新适配为横向播放场景(YouTube长视频、博客嵌入视频)。
- 将4:3比例的存档视频重新适配为3:4比例的移动端格式,反之亦然。
输出视频的宽高比是源视频宽高比的反转——16:9 → 9:16,而非“在9:16帧内添加黑边的16:9视频”。
When NOT to use
不适用场景
- Multi-person Q&A where each face needs its own crop — this skill picks one crop track per video. For per-speaker split renders, use wjs-editing-multicam instead.
- Animated content / B-roll with no faces — falls back to center crop, usually wrong for the intent.
- Heavy camera motion in the source (handheld pan/zoom) — the face tracker amplifies camera shake. Stabilize first.
- Source already at target aspect — no work to do.
- 多人问答场景:每个需要单独裁剪的人脸——本技能每个视频仅生成一条裁剪轨迹。如需按说话者拆分渲染,请使用wjs-editing-multicam。
- 无面部的动画内容/B-roll素材:会默认居中裁剪,通常不符合需求。
- 源视频存在剧烈镜头运动(手持摇摄/变焦):面部追踪器会放大镜头抖动。请先进行画面稳定处理。
- 源视频已符合目标宽高比:无需处理。
What this skill IS — and IS NOT
本技能的能力边界
| Is | Is not |
|---|---|
| Visual active-speaker detection via MAR (mouth-aspect-ratio) variance | Audio-visual fusion (audio energy + lip motion cross-correlated) |
| Stable face tracking across frames by center-distance matching | Re-identification across long gaps / occlusions |
| Speaker-aligned segments with hysteresis to prevent flicker | Frame-by-frame switching on every flicker |
| |
Hard cuts between segments, fixed crop within each segment ( | Smooth panning that drifts during a speaker's turn (opt-in |
| Audio stream-copy (bit-exact) | Audio reprocessing / re-encoding |
MediaPipe Tasks | Per-frame neural inpainting / out-painting |
One | Frame-by-frame Python compositor |
Falls back to "largest face" automatically when no one is talking (silence, music-only stretches).
| 具备能力 | 不具备能力 |
|---|---|
| 通过MAR(嘴部宽高比)变化实现视觉化活跃说话者检测 | 音视频融合检测(音频能量+唇部运动互相关联) |
| 通过中心距离匹配实现跨帧稳定面部追踪 | 长间隔/遮挡情况下的人脸重识别 |
| 带滞后效应的说话者对齐片段,防止画面闪烁 | 逐帧切换导致的频繁闪烁 |
| |
片段间硬切,片段内固定裁剪( | 说话者发言期间的平滑平移(可选 |
| 音频流直接复制(比特级精确) | 音频重处理/重新编码 |
通过ffmpeg以5fps采样MediaPipe Tasks | 逐帧神经修复/扩展画面 |
单次 | 逐帧Python合成器处理 |
当无人说话时(静音、纯音乐片段),会自动 fallback 到“最大人脸”模式。
Dependencies
依赖项
bash
pip install mediapipe opencv-python numpy(MediaPipe lives outside the standard Python distribution; ffmpeg and ffprobe must be on .)
PATHFirst-run model download: MediaPipe 0.10+ uses the Tasks API, which needs a model file (~4 MB). On the first call, downloads it to and caches it for subsequent runs. The script fails offline on first run.
face_landmarker.taskcrop.py~/.claude/skills/wjs-reframing-video/models/Range limitation: The bundled landmarker is tuned for faces within ~2 m of the camera (selfie / podcast / interview distance). Wide event shots with small faces may not detect — sample a frame first to confirm.
bash
pip install mediapipe opencv-python numpy(MediaPipe不在标准Python发行版中;ffmpeg和ffprobe必须在环境变量中。)
PATH首次运行模型下载:MediaPipe 0.10+使用Tasks API,需要模型文件(约4MB)。首次调用时,会将其下载到并缓存,供后续运行使用。离线环境下首次运行脚本会失败。
face_landmarker.taskcrop.py~/.claude/skills/wjs-reframing-video/models/检测范围限制:内置的关键点检测器针对距离摄像头约2米内的人脸优化(自拍/播客/访谈距离)。包含小尺寸人脸的广角活动画面可能无法检测——请先采样一帧确认。
Crop math
裁剪计算逻辑
Source aspect = . Target aspect = (inverted). Compute crop window:
W / HH / W| Source orientation | Crop window |
|---|---|
| Horizontal (W > H) → Portrait | |
| Portrait (W < H) → Horizontal | |
For 1920×1080 → portrait, , . Final scale to 1080×1920 (upscale ~1.78×).
For 1080×1920 → landscape, , . Final scale to 1920×1080.
W_crop = 608H_crop = 1080W_crop = 1080H_crop = 608Override the final size via if you want native crop dimensions instead of upscaling.
--output-size 1080x1920源宽高比 = 。目标宽高比 = (反转)。计算裁剪窗口:
W / HH / W| 源视频方向 | 裁剪窗口 |
|---|---|
| 横向(W > H)→ 竖屏 | |
| 竖屏(W < H)→ 横向 | |
对于1920×1080转竖屏,,。最终缩放至1080×1920(约1.78倍放大)。
对于1080×1920转横向,,。最终缩放至1920×1080。
W_crop = 608H_crop = 1080W_crop = 1080H_crop = 608如果希望保留原生裁剪尺寸而非放大,可以通过覆盖最终输出尺寸。
--output-size 1080x1920Pipeline
处理流程
- Probe input dimensions, fps, duration via ffprobe.
- Decide orientation — auto from aspect (to override).
--target portrait|landscape - Sample frames at (default 5; high enough to catch mouth motion — Nyquist for speech is ~10 Hz, we need at least 4–5 fps).
--sample-fps - Detect face landmarks per sampled frame with MediaPipe Tasks (478 landmarks). For each detected face record: center, size proxy, MAR (mouth-aspect-ratio = inner-lip vertical distance / horizontal mouth-corner distance).
FaceLandmarker - Track faces across frames by center-distance matching → each face gets a stable .
face_id - Per-sample active speaker: for each face track, variance of MAR over a sliding window (, default 1 s). The face with the highest variance is "speaking". Below
--mar-var-window-sec, no one is speaking → fall back to largest face.--mar-var-threshold - Hysteresis: a candidate switch only commits if the new speaker is stable for (default 1.5 s). Shorter flickers are squashed — prevents the crop from ping-ponging on a one-frame mis-detection.
--min-segment-sec - Speaker-aligned segments → for each segment, mean (cx, cy) of that speaker's face over the segment becomes the crop center, fixed for the full duration of the segment.
- Build a ffmpeg step-function expression (, default) that holds each segment's crop position constant and jumps instantly at each segment boundary — the visual feel of a real cut between camera angles. (
--motion cutswitches to piecewise-linear pan between segment midpoints; rarely the right call for talking-head content because the camera appears to drift mid-sentence.)--motion smooth - Render one ffmpeg pass — . The crop filter evaluates
crop=W:H:x='expr':y='expr', scale=OUT_W:OUT_Handxper frame natively. Audio stream-copied.y
scripts/crop.py- — sidecar with the crop plan
<input>.crop.json - — final cropped + scaled video
<input>_cropped.mp4
- 探测:通过ffprobe获取输入视频的分辨率、帧率、时长。
- 确定方向:根据宽高比自动判断(可通过手动指定)。
--target portrait|landscape - 采样帧:以(默认5fps)采样帧——帧率足够捕捉嘴部运动(语音的奈奎斯特频率约为10Hz,至少需要4-5fps)。
--sample-fps - 面部关键点检测:使用MediaPipe Tasks (478个关键点)对每个采样帧进行检测。记录每张检测到的人脸的:中心位置、尺寸代理值、MAR(嘴部宽高比 = 嘴唇内部垂直距离 / 嘴角水平距离)。
FaceLandmarker - 人脸追踪:通过中心距离匹配跨帧追踪人脸→每个人脸获得稳定的。
face_id - 采样帧活跃说话者判定:对每个人脸轨迹,计算滑动窗口内的MAR方差(,默认1秒)。方差最高的人脸即为“说话者”。低于
--mar-var-window-sec时,判定为无人说话→fallback到最大人脸。--mar-var-threshold - 滞后效应:候选说话者切换仅在新说话者稳定持续(默认1.5秒)时生效。较短的闪烁会被抑制——避免因单帧误检测导致裁剪位置频繁跳动。
--min-segment-sec - 说话者对齐片段:对每个片段,该说话者人脸在片段内的平均中心坐标(cx, cy)成为裁剪中心,在整个片段持续时间内固定。
- 构建ffmpeg阶跃函数表达式(,默认):每个片段的裁剪位置保持恒定,在片段边界处瞬间跳转——模拟真实镜头切换的视觉效果。(
--motion cut切换为片段中点间的分段线性平移;对于访谈类内容很少适用,因为镜头会在说话者发言期间漂移。)--motion smooth - 渲染:单次ffmpeg处理——。裁剪滤镜会在每帧原生计算
crop=W:H:x='expr':y='expr', scale=OUT_W:OUT_H和x值。音频流直接复制。y
scripts/crop.py- — 记录裁剪方案的辅助文件
<input>.crop.json - — 最终裁剪+缩放后的视频
<input>_cropped.mp4
Sidecar schema (<input>.crop.json
)
<input>.crop.json辅助文件 schema(<input>.crop.json
)
<input>.crop.jsonjson
{
"_about": "wjs-reframing-video crop plan for cam_a.MOV. Active-speaker detected via MAR variance.",
"_help": {
"source_size": "[width, height] in pixels.",
"target_size": "[width, height] of the final rendered output.",
"crop_window": "[width, height] of the moving crop in source coords.",
"chunks": "Speaker-aligned segments: {t0, t1, cx, cy, speaker_id}.",
"face_pick_mode": "speaker = MAR-variance active-speaker; largest = old behavior.",
"speaker_id": "Stable face track id. null means no face / silence fallback."
},
"schema_version": 2,
"source": "cam_a.MOV",
"source_size": [1920, 1080],
"target": "portrait",
"target_size": [1080, 1920],
"crop_window": [608, 1080],
"face_pick_mode": "speaker",
"sample_fps": 5.0,
"mar_var_window_sec": 1.0,
"mar_var_threshold": 1.5e-4,
"min_segment_sec": 1.5,
"chunks": [
{"t0": 0.0, "t1": 4.2, "cx": 808, "cy": 540, "speaker_id": 0},
{"t0": 4.2, "t1": 11.6, "cx": 1182, "cy": 540, "speaker_id": 1},
{"t0": 11.6, "t1": 14.0, "cx": 808, "cy": 540, "speaker_id": 0}
],
"face_sample_count": 1234,
"track_count": 2
}json
{
"_about": "wjs-reframing-video crop plan for cam_a.MOV. Active-speaker detected via MAR variance.",
"_help": {
"source_size": "[width, height] in pixels.",
"target_size": "[width, height] of the final rendered output.",
"crop_window": "[width, height] of the moving crop in source coords.",
"chunks": "Speaker-aligned segments: {t0, t1, cx, cy, speaker_id}.",
"face_pick_mode": "speaker = MAR-variance active-speaker; largest = old behavior.",
"speaker_id": "Stable face track id. null means no face / silence fallback."
},
"schema_version": 2,
"source": "cam_a.MOV",
"source_size": [1920, 1080],
"target": "portrait",
"target_size": [1080, 1920],
"crop_window": [608, 1080],
"face_pick_mode": "speaker",
"sample_fps": 5.0,
"mar_var_window_sec": 1.0,
"mar_var_threshold": 1.5e-4,
"min_segment_sec": 1.5,
"chunks": [
{"t0": 0.0, "t1": 4.2, "cx": 808, "cy": 540, "speaker_id": 0},
{"t0": 4.2, "t1": 11.6, "cx": 1182, "cy": 540, "speaker_id": 1},
{"t0": 11.6, "t1": 14.0, "cx": 808, "cy": 540, "speaker_id": 0}
],
"face_sample_count": 1234,
"track_count": 2
}Performance
性能表现
- Detection is the slow step. On Apple Silicon at 2 fps sampling, expect ~10–20× realtime (a 30-min source detects in ~1–2 min). Bumping makes detection slower but tracking more responsive.
--sample-fps - Render is fast — single ffmpeg pass with hardware encode (on macOS). Often <1× realtime for a 1080p source.
hevc_videotoolbox - For very long sources (>200 chunks), the ffmpeg expression gets cumbersome; the script auto-downsamples chunk midpoints to keep the expression under ~200 control points.
- 检测环节是最慢的步骤。在Apple Silicon上以2fps采样时,处理速度约为实时的10-20倍(30分钟的源视频检测耗时约1-2分钟)。提高会减慢检测速度,但会提升追踪响应性。
--sample-fps - 渲染环节速度很快——单次ffmpeg处理,使用硬件编码(macOS上为)。对于1080p源视频,通常处理速度低于实时(耗时更短)。
hevc_videotoolbox - 对于超长源视频(超过200个片段),ffmpeg表达式会变得繁琐;脚本会自动对片段中点进行下采样,使表达式控制点数保持在约200个以内。
Common pitfalls
常见问题
- Mouth gestures aren't speech — a yawn, laugh, eating, or sucking-in-air all raise MAR variance. The detector can briefly mistake these for talking. For high-stakes content, eyeball the speaker timeline in the sidecar (the script prints a summary) and re-run with a different
face#N: Xs on screen (Y%)if needed.--mar-var-threshold - Side-profile or down-tilted faces — when a face is rotated >60° from camera, MediaPipe may fail to land mouth landmarks reliably, so MAR variance flatlines. The speaker fallback to "largest face" kicks in. If you have a long stretch of profile shots, consider .
--face-pick largest - Two faces with overlapping speech (interruption / talking over) — both faces have MAR variance, only one wins. The losing face is treated as listener. For accurate per-speaker tracking under crosstalk, use wjs-editing-multicam with separate cams.
- Long stretches of silence (B-roll, music) — falls back to largest face. If the largest face is wrong (e.g. a listener stays still while the speaker's mic feeds music), you'll see drift. Pre-segment around music-only sections.
- Source has burned-in lower-thirds / subtitles — for H→V, the lower band gets cropped out; for V→H, it stays but gets stretched. Strip burn-ins before running.
- Wide-angle / fish-eye lenses — landmarks miss faces near edges. Pre-correct distortion with first.
ffmpeg lenscorrection - Upscaling artifacts — is a 1.78× upscale and visible on sharp text. Render at native crop dims (
608×1080 → 1080×1920) and let the platform upscale, if you have overlays you want to keep sharp.--output-size 608x1080 - Output bitrate > platform limit — default is . WeChat Channels (视频号) caps at 10 Mbps; pass
--bitrate 12Mfor that target.--bitrate 8M
- 嘴部动作不等于说话:哈欠、大笑、进食、吸气都会提高MAR方差。检测器可能会短暂误判这些动作。对于重要内容,可以查看辅助文件中的说话者时间线(脚本会输出摘要),如有需要,调整
face#N: Xs on screen (Y%)重新运行。--mar-var-threshold - 侧脸或低头人脸:当人脸与摄像头夹角超过60°时,MediaPipe可能无法可靠检测嘴部关键点,导致MAR方差趋于平稳。此时会fallback到“最大人脸”模式。如果存在长时间侧脸镜头,可以考虑使用。
--face-pick largest - 两人同时说话(打断/插话):两个人脸的MAR方差都会升高,但只有一个会被判定为说话者。另一个会被视为听众。如需在交叉对话时准确追踪每个说话者,请使用带独立摄像头的wjs-editing-multicam。
- 长时间静音片段(B-roll、音乐):会fallback到最大人脸。如果最大人脸不符合需求(例如听众保持静止,而说话者的麦克风播放音乐),会出现画面漂移。请提前分割纯音乐片段。
- 源视频带有固定字幕/角标:横转竖时,底部区域会被裁剪掉;竖转横时,字幕会保留但被拉伸。请在运行前移除固定字幕。
- 广角/鱼眼镜头:边缘人脸的关键点检测会失败。请先使用校正畸变。
ffmpeg lenscorrection - 放大 artifacts:是1.78倍放大,清晰文本会出现可见失真。如果要保持叠加元素清晰,可以按原生裁剪尺寸渲染(
608×1080 → 1080×1920),再由平台进行放大。--output-size 608x1080 - 输出码率超过平台限制:默认码率为。微信视频号上限为10 Mbps;针对该平台请传入
--bitrate 12M。--bitrate 8M