wjs-editing-multicam

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

wjs-editing-multicam

wjs-editing-multicam

Combine N synced camera angles into a single rendered MP4. Decisions are audio-energy-driven only — the cam with the loudest mic each second wins. Output is hard cuts (or hard cuts plus a corner PiP).
将N个已同步的机位视频合并为单个渲染后的MP4文件。机位切换仅由音频能量决定——每秒麦克风音量最大的机位将被选中。输出为硬切效果(或硬切加角落画中画)。

What this skill IS — and IS NOT

本技能能做什么——不能做什么

IsIs not
Audio-energy-driven cam switchingFace / framing detection (no face_recognition, no MediaPipe)
Single-source audio (one cam's mic)Multi-mic mix / per-speaker gating
Hard cuts, with optional PiP insetCrossfades / opacity transitions / sliding animations
ffmpeg
concat +
overlay
filter renders
HyperFrames composition /
<hf-clip>
Coverage-aware (won't pick a cam outside its sidecar window)Frame-accurate beat alignment / VAD-edge cuts
If you need face tracking, fade transitions, captions, or HyperFrames composition, use the hyperframes skill on top of this skill's MP4 output.
支持功能不支持功能
基于音频能量的机位切换人脸/取景检测(无face_recognition,无MediaPipe)
单源音频(单个机位的麦克风)多麦克风混音/按说话人门控
硬切,可选画中画嵌入交叉淡入淡出/透明度过渡/滑动动画
基于
ffmpeg
concat +
overlay
滤镜渲染
HyperFrames合成 /
<hf-clip>
覆盖范围感知(不会选择超出其副文件时间窗口的机位)帧级精准节拍对齐/基于语音活动检测(VAD)的边缘剪辑
如果您需要人脸追踪、淡入淡出过渡、字幕或HyperFrames合成,请在本技能输出的MP4文件基础上使用hyperframes技能。

REQUIRED INPUT

必需输入

Original camera files (untouched) plus their
.sync.json
sidecars next to them.
If sources aren't synced yet, run wjs-syncing-multicam first to write the sidecars. Missing sidecar = cam assumed at delta=0, full coverage.
autoedit.py
reads each sidecar for
delta_seconds
+
overlap_in_reference
, lifts the cam's audio envelope into the reference timeline, and only schedules a cam during its coverage window.
render_cuts.py
/
render_pip.py
apply
ffmpeg -itsoffset
per input using the EDL's
deltas[]
array.
原始机位文件(未修改)及其旁边的
.sync.json
副文件
。如果源视频尚未同步,请先运行wjs-syncing-multicam生成副文件。缺少副文件则默认该机位的时间偏移为0,覆盖全程。
autoedit.py
会读取每个副文件中的
delta_seconds
+
overlap_in_reference
,将机位的音频包络映射到参考时间线,并且仅在机位的覆盖时间窗口内安排其出镜。
render_cuts.py
/
render_pip.py
会利用EDL的
deltas[]
数组,为每个输入应用
ffmpeg -itsoffset
参数。

When NOT to use

不适用场景

  • One source — nothing to switch between; use video-segmentation.
  • Polished NLE timeline already exists — don't fight the editor.
  • Want fade transitions / overlay captions / brand title cards — run this skill first to get the cut-down MP4, then feed it into wjs-overlaying-video or hyperframes.
  • 仅单个源视频——没有可切换的机位;请使用video-segmentation技能。
  • 已有完善的非线性编辑(NLE)时间线——无需使用本工具。
  • 需要淡入淡出过渡/叠加字幕/品牌片头——先运行本技能生成剪辑后的MP4,再将其输入wjs-overlaying-videohyperframes技能。

Pipeline

处理流程

  1. Read each input's sidecar → list of
    delta_seconds[k]
    +
    overlap_in_reference[k]
    .
  2. Extract per-cam mono PCM @ 16 kHz from the original file.
  3. Log-RMS envelope at 1 Hz frame rate (per-second).
  4. Lift each envelope into reference timeline by indexing at
    t_ref - delta_k
    ; uncovered seconds become
    -inf
    so they're never picked.
  5. Audio source = the cam with the largest envelope spread (90th − 10th percentile over its covered seconds), with a small bonus for coverage fraction.
  6. Score per second:
    cam[k] - mean(other covered cams)
    . Highest score = best active-speaker candidate.
  7. Editor decides EDL — two modes:
    • rotation
      (default): random dwell in [
      min_dwell=8
      ,
      max_dwell=15
      ] s, pick best-scoring covered cam (≠ current) at each cut.
    • greedy
      : hysteresis — hold current unless another cam's lookahead-window score beats it by
      --switch-threshold
      . Floor
      min_dwell=4
      , ceiling
      max_dwell=18
      . Both force-switch if the active cam exits its coverage window mid-shot.
  8. Emit EDL JSON.
  1. 读取每个输入的副文件 → 获取
    delta_seconds[k]
    +
    overlap_in_reference[k]
    列表。
  2. 从原始文件中提取每个机位的单声道PCM音频(16 kHz采样率)。
  3. 以1 Hz帧率(每秒一次)计算对数均方根(RMS)包络。
  4. 将每个包络映射到参考时间线:通过
    t_ref - delta_k
    进行索引;未覆盖的秒数标记为
    -inf
    ,确保不会被选中。
  5. 音频源选择:选择包络跨度最大(覆盖时间段内的90th − 10th百分位数)的机位,同时会根据覆盖范围比例给予少量额外加分。
  6. 每秒评分
    cam[k] - mean(other covered cams)
    。得分最高的机位即为最佳当前说话人候选。
  7. 编辑器生成EDL——两种模式:
    • rotation
      (默认):随机停留时间在[
      min_dwell=8
      ,
      max_dwell=15
      ]秒之间,每次切换时选择得分最高的可用机位(≠当前机位)。
    • greedy
      :滞后模式——保持当前机位,除非其他机位的前瞻窗口得分超过
      --switch-threshold
      阈值。最小停留时间下限为4秒,上限为18秒。 两种模式都会在当前机位超出其覆盖窗口时强制切换。
  8. 输出EDL JSON文件。

EDL schema (
edl.json
)

EDL schema (
edl.json
)

json
{
  "_about": "EDL produced by wjs-editing-multicam/autoedit.py. Times in reference timeline. Render scripts apply ffmpeg -itsoffset deltas[k] per input.",
  "_help": {
    "inputs":        "Original media paths, in cam-index order (cam 0, cam 1, ...).",
    "deltas":        "Per-cam delta_seconds from each sidecar. Render uses ffmpeg -itsoffset deltas[k].",
    "duration_sec":  "Output duration in reference timeline.",
    "audio_source":  "Cam index whose audio track becomes the master. Single source — not a mix.",
    "coverage":      "[start, end] per cam in reference timeline.",
    "edl":           "List of {cam, start, end} segments. Times are reference-timeline seconds."
  },
  "inputs":       ["cam_a.MOV", "cam_b.MOV"],
  "deltas":       [0.0, 12.345],
  "duration_sec": 4512,
  "audio_source": 0,
  "coverage":     [[0.0, 4512.0], [12.345, 4499.835]],
  "edl":          [{"cam": 0, "start": 0, "end": 13}, {"cam": 1, "start": 13, "end": 28}, ...]
}
autoedit.py
writes
_about
+
_help
directly into the file so opening the JSON in any editor explains itself.
json
{
  "_about": "EDL produced by wjs-editing-multicam/autoedit.py. Times in reference timeline. Render scripts apply ffmpeg -itsoffset deltas[k] per input.",
  "_help": {
    "inputs":        "Original media paths, in cam-index order (cam 0, cam 1, ...).",
    "deltas":        "Per-cam delta_seconds from each sidecar. Render uses ffmpeg -itsoffset deltas[k].",
    "duration_sec":  "Output duration in reference timeline.",
    "audio_source":  "Cam index whose audio track becomes the master. Single source — not a mix.",
    "coverage":      "[start, end] per cam in reference timeline.",
    "edl":           "List of {cam, start, end} segments. Times are reference-timeline seconds."
  },
  "inputs":       ["cam_a.MOV", "cam_b.MOV"],
  "deltas":       [0.0, 12.345],
  "duration_sec": 4512,
  "audio_source": 0,
  "coverage":     [[0.0, 4512.0], [12.345, 4499.835]],
  "edl":          [{"cam": 0, "start": 0, "end": 13}, {"cam": 1, "start": 13, "end": 28}, ...]
}
autoedit.py
会将
_about
+
_help
直接写入文件,因此在任何编辑器中打开该JSON文件即可了解其内容。

Render

渲染

ScriptWhat it does
scripts/render_cuts.py
Hard cuts only.
concat
filter graph over per-segment
trim+scale+pad
. Audio =
audio_source
cam, trimmed to first EDL row's start.
scripts/render_pip.py
Hard cuts + corner picture-in-picture overlay. Main cam = EDL row's
cam
; PiP cam picked round-robin (or via per-row
pip
field). PiP is scaled to
--pip-width
(default 480 px), placed in a configurable corner with optional white border. No fade / no opacity — solid block on/off.
Both apply
-itsoffset deltas[k]
per input.
脚本功能
scripts/render_cuts.py
仅硬切效果。基于每个片段的
trim+scale+pad
使用
concat
滤镜图。音频采用
audio_source
机位的音轨,修剪至EDL第一行的起始时间。
scripts/render_pip.py
硬切效果 + 角落画中画叠加。主机位为EDL行中的
cam
;画中画机位采用循环选择(或通过每行的
pip
字段指定)。画中画缩放至
--pip-width
(默认480像素),放置在可配置的角落,可选白色边框。无淡入淡出/无透明度变化——为实心块开关效果。
两个脚本都会为每个输入应用
-itsoffset deltas[k]
参数。

Brainstorm before running

运行前的思考

Three real knobs to confirm with the user:
  • Pacing
    --mode rotation
    (varied dwell, easier on the ear) vs
    --mode greedy
    (energy-following, snappier).
  • PiP — yes / no. If yes, which corner + width?
  • Min cut length
    --min-dwell
    floor. 8 s default for rotation is conservative; talking-heads can go to 4.
audio_source
is auto-picked; override with
--audio-source <cam-index>
if the auto-pick sounds wrong on a 30 s listen.
需要与用户确认三个关键设置:
  • 节奏
    --mode rotation
    (停留时间多变,听觉体验更舒适) vs
    --mode greedy
    (跟随能量变化,切换更灵敏)。
  • 画中画(PiP) — 是/否。如果是,选择哪个角落以及宽度?
  • 最小剪辑长度
    --min-dwell
    下限。默认的8秒是保守设置;访谈类视频可设为4秒。
audio_source
为自动选择;如果试听30秒后发现自动选择的音频源不合适,可使用
--audio-source <cam-index>
手动指定。

File layout

文件布局

working_dir/
  cam_a.MOV                 # ORIGINAL, untouched
  cam_a.MOV.sync.json       # from wjs-syncing-multicam
  cam_b.MOV                 # ORIGINAL, untouched
  cam_b.MOV.sync.json
  edl.json                  # from autoedit.py
  multicam_render.mp4       # from render_cuts.py OR render_pip.py
working_dir/
  cam_a.MOV                 # 原始文件,未修改
  cam_a.MOV.sync.json       # 来自wjs-syncing-multicam
  cam_b.MOV                 # 原始文件,未修改
  cam_b.MOV.sync.json
  edl.json                  # 来自autoedit.py
  multicam_render.mp4       # 来自render_cuts.py 或 render_pip.py

Common pitfalls

常见误区

  • Trusting
    audio_source
    without listening.
    Spread + coverage is a proxy. Always sample a 30 s clip before committing — a high-spread track can still be clipped / distorted.
  • Running
    autoedit.py
    on the full 75 min before tuning.
    Run on a 2-min slice first (
    ffmpeg -ss A -t 120
    an extract per cam), listen, adjust
    --min-dwell
    /
    --mode
    , then commit to full length.
  • Expecting face-driven framing. This skill doesn't see the video — only the audio. If one cam is well-framed but quiet, the editor won't favor it. Use
    --audio-source
    + per-segment
    pip
    overrides as the manual escape hatch.
  • Re-rendering when sync was wrong. EDL bakes in
    deltas[]
    at autoedit time. If you fix the sidecars later, re-run
    autoedit.py
    to regenerate the EDL before re-rendering.
  • 未试听就信任
    audio_source
    。包络跨度+覆盖范围只是代理指标。在正式渲染前务必试听30秒片段——跨度大的音轨仍可能存在削波/失真问题。
  • 未调整参数就直接处理75分钟的完整视频。先处理2分钟的片段(用
    ffmpeg -ss A -t 120
    为每个机位提取片段),试听后调整
    --min-dwell
    /
    --mode
    ,再进行完整长度的处理。
  • 期望基于人脸的取景。本技能不分析视频画面——仅基于音频。如果某个机位取景良好但音量小,编辑器不会优先选择它。可使用
    --audio-source
    + 逐段
    pip
    手动覆盖作为补救方案。
  • 同步错误后重新渲染。EDL在autoedit阶段已固化
    deltas[]
    参数。如果后续修复了副文件,需重新运行
    autoedit.py
    生成新的EDL后再重新渲染。