video-creator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Video Creator

Video Creator

图片+音频合成视频工具。

Tool for combining images and audio to create videos.

⛔⛔⛔ 最高铁律:音画同步(违反即废片重做!)

⛔⛔⛔ Top Rule: Audio-Visual Synchronization (Violation Requires Full Redo!)

每次生成视频,duration 必须从 narration.json 时间戳精确计算,绝对禁止凭感觉手动填!
Every time you generate a video, the duration must be calculated precisely from the timestamps in narration.json. Manual estimation based on feeling is strictly forbidden!

强制流程(缺一步都不行)

Mandatory Workflow (No Steps Skipped!)

1. TTS 生成配音 → 得到 narration.mp3 + narration.json(时间戳)
2. 读取 narration.json,逐句分析内容语义
3. 确定每张图对应哪些句子(按内容语义匹配,不是平均分!)
4. 每张图的 duration = 对应最后一句的 end - 对应第一句的 start
5. 校验:所有 duration 之和 ≈ 音频总时长(误差 < 1s)
6. ⛔ 运行 verify_alignment.py 校验(必须通过才能合成!)
7. 校验通过才能写 video_config.yaml,否则停止!
1. Generate dubbing via TTS → Get narration.mp3 + narration.json (timestamps)
2. Read narration.json and analyze the semantic meaning of each sentence
3. Determine which sentences each image corresponds to (match by content semantics, not average distribution!)
4. Duration of each image = end timestamp of the last corresponding sentence - start timestamp of the first corresponding sentence
5. Verification: Sum of all durations ≈ total audio duration (error < 1s)
6. ⛔ Run verify_alignment.py for validation (must pass before synthesis!)
7. Only write video_config.yaml after passing validation; otherwise, stop!

示例(以古诗教学为例)

Example (Ancient Poetry Teaching)

narration.json 时间戳:
  句0 [0.1-2.6]  "静夜思,唐,李白。"
  句1 [2.6-5.7]  "床前明月光,疑是地上霜。"
  句2 [5.7-8.6]  "举头望明月,低头思故乡。"
  句3 [8.6-16.4] "这首诗是唐代大诗人李白..."
  句4 [16.4-20.9] "短短二十个字..."
  句5 [20.9-22.6] "床前明月光。"
  ...

图片分配:
  01_title.png    → 句0-2(全诗朗诵)→ duration = 8.6 - 0.1 = 8.5s
  02_poet.png     → 句3-4(诗人介绍)→ duration = 20.9 - 8.6 = 12.3s
  03_moonlight.png → 句5-8(第一句解读)→ duration = 38.6 - 20.9 = 17.7s
  ...

校验:8.5 + 12.3 + 17.7 + ... = 114.4s ≈ 音频 114.5s ✅
Timestamps in narration.json:
  Sentence 0 [0.1-2.6]  "Quiet Night Thoughts, Tang Dynasty, Li Bai."
  Sentence 1 [2.6-5.7]  "Before my bed a pool of light, I wonder if it's frost on the ground."
  Sentence 2 [5.7-8.6]  "I lift my eyes and look at the bright moon, I lower my head and think of my hometown."
  Sentence 3 [8.6-16.4] "This poem was written by Li Bai, a great poet of the Tang Dynasty..."
  Sentence 4 [16.4-20.9] "In just twenty characters..."
  Sentence 5 [20.9-22.6] "Before my bed a pool of light."
  ...

Image Allocation:
  01_title.png    → Sentences 0-2 (Full poem recitation) → duration = 8.6 - 0.1 = 8.5s
  02_poet.png     → Sentences 3-4 (Poet introduction) → duration = 20.9 - 8.6 = 12.3s
  03_moonlight.png → Sentences 5-8 (Interpretation of the first line) → duration = 38.6 - 20.9 = 17.7s
  ...

Validation: 8.5 + 12.3 + 17.7 + ... = 114.4s ≈ Audio duration 114.5s ✅

禁止行为

Forbidden Actions

  • ❌ 凭感觉给每张图分配 10s、15s、20s
  • ❌ 平均分配(总时长 / 图片数)
  • ❌ 不读 narration.json 就写 duration
  • ❌ 图片总时长和音频差 5 秒以上还强行合成
  • ❌ 让 video_maker.py 自动拉伸超过 1 秒
  • ❌ duration: auto(已从代码中彻底删除!)

  • ❌ Arbitrarily assign 10s, 15s, 20s to each image based on feeling
  • ❌ Average distribution (Total duration / number of images)
  • ❌ Write duration without reading narration.json
  • ❌ Force synthesis when the total image duration differs from audio by more than 5 seconds
  • ❌ Allow video_maker.py to auto-stretch by more than 1 second
  • ❌ Use duration: auto (completely removed from code!)

生图铁律(违反即重做)

Image Generation Rules (Violation Requires Redo)

0. 默认尺寸(最重要!)

0. Default Dimensions (Most Important!)

默认比例:16:9(1920×1080 横版)
除非以下情况,否则一律用 16:9:
  • 用户明确指定其他比例
  • 与其他流程协同时,其他流程有明确要求(如小红书要 3:4,抖音要 9:16)
bash
undefined
Default Aspect Ratio: 16:9 (1920×1080 Landscape)
Use 16:9 unless:
  • User explicitly specifies another ratio
  • Collaborating with other workflows that have clear requirements (e.g., 3:4 for Xiaohongshu, 9:16 for Douyin)
bash
undefined

✅ 默认:不指定 -r 参数,或明确写 -r 16:9

✅ Default: Do not specify -r parameter, or explicitly write -r 16:9

python text_to_image.py "提示词" -o output.png python text_to_image.py "提示词" -r 16:9 -o output.png
python text_to_image.py "prompt" -o output.png python text_to_image.py "prompt" -r 16:9 -o output.png

❌ 禁止:随意切换比例,一会儿竖版一会儿横版

❌ Forbidden: Randomly switch ratios, alternating between portrait and landscape

同一个视频项目,所有图片必须是同一比例!

All images in the same video project must use the same ratio!

undefined
undefined

1. 图片密度要求

1. Image Density Requirements

视频时长最少图片数每张图时长
30秒8张3-4秒
60秒15张3-5秒
90秒22张3-5秒
120秒30张3-5秒
铁律:每张图最长 7 秒,超过必须拆分!
计算公式:
图片数量 = ceil(音频时长 / 4)
Video DurationMinimum Number of ImagesDuration per Image
30s83-4s
60s153-5s
90s223-5s
120s303-5s
Rule: Each image can be displayed for a maximum of 7 seconds. If exceeded, split it!
Calculation Formula:
number_of_images = ceil(audio_duration / 4)

2. 生图提示词语言要求

2. Prompt Language Requirements for Image Generation

所有生图 prompt 必须用中文写!禁止英文 prompt!
bash
undefined
All image generation prompts must be in Chinese. English prompts are strictly forbidden!
bash
undefined

❌ 错误:用英文写 prompt

❌ Wrong: Use English prompt

python text_to_image.py "modern tech illustration, AI robot, blue gradient background"
python text_to_image.py "modern tech illustration, AI robot, blue gradient background"

❌ 错误:英文 prompt 里夹中文

❌ Wrong: Mix English and Chinese in prompt

python text_to_image.py "tech style, Chinese text '对决', blue theme"
python text_to_image.py "tech style, Chinese text 'Duel', blue theme"

✅ 正确:纯中文 prompt

✅ Correct: Chinese-only prompt

python text_to_image.py "现代科技插画风格,可爱AI机器人坐在电脑前,蓝紫色渐变背景,霓虹灯光效,多个悬浮的全息UI面板,信息密度高,专业信息图风格"

**铁律**:
- prompt 必须是纯中文
- 生成的图片里如果有文字,也必须是中文
- 禁止任何英文出现
python text_to_image.py "Modern tech illustration style, cute AI robot sitting in front of a computer, blue-purple gradient background, neon lighting effects, multiple floating holographic UI panels, high information density, professional infographic style"

**Rules**:
- Prompts must be Chinese-only
- Any text in generated images must be Chinese
- No English is allowed

3. 信息密度要求

3. Information Density Requirements

信息密度 = 文字要点多 + 视觉元素丰富
文字内容要丰富
undefined
Information Density = Rich text points + Abundant visual elements
Rich Text Content:
undefined

❌ 错误:文字太少

❌ Wrong: Too little text

图片里只写 "AI对比"
Only "AI Comparison" in the image

✅ 正确:文字要点多

✅ Correct: Rich text points

图片里包含:
  • 标题:QoderWork vs OpenClaw
  • 副标题:桌面AI助手对比
  • 要点1:开箱即用 vs 自由定制
  • 要点2:$19/月 vs 免费开源
  • 要点3:普通用户 vs 技术极客

**视觉元素要丰富**:
Image includes:
  • Title: QoderWork vs OpenClaw
  • Subtitle: Desktop AI Assistant Comparison
  • Point 1: Out-of-the-box vs Customizable
  • Point 2: $19/month vs Free & Open Source
  • Point 3: General Users vs Tech Enthusiasts

**Abundant Visual Elements**:

❌ 错误:太空洞

❌ Wrong: Too empty

只有文字,没有图标、图表、装饰
Only text, no icons, charts, or decorations

✅ 正确:可视化丰富

✅ Correct: Rich visualization

  • 配合图标(对勾、叉、箭头、星星)
  • 配合图表(柱状图、饼图、对比条)
  • 配合插画(机器人、电脑、用户形象)
  • 配合装饰(光效、渐变、边框)

**信息密度原则**:
- 每张图要有明确的文字标题和要点
- 文字内容要和配音内容对应
- 更重要的是**可视化**:图标、图表、插画、装饰
- 禁止纯文字图,也禁止纯装饰图
  • Match with icons (checkmarks, crosses, arrows, stars)
  • Match with charts (bar charts, pie charts, comparison bars)
  • Match with illustrations (robots, computers, user avatars)
  • Match with decorations (lighting effects, gradients, borders)

**Information Density Principles**:
- Each image must have a clear text title and key points
- Text content must correspond to the dubbing content
- Visualization is more important: icons, charts, illustrations, decorations
- Pure text images or pure decorative images are forbidden

4. 生图描述要细致具体

4. Detailed and Specific Image Generation Descriptions

每张图必须有丰富、具体的视觉元素,禁止笼统空洞!
bash
undefined
Each image must have rich, specific visual elements. Vague descriptions are forbidden!
bash
undefined

❌ 错误:太笼统

❌ Wrong: Too vague

"一个机器人" "科技风格的图" "对比图"
"A robot" "Tech-style image" "Comparison image"

✅ 正确:细致具体

✅ Correct: Detailed and specific

"可爱的蓝色AI机器人吉祥物,圆润金属质感,坐在现代简约办公桌前, 桌上有27寸曲面显示器显示代码,旁边放着咖啡杯和多肉植物, 机器人头顶悬浮三个全息面板分别显示折线图、饼图、进度条, 深蓝色科技感背景,地面有蓝色光带,整体赛博朋克风格,柔和的体积光"

**prompt 必备 6 要素**(缺一不可):

| 要素 | 说明 | 示例 |
|------|------|------|
| 主体 | 谁/什么东西,要具体 | "蓝色圆润金属质感的AI机器人" 而非 "机器人" |
| 动作 | 在做什么,姿态如何 | "双手放在键盘上打字,微微侧头" |
| 环境 | 在哪里,背景是什么 | "现代简约办公室,落地窗外是城市夜景" |
| 细节 | 周围有什么物品/元素 | "桌上有咖啡杯、多肉植物、便签纸" |
| 风格 | 什么画风/光效 | "赛博朋克风格,霓虹灯光,体积光效果" |
| 色彩 | 主色调是什么 | "蓝紫渐变主色调,橙色点缀" |
"Cute blue AI robot mascot, rounded metallic texture, sitting at a modern minimalist desk, with a 27-inch curved monitor displaying code, a coffee cup and succulent plant beside it, three holographic panels floating above the robot showing line charts, pie charts, and progress bars, dark blue tech background, blue light strips on the floor, cyberpunk style, soft volumetric lighting"

**6 Essential Elements for Prompts** (None can be missing):

| Element | Description | Example |
|---------|-------------|---------|
| Subject | Who/what, specific | "Blue rounded metallic AI robot" instead of "robot" |
| Action | What it's doing, posture | "Hands on keyboard typing, slightly tilted head" |
| Environment | Where it is, background | "Modern minimalist office, city night view outside floor-to-ceiling windows" |
| Details | Surrounding items/elements | "Coffee cup, succulent plant, sticky notes on desk" |
| Style | Art style/lighting effects | "Cyberpunk style, neon lighting, volumetric lighting effect" |
| Color | Main color tone | "Blue-purple gradient main tone, orange accents" |

5. 视觉风格一致性

5. Visual Style Consistency

同一个视频的所有图片必须保持风格统一:
  • 使用相同的风格前缀
  • 使用相同的色彩基调
  • 使用相同的比例(禁止混用横竖版!)
bash
undefined
All images in the same video must maintain consistent style:
  • Use the same style prefix
  • Use the same color scheme
  • Use the same aspect ratio (forbid mixing portrait and landscape!)
bash
undefined

定义统一风格前缀(中文!)

Define unified style prefix (Chinese!)

STYLE="现代科技插画风格,干净矢量设计,蓝紫渐变配色,专业信息图美感,高信息密度,霓虹发光效果,深色背景"
STYLE="Modern tech illustration style, clean vector design, blue-purple gradient color scheme, professional infographic aesthetics, high information density, neon glow effect, dark background"

所有图片都用这个前缀 + 相同比例

All images use this prefix + same aspect ratio

python text_to_image.py "$STYLE,[具体场景内容]" -r 16:9 -o scene01.png python text_to_image.py "$STYLE,[具体场景内容]" -r 16:9 -o scene02.png
undefined
python text_to_image.py "$STYLE, [specific scene content]" -r 16:9 -o scene01.png python text_to_image.py "$STYLE, [specific scene content]" -r 16:9 -o scene02.png
undefined

6. 比例选择指南

6. Aspect Ratio Guide

场景比例说明
默认/通用16:9B站、YouTube、公众号视频、PPT配图
抖音/视频号/快手9:16竖屏短视频平台,需用户指定或流程要求
小红书3:4小红书笔记配图,需用户指定或流程要求
朋友圈1:1正方形,需用户指定
铁律:不确定用什么比例时,一律用 16:9

ScenarioAspect RatioDescription
Default/General16:9Bilibili, YouTube, Official Account Videos, PPT Illustrations
Douyin/Video Account/Kuaishou9:16Vertical short video platforms, requires user specification or workflow requirements
Xiaohongshu3:4Xiaohongshu note illustrations, requires user specification or workflow requirements
Moments1:1Square, requires user specification
Rule: When in doubt, use 16:9

核心流程(铁律)

Core Workflow (Rules)

故事类视频生成流程(套娃流程)

Story Video Generation Workflow (Nested Workflow)

当用户提供故事/剧情/剧本时,必须严格按以下套娃流程执行
┌─────────────────────────────────────────────────────────────┐
│ 第一层:故事 → 拆分场景 → 并发生成场景主图(文生图)            │
│                                                              │
│   大闹天宫 →  场景1:弼马温受辱                                │
│              场景2:筋斗云回花果山                             │
│              场景3:玉帝派兵                                   │
│              ...                                              │
│              → 并发调用 text_to_image.py 生成每个场景主图      │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 第二层:每个场景主图 → 图生图拆出细镜头(保持角色一致)         │
│                                                              │
│   场景1主图 → 细镜头1:悟空看官印疑惑                          │
│              细镜头2:悟空踢翻马槽                             │
│   场景2主图 → 细镜头1:踏筋斗云腾空                            │
│              细镜头2:花果山自封大圣                           │
│              → 并发调用 image_to_image.py,以主图为参考        │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 第三层:生成配音 + 字幕 + 校验 + 合成视频                      │
│                                                              │
│   1. tts_generator.py 生成配音 + 时间戳                       │
│   2. 【铁律】根据时间戳精确计算每张图的duration                │
│   3. 生成 SRT 字幕                                            │
│   4. 生成 video_config.yaml                                   │
│   5. ⛔ verify_alignment.py 校验(必须通过!)                │
│   6. video_maker.py 合成:                                    │
│      → 图片合成(带转场)                                      │
│      → 合并音频                                               │
│      → 烧录字幕(ASS格式,底部居中固定)                       │
│      → 自动拼接片尾(二维码+"点关注不迷路")                   │
│      → 添加BGM                                                │
└─────────────────────────────────────────────────────────────┘

**铁律:所有视频必须自动拼接片尾!**
When users provide stories/plots/scripts, strictly follow this nested workflow:
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Story → Split into scenes → Parallel generation of scene main images (text-to-image) │
│                                                              │
│   Havoc in Heaven →  Scene 1: Sun Wukong humiliated as Horse Keeper │
│              Scene 2: Returns to Flower Fruit Mountain on Cloud Somersault │
│              Scene 3: Jade Emperor sends troops │
│              ...                                              │
│              → Parallel call text_to_image.py to generate main image for each scene │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Each scene main image → Generate close-up shots via image-to-image (maintain character consistency) │
│                                                              │
│   Scene 1 main image → Close-up 1: Sun Wukong doubts the official seal │
│              Close-up 2: Sun Wukong kicks over the horse trough │
│   Scene 2 main image → Close-up 1: Soars on Cloud Somersault │
│              Close-up 2: Proclaims himself Great Sage Equal to Heaven at Flower Fruit Mountain │
│              → Parallel call image_to_image.py with main image as reference │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Generate dubbing + subtitles + validation + video synthesis │
│                                                              │
│   1. tts_generator.py generates dubbing + timestamps │
│   2. [Rule] Precisely calculate duration of each image based on timestamps │
│   3. Generate SRT subtitles │
│   4. Generate video_config.yaml │
│   5. ⛔ Run verify_alignment.py for validation (must pass!) │
│   6. video_maker.py synthesis: │
│      → Image synthesis (with transitions) │
│      → Audio merging │
│      → Burn subtitles (ASS format, fixed at bottom center) │
│      → Auto-append outro (QR code + "Follow for more content") │
│      → Add BGM │
└─────────────────────────────────────────────────────────────┘

**Rule: All videos must auto-append outro!**

目录结构规范

Directory Structure Specification

assets/generated/{project_name}/
├── scene1/
│   ├── main.png         # 场景1主图(文生图)
│   ├── shot_01.png      # 细镜头1(图生图)
│   └── shot_02.png      # 细镜头2(图生图)
├── scene2/
│   ├── main.png
│   ├── shot_01.png
│   └── shot_02.png
├── ...
├── narration.mp3        # 配音
├── narration.json       # 时间戳
├── subtitles.srt        # 字幕
├── video_config.yaml    # 视频配置
└── {project_name}.mp4   # 最终视频
assets/generated/{project_name}/
├── scene1/
│   ├── main.png         # Scene 1 main image (text-to-image)
│   ├── shot_01.png      # Close-up 1 (image-to-image)
│   └── shot_02.png      # Close-up 2 (image-to-image)
├── scene2/
│   ├── main.png
│   ├── shot_01.png
│   └── shot_02.png
├── ...
├── narration.mp3        # Dubbing
├── narration.json       # Timestamps
├── subtitles.srt        # Subtitles
├── video_config.yaml    # Video configuration
└── {project_name}.mp4   # Final video

执行命令示例

Example Execution Commands

bash
undefined
bash
undefined

第一层:并发生成场景主图(默认 16:9)

Layer 1: Parallel generation of scene main images (default 16:9)

python .opencode/skills/image-service/scripts/text_to_image.py "风格描述,场景1内容" -r 16:9 -o scene1/main.png & python .opencode/skills/image-service/scripts/text_to_image.py "风格描述,场景2内容" -r 16:9 -o scene2/main.png & wait
python .opencode/skills/image-service/scripts/text_to_image.py "Style description, Scene 1 content" -r 16:9 -o scene1/main.png & python .opencode/skills/image-service/scripts/text_to_image.py "Style description, Scene 2 content" -r 16:9 -o scene2/main.png & wait

第二层:并发图生图生成细镜头(保持相同比例)

Layer 2: Parallel image-to-image generation of close-up shots (same aspect ratio)

python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "保持角色风格,细镜头描述" -r 16:9 -o scene1/shot_01.png & python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "保持角色风格,细镜头描述" -r 16:9 -o scene1/shot_02.png & wait
python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "Maintain character style, close-up description" -r 16:9 -o scene1/shot_01.png & python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "Maintain character style, close-up description" -r 16:9 -o scene1/shot_02.png & wait

第三层:生成配音+校验+合成视频

Layer 3: Generate dubbing + validation + video synthesis

python .opencode/skills/video-creator/scripts/tts_generator.py --text "完整旁白" --output narration.mp3 --timestamps
python .opencode/skills/video-creator/scripts/tts_generator.py --text "Full narration" --output narration.mp3 --timestamps

⛔ 强制校验(必须通过才能合成!)

⛔ Mandatory validation (must pass before synthesis!)

python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml

校验通过后才能合成

Synthesize only after validation passes

python .opencode/skills/video-creator/scripts/video_maker.py video_config.yaml --srt subtitles.srt --bgm epic

---
python .opencode/skills/video-creator/scripts/video_maker.py video_config.yaml --srt subtitles.srt --bgm epic

---

视频配置文件格式

Video Configuration File Format

yaml
undefined
yaml
undefined

video_config.yaml

video_config.yaml

ratio: "16:9" # 默认横版!必须加引号避免YAML解析错误 bgm_volume: 0.12 outro: true
scenes:
  • audio: narration.mp3 images:

    按场景顺序排列所有细镜头

    • file: scene1/shot_01.png duration: 4.34
    • file: scene1/shot_02.png duration: 4.88
    • file: scene2/shot_01.png duration: 2.15

    ...


**注意**:`ratio` 必须用引号包裹,如 `"16:9"`,否则 YAML 会解析成时间格式。

---
ratio: "16:9" # Default landscape! Must be quoted to avoid YAML parsing errors bgm_volume: 0.12 outro: true
scenes:
  • audio: narration.mp3 images:

    Arrange all close-up shots in scene order

    • file: scene1/shot_01.png duration: 4.34
    • file: scene1/shot_02.png duration: 4.88
    • file: scene2/shot_01.png duration: 2.15

    ...


**Note**: `ratio` must be enclosed in quotes, e.g., `"16:9"`, otherwise YAML will parse it as a time format.

---

时长分配规范(铁律!)

Duration Allocation Specification (Rule!)

生成 video_config.yaml 前,必须严格按以下流程计算 duration:
Before generating video_config.yaml, strictly calculate duration following this workflow:

步骤1:读取时间戳文件

Step 1: Read Timestamp File

python
import json
with open("narration.json", "r") as f:
    timestamps = json.load(f)
audio_duration = timestamps[-1]["end"]
print(f"音频总时长: {audio_duration:.1f}s")
python
import json
with open("narration.json", "r") as f:
    timestamps = json.load(f)
audio_duration = timestamps[-1]["end"]
print(f"Total audio duration: {audio_duration:.1f}s")

步骤2:按内容语义划分场景

Step 2: Divide Scenes by Content Semantics

根据解说词内容,确定每张图对应的时间段:
python
undefined
Determine the time period corresponding to each image based on the narration content:
python
undefined

示例:根据解说词内容划分

Example: Division based on narration content

找到每个主题切换点的时间戳

Find timestamp points where theme changes

scenes = [ ("cover.png", 0, 12.5), # 开场到第一个主题切换 ("scene01.png", 12.5, 26), # 第二段内容 # ...根据 narration.json 中的句子边界精确划分 ]
undefined
scenes = [ ("cover.png", 0, 12.5), # Opening to first theme change ("scene01.png", 12.5, 26), # Second segment content # ... Precisely divide based on sentence boundaries in narration.json ]
undefined

步骤3:计算每张图的 duration

Step 3: Calculate Duration for Each Image

python
for file, start, end in scenes:
    duration = end - start
    print(f"{file}: {duration:.1f}s")
python
for file, start, end in scenes:
    duration = end - start
    print(f"{file}: {duration:.1f}s")

步骤4:校验总时长

Step 4: Validate Total Duration

python
total_duration = sum(duration for _, _, duration in scenes)
assert abs(total_duration - audio_duration) < 1.0, \
    f"时长不匹配!图片总时长{total_duration}s vs 音频{audio_duration}s"
python
total_duration = sum(end - start for _, start, end in scenes)
assert abs(total_duration - audio_duration) < 1.0, \
    f"Duration mismatch! Total image duration {total_duration}s vs Audio duration {audio_duration}s"

铁律

Rules

  1. 必须先读取 narration.json 时间戳,不能凭感觉估算
  2. 按句子语义边界划分,不能平均分配
  3. 生成配置前必须校验,确保图片总时长 = 音频总时长(误差<0.5秒)
  4. 禁止让脚本自动拉伸,音画不同步的视频不合格
  5. 禁止 duration=0,每张图最少 1 秒
  6. 生成 yaml 后必须用 verify_alignment.py 校验再合成
  1. Must read timestamps from narration.json first, no estimation based on feeling
  2. Divide by sentence semantic boundaries, no average distribution
  3. Must validate before generating configuration, ensure total image duration = audio duration (error < 0.5s)
  4. Forbid auto-stretching by script, videos with audio-visual asynchrony are unqualified
  5. Forbid duration=0, each image must be displayed for at least 1 second
  6. Must validate with verify_alignment.py before synthesis after generating yaml

校验脚本(强制执行!合成前必须通过!)

Validation Script (Mandatory Execution! Must Pass Before Synthesis!)

bash
undefined
bash
undefined

⛔ 合成视频前必须先运行校验脚本,不通过禁止合成!

⛔ Must run validation script before video synthesis. Forbid synthesis if it fails!

python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml

校验内容:

Validation Content:

1. 所有图片文件是否存在

1. Whether all image files exist

2. duration 是否为精确数值(非数字直接拒绝!)

2. Whether duration is a precise numerical value (reject non-numeric values directly!)

3. 图片总时长 vs 音频总时长(误差 < 1s)

3. Total image duration vs Total audio duration (error < 1s)

4. 每张图时长是否在合理范围(1-7s)

4. Whether each image duration is within reasonable range (1-7s)

5. 图片文件名关键词 vs 语音内容关键词 语义交叉比对

5. Semantic cross-check between image filename keywords and voice content keywords

6. 输出完整对照表(图片 + 时长 + 语义✅/❌ + 对应语音文字)

6. Output complete comparison table (Image + Duration + Semantic ✅/❌ + Corresponding voice text)

退出码 0 = 通过,1 = 失败

Exit code 0 = Pass, 1 = Fail

失败时禁止合成,必须修复后重新校验!

Forbid synthesis if failed, must fix and re-validate!


**注意:video_maker.py 也内置了硬校验——duration 必须是精确正数浮点数,缺失或非数字直接拒绝执行并 exit(1)!duration: auto 已被彻底删除!**

**Note**: video_maker.py also has built-in hard validation — duration must be a precise positive float. Missing or non-numeric values will trigger immediate exit(1)! duration: auto has been completely removed!

时长分配表模板

Duration Allocation Table Template

生成配置前,先输出分配表确认:
markdown
| 场景图 | 对应内容 | 开始 | 结束 | 时长 |
|--------|----------|------|------|------|
| cover.png | 开场引入 | 0s | 12.5s | 12.5s |
| scene01.png | AI Agent时代 | 12.5s | 26s | 13.5s |
| ... | ... | ... | ... | ... |
| **合计** | | | | **{total}s** |

音频总时长:{audio_duration}s
差值:{diff}s ✅/❌

Output the allocation table for confirmation before generating configuration:
markdown
| Scene Image | Corresponding Content | Start | End | Duration |
|-------------|-----------------------|-------|-----|----------|
| cover.png | Opening Introduction | 0s | 12.5s | 12.5s |
| scene01.png | AI Agent Era | 12.5s | 26s | 13.5s |
| ... | ... | ... | ... | ... |
| **Total** | | | | **{total}s** |

Total Audio Duration: {audio_duration}s
Difference: {diff}s ✅/❌

字幕规范

Subtitle Specification

字幕使用 ASS 格式,强制底部居中固定位置
  • 位置:底部居中(Alignment=2)
  • 字体:PingFang SC
  • 大小:屏幕高度 / 40
  • 描边:2px 黑色描边 + 1px 阴影
  • 底边距:屏幕高度 / 20
禁止:字幕乱跑、大小不一、位置不固定

Subtitles use ASS format, fixed at bottom center position:
  • Position: Bottom center (Alignment=2)
  • Font: PingFang SC
  • Size: Screen height / 40
  • Stroke: 2px black stroke + 1px shadow
  • Bottom Margin: Screen height / 20
Forbid: Subtitles moving randomly, inconsistent sizes, or unfixed positions

脚本参数说明

Script Parameter Description

video_maker.py

video_maker.py

bash
python video_maker.py config.yaml [options]
参数说明默认值
--no-outro
不添加片尾添加
--no-bgm
不添加BGM添加
--fade
转场时长(秒)0.5
--bgm-volume
BGM音量0.08
--bgm
自定义BGM(可选: epic)默认科技风
--ratio
视频比例16:9(会被配置文件覆盖)
--srt
字幕文件路径
bash
python video_maker.py config.yaml [options]
ParameterDescriptionDefault Value
--no-outro
Do not add outroAdd outro
--no-bgm
Do not add BGMAdd BGM
--fade
Transition duration (seconds)0.5
--bgm-volume
BGM volume0.08
--bgm
Custom BGM (Optional: epic)Default tech style
--ratio
Video aspect ratio16:9 (overridden by configuration file)
--srt
Subtitle file pathNone

verify_alignment.py

verify_alignment.py

bash
python verify_alignment.py video_config.yaml
校验项说明
图片存在性所有图片文件必须存在
duration 精确性必须是正数浮点数,禁止 auto/空值/非数字
总时长匹配图片总时长 vs 音频总时长,误差 < 1s
单图时长范围每张图 1-7 秒,超出警告
语义交叉比对总结段图片文件名关键词 vs 语音内容关键词匹配
bash
python verify_alignment.py video_config.yaml
Validation ItemDescription
Image ExistenceAll image files must exist
Duration PrecisionMust be a positive float, forbid auto/empty/non-numeric values
Total Duration MatchTotal image duration vs Total audio duration, error < 1s
Single Image Duration RangeEach image 1-7s, warning if exceeded
Semantic Cross-CheckMatch between image filename keywords and voice content keywords

tts_generator.py

tts_generator.py

bash
python tts_generator.py --text "文本" --output audio.mp3 [options]
参数说明默认值
--voice
音色zh-CN-YunxiNeural
--rate
语速+0%
--timestamps
输出时间戳JSON

bash
python tts_generator.py --text "Text" --output audio.mp3 [options]
ParameterDescriptionDefault Value
--voice
Voice IDzh-CN-YunxiNeural
--rate
Speech rate+0%
--timestamps
Output timestamp JSONNo

支持的视频比例

Supported Video Aspect Ratios

image-service
生图服务保持一致,支持 10 种比例
比例分辨率适用场景
1:11024×1024正方形,朋友圈
2:3832×1248竖版海报
3:21248×832横版海报
3:41080×1440小红书、朋友圈
4:31440×1080传统显示器
4:5864×1080Instagram
5:41080×864横版照片
9:161080×1920抖音、视频号、竖屏
16:91920×1080B站、YouTube、横屏
21:91536×672超宽屏电影

Consistent with image-service, supports 10 aspect ratios:
Aspect RatioResolutionApplicable Scenario
1:11024×1024Square, Moments
2:3832×1248Vertical poster
3:21248×832Horizontal poster
3:41080×1440Xiaohongshu, Moments
4:31440×1080Traditional monitor
4:5864×1080Instagram
5:41080×864Horizontal photo
9:161080×1920Douyin, Video Account, Vertical screen
16:91920×1080Bilibili, YouTube, Official Account Videos, Landscape
21:91536×672Ultra-wide screen movie

片尾规范

Outro Specification

铁律:所有视频必须自动拼接对应尺寸的片尾!
片尾匹配顺序:
  1. 精确匹配:
    outro_{ratio}.mp4
  2. 方向匹配:竖版→
    outro_9x16.mp4
    ,横版→
    outro_16x9.mp4
  3. 兜底:
    outro.mp4

Rule: All videos must auto-append the corresponding outro based on aspect ratio!
Outro Matching Order:
  1. Exact match:
    outro_{ratio}.mp4
  2. Orientation match: Portrait →
    outro_9x16.mp4
    , Landscape →
    outro_16x9.mp4
  3. Fallback:
    outro.mp4

BGM 资源

BGM Resources

按风格分类

Classification by Style

风格文件快捷参数适用场景
古风/中国风
bgm_ancient_tale.mp3
ancient
水浒、三国、历史故事
bgm_asian_drums.mp3
asian
武侠、动作、战斗
bgm_meditation.mp3
meditation
禅意、冥想、国学
bgm_garden.mp3
garden
田园、悠闲、风景
治愈/轻松
bgm_carefree.mp3
carefree
Vlog、日常、生活
bgm_dreamy.mp3
dreamy
梦幻、回忆、温馨
bgm_happybee.mp3
happybee
欢快、明亮、阳光
热血/史诗
bgm_epic.mp3
epic
励志、战斗、高燃
bgm_heroic.mp3
heroic
英雄、胜利、荣耀
bgm_crusade.mp3
crusade
战争、史诗、宏大
bgm_allthis.mp3
allthis
电影感、叙事、情感
科技/未来
bgm_technology.mp3
tech
AI、产品、教程
bgm_digital.mp3
digital
数码、网络、互联网
bgm_chasm.mp3
chasm
科幻、太空、神秘
悬疑/紧张
bgm_anxiety.mp3
anxiety
推理、紧张、危机
bgm_darkfog.mp3
darkfog
恐怖、黑暗、悬疑
欢快/活泼
bgm_funky.mp3
funky
搞笑、轻松、节奏感
bgm_happyboy.mp3
happyboy
可爱、儿童、动画
bgm_doh.mp3
doh
俏皮、有趣、短视频
电子/节奏
bgm_edm.mp3
edm
动感、剪辑、卡点
bgm_electro.mp3
electro
电子、现代、潮流
bgm_bitshift.mp3
bitshift
游戏、8-bit、复古电子
bgm_hiphop.mp3
hiphop
说唱、街头、潮酷
StyleFileShortcut ParameterApplicable Scenario
Ancient Chinese Style
bgm_ancient_tale.mp3
ancient
Water Margin, Romance of the Three Kingdoms, Historical Stories
bgm_asian_drums.mp3
asian
Martial Arts, Action, Battle
bgm_meditation.mp3
meditation
Zen, Meditation, Traditional Chinese Studies
bgm_garden.mp3
garden
Pastoral, Leisure, Scenery
Healing/Relaxing
bgm_carefree.mp3
carefree
Vlog, Daily Life, Lifestyle
bgm_dreamy.mp3
dreamy
Fantasy, Memories, Warmth
bgm_happybee.mp3
happybee
Cheerful, Bright, Sunny
Passionate/Epic
bgm_epic.mp3
epic
Inspirational, Battle, High-energy
bgm_heroic.mp3
heroic
Heroes, Victory, Glory
bgm_crusade.mp3
crusade
War, Epic, Grand Scale
bgm_allthis.mp3
allthis
Cinematic, Narrative, Emotional
Tech/Futuristic
bgm_technology.mp3
tech
AI, Products, Tutorials
bgm_digital.mp3
digital
Digital, Internet, Online Services
bgm_chasm.mp3
chasm
Sci-fi, Space, Mystery
Suspense/Tense
bgm_anxiety.mp3
anxiety
Mystery, Tense, Crisis
bgm_darkfog.mp3
darkfog
Horror, Dark, Suspense
Cheerful/Lively
bgm_funky.mp3
funky
Comedy, Relaxing, Rhythmic
bgm_happyboy.mp3
happyboy
Cute, Children, Animation
bgm_doh.mp3
doh
Playful, Funny, Short Videos
Electronic/Rhythmic
bgm_edm.mp3
edm
Dynamic, Editing, Beat-matching
bgm_electro.mp3
electro
Electronic, Modern, Trendy
bgm_bitshift.mp3
bitshift
Games, 8-bit, Retro Electronic
bgm_hiphop.mp3
hiphop
Rap, Street, Trendy

使用方式

Usage

bash
undefined
bash
undefined

方式1:快捷参数(推荐)

Method 1: Shortcut parameter (Recommended)

python video_maker.py config.yaml --bgm epic python video_maker.py config.yaml --bgm ancient python video_maker.py config.yaml --bgm edm
python video_maker.py config.yaml --bgm epic python video_maker.py config.yaml --bgm ancient python video_maker.py config.yaml --bgm edm

方式2:完整文件名

Method 2: Full filename

python video_maker.py config.yaml --bgm bgm_ancient_tale.mp3
python video_maker.py config.yaml --bgm bgm_ancient_tale.mp3

方式3:自定义路径

Method 3: Custom path

python video_maker.py config.yaml --bgm /path/to/custom.mp3

---
python video_maker.py config.yaml --bgm /path/to/custom.mp3

---

常用音色

Common Voice IDs

音色 ID风格
zh-CN-YunyangNeural男声,新闻播报
zh-CN-YunxiNeural男声,阳光活泼
zh-CN-XiaoxiaoNeural女声,温暖自然
zh-CN-XiaoyiNeural女声,活泼可爱

Voice IDStyle
zh-CN-YunyangNeuralMale, News Broadcast
zh-CN-YunxiNeuralMale, Sunny and Lively
zh-CN-XiaoxiaoNeuralFemale, Warm and Natural
zh-CN-XiaoyiNeuralFemale, Lively and Cute

目录结构

Directory Structure

video-creator/
├── SKILL.md
├── scripts/
│   ├── video_maker.py        # 主脚本:图片+音频→视频(内置duration硬卡)
│   ├── verify_alignment.py   # 合成前强制校验(时长+语义交叉比对)
│   ├── tts_generator.py      # TTS 语音生成
│   └── scene_splitter.py     # 场景拆分器(可选)
├── assets/
│   ├── outro.mp4             # 通用片尾(16:9)
│   ├── outro_9x16.mp4        # 竖版片尾
│   ├── outro_3x4.mp4         # 3:4片尾
│   └── bgm_*.mp3             # 22首BGM(详见BGM资源表)
└── references/
    └── edge_tts_voices.md

video-creator/
├── SKILL.md
├── scripts/
│   ├── video_maker.py        # Main script: Images + Audio → Video (built-in duration hard check)
│   ├── verify_alignment.py   # Mandatory pre-synthesis validation (Duration + Semantic cross-check)
│   ├── tts_generator.py      # TTS Voice Generation
│   └── scene_splitter.py     # Scene Splitter (Optional)
├── assets/
│   ├── outro.mp4             # General outro (16:9)
│   ├── outro_9x16.mp4        # Portrait outro
│   ├── outro_3x4.mp4         # 3:4 outro
│   └── bgm_*.mp3             # 22 BGM tracks (see BGM Resource Table)
└── references/
    └── edge_tts_voices.md

依赖

Dependencies

bash
undefined
bash
undefined

系统依赖

System Dependencies

brew install ffmpeg # Mac
brew install ffmpeg # Mac

Python 依赖

Python Dependencies

pip install edge-tts pyyaml
undefined
pip install edge-tts pyyaml
undefined