video-creator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVideo Creator
Video Creator
图片+音频合成视频工具。
Tool for combining images and audio to create videos.
⛔⛔⛔ 最高铁律:音画同步(违反即废片重做!)
⛔⛔⛔ Top Rule: Audio-Visual Synchronization (Violation Requires Full Redo!)
每次生成视频,duration 必须从 narration.json 时间戳精确计算,绝对禁止凭感觉手动填!
Every time you generate a video, the duration must be calculated precisely from the timestamps in narration.json. Manual estimation based on feeling is strictly forbidden!
强制流程(缺一步都不行)
Mandatory Workflow (No Steps Skipped!)
1. TTS 生成配音 → 得到 narration.mp3 + narration.json(时间戳)
2. 读取 narration.json,逐句分析内容语义
3. 确定每张图对应哪些句子(按内容语义匹配,不是平均分!)
4. 每张图的 duration = 对应最后一句的 end - 对应第一句的 start
5. 校验:所有 duration 之和 ≈ 音频总时长(误差 < 1s)
6. ⛔ 运行 verify_alignment.py 校验(必须通过才能合成!)
7. 校验通过才能写 video_config.yaml,否则停止!1. Generate dubbing via TTS → Get narration.mp3 + narration.json (timestamps)
2. Read narration.json and analyze the semantic meaning of each sentence
3. Determine which sentences each image corresponds to (match by content semantics, not average distribution!)
4. Duration of each image = end timestamp of the last corresponding sentence - start timestamp of the first corresponding sentence
5. Verification: Sum of all durations ≈ total audio duration (error < 1s)
6. ⛔ Run verify_alignment.py for validation (must pass before synthesis!)
7. Only write video_config.yaml after passing validation; otherwise, stop!示例(以古诗教学为例)
Example (Ancient Poetry Teaching)
narration.json 时间戳:
句0 [0.1-2.6] "静夜思,唐,李白。"
句1 [2.6-5.7] "床前明月光,疑是地上霜。"
句2 [5.7-8.6] "举头望明月,低头思故乡。"
句3 [8.6-16.4] "这首诗是唐代大诗人李白..."
句4 [16.4-20.9] "短短二十个字..."
句5 [20.9-22.6] "床前明月光。"
...
图片分配:
01_title.png → 句0-2(全诗朗诵)→ duration = 8.6 - 0.1 = 8.5s
02_poet.png → 句3-4(诗人介绍)→ duration = 20.9 - 8.6 = 12.3s
03_moonlight.png → 句5-8(第一句解读)→ duration = 38.6 - 20.9 = 17.7s
...
校验:8.5 + 12.3 + 17.7 + ... = 114.4s ≈ 音频 114.5s ✅Timestamps in narration.json:
Sentence 0 [0.1-2.6] "Quiet Night Thoughts, Tang Dynasty, Li Bai."
Sentence 1 [2.6-5.7] "Before my bed a pool of light, I wonder if it's frost on the ground."
Sentence 2 [5.7-8.6] "I lift my eyes and look at the bright moon, I lower my head and think of my hometown."
Sentence 3 [8.6-16.4] "This poem was written by Li Bai, a great poet of the Tang Dynasty..."
Sentence 4 [16.4-20.9] "In just twenty characters..."
Sentence 5 [20.9-22.6] "Before my bed a pool of light."
...
Image Allocation:
01_title.png → Sentences 0-2 (Full poem recitation) → duration = 8.6 - 0.1 = 8.5s
02_poet.png → Sentences 3-4 (Poet introduction) → duration = 20.9 - 8.6 = 12.3s
03_moonlight.png → Sentences 5-8 (Interpretation of the first line) → duration = 38.6 - 20.9 = 17.7s
...
Validation: 8.5 + 12.3 + 17.7 + ... = 114.4s ≈ Audio duration 114.5s ✅禁止行为
Forbidden Actions
- ❌ 凭感觉给每张图分配 10s、15s、20s
- ❌ 平均分配(总时长 / 图片数)
- ❌ 不读 narration.json 就写 duration
- ❌ 图片总时长和音频差 5 秒以上还强行合成
- ❌ 让 video_maker.py 自动拉伸超过 1 秒
- ❌ duration: auto(已从代码中彻底删除!)
- ❌ Arbitrarily assign 10s, 15s, 20s to each image based on feeling
- ❌ Average distribution (Total duration / number of images)
- ❌ Write duration without reading narration.json
- ❌ Force synthesis when the total image duration differs from audio by more than 5 seconds
- ❌ Allow video_maker.py to auto-stretch by more than 1 second
- ❌ Use duration: auto (completely removed from code!)
生图铁律(违反即重做)
Image Generation Rules (Violation Requires Redo)
0. 默认尺寸(最重要!)
0. Default Dimensions (Most Important!)
默认比例:16:9(1920×1080 横版)
除非以下情况,否则一律用 16:9:
- 用户明确指定其他比例
- 与其他流程协同时,其他流程有明确要求(如小红书要 3:4,抖音要 9:16)
bash
undefinedDefault Aspect Ratio: 16:9 (1920×1080 Landscape)
Use 16:9 unless:
- User explicitly specifies another ratio
- Collaborating with other workflows that have clear requirements (e.g., 3:4 for Xiaohongshu, 9:16 for Douyin)
bash
undefined✅ 默认:不指定 -r 参数,或明确写 -r 16:9
✅ Default: Do not specify -r parameter, or explicitly write -r 16:9
python text_to_image.py "提示词" -o output.png
python text_to_image.py "提示词" -r 16:9 -o output.png
python text_to_image.py "prompt" -o output.png
python text_to_image.py "prompt" -r 16:9 -o output.png
❌ 禁止:随意切换比例,一会儿竖版一会儿横版
❌ Forbidden: Randomly switch ratios, alternating between portrait and landscape
同一个视频项目,所有图片必须是同一比例!
All images in the same video project must use the same ratio!
undefinedundefined1. 图片密度要求
1. Image Density Requirements
| 视频时长 | 最少图片数 | 每张图时长 |
|---|---|---|
| 30秒 | 8张 | 3-4秒 |
| 60秒 | 15张 | 3-5秒 |
| 90秒 | 22张 | 3-5秒 |
| 120秒 | 30张 | 3-5秒 |
铁律:每张图最长 7 秒,超过必须拆分!
计算公式:
图片数量 = ceil(音频时长 / 4)| Video Duration | Minimum Number of Images | Duration per Image |
|---|---|---|
| 30s | 8 | 3-4s |
| 60s | 15 | 3-5s |
| 90s | 22 | 3-5s |
| 120s | 30 | 3-5s |
Rule: Each image can be displayed for a maximum of 7 seconds. If exceeded, split it!
Calculation Formula:
number_of_images = ceil(audio_duration / 4)2. 生图提示词语言要求
2. Prompt Language Requirements for Image Generation
所有生图 prompt 必须用中文写!禁止英文 prompt!
bash
undefinedAll image generation prompts must be in Chinese. English prompts are strictly forbidden!
bash
undefined❌ 错误:用英文写 prompt
❌ Wrong: Use English prompt
python text_to_image.py "modern tech illustration, AI robot, blue gradient background"
python text_to_image.py "modern tech illustration, AI robot, blue gradient background"
❌ 错误:英文 prompt 里夹中文
❌ Wrong: Mix English and Chinese in prompt
python text_to_image.py "tech style, Chinese text '对决', blue theme"
python text_to_image.py "tech style, Chinese text 'Duel', blue theme"
✅ 正确:纯中文 prompt
✅ Correct: Chinese-only prompt
python text_to_image.py "现代科技插画风格,可爱AI机器人坐在电脑前,蓝紫色渐变背景,霓虹灯光效,多个悬浮的全息UI面板,信息密度高,专业信息图风格"
**铁律**:
- prompt 必须是纯中文
- 生成的图片里如果有文字,也必须是中文
- 禁止任何英文出现python text_to_image.py "Modern tech illustration style, cute AI robot sitting in front of a computer, blue-purple gradient background, neon lighting effects, multiple floating holographic UI panels, high information density, professional infographic style"
**Rules**:
- Prompts must be Chinese-only
- Any text in generated images must be Chinese
- No English is allowed3. 信息密度要求
3. Information Density Requirements
信息密度 = 文字要点多 + 视觉元素丰富
文字内容要丰富:
undefinedInformation Density = Rich text points + Abundant visual elements
Rich Text Content:
undefined❌ 错误:文字太少
❌ Wrong: Too little text
图片里只写 "AI对比"
Only "AI Comparison" in the image
✅ 正确:文字要点多
✅ Correct: Rich text points
图片里包含:
- 标题:QoderWork vs OpenClaw
- 副标题:桌面AI助手对比
- 要点1:开箱即用 vs 自由定制
- 要点2:$19/月 vs 免费开源
- 要点3:普通用户 vs 技术极客
**视觉元素要丰富**:Image includes:
- Title: QoderWork vs OpenClaw
- Subtitle: Desktop AI Assistant Comparison
- Point 1: Out-of-the-box vs Customizable
- Point 2: $19/month vs Free & Open Source
- Point 3: General Users vs Tech Enthusiasts
**Abundant Visual Elements**:❌ 错误:太空洞
❌ Wrong: Too empty
只有文字,没有图标、图表、装饰
Only text, no icons, charts, or decorations
✅ 正确:可视化丰富
✅ Correct: Rich visualization
- 配合图标(对勾、叉、箭头、星星)
- 配合图表(柱状图、饼图、对比条)
- 配合插画(机器人、电脑、用户形象)
- 配合装饰(光效、渐变、边框)
**信息密度原则**:
- 每张图要有明确的文字标题和要点
- 文字内容要和配音内容对应
- 更重要的是**可视化**:图标、图表、插画、装饰
- 禁止纯文字图,也禁止纯装饰图- Match with icons (checkmarks, crosses, arrows, stars)
- Match with charts (bar charts, pie charts, comparison bars)
- Match with illustrations (robots, computers, user avatars)
- Match with decorations (lighting effects, gradients, borders)
**Information Density Principles**:
- Each image must have a clear text title and key points
- Text content must correspond to the dubbing content
- Visualization is more important: icons, charts, illustrations, decorations
- Pure text images or pure decorative images are forbidden4. 生图描述要细致具体
4. Detailed and Specific Image Generation Descriptions
每张图必须有丰富、具体的视觉元素,禁止笼统空洞!
bash
undefinedEach image must have rich, specific visual elements. Vague descriptions are forbidden!
bash
undefined❌ 错误:太笼统
❌ Wrong: Too vague
"一个机器人"
"科技风格的图"
"对比图"
"A robot"
"Tech-style image"
"Comparison image"
✅ 正确:细致具体
✅ Correct: Detailed and specific
"可爱的蓝色AI机器人吉祥物,圆润金属质感,坐在现代简约办公桌前,
桌上有27寸曲面显示器显示代码,旁边放着咖啡杯和多肉植物,
机器人头顶悬浮三个全息面板分别显示折线图、饼图、进度条,
深蓝色科技感背景,地面有蓝色光带,整体赛博朋克风格,柔和的体积光"
**prompt 必备 6 要素**(缺一不可):
| 要素 | 说明 | 示例 |
|------|------|------|
| 主体 | 谁/什么东西,要具体 | "蓝色圆润金属质感的AI机器人" 而非 "机器人" |
| 动作 | 在做什么,姿态如何 | "双手放在键盘上打字,微微侧头" |
| 环境 | 在哪里,背景是什么 | "现代简约办公室,落地窗外是城市夜景" |
| 细节 | 周围有什么物品/元素 | "桌上有咖啡杯、多肉植物、便签纸" |
| 风格 | 什么画风/光效 | "赛博朋克风格,霓虹灯光,体积光效果" |
| 色彩 | 主色调是什么 | "蓝紫渐变主色调,橙色点缀" |"Cute blue AI robot mascot, rounded metallic texture, sitting at a modern minimalist desk,
with a 27-inch curved monitor displaying code, a coffee cup and succulent plant beside it,
three holographic panels floating above the robot showing line charts, pie charts, and progress bars,
dark blue tech background, blue light strips on the floor, cyberpunk style, soft volumetric lighting"
**6 Essential Elements for Prompts** (None can be missing):
| Element | Description | Example |
|---------|-------------|---------|
| Subject | Who/what, specific | "Blue rounded metallic AI robot" instead of "robot" |
| Action | What it's doing, posture | "Hands on keyboard typing, slightly tilted head" |
| Environment | Where it is, background | "Modern minimalist office, city night view outside floor-to-ceiling windows" |
| Details | Surrounding items/elements | "Coffee cup, succulent plant, sticky notes on desk" |
| Style | Art style/lighting effects | "Cyberpunk style, neon lighting, volumetric lighting effect" |
| Color | Main color tone | "Blue-purple gradient main tone, orange accents" |5. 视觉风格一致性
5. Visual Style Consistency
同一个视频的所有图片必须保持风格统一:
- 使用相同的风格前缀
- 使用相同的色彩基调
- 使用相同的比例(禁止混用横竖版!)
bash
undefinedAll images in the same video must maintain consistent style:
- Use the same style prefix
- Use the same color scheme
- Use the same aspect ratio (forbid mixing portrait and landscape!)
bash
undefined定义统一风格前缀(中文!)
Define unified style prefix (Chinese!)
STYLE="现代科技插画风格,干净矢量设计,蓝紫渐变配色,专业信息图美感,高信息密度,霓虹发光效果,深色背景"
STYLE="Modern tech illustration style, clean vector design, blue-purple gradient color scheme, professional infographic aesthetics, high information density, neon glow effect, dark background"
所有图片都用这个前缀 + 相同比例
All images use this prefix + same aspect ratio
python text_to_image.py "$STYLE,[具体场景内容]" -r 16:9 -o scene01.png
python text_to_image.py "$STYLE,[具体场景内容]" -r 16:9 -o scene02.png
undefinedpython text_to_image.py "$STYLE, [specific scene content]" -r 16:9 -o scene01.png
python text_to_image.py "$STYLE, [specific scene content]" -r 16:9 -o scene02.png
undefined6. 比例选择指南
6. Aspect Ratio Guide
| 场景 | 比例 | 说明 |
|---|---|---|
| 默认/通用 | 16:9 | B站、YouTube、公众号视频、PPT配图 |
| 抖音/视频号/快手 | 9:16 | 竖屏短视频平台,需用户指定或流程要求 |
| 小红书 | 3:4 | 小红书笔记配图,需用户指定或流程要求 |
| 朋友圈 | 1:1 | 正方形,需用户指定 |
铁律:不确定用什么比例时,一律用 16:9
| Scenario | Aspect Ratio | Description |
|---|---|---|
| Default/General | 16:9 | Bilibili, YouTube, Official Account Videos, PPT Illustrations |
| Douyin/Video Account/Kuaishou | 9:16 | Vertical short video platforms, requires user specification or workflow requirements |
| Xiaohongshu | 3:4 | Xiaohongshu note illustrations, requires user specification or workflow requirements |
| Moments | 1:1 | Square, requires user specification |
Rule: When in doubt, use 16:9
核心流程(铁律)
Core Workflow (Rules)
故事类视频生成流程(套娃流程)
Story Video Generation Workflow (Nested Workflow)
当用户提供故事/剧情/剧本时,必须严格按以下套娃流程执行:
┌─────────────────────────────────────────────────────────────┐
│ 第一层:故事 → 拆分场景 → 并发生成场景主图(文生图) │
│ │
│ 大闹天宫 → 场景1:弼马温受辱 │
│ 场景2:筋斗云回花果山 │
│ 场景3:玉帝派兵 │
│ ... │
│ → 并发调用 text_to_image.py 生成每个场景主图 │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 第二层:每个场景主图 → 图生图拆出细镜头(保持角色一致) │
│ │
│ 场景1主图 → 细镜头1:悟空看官印疑惑 │
│ 细镜头2:悟空踢翻马槽 │
│ 场景2主图 → 细镜头1:踏筋斗云腾空 │
│ 细镜头2:花果山自封大圣 │
│ → 并发调用 image_to_image.py,以主图为参考 │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 第三层:生成配音 + 字幕 + 校验 + 合成视频 │
│ │
│ 1. tts_generator.py 生成配音 + 时间戳 │
│ 2. 【铁律】根据时间戳精确计算每张图的duration │
│ 3. 生成 SRT 字幕 │
│ 4. 生成 video_config.yaml │
│ 5. ⛔ verify_alignment.py 校验(必须通过!) │
│ 6. video_maker.py 合成: │
│ → 图片合成(带转场) │
│ → 合并音频 │
│ → 烧录字幕(ASS格式,底部居中固定) │
│ → 自动拼接片尾(二维码+"点关注不迷路") │
│ → 添加BGM │
└─────────────────────────────────────────────────────────────┘
**铁律:所有视频必须自动拼接片尾!**When users provide stories/plots/scripts, strictly follow this nested workflow:
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Story → Split into scenes → Parallel generation of scene main images (text-to-image) │
│ │
│ Havoc in Heaven → Scene 1: Sun Wukong humiliated as Horse Keeper │
│ Scene 2: Returns to Flower Fruit Mountain on Cloud Somersault │
│ Scene 3: Jade Emperor sends troops │
│ ... │
│ → Parallel call text_to_image.py to generate main image for each scene │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Each scene main image → Generate close-up shots via image-to-image (maintain character consistency) │
│ │
│ Scene 1 main image → Close-up 1: Sun Wukong doubts the official seal │
│ Close-up 2: Sun Wukong kicks over the horse trough │
│ Scene 2 main image → Close-up 1: Soars on Cloud Somersault │
│ Close-up 2: Proclaims himself Great Sage Equal to Heaven at Flower Fruit Mountain │
│ → Parallel call image_to_image.py with main image as reference │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Generate dubbing + subtitles + validation + video synthesis │
│ │
│ 1. tts_generator.py generates dubbing + timestamps │
│ 2. [Rule] Precisely calculate duration of each image based on timestamps │
│ 3. Generate SRT subtitles │
│ 4. Generate video_config.yaml │
│ 5. ⛔ Run verify_alignment.py for validation (must pass!) │
│ 6. video_maker.py synthesis: │
│ → Image synthesis (with transitions) │
│ → Audio merging │
│ → Burn subtitles (ASS format, fixed at bottom center) │
│ → Auto-append outro (QR code + "Follow for more content") │
│ → Add BGM │
└─────────────────────────────────────────────────────────────┘
**Rule: All videos must auto-append outro!**目录结构规范
Directory Structure Specification
assets/generated/{project_name}/
├── scene1/
│ ├── main.png # 场景1主图(文生图)
│ ├── shot_01.png # 细镜头1(图生图)
│ └── shot_02.png # 细镜头2(图生图)
├── scene2/
│ ├── main.png
│ ├── shot_01.png
│ └── shot_02.png
├── ...
├── narration.mp3 # 配音
├── narration.json # 时间戳
├── subtitles.srt # 字幕
├── video_config.yaml # 视频配置
└── {project_name}.mp4 # 最终视频assets/generated/{project_name}/
├── scene1/
│ ├── main.png # Scene 1 main image (text-to-image)
│ ├── shot_01.png # Close-up 1 (image-to-image)
│ └── shot_02.png # Close-up 2 (image-to-image)
├── scene2/
│ ├── main.png
│ ├── shot_01.png
│ └── shot_02.png
├── ...
├── narration.mp3 # Dubbing
├── narration.json # Timestamps
├── subtitles.srt # Subtitles
├── video_config.yaml # Video configuration
└── {project_name}.mp4 # Final video执行命令示例
Example Execution Commands
bash
undefinedbash
undefined第一层:并发生成场景主图(默认 16:9)
Layer 1: Parallel generation of scene main images (default 16:9)
python .opencode/skills/image-service/scripts/text_to_image.py "风格描述,场景1内容" -r 16:9 -o scene1/main.png &
python .opencode/skills/image-service/scripts/text_to_image.py "风格描述,场景2内容" -r 16:9 -o scene2/main.png &
wait
python .opencode/skills/image-service/scripts/text_to_image.py "Style description, Scene 1 content" -r 16:9 -o scene1/main.png &
python .opencode/skills/image-service/scripts/text_to_image.py "Style description, Scene 2 content" -r 16:9 -o scene2/main.png &
wait
第二层:并发图生图生成细镜头(保持相同比例)
Layer 2: Parallel image-to-image generation of close-up shots (same aspect ratio)
python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "保持角色风格,细镜头描述" -r 16:9 -o scene1/shot_01.png &
python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "保持角色风格,细镜头描述" -r 16:9 -o scene1/shot_02.png &
wait
python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "Maintain character style, close-up description" -r 16:9 -o scene1/shot_01.png &
python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "Maintain character style, close-up description" -r 16:9 -o scene1/shot_02.png &
wait
第三层:生成配音+校验+合成视频
Layer 3: Generate dubbing + validation + video synthesis
python .opencode/skills/video-creator/scripts/tts_generator.py --text "完整旁白" --output narration.mp3 --timestamps
python .opencode/skills/video-creator/scripts/tts_generator.py --text "Full narration" --output narration.mp3 --timestamps
⛔ 强制校验(必须通过才能合成!)
⛔ Mandatory validation (must pass before synthesis!)
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml
校验通过后才能合成
Synthesize only after validation passes
python .opencode/skills/video-creator/scripts/video_maker.py video_config.yaml --srt subtitles.srt --bgm epic
---python .opencode/skills/video-creator/scripts/video_maker.py video_config.yaml --srt subtitles.srt --bgm epic
---视频配置文件格式
Video Configuration File Format
yaml
undefinedyaml
undefinedvideo_config.yaml
video_config.yaml
ratio: "16:9" # 默认横版!必须加引号避免YAML解析错误
bgm_volume: 0.12
outro: true
scenes:
- audio: narration.mp3
images:
按场景顺序排列所有细镜头
- file: scene1/shot_01.png duration: 4.34
- file: scene1/shot_02.png duration: 4.88
- file: scene2/shot_01.png duration: 2.15
...
**注意**:`ratio` 必须用引号包裹,如 `"16:9"`,否则 YAML 会解析成时间格式。
---ratio: "16:9" # Default landscape! Must be quoted to avoid YAML parsing errors
bgm_volume: 0.12
outro: true
scenes:
- audio: narration.mp3
images:
Arrange all close-up shots in scene order
- file: scene1/shot_01.png duration: 4.34
- file: scene1/shot_02.png duration: 4.88
- file: scene2/shot_01.png duration: 2.15
...
**Note**: `ratio` must be enclosed in quotes, e.g., `"16:9"`, otherwise YAML will parse it as a time format.
---时长分配规范(铁律!)
Duration Allocation Specification (Rule!)
生成 video_config.yaml 前,必须严格按以下流程计算 duration:
Before generating video_config.yaml, strictly calculate duration following this workflow:
步骤1:读取时间戳文件
Step 1: Read Timestamp File
python
import json
with open("narration.json", "r") as f:
timestamps = json.load(f)
audio_duration = timestamps[-1]["end"]
print(f"音频总时长: {audio_duration:.1f}s")python
import json
with open("narration.json", "r") as f:
timestamps = json.load(f)
audio_duration = timestamps[-1]["end"]
print(f"Total audio duration: {audio_duration:.1f}s")步骤2:按内容语义划分场景
Step 2: Divide Scenes by Content Semantics
根据解说词内容,确定每张图对应的时间段:
python
undefinedDetermine the time period corresponding to each image based on the narration content:
python
undefined示例:根据解说词内容划分
Example: Division based on narration content
找到每个主题切换点的时间戳
Find timestamp points where theme changes
scenes = [
("cover.png", 0, 12.5), # 开场到第一个主题切换
("scene01.png", 12.5, 26), # 第二段内容
# ...根据 narration.json 中的句子边界精确划分
]
undefinedscenes = [
("cover.png", 0, 12.5), # Opening to first theme change
("scene01.png", 12.5, 26), # Second segment content
# ... Precisely divide based on sentence boundaries in narration.json
]
undefined步骤3:计算每张图的 duration
Step 3: Calculate Duration for Each Image
python
for file, start, end in scenes:
duration = end - start
print(f"{file}: {duration:.1f}s")python
for file, start, end in scenes:
duration = end - start
print(f"{file}: {duration:.1f}s")步骤4:校验总时长
Step 4: Validate Total Duration
python
total_duration = sum(duration for _, _, duration in scenes)
assert abs(total_duration - audio_duration) < 1.0, \
f"时长不匹配!图片总时长{total_duration}s vs 音频{audio_duration}s"python
total_duration = sum(end - start for _, start, end in scenes)
assert abs(total_duration - audio_duration) < 1.0, \
f"Duration mismatch! Total image duration {total_duration}s vs Audio duration {audio_duration}s"铁律
Rules
- 必须先读取 narration.json 时间戳,不能凭感觉估算
- 按句子语义边界划分,不能平均分配
- 生成配置前必须校验,确保图片总时长 = 音频总时长(误差<0.5秒)
- 禁止让脚本自动拉伸,音画不同步的视频不合格
- 禁止 duration=0,每张图最少 1 秒
- 生成 yaml 后必须用 verify_alignment.py 校验再合成
- Must read timestamps from narration.json first, no estimation based on feeling
- Divide by sentence semantic boundaries, no average distribution
- Must validate before generating configuration, ensure total image duration = audio duration (error < 0.5s)
- Forbid auto-stretching by script, videos with audio-visual asynchrony are unqualified
- Forbid duration=0, each image must be displayed for at least 1 second
- Must validate with verify_alignment.py before synthesis after generating yaml
校验脚本(强制执行!合成前必须通过!)
Validation Script (Mandatory Execution! Must Pass Before Synthesis!)
bash
undefinedbash
undefined⛔ 合成视频前必须先运行校验脚本,不通过禁止合成!
⛔ Must run validation script before video synthesis. Forbid synthesis if it fails!
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml
校验内容:
Validation Content:
1. 所有图片文件是否存在
1. Whether all image files exist
2. duration 是否为精确数值(非数字直接拒绝!)
2. Whether duration is a precise numerical value (reject non-numeric values directly!)
3. 图片总时长 vs 音频总时长(误差 < 1s)
3. Total image duration vs Total audio duration (error < 1s)
4. 每张图时长是否在合理范围(1-7s)
4. Whether each image duration is within reasonable range (1-7s)
5. 图片文件名关键词 vs 语音内容关键词 语义交叉比对
5. Semantic cross-check between image filename keywords and voice content keywords
6. 输出完整对照表(图片 + 时长 + 语义✅/❌ + 对应语音文字)
6. Output complete comparison table (Image + Duration + Semantic ✅/❌ + Corresponding voice text)
退出码 0 = 通过,1 = 失败
Exit code 0 = Pass, 1 = Fail
失败时禁止合成,必须修复后重新校验!
Forbid synthesis if failed, must fix and re-validate!
**注意:video_maker.py 也内置了硬校验——duration 必须是精确正数浮点数,缺失或非数字直接拒绝执行并 exit(1)!duration: auto 已被彻底删除!**
**Note**: video_maker.py also has built-in hard validation — duration must be a precise positive float. Missing or non-numeric values will trigger immediate exit(1)! duration: auto has been completely removed!时长分配表模板
Duration Allocation Table Template
生成配置前,先输出分配表确认:
markdown
| 场景图 | 对应内容 | 开始 | 结束 | 时长 |
|--------|----------|------|------|------|
| cover.png | 开场引入 | 0s | 12.5s | 12.5s |
| scene01.png | AI Agent时代 | 12.5s | 26s | 13.5s |
| ... | ... | ... | ... | ... |
| **合计** | | | | **{total}s** |
音频总时长:{audio_duration}s
差值:{diff}s ✅/❌Output the allocation table for confirmation before generating configuration:
markdown
| Scene Image | Corresponding Content | Start | End | Duration |
|-------------|-----------------------|-------|-----|----------|
| cover.png | Opening Introduction | 0s | 12.5s | 12.5s |
| scene01.png | AI Agent Era | 12.5s | 26s | 13.5s |
| ... | ... | ... | ... | ... |
| **Total** | | | | **{total}s** |
Total Audio Duration: {audio_duration}s
Difference: {diff}s ✅/❌字幕规范
Subtitle Specification
字幕使用 ASS 格式,强制底部居中固定位置:
- 位置:底部居中(Alignment=2)
- 字体:PingFang SC
- 大小:屏幕高度 / 40
- 描边:2px 黑色描边 + 1px 阴影
- 底边距:屏幕高度 / 20
禁止:字幕乱跑、大小不一、位置不固定
Subtitles use ASS format, fixed at bottom center position:
- Position: Bottom center (Alignment=2)
- Font: PingFang SC
- Size: Screen height / 40
- Stroke: 2px black stroke + 1px shadow
- Bottom Margin: Screen height / 20
Forbid: Subtitles moving randomly, inconsistent sizes, or unfixed positions
脚本参数说明
Script Parameter Description
video_maker.py
video_maker.py
bash
python video_maker.py config.yaml [options]| 参数 | 说明 | 默认值 |
|---|---|---|
| 不添加片尾 | 添加 |
| 不添加BGM | 添加 |
| 转场时长(秒) | 0.5 |
| BGM音量 | 0.08 |
| 自定义BGM(可选: epic) | 默认科技风 |
| 视频比例 | 16:9(会被配置文件覆盖) |
| 字幕文件路径 | 无 |
bash
python video_maker.py config.yaml [options]| Parameter | Description | Default Value |
|---|---|---|
| Do not add outro | Add outro |
| Do not add BGM | Add BGM |
| Transition duration (seconds) | 0.5 |
| BGM volume | 0.08 |
| Custom BGM (Optional: epic) | Default tech style |
| Video aspect ratio | 16:9 (overridden by configuration file) |
| Subtitle file path | None |
verify_alignment.py
verify_alignment.py
bash
python verify_alignment.py video_config.yaml| 校验项 | 说明 |
|---|---|
| 图片存在性 | 所有图片文件必须存在 |
| duration 精确性 | 必须是正数浮点数,禁止 auto/空值/非数字 |
| 总时长匹配 | 图片总时长 vs 音频总时长,误差 < 1s |
| 单图时长范围 | 每张图 1-7 秒,超出警告 |
| 语义交叉比对 | 总结段图片文件名关键词 vs 语音内容关键词匹配 |
bash
python verify_alignment.py video_config.yaml| Validation Item | Description |
|---|---|
| Image Existence | All image files must exist |
| Duration Precision | Must be a positive float, forbid auto/empty/non-numeric values |
| Total Duration Match | Total image duration vs Total audio duration, error < 1s |
| Single Image Duration Range | Each image 1-7s, warning if exceeded |
| Semantic Cross-Check | Match between image filename keywords and voice content keywords |
tts_generator.py
tts_generator.py
bash
python tts_generator.py --text "文本" --output audio.mp3 [options]| 参数 | 说明 | 默认值 |
|---|---|---|
| 音色 | zh-CN-YunxiNeural |
| 语速 | +0% |
| 输出时间戳JSON | 否 |
bash
python tts_generator.py --text "Text" --output audio.mp3 [options]| Parameter | Description | Default Value |
|---|---|---|
| Voice ID | zh-CN-YunxiNeural |
| Speech rate | +0% |
| Output timestamp JSON | No |
支持的视频比例
Supported Video Aspect Ratios
与 生图服务保持一致,支持 10 种比例:
image-service| 比例 | 分辨率 | 适用场景 |
|---|---|---|
| 1:1 | 1024×1024 | 正方形,朋友圈 |
| 2:3 | 832×1248 | 竖版海报 |
| 3:2 | 1248×832 | 横版海报 |
| 3:4 | 1080×1440 | 小红书、朋友圈 |
| 4:3 | 1440×1080 | 传统显示器 |
| 4:5 | 864×1080 | |
| 5:4 | 1080×864 | 横版照片 |
| 9:16 | 1080×1920 | 抖音、视频号、竖屏 |
| 16:9 | 1920×1080 | B站、YouTube、横屏 |
| 21:9 | 1536×672 | 超宽屏电影 |
Consistent with image-service, supports 10 aspect ratios:
| Aspect Ratio | Resolution | Applicable Scenario |
|---|---|---|
| 1:1 | 1024×1024 | Square, Moments |
| 2:3 | 832×1248 | Vertical poster |
| 3:2 | 1248×832 | Horizontal poster |
| 3:4 | 1080×1440 | Xiaohongshu, Moments |
| 4:3 | 1440×1080 | Traditional monitor |
| 4:5 | 864×1080 | |
| 5:4 | 1080×864 | Horizontal photo |
| 9:16 | 1080×1920 | Douyin, Video Account, Vertical screen |
| 16:9 | 1920×1080 | Bilibili, YouTube, Official Account Videos, Landscape |
| 21:9 | 1536×672 | Ultra-wide screen movie |
片尾规范
Outro Specification
铁律:所有视频必须自动拼接对应尺寸的片尾!
片尾匹配顺序:
- 精确匹配:
outro_{ratio}.mp4 - 方向匹配:竖版→,横版→
outro_9x16.mp4outro_16x9.mp4 - 兜底:
outro.mp4
Rule: All videos must auto-append the corresponding outro based on aspect ratio!
Outro Matching Order:
- Exact match:
outro_{ratio}.mp4 - Orientation match: Portrait → , Landscape →
outro_9x16.mp4outro_16x9.mp4 - Fallback:
outro.mp4
BGM 资源
BGM Resources
按风格分类
Classification by Style
| 风格 | 文件 | 快捷参数 | 适用场景 |
|---|---|---|---|
| 古风/中国风 | | | 水浒、三国、历史故事 |
| | 武侠、动作、战斗 | |
| | 禅意、冥想、国学 | |
| | 田园、悠闲、风景 | |
| 治愈/轻松 | | | Vlog、日常、生活 |
| | 梦幻、回忆、温馨 | |
| | 欢快、明亮、阳光 | |
| 热血/史诗 | | | 励志、战斗、高燃 |
| | 英雄、胜利、荣耀 | |
| | 战争、史诗、宏大 | |
| | 电影感、叙事、情感 | |
| 科技/未来 | | | AI、产品、教程 |
| | 数码、网络、互联网 | |
| | 科幻、太空、神秘 | |
| 悬疑/紧张 | | | 推理、紧张、危机 |
| | 恐怖、黑暗、悬疑 | |
| 欢快/活泼 | | | 搞笑、轻松、节奏感 |
| | 可爱、儿童、动画 | |
| | 俏皮、有趣、短视频 | |
| 电子/节奏 | | | 动感、剪辑、卡点 |
| | 电子、现代、潮流 | |
| | 游戏、8-bit、复古电子 | |
| | 说唱、街头、潮酷 |
| Style | File | Shortcut Parameter | Applicable Scenario |
|---|---|---|---|
| Ancient Chinese Style | | | Water Margin, Romance of the Three Kingdoms, Historical Stories |
| | Martial Arts, Action, Battle | |
| | Zen, Meditation, Traditional Chinese Studies | |
| | Pastoral, Leisure, Scenery | |
| Healing/Relaxing | | | Vlog, Daily Life, Lifestyle |
| | Fantasy, Memories, Warmth | |
| | Cheerful, Bright, Sunny | |
| Passionate/Epic | | | Inspirational, Battle, High-energy |
| | Heroes, Victory, Glory | |
| | War, Epic, Grand Scale | |
| | Cinematic, Narrative, Emotional | |
| Tech/Futuristic | | | AI, Products, Tutorials |
| | Digital, Internet, Online Services | |
| | Sci-fi, Space, Mystery | |
| Suspense/Tense | | | Mystery, Tense, Crisis |
| | Horror, Dark, Suspense | |
| Cheerful/Lively | | | Comedy, Relaxing, Rhythmic |
| | Cute, Children, Animation | |
| | Playful, Funny, Short Videos | |
| Electronic/Rhythmic | | | Dynamic, Editing, Beat-matching |
| | Electronic, Modern, Trendy | |
| | Games, 8-bit, Retro Electronic | |
| | Rap, Street, Trendy |
使用方式
Usage
bash
undefinedbash
undefined方式1:快捷参数(推荐)
Method 1: Shortcut parameter (Recommended)
python video_maker.py config.yaml --bgm epic
python video_maker.py config.yaml --bgm ancient
python video_maker.py config.yaml --bgm edm
python video_maker.py config.yaml --bgm epic
python video_maker.py config.yaml --bgm ancient
python video_maker.py config.yaml --bgm edm
方式2:完整文件名
Method 2: Full filename
python video_maker.py config.yaml --bgm bgm_ancient_tale.mp3
python video_maker.py config.yaml --bgm bgm_ancient_tale.mp3
方式3:自定义路径
Method 3: Custom path
python video_maker.py config.yaml --bgm /path/to/custom.mp3
---python video_maker.py config.yaml --bgm /path/to/custom.mp3
---常用音色
Common Voice IDs
| 音色 ID | 风格 |
|---|---|
| zh-CN-YunyangNeural | 男声,新闻播报 |
| zh-CN-YunxiNeural | 男声,阳光活泼 |
| zh-CN-XiaoxiaoNeural | 女声,温暖自然 |
| zh-CN-XiaoyiNeural | 女声,活泼可爱 |
| Voice ID | Style |
|---|---|
| zh-CN-YunyangNeural | Male, News Broadcast |
| zh-CN-YunxiNeural | Male, Sunny and Lively |
| zh-CN-XiaoxiaoNeural | Female, Warm and Natural |
| zh-CN-XiaoyiNeural | Female, Lively and Cute |
目录结构
Directory Structure
video-creator/
├── SKILL.md
├── scripts/
│ ├── video_maker.py # 主脚本:图片+音频→视频(内置duration硬卡)
│ ├── verify_alignment.py # 合成前强制校验(时长+语义交叉比对)
│ ├── tts_generator.py # TTS 语音生成
│ └── scene_splitter.py # 场景拆分器(可选)
├── assets/
│ ├── outro.mp4 # 通用片尾(16:9)
│ ├── outro_9x16.mp4 # 竖版片尾
│ ├── outro_3x4.mp4 # 3:4片尾
│ └── bgm_*.mp3 # 22首BGM(详见BGM资源表)
└── references/
└── edge_tts_voices.mdvideo-creator/
├── SKILL.md
├── scripts/
│ ├── video_maker.py # Main script: Images + Audio → Video (built-in duration hard check)
│ ├── verify_alignment.py # Mandatory pre-synthesis validation (Duration + Semantic cross-check)
│ ├── tts_generator.py # TTS Voice Generation
│ └── scene_splitter.py # Scene Splitter (Optional)
├── assets/
│ ├── outro.mp4 # General outro (16:9)
│ ├── outro_9x16.mp4 # Portrait outro
│ ├── outro_3x4.mp4 # 3:4 outro
│ └── bgm_*.mp3 # 22 BGM tracks (see BGM Resource Table)
└── references/
└── edge_tts_voices.md依赖
Dependencies
bash
undefinedbash
undefined系统依赖
System Dependencies
brew install ffmpeg # Mac
brew install ffmpeg # Mac
Python 依赖
Python Dependencies
pip install edge-tts pyyaml
undefinedpip install edge-tts pyyaml
undefined