video-generation
Original:🇺🇸 English
Translated
Guide to video generation in MassGen. Use when creating videos from text prompts or images across Grok, Google Veo, and OpenAI Sora backends.
2installs
Sourcemassgen/massgen
Added on
NPX Install
npx skill4agent add massgen/massgen video-generationTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Video Generation
Generate videos using with . The system auto-selects the best backend based on available API keys.
generate_mediamode="video"Quick Start
python
# Simple text-to-video (auto-selects backend)
generate_media(prompt="A robot walking through a city", mode="video")
# Specify backend and duration
generate_media(prompt="Ocean waves crashing on rocks", mode="video",
backend_type="google", duration=8)
# With aspect ratio
generate_media(prompt="A timelapse of clouds", mode="video",
backend_type="grok", aspect_ratio="16:9", duration=10)Backend Comparison
| Backend | Default Model | Duration Range | Default Duration | Resolutions | API Key |
|---|---|---|---|---|---|
| Grok (priority 1) | | 1-15s | 5s | 480p, 720p | |
| Google Veo (priority 2) | | 4-8s | 8s | 720p, 1080p, 4K (use | |
| OpenAI Sora (priority 3) | | 4, 8, or 12s (discrete) | 4s | Standard | |
Key Parameters
| Parameter | Description | Example |
|---|---|---|
| Text description of the video | |
| Force a specific backend | |
| Override default model | |
| Video length in seconds | |
| Video aspect ratio | |
| Resolution (Grok: 480p/720p; Veo: 720p/1080p/4k) | |
| Source image for image-to-video | |
| Style/content guide images (Veo, up to 3) | |
| What to exclude (Veo) | |
Duration Handling
Each backend has different duration constraints. automatically clamps the requested duration:
generate_media- Grok: Continuous range 1-15s (clamped to bounds)
- Google Veo: Continuous range 4-8s (clamped to bounds), defaults to 16:9 aspect ratio
- OpenAI Sora: Discrete values only (4, 8, or 12s) - snaps to nearest valid value
A warning is logged if duration is adjusted.
Image-to-Video
All three video backends support starting video from an existing image via :
input_imagespython
generate_media(
prompt="Animate this scene with gentle movement",
mode="video",
input_images=["scene.jpg"],
duration=5
)The first image in is used; additional images are ignored.
input_imagesGeneration Time
Video generation is significantly slower than images. All backends use polling:
- Grok: SDK handles polling internally (up to 10 min timeout)
- Google Veo: Custom polling every 20s (up to 10 min)
- OpenAI Sora: Custom polling every 2s
Veo 3.1: Native Audio
Veo 3.1 generates audio (dialogue, SFX, ambient) automatically from prompt content. No extra parameter needed — just describe the sounds:
- Dialogue: Use quotation marks in prompt ()
"Hello," she said. - Sound effects: Describe sounds ()
tires screeching, engine roaring - Ambient: Describe atmosphere ()
eerie hum resonates through the hallway
Veo 3.1: Extension Constraints
When extending videos via with a ID:
continue_fromveo_vid_*- Resolution is forced to 720p (API requirement for extensions)
- Only 16:9 and 9:16 aspect ratios are supported
- Each extension adds up to 7 seconds (API limit: 20 extensions, ~141s total)
- Generated videos are retained for 2 days before expiry
Producing Longer Videos
Current APIs cap at 15 seconds max per clip (Grok), with most backends at 4-8s. There is no way to generate a continuous 30+ second video in one call. The proven approach:
- Plan a shot list — break your video into 6-8s segments with specific camera language per shot
- Generate clips in parallel — launch all segments concurrently using
background=True - Composite in Remotion (see below) — layer programmatic animation on top of generated footage
- Bridge with audio — a unified narration or music track smooths over visual cuts between clips
For visual continuity, use the same style anchor in every prompt (e.g., "BBC Earth documentary cinematography") and maintain consistent lighting/color descriptions.
Full production guide with examples, transition types, and duration strategy: See references/production.md
Hybrid Workflow: AI Footage + Remotion Animation
The best results come from combining AI-generated footage with Remotion's programmatic animation — not choosing one or the other.
AI video generation produces photorealistic, cinematic footage that pure programmatic rendering cannot match. Remotion produces precise typography, motion graphics, overlays, and transitions that AI generation cannot reliably control. Use both together.
The Rule: Generate First, Composite Second
- Generate AI clips for cinematic/photorealistic shots (environments, product demos, atmospheric footage)
- Use those clips as visual foundations in Remotion — import them as or
<Video>background layers<OffthreadVideo> - Composite programmatic elements on top — typography, motion graphics, logos, data overlays, transitions, captions
- Fill gaps with pure Remotion animation — title cards, intro sequences, motion-graphics-only segments where AI footage isn't needed
Do NOT Discard Generated Clips
Every AI-generated clip costs real money and time. Do not abandon generated footage and fall back to purely programmatic rendering. This is a common failure mode — agents generate clips, notice minor artifacts (e.g., repeated patterns, slight distortion), then pivot entirely to OpenCV/PIL/moviepy rendering, wasting all the generation budget.
Instead:
| Situation | Wrong Approach | Right Approach |
|---|---|---|
| Minor artifacts in generated clip | Discard clip, render from scratch with OpenCV | Use clip as background, mask artifacts with overlays/motion graphics |
| Generated clip doesn't match vision exactly | Regenerate or abandon | Composite typography/effects on top to guide the viewer's attention |
| Need precise text/logo placement | Skip AI generation, use pure programmatic | Generate atmospheric footage, overlay text in Remotion |
| Some shots need AI footage, others don't | Use one approach for everything | Mix: AI-backed shots + pure Remotion animation shots |
Cost Awareness
Each call is expensive. Plan before generating:
generate_media(mode="video")- Decide which shots need AI footage before generating anything — not every shot needs it
- Generate only what you'll use — don't speculatively generate 8 clips hoping some work out
- Review and use what you generate — analyze each clip with , then plan your Remotion composition around actual footage
read_media - One good clip composited well beats five unused clips — invest in composition quality over generation quantity
Post-Production: Always Use Remotion
Remotion is the default post-production tool for any video that needs editing beyond simple concatenation. This includes captions, titles, transitions, overlays, motion graphics — essentially any video intended to look professional. Do not use raw ffmpeg or manual filter chains for these tasks; the results look amateur compared to what Remotion produces.
drawtextWhen you have video clips to assemble, load the Remotion skill and use it. This is not optional for professional output.
Loading the Remotion Skill
Load the skill to get detailed rules and code examples:
- Local path (if installed via quickstart):
.agent/skills/remotion/SKILL.md - Remote repo (if not installed): https://github.com/remotion-dev/skills
What Remotion Gives You
| Capability | Remotion | Raw ffmpeg |
|---|---|---|
| Styled animated captions | CSS-styled, word-level highlighting, animations | |
| Title cards / lower thirds | React components, any font/layout | Manual positioning, limited fonts |
| Scene transitions | Timing curves, spring animations, custom effects | Basic xfade (fade, wipe) |
| Motion graphics | Full React/CSS/Three.js/Lottie ecosystem | Not possible |
| Light leak / overlay effects | Built-in | Complex filter chains |
| Text animations | Typography effects, per-character animation | Not feasible |
| AI footage + overlays | Import clips as | Not feasible at quality |
When ffmpeg Alone Is Sufficient
Only use ffmpeg without Remotion for:
- Concatenating clips with no captions, titles, or transitions (just hard cuts)
- Audio mixing / ducking (ffmpeg or Pydub)
- Color grading via LUT files (filter)
lut3d - Quick format conversion or rescaling
Workflow
- Generate AI clips with (parallel, background mode) — for shots that need cinematic/photorealistic quality
generate_media - Review clips with — assess what you have, plan composition around actual footage
read_media - Generate audio (narration, music) with
generate_media(mode="audio") - Load the Remotion skill and set up a Remotion project
- Composite in Remotion: import AI clips as background layers, overlay typography/motion graphics/captions, add pure-animation segments for title cards and transitions
<Video> - Render via Remotion's headless renderer
Key Remotion Rule Files to Load
When working on a specific task, load the relevant rule files from the Remotion skill:
- Captions/subtitles: ,
rules/subtitles.md,rules/display-captions.mdrules/transcribe-captions.md - Transitions:
rules/transitions.md - Text animations:
rules/text-animations.md - Light leaks:
rules/light-leaks.md - Audio: ,
rules/audio.mdrules/audio-visualization.md - Sequencing/timeline: ,
rules/sequencing.mdrules/trimming.md - 3D motion graphics:
rules/3d.md - Animations/timing: ,
rules/animations.mdrules/timing.md
Need More Control?
- Per-backend resolution, duration details, and quirks: See references/backends.md
- Video continuation, remix, and image-to-video: See references/editing.md
- Multi-shot production, transitions, and cinematic workflow: See references/production.md