cinematic-cutscene — Locked-Look Story Beat in 5-10 Seconds
A cutscene is a non-interactive video clip the game plays back at a fixed moment — opening, story beat, character intro, ending. Quality is dominated by two things: continuity (the character on screen looks like the character in the game) and shot length (every clip past 10s drifts in identity, hands, and physics). This skill enforces both — generate a reference image first to lock the look, then image-to-video each shot, then chain shots in sequence rather than asking the model for one long take.
If the user wants 5 seconds of marketing footage, that's
. If they want a 3-second seamless background loop, that's
. This skill is for narrative beats with a defined start, middle, and end — usually with dialogue or VO.
When to use
- "Generate the opening cinematic for the game."
- "I need a cutscene where the witch turns to the player and warns them about the curse."
- "Make a 10-second character intro for the boss."
- "Add an ending sequence after the player defeats the dragon."
- "Cinematic where the village burns and the protagonist runs."
- The user has a defined moment with story content — not a vibe loop, not a marketing splash.
When NOT to use
- "5-second slow-mo combat shot for the trailer" — .
- "Looping splash screen background" — .
- "Animated logo backdrop on the title screen" — .
- Real-time in-engine cinematic with the actual gameplay characters and camera — that's a Godot AnimationPlayer / Timeline job, not generated video. Route to .
- Voice-only narration over a static image — generate the dialogue with and use a static , no video needed.
Steps
1. Read the soul file and any prior cutscenes
Read .summer/GameSoul.md
summer_search_assets(query="cutscene", filter={ kind: "video" })
summer_search_assets(query="<character name> reference", filter={ kind: "image" })
If a reference image of the character already exists from a prior
or
run,
reuse it as the
— that single decision is the difference between the character looking like themselves and looking like a stranger.
2. Plan the shot list
Cutscenes longer than 10 seconds are multiple shots, not one take. Ask the user to break it down:
Want this as one 10s shot, or three shots (e.g. wide → close on face → reaction)? Each shot is 5-10s. I'll generate a reference image once, then image-to-video each shot off of it.
Default if the user is vague: 3 shots, 5s each, ~$1.50 total on kling. Confirm before spending.
3. Lock the look — generate a reference image first
For any character or hero scene, generate a reference still with
before generating video. This still drives
on every subsequent shot, so the character is consistent across cuts.
summer_generate_image(
prompt="<subject>, <setting>, cinematic lighting, film still, <art style from GameSoul.md>",
model="nano-banana-2",
options={ image_size: "landscape_16_9" }
)
Show the user the still and confirm it's the right look before video-ing it. Regenerating a $0.05 still beats regenerating a $0.50 video.
4. Pick the model
| Model | Cost | Speed | When |
|---|
| ~$0.10 | ~30s | Iteration, blocking shots, B-roll, throwaway tests |
| ~$0.50 | 2-4 min | Hero shots, character cutscenes, anything the player will sit and watch |
| ~$0.30 | 1-2 min | Same as kling when iteration speed matters more than the last 10% of quality |
| ~$1.00 | 3-5 min | Pitch decks, premium dialogue scenes with synced lip motion, short-form ad |
| ~$0.40 | 2-3 min | Stylized / anime-leaning content; better at non-photoreal looks than kling |
Default policy:
first to validate the prompt and shot framing. If it lands the composition but quality is rough, escalate to
. Only reach for
if dialogue lip-sync matters and the user has approved the cost.
5. Generate each shot
summer_generate_video(
prompt="<subject does <action>, <camera move>, <lighting>, cinematic, 16mm film grain>",
model="kling",
imageUrl="<reference image fileUrl from step 3>",
duration=5,
aspectRatio="16:9"
)
Returns
. Show the user the URL and ask:
Shot 1 of 3 done. Land or regenerate? If land, I'll move to shot 2.
6. Generate dialogue audio (if the scene has dialogue)
Cutscene dialogue is TTS, not in the video model. The video model can render mouth motion that
looks like talking, but the audio comes from
. Generate it separately and the editor (Godot's AnimationPlayer or your
controller scene) syncs them.
summer_generate_audio(
capability="text_to_speech",
text="They'll come for you at dawn. Run while you can.",
voiceId="<from audio bible — see audio/voice-line>"
)
Lip-sync caveat: if the video shows the character clearly mouthing words and the audio is a different cadence, viewers notice. Either (a) keep the camera off the character's face during dialogue, (b) accept the asynchrony for a stylized look, or (c) use
and prompt the dialogue text directly into the video prompt for synced motion.
7. Import and wire as a VideoStreamPlayer
summer_import_from_url(url="<fileUrl>", path="assets/video/cinematics/intro_shot_01.mp4")
Build a controller scene at
:
summer_add_node(parent=".", type="Control", name="Intro")
summer_add_node(parent="./Intro", type="ColorRect", name="Fade") # black, alpha 1.0 → 0.0
summer_add_node(parent="./Intro", type="VideoStreamPlayer", name="Video")
summer_set_prop(path="./Intro/Video", property="stream", value="res://assets/video/cinematics/intro_shot_01.mp4")
summer_set_prop(path="./Intro/Video", property="autoplay", value=true)
summer_set_prop(path="./Intro/Video", property="expand", value=true)
summer_set_prop(path="./Intro/Fade", property="anchors_preset", value=15)
Attach a
script that fades the
from black to clear over 0.5s on
, then back to black when
fires, then
s the scene. For multi-shot cutscenes, queue the next
in the
signal handler.
For dialogue, add an
sibling with the TTS clip and call
in
after a small delay matching where the line lands in the video.
Reference card — prompts that work
Pattern:
<subject> + <action> + <camera move> + <lighting> + <stylistic anchor>
. Keep prompts under 50 words; over-prompting confuses the model. Always pair with
to lock the character.
| Goal | Model | Prompt | Cost | Duration |
|---|
| Opening establishing shot | | wide shot of a fog-shrouded medieval village at dawn, slow dolly-in toward the church spire, warm low sun, cinematic, anamorphic lens flare
| $0.50 | 5s |
| Character intro (hero turns to camera) | | young witch with raven on shoulder turns slowly toward camera, candlelit interior, shallow depth of field, cinematic film still
| $0.50 | 5s |
| Dialogue close-up (no synced lips) | | close-up of a grizzled knight, eyes downcast then looking up, firelight on his face, dust motes, 16mm film grain
| $0.50 | 5s |
| Dialogue close-up (synced lips) | | close-up of the witch saying "they will come for you at dawn", candlelight, cinematic, shallow DOF
| $1.00 | 5s |
| Action beat (village burns) | | medium wide of a thatched village engulfed in flames at night, embers rising, silhouettes running through smoke, hand-held camera, cinematic
| $0.50 | 5s |
| Ending shot (hero walks away) | | low wide shot of a cloaked figure walking away across a blasted plain at sunset, slow truck-back, golden hour, dust on the wind
| $0.50 | 5s |
| Throwaway iteration / blocking | | same prompt as above | $0.10 | 5s |
| Anime / stylized cutscene | | anime-style young swordsman draws blade and steps forward, sakura petals, dramatic wind, Studio Ghibli soft painterly
| $0.40 | 5s |
Bad prompts and why
| Bad | Why it fails |
|---|
epic cutscene of the hero defeating the boss
| No subject, no shot, no camera. Returns a generic action montage. |
the hero walks into the throne room, kneels before the king, draws his sword and says I refuse to serve, then walks out
| Five events in one prompt. Model picks one (badly) or tries all and renders mush. Split into three shots. |
cinematic 4k masterpiece trending on artstation
| Adjective slop. The model already knows "cinematic"; the rest is dead weight. |
make the character look exactly like in the game
| Words can't anchor identity. Use . |
the camera does a complex handheld weaving move through the crowd
| Video models render simple camera moves (pan, tilt, dolly, truck) reliably and complex moves badly. Pick one verb. |
Anti-patterns
- Generating a 10s shot when you should chain two 5s shots. Identity, hand consistency, and physics drift every additional second past ~6s. Two 5s shots cut together look better and cost the same.
- Skipping the reference image. Generating four character cutscenes from text prompts alone produces four different-looking people. Always lock the look with first, then every subsequent video call.
- Asking the video model to render dialogue without . and will animate mouths but the motion does not match any audio. Use if dialogue is on-camera, or keep the camera off the speaker.
- Using for blocking iterations. Burn $0.10 on to validate framing and prompt; only spend $0.50 once the composition lands.
- Putting the cutscene on a without a fade. Cuts straight from gameplay to video are jarring. Always wrap in a fade in/out.
- Forgetting to import the file. returns a on Cloudinary; until you call , it isn't in and the scene can't reference it.
Edge cases
- Vertical cutscene for a mobile target. Set . Reference image must also be 9:16 — generate it with
image_size: "portrait_16_9"
or the framing will crop wrong on the video.
- Cutscene must match an in-engine character precisely. No video model will hit pixel-perfect identity. Either (a) accept the artistic license, (b) render the cutscene in-engine with instead of generating it, or (c) generate a reference image from a screenshot of the in-engine character (image-to-image first, then image-to-video).
- Dialogue is too long for a 5s shot. Either trim the line, or split: 5s of speaker, 5s of listener reaction (cheaper because the listener doesn't need synced lips).
- Cutscene needs to play and the player isn't looking. Pause the gameplay timer, push a overlay with the , restore on finish.
- Multi-shot continuity. Use the same reference image as across all shots. If lighting differs by shot, generate one reference per lighting setup, not one per shot.
Fallback (no MCP)
If the Studio MCP server isn't running, the user can do all of this through the Studio web dashboard at the Summer Engine cloud console:
- Generate a reference image in the Image tab.
- Open the Video tab, paste the reference URL into the image-to-video field.
- Set model, duration, aspect ratio, and prompt as listed above.
- Download the and drop it into in the project, then re-import via the Godot editor.
Print the exact prompt + model + duration + aspect ratio so the user can paste it into the dashboard verbatim.
Handoff
Once the cutscene is generated and wired:
Cutscene
wired at
with fade in/out and
chained to scene change. Next:
- Add the dialogue track with if you haven't yet.
- Score the moment with — a cutscene without music feels like a placeholder.
- For the boss reveal cutscene, run this skill again with the same reference image to keep the witch identity stable.
- If you want a 5s "money shot" version for marketing, run against the same reference.
See also
- — marketing footage, 5-10s, maximum visual punch.
- — seamless looping background clips.
- — TTS dialogue used inside cutscenes.
- — score the cutscene.
- — generate the reference image axis if no character reference exists yet.
2d-assets/character-portrait
— produce a high-fidelity locked character portrait for use as .
_shared/mcp-tools-reference.md
— parameter schema and error codes.