Director — Story-to-Video Production Pipeline
Overview
The Director skill orchestrates a complete short film production from a text story. It breaks the story into scenes, generates start/end frames for each, creates video clips from frame pairs, and concatenates everything into a final video.
Pipeline: Story Planning → Z-Image Hero + Character Refs → Qwen Edit Chain (all frames) → WAN 2.2 FLF Video Clips → ffmpeg Concatenation
Key architectural decisions:
- 1 hero frame + edit chain for character consistency (NEVER independent Z-Image per scene)
- Inter-scene frame continuity: Scene N's end frame IS Scene N+1's start frame (same image file, no edit gap)
- Character reference images fed into Qwen Edit's extra image slots
- State file persists to disk for context compaction survival
- Each scene is independently retryable without affecting others
- between every model family switch
CRITICAL: Character Consistency
Independent Z-Image generations per scene produce different-looking characters. This was the #1 problem discovered during testing. The solution:
- Generate ONE hero frame with Z-Image — establishes the main character, setting, and lighting
- Generate character reference images — close-up portraits of each character, key props, and the background
- ALL other scene frames are created via Qwen Edit chain from the hero, with character refs in extra image slots
- This ensures the same face, clothing, and environment across every frame
8-Phase Pipeline
Phase 1: Story Planning → Break story into scenes (Claude reasoning, no ComfyUI)
Phase 2: Hero + Refs → Z-Image: 1 hero frame + character ref portraits + background ref
Phase 3: Hero Review → Visual verify hero and refs, user approves
Phase 4: Edit Chain → Qwen Edit: chain ALL scene frames from hero (with char refs in slots 2-3)
Phase 5: Frame Review → Visual verify all frames, approve/reject/retry
Phase 6: Video Clips → WAN 2.2 FLF dual Hi-Lo (one clip per scene)
Phase 7: Video Review → Preview each clip
Phase 8: Final Assembly → ffmpeg concat all clips into one MP4
State File Format
Saved at
~/code/comfyui-mcp/workflows/director_state_{project_id}.json
. Updated after every edit or phase completion.
json
{
"project_id": "story_20260216_143022",
"created": "2026-02-16T14:30:22Z",
"story": "Original user story text",
"current_phase": 4,
"orientation": "portrait",
"hero_frame": { "file": "director_hero_00001_.png", "seed": 428571, "approved": true },
"character_refs": {
"man": "director_ref_man.png",
"cat": "director_ref_cat.png",
"woman": "director_ref_woman.png",
"background": "director_ref_bedroom.png"
},
"scenes": [
{
"id": 1,
"description": "Brief scene description",
"edit_prompt_start": "Qwen Edit instruction to create start frame from source",
"edit_prompt_end": "Qwen Edit instruction to create end frame from source",
"edit_source_start": "hero",
"edit_source_end": "hero",
"video_prompt": "WAN motion description",
"start_frame": { "file": "director_s1_start_00001_.png", "seed": 12345, "approved": true },
"end_frame": { "file": "director_hero_00001_.png", "seed": null, "approved": true },
"video_clip": { "file": "director_s1_00001.mp4", "seed": 11111, "approved": false },
"status": "video_pending"
}
],
"final_video": null,
"settings": {
"start_frame_resolution": [832, 1472],
"video_resolution": [480, 720],
"video_frames": 81,
"video_fps": 16
}
}
Models Used Per Phase
| Phase | Model Family | Key Models | VRAM |
|---|
| 2: Hero + Refs | Z-Image | redcraftRedzimageUpdatedJAN30_redzibDX1.safetensors
| ~17GB |
| 4: Edit Chain | Qwen Edit | qwen_image_edit_2511_bf16.safetensors
+ Lightning LoRA | ~17-18GB |
| 6: Video Clips | WAN 2.2 I2V | Remix NSFW Hi+Lo (built-in lightning) | ~22-24GB |
CRITICAL:
between every model family switch.
Phase 1: Story Planning
Break the story into 2-6 scenes. For each scene, identify:
- description: What happens (1-2 sentences)
- start frame: What the opening frame looks like
- end frame: What the closing frame looks like
- video_prompt: Motion description for FLF transition
Identify a hero frame — the single most representative scene image that establishes the main character and setting. This hero will anchor all other frames via Qwen Edit.
Also identify which character reference images are needed (portraits of each character, key props, background).
CRITICAL: Inter-Scene Frame Continuity
The end frame of Scene N must be the EXACT same image file as the start frame of Scene N+1. Do NOT create separate Qwen-edited start frames for subsequent scenes — this causes visible jumps at scene boundaries when the videos are concatenated.
The frame chain for video generation:
Scene 1: S1_start (unique) → hero (end)
Scene 2: hero (= S1 end) → S2_end
Scene 3: S2_end (= S2 end) → S3_end
Scene 4: S3_end (= S3 end) → S4_end
Scene 5: S4_end (= S4 end) → S5_end
Only Scene 1 needs a unique start frame. All other scenes inherit their start from the previous scene's end.
Edit Chain Planning
The edit chain produces only end frames (plus Scene 1's unique start frame). Map which end frame derives from which source:
- Some end frames edit directly from the hero
- Later end frames may chain from earlier end frames
- Keep chains shallow (max 4-5 deep) to minimize drift
Example chain:
Hero (man+cat on bed)
├─ S1 Start: edit hero → remove cat, man alone
├─ S2 End: edit hero → replace cat with woman
│ └─ S3 End: edit S2End → both sit up, man startled
│ └─ S4 End: edit S3End → sitting close, warm smiles
│ └─ S5 End: edit S4End → warm embrace
Phase 2: Hero + Character References — Z-Image
Generate with Z-Image RedCraft DX1 (10 steps, CFG 1, euler/simple):
- Hero frame: The establishing shot with main character + key elements
- Character ref portraits: Close-up of each character (man, woman, animal, etc.)
- Background ref: The setting without characters
Add to negative prompts for character refs to exclude wrong subjects (e.g., "woman, female" when generating man portrait).
Hero Frame Workflow Template
json
{
"1": { "class_type": "CheckpointLoaderSimple", "inputs": { "ckpt_name": "redcraftRedzimageUpdatedJAN30_redzibDX1.safetensors" }},
"2": { "class_type": "CLIPTextEncode", "inputs": { "clip": ["1", 1], "text": "<hero_prompt>" }, "_meta": { "title": "Positive" }},
"3": { "class_type": "CLIPTextEncode", "inputs": { "clip": ["1", 1], "text": "3D, ai generated, semi realistic, illustrated, drawing, comic, digital painting, 3D model, blender, video game screenshot, render, smooth textures, CGI, text, writing, subtitle, watermark, logo, blurry, low quality, jpeg artifacts, grainy" }, "_meta": { "title": "Negative" }},
"4": { "class_type": "EmptyLatentImage", "inputs": { "width": 832, "height": 1472, "batch_size": 1 }},
"5": { "class_type": "KSampler", "inputs": {
"model": ["1", 0], "positive": ["2", 0], "negative": ["3", 0], "latent_image": ["4", 0],
"seed": 42, "steps": 10, "cfg": 1, "sampler_name": "euler", "scheduler": "simple", "denoise": 1
}},
"6": { "class_type": "VAEDecode", "inputs": { "samples": ["5", 0], "vae": ["1", 2] }},
"7": { "class_type": "SaveImage", "inputs": { "images": ["6", 0], "filename_prefix": "director_hero" }}
}
Queue hero + all refs while Z-Image checkpoint is loaded (same checkpoint, different prompts).
Phase 3: Hero Review
Show hero frame and all character refs. User approves or requests regeneration with new seed.
Phase 4: Edit Chain — Qwen Image Edit
CRITICAL: Consistency Rules for Edit Prompts
- Always explicitly anchor clothing: "The man wears his grey t-shirt" in EVERY prompt
- Always state what doesn't change: "Same bedroom, same warm lighting, same clothing"
- Use strong emotion words: "extremely shocked and startled" >> "surprised"
- Include proportionality: "Her head and body should be proportional and natural looking"
- Prevent head enlargement: In embrace/close-up poses, Qwen Edit tends to enlarge heads. Add explicit: "do not enlarge her head, keep the same small natural size as in the original image"
- Describe the transformation, not just the end state
Workflow Template (with Character Reference Slots)
json
{
"1": { "class_type": "UNETLoader", "inputs": { "unet_name": "qwen_image_edit_2511_bf16.safetensors", "weight_dtype": "default" }},
"2": { "class_type": "LoraLoaderModelOnly", "inputs": { "model": ["1", 0], "lora_name": "Qwen-Image-Edit-2511-Lightning-4steps-V1.0-bf16.safetensors", "strength_model": 1 }},
"3": { "class_type": "CLIPLoader", "inputs": { "clip_name": "qwen_2.5_vl_7b_fp8_scaled.safetensors", "type": "qwen_image" }},
"4": { "class_type": "VAELoader", "inputs": { "vae_name": "qwen_image_vae.safetensors" }},
"5": { "class_type": "LoadImage", "inputs": { "image": "<source_scene.png>" }, "_meta": { "title": "Source Scene" }},
"5b": { "class_type": "LoadImage", "inputs": { "image": "<character_ref.png>" }, "_meta": { "title": "Character Ref" }},
"5c": { "class_type": "LoadImage", "inputs": { "image": "<background_ref.png>" }, "_meta": { "title": "Background Ref" }},
"6": { "class_type": "TextEncodeQwenImageEditPlusAdvance_lrzjason", "inputs": {
"clip": ["3", 0], "prompt": "<edit_prompt>", "vae": ["4", 0],
"vl_resize_image1": ["5", 0],
"vl_resize_image2": ["5b", 0],
"vl_resize_image3": ["5c", 0],
"target_size": 1024, "target_vl_size": 384,
"upscale_method": "lanczos", "crop_method": "pad"
}},
"7": { "class_type": "ConditioningZeroOut", "inputs": { "conditioning": ["6", 0] }},
"8": { "class_type": "KSampler", "inputs": {
"model": ["2", 0], "positive": ["6", 0], "negative": ["7", 0], "latent_image": ["6", 1],
"seed": 42, "steps": 4, "cfg": 1, "sampler_name": "euler", "scheduler": "simple", "denoise": 1
}},
"9": { "class_type": "VAEDecode", "inputs": { "samples": ["8", 0], "vae": ["4", 0] }},
"10": { "class_type": "SaveImage", "inputs": { "images": ["9", 0], "filename_prefix": "director_s1_start" }}
}
Key: slots 5b and 5c — feed character reference and background reference into
and
. This helps the vision encoder maintain character appearance across edits.
Chain Execution
Edits are sequential — each depends on the previous output:
- Run edit, wait for completion
- the output
- Use uploaded output as source for next edit
- Update state file after each edit
Independent edits (both from hero) can run in parallel.
Timing
- First edit: ~87s (model loading)
- Subsequent edits: ~30-40s each (models cached)
Phase 5: Frame Review
For each frame, show via
for visual inspection. User approves or provides feedback. Re-run individual edits without redoing the whole chain.
Phase 6: Video Clip Generation — WAN 2.2 FLF Dual Hi-Lo
Workflow Template
(Same as wan-flf-video skill — Remix NSFW Hi+Lo, 4-stack LoRA, ImageResizeKJv2, dual KSamplerAdvanced)
Key settings:
- Portrait: width=480, height=720
- 81 frames, 16fps = ~5 seconds per clip
- uni_pc/beta sampler, CFG 1, 4 total steps (Hi: 0→2, Lo: 2→4)
- ModelSamplingSD3 shift=5 on both UNETs
Morph LoRA
For transformation scenes (e.g., cat→woman), add morph LoRA to Hi/Lo Common stacks:
wan2.2_i2v_magical_morph_highnoise.safetensors
→ Hi Common slot 1 (strength 1.0)
wan2.2_i2v_magical_morph_lownoise.safetensors
→ Lo Common slot 1 (strength 1.0)
Use 1.0 strength — tested without sparkle issues. Lower values (0.7-0.85) produce weaker morph effects that may look like a dissolve rather than a true morph.
Per-Scene Changes
Swap per scene: start/end image filenames, positive prompt text, noise_seed, filename_prefix.
All 5 clips can be queued at once — they run sequentially in ComfyUI, sharing loaded models.
Phase 7: Video Review
Report each clip's filename. User previews externally.
Phase 8: Final Assembly — ffmpeg Concat
bash
cd "<ComfyUI_output_dir>"
printf "file 'director_s1_00001.mp4'\nfile 'director_s2_00001.mp4'\n..." > concat_list.txt
ffmpeg -f concat -safe 0 -i concat_list.txt -c copy director_final_{project_id}.mp4
All clips share resolution/codec/framerate — copy-concat works without re-encoding.
Resumption Protocol
After context compaction:
- Read state file AND
director_session_notes.md
if it exists
- Check and per-scene status
- Skip approved assets, continue from incomplete point
- before loading the model family for the current phase
Timing Estimates (RTX 4090)
| Phase | Per Scene | 5 Scenes |
|---|
| Hero + Refs (Z-Image) | ~10s each | ~50s (one-time) |
| Edit Chain (Qwen 4-step) | ~35s each | ~280s (8 edits) |
| Video Clip (WAN FLF 81 frames) | ~140s | ~700s |
| VRAM swaps (3x clear_vram) | ~30s each | ~90s |
| Total generation | | ~19 min |
Storytelling Props for Continuity
Use distinctive visual elements that transfer between characters/forms to create narrative connections:
- A colored collar on an animal → becomes a choker/necklace on the human form
- Eye color matching between animal and human
- Distinctive clothing or accessories that persist across scenes
- These "continuity props" reinforce the story visually
Prompt Engineering for Edit Chains
DO
- "The man wears his grey t-shirt" (anchor clothing every time)
- "Same bedroom, same warm amber lamplight, same white sheets"
- "Extremely shocked, jaw dropped, eyes wide in total disbelief"
- "Her head and body proportional and natural looking"
DON'T
- Assume clothing/setting will be preserved automatically
- Use mild emotion words ("surprised" → use "extremely shocked" instead)
- Chain more than 5-6 edits deep without branching back to hero
- Assume head proportions stay correct in embrace/hug poses — always add explicit size anchoring
WAN Video Prompts
- Use motion verbs: "walks", "turns", "reaches", "sits up", "leans in"
- AVOID: "magical", "enchanted", "mystical" (causes sparkle effects)
- USE: "smoothly transforms", "seamlessly reshapes", "gradually"
- Include scale cues: "grows into", "expands upward"