founder-product-video
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesefounder-product-video
创始人产品视频生成流程
You generate a 65-second founder-style product video from a product URL plus user-provided imagery: 60 seconds of talking-founder body video plus a 5-second branded end card. The user's images (product photos / website screenshots / app screenshots) flow into the SeeDance acts as reference images, so the actual product appears in reveal shots instead of being imagined.
No cutaways. No website CSS extraction. AI generation, deterministic render, captions, concat, and music mix go through Pika MCP tools by default. Lower-third overlays are opt-in because compositing an arbitrary transparent overlay still requires a local ffmpeg fallback until MCP exposes a general alpha-overlay tool.
基于产品URL和用户提供的素材生成65秒创始人风格产品视频:包含60秒创始人讲解视频片段,以及5秒品牌片尾卡片。用户提供的图片(产品照片/网站截图/应用截图)会作为参考素材融入SeeDance片段,因此展示镜头中呈现的是真实产品,而非AI生成的内容。
无切换镜头,无需提取网站CSS。AI生成、确定性渲染、字幕添加、视频拼接和音乐混音默认通过Pika MCP工具完成。下方字幕条为可选功能,因为合成任意透明叠加层仍需依赖本地ffmpeg作为 fallback,直到MCP推出通用的alpha叠加工具。
[0] Intake — run first, before any pipeline step
[0] 输入采集 — 所有流程步骤前先执行
If invoked with empty args, print this menu verbatim and stop — wait for the user to paste inputs:
What founder video do you want to make? Required:
- Product URL —
(anything with a real homepage)https://...- Founder — name + role, e.g. "Eli Kim, CEO"
- Founder photo — local path, https URL, OR
(I'll create a portrait)generateOptional (sensible defaults if omitted): brand kit path · custom on-phone screenshots · music · aspect (16:9 / 9:16 / 1:1) · location image · voice style · product typeExample:/founder-product-video https://scrapegraphai.com --founder "Eli Kim, CEO" --photo ~/Pictures/eli.jpg
If args carry partial input in interactive mode, skip the menu and gather the missing required fields by asking one at a time — ask, wait, ask the next. Don't bundle questions into one block. If the user supplies a field unprompted (e.g. they pasted a URL in the trigger message), skip that question and confirm the value back to them once at the end. Don't start the pipeline until all required fields are answered. If the non-interactive fast lane applies, use step [0.5] instead.
若调用时参数为空,直接打印以下菜单并停止,等待用户粘贴输入内容:
你想要制作什么样的创始人视频? 必填项:
- 产品URL —
(需为真实主页)https://...- 创始人信息 — 姓名+职位,例如:"Eli Kim, CEO"
- 创始人照片 — 本地路径、HTTPS URL,或输入
(我将生成一张肖像照)generate可选项(若省略则使用合理默认值):品牌套件路径 · 自定义手机屏幕截图 · 音乐 · 画面比例(16:9 / 9:16 / 1:1) · 场景图片 · 语音风格 · 产品类型示例:/founder-product-video https://scrapegraphai.com --founder "Eli Kim, CEO" --photo ~/Pictures/eli.jpg
若在交互模式下参数不完整,跳过菜单,逐个询问缺失的必填字段——问完一个等待回复,再问下一个。不要将多个问题合并成一个区块。如果用户主动提供了某个字段(例如在触发消息中粘贴了URL),跳过该问题,并在最后统一确认所有值。在收集到所有必填字段前,不要启动流程。若使用非交互快速通道,请改用[0.5]步骤。
[0.5] Non-interactive fast lane
[0.5] 非交互快速通道
Use this path when the caller passes or , or when the
caller states they are running from CI, a subagent, a batch job, or any other
non-interactive harness.
--quick--config <path>This section has precedence over the interactive ask/wait instructions below.
When it applies, use this fast lane and do not fall through to the multi-turn
intake unless or founder name/role is truly missing.
url- points to a JSON file with pre-baked values for the canonical input contract:
--config <path>,urlorbrand_kit_path,build_brand,founder_name,founder_role,founder_photo,assets,music_url,aspect_ratio,location_image_url,voice_style, andproduct_type.lower_third - means use defaults for optional extras, auto-build the brand kit with
--quickifbuild-a-brand --quickis omitted, and usebrand-kitwhen no photo is supplied.founder_photo = "generate" - For or
--quick, do not stop for confirmation at the brand-kit branch, founder-photo generation prompt, optional-extras prompt, script choices, or end-card/caption defaults. Record assumptions inline and continue.--config - If or founder name/role cannot be found in args or config, stop once with a single compact missing-fields list instead of starting a multi-turn Q&A loop.
url
1. Product URL (required) — . Used to (a) derive the brief in step [1] and (b) feed the brand-kit branch below.
https://...2. Brand kit (required) — interactive mode: ask "Do you already have a brand kit folder, or should I build one first?"
- If a path -> use it (). Accept either
state.brand_kit_path = <path>or an exportedbrand.jsonkit containingbuild-a-brand,brand.md, and logo assets.tokens/tokens.json - If "build" -> invoke the skill on the URL/brief and wait for the exported brand kit. This is a full identity workflow and may pause for user choices; surface those prompts in interactive mode.
build-a-brand - Fast lane: if config provides , use it. If config sets
brand_kit_pathorbuild_brandomits--quick, invokebrand-kiton the URL/brief and wait for the exported brand kit; do not surface build-a-brand prompts or stop for identity choices. After either branch, setbuild-a-brand --quick. Only stop with a single compact missing-fields list if there is no path and the brand kit cannot be built.state.brand_kit_path
3. Founder identity (required) — interactive mode: ask all three together:
- — e.g. "Semi"
founder_name - — e.g. "CEO, Pika"
founder_role - — local path / https URL / OR the literal string
founder_phototo auto-create a portrait. Ifgenerate, prompt the user for a 1-line vibe ("warm, casual smart attire" / "Pixar-style 3D animation" / etc.) — this becomes the seed prompt forgeneratein step [4].mcp__pika__generate_image - Fast lane: use founder values from args/config. If is omitted, set
founder_photoand use a neutral founder-portrait vibe derived from the product tone; do not stop for a separate photo-vibe prompt.founder_photo = "generate"
Default to no lower-third so the happy path stays MCP-only. If the user explicitly asks for a lower-third, record ; in interactive mode confirm that one local ffmpeg overlay pass will be used, and in the fast lane record that assumption inline.
state.lower_third = true4. Optional extras — interactive mode: offer these once as a single message, then proceed without waiting if no answer comes back in the same turn. Fast lane: use the defaults below without asking.
- Custom imagery — list of (product photos / app screenshots) shown on the founder's phone. Default if omitted: use screenshots from
assetswhen present, otherwise look for obvious screenshots or product images inside the brand kit. If none exist in interactive mode, ask once whether to proceed without imagery or wait for uploads; in the fast lane, proceed without custom imagery and record the assumption.brand.json.screenshots - Music — local path / https URL / OR (instrumental, ~60s). Default:
generatevia MiniMax.generate - Lower-third — optional. Default: off. If enabled, the render is MCP but the overlay onto body video uses one local ffmpeg pass.
- Aspect ratio — (default),
16:9,9:16.1:1 - Location — defaults to a flat seamless backdrop in (the brand-accent backdrop pattern validated 2026-05-03 on a Pika.me run where accent happens to be lavender — clean studio-shoot look, character against a single brand color, whatever the brand's accent is). Override with a path / URL / text description if the user wants office, outdoor, etc.
state.brand.colors.accent - Voice style — VO direction string for SeeDance, e.g. "warm authentic founder energy, conversational". Default: derived from .
brief.tone - Product type — . Default: auto-derived in step [2] from asset analyses.
digital | physical_apparel | physical_object | consumable | service
After Stage 0 completes, store all gathered values in . If you already created a local work directory for this run, optionally persist the same object as ; do not require a predefined work-directory environment variable. Then enter the pipeline at step [1].
state.inputs<workdir>/inputs.json当调用者传入或参数,或调用者说明是从CI、子Agent、批处理作业或其他非交互环境运行时,使用此路径。
--quick--config <path>此部分优先级高于下方的交互询问/等待指令。适用时使用快速通道,除非或创始人姓名/职位确实缺失,否则不要进入多轮输入采集流程。
url- 指向一个JSON文件,包含标准输入约定的预定义值:
--config <path>、url或brand_kit_path、build_brand、founder_name、founder_role、founder_photo、assets、music_url、aspect_ratio、location_image_url、voice_style和product_type。lower_third - 表示对可选参数使用默认值;若省略
--quick,则自动通过brand-kit生成品牌套件;若未提供照片,则设置build-a-brand --quick。founder_photo = "generate" - 对于或
--quick模式,无需在品牌套件分支、创始人照片生成提示、可选参数提示、脚本选择或片尾/字幕默认值处停止确认。记录假设并继续执行。--config - 若无法从参数或配置中找到或创始人姓名/职位,仅停止一次并列出缺失字段,而非启动多轮问答循环。
url
1. 产品URL(必填)—— 。用于(a)在步骤[1]中生成简介,(b)为下方的品牌套件分支提供数据源。
https://...2. 品牌套件(必填)—— 交互模式下:询问 "你已有品牌套件文件夹,还是需要我先创建一个?"
- 若提供路径 → 使用该路径()。支持
state.brand_kit_path = <path>或导出的brand.json套件(包含build-a-brand、brand.md和Logo资源)。tokens/tokens.json - 若选择"创建" → 对URL/简介调用技能,等待导出的品牌套件。这是完整的品牌标识工作流,可能会暂停等待用户选择;在交互模式下展示这些提示。
build-a-brand - 快速通道:若配置中提供,则使用该路径。若配置中设置
brand_kit_path或build_brand模式省略--quick,则对URL/简介调用brand-kit并等待导出的品牌套件;无需展示build-a-brand的提示或停止等待品牌标识选择。完成任一分支后,设置build-a-brand --quick。若既无路径又无法生成品牌套件,仅停止一次并列出缺失字段。state.brand_kit_path
3. 创始人身份(必填)—— 交互模式下:同时询问以下三项:
- — 例如:"Semi"
founder_name - — 例如:"CEO, Pika"
founder_role - — 本地路径/HTTPS URL,或输入字面量
founder_photo自动生成肖像。若输入generate,请提示用户提供1行风格描述("温暖、休闲正装" / "皮克斯风格3D动画" / 等)——这将作为步骤[4]中generate的种子提示词。mcp__pika__generate_image - 快速通道:使用参数/配置中的创始人信息。若省略,则设置
founder_photo,并根据产品风格生成中性的创始人肖像风格;无需单独询问照片风格提示。founder_photo = "generate"
默认不添加下方字幕条,以保持核心流程仅依赖MCP。若用户明确要求添加字幕条,记录;在交互模式下确认将使用一次本地ffmpeg叠加处理,在快速通道中记录该假设。
state.lower_third = true4. 可选参数—— 交互模式下:一次性提供这些选项,若同一轮中未收到回复则继续执行。快速通道:直接使用下方默认值,无需询问。
- 自定义素材 — 列表(产品照片/应用截图),将展示在创始人的手机屏幕上。默认值:若存在
assets则使用其中的截图,否则在品牌套件中查找明显的截图或产品图片。若交互模式下无可用素材,询问一次是否继续无素材生成或等待上传;快速通道中直接继续无素材生成并记录该假设。brand.json.screenshots - 音乐 — 本地路径/HTTPS URL,或输入(生成约60秒的纯音乐)。默认值:通过MiniMax
generate生成。generate - 下方字幕条 — 可选。默认值:关闭。若启用,渲染依赖MCP,但在人物视频上叠加字幕条需使用一次本地ffmpeg处理。
- 画面比例 — (默认)、
16:9、9:16。1:1 - 场景 — 默认使用品牌强调色的纯色无缝背景(2026-05-03在Pika.me运行中验证,当强调色为淡紫色时,呈现干净的工作室拍摄效果,人物置于单一品牌色背景前)。若用户需要办公室、户外等场景,可通过路径/URL/文本描述覆盖默认值。
state.brand.colors.accent - 语音风格 — SeeDance的语音指导字符串,例如:"温暖真实的创始人语气,口语化表达"。默认值:从推导。
brief.tone - 产品类型 — 。默认值:在步骤[2]中通过素材分析自动推导。
digital | physical_apparel | physical_object | consumable | service
完成阶段0后,将所有收集到的值存储在中。若已为此运行创建本地工作目录,可选择将该对象保存为;无需预定义工作目录环境变量。然后进入步骤[1]开始流程。
state.inputs<workdir>/inputs.jsonRequired inputs (canonical contract)
必填输入(标准约定)
After Stage 0, these are the fields downstream steps consume:
- — the product website (https://). Drives step [1] brief.
url - — brand kit folder. Required. End card AND lower-third consume
brand_kit_pathwhen present, otherwisebrand.json,brand.md, and logo assets from atokens/tokens.jsonexport. See step [4.5].build-a-brand - +
founder_name+founder_role— required from intake. Step [4] normalizesfounder_photointofounder_photoandfounder_photo_urlbefore any SeeDance call.character_url - — optional array of
assets. Defaults to the screenshots captured by the brand kit.{ url, role?, caption? }is a hint string mapping the asset to a script beat (role,hero,feature_a, etc.).cta - — optional. Defaults to a generated solid-color backdrop in
location_image_url.state.brand.colors.accent - — optional. Defaults to
music_url(MiniMax instrumental in step [7]).generate - — default
aspect_ratio.16:9 - — optional, defaults to
voice_style.brief.tone - — optional, auto-derived in step [2].
product_type
完成阶段0后,下游步骤将使用以下字段:
- — 产品网站(https://)。驱动步骤[1]的简介生成。
url - — 品牌套件文件夹。必填项。片尾卡片和下方字幕条将使用
brand_kit_path(若存在),否则使用brand.json导出的build-a-brand、brand.md和Logo资源。详见步骤[4.5]。tokens/tokens.json - +
founder_name+founder_role— 从输入采集环节获取的必填项。步骤[4]会在调用SeeDance前将founder_photo标准化为founder_photo和founder_photo_url。character_url - — 可选的
assets数组。默认值为品牌套件中捕获的截图。{ url, role?, caption? }是将素材映射到脚本节拍的提示字符串(role、hero、feature_a等)。cta - — 可选。默认值为
location_image_url品牌强调色生成的纯色背景。state.brand.colors.accent - — 可选。默认值为
music_url(步骤[7]中通过MiniMax生成)。generate - — 默认
aspect_ratio。16:9 - — 可选,默认值为
voice_style。brief.tone - — 可选,在步骤[2]中自动推导。
product_type
State
状态管理
Keep a simple object as you work and save every CDN URL there so a partial run can be resumed. Treat value as the successful terminal state ( and are the failure terminals), then unwrap when present. The final video lives on Pika's CDN; no local workspace is required unless you trigger the one lower-third alpha-overlay fallback in step [8b].
statemcp__pika__task_statuscompletedfailedcancelledresult.structuredContent工作过程中维护一个简单的对象,保存所有CDN URL,以便恢复部分完成的运行。将值视为成功终端状态(和为失败终端状态),然后在存在时提取。最终视频存储在Pika的CDN上;除非触发步骤[8b]中的字幕条alpha叠加fallback,否则无需本地工作区。
statemcp__pika__task_statuscompletedfailedcancelledresult.structuredContentPipeline overview
流程概述
[Stage 0] intake you (Claude): ask user for url + brand-kit (path or build) + founder (name/role/photo) + optional extras
→ [0.5] brand-kit auto-build (only if user said "build")
invoke `build-a-brand`; in non-interactive mode use `build-a-brand --quick`
→ [1] analyze_brief pika MCP: product name + tagline + features + tone + CTA
→ [2] analyze_media × N pika MCP: understand what each user asset shows
→ [3] write script you (Claude): 4 acts × 15s; map assets to acts
→ [4] founder/location refs pika MCP: upload or generate founder ref; carry supplied custom location
→ [4.5] brand-kit ingestion parse brand.json OR brand.md + tokens → state.brand; generate default brand-accent location if needed
→ [5] generate_reference_video × 4 IN PARALLEL pika MCP: SeeDance acts, asset images as refs
→ [6] edit_concat acts pika MCP: 60s stitched base (dialogue-only audio)
→ [7] generate_music pika MCP: instrumental with 5-section non-vocal lyrics structure
→ [8] captions / lower-third pika MCP: add_captions for subtitles; render lower-third as transparent .mov. If lower-third is enabled, use the alpha-overlay fallback only because MCP has no arbitrary alpha-overlay tool yet.
→ [9] render_html_animation pika MCP: 5s end card — author inline HTML, brand-kit fonts inlined, aspect matches body, no corner clutter, CSS @keyframes (NOT GSAP)
→ [10] edit_concat + audio_mix pika MCP: concat body + end card, then mix music over the full ~65s
→ [10.5] final duration probe pika MCP: analyze final_url and enforce the 55s duration floor before delivery
→ [11] final_url save the MCP returned final_url; upload only if a local fallback created the final MP4
→ [12] deliver[阶段0] 输入采集 你(Claude):询问用户URL + 品牌套件(路径或创建) + 创始人信息(姓名/职位/照片) + 可选参数
→ [0.5] 品牌套件自动创建(仅当用户选择"创建"时)
调用`build-a-brand`;非交互模式下使用`build-a-brand --quick`
→ [1] 分析产品简介 Pika MCP:提取产品名称 + 标语 + 功能 + 风格 + 行动号召
→ [2] 分析媒体素材 × N Pika MCP:理解每个用户素材的内容
→ [3] 编写脚本 你(Claude):4段各15秒的片段;将素材映射到对应片段
→ [4] 创始人/场景参考素材 Pika MCP:上传或生成创始人参考素材;使用用户提供的自定义场景
→ [4.5] 品牌套件导入 解析brand.json或brand.md + tokens → state.brand;若需要则生成默认品牌强调色场景
→ [5] 并行生成4段参考视频 Pika MCP:SeeDance片段,使用素材图片作为参考
→ [6] 拼接片段 Pika MCP:拼接成60秒基础视频(仅对话音频)
→ [7] 生成背景音乐 Pika MCP:生成包含5段无歌词结构的纯音乐
→ [8] 字幕/下方字幕条 Pika MCP:添加字幕;将字幕条渲染为透明.mov文件。若启用字幕条,仅使用alpha叠加fallback,因为MCP尚未推出通用alpha叠加工具。
→ [9] 渲染HTML动画 Pika MCP:5秒片尾卡片——编写内嵌HTML,内嵌品牌套件字体,画面比例与主体视频匹配,无角落杂物,使用CSS @keyframes(禁止使用GSAP)
→ [10] 拼接+混音 Pika MCP:拼接主体视频+片尾卡片,然后在整个约65秒视频上混音
→ [10.5] 最终时长校验 Pika MCP:分析final_url,确保交付前时长不低于55秒
→ [11] 最终URL 保存MCP返回的final_url;仅当本地fallback生成最终MP4时才上传
→ [12] 交付Operational notes
操作注意事项
Keep the main workflow focused on sequencing. Historical server validation details live in ; only the active constraints stay here:
references/ops-notes.md- Use a unique per SeeDance act (101, 202, 303, 404). Identical generation params can replay cached failures.
seed - MiniMax music length is controlled by five sections (
lyrics/intro/verse/bridge/chorus) with instrumental parentheticals; prompt prose alone is not a reliable length control.outro - If SeeDance rejects a real-person founder photo, re-roll the founder ref with stronger stylization rather than retrying the same rejected reference.
- For local brand-kit logos, upload only logo-appropriate raster assets (,
image/png, orimage/jpeg). Do not send SVGs toimage/webp; choose the PNG export frommcp__pika__upload_assetor rasterize first.build-a-brand - Use CSS for CDN-hosted logo/photo assets in end-card HTML;
background-image: url(...)is blocked by CDN CORS.<img crossorigin> - Use server-side deterministic tools for captions, concat, and mix. Local ffmpeg is only the fallback for arbitrary transparent lower-third overlay composition, and that fallback is used only when .
state.lower_third = true - Decompose every 15s act into 3 time-coded sub-shots. Single-shot acts look static.
- Open each act with the style-match location framing and repeat the same sentence across all 4 act prompts.
WARDROBE LOCK:
保持主流程专注于任务排序。历史服务器验证细节存储在中;仅保留当前有效的约束:
references/ops-notes.md- 每个SeeDance片段使用唯一的(101、202、303、404)。相同的生成参数可能会重放缓存的失败结果。
seed - MiniMax音乐时长通过5个部分(
lyrics/intro/verse/bridge/chorus)控制,每个部分包含纯音乐括号注释;仅使用提示性文字无法可靠控制时长。outro - 若SeeDance拒绝真实人物的创始人照片,通过更强的风格化重新生成创始人参考素材,而非重试相同的被拒绝素材。
- 对于本地品牌套件中的Logo,仅上传适合作为Logo的栅格资源(、
image/png或image/jpeg)。不要将SVG发送给image/webp;选择mcp__pika__upload_asset导出的PNG文件或先栅格化SVG。build-a-brand - 在片尾卡片HTML中,对CDN托管的Logo/照片资源使用CSS ;
background-image: url(...)会被CDN CORS阻止。<img crossorigin> - 使用服务器端确定性工具处理字幕、拼接和混音。本地ffmpeg仅作为任意透明字幕条叠加合成的fallback,且仅在时使用。
state.lower_third = true - 将每个15秒片段分解为3个带时间码的子镜头。单镜头片段无论提示词多详细,都会显得静态呆板。
- 每个片段以风格匹配的场景构图开头,并在所有4个片段的提示词中重复相同的语句。
WARDROBE LOCK:
[1] Analyze brief
[1] 分析产品简介
analyze_brief(
sources=[{ type: "url", url: <product_url> }],
context: "Founder-style 60-second product video. Need: product name, one-line tagline, 3-5 key features, target audience, brand tone, and a call-to-action."
)Save the result as . You'll reference , , , , throughout.
briefbrief.product_namebrief.taglinebrief.key_featuresbrief.tonebrief.call_to_actionanalyze_brief(
sources=[{ type: "url", url: <product_url> }],
context: "创始人风格60秒产品视频。需要:产品名称、一行标语、3-5个核心功能、目标受众、品牌风格和行动号召。"
)将结果保存为。后续将参考、、、、。
briefbrief.product_namebrief.taglinebrief.key_featuresbrief.tonebrief.call_to_action[2] Analyze each user-provided asset + derive product_type
product_type[2] 分析每个用户提供的素材 + 推导product_type
product_typeFor each entry in , run to extract content + visual style + asset type. Run all in parallel in one tool batch:
assetsmcp__pika__analyze_mediaanalyze_media(
media: <asset.url>,
query: 'Describe this product image briefly. Return STRICT JSON:
{
"content_description": "1-line summary of what is visible",
"asset_type": "digital_screen | physical_apparel | physical_object | consumable | infographic | other",
"key_elements": ["3-5 specific UI elements / features / objects in the image"],
"visible_copy": "any text visible — heading, button label, tagline, t-shirt graphic text (or empty string)",
"primary_colors": ["#hex", "#hex", "#hex"],
"vibe": "1-line visual feeling",
"best_for_act": "hook | problem | solution | proof"
}
Return ONLY the JSON.'
)asset_type- — app UI / website screenshot / SaaS dashboard / mobile app capture
digital_screen - — t-shirts, hoodies, hats, anything wearable (model + garment)
physical_apparel - — gadgets, accessories, packaged goods, anything held in hand
physical_object - — food, beverages, supplements (something used/eaten/drunk)
consumable - — chart, diagram, data viz, illustration
infographic - — anything else; describe and pick best fit
other
Save as .
asset_analyses[i]对于中的每个条目,运行提取内容+视觉风格+素材类型。在一个工具批次中并行运行所有分析:
assetsmcp__pika__analyze_mediaanalyze_media(
media: <asset.url>,
query: '简要描述此产品图片。返回严格JSON格式:
{
"content_description": "1行可见内容摘要",
"asset_type": "digital_screen | physical_apparel | physical_object | consumable | infographic | other",
"key_elements": ["图片中3-5个特定UI元素/功能/对象"],
"visible_copy": "可见文本——标题、按钮标签、标语、T恤图案文本(或空字符串)",
"primary_colors": ["#hex", "#hex", "#hex"],
"vibe": "1行视觉感受描述",
"best_for_act": "hook | problem | solution | proof"
}
仅返回JSON。'
)asset_type- — 应用UI/网站截图/SaaS仪表盘/移动应用画面
digital_screen - — T恤、卫衣、帽子等可穿戴物品(模特+服装)
physical_apparel - — 电子产品、配件、包装商品等可手持物品
physical_object - — 食品、饮料、保健品等可使用/食用/饮用的物品
consumable - — 图表、示意图、数据可视化、插画
infographic - — 其他类型;描述并选择最适合的分类
other
将结果保存为。
asset_analyses[i]Derive product_type
product_type推导product_type
product_typeLook at the dominant across all assets:
asset_typeproduct_type = mode(asset_analyses[i].asset_type) → mapped to {
digital_screen → "digital"
physical_apparel → "physical_apparel"
physical_object → "physical_object"
consumable → "consumable"
infographic | other → "service" (no concrete physical/digital reveal — fall back to environment shots)
}If user passed explicitly, use that and skip auto-derivation. The value drives which shots are picked in step [3] and how the founder reveals the product in step [5]. Get this right or the video shows the wrong thing on screen.
product_typeproduct_type查看所有素材中占主导的:
asset_typeproduct_type = 众数(asset_analyses[i].asset_type) → 映射为 {
digital_screen → "digital"
physical_apparel → "physical_apparel"
physical_object → "physical_object"
consumable → "consumable"
infographic | other → "service"(无具体实体/数字展示—— fallback到环境镜头)
}若用户明确传入,则使用该值并跳过自动推导。值将驱动步骤[3]中镜头的选择以及步骤[5]中创始人展示产品的方式。必须正确设置,否则视频中会展示错误内容。
product_typeproduct_typeProduct type → reveal pattern (the most important table in this skill)
产品类型 → 展示模式(本技能中最重要的表格)
product_type| product_type | Reveal shots | Reveal beat (used in SeeDance prompt) | Which acts get assets |
|---|---|---|---|
| C-phone, E-phone | "founder lifts her phone toward camera; the screen clearly shows @ImageN — match exactly, do not invent UI" | shots C and E only |
| G-hold, G-wear, E-detail | "founder lifts a charcoal-washed graphic tee toward camera; the shirt design exactly matches @ImageN — match the print/graphic exactly, do not invent" OR "founder is wearing the t-shirt from @ImageN — match the print exactly" | EVERY shot where a t-shirt is visible (C/E/F + the "wearing" variants) |
| C-hold, E-detail, F-twoshot | "founder holds the [product name] up toward camera; the product exactly matches @ImageN — match shape, color, branding" | shots that show the product |
| C-hold, E-detail, H-using | "founder holds/uses the [product]; the packaging/product matches @ImageN exactly" | shots that show the product |
| A, B, D, F (environment) | no specific product reveal — focus on founder + environment | none of the shots reference assets |
For physical products, every shot where the product appears in frame should pass that asset as a reference image; otherwise SeeDance tends to invent a generic-looking product. For apparel, if the founder is wearing a t-shirt and the script says "we make t-shirts", the founder's t-shirt needs to reference one of the assets even in shots that are not reveal moments. Pass the asset URLs in for those acts and write prompt language like "the founder is wearing the t-shirt from @Image3 — print matches exactly".
reference_imagesproduct_type| product_type | 展示镜头 | 展示节拍(用于SeeDance提示词) | 哪些片段使用素材 |
|---|---|---|---|
| C-phone、E-phone | "创始人将手机举向镜头;屏幕清晰显示@ImageN——完全匹配,不要生成虚构UI" | 仅C和E镜头 |
| G-hold、G-wear、E-detail | "创始人将一件炭灰色水洗印花T恤举向镜头;T恤图案完全匹配@ImageN——完全匹配印花/图案,不要生成虚构内容" 或 "创始人穿着@ImageN中的T恤——完全匹配印花" | 所有出现T恤的镜头(C/E/F + "穿着"变体) |
| C-hold、E-detail、F-twoshot | "创始人将[产品名称]举向镜头;产品完全匹配@ImageN——匹配形状、颜色、品牌标识" | 展示产品的镜头 |
| C-hold、E-detail、H-using | "创始人持有/使用[产品];包装/产品完全匹配@ImageN" | 展示产品的镜头 |
| A、B、D、F(环境镜头) | 无特定产品展示——专注于创始人+环境 | 无镜头引用素材 |
对于实体产品,所有画面中出现产品的镜头都应将该素材作为参考图片传入;否则SeeDance可能会生成外观通用的产品。对于服装,若创始人穿着T恤且脚本提到"我们制作T恤",即使不是展示时刻,创始人的T恤也需要引用其中一个素材。在这些片段的中传入素材URL,并在提示词中写入类似"创始人穿着@Image3中的T恤——印花完全匹配"的内容。
reference_images[3] Write script + character voice + per-shot asset + per-line beats (you do this — no model call)
[3] 编写脚本 + 人物语音 + 镜头素材映射 + 逐行节拍(由你完成——无需调用模型)
Three sub-products, all written by you (Claude) in one inline JSON:
- — 3-4 lines describing the character's DEFAULT delivery (carries through every act for consistency)
character_voice_profile - Per-shot +
asset_index— what asset is visible in this shot and how it's revealedreveal_beat - Per-shot — line-by-line acting direction with
beats[]+emotion+ silence beats between sentencesphysical
This is what separates a generic AI-talking-head from a character that actually feels intentional. Read all four sub-sections below ([3.0] founder voice, [3a] character voice profile, [3b] beats, [3c] transitions, [3c.1] acting energy, [3d] full JSON) before writing.
三个子成果,均由你(Claude)以内嵌JSON格式编写:
- — 3-4行描述人物的默认表达方式(贯穿所有片段以保持一致性)
character_voice_profile - 每个镜头的+
asset_index— 该镜头中可见的素材以及展示方式reveal_beat - 每个镜头的— 逐行表演指导,包含
beats[]+emotion+ 句子间的停顿节拍physical
这是区分通用AI讲解头和具有真实意图的人物的关键。在编写前阅读以下四个子部分([3.0]创始人语气、[3a]人物语音配置、[3b]节拍、[3c]转场、[3c.1]表演力度、[3d]完整JSON)。
[3.0] Founder voice — write a PITCH, not a feature list
[3.0] 创始人语气 — 编写推销话术,而非功能列表
The single most common failure mode in this skill is dialogue that reads like a marketing-page bullet list ("It can reason. Code. Even write your emails. No proxies. No selectors. No maintenance. Plug it into LangChain. LlamaIndex. MCP. Twenty-four thousand stars on GitHub. MIT licensed. Production-grade.") — clean copy, but it's not how a founder talks. User feedback 2026-05-08: "the script sounds like a list of features, not like a founder would sell their product on camera."
Real founders pitching their own product on camera use:
- First-person ownership — "I built", "we shipped", "we use it ourselves", "honestly we just want this everywhere"
- A personal stake or origin moment — Act 1 should reference a frustration the founder lived through, NOT the product abstractly. "Every time I tried building X, I hit the same wall" beats "X is hard."
- Conversational connectives — "look", "honestly", "the thing is", "so", "actually", trailing "..." for thinking. These are throwaway words in writing but the breath of natural speech.
- A "bet" framing for the product — "what if X just worked?", "we asked ourselves", "the whole idea was". Founders frame their product as an answer to a question they asked themselves, not as a list of capabilities.
- One concrete anchor — a specific number, a specific time, a specific scenario. "24 thousand devs starred it last year" beats "it's popular." "At 3 AM the layout breaks" beats "scrapers are unreliable."
- Invitation-energy CTA — "come try us", "go play with it", "we just want it everywhere". NOT "stop scraping. start extracting." (that's a Don Draper tagline, not a founder).
Banned patterns (each was empirically called out by the user, do not repeat):
| ❌ Banned pattern | Example | Why |
|---|---|---|
| Triple-negation chant | "No proxies. No selectors. No maintenance." | Feels like a marketing chant, not human speech |
| Capability staccato | "It can reason. Code. Even write your emails." | Reads as a feature checklist |
| Integration-list-as-pitch | "Plug it into LangChain. LlamaIndex. MCP." | Listing integrations is fine ONCE in passing — never as a 3-beat hook |
| Tagline closer | "Stop scraping. Start extracting." | Pure ad-copy. Founders close with invitation, not a slogan |
| Specs-as-pitch | "MIT licensed. Production-grade." | Specs go in the README, not the founder's mouth on camera |
| "Just" as filler in a list | "Just one API call. Just any URL. Just structured JSON." | "Just" repeated reads as marketing emphasis, not natural speech |
Allowed patterns (use these instead):
| ✅ Pattern | Example |
|---|---|
| Personal-stake hook | "Honestly — every time I tried building X, same thing happened. ..." |
| "What if" framing | "So we built Y. The whole idea was: what if Z just worked?" |
| One concrete claim | "Last year we hit 24 thousand stars. People are plugging us in everywhere." |
| Casual aside on pain | "Hand it a URL. Get clean structured data. The layout changes? Doesn't matter." |
| Invitation closer | "If your agent needs to actually see the live web — come try us." |
Structure (4 acts, ~30-40 words per act = 120-160 words total, ~50-60s spoken):
- Act 1: Personal stake / pain. First person. Reference a specific frustration the founder lived through. Land on the problem named cleanly.
- Act 2: The bet. "So we built X. The idea was — what if [pain] just worked?" One sentence on what it actually does (URL → data, prompt → image, etc).
- Act 3: Proof + community. ONE specific number (stars, customers, ARR). One casual mention of integrations or where it's used. Tone: quiet confidence, not bragging.
- Act 4: Invitation. "If [reader's situation] — come try us. [URL]. [One inviting line]." End on warmth, not a tagline.
Self-test before approving the script. Read each act's dialogue out loud. If you'd be embarrassed to say it on camera as the founder, rewrite it. If it sounds like a 30-second commercial voiceover, rewrite it. If a paragraph has more than two punctuation periods in a row of short fragments, rewrite it.
本技能最常见的失败模式是对话读起来像营销页面的项目符号列表("它能推理、编写代码,甚至写邮件。无需代理、无需选择器、无需维护。可接入LangChain、LlamaIndex、MCP。GitHub上有24000颗星。MIT许可。生产级。")——文案简洁,但不符合创始人的说话方式。2026-05-08用户反馈:"脚本听起来像功能列表,不像创始人在镜头前推销产品的语气。"
真实创始人在镜头前推销自己产品时会使用:
- 第一人称所有权 — "我打造了"、"我们发布了"、"我们自己也在使用"、"说实话我们希望它无处不在"
- 个人经历或起源时刻 — 第一段应提及创始人亲身经历的挫折,而非抽象介绍产品。"每次我尝试构建X时,都会遇到同样的障碍"比"X很难"更好。
- 口语化连接词 — "你看"、"说实话"、"关键是"、"所以"、"实际上"、结尾的"..."表示思考。这些在书面语中是冗余词,但在自然口语中是呼吸感的体现。
- 产品的"假设"框架 — "如果X能正常工作会怎样?"、"我们问自己"、"核心想法是"。创始人将产品视为自己提出的问题的答案,而非能力列表。
- 一个具体的锚点 — 具体数字、具体时间、具体场景。"去年有24000名开发者为它点赞"比"它很受欢迎"更好。"凌晨3点布局会崩溃"比"爬虫不可靠"更好。
- 邀请式行动号召 — "来试试我们的产品"、"去体验一下"、"我们希望它无处不在"。不要使用"停止爬虫,开始提取"(这是广告文案风格,不是创始人语气)。
禁止模式(每个模式都被用户明确指出,请勿重复):
| ❌ 禁止模式 | 示例 | 原因 |
|---|---|---|
| 三重否定排比 | "无需代理。无需选择器。无需维护。" | 听起来像营销口号,而非人类语言 |
| 功能短句罗列 | "它能推理。编写代码。甚至写邮件。" | 读起来像功能清单 |
| 以集成列表为卖点 | "可接入LangChain。LlamaIndex。MCP。" | 提及集成可以,但不要作为3节拍的钩子 |
| 标语式结尾 | "停止爬虫。开始提取。" | 纯粹的广告文案。创始人应以邀请结尾,而非口号 |
| 以规格为卖点 | "MIT许可。生产级。" | 规格应写在README中,而非创始人在镜头中说的内容 |
| 列表中重复使用"Just" | "只需一次API调用。只需任意URL。只需结构化JSON。" | 重复"Just"读起来像营销强调,而非自然口语 |
允许模式(改用这些模式):
| ✅ 允许模式 | 示例 |
|---|---|
| 个人经历钩子 | "说实话——每次我尝试构建X时,都会发生同样的事情。..." |
| "如果"框架 | "所以我们打造了Y。核心想法是:如果[痛点]能正常工作会怎样?" |
| 一个具体的主张 | "去年我们获得了24000颗星。人们在各种场景中使用我们的产品。" |
| 对痛点的随意提及 | "传入一个URL。获取干净的结构化数据。布局变化?没关系。" |
| 邀请式结尾 | "如果你的Agent需要真正访问实时网页——来试试我们的产品。" |
结构(4段,每段约30-40词,总计120-160词,约50-60秒口语时长):
- 第一段:个人经历/痛点 — 第一人称。提及创始人亲身经历的具体挫折。清晰点明问题。
- 第二段:核心假设 — "所以我们打造了X。想法是——如果[痛点]能正常工作会怎样?"用一句话说明它实际能做什么(URL→数据、提示词→图片等)。
- 第三段:证明+社区 — 一个具体数字(点赞数、客户数、ARR)。随意提及一次集成或使用场景。语气:低调自信,而非炫耀。
- 第四段:邀请 — "如果[读者场景]——来试试我们的产品。[URL]。一句邀请语。"以温暖结尾,而非口号。
脚本自我测试。大声朗读每段对话。如果你作为创始人在镜头前说这些话会感到尴尬,请重写。如果听起来像30秒广告旁白,请重写。如果段落中有两个以上连续的短句标点,请重写。
[3a] Derive character_voice_profile
+ wardrobe_lock
character_voice_profilewardrobe_lock[3a] 推导character_voice_profile
+ wardrobe_lock
character_voice_profilewardrobe_lockTwo separate fields, both required:
character_voice_profilebrief.tonecharacter_image_urlproduct_typewardrobe_lockTonal-template starters (orchestrator picks/customizes from the brief tone):
| Default cadence | Face | Hands | Pauses |
|---|---|---|---|---|
| conversational, like explaining to a friend at coffee | slight smirk default, eyebrow flicks on reveals | open out flat on big claims, hand to chin when thinking | held eye contact instead of filling silence |
| light staccato, expressive | mischief lives near the eyes, frequent eyebrow flicks, smiles arrive a beat after the punchline | light shoulder bounces, animated count-on-fingers | brief pauses with knowing looks |
| measured pace, deliberate | soft direct eye contact, restrained smile | hand positions deliberate not constant, single open palm gesture | confident silences, doesn't fill |
| analytical, slightly slower | analytical default, eyes cycle to think then return on landing | hand-to-chin thinking gesture, points to imaginary diagrams | thinking-pauses, eyes go up-left |
| staccato, clipped sentences with sudden pauses | dry deadpan default, mischievous grin breaks through then disappears | body stays still, the FACE does the work | sharp pauses, slight head tilts |
Worked example — Pika MCP / Semi (casual tone, 3D Pixar 20s woman):
"Casual confidence, like explaining the connector to a friend at coffee. Slight smirk default. Eyebrow flicks on key reveals. Hand goes to chin when thinking, opens out flat on the big 'meet Pika M C P' claim. Pauses with held eye contact rather than filling silence. Lands punchlines deadpan and lets a small smile arrive a beat after."
Worked example — Cat Stole My T-Shirt (playful/edgy tone, streetwear founder):
"Sharp dry wit. Talks fast in clipped sentences with sudden pauses. Default slight smirk with one raised eyebrow. Eye-rolls on the pain points ('boring', 'generic'). Mischievous grin breaks through on punchlines but disappears immediately. Hands stay mostly still — the FACE does the work."
两个独立字段,均为必填:
character_voice_profilebrief.tonecharacter_image_urlproduct_typewardrobe_lockwardrobe_lock风格模板起始句(根据简介风格选择/自定义):
| 默认语速 | 面部表情 | 手部动作 | 停顿方式 |
|---|---|---|---|---|
| 口语化,像在咖啡店里向朋友解释 | 默认轻微 smirk,展示关键内容时挑眉 | 提出重要主张时手掌张开,思考时手托下巴 | 保持眼神接触而非填补沉默 |
| 轻快短句,富有表现力 | 眼神中带有 mischief,频繁挑眉,笑点后一拍微笑 | 轻晃肩膀,用手指计数 | 短暂停顿并露出会意的表情 |
| 语速平稳, deliberate | 柔和的直视眼神,克制的微笑 | 手部动作 deliberate而非频繁,单一手掌张开手势 | 自信的沉默,不填补空白 |
| 分析性,稍慢 | 默认分析性表情,思考时眼神转动然后回到镜头 | 思考时手托下巴,指向想象中的图表 | 思考停顿,眼神向左上方看 |
| 短句,突然停顿 | 默认面无表情,调皮的笑容突然出现然后消失 | 身体保持静止,面部表达情绪 | 尖锐停顿,轻微歪头 |
示例 — Pika MCP / Semi(casual风格,3D皮克斯20岁女性):
"低调自信,像在咖啡店里向朋友解释连接器。默认轻微 smirk。展示关键内容时挑眉。思考时手托下巴,提出重要主张("Meet Pika M C P")时手掌张开。停顿保持眼神接触而非填补沉默。面无表情地说出笑点,然后一拍后露出微笑。"
示例 — Cat Stole My T-Shirt(playful/edgy风格,街头服饰创始人):
"尖锐的冷幽默。语速快,短句,突然停顿。默认轻微 smirk,挑眉。提及痛点("boring"、"generic")时翻白眼。笑点时露出调皮的笑容但立即消失。手部基本静止——面部表达情绪。"
[3b] Per-line beats[]
— line-level direction, not act-level
beats[][3b] 逐行beats[]
— 镜头级指导,而非片段级
beats[]Each shot's dialogue is broken into . Each beat is one short sentence (or a deliberate silence) with its own + direction. Silence between beats is part of the performance — fill it with held looks, micro-expressions, gesture transitions.
beatsemotionphysicalA beat with is silent (no spoken text) — it just describes what happens visually during the natural pause between sentences. Use these between dialogue beats that need a held moment for emphasis.
text: "(beat)"When the SeeDance prompt is built in step [5], beats become the per-shot acting direction (the dialogue text without markers becomes the payload).
(beat)<<<voice_1>>>每个镜头的对话被分解为。每个beat是一个短句(或刻意停顿),带有自己的 + 指导。beat之间的停顿是表演的一部分——用保持的表情、微表情、手势过渡填补停顿。
beatsemotionphysicaltext: "(beat)"在步骤[5]中构建SeeDance提示词时,beat将成为镜头级的表演指导(不带标记的对话文本将成为内容)。
(beat)<<<voice_1>>>[3c] transition_from_prev
— choreograph continuous camera motion between shots in the same clip
transition_from_prev[3c] transition_from_prev
— 编排同一片段中镜头间的连续摄像机运动
transition_from_prevThe fundamental SeeDance limitation: each 15s SeeDance generation renders ONE virtual environment with ONE virtual camera. When a multi-shot prompt declares "Shot C, then Shot A" without specifying a continuous camera move between them, SeeDance defaults to re-framing the same camera position (zoom or crop). The result reads as a jump zoom, not a real cut — same background, character at different sizes.
The fix: every shot beyond the first in an act must declare a field — a one-line description of the continuous camera motion that takes us from the previous shot's framing to this one. SeeDance then has to render an actual move-through-space, which means different parts of the room appear behind the character across the clip.
transition_from_prevPattern: name the camera's start position, name where it ends up, name the move that connects them. Movement verbs that work: dolly, pull back, push in, orbit, arc, glide, crane up, crane down, tilt up, tilt down, drift left/right.
Examples:
| Effect |
|---|---|
| "Camera pulls back and arcs left, revealing the brick wall and standing desk behind her now in frame" | Real spatial change — different background portion |
| "Push past the phone screen into a closer framing of her face — the room blurs behind her" | Continuous motion using rack focus + dolly |
| "Camera glides clockwise around her at a steady distance, picking up the whiteboard and plants on the new side" | Orbit reveals new background |
| "Pull back from her hands holding the phone to a medium shot, then drift right toward the window light" | Two-step continuous move |
| ❌ "Cut to medium shot" / ❌ "Now we see her in a medium shot" | These don't describe motion — SeeDance falls back to same-position re-frame |
The first shot in an act has no — it establishes the framing. Every subsequent shot in that act gets one.
transition_from_prevSeeDance can do hard cuts within a single 15s clip when prompted explicitly. Write between time-coded sub-shots (instead of ) for distinct framing changes — SeeDance honors this and renders a real cut, not a re-frame. Reserve for continuous-motion handoffs where you want the camera to glide between framings. Pattern: hard cuts feel like a real edited piece (different framings, different camera angles, different acting energy); transitions feel like a single moving long take.
Hard cut:Transition:Transition:SeeDance的核心限制:每个15秒SeeDance生成会渲染一个虚拟环境和一个虚拟摄像机。当多镜头提示词声明"Shot C,然后Shot A"但未指定镜头间的连续摄像机运动时,SeeDance默认会重新构图同一摄像机位置(缩放或裁剪)。结果看起来像跳变缩放,而非真实剪辑——背景相同,人物大小不同。
解决方案:片段中除第一个镜头外的每个镜头必须声明字段——一行描述从上个镜头构图到当前镜头的连续摄像机运动。SeeDance将渲染真实的空间移动,因此人物身后会出现房间的不同部分。
transition_from_prev模式:命名摄像机的起始位置、结束位置,以及连接两者的动作。有效的动作动词:dolly、pull back、push in、orbit、arc、glide、crane up、crane down、tilt up、tilt down、drift left/right。
示例:
| 效果 |
|---|---|
| "摄像机向后拉并向左弧形移动,现在画面中显示她身后的砖墙和站立式书桌" | 真实空间变化——背景部分不同 |
| "推过手机屏幕,更近距离地拍摄她的脸——房间在她身后模糊" | 使用焦点切换+移动的连续运动 |
| "摄像机以稳定距离顺时针环绕她,捕捉新一侧的白板和植物" | 环绕展示新背景 |
| "从她手持手机的画面向后拉到中景,然后向右飘向窗户光线" | 两步连续移动 |
| ❌ "切换到中景" / ❌ "现在我们看到她的中景" | 这些未描述运动——SeeDance会 fallback到同一位置重新构图 |
片段中的第一个镜头没有——它建立构图。该片段中后续的每个镜头都需要一个。
transition_from_prevSeeDance可以在单个15秒片段中明确提示硬切。在带时间码的子镜头之间写入(而非)以实现不同构图的切换——SeeDance会遵守此提示并渲染真实剪辑,而非重新构图。模式:硬切看起来像真实的剪辑作品(不同构图、不同摄像机角度、不同表演力度);转场看起来像单个移动的长镜头。
Hard cut:Transition:[3c.1] Acting energy floor — every beat needs explicit body movement
[3c.1] 表演力度底线 — 每个beat需要明确的身体动作
A frequent failure mode: beats are written with only facial micro-expressions ("slight nod", "eyebrow flick", "eyes hold camera"). SeeDance renders this as a near-frozen founder — eyes barely move, no presence. Result reads as "static, frozen, no excitement."
Rule: every beat's field needs at least one of:
physical- A hand or arm gesture (open palm, count on fingers, dismissive flick, point at self/camera, hand to chest, wider arm sweep, hand-to-temple thinking)
- A torso shift (lean forward, lean back, slight body turn, shoulder shift)
- A head action LARGER than a micro-flick (turn left/right and back, tilt 8°+, slow head shake, head bob on rhythm)
- A directional eye flick combined with eyebrow movement (look down then snap up to camera, etc.)
Facial-only beats are acceptable only for:
- Silent markers between spoken sentences (those are meant to be still — the held look is the point)
(beat) - Final landing beat at end of an act when the camera is already moving (camera does the work)
When you write the SeeDance prompt, make sure the assembled "Acting beats" block reads physically dense — if you scan it and see five beats in a row that all say "slight nod" or "small smirk" with no other movement, the founder will look frozen. Rewrite with bigger movement.
常见失败模式:beat仅编写面部微表情("轻微点头"、"挑眉"、"眼神保持镜头")。SeeDance会渲染出近乎静止的创始人——眼睛几乎不动,没有存在感。结果被评价为"静态、僵硬、无活力"。
规则:每个beat的字段至少包含以下一项:
physical- 手部或手臂手势(手掌张开、用手指计数、 dismissive flick、指向自己/镜头、手放在胸前、手臂大范围挥动、手托太阳穴思考)
- 躯干移动(前倾、后仰、轻微转身、肩膀移动)
- 头部动作(大于微 flick的动作——左右转动然后回来、倾斜8°以上、缓慢摇头、随节奏点头)
- 方向眼神 flick + 眉毛动作(向下看然后快速回到镜头等)
仅面部动作的beat仅适用于:
- 对话beat之间的沉默标记(这些应该静止——保持的表情是重点)
(beat) - 片段结尾的最终beat,此时摄像机已经在移动(摄像机完成动作)
编写SeeDance提示词时,确保组合的"Acting beats"块看起来有丰富的身体动作——如果扫描时看到连续五个beat都写着"轻微点头"或"小smirk"而无其他动作,创始人会看起来僵硬。重写为更大的动作。
[3d] Script JSON
[3d] 脚本JSON
json
{
"product_type": "<from step [2]>",
"character_voice_profile": "Casual confidence, like explaining to a friend at coffee. Slight smirk default. Eyebrow flicks on key reveals. Hand goes to chin when thinking, opens out flat on big claims. Pauses with held eye contact rather than filling silence. Lands punchlines deadpan, lets a small smile arrive a beat after.",
"segments": [
{
"act": 1,
"shots": [
{
"type": "A",
"asset_index": null,
"beats": [
{ "text": "Claude is brilliant.", "emotion": "declarative respect — say it like she means it", "physical": "soft direct eye contact, slight nod" },
{ "text": "(beat)", "physical": "subtle smirk arrives, eyes hold camera" },
{ "text": "But it's also kind of...", "emotion": "playful pivot, ellipsis hangs", "physical": "slight head tilt right, eyes drift up briefly on the ellipsis" },
{ "text": "shapeless.", "emotion": "deadpan landing", "physical": "eyes return to camera, single dismissive shrug" }
]
},
{
"type": "B",
"asset_index": null,
"transition_from_prev": "Camera pushes in slowly from the medium framing into a tighter close-up, drifting slightly off-axis to her right so a different slice of the brick wall and window light is visible behind her",
"beats": [
{ "text": "No face. No voice. No personality of its own.", "emotion": "staccato dismissal", "physical": "small head shake on each, eyebrow flick on 'personality'" },
{ "text": "(beat)", "physical": "held look, eyes lock camera, soft smile starts to arrive" },
{ "text": "Just an empty assistant waiting for orders.", "emotion": "flat, slightly resigned deadpan", "physical": "neutral face" },
{ "text": "(beat)", "physical": "soft confident smile arrives, lean forward begins" },
{ "text": "That's about to change.", "emotion": "grounded conviction, the turn", "physical": "lock eyes, single confident nod on 'change'" }
]
}
]
},
{
"act": 2,
"shots": [
{
"type": "C",
"asset_index": 0,
"reveal_beat": "The character holds her phone up toward camera at chest height, screen facing the viewer. Phone screen exactly matches @Image3 — match the screen content exactly, do not invent UI.",
"beats": [
{ "text": "Meet Pika M C P.", "emotion": "introduction with quiet pride", "physical": "phone lifts to camera, eyes flick from screen to lens" },
{ "text": "One connector that gives your Claude a face, a name, a personality.", "emotion": "warm steady build", "physical": "free hand counts the three on fingers — face, name, personality" }
]
},
{
"type": "A",
"asset_index": null,
"transition_from_prev": "Camera pulls back from the phone and arcs slightly left, the phone lowers out of frame as we end on a medium shot of the character with the desk and whiteboard now visible behind her",
"beats": [
{ "text": "And the ability to make videos, images, audio.", "emotion": "expanding the promise", "physical": "open-palm gesture sweeps wider on each item" },
{ "text": "Right inside the chat.", "emotion": "the grounding kicker", "physical": "hand lands flat, eyebrows up, slight smile arrives" }
]
}
]
},
{
"act": 3,
"shots": [
{
"type": "D",
"asset_index": null,
"beats": [
{ "text": "Setup takes thirty seconds.", "emotion": "matter-of-fact reassurance", "physical": "walks past a desk, glances at a laptop briefly" },
{ "text": "Open Claude settings, paste the URL, sign in.", "emotion": "quick rhythmic checklist", "physical": "counts three on fingers as she walks" }
]
},
{
"type": "E",
"asset_index": 1,
"reveal_beat": "Close-up of hands holding phone. Screen first matches @Image3 ('What if your Claude could be CAMI'), then transitions mid-shot to @Image4 (setup guide). Match each screen exactly, do not invent UI.",
"transition_from_prev": "Camera dollies in fast past her shoulder to land on a tight close-up of her hands and the phone, the loft background drops fully out of focus",
"beats": [
{ "text": "Your Claude becomes Cammy.", "emotion": "the soft surprise reveal", "physical": "small smile at the phone, then up to camera, eyes warm" }
]
},
{
"type": "A",
"asset_index": null,
"transition_from_prev": "Camera pulls back and tilts up from the phone to find her face in a medium shot, the brick wall and afternoon light now visible behind her on a different side of the loft than Shot D",
"beats": [
{ "text": "Or whoever you build.", "emotion": "casual aside", "physical": "small shrug, slight smirk" },
{ "text": "(beat)", "physical": "held look, smirk fades into warm sincerity" },
{ "text": "Now talk to her like a person.", "emotion": "the real point — quiet conviction", "physical": "single nod on 'person', eyes hold" }
]
}
]
},
{
"act": 4,
"shots": [
{
"type": "F",
"asset_index": null,
"beats": [
{ "text": "Skills bundled in.", "emotion": "casual confidence, intro to a list", "physical": "slight tilt of the chin, knowing look" },
{ "text": "Podcasts. Explainer videos. U G C ads.", "emotion": "rhythmic three-beat list", "physical": "small nod on each, eyebrow flick on 'U G C'" },
{ "text": "All from chat.", "emotion": "the grounding tag", "physical": "open-palm gesture lands flat, slight smile arrives" }
]
},
{
"type": "B",
"asset_index": null,
"transition_from_prev": "Camera pushes in slowly from the wider brand-context framing into an intimate medium close-up; the environment recedes into soft bokeh, the character fills more of the frame",
"beats": [
{ "text": "So stop wrestling with generic A I.", "emotion": "direct address, low-key challenge", "physical": "raised eyebrow, slight head tilt" },
{ "text": "(beat)", "physical": "held look, smirk grows" },
{ "text": "Pikafy your Claude.", "emotion": "the brand line, said with certainty", "physical": "lean slightly into camera, lock eyes" },
{ "text": "Start at pika dot me slash M C P.", "emotion": "warm CTA, the invitation", "physical": "soft confident smile, single closing nod on 'M C P'" }
]
}
]
}
]
}Per-shot asset assignment rules:
- For digital products: only shots and
Cget anE(phone-reveal moments). Other shots show the founder without a specific UI reference.asset_index - For physical products: any shot where the product is visible in frame gets an . The asset is the source of truth for what that product looks like. Acts can reuse the same asset across multiple shots, OR show a different asset per shot to demo product variety.
asset_index - For service products: no anywhere — the script relies on dialogue + environment.
asset_index
Reference image array per act = the union of asset URLs across that act's shots. Within the SeeDance prompt, refer to assets by their position in the array as , (positions 3+ — positions 1 and 2 are always character + location refs). The orchestrator computes this mapping when building the prompt in step [5].
@Image3@Image4Dialogue rules — TOTAL across all 4 acts must read aloud in 55–60s (~150 wpm = ~150 words total, ~37 per act). Short, punchy, speakable. Avoid em-dashes (founders don't speak them). Use natural contractions. Reference the user's actual product features (drawn from and ), not invented ones.
asset_analyses[i].key_elementsasset_analyses[i].visible_copyjson
{
"product_type": "<来自步骤[2]>",
"character_voice_profile": "低调自信,像在咖啡店里向朋友解释。默认轻微 smirk。展示关键内容时挑眉。思考时手托下巴,提出重要主张时手掌张开。停顿保持眼神接触而非填补沉默。面无表情地说出笑点,然后一拍后露出微笑。",
"segments": [
{
"act": 1,
"shots": [
{
"type": "A",
"asset_index": null,
"beats": [
{ "text": "Claude很出色。", "emotion": "明确的尊重——真诚地说", "physical": "柔和的直视眼神,轻微点头" },
{ "text": "(beat)", "physical": "露出微妙的 smirk,眼神保持镜头" },
{ "text": "但它也有点...", "emotion": " playful转折,省略号拖长", "physical": "轻微歪头向右,思考省略号时眼神短暂向上飘" },
{ "text": "没有形态。", "emotion": "面无表情地收尾", "physical": "眼神回到镜头,单一 dismissive耸肩" }
]
},
{
"type": "B",
"asset_index": null,
"transition_from_prev": "摄像机从中景缓慢推近到特写,稍微偏离她的右侧,现在画面中显示砖墙和窗户光线的不同部分",
"beats": [
{ "text": "没有脸。没有声音。没有自己的个性。", "emotion": "短句式 dismiss", "physical": "每句轻微摇头,提及'个性'时挑眉" },
{ "text": "(beat)", "physical": "保持眼神,锁定镜头,开始露出微笑" },
{ "text": "只是一个等待指令的空助手。", "emotion": "平淡、略带无奈的面无表情", "physical": "中性面部表情" },
{ "text": "(beat)", "physical": "露出柔和自信的微笑,开始前倾" },
{ "text": "这种情况即将改变。", "emotion": "坚定的信念,转折", "physical": "锁定眼神,提及'改变'时自信点头" }
]
}
]
},
{
"act": 2,
"shots": [
{
"type": "C",
"asset_index": 0,
"reveal_beat": "人物将手机举向镜头,胸部高度,屏幕朝向观众。手机屏幕完全匹配@Image3——完全匹配屏幕内容,不要生成虚构UI。",
"beats": [
{ "text": "Meet Pika M C P。", "emotion": "低调自豪的介绍", "physical": "手机举向镜头,眼神从屏幕转到镜头" },
{ "text": "一个连接器,为你的Claude赋予脸、名字和个性。", "emotion": "温暖平稳的递进", "physical": "另一只手用手指计数——脸、名字、个性" }
]
},
{
"type": "A",
"asset_index": null,
"transition_from_prev": "摄像机从手机向后拉并轻微向左弧形移动,手机移出画面,最终定格在人物的中景,身后显示书桌和白板",
"beats": [
{ "text": "还能制作视频、图片、音频。", "emotion": "扩展承诺", "physical": "手掌张开,每一项都大范围挥动" },
{ "text": "就在聊天界面内。", "emotion": "核心亮点", "physical": "手掌放平,挑眉,露出微笑" }
]
}
]
},
{
"act": 3,
"shots": [
{
"type": "D",
"asset_index": null,
"beats": [
{ "text": "设置只需30秒。", "emotion": "实事求是的 reassurance", "physical": "走过书桌,短暂 glance笔记本电脑" },
{ "text": "打开Claude设置,粘贴URL,登录。", "emotion": "快速有节奏的清单", "physical": "走路时用手指计数" }
]
},
{
"type": "E",
"asset_index": 1,
"reveal_beat": "特写手持手机的手。屏幕首先匹配@Image3('What if your Claude could be CAMI'),然后镜头中过渡到@Image4(设置指南)。完全匹配每个屏幕,不要生成虚构UI。",
"transition_from_prev": "摄像机快速推过她的肩膀,定格在她的手和手机的特写, loft背景完全失焦",
"beats": [
{ "text": "你的Claude变成Cammy。", "emotion": "柔和惊喜的展示", "physical": "看着手机微笑,然后看向镜头,眼神温暖" }
]
},
{
"type": "A",
"asset_index": null,
"transition_from_prev": "摄像机从手机向后拉并向上倾斜,找到她的脸的中景,现在画面中显示砖墙和午后光线,位于loft的另一侧(与Shot D不同)",
"beats": [
{ "text": "或者你打造的任何角色。", "emotion": "随意提及", "physical": "轻微耸肩,轻微 smirk" },
{ "text": "(beat)", "physical": "保持眼神,smirk消失,露出温暖真诚的表情" },
{ "text": "现在像和人一样和她聊天。", "emotion": "核心要点——低调坚定", "physical": "提及'人'时点头,眼神保持" }
]
}
]
},
{
"act": 4,
"shots": [
{
"type": "F",
"asset_index": null,
"beats": [
{ "text": "内置技能。", "emotion": "低调自信,列表介绍", "physical": "轻微抬下巴,会意的表情" },
{ "text": "播客。讲解视频。U G C广告。", "emotion": "有节奏的三节拍列表", "physical": "每一项轻微点头,提及'U G C'时挑眉" },
{ "text": "全部来自聊天界面。", "emotion": "核心标签", "physical": "手掌放平,露出微笑" }
]
},
{
"type": "B",
"asset_index": null,
"transition_from_prev": "摄像机从较宽的品牌场景构图缓慢推近到亲密的中特写;环境退化为柔和的散景,人物占据更多画面",
"beats": [
{ "text": "所以别再和通用AI较劲了。", "emotion": "直接对话,低调挑战", "physical": "挑眉,轻微歪头" },
{ "text": "(beat)", "physical": "保持眼神,smirk扩大" },
{ "text": "Pikafy你的Claude。", "emotion": "品牌口号,肯定地说", "physical": "稍微前倾镜头,锁定眼神" },
{ "text": "从pika dot me slash M C P开始。", "emotion": "温暖的行动号召,邀请", "physical": "柔和自信的微笑,提及'M C P'时点头收尾" }
]
}
]
}
]
}镜头素材分配规则:
- 数字产品:仅和
C镜头有E(手机展示时刻)。其他镜头展示创始人,无特定UI参考。asset_index - 实体产品:任何画面中出现产品的镜头都有。素材是产品外观的唯一来源。片段可以在多个镜头中重用同一素材,或每个镜头展示不同素材以演示产品多样性。
asset_index - 服务产品:无——脚本依赖对话+环境。
asset_index
每个片段的参考图片数组 = 该片段所有镜头中素材URL的并集。在SeeDance提示词中,素材在数组中的位置决定其标记(位置1、2始终是人物+场景;素材从位置3开始)。编排器在步骤[5]中构建提示词时计算此映射。
@ImageN对话规则 — 所有4个片段的总对话时长为55–60秒(~150词/分钟 = ~150词总计,每段约37词)。简短、有力、适合口语。避免使用破折号(创始人不会这样说话)。使用自然缩写。参考用户实际产品的功能(来自和),而非虚构功能。
asset_analyses[i].key_elementsasset_analyses[i].visible_copyTTS pronunciation rewrites
TTS发音改写
SeeDance's native lip-sync TTS reads text literally — it has no semantic awareness that "Cami" is a name, "UGC" is an acronym to spell, or "pika.me" is a URL. Rewrite the dialogue text the way you want it pronounced, then submit. Apply these substitutions:
<<<voice_1>>>| Pattern | Wrong (literal) | Correct (rewrite for TTS) |
|---|---|---|
Names ending in | "Cami" → "kah-MAH" | "Cammy" (or "Tess" / "Mae" / "Sam" — phonetic) |
| Names with unfamiliar spellings | "Aoife" → confused | "Eefa" (phonetic) |
| Acronyms meant to be spelled out | "UGC" → "uhg" / "ugg"; "MCP" → "mehp" / silent | "U G C" / "M C P" (single spaces between letters) |
| Acronyms spoken as words | "NASA" → "nasa" ✅ (already correct); "IKEA" → "ikea" ✅ | leave alone |
| Domain dots | "pika.me" → "pikamay" / "pikah-may" | "pika dot me" |
| URL slashes / paths | "pika.me/MCP" → "pikamay-em-cee-pee" | "pika dot me slash M C P" |
| Symbols | "$50" → silent; "@user" → "at user" or silent | "fifty bucks" / "at-sign user" |
| Numbers in weird formats | "2026" → ambiguous | "twenty twenty-six" for years; "two thousand" for round numbers |
When in doubt about how a brand pronounces an acronym (NASA vs N.A.S.A.), check the brand's own website / videos. Default to spelled-out letters when unclear.
Worked example — original script vs TTS-safe rewrite:
Original: "UGC ads. All from chat. Pikafy your Claude. Start at pika.me/MCP."
TTS-safe: "U G C ads. All from chat. Pikafy your Claude. Start at pika dot me slash M C P."
Original: "Your Claude becomes Cami."
TTS-safe: "Your Claude becomes Cammy."Keep the rewrites in the dialogue text only — your script JSON's field is what flows into the SeeDance prompt verbatim. Slide-card text (end card etc.) and the brief stay in original spelling.
dialogueSeeDance的原生唇同步TTS会逐字读取文本——它没有语义意识,不知道"Cami"是名字、"UGC"是需要拼写的首字母缩写,或"pika.me"是URL。按照你希望的发音改写对话文本,然后提交。应用以下替换:
<<<voice_1>>>| 模式 | 错误(逐字) | 正确(为TTS改写) |
|---|---|---|
以 | "Cami" → "kah-MAH" | "Cammy"(或"Tess" / "Mae" / "Sam" — 语音化) |
| 拼写不常见的名字 | "Aoife" → 混淆 | "Eefa"(语音化) |
| 需要拼写的首字母缩写 | "UGC" → "uhg" / "ugg"; "MCP" → "mehp" / 无声 | "U G C" / "M C P"(字母间加空格) |
| 作为单词发音的首字母缩写 | "NASA" → "nasa" ✅(已正确);"IKEA" → "ikea" ✅ | 保持不变 |
| 域名点 | "pika.me" → "pikamay" / "pikah-may" | "pika dot me" |
| URL斜杠/路径 | "pika.me/MCP" → "pikamay-em-cee-pee" | "pika dot me slash M C P" |
| 符号 | "$50" → 无声; "@user" → "at user"或无声 | "fifty bucks" / "at-sign user" |
| 格式特殊的数字 | "2026" → 模糊 | "twenty twenty-six"(年份);"two thousand"(整数) |
若不确定品牌如何发音首字母缩写(NASA vs N.A.S.A.),查看品牌自己的网站/视频。不确定时默认拼写字母。
示例 — 原始脚本 vs TTS安全改写:
原始: "UGC ads. All from chat. Pikafy your Claude. Start at pika.me/MCP."
TTS安全: "U G C ads. All from chat. Pikafy your Claude. Start at pika dot me slash M C P."
原始: "Your Claude becomes Cami."
TTS安全: "Your Claude becomes Cammy."仅在对话文本中进行改写——脚本JSON的字段将直接流入SeeDance提示词。幻灯片文本(片尾卡片等)和简介保持原始拼写。
dialogueShot type reference
镜头类型参考
Each shot has variants depending on . Pick the variant that matches.
product_type⚠️ Avoid film-industry shot terminology that SeeDance interprets literally. SeeDance reads named shot types as a literal recipe — including any implied subjects the term carries. Specifically:
- ❌ "Two-shot" → adds a SECOND PERSON to the frame (term means "shot with two subjects" in film, but SeeDance just sees "two" + "person").
- ❌ "Three-shot" — same trap.
- ❌ "Over-shoulder" / "OTS" → adds a phantom shoulder/back-of-head in the foreground for the character to interact with. The character then performs toward that phantom person, not toward the camera. (Verified 2026-05-02 — caused Semi to "show her phone to someone in OTS" in an early Pika MCP test.)
- ❌ "Master shot" — can be misread as "the master / their boss".
- ✅ "Medium shot", "Close-up", "Wide" are safe; they're commonly used colloquially.
The fix in every case is plain language about what the camera sees, not film vocabulary that implies extra subjects. Examples:
- "Over-shoulder reveal" → "the character holds her phone up toward camera, screen facing the viewer at chest height"
- "Two-shot" → "wide framing of the character with [product/logo/environment] in frame"
- "POV" → "low camera angle from the character's eyeline"
| Shot | All variants | Camera |
|---|---|---|
| A | Medium shot, waist up, the character centered. (No product in frame, OR if | slow subtle push-in dolly |
| B | Medium close-up, chest up, intimate. (No product, OR if | gentle handheld breathing motion |
| C | Phone/product reveal — direct to camera. | slow push-in toward the held object |
| D | Wide + environment, full body in expansive space. (No specific product moment.) | slow tracking shot, parallax depth |
| E | Close-up reveal. | subtle rack focus, slow tilt up |
| F | Brand context shot. Wider framing of the character with product/logo/environment context surrounding them. | slow pull-back, gentle zoom-out |
每个镜头根据有不同变体。选择匹配的变体。
product_type⚠️ 避免使用SeeDance会逐字解读的电影行业镜头术语。SeeDance会将命名的镜头类型视为字面配方——包括术语携带的任何隐含主体。具体来说:
- ❌ "Two-shot" → 在画面中添加第二个人(该术语在电影中意为"包含两个主体的镜头",但SeeDance只看到"two" + "person")。
- ❌ "Three-shot" — 同样的陷阱。
- ❌ "Over-shoulder" / "OTS" → 在前景中添加一个 phantom肩膀/后脑勺,让人物与之互动。人物会朝向那个 phantom人表演,而非镜头。(2026-05-02验证——在早期Pika MCP测试中导致Semi "向OTS中的人展示手机"。)
- ❌ "Master shot" — 可能被误解为"主人/老板"。
- ✅ "Medium shot"、"Close-up"、"Wide"是安全的;它们常用于口语中。
解决方案是用平实语言描述摄像机看到的内容,而非暗示额外主体的电影词汇。示例:
- "Over-shoulder reveal" → "人物将手机举向镜头,屏幕朝向观众,胸部高度"
- "Two-shot" → "人物与[产品/Logo/环境]同框的宽构图"
- "POV" → "人物视线的低摄像机角度"
| 镜头 | 所有变体 | 摄像机动作 |
|---|---|---|
| A | 中景,腰部以上,人物居中。(画面中无产品,或 | 缓慢微妙的推进移动 |
| B | 中特写,胸部以上,亲密。(无产品,或 | 轻微手持呼吸式运动 |
| C | 手机/产品展示——直接面向镜头。 | 缓慢推向手持物品 |
| D | 宽景+环境,全身在开阔空间。(无特定产品时刻。) | 缓慢跟踪镜头,视差深度 |
| E | 特写展示。 | 微妙焦点切换,缓慢向上倾斜 |
| F | 品牌场景镜头。人物与产品/Logo/环境背景同框的较宽构图。 | 缓慢向后拉,轻微缩放 |
[4] Founder + custom location reference images
[4] 创始人 + 自定义场景参考图片
Prepare before any SeeDance call:
character_url- If is an HTTPS URL, set
founder_photo.founder_photo_url = character_url = founder_photo - If is a local path, upload it with
founder_photo, then setmcp__pika__upload_asset.founder_photo_url = character_url = public_url - If is
founder_photo, callgenerateand use the returned URL:mcp__pika__generate_image
generate_image(
prompt: "<founder vibe>. Professional founder portrait, clean studio lighting, sharp focus on face, confident expression, suitable as character reference for video generation.",
aspect_ratio: "3:4",
resolution: "2K"
)Handle location only when the user supplied a custom location:
- If is an HTTPS URL, set
location_image_url.location_url = location_image_url - If it is a local path, upload it with and set
mcp__pika__upload_asset.location_url = public_url - If it is a text description, generate a custom location reference with .
mcp__pika__generate_image - If no custom location was supplied, do nothing here. Step [4.5] generates the default brand-accent backdrop after exists.
state.brand
Save the resulting URLs into . If SeeDance later rejects the founder ref on content policy, see "Known infra quirks" — re-roll with stronger stylization.
state在调用SeeDance前准备:
character_url- 若是HTTPS URL,设置
founder_photo。founder_photo_url = character_url = founder_photo - 若是本地路径,通过
founder_photo上传,然后设置mcp__pika__upload_asset。founder_photo_url = character_url = public_url - 若是
founder_photo,调用generate并使用返回的URL:mcp__pika__generate_image
generate_image(
prompt: "<创始人风格>. 专业创始人肖像,干净的工作室灯光,面部清晰对焦,自信表情,适合作为视频生成的人物参考。",
aspect_ratio: "3:4",
resolution: "2K"
)仅当用户提供自定义场景时处理场景:
- 若是HTTPS URL,设置
location_image_url。location_url = location_image_url - 若为本地路径,通过上传并设置
mcp__pika__upload_asset。location_url = public_url - 若为文本描述,通过生成自定义场景参考。
mcp__pika__generate_image - 若未提供自定义场景,此处无需操作。步骤[4.5]会在存在后生成默认品牌强调色背景。
state.brand
将生成的URL保存到中。若SeeDance因内容政策拒绝创始人参考素材,查看"已知 infra quirks"——通过更强的风格化重新生成。
state[4.5] Brand-kit ingestion (always — Stage 0 guarantees brand_kit_path
)
brand_kit_path[4.5] 品牌套件导入(始终执行——阶段0保证brand_kit_path
存在)
brand_kit_pathbrand_kit_pathbuild-a-brandbuild-a-brand --quickPreferred source is when present. Otherwise extract from a export:
brand.jsonbuild-a-brand- for name, tagline, voice, typography names, and logo descriptions.
brand.md - for colors and font tokens.
tokens/tokens.json - assets for wordmark, symbol/icon, and lockups.
logo/
Extract into :
state.brand | Source | Notes |
|---|---|---|
| | brand display name |
| | upload local raster asset via |
| | upload local raster asset via |
| palette role | text and border color |
| palette role | page/background color |
| CTA/primary brand color from palette or | end-card CTA pill bg + lower-third accent |
| secondary light/highlight color from palette or tokens | lower-third border / highlight |
| typography display token or | use as-is if available; fall back to Space Grotesk |
| typography body/text token or | fall back to system sans |
| typography mono token if present | fall back to Space Mono |
Upload step is required when the brand-kit assets are local files. Without public wordmark/icon URLs, the HTML rendered by can't reach them. Use the MCP flow with raster logo files only and save the returned values on . rejects ; if the best logo is an SVG, pick the sibling PNG export from the brand kit or rasterize the SVG to PNG via by inlining the SVG inside an HTML block and using the returned PNG .
mcp__pika__render_html_animationmcp__pika__upload_assetpublic_urlstate.brandmcp__pika__upload_assetimage/svg+xmlmcp__pika__html_to_png<svg>public_urlBrand-accent backdrop default location — when was not set by Step [4], render a solid-color PNG via using . Match the requested video aspect so the reference is not cropped later:
location_urlmcp__pika__html_to_pngstate.brand.colors.accent | Backdrop size |
|---|---|
| 1920×1080 |
| 1080×1920 |
| 1080×1080 |
Save the returned as . This produces the clean studio-shoot aesthetic — character against a flat seamless wall in the brand's own accent color, no furniture or background detail. Validated 2026-05-03 on the pika.me re-run; far cleaner than AI-generated office sets and renders SeeDance reliably.
file_urllocation_urlhtml_to_png(
html: "<body style='margin:0;background:{accent}'></body>",
format: "png",
mode: "sync",
raster_options: { viewport_px: { width: W, height: H }, device_scale: 1 }
)阶段0要求必须存在——要么用户提供,要么先通过创建。解析一次并在片尾卡片和下方字幕条中重用。若交互模式下文件夹缺失,询问用户提供或重新创建品牌套件后再继续。非交互快速通道中,先尝试分支;若无法生成套件,仅停止一次并列出缺失字段。
brand_kit_pathbuild-a-brandbuild-a-brand --quick优先使用(若存在)。否则从导出中提取:
brand.jsonbuild-a-brand- 用于名称、标语、语气、字体名称和Logo描述。
brand.md - 用于颜色和字体 tokens。
tokens/tokens.json - 资源用于文字Logo、符号/图标和组合Logo。
logo/
提取到:
state.brand | 来源 | 说明 |
|---|---|---|
| | 品牌显示名称 |
| | 通过 |
| | 通过 |
| 调色板角色 | 文本和边框颜色 |
| 调色板角色 | 页面/背景颜色 |
| 调色板中的CTA/主品牌颜色或 | 片尾卡片CTA按钮背景 + 下方字幕条强调色 |
| 调色板中的次要亮色/高亮颜色或tokens | 下方字幕条边框/高亮 |
| 排版显示token或 | 若可用则直接使用;fallback到Space Grotesk |
| 排版正文token或 | fallback到系统无衬线字体 |
| 若存在排版等宽token | fallback到Space Mono |
当品牌套件资源是本地文件时,必须上传。若无公开的文字Logo/图标URL,渲染的HTML无法访问它们。使用MCP 流程仅上传栅格Logo文件,并将返回的值保存到。拒绝;若最佳Logo是SVG,选择品牌套件中的PNG导出或通过将SVG栅格化为PNG(将SVG内联到HTML 块中,使用返回的PNG )。
mcp__pika__render_html_animationmcp__pika__upload_assetpublic_urlstate.brandmcp__pika__upload_assetimage/svg+xmlmcp__pika__html_to_png<svg>public_url默认品牌强调色背景场景 — 若步骤[4]未设置,通过使用渲染纯色PNG。匹配请求的视频比例,避免后续参考被裁剪:
location_urlmcp__pika__html_to_pngstate.brand.colors.accent | 背景尺寸 |
|---|---|
| 1920×1080 |
| 1080×1920 |
| 1080×1080 |
将返回的保存为。这将产生干净的工作室拍摄美学——人物置于品牌强调色的纯色无缝墙前,无家具或背景细节。2026-05-03在pika.me重新运行中验证;比AI生成的办公室场景更干净,且SeeDance渲染可靠。
file_urllocation_urlhtml_to_png(
html: "<body style='margin:0;background:{accent}'></body>",
format: "png",
mode: "sync",
raster_options: { viewport_px: { width: W, height: H }, device_scale: 1 }
)Save result.file_url as location_url
将result.file_url保存为location_url
undefinedundefined[5] Generate 4 SeeDance acts in parallel
[5] 并行生成4段SeeDance片段
For each act, collect the union of values across all shots in that act to build the array:
asset_indexreference_imagesact_asset_indices = unique(shot.asset_index for shot in act.shots if shot.asset_index !== null)
act_asset_urls = [assets[i].url for i in act_asset_indices]
reference_images = [character_url, location_url, ...act_asset_urls]The asset's position in determines its token (positions 1, 2 are character + location; assets start at position 3). When writing the prompt, map each shot's to its token using this position.
reference_images@ImageNasset_index@ImageN对于每个片段,收集该片段所有镜头中值的并集以构建数组:
asset_indexreference_imagesact_asset_indices = 去重(shot.asset_index for shot in act.shots if shot.asset_index !== null)
act_asset_urls = [assets[i].url for i in act_asset_indices]
reference_images = [character_url, location_url, ...act_asset_urls]素材在中的位置决定其标记(位置1、2是人物+场景;素材从位置3开始)。编写提示词时,使用此位置将每个镜头的映射到其标记。
reference_images@ImageNasset_index@ImageNPrompt template
提示词模板
Refer to the character through the reference, not descriptive prompt prose. When the prompt says both "young creative streetwear founder" and is the character ref, the two descriptions can fight — SeeDance may try to satisfy both by inventing a second figure. Same rule applies to the location and . Let the reference images carry the visual identity.
@Image1@Image1@Image2Each act's prompt has three layers, top to bottom:
- Character + location identity (always the same opening line).
- Character voice — the act prompt repeats verbatim. This carries the actor's personality through every shot.
script.character_voice_profile - Per-shot blocks — each shot gets its composition (or if defined), then a per-line "Acting beats" list built from the shot's
reveal_beat, then the dialogue insidebeats[].<<<voice_1>>>
⚠️ The opener line, , , , / , and the closing line are all load-bearing — each is documented in near the bottom with the specific failure mode it prevents. Paraphrasing any of them silently breaks the recipe (literal backdrop, sung lyrics, jump zooms, wardrobe drift, etc.). Leave them verbatim; the connective prose around them is yours to compose.
WARDROBE LOCK:Background context:<<<voice_1>>>Transition:Hard cut:Native lip-synced dialogue audio, no music overlay.## Load-bearing phrasesTemplate:
The character (matching @Image1) in a setting whose visual style, palette, lighting and materials match @Image2.
CHARACTER VOICE: {script.character_voice_profile}
WARDROBE LOCK (verbatim across all 4 acts): {script.wardrobe_lock}
Shot {first}: {if shot.reveal_beat exists: shot.reveal_beat ELSE: composition + camera from Shot table}. Background context: {distinct physical position in the space — which wall / window / feature is behind the character}.
Acting beats:
• "{beat[0].text}" — {beat[0].emotion}; {beat[0].physical}.
• (silence) — {silence beat physical}.
• "{beat[N].text}" — {beat[N].emotion}; {beat[N].physical}.
<<<voice_1>>>{joined beat texts excluding (beat) markers, with periods and commas as written}<<<voice_1>>>
[If 2+ shots in this act:] Transition: {shot[1].transition_from_prev}.
Shot {second}: {composition or reveal_beat}. Background context: {a DIFFERENT physical position from Shot {first} — different wall / different angle / different background feature}.
Acting beats: ...
<<<voice_1>>>...<<<voice_1>>>
[If 3 shots:] Transition: {shot[2].transition_from_prev}.
Shot {third}: ... Background context: {a THIRD distinct physical position — must be visually different from Shots {first} and {second}}.
Native lip-synced dialogue audio, no music overlay.Notes on the template:
- Open with "The character (matching @Image1) in a setting whose visual style ... match @Image2" — never "inside the location matching @Image2". SeeDance reads the LITERAL backdrop from @Image2 if you say "inside the location" — see Known infra quirks.
- Don't describe the character in prose — SeeDance reads visual identity from . EXCEPTION: lock wardrobe explicitly via the
@Image1line (e.g. "wearing the same charcoal hoodie over a dark band tee throughout"). @Image1 alone doesn't lock wardrobe across separate 15s generations.WARDROBE LOCK - Don't add aesthetic adjectives that contradict the references. If is a 3D Pixar-style character, don't add "photorealistic" anywhere in the prompt.
@Image1 - The CHARACTER VOICE + WARDROBE LOCK lines appear once per prompt, before the shots. They prime SeeDance for the actor's whole vibe + clothing continuity.
- Each shot block includes a line describing a distinct physical position in the space — different wall, different window, different background feature than the other shots in this act and ideally distinct from the other acts. This forces SeeDance to render scene variety while keeping the brand aesthetic anchored to @Image2.
Background context: - markers stay OUT of the
(beat)payload. They're acting direction only — the natural pause between sentences in the dialogue text is where they happen.<<<voice_1>>> - Every shot beyond the first in an act gets a line built from
Transition: …. This narrates the camera move between framings, forcing SeeDance to render real spatial motion (different parts of the room behind the character) instead of an unmotivated re-frame that reads as a jump zoom.shot.transition_from_prev
通过参考描述人物,而非提示词 prose。若提示词同时说"年轻创意街头服饰创始人"且是人物参考素材,两个描述可能冲突——SeeDance可能会尝试同时满足两者,从而生成第二个人物。场景和同理。让参考图片承载视觉标识。
@Image1@Image1@Image2每个片段的提示词分为三层,从上到下:
- 人物+场景标识(始终是相同的开头行)。
- 人物语音 — 片段提示词重复原文。这将演员的个性贯穿所有镜头。
script.character_voice_profile - 镜头块 — 每个镜头有其构图(若定义了则使用该beat),然后是从镜头
reveal_beat构建的逐行"Acting beats"列表,然后是beats[]内的对话。<<<voice_1>>>
⚠️ 开头行、、、, / ,以及结尾的行都是关键——每个行在底部的中都有文档说明其防止的特定失败模式。改写任何一行都会无声地破坏流程(字面背景、歌词演唱、跳变缩放、服装变化等)。保持原文不变;周围的连接性prose可自行编写。
WARDROBE LOCK:Background context:<<<voice_1>>>Transition:Hard cut:Native lip-synced dialogue audio, no music overlay.## Load-bearing phrases模板:
The character (matching @Image1) in a setting whose visual style, palette, lighting and materials match @Image2.
CHARACTER VOICE: {script.character_voice_profile}
WARDROBE LOCK (verbatim across all 4 acts): {script.wardrobe_lock}
Shot {first}: {若存在shot.reveal_beat则使用该beat,否则使用镜头表中的构图+摄像机动作}. Background context: {空间中的不同物理位置——人物身后是哪面墙/窗户/特征}.
Acting beats:
• "{beat[0].text}" — {beat[0].emotion}; {beat[0].physical}.
• (silence) — {沉默beat的动作}.
• "{beat[N].text}" — {beat[N].emotion}; {beat[N].physical}.
<<<voice_1>>>{连接beat文本,排除(beat)标记,保留原文的句号和逗号}<<<voice_1>>>
[若片段中有2+镜头:] Transition: {shot[1].transition_from_prev}.
Shot {second}: {构图或reveal_beat}. Background context: {与Shot {first}不同的物理位置——不同墙/不同角度/不同背景特征}.
Acting beats: ...
<<<voice_1>>>...<<<voice_1>>>
[若有3个镜头:] Transition: {shot[2].transition_from_prev}.
Shot {third}: ... Background context: {与Shot {first}和{second}不同的第三个物理位置}.
Native lip-synced dialogue audio, no music overlay.模板说明:
- 以**"The character (matching @Image1) in a setting whose visual style ... match @Image2"**开头——永远不要用"inside the location matching @Image2"。SeeDance会逐字读取@Image2的字面背景,若你说"inside the location",请查看Known infra quirks。
- 不要用prose描述人物——SeeDance从读取视觉标识。例外:通过
@Image1行明确锁定服装(例如"所有片段中均穿着同一件炭灰色卫衣,内搭深色乐队T恤")。仅@Image1无法在单独的15秒生成中锁定服装。WARDROBE LOCK: - 不要添加与参考素材矛盾的美学形容词。若是3D皮克斯风格人物,不要在提示词中添加"photorealistic"。
@Image1 - CHARACTER VOICE + WARDROBE LOCK行在每个提示词中出现一次,在镜头前。它们为SeeDance铺垫演员的整体风格+服装连续性。
- 每个镜头块包含行,描述空间中的不同物理位置——与该片段中其他镜头的墙、窗户、背景特征不同,理想情况下与其他片段也不同。这迫使SeeDance渲染场景多样性,同时保持品牌美学锚定到@Image2。
Background context: - 标记不包含在
(beat)内容中。它们仅作为表演指导——对话文本中的句子间自然停顿就是它们的作用时机。<<<voice_1>>> - 片段中除第一个镜头外的每个镜头都有行,由
Transition: …构建。这描述了构图之间的摄像机移动,迫使SeeDance渲染真实的空间运动(人物身后是房间的不同部分),而非无动机的重新构图(看起来像跳变缩放)。shot.transition_from_prev
Worked example — physical_apparel
, Act 2 (with voice + beats)
physical_apparel示例 — physical_apparel
,第二段(带语音+节拍)
physical_apparelAct 2 has shots . Shot C has and a reveal_beat about the search-history t-shirt. Shot A has no asset_index. Asset 0 = . Character voice profile is the streetwear-founder example from step [3a].
[C, A]asset_index: 0@Image3The character (matching @Image1) in a setting whose visual style, palette, lighting and materials match @Image2.
CHARACTER VOICE: Sharp dry wit. Talks fast in clipped sentences with sudden pauses. Default slight smirk with one raised eyebrow. Eye-rolls on the pain points. Mischievous grin breaks through on punchlines but disappears immediately. Hands stay mostly still — the FACE does the work.
WARDROBE LOCK: wearing the same charcoal-washed brand graphic tee under an unbuttoned indigo-denim chore jacket throughout all 4 acts.
Shot C: The character lifts a charcoal-washed graphic tee toward camera, holding it flat at chest height. The shirt design exactly matches @Image3 — 'I SAW YOUR SEARCH HISTORY' with shocked cat illustration in bold white type. Match the print exactly, do not invent. Camera: slow lateral arc pan around toward eyeline. Background context: standing near the apparel rack against the brick wall on the right side of the loft, soft midday window light from camera-left.
Acting beats:
• "You know that thing you almost typed?" — knowing low-key accusation; raised eyebrow on 'almost typed', slight head tilt.
• "That weird search?" — held look, eyebrow stays up, hint of smirk arrives.
• (silence) — beat lands, smirk grows, eyes flick to lens.
• "Your cat saw it." — mischievous deadpan; lifts shirt slightly higher, single confident nod.
<<<voice_1>>>You know that thing you almost typed? That weird search? Your cat saw it.<<<voice_1>>>
Transition: Camera pulls back from the held shirt and arcs slightly left, the shirt lowers out of frame as we end on a medium shot with the apparel rack and brick wall now visible behind her.
Shot A: Medium shot, waist up, the character centered, looking at camera. Camera: slow subtle push-in dolly.
Acting beats:
• "We turned it into a t-shirt." — matter-of-fact reveal; eyebrows up, slight smile arrives.
• " — graphic tees with attitude." — the brand-line punctuation; smirk lands on 'attitude', eyes hold.
<<<voice_1>>>We turned it into a t-shirt — graphic tees with attitude.<<<voice_1>>>
Native lip-synced dialogue audio, no music overlay.第二段有镜头。镜头C有和关于搜索历史T恤的reveal_beat。镜头A无asset_index。素材0 = 。人物语音配置是步骤[3a]中的街头服饰创始人示例。
[C, A]asset_index: 0@Image3The character (matching @Image1) in a setting whose visual style, palette, lighting and materials match @Image2.
CHARACTER VOICE: 尖锐的冷幽默。语速快,短句,突然停顿。默认轻微 smirk,挑眉。提及痛点时翻白眼。笑点时露出调皮的笑容但立即消失。手部基本静止——面部表达情绪。
WARDROBE LOCK: 所有4个片段中均穿着同一件炭灰色水洗品牌印花T恤,外搭未扣的靛蓝牛仔工装夹克。
Shot C: 人物将一件炭灰色水洗印花T恤举向镜头,胸部高度放平。T恤图案完全匹配@Image3——'I SAW YOUR SEARCH HISTORY',带有 bold白色字体的震惊猫插画。完全匹配印花,不要生成虚构内容。摄像机:缓慢侧向弧形移动到视线水平。Background context: 站在loft右侧的服装架旁,靠近砖墙,镜头左侧有柔和的午间窗户光线。
Acting beats:
• "你知道你差点输入的那个东西吗?" — 会意的低调指责;提及'almost typed'时挑眉,轻微歪头。
• "那个奇怪的搜索?" — 保持眼神,眉毛抬起,开始露出 smirk。
• (silence) — 停顿,smirk扩大,眼神 flick到镜头。
• "你的猫看到了。" — mischievous面无表情;稍微抬高T恤,自信点头。
<<<voice_1>>>你知道你差点输入的那个东西吗?那个奇怪的搜索?你的猫看到了。<<<voice_1>>>
Transition: 摄像机从手持T恤向后拉并轻微向左弧形移动,T恤移出画面,最终定格在中景,身后显示服装架和砖墙。
Shot A: 中景,腰部以上,人物居中,看向镜头。摄像机:缓慢微妙的推进移动。
Acting beats:
• "我们把它做成了T恤。" — 实事求是的展示;挑眉,开始露出微笑。
• " — 有态度的印花T恤。" — 品牌口号标点;提及'attitude'时smirk定格,眼神保持。
<<<voice_1>>>我们把它做成了T恤 — 有态度的印花T恤。<<<voice_1>>>
Native lip-synced dialogue audio, no music overlay.Physical product consistency across shots
实体产品跨镜头一致性
For , if the character is wearing the brand's product across multiple acts (e.g. acts where no specific product reveal happens but the character is still in a brand t-shirt), pick one hero shirt asset and pass it to those acts too with prompt language like "the character is wearing the t-shirt from @Image3 — print matches exactly". Otherwise SeeDance invents a generic-looking shirt, defeating the asset coverage goal.
physical_apparel对于,若人物在多个片段中穿着品牌产品(例如无特定产品展示但人物仍穿着品牌T恤的片段),选择一个核心T恤素材并将其传入这些片段,提示词中写入类似*"人物穿着@Image3中的T恤——印花完全匹配"*的内容。否则SeeDance会生成外观通用的T恤,违背素材覆盖目标。
physical_apparelFire all 4 acts in parallel — 3 sub-shots per act
并行触发所有4个片段 — 每个片段3个子镜头
Fire all 4 acts in parallel in a single tool batch. Every act prompt should contain 3 time-coded sub-shots; single-shot 15s clips render as static, frozen-looking founder regardless of how detailed the prompt is. This was empirically validated 2026-05-08: a ScrapeGraphAI run shipped 4 single-shot acts and the user described it as "very static, no body language, camera work boring" — the fix was decomposing each act into 3 time-coded sub-shots within the prompt.
Default decomposition: 3 sub-shots per act, sized by dialogue density (e.g. , , ).
(0-4s)(4-9s)(9-15s)Each sub-shot needs:
- A different camera framing — never two consecutive sub-shots with the same shot type. Mix medium / close-up / wide / lower-angle. The user reads variety as production value.
- A different camera motion — push-in, pull-back, lateral arc, orbit, handheld, static-held, rack-focus. Not all "slow push-in".
- A different physical position in the space — see Location-reference rule in "Known infra quirks". Each shot must describe a different wall / window / feature visible behind the character so SeeDance moves through the space instead of reproducing one literal backdrop.
- An ENERGETIC physical action per beat — see [3c.1] above. Hand gestures, leans, head turns, shoulder shifts. No facial-only beats.
- The dialogue sub-portion for that window, wrapped in tokens inside the shot's block.
<<<voice_1>>>...<<<voice_1>>>
SeeDance firing pattern:
generate_reference_video(
provider: "seedance",
resolution: "1080p",
aspect_ratio: "16:9",
duration: 15,
seed: 101, # unique per act (101, 202, 303, 404) — busts the idempotency cache
sound: true, # native lip-sync from each <<<voice_1>>> block
reference_images: [character_url, location_url, ...assets_for_this_act],
prompt: <full prompt — see template in section above, 3 time-coded sub-shots inline>
)在单个工具批次中并行触发所有4个片段。每个片段提示词应包含3个带时间码的子镜头;单镜头15秒片段无论提示词多详细,都会渲染出静态呆板的创始人。2026-05-08经验证:ScrapeGraphAI运行中交付了4个单镜头片段,用户评价为"非常静态,无肢体语言,摄像机工作无聊"——解决方案是将每个片段分解为提示词中的3个带时间码的子镜头。
默认分解:每个片段3个子镜头,根据对话密度调整时长(例如、、)。
(0-4s)(4-9s)(9-15s)每个子镜头需要:
- 不同的摄像机构图 — 永远不要有两个连续子镜头使用相同的镜头类型。混合中景/特写/宽景/低角度。用户会将多样性视为制作价值。
- 不同的摄像机运动 — 推进、向后拉、侧向弧形、环绕、手持、静态保持、焦点切换。不要全是"缓慢推进"。
- 空间中的不同物理位置 — 查看"Known infra quirks"中的场景参考规则。每个镜头必须描述人物身后不同的墙/窗户/特征,以便SeeDance在空间中移动,而非重复一个字面背景。
- 每个beat有有力的身体动作 — 见上方[3c.1]。手势、倾斜、头部转动、肩膀移动。无仅面部动作的beat。
- 该时间段的对话子部分,包裹在镜头块内的标记中。
<<<voice_1>>>...<<<voice_1>>>
SeeDance触发模式:
generate_reference_video(
provider: "seedance",
resolution: "1080p",
aspect_ratio: "16:9",
duration: 15,
seed: 101, # 每个片段唯一(101、202、303、404)——打破幂等缓存
sound: true, # 每个<<<voice_1>>>块的原生唇同步
reference_images: [character_url, location_url, ...assets_for_this_act],
prompt: <完整提示词——见上方模板,内嵌3个带时间码的子镜头>
)... × 4 acts, all in the same message ...
... ×4个片段,同一消息中触发 ...
Notes:
- Each act runs ~3-8 min on SeeDance. If generation completes asynchronously, follow the MCP tool's returned status handle until the act reaches a terminal state.
- The full prompt (opening line + CHARACTER VOICE + 3 sub-shots with Acting beats + `<<<voice_1>>>` per shot + Transition lines) goes in the single `prompt` parameter. SeeDance has no `shots:[]` array — the multi-shot structure is encoded in prose.
- Use unique seeds (101, 202, 303, 404) so identical-looking calls don't hash to the same cached task ID. The `seed` parameter is seedance-only.
Save the 4 returned URLs in submission order as `act_urls = [act1, act2, act3, act4]`.
说明:
- 每个片段在SeeDance上运行约3-8分钟。若生成异步完成,跟踪MCP工具返回的状态句柄,直到片段达到终端状态。
- 完整提示词(开头行+CHARACTER VOICE+3个子镜头带Acting beats+每个镜头的`<<<voice_1>>>`+Transition行)传入单个`prompt`参数。SeeDance没有`shots:[]`数组——多镜头结构编码在prose中。
- 使用唯一seed(101、202、303、404),使外观相同的调用不会哈希到相同的缓存任务ID。`seed`参数仅适用于seedance。
按提交顺序保存4个返回的URL为`act_urls = [act1, act2, act3, act4]`。Duration floor and partial-act recovery
时长底线和片段恢复
All 4 act_urls are required before step [6]. Do not concat a partial act list.
Three completed acts plus the end card produce a ~50s asset, which misses the
55s duration floor and must not be reported as a successful founder video.
If one SeeDance act times out, stalls past the run's wall budget, or reaches a
failure terminal while other acts completed:
- Retry the missing act once with the same prompt, ,
reference_images,duration,sound, andresolution, but a new seed (aspect_ratio). Do not rerun successful acts.original_seed + 1000 - If the retry completes, insert that URL into the original act slot and
continue with .
act_urls = [act1, act2, act3, act4] - If the retry cannot complete, stop and surface the upstream SeeDance timeout.
You may return completed act URLs as a diagnostic preview, but do not deliver
a partial concat as , do not call it production-ready, and do not proceed to step [6].
final_url
步骤[6]前需要所有4个act_urls。不要拼接部分片段列表。3个完成的片段加片尾卡片会生成约50秒的素材,低于55秒时长底线,不得作为成功的创始人视频报告。
若一个SeeDance片段超时、超过运行预算仍未完成,或达到失败终端状态而其他片段已完成:
- 使用相同的提示词、、
reference_images、duration、sound和resolution重试一次缺失的片段,但使用新的seed(aspect_ratio)。不要重新运行成功的片段。original_seed + 1000 - 若重试完成,将该URL插入原始片段位置,继续使用。
act_urls = [act1, act2, act3, act4] - 若重试无法完成,停止并显示上游SeeDance超时。可返回完成的片段URL作为诊断预览,但不要将部分拼接作为交付,不要称其为生产就绪,不要进入步骤[6]。
final_url
[6] Stitch acts into 60s base
[6] 将片段拼接成60秒基础视频
edit_concat(video_urls=act_urls)Save as (60 seconds, 16:9, native dialogue audio).
base_urledit_concat(video_urls=act_urls)保存为(60秒,16:9,原生对话音频)。
base_url[7] Generate background music — INSTRUMENTAL, target 60–80s
[7] 生成背景音乐 — 纯音乐,目标时长60–80秒
The pattern is fixed. The sound is a per-brand creative decision. MiniMax wants a structured block with 5 tags and parenthetical cues. That's the load-bearing recipe. WHAT genre / instrumentation / mood goes inside those cues is your job — pick something that fits the brand's tone, the founder's voice, and the product. Do NOT copy the piano example below verbatim; that's one possible sound, not a template.
lyrics[section](instrumental — …)模式固定。声音是每个品牌的创意决策。MiniMax需要结构化的块,包含5个标签和括号提示。这是关键流程。这些提示中使用什么流派/乐器/情绪是你的工作——选择适合品牌风格、创始人语气和产品的声音。不要直接复制下方的钢琴示例;这只是一种可能的声音,而非模板。
lyrics[section](instrumental — …)Fixed pattern (don't change):
固定模式(不要更改):
provider: "minimax-music"- field: exactly 5 sections (
lyrics/[intro]/[verse]/[bridge]/[chorus]), separated by[outro]\n\n - Each section contains a parenthetical
(instrumental — …) - No real English words inside any section (MiniMax sings them)
- No /
lengthextras (the wrapper drops them; section count is the actual length lever)duration
provider: "minimax-music"- 字段:恰好5个部分(
lyrics/[intro]/[verse]/[bridge]/[chorus]),用[outro]分隔\n\n - 每个部分包含括号提示
(instrumental — …) - 任何部分中不要有真实英文单词(MiniMax会演唱它们)
- 不要添加/
length额外参数(包装器会忽略它们;部分数量是实际时长控制因素)duration
Creative decision (per video — pick from brief.tone
+ script vibe + product context):
brief.tone创意决策(每个视频——从brief.tone
+脚本风格+产品上下文选择):
brief.toneExamples of what the music register might be for different brands. Don't use these literally — match the vibe of YOUR brand:
| Brand register | Genre direction | Sound palette |
|---|---|---|
| Technical / dev-tool / B2B SaaS (e.g. ScrapeGraphAI) | corporate cinematic underscore, Apple-keynote calm | warm felt piano, soft synth pad, sparse low-end pulse, 80–90 BPM |
| Playful / streetwear / consumer (e.g. Cat Stole My T-Shirt) | lo-fi hip-hop, chill beat | dusty drums, jazzy chord stabs, vinyl crackle, 70–85 BPM |
| Disruptive / edgy / fintech / crypto | minimal electronic, dark synthwave | analog synth bass, side-chained pad, half-time kick, 90–100 BPM |
| Fashion / luxury / lifestyle | minimal house, modern fashion-film score | filtered house pad, soft 4-on-floor kick, French-touch chord, 100–110 BPM |
| Fitness / energy / sports | driving electronic pulse, workout register | pulsing synth bass, building arp, hi-hat eighth-notes, 110–125 BPM |
| Food / hospitality / café | acoustic warmth, indie folk | fingerpicked acoustic guitar, brushed snare, light upright bass, 80–95 BPM |
| Cinematic / brand-story / docu | orchestral score, hopeful crescendo | strings layer, soft piano lead, swelling brass, tempo build |
| Gaming / dev-tools / creator-tools | chiptune-modern hybrid, retro-pixel | square-wave lead, modern synth pad, snappy snare, 105–120 BPM |
When in doubt: read (technical / casual / playful / professional / disruptive) and pick a register that wouldn't feel weird next to the founder's voice profile + the script's emotional arc. Match the energy, don't fight it.
brief.tone不同品牌的音乐风格示例。不要直接使用这些——匹配你的品牌风格:
| 品牌风格 | 流派方向 | 声音调色板 |
|---|---|---|
| 技术/开发工具/B2B SaaS(例如ScrapeGraphAI) | 企业电影配乐,苹果发布会风格平静 | 温暖毡钢琴、柔和合成器垫、稀疏低频脉冲、80–90 BPM |
| Playful/街头服饰/消费者(例如Cat Stole My T-Shirt) | lo-fi嘻哈, chill节拍 | 复古鼓、爵士和弦、黑胶 crackle、70–85 BPM |
| Disruptive/edgy/金融科技/加密货币 | 极简电子,黑暗合成器浪潮 | 模拟合成器贝斯、侧链垫、半速底鼓、90–100 BPM |
| 时尚/奢侈品/生活方式 | 极简浩室,现代时尚电影配乐 | 过滤浩室垫、柔和4-on-floor底鼓、法式触摸和弦、100–110 BPM |
| 健身/能量/运动 | 驱动电子脉冲,健身风格 | 脉冲合成器贝斯、琶音器构建、八分音符踩镲、110–125 BPM |
| 食品/酒店/咖啡馆 | 原声温暖,独立民谣 | 指弹原声吉他、刷击军鼓、轻便立式贝斯、80–95 BPM |
| 电影/品牌故事/纪录片 | 管弦乐配乐,充满希望的 crescendo | 弦乐层、柔和钢琴主音、 swelling铜管、 tempo递进 |
| 游戏/开发工具/创作者工具 | 芯片音乐-现代混合,复古像素 | 方波主音、现代合成器垫、清脆军鼓、105–120 BPM |
不确定时:读取(technical / casual / playful / professional / disruptive),选择与创始人语音配置+脚本情绪弧不冲突的风格。匹配能量,不要对抗。
brief.toneCanonical call (substitute YOUR creative direction into prompt + parentheticals):
标准调用(将你的创意方向代入提示词+括号提示):
generate_music(
provider: "minimax-music",
prompt: "<one-line overall description: register + headline instrument + mood + BPM>",
lyrics: "[intro]\n(instrumental — <opening texture: 2-3 instruments, sparse>)\n\n[verse]\n(instrumental — <add a layer: percussion or bass or counter-melody>)\n\n[bridge]\n(instrumental — <peak tension: layer swell or chord shift>)\n\n[chorus]\n(instrumental — <full arrangement, the emotional payoff>)\n\n[outro]\n(instrumental — <gentle resolve, instruments fall away>)"
)How it works (load-bearing):
- tags scale duration: 5 sections → ~60–80s output. Use 5 for a 60s body + 5s end card.
[section] - parenthetical inside each section is read as a production directive ("no vocals, play this arrangement"). Without it, MiniMax may sing.
(instrumental — …) - field carries the OVERALL style snapshot. Section parentheticals do the per-beat work.
prompt
Save as . Read :
music_urlresult.duration_seconds- If → mix it. Expected path with the 5-section structure.
>= 50s - If → re-roll with the same call. After 2 attempts, accept whatever returned —
< 50splays the music for its duration then leaves silence; dialogue carries the rest.mcp__pika__edit_audio_mix
Banned anti-patterns (each empirically caused a failure):
- ❌ Bare — MiniMax sings the literal word "instrumental" for ~10s.
lyrics: "[instrumental]" - ❌ Omitting entirely — output drops to ~17–30s.
lyrics - ❌ Putting the duration in the text ("90 second instrumental") — has no effect; section count drives length.
prompt - ❌ Real lyrics with English words inside any section — MiniMax sings them.
- ❌ Copying the same piano-and-pads example for every brand — the recipe is the pattern, not the sound. Pick a register per brand.
generate_music(
provider: "minimax-music",
prompt: "<一行整体描述:风格+核心乐器+情绪+BPM>",
lyrics: "[intro]\n(instrumental — <开场纹理:2-3种乐器,稀疏>)\n\n[verse]\n(instrumental — <添加一层:打击乐或贝斯或副旋律>)\n\n[bridge]\n(instrumental — <峰值张力:层递进或和弦转换>)\n\n[chorus]\n(instrumental — <完整编排,情感高潮>)\n\n[outro]\n(instrumental — <柔和收尾,乐器渐退>)"
)工作原理(关键):
- 标签缩放时长: 5个部分 → ~60–80秒输出。对于60秒主体+5秒片尾卡片,使用5个部分。
[section] - 每个部分内的括号提示被视为制作指令("无 vocals,演奏此编排")。若无此提示,MiniMax可能会演唱。
(instrumental — …) - 字段承载整体风格快照。部分括号提示完成逐节拍工作。
prompt
保存为。读取:
music_urlresult.duration_seconds- 若→ 混音。5部分结构的预期路径。
>= 50s - 若→ 使用相同调用重新生成。2次尝试后接受返回结果——
< 50s会播放音乐直到结束,剩余部分留空;对话会承载剩余内容。mcp__pika__edit_audio_mix
禁止反模式(每个模式都导致过失败):
- ❌ 仅— MiniMax会演唱字面单词"instrumental"约10秒。
lyrics: "[instrumental]" - ❌ 完全省略— 输出时长降至~17–30秒。
lyrics - ❌ 在文本中写入时长("90 second instrumental")——无效;部分数量驱动时长。
prompt - ❌ 任何部分中有真实英文歌词——MiniMax会演唱它们。
- ❌ 为每个品牌复制相同的钢琴+垫示例——流程是模式,而非声音。为每个品牌选择风格。
[8] Composition layer — lower-third + subtitles
[8] 合成层 — 下方字幕条 + 字幕
Pipeline ordering note — music mix happens in step [10], AFTER end-card concat. Mixing music into the body before the end card is concatenated leaves the end card silent (the music track ends at the cut). Always: overlays on body → end card → concat → THEN mix music over the full assembled clip.
Use MCP tools first. handles subtitle timing and burn-in server-side; handles authored HTML motion. The only remaining local fallback is arbitrary transparent lower-third overlay, because the current MCP surface has no general alpha-overlay/compose tool and is not sized for a full-width 800×220 lower-third.
mcp__pika__add_captionsmcp__pika__render_html_animationmcp__pika__edit_pip流程顺序说明 — 混音在步骤[10]中进行,在片尾卡片拼接之后。在拼接片尾卡片前将音乐混入主体视频会导致片尾卡片无声(音乐在剪辑点结束)。始终遵循:主体视频叠加 → 片尾卡片 → 拼接 → 然后在整个组合剪辑上混音。
优先使用MCP工具。处理字幕计时和烧录;处理自定义HTML动画。唯一的本地fallback是任意透明字幕条叠加,因为当前MCP界面尚未推出通用的alpha叠加/合成工具,且不适合全宽800×220的字幕条。
mcp__pika__add_captionsmcp__pika__render_html_animationmcp__pika__edit_pipDefault paths
默认路径
| Requested layer | Default action |
|---|---|
| No lower-third, no subtitles | |
| Subtitles only | call |
| Lower-third only | render lower-third |
| Lower-third + subtitles | render and overlay the lower-third first, upload that body checkpoint, then call |
If is false or unset, skip [8a] and [8b]. This keeps the default path fully MCP-native.
state.lower_thirdDo not call local Whisper/caption scripts or chained for captions. If exact original-script spelling matters, pass manual only when you already have exact timed segments from a trusted source; otherwise prefer the auto waterfall.
mcp__pika__edit_text_overlaysubtitles[]mcp__pika__add_captions| 请求的层 | 默认操作 |
|---|---|
| 无字幕条,无字幕 | |
| 仅字幕 | 调用 |
| 仅字幕条 | 通过 |
| 字幕条+字幕 | 先渲染并叠加字幕条,上传该主体检查点,然后对检查点URL调用 |
若为false或未设置,跳过[8a]和[8b]。这使默认路径完全基于MCP。
state.lower_third不要调用本地Whisper/字幕脚本或链式添加字幕。若需要精确的原始脚本拼写,仅当已有可信的带时间码片段时传入手动;否则优先使用自动流程。
mcp__pika__edit_text_overlaysubtitles[]mcp__pika__add_captions[8a] Render the lower-third (only if state.lower_third = true
)
state.lower_third = true[8a] 渲染下方字幕条(仅当state.lower_third = true
时)
state.lower_third = trueSkip this sub-step unless . Render via with (ProRes 4444 with yuva420p — preserves alpha). Do NOT use — HyperFrames currently emits webm as VP9 with no alpha channel, so "transparent" areas come out as literal black pixels and the composited LT shows a black box outside the pill. Discovered 2026-05-18 on the ScrapeGraphAI v6 run; ffprobe on the returned webm confirmed (no alpha). The ProRes path is the only reliable alpha path right now.
state.lower_third = truemcp__pika__render_html_animationformat: "mov"format: "webm"pix_fmt=yuv420ppix_fmt=yuv420p.mov- Native dimensions: 800×220 (matches the placement size on a 1280×720 frame, so no scaling artifacts)
- Pill: bg (default
state.brand.colors.primary),#0d0d0dborder (defaultstate.brand.colors.highlight),#fefbcfdrop shadow (defaultstate.brand.colors.accent), 18px border-radius#cfc3ff - Two-line text: (Space Grotesk 800, 80px, white) +
founder_name(Space Grotesk 500, 28px, butter)founder_role - NO logo inside the pill — brand mark lives in the end card; lower-third is about the person
- CSS only: slide in 0–0.6s from
@keyframes, subtle box-shadow pulse around 3s, slide out 4–5s. Do not use GSAP for lower-third animation; the same per-frame seek concerns as the end card apply.translateX(-900px) - Save URL as (file extension
lower_third_url).mov
If you need to verify the , inspect the video stream pixel format and confirm it contains alpha (). If it is plain , the alpha was dropped — switch render format or re-render.
.movyuva...yuv...仅当时执行此子步骤。通过以渲染(ProRes 4444,yuva420p——保留alpha通道)。不要使用——HyperFrames目前输出的webm是VP9 ,无alpha通道,因此"透明"区域会显示为纯黑像素,合成的字幕条会在按钮外显示黑框。2026-05-18在ScrapeGraphAI v6运行中发现;对返回的webm执行ffprobe确认(无alpha)。 ProRes路径是目前唯一可靠的alpha通道路径。
state.lower_third = truemcp__pika__render_html_animationformat: "mov"format: "webm"pix_fmt=yuv420ppix_fmt=yuv420p.mov- 原生尺寸:800×220(匹配1280×720帧的放置尺寸,无缩放 artifacts)
- 按钮:背景(默认
state.brand.colors.primary),#0d0d0d边框(默认state.brand.colors.highlight),#fefbcf阴影(默认state.brand.colors.accent),18px圆角#cfc3ff - 两行文本:(Space Grotesk 800,80px,白色) +
founder_name(Space Grotesk 500,28px,黄油色)founder_role - 按钮内无Logo——品牌标识在片尾卡片中;字幕条聚焦于人物
- 仅使用CSS :0–0.6s从
@keyframes滑入,3s左右轻微阴影脉冲,4–5s滑出。不要使用GSAP制作字幕条动画;片尾卡片的逐帧查找问题同样适用。translateX(-900px) - 将URL保存为(文件扩展名
lower_third_url).mov
若需要验证,检查视频流像素格式并确认包含alpha()。若为纯,则alpha通道丢失——切换渲染格式或重新渲染。
.movyuva...yuv...[8b] Lower-third overlay fallback
[8b] 字幕条叠加fallback
Use this only when the lower-third is enabled. It is the current MCP gap, not the default caption path: MCP does not yet expose a general arbitrary alpha-overlay / video-compose primitive.
Fallback contract:
- Download and
base_urlonly for this local composition step.lower_third_url - Overlay the 800×220 lower-third at ,
x=50, enabled fory=video_height - 220 - 100.t=0..5s - Preserve the original body audio without re-encoding so lip-sync stays exact.
- Use visually lossless H.264 settings for the local checkpoint.
- Upload the checkpoint with and save the returned
mcp__pika__upload_assetaspublic_url.body_with_lower_third_url
If subtitles are requested too, call and save its returned as . If not, .
mcp__pika__add_captions(video_url: body_with_lower_third_url, ...)urlbody_with_overlays_urlbody_with_overlays_url = body_with_lower_third_url仅当启用字幕条时使用。这是当前MCP的缺口,而非默认字幕路径:MCP尚未推出通用的任意alpha叠加/视频合成原语。
Fallback约定:
- 仅为此本地合成步骤下载和
base_url。lower_third_url - 将800×220字幕条叠加在,
x=50,启用时间y=video_height - 220 - 100。t=0..5s - 保留原始主体音频,不重新编码,以保持唇同步精确。
- 对本地检查点使用视觉无损H.264设置。
- 通过上传检查点并将返回的
mcp__pika__upload_asset保存为public_url。body_with_lower_third_url
若同时请求字幕,调用并将返回的保存为。否则,。
mcp__pika__add_captions(video_url: body_with_lower_third_url, ...)urlbody_with_overlays_urlbody_with_overlays_url = body_with_lower_third_url[8c] Captions via MCP
[8c] 通过MCP添加字幕
Default call:
add_captions(
video_url: <base_url or body_with_lower_third_url>,
caption_mode: "auto",
style: "classic",
position: "bottom",
font: "inter",
font_color: "#ffffff",
highlight_color: state.brand.colors.accent or "#cfc3ff",
outline_color: state.brand.colors.primary or "#111111",
font_size: 42
)Save returned as . The returned is useful for QA, but the video URL is the pipeline artifact.
urlbody_with_overlays_urltranscript默认调用:
add_captions(
video_url: <base_url或body_with_lower_third_url>,
caption_mode: "auto",
style: "classic",
position: "bottom",
font: "inter",
font_color: "#ffffff",
highlight_color: state.brand.colors.accent或"#cfc3ff",
outline_color: state.brand.colors.primary或"#111111",
font_size: 42
)将返回的保存为。返回的对QA有用,但视频URL是流程产物。
urlbody_with_overlays_urltranscript[9] Animated end card (5s) — author inline HTML, render via HyperFrames
[9] 动画片尾卡片(5秒)—— 编写内嵌HTML,通过HyperFrames渲染
We do NOT use here. That tool delegates HTML authoring to a slide-card LLM, which routinely adds corner clutter (top-left wordmarks, bottom-right URLs), picks wrong aspects, and produces animations that don't reliably play through HyperFrames' per-frame seek. Instead, the orchestrator authors the end-card HTML directly and renders it via . Same engine the lower-third uses.
mcp__pika__generate_slide_animationmcp__pika__render_html_animation此处不使用。该工具将HTML编写委托给幻灯片LLM,后者通常会添加角落杂物(左上角文字Logo、右下角URL),选择错误的比例,并生成无法通过HyperFrames逐帧查找可靠播放的动画。相反,编排器直接编写片尾卡片HTML并通过渲染。与字幕条使用相同引擎。
mcp__pika__generate_slide_animationmcp__pika__render_html_animationHard rules — empirically verified, do not deviate
严格规则 — 经验证,请勿偏离
These are NOT stylistic preferences. Each was discovered by rendering, extracting frames, comparing to expectations, and observing a specific failure. Reverting any of them will reproduce a known bug.
-
Aspect ratio matches the body video. Read(default
aspect_ratio). Compute16:9×data-widthfor thedata-height:#stage,16:9 → 1920×1080,9:16 → 1080×1920. Hardcoding the wrong orientation produces a side-bar concat where the body and end card play side-by-side instead of in sequence.1:1 → 1080×1080 -
No corner clutter. No top-left wordmark. No bottom URL. No icon squircle next to the title. The end card is a single centered message + CTA. The brand mark is implied by the typography and palette; an explicit logo competes with the title and reads as cluttered. (If the user explicitly asks for a logo, place it integrated into the centered stack — never in a corner.) (User feedback v1→v3.)
-
Use CSSfor the entrance animation, not GSAP
@keyframes. HyperFrames Chrome seeks per-frame via BeginFrame; GSAP'stl.from()records its initial state lazily on first play and never fires under seek-only playback — every frame renders the static FINAL state with no entrance motion. CSStl.from()are tied to Chrome's compositor clock and animate deterministically per frame. Declare duration through the@keyframes/#stage#cardattributes; do not add a GSAP script just to establish timing. (Verified v1, v3, v4 — frames identical at t=0 and t=2s; fixed v5 by switching to CSS @keyframes.)data-duration -
All entrance animations must complete by. Half-second hold so the final state sits readable before the cut. With
t = duration - 0.5sthat's allduration: 5s.animation-delay + animation-duration <= 4.5s -
Use absolute positioning for animated elements, not flex. Flex layout doesn't fully settle by frame 0 in HyperFrames Chrome, which compounds the GSAP-from() bug above — elements pop in mid-animation as flex finishes its second pass. Pure absolute positioning gives stable, deterministic layout from frame 0. (Discovered while debugging v3/v4.)
-
Give eacha unique
@font-facename; don't rely on weight matching. Declaring twofont-familyblocks is supposed to let CSS@font-face { font-family: "telka"; ... font-weight: 700/500; }pick the 500 face — but in HyperFrames Chrome the matching is unreliable for some weights. Use distinct families:font-weight: 500,"telka-ext-900","telka-700". (Verified v3: Telka 500 face never loaded despite valid woff2; tagline rendered serif. Fixed v6 by switching tagline to"telka-500".)telka-700 -
and
telkaextended-900-normal.woff2are KNOWN-WORKING in HyperFrames Chrome.telka-700-normal.woff2is KNOWN-BROKEN — it parses successfully via fontTools (correct OS/2.usWeightClass=500, valid cmap, correct family name) but Chrome silently rejects the @font-face declaration and falls back to system serif. Workaround: usetelka-500-normal.woff2for the tagline too (same family, slightly heavier — visually still on-brand). If a future end card needs medium weight, test the candidate woff2 by rendering+extracting frame 30 BEFORE shipping. Don't trust that "Telka 400" or "Telka 300" will work just because Telka 700 does.telka-700 -
Inline brand-kit fonts as base64. Pika CDN doesn't accept font uploads (mime allowlist) and doesn't send CORS headers, soURL references fail in HyperFrames Chrome and the font silently falls back to system serif. Subset each woff2 with
@font-faceto just the glyphs in title+tagline+CTA, base64-encode, embed aspyftsubset. Subsetted files are typically 5–10 KB each.data:font/woff2;base64,... -
Don't write CSSfallback chains for brand designs. If the brand font fails to load, a fallback chain hides the failure — you ship Helvetica thinking it's Telka. Use
font-familyalone (no fallback). Then a font load failure renders Chrome's default serif, which is visually obvious and triggers a fix.font-family: "telka-700" -
Composition contract — the HyperFrames contract:wraps a SINGLE direct child
<div id="stage" data-composition-id="main" data-start="0" data-duration="5" data-width="W" data-height="H">which contains everything else. Visible timed elements must include<div id="card" class="clip" data-start="0" data-duration="5" data-track-index="0">because HyperFrames uses it for visibility control, and the clip must be nested inside the composition root, not a sibling. Multi-tracked direct children ofclass="clip"interact poorly with frame seeking. (Fixed by mirroring the working lower-third structure.)#stage -
Runtime readiness hook — include a small compatibility hook before:
</body>. CSSwindow.__hf = { duration: 5, seek: (t) => { document.documentElement.style.setProperty("--hf-time", String(t)); } };still drive the visual animation, but the hook makes the prod frame-capture path ready when it probes for@keyframes. If the worker reportswindow.__hf, treat the HTML as invalid forwindow.__hf not ready after 45000ms; fix the composition contract or hook and rerender. Do not fall back to a static PNG.render_html_animation -
Always extract frames at t=0, t=1s, t=2s after rendering and visually compare. If frames 0 and 2 look identical, the entrance animation isn't running. If the tagline looks like a serif, the brand font didn't load. Don't trust the URL alone. Don't ship without this check. (User caught these failures three renders in a row before frame extraction was added.)
这些不是风格偏好。每个规则都是通过渲染、提取帧、与预期对比并观察特定失败发现的。违反任何规则都会重现已知bug。
-
画面比例与主体视频匹配。读取(默认
aspect_ratio)。计算16:9的#stage×data-width:data-height,16:9 → 1920×1080,9:16 → 1080×1920。硬编码错误方向会导致主体视频和片尾卡片并排播放的侧边栏拼接。1:1 → 1080×1080 -
无角落杂物。无左上角文字Logo。无底部URL。无标题旁的图标圆形。片尾卡片是单一居中消息+行动号召。品牌标识由字体和调色板暗示;显式Logo会与标题竞争,显得杂乱。(若用户明确要求Logo,将其集成到居中堆叠中——永远不要放在角落。)(用户反馈v1→v3。)
-
使用CSS制作入场动画,而非GSAP
@keyframes。HyperFrames Chrome通过BeginFrame逐帧查找;GSAP的tl.from()在首次播放时延迟记录初始状态,在仅查找播放时永远不会触发——每一帧都渲染静态最终状态,无入场动画。CSStl.from()与Chrome的 compositor时钟绑定,逐帧确定性动画。通过@keyframes/#stage的#card属性声明时长;不要仅为了设置时长添加GSAP脚本。(v1、v3、v4验证——t=0和t=2s帧相同;v5切换到CSS @keyframes修复。)data-duration -
所有入场动画必须在前完成。半秒保持,使最终状态在剪辑前可读。时长5秒时,所有
t = duration - 0.5s。animation-delay + animation-duration <= 4.5s -
对动画元素使用绝对定位,而非flex。Flex布局在HyperFrames Chrome中无法在帧0完全稳定,会加剧上述GSAP-from() bug——元素在动画中途弹出,因为flex完成第二次传递。纯绝对定位从帧0开始提供稳定、确定性布局。(调试v3/v4时发现。)
-
为每个指定唯一的
@font-face名称;不要依赖权重匹配。声明两个font-family块应该让CSS@font-face { font-family: "telka"; ... font-weight: 700/500; }选择500字重——但在HyperFrames Chrome中,某些权重的匹配不可靠。使用不同的字体族:font-weight: 500、"telka-ext-900"、"telka-700"。(v3验证:Telka 500字重从未加载,尽管woff2有效;标语渲染为衬线字体。v6切换到"telka-500"修复。)telka-700 -
和
telkaextended-900-normal.woff2在HyperFrames Chrome中已知可用。telka-700-normal.woff2已知不可用——它通过fontTools解析成功(正确的OS/2.usWeightClass=500,有效的cmap,正确的字体名称),但Chrome会静默拒绝@font-face声明并fallback到系统衬线字体。解决方法:标语也使用telka-500-normal.woff2(相同字体族,稍重——视觉上仍符合品牌)。若未来片尾卡片需要中等字重,在交付前通过渲染+提取帧30测试候选woff2。不要假设"Telka 400"或"Telka 300"会像Telka 700一样工作。telka-700 -
将品牌套件字体作为base64内联。Pika CDN不接受字体上传(mime允许列表)且不发送CORS头,因此URL引用在HyperFrames Chrome中会失败,字体静默fallback到系统衬线字体。使用
@font-face将每个woff2子集化为标题+标语+行动号召中使用的字符,base64编码,嵌入为pyftsubset。子集化文件通常为5–10 KB。data:font/woff2;base64,... -
不要为品牌设计编写CSSfallback链。若品牌字体加载失败,fallback链会隐藏失败——你以为是Telka,实际是Helvetica。仅使用
font-family(无fallback)。这样字体加载失败会渲染Chrome的默认衬线字体,视觉上明显,触发修复。font-family: "telka-700" -
合成约定 — HyperFrames约定:包裹单个直接子元素
<div id="stage" data-composition-id="main" data-start="0" data-duration="5" data-width="W" data-height="H">,其中包含所有其他内容。可见的定时元素必须包含<div id="card" class="clip" data-start="0" data-duration="5" data-track-index="0">,因为HyperFrames用它控制可见性,且clip必须嵌套在合成根内,而非兄弟元素。class="clip"的多轨道直接子元素与帧查找交互不良。(通过镜像工作的字幕条结构修复。)#stage -
运行时就绪钩子 — 在前添加一个小兼容性钩子:
</body>. CSSwindow.__hf = { duration: 5, seek: (t) => { document.documentElement.style.setProperty("--hf-time", String(t)); } };仍驱动视觉动画,但钩子使生产帧捕获路径在探测@keyframes时就绪。若工作线程报告window.__hf,则认为HTML对window.__hf not ready after 45000ms无效;修复合成约定或钩子并重新渲染。不要fallback到静态PNG。render_html_animation -
渲染后始终提取t=0、t=1s、t=2s的帧并视觉对比。若帧0和帧2相同,说明入场动画未运行。若标语看起来像衬线字体,说明品牌字体未加载。不要仅信任URL。交付前不要跳过此检查。(用户在三次渲染中都发现了这些失败,直到添加帧提取。)
Build steps
构建步骤
python
undefinedpython
undefined1. Decide canvas
1. 确定画布尺寸
W, H = {"16:9": (1920, 1080), "9:16": (1080, 1920), "1:1": (1080, 1080)}[aspect_ratio]
W, H = {"16:9": (1920, 1080), "9:16": (1080, 1920), "1:1": (1080, 1080)}[aspect_ratio]
2. Pick palette + fonts from state.brand (with fallbacks)
2. 从state.brand选择调色板+字体(带fallbacks)
bg = state.brand.colors.surface if state.brand else "#ffffff"
ink = state.brand.colors.primary if state.brand else "#0d0d0d"
accent = state.brand.colors.accent if state.brand else (accent_color or "#0d0d0d")
display_font_path = state.brand.fonts.display_path # e.g. brand-kit/.../telkaextended-900-normal.woff2
body_font_path = state.brand.fonts.body_path # used for tagline + CTA
bg = state.brand.colors.surface if state.brand else "#ffffff"
ink = state.brand.colors.primary if state.brand else "#0d0d0d"
accent = state.brand.colors.accent if state.brand else (accent_color or "#0d0d0d")
display_font_path = state.brand.fonts.display_path # 例如 brand-kit/.../telkaextended-900-normal.woff2
body_font_path = state.brand.fonts.body_path # 用于标语+行动号召
If state.brand has no fonts, use the fallback branch below and flag this in the deliver step.
若state.brand无字体,使用下方fallback分支并在交付步骤中标记。
3. Subset fonts to only the chars used in title/tagline/CTA
3. 将字体子集化为标题/标语/行动号召中使用的字符
glyphs = set(brief.product_name + brief.tagline + brief.call_to_action + " .,'-/")
if display_font_path and body_font_path:
# Use a standard fontTools pyftsubset command or equivalent local font-subset utility.
# Then base64-encode the subsetted woff2 bytes and inline them in @font-face data: URLs.
display_b64 = "<base64 subsetted display woff2 bytes>"
body_b64 = "<base64 subsetted body woff2 bytes>"
else:
display_b64 = body_b64 = None
glyphs = set(brief.product_name + brief.tagline + brief.call_to_action + " .,'-/")
if display_font_path and body_font_path:
# 使用标准fontTools pyftsubset命令或等效的本地字体子集工具。
# 然后将子集化的woff2字节base64编码,内联到@font-face data: URL中。
display_b64 = "<base64子集化显示字体woff2字节>"
body_b64 = "<base64子集化正文字体woff2字节>"
else:
display_b64 = body_b64 = None
4. Author the HTML inline. Do not load a preset/template file.
4. 内联编写HTML。不要加载预设/模板文件。
Include #stage/#card contract, CSS @keyframes, absolute positioning,
包含#stage/#card约定、CSS @keyframes、绝对定位、
@font-face data URLs when display_b64/body_b64 exist, title lines,
当display_b64/body_b64存在时的@font-face data URLs、标题行、
tagline, CTA, palette values, and dimensions W/H.
标语、行动号召、调色板值和尺寸W/H。
5. Render
5. 渲染
end_card_url = render_html_animation(html=filled, fps=30, quality="standard", format="mp4")
undefinedend_card_url = render_html_animation(html=filled, fps=30, quality="standard", format="mp4")
undefinedLayout (centered stack — no corners)
布局(居中堆叠 — 无角落)
┌─────────────────────────────────────────────┐
│ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ │ ← top accent bar (animated wipe L→R)
│ │
│ BIG TITLE LINE 1 │ ← display font 900, slides up
│ BIG TITLE LINE 2 │ ← display font 900, slides up (staggered)
│ │
│ ──── │ ← short accent divider (scaleX in)
│ │
│ tagline goes here │ ← body font 500, fades+rises
│ │
│ ╭─ Start with X ─╮ │ ← CTA pill, accent bg, ink border, back-out pop + breath
│ ╰─────────────────╯ │
│ │
└─────────────────────────────────────────────┘┌─────────────────────────────────────────────┐
│ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ │ ← 顶部强调条(从左到右动画擦除)
│ │
│ BIG TITLE LINE 1 │ ← 显示字体900,向上滑动
│ BIG TITLE LINE 2 │ ← 显示字体900,向上滑动(错开)
│ │
│ ──── │ ← 短强调分隔线(scaleX入场)
│ │
│ tagline goes here │ ← 正文字体500,淡入+上升
│ │
│ ╭─ Start with X ─╮ │ ← 行动号召按钮,强调色背景,主色边框,后退弹出+呼吸
│ ╰─────────────────╯ │
│ │
└─────────────────────────────────────────────┘Animation timing reference (5s end card, CSS @keyframes)
动画时间参考(5秒片尾卡片,CSS @keyframes)
All implemented as CSS on the corresponding element. The element's pre-animation CSS state IS the "from" — no JS needed.
animation: name duration easing delay forwards| t (s) | Element | Animation |
|---|---|---|
| 0.00–0.85 | accent-top | |
| 0.25–1.00 | title line 1 ( | |
| 0.42–1.17 | title line 2 ( | same, staggered |
| 1.00–1.50 | divider | |
| 1.20–1.75 | tagline | |
| 1.55–2.15 | CTA pill | |
| 2.60–4.20 | CTA pill | |
| 2.80–4.80 | accent-top | |
| 4.80–5.00 | hold | (final readable state) |
Reference implementation pattern: the v6 centered-stack HTML recipe verified on 2026-05-02. Author the filled HTML inline in the orchestrator and pass it directly to ; do not call a bundled helper script or rely on a separate directory.
mcp__pika__render_html_animationpresets/If the brand-kit lacks fonts ( is null) — fall back to system for tagline/CTA but DROP the title down to a system-display weight. Don't render brand-typography end cards with fallback fonts; they always look wrong. Flag this in the deliver step so the user knows the brand-kit is incomplete.
state.brand.fonts-apple-system, sans-serifSave the returned MP4 URL as . must be an MP4 video segment, not a static PNG, because step [10] concatenates it with the body video. Download it only if you need local visual QA frames or a local fallback assembly.
end_card_urlend_card_url所有动画通过对应元素的CSS 实现。元素的动画前CSS状态即为"from"状态——无需JS。
animation: name duration easing delay forwards| t (s) | 元素 | 动画 |
|---|---|---|
| 0.00–0.85 | accent-top | |
| 0.25–1.00 | 标题行1 ( | |
| 0.42–1.17 | 标题行2 ( | 相同,错开 |
| 1.00–1.50 | 分隔线 | |
| 1.20–1.75 | 标语 | |
| 1.55–2.15 | 行动号召按钮 | |
| 2.60–4.20 | 行动号召按钮 | |
| 2.80–4.80 | accent-top | |
| 4.80–5.00 | 保持 | (最终可读状态) |
参考实现模式:2026-05-02验证的v6居中堆叠HTML流程。在编排器中内联编写填充后的HTML并直接传递给;不要调用捆绑的辅助脚本或依赖单独的目录。
mcp__pika__render_html_animationpresets/若品牌套件缺少字体(为null)——标语/行动号召fallback到系统,但标题降级为系统显示字重。不要用fallback字体渲染品牌字体片尾卡片;它们看起来总是不对。在交付步骤中标记此情况,让用户知道品牌套件不完整。
state.brand.fonts-apple-system, sans-serif将返回的MP4 URL保存为。必须是MP4视频片段,而非静态PNG,因为步骤[10]会将其与主体视频拼接。仅当需要本地视觉QA帧或本地fallback组装时才下载。
end_card_urlend_card_url[10] Assemble body + end card + music via MCP
[10] 通过MCP组装主体+片尾卡片+音乐
Use the server-side deterministic edit tools for final assembly. The current MCP server normalizes mismatched inputs before concat, and preserves the original video audio while mixing the music track.
mcp__pika__edit_concatmcp__pika__edit_audio_mixassembled = edit_concat(video_urls=[body_with_overlays_url, end_card_url])
assembled_url = assembled.url
if music_url:
mixed = edit_audio_mix(video_url=assembled_url, audio_url=music_url, audio_volume=0.16)
final_url = mixed.url
else:
final_url = assembled_urlMix music after concat, never before, so the score continues through the end card. If fails because the music file is too short or malformed, set , surface the music issue, and still run step [10.5] before delivery; do not rerun expensive SeeDance acts.
mcp__pika__edit_audio_mixfinal_url = assembled_url使用服务器端确定性编辑工具进行最终组装。当前MCP服务器在拼接前会标准化不匹配的输入,在混音时保留原始视频音频。
mcp__pika__edit_concatmcp__pika__edit_audio_mixassembled = edit_concat(video_urls=[body_with_overlays_url, end_card_url])
assembled_url = assembled.url
if music_url:
mixed = edit_audio_mix(video_url=assembled_url, audio_url=music_url, audio_volume=0.16)
final_url = mixed.url
else:
final_url = assembled_url拼接后混音,永远不要在拼接前混音,这样配乐会延续到片尾卡片。若因音乐文件太短或格式错误失败,设置,显示音乐问题,仍在交付前运行步骤[10.5];不要重新运行昂贵的SeeDance片段。
mcp__pika__edit_audio_mixfinal_url = assembled_url[10.5] Final duration floor
[10.5] 最终时长底线
Before reporting to the user, probe the assembled result:
final_urlmcp__pika__analyze_media(
media: final_url,
query: "Return JSON with duration_seconds for this video."
)Save the result as . It must be and
before reporting as the completed deliverable.
If is under 55s, treat the run as a failed partial
assembly: do not deliver the URL as final, do not mark the skill complete, and
return to the missing-act recovery above. If all 4 acts were present but the
probe is still under 55s, stop and surface the concat/provider truncation for
investigation instead of padding with unrelated footage.
final_duration_seconds>= 55<= 75final_urlfinal_duration_seconds向用户报告前,检查组装结果:
final_urlmcp__pika__analyze_media(
media: final_url,
query: "返回此视频的duration_seconds JSON。"
)将结果保存为。必须且才能将报告为完成的交付物。若低于55秒,将运行视为失败的部分组装:不要将URL作为最终交付物,不要标记技能完成,回到上方的缺失片段恢复流程。若所有4个片段都存在但检查仍低于55秒,停止并显示拼接/提供商截断问题以进行调查,而非用无关素材填充。
final_duration_seconds>= 55<= 75final_urlfinal_duration_seconds[11] Final URL
[11] 最终URL
Save into . If a local fallback assembly produced the final MP4, upload that checkpoint through and replace with the returned .
final_urlstatemcp__pika__upload_assetfinal_urlpublic_url将保存到中。若本地fallback组装生成了最终MP4,通过上传该检查点并将替换为返回的。
final_urlstatemcp__pika__upload_assetfinal_urlpublic_url[12] Asset bundle (optional)
[12] 资源包(可选)
Return the final URL first. If the user asks for editable source assets, provide a bundle containing:
final_urlact_urls- ,
character_url, and any product/screenshot asset URLslocation_url music_urlend_card_url- optional /
lower_third_urlbody_with_overlays_url - the script, brand quick-reference, and act-to-asset map
Why this matters:
- The user can swap a layer (founder photo, sub timing, music) and re-render WITHOUT re-fetching everything
- The brand kit is the single source of truth for any future video for the same brand — re-using it is one folder copy away
- The act-level s are valuable on their own (e.g. the user might want to clip just one act for a specific channel)
.mp4 - Discovered 2026-05-09 — user explicitly asked: "can you also save the refs you generated next to the video so i can have those assets?". Codified as default.
先返回最终URL。若用户要求可编辑的源资源,提供包含以下内容的包:
final_urlact_urls- 、
character_url和任何产品/截图资源URLlocation_url music_urlend_card_url- 可选/
lower_third_urlbody_with_overlays_url - 脚本、品牌快速参考和片段-素材映射
为什么这很重要:
- 用户可以替换一层(创始人照片、字幕计时、音乐)并重新渲染,无需重新获取所有内容
- 品牌套件是同一品牌未来任何视频的唯一来源——重用只需复制一个文件夹
- 片段级本身很有价值(例如用户可能想要剪辑一个片段用于特定渠道)
.mp4 - 2026-05-09发现——用户明确询问:"你能把生成的参考素材和视频一起保存吗?我想要这些资源。"。已编码为默认行为。
[13] Deliver
[13] 交付
Report to the user. Include:
final_url- Asset bundle URL/path only if the user asked for one
- Total duration from (~65s = 60s body + 5s end card)
final_duration_seconds - Intermediate URLs or local files useful for reruns: , optional
base_url,body_with_overlays_url,music_url,end_card_url, and eachfinal_urlact_urls[i] - The brief (/
brief.product_name) so the user can confirm the model picked up the right productbrief.tagline - A 1-line summary of which asset went into which act, so the user can confirm placement
向用户报告。包含:
final_url- 仅当用户要求时提供资源包URL/路径
- 的总时长(约65秒 = 60秒主体 + 5秒片尾卡片)
final_duration_seconds - 对重新运行有用的中间URL或本地文件:、可选
base_url、body_with_overlays_url、music_url、end_card_url和每个final_urlact_urls[i] - 简介(/
brief.product_name),让用户确认模型正确识别了产品brief.tagline - 1行摘要,说明哪个素材用于哪个片段,让用户确认放置正确
Verification gates
验证门
| Step | Check |
|---|---|
| Brief | |
| Asset analyses | One JSON object per asset, each with non-empty |
| Script | 4 acts; total dialogue 100–180 words; each asset is referenced by at least one act, OR explicitly noted as unused |
| Refs | |
| Acts | All 4 act_urls returned (each with a unique seed); for acts containing shots with non-null |
| Stitch | |
| Music | URL returned; |
| Overlays | |
| End card | |
| Final assembly | |
| Final duration | |
| Local fallback upload | Only if local fallback assembly was used: |
| 步骤 | 检查 |
|---|---|
| 简介 | |
| 素材分析 | 每个素材对应一个JSON对象,每个对象有非空 |
| 脚本 | 4个片段;总对话100–180词;每个素材至少被一个片段引用,或明确标记为未使用 |
| 参考素材 | |
| 片段 | 返回所有4个act_urls(每个有唯一seed);对于包含非null |
| 拼接 | 返回 |
| 音乐 | 返回URL; |
| 叠加层 | 返回 |
| 片尾卡片 | 返回 |
| 最终组装 | |
| 最终时长 | |
| 本地fallback上传 | 仅当使用本地fallback组装时: |
Failure modes
失败模式
Except for the one missing-act retry documented in "Duration floor and partial-act recovery", stop and surface on first verification failure. Don't auto-retry expensive calls (SeeDance acts run 3–8 min each — repeated failures burn credits).
| Symptom | Cause | Fix |
|---|---|---|
| Idempotency cache replaying an old failed result for identical params | Pass a unique |
| SeeDance returns 422 "may contain likenesses of real people" on founder ref | Content-policy filter tripped (intermittent — same photo may pass next attempt) | Re-roll the founder portrait with stronger stylization ("Pixar / Disney 3D animation aesthetic"). Don't auto-retry the same ref — burns credits. |
| Founder shirt changes between acts | @Image1 read fresh each act, no wardrobe lock | Add a |
| All 4 acts have the same physical backdrop | Opening line says "inside the location matching @Image2" — read literally | Open with "in a setting whose visual style, palette, lighting and materials match @Image2" + add per-shot |
| Founder looks frozen / no body language | Beats are facial-only; act is single-shot | Add 3 time-coded sub-shots per act with explicit camera-motion |
| Final video is under 55s | One SeeDance act timed out or final concat trimmed the body, producing a partial run | Do not deliver it as final. Retry the missing act once with a new seed while preserving successful acts; if that cannot complete, stop and surface the upstream SeeDance timeout. |
| Music returns < 50s | MiniMax non-determinism, or | Confirm |
| Music sings the word "instrumental" | | Use the 5-section structure with parenthetical cues (see step [7]). |
| Captions misspell product names | Auto transcription normalized the spoken audio | Use manual |
| Lower-third overlay shows a black box outside the pill | webm format encoded without alpha (yuv420p) | Re-render with |
| Final video audio shorter than video | Local fallback concat used | Prefer MCP |
| End card renders identical at t=0 and t=2s (no entrance animation) | GSAP | Convert entrance to CSS |
| End-card tagline renders as serif fallback | woff2 font failed to load in HyperFrames Chrome | Use unique |
| End-card HTML did not expose the runtime readiness hook or valid nested | Add/fix the |
除"时长底线和片段恢复"中记录的一次缺失片段重试外,首次验证失败时停止并显示。不要自动重试昂贵的调用(SeeDance片段运行3–8分钟——重复失败会消耗 credits)。
| 症状 | 原因 | 修复 |
|---|---|---|
| 幂等缓存重放相同参数的旧失败结果 | 每个调用传递唯一的 |
| SeeDance返回422 "may contain likenesses of real people"对创始人参考素材 | 内容政策过滤器触发(间歇性——同一张照片下次可能通过) | 用更强的风格化重新生成创始人肖像("皮克斯/迪士尼3D动画美学")。不要自动重试相同素材——消耗 credits。 |
| 创始人衬衫在片段间变化 | 每个片段重新读取@Image1,无服装锁定 | 在每个片段提示词中添加 |
| 所有4个片段有相同的物理背景 | 开头行说"inside the location matching @Image2"——逐字解读 | 以"in a setting whose visual style, palette, lighting and materials match @Image2"开头 + 添加每个镜头的 |
| 创始人看起来僵硬/无肢体语言 | Beat仅为面部动作;片段为单镜头 | 每个片段分解为3个带时间码的子镜头,添加明确的摄像机运动 |
| 最终视频时长低于55秒 | 一个SeeDance片段超时或最终拼接裁剪了主体,导致部分运行 | 不要作为最终交付物。用新seed重试一次缺失片段,保留成功片段;若无法完成,停止并显示上游SeeDance超时。 |
| 音乐返回时长<50秒 | MiniMax非确定性,或省略 | 确认 |
| 音乐演唱"instrumental"单词 | | 使用带括号提示的5部分结构(见步骤[7])。 |
| 字幕拼写产品名称错误 | 自动转录标准化了口语音频 | 仅当已有可信的带时间码片段时使用手动 |
| 字幕条叠加在按钮外显示黑框 | webm格式未编码alpha(yuv420p) | 用 |
| 最终视频音频短于视频 | 本地fallback拼接使用 | 优先使用MCP |
| 片尾卡片在t=0和t=2s渲染相同(无入场动画) | 使用GSAP | 将入场转换为CSS |
| 片尾卡片标语渲染为衬线fallback | woff2字体在HyperFrames Chrome中加载失败 | 为每个字重使用唯一的 |
| 片尾卡片HTML未暴露运行时就绪钩子或有效的嵌套 | 添加/修复 |
Load-bearing phrases
关键短语
These strings go into the SeeDance prompt (or HTML render) verbatim. Each was empirically validated — paraphrasing breaks the recipe silently. When editing prompts, search for these anchors and leave them intact.
| Phrase | Goes into | Why load-bearing |
|---|---|---|
| Opening line of every act prompt | "in a setting whose style matches" frees SeeDance to vary backdrop per shot; "inside the location matching" reproduces the literal backdrop across all 4 acts. |
| Header line under | Locks clothing across separate 15s generations. @Image1 alone drifts (Act 2 founder appeared in a white shirt on v6). |
| Inside every shot block | Forces SeeDance to render a different physical position per shot — different wall, different angle, different background feature. Without it, all 3 sub-shots end up in the same corner. |
| Per-shot dialogue payload | SeeDance native lip-sync token. The tokens are how the engine knows what to mouth-sync; without them no lip-sync at all. |
| Between sub-shots in the same prompt | |
| Closing line of every act prompt | Prevents SeeDance from layering its own ambient music underneath the dialogue, which would fight the MiniMax score in step [10]. |
| MiniMax | 5 sections drive song length (~60–80s); parentheticals are read as production notes ("no vocals"), not lyrics. Bare |
这些字符串逐字传入SeeDance提示词(或HTML渲染)。每个短语都经验证——改写会无声地破坏流程。编辑提示词时,搜索这些锚点并保持原样。
| 短语 | 传入位置 | 关键原因 |
|---|---|---|
| 每个片段提示词的开头行 | "in a setting whose style matches"允许SeeDance在镜头间变化背景;"inside the location matching"会在所有4个片段中重现字面背景。 |
| 每个片段提示词中 | 在单独的15秒生成中锁定服装。仅@Image1会导致服装变化(v6中第二段创始人穿着白色衬衫)。 |
| 每个镜头块内 | 迫使SeeDance在每个镜头中渲染不同的物理位置——不同墙、不同角度、不同背景特征。若无此,所有3个子镜头最终会在同一角落。 |
| 每个镜头的对话内容 | SeeDance原生唇同步标记。引擎通过这些标记知道要同步口型;无标记则无唇同步。 |
| 同一提示词中的子镜头之间 | |
| 每个片段提示词的结尾行 | 防止SeeDance在对话下叠加自己的环境音乐,这会与步骤[10]中的MiniMax配乐冲突。 |
| MiniMax | 5个部分驱动歌曲时长(~60–80秒);括号提示被视为制作说明("无 vocals"),而非歌词。仅 |
What NOT to do
禁止操作
- Don't open the SeeDance prompt with "inside the location matching @Image2" — reproduces the literal backdrop across all acts. Use "in a setting whose visual style ... match @Image2".
- Don't describe the character in prompt prose — @Image1 carries identity. Prose conflicts produce phantom figures or wrong outfits. Exception: the line.
WARDROBE LOCK: - Don't use film-industry shot terms — "Two-shot" / "Three-shot" / "Over-shoulder" / "OTS" / "Master shot" trigger SeeDance phantom-subject artifacts. Describe what the camera sees in plain language.
- Don't render the lower-third as webm — alpha not preserved (HyperFrames emits yuv420p). Use (ProRes 4444 yuva). Verify with
format: "mov".ffprobe \| grep pix_fmt - Don't use for the end card — that tool's slide-card LLM adds corner clutter and produces animations that don't seek deterministically. Author inline HTML and render via
mcp__pika__generate_slide_animation.mcp__pika__render_html_animation - Don't chain pika MCP / overlay calls for pixel composition — that cascades quality loss and can introduce lip-sync drift. Use
mcp__pika__edit_text_overlayfor captions and the single local ffmpeg lower-third fallback only when the lower-third is enabled.mcp__pika__add_captions - Don't use local for final assembly unless MCP is unavailable — the old audio-drop bug was in local concat behavior. Default to MCP
ffmpeg concat -c copy+mcp__pika__edit_concat.mcp__pika__edit_audio_mix - Don't fire SeeDance with identical params across acts — the MCP idempotency cache hashes to the same task ID and replays old results (sometimes failures). Pass unique per act.
seed - Don't omit the field on MiniMax music — output drops to ~17–30s. Don't put bare
lyricseither — model sings the word. Use the 5-section structure with parenthetical cues.[instrumental] - Don't copy the example music sound for every brand — the recipe is the pattern (5 sections + parentheticals), not the specific instrumentation. Pick a register that matches (see step [7] table).
brief.tone - Don't put real English words inside / any music section — MiniMax sings them.
[chorus]
- 不要用"inside the location matching @Image2"开头SeeDance提示词——会在所有片段中重现字面背景。用"in a setting whose visual style ... match @Image2"开头。
- 不要在提示词prose中描述人物——@Image1承载身份。Prose冲突会产生 phantom人物或错误服装。例外:行。
WARDROBE LOCK: - 不要使用电影行业镜头术语——"Two-shot" / "Three-shot" / "Over-shoulder" / "OTS" / "Master shot"会触发SeeDance phantom主体 artifacts。用平实语言描述摄像机看到的内容。
- 不要将字幕条渲染为webm——不保留alpha(HyperFrames输出yuv420p)。使用(ProRes 4444 yuva)。用
format: "mov"验证。ffprobe \| grep pix_fmt - 不要用制作片尾卡片——该工具的幻灯片LLM会添加角落杂物,生成无法确定性查找的动画。内联编写HTML并通过
mcp__pika__generate_slide_animation渲染。mcp__pika__render_html_animation - 不要链式调用pika MCP / 叠加调用进行像素合成——这会导致质量损失级联,并可能引入唇同步漂移。用
mcp__pika__edit_text_overlay添加字幕,仅当启用字幕条时使用一次本地ffmpeg字幕条fallback。mcp__pika__add_captions - 除非MCP不可用,否则不要用本地进行最终组装——旧的音频丢失bug存在于本地拼接行为中。默认使用MCP
ffmpeg concat -c copy+mcp__pika__edit_concat。mcp__pika__edit_audio_mix - 不要对片段使用相同参数触发SeeDance——MCP幂等缓存会哈希到相同任务ID并重放旧结果(有时是失败结果)。每个片段传递唯一的。
seed - 不要省略MiniMax音乐的字段——输出时长降至~17–30秒。也不要仅用
lyrics——模型会演唱该单词。使用带括号提示的5部分结构。[instrumental] - 不要为每个品牌复制示例音乐声音——流程是模式(5部分+括号提示),而非特定乐器。选择匹配的风格(见步骤[7]表格)。
brief.tone - 不要在/ 任何音乐部分中加入真实英文单词——MiniMax会演唱它们。
[chorus]
Engine choice: seedance-only (with caveats)
引擎选择:仅seedance(带注意事项)
SeeDance ( via ) is the sole video engine. Picked over alternatives after testing:
fal-seedance-2-i2vmcp__pika__generate_reference_videoprovider: "seedance"- vs Kling v3-omni: Kling has a true hard-cut array (cleaner multi-shot) but rejects the
shots[]parameter (cache-busting harder), and 4 × pro 1080p outputs sum >50MB and exceed theseedupload cap (forces local concat). Kling does have more permissive content policy for real-person photos — it's a worth keeping in mind as a fallback if SeeDance's intermittent 422 becomes a hard block.mcp__pika__edit_concat - vs Happy Horse (Alibaba DashScope): produced clean 1080p with native lip-sync but the multi-shot prompt direction was weaker. Validated 2026-05-18 (v5) but visibly less cinematic than SeeDance v6/v7.
happyhorse-1.0-r2v - SeeDance wins because: native lip-sync, accepts
<<<voice_1>>>(cache-busting), permissive enough on real-person photos that 95%+ runs pass content filter, single 15s prompt with time-coded sub-shots gives enough variation for a talking-head register.seed
If SeeDance is down or its content filter starts rejecting your founder ref repeatedly, the documented fallback is to re-roll the founder portrait with stronger stylization (Pixar / Disney 3D aesthetic). Switching engines mid-pipeline changes too many assumptions in the prompt, concat, and asset-size flow.
SeeDance(通过的调用)是唯一的视频引擎。经过测试后选择,优于其他选项:
mcp__pika__generate_reference_videoprovider: "seedance"fal-seedance-2-i2v- vs Kling v3-omni:Kling有真正的硬切数组(更干净的多镜头)但不支持
shots[]参数(更难缓存 busting),且4段1080p专业输出总和>50MB,超过seed上传限制(迫使本地拼接)。Kling对真实人物照片的内容政策更宽松——若SeeDance的间歇性422成为硬障碍,值得作为fallback考虑。mcp__pika__edit_concat - vs Happy Horse (阿里巴巴DashScope):生成干净的1080p视频,带原生唇同步,但多镜头提示词指导较弱。2026-05-18验证(v5)但视觉上不如SeeDance v6/v7有电影感。
happyhorse-1.0-r2v - SeeDance胜出原因:原生唇同步,支持
<<<voice_1>>>(缓存 busting),对真实人物照片的内容政策足够宽松,95%+运行通过内容过滤器,单个15秒提示词带时间码子镜头为讲解头风格提供足够变化。seed
若SeeDance宕机或其内容过滤器反复拒绝你的创始人参考素材,文档化的fallback是用更强的风格化重新生成创始人肖像(皮克斯/迪士尼3D美学)。中途切换引擎会改变提示词、拼接和资源大小流程中的太多假设。
Runtime expectations
运行时预期
Wall-clock budget per step. Total run is ~12–18 minutes, dominated by the parallel SeeDance batch.
| Step | Wall clock | Notes |
|---|---|---|
| [1] analyze_brief | 20–60s | |
| [2] analyze_media × N | 15–30s per asset, parallel | |
| [4] founder/custom location refs | 10–60s each | Upload local founder photo, use supplied URL, or generate a founder portrait; default location waits for [4.5] |
| [4.5] brand-kit ingestion + default location | 10–30s | Parse brand kit, upload logo assets, render aspect-matched brand-accent backdrop if needed |
| [5] SeeDance × 4 in parallel | 5–9 min wall (slowest act) | Each act 3–8 min and may complete asynchronously |
| [6] edit_concat (acts → 60s body) | 30–60s | |
| [7] generate_music | 20–60s, retry once if <50s | |
| [8a] render LT as .mov | 60–120s | ProRes 4444 slower than webm but only path with alpha |
| [8] add_captions | 30–90s | subtitles only, or after lower-third checkpoint |
| [8b] local lower-third overlay fallback | 20–60s | only when lower-third is enabled |
| [9] render end card | 60–120s | |
| [10] edit_concat + edit_audio_mix | 30–90s | server-side normalized concat and music mix |
| Total | 12–18 minutes |
每个步骤的挂钟预算。总运行时间约12–18分钟,主要由并行SeeDance批次决定。
| 步骤 | 挂钟时间 | 说明 |
|---|---|---|
| [1] 分析产品简介 | 20–60秒 | |
| [2] 分析媒体素材 × N | 每个素材15–30秒,并行 | |
| [4] 创始人/自定义场景参考素材 | 每个10–60秒 | 上传本地创始人照片,使用提供的URL,或生成创始人肖像;默认场景等待[4.5] |
| [4.5] 品牌套件导入 + 默认场景 | 10–30秒 | 解析品牌套件,上传Logo资源,若需要渲染匹配比例的品牌强调色背景 |
| [5] 并行生成4段SeeDance | 挂钟时间5–9分钟(最慢的片段) | 每个片段3–8分钟,可能异步完成 |
| [6] 拼接片段(片段→60秒主体) | 30–60秒 | |
| [7] 生成音乐 | 20–60秒,若<50秒重试一次 | |
| [8a] 渲染字幕条为.mov | 60–120秒 | ProRes 4444比webm慢,但唯一带alpha的路径 |
| [8] 添加字幕 | 30–90秒 | 仅字幕,或在字幕条检查点之后 |
| [8b] 本地字幕条叠加fallback | 20–60秒 | 仅当启用字幕条时 |
| [9] 渲染片尾卡片 | 60–120秒 | |
| [10] 拼接+混音 | 30–90秒 | 服务器端标准化拼接和混音 |
| 总计 | 12–18分钟 |
Defaults
默认值
- 4 × 15s SeeDance acts, parallel, unique seeds (101, 202, 303, 404)
- Character identity comes from the reference — avoid describing the character in prompt prose (no "Founder Semi", no "young creative streetwear founder"). Open every prompt with "The character (matching @Image1) in a setting whose visual style, palette, lighting and materials match @Image2." Using "inside the location matching @Image2" makes SeeDance reproduce the literal backdrop across all acts. The one exception is a
@Image1line in every act prompt to keep clothing consistent across the 4 separate 15s generations.WARDROBE LOCK: - Avoid film-industry shot terminology that SeeDance reads literally — never write "Two-shot", "Three-shot", or "Master shot". Shot F is "Brand context shot".
- Every script has a (3-4 lines describing default delivery — cadence, signature gestures, pause behavior). Repeated verbatim as
character_voice_profilein every act's SeeDance prompt.CHARACTER VOICE: … - Every shot has , not act-level
beats[]. Each beat hasacting+text+emotion. Silentphysicalentries direct what happens BETWEEN spoken sentences (held looks, micro-expressions, gesture transitions). Beats are emitted as a per-shot "Acting beats" block in the SeeDance prompt.(beat) - Every shot beyond the first in an act has — a continuous-camera-move description that takes us from the previous framing to this one (dolly, arc, push past, pull back, orbit). Without this, multi-shot acts read as jump zooms because SeeDance reframes the same virtual camera position instead of moving through space.
transition_from_prev - Apply the TTS pronunciation rewrites to dialogue text before joining into the payload (Cami → Cammy, MCP → M C P, pika.me → pika dot me, etc.). See "TTS pronunciation rewrites" in step [3].
<<<voice_1>>> - 16:9, 1080p (SeeDance )
resolution: "1080p" - User-provided assets are revealed product-type-appropriately:
- → on-phone in shots C and E
digital - → founder wears/holds the actual t-shirts; assets passed as refs to EVERY shot where the shirt appears (not just one reveal beat)
physical_apparel - → founder holds the product up; assets passed to all shots where product is visible
physical_object - → founder uses/eats/drinks; same pattern
consumable - → no asset reveal; environment + dialogue only
service
- Music: target ~60–80s instrumental — pass with 5
lyricstags ([section]/intro/verse/bridge/chorus), each containing aoutroparenthetical production cue. Section count drives length; the parenthetical guarantees no vocals. See step [7] for the canonical call. Retry once if under 50s, then accept what you got —(instrumental — …)plays the music for its duration and leaves silence beyond.mcp__pika__edit_audio_mix - 5s end card via — author inline HTML per step [9], inline brand-kit fonts as base64. Sources brand from
mcp__pika__render_html_animation(set in step [4.5]) → real logo, real palette, real fonts.state.brand - Captions via . Use server-side word-level caption burn-in by default. Font choices are the tool-supported set (
mcp__pika__add_captions,inter,bebas-neue); use brand accent colors for highlight/outline instead of local custom font drawtext.noto-cjk - Lower-third fallback. Off by default. If , render a 5s branded pill bottom-left via
state.lower_third = true; the final overlay onto the body uses one local ffmpeg pass only until MCP exposes a general alpha-overlay/compose tool.mcp__pika__render_html_animation(format:"mov") - Final assembly via MCP. Use for body + end card, then
mcp__pika__edit_concatfor music. Local concat/mix is a fallback, not the canonical path.mcp__pika__edit_audio_mix - Provider: only. Reference tokens are
seedance/@Image1/@Image2. Native lip-sync via@Image3tokens per sub-shot. Real-person founder photos pass the content filter the vast majority of the time; intermittent 422 → re-roll with stronger stylization.<<<voice_1>>>...<<<voice_1>>>
- 4段各15秒的SeeDance片段,并行,唯一seed(101、202、303、404)
- 人物身份来自参考——避免在提示词prose中描述人物(不要写"Founder Semi",不要写"young creative streetwear founder")。每个提示词以**"The character (matching @Image1) in a setting whose visual style, palette, lighting and materials match @Image2."**开头。使用"inside the location matching @Image2"会让SeeDance在所有片段中重现字面背景。唯一例外是每个片段提示词中的
@Image1行,以在4个单独的15秒生成中保持服装一致。WARDROBE LOCK: - 避免使用SeeDance会逐字解读的电影行业镜头术语——永远不要写"Two-shot"、"Three-shot"或"Master shot"。镜头F是"Brand context shot"。
- 每个脚本有(3-4行描述默认表达方式——语速、标志性手势、停顿行为)。在每个片段的SeeDance提示词中重复原文作为
character_voice_profile。CHARACTER VOICE: … - 每个镜头有,而非片段级
beats[]。每个beat有acting+text+emotion。沉默的physical条目指导对话句子之间的动作(保持的表情、微表情、手势过渡)。Beat在SeeDance提示词中作为每个镜头的"Acting beats"块输出。(beat) - 片段中除第一个镜头外的每个镜头有——描述从上个构图到当前构图的连续摄像机运动(移动、弧形、推过、向后拉、环绕)。若无此,多镜头片段看起来像跳变缩放,因为SeeDance重新构图同一虚拟摄像机位置,而非在空间中移动。
transition_from_prev - 在将对话文本加入内容前应用TTS发音改写(Cami→Cammy,MCP→M C P,pika.me→pika dot me等)。见步骤[3]中的"TTS发音改写"。
<<<voice_1>>> - 16:9,1080p(SeeDance )
resolution: "1080p" - 用户提供的素材根据产品类型展示:
- → 在C和E镜头的手机上展示
digital - → 创始人穿着/持有真实T恤;所有出现T恤的镜头都传入素材作为参考(不仅是一个展示节拍)
physical_apparel - → 创始人举起产品;所有展示产品的镜头都传入素材
physical_object - → 创始人使用/食用/饮用;相同模式
consumable - → 无素材展示;仅环境+对话
service
- 音乐:目标时长~60–80秒纯音乐——传入,包含5个
lyrics标签([section]/intro/verse/bridge/chorus),每个标签包含outro括号制作提示。部分数量驱动时长;括号提示保证无 vocals。见步骤[7]的标准调用。若<50秒重试一次,然后接受结果——(instrumental — …)会播放音乐直到结束,剩余部分留空。mcp__pika__edit_audio_mix - 5秒片尾卡片通过生成——根据步骤[9]内联编写HTML,将品牌套件字体作为base64内联。从
mcp__pika__render_html_animation(步骤[4.5]中设置)获取品牌信息→真实Logo、真实调色板、真实字体。state.brand - 通过添加字幕。默认使用服务器端逐词字幕烧录。字体选择工具支持的集合(
mcp__pika__add_captions、inter、bebas-neue);用品牌强调色作为高亮/轮廓颜色,而非本地自定义字体drawtext。noto-cjk - 字幕条fallback。默认关闭。若,通过
state.lower_third = true渲染左下角5秒品牌按钮;直到MCP推出通用alpha叠加/合成工具,最终叠加到主体视频使用一次本地ffmpeg处理。mcp__pika__render_html_animation(format:"mov") - 通过MCP进行最终组装。用拼接主体+片尾卡片,然后用
mcp__pika__edit_concat混音。本地拼接/混音是fallback,而非标准路径。mcp__pika__edit_audio_mix - 提供商:仅。参考标记为
seedance/@Image1/@Image2。通过每个子镜头的@Image3标记实现原生唇同步。真实人物创始人照片绝大多数情况下通过内容过滤器;间歇性422→用更强的风格化重新生成。<<<voice_1>>>...<<<voice_1>>>