kiss-cam

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

kiss-cam

A two-call pika pipeline: spectator-POV Jumbotron still (

gpt-image-2

) → 15-second in-arena kiss cam clip with PA-announcer commentary and crowd reaction (

kling-v3-omni

, first-frame-locked to the still). The trend look is calibrated — pass both reference images straight through. There is no textual substitution into the prompts; the subjects are anchored only through

reference_images

. Don't reach for

${subject_a}

${subject_b}

placeholders. Step 1 and Step 2 prompts are verbatim, not scaffolds.

这是一个两步式pika流水线流程：观众视角的大屏幕静态图（

gpt-image-2

生成）→ 15秒场内Kiss Cam剪辑视频（

kling-v3-omni

生成，首帧与静态图锁定）。已校准符合潮流的视觉风格——直接传入两张参考图即可。提示词中不做文本替换，仅通过

reference_images

锚定对象。请勿使用

${subject_a}

${subject_b}

占位符。步骤1和步骤2的提示词为固定文本，并非模板框架。

Prerequisites

前置条件

pika MCP available in the host. Tool name prefix varies by mount point — use whatever the host exposes. Tools needed: asset upload, image generation, reference-video generation, and async status follow-up when a generation does not complete inline.

主机需具备pika MCP。工具名称前缀因挂载点而异——使用主机暴露的对应名称。所需工具：资源上传、图像生成、参考视频生成，以及当生成未即时完成时的异步状态跟踪。

Stage 0 — Intake

阶段0 — 输入收集

If invoked with empty args and no usable prior context, print this menu and stop:

Which two subjects should be on the Kiss Cam? Required:

Subject A reference photo — local path or HTTPS URL

Subject B reference photo — local path or HTTPS URL

If one photo is already present, ask only for the missing photo. Running before both have arrived leaves Step 1 with a missing

reference_images

entry and produces an inconsistent still.

Subject A reference photo (required) — local path or https URL. Save as
```
state.subject_a_url
```
.
Subject B reference photo (required) — local path or https URL. Save as
```
state.subject_b_url
```
.

For each: if already an

https://…

URL, use it as-is. If local path →

mcp__pika__upload_asset

→ PUT bytes to

presigned_url

→ use

public_url

. On Claude Desktop, pasted inline images don't reach MCP — ask once for a URL or a

.zip

attachment instead (this is the one allowed clarifier; once both URLs are in, the "no further yes/no gates" rule below applies).

Either subject can be in any visual style — photoreal human, 3D rendered character, designer toy, illustrated avatar, sculpted figurine, etc. The recipe preserves whatever style the reference uses; do not redraw in a different style. No names are used anywhere — the Kiss Cam graphic does not have a chyron with names. Just two subjects caught on the Jumbotron.

Confirm back in one line ("Generating a Madison Square Garden Kiss Cam moment for these two…") and start. No further yes/no gates after this point — the pipeline runs end-to-end.

如果调用时参数为空且无可用上下文，显示以下菜单并停止：

哪两个对象要出现在Kiss Cam中？ 必填项：

对象A参考照片 — 本地路径或HTTPS URL

对象B参考照片 — 本地路径或HTTPS URL

如果已提供一张照片，仅询问缺失的另一张。若在两张照片都收集完成前运行流程，步骤1会因缺少

reference_images

条目而生成不一致的静态图。

对象A参考照片（必填）—— 本地路径或HTTPS URL。保存为
```
state.subject_a_url
```
。
对象B参考照片（必填）—— 本地路径或HTTPS URL。保存为
```
state.subject_b_url
```
。

对于每张照片：如果已是

https://…

格式的URL，直接使用；如果是本地路径 → 调用

mcp__pika__upload_asset

→ 将字节内容上传至

presigned_url

→ 使用返回的

public_url

。在Claude Desktop中，粘贴的内嵌图像无法传递到MCP——请用户提供URL或

.zip

附件（这是唯一允许的澄清请求；一旦获取到两个URL，后续不再进行任何确认操作）。

任意对象可采用任意视觉风格——写实真人、3D渲染角色、设计师玩具、插画头像、雕刻人偶等。流程会保留参考图的原有风格；请勿重新绘制为其他风格。全程不使用姓名——Kiss Cam图形中不会显示带有姓名的字幕条。仅展示大屏幕上的两个对象。

用一句话确认（“正在为这两个对象生成麦迪逊广场花园的Kiss Cam时刻……”）并启动流程。此后不再进行任何确认操作——流水线将端到端运行。

Step 1 — Spectator-POV Jumbotron still (

generate_image

, gpt-image-2)

步骤1 — 观众视角的大屏幕静态图（

generate_image

，gpt-image-2）

The kiss cam graphic + scoreboard + retro frame get baked into the still at frame 0 — load-bearing, so Kling treats the entire decorative UI as pixel-locked burned-in UI in Step 2 instead of animating it mid-clip.

Why gpt-image-2 (and no fallback): sharper LED panel detail (scoreboard numerals, kiss cam typography, retro decorative edges) and stronger reference-likeness lock than alternative providers; the LED-sharpness + likeness combo is what sells the trend. On a

moderation_blocked

response, re-roll the same call instead of swapping providers — alternatives produced softer likeness and softer LED detail in earlier trials. We call gpt-image-2 at 1K 16:9 (1792×1024); the 2K/4K variants (AGNT-336) don't help here since Kling pro outputs 1080p downstream.

Why a Jumbotron-POV phone shot (and not a TV broadcast overlay): the first iteration produced a TV broadcast cutaway with a pink-heart kiss cam graphic on the feed — user feedback was "the kiss cam graphics is ugly, look how real kiss cam moments look in real videos." Real viral kiss-cam clips online are virtually all spectator phone shots OF the Jumbotron (Obama-era USA Basketball kiss cam, Sarah Hyland / Wells Adams kiss cam, etc.). The Jumbotron-shot framing hits the aesthetic users actually associate with "real kiss cam" — retro red border + sparkly hearts + cursive Kiss Cam script + adjacent LED scoreboard panels + arena darkness + fans filming with phones.

prompt
(verbatim, sent as-is — no template substitutions):

A high-resolution spectator-POV phone screenshot from inside packed Madison Square Garden during a Knicks vs Bulls NBA game, filmed by a fan in the lower bowl looking up at the giant suspended Jumbotron displaying the in-arena "Kiss Cam" segment. Real fan-filmed phone footage aesthetic — TikTok / Instagram online kiss-cam clip style.

Foreground (bottom 20% of frame): silhouettes of the backs/heads of a few fans in the row in front, slightly out-of-focus, dark; a couple of phones held up filming the Jumbotron, bright phone screens visible.

Mid-ground / upper 80%: the Jumbotron dominates the upper-center, bright LED panels against the dark arena ceiling and cabling above. Slight phone-camera tilt, slightly off-center natural handheld framing. Dim arena around, hint of upper-deck silhouettes.

The Jumbotron's central LED panel displays the kiss cam segment in iconic retro arena kiss cam style:
- Thick deep crimson red glowing decorative border frame.
- Bright red sparkly cartoon heart icons glowing in the corners and along edges.
- Subtle white star pattern texture in the red border.
- At the bottom center, large flowing white cursive script "Kiss Cam" with red glow and drop shadow.

Inside the kiss cam frame on the Jumbotron, two subjects framed two-shot chest-up:
- LEFT (Subject A) is the character from the FIRST reference image — preserve their exact likeness AND exact visual style (medium, level of stylization, color palette, rendering attributes). Do not redraw in a different style. Subject A is smiling shyly with one hand near the face, looking caught-off-guard and giggling.
- RIGHT (Subject B) is the character from the SECOND reference image — preserve their exact likeness AND exact visual style. Do not redraw in a different style. Subject B is seated upright at human scale next to Subject A, head tilted slightly toward Subject A.
- Around them inside the frame: packed Knicks fans in blue-and-orange jerseys, laughing, pointing, smiling at the kiss cam.

Adjacent LED panels on the Jumbotron show: "KNICKS 57 — BULLS 61", Q4, clock "4:32", with team logos. Small "MADISON SQUARE GARDEN" / "MSG" / AT&T branding.

Phone-camera aesthetic: slight motion blur, mild handheld imperfection, slightly noisy in dark areas, bright LED slightly blown out, 16:9 aspect ratio, real spectator POV. Looks like a still pulled from a fan-filmed Instagram / TikTok kiss cam clip.

Call params:

```
provider
```
:
```
gpt-image-2
```
(load-bearing — sharpest LED detail and strongest reference-likeness lock; no fallback provider, re-roll the same call on moderation hits)
```
reference_images
```
:
```
[state.subject_a_url, state.subject_b_url]
```
(order matters — Subject A must be index 0, Subject B index 1; the prompt's "FIRST / SECOND reference image" refers to array index)
```
aspect_ratio
```
:
```
16:9
```
```
quality
```
:
```
medium
```
(default for speed;
high
is now exposed but ~2 min/call — use only when fidelity matters)
```
output_format
```
:
```
png
```

Do not pass textual feature descriptions for either subject. The prompt above already refers to each subject only as "the character from the FIRST/SECOND reference image" — that's intentional. Describing hair / face / clothing / accessories in text fights the reference image and causes drift: verbal features override the visual reference, and the model homogenizes the subjects toward the description.

The reference images carry all identity + style information; the prompt only adds the style-preservation lock. This works for any reference style — photoreal, 3D rendered, sculpted, illustrated, etc. — without naming the style.

Save the returned URL as

state.kisscam_still_url

Agent-side self-check before Step 2: visually inspect — if the "Kiss Cam" text is misspelled, the scoreboard looks wrong, or either subject's likeness drifted, re-roll Step 1. Everything downstream pixel-locks to this image. This is the agent's own check — do not ask the user.

On failure (

moderation_blocked

from gpt-image-2 — often female/female subject pairings + "kiss cam" wording): re-roll the same call. Do NOT swap providers — alternatives produce softer likeness and softer LED detail.

Kiss Cam图形+计分板+复古边框会被嵌入到第0帧的静态图中——这是关键设计，这样在步骤2中Kling会将整个装饰性UI视为像素锁定的内置UI，不会在剪辑过程中对其进行动画处理。

为何选择gpt-image-2（无备选方案）： 相比其他提供商，它能生成更清晰的LED面板细节（计分板数字、Kiss Cam字体、复古装饰边缘），且对参考图的相似度锁定更强；这种清晰度+相似度的组合是打造潮流效果的关键。如果收到

moderation_blocked

响应，重新发起相同调用即可，不要更换提供商——早期测试显示，备选方案生成的相似度和LED细节都更模糊。我们调用1K 16:9分辨率（1792×1024）的gpt-image-2；2K/4K变体（AGNT-336）在此处并无帮助，因为下游Kling Pro输出为1080p。

为何选择大屏幕视角的手机拍摄画面（而非电视转播叠加层）： 最初版本生成的是带有粉色爱心Kiss Cam图形的电视转播画面——用户反馈“Kiss Cam图形太丑了，看看真实的Kiss Cam视频是什么样的”。网上传播的爆款Kiss Cam视频几乎都是观众用手机拍摄的大屏幕画面（奥巴马时期美国男篮的Kiss Cam、Sarah Hyland/Wells Adams的Kiss Cam等）。大屏幕视角的画面符合用户对“真实Kiss Cam”的审美认知——复古红色边框+闪亮爱心+草书Kiss Cam字样+相邻的LED计分板+昏暗的场馆+粉丝用手机拍摄的场景。

prompt
（固定文本，直接发送——不做模板替换）：

A high-resolution spectator-POV phone screenshot from inside packed Madison Square Garden during a Knicks vs Bulls NBA game, filmed by a fan in the lower bowl looking up at the giant suspended Jumbotron displaying the in-arena "Kiss Cam" segment. Real fan-filmed phone footage aesthetic — TikTok / Instagram online kiss-cam clip style.

Foreground (bottom 20% of frame): silhouettes of the backs/heads of a few fans in the row in front, slightly out-of-focus, dark; a couple of phones held up filming the Jumbotron, bright phone screens visible.

Mid-ground / upper 80%: the Jumbotron dominates the upper-center, bright LED panels against the dark arena ceiling and cabling above. Slight phone-camera tilt, slightly off-center natural handheld framing. Dim arena around, hint of upper-deck silhouettes.

The Jumbotron's central LED panel displays the kiss cam segment in iconic retro arena kiss cam style:
- Thick deep crimson red glowing decorative border frame.
- Bright red sparkly cartoon heart icons glowing in the corners and along edges.
- Subtle white star pattern texture in the red border.
- At the bottom center, large flowing white cursive script "Kiss Cam" with red glow and drop shadow.

Inside the kiss cam frame on the Jumbotron, two subjects framed two-shot chest-up:
- LEFT (Subject A) is the character from the FIRST reference image — preserve their exact likeness AND exact visual style (medium, level of stylization, color palette, rendering attributes). Do not redraw in a different style. Subject A is smiling shyly with one hand near the face, looking caught-off-guard and giggling.
- RIGHT (Subject B) is the character from the SECOND reference image — preserve their exact likeness AND exact visual style. Do not redraw in a different style. Subject B is seated upright at human scale next to Subject A, head tilted slightly toward Subject A.
- Around them inside the frame: packed Knicks fans in blue-and-orange jerseys, laughing, pointing, smiling at the kiss cam.

Adjacent LED panels on the Jumbotron show: "KNICKS 57 — BULLS 61", Q4, clock "4:32", with team logos. Small "MADISON SQUARE GARDEN" / "MSG" / AT&T branding.

Phone-camera aesthetic: slight motion blur, mild handheld imperfection, slightly noisy in dark areas, bright LED slightly blown out, 16:9 aspect ratio, real spectator POV. Looks like a still pulled from a fan-filmed Instagram / TikTok kiss cam clip.

调用参数：

```
provider
```
:
```
gpt-image-2
```
（关键选择——最清晰的LED细节和最强的参考相似度锁定；无备选提供商，遇到审核拦截时重新发起相同调用）
```
reference_images
```
:
```
[state.subject_a_url, state.subject_b_url]
```
（顺序重要——对象A必须是索引0，对象B是索引1；提示词中的“第一张/第二张参考图”指数组索引）
```
aspect_ratio
```
:
```
16:9
```
```
quality
```
:
```
medium
```
（默认值以提升速度；
```
high
```
已开放但每次调用约需2分钟——仅在保真度要求极高时使用）
```
output_format
```
:
```
png
```

请勿为任一对象添加文本特征描述。上述提示词已将每个对象仅描述为“第一张/第二张参考图中的角色”——这是有意设计的。用文字描述头发/面部/服装/配饰会与参考图冲突，导致生成结果偏离：文字特征会覆盖视觉参考，模型会将对象同质化以匹配文字描述。

参考图包含所有身份+风格信息；提示词仅添加风格保留锁定。这种方式适用于任何参考风格——写实、3D渲染、雕刻、插画等——无需指定风格名称。

将返回的URL保存为

state.kisscam_still_url

。

进入步骤2前的Agent自检：视觉检查——如果“Kiss Cam”文字拼写错误、计分板显示错误，或任一对象的相似度偏离，重新运行步骤1。后续所有环节都与该图像像素锁定。这是Agent自行完成的检查——无需询问用户。

失败处理（gpt-image-2返回

moderation_blocked

——通常是女性/女性对象组合+“kiss cam”表述导致）：重新发起相同调用。请勿更换提供商——备选方案生成的相似度和LED细节都更模糊。

Step 2 — In-arena kiss cam clip (

generate_reference_video

, kling-v3-omni)

步骤2 — 场内Kiss Cam剪辑视频（

generate_reference_video

，kling-v3-omni）

Kling-omni

image_types: ["first_frame"]

locks the still as literal frame 0, so the Jumbotron, scoreboard, kiss cam frame, hearts, cursive "Kiss Cam" text, and foreground fan silhouettes all stay pixel-locked across all 15s. Only the content inside the kiss cam panel animates.

Call params:

```
provider
```
:
```
kling
```
```
kling_model
```
:
```
kling-v3-omni
```
```
duration
```
:
```
15
```
```
aspect_ratio
```
:
```
16:9
```
```
quality_mode
```
:
```
pro
```
(1080p)
```
reference_images
```
:
```
[state.kisscam_still_url]
```
(use the latest value — Step 1 re-rolls overwrite it)
```
image_types
```
:
```
["first_frame"]
```
```
sound
```
:
```
true
```
```
prompt_adherence
```
:
```
strict
```
```
negative_prompt
```
(verbatim):

scene cuts, scoreboard changing, Kiss Cam text changing, graphics morphing, character turning real, character becoming 2D, identity drift, gimbal stabilized, exaggerated acting, theatrical expressions, over-acting, mugging at camera, cartoon reactions, subjects lip-syncing PA announcer dialogue, characters mouthing the announcer lines, subjects' mouths moving with off-screen voice, subjects talking to camera, on-screen lip-sync

prompt_adherence: strict

paired with the full

negative_prompt

anchor list are load-bearing — without both, Kling animates the scoreboard or "Kiss Cam" text mid-clip and subjects regress toward over-acting / lip-syncing the off-screen PA announcer.

prompt
(verbatim, ~2400 chars — keep it pre-trimmed because Kling caps prompts at 2500 chars):

First frame: spectator-POV phone shot at MSG of the Jumbotron Kiss Cam segment. Fan-filmed TikTok / Instagram kiss-cam clip style. Handheld phone, NOT pro broadcast.

Camera: continuous handheld shot. Subtle micro-wobble. Slow smooth push-in zoom across 15s toward the kiss cam panel.

Locked: Jumbotron, scoreboard (KNICKS 57 / Q4 4:32 / BULLS 61), red kiss cam border, sparkly hearts, cursive "Kiss Cam" script, AT&T branding, dark arena, foreground fan silhouettes — all consistent.

Inside the kiss cam frame:
Both subjects are the two characters already shown on the kiss cam panel of the first-frame reference still — preserve their exact likeness AND exact visual style across all 15 seconds. Do not redraw either subject in a different style. They animate with subtle, restrained, true-to-life motion within their original style — small natural blinks, gentle breathing, slight head turns, brief micro-smiles. Neither stiff and frozen, nor over-acting. The level of expression is what a real person shows when caught unexpectedly on a stadium camera. Keep all movement subtle, believable, human.
Surrounding Knicks fans react naturally in the background.

Timeline (all reactions stay subtle and human-scale):
0-3s: Subject A notices the camera, small embarrassed half-smile, glances at Subject B. Subject B notices a beat later, soft natural realization, slight smile beginning. Fans behind react gently.
3-7s: A and B exchange a brief warm look, share a quiet small laugh. Faces relax into natural smiles.
7-11s: A and B lean in and share a gentle kiss on the lips — brief, sweet, natural, not staged. Crowd cheers softly.
11-15s: They pull apart with light natural smiles. B leans head gently on A's shoulder. Both share a small quiet laugh.

Audio (non-diegetic / OFF-SCREEN only — Subject A and Subject B stay SILENT throughout, mouths closed, do not lip-sync the dialogue below):
- Packed-arena ambient throughout.
- OFF-SCREEN gender-neutral PA announcer (NOT from either visible subject — an unseen arena voice), amplified, slightly echoey, playful.
  0-3s: "Oh, kiss cam at the Garden — look at this!"
  3-7s: "Hahaha — let's see if they go for it!"
  7-11s: crowd erupts with "AWWWWS," cheers, claps swelling.
  11-15s: PA laughing: "There it is! Big love at the Garden tonight!"
- Phone mic picks up nearby fans louder than PA echo.

Aesthetic: real spectator phone-shot, noisy darks, slightly blown LED, subtle motion blur. 16:9, 1080p.

Why "subtle, restrained, true-to-life motion within the reference style" (and not "subtle motion only", a specific style label, or an expressive micro-expression list):

First iteration constrained stylized subjects with "subtle head/eye shifts only — remains a vinyl toy throughout" — looked frozen and pasted-in.
Second iteration named a specific style ("Pixar-quality 3D character"), which forced subjects toward that look even when the reference was a different style.
Third iteration said "animate naturally and expressively" with a loaded list of micro-expressions ("eyes widening, mouths opening for shock, cheeks lifting when laughing, hands coming up to the face") — subjects then over-acted, mugged at the camera, became theatrical.

The working framing keeps the style-preservation lock but specifies subtle, restrained, true-to-life motion at the level of someone actually caught on a stadium camera — paired with

exaggerated acting / theatrical expressions / over-acting / mugging at camera / cartoon reactions

in the

negative_prompt

to suppress regression toward the third-iteration failure mode.

Save the returned video URL as

state.kisscam_video_url

. If generation completes asynchronously, follow the MCP tool's returned status handle. Client-layer timeouts can orphan the upstream task with no recovery handle, so re-run from scratch on timeout.

On failure: re-run kling — don't switch video engines. Seedance's two-stage likeness gate (same as the

baseball-trend

sibling) makes it unusable here. If the still itself is the issue (use the Step 1 self-check criteria — text spelling, scoreboard, likeness — to decide), re-run Step 1.

Kling-omni的

image_types: ["first_frame"]

参数会将静态图锁定为第0帧，因此大屏幕、计分板、Kiss Cam边框、爱心、草书“Kiss Cam”文字以及前景粉丝剪影在整个15秒内都保持像素锁定。仅Kiss Cam面板内的内容会被动画化。

调用参数：

```
provider
```
:
```
kling
```
```
kling_model
```
:
```
kling-v3-omni
```
```
duration
```
:
```
15
```
```
aspect_ratio
```
:
```
16:9
```
```
quality_mode
```
:
```
pro
```
（1080p）
```
reference_images
```
:
```
[state.kisscam_still_url]
```
（使用最新值——步骤1重新生成会覆盖该值）
```
image_types
```
:
```
["first_frame"]
```
```
sound
```
:
```
true
```
```
prompt_adherence
```
:
```
strict
```
```
negative_prompt
```
（固定文本）：

scene cuts, scoreboard changing, Kiss Cam text changing, graphics morphing, character turning real, character becoming 2D, identity drift, gimbal stabilized, exaggerated acting, theatrical expressions, over-acting, mugging at camera, cartoon reactions, subjects lip-syncing PA announcer dialogue, characters mouthing the announcer lines, subjects' mouths moving with off-screen voice, subjects talking to camera, on-screen lip-sync

prompt_adherence: strict

搭配完整的

negative_prompt

锚定列表是关键设计——缺少其中任何一个，Kling都会在剪辑过程中对计分板或“Kiss Cam”文字进行动画处理，且对象会出现过度表演/对口型模仿场外广播员的问题。

prompt
（固定文本，约2400字符——需保持预修剪状态，因为Kling的提示词上限为2500字符）：

First frame: spectator-POV phone shot at MSG of the Jumbotron Kiss Cam segment. Fan-filmed TikTok / Instagram kiss-cam clip style. Handheld phone, NOT pro broadcast.

Camera: continuous handheld shot. Subtle micro-wobble. Slow smooth push-in zoom across 15s toward the kiss cam panel.

Locked: Jumbotron, scoreboard (KNICKS 57 / Q4 4:32 / BULLS 61), red kiss cam border, sparkly hearts, cursive "Kiss Cam" script, AT&T branding, dark arena, foreground fan silhouettes — all consistent.

Inside the kiss cam frame:
Both subjects are the two characters already shown on the kiss cam panel of the first-frame reference still — preserve their exact likeness AND exact visual style across all 15 seconds. Do not redraw either subject in a different style. They animate with subtle, restrained, true-to-life motion within their original style — small natural blinks, gentle breathing, slight head turns, brief micro-smiles. Neither stiff and frozen, nor over-acting. The level of expression is what a real person shows when caught unexpectedly on a stadium camera. Keep all movement subtle, believable, human.
Surrounding Knicks fans react naturally in the background.

Timeline (all reactions stay subtle and human-scale):
0-3s: Subject A notices the camera, small embarrassed half-smile, glances at Subject B. Subject B notices a beat later, soft natural realization, slight smile beginning. Fans behind react gently.
3-7s: A and B exchange a brief warm look, share a quiet small laugh. Faces relax into natural smiles.
7-11s: A and B lean in and share a gentle kiss on the lips — brief, sweet, natural, not staged. Crowd cheers softly.
11-15s: They pull apart with light natural smiles. B leans head gently on A's shoulder. Both share a small quiet laugh.

Audio (non-diegetic / OFF-SCREEN only — Subject A and Subject B stay SILENT throughout, mouths closed, do not lip-sync the dialogue below):
- Packed-arena ambient throughout.
- OFF-SCREEN gender-neutral PA announcer (NOT from either visible subject — an unseen arena voice), amplified, slightly echoey, playful.
  0-3s: "Oh, kiss cam at the Garden — look at this!"
  3-7s: "Hahaha — let's see if they go for it!"
  7-11s: crowd erupts with "AWWWWS," cheers, claps swelling.
  11-15s: PA laughing: "There it is! Big love at the Garden tonight!"
- Phone mic picks up nearby fans louder than PA echo.

Aesthetic: real spectator phone-shot, noisy darks, slightly blown LED, subtle motion blur. 16:9, 1080p.

为何使用“subtle, restrained, true-to-life motion within the reference style”（而非“仅细微动作”、特定风格标签或详细的微表情列表）：

第一版用“仅细微头部/眼部动作——全程保持 vinyl toy 风格”限制风格化对象——结果看起来僵硬且像是粘贴上去的。
第二版指定了特定风格（“皮克斯级3D角色”），即使参考图是其他风格，也会强制对象向该风格靠拢。
第三版要求“自然且富有表现力地动画化”并附带大量微表情列表（“眼睛睁大、嘴巴张开表示惊讶、笑时脸颊抬起、手举到脸前”）——结果对象过度表演、对着镜头做鬼脸、表情夸张。

当前的表述既保留了风格锁定，又指定了符合真实场景的细微、克制动作——就像真实被场馆镜头捕捉到的人一样——搭配

negative_prompt

中的“exaggerated acting / theatrical expressions / over-acting / mugging at camera / cartoon reactions”，可以避免第三版的失败模式。

将返回的视频URL保存为

state.kisscam_video_url

。如果生成为异步完成，跟踪MCP工具返回的状态句柄。如果客户端层超时导致上游任务丢失且无恢复句柄，从头重新运行步骤2。

失败处理：重新运行kling——不要切换视频引擎。Seedance的两步式相似度验证（与

baseball-trend

兄弟技能相同）在此处不可用。如果是静态图本身的问题（根据步骤1的自检标准——文字拼写、计分板、相似度——判断），重新运行步骤1。

Step 3 — Deliver

步骤3 — 交付

Return both Pika CDN URLs: the still image URL and the final video URL. If the host client requires local media markers, create the local preview outside this skill after confirming both CDN URLs are reachable.

One-line summary: "Kiss Cam moment at MSG — 15s, 16:9, 1080p, Kling v3-omni, native PA-announcer commentary and crowd reaction."

返回两个Pika CDN URL：静态图URL和最终视频URL。如果主机客户端需要本地媒体标记，确认两个CDN URL可访问后，在该skill外部创建本地预览。

一句话总结：“MSG的Kiss Cam时刻——15秒，16:9，1080p，Kling v3-omni，原生广播员解说和观众反应。”

Runtime expectations

运行时间预期

Step 1 — gpt-image-2 medium, 16:9, 2 reference images: ~40–90s
Step 2 — kling v3-omni, pro 1080p, 15s, sound on: ~3–5 min
Step 3 — download + emit markers: ~5–10s
Total wall-clock per take: ~4–6 minutes

If a re-roll is needed at Step 1 the budget restarts there; at Step 2 only the video stage repeats.

步骤1 — gpt-image-2 medium，16:9，2张参考图： ~40–90秒
步骤2 — kling v3-omni，pro 1080p，15秒，开启声音： ~3–5分钟
步骤3 — 下载+生成标记： ~5–10秒
单次总耗时： ~4–6分钟

如果步骤1需要重新生成，耗时从步骤1重新计算；如果步骤2需要重新生成，仅重复视频阶段的耗时。

Load-bearing phrases (keep verbatim)

关键固定表述（请勿修改）

Don't edit these without a re-validation pass — they're empirical behavior dependencies, not stylistic choices.

Image prompt (Step 1):

spectator-POV phone screenshot ... lower bowl looking up at the giant suspended Jumbotron

— without this, the model defaults to a broadcast-feed aesthetic

```
iconic retro arena kiss cam style
```
+ the four decorative bullets (crimson border, sparkly hearts, star pattern, cursive Kiss Cam) — together produce the recognizable retro look
```
the character from the FIRST / SECOND reference image
```
— image-grounding lock; never replace with verbal feature descriptions

preserve their exact likeness AND exact visual style

Do not redraw in a different style

— the style-preservation lock; both halves are load-bearing

Kling prompt (Step 2):

First frame: spectator-POV phone shot at MSG of the Jumbotron Kiss Cam segment

— anchor that matches the still

preserve their exact likeness AND exact visual style across all 15 seconds

— identity + style continuity across the clip

subtle, restrained, true-to-life motion within their original style

Neither stiff and frozen, nor over-acting

The level of expression is what a real person shows when caught unexpectedly on a stadium camera

— calibrates the motion level (avoids both the "vinyl-toy frozen" and "theatrical over-acting" failure modes)

```
gentle kiss on the lips — brief, sweet, natural, not staged
```
— the kiss beat anchor (do not soften to "head" / "cheek" — the trend is a lip kiss)

Audio (non-diegetic / OFF-SCREEN only — Subject A and Subject B stay SILENT throughout, mouths closed, do not lip-sync the dialogue below)

OFF-SCREEN gender-neutral PA announcer (NOT from either visible subject — an unseen arena voice)

— without these, Kling defaults to attributing the quoted dialogue to a visible face and lip-syncs the announcer lines onto Subject A or B

Phone mic picks up nearby fans louder than PA echo

— sells the spectator-phone audio aesthetic

Kling negative_prompt (Step 2):

```
exaggerated acting, theatrical expressions, over-acting, mugging at camera, cartoon reactions
```
— without these Kling regresses toward the "expressive" failure mode where subjects mug at the lens

subjects lip-syncing PA announcer dialogue, characters mouthing the announcer lines, subjects' mouths moving with off-screen voice, subjects talking to camera, on-screen lip-sync

— paired with the off-screen audio anchor; suppresses Kling's default behavior of animating a visible mouth to match any speech on the audio track

Params:

```
provider: gpt-image-2
```
(Step 1) — see Step 1 rationale
```
prompt_adherence: strict
```
(Step 2) — without it, scoreboard and "Kiss Cam" text drift mid-clip
```
image_types: ["first_frame"]
```
(Step 2) — pins the still as literal frame 0
```
quality_mode: pro
```
(Step 2) — 1080p output
```
duration: 15
```
(Step 2) — the timeline beats (0-3s / 3-7s / 7-11s / 11-15s) are written for 15s; changing duration breaks the kiss-beat timing

未经重新验证请勿修改这些内容——它们是基于经验的行为依赖，而非风格选择。

图像提示词（步骤1）：

spectator-POV phone screenshot ... lower bowl looking up at the giant suspended Jumbotron

— 缺少此表述，模型会默认生成转播画面风格

```
iconic retro arena kiss cam style
```
+ 四个装饰要点（深红色边框、闪亮爱心、星星图案、草书Kiss Cam字样）——共同生成辨识度高的复古风格
```
the character from the FIRST / SECOND reference image
```
— 图像锚定锁定；永远不要替换为文字特征描述

preserve their exact likeness AND exact visual style

Do not redraw in a different style

— 风格保留锁定；两部分内容都至关重要

Kling提示词（步骤2）：

First frame: spectator-POV phone shot at MSG of the Jumbotron Kiss Cam segment

— 与静态图匹配的锚定表述

preserve their exact likeness AND exact visual style across all 15 seconds

— 整个剪辑过程中的身份+风格连续性

subtle, restrained, true-to-life motion within their original style

Neither stiff and frozen, nor over-acting

The level of expression is what a real person shows when caught unexpectedly on a stadium camera

— 校准动作幅度（避免“vinyl toy僵硬”和“夸张表演”两种失败模式）

```
gentle kiss on the lips — brief, sweet, natural, not staged
```
— 接吻环节的锚定表述（不要改为“头部”/“脸颊”——潮流是嘴唇接吻）

Audio (non-diegetic / OFF-SCREEN only — Subject A and Subject B stay SILENT throughout, mouths closed, do not lip-sync the dialogue below)

OFF-SCREEN gender-neutral PA announcer (NOT from either visible subject — an unseen arena voice)

— 缺少这些表述，Kling会默认将引用的对话分配给画面中的某个角色，并让对象A或B对口型模仿广播员的台词

Phone mic picks up nearby fans louder than PA echo

— 强化观众手机拍摄的音频风格

Kling negative_prompt（步骤2）：

```
exaggerated acting, theatrical expressions, over-acting, mugging at camera, cartoon reactions
```
— 缺少这些表述，Kling会回归“夸张”失败模式，对象会对着镜头做鬼脸

subjects lip-syncing PA announcer dialogue, characters mouthing the announcer lines, subjects' mouths moving with off-screen voice, subjects talking to camera, on-screen lip-sync

— 与场外音频锚定搭配；抑制Kling将音频中的对话与画面中可见角色对口型的默认行为

参数：

```
provider: gpt-image-2
```
（步骤1）—— 参见步骤1的理由
```
prompt_adherence: strict
```
（步骤2）—— 缺少此参数，计分板和“Kiss Cam”文字会在剪辑过程中偏离
```
image_types: ["first_frame"]
```
（步骤2）—— 将静态图固定为第0帧
```
quality_mode: pro
```
（步骤2）—— 1080p输出
```
duration: 15
```
（步骤2）—— 时间轴节点（0-3秒/3-7秒/7-11秒/11-15秒）是为15秒设计的；修改时长会破坏接吻环节的时间节奏

Engine choice: gpt-image-2 + Kling-only

引擎选择：仅gpt-image-2 + Kling

Step 1 — gpt-image-2, no fallback. Empirically sharpest LED panel detail (scoreboard numerals, kiss cam typography, retro decorative edges) and strongest reference-likeness lock — the combo is what sells the trend. On a

moderation_blocked

response, re-roll the same call rather than swapping providers; alternatives produced softer likeness and softer LED detail in earlier trials.

Step 2 — Kling, no Seedance. Seedance has a two-stage

partner_validation_failed

422 gate (same as the

baseball-trend

sibling skill): an input-side gate that rejects references with recognizable real people, and an output-side gate that rejects AFTER generation if the produced clip contains recognizable-looking faces. Every Kiss Cam shot has a packed-arena crowd full of faces, so the output-side gate is unavoidable here. Kling is the only engine that lands this recipe.

Kling trade-offs: 2500-char

prompt

cap (recipe above is pre-trimmed to ~2400 chars; re-inflating it can trigger prompt-length errors), no

seed

param (re-rolls are non-reproducible — to re-roll just call again).

步骤1 — gpt-image-2，无备选方案。 根据经验，它能生成最清晰的LED面板细节（计分板数字、Kiss Cam字体、复古装饰边缘），且对参考图的相似度锁定最强——这种组合是打造潮流效果的关键。如果收到

moderation_blocked

响应，重新发起相同调用即可，不要更换提供商；早期测试显示，备选方案生成的相似度和LED细节都更模糊。

步骤2 — Kling，不使用Seedance。 Seedance有两步式

partner_validation_failed

422验证（与

baseball-trend

兄弟技能相同）：输入侧验证会拒绝包含可识别真人的参考图，输出侧验证会在生成完成后拒绝包含可识别面孔的剪辑视频。每个Kiss Cam画面都包含满场的观众面孔，因此输出侧验证无法避免。Kling是唯一能完成此流程的引擎。

Kling的权衡： 提示词上限为2500字符（上述流程的提示词已预修剪至约2400字符；扩展提示词会触发长度错误），无

seed

参数（重新生成不可复现——只需再次调用即可重新生成）。

Failure cheat sheet

故障排查速查表

Symptom	Fix
`moderation_blocked` on Step 1	gpt-image-2 safety gate (often female/female subject pairings + "kiss cam" wording). Re-roll the same call; do NOT swap providers — alternatives produce softer likeness and softer LED detail
Kling prompt error: prompt > 2500 chars	Re-inflated audio or aesthetic section in the Kling prompt. Cut from the audio or aesthetic block; never from the animation timeline
Scoreboard, "Kiss Cam" text, or graphics animate mid-clip	`prompt_adherence` not set to `strict` , or `negative_prompt` missing entries like "scoreboard changing" / "Kiss Cam text changing". Verify both params; re-run Step 2
Subject identity drifts after ~8s	Reference still face crop too small — not enough facial pixels for Kling to lock. Re-roll Step 1 with a tighter face crop on the subjects
Subject gets redrawn in a different style (photoreal → illustrated, or vice versa)	Style-preservation lock weakened, or a specific style label (Pixar / anime / etc.) crept into either prompt. Restore the "preserve exact likeness AND visual style" anchor; remove any style label
PA announcer mispronounces a word or misses the kiss beat	Native audio is one take per Kling generation. Re-run Step 2 — no prompt-level fix
One of the on-screen subjects lip-syncs / mouths the PA announcer's dialogue	Kling defaults to attributing any quoted dialogue on the audio track to a visible face in frame — without an explicit off-screen anchor, it picks a subject and animates their mouth to the words. Verify the audio block is framed as `Audio (non-diegetic / OFF-SCREEN only — Subject A and Subject B stay SILENT throughout, mouths closed...)` and the `negative_prompt` contains `subjects lip-syncing PA announcer dialogue, characters mouthing the announcer lines, on-screen lip-sync` ; re-run Step 2
Subjects mug at camera / over-act / theatrical expressions / cartoon reactions	Animation block prescribes a loaded list of simultaneous micro-expressions, or timeline beats use exaggerated descriptors ("huge smile", "shy excited wiggle", "eyes widen"). Restore the `subtle, restrained, true-to-life motion` framing in the animation block; strip emotion adjectives from the timeline; verify `exaggerated acting / theatrical expressions / over-acting / mugging at camera / cartoon reactions` are in the `negative_prompt`
Kiss lands on cheek / forehead / head instead of lips	Timeline beat softened away from `gentle kiss on the lips` . Restore the verbatim lip-kiss line in the 7-11s beat; the trend is a lip kiss, not a peck on the head
Seedance attempted instead of Kling	Wrong engine chosen. Switch to `kling` — Seedance's two-stage likeness gate makes it unusable here (see "Engine choice")
Step 2 times out with no `task_id` returned	Client-layer timeout orphaned the upstream task — no recovery handle. Re-run Step 2 from scratch

症状	修复方案
步骤1返回 `moderation_blocked`	gpt-image-2的安全拦截（通常是女性/女性对象组合+“kiss cam”表述导致）。重新发起相同调用；请勿更换提供商——备选方案生成的相似度和LED细节都更模糊
Kling提示词错误：提示词超过2500字符	Kling提示词中的音频或风格部分被扩展。从音频或风格块中删减内容；切勿修改动画时间轴
计分板、“Kiss Cam”文字或图形在剪辑过程中动画化	`prompt_adherence` 未设置为 `strict` ，或 `negative_prompt` 缺少“scoreboard changing”/“Kiss Cam text changing”等条目。验证两个参数；重新运行步骤2
对象身份在约8秒后偏离	参考图中的面部裁剪过小——Kling无法锁定足够的面部像素。重新运行步骤1，对对象进行更紧凑的面部裁剪
对象被重新绘制为其他风格（写实→插画，或反之）	风格保留锁定被弱化，或某个特定风格标签（皮克斯/动漫/等）混入了任一提示词。恢复“preserve exact likeness AND visual style”锚定表述；移除所有风格标签
广播员发音错误或错过接吻环节的时间点	原生音频是Kling每次生成的单次结果。重新运行步骤2——无法通过修改提示词修复
画面中的某个对象对口型/模仿广播员的台词	Kling默认将音频中的引用对话分配给画面中的可见角色——如果没有明确的场外锚定，它会选择一个对象并让其对口型匹配台词。验证音频块是否包含 `Audio (non-diegetic / OFF-SCREEN only — Subject A and Subject B stay SILENT throughout, mouths closed...)` ，且 `negative_prompt` 包含 `subjects lip-syncing PA announcer dialogue, characters mouthing the announcer lines, on-screen lip-sync` ；重新运行步骤2
对象对着镜头做鬼脸/过度表演/表情夸张/卡通化反应	动画块指定了大量同时进行的微表情，或时间轴节点使用了夸张的描述词（“大大的微笑”、“害羞兴奋的扭动”、“眼睛睁大”）。恢复动画块中的 `subtle, restrained, true-to-life motion` 表述；移除时间轴中的情绪形容词；验证 `negative_prompt` 包含 `exaggerated acting / theatrical expressions / over-acting / mugging at camera / cartoon reactions` ；重新运行步骤2
接吻落在脸颊/额头/头部而非嘴唇	时间轴节点被修改为“头部/脸颊接吻”。恢复7-11秒节点中的固定表述 `gentle kiss on the lips — brief, sweet, natural, not staged` ；潮流是嘴唇接吻，而非头部轻吻
尝试使用Seedance而非Kling	选择了错误的引擎。切换为 `kling` ——Seedance的两步式相似度验证在此处不可用（参见“引擎选择”）
步骤2超时且未返回 `task_id`	客户端层超时导致上游任务丢失——无恢复句柄。从头重新运行步骤2

What not to do

禁止操作

Don't add name chyrons. This trend has NO names — it's two anonymous subjects caught on cam.
Don't render this as a TV broadcast cutaway with a heart overlay. It's a Jumbotron-shot.
Don't bare-phrase "subtle motion only" (subjects freeze into vinyl-toy stiffness) OR load the animation block with simultaneous micro-expressions like "eyes widening, mouths opening, cheeks lifting" (subjects over-act and mug at the camera). The working framing is
```
subtle, restrained, true-to-life motion at the level of someone caught unexpectedly on a stadium camera
```
— both failure modes hide in the word "subtle"; the qualifier "true-to-life" is what calibrates it.
Don't write timeline beats with exaggerated emotion adjectives like "huge smile", "shy excited wiggle", "eyes widen", "giggles back". Use restrained verbs: "notices", "small smile", "exchange a look", "share a quiet laugh".
Don't name a specific target style (Pixar, anime, Disney, photoreal, etc.) in either prompt. The reference image defines the style; the prompt only says "preserve the reference style."
Don't soften the kiss to forehead / cheek / "kiss on top of the head". The trend is a lip kiss — keep the verbatim
```
gentle kiss on the lips — brief, sweet, natural, not staged
```
beat.
Don't write the audio block as bare quoted dialogue without an OFF-SCREEN anchor. Kling defaults to attributing any quoted speech to a visible face and will lip-sync the announcer lines onto Subject A or B. Always frame the audio block with
```
Audio (non-diegetic / OFF-SCREEN only — Subject A and Subject B stay SILENT throughout, mouths closed, never lip-sync any dialogue below)
```
and add the anti-lip-sync terms to
```
negative_prompt
```
.
Don't gender either subject in the narrative or PA dialogue. Use "Subject A / Subject B", "they/them", or descriptions without pronouns.
Don't swap providers on moderation hits — re-roll the same
```
gpt-image-2
```
call.
Don't try Seedance.
Don't generate music. Native PA announcer + crowd ambient IS the soundtrack.
Don't run a post-processing layer (
```
add_captions
```
,
```
generate_music
```
,
```
edit_concat
```
,
```
edit_text_overlay
```
,
```
edit_pip
```
, any
```
edit_*
```
). Kling burns the scorebug + chyron + native commentary directly; anything added afterward breaks the kiss-cam illusion.

请勿添加姓名字幕条。该潮流不包含姓名——仅展示两个匿名对象被镜头捕捉的画面。
请勿渲染为带有爱心叠加层的电视转播画面。必须是大屏幕视角的画面。
请勿仅使用“细微动作”表述（对象会僵硬得像vinyl toy）或在动画块中添加大量同时进行的微表情（如“眼睛睁大、嘴巴张开、脸颊抬起”）（对象会过度表演并对着镜头做鬼脸）。正确的表述是
```
subtle, restrained, true-to-life motion at the level of someone caught unexpectedly on a stadium camera
```
——两种失败模式都隐藏在“细微”一词中；“真实”限定词才是校准动作幅度的关键。
请勿在时间轴节点中使用夸张的情绪形容词，如“大大的微笑”、“害羞兴奋的扭动”、“眼睛睁大”、“回以笑声”。使用克制的动词：“注意到”、“淡淡的微笑”、“对视一眼”、“相视一笑”。
请勿在任一提示词中指定目标风格（皮克斯、动漫、迪士尼、写实等）。参考图定义风格；提示词仅需说明“保留参考图风格”。
请勿将接吻改为额头/脸颊/“头部轻吻”。潮流是嘴唇接吻——保留固定表述
```
gentle kiss on the lips — brief, sweet, natural, not staged
```
。
请勿将音频块写为无场外锚定的纯引用对话。Kling默认将引用对话分配给画面中的可见角色，并让对象对口型模仿广播员的台词。必须始终用
```
Audio (non-diegetic / OFF-SCREEN only — Subject A and Subject B stay SILENT throughout, mouths closed, never lip-sync any dialogue below)
```
来定义音频块，并在
```
negative_prompt
```
中添加反对口型的表述。
请勿在叙述或广播对话中指定对象的性别。使用“对象A/对象B”、“他们/她们”或无代词的描述。
遇到审核拦截时请勿更换提供商——重新发起相同的
```
gpt-image-2
```
调用。
请勿尝试使用Seedance。
请勿生成音乐。原生广播员解说+观众环境音就是配乐。
请勿运行后处理层（
```
add_captions
```
、
```
generate_music
```
、
```
edit_concat
```
、
```
edit_text_overlay
```
、
```
edit_pip
```
等任何
```
edit_*
```
工具）。Kling已直接内置计分板+字幕条+原生解说；后续添加任何内容都会破坏Kiss Cam的真实感。",

kiss-cam

Original

Translation

kiss-cam

kiss-cam

Prerequisites

前置条件

Stage 0 — Intake

阶段0 — 输入收集

Step 1 — Spectator-POV Jumbotron still (
`generate_image`
, gpt-image-2)

步骤1 — 观众视角的大屏幕静态图（
`generate_image`
，gpt-image-2）

Step 2 — In-arena kiss cam clip (
`generate_reference_video`
, kling-v3-omni)

步骤2 — 场内Kiss Cam剪辑视频（
`generate_reference_video`
，kling-v3-omni）

Step 3 — Deliver

步骤3 — 交付

Runtime expectations

运行时间预期

Load-bearing phrases (keep verbatim)

关键固定表述（请勿修改）

Engine choice: gpt-image-2 + Kling-only

引擎选择：仅gpt-image-2 + Kling

Failure cheat sheet

故障排查速查表

What not to do

禁止操作

kiss-cam

Original

Translation

kiss-cam

kiss-cam

Prerequisites

前置条件

Stage 0 — Intake

阶段0 — 输入收集

Step 1 — Spectator-POV Jumbotron still (generate_image, gpt-image-2)

步骤1 — 观众视角的大屏幕静态图（generate_image，gpt-image-2）

Step 2 — In-arena kiss cam clip (generate_reference_video, kling-v3-omni)

步骤2 — 场内Kiss Cam剪辑视频（generate_reference_video，kling-v3-omni）

Step 3 — Deliver

步骤3 — 交付

Runtime expectations

运行时间预期

Load-bearing phrases (keep verbatim)

关键固定表述（请勿修改）

Engine choice: gpt-image-2 + Kling-only

引擎选择：仅gpt-image-2 + Kling

Failure cheat sheet

故障排查速查表

What not to do

禁止操作

Step 1 — Spectator-POV Jumbotron still (
`generate_image`
, gpt-image-2)

步骤1 — 观众视角的大屏幕静态图（
`generate_image`
，gpt-image-2）

Step 2 — In-arena kiss cam clip (
`generate_reference_video`
, kling-v3-omni)

步骤2 — 场内Kiss Cam剪辑视频（
`generate_reference_video`
，kling-v3-omni）