ai-avatar-video

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Avatar & Talking Head Videos

AI虚拟形象（AI Avatar）与会说话头部视频

Create AI avatars and talking head videos via inference.sh CLI.

通过inference.sh CLI创建AI虚拟形象与会说话头部视频。

Quick Start

快速开始

Requires inference.sh CLI (
belt
). Install instructions

bash

belt login

需要安装inference.sh CLI（
belt
）。安装说明

bash

belt login

Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS)

推荐使用：P-Video-Avatar（速度最快、成本最低，内置TTS）

belt app run pruna/p-video-avatar --input '{ "image": "https://portrait.jpg", "voice_script": "Hello, welcome to our product demo!", "voice": "Zephyr (Female)" }'

undefined

belt app run pruna/p-video-avatar --input '{ "image": "https://portrait.jpg", "voice_script": "Hello, welcome to our product demo!", "voice": "Zephyr (Female)" }'

undefined

Available Models

可用模型

Start with P-Video-Avatar — it's 18x faster and 6x cheaper than alternatives, with built-in TTS, dynamic backgrounds, and 1080p support.

Model	App ID	Best For	Built-in TTS
P-Video-Avatar	`pruna/p-video-avatar`	Best overall: speed, cost, quality, control	Yes (30 voices, 10 languages)
OmniHuman 1.5	`bytedance/omnihuman-1-5`	Multi-character, audio-driven	No
Fabric 1.0	`falai/fabric-1-0`	Image talks with lipsync	Yes
PixVerse Lipsync	`falai/pixverse-lipsync`	Highly realistic lipsync	No

优先选择P-Video-Avatar — 相比其他模型快18倍、成本低6倍，内置TTS、动态背景，支持1080p分辨率。

模型	应用ID	最佳适用场景	内置TTS
P-Video-Avatar	`pruna/p-video-avatar`	综合最佳：速度、成本、质量、可控性	是（30种音色，10种语言）
OmniHuman 1.5	`bytedance/omnihuman-1-5`	多角色、音频驱动	否
Fabric 1.0	`falai/fabric-1-0`	图片唇形同步说话	是
PixVerse Lipsync	`falai/pixverse-lipsync`	高逼真度唇形同步	否

Cost & Speed Comparison

成本与速度对比

Model	Speed (per sec of video)	Cost per second
P-Video-Avatar	~1.83s/s	$0.025
OmniHuman 1.5	~28s/s (15x slower)	$0.16 (6.4x more)
Fabric 1.0	~34s/s (18x slower)	$0.14 (5.6x more)

模型	速度（每生成1秒视频耗时）	每秒成本
P-Video-Avatar	~1.83s/s	$0.025
OmniHuman 1.5	~28s/s（慢15倍）	$0.16（贵6.4倍）
Fabric 1.0	~34s/s（慢18倍）	$0.14（贵5.6倍）

Examples

使用示例

P-Video-Avatar (Recommended)

P-Video-Avatar（推荐）

Generate avatar from portrait + text script with built-in TTS:

bash

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Welcome to our product walkthrough. Today I will show you three key features.",
  "voice": "Puck (Male)",
  "voice_language": "English (US)",
  "resolution": "720p"
}'

With custom style control:

bash

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "This is exciting news!",
  "voice": "Aoede (Female)",
  "voice_prompt": "Enthusiastic and energetic tone",
  "video_prompt": "The person is presenting on stage with dramatic lighting",
  "resolution": "1080p"
}'

With audio file instead of TTS:

bash

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "audio": "https://speech.mp3"
}'

通过肖像图+文本脚本结合内置TTS生成虚拟形象：

bash

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Welcome to our product walkthrough. Today I will show you three key features.",
  "voice": "Puck (Male)",
  "voice_language": "English (US)",
  "resolution": "720p"
}'

自定义风格控制：

bash

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "This is exciting news!",
  "voice": "Aoede (Female)",
  "voice_prompt": "Enthusiastic and energetic tone",
  "video_prompt": "The person is presenting on stage with dramatic lighting",
  "resolution": "1080p"
}'

使用音频文件替代TTS：

bash

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "audio": "https://speech.mp3"
}'

Full Workflow: Generate Portrait + Avatar

完整流程：生成肖像图+虚拟形象

Use Pruna P-Image to generate the portrait, then create the avatar:

bash

undefined

先使用Pruna P-Image生成肖像图，再创建虚拟形象：

bash

undefined

1. Generate a portrait image

1. 生成肖像图

belt app run pruna/p-image --input '{ "prompt": "professional headshot portrait of a young woman, neutral background, looking at camera, studio lighting, photorealistic", "aspect_ratio": "9:16" }'

2. Create avatar video with built-in TTS

2. 结合内置TTS创建虚拟形象视频

belt app run pruna/p-video-avatar --input '{ "image": "<image-url-from-step-1>", "voice_script": "Hi there! Let me walk you through our latest features.", "voice": "Zephyr (Female)" }'

undefined

belt app run pruna/p-video-avatar --input '{ "image": "<image-url-from-step-1>", "voice_script": "Hi there! Let me walk you through our latest features.", "voice": "Zephyr (Female)" }'

undefined

OmniHuman 1.5 (Multi-Character)

OmniHuman 1.5（多角色）

bash

belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

Supports specifying which character to drive in multi-person images.

bash

belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

支持指定多人图片中的驱动角色。

Fabric 1.0 (Image Talks)

Fabric 1.0（图片说话）

bash

belt app run falai/fabric-1-0 --input '{
  "image_url": "https://face.jpg",
  "audio_url": "https://audio.mp3"
}'

bash

belt app run falai/fabric-1-0 --input '{
  "image_url": "https://face.jpg",
  "audio_url": "https://audio.mp3"
}'

PixVerse Lipsync

bash

belt app run falai/pixverse-lipsync --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

bash

belt app run falai/pixverse-lipsync --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

Full Workflow: TTS + Avatar (Non-TTS Models)

完整流程：TTS+虚拟形象（无内置TTS模型）

For models without built-in TTS, generate speech first:

bash

undefined

对于无内置TTS的模型，需先生成语音：

bash

undefined

1. Generate speech from text

1. 从文本生成语音

belt app run infsh/kokoro-tts --input '{ "prompt": "Welcome to our product demo. Today I will show you..." }' > speech.json

2. Create avatar video with the speech

2. 结合语音创建虚拟形象视频

belt app run bytedance/omnihuman-1-5 --input '{ "image_url": "https://presenter-photo.jpg", "audio_url": "<audio-url-from-step-1>" }'

undefined

belt app run bytedance/omnihuman-1-5 --input '{ "image_url": "https://presenter-photo.jpg", "audio_url": "<audio-url-from-step-1>" }'

undefined

Full Workflow: Dub Video in Another Language

完整流程：视频多语言配音

bash

undefined

bash

undefined

1. Transcribe original video

1. 转录原视频音频

belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://video.mp4"}' > transcript.json

2. Translate text (manually or with an LLM)

2. 翻译文本（手动或通过LLM）

3. Generate speech in new language

3. 生成目标语言语音

belt app run infsh/kokoro-tts --input '{"text": "<translated-text>"}' > new_speech.json

4. Lipsync the original video with new audio

4. 将原视频与新音频做唇形同步

belt app run infsh/latentsync-1-6 --input '{ "video_url": "https://original-video.mp4", "audio_url": "<new-audio-url>" }'

undefined

belt app run infsh/latentsync-1-6 --input '{ "video_url": "https://original-video.mp4", "audio_url": "<new-audio-url>" }'

undefined

Use Cases

适用场景

Marketing: Product demos with AI presenter
Education: Course videos, explainers
Localization: Dub content in multiple languages
Social Media: Consistent virtual influencer
Corporate: Training videos, announcements
Gaming: Character avatars, NPC dialogue

营销：搭配AI主持人的产品演示
教育：课程视频、讲解视频
本地化：多语言内容配音
社交媒体：风格统一的虚拟网红
企业：培训视频、公告内容
游戏：角色虚拟形象、NPC对话

Tips

使用技巧

Use high-quality portrait photos (front-facing, good lighting)
Audio should be clear with minimal background noise
P-Video-Avatar supports built-in TTS — no need for a separate speech generation step
P-Video-Avatar output aspect ratio matches the input image
Generate portraits with
```
pruna/p-image
```
using
```
9:16
```
aspect ratio for vertical videos
OmniHuman 1.5 supports multiple people in one image
LatentSync is best for syncing existing videos to new audio

使用高质量肖像照（正面朝向、光线良好）
音频需清晰，背景噪音尽可能小
P-Video-Avatar内置TTS，无需额外语音生成步骤
P-Video-Avatar输出视频比例与输入图片一致
使用
```
pruna/p-image
```
生成肖像图时，选择
```
9:16
```
比例适配竖屏视频
OmniHuman 1.5支持单图多角色
LatentSync最适合将现有视频与新音频做同步

Related Skills

Dedicated P-Video-Avatar skill

专属P-Video-Avatar技能

npx skills add inference-sh/skills@p-video-avatar

Full platform skill (all 250+ apps)

全平台技能（包含250+应用）

npx skills add inference-sh/skills@infsh-cli

Text-to-speech (generate audio for non-TTS avatar models)

文本转语音（为无内置TTS的虚拟形象模型生成音频）

npx skills add inference-sh/skills@text-to-speech

Speech-to-text (transcribe for dubbing)

语音转文本（为配音场景做转录）

npx skills add inference-sh/skills@speech-to-text

Video generation

视频生成

npx skills add inference-sh/skills@ai-video-generation

Image generation (create avatar images)

图片生成（创建虚拟形象图片）

npx skills add inference-sh/skills@ai-image-generation


Browse all video apps: `belt app list --category video`

npx skills add inference-sh/skills@ai-image-generation


浏览所有视频类应用：`belt app list --category video`

Documentation

文档参考

Running Apps - How to run apps via CLI
Content Pipeline Example - Building media workflows
Streaming Results - Real-time progress updates

运行应用 - 如何通过CLI运行应用
内容流水线示例 - 构建媒体工作流
流式结果 - 实时进度更新

ai-avatar-video

Original

Translation

AI Avatar & Talking Head Videos

AI虚拟形象（AI Avatar）与会说话头部视频

Quick Start

快速开始

Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS)

推荐使用：P-Video-Avatar（速度最快、成本最低，内置TTS）

Available Models

可用模型

Cost & Speed Comparison

成本与速度对比

Examples

使用示例

P-Video-Avatar (Recommended)

P-Video-Avatar（推荐）

Full Workflow: Generate Portrait + Avatar

完整流程：生成肖像图+虚拟形象

1. Generate a portrait image

1. 生成肖像图

2. Create avatar video with built-in TTS

2. 结合内置TTS创建虚拟形象视频

OmniHuman 1.5 (Multi-Character)

OmniHuman 1.5（多角色）

Fabric 1.0 (Image Talks)

Fabric 1.0（图片说话）

PixVerse Lipsync

PixVerse Lipsync

Full Workflow: TTS + Avatar (Non-TTS Models)

完整流程：TTS+虚拟形象（无内置TTS模型）

1. Generate speech from text

1. 从文本生成语音

2. Create avatar video with the speech

2. 结合语音创建虚拟形象视频

Full Workflow: Dub Video in Another Language

完整流程：视频多语言配音

1. Transcribe original video

1. 转录原视频音频

2. Translate text (manually or with an LLM)

2. 翻译文本（手动或通过LLM）

3. Generate speech in new language

3. 生成目标语言语音

4. Lipsync the original video with new audio

4. 将原视频与新音频做唇形同步

Use Cases

适用场景

Tips

使用技巧

Related Skills

相关技能

Dedicated P-Video-Avatar skill

专属P-Video-Avatar技能

Full platform skill (all 250+ apps)

全平台技能（包含250+应用）

Text-to-speech (generate audio for non-TTS avatar models)

文本转语音（为无内置TTS的虚拟形象模型生成音频）

Speech-to-text (transcribe for dubbing)

语音转文本（为配音场景做转录）

Video generation

视频生成

Image generation (create avatar images)

图片生成（创建虚拟形象图片）

Documentation

文档参考