parlor-on-device-ai

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Parlor On-Device AI

Parlor 端侧AI

Skill by ara.so — Daily 2026 Skills collection.
Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request.
Skill 由 ara.so 开发 — 2026年度技能合集。
Parlor是一款实时端侧多模态AI助手。它结合了通过LiteRT-LM运行的Gemma 4 E2B实现语音和视觉理解,搭配Kokoro TTS实现语音输出。所有功能都在本地运行 — 无需API密钥,无需调用云端服务,每次请求零成本。

Architecture

架构

Browser (mic + camera)
    │  WebSocket (audio PCM + JPEG frames)
FastAPI server
    ├── Gemma 4 E2B via LiteRT-LM (GPU)  →  understands speech + vision
    └── Kokoro TTS (MLX on Mac, ONNX on Linux)  →  speaks back
    │  WebSocket (streamed audio chunks)
Browser (playback + transcript)
Key features:
  • Silero VAD in browser — hands-free, no push-to-talk
  • Barge-in — interrupt AI mid-sentence by speaking
  • Sentence-level TTS streaming — audio starts before full response is ready
  • Platform-aware TTS — MLX backend on Apple Silicon, ONNX on Linux
Browser (mic + camera)
    │  WebSocket (audio PCM + JPEG frames)
FastAPI server
    ├── Gemma 4 E2B via LiteRT-LM (GPU)  →  understands speech + vision
    └── Kokoro TTS (MLX on Mac, ONNX on Linux)  →  speaks back
    │  WebSocket (streamed audio chunks)
Browser (playback + transcript)
关键特性:
  • 浏览器内置Silero VAD — 免手持,无需按键通话
  • 插话功能 — 说话即可打断正在输出内容的AI
  • 句子级TTS流式输出 — 完整响应生成完成前就可以开始播放音频
  • 平台适配的TTS — Apple Silicon芯片上使用MLX后端,Linux上使用ONNX后端

Requirements

环境要求

  • Python 3.12+
  • macOS with Apple Silicon or Linux with a supported GPU
  • ~3 GB free RAM
  • uv
    package manager
  • Python 3.12+
  • 搭载Apple Silicon的macOS 带支持的GPU的Linux设备
  • ~3 GB可用内存
  • uv
    包管理器

Installation

安装

bash
git clone https://github.com/fikrikarim/parlor.git
cd parlor
bash
git clone https://github.com/fikrikarim/parlor.git
cd parlor

Install uv if needed

如需安装uv可执行以下命令

cd src uv sync uv run server.py

Open [http://localhost:8000](http://localhost:8000), grant camera and microphone permissions, and start talking.

Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).
cd src uv sync uv run server.py

打开 [http://localhost:8000](http://localhost:8000),授予摄像头和麦克风权限,即可开始对话。

首次运行时会自动下载模型(Gemma 4 E2B约2.6GB,额外包含TTS模型)。

Configuration

配置

Set environment variables before running:
bash
undefined
运行前可设置环境变量:
bash
undefined

Use a pre-downloaded model instead of auto-downloading

使用预下载的模型,无需自动下载

export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm
export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm

Change server port (default: 8000)

修改服务端口(默认: 8000)

export PORT=9000
uv run server.py

| Variable     | Default                        | Description                                    |
|--------------|-------------------------------|------------------------------------------------|
| `MODEL_PATH` | auto-download from HuggingFace | Path to local `.litertlm` model file           |
| `PORT`       | `8000`                         | Server port                                    |
export PORT=9000
uv run server.py

| 变量名     | 默认值                        | 说明                                    |
|--------------|-------------------------------|------------------------------------------------|
| `MODEL_PATH` | 自动从HuggingFace下载 | 本地 `.litertlm` 模型文件的路径           |
| `PORT`       | `8000`                         | 服务端口                                    |

Project Structure

项目结构

src/
├── server.py              # FastAPI WebSocket server + Gemma 4 inference
├── tts.py                 # Platform-aware TTS (MLX on Mac, ONNX on Linux)
├── index.html             # Frontend UI (VAD, camera, audio playback)
├── pyproject.toml         # Dependencies
└── benchmarks/
    ├── bench.py           # End-to-end WebSocket benchmark
    └── benchmark_tts.py   # TTS backend comparison
src/
├── server.py              # FastAPI WebSocket服务 + Gemma 4推理
├── tts.py                 # 平台适配的TTS(Mac使用MLX,Linux使用ONNX)
├── index.html             # 前端UI(VAD、摄像头、音频播放)
├── pyproject.toml         # 依赖配置
└── benchmarks/
    ├── bench.py           # 端到端WebSocket基准测试
    └── benchmark_tts.py   # TTS后端对比测试

Key Components

核心组件

server.py — FastAPI WebSocket Server

server.py — FastAPI WebSocket服务

The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back.
python
undefined
服务端处理两个WebSocket连接:一个用于接收浏览器传来的音频/视频,一个用于向浏览器流式返回音频。
python
undefined

Simplified pattern from server.py

Simplified pattern from server.py

from fastapi import FastAPI, WebSocket import asyncio
app = FastAPI()
@app.websocket("/ws") async def websocket_endpoint(websocket: WebSocket): await websocket.accept() async for data in websocket.iter_bytes(): # data contains PCM audio + optional JPEG frame response_text = await run_gemma_inference(data) audio_chunks = await run_tts(response_text) for chunk in audio_chunks: await websocket.send_bytes(chunk)
undefined
from fastapi import FastAPI, WebSocket import asyncio
app = FastAPI()
@app.websocket("/ws") async def websocket_endpoint(websocket: WebSocket): await websocket.accept() async for data in websocket.iter_bytes(): # data contains PCM audio + optional JPEG frame response_text = await run_gemma_inference(data) audio_chunks = await run_tts(response_text) for chunk in audio_chunks: await websocket.send_bytes(chunk)
undefined

tts.py — Platform-Aware TTS

tts.py — 平台适配的TTS

Kokoro TTS selects backend based on platform:
python
undefined
Kokoro TTS会根据运行平台选择后端:
python
undefined

tts.py uses platform detection

tts.py uses platform detection

import platform
def get_tts_backend(): if platform.system() == "Darwin": # Apple Silicon: use MLX backend for GPU acceleration from kokoro_mlx import KokoroMLX return KokoroMLX() else: # Linux: use ONNX backend from kokoro import KokoroPipeline return KokoroPipeline(lang_code='a')
tts = get_tts_backend()
import platform
def get_tts_backend(): if platform.system() == "Darwin": # Apple Silicon: use MLX backend for GPU acceleration from kokoro_mlx import KokoroMLX return KokoroMLX() else: # Linux: use ONNX backend from kokoro import KokoroPipeline return KokoroPipeline(lang_code='a')
tts = get_tts_backend()

Sentence-level streaming — yields audio as each sentence is ready

Sentence-level streaming — yields audio as each sentence is ready

async def synthesize_streaming(text: str): for sentence in split_sentences(text): audio = tts.synthesize(sentence) yield audio
undefined
async def synthesize_streaming(text: str): for sentence in split_sentences(text): audio = tts.synthesize(sentence) yield audio
undefined

Gemma 4 E2B Inference via LiteRT-LM

通过LiteRT-LM运行Gemma 4 E2B推理

python
undefined
python
undefined

LiteRT-LM inference pattern

LiteRT-LM inference pattern

from litert_lm import LiteRTLM import os
model_path = os.environ.get("MODEL_PATH", None)
from litert_lm import LiteRTLM import os
model_path = os.environ.get("MODEL_PATH", None)

Auto-downloads if MODEL_PATH not set

未设置MODEL_PATH则自动下载

model = LiteRTLM.from_pretrained( "google/gemma-4-E2B-it", local_path=model_path )
async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None): inputs = {"audio": audio_pcm} if image_jpeg: inputs["image"] = image_jpeg
response = ""
async for token in model.generate_stream(**inputs):
    response += token
return response
undefined
model = LiteRTLM.from_pretrained( "google/gemma-4-E2B-it", local_path=model_path )
async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None): inputs = {"audio": audio_pcm} if image_jpeg: inputs["image"] = image_jpeg
response = ""
async for token in model.generate_stream(**inputs):
    response += token
return response
undefined

Running Benchmarks

运行基准测试

bash
cd src
bash
cd src

End-to-end WebSocket latency benchmark

端到端WebSocket延迟基准测试

uv run benchmarks/bench.py
uv run benchmarks/bench.py

Compare TTS backends (MLX vs ONNX)

对比TTS后端性能(MLX vs ONNX)

uv run benchmarks/benchmark_tts.py
undefined
uv run benchmarks/benchmark_tts.py
undefined

Performance Reference (Apple M3 Pro)

性能参考(Apple M3 Pro)

StageTime
Speech + vision understanding~1.8–2.2s
Response generation (~25 tokens)~0.3s
Text-to-speech (1–3 sentences)~0.3–0.7s
Total end-to-end~2.5–3.0s
Decode speed: ~83 tokens/sec on GPU.
阶段耗时
语音+视觉理解~1.8–2.2s
响应生成(约25个token)~0.3s
文本转语音(1-3个句子)~0.3–0.7s
端到端总耗时~2.5–3.0s
解码速度: GPU上约83 tokens/秒。

Common Patterns

常见使用模式

Extending the System Prompt

扩展系统提示词

Modify the prompt in
server.py
to change the AI's persona or task:
python
SYSTEM_PROMPT = """You are a helpful language tutor. 
Respond conversationally in 1-3 sentences.
If the user makes a grammar mistake, gently correct them.
You can see through the user's camera and discuss what you observe."""
修改
server.py
中的提示词即可调整AI的人设或任务:
python
SYSTEM_PROMPT = """You are a helpful language tutor. 
Respond conversationally in 1-3 sentences.
If the user makes a grammar mistake, gently correct them.
You can see through the user's camera and discuss what you observe."""

Adding a New Language for TTS

新增TTS支持的语言

Kokoro supports multiple language codes. Set
lang_code
in
tts.py
:
python
undefined
Kokoro支持多语言代码,在
tts.py
中设置
lang_code
即可:
python
undefined

Language codes: 'a' = American English, 'b' = British English

Language codes: 'a' = American English, 'b' = British English

'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese

'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese

pipeline = KokoroPipeline(lang_code='e') # Spanish
undefined
pipeline = KokoroPipeline(lang_code='e') # Spanish
undefined

Customizing VAD Sensitivity (index.html)

自定义VAD灵敏度(index.html)

The Silero VAD threshold can be tuned in the frontend:
javascript
// In index.html — lower positiveSpeechThreshold = more sensitive
const vad = await MicVAD.new({
  positiveSpeechThreshold: 0.6,   // default ~0.8, lower = triggers more easily
  negativeSpeechThreshold: 0.35,  // how quickly it stops detecting speech
  minSpeechFrames: 3,
  onSpeechStart: () => { /* UI feedback */ },
  onSpeechEnd: (audio) => sendAudioToServer(audio),
});
可以在前端调整Silero VAD的阈值:
javascript
// In index.html — lower positiveSpeechThreshold = more sensitive
const vad = await MicVAD.new({
  positiveSpeechThreshold: 0.6,   // default ~0.8, lower = triggers more easily
  negativeSpeechThreshold: 0.35,  // how quickly it stops detecting speech
  minSpeechFrames: 3,
  onSpeechStart: () => { /* UI feedback */ },
  onSpeechEnd: (audio) => sendAudioToServer(audio),
});

Sending Frames Programmatically (WebSocket Client Example)

可编程发送帧(WebSocket客户端示例)

python
import asyncio
import websockets
import json
import base64

async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None):
    uri = "ws://localhost:8000/ws"
    async with websockets.connect(uri) as ws:
        payload = {
            "audio": base64.b64encode(audio_pcm_bytes).decode(),
        }
        if jpeg_bytes:
            payload["image"] = base64.b64encode(jpeg_bytes).decode()
        
        await ws.send(json.dumps(payload))
        
        # Receive streamed audio response
        async for message in ws:
            audio_chunk = message  # raw PCM bytes
            # play or save audio_chunk
python
import asyncio
import websockets
import json
import base64

async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None):
    uri = "ws://localhost:8000/ws"
    async with websockets.connect(uri) as ws:
        payload = {
            "audio": base64.b64encode(audio_pcm_bytes).decode(),
        }
        if jpeg_bytes:
            payload["image"] = base64.b64encode(jpeg_bytes).decode()
        
        await ws.send(json.dumps(payload))
        
        # Receive streamed audio response
        async for message in ws:
            audio_chunk = message  # raw PCM bytes
            # play or save audio_chunk

Troubleshooting

故障排查

Model download fails

模型下载失败

bash
undefined
bash
undefined

Pre-download manually via huggingface_hub

手动通过huggingface_hub预下载

uv run python -c " from huggingface_hub import hf_hub_download path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm') print(path) " export MODEL_PATH=/path/shown/above uv run server.py
undefined
uv run python -c " from huggingface_hub import hf_hub_download path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm') print(path) " export MODEL_PATH=/path/shown/above uv run server.py
undefined

Microphone/camera not working in browser

浏览器中麦克风/摄像头无法工作

  • Must access via
    http://localhost
    (not IP address) — browsers block media APIs on non-localhost HTTP
  • Check browser permissions: address bar → lock icon → reset permissions
  • 必须通过
    http://localhost
    访问(不能用IP地址)—— 浏览器会拦截非localhost HTTP站点的媒体API
  • 检查浏览器权限: 地址栏→锁形图标→重置权限

TTS not loading on Linux

Linux上TTS加载失败

bash
undefined
bash
undefined

Ensure ONNX runtime is installed

确保已安装ONNX运行时

uv add onnxruntime
uv add onnxruntime

Or for GPU:

如需GPU支持可安装:

uv add onnxruntime-gpu
undefined
uv add onnxruntime-gpu
undefined

High latency or slow inference

延迟过高或推理过慢

  • Verify GPU is being used: check for Metal (Mac) or CUDA (Linux) in startup logs
  • Close other GPU-heavy applications
  • On Linux, confirm CUDA drivers match installed
    onnxruntime-gpu
    version
  • 确认GPU已被调用: 查看启动日志中的Metal(Mac)或CUDA(Linux)标识
  • 关闭其他占用GPU的应用
  • Linux设备上确认CUDA驱动版本与已安装的
    onnxruntime-gpu
    版本匹配

Port already in use

端口已被占用

bash
export PORT=8080
uv run server.py
bash
export PORT=8080
uv run server.py

Or kill the existing process:

也可以直接杀死占用端口的进程:

lsof -ti:8000 | xargs kill
undefined
lsof -ti:8000 | xargs kill
undefined

uv sync
fails — Python version mismatch

uv sync
失败 — Python版本不匹配

bash
undefined
bash
undefined

Parlor requires Python 3.12+

Parlor要求Python 3.12+

python3 --version
python3 --version

Install 3.12 via pyenv or system package manager, then:

通过pyenv或系统包管理器安装3.12版本后执行:

uv python pin 3.12 uv sync
undefined
uv python pin 3.12 uv sync
undefined

Dependencies (pyproject.toml)

依赖(pyproject.toml)

Key packages installed by
uv sync
:
  • litert-lm
    — Google AI Edge inference runtime for Gemma
  • fastapi
    +
    uvicorn
    — async web/WebSocket server
  • kokoro
    — Kokoro TTS ONNX backend
  • kokoro-mlx
    — Kokoro TTS MLX backend (Mac only)
  • silero-vad
    — voice activity detection (browser-side via CDN)
  • huggingface-hub
    — model auto-download
uv sync
安装的核心包:
  • litert-lm
    — 用于运行Gemma的Google AI Edge推理运行时
  • fastapi
    +
    uvicorn
    — 异步web/WebSocket服务
  • kokoro
    — Kokoro TTS ONNX后端
  • kokoro-mlx
    — Kokoro TTS MLX后端(仅Mac可用)
  • silero-vad
    — 语音活动检测(浏览器端通过CDN加载)
  • huggingface-hub
    — 模型自动下载