parlor-on-device-ai
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseParlor On-Device AI
Parlor 端侧AI
Skill by ara.so — Daily 2026 Skills collection.
Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request.
Skill 由 ara.so 开发 — 2026年度技能合集。
Parlor是一款实时端侧多模态AI助手。它结合了通过LiteRT-LM运行的Gemma 4 E2B实现语音和视觉理解,搭配Kokoro TTS实现语音输出。所有功能都在本地运行 — 无需API密钥,无需调用云端服务,每次请求零成本。
Architecture
架构
Browser (mic + camera)
│
│ WebSocket (audio PCM + JPEG frames)
▼
FastAPI server
├── Gemma 4 E2B via LiteRT-LM (GPU) → understands speech + vision
└── Kokoro TTS (MLX on Mac, ONNX on Linux) → speaks back
│
│ WebSocket (streamed audio chunks)
▼
Browser (playback + transcript)Key features:
- Silero VAD in browser — hands-free, no push-to-talk
- Barge-in — interrupt AI mid-sentence by speaking
- Sentence-level TTS streaming — audio starts before full response is ready
- Platform-aware TTS — MLX backend on Apple Silicon, ONNX on Linux
Browser (mic + camera)
│
│ WebSocket (audio PCM + JPEG frames)
▼
FastAPI server
├── Gemma 4 E2B via LiteRT-LM (GPU) → understands speech + vision
└── Kokoro TTS (MLX on Mac, ONNX on Linux) → speaks back
│
│ WebSocket (streamed audio chunks)
▼
Browser (playback + transcript)关键特性:
- 浏览器内置Silero VAD — 免手持,无需按键通话
- 插话功能 — 说话即可打断正在输出内容的AI
- 句子级TTS流式输出 — 完整响应生成完成前就可以开始播放音频
- 平台适配的TTS — Apple Silicon芯片上使用MLX后端,Linux上使用ONNX后端
Requirements
环境要求
- Python 3.12+
- macOS with Apple Silicon or Linux with a supported GPU
- ~3 GB free RAM
- package manager
uv
- Python 3.12+
- 搭载Apple Silicon的macOS 或 带支持的GPU的Linux设备
- ~3 GB可用内存
- 包管理器
uv
Installation
安装
bash
git clone https://github.com/fikrikarim/parlor.git
cd parlorbash
git clone https://github.com/fikrikarim/parlor.git
cd parlorInstall uv if needed
如需安装uv可执行以下命令
curl -LsSf https://astral.sh/uv/install.sh | sh
cd src
uv sync
uv run server.py
Open [http://localhost:8000](http://localhost:8000), grant camera and microphone permissions, and start talking.
Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).curl -LsSf https://astral.sh/uv/install.sh | sh
cd src
uv sync
uv run server.py
打开 [http://localhost:8000](http://localhost:8000),授予摄像头和麦克风权限,即可开始对话。
首次运行时会自动下载模型(Gemma 4 E2B约2.6GB,额外包含TTS模型)。Configuration
配置
Set environment variables before running:
bash
undefined运行前可设置环境变量:
bash
undefinedUse a pre-downloaded model instead of auto-downloading
使用预下载的模型,无需自动下载
export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm
export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm
Change server port (default: 8000)
修改服务端口(默认: 8000)
export PORT=9000
uv run server.py
| Variable | Default | Description |
|--------------|-------------------------------|------------------------------------------------|
| `MODEL_PATH` | auto-download from HuggingFace | Path to local `.litertlm` model file |
| `PORT` | `8000` | Server port |export PORT=9000
uv run server.py
| 变量名 | 默认值 | 说明 |
|--------------|-------------------------------|------------------------------------------------|
| `MODEL_PATH` | 自动从HuggingFace下载 | 本地 `.litertlm` 模型文件的路径 |
| `PORT` | `8000` | 服务端口 |Project Structure
项目结构
src/
├── server.py # FastAPI WebSocket server + Gemma 4 inference
├── tts.py # Platform-aware TTS (MLX on Mac, ONNX on Linux)
├── index.html # Frontend UI (VAD, camera, audio playback)
├── pyproject.toml # Dependencies
└── benchmarks/
├── bench.py # End-to-end WebSocket benchmark
└── benchmark_tts.py # TTS backend comparisonsrc/
├── server.py # FastAPI WebSocket服务 + Gemma 4推理
├── tts.py # 平台适配的TTS(Mac使用MLX,Linux使用ONNX)
├── index.html # 前端UI(VAD、摄像头、音频播放)
├── pyproject.toml # 依赖配置
└── benchmarks/
├── bench.py # 端到端WebSocket基准测试
└── benchmark_tts.py # TTS后端对比测试Key Components
核心组件
server.py — FastAPI WebSocket Server
server.py — FastAPI WebSocket服务
The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back.
python
undefined服务端处理两个WebSocket连接:一个用于接收浏览器传来的音频/视频,一个用于向浏览器流式返回音频。
python
undefinedSimplified pattern from server.py
Simplified pattern from server.py
from fastapi import FastAPI, WebSocket
import asyncio
app = FastAPI()
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
async for data in websocket.iter_bytes():
# data contains PCM audio + optional JPEG frame
response_text = await run_gemma_inference(data)
audio_chunks = await run_tts(response_text)
for chunk in audio_chunks:
await websocket.send_bytes(chunk)
undefinedfrom fastapi import FastAPI, WebSocket
import asyncio
app = FastAPI()
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
async for data in websocket.iter_bytes():
# data contains PCM audio + optional JPEG frame
response_text = await run_gemma_inference(data)
audio_chunks = await run_tts(response_text)
for chunk in audio_chunks:
await websocket.send_bytes(chunk)
undefinedtts.py — Platform-Aware TTS
tts.py — 平台适配的TTS
Kokoro TTS selects backend based on platform:
python
undefinedKokoro TTS会根据运行平台选择后端:
python
undefinedtts.py uses platform detection
tts.py uses platform detection
import platform
def get_tts_backend():
if platform.system() == "Darwin":
# Apple Silicon: use MLX backend for GPU acceleration
from kokoro_mlx import KokoroMLX
return KokoroMLX()
else:
# Linux: use ONNX backend
from kokoro import KokoroPipeline
return KokoroPipeline(lang_code='a')
tts = get_tts_backend()
import platform
def get_tts_backend():
if platform.system() == "Darwin":
# Apple Silicon: use MLX backend for GPU acceleration
from kokoro_mlx import KokoroMLX
return KokoroMLX()
else:
# Linux: use ONNX backend
from kokoro import KokoroPipeline
return KokoroPipeline(lang_code='a')
tts = get_tts_backend()
Sentence-level streaming — yields audio as each sentence is ready
Sentence-level streaming — yields audio as each sentence is ready
async def synthesize_streaming(text: str):
for sentence in split_sentences(text):
audio = tts.synthesize(sentence)
yield audio
undefinedasync def synthesize_streaming(text: str):
for sentence in split_sentences(text):
audio = tts.synthesize(sentence)
yield audio
undefinedGemma 4 E2B Inference via LiteRT-LM
通过LiteRT-LM运行Gemma 4 E2B推理
python
undefinedpython
undefinedLiteRT-LM inference pattern
LiteRT-LM inference pattern
from litert_lm import LiteRTLM
import os
model_path = os.environ.get("MODEL_PATH", None)
from litert_lm import LiteRTLM
import os
model_path = os.environ.get("MODEL_PATH", None)
Auto-downloads if MODEL_PATH not set
未设置MODEL_PATH则自动下载
model = LiteRTLM.from_pretrained(
"google/gemma-4-E2B-it",
local_path=model_path
)
async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None):
inputs = {"audio": audio_pcm}
if image_jpeg:
inputs["image"] = image_jpeg
response = ""
async for token in model.generate_stream(**inputs):
response += token
return responseundefinedmodel = LiteRTLM.from_pretrained(
"google/gemma-4-E2B-it",
local_path=model_path
)
async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None):
inputs = {"audio": audio_pcm}
if image_jpeg:
inputs["image"] = image_jpeg
response = ""
async for token in model.generate_stream(**inputs):
response += token
return responseundefinedRunning Benchmarks
运行基准测试
bash
cd srcbash
cd srcEnd-to-end WebSocket latency benchmark
端到端WebSocket延迟基准测试
uv run benchmarks/bench.py
uv run benchmarks/bench.py
Compare TTS backends (MLX vs ONNX)
对比TTS后端性能(MLX vs ONNX)
uv run benchmarks/benchmark_tts.py
undefineduv run benchmarks/benchmark_tts.py
undefinedPerformance Reference (Apple M3 Pro)
性能参考(Apple M3 Pro)
| Stage | Time |
|---|---|
| Speech + vision understanding | ~1.8–2.2s |
| Response generation (~25 tokens) | ~0.3s |
| Text-to-speech (1–3 sentences) | ~0.3–0.7s |
| Total end-to-end | ~2.5–3.0s |
Decode speed: ~83 tokens/sec on GPU.
| 阶段 | 耗时 |
|---|---|
| 语音+视觉理解 | ~1.8–2.2s |
| 响应生成(约25个token) | ~0.3s |
| 文本转语音(1-3个句子) | ~0.3–0.7s |
| 端到端总耗时 | ~2.5–3.0s |
解码速度: GPU上约83 tokens/秒。
Common Patterns
常见使用模式
Extending the System Prompt
扩展系统提示词
Modify the prompt in to change the AI's persona or task:
server.pypython
SYSTEM_PROMPT = """You are a helpful language tutor.
Respond conversationally in 1-3 sentences.
If the user makes a grammar mistake, gently correct them.
You can see through the user's camera and discuss what you observe."""修改中的提示词即可调整AI的人设或任务:
server.pypython
SYSTEM_PROMPT = """You are a helpful language tutor.
Respond conversationally in 1-3 sentences.
If the user makes a grammar mistake, gently correct them.
You can see through the user's camera and discuss what you observe."""Adding a New Language for TTS
新增TTS支持的语言
Kokoro supports multiple language codes. Set in :
lang_codetts.pypython
undefinedKokoro支持多语言代码,在中设置即可:
tts.pylang_codepython
undefinedLanguage codes: 'a' = American English, 'b' = British English
Language codes: 'a' = American English, 'b' = British English
'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese
'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese
pipeline = KokoroPipeline(lang_code='e') # Spanish
undefinedpipeline = KokoroPipeline(lang_code='e') # Spanish
undefinedCustomizing VAD Sensitivity (index.html)
自定义VAD灵敏度(index.html)
The Silero VAD threshold can be tuned in the frontend:
javascript
// In index.html — lower positiveSpeechThreshold = more sensitive
const vad = await MicVAD.new({
positiveSpeechThreshold: 0.6, // default ~0.8, lower = triggers more easily
negativeSpeechThreshold: 0.35, // how quickly it stops detecting speech
minSpeechFrames: 3,
onSpeechStart: () => { /* UI feedback */ },
onSpeechEnd: (audio) => sendAudioToServer(audio),
});可以在前端调整Silero VAD的阈值:
javascript
// In index.html — lower positiveSpeechThreshold = more sensitive
const vad = await MicVAD.new({
positiveSpeechThreshold: 0.6, // default ~0.8, lower = triggers more easily
negativeSpeechThreshold: 0.35, // how quickly it stops detecting speech
minSpeechFrames: 3,
onSpeechStart: () => { /* UI feedback */ },
onSpeechEnd: (audio) => sendAudioToServer(audio),
});Sending Frames Programmatically (WebSocket Client Example)
可编程发送帧(WebSocket客户端示例)
python
import asyncio
import websockets
import json
import base64
async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None):
uri = "ws://localhost:8000/ws"
async with websockets.connect(uri) as ws:
payload = {
"audio": base64.b64encode(audio_pcm_bytes).decode(),
}
if jpeg_bytes:
payload["image"] = base64.b64encode(jpeg_bytes).decode()
await ws.send(json.dumps(payload))
# Receive streamed audio response
async for message in ws:
audio_chunk = message # raw PCM bytes
# play or save audio_chunkpython
import asyncio
import websockets
import json
import base64
async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None):
uri = "ws://localhost:8000/ws"
async with websockets.connect(uri) as ws:
payload = {
"audio": base64.b64encode(audio_pcm_bytes).decode(),
}
if jpeg_bytes:
payload["image"] = base64.b64encode(jpeg_bytes).decode()
await ws.send(json.dumps(payload))
# Receive streamed audio response
async for message in ws:
audio_chunk = message # raw PCM bytes
# play or save audio_chunkTroubleshooting
故障排查
Model download fails
模型下载失败
bash
undefinedbash
undefinedPre-download manually via huggingface_hub
手动通过huggingface_hub预下载
uv run python -c "
from huggingface_hub import hf_hub_download
path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm')
print(path)
"
export MODEL_PATH=/path/shown/above
uv run server.py
undefineduv run python -c "
from huggingface_hub import hf_hub_download
path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm')
print(path)
"
export MODEL_PATH=/path/shown/above
uv run server.py
undefinedMicrophone/camera not working in browser
浏览器中麦克风/摄像头无法工作
- Must access via (not IP address) — browsers block media APIs on non-localhost HTTP
http://localhost - Check browser permissions: address bar → lock icon → reset permissions
- 必须通过访问(不能用IP地址)—— 浏览器会拦截非localhost HTTP站点的媒体API
http://localhost - 检查浏览器权限: 地址栏→锁形图标→重置权限
TTS not loading on Linux
Linux上TTS加载失败
bash
undefinedbash
undefinedEnsure ONNX runtime is installed
确保已安装ONNX运行时
uv add onnxruntime
uv add onnxruntime
Or for GPU:
如需GPU支持可安装:
uv add onnxruntime-gpu
undefineduv add onnxruntime-gpu
undefinedHigh latency or slow inference
延迟过高或推理过慢
- Verify GPU is being used: check for Metal (Mac) or CUDA (Linux) in startup logs
- Close other GPU-heavy applications
- On Linux, confirm CUDA drivers match installed version
onnxruntime-gpu
- 确认GPU已被调用: 查看启动日志中的Metal(Mac)或CUDA(Linux)标识
- 关闭其他占用GPU的应用
- Linux设备上确认CUDA驱动版本与已安装的版本匹配
onnxruntime-gpu
Port already in use
端口已被占用
bash
export PORT=8080
uv run server.pybash
export PORT=8080
uv run server.pyOr kill the existing process:
也可以直接杀死占用端口的进程:
lsof -ti:8000 | xargs kill
undefinedlsof -ti:8000 | xargs kill
undefineduv sync
fails — Python version mismatch
uv syncuv sync
失败 — Python版本不匹配
uv syncbash
undefinedbash
undefinedParlor requires Python 3.12+
Parlor要求Python 3.12+
python3 --version
python3 --version
Install 3.12 via pyenv or system package manager, then:
通过pyenv或系统包管理器安装3.12版本后执行:
uv python pin 3.12
uv sync
undefineduv python pin 3.12
uv sync
undefinedDependencies (pyproject.toml)
依赖(pyproject.toml)
Key packages installed by :
uv sync- — Google AI Edge inference runtime for Gemma
litert-lm - +
fastapi— async web/WebSocket serveruvicorn - — Kokoro TTS ONNX backend
kokoro - — Kokoro TTS MLX backend (Mac only)
kokoro-mlx - — voice activity detection (browser-side via CDN)
silero-vad - — model auto-download
huggingface-hub
uv sync- — 用于运行Gemma的Google AI Edge推理运行时
litert-lm - +
fastapi— 异步web/WebSocket服务uvicorn - — Kokoro TTS ONNX后端
kokoro - — Kokoro TTS MLX后端(仅Mac可用)
kokoro-mlx - — 语音活动检测(浏览器端通过CDN加载)
silero-vad - — 模型自动下载
huggingface-hub