multimodal-llm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Multimodal LLM Patterns

多模态LLM模式

Integrate vision and audio capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, and text-to-speech.
集成主流多模态模型的视觉与音频能力,涵盖图像分析、文档理解、实时语音Agent、语音转文本(Speech-to-Text)及文本转语音(Text-to-Speech)功能。

Quick Reference

快速参考

CategoryRulesImpactWhen to Use
Vision: Image Analysis1HIGHImage captioning, VQA, multi-image comparison, object detection
Vision: Document Understanding1HIGHOCR, chart/diagram analysis, PDF processing, table extraction
Vision: Model Selection1MEDIUMChoosing provider, cost optimization, image size limits
Audio: Speech-to-Text1HIGHTranscription, speaker diarization, long-form audio
Audio: Text-to-Speech1MEDIUMVoice synthesis, expressive TTS, multi-speaker dialogue
Audio: Model Selection1MEDIUMReal-time voice agents, provider comparison, pricing
Total: 6 rules across 2 categories (Vision, Audio)
类别规则数量影响等级适用场景
视觉:图像分析1图像标题生成、VQA、多图像对比、目标检测
视觉:文档理解1OCR、图表/示意图分析、PDF处理、表格提取
视觉:模型选型1供应商选择、成本优化、图像尺寸限制适配
音频:语音转文本1音频转写、说话人分离、长音频处理
音频:文本转语音1语音合成、情感化TTS、多角色对话生成
音频:模型选型1实时语音Agent开发、供应商对比、定价评估
总计:2大类别(视觉、音频)下共6条规则

Vision: Image Analysis

视觉:图像分析

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set
max_tokens
and resize images before encoding.
RuleFileKey Pattern
Image Analysis
rules/vision-image-analysis.md
Base64 encoding, multi-image, bounding boxes
将图像发送至多模态LLM,实现标题生成、视觉问答(VQA)及目标检测。需始终设置
max_tokens
,并在编码前调整图像尺寸。
规则文件核心模式
图像分析
rules/vision-image-analysis.md
Base64编码、多图像处理、边界框标注

Vision: Document Understanding

视觉:文档理解

Extract structured data from documents, charts, and PDFs using vision models.
RuleFileKey Pattern
Document Vision
rules/vision-document.md
PDF page ranges, detail levels, OCR strategies
借助视觉模型从文档、图表及PDF中提取结构化数据。
规则文件核心模式
文档视觉处理
rules/vision-document.md
PDF页码范围、细节级别设置、OCR策略

Vision: Model Selection

视觉:模型选型

Choose the right vision provider based on accuracy, cost, and context window needs.
RuleFileKey Pattern
Vision Models
rules/vision-models.md
Provider comparison, token costs, image limits
基于准确率、成本及上下文窗口需求选择合适的视觉模型供应商。
规则文件核心模式
视觉模型
rules/vision-models.md
供应商对比、Token成本、图像尺寸限制

Audio: Speech-to-Text

音频:语音转文本

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.
RuleFileKey Pattern
Speech-to-Text
rules/audio-speech-to-text.md
Gemini long-form, GPT-4o-Transcribe, AssemblyAI features
实现带说话人分离、时间戳及情感分析的音频转文本功能。
规则文件核心模式
语音转文本
rules/audio-speech-to-text.md
Gemini长文本处理、GPT-4o-Transcribe、AssemblyAI特性

Audio: Text-to-Speech

音频:文本转语音

Generate natural speech from text with voice selection and expressive cues.
RuleFileKey Pattern
Text-to-Speech
rules/audio-text-to-speech.md
Gemini TTS, voice config, auditory cues
基于文本生成自然语音,支持音色选择与情感语调设置。
规则文件核心模式
文本转语音
rules/audio-text-to-speech.md
Gemini TTS、语音配置、听觉提示

Audio: Model Selection

音频:模型选型

Select the right audio/voice provider for real-time, transcription, or TTS use cases.
RuleFileKey Pattern
Audio Models
rules/audio-models.md
Real-time voice comparison, STT benchmarks, pricing
针对实时交互、转写或文本转语音场景选择合适的音频/语音供应商。
规则文件核心模式
音频模型
rules/audio-models.md
实时语音对比、STT基准测试、定价评估

Key Decisions

关键决策建议

DecisionRecommendation
High accuracy visionClaude Opus 4.6 or GPT-5
Long documentsGemini 2.5 Pro (1M context)
Cost-efficient visionGemini 2.5 Flash ($0.15/M tokens)
Video analysisGemini 2.5/3 Pro (native video)
Voice assistantGrok Voice Agent (fastest, <1s)
Emotional voice AIGemini Live API
Long audio transcriptionGemini 2.5 Pro (9.5hr)
Speaker diarizationAssemblyAI or Gemini
Self-hosted STTWhisper Large V3
决策场景推荐方案
高精度视觉任务Claude Opus 4.6 或 GPT-5
长文档处理Gemini 2.5 Pro(1M上下文窗口)
高性价比视觉任务Gemini 2.5 Flash($0.15/百万Token)
视频分析Gemini 2.5/3 Pro(原生视频支持)
语音助手Grok Voice Agent(速度最快,延迟<1秒)
情感化语音AIGemini Live API
长音频转写Gemini 2.5 Pro(支持9.5小时音频)
说话人分离AssemblyAI 或 Gemini
自托管STTWhisper Large V3

Example

示例

python
import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)
python
import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)

Common Mistakes

常见误区

  1. Not setting
    max_tokens
    on vision requests (responses truncated)
  2. Sending oversized images without resizing (>2048px)
  3. Using
    high
    detail level for simple yes/no classification
  4. Using STT+LLM+TTS pipeline instead of native speech-to-speech
  5. Not leveraging barge-in support for natural voice conversations
  6. Using deprecated models (GPT-4V, Whisper-1)
  7. Ignoring rate limits on vision and audio endpoints
  1. 未在视觉请求中设置
    max_tokens
    (导致响应被截断)
  2. 直接发送超大尺寸图像而未进行缩放(>2048像素)
  3. 对简单的是/否分类任务使用
    high
    细节级别
  4. 采用STT+LLM+TTS流水线而非原生语音转语音方案
  5. 未利用打断(barge-in)支持实现自然语音对话
  6. 使用已弃用模型(如GPT-4V、Whisper-1)
  7. 忽略视觉与音频接口的调用速率限制

Related Skills

相关技能

  • rag-retrieval
    - Multimodal RAG with image + text retrieval
  • llm-integration
    - General LLM function calling patterns
  • streaming-api-patterns
    - WebSocket patterns for real-time audio
  • rag-retrieval
    - 支持图像+文本检索的多模态RAG
  • llm-integration
    - 通用LLM函数调用模式
  • streaming-api-patterns
    - 实时音频的WebSocket模式