vision-multimodal
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVision & Multimodal Skill
视觉与多模态Skill
Leverage Claude's vision capabilities for image analysis, document processing, and multimodal understanding.
利用Claude的视觉能力进行图像分析、文档处理和多模态理解。
When to Use This Skill
何时使用该Skill
- Image analysis and description
- Document/PDF processing
- Screenshot analysis
- OCR-like text extraction
- Visual comparison
- Chart and diagram interpretation
- 图像分析与描述
- 文档/PDF处理
- 截图分析
- 类OCR文本提取
- 视觉对比
- 图表与示意图解读
Supported Formats
支持的格式
| Format | Status | Best For |
|---|---|---|
| JPEG | ✓ | Photos, natural scenes |
| PNG | ✓ | Screenshots, UI, text |
| GIF | ✓ | Animated (first frame) |
| WebP | ✓ | Modern, compressed |
| ✓ | Documents (via Files API) |
| 格式 | 状态 | 最佳适用场景 |
|---|---|---|
| JPEG | ✓ | 照片、自然场景 |
| PNG | ✓ | 截图、UI界面、文本 |
| GIF | ✓ | 动图(仅分析第一帧) |
| WebP | ✓ | 现代压缩格式 |
| ✓ | 文档(通过Files API) |
Image Size Guidelines
图像尺寸指南
- Minimum: 200 pixels (smaller = reduced accuracy)
- Optimal: 1000x1000 pixels
- Maximum: 8000x8000 pixels
- Token cost: ~(width × height) / 1000
- Tip: Resize to 1568px max dimension for 30-50% token savings
- 最小尺寸: 200像素(尺寸越小,准确率越低)
- 最佳尺寸: 1000x1000像素
- 最大尺寸: 8000x8000像素
- Token成本: ~(宽度 × 高度) / 1000
- 提示: 将最大维度调整为1568px可节省30-50%的Token
Core Patterns
核心使用模式
Pattern 1: Single Image Analysis
模式1:单图分析
python
import anthropic
import base64
client = anthropic.Anthropic()python
import anthropic
import base64
client = anthropic.Anthropic()Load and encode image
Load and encode image
with open("image.jpg", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data
}
},
{
"type": "text",
"text": "Describe this image in detail."
}
]
}]
)
undefinedwith open("image.jpg", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data
}
},
{
"type": "text",
"text": "Describe this image in detail."
}
]
}]
)
undefinedPattern 2: Image from URL
模式2:从URL获取图像
python
import httpxpython
import httpxFetch and encode from URL
Fetch and encode from URL
image_url = "https://example.com/image.jpg"
response = httpx.get(image_url)
image_data = base64.standard_b64encode(response.content).decode("utf-8")
image_url = "https://example.com/image.jpg"
response = httpx.get(image_url)
image_data = base64.standard_b64encode(response.content).decode("utf-8")
Then use same pattern as above
Then use same pattern as above
undefinedundefinedPattern 3: Multiple Images
模式3:多图处理
python
undefinedpython
undefinedCompare multiple images (up to 100 per request)
Compare multiple images (up to 100 per request)
messages = [{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}},
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}},
{"type": "text", "text": "Compare these two images and list the differences."}
]
}]
undefinedmessages = [{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}},
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}},
{"type": "text", "text": "Compare these two images and list the differences."}
]
}]
undefinedPattern 4: Few-Shot with Images
模式4:带图像的少样本示例
python
undefinedpython
undefinedTeach by example
Teach by example
messages = [
# Example 1
{"role": "user", "content": [
{"type": "image", "source": {...}},
{"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},
# Example 2
{"role": "user", "content": [
{"type": "image", "source": {...}},
{"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},
# Target image
{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
{"type": "text", "text": "Classify this image."}
]}]
undefinedmessages = [
# Example 1
{"role": "user", "content": [
{"type": "image", "source": {...}},
{"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},
# Example 2
{"role": "user", "content": [
{"type": "image", "source": {...}},
{"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},
# Target image
{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
{"type": "text", "text": "Classify this image."}
]}]
undefinedPattern 5: PDF Processing
模式5:PDF处理
python
undefinedpython
undefinedUsing Files API (beta)
Using Files API (beta)
with open("document.pdf", "rb") as f:
pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data
}
},
{"type": "text", "text": "Summarize this document."}
]
}]
)
undefinedwith open("document.pdf", "rb") as f:
pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data
}
},
{
"type": "text",
"text": "Summarize this document."}
}
]
}]
)
undefinedPrompt Engineering for Vision
视觉任务的提示词工程
Strategy 1: Role Assignment
策略1:角色分配
python
prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.
Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""python
prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.
Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""Strategy 2: Step-by-Step Thinking
策略2:分步思考
python
prompt = """Before answering, analyze the image systematically:
<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>
Then provide your answer based on this analysis."""python
prompt = """Before answering, analyze the image systematically:
<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>
Then provide your answer based on this analysis."""Strategy 3: Structured Output
策略3:结构化输出
python
prompt = """Extract information from this receipt and return as JSON:
{
"vendor": "",
"date": "",
"items": [{"name": "", "price": 0}],
"total": 0
}"""python
prompt = """Extract information from this receipt and return as JSON:
{
"vendor": "",
"date": "",
"items": [{"name": "", "price": 0}],
"total": 0
}"""Image Optimization
图像优化
python
from PIL import Image
import io
def optimize_for_claude(image_path, max_dimension=1568):
"""Resize image to reduce token usage by 30-50%"""
with Image.open(image_path) as img:
# Calculate new dimensions
ratio = min(max_dimension / img.width, max_dimension / img.height)
if ratio < 1:
new_size = (int(img.width * ratio), int(img.height * ratio))
img = img.resize(new_size, Image.LANCZOS)
# Convert to bytes
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=85)
return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")python
from PIL import Image
import io
def optimize_for_claude(image_path, max_dimension=1568):
"""Resize image to reduce token usage by 30-50%"""
with Image.open(image_path) as img:
# Calculate new dimensions
ratio = min(max_dimension / img.width, max_dimension / img.height)
if ratio < 1:
new_size = (int(img.width * ratio), int(img.height * ratio))
img = img.resize(new_size, Image.LANCZOS)
# Convert to bytes
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=85)
return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")Common Use Cases
常见使用场景
Text Extraction (OCR-like)
文本提取(类OCR)
python
prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""python
prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""Table Extraction
表格提取
python
prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""python
prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""Chart Analysis
图表分析
python
prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""python
prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""Best Practices
最佳实践
DO:
建议:
- Use high-quality images (≥1000px)
- Resize large images to save tokens
- Provide context about what to look for
- Use few-shot examples for consistent output
- 使用高质量图像(≥1000px)
- 调整大尺寸图像以节省Token
- 提供分析的上下文信息
- 使用少样本示例保证输出一致性
DON'T:
避免:
- Send images smaller than 200px
- Expect perfect OCR for handwriting
- Send very large images (>8000px)
- Ignore token costs for multiple images
- 发送小于200px的图像
- 期望手写文本的完美OCR识别
- 发送超过8000px的超大图像
- 忽略多图处理的Token成本
Limitations
局限性
- Cannot identify specific individuals
- May struggle with very small text
- Animated GIFs: only first frame analyzed
- Some specialized symbols may be misread
- 无法识别特定个人
- 可能难以识别极小文本
- 动图GIF:仅分析第一帧
- 部分特殊符号可能被误读
See Also
相关链接
- [[llm-integration]] - API basics
- [[extended-thinking]] - Complex reasoning
- [[citations-retrieval]] - Document citations
- [[llm-integration]] - API基础
- [[extended-thinking]] - 复杂推理
- [[citations-retrieval]] - 文档引用