blip-2-vision-language
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBLIP-2: Vision-Language Pre-training
BLIP-2:视觉-语言预训练
Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.
本指南全面介绍如何使用Salesforce的BLIP-2,结合冻结图像编码器与大语言模型完成视觉-语言任务。
When to use BLIP-2
何时使用BLIP-2
Use BLIP-2 when:
- Need high-quality image captioning with natural descriptions
- Building visual question answering (VQA) systems
- Require zero-shot image-text understanding without task-specific training
- Want to leverage LLM reasoning for visual tasks
- Building multimodal conversational AI
- Need image-text retrieval or matching
Key features:
- Q-Former architecture: Lightweight query transformer bridges vision and language
- Frozen backbone efficiency: No need to fine-tune large vision/language models
- Multiple LLM backends: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
- Zero-shot capabilities: Strong performance without task-specific training
- Efficient training: Only trains Q-Former (~188M parameters)
- State-of-the-art results: Beats larger models on VQA benchmarks
Use alternatives instead:
- LLaVA: For instruction-following multimodal chat
- InstructBLIP: For improved instruction-following (BLIP-2 successor)
- GPT-4V/Claude 3: For production multimodal chat (proprietary)
- CLIP: For simple image-text similarity without generation
- Flamingo: For few-shot visual learning
在以下场景使用BLIP-2:
- 需要生成自然描述的高质量图像标题
- 构建视觉问答(VQA)系统
- 无需任务特定训练,实现零样本图文理解
- 希望借助LLM推理能力完成视觉任务
- 构建多模态对话AI
- 需要图文检索或匹配功能
核心特性:
- Q-Former架构:轻量级查询Transformer连接视觉与语言
- 冻结骨干高效性:无需微调大型视觉/语言模型
- 多LLM后端支持:OPT(2.7B、6.7B)与FlanT5(XL、XXL)
- 零样本能力:无需任务特定训练即可实现出色性能
- 高效训练:仅需训练Q-Former(约1.88亿参数)
- 最先进结果:在VQA基准测试中超越更大规模模型
可选择替代方案:
- LLaVA:用于遵循指令的多模态对话
- InstructBLIP:BLIP-2的继任者,指令遵循能力更优
- GPT-4V/Claude 3:用于生产环境的多模态对话(闭源)
- CLIP:用于无需生成能力的简单图文相似度计算
- Flamingo:用于少样本视觉学习
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedHuggingFace Transformers (recommended)
HuggingFace Transformers(推荐)
pip install transformers accelerate torch Pillow
pip install transformers accelerate torch Pillow
Or LAVIS library (Salesforce official)
或LAVIS库(Salesforce官方)
pip install salesforce-lavis
undefinedpip install salesforce-lavis
undefinedBasic image captioning
基础图像标题生成
python
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGenerationpython
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGenerationLoad model and processor
加载模型与处理器
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
Load image
加载图像
image = Image.open("photo.jpg").convert("RGB")
image = Image.open("photo.jpg").convert("RGB")
Generate caption
生成标题
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)
undefinedinputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)
undefinedVisual question answering
视觉问答
python
undefinedpython
undefinedAsk a question about the image
针对图像提出问题
question = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)
undefinedquestion = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)
undefinedUsing LAVIS library
使用LAVIS库
python
import torch
from lavis.models import load_model_and_preprocess
from PIL import Imagepython
import torch
from lavis.models import load_model_and_preprocess
from PIL import ImageLoad model
加载模型
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device
)
Process image
处理图像
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors"eval".unsqueeze(0).to(device)
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors"eval".unsqueeze(0).to(device)
Caption
生成标题
caption = model.generate({"image": image})
print(caption)
caption = model.generate({"image": image})
print(caption)
VQA
视觉问答
question = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)
undefinedquestion = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)
undefinedCore concepts
核心概念
Architecture overview
架构概述
BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Q-Former │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Learned Queries (32 queries × 768 dim) │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Cross-Attention with Image Features │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Self-Attention Layers (Transformer) │ │
│ └────────────────────────┬────────────────────────────┘ │
└───────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Frozen Vision Encoder │ Frozen LLM │
│ (ViT-G/14 from EVA-CLIP) │ (OPT or FlanT5) │
└─────────────────────────────────────────────────────────────┘BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Q-Former │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Learned Queries (32 queries × 768 dim) │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Cross-Attention with Image Features │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Self-Attention Layers (Transformer) │ │
│ └────────────────────────┬────────────────────────────┘ │
└───────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Frozen Vision Encoder │ Frozen LLM │
│ (ViT-G/14 from EVA-CLIP) │ (OPT or FlanT5) │
└─────────────────────────────────────────────────────────────┘Model variants
模型变体
| Model | LLM Backend | Size | Use Case |
|---|---|---|---|
| OPT-2.7B | ~4GB | General captioning, VQA |
| OPT-6.7B | ~8GB | Better reasoning |
| FlanT5-XL | ~5GB | Instruction following |
| FlanT5-XXL | ~13GB | Best quality |
| 模型 | LLM后端 | 显存占用 | 使用场景 |
|---|---|---|---|
| OPT-2.7B | ~4GB | 通用标题生成、视觉问答 |
| OPT-6.7B | ~8GB | 推理能力更优 |
| FlanT5-XL | ~5GB | 指令遵循 |
| FlanT5-XXL | ~13GB | 最佳质量 |
Q-Former components
Q-Former组件
| Component | Description | Parameters |
|---|---|---|
| Learned queries | Fixed set of learnable embeddings | 32 × 768 |
| Image transformer | Cross-attention to vision features | ~108M |
| Text transformer | Self-attention for text | ~108M |
| Linear projection | Maps to LLM dimension | Varies |
| 组件 | 描述 | 参数规模 |
|---|---|---|
| Learned queries | 固定可学习嵌入集合 | 32 × 768 |
| Image transformer | 与视觉特征的交叉注意力 | ~1.08亿 |
| Text transformer | 文本自注意力 | ~1.08亿 |
| Linear projection | 映射至LLM维度 | 可变 |
Advanced usage
进阶用法
Batch processing
批量处理
python
from PIL import Image
import torchpython
from PIL import Image
import torchLoad multiple images
加载多张图像
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
"What is shown in this image?",
"Describe the scene.",
"What colors are prominent?",
"Is there a person in this image?"
]
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
"What is shown in this image?",
"Describe the scene.",
"What colors are prominent?",
"Is there a person in this image?"
]
Process batch
批量处理
inputs = processor(
images=images,
text=questions,
return_tensors="pt",
padding=True
).to("cuda", torch.float16)
inputs = processor(
images=images,
text=questions,
return_tensors="pt",
padding=True
).to("cuda", torch.float16)
Generate
生成结果
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers):
print(f"Q: {q}\nA: {a}\n")
undefinedgenerated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers):
print(f"Q: {q}\nA: {a}\n")
undefinedControlling generation
控制生成过程
python
undefinedpython
undefinedControl generation parameters
控制生成参数
generated_ids = model.generate(
**inputs,
max_new_tokens=100,
min_length=20,
num_beams=5, # Beam search
no_repeat_ngram_size=2, # Avoid repetition
top_p=0.9, # Nucleus sampling
temperature=0.7, # Creativity
do_sample=True, # Enable sampling
)
generated_ids = model.generate(
**inputs,
max_new_tokens=100,
min_length=20,
num_beams=5, # 束搜索
no_repeat_ngram_size=2, # 避免重复
top_p=0.9, # 核采样
temperature=0.7, # 生成创造性
do_sample=True, # 启用采样
)
For deterministic output
确定性输出
generated_ids = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
do_sample=False,
)
undefinedgenerated_ids = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
do_sample=False,
)
undefinedMemory optimization
显存优化
python
undefinedpython
undefined8-bit quantization
8位量化
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-6.7b",
quantization_config=quantization_config,
device_map="auto"
)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-6.7b",
quantization_config=quantization_config,
device_map="auto"
)
4-bit quantization (more aggressive)
4位量化(更激进)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xxl",
quantization_config=quantization_config,
device_map="auto"
)
undefinedquantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xxl",
quantization_config=quantization_config,
device_map="auto"
)
undefinedImage-text matching
图文匹配
python
undefinedpython
undefinedUsing LAVIS for ITM (Image-Text Matching)
使用LAVIS进行图文匹配(ITM)
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_image_text_matching",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors"eval".unsqueeze(0).to(device)
text = txt_processors["eval"]("a dog sitting on grass")
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_image_text_matching",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors"eval".unsqueeze(0).to(device)
text = txt_processors["eval"]("a dog sitting on grass")
Get matching score
获取匹配分数
itm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"Match probability: {itm_scores[:, 1].item():.3f}")
undefineditm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"Match probability: {itm_scores[:, 1].item():.3f}")
undefinedFeature extraction
特征提取
python
undefinedpython
undefinedExtract image features with Q-Former
使用Q-Former提取图像特征
from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors"eval".unsqueeze(0).to(device)
from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors"eval".unsqueeze(0).to(device)
Get features
获取特征
features = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds # Shape: [1, 32, 768]
image_features = features.image_embeds_proj # Projected for matching
undefinedfeatures = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds # 形状: [1, 32, 768]
image_features = features.image_embeds_proj # 用于匹配的投影特征
undefinedCommon workflows
常见工作流
Workflow 1: Image captioning pipeline
工作流1:图像标题生成流水线
python
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path
class ImageCaptioner:
def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def caption(self, image_path: str, prompt: str = None) -> str:
image = Image.open(image_path).convert("RGB")
if prompt:
inputs = self.processor(images=image, text=prompt, return_tensors="pt")
else:
inputs = self.processor(images=image, return_tensors="pt")
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def caption_batch(self, image_paths: list, prompt: str = None) -> list:
images = [Image.open(p).convert("RGB") for p in image_paths]
if prompt:
inputs = self.processor(
images=images,
text=[prompt] * len(images),
return_tensors="pt",
padding=True
)
else:
inputs = self.processor(images=images, return_tensors="pt", padding=True)
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(**inputs, max_new_tokens=50)
return self.processor.batch_decode(generated_ids, skip_special_tokens=True)python
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path
class ImageCaptioner:
def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def caption(self, image_path: str, prompt: str = None) -> str:
image = Image.open(image_path).convert("RGB")
if prompt:
inputs = self.processor(images=image, text=prompt, return_tensors="pt")
else:
inputs = self.processor(images=image, return_tensors="pt")
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def caption_batch(self, image_paths: list, prompt: str = None) -> list:
images = [Image.open(p).convert("RGB") for p in image_paths]
if prompt:
inputs = self.processor(
images=images,
text=[prompt] * len(images),
return_tensors="pt",
padding=True
)
else:
inputs = self.processor(images=images, return_tensors="pt", padding=True)
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(**inputs, max_new_tokens=50)
return self.processor.batch_decode(generated_ids, skip_special_tokens=True)Usage
使用示例
captioner = ImageCaptioner()
captioner = ImageCaptioner()
Single image
单张图像
caption = captioner.caption("photo.jpg")
print(f"Caption: {caption}")
caption = captioner.caption("photo.jpg")
print(f"Caption: {caption}")
With prompt for style
带风格提示词
caption = captioner.caption("photo.jpg", "a detailed description of")
print(f"Detailed: {caption}")
caption = captioner.caption("photo.jpg", "a detailed description of")
print(f"Detailed: {caption}")
Batch processing
批量处理
captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
for i, cap in enumerate(captions):
print(f"Image {i+1}: {cap}")
undefinedcaptions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
for i, cap in enumerate(captions):
print(f"Image {i+1}: {cap}")
undefinedWorkflow 2: Visual Q&A system
工作流2:视觉问答系统
python
class VisualQA:
def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.current_image = None
self.current_inputs = None
def set_image(self, image_path: str):
"""Load image for multiple questions."""
self.current_image = Image.open(image_path).convert("RGB")
def ask(self, question: str) -> str:
"""Ask a question about the current image."""
if self.current_image is None:
raise ValueError("No image set. Call set_image() first.")
# Format question for FlanT5
prompt = f"Question: {question} Answer:"
inputs = self.processor(
images=self.current_image,
text=prompt,
return_tensors="pt"
).to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def ask_multiple(self, questions: list) -> dict:
"""Ask multiple questions about current image."""
return {q: self.ask(q) for q in questions}python
class VisualQA:
def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.current_image = None
self.current_inputs = None
def set_image(self, image_path: str):
"""加载图像以支持多轮提问。"""
self.current_image = Image.open(image_path).convert("RGB")
def ask(self, question: str) -> str:
"""针对当前图像提问。"""
if self.current_image is None:
raise ValueError("No image set. Call set_image() first.")
# 为FlanT5格式化问题
prompt = f"Question: {question} Answer:"
inputs = self.processor(
images=self.current_image,
text=prompt,
return_tensors="pt"
).to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def ask_multiple(self, questions: list) -> dict:
"""针对当前图像批量提问。"""
return {q: self.ask(q) for q in questions}Usage
使用示例
vqa = VisualQA()
vqa.set_image("scene.jpg")
vqa = VisualQA()
vqa.set_image("scene.jpg")
Ask questions
单轮提问
print(vqa.ask("What objects are in this image?"))
print(vqa.ask("What is the weather like?"))
print(vqa.ask("How many people are there?"))
print(vqa.ask("What objects are in this image?"))
print(vqa.ask("What is the weather like?"))
print(vqa.ask("How many people are there?"))
Batch questions
批量提问
results = vqa.ask_multiple([
"What is the main subject?",
"What colors are dominant?",
"Is this indoors or outdoors?"
])
undefinedresults = vqa.ask_multiple([
"What is the main subject?",
"What colors are dominant?",
"Is this indoors or outdoors?"
])
undefinedWorkflow 3: Image search/retrieval
工作流3:图像搜索/检索
python
import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess
class ImageSearchEngine:
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=self.device
)
self.image_features = []
self.image_paths = []
def index_images(self, image_paths: list):
"""Build index from images."""
self.image_paths = image_paths
for path in image_paths:
image = Image.open(path).convert("RGB")
image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
with torch.no_grad():
features = self.model.extract_features({"image": image}, mode="image")
# Use projected features for matching
self.image_features.append(
features.image_embeds_proj.mean(dim=1).cpu().numpy()
)
self.image_features = np.vstack(self.image_features)
def search(self, query: str, top_k: int = 5) -> list:
"""Search images by text query."""
# Get text features
text = self.txt_processors["eval"](query)
text_input = {"text_input": [text]}
with torch.no_grad():
text_features = self.model.extract_features(text_input, mode="text")
text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()
# Compute similarities
similarities = np.dot(self.image_features, text_embeds.T).squeeze()
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(self.image_paths[i], similarities[i]) for i in top_indices]python
import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess
class ImageSearchEngine:
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=self.device
)
self.image_features = []
self.image_paths = []
def index_images(self, image_paths: list):
"""为图像构建索引。"""
self.image_paths = image_paths
for path in image_paths:
image = Image.open(path).convert("RGB")
image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
with torch.no_grad():
features = self.model.extract_features({"image": image}, mode="image")
# 使用投影特征进行匹配
self.image_features.append(
features.image_embeds_proj.mean(dim=1).cpu().numpy()
)
self.image_features = np.vstack(self.image_features)
def search(self, query: str, top_k: int = 5) -> list:
"""通过文本查询搜索图像。"""
# 获取文本特征
text = self.txt_processors["eval"](query)
text_input = {"text_input": [text]}
with torch.no_grad():
text_features = self.model.extract_features(text_input, mode="text")
text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()
# 计算相似度
similarities = np.dot(self.image_features, text_embeds.T).squeeze()
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(self.image_paths[i], similarities[i]) for i in top_indices]Usage
使用示例
engine = ImageSearchEngine()
engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])
engine = ImageSearchEngine()
engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])
Search
搜索
results = engine.search("a sunset over the ocean", top_k=5)
for path, score in results:
print(f"{path}: {score:.3f}")
undefinedresults = engine.search("a sunset over the ocean", top_k=5)
for path, score in results:
print(f"{path}: {score:.3f}")
undefinedOutput format
输出格式
Generation output
生成输出
python
undefinedpython
undefinedDirect generation returns token IDs
直接生成返回Token ID
generated_ids = model.generate(**inputs, max_new_tokens=50)
generated_ids = model.generate(**inputs, max_new_tokens=50)
Shape: [batch_size, sequence_length]
形状: [batch_size, sequence_length]
Decode to text
解码为文本
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
Returns: list of strings
返回: 字符串列表
undefinedundefinedFeature extraction output
特征提取输出
python
undefinedpython
undefinedQ-Former outputs
Q-Former输出
features = model.extract_features({"image": image}, mode="image")
features.image_embeds # [B, 32, 768] - Q-Former outputs
features.image_embeds_proj # [B, 32, 256] - Projected for matching
features.text_embeds # [B, seq_len, 768] - Text features
features.text_embeds_proj # [B, 256] - Projected text (CLS)
undefinedfeatures = model.extract_features({"image": image}, mode="image")
features.image_embeds # [B, 32, 768] - Q-Former输出
features.image_embeds_proj # [B, 32, 256] - 用于匹配的投影特征
features.text_embeds # [B, seq_len, 768] - 文本特征
features.text_embeds_proj # [B, 256] - 投影后的文本特征(CLS)
undefinedPerformance optimization
性能优化
GPU memory requirements
GPU显存需求
| Model | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|---|---|---|
| blip2-opt-2.7b | ~8GB | ~5GB | ~3GB |
| blip2-opt-6.7b | ~16GB | ~9GB | ~5GB |
| blip2-flan-t5-xl | ~10GB | ~6GB | ~4GB |
| blip2-flan-t5-xxl | ~26GB | ~14GB | ~8GB |
| 模型 | FP16显存 | INT8显存 | INT4显存 |
|---|---|---|---|
| blip2-opt-2.7b | ~8GB | ~5GB | ~3GB |
| blip2-opt-6.7b | ~16GB | ~9GB | ~5GB |
| blip2-flan-t5-xl | ~10GB | ~6GB | ~4GB |
| blip2-flan-t5-xxl | ~26GB | ~14GB | ~8GB |
Speed optimization
速度优化
python
undefinedpython
undefinedUse Flash Attention if available
若可用则使用Flash Attention
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2", # Requires flash-attn
device_map="auto"
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2", # 需要flash-attn库
device_map="auto"
)
Compile model (PyTorch 2.0+)
编译模型(PyTorch 2.0+)
model = torch.compile(model)
model = torch.compile(model)
Use smaller images (if quality allows)
使用更小尺寸图像(若质量允许)
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
Default is 224x224, which is optimal
默认尺寸为224x224,是最优尺寸
undefinedundefinedCommon issues
常见问题
| Issue | Solution |
|---|---|
| CUDA OOM | Use INT8/INT4 quantization, smaller model |
| Slow generation | Use greedy decoding, reduce max_new_tokens |
| Poor captions | Try FlanT5 variant, use prompts |
| Hallucinations | Lower temperature, use beam search |
| Wrong answers | Rephrase question, provide context |
| 问题 | 解决方案 |
|---|---|
| CUDA显存不足 | 使用INT8/INT4量化,选择更小模型 |
| 生成速度慢 | 使用贪婪解码,减少max_new_tokens |
| 标题质量差 | 尝试FlanT5变体,使用提示词 |
| 幻觉问题 | 降低temperature,使用束搜索 |
| 回答错误 | 重新表述问题,提供上下文 |
References
参考资料
- Advanced Usage - Fine-tuning, integration, deployment
- Troubleshooting - Common issues and solutions
- 进阶用法 - 微调、集成、部署
- 故障排除 - 常见问题与解决方案
Resources
资源
- Paper: https://arxiv.org/abs/2301.12597
- GitHub (LAVIS): https://github.com/salesforce/LAVIS
- HuggingFace: https://huggingface.co/Salesforce/blip2-opt-2.7b
- Demo: https://huggingface.co/spaces/Salesforce/BLIP2
- InstructBLIP: https://arxiv.org/abs/2305.06500 (successor)
- 论文: https://arxiv.org/abs/2301.12597
- GitHub(LAVIS): https://github.com/salesforce/LAVIS
- HuggingFace: https://huggingface.co/Salesforce/blip2-opt-2.7b
- Demo: https://huggingface.co/spaces/Salesforce/BLIP2
- InstructBLIP: https://arxiv.org/abs/2305.06500(继任者)