blip-2-vision-language

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

BLIP-2: Vision-Language Pre-training

BLIP-2:视觉-语言预训练

Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.
本指南全面介绍如何使用Salesforce的BLIP-2,结合冻结图像编码器与大语言模型完成视觉-语言任务。

When to use BLIP-2

何时使用BLIP-2

Use BLIP-2 when:
  • Need high-quality image captioning with natural descriptions
  • Building visual question answering (VQA) systems
  • Require zero-shot image-text understanding without task-specific training
  • Want to leverage LLM reasoning for visual tasks
  • Building multimodal conversational AI
  • Need image-text retrieval or matching
Key features:
  • Q-Former architecture: Lightweight query transformer bridges vision and language
  • Frozen backbone efficiency: No need to fine-tune large vision/language models
  • Multiple LLM backends: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
  • Zero-shot capabilities: Strong performance without task-specific training
  • Efficient training: Only trains Q-Former (~188M parameters)
  • State-of-the-art results: Beats larger models on VQA benchmarks
Use alternatives instead:
  • LLaVA: For instruction-following multimodal chat
  • InstructBLIP: For improved instruction-following (BLIP-2 successor)
  • GPT-4V/Claude 3: For production multimodal chat (proprietary)
  • CLIP: For simple image-text similarity without generation
  • Flamingo: For few-shot visual learning
在以下场景使用BLIP-2:
  • 需要生成自然描述的高质量图像标题
  • 构建视觉问答(VQA)系统
  • 无需任务特定训练,实现零样本图文理解
  • 希望借助LLM推理能力完成视觉任务
  • 构建多模态对话AI
  • 需要图文检索或匹配功能
核心特性:
  • Q-Former架构:轻量级查询Transformer连接视觉与语言
  • 冻结骨干高效性:无需微调大型视觉/语言模型
  • 多LLM后端支持:OPT(2.7B、6.7B)与FlanT5(XL、XXL)
  • 零样本能力:无需任务特定训练即可实现出色性能
  • 高效训练:仅需训练Q-Former(约1.88亿参数)
  • 最先进结果:在VQA基准测试中超越更大规模模型
可选择替代方案:
  • LLaVA:用于遵循指令的多模态对话
  • InstructBLIP:BLIP-2的继任者,指令遵循能力更优
  • GPT-4V/Claude 3:用于生产环境的多模态对话(闭源)
  • CLIP:用于无需生成能力的简单图文相似度计算
  • Flamingo:用于少样本视觉学习

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

HuggingFace Transformers (recommended)

HuggingFace Transformers(推荐)

pip install transformers accelerate torch Pillow
pip install transformers accelerate torch Pillow

Or LAVIS library (Salesforce official)

或LAVIS库(Salesforce官方)

pip install salesforce-lavis
undefined
pip install salesforce-lavis
undefined

Basic image captioning

基础图像标题生成

python
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
python
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

Load model and processor

加载模型与处理器

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto" )
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto" )

Load image

加载图像

image = Image.open("photo.jpg").convert("RGB")
image = Image.open("photo.jpg").convert("RGB")

Generate caption

生成标题

inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=50) caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(caption)
undefined
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=50) caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(caption)
undefined

Visual question answering

视觉问答

python
undefined
python
undefined

Ask a question about the image

针对图像提出问题

question = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=50) answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(answer)
undefined
question = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=50) answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(answer)
undefined

Using LAVIS library

使用LAVIS库

python
import torch
from lavis.models import load_model_and_preprocess
from PIL import Image
python
import torch
from lavis.models import load_model_and_preprocess
from PIL import Image

Load model

加载模型

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model, vis_processors, txt_processors = load_model_and_preprocess( name="blip2_opt", model_type="pretrain_opt2.7b", is_eval=True, device=device )
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model, vis_processors, txt_processors = load_model_and_preprocess( name="blip2_opt", model_type="pretrain_opt2.7b", is_eval=True, device=device )

Process image

处理图像

image = Image.open("photo.jpg").convert("RGB") image = vis_processors"eval".unsqueeze(0).to(device)
image = Image.open("photo.jpg").convert("RGB") image = vis_processors"eval".unsqueeze(0).to(device)

Caption

生成标题

caption = model.generate({"image": image}) print(caption)
caption = model.generate({"image": image}) print(caption)

VQA

视觉问答

question = txt_processors["eval"]("What is in this image?") answer = model.generate({"image": image, "prompt": question}) print(answer)
undefined
question = txt_processors["eval"]("What is in this image?") answer = model.generate({"image": image, "prompt": question}) print(answer)
undefined

Core concepts

核心概念

Architecture overview

架构概述

BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│                        Q-Former                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │     Learned Queries (32 queries × 768 dim)          │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Cross-Attention with Image Features               │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Self-Attention Layers (Transformer)               │    │
│  └────────────────────────┬────────────────────────────┘    │
└───────────────────────────┼─────────────────────────────────┘
┌───────────────────────────▼─────────────────────────────────┐
│  Frozen Vision Encoder    │      Frozen LLM                  │
│  (ViT-G/14 from EVA-CLIP) │      (OPT or FlanT5)            │
└─────────────────────────────────────────────────────────────┘
BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│                        Q-Former                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │     Learned Queries (32 queries × 768 dim)          │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Cross-Attention with Image Features               │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Self-Attention Layers (Transformer)               │    │
│  └────────────────────────┬────────────────────────────┘    │
└───────────────────────────┼─────────────────────────────────┘
┌───────────────────────────▼─────────────────────────────────┐
│  Frozen Vision Encoder    │      Frozen LLM                  │
│  (ViT-G/14 from EVA-CLIP) │      (OPT or FlanT5)            │
└─────────────────────────────────────────────────────────────┘

Model variants

模型变体

ModelLLM BackendSizeUse Case
blip2-opt-2.7b
OPT-2.7B~4GBGeneral captioning, VQA
blip2-opt-6.7b
OPT-6.7B~8GBBetter reasoning
blip2-flan-t5-xl
FlanT5-XL~5GBInstruction following
blip2-flan-t5-xxl
FlanT5-XXL~13GBBest quality
模型LLM后端显存占用使用场景
blip2-opt-2.7b
OPT-2.7B~4GB通用标题生成、视觉问答
blip2-opt-6.7b
OPT-6.7B~8GB推理能力更优
blip2-flan-t5-xl
FlanT5-XL~5GB指令遵循
blip2-flan-t5-xxl
FlanT5-XXL~13GB最佳质量

Q-Former components

Q-Former组件

ComponentDescriptionParameters
Learned queriesFixed set of learnable embeddings32 × 768
Image transformerCross-attention to vision features~108M
Text transformerSelf-attention for text~108M
Linear projectionMaps to LLM dimensionVaries
组件描述参数规模
Learned queries固定可学习嵌入集合32 × 768
Image transformer与视觉特征的交叉注意力~1.08亿
Text transformer文本自注意力~1.08亿
Linear projection映射至LLM维度可变

Advanced usage

进阶用法

Batch processing

批量处理

python
from PIL import Image
import torch
python
from PIL import Image
import torch

Load multiple images

加载多张图像

images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)] questions = [ "What is shown in this image?", "Describe the scene.", "What colors are prominent?", "Is there a person in this image?" ]
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)] questions = [ "What is shown in this image?", "Describe the scene.", "What colors are prominent?", "Is there a person in this image?" ]

Process batch

批量处理

inputs = processor( images=images, text=questions, return_tensors="pt", padding=True ).to("cuda", torch.float16)
inputs = processor( images=images, text=questions, return_tensors="pt", padding=True ).to("cuda", torch.float16)

Generate

生成结果

generated_ids = model.generate(**inputs, max_new_tokens=50) answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers): print(f"Q: {q}\nA: {a}\n")
undefined
generated_ids = model.generate(**inputs, max_new_tokens=50) answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers): print(f"Q: {q}\nA: {a}\n")
undefined

Controlling generation

控制生成过程

python
undefined
python
undefined

Control generation parameters

控制生成参数

generated_ids = model.generate( **inputs, max_new_tokens=100, min_length=20, num_beams=5, # Beam search no_repeat_ngram_size=2, # Avoid repetition top_p=0.9, # Nucleus sampling temperature=0.7, # Creativity do_sample=True, # Enable sampling )
generated_ids = model.generate( **inputs, max_new_tokens=100, min_length=20, num_beams=5, # 束搜索 no_repeat_ngram_size=2, # 避免重复 top_p=0.9, # 核采样 temperature=0.7, # 生成创造性 do_sample=True, # 启用采样 )

For deterministic output

确定性输出

generated_ids = model.generate( **inputs, max_new_tokens=50, num_beams=5, do_sample=False, )
undefined
generated_ids = model.generate( **inputs, max_new_tokens=50, num_beams=5, do_sample=False, )
undefined

Memory optimization

显存优化

python
undefined
python
undefined

8-bit quantization

8位量化

from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-6.7b", quantization_config=quantization_config, device_map="auto" )
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-6.7b", quantization_config=quantization_config, device_map="auto" )

4-bit quantization (more aggressive)

4位量化(更激进)

quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 )
model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-flan-t5-xxl", quantization_config=quantization_config, device_map="auto" )
undefined
quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 )
model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-flan-t5-xxl", quantization_config=quantization_config, device_map="auto" )
undefined

Image-text matching

图文匹配

python
undefined
python
undefined

Using LAVIS for ITM (Image-Text Matching)

使用LAVIS进行图文匹配(ITM)

from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess( name="blip2_image_text_matching", model_type="pretrain", is_eval=True, device=device )
image = vis_processors"eval".unsqueeze(0).to(device) text = txt_processors["eval"]("a dog sitting on grass")
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess( name="blip2_image_text_matching", model_type="pretrain", is_eval=True, device=device )
image = vis_processors"eval".unsqueeze(0).to(device) text = txt_processors["eval"]("a dog sitting on grass")

Get matching score

获取匹配分数

itm_output = model({"image": image, "text_input": text}, match_head="itm") itm_scores = torch.nn.functional.softmax(itm_output, dim=1) print(f"Match probability: {itm_scores[:, 1].item():.3f}")
undefined
itm_output = model({"image": image, "text_input": text}, match_head="itm") itm_scores = torch.nn.functional.softmax(itm_output, dim=1) print(f"Match probability: {itm_scores[:, 1].item():.3f}")
undefined

Feature extraction

特征提取

python
undefined
python
undefined

Extract image features with Q-Former

使用Q-Former提取图像特征

from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess( name="blip2_feature_extractor", model_type="pretrain", is_eval=True, device=device )
image = vis_processors"eval".unsqueeze(0).to(device)
from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess( name="blip2_feature_extractor", model_type="pretrain", is_eval=True, device=device )
image = vis_processors"eval".unsqueeze(0).to(device)

Get features

获取特征

features = model.extract_features({"image": image}, mode="image") image_embeds = features.image_embeds # Shape: [1, 32, 768] image_features = features.image_embeds_proj # Projected for matching
undefined
features = model.extract_features({"image": image}, mode="image") image_embeds = features.image_embeds # 形状: [1, 32, 768] image_features = features.image_embeds_proj # 用于匹配的投影特征
undefined

Common workflows

常见工作流

Workflow 1: Image captioning pipeline

工作流1:图像标题生成流水线

python
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path

class ImageCaptioner:
    def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def caption(self, image_path: str, prompt: str = None) -> str:
        image = Image.open(image_path).convert("RGB")

        if prompt:
            inputs = self.processor(images=image, text=prompt, return_tensors="pt")
        else:
            inputs = self.processor(images=image, return_tensors="pt")

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def caption_batch(self, image_paths: list, prompt: str = None) -> list:
        images = [Image.open(p).convert("RGB") for p in image_paths]

        if prompt:
            inputs = self.processor(
                images=images,
                text=[prompt] * len(images),
                return_tensors="pt",
                padding=True
            )
        else:
            inputs = self.processor(images=images, return_tensors="pt", padding=True)

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(**inputs, max_new_tokens=50)
        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)
python
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path

class ImageCaptioner:
    def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def caption(self, image_path: str, prompt: str = None) -> str:
        image = Image.open(image_path).convert("RGB")

        if prompt:
            inputs = self.processor(images=image, text=prompt, return_tensors="pt")
        else:
            inputs = self.processor(images=image, return_tensors="pt")

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def caption_batch(self, image_paths: list, prompt: str = None) -> list:
        images = [Image.open(p).convert("RGB") for p in image_paths]

        if prompt:
            inputs = self.processor(
                images=images,
                text=[prompt] * len(images),
                return_tensors="pt",
                padding=True
            )
        else:
            inputs = self.processor(images=images, return_tensors="pt", padding=True)

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(**inputs, max_new_tokens=50)
        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)

Usage

使用示例

captioner = ImageCaptioner()
captioner = ImageCaptioner()

Single image

单张图像

caption = captioner.caption("photo.jpg") print(f"Caption: {caption}")
caption = captioner.caption("photo.jpg") print(f"Caption: {caption}")

With prompt for style

带风格提示词

caption = captioner.caption("photo.jpg", "a detailed description of") print(f"Detailed: {caption}")
caption = captioner.caption("photo.jpg", "a detailed description of") print(f"Detailed: {caption}")

Batch processing

批量处理

captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"]) for i, cap in enumerate(captions): print(f"Image {i+1}: {cap}")
undefined
captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"]) for i, cap in enumerate(captions): print(f"Image {i+1}: {cap}")
undefined

Workflow 2: Visual Q&A system

工作流2:视觉问答系统

python
class VisualQA:
    def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.current_image = None
        self.current_inputs = None

    def set_image(self, image_path: str):
        """Load image for multiple questions."""
        self.current_image = Image.open(image_path).convert("RGB")

    def ask(self, question: str) -> str:
        """Ask a question about the current image."""
        if self.current_image is None:
            raise ValueError("No image set. Call set_image() first.")

        # Format question for FlanT5
        prompt = f"Question: {question} Answer:"

        inputs = self.processor(
            images=self.current_image,
            text=prompt,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def ask_multiple(self, questions: list) -> dict:
        """Ask multiple questions about current image."""
        return {q: self.ask(q) for q in questions}
python
class VisualQA:
    def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.current_image = None
        self.current_inputs = None

    def set_image(self, image_path: str):
        """加载图像以支持多轮提问。"""
        self.current_image = Image.open(image_path).convert("RGB")

    def ask(self, question: str) -> str:
        """针对当前图像提问。"""
        if self.current_image is None:
            raise ValueError("No image set. Call set_image() first.")

        # 为FlanT5格式化问题
        prompt = f"Question: {question} Answer:"

        inputs = self.processor(
            images=self.current_image,
            text=prompt,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def ask_multiple(self, questions: list) -> dict:
        """针对当前图像批量提问。"""
        return {q: self.ask(q) for q in questions}

Usage

使用示例

vqa = VisualQA() vqa.set_image("scene.jpg")
vqa = VisualQA() vqa.set_image("scene.jpg")

Ask questions

单轮提问

print(vqa.ask("What objects are in this image?")) print(vqa.ask("What is the weather like?")) print(vqa.ask("How many people are there?"))
print(vqa.ask("What objects are in this image?")) print(vqa.ask("What is the weather like?")) print(vqa.ask("How many people are there?"))

Batch questions

批量提问

results = vqa.ask_multiple([ "What is the main subject?", "What colors are dominant?", "Is this indoors or outdoors?" ])
undefined
results = vqa.ask_multiple([ "What is the main subject?", "What colors are dominant?", "Is this indoors or outdoors?" ])
undefined

Workflow 3: Image search/retrieval

工作流3:图像搜索/检索

python
import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess

class ImageSearchEngine:
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
            name="blip2_feature_extractor",
            model_type="pretrain",
            is_eval=True,
            device=self.device
        )
        self.image_features = []
        self.image_paths = []

    def index_images(self, image_paths: list):
        """Build index from images."""
        self.image_paths = image_paths

        for path in image_paths:
            image = Image.open(path).convert("RGB")
            image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)

            with torch.no_grad():
                features = self.model.extract_features({"image": image}, mode="image")
                # Use projected features for matching
                self.image_features.append(
                    features.image_embeds_proj.mean(dim=1).cpu().numpy()
                )

        self.image_features = np.vstack(self.image_features)

    def search(self, query: str, top_k: int = 5) -> list:
        """Search images by text query."""
        # Get text features
        text = self.txt_processors["eval"](query)
        text_input = {"text_input": [text]}

        with torch.no_grad():
            text_features = self.model.extract_features(text_input, mode="text")
            text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()

        # Compute similarities
        similarities = np.dot(self.image_features, text_embeds.T).squeeze()
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(self.image_paths[i], similarities[i]) for i in top_indices]
python
import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess

class ImageSearchEngine:
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
            name="blip2_feature_extractor",
            model_type="pretrain",
            is_eval=True,
            device=self.device
        )
        self.image_features = []
        self.image_paths = []

    def index_images(self, image_paths: list):
        """为图像构建索引。"""
        self.image_paths = image_paths

        for path in image_paths:
            image = Image.open(path).convert("RGB")
            image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)

            with torch.no_grad():
                features = self.model.extract_features({"image": image}, mode="image")
                # 使用投影特征进行匹配
                self.image_features.append(
                    features.image_embeds_proj.mean(dim=1).cpu().numpy()
                )

        self.image_features = np.vstack(self.image_features)

    def search(self, query: str, top_k: int = 5) -> list:
        """通过文本查询搜索图像。"""
        # 获取文本特征
        text = self.txt_processors["eval"](query)
        text_input = {"text_input": [text]}

        with torch.no_grad():
            text_features = self.model.extract_features(text_input, mode="text")
            text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()

        # 计算相似度
        similarities = np.dot(self.image_features, text_embeds.T).squeeze()
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(self.image_paths[i], similarities[i]) for i in top_indices]

Usage

使用示例

engine = ImageSearchEngine() engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])
engine = ImageSearchEngine() engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])

Search

搜索

results = engine.search("a sunset over the ocean", top_k=5) for path, score in results: print(f"{path}: {score:.3f}")
undefined
results = engine.search("a sunset over the ocean", top_k=5) for path, score in results: print(f"{path}: {score:.3f}")
undefined

Output format

输出格式

Generation output

生成输出

python
undefined
python
undefined

Direct generation returns token IDs

直接生成返回Token ID

generated_ids = model.generate(**inputs, max_new_tokens=50)
generated_ids = model.generate(**inputs, max_new_tokens=50)

Shape: [batch_size, sequence_length]

形状: [batch_size, sequence_length]

Decode to text

解码为文本

text = processor.batch_decode(generated_ids, skip_special_tokens=True)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)

Returns: list of strings

返回: 字符串列表

undefined
undefined

Feature extraction output

特征提取输出

python
undefined
python
undefined

Q-Former outputs

Q-Former输出

features = model.extract_features({"image": image}, mode="image")
features.image_embeds # [B, 32, 768] - Q-Former outputs features.image_embeds_proj # [B, 32, 256] - Projected for matching features.text_embeds # [B, seq_len, 768] - Text features features.text_embeds_proj # [B, 256] - Projected text (CLS)
undefined
features = model.extract_features({"image": image}, mode="image")
features.image_embeds # [B, 32, 768] - Q-Former输出 features.image_embeds_proj # [B, 32, 256] - 用于匹配的投影特征 features.text_embeds # [B, seq_len, 768] - 文本特征 features.text_embeds_proj # [B, 256] - 投影后的文本特征(CLS)
undefined

Performance optimization

性能优化

GPU memory requirements

GPU显存需求

ModelFP16 VRAMINT8 VRAMINT4 VRAM
blip2-opt-2.7b~8GB~5GB~3GB
blip2-opt-6.7b~16GB~9GB~5GB
blip2-flan-t5-xl~10GB~6GB~4GB
blip2-flan-t5-xxl~26GB~14GB~8GB
模型FP16显存INT8显存INT4显存
blip2-opt-2.7b~8GB~5GB~3GB
blip2-opt-6.7b~16GB~9GB~5GB
blip2-flan-t5-xl~10GB~6GB~4GB
blip2-flan-t5-xxl~26GB~14GB~8GB

Speed optimization

速度优化

python
undefined
python
undefined

Use Flash Attention if available

若可用则使用Flash Attention

model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, attn_implementation="flash_attention_2", # Requires flash-attn device_map="auto" )
model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, attn_implementation="flash_attention_2", # 需要flash-attn库 device_map="auto" )

Compile model (PyTorch 2.0+)

编译模型(PyTorch 2.0+)

model = torch.compile(model)
model = torch.compile(model)

Use smaller images (if quality allows)

使用更小尺寸图像(若质量允许)

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")

Default is 224x224, which is optimal

默认尺寸为224x224,是最优尺寸

undefined
undefined

Common issues

常见问题

IssueSolution
CUDA OOMUse INT8/INT4 quantization, smaller model
Slow generationUse greedy decoding, reduce max_new_tokens
Poor captionsTry FlanT5 variant, use prompts
HallucinationsLower temperature, use beam search
Wrong answersRephrase question, provide context
问题解决方案
CUDA显存不足使用INT8/INT4量化,选择更小模型
生成速度慢使用贪婪解码,减少max_new_tokens
标题质量差尝试FlanT5变体,使用提示词
幻觉问题降低temperature,使用束搜索
回答错误重新表述问题,提供上下文

References

参考资料

  • Advanced Usage - Fine-tuning, integration, deployment
  • Troubleshooting - Common issues and solutions
  • 进阶用法 - 微调、集成、部署
  • 故障排除 - 常见问题与解决方案

Resources

资源