blip-2-vision-language

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

BLIP-2: Vision-Language Pre-training

BLIP-2：视觉-语言预训练

Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.

本指南全面介绍如何使用Salesforce的BLIP-2，结合冻结图像编码器与大语言模型完成视觉-语言任务。

When to use BLIP-2

何时使用BLIP-2

Use BLIP-2 when:

Need high-quality image captioning with natural descriptions
Building visual question answering (VQA) systems
Require zero-shot image-text understanding without task-specific training
Want to leverage LLM reasoning for visual tasks
Building multimodal conversational AI
Need image-text retrieval or matching

Key features:

Q-Former architecture: Lightweight query transformer bridges vision and language
Frozen backbone efficiency: No need to fine-tune large vision/language models
Multiple LLM backends: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
Zero-shot capabilities: Strong performance without task-specific training
Efficient training: Only trains Q-Former (~188M parameters)
State-of-the-art results: Beats larger models on VQA benchmarks

Use alternatives instead:

LLaVA: For instruction-following multimodal chat
InstructBLIP: For improved instruction-following (BLIP-2 successor)
GPT-4V/Claude 3: For production multimodal chat (proprietary)
CLIP: For simple image-text similarity without generation
Flamingo: For few-shot visual learning

在以下场景使用BLIP-2：

需要生成自然描述的高质量图像标题
构建视觉问答（VQA）系统
无需任务特定训练，实现零样本图文理解
希望借助LLM推理能力完成视觉任务
构建多模态对话AI
需要图文检索或匹配功能

核心特性：

Q-Former架构：轻量级查询Transformer连接视觉与语言
冻结骨干高效性：无需微调大型视觉/语言模型
多LLM后端支持：OPT（2.7B、6.7B）与FlanT5（XL、XXL）
零样本能力：无需任务特定训练即可实现出色性能
高效训练：仅需训练Q-Former（约1.88亿参数）
最先进结果：在VQA基准测试中超越更大规模模型

可选择替代方案：

LLaVA：用于遵循指令的多模态对话
InstructBLIP：BLIP-2的继任者，指令遵循能力更优
GPT-4V/Claude 3：用于生产环境的多模态对话（闭源）
CLIP：用于无需生成能力的简单图文相似度计算
Flamingo：用于少样本视觉学习

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

HuggingFace Transformers (recommended)

HuggingFace Transformers（推荐）

pip install transformers accelerate torch Pillow

Or LAVIS library (Salesforce official)

或LAVIS库（Salesforce官方）

pip install salesforce-lavis

undefined

pip install salesforce-lavis

undefined

Basic image captioning

基础图像标题生成

python

import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

python

import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

Load model and processor

加载模型与处理器

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto" )

Load image

加载图像

image = Image.open("photo.jpg").convert("RGB")

Generate caption

生成标题

inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=50) caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(caption)

undefined

undefined

Visual question answering

视觉问答

python

undefined

python

undefined

Ask a question about the image

针对图像提出问题

question = "What color is the car in this image?"

inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16) generated_ids = model.generate(**inputs, max_new_tokens=50) answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(answer)

undefined

question = "What color is the car in this image?"

undefined

Using LAVIS library

使用LAVIS库

python

import torch
from lavis.models import load_model_and_preprocess
from PIL import Image

python

import torch
from lavis.models import load_model_and_preprocess
from PIL import Image

Load model

加载模型

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model, vis_processors, txt_processors = load_model_and_preprocess( name="blip2_opt", model_type="pretrain_opt2.7b", is_eval=True, device=device )

Process image

处理图像

image = Image.open("photo.jpg").convert("RGB") image = vis_processors"eval".unsqueeze(0).to(device)

Caption

生成标题

caption = model.generate({"image": image}) print(caption)

VQA

视觉问答

question = txt_processors["eval"]("What is in this image?") answer = model.generate({"image": image, "prompt": question}) print(answer)

undefined

question = txt_processors["eval"]("What is in this image?") answer = model.generate({"image": image, "prompt": question}) print(answer)

undefined

Core concepts

核心概念

Architecture overview

架构概述

BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│                        Q-Former                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │     Learned Queries (32 queries × 768 dim)          │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Cross-Attention with Image Features               │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Self-Attention Layers (Transformer)               │    │
│  └────────────────────────┬────────────────────────────┘    │
└───────────────────────────┼─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│  Frozen Vision Encoder    │      Frozen LLM                  │
│  (ViT-G/14 from EVA-CLIP) │      (OPT or FlanT5)            │
└─────────────────────────────────────────────────────────────┘

BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│                        Q-Former                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │     Learned Queries (32 queries × 768 dim)          │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Cross-Attention with Image Features               │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Self-Attention Layers (Transformer)               │    │
│  └────────────────────────┬────────────────────────────┘    │
└───────────────────────────┼─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│  Frozen Vision Encoder    │      Frozen LLM                  │
│  (ViT-G/14 from EVA-CLIP) │      (OPT or FlanT5)            │
└─────────────────────────────────────────────────────────────┘

Model variants

模型变体

Model	LLM Backend	Size	Use Case
`blip2-opt-2.7b`	OPT-2.7B	~4GB	General captioning, VQA
`blip2-opt-6.7b`	OPT-6.7B	~8GB	Better reasoning
`blip2-flan-t5-xl`	FlanT5-XL	~5GB	Instruction following
`blip2-flan-t5-xxl`	FlanT5-XXL	~13GB	Best quality

模型	LLM后端	显存占用	使用场景
`blip2-opt-2.7b`	OPT-2.7B	~4GB	通用标题生成、视觉问答
`blip2-opt-6.7b`	OPT-6.7B	~8GB	推理能力更优
`blip2-flan-t5-xl`	FlanT5-XL	~5GB	指令遵循
`blip2-flan-t5-xxl`	FlanT5-XXL	~13GB	最佳质量

Q-Former components

Q-Former组件

Component	Description	Parameters
Learned queries	Fixed set of learnable embeddings	32 × 768
Image transformer	Cross-attention to vision features	~108M
Text transformer	Self-attention for text	~108M
Linear projection	Maps to LLM dimension	Varies

组件	描述	参数规模
Learned queries	固定可学习嵌入集合	32 × 768
Image transformer	与视觉特征的交叉注意力	~1.08亿
Text transformer	文本自注意力	~1.08亿
Linear projection	映射至LLM维度	可变

Advanced usage

进阶用法

Batch processing

批量处理

python

from PIL import Image
import torch

python

from PIL import Image
import torch

Load multiple images

加载多张图像

images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)] questions = [ "What is shown in this image?", "Describe the scene.", "What colors are prominent?", "Is there a person in this image?" ]

Process batch

批量处理

inputs = processor( images=images, text=questions, return_tensors="pt", padding=True ).to("cuda", torch.float16)

Generate

生成结果

generated_ids = model.generate(**inputs, max_new_tokens=50) answers = processor.batch_decode(generated_ids, skip_special_tokens=True)

for q, a in zip(questions, answers): print(f"Q: {q}\nA: {a}\n")

undefined

generated_ids = model.generate(**inputs, max_new_tokens=50) answers = processor.batch_decode(generated_ids, skip_special_tokens=True)

for q, a in zip(questions, answers): print(f"Q: {q}\nA: {a}\n")

undefined

Controlling generation

控制生成过程

python

undefined

python

undefined

Control generation parameters

控制生成参数

generated_ids = model.generate( **inputs, max_new_tokens=100, min_length=20, num_beams=5, # Beam search no_repeat_ngram_size=2, # Avoid repetition top_p=0.9, # Nucleus sampling temperature=0.7, # Creativity do_sample=True, # Enable sampling )

generated_ids = model.generate( **inputs, max_new_tokens=100, min_length=20, num_beams=5, # 束搜索 no_repeat_ngram_size=2, # 避免重复 top_p=0.9, # 核采样 temperature=0.7, # 生成创造性 do_sample=True, # 启用采样 )

For deterministic output

确定性输出

generated_ids = model.generate( **inputs, max_new_tokens=50, num_beams=5, do_sample=False, )

undefined

generated_ids = model.generate( **inputs, max_new_tokens=50, num_beams=5, do_sample=False, )

undefined

Memory optimization

显存优化

python

undefined

python

undefined

8-bit quantization

8位量化

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-6.7b", quantization_config=quantization_config, device_map="auto" )

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-6.7b", quantization_config=quantization_config, device_map="auto" )

4-bit quantization (more aggressive)

4位量化（更激进）

quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 )

model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-flan-t5-xxl", quantization_config=quantization_config, device_map="auto" )

undefined

quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 )

model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-flan-t5-xxl", quantization_config=quantization_config, device_map="auto" )

undefined

Image-text matching

图文匹配

python

undefined

python

undefined

Using LAVIS for ITM (Image-Text Matching)

使用LAVIS进行图文匹配（ITM）

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess( name="blip2_image_text_matching", model_type="pretrain", is_eval=True, device=device )

image = vis_processors"eval".unsqueeze(0).to(device) text = txt_processors["eval"]("a dog sitting on grass")

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess( name="blip2_image_text_matching", model_type="pretrain", is_eval=True, device=device )

image = vis_processors"eval".unsqueeze(0).to(device) text = txt_processors["eval"]("a dog sitting on grass")

Get matching score

获取匹配分数

itm_output = model({"image": image, "text_input": text}, match_head="itm") itm_scores = torch.nn.functional.softmax(itm_output, dim=1) print(f"Match probability: {itm_scores[:, 1].item():.3f}")

undefined

itm_output = model({"image": image, "text_input": text}, match_head="itm") itm_scores = torch.nn.functional.softmax(itm_output, dim=1) print(f"Match probability: {itm_scores[:, 1].item():.3f}")

undefined

Feature extraction

特征提取

python

undefined

python

undefined

Extract image features with Q-Former

使用Q-Former提取图像特征

from lavis.models import load_model_and_preprocess

model, vis_processors, _ = load_model_and_preprocess( name="blip2_feature_extractor", model_type="pretrain", is_eval=True, device=device )

image = vis_processors"eval".unsqueeze(0).to(device)

from lavis.models import load_model_and_preprocess

model, vis_processors, _ = load_model_and_preprocess( name="blip2_feature_extractor", model_type="pretrain", is_eval=True, device=device )

image = vis_processors"eval".unsqueeze(0).to(device)

Get features

获取特征

features = model.extract_features({"image": image}, mode="image") image_embeds = features.image_embeds # Shape: [1, 32, 768] image_features = features.image_embeds_proj # Projected for matching

undefined

features = model.extract_features({"image": image}, mode="image") image_embeds = features.image_embeds # 形状: [1, 32, 768] image_features = features.image_embeds_proj # 用于匹配的投影特征

undefined

Common workflows

常见工作流

Workflow 1: Image captioning pipeline

工作流1：图像标题生成流水线

python

import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path

class ImageCaptioner:
    def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def caption(self, image_path: str, prompt: str = None) -> str:
        image = Image.open(image_path).convert("RGB")

        if prompt:
            inputs = self.processor(images=image, text=prompt, return_tensors="pt")
        else:
            inputs = self.processor(images=image, return_tensors="pt")

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def caption_batch(self, image_paths: list, prompt: str = None) -> list:
        images = [Image.open(p).convert("RGB") for p in image_paths]

        if prompt:
            inputs = self.processor(
                images=images,
                text=[prompt] * len(images),
                return_tensors="pt",
                padding=True
            )
        else:
            inputs = self.processor(images=images, return_tensors="pt", padding=True)

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(**inputs, max_new_tokens=50)
        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)

python

import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path

class ImageCaptioner:
    def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def caption(self, image_path: str, prompt: str = None) -> str:
        image = Image.open(image_path).convert("RGB")

        if prompt:
            inputs = self.processor(images=image, text=prompt, return_tensors="pt")
        else:
            inputs = self.processor(images=image, return_tensors="pt")

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def caption_batch(self, image_paths: list, prompt: str = None) -> list:
        images = [Image.open(p).convert("RGB") for p in image_paths]

        if prompt:
            inputs = self.processor(
                images=images,
                text=[prompt] * len(images),
                return_tensors="pt",
                padding=True
            )
        else:
            inputs = self.processor(images=images, return_tensors="pt", padding=True)

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(**inputs, max_new_tokens=50)
        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)

Usage

使用示例

captioner = ImageCaptioner()

Single image

单张图像

caption = captioner.caption("photo.jpg") print(f"Caption: {caption}")

With prompt for style

带风格提示词

caption = captioner.caption("photo.jpg", "a detailed description of") print(f"Detailed: {caption}")

Batch processing

批量处理

captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"]) for i, cap in enumerate(captions): print(f"Image {i+1}: {cap}")

undefined

captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"]) for i, cap in enumerate(captions): print(f"Image {i+1}: {cap}")

undefined

Workflow 2: Visual Q&A system

工作流2：视觉问答系统

python

class VisualQA:
    def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.current_image = None
        self.current_inputs = None

    def set_image(self, image_path: str):
        """Load image for multiple questions."""
        self.current_image = Image.open(image_path).convert("RGB")

    def ask(self, question: str) -> str:
        """Ask a question about the current image."""
        if self.current_image is None:
            raise ValueError("No image set. Call set_image() first.")

        # Format question for FlanT5
        prompt = f"Question: {question} Answer:"

        inputs = self.processor(
            images=self.current_image,
            text=prompt,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def ask_multiple(self, questions: list) -> dict:
        """Ask multiple questions about current image."""
        return {q: self.ask(q) for q in questions}

python

class VisualQA:
    def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.current_image = None
        self.current_inputs = None

    def set_image(self, image_path: str):
        """加载图像以支持多轮提问。"""
        self.current_image = Image.open(image_path).convert("RGB")

    def ask(self, question: str) -> str:
        """针对当前图像提问。"""
        if self.current_image is None:
            raise ValueError("No image set. Call set_image() first.")

        # 为FlanT5格式化问题
        prompt = f"Question: {question} Answer:"

        inputs = self.processor(
            images=self.current_image,
            text=prompt,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def ask_multiple(self, questions: list) -> dict:
        """针对当前图像批量提问。"""
        return {q: self.ask(q) for q in questions}

Usage

使用示例

vqa = VisualQA() vqa.set_image("scene.jpg")

Ask questions

单轮提问

print(vqa.ask("What objects are in this image?")) print(vqa.ask("What is the weather like?")) print(vqa.ask("How many people are there?"))

Batch questions

批量提问

results = vqa.ask_multiple([ "What is the main subject?", "What colors are dominant?", "Is this indoors or outdoors?" ])

undefined

results = vqa.ask_multiple([ "What is the main subject?", "What colors are dominant?", "Is this indoors or outdoors?" ])

undefined

Workflow 3: Image search/retrieval

工作流3：图像搜索/检索

python

import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess

class ImageSearchEngine:
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
            name="blip2_feature_extractor",
            model_type="pretrain",
            is_eval=True,
            device=self.device
        )
        self.image_features = []
        self.image_paths = []

    def index_images(self, image_paths: list):
        """Build index from images."""
        self.image_paths = image_paths

        for path in image_paths:
            image = Image.open(path).convert("RGB")
            image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)

            with torch.no_grad():
                features = self.model.extract_features({"image": image}, mode="image")
                # Use projected features for matching
                self.image_features.append(
                    features.image_embeds_proj.mean(dim=1).cpu().numpy()
                )

        self.image_features = np.vstack(self.image_features)

    def search(self, query: str, top_k: int = 5) -> list:
        """Search images by text query."""
        # Get text features
        text = self.txt_processors["eval"](query)
        text_input = {"text_input": [text]}

        with torch.no_grad():
            text_features = self.model.extract_features(text_input, mode="text")
            text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()

        # Compute similarities
        similarities = np.dot(self.image_features, text_embeds.T).squeeze()
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(self.image_paths[i], similarities[i]) for i in top_indices]

python

import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess

class ImageSearchEngine:
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
            name="blip2_feature_extractor",
            model_type="pretrain",
            is_eval=True,
            device=self.device
        )
        self.image_features = []
        self.image_paths = []

    def index_images(self, image_paths: list):
        """为图像构建索引。"""
        self.image_paths = image_paths

        for path in image_paths:
            image = Image.open(path).convert("RGB")
            image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)

            with torch.no_grad():
                features = self.model.extract_features({"image": image}, mode="image")
                # 使用投影特征进行匹配
                self.image_features.append(
                    features.image_embeds_proj.mean(dim=1).cpu().numpy()
                )

        self.image_features = np.vstack(self.image_features)

    def search(self, query: str, top_k: int = 5) -> list:
        """通过文本查询搜索图像。"""
        # 获取文本特征
        text = self.txt_processors["eval"](query)
        text_input = {"text_input": [text]}

        with torch.no_grad():
            text_features = self.model.extract_features(text_input, mode="text")
            text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()

        # 计算相似度
        similarities = np.dot(self.image_features, text_embeds.T).squeeze()
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(self.image_paths[i], similarities[i]) for i in top_indices]

Usage

使用示例

engine = ImageSearchEngine() engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])

Search

搜索

results = engine.search("a sunset over the ocean", top_k=5) for path, score in results: print(f"{path}: {score:.3f}")

undefined

results = engine.search("a sunset over the ocean", top_k=5) for path, score in results: print(f"{path}: {score:.3f}")

undefined

Output format

输出格式

Generation output

生成输出

python

undefined

python

undefined

Direct generation returns token IDs

直接生成返回Token ID

generated_ids = model.generate(**inputs, max_new_tokens=50)

Shape: [batch_size, sequence_length]

形状: [batch_size, sequence_length]

Decode to text

解码为文本

text = processor.batch_decode(generated_ids, skip_special_tokens=True)

Returns: list of strings

返回: 字符串列表

undefined

undefined

Feature extraction output

特征提取输出

python

undefined

python

undefined

Q-Former outputs

Q-Former输出

features = model.extract_features({"image": image}, mode="image")

features.image_embeds # [B, 32, 768] - Q-Former outputs features.image_embeds_proj # [B, 32, 256] - Projected for matching features.text_embeds # [B, seq_len, 768] - Text features features.text_embeds_proj # [B, 256] - Projected text (CLS)

undefined

features = model.extract_features({"image": image}, mode="image")

features.image_embeds # [B, 32, 768] - Q-Former输出 features.image_embeds_proj # [B, 32, 256] - 用于匹配的投影特征 features.text_embeds # [B, seq_len, 768] - 文本特征 features.text_embeds_proj # [B, 256] - 投影后的文本特征（CLS）

undefined

Performance optimization

性能优化

GPU memory requirements

GPU显存需求

Model	FP16 VRAM	INT8 VRAM	INT4 VRAM
blip2-opt-2.7b	~8GB	~5GB	~3GB
blip2-opt-6.7b	~16GB	~9GB	~5GB
blip2-flan-t5-xl	~10GB	~6GB	~4GB
blip2-flan-t5-xxl	~26GB	~14GB	~8GB

模型	FP16显存	INT8显存	INT4显存
blip2-opt-2.7b	~8GB	~5GB	~3GB
blip2-opt-6.7b	~16GB	~9GB	~5GB
blip2-flan-t5-xl	~10GB	~6GB	~4GB
blip2-flan-t5-xxl	~26GB	~14GB	~8GB

Speed optimization

速度优化

python

undefined

python

undefined

Use Flash Attention if available

若可用则使用Flash Attention

model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, attn_implementation="flash_attention_2", # Requires flash-attn device_map="auto" )

model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, attn_implementation="flash_attention_2", # 需要flash-attn库 device_map="auto" )

Compile model (PyTorch 2.0+)

编译模型（PyTorch 2.0+）

model = torch.compile(model)

Use smaller images (if quality allows)

使用更小尺寸图像（若质量允许）

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")

Default is 224x224, which is optimal

默认尺寸为224x224，是最优尺寸

undefined

undefined

Common issues

常见问题

Issue	Solution
CUDA OOM	Use INT8/INT4 quantization, smaller model
Slow generation	Use greedy decoding, reduce max_new_tokens
Poor captions	Try FlanT5 variant, use prompts
Hallucinations	Lower temperature, use beam search
Wrong answers	Rephrase question, provide context

问题	解决方案
CUDA显存不足	使用INT8/INT4量化，选择更小模型
生成速度慢	使用贪婪解码，减少max_new_tokens
标题质量差	尝试FlanT5变体，使用提示词
幻觉问题	降低temperature，使用束搜索
回答错误	重新表述问题，提供上下文

References

参考资料

Advanced Usage - Fine-tuning, integration, deployment
Troubleshooting - Common issues and solutions

进阶用法 - 微调、集成、部署
故障排除 - 常见问题与解决方案

Resources

资源

Paper: https://arxiv.org/abs/2301.12597
GitHub (LAVIS): https://github.com/salesforce/LAVIS
HuggingFace: https://huggingface.co/Salesforce/blip2-opt-2.7b
Demo: https://huggingface.co/spaces/Salesforce/BLIP2
InstructBLIP: https://arxiv.org/abs/2305.06500 (successor)

论文: https://arxiv.org/abs/2301.12597
GitHub（LAVIS）: https://github.com/salesforce/LAVIS
HuggingFace: https://huggingface.co/Salesforce/blip2-opt-2.7b
Demo: https://huggingface.co/spaces/Salesforce/BLIP2
InstructBLIP: https://arxiv.org/abs/2305.06500（继任者）