deepseek-ocr

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

DeepSeek-OCR

Skill by ara.so — Daily 2026 Skills collection.

DeepSeek-OCR is a vision-language model for Optical Character Recognition with "Contexts Optical Compression." It supports native and dynamic resolutions, multiple prompt modes (document-to-markdown, free OCR, figure parsing, grounding), and can be run via vLLM (high-throughput) or HuggingFace Transformers. It processes images and PDFs, outputting structured text or markdown.

由ara.so提供的技能——2026每日技能合集。

DeepSeek-OCR是一款具备“上下文光学压缩”功能的光学字符识别（OCR）视觉语言模型。它支持原生和动态分辨率，提供多种提示模式（文档转Markdown、自由OCR、图表解析、定位识别），可通过vLLM（高吞吐量）或HuggingFace Transformers运行。它能处理图片和PDF，输出结构化文本或Markdown格式内容。

Installation

安装

Prerequisites

前置要求

CUDA 11.8+, PyTorch 2.6.0
Python 3.12.9 (via conda recommended)

CUDA 11.8+、PyTorch 2.6.0
Python 3.12.9（推荐通过conda安装）

Setup

安装步骤

bash

git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

bash

git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

Install PyTorch with CUDA 11.8

安装支持CUDA 11.8的PyTorch

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
--index-url https://download.pytorch.org/whl/cu118

Download vllm-0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.5

从https://github.com/vllm-project/vllm/releases/tag/v0.8.5下载vllm-0.8.5安装包

pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl

pip install -r requirements.txt pip install flash-attn==2.7.3 --no-build-isolation

undefined

pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl

pip install -r requirements.txt pip install flash-attn==2.7.3 --no-build-isolation

undefined

Alternative: upstream vLLM (nightly)

替代方案：上游vLLM（ nightly版本）

bash

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

bash

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Model Download

模型下载

Model is available on HuggingFace:

deepseek-ai/DeepSeek-OCR

python

from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")

模型可在HuggingFace获取：

deepseek-ai/DeepSeek-OCR

python

from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")

Inference: vLLM (Recommended for Production)

推理：vLLM（生产环境推荐）

Single Image — Streaming

单图片——流式输出

python

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td> for table support
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"image": image}}],
    sampling_params
)

print(outputs[0].outputs[0].text)

python

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # 表格支持所需的<td>, </td>
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"image": image}}],
    sampling_params
)

print(outputs[0].outputs[0].text)

Batch Images

批量图片处理

python

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "

model_input = [
    {
        "prompt": prompt,
        "multi_modal_data": {"image": Image.open(p).convert("RGB")}
    }
    for p in image_paths
]

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(model_input, sampling_params)

for path, output in zip(image_paths, outputs):
    print(f"=== {path} ===")
    print(output.outputs[0].text)

python

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "

model_input = [
    {
        "prompt": prompt,
        "multi_modal_data": {"image": Image.open(p).convert("RGB")}
    }
    for p in image_paths
]

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(model_input, sampling_params)

for path, output in zip(image_paths, outputs):
    print(f"=== {path} ===")
    print(output.outputs[0].text)

PDF Processing (via vLLM scripts)

PDF处理（通过vLLM脚本）

bash

cd DeepSeek-OCR-master/DeepSeek-OCR-vllm

bash

cd DeepSeek-OCR-master/DeepSeek-OCR-vllm

Edit config.py: set INPUT_PATH, OUTPUT_PATH, model path, etc.

编辑config.py：设置INPUT_PATH、OUTPUT_PATH、模型路径等

python run_dpsk_ocr_pdf.py # ~2500 tokens/s on A100-40G

undefined

python run_dpsk_ocr_pdf.py # 在A100-40G上约2500 tokens/秒

undefined

Benchmark Evaluation

基准测试评估

bash

cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_eval_batch.py

bash

cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_eval_batch.py

Inference: HuggingFace Transformers

推理：HuggingFace Transformers

python

import os
import torch
from transformers import AutoModel, AutoTokenizer

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model_name = "deepseek-ai/DeepSeek-OCR"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)

python

import os
import torch
from transformers import AutoModel, AutoTokenizer

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model_name = "deepseek-ai/DeepSeek-OCR"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)

Document to markdown

文档转Markdown

res = model.infer( tokenizer, prompt="<image>\n<|grounding|>Convert the document to markdown. ", image_file="document.jpg", output_path="./output/", base_size=1024, image_size=640, crop_mode=True, save_results=True, test_compress=True, ) print(res)

undefined

undefined

Transformers Script

Transformers脚本

bash

cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.py

bash

cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.py

Prompt Reference

提示词参考

Use Case	Prompt
Document → Markdown	`<image>\n<
General OCR	`<image>\n<
Free OCR (no layout)	`<image>\nFree OCR.`
Parse figure/chart	`<image>\nParse the figure.`
General description	`<image>\nDescribe this image in detail.`
Grounded REC	`<image>\nLocate <\|ref\|>TARGET_TEXT<\|/ref\|> in the image.`

python

PROMPTS = {
    "document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",
    "ocr_image":         "<image>\n<|grounding|>OCR this image. ",
    "free_ocr":          "<image>\nFree OCR. ",
    "parse_figure":      "<image>\nParse the figure. ",
    "describe":          "<image>\nDescribe this image in detail. ",
    "rec":               "<image>\nLocate <|ref|>{target}<|/ref|> in the image. ",
}

使用场景	提示词
文档转Markdown	`<image>\n<
通用OCR	`<image>\n<
自由OCR（无布局）	`<image>\nFree OCR.`
图表解析	`<image>\nParse the figure.`
详细描述	`<image>\nDescribe this image in detail.`
定位识别（REC）	`<image>\nLocate <\|ref\|>TARGET_TEXT<\|/ref\|> in the image.`

python

PROMPTS = {
    "document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",
    "ocr_image":         "<image>\n<|grounding|>OCR this image. ",
    "free_ocr":          "<image>\nFree OCR. ",
    "parse_figure":      "<image>\nParse the figure. ",
    "describe":          "<image>\nDescribe this image in detail. ",
    "rec":               "<image>\nLocate <|ref|>{target}<|/ref|> in the image. ",
}

Supported Resolutions

支持的分辨率

Mode	Resolution	Vision Tokens
Tiny	512×512	64
Small	640×640	100
Base	1024×1024	256
Large	1280×1280	400
Gundam (dynamic)	n×640×640 + 1×1024×1024	variable

python

undefined

模式	分辨率	视觉Tokens数
Tiny	512×512	64
Small	640×640	100
Base	1024×1024	256
Large	1280×1280	400
Gundam（动态）	n×640×640 + 1×1024×1024	可变

python

undefined

Transformers: control resolution via infer() params

Transformers：通过infer()参数控制分辨率

res = model.infer( tokenizer, prompt=prompt, image_file="image.jpg", base_size=1024, # 512, 640, 1024, or 1280 image_size=640, # patch size for dynamic mode crop_mode=True, # True = Gundam dynamic resolution )

---

res = model.infer( tokenizer, prompt=prompt, image_file="image.jpg", base_size=1024, # 可选512、640、1024或1280 image_size=640, # 动态模式下的补丁尺寸 crop_mode=True, # True = Gundam动态分辨率 )

---

Configuration (vLLM)

配置（vLLM）

Edit

DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py

python

undefined

编辑

DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py

：

python

undefined

Key config fields (example)

关键配置字段（示例）

MODEL_PATH = "deepseek-ai/DeepSeek-OCR" # or local path INPUT_PATH = "/data/input_images/" OUTPUT_PATH = "/data/output/" TENSOR_PARALLEL_SIZE = 1 # GPUs for tensor parallelism MAX_TOKENS = 8192 TEMPERATURE = 0.0 NGRAM_SIZE = 30 WINDOW_SIZE = 90

---

MODEL_PATH = "deepseek-ai/DeepSeek-OCR" # 或本地路径 INPUT_PATH = "/data/input_images/" OUTPUT_PATH = "/data/output/" TENSOR_PARALLEL_SIZE = 1 # 用于张量并行的GPU数量 MAX_TOKENS = 8192 TEMPERATURE = 0.0 NGRAM_SIZE = 30 WINDOW_SIZE = 90

---

Common Patterns

常见使用场景

Process a Directory of Images

处理目录下的所有图片

python

import os
from pathlib import Path
from PIL import Image
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))
    
    inputs = [
        {"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}
        for f in image_files
    ]
    
    outputs = llm.generate(inputs, sampling_params)
    
    for img_path, output in zip(image_files, outputs):
        out_file = Path(output_dir) / (img_path.stem + ".txt")
        out_file.write_text(output.outputs[0].text)
        print(f"Saved: {out_file}")

batch_ocr("/data/scans/", "/data/results/")

python

import os
from pathlib import Path
from PIL import Image
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))
    
    inputs = [
        {"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}
        for f in image_files
    ]
    
    outputs = llm.generate(inputs, sampling_params)
    
    for img_path, output in zip(image_files, outputs):
        out_file = Path(output_dir) / (img_path.stem + ".txt")
        out_file.write_text(output.outputs[0].text)
        print(f"已保存：{out_file}")

batch_ocr("/data/scans/", "/data/results/")

Convert PDF Pages to Markdown

将PDF页面转换为Markdown

python

import fitz  # PyMuPDF
from PIL import Image
from io import BytesIO
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def pdf_to_markdown(pdf_path: str) -> list[str]:
    doc = fitz.open(pdf_path)
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    prompt = "<image>\n<|grounding|>Convert the document to markdown. "
    inputs = []
    for page in doc:
        pix = page.get_pixmap(dpi=150)
        img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")
        inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})
    
    outputs = llm.generate(inputs, sampling_params)
    return [o.outputs[0].text for o in outputs]

pages = pdf_to_markdown("report.pdf")
full_markdown = "\n\n---\n\n".join(pages)
print(full_markdown)

python

import fitz  # PyMuPDF
from PIL import Image
from io import BytesIO
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def pdf_to_markdown(pdf_path: str) -> list[str]:
    doc = fitz.open(pdf_path)
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    prompt = "<image>\n<|grounding|>Convert the document to markdown. "
    inputs = []
    for page in doc:
        pix = page.get_pixmap(dpi=150)
        img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")
        inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})
    
    outputs = llm.generate(inputs, sampling_params)
    return [o.outputs[0].text for o in outputs]

pages = pdf_to_markdown("report.pdf")
full_markdown = "\n\n---\n\n".join(pages)
print(full_markdown)

Grounded Text Location (REC)

文本定位识别（REC）

python

import torch
from transformers import AutoModel, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
).eval().cuda().to(torch.bfloat16)

target = "Total Amount"
prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file="invoice.jpg",
    output_path="./output/",
    base_size=1024,
    image_size=640,
    crop_mode=False,
    save_results=True,
)
print(res)  # Returns bounding box / location info

python

import torch
from transformers import AutoModel, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
).eval().cuda().to(torch.bfloat16)

target = "Total Amount"
prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file="invoice.jpg",
    output_path="./output/",
    base_size=1024,
    image_size=640,
    crop_mode=False,
    save_results=True,
)
print(res)  # 返回边界框/位置信息

Troubleshooting

常见问题排查

transformers

version conflict with vLLM

transformers

版本与vLLM冲突

vLLM 0.8.5 requires

transformers>=4.51.1

— if running both in the same env, this error is safe to ignore per the project docs.

vLLM 0.8.5要求

transformers>=4.51.1

——根据项目文档，若在同一环境中运行两者，该错误可忽略。

Flash Attention build errors

Flash Attention构建错误

bash

undefined

bash

undefined

Ensure torch is installed before flash-attn

确保先安装torch再安装flash-attn

pip install flash-attn==2.7.3 --no-build-isolation

undefined

pip install flash-attn==2.7.3 --no-build-isolation

undefined

CUDA out of memory

CUDA内存不足

Use smaller resolution:
```
base_size=512
```
or
```
base_size=640
```
Disable
```
crop_mode=False
```
to avoid multi-crop dynamic resolution
Reduce batch size in vLLM inputs

使用更小的分辨率：
```
base_size=512
```
或
```
base_size=640
```
禁用
```
crop_mode=False
```
以避免多裁剪动态分辨率
减少vLLM输入中的批量大小

Model output is garbled / repetitive

模型输出乱码/重复

Ensure

NGramPerReqLogitsProcessor

is passed to

LLM

— this is required for proper decoding:

python

from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
llm = LLM(..., logits_processors=[NGramPerReqLogitsProcessor])

确保将

NGramPerReqLogitsProcessor

传入

LLM

——这是正确解码的必要条件：

python

from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
llm = LLM(..., logits_processors=[NGramPerReqLogitsProcessor])

Tables not rendering correctly

表格渲染不正确

Add table token IDs to the whitelist:

python

whitelist_token_ids={128821, 128822}  # <td> and </td>

将表格token ID添加到白名单：

python

whitelist_token_ids={128821, 128822}  # <td>和</td>

Multi-GPU inference

多GPU推理

python

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    tensor_parallel_size=4,  # number of GPUs
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor],
)

python

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    tensor_parallel_size=4,  # GPU数量
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor],
)

Key Files

关键文件

DeepSeek-OCR-master/
├── DeepSeek-OCR-vllm/
│   ├── config.py                  # vLLM configuration
│   ├── run_dpsk_ocr_image.py      # Single image inference
│   ├── run_dpsk_ocr_pdf.py        # PDF batch inference
│   └── run_dpsk_ocr_eval_batch.py # Benchmark evaluation
└── DeepSeek-OCR-hf/
    └── run_dpsk_ocr.py            # HuggingFace Transformers inference

DeepSeek-OCR-master/
├── DeepSeek-OCR-vllm/
│   ├── config.py                  # vLLM配置文件
│   ├── run_dpsk_ocr_image.py      # 单图片推理脚本
│   ├── run_dpsk_ocr_pdf.py        # PDF批量推理脚本
│   └── run_dpsk_ocr_eval_batch.py # 基准测试评估脚本
└── DeepSeek-OCR-hf/
    └── run_dpsk_ocr.py            # HuggingFace Transformers推理脚本

deepseek-ocr

Original

Translation

DeepSeek-OCR

DeepSeek-OCR

Installation

安装

Prerequisites

前置要求

Setup

安装步骤

Install PyTorch with CUDA 11.8

安装支持CUDA 11.8的PyTorch

Download vllm-0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.5

从https://github.com/vllm-project/vllm/releases/tag/v0.8.5下载vllm-0.8.5安装包

Alternative: upstream vLLM (nightly)

替代方案：上游vLLM（ nightly版本）

Model Download

模型下载

Inference: vLLM (Recommended for Production)

推理：vLLM（生产环境推荐）

Single Image — Streaming

单图片——流式输出

Batch Images

批量图片处理

PDF Processing (via vLLM scripts)

PDF处理（通过vLLM脚本）

Edit config.py: set INPUT_PATH, OUTPUT_PATH, model path, etc.

编辑config.py：设置INPUT_PATH、OUTPUT_PATH、模型路径等

Benchmark Evaluation

基准测试评估

Inference: HuggingFace Transformers

推理：HuggingFace Transformers

Document to markdown

文档转Markdown

Transformers Script

Transformers脚本

Prompt Reference

提示词参考

Supported Resolutions

支持的分辨率

Transformers: control resolution via infer() params

Transformers：通过infer()参数控制分辨率

Configuration (vLLM)

配置（vLLM）

Key config fields (example)

关键配置字段（示例）

Common Patterns

常见使用场景

Process a Directory of Images

处理目录下的所有图片

Convert PDF Pages to Markdown

将PDF页面转换为Markdown

Grounded Text Location (REC)

文本定位识别（REC）

Troubleshooting

常见问题排查

transformers version conflict with vLLM

transformers版本与vLLM冲突

Flash Attention build errors

Flash Attention构建错误

Ensure torch is installed before flash-attn

确保先安装torch再安装flash-attn

CUDA out of memory

CUDA内存不足

Model output is garbled / repetitive

模型输出乱码/重复

Tables not rendering correctly

表格渲染不正确

Multi-GPU inference

多GPU推理

Key Files

关键文件

`transformers`
version conflict with vLLM

`transformers`
版本与vLLM冲突