deepseek-ocr

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DeepSeek-OCR

DeepSeek-OCR

Skill by ara.so — Daily 2026 Skills collection.
DeepSeek-OCR is a vision-language model for Optical Character Recognition with "Contexts Optical Compression." It supports native and dynamic resolutions, multiple prompt modes (document-to-markdown, free OCR, figure parsing, grounding), and can be run via vLLM (high-throughput) or HuggingFace Transformers. It processes images and PDFs, outputting structured text or markdown.

ara.so提供的技能——2026每日技能合集。
DeepSeek-OCR是一款具备“上下文光学压缩”功能的光学字符识别(OCR)视觉语言模型。它支持原生和动态分辨率,提供多种提示模式(文档转Markdown、自由OCR、图表解析、定位识别),可通过vLLM(高吞吐量)或HuggingFace Transformers运行。它能处理图片和PDF,输出结构化文本或Markdown格式内容。

Installation

安装

Prerequisites

前置要求

  • CUDA 11.8+, PyTorch 2.6.0
  • Python 3.12.9 (via conda recommended)
  • CUDA 11.8+、PyTorch 2.6.0
  • Python 3.12.9(推荐通过conda安装)

Setup

安装步骤

bash
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr
bash
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

Install PyTorch with CUDA 11.8

安装支持CUDA 11.8的PyTorch

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
--index-url https://download.pytorch.org/whl/cu118
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
--index-url https://download.pytorch.org/whl/cu118
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt pip install flash-attn==2.7.3 --no-build-isolation
undefined
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt pip install flash-attn==2.7.3 --no-build-isolation
undefined

Alternative: upstream vLLM (nightly)

替代方案:上游vLLM( nightly版本)

bash
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

bash
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Model Download

模型下载

Model is available on HuggingFace:
deepseek-ai/DeepSeek-OCR
python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")

模型可在HuggingFace获取:
deepseek-ai/DeepSeek-OCR
python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")

Inference: vLLM (Recommended for Production)

推理:vLLM(生产环境推荐)

Single Image — Streaming

单图片——流式输出

python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td> for table support
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"image": image}}],
    sampling_params
)

print(outputs[0].outputs[0].text)
python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # 表格支持所需的<td>, </td>
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"image": image}}],
    sampling_params
)

print(outputs[0].outputs[0].text)

Batch Images

批量图片处理

python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "

model_input = [
    {
        "prompt": prompt,
        "multi_modal_data": {"image": Image.open(p).convert("RGB")}
    }
    for p in image_paths
]

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(model_input, sampling_params)

for path, output in zip(image_paths, outputs):
    print(f"=== {path} ===")
    print(output.outputs[0].text)
python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "

model_input = [
    {
        "prompt": prompt,
        "multi_modal_data": {"image": Image.open(p).convert("RGB")}
    }
    for p in image_paths
]

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(model_input, sampling_params)

for path, output in zip(image_paths, outputs):
    print(f"=== {path} ===")
    print(output.outputs[0].text)

PDF Processing (via vLLM scripts)

PDF处理(通过vLLM脚本)

bash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
bash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm

Edit config.py: set INPUT_PATH, OUTPUT_PATH, model path, etc.

编辑config.py:设置INPUT_PATH、OUTPUT_PATH、模型路径等

python run_dpsk_ocr_pdf.py # ~2500 tokens/s on A100-40G
undefined
python run_dpsk_ocr_pdf.py # 在A100-40G上约2500 tokens/秒
undefined

Benchmark Evaluation

基准测试评估

bash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_eval_batch.py

bash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_eval_batch.py

Inference: HuggingFace Transformers

推理:HuggingFace Transformers

python
import os
import torch
from transformers import AutoModel, AutoTokenizer

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model_name = "deepseek-ai/DeepSeek-OCR"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)
python
import os
import torch
from transformers import AutoModel, AutoTokenizer

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model_name = "deepseek-ai/DeepSeek-OCR"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)

Document to markdown

文档转Markdown

res = model.infer( tokenizer, prompt="<image>\n<|grounding|>Convert the document to markdown. ", image_file="document.jpg", output_path="./output/", base_size=1024, image_size=640, crop_mode=True, save_results=True, test_compress=True, ) print(res)
undefined
res = model.infer( tokenizer, prompt="<image>\n<|grounding|>Convert the document to markdown. ", image_file="document.jpg", output_path="./output/", base_size=1024, image_size=640, crop_mode=True, save_results=True, test_compress=True, ) print(res)
undefined

Transformers Script

Transformers脚本

bash
cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.py

bash
cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.py

Prompt Reference

提示词参考

Use CasePrompt
Document → Markdown`<image>\n<
General OCR`<image>\n<
Free OCR (no layout)
<image>\nFree OCR. 
Parse figure/chart
<image>\nParse the figure. 
General description
<image>\nDescribe this image in detail. 
Grounded REC
<image>\nLocate <|ref|>TARGET_TEXT<|/ref|> in the image. 
python
PROMPTS = {
    "document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",
    "ocr_image":         "<image>\n<|grounding|>OCR this image. ",
    "free_ocr":          "<image>\nFree OCR. ",
    "parse_figure":      "<image>\nParse the figure. ",
    "describe":          "<image>\nDescribe this image in detail. ",
    "rec":               "<image>\nLocate <|ref|>{target}<|/ref|> in the image. ",
}

使用场景提示词
文档转Markdown`<image>\n<
通用OCR`<image>\n<
自由OCR(无布局)
<image>\nFree OCR. 
图表解析
<image>\nParse the figure. 
详细描述
<image>\nDescribe this image in detail. 
定位识别(REC)
<image>\nLocate <|ref|>TARGET_TEXT<|/ref|> in the image. 
python
PROMPTS = {
    "document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",
    "ocr_image":         "<image>\n<|grounding|>OCR this image. ",
    "free_ocr":          "<image>\nFree OCR. ",
    "parse_figure":      "<image>\nParse the figure. ",
    "describe":          "<image>\nDescribe this image in detail. ",
    "rec":               "<image>\nLocate <|ref|>{target}<|/ref|> in the image. ",
}

Supported Resolutions

支持的分辨率

ModeResolutionVision Tokens
Tiny512×51264
Small640×640100
Base1024×1024256
Large1280×1280400
Gundam (dynamic)n×640×640 + 1×1024×1024variable
python
undefined
模式分辨率视觉Tokens数
Tiny512×51264
Small640×640100
Base1024×1024256
Large1280×1280400
Gundam(动态)n×640×640 + 1×1024×1024可变
python
undefined

Transformers: control resolution via infer() params

Transformers:通过infer()参数控制分辨率

res = model.infer( tokenizer, prompt=prompt, image_file="image.jpg", base_size=1024, # 512, 640, 1024, or 1280 image_size=640, # patch size for dynamic mode crop_mode=True, # True = Gundam dynamic resolution )

---
res = model.infer( tokenizer, prompt=prompt, image_file="image.jpg", base_size=1024, # 可选512、640、1024或1280 image_size=640, # 动态模式下的补丁尺寸 crop_mode=True, # True = Gundam动态分辨率 )

---

Configuration (vLLM)

配置(vLLM)

Edit
DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py
:
python
undefined
编辑
DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py
python
undefined

Key config fields (example)

关键配置字段(示例)

MODEL_PATH = "deepseek-ai/DeepSeek-OCR" # or local path INPUT_PATH = "/data/input_images/" OUTPUT_PATH = "/data/output/" TENSOR_PARALLEL_SIZE = 1 # GPUs for tensor parallelism MAX_TOKENS = 8192 TEMPERATURE = 0.0 NGRAM_SIZE = 30 WINDOW_SIZE = 90

---
MODEL_PATH = "deepseek-ai/DeepSeek-OCR" # 或本地路径 INPUT_PATH = "/data/input_images/" OUTPUT_PATH = "/data/output/" TENSOR_PARALLEL_SIZE = 1 # 用于张量并行的GPU数量 MAX_TOKENS = 8192 TEMPERATURE = 0.0 NGRAM_SIZE = 30 WINDOW_SIZE = 90

---

Common Patterns

常见使用场景

Process a Directory of Images

处理目录下的所有图片

python
import os
from pathlib import Path
from PIL import Image
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))
    
    inputs = [
        {"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}
        for f in image_files
    ]
    
    outputs = llm.generate(inputs, sampling_params)
    
    for img_path, output in zip(image_files, outputs):
        out_file = Path(output_dir) / (img_path.stem + ".txt")
        out_file.write_text(output.outputs[0].text)
        print(f"Saved: {out_file}")

batch_ocr("/data/scans/", "/data/results/")
python
import os
from pathlib import Path
from PIL import Image
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))
    
    inputs = [
        {"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}
        for f in image_files
    ]
    
    outputs = llm.generate(inputs, sampling_params)
    
    for img_path, output in zip(image_files, outputs):
        out_file = Path(output_dir) / (img_path.stem + ".txt")
        out_file.write_text(output.outputs[0].text)
        print(f"已保存:{out_file}")

batch_ocr("/data/scans/", "/data/results/")

Convert PDF Pages to Markdown

将PDF页面转换为Markdown

python
import fitz  # PyMuPDF
from PIL import Image
from io import BytesIO
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def pdf_to_markdown(pdf_path: str) -> list[str]:
    doc = fitz.open(pdf_path)
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    prompt = "<image>\n<|grounding|>Convert the document to markdown. "
    inputs = []
    for page in doc:
        pix = page.get_pixmap(dpi=150)
        img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")
        inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})
    
    outputs = llm.generate(inputs, sampling_params)
    return [o.outputs[0].text for o in outputs]

pages = pdf_to_markdown("report.pdf")
full_markdown = "\n\n---\n\n".join(pages)
print(full_markdown)
python
import fitz  # PyMuPDF
from PIL import Image
from io import BytesIO
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

def pdf_to_markdown(pdf_path: str) -> list[str]:
    doc = fitz.open(pdf_path)
    llm = LLM(
        model="deepseek-ai/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor],
    )
    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
        skip_special_tokens=False,
    )
    
    prompt = "<image>\n<|grounding|>Convert the document to markdown. "
    inputs = []
    for page in doc:
        pix = page.get_pixmap(dpi=150)
        img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")
        inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})
    
    outputs = llm.generate(inputs, sampling_params)
    return [o.outputs[0].text for o in outputs]

pages = pdf_to_markdown("report.pdf")
full_markdown = "\n\n---\n\n".join(pages)
print(full_markdown)

Grounded Text Location (REC)

文本定位识别(REC)

python
import torch
from transformers import AutoModel, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
).eval().cuda().to(torch.bfloat16)

target = "Total Amount"
prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file="invoice.jpg",
    output_path="./output/",
    base_size=1024,
    image_size=640,
    crop_mode=False,
    save_results=True,
)
print(res)  # Returns bounding box / location info

python
import torch
from transformers import AutoModel, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True,
).eval().cuda().to(torch.bfloat16)

target = "Total Amount"
prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file="invoice.jpg",
    output_path="./output/",
    base_size=1024,
    image_size=640,
    crop_mode=False,
    save_results=True,
)
print(res)  # 返回边界框/位置信息

Troubleshooting

常见问题排查

transformers
version conflict with vLLM

transformers
版本与vLLM冲突

vLLM 0.8.5 requires
transformers>=4.51.1
— if running both in the same env, this error is safe to ignore per the project docs.
vLLM 0.8.5要求
transformers>=4.51.1
——根据项目文档,若在同一环境中运行两者,该错误可忽略。

Flash Attention build errors

Flash Attention构建错误

bash
undefined
bash
undefined

Ensure torch is installed before flash-attn

确保先安装torch再安装flash-attn

pip install flash-attn==2.7.3 --no-build-isolation
undefined
pip install flash-attn==2.7.3 --no-build-isolation
undefined

CUDA out of memory

CUDA内存不足

  • Use smaller resolution:
    base_size=512
    or
    base_size=640
  • Disable
    crop_mode=False
    to avoid multi-crop dynamic resolution
  • Reduce batch size in vLLM inputs
  • 使用更小的分辨率:
    base_size=512
    base_size=640
  • 禁用
    crop_mode=False
    以避免多裁剪动态分辨率
  • 减少vLLM输入中的批量大小

Model output is garbled / repetitive

模型输出乱码/重复

Ensure
NGramPerReqLogitsProcessor
is passed to
LLM
— this is required for proper decoding:
python
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
llm = LLM(..., logits_processors=[NGramPerReqLogitsProcessor])
确保将
NGramPerReqLogitsProcessor
传入
LLM
——这是正确解码的必要条件:
python
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
llm = LLM(..., logits_processors=[NGramPerReqLogitsProcessor])

Tables not rendering correctly

表格渲染不正确

Add table token IDs to the whitelist:
python
whitelist_token_ids={128821, 128822}  # <td> and </td>
将表格token ID添加到白名单:
python
whitelist_token_ids={128821, 128822}  # <td>和</td>

Multi-GPU inference

多GPU推理

python
llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    tensor_parallel_size=4,  # number of GPUs
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor],
)

python
llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    tensor_parallel_size=4,  # GPU数量
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor],
)

Key Files

关键文件

DeepSeek-OCR-master/
├── DeepSeek-OCR-vllm/
│   ├── config.py                  # vLLM configuration
│   ├── run_dpsk_ocr_image.py      # Single image inference
│   ├── run_dpsk_ocr_pdf.py        # PDF batch inference
│   └── run_dpsk_ocr_eval_batch.py # Benchmark evaluation
└── DeepSeek-OCR-hf/
    └── run_dpsk_ocr.py            # HuggingFace Transformers inference
DeepSeek-OCR-master/
├── DeepSeek-OCR-vllm/
│   ├── config.py                  # vLLM配置文件
│   ├── run_dpsk_ocr_image.py      # 单图片推理脚本
│   ├── run_dpsk_ocr_pdf.py        # PDF批量推理脚本
│   └── run_dpsk_ocr_eval_batch.py # 基准测试评估脚本
└── DeepSeek-OCR-hf/
    └── run_dpsk_ocr.py            # HuggingFace Transformers推理脚本