opendataloader-pdf

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OpenDataLoader PDF

OpenDataLoader PDF

PDF 解析器 · 基准测试第一 · RAG/LLM 数据提取利器
PDF Parser · No.1 in Benchmark Tests · Great Tool for RAG/LLM Data Extraction

功能定位

Feature Positioning

  • 核心能力:从任意 PDF 提取结构化数据(Markdown、JSON、HTML),带边界框坐标
  • 技术亮点:XY-Cut++ 读取顺序、Bounding Box 定位、AI 混合模式处理复杂页面
  • 基准成绩:综合 0.90(第一),表格 0.93,读取顺序 0.94(对标 Docling、Marker、MinerU 等)
  • 许可证:Apache 2.0(核心功能免费)
  • Core Capability: Extract structured data (Markdown, JSON, HTML) from any PDF with bounding box coordinates
  • Technical Highlights: XY-Cut++ reading order, Bounding Box positioning, hybrid AI mode for processing complex pages
  • Benchmark Results: Overall score 0.90 (ranking 1st), table processing score 0.93, reading order score 0.94 (compared with Docling, Marker, MinerU, etc.)
  • License: Apache 2.0 (core features are free)

适用场景

Applicable Scenarios

  • 批量提取 PDF 为 Markdown / JSON / HTML 用于 RAG 或 LLM 训练
  • 需要边界框坐标做源码溯源(哪个段落来自 PDF 第几页哪个位置)
  • 复杂表格、扫描件、含公式的学术 PDF
  • PDF 无障碍化(Tagged PDF 生成,Q2 2026 免费开放)
  • Batch extract PDF content to Markdown / JSON / HTML for RAG or LLM training
  • Require bounding box coordinates for source tracing (identify which page and position of the PDF a paragraph comes from)
  • Process complex tables, scanned documents, academic PDFs with formulas
  • PDF accessibility (Tagged PDF generation, free to access in Q2 2026)

安装

Installation

前提

Prerequisites

  • Java 11+
  • Python 3.10+
bash
pip install -U opendataloader-pdf
混合 AI 模式(复杂表格 / OCR / 公式):
bash
pip install "opendataloader-pdf[hybrid]"
  • Java 11+
  • Python 3.10+
bash
pip install -U opendataloader-pdf
For hybrid AI mode (for complex tables / OCR / formulas):
bash
pip install "opendataloader-pdf[hybrid]"

快速使用

Quick Start

CLI(适合单文件或批量)

CLI (suitable for single file or batch processing)

bash
undefined
bash
undefined

快速模式:输出 Markdown + JSON

快速模式:输出 Markdown + JSON

opendataloader-pdf input.pdf output_dir/
opendataloader-pdf input.pdf output_dir/

指定格式

指定格式

opendataloader-pdf input.pdf output_dir/ --format markdown,json,html
opendataloader-pdf input.pdf output_dir/ --format markdown,json,html

混合 AI 模式(复杂表格 / 扫描件)

混合 AI 模式(复杂表格 / 扫描件)

opendataloader-pdf --hybrid docling-fast input.pdf output_dir/
opendataloader-pdf --hybrid docling-fast input.pdf output_dir/

混合模式 + OCR(扫描件)

混合模式 + OCR(扫描件)

opendataloader-pdf --hybrid docling-fast --force-ocr input.pdf output_dir/
opendataloader-pdf --hybrid docling-fast --force-ocr input.pdf output_dir/

混合模式 + 公式识别

混合模式 + 公式识别

opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf output_dir/
undefined
opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf output_dir/
undefined

Python API

Python API

python
import opendataloader_pdf
python
import opendataloader_pdf

批量处理(一次调用会启动 JVM,建议批量一次性传入)

批量处理(一次调用会启动 JVM,建议批量一次性传入)

opendataloader_pdf.convert( input_path=["file1.pdf", "file2.pdf", "folder/"], output_dir="output/", format="markdown,json" )
undefined
opendataloader_pdf.convert( input_path=["file1.pdf", "file2.pdf", "folder/"], output_dir="output/", format="markdown,json" )
undefined

模式选择指南

Mode Selection Guide

文档类型模式命令
标准数字 PDF快速(默认)
opendataloader-pdf file.pdf out/
复杂/无线框表格混合
opendataloader-pdf --hybrid docling-fast file.pdf out/
扫描件混合 + OCR同上 +
--force-ocr
非英语扫描件混合 + OCR
--force-ocr --ocr-lang "ko,en"
含数学公式混合 + 公式
--hybrid docling-fast --hybrid-mode full
图表需要描述混合 + 图片描述
--enrich-picture-description --hybrid-mode full
Document TypeModeCommand
Standard digital PDFFast (default)
opendataloader-pdf file.pdf out/
Complex/borderless tablesHybrid
opendataloader-pdf --hybrid docling-fast file.pdf out/
Scanned documentsHybrid + OCRSame as above +
--force-ocr
Non-English scanned documentsHybrid + OCR
--force-ocr --ocr-lang "ko,en"
PDFs with mathematical formulasHybrid + formula recognition
--hybrid docling-fast --hybrid-mode full
Charts requiring descriptionHybrid + image description
--enrich-picture-description --hybrid-mode full

输出格式说明

Output Format Description

Markdown

Markdown

保留标题层级、表格结构、列表嵌套,适合直接用于 chunking。
Preserves heading hierarchy, table structure, list nesting, suitable for direct chunking.

JSON(带边界框)

JSON (with bounding boxes)

json
{
  "pages": [{
    "page_number": 1,
    "elements": [{
      "type": "heading",
      "text": "...",
      "bbox": [x0, y0, x1, y1],
      "level": 1
    }, {
      "type": "table",
      "bbox": [x0, y0, x1, y1],
      "html": "..."
    }]
  }]
}
每个元素都有
bbox
坐标,方便做源码溯源。
json
{
  "pages": [{
    "page_number": 1,
    "elements": [{
      "type": "heading",
      "text": "...",
      "bbox": [x0, y0, x1, y1],
      "level": 1
    }, {
      "type": "table",
      "bbox": [x0, y0, x1, y1],
      "html": "..."
    }]
  }]
}
Each element has
bbox
coordinates for easy source tracing.

HTML

HTML

保留布局结构,适合渲染或进一步处理。
Preserves layout structure, suitable for rendering or further processing.

Gotchas

Gotchas

  • 每次
    convert()
    调用会启动一个新的 JVM 进程
    ,所以批量文件建议一次传入,而不是循环多次调用
  • 混合模式需要在后台启动服务器:
    opendataloader-pdf-hybrid --port 5002
    ,然后客户端加
    --hybrid docling-fast
  • --enrich-formula
    --enrich-picture-description
    必须在混合服务器和客户端都加
    --hybrid-mode full
    ,否则强化功能静默跳过
  • Java 选项修改后必须运行
    npm run sync
    ,它会重新生成
    options.json
    和所有 Python/Node.js 绑定
  • Each
    convert()
    call starts a new JVM process
    , so it is recommended to pass all batch files at once instead of calling it multiple times in a loop
  • Hybrid mode requires starting the server in the background:
    opendataloader-pdf-hybrid --port 5002
    , then add
    --hybrid docling-fast
    to the client command
  • For
    --enrich-formula
    or
    --enrich-picture-description
    , you must add
    --hybrid-mode full
    to both the hybrid server and client commands, otherwise the enhancement features will be skipped silently
  • After modifying Java options, you must run
    npm run sync
    , which will regenerate
    options.json
    and all Python/Node.js bindings

与其他工具的对比

Comparison with Other Tools

引擎综合分表格速度(秒/页)
opendataloader(混合)0.900.930.43
docling0.860.890.73
marker0.830.8153.93
mineru0.820.875.96
pymupdf4llm0.570.400.09
EngineOverall ScoreTable Processing ScoreSpeed (seconds per page)
opendataloader (hybrid)0.900.930.43
docling0.860.890.73
marker0.830.8153.93
mineru0.820.875.96
pymupdf4llm0.570.400.09

引用信息

Reference Information