Loading...
Loading...
PDF data extraction tool. Use it when users mention "PDF extraction", "PDF to Markdown", "PDF parsing", "extract PDF content", "PDF to JSON", "RAG PDF". OpenDataLoader PDF is currently the top-ranked PDF parser in benchmark tests, supporting local mode (fast, deterministic) and hybrid AI mode (for complex tables, scanned documents, formulas), with output formats including Markdown, JSON (with bounding boxes), and HTML. It is suitable for scenarios where structured data needs to be extracted from PDFs for RAG/LLM pipelines, or where batch processing of PDF documents is required.
npx skill4agent add chujianyun/skills opendataloader-pdfpip install -U opendataloader-pdfpip install "opendataloader-pdf[hybrid]"# 快速模式:输出 Markdown + JSON
opendataloader-pdf input.pdf output_dir/
# 指定格式
opendataloader-pdf input.pdf output_dir/ --format markdown,json,html
# 混合 AI 模式(复杂表格 / 扫描件)
opendataloader-pdf --hybrid docling-fast input.pdf output_dir/
# 混合模式 + OCR(扫描件)
opendataloader-pdf --hybrid docling-fast --force-ocr input.pdf output_dir/
# 混合模式 + 公式识别
opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf output_dir/import opendataloader_pdf
# 批量处理(一次调用会启动 JVM,建议批量一次性传入)
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
format="markdown,json"
)| Document Type | Mode | Command |
|---|---|---|
| Standard digital PDF | Fast (default) | |
| Complex/borderless tables | Hybrid | |
| Scanned documents | Hybrid + OCR | Same as above + |
| Non-English scanned documents | Hybrid + OCR | |
| PDFs with mathematical formulas | Hybrid + formula recognition | |
| Charts requiring description | Hybrid + image description | |
{
"pages": [{
"page_number": 1,
"elements": [{
"type": "heading",
"text": "...",
"bbox": [x0, y0, x1, y1],
"level": 1
}, {
"type": "table",
"bbox": [x0, y0, x1, y1],
"html": "..."
}]
}]
}bboxconvert()opendataloader-pdf-hybrid --port 5002--hybrid docling-fast--enrich-formula--enrich-picture-description--hybrid-mode fullnpm run syncoptions.json| Engine | Overall Score | Table Processing Score | Speed (seconds per page) |
|---|---|---|---|
| opendataloader (hybrid) | 0.90 | 0.93 | 0.43 |
| docling | 0.86 | 0.89 | 0.73 |
| marker | 0.83 | 0.81 | 53.93 |
| mineru | 0.82 | 0.87 | 5.96 |
| pymupdf4llm | 0.57 | 0.40 | 0.09 |
pip install opendataloader-pdfnpm install @opendataloader/pdforg.opendataloader:opendataloader-pdf-core