Loading...
Loading...
Extract text from PDFs/scans (pymupdf, marker-pdf).
npx skill4agent add nousresearch/hermes-agent ocr-and-documentspython-docxpowerpointpython-pptxweb_extractweb_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---|---|---|
| Text-based PDF | ✅ | ✅ |
| Scanned PDF (OCR) | ❌ | ✅ (90+ languages) |
| Tables | ✅ (basic) | ✅ (high accuracy) |
| Equations / LaTeX | ❌ | ✅ |
| Code blocks | ❌ | ✅ |
| Forms | ❌ | ✅ |
| Headers/footers removal | ❌ | ✅ |
| Reading order detection | ❌ | ✅ |
| Images extraction | ✅ (embedded) | ✅ (with context) |
| Images → text (OCR) | ❌ | ✅ |
| EPUB | ✅ | ✅ |
| Markdown output | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| Install size | ~25MB | ~3-5GB (PyTorch + models) |
| Speed | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
pip install pymupdf pymupdf4llmpython scripts/extract_pymupdf.py document.pdf # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown # Markdown
python scripts/extract_pymupdf.py document.pdf --tables # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pagespython3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
print(page.get_text())
"# Check disk space first
python scripts/extract_marker.py --check
pip install marker-pdfpython scripts/extract_marker.py document.pdf # Markdown
python scripts/extract_marker.py document.pdf --json # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/ # Save images
python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracymarker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4 # Batch# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
# Search
web_search(query="arxiv GRPO reinforcement learning 2026")execute_code# Split: extract pages 1-5 to a new PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")# Merge multiple PDFs
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")# Search for text across all pages
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
results = page.search_for("revenue")
if results:
print(f"Page {i+1}: {len(results)} match(es)")
print(page.get_text("text"))web_extract--help~/.cache/huggingface/pip install python-docxpowerpoint