Loading...
Loading...
Compare original and translation side by side
| PDF Type | Best Approach | Script |
|---|---|---|
| Simple text PDF | PyMuPDF | |
| PDF with tables | pdfplumber | |
| Scanned/image PDF (local) | pytesseract | |
| Complex layout, highest accuracy | Mistral OCR API | |
| End-to-end RAG pipeline | marker-pdf | |
| PDF类型 | 最佳方案 | 脚本 |
|---|---|---|
| 纯文本PDF | PyMuPDF | |
| 含表格的PDF | pdfplumber | |
| 扫描版/图片PDF(本地) | pytesseract | |
| 复杂布局、最高精度 | Mistral OCR API | |
| 端到端RAG流水线 | marker-pdf | |
uv run scripts/extract_pymupdf.py input.pdf output.mdpymupdf4llmuv run scripts/extract_pymupdf.py input.pdf output.mdpymupdf4llmuv run scripts/extract_pdfplumber.py input.pdf output.mduv run scripts/extract_pdfplumber.py input.pdf output.mduv run scripts/extract_with_ocr.py input.pdf output.txtpytesseractpdf2imagebrew install tesseractuv run scripts/extract_with_ocr.py input.pdf output.txtpytesseractpdf2imagebrew install tesseractexport MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.mdexport MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md