markdown-tools

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Markdown Tools

Markdown 工具

Convert documents to high-quality markdown with intelligent multi-tool orchestration.
通过智能多工具编排将文档转换为高质量的Markdown格式。

Dual Mode Architecture

双模式架构

ModeSpeedQualityUse Case
Quick (default)FastGoodDrafts, simple documents
HeavySlowerBestFinal documents, complex layouts
模式速度质量适用场景
快速(默认)良好草稿文档、简单格式文档
深度较慢最优终稿文档、复杂布局文档

Quick Start

快速开始

Installation

安装

bash
undefined
bash
undefined

Required: PDF/DOCX/PPTX support

Required: PDF/DOCX/PPTX support

uv tool install "markitdown[pdf]" pip install pymupdf4llm brew install pandoc
undefined
uv tool install "markitdown[pdf]" pip install pymupdf4llm brew install pandoc
undefined

Basic Conversion

基础转换

bash
undefined
bash
undefined

Quick Mode (default) - fast, single best tool

Quick Mode (default) - fast, single best tool

uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md

Heavy Mode - multi-tool parallel execution with merge

Heavy Mode - multi-tool parallel execution with merge

uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy

Check available tools

Check available tools

uv run scripts/convert.py --list-tools
undefined
uv run scripts/convert.py --list-tools
undefined

Tool Selection Matrix

工具选择矩阵

FormatQuick Mode ToolHeavy Mode Tools
PDFpymupdf4llmpymupdf4llm + markitdown
DOCXpandocpandoc + markitdown
PPTXmarkitdownmarkitdown + pandoc
XLSXmarkitdownmarkitdown
格式快速模式工具深度模式工具
PDFpymupdf4llmpymupdf4llm + markitdown
DOCXpandocpandoc + markitdown
PPTXmarkitdownmarkitdown + pandoc
XLSXmarkitdownmarkitdown

Tool Characteristics

工具特性

  • pymupdf4llm: LLM-optimized PDF conversion with native table detection and image extraction
  • markitdown: Microsoft's universal converter, good for Office formats
  • pandoc: Excellent structure preservation for DOCX/PPTX
  • pymupdf4llm: 针对LLM优化的PDF转换工具,支持原生表格检测和图片提取
  • markitdown: 微软推出的通用转换器,适用于Office格式文档
  • pandoc: 擅长保留DOCX/PPTX文档的结构

Heavy Mode Workflow

深度模式工作流

Heavy Mode runs multiple tools in parallel and selects the best segments:
  1. Parallel Execution: Run all applicable tools simultaneously
  2. Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
  3. Quality Scoring: Score each segment based on completeness and structure
  4. Intelligent Merge: Select best version of each segment across tools
深度模式会并行运行多个工具,并选择最优的内容片段进行合并:
  1. 并行执行: 同时运行所有适用的工具
  2. 片段分析: 将每个工具的输出解析为不同片段(表格、标题、图片、段落)
  3. 质量评分: 根据完整性和结构对每个片段进行评分
  4. 智能合并: 从所有工具的输出中选择每个片段的最优版本

Merge Criteria

合并规则

Segment TypeSelection Criteria
TablesMore rows/columns, proper header separator
ImagesAlt text present, local paths preferred
HeadingsProper hierarchy, appropriate length
ListsMore items, nested structure preserved
ParagraphsContent completeness
片段类型选择标准
表格包含更多行/列,表头分隔符格式正确
图片包含替代文本,优先选择本地路径
标题层级结构正确,长度合适
列表包含更多条目,嵌套结构完整保留
段落内容完整度高

Image Extraction

图片提取

bash
undefined
bash
undefined

Extract images with metadata

Extract images with metadata

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets

Generate markdown references file

Generate markdown references file

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md

Output:
- Images: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- Metadata: `assets/images_metadata.json` (page, position, dimensions)
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md

输出内容:
- 图片: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- 元数据: `assets/images_metadata.json`(包含页码、位置、尺寸信息)

Quality Validation

质量验证

bash
undefined
bash
undefined

Validate conversion quality

Validate conversion quality

uv run --with pymupdf scripts/validate_output.py document.pdf output.md
uv run --with pymupdf scripts/validate_output.py document.pdf output.md

Generate HTML report

Generate HTML report

uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
undefined
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
undefined

Quality Metrics

质量指标

MetricPassWarnFail
Text Retention>95%85-95%<85%
Table Retention100%90-99%<90%
Image Retention100%80-99%<80%
指标通过警告失败
文本保留率>95%85-95%<85%
表格保留率100%90-99%<90%
图片保留率100%80-99%<80%

Merge Outputs Manually

手动合并输出

bash
undefined
bash
undefined

Merge multiple markdown files

Merge multiple markdown files

python scripts/merge_outputs.py output1.md output2.md -o merged.md
python scripts/merge_outputs.py output1.md output2.md -o merged.md

Show segment attribution

Show segment attribution

python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
undefined
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
undefined

Path Conversion (Windows/WSL)

路径转换(Windows/WSL)

bash
undefined
bash
undefined

Windows → WSL conversion

Windows → WSL conversion

python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"

Output: /mnt/c/Users/name/Documents/file.pdf

Output: /mnt/c/Users/name/Documents/file.pdf

undefined
undefined

Common Issues

常见问题

"No conversion tools available"
bash
undefined
"No conversion tools available"
bash
undefined

Install all tools

Install all tools

pip install pymupdf4llm uv tool install "markitdown[pdf]" brew install pandoc

**FontBBox warnings during PDF conversion**
- Harmless font parsing warnings, output is still correct

**Images missing from output**
- Use Heavy Mode for better image preservation
- Or extract separately with `scripts/extract_pdf_images.py`

**Tables broken in output**
- Use Heavy Mode - it selects the most complete table version
- Or validate with `scripts/validate_output.py`
pip install pymupdf4llm uv tool install "markitdown[pdf]" brew install pandoc

**FontBBox warnings during PDF conversion**
- 这是无害的字体解析警告,输出内容仍然正确

**Images missing from output**
- 使用深度模式可提升图片保留效果
- 或通过 `scripts/extract_pdf_images.py` 单独提取图片

**Tables broken in output**
- 使用深度模式 - 它会选择最完整的表格版本
- 或通过 `scripts/validate_output.py` 验证转换结果

Bundled Scripts

内置脚本

ScriptPurpose
convert.py
Main orchestrator with Quick/Heavy mode
merge_outputs.py
Merge multiple markdown outputs
validate_output.py
Quality validation with HTML report
extract_pdf_images.py
PDF image extraction with metadata
convert_path.py
Windows to WSL path converter
脚本用途
convert.py
主编排工具,支持快速/深度模式
merge_outputs.py
合并多个Markdown输出文件
validate_output.py
转换质量验证,生成HTML报告
extract_pdf_images.py
提取PDF中的图片并生成元数据
convert_path.py
Windows与WSL路径转换工具

References

参考资料

  • references/heavy-mode-guide.md
    - Detailed Heavy Mode documentation
  • references/tool-comparison.md
    - Tool capabilities comparison
  • references/conversion-examples.md
    - Batch operation examples
  • references/heavy-mode-guide.md
    - 深度模式详细文档
  • references/tool-comparison.md
    - 工具能力对比文档
  • references/conversion-examples.md
    - 批量操作示例文档