markdown-tools

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Markdown Tools

Markdown 工具

Convert documents to high-quality markdown with intelligent multi-tool orchestration.

通过智能多工具编排将文档转换为高质量的Markdown格式。

Dual Mode Architecture

双模式架构

Mode	Speed	Quality	Use Case
Quick (default)	Fast	Good	Drafts, simple documents
Heavy	Slower	Best	Final documents, complex layouts

模式	速度	质量	适用场景
快速（默认）	快	良好	草稿文档、简单格式文档
深度	较慢	最优	终稿文档、复杂布局文档

Quick Start

快速开始

Installation

安装

bash

undefined

bash

undefined

Required: PDF/DOCX/PPTX support

uv tool install "markitdown[pdf]" pip install pymupdf4llm brew install pandoc

undefined

uv tool install "markitdown[pdf]" pip install pymupdf4llm brew install pandoc

undefined

Basic Conversion

基础转换

bash

undefined

bash

undefined

Quick Mode (default) - fast, single best tool

uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md

Heavy Mode - multi-tool parallel execution with merge

uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy

Check available tools

uv run scripts/convert.py --list-tools

undefined

uv run scripts/convert.py --list-tools

undefined

Tool Selection Matrix

工具选择矩阵

Format	Quick Mode Tool	Heavy Mode Tools
PDF	pymupdf4llm	pymupdf4llm + markitdown
DOCX	pandoc	pandoc + markitdown
PPTX	markitdown	markitdown + pandoc
XLSX	markitdown	markitdown

格式	快速模式工具	深度模式工具
PDF	pymupdf4llm	pymupdf4llm + markitdown
DOCX	pandoc	pandoc + markitdown
PPTX	markitdown	markitdown + pandoc
XLSX	markitdown	markitdown

Tool Characteristics

工具特性

pymupdf4llm: LLM-optimized PDF conversion with native table detection and image extraction
markitdown: Microsoft's universal converter, good for Office formats
pandoc: Excellent structure preservation for DOCX/PPTX

pymupdf4llm: 针对LLM优化的PDF转换工具，支持原生表格检测和图片提取
markitdown: 微软推出的通用转换器，适用于Office格式文档
pandoc: 擅长保留DOCX/PPTX文档的结构

Heavy Mode Workflow

深度模式工作流

Heavy Mode runs multiple tools in parallel and selects the best segments:

Parallel Execution: Run all applicable tools simultaneously
Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
Quality Scoring: Score each segment based on completeness and structure
Intelligent Merge: Select best version of each segment across tools

深度模式会并行运行多个工具，并选择最优的内容片段进行合并：

并行执行: 同时运行所有适用的工具
片段分析: 将每个工具的输出解析为不同片段（表格、标题、图片、段落）
质量评分: 根据完整性和结构对每个片段进行评分
智能合并: 从所有工具的输出中选择每个片段的最优版本

Merge Criteria

合并规则

Segment Type	Selection Criteria
Tables	More rows/columns, proper header separator
Images	Alt text present, local paths preferred
Headings	Proper hierarchy, appropriate length
Lists	More items, nested structure preserved
Paragraphs	Content completeness

片段类型	选择标准
表格	包含更多行/列，表头分隔符格式正确
图片	包含替代文本，优先选择本地路径
标题	层级结构正确，长度合适
列表	包含更多条目，嵌套结构完整保留
段落	内容完整度高

Image Extraction

图片提取

bash

undefined

bash

undefined

Extract images with metadata

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets

Generate markdown references file

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md


Output:
- Images: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- Metadata: `assets/images_metadata.json` (page, position, dimensions)

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md


输出内容：
- 图片: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- 元数据: `assets/images_metadata.json`（包含页码、位置、尺寸信息）

Quality Validation

质量验证

bash

undefined

bash

undefined

Validate conversion quality

uv run --with pymupdf scripts/validate_output.py document.pdf output.md

Generate HTML report

uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html

undefined

uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html

undefined

Quality Metrics

质量指标

Metric	Pass	Warn	Fail
Text Retention	>95%	85-95%	<85%
Table Retention	100%	90-99%	<90%
Image Retention	100%	80-99%	<80%

指标	通过	警告	失败
文本保留率	>95%	85-95%	<85%
表格保留率	100%	90-99%	<90%
图片保留率	100%	80-99%	<80%

Merge Outputs Manually

手动合并输出

bash

undefined

bash

undefined

Merge multiple markdown files

python scripts/merge_outputs.py output1.md output2.md -o merged.md

Show segment attribution

python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose

undefined

python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose

undefined

Path Conversion (Windows/WSL)

路径转换（Windows/WSL）

bash

undefined

bash

undefined

Windows → WSL conversion

python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"

Output: /mnt/c/Users/name/Documents/file.pdf

undefined

undefined

Common Issues

常见问题

"No conversion tools available"

bash

undefined

"No conversion tools available"

bash

undefined

Install all tools

pip install pymupdf4llm uv tool install "markitdown[pdf]" brew install pandoc


**FontBBox warnings during PDF conversion**
- Harmless font parsing warnings, output is still correct

**Images missing from output**
- Use Heavy Mode for better image preservation
- Or extract separately with `scripts/extract_pdf_images.py`

**Tables broken in output**
- Use Heavy Mode - it selects the most complete table version
- Or validate with `scripts/validate_output.py`

pip install pymupdf4llm uv tool install "markitdown[pdf]" brew install pandoc


**FontBBox warnings during PDF conversion**
- 这是无害的字体解析警告，输出内容仍然正确

**Images missing from output**
- 使用深度模式可提升图片保留效果
- 或通过 `scripts/extract_pdf_images.py` 单独提取图片

**Tables broken in output**
- 使用深度模式 - 它会选择最完整的表格版本
- 或通过 `scripts/validate_output.py` 验证转换结果

Bundled Scripts

内置脚本

Script	Purpose
`convert.py`	Main orchestrator with Quick/Heavy mode
`merge_outputs.py`	Merge multiple markdown outputs
`validate_output.py`	Quality validation with HTML report
`extract_pdf_images.py`	PDF image extraction with metadata
`convert_path.py`	Windows to WSL path converter

脚本	用途
`convert.py`	主编排工具，支持快速/深度模式
`merge_outputs.py`	合并多个Markdown输出文件
`validate_output.py`	转换质量验证，生成HTML报告
`extract_pdf_images.py`	提取PDF中的图片并生成元数据
`convert_path.py`	Windows与WSL路径转换工具

References

参考资料

```
references/heavy-mode-guide.md
```
- Detailed Heavy Mode documentation
```
references/tool-comparison.md
```
- Tool capabilities comparison
```
references/conversion-examples.md
```
- Batch operation examples

```
references/heavy-mode-guide.md
```
- 深度模式详细文档
```
references/tool-comparison.md
```
- 工具能力对比文档
```
references/conversion-examples.md
```
- 批量操作示例文档