doc-to-markdown

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Doc to Markdown

文档转Markdown

Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.

Architecture: Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add).

通过智能多工具编排和自动DOCX后处理，将各类文档转换为高质量Markdown。

架构：Pandoc（行业领先的提取工具） + 8项后处理修复（核心增值功能）。

Quick Start

快速开始

bash

undefined

bash

undefined

DOCX → Markdown (one command, zero manual fixes)

DOCX → Markdown（一键操作，无需手动修复）

uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media

PDF → Markdown

uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md

Run tests

运行测试

uv run --with pytest pytest scripts/test_convert.py -v

undefined

uv run --with pytest pytest scripts/test_convert.py -v

undefined

Dual Mode

双模式

Mode	Speed	Quality	Use Case
Quick (default)	Fast	Good	Drafts, simple documents
Heavy	Slower	Best	Final documents, complex layouts

模式	速度	质量	使用场景
快速模式（默认）	快	良好	草稿、简单文档
深度模式	较慢	最佳	终稿、复杂布局文档

Tool Selection

工具选择

Format	Quick Mode	Heavy Mode
PDF	pymupdf4llm	pymupdf4llm + markitdown
DOCX	pandoc + post-processing	pandoc + markitdown
PPTX	markitdown	markitdown + pandoc
XLSX	markitdown	markitdown

格式	快速模式	深度模式
PDF	pymupdf4llm	pymupdf4llm + markitdown
DOCX	pandoc + 后处理	pandoc + markitdown
PPTX	markitdown	markitdown + pandoc
XLSX	markitdown	markitdown

DOCX Post-Processing (automatic)

DOCX自动后处理

When converting DOCX via pandoc, 8 cleanups are applied automatically:

Problem	Fix	Test coverage
Grid tables ( `+:---+` )	Single-column → blockquote, multi-column → pipe table	`TestPostprocessPipeline`
Simple tables ( `---- ----` )	Multi-column images → pipe table with captions	`TestSimpleTable`
Image path nesting ( `media/media/` )	Flatten to `media/` , absolute → relative	`test_stats_tracking`
Pandoc attributes ( `{width="..."}` )	Removed	`test_pandoc_attributes_removed`
CJK bold spacing ( `粗体中文` )	Add space around `**` for CJK bold spans	`TestCjkBoldSpacing` (15 cases)
Indented dashed code blocks	→ fenced ``` with language detection	`test_code_block_with_language`
Escaped brackets ( `\[...\]` )	→ `[...]`	`test_escaped_brackets_fixed`
Double-bracket links ( `[[text]](url)` )	→ `[text](url)`	`test_double_bracket_links_fixed`

通过pandoc转换DOCX时，会自动应用8项清理修复：

问题	修复方案	测试覆盖率
网格表格（ `+:---+` ）	单列转块引用，多列转管道表格	`TestPostprocessPipeline`
普通表格（ `---- ----` ）	多列图片转带标题的管道表格	`TestSimpleTable`
图片路径嵌套（ `media/media/` ）	扁平化至 `media/` ，绝对路径转相对路径	`test_stats_tracking`
Pandoc属性（ `{width="..."}` ）	移除冗余属性	`test_pandoc_attributes_removed`
CJK粗体间距（ `粗体中文` ）	在CJK粗体段的 `**` 两侧添加空格	`TestCjkBoldSpacing` （15个测试用例）
缩进虚线代码块	转换为带语言检测的围栏式```代码块	`test_code_block_with_language`
转义括号（ `\[...\]` ）	转换为 `[...]`	`test_escaped_brackets_fixed`
双括号链接（ `[[text]](url)` ）	转换为 `[text](url)`	`test_double_bracket_links_fixed`

CJK Bold Spacing — why and how

CJK粗体间距——原因及解决方案

DOCX uses run-level styling (no spaces between bold/normal runs in CJK text). Markdown renderers need whitespace around

**

to recognize bold boundaries.

Rule: if a

**content**

span contains any CJK character, ensure both sides have a space — unless already spaced or at line boundary. This handles CJK punctuation, emoji adjacency, and mixed content.

Before: 打开**飞书**，就可以    → some renderers fail to bold
After:  打开 **飞书** ，就可以  → universally renders correctly

DOCX使用段落级样式（CJK文本中粗体与普通文本段之间无空格），但Markdown渲染器需要在

**

两侧有空格才能正确识别粗体边界。

规则：如果

**内容**

段包含任意CJK字符，则确保

**

两侧均有空格——除非已存在空格或位于行首/行尾。此规则可处理CJK标点、表情符号相邻及混合内容场景。

转换前: 打开**飞书**，就可以    → 部分渲染器无法正确识别粗体
转换后: 打开 **飞书** ，就可以  → 所有渲染器均可正确显示

Heavy Mode Workflow

深度模式工作流

Heavy Mode runs multiple tools in parallel and selects the best segments:

Parallel Execution: Run all applicable tools simultaneously
Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
Quality Scoring: Score each segment based on completeness and structure
Intelligent Merge: Select best version of each segment across tools

深度模式会并行运行多个工具，并选择最佳片段：

并行执行：同时运行所有适用工具
片段分析：将每个输出解析为不同片段（表格、标题、图片、段落）
质量评分：根据完整性和结构对每个片段评分
智能合并：在所有工具的输出中选择每个片段的最佳版本

Merge Criteria

合并标准

Segment Type	Selection Criteria
Tables	More rows/columns, proper header separator
Images	Alt text present, local paths preferred
Headings	Proper hierarchy, appropriate length
Lists	More items, nested structure preserved
Paragraphs	Content completeness

片段类型	选择标准
表格	行数/列数更多，表头分隔符规范
图片	包含替代文本，优先选择本地路径
标题	层级规范，长度合适
列表	条目更多，嵌套结构保留完整
段落	内容完整度高

Image Extraction

图片提取

bash

undefined

bash

undefined

Extract images with metadata

提取图片及元数据

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets

Generate markdown references file

生成Markdown引用文件

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md


Output:
- Images: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- Metadata: `assets/images_metadata.json` (page, position, dimensions)

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md


输出内容：
- 图片：`assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- 元数据：`assets/images_metadata.json`（包含页码、位置、尺寸信息）

Quality Validation

质量验证

bash

undefined

bash

undefined

Validate conversion quality

验证转换质量

uv run --with pymupdf scripts/validate_output.py document.pdf output.md

Generate HTML report

生成HTML报告

uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html

undefined

uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html

undefined

Quality Metrics

质量指标

Metric	Pass	Warn	Fail
Text Retention	>95%	85-95%	<85%
Table Retention	100%	90-99%	<90%
Image Retention	100%	80-99%	<80%

指标	通过	警告	失败
文本保留率	>95%	85-95%	<85%
表格保留率	100%	90-99%	<90%
图片保留率	100%	80-99%	<80%

Merge Outputs Manually

手动合并输出

bash

undefined

bash

undefined

Merge multiple markdown files

合并多个Markdown文件

python scripts/merge_outputs.py output1.md output2.md -o merged.md

Show segment attribution

显示片段来源

python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose

undefined

python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose

undefined

Path Conversion (Windows/WSL)

路径转换（Windows/WSL）

bash

undefined

bash

undefined

Windows to WSL conversion

Windows转WSL路径

python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"

Output: /mnt/c/Users/name/Documents/file.pdf

输出: /mnt/c/Users/name/Documents/file.pdf

undefined

undefined

Common Issues

常见问题

"No conversion tools available"

bash

undefined

"无可用转换工具"

bash

undefined

Install all tools

安装所有工具

pip install pymupdf4llm uv tool install "markitdown[pdf]" brew install pandoc


**FontBBox warnings during PDF conversion**
- Harmless font parsing warnings, output is still correct

**Images missing from output**
- Use Heavy Mode for better image preservation
- Or extract separately with `scripts/extract_pdf_images.py`

**Tables broken in output**
- Use Heavy Mode - it selects the most complete table version
- Or validate with `scripts/validate_output.py`

pip install pymupdf4llm uv tool install "markitdown[pdf]" brew install pandoc


**PDF转换时出现FontBBox警告**
- 这是无害的字体解析警告，输出内容仍正确

**输出中缺少图片**
- 使用深度模式可提升图片保留效果
- 或通过`scripts/extract_pdf_images.py`单独提取图片

**输出中表格损坏**
- 使用深度模式——它会选择最完整的表格版本
- 或通过`scripts/validate_output.py`验证

Bundled Scripts

内置脚本

Script	Purpose
`convert.py`	Main orchestrator with Quick/Heavy mode + DOCX post-processing
`test_convert.py`	31 tests covering all post-processing functions
`merge_outputs.py`	Merge multiple markdown outputs
`validate_output.py`	Quality validation with HTML report
`extract_pdf_images.py`	PDF image extraction with metadata
`convert_path.py`	Windows to WSL path converter

脚本	用途
`convert.py`	主编排工具，支持快速/深度模式及DOCX后处理
`test_convert.py`	包含31项测试，覆盖所有后处理功能
`merge_outputs.py`	合并多个Markdown输出文件
`validate_output.py`	质量验证并生成HTML报告
`extract_pdf_images.py`	提取PDF图片及元数据
`convert_path.py`	Windows转WSL路径转换器

References

参考资料

```
references/benchmark-2026-03-22.md
```
- 5-tool benchmark (Docling/MarkItDown/Pandoc/Mammoth/ours)
```
references/heavy-mode-guide.md
```
- Detailed Heavy Mode documentation
```
references/tool-comparison.md
```
- Tool capabilities comparison
```
references/conversion-examples.md
```
- Batch operation examples

```
references/benchmark-2026-03-22.md
```
- 5工具对比测试报告（Docling/MarkItDown/Pandoc/Mammoth/本工具）
```
references/heavy-mode-guide.md
```
- 深度模式详细文档
```
references/tool-comparison.md
```
- 工具能力对比
```
references/conversion-examples.md
```
- 批量操作示例