doc-to-markdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Doc to Markdown

文档转Markdown

Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.
Architecture: Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add).
通过智能多工具编排和自动DOCX后处理,将各类文档转换为高质量Markdown。
架构:Pandoc(行业领先的提取工具) + 8项后处理修复(核心增值功能)。

Quick Start

快速开始

bash
undefined
bash
undefined

DOCX → Markdown (one command, zero manual fixes)

DOCX → Markdown(一键操作,无需手动修复)

uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media

PDF → Markdown

PDF → Markdown

uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md

Run tests

运行测试

uv run --with pytest pytest scripts/test_convert.py -v
undefined
uv run --with pytest pytest scripts/test_convert.py -v
undefined

Dual Mode

双模式

ModeSpeedQualityUse Case
Quick (default)FastGoodDrafts, simple documents
HeavySlowerBestFinal documents, complex layouts
模式速度质量使用场景
快速模式(默认)良好草稿、简单文档
深度模式较慢最佳终稿、复杂布局文档

Tool Selection

工具选择

FormatQuick ModeHeavy Mode
PDFpymupdf4llmpymupdf4llm + markitdown
DOCXpandoc + post-processingpandoc + markitdown
PPTXmarkitdownmarkitdown + pandoc
XLSXmarkitdownmarkitdown
格式快速模式深度模式
PDFpymupdf4llmpymupdf4llm + markitdown
DOCXpandoc + 后处理pandoc + markitdown
PPTXmarkitdownmarkitdown + pandoc
XLSXmarkitdownmarkitdown

DOCX Post-Processing (automatic)

DOCX自动后处理

When converting DOCX via pandoc, 8 cleanups are applied automatically:
ProblemFixTest coverage
Grid tables (
+:---+
)
Single-column → blockquote, multi-column → pipe table
TestPostprocessPipeline
Simple tables (
  ---- ----
)
Multi-column images → pipe table with captions
TestSimpleTable
Image path nesting (
media/media/
)
Flatten to
media/
, absolute → relative
test_stats_tracking
Pandoc attributes (
{width="..."}
)
Removed
test_pandoc_attributes_removed
CJK bold spacing (
**粗体**中文
)
Add space around
**
for CJK bold spans
TestCjkBoldSpacing
(15 cases)
Indented dashed code blocks→ fenced ``` with language detection
test_code_block_with_language
Escaped brackets (
\[...\]
)
[...]
test_escaped_brackets_fixed
Double-bracket links (
[[text]](url)
)
[text](url)
test_double_bracket_links_fixed
通过pandoc转换DOCX时,会自动应用8项清理修复:
问题修复方案测试覆盖率
网格表格(
+:---+
单列转块引用,多列转管道表格
TestPostprocessPipeline
普通表格(
  ---- ----
多列图片转带标题的管道表格
TestSimpleTable
图片路径嵌套(
media/media/
扁平化至
media/
,绝对路径转相对路径
test_stats_tracking
Pandoc属性(
{width="..."}
移除冗余属性
test_pandoc_attributes_removed
CJK粗体间距(
**粗体**中文
在CJK粗体段的
**
两侧添加空格
TestCjkBoldSpacing
(15个测试用例)
缩进虚线代码块转换为带语言检测的围栏式```代码块
test_code_block_with_language
转义括号(
\[...\]
转换为
[...]
test_escaped_brackets_fixed
双括号链接(
[[text]](url)
转换为
[text](url)
test_double_bracket_links_fixed

CJK Bold Spacing — why and how

CJK粗体间距——原因及解决方案

DOCX uses run-level styling (no spaces between bold/normal runs in CJK text). Markdown renderers need whitespace around
**
to recognize bold boundaries.
Rule: if a
**content**
span contains any CJK character, ensure both sides have a space — unless already spaced or at line boundary. This handles CJK punctuation, emoji adjacency, and mixed content.
Before: 打开**飞书**,就可以    → some renderers fail to bold
After:  打开 **飞书** ,就可以  → universally renders correctly
DOCX使用段落级样式(CJK文本中粗体与普通文本段之间无空格),但Markdown渲染器需要在
**
两侧有空格才能正确识别粗体边界。
规则:如果
**内容**
段包含任意CJK字符,则确保
**
两侧均有空格——除非已存在空格或位于行首/行尾。此规则可处理CJK标点、表情符号相邻及混合内容场景。
转换前: 打开**飞书**,就可以    → 部分渲染器无法正确识别粗体
转换后: 打开 **飞书** ,就可以  → 所有渲染器均可正确显示

Heavy Mode Workflow

深度模式工作流

Heavy Mode runs multiple tools in parallel and selects the best segments:
  1. Parallel Execution: Run all applicable tools simultaneously
  2. Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
  3. Quality Scoring: Score each segment based on completeness and structure
  4. Intelligent Merge: Select best version of each segment across tools
深度模式会并行运行多个工具,并选择最佳片段:
  1. 并行执行:同时运行所有适用工具
  2. 片段分析:将每个输出解析为不同片段(表格、标题、图片、段落)
  3. 质量评分:根据完整性和结构对每个片段评分
  4. 智能合并:在所有工具的输出中选择每个片段的最佳版本

Merge Criteria

合并标准

Segment TypeSelection Criteria
TablesMore rows/columns, proper header separator
ImagesAlt text present, local paths preferred
HeadingsProper hierarchy, appropriate length
ListsMore items, nested structure preserved
ParagraphsContent completeness
片段类型选择标准
表格行数/列数更多,表头分隔符规范
图片包含替代文本,优先选择本地路径
标题层级规范,长度合适
列表条目更多,嵌套结构保留完整
段落内容完整度高

Image Extraction

图片提取

bash
undefined
bash
undefined

Extract images with metadata

提取图片及元数据

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets

Generate markdown references file

生成Markdown引用文件

uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md

Output:
- Images: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- Metadata: `assets/images_metadata.json` (page, position, dimensions)
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md

输出内容:
- 图片:`assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- 元数据:`assets/images_metadata.json`(包含页码、位置、尺寸信息)

Quality Validation

质量验证

bash
undefined
bash
undefined

Validate conversion quality

验证转换质量

uv run --with pymupdf scripts/validate_output.py document.pdf output.md
uv run --with pymupdf scripts/validate_output.py document.pdf output.md

Generate HTML report

生成HTML报告

uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
undefined
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
undefined

Quality Metrics

质量指标

MetricPassWarnFail
Text Retention>95%85-95%<85%
Table Retention100%90-99%<90%
Image Retention100%80-99%<80%
指标通过警告失败
文本保留率>95%85-95%<85%
表格保留率100%90-99%<90%
图片保留率100%80-99%<80%

Merge Outputs Manually

手动合并输出

bash
undefined
bash
undefined

Merge multiple markdown files

合并多个Markdown文件

python scripts/merge_outputs.py output1.md output2.md -o merged.md
python scripts/merge_outputs.py output1.md output2.md -o merged.md

Show segment attribution

显示片段来源

python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
undefined
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
undefined

Path Conversion (Windows/WSL)

路径转换(Windows/WSL)

bash
undefined
bash
undefined

Windows to WSL conversion

Windows转WSL路径

python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"

Output: /mnt/c/Users/name/Documents/file.pdf

输出: /mnt/c/Users/name/Documents/file.pdf

undefined
undefined

Common Issues

常见问题

"No conversion tools available"
bash
undefined
"无可用转换工具"
bash
undefined

Install all tools

安装所有工具

pip install pymupdf4llm uv tool install "markitdown[pdf]" brew install pandoc

**FontBBox warnings during PDF conversion**
- Harmless font parsing warnings, output is still correct

**Images missing from output**
- Use Heavy Mode for better image preservation
- Or extract separately with `scripts/extract_pdf_images.py`

**Tables broken in output**
- Use Heavy Mode - it selects the most complete table version
- Or validate with `scripts/validate_output.py`
pip install pymupdf4llm uv tool install "markitdown[pdf]" brew install pandoc

**PDF转换时出现FontBBox警告**
- 这是无害的字体解析警告,输出内容仍正确

**输出中缺少图片**
- 使用深度模式可提升图片保留效果
- 或通过`scripts/extract_pdf_images.py`单独提取图片

**输出中表格损坏**
- 使用深度模式——它会选择最完整的表格版本
- 或通过`scripts/validate_output.py`验证

Bundled Scripts

内置脚本

ScriptPurpose
convert.py
Main orchestrator with Quick/Heavy mode + DOCX post-processing
test_convert.py
31 tests covering all post-processing functions
merge_outputs.py
Merge multiple markdown outputs
validate_output.py
Quality validation with HTML report
extract_pdf_images.py
PDF image extraction with metadata
convert_path.py
Windows to WSL path converter
脚本用途
convert.py
主编排工具,支持快速/深度模式及DOCX后处理
test_convert.py
包含31项测试,覆盖所有后处理功能
merge_outputs.py
合并多个Markdown输出文件
validate_output.py
质量验证并生成HTML报告
extract_pdf_images.py
提取PDF图片及元数据
convert_path.py
Windows转WSL路径转换器

References

参考资料

  • references/benchmark-2026-03-22.md
    - 5-tool benchmark (Docling/MarkItDown/Pandoc/Mammoth/ours)
  • references/heavy-mode-guide.md
    - Detailed Heavy Mode documentation
  • references/tool-comparison.md
    - Tool capabilities comparison
  • references/conversion-examples.md
    - Batch operation examples
  • references/benchmark-2026-03-22.md
    - 5工具对比测试报告(Docling/MarkItDown/Pandoc/Mammoth/本工具)
  • references/heavy-mode-guide.md
    - 深度模式详细文档
  • references/tool-comparison.md
    - 工具能力对比
  • references/conversion-examples.md
    - 批量操作示例