doc-to-markdown
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDoc to Markdown
文档转Markdown
Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.
Architecture: Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add).
通过智能多工具编排和自动DOCX后处理,将各类文档转换为高质量Markdown。
架构:Pandoc(行业领先的提取工具) + 8项后处理修复(核心增值功能)。
Quick Start
快速开始
bash
undefinedbash
undefinedDOCX → Markdown (one command, zero manual fixes)
DOCX → Markdown(一键操作,无需手动修复)
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media
PDF → Markdown
PDF → Markdown
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
Run tests
运行测试
uv run --with pytest pytest scripts/test_convert.py -v
undefineduv run --with pytest pytest scripts/test_convert.py -v
undefinedDual Mode
双模式
| Mode | Speed | Quality | Use Case |
|---|---|---|---|
| Quick (default) | Fast | Good | Drafts, simple documents |
| Heavy | Slower | Best | Final documents, complex layouts |
| 模式 | 速度 | 质量 | 使用场景 |
|---|---|---|---|
| 快速模式(默认) | 快 | 良好 | 草稿、简单文档 |
| 深度模式 | 较慢 | 最佳 | 终稿、复杂布局文档 |
Tool Selection
工具选择
| Format | Quick Mode | Heavy Mode |
|---|---|---|
| pymupdf4llm | pymupdf4llm + markitdown | |
| DOCX | pandoc + post-processing | pandoc + markitdown |
| PPTX | markitdown | markitdown + pandoc |
| XLSX | markitdown | markitdown |
| 格式 | 快速模式 | 深度模式 |
|---|---|---|
| pymupdf4llm | pymupdf4llm + markitdown | |
| DOCX | pandoc + 后处理 | pandoc + markitdown |
| PPTX | markitdown | markitdown + pandoc |
| XLSX | markitdown | markitdown |
DOCX Post-Processing (automatic)
DOCX自动后处理
When converting DOCX via pandoc, 8 cleanups are applied automatically:
| Problem | Fix | Test coverage |
|---|---|---|
Grid tables ( | Single-column → blockquote, multi-column → pipe table | |
Simple tables ( | Multi-column images → pipe table with captions | |
Image path nesting ( | Flatten to | |
Pandoc attributes ( | Removed | |
CJK bold spacing ( | Add space around | |
| Indented dashed code blocks | → fenced ``` with language detection | |
Escaped brackets ( | → | |
Double-bracket links ( | → | |
通过pandoc转换DOCX时,会自动应用8项清理修复:
| 问题 | 修复方案 | 测试覆盖率 |
|---|---|---|
网格表格( | 单列转块引用,多列转管道表格 | |
普通表格( | 多列图片转带标题的管道表格 | |
图片路径嵌套( | 扁平化至 | |
Pandoc属性( | 移除冗余属性 | |
CJK粗体间距( | 在CJK粗体段的 | |
| 缩进虚线代码块 | 转换为带语言检测的围栏式```代码块 | |
转义括号( | 转换为 | |
双括号链接( | 转换为 | |
CJK Bold Spacing — why and how
CJK粗体间距——原因及解决方案
DOCX uses run-level styling (no spaces between bold/normal runs in CJK text). Markdown renderers need whitespace around to recognize bold boundaries.
**Rule: if a span contains any CJK character, ensure both sides have a space — unless already spaced or at line boundary. This handles CJK punctuation, emoji adjacency, and mixed content.
**content**Before: 打开**飞书**,就可以 → some renderers fail to bold
After: 打开 **飞书** ,就可以 → universally renders correctlyDOCX使用段落级样式(CJK文本中粗体与普通文本段之间无空格),但Markdown渲染器需要在两侧有空格才能正确识别粗体边界。
**规则:如果段包含任意CJK字符,则确保两侧均有空格——除非已存在空格或位于行首/行尾。此规则可处理CJK标点、表情符号相邻及混合内容场景。
**内容****转换前: 打开**飞书**,就可以 → 部分渲染器无法正确识别粗体
转换后: 打开 **飞书** ,就可以 → 所有渲染器均可正确显示Heavy Mode Workflow
深度模式工作流
Heavy Mode runs multiple tools in parallel and selects the best segments:
- Parallel Execution: Run all applicable tools simultaneously
- Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
- Quality Scoring: Score each segment based on completeness and structure
- Intelligent Merge: Select best version of each segment across tools
深度模式会并行运行多个工具,并选择最佳片段:
- 并行执行:同时运行所有适用工具
- 片段分析:将每个输出解析为不同片段(表格、标题、图片、段落)
- 质量评分:根据完整性和结构对每个片段评分
- 智能合并:在所有工具的输出中选择每个片段的最佳版本
Merge Criteria
合并标准
| Segment Type | Selection Criteria |
|---|---|
| Tables | More rows/columns, proper header separator |
| Images | Alt text present, local paths preferred |
| Headings | Proper hierarchy, appropriate length |
| Lists | More items, nested structure preserved |
| Paragraphs | Content completeness |
| 片段类型 | 选择标准 |
|---|---|
| 表格 | 行数/列数更多,表头分隔符规范 |
| 图片 | 包含替代文本,优先选择本地路径 |
| 标题 | 层级规范,长度合适 |
| 列表 | 条目更多,嵌套结构保留完整 |
| 段落 | 内容完整度高 |
Image Extraction
图片提取
bash
undefinedbash
undefinedExtract images with metadata
提取图片及元数据
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets
Generate markdown references file
生成Markdown引用文件
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md
Output:
- Images: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- Metadata: `assets/images_metadata.json` (page, position, dimensions)uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md
输出内容:
- 图片:`assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- 元数据:`assets/images_metadata.json`(包含页码、位置、尺寸信息)Quality Validation
质量验证
bash
undefinedbash
undefinedValidate conversion quality
验证转换质量
uv run --with pymupdf scripts/validate_output.py document.pdf output.md
uv run --with pymupdf scripts/validate_output.py document.pdf output.md
Generate HTML report
生成HTML报告
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
undefineduv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
undefinedQuality Metrics
质量指标
| Metric | Pass | Warn | Fail |
|---|---|---|---|
| Text Retention | >95% | 85-95% | <85% |
| Table Retention | 100% | 90-99% | <90% |
| Image Retention | 100% | 80-99% | <80% |
| 指标 | 通过 | 警告 | 失败 |
|---|---|---|---|
| 文本保留率 | >95% | 85-95% | <85% |
| 表格保留率 | 100% | 90-99% | <90% |
| 图片保留率 | 100% | 80-99% | <80% |
Merge Outputs Manually
手动合并输出
bash
undefinedbash
undefinedMerge multiple markdown files
合并多个Markdown文件
python scripts/merge_outputs.py output1.md output2.md -o merged.md
python scripts/merge_outputs.py output1.md output2.md -o merged.md
Show segment attribution
显示片段来源
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
undefinedpython scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
undefinedPath Conversion (Windows/WSL)
路径转换(Windows/WSL)
bash
undefinedbash
undefinedWindows to WSL conversion
Windows转WSL路径
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
Output: /mnt/c/Users/name/Documents/file.pdf
输出: /mnt/c/Users/name/Documents/file.pdf
undefinedundefinedCommon Issues
常见问题
"No conversion tools available"
bash
undefined"无可用转换工具"
bash
undefinedInstall all tools
安装所有工具
pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandoc
**FontBBox warnings during PDF conversion**
- Harmless font parsing warnings, output is still correct
**Images missing from output**
- Use Heavy Mode for better image preservation
- Or extract separately with `scripts/extract_pdf_images.py`
**Tables broken in output**
- Use Heavy Mode - it selects the most complete table version
- Or validate with `scripts/validate_output.py`pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandoc
**PDF转换时出现FontBBox警告**
- 这是无害的字体解析警告,输出内容仍正确
**输出中缺少图片**
- 使用深度模式可提升图片保留效果
- 或通过`scripts/extract_pdf_images.py`单独提取图片
**输出中表格损坏**
- 使用深度模式——它会选择最完整的表格版本
- 或通过`scripts/validate_output.py`验证Bundled Scripts
内置脚本
| Script | Purpose |
|---|---|
| Main orchestrator with Quick/Heavy mode + DOCX post-processing |
| 31 tests covering all post-processing functions |
| Merge multiple markdown outputs |
| Quality validation with HTML report |
| PDF image extraction with metadata |
| Windows to WSL path converter |
| 脚本 | 用途 |
|---|---|
| 主编排工具,支持快速/深度模式及DOCX后处理 |
| 包含31项测试,覆盖所有后处理功能 |
| 合并多个Markdown输出文件 |
| 质量验证并生成HTML报告 |
| 提取PDF图片及元数据 |
| Windows转WSL路径转换器 |
References
参考资料
- - 5-tool benchmark (Docling/MarkItDown/Pandoc/Mammoth/ours)
references/benchmark-2026-03-22.md - - Detailed Heavy Mode documentation
references/heavy-mode-guide.md - - Tool capabilities comparison
references/tool-comparison.md - - Batch operation examples
references/conversion-examples.md
- - 5工具对比测试报告(Docling/MarkItDown/Pandoc/Mammoth/本工具)
references/benchmark-2026-03-22.md - - 深度模式详细文档
references/heavy-mode-guide.md - - 工具能力对比
references/tool-comparison.md - - 批量操作示例
references/conversion-examples.md