markitdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Document to Markdown Conversion

文档转Markdown转换

Overview

概述

Convert various document formats to clean Markdown using Microsoft's MarkItDown tool. Optimized for LLM processing, content extraction, and document analysis workflows.
Supported Formats: PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx/.xls), Images (with OCR/LLM), HTML, Audio (with transcription), CSV, JSON, XML, ZIP archives, EPubs
使用微软的MarkItDown工具将多种文档格式转换为整洁的Markdown格式。针对LLM处理、内容提取和文档分析工作流进行了优化。
支持的格式:PDF、Word(.docx)、PowerPoint(.pptx)、Excel(.xlsx/.xls)、图片(支持OCR/LLM)、HTML、音频(支持转写)、CSV、JSON、XML、ZIP压缩包、EPub

Quick Start

快速开始

Basic Usage

基础用法

python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

Command Line

命令行方式

bash
undefined
bash
undefined

Convert single file

转换单个文件

markitdown document.pdf > output.md markitdown document.pdf -o output.md
markitdown document.pdf > output.md markitdown document.pdf -o output.md

Pipe input

管道输入

cat document.pdf | markitdown
undefined
cat document.pdf | markitdown
undefined

🔒 Security Considerations

🔒 安全注意事项

Before using in production:
  • ✅ Validate file types (MIME, not extension)
  • ✅ Limit file sizes (prevent DoS)
  • ✅ Sanitize file paths (prevent traversal)
  • ✅ Protect API keys (never hardcode)
  • ✅ Consider data privacy (external services)
See patterns.md for implementation details.
在生产环境中使用前:
  • ✅ 验证文件类型(基于MIME类型,而非扩展名)
  • ✅ 限制文件大小(防止拒绝服务攻击)
  • ✅ 清理文件路径(防止路径遍历)
  • ✅ 保护API密钥(绝对不要硬编码)
  • ✅ 考虑数据隐私(涉及外部服务时)
有关实现细节,请参阅patterns.md

API Key Security

API密钥安全

❌ NEVER:
  • Hardcode keys in code
  • Commit .env files to git
  • Log environment variables
✅ ALWAYS:
  • Use environment variables:
    export OPENAI_API_KEY="sk-..."
    # pragma: allowlist secret
  • Use secret management (AWS Secrets Manager, Azure Key Vault)
  • Rotate keys regularly
❌ 绝对不要:
  • 在代码中硬编码密钥
  • 将.env文件提交到git
  • 记录环境变量
✅ 务必:
  • 使用环境变量:
    export OPENAI_API_KEY="sk-..."
    # pragma: allowlist secret
  • 使用密钥管理服务(AWS Secrets Manager、Azure Key Vault)
  • 定期轮换密钥

Common Patterns

常见使用模式

PDF Documents

PDF文档

python
undefined
python
undefined

Basic PDF conversion

基础PDF转换

md = MarkItDown() result = md.convert("report.pdf")
md = MarkItDown() result = md.convert("report.pdf")

With Azure Document Intelligence (better quality)

使用Azure Document Intelligence(质量更高)

md = MarkItDown(docintel_endpoint="<your-endpoint>") result = md.convert("report.pdf")
undefined
md = MarkItDown(docintel_endpoint="<your-endpoint>") result = md.convert("report.pdf")
undefined

Office Documents

Office文档

python
undefined
python
undefined

Word documents - preserves structure

Word文档 - 保留结构

result = md.convert("document.docx")
result = md.convert("document.docx")

Excel - converts tables to markdown tables

Excel - 转换为Markdown表格

result = md.convert("spreadsheet.xlsx")
result = md.convert("spreadsheet.xlsx")

PowerPoint - extracts slide content

PowerPoint - 提取幻灯片内容

result = md.convert("presentation.pptx")
undefined
result = md.convert("presentation.pptx")
undefined

Images with Descriptions

带描述的图片

python
undefined
python
undefined

✅ SECURE: Using environment variables for API keys

✅ 安全做法:使用环境变量存储API密钥

import os from openai import OpenAI
api_key = os.getenv("OPENAI_API_KEY") if not api_key: raise RuntimeError("OPENAI_API_KEY not set")
client = OpenAI(api_key=api_key) md = MarkItDown(llm_client=client, llm_model="gpt-4o") result = md.convert("diagram.jpg") # Gets AI-generated description
undefined
import os from openai import OpenAI
api_key = os.getenv("OPENAI_API_KEY") if not api_key: raise RuntimeError("OPENAI_API_KEY未设置")
client = OpenAI(api_key=api_key) md = MarkItDown(llm_client=client, llm_model="gpt-4o") result = md.convert("diagram.jpg") # 获取AI生成的描述
undefined

Batch Processing

批量处理

python
from pathlib import Path

md = MarkItDown()
documents = Path(".").glob("*.pdf")

for doc in documents:
    result = md.convert(str(doc))
    output_path = doc.with_suffix(".md")
    output_path.write_text(result.text_content)
python
from pathlib import Path

md = MarkItDown()
documents = Path(".").glob("*.pdf")

for doc in documents:
    result = md.convert(str(doc))
    output_path = doc.with_suffix(".md")
    output_path.write_text(result.text_content)

Installation

安装

bash
undefined
bash
undefined

Full installation (all features)

完整安装(包含所有功能)

pip install 'markitdown[all]'
pip install 'markitdown[all]'

Selective features

选择性安装功能

pip install 'markitdown[pdf, docx, pptx]'

**Requirements**: Python 3.10 or higher
pip install 'markitdown[pdf, docx, pptx]'

**要求**:Python 3.10或更高版本

Key Features

核心特性

  • Structure Preservation: Maintains headings, lists, tables, links
  • Plugin System: Extend with custom converters
  • Docker Support: Containerized deployments
  • MCP Integration: Model Context Protocol server for LLM apps
  • 结构保留:保持标题、列表、表格、链接
  • 插件系统:通过自定义转换器扩展功能
  • Docker支持:容器化部署
  • MCP集成:为LLM应用提供Model Context Protocol服务器

When to Read Supporting Files

何时阅读配套文件

  • reference.md - Read when you need:
    • Complete API reference and all configuration options
    • Azure Document Intelligence integration details
    • Plugin development guide
    • Docker and MCP server setup
    • Troubleshooting and error handling
  • examples.md - Read when you need:
    • Working examples for specific file types
    • Batch processing workflows
    • Error handling patterns
    • Integration with existing pipelines
  • patterns.md - Read when you need:
    • Production deployment patterns
    • Performance optimization strategies
    • Security considerations
    • Anti-patterns to avoid
  • reference.md - 当你需要以下内容时阅读:
    • 完整API参考和所有配置选项
    • Azure Document Intelligence集成细节
    • 插件开发指南
    • Docker和MCP服务器设置
    • 故障排除和错误处理
  • examples.md - 当你需要以下内容时阅读:
    • 针对特定文件类型的可用示例
    • 批量处理工作流
    • 错误处理模式
    • 与现有流水线的集成
  • patterns.md - 当你需要以下内容时阅读:
    • 生产环境部署模式
    • 性能优化策略
    • 安全注意事项
    • 需避免的反模式

Quick Reference

快速参考

File TypeUse CaseCommand
PDFReports, papers
md.convert("file.pdf")
WordDocuments
md.convert("file.docx")
ExcelData tables
md.convert("file.xlsx")
PowerPointPresentations
md.convert("file.pptx")
ImagesDiagrams with OCR
md = MarkItDown(llm_client=client); md.convert("img.jpg")
HTMLWeb pages
md.convert("page.html")
ZIPArchives
md.convert("archive.zip")
- processes contents
文件类型使用场景命令
PDF报告、论文
md.convert("file.pdf")
Word文档
md.convert("file.docx")
Excel数据表格
md.convert("file.xlsx")
PowerPoint演示文稿
md.convert("file.pptx")
图片带OCR的图表
md = MarkItDown(llm_client=client); md.convert("img.jpg")
HTML网页
md.convert("page.html")
ZIP压缩包
md.convert("archive.zip")
- 处理压缩包内的内容

⚠️ Common Mistakes to Avoid

⚠️ 需避免的常见错误

Anti-Pattern 1: Hardcoded API Keys
python
undefined
反模式1:硬编码API密钥
python
undefined

❌ NEVER DO THIS

❌ 绝对不要这样做

md = MarkItDown(llm_client=OpenAI(api_key="sk-hardcoded-key"))
md = MarkItDown(llm_client=OpenAI(api_key="sk-hardcoded-key"))

✅ ALWAYS DO THIS

✅ 务必这样做

api_key = os.getenv("OPENAI_API_KEY") md = MarkItDown(llm_client=OpenAI(api_key=api_key))

**Anti-Pattern 2: Unvalidated File Paths**

```python
api_key = os.getenv("OPENAI_API_KEY") md = MarkItDown(llm_client=OpenAI(api_key=api_key))

**反模式2:未验证的文件路径**

```python

❌ Vulnerable to path traversal

❌ 易受路径遍历攻击

user_input = "../../../etc/passwd" md.convert(user_input)
user_input = "../../../etc/passwd" md.convert(user_input)

✅ Validate and sanitize

✅ 验证并清理路径

from pathlib import Path safe_path = Path(user_input).resolve() if not safe_path.is_relative_to(allowed_dir): raise ValueError("Invalid path") md.convert(str(safe_path))

**Anti-Pattern 3: Ignoring File Size Limits**

```python
from pathlib import Path safe_path = Path(user_input).resolve() if not safe_path.is_relative_to(allowed_dir): raise ValueError("无效路径") md.convert(str(safe_path))

**反模式3:忽略文件大小限制**

```python

❌ Can cause DoS

❌ 可能导致拒绝服务攻击

md.convert("huge_file.pdf") # No size check
md.convert("huge_file.pdf") # 无大小检查

✅ Check size first

✅ 先检查文件大小

max_size = 50 * 1024 * 1024 # 50MB if Path("file.pdf").stat().st_size > max_size: raise ValueError("File too large")
undefined
max_size = 50 * 1024 * 1024 # 50MB if Path("file.pdf").stat().st_size > max_size: raise ValueError("文件过大")
undefined

Common Issues

常见问题

Import Error: Ensure Python >= 3.10 and markitdown installed Missing Dependencies: Install with
pip install 'markitdown[all]'
Image Descriptions Not Working: Requires LLM client (OpenAI or compatible)
For detailed troubleshooting, see reference.md.
导入错误:确保Python版本≥3.10且已安装markitdown 缺少依赖:使用
pip install 'markitdown[all]'
安装 图片描述无法工作:需要LLM客户端(OpenAI或兼容服务)
有关详细故障排除,请参阅reference.md