markitdown
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDocument to Markdown Conversion
文档转Markdown转换
Overview
概述
Convert various document formats to clean Markdown using Microsoft's MarkItDown tool. Optimized for LLM processing, content extraction, and document analysis workflows.
Supported Formats: PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx/.xls), Images (with OCR/LLM), HTML, Audio (with transcription), CSV, JSON, XML, ZIP archives, EPubs
使用微软的MarkItDown工具将多种文档格式转换为整洁的Markdown格式。针对LLM处理、内容提取和文档分析工作流进行了优化。
支持的格式:PDF、Word(.docx)、PowerPoint(.pptx)、Excel(.xlsx/.xls)、图片(支持OCR/LLM)、HTML、音频(支持转写)、CSV、JSON、XML、ZIP压缩包、EPub
Quick Start
快速开始
Basic Usage
基础用法
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)Command Line
命令行方式
bash
undefinedbash
undefinedConvert single file
转换单个文件
markitdown document.pdf > output.md
markitdown document.pdf -o output.md
markitdown document.pdf > output.md
markitdown document.pdf -o output.md
Pipe input
管道输入
cat document.pdf | markitdown
undefinedcat document.pdf | markitdown
undefined🔒 Security Considerations
🔒 安全注意事项
Before using in production:
- ✅ Validate file types (MIME, not extension)
- ✅ Limit file sizes (prevent DoS)
- ✅ Sanitize file paths (prevent traversal)
- ✅ Protect API keys (never hardcode)
- ✅ Consider data privacy (external services)
See patterns.md for implementation details.
在生产环境中使用前:
- ✅ 验证文件类型(基于MIME类型,而非扩展名)
- ✅ 限制文件大小(防止拒绝服务攻击)
- ✅ 清理文件路径(防止路径遍历)
- ✅ 保护API密钥(绝对不要硬编码)
- ✅ 考虑数据隐私(涉及外部服务时)
有关实现细节,请参阅patterns.md。
API Key Security
API密钥安全
❌ NEVER:
- Hardcode keys in code
- Commit .env files to git
- Log environment variables
✅ ALWAYS:
- Use environment variables: # pragma: allowlist secret
export OPENAI_API_KEY="sk-..." - Use secret management (AWS Secrets Manager, Azure Key Vault)
- Rotate keys regularly
❌ 绝对不要:
- 在代码中硬编码密钥
- 将.env文件提交到git
- 记录环境变量
✅ 务必:
- 使用环境变量:# pragma: allowlist secret
export OPENAI_API_KEY="sk-..." - 使用密钥管理服务(AWS Secrets Manager、Azure Key Vault)
- 定期轮换密钥
Common Patterns
常见使用模式
PDF Documents
PDF文档
python
undefinedpython
undefinedBasic PDF conversion
基础PDF转换
md = MarkItDown()
result = md.convert("report.pdf")
md = MarkItDown()
result = md.convert("report.pdf")
With Azure Document Intelligence (better quality)
使用Azure Document Intelligence(质量更高)
md = MarkItDown(docintel_endpoint="<your-endpoint>")
result = md.convert("report.pdf")
undefinedmd = MarkItDown(docintel_endpoint="<your-endpoint>")
result = md.convert("report.pdf")
undefinedOffice Documents
Office文档
python
undefinedpython
undefinedWord documents - preserves structure
Word文档 - 保留结构
result = md.convert("document.docx")
result = md.convert("document.docx")
Excel - converts tables to markdown tables
Excel - 转换为Markdown表格
result = md.convert("spreadsheet.xlsx")
result = md.convert("spreadsheet.xlsx")
PowerPoint - extracts slide content
PowerPoint - 提取幻灯片内容
result = md.convert("presentation.pptx")
undefinedresult = md.convert("presentation.pptx")
undefinedImages with Descriptions
带描述的图片
python
undefinedpython
undefined✅ SECURE: Using environment variables for API keys
✅ 安全做法:使用环境变量存储API密钥
import os
from openai import OpenAI
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise RuntimeError("OPENAI_API_KEY not set")
client = OpenAI(api_key=api_key)
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.jpg") # Gets AI-generated description
undefinedimport os
from openai import OpenAI
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise RuntimeError("OPENAI_API_KEY未设置")
client = OpenAI(api_key=api_key)
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.jpg") # 获取AI生成的描述
undefinedBatch Processing
批量处理
python
from pathlib import Path
md = MarkItDown()
documents = Path(".").glob("*.pdf")
for doc in documents:
result = md.convert(str(doc))
output_path = doc.with_suffix(".md")
output_path.write_text(result.text_content)python
from pathlib import Path
md = MarkItDown()
documents = Path(".").glob("*.pdf")
for doc in documents:
result = md.convert(str(doc))
output_path = doc.with_suffix(".md")
output_path.write_text(result.text_content)Installation
安装
bash
undefinedbash
undefinedFull installation (all features)
完整安装(包含所有功能)
pip install 'markitdown[all]'
pip install 'markitdown[all]'
Selective features
选择性安装功能
pip install 'markitdown[pdf, docx, pptx]'
**Requirements**: Python 3.10 or higherpip install 'markitdown[pdf, docx, pptx]'
**要求**:Python 3.10或更高版本Key Features
核心特性
- Structure Preservation: Maintains headings, lists, tables, links
- Plugin System: Extend with custom converters
- Docker Support: Containerized deployments
- MCP Integration: Model Context Protocol server for LLM apps
- 结构保留:保持标题、列表、表格、链接
- 插件系统:通过自定义转换器扩展功能
- Docker支持:容器化部署
- MCP集成:为LLM应用提供Model Context Protocol服务器
When to Read Supporting Files
何时阅读配套文件
-
reference.md - Read when you need:
- Complete API reference and all configuration options
- Azure Document Intelligence integration details
- Plugin development guide
- Docker and MCP server setup
- Troubleshooting and error handling
-
examples.md - Read when you need:
- Working examples for specific file types
- Batch processing workflows
- Error handling patterns
- Integration with existing pipelines
-
patterns.md - Read when you need:
- Production deployment patterns
- Performance optimization strategies
- Security considerations
- Anti-patterns to avoid
-
reference.md - 当你需要以下内容时阅读:
- 完整API参考和所有配置选项
- Azure Document Intelligence集成细节
- 插件开发指南
- Docker和MCP服务器设置
- 故障排除和错误处理
-
examples.md - 当你需要以下内容时阅读:
- 针对特定文件类型的可用示例
- 批量处理工作流
- 错误处理模式
- 与现有流水线的集成
-
patterns.md - 当你需要以下内容时阅读:
- 生产环境部署模式
- 性能优化策略
- 安全注意事项
- 需避免的反模式
Quick Reference
快速参考
| File Type | Use Case | Command |
|---|---|---|
| Reports, papers | | |
| Word | Documents | |
| Excel | Data tables | |
| PowerPoint | Presentations | |
| Images | Diagrams with OCR | |
| HTML | Web pages | |
| ZIP | Archives | |
| 文件类型 | 使用场景 | 命令 |
|---|---|---|
| 报告、论文 | | |
| Word | 文档 | |
| Excel | 数据表格 | |
| PowerPoint | 演示文稿 | |
| 图片 | 带OCR的图表 | |
| HTML | 网页 | |
| ZIP | 压缩包 | |
⚠️ Common Mistakes to Avoid
⚠️ 需避免的常见错误
Anti-Pattern 1: Hardcoded API Keys
python
undefined反模式1:硬编码API密钥
python
undefined❌ NEVER DO THIS
❌ 绝对不要这样做
md = MarkItDown(llm_client=OpenAI(api_key="sk-hardcoded-key"))
md = MarkItDown(llm_client=OpenAI(api_key="sk-hardcoded-key"))
✅ ALWAYS DO THIS
✅ 务必这样做
api_key = os.getenv("OPENAI_API_KEY")
md = MarkItDown(llm_client=OpenAI(api_key=api_key))
**Anti-Pattern 2: Unvalidated File Paths**
```pythonapi_key = os.getenv("OPENAI_API_KEY")
md = MarkItDown(llm_client=OpenAI(api_key=api_key))
**反模式2:未验证的文件路径**
```python❌ Vulnerable to path traversal
❌ 易受路径遍历攻击
user_input = "../../../etc/passwd"
md.convert(user_input)
user_input = "../../../etc/passwd"
md.convert(user_input)
✅ Validate and sanitize
✅ 验证并清理路径
from pathlib import Path
safe_path = Path(user_input).resolve()
if not safe_path.is_relative_to(allowed_dir):
raise ValueError("Invalid path")
md.convert(str(safe_path))
**Anti-Pattern 3: Ignoring File Size Limits**
```pythonfrom pathlib import Path
safe_path = Path(user_input).resolve()
if not safe_path.is_relative_to(allowed_dir):
raise ValueError("无效路径")
md.convert(str(safe_path))
**反模式3:忽略文件大小限制**
```python❌ Can cause DoS
❌ 可能导致拒绝服务攻击
md.convert("huge_file.pdf") # No size check
md.convert("huge_file.pdf") # 无大小检查
✅ Check size first
✅ 先检查文件大小
max_size = 50 * 1024 * 1024 # 50MB
if Path("file.pdf").stat().st_size > max_size:
raise ValueError("File too large")
undefinedmax_size = 50 * 1024 * 1024 # 50MB
if Path("file.pdf").stat().st_size > max_size:
raise ValueError("文件过大")
undefinedCommon Issues
常见问题
Import Error: Ensure Python >= 3.10 and markitdown installed
Missing Dependencies: Install with
Image Descriptions Not Working: Requires LLM client (OpenAI or compatible)
pip install 'markitdown[all]'For detailed troubleshooting, see reference.md.
导入错误:确保Python版本≥3.10且已安装markitdown
缺少依赖:使用安装
图片描述无法工作:需要LLM客户端(OpenAI或兼容服务)
pip install 'markitdown[all]'有关详细故障排除,请参阅reference.md。