markitdown
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMarkItDown
MarkItDown
Overview
概述
MarkItDown is a Python utility that converts various file formats into Markdown format, optimized for use with large language models and text analysis pipelines. It preserves document structure (headings, lists, tables, hyperlinks) while producing clean, token-efficient Markdown output.
MarkItDown 是一款Python工具,可将多种文件格式转换为Markdown格式,专为大语言模型(LLM)和文本分析流程优化。它在保留文档结构(标题、列表、表格、超链接)的同时,生成简洁、高效的Markdown输出。
When to Use This Skill
适用场景
Use this skill when users request:
- Converting documents to Markdown format
- Extracting text from PDF, Word, PowerPoint, or Excel files
- Performing OCR on images to extract text
- Transcribing audio files to text
- Extracting YouTube video transcripts
- Processing HTML, EPUB, or web content to Markdown
- Converting structured data (CSV, JSON, XML) to readable Markdown
- Batch converting multiple files or ZIP archives
- Preparing documents for LLM analysis or RAG systems
当用户有以下需求时,可使用本工具:
- 将文档转换为Markdown格式
- 从PDF、Word、PowerPoint或Excel文件中提取文本
- 对图片进行OCR识别以提取文本
- 将音频文件转写为文本
- 提取YouTube视频字幕
- 将HTML、EPUB或网页内容转换为Markdown
- 将结构化数据(CSV、JSON、XML)转换为易读的Markdown格式
- 批量转换多个文件或ZIP压缩包
- 为LLM分析或RAG系统准备文档
Core Capabilities
核心功能
1. Document Conversion
1. 文档转换
Convert Office documents and PDFs to Markdown while preserving structure.
Supported formats:
- PDF files (with optional Azure Document Intelligence integration)
- Word documents (DOCX)
- PowerPoint presentations (PPTX)
- Excel spreadsheets (XLSX, XLS)
Basic usage:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)Command-line:
bash
markitdown document.pdf -o output.mdSee for detailed documentation on document-specific features.
references/document_conversion.md将Office文档和PDF转换为Markdown,同时保留文档结构。
支持格式:
- PDF文件(可选集成Azure Document Intelligence)
- Word文档(DOCX)
- PowerPoint演示文稿(PPTX)
- Excel电子表格(XLSX、XLS)
基础用法:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)命令行方式:
bash
markitdown document.pdf -o output.md有关文档专属功能的详细说明,请参阅 。
references/document_conversion.md2. Media Processing
2. 媒体处理
Extract text from images using OCR and transcribe audio files to text.
Supported formats:
- Images (JPEG, PNG, GIF, etc.) with EXIF metadata extraction
- Audio files with speech transcription (requires speech_recognition)
Image with OCR:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("image.jpg")
print(result.text_content) # Includes EXIF metadata and OCR textAudio transcription:
python
result = md.convert("audio.wav")
print(result.text_content) # Transcribed speechSee for advanced media handling options.
references/media_processing.md通过OCR从图片中提取文本,并将音频文件转写为文本。
支持格式:
- 图片(JPEG、PNG、GIF等),支持EXIF元数据提取
- 音频文件(需依赖speech_recognition库进行语音转写)
图片OCR识别:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("image.jpg")
print(result.text_content) # 包含EXIF元数据和OCR识别文本音频转写:
python
result = md.convert("audio.wav")
print(result.text_content) # 转写后的语音文本有关高级媒体处理选项,请参阅 。
references/media_processing.md3. Web Content Extraction
3. 网页内容提取
Convert web-based content and e-books to Markdown.
Supported formats:
- HTML files and web pages
- YouTube video transcripts (via URL)
- EPUB books
- RSS feeds
YouTube transcript:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
print(result.text_content)See for web extraction details.
references/web_content.md将网页内容和电子书转换为Markdown格式。
支持格式:
- HTML文件和网页
- YouTube视频字幕(通过URL提取)
- EPUB电子书
- RSS订阅源
提取YouTube字幕:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
print(result.text_content)有关网页提取的详细说明,请参阅 。
references/web_content.md4. Structured Data Handling
4. 结构化数据处理
Convert structured data formats to readable Markdown tables.
Supported formats:
- CSV files
- JSON files
- XML files
CSV to Markdown table:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content) # Formatted as Markdown tableSee for format-specific options.
references/structured_data.md将结构化数据格式转换为易读的Markdown表格。
支持格式:
- CSV文件
- JSON文件
- XML文件
CSV转Markdown表格:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content) # 格式化为Markdown表格有关格式专属选项,请参阅 。
references/structured_data.md5. Advanced Integrations
5. 高级集成
Enhance conversion quality with AI-powered features.
Azure Document Intelligence:
For enhanced PDF processing with better table extraction and layout analysis:
python
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<endpoint>", docintel_key="<key>")
result = md.convert("complex.pdf")LLM-Powered Image Descriptions:
Generate detailed image descriptions using GPT-4o:
python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx") # Images described with LLMSee for integration details.
references/advanced_integrations.md借助AI驱动的功能提升转换质量。
Azure Document Intelligence集成:
如需更出色的表格提取和布局分析,以增强PDF处理能力:
python
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<endpoint>", docintel_key="<key>")
result = md.convert("complex.pdf")LLM驱动的图片描述:
使用GPT-4o生成详细的图片描述:
python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx") # 图片将由LLM生成描述有关集成详情,请参阅 。
references/advanced_integrations.md6. Batch Processing
6. 批量处理
Process multiple files or entire ZIP archives at once.
ZIP file processing:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("archive.zip")
print(result.text_content) # All files converted and concatenatedBatch script:
Use the provided batch processing script for directory conversion:
bash
python scripts/batch_convert.py /path/to/documents /path/to/outputSee for implementation details.
scripts/batch_convert.py一次性处理多个文件或整个ZIP压缩包。
ZIP压缩包处理:
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("archive.zip")
print(result.text_content) # 所有文件转换后合并输出批量处理脚本:
使用提供的批量处理脚本进行目录转换:
bash
python scripts/batch_convert.py /path/to/documents /path/to/output有关实现细节,请参阅 。
scripts/batch_convert.pyInstallation
安装
Full installation (all features):
bash
uv pip install 'markitdown[all]'Modular installation (specific features):
bash
uv pip install 'markitdown[pdf]' # PDF support
uv pip install 'markitdown[docx]' # Word support
uv pip install 'markitdown[pptx]' # PowerPoint support
uv pip install 'markitdown[xlsx]' # Excel support
uv pip install 'markitdown[audio]' # Audio transcription
uv pip install 'markitdown[youtube]' # YouTube transcriptsRequirements:
- Python 3.10 or higher
完整安装(包含所有功能):
bash
uv pip install 'markitdown[all]'模块化安装(仅安装指定功能):
bash
uv pip install 'markitdown[pdf]' # PDF支持
uv pip install 'markitdown[docx]' # Word支持
uv pip install 'markitdown[pptx]' # PowerPoint支持
uv pip install 'markitdown[xlsx]' # Excel支持
uv pip install 'markitdown[audio]' # 音频转写支持
uv pip install 'markitdown[youtube]' # YouTube字幕支持系统要求:
- Python 3.10或更高版本
Output Format
输出格式
MarkItDown produces clean, token-efficient Markdown optimized for LLM consumption:
- Preserves headings, lists, and tables
- Maintains hyperlinks and formatting
- Includes metadata where relevant (EXIF, document properties)
- No temporary files created (streaming approach)
MarkItDown生成简洁、高效的Markdown内容,专为LLM使用优化:
- 保留标题、列表和表格结构
- 维持超链接和格式
- 包含相关元数据(EXIF、文档属性)
- 不创建临时文件(采用流式处理方式)
Common Workflows
常见工作流
Preparing documents for RAG:
python
from markitdown import MarkItDown
md = MarkItDown()为RAG系统准备文档:
python
from markitdown import MarkItDown
md = MarkItDown()Convert knowledge base documents
转换知识库文档
docs = ["manual.pdf", "guide.docx", "faq.html"]
markdown_content = []
for doc in docs:
result = md.convert(doc)
markdown_content.append(result.text_content)
docs = ["manual.pdf", "guide.docx", "faq.html"]
markdown_content = []
for doc in docs:
result = md.convert(doc)
markdown_content.append(result.text_content)
Now ready for embedding and indexing
转换完成后即可用于嵌入和索引
**Document analysis pipeline:**
```bash
**文档分析流程:**
```bashConvert all PDFs in directory
转换目录下所有PDF文件
for file in documents/*.pdf; do
markitdown "$file" -o "markdown/$(basename "$file" .pdf).md"
done
undefinedfor file in documents/*.pdf; do
markitdown "$file" -o "markdown/$(basename "$file" .pdf).md"
done
undefinedPlugin System
插件系统
MarkItDown supports extensible plugins for custom conversion logic. Plugins are disabled by default for security:
python
from markitdown import MarkItDownMarkItDown支持可扩展的插件,用于自定义转换逻辑。出于安全考虑,插件默认处于禁用状态:
python
from markitdown import MarkItDownEnable plugins if needed
如需启用插件
md = MarkItDown(enable_plugins=True)
undefinedmd = MarkItDown(enable_plugins=True)
undefinedResources
资源
This skill includes comprehensive reference documentation for each capability:
- references/document_conversion.md - Detailed PDF, DOCX, PPTX, XLSX conversion options
- references/media_processing.md - Image OCR and audio transcription details
- references/web_content.md - HTML, YouTube, and EPUB extraction
- references/structured_data.md - CSV, JSON, XML conversion formats
- references/advanced_integrations.md - Azure Document Intelligence and LLM integration
- scripts/batch_convert.py - Batch processing utility for directories
本工具包含针对各功能的完整参考文档:
- references/document_conversion.md - PDF、DOCX、PPTX、XLSX转换的详细选项
- references/media_processing.md - 图片OCR和音频转写的详细说明
- references/web_content.md - HTML、YouTube和EPUB提取的相关内容
- references/structured_data.md - CSV、JSON、XML转换格式说明
- references/advanced_integrations.md - Azure Document Intelligence和LLM集成详情
- scripts/batch_convert.py - 目录批量处理工具