markitdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MarkItDown

MarkItDown

Overview

概述

MarkItDown is a Python utility that converts various file formats into Markdown format, optimized for use with large language models and text analysis pipelines. It preserves document structure (headings, lists, tables, hyperlinks) while producing clean, token-efficient Markdown output.
MarkItDown 是一款Python工具,可将多种文件格式转换为Markdown格式,专为大语言模型(LLM)和文本分析流程优化。它在保留文档结构(标题、列表、表格、超链接)的同时,生成简洁、高效的Markdown输出。

When to Use This Skill

适用场景

Use this skill when users request:
  • Converting documents to Markdown format
  • Extracting text from PDF, Word, PowerPoint, or Excel files
  • Performing OCR on images to extract text
  • Transcribing audio files to text
  • Extracting YouTube video transcripts
  • Processing HTML, EPUB, or web content to Markdown
  • Converting structured data (CSV, JSON, XML) to readable Markdown
  • Batch converting multiple files or ZIP archives
  • Preparing documents for LLM analysis or RAG systems
当用户有以下需求时,可使用本工具:
  • 将文档转换为Markdown格式
  • 从PDF、Word、PowerPoint或Excel文件中提取文本
  • 对图片进行OCR识别以提取文本
  • 将音频文件转写为文本
  • 提取YouTube视频字幕
  • 将HTML、EPUB或网页内容转换为Markdown
  • 将结构化数据(CSV、JSON、XML)转换为易读的Markdown格式
  • 批量转换多个文件或ZIP压缩包
  • 为LLM分析或RAG系统准备文档

Core Capabilities

核心功能

1. Document Conversion

1. 文档转换

Convert Office documents and PDFs to Markdown while preserving structure.
Supported formats:
  • PDF files (with optional Azure Document Intelligence integration)
  • Word documents (DOCX)
  • PowerPoint presentations (PPTX)
  • Excel spreadsheets (XLSX, XLS)
Basic usage:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
Command-line:
bash
markitdown document.pdf -o output.md
See
references/document_conversion.md
for detailed documentation on document-specific features.
将Office文档和PDF转换为Markdown,同时保留文档结构。
支持格式:
  • PDF文件(可选集成Azure Document Intelligence)
  • Word文档(DOCX)
  • PowerPoint演示文稿(PPTX)
  • Excel电子表格(XLSX、XLS)
基础用法:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
命令行方式:
bash
markitdown document.pdf -o output.md
有关文档专属功能的详细说明,请参阅
references/document_conversion.md

2. Media Processing

2. 媒体处理

Extract text from images using OCR and transcribe audio files to text.
Supported formats:
  • Images (JPEG, PNG, GIF, etc.) with EXIF metadata extraction
  • Audio files with speech transcription (requires speech_recognition)
Image with OCR:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("image.jpg")
print(result.text_content)  # Includes EXIF metadata and OCR text
Audio transcription:
python
result = md.convert("audio.wav")
print(result.text_content)  # Transcribed speech
See
references/media_processing.md
for advanced media handling options.
通过OCR从图片中提取文本,并将音频文件转写为文本。
支持格式:
  • 图片(JPEG、PNG、GIF等),支持EXIF元数据提取
  • 音频文件(需依赖speech_recognition库进行语音转写)
图片OCR识别:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("image.jpg")
print(result.text_content)  # 包含EXIF元数据和OCR识别文本
音频转写:
python
result = md.convert("audio.wav")
print(result.text_content)  # 转写后的语音文本
有关高级媒体处理选项,请参阅
references/media_processing.md

3. Web Content Extraction

3. 网页内容提取

Convert web-based content and e-books to Markdown.
Supported formats:
  • HTML files and web pages
  • YouTube video transcripts (via URL)
  • EPUB books
  • RSS feeds
YouTube transcript:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
See
references/web_content.md
for web extraction details.
将网页内容和电子书转换为Markdown格式。
支持格式:
  • HTML文件和网页
  • YouTube视频字幕(通过URL提取)
  • EPUB电子书
  • RSS订阅源
提取YouTube字幕:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
有关网页提取的详细说明,请参阅
references/web_content.md

4. Structured Data Handling

4. 结构化数据处理

Convert structured data formats to readable Markdown tables.
Supported formats:
  • CSV files
  • JSON files
  • XML files
CSV to Markdown table:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content)  # Formatted as Markdown table
See
references/structured_data.md
for format-specific options.
将结构化数据格式转换为易读的Markdown表格。
支持格式:
  • CSV文件
  • JSON文件
  • XML文件
CSV转Markdown表格:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content)  # 格式化为Markdown表格
有关格式专属选项,请参阅
references/structured_data.md

5. Advanced Integrations

5. 高级集成

Enhance conversion quality with AI-powered features.
Azure Document Intelligence: For enhanced PDF processing with better table extraction and layout analysis:
python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<endpoint>", docintel_key="<key>")
result = md.convert("complex.pdf")
LLM-Powered Image Descriptions: Generate detailed image descriptions using GPT-4o:
python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")  # Images described with LLM
See
references/advanced_integrations.md
for integration details.
借助AI驱动的功能提升转换质量。
Azure Document Intelligence集成: 如需更出色的表格提取和布局分析,以增强PDF处理能力:
python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<endpoint>", docintel_key="<key>")
result = md.convert("complex.pdf")
LLM驱动的图片描述: 使用GPT-4o生成详细的图片描述:
python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")  # 图片将由LLM生成描述
有关集成详情,请参阅
references/advanced_integrations.md

6. Batch Processing

6. 批量处理

Process multiple files or entire ZIP archives at once.
ZIP file processing:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("archive.zip")
print(result.text_content)  # All files converted and concatenated
Batch script: Use the provided batch processing script for directory conversion:
bash
python scripts/batch_convert.py /path/to/documents /path/to/output
See
scripts/batch_convert.py
for implementation details.
一次性处理多个文件或整个ZIP压缩包。
ZIP压缩包处理:
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("archive.zip")
print(result.text_content)  # 所有文件转换后合并输出
批量处理脚本: 使用提供的批量处理脚本进行目录转换:
bash
python scripts/batch_convert.py /path/to/documents /path/to/output
有关实现细节,请参阅
scripts/batch_convert.py

Installation

安装

Full installation (all features):
bash
uv pip install 'markitdown[all]'
Modular installation (specific features):
bash
uv pip install 'markitdown[pdf]'           # PDF support
uv pip install 'markitdown[docx]'          # Word support
uv pip install 'markitdown[pptx]'          # PowerPoint support
uv pip install 'markitdown[xlsx]'          # Excel support
uv pip install 'markitdown[audio]'         # Audio transcription
uv pip install 'markitdown[youtube]'       # YouTube transcripts
Requirements:
  • Python 3.10 or higher
完整安装(包含所有功能):
bash
uv pip install 'markitdown[all]'
模块化安装(仅安装指定功能):
bash
uv pip install 'markitdown[pdf]'           # PDF支持
uv pip install 'markitdown[docx]'          # Word支持
uv pip install 'markitdown[pptx]'          # PowerPoint支持
uv pip install 'markitdown[xlsx]'          # Excel支持
uv pip install 'markitdown[audio]'         # 音频转写支持
uv pip install 'markitdown[youtube]'       # YouTube字幕支持
系统要求:
  • Python 3.10或更高版本

Output Format

输出格式

MarkItDown produces clean, token-efficient Markdown optimized for LLM consumption:
  • Preserves headings, lists, and tables
  • Maintains hyperlinks and formatting
  • Includes metadata where relevant (EXIF, document properties)
  • No temporary files created (streaming approach)
MarkItDown生成简洁、高效的Markdown内容,专为LLM使用优化:
  • 保留标题、列表和表格结构
  • 维持超链接和格式
  • 包含相关元数据(EXIF、文档属性)
  • 不创建临时文件(采用流式处理方式)

Common Workflows

常见工作流

Preparing documents for RAG:
python
from markitdown import MarkItDown

md = MarkItDown()
为RAG系统准备文档:
python
from markitdown import MarkItDown

md = MarkItDown()

Convert knowledge base documents

转换知识库文档

docs = ["manual.pdf", "guide.docx", "faq.html"] markdown_content = []
for doc in docs: result = md.convert(doc) markdown_content.append(result.text_content)
docs = ["manual.pdf", "guide.docx", "faq.html"] markdown_content = []
for doc in docs: result = md.convert(doc) markdown_content.append(result.text_content)

Now ready for embedding and indexing

转换完成后即可用于嵌入和索引


**Document analysis pipeline:**
```bash

**文档分析流程:**
```bash

Convert all PDFs in directory

转换目录下所有PDF文件

for file in documents/*.pdf; do markitdown "$file" -o "markdown/$(basename "$file" .pdf).md" done
undefined
for file in documents/*.pdf; do markitdown "$file" -o "markdown/$(basename "$file" .pdf).md" done
undefined

Plugin System

插件系统

MarkItDown supports extensible plugins for custom conversion logic. Plugins are disabled by default for security:
python
from markitdown import MarkItDown
MarkItDown支持可扩展的插件,用于自定义转换逻辑。出于安全考虑,插件默认处于禁用状态:
python
from markitdown import MarkItDown

Enable plugins if needed

如需启用插件

md = MarkItDown(enable_plugins=True)
undefined
md = MarkItDown(enable_plugins=True)
undefined

Resources

资源

This skill includes comprehensive reference documentation for each capability:
  • references/document_conversion.md - Detailed PDF, DOCX, PPTX, XLSX conversion options
  • references/media_processing.md - Image OCR and audio transcription details
  • references/web_content.md - HTML, YouTube, and EPUB extraction
  • references/structured_data.md - CSV, JSON, XML conversion formats
  • references/advanced_integrations.md - Azure Document Intelligence and LLM integration
  • scripts/batch_convert.py - Batch processing utility for directories
本工具包含针对各功能的完整参考文档:
  • references/document_conversion.md - PDF、DOCX、PPTX、XLSX转换的详细选项
  • references/media_processing.md - 图片OCR和音频转写的详细说明
  • references/web_content.md - HTML、YouTube和EPUB提取的相关内容
  • references/structured_data.md - CSV、JSON、XML转换格式说明
  • references/advanced_integrations.md - Azure Document Intelligence和LLM集成详情
  • scripts/batch_convert.py - 目录批量处理工具