markitdown-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MarkItDown Skill

MarkItDown 工具

Microsoft's Python utility for converting various file formats to Markdown for LLM and text analysis pipelines.
微软推出的 Python 工具,可将多种文件格式转换为 Markdown,适用于 LLM 和文本分析流水线。

Overview

概述

MarkItDown converts documents while preserving structure (headings, lists, tables, links). It's optimized for LLM consumption rather than human-readable output.
MarkItDown 在转换文档时会保留结构(标题、列表、表格、链接)。它针对 LLM 处理进行了优化,而非面向人类可读的输出。

Supported Formats

支持的格式

CategoryFormats
DocumentsPDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX, XLS)
MediaImages (EXIF + OCR), Audio (WAV, MP3 transcription)
WebHTML, YouTube URLs, Wikipedia, RSS/Atom feeds
DataCSV, JSON, XML, Jupyter notebooks (.ipynb)
ArchivesZIP (iterates contents), EPub
EmailOutlook MSG files
分类格式
文档PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX, XLS)
媒体图片(EXIF + OCR)、音频(WAV、MP3 转写)
网页HTML、YouTube 链接、维基百科、RSS/Atom 源
数据CSV、JSON、XML、Jupyter Notebook (.ipynb)
归档文件ZIP(遍历内容)、EPub
邮件Outlook MSG 文件

Quick Start

快速开始

Installation

安装

bash
undefined
bash
undefined

Full installation (recommended)

完整安装(推荐)

pip install 'markitdown[all]'
pip install 'markitdown[all]'

Minimal with specific formats

仅安装特定格式支持的精简版本

pip install 'markitdown[pdf,docx,pptx]'
pip install 'markitdown[pdf,docx,pptx]'

Using uv

使用 uv 安装

uv pip install 'markitdown[all]'
undefined
uv pip install 'markitdown[all]'
undefined

Optional Dependencies

可选扩展依赖

ExtraDescription
[all]
All optional dependencies
[pdf]
PDF file support
[docx]
Word documents
[pptx]
PowerPoint presentations
[xlsx]
Excel spreadsheets
[xls]
Legacy Excel files
[outlook]
Outlook MSG files
[az-doc-intel]
Azure Document Intelligence
[audio-transcription]
WAV/MP3 transcription
[youtube-transcription]
YouTube video transcripts
扩展依赖说明
[all]
所有可选依赖
[pdf]
PDF 文件支持
[docx]
Word 文档支持
[pptx]
PowerPoint 演示文稿支持
[xlsx]
Excel 电子表格支持
[xls]
旧版 Excel 文件支持
[outlook]
Outlook MSG 文件支持
[az-doc-intel]
Azure Document Intelligence 支持
[audio-transcription]
WAV/MP3 转写支持
[youtube-transcription]
YouTube 视频字幕转写支持

Command-Line Usage

命令行使用

bash
undefined
bash
undefined

Basic conversion

基础转换

markitdown document.pdf > output.md
markitdown document.pdf > output.md

Specify output file

指定输出文件

markitdown document.pdf -o output.md
markitdown document.pdf -o output.md

Pipe input

管道输入

cat document.pdf | markitdown > output.md
cat document.pdf | markitdown > output.md

With Azure Document Intelligence

结合 Azure Document Intelligence 使用

markitdown document.pdf -o output.md -d -e "<endpoint>"
undefined
markitdown document.pdf -o output.md -d -e "<endpoint>"
undefined

Python API

Python API

python
from markitdown import MarkItDown
python
from markitdown import MarkItDown

Basic conversion

基础转换

md = MarkItDown() result = md.convert("document.xlsx") print(result.text_content)
md = MarkItDown() result = md.convert("document.xlsx") print(result.text_content)

With LLM for image descriptions

结合 LLM 生成图片描述

from openai import OpenAI
client = OpenAI() md = MarkItDown( llm_client=client, llm_model="gpt-4o", llm_prompt="Describe this image in detail" ) result = md.convert("image.jpg") print(result.text_content)
from openai import OpenAI
client = OpenAI() md = MarkItDown( llm_client=client, llm_model="gpt-4o", llm_prompt="Describe this image in detail" ) result = md.convert("image.jpg") print(result.text_content)

With Azure Document Intelligence

结合 Azure Document Intelligence 使用

md = MarkItDown(docintel_endpoint="<your-endpoint>") result = md.convert("complex-document.pdf") print(result.text_content)
undefined
md = MarkItDown(docintel_endpoint="<your-endpoint>") result = md.convert("complex-document.pdf") print(result.text_content)
undefined

Common Use Cases

常见使用场景

Batch Convert Directory

批量转换目录文件

python
from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()
input_dir = Path("./documents")
output_dir = Path("./markdown")
output_dir.mkdir(exist_ok=True)

for file in input_dir.glob("*"):
    if file.is_file():
        try:
            result = md.convert(str(file))
            output_file = output_dir / f"{file.stem}.md"
            output_file.write_text(result.text_content)
            print(f"Converted: {file.name}")
        except Exception as e:
            print(f"Failed: {file.name} - {e}")
python
from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()
input_dir = Path("./documents")
output_dir = Path("./markdown")
output_dir.mkdir(exist_ok=True)

for file in input_dir.glob("*"):
    if file.is_file():
        try:
            result = md.convert(str(file))
            output_file = output_dir / f"{file.stem}.md"
            output_file.write_text(result.text_content)
            print(f"Converted: {file.name}")
        except Exception as e:
            print(f"Failed: {file.name} - {e}")

Process for LLM Context

为 LLM 上下文处理文档

python
from markitdown import MarkItDown

def prepare_for_llm(file_path: str) -> str:
    """Convert document to LLM-ready markdown."""
    md = MarkItDown()
    result = md.convert(file_path)

    # Add source reference
    content = f"# Source: {file_path}\n\n{result.text_content}"
    return content
python
from markitdown import MarkItDown

def prepare_for_llm(file_path: str) -> str:
    """将文档转换为适用于 LLM 的 Markdown 格式。"""
    md = MarkItDown()
    result = md.convert(file_path)

    # 添加来源引用
    content = f"# Source: {file_path}\n\n{result.text_content}"
    return content

Use with your LLM

与你的 LLM 配合使用

context = prepare_for_llm("report.pdf")
undefined
context = prepare_for_llm("report.pdf")
undefined

Extract YouTube Transcript

提取 YouTube 字幕

bash
undefined
bash
undefined

CLI

命令行方式

markitdown "https://www.youtube.com/watch?v=VIDEO_ID" > transcript.md

```python
markitdown "https://www.youtube.com/watch?v=VIDEO_ID" > transcript.md

```python

Python

Python 方式

from markitdown import MarkItDown
md = MarkItDown() result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID") print(result.text_content)
undefined
from markitdown import MarkItDown
md = MarkItDown() result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID") print(result.text_content)
undefined

Image OCR with AI Description

图片 OCR 结合 AI 描述

python
from markitdown import MarkItDown
from openai import OpenAI
python
from markitdown import MarkItDown
from openai import OpenAI

Initialize with LLM support

初始化并启用 LLM 支持

client = OpenAI() md = MarkItDown( llm_client=client, llm_model="gpt-4o" )
client = OpenAI() md = MarkItDown( llm_client=client, llm_model="gpt-4o" )

Convert image with AI description

转换图片并生成 AI 描述

result = md.convert("screenshot.png") print(result.text_content)
undefined
result = md.convert("screenshot.png") print(result.text_content)
undefined

Convert Jupyter Notebook

转换 Jupyter Notebook

python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.text_content)  # Code cells, outputs, markdown
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.text_content)  # 包含代码单元格、输出结果和 Markdown 内容

Extract Wikipedia Content

提取维基百科内容

python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://en.wikipedia.org/wiki/Python")
print(result.text_content)  # Main article content only
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://en.wikipedia.org/wiki/Python")
print(result.text_content)  # 仅提取主文章内容

Parse RSS Feed

解析 RSS 源

python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://example.com/feed.xml")
print(result.text_content)  # Feed entries as markdown
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://example.com/feed.xml")
print(result.text_content)  # 将源条目转换为 Markdown 格式

Plugin System

插件系统

MarkItDown supports third-party plugins for extended functionality.
bash
undefined
MarkItDown 支持第三方插件以扩展功能。
bash
undefined

List installed plugins

列出已安装的插件

markitdown --list-plugins
markitdown --list-plugins

Enable plugins during conversion

转换时启用插件

markitdown --use-plugins document.pdf

```python
markitdown --use-plugins document.pdf

```python

Enable plugins in Python

在 Python 中启用插件

md = MarkItDown(enable_plugins=True) result = md.convert("document.pdf")

> Search GitHub for `#markitdown-plugin` to find available plugins.
md = MarkItDown(enable_plugins=True) result = md.convert("document.pdf")

> 在 GitHub 上搜索 `#markitdown-plugin` 可找到可用插件。

MCP Server Integration

MCP 服务器集成

MarkItDown offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop.
bash
undefined
MarkItDown 提供 MCP(Model Context Protocol)服务器,可与 Claude Desktop 等 LLM 应用集成。
bash
undefined

Install MCP server

安装 MCP 服务器

pip install markitdown-mcp
pip install markitdown-mcp

Or from source

或从源码安装

git clone https://github.com/microsoft/markitdown.git cd markitdown/packages/markitdown-mcp pip install -e .

See [markitdown-mcp][mcp-repo] for configuration details.

[mcp-repo]: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcp
git clone https://github.com/microsoft/markitdown.git cd markitdown/packages/markitdown-mcp pip install -e .

配置细节请查看 [markitdown-mcp][mcp-repo]。

[mcp-repo]: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcp

Docker Usage

Docker 使用

bash
undefined
bash
undefined

Build image

构建镜像

docker build -t markitdown:latest .
docker build -t markitdown:latest .

Convert file

转换文件

docker run --rm -i markitdown:latest < document.pdf > output.md
undefined
docker run --rm -i markitdown:latest < document.pdf > output.md
undefined

Troubleshooting

故障排除

IssueSolution
Missing dependenciesInstall with
pip install 'markitdown[all]'
PDF extraction failsTry Azure Document Intelligence for complex PDFs
Image text not extractedEnsure OCR dependencies installed or use LLM mode
Large file timeoutProcess in chunks or use streaming
Plugin not foundRun
markitdown --list-plugins
to verify installation
问题解决方案
缺少依赖使用
pip install 'markitdown[all]'
安装完整依赖
PDF 提取失败对于复杂 PDF,尝试使用 Azure Document Intelligence
图片文本未提取确保已安装 OCR 依赖,或使用 LLM 模式
大文件超时分块处理或使用流式处理
插件未找到运行
markitdown --list-plugins
验证插件是否已安装

Common Errors

常见错误

bash
undefined
bash
undefined

ModuleNotFoundError for specific format

特定格式对应的模块未找到

pip install 'markitdown[pdf]' # Install missing dependency
pip install 'markitdown[pdf]' # 安装缺失的依赖

Azure authentication

Azure 身份验证

export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="<endpoint>" export AZURE_DOCUMENT_INTELLIGENCE_KEY="<key>"
undefined
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="<endpoint>" export AZURE_DOCUMENT_INTELLIGENCE_KEY="<key>"
undefined

Requirements

环境要求

  • Python >= 3.10
  • Virtual environment recommended
bash
undefined
  • Python >= 3.10
  • 推荐使用虚拟环境
bash
undefined

Create virtual environment

创建虚拟环境

python -m venv .venv source .venv/bin/activate # Linux/macOS .venv\Scripts\activate # Windows
python -m venv .venv source .venv/bin/activate # Linux/macOS .venv\Scripts\activate # Windows

Install

安装工具

pip install 'markitdown[all]'
undefined
pip install 'markitdown[all]'
undefined

References

参考资料