markitdown-skill

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MarkItDown Skill

MarkItDown 工具

Microsoft's Python utility for converting various file formats to Markdown for LLM and text analysis pipelines.

微软推出的 Python 工具，可将多种文件格式转换为 Markdown，适用于 LLM 和文本分析流水线。

Overview

概述

MarkItDown converts documents while preserving structure (headings, lists, tables, links). It's optimized for LLM consumption rather than human-readable output.

MarkItDown 在转换文档时会保留结构（标题、列表、表格、链接）。它针对 LLM 处理进行了优化，而非面向人类可读的输出。

Supported Formats

支持的格式

Category	Formats
Documents	PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX, XLS)
Media	Images (EXIF + OCR), Audio (WAV, MP3 transcription)
Web	HTML, YouTube URLs, Wikipedia, RSS/Atom feeds
Data	CSV, JSON, XML, Jupyter notebooks (.ipynb)
Archives	ZIP (iterates contents), EPub
Email	Outlook MSG files

分类	格式
文档	PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX, XLS)
媒体	图片（EXIF + OCR）、音频（WAV、MP3 转写）
网页	HTML、YouTube 链接、维基百科、RSS/Atom 源
数据	CSV、JSON、XML、Jupyter Notebook (.ipynb)
归档文件	ZIP（遍历内容）、EPub
邮件	Outlook MSG 文件

Quick Start

快速开始

Installation

安装

bash

undefined

bash

undefined

Full installation (recommended)

完整安装（推荐）

pip install 'markitdown[all]'

Minimal with specific formats

仅安装特定格式支持的精简版本

pip install 'markitdown[pdf,docx,pptx]'

Using uv

使用 uv 安装

uv pip install 'markitdown[all]'

undefined

uv pip install 'markitdown[all]'

undefined

Optional Dependencies

可选扩展依赖

Extra	Description
`[all]`	All optional dependencies
`[pdf]`	PDF file support
`[docx]`	Word documents
`[pptx]`	PowerPoint presentations
`[xlsx]`	Excel spreadsheets
`[xls]`	Legacy Excel files
`[outlook]`	Outlook MSG files
`[az-doc-intel]`	Azure Document Intelligence
`[audio-transcription]`	WAV/MP3 transcription
`[youtube-transcription]`	YouTube video transcripts

扩展依赖	说明
`[all]`	所有可选依赖
`[pdf]`	PDF 文件支持
`[docx]`	Word 文档支持
`[pptx]`	PowerPoint 演示文稿支持
`[xlsx]`	Excel 电子表格支持
`[xls]`	旧版 Excel 文件支持
`[outlook]`	Outlook MSG 文件支持
`[az-doc-intel]`	Azure Document Intelligence 支持
`[audio-transcription]`	WAV/MP3 转写支持
`[youtube-transcription]`	YouTube 视频字幕转写支持

Command-Line Usage

命令行使用

bash

undefined

bash

undefined

Basic conversion

基础转换

markitdown document.pdf > output.md

Specify output file

指定输出文件

markitdown document.pdf -o output.md

Pipe input

管道输入

cat document.pdf | markitdown > output.md

With Azure Document Intelligence

结合 Azure Document Intelligence 使用

markitdown document.pdf -o output.md -d -e "<endpoint>"

undefined

markitdown document.pdf -o output.md -d -e "<endpoint>"

undefined

Python API

python

from markitdown import MarkItDown

python

from markitdown import MarkItDown

Basic conversion

基础转换

md = MarkItDown() result = md.convert("document.xlsx") print(result.text_content)

With LLM for image descriptions

结合 LLM 生成图片描述

from openai import OpenAI

client = OpenAI() md = MarkItDown( llm_client=client, llm_model="gpt-4o", llm_prompt="Describe this image in detail" ) result = md.convert("image.jpg") print(result.text_content)

from openai import OpenAI

client = OpenAI() md = MarkItDown( llm_client=client, llm_model="gpt-4o", llm_prompt="Describe this image in detail" ) result = md.convert("image.jpg") print(result.text_content)

With Azure Document Intelligence

结合 Azure Document Intelligence 使用

md = MarkItDown(docintel_endpoint="<your-endpoint>") result = md.convert("complex-document.pdf") print(result.text_content)

undefined

md = MarkItDown(docintel_endpoint="<your-endpoint>") result = md.convert("complex-document.pdf") print(result.text_content)

undefined

Common Use Cases

常见使用场景

Batch Convert Directory

批量转换目录文件

python

from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()
input_dir = Path("./documents")
output_dir = Path("./markdown")
output_dir.mkdir(exist_ok=True)

for file in input_dir.glob("*"):
    if file.is_file():
        try:
            result = md.convert(str(file))
            output_file = output_dir / f"{file.stem}.md"
            output_file.write_text(result.text_content)
            print(f"Converted: {file.name}")
        except Exception as e:
            print(f"Failed: {file.name} - {e}")

python

from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()
input_dir = Path("./documents")
output_dir = Path("./markdown")
output_dir.mkdir(exist_ok=True)

for file in input_dir.glob("*"):
    if file.is_file():
        try:
            result = md.convert(str(file))
            output_file = output_dir / f"{file.stem}.md"
            output_file.write_text(result.text_content)
            print(f"Converted: {file.name}")
        except Exception as e:
            print(f"Failed: {file.name} - {e}")

Process for LLM Context

为 LLM 上下文处理文档

python

from markitdown import MarkItDown

def prepare_for_llm(file_path: str) -> str:
    """Convert document to LLM-ready markdown."""
    md = MarkItDown()
    result = md.convert(file_path)

    # Add source reference
    content = f"# Source: {file_path}\n\n{result.text_content}"
    return content

python

from markitdown import MarkItDown

def prepare_for_llm(file_path: str) -> str:
    """将文档转换为适用于 LLM 的 Markdown 格式。"""
    md = MarkItDown()
    result = md.convert(file_path)

    # 添加来源引用
    content = f"# Source: {file_path}\n\n{result.text_content}"
    return content

Use with your LLM

与你的 LLM 配合使用

context = prepare_for_llm("report.pdf")

undefined

context = prepare_for_llm("report.pdf")

undefined

Extract YouTube Transcript

提取 YouTube 字幕

bash

undefined

bash

undefined

CLI

命令行方式

markitdown "https://www.youtube.com/watch?v=VIDEO_ID" > transcript.md


```python

markitdown "https://www.youtube.com/watch?v=VIDEO_ID" > transcript.md


```python

Python

Python 方式

from markitdown import MarkItDown

md = MarkItDown() result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID") print(result.text_content)

undefined

from markitdown import MarkItDown

md = MarkItDown() result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID") print(result.text_content)

undefined

Image OCR with AI Description

图片 OCR 结合 AI 描述

python

from markitdown import MarkItDown
from openai import OpenAI

python

from markitdown import MarkItDown
from openai import OpenAI

Initialize with LLM support

初始化并启用 LLM 支持

client = OpenAI() md = MarkItDown( llm_client=client, llm_model="gpt-4o" )

Convert image with AI description

转换图片并生成 AI 描述

result = md.convert("screenshot.png") print(result.text_content)

undefined

result = md.convert("screenshot.png") print(result.text_content)

undefined

Convert Jupyter Notebook

转换 Jupyter Notebook

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.text_content)  # Code cells, outputs, markdown

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.text_content)  # 包含代码单元格、输出结果和 Markdown 内容

Extract Wikipedia Content

提取维基百科内容

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://en.wikipedia.org/wiki/Python")
print(result.text_content)  # Main article content only

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://en.wikipedia.org/wiki/Python")
print(result.text_content)  # 仅提取主文章内容

Parse RSS Feed

解析 RSS 源

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://example.com/feed.xml")
print(result.text_content)  # Feed entries as markdown

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://example.com/feed.xml")
print(result.text_content)  # 将源条目转换为 Markdown 格式

Plugin System

插件系统

MarkItDown supports third-party plugins for extended functionality.

bash

undefined

MarkItDown 支持第三方插件以扩展功能。

bash

undefined

List installed plugins

列出已安装的插件

markitdown --list-plugins

Enable plugins during conversion

转换时启用插件

markitdown --use-plugins document.pdf


```python

markitdown --use-plugins document.pdf


```python

Enable plugins in Python

在 Python 中启用插件

md = MarkItDown(enable_plugins=True) result = md.convert("document.pdf")


> Search GitHub for `#markitdown-plugin` to find available plugins.

md = MarkItDown(enable_plugins=True) result = md.convert("document.pdf")


> 在 GitHub 上搜索 `#markitdown-plugin` 可找到可用插件。

MCP Server Integration

MCP 服务器集成

MarkItDown offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop.

bash

undefined

MarkItDown 提供 MCP（Model Context Protocol）服务器，可与 Claude Desktop 等 LLM 应用集成。

bash

undefined

Install MCP server

安装 MCP 服务器

pip install markitdown-mcp

Or from source

或从源码安装

git clone https://github.com/microsoft/markitdown.git cd markitdown/packages/markitdown-mcp pip install -e .


See [markitdown-mcp][mcp-repo] for configuration details.

[mcp-repo]: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcp

git clone https://github.com/microsoft/markitdown.git cd markitdown/packages/markitdown-mcp pip install -e .


配置细节请查看 [markitdown-mcp][mcp-repo]。

[mcp-repo]: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcp

Docker Usage

Docker 使用

bash

undefined

bash

undefined

Build image

构建镜像

docker build -t markitdown:latest .

Convert file

转换文件

docker run --rm -i markitdown:latest < document.pdf > output.md

undefined

docker run --rm -i markitdown:latest < document.pdf > output.md

undefined

Troubleshooting

故障排除

Issue	Solution
Missing dependencies	Install with `pip install 'markitdown[all]'`
PDF extraction fails	Try Azure Document Intelligence for complex PDFs
Image text not extracted	Ensure OCR dependencies installed or use LLM mode
Large file timeout	Process in chunks or use streaming
Plugin not found	Run `markitdown --list-plugins` to verify installation

问题	解决方案
缺少依赖	使用 `pip install 'markitdown[all]'` 安装完整依赖
PDF 提取失败	对于复杂 PDF，尝试使用 Azure Document Intelligence
图片文本未提取	确保已安装 OCR 依赖，或使用 LLM 模式
大文件超时	分块处理或使用流式处理
插件未找到	运行 `markitdown --list-plugins` 验证插件是否已安装

Common Errors

常见错误

bash

undefined

bash

undefined

ModuleNotFoundError for specific format

特定格式对应的模块未找到

pip install 'markitdown[pdf]' # Install missing dependency

pip install 'markitdown[pdf]' # 安装缺失的依赖

Azure authentication

Azure 身份验证

export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="<endpoint>" export AZURE_DOCUMENT_INTELLIGENCE_KEY="<key>"

undefined

export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="<endpoint>" export AZURE_DOCUMENT_INTELLIGENCE_KEY="<key>"

undefined

Requirements

环境要求

Python >= 3.10
Virtual environment recommended

bash

undefined

Python >= 3.10
推荐使用虚拟环境

bash

undefined

Create virtual environment

创建虚拟环境

python -m venv .venv source .venv/bin/activate # Linux/macOS .venv\Scripts\activate # Windows

Install

安装工具

pip install 'markitdown[all]'

undefined

pip install 'markitdown[all]'

undefined

References

参考资料

```
references/cli-reference.md
```
- Complete CLI options
```
references/api-reference.md
```
- Python API details
```
references/examples.md
```
- Extended examples
```
references/advanced-features.md
```
- Custom converters, URI handling
GitHub: https://github.com/microsoft/markitdown
PyPI: https://pypi.org/project/markitdown/

```
references/cli-reference.md
```
- 完整的命令行选项
```
references/api-reference.md
```
- Python API 详细说明
```
references/examples.md
```
- 扩展示例
```
references/advanced-features.md
```
- 自定义转换器、URI 处理
GitHub: https://github.com/microsoft/markitdown
PyPI: https://pypi.org/project/markitdown/