paddleocr-doc-parsing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PaddleOCR Document Parsing Skill

PaddleOCR 文档解析技能

When to Use This Skill

何时使用该技能

Use Document Parsing for:
  • Documents with tables (invoices, financial reports, spreadsheets)
  • Documents with mathematical formulas (academic papers, scientific documents)
  • Documents with charts and diagrams
  • Multi-column layouts (newspapers, magazines, brochures)
  • Complex document structures requiring layout analysis
  • Any document requiring structured understanding
Use Text Recognition instead for:
  • Simple text-only extraction
  • Quick OCR tasks where speed is critical
  • Screenshots or simple images with clear text
适合使用文档解析的场景:
  • 包含表格的文档(发票、财务报告、电子表格)
  • 包含数学公式的文档(学术论文、科学文献)
  • 包含图表和示意图的文档
  • 多栏布局的文档(报纸、杂志、宣传册)
  • 需要布局分析的复杂文档结构
  • 任何需要结构化理解的文档
适合使用文本识别的场景:
  • 仅需提取简单文本的场景
  • 速度优先的快速OCR任务
  • 截图或文本清晰的简单图片

How to Use This Skill

如何使用该技能

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
  1. ONLY use PaddleOCR Document Parsing API - Execute the script
    python scripts/vl_caller.py
  2. NEVER use Claude's built-in vision - Do NOT parse documents yourself
  3. NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
  4. IF API fails - Display the error message and STOP immediately
  5. NO fallback methods - Do NOT attempt document parsing any other way
If the script execution fails (API not configured, network error, etc.):
  • Show the error message to the user
  • Do NOT offer to help using your vision capabilities
  • Do NOT ask "Would you like me to try parsing it?"
  • Simply stop and wait for user to fix the configuration
⛔ 强制限制 - 请勿违反 ⛔
  1. 仅可使用PaddleOCR文档解析API - 执行脚本
    python scripts/vl_caller.py
  2. 禁止使用Claude内置的视觉功能 - 请勿自行解析文档
  3. 禁止提供替代方案 - 请勿提出“我可以尝试分析它”或类似表述
  4. 若API调用失败 - 显示错误信息并立即停止操作
  5. 无 fallback 方案 - 请勿尝试其他文档解析方式
如果脚本执行失败(API未配置、网络错误等):
  • 向用户显示错误信息
  • 请勿提出使用自身视觉能力提供帮助
  • 请勿询问“您是否希望我尝试解析它?”
  • 直接停止操作,等待用户修复配置

Basic Workflow

基本工作流程

  1. Execute document parsing:
    bash
    python scripts/vl_caller.py --file-url "URL provided by user"
    Or for local files:
    bash
    python scripts/vl_caller.py --file-path "file path"
    Save result to file (recommended):
    bash
    python scripts/vl_caller.py --file-url "URL" --output result.json --pretty
    • The script will display:
      Result saved to: /absolute/path/to/result.json
    • This message appears on stderr, the JSON is saved to the file
    • Tell the user the file path shown in the message
  2. The script returns COMPLETE JSON with all document content:
    • Headers, footers, page numbers
    • Main text content
    • Tables with structure
    • Formulas (with LaTeX)
    • Figures and charts
    • Footnotes and references
    • Seals and stamps
    • Layout and reading order
  3. Extract what the user needs from the complete data based on their request.
  1. 执行文档解析:
    bash
    python scripts/vl_caller.py --file-url "URL provided by user"
    针对本地文件的命令:
    bash
    python scripts/vl_caller.py --file-path "file path"
    将结果保存到文件(推荐):
    bash
    python scripts/vl_caller.py --file-url "URL" --output result.json --pretty
    • 脚本会显示:
      Result saved to: /absolute/path/to/result.json
    • 该信息会输出到stderr,JSON结果会保存到文件中
    • 告知用户信息中显示的文件路径
  2. 脚本会返回包含完整文档内容的JSON:
    • 页眉、页脚、页码
    • 主要文本内容
    • 带结构的表格
    • 公式(含LaTeX格式)
    • 图形和图表
    • 脚注和参考文献
    • 印章和戳记
    • 布局和阅读顺序
  3. 根据用户需求从完整数据中提取相关内容.

IMPORTANT: Complete Content Display

重要提示:完整内容展示

CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.
  • The script returns ALL document content in a structured format
  • Display the full content requested by the user, do NOT truncate or summarize
  • If user asks for "all text", show the entire
    text
    field
  • If user asks for "tables", show ALL tables in the document
  • If user asks for "main content", filter out headers/footers but show ALL body text
What this means:
  • DO: Display complete text, all tables, all formulas as requested
  • DO: Present content in reading order using
    reading_order
    array
  • DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
  • DON'T: Summarize or provide excerpts when user asks for full content
  • DON'T: Say "Here's a preview" when user expects complete output
Example - Correct:
User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:

[Display entire text field or concatenated regions in reading order]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)
Example - Incorrect ❌:
User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"
关键要求: 必须根据用户需求展示提取到的完整内容.
  • 脚本会以结构化格式返回所有文档内容
  • 展示用户要求的全部内容,请勿截断或总结
  • 若用户要求“所有文本”,则显示完整的
    text
    字段
  • 若用户要求“所有表格”,则显示文档中的全部表格
  • 若用户要求“主要内容”,则过滤掉页眉/页脚,显示所有正文文本
具体要求:
  • 允许: 按要求展示完整文本、所有表格、所有公式
  • 允许: 使用
    reading_order
    数组按阅读顺序呈现内容
  • 禁止: 除非内容过长(>10000字符),否则请勿用“...”截断
  • 禁止: 当用户要求完整内容时,进行总结或提供节选
  • 禁止: 当用户期望完整输出时,说“这是预览内容”
示例 - 正确做法:
User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:

[Display entire text field or concatenated regions in reading order]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)
示例 - 错误做法 ❌:
User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"

Understanding the JSON Response

理解JSON响应

The script returns a JSON envelope wrapping the raw API result:
json
{
  "ok": true,
  "text": "Full markdown/HTML text extracted from all pages",
  "result": [
    {
      "prunedResult": {
        "parsing_res_list": [
          {"block_label": "text", "block_content": "Paragraph text content here...", "block_bbox": [100, 200, 500, 230], "block_id": 3},
          {"block_label": "table", "block_content": "<table>...</table>", "block_bbox": [50, 300, 900, 600], "block_id": 5},
          {"block_label": "seal", "block_content": "<img .../>", "block_bbox": [400, 50, 600, 180], "block_id": 2}
        ]
      },
      "markdown": {
        "text": "Full page content in markdown/HTML format",
        "images": {"imgs/filename.jpg": "https://..."}
      }
    }
  ],
  "error": null
}
Key fields:
  • text
    — extracted markdown text from all pages (use this for quick text display)
  • result
    — raw API result array (one object per page), for detailed block-level access
  • result[n].prunedResult.parsing_res_list
    — array of detected content blocks
  • result[n].markdown.text
    — full page content in markdown/HTML
脚本会返回一个包裹原始API结果的JSON信封:
json
{
  "ok": true,
  "text": "Full markdown/HTML text extracted from all pages",
  "result": [
    {
      "prunedResult": {
        "parsing_res_list": [
          {"block_label": "text", "block_content": "Paragraph text content here...", "block_bbox": [100, 200, 500, 230], "block_id": 3},
          {"block_label": "table", "block_content": "<table>...</table>", "block_bbox": [50, 300, 900, 600], "block_id": 5},
          {"block_label": "seal", "block_content": "<img .../>", "block_bbox": [400, 50, 600, 180], "block_id": 2}
        ]
      },
      "markdown": {
        "text": "Full page content in markdown/HTML format",
        "images": {"imgs/filename.jpg": "https://..."}
      }
    }
  ],
  "error": null
}
关键字段:
  • text
    — 从所有页面提取的markdown文本(可用于快速文本展示)
  • result
    — 原始API结果数组(每个对象对应一页),用于按块级访问详细内容
  • result[n].prunedResult.parsing_res_list
    — 检测到的内容块数组
  • result[n].markdown.text
    — 完整页面内容的markdown/HTML格式

Block Labels

块标签

The
block_label
field indicates the content type:
LabelDescription
text
Regular text content
table
Table (content is HTML
<table>
)
image
Embedded image
seal
Seal or stamp
figure_title
Figure/chart title or caption
vision_footnote
Footnote detected by vision model
aside_text
Side/margin text
block_label
字段表示内容类型:
标签描述
text
常规文本内容
table
表格(内容为HTML
<table>
格式)
image
嵌入图片
seal
印章或戳记
figure_title
图形/图表的标题或说明
vision_footnote
视觉模型检测到的脚注
aside_text
侧边/页边文本

Content Extraction Guidelines

内容提取指南

Based on user intent, filter the blocks:
User SaysWhat to ExtractHow
"Extract all text"EverythingUse
text
field directly
"Get all tables"table blocks onlyFilter
parsing_res_list
by
block_label == "table"
"Show main content"text + table blocksFilter out
aside_text
,
image
"Complete document"EverythingUse
text
field or iterate all blocks
根据用户意图过滤内容块:
用户需求提取内容实现方式
"提取所有文本"全部内容直接使用
text
字段
"获取所有表格"仅表格块
block_label == "table"
过滤
parsing_res_list
"展示主要内容"文本+表格块过滤掉
aside_text
image
"完整文档"全部内容使用
text
字段或遍历所有块

Usage Examples

使用示例

Example 1: Extract Main Content (default behavior)
bash
python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty
Then filter JSON to extract core content:
  • Include: text, table, formula, figure, footnote
  • Exclude: header, footer, page_number
Example 2: Extract Tables Only
bash
python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty
Then filter JSON:
  • Only keep regions where type="table"
  • Present table content in markdown format
Example 3: Complete Document with Everything
bash
python scripts/vl_caller.py \
  --file-url "URL" \
  --pretty
Then use the
text
field or present all regions in reading_order.
示例1: 提取主要内容(默认行为)
bash
python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty
然后过滤JSON提取核心内容:
  • 包含: 文本、表格、公式、图形、脚注
  • 排除: 页眉、页脚、页码
示例2: 仅提取表格
bash
python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty
然后过滤JSON:
  • 仅保留类型为"table"的区域
  • 以markdown格式呈现表格内容
示例3: 提取完整文档
bash
python scripts/vl_caller.py \
  --file-url "URL" \
  --pretty
然后使用
text
字段或按阅读顺序遍历所有块.

First-Time Configuration

首次配置

When API is not configured:
The error will show:
Configuration error: API not configured. Get your API at: https://paddleocr.com
Auto-configuration workflow:
  1. Show the exact error message to user (including the URL)
  2. Tell user to provide credentials:
    Please visit the URL above to get your PADDLEOCR_DOC_PARSING_API_URL and PADDLEOCR_ACCESS_TOKEN.
    Once you have them, send them to me and I'll configure it automatically.
  3. When user provides credentials (accept any format):
    • PADDLEOCR_DOC_PARSING_API_URL=https://xxx.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...
    • Here's my API: https://xxx and token: abc123
    • Copy-pasted code format
    • Any other reasonable format
  4. Parse credentials from user's message:
    • Extract PADDLEOCR_DOC_PARSING_API_URL value (look for URLs)
    • Extract PADDLEOCR_ACCESS_TOKEN value (long alphanumeric string, usually 40+ chars)
  5. Configure automatically:
    bash
    python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"
  6. If configuration succeeds:
    • Inform user: "Configuration complete! Parsing document now..."
    • Retry the original parsing task
  7. If configuration fails:
    • Show the error
    • Ask user to verify the credentials
IMPORTANT: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.
当API未配置时:
错误信息会显示:
Configuration error: API not configured. Get your API at: https://paddleocr.com
自动配置流程:
  1. 向用户显示完整的错误信息(包含URL)
  2. 告知用户提供凭证:
    请访问上述URL获取您的PADDLEOCR_DOC_PARSING_API_URL和PADDLEOCR_ACCESS_TOKEN。
    获取后请发送给我,我会自动完成配置。
  3. 当用户提供凭证时(接受任意格式):
    • PADDLEOCR_DOC_PARSING_API_URL=https://xxx.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...
    • Here's my API: https://xxx and token: abc123
    • 复制粘贴的代码格式
    • 其他合理格式
  4. 从用户消息中解析凭证:
    • 提取PADDLEOCR_DOC_PARSING_API_URL的值(查找URL)
    • 提取PADDLEOCR_ACCESS_TOKEN的值(长字母数字字符串,通常40字符以上)
  5. 自动配置:
    bash
    python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"
  6. 若配置成功:
    • 告知用户: "配置完成!正在解析文档..."
    • 重试原始的解析任务
  7. 若配置失败:
    • 显示错误信息
    • 请求用户验证凭证
重要提示: 错误消息格式是严格的,必须完全按照脚本提供的内容展示,请勿修改或改写.

Handling Large Files

处理大文件

There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.
Tips for large files:
API没有文件大小限制。对于PDF文件,单次请求最多支持100页。
大文件处理技巧:

Use URL for Large Local Files (Recommended)

为大型本地文件使用URL(推荐)

For very large local files, prefer
--file-url
over
--file-path
to avoid base64 encoding overhead:
bash
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"
对于非常大的本地文件,优先使用
--file-url
而非
--file-path
,以避免base64编码的开销:
bash
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

Process Specific Pages (PDF Only)

处理特定页面(仅PDF)

If you only need certain pages from a large PDF, extract them first:
bash
undefined
若仅需要大型PDF中的部分页面,可先提取这些页面:
bash
undefined

Using pypdfium2 (requires: pip install pypdfium2)

Using pypdfium2 (requires: pip install pypdfium2)

python -c " import pypdfium2 as pdfium doc = pdfium.PdfDocument('large.pdf')
python -c " import pypdfium2 as pdfium doc = pdfium.PdfDocument('large.pdf')

Extract pages 0-4 (first 5 pages)

Extract pages 0-4 (first 5 pages)

new_doc = pdfium.PdfDocument.new() for i in range(min(5, len(doc))): new_doc.import_pages(doc, [i]) new_doc.save('pages_1_5.pdf') "
new_doc = pdfium.PdfDocument.new() for i in range(min(5, len(doc))): new_doc.import_pages(doc, [i]) new_doc.save('pages_1_5.pdf') "

Then process the smaller file

Then process the smaller file

python scripts/vl_caller.py --file-path "pages_1_5.pdf"
undefined
python scripts/vl_caller.py --file-path "pages_1_5.pdf"
undefined

Error Handling

错误处理

Authentication failed (401/403):
error: Authentication failed
→ Token is invalid, reconfigure with correct credentials
API quota exceeded (429):
error: API quota exceeded
→ Daily API quota exhausted, inform user to wait or upgrade
Unsupported format:
error: Unsupported file format
→ File format not supported, convert to PDF/PNG/JPG
认证失败 (401/403):
error: Authentication failed
→ 令牌无效,请使用正确的凭证重新配置
API配额耗尽 (429):
error: API quota exceeded
→ 每日API配额已用完,请告知用户等待或升级
不支持的格式:
error: Unsupported file format
→ 文件格式不支持,请转换为PDF/PNG/JPG格式

Pseudo-Code: Content Extraction

伪代码: 内容提取

Extract all text (most common):
python
def extract_all_text(json_response):
    # Quickest: use the pre-extracted text field
    print(json_response['text'])
Extract tables only:
python
def extract_tables(json_response):
    for page in json_response['result']:
        blocks = page['prunedResult']['parsing_res_list']
        tables = [b for b in blocks if b['block_label'] == 'table']
        for i, table in enumerate(tables):
            print(f"Table {i+1}:")
            print(table['block_content'])  # HTML table
Iterate all blocks:
python
def extract_by_block(json_response):
    for page in json_response['result']:
        blocks = page['prunedResult']['parsing_res_list']
        for block in blocks:
            print(f"[{block['block_label']}] {block['block_content'][:100]}")
提取所有文本(最常用):
python
def extract_all_text(json_response):
    # Quickest: use the pre-extracted text field
    print(json_response['text'])
仅提取表格:
python
def extract_tables(json_response):
    for page in json_response['result']:
        blocks = page['prunedResult']['parsing_res_list']
        tables = [b for b in blocks if b['block_label'] == 'table']
        for i, table in enumerate(tables):
            print(f"Table {i+1}:")
            print(table['block_content'])  # HTML table
遍历所有块:
python
def extract_by_block(json_response):
    for page in json_response['result']:
        blocks = page['prunedResult']['parsing_res_list']
        for block in blocks:
            print(f"[{block['block_label']}] {block['block_content'][:100]}")

Important Notes

重要说明

  • The script NEVER filters content - It always returns complete data
  • Claude decides what to present - Based on user's specific request
  • All data is always available - Can be re-interpreted for different needs
  • No information is lost - Complete document structure preserved
  • 脚本从不过滤内容 - 始终返回完整数据
  • Claude决定展示内容 - 基于用户的具体需求
  • 所有数据始终可用 - 可根据不同需求重新解读
  • 无信息丢失 - 完整保留文档结构

Reference Documentation

参考文档

For in-depth understanding of the PaddleOCR Document Parsing system, refer to:
  • references/output_schema.md
    - Output format specification
  • references/provider_api.md
    - Provider API contract
Note: Model version and capabilities are determined by your API endpoint (PADDLEOCR_DOC_PARSING_API_URL).
Load these reference documents into context when:
  • Debugging complex parsing issues
  • Need to understand output format
  • Working with provider API details
如需深入了解PaddleOCR文档解析系统,请参考:
  • references/output_schema.md
    - 输出格式规范
  • references/provider_api.md
    - 提供商API协议
注意: 模型版本和功能由您的API端点(PADDLEOCR_DOC_PARSING_API_URL)决定.
在以下场景中加载这些参考文档到上下文:
  • 调试复杂的解析问题
  • 需要理解输出格式
  • 处理提供商API的细节

Testing the Skill

测试技能

To verify the skill is working properly:
bash
python scripts/smoke_test.py
This tests configuration and optionally API connectivity.
要验证技能是否正常工作:
bash
python scripts/smoke_test.py
该脚本会测试配置情况,可选测试API连通性.