paddleocr-doc-parsing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PaddleOCR Document Parsing Skill

PaddleOCR文档解析Skill

When to Use This Skill

何时使用该Skill

Use Document Parsing for:

Documents with tables (invoices, financial reports, spreadsheets)
Documents with mathematical formulas (academic papers, scientific documents)
Documents with charts and diagrams
Multi-column layouts (newspapers, magazines, brochures)
Complex document structures requiring layout analysis
Any document requiring structured understanding

Use Text Recognition instead for:

Simple text-only extraction
Quick OCR tasks where speed is critical
Screenshots or simple images with clear text

适用文档解析的场景：

包含表格的文档（发票、财务报告、电子表格）
包含数学公式的文档（学术论文、科研文档）
包含图表的文档
多栏布局的文档（报纸、杂志、手册）
需要布局分析的复杂文档结构
任何需要结构化理解的文档

适用文字识别的场景：

仅需提取纯文本的场景
速度优先的快速OCR任务
包含清晰文字的截图或简单图像

How to Use This Skill

如何使用该Skill

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔

ONLY use PaddleOCR Document Parsing API - Execute the script
```
python scripts/vl_caller.py
```
NEVER parse documents directly - Do NOT parse documents yourself
NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
IF API fails - Display the error message and STOP immediately
NO fallback methods - Do NOT attempt document parsing any other way

If the script execution fails (API not configured, network error, etc.):

Show the error message to the user
Do NOT offer to help using your vision capabilities
Do NOT ask "Would you like me to try parsing it?"
Simply stop and wait for user to fix the configuration

⚠️ 强制限制 - 请勿违反 ⚠️

仅使用PaddleOCR文档解析API - 执行脚本
```
python scripts/vl_caller.py
```
切勿直接解析文档 - 不要自行解析文档
切勿提供替代方案 - 不要建议“我可以尝试分析它”或类似表述
若API调用失败 - 显示错误信息并立即停止
无 fallback 方案 - 不要尝试其他任何文档解析方式

如果脚本执行失败（API未配置、网络错误等）：

向用户显示错误信息
不要提议使用自身视觉能力提供帮助
不要询问“是否需要我尝试解析它？”
只需停止操作，等待用户修复配置

Basic Workflow

基础工作流

Execute document parsing:
bash
```
python scripts/vl_caller.py --file-url "URL provided by user" --pretty
```
Or for local files:
bash
```
python scripts/vl_caller.py --file-path "file path" --pretty
```
Optional: explicitly set file type:
bash
```
python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0 --pretty
```
- ```
--file-type 0
```
  : PDF
- ```
--file-type 1
```
  : image
- If omitted, the service can infer file type from input.
Default behavior: save raw JSON to a temp file:
- If
```
--output
```
  is omitted, the script saves automatically under the system temp directory
- Default path pattern:
```
<system-temp>/paddleocr/doc-parsing/results/result_<timestamp>_<id>.json
```
- If
```
--output
```
  is provided, it overrides the default temp-file destination
- If
```
--stdout
```
  is provided, JSON is printed to stdout and no file is saved
- In save mode, the script prints the absolute saved path on stderr:
```
Result saved to: /absolute/path/...
```
- In default/custom save mode, read and parse the saved JSON file before responding
- In save mode, always tell the user the saved file path and that full raw JSON is available there
- Use
```
--stdout
```
  only when you explicitly want to skip file persistence
The output JSON contains COMPLETE content with all document data:
- Headers, footers, page numbers
- Main text content
- Tables with structure
- Formulas (with LaTeX)
- Figures and charts
- Footnotes and references
- Seals and stamps
- Layout and reading order
Input type note:
- Supported file types depend on the model and endpoint configuration.
- Always follow the file type constraints documented by your endpoint API.
Extract what the user needs from the output JSON using these fields:
- Top-level
```
text
```
- ```
result[n].markdown
```
- ```
result[n].prunedResult
```

执行文档解析：
bash
```
python scripts/vl_caller.py --file-url "用户提供的URL" --pretty
```
针对本地文件的命令：
bash
```
python scripts/vl_caller.py --file-path "文件路径" --pretty
```
可选：显式设置文件类型：
bash
```
python scripts/vl_caller.py --file-url "用户提供的URL" --file-type 0 --pretty
```
- ```
--file-type 0
```
  : PDF
- ```
--file-type 1
```
  : 图像
- 如果省略该参数，服务会自动从输入推断文件类型
默认行为：将原始JSON保存到临时文件：
- 若省略
```
--output
```
  参数，脚本会自动保存到系统临时目录
- 默认路径格式：
```
<系统临时目录>/paddleocr/doc-parsing/results/result_<时间戳>_<id>.json
```
- 若提供
```
--output
```
  参数，会覆盖默认的临时文件路径
- 若提供
```
--stdout
```
  参数，JSON会打印到标准输出，不会保存为文件
- 在保存模式下，脚本会在标准错误输出中打印绝对保存路径：
```
Result saved to: /absolute/path/...
```
- 在默认/自定义保存模式下，回复用户前需读取并解析保存的JSON文件
- 在保存模式下，务必告知用户保存的文件路径，说明完整原始JSON存放在该位置
- 仅当明确需要跳过文件持久化时，才使用
```
--stdout
```
  参数
输出JSON包含完整内容，涵盖所有文档数据：
- 页眉、页脚、页码
- 正文文本内容
- 带结构的表格
- 公式（以LaTeX格式呈现）
- 图形和图表
- 脚注和参考文献
- 印章和标记
- 布局和阅读顺序
输入类型说明：
- 支持的文件类型取决于模型和端点配置
- 请始终遵循端点API文档中规定的文件类型限制
从输出JSON中提取用户所需内容，可使用以下字段：
- 顶层
```
text
```
  字段
- ```
result[n].markdown
```
  字段
- ```
result[n].prunedResult
```
  字段

IMPORTANT: Complete Content Display

重要：完整内容展示

CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.

The output JSON contains ALL document content in a structured format
In save mode, the raw provider result can be inspected in the saved JSON file
Display the full content requested by the user, do NOT truncate or summarize
If user asks for "all text", show the entire
```
text
```
field
If user asks for "tables", show ALL tables in the document
If user asks for "main content", filter out headers/footers but show ALL body text

What this means:

DO: Display complete text, all tables, all formulas as requested
DO: Present content using these fields: top-level
```
text
```
,
```
result[n].markdown
```
, and
```
result[n].prunedResult
```
DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
DON'T: Summarize or provide excerpts when user asks for full content
DON'T: Say "Here's a preview" when user expects complete output

Example - Correct:

User: "Extract all the text from this document"
Agent: I've parsed the complete document. Here's all the extracted text:

[Display entire text field or concatenated regions in reading order]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)

Example - Incorrect:

User: "Extract all the text"
Agent: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"

关键要求：必须根据用户需求展示完整的提取内容

输出JSON包含所有结构化格式的文档内容
在保存模式下，原始API结果可在临时文件路径中查看
展示用户请求的全部内容，不得截断或摘要
若用户要求“提取所有文本”，需展示整个
```
text
```
字段内容
若用户要求“提取表格”，需展示文档中的所有表格
若用户要求“提取主要内容”，需过滤掉页眉/页脚，展示所有正文文本

具体要求：

✅ 应做：按要求展示完整文本、所有表格、所有公式
✅ 应做：使用以下字段呈现内容：顶层
```
text
```
、
```
result[n].markdown
```
和
```
result[n].prunedResult
```
❌ 不应做：除非内容过长（超过10000字符），否则不要用“...”截断
❌ 不应做：当用户要求完整内容时，提供摘要或节选
❌ 不应做：当用户期望完整输出时，说“这是预览内容”

正确示例：

用户："提取这份文档的所有文本"
Agent: 我已完成文档解析，以下是提取的全部文本：

[展示整个text字段内容或按阅读顺序拼接的所有区域]

文档统计信息：
- 总区域数：25
- 文本块数：15
- 表格数：3
- 公式数：2
质量：优秀（置信度：0.92）

错误示例：

用户："提取所有文本"
Agent: "我发现一份包含多个章节的文档，以下是开头部分：
'引言...'（为简洁起见截断内容）"

Understanding the JSON Response

理解JSON响应

The output JSON uses an envelope wrapping the raw API result:

json

{
  "ok": true,
  "text": "Full markdown/HTML text extracted from all pages",
  "result": { ... },  // raw provider response
  "error": null
}

Key fields:

```
text
```
— extracted markdown text from all pages (use this for quick text display)
```
result
```
- raw provider response object
```
result[n].prunedResult
```
- structured parsing output for each page (layout/content/confidence and related metadata)
```
result[n].markdown
```
— full rendered page output in markdown/HTML

Raw result location (default): the temp-file path printed by the script on stderr

输出JSON使用信封结构包装原始API结果：

json

{
  "ok": true,
  "text": "从所有页面提取的完整markdown/HTML文本",
  "result": { ... },  // 原始服务商响应
  "error": null
}

关键字段：

```
text
```
— 从所有页面提取的markdown文本（用于快速文本展示）
```
result
```
- 原始服务商响应对象
```
result[n].prunedResult
```
- 每一页的结构化解析输出（布局/内容/置信度及相关元数据）
```
result[n].markdown
```
— 每一页的完整渲染输出，格式为markdown/HTML

原始结果默认存储位置：脚本在标准错误输出中打印的临时文件路径

Usage Examples

使用示例

Example 1: Extract Full Document Text

bash

python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

Then use:

Top-level
```
text
```
for quick full-text output
```
result[n].markdown
```
when page-level output is needed

Example 2: Extract Structured Page Data

bash

python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

Then use:

```
result[n].prunedResult
```
for structured parsing data (layout/content/confidence)
```
result[n].markdown
```
for rendered page content

Example 3: Print JSON Without Saving

bash

python scripts/vl_caller.py \
  --file-url "URL" \
  --stdout \
  --pretty

Then return:

Full
```
text
```
when user asks for full document content
```
result[n].prunedResult
```
and
```
result[n].markdown
```
when user needs complete structured page data

示例1：提取完整文档文本

bash

python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

之后可使用：

顶层
```
text
```
字段用于快速展示全文
当需要按页输出时，使用
```
result[n].markdown
```
字段

示例2：提取结构化页面数据

bash

python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

之后可使用：

```
result[n].prunedResult
```
字段获取结构化解析数据（布局/内容/置信度）
```
result[n].markdown
```
字段获取渲染后的页面内容

示例3：打印JSON而不保存文件

bash

python scripts/vl_caller.py \
  --file-url "URL" \
  --stdout \
  --pretty

之后返回：

当用户要求完整文档内容时，返回完整
```
text
```
字段
当用户需要完整结构化页面数据时，返回
```
result[n].prunedResult
```
和
```
result[n].markdown
```
字段

First-Time Configuration

首次配置

You can generally assume that the required environment variables have already been configured. Only when a parsing task fails should you analyze the error message to determine whether it is caused by a configuration issue. If it is indeed a configuration problem, you should notify the user to fix it.

When API is not configured:

The error will show:

CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.com

Configuration workflow:

Show the exact error message to the user (including the URL).
Guide the user to configure securely:
- Recommend configuring through the host application's standard method (e.g., settings file, environment variable UI) rather than pasting credentials in chat.
- List the required environment variables:
```
- PADDLEOCR_DOC_PARSING_API_URL
- PADDLEOCR_ACCESS_TOKEN
- Optional: PADDLEOCR_DOC_PARSING_TIMEOUT
```
If the user provides credentials in chat anyway (accept any reasonable format), for example:
- ```
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...
```
- ```
Here's my API: https://xxx and token: abc123
```
- Copy-pasted code format
- Any other reasonable format
- Security note: Warn the user that credentials shared in chat may be stored in conversation history. Recommend setting them through the host application's configuration instead when possible.
Then parse and validate the values:
- Extract
```
PADDLEOCR_DOC_PARSING_API_URL
```
  (look for URLs with
```
paddleocr.com
```
  or similar)
- Confirm
```
PADDLEOCR_DOC_PARSING_API_URL
```
  is a full endpoint ending with
```
/layout-parsing
```
- Extract
```
PADDLEOCR_ACCESS_TOKEN
```
  (long alphanumeric string, usually 40+ chars)
Ask the user to confirm the environment is configured.
Retry only after confirmation:
- Once the user confirms the environment variables are available, retry the original parsing task

通常可假设所需环境变量已配置完成。仅当解析任务失败时，才需分析错误信息判断是否由配置问题导致。若确实是配置问题，需通知用户进行修复。

当API未配置时：

错误信息如下：

CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.com

配置流程：

向用户显示完整错误信息（包含链接）
指导用户安全配置：
- 建议通过宿主应用的标准方式配置（如设置文件、环境变量UI），而非在聊天中粘贴凭证
- 列出所需环境变量：
```
- PADDLEOCR_DOC_PARSING_API_URL
- PADDLEOCR_ACCESS_TOKEN
- 可选：PADDLEOCR_DOC_PARSING_TIMEOUT
```
若用户仍在聊天中提供凭证（接受任何合理格式），例如：
- ```
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...
```
- ```
这是我的API：https://xxx 和令牌：abc123
```
- 复制粘贴的代码格式
- 其他任何合理格式
- 安全提示：提醒用户在聊天中共享的凭证可能会存储在对话历史中。建议尽可能通过宿主应用的配置功能设置。
然后解析并验证值：
- 提取
```
PADDLEOCR_DOC_PARSING_API_URL
```
  （查找包含
```
paddleocr.com
```
  或类似域名的链接）
- 确认
```
PADDLEOCR_DOC_PARSING_API_URL
```
  是完整的端点，以
```
/layout-parsing
```
  结尾
- 提取
```
PADDLEOCR_ACCESS_TOKEN
```
  （长字母数字字符串，通常40字符以上）
请用户确认环境已配置完成
仅在确认后重试：
- 用户确认环境变量已配置后，重新执行原始解析任务

Handling Large Files

处理大文件

There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.

Tips for large files:

API无文件大小限制。对于PDF文件，单次请求最多支持100页。

大文件处理技巧：

Use URL for Large Local Files (Recommended)

为大型本地文件使用URL（推荐）

For very large local files, prefer

--file-url

over

--file-path

to avoid base64 encoding overhead:

bash

python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

对于超大本地文件，优先使用

--file-url

而非

--file-path

，避免base64编码开销：

bash

python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

Process Specific Pages (PDF Only)

处理特定页面（仅PDF支持）

If you only need certain pages from a large PDF, extract them first:

bash

undefined

若仅需从大型PDF中提取部分页面，可先拆分：

bash

undefined

Extract pages 1-5

提取1-5页

python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5"

Mixed ranges are supported

支持混合范围

python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12"

Then process the smaller file

然后处理拆分后的小文件

python scripts/vl_caller.py --file-path "pages_1_5.pdf"

undefined

python scripts/vl_caller.py --file-path "pages_1_5.pdf"

undefined

Error Handling

错误处理

Authentication failed (403):

error: Authentication failed

→ Token is invalid, reconfigure with correct credentials

API quota exceeded (429):

error: API quota exceeded

→ Daily API quota exhausted, inform user to wait or upgrade

Unsupported format:

error: Unsupported file format

→ File format not supported, convert to PDF/PNG/JPG

认证失败（403）：

error: Authentication failed

→ 令牌无效، 使用正确凭证重新配置

API配额耗尽（429）：

error: API quota exceeded

→ 每日API配额已用尽，告知用户等待或升级

不支持的格式：

error: Unsupported file format

→ 文件格式不支持，转换为PDF/PNG/JPG格式后重试

Important Notes

重要说明

The script NEVER filters content - It always returns complete data
The AI agent decides what to present - Based on user's specific request
All data is always available - Can be re-interpreted for different needs
No information is lost - Complete document structure preserved

脚本从不过滤内容 - 始终返回完整数据
AI Agent决定展示内容 - 基于用户的具体请求
所有数据始终可用 - 可针对不同需求重新解读
无信息丢失 - 完整保留文档结构

Reference Documentation

参考文档

```
references/output_schema.md
```
- Output format specification

Note: Model version and capabilities are determined by your API endpoint (
PADDLEOCR_DOC_PARSING_API_URL
).

Load these reference documents into context when:

Debugging complex parsing issues
Need to understand output format
Working with provider API details

```
references/output_schema.md
```
- 输出格式规范

注意：模型版本和功能由API端点（
PADDLEOCR_DOC_PARSING_API_URL
）决定

在以下场景中需加载这些参考文档到上下文：

调试复杂解析问题时
需要理解输出格式时
处理服务商API细节时

Testing the Skill

测试该Skill

To verify the skill is working properly:

bash

python scripts/smoke_test.py

This tests configuration and optionally API connectivity.

要验证该Skill是否正常工作：

bash

python scripts/smoke_test.py

该脚本会测试配置情况，可选测试API连通性。