paddleocr-doc-parsing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePaddleOCR Document Parsing Skill
PaddleOCR文档解析Skill
When to Use This Skill
何时使用该Skill
Use Document Parsing for:
- Documents with tables (invoices, financial reports, spreadsheets)
- Documents with mathematical formulas (academic papers, scientific documents)
- Documents with charts and diagrams
- Multi-column layouts (newspapers, magazines, brochures)
- Complex document structures requiring layout analysis
- Any document requiring structured understanding
Use Text Recognition instead for:
- Simple text-only extraction
- Quick OCR tasks where speed is critical
- Screenshots or simple images with clear text
适用文档解析的场景:
- 包含表格的文档(发票、财务报告、电子表格)
- 包含数学公式的文档(学术论文、科研文档)
- 包含图表的文档
- 多栏布局的文档(报纸、杂志、手册)
- 需要布局分析的复杂文档结构
- 任何需要结构化理解的文档
适用文字识别的场景:
- 仅需提取纯文本的场景
- 速度优先的快速OCR任务
- 包含清晰文字的截图或简单图像
How to Use This Skill
如何使用该Skill
⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
- ONLY use PaddleOCR Document Parsing API - Execute the script
python scripts/vl_caller.py - NEVER parse documents directly - Do NOT parse documents yourself
- NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
- IF API fails - Display the error message and STOP immediately
- NO fallback methods - Do NOT attempt document parsing any other way
If the script execution fails (API not configured, network error, etc.):
- Show the error message to the user
- Do NOT offer to help using your vision capabilities
- Do NOT ask "Would you like me to try parsing it?"
- Simply stop and wait for user to fix the configuration
⚠️ 强制限制 - 请勿违反 ⚠️
- 仅使用PaddleOCR文档解析API - 执行脚本
python scripts/vl_caller.py - 切勿直接解析文档 - 不要自行解析文档
- 切勿提供替代方案 - 不要建议“我可以尝试分析它”或类似表述
- 若API调用失败 - 显示错误信息并立即停止
- 无 fallback 方案 - 不要尝试其他任何文档解析方式
如果脚本执行失败(API未配置、网络错误等):
- 向用户显示错误信息
- 不要提议使用自身视觉能力提供帮助
- 不要询问“是否需要我尝试解析它?”
- 只需停止操作,等待用户修复配置
Basic Workflow
基础工作流
-
Execute document parsing:bash
python scripts/vl_caller.py --file-url "URL provided by user" --prettyOr for local files:bashpython scripts/vl_caller.py --file-path "file path" --prettyOptional: explicitly set file type:bashpython scripts/vl_caller.py --file-url "URL provided by user" --file-type 0 --pretty--file-type 0- : image
--file-type 1 - If omitted, the service can infer file type from input.
Default behavior: save raw JSON to a temp file:- If is omitted, the script saves automatically under the system temp directory
--output - Default path pattern:
<system-temp>/paddleocr/doc-parsing/results/result_<timestamp>_<id>.json - If is provided, it overrides the default temp-file destination
--output - If is provided, JSON is printed to stdout and no file is saved
--stdout - In save mode, the script prints the absolute saved path on stderr:
Result saved to: /absolute/path/... - In default/custom save mode, read and parse the saved JSON file before responding
- In save mode, always tell the user the saved file path and that full raw JSON is available there
- Use only when you explicitly want to skip file persistence
--stdout
-
The output JSON contains COMPLETE content with all document data:
- Headers, footers, page numbers
- Main text content
- Tables with structure
- Formulas (with LaTeX)
- Figures and charts
- Footnotes and references
- Seals and stamps
- Layout and reading order
Input type note:- Supported file types depend on the model and endpoint configuration.
- Always follow the file type constraints documented by your endpoint API.
-
Extract what the user needs from the output JSON using these fields:
- Top-level
text result[n].markdownresult[n].prunedResult
- Top-level
-
执行文档解析:bash
python scripts/vl_caller.py --file-url "用户提供的URL" --pretty针对本地文件的命令:bashpython scripts/vl_caller.py --file-path "文件路径" --pretty可选:显式设置文件类型:bashpython scripts/vl_caller.py --file-url "用户提供的URL" --file-type 0 --pretty--file-type 0- : 图像
--file-type 1 - 如果省略该参数,服务会自动从输入推断文件类型
默认行为:将原始JSON保存到临时文件:- 若省略参数,脚本会自动保存到系统临时目录
--output - 默认路径格式:
<系统临时目录>/paddleocr/doc-parsing/results/result_<时间戳>_<id>.json - 若提供参数,会覆盖默认的临时文件路径
--output - 若提供参数,JSON会打印到标准输出,不会保存为文件
--stdout - 在保存模式下,脚本会在标准错误输出中打印绝对保存路径:
Result saved to: /absolute/path/... - 在默认/自定义保存模式下,回复用户前需读取并解析保存的JSON文件
- 在保存模式下,务必告知用户保存的文件路径,说明完整原始JSON存放在该位置
- 仅当明确需要跳过文件持久化时,才使用参数
--stdout
-
输出JSON包含完整内容,涵盖所有文档数据:
- 页眉、页脚、页码
- 正文文本内容
- 带结构的表格
- 公式(以LaTeX格式呈现)
- 图形和图表
- 脚注和参考文献
- 印章和标记
- 布局和阅读顺序
输入类型说明:- 支持的文件类型取决于模型和端点配置
- 请始终遵循端点API文档中规定的文件类型限制
-
从输出JSON中提取用户所需内容,可使用以下字段:
- 顶层字段
text - 字段
result[n].markdown - 字段
result[n].prunedResult
- 顶层
IMPORTANT: Complete Content Display
重要:完整内容展示
CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.
- The output JSON contains ALL document content in a structured format
- In save mode, the raw provider result can be inspected in the saved JSON file
- Display the full content requested by the user, do NOT truncate or summarize
- If user asks for "all text", show the entire field
text - If user asks for "tables", show ALL tables in the document
- If user asks for "main content", filter out headers/footers but show ALL body text
What this means:
- DO: Display complete text, all tables, all formulas as requested
- DO: Present content using these fields: top-level ,
text, andresult[n].markdownresult[n].prunedResult - DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
- DON'T: Summarize or provide excerpts when user asks for full content
- DON'T: Say "Here's a preview" when user expects complete output
Example - Correct:
User: "Extract all the text from this document"
Agent: I've parsed the complete document. Here's all the extracted text:
[Display entire text field or concatenated regions in reading order]
Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)Example - Incorrect:
User: "Extract all the text"
Agent: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"关键要求:必须根据用户需求展示完整的提取内容
- 输出JSON包含所有结构化格式的文档内容
- 在保存模式下,原始API结果可在临时文件路径中查看
- 展示用户请求的全部内容,不得截断或摘要
- 若用户要求“提取所有文本”,需展示整个字段内容
text - 若用户要求“提取表格”,需展示文档中的所有表格
- 若用户要求“提取主要内容”,需过滤掉页眉/页脚,展示所有正文文本
具体要求:
- ✅ 应做:按要求展示完整文本、所有表格、所有公式
- ✅ 应做:使用以下字段呈现内容:顶层、
text和result[n].markdownresult[n].prunedResult - ❌ 不应做:除非内容过长(超过10000字符),否则不要用“...”截断
- ❌ 不应做:当用户要求完整内容时,提供摘要或节选
- ❌ 不应做:当用户期望完整输出时,说“这是预览内容”
正确示例:
用户:"提取这份文档的所有文本"
Agent: 我已完成文档解析,以下是提取的全部文本:
[展示整个text字段内容或按阅读顺序拼接的所有区域]
文档统计信息:
- 总区域数:25
- 文本块数:15
- 表格数:3
- 公式数:2
质量:优秀(置信度:0.92)错误示例:
用户:"提取所有文本"
Agent: "我发现一份包含多个章节的文档,以下是开头部分:
'引言...'(为简洁起见截断内容)"Understanding the JSON Response
理解JSON响应
The output JSON uses an envelope wrapping the raw API result:
json
{
"ok": true,
"text": "Full markdown/HTML text extracted from all pages",
"result": { ... }, // raw provider response
"error": null
}Key fields:
- — extracted markdown text from all pages (use this for quick text display)
text - - raw provider response object
result - - structured parsing output for each page (layout/content/confidence and related metadata)
result[n].prunedResult - — full rendered page output in markdown/HTML
result[n].markdown
Raw result location (default): the temp-file path printed by the script on stderr
输出JSON使用信封结构包装原始API结果:
json
{
"ok": true,
"text": "从所有页面提取的完整markdown/HTML文本",
"result": { ... }, // 原始服务商响应
"error": null
}关键字段:
- — 从所有页面提取的markdown文本(用于快速文本展示)
text - - 原始服务商响应对象
result - - 每一页的结构化解析输出(布局/内容/置信度及相关元数据)
result[n].prunedResult - — 每一页的完整渲染输出,格式为markdown/HTML
result[n].markdown
原始结果默认存储位置:脚本在标准错误输出中打印的临时文件路径
Usage Examples
使用示例
Example 1: Extract Full Document Text
bash
python scripts/vl_caller.py \
--file-url "https://example.com/paper.pdf" \
--prettyThen use:
- Top-level for quick full-text output
text - when page-level output is needed
result[n].markdown
Example 2: Extract Structured Page Data
bash
python scripts/vl_caller.py \
--file-path "./financial_report.pdf" \
--prettyThen use:
- for structured parsing data (layout/content/confidence)
result[n].prunedResult - for rendered page content
result[n].markdown
Example 3: Print JSON Without Saving
bash
python scripts/vl_caller.py \
--file-url "URL" \
--stdout \
--prettyThen return:
- Full when user asks for full document content
text - and
result[n].prunedResultwhen user needs complete structured page dataresult[n].markdown
示例1:提取完整文档文本
bash
python scripts/vl_caller.py \
--file-url "https://example.com/paper.pdf" \
--pretty之后可使用:
- 顶层字段用于快速展示全文
text - 当需要按页输出时,使用字段
result[n].markdown
示例2:提取结构化页面数据
bash
python scripts/vl_caller.py \
--file-path "./financial_report.pdf" \
--pretty之后可使用:
- 字段获取结构化解析数据(布局/内容/置信度)
result[n].prunedResult - 字段获取渲染后的页面内容
result[n].markdown
示例3:打印JSON而不保存文件
bash
python scripts/vl_caller.py \
--file-url "URL" \
--stdout \
--pretty之后返回:
- 当用户要求完整文档内容时,返回完整字段
text - 当用户需要完整结构化页面数据时,返回和
result[n].prunedResult字段result[n].markdown
First-Time Configuration
首次配置
You can generally assume that the required environment variables have already been configured. Only when a parsing task fails should you analyze the error message to determine whether it is caused by a configuration issue. If it is indeed a configuration problem, you should notify the user to fix it.
When API is not configured:
The error will show:
CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.comConfiguration workflow:
-
Show the exact error message to the user (including the URL).
-
Guide the user to configure securely:
- Recommend configuring through the host application's standard method (e.g., settings file, environment variable UI) rather than pasting credentials in chat.
- List the required environment variables:
- PADDLEOCR_DOC_PARSING_API_URL - PADDLEOCR_ACCESS_TOKEN - Optional: PADDLEOCR_DOC_PARSING_TIMEOUT
-
If the user provides credentials in chat anyway (accept any reasonable format), for example:
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...Here's my API: https://xxx and token: abc123- Copy-pasted code format
- Any other reasonable format
- Security note: Warn the user that credentials shared in chat may be stored in conversation history. Recommend setting them through the host application's configuration instead when possible.
Then parse and validate the values:- Extract (look for URLs with
PADDLEOCR_DOC_PARSING_API_URLor similar)paddleocr.com - Confirm is a full endpoint ending with
PADDLEOCR_DOC_PARSING_API_URL/layout-parsing - Extract (long alphanumeric string, usually 40+ chars)
PADDLEOCR_ACCESS_TOKEN
-
Ask the user to confirm the environment is configured.
-
Retry only after confirmation:
- Once the user confirms the environment variables are available, retry the original parsing task
通常可假设所需环境变量已配置完成。仅当解析任务失败时,才需分析错误信息判断是否由配置问题导致。若确实是配置问题,需通知用户进行修复。
当API未配置时:
错误信息如下:
CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.com配置流程:
-
向用户显示完整错误信息(包含链接)
-
指导用户安全配置:
- 建议通过宿主应用的标准方式配置(如设置文件、环境变量UI),而非在聊天中粘贴凭证
- 列出所需环境变量:
- PADDLEOCR_DOC_PARSING_API_URL - PADDLEOCR_ACCESS_TOKEN - 可选:PADDLEOCR_DOC_PARSING_TIMEOUT
-
若用户仍在聊天中提供凭证(接受任何合理格式),例如:
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...这是我的API:https://xxx 和令牌:abc123- 复制粘贴的代码格式
- 其他任何合理格式
- 安全提示:提醒用户在聊天中共享的凭证可能会存储在对话历史中。建议尽可能通过宿主应用的配置功能设置。
然后解析并验证值:- 提取(查找包含
PADDLEOCR_DOC_PARSING_API_URL或类似域名的链接)paddleocr.com - 确认是完整的端点,以
PADDLEOCR_DOC_PARSING_API_URL结尾/layout-parsing - 提取(长字母数字字符串,通常40字符以上)
PADDLEOCR_ACCESS_TOKEN
-
请用户确认环境已配置完成
-
仅在确认后重试:
- 用户确认环境变量已配置后,重新执行原始解析任务
Handling Large Files
处理大文件
There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.
Tips for large files:
API无文件大小限制。对于PDF文件,单次请求最多支持100页。
大文件处理技巧:
Use URL for Large Local Files (Recommended)
为大型本地文件使用URL(推荐)
For very large local files, prefer over to avoid base64 encoding overhead:
--file-url--file-pathbash
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"对于超大本地文件,优先使用而非,避免base64编码开销:
--file-url--file-pathbash
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"Process Specific Pages (PDF Only)
处理特定页面(仅PDF支持)
If you only need certain pages from a large PDF, extract them first:
bash
undefined若仅需从大型PDF中提取部分页面,可先拆分:
bash
undefinedExtract pages 1-5
提取1-5页
python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5"
python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5"
Mixed ranges are supported
支持混合范围
python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12"
python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12"
Then process the smaller file
然后处理拆分后的小文件
python scripts/vl_caller.py --file-path "pages_1_5.pdf"
undefinedpython scripts/vl_caller.py --file-path "pages_1_5.pdf"
undefinedError Handling
错误处理
Authentication failed (403):
error: Authentication failed→ Token is invalid, reconfigure with correct credentials
API quota exceeded (429):
error: API quota exceeded→ Daily API quota exhausted, inform user to wait or upgrade
Unsupported format:
error: Unsupported file format→ File format not supported, convert to PDF/PNG/JPG
认证失败(403):
error: Authentication failed→ 令牌无效، 使用正确凭证重新配置
API配额耗尽(429):
error: API quota exceeded→ 每日API配额已用尽,告知用户等待或升级
不支持的格式:
error: Unsupported file format→ 文件格式不支持,转换为PDF/PNG/JPG格式后重试
Important Notes
重要说明
- The script NEVER filters content - It always returns complete data
- The AI agent decides what to present - Based on user's specific request
- All data is always available - Can be re-interpreted for different needs
- No information is lost - Complete document structure preserved
- 脚本从不过滤内容 - 始终返回完整数据
- AI Agent决定展示内容 - 基于用户的具体请求
- 所有数据始终可用 - 可针对不同需求重新解读
- 无信息丢失 - 完整保留文档结构
Reference Documentation
参考文档
- - Output format specification
references/output_schema.md
Note: Model version and capabilities are determined by your API endpoint ().PADDLEOCR_DOC_PARSING_API_URL
Load these reference documents into context when:
- Debugging complex parsing issues
- Need to understand output format
- Working with provider API details
- - 输出格式规范
references/output_schema.md
注意:模型版本和功能由API端点()决定PADDLEOCR_DOC_PARSING_API_URL
在以下场景中需加载这些参考文档到上下文:
- 调试复杂解析问题时
- 需要理解输出格式时
- 处理服务商API细节时
Testing the Skill
测试该Skill
To verify the skill is working properly:
bash
python scripts/smoke_test.pyThis tests configuration and optionally API connectivity.
要验证该Skill是否正常工作:
bash
python scripts/smoke_test.py该脚本会测试配置情况,可选测试API连通性。