paddleocr-text-recognition

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PaddleOCR Text Recognition Skill

PaddleOCR文本识别技能

When to Use This Skill

何时使用本技能

Invoke this skill in the following situations:
  • Extract text from images (screenshots, photos, scans)
  • Extract text from PDFs or document images
  • Extract text and positions from structured documents (invoices, receipts, forms, tables)
  • Extract text from URLs or local files that point to images/PDFs
Do not use this skill in the following situations:
  • Plain text files that can be read directly with the Read tool
  • Code files or markdown documents
  • Tasks that do not involve image-to-text conversion
在以下场景中调用本技能:
  • 从图片(截图、照片、扫描件)中提取文本
  • 从PDF或文档图片中提取文本
  • 从结构化文档(发票、收据、表单、表格)中提取文本及位置信息
  • 从指向图片/PDF的URL或本地文件中提取文本
请勿在以下场景中使用本技能:
  • 可直接使用Read工具读取的纯文本文件
  • 代码文件或Markdown文档
  • 不涉及图片转文本转换的任务

How to Use This Skill

如何使用本技能

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
  1. ONLY use PaddleOCR Text Recognition API - Execute the script
    python scripts/ocr_caller.py
  2. NEVER read images directly - Do NOT read images yourself
  3. NEVER offer alternatives - Do NOT suggest "I can try to read it" or similar
  4. IF API fails - Display the error message and STOP immediately
  5. NO fallback methods - Do NOT attempt OCR any other way
If the script execution fails (API not configured, network error, etc.):
  • Show the error message to the user
  • Do NOT offer to help using your vision capabilities
  • Do NOT ask "Would you like me to try reading it?"
  • Simply stop and wait for user to fix the configuration
⚠️ 强制限制 - 不得违反 ⚠️
  1. 仅使用PaddleOCR文本识别API - 执行脚本
    python scripts/ocr_caller.py
  2. 切勿直接读取图片 - 不要自行读取图片
  3. 切勿提供替代方案 - 不要提出“我可以尝试读取它”或类似表述
  4. 若API调用失败 - 显示错误信息并立即停止操作
  5. 无 fallback 方法 - 不要尝试以其他方式进行OCR
如果脚本执行失败(API未配置、网络错误等):
  • 向用户显示错误信息
  • 不要提出使用自身视觉能力提供帮助
  • 不要询问“你想让我尝试读取它吗?”
  • 只需停止操作,等待用户修复配置

Basic Workflow

基本工作流程

  1. Identify the input source:
    • User provides URL: Use the
      --file-url
      parameter
    • User provides local file path: Use the
      --file-path
      parameter
    • User uploads image: Save it first, then use
      --file-path
    Input type note:
    • Supported file types depend on the model and endpoint configuration.
    • Follow the official endpoint/API documentation for the exact supported formats.
  2. Execute OCR:
    bash
    python scripts/ocr_caller.py --file-url "URL provided by user" --pretty
    Or for local files:
    bash
    python scripts/ocr_caller.py --file-path "file path" --pretty
    Default behavior: save raw JSON to a temp file:
    • If
      --output
      is omitted, the script saves automatically under the system temp directory
    • Default path pattern:
      <system-temp>/paddleocr/text-recognition/results/result_<timestamp>_<id>.json
    • If
      --output
      is provided, it overrides the default temp-file destination
    • If
      --stdout
      is provided, JSON is printed to stdout and no file is saved
    • In save mode, the script prints the absolute saved path on stderr:
      Result saved to: /absolute/path/...
    • In default/custom save mode, read and parse the saved JSON file before responding
    • Use
      --stdout
      only when you explicitly want to skip file persistence
  3. Parse JSON response:
    • In default/custom save mode, load JSON from the saved file path shown by the script
    • Check the
      ok
      field:
      true
      means success,
      false
      means error
    • Extract text:
      text
      field contains all recognized text
    • If
      --stdout
      is used, parse the stdout JSON directly
    • Handle errors: If
      ok
      is false, display
      error.message
  4. Present results to user:
    • Display extracted text in a readable format
    • If the text is empty, the image may contain no text
    • In save mode, always tell the user the saved file path and that full raw JSON is available there
  1. 识别输入源
    • 用户提供URL:使用
      --file-url
      参数
    • 用户提供本地文件路径:使用
      --file-path
      参数
    • 用户上传图片:先保存图片,再使用
      --file-path
      参数
    输入类型说明
    • 支持的文件类型取决于模型和端点配置。
    • 请遵循官方端点/API文档查看确切支持的格式。
  2. 执行OCR
    bash
    python scripts/ocr_caller.py --file-url "用户提供的URL" --pretty
    对于本地文件:
    bash
    python scripts/ocr_caller.py --file-path "文件路径" --pretty
    默认行为:将原始JSON保存到临时文件
    • 如果省略
      --output
      参数,脚本会自动保存到系统临时目录下
    • 默认路径格式:
      <系统临时目录>/paddleocr/text-recognition/results/result_<时间戳>_<id>.json
    • 如果提供
      --output
      参数,会覆盖默认的临时文件路径
    • 如果提供
      --stdout
      参数,JSON会打印到标准输出,不会保存文件
    • 在保存模式下,脚本会在标准错误输出中打印绝对保存路径:
      Result saved to: /absolute/path/...
    • 在默认/自定义保存模式下,在回复前需读取并解析保存的JSON文件
    • 仅当明确想要跳过文件持久化时才使用
      --stdout
  3. 解析JSON响应
    • 在默认/自定义保存模式下,从脚本打印的保存文件路径加载JSON
    • 检查
      ok
      字段:
      true
      表示成功,
      false
      表示错误
    • 提取文本:
      text
      字段包含所有识别出的文本
    • 如果使用
      --stdout
      ,直接解析标准输出中的JSON
    • 处理错误:如果
      ok
      false
      ,显示
      error.message
      内容
  4. 向用户展示结果
    • 以易读格式显示提取的文本
    • 如果文本为空,说明图片可能不包含任何文本
    • 在保存模式下,务必告知用户保存的文件路径,并说明完整的原始JSON可在该路径获取

IMPORTANT: Complete Output Display

重要:完整输出展示

CRITICAL: Always display the COMPLETE recognized text to the user. Do NOT truncate or summarize the OCR results.
  • The output JSON contains complete output, including full text in
    text
    field
  • You MUST display the entire
    text
    content to the user
    , no matter how long it is
  • Do NOT use phrases like "Here's a summary" or "The text begins with..."
  • Do NOT truncate with "..." unless the text truly exceeds reasonable display limits
  • The user expects to see ALL the recognized text, not a preview or excerpt
Correct approach:
I've extracted the text from the image. Here's the complete content:

[Display the entire text here]
Incorrect approach:
I found some text in the image. Here's a preview:
"The quick brown fox..." (truncated)
关键要求:始终向用户展示完整的识别文本。不得截断或总结OCR结果。
  • 输出JSON包含完整输出,
    text
    字段中是完整文本
  • 必须向用户展示整个
    text
    内容
    ,无论长度如何
  • 不得使用“以下是摘要”或“文本开头为...”之类的表述
  • 除非文本确实超出合理显示限制,否则不得用“...”截断
  • 用户期望看到所有识别出的文本,而非预览或摘录
正确做法
我已从图片中提取出文本。以下是完整内容:

[在此处展示全部文本]
错误做法
我在图片中发现了一些文本。以下是预览:
"敏捷的棕色狐狸..."(已截断)

Usage Examples

使用示例

Example 1: URL OCR:
bash
python scripts/ocr_caller.py --file-url "https://example.com/invoice.jpg" --pretty
Example 2: Local File OCR:
bash
python scripts/ocr_caller.py --file-path "./document.pdf" --pretty
Example 3: OCR With Explicit File Type:
bash
python scripts/ocr_caller.py --file-url "https://example.com/input" --file-type 1 --pretty
Example 4: Print JSON Without Saving:
bash
python scripts/ocr_caller.py --file-url "https://example.com/input" --stdout --pretty
示例1:URL OCR
bash
python scripts/ocr_caller.py --file-url "https://example.com/invoice.jpg" --pretty
示例2:本地文件OCR
bash
python scripts/ocr_caller.py --file-path "./document.pdf" --pretty
示例3:指定文件类型的OCR
bash
python scripts/ocr_caller.py --file-url "https://example.com/input" --file-type 1 --pretty
示例4:打印JSON而不保存
bash
python scripts/ocr_caller.py --file-url "https://example.com/input" --stdout --pretty

Understanding the Output

理解输出结果

The output JSON structure is as follows:
json
{
  "ok": true,
  "text": "All recognized text here...",
  "result": { ... },
  "error": null
}
Key fields:
  • ok
    :
    true
    for success,
    false
    for error
  • text
    : Complete recognized text
  • result
    : Raw API response (for debugging)
  • error
    : Error details if
    ok
    is false
Raw result location (default): the temp-file path printed by the script on stderr
输出JSON结构如下:
json
{
  "ok": true,
  "text": "所有识别出的文本内容...",
  "result": { ... },
  "error": null
}
关键字段
  • ok
    true
    表示成功,
    false
    表示错误
  • text
    :完整的识别文本
  • result
    :原始API响应(用于调试)
  • error
    :如果
    ok
    false
    ,则包含错误详情
原始结果位置(默认):脚本在标准错误输出中打印的临时文件路径

First-Time Configuration

首次配置

You can generally assume that the required environment variables have already been configured. Only when an OCR task fails should you analyze the error message to determine whether it is caused by a configuration issue. If it is indeed a configuration problem, you should notify the user to fix it.
When API is not configured:
The error will show:
CONFIG_ERROR: PADDLEOCR_OCR_API_URL not configured. Get your API at: https://paddleocr.com
Configuration workflow:
  1. Show the exact error message to the user (including the URL).
  2. Guide the user to configure securely:
    • Recommend configuring through the host application's standard method (e.g., settings file, environment variable UI) rather than pasting credentials in chat.
    • List the required environment variables:
      - PADDLEOCR_OCR_API_URL
      - PADDLEOCR_ACCESS_TOKEN
      - Optional: PADDLEOCR_OCR_TIMEOUT
  3. If the user provides credentials in chat anyway (accept any reasonable format), for example:
    • PADDLEOCR_OCR_API_URL=https://xxx.paddleocr.com/ocr, PADDLEOCR_ACCESS_TOKEN=abc123...
    • Here's my API: https://xxx and token: abc123
    • Copy-pasted code format
    • Any other reasonable format
    • Security note: Warn the user that credentials shared in chat may be stored in conversation history. Recommend setting them through the host application's configuration instead when possible.
    Then parse and validate the values:
    • Extract
      PADDLEOCR_OCR_API_URL
      (look for URLs with
      paddleocr.com
      or similar)
    • Confirm
      PADDLEOCR_OCR_API_URL
      is a full endpoint ending with
      /ocr
    • Extract
      PADDLEOCR_ACCESS_TOKEN
      (long alphanumeric string, usually 40+ chars)
  4. Ask the user to confirm the environment is configured.
  5. Retry only after confirmation:
    • Once the user confirms the environment variables are available, retry the original OCR task
通常可以假设所需的环境变量已配置完成。只有当OCR任务失败时,才需要分析错误消息以确定是否由配置问题导致。如果确实是配置问题,应通知用户进行修复。
当API未配置时
错误信息将显示:
CONFIG_ERROR: PADDLEOCR_OCR_API_URL not configured. Get your API at: https://paddleocr.com
配置流程
  1. 向用户显示确切的错误消息(包含URL)。
  2. 指导用户安全配置
    • 建议通过宿主应用的标准方式(如设置文件、环境变量UI)进行配置,而非在聊天中粘贴凭据。
    • 列出所需的环境变量:
      - PADDLEOCR_OCR_API_URL
      - PADDLEOCR_ACCESS_TOKEN
      - 可选:PADDLEOCR_OCR_TIMEOUT
  3. 若用户仍在聊天中提供凭据(接受任何合理格式),例如:
    • PADDLEOCR_OCR_API_URL=https://xxx.paddleocr.com/ocr, PADDLEOCR_ACCESS_TOKEN=abc123...
    • 这是我的API:https://xxx 和 token:abc123
    • 复制粘贴的代码格式
    • 任何其他合理格式
    • 安全提示:警告用户在聊天中共享的凭据可能会存储在对话历史中。建议尽可能通过宿主应用的配置进行设置。
    然后解析并验证值:
    • 提取
      PADDLEOCR_OCR_API_URL
      (查找包含
      paddleocr.com
      或类似域名的URL)
    • 确认
      PADDLEOCR_OCR_API_URL
      是完整的、以
      /ocr
      结尾的端点
    • 提取
      PADDLEOCR_ACCESS_TOKEN
      (长字母数字字符串,通常40个字符以上)
  4. 请用户确认环境已配置完成
  5. 仅在确认后重试
    • 一旦用户确认环境变量已配置好,重试原始的OCR任务

Error Handling

错误处理

Authentication failed:
API_ERROR: Authentication failed (403). Check your token.
  • Token is invalid, reconfigure with correct credentials
Quota exceeded:
API_ERROR: API rate limit exceeded (429)
  • Daily API quota exhausted, inform user to wait or upgrade
No text detected:
  • text
    field is empty
  • Image may be blank, corrupted, or contain no text
认证失败
API_ERROR: Authentication failed (403). Check your token.
  • Token无效,请使用正确的凭据重新配置
配额超出
API_ERROR: API rate limit exceeded (429)
  • 每日API配额已用尽,告知用户等待或升级服务
未检测到文本
  • text
    字段为空
  • 图片可能是空白、损坏或不包含任何文本

Tips for Better Results

提升识别效果的小贴士

If recognition quality is poor, suggest:
  • Check if the image is clear and contains text
  • Provide a higher resolution image if possible
如果识别质量不佳,建议:
  • 检查图片是否清晰且包含文本
  • 尽可能提供更高分辨率的图片

Reference Documentation

参考文档

For in-depth understanding of the OCR system, refer to:
  • references/output_schema.md
    - Output format specification
Note: Model version, capabilities, and supported file formats are determined by your API endpoint (
PADDLEOCR_OCR_API_URL
) and its official API documentation.
如需深入了解OCR系统,请参考:
  • references/output_schema.md
    - 输出格式规范
注意:模型版本、功能和支持的文件格式由你的API端点(
PADDLEOCR_OCR_API_URL
)及其官方API文档决定。

Testing the Skill

测试技能

To verify the skill is working properly:
bash
python scripts/smoke_test.py
This tests configuration and API connectivity.
要验证技能是否正常工作:
bash
python scripts/smoke_test.py
此脚本会测试配置和API连通性。