silicon-paddle-ocr

Original🇺🇸 English
Translated
2 scriptsChecked / no sensitive code detected

OCR skill using PaddleOCR model via SiliconFlow API. This skill should be used when the user asks to "recognize text from an image", "extract text from a photo", "OCR this image", "read text from screenshot", or mentions "PaddleOCR", "image text recognition", "text extraction from images".

6installs
Added on

NPX Install

npx skill4agent add aotenjou/silicon-paddleocr silicon-paddle-ocr

Tags

Translated version includes tags in frontmatter

OCR - Image Text Recognition

Use PaddleOCR to extract text content from images. Supports single image or batch processing.

Overview

This skill provides optical character recognition (OCR) capabilities using the PaddlePaddle/PaddleOCR-VL-1.5 model via the SiliconFlow API. Extract text from JPG, PNG, WebP, BMP, and GIF images.

When to Use

Invoke this skill when:
  • User wants to extract text from an image
  • User asks to OCR a screenshot or photo
  • User needs to read text from an image file
  • User mentions text recognition from images

How to Use

Prerequisites

Ensure the
SILICONFLOW_API_KEY
environment variable is set:
bash
export SILICONFLOW_API_KEY="your_api_key"

Basic Usage

Execute the OCR script:
bash
python3 scripts/ocr_skill.py [options] image_path

Arguments

ArgumentDescription
images
Image file path(s) or glob pattern (required)
-k, --api-key
API key (default: from SILICONFLOW_API_KEY env)
-m, --model
OCR model name (default: PaddlePaddle/PaddleOCR-VL-1.5)
-p, --prompt
Recognition prompt for custom behavior
-j, --json
Output results in JSON format
-o, --output
Save results to specified file
--max-tokens
Maximum tokens in response (default: 2000)

Examples

Single image:
bash
python3 scripts/ocr_skill.py /path/to/image.jpg
Multiple images with glob:
bash
python3 scripts/ocr_skill.py /path/to/images/*.png
JSON output format:
bash
python3 scripts/ocr_skill.py --json /path/to/image.jpg
Custom prompt for table extraction:
bash
python3 scripts/ocr_skill.py -p "Please identify and format table content as Markdown" /path/to/table.jpg
Save to file:
bash
python3 scripts/ocr_skill.py --json --output results.json /path/to/images/*.jpg

Output Format

Text output (default):
--- image.jpg ---
识别到的文字内容
识别到 X 处文字区域
JSON output:
json
{
  "image.jpg": {
    "image_path": "/path/to/image.jpg",
    "image_size": [width, height],
    "texts": [
      {
        "text": "识别的文字",
        "box": [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
      }
    ],
    "full_text": "所有文本的组合"
  },
  "image2.png": { ... }
}
Coordinates Explanation:
  • LOC values are normalized coordinates converted to pixel coordinates
  • Conversion: pixel = LOC × (image_size / LOC_max_value)
  • LOC max_value is approximately 972 (may vary by model/image)
  • The
    box
    field provides the four corner coordinates of each text region in pixel format

Supported Image Formats

  • JPG/JPEG
  • PNG
  • WebP
  • BMP
  • GIF

Error Handling

If processing fails:
  • Check that the image file exists
  • Verify the SILICONFLOW_API_KEY is valid
  • Ensure the API endpoint is reachable
Images that fail to process will show an error message, and other images will continue processing.

Additional Resources

Reference Files

  • references/api-configuration.md
    - API configuration details

Example Files

  • examples/sample-usage.sh
    - Example usage script

Scripts

  • scripts/ocr_skill.py
    - The main OCR implementation