pypdf

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PyPDF PDF Manipulation Skill

PyPDF PDF文档处理技能

Overview

概述

PyPDF is a pure-Python library for working with PDF files. This skill covers comprehensive patterns for PDF manipulation including:
  • PDF merging - Combine multiple PDFs into one document
  • PDF splitting - Extract specific pages or split into multiple files
  • Page rotation - Rotate pages by 90, 180, or 270 degrees
  • Watermarking - Add text or image watermarks to pages
  • Text extraction - Extract text content from PDF pages
  • Metadata handling - Read and modify PDF metadata
  • Form filling - Fill PDF form fields programmatically
  • Encryption/Decryption - Secure PDFs with passwords
PyPDF是一个纯Python的PDF文件处理库。本技能涵盖了PDF处理的全面操作模式,包括:
  • PDF合并 - 将多个PDF合并为单个文档
  • PDF拆分 - 提取特定页面或拆分为多个文件
  • 页面旋转 - 将页面旋转90、180或270度
  • 添加水印 - 为页面添加文本或图片水印
  • 文本提取 - 从PDF页面中提取文本内容
  • 元数据处理 - 读取和修改PDF元数据
  • 表单填写 - 通过编程方式填写PDF表单字段
  • 加密/解密 - 使用密码保护PDF

When to Use This Skill

何时使用本技能

USE when:

适合使用场景:

  • Merging multiple PDF files into a single document
  • Splitting large PDFs into smaller files
  • Extracting specific pages from PDFs
  • Adding watermarks or stamps to documents
  • Extracting text content for analysis
  • Reading or modifying PDF metadata
  • Filling PDF forms programmatically
  • Encrypting or decrypting PDF files
  • Adding page numbers or headers/footers
  • Rotating or reordering pages
  • Automating PDF workflows in pipelines
  • 将多个PDF文件合并为单个文档
  • 将大型PDF拆分为较小的文件
  • 从PDF中提取特定页面
  • 为文档添加水印或印章
  • 提取文本内容用于分析
  • 读取或修改PDF元数据
  • 通过编程方式填写PDF表单
  • 对PDF文件进行加密或解密
  • 添加页码或页眉/页脚
  • 旋转或重新排序页面
  • 在流水线中自动化PDF工作流

DON'T USE when:

不适合使用场景:

  • Creating PDFs from scratch (use reportlab or weasyprint)
  • Need advanced text layout control (use reportlab)
  • Converting other formats to PDF (use dedicated converters)
  • Need OCR for scanned documents (use pytesseract + pdf2image)
  • Working with complex form creation (use reportlab)
  • Need to edit existing text content (limited support)
  • 从头创建PDF(使用reportlab或weasyprint)
  • 需要高级文本布局控制(使用reportlab)
  • 将其他格式转换为PDF(使用专用转换器)
  • 需要对扫描文档进行OCR识别(使用pytesseract + pdf2image)
  • 处理复杂的表单创建(使用reportlab)
  • 需要编辑现有文本内容(支持有限)

Prerequisites

前置条件

Installation

安装

bash
undefined
bash
undefined

Basic installation

基础安装

pip install pypdf
pip install pypdf

Using uv (recommended)

使用uv(推荐)

uv pip install pypdf
uv pip install pypdf

With crypto support for encryption

带加密支持的安装

pip install pypdf[crypto]
pip install pypdf[crypto]

For creating PDFs (watermarks, overlays)

用于创建PDF(水印、叠加层)

pip install pypdf reportlab
pip install pypdf reportlab

Full installation

完整安装

pip install pypdf[crypto] reportlab Pillow
undefined
pip install pypdf[crypto] reportlab Pillow
undefined

Verify Installation

验证安装

python
from pypdf import PdfReader, PdfWriter, PdfMerger
from pypdf.errors import PdfReadError

print("pypdf installed successfully!")
print(f"Version: {pypdf.__version__}")
python
from pypdf import PdfReader, PdfWriter, PdfMerger
from pypdf.errors import PdfReadError

print("pypdf安装成功!")
print(f"版本: {pypdf.__version__}")

Core Capabilities

核心功能

1. PDF Merging

1. PDF合并

python
"""
Merge multiple PDF files into a single document.
"""
from pypdf import PdfMerger, PdfReader, PdfWriter
from pathlib import Path
from typing import List, Optional

def merge_pdfs(
    pdf_paths: List[str],
    output_path: str,
    bookmarks: bool = True
) -> None:
    """Merge multiple PDFs into one file."""
    merger = PdfMerger()

    for pdf_path in pdf_paths:
        path = Path(pdf_path)
        if path.exists():
            # Add with bookmark (outline entry)
            merger.append(
                str(pdf_path),
                outline_item=path.stem if bookmarks else None
            )
            print(f"Added: {path.name}")
        else:
            print(f"Warning: File not found - {pdf_path}")

    merger.write(output_path)
    merger.close()

    print(f"Merged PDF saved to: {output_path}")


def merge_with_page_selection(
    pdf_configs: List[dict],
    output_path: str
) -> None:
    """Merge specific pages from multiple PDFs.

    Args:
        pdf_configs: List of dicts with 'path', 'pages' (optional) keys
                    pages can be tuple (start, end) or list of page numbers
        output_path: Output file path
    """
    merger = PdfMerger()

    for config in pdf_configs:
        pdf_path = config['path']
        pages = config.get('pages')

        if pages is None:
            # Add all pages
            merger.append(pdf_path)
        elif isinstance(pages, tuple):
            # Add page range (start, end)
            merger.append(pdf_path, pages=pages)
        elif isinstance(pages, list):
            # Add specific pages
            reader = PdfReader(pdf_path)
            for page_num in pages:
                if 0 <= page_num < len(reader.pages):
                    merger.append(pdf_path, pages=(page_num, page_num + 1))

        print(f"Added: {pdf_path} - Pages: {pages or 'all'}")

    merger.write(output_path)
    merger.close()

    print(f"Merged PDF saved to: {output_path}")


def merge_directory(
    directory: str,
    output_path: str,
    pattern: str = "*.pdf",
    sort_key: Optional[str] = "name"
) -> int:
    """Merge all PDFs in a directory."""
    dir_path = Path(directory)
    pdf_files = list(dir_path.glob(pattern))

    if not pdf_files:
        print(f"No PDF files found in {directory}")
        return 0

    # Sort files
    if sort_key == "name":
        pdf_files.sort(key=lambda x: x.name.lower())
    elif sort_key == "date":
        pdf_files.sort(key=lambda x: x.stat().st_mtime)
    elif sort_key == "size":
        pdf_files.sort(key=lambda x: x.stat().st_size)

    merge_pdfs([str(f) for f in pdf_files], output_path)

    return len(pdf_files)
python
"""
将多个PDF文件合并为单个文档。
"""
from pypdf import PdfMerger, PdfReader, PdfWriter
from pathlib import Path
from typing import List, Optional

def merge_pdfs(
    pdf_paths: List[str],
    output_path: str,
    bookmarks: bool = True
) -> None:
    """将多个PDF合并为一个文件。"""
    merger = PdfMerger()

    for pdf_path in pdf_paths:
        path = Path(pdf_path)
        if path.exists():
            # 添加书签(大纲条目)
            merger.append(
                str(pdf_path),
                outline_item=path.stem if bookmarks else None
            )
            print(f"已添加: {path.name}")
        else:
            print(f"警告:未找到文件 - {pdf_path}")

    merger.write(output_path)
    merger.close()

    print(f"合并后的PDF已保存至: {output_path}")


def merge_with_page_selection(
    pdf_configs: List[dict],
    output_path: str
) -> None:
    """合并多个PDF中的特定页面。

    参数:
        pdf_configs: 包含'path'、'pages'(可选)键的字典列表
                    pages可以是元组(起始页, 结束页)或页码列表
        output_path: 输出文件路径
    """
    merger = PdfMerger()

    for config in pdf_configs:
        pdf_path = config['path']
        pages = config.get('pages')

        if pages is None:
            # 添加所有页面
            merger.append(pdf_path)
        elif isinstance(pages, tuple):
            # 添加页面范围(起始页, 结束页)
            merger.append(pdf_path, pages=pages)
        elif isinstance(pages, list):
            # 添加特定页面
            reader = PdfReader(pdf_path)
            for page_num in pages:
                if 0 <= page_num < len(reader.pages):
                    merger.append(pdf_path, pages=(page_num, page_num + 1))

        print(f"已添加: {pdf_path} - 页面: {pages or '全部'}")

    merger.write(output_path)
    merger.close()

    print(f"合并后的PDF已保存至: {output_path}")


def merge_directory(
    directory: str,
    output_path: str,
    pattern: str = "*.pdf",
    sort_key: Optional[str] = "name"
) -> int:
    """合并目录中的所有PDF。"""
    dir_path = Path(directory)
    pdf_files = list(dir_path.glob(pattern))

    if not pdf_files:
        print(f"在{directory}中未找到PDF文件")
        return 0

    # 排序文件
    if sort_key == "name":
        pdf_files.sort(key=lambda x: x.name.lower())
    elif sort_key == "date":
        pdf_files.sort(key=lambda x: x.stat().st_mtime)
    elif sort_key == "size":
        pdf_files.sort(key=lambda x: x.stat().st_size)

    merge_pdfs([str(f) for f in pdf_files], output_path)

    return len(pdf_files)

Example usage

使用示例

merge_pdfs(['report1.pdf', 'report2.pdf', 'appendix.pdf'], 'complete_report.pdf')

merge_pdfs(['report1.pdf', 'report2.pdf', 'appendix.pdf'], 'complete_report.pdf')

merge_with_page_selection([

merge_with_page_selection([

{'path': 'doc1.pdf', 'pages': (0, 5)}, # First 5 pages

{'path': 'doc1.pdf', 'pages': (0, 5)}, # 前5页

{'path': 'doc2.pdf', 'pages': [0, 2, 4]}, # Pages 1, 3, 5

{'path': 'doc2.pdf', 'pages': [0, 2, 4]}, # 第1、3、5页

{'path': 'doc3.pdf'} # All pages

{'path': 'doc3.pdf'} # 所有页面

], 'combined.pdf')

], 'combined.pdf')

undefined
undefined

2. PDF Splitting

2. PDF拆分

python
"""
Split PDF files into separate documents.
"""
from pypdf import PdfReader, PdfWriter
from pathlib import Path
from typing import List, Tuple, Optional

def split_pdf_by_pages(
    input_path: str,
    output_dir: str,
    pages_per_file: int = 1
) -> List[str]:
    """Split PDF into multiple files with specified pages per file."""
    reader = PdfReader(input_path)
    total_pages = len(reader.pages)

    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    input_name = Path(input_path).stem
    created_files = []

    for start in range(0, total_pages, pages_per_file):
        writer = PdfWriter()
        end = min(start + pages_per_file, total_pages)

        for page_num in range(start, end):
            writer.add_page(reader.pages[page_num])

        # Generate output filename
        if pages_per_file == 1:
            output_file = output_path / f"{input_name}_page_{start + 1}.pdf"
        else:
            output_file = output_path / f"{input_name}_pages_{start + 1}-{end}.pdf"

        writer.write(str(output_file))
        created_files.append(str(output_file))

        print(f"Created: {output_file.name}")

    print(f"Split into {len(created_files)} files")
    return created_files


def extract_pages(
    input_path: str,
    output_path: str,
    page_numbers: List[int]
) -> None:
    """Extract specific pages from a PDF.

    Args:
        input_path: Source PDF file
        output_path: Destination file
        page_numbers: List of page numbers (0-indexed)
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page_num in page_numbers:
        if 0 <= page_num < len(reader.pages):
            writer.add_page(reader.pages[page_num])
            print(f"Extracted page {page_num + 1}")
        else:
            print(f"Warning: Page {page_num + 1} out of range")

    writer.write(output_path)
    print(f"Extracted pages saved to: {output_path}")


def split_by_ranges(
    input_path: str,
    output_dir: str,
    ranges: List[Tuple[int, int, str]]
) -> List[str]:
    """Split PDF by specified page ranges.

    Args:
        input_path: Source PDF file
        output_dir: Output directory
        ranges: List of (start, end, name) tuples
                start and end are 0-indexed
    """
    reader = PdfReader(input_path)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    created_files = []

    for start, end, name in ranges:
        writer = PdfWriter()

        for page_num in range(start, min(end, len(reader.pages))):
            writer.add_page(reader.pages[page_num])

        output_file = output_path / f"{name}.pdf"
        writer.write(str(output_file))
        created_files.append(str(output_file))

        print(f"Created: {output_file.name} (pages {start + 1}-{end})")

    return created_files


def split_by_bookmarks(
    input_path: str,
    output_dir: str
) -> List[str]:
    """Split PDF by bookmark (outline) entries."""
    reader = PdfReader(input_path)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    if not reader.outline:
        print("No bookmarks found in PDF")
        return []

    created_files = []

    # Get bookmark page numbers
    bookmarks = []
    for item in reader.outline:
        if isinstance(item, list):
            continue  # Skip nested bookmarks
        try:
            page_num = reader.get_destination_page_number(item)
            title = item.title
            bookmarks.append((page_num, title))
        except:
            continue

    # Sort by page number
    bookmarks.sort(key=lambda x: x[0])

    # Add end marker
    bookmarks.append((len(reader.pages), "END"))

    # Create PDFs for each section
    for i in range(len(bookmarks) - 1):
        start_page, title = bookmarks[i]
        end_page = bookmarks[i + 1][0]

        if start_page >= end_page:
            continue

        writer = PdfWriter()
        for page_num in range(start_page, end_page):
            writer.add_page(reader.pages[page_num])

        # Clean filename
        safe_title = "".join(c if c.isalnum() or c in ' -_' else '_' for c in title)
        output_file = output_path / f"{i + 1:02d}_{safe_title}.pdf"

        writer.write(str(output_file))
        created_files.append(str(output_file))

        print(f"Created: {output_file.name}")

    return created_files
python
"""
将PDF文件拆分为多个独立文档。
"""
from pypdf import PdfReader, PdfWriter
from pathlib import Path
from typing import List, Tuple, Optional

def split_pdf_by_pages(
    input_path: str,
    output_dir: str,
    pages_per_file: int = 1
) -> List[str]:
    """将PDF按指定每页数量拆分为多个文件。"""
    reader = PdfReader(input_path)
    total_pages = len(reader.pages)

    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    input_name = Path(input_path).stem
    created_files = []

    for start in range(0, total_pages, pages_per_file):
        writer = PdfWriter()
        end = min(start + pages_per_file, total_pages)

        for page_num in range(start, end):
            writer.add_page(reader.pages[page_num])

        # 生成输出文件名
        if pages_per_file == 1:
            output_file = output_path / f"{input_name}_page_{start + 1}.pdf"
        else:
            output_file = output_path / f"{input_name}_pages_{start + 1}-{end}.pdf"

        writer.write(str(output_file))
        created_files.append(str(output_file))

        print(f"已创建: {output_file.name}")

    print(f"已拆分为{len(created_files)}个文件")
    return created_files


def extract_pages(
    input_path: str,
    output_path: str,
    page_numbers: List[int]
) -> None:
    """从PDF中提取特定页面。

    参数:
        input_path: 源PDF文件
        output_path: 目标文件
        page_numbers: 要提取的页码列表(从0开始索引)
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page_num in page_numbers:
        if 0 <= page_num < len(reader.pages):
            writer.add_page(reader.pages[page_num])
            print(f"已提取页面{page_num + 1}")
        else:
            print(f"警告:页面{page_num + 1}超出范围")

    writer.write(output_path)
    print(f"提取的页面已保存至: {output_path}")


def split_by_ranges(
    input_path: str,
    output_dir: str,
    ranges: List[Tuple[int, int, str]]
) -> List[str]:
    """按指定页面范围拆分PDF。

    参数:
        input_path: 源PDF文件
        output_dir: 输出目录
        ranges: (起始页, 结束页, 名称)元组列表
                起始页和结束页为从0开始的索引
    """
    reader = PdfReader(input_path)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    created_files = []

    for start, end, name in ranges:
        writer = PdfWriter()

        for page_num in range(start, min(end, len(reader.pages))):
            writer.add_page(reader.pages[page_num])

        output_file = output_path / f"{name}.pdf"
        writer.write(str(output_file))
        created_files.append(str(output_file))

        print(f"已创建: {output_file.name}(页面{start + 1}-{end})")

    return created_files


def split_by_bookmarks(
    input_path: str,
    output_dir: str
) -> List[str]:
    """按书签(大纲)条目拆分PDF。"""
    reader = PdfReader(input_path)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    if not reader.outline:
        print("PDF中未找到书签")
        return []

    created_files = []

    # 获取书签对应的页码
    bookmarks = []
    for item in reader.outline:
        if isinstance(item, list):
            continue  # 跳过嵌套书签
        try:
            page_num = reader.get_destination_page_number(item)
            title = item.title
            bookmarks.append((page_num, title))
        except:
            continue

    # 按页码排序
    bookmarks.sort(key=lambda x: x[0])

    # 添加结束标记
    bookmarks.append((len(reader.pages), "END"))

    # 为每个章节创建PDF
    for i in range(len(bookmarks) - 1):
        start_page, title = bookmarks[i]
        end_page = bookmarks[i + 1][0]

        if start_page >= end_page:
            continue

        writer = PdfWriter()
        for page_num in range(start_page, end_page):
            writer.add_page(reader.pages[page_num])

        # 清理文件名
        safe_title = "".join(c if c.isalnum() or c in ' -_' else '_' for c in title)
        output_file = output_path / f"{i + 1:02d}_{safe_title}.pdf"

        writer.write(str(output_file))
        created_files.append(str(output_file))

        print(f"已创建: {output_file.name}")

    return created_files

Example usage

使用示例

split_pdf_by_pages('large_document.pdf', 'split_output/', pages_per_file=10)

split_pdf_by_pages('large_document.pdf', 'split_output/', pages_per_file=10)

extract_pages('document.pdf', 'selected_pages.pdf', [0, 4, 9]) # Pages 1, 5, 10

extract_pages('document.pdf', 'selected_pages.pdf', [0, 4, 9]) # 第1、5、10页

split_by_ranges('manual.pdf', 'chapters/', [

split_by_ranges('manual.pdf', 'chapters/', [

(0, 10, 'chapter_1'),

(0, 10, 'chapter_1'),

(10, 25, 'chapter_2'),

(10, 25, 'chapter_2'),

(25, 40, 'chapter_3')

(25, 40, 'chapter_3')

])

])

undefined
undefined

3. Page Rotation and Transformation

3. 页面旋转与变换

python
"""
Rotate, crop, and transform PDF pages.
"""
from pypdf import PdfReader, PdfWriter, Transformation
from pypdf.generic import RectangleObject
from pathlib import Path
from typing import List, Optional

def rotate_pages(
    input_path: str,
    output_path: str,
    rotation: int,
    pages: Optional[List[int]] = None
) -> None:
    """Rotate PDF pages by specified degrees.

    Args:
        input_path: Source PDF file
        output_path: Destination file
        rotation: Rotation in degrees (90, 180, or 270)
        pages: List of page numbers to rotate (0-indexed), None for all
    """
    if rotation not in [90, 180, 270]:
        raise ValueError("Rotation must be 90, 180, or 270 degrees")

    reader = PdfReader(input_path)
    writer = PdfWriter()

    for i, page in enumerate(reader.pages):
        if pages is None or i in pages:
            page.rotate(rotation)
        writer.add_page(page)

    writer.write(output_path)
    print(f"Rotated PDF saved to: {output_path}")


def rotate_landscape_pages(
    input_path: str,
    output_path: str
) -> int:
    """Automatically rotate landscape pages to portrait."""
    reader = PdfReader(input_path)
    writer = PdfWriter()

    rotated_count = 0

    for page in reader.pages:
        # Get page dimensions
        width = float(page.mediabox.width)
        height = float(page.mediabox.height)

        # Rotate if landscape
        if width > height:
            page.rotate(90)
            rotated_count += 1

        writer.add_page(page)

    writer.write(output_path)
    print(f"Rotated {rotated_count} landscape pages")
    return rotated_count


def crop_pages(
    input_path: str,
    output_path: str,
    crop_box: tuple,
    pages: Optional[List[int]] = None
) -> None:
    """Crop PDF pages to specified dimensions.

    Args:
        input_path: Source PDF file
        output_path: Destination file
        crop_box: (left, bottom, right, top) in points (72 points = 1 inch)
        pages: List of page numbers to crop (0-indexed), None for all
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    left, bottom, right, top = crop_box

    for i, page in enumerate(reader.pages):
        if pages is None or i in pages:
            page.mediabox = RectangleObject([left, bottom, right, top])
            page.cropbox = RectangleObject([left, bottom, right, top])

        writer.add_page(page)

    writer.write(output_path)
    print(f"Cropped PDF saved to: {output_path}")


def scale_pages(
    input_path: str,
    output_path: str,
    scale_x: float = 1.0,
    scale_y: float = 1.0
) -> None:
    """Scale PDF pages by specified factors."""
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page in reader.pages:
        # Apply transformation
        op = Transformation().scale(sx=scale_x, sy=scale_y)
        page.add_transformation(op)

        # Update media box
        page.mediabox.lower_left = (
            float(page.mediabox.lower_left[0]) * scale_x,
            float(page.mediabox.lower_left[1]) * scale_y
        )
        page.mediabox.upper_right = (
            float(page.mediabox.upper_right[0]) * scale_x,
            float(page.mediabox.upper_right[1]) * scale_y
        )

        writer.add_page(page)

    writer.write(output_path)
    print(f"Scaled PDF saved to: {output_path}")


def reorder_pages(
    input_path: str,
    output_path: str,
    new_order: List[int]
) -> None:
    """Reorder PDF pages according to specified order.

    Args:
        input_path: Source PDF file
        output_path: Destination file
        new_order: List of page indices in desired order (0-indexed)
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page_num in new_order:
        if 0 <= page_num < len(reader.pages):
            writer.add_page(reader.pages[page_num])

    writer.write(output_path)
    print(f"Reordered PDF saved to: {output_path}")
python
"""
旋转、裁剪和变换PDF页面。
"""
from pypdf import PdfReader, PdfWriter, Transformation
from pypdf.generic import RectangleObject
from pathlib import Path
from typing import List, Optional

def rotate_pages(
    input_path: str,
    output_path: str,
    rotation: int,
    pages: Optional[List[int]] = None
) -> None:
    """将PDF页面旋转指定角度。

    参数:
        input_path: 源PDF文件
        output_path: 目标文件
        rotation: 旋转角度(90、180或270度)
        pages: 要旋转的页码列表(从0开始索引),为None则旋转所有页面
    """
    if rotation not in [90, 180, 270]:
        raise ValueError("旋转角度必须为90、180或270度")

    reader = PdfReader(input_path)
    writer = PdfWriter()

    for i, page in enumerate(reader.pages):
        if pages is None or i in pages:
            page.rotate(rotation)
        writer.add_page(page)

    writer.write(output_path)
    print(f"旋转后的PDF已保存至: {output_path}")


def rotate_landscape_pages(
    input_path: str,
    output_path: str
) -> int:
    """自动将横向页面旋转为纵向。"""
    reader = PdfReader(input_path)
    writer = PdfWriter()

    rotated_count = 0

    for page in reader.pages:
        # 获取页面尺寸
        width = float(page.mediabox.width)
        height = float(page.mediabox.height)

        # 如果是横向则旋转
        if width > height:
            page.rotate(90)
            rotated_count += 1

        writer.add_page(page)

    writer.write(output_path)
    print(f"已旋转{rotated_count}个横向页面")
    return rotated_count


def crop_pages(
    input_path: str,
    output_path: str,
    crop_box: tuple,
    pages: Optional[List[int]] = None
) -> None:
    """将PDF页面裁剪为指定尺寸。

    参数:
        input_path: 源PDF文件
        output_path: 目标文件
        crop_box: (左, 下, 右, 上),单位为点(72点=1英寸)
        pages: 要裁剪的页码列表(从0开始索引),为None则裁剪所有页面
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    left, bottom, right, top = crop_box

    for i, page in enumerate(reader.pages):
        if pages is None or i in pages:
            page.mediabox = RectangleObject([left, bottom, right, top])
            page.cropbox = RectangleObject([left, bottom, right, top])

        writer.add_page(page)

    writer.write(output_path)
    print(f"裁剪后的PDF已保存至: {output_path}")


def scale_pages(
    input_path: str,
    output_path: str,
    scale_x: float = 1.0,
    scale_y: float = 1.0
) -> None:
    """按指定比例缩放PDF页面。"""
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page in reader.pages:
        # 应用变换
        op = Transformation().scale(sx=scale_x, sy=scale_y)
        page.add_transformation(op)

        # 更新媒体框
        page.mediabox.lower_left = (
            float(page.mediabox.lower_left[0]) * scale_x,
            float(page.mediabox.lower_left[1]) * scale_y
        )
        page.mediabox.upper_right = (
            float(page.mediabox.upper_right[0]) * scale_x,
            float(page.mediabox.upper_right[1]) * scale_y
        )

        writer.add_page(page)

    writer.write(output_path)
    print(f"缩放后的PDF已保存至: {output_path}")


def reorder_pages(
    input_path: str,
    output_path: str,
    new_order: List[int]
) -> None:
    """按指定顺序重新排列PDF页面。

    参数:
        input_path: 源PDF文件
        output_path: 目标文件
        new_order: 所需顺序的页面索引列表(从0开始索引)
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page_num in new_order:
        if 0 <= page_num < len(reader.pages):
            writer.add_page(reader.pages[page_num])

    writer.write(output_path)
    print(f"重排后的PDF已保存至: {output_path}")

Example usage

使用示例

rotate_pages('document.pdf', 'rotated.pdf', 90)

rotate_pages('document.pdf', 'rotated.pdf', 90)

rotate_pages('document.pdf', 'rotated.pdf', 90, pages=[0, 2, 4])

rotate_pages('document.pdf', 'rotated.pdf', 90, pages=[0, 2, 4])

crop_pages('document.pdf', 'cropped.pdf', (72, 72, 540, 720)) # 1 inch margins

crop_pages('document.pdf', 'cropped.pdf', (72, 72, 540, 720)) # 1英寸边距

reorder_pages('document.pdf', 'reordered.pdf', [2, 0, 1, 4, 3])

reorder_pages('document.pdf', 'reordered.pdf', [2, 0, 1, 4, 3])

undefined
undefined

4. Watermarking and Stamping

4. 水印与印章添加

python
"""
Add watermarks, stamps, and overlays to PDF pages.
"""
from pypdf import PdfReader, PdfWriter
from pathlib import Path
from typing import Optional, Tuple
from io import BytesIO
python
"""
为PDF页面添加水印、印章和叠加层。
"""
from pypdf import PdfReader, PdfWriter
from pathlib import Path
from typing import Optional, Tuple
from io import BytesIO

For creating watermarks

用于创建水印

try: from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import letter from reportlab.lib.colors import Color REPORTLAB_AVAILABLE = True except ImportError: REPORTLAB_AVAILABLE = False
def create_text_watermark( text: str, output_path: str, font_size: int = 60, opacity: float = 0.3, rotation: int = 45, color: Tuple[float, float, float] = (0.5, 0.5, 0.5) ) -> str: """Create a watermark PDF with specified text.""" if not REPORTLAB_AVAILABLE: raise ImportError("reportlab is required for creating watermarks")
packet = BytesIO()
c = canvas.Canvas(packet, pagesize=letter)
width, height = letter

# Set transparency
c.setFillColor(Color(*color, alpha=opacity))

# Save state, rotate, draw text
c.saveState()
c.translate(width / 2, height / 2)
c.rotate(rotation)
c.setFont("Helvetica-Bold", font_size)

# Draw text centered
text_width = c.stringWidth(text, "Helvetica-Bold", font_size)
c.drawString(-text_width / 2, 0, text)

c.restoreState()
c.save()

# Write to file
packet.seek(0)
with open(output_path, 'wb') as f:
    f.write(packet.getvalue())

return output_path
def add_watermark( input_path: str, watermark_path: str, output_path: str, pages: Optional[list] = None ) -> None: """Add watermark to PDF pages.
Args:
    input_path: Source PDF file
    watermark_path: Watermark PDF file
    output_path: Destination file
    pages: List of page numbers to watermark (0-indexed), None for all
"""
reader = PdfReader(input_path)
watermark_reader = PdfReader(watermark_path)
watermark_page = watermark_reader.pages[0]

writer = PdfWriter()

for i, page in enumerate(reader.pages):
    if pages is None or i in pages:
        page.merge_page(watermark_page)
    writer.add_page(page)

writer.write(output_path)
print(f"Watermarked PDF saved to: {output_path}")
def add_page_numbers( input_path: str, output_path: str, position: str = "bottom-center", start_number: int = 1, prefix: str = "Page ", font_size: int = 10 ) -> None: """Add page numbers to PDF.
Args:
    input_path: Source PDF file
    output_path: Destination file
    position: Position of page number (bottom-center, bottom-right, etc.)
    start_number: Starting page number
    prefix: Text before page number
    font_size: Font size for page numbers
"""
if not REPORTLAB_AVAILABLE:
    raise ImportError("reportlab is required for adding page numbers")

reader = PdfReader(input_path)
writer = PdfWriter()

for i, page in enumerate(reader.pages):
    # Get page dimensions
    width = float(page.mediabox.width)
    height = float(page.mediabox.height)

    # Create page number overlay
    packet = BytesIO()
    c = canvas.Canvas(packet, pagesize=(width, height))

    # Calculate position
    page_num_text = f"{prefix}{start_number + i}"

    if position == "bottom-center":
        x = width / 2
        y = 30
    elif position == "bottom-right":
        x = width - 50
        y = 30
    elif position == "top-center":
        x = width / 2
        y = height - 30
    elif position == "top-right":
        x = width - 50
        y = height - 30
    else:
        x = width / 2
        y = 30

    c.setFont("Helvetica", font_size)
    text_width = c.stringWidth(page_num_text, "Helvetica", font_size)

    if "center" in position:
        x -= text_width / 2

    c.drawString(x, y, page_num_text)
    c.save()

    # Merge with page
    packet.seek(0)
    overlay = PdfReader(packet)
    page.merge_page(overlay.pages[0])
    writer.add_page(page)

writer.write(output_path)
print(f"Page numbers added to: {output_path}")
def add_header_footer( input_path: str, output_path: str, header: Optional[str] = None, footer: Optional[str] = None, font_size: int = 10 ) -> None: """Add header and/or footer to all pages.""" if not REPORTLAB_AVAILABLE: raise ImportError("reportlab is required for adding headers/footers")
reader = PdfReader(input_path)
writer = PdfWriter()

for page in reader.pages:
    width = float(page.mediabox.width)
    height = float(page.mediabox.height)

    # Create overlay
    packet = BytesIO()
    c = canvas.Canvas(packet, pagesize=(width, height))
    c.setFont("Helvetica", font_size)

    if header:
        text_width = c.stringWidth(header, "Helvetica", font_size)
        c.drawString((width - text_width) / 2, height - 30, header)

    if footer:
        text_width = c.stringWidth(footer, "Helvetica", font_size)
        c.drawString((width - text_width) / 2, 20, footer)

    c.save()

    # Merge
    packet.seek(0)
    overlay = PdfReader(packet)
    page.merge_page(overlay.pages[0])
    writer.add_page(page)

writer.write(output_path)
print(f"Header/footer added to: {output_path}")
try: from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import letter from reportlab.lib.colors import Color REPORTLAB_AVAILABLE = True except ImportError: REPORTLAB_AVAILABLE = False
def create_text_watermark( text: str, output_path: str, font_size: int = 60, opacity: float = 0.3, rotation: int = 45, color: Tuple[float, float, float] = (0.5, 0.5, 0.5) ) -> str: """创建包含指定文本的水印PDF。""" if not REPORTLAB_AVAILABLE: raise ImportError("创建水印需要安装reportlab")
packet = BytesIO()
c = canvas.Canvas(packet, pagesize=letter)
width, height = letter

# 设置透明度
c.setFillColor(Color(*color, alpha=opacity))

# 保存状态、旋转、绘制文本
c.saveState()
c.translate(width / 2, height / 2)
c.rotate(rotation)
c.setFont("Helvetica-Bold", font_size)

# 居中绘制文本
text_width = c.stringWidth(text, "Helvetica-Bold", font_size)
c.drawString(-text_width / 2, 0, text)

c.restoreState()
c.save()

# 写入文件
packet.seek(0)
with open(output_path, 'wb') as f:
    f.write(packet.getvalue())

return output_path
def add_watermark( input_path: str, watermark_path: str, output_path: str, pages: Optional[list] = None ) -> None: """为PDF页面添加水印。
参数:
    input_path: 源PDF文件
    watermark_path: 水印PDF文件
    output_path: 目标文件
    pages: 要添加水印的页码列表(从0开始索引),为None则添加到所有页面
"""
reader = PdfReader(input_path)
watermark_reader = PdfReader(watermark_path)
watermark_page = watermark_reader.pages[0]

writer = PdfWriter()

for i, page in enumerate(reader.pages):
    if pages is None or i in pages:
        page.merge_page(watermark_page)
    writer.add_page(page)

writer.write(output_path)
print(f"添加水印后的PDF已保存至: {output_path}")
def add_page_numbers( input_path: str, output_path: str, position: str = "bottom-center", start_number: int = 1, prefix: str = "Page ", font_size: int = 10 ) -> None: """为PDF添加页码。
参数:
    input_path: 源PDF文件
    output_path: 目标文件
    position: 页码位置(bottom-center、bottom-right等)
    start_number: 起始页码
    prefix: 页码前的文本
    font_size: 页码字体大小
"""
if not REPORTLAB_AVAILABLE:
    raise ImportError("添加页码需要安装reportlab")

reader = PdfReader(input_path)
writer = PdfWriter()

for i, page in enumerate(reader.pages):
    # 获取页面尺寸
    width = float(page.mediabox.width)
    height = float(page.mediabox.height)

    # 创建页码叠加层
    packet = BytesIO()
    c = canvas.Canvas(packet, pagesize=(width, height))

    # 计算位置
    page_num_text = f"{prefix}{start_number + i}"

    if position == "bottom-center":
        x = width / 2
        y = 30
    elif position == "bottom-right":
        x = width - 50
        y = 30
    elif position == "top-center":
        x = width / 2
        y = height - 30
    elif position == "top-right":
        x = width - 50
        y = height - 30
    else:
        x = width / 2
        y = 30

    c.setFont("Helvetica", font_size)
    text_width = c.stringWidth(page_num_text, "Helvetica", font_size)

    if "center" in position:
        x -= text_width / 2

    c.drawString(x, y, page_num_text)
    c.save()

    # 与页面合并
    packet.seek(0)
    overlay = PdfReader(packet)
    page.merge_page(overlay.pages[0])
    writer.add_page(page)

writer.write(output_path)
print(f"已为PDF添加页码: {output_path}")
def add_header_footer( input_path: str, output_path: str, header: Optional[str] = None, footer: Optional[str] = None, font_size: int = 10 ) -> None: """为所有页面添加页眉和/或页脚。""" if not REPORTLAB_AVAILABLE: raise ImportError("添加页眉/页脚需要安装reportlab")
reader = PdfReader(input_path)
writer = PdfWriter()

for page in reader.pages:
    width = float(page.mediabox.width)
    height = float(page.mediabox.height)

    # 创建叠加层
    packet = BytesIO()
    c = canvas.Canvas(packet, pagesize=(width, height))
    c.setFont("Helvetica", font_size)

    if header:
        text_width = c.stringWidth(header, "Helvetica", font_size)
        c.drawString((width - text_width) / 2, height - 30, header)

    if footer:
        text_width = c.stringWidth(footer, "Helvetica", font_size)
        c.drawString((width - text_width) / 2, 20, footer)

    c.save()

    # 合并
    packet.seek(0)
    overlay = PdfReader(packet)
    page.merge_page(overlay.pages[0])
    writer.add_page(page)

writer.write(output_path)
print(f"已为PDF添加页眉/页脚: {output_path}")

Example usage

使用示例

create_text_watermark("CONFIDENTIAL", "watermark.pdf")

create_text_watermark("CONFIDENTIAL", "watermark.pdf")

add_watermark("document.pdf", "watermark.pdf", "watermarked_document.pdf")

add_watermark("document.pdf", "watermark.pdf", "watermarked_document.pdf")

add_page_numbers("document.pdf", "numbered.pdf", position="bottom-center")

add_page_numbers("document.pdf", "numbered.pdf", position="bottom-center")

add_header_footer("document.pdf", "with_header.pdf",

add_header_footer("document.pdf", "with_header.pdf",

header="Company Name - Confidential",

header="Company Name - Confidential",

footer="Do not distribute")

footer="Do not distribute")

undefined
undefined

5. Text Extraction and Metadata

5. 文本提取与元数据管理

python
"""
Extract text and manage PDF metadata.
"""
from pypdf import PdfReader, PdfWriter
from pathlib import Path
from typing import Dict, Optional, List
from datetime import datetime

def extract_text(
    input_path: str,
    pages: Optional[List[int]] = None,
    preserve_layout: bool = False
) -> str:
    """Extract text from PDF.

    Args:
        input_path: Source PDF file
        pages: List of page numbers to extract (0-indexed), None for all
        preserve_layout: Try to preserve text layout

    Returns:
        Extracted text as string
    """
    reader = PdfReader(input_path)
    text_parts = []

    target_pages = pages if pages else range(len(reader.pages))

    for page_num in target_pages:
        if 0 <= page_num < len(reader.pages):
            page = reader.pages[page_num]

            if preserve_layout:
                page_text = page.extract_text(extraction_mode="layout")
            else:
                page_text = page.extract_text()

            if page_text:
                text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}")

    return "\n\n".join(text_parts)


def extract_text_to_file(
    input_path: str,
    output_path: str,
    pages: Optional[List[int]] = None
) -> int:
    """Extract text from PDF and save to file."""
    text = extract_text(input_path, pages)

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(text)

    word_count = len(text.split())
    print(f"Extracted {word_count} words to: {output_path}")
    return word_count


def get_pdf_info(input_path: str) -> Dict:
    """Get PDF document information and metadata."""
    reader = PdfReader(input_path)

    info = {
        'file_path': input_path,
        'num_pages': len(reader.pages),
        'is_encrypted': reader.is_encrypted,
        'metadata': {}
    }

    # Get metadata
    if reader.metadata:
        metadata = reader.metadata
        info['metadata'] = {
            'title': metadata.get('/Title', ''),
            'author': metadata.get('/Author', ''),
            'subject': metadata.get('/Subject', ''),
            'creator': metadata.get('/Creator', ''),
            'producer': metadata.get('/Producer', ''),
            'creation_date': str(metadata.get('/CreationDate', '')),
            'modification_date': str(metadata.get('/ModDate', ''))
        }

    # Get page dimensions of first page
    if reader.pages:
        first_page = reader.pages[0]
        info['page_width'] = float(first_page.mediabox.width)
        info['page_height'] = float(first_page.mediabox.height)
        info['page_size_inches'] = (
            info['page_width'] / 72,
            info['page_height'] / 72
        )

    return info


def set_pdf_metadata(
    input_path: str,
    output_path: str,
    metadata: Dict[str, str]
) -> None:
    """Set PDF metadata.

    Args:
        input_path: Source PDF file
        output_path: Destination file
        metadata: Dictionary with keys: title, author, subject, keywords, creator
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    # Copy pages
    for page in reader.pages:
        writer.add_page(page)

    # Set metadata
    writer.add_metadata({
        '/Title': metadata.get('title', ''),
        '/Author': metadata.get('author', ''),
        '/Subject': metadata.get('subject', ''),
        '/Keywords': metadata.get('keywords', ''),
        '/Creator': metadata.get('creator', 'pypdf'),
        '/Producer': 'pypdf',
        '/ModDate': datetime.now().strftime("D:%Y%m%d%H%M%S")
    })

    writer.write(output_path)
    print(f"Metadata updated: {output_path}")


def search_pdf(
    input_path: str,
    search_term: str,
    case_sensitive: bool = False
) -> List[Dict]:
    """Search for text in PDF and return page numbers and context."""
    reader = PdfReader(input_path)
    results = []

    for i, page in enumerate(reader.pages):
        text = page.extract_text()
        if not text:
            continue

        search_text = text if case_sensitive else text.lower()
        term = search_term if case_sensitive else search_term.lower()

        if term in search_text:
            # Find context around match
            idx = search_text.find(term)
            start = max(0, idx - 50)
            end = min(len(text), idx + len(term) + 50)
            context = text[start:end].replace('\n', ' ')

            results.append({
                'page': i + 1,
                'context': f"...{context}..."
            })

    return results
python
"""
提取文本并管理PDF元数据。
"""
from pypdf import PdfReader, PdfWriter
from pathlib import Path
from typing import Dict, Optional, List
from datetime import datetime

def extract_text(
    input_path: str,
    pages: Optional[List[int]] = None,
    preserve_layout: bool = False
) -> str:
    """从PDF中提取文本。

    参数:
        input_path: 源PDF文件
        pages: 要提取的页码列表(从0开始索引),为None则提取所有页面
        preserve_layout: 尝试保留文本布局

    返回:
        提取的文本字符串
    """
    reader = PdfReader(input_path)
    text_parts = []

    target_pages = pages if pages else range(len(reader.pages))

    for page_num in target_pages:
        if 0 <= page_num < len(reader.pages):
            page = reader.pages[page_num]

            if preserve_layout:
                page_text = page.extract_text(extraction_mode="layout")
            else:
                page_text = page.extract_text()

            if page_text:
                text_parts.append(f"--- 第{page_num + 1}页 ---\n{page_text}")

    return "\n\n".join(text_parts)


def extract_text_to_file(
    input_path: str,
    output_path: str,
    pages: Optional[List[int]] = None
) -> int:
    """从PDF中提取文本并保存到文件。"""
    text = extract_text(input_path, pages)

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(text)

    word_count = len(text.split())
    print(f"已提取{word_count}个单词至: {output_path}")
    return word_count


def get_pdf_info(input_path: str) -> Dict:
    """获取PDF文档信息和元数据。"""
    reader = PdfReader(input_path)

    info = {
        'file_path': input_path,
        'num_pages': len(reader.pages),
        'is_encrypted': reader.is_encrypted,
        'metadata': {}
    }

    # 获取元数据
    if reader.metadata:
        metadata = reader.metadata
        info['metadata'] = {
            'title': metadata.get('/Title', ''),
            'author': metadata.get('/Author', ''),
            'subject': metadata.get('/Subject', ''),
            'creator': metadata.get('/Creator', ''),
            'producer': metadata.get('/Producer', ''),
            'creation_date': str(metadata.get('/CreationDate', '')),
            'modification_date': str(metadata.get('/ModDate', ''))
        }

    # 获取第一页的页面尺寸
    if reader.pages:
        first_page = reader.pages[0]
        info['page_width'] = float(first_page.mediabox.width)
        info['page_height'] = float(first_page.mediabox.height)
        info['page_size_inches'] = (
            info['page_width'] / 72,
            info['page_height'] / 72
        )

    return info


def set_pdf_metadata(
    input_path: str,
    output_path: str,
    metadata: Dict[str, str]
) -> None:
    """设置PDF元数据。

    参数:
        input_path: 源PDF文件
        output_path: 目标文件
        metadata: 包含title、author、subject、keywords、creator键的字典
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    # 复制页面
    for page in reader.pages:
        writer.add_page(page)

    # 设置元数据
    writer.add_metadata({
        '/Title': metadata.get('title', ''),
        '/Author': metadata.get('author', ''),
        '/Subject': metadata.get('subject', ''),
        '/Keywords': metadata.get('keywords', ''),
        '/Creator': metadata.get('creator', 'pypdf'),
        '/Producer': 'pypdf',
        '/ModDate': datetime.now().strftime("D:%Y%m%d%H%M%S")
    })

    writer.write(output_path)
    print(f"已更新元数据: {output_path}")


def search_pdf(
    input_path: str,
    search_term: str,
    case_sensitive: bool = False
) -> List[Dict]:
    """在PDF中搜索文本并返回页码和上下文。"""
    reader = PdfReader(input_path)
    results = []

    for i, page in enumerate(reader.pages):
        text = page.extract_text()
        if not text:
            continue

        search_text = text if case_sensitive else text.lower()
        term = search_term if case_sensitive else search_term.lower()

        if term in search_text:
            # 查找匹配内容的上下文
            idx = search_text.find(term)
            start = max(0, idx - 50)
            end = min(len(text), idx + len(term) + 50)
            context = text[start:end].replace('\n', ' ')

            results.append({
                'page': i + 1,
                'context': f"...{context}..."
            })

    return results

Example usage

使用示例

text = extract_text('document.pdf')

text = extract_text('document.pdf')

info = get_pdf_info('document.pdf')

info = get_pdf_info('document.pdf')

set_pdf_metadata('document.pdf', 'with_metadata.pdf', {

set_pdf_metadata('document.pdf', 'with_metadata.pdf', {

'title': 'My Document',

'title': 'My Document',

'author': 'John Doe',

'author': 'John Doe',

'subject': 'Report'

'subject': 'Report'

})

})

results = search_pdf('document.pdf', 'important')

results = search_pdf('document.pdf', 'important')

undefined
undefined

6. Encryption and Form Filling

6. 加密与表单填写

python
"""
PDF encryption, decryption, and form handling.
"""
from pypdf import PdfReader, PdfWriter
from pathlib import Path
from typing import Dict, Optional, List

def encrypt_pdf(
    input_path: str,
    output_path: str,
    user_password: str,
    owner_password: Optional[str] = None,
    permissions: Optional[Dict[str, bool]] = None
) -> None:
    """Encrypt PDF with password protection.

    Args:
        input_path: Source PDF file
        output_path: Destination file
        user_password: Password to open the document
        owner_password: Password for full access (defaults to user_password)
        permissions: Dict of permission flags (print, modify, copy, etc.)
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page in reader.pages:
        writer.add_page(page)

    # Copy metadata if exists
    if reader.metadata:
        writer.add_metadata(reader.metadata)

    # Default permissions (restrictive)
    default_permissions = {
        'print': True,
        'modify': False,
        'copy': False,
        'annotations': True,
        'forms': True,
        'extract': False,
        'assemble': False,
        'print_high_quality': True
    }

    if permissions:
        default_permissions.update(permissions)

    # Encrypt
    owner_pwd = owner_password or user_password
    writer.encrypt(
        user_password=user_password,
        owner_password=owner_pwd,
        permissions_flag=-1  # All permissions by default
    )

    writer.write(output_path)
    print(f"Encrypted PDF saved to: {output_path}")


def decrypt_pdf(
    input_path: str,
    output_path: str,
    password: str
) -> bool:
    """Decrypt a password-protected PDF.

    Args:
        input_path: Encrypted PDF file
        output_path: Destination file (unencrypted)
        password: Password to decrypt

    Returns:
        True if successful, False otherwise
    """
    try:
        reader = PdfReader(input_path)

        if reader.is_encrypted:
            if not reader.decrypt(password):
                print("Incorrect password")
                return False

        writer = PdfWriter()
        for page in reader.pages:
            writer.add_page(page)

        if reader.metadata:
            writer.add_metadata(reader.metadata)

        writer.write(output_path)
        print(f"Decrypted PDF saved to: {output_path}")
        return True

    except Exception as e:
        print(f"Decryption failed: {e}")
        return False


def get_form_fields(input_path: str) -> Dict[str, Dict]:
    """Get all form fields from a PDF."""
    reader = PdfReader(input_path)
    fields = {}

    if reader.get_fields():
        for name, field in reader.get_fields().items():
            field_type = field.get('/FT', '')
            value = field.get('/V', '')

            fields[name] = {
                'type': str(field_type),
                'value': str(value) if value else '',
                'field': field
            }

    return fields


def fill_pdf_form(
    input_path: str,
    output_path: str,
    field_values: Dict[str, str],
    flatten: bool = False
) -> None:
    """Fill PDF form fields with values.

    Args:
        input_path: Source PDF with form fields
        output_path: Destination file
        field_values: Dictionary of field names and values
        flatten: If True, make form fields uneditable
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    # Add pages
    for page in reader.pages:
        writer.add_page(page)

    # Update form fields
    writer.update_page_form_field_values(
        writer.pages[0] if writer.pages else None,
        field_values
    )

    if flatten:
        # Note: Full flatten support may require additional processing
        for page in writer.pages:
            if '/Annots' in page:
                del page['/Annots']

    writer.write(output_path)
    print(f"Form filled and saved to: {output_path}")


def list_form_fields_report(input_path: str) -> str:
    """Generate a report of all form fields in a PDF."""
    fields = get_form_fields(input_path)

    if not fields:
        return "No form fields found in this PDF."

    report = ["PDF Form Fields Report", "=" * 40, ""]

    for name, info in fields.items():
        report.append(f"Field: {name}")
        report.append(f"  Type: {info['type']}")
        report.append(f"  Current Value: {info['value'] or '(empty)'}")
        report.append("")

    report.append(f"Total fields: {len(fields)}")

    return "\n".join(report)
python
"""
PDF加密、解密和表单处理。
"""
from pypdf import PdfReader, PdfWriter
from pathlib import Path
from typing import Dict, Optional, List

def encrypt_pdf(
    input_path: str,
    output_path: str,
    user_password: str,
    owner_password: Optional[str] = None,
    permissions: Optional[Dict[str, bool]] = None
) -> None:
    """使用密码保护对PDF进行加密。

    参数:
        input_path: 源PDF文件
        output_path: 目标文件
        user_password: 打开文档的密码
        owner_password: 拥有完全访问权限的密码(默认为user_password)
        permissions: 权限标志字典(print、modify、copy等)
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page in reader.pages:
        writer.add_page(page)

    # 复制元数据(如果存在)
    if reader.metadata:
        writer.add_metadata(reader.metadata)

    # 默认权限(受限)
    default_permissions = {
        'print': True,
        'modify': False,
        'copy': False,
        'annotations': True,
        'forms': True,
        'extract': False,
        'assemble': False,
        'print_high_quality': True
    }

    if permissions:
        default_permissions.update(permissions)

    # 加密
    owner_pwd = owner_password or user_password
    writer.encrypt(
        user_password=user_password,
        owner_password=owner_pwd,
        permissions_flag=-1  # 默认所有权限
    )

    writer.write(output_path)
    print(f"加密后的PDF已保存至: {output_path}")


def decrypt_pdf(
    input_path: str,
    output_path: str,
    password: str
) -> bool:
    """解密受密码保护的PDF。

    参数:
        input_path: 加密的PDF文件
        output_path: 目标文件(未加密)
        password: 解密密码

    返回:
        成功返回True,失败返回False
    """
    try:
        reader = PdfReader(input_path)

        if reader.is_encrypted:
            if not reader.decrypt(password):
                print("密码错误")
                return False

        writer = PdfWriter()
        for page in reader.pages:
            writer.add_page(page)

        if reader.metadata:
            writer.add_metadata(reader.metadata)

        writer.write(output_path)
        print(f"解密后的PDF已保存至: {output_path}")
        return True

    except Exception as e:
        print(f"解密失败: {e}")
        return False


def get_form_fields(input_path: str) -> Dict[str, Dict]:
    """获取PDF中的所有表单字段。"""
    reader = PdfReader(input_path)
    fields = {}

    if reader.get_fields():
        for name, field in reader.get_fields().items():
            field_type = field.get('/FT', '')
            value = field.get('/V', '')

            fields[name] = {
                'type': str(field_type),
                'value': str(value) if value else '',
                'field': field
            }

    return fields


def fill_pdf_form(
    input_path: str,
    output_path: str,
    field_values: Dict[str, str],
    flatten: bool = False
) -> None:
    """为PDF表单字段填充值。

    参数:
        input_path: 包含表单字段的源PDF
        output_path: 目标文件
        field_values: 字段名称和对应值的字典
        flatten: 如果为True,将表单字段设置为不可编辑
    """
    reader = PdfReader(input_path)
    writer = PdfWriter()

    # 添加页面
    for page in reader.pages:
        writer.add_page(page)

    # 更新表单字段
    writer.update_page_form_field_values(
        writer.pages[0] if writer.pages else None,
        field_values
    )

    if flatten:
        # 注意:完整的扁平化支持可能需要额外处理
        for page in writer.pages:
            if '/Annots' in page:
                del page['/Annots']

    writer.write(output_path)
    print(f"表单已填写并保存至: {output_path}")


def list_form_fields_report(input_path: str) -> str:
    """生成PDF中所有表单字段的报告。"""
    fields = get_form_fields(input_path)

    if not fields:
        return "此PDF中未找到表单字段。"

    report = ["PDF表单字段报告", "=" * 40, ""]

    for name, info in fields.items():
        report.append(f"字段: {name}")
        report.append(f"  类型: {info['type']}")
        report.append(f"  当前值: {info['value'] or '(空)'}")
        report.append("")

    report.append(f"总字段数: {len(fields)}")

    return "\n".join(report)

Example usage

使用示例

encrypt_pdf('document.pdf', 'encrypted.pdf', 'mypassword')

encrypt_pdf('document.pdf', 'encrypted.pdf', 'mypassword')

decrypt_pdf('encrypted.pdf', 'decrypted.pdf', 'mypassword')

decrypt_pdf('encrypted.pdf', 'decrypted.pdf', 'mypassword')

fields = get_form_fields('form.pdf')

fields = get_form_fields('form.pdf')

fill_pdf_form('form.pdf', 'filled_form.pdf', {

fill_pdf_form('form.pdf', 'filled_form.pdf', {

'name': 'John Doe',

'name': 'John Doe',

'date': '2026-01-17',

'date': '2026-01-17',

'signature': 'John Doe'

'signature': 'John Doe'

})

})

undefined
undefined

Integration Examples

集成示例

Batch PDF Processing Pipeline

批量PDF处理流水线

python
"""
Batch process PDFs with configurable operations.
"""
from pypdf import PdfReader, PdfWriter, PdfMerger
from pathlib import Path
from typing import List, Dict, Any, Callable
from concurrent.futures import ThreadPoolExecutor, as_completed
import logging

logger = logging.getLogger(__name__)

class PDFProcessor:
    """Batch PDF processing with configurable operations."""

    def __init__(self, output_dir: str):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)

    def process_batch(
        self,
        pdf_files: List[str],
        operations: List[Dict[str, Any]],
        parallel: bool = False
    ) -> List[Dict]:
        """Process multiple PDFs with specified operations.

        Args:
            pdf_files: List of PDF file paths
            operations: List of operation configs
            parallel: Run in parallel if True
        """
        results = []

        if parallel:
            with ThreadPoolExecutor(max_workers=4) as executor:
                futures = {
                    executor.submit(self._process_single, f, operations): f
                    for f in pdf_files
                }
                for future in as_completed(futures):
                    results.append(future.result())
        else:
            for pdf_file in pdf_files:
                results.append(self._process_single(pdf_file, operations))

        return results

    def _process_single(
        self,
        pdf_path: str,
        operations: List[Dict[str, Any]]
    ) -> Dict:
        """Process single PDF with operations."""
        result = {'file': pdf_path, 'success': True, 'operations': []}

        try:
            current_path = pdf_path

            for op in operations:
                op_name = op['name']
                op_params = op.get('params', {})

                output_path = str(
                    self.output_dir / f"{Path(current_path).stem}_{op_name}.pdf"
                )

                if op_name == 'rotate':
                    self._rotate(current_path, output_path, **op_params)
                elif op_name == 'watermark':
                    self._watermark(current_path, output_path, **op_params)
                elif op_name == 'extract_pages':
                    self._extract_pages(current_path, output_path, **op_params)
                elif op_name == 'encrypt':
                    self._encrypt(current_path, output_path, **op_params)

                result['operations'].append({
                    'name': op_name,
                    'output': output_path
                })
                current_path = output_path

            result['final_output'] = current_path

        except Exception as e:
            result['success'] = False
            result['error'] = str(e)
            logger.exception(f"Failed to process {pdf_path}")

        return result

    def _rotate(self, input_path, output_path, rotation=90, pages=None):
        reader = PdfReader(input_path)
        writer = PdfWriter()
        for i, page in enumerate(reader.pages):
            if pages is None or i in pages:
                page.rotate(rotation)
            writer.add_page(page)
        writer.write(output_path)

    def _watermark(self, input_path, output_path, watermark_path):
        reader = PdfReader(input_path)
        watermark = PdfReader(watermark_path).pages[0]
        writer = PdfWriter()
        for page in reader.pages:
            page.merge_page(watermark)
            writer.add_page(page)
        writer.write(output_path)

    def _extract_pages(self, input_path, output_path, pages):
        reader = PdfReader(input_path)
        writer = PdfWriter()
        for p in pages:
            if 0 <= p < len(reader.pages):
                writer.add_page(reader.pages[p])
        writer.write(output_path)

    def _encrypt(self, input_path, output_path, password):
        reader = PdfReader(input_path)
        writer = PdfWriter()
        for page in reader.pages:
            writer.add_page(page)
        writer.encrypt(password)
        writer.write(output_path)
python
"""
使用可配置操作批量处理PDF。
"""
from pypdf import PdfReader, PdfWriter, PdfMerger
from pathlib import Path
from typing import List, Dict, Any, Callable
from concurrent.futures import ThreadPoolExecutor, as_completed
import logging

logger = logging.getLogger(__name__)

class PDFProcessor:
    """支持可配置操作的批量PDF处理类。"""

    def __init__(self, output_dir: str):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)

    def process_batch(
        self,
        pdf_files: List[str],
        operations: List[Dict[str, Any]],
        parallel: bool = False
    ) -> List[Dict]:
        """使用指定操作处理多个PDF。

        参数:
            pdf_files: PDF文件路径列表
            operations: 操作配置列表
            parallel: 如果为True则并行处理
        """
        results = []

        if parallel:
            with ThreadPoolExecutor(max_workers=4) as executor:
                futures = {
                    executor.submit(self._process_single, f, operations): f
                    for f in pdf_files
                }
                for future in as_completed(futures):
                    results.append(future.result())
        else:
            for pdf_file in pdf_files:
                results.append(self._process_single(pdf_file, operations))

        return results

    def _process_single(
        self,
        pdf_path: str,
        operations: List[Dict[str, Any]]
    ) -> Dict:
        """使用操作处理单个PDF。"""
        result = {'file': pdf_path, 'success': True, 'operations': []}

        try:
            current_path = pdf_path

            for op in operations:
                op_name = op['name']
                op_params = op.get('params', {})

                output_path = str(
                    self.output_dir / f"{Path(current_path).stem}_{op_name}.pdf"
                )

                if op_name == 'rotate':
                    self._rotate(current_path, output_path, **op_params)
                elif op_name == 'watermark':
                    self._watermark(current_path, output_path, **op_params)
                elif op_name == 'extract_pages':
                    self._extract_pages(current_path, output_path, **op_params)
                elif op_name == 'encrypt':
                    self._encrypt(current_path, output_path, **op_params)

                result['operations'].append({
                    'name': op_name,
                    'output': output_path
                })
                current_path = output_path

            result['final_output'] = current_path

        except Exception as e:
            result['success'] = False
            result['error'] = str(e)
            logger.exception(f"处理{pdf_path}失败")

        return result

    def _rotate(self, input_path, output_path, rotation=90, pages=None):
        reader = PdfReader(input_path)
        writer = PdfWriter()
        for i, page in enumerate(reader.pages):
            if pages is None or i in pages:
                page.rotate(rotation)
            writer.add_page(page)
        writer.write(output_path)

    def _watermark(self, input_path, output_path, watermark_path):
        reader = PdfReader(input_path)
        watermark = PdfReader(watermark_path).pages[0]
        writer = PdfWriter()
        for page in reader.pages:
            page.merge_page(watermark)
            writer.add_page(page)
        writer.write(output_path)

    def _extract_pages(self, input_path, output_path, pages):
        reader = PdfReader(input_path)
        writer = PdfWriter()
        for p in pages:
            if 0 <= p < len(reader.pages):
                writer.add_page(reader.pages[p])
        writer.write(output_path)

    def _encrypt(self, input_path, output_path, password):
        reader = PdfReader(input_path)
        writer = PdfWriter()
        for page in reader.pages:
            writer.add_page(page)
        writer.encrypt(password)
        writer.write(output_path)

Example usage

使用示例

processor = PDFProcessor('processed_output/')

processor = PDFProcessor('processed_output/')

results = processor.process_batch(

results = processor.process_batch(

['doc1.pdf', 'doc2.pdf', 'doc3.pdf'],

['doc1.pdf', 'doc2.pdf', 'doc3.pdf'],

[

[

{'name': 'rotate', 'params': {'rotation': 90}},

{'name': 'rotate', 'params': {'rotation': 90}},

{'name': 'watermark', 'params': {'watermark_path': 'watermark.pdf'}},

{'name': 'watermark', 'params': {'watermark_path': 'watermark.pdf'}},

{'name': 'encrypt', 'params': {'password': 'secure123'}}

{'name': 'encrypt', 'params': {'password': 'secure123'}}

],

],

parallel=True

parallel=True

)

)

undefined
undefined

Best Practices

最佳实践

1. Memory Management

1. 内存管理

python
"""Best practices for handling large PDFs."""
python
"""处理大型PDF的最佳实践。"""

DO: Process pages one at a time for large files

推荐:处理大型文件时逐页处理

def process_large_pdf(input_path, output_path): reader = PdfReader(input_path) writer = PdfWriter()
for page in reader.pages:
    # Process page
    writer.add_page(page)
    # Writer streams to file, not memory

writer.write(output_path)
def process_large_pdf(input_path, output_path): reader = PdfReader(input_path) writer = PdfWriter()
for page in reader.pages:
    # 处理页面
    writer.add_page(page)
    # Writer会流式写入文件,而非存储在内存中

writer.write(output_path)

DO: Use context managers when available

推荐:尽可能使用上下文管理器

with open('document.pdf', 'rb') as f: reader = PdfReader(f) # Process...
undefined
with open('document.pdf', 'rb') as f: reader = PdfReader(f) # 处理...
undefined

2. Error Handling

2. 错误处理

python
"""Robust error handling for PDF operations."""
from pypdf.errors import PdfReadError, PdfReadWarning

def safe_read_pdf(pdf_path):
    """Safely read PDF with error handling."""
    try:
        reader = PdfReader(pdf_path)
        return reader, None
    except PdfReadError as e:
        return None, f"Invalid PDF: {e}"
    except FileNotFoundError:
        return None, f"File not found: {pdf_path}"
    except PermissionError:
        return None, f"Permission denied: {pdf_path}"
    except Exception as e:
        return None, f"Unexpected error: {e}"
python
"""PDF操作的健壮错误处理。"""
from pypdf.errors import PdfReadError, PdfReadWarning

def safe_read_pdf(pdf_path):
    """带错误处理的PDF安全读取。"""
    try:
        reader = PdfReader(pdf_path)
        return reader, None
    except PdfReadError as e:
        return None, f"无效PDF: {e}"
    except FileNotFoundError:
        return None, f"未找到文件: {pdf_path}"
    except PermissionError:
        return None, f"权限不足: {pdf_path}"
    except Exception as e:
        return None, f"意外错误: {e}"

3. Validation

3. 验证

python
"""Validate PDF files before processing."""

def validate_pdf(pdf_path):
    """Validate PDF file."""
    path = Path(pdf_path)

    if not path.exists():
        return False, "File does not exist"

    if path.suffix.lower() != '.pdf':
        return False, "Not a PDF file"

    try:
        reader = PdfReader(pdf_path)
        _ = len(reader.pages)  # Try to access pages
        return True, "Valid PDF"
    except Exception as e:
        return False, f"Invalid PDF: {e}"
python
"""处理前验证PDF文件。"""

def validate_pdf(pdf_path):
    """验证PDF文件。"""
    path = Path(pdf_path)

    if not path.exists():
        return False, "文件不存在"

    if path.suffix.lower() != '.pdf':
        return False, "不是PDF文件"

    try:
        reader = PdfReader(pdf_path)
        _ = len(reader.pages)  # 尝试访问页面
        return True, "有效的PDF"
    except Exception as e:
        return False, f"无效PDF: {e}"

Troubleshooting

故障排除

Common Issues

常见问题

1. Encrypted PDF Error

1. 加密PDF错误

python
undefined
python
undefined

Problem: Cannot read encrypted PDF

问题:无法读取加密的PDF

Solution: Decrypt first

解决方案:先解密

reader = PdfReader("encrypted.pdf") if reader.is_encrypted: reader.decrypt("password") # Provide password
undefined
reader = PdfReader("encrypted.pdf") if reader.is_encrypted: reader.decrypt("password") # 提供密码
undefined

2. Text Extraction Returns Empty

2. 文本提取返回空值

python
undefined
python
undefined

Problem: extract_text() returns empty string

问题:extract_text()返回空字符串

Solution: PDF may be image-based (scanned)

解决方案:PDF可能是基于图片的(扫描件)

For scanned PDFs, use OCR:

对于扫描PDF,使用OCR:

pip install pdf2image pytesseract

pip install pdf2image pytesseract

Then use pytesseract to OCR the images

然后使用pytesseract对图片进行OCR识别

undefined
undefined

3. Memory Error with Large PDFs

3. 处理大型PDF时内存错误

python
undefined
python
undefined

Problem: Memory error with large files

问题:处理大型文件时出现内存错误

Solution: Process incrementally

解决方案:增量处理

def split_large_pdf(input_path, output_dir, max_pages=100): reader = PdfReader(input_path) total = len(reader.pages)
for start in range(0, total, max_pages):
    writer = PdfWriter()
    end = min(start + max_pages, total)

    for i in range(start, end):
        writer.add_page(reader.pages[i])

    writer.write(f"{output_dir}/part_{start//max_pages + 1}.pdf")
undefined
def split_large_pdf(input_path, output_dir, max_pages=100): reader = PdfReader(input_path) total = len(reader.pages)
for start in range(0, total, max_pages):
    writer = PdfWriter()
    end = min(start + max_pages, total)

    for i in range(start, end):
        writer.add_page(reader.pages[i])

    writer.write(f"{output_dir}/part_{start//max_pages + 1}.pdf")
undefined

Version History

版本历史

1.0.0 (2026-01-17)

1.0.0 (2026-01-17)

  • Initial skill creation
  • Core capabilities documentation
  • 6 complete code examples
  • Batch processing patterns
  • Encryption and form handling
  • 初始技能创建
  • 核心功能文档
  • 6个完整代码示例
  • 批量处理模式
  • 加密和表单处理

Resources

资源

Related Skills

相关技能

  • reportlab - PDF creation from scratch
  • python-docx - Word document handling
  • pillow - Image processing for PDF images
  • pdf2image - Convert PDF pages to images

This skill provides comprehensive patterns for PDF manipulation refined from production document processing systems.
  • reportlab - 从头创建PDF
  • python-docx - Word文档处理
  • pillow - PDF图片处理
  • pdf2image - 将PDF页面转换为图片

本技能提供了从生产级文档处理系统中提炼的PDF处理全面操作模式。