docx

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DOCX Processing Skill

DOCX文档处理技能

Overview

概述

This skill enables comprehensive Word document operations through multiple specialized workflows for reading, creating, and editing documents.
本技能通过多个专用工作流实现全面的Word文档操作,涵盖文档的读取、创建与编辑。

Quick Start

快速开始

python
from docx import Document
python
from docx import Document

Read existing document

Read existing document

doc = Document("document.docx") for para in doc.paragraphs: print(para.text)
doc = Document("document.docx") for para in doc.paragraphs: print(para.text)

Create new document

Create new document

doc = Document() doc.add_heading("My Title", level=0) doc.add_paragraph("Hello, World!") doc.save("output.docx")
undefined
doc = Document() doc.add_heading("My Title", level=0) doc.add_paragraph("Hello, World!") doc.save("output.docx")
undefined

When to Use

适用场景

  • Extracting text and tables from Word documents
  • Creating professional documents programmatically
  • Generating reports from templates
  • Bulk document processing and modification
  • Legal document redlining with tracked changes
  • Converting Word documents to other formats
  • Adding headers, footers, and page numbers
  • Inserting images and tables into documents
  • 从Word文档中提取文本和表格
  • 以编程方式创建专业文档
  • 从模板生成报告
  • 批量处理和修改文档
  • 对法律文档进行带追踪更改的修订
  • 将Word文档转换为其他格式
  • 添加页眉、页脚和页码
  • 在文档中插入图片和表格

Core Capabilities

核心功能

  • Reading & Analysis: Extract text via pandoc or access raw XML for comments, formatting, and metadata
  • Document Creation: Use python-docx to build new documents from scratch
  • Document Editing: Employ OOXML manipulation for complex modifications
  • Tracked Changes: Implement redlining workflow for professional document editing
  • 读取与分析:通过pandoc提取文本,或访问原始XML以获取批注、格式和元数据
  • 文档创建:使用python-docx从头构建新文档
  • 文档编辑:通过OOXML操作实现复杂修改
  • 更改追踪:采用修订工作流实现专业文档编辑

Reading Documents

读取文档

Extract Text with Pandoc

使用Pandoc提取文本

bash
pandoc document.docx -t plain -o output.txt
pandoc document.docx -t markdown -o output.md
bash
pandoc document.docx -t plain -o output.txt
pandoc document.docx -t markdown -o output.md

Python Text Extraction

Python文本提取

python
from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(para.text)
python
from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(para.text)

Extract Tables

提取表格

python
from docx import Document

doc = Document("document.docx")
for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            print(cell.text, end="\t")
        print()
python
from docx import Document

doc = Document("document.docx")
for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            print(cell.text, end="\t")
        print()

Creating Documents

创建文档

Basic Document Creation

基础文档创建

python
from docx import Document
from docx.shared import Pt, Inches

doc = Document()
python
from docx import Document
from docx.shared import Pt, Inches

doc = Document()

Add heading

Add heading

doc.add_heading("Document Title", level=0)
doc.add_heading("Document Title", level=0)

Add paragraph with formatting

Add paragraph with formatting

para = doc.add_paragraph() run = para.add_run("Bold text") run.bold = True
para.add_run(" and ") run = para.add_run("italic text") run.italic = True
para = doc.add_paragraph() run = para.add_run("Bold text") run.bold = True
para.add_run(" and ") run = para.add_run("italic text") run.italic = True

Add styled paragraph

Add styled paragraph

doc.add_paragraph("Normal paragraph text.")
doc.save("output.docx")
undefined
doc.add_paragraph("Normal paragraph text.")
doc.save("output.docx")
undefined

Add Tables

添加表格

python
from docx import Document
from docx.shared import Inches

doc = Document()

table = doc.add_table(rows=3, cols=3)
table.style = 'Table Grid'
python
from docx import Document
from docx.shared import Inches

doc = Document()

table = doc.add_table(rows=3, cols=3)
table.style = 'Table Grid'

Fill cells

Fill cells

for i, row in enumerate(table.rows): for j, cell in enumerate(row.cells): cell.text = f"Row {i+1}, Col {j+1}"
doc.save("output.docx")
undefined
for i, row in enumerate(table.rows): for j, cell in enumerate(row.cells): cell.text = f"Row {i+1}, Col {j+1}"
doc.save("output.docx")
undefined

Add Images

添加图片

python
from docx import Document
from docx.shared import Inches

doc = Document()
doc.add_heading("Document with Image", level=0)
doc.add_picture("image.png", width=Inches(4))
doc.add_paragraph("Caption for the image.")

doc.save("output.docx")
python
from docx import Document
from docx.shared import Inches

doc = Document()
doc.add_heading("Document with Image", level=0)
doc.add_picture("image.png", width=Inches(4))
doc.add_paragraph("Caption for the image.")

doc.save("output.docx")

Advanced Formatting

高级格式设置

python
from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH

doc = Document()
python
from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH

doc = Document()

Custom heading

Custom heading

heading = doc.add_heading(level=1) run = heading.add_run("Custom Styled Heading") run.font.size = Pt(24) run.font.color.rgb = RGBColor(0x2E, 0x74, 0xB5)
heading = doc.add_heading(level=1) run = heading.add_run("Custom Styled Heading") run.font.size = Pt(24) run.font.color.rgb = RGBColor(0x2E, 0x74, 0xB5)

Centered paragraph

Centered paragraph

para = doc.add_paragraph("Centered text") para.alignment = WD_ALIGN_PARAGRAPH.CENTER
para = doc.add_paragraph("Centered text") para.alignment = WD_ALIGN_PARAGRAPH.CENTER

Bulleted list

Bulleted list

doc.add_paragraph("First item", style='List Bullet') doc.add_paragraph("Second item", style='List Bullet') doc.add_paragraph("Third item", style='List Bullet')
doc.save("output.docx")
undefined
doc.add_paragraph("First item", style='List Bullet') doc.add_paragraph("Second item", style='List Bullet') doc.add_paragraph("Third item", style='List Bullet')
doc.save("output.docx")
undefined

Editing Documents

编辑文档

Modify Existing Document

修改现有文档

python
from docx import Document

doc = Document("existing.docx")
python
from docx import Document

doc = Document("existing.docx")

Replace text in paragraphs

Replace text in paragraphs

for para in doc.paragraphs: if "old text" in para.text: for run in para.runs: run.text = run.text.replace("old text", "new text")
doc.save("modified.docx")
undefined
for para in doc.paragraphs: if "old text" in para.text: for run in para.runs: run.text = run.text.replace("old text", "new text")
doc.save("modified.docx")
undefined

Add Content to Existing Document

向现有文档添加内容

python
from docx import Document

doc = Document("existing.docx")
python
from docx import Document

doc = Document("existing.docx")

Add new paragraph at end

Add new paragraph at end

doc.add_paragraph("New paragraph added.")
doc.add_paragraph("New paragraph added.")

Add new section

Add new section

doc.add_page_break() doc.add_heading("New Section", level=1) doc.add_paragraph("Content for new section.")
doc.save("modified.docx")
undefined
doc.add_page_break() doc.add_heading("New Section", level=1) doc.add_paragraph("Content for new section.")
doc.save("modified.docx")
undefined

Redlining Workflow

修订工作流

For legal, academic, or government documents requiring tracked changes:
针对法律、学术或政府文档需要追踪更改的场景:

Step 1: Convert to Markdown

步骤1:转换为Markdown

bash
pandoc document.docx -t markdown -o document.md
bash
pandoc document.docx -t markdown -o document.md

Step 2: Plan Changes

步骤2:规划更改

Document the specific changes needed before implementation.
在实施前记录需要进行的具体更改。

Step 3: Apply Changes in Batches

步骤3:批量应用更改

Apply 3-10 related modifications at a time, preserving formatting.
每次应用3-10项相关修改,同时保留格式。

Step 4: Validate Changes

步骤4:验证更改

Ensure original formatting and unchanged content are preserved.
确保原始格式和未修改内容得以保留。

Key Principle

核心原则

When modifying text like "30 days" to "60 days", only mark the changed portion while preserving unchanged runs with their original RSID attributes.
当修改类似“30天”为“60天”的文本时,仅标记更改部分,同时保留未更改内容的原始RSID属性。

Extract Metadata

提取元数据

python
from docx import Document

doc = Document("document.docx")
props = doc.core_properties

print(f"Title: {props.title}")
print(f"Author: {props.author}")
print(f"Created: {props.created}")
print(f"Modified: {props.modified}")
print(f"Last Modified By: {props.last_modified_by}")
python
from docx import Document

doc = Document("document.docx")
props = doc.core_properties

print(f"Title: {props.title}")
print(f"Author: {props.author}")
print(f"Created: {props.created}")
print(f"Modified: {props.modified}")
print(f"Last Modified By: {props.last_modified_by}")

Working with Headers/Footers

页眉/页脚操作

python
from docx import Document

doc = Document()
python
from docx import Document

doc = Document()

Add header

Add header

section = doc.sections[0] header = section.header header_para = header.paragraphs[0] header_para.text = "Document Header"
section = doc.sections[0] header = section.header header_para = header.paragraphs[0] header_para.text = "Document Header"

Add footer

Add footer

footer = section.footer footer_para = footer.paragraphs[0] footer_para.text = "Page Footer"
doc.save("with_header_footer.docx")
undefined
footer = section.footer footer_para = footer.paragraphs[0] footer_para.text = "Page Footer"
doc.save("with_header_footer.docx")
undefined

Execution Checklist

执行检查清单

  • Verify input document exists and is valid .docx
  • Check if document is password-protected
  • Backup original before modifications
  • Preserve existing styles and formatting
  • Validate output document opens correctly
  • Check for broken hyperlinks or images
  • 验证输入文档存在且为有效的.docx文件
  • 检查文档是否受密码保护
  • 修改前备份原始文档
  • 保留现有样式和格式
  • 验证输出文档可正常打开
  • 检查是否存在损坏的超链接或图片

Error Handling

错误处理

Common Errors

常见错误

Error: PackageNotFoundError
  • Cause: File is not a valid .docx (possibly .doc)
  • Solution: Convert to .docx using LibreOffice or save as .docx from Word
Error: KeyError on style
  • Cause: Requested style doesn't exist in document
  • Solution: Use built-in styles or check available styles first
Error: Permission denied
  • Cause: File is open in another application
  • Solution: Close the file in Word/LibreOffice
Error: Encoding issues
  • Cause: Special characters in content
  • Solution: Ensure UTF-8 encoding, handle special chars
错误:PackageNotFoundError
  • 原因:文件不是有效的.docx(可能是.doc格式)
  • 解决方法:使用LibreOffice转换为.docx,或在Word中另存为.docx格式
错误:样式KeyError
  • 原因:请求的样式在文档中不存在
  • 解决方法:使用内置样式,或先检查可用样式
错误:权限被拒绝
  • 原因:文件在其他应用中打开
  • 解决方法:在Word/LibreOffice中关闭该文件
错误:编码问题
  • 原因:内容包含特殊字符
  • 解决方法:确保使用UTF-8编码,处理特殊字符

Metrics

性能指标

MetricTypical Value
Document creation~100 docs/second
Text extraction~500 pages/second
Table extraction~50 tables/second
Memory usage~5MB per document
指标典型值
文档创建速度~100个/秒
文本提取速度~500页/秒
表格提取速度~50个/秒
内存占用~5MB/每个文档

Dependencies

依赖项

bash
pip install python-docx
System tools:
  • Pandoc (for format conversion)
  • LibreOffice (for PDF conversion)

bash
pip install python-docx
系统工具:
  • Pandoc(用于格式转换)
  • LibreOffice(用于PDF转换)

Version History

版本历史

  • 1.1.0 (2026-01-02): Added Quick Start, When to Use, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
  • 1.0.0 (2024-10-15): Initial release with python-docx, pandoc integration, redlining workflow
  • 1.1.0(2026-01-02):新增快速开始、适用场景、执行检查清单、错误处理、性能指标章节;更新版头信息,包含版本、分类、相关技能
  • 1.0.0(2024-10-15):初始版本,集成python-docx、pandoc,支持修订工作流