DOCX Processing Skill

DOCX文档处理技能

Overview

概述

This skill enables comprehensive Word document operations through multiple specialized workflows for reading, creating, and editing documents.

本技能通过多个专用工作流实现全面的Word文档操作，涵盖文档的读取、创建与编辑。

Quick Start

快速开始

python

from docx import Document

python

from docx import Document

Read existing document

doc = Document("document.docx") for para in doc.paragraphs: print(para.text)

Create new document

doc = Document() doc.add_heading("My Title", level=0) doc.add_paragraph("Hello, World!") doc.save("output.docx")

undefined

doc = Document() doc.add_heading("My Title", level=0) doc.add_paragraph("Hello, World!") doc.save("output.docx")

undefined

When to Use

适用场景

Extracting text and tables from Word documents
Creating professional documents programmatically
Generating reports from templates
Bulk document processing and modification
Legal document redlining with tracked changes
Converting Word documents to other formats
Adding headers, footers, and page numbers
Inserting images and tables into documents

从Word文档中提取文本和表格
以编程方式创建专业文档
从模板生成报告
批量处理和修改文档
对法律文档进行带追踪更改的修订
将Word文档转换为其他格式
添加页眉、页脚和页码
在文档中插入图片和表格

Core Capabilities

核心功能

Reading & Analysis: Extract text via pandoc or access raw XML for comments, formatting, and metadata
Document Creation: Use python-docx to build new documents from scratch
Document Editing: Employ OOXML manipulation for complex modifications
Tracked Changes: Implement redlining workflow for professional document editing

读取与分析：通过pandoc提取文本，或访问原始XML以获取批注、格式和元数据
文档创建：使用python-docx从头构建新文档
文档编辑：通过OOXML操作实现复杂修改
更改追踪：采用修订工作流实现专业文档编辑

Reading Documents

读取文档

Extract Text with Pandoc

使用Pandoc提取文本

bash

pandoc document.docx -t plain -o output.txt
pandoc document.docx -t markdown -o output.md

bash

pandoc document.docx -t plain -o output.txt
pandoc document.docx -t markdown -o output.md

Python Text Extraction

Python文本提取

python

from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(para.text)

python

from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(para.text)

Extract Tables

提取表格

python

from docx import Document

doc = Document("document.docx")
for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            print(cell.text, end="\t")
        print()

python

from docx import Document

doc = Document("document.docx")
for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            print(cell.text, end="\t")
        print()

Creating Documents

创建文档

Basic Document Creation

基础文档创建

python

from docx import Document
from docx.shared import Pt, Inches

doc = Document()

python

from docx import Document
from docx.shared import Pt, Inches

doc = Document()

Add heading

doc.add_heading("Document Title", level=0)

Add paragraph with formatting

para = doc.add_paragraph() run = para.add_run("Bold text") run.bold = True

para.add_run(" and ") run = para.add_run("italic text") run.italic = True

para = doc.add_paragraph() run = para.add_run("Bold text") run.bold = True

para.add_run(" and ") run = para.add_run("italic text") run.italic = True

Add styled paragraph

doc.add_paragraph("Normal paragraph text.")

doc.save("output.docx")

undefined

doc.add_paragraph("Normal paragraph text.")

doc.save("output.docx")

undefined

Add Tables

添加表格

python

from docx import Document
from docx.shared import Inches

doc = Document()

table = doc.add_table(rows=3, cols=3)
table.style = 'Table Grid'

python

from docx import Document
from docx.shared import Inches

doc = Document()

table = doc.add_table(rows=3, cols=3)
table.style = 'Table Grid'

Fill cells

for i, row in enumerate(table.rows): for j, cell in enumerate(row.cells): cell.text = f"Row {i+1}, Col {j+1}"

doc.save("output.docx")

undefined

for i, row in enumerate(table.rows): for j, cell in enumerate(row.cells): cell.text = f"Row {i+1}, Col {j+1}"

doc.save("output.docx")

undefined

Add Images

添加图片

python

from docx import Document
from docx.shared import Inches

doc = Document()
doc.add_heading("Document with Image", level=0)
doc.add_picture("image.png", width=Inches(4))
doc.add_paragraph("Caption for the image.")

doc.save("output.docx")

python

from docx import Document
from docx.shared import Inches

doc = Document()
doc.add_heading("Document with Image", level=0)
doc.add_picture("image.png", width=Inches(4))
doc.add_paragraph("Caption for the image.")

doc.save("output.docx")

Advanced Formatting

高级格式设置

python

from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH

doc = Document()

python

from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH

doc = Document()

Custom heading

heading = doc.add_heading(level=1) run = heading.add_run("Custom Styled Heading") run.font.size = Pt(24) run.font.color.rgb = RGBColor(0x2E, 0x74, 0xB5)

Centered paragraph

para = doc.add_paragraph("Centered text") para.alignment = WD_ALIGN_PARAGRAPH.CENTER

Bulleted list

doc.add_paragraph("First item", style='List Bullet') doc.add_paragraph("Second item", style='List Bullet') doc.add_paragraph("Third item", style='List Bullet')

doc.save("output.docx")

undefined

doc.add_paragraph("First item", style='List Bullet') doc.add_paragraph("Second item", style='List Bullet') doc.add_paragraph("Third item", style='List Bullet')

doc.save("output.docx")

undefined

Editing Documents

编辑文档

Modify Existing Document

修改现有文档

python

from docx import Document

doc = Document("existing.docx")

python

from docx import Document

doc = Document("existing.docx")

Replace text in paragraphs

for para in doc.paragraphs: if "old text" in para.text: for run in para.runs: run.text = run.text.replace("old text", "new text")

doc.save("modified.docx")

undefined

for para in doc.paragraphs: if "old text" in para.text: for run in para.runs: run.text = run.text.replace("old text", "new text")

doc.save("modified.docx")

undefined

Add Content to Existing Document

向现有文档添加内容

python

from docx import Document

doc = Document("existing.docx")

python

from docx import Document

doc = Document("existing.docx")

Add new paragraph at end

doc.add_paragraph("New paragraph added.")

Add new section

doc.add_page_break() doc.add_heading("New Section", level=1) doc.add_paragraph("Content for new section.")

doc.save("modified.docx")

undefined

doc.add_page_break() doc.add_heading("New Section", level=1) doc.add_paragraph("Content for new section.")

doc.save("modified.docx")

undefined

Redlining Workflow

修订工作流

For legal, academic, or government documents requiring tracked changes:

针对法律、学术或政府文档需要追踪更改的场景：

Step 1: Convert to Markdown

步骤1：转换为Markdown

bash

pandoc document.docx -t markdown -o document.md

bash

pandoc document.docx -t markdown -o document.md

Step 2: Plan Changes

步骤2：规划更改

Document the specific changes needed before implementation.

在实施前记录需要进行的具体更改。

Step 3: Apply Changes in Batches

步骤3：批量应用更改

Apply 3-10 related modifications at a time, preserving formatting.

每次应用3-10项相关修改，同时保留格式。

Step 4: Validate Changes

步骤4：验证更改

Ensure original formatting and unchanged content are preserved.

确保原始格式和未修改内容得以保留。

Key Principle

核心原则

When modifying text like "30 days" to "60 days", only mark the changed portion while preserving unchanged runs with their original RSID attributes.

当修改类似“30天”为“60天”的文本时，仅标记更改部分，同时保留未更改内容的原始RSID属性。

Extract Metadata

提取元数据

python

from docx import Document

doc = Document("document.docx")
props = doc.core_properties

print(f"Title: {props.title}")
print(f"Author: {props.author}")
print(f"Created: {props.created}")
print(f"Modified: {props.modified}")
print(f"Last Modified By: {props.last_modified_by}")

python

from docx import Document

doc = Document("document.docx")
props = doc.core_properties

print(f"Title: {props.title}")
print(f"Author: {props.author}")
print(f"Created: {props.created}")
print(f"Modified: {props.modified}")
print(f"Last Modified By: {props.last_modified_by}")

Working with Headers/Footers

页眉/页脚操作

python

from docx import Document

doc = Document()

python

from docx import Document

doc = Document()

Add header

section = doc.sections[0] header = section.header header_para = header.paragraphs[0] header_para.text = "Document Header"

Add footer

footer = section.footer footer_para = footer.paragraphs[0] footer_para.text = "Page Footer"

doc.save("with_header_footer.docx")

undefined

footer = section.footer footer_para = footer.paragraphs[0] footer_para.text = "Page Footer"

doc.save("with_header_footer.docx")

undefined

Execution Checklist

执行检查清单

Verify input document exists and is valid .docx
Check if document is password-protected
Backup original before modifications
Preserve existing styles and formatting
Validate output document opens correctly
Check for broken hyperlinks or images

验证输入文档存在且为有效的.docx文件
检查文档是否受密码保护
修改前备份原始文档
保留现有样式和格式
验证输出文档可正常打开
检查是否存在损坏的超链接或图片

Error Handling

错误处理

Common Errors

常见错误

Error: PackageNotFoundError

Cause: File is not a valid .docx (possibly .doc)
Solution: Convert to .docx using LibreOffice or save as .docx from Word

Error: KeyError on style

Cause: Requested style doesn't exist in document
Solution: Use built-in styles or check available styles first

Error: Permission denied

Cause: File is open in another application
Solution: Close the file in Word/LibreOffice

Error: Encoding issues

Cause: Special characters in content
Solution: Ensure UTF-8 encoding, handle special chars

错误：PackageNotFoundError

原因：文件不是有效的.docx（可能是.doc格式）
解决方法：使用LibreOffice转换为.docx，或在Word中另存为.docx格式

错误：样式KeyError

原因：请求的样式在文档中不存在
解决方法：使用内置样式，或先检查可用样式

错误：权限被拒绝

原因：文件在其他应用中打开
解决方法：在Word/LibreOffice中关闭该文件

错误：编码问题

原因：内容包含特殊字符
解决方法：确保使用UTF-8编码，处理特殊字符

Metrics

性能指标

Metric	Typical Value
Document creation	~100 docs/second
Text extraction	~500 pages/second
Table extraction	~50 tables/second
Memory usage	~5MB per document

指标	典型值
文档创建速度	~100个/秒
文本提取速度	~500页/秒
表格提取速度	~50个/秒
内存占用	~5MB/每个文档

Dependencies

依赖项

bash

pip install python-docx

System tools:

Pandoc (for format conversion)
LibreOffice (for PDF conversion)

bash

pip install python-docx

系统工具：

Pandoc（用于格式转换）
LibreOffice（用于PDF转换）

Version History

版本历史

1.1.0 (2026-01-02): Added Quick Start, When to Use, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
1.0.0 (2024-10-15): Initial release with python-docx, pandoc integration, redlining workflow

1.1.0（2026-01-02）：新增快速开始、适用场景、执行检查清单、错误处理、性能指标章节；更新版头信息，包含版本、分类、相关技能
1.0.0（2024-10-15）：初始版本，集成python-docx、pandoc，支持修订工作流

docx

Original

Translation

DOCX Processing Skill

DOCX文档处理技能

Overview

概述

Quick Start

快速开始

Read existing document

Read existing document

Create new document

Create new document

When to Use

适用场景

Core Capabilities

核心功能

Reading Documents

读取文档

Extract Text with Pandoc

使用Pandoc提取文本

Python Text Extraction

Python文本提取

Extract Tables

提取表格

Creating Documents

创建文档

Basic Document Creation

基础文档创建

Add heading

Add heading

Add paragraph with formatting

Add paragraph with formatting

Add styled paragraph

Add styled paragraph

Add Tables

添加表格

Fill cells

Fill cells

Add Images

添加图片

Advanced Formatting

高级格式设置

Custom heading

Custom heading

Centered paragraph

Centered paragraph

Bulleted list

Bulleted list

Editing Documents

编辑文档

Modify Existing Document

修改现有文档

Replace text in paragraphs

Replace text in paragraphs

Add Content to Existing Document

向现有文档添加内容

Add new paragraph at end

Add new paragraph at end

Add new section

Add new section

Redlining Workflow

修订工作流

Step 1: Convert to Markdown

步骤1：转换为Markdown

Step 2: Plan Changes

步骤2：规划更改

Step 3: Apply Changes in Batches

步骤3：批量应用更改

Step 4: Validate Changes

步骤4：验证更改

Key Principle

核心原则

Extract Metadata

提取元数据

Working with Headers/Footers

页眉/页脚操作

Add header

Add header

Add footer