docx
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDOCX Processing Skill
DOCX文档处理技能
Overview
概述
This skill enables comprehensive Word document operations through multiple specialized workflows for reading, creating, and editing documents.
本技能通过多个专用工作流实现全面的Word文档操作,涵盖文档的读取、创建与编辑。
Quick Start
快速开始
python
from docx import Documentpython
from docx import DocumentRead existing document
Read existing document
doc = Document("document.docx")
for para in doc.paragraphs:
print(para.text)
doc = Document("document.docx")
for para in doc.paragraphs:
print(para.text)
Create new document
Create new document
doc = Document()
doc.add_heading("My Title", level=0)
doc.add_paragraph("Hello, World!")
doc.save("output.docx")
undefineddoc = Document()
doc.add_heading("My Title", level=0)
doc.add_paragraph("Hello, World!")
doc.save("output.docx")
undefinedWhen to Use
适用场景
- Extracting text and tables from Word documents
- Creating professional documents programmatically
- Generating reports from templates
- Bulk document processing and modification
- Legal document redlining with tracked changes
- Converting Word documents to other formats
- Adding headers, footers, and page numbers
- Inserting images and tables into documents
- 从Word文档中提取文本和表格
- 以编程方式创建专业文档
- 从模板生成报告
- 批量处理和修改文档
- 对法律文档进行带追踪更改的修订
- 将Word文档转换为其他格式
- 添加页眉、页脚和页码
- 在文档中插入图片和表格
Core Capabilities
核心功能
- Reading & Analysis: Extract text via pandoc or access raw XML for comments, formatting, and metadata
- Document Creation: Use python-docx to build new documents from scratch
- Document Editing: Employ OOXML manipulation for complex modifications
- Tracked Changes: Implement redlining workflow for professional document editing
- 读取与分析:通过pandoc提取文本,或访问原始XML以获取批注、格式和元数据
- 文档创建:使用python-docx从头构建新文档
- 文档编辑:通过OOXML操作实现复杂修改
- 更改追踪:采用修订工作流实现专业文档编辑
Reading Documents
读取文档
Extract Text with Pandoc
使用Pandoc提取文本
bash
pandoc document.docx -t plain -o output.txt
pandoc document.docx -t markdown -o output.mdbash
pandoc document.docx -t plain -o output.txt
pandoc document.docx -t markdown -o output.mdPython Text Extraction
Python文本提取
python
from docx import Document
doc = Document("document.docx")
for para in doc.paragraphs:
print(para.text)python
from docx import Document
doc = Document("document.docx")
for para in doc.paragraphs:
print(para.text)Extract Tables
提取表格
python
from docx import Document
doc = Document("document.docx")
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
print(cell.text, end="\t")
print()python
from docx import Document
doc = Document("document.docx")
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
print(cell.text, end="\t")
print()Creating Documents
创建文档
Basic Document Creation
基础文档创建
python
from docx import Document
from docx.shared import Pt, Inches
doc = Document()python
from docx import Document
from docx.shared import Pt, Inches
doc = Document()Add heading
Add heading
doc.add_heading("Document Title", level=0)
doc.add_heading("Document Title", level=0)
Add paragraph with formatting
Add paragraph with formatting
para = doc.add_paragraph()
run = para.add_run("Bold text")
run.bold = True
para.add_run(" and ")
run = para.add_run("italic text")
run.italic = True
para = doc.add_paragraph()
run = para.add_run("Bold text")
run.bold = True
para.add_run(" and ")
run = para.add_run("italic text")
run.italic = True
Add styled paragraph
Add styled paragraph
doc.add_paragraph("Normal paragraph text.")
doc.save("output.docx")
undefineddoc.add_paragraph("Normal paragraph text.")
doc.save("output.docx")
undefinedAdd Tables
添加表格
python
from docx import Document
from docx.shared import Inches
doc = Document()
table = doc.add_table(rows=3, cols=3)
table.style = 'Table Grid'python
from docx import Document
from docx.shared import Inches
doc = Document()
table = doc.add_table(rows=3, cols=3)
table.style = 'Table Grid'Fill cells
Fill cells
for i, row in enumerate(table.rows):
for j, cell in enumerate(row.cells):
cell.text = f"Row {i+1}, Col {j+1}"
doc.save("output.docx")
undefinedfor i, row in enumerate(table.rows):
for j, cell in enumerate(row.cells):
cell.text = f"Row {i+1}, Col {j+1}"
doc.save("output.docx")
undefinedAdd Images
添加图片
python
from docx import Document
from docx.shared import Inches
doc = Document()
doc.add_heading("Document with Image", level=0)
doc.add_picture("image.png", width=Inches(4))
doc.add_paragraph("Caption for the image.")
doc.save("output.docx")python
from docx import Document
from docx.shared import Inches
doc = Document()
doc.add_heading("Document with Image", level=0)
doc.add_picture("image.png", width=Inches(4))
doc.add_paragraph("Caption for the image.")
doc.save("output.docx")Advanced Formatting
高级格式设置
python
from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
doc = Document()python
from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
doc = Document()Custom heading
Custom heading
heading = doc.add_heading(level=1)
run = heading.add_run("Custom Styled Heading")
run.font.size = Pt(24)
run.font.color.rgb = RGBColor(0x2E, 0x74, 0xB5)
heading = doc.add_heading(level=1)
run = heading.add_run("Custom Styled Heading")
run.font.size = Pt(24)
run.font.color.rgb = RGBColor(0x2E, 0x74, 0xB5)
Centered paragraph
Centered paragraph
para = doc.add_paragraph("Centered text")
para.alignment = WD_ALIGN_PARAGRAPH.CENTER
para = doc.add_paragraph("Centered text")
para.alignment = WD_ALIGN_PARAGRAPH.CENTER
Bulleted list
Bulleted list
doc.add_paragraph("First item", style='List Bullet')
doc.add_paragraph("Second item", style='List Bullet')
doc.add_paragraph("Third item", style='List Bullet')
doc.save("output.docx")
undefineddoc.add_paragraph("First item", style='List Bullet')
doc.add_paragraph("Second item", style='List Bullet')
doc.add_paragraph("Third item", style='List Bullet')
doc.save("output.docx")
undefinedEditing Documents
编辑文档
Modify Existing Document
修改现有文档
python
from docx import Document
doc = Document("existing.docx")python
from docx import Document
doc = Document("existing.docx")Replace text in paragraphs
Replace text in paragraphs
for para in doc.paragraphs:
if "old text" in para.text:
for run in para.runs:
run.text = run.text.replace("old text", "new text")
doc.save("modified.docx")
undefinedfor para in doc.paragraphs:
if "old text" in para.text:
for run in para.runs:
run.text = run.text.replace("old text", "new text")
doc.save("modified.docx")
undefinedAdd Content to Existing Document
向现有文档添加内容
python
from docx import Document
doc = Document("existing.docx")python
from docx import Document
doc = Document("existing.docx")Add new paragraph at end
Add new paragraph at end
doc.add_paragraph("New paragraph added.")
doc.add_paragraph("New paragraph added.")
Add new section
Add new section
doc.add_page_break()
doc.add_heading("New Section", level=1)
doc.add_paragraph("Content for new section.")
doc.save("modified.docx")
undefineddoc.add_page_break()
doc.add_heading("New Section", level=1)
doc.add_paragraph("Content for new section.")
doc.save("modified.docx")
undefinedRedlining Workflow
修订工作流
For legal, academic, or government documents requiring tracked changes:
针对法律、学术或政府文档需要追踪更改的场景:
Step 1: Convert to Markdown
步骤1:转换为Markdown
bash
pandoc document.docx -t markdown -o document.mdbash
pandoc document.docx -t markdown -o document.mdStep 2: Plan Changes
步骤2:规划更改
Document the specific changes needed before implementation.
在实施前记录需要进行的具体更改。
Step 3: Apply Changes in Batches
步骤3:批量应用更改
Apply 3-10 related modifications at a time, preserving formatting.
每次应用3-10项相关修改,同时保留格式。
Step 4: Validate Changes
步骤4:验证更改
Ensure original formatting and unchanged content are preserved.
确保原始格式和未修改内容得以保留。
Key Principle
核心原则
When modifying text like "30 days" to "60 days", only mark the changed portion while preserving unchanged runs with their original RSID attributes.
当修改类似“30天”为“60天”的文本时,仅标记更改部分,同时保留未更改内容的原始RSID属性。
Extract Metadata
提取元数据
python
from docx import Document
doc = Document("document.docx")
props = doc.core_properties
print(f"Title: {props.title}")
print(f"Author: {props.author}")
print(f"Created: {props.created}")
print(f"Modified: {props.modified}")
print(f"Last Modified By: {props.last_modified_by}")python
from docx import Document
doc = Document("document.docx")
props = doc.core_properties
print(f"Title: {props.title}")
print(f"Author: {props.author}")
print(f"Created: {props.created}")
print(f"Modified: {props.modified}")
print(f"Last Modified By: {props.last_modified_by}")Working with Headers/Footers
页眉/页脚操作
python
from docx import Document
doc = Document()python
from docx import Document
doc = Document()Add header
Add header
section = doc.sections[0]
header = section.header
header_para = header.paragraphs[0]
header_para.text = "Document Header"
section = doc.sections[0]
header = section.header
header_para = header.paragraphs[0]
header_para.text = "Document Header"
Add footer
Add footer
footer = section.footer
footer_para = footer.paragraphs[0]
footer_para.text = "Page Footer"
doc.save("with_header_footer.docx")
undefinedfooter = section.footer
footer_para = footer.paragraphs[0]
footer_para.text = "Page Footer"
doc.save("with_header_footer.docx")
undefinedExecution Checklist
执行检查清单
- Verify input document exists and is valid .docx
- Check if document is password-protected
- Backup original before modifications
- Preserve existing styles and formatting
- Validate output document opens correctly
- Check for broken hyperlinks or images
- 验证输入文档存在且为有效的.docx文件
- 检查文档是否受密码保护
- 修改前备份原始文档
- 保留现有样式和格式
- 验证输出文档可正常打开
- 检查是否存在损坏的超链接或图片
Error Handling
错误处理
Common Errors
常见错误
Error: PackageNotFoundError
- Cause: File is not a valid .docx (possibly .doc)
- Solution: Convert to .docx using LibreOffice or save as .docx from Word
Error: KeyError on style
- Cause: Requested style doesn't exist in document
- Solution: Use built-in styles or check available styles first
Error: Permission denied
- Cause: File is open in another application
- Solution: Close the file in Word/LibreOffice
Error: Encoding issues
- Cause: Special characters in content
- Solution: Ensure UTF-8 encoding, handle special chars
错误:PackageNotFoundError
- 原因:文件不是有效的.docx(可能是.doc格式)
- 解决方法:使用LibreOffice转换为.docx,或在Word中另存为.docx格式
错误:样式KeyError
- 原因:请求的样式在文档中不存在
- 解决方法:使用内置样式,或先检查可用样式
错误:权限被拒绝
- 原因:文件在其他应用中打开
- 解决方法:在Word/LibreOffice中关闭该文件
错误:编码问题
- 原因:内容包含特殊字符
- 解决方法:确保使用UTF-8编码,处理特殊字符
Metrics
性能指标
| Metric | Typical Value |
|---|---|
| Document creation | ~100 docs/second |
| Text extraction | ~500 pages/second |
| Table extraction | ~50 tables/second |
| Memory usage | ~5MB per document |
| 指标 | 典型值 |
|---|---|
| 文档创建速度 | ~100个/秒 |
| 文本提取速度 | ~500页/秒 |
| 表格提取速度 | ~50个/秒 |
| 内存占用 | ~5MB/每个文档 |
Dependencies
依赖项
bash
pip install python-docxSystem tools:
- Pandoc (for format conversion)
- LibreOffice (for PDF conversion)
bash
pip install python-docx系统工具:
- Pandoc(用于格式转换)
- LibreOffice(用于PDF转换)
Version History
版本历史
- 1.1.0 (2026-01-02): Added Quick Start, When to Use, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
- 1.0.0 (2024-10-15): Initial release with python-docx, pandoc integration, redlining workflow
- 1.1.0(2026-01-02):新增快速开始、适用场景、执行检查清单、错误处理、性能指标章节;更新版头信息,包含版本、分类、相关技能
- 1.0.0(2024-10-15):初始版本,集成python-docx、pandoc,支持修订工作流