docx-processing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDOCX Processing
DOCX 处理
Overview
概述
Generate, manipulate, and template Word documents programmatically. This skill covers python-docx for direct document creation, docxtpl for Jinja2-based template filling, formatting control (headings, tables, images, headers/footers), mail merge operations, style management, and conversion strategies.
Apply this skill whenever Word documents need to be created, populated, or transformed through code rather than manual editing.
通过编程方式生成、操作Word文档并制作文档模板。本技能涵盖用于直接创建文档的python-docx、用于基于Jinja2模板填充的docxtpl、格式控制(标题、表格、图片、页眉/页脚)、邮件合并操作、样式管理和转换策略。
当你需要通过代码而非手动编辑来创建、填充或转换Word文档时,即可应用本技能。
Multi-Phase Process
多阶段流程
Phase 1: Requirements
阶段1:需求确认
- Determine if creating from scratch or filling a template
- Identify document structure (sections, headers, tables, images)
- Define data sources (JSON, CSV, database, API)
- Plan styling requirements (fonts, colors, margins)
- Determine output format (DOCX, PDF conversion needed)
STOP — Do NOT begin implementation until the approach (scratch vs template) is decided and data sources are confirmed.
- 确定是从零创建文档还是填充现有模板
- 明确文档结构(章节、页眉、表格、图片)
- 定义数据源(JSON、CSV、数据库、API)
- 规划样式要求(字体、颜色、边距)
- 确定输出格式(是否需要DOCX转PDF)
停止 — 在确定实现方案(从零开发vs使用模板)并确认数据源之前,请勿开始编码实现。
Phase 2: Implementation
阶段2:实现开发
- Set up document template or create from scratch
- Implement data binding and content generation
- Apply formatting and styles
- Add headers, footers, and page numbers
- Handle images and embedded objects
STOP — Do NOT skip to validation until all document sections are implemented.
- 搭建文档模板或从零创建文档框架
- 实现数据绑定和内容生成逻辑
- 应用格式和样式配置
- 添加页眉、页脚和页码
- 处理图片和嵌入对象
停止 — 在所有文档章节实现完成前,请勿跳至验证环节。
Phase 3: Validation
阶段3:验证测试
- Verify document renders correctly in Word/LibreOffice
- Check formatting consistency across pages
- Validate data accuracy in generated documents
- Test with edge cases (long text, missing data, special characters)
- Verify PDF conversion if required
- 验证文档在Word/LibreOffice中可以正确渲染
- 检查跨页面的格式一致性
- 校验生成文档中的数据准确性
- 使用边界场景测试(长文本、缺失数据、特殊字符)
- 若需要PDF输出则验证转换结果是否正常
Approach Decision Table
方案决策表
| Scenario | Approach | Library | Why |
|---|---|---|---|
| One-off report generation | From scratch | python-docx | Full programmatic control |
| Recurring reports with fixed layout | Template | docxtpl | Design layout in Word, fill with data |
| Bulk letter generation (mail merge) | Template | docxtpl | One template, many outputs |
| Complex formatting, custom styles | From scratch | python-docx | Direct access to document model |
| Non-technical users design template | Template | docxtpl | Users edit in Word, developers bind data |
| PDF output required | Either + conversion | libreoffice / docx2pdf | Post-processing step |
| 场景 | 实现方案 | 依赖库 | 选择原因 |
|---|---|---|---|
| 一次性报表生成 | 从零开发 | python-docx | 完全的编程控制能力 |
| 固定布局的定期报表 | 模板填充 | docxtpl | 在Word中设计布局,直接用数据填充即可 |
| 批量信函生成(邮件合并) | 模板填充 | docxtpl | 一个模板可生成多份输出 |
| 复杂格式、自定义样式需求 | 从零开发 | python-docx | 可直接访问文档底层模型 |
| 非技术用户参与模板设计 | 模板填充 | docxtpl | 用户可直接在Word中编辑模板,开发者只需绑定数据 |
| 需要PDF输出 | 任意方案+转换 | libreoffice / docx2pdf | 作为后处理步骤执行 |
python-docx Patterns
python-docx 常用模式
Document Creation
文档创建
python
from docx import Document
from docx.shared import Inches, Pt, Cm, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT
doc = Document()python
from docx import Document
from docx.shared import Inches, Pt, Cm, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT
doc = Document()Set default font
Set default font
style = doc.styles['Normal']
font = style.font
font.name = 'Calibri'
font.size = Pt(11)
style = doc.styles['Normal']
font = style.font
font.name = 'Calibri'
font.size = Pt(11)
Add heading
Add heading
doc.add_heading('Monthly Report', level=0)
doc.add_heading('Monthly Report', level=0)
Add paragraph with formatting
Add paragraph with formatting
para = doc.add_paragraph()
run = para.add_run('Important: ')
run.bold = True
run.font.color.rgb = RGBColor(0xCC, 0x00, 0x00)
para.add_run('This section requires attention.')
para = doc.add_paragraph()
run = para.add_run('Important: ')
run.bold = True
run.font.color.rgb = RGBColor(0xCC, 0x00, 0x00)
para.add_run('This section requires attention.')
Add table
Add table
table = doc.add_table(rows=1, cols=3, style='Light Grid Accent 1')
hdr_cells = table.rows[0].cells
hdr_cells[0].text = 'Name'
hdr_cells[1].text = 'Department'
hdr_cells[2].text = 'Revenue'
for name, dept, rev in data:
row_cells = table.add_row().cells
row_cells[0].text = name
row_cells[1].text = dept
row_cells[2].text = f'${rev:,.2f}'
table = doc.add_table(rows=1, cols=3, style='Light Grid Accent 1')
hdr_cells = table.rows[0].cells
hdr_cells[0].text = 'Name'
hdr_cells[1].text = 'Department'
hdr_cells[2].text = 'Revenue'
for name, dept, rev in data:
row_cells = table.add_row().cells
row_cells[0].text = name
row_cells[1].text = dept
row_cells[2].text = f'${rev:,.2f}'
Add image
Add image
doc.add_picture('chart.png', width=Inches(5.5))
doc.add_picture('chart.png', width=Inches(5.5))
Save
Save
doc.save('report.docx')
undefineddoc.save('report.docx')
undefinedHeaders and Footers
页眉页脚设置
python
from docx.enum.section import WD_ORIENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
section = doc.sections[0]python
from docx.enum.section import WD_ORIENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
section = doc.sections[0]Page setup
Page setup
section.page_width = Cm(21)
section.page_height = Cm(29.7)
section.left_margin = Cm(2.5)
section.right_margin = Cm(2.5)
section.top_margin = Cm(2.5)
section.bottom_margin = Cm(2.5)
section.page_width = Cm(21)
section.page_height = Cm(29.7)
section.left_margin = Cm(2.5)
section.right_margin = Cm(2.5)
section.top_margin = Cm(2.5)
section.bottom_margin = Cm(2.5)
Header
Header
header = section.header
header_para = header.paragraphs[0]
header_para.text = 'Company Name — Confidential'
header_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT
header_para.style.font.size = Pt(9)
header_para.style.font.color.rgb = RGBColor(0x88, 0x88, 0x88)
header = section.header
header_para = header.paragraphs[0]
header_para.text = 'Company Name — Confidential'
header_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT
header_para.style.font.size = Pt(9)
header_para.style.font.color.rgb = RGBColor(0x88, 0x88, 0x88)
Footer with page numbers
Footer with page numbers
footer = section.footer
footer_para = footer.paragraphs[0]
footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
footer = section.footer
footer_para = footer.paragraphs[0]
footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
Add page number field
Add page number field
run = footer_para.add_run()
fldChar = OxmlElement('w:fldChar')
fldChar.set(qn('w:fldCharType'), 'begin')
run._r.append(fldChar)
run2 = footer_para.add_run()
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve')
instrText.text = ' PAGE '
run2._r.append(instrText)
run3 = footer_para.add_run()
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'end')
run3._r.append(fldChar2)
undefinedrun = footer_para.add_run()
fldChar = OxmlElement('w:fldChar')
fldChar.set(qn('w:fldCharType'), 'begin')
run._r.append(fldChar)
run2 = footer_para.add_run()
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve')
instrText.text = ' PAGE '
run2._r.append(instrText)
run3 = footer_para.add_run()
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'end')
run3._r.append(fldChar2)
undefinedTable Formatting
表格格式化
python
from docx.shared import Cm, Pt
from docx.oxml.ns import nsdecls
from docx.oxml import parse_xmlpython
from docx.shared import Cm, Pt
from docx.oxml.ns import nsdecls
from docx.oxml import parse_xmlSet column widths
Set column widths
table.columns[0].width = Cm(4)
table.columns[1].width = Cm(6)
table.columns[2].width = Cm(3)
table.columns[0].width = Cm(4)
table.columns[1].width = Cm(6)
table.columns[2].width = Cm(3)
Cell shading
Cell shading
for cell in table.rows[0].cells:
shading = parse_xml(f'<w:shd {nsdecls("w")} w:fill="2F5496"/>')
cell._tc.get_or_add_tcPr().append(shading)
for paragraph in cell.paragraphs:
for run in paragraph.runs:
run.font.color.rgb = RGBColor(0xFF, 0xFF, 0xFF)
run.font.bold = True
for cell in table.rows[0].cells:
shading = parse_xml(f'<w:shd {nsdecls("w")} w:fill="2F5496"/>')
cell._tc.get_or_add_tcPr().append(shading)
for paragraph in cell.paragraphs:
for run in paragraph.runs:
run.font.color.rgb = RGBColor(0xFF, 0xFF, 0xFF)
run.font.bold = True
Cell alignment
Cell alignment
for row in table.rows:
for cell in row.cells:
cell.paragraphs[0].alignment = WD_ALIGN_PARAGRAPH.CENTER
undefinedfor row in table.rows:
for cell in row.cells:
cell.paragraphs[0].alignment = WD_ALIGN_PARAGRAPH.CENTER
undefineddocxtpl Template Patterns
docxtpl 模板常用模式
Template Syntax (Jinja2)
模板语法(Jinja2)
Template file (template.docx) contains:
{{ company_name }}
Date: {{ report_date }}
Dear {{ recipient_name }},
{% for item in items %}
- {{ item.name }}: ${{ item.price }}
{% endfor %}
Total: ${{ total }}
{%if urgent %}
URGENT: This requires immediate attention.
{%endif %}Template file (template.docx) contains:
{{ company_name }}
Date: {{ report_date }}
Dear {{ recipient_name }},
{% for item in items %}
- {{ item.name }}: ${{ item.price }}
{% endfor %}
Total: ${{ total }}
{%if urgent %}
URGENT: This requires immediate attention.
{%endif %}Template Rendering
模板渲染
python
from docxtpl import DocxTemplate, InlineImage
from docx.shared import Mm
tpl = DocxTemplate('template.docx')
context = {
'company_name': 'Acme Corp',
'report_date': '2025-03-15',
'recipient_name': 'Alice Johnson',
'items': [
{'name': 'Widget A', 'price': '29.99'},
{'name': 'Widget B', 'price': '49.99'},
],
'total': '79.98',
'urgent': True,
'chart': InlineImage(tpl, 'chart.png', width=Mm(120)),
}
tpl.render(context)
tpl.save('output.docx')python
from docxtpl import DocxTemplate, InlineImage
from docx.shared import Mm
tpl = DocxTemplate('template.docx')
context = {
'company_name': 'Acme Corp',
'report_date': '2025-03-15',
'recipient_name': 'Alice Johnson',
'items': [
{'name': 'Widget A', 'price': '29.99'},
{'name': 'Widget B', 'price': '49.99'},
],
'total': '79.98',
'urgent': True,
'chart': InlineImage(tpl, 'chart.png', width=Mm(120)),
}
tpl.render(context)
tpl.save('output.docx')Rich Text in Templates
模板富文本处理
python
from docxtpl import RichText
rt = RichText()
rt.add('Normal text ')
rt.add('bold text', bold=True)
rt.add(' and ')
rt.add('red text', color='FF0000')
rt.add(' with ')
rt.add('a link', url_id=tpl.build_url_id('https://example.com'))
context = {'formatted_text': rt}python
from docxtpl import RichText
rt = RichText()
rt.add('Normal text ')
rt.add('bold text', bold=True)
rt.add(' and ')
rt.add('red text', color='FF0000')
rt.add(' with ')
rt.add('a link', url_id=tpl.build_url_id('https://example.com'))
context = {'formatted_text': rt}Tables in Templates
模板表格处理
Template table row with loop:
{% tr for row in table_data %}
{{ row.name }} | {{ row.value }} | {{ row.status }}
{% endtr %}Template table row with loop:
{% tr for row in table_data %}
{{ row.name }} | {{ row.value }} | {{ row.status }}
{% endtr %}Mail Merge
邮件合并
python
from docxtpl import DocxTemplate
import csv
template = DocxTemplate('letter_template.docx')
with open('recipients.csv') as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
context = {
'name': row['name'],
'address': row['address'],
'amount': row['amount'],
'due_date': row['due_date'],
}
template.render(context)
template.save(f'letters/letter_{i:04d}_{row["name"]}.docx')
template = DocxTemplate('letter_template.docx') # Re-load for next iterationpython
from docxtpl import DocxTemplate
import csv
template = DocxTemplate('letter_template.docx')
with open('recipients.csv') as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
context = {
'name': row['name'],
'address': row['address'],
'amount': row['amount'],
'due_date': row['due_date'],
}
template.render(context)
template.save(f'letters/letter_{i:04d}_{row["name"]}.docx')
template = DocxTemplate('letter_template.docx') # Re-load for next iterationStyle Management
样式管理
Custom Styles
自定义样式
python
from docx.enum.style import WD_STYLE_TYPEpython
from docx.enum.style import WD_STYLE_TYPECreate custom paragraph style
Create custom paragraph style
style = doc.styles.add_style('CustomHeading', WD_STYLE_TYPE.PARAGRAPH)
style.font.name = 'Arial'
style.font.size = Pt(16)
style.font.bold = True
style.font.color.rgb = RGBColor(0x2F, 0x54, 0x96)
style.paragraph_format.space_before = Pt(12)
style.paragraph_format.space_after = Pt(6)
style = doc.styles.add_style('CustomHeading', WD_STYLE_TYPE.PARAGRAPH)
style.font.name = 'Arial'
style.font.size = Pt(16)
style.font.bold = True
style.font.color.rgb = RGBColor(0x2F, 0x54, 0x96)
style.paragraph_format.space_before = Pt(12)
style.paragraph_format.space_after = Pt(6)
Apply custom style
Apply custom style
doc.add_paragraph('Section Title', style='CustomHeading')
undefineddoc.add_paragraph('Section Title', style='CustomHeading')
undefinedStyle Inheritance
样式继承关系
Normal → Heading 1 → Heading 2 → ...
Normal → Body Text → List Paragraph
Normal → Table Normal → Table GridNormal → Heading 1 → Heading 2 → ...
Normal → Body Text → List Paragraph
Normal → Table Normal → Table GridConversion Strategies
转换策略
DOCX to PDF
DOCX 转 PDF
python
undefinedpython
undefinedOption 1: LibreOffice (most reliable, server-friendly)
Option 1: LibreOffice (most reliable, server-friendly)
import subprocess
subprocess.run([
'libreoffice', '--headless', '--convert-to', 'pdf',
'--outdir', output_dir, input_file
])
import subprocess
subprocess.run([
'libreoffice', '--headless', '--convert-to', 'pdf',
'--outdir', output_dir, input_file
])
Option 2: docx2pdf (Windows/macOS with Word installed)
Option 2: docx2pdf (Windows/macOS with Word installed)
from docx2pdf import convert
convert('input.docx', 'output.pdf')
from docx2pdf import convert
convert('input.docx', 'output.pdf')
Option 3: Generate PDF directly with reportlab for full control
Option 3: Generate PDF directly with reportlab for full control
undefinedundefinedError Handling
错误处理
python
import jinja2
def safe_generate_document(template_path, context, output_path):
try:
tpl = DocxTemplate(template_path)
tpl.render(context)
tpl.save(output_path)
return True
except jinja2.UndefinedError as e:
print(f"Missing template variable: {e}")
return False
except FileNotFoundError as e:
print(f"Template not found: {e}")
return False
except Exception as e:
print(f"Document generation failed: {e}")
return Falsepython
import jinja2
def safe_generate_document(template_path, context, output_path):
try:
tpl = DocxTemplate(template_path)
tpl.render(context)
tpl.save(output_path)
return True
except jinja2.UndefinedError as e:
print(f"Missing template variable: {e}")
return False
except FileNotFoundError as e:
print(f"Template not found: {e}")
return False
except Exception as e:
print(f"Document generation failed: {e}")
return FalseAnti-Patterns / Common Mistakes
反模式/常见错误
| Anti-Pattern | Why It Fails | What To Do Instead |
|---|---|---|
| Hardcoding font sizes instead of styles | Inconsistent formatting, hard to maintain | Define styles once, apply everywhere |
| Not handling missing template variables | Runtime crashes on incomplete data | Use |
| Huge tables without pagination | Unreadable output, broken layouts | Break tables across pages or summarize |
| Absolute image paths | Breaks portability across environments | Use relative paths or embed images |
| Not testing with different Word versions | Formatting breaks silently | Test in Word, LibreOffice, and Google Docs |
| Modifying XML directly when API exists | Fragile, version-dependent code | Use python-docx API methods first |
| All direct formatting, no styles | Impossible to maintain consistency | Create and apply named styles |
| Ignoring Unicode characters | Mojibake in generated documents | Test with accented characters, CJK, symbols |
| Not re-loading template in mail merge | Corrupted output after first render | Re-instantiate DocxTemplate per iteration |
| 反模式 | 故障原因 | 替代方案 |
|---|---|---|
| 硬编码字体大小而非使用样式 | 格式不一致,难以维护 | 一次性定义样式,全局复用 |
| 不处理缺失的模板变量 | 数据不完整时运行时崩溃 | 使用 |
| 超大表格未做分页处理 | 输出不可读,布局损坏 | 将表格拆分到多页或者做数据汇总 |
| 使用绝对路径引用图片 | 跨环境运行时路径失效 | 使用相对路径或直接嵌入图片 |
| 未在不同Word版本下测试 | 格式悄无声息地损坏 | 在Word、LibreOffice和Google Docs中都做测试 |
| 有可用API时直接修改XML | 代码脆弱,依赖版本 | 优先使用python-docx提供的API方法 |
| 全部直接设置格式,不使用样式 | 无法维持格式一致性 | 创建并应用命名样式 |
| 忽略Unicode字符处理 | 生成的文档出现乱码 | 用重音字符、中日韩字符、特殊符号做测试 |
| 邮件合并时不重新加载模板 | 第一次渲染后输出损坏 | 每次迭代都重新实例化DocxTemplate |
Anti-Rationalization Guards
不合理操作禁令
- Do NOT skip the approach decision (scratch vs template) -- it determines your entire implementation.
- Do NOT generate documents without testing in at least Word and one alternative viewer.
- Do NOT ignore missing data -- handle empty/null fields with defaults or conditional sections.
- Do NOT skip error handling in production document generation pipelines.
- Do NOT hardcode formatting when styles can be used instead.
- 不要跳过方案决策环节(从零开发vs使用模板)—— 它决定了整个实现的走向
- 不要未在至少Word和一个其他查看器中测试就生成文档
- 不要忽略缺失数据—— 用默认值或条件区块处理空/Null字段
- 生产环境的文档生成管线不要跳过错误处理
- 可以用样式实现的格式不要硬编码
Integration Points
集成点
| Skill | How It Connects |
|---|---|
| DOCX-to-PDF conversion, or choosing PDF generation directly |
| Data from Excel feeds into document generation contexts |
| Generated documents attach to professional emails |
| Research content formatted into whitepapers and reports |
| Output file naming and directory structure conventions |
| Document generation pipelines in CI/CD or server environments |
| 技能 | 关联方式 |
|---|---|
| DOCX转PDF,或直接选择PDF生成方案 |
| Excel中的数据可作为文档生成的上下文数据源 |
| 生成的文档可作为专业邮件的附件 |
| 调研内容可格式化为白皮书和报告 |
| 输出文件命名和目录结构规范 |
| CI/CD或服务器环境中的文档生成管线 |
Skill Type
技能类型
FLEXIBLE — Choose between python-docx (programmatic) and docxtpl (template-based) based on document complexity. Simple reports may not need templates; complex recurring documents benefit from templates.
灵活适配 — 根据文档复杂度选择python-docx(编程式)和docxtpl(模板式)。简单报表不需要模板;复杂的定期生成文档使用模板收益更高。