DOCX Processing

DOCX 处理

Overview

概述

Generate, manipulate, and template Word documents programmatically. This skill covers python-docx for direct document creation, docxtpl for Jinja2-based template filling, formatting control (headings, tables, images, headers/footers), mail merge operations, style management, and conversion strategies.

Apply this skill whenever Word documents need to be created, populated, or transformed through code rather than manual editing.

通过编程方式生成、操作Word文档并制作文档模板。本技能涵盖用于直接创建文档的python-docx、用于基于Jinja2模板填充的docxtpl、格式控制（标题、表格、图片、页眉/页脚）、邮件合并操作、样式管理和转换策略。

当你需要通过代码而非手动编辑来创建、填充或转换Word文档时，即可应用本技能。

Multi-Phase Process

多阶段流程

Phase 1: Requirements

阶段1：需求确认

Determine if creating from scratch or filling a template
Identify document structure (sections, headers, tables, images)
Define data sources (JSON, CSV, database, API)
Plan styling requirements (fonts, colors, margins)
Determine output format (DOCX, PDF conversion needed)

STOP — Do NOT begin implementation until the approach (scratch vs template) is decided and data sources are confirmed.

确定是从零创建文档还是填充现有模板
明确文档结构（章节、页眉、表格、图片）
定义数据源（JSON、CSV、数据库、API）
规划样式要求（字体、颜色、边距）
确定输出格式（是否需要DOCX转PDF）

停止 — 在确定实现方案（从零开发vs使用模板）并确认数据源之前，请勿开始编码实现。

Phase 2: Implementation

阶段2：实现开发

Set up document template or create from scratch
Implement data binding and content generation
Apply formatting and styles
Add headers, footers, and page numbers
Handle images and embedded objects

STOP — Do NOT skip to validation until all document sections are implemented.

搭建文档模板或从零创建文档框架
实现数据绑定和内容生成逻辑
应用格式和样式配置
添加页眉、页脚和页码
处理图片和嵌入对象

停止 — 在所有文档章节实现完成前，请勿跳至验证环节。

Phase 3: Validation

阶段3：验证测试

Verify document renders correctly in Word/LibreOffice
Check formatting consistency across pages
Validate data accuracy in generated documents
Test with edge cases (long text, missing data, special characters)
Verify PDF conversion if required

验证文档在Word/LibreOffice中可以正确渲染
检查跨页面的格式一致性
校验生成文档中的数据准确性
使用边界场景测试（长文本、缺失数据、特殊字符）
若需要PDF输出则验证转换结果是否正常

Approach Decision Table

方案决策表

Scenario	Approach	Library	Why
One-off report generation	From scratch	python-docx	Full programmatic control
Recurring reports with fixed layout	Template	docxtpl	Design layout in Word, fill with data
Bulk letter generation (mail merge)	Template	docxtpl	One template, many outputs
Complex formatting, custom styles	From scratch	python-docx	Direct access to document model
Non-technical users design template	Template	docxtpl	Users edit in Word, developers bind data
PDF output required	Either + conversion	libreoffice / docx2pdf	Post-processing step

场景	实现方案	依赖库	选择原因
一次性报表生成	从零开发	python-docx	完全的编程控制能力
固定布局的定期报表	模板填充	docxtpl	在Word中设计布局，直接用数据填充即可
批量信函生成（邮件合并）	模板填充	docxtpl	一个模板可生成多份输出
复杂格式、自定义样式需求	从零开发	python-docx	可直接访问文档底层模型
非技术用户参与模板设计	模板填充	docxtpl	用户可直接在Word中编辑模板，开发者只需绑定数据
需要PDF输出	任意方案+转换	libreoffice / docx2pdf	作为后处理步骤执行

python-docx Patterns

python-docx 常用模式

Document Creation

文档创建

python

from docx import Document
from docx.shared import Inches, Pt, Cm, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT

doc = Document()

python

from docx import Document
from docx.shared import Inches, Pt, Cm, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT

doc = Document()

Set default font

style = doc.styles['Normal'] font = style.font font.name = 'Calibri' font.size = Pt(11)

Add heading

doc.add_heading('Monthly Report', level=0)

Add paragraph with formatting

para = doc.add_paragraph() run = para.add_run('Important: ') run.bold = True run.font.color.rgb = RGBColor(0xCC, 0x00, 0x00) para.add_run('This section requires attention.')

Add table

table = doc.add_table(rows=1, cols=3, style='Light Grid Accent 1') hdr_cells = table.rows[0].cells hdr_cells[0].text = 'Name' hdr_cells[1].text = 'Department' hdr_cells[2].text = 'Revenue'

for name, dept, rev in data: row_cells = table.add_row().cells row_cells[0].text = name row_cells[1].text = dept row_cells[2].text = f'${rev:,.2f}'

table = doc.add_table(rows=1, cols=3, style='Light Grid Accent 1') hdr_cells = table.rows[0].cells hdr_cells[0].text = 'Name' hdr_cells[1].text = 'Department' hdr_cells[2].text = 'Revenue'

for name, dept, rev in data: row_cells = table.add_row().cells row_cells[0].text = name row_cells[1].text = dept row_cells[2].text = f'${rev:,.2f}'

Add image

doc.add_picture('chart.png', width=Inches(5.5))

Save

doc.save('report.docx')

undefined

doc.save('report.docx')

undefined

Headers and Footers

页眉页脚设置

python

from docx.enum.section import WD_ORIENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement

section = doc.sections[0]

python

from docx.enum.section import WD_ORIENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement

section = doc.sections[0]

Page setup

section.page_width = Cm(21) section.page_height = Cm(29.7) section.left_margin = Cm(2.5) section.right_margin = Cm(2.5) section.top_margin = Cm(2.5) section.bottom_margin = Cm(2.5)

Header

header = section.header header_para = header.paragraphs[0] header_para.text = 'Company Name — Confidential' header_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT header_para.style.font.size = Pt(9) header_para.style.font.color.rgb = RGBColor(0x88, 0x88, 0x88)

Footer with page numbers

footer = section.footer footer_para = footer.paragraphs[0] footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER

Add page number field

run = footer_para.add_run() fldChar = OxmlElement('w:fldChar') fldChar.set(qn('w:fldCharType'), 'begin') run._r.append(fldChar)

run2 = footer_para.add_run() instrText = OxmlElement('w:instrText') instrText.set(qn('xml:space'), 'preserve') instrText.text = ' PAGE ' run2._r.append(instrText)

run3 = footer_para.add_run() fldChar2 = OxmlElement('w:fldChar') fldChar2.set(qn('w:fldCharType'), 'end') run3._r.append(fldChar2)

undefined

run = footer_para.add_run() fldChar = OxmlElement('w:fldChar') fldChar.set(qn('w:fldCharType'), 'begin') run._r.append(fldChar)

run2 = footer_para.add_run() instrText = OxmlElement('w:instrText') instrText.set(qn('xml:space'), 'preserve') instrText.text = ' PAGE ' run2._r.append(instrText)

run3 = footer_para.add_run() fldChar2 = OxmlElement('w:fldChar') fldChar2.set(qn('w:fldCharType'), 'end') run3._r.append(fldChar2)

undefined

Table Formatting

表格格式化

python

from docx.shared import Cm, Pt
from docx.oxml.ns import nsdecls
from docx.oxml import parse_xml

python

from docx.shared import Cm, Pt
from docx.oxml.ns import nsdecls
from docx.oxml import parse_xml

Set column widths

table.columns[0].width = Cm(4) table.columns[1].width = Cm(6) table.columns[2].width = Cm(3)

Cell shading

for cell in table.rows[0].cells: shading = parse_xml(f'<w:shd {nsdecls("w")} w:fill="2F5496"/>') cell._tc.get_or_add_tcPr().append(shading) for paragraph in cell.paragraphs: for run in paragraph.runs: run.font.color.rgb = RGBColor(0xFF, 0xFF, 0xFF) run.font.bold = True

Cell alignment

for row in table.rows: for cell in row.cells: cell.paragraphs[0].alignment = WD_ALIGN_PARAGRAPH.CENTER

undefined

for row in table.rows: for cell in row.cells: cell.paragraphs[0].alignment = WD_ALIGN_PARAGRAPH.CENTER

undefined

docxtpl Template Patterns

docxtpl 模板常用模式

Template Syntax (Jinja2)

模板语法（Jinja2）

Template file (template.docx) contains:

{{ company_name }}
Date: {{ report_date }}

Dear {{ recipient_name }},

{% for item in items %}
- {{ item.name }}: ${{ item.price }}
{% endfor %}

Total: ${{ total }}

{%if urgent %}
URGENT: This requires immediate attention.
{%endif %}

Template file (template.docx) contains:

{{ company_name }}
Date: {{ report_date }}

Dear {{ recipient_name }},

{% for item in items %}
- {{ item.name }}: ${{ item.price }}
{% endfor %}

Total: ${{ total }}

{%if urgent %}
URGENT: This requires immediate attention.
{%endif %}

Template Rendering

模板渲染

python

from docxtpl import DocxTemplate, InlineImage
from docx.shared import Mm

tpl = DocxTemplate('template.docx')

context = {
    'company_name': 'Acme Corp',
    'report_date': '2025-03-15',
    'recipient_name': 'Alice Johnson',
    'items': [
        {'name': 'Widget A', 'price': '29.99'},
        {'name': 'Widget B', 'price': '49.99'},
    ],
    'total': '79.98',
    'urgent': True,
    'chart': InlineImage(tpl, 'chart.png', width=Mm(120)),
}

tpl.render(context)
tpl.save('output.docx')

python

from docxtpl import DocxTemplate, InlineImage
from docx.shared import Mm

tpl = DocxTemplate('template.docx')

context = {
    'company_name': 'Acme Corp',
    'report_date': '2025-03-15',
    'recipient_name': 'Alice Johnson',
    'items': [
        {'name': 'Widget A', 'price': '29.99'},
        {'name': 'Widget B', 'price': '49.99'},
    ],
    'total': '79.98',
    'urgent': True,
    'chart': InlineImage(tpl, 'chart.png', width=Mm(120)),
}

tpl.render(context)
tpl.save('output.docx')

Rich Text in Templates

模板富文本处理

python

from docxtpl import RichText

rt = RichText()
rt.add('Normal text ')
rt.add('bold text', bold=True)
rt.add(' and ')
rt.add('red text', color='FF0000')
rt.add(' with ')
rt.add('a link', url_id=tpl.build_url_id('https://example.com'))

context = {'formatted_text': rt}

python

from docxtpl import RichText

rt = RichText()
rt.add('Normal text ')
rt.add('bold text', bold=True)
rt.add(' and ')
rt.add('red text', color='FF0000')
rt.add(' with ')
rt.add('a link', url_id=tpl.build_url_id('https://example.com'))

context = {'formatted_text': rt}

Tables in Templates

模板表格处理

Template table row with loop:
{% tr for row in table_data %}
{{ row.name }} | {{ row.value }} | {{ row.status }}
{% endtr %}

Template table row with loop:
{% tr for row in table_data %}
{{ row.name }} | {{ row.value }} | {{ row.status }}
{% endtr %}

Mail Merge

邮件合并

python

from docxtpl import DocxTemplate
import csv

template = DocxTemplate('letter_template.docx')

with open('recipients.csv') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        context = {
            'name': row['name'],
            'address': row['address'],
            'amount': row['amount'],
            'due_date': row['due_date'],
        }
        template.render(context)
        template.save(f'letters/letter_{i:04d}_{row["name"]}.docx')
        template = DocxTemplate('letter_template.docx')  # Re-load for next iteration

python

from docxtpl import DocxTemplate
import csv

template = DocxTemplate('letter_template.docx')

with open('recipients.csv') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        context = {
            'name': row['name'],
            'address': row['address'],
            'amount': row['amount'],
            'due_date': row['due_date'],
        }
        template.render(context)
        template.save(f'letters/letter_{i:04d}_{row["name"]}.docx')
        template = DocxTemplate('letter_template.docx')  # Re-load for next iteration

Style Management

样式管理

Custom Styles

自定义样式

python

from docx.enum.style import WD_STYLE_TYPE

python

from docx.enum.style import WD_STYLE_TYPE

Create custom paragraph style

style = doc.styles.add_style('CustomHeading', WD_STYLE_TYPE.PARAGRAPH) style.font.name = 'Arial' style.font.size = Pt(16) style.font.bold = True style.font.color.rgb = RGBColor(0x2F, 0x54, 0x96) style.paragraph_format.space_before = Pt(12) style.paragraph_format.space_after = Pt(6)

Apply custom style

doc.add_paragraph('Section Title', style='CustomHeading')

undefined

doc.add_paragraph('Section Title', style='CustomHeading')

undefined

Style Inheritance

样式继承关系

Normal → Heading 1 → Heading 2 → ...
Normal → Body Text → List Paragraph
Normal → Table Normal → Table Grid

Normal → Heading 1 → Heading 2 → ...
Normal → Body Text → List Paragraph
Normal → Table Normal → Table Grid

Conversion Strategies

转换策略

DOCX to PDF

DOCX 转 PDF

python

undefined

python

undefined

Option 1: LibreOffice (most reliable, server-friendly)

import subprocess subprocess.run([ 'libreoffice', '--headless', '--convert-to', 'pdf', '--outdir', output_dir, input_file ])

Option 2: docx2pdf (Windows/macOS with Word installed)

from docx2pdf import convert convert('input.docx', 'output.pdf')

Option 3: Generate PDF directly with reportlab for full control

undefined

undefined

Error Handling

错误处理

python

import jinja2

def safe_generate_document(template_path, context, output_path):
    try:
        tpl = DocxTemplate(template_path)
        tpl.render(context)
        tpl.save(output_path)
        return True
    except jinja2.UndefinedError as e:
        print(f"Missing template variable: {e}")
        return False
    except FileNotFoundError as e:
        print(f"Template not found: {e}")
        return False
    except Exception as e:
        print(f"Document generation failed: {e}")
        return False

python

import jinja2

def safe_generate_document(template_path, context, output_path):
    try:
        tpl = DocxTemplate(template_path)
        tpl.render(context)
        tpl.save(output_path)
        return True
    except jinja2.UndefinedError as e:
        print(f"Missing template variable: {e}")
        return False
    except FileNotFoundError as e:
        print(f"Template not found: {e}")
        return False
    except Exception as e:
        print(f"Document generation failed: {e}")
        return False

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-Pattern	Why It Fails	What To Do Instead
Hardcoding font sizes instead of styles	Inconsistent formatting, hard to maintain	Define styles once, apply everywhere
Not handling missing template variables	Runtime crashes on incomplete data	Use `jinja2.Undefined` or default filters
Huge tables without pagination	Unreadable output, broken layouts	Break tables across pages or summarize
Absolute image paths	Breaks portability across environments	Use relative paths or embed images
Not testing with different Word versions	Formatting breaks silently	Test in Word, LibreOffice, and Google Docs
Modifying XML directly when API exists	Fragile, version-dependent code	Use python-docx API methods first
All direct formatting, no styles	Impossible to maintain consistency	Create and apply named styles
Ignoring Unicode characters	Mojibake in generated documents	Test with accented characters, CJK, symbols
Not re-loading template in mail merge	Corrupted output after first render	Re-instantiate DocxTemplate per iteration

反模式	故障原因	替代方案
硬编码字体大小而非使用样式	格式不一致，难以维护	一次性定义样式，全局复用
不处理缺失的模板变量	数据不完整时运行时崩溃	使用 `jinja2.Undefined` 或默认值过滤器
超大表格未做分页处理	输出不可读，布局损坏	将表格拆分到多页或者做数据汇总
使用绝对路径引用图片	跨环境运行时路径失效	使用相对路径或直接嵌入图片
未在不同Word版本下测试	格式悄无声息地损坏	在Word、LibreOffice和Google Docs中都做测试
有可用API时直接修改XML	代码脆弱，依赖版本	优先使用python-docx提供的API方法
全部直接设置格式，不使用样式	无法维持格式一致性	创建并应用命名样式
忽略Unicode字符处理	生成的文档出现乱码	用重音字符、中日韩字符、特殊符号做测试
邮件合并时不重新加载模板	第一次渲染后输出损坏	每次迭代都重新实例化DocxTemplate

Anti-Rationalization Guards

不合理操作禁令

Do NOT skip the approach decision (scratch vs template) -- it determines your entire implementation.
Do NOT generate documents without testing in at least Word and one alternative viewer.
Do NOT ignore missing data -- handle empty/null fields with defaults or conditional sections.
Do NOT skip error handling in production document generation pipelines.
Do NOT hardcode formatting when styles can be used instead.

不要跳过方案决策环节（从零开发vs使用模板）—— 它决定了整个实现的走向
不要未在至少Word和一个其他查看器中测试就生成文档
不要忽略缺失数据—— 用默认值或条件区块处理空/Null字段
生产环境的文档生成管线不要跳过错误处理
可以用样式实现的格式不要硬编码

Integration Points

集成点

Skill	How It Connects
`pdf-processing`	DOCX-to-PDF conversion, or choosing PDF generation directly
`xlsx-processing`	Data from Excel feeds into document generation contexts
`email-composer`	Generated documents attach to professional emails
`content-research-writer`	Research content formatted into whitepapers and reports
`file-organizer`	Output file naming and directory structure conventions
`deployment`	Document generation pipelines in CI/CD or server environments

技能	关联方式
`pdf-processing`	DOCX转PDF，或直接选择PDF生成方案
`xlsx-processing`	Excel中的数据可作为文档生成的上下文数据源
`email-composer`	生成的文档可作为专业邮件的附件
`content-research-writer`	调研内容可格式化为白皮书和报告
`file-organizer`	输出文件命名和目录结构规范
`deployment`	CI/CD或服务器环境中的文档生成管线

Skill Type

技能类型

FLEXIBLE — Choose between python-docx (programmatic) and docxtpl (template-based) based on document complexity. Simple reports may not need templates; complex recurring documents benefit from templates.

灵活适配 — 根据文档复杂度选择python-docx（编程式）和docxtpl（模板式）。简单报表不需要模板；复杂的定期生成文档使用模板收益更高。

docx-processing

Original

Translation

DOCX Processing

DOCX 处理

Overview

概述

Multi-Phase Process

多阶段流程

Phase 1: Requirements

阶段1：需求确认

Phase 2: Implementation

阶段2：实现开发

Phase 3: Validation

阶段3：验证测试

Approach Decision Table

方案决策表

python-docx Patterns

python-docx 常用模式

Document Creation

文档创建

Set default font

Set default font

Add heading

Add heading

Add paragraph with formatting

Add paragraph with formatting

Add table

Add table

Add image

Add image

Save

Save

Headers and Footers

页眉页脚设置

Page setup

Page setup

Header

Header

Footer with page numbers

Footer with page numbers

Add page number field

Add page number field

Table Formatting

表格格式化

Set column widths

Set column widths

Cell shading

Cell shading

Cell alignment

Cell alignment

docxtpl Template Patterns

docxtpl 模板常用模式

Template Syntax (Jinja2)

模板语法（Jinja2）

Template Rendering

模板渲染

Rich Text in Templates

模板富文本处理

Tables in Templates

模板表格处理

Mail Merge

邮件合并

Style Management

样式管理

Custom Styles

自定义样式

Create custom paragraph style

Create custom paragraph style

Apply custom style

Apply custom style

Style Inheritance

样式继承关系

Conversion Strategies

转换策略

DOCX to PDF

DOCX 转 PDF

Option 1: LibreOffice (most reliable, server-friendly)

Option 1: LibreOffice (most reliable, server-friendly)

Option 2: docx2pdf (Windows/macOS with Word installed)