docx-processing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DOCX Processing

DOCX 处理

Overview

概述

Generate, manipulate, and template Word documents programmatically. This skill covers python-docx for direct document creation, docxtpl for Jinja2-based template filling, formatting control (headings, tables, images, headers/footers), mail merge operations, style management, and conversion strategies.
Apply this skill whenever Word documents need to be created, populated, or transformed through code rather than manual editing.
通过编程方式生成、操作Word文档并制作文档模板。本技能涵盖用于直接创建文档的python-docx、用于基于Jinja2模板填充的docxtpl、格式控制(标题、表格、图片、页眉/页脚)、邮件合并操作、样式管理和转换策略。
当你需要通过代码而非手动编辑来创建、填充或转换Word文档时,即可应用本技能。

Multi-Phase Process

多阶段流程

Phase 1: Requirements

阶段1:需求确认

  1. Determine if creating from scratch or filling a template
  2. Identify document structure (sections, headers, tables, images)
  3. Define data sources (JSON, CSV, database, API)
  4. Plan styling requirements (fonts, colors, margins)
  5. Determine output format (DOCX, PDF conversion needed)
STOP — Do NOT begin implementation until the approach (scratch vs template) is decided and data sources are confirmed.
  1. 确定是从零创建文档还是填充现有模板
  2. 明确文档结构(章节、页眉、表格、图片)
  3. 定义数据源(JSON、CSV、数据库、API)
  4. 规划样式要求(字体、颜色、边距)
  5. 确定输出格式(是否需要DOCX转PDF)
停止 — 在确定实现方案(从零开发vs使用模板)并确认数据源之前,请勿开始编码实现。

Phase 2: Implementation

阶段2:实现开发

  1. Set up document template or create from scratch
  2. Implement data binding and content generation
  3. Apply formatting and styles
  4. Add headers, footers, and page numbers
  5. Handle images and embedded objects
STOP — Do NOT skip to validation until all document sections are implemented.
  1. 搭建文档模板或从零创建文档框架
  2. 实现数据绑定和内容生成逻辑
  3. 应用格式和样式配置
  4. 添加页眉、页脚和页码
  5. 处理图片和嵌入对象
停止 — 在所有文档章节实现完成前,请勿跳至验证环节。

Phase 3: Validation

阶段3:验证测试

  1. Verify document renders correctly in Word/LibreOffice
  2. Check formatting consistency across pages
  3. Validate data accuracy in generated documents
  4. Test with edge cases (long text, missing data, special characters)
  5. Verify PDF conversion if required
  1. 验证文档在Word/LibreOffice中可以正确渲染
  2. 检查跨页面的格式一致性
  3. 校验生成文档中的数据准确性
  4. 使用边界场景测试(长文本、缺失数据、特殊字符)
  5. 若需要PDF输出则验证转换结果是否正常

Approach Decision Table

方案决策表

ScenarioApproachLibraryWhy
One-off report generationFrom scratchpython-docxFull programmatic control
Recurring reports with fixed layoutTemplatedocxtplDesign layout in Word, fill with data
Bulk letter generation (mail merge)TemplatedocxtplOne template, many outputs
Complex formatting, custom stylesFrom scratchpython-docxDirect access to document model
Non-technical users design templateTemplatedocxtplUsers edit in Word, developers bind data
PDF output requiredEither + conversionlibreoffice / docx2pdfPost-processing step
场景实现方案依赖库选择原因
一次性报表生成从零开发python-docx完全的编程控制能力
固定布局的定期报表模板填充docxtpl在Word中设计布局,直接用数据填充即可
批量信函生成(邮件合并)模板填充docxtpl一个模板可生成多份输出
复杂格式、自定义样式需求从零开发python-docx可直接访问文档底层模型
非技术用户参与模板设计模板填充docxtpl用户可直接在Word中编辑模板,开发者只需绑定数据
需要PDF输出任意方案+转换libreoffice / docx2pdf作为后处理步骤执行

python-docx Patterns

python-docx 常用模式

Document Creation

文档创建

python
from docx import Document
from docx.shared import Inches, Pt, Cm, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT

doc = Document()
python
from docx import Document
from docx.shared import Inches, Pt, Cm, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT

doc = Document()

Set default font

Set default font

style = doc.styles['Normal'] font = style.font font.name = 'Calibri' font.size = Pt(11)
style = doc.styles['Normal'] font = style.font font.name = 'Calibri' font.size = Pt(11)

Add heading

Add heading

doc.add_heading('Monthly Report', level=0)
doc.add_heading('Monthly Report', level=0)

Add paragraph with formatting

Add paragraph with formatting

para = doc.add_paragraph() run = para.add_run('Important: ') run.bold = True run.font.color.rgb = RGBColor(0xCC, 0x00, 0x00) para.add_run('This section requires attention.')
para = doc.add_paragraph() run = para.add_run('Important: ') run.bold = True run.font.color.rgb = RGBColor(0xCC, 0x00, 0x00) para.add_run('This section requires attention.')

Add table

Add table

table = doc.add_table(rows=1, cols=3, style='Light Grid Accent 1') hdr_cells = table.rows[0].cells hdr_cells[0].text = 'Name' hdr_cells[1].text = 'Department' hdr_cells[2].text = 'Revenue'
for name, dept, rev in data: row_cells = table.add_row().cells row_cells[0].text = name row_cells[1].text = dept row_cells[2].text = f'${rev:,.2f}'
table = doc.add_table(rows=1, cols=3, style='Light Grid Accent 1') hdr_cells = table.rows[0].cells hdr_cells[0].text = 'Name' hdr_cells[1].text = 'Department' hdr_cells[2].text = 'Revenue'
for name, dept, rev in data: row_cells = table.add_row().cells row_cells[0].text = name row_cells[1].text = dept row_cells[2].text = f'${rev:,.2f}'

Add image

Add image

doc.add_picture('chart.png', width=Inches(5.5))
doc.add_picture('chart.png', width=Inches(5.5))

Save

Save

doc.save('report.docx')
undefined
doc.save('report.docx')
undefined

Headers and Footers

页眉页脚设置

python
from docx.enum.section import WD_ORIENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement

section = doc.sections[0]
python
from docx.enum.section import WD_ORIENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement

section = doc.sections[0]

Page setup

Page setup

section.page_width = Cm(21) section.page_height = Cm(29.7) section.left_margin = Cm(2.5) section.right_margin = Cm(2.5) section.top_margin = Cm(2.5) section.bottom_margin = Cm(2.5)
section.page_width = Cm(21) section.page_height = Cm(29.7) section.left_margin = Cm(2.5) section.right_margin = Cm(2.5) section.top_margin = Cm(2.5) section.bottom_margin = Cm(2.5)

Header

Header

header = section.header header_para = header.paragraphs[0] header_para.text = 'Company Name — Confidential' header_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT header_para.style.font.size = Pt(9) header_para.style.font.color.rgb = RGBColor(0x88, 0x88, 0x88)
header = section.header header_para = header.paragraphs[0] header_para.text = 'Company Name — Confidential' header_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT header_para.style.font.size = Pt(9) header_para.style.font.color.rgb = RGBColor(0x88, 0x88, 0x88)

Footer with page numbers

Footer with page numbers

footer = section.footer footer_para = footer.paragraphs[0] footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
footer = section.footer footer_para = footer.paragraphs[0] footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER

Add page number field

Add page number field

run = footer_para.add_run() fldChar = OxmlElement('w:fldChar') fldChar.set(qn('w:fldCharType'), 'begin') run._r.append(fldChar)
run2 = footer_para.add_run() instrText = OxmlElement('w:instrText') instrText.set(qn('xml:space'), 'preserve') instrText.text = ' PAGE ' run2._r.append(instrText)
run3 = footer_para.add_run() fldChar2 = OxmlElement('w:fldChar') fldChar2.set(qn('w:fldCharType'), 'end') run3._r.append(fldChar2)
undefined
run = footer_para.add_run() fldChar = OxmlElement('w:fldChar') fldChar.set(qn('w:fldCharType'), 'begin') run._r.append(fldChar)
run2 = footer_para.add_run() instrText = OxmlElement('w:instrText') instrText.set(qn('xml:space'), 'preserve') instrText.text = ' PAGE ' run2._r.append(instrText)
run3 = footer_para.add_run() fldChar2 = OxmlElement('w:fldChar') fldChar2.set(qn('w:fldCharType'), 'end') run3._r.append(fldChar2)
undefined

Table Formatting

表格格式化

python
from docx.shared import Cm, Pt
from docx.oxml.ns import nsdecls
from docx.oxml import parse_xml
python
from docx.shared import Cm, Pt
from docx.oxml.ns import nsdecls
from docx.oxml import parse_xml

Set column widths

Set column widths

table.columns[0].width = Cm(4) table.columns[1].width = Cm(6) table.columns[2].width = Cm(3)
table.columns[0].width = Cm(4) table.columns[1].width = Cm(6) table.columns[2].width = Cm(3)

Cell shading

Cell shading

for cell in table.rows[0].cells: shading = parse_xml(f'<w:shd {nsdecls("w")} w:fill="2F5496"/>') cell._tc.get_or_add_tcPr().append(shading) for paragraph in cell.paragraphs: for run in paragraph.runs: run.font.color.rgb = RGBColor(0xFF, 0xFF, 0xFF) run.font.bold = True
for cell in table.rows[0].cells: shading = parse_xml(f'<w:shd {nsdecls("w")} w:fill="2F5496"/>') cell._tc.get_or_add_tcPr().append(shading) for paragraph in cell.paragraphs: for run in paragraph.runs: run.font.color.rgb = RGBColor(0xFF, 0xFF, 0xFF) run.font.bold = True

Cell alignment

Cell alignment

for row in table.rows: for cell in row.cells: cell.paragraphs[0].alignment = WD_ALIGN_PARAGRAPH.CENTER
undefined
for row in table.rows: for cell in row.cells: cell.paragraphs[0].alignment = WD_ALIGN_PARAGRAPH.CENTER
undefined

docxtpl Template Patterns

docxtpl 模板常用模式

Template Syntax (Jinja2)

模板语法(Jinja2)

Template file (template.docx) contains:

{{ company_name }}
Date: {{ report_date }}

Dear {{ recipient_name }},

{% for item in items %}
- {{ item.name }}: ${{ item.price }}
{% endfor %}

Total: ${{ total }}

{%if urgent %}
URGENT: This requires immediate attention.
{%endif %}
Template file (template.docx) contains:

{{ company_name }}
Date: {{ report_date }}

Dear {{ recipient_name }},

{% for item in items %}
- {{ item.name }}: ${{ item.price }}
{% endfor %}

Total: ${{ total }}

{%if urgent %}
URGENT: This requires immediate attention.
{%endif %}

Template Rendering

模板渲染

python
from docxtpl import DocxTemplate, InlineImage
from docx.shared import Mm

tpl = DocxTemplate('template.docx')

context = {
    'company_name': 'Acme Corp',
    'report_date': '2025-03-15',
    'recipient_name': 'Alice Johnson',
    'items': [
        {'name': 'Widget A', 'price': '29.99'},
        {'name': 'Widget B', 'price': '49.99'},
    ],
    'total': '79.98',
    'urgent': True,
    'chart': InlineImage(tpl, 'chart.png', width=Mm(120)),
}

tpl.render(context)
tpl.save('output.docx')
python
from docxtpl import DocxTemplate, InlineImage
from docx.shared import Mm

tpl = DocxTemplate('template.docx')

context = {
    'company_name': 'Acme Corp',
    'report_date': '2025-03-15',
    'recipient_name': 'Alice Johnson',
    'items': [
        {'name': 'Widget A', 'price': '29.99'},
        {'name': 'Widget B', 'price': '49.99'},
    ],
    'total': '79.98',
    'urgent': True,
    'chart': InlineImage(tpl, 'chart.png', width=Mm(120)),
}

tpl.render(context)
tpl.save('output.docx')

Rich Text in Templates

模板富文本处理

python
from docxtpl import RichText

rt = RichText()
rt.add('Normal text ')
rt.add('bold text', bold=True)
rt.add(' and ')
rt.add('red text', color='FF0000')
rt.add(' with ')
rt.add('a link', url_id=tpl.build_url_id('https://example.com'))

context = {'formatted_text': rt}
python
from docxtpl import RichText

rt = RichText()
rt.add('Normal text ')
rt.add('bold text', bold=True)
rt.add(' and ')
rt.add('red text', color='FF0000')
rt.add(' with ')
rt.add('a link', url_id=tpl.build_url_id('https://example.com'))

context = {'formatted_text': rt}

Tables in Templates

模板表格处理

Template table row with loop:
{% tr for row in table_data %}
{{ row.name }} | {{ row.value }} | {{ row.status }}
{% endtr %}
Template table row with loop:
{% tr for row in table_data %}
{{ row.name }} | {{ row.value }} | {{ row.status }}
{% endtr %}

Mail Merge

邮件合并

python
from docxtpl import DocxTemplate
import csv

template = DocxTemplate('letter_template.docx')

with open('recipients.csv') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        context = {
            'name': row['name'],
            'address': row['address'],
            'amount': row['amount'],
            'due_date': row['due_date'],
        }
        template.render(context)
        template.save(f'letters/letter_{i:04d}_{row["name"]}.docx')
        template = DocxTemplate('letter_template.docx')  # Re-load for next iteration
python
from docxtpl import DocxTemplate
import csv

template = DocxTemplate('letter_template.docx')

with open('recipients.csv') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        context = {
            'name': row['name'],
            'address': row['address'],
            'amount': row['amount'],
            'due_date': row['due_date'],
        }
        template.render(context)
        template.save(f'letters/letter_{i:04d}_{row["name"]}.docx')
        template = DocxTemplate('letter_template.docx')  # Re-load for next iteration

Style Management

样式管理

Custom Styles

自定义样式

python
from docx.enum.style import WD_STYLE_TYPE
python
from docx.enum.style import WD_STYLE_TYPE

Create custom paragraph style

Create custom paragraph style

style = doc.styles.add_style('CustomHeading', WD_STYLE_TYPE.PARAGRAPH) style.font.name = 'Arial' style.font.size = Pt(16) style.font.bold = True style.font.color.rgb = RGBColor(0x2F, 0x54, 0x96) style.paragraph_format.space_before = Pt(12) style.paragraph_format.space_after = Pt(6)
style = doc.styles.add_style('CustomHeading', WD_STYLE_TYPE.PARAGRAPH) style.font.name = 'Arial' style.font.size = Pt(16) style.font.bold = True style.font.color.rgb = RGBColor(0x2F, 0x54, 0x96) style.paragraph_format.space_before = Pt(12) style.paragraph_format.space_after = Pt(6)

Apply custom style

Apply custom style

doc.add_paragraph('Section Title', style='CustomHeading')
undefined
doc.add_paragraph('Section Title', style='CustomHeading')
undefined

Style Inheritance

样式继承关系

Normal → Heading 1 → Heading 2 → ...
Normal → Body Text → List Paragraph
Normal → Table Normal → Table Grid
Normal → Heading 1 → Heading 2 → ...
Normal → Body Text → List Paragraph
Normal → Table Normal → Table Grid

Conversion Strategies

转换策略

DOCX to PDF

DOCX 转 PDF

python
undefined
python
undefined

Option 1: LibreOffice (most reliable, server-friendly)

Option 1: LibreOffice (most reliable, server-friendly)

import subprocess subprocess.run([ 'libreoffice', '--headless', '--convert-to', 'pdf', '--outdir', output_dir, input_file ])
import subprocess subprocess.run([ 'libreoffice', '--headless', '--convert-to', 'pdf', '--outdir', output_dir, input_file ])

Option 2: docx2pdf (Windows/macOS with Word installed)

Option 2: docx2pdf (Windows/macOS with Word installed)

from docx2pdf import convert convert('input.docx', 'output.pdf')
from docx2pdf import convert convert('input.docx', 'output.pdf')

Option 3: Generate PDF directly with reportlab for full control

Option 3: Generate PDF directly with reportlab for full control

undefined
undefined

Error Handling

错误处理

python
import jinja2

def safe_generate_document(template_path, context, output_path):
    try:
        tpl = DocxTemplate(template_path)
        tpl.render(context)
        tpl.save(output_path)
        return True
    except jinja2.UndefinedError as e:
        print(f"Missing template variable: {e}")
        return False
    except FileNotFoundError as e:
        print(f"Template not found: {e}")
        return False
    except Exception as e:
        print(f"Document generation failed: {e}")
        return False
python
import jinja2

def safe_generate_document(template_path, context, output_path):
    try:
        tpl = DocxTemplate(template_path)
        tpl.render(context)
        tpl.save(output_path)
        return True
    except jinja2.UndefinedError as e:
        print(f"Missing template variable: {e}")
        return False
    except FileNotFoundError as e:
        print(f"Template not found: {e}")
        return False
    except Exception as e:
        print(f"Document generation failed: {e}")
        return False

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-PatternWhy It FailsWhat To Do Instead
Hardcoding font sizes instead of stylesInconsistent formatting, hard to maintainDefine styles once, apply everywhere
Not handling missing template variablesRuntime crashes on incomplete dataUse
jinja2.Undefined
or default filters
Huge tables without paginationUnreadable output, broken layoutsBreak tables across pages or summarize
Absolute image pathsBreaks portability across environmentsUse relative paths or embed images
Not testing with different Word versionsFormatting breaks silentlyTest in Word, LibreOffice, and Google Docs
Modifying XML directly when API existsFragile, version-dependent codeUse python-docx API methods first
All direct formatting, no stylesImpossible to maintain consistencyCreate and apply named styles
Ignoring Unicode charactersMojibake in generated documentsTest with accented characters, CJK, symbols
Not re-loading template in mail mergeCorrupted output after first renderRe-instantiate DocxTemplate per iteration
反模式故障原因替代方案
硬编码字体大小而非使用样式格式不一致,难以维护一次性定义样式,全局复用
不处理缺失的模板变量数据不完整时运行时崩溃使用
jinja2.Undefined
或默认值过滤器
超大表格未做分页处理输出不可读,布局损坏将表格拆分到多页或者做数据汇总
使用绝对路径引用图片跨环境运行时路径失效使用相对路径或直接嵌入图片
未在不同Word版本下测试格式悄无声息地损坏在Word、LibreOffice和Google Docs中都做测试
有可用API时直接修改XML代码脆弱,依赖版本优先使用python-docx提供的API方法
全部直接设置格式,不使用样式无法维持格式一致性创建并应用命名样式
忽略Unicode字符处理生成的文档出现乱码用重音字符、中日韩字符、特殊符号做测试
邮件合并时不重新加载模板第一次渲染后输出损坏每次迭代都重新实例化DocxTemplate

Anti-Rationalization Guards

不合理操作禁令

  • Do NOT skip the approach decision (scratch vs template) -- it determines your entire implementation.
  • Do NOT generate documents without testing in at least Word and one alternative viewer.
  • Do NOT ignore missing data -- handle empty/null fields with defaults or conditional sections.
  • Do NOT skip error handling in production document generation pipelines.
  • Do NOT hardcode formatting when styles can be used instead.
  • 不要跳过方案决策环节(从零开发vs使用模板)—— 它决定了整个实现的走向
  • 不要未在至少Word和一个其他查看器中测试就生成文档
  • 不要忽略缺失数据—— 用默认值或条件区块处理空/Null字段
  • 生产环境的文档生成管线不要跳过错误处理
  • 可以用样式实现的格式不要硬编码

Integration Points

集成点

SkillHow It Connects
pdf-processing
DOCX-to-PDF conversion, or choosing PDF generation directly
xlsx-processing
Data from Excel feeds into document generation contexts
email-composer
Generated documents attach to professional emails
content-research-writer
Research content formatted into whitepapers and reports
file-organizer
Output file naming and directory structure conventions
deployment
Document generation pipelines in CI/CD or server environments
技能关联方式
pdf-processing
DOCX转PDF,或直接选择PDF生成方案
xlsx-processing
Excel中的数据可作为文档生成的上下文数据源
email-composer
生成的文档可作为专业邮件的附件
content-research-writer
调研内容可格式化为白皮书和报告
file-organizer
输出文件命名和目录结构规范
deployment
CI/CD或服务器环境中的文档生成管线

Skill Type

技能类型

FLEXIBLE — Choose between python-docx (programmatic) and docxtpl (template-based) based on document complexity. Simple reports may not need templates; complex recurring documents benefit from templates.
灵活适配 — 根据文档复杂度选择python-docx(编程式)和docxtpl(模板式)。简单报表不需要模板;复杂的定期生成文档使用模板收益更高。