hwpx
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHWPX creation, editing, and analysis
HWPX文档的创建、编辑与分析
Overview
概述
A .hwpx file is a ZIP archive containing XML files, based on the OWPML (Open Word-Processor Markup Language) standard (KS X 6101).
.hwpx文件是一个包含XML文件的ZIP压缩包,基于OWPML(Open Word-Processor Markup Language,开放式文字处理器标记语言)标准(KS X 6101)。
Quick Reference
快速参考
| Task | Approach |
|---|---|
| Read/analyze content | |
| Create new document | Use |
| Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below |
| 任务 | 实现方法 |
|---|---|
| 读取/分析内容 | 使用 |
| 创建新文档 | 使用 |
| 编辑现有文档 | 解压 → 编辑XML → 重新打包 - 详见下方「编辑现有文档」部分 |
Converting .hwp to .hwpx
将.hwp格式转换为.hwpx格式
Legacy files must be converted before editing:
.hwpbash
undefined旧版文件必须先转换才能编辑:
.hwpbash
undefinedUsing hwpxjs CLI (pure TypeScript, no external dependencies)
使用hwpxjs CLI(纯TypeScript实现,无外部依赖)
npx hwpxjs convert:hwp document.hwp output.hwpx
npx hwpxjs convert:hwp document.hwpx output.hwpx
Or using LibreOffice as fallback
或使用LibreOffice作为备选方案
python scripts/office/soffice.py --headless --convert-to hwpx document.hwp
undefinedpython scripts/office/soffice.py --headless --convert-to hwpx document.hwp
undefinedReading Content
读取内容
bash
undefinedbash
undefinedText extraction via CLI
通过CLI提取文本
npx hwpxjs txt document.hwpx
npx hwpxjs txt document.hwpx
HTML conversion (includes images/styles)
转换为HTML格式(包含图片/样式)
npx hwpxjs html document.hwpx > output.html
npx hwpxjs html document.hwpx > output.html
Raw XML access
访问原始XML
python scripts/unpack.py document.hwpx unpacked/
undefinedpython scripts/unpack.py document.hwpx unpacked/
undefinedConverting to Images
转换为图片格式
bash
python scripts/office/soffice.py --headless --convert-to pdf document.hwpx
pdftoppm -jpeg -r 150 document.pdf pagebash
python scripts/office/soffice.py --headless --convert-to pdf document.hwpx
pdftoppm -jpeg -r 150 document.pdf pageCreating New Documents
创建新文档
Generate .hwpx files with JavaScript. Install:
npm install @ssabrojs/hwpxjs使用JavaScript生成.hwpx文件。安装命令:
npm install @ssabrojs/hwpxjsSetup
初始化设置
javascript
const { HwpxWriter, HwpxReader } = require("@ssabrojs/hwpxjs");
const fs = require("fs");
// Create document from plain text
const writer = new HwpxWriter();
const content = `문서 제목
첫 번째 문단입니다.
두 번째 문단입니다.`;
const buffer = await writer.createFromPlainText(content);
fs.writeFileSync("output.hwpx", buffer);javascript
const { HwpxWriter, HwpxReader } = require("@ssabrojs/hwpxjs");
const fs = require("fs");
// 从纯文本创建文档
const writer = new HwpxWriter();
const content = `문서 제목
첫 번째 문단입니다.
두 번째 문단입니다.`;
const buffer = await writer.createFromPlainText(content);
fs.writeFileSync("output.hwpx", buffer);Reading Documents
读取文档
javascript
const { HwpxReader } = require("@ssabrojs/hwpxjs");
const fs = require("fs");
const reader = new HwpxReader();
const fileBuffer = fs.readFileSync("document.hwpx");
await reader.loadFromArrayBuffer(fileBuffer.buffer);
// Extract text
const text = await reader.extractText();
console.log(text);
// Get document info
const info = await reader.getDocumentInfo();
console.log(info);
// List images
const images = await reader.listImages();
console.log(images);
// [{ binPath: "BinData/0.jpg", width: 200, height: 150, format: "jpg" }]javascript
const { HwpxReader } = require("@ssabrojs/hwpxjs");
const fs = require("fs");
const reader = new HwpxReader();
const fileBuffer = fs.readFileSync("document.hwpx");
await reader.loadFromArrayBuffer(fileBuffer.buffer);
// 提取文本
const text = await reader.extractText();
console.log(text);
// 获取文档信息
const info = await reader.getDocumentInfo();
console.log(info);
// 列出图片
const images = await reader.listImages();
console.log(images);
// [{ binPath: "BinData/0.jpg", width: 200, height: 150, format: "jpg" }]HTML Conversion
HTML格式转换
javascript
// Basic HTML conversion
const html = await reader.extractHtml();
// With all options
const fullHtml = await reader.extractHtml({
paragraphTag: "p",
tableClassName: "hwpx-table",
renderImages: true, // Include images
renderTables: true, // Include tables
renderStyles: true, // Apply styles (bold, italic, color)
embedImages: true, // Base64 embed images
tableHeaderFirstRow: true // First row as <th>
});javascript
// 基础HTML转换
const html = await reader.extractHtml();
// 全选项配置
const fullHtml = await reader.extractHtml({
paragraphTag: "p",
tableClassName: "hwpx-table",
renderImages: true, // 包含图片
renderTables: true, // 包含表格
renderStyles: true, // 应用样式(粗体、斜体、颜色)
embedImages: true, // 以Base64格式嵌入图片
tableHeaderFirstRow: true // 将第一行设为<th>表头
});HWP to HWPX Conversion
HWP格式转HWPX格式
javascript
const { HwpConverter } = require("@ssabrojs/hwpxjs");
const converter = new HwpConverter({ verbose: true });
// Check availability
if (converter.isAvailable()) {
// Convert HWP to HWPX
const result = await converter.convertHwpToHwpx("input.hwp", "output.hwpx");
if (result.success) {
console.log(`Converted: ${result.processingTime}ms`);
}
// Or extract text only
const text = await converter.convertHwpToText("input.hwp");
}javascript
const { HwpConverter } = require("@ssabrojs/hwpxjs");
const converter = new HwpConverter({ verbose: true });
// 检查可用性
if (converter.isAvailable()) {
// 将HWP转换为HWPX
const result = await converter.convertHwpToHwpx("input.hwp", "output.hwpx");
if (result.success) {
console.log(`转换完成:${result.processingTime}ms`);
}
// 或仅提取文本
const text = await converter.convertHwpToText("input.hwp");
}Template Processing
模板处理
javascript
// hwpxjs supports {{key}} template replacement
const reader = new HwpxReader();
await reader.loadFromArrayBuffer(templateBuffer);
// Apply template replacements
const html = await reader.extractHtml();
const result = html
.replace(/\{\{name\}\}/g, "홍길동")
.replace(/\{\{date\}\}/g, "2025-01-01");javascript
// hwpxjs支持{{key}}模板替换
const reader = new HwpxReader();
await reader.loadFromArrayBuffer(templateBuffer);
// 应用模板替换
const html = await reader.extractHtml();
const result = html
.replace(/\{\{name\}\}/g, "홍길동")
.replace(/\{\{date\}\}/g, "2025-01-01");Critical Rules for hwpxjs
hwpxjs使用关键规则
- createFromPlainText returns Buffer - save with
fs.writeFileSync(path, buffer) - loadFromArrayBuffer for reading - pass not
fileBuffer.bufferfileBuffer - Text-only creation - for tables/images, use XML editing approach below
- HwpConverter for HWP files - pure TypeScript, no LibreOffice needed
- extractHtml for rich content - includes styles, tables, images
- createFromPlainText返回Buffer - 使用保存
fs.writeFileSync(path, buffer) - 读取时使用loadFromArrayBuffer - 传入而非
fileBuffer.bufferfileBuffer - 纯文本创建限制 - 如需创建包含表格/图片的文档,请使用下方的XML编辑方法
- 处理HWP文件使用HwpConverter - 纯TypeScript实现,无需依赖LibreOffice
- 提取富内容使用extractHtml - 包含样式、表格、图片
Editing Existing Documents
编辑现有文档
Follow all 3 steps in order.
请严格按以下3个步骤操作。
Step 1: Unpack
步骤1:解压
bash
python scripts/unpack.py document.hwpx unpacked/bash
python scripts/unpack.py document.hwpx unpacked/Step 2: Edit XML
步骤2:编辑XML
Edit files in . See XML Reference below for patterns.
unpacked/Contents/Use the Edit tool directly for string replacement. Do not write Python scripts. Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
CRITICAL: Remove when modifying text. This element contains cached layout data. Leaving stale linesegarray causes character overlap:
<hp:linesegarray>xml
<!-- BEFORE: paragraph with stale layout cache -->
<hp:p id="0" paraPrIDRef="0" styleIDRef="0">
<hp:run charPrIDRef="19">
<hp:t>Original text</hp:t>
</hp:run>
<hp:linesegarray>
<hp:lineseg textpos="0" vertpos="0" vertsize="1000" horzsize="5000" .../>
</hp:linesegarray>
</hp:p>
<!-- AFTER: remove linesegarray entirely -->
<hp:p id="0" paraPrIDRef="0" styleIDRef="0">
<hp:run charPrIDRef="19">
<hp:t>New longer text that exceeds original width</hp:t>
</hp:run>
</hp:p>Note: Multiple elements share one . Remove it when editing ANY run in the paragraph.
<hp:run><hp:linesegarray>编辑目录下的文件。下方XML参考部分提供了常见模式。
unpacked/Contents/**直接使用编辑工具进行字符串替换,请勿编写Python脚本。**脚本会引入不必要的复杂度,而编辑工具能直观展示替换内容。
**重要提示:修改文本时请移除元素。**该元素包含缓存的布局数据,保留过时的linesegarray会导致字符重叠:
<hp:linesegarray>xml
<!-- 修改前:包含过时布局缓存的段落 -->
<hp:p id="0" paraPrIDRef="0" styleIDRef="0">
<hp:run charPrIDRef="19">
<hp:t>Original text</hp:t>
</hp:run>
<hp:linesegarray>
<hp:lineseg textpos="0" vertpos="0" vertsize="1000" horzsize="5000" .../>
</hp:linesegarray>
</hp:p>
<!-- 修改后:完全移除linesegarray -->
<hp:p id="0" paraPrIDRef="0" styleIDRef="0">
<hp:run charPrIDRef="19">
<hp:t>New longer text that exceeds original width</hp:t>
</hp:run>
</hp:p>注意:多个元素共享同一个。编辑段落中的任意run时,都需要移除该元素。
<hp:run><hp:linesegarray>Step 3: Pack
步骤3:重新打包
bash
python scripts/pack.py unpacked/ output.hwpxbash
python scripts/pack.py unpacked/ output.hwpxCommon Pitfalls
常见问题
- Character overlap after edit: Remove from the edited
<hp:linesegarray>. Multiple<hp:p>elements share one linesegarray—remove it when editing ANY run.<hp:run> - Wrong table cell modified: Include in search pattern. CRITICAL:
<hp:cellAddr>appears AFTER cell content, not before. Use<hp:cellAddr>.grep -B20 'colAddr="2" rowAddr="0"' section0.xml - Preserve : Don't change charPrIDRef when editing text—it references font/size/style in header.xml.
charPrIDRef - File corruption from string replacement: Use lxml for structural changes (inserting elements). String replacement breaks XML parent-child relationships.
- Page overflow from text replacement: Replacing blanks/spaces with text can cause content overflow and page breaks. Solutions: (1) Keep replacement text similar in length to original spaces, (2) Preserve charPrIDRef for underlined fields to maintain underline style, (3) Reduce unnecessary whitespace proportionally, (4) Cell/margin adjustments may be needed.
- Image size too large (e.g., 635mm): HWP unit calculation error. 1 HWP unit = 1/7200 inch, so 1mm ≈ 283.5 HWP units.
- ❌ Wrong: → 635mm (too large!)
width="180000" - ✅ Correct: → ~12mm (signature size)
width="3400" - Formula:
mm × (7200 ÷ 25.4) = HWP units
- ❌ Wrong:
- 编辑后字符重叠:从编辑的中移除
<hp:p>。多个<hp:linesegarray>元素共享同一个linesegarray——编辑任意run时都需移除它。<hp:run> - 修改了错误的表格单元格:搜索时请包含元素。**重要提示:
<hp:cellAddr>位于单元格内容之后,而非之前。**可使用命令<hp:cellAddr>查找。grep -B20 'colAddr="2" rowAddr="0"' section0.xml - 保留:编辑文本时请勿修改charPrIDRef——它引用了header.xml中的字体/字号/样式设置。
charPrIDRef - 字符串替换导致文件损坏:如需结构化修改(如插入元素),请使用lxml。字符串替换会破坏XML的父子关系。
- 文本替换导致页面溢出:将空白/空格替换为文本可能会导致内容溢出和分页问题。解决方案:(1) 保持替换文本长度与原空白内容相近,(2) 保留下划线字段的charPrIDRef以维持下划线样式,(3) 按比例减少不必要的空白,(4) 可能需要调整单元格/边距。
- 图片尺寸过大(如635mm):HWP单位计算错误。1个HWP单位 = 1/7200英寸,因此1mm ≈ 283.5个HWP单位。
- ❌ 错误示例:→ 635mm(过大!)
width="180000" - ✅ 正确示例:→ ~12mm(签名尺寸)
width="3400" - 计算公式:
mm × (7200 ÷ 25.4) = HWP单位
- ❌ 错误示例:
XML Reference
XML参考
Key Elements
核心元素
| Element | Purpose |
|---|---|
| Paragraph |
| Text run with formatting |
| Text content |
| Table |
| Table cell |
| Cell position (AFTER content) |
| Image |
| Layout cache (remove when editing) |
| 元素 | 用途 |
|---|---|
| 段落 |
| 带格式的文本块 |
| 文本内容 |
| 表格 |
| 表格单元格 |
| 单元格位置(位于内容之后) |
| 图片 |
| 布局缓存(编辑时需移除) |
Paragraph Structure
段落结构
xml
<hp:p id="0" paraPrIDRef="0" styleIDRef="0" pageBreak="0">
<hp:run charPrIDRef="0">
<hp:t>Text content</hp:t>
</hp:run>
<hp:linesegarray> <!-- Remove this when editing text -->
<hp:lineseg textpos="0" vertpos="0" vertsize="1000" .../>
</hp:linesegarray>
</hp:p>xml
<hp:p id="0" paraPrIDRef="0" styleIDRef="0" pageBreak="0">
<hp:run charPrIDRef="0">
<hp:t>Text content</hp:t>
</hp:run>
<hp:linesegarray> <!-- 编辑文本时请移除该元素 -->
<hp:lineseg textpos="0" vertpos="0" vertsize="1000" .../>
</hp:linesegarray>
</hp:p>Table Cell Structure
表格单元格结构
xml
<hp:tc borderFillIDRef="5">
<hp:subList textDirection="HORIZONTAL" vertAlign="CENTER">
<hp:p paraPrIDRef="20">
<hp:run charPrIDRef="19">
<hp:t>Cell content</hp:t>
</hp:run>
</hp:p>
</hp:subList>
<hp:cellAddr colAddr="0" rowAddr="0"/> <!-- Position identifier -->
<hp:cellSpan colSpan="1" rowSpan="1"/>
<hp:cellSz width="5136" height="4179"/>
</hp:tc>xml
<hp:tc borderFillIDRef="5">
<hp:subList textDirection="HORIZONTAL" vertAlign="CENTER">
<hp:p paraPrIDRef="20">
<hp:run charPrIDRef="19">
<hp:t>Cell content</hp:t>
</hp:run>
</hp:p>
</hp:subList>
<hp:cellAddr colAddr="0" rowAddr="0"/> <!-- 位置标识符 -->
<hp:cellSpan colSpan="1" rowSpan="1"/>
<hp:cellSz width="5136" height="4179"/>
</hp:tc>Images
图片
CRITICAL: MUST be inside , followed by empty
<hp:pic><hp:run><hp:t/>- Add image file to
BinData/ - Add to manifest :
Contents/content.hpf
xml
<opf:item id="image1" href="BinData/image1.png" media-type="image/png" isEmbeded="1"/>- Reference in section0.xml:
xml
<hp:p id="0" paraPrIDRef="38" styleIDRef="41">
<hp:run charPrIDRef="0">
<hp:pic id="12345" zOrder="0" numberingType="PICTURE" textWrap="TOP_AND_BOTTOM">
<hp:orgSz width="7200" height="7200"/> <!-- 1 inch = 7200 HWP units -->
<hp:curSz width="3600" height="3600"/> <!-- Display: 0.5 inch -->
<hc:img binaryItemIDRef="image1" bright="0" contrast="0" effect="REAL_PIC" alpha="0"/>
<hp:sz width="3600" widthRelTo="ABSOLUTE" height="3600" heightRelTo="ABSOLUTE"/>
<hp:pos treatAsChar="1" horzRelTo="COLUMN" horzAlign="CENTER" vertRelTo="PARA" vertAlign="TOP"/>
</hp:pic>
<hp:t/> <!-- REQUIRED: empty text element after hp:pic -->
</hp:run>
</hp:p>Size units: HWP uses 1/7200 inch units. 1mm ≈ 283.5 units (7200 ÷ 25.4)
For safe image insertion using lxml, see references/image-insertion.md.
重要提示:必须位于内部,且后面需跟空的元素
<hp:pic><hp:run><hp:t/>- 将图片文件添加至目录
BinData/ - 在清单文件中添加引用:
Contents/content.hpf
xml
<opf:item id="image1" href="BinData/image1.png" media-type="image/png" isEmbeded="1"/>- 在section0.xml中引用图片:
xml
<hp:p id="0" paraPrIDRef="38" styleIDRef="41">
<hp:run charPrIDRef="0">
<hp:pic id="12345" zOrder="0" numberingType="PICTURE" textWrap="TOP_AND_BOTTOM">
<hp:orgSz width="7200" height="7200"/> <!-- 1英寸 = 7200个HWP单位 -->
<hp:curSz width="3600" height="3600"/> <!-- 显示尺寸:0.5英寸 -->
<hc:img binaryItemIDRef="image1" bright="0" contrast="0" effect="REAL_PIC" alpha="0"/>
<hp:sz width="3600" widthRelTo="ABSOLUTE" height="3600" heightRelTo="ABSOLUTE"/>
<hp:pos treatAsChar="1" horzRelTo="COLUMN" horzAlign="CENTER" vertRelTo="PARA" vertAlign="TOP"/>
</hp:pic>
<hp:t/> <!-- 必填:hp:pic后需跟空文本元素 -->
</hp:run>
</hp:p>尺寸单位:HWP使用1/7200英寸作为单位。1mm ≈ 283.5个单位(7200 ÷ 25.4)
如需使用lxml安全插入图片,请参考references/image-insertion.md。
Page Break
分页符
xml
<hp:p pageBreak="1" ...> <!-- pageBreak="1" inserts break before paragraph -->xml
<hp:p pageBreak="1" ...> <!-- pageBreak="1"会在段落前插入分页符 -->Differences from DOCX
与DOCX的差异
| Aspect | HWPX | DOCX |
|---|---|---|
| Text element | | |
| Paragraph | | |
| Run | | |
| Layout cache | | None |
| Content location | | |
| Cell identifier | | implicit order |
Key difference: HWPX stores layout cache in linesegarray; DOCX doesn't. This is why editing HWPX requires removing linesegarray.
For detailed XML structures (headers/footers, lists/numbering, paragraph formatting), see references/xml-reference.md.
| 方面 | HWPX | DOCX |
|---|---|---|
| 文本元素 | | |
| 段落 | | |
| 文本块 | | |
| 布局缓存 | | 无 |
| 内容位置 | | |
| 单元格标识符 | | 隐式顺序 |
核心差异:HWPX在linesegarray中存储布局缓存,而DOCX没有。这就是编辑HWPX时需要移除linesegarray的原因。
如需详细的XML结构(页眉/页脚、列表/编号、段落格式),请参考references/xml-reference.md。
Dependencies
依赖项
bash
npm install @ssabrojs/hwpxjs- hwpxjs: - reading, writing, HTML conversion, HWP→HWPX conversion
npm install @ssabrojs/hwpxjs - pyhwp2md: Converting HWP/HWPX to Markdown (alternative)
- LibreOffice: PDF conversion (auto-configured via )
scripts/office/soffice.py - Poppler: for PDF to images
pdftoppm
bash
npm install @ssabrojs/hwpxjs- hwpxjs:- 读取、写入、HTML转换、HWP→HWPX转换
npm install @ssabrojs/hwpxjs - pyhwp2md:将HWP/HWPX转换为Markdown(备选工具)
- LibreOffice:PDF转换(通过自动配置)
scripts/office/soffice.py - Poppler:工具,用于将PDF转换为图片
pdftoppm