kordoc-korean-document-parser
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesekordoc Korean Document Parser
kordoc 韩文文档解析器
Skill by ara.so — Daily 2026 Skills collection.
kordoc is a TypeScript library and CLI for parsing Korean government documents (HWP 5.x, HWPX, PDF) into Markdown and structured data. It handles proprietary HWP binary formats, table extraction, form field recognition, document diffing, and reverse Markdown→HWPX generation.
IRBlock[]Skill 由 ara.so 开发 — 属于2026年度Skill合集。
kordoc是一款TypeScript库和CLI工具,可将韩国政府文档(HWP 5.x、HWPX、PDF)解析为Markdown和结构化的数据。它支持处理专有HWP二进制格式、表格提取、表单字段识别、文档对比,以及反向的Markdown→HWPX生成。
IRBlock[]Installation
安装
bash
undefinedbash
undefinedCore library
Core library
npm install kordoc
npm install kordoc
PDF support (optional peer dependency)
PDF support (optional peer dependency)
npm install pdfjs-dist
npm install pdfjs-dist
CLI (no install needed)
CLI (no install needed)
npx kordoc document.hwpx
---npx kordoc document.hwpx
---Core API
核心API
Auto-detect and Parse Any Document
自动检测并解析任意文档
typescript
import { parse } from "kordoc"
import { readFileSync } from "fs"
const buffer = readFileSync("document.hwpx")
const result = await parse(buffer.buffer) // ArrayBuffer required
if (result.success) {
console.log(result.markdown) // string: full Markdown
console.log(result.blocks) // IRBlock[]: structured data
console.log(result.metadata) // { title, author, createdAt, pageCount, ... }
console.log(result.outline) // OutlineItem[]: document structure
console.log(result.warnings) // ParseWarning[]: skipped elements
} else {
console.error(result.error) // string message
console.error(result.code) // ErrorCode: "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...
}typescript
import { parse } from "kordoc"
import { readFileSync } from "fs"
const buffer = readFileSync("document.hwpx")
const result = await parse(buffer.buffer) // ArrayBuffer required
if (result.success) {
console.log(result.markdown) // string: full Markdown
console.log(result.blocks) // IRBlock[]: structured data
console.log(result.metadata) // { title, author, createdAt, pageCount, ... }
console.log(result.outline) // OutlineItem[]: document structure
console.log(result.warnings) // ParseWarning[]: skipped elements
} else {
console.error(result.error) // string message
console.error(result.code) // ErrorCode: "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...
}Format-Specific Parsers
指定格式解析器
typescript
import { parseHwpx, parseHwp, parsePdf, detectFormat } from "kordoc"
// Detect format first
const fmt = detectFormat(buffer.buffer) // "hwpx" | "hwp" | "pdf" | "unknown"
// Parse by format
const hwpxResult = await parseHwpx(buffer.buffer)
const hwpResult = await parseHwp(buffer.buffer)
const pdfResult = await parsePdf(buffer.buffer)typescript
import { parseHwpx, parseHwp, parsePdf, detectFormat } from "kordoc"
// Detect format first
const fmt = detectFormat(buffer.buffer) // "hwpx" | "hwp" | "pdf" | "unknown"
// Parse by format
const hwpxResult = await parseHwpx(buffer.buffer)
const hwpResult = await parseHwp(buffer.buffer)
const pdfResult = await parsePdf(buffer.buffer)Parse Options
解析选项
typescript
import { parse, ParseOptions } from "kordoc"
const result = await parse(buffer.buffer, {
pages: "1-3", // page range string
// pages: [1, 5, 10], // or specific page numbers
ocr: async (pageImage, pageNumber, mimeType) => {
// Pluggable OCR for image-based PDFs
// pageImage: ArrayBuffer of the page image
return await myOcrService.recognize(pageImage)
}
})typescript
import { parse, ParseOptions } from "kordoc"
const result = await parse(buffer.buffer, {
pages: "1-3", // page range string
// pages: [1, 5, 10], // or specific page numbers
ocr: async (pageImage, pageNumber, mimeType) => {
// Pluggable OCR for image-based PDFs
// pageImage: ArrayBuffer of the page image
return await myOcrService.recognize(pageImage)
}
})Working with IRBlocks
使用IRBlocks
typescript
import type { IRBlock, IRBlockType, IRTable, IRCell } from "kordoc"
// IRBlock types: "heading" | "paragraph" | "table" | "list" | "image" | "separator"
for (const block of result.blocks) {
if (block.type === "heading") {
console.log(`H${block.level}: ${block.text}`)
console.log(block.bbox) // { x, y, width, height, page }
}
if (block.type === "table") {
const table = block as IRTable
for (const row of table.rows) {
for (const cell of row) {
console.log(cell.text, cell.colspan, cell.rowspan)
}
}
}
if (block.type === "paragraph") {
console.log(block.text)
console.log(block.style) // InlineStyle: { bold, italic, fontSize, ... }
console.log(block.pageNumber)
}
}typescript
import type { IRBlock, IRBlockType, IRTable, IRCell } from "kordoc"
// IRBlock types: "heading" | "paragraph" | "table" | "list" | "image" | "separator"
for (const block of result.blocks) {
if (block.type === "heading") {
console.log(`H${block.level}: ${block.text}`)
console.log(block.bbox) // { x, y, width, height, page }
}
if (block.type === "table") {
const table = block as IRTable
for (const row of table.rows) {
for (const cell of row) {
console.log(cell.text, cell.colspan, cell.rowspan)
}
}
}
if (block.type === "paragraph") {
console.log(block.text)
console.log(block.style) // InlineStyle: { bold, italic, fontSize, ... }
console.log(block.pageNumber)
}
}Convert Blocks Back to Markdown
将Blocks转换回Markdown
typescript
import { blocksToMarkdown } from "kordoc"
const markdown = blocksToMarkdown(result.blocks)typescript
import { blocksToMarkdown } from "kordoc"
const markdown = blocksToMarkdown(result.blocks)Document Comparison
文档对比
typescript
import { compare } from "kordoc"
const bufA = readFileSync("v1.hwp").buffer
const bufB = readFileSync("v2.hwpx").buffer // cross-format supported
const diff = await compare(bufA, bufB)
console.log(diff.stats)
// { added: 3, removed: 1, modified: 5, unchanged: 42 }
for (const d of diff.diffs) {
// d.type: "added" | "removed" | "modified" | "unchanged"
// d.blockA, d.blockB: IRBlock
// d.cellDiffs: CellDiff[] for table blocks
console.log(d.type, d.blockA?.text ?? d.blockB?.text)
}typescript
import { compare } from "kordoc"
const bufA = readFileSync("v1.hwp").buffer
const bufB = readFileSync("v2.hwpx").buffer // cross-format supported
const diff = await compare(bufA, bufB)
console.log(diff.stats)
// { added: 3, removed: 1, modified: 5, unchanged: 42 }
for (const d of diff.diffs) {
// d.type: "added" | "removed" | "modified" | "unchanged"
// d.blockA, d.blockB: IRBlock
// d.cellDiffs: CellDiff[] for table blocks
console.log(d.type, d.blockA?.text ?? d.blockB?.text)
}Form Field Extraction
表单字段提取
typescript
import { parse, extractFormFields } from "kordoc"
const result = await parse(buffer.buffer)
if (result.success) {
const form = extractFormFields(result.blocks)
console.log(form.confidence) // 0.0–1.0
for (const field of form.fields) {
// { label: "성명", value: "홍길동", row: 0, col: 0 }
console.log(`${field.label}: ${field.value}`)
}
}typescript
import { parse, extractFormFields } from "kordoc"
const result = await parse(buffer.buffer)
if (result.success) {
const form = extractFormFields(result.blocks)
console.log(form.confidence) // 0.0–1.0
for (const field of form.fields) {
// { label: "성명", value: "홍길동", row: 0, col: 0 }
console.log(`${field.label}: ${field.value}`)
}
}Markdown → HWPX Generation
Markdown → HWPX 生成
typescript
import { markdownToHwpx } from "kordoc"
import { writeFileSync } from "fs"
const markdown = `typescript
import { markdownToHwpx } from "kordoc"
import { writeFileSync } from "fs"
const markdown = `제목
제목
본문 내용입니다.
| 구분 | 내용 |
|---|---|
| 항목1 | 값1 |
| 항목2 | 값2 |
| ` |
const hwpxBuffer = await markdownToHwpx(markdown)
writeFileSync("output.hwpx", Buffer.from(hwpxBuffer))
---본문 내용입니다.
| 구분 | 내용 |
|---|---|
| 항목1 | 값1 |
| 항목2 | 값2 |
| ` |
const hwpxBuffer = await markdownToHwpx(markdown)
writeFileSync("output.hwpx", Buffer.from(hwpxBuffer))
---CLI Usage
CLI 使用方法
bash
undefinedbash
undefinedBasic conversion — output to stdout
Basic conversion — output to stdout
npx kordoc document.hwpx
npx kordoc document.hwpx
Save to file
Save to file
npx kordoc document.hwp -o output.md
npx kordoc document.hwp -o output.md
Batch convert all PDFs to a directory
Batch convert all PDFs to a directory
npx kordoc *.pdf -d ./converted/
npx kordoc *.pdf -d ./converted/
JSON output with blocks + metadata
JSON output with blocks + metadata
npx kordoc report.hwpx --format json
npx kordoc report.hwpx --format json
Parse specific pages only
Parse specific pages only
npx kordoc report.hwpx --pages 1-3
npx kordoc report.hwpx --pages 1-3
Watch mode — auto-convert new files
Watch mode — auto-convert new files
npx kordoc watch ./incoming -d ./output
npx kordoc watch ./incoming -d ./output
Watch with webhook notification on conversion
Watch with webhook notification on conversion
npx kordoc watch ./docs --webhook https://api.example.com/hook
---npx kordoc watch ./docs --webhook https://api.example.com/hook
---MCP Server Setup
MCP 服务器设置
Add to your MCP config (Claude Desktop, Cursor, Windsurf):
json
{
"mcpServers": {
"kordoc": {
"command": "npx",
"args": ["-y", "kordoc-mcp"]
}
}
}添加到你的MCP配置(Claude Desktop、Cursor、Windsurf):
json
{
"mcpServers": {
"kordoc": {
"command": "npx",
"args": ["-y", "kordoc-mcp"]
}
}
}Available MCP Tools
可用MCP工具
| Tool | Description |
|---|---|
| Parse HWP/HWPX/PDF → Markdown + metadata + outline + warnings |
| Detect file format via magic bytes |
| Extract only metadata (fast, no full parse) |
| Parse a specific page range |
| Extract the Nth table from a document |
| Diff two documents (cross-format supported) |
| Extract form fields as structured JSON |
| 工具 | 描述 |
|---|---|
| 将HWP/HWPX/PDF解析为Markdown + 元数据 + 大纲 + 警告信息 |
| 通过魔术字节检测文件格式 |
| 仅提取元数据(速度快,无需全量解析) |
| 解析指定页面范围 |
| 提取文档中的第N个表格 |
| 对比两份文档(支持跨格式) |
| 提取表单字段为结构化JSON |
TypeScript Types Reference
TypeScript 类型参考
typescript
import type {
// Results
ParseResult, ParseSuccess, ParseFailure,
ErrorCode, // "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...
// Blocks
IRBlock, IRBlockType, IRTable, IRCell, CellContext,
// Metadata & structure
DocumentMetadata, OutlineItem,
ParseWarning, WarningCode,
BoundingBox, // { x, y, width, height, page }
InlineStyle, // { bold, italic, fontSize, color, ... }
// Options
ParseOptions, FileType,
OcrProvider, // async (image, pageNum, mime) => string
WatchOptions,
// Diff
DiffResult, BlockDiff, CellDiff, DiffChangeType,
// Forms
FormField, FormResult,
} from "kordoc"typescript
import type {
// Results
ParseResult, ParseSuccess, ParseFailure,
ErrorCode, // "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...
// Blocks
IRBlock, IRBlockType, IRTable, IRCell, CellContext,
// Metadata & structure
DocumentMetadata, OutlineItem,
ParseWarning, WarningCode,
BoundingBox, // { x, y, width, height, page }
InlineStyle, // { bold, italic, fontSize, color, ... }
// Options
ParseOptions, FileType,
OcrProvider, // async (image, pageNum, mime) => string
WatchOptions,
// Diff
DiffResult, BlockDiff, CellDiff, DiffChangeType,
// Forms
FormField, FormResult,
} from "kordoc"Common Patterns
常见使用场景
Batch Process Files with Error Handling
带错误处理的批量文件处理
typescript
import { parse, detectFormat } from "kordoc"
import { readFileSync } from "fs"
import { glob } from "glob"
const files = await glob("./docs/**/*.{hwp,hwpx,pdf}")
for (const file of files) {
const buffer = readFileSync(file)
const fmt = detectFormat(buffer.buffer)
if (fmt === "unknown") {
console.warn(`Skipping unknown format: ${file}`)
continue
}
const result = await parse(buffer.buffer)
if (!result.success) {
if (result.code === "ENCRYPTED") {
console.warn(`Encrypted, skipping: ${file}`)
} else if (result.code === "IMAGE_BASED_PDF") {
console.warn(`Image-based PDF needs OCR: ${file}`)
} else {
console.error(`Failed: ${file} — ${result.error}`)
}
continue
}
console.log(`Parsed ${file}: ${result.blocks.length} blocks`)
}typescript
import { parse, detectFormat } from "kordoc"
import { readFileSync } from "fs"
import { glob } from "glob"
const files = await glob("./docs/**/*.{hwp,hwpx,pdf}")
for (const file of files) {
const buffer = readFileSync(file)
const fmt = detectFormat(buffer.buffer)
if (fmt === "unknown") {
console.warn(`Skipping unknown format: ${file}`)
continue
}
const result = await parse(buffer.buffer)
if (!result.success) {
if (result.code === "ENCRYPTED") {
console.warn(`Encrypted, skipping: ${file}`)
} else if (result.code === "IMAGE_BASED_PDF") {
console.warn(`Image-based PDF needs OCR: ${file}`)
} else {
console.error(`Failed: ${file} — ${result.error}`)
}
continue
}
console.log(`Parsed ${file}: ${result.blocks.length} blocks`)
}Extract All Tables from a Document
提取文档中的所有表格
typescript
import { parse } from "kordoc"
import type { IRTable } from "kordoc"
const result = await parse(buffer.buffer)
if (result.success) {
const tables = result.blocks.filter(b => b.type === "table") as IRTable[]
tables.forEach((table, i) => {
console.log(`\n--- Table ${i + 1} ---`)
for (const row of table.rows) {
const cells = row.map(cell => cell.text.trim()).join(" | ")
console.log(`| ${cells} |`)
}
})
}typescript
import { parse } from "kordoc"
import type { IRTable } from "kordoc"
const result = await parse(buffer.buffer)
if (result.success) {
const tables = result.blocks.filter(b => b.type === "table") as IRTable[]
tables.forEach((table, i) => {
console.log(`\n--- Table ${i + 1} ---`)
for (const row of table.rows) {
const cells = row.map(cell => cell.text.trim()).join(" | ")
console.log(`| ${cells} |`)
}
})
}OCR with Tesseract.js
结合Tesseract.js使用OCR
typescript
import { parse } from "kordoc"
import Tesseract from "tesseract.js"
const result = await parse(buffer.buffer, {
ocr: async (pageImage, pageNumber, mimeType) => {
const blob = new Blob([pageImage], { type: mimeType })
const url = URL.createObjectURL(blob)
const { data } = await Tesseract.recognize(url, "kor+eng")
URL.revokeObjectURL(url)
return data.text
}
})typescript
import { parse } from "kordoc"
import Tesseract from "tesseract.js"
const result = await parse(buffer.buffer, {
ocr: async (pageImage, pageNumber, mimeType) => {
const blob = new Blob([pageImage], { type: mimeType })
const url = URL.createObjectURL(blob)
const { data } = await Tesseract.recognize(url, "kor+eng")
URL.revokeObjectURL(url)
return data.text
}
})Watch Mode Programmatic API
编程式调用监听模式
typescript
import { watch } from "kordoc"
const watcher = watch("./incoming", {
output: "./converted",
webhook: process.env.WEBHOOK_URL,
onFile: async (file, result) => {
if (result.success) {
console.log(`Converted: ${file}`)
}
}
})
// Stop watching
watcher.stop()typescript
import { watch } from "kordoc"
const watcher = watch("./incoming", {
output: "./converted",
webhook: process.env.WEBHOOK_URL,
onFile: async (file, result) => {
if (result.success) {
console.log(`Converted: ${file}`)
}
}
})
// Stop watching
watcher.stop()Troubleshooting
故障排查
buffer.bufferBufferArrayBufferBufferreadFileSync("file").buffer.bufferUint8ArrayPDF tables not detected — Line-based detection requires pdfjs-dist installed. Install it: . For borderless tables, kordoc uses cluster-based heuristics automatically.
npm install pdfjs-dist"IMAGE_BASED_PDF"ocr"ENCRYPTED"Korean characters garbled in output — Ensure your terminal/file uses UTF-8 encoding. kordoc outputs UTF-8 Markdown by default.
Large files are slow — Use option to parse only needed pages: . Metadata-only extraction is faster: MCP tool or check directly.
pagesparse(buf, { pages: "1-5" })parse_metadataresult.metadataHWP table columns wrong — Update to v1.6.1+. Earlier versions had a 2-byte offset misalignment in LIST_HEADER parsing causing column explosion.
buffer.bufferBufferArrayBufferBufferreadFileSync("file").bufferUint8Array.bufferPDF表格检测失败 — 基于线条的检测需要安装pdfjs-dist,请执行:。对于无边框表格,kordoc会自动使用基于聚类的启发式算法处理。
npm install pdfjs-dist"IMAGE_BASED_PDF"ocr"ENCRYPTED"输出韩文字符乱码 — 请确保你的终端/文件使用UTF-8编码,kordoc默认输出UTF-8格式的Markdown。
大文件解析速度慢 — 使用选项仅解析需要的页面:。仅提取元数据速度更快,可使用 MCP工具或者直接读取。
pagesparse(buf, { pages: "1-5" })parse_metadataresult.metadataHWP表格列数错误 — 请升级到v1.6.1+版本,更早版本在LIST_HEADER解析时存在2字节偏移问题,会导致列数异常。