kordoc-korean-document-parser

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

kordoc Korean Document Parser

kordoc 韩文文档解析器

Skill by ara.so — Daily 2026 Skills collection.
kordoc is a TypeScript library and CLI for parsing Korean government documents (HWP 5.x, HWPX, PDF) into Markdown and structured
IRBlock[]
data. It handles proprietary HWP binary formats, table extraction, form field recognition, document diffing, and reverse Markdown→HWPX generation.

Skill 由 ara.so 开发 — 属于2026年度Skill合集。
kordoc是一款TypeScript库和CLI工具,可将韩国政府文档(HWP 5.x、HWPX、PDF)解析为Markdown和结构化的
IRBlock[]
数据。它支持处理专有HWP二进制格式、表格提取、表单字段识别、文档对比,以及反向的Markdown→HWPX生成。

Installation

安装

bash
undefined
bash
undefined

Core library

Core library

npm install kordoc
npm install kordoc

PDF support (optional peer dependency)

PDF support (optional peer dependency)

npm install pdfjs-dist
npm install pdfjs-dist

CLI (no install needed)

CLI (no install needed)

npx kordoc document.hwpx

---
npx kordoc document.hwpx

---

Core API

核心API

Auto-detect and Parse Any Document

自动检测并解析任意文档

typescript
import { parse } from "kordoc"
import { readFileSync } from "fs"

const buffer = readFileSync("document.hwpx")
const result = await parse(buffer.buffer) // ArrayBuffer required

if (result.success) {
  console.log(result.markdown)   // string: full Markdown
  console.log(result.blocks)     // IRBlock[]: structured data
  console.log(result.metadata)   // { title, author, createdAt, pageCount, ... }
  console.log(result.outline)    // OutlineItem[]: document structure
  console.log(result.warnings)   // ParseWarning[]: skipped elements
} else {
  console.error(result.error)    // string message
  console.error(result.code)     // ErrorCode: "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...
}
typescript
import { parse } from "kordoc"
import { readFileSync } from "fs"

const buffer = readFileSync("document.hwpx")
const result = await parse(buffer.buffer) // ArrayBuffer required

if (result.success) {
  console.log(result.markdown)   // string: full Markdown
  console.log(result.blocks)     // IRBlock[]: structured data
  console.log(result.metadata)   // { title, author, createdAt, pageCount, ... }
  console.log(result.outline)    // OutlineItem[]: document structure
  console.log(result.warnings)   // ParseWarning[]: skipped elements
} else {
  console.error(result.error)    // string message
  console.error(result.code)     // ErrorCode: "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...
}

Format-Specific Parsers

指定格式解析器

typescript
import { parseHwpx, parseHwp, parsePdf, detectFormat } from "kordoc"

// Detect format first
const fmt = detectFormat(buffer.buffer) // "hwpx" | "hwp" | "pdf" | "unknown"

// Parse by format
const hwpxResult = await parseHwpx(buffer.buffer)
const hwpResult  = await parseHwp(buffer.buffer)
const pdfResult  = await parsePdf(buffer.buffer)
typescript
import { parseHwpx, parseHwp, parsePdf, detectFormat } from "kordoc"

// Detect format first
const fmt = detectFormat(buffer.buffer) // "hwpx" | "hwp" | "pdf" | "unknown"

// Parse by format
const hwpxResult = await parseHwpx(buffer.buffer)
const hwpResult  = await parseHwp(buffer.buffer)
const pdfResult  = await parsePdf(buffer.buffer)

Parse Options

解析选项

typescript
import { parse, ParseOptions } from "kordoc"

const result = await parse(buffer.buffer, {
  pages: "1-3",          // page range string
  // pages: [1, 5, 10], // or specific page numbers
  ocr: async (pageImage, pageNumber, mimeType) => {
    // Pluggable OCR for image-based PDFs
    // pageImage: ArrayBuffer of the page image
    return await myOcrService.recognize(pageImage)
  }
})

typescript
import { parse, ParseOptions } from "kordoc"

const result = await parse(buffer.buffer, {
  pages: "1-3",          // page range string
  // pages: [1, 5, 10], // or specific page numbers
  ocr: async (pageImage, pageNumber, mimeType) => {
    // Pluggable OCR for image-based PDFs
    // pageImage: ArrayBuffer of the page image
    return await myOcrService.recognize(pageImage)
  }
})

Working with IRBlocks

使用IRBlocks

typescript
import type { IRBlock, IRBlockType, IRTable, IRCell } from "kordoc"

// IRBlock types: "heading" | "paragraph" | "table" | "list" | "image" | "separator"
for (const block of result.blocks) {
  if (block.type === "heading") {
    console.log(`H${block.level}: ${block.text}`)
    console.log(block.bbox)       // { x, y, width, height, page }
  }

  if (block.type === "table") {
    const table = block as IRTable
    for (const row of table.rows) {
      for (const cell of row) {
        console.log(cell.text, cell.colspan, cell.rowspan)
      }
    }
  }

  if (block.type === "paragraph") {
    console.log(block.text)
    console.log(block.style)      // InlineStyle: { bold, italic, fontSize, ... }
    console.log(block.pageNumber)
  }
}
typescript
import type { IRBlock, IRBlockType, IRTable, IRCell } from "kordoc"

// IRBlock types: "heading" | "paragraph" | "table" | "list" | "image" | "separator"
for (const block of result.blocks) {
  if (block.type === "heading") {
    console.log(`H${block.level}: ${block.text}`)
    console.log(block.bbox)       // { x, y, width, height, page }
  }

  if (block.type === "table") {
    const table = block as IRTable
    for (const row of table.rows) {
      for (const cell of row) {
        console.log(cell.text, cell.colspan, cell.rowspan)
      }
    }
  }

  if (block.type === "paragraph") {
    console.log(block.text)
    console.log(block.style)      // InlineStyle: { bold, italic, fontSize, ... }
    console.log(block.pageNumber)
  }
}

Convert Blocks Back to Markdown

将Blocks转换回Markdown

typescript
import { blocksToMarkdown } from "kordoc"

const markdown = blocksToMarkdown(result.blocks)

typescript
import { blocksToMarkdown } from "kordoc"

const markdown = blocksToMarkdown(result.blocks)

Document Comparison

文档对比

typescript
import { compare } from "kordoc"

const bufA = readFileSync("v1.hwp").buffer
const bufB = readFileSync("v2.hwpx").buffer  // cross-format supported

const diff = await compare(bufA, bufB)

console.log(diff.stats)
// { added: 3, removed: 1, modified: 5, unchanged: 42 }

for (const d of diff.diffs) {
  // d.type: "added" | "removed" | "modified" | "unchanged"
  // d.blockA, d.blockB: IRBlock
  // d.cellDiffs: CellDiff[] for table blocks
  console.log(d.type, d.blockA?.text ?? d.blockB?.text)
}

typescript
import { compare } from "kordoc"

const bufA = readFileSync("v1.hwp").buffer
const bufB = readFileSync("v2.hwpx").buffer  // cross-format supported

const diff = await compare(bufA, bufB)

console.log(diff.stats)
// { added: 3, removed: 1, modified: 5, unchanged: 42 }

for (const d of diff.diffs) {
  // d.type: "added" | "removed" | "modified" | "unchanged"
  // d.blockA, d.blockB: IRBlock
  // d.cellDiffs: CellDiff[] for table blocks
  console.log(d.type, d.blockA?.text ?? d.blockB?.text)
}

Form Field Extraction

表单字段提取

typescript
import { parse, extractFormFields } from "kordoc"

const result = await parse(buffer.buffer)
if (result.success) {
  const form = extractFormFields(result.blocks)

  console.log(form.confidence)  // 0.0–1.0
  for (const field of form.fields) {
    // { label: "성명", value: "홍길동", row: 0, col: 0 }
    console.log(`${field.label}: ${field.value}`)
  }
}

typescript
import { parse, extractFormFields } from "kordoc"

const result = await parse(buffer.buffer)
if (result.success) {
  const form = extractFormFields(result.blocks)

  console.log(form.confidence)  // 0.0–1.0
  for (const field of form.fields) {
    // { label: "성명", value: "홍길동", row: 0, col: 0 }
    console.log(`${field.label}: ${field.value}`)
  }
}

Markdown → HWPX Generation

Markdown → HWPX 生成

typescript
import { markdownToHwpx } from "kordoc"
import { writeFileSync } from "fs"

const markdown = `
typescript
import { markdownToHwpx } from "kordoc"
import { writeFileSync } from "fs"

const markdown = `

제목

제목

본문 내용입니다.
구분내용
항목1값1
항목2값2
`
const hwpxBuffer = await markdownToHwpx(markdown) writeFileSync("output.hwpx", Buffer.from(hwpxBuffer))

---
본문 내용입니다.
구분내용
항목1값1
항목2값2
`
const hwpxBuffer = await markdownToHwpx(markdown) writeFileSync("output.hwpx", Buffer.from(hwpxBuffer))

---

CLI Usage

CLI 使用方法

bash
undefined
bash
undefined

Basic conversion — output to stdout

Basic conversion — output to stdout

npx kordoc document.hwpx
npx kordoc document.hwpx

Save to file

Save to file

npx kordoc document.hwp -o output.md
npx kordoc document.hwp -o output.md

Batch convert all PDFs to a directory

Batch convert all PDFs to a directory

npx kordoc *.pdf -d ./converted/
npx kordoc *.pdf -d ./converted/

JSON output with blocks + metadata

JSON output with blocks + metadata

npx kordoc report.hwpx --format json
npx kordoc report.hwpx --format json

Parse specific pages only

Parse specific pages only

npx kordoc report.hwpx --pages 1-3
npx kordoc report.hwpx --pages 1-3

Watch mode — auto-convert new files

Watch mode — auto-convert new files

npx kordoc watch ./incoming -d ./output
npx kordoc watch ./incoming -d ./output

Watch with webhook notification on conversion

Watch with webhook notification on conversion

npx kordoc watch ./docs --webhook https://api.example.com/hook

---
npx kordoc watch ./docs --webhook https://api.example.com/hook

---

MCP Server Setup

MCP 服务器设置

Add to your MCP config (Claude Desktop, Cursor, Windsurf):
json
{
  "mcpServers": {
    "kordoc": {
      "command": "npx",
      "args": ["-y", "kordoc-mcp"]
    }
  }
}
添加到你的MCP配置(Claude Desktop、Cursor、Windsurf):
json
{
  "mcpServers": {
    "kordoc": {
      "command": "npx",
      "args": ["-y", "kordoc-mcp"]
    }
  }
}

Available MCP Tools

可用MCP工具

ToolDescription
parse_document
Parse HWP/HWPX/PDF → Markdown + metadata + outline + warnings
detect_format
Detect file format via magic bytes
parse_metadata
Extract only metadata (fast, no full parse)
parse_pages
Parse a specific page range
parse_table
Extract the Nth table from a document
compare_documents
Diff two documents (cross-format supported)
parse_form
Extract form fields as structured JSON

工具描述
parse_document
将HWP/HWPX/PDF解析为Markdown + 元数据 + 大纲 + 警告信息
detect_format
通过魔术字节检测文件格式
parse_metadata
仅提取元数据(速度快,无需全量解析)
parse_pages
解析指定页面范围
parse_table
提取文档中的第N个表格
compare_documents
对比两份文档(支持跨格式)
parse_form
提取表单字段为结构化JSON

TypeScript Types Reference

TypeScript 类型参考

typescript
import type {
  // Results
  ParseResult, ParseSuccess, ParseFailure,
  ErrorCode,        // "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...

  // Blocks
  IRBlock, IRBlockType, IRTable, IRCell, CellContext,

  // Metadata & structure
  DocumentMetadata, OutlineItem,
  ParseWarning, WarningCode,
  BoundingBox,      // { x, y, width, height, page }
  InlineStyle,      // { bold, italic, fontSize, color, ... }

  // Options
  ParseOptions, FileType,
  OcrProvider,      // async (image, pageNum, mime) => string
  WatchOptions,

  // Diff
  DiffResult, BlockDiff, CellDiff, DiffChangeType,

  // Forms
  FormField, FormResult,
} from "kordoc"

typescript
import type {
  // Results
  ParseResult, ParseSuccess, ParseFailure,
  ErrorCode,        // "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...

  // Blocks
  IRBlock, IRBlockType, IRTable, IRCell, CellContext,

  // Metadata & structure
  DocumentMetadata, OutlineItem,
  ParseWarning, WarningCode,
  BoundingBox,      // { x, y, width, height, page }
  InlineStyle,      // { bold, italic, fontSize, color, ... }

  // Options
  ParseOptions, FileType,
  OcrProvider,      // async (image, pageNum, mime) => string
  WatchOptions,

  // Diff
  DiffResult, BlockDiff, CellDiff, DiffChangeType,

  // Forms
  FormField, FormResult,
} from "kordoc"

Common Patterns

常见使用场景

Batch Process Files with Error Handling

带错误处理的批量文件处理

typescript
import { parse, detectFormat } from "kordoc"
import { readFileSync } from "fs"
import { glob } from "glob"

const files = await glob("./docs/**/*.{hwp,hwpx,pdf}")

for (const file of files) {
  const buffer = readFileSync(file)
  const fmt = detectFormat(buffer.buffer)

  if (fmt === "unknown") {
    console.warn(`Skipping unknown format: ${file}`)
    continue
  }

  const result = await parse(buffer.buffer)

  if (!result.success) {
    if (result.code === "ENCRYPTED") {
      console.warn(`Encrypted, skipping: ${file}`)
    } else if (result.code === "IMAGE_BASED_PDF") {
      console.warn(`Image-based PDF needs OCR: ${file}`)
    } else {
      console.error(`Failed: ${file}${result.error}`)
    }
    continue
  }

  console.log(`Parsed ${file}: ${result.blocks.length} blocks`)
}
typescript
import { parse, detectFormat } from "kordoc"
import { readFileSync } from "fs"
import { glob } from "glob"

const files = await glob("./docs/**/*.{hwp,hwpx,pdf}")

for (const file of files) {
  const buffer = readFileSync(file)
  const fmt = detectFormat(buffer.buffer)

  if (fmt === "unknown") {
    console.warn(`Skipping unknown format: ${file}`)
    continue
  }

  const result = await parse(buffer.buffer)

  if (!result.success) {
    if (result.code === "ENCRYPTED") {
      console.warn(`Encrypted, skipping: ${file}`)
    } else if (result.code === "IMAGE_BASED_PDF") {
      console.warn(`Image-based PDF needs OCR: ${file}`)
    } else {
      console.error(`Failed: ${file}${result.error}`)
    }
    continue
  }

  console.log(`Parsed ${file}: ${result.blocks.length} blocks`)
}

Extract All Tables from a Document

提取文档中的所有表格

typescript
import { parse } from "kordoc"
import type { IRTable } from "kordoc"

const result = await parse(buffer.buffer)
if (result.success) {
  const tables = result.blocks.filter(b => b.type === "table") as IRTable[]

  tables.forEach((table, i) => {
    console.log(`\n--- Table ${i + 1} ---`)
    for (const row of table.rows) {
      const cells = row.map(cell => cell.text.trim()).join(" | ")
      console.log(`| ${cells} |`)
    }
  })
}
typescript
import { parse } from "kordoc"
import type { IRTable } from "kordoc"

const result = await parse(buffer.buffer)
if (result.success) {
  const tables = result.blocks.filter(b => b.type === "table") as IRTable[]

  tables.forEach((table, i) => {
    console.log(`\n--- Table ${i + 1} ---`)
    for (const row of table.rows) {
      const cells = row.map(cell => cell.text.trim()).join(" | ")
      console.log(`| ${cells} |`)
    }
  })
}

OCR with Tesseract.js

结合Tesseract.js使用OCR

typescript
import { parse } from "kordoc"
import Tesseract from "tesseract.js"

const result = await parse(buffer.buffer, {
  ocr: async (pageImage, pageNumber, mimeType) => {
    const blob = new Blob([pageImage], { type: mimeType })
    const url = URL.createObjectURL(blob)
    const { data } = await Tesseract.recognize(url, "kor+eng")
    URL.revokeObjectURL(url)
    return data.text
  }
})
typescript
import { parse } from "kordoc"
import Tesseract from "tesseract.js"

const result = await parse(buffer.buffer, {
  ocr: async (pageImage, pageNumber, mimeType) => {
    const blob = new Blob([pageImage], { type: mimeType })
    const url = URL.createObjectURL(blob)
    const { data } = await Tesseract.recognize(url, "kor+eng")
    URL.revokeObjectURL(url)
    return data.text
  }
})

Watch Mode Programmatic API

编程式调用监听模式

typescript
import { watch } from "kordoc"

const watcher = watch("./incoming", {
  output: "./converted",
  webhook: process.env.WEBHOOK_URL,
  onFile: async (file, result) => {
    if (result.success) {
      console.log(`Converted: ${file}`)
    }
  }
})

// Stop watching
watcher.stop()

typescript
import { watch } from "kordoc"

const watcher = watch("./incoming", {
  output: "./converted",
  webhook: process.env.WEBHOOK_URL,
  onFile: async (file, result) => {
    if (result.success) {
      console.log(`Converted: ${file}`)
    }
  }
})

// Stop watching
watcher.stop()

Troubleshooting

故障排查

buffer.buffer
vs
Buffer
— kordoc requires
ArrayBuffer
, not Node.js
Buffer
. Always pass
readFileSync("file").buffer
or use
.buffer
on a
Uint8Array
.
PDF tables not detected — Line-based detection requires pdfjs-dist installed. Install it:
npm install pdfjs-dist
. For borderless tables, kordoc uses cluster-based heuristics automatically.
"IMAGE_BASED_PDF"
error
— The PDF contains scanned images with no text layer. Provide an
ocr
function in parse options.
"ENCRYPTED"
error
— HWP DRM/password-protected files cannot be parsed without the decryption key. No workaround.
Korean characters garbled in output — Ensure your terminal/file uses UTF-8 encoding. kordoc outputs UTF-8 Markdown by default.
Large files are slow — Use
pages
option to parse only needed pages:
parse(buf, { pages: "1-5" })
. Metadata-only extraction is faster:
parse_metadata
MCP tool or check
result.metadata
directly.
HWP table columns wrong — Update to v1.6.1+. Earlier versions had a 2-byte offset misalignment in LIST_HEADER parsing causing column explosion.
buffer.buffer
Buffer
的区别
— kordoc要求传入
ArrayBuffer
而非Node.js
Buffer
。请始终传递
readFileSync("file").buffer
或者对
Uint8Array
调用
.buffer
属性。
PDF表格检测失败 — 基于线条的检测需要安装pdfjs-dist,请执行:
npm install pdfjs-dist
。对于无边框表格,kordoc会自动使用基于聚类的启发式算法处理。
"IMAGE_BASED_PDF"
错误
— 该PDF是扫描件,无文本层,请在解析选项中传入
ocr
函数。
"ENCRYPTED"
错误
— 受DRM/密码保护的HWP文件没有解密密钥无法解析,暂无解决方案。
输出韩文字符乱码 — 请确保你的终端/文件使用UTF-8编码,kordoc默认输出UTF-8格式的Markdown。
大文件解析速度慢 — 使用
pages
选项仅解析需要的页面:
parse(buf, { pages: "1-5" })
。仅提取元数据速度更快,可使用
parse_metadata
MCP工具或者直接读取
result.metadata
HWP表格列数错误 — 请升级到v1.6.1+版本,更早版本在LIST_HEADER解析时存在2字节偏移问题,会导致列数异常。