addon-docling-legal-chunk-embed

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Add-on: Docling Legal Chunk + Embed

插件:Docling 法律文本分块+嵌入

Use this skill when a project needs legal-focused document ingestion from PDF into markdown/chunks suitable for retrieval and downstream clause reasoning.
当项目需要将法律类文档从PDF摄入为适用于检索和下游条款推理的Markdown/分块内容时,可使用此技能。

Compatibility

兼容性

  • Works with
    architect-python-uv-batch
    .
  • Works with
    architect-python-uv-fastapi-sqlalchemy
    (worker or async job path).
  • Commonly paired with
    addon-rag-ingestion-pipeline
    .
  • 适配
    architect-python-uv-batch
  • 适配
    architect-python-uv-fastapi-sqlalchemy
    (worker 或异步任务路径)。
  • 通常与
    addon-rag-ingestion-pipeline
    搭配使用。

Inputs

输入参数

Collect:
  • LEGAL_SOURCE_DIR
    : default
    data/inbox/legal
    .
  • CLAUSE_MAX_CHARS
    : default
    1400
    .
  • CLAUSE_OVERLAP_CHARS
    : default
    120
    .
  • EMBED_PROVIDER
    :
    sentence-transformers
    |
    openai
    .
  • OUTPUT_MODE
    :
    markdown+json
    (default) |
    json-only
    .
需收集:
  • LEGAL_SOURCE_DIR
    :默认值
    data/inbox/legal
  • CLAUSE_MAX_CHARS
    :默认值
    1400
  • CLAUSE_OVERLAP_CHARS
    :默认值
    120
  • EMBED_PROVIDER
    :可选值
    sentence-transformers
    |
    openai
  • OUTPUT_MODE
    :可选值
    markdown+json
    (默认)|
    json-only

Integration Workflow

集成工作流

  1. Add dependencies:
bash
uv add docling orjson
  • For local embeddings:
bash
uv add sentence-transformers
  • For OpenAI embeddings:
bash
uv add openai
  1. Add modules:
text
src/{{MODULE_NAME}}/rag/legal/docling_extract.py
src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
src/{{MODULE_NAME}}/rag/legal/embed_index.py
src/{{MODULE_NAME}}/rag/legal/types.py
  1. Add CLI commands:
bash
uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json
  1. Enforce clause-aware chunking:
  • Prefer section/heading boundaries first (
    Article
    ,
    Section
    , numbered clauses).
  • Fallback to paragraph-level splitting.
  • Keep stable clause ids and citation metadata (
    source_path
    ,
    page
    ,
    section
    ,
    clause_id
    ).
  1. 添加依赖:
bash
uv add docling orjson
  • 本地嵌入所需依赖:
bash
uv add sentence-transformers
  • OpenAI 嵌入所需依赖:
bash
uv add openai
  1. 添加模块:
text
src/{{MODULE_NAME}}/rag/legal/docling_extract.py
src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
src/{{MODULE_NAME}}/rag/legal/embed_index.py
src/{{MODULE_NAME}}/rag/legal/types.py
  1. 添加CLI命令:
bash
uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json
  1. 强制启用条款感知分块:
  • 优先按章节/标题边界拆分(
    Article
    Section
    、带编号的条款)。
  • 降级方案为段落级拆分。
  • 保留稳定的条款ID和引用元数据(
    source_path
    page
    section
    clause_id
    )。

Required Templates

所需模板

src/{{MODULE_NAME}}/rag/legal/types.py

src/{{MODULE_NAME}}/rag/legal/types.py

python
from pydantic import BaseModel


class LegalClause(BaseModel):
    clause_id: str
    source_path: str
    section: str | None = None
    page: int | None = None
    content: str
    metadata: dict[str, str] = {}
python
from pydantic import BaseModel


class LegalClause(BaseModel):
    clause_id: str
    source_path: str
    section: str | None = None
    page: int | None = None
    content: str
    metadata: dict[str, str] = {}

src/{{MODULE_NAME}}/rag/legal/clause_chunk.py

src/{{MODULE_NAME}}/rag/legal/clause_chunk.py

python
import re


SECTION_RE = re.compile(r"^(article|section|clause)\s+[\w.-]+", re.IGNORECASE)


def split_legal_clauses(markdown_text: str, max_chars: int = 1400) -> list[str]:
    blocks = [b.strip() for b in markdown_text.split("\n\n") if b.strip()]
    clauses: list[str] = []
    buf = ""
    for block in blocks:
        is_boundary = bool(SECTION_RE.match(block))
        if is_boundary and buf:
            clauses.append(buf.strip())
            buf = block
            continue
        if len(buf) + len(block) + 2 > max_chars and buf:
            clauses.append(buf.strip())
            buf = block
        else:
            buf = f"{buf}\n\n{block}".strip() if buf else block
    if buf:
        clauses.append(buf.strip())
    return clauses
python
import re


SECTION_RE = re.compile(r"^(article|section|clause)\s+[\w.-]+", re.IGNORECASE)


def split_legal_clauses(markdown_text: str, max_chars: int = 1400) -> list[str]:
    blocks = [b.strip() for b in markdown_text.split("\n\n") if b.strip()]
    clauses: list[str] = []
    buf = ""
    for block in blocks:
        is_boundary = bool(SECTION_RE.match(block))
        if is_boundary and buf:
            clauses.append(buf.strip())
            buf = block
            continue
        if len(buf) + len(block) + 2 > max_chars and buf:
            clauses.append(buf.strip())
            buf = block
        else:
            buf = f"{buf}\n\n{block}".strip() if buf else block
    if buf:
        clauses.append(buf.strip())
    return clauses

Guardrails

防护规则

  • Documentation contract for generated code:
    • Python: write module docstrings and docstrings for public classes, methods, and functions.
    • Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
    • Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
    • Apply this contract even when using template snippets below; expand templates as needed.
  • Preserve legal ordering and section labels; do not reorder clauses.
  • Keep extracted markdown for auditability before embedding.
  • Include deterministic clause ids to support re-ingestion idempotency.
  • Never drop citation metadata needed for legal review.
  • Keep PII handling configurable; redact only when explicitly required.
  • 生成代码的文档规范:
    • Python:需为模块、公开类、方法和函数编写 docstring。
    • Next.js/TypeScript:需为导出的组件、hooks、工具函数和路由处理函数编写 JSDoc。
    • 仅针对非直观逻辑、不变量或安全约束添加简洁的原理说明注释。
    • 即使用到下方的模板片段也需遵守该规范,可根据需要扩展模板。
  • 保留法律文书的原有顺序和章节标签,不得重新排列条款。
  • 嵌入前保留提取出的Markdown内容,便于审计。
  • 采用确定性生成的条款ID,支持重复摄入的幂等性。
  • 不得丢弃法律审核所需的引用元数据。
  • PII处理需保持可配置,仅在明确要求时才进行脱敏。

Validation Checklist

验证清单

  • Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
bash
uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json
uv run pytest -q
Fallback (
offline-smoke
):
bash
test -f src/{{MODULE_NAME}}/rag/legal/docling_extract.py
test -f src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
test -f src/{{MODULE_NAME}}/rag/legal/embed_index.py
  • 确认生成的代码包含要求的docstring/JSDoc,以及针对非直观逻辑的原理说明注释。
bash
uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json
uv run pytest -q
降级方案(
offline-smoke
):
bash
test -f src/{{MODULE_NAME}}/rag/legal/docling_extract.py
test -f src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
test -f src/{{MODULE_NAME}}/rag/legal/embed_index.py

Decision Justification Rule

决策说明规则

  • Every non-trivial decision must include a concrete justification.
  • Capture the alternatives considered and why they were rejected.
  • State tradeoffs and residual risks for the chosen option.
  • If justification is missing, treat the task as incomplete and surface it as a blocker.
  • 所有非平凡的决策都必须附带具体的理由说明。
  • 记录考虑过的替代方案,以及拒绝这些方案的原因。
  • 说明所选方案的权衡点和剩余风险。
  • 如果缺少理由说明,视为任务未完成,将其标记为阻塞项。