addon-docling-legal-chunk-embed

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Add-on: Docling Legal Chunk + Embed

插件：Docling 法律文本分块+嵌入

Use this skill when a project needs legal-focused document ingestion from PDF into markdown/chunks suitable for retrieval and downstream clause reasoning.

当项目需要将法律类文档从PDF摄入为适用于检索和下游条款推理的Markdown/分块内容时，可使用此技能。

Compatibility

兼容性

Works with
```
architect-python-uv-batch
```
.
Works with
```
architect-python-uv-fastapi-sqlalchemy
```
(worker or async job path).
Commonly paired with
```
addon-rag-ingestion-pipeline
```
.

适配
```
architect-python-uv-batch
```
。
适配
```
architect-python-uv-fastapi-sqlalchemy
```
（worker 或异步任务路径）。
通常与
```
addon-rag-ingestion-pipeline
```
搭配使用。

Inputs

输入参数

Collect:

```
LEGAL_SOURCE_DIR
```
: default
```
data/inbox/legal
```
.
```
CLAUSE_MAX_CHARS
```
: default
```
1400
```
.
```
CLAUSE_OVERLAP_CHARS
```
: default
```
120
```
.

EMBED_PROVIDER

sentence-transformers

openai

```
OUTPUT_MODE
```
:
```
markdown+json
```
(default) |
```
json-only
```
.

需收集：

```
LEGAL_SOURCE_DIR
```
：默认值
```
data/inbox/legal
```
。
```
CLAUSE_MAX_CHARS
```
：默认值
```
1400
```
。
```
CLAUSE_OVERLAP_CHARS
```
：默认值
```
120
```
。

EMBED_PROVIDER

：可选值

sentence-transformers

openai

。

```
OUTPUT_MODE
```
：可选值
```
markdown+json
```
（默认）|
```
json-only
```
。

Integration Workflow

集成工作流

Add dependencies:

bash

uv add docling orjson

For local embeddings:

bash

uv add sentence-transformers

For OpenAI embeddings:

bash

uv add openai

Add modules:

text

src/{{MODULE_NAME}}/rag/legal/docling_extract.py
src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
src/{{MODULE_NAME}}/rag/legal/embed_index.py
src/{{MODULE_NAME}}/rag/legal/types.py

Add CLI commands:

bash

uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json

Enforce clause-aware chunking:

Prefer section/heading boundaries first (
```
Article
```
,
```
Section
```
, numbered clauses).
Fallback to paragraph-level splitting.
Keep stable clause ids and citation metadata (
```
source_path
```
,
```
page
```
,
```
section
```
,
```
clause_id
```
).

添加依赖：

bash

uv add docling orjson

本地嵌入所需依赖：

bash

uv add sentence-transformers

OpenAI 嵌入所需依赖：

bash

uv add openai

添加模块：

text

src/{{MODULE_NAME}}/rag/legal/docling_extract.py
src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
src/{{MODULE_NAME}}/rag/legal/embed_index.py
src/{{MODULE_NAME}}/rag/legal/types.py

添加CLI命令：

bash

uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json

强制启用条款感知分块：

优先按章节/标题边界拆分（
```
Article
```
、
```
Section
```
、带编号的条款）。
降级方案为段落级拆分。
保留稳定的条款ID和引用元数据（
```
source_path
```
、
```
page
```
、
```
section
```
、
```
clause_id
```
）。

Required Templates

所需模板

src/{{MODULE_NAME}}/rag/legal/types.py

src/{{MODULE_NAME}}/rag/legal/types.py

python

from pydantic import BaseModel


class LegalClause(BaseModel):
    clause_id: str
    source_path: str
    section: str | None = None
    page: int | None = None
    content: str
    metadata: dict[str, str] = {}

python

from pydantic import BaseModel


class LegalClause(BaseModel):
    clause_id: str
    source_path: str
    section: str | None = None
    page: int | None = None
    content: str
    metadata: dict[str, str] = {}

src/{{MODULE_NAME}}/rag/legal/clause_chunk.py

src/{{MODULE_NAME}}/rag/legal/clause_chunk.py

python

import re


SECTION_RE = re.compile(r"^(article|section|clause)\s+[\w.-]+", re.IGNORECASE)


def split_legal_clauses(markdown_text: str, max_chars: int = 1400) -> list[str]:
    blocks = [b.strip() for b in markdown_text.split("\n\n") if b.strip()]
    clauses: list[str] = []
    buf = ""
    for block in blocks:
        is_boundary = bool(SECTION_RE.match(block))
        if is_boundary and buf:
            clauses.append(buf.strip())
            buf = block
            continue
        if len(buf) + len(block) + 2 > max_chars and buf:
            clauses.append(buf.strip())
            buf = block
        else:
            buf = f"{buf}\n\n{block}".strip() if buf else block
    if buf:
        clauses.append(buf.strip())
    return clauses

python

import re


SECTION_RE = re.compile(r"^(article|section|clause)\s+[\w.-]+", re.IGNORECASE)


def split_legal_clauses(markdown_text: str, max_chars: int = 1400) -> list[str]:
    blocks = [b.strip() for b in markdown_text.split("\n\n") if b.strip()]
    clauses: list[str] = []
    buf = ""
    for block in blocks:
        is_boundary = bool(SECTION_RE.match(block))
        if is_boundary and buf:
            clauses.append(buf.strip())
            buf = block
            continue
        if len(buf) + len(block) + 2 > max_chars and buf:
            clauses.append(buf.strip())
            buf = block
        else:
            buf = f"{buf}\n\n{block}".strip() if buf else block
    if buf:
        clauses.append(buf.strip())
    return clauses

Guardrails

防护规则

Documentation contract for generated code:
- Python: write module docstrings and docstrings for public classes, methods, and functions.
- Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
- Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
- Apply this contract even when using template snippets below; expand templates as needed.
Preserve legal ordering and section labels; do not reorder clauses.
Keep extracted markdown for auditability before embedding.
Include deterministic clause ids to support re-ingestion idempotency.
Never drop citation metadata needed for legal review.
Keep PII handling configurable; redact only when explicitly required.

生成代码的文档规范：
- Python：需为模块、公开类、方法和函数编写 docstring。
- Next.js/TypeScript：需为导出的组件、hooks、工具函数和路由处理函数编写 JSDoc。
- 仅针对非直观逻辑、不变量或安全约束添加简洁的原理说明注释。
- 即使用到下方的模板片段也需遵守该规范，可根据需要扩展模板。
保留法律文书的原有顺序和章节标签，不得重新排列条款。
嵌入前保留提取出的Markdown内容，便于审计。
采用确定性生成的条款ID，支持重复摄入的幂等性。
不得丢弃法律审核所需的引用元数据。
PII处理需保持可配置，仅在明确要求时才进行脱敏。

Validation Checklist

验证清单

Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.

bash

uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json
uv run pytest -q

Fallback (

offline-smoke

bash

test -f src/{{MODULE_NAME}}/rag/legal/docling_extract.py
test -f src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
test -f src/{{MODULE_NAME}}/rag/legal/embed_index.py

确认生成的代码包含要求的docstring/JSDoc，以及针对非直观逻辑的原理说明注释。

bash

uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json
uv run pytest -q

降级方案（

offline-smoke

）：

bash

test -f src/{{MODULE_NAME}}/rag/legal/docling_extract.py
test -f src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
test -f src/{{MODULE_NAME}}/rag/legal/embed_index.py

Decision Justification Rule

决策说明规则

Every non-trivial decision must include a concrete justification.
Capture the alternatives considered and why they were rejected.
State tradeoffs and residual risks for the chosen option.
If justification is missing, treat the task as incomplete and surface it as a blocker.

所有非平凡的决策都必须附带具体的理由说明。
记录考虑过的替代方案，以及拒绝这些方案的原因。
说明所选方案的权衡点和剩余风险。
如果缺少理由说明，视为任务未完成，将其标记为阻塞项。

addon-docling-legal-chunk-embed

Original

Translation

Add-on: Docling Legal Chunk + Embed

插件：Docling 法律文本分块+嵌入

Compatibility

兼容性

Inputs

输入参数

Integration Workflow

集成工作流

Required Templates

所需模板

`src/{{MODULE_NAME}}/rag/legal/types.py`

`src/{{MODULE_NAME}}/rag/legal/types.py`

`src/{{MODULE_NAME}}/rag/legal/clause_chunk.py`

`src/{{MODULE_NAME}}/rag/legal/clause_chunk.py`

Guardrails

防护规则

Validation Checklist

验证清单

Decision Justification Rule

决策说明规则