addon-docling-legal-chunk-embed
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAdd-on: Docling Legal Chunk + Embed
插件:Docling 法律文本分块+嵌入
Use this skill when a project needs legal-focused document ingestion from PDF into markdown/chunks suitable for retrieval and downstream clause reasoning.
当项目需要将法律类文档从PDF摄入为适用于检索和下游条款推理的Markdown/分块内容时,可使用此技能。
Compatibility
兼容性
- Works with .
architect-python-uv-batch - Works with (worker or async job path).
architect-python-uv-fastapi-sqlalchemy - Commonly paired with .
addon-rag-ingestion-pipeline
- 适配 。
architect-python-uv-batch - 适配 (worker 或异步任务路径)。
architect-python-uv-fastapi-sqlalchemy - 通常与 搭配使用。
addon-rag-ingestion-pipeline
Inputs
输入参数
Collect:
- : default
LEGAL_SOURCE_DIR.data/inbox/legal - : default
CLAUSE_MAX_CHARS.1400 - : default
CLAUSE_OVERLAP_CHARS.120 - :
EMBED_PROVIDER|sentence-transformers.openai - :
OUTPUT_MODE(default) |markdown+json.json-only
需收集:
- :默认值
LEGAL_SOURCE_DIR。data/inbox/legal - :默认值
CLAUSE_MAX_CHARS。1400 - :默认值
CLAUSE_OVERLAP_CHARS。120 - :可选值
EMBED_PROVIDER|sentence-transformers。openai - :可选值
OUTPUT_MODE(默认)|markdown+json。json-only
Integration Workflow
集成工作流
- Add dependencies:
bash
uv add docling orjson- For local embeddings:
bash
uv add sentence-transformers- For OpenAI embeddings:
bash
uv add openai- Add modules:
text
src/{{MODULE_NAME}}/rag/legal/docling_extract.py
src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
src/{{MODULE_NAME}}/rag/legal/embed_index.py
src/{{MODULE_NAME}}/rag/legal/types.py- Add CLI commands:
bash
uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json- Enforce clause-aware chunking:
- Prefer section/heading boundaries first (,
Article, numbered clauses).Section - Fallback to paragraph-level splitting.
- Keep stable clause ids and citation metadata (,
source_path,page,section).clause_id
- 添加依赖:
bash
uv add docling orjson- 本地嵌入所需依赖:
bash
uv add sentence-transformers- OpenAI 嵌入所需依赖:
bash
uv add openai- 添加模块:
text
src/{{MODULE_NAME}}/rag/legal/docling_extract.py
src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
src/{{MODULE_NAME}}/rag/legal/embed_index.py
src/{{MODULE_NAME}}/rag/legal/types.py- 添加CLI命令:
bash
uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json- 强制启用条款感知分块:
- 优先按章节/标题边界拆分(、
Article、带编号的条款)。Section - 降级方案为段落级拆分。
- 保留稳定的条款ID和引用元数据(、
source_path、page、section)。clause_id
Required Templates
所需模板
src/{{MODULE_NAME}}/rag/legal/types.py
src/{{MODULE_NAME}}/rag/legal/types.pysrc/{{MODULE_NAME}}/rag/legal/types.py
src/{{MODULE_NAME}}/rag/legal/types.pypython
from pydantic import BaseModel
class LegalClause(BaseModel):
clause_id: str
source_path: str
section: str | None = None
page: int | None = None
content: str
metadata: dict[str, str] = {}python
from pydantic import BaseModel
class LegalClause(BaseModel):
clause_id: str
source_path: str
section: str | None = None
page: int | None = None
content: str
metadata: dict[str, str] = {}src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
src/{{MODULE_NAME}}/rag/legal/clause_chunk.pysrc/{{MODULE_NAME}}/rag/legal/clause_chunk.py
src/{{MODULE_NAME}}/rag/legal/clause_chunk.pypython
import re
SECTION_RE = re.compile(r"^(article|section|clause)\s+[\w.-]+", re.IGNORECASE)
def split_legal_clauses(markdown_text: str, max_chars: int = 1400) -> list[str]:
blocks = [b.strip() for b in markdown_text.split("\n\n") if b.strip()]
clauses: list[str] = []
buf = ""
for block in blocks:
is_boundary = bool(SECTION_RE.match(block))
if is_boundary and buf:
clauses.append(buf.strip())
buf = block
continue
if len(buf) + len(block) + 2 > max_chars and buf:
clauses.append(buf.strip())
buf = block
else:
buf = f"{buf}\n\n{block}".strip() if buf else block
if buf:
clauses.append(buf.strip())
return clausespython
import re
SECTION_RE = re.compile(r"^(article|section|clause)\s+[\w.-]+", re.IGNORECASE)
def split_legal_clauses(markdown_text: str, max_chars: int = 1400) -> list[str]:
blocks = [b.strip() for b in markdown_text.split("\n\n") if b.strip()]
clauses: list[str] = []
buf = ""
for block in blocks:
is_boundary = bool(SECTION_RE.match(block))
if is_boundary and buf:
clauses.append(buf.strip())
buf = block
continue
if len(buf) + len(block) + 2 > max_chars and buf:
clauses.append(buf.strip())
buf = block
else:
buf = f"{buf}\n\n{block}".strip() if buf else block
if buf:
clauses.append(buf.strip())
return clausesGuardrails
防护规则
-
Documentation contract for generated code:
- Python: write module docstrings and docstrings for public classes, methods, and functions.
- Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
- Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
- Apply this contract even when using template snippets below; expand templates as needed.
-
Preserve legal ordering and section labels; do not reorder clauses.
-
Keep extracted markdown for auditability before embedding.
-
Include deterministic clause ids to support re-ingestion idempotency.
-
Never drop citation metadata needed for legal review.
-
Keep PII handling configurable; redact only when explicitly required.
-
生成代码的文档规范:
- Python:需为模块、公开类、方法和函数编写 docstring。
- Next.js/TypeScript:需为导出的组件、hooks、工具函数和路由处理函数编写 JSDoc。
- 仅针对非直观逻辑、不变量或安全约束添加简洁的原理说明注释。
- 即使用到下方的模板片段也需遵守该规范,可根据需要扩展模板。
-
保留法律文书的原有顺序和章节标签,不得重新排列条款。
-
嵌入前保留提取出的Markdown内容,便于审计。
-
采用确定性生成的条款ID,支持重复摄入的幂等性。
-
不得丢弃法律审核所需的引用元数据。
-
PII处理需保持可配置,仅在明确要求时才进行脱敏。
Validation Checklist
验证清单
- Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
bash
uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json
uv run pytest -qFallback ():
offline-smokebash
test -f src/{{MODULE_NAME}}/rag/legal/docling_extract.py
test -f src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
test -f src/{{MODULE_NAME}}/rag/legal/embed_index.py- 确认生成的代码包含要求的docstring/JSDoc,以及针对非直观逻辑的原理说明注释。
bash
uv run {{PROJECT_NAME}} legal-extract --source data/inbox/legal --out data/processed/legal
uv run {{PROJECT_NAME}} legal-index --source data/processed/legal --out data/index/legal-index.json
uv run pytest -q降级方案():
offline-smokebash
test -f src/{{MODULE_NAME}}/rag/legal/docling_extract.py
test -f src/{{MODULE_NAME}}/rag/legal/clause_chunk.py
test -f src/{{MODULE_NAME}}/rag/legal/embed_index.pyDecision Justification Rule
决策说明规则
- Every non-trivial decision must include a concrete justification.
- Capture the alternatives considered and why they were rejected.
- State tradeoffs and residual risks for the chosen option.
- If justification is missing, treat the task as incomplete and surface it as a blocker.
- 所有非平凡的决策都必须附带具体的理由说明。
- 记录考虑过的替代方案,以及拒绝这些方案的原因。
- 说明所选方案的权衡点和剩余风险。
- 如果缺少理由说明,视为任务未完成,将其标记为阻塞项。