tabular-document-review
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese⚠️ EXPERIMENTAL — This skill is provided for educational and informational purposes only. It does NOT constitute legal advice. All responsibility for usage rests with the user. Consult qualified legal professionals before acting on any output.
⚠️ 实验性功能 — 本技能仅用于教育和信息目的,不构成法律建议。使用责任完全由用户承担。根据输出内容采取行动前,请咨询合格的法律专业人士。
Tabular Document Review Skill
表格化文档审查技能
Overview
概述
Production-ready toolkit for extracting structured data from multiple legal documents into a comparison matrix with citations. Supports user-defined extraction columns, parallel processing with up to 10 agents, confidence scoring, and output in markdown table or structured JSON. Designed for legal teams performing bulk contract review, NDA comparison, employment agreement analysis, and lease review.
这是一套可用于生产环境的工具包,能够从多份法律文档中提取结构化数据并生成带引用的对比矩阵。它支持用户自定义提取列、最多10个Agent的并行处理、置信度评分,以及输出为Markdown表格或结构化JSON。专为法律团队设计,适用于批量合同审查、NDA对比、雇佣协议分析和租赁协议审查场景。
Table of Contents
目录
Tools
工具
1. Document Discovery (scripts/document_discovery.py
)
scripts/document_discovery.py1. 文档发现工具(scripts/document_discovery.py
)
scripts/document_discovery.pyScan a directory for legal documents and generate an inventory manifest.
bash
python scripts/document_discovery.py /path/to/contracts
python scripts/document_discovery.py /path/to/ndas --types pdf,docx --json
python scripts/document_discovery.py /path/to/leases --types pdf,docx,txt,md --min-size 1024扫描目录中的法律文档并生成清单。
bash
python scripts/document_discovery.py /path/to/contracts
python scripts/document_discovery.py /path/to/ndas --types pdf,docx --json
python scripts/document_discovery.py /path/to/leases --types pdf,docx,txt,md --min-size 10242. Extraction Aggregator (scripts/extraction_aggregator.py
)
scripts/extraction_aggregator.py2. 提取聚合工具(scripts/extraction_aggregator.py
)
scripts/extraction_aggregator.pyAggregate multiple extraction result JSONs into a unified comparison matrix.
bash
python scripts/extraction_aggregator.py \
--results extraction_1.json extraction_2.json extraction_3.json
python scripts/extraction_aggregator.py \
--results-dir ./extraction_results/ --json
python scripts/extraction_aggregator.py \
--results-dir ./extraction_results/ \
--format markdown \
--output review_matrix.md
python scripts/extraction_aggregator.py \
--results extraction_1.json extraction_2.json \
--columns "Parties,Effective Date,Term,Governing Law"将多个提取结果JSON聚合为统一的对比矩阵。
bash
python scripts/extraction_aggregator.py \
--results extraction_1.json extraction_2.json extraction_3.json
python scripts/extraction_aggregator.py \
--results-dir ./extraction_results/ --json
python scripts/extraction_aggregator.py \
--results-dir ./extraction_results/ \
--format markdown \
--output review_matrix.md
python scripts/extraction_aggregator.py \
--results extraction_1.json extraction_2.json \
--columns "Parties,Effective Date,Term,Governing Law"Reference Guides
参考指南
| Reference | Purpose |
|---|---|
| Document extraction best practices, JSON schema, agent prompts |
| Pre-defined column sets for contracts, NDAs, employment, leases |
| 参考文档 | 用途 |
|---|---|
| 文档提取最佳实践、JSON schema、Agent提示词 |
| 针对合同、NDA、雇佣协议、租赁协议的预定义列集合 |
Workflows
工作流
5-Step Document Review Pipeline
五步文档审查流程
| Step | Action | Tool | Output |
|---|---|---|---|
| 1. Gather Requirements | Define document folder, output filename, columns to extract | Manual | Column list, file path |
| 2. Discover Documents | Scan directory for target documents | | Document manifest JSON |
| 3. Process Documents | Extract values per column with citations (parallel agents) | AI agents (external) | Per-document extraction JSONs |
| 4. Collect Results | Aggregate extraction JSONs into unified matrix | | Consolidated matrix |
| 5. Generate Output | Export as markdown table or structured JSON | | Final deliverable |
| 步骤 | 操作 | 工具 | 输出 |
|---|---|---|---|
| 1. 收集需求 | 定义文档文件夹、输出文件名、要提取的列 | 手动操作 | 列列表、文件路径 |
| 2. 发现文档 | 扫描目录查找目标文档 | | 文档清单JSON |
| 3. 处理文档 | 通过带引用的方式提取每列的值(并行Agent) | AI Agent(外部) | 每份文档的提取结果JSON |
| 4. 收集结果 | 将提取结果JSON聚合为统一矩阵 | | 整合后的矩阵 |
| 5. 生成输出 | 导出为Markdown表格或结构化JSON | | 最终交付物 |
Parallel Processing Strategy
并行处理策略
| Agents | Documents per Agent | Use When |
|---|---|---|
| 1 | All | 1-5 documents |
| 2-3 | ceil(N/agents) | 6-15 documents |
| 4-6 | ceil(N/agents) | 16-40 documents |
| 7-10 | ceil(N/agents) | 41-100 documents |
| 10 (max) | ceil(N/10) | 100+ documents |
| Agent数量 | 每个Agent处理的文档数 | 使用场景 |
|---|---|---|
| 1 | 全部 | 1-5份文档 |
| 2-3 | ceil(N/agents) | 6-15份文档 |
| 4-6 | ceil(N/agents) | 16-40份文档 |
| 7-10 | ceil(N/agents) | 41-100份文档 |
| 10(最大值) | ceil(N/10) | 100+份文档 |
Agent Prompt Template
Agent提示词模板
Each agent receives a prompt structured as:
You are reviewing {count} legal documents. For each document, extract the
following columns:
{column_definitions}
For each value extracted:
1. Provide the exact value found
2. Include the page number (PDF) or section/paragraph (DOCX/MD)
3. Rate your confidence: HIGH (exact match), MEDIUM (inferred), LOW (uncertain)
4. If not found, record "NOT FOUND" with confidence LOW
Output as JSON per the extraction schema.每个Agent会收到如下结构的提示词:
You are reviewing {count} legal documents. For each document, extract the
following columns:
{column_definitions}
For each value extracted:
1. Provide the exact value found
2. Include the page number (PDF) or section/paragraph (DOCX/MD)
3. Rate your confidence: HIGH (exact match), MEDIUM (inferred), LOW (uncertain)
4. If not found, record "NOT FOUND" with confidence LOW
Output as JSON per the extraction schema.Confidence Scoring
置信度评分
| Level | Color Code | Definition |
|---|---|---|
| HIGH | Green | Exact value found with clear citation |
| MEDIUM | Yellow | Value inferred from context; multiple possible interpretations |
| LOW | Red / Not Found | Value uncertain or not found in document |
| 等级 | 颜色代码 | 定义 |
|---|---|---|
| HIGH | 绿色 | 找到精确值且引用清晰 |
| MEDIUM | 黄色 | 从上下文推断的值;存在多种可能的解释 |
| LOW | 红色/未找到 | 值不确定或未在文档中找到 |
Output Format
输出格式
Sheet 1: Document Review
| Document | Parties | Effective Date | Term | Governing Law | ... |
|---|---|---|---|---|---|
| contract_a.pdf | Acme / Beta [p.1] | 2026-01-15 [p.2] | 3 years [p.3] | Delaware [p.12] | ... |
| contract_b.pdf | Gamma / Delta [p.1] | NOT FOUND | 2 years [p.4] | New York [p.10] | ... |
Sheet 2: Summary
| Metric | Value |
|---|---|
| Documents processed | 25 |
| Columns extracted | 8 |
| Average confidence | 87% |
| Not found rate | 12% |
表格1:文档审查
| 文档 | 缔约方 | 生效日期 | 期限 | 管辖法律 | ... |
|---|---|---|---|---|---|
| contract_a.pdf | Acme / Beta [p.1] | 2026-01-15 [p.2] | 3 years [p.3] | Delaware [p.12] | ... |
| contract_b.pdf | Gamma / Delta [p.1] | NOT FOUND | 2 years [p.4] | New York [p.10] | ... |
表格2:摘要
| 指标 | 值 |
|---|---|
| 已处理文档数 | 25 |
| 已提取列数 | 8 |
| 平均置信度 | 87% |
| 未找到率 | 12% |
Extraction Scenarios
提取场景
Contract Review
合同审查
| Column | What to Extract |
|---|---|
| Parties | All contracting parties with full legal names |
| Effective Date | Contract effective or execution date |
| Term | Duration of the agreement |
| Renewal | Auto-renewal terms and notice period |
| Governing Law | Jurisdiction governing the agreement |
| Liability Cap | Maximum liability amount or formula |
| Indemnification | Indemnification obligations and scope |
| IP Ownership | Intellectual property ownership provisions |
| Termination Rights | Termination triggers and notice requirements |
| Data Protection | Data protection or privacy obligations |
| 列 | 提取内容 |
|---|---|
| Parties | 所有缔约方的完整法律名称 |
| Effective Date | 合同生效或签署日期 |
| Term | 协议期限 |
| Renewal | 自动续期条款和通知期限 |
| Governing Law | 管辖协议的司法管辖区 |
| Liability Cap | 最大责任金额或计算公式 |
| Indemnification | 赔偿义务和范围 |
| IP Ownership | 知识产权所有权条款 |
| Termination Rights | 终止触发条件和通知要求 |
| Data Protection | 数据保护或隐私义务 |
NDA Review
NDA审查
| Column | What to Extract |
|---|---|
| Parties | Disclosing and receiving parties |
| Type | Mutual or one-way |
| Definition Scope | How "confidential information" is defined |
| Exceptions | Standard exceptions to confidentiality |
| Term | Duration of confidentiality obligations |
| Survival | Survival period after termination |
| Return/Destruction | Obligations on termination |
| Remedies | Available remedies for breach |
| 列 | 提取内容 |
|---|---|
| Parties | 披露方和接收方 |
| Type | 双向或单向 |
| Definition Scope | “保密信息”的定义范围 |
| Exceptions | 保密义务的标准例外情况 |
| Term | 保密义务的期限 |
| Survival | 终止后的存续期限 |
| Return/Destruction | 终止后的义务 |
| Remedies | 违约可用的救济措施 |
Troubleshooting
故障排除
| Problem | Cause | Solution |
|---|---|---|
| Discovery finds 0 documents | Wrong path or file types | Verify path exists; check |
| Extraction JSONs have wrong schema | Agent prompt incomplete | Use the extraction schema from |
| Aggregator shows conflicts | Multiple values for same cell | Review source documents; aggregator marks conflicts for manual review |
| High "NOT FOUND" rate | Columns too specific for document type | Use column definitions from |
| Confidence all LOW | Agent unable to locate values | Check column definitions are specific enough; verify document is readable |
| Aggregator crashes on large set | Too many result files loaded at once | Process in batches of 50 results; use |
| Markdown table misaligned | Long values or special characters | Use |
| Missing citations | Agent did not include page/section references | Reinforce citation requirement in agent prompt; check extraction schema |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 文档发现工具找到0份文档 | 路径错误或文件类型不匹配 | 验证路径是否存在;检查 |
| 提取结果JSON的schema错误 | Agent提示词不完整 | 使用 |
| 聚合工具显示冲突 | 同一单元格存在多个值 | 审查源文档;聚合工具会标记冲突以便人工审查 |
| “NOT FOUND”率过高 | 列定义对于文档类型过于具体 | 使用 |
| 所有置信度均为LOW | Agent无法定位值 | 检查列定义是否足够具体;验证文档是否可读 |
| 聚合工具在处理大型数据集时崩溃 | 一次性加载的结果文件过多 | 分批处理,每批50个结果;使用 |
| Markdown表格对齐错误 | 值过长或包含特殊字符 | 使用 |
| 缺少引用 | Agent未包含页码/章节引用 | 在Agent提示词中强调引用要求;检查提取schema |
Success Criteria
成功标准
- Extraction Coverage: 90%+ of defined columns populated across all documents
- Confidence Distribution: 70%+ of extractions rated HIGH confidence
- Citation Accuracy: Every extracted value includes verifiable page/section citation
- Processing Speed: 50+ documents processed within 30 minutes using parallel agents
- Matrix Completeness: Final matrix includes all documents and all columns with no orphan rows
- 提取覆盖率:所有文档中90%以上的定义列已填充
- 置信度分布:70%以上的提取结果被评为HIGH置信度
- 引用准确性:每个提取的值都包含可验证的页码/章节引用
- 处理速度:使用并行Agent在30分钟内处理50份以上文档
- 矩阵完整性:最终矩阵包含所有文档和所有列,无孤立行
Scope & Limitations
范围与限制
This skill covers:
- Document inventory and discovery across PDF, DOCX, TXT, and MD formats
- Aggregation of extraction results from parallel agent processing into unified matrix
- Pre-defined column sets for contracts, NDAs, employment agreements, and leases
- Confidence scoring and conflict detection for extracted values
- Markdown and JSON output formats
This skill does NOT cover:
- Actual document parsing or text extraction (requires external libraries or AI agents)
- OCR processing for scanned documents
- Excel/XLSX output generation (use JSON output and convert externally)
- Automated legal analysis or risk assessment of extracted values
- Document comparison or redlining between versions
本技能涵盖:
- PDF、DOCX、TXT和MD格式的文档清单与发现
- 将并行Agent处理的提取结果聚合为统一矩阵
- 针对合同、NDA、雇佣协议和租赁协议的预定义列集合
- 提取值的置信度评分和冲突检测
- Markdown和JSON输出格式
本技能不涵盖:
- 实际的文档解析或文本提取(需要外部库或AI Agent)
- 扫描文档的OCR处理
- Excel/XLSX输出生成(使用JSON输出并通过外部工具转换)
- 对提取值的自动法律分析或风险评估
- 不同版本文档之间的对比或红线标注
Anti-Patterns
反模式
| Anti-Pattern | Why It Fails | Better Approach |
|---|---|---|
| Vague column definitions | "Date" could match dozens of dates in a contract | Use specific definitions: "Effective Date" with guidance on where to look |
| Skipping document discovery | Unknown document count leads to wrong agent allocation | Always run discovery first; use manifest for pipeline planning |
| Ignoring LOW confidence results | Missing or uncertain data treated as fact | Review all LOW confidence cells manually; flag in final report |
| Processing 100+ docs with 1 agent | Slow, context window overflow, quality degradation | Use parallel processing: ceil(N/10) documents per agent, max 10 agents |
| No citation requirement | Cannot verify extracted values against source | Require page/section citation for every extraction; reject uncited values |
| 反模式 | 失败原因 | 更好的方法 |
|---|---|---|
| 模糊的列定义 | “日期”可能匹配合同中的数十个日期 | 使用具体定义:如“生效日期”并指明查找位置 |
| 跳过文档发现步骤 | 未知文档数量会导致Agent分配错误 | 始终先运行文档发现工具;使用清单进行流程规划 |
| 忽略LOW置信度结果 | 缺失或不确定的数据被视为事实 | 人工审查所有LOW置信度的单元格;在最终报告中标记 |
| 使用1个Agent处理100+份文档 | 速度慢、上下文窗口溢出、质量下降 | 使用并行处理:每个Agent处理ceil(N/10)份文档,最多10个Agent |
| 不要求引用 | 无法对照源文档验证提取的值 | 要求每个提取结果都包含页码/章节引用;拒绝无引用的值 |
Tool Reference
工具参考
scripts/document_discovery.py
scripts/document_discovery.pyscripts/document_discovery.py
scripts/document_discovery.pyScan directory for legal documents and generate inventory manifest.
usage: document_discovery.py [-h] [--json]
[--types TYPES]
[--min-size MIN_SIZE]
[--max-size MAX_SIZE]
directory
positional arguments:
directory Path to directory containing documents
options:
-h, --help Show help message and exit
--json Output in JSON format
--types TYPES Comma-separated file extensions to include
(default: pdf,docx,doc,txt,md,rtf)
--min-size MIN_SIZE Minimum file size in bytes (default: 0)
--max-size MAX_SIZE Maximum file size in bytes (default: no limit)扫描目录中的法律文档并生成清单。
usage: document_discovery.py [-h] [--json]
[--types TYPES]
[--min-size MIN_SIZE]
[--max-size MAX_SIZE]
directory
positional arguments:
directory 包含文档的目录路径
options:
-h, --help 显示帮助信息并退出
--json 以JSON格式输出
--types TYPES 要包含的文件扩展名(逗号分隔)
默认值:pdf,docx,doc,txt,md,rtf
--min-size MIN_SIZE 最小文件大小(字节,默认值:0)
--max-size MAX_SIZE 最大文件大小(字节,默认值:无限制)scripts/extraction_aggregator.py
scripts/extraction_aggregator.pyscripts/extraction_aggregator.py
scripts/extraction_aggregator.pyAggregate extraction results into unified comparison matrix.
usage: extraction_aggregator.py [-h] [--json]
[--results RESULTS [RESULTS ...]]
[--results-dir RESULTS_DIR]
[--format {markdown,json}]
[--columns COLUMNS]
[--output OUTPUT]
options:
-h, --help Show help message and exit
--json Output in JSON format (alias for --format json)
--results One or more extraction result JSON files
--results-dir Directory containing extraction result JSON files
--format Output format: markdown table or JSON (default: markdown)
--columns Comma-separated column names to include (default: all)
--output Write output to file instead of stdout将提取结果聚合为统一的对比矩阵。
usage: extraction_aggregator.py [-h] [--json]
[--results RESULTS [RESULTS ...]]
[--results-dir RESULTS_DIR]
[--format {markdown,json}]
[--columns COLUMNS]
[--output OUTPUT]
options:
-h, --help 显示帮助信息并退出
--json 以JSON格式输出(`--format json`的别名)
--results 一个或多个提取结果JSON文件
--results-dir 包含提取结果JSON文件的目录
--format 输出格式:markdown表格或JSON(默认值:markdown)
--columns 要包含的列名(逗号分隔,默认值:所有列)
--output 将输出写入文件而非标准输出