tabular-document-review

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
⚠️ EXPERIMENTAL — This skill is provided for educational and informational purposes only. It does NOT constitute legal advice. All responsibility for usage rests with the user. Consult qualified legal professionals before acting on any output.
⚠️ 实验性功能 — 本技能仅用于教育和信息目的,不构成法律建议。使用责任完全由用户承担。根据输出内容采取行动前,请咨询合格的法律专业人士。

Tabular Document Review Skill

表格化文档审查技能

Overview

概述

Production-ready toolkit for extracting structured data from multiple legal documents into a comparison matrix with citations. Supports user-defined extraction columns, parallel processing with up to 10 agents, confidence scoring, and output in markdown table or structured JSON. Designed for legal teams performing bulk contract review, NDA comparison, employment agreement analysis, and lease review.
这是一套可用于生产环境的工具包,能够从多份法律文档中提取结构化数据并生成带引用的对比矩阵。它支持用户自定义提取列、最多10个Agent的并行处理、置信度评分,以及输出为Markdown表格或结构化JSON。专为法律团队设计,适用于批量合同审查、NDA对比、雇佣协议分析和租赁协议审查场景。

Table of Contents

目录

Tools

工具

1. Document Discovery (
scripts/document_discovery.py
)

1. 文档发现工具(
scripts/document_discovery.py

Scan a directory for legal documents and generate an inventory manifest.
bash
python scripts/document_discovery.py /path/to/contracts

python scripts/document_discovery.py /path/to/ndas --types pdf,docx --json

python scripts/document_discovery.py /path/to/leases --types pdf,docx,txt,md --min-size 1024
扫描目录中的法律文档并生成清单。
bash
python scripts/document_discovery.py /path/to/contracts

python scripts/document_discovery.py /path/to/ndas --types pdf,docx --json

python scripts/document_discovery.py /path/to/leases --types pdf,docx,txt,md --min-size 1024

2. Extraction Aggregator (
scripts/extraction_aggregator.py
)

2. 提取聚合工具(
scripts/extraction_aggregator.py

Aggregate multiple extraction result JSONs into a unified comparison matrix.
bash
python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json extraction_3.json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ --json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ \
  --format markdown \
  --output review_matrix.md

python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json \
  --columns "Parties,Effective Date,Term,Governing Law"
将多个提取结果JSON聚合为统一的对比矩阵。
bash
python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json extraction_3.json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ --json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ \
  --format markdown \
  --output review_matrix.md

python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json \
  --columns "Parties,Effective Date,Term,Governing Law"

Reference Guides

参考指南

ReferencePurpose
references/extraction_methodology.md
Document extraction best practices, JSON schema, agent prompts
references/common_extraction_columns.md
Pre-defined column sets for contracts, NDAs, employment, leases
参考文档用途
references/extraction_methodology.md
文档提取最佳实践、JSON schema、Agent提示词
references/common_extraction_columns.md
针对合同、NDA、雇佣协议、租赁协议的预定义列集合

Workflows

工作流

5-Step Document Review Pipeline

五步文档审查流程

StepActionToolOutput
1. Gather RequirementsDefine document folder, output filename, columns to extractManualColumn list, file path
2. Discover DocumentsScan directory for target documents
document_discovery.py
Document manifest JSON
3. Process DocumentsExtract values per column with citations (parallel agents)AI agents (external)Per-document extraction JSONs
4. Collect ResultsAggregate extraction JSONs into unified matrix
extraction_aggregator.py
Consolidated matrix
5. Generate OutputExport as markdown table or structured JSON
extraction_aggregator.py
Final deliverable
步骤操作工具输出
1. 收集需求定义文档文件夹、输出文件名、要提取的列手动操作列列表、文件路径
2. 发现文档扫描目录查找目标文档
document_discovery.py
文档清单JSON
3. 处理文档通过带引用的方式提取每列的值(并行Agent)AI Agent(外部)每份文档的提取结果JSON
4. 收集结果将提取结果JSON聚合为统一矩阵
extraction_aggregator.py
整合后的矩阵
5. 生成输出导出为Markdown表格或结构化JSON
extraction_aggregator.py
最终交付物

Parallel Processing Strategy

并行处理策略

AgentsDocuments per AgentUse When
1All1-5 documents
2-3ceil(N/agents)6-15 documents
4-6ceil(N/agents)16-40 documents
7-10ceil(N/agents)41-100 documents
10 (max)ceil(N/10)100+ documents
Agent数量每个Agent处理的文档数使用场景
1全部1-5份文档
2-3ceil(N/agents)6-15份文档
4-6ceil(N/agents)16-40份文档
7-10ceil(N/agents)41-100份文档
10(最大值)ceil(N/10)100+份文档

Agent Prompt Template

Agent提示词模板

Each agent receives a prompt structured as:
You are reviewing {count} legal documents. For each document, extract the
following columns:

{column_definitions}

For each value extracted:
1. Provide the exact value found
2. Include the page number (PDF) or section/paragraph (DOCX/MD)
3. Rate your confidence: HIGH (exact match), MEDIUM (inferred), LOW (uncertain)
4. If not found, record "NOT FOUND" with confidence LOW

Output as JSON per the extraction schema.
每个Agent会收到如下结构的提示词:
You are reviewing {count} legal documents. For each document, extract the
following columns:

{column_definitions}

For each value extracted:
1. Provide the exact value found
2. Include the page number (PDF) or section/paragraph (DOCX/MD)
3. Rate your confidence: HIGH (exact match), MEDIUM (inferred), LOW (uncertain)
4. If not found, record "NOT FOUND" with confidence LOW

Output as JSON per the extraction schema.

Confidence Scoring

置信度评分

LevelColor CodeDefinition
HIGHGreenExact value found with clear citation
MEDIUMYellowValue inferred from context; multiple possible interpretations
LOWRed / Not FoundValue uncertain or not found in document
等级颜色代码定义
HIGH绿色找到精确值且引用清晰
MEDIUM黄色从上下文推断的值;存在多种可能的解释
LOW红色/未找到值不确定或未在文档中找到

Output Format

输出格式

Sheet 1: Document Review
DocumentPartiesEffective DateTermGoverning Law...
contract_a.pdfAcme / Beta [p.1]2026-01-15 [p.2]3 years [p.3]Delaware [p.12]...
contract_b.pdfGamma / Delta [p.1]NOT FOUND2 years [p.4]New York [p.10]...
Sheet 2: Summary
MetricValue
Documents processed25
Columns extracted8
Average confidence87%
Not found rate12%
表格1:文档审查
文档缔约方生效日期期限管辖法律...
contract_a.pdfAcme / Beta [p.1]2026-01-15 [p.2]3 years [p.3]Delaware [p.12]...
contract_b.pdfGamma / Delta [p.1]NOT FOUND2 years [p.4]New York [p.10]...
表格2:摘要
指标
已处理文档数25
已提取列数8
平均置信度87%
未找到率12%

Extraction Scenarios

提取场景

Contract Review

合同审查

ColumnWhat to Extract
PartiesAll contracting parties with full legal names
Effective DateContract effective or execution date
TermDuration of the agreement
RenewalAuto-renewal terms and notice period
Governing LawJurisdiction governing the agreement
Liability CapMaximum liability amount or formula
IndemnificationIndemnification obligations and scope
IP OwnershipIntellectual property ownership provisions
Termination RightsTermination triggers and notice requirements
Data ProtectionData protection or privacy obligations
提取内容
Parties所有缔约方的完整法律名称
Effective Date合同生效或签署日期
Term协议期限
Renewal自动续期条款和通知期限
Governing Law管辖协议的司法管辖区
Liability Cap最大责任金额或计算公式
Indemnification赔偿义务和范围
IP Ownership知识产权所有权条款
Termination Rights终止触发条件和通知要求
Data Protection数据保护或隐私义务

NDA Review

NDA审查

ColumnWhat to Extract
PartiesDisclosing and receiving parties
TypeMutual or one-way
Definition ScopeHow "confidential information" is defined
ExceptionsStandard exceptions to confidentiality
TermDuration of confidentiality obligations
SurvivalSurvival period after termination
Return/DestructionObligations on termination
RemediesAvailable remedies for breach
提取内容
Parties披露方和接收方
Type双向或单向
Definition Scope“保密信息”的定义范围
Exceptions保密义务的标准例外情况
Term保密义务的期限
Survival终止后的存续期限
Return/Destruction终止后的义务
Remedies违约可用的救济措施

Troubleshooting

故障排除

ProblemCauseSolution
Discovery finds 0 documentsWrong path or file typesVerify path exists; check
--types
matches actual file extensions
Extraction JSONs have wrong schemaAgent prompt incompleteUse the extraction schema from
extraction_methodology.md
Aggregator shows conflictsMultiple values for same cellReview source documents; aggregator marks conflicts for manual review
High "NOT FOUND" rateColumns too specific for document typeUse column definitions from
common_extraction_columns.md
; broaden definitions
Confidence all LOWAgent unable to locate valuesCheck column definitions are specific enough; verify document is readable
Aggregator crashes on large setToo many result files loaded at onceProcess in batches of 50 results; use
--columns
to limit output width
Markdown table misalignedLong values or special charactersUse
--format json
for machine processing; truncate long values
Missing citationsAgent did not include page/section referencesReinforce citation requirement in agent prompt; check extraction schema
问题原因解决方案
文档发现工具找到0份文档路径错误或文件类型不匹配验证路径是否存在;检查
--types
参数是否与实际文件扩展名匹配
提取结果JSON的schema错误Agent提示词不完整使用
extraction_methodology.md
中的提取schema
聚合工具显示冲突同一单元格存在多个值审查源文档;聚合工具会标记冲突以便人工审查
“NOT FOUND”率过高列定义对于文档类型过于具体使用
common_extraction_columns.md
中的列定义;放宽定义范围
所有置信度均为LOWAgent无法定位值检查列定义是否足够具体;验证文档是否可读
聚合工具在处理大型数据集时崩溃一次性加载的结果文件过多分批处理,每批50个结果;使用
--columns
参数限制输出宽度
Markdown表格对齐错误值过长或包含特殊字符使用
--format json
用于机器处理;截断过长的值
缺少引用Agent未包含页码/章节引用在Agent提示词中强调引用要求;检查提取schema

Success Criteria

成功标准

  • Extraction Coverage: 90%+ of defined columns populated across all documents
  • Confidence Distribution: 70%+ of extractions rated HIGH confidence
  • Citation Accuracy: Every extracted value includes verifiable page/section citation
  • Processing Speed: 50+ documents processed within 30 minutes using parallel agents
  • Matrix Completeness: Final matrix includes all documents and all columns with no orphan rows
  • 提取覆盖率:所有文档中90%以上的定义列已填充
  • 置信度分布:70%以上的提取结果被评为HIGH置信度
  • 引用准确性:每个提取的值都包含可验证的页码/章节引用
  • 处理速度:使用并行Agent在30分钟内处理50份以上文档
  • 矩阵完整性:最终矩阵包含所有文档和所有列,无孤立行

Scope & Limitations

范围与限制

This skill covers:
  • Document inventory and discovery across PDF, DOCX, TXT, and MD formats
  • Aggregation of extraction results from parallel agent processing into unified matrix
  • Pre-defined column sets for contracts, NDAs, employment agreements, and leases
  • Confidence scoring and conflict detection for extracted values
  • Markdown and JSON output formats
This skill does NOT cover:
  • Actual document parsing or text extraction (requires external libraries or AI agents)
  • OCR processing for scanned documents
  • Excel/XLSX output generation (use JSON output and convert externally)
  • Automated legal analysis or risk assessment of extracted values
  • Document comparison or redlining between versions
本技能涵盖:
  • PDF、DOCX、TXT和MD格式的文档清单与发现
  • 将并行Agent处理的提取结果聚合为统一矩阵
  • 针对合同、NDA、雇佣协议和租赁协议的预定义列集合
  • 提取值的置信度评分和冲突检测
  • Markdown和JSON输出格式
本技能不涵盖:
  • 实际的文档解析或文本提取(需要外部库或AI Agent)
  • 扫描文档的OCR处理
  • Excel/XLSX输出生成(使用JSON输出并通过外部工具转换)
  • 对提取值的自动法律分析或风险评估
  • 不同版本文档之间的对比或红线标注

Anti-Patterns

反模式

Anti-PatternWhy It FailsBetter Approach
Vague column definitions"Date" could match dozens of dates in a contractUse specific definitions: "Effective Date" with guidance on where to look
Skipping document discoveryUnknown document count leads to wrong agent allocationAlways run discovery first; use manifest for pipeline planning
Ignoring LOW confidence resultsMissing or uncertain data treated as factReview all LOW confidence cells manually; flag in final report
Processing 100+ docs with 1 agentSlow, context window overflow, quality degradationUse parallel processing: ceil(N/10) documents per agent, max 10 agents
No citation requirementCannot verify extracted values against sourceRequire page/section citation for every extraction; reject uncited values
反模式失败原因更好的方法
模糊的列定义“日期”可能匹配合同中的数十个日期使用具体定义:如“生效日期”并指明查找位置
跳过文档发现步骤未知文档数量会导致Agent分配错误始终先运行文档发现工具;使用清单进行流程规划
忽略LOW置信度结果缺失或不确定的数据被视为事实人工审查所有LOW置信度的单元格;在最终报告中标记
使用1个Agent处理100+份文档速度慢、上下文窗口溢出、质量下降使用并行处理:每个Agent处理ceil(N/10)份文档,最多10个Agent
不要求引用无法对照源文档验证提取的值要求每个提取结果都包含页码/章节引用;拒绝无引用的值

Tool Reference

工具参考

scripts/document_discovery.py

scripts/document_discovery.py

Scan directory for legal documents and generate inventory manifest.
usage: document_discovery.py [-h] [--json]
                              [--types TYPES]
                              [--min-size MIN_SIZE]
                              [--max-size MAX_SIZE]
                              directory

positional arguments:
  directory             Path to directory containing documents

options:
  -h, --help            Show help message and exit
  --json                Output in JSON format
  --types TYPES         Comma-separated file extensions to include
                        (default: pdf,docx,doc,txt,md,rtf)
  --min-size MIN_SIZE   Minimum file size in bytes (default: 0)
  --max-size MAX_SIZE   Maximum file size in bytes (default: no limit)
扫描目录中的法律文档并生成清单。
usage: document_discovery.py [-h] [--json]
                              [--types TYPES]
                              [--min-size MIN_SIZE]
                              [--max-size MAX_SIZE]
                              directory

positional arguments:
  directory             包含文档的目录路径

options:
  -h, --help            显示帮助信息并退出
  --json                以JSON格式输出
  --types TYPES         要包含的文件扩展名(逗号分隔)
                        默认值:pdf,docx,doc,txt,md,rtf
  --min-size MIN_SIZE   最小文件大小(字节,默认值:0)
  --max-size MAX_SIZE   最大文件大小(字节,默认值:无限制)

scripts/extraction_aggregator.py

scripts/extraction_aggregator.py

Aggregate extraction results into unified comparison matrix.
usage: extraction_aggregator.py [-h] [--json]
                                 [--results RESULTS [RESULTS ...]]
                                 [--results-dir RESULTS_DIR]
                                 [--format {markdown,json}]
                                 [--columns COLUMNS]
                                 [--output OUTPUT]

options:
  -h, --help            Show help message and exit
  --json                Output in JSON format (alias for --format json)
  --results             One or more extraction result JSON files
  --results-dir         Directory containing extraction result JSON files
  --format              Output format: markdown table or JSON (default: markdown)
  --columns             Comma-separated column names to include (default: all)
  --output              Write output to file instead of stdout
将提取结果聚合为统一的对比矩阵。
usage: extraction_aggregator.py [-h] [--json]
                                 [--results RESULTS [RESULTS ...]]
                                 [--results-dir RESULTS_DIR]
                                 [--format {markdown,json}]
                                 [--columns COLUMNS]
                                 [--output OUTPUT]

options:
  -h, --help            显示帮助信息并退出
  --json                以JSON格式输出(`--format json`的别名)
  --results             一个或多个提取结果JSON文件
  --results-dir         包含提取结果JSON文件的目录
  --format              输出格式:markdown表格或JSON(默认值:markdown)
  --columns             要包含的列名(逗号分隔,默认值:所有列)
  --output              将输出写入文件而非标准输出