tabular-document-review

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

⚠️ EXPERIMENTAL — This skill is provided for educational and informational purposes only. It does NOT constitute legal advice. All responsibility for usage rests with the user. Consult qualified legal professionals before acting on any output.

⚠️ 实验性功能 — 本技能仅用于教育和信息目的，不构成法律建议。使用责任完全由用户承担。根据输出内容采取行动前，请咨询合格的法律专业人士。

Tabular Document Review Skill

表格化文档审查技能

Overview

概述

Production-ready toolkit for extracting structured data from multiple legal documents into a comparison matrix with citations. Supports user-defined extraction columns, parallel processing with up to 10 agents, confidence scoring, and output in markdown table or structured JSON. Designed for legal teams performing bulk contract review, NDA comparison, employment agreement analysis, and lease review.

这是一套可用于生产环境的工具包，能够从多份法律文档中提取结构化数据并生成带引用的对比矩阵。它支持用户自定义提取列、最多10个Agent的并行处理、置信度评分，以及输出为Markdown表格或结构化JSON。专为法律团队设计，适用于批量合同审查、NDA对比、雇佣协议分析和租赁协议审查场景。

Tools

工具

1. Document Discovery (

scripts/document_discovery.py

)

1. 文档发现工具（

scripts/document_discovery.py

）

Scan a directory for legal documents and generate an inventory manifest.

bash

python scripts/document_discovery.py /path/to/contracts

python scripts/document_discovery.py /path/to/ndas --types pdf,docx --json

python scripts/document_discovery.py /path/to/leases --types pdf,docx,txt,md --min-size 1024

扫描目录中的法律文档并生成清单。

bash

python scripts/document_discovery.py /path/to/contracts

python scripts/document_discovery.py /path/to/ndas --types pdf,docx --json

python scripts/document_discovery.py /path/to/leases --types pdf,docx,txt,md --min-size 1024

2. Extraction Aggregator (

scripts/extraction_aggregator.py

)

2. 提取聚合工具（

scripts/extraction_aggregator.py

）

Aggregate multiple extraction result JSONs into a unified comparison matrix.

bash

python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json extraction_3.json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ --json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ \
  --format markdown \
  --output review_matrix.md

python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json \
  --columns "Parties,Effective Date,Term,Governing Law"

将多个提取结果JSON聚合为统一的对比矩阵。

bash

python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json extraction_3.json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ --json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ \
  --format markdown \
  --output review_matrix.md

python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json \
  --columns "Parties,Effective Date,Term,Governing Law"

Reference Guides

参考指南

Reference	Purpose
`references/extraction_methodology.md`	Document extraction best practices, JSON schema, agent prompts
`references/common_extraction_columns.md`	Pre-defined column sets for contracts, NDAs, employment, leases

参考文档	用途
`references/extraction_methodology.md`	文档提取最佳实践、JSON schema、Agent提示词
`references/common_extraction_columns.md`	针对合同、NDA、雇佣协议、租赁协议的预定义列集合

Workflows

工作流

5-Step Document Review Pipeline

五步文档审查流程

Step	Action	Tool	Output
1. Gather Requirements	Define document folder, output filename, columns to extract	Manual	Column list, file path
2. Discover Documents	Scan directory for target documents	`document_discovery.py`	Document manifest JSON
3. Process Documents	Extract values per column with citations (parallel agents)	AI agents (external)	Per-document extraction JSONs
4. Collect Results	Aggregate extraction JSONs into unified matrix	`extraction_aggregator.py`	Consolidated matrix
5. Generate Output	Export as markdown table or structured JSON	`extraction_aggregator.py`	Final deliverable

步骤	操作	工具	输出
1. 收集需求	定义文档文件夹、输出文件名、要提取的列	手动操作	列列表、文件路径
2. 发现文档	扫描目录查找目标文档	`document_discovery.py`	文档清单JSON
3. 处理文档	通过带引用的方式提取每列的值（并行Agent）	AI Agent（外部）	每份文档的提取结果JSON
4. 收集结果	将提取结果JSON聚合为统一矩阵	`extraction_aggregator.py`	整合后的矩阵
5. 生成输出	导出为Markdown表格或结构化JSON	`extraction_aggregator.py`	最终交付物

Parallel Processing Strategy

并行处理策略

Agents	Documents per Agent	Use When
1	All	1-5 documents
2-3	ceil(N/agents)	6-15 documents
4-6	ceil(N/agents)	16-40 documents
7-10	ceil(N/agents)	41-100 documents
10 (max)	ceil(N/10)	100+ documents

Agent数量	每个Agent处理的文档数	使用场景
1	全部	1-5份文档
2-3	ceil(N/agents)	6-15份文档
4-6	ceil(N/agents)	16-40份文档
7-10	ceil(N/agents)	41-100份文档
10（最大值）	ceil(N/10)	100+份文档

Agent Prompt Template

Agent提示词模板

Each agent receives a prompt structured as:

You are reviewing {count} legal documents. For each document, extract the
following columns:

{column_definitions}

For each value extracted:
1. Provide the exact value found
2. Include the page number (PDF) or section/paragraph (DOCX/MD)
3. Rate your confidence: HIGH (exact match), MEDIUM (inferred), LOW (uncertain)
4. If not found, record "NOT FOUND" with confidence LOW

Output as JSON per the extraction schema.

每个Agent会收到如下结构的提示词：

You are reviewing {count} legal documents. For each document, extract the
following columns:

{column_definitions}

For each value extracted:
1. Provide the exact value found
2. Include the page number (PDF) or section/paragraph (DOCX/MD)
3. Rate your confidence: HIGH (exact match), MEDIUM (inferred), LOW (uncertain)
4. If not found, record "NOT FOUND" with confidence LOW

Output as JSON per the extraction schema.

Confidence Scoring

置信度评分

Level	Color Code	Definition
HIGH	Green	Exact value found with clear citation
MEDIUM	Yellow	Value inferred from context; multiple possible interpretations
LOW	Red / Not Found	Value uncertain or not found in document

等级	颜色代码	定义
HIGH	绿色	找到精确值且引用清晰
MEDIUM	黄色	从上下文推断的值；存在多种可能的解释
LOW	红色/未找到	值不确定或未在文档中找到

Output Format

输出格式

Sheet 1: Document Review

Document	Parties	Effective Date	Term	Governing Law	...
contract_a.pdf	Acme / Beta [p.1]	2026-01-15 [p.2]	3 years [p.3]	Delaware [p.12]	...
contract_b.pdf	Gamma / Delta [p.1]	NOT FOUND	2 years [p.4]	New York [p.10]	...

Sheet 2: Summary

Metric	Value
Documents processed	25
Columns extracted	8
Average confidence	87%
Not found rate	12%

表格1：文档审查

文档	缔约方	生效日期	期限	管辖法律	...
contract_a.pdf	Acme / Beta [p.1]	2026-01-15 [p.2]	3 years [p.3]	Delaware [p.12]	...
contract_b.pdf	Gamma / Delta [p.1]	NOT FOUND	2 years [p.4]	New York [p.10]	...

表格2：摘要

指标	值
已处理文档数	25
已提取列数	8
平均置信度	87%
未找到率	12%

Extraction Scenarios

提取场景

Contract Review

合同审查

Column	What to Extract
Parties	All contracting parties with full legal names
Effective Date	Contract effective or execution date
Term	Duration of the agreement
Renewal	Auto-renewal terms and notice period
Governing Law	Jurisdiction governing the agreement
Liability Cap	Maximum liability amount or formula
Indemnification	Indemnification obligations and scope
IP Ownership	Intellectual property ownership provisions
Termination Rights	Termination triggers and notice requirements
Data Protection	Data protection or privacy obligations

列	提取内容
Parties	所有缔约方的完整法律名称
Effective Date	合同生效或签署日期
Term	协议期限
Renewal	自动续期条款和通知期限
Governing Law	管辖协议的司法管辖区
Liability Cap	最大责任金额或计算公式
Indemnification	赔偿义务和范围
IP Ownership	知识产权所有权条款
Termination Rights	终止触发条件和通知要求
Data Protection	数据保护或隐私义务

NDA Review

NDA审查

Column	What to Extract
Parties	Disclosing and receiving parties
Type	Mutual or one-way
Definition Scope	How "confidential information" is defined
Exceptions	Standard exceptions to confidentiality
Term	Duration of confidentiality obligations
Survival	Survival period after termination
Return/Destruction	Obligations on termination
Remedies	Available remedies for breach

列	提取内容
Parties	披露方和接收方
Type	双向或单向
Definition Scope	“保密信息”的定义范围
Exceptions	保密义务的标准例外情况
Term	保密义务的期限
Survival	终止后的存续期限
Return/Destruction	终止后的义务
Remedies	违约可用的救济措施

Troubleshooting

故障排除

Problem	Cause	Solution
Discovery finds 0 documents	Wrong path or file types	Verify path exists; check `--types` matches actual file extensions
Extraction JSONs have wrong schema	Agent prompt incomplete	Use the extraction schema from `extraction_methodology.md`
Aggregator shows conflicts	Multiple values for same cell	Review source documents; aggregator marks conflicts for manual review
High "NOT FOUND" rate	Columns too specific for document type	Use column definitions from `common_extraction_columns.md` ; broaden definitions
Confidence all LOW	Agent unable to locate values	Check column definitions are specific enough; verify document is readable
Aggregator crashes on large set	Too many result files loaded at once	Process in batches of 50 results; use `--columns` to limit output width
Markdown table misaligned	Long values or special characters	Use `--format json` for machine processing; truncate long values
Missing citations	Agent did not include page/section references	Reinforce citation requirement in agent prompt; check extraction schema

问题	原因	解决方案
文档发现工具找到0份文档	路径错误或文件类型不匹配	验证路径是否存在；检查 `--types` 参数是否与实际文件扩展名匹配
提取结果JSON的schema错误	Agent提示词不完整	使用 `extraction_methodology.md` 中的提取schema
聚合工具显示冲突	同一单元格存在多个值	审查源文档；聚合工具会标记冲突以便人工审查
“NOT FOUND”率过高	列定义对于文档类型过于具体	使用 `common_extraction_columns.md` 中的列定义；放宽定义范围
所有置信度均为LOW	Agent无法定位值	检查列定义是否足够具体；验证文档是否可读
聚合工具在处理大型数据集时崩溃	一次性加载的结果文件过多	分批处理，每批50个结果；使用 `--columns` 参数限制输出宽度
Markdown表格对齐错误	值过长或包含特殊字符	使用 `--format json` 用于机器处理；截断过长的值
缺少引用	Agent未包含页码/章节引用	在Agent提示词中强调引用要求；检查提取schema

Success Criteria

成功标准

Extraction Coverage: 90%+ of defined columns populated across all documents
Confidence Distribution: 70%+ of extractions rated HIGH confidence
Citation Accuracy: Every extracted value includes verifiable page/section citation
Processing Speed: 50+ documents processed within 30 minutes using parallel agents
Matrix Completeness: Final matrix includes all documents and all columns with no orphan rows

提取覆盖率：所有文档中90%以上的定义列已填充
置信度分布：70%以上的提取结果被评为HIGH置信度
引用准确性：每个提取的值都包含可验证的页码/章节引用
处理速度：使用并行Agent在30分钟内处理50份以上文档
矩阵完整性：最终矩阵包含所有文档和所有列，无孤立行

Scope & Limitations

范围与限制

This skill covers:

Document inventory and discovery across PDF, DOCX, TXT, and MD formats
Aggregation of extraction results from parallel agent processing into unified matrix
Pre-defined column sets for contracts, NDAs, employment agreements, and leases
Confidence scoring and conflict detection for extracted values
Markdown and JSON output formats

This skill does NOT cover:

Actual document parsing or text extraction (requires external libraries or AI agents)
OCR processing for scanned documents
Excel/XLSX output generation (use JSON output and convert externally)
Automated legal analysis or risk assessment of extracted values
Document comparison or redlining between versions

本技能涵盖：

PDF、DOCX、TXT和MD格式的文档清单与发现
将并行Agent处理的提取结果聚合为统一矩阵
针对合同、NDA、雇佣协议和租赁协议的预定义列集合
提取值的置信度评分和冲突检测
Markdown和JSON输出格式

本技能不涵盖：

实际的文档解析或文本提取（需要外部库或AI Agent）
扫描文档的OCR处理
Excel/XLSX输出生成（使用JSON输出并通过外部工具转换）
对提取值的自动法律分析或风险评估
不同版本文档之间的对比或红线标注

Anti-Patterns

反模式

Anti-Pattern	Why It Fails	Better Approach
Vague column definitions	"Date" could match dozens of dates in a contract	Use specific definitions: "Effective Date" with guidance on where to look
Skipping document discovery	Unknown document count leads to wrong agent allocation	Always run discovery first; use manifest for pipeline planning
Ignoring LOW confidence results	Missing or uncertain data treated as fact	Review all LOW confidence cells manually; flag in final report
Processing 100+ docs with 1 agent	Slow, context window overflow, quality degradation	Use parallel processing: ceil(N/10) documents per agent, max 10 agents
No citation requirement	Cannot verify extracted values against source	Require page/section citation for every extraction; reject uncited values

反模式	失败原因	更好的方法
模糊的列定义	“日期”可能匹配合同中的数十个日期	使用具体定义：如“生效日期”并指明查找位置
跳过文档发现步骤	未知文档数量会导致Agent分配错误	始终先运行文档发现工具；使用清单进行流程规划
忽略LOW置信度结果	缺失或不确定的数据被视为事实	人工审查所有LOW置信度的单元格；在最终报告中标记
使用1个Agent处理100+份文档	速度慢、上下文窗口溢出、质量下降	使用并行处理：每个Agent处理ceil(N/10)份文档，最多10个Agent
不要求引用	无法对照源文档验证提取的值	要求每个提取结果都包含页码/章节引用；拒绝无引用的值

Tool Reference

工具参考

scripts/document_discovery.py

scripts/document_discovery.py

Scan directory for legal documents and generate inventory manifest.

usage: document_discovery.py [-h] [--json]
                              [--types TYPES]
                              [--min-size MIN_SIZE]
                              [--max-size MAX_SIZE]
                              directory

positional arguments:
  directory             Path to directory containing documents

options:
  -h, --help            Show help message and exit
  --json                Output in JSON format
  --types TYPES         Comma-separated file extensions to include
                        (default: pdf,docx,doc,txt,md,rtf)
  --min-size MIN_SIZE   Minimum file size in bytes (default: 0)
  --max-size MAX_SIZE   Maximum file size in bytes (default: no limit)

扫描目录中的法律文档并生成清单。

usage: document_discovery.py [-h] [--json]
                              [--types TYPES]
                              [--min-size MIN_SIZE]
                              [--max-size MAX_SIZE]
                              directory

positional arguments:
  directory             包含文档的目录路径

options:
  -h, --help            显示帮助信息并退出
  --json                以JSON格式输出
  --types TYPES         要包含的文件扩展名（逗号分隔）
                        默认值：pdf,docx,doc,txt,md,rtf
  --min-size MIN_SIZE   最小文件大小（字节，默认值：0）
  --max-size MAX_SIZE   最大文件大小（字节，默认值：无限制）

scripts/extraction_aggregator.py

scripts/extraction_aggregator.py

Aggregate extraction results into unified comparison matrix.

usage: extraction_aggregator.py [-h] [--json]
                                 [--results RESULTS [RESULTS ...]]
                                 [--results-dir RESULTS_DIR]
                                 [--format {markdown,json}]
                                 [--columns COLUMNS]
                                 [--output OUTPUT]

options:
  -h, --help            Show help message and exit
  --json                Output in JSON format (alias for --format json)
  --results             One or more extraction result JSON files
  --results-dir         Directory containing extraction result JSON files
  --format              Output format: markdown table or JSON (default: markdown)
  --columns             Comma-separated column names to include (default: all)
  --output              Write output to file instead of stdout

将提取结果聚合为统一的对比矩阵。

usage: extraction_aggregator.py [-h] [--json]
                                 [--results RESULTS [RESULTS ...]]
                                 [--results-dir RESULTS_DIR]
                                 [--format {markdown,json}]
                                 [--columns COLUMNS]
                                 [--output OUTPUT]

options:
  -h, --help            显示帮助信息并退出
  --json                以JSON格式输出（`--format json`的别名）
  --results             一个或多个提取结果JSON文件
  --results-dir         包含提取结果JSON文件的目录
  --format              输出格式：markdown表格或JSON（默认值：markdown）
  --columns             要包含的列名（逗号分隔，默认值：所有列）
  --output              将输出写入文件而非标准输出

tabular-document-review

Original

Translation

Tabular Document Review Skill

表格化文档审查技能

Overview

概述

Table of Contents

目录

Tools

工具

1. Document Discovery (scripts/document_discovery.py)

1. 文档发现工具（scripts/document_discovery.py）

2. Extraction Aggregator (scripts/extraction_aggregator.py)

2. 提取聚合工具（scripts/extraction_aggregator.py）

Reference Guides

参考指南

Workflows

工作流

5-Step Document Review Pipeline

五步文档审查流程

Parallel Processing Strategy

并行处理策略

Agent Prompt Template

Agent提示词模板

Confidence Scoring

置信度评分

Output Format

输出格式

Extraction Scenarios

提取场景

Contract Review

合同审查

NDA Review

NDA审查

Troubleshooting

故障排除

Success Criteria

成功标准

Scope & Limitations

范围与限制

Anti-Patterns

反模式

Tool Reference

工具参考

scripts/document_discovery.py

scripts/document_discovery.py

scripts/extraction_aggregator.py

scripts/extraction_aggregator.py

1. Document Discovery (
`scripts/document_discovery.py`
)

1. 文档发现工具（
`scripts/document_discovery.py`
）

2. Extraction Aggregator (
`scripts/extraction_aggregator.py`
)

2. 提取聚合工具（
`scripts/extraction_aggregator.py`
）

`scripts/document_discovery.py`

`scripts/document_discovery.py`

`scripts/extraction_aggregator.py`

`scripts/extraction_aggregator.py`