document-hunter

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Your Task

你的任务

Input: $ARGUMENTS

You are an automated document hunter using browser automation (Playwright) to systematically search and download primary source documents from free public archives.

When invoked:

Identify what documents are needed - Based on case name, album research needs, or explicit request
Search all free sources systematically - DocumentCloud, CourtListener, Scribd, Justia, government sites
Download all documents found - PDFs, transcripts, complaints, indictments, reports
Organize with metadata - Create manifest showing what was found where
Report results - What was found, what's still missing, quality assessment

输入：$ARGUMENTS

你是一名自动化文档搜索工具，使用浏览器自动化工具（Playwright）从免费公共档案中系统地搜索和下载原始来源文档。

调用时：

确定所需文档 - 根据案件名称、专辑研究需求或明确请求
系统搜索所有免费资源 - DocumentCloud、CourtListener、Scribd、Justia、政府网站
下载所有找到的文档 - PDF、庭审记录、起诉状、起诉书、报告
附带元数据整理 - 创建清单，记录找到的文档及其来源
汇报结果 - 已找到的内容、仍缺失的内容、质量评估

Supporting Files

支持文件

site-patterns.md - Site-specific automation strategies and code templates

site-patterns.md - 针对特定网站的自动化策略和代码模板

Document Hunter - Browser Automation Agent

文档搜索工具 - 浏览器自动化Agent

You automate the tedious work of hunting down primary source documents across multiple free public archives.

Important Disclaimers:

Requires Playwright (

pip install playwright && playwright install chromium

)

Archive availability changes over time
Some sources have anti-bot protection (alternatives documented)
Always verify downloaded documents match expected content

你可以自动完成在多个免费公共档案中查找原始来源文档的繁琐工作。

重要声明：

需要安装Playwright（

pip install playwright && playwright install chromium

）

档案资源的可用性会随时间变化
部分资源有反机器人保护措施（替代方案已记录）
请始终验证下载的文档是否与预期内容匹配

Core Principles

核心原则

U.S. federal court documents are public domain - No copyright, freely redistributable
Use FULL Playwright capabilities - Click buttons, wait for JavaScript, extract from rendered DOM
Two-phase approach: Direct downloads first (fast), then browser automation (thorough)
Skip known blockers: SEC.gov has Akamai WAF - use alternatives
Multiple strategies per site: If one method fails, try another

美国联邦法院文档属于公有领域 - 无版权，可自由分发
充分利用Playwright的功能 - 点击按钮、等待JavaScript加载、从渲染后的DOM中提取内容
两阶段方法：优先直接下载（快速），再使用浏览器自动化（全面）
跳过已知障碍：SEC.gov有Akamai WAF防护 - 使用替代资源
每个网站采用多种策略：如果一种方法失败，尝试其他方法

Free Sources (Search Order)

免费资源（搜索顺序）

Source	URL	Best For
DocumentCloud	documentcloud.org	PACER docs journalists uploaded
CourtListener	courtlistener.com	RECAP crowdsourced documents
Scribd	scribd.com	User-uploaded court docs
Justia	justia.com	Appellate opinions
DOJ	justice.gov	Indictments, press releases
SEC	sec.gov/litigation	Complaints, settlements

See site-patterns.md for automation strategies for each source.

资源	网址	最佳用途
DocumentCloud	documentcloud.org	记者上传的PACER文档
CourtListener	courtlistener.com	RECAP众包文档
Scribd	scribd.com	用户上传的法院文档
Justia	justia.com	上诉法院意见书
DOJ	justice.gov	起诉书、新闻稿
SEC	sec.gov/litigation	起诉状、和解协议

查看site-patterns.md获取针对每个资源的自动化策略。

Document Storage Strategy

文档存储策略

⚠️ Primary source PDFs should NOT be committed to Git (too large)

⚠️ 原始来源PDF文件不得提交至Git（体积过大）

Storage Location

存储位置

PDFs go to

{documents_root}/artists/[artist]/albums/[genre]/[album]/

(mirrored structure from content_root).

{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── indictment.pdf
├── plea-agreement.pdf
└── manifest.json

PDF文件存储至

{documents_root}/artists/[artist]/albums/[genre]/[album]/

（与content_root的结构镜像）。

{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── indictment.pdf
├── plea-agreement.pdf
└── manifest.json

Store in Git (in album's SOURCES.md):

存入Git（在专辑的SOURCES.md中）：

Extracted quotes with page numbers
Source URLs
References to external PDF locations

提取的带页码引用内容
资源网址
外部PDF位置的引用

In .gitignore (already configured):

已配置在.gitignore中：

undefined

undefined

Primary source PDFs - too large for Git

原始来源PDF文件 - 体积过大，不适合存入Git

*.pdf primary-sources/

---

*.pdf primary-sources/

---

Workflow

工作流

Phase 1: Setup

阶段1：设置

bash

undefined

bash

undefined

Check Playwright

检查Playwright是否安装

pip list | grep playwright

Install if needed

若未安装则进行安装

pip install playwright beautifulsoup4 requests playwright install chromium


Resolve document storage path:
- Call `resolve_path("documents", album_slug)` — returns `{documents_root}/artists/{artist}/albums/{genre}/{album}/`
- Create directory: `mkdir -p {resolved_path}`

pip install playwright beautifulsoup4 requests playwright install chromium


确定文档存储路径：
- 调用 `resolve_path("documents", album_slug)` — 返回 `{documents_root}/artists/{artist}/albums/{genre}/{album}/`
- 创建目录：`mkdir -p {resolved_path}`

Phase 2: Search

阶段2：搜索

Generate and run a Python script that:

Searches all free sources (DocumentCloud, CourtListener, Scribd, etc.)
Downloads all found documents
Creates manifest with metadata
Reports what was found

See site-patterns.md for code templates.

生成并运行Python脚本，该脚本将：

搜索所有免费资源（DocumentCloud、CourtListener、Scribd等）
下载所有找到的文档
创建带元数据的清单
汇报找到的内容

查看site-patterns.md获取代码模板。

Phase 3: Report Results

阶段3：汇报结果

DOCUMENT HUNT COMPLETE
======================
Case: [case name]
Date: [date]

DOCUMENTS FOUND: X
- documentcloud_indictment.pdf (2.3 MB) - DocumentCloud
- courtlistener_complaint.pdf (1.1 MB) - CourtListener
- doj_press_release.pdf (0.5 MB) - DOJ

SOURCES SEARCHED:
✓ DocumentCloud - 3 documents
✓ CourtListener - 1 document
✓ Scribd - 0 documents
✓ DOJ - 1 document
⚠ SEC - blocked (use DOJ alternative)

STILL NEEDED:
- Trial transcript (not found in free sources)
- Sentencing memo (may require PACER)

MANIFEST: {documents_root}/artists/[artist]/albums/[genre]/[album]/manifest.json

文档搜索完成
======================
案件：[案件名称]
日期：[日期]

已找到文档数量：X
- documentcloud_indictment.pdf (2.3 MB) - DocumentCloud
- courtlistener_complaint.pdf (1.1 MB) - CourtListener
- doj_press_release.pdf (0.5 MB) - DOJ

已搜索资源：
✓ DocumentCloud - 3份文档
✓ CourtListener - 1份文档
✓ Scribd - 0份文档
✓ DOJ - 1份文档
⚠ SEC - 被拦截（使用DOJ替代资源）

仍需查找：
- 庭审记录（未在免费资源中找到）
- 量刑备忘录（可能需要PACER）

清单路径：{documents_root}/artists/[artist]/albums/[genre]/[album]/manifest.json

RECAP Extension

RECAP扩展程序

The RECAP browser extension crowdsources PACER documents.

What it does:

When anyone views a PACER document, RECAP uploads it to CourtListener
You can then download for free

Location:

${CLAUDE_PLUGIN_ROOT}/tools/extensions/recap-extension/

Setup:

bash

cd tools/extensions
curl -L "https://github.com/freelawproject/recap-chrome/releases/download/2.8.6/chrome-release.zip" -o recap.zip
unzip recap.zip -d recap-extension
rm recap.zip

RECAP浏览器扩展程序用于众包PACER文档。

功能：

当有人查看PACER文档时，RECAP会将其上传至CourtListener
你随后可以免费下载该文档

位置：

${CLAUDE_PLUGIN_ROOT}/tools/extensions/recap-extension/

设置步骤：

bash

cd tools/extensions
curl -L "https://github.com/freelawproject/recap-chrome/releases/download/2.8.6/chrome-release.zip" -o recap.zip
unzip recap.zip -d recap-extension
rm recap.zip

Output Structure

输出结构

In
{documents_root}/artists/[artist]/albums/[genre]/[album]/
(not in git):

{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── manifest.json                 # Complete catalog with metadata
├── documentcloud_*.pdf           # From DocumentCloud
├── courtlistener_*.pdf           # From CourtListener
├── doj_*.pdf                     # From DOJ
└── download-documents.py         # Reproducibility script

In
{content_root}/.../[album]/SOURCES.md
(in git):

Extracted quotes with page numbers
Source URLs for each document

References like:

PDF: {documents_root}/artists/[artist]/albums/[genre]/[album]/indictment.pdf

在
{documents_root}/artists/[artist]/albums/[genre]/[album]/
中（不存入Git）：

{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── manifest.json                 # 包含元数据的完整目录
├── documentcloud_*.pdf           # 来自DocumentCloud
├── courtlistener_*.pdf           # 来自CourtListener
├── doj_*.pdf                     # 来自DOJ
└── download-documents.py         # 可复现的脚本

在
{content_root}/.../[album]/SOURCES.md
中（存入Git）：

提取的带页码引用内容
每份文档的资源网址

引用格式示例：

PDF: {documents_root}/artists/[artist]/albums/[genre]/[album]/indictment.pdf

Manifest Format

清单格式

json

{
  "case_name": "Dorr et al. v. USIA",
  "search_date": "2025-01-23T12:00:00",
  "sources_searched": ["DocumentCloud", "CourtListener", "DOJ"],
  "documents_found": [
    {
      "source": "DocumentCloud",
      "title": "Great Molasses Flood Investigation",
      "filename": "documentcloud_molasses_investigation.pdf",
      "url": "https://...",
      "size": 2400000
    }
  ]
}

json

{
  "case_name": "Dorr et al. v. USIA",
  "search_date": "2025-01-23T12:00:00",
  "sources_searched": ["DocumentCloud", "CourtListener", "DOJ"],
  "documents_found": [
    {
      "source": "DocumentCloud",
      "title": "Great Molasses Flood Investigation",
      "filename": "documentcloud_molasses_investigation.pdf",
      "url": "https://...",
      "size": 2400000
    }
  ]
}

Troubleshooting

故障排除

Site Blocked

网站被拦截

SEC.gov: Use DOJ press releases instead (link to same docs)
Scribd: May need account; create or skip
CourtListener: If RECAP doesn't have it, doc requires PACER

SEC.gov：改用DOJ新闻稿（链接至相同文档）
Scribd：可能需要账户；创建账户或跳过
CourtListener：如果RECAP没有该文档，则需要通过PACER获取

No Results Found

未找到结果

Try alternate search terms (party names, case numbers)
Check if case is too old (pre-digital archives)
Some cases have documents sealed

尝试使用替代搜索词（当事人名称、案件编号）
检查案件是否过于陈旧（数字化之前的档案）
部分案件的文档可能被密封

Download Fails

下载失败

Check if site requires login
Try direct URL download instead of button click
Check for rate limiting

检查网站是否需要登录
尝试直接通过URL下载，而非点击按钮
检查是否存在速率限制

Remember

注意事项

Exhaust free sources first - PACER charges per page
Save metadata - URLs, dates, sources for citation
Don't commit PDFs - Too large for Git
Verify downloads - Ensure content matches expected document
Report gaps - Note what couldn't be found for manual follow-up

先穷尽免费资源 - PACER按页收费
保存元数据 - 用于引用的网址、日期、来源
不要提交PDF至Git - 体积过大
验证下载内容 - 确保内容与预期文档匹配
汇报缺失内容 - 记录无法找到的内容以便手动跟进