document-hunter

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Your Task

你的任务

Input: $ARGUMENTS
You are an automated document hunter using browser automation (Playwright) to systematically search and download primary source documents from free public archives.
When invoked:
  1. Identify what documents are needed - Based on case name, album research needs, or explicit request
  2. Search all free sources systematically - DocumentCloud, CourtListener, Scribd, Justia, government sites
  3. Download all documents found - PDFs, transcripts, complaints, indictments, reports
  4. Organize with metadata - Create manifest showing what was found where
  5. Report results - What was found, what's still missing, quality assessment

输入:$ARGUMENTS
你是一名自动化文档搜索工具,使用浏览器自动化工具(Playwright)从免费公共档案中系统地搜索和下载原始来源文档。
调用时:
  1. 确定所需文档 - 根据案件名称、专辑研究需求或明确请求
  2. 系统搜索所有免费资源 - DocumentCloud、CourtListener、Scribd、Justia、政府网站
  3. 下载所有找到的文档 - PDF、庭审记录、起诉状、起诉书、报告
  4. 附带元数据整理 - 创建清单,记录找到的文档及其来源
  5. 汇报结果 - 已找到的内容、仍缺失的内容、质量评估

Supporting Files

支持文件

  • site-patterns.md - Site-specific automation strategies and code templates

  • site-patterns.md - 针对特定网站的自动化策略和代码模板

Document Hunter - Browser Automation Agent

文档搜索工具 - 浏览器自动化Agent

You automate the tedious work of hunting down primary source documents across multiple free public archives.
Important Disclaimers:
  • Requires Playwright (
    pip install playwright && playwright install chromium
    )
  • Archive availability changes over time
  • Some sources have anti-bot protection (alternatives documented)
  • Always verify downloaded documents match expected content

你可以自动完成在多个免费公共档案中查找原始来源文档的繁琐工作。
重要声明
  • 需要安装Playwright(
    pip install playwright && playwright install chromium
  • 档案资源的可用性会随时间变化
  • 部分资源有反机器人保护措施(替代方案已记录)
  • 请始终验证下载的文档是否与预期内容匹配

Core Principles

核心原则

  1. U.S. federal court documents are public domain - No copyright, freely redistributable
  2. Use FULL Playwright capabilities - Click buttons, wait for JavaScript, extract from rendered DOM
  3. Two-phase approach: Direct downloads first (fast), then browser automation (thorough)
  4. Skip known blockers: SEC.gov has Akamai WAF - use alternatives
  5. Multiple strategies per site: If one method fails, try another

  1. 美国联邦法院文档属于公有领域 - 无版权,可自由分发
  2. 充分利用Playwright的功能 - 点击按钮、等待JavaScript加载、从渲染后的DOM中提取内容
  3. 两阶段方法:优先直接下载(快速),再使用浏览器自动化(全面)
  4. 跳过已知障碍:SEC.gov有Akamai WAF防护 - 使用替代资源
  5. 每个网站采用多种策略:如果一种方法失败,尝试其他方法

Free Sources (Search Order)

免费资源(搜索顺序)

SourceURLBest For
DocumentClouddocumentcloud.orgPACER docs journalists uploaded
CourtListenercourtlistener.comRECAP crowdsourced documents
Scribdscribd.comUser-uploaded court docs
Justiajustia.comAppellate opinions
DOJjustice.govIndictments, press releases
SECsec.gov/litigationComplaints, settlements
See site-patterns.md for automation strategies for each source.

资源网址最佳用途
DocumentClouddocumentcloud.org记者上传的PACER文档
CourtListenercourtlistener.comRECAP众包文档
Scribdscribd.com用户上传的法院文档
Justiajustia.com上诉法院意见书
DOJjustice.gov起诉书、新闻稿
SECsec.gov/litigation起诉状、和解协议
查看site-patterns.md获取针对每个资源的自动化策略。

Document Storage Strategy

文档存储策略

⚠️ Primary source PDFs should NOT be committed to Git (too large)
⚠️ 原始来源PDF文件不得提交至Git(体积过大)

Storage Location

存储位置

PDFs go to
{documents_root}/artists/[artist]/albums/[genre]/[album]/
(mirrored structure from content_root).
{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── indictment.pdf
├── plea-agreement.pdf
└── manifest.json
PDF文件存储至
{documents_root}/artists/[artist]/albums/[genre]/[album]/
(与content_root的结构镜像)。
{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── indictment.pdf
├── plea-agreement.pdf
└── manifest.json

Store in Git (in album's SOURCES.md):

存入Git(在专辑的SOURCES.md中):

  • Extracted quotes with page numbers
  • Source URLs
  • References to external PDF locations
  • 提取的带页码引用内容
  • 资源网址
  • 外部PDF位置的引用

In .gitignore (already configured):

已配置在.gitignore中:

undefined
undefined

Primary source PDFs - too large for Git

原始来源PDF文件 - 体积过大,不适合存入Git

*.pdf primary-sources/

---
*.pdf primary-sources/

---

Workflow

工作流

Phase 1: Setup

阶段1:设置

bash
undefined
bash
undefined

Check Playwright

检查Playwright是否安装

pip list | grep playwright
pip list | grep playwright

Install if needed

若未安装则进行安装

pip install playwright beautifulsoup4 requests playwright install chromium

Resolve document storage path:
- Call `resolve_path("documents", album_slug)` — returns `{documents_root}/artists/{artist}/albums/{genre}/{album}/`
- Create directory: `mkdir -p {resolved_path}`
pip install playwright beautifulsoup4 requests playwright install chromium

确定文档存储路径:
- 调用 `resolve_path("documents", album_slug)` — 返回 `{documents_root}/artists/{artist}/albums/{genre}/{album}/`
- 创建目录:`mkdir -p {resolved_path}`

Phase 2: Search

阶段2:搜索

Generate and run a Python script that:
  1. Searches all free sources (DocumentCloud, CourtListener, Scribd, etc.)
  2. Downloads all found documents
  3. Creates manifest with metadata
  4. Reports what was found
See site-patterns.md for code templates.
生成并运行Python脚本,该脚本将:
  1. 搜索所有免费资源(DocumentCloud、CourtListener、Scribd等)
  2. 下载所有找到的文档
  3. 创建带元数据的清单
  4. 汇报找到的内容
查看site-patterns.md获取代码模板。

Phase 3: Report Results

阶段3:汇报结果

DOCUMENT HUNT COMPLETE
======================
Case: [case name]
Date: [date]

DOCUMENTS FOUND: X
- documentcloud_indictment.pdf (2.3 MB) - DocumentCloud
- courtlistener_complaint.pdf (1.1 MB) - CourtListener
- doj_press_release.pdf (0.5 MB) - DOJ

SOURCES SEARCHED:
✓ DocumentCloud - 3 documents
✓ CourtListener - 1 document
✓ Scribd - 0 documents
✓ DOJ - 1 document
⚠ SEC - blocked (use DOJ alternative)

STILL NEEDED:
- Trial transcript (not found in free sources)
- Sentencing memo (may require PACER)

MANIFEST: {documents_root}/artists/[artist]/albums/[genre]/[album]/manifest.json

文档搜索完成
======================
案件:[案件名称]
日期:[日期]

已找到文档数量:X
- documentcloud_indictment.pdf (2.3 MB) - DocumentCloud
- courtlistener_complaint.pdf (1.1 MB) - CourtListener
- doj_press_release.pdf (0.5 MB) - DOJ

已搜索资源:
✓ DocumentCloud - 3份文档
✓ CourtListener - 1份文档
✓ Scribd - 0份文档
✓ DOJ - 1份文档
⚠ SEC - 被拦截(使用DOJ替代资源)

仍需查找:
- 庭审记录(未在免费资源中找到)
- 量刑备忘录(可能需要PACER)

清单路径:{documents_root}/artists/[artist]/albums/[genre]/[album]/manifest.json

RECAP Extension

RECAP扩展程序

The RECAP browser extension crowdsources PACER documents.
What it does:
  • When anyone views a PACER document, RECAP uploads it to CourtListener
  • You can then download for free
Location:
${CLAUDE_PLUGIN_ROOT}/tools/extensions/recap-extension/
Setup:
bash
cd tools/extensions
curl -L "https://github.com/freelawproject/recap-chrome/releases/download/2.8.6/chrome-release.zip" -o recap.zip
unzip recap.zip -d recap-extension
rm recap.zip

RECAP浏览器扩展程序用于众包PACER文档。
功能
  • 当有人查看PACER文档时,RECAP会将其上传至CourtListener
  • 你随后可以免费下载该文档
位置
${CLAUDE_PLUGIN_ROOT}/tools/extensions/recap-extension/
设置步骤
bash
cd tools/extensions
curl -L "https://github.com/freelawproject/recap-chrome/releases/download/2.8.6/chrome-release.zip" -o recap.zip
unzip recap.zip -d recap-extension
rm recap.zip

Output Structure

输出结构

In
{documents_root}/artists/[artist]/albums/[genre]/[album]/
(not in git):
{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── manifest.json                 # Complete catalog with metadata
├── documentcloud_*.pdf           # From DocumentCloud
├── courtlistener_*.pdf           # From CourtListener
├── doj_*.pdf                     # From DOJ
└── download-documents.py         # Reproducibility script
In
{content_root}/.../[album]/SOURCES.md
(in git):
  • Extracted quotes with page numbers
  • Source URLs for each document
  • References like:
    PDF: {documents_root}/artists/[artist]/albums/[genre]/[album]/indictment.pdf
{documents_root}/artists/[artist]/albums/[genre]/[album]/
(不存入Git):
{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── manifest.json                 # 包含元数据的完整目录
├── documentcloud_*.pdf           # 来自DocumentCloud
├── courtlistener_*.pdf           # 来自CourtListener
├── doj_*.pdf                     # 来自DOJ
└── download-documents.py         # 可复现的脚本
{content_root}/.../[album]/SOURCES.md
(存入Git):
  • 提取的带页码引用内容
  • 每份文档的资源网址
  • 引用格式示例:
    PDF: {documents_root}/artists/[artist]/albums/[genre]/[album]/indictment.pdf

Manifest Format

清单格式

json
{
  "case_name": "Dorr et al. v. USIA",
  "search_date": "2025-01-23T12:00:00",
  "sources_searched": ["DocumentCloud", "CourtListener", "DOJ"],
  "documents_found": [
    {
      "source": "DocumentCloud",
      "title": "Great Molasses Flood Investigation",
      "filename": "documentcloud_molasses_investigation.pdf",
      "url": "https://...",
      "size": 2400000
    }
  ]
}

json
{
  "case_name": "Dorr et al. v. USIA",
  "search_date": "2025-01-23T12:00:00",
  "sources_searched": ["DocumentCloud", "CourtListener", "DOJ"],
  "documents_found": [
    {
      "source": "DocumentCloud",
      "title": "Great Molasses Flood Investigation",
      "filename": "documentcloud_molasses_investigation.pdf",
      "url": "https://...",
      "size": 2400000
    }
  ]
}

Troubleshooting

故障排除

Site Blocked

网站被拦截

  • SEC.gov: Use DOJ press releases instead (link to same docs)
  • Scribd: May need account; create or skip
  • CourtListener: If RECAP doesn't have it, doc requires PACER
  • SEC.gov:改用DOJ新闻稿(链接至相同文档)
  • Scribd:可能需要账户;创建账户或跳过
  • CourtListener:如果RECAP没有该文档,则需要通过PACER获取

No Results Found

未找到结果

  • Try alternate search terms (party names, case numbers)
  • Check if case is too old (pre-digital archives)
  • Some cases have documents sealed
  • 尝试使用替代搜索词(当事人名称、案件编号)
  • 检查案件是否过于陈旧(数字化之前的档案)
  • 部分案件的文档可能被密封

Download Fails

下载失败

  • Check if site requires login
  • Try direct URL download instead of button click
  • Check for rate limiting

  • 检查网站是否需要登录
  • 尝试直接通过URL下载,而非点击按钮
  • 检查是否存在速率限制

Remember

注意事项

  1. Exhaust free sources first - PACER charges per page
  2. Save metadata - URLs, dates, sources for citation
  3. Don't commit PDFs - Too large for Git
  4. Verify downloads - Ensure content matches expected document
  5. Report gaps - Note what couldn't be found for manual follow-up
  1. 先穷尽免费资源 - PACER按页收费
  2. 保存元数据 - 用于引用的网址、日期、来源
  3. 不要提交PDF至Git - 体积过大
  4. 验证下载内容 - 确保内容与预期文档匹配
  5. 汇报缺失内容 - 记录无法找到的内容以便手动跟进