web-fetch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Fetch

网页抓取工具

Controls a headless Chrome instance via Chrome DevTools Protocol (CDP) to fetch web pages with full JavaScript rendering. Handles JS redirects, SPAs, and dynamic content. Extracts clean text from any URL.
通过**Chrome DevTools Protocol (CDP)**控制无头Chrome实例,获取支持完整JavaScript渲染的网页。可处理JS重定向、单页应用(SPA)和动态内容,从任意URL提取干净的文本内容。

Prerequisites

前置条件

  • Google Chrome or Chromium: Must be installed on the system.
    • macOS:
      brew install --cask google-chrome
    • Ubuntu/Debian:
      sudo apt install google-chrome-stable
    • Windows:
      winget install Google.Chrome
  • Python 3: Standard library only. Zero pip dependencies. Optionally
    trafilatura
    for better content extraction.
  • No API keys or tokens required.
  • Google Chrome 或 Chromium:必须已安装在系统中。
    • macOS:
      brew install --cask google-chrome
    • Ubuntu/Debian:
      sudo apt install google-chrome-stable
    • Windows:
      winget install Google.Chrome
  • Python 3:仅需标准库,无需通过pip安装任何依赖。可选安装
    trafilatura
    以获得更好的内容提取效果。
  • 无需API密钥或令牌。

Usage

使用方法

Single URL

单个URL

bash
python <skill-path>/scripts/web_fetch.py --url "https://example.com/article"
bash
python <skill-path>/scripts/web_fetch.py --url "https://example.com/article"

Multiple URLs (single Chrome session, efficient)

多个URL(复用单个Chrome会话,效率更高)

bash
python <skill-path>/scripts/web_fetch.py --url "https://example.com/a" --url "https://example.com/b"
bash
python <skill-path>/scripts/web_fetch.py --url "https://example.com/a" --url "https://example.com/b"

Piped input (one URL per line)

管道输入(每行一个URL)

bash
echo "https://example.com/article" | python <skill-path>/scripts/web_fetch.py
bash
echo "https://example.com/article" | python <skill-path>/scripts/web_fetch.py

JSON output to file (recommended for multiple URLs)

将JSON输出写入文件(推荐用于多URL场景)

bash
undefined
bash
undefined

Cross-platform temp dir

跨平台临时目录

TMPDIR=$(python -c "import tempfile; print(tempfile.gettempdir())") python <skill-path>/scripts/web_fetch.py --url "https://example.com" --format json -o "$TMPDIR/results.json"
undefined
TMPDIR=$(python -c "import tempfile; print(tempfile.gettempdir())") python <skill-path>/scripts/web_fetch.py --url "https://example.com" --format json -o "$TMPDIR/results.json"
undefined

Options

选项

FlagDefaultDescription
--url
URL to fetch (repeatable)
--format
text
Output format:
text
or
json
-o, --output
stdoutWrite to file instead of stdout (recommended for large results)
--wait-ms
1500
Wait time for JS rendering in milliseconds
--timeout
15
Per-URL timeout in seconds
--max-chars
15000
Max characters per page
--batch-size
3
URLs per batch (Chrome restarts between batches for stability)
--port
autoCDP debugging port (default: auto-detect free port)
标志默认值描述
--url
要获取的URL(可重复指定)
--format
text
输出格式:
text
json
-o, --output
stdout将输出写入文件而非标准输出(大结果场景推荐使用)
--wait-ms
1500
JS渲染等待时间(毫秒)
--timeout
15
每个URL的超时时间(秒)
--max-chars
15000
每页最大字符数
--batch-size
3
每批处理的URL数量(为保证稳定性,批处理之间会重启Chrome)
--port
autoCDP调试端口(默认:自动检测可用端口)

Output

输出结果

Text mode (default)

文本模式(默认)

Plain text content of each URL, separated by
---
.
每个URL的纯文本内容,以
---
分隔。

JSON mode

JSON模式

json
[
  {
    "url": "https://news.google.com/rss/articles/...",
    "final_url": "https://www.theguardian.com/actual-article",
    "success": true,
    "content": "Article text content..."
  }
]
On failure:
json
[
  {
    "url": "https://example.com/broken",
    "success": false,
    "error": "Page load timeout"
  }
]
json
[
  {
    "url": "https://news.google.com/rss/articles/...",
    "final_url": "https://www.theguardian.com/actual-article",
    "success": true,
    "content": "Article text content..."
  }
]
失败时的输出:
json
[
  {
    "url": "https://example.com/broken",
    "success": false,
    "error": "Page load timeout"
  }
]

How It Works

工作原理

  1. Finds Chrome — Checks platform-specific paths, then searches PATH
  2. Launches isolated Chrome
    --headless=new --remote-debugging-port=PORT --user-data-dir=TMPDIR
    (never conflicts with your running Chrome)
  3. Controls via CDP — Connects over WebSocket using a pure-Python CDP client (no dependencies)
  4. Navigates & waits — Uses
    Page.navigate
    , listens for
    loadEventFired
    , then waits for JS rendering
  5. Captures content — Gets final URL via
    Runtime.evaluate
    , extracts DOM via
    DOM.getOuterHTML
  6. Extracts text — Uses
    trafilatura
    (if installed) or falls back to HTML tag stripping
  7. Cleans up — Terminates Chrome and removes temp profile
  1. 查找Chrome — 检查平台特定路径,然后搜索系统PATH
  2. 启动独立Chrome实例 — 使用
    --headless=new --remote-debugging-port=PORT --user-data-dir=TMPDIR
    参数(绝不会与您正在运行的Chrome产生冲突)
  3. 通过CDP控制 — 使用纯Python编写的CDP客户端通过WebSocket连接(无依赖)
  4. 导航与等待 — 调用
    Page.navigate
    ,监听
    loadEventFired
    事件,然后等待JS渲染完成
  5. 捕获内容 — 通过
    Runtime.evaluate
    获取最终URL,通过
    DOM.getOuterHTML
    提取DOM内容
  6. 提取文本 — 使用
    trafilatura
    (若已安装),否则回退到HTML标签剥离方式
  7. 清理资源 — 终止Chrome进程并删除临时配置文件

Key Advantages

核心优势

  • No profile conflicts — Uses isolated
    --user-data-dir
    , works alongside your running Chrome
  • Handles JS redirects — Google News, URL shorteners, SPA routing all work
  • Zero dependencies — Pure Python 3 stdlib (WebSocket client included)
  • Single Chrome session — Multiple URLs reuse one instance (fast)
  • Cross-platform — Mac, Windows, Linux
  • 无配置文件冲突 — 使用独立的
    --user-data-dir
    ,可与您正在运行的Chrome同时使用
  • 支持JS重定向 — Google News、URL短链接、SPA路由均能正常处理
  • 零依赖 — 仅需Python 3标准库(内置WebSocket客户端)
  • 复用Chrome会话 — 多个URL复用同一个实例(速度更快)
  • 跨平台 — 支持Mac、Windows、Linux

Cross-Platform Support

跨平台支持

PlatformChrome Paths Checked
macOS
/Applications/Google Chrome.app/...
, Chromium.app
Linux
/usr/bin/google-chrome
, chromium, snap chromium
Windows
%ProgramFiles%\Google\Chrome\...
,
%LocalAppData%\...
Falls back to searching
PATH
on all platforms.
平台检查的Chrome路径
macOS
/Applications/Google Chrome.app/...
, Chromium.app
Linux
/usr/bin/google-chrome
, chromium, snap chromium
Windows
%ProgramFiles%\Google\Chrome\...
,
%LocalAppData%\...
所有平台均会回退到搜索系统PATH。

Notes

注意事项

  • No configuration needed — Just have Chrome installed. No need to enable "remote debugging" or change any Chrome settings. The script handles everything automatically.
  • macOS first run — The system may show a popup asking "Do you want the application Google Chrome to accept incoming network connections?" Click Allow. This is because CDP uses a local port (
    127.0.0.1
    ) — no external network traffic is involved.
  • 无需配置 — 只需安装Chrome即可。无需启用“远程调试”或修改任何Chrome设置,脚本会自动处理所有配置。
  • macOS首次运行 — 系统可能会弹出提示“是否允许应用Google Chrome接受传入网络连接?”请点击允许。这是因为CDP使用本地端口(
    127.0.0.1
    )——不会涉及外部网络流量。