web-fetch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web Fetch

网页抓取工具

Controls a headless Chrome instance via Chrome DevTools Protocol (CDP) to fetch web pages with full JavaScript rendering. Handles JS redirects, SPAs, and dynamic content. Extracts clean text from any URL.

通过**Chrome DevTools Protocol (CDP)**控制无头Chrome实例，获取支持完整JavaScript渲染的网页。可处理JS重定向、单页应用（SPA）和动态内容，从任意URL提取干净的文本内容。

Prerequisites

前置条件

Google Chrome or Chromium: Must be installed on the system.

macOS:
```
brew install --cask google-chrome
```
Ubuntu/Debian:
```
sudo apt install google-chrome-stable
```
Windows:
```
winget install Google.Chrome
```

Python 3: Standard library only. Zero pip dependencies. Optionally
```
trafilatura
```
for better content extraction.
No API keys or tokens required.

Google Chrome 或 Chromium：必须已安装在系统中。

macOS：
```
brew install --cask google-chrome
```
Ubuntu/Debian：
```
sudo apt install google-chrome-stable
```
Windows：
```
winget install Google.Chrome
```

Python 3：仅需标准库，无需通过pip安装任何依赖。可选安装
```
trafilatura
```
以获得更好的内容提取效果。
无需API密钥或令牌。

Usage

使用方法

Single URL

单个URL

bash

python <skill-path>/scripts/web_fetch.py --url "https://example.com/article"

bash

python <skill-path>/scripts/web_fetch.py --url "https://example.com/article"

Multiple URLs (single Chrome session, efficient)

多个URL（复用单个Chrome会话，效率更高）

bash

python <skill-path>/scripts/web_fetch.py --url "https://example.com/a" --url "https://example.com/b"

bash

python <skill-path>/scripts/web_fetch.py --url "https://example.com/a" --url "https://example.com/b"

Piped input (one URL per line)

管道输入（每行一个URL）

bash

echo "https://example.com/article" | python <skill-path>/scripts/web_fetch.py

bash

echo "https://example.com/article" | python <skill-path>/scripts/web_fetch.py

JSON output to file (recommended for multiple URLs)

将JSON输出写入文件（推荐用于多URL场景）

bash

undefined

bash

undefined

Cross-platform temp dir

跨平台临时目录

TMPDIR=$(python -c "import tempfile; print(tempfile.gettempdir())") python <skill-path>/scripts/web_fetch.py --url "https://example.com" --format json -o "$TMPDIR/results.json"

undefined

TMPDIR=$(python -c "import tempfile; print(tempfile.gettempdir())") python <skill-path>/scripts/web_fetch.py --url "https://example.com" --format json -o "$TMPDIR/results.json"

undefined

Options

选项

Flag	Default	Description
`--url`	—	URL to fetch (repeatable)
`--format`	`text`	Output format: `text` or `json`
`-o, --output`	stdout	Write to file instead of stdout (recommended for large results)
`--wait-ms`	`1500`	Wait time for JS rendering in milliseconds
`--timeout`	`15`	Per-URL timeout in seconds
`--max-chars`	`15000`	Max characters per page
`--batch-size`	`3`	URLs per batch (Chrome restarts between batches for stability)
`--port`	auto	CDP debugging port (default: auto-detect free port)

标志	默认值	描述
`--url`	—	要获取的URL（可重复指定）
`--format`	`text`	输出格式： `text` 或 `json`
`-o, --output`	stdout	将输出写入文件而非标准输出（大结果场景推荐使用）
`--wait-ms`	`1500`	JS渲染等待时间（毫秒）
`--timeout`	`15`	每个URL的超时时间（秒）
`--max-chars`	`15000`	每页最大字符数
`--batch-size`	`3`	每批处理的URL数量（为保证稳定性，批处理之间会重启Chrome）
`--port`	auto	CDP调试端口（默认：自动检测可用端口）

Output

输出结果

Text mode (default)

文本模式（默认）

Plain text content of each URL, separated by

---

每个URL的纯文本内容，以

---

分隔。

JSON mode

JSON模式

json

[
  {
    "url": "https://news.google.com/rss/articles/...",
    "final_url": "https://www.theguardian.com/actual-article",
    "success": true,
    "content": "Article text content..."
  }
]

On failure:

json

[
  {
    "url": "https://example.com/broken",
    "success": false,
    "error": "Page load timeout"
  }
]

json

[
  {
    "url": "https://news.google.com/rss/articles/...",
    "final_url": "https://www.theguardian.com/actual-article",
    "success": true,
    "content": "Article text content..."
  }
]

失败时的输出：

json

[
  {
    "url": "https://example.com/broken",
    "success": false,
    "error": "Page load timeout"
  }
]

How It Works

工作原理

Finds Chrome — Checks platform-specific paths, then searches PATH
Launches isolated Chrome —
```
--headless=new --remote-debugging-port=PORT --user-data-dir=TMPDIR
```
(never conflicts with your running Chrome)
Controls via CDP — Connects over WebSocket using a pure-Python CDP client (no dependencies)
Navigates & waits — Uses
```
Page.navigate
```
, listens for
```
loadEventFired
```
, then waits for JS rendering
Captures content — Gets final URL via
```
Runtime.evaluate
```
, extracts DOM via
```
DOM.getOuterHTML
```
Extracts text — Uses
```
trafilatura
```
(if installed) or falls back to HTML tag stripping
Cleans up — Terminates Chrome and removes temp profile

查找Chrome — 检查平台特定路径，然后搜索系统PATH
启动独立Chrome实例 — 使用
```
--headless=new --remote-debugging-port=PORT --user-data-dir=TMPDIR
```
参数（绝不会与您正在运行的Chrome产生冲突）
通过CDP控制 — 使用纯Python编写的CDP客户端通过WebSocket连接（无依赖）
导航与等待 — 调用
```
Page.navigate
```
，监听
```
loadEventFired
```
事件，然后等待JS渲染完成
捕获内容 — 通过
```
Runtime.evaluate
```
获取最终URL，通过
```
DOM.getOuterHTML
```
提取DOM内容
提取文本 — 使用
```
trafilatura
```
（若已安装），否则回退到HTML标签剥离方式
清理资源 — 终止Chrome进程并删除临时配置文件

Key Advantages

核心优势

No profile conflicts — Uses isolated
```
--user-data-dir
```
, works alongside your running Chrome
Handles JS redirects — Google News, URL shorteners, SPA routing all work
Zero dependencies — Pure Python 3 stdlib (WebSocket client included)
Single Chrome session — Multiple URLs reuse one instance (fast)
Cross-platform — Mac, Windows, Linux

无配置文件冲突 — 使用独立的
```
--user-data-dir
```
，可与您正在运行的Chrome同时使用
支持JS重定向 — Google News、URL短链接、SPA路由均能正常处理
零依赖 — 仅需Python 3标准库（内置WebSocket客户端）
复用Chrome会话 — 多个URL复用同一个实例（速度更快）
跨平台 — 支持Mac、Windows、Linux

Cross-Platform Support

跨平台支持

Platform	Chrome Paths Checked
macOS	`/Applications/Google Chrome.app/...` , Chromium.app
Linux	`/usr/bin/google-chrome` , chromium, snap chromium
Windows	`%ProgramFiles%\Google\Chrome\...` , `%LocalAppData%\...`

Falls back to searching

PATH

on all platforms.

平台	检查的Chrome路径
macOS	`/Applications/Google Chrome.app/...` , Chromium.app
Linux	`/usr/bin/google-chrome` , chromium, snap chromium
Windows	`%ProgramFiles%\Google\Chrome\...` , `%LocalAppData%\...`

所有平台均会回退到搜索系统PATH。

Notes

注意事项

No configuration needed — Just have Chrome installed. No need to enable "remote debugging" or change any Chrome settings. The script handles everything automatically.
macOS first run — The system may show a popup asking "Do you want the application Google Chrome to accept incoming network connections?" Click Allow. This is because CDP uses a local port (
```
127.0.0.1
```
) — no external network traffic is involved.

无需配置 — 只需安装Chrome即可。无需启用“远程调试”或修改任何Chrome设置，脚本会自动处理所有配置。
macOS首次运行 — 系统可能会弹出提示“是否允许应用Google Chrome接受传入网络连接？”请点击允许。这是因为CDP使用本地端口（
```
127.0.0.1
```
）——不会涉及外部网络流量。