web-fetch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Fetch
网页抓取工具
Controls a headless Chrome instance via Chrome DevTools Protocol (CDP) to fetch web pages with full JavaScript rendering. Handles JS redirects, SPAs, and dynamic content. Extracts clean text from any URL.
通过**Chrome DevTools Protocol (CDP)**控制无头Chrome实例,获取支持完整JavaScript渲染的网页。可处理JS重定向、单页应用(SPA)和动态内容,从任意URL提取干净的文本内容。
Prerequisites
前置条件
- Google Chrome or Chromium: Must be installed on the system.
- macOS:
brew install --cask google-chrome - Ubuntu/Debian:
sudo apt install google-chrome-stable - Windows:
winget install Google.Chrome
- macOS:
- Python 3: Standard library only. Zero pip dependencies. Optionally for better content extraction.
trafilatura - No API keys or tokens required.
- Google Chrome 或 Chromium:必须已安装在系统中。
- macOS:
brew install --cask google-chrome - Ubuntu/Debian:
sudo apt install google-chrome-stable - Windows:
winget install Google.Chrome
- macOS:
- Python 3:仅需标准库,无需通过pip安装任何依赖。可选安装以获得更好的内容提取效果。
trafilatura - 无需API密钥或令牌。
Usage
使用方法
Single URL
单个URL
bash
python <skill-path>/scripts/web_fetch.py --url "https://example.com/article"bash
python <skill-path>/scripts/web_fetch.py --url "https://example.com/article"Multiple URLs (single Chrome session, efficient)
多个URL(复用单个Chrome会话,效率更高)
bash
python <skill-path>/scripts/web_fetch.py --url "https://example.com/a" --url "https://example.com/b"bash
python <skill-path>/scripts/web_fetch.py --url "https://example.com/a" --url "https://example.com/b"Piped input (one URL per line)
管道输入(每行一个URL)
bash
echo "https://example.com/article" | python <skill-path>/scripts/web_fetch.pybash
echo "https://example.com/article" | python <skill-path>/scripts/web_fetch.pyJSON output to file (recommended for multiple URLs)
将JSON输出写入文件(推荐用于多URL场景)
bash
undefinedbash
undefinedCross-platform temp dir
跨平台临时目录
TMPDIR=$(python -c "import tempfile; print(tempfile.gettempdir())")
python <skill-path>/scripts/web_fetch.py --url "https://example.com" --format json -o "$TMPDIR/results.json"
undefinedTMPDIR=$(python -c "import tempfile; print(tempfile.gettempdir())")
python <skill-path>/scripts/web_fetch.py --url "https://example.com" --format json -o "$TMPDIR/results.json"
undefinedOptions
选项
| Flag | Default | Description |
|---|---|---|
| — | URL to fetch (repeatable) |
| | Output format: |
| stdout | Write to file instead of stdout (recommended for large results) |
| | Wait time for JS rendering in milliseconds |
| | Per-URL timeout in seconds |
| | Max characters per page |
| | URLs per batch (Chrome restarts between batches for stability) |
| auto | CDP debugging port (default: auto-detect free port) |
| 标志 | 默认值 | 描述 |
|---|---|---|
| — | 要获取的URL(可重复指定) |
| | 输出格式: |
| stdout | 将输出写入文件而非标准输出(大结果场景推荐使用) |
| | JS渲染等待时间(毫秒) |
| | 每个URL的超时时间(秒) |
| | 每页最大字符数 |
| | 每批处理的URL数量(为保证稳定性,批处理之间会重启Chrome) |
| auto | CDP调试端口(默认:自动检测可用端口) |
Output
输出结果
Text mode (default)
文本模式(默认)
Plain text content of each URL, separated by .
---每个URL的纯文本内容,以分隔。
---JSON mode
JSON模式
json
[
{
"url": "https://news.google.com/rss/articles/...",
"final_url": "https://www.theguardian.com/actual-article",
"success": true,
"content": "Article text content..."
}
]On failure:
json
[
{
"url": "https://example.com/broken",
"success": false,
"error": "Page load timeout"
}
]json
[
{
"url": "https://news.google.com/rss/articles/...",
"final_url": "https://www.theguardian.com/actual-article",
"success": true,
"content": "Article text content..."
}
]失败时的输出:
json
[
{
"url": "https://example.com/broken",
"success": false,
"error": "Page load timeout"
}
]How It Works
工作原理
- Finds Chrome — Checks platform-specific paths, then searches PATH
- Launches isolated Chrome — (never conflicts with your running Chrome)
--headless=new --remote-debugging-port=PORT --user-data-dir=TMPDIR - Controls via CDP — Connects over WebSocket using a pure-Python CDP client (no dependencies)
- Navigates & waits — Uses , listens for
Page.navigate, then waits for JS renderingloadEventFired - Captures content — Gets final URL via , extracts DOM via
Runtime.evaluateDOM.getOuterHTML - Extracts text — Uses (if installed) or falls back to HTML tag stripping
trafilatura - Cleans up — Terminates Chrome and removes temp profile
- 查找Chrome — 检查平台特定路径,然后搜索系统PATH
- 启动独立Chrome实例 — 使用参数(绝不会与您正在运行的Chrome产生冲突)
--headless=new --remote-debugging-port=PORT --user-data-dir=TMPDIR - 通过CDP控制 — 使用纯Python编写的CDP客户端通过WebSocket连接(无依赖)
- 导航与等待 — 调用,监听
Page.navigate事件,然后等待JS渲染完成loadEventFired - 捕获内容 — 通过获取最终URL,通过
Runtime.evaluate提取DOM内容DOM.getOuterHTML - 提取文本 — 使用(若已安装),否则回退到HTML标签剥离方式
trafilatura - 清理资源 — 终止Chrome进程并删除临时配置文件
Key Advantages
核心优势
- No profile conflicts — Uses isolated , works alongside your running Chrome
--user-data-dir - Handles JS redirects — Google News, URL shorteners, SPA routing all work
- Zero dependencies — Pure Python 3 stdlib (WebSocket client included)
- Single Chrome session — Multiple URLs reuse one instance (fast)
- Cross-platform — Mac, Windows, Linux
- 无配置文件冲突 — 使用独立的,可与您正在运行的Chrome同时使用
--user-data-dir - 支持JS重定向 — Google News、URL短链接、SPA路由均能正常处理
- 零依赖 — 仅需Python 3标准库(内置WebSocket客户端)
- 复用Chrome会话 — 多个URL复用同一个实例(速度更快)
- 跨平台 — 支持Mac、Windows、Linux
Cross-Platform Support
跨平台支持
| Platform | Chrome Paths Checked |
|---|---|
| macOS | |
| Linux | |
| Windows | |
Falls back to searching on all platforms.
PATH| 平台 | 检查的Chrome路径 |
|---|---|
| macOS | |
| Linux | |
| Windows | |
所有平台均会回退到搜索系统PATH。
Notes
注意事项
- No configuration needed — Just have Chrome installed. No need to enable "remote debugging" or change any Chrome settings. The script handles everything automatically.
- macOS first run — The system may show a popup asking "Do you want the application Google Chrome to accept incoming network connections?" Click Allow. This is because CDP uses a local port () — no external network traffic is involved.
127.0.0.1
- 无需配置 — 只需安装Chrome即可。无需启用“远程调试”或修改任何Chrome设置,脚本会自动处理所有配置。
- macOS首次运行 — 系统可能会弹出提示“是否允许应用Google Chrome接受传入网络连接?”请点击允许。这是因为CDP使用本地端口()——不会涉及外部网络流量。
127.0.0.1