web-fetcher

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Fetcher

网页抓取工具(Web Fetcher)

Extract web page content as clean text/markdown from a given URL using a fallback chain of free services.
通过由免费服务组成的回退链路,从指定URL提取网页内容,输出纯净的文本/markdown格式。

Usage

使用方法

bash
python3 <skill-path>/scripts/fetch.py <url>
Save to file:
bash
python3 <skill-path>/scripts/fetch.py <url> -o output.md
bash
python3 <skill-path>/scripts/fetch.py <url>
保存到文件:
bash
python3 <skill-path>/scripts/fetch.py <url> -o output.md

Fallback Chain

回退链路

The script tries these sources in order, falling back on failure:
  1. Jina Reader (
    r.jina.ai/{url}
    ) — best markdown quality, supports JS-rendered pages
  2. defuddle.md (
    defuddle.md/{url}
    ) — by Obsidian creator @kepano
  3. markdown.new (
    markdown.new/{url}
    ) — 3-layer strategy with browser rendering fallback
  4. OpenCLI — platform-specific commands with browser login state (zhihu, reddit, twitter, weibo)
  5. Raw HTML — direct fetch as last resort
脚本会按顺序尝试以下数据源,失败时自动回退到下一个:
  1. Jina Reader (
    r.jina.ai/{url}
    ) — markdown质量最优,支持JS渲染页面
  2. defuddle.md (
    defuddle.md/{url}
    ) — 由Obsidian开发者@kepano打造
  3. markdown.new (
    markdown.new/{url}
    ) — 具备3层策略,兜底为浏览器渲染
  4. OpenCLI — 平台专属命令,可调用浏览器登录状态(知乎、reddit、twitter、微博)
  5. 原生HTML — 最后兜底方案,直接抓取

When to Use

适用场景

  • JS-rendered pages that WebFetch can't handle (Twitter/X, SPAs)
  • Login-required pages on supported platforms (zhihu, reddit, twitter, weibo, xiaohongshu)
  • Bulk content extraction
  • When you need clean markdown instead of summarized content
  • WebFetch无法处理的JS渲染页面(Twitter/X、SPAs)
  • 受支持平台上需要登录的页面(知乎、reddit、twitter、微博、小红书)
  • 批量内容提取
  • 需要纯净markdown而非总结内容的场景

OpenCLI Supported Platforms

OpenCLI支持的平台

When free services fail, OpenCLI auto-detects the platform from URL and routes to the right command:
URL PatternOpenCLI Command
zhihu.com/question/xxx
opencli zhihu question
zhuanlan.zhihu.com/p/xxx
opencli zhihu download
reddit.com/r/.../comments/...
opencli reddit read
twitter.com/x.com/.../status/xxx
opencli twitter thread
weibo.com/...
opencli weibo search
Requires:
npm i -g @jackwener/opencli
+ Browser Bridge extension in Chrome/Arc.
当免费服务失败时,OpenCLI会从URL自动检测平台并调用对应命令:
URL PatternOpenCLI Command
zhihu.com/question/xxx
opencli zhihu question
zhuanlan.zhihu.com/p/xxx
opencli zhihu download
reddit.com/r/.../comments/...
opencli reddit read
twitter.com/x.com/.../status/xxx
opencli twitter thread
weibo.com/...
opencli weibo search
依赖要求:
npm i -g @jackwener/opencli
+ Chrome/Arc浏览器中的Browser Bridge扩展。

Limitations

局限性

  • WeChat articles (微信公众号) not supported by any strategy
  • OpenCLI requires browser extension setup (one-time)
  • 所有策略均不支持微信公众号文章
  • OpenCLI需要配置浏览器扩展(仅需一次)

Rate Limits

速率限制

ServiceLimit
Jina Reader20 req/min (free), 10M token key available at jina.ai/reader
markdown.new500 req/day/IP
defuddle.mdNot documented
OpenCLINo documented limits (uses browser session)
服务限制
Jina Reader20次请求/分钟(免费版),可前往jina.ai/reader获取10M token密钥
markdown.new500次请求/天/IP
defuddle.md未公开限制
OpenCLI无公开限制(使用浏览器会话)