crawler

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Crawler Skill

爬虫技能

Converts any URL into clean markdown using a robust 3-tier fallback chain.
将任意URL转换为干净的Markdown格式,采用强大的三层回退链式架构。

Quick start

快速开始

bash
uv run scripts/crawl.py --url https://example.com --output reports/example.md
Markdown is saved to the file specified by
--output
. Progress/errors go to stderr. Exit code
0
on success,
1
if all scrapers fail.
bash
uv run scripts/crawl.py --url https://example.com --output reports/example.md
Markdown内容将保存到
--output
指定的文件中。进度/错误信息输出到stderr。成功时退出码为
0
,若所有爬虫都失败则退出码为
1

How it works

工作原理

The script tries each tier in order and returns the first success:
TierModuleRequires
1Firecrawl (
firecrawl_scraper.py
)
FIRECRAWL_API_KEY
env var (optional; falls back if missing)
2Jina Reader (
jina_reader.py
)
Nothing — free, no key needed
3Scrapling (
scrapling_scraper.py
)
Local headless browser (auto-installs via pip)
脚本按顺序尝试每个层级,返回第一个成功的结果:
层级模块依赖条件
1Firecrawl (
firecrawl_scraper.py
)
FIRECRAWL_API_KEY
环境变量(可选;若缺失则回退)
2Jina Reader (
jina_reader.py
)
无 — 免费使用,无需密钥
3Scrapling (
scrapling_scraper.py
)
本地无头浏览器(通过pip自动安装)

File layout

文件结构

crawler-skill/
├── SKILL.md            ← this file
├── scripts/
│   ├── crawl.py               ← main CLI entry point (PEP 723 inline deps)
│   └── src/
│       ├── domain_router.py       ← URL-to-tier routing rules
│       ├── firecrawl_scraper.py   ← Tier 1: Firecrawl API
│       ├── jina_reader.py         ← Tier 2: Jina r.jina.ai proxy
│       └── scrapling_scraper.py   ← Tier 3: local headless scraper
└── tests/
    └── test_crawl.py          ← 70 pytest tests (all passing)
crawler-skill/
├── SKILL.md            ← 本文档
├── scripts/
│   ├── crawl.py               ← 主CLI入口点(PEP 723 内联依赖)
│   └── src/
│       ├── domain_router.py       ← URL到层级的路由规则
│       ├── firecrawl_scraper.py   ← 层级1:Firecrawl API
│       ├── jina_reader.py         ← 层级2:Jina r.jina.ai 代理
│       └── scrapling_scraper.py   ← 层级3:本地无头爬虫
└── tests/
    └── test_crawl.py          ← 70个pytest测试用例(全部通过)

Usage examples

使用示例

bash
undefined
bash
undefined

Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling

基础抓取 — 先尝试Firecrawl,失败则回退到Jina,再失败则用Scrapling

Always prefer using --output to avoid terminal encoding issues

建议始终使用--output参数以避免终端编码问题

uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md
uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md

If no --output is provided, markdown goes to stdout (not recommended on Windows)

若未提供--output参数,Markdown内容将输出到stdout(Windows系统不推荐)

uv run scripts/crawl.py --url https://example.com
uv run scripts/crawl.py --url https://example.com

With a Firecrawl API key for best results

使用Firecrawl API密钥以获得最佳效果

FIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md
undefined
FIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md
undefined

URL requirements

URL要求

Only
http://
and
https://
URLs are accepted. Passing any other scheme (
ftp://
,
file://
,
javascript:
, a bare path, etc.) exits with code
1
and prints a clear error — no scraping is attempted.
仅接受
http://
https://
协议的URL。若传入其他协议(
ftp://
file://
javascript:
、裸路径等),脚本将以退出码
1
终止并打印清晰的错误信息 — 不会尝试抓取。

Saving Reports

保存报告

When the user asks to save the crawled content or a summary to a file, ALWAYS use the
--output
argument and save the file into the
reports/
directory at the project root (for example,
{project_root}/reports
). If the directory does not exist, the script will create it.
Example: If asked to "save to result.md", you should run:
uv run scripts/crawl.py --url <URL> --output reports/result.md
当用户要求将抓取内容或摘要保存到文件时,务必使用
--output
参数并将文件保存到项目根目录的
reports/
文件夹中(例如
{project_root}/reports
)。若目录不存在,脚本会自动创建它。
示例: 若用户要求“保存到result.md”,你应运行:
uv run scripts/crawl.py --url <URL> --output reports/result.md

Point at a self-hosted Firecrawl instance

指向自托管的Firecrawl实例

bash
FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.com
bash
FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.com

Content validation

内容验证

Each scraper validates its output before returning success:
  • Minimum 100 characters of content (rejects empty/error pages)
  • Detection of CAPTCHA / bot-verification pages (Firecrawl)
  • Detection of Cloudflare interstitial pages (Scrapling — escalates to StealthyFetcher)
  • Detection of Jina error page indicators (
    Error:
    ,
    Access Denied
    , etc.)
每个爬虫在返回成功结果前会验证输出:
  • 内容至少100字符(拒绝空白/错误页面)
  • 检测验证码/机器人验证页面(Firecrawl)
  • 检测Cloudflare 中间页面(Scrapling — 升级为StealthyFetcher)
  • 检测Jina错误页面标识(
    Error:
    Access Denied
    等)

Domain routing

域名路由

Certain hostnames bypass one or more scraper tiers to avoid known compatibility issues. The logic lives in
scripts/src/domain_router.py
.
DomainSkipped tiersActive chain
medium.com
(and subdomains)
firecrawljina → scrapling
mp.weixin.qq.com
firecrawl + jinascrapling only
everything elsefirecrawl → jina → scrapling
Sub-domain matching follows a suffix rule:
blog.medium.com
matches the
medium.com
rule because its hostname ends with
.medium.com
. An exact sub-domain like
other.weixin.qq.com
does not match
mp.weixin.qq.com
.
某些主机名会绕过一个或多个爬虫层级以避免已知兼容性问题。相关逻辑位于
scripts/src/domain_router.py
中。
域名跳过的层级启用的链式流程
medium.com
(及子域名)
firecrawljina → scrapling
mp.weixin.qq.com
firecrawl + jina仅scrapling
其他所有域名firecrawl → jina → scrapling
子域名匹配遵循后缀规则:
blog.medium.com
会匹配
medium.com
规则,因为其主机名以
.medium.com
结尾。而精确子域名如
other.weixin.qq.com
不会匹配
mp.weixin.qq.com
规则。

Running tests

运行测试

bash
uv run pytest tests/ -v
All 70 tests use mocking — no network calls, no API keys required.
bash
uv run pytest tests/ -v
所有70个测试用例均使用模拟数据 — 无需网络调用,无需API密钥。

Dependencies (auto-installed by
uv run
)

依赖项(由
uv run
自动安装)

  • firecrawl-py>=2.0
    — Firecrawl Python SDK
  • httpx>=0.27
    — HTTP client for Jina Reader
  • scrapling>=0.2
    — Headless scraping with stealth support
  • html2text>=2024.2.26
    — HTML-to-markdown conversion
  • firecrawl-py>=2.0
    — Firecrawl Python SDK
  • httpx>=0.27
    — Jina Reader的HTTP客户端
  • scrapling>=0.2
    — 支持隐身模式的无头爬虫
  • html2text>=2024.2.26
    — HTML到Markdown的转换工具

When to invoke this skill

何时调用此技能

Invoke
crawl.py
whenever you need the text content of a web page:
python
result = subprocess.run(
    ["uv", "run", "scripts/crawl.py", "--url", url],
    capture_output=True, text=True
)
if result.returncode == 0:
    markdown = result.stdout
Or simply run it directly from the terminal as shown in Quick start above.
当你需要获取网页文本内容时,调用
crawl.py
python
result = subprocess.run(
    ["uv", "run", "scripts/crawl.py", "--url", url],
    capture_output=True, text=True
)
if result.returncode == 0:
    markdown = result.stdout
或者直接按照上述快速开始中的示例从终端运行它。