crawler

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Crawler Skill

爬虫技能

Converts any URL into clean markdown using a robust 3-tier fallback chain.

将任意URL转换为干净的Markdown格式，采用强大的三层回退链式架构。

Quick start

快速开始

bash

uv run scripts/crawl.py --url https://example.com --output reports/example.md

Markdown is saved to the file specified by

--output

. Progress/errors go to stderr. Exit code

on success,

if all scrapers fail.

bash

uv run scripts/crawl.py --url https://example.com --output reports/example.md

Markdown内容将保存到

--output

指定的文件中。进度/错误信息输出到stderr。成功时退出码为

，若所有爬虫都失败则退出码为

。

How it works

工作原理

The script tries each tier in order and returns the first success:

Tier	Module	Requires
1	Firecrawl ( `firecrawl_scraper.py` )	`FIRECRAWL_API_KEY` env var (optional; falls back if missing)
2	Jina Reader ( `jina_reader.py` )	Nothing — free, no key needed
3	Scrapling ( `scrapling_scraper.py` )	Local headless browser (auto-installs via pip)

脚本按顺序尝试每个层级，返回第一个成功的结果：

层级	模块	依赖条件
1	Firecrawl ( `firecrawl_scraper.py` )	`FIRECRAWL_API_KEY` 环境变量（可选；若缺失则回退）
2	Jina Reader ( `jina_reader.py` )	无 — 免费使用，无需密钥
3	Scrapling ( `scrapling_scraper.py` )	本地无头浏览器（通过pip自动安装）

File layout

文件结构

crawler-skill/
├── SKILL.md            ← this file
├── scripts/
│   ├── crawl.py               ← main CLI entry point (PEP 723 inline deps)
│   └── src/
│       ├── domain_router.py       ← URL-to-tier routing rules
│       ├── firecrawl_scraper.py   ← Tier 1: Firecrawl API
│       ├── jina_reader.py         ← Tier 2: Jina r.jina.ai proxy
│       └── scrapling_scraper.py   ← Tier 3: local headless scraper
└── tests/
    └── test_crawl.py          ← 70 pytest tests (all passing)

crawler-skill/
├── SKILL.md            ← 本文档
├── scripts/
│   ├── crawl.py               ← 主CLI入口点（PEP 723 内联依赖）
│   └── src/
│       ├── domain_router.py       ← URL到层级的路由规则
│       ├── firecrawl_scraper.py   ← 层级1：Firecrawl API
│       ├── jina_reader.py         ← 层级2：Jina r.jina.ai 代理
│       └── scrapling_scraper.py   ← 层级3：本地无头爬虫
└── tests/
    └── test_crawl.py          ← 70个pytest测试用例（全部通过）

Usage examples

使用示例

bash

undefined

bash

undefined

Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling

基础抓取 — 先尝试Firecrawl，失败则回退到Jina，再失败则用Scrapling

Always prefer using --output to avoid terminal encoding issues

建议始终使用--output参数以避免终端编码问题

uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md

If no --output is provided, markdown goes to stdout (not recommended on Windows)

若未提供--output参数，Markdown内容将输出到stdout（Windows系统不推荐）

uv run scripts/crawl.py --url https://example.com

With a Firecrawl API key for best results

使用Firecrawl API密钥以获得最佳效果

FIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md

undefined

FIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md

undefined

URL requirements

URL要求

Only

http://

and

https://

URLs are accepted. Passing any other scheme (

ftp://

file://

javascript:

, a bare path, etc.) exits with code

and prints a clear error — no scraping is attempted.

仅接受

http://

和

https://

协议的URL。若传入其他协议（

ftp://

、

file://

、

javascript:

、裸路径等），脚本将以退出码

终止并打印清晰的错误信息 — 不会尝试抓取。

Saving Reports

保存报告

When the user asks to save the crawled content or a summary to a file, ALWAYS use the

--output

argument and save the file into the

reports/

directory at the project root (for example,

{project_root}/reports

). If the directory does not exist, the script will create it.

Example: If asked to "save to result.md", you should run:

uv run scripts/crawl.py --url <URL> --output reports/result.md

当用户要求将抓取内容或摘要保存到文件时，务必使用

--output

参数并将文件保存到项目根目录的

reports/

文件夹中（例如

{project_root}/reports

）。若目录不存在，脚本会自动创建它。

示例：若用户要求“保存到result.md”，你应运行：

uv run scripts/crawl.py --url <URL> --output reports/result.md

Point at a self-hosted Firecrawl instance

指向自托管的Firecrawl实例

bash

FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.com

bash

FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.com

Content validation

内容验证

Each scraper validates its output before returning success:

Minimum 100 characters of content (rejects empty/error pages)
Detection of CAPTCHA / bot-verification pages (Firecrawl)
Detection of Cloudflare interstitial pages (Scrapling — escalates to StealthyFetcher)
Detection of Jina error page indicators (
```
Error:
```
,
```
Access Denied
```
, etc.)

每个爬虫在返回成功结果前会验证输出：

内容至少100字符（拒绝空白/错误页面）
检测验证码/机器人验证页面（Firecrawl）
检测Cloudflare 中间页面（Scrapling — 升级为StealthyFetcher）
检测Jina错误页面标识（
```
Error:
```
、
```
Access Denied
```
等）

Domain routing

域名路由

Certain hostnames bypass one or more scraper tiers to avoid known compatibility issues. The logic lives in

scripts/src/domain_router.py

Domain	Skipped tiers	Active chain
`medium.com` (and subdomains)	firecrawl	jina → scrapling
`mp.weixin.qq.com`	firecrawl + jina	scrapling only
everything else	—	firecrawl → jina → scrapling

Sub-domain matching follows a suffix rule:

blog.medium.com

matches the

medium.com

rule because its hostname ends with

.medium.com

. An exact sub-domain like

other.weixin.qq.com

does not match

mp.weixin.qq.com

某些主机名会绕过一个或多个爬虫层级以避免已知兼容性问题。相关逻辑位于

scripts/src/domain_router.py

中。

域名	跳过的层级	启用的链式流程
`medium.com` （及子域名）	firecrawl	jina → scrapling
`mp.weixin.qq.com`	firecrawl + jina	仅scrapling
其他所有域名	—	firecrawl → jina → scrapling

子域名匹配遵循后缀规则：

blog.medium.com

会匹配

medium.com

规则，因为其主机名以

.medium.com

结尾。而精确子域名如

other.weixin.qq.com

不会匹配

mp.weixin.qq.com

规则。

Running tests

运行测试

bash

uv run pytest tests/ -v

All 70 tests use mocking — no network calls, no API keys required.

bash

uv run pytest tests/ -v

所有70个测试用例均使用模拟数据 — 无需网络调用，无需API密钥。

Dependencies (auto-installed by

uv run

)

依赖项（由

uv run

自动安装）

```
firecrawl-py>=2.0
```
— Firecrawl Python SDK
```
httpx>=0.27
```
— HTTP client for Jina Reader
```
scrapling>=0.2
```
— Headless scraping with stealth support
```
html2text>=2024.2.26
```
— HTML-to-markdown conversion

```
firecrawl-py>=2.0
```
— Firecrawl Python SDK
```
httpx>=0.27
```
— Jina Reader的HTTP客户端
```
scrapling>=0.2
```
— 支持隐身模式的无头爬虫
```
html2text>=2024.2.26
```
— HTML到Markdown的转换工具

When to invoke this skill

何时调用此技能

Invoke

crawl.py

whenever you need the text content of a web page:

python

result = subprocess.run(
    ["uv", "run", "scripts/crawl.py", "--url", url],
    capture_output=True, text=True
)
if result.returncode == 0:
    markdown = result.stdout

Or simply run it directly from the terminal as shown in Quick start above.

当你需要获取网页文本内容时，调用

crawl.py

：

python

result = subprocess.run(
    ["uv", "run", "scripts/crawl.py", "--url", url],
    capture_output=True, text=True
)
if result.returncode == 0:
    markdown = result.stdout

或者直接按照上述快速开始中的示例从终端运行它。

crawler

Original

Translation

Crawler Skill

爬虫技能

Quick start

快速开始

How it works

工作原理

File layout

文件结构

Usage examples

使用示例

Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling

基础抓取 — 先尝试Firecrawl，失败则回退到Jina，再失败则用Scrapling

Always prefer using --output to avoid terminal encoding issues

建议始终使用--output参数以避免终端编码问题

If no --output is provided, markdown goes to stdout (not recommended on Windows)

若未提供--output参数，Markdown内容将输出到stdout（Windows系统不推荐）

With a Firecrawl API key for best results

使用Firecrawl API密钥以获得最佳效果

URL requirements

URL要求

Saving Reports

保存报告

Point at a self-hosted Firecrawl instance

指向自托管的Firecrawl实例

Content validation

内容验证

Domain routing

域名路由

Running tests

运行测试

Dependencies (auto-installed by
`uv run`
)

依赖项（由
`uv run`
自动安装）

When to invoke this skill

何时调用此技能

crawler

Original

Translation

Crawler Skill

爬虫技能

Quick start

快速开始

How it works

工作原理

File layout

文件结构

Usage examples

使用示例

Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling

基础抓取 — 先尝试Firecrawl，失败则回退到Jina，再失败则用Scrapling

Always prefer using --output to avoid terminal encoding issues

建议始终使用--output参数以避免终端编码问题

If no --output is provided, markdown goes to stdout (not recommended on Windows)

若未提供--output参数，Markdown内容将输出到stdout（Windows系统不推荐）

With a Firecrawl API key for best results

使用Firecrawl API密钥以获得最佳效果

URL requirements

URL要求

Saving Reports

保存报告

Point at a self-hosted Firecrawl instance

指向自托管的Firecrawl实例

Content validation

内容验证

Domain routing

域名路由

Running tests

运行测试

Dependencies (auto-installed by uv run)

依赖项（由uv run自动安装）

When to invoke this skill

何时调用此技能

Dependencies (auto-installed by
`uv run`
)

依赖项（由
`uv run`
自动安装）