scrapling

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

scrapling - Adaptive Web Scraping Framework

scrapling - 自适应网页抓取框架

Keyword:
scrapling
·
adaptive scraping
·
stealthy fetch
·
scrapling spider
Respect each target site's terms, robots, rate limits, and authorization boundaries.

Scrapling is a Python scraping framework for parser-first HTML extraction, browser-backed fetching, stealth anti-bot handling, CLI prototyping, and optional larger crawl workflows. Its distinctive feature is adaptive scraping: you can save element fingerprints and later relocate equivalent elements after a site redesign.

关键词：
scrapling
·
adaptive scraping
·
stealthy fetch
·
scrapling spider
请遵守每个目标站点的条款、robots协议、请求频率限制和授权边界。

Scrapling是一款Python爬虫框架，支持解析优先的HTML提取、基于浏览器的内容获取、隐身反爬虫处理、CLI原型开发，以及可选的大规模爬取工作流。它的特色功能是自适应抓取：你可以保存元素指纹，在站点改版后也能重新定位到等效元素。

When to use this skill

何时使用该技能

Install Scrapling with the right extras for parser-only, fetchers, shell, AI, or full usage
Parse known HTML with
```
Selector
```
before escalating to browser-backed fetchers
Choose between
```
Fetcher
```
,
```
DynamicFetcher
```
, and
```
StealthyFetcher
```

Reuse

FetcherSession

DynamicSession

, or

StealthySession

for multiple requests

Parse HTML with CSS, XPath,
```
::text
```
,
```
::attr(...)
```
, text matching, regex, and similar-element lookup

Enable adaptive scraping with

adaptive=True

auto_save=True

retrieve()

, and

relocate()

Use the
```
scrapling
```
CLI for terminal-first extraction or shell work
Understand MCP and spiders as second-tier workflows once core scraping is working
Decide when Docker-only CLI usage is enough versus when Python code is required

按需安装带对应扩展包的Scrapling，支持仅解析、获取器、shell、AI或全功能使用
在升级到基于浏览器的获取器之前，先用
```
Selector
```
解析已知HTML
可在
```
Fetcher
```
、
```
DynamicFetcher
```
和
```
StealthyFetcher
```
之间选择

复用

FetcherSession

、

DynamicSession

或

StealthySession

处理多请求场景

可通过CSS、XPath、
```
::text
```
、
```
::attr(...)
```
、文本匹配、正则、相似元素查询等方式解析HTML

通过

adaptive=True

、

auto_save=True

、

retrieve()

、

relocate()

启用自适应抓取

使用
```
scrapling
```
CLI实现终端优先的提取或shell操作
掌握核心抓取能力后，再了解作为二级工作流的MCP和爬虫功能
判断何时只需Docker版CLI即可满足需求，何时需要编写Python代码

Instructions

使用说明

Step 1: Install and verify the environment

步骤1：安装并验证环境

Use a virtual environment unless the user explicitly wants a system install.

bash

bash scripts/install.sh

Supported install profiles:

```
parser
```
:
```
pip install scrapling
```

fetchers

pip install "scrapling[fetchers]"

```
shell
```
:
```
pip install "scrapling[shell]"
```
```
ai
```
:
```
pip install "scrapling[ai]"
```
```
all
```
:
```
pip install "scrapling[all]"
```

Examples:

bash

bash scripts/install.sh --profile parser
bash scripts/install.sh --profile fetchers
bash scripts/install.sh --profile all --force

Browser-backed flows require

scrapling install

. Parser-only workflows do not.

If the user only wants terminal extraction and prefers containers, Docker images are available. That path is CLI-oriented and does not replace Python coding workflows.

除非用户明确要求系统级安装，否则请使用虚拟环境。

bash

bash scripts/install.sh

支持的安装配置项：

```
parser
```
:
```
pip install scrapling
```

fetchers

pip install "scrapling[fetchers]"

```
shell
```
:
```
pip install "scrapling[shell]"
```
```
ai
```
:
```
pip install "scrapling[ai]"
```
```
all
```
:
```
pip install "scrapling[all]"
```

示例：

bash

bash scripts/install.sh --profile parser
bash scripts/install.sh --profile fetchers
bash scripts/install.sh --profile all --force

基于浏览器的流程需要执行

scrapling install

，仅解析的工作流不需要。

如果用户只需要终端提取且偏好容器，可使用Docker镜像，该方案面向CLI场景，无法替代Python编码工作流。

Step 2: Start parser-first, then choose the right fetcher

步骤2：从解析优先开始，选择合适的获取器

If the user already has HTML or only needs DOM parsing, start with

Selector

python

from scrapling import Selector

page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()

Important parser notes:

Scrapling currently targets HTML, not XML feeds
```
Selector
```
is the current user-facing parser API
Legacy
```
Adaptor
```
compatibility exists in code, but teach
```
Selector
```

If the user needs live fetching, pick the narrowest fetcher that solves the job:

Fetcher
: static HTML or plain HTTP targets
DynamicFetcher
: JavaScript-rendered pages
StealthyFetcher
: harder protected targets, including documented Cloudflare handling

Escalation rule:

Start with
```
Selector
```
if you already have HTML
Otherwise start with
```
Fetcher
```
If content is rendered client-side, switch to
```
DynamicFetcher
```
If protection blocks or empties the result, switch to
```
StealthyFetcher
```

Do not present anti-bot bypass as guaranteed. Phrase it as a documented capability whose success depends on the target and environment.

如果用户已经有HTML内容或仅需要DOM解析，先从

Selector

开始：

python

from scrapling import Selector

page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()

解析器重要说明：

Scrapling目前面向HTML场景，不支持XML feeds
```
Selector
```
是当前面向用户的解析器API
代码中保留了旧版
```
Adaptor
```
的兼容性，但建议教学使用
```
Selector
```

如果用户需要实时获取内容，选择能满足需求的最轻量获取器：

Fetcher
: 静态HTML或普通HTTP目标
DynamicFetcher
: JavaScript渲染的页面
StealthyFetcher
: 防护等级更高的目标，包括官方支持的Cloudflare处理

升级规则：

如果已有HTML，先使用
```
Selector
```
否则先使用
```
Fetcher
```
如果内容是客户端渲染，切换到
```
DynamicFetcher
```
如果防护拦截了请求或返回空结果，切换到
```
StealthyFetcher
```

请勿承诺反爬虫绕过一定有效，应表述为官方提供的能力，成功率取决于目标和运行环境。

Step 3: Parse the response and reuse sessions

步骤3：解析响应并复用会话

All fetchers return a

Response

object that extends Scrapling's

Selector

engine.

Core parsing options:

CSS selectors:
```
page.css(".product")
```
XPath selectors:
```
page.xpath("//article")
```
Text and attributes:
```
::text
```
,
```
::attr(href)
```
Text search:
```
find_by_text(...)
```
Regex search:
```
find_by_regex(...)
```
Similar elements:
```
element.find_similar(...)
```

Use session classes for repeated requests, cookies, or state reuse:

python

from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate="chrome") as session:
    page1 = session.get("https://example.com")
    page2 = session.get("https://example.com/account")

Use adaptive scraping when the target DOM is brittle:

python

from scrapling.fetchers import Fetcher

Fetcher.configure(adaptive=True)
page = Fetcher.get("https://example.com")

saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)

Important adaptive notes:

```
adaptive=True
```
is off by default
```
auto_save=True
```
stores element fingerprints keyed by selector or identifier
```
adaptive_domain
```
helps when the same site moved domains or archived copies are involved
Manual flows are available with
```
save()
```
,
```
retrieve()
```
, and
```
relocate()
```

Move the deeper parser details into references/parser-and-adaptive.md.

所有获取器都会返回继承了Scrapling

Selector

引擎的

Response

对象。

核心解析选项：

CSS选择器:
```
page.css(".product")
```
XPath选择器:
```
page.xpath("//article")
```
文本和属性:
```
::text
```
,
```
::attr(href)
```
文本搜索:
```
find_by_text(...)
```
正则搜索:
```
find_by_regex(...)
```
相似元素:
```
element.find_similar(...)
```

使用会话类处理重复请求、cookie或状态复用场景：

python

from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate="chrome") as session:
    page1 = session.get("https://example.com")
    page2 = session.get("https://example.com/account")

当目标DOM结构不稳定时使用自适应抓取：

python

from scrapling.fetchers import Fetcher

Fetcher.configure(adaptive=True)
page = Fetcher.get("https://example.com")

saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)

自适应功能重要说明：

```
adaptive=True
```
默认关闭
```
auto_save=True
```
会存储以选择器或标识符为键的元素指纹
当同一站点更换域名或涉及归档副本时，
```
adaptive_domain
```
可提供支持
可通过
```
save()
```
、
```
retrieve()
```
和
```
relocate()
```
实现手动流程

更深入的解析器细节请查看references/parser-and-adaptive.md

Step 4: Use the CLI for quick extraction or shell work

步骤4：使用CLI实现快速提取或shell操作

CLI overview:

```
scrapling install
```
```
scrapling shell
```

scrapling extract get|post|put|delete|fetch|stealthy-fetch

Wrapper scripts in this skill:

bash scripts/run-extract.sh get "https://example.com" article.md

bash scripts/run-extract.sh fetch "https://app.example.com" content.md --network-idle

bash scripts/run-extract.sh stealth "https://protected.example.com" content.md --solve-cloudflare

Use the CLI when:

The user needs quick output files in
```
.md
```
,
```
.html
```
, or
```
.txt
```
CSS selectors are enough to trim output
A shell should be started without writing Python first

CLI and optional MCP details live in references/cli-and-mcp.md.

CLI概览：

```
scrapling install
```
```
scrapling shell
```

scrapling extract get|post|put|delete|fetch|stealthy-fetch

本技能中的封装脚本：

bash scripts/run-extract.sh get "https://example.com" article.md

bash scripts/run-extract.sh fetch "https://app.example.com" content.md --network-idle

bash scripts/run-extract.sh stealth "https://protected.example.com" content.md --solve-cloudflare

适合使用CLI的场景：

用户需要快速输出
```
.md
```
、
```
.html
```
或
```
.txt
```
格式的文件
仅用CSS选择器就足够筛选输出内容
需要启动shell而无需先编写Python代码

CLI和可选的MCP细节请查看references/cli-and-mcp.md

Step 5: Treat MCP and spiders as second-tier workflows

步骤5：将MCP和爬虫作为二级工作流

Use MCP when the user explicitly wants Scrapling exposed to an agent client:

bash

bash scripts/run-mcp.sh
bash scripts/run-mcp.sh --http --host 127.0.0.1 --port 8000

Use spiders when the task is no longer a few page fetches and becomes a crawl with link following, concurrency, or checkpoint resume.

These are important capabilities, but they should not replace the core parser-plus-fetcher workflow in normal end-user guidance.

当用户明确需要将Scrapling暴露给Agent客户端时使用MCP：

bash

bash scripts/run-mcp.sh
bash scripts/run-mcp.sh --http --host 127.0.0.1 --port 8000

当任务不再是少量页面获取，而是需要跟随链接、并发处理、断点续爬的爬取任务时使用爬虫功能。

这些都是重要能力，但在普通终端用户指南中不应替代核心的「解析器+获取器」工作流。

Examples

示例

Example 1: Install Scrapling with all extras

示例1：安装带所有扩展的Scrapling

bash

bash scripts/install.sh

bash

bash scripts/install.sh

Example 2: Parser-only install

示例2：仅安装解析模块

bash

bash scripts/install.sh --profile parser

bash

bash scripts/install.sh --profile parser

Example 3: Parse local HTML with

Selector

示例3：使用

Selector

解析本地HTML

python

from scrapling import Selector

page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()

python

from scrapling import Selector

page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()

Example 4: Fast static scrape from the terminal

示例4：从终端快速静态抓取

bash

bash scripts/run-extract.sh get "https://example.com" content.md --css-selector "article"

bash

bash scripts/run-extract.sh get "https://example.com" content.md --css-selector "article"

Example 5: Python

Fetcher

示例5：Python

Fetcher

使用

python

from scrapling.fetchers import Fetcher

page = Fetcher.get("https://example.com", impersonate="chrome")
title = page.css("title::text").get()

python

from scrapling.fetchers import Fetcher

page = Fetcher.get("https://example.com", impersonate="chrome")
title = page.css("title::text").get()

Example 6: Python

DynamicFetcher

示例6：Python

DynamicFetcher

使用

python

from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch(
    "https://example.com",
    network_idle=True,
    wait_selector=".content"
)

python

from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch(
    "https://example.com",
    network_idle=True,
    wait_selector=".content"
)

Example 7: Python

StealthyFetcher

示例7：Python

StealthyFetcher

使用

python

from scrapling.fetchers import StealthyFetcher

page = StealthyFetcher.fetch(
    "https://example.com",
    headless=True,
    solve_cloudflare=True
)

python

from scrapling.fetchers import StealthyFetcher

page = StealthyFetcher.fetch(
    "https://example.com",
    headless=True,
    solve_cloudflare=True
)

Example 8: Async HTTP

示例8：异步HTTP请求

python

from scrapling.fetchers import AsyncFetcher

page = await AsyncFetcher.get("https://example.com")

python

from scrapling.fetchers import AsyncFetcher

page = await AsyncFetcher.get("https://example.com")

Example 9: Session reuse

示例9：会话复用

python

from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate="chrome") as session:
    page1 = session.get("https://example.com")
    page2 = session.get("https://example.com/account")

python

from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate="chrome") as session:
    page1 = session.get("https://example.com")
    page2 = session.get("https://example.com/account")

Example 10: Adaptive selector recovery

示例10：自适应选择器恢复

python

from scrapling.fetchers import Fetcher

Fetcher.configure(adaptive=True, adaptive_domain="example.com")
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)

python

from scrapling.fetchers import Fetcher

Fetcher.configure(adaptive=True, adaptive_domain="example.com")
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)

Example 11: Minimal spider reference

示例11：最简爬虫参考

python

from scrapling.spiders import Spider, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]

    async def parse(self, response: Response):
        for quote in response.css(".quote"):
            yield {"text": quote.css(".text::text").get()}

result = QuotesSpider().start()
print(result.items.to_json())

python

from scrapling.spiders import Spider, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]

    async def parse(self, response: Response):
        for quote in response.css(".quote"):
            yield {"text": quote.css(".text::text").get()}

result = QuotesSpider().start()
print(result.items.to_json())

Example 12: Start the MCP server over stdio

示例12：通过stdio启动MCP服务器

bash

bash scripts/run-mcp.sh

bash

bash scripts/run-mcp.sh

Best practices

最佳实践

Start with
```
Selector
```
or the lightest fetcher that works and escalate only when the site actually needs rendering or stealth.
Reuse session classes for repeated requests so browser startup and connection overhead stay low.
Prefer
```
.md
```
or
```
.txt
```
CLI output and CSS selectors over dumping full HTML into the model context.
Enable adaptive scraping only where selector brittleness is a real maintenance problem.
Use
```
page_action
```
,
```
wait_selector
```
, and
```
network_idle
```
deliberately instead of adding blind sleeps.
Treat Cloudflare solving, proxies, and browser impersonation as opt-in tools for authorized, policy-compliant work, not guaranteed bypasses.
Remember that XML feeds are not the current target surface; Scrapling is documented around HTML parsing.
Move from CLI to Python or spiders when retry logic, structured output, or crawl control becomes important.
For MCP usage, make the client/server transport explicit: stdio for local agent integration,
```
--http
```
for streamable HTTP deployments.

从
```
Selector
```
或可用的最轻量获取器开始，仅当站点确实需要渲染或隐身能力时再升级。
重复请求时复用会话类，降低浏览器启动和连接开销。
优先使用CLI输出
```
.md
```
或
```
.txt
```
格式内容、配合CSS选择器筛选，避免将完整HTML传入模型上下文。
仅当选择器不稳定确实带来维护问题时，再启用自适应抓取。
合理使用
```
page_action
```
、
```
wait_selector
```
和
```
network_idle
```
，不要盲目添加sleep等待。
Cloudflare破解、代理、浏览器模拟是用于合规授权场景的可选工具，而非万能绕过手段。
注意当前版本不支持XML feeds，Scrapling是面向HTML解析设计的。
当需要重试逻辑、结构化输出或爬取控制能力时，从CLI切换到Python代码或爬虫功能。
用于MCP场景时，明确客户端/服务端传输方式：本地Agent集成使用stdio，可流式HTTP部署使用
```
--http
```
参数。

References

参考资料

references/fetchers-and-sessions.md
references/parser-and-adaptive.md
references/cli-and-mcp.md
references/spiders.md
scripts/install.sh
scripts/run-extract.sh
scripts/run-mcp.sh
Scrapling GitHub Repository
Scrapling Documentation

references/fetchers-and-sessions.md
references/parser-and-adaptive.md
references/cli-and-mcp.md
references/spiders.md
scripts/install.sh
scripts/run-extract.sh
scripts/run-mcp.sh
Scrapling GitHub仓库
Scrapling官方文档

scrapling

Original

Translation

scrapling - Adaptive Web Scraping Framework

scrapling - 自适应网页抓取框架

When to use this skill

何时使用该技能

Instructions

使用说明

Step 1: Install and verify the environment

步骤1：安装并验证环境

Step 2: Start parser-first, then choose the right fetcher

步骤2：从解析优先开始，选择合适的获取器

Step 3: Parse the response and reuse sessions

步骤3：解析响应并复用会话

Step 4: Use the CLI for quick extraction or shell work

步骤4：使用CLI实现快速提取或shell操作

Step 5: Treat MCP and spiders as second-tier workflows

步骤5：将MCP和爬虫作为二级工作流

Examples

示例

Example 1: Install Scrapling with all extras

示例1：安装带所有扩展的Scrapling

Example 2: Parser-only install

示例2：仅安装解析模块

Example 3: Parse local HTML with Selector

示例3：使用Selector解析本地HTML

Example 4: Fast static scrape from the terminal

示例4：从终端快速静态抓取

Example 5: Python Fetcher

示例5：Python Fetcher使用

Example 6: Python DynamicFetcher

示例6：Python DynamicFetcher使用

Example 7: Python StealthyFetcher

示例7：Python StealthyFetcher使用

Example 8: Async HTTP

示例8：异步HTTP请求

Example 9: Session reuse

示例9：会话复用

Example 10: Adaptive selector recovery

示例10：自适应选择器恢复

Example 11: Minimal spider reference

示例11：最简爬虫参考

Example 12: Start the MCP server over stdio

示例12：通过stdio启动MCP服务器

Best practices

最佳实践

References

参考资料

Example 3: Parse local HTML with
`Selector`

示例3：使用
`Selector`
解析本地HTML

Example 5: Python
`Fetcher`

示例5：Python
`Fetcher`
使用

Example 6: Python
`DynamicFetcher`

示例6：Python
`DynamicFetcher`
使用

Example 7: Python
`StealthyFetcher`

示例7：Python
`StealthyFetcher`
使用