scrapling

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

scrapling - Adaptive Web Scraping Framework

scrapling - 自适应网页抓取框架

Keyword:
scrapling
·
adaptive scraping
·
stealthy fetch
·
scrapling spider
Respect each target site's terms, robots, rate limits, and authorization boundaries.
Scrapling is a Python scraping framework for parser-first HTML extraction, browser-backed fetching, stealth anti-bot handling, CLI prototyping, and optional larger crawl workflows. Its distinctive feature is adaptive scraping: you can save element fingerprints and later relocate equivalent elements after a site redesign.
关键词
scrapling
·
adaptive scraping
·
stealthy fetch
·
scrapling spider
请遵守每个目标站点的条款、robots协议、请求频率限制和授权边界。
Scrapling是一款Python爬虫框架,支持解析优先的HTML提取、基于浏览器的内容获取、隐身反爬虫处理、CLI原型开发,以及可选的大规模爬取工作流。它的特色功能是自适应抓取:你可以保存元素指纹,在站点改版后也能重新定位到等效元素。

When to use this skill

何时使用该技能

  • Install Scrapling with the right extras for parser-only, fetchers, shell, AI, or full usage
  • Parse known HTML with
    Selector
    before escalating to browser-backed fetchers
  • Choose between
    Fetcher
    ,
    DynamicFetcher
    , and
    StealthyFetcher
  • Reuse
    FetcherSession
    ,
    DynamicSession
    , or
    StealthySession
    for multiple requests
  • Parse HTML with CSS, XPath,
    ::text
    ,
    ::attr(...)
    , text matching, regex, and similar-element lookup
  • Enable adaptive scraping with
    adaptive=True
    ,
    auto_save=True
    ,
    retrieve()
    , and
    relocate()
  • Use the
    scrapling
    CLI for terminal-first extraction or shell work
  • Understand MCP and spiders as second-tier workflows once core scraping is working
  • Decide when Docker-only CLI usage is enough versus when Python code is required
  • 按需安装带对应扩展包的Scrapling,支持仅解析、获取器、shell、AI或全功能使用
  • 在升级到基于浏览器的获取器之前,先用
    Selector
    解析已知HTML
  • 可在
    Fetcher
    DynamicFetcher
    StealthyFetcher
    之间选择
  • 复用
    FetcherSession
    DynamicSession
    StealthySession
    处理多请求场景
  • 可通过CSS、XPath、
    ::text
    ::attr(...)
    、文本匹配、正则、相似元素查询等方式解析HTML
  • 通过
    adaptive=True
    auto_save=True
    retrieve()
    relocate()
    启用自适应抓取
  • 使用
    scrapling
    CLI实现终端优先的提取或shell操作
  • 掌握核心抓取能力后,再了解作为二级工作流的MCP和爬虫功能
  • 判断何时只需Docker版CLI即可满足需求,何时需要编写Python代码

Instructions

使用说明

Step 1: Install and verify the environment

步骤1:安装并验证环境

Use a virtual environment unless the user explicitly wants a system install.
bash
bash scripts/install.sh
Supported install profiles:
  • parser
    :
    pip install scrapling
  • fetchers
    :
    pip install "scrapling[fetchers]"
  • shell
    :
    pip install "scrapling[shell]"
  • ai
    :
    pip install "scrapling[ai]"
  • all
    :
    pip install "scrapling[all]"
Examples:
bash
bash scripts/install.sh --profile parser
bash scripts/install.sh --profile fetchers
bash scripts/install.sh --profile all --force
Browser-backed flows require
scrapling install
. Parser-only workflows do not.
If the user only wants terminal extraction and prefers containers, Docker images are available. That path is CLI-oriented and does not replace Python coding workflows.
除非用户明确要求系统级安装,否则请使用虚拟环境。
bash
bash scripts/install.sh
支持的安装配置项:
  • parser
    :
    pip install scrapling
  • fetchers
    :
    pip install "scrapling[fetchers]"
  • shell
    :
    pip install "scrapling[shell]"
  • ai
    :
    pip install "scrapling[ai]"
  • all
    :
    pip install "scrapling[all]"
示例:
bash
bash scripts/install.sh --profile parser
bash scripts/install.sh --profile fetchers
bash scripts/install.sh --profile all --force
基于浏览器的流程需要执行
scrapling install
,仅解析的工作流不需要。
如果用户只需要终端提取且偏好容器,可使用Docker镜像,该方案面向CLI场景,无法替代Python编码工作流。

Step 2: Start parser-first, then choose the right fetcher

步骤2:从解析优先开始,选择合适的获取器

If the user already has HTML or only needs DOM parsing, start with
Selector
:
python
from scrapling import Selector

page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()
Important parser notes:
  • Scrapling currently targets HTML, not XML feeds
  • Selector
    is the current user-facing parser API
  • Legacy
    Adaptor
    compatibility exists in code, but teach
    Selector
If the user needs live fetching, pick the narrowest fetcher that solves the job:
  • Fetcher
    : static HTML or plain HTTP targets
  • DynamicFetcher
    : JavaScript-rendered pages
  • StealthyFetcher
    : harder protected targets, including documented Cloudflare handling
Escalation rule:
  1. Start with
    Selector
    if you already have HTML
  2. Otherwise start with
    Fetcher
  3. If content is rendered client-side, switch to
    DynamicFetcher
  4. If protection blocks or empties the result, switch to
    StealthyFetcher
Do not present anti-bot bypass as guaranteed. Phrase it as a documented capability whose success depends on the target and environment.
如果用户已经有HTML内容或仅需要DOM解析,先从
Selector
开始:
python
from scrapling import Selector

page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()
解析器重要说明:
  • Scrapling目前面向HTML场景,不支持XML feeds
  • Selector
    是当前面向用户的解析器API
  • 代码中保留了旧版
    Adaptor
    的兼容性,但建议教学使用
    Selector
如果用户需要实时获取内容,选择能满足需求的最轻量获取器:
  • Fetcher
    : 静态HTML或普通HTTP目标
  • DynamicFetcher
    : JavaScript渲染的页面
  • StealthyFetcher
    : 防护等级更高的目标,包括官方支持的Cloudflare处理
升级规则:
  1. 如果已有HTML,先使用
    Selector
  2. 否则先使用
    Fetcher
  3. 如果内容是客户端渲染,切换到
    DynamicFetcher
  4. 如果防护拦截了请求或返回空结果,切换到
    StealthyFetcher
请勿承诺反爬虫绕过一定有效,应表述为官方提供的能力,成功率取决于目标和运行环境。

Step 3: Parse the response and reuse sessions

步骤3:解析响应并复用会话

All fetchers return a
Response
object that extends Scrapling's
Selector
engine.
Core parsing options:
  • CSS selectors:
    page.css(".product")
  • XPath selectors:
    page.xpath("//article")
  • Text and attributes:
    ::text
    ,
    ::attr(href)
  • Text search:
    find_by_text(...)
  • Regex search:
    find_by_regex(...)
  • Similar elements:
    element.find_similar(...)
Use session classes for repeated requests, cookies, or state reuse:
python
from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate="chrome") as session:
    page1 = session.get("https://example.com")
    page2 = session.get("https://example.com/account")
Use adaptive scraping when the target DOM is brittle:
python
from scrapling.fetchers import Fetcher

Fetcher.configure(adaptive=True)
page = Fetcher.get("https://example.com")

saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)
Important adaptive notes:
  • adaptive=True
    is off by default
  • auto_save=True
    stores element fingerprints keyed by selector or identifier
  • adaptive_domain
    helps when the same site moved domains or archived copies are involved
  • Manual flows are available with
    save()
    ,
    retrieve()
    , and
    relocate()
Move the deeper parser details into references/parser-and-adaptive.md.
所有获取器都会返回继承了Scrapling
Selector
引擎的
Response
对象。
核心解析选项:
  • CSS选择器:
    page.css(".product")
  • XPath选择器:
    page.xpath("//article")
  • 文本和属性:
    ::text
    ,
    ::attr(href)
  • 文本搜索:
    find_by_text(...)
  • 正则搜索:
    find_by_regex(...)
  • 相似元素:
    element.find_similar(...)
使用会话类处理重复请求、cookie或状态复用场景:
python
from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate="chrome") as session:
    page1 = session.get("https://example.com")
    page2 = session.get("https://example.com/account")
当目标DOM结构不稳定时使用自适应抓取:
python
from scrapling.fetchers import Fetcher

Fetcher.configure(adaptive=True)
page = Fetcher.get("https://example.com")

saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)
自适应功能重要说明:
  • adaptive=True
    默认关闭
  • auto_save=True
    会存储以选择器或标识符为键的元素指纹
  • 当同一站点更换域名或涉及归档副本时,
    adaptive_domain
    可提供支持
  • 可通过
    save()
    retrieve()
    relocate()
    实现手动流程
更深入的解析器细节请查看references/parser-and-adaptive.md

Step 4: Use the CLI for quick extraction or shell work

步骤4:使用CLI实现快速提取或shell操作

CLI overview:
  • scrapling install
  • scrapling shell
  • scrapling extract get|post|put|delete|fetch|stealthy-fetch
Wrapper scripts in this skill:
  • bash scripts/run-extract.sh get "https://example.com" article.md
  • bash scripts/run-extract.sh fetch "https://app.example.com" content.md --network-idle
  • bash scripts/run-extract.sh stealth "https://protected.example.com" content.md --solve-cloudflare
Use the CLI when:
  • The user needs quick output files in
    .md
    ,
    .html
    , or
    .txt
  • CSS selectors are enough to trim output
  • A shell should be started without writing Python first
CLI and optional MCP details live in references/cli-and-mcp.md.
CLI概览:
  • scrapling install
  • scrapling shell
  • scrapling extract get|post|put|delete|fetch|stealthy-fetch
本技能中的封装脚本:
  • bash scripts/run-extract.sh get "https://example.com" article.md
  • bash scripts/run-extract.sh fetch "https://app.example.com" content.md --network-idle
  • bash scripts/run-extract.sh stealth "https://protected.example.com" content.md --solve-cloudflare
适合使用CLI的场景:
  • 用户需要快速输出
    .md
    .html
    .txt
    格式的文件
  • 仅用CSS选择器就足够筛选输出内容
  • 需要启动shell而无需先编写Python代码
CLI和可选的MCP细节请查看references/cli-and-mcp.md

Step 5: Treat MCP and spiders as second-tier workflows

步骤5:将MCP和爬虫作为二级工作流

Use MCP when the user explicitly wants Scrapling exposed to an agent client:
bash
bash scripts/run-mcp.sh
bash scripts/run-mcp.sh --http --host 127.0.0.1 --port 8000
Use spiders when the task is no longer a few page fetches and becomes a crawl with link following, concurrency, or checkpoint resume.
These are important capabilities, but they should not replace the core parser-plus-fetcher workflow in normal end-user guidance.
当用户明确需要将Scrapling暴露给Agent客户端时使用MCP:
bash
bash scripts/run-mcp.sh
bash scripts/run-mcp.sh --http --host 127.0.0.1 --port 8000
当任务不再是少量页面获取,而是需要跟随链接、并发处理、断点续爬的爬取任务时使用爬虫功能。
这些都是重要能力,但在普通终端用户指南中不应替代核心的「解析器+获取器」工作流。

Examples

示例

Example 1: Install Scrapling with all extras

示例1:安装带所有扩展的Scrapling

bash
bash scripts/install.sh
bash
bash scripts/install.sh

Example 2: Parser-only install

示例2:仅安装解析模块

bash
bash scripts/install.sh --profile parser
bash
bash scripts/install.sh --profile parser

Example 3: Parse local HTML with
Selector

示例3:使用
Selector
解析本地HTML

python
from scrapling import Selector

page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()
python
from scrapling import Selector

page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()

Example 4: Fast static scrape from the terminal

示例4:从终端快速静态抓取

bash
bash scripts/run-extract.sh get "https://example.com" content.md --css-selector "article"
bash
bash scripts/run-extract.sh get "https://example.com" content.md --css-selector "article"

Example 5: Python
Fetcher

示例5:Python
Fetcher
使用

python
from scrapling.fetchers import Fetcher

page = Fetcher.get("https://example.com", impersonate="chrome")
title = page.css("title::text").get()
python
from scrapling.fetchers import Fetcher

page = Fetcher.get("https://example.com", impersonate="chrome")
title = page.css("title::text").get()

Example 6: Python
DynamicFetcher

示例6:Python
DynamicFetcher
使用

python
from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch(
    "https://example.com",
    network_idle=True,
    wait_selector=".content"
)
python
from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch(
    "https://example.com",
    network_idle=True,
    wait_selector=".content"
)

Example 7: Python
StealthyFetcher

示例7:Python
StealthyFetcher
使用

python
from scrapling.fetchers import StealthyFetcher

page = StealthyFetcher.fetch(
    "https://example.com",
    headless=True,
    solve_cloudflare=True
)
python
from scrapling.fetchers import StealthyFetcher

page = StealthyFetcher.fetch(
    "https://example.com",
    headless=True,
    solve_cloudflare=True
)

Example 8: Async HTTP

示例8:异步HTTP请求

python
from scrapling.fetchers import AsyncFetcher

page = await AsyncFetcher.get("https://example.com")
python
from scrapling.fetchers import AsyncFetcher

page = await AsyncFetcher.get("https://example.com")

Example 9: Session reuse

示例9:会话复用

python
from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate="chrome") as session:
    page1 = session.get("https://example.com")
    page2 = session.get("https://example.com/account")
python
from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate="chrome") as session:
    page1 = session.get("https://example.com")
    page2 = session.get("https://example.com/account")

Example 10: Adaptive selector recovery

示例10:自适应选择器恢复

python
from scrapling.fetchers import Fetcher

Fetcher.configure(adaptive=True, adaptive_domain="example.com")
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)
python
from scrapling.fetchers import Fetcher

Fetcher.configure(adaptive=True, adaptive_domain="example.com")
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)

Example 11: Minimal spider reference

示例11:最简爬虫参考

python
from scrapling.spiders import Spider, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]

    async def parse(self, response: Response):
        for quote in response.css(".quote"):
            yield {"text": quote.css(".text::text").get()}

result = QuotesSpider().start()
print(result.items.to_json())
python
from scrapling.spiders import Spider, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]

    async def parse(self, response: Response):
        for quote in response.css(".quote"):
            yield {"text": quote.css(".text::text").get()}

result = QuotesSpider().start()
print(result.items.to_json())

Example 12: Start the MCP server over stdio

示例12:通过stdio启动MCP服务器

bash
bash scripts/run-mcp.sh
bash
bash scripts/run-mcp.sh

Best practices

最佳实践

  1. Start with
    Selector
    or the lightest fetcher that works and escalate only when the site actually needs rendering or stealth.
  2. Reuse session classes for repeated requests so browser startup and connection overhead stay low.
  3. Prefer
    .md
    or
    .txt
    CLI output and CSS selectors over dumping full HTML into the model context.
  4. Enable adaptive scraping only where selector brittleness is a real maintenance problem.
  5. Use
    page_action
    ,
    wait_selector
    , and
    network_idle
    deliberately instead of adding blind sleeps.
  6. Treat Cloudflare solving, proxies, and browser impersonation as opt-in tools for authorized, policy-compliant work, not guaranteed bypasses.
  7. Remember that XML feeds are not the current target surface; Scrapling is documented around HTML parsing.
  8. Move from CLI to Python or spiders when retry logic, structured output, or crawl control becomes important.
  9. For MCP usage, make the client/server transport explicit: stdio for local agent integration,
    --http
    for streamable HTTP deployments.
  1. Selector
    或可用的最轻量获取器开始,仅当站点确实需要渲染或隐身能力时再升级。
  2. 重复请求时复用会话类,降低浏览器启动和连接开销。
  3. 优先使用CLI输出
    .md
    .txt
    格式内容、配合CSS选择器筛选,避免将完整HTML传入模型上下文。
  4. 仅当选择器不稳定确实带来维护问题时,再启用自适应抓取。
  5. 合理使用
    page_action
    wait_selector
    network_idle
    ,不要盲目添加sleep等待。
  6. Cloudflare破解、代理、浏览器模拟是用于合规授权场景的可选工具,而非万能绕过手段。
  7. 注意当前版本不支持XML feeds,Scrapling是面向HTML解析设计的。
  8. 当需要重试逻辑、结构化输出或爬取控制能力时,从CLI切换到Python代码或爬虫功能。
  9. 用于MCP场景时,明确客户端/服务端传输方式:本地Agent集成使用stdio,可流式HTTP部署使用
    --http
    参数。

References

参考资料

  • references/fetchers-and-sessions.md
  • references/parser-and-adaptive.md
  • references/cli-and-mcp.md
  • references/spiders.md
  • scripts/install.sh
  • scripts/run-extract.sh
  • scripts/run-mcp.sh
  • Scrapling GitHub仓库
  • Scrapling官方文档