scrapling
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesescrapling - Adaptive Web Scraping Framework
scrapling - 自适应网页抓取框架
Keyword:·scrapling·adaptive scraping·stealthy fetchscrapling spiderRespect each target site's terms, robots, rate limits, and authorization boundaries.
Scrapling is a Python scraping framework for parser-first HTML extraction, browser-backed fetching, stealth anti-bot handling, CLI prototyping, and optional larger crawl workflows. Its distinctive feature is adaptive scraping: you can save element fingerprints and later relocate equivalent elements after a site redesign.
关键词:·scrapling·adaptive scraping·stealthy fetchscrapling spider请遵守每个目标站点的条款、robots协议、请求频率限制和授权边界。
Scrapling是一款Python爬虫框架,支持解析优先的HTML提取、基于浏览器的内容获取、隐身反爬虫处理、CLI原型开发,以及可选的大规模爬取工作流。它的特色功能是自适应抓取:你可以保存元素指纹,在站点改版后也能重新定位到等效元素。
When to use this skill
何时使用该技能
- Install Scrapling with the right extras for parser-only, fetchers, shell, AI, or full usage
- Parse known HTML with before escalating to browser-backed fetchers
Selector - Choose between ,
Fetcher, andDynamicFetcherStealthyFetcher - Reuse ,
FetcherSession, orDynamicSessionfor multiple requestsStealthySession - Parse HTML with CSS, XPath, ,
::text, text matching, regex, and similar-element lookup::attr(...) - Enable adaptive scraping with ,
adaptive=True,auto_save=True, andretrieve()relocate() - Use the CLI for terminal-first extraction or shell work
scrapling - Understand MCP and spiders as second-tier workflows once core scraping is working
- Decide when Docker-only CLI usage is enough versus when Python code is required
- 按需安装带对应扩展包的Scrapling,支持仅解析、获取器、shell、AI或全功能使用
- 在升级到基于浏览器的获取器之前,先用解析已知HTML
Selector - 可在、
Fetcher和DynamicFetcher之间选择StealthyFetcher - 复用、
FetcherSession或DynamicSession处理多请求场景StealthySession - 可通过CSS、XPath、、
::text、文本匹配、正则、相似元素查询等方式解析HTML::attr(...) - 通过、
adaptive=True、auto_save=True、retrieve()启用自适应抓取relocate() - 使用CLI实现终端优先的提取或shell操作
scrapling - 掌握核心抓取能力后,再了解作为二级工作流的MCP和爬虫功能
- 判断何时只需Docker版CLI即可满足需求,何时需要编写Python代码
Instructions
使用说明
Step 1: Install and verify the environment
步骤1:安装并验证环境
Use a virtual environment unless the user explicitly wants a system install.
bash
bash scripts/install.shSupported install profiles:
- :
parserpip install scrapling - :
fetcherspip install "scrapling[fetchers]" - :
shellpip install "scrapling[shell]" - :
aipip install "scrapling[ai]" - :
allpip install "scrapling[all]"
Examples:
bash
bash scripts/install.sh --profile parser
bash scripts/install.sh --profile fetchers
bash scripts/install.sh --profile all --forceBrowser-backed flows require . Parser-only workflows do not.
scrapling installIf the user only wants terminal extraction and prefers containers, Docker images are available. That path is CLI-oriented and does not replace Python coding workflows.
除非用户明确要求系统级安装,否则请使用虚拟环境。
bash
bash scripts/install.sh支持的安装配置项:
- :
parserpip install scrapling - :
fetcherspip install "scrapling[fetchers]" - :
shellpip install "scrapling[shell]" - :
aipip install "scrapling[ai]" - :
allpip install "scrapling[all]"
示例:
bash
bash scripts/install.sh --profile parser
bash scripts/install.sh --profile fetchers
bash scripts/install.sh --profile all --force基于浏览器的流程需要执行,仅解析的工作流不需要。
scrapling install如果用户只需要终端提取且偏好容器,可使用Docker镜像,该方案面向CLI场景,无法替代Python编码工作流。
Step 2: Start parser-first, then choose the right fetcher
步骤2:从解析优先开始,选择合适的获取器
If the user already has HTML or only needs DOM parsing, start with :
Selectorpython
from scrapling import Selector
page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()Important parser notes:
- Scrapling currently targets HTML, not XML feeds
- is the current user-facing parser API
Selector - Legacy compatibility exists in code, but teach
AdaptorSelector
If the user needs live fetching, pick the narrowest fetcher that solves the job:
- : static HTML or plain HTTP targets
Fetcher - : JavaScript-rendered pages
DynamicFetcher - : harder protected targets, including documented Cloudflare handling
StealthyFetcher
Escalation rule:
- Start with if you already have HTML
Selector - Otherwise start with
Fetcher - If content is rendered client-side, switch to
DynamicFetcher - If protection blocks or empties the result, switch to
StealthyFetcher
Do not present anti-bot bypass as guaranteed. Phrase it as a documented capability whose success depends on the target and environment.
如果用户已经有HTML内容或仅需要DOM解析,先从开始:
Selectorpython
from scrapling import Selector
page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()解析器重要说明:
- Scrapling目前面向HTML场景,不支持XML feeds
- 是当前面向用户的解析器API
Selector - 代码中保留了旧版的兼容性,但建议教学使用
AdaptorSelector
如果用户需要实时获取内容,选择能满足需求的最轻量获取器:
- : 静态HTML或普通HTTP目标
Fetcher - : JavaScript渲染的页面
DynamicFetcher - : 防护等级更高的目标,包括官方支持的Cloudflare处理
StealthyFetcher
升级规则:
- 如果已有HTML,先使用
Selector - 否则先使用
Fetcher - 如果内容是客户端渲染,切换到
DynamicFetcher - 如果防护拦截了请求或返回空结果,切换到
StealthyFetcher
请勿承诺反爬虫绕过一定有效,应表述为官方提供的能力,成功率取决于目标和运行环境。
Step 3: Parse the response and reuse sessions
步骤3:解析响应并复用会话
All fetchers return a object that extends Scrapling's engine.
ResponseSelectorCore parsing options:
- CSS selectors:
page.css(".product") - XPath selectors:
page.xpath("//article") - Text and attributes: ,
::text::attr(href) - Text search:
find_by_text(...) - Regex search:
find_by_regex(...) - Similar elements:
element.find_similar(...)
Use session classes for repeated requests, cookies, or state reuse:
python
from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate="chrome") as session:
page1 = session.get("https://example.com")
page2 = session.get("https://example.com/account")Use adaptive scraping when the target DOM is brittle:
python
from scrapling.fetchers import Fetcher
Fetcher.configure(adaptive=True)
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)Important adaptive notes:
- is off by default
adaptive=True - stores element fingerprints keyed by selector or identifier
auto_save=True - helps when the same site moved domains or archived copies are involved
adaptive_domain - Manual flows are available with ,
save(), andretrieve()relocate()
Move the deeper parser details into references/parser-and-adaptive.md.
所有获取器都会返回继承了Scrapling引擎的对象。
SelectorResponse核心解析选项:
- CSS选择器:
page.css(".product") - XPath选择器:
page.xpath("//article") - 文本和属性: ,
::text::attr(href) - 文本搜索:
find_by_text(...) - 正则搜索:
find_by_regex(...) - 相似元素:
element.find_similar(...)
使用会话类处理重复请求、cookie或状态复用场景:
python
from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate="chrome") as session:
page1 = session.get("https://example.com")
page2 = session.get("https://example.com/account")当目标DOM结构不稳定时使用自适应抓取:
python
from scrapling.fetchers import Fetcher
Fetcher.configure(adaptive=True)
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)自适应功能重要说明:
- 默认关闭
adaptive=True - 会存储以选择器或标识符为键的元素指纹
auto_save=True - 当同一站点更换域名或涉及归档副本时,可提供支持
adaptive_domain - 可通过、
save()和retrieve()实现手动流程relocate()
更深入的解析器细节请查看references/parser-and-adaptive.md
Step 4: Use the CLI for quick extraction or shell work
步骤4:使用CLI实现快速提取或shell操作
CLI overview:
scrapling installscrapling shellscrapling extract get|post|put|delete|fetch|stealthy-fetch
Wrapper scripts in this skill:
bash scripts/run-extract.sh get "https://example.com" article.mdbash scripts/run-extract.sh fetch "https://app.example.com" content.md --network-idlebash scripts/run-extract.sh stealth "https://protected.example.com" content.md --solve-cloudflare
Use the CLI when:
- The user needs quick output files in ,
.md, or.html.txt - CSS selectors are enough to trim output
- A shell should be started without writing Python first
CLI and optional MCP details live in references/cli-and-mcp.md.
CLI概览:
scrapling installscrapling shellscrapling extract get|post|put|delete|fetch|stealthy-fetch
本技能中的封装脚本:
bash scripts/run-extract.sh get "https://example.com" article.mdbash scripts/run-extract.sh fetch "https://app.example.com" content.md --network-idlebash scripts/run-extract.sh stealth "https://protected.example.com" content.md --solve-cloudflare
适合使用CLI的场景:
- 用户需要快速输出、
.md或.html格式的文件.txt - 仅用CSS选择器就足够筛选输出内容
- 需要启动shell而无需先编写Python代码
CLI和可选的MCP细节请查看references/cli-and-mcp.md
Step 5: Treat MCP and spiders as second-tier workflows
步骤5:将MCP和爬虫作为二级工作流
Use MCP when the user explicitly wants Scrapling exposed to an agent client:
bash
bash scripts/run-mcp.sh
bash scripts/run-mcp.sh --http --host 127.0.0.1 --port 8000Use spiders when the task is no longer a few page fetches and becomes a crawl with link following, concurrency, or checkpoint resume.
These are important capabilities, but they should not replace the core parser-plus-fetcher workflow in normal end-user guidance.
当用户明确需要将Scrapling暴露给Agent客户端时使用MCP:
bash
bash scripts/run-mcp.sh
bash scripts/run-mcp.sh --http --host 127.0.0.1 --port 8000当任务不再是少量页面获取,而是需要跟随链接、并发处理、断点续爬的爬取任务时使用爬虫功能。
这些都是重要能力,但在普通终端用户指南中不应替代核心的「解析器+获取器」工作流。
Examples
示例
Example 1: Install Scrapling with all extras
示例1:安装带所有扩展的Scrapling
bash
bash scripts/install.shbash
bash scripts/install.shExample 2: Parser-only install
示例2:仅安装解析模块
bash
bash scripts/install.sh --profile parserbash
bash scripts/install.sh --profile parserExample 3: Parse local HTML with Selector
Selector示例3:使用Selector
解析本地HTML
Selectorpython
from scrapling import Selector
page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()python
from scrapling import Selector
page = Selector(html_doc, url="https://example.com")
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()Example 4: Fast static scrape from the terminal
示例4:从终端快速静态抓取
bash
bash scripts/run-extract.sh get "https://example.com" content.md --css-selector "article"bash
bash scripts/run-extract.sh get "https://example.com" content.md --css-selector "article"Example 5: Python Fetcher
Fetcher示例5:Python Fetcher
使用
Fetcherpython
from scrapling.fetchers import Fetcher
page = Fetcher.get("https://example.com", impersonate="chrome")
title = page.css("title::text").get()python
from scrapling.fetchers import Fetcher
page = Fetcher.get("https://example.com", impersonate="chrome")
title = page.css("title::text").get()Example 6: Python DynamicFetcher
DynamicFetcher示例6:Python DynamicFetcher
使用
DynamicFetcherpython
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch(
"https://example.com",
network_idle=True,
wait_selector=".content"
)python
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch(
"https://example.com",
network_idle=True,
wait_selector=".content"
)Example 7: Python StealthyFetcher
StealthyFetcher示例7:Python StealthyFetcher
使用
StealthyFetcherpython
from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch(
"https://example.com",
headless=True,
solve_cloudflare=True
)python
from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch(
"https://example.com",
headless=True,
solve_cloudflare=True
)Example 8: Async HTTP
示例8:异步HTTP请求
python
from scrapling.fetchers import AsyncFetcher
page = await AsyncFetcher.get("https://example.com")python
from scrapling.fetchers import AsyncFetcher
page = await AsyncFetcher.get("https://example.com")Example 9: Session reuse
示例9:会话复用
python
from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate="chrome") as session:
page1 = session.get("https://example.com")
page2 = session.get("https://example.com/account")python
from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate="chrome") as session:
page1 = session.get("https://example.com")
page2 = session.get("https://example.com/account")Example 10: Adaptive selector recovery
示例10:自适应选择器恢复
python
from scrapling.fetchers import Fetcher
Fetcher.configure(adaptive=True, adaptive_domain="example.com")
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)python
from scrapling.fetchers import Fetcher
Fetcher.configure(adaptive=True, adaptive_domain="example.com")
page = Fetcher.get("https://example.com")
saved = page.css(".product", auto_save=True)
relocated = page.css(".product", adaptive=True)Example 11: Minimal spider reference
示例11:最简爬虫参考
python
from scrapling.spiders import Spider, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response: Response):
for quote in response.css(".quote"):
yield {"text": quote.css(".text::text").get()}
result = QuotesSpider().start()
print(result.items.to_json())python
from scrapling.spiders import Spider, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response: Response):
for quote in response.css(".quote"):
yield {"text": quote.css(".text::text").get()}
result = QuotesSpider().start()
print(result.items.to_json())Example 12: Start the MCP server over stdio
示例12:通过stdio启动MCP服务器
bash
bash scripts/run-mcp.shbash
bash scripts/run-mcp.shBest practices
最佳实践
- Start with or the lightest fetcher that works and escalate only when the site actually needs rendering or stealth.
Selector - Reuse session classes for repeated requests so browser startup and connection overhead stay low.
- Prefer or
.mdCLI output and CSS selectors over dumping full HTML into the model context..txt - Enable adaptive scraping only where selector brittleness is a real maintenance problem.
- Use ,
page_action, andwait_selectordeliberately instead of adding blind sleeps.network_idle - Treat Cloudflare solving, proxies, and browser impersonation as opt-in tools for authorized, policy-compliant work, not guaranteed bypasses.
- Remember that XML feeds are not the current target surface; Scrapling is documented around HTML parsing.
- Move from CLI to Python or spiders when retry logic, structured output, or crawl control becomes important.
- For MCP usage, make the client/server transport explicit: stdio for local agent integration, for streamable HTTP deployments.
--http
- 从或可用的最轻量获取器开始,仅当站点确实需要渲染或隐身能力时再升级。
Selector - 重复请求时复用会话类,降低浏览器启动和连接开销。
- 优先使用CLI输出或
.md格式内容、配合CSS选择器筛选,避免将完整HTML传入模型上下文。.txt - 仅当选择器不稳定确实带来维护问题时,再启用自适应抓取。
- 合理使用、
page_action和wait_selector,不要盲目添加sleep等待。network_idle - Cloudflare破解、代理、浏览器模拟是用于合规授权场景的可选工具,而非万能绕过手段。
- 注意当前版本不支持XML feeds,Scrapling是面向HTML解析设计的。
- 当需要重试逻辑、结构化输出或爬取控制能力时,从CLI切换到Python代码或爬虫功能。
- 用于MCP场景时,明确客户端/服务端传输方式:本地Agent集成使用stdio,可流式HTTP部署使用参数。
--http
References
参考资料
- references/fetchers-and-sessions.md
- references/parser-and-adaptive.md
- references/cli-and-mcp.md
- references/spiders.md
- scripts/install.sh
- scripts/run-extract.sh
- scripts/run-mcp.sh
- Scrapling GitHub Repository
- Scrapling Documentation
- references/fetchers-and-sessions.md
- references/parser-and-adaptive.md
- references/cli-and-mcp.md
- references/spiders.md
- scripts/install.sh
- scripts/run-extract.sh
- scripts/run-mcp.sh
- Scrapling GitHub仓库
- Scrapling官方文档