scrapling-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Scrapling Skill

Scrapling 技能文档

Overview

概述

Use Scrapling through its CLI as the default path. Start with the smallest working command, validate the saved output, and only escalate to browser-backed fetching when the static fetch does not contain the real page content.
Do not assume the user's Scrapling install is healthy. Verify it first.
默认通过CLI使用Scrapling。从最简可用命令开始,验证保存的输出结果,仅当静态抓取无法获取页面真实内容时,再升级为基于浏览器的抓取方式。
不要假设用户的Scrapling安装状态正常,首先进行验证。

Default Workflow

默认工作流程

Copy this checklist and keep it updated while working:
text
Scrapling Progress:
- [ ] Step 1: Diagnose the local Scrapling install
- [ ] Step 2: Fix CLI extras or browser runtime if needed
- [ ] Step 3: Choose static or dynamic fetch
- [ ] Step 4: Save output to a file
- [ ] Step 5: Validate file size and extracted content
- [ ] Step 6: Escalate only if the previous path failed
复制以下检查清单,并在操作过程中更新进度:
text
Scrapling 进度:
- [ ] 步骤1:诊断本地Scrapling安装情况
- [ ] 步骤2:按需修复CLI扩展或浏览器运行时
- [ ] 步骤3:选择静态或动态抓取方式
- [ ] 步骤4:将输出保存至文件
- [ ] 步骤5:验证文件大小与提取内容
- [ ] 步骤6:仅当前述步骤失败时再升级方案

Step 1: Diagnose the Install

步骤1:诊断安装状态

Run the bundled diagnostic script first:
bash
python3 scripts/diagnose_scrapling.py
Use the result as the source of truth for the next step.
首先运行内置的诊断脚本:
bash
python3 scripts/diagnose_scrapling.py
以此脚本的结果作为下一步操作的依据。

Step 2: Fix the Install

步骤2:修复安装问题

If the CLI was installed without extras

若CLI未安装扩展组件

If
scrapling --help
fails with missing
click
or a message about installing Scrapling with extras, reinstall it with the CLI extra:
bash
uv tool uninstall scrapling
uv tool install 'scrapling[shell]'
Do not default to
scrapling[all]
unless the user explicitly needs the broader feature set.
如果
scrapling --help
因缺少
click
或提示需安装Scrapling扩展组件而失败,请重新安装带CLI扩展的版本:
bash
uv tool uninstall scrapling
uv tool install 'scrapling[shell]'
除非用户明确需要更全面的功能集,否则默认不要使用
scrapling[all]

If browser-backed fetchers are needed

若需基于浏览器的抓取器

Install the Playwright runtime:
bash
scrapling install
If the install looks slow or opaque, read
references/troubleshooting.md
before guessing. Do not claim success until either:
  • scrapling install
    reports that dependencies are already installed, or
  • the diagnostic script confirms both Chromium and Chrome Headless Shell are present.
安装Playwright运行时:
bash
scrapling install
若安装过程缓慢或状态不明确,请先查看
references/troubleshooting.md
再操作。需满足以下任一条件才可确认安装成功:
  • scrapling install
    提示依赖已安装完成,或
  • 诊断脚本确认Chromium和Chrome Headless Shell已存在

Step 3: Choose the Fetcher

步骤3:选择抓取方式

Use this decision rule:
  • Start with
    extract get
    for normal pages, article pages, and most WeChat public articles.
  • Use
    extract fetch
    when the static HTML does not contain the real content or the page depends on JavaScript rendering.
  • Use
    extract stealthy-fetch
    only after
    fetch
    still fails because of anti-bot or challenge behavior. Do not make it the default.
遵循以下决策规则:
  • 对于普通页面、文章页面及大多数微信公众号文章,优先使用
    extract get
  • 当静态HTML无法获取真实内容或页面依赖JavaScript渲染时,使用
    extract fetch
  • 仅当
    fetch
    仍因反机器人机制或验证挑战失败时,才使用
    extract stealthy-fetch
    ,不要将其设为默认选项

Step 4: Run the Smallest Useful Command

步骤4:运行最简可用命令

Always quote URLs in shell commands. This is mandatory in
zsh
when the URL contains
?
,
&
, or other special characters.
在shell命令中务必为URL添加引号。当URL包含
?
&
或其他特殊字符时,这在
zsh
中是强制要求。

Full page to HTML

整页提取为HTML

bash
scrapling extract get 'https://example.com' page.html
bash
scrapling extract get 'https://example.com' page.html

Main content to Markdown

提取主内容为Markdown

bash
scrapling extract get 'https://example.com' article.md -s 'main'
bash
scrapling extract get 'https://example.com' article.md -s 'main'

JS-rendered page with browser automation

基于浏览器自动化的JS渲染页面提取

bash
scrapling extract fetch 'https://example.com' page.html --timeout 20000
bash
scrapling extract fetch 'https://example.com' page.html --timeout 20000

WeChat public article body

微信公众号文章正文

Use
#js_content
first. This is the default selector for article body extraction on
mp.weixin.qq.com
pages.
bash
scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
优先使用
#js_content
选择器,这是
mp.weixin.qq.com
页面提取文章正文的默认选择器。
bash
scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'

Step 5: Validate the Output

步骤5:验证输出结果

After every extraction, verify the file instead of assuming success:
bash
wc -c article.md
sed -n '1,40p' article.md
For HTML output, check that the expected title, container, or selector target is actually present:
bash
rg -n '<title>|js_content|rich_media_title|main' page.html
If the file is tiny, empty, or missing the expected container, the extraction did not succeed. Go back to Step 3 and switch fetchers or selectors.
每次提取完成后,务必验证文件而非默认假设操作成功:
bash
wc -c article.md
sed -n '1,40p' article.md
对于HTML输出,检查是否存在预期的标题、容器或选择器目标:
bash
rg -n '<title>|js_content|rich_media_title|main' page.html
若文件体积极小、为空或缺少预期容器,则提取未成功。返回步骤3,更换抓取方式或选择器。

Step 6: Handle Known Failure Modes

步骤6:处理已知故障模式

Local TLS trust store problem

本地TLS信任库问题

If
extract get
fails with
curl: (60) SSL certificate problem
, treat it as a local trust-store problem first, not a Scrapling content failure.
Retry the same command with:
bash
--no-verify
Only do this after confirming the failure matches the local certificate verification error pattern. Do not silently disable verification by default.
如果
extract get
curl: (60) SSL certificate problem
失败,首先将其视为本地信任库问题,而非Scrapling的内容提取故障。
添加以下参数重试同一命令:
bash
--no-verify
仅当确认故障符合本地证书验证错误特征时才执行此操作,不要默认静默关闭验证。

WeChat article pages

微信公众号文章页面

For
mp.weixin.qq.com
:
  • Try
    extract get
    before
    extract fetch
  • Use
    -s '#js_content'
    for the article body
  • Validate the saved Markdown or HTML immediately
针对
mp.weixin.qq.com
页面:
  • 优先尝试
    extract get
    再使用
    extract fetch
  • 使用
    -s '#js_content'
    提取文章正文
  • 立即验证保存的Markdown或HTML内容

Browser-backed fetch failures

基于浏览器的抓取失败

If
extract fetch
fails:
  1. Re-check the install with
    python3 scripts/diagnose_scrapling.py
  2. Confirm Chromium and Chrome Headless Shell are present
  3. Retry with a slightly longer timeout
  4. Escalate to
    stealthy-fetch
    only if the site behavior justifies it
extract fetch
失败:
  1. 重新运行
    python3 scripts/diagnose_scrapling.py
    检查安装状态
  2. 确认Chromium和Chrome Headless Shell已存在
  3. 适当延长超时时间后重试
  4. 仅当网站行为确有必要时,再升级为
    stealthy-fetch

Command Patterns

命令模板

Diagnose and smoke test a URL

诊断并快速测试URL

bash
python3 scripts/diagnose_scrapling.py --url 'https://example.com'
bash
python3 scripts/diagnose_scrapling.py --url 'https://example.com'

Diagnose and smoke test a WeChat article body

诊断并快速测试微信公众号文章正文

bash
python3 scripts/diagnose_scrapling.py \
  --url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
  --selector '#js_content' \
  --no-verify
bash
python3 scripts/diagnose_scrapling.py \
  --url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
  --selector '#js_content' \
  --no-verify

Diagnose and smoke test a browser-backed fetch

诊断并快速测试基于浏览器的抓取

bash
python3 scripts/diagnose_scrapling.py \
  --url 'https://example.com' \
  --dynamic
bash
python3 scripts/diagnose_scrapling.py \
  --url 'https://example.com' \
  --dynamic

Guardrails

注意事项

  • Do not tell the user to reinstall blindly. Verify first.
  • Do not default to the Python library API when the user is clearly asking about the CLI.
  • Do not jump to browser-backed fetching unless the static result is missing the real content.
  • Do not claim success from exit code alone. Inspect the saved file.
  • Do not hardcode user-specific absolute paths into outputs or docs.
  • 不要告知用户盲目重装,需先验证问题
  • 若用户明确询问CLI相关内容,不要默认引导使用Python库API
  • 不要直接使用基于浏览器的抓取,除非静态抓取无法获取真实内容
  • 不要仅通过退出码判断操作成功,需检查保存的文件
  • 不要在输出或文档中硬编码用户特定的绝对路径

Resources

参考资源

  • Installation and smoke test helper:
    scripts/diagnose_scrapling.py
  • Verified failure modes and recovery paths:
    references/troubleshooting.md
  • 安装与快速测试辅助工具:
    scripts/diagnose_scrapling.py
  • 已知故障模式与恢复方案:
    references/troubleshooting.md