crawler

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Crawler Skill

爬虫技能

Playwriter exploration -> CDP evidence capture -> Documentation -> Code generation
Use
crawler
when the user wants a reusable crawling flow, site extraction plan, API reverse engineering for crawling, or analysis-backed crawler code.
For resumable or multi-step crawl work, treat
.hypercore/crawler/<ACTION>.json
as the durable context file that preserves intent, current state, evidence pointers, and the next step.
Do not use
crawler
for generic browser automation, one-off page clicking, or document rewriting with no crawl deliverable.
For quick one-off extraction with no reusable crawler, keep the work lightweight and avoid forcing the full artifact set unless the request expands into crawl design.
Templates: document-templates.md · code-templates.md Checklists: pre-crawl-checklist.md · anti-bot-checklist.md References: playwriter-commands.md · chrome-devtools-mcp.md · cdp-capture.md · crawling-patterns.md · selector-strategies.md · network-crawling.md · action-manifest.md

<trigger_examples>
Positive examples:
  • "Scrape product cards from this shop, inspect the API first, then generate a crawler."
  • "Figure out how this logged-in dashboard loads data and document the cookies and headers."
  • "Analyze this Cloudflare-protected site and recommend the safest crawl approach."
Negative examples:
  • "Open this site and click through the signup flow."
  • "Rewrite this crawl runbook for readability."
Boundary example:
  • "Grab three prices from this public page right now." Prefer lightweight extraction unless the user asks for a reusable crawler or site-wide strategy.
</trigger_examples>

<trigger_conditions>
TriggerAction
Reusable crawling, scraping, or site-wide extractionRun immediately
Site investigation or API reverse engineering for crawlingStart discovery and API interception
One-off extraction from a single pageTreat as a boundary case and keep the workflow lightweight unless reusable crawl work is requested
Anti-bot bypass or Cloudflare-heavy targetStart with risk checks and Anti-Detect guidance
</trigger_conditions>

<support_file_routing>
Read support files in this order:
  1. Start with pre-crawl-checklist.md before making crawl or code decisions.
  2. Use playwriter-commands.md when you need session control, page interaction, visual inspection, or selector validation (Playwright MCP = driving).
  3. Use chrome-devtools-mcp.md when you need first-party Chrome DevTools fidelity for live network requests, console errors, performance traces, Lighthouse audits, or memory snapshots (Chrome DevTools MCP = debugging).
  4. Use cdp-capture.md when you need structured network, cookie, token, storage, or rate-limit evidence with lower token cost than full Playwriter snapshots.
  5. Use network-crawling.md when turning Playwriter / chrome-devtools-mcp / CDP evidence into
    API.md
    ,
    NETWORK.md
    , and raw evidence files.
  6. Use selector-strategies.md when DOM extraction is still on the table.
  7. Use crawling-patterns.md when pagination, authentication, lazy loading, or retries shape the approach.
  8. Use anti-bot-checklist.md when the target shows blocks, CAPTCHA, Cloudflare, or explicit anti-detect requirements.
  9. Use action-manifest.md when the run needs a durable state file under
    .hypercore/crawler/<ACTION>.json
    .
  10. Use document-templates.md when writing
    .hypercore/crawler/[site]/
    artifacts.
  11. Use code-templates.md only after the method is chosen and the discovery evidence is documented.
</support_file_routing>

<mandatory_reasoning>
Playwriter探索 -> CDP证据捕获 -> 文档记录 -> 代码生成
当用户需要可复用的爬取流程、网站提取方案、用于爬取的API逆向工程,或基于分析的爬虫代码时,使用
crawler
功能。
对于可恢复或多步骤的爬取工作,将
.hypercore/crawler/<ACTION>.json
作为持久化上下文文件,用于保存意图、当前状态、证据指针和下一步操作。
请勿将
crawler
用于通用浏览器自动化、一次性页面点击或无爬取交付物的文档重写。
对于无需可复用爬虫的快速一次性提取,保持工作轻量化,除非请求扩展为爬取设计,否则不要强制生成完整的工件集。
模板: document-templates.md · code-templates.md 检查清单: pre-crawl-checklist.md · anti-bot-checklist.md 参考资料: playwriter-commands.md · chrome-devtools-mcp.md · cdp-capture.md · crawling-patterns.md · selector-strategies.md · network-crawling.md · action-manifest.md

<trigger_examples>
正面示例:
  • "从这家店铺爬取商品卡片,先检查API,然后生成爬虫。"
  • "弄清楚这个登录后的仪表盘如何加载数据,并记录cookie和请求头。"
  • "分析这个受Cloudflare保护的网站,推荐最安全的爬取方案。"
负面示例:
  • "打开这个网站并完成注册流程的点击操作。"
  • "重写这份爬取手册以提高可读性。"
边界示例:
  • "立即从这个公开页面获取三个价格。" 除非用户要求可复用爬虫或全站策略,否则优先选择轻量化提取。
</trigger_examples>

<trigger_conditions>
触发条件操作
可复用爬取、数据抓取或全站提取立即执行
为爬取而进行的网站调研或API逆向工程启动发现和API拦截
单页面的一次性提取视为边界情况,保持工作流程轻量化,除非请求可复用爬取工作
反爬虫绕过或Cloudflare防护较强的目标从风险检查和反检测指导开始
</trigger_conditions>

<support_file_routing>
按以下顺序读取支持文件:
  1. 在做出爬取或代码决策前,先阅读pre-crawl-checklist.md
  2. 当需要会话控制、页面交互、视觉检查或选择器验证时,使用playwriter-commands.md(Playwright MCP = 驱动)。
  3. 当需要原生Chrome DevTools保真度来获取实时网络请求、控制台错误、性能追踪、Lighthouse审计或内存快照时,使用chrome-devtools-mcp.md(Chrome DevTools MCP = 调试)。
  4. 当需要结构化的网络、Cookie、令牌、存储或速率限制证据,且令牌成本低于完整Playwriter快照时,使用cdp-capture.md
  5. 当将Playwriter/chrome-devtools-mcp/CDP证据转换为
    API.md
    NETWORK.md
    和原始证据文件时,使用network-crawling.md
  6. 当仍考虑DOM提取时,使用selector-strategies.md
  7. 当分页、认证、懒加载或重试影响爬取方案时,使用crawling-patterns.md
  8. 当目标显示拦截、CAPTCHA、Cloudflare或明确的反检测要求时,使用anti-bot-checklist.md
  9. 当运行需要在
    .hypercore/crawler/<ACTION>.json
    下保存持久化状态文件时,使用action-manifest.md
  10. 当编写
    .hypercore/crawler/[site]/
    工件时,使用document-templates.md
  11. 仅在选定方法并记录发现证据后,使用code-templates.md
</support_file_routing>

<mandatory_reasoning>

Mandatory Sequential Thinking

强制顺序思考

  • Always use the
    sequential-thinking
    tool before starting crawl design, extraction strategy, or code generation decisions.
  • Run
    sequential-thinking
    for each major phase: discovery, method selection, and implementation planning.
  • If
    sequential-thinking
    is unavailable, stop and report the blocker instead of continuing without structured reasoning.
</mandatory_reasoning>

<execution_defaults>
  • Do discovery before code generation, selector lock-in, or auth assumptions.
  • Use Playwriter to reproduce the user-visible flow, then prefer CDP for structured network/auth evidence capture.
  • Prefer an API-backed crawler when CDP or fallback browser-network evidence shows a stable endpoint and manageable auth.
  • Keep large DOM or accessibility snapshots rare; use them for structure checks and selector validation, not as the default capture surface.
  • If CDP attach fails, document the limitation in
    ANALYSIS.md
    and use Playwriter interception only when the fallback evidence is still sufficient.
  • Stop and report blockers when legal constraints, repeated
    403/429/503
    , CAPTCHA, or strong anti-bot signals make automation unsafe.
  • Do not promise
    CRAWLER.ts
    until the method, auth material, and rate-limit posture are documented.
</execution_defaults>

<workflow>
PhaseTaskCommand/Method
1. SessionCreate session + open page
playwriter session new
2. ExploreReproduce the page flow with Playwriter
accessibilitySnapshot
,
screenshotWithAccessibilityLabels
3. CaptureCollect network/auth/perf evidence via
chrome-devtools-mcp
(preferred) or CDP fallback
list_network_requests
,
list_console_messages
,
performance_start_trace
; CDP
Network.*
,
Storage.*
,
Runtime.evaluate
— see chrome-devtools-mcp.md and cdp-capture.md
4. AnalyzeDecide API-first vs DOM-firstnetwork-crawling.md, selector-strategies.md
5. DocumentSave findings under
.hypercore/crawler/[site]/
Write
6. CodeGenerate crawler implementationcode-templates.md
</workflow>
<method_selection>
ConditionMethodNotes
API found via CDP or fallback browser-network evidence + simple auth
fetch
/
httpx
Fastest
API + cookie/token required
fetch
+ Cookie
Requires expiry handling
API + Cloudflare / DataDome / JA3 fingerprinting
curl_cffi
(impersonate Chrome)
Restores TLS/JA3; pair with residential proxy
Discovery / live network + perf evidence
chrome-devtools-mcp
First-party CDP fidelity (network, console, perf trace, Lighthouse) — see chrome-devtools-mcp.md
Page driving / login / lazy-load triggering
playwriter
"Make the page do the thing"
Strong anti-bot (Cloudflare, DataDome)Patchright or rebrowser-patchesPatches Chromium / patches
Runtime.Enable
leakage — see anti-bot-checklist.md
Chromium-specific fingerprintingCamoufoxFirefox-based stealth fork
No API (SSR) and no anti-botPlaywright DOMParse directly
</method_selection>

<output_structure>
.hypercore/crawler/<ACTION>.json
  • ACTION.json
    preserves intent, current status, capture mode, blockers, output pointers, and the next step.
  • .hypercore/crawler/[site-name]/
    preserves detailed evidence, analysis, and generated code for that site.
text
.hypercore/crawler/
├── <ACTION>.json              # durable action context
└── [site-name]/
    ├── ANALYSIS.md
    ├── SELECTORS.md
    ├── API.md
    ├── NETWORK.md
    ├── raw/
    │   ├── network-summary.json
    │   ├── auth-signals.json
    │   └── endpoint-candidates.json
    └── CRAWLER.ts
Site artifact contract:
.hypercore/crawler/[site-name]/
├── ANALYSIS.md      # Site structure
├── SELECTORS.md     # DOM selectors
├── API.md           # API endpoints
├── NETWORK.md       # Auth/network details
├── raw/
│   ├── network-summary.json      # normalized request/response evidence
│   ├── auth-signals.json         # cookies/storage/header evidence
│   └── endpoint-candidates.json  # deduped API candidates
└── CRAWLER.ts       # Generated crawler code
Minimum artifact contract:
  • .hypercore/crawler/<ACTION>.json
    is required for reusable, blocked, or resumable crawl work.
  • ANALYSIS.md
    is always required for reusable crawl work.
  • SELECTORS.md
    is required when DOM extraction is used or kept as a fallback path.
  • API.md
    is required when API discovery was attempted; document discovered endpoints or the absence of a usable API.
  • NETWORK.md
    is required when cookies, tokens, headers, rate limits, or bot-detection signals affect the method.
  • raw/network-summary.json
    ,
    raw/auth-signals.json
    , and
    raw/endpoint-candidates.json
    are recommended when CDP capture is available, and should back the human-readable docs instead of replacing them.
  • CRAWLER.ts
    is required only after discovery evidence is written and the chosen method is justified.
Starter interaction commands live in playwriter-commands.md. CDP evidence capture lives in cdp-capture.md. Durable action-state rules live in action-manifest.md. Keep the core focused on method choice, output gates, and stop conditions.
Templates: document-templates.md
</output_structure>

<blocked_outcomes>
For blocked or unsafe runs:
  • write
    ANALYSIS.md
    with the blocker, the evidence that triggered the stop, and the safest next step
  • write
    NETWORK.md
    when auth signals, block responses, or anti-bot findings affected the decision
  • write any available raw evidence files even when the run is blocked, so the stop is auditable
  • update
    ACTION.json
    so
    status
    ,
    capture_mode
    , blockers, and output pointers match the blocked state
  • omit
    CRAWLER.ts
    until the blocker is resolved or the method becomes safe to automate
</blocked_outcomes>

<validation>
text
✅ Playwriter session created
✅ `ACTION.json` created when the run is reusable, blocked, or resumable
✅ Structure analyzed with limited Playwriter snapshots
✅ CDP capture attempted for network/auth evidence
✅ raw evidence files recorded when CDP capture is available, or the fallback limitation documented when it is not
✅ Selector extraction validated
✅ Findings documented under .hypercore/crawler/
✅ Crawler code generated
✅ sequential-thinking trace recorded for major phases
✅ legal, rate-limit, and bot-detection blockers documented before scaling
✅ blocked runs reported explicitly when crawler code is unsafe or premature
✅ `ACTION.json` status and `site_dir` match the actual run outputs
✅ completed runs leave `ACTION.json.next_step` empty or terminal and point outputs at final files
</validation>
<forbidden>
CategoryForbidden
AnalysisGuess selectors without structure analysis
ApproachUse DOM-only flow without checking APIs
DocumentationSkip documenting analysis results
NetworkIgnore rate limiting
</forbidden>
<example>
bash
undefined
  • 在开始爬取设计、提取策略或代码生成决策前,务必使用
    sequential-thinking
    工具。
  • 针对每个主要阶段(发现、方法选择、实施规划)运行
    sequential-thinking
  • 如果
    sequential-thinking
    不可用,请停止操作并报告阻塞问题,不要在无结构化推理的情况下继续。
</mandatory_reasoning>

<execution_defaults>
  • 在生成代码、锁定选择器或做出认证假设前,先进行发现工作。
  • 使用Playwriter复现用户可见的流程,然后优先使用CDP捕获结构化网络/认证证据。
  • 当CDP或备用浏览器网络证据显示存在稳定端点且认证易于管理时,优先选择基于API的爬虫。
  • 尽量减少大型DOM或可访问性快照的使用;仅将其用于结构检查和选择器验证,而非默认捕获方式。
  • 如果CDP连接失败,在
    ANALYSIS.md
    中记录限制,仅当备用证据仍足够时,才使用Playwriter拦截。
  • 当存在法律约束、重复出现
    403/429/503
    错误、CAPTCHA或强反爬虫信号导致自动化不安全时,停止操作并报告阻塞问题。
  • 在记录方法、认证材料和速率限制状态前,不要承诺生成
    CRAWLER.ts
</execution_defaults>

<workflow>
阶段任务命令/方法
1. 会话创建会话 + 打开页面
playwriter session new
2. 探索使用Playwriter复现页面流程
accessibilitySnapshot
,
screenshotWithAccessibilityLabels
3. 捕获通过
chrome-devtools-mcp
(首选)或CDP备用方案收集网络/认证/性能证据
list_network_requests
,
list_console_messages
,
performance_start_trace
;CDP
Network.*
,
Storage.*
,
Runtime.evaluate
— 详见chrome-devtools-mcp.mdcdp-capture.md
4. 分析决定优先使用API还是DOMnetwork-crawling.md, selector-strategies.md
5. 文档记录将调研结果保存到
.hypercore/crawler/[site]/
目录下
编写
6. 代码生成生成爬虫实现代码code-templates.md
</workflow>
<method_selection>
条件方法说明
通过CDP或备用浏览器网络证据发现API + 简单认证
fetch
/
httpx
速度最快
API + 需要Cookie/令牌
fetch
+ Cookie
需要处理过期问题
API + Cloudflare / DataDome / JA3指纹识别
curl_cffi
(模拟Chrome)
恢复TLS/JA3;搭配住宅代理使用
发现 / 实时网络 + 性能证据
chrome-devtools-mcp
原生CDP保真度(网络、控制台、性能追踪、Lighthouse) — 详见chrome-devtools-mcp.md
页面驱动 / 登录 / 触发懒加载
playwriter
"让页面执行操作"
强反爬虫(Cloudflare、DataDome)Patchrightrebrowser-patches修补Chromium / 修补
Runtime.Enable
泄漏 — 详见anti-bot-checklist.md
Chromium特定指纹识别Camoufox基于Firefox的隐身分支
无API(SSR)且无反爬虫Playwright DOM直接解析
</method_selection>

<output_structure>
.hypercore/crawler/<ACTION>.json
  • ACTION.json
    保存意图、当前状态、捕获模式、阻塞问题、输出指针和下一步操作。
  • .hypercore/crawler/[site-name]/
    保存该网站的详细证据、分析结果和生成的代码。
text
.hypercore/crawler/
├── <ACTION>.json              # 持久化操作上下文
└── [site-name]/
    ├── ANALYSIS.md
    ├── SELECTORS.md
    ├── API.md
    ├── NETWORK.md
    ├── raw/
    │   ├── network-summary.json
    │   ├── auth-signals.json
    │   └── endpoint-candidates.json
    └── CRAWLER.ts
网站工件约定:
.hypercore/crawler/[site-name]/
├── ANALYSIS.md      # 网站结构
├── SELECTORS.md     # DOM选择器
├── API.md           # API端点
├── NETWORK.md       # 认证/网络详情
├── raw/
│   ├── network-summary.json      # 标准化请求/响应证据
│   ├── auth-signals.json         # Cookie/存储/请求头证据
│   └── endpoint-candidates.json  # 去重后的API候选端点
└── CRAWLER.ts       # 生成的爬虫代码
最低工件约定:
  • 对于可复用、阻塞或可恢复的爬取工作,必须存在
    .hypercore/crawler/<ACTION>.json
  • 对于可复用爬取工作,必须存在
    ANALYSIS.md
  • 当使用DOM提取或保留DOM提取作为备用方案时,必须存在
    SELECTORS.md
  • 当尝试发现API时,必须存在
    API.md
    ;记录发现的端点或不存在可用API的情况。
  • 当Cookie、令牌、请求头、速率限制或反爬虫信号影响方法选择时,必须存在
    NETWORK.md
  • 当CDP捕获可用时,推荐生成
    raw/network-summary.json
    raw/auth-signals.json
    raw/endpoint-candidates.json
    ,这些文件应作为可读文档的支撑,而非替代文档。
  • 仅在记录发现证据并证明所选方法合理后,才需要生成
    CRAWLER.ts
初始交互命令位于playwriter-commands.md。CDP证据捕获位于cdp-capture.md。持久化操作状态规则位于action-manifest.md。核心关注点应放在方法选择、输出门槛和停止条件上。
模板: document-templates.md
</output_structure>

<blocked_outcomes>
对于阻塞或不安全的运行:
  • 编写
    ANALYSIS.md
    ,记录阻塞问题、触发停止的证据以及最安全的下一步操作
  • 当认证信号、拦截响应或反爬虫发现影响决策时,编写
    NETWORK.md
  • 即使运行被阻塞,也要记录所有可用的原始证据文件,以便停止操作可被审计
  • 更新
    ACTION.json
    ,使
    status
    capture_mode
    、阻塞问题和输出指针与阻塞状态匹配
  • 在阻塞问题解决或方法变得安全可自动化前,不要生成
    CRAWLER.ts
</blocked_outcomes>

<validation>
text
✅ 创建Playwriter会话
✅ 当运行可复用、阻塞或可恢复时,创建`ACTION.json`
✅ 使用有限的Playwriter快照分析结构
✅ 尝试使用CDP捕获网络/认证证据
✅ 当CDP捕获可用时记录原始证据文件,当不可用时记录备用方案的限制
✅ 验证选择器提取
✅ 将调研结果记录在.hypercore/crawler/目录下
✅ 生成爬虫代码
✅ 为主要阶段记录sequential-thinking轨迹
✅ 在扩大规模前记录法律、速率限制和反爬虫阻塞问题
✅ 当爬虫代码不安全或过早生成时,明确报告阻塞运行
✅ `ACTION.json`的状态和`site_dir`与实际运行输出匹配
✅ 完成的运行将`ACTION.json.next_step`设为空或终端状态,并将输出指向最终文件
</validation>
<forbidden>
类别禁止操作
分析未进行结构分析就猜测选择器
方案未检查API就仅使用DOM流程
文档记录跳过分析结果的记录
网络忽略速率限制
</forbidden>
<example>
bash
undefined

User: /crawler crawl products from https://shop.example.com

1. Create durable action context

1. 创建持久化操作上下文

.hypercore/crawler/extract-products.json

.hypercore/crawler/extract-products.json

2. Session

2. 会话

playwriter session new # => 1 playwriter -s 1 -e "state.page = await context.newPage(); await state.page.goto('https://shop.example.com/products')"
playwriter session new # => 1 playwriter -s 1 -e "state.page = await context.newPage(); await state.page.goto('https://shop.example.com/products')"

3. Structure analysis

3. 结构分析

playwriter -s 1 -e "console.log(await accessibilitySnapshot({ page: state.page }))"
playwriter -s 1 -e "console.log(await accessibilitySnapshot({ page: state.page }))"

=> list "Products" [ref=e5]: listitem [ref=e6]: link "Product A" [ref=e7]

=> 列表 "Products" [ref=e5]: listitem [ref=e6]: 链接 "Product A" [ref=e7]

4. CDP capture

4. CDP捕获

playwriter -s 1 -e $' const client = await state.page.context().newCDPSession(state.page); await client.send("Network.enable"); state.cdpHits = []; client.on("Network.responseReceived", (event) => { if (event.response.url.includes("/api/")) state.cdpHits.push(event.response.url); }); ' playwriter -s 1 -e "await state.page.evaluate(() => window.scrollTo(0, 9999))" playwriter -s 1 -e "console.log(state.cdpHits)"
playwriter -s 1 -e $' const client = await state.page.context().newCDPSession(state.page); await client.send("Network.enable"); state.cdpHits = []; client.on("Network.responseReceived", (event) => { if (event.response.url.includes("/api/")) state.cdpHits.push(event.response.url); }); ' playwriter -s 1 -e "await state.page.evaluate(() => window.scrollTo(0, 9999))" playwriter -s 1 -e "console.log(state.cdpHits)"

=> ["/api/products?page=2"]

=> ["/api/products?page=2"]

5. Update extract-products.json -> status=running, capture_mode=cdp

5. 更新extract-products.json -> status=running, capture_mode=cdp

6. Documentation -> .hypercore/crawler/shop-example-com/ + raw/network-summary.json

6. 文档记录 -> .hypercore/crawler/shop-example-com/ + raw/network-summary.json

7. Generate API-based crawler

7. 生成基于API的爬虫


</example>

</example>