crawler

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Crawler Skill

爬虫技能

Playwriter exploration -> CDP evidence capture -> Documentation -> Code generation

Use

crawler

when the user wants a reusable crawling flow, site extraction plan, API reverse engineering for crawling, or analysis-backed crawler code.

For resumable or multi-step crawl work, treat

.hypercore/crawler/<ACTION>.json

as the durable context file that preserves intent, current state, evidence pointers, and the next step.

Do not use

crawler

for generic browser automation, one-off page clicking, or document rewriting with no crawl deliverable.

For quick one-off extraction with no reusable crawler, keep the work lightweight and avoid forcing the full artifact set unless the request expands into crawl design.

Templates: document-templates.md · code-templates.md Checklists: pre-crawl-checklist.md · anti-bot-checklist.md References: playwriter-commands.md · chrome-devtools-mcp.md · cdp-capture.md · crawling-patterns.md · selector-strategies.md · network-crawling.md · action-manifest.md

<trigger_examples>

Positive examples:

"Scrape product cards from this shop, inspect the API first, then generate a crawler."
"Figure out how this logged-in dashboard loads data and document the cookies and headers."
"Analyze this Cloudflare-protected site and recommend the safest crawl approach."

Negative examples:

"Open this site and click through the signup flow."
"Rewrite this crawl runbook for readability."

Boundary example:

"Grab three prices from this public page right now." Prefer lightweight extraction unless the user asks for a reusable crawler or site-wide strategy.

</trigger_examples>

<trigger_conditions>

Trigger	Action
Reusable crawling, scraping, or site-wide extraction	Run immediately
Site investigation or API reverse engineering for crawling	Start discovery and API interception
One-off extraction from a single page	Treat as a boundary case and keep the workflow lightweight unless reusable crawl work is requested
Anti-bot bypass or Cloudflare-heavy target	Start with risk checks and Anti-Detect guidance

</trigger_conditions>

<support_file_routing>

Read support files in this order:

Start with pre-crawl-checklist.md before making crawl or code decisions.
Use playwriter-commands.md when you need session control, page interaction, visual inspection, or selector validation (Playwright MCP = driving).
Use chrome-devtools-mcp.md when you need first-party Chrome DevTools fidelity for live network requests, console errors, performance traces, Lighthouse audits, or memory snapshots (Chrome DevTools MCP = debugging).
Use cdp-capture.md when you need structured network, cookie, token, storage, or rate-limit evidence with lower token cost than full Playwriter snapshots.
Use network-crawling.md when turning Playwriter / chrome-devtools-mcp / CDP evidence into
```
API.md
```
,
```
NETWORK.md
```
, and raw evidence files.
Use selector-strategies.md when DOM extraction is still on the table.
Use crawling-patterns.md when pagination, authentication, lazy loading, or retries shape the approach.
Use anti-bot-checklist.md when the target shows blocks, CAPTCHA, Cloudflare, or explicit anti-detect requirements.
Use action-manifest.md when the run needs a durable state file under
```
.hypercore/crawler/<ACTION>.json
```
.
Use document-templates.md when writing
```
.hypercore/crawler/[site]/
```
artifacts.
Use code-templates.md only after the method is chosen and the discovery evidence is documented.

</support_file_routing>

<mandatory_reasoning>

Playwriter探索 -> CDP证据捕获 -> 文档记录 -> 代码生成

当用户需要可复用的爬取流程、网站提取方案、用于爬取的API逆向工程，或基于分析的爬虫代码时，使用

crawler

功能。

对于可恢复或多步骤的爬取工作，将

.hypercore/crawler/<ACTION>.json

作为持久化上下文文件，用于保存意图、当前状态、证据指针和下一步操作。

请勿将

crawler

用于通用浏览器自动化、一次性页面点击或无爬取交付物的文档重写。

对于无需可复用爬虫的快速一次性提取，保持工作轻量化，除非请求扩展为爬取设计，否则不要强制生成完整的工件集。

模板: document-templates.md · code-templates.md 检查清单: pre-crawl-checklist.md · anti-bot-checklist.md 参考资料: playwriter-commands.md · chrome-devtools-mcp.md · cdp-capture.md · crawling-patterns.md · selector-strategies.md · network-crawling.md · action-manifest.md

<trigger_examples>

正面示例:

"从这家店铺爬取商品卡片，先检查API，然后生成爬虫。"
"弄清楚这个登录后的仪表盘如何加载数据，并记录cookie和请求头。"
"分析这个受Cloudflare保护的网站，推荐最安全的爬取方案。"

负面示例:

"打开这个网站并完成注册流程的点击操作。"
"重写这份爬取手册以提高可读性。"

边界示例:

"立即从这个公开页面获取三个价格。" 除非用户要求可复用爬虫或全站策略，否则优先选择轻量化提取。

</trigger_examples>

<trigger_conditions>

触发条件	操作
可复用爬取、数据抓取或全站提取	立即执行
为爬取而进行的网站调研或API逆向工程	启动发现和API拦截
单页面的一次性提取	视为边界情况，保持工作流程轻量化，除非请求可复用爬取工作
反爬虫绕过或Cloudflare防护较强的目标	从风险检查和反检测指导开始

</trigger_conditions>

<support_file_routing>

按以下顺序读取支持文件:

在做出爬取或代码决策前，先阅读pre-crawl-checklist.md。
当需要会话控制、页面交互、视觉检查或选择器验证时，使用playwriter-commands.md（Playwright MCP = 驱动）。
当需要原生Chrome DevTools保真度来获取实时网络请求、控制台错误、性能追踪、Lighthouse审计或内存快照时，使用chrome-devtools-mcp.md（Chrome DevTools MCP = 调试）。
当需要结构化的网络、Cookie、令牌、存储或速率限制证据，且令牌成本低于完整Playwriter快照时，使用cdp-capture.md。
当将Playwriter/chrome-devtools-mcp/CDP证据转换为
```
API.md
```
、
```
NETWORK.md
```
和原始证据文件时，使用network-crawling.md。
当仍考虑DOM提取时，使用selector-strategies.md。
当分页、认证、懒加载或重试影响爬取方案时，使用crawling-patterns.md。
当目标显示拦截、CAPTCHA、Cloudflare或明确的反检测要求时，使用anti-bot-checklist.md。
当运行需要在
```
.hypercore/crawler/<ACTION>.json
```
下保存持久化状态文件时，使用action-manifest.md。
当编写
```
.hypercore/crawler/[site]/
```
工件时，使用document-templates.md。
仅在选定方法并记录发现证据后，使用code-templates.md。

</support_file_routing>

<mandatory_reasoning>

Mandatory Sequential Thinking

强制顺序思考

Always use the
```
sequential-thinking
```
tool before starting crawl design, extraction strategy, or code generation decisions.
Run
```
sequential-thinking
```
for each major phase: discovery, method selection, and implementation planning.
If
```
sequential-thinking
```
is unavailable, stop and report the blocker instead of continuing without structured reasoning.

</mandatory_reasoning>

<execution_defaults>

Do discovery before code generation, selector lock-in, or auth assumptions.
Use Playwriter to reproduce the user-visible flow, then prefer CDP for structured network/auth evidence capture.
Prefer an API-backed crawler when CDP or fallback browser-network evidence shows a stable endpoint and manageable auth.
Keep large DOM or accessibility snapshots rare; use them for structure checks and selector validation, not as the default capture surface.
If CDP attach fails, document the limitation in
```
ANALYSIS.md
```
and use Playwriter interception only when the fallback evidence is still sufficient.
Stop and report blockers when legal constraints, repeated
```
403/429/503
```
, CAPTCHA, or strong anti-bot signals make automation unsafe.
Do not promise
```
CRAWLER.ts
```
until the method, auth material, and rate-limit posture are documented.

</execution_defaults>

Phase	Task	Command/Method
1. Session	Create session + open page	`playwriter session new`
2. Explore	Reproduce the page flow with Playwriter	`accessibilitySnapshot` , `screenshotWithAccessibilityLabels`
3. Capture	Collect network/auth/perf evidence via `chrome-devtools-mcp` (preferred) or CDP fallback	`list_network_requests` , `list_console_messages` , `performance_start_trace` ; CDP `Network.` , `Storage.` , `Runtime.evaluate` — see chrome-devtools-mcp.md and cdp-capture.md
4. Analyze	Decide API-first vs DOM-first	network-crawling.md, selector-strategies.md
5. Document	Save findings under `.hypercore/crawler/[site]/`	Write
6. Code	Generate crawler implementation	code-templates.md

</workflow>

<method_selection>

Condition	Method	Notes
API found via CDP or fallback browser-network evidence + simple auth	`fetch` / `httpx`	Fastest
API + cookie/token required	`fetch` + Cookie	Requires expiry handling
API + Cloudflare / DataDome / JA3 fingerprinting	`curl_cffi` (impersonate Chrome)	Restores TLS/JA3; pair with residential proxy
Discovery / live network + perf evidence	`chrome-devtools-mcp`	First-party CDP fidelity (network, console, perf trace, Lighthouse) — see chrome-devtools-mcp.md
Page driving / login / lazy-load triggering	`playwriter`	"Make the page do the thing"
Strong anti-bot (Cloudflare, DataDome)	Patchright or rebrowser-patches	Patches Chromium / patches `Runtime.Enable` leakage — see anti-bot-checklist.md
Chromium-specific fingerprinting	Camoufox	Firefox-based stealth fork
No API (SSR) and no anti-bot	Playwright DOM	Parse directly

</method_selection>

<output_structure>

.hypercore/crawler/<ACTION>.json

```
ACTION.json
```
preserves intent, current status, capture mode, blockers, output pointers, and the next step.
```
.hypercore/crawler/[site-name]/
```
preserves detailed evidence, analysis, and generated code for that site.

text

.hypercore/crawler/
├── <ACTION>.json              # durable action context
└── [site-name]/
    ├── ANALYSIS.md
    ├── SELECTORS.md
    ├── API.md
    ├── NETWORK.md
    ├── raw/
    │   ├── network-summary.json
    │   ├── auth-signals.json
    │   └── endpoint-candidates.json
    └── CRAWLER.ts

Site artifact contract:

.hypercore/crawler/[site-name]/
├── ANALYSIS.md      # Site structure
├── SELECTORS.md     # DOM selectors
├── API.md           # API endpoints
├── NETWORK.md       # Auth/network details
├── raw/
│   ├── network-summary.json      # normalized request/response evidence
│   ├── auth-signals.json         # cookies/storage/header evidence
│   └── endpoint-candidates.json  # deduped API candidates
└── CRAWLER.ts       # Generated crawler code

Minimum artifact contract:

```
.hypercore/crawler/<ACTION>.json
```
is required for reusable, blocked, or resumable crawl work.
```
ANALYSIS.md
```
is always required for reusable crawl work.
```
SELECTORS.md
```
is required when DOM extraction is used or kept as a fallback path.
```
API.md
```
is required when API discovery was attempted; document discovered endpoints or the absence of a usable API.
```
NETWORK.md
```
is required when cookies, tokens, headers, rate limits, or bot-detection signals affect the method.
```
raw/network-summary.json
```
,
```
raw/auth-signals.json
```
, and
```
raw/endpoint-candidates.json
```
are recommended when CDP capture is available, and should back the human-readable docs instead of replacing them.
```
CRAWLER.ts
```
is required only after discovery evidence is written and the chosen method is justified.

Starter interaction commands live in playwriter-commands.md. CDP evidence capture lives in cdp-capture.md. Durable action-state rules live in action-manifest.md. Keep the core focused on method choice, output gates, and stop conditions.

Templates: document-templates.md

</output_structure>

<blocked_outcomes>

For blocked or unsafe runs:

write
```
ANALYSIS.md
```
with the blocker, the evidence that triggered the stop, and the safest next step
write
```
NETWORK.md
```
when auth signals, block responses, or anti-bot findings affected the decision
write any available raw evidence files even when the run is blocked, so the stop is auditable
update
```
ACTION.json
```
so
```
status
```
,
```
capture_mode
```
, blockers, and output pointers match the blocked state
omit
```
CRAWLER.ts
```
until the blocker is resolved or the method becomes safe to automate

</blocked_outcomes>

text

✅ Playwriter session created
✅ `ACTION.json` created when the run is reusable, blocked, or resumable
✅ Structure analyzed with limited Playwriter snapshots
✅ CDP capture attempted for network/auth evidence
✅ raw evidence files recorded when CDP capture is available, or the fallback limitation documented when it is not
✅ Selector extraction validated
✅ Findings documented under .hypercore/crawler/
✅ Crawler code generated
✅ sequential-thinking trace recorded for major phases
✅ legal, rate-limit, and bot-detection blockers documented before scaling
✅ blocked runs reported explicitly when crawler code is unsafe or premature
✅ `ACTION.json` status and `site_dir` match the actual run outputs
✅ completed runs leave `ACTION.json.next_step` empty or terminal and point outputs at final files

</validation>

Category	Forbidden
Analysis	Guess selectors without structure analysis
Approach	Use DOM-only flow without checking APIs
Documentation	Skip documenting analysis results
Network	Ignore rate limiting

</forbidden>

bash

undefined

在开始爬取设计、提取策略或代码生成决策前，务必使用
```
sequential-thinking
```
工具。
针对每个主要阶段（发现、方法选择、实施规划）运行
```
sequential-thinking
```
。
如果
```
sequential-thinking
```
不可用，请停止操作并报告阻塞问题，不要在无结构化推理的情况下继续。

</mandatory_reasoning>

<execution_defaults>

在生成代码、锁定选择器或做出认证假设前，先进行发现工作。
使用Playwriter复现用户可见的流程，然后优先使用CDP捕获结构化网络/认证证据。
当CDP或备用浏览器网络证据显示存在稳定端点且认证易于管理时，优先选择基于API的爬虫。
尽量减少大型DOM或可访问性快照的使用；仅将其用于结构检查和选择器验证，而非默认捕获方式。
如果CDP连接失败，在
```
ANALYSIS.md
```
中记录限制，仅当备用证据仍足够时，才使用Playwriter拦截。
当存在法律约束、重复出现
```
403/429/503
```
错误、CAPTCHA或强反爬虫信号导致自动化不安全时，停止操作并报告阻塞问题。
在记录方法、认证材料和速率限制状态前，不要承诺生成
```
CRAWLER.ts
```
。

</execution_defaults>

阶段	任务	命令/方法
1. 会话	创建会话 + 打开页面	`playwriter session new`
2. 探索	使用Playwriter复现页面流程	`accessibilitySnapshot` , `screenshotWithAccessibilityLabels`
3. 捕获	通过 `chrome-devtools-mcp` （首选）或CDP备用方案收集网络/认证/性能证据	`list_network_requests` , `list_console_messages` , `performance_start_trace` ；CDP `Network.` , `Storage.` , `Runtime.evaluate` — 详见chrome-devtools-mcp.md和cdp-capture.md
4. 分析	决定优先使用API还是DOM	network-crawling.md, selector-strategies.md
5. 文档记录	将调研结果保存到 `.hypercore/crawler/[site]/` 目录下	编写
6. 代码生成	生成爬虫实现代码	code-templates.md

</workflow>

<method_selection>

条件	方法	说明
通过CDP或备用浏览器网络证据发现API + 简单认证	`fetch` / `httpx`	速度最快
API + 需要Cookie/令牌	`fetch` + Cookie	需要处理过期问题
API + Cloudflare / DataDome / JA3指纹识别	`curl_cffi` （模拟Chrome）	恢复TLS/JA3；搭配住宅代理使用
发现 / 实时网络 + 性能证据	`chrome-devtools-mcp`	原生CDP保真度（网络、控制台、性能追踪、Lighthouse） — 详见chrome-devtools-mcp.md
页面驱动 / 登录 / 触发懒加载	`playwriter`	"让页面执行操作"
强反爬虫（Cloudflare、DataDome）	Patchright 或 rebrowser-patches	修补Chromium / 修补 `Runtime.Enable` 泄漏 — 详见anti-bot-checklist.md
Chromium特定指纹识别	Camoufox	基于Firefox的隐身分支
无API（SSR）且无反爬虫	Playwright DOM	直接解析

</method_selection>

<output_structure>

.hypercore/crawler/<ACTION>.json

```
ACTION.json
```
保存意图、当前状态、捕获模式、阻塞问题、输出指针和下一步操作。
```
.hypercore/crawler/[site-name]/
```
保存该网站的详细证据、分析结果和生成的代码。

text

.hypercore/crawler/
├── <ACTION>.json              # 持久化操作上下文
└── [site-name]/
    ├── ANALYSIS.md
    ├── SELECTORS.md
    ├── API.md
    ├── NETWORK.md
    ├── raw/
    │   ├── network-summary.json
    │   ├── auth-signals.json
    │   └── endpoint-candidates.json
    └── CRAWLER.ts

网站工件约定:

.hypercore/crawler/[site-name]/
├── ANALYSIS.md      # 网站结构
├── SELECTORS.md     # DOM选择器
├── API.md           # API端点
├── NETWORK.md       # 认证/网络详情
├── raw/
│   ├── network-summary.json      # 标准化请求/响应证据
│   ├── auth-signals.json         # Cookie/存储/请求头证据
│   └── endpoint-candidates.json  # 去重后的API候选端点
└── CRAWLER.ts       # 生成的爬虫代码

最低工件约定:

对于可复用、阻塞或可恢复的爬取工作，必须存在
```
.hypercore/crawler/<ACTION>.json
```
。
对于可复用爬取工作，必须存在
```
ANALYSIS.md
```
。
当使用DOM提取或保留DOM提取作为备用方案时，必须存在
```
SELECTORS.md
```
。
当尝试发现API时，必须存在
```
API.md
```
；记录发现的端点或不存在可用API的情况。
当Cookie、令牌、请求头、速率限制或反爬虫信号影响方法选择时，必须存在
```
NETWORK.md
```
。
当CDP捕获可用时，推荐生成
```
raw/network-summary.json
```
、
```
raw/auth-signals.json
```
和
```
raw/endpoint-candidates.json
```
，这些文件应作为可读文档的支撑，而非替代文档。
仅在记录发现证据并证明所选方法合理后，才需要生成
```
CRAWLER.ts
```
。

初始交互命令位于playwriter-commands.md。CDP证据捕获位于cdp-capture.md。持久化操作状态规则位于action-manifest.md。核心关注点应放在方法选择、输出门槛和停止条件上。

模板: document-templates.md

</output_structure>

<blocked_outcomes>

对于阻塞或不安全的运行:

编写
```
ANALYSIS.md
```
，记录阻塞问题、触发停止的证据以及最安全的下一步操作
当认证信号、拦截响应或反爬虫发现影响决策时，编写
```
NETWORK.md
```
即使运行被阻塞，也要记录所有可用的原始证据文件，以便停止操作可被审计
更新
```
ACTION.json
```
，使
```
status
```
、
```
capture_mode
```
、阻塞问题和输出指针与阻塞状态匹配
在阻塞问题解决或方法变得安全可自动化前，不要生成
```
CRAWLER.ts
```

</blocked_outcomes>

text

✅ 创建Playwriter会话
✅ 当运行可复用、阻塞或可恢复时，创建`ACTION.json`
✅ 使用有限的Playwriter快照分析结构
✅ 尝试使用CDP捕获网络/认证证据
✅ 当CDP捕获可用时记录原始证据文件，当不可用时记录备用方案的限制
✅ 验证选择器提取
✅ 将调研结果记录在.hypercore/crawler/目录下
✅ 生成爬虫代码
✅ 为主要阶段记录sequential-thinking轨迹
✅ 在扩大规模前记录法律、速率限制和反爬虫阻塞问题
✅ 当爬虫代码不安全或过早生成时，明确报告阻塞运行
✅ `ACTION.json`的状态和`site_dir`与实际运行输出匹配
✅ 完成的运行将`ACTION.json.next_step`设为空或终端状态，并将输出指向最终文件

</validation>

类别	禁止操作
分析	未进行结构分析就猜测选择器
方案	未检查API就仅使用DOM流程
文档记录	跳过分析结果的记录
网络	忽略速率限制

</forbidden>

bash

undefined

User: /crawler crawl products from https://shop.example.com

用户: /crawler 从https://shop.example.com爬取商品

1. Create durable action context

1. 创建持久化操作上下文

.hypercore/crawler/extract-products.json

2. Session

2. 会话

playwriter session new # => 1 playwriter -s 1 -e "state.page = await context.newPage(); await state.page.goto('https://shop.example.com/products')"

3. Structure analysis

3. 结构分析

playwriter -s 1 -e "console.log(await accessibilitySnapshot({ page: state.page }))"

=> list "Products" [ref=e5]: listitem [ref=e6]: link "Product A" [ref=e7]

=> 列表 "Products" [ref=e5]: listitem [ref=e6]: 链接 "Product A" [ref=e7]

4. CDP capture

4. CDP捕获

playwriter -s 1 -e $' const client = await state.page.context().newCDPSession(state.page); await client.send("Network.enable"); state.cdpHits = []; client.on("Network.responseReceived", (event) => { if (event.response.url.includes("/api/")) state.cdpHits.push(event.response.url); }); ' playwriter -s 1 -e "await state.page.evaluate(() => window.scrollTo(0, 9999))" playwriter -s 1 -e "console.log(state.cdpHits)"

=> ["/api/products?page=2"]

5. Update extract-products.json -> status=running, capture_mode=cdp

5. 更新extract-products.json -> status=running, capture_mode=cdp

6. Documentation -> .hypercore/crawler/shop-example-com/ + raw/network-summary.json

6. 文档记录 -> .hypercore/crawler/shop-example-com/ + raw/network-summary.json

7. Generate API-based crawler

7. 生成基于API的爬虫


</example>


</example>