extract

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

When to Use This Skill

何时使用此技能

Activate when the user wants to obtain data from a website:

"Extract all product prices from this page"
"Scrape the table of results from ..."
"Pull the list of authors and titles from arXiv search results"
"Collect all job listings from this page"
"Get the data from this dashboard table"
"Harvest review scores from ..."
"Download all the links/images/cards from ..."

The deliverable is always two artifacts:

Executable Playwright script — a standalone
```
.cjs
```
file that reproduces the extraction without Actionbook at runtime.
Extracted data — JSON (default), CSV, or user-specified format written to disk.

当用户需要从网站获取数据时激活：

"从该页面提取所有产品价格"
"抓取来自...的结果表格"
"获取arXiv搜索结果中的作者和标题列表"
"收集该页面上的所有职位招聘信息"
"获取此仪表板表格中的数据"
"采集来自...的评分评论"
"下载来自...的所有链接/图片/卡片"

交付物始终包含两个成果：

可执行的Playwright脚本 — 一个独立的
```
.cjs
```
文件，无需Actionbook即可在运行时重现提取过程。
提取的数据 — JSON（默认格式）、CSV或用户指定格式，保存至磁盘。

Decision Strategy

决策策略

Use Actionbook as a conditional accelerator, not a mandatory step. The goal is reliable selectors in the shortest path.

User request
  │
  ├─► actionbook search "<site> <intent>"
  │     ├─ Results with Health Score ≥ 70%  ──► actionbook get "<ID>" ──► use selectors
  │     └─ No results / low score  ──► Fallback
  │
  └─► Fallback: actionbook browser open <url>
        ├─ actionbook browser snapshot   (accessibility tree → find selectors)
        ├─ actionbook browser screenshot (visual confirmation)
        └─ manual selector discovery via DOM inspection

Priority order for selector sources:

Priority	Source	When
1	`actionbook get`	Site is indexed, health score ≥ 70%
2	`actionbook browser snapshot`	Not indexed or selectors outdated
3	DOM inspection via screenshot + snapshot	Complex SPA / dynamic content

Non-negotiable rule: if

search + get

already provides usable selectors for required fields, start from

get

selectors and do not jump to full fallback (

snapshot

screenshot

) by default. Exception: lightweight mechanism probes (for hydration/virtualization/pagination) are allowed when runtime behavior may affect script correctness. Escalate to

snapshot

screenshot

only when probes/sample validation indicate selector gaps or instability.

将Actionbook用作条件加速器，而非强制步骤。目标是通过最短路径获取可靠的选择器。

User request
  │
  ├─► actionbook search "<site> <intent>"
  │     ├─ Results with Health Score ≥ 70%  ──► actionbook get "<ID>" ──► use selectors
  │     └─ No results / low score  ──► Fallback
  │
  └─► Fallback: actionbook browser open <url>
        ├─ actionbook browser snapshot   (accessibility tree → find selectors)
        ├─ actionbook browser screenshot (visual confirmation)
        └─ manual selector discovery via DOM inspection

选择器来源的优先级顺序：

优先级	来源	使用场景
1	`actionbook get`	网站已被索引，健康评分≥70%
2	`actionbook browser snapshot`	未被索引或选择器已过时
3	通过截图+snapshot进行DOM检查	复杂SPA / 动态内容

不可违背的规则： 如果

search + get

已为所需字段提供可用选择器，则默认从

get

选择器开始，不要直接跳转到完整回退方案（

snapshot

screenshot

）。例外情况：当运行时行为可能影响脚本正确性时，允许进行轻量级机制探测（针对水合/虚拟化/分页）。仅当探测/样本验证显示选择器存在缺失或不稳定时，才升级到

 snapshot

screenshot

。

Mechanism-Aware Script Strategy

机制感知脚本策略

Websites use patterns that break naive scraping. The generated Playwright script must account for these:

网站会采用一些会破坏简单抓取逻辑的模式。生成的Playwright脚本必须处理这些情况：

Streaming / SSR / RSC hydration

流处理 / SSR / RSC 水合

Pages may render a shell first, then stream or hydrate content.

javascript

// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
  const items = document.querySelectorAll('[data-item]');
  return items.length > 0 && !document.querySelector('[data-pending]');
});

Detection cues: React root with

data-reactroot

, Next.js

__NEXT_DATA__

, empty containers that fill after JS runs. If

actionbook browser text "<selector>"

returns empty but the screenshot shows content, hydration hasn't completed.

页面可能先渲染一个外壳，然后再流式传输或水合内容。

javascript

// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
  const items = document.querySelectorAll('[data-item]');
  return items.length > 0 && !document.querySelector('[data-pending]');
});

检测线索： 带有

data-reactroot

的React根节点、Next.js的

__NEXT_DATA__

、JS运行后才填充内容的空容器。如果

actionbook browser text "<selector>"

返回空值，但截图显示有内容，则说明水合尚未完成。

Virtualized lists / virtual DOM

虚拟化列表 / 虚拟DOM

Only visible rows exist in the DOM. Scrolling renders new rows and destroys old ones.

javascript

// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;

const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');

let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
  const items = await page.$$eval('[data-row]', rows =>
    rows.map(r => ({ text: r.textContent.trim() }))
  );
  for (const item of items) {
    if (!allItems.find(i => i.text === item.text)) allItems.push(item);
  }

  await container.evaluate(el => el.scrollBy(0, 600));
  await page.waitForTimeout(300);

  const currentTop = await container.evaluate(el => el.scrollTop);
  if (currentTop === previousTop) break;

  previousTop = currentTop;
  scrolls += 1;
}

Detection cues: Container has fixed height with

overflow: auto/scroll

, row count in DOM is much smaller than stated total, rows have

transform: translateY(...)

position: absolute; top: ...px

DOM中仅存在可见行。滚动会渲染新行并销毁旧行。

javascript

// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;

const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');

let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
  const items = await page.$$eval('[data-row]', rows =>
    rows.map(r => ({ text: r.textContent.trim() }))
  );
  for (const item of items) {
    if (!allItems.find(i => i.text === item.text)) allItems.push(item);
  }

  await container.evaluate(el => el.scrollBy(0, 600));
  await page.waitForTimeout(300);

  const currentTop = await container.evaluate(el => el.scrollTop);
  if (currentTop === previousTop) break;

  previousTop = currentTop;
  scrolls += 1;
}

检测线索： 容器具有固定高度且

overflow: auto/scroll

、DOM中的行数远少于声明的总行数、行元素带有

transform: translateY(...)

或

position: absolute; top: ...px

属性。

Infinite scroll / lazy loading

无限滚动 / 懒加载

New content appends when the user scrolls near the bottom.

javascript

// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;

while (scrolls < maxScrolls && noGrowthStreak < 3) {
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(1200);

  const newCount = await page.$$eval('.item', els => els.length);
  if (newCount > itemCount) {
    itemCount = newCount;
    noGrowthStreak = 0;
  } else {
    noGrowthStreak += 1;
  }

  scrolls += 1;
}

Detection cues: Intersection Observer in page JS, "Load more" button, sentinel element at bottom, network requests firing on scroll.

当用户滚动到接近底部时，会追加新内容。

javascript

// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;

while (scrolls < maxScrolls && noGrowthStreak < 3) {
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(1200);

  const newCount = await page.$$eval('.item', els => els.length);
  if (newCount > itemCount) {
    itemCount = newCount;
    noGrowthStreak = 0;
  } else {
    noGrowthStreak += 1;
  }

  scrolls += 1;
}

检测线索： 页面JS中存在Intersection Observer、“加载更多”按钮、底部的哨兵元素、滚动时触发网络请求。

Pagination

分页

Multi-page results behind "Next" buttons or numbered pages.

javascript

// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
  const pageData = await page.$$eval('.result-item', items =>
    items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
  );
  allData.push(...pageData);

  const nextBtn = await page.$('a.next-page:not([disabled])');
  if (!nextBtn) break;

  const previousUrl = page.url();
  const previousFirstItem = await page
    .$eval('.result-item', el => el.textContent?.trim() || '')
    .catch(() => '');

  await nextBtn.click();

  // Post-click detection only: advance must be caused by this click
  const advanced = await Promise.any([
    page
      .waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
      .then(() => true),
    page
      .waitForFunction(
        prev => {
          const first = document.querySelector('.result-item');
          return !!first && (first.textContent || '').trim() !== prev;
        },
        previousFirstItem,
        { timeout: 5000 }
      )
      .then(() => true),
  ]).catch(() => false);

  if (!advanced) break;

  await page.waitForLoadState('networkidle').catch(() => {});
  pageIndex += 1;
}

“下一页”按钮或编号页面后的多页结果。

javascript

// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
  const pageData = await page.$$eval('.result-item', items =>
    items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
  );
  allData.push(...pageData);

  const nextBtn = await page.$('a.next-page:not([disabled])');
  if (!nextBtn) break;

  const previousUrl = page.url();
  const previousFirstItem = await page
    .$eval('.result-item', el => el.textContent?.trim() || '')
    .catch(() => '');

  await nextBtn.click();

  // Post-click detection only: advance must be caused by this click
  const advanced = await Promise.any([
    page
      .waitForURL(url => url.toString() !== previousUrl, { timeout:<[BOS_never_used_51bce0c785ca2f68081bfa7d91973934]>5000 })
      .then(() => true),
    page
      .waitForFunction(
        prev => {
          const first = document.querySelector('.result-item');
          return !!first && (first.textContent || '').trim() !== prev;
        },
        previousFirstItem,
        { timeout: 5000 }
      )
      .then(() => true),
  ]).catch(() => false);

  if (!advanced) break;

  await page.waitForLoadState('networkidle').catch(() => {});
  pageIndex += 1;
}

Execution Chain

执行流程

Step 1: Understand the target

步骤1：明确目标

Identify from the user request:

URL — the page to extract from
Data shape — what fields / columns are needed
Scope — single page, paginated, infinite scroll, or multi-page crawl
Output format — JSON (default), CSV, or other

从用户请求中确定：

URL — 要提取数据的页面
数据结构 — 需要哪些字段/列
范围 — 单页、分页、无限滚动或多页面抓取
输出格式 — JSON（默认）、CSV或其他格式

Step 2: Obtain selectors and choose execution path

步骤2：获取选择器并选择执行路径

bash

undefined

bash

undefined

Try Actionbook index first

首先尝试Actionbook索引

actionbook search "<site> <data-description>" --domain <domain>

If good results (health ≥ 70%), get full selectors

如果结果良好（健康评分≥70%），获取完整选择器

actionbook get "<ID>"


Use this routing strictly:

- **Path A (default when `get` is good):** requested fields are covered by `get` selectors and quality is acceptable.
  - Start from `get` selectors and move to script draft quickly.
  - You may run lightweight mechanism probes (`browser text`, quick scroll checks) before finalizing script strategy.
  - **Do not run full fallback (`snapshot` / `screenshot`) before first draft unless probe/sample validation shows mismatch.**
  - Field mapping must default to `get` selectors and mark source as `actionbook_get`.

- **Path B (partial / unstable):** `get` exists but required fields are missing, selector resolves 0 elements, or validation fails.
  - Run targeted fallback only for failed fields/steps.

- **Path C (no usable coverage):** search/get has no usable result.
  - Run full fallback discovery.

actionbook get "<ID>"


严格遵循以下路由规则：

- **路径A（默认，当`get`结果良好时）：** 请求的字段已被`get`选择器覆盖，且质量符合要求。
  - 从`get`选择器开始，快速生成脚本草稿。
  - 在最终确定脚本策略前，可运行轻量级机制探测（`browser text`、快速滚动检查）。
  - **除非探测/样本验证显示不匹配，否则在生成第一版草稿前不要运行完整回退方案（`snapshot` / `screenshot`）。**
  - 字段映射默认使用`get`选择器，并标记来源为`actionbook_get`。

- **路径B（部分覆盖/不稳定）：** `get`结果存在，但所需字段缺失、选择器匹配0个元素或验证失败。
  - 仅针对失败的字段/步骤运行定向回退方案。

- **路径C（无可用覆盖）：** search/get无可用结果。
  - 运行完整回退发现流程。

Step 3: Probe page mechanisms and fallback only when needed

步骤3：探测页面机制，仅在需要时使用回退方案

Path A mechanism detection timing:

Run minimal probes either before final script draft or during sample validation.
Before any probe command, ensure the correct page context is open:
- ```
actionbook browser open "<url>"
```
  (if current tab context is unknown/stale)
If probes/sample run indicate mismatch (missing rows, unstable selectors, wrong pagination behavior), escalate to Path B targeted fallback.

Fallback discovery by path:

Path B targeted fallback (only failed fields/steps):

bash

actionbook browser open "<url>"     # if not already open
actionbook browser snapshot          # focus on failed field/container mapping

路径A的机制检测时机：

在最终脚本草稿生成前或样本验证期间运行最少的探测。
在运行任何探测命令前，确保打开了正确的页面上下文：
- ```
actionbook browser open "<url>"
```
  （如果当前标签页上下文未知/已失效）
如果探测/样本运行显示不匹配（缺失行、选择器不稳定、分页行为错误），升级到路径B的定向回退方案。

按路径执行回退发现：

路径B：定向回退（仅针对失败的字段/步骤）：

bash

actionbook browser open "<url>"     # 如果尚未打开
actionbook browser snapshot          # 聚焦于失败的字段/容器映射

actionbook browser screenshot # optional visual confirmation for failed area

actionbook browser screenshot # 可选，对失败区域进行视觉确认


**Path C full fallback (no usable coverage):**

```bash
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot

Mechanism probes (run when script strategy needs confirmation):

bash

undefined


**路径C：完整回退（无可用覆盖）：**

```bash
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot

机制探测（当需要确认脚本策略时运行）：

bash

undefined

Hydration / streaming check

水合 / 流处理检查

actionbook browser text "<container-selector>"

Infinite scroll quick signal (explicit before/after decision)

无限滚动快速检测（明确决策前/后状态）

actionbook browser eval "document.querySelectorAll('<item-selector>').length" # before actionbook browser click "<scroll-container-selector-or-body>" # focus scroll context actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);" actionbook browser eval "document.querySelectorAll('<item-selector>').length" # after

actionbook browser eval "document.querySelectorAll('<item-selector>').length" # 滚动前 actionbook browser click "<scroll-container-selector-or-body>" # 聚焦滚动上下文 actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);" actionbook browser eval "document.querySelectorAll('<item-selector>').length" # 滚动后

If count increases, treat page as lazy-load/infinite-scroll.

如果数量增加，则将页面视为懒加载/无限滚动类型。


Fallback trigger conditions:
- `actionbook get` cannot map all required fields.
- `actionbook get` selectors return empty/unstable values in sample run.
- Runtime behavior conflicts with expected mechanism (e.g., virtualized container, delayed hydration).


回退触发条件：
- `actionbook get`无法映射所有所需字段。
- `actionbook get`选择器返回空值/不稳定值。
- 运行时行为与预期机制冲突（例如，虚拟化容器、延迟水合）。

Step 4: Generate Playwright script

步骤4：生成Playwright脚本

Write a standalone Playwright script (

extract_<domain>_<slug>.cjs

) that:

Navigates to the target URL.
Waits for the correct readiness signal (not just
```
load
```
— see mechanisms above).
Handles the detected mechanism (virtual scroll, pagination, etc.).
Extracts data into structured objects.
Writes output to disk (
```
JSON.stringify
```
/ CSV).
Closes the browser.
Enforces guardrails (
```
maxPages
```
,
```
maxScrolls
```
, timeout budget) to avoid infinite loops.

Script template:

javascript

// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('<URL>', { waitUntil: 'domcontentloaded' });

  // -- wait for readiness --
  await page.waitForSelector('<container>', { state: 'visible' });

  // -- extract --
  const data = await page.$$eval('<item-selector>', items =>
    items.map(el => ({
      // fields mapped from user request
    }))
  );

  // -- output --
  const fs = require('fs');
  fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
  console.log(`Extracted ${data.length} items → output.json`);

  await browser.close();
})();

编写独立的Playwright脚本（

extract_<domain>_<slug>.cjs

），需包含以下内容：

导航到目标URL。
等待正确的就绪信号（不仅仅是
```
load
```
——参见上述机制部分）。
处理检测到的机制（虚拟滚动、分页等）。
将数据提取为结构化对象。
将输出写入磁盘（
```
JSON.stringify
```
/ CSV）。
关闭浏览器。
实施防护措施（
```
maxPages
```
、
```
maxScrolls
```
、超时预算)以避免无限循环。

脚本模板：

javascript

// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('<URL>', { waitUntil: 'domcontentloaded' });

  // -- wait for readiness --
  await page.waitForSelector('<container>', { state: 'visible' });

  // -- extract --
  const data = await page.$$eval('<item-selector>', items =>
    items.map(el => ({
      // fields mapped from user request
    }))
  );

  // -- output --
  const fs = require('fs');
  fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
  console.log(`Extracted ${data.length} items → output.json`);

  await browser.close();
})();

Step 5: Execute and validate

步骤5：执行并验证

Run the script to confirm it works:

bash

node extract_<domain>_<slug>.cjs

Validation rules:

Check	Pass condition
Script exits 0	No runtime errors
Output file exists	Non-empty file written
Record count > 0	At least one item extracted
No null/empty fields	Every declared field has a value in ≥ 90% of records
Data matches page	Spot-check first and last record against `actionbook browser text`

If validation fails, inspect the output, adjust selectors or wait strategy, and re-run.

运行脚本以确认其正常工作：

bash

node extract_<domain>_<slug>.cjs

验证规则：

检查项	通过条件
脚本正常退出	无运行时错误
输出文件存在	生成非空文件
记录数>0	至少提取到一个条目
无空值/缺失字段	每个声明的字段在≥90%的记录中都有值
数据与页面匹配	抽查第一条和最后一条记录，与 `actionbook browser text` 结果对比

如果验证失败，检查输出，调整选择器或等待策略，然后重新运行。

Step 6: Deliver

步骤6：交付

Present to the user:

Script path — the
```
.cjs
```
file they can re-run anytime.
Data path — the output JSON/CSV file.
Record count — how many items were extracted.
Notes — any mechanism-specific caveats (e.g., "this site uses infinite scroll; the script scrolls up to 50 pages by default").

向用户提供以下内容：

脚本路径 — 用户可随时重新运行的
```
.cjs
```
文件。
数据路径 — 输出的JSON/CSV文件。
记录数 — 提取的条目数量。
注意事项 — 任何与机制相关的提示（例如：“该网站使用无限滚动；脚本默认最多滚动50页”）。

Output Contract

输出约定

Every

extract

invocation produces:

Artifact	Path	Format
Playwright script	`./extract_<domain>_<slug>.cjs`	Standalone Node.js script using `playwright`
Extracted data	`./output.json` (default) or user-specified path	JSON array of objects (default), CSV, or user-specified

The script must be re-runnable — a user should be able to execute it later without Actionbook installed, as long as Node.js + Playwright are available in the runtime environment.

每次调用

extract

都会生成：

成果	路径	格式
Playwright脚本	`./extract_<domain>_<slug>.cjs`	独立Node.js脚本，使用 `playwright`
提取的数据	`./output.json` （默认）或用户指定路径	JSON对象数组（默认）、CSV或用户指定格式

脚本必须可重复运行——只要运行环境中安装了Node.js + Playwright，用户无需安装Actionbook即可在日后执行该脚本。

Selector Priority

选择器优先级

When multiple selector types are available from

actionbook get

Priority	Type	Reason
1	`data-testid`	Stable, test-oriented, rarely changes
2	`aria-label`	Accessibility-driven, semantically meaningful
3	CSS selector	Structural, may break on redesign
4	XPath	Last resort, most brittle

当

actionbook get

提供多种选择器类型时，优先级如下：

优先级	类型	原因
1	`data-testid`	稳定、面向测试、极少变更
2	`aria-label`	基于无障碍设计、语义明确
3	CSS选择器	结构化，可能会因网站改版而失效
4	XPath	最后选择，最脆弱

Error Handling

错误处理

Error	Action
`actionbook search` returns no results	Fall back to `snapshot` + `screenshot`
Selector returns 0 elements	Re-snapshot, compare with screenshot, update selector
Script times out	Add longer `waitForTimeout` , check for anti-bot measures
Partial data (some fields empty)	Check if content is lazy-loaded; add scroll/wait
Anti-bot / CAPTCHA	Inform user; suggest running with `headless: false` or using their own browser session via `actionbook setup` extension mode

错误类型	处理措施
`actionbook search` 无结果	回退到 `snapshot` + `screenshot`
选择器匹配0个元素	重新生成snapshot，与截图对比，更新选择器
脚本超时	增加 `waitForTimeout` 时长，检查反爬机制
数据不完整（部分字段为空）	检查内容是否为懒加载;添加滚动/等待逻辑
反爬 / CAPTCHA	通知用户;建议使用 `headless: false` 模式运行，或通过 `actionbook setup` 扩展模式使用用户自己的浏览器会话