extract

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

When to Use This Skill

何时使用此技能

Activate when the user wants to obtain data from a website:
  • "Extract all product prices from this page"
  • "Scrape the table of results from ..."
  • "Pull the list of authors and titles from arXiv search results"
  • "Collect all job listings from this page"
  • "Get the data from this dashboard table"
  • "Harvest review scores from ..."
  • "Download all the links/images/cards from ..."
The deliverable is always two artifacts:
  1. Executable Playwright script — a standalone
    .cjs
    file that reproduces the extraction without Actionbook at runtime.
  2. Extracted data — JSON (default), CSV, or user-specified format written to disk.
当用户需要从网站获取数据时激活:
  • "从该页面提取所有产品价格"
  • "抓取来自...的结果表格"
  • "获取arXiv搜索结果中的作者和标题列表"
  • "收集该页面上的所有职位招聘信息"
  • "获取此仪表板表格中的数据"
  • "采集来自...的评分评论"
  • "下载来自...的所有链接/图片/卡片"
交付物始终包含两个成果
  1. 可执行的Playwright脚本 — 一个独立的
    .cjs
    文件,无需Actionbook即可在运行时重现提取过程。
  2. 提取的数据 — JSON(默认格式)、CSV或用户指定格式,保存至磁盘。

Decision Strategy

决策策略

Use Actionbook as a conditional accelerator, not a mandatory step. The goal is reliable selectors in the shortest path.
User request
  ├─► actionbook search "<site> <intent>"
  │     ├─ Results with Health Score ≥ 70%  ──► actionbook get "<ID>" ──► use selectors
  │     └─ No results / low score  ──► Fallback
  └─► Fallback: actionbook browser open <url>
        ├─ actionbook browser snapshot   (accessibility tree → find selectors)
        ├─ actionbook browser screenshot (visual confirmation)
        └─ manual selector discovery via DOM inspection
Priority order for selector sources:
PrioritySourceWhen
1
actionbook get
Site is indexed, health score ≥ 70%
2
actionbook browser snapshot
Not indexed or selectors outdated
3DOM inspection via screenshot + snapshotComplex SPA / dynamic content
Non-negotiable rule: if
search + get
already provides usable selectors for required fields, start from
get
selectors and do not jump to full fallback (
snapshot
/
screenshot
) by default. Exception: lightweight mechanism probes (for hydration/virtualization/pagination) are allowed when runtime behavior may affect script correctness. Escalate to
snapshot
/
screenshot
only when probes/sample validation indicate selector gaps or instability.
将Actionbook用作条件加速器,而非强制步骤。目标是通过最短路径获取可靠的选择器。
User request
  ├─► actionbook search "<site> <intent>"
  │     ├─ Results with Health Score ≥ 70%  ──► actionbook get "<ID>" ──► use selectors
  │     └─ No results / low score  ──► Fallback
  └─► Fallback: actionbook browser open <url>
        ├─ actionbook browser snapshot   (accessibility tree → find selectors)
        ├─ actionbook browser screenshot (visual confirmation)
        └─ manual selector discovery via DOM inspection
选择器来源的优先级顺序:
优先级来源使用场景
1
actionbook get
网站已被索引,健康评分≥70%
2
actionbook browser snapshot
未被索引或选择器已过时
3通过截图+snapshot进行DOM检查复杂SPA / 动态内容
不可违背的规则: 如果
search + get
已为所需字段提供可用选择器,则默认从
get
选择器开始,不要直接跳转到完整回退方案(
snapshot
/
screenshot
)。例外情况:当运行时行为可能影响脚本正确性时,允许进行轻量级机制探测(针对水合/虚拟化/分页)。仅当探测/样本验证显示选择器存在缺失或不稳定时,才升级到
 snapshot
/
screenshot

Mechanism-Aware Script Strategy

机制感知脚本策略

Websites use patterns that break naive scraping. The generated Playwright script must account for these:
网站会采用一些会破坏简单抓取逻辑的模式。生成的Playwright脚本必须处理这些情况:

Streaming / SSR / RSC hydration

流处理 / SSR / RSC 水合

Pages may render a shell first, then stream or hydrate content.
javascript
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
  const items = document.querySelectorAll('[data-item]');
  return items.length > 0 && !document.querySelector('[data-pending]');
});
Detection cues: React root with
data-reactroot
, Next.js
__NEXT_DATA__
, empty containers that fill after JS runs. If
actionbook browser text "<selector>"
returns empty but the screenshot shows content, hydration hasn't completed.
页面可能先渲染一个外壳,然后再流式传输或水合内容。
javascript
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
  const items = document.querySelectorAll('[data-item]');
  return items.length > 0 && !document.querySelector('[data-pending]');
});
检测线索: 带有
data-reactroot
的React根节点、Next.js的
__NEXT_DATA__
、JS运行后才填充内容的空容器。如果
actionbook browser text "<selector>"
返回空值,但截图显示有内容,则说明水合尚未完成。

Virtualized lists / virtual DOM

虚拟化列表 / 虚拟DOM

Only visible rows exist in the DOM. Scrolling renders new rows and destroys old ones.
javascript
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;

const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');

let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
  const items = await page.$$eval('[data-row]', rows =>
    rows.map(r => ({ text: r.textContent.trim() }))
  );
  for (const item of items) {
    if (!allItems.find(i => i.text === item.text)) allItems.push(item);
  }

  await container.evaluate(el => el.scrollBy(0, 600));
  await page.waitForTimeout(300);

  const currentTop = await container.evaluate(el => el.scrollTop);
  if (currentTop === previousTop) break;

  previousTop = currentTop;
  scrolls += 1;
}
Detection cues: Container has fixed height with
overflow: auto/scroll
, row count in DOM is much smaller than stated total, rows have
transform: translateY(...)
or
position: absolute; top: ...px
.
DOM中仅存在可见行。滚动会渲染新行并销毁旧行。
javascript
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;

const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');

let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
  const items = await page.$$eval('[data-row]', rows =>
    rows.map(r => ({ text: r.textContent.trim() }))
  );
  for (const item of items) {
    if (!allItems.find(i => i.text === item.text)) allItems.push(item);
  }

  await container.evaluate(el => el.scrollBy(0, 600));
  await page.waitForTimeout(300);

  const currentTop = await container.evaluate(el => el.scrollTop);
  if (currentTop === previousTop) break;

  previousTop = currentTop;
  scrolls += 1;
}
检测线索: 容器具有固定高度且
overflow: auto/scroll
、DOM中的行数远少于声明的总行数、行元素带有
transform: translateY(...)
position: absolute; top: ...px
属性。

Infinite scroll / lazy loading

无限滚动 / 懒加载

New content appends when the user scrolls near the bottom.
javascript
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;

while (scrolls < maxScrolls && noGrowthStreak < 3) {
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(1200);

  const newCount = await page.$$eval('.item', els => els.length);
  if (newCount > itemCount) {
    itemCount = newCount;
    noGrowthStreak = 0;
  } else {
    noGrowthStreak += 1;
  }

  scrolls += 1;
}
Detection cues: Intersection Observer in page JS, "Load more" button, sentinel element at bottom, network requests firing on scroll.
当用户滚动到接近底部时,会追加新内容。
javascript
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;

while (scrolls < maxScrolls && noGrowthStreak < 3) {
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(1200);

  const newCount = await page.$$eval('.item', els => els.length);
  if (newCount > itemCount) {
    itemCount = newCount;
    noGrowthStreak = 0;
  } else {
    noGrowthStreak += 1;
  }

  scrolls += 1;
}
检测线索: 页面JS中存在Intersection Observer、“加载更多”按钮、底部的哨兵元素、滚动时触发网络请求。

Pagination

分页

Multi-page results behind "Next" buttons or numbered pages.
javascript
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
  const pageData = await page.$$eval('.result-item', items =>
    items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
  );
  allData.push(...pageData);

  const nextBtn = await page.$('a.next-page:not([disabled])');
  if (!nextBtn) break;

  const previousUrl = page.url();
  const previousFirstItem = await page
    .$eval('.result-item', el => el.textContent?.trim() || '')
    .catch(() => '');

  await nextBtn.click();

  // Post-click detection only: advance must be caused by this click
  const advanced = await Promise.any([
    page
      .waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
      .then(() => true),
    page
      .waitForFunction(
        prev => {
          const first = document.querySelector('.result-item');
          return !!first && (first.textContent || '').trim() !== prev;
        },
        previousFirstItem,
        { timeout: 5000 }
      )
      .then(() => true),
  ]).catch(() => false);

  if (!advanced) break;

  await page.waitForLoadState('networkidle').catch(() => {});
  pageIndex += 1;
}
“下一页”按钮或编号页面后的多页结果。
javascript
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
  const pageData = await page.$$eval('.result-item', items =>
    items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
  );
  allData.push(...pageData);

  const nextBtn = await page.$('a.next-page:not([disabled])');
  if (!nextBtn) break;

  const previousUrl = page.url();
  const previousFirstItem = await page
    .$eval('.result-item', el => el.textContent?.trim() || '')
    .catch(() => '');

  await nextBtn.click();

  // Post-click detection only: advance must be caused by this click
  const advanced = await Promise.any([
    page
      .waitForURL(url => url.toString() !== previousUrl, { timeout:<[BOS_never_used_51bce0c785ca2f68081bfa7d91973934]>5000 })
      .then(() => true),
    page
      .waitForFunction(
        prev => {
          const first = document.querySelector('.result-item');
          return !!first && (first.textContent || '').trim() !== prev;
        },
        previousFirstItem,
        { timeout: 5000 }
      )
      .then(() => true),
  ]).catch(() => false);

  if (!advanced) break;

  await page.waitForLoadState('networkidle').catch(() => {});
  pageIndex += 1;
}

Execution Chain

执行流程

Step 1: Understand the target

步骤1:明确目标

Identify from the user request:
  • URL — the page to extract from
  • Data shape — what fields / columns are needed
  • Scope — single page, paginated, infinite scroll, or multi-page crawl
  • Output format — JSON (default), CSV, or other
从用户请求中确定:
  • URL — 要提取数据的页面
  • 数据结构 — 需要哪些字段/列
  • 范围 — 单页、分页、无限滚动或多页面抓取
  • 输出格式 — JSON(默认)、CSV或其他格式

Step 2: Obtain selectors and choose execution path

步骤2:获取选择器并选择执行路径

bash
undefined
bash
undefined

Try Actionbook index first

首先尝试Actionbook索引

actionbook search "<site> <data-description>" --domain <domain>
actionbook search "<site> <data-description>" --domain <domain>

If good results (health ≥ 70%), get full selectors

如果结果良好(健康评分≥70%),获取完整选择器

actionbook get "<ID>"

Use this routing strictly:

- **Path A (default when `get` is good):** requested fields are covered by `get` selectors and quality is acceptable.
  - Start from `get` selectors and move to script draft quickly.
  - You may run lightweight mechanism probes (`browser text`, quick scroll checks) before finalizing script strategy.
  - **Do not run full fallback (`snapshot` / `screenshot`) before first draft unless probe/sample validation shows mismatch.**
  - Field mapping must default to `get` selectors and mark source as `actionbook_get`.

- **Path B (partial / unstable):** `get` exists but required fields are missing, selector resolves 0 elements, or validation fails.
  - Run targeted fallback only for failed fields/steps.

- **Path C (no usable coverage):** search/get has no usable result.
  - Run full fallback discovery.
actionbook get "<ID>"

严格遵循以下路由规则:

- **路径A(默认,当`get`结果良好时):** 请求的字段已被`get`选择器覆盖,且质量符合要求。
  - 从`get`选择器开始,快速生成脚本草稿。
  - 在最终确定脚本策略前,可运行轻量级机制探测(`browser text`、快速滚动检查)。
  - **除非探测/样本验证显示不匹配,否则在生成第一版草稿前不要运行完整回退方案(`snapshot` / `screenshot`)。**
  - 字段映射默认使用`get`选择器,并标记来源为`actionbook_get`。

- **路径B(部分覆盖/不稳定):** `get`结果存在,但所需字段缺失、选择器匹配0个元素或验证失败。
  - 仅针对失败的字段/步骤运行定向回退方案。

- **路径C(无可用覆盖):** search/get无可用结果。
  - 运行完整回退发现流程。

Step 3: Probe page mechanisms and fallback only when needed

步骤3:探测页面机制,仅在需要时使用回退方案

Path A mechanism detection timing:
  • Run minimal probes either before final script draft or during sample validation.
  • Before any probe command, ensure the correct page context is open:
    • actionbook browser open "<url>"
      (if current tab context is unknown/stale)
  • If probes/sample run indicate mismatch (missing rows, unstable selectors, wrong pagination behavior), escalate to Path B targeted fallback.
Fallback discovery by path:
Path B targeted fallback (only failed fields/steps):
bash
actionbook browser open "<url>"     # if not already open
actionbook browser snapshot          # focus on failed field/container mapping
路径A的机制检测时机:
  • 最终脚本草稿生成前样本验证期间运行最少的探测。
  • 在运行任何探测命令前,确保打开了正确的页面上下文:
    • actionbook browser open "<url>"
      (如果当前标签页上下文未知/已失效)
  • 如果探测/样本运行显示不匹配(缺失行、选择器不稳定、分页行为错误),升级到路径B的定向回退方案。
按路径执行回退发现:
路径B:定向回退(仅针对失败的字段/步骤):
bash
actionbook browser open "<url>"     # 如果尚未打开
actionbook browser snapshot          # 聚焦于失败的字段/容器映射

actionbook browser screenshot # optional visual confirmation for failed area

actionbook browser screenshot # 可选,对失败区域进行视觉确认


**Path C full fallback (no usable coverage):**

```bash
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot
Mechanism probes (run when script strategy needs confirmation):
bash
undefined

**路径C:完整回退(无可用覆盖):**

```bash
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot
机制探测(当需要确认脚本策略时运行):
bash
undefined

Hydration / streaming check

水合 / 流处理检查

actionbook browser text "<container-selector>"
actionbook browser text "<container-selector>"

Infinite scroll quick signal (explicit before/after decision)

无限滚动快速检测(明确决策前/后状态)

actionbook browser eval "document.querySelectorAll('<item-selector>').length" # before actionbook browser click "<scroll-container-selector-or-body>" # focus scroll context actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);" actionbook browser eval "document.querySelectorAll('<item-selector>').length" # after
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # 滚动前 actionbook browser click "<scroll-container-selector-or-body>" # 聚焦滚动上下文 actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);" actionbook browser eval "document.querySelectorAll('<item-selector>').length" # 滚动后

If count increases, treat page as lazy-load/infinite-scroll.

如果数量增加,则将页面视为懒加载/无限滚动类型。


Fallback trigger conditions:
- `actionbook get` cannot map all required fields.
- `actionbook get` selectors return empty/unstable values in sample run.
- Runtime behavior conflicts with expected mechanism (e.g., virtualized container, delayed hydration).

回退触发条件:
- `actionbook get`无法映射所有所需字段。
- `actionbook get`选择器返回空值/不稳定值。
- 运行时行为与预期机制冲突(例如,虚拟化容器、延迟水合)。

Step 4: Generate Playwright script

步骤4:生成Playwright脚本

Write a standalone Playwright script (
extract_<domain>_<slug>.cjs
) that:
  1. Navigates to the target URL.
  2. Waits for the correct readiness signal (not just
    load
    — see mechanisms above).
  3. Handles the detected mechanism (virtual scroll, pagination, etc.).
  4. Extracts data into structured objects.
  5. Writes output to disk (
    JSON.stringify
    / CSV).
  6. Closes the browser.
  7. Enforces guardrails (
    maxPages
    ,
    maxScrolls
    , timeout budget) to avoid infinite loops.
Script template:
javascript
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('<URL>', { waitUntil: 'domcontentloaded' });

  // -- wait for readiness --
  await page.waitForSelector('<container>', { state: 'visible' });

  // -- extract --
  const data = await page.$$eval('<item-selector>', items =>
    items.map(el => ({
      // fields mapped from user request
    }))
  );

  // -- output --
  const fs = require('fs');
  fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
  console.log(`Extracted ${data.length} items → output.json`);

  await browser.close();
})();
编写独立的Playwright脚本(
extract_<domain>_<slug>.cjs
),需包含以下内容:
  1. 导航到目标URL。
  2. 等待正确的就绪信号(不仅仅是
    load
    ——参见上述机制部分)。
  3. 处理检测到的机制(虚拟滚动、分页等)。
  4. 将数据提取为结构化对象。
  5. 将输出写入磁盘(
    JSON.stringify
    / CSV)。
  6. 关闭浏览器。
  7. 实施防护措施(
    maxPages
    maxScrolls
    、超时预算)以避免无限循环。
脚本模板:
javascript
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('<URL>', { waitUntil: 'domcontentloaded' });

  // -- wait for readiness --
  await page.waitForSelector('<container>', { state: 'visible' });

  // -- extract --
  const data = await page.$$eval('<item-selector>', items =>
    items.map(el => ({
      // fields mapped from user request
    }))
  );

  // -- output --
  const fs = require('fs');
  fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
  console.log(`Extracted ${data.length} items → output.json`);

  await browser.close();
})();

Step 5: Execute and validate

步骤5:执行并验证

Run the script to confirm it works:
bash
node extract_<domain>_<slug>.cjs
Validation rules:
CheckPass condition
Script exits 0No runtime errors
Output file existsNon-empty file written
Record count > 0At least one item extracted
No null/empty fieldsEvery declared field has a value in ≥ 90% of records
Data matches pageSpot-check first and last record against
actionbook browser text
If validation fails, inspect the output, adjust selectors or wait strategy, and re-run.
运行脚本以确认其正常工作:
bash
node extract_<domain>_<slug>.cjs
验证规则:
检查项通过条件
脚本正常退出无运行时错误
输出文件存在生成非空文件
记录数>0至少提取到一个条目
无空值/缺失字段每个声明的字段在≥90%的记录中都有值
数据与页面匹配抽查第一条和最后一条记录,与
actionbook browser text
结果对比
如果验证失败,检查输出,调整选择器或等待策略,然后重新运行。

Step 6: Deliver

步骤6:交付

Present to the user:
  1. Script path — the
    .cjs
    file they can re-run anytime.
  2. Data path — the output JSON/CSV file.
  3. Record count — how many items were extracted.
  4. Notes — any mechanism-specific caveats (e.g., "this site uses infinite scroll; the script scrolls up to 50 pages by default").
向用户提供以下内容:
  1. 脚本路径 — 用户可随时重新运行的
    .cjs
    文件。
  2. 数据路径 — 输出的JSON/CSV文件。
  3. 记录数 — 提取的条目数量。
  4. 注意事项 — 任何与机制相关的提示(例如:“该网站使用无限滚动;脚本默认最多滚动50页”)。

Output Contract

输出约定

Every
extract
invocation produces:
ArtifactPathFormat
Playwright script
./extract_<domain>_<slug>.cjs
Standalone Node.js script using
playwright
Extracted data
./output.json
(default) or user-specified path
JSON array of objects (default), CSV, or user-specified
The script must be re-runnable — a user should be able to execute it later without Actionbook installed, as long as Node.js + Playwright are available in the runtime environment.
每次调用
extract
都会生成:
成果路径格式
Playwright脚本
./extract_<domain>_<slug>.cjs
独立Node.js脚本,使用
playwright
提取的数据
./output.json
(默认)或用户指定路径
JSON对象数组(默认)、CSV或用户指定格式
脚本必须可重复运行——只要运行环境中安装了Node.js + Playwright,用户无需安装Actionbook即可在日后执行该脚本。

Selector Priority

选择器优先级

When multiple selector types are available from
actionbook get
:
PriorityTypeReason
1
data-testid
Stable, test-oriented, rarely changes
2
aria-label
Accessibility-driven, semantically meaningful
3CSS selectorStructural, may break on redesign
4XPathLast resort, most brittle
actionbook get
提供多种选择器类型时,优先级如下:
优先级类型原因
1
data-testid
稳定、面向测试、极少变更
2
aria-label
基于无障碍设计、语义明确
3CSS选择器结构化,可能会因网站改版而失效
4XPath最后选择,最脆弱

Error Handling

错误处理

ErrorAction
actionbook search
returns no results
Fall back to
snapshot
+
screenshot
Selector returns 0 elementsRe-snapshot, compare with screenshot, update selector
Script times outAdd longer
waitForTimeout
, check for anti-bot measures
Partial data (some fields empty)Check if content is lazy-loaded; add scroll/wait
Anti-bot / CAPTCHAInform user; suggest running with
headless: false
or using their own browser session via
actionbook setup
extension mode
错误类型处理措施
actionbook search
无结果
回退到
snapshot
+
screenshot
选择器匹配0个元素重新生成snapshot,与截图对比,更新选择器
脚本超时增加
waitForTimeout
时长,检查反爬机制
数据不完整(部分字段为空)检查内容是否为懒加载;添加滚动/等待逻辑
反爬 / CAPTCHA通知用户;建议使用
headless: false
模式运行,或通过
actionbook setup
扩展模式使用用户自己的浏览器会话