extract
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWhen to Use This Skill
何时使用此技能
Activate when the user wants to obtain data from a website:
- "Extract all product prices from this page"
- "Scrape the table of results from ..."
- "Pull the list of authors and titles from arXiv search results"
- "Collect all job listings from this page"
- "Get the data from this dashboard table"
- "Harvest review scores from ..."
- "Download all the links/images/cards from ..."
The deliverable is always two artifacts:
- Executable Playwright script — a standalone file that reproduces the extraction without Actionbook at runtime.
.cjs - Extracted data — JSON (default), CSV, or user-specified format written to disk.
当用户需要从网站获取数据时激活:
- "从该页面提取所有产品价格"
- "抓取来自...的结果表格"
- "获取arXiv搜索结果中的作者和标题列表"
- "收集该页面上的所有职位招聘信息"
- "获取此仪表板表格中的数据"
- "采集来自...的评分评论"
- "下载来自...的所有链接/图片/卡片"
交付物始终包含两个成果:
- 可执行的Playwright脚本 — 一个独立的文件,无需Actionbook即可在运行时重现提取过程。
.cjs - 提取的数据 — JSON(默认格式)、CSV或用户指定格式,保存至磁盘。
Decision Strategy
决策策略
Use Actionbook as a conditional accelerator, not a mandatory step. The goal is reliable selectors in the shortest path.
User request
│
├─► actionbook search "<site> <intent>"
│ ├─ Results with Health Score ≥ 70% ──► actionbook get "<ID>" ──► use selectors
│ └─ No results / low score ──► Fallback
│
└─► Fallback: actionbook browser open <url>
├─ actionbook browser snapshot (accessibility tree → find selectors)
├─ actionbook browser screenshot (visual confirmation)
└─ manual selector discovery via DOM inspectionPriority order for selector sources:
| Priority | Source | When |
|---|---|---|
| 1 | | Site is indexed, health score ≥ 70% |
| 2 | | Not indexed or selectors outdated |
| 3 | DOM inspection via screenshot + snapshot | Complex SPA / dynamic content |
Non-negotiable rule: if already provides usable selectors for required fields, start from selectors and do not jump to full fallback (/) by default. Exception: lightweight mechanism probes (for hydration/virtualization/pagination) are allowed when runtime behavior may affect script correctness. Escalate to / only when probes/sample validation indicate selector gaps or instability.
search + getgetsnapshotscreenshotsnapshotscreenshot将Actionbook用作条件加速器,而非强制步骤。目标是通过最短路径获取可靠的选择器。
User request
│
├─► actionbook search "<site> <intent>"
│ ├─ Results with Health Score ≥ 70% ──► actionbook get "<ID>" ──► use selectors
│ └─ No results / low score ──► Fallback
│
└─► Fallback: actionbook browser open <url>
├─ actionbook browser snapshot (accessibility tree → find selectors)
├─ actionbook browser screenshot (visual confirmation)
└─ manual selector discovery via DOM inspection选择器来源的优先级顺序:
| 优先级 | 来源 | 使用场景 |
|---|---|---|
| 1 | | 网站已被索引,健康评分≥70% |
| 2 | | 未被索引或选择器已过时 |
| 3 | 通过截图+snapshot进行DOM检查 | 复杂SPA / 动态内容 |
不可违背的规则: 如果已为所需字段提供可用选择器,则默认从选择器开始,不要直接跳转到完整回退方案(/)。例外情况:当运行时行为可能影响脚本正确性时,允许进行轻量级机制探测(针对水合/虚拟化/分页)。仅当探测/样本验证显示选择器存在缺失或不稳定时,才升级到/。
search + getgetsnapshotscreenshot snapshotscreenshotMechanism-Aware Script Strategy
机制感知脚本策略
Websites use patterns that break naive scraping. The generated Playwright script must account for these:
网站会采用一些会破坏简单抓取逻辑的模式。生成的Playwright脚本必须处理这些情况:
Streaming / SSR / RSC hydration
流处理 / SSR / RSC 水合
Pages may render a shell first, then stream or hydrate content.
javascript
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
const items = document.querySelectorAll('[data-item]');
return items.length > 0 && !document.querySelector('[data-pending]');
});Detection cues: React root with , Next.js , empty containers that fill after JS runs. If returns empty but the screenshot shows content, hydration hasn't completed.
data-reactroot__NEXT_DATA__actionbook browser text "<selector>"页面可能先渲染一个外壳,然后再流式传输或水合内容。
javascript
// Wait for hydration to complete — not just DOMContentLoaded
await page.waitForSelector('[data-item]', { state: 'attached' });
await page.waitForFunction(() => {
const items = document.querySelectorAll('[data-item]');
return items.length > 0 && !document.querySelector('[data-pending]');
});检测线索: 带有的React根节点、Next.js的、JS运行后才填充内容的空容器。如果返回空值,但截图显示有内容,则说明水合尚未完成。
data-reactroot__NEXT_DATA__actionbook browser text "<selector>"Virtualized lists / virtual DOM
虚拟化列表 / 虚拟DOM
Only visible rows exist in the DOM. Scrolling renders new rows and destroys old ones.
javascript
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;
const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');
let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
const items = await page.$$eval('[data-row]', rows =>
rows.map(r => ({ text: r.textContent.trim() }))
);
for (const item of items) {
if (!allItems.find(i => i.text === item.text)) allItems.push(item);
}
await container.evaluate(el => el.scrollBy(0, 600));
await page.waitForTimeout(300);
const currentTop = await container.evaluate(el => el.scrollTop);
if (currentTop === previousTop) break;
previousTop = currentTop;
scrolls += 1;
}Detection cues: Container has fixed height with , row count in DOM is much smaller than stated total, rows have or .
overflow: auto/scrolltransform: translateY(...)position: absolute; top: ...pxDOM中仅存在可见行。滚动会渲染新行并销毁旧行。
javascript
// Scroll-and-collect loop for virtualized lists (scroll container aware)
const allItems = [];
const maxScrolls = 50;
let scrolls = 0;
const container = await page.$('<scroll-container-selector>');
if (!container) throw new Error('Scroll container not found');
let previousTop = await container.evaluate(el => el.scrollTop);
while (scrolls < maxScrolls) {
const items = await page.$$eval('[data-row]', rows =>
rows.map(r => ({ text: r.textContent.trim() }))
);
for (const item of items) {
if (!allItems.find(i => i.text === item.text)) allItems.push(item);
}
await container.evaluate(el => el.scrollBy(0, 600));
await page.waitForTimeout(300);
const currentTop = await container.evaluate(el => el.scrollTop);
if (currentTop === previousTop) break;
previousTop = currentTop;
scrolls += 1;
}检测线索: 容器具有固定高度且、DOM中的行数远少于声明的总行数、行元素带有或属性。
overflow: auto/scrolltransform: translateY(...)position: absolute; top: ...pxInfinite scroll / lazy loading
无限滚动 / 懒加载
New content appends when the user scrolls near the bottom.
javascript
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;
while (scrolls < maxScrolls && noGrowthStreak < 3) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1200);
const newCount = await page.$$eval('.item', els => els.length);
if (newCount > itemCount) {
itemCount = newCount;
noGrowthStreak = 0;
} else {
noGrowthStreak += 1;
}
scrolls += 1;
}Detection cues: Intersection Observer in page JS, "Load more" button, sentinel element at bottom, network requests firing on scroll.
当用户滚动到接近底部时,会追加新内容。
javascript
// Scroll to bottom until no new content loads (with no-growth tolerance)
let itemCount = 0;
let noGrowthStreak = 0;
const maxScrolls = 80;
let scrolls = 0;
while (scrolls < maxScrolls && noGrowthStreak < 3) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1200);
const newCount = await page.$$eval('.item', els => els.length);
if (newCount > itemCount) {
itemCount = newCount;
noGrowthStreak = 0;
} else {
noGrowthStreak += 1;
}
scrolls += 1;
}检测线索: 页面JS中存在Intersection Observer、“加载更多”按钮、底部的哨兵元素、滚动时触发网络请求。
Pagination
分页
Multi-page results behind "Next" buttons or numbered pages.
javascript
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
const pageData = await page.$$eval('.result-item', items =>
items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
);
allData.push(...pageData);
const nextBtn = await page.$('a.next-page:not([disabled])');
if (!nextBtn) break;
const previousUrl = page.url();
const previousFirstItem = await page
.$eval('.result-item', el => el.textContent?.trim() || '')
.catch(() => '');
await nextBtn.click();
// Post-click detection only: advance must be caused by this click
const advanced = await Promise.any([
page
.waitForURL(url => url.toString() !== previousUrl, { timeout: 5000 })
.then(() => true),
page
.waitForFunction(
prev => {
const first = document.querySelector('.result-item');
return !!first && (first.textContent || '').trim() !== prev;
},
previousFirstItem,
{ timeout: 5000 }
)
.then(() => true),
]).catch(() => false);
if (!advanced) break;
await page.waitForLoadState('networkidle').catch(() => {});
pageIndex += 1;
}“下一页”按钮或编号页面后的多页结果。
javascript
// Click-through pagination (navigation-aware, SPA-safe)
const allData = [];
const maxPages = 50;
let pageIndex = 0;
while (pageIndex < maxPages) {
const pageData = await page.$$eval('.result-item', items =>
items.map(el => ({ title: el.querySelector('h3')?.textContent?.trim() }))
);
allData.push(...pageData);
const nextBtn = await page.$('a.next-page:not([disabled])');
if (!nextBtn) break;
const previousUrl = page.url();
const previousFirstItem = await page
.$eval('.result-item', el => el.textContent?.trim() || '')
.catch(() => '');
await nextBtn.click();
// Post-click detection only: advance must be caused by this click
const advanced = await Promise.any([
page
.waitForURL(url => url.toString() !== previousUrl, { timeout:<[BOS_never_used_51bce0c785ca2f68081bfa7d91973934]>5000 })
.then(() => true),
page
.waitForFunction(
prev => {
const first = document.querySelector('.result-item');
return !!first && (first.textContent || '').trim() !== prev;
},
previousFirstItem,
{ timeout: 5000 }
)
.then(() => true),
]).catch(() => false);
if (!advanced) break;
await page.waitForLoadState('networkidle').catch(() => {});
pageIndex += 1;
}Execution Chain
执行流程
Step 1: Understand the target
步骤1:明确目标
Identify from the user request:
- URL — the page to extract from
- Data shape — what fields / columns are needed
- Scope — single page, paginated, infinite scroll, or multi-page crawl
- Output format — JSON (default), CSV, or other
从用户请求中确定:
- URL — 要提取数据的页面
- 数据结构 — 需要哪些字段/列
- 范围 — 单页、分页、无限滚动或多页面抓取
- 输出格式 — JSON(默认)、CSV或其他格式
Step 2: Obtain selectors and choose execution path
步骤2:获取选择器并选择执行路径
bash
undefinedbash
undefinedTry Actionbook index first
首先尝试Actionbook索引
actionbook search "<site> <data-description>" --domain <domain>
actionbook search "<site> <data-description>" --domain <domain>
If good results (health ≥ 70%), get full selectors
如果结果良好(健康评分≥70%),获取完整选择器
actionbook get "<ID>"
Use this routing strictly:
- **Path A (default when `get` is good):** requested fields are covered by `get` selectors and quality is acceptable.
- Start from `get` selectors and move to script draft quickly.
- You may run lightweight mechanism probes (`browser text`, quick scroll checks) before finalizing script strategy.
- **Do not run full fallback (`snapshot` / `screenshot`) before first draft unless probe/sample validation shows mismatch.**
- Field mapping must default to `get` selectors and mark source as `actionbook_get`.
- **Path B (partial / unstable):** `get` exists but required fields are missing, selector resolves 0 elements, or validation fails.
- Run targeted fallback only for failed fields/steps.
- **Path C (no usable coverage):** search/get has no usable result.
- Run full fallback discovery.actionbook get "<ID>"
严格遵循以下路由规则:
- **路径A(默认,当`get`结果良好时):** 请求的字段已被`get`选择器覆盖,且质量符合要求。
- 从`get`选择器开始,快速生成脚本草稿。
- 在最终确定脚本策略前,可运行轻量级机制探测(`browser text`、快速滚动检查)。
- **除非探测/样本验证显示不匹配,否则在生成第一版草稿前不要运行完整回退方案(`snapshot` / `screenshot`)。**
- 字段映射默认使用`get`选择器,并标记来源为`actionbook_get`。
- **路径B(部分覆盖/不稳定):** `get`结果存在,但所需字段缺失、选择器匹配0个元素或验证失败。
- 仅针对失败的字段/步骤运行定向回退方案。
- **路径C(无可用覆盖):** search/get无可用结果。
- 运行完整回退发现流程。Step 3: Probe page mechanisms and fallback only when needed
步骤3:探测页面机制,仅在需要时使用回退方案
Path A mechanism detection timing:
- Run minimal probes either before final script draft or during sample validation.
- Before any probe command, ensure the correct page context is open:
- (if current tab context is unknown/stale)
actionbook browser open "<url>"
- If probes/sample run indicate mismatch (missing rows, unstable selectors, wrong pagination behavior), escalate to Path B targeted fallback.
Fallback discovery by path:
Path B targeted fallback (only failed fields/steps):
bash
actionbook browser open "<url>" # if not already open
actionbook browser snapshot # focus on failed field/container mapping路径A的机制检测时机:
- 在最终脚本草稿生成前或样本验证期间运行最少的探测。
- 在运行任何探测命令前,确保打开了正确的页面上下文:
- (如果当前标签页上下文未知/已失效)
actionbook browser open "<url>"
- 如果探测/样本运行显示不匹配(缺失行、选择器不稳定、分页行为错误),升级到路径B的定向回退方案。
按路径执行回退发现:
路径B:定向回退(仅针对失败的字段/步骤):
bash
actionbook browser open "<url>" # 如果尚未打开
actionbook browser snapshot # 聚焦于失败的字段/容器映射actionbook browser screenshot # optional visual confirmation for failed area
actionbook browser screenshot # 可选,对失败区域进行视觉确认
**Path C full fallback (no usable coverage):**
```bash
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshotMechanism probes (run when script strategy needs confirmation):
bash
undefined
**路径C:完整回退(无可用覆盖):**
```bash
actionbook browser open "<url>"
actionbook browser snapshot
actionbook browser screenshot机制探测(当需要确认脚本策略时运行):
bash
undefinedHydration / streaming check
水合 / 流处理检查
actionbook browser text "<container-selector>"
actionbook browser text "<container-selector>"
Infinite scroll quick signal (explicit before/after decision)
无限滚动快速检测(明确决策前/后状态)
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # before
actionbook browser click "<scroll-container-selector-or-body>" # focus scroll context
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # after
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # 滚动前
actionbook browser click "<scroll-container-selector-or-body>" # 聚焦滚动上下文
actionbook browser eval "const c=document.querySelector('<scroll-container-selector>') || document.scrollingElement; c.scrollBy(0, c.clientHeight || window.innerHeight);"
actionbook browser eval "document.querySelectorAll('<item-selector>').length" # 滚动后
If count increases, treat page as lazy-load/infinite-scroll.
如果数量增加,则将页面视为懒加载/无限滚动类型。
Fallback trigger conditions:
- `actionbook get` cannot map all required fields.
- `actionbook get` selectors return empty/unstable values in sample run.
- Runtime behavior conflicts with expected mechanism (e.g., virtualized container, delayed hydration).
回退触发条件:
- `actionbook get`无法映射所有所需字段。
- `actionbook get`选择器返回空值/不稳定值。
- 运行时行为与预期机制冲突(例如,虚拟化容器、延迟水合)。Step 4: Generate Playwright script
步骤4:生成Playwright脚本
Write a standalone Playwright script () that:
extract_<domain>_<slug>.cjs- Navigates to the target URL.
- Waits for the correct readiness signal (not just — see mechanisms above).
load - Handles the detected mechanism (virtual scroll, pagination, etc.).
- Extracts data into structured objects.
- Writes output to disk (/ CSV).
JSON.stringify - Closes the browser.
- Enforces guardrails (,
maxPages, timeout budget) to avoid infinite loops.maxScrolls
Script template:
javascript
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('<URL>', { waitUntil: 'domcontentloaded' });
// -- wait for readiness --
await page.waitForSelector('<container>', { state: 'visible' });
// -- extract --
const data = await page.$$eval('<item-selector>', items =>
items.map(el => ({
// fields mapped from user request
}))
);
// -- output --
const fs = require('fs');
fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
console.log(`Extracted ${data.length} items → output.json`);
await browser.close();
})();编写独立的Playwright脚本(),需包含以下内容:
extract_<domain>_<slug>.cjs- 导航到目标URL。
- 等待正确的就绪信号(不仅仅是——参见上述机制部分)。
load - 处理检测到的机制(虚拟滚动、分页等)。
- 将数据提取为结构化对象。
- 将输出写入磁盘(/ CSV)。
JSON.stringify - 关闭浏览器。
- 实施防护措施(、
maxPages、超时预算)以避免无限循环。maxScrolls
脚本模板:
javascript
// extract_<domain>_<slug>.cjs
// Generated by Actionbook extract skill
// Usage: node extract_<domain>_<slug>.cjs
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('<URL>', { waitUntil: 'domcontentloaded' });
// -- wait for readiness --
await page.waitForSelector('<container>', { state: 'visible' });
// -- extract --
const data = await page.$$eval('<item-selector>', items =>
items.map(el => ({
// fields mapped from user request
}))
);
// -- output --
const fs = require('fs');
fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
console.log(`Extracted ${data.length} items → output.json`);
await browser.close();
})();Step 5: Execute and validate
步骤5:执行并验证
Run the script to confirm it works:
bash
node extract_<domain>_<slug>.cjsValidation rules:
| Check | Pass condition |
|---|---|
| Script exits 0 | No runtime errors |
| Output file exists | Non-empty file written |
| Record count > 0 | At least one item extracted |
| No null/empty fields | Every declared field has a value in ≥ 90% of records |
| Data matches page | Spot-check first and last record against |
If validation fails, inspect the output, adjust selectors or wait strategy, and re-run.
运行脚本以确认其正常工作:
bash
node extract_<domain>_<slug>.cjs验证规则:
| 检查项 | 通过条件 |
|---|---|
| 脚本正常退出 | 无运行时错误 |
| 输出文件存在 | 生成非空文件 |
| 记录数>0 | 至少提取到一个条目 |
| 无空值/缺失字段 | 每个声明的字段在≥90%的记录中都有值 |
| 数据与页面匹配 | 抽查第一条和最后一条记录,与 |
如果验证失败,检查输出,调整选择器或等待策略,然后重新运行。
Step 6: Deliver
步骤6:交付
Present to the user:
- Script path — the file they can re-run anytime.
.cjs - Data path — the output JSON/CSV file.
- Record count — how many items were extracted.
- Notes — any mechanism-specific caveats (e.g., "this site uses infinite scroll; the script scrolls up to 50 pages by default").
向用户提供以下内容:
- 脚本路径 — 用户可随时重新运行的文件。
.cjs - 数据路径 — 输出的JSON/CSV文件。
- 记录数 — 提取的条目数量。
- 注意事项 — 任何与机制相关的提示(例如:“该网站使用无限滚动;脚本默认最多滚动50页”)。
Output Contract
输出约定
Every invocation produces:
extract| Artifact | Path | Format |
|---|---|---|
| Playwright script | | Standalone Node.js script using |
| Extracted data | | JSON array of objects (default), CSV, or user-specified |
The script must be re-runnable — a user should be able to execute it later without Actionbook installed, as long as Node.js + Playwright are available in the runtime environment.
每次调用都会生成:
extract| 成果 | 路径 | 格式 |
|---|---|---|
| Playwright脚本 | | 独立Node.js脚本,使用 |
| 提取的数据 | | JSON对象数组(默认)、CSV或用户指定格式 |
脚本必须可重复运行——只要运行环境中安装了Node.js + Playwright,用户无需安装Actionbook即可在日后执行该脚本。
Selector Priority
选择器优先级
When multiple selector types are available from :
actionbook get| Priority | Type | Reason |
|---|---|---|
| 1 | | Stable, test-oriented, rarely changes |
| 2 | | Accessibility-driven, semantically meaningful |
| 3 | CSS selector | Structural, may break on redesign |
| 4 | XPath | Last resort, most brittle |
当提供多种选择器类型时,优先级如下:
actionbook get| 优先级 | 类型 | 原因 |
|---|---|---|
| 1 | | 稳定、面向测试、极少变更 |
| 2 | | 基于无障碍设计、语义明确 |
| 3 | CSS选择器 | 结构化,可能会因网站改版而失效 |
| 4 | XPath | 最后选择,最脆弱 |
Error Handling
错误处理
| Error | Action |
|---|---|
| Fall back to |
| Selector returns 0 elements | Re-snapshot, compare with screenshot, update selector |
| Script times out | Add longer |
| Partial data (some fields empty) | Check if content is lazy-loaded; add scroll/wait |
| Anti-bot / CAPTCHA | Inform user; suggest running with |
| 错误类型 | 处理措施 |
|---|---|
| 回退到 |
| 选择器匹配0个元素 | 重新生成snapshot,与截图对比,更新选择器 |
| 脚本超时 | 增加 |
| 数据不完整(部分字段为空) | 检查内容是否为懒加载;添加滚动/等待逻辑 |
| 反爬 / CAPTCHA | 通知用户;建议使用 |