playwright-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

playwright-scraper

playwright-scraper

Purpose

用途

This skill enables web scraping using Playwright, a Node.js library for browser automation. It focuses on handling dynamic content, authentication flows, pagination, data extraction, and screenshots to reliably scrape modern websites.
该技能支持使用Playwright(一款用于浏览器自动化的Node.js库)实现网页爬虫。它专注于处理动态内容、身份认证流程、分页、数据提取和截图功能,可稳定爬取现代化网站。

When to Use

适用场景

Use this skill for scraping sites with JavaScript-rendered content (e.g., React or Angular apps), sites requiring login (e.g., dashboards), handling multi-page results (e.g., search results), or capturing visual data (e.g., screenshots for verification). Avoid for static HTML sites where simpler tools like requests suffice.
当你需要爬取JavaScript渲染内容的站点(例如React或Angular应用)、需要登录的站点(例如仪表盘)、处理多页结果(例如搜索结果),或者捕获可视化数据(例如用于验证的截图)时,可以使用该技能。对于静态HTML站点,使用requests这类更简单的工具即可,无需使用本技能。

Key Capabilities

核心能力

  • Dynamically load and interact with content using Playwright's browser control.
  • Manage authentication flows, such as logging in via forms or API tokens.
  • Handle pagination by navigating pages, clicking "next" buttons, or parsing URLs.
  • Extract data using selectors, with options for JSON output or file saves.
  • Capture screenshots or full-page PDFs for debugging or reporting.
  • Supports headless or visible browser modes for flexibility.
  • 通过Playwright的浏览器控制能力动态加载内容并与内容交互。
  • 管理身份认证流程,例如通过表单或API令牌登录。
  • 通过页面导航、点击「下一页」按钮或解析URL处理分页。
  • 使用选择器提取数据,支持JSON输出或保存到文件。
  • 捕获截图或整页PDF,用于调试或生成报告。
  • 支持无头或可见浏览器模式,灵活性高。

Usage Patterns

使用范式

Always initialize a browser context first, then create pages for navigation. Use async patterns for reliability. For authenticated scraping, handle cookies or sessions per context. Structure scripts to loop through pages for pagination and use try-catch for flaky elements. Pass configurations via JSON files or environment variables for reusability.
始终先初始化浏览器上下文,再创建页面用于导航。使用异步模式保证稳定性。对于需要身份认证的爬虫,每个上下文单独处理Cookie或会话。脚本结构应支持循环遍历分页页面,使用try-catch处理不稳定的元素。通过JSON文件或环境变量传递配置,提升可复用性。

Common Commands/API

常用命令/API

Use Playwright's Node.js API. Install via
npm install playwright
. Key methods include:
  • Launch browser:
    const browser = await playwright.chromium.launch({ headless: true });
  • Navigate page:
    const page = await browser.newPage(); await page.goto('https://example.com');
  • Handle auth:
    await page.fill('#username', process.env.USERNAME); await page.fill('#password', process.env.PASSWORD); await page.click('#login');
  • Extract data:
    const data = await page.evaluate(() => document.querySelector('#target').innerText); console.log(data);
  • Pagination:
    while (await page.$('#next-button')) { await page.click('#next-button'); await page.waitForSelector('.item'); }
  • Take screenshot:
    await page.screenshot({ path: 'screenshot.png' });
    CLI flags for running scripts: Use
    npx playwright test
    with flags like
    --headed
    for visible mode or
    --timeout 30000
    for extended waits.
使用Playwright的Node.js API。安装命令:
npm install playwright
。核心方法包括:
  • 启动浏览器:
    const browser = await playwright.chromium.launch({ headless: true });
  • 页面导航:
    const page = await browser.newPage(); await page.goto('https://example.com');
  • 身份认证处理:
    await page.fill('#username', process.env.USERNAME); await page.fill('#password', process.env.PASSWORD); await page.click('#login');
  • 数据提取:
    const data = await page.evaluate(() => document.querySelector('#target').innerText); console.log(data);
  • 分页处理:
    while (await page.$('#next-button')) { await page.click('#next-button'); await page.waitForSelector('.item'); }
  • 截图:
    await page.screenshot({ path: 'screenshot.png' });
    运行脚本的CLI参数:使用
    npx playwright test
    搭配参数,例如
    --headed
    开启可见模式,或
    --timeout 30000
    延长等待时间。

Integration Notes

集成说明

Integrate by importing Playwright in Node.js projects. For auth, use environment variables like
$PLAYWRIGHT_USERNAME
and
$PLAYWRIGHT_PASSWORD
to avoid hardcoding. Configuration format: Use a JSON file for settings, e.g.,
{ "url": "https://target.com", "selector": "#data-element" }
. Pass it via script args:
node scraper.js --config config.json
. For larger systems, chain with tools like Puppeteer (if migrating) or export data to databases via
page.evaluate
results. Ensure compatibility with Node.js 14+ and handle proxy settings with
browser.launch({ proxy: { server: 'http://myproxy.com:8080' } })
.
在Node.js项目中导入Playwright即可完成集成。对于身份认证,使用
$PLAYWRIGHT_USERNAME
$PLAYWRIGHT_PASSWORD
这类环境变量避免硬编码。配置格式:使用JSON文件存储设置,例如
{ "url": "https://target.com", "selector": "#data-element" }
,通过脚本参数传递:
node scraper.js --config config.json
。对于更大的系统,可以和Puppeteer这类工具链式调用(如果是迁移场景),或者通过
page.evaluate
的结果将数据导出到数据库。确保兼容Node.js 14及以上版本,通过
browser.launch({ proxy: { server: 'http://myproxy.com:8080' } })
配置代理设置。

Error Handling

错误处理

Anticipate common errors like timeout on dynamic loads or selector failures. Use
page.waitForSelector
with timeouts:
await page.waitForSelector('#element', { timeout: 10000 }).catch(err => console.error('Element not found:', err));
. For network issues, wrap
page.goto
in try-catch:
try { await page.goto(url, { waitUntil: 'networkidle' }); } catch (e) { console.error('Navigation failed:', e.message); await browser.close(); }
. Handle authentication failures by checking for error elements:
if (await page.$('#error-message')) { throw new Error('Login failed'); }
. Log errors with details and retry up to 3 times using a loop.
提前预判常见错误,例如动态加载超时或选择器匹配失败。使用带超时的
page.waitForSelector
await page.waitForSelector('#element', { timeout: 10000 }).catch(err => console.error('Element not found:', err));
。对于网络问题,将
page.goto
包裹在try-catch中:
try { await page.goto(url, { waitUntil: 'networkidle' }); } catch (e) { console.error('Navigation failed:', e.message); await browser.close(); }
。通过检查错误元素处理身份认证失败:
if (await page.$('#error-message')) { throw new Error('Login failed'); }
。记录错误详情,使用循环最多重试3次。

Concrete Usage Examples

实际使用示例

  1. Scraping a logged-in dashboard: First, set env vars:
    export PLAYWRIGHT_USERNAME='user@example.com'
    and
    export PLAYWRIGHT_PASSWORD='securepass'
    . Then, run:
    const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://dashboard.com/login'); await page.fill('#username', process.env.PLAYWRIGHT_USERNAME); await page.fill('#password', process.env.PLAYWRIGHT_PASSWORD); await page.click('#submit'); const data = await page.evaluate(() => document.querySelector('#dashboard-data').innerText); console.log(data); await browser.close();
    This extracts data from a protected page.
  2. Handling pagination on a search site: Script:
    const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://search.com?q=query'); let items = []; while (true) { items.push(...await page.$$eval('.result-item', elements => elements.map(el => el.innerText))); const nextButton = await page.$('#next-page'); if (!nextButton) break; await nextButton.click(); await page.waitForTimeout(2000); } console.log(items); await browser.close();
    This collects results across multiple pages.
  1. 爬取需要登录的仪表盘: 首先设置环境变量:
    export PLAYWRIGHT_USERNAME='user@example.com'
    export PLAYWRIGHT_PASSWORD='securepass'
    。然后运行:
    const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://dashboard.com/login'); await page.fill('#username', process.env.PLAYWRIGHT_USERNAME); await page.fill('#password', process.env.PLAYWRIGHT_PASSWORD); await page.click('#submit'); const data = await page.evaluate(() => document.querySelector('#dashboard-data').innerText); console.log(data); await browser.close();
    该脚本会从受保护的页面提取数据。
  2. 处理搜索站点的分页: 脚本:
    const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://search.com?q=query'); let items = []; while (true) { items.push(...await page.$$eval('.result-item', elements => elements.map(el => el.innerText))); const nextButton = await page.$('#next-page'); if (!nextButton) break; await nextButton.click(); await page.waitForTimeout(2000); } console.log(items); await browser.close();
    该脚本会收集多页的搜索结果。

Graph Relationships

关联关系

  • Related to: "selenium-automation" (alternative browser automation tool)
  • Depends on: "node-runtime" (for Playwright execution)
  • Complements: "data-extraction" (for post-processing scraped data)
  • In cluster: "community" (shared with other open-source tools)
  • 相关项:"selenium-automation"(替代浏览器自动化工具)
  • 依赖项:"node-runtime"(用于运行Playwright)
  • 互补项:"data-extraction"(用于爬取数据的后处理)
  • 所属集群:"community"(与其他开源工具共享分类)