playwright-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseplaywright-scraper
playwright-scraper
Purpose
用途
This skill enables web scraping using Playwright, a Node.js library for browser automation. It focuses on handling dynamic content, authentication flows, pagination, data extraction, and screenshots to reliably scrape modern websites.
该技能支持使用Playwright(一款用于浏览器自动化的Node.js库)实现网页爬虫。它专注于处理动态内容、身份认证流程、分页、数据提取和截图功能,可稳定爬取现代化网站。
When to Use
适用场景
Use this skill for scraping sites with JavaScript-rendered content (e.g., React or Angular apps), sites requiring login (e.g., dashboards), handling multi-page results (e.g., search results), or capturing visual data (e.g., screenshots for verification). Avoid for static HTML sites where simpler tools like requests suffice.
当你需要爬取JavaScript渲染内容的站点(例如React或Angular应用)、需要登录的站点(例如仪表盘)、处理多页结果(例如搜索结果),或者捕获可视化数据(例如用于验证的截图)时,可以使用该技能。对于静态HTML站点,使用requests这类更简单的工具即可,无需使用本技能。
Key Capabilities
核心能力
- Dynamically load and interact with content using Playwright's browser control.
- Manage authentication flows, such as logging in via forms or API tokens.
- Handle pagination by navigating pages, clicking "next" buttons, or parsing URLs.
- Extract data using selectors, with options for JSON output or file saves.
- Capture screenshots or full-page PDFs for debugging or reporting.
- Supports headless or visible browser modes for flexibility.
- 通过Playwright的浏览器控制能力动态加载内容并与内容交互。
- 管理身份认证流程,例如通过表单或API令牌登录。
- 通过页面导航、点击「下一页」按钮或解析URL处理分页。
- 使用选择器提取数据,支持JSON输出或保存到文件。
- 捕获截图或整页PDF,用于调试或生成报告。
- 支持无头或可见浏览器模式,灵活性高。
Usage Patterns
使用范式
Always initialize a browser context first, then create pages for navigation. Use async patterns for reliability. For authenticated scraping, handle cookies or sessions per context. Structure scripts to loop through pages for pagination and use try-catch for flaky elements. Pass configurations via JSON files or environment variables for reusability.
始终先初始化浏览器上下文,再创建页面用于导航。使用异步模式保证稳定性。对于需要身份认证的爬虫,每个上下文单独处理Cookie或会话。脚本结构应支持循环遍历分页页面,使用try-catch处理不稳定的元素。通过JSON文件或环境变量传递配置,提升可复用性。
Common Commands/API
常用命令/API
Use Playwright's Node.js API. Install via . Key methods include:
npm install playwright- Launch browser:
const browser = await playwright.chromium.launch({ headless: true }); - Navigate page:
const page = await browser.newPage(); await page.goto('https://example.com'); - Handle auth:
await page.fill('#username', process.env.USERNAME); await page.fill('#password', process.env.PASSWORD); await page.click('#login'); - Extract data:
const data = await page.evaluate(() => document.querySelector('#target').innerText); console.log(data); - Pagination:
while (await page.$('#next-button')) { await page.click('#next-button'); await page.waitForSelector('.item'); } - Take screenshot: CLI flags for running scripts: Use
await page.screenshot({ path: 'screenshot.png' });with flags likenpx playwright testfor visible mode or--headedfor extended waits.--timeout 30000
使用Playwright的Node.js API。安装命令:。核心方法包括:
npm install playwright- 启动浏览器:
const browser = await playwright.chromium.launch({ headless: true }); - 页面导航:
const page = await browser.newPage(); await page.goto('https://example.com'); - 身份认证处理:
await page.fill('#username', process.env.USERNAME); await page.fill('#password', process.env.PASSWORD); await page.click('#login'); - 数据提取:
const data = await page.evaluate(() => document.querySelector('#target').innerText); console.log(data); - 分页处理:
while (await page.$('#next-button')) { await page.click('#next-button'); await page.waitForSelector('.item'); } - 截图:运行脚本的CLI参数:使用
await page.screenshot({ path: 'screenshot.png' });搭配参数,例如npx playwright test开启可见模式,或--headed延长等待时间。--timeout 30000
Integration Notes
集成说明
Integrate by importing Playwright in Node.js projects. For auth, use environment variables like and to avoid hardcoding. Configuration format: Use a JSON file for settings, e.g., . Pass it via script args: . For larger systems, chain with tools like Puppeteer (if migrating) or export data to databases via results. Ensure compatibility with Node.js 14+ and handle proxy settings with .
$PLAYWRIGHT_USERNAME$PLAYWRIGHT_PASSWORD{ "url": "https://target.com", "selector": "#data-element" }node scraper.js --config config.jsonpage.evaluatebrowser.launch({ proxy: { server: 'http://myproxy.com:8080' } })在Node.js项目中导入Playwright即可完成集成。对于身份认证,使用和这类环境变量避免硬编码。配置格式:使用JSON文件存储设置,例如,通过脚本参数传递:。对于更大的系统,可以和Puppeteer这类工具链式调用(如果是迁移场景),或者通过的结果将数据导出到数据库。确保兼容Node.js 14及以上版本,通过配置代理设置。
$PLAYWRIGHT_USERNAME$PLAYWRIGHT_PASSWORD{ "url": "https://target.com", "selector": "#data-element" }node scraper.js --config config.jsonpage.evaluatebrowser.launch({ proxy: { server: 'http://myproxy.com:8080' } })Error Handling
错误处理
Anticipate common errors like timeout on dynamic loads or selector failures. Use with timeouts: . For network issues, wrap in try-catch: . Handle authentication failures by checking for error elements: . Log errors with details and retry up to 3 times using a loop.
page.waitForSelectorawait page.waitForSelector('#element', { timeout: 10000 }).catch(err => console.error('Element not found:', err));page.gototry { await page.goto(url, { waitUntil: 'networkidle' }); } catch (e) { console.error('Navigation failed:', e.message); await browser.close(); }if (await page.$('#error-message')) { throw new Error('Login failed'); }提前预判常见错误,例如动态加载超时或选择器匹配失败。使用带超时的:。对于网络问题,将包裹在try-catch中:。通过检查错误元素处理身份认证失败:。记录错误详情,使用循环最多重试3次。
page.waitForSelectorawait page.waitForSelector('#element', { timeout: 10000 }).catch(err => console.error('Element not found:', err));page.gototry { await page.goto(url, { waitUntil: 'networkidle' }); } catch (e) { console.error('Navigation failed:', e.message); await browser.close(); }if (await page.$('#error-message')) { throw new Error('Login failed'); }Concrete Usage Examples
实际使用示例
- Scraping a logged-in dashboard: First, set env vars: and
export PLAYWRIGHT_USERNAME='user@example.com'. Then, run:export PLAYWRIGHT_PASSWORD='securepass'This extracts data from a protected page.const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://dashboard.com/login'); await page.fill('#username', process.env.PLAYWRIGHT_USERNAME); await page.fill('#password', process.env.PLAYWRIGHT_PASSWORD); await page.click('#submit'); const data = await page.evaluate(() => document.querySelector('#dashboard-data').innerText); console.log(data); await browser.close(); - Handling pagination on a search site: Script: This collects results across multiple pages.
const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://search.com?q=query'); let items = []; while (true) { items.push(...await page.$$eval('.result-item', elements => elements.map(el => el.innerText))); const nextButton = await page.$('#next-page'); if (!nextButton) break; await nextButton.click(); await page.waitForTimeout(2000); } console.log(items); await browser.close();
- 爬取需要登录的仪表盘: 首先设置环境变量:和
export PLAYWRIGHT_USERNAME='user@example.com'。然后运行:export PLAYWRIGHT_PASSWORD='securepass'该脚本会从受保护的页面提取数据。const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://dashboard.com/login'); await page.fill('#username', process.env.PLAYWRIGHT_USERNAME); await page.fill('#password', process.env.PLAYWRIGHT_PASSWORD); await page.click('#submit'); const data = await page.evaluate(() => document.querySelector('#dashboard-data').innerText); console.log(data); await browser.close(); - 处理搜索站点的分页: 脚本:该脚本会收集多页的搜索结果。
const browser = await playwright.chromium.launch(); const page = await browser.newPage(); await page.goto('https://search.com?q=query'); let items = []; while (true) { items.push(...await page.$$eval('.result-item', elements => elements.map(el => el.innerText))); const nextButton = await page.$('#next-page'); if (!nextButton) break; await nextButton.click(); await page.waitForTimeout(2000); } console.log(items); await browser.close();
Graph Relationships
关联关系
- Related to: "selenium-automation" (alternative browser automation tool)
- Depends on: "node-runtime" (for Playwright execution)
- Complements: "data-extraction" (for post-processing scraped data)
- In cluster: "community" (shared with other open-source tools)
- 相关项:"selenium-automation"(替代浏览器自动化工具)
- 依赖项:"node-runtime"(用于运行Playwright)
- 互补项:"data-extraction"(用于爬取数据的后处理)
- 所属集群:"community"(与其他开源工具共享分类)