browser-automation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Browser Automation - POWERFUL

浏览器自动化 - 功能强大

Overview

概述

The Browser Automation skill provides comprehensive tools and knowledge for building production-grade web automation workflows using Playwright. This skill covers data extraction, form filling, screenshot capture, session management, and anti-detection patterns for reliable browser automation at scale.
When to use this skill:
  • Scraping structured data from websites (tables, listings, search results)
  • Automating multi-step browser workflows (login, fill forms, download files)
  • Capturing screenshots or PDFs of web pages
  • Extracting data from SPAs and JavaScript-heavy sites
  • Building repeatable browser-based data pipelines
When NOT to use this skill:
  • Writing browser tests or E2E test suites — use playwright-pro instead
  • Testing API endpoints — use api-test-suite-builder instead
  • Load testing or performance benchmarking — use performance-profiler instead
Why Playwright over Selenium or Puppeteer:
  • Auto-wait built in — no explicit
    sleep()
    or
    waitForElement()
    needed for most actions
  • Multi-browser from one API — Chromium, Firefox, WebKit with zero config changes
  • Network interception — block ads, mock responses, capture API calls natively
  • Browser contexts — isolated sessions without spinning up new browser instances
  • Codegen
    playwright codegen
    records your actions and generates scripts
  • Async-first — Python async/await for high-throughput scraping
本浏览器自动化技能提供了全面的工具和知识,可使用Playwright构建生产级别的Web自动化工作流。涵盖数据提取、表单填写、屏幕截图捕获、会话管理以及反检测模式,支持大规模可靠的浏览器自动化。
适用场景:
  • 从网站抓取结构化数据(表格、列表、搜索结果)
  • 自动化多步骤浏览器工作流(登录、填写表单、下载文件)
  • 捕获网页的屏幕截图或PDF
  • 从SPA和JavaScript密集型网站提取数据
  • 构建可重复的基于浏览器的数据管道
不适用场景:
  • 编写浏览器测试或端到端测试套件——请使用playwright-pro
  • 测试API端点——请使用api-test-suite-builder
  • 负载测试或性能基准测试——请使用performance-profiler
为什么选择Playwright而非Selenium或Puppeteer:
  • 内置自动等待——大多数操作无需显式使用
    sleep()
    waitForElement()
  • 单API支持多浏览器——无需修改配置即可支持Chromium、Firefox、WebKit
  • 网络拦截——原生支持拦截广告、模拟响应、捕获API调用
  • 浏览器上下文——无需启动新浏览器实例即可创建隔离会话
  • Codegen功能——
    playwright codegen
    可记录操作并生成脚本
  • 优先支持异步——Python async/await实现高吞吐量抓取

Core Competencies

核心能力

1. Web Scraping Patterns

1. 网页抓取模式

Selector priority (most to least reliable):
  1. data-testid
    ,
    data-id
    , or custom data attributes — stable across redesigns
  2. #id
    selectors — unique but may change between deploys
  3. Semantic selectors:
    article
    ,
    nav
    ,
    main
    ,
    section
    — resilient to CSS changes
  4. Class-based:
    .product-card
    ,
    .price
    — brittle if classes are generated (e.g., CSS modules)
  5. Positional:
    nth-child()
    ,
    nth-of-type()
    — last resort, breaks on layout changes
Use XPath only when CSS cannot express the relationship (e.g., ancestor traversal, text-based selection).
Pagination strategies: next-button, URL-based (
?page=N
), infinite scroll, load-more button. See data_extraction_recipes.md for complete pagination handlers and scroll patterns.
选择器优先级(从最可靠到最不可靠):
  1. data-testid
    data-id
    或自定义数据属性——在网站改版时仍保持稳定
  2. #id
    选择器——唯一但可能在部署间变更
  3. 语义化选择器:
    article
    nav
    main
    section
    ——对CSS变更有较强适应性
  4. 基于类的选择器:
    .product-card
    .price
    ——若类名是自动生成的(如CSS modules)则易失效
  5. 位置选择器:
    nth-child()
    nth-of-type()
    ——最后选择,布局变更时会失效
仅当CSS无法表达元素关系时(如祖先遍历、基于文本的选择)才使用XPath。
分页策略: 下一页按钮、基于URL的分页(
?page=N
)、无限滚动、加载更多按钮。完整的分页处理程序和滚动模式请参考data_extraction_recipes.md

2. Form Filling & Multi-Step Workflows

2. 表单填写与多步骤工作流

Break multi-step forms into discrete functions per step. Each function fills fields, clicks "Next"/"Continue", and waits for the next step to load (URL change or DOM element).
Key patterns: login flows, multi-page forms, file uploads (including drag-and-drop zones), native and custom dropdown handling. See playwright_browser_api.md for complete API reference on
fill()
,
select_option()
,
set_input_files()
, and
expect_file_chooser()
.
将多步骤表单拆分为每个步骤对应的独立函数。每个函数负责填写字段、点击“下一步/继续”按钮,并等待下一步加载完成(URL变更或DOM元素出现)。
关键模式:登录流程、多页表单、文件上传(包括拖放区域)、原生和自定义下拉菜单处理。关于
fill()
select_option()
set_input_files()
expect_file_chooser()
的完整API参考请查看playwright_browser_api.md

3. Screenshot & PDF Capture

3. 屏幕截图与PDF捕获

  • Full page:
    await page.screenshot(path="full.png", full_page=True)
  • Element:
    await page.locator("div.chart").screenshot(path="chart.png")
  • PDF (Chromium only):
    await page.pdf(path="out.pdf", format="A4", print_background=True)
  • Visual regression: Take screenshots at known states, store baselines in version control with naming:
    {page}_{viewport}_{state}.png
See playwright_browser_api.md for full screenshot/PDF options.
  • 整页截图:
    await page.screenshot(path="full.png", full_page=True)
  • 元素截图:
    await page.locator("div.chart").screenshot(path="chart.png")
  • PDF生成(仅Chromium支持):
    await page.pdf(path="out.pdf", format="A4", print_background=True)
  • 视觉回归测试: 在已知状态下截取截图,将基准图存储在版本控制中,命名格式为:
    {page}_{viewport}_{state}.png
完整的截图/PDF选项请参考playwright_browser_api.md

4. Structured Data Extraction

4. 结构化数据提取

Core extraction patterns:
  • Tables to JSON — Extract
    <thead>
    headers and
    <tbody>
    rows into dictionaries
  • Listings to arrays — Map repeating card elements using a field-selector map (supports
    ::attr()
    for attributes)
  • Nested/threaded data — Recursive extraction for comments with replies, category trees
See data_extraction_recipes.md for complete extraction functions, price parsing, data cleaning utilities, and output format helpers (JSON, CSV, JSONL).
核心提取模式:
  • 表格转JSON——提取
    <thead>
    表头和
    <tbody>
    行数据并转换为字典
  • 列表转数组——使用字段-选择器映射处理重复卡片元素(支持
    ::attr()
    提取属性)
  • 嵌套/线程化数据——递归提取带回复的评论、分类树等数据
完整的提取函数、价格解析、数据清理工具以及输出格式助手(JSON、CSV、JSONL)请参考data_extraction_recipes.md

5. Cookie & Session Management

5. Cookie与会话管理

  • Save/restore cookies:
    context.cookies()
    and
    context.add_cookies()
  • Full storage state (cookies + localStorage):
    context.storage_state(path="state.json")
    to save,
    browser.new_context(storage_state="state.json")
    to restore
Best practice: Save state after login, reuse across scraping sessions. Check session validity before starting a long job — make a lightweight request to a protected page and verify you are not redirected to login. See playwright_browser_api.md for cookie and storage state API details.
  • 保存/恢复Cookie: 使用
    context.cookies()
    context.add_cookies()
  • 完整存储状态(Cookie + localStorage):使用
    context.storage_state(path="state.json")
    保存状态,使用
    browser.new_context(storage_state="state.json")
    恢复状态
最佳实践: 登录后保存状态,在抓取会话中复用。启动长任务前检查会话有效性——向受保护页面发送轻量级请求,验证是否未被重定向到登录页。Cookie和存储状态的API详情请参考playwright_browser_api.md

6. Anti-Detection Patterns

6. 反检测模式

Modern websites detect automation through multiple vectors. Apply these in priority order:
  1. WebDriver flag removal — Remove
    navigator.webdriver = true
    via init script (critical)
  2. Custom user agent — Rotate through real browser UAs; never use the default headless UA
  3. Realistic viewport — Set 1920x1080 or similar real-world dimensions (default 800x600 is a red flag)
  4. Request throttling — Add
    random.uniform()
    delays between actions
  5. Proxy support — Per-browser or per-context proxy configuration
See anti_detection_patterns.md for the complete stealth stack: navigator property hardening, WebGL/canvas fingerprint evasion, behavioral simulation (mouse movement, typing speed, scroll patterns), proxy rotation strategies, and detection self-test URLs.
现代网站会通过多种方式检测自动化工具,请按以下优先级应用反检测策略:
  1. 移除WebDriver标识——通过初始化脚本移除
    navigator.webdriver = true
    (至关重要)
  2. 自定义用户代理——轮换使用真实浏览器的UA;切勿使用默认无头浏览器UA
  3. 真实视口尺寸——设置1920x1080等真实场景尺寸(默认800x600易被识别)
  4. 请求限流——在操作间添加
    random.uniform()
    延迟
  5. 代理支持——为浏览器或上下文配置代理
完整的隐身策略栈请参考anti_detection_patterns.md:包括navigator属性加固、WebGL/Canvas指纹规避、行为模拟(鼠标移动、打字速度、滚动模式)、代理轮换策略以及检测自测URL。

7. Dynamic Content Handling

7. 动态内容处理

  • SPA rendering: Wait for content selectors (
    wait_for_selector
    ), not the page load event
  • AJAX/Fetch waiting: Use
    page.expect_response("**/api/data*")
    to intercept and wait for specific API calls
  • Shadow DOM: Playwright pierces open Shadow DOM with
    >>
    operator:
    page.locator("custom-element >> .inner-class")
  • Lazy-loaded images: Scroll elements into view with
    scroll_into_view_if_needed()
    to trigger loading
See playwright_browser_api.md for wait strategies, network interception, and Shadow DOM details.
  • SPA渲染: 等待内容选择器(
    wait_for_selector
    ),而非页面加载事件
  • AJAX/Fetch等待: 使用
    page.expect_response("**/api/data*")
    拦截并等待特定API调用
  • Shadow DOM: Playwright使用
    >>
    运算符穿透开放的Shadow DOM:
    page.locator("custom-element >> .inner-class")
  • 懒加载图片: 使用
    scroll_into_view_if_needed()
    将元素滚动到视图中以触发加载
等待策略、网络拦截和Shadow DOM的详情请参考playwright_browser_api.md

8. Error Handling & Retry Logic

8. 错误处理与重试逻辑

  • Retry with backoff: Wrap page interactions in retry logic with exponential backoff (e.g., 1s, 2s, 4s)
  • Fallback selectors: On
    TimeoutError
    , try alternative selectors before failing
  • Error-state screenshots: Capture
    page.screenshot(path="error-state.png")
    on unexpected failures for debugging
  • Rate limit detection: Check for HTTP 429 responses and respect
    Retry-After
    headers
See anti_detection_patterns.md for the complete exponential backoff implementation and rate limiter class.
  • 带退避的重试: 将页面交互包装在指数退避的重试逻辑中(如1s、2s、4s)
  • 备选选择器: 遇到
    TimeoutError
    时,尝试备选选择器后再失败
  • 错误状态截图: 发生意外失败时捕获
    page.screenshot(path="error-state.png")
    用于调试
  • 速率限制检测: 检查HTTP 429响应并遵循
    Retry-After
    头部
完整的指数退避实现和速率限制器类请参考anti_detection_patterns.md

Workflows

工作流

Workflow 1: Single-Page Data Extraction

工作流1:单页数据提取

Scenario: Extract product data from a single page with JavaScript-rendered content.
Steps:
  1. Launch browser in headed mode during development (
    headless=False
    ), switch to headless for production
  2. Navigate to URL and wait for content selector
  3. Extract data using
    query_selector_all
    with field mapping
  4. Validate extracted data (check for nulls, expected types)
  5. Output as JSON
python
async def extract_single_page(url, selectors):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 ..."
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle")
        data = await extract_listings(page, selectors["container"], selectors["fields"])
        await browser.close()
    return data
场景: 从JavaScript渲染的单页中提取产品数据。
步骤:
  1. 开发阶段使用有头模式启动浏览器(
    headless=False
    ),生产环境切换为无头模式
  2. 导航到URL并等待内容选择器出现
  3. 使用
    query_selector_all
    结合字段映射提取数据
  4. 验证提取的数据(检查空值、预期类型)
  5. 输出为JSON格式
python
async def extract_single_page(url, selectors):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 ..."
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle")
        data = await extract_listings(page, selectors["container"], selectors["fields"])
        await browser.close()
    return data

Workflow 2: Multi-Page Scraping with Pagination

工作流2:带分页的多页抓取

Scenario: Scrape search results across 50+ pages.
Steps:
  1. Launch browser with anti-detection settings
  2. Navigate to first page
  3. Extract data from current page
  4. Check if "Next" button exists and is enabled
  5. Click next, wait for new content to load (not just navigation)
  6. Repeat until no next page or max pages reached
  7. Deduplicate results by unique key
  8. Write output incrementally (don't hold everything in memory)
python
async def scrape_paginated(base_url, selectors, max_pages=100):
    all_data = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await (await browser.new_context()).new_page()
        await page.goto(base_url)

        for page_num in range(max_pages):
            items = await extract_listings(page, selectors["container"], selectors["fields"])
            all_data.extend(items)

            next_btn = page.locator(selectors["next_button"])
            if await next_btn.count() == 0 or await next_btn.is_disabled():
                break

            await next_btn.click()
            await page.wait_for_selector(selectors["container"])
            await human_delay(800, 2000)

        await browser.close()
    return all_data
场景: 抓取50+页的搜索结果。
步骤:
  1. 启动带有反检测设置的浏览器
  2. 导航到第一页
  3. 提取当前页面的数据
  4. 检查“下一页”按钮是否存在且可用
  5. 点击下一页,等待新内容加载(不仅仅是导航完成)
  6. 重复直到没有下一页或达到最大页数
  7. 通过唯一键去重结果
  8. 增量写入输出(不要将所有数据保存在内存中)
python
async def scrape_paginated(base_url, selectors, max_pages=100):
    all_data = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await (await browser.new_context()).new_page()
        await page.goto(base_url)

        for page_num in range(max_pages):
            items = await extract_listings(page, selectors["container"], selectors["fields"])
            all_data.extend(items)

            next_btn = page.locator(selectors["next_button"])
            if await next_btn.count() == 0 or await next_btn.is_disabled():
                break

            await next_btn.click()
            await page.wait_for_selector(selectors["container"])
            await human_delay(800, 2000)

        await browser.close()
    return all_data

Workflow 3: Authenticated Workflow Automation

工作流3:已认证的工作流自动化

Scenario: Log into a portal, navigate a multi-step form, download a report.
Steps:
  1. Check for existing session state file
  2. If no session, perform login and save state
  3. Navigate to target page using saved session
  4. Fill multi-step form with provided data
  5. Wait for download to trigger
  6. Save downloaded file to target directory
python
async def authenticated_workflow(credentials, form_data, download_dir):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        state_file = "session_state.json"

        # Restore or create session
        if os.path.exists(state_file):
            context = await browser.new_context(storage_state=state_file)
        else:
            context = await browser.new_context()
            page = await context.new_page()
            await login(page, credentials["url"], credentials["user"], credentials["pass"])
            await context.storage_state(path=state_file)

        page = await context.new_page()
        await page.goto(form_data["target_url"])

        # Fill form steps
        for step_fn in [fill_step_1, fill_step_2]:
            await step_fn(page, form_data)

        # Handle download
        async with page.expect_download() as dl_info:
            await page.click("button:has-text('Download Report')")
        download = await dl_info.value
        await download.save_as(os.path.join(download_dir, download.suggested_filename))

        await browser.close()
场景: 登录门户、导航多步骤表单、下载报告。
步骤:
  1. 检查是否存在现有会话状态文件
  2. 若无会话,执行登录并保存状态
  3. 使用保存的会话导航到目标页面
  4. 使用提供的数据填写多步骤表单
  5. 等待下载触发
  6. 将下载的文件保存到目标目录
python
async def authenticated_workflow(credentials, form_data, download_dir):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        state_file = "session_state.json"

        # 恢复或创建会话
        if os.path.exists(state_file):
            context = await browser.new_context(storage_state=state_file)
        else:
            context = await browser.new_context()
            page = await context.new_page()
            await login(page, credentials["url"], credentials["user"], credentials["pass"])
            await context.storage_state(path=state_file)

        page = await context.new_page()
        await page.goto(form_data["target_url"])

        # 填写表单步骤
        for step_fn in [fill_step_1, fill_step_2]:
            await step_fn(page, form_data)

        # 处理下载
        async with page.expect_download() as dl_info:
            await page.click("button:has-text('Download Report')")
        download = await dl_info.value
        await download.save_as(os.path.join(download_dir, download.suggested_filename))

        await browser.close()

Tools Reference

工具参考

ScriptPurposeKey FlagsOutput
scraping_toolkit.py
Generate Playwright scraping script skeleton
--url
,
--selectors
,
--paginate
,
--output
Python script or JSON config
form_automation_builder.py
Generate form-fill automation script from field spec
--fields
,
--url
,
--output
Python automation script
anti_detection_checker.py
Audit a Playwright script for detection vectors
--file
,
--verbose
Risk report with score
All scripts are stdlib-only. Run
python3 <script> --help
for full usage.
脚本用途关键参数输出
scraping_toolkit.py
生成Playwright抓取脚本框架
--url
,
--selectors
,
--paginate
,
--output
Python脚本或JSON配置
form_automation_builder.py
根据字段规格生成表单填写自动化脚本
--fields
,
--url
,
--output
Python自动化脚本
anti_detection_checker.py
审计Playwright脚本的检测风险点
--file
,
--verbose
带评分的风险报告
所有脚本仅依赖标准库。运行
python3 <script> --help
查看完整使用说明。

Anti-Patterns

反模式

Hardcoded Waits

硬编码等待

Bad:
await page.wait_for_timeout(5000)
before every action. Good: Use
wait_for_selector
,
wait_for_url
,
expect_response
, or
wait_for_load_state
. Hardcoded waits are flaky and slow.
错误做法: 在每个操作前使用
await page.wait_for_timeout(5000)
正确做法: 使用
wait_for_selector
wait_for_url
expect_response
wait_for_load_state
。硬编码等待不稳定且速度慢。

No Error Recovery

无错误恢复机制

Bad: Linear script that crashes on first failure. Good: Wrap each page interaction in try/except. Take error-state screenshots. Implement retry with exponential backoff.
错误做法: 线性脚本在首次失败时崩溃。 正确做法: 将每个页面交互包装在try/except块中。捕获错误状态截图。实现带指数退避的重试逻辑。

Ignoring robots.txt

忽略robots.txt

Bad: Scraping without checking robots.txt directives. Good: Fetch and parse robots.txt before scraping. Respect
Crawl-delay
. Skip disallowed paths. Add your bot name to User-Agent if running at scale.
错误做法: 不检查robots.txt指令就进行抓取。 正确做法: 抓取前获取并解析robots.txt。遵循
Crawl-delay
规则。跳过禁止抓取的路径。大规模运行时在User-Agent中添加你的机器人名称。

Storing Credentials in Scripts

在脚本中存储凭证

Bad: Hardcoding usernames and passwords in Python files. Good: Use environment variables,
.env
files (gitignored), or a secrets manager. Pass credentials via CLI arguments.
错误做法: 在Python文件中硬编码用户名和密码。 正确做法: 使用环境变量、
.env
文件(已加入git忽略)或密钥管理器。通过CLI参数传递凭证。

No Rate Limiting

无速率限制

Bad: Hammering a site with 100 requests/second. Good: Add random delays between requests (1-3s for polite scraping). Monitor for 429 responses. Implement exponential backoff.
错误做法: 以每秒100次请求的频率冲击网站。 正确做法: 在请求间添加随机延迟(礼貌抓取建议1-3秒)。监控429响应。实现指数退避。

Selector Fragility

选择器易失效

Bad: Relying on auto-generated class names (
.css-1a2b3c
) or deep nesting (
div > div > div > span:nth-child(3)
). Good: Use data attributes, semantic HTML, or text-based locators. Test selectors in browser DevTools first.
错误做法: 依赖自动生成的类名(
.css-1a2b3c
)或深层嵌套(
div > div > div > span:nth-child(3)
)。 正确做法: 使用数据属性、语义化HTML或基于文本的定位器。先在浏览器开发者工具中测试选择器。

Not Cleaning Up Browser Instances

未清理浏览器实例

Bad: Launching browsers without closing them, leading to resource leaks. Good: Always use
try/finally
or async context managers to ensure
browser.close()
is called.
错误做法: 启动浏览器后不关闭,导致资源泄漏。 正确做法: 始终使用
try/finally
或异步上下文管理器确保调用
browser.close()

Running Headed in Production

生产环境使用有头模式

Bad: Using
headless=False
in production/CI. Good: Develop with headed mode for debugging, deploy with
headless=True
. Use environment variable to toggle:
headless = os.environ.get("HEADLESS", "true") == "true"
.
错误做法: 在生产/CI环境中使用
headless=False
正确做法: 开发阶段使用有头模式调试,部署时使用
headless=True
。使用环境变量切换:
headless = os.environ.get("HEADLESS", "true") == "true"

Cross-References

交叉引用

  • playwright-pro — Browser testing skill. Use for E2E tests, test assertions, test fixtures. Browser Automation is for data extraction and workflow automation, not testing.
  • api-test-suite-builder — When the website has a public API, hit the API directly instead of scraping the rendered page. Faster, more reliable, less detectable.
  • performance-profiler — If your automation scripts are slow, profile the bottlenecks before adding concurrency.
  • env-secrets-manager — For securely managing credentials used in authenticated automation workflows.
  • playwright-pro —— 浏览器测试技能。用于端到端测试、测试断言、测试夹具。浏览器自动化技能用于数据提取和工作流自动化,而非测试。
  • api-test-suite-builder —— 当网站有公开API时,直接调用API而非抓取渲染页面。速度更快、更可靠、更不易被检测。
  • performance-profiler —— 如果自动化脚本速度慢,在添加并发前先分析瓶颈。
  • env-secrets-manager —— 用于安全管理已认证自动化工作流中使用的凭证。