browser-automation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBrowser Automation - POWERFUL
浏览器自动化 - 功能强大
Overview
概述
The Browser Automation skill provides comprehensive tools and knowledge for building production-grade web automation workflows using Playwright. This skill covers data extraction, form filling, screenshot capture, session management, and anti-detection patterns for reliable browser automation at scale.
When to use this skill:
- Scraping structured data from websites (tables, listings, search results)
- Automating multi-step browser workflows (login, fill forms, download files)
- Capturing screenshots or PDFs of web pages
- Extracting data from SPAs and JavaScript-heavy sites
- Building repeatable browser-based data pipelines
When NOT to use this skill:
- Writing browser tests or E2E test suites — use playwright-pro instead
- Testing API endpoints — use api-test-suite-builder instead
- Load testing or performance benchmarking — use performance-profiler instead
Why Playwright over Selenium or Puppeteer:
- Auto-wait built in — no explicit or
sleep()needed for most actionswaitForElement() - Multi-browser from one API — Chromium, Firefox, WebKit with zero config changes
- Network interception — block ads, mock responses, capture API calls natively
- Browser contexts — isolated sessions without spinning up new browser instances
- Codegen — records your actions and generates scripts
playwright codegen - Async-first — Python async/await for high-throughput scraping
本浏览器自动化技能提供了全面的工具和知识,可使用Playwright构建生产级别的Web自动化工作流。涵盖数据提取、表单填写、屏幕截图捕获、会话管理以及反检测模式,支持大规模可靠的浏览器自动化。
适用场景:
- 从网站抓取结构化数据(表格、列表、搜索结果)
- 自动化多步骤浏览器工作流(登录、填写表单、下载文件)
- 捕获网页的屏幕截图或PDF
- 从SPA和JavaScript密集型网站提取数据
- 构建可重复的基于浏览器的数据管道
不适用场景:
- 编写浏览器测试或端到端测试套件——请使用playwright-pro
- 测试API端点——请使用api-test-suite-builder
- 负载测试或性能基准测试——请使用performance-profiler
为什么选择Playwright而非Selenium或Puppeteer:
- 内置自动等待——大多数操作无需显式使用或
sleep()waitForElement() - 单API支持多浏览器——无需修改配置即可支持Chromium、Firefox、WebKit
- 网络拦截——原生支持拦截广告、模拟响应、捕获API调用
- 浏览器上下文——无需启动新浏览器实例即可创建隔离会话
- Codegen功能——可记录操作并生成脚本
playwright codegen - 优先支持异步——Python async/await实现高吞吐量抓取
Core Competencies
核心能力
1. Web Scraping Patterns
1. 网页抓取模式
Selector priority (most to least reliable):
- ,
data-testid, or custom data attributes — stable across redesignsdata-id - selectors — unique but may change between deploys
#id - Semantic selectors: ,
article,nav,main— resilient to CSS changessection - Class-based: ,
.product-card— brittle if classes are generated (e.g., CSS modules).price - Positional: ,
nth-child()— last resort, breaks on layout changesnth-of-type()
Use XPath only when CSS cannot express the relationship (e.g., ancestor traversal, text-based selection).
Pagination strategies: next-button, URL-based (), infinite scroll, load-more button. See data_extraction_recipes.md for complete pagination handlers and scroll patterns.
?page=N选择器优先级(从最可靠到最不可靠):
- 、
data-testid或自定义数据属性——在网站改版时仍保持稳定data-id - 选择器——唯一但可能在部署间变更
#id - 语义化选择器:、
article、nav、main——对CSS变更有较强适应性section - 基于类的选择器:、
.product-card——若类名是自动生成的(如CSS modules)则易失效.price - 位置选择器:、
nth-child()——最后选择,布局变更时会失效nth-of-type()
仅当CSS无法表达元素关系时(如祖先遍历、基于文本的选择)才使用XPath。
分页策略: 下一页按钮、基于URL的分页()、无限滚动、加载更多按钮。完整的分页处理程序和滚动模式请参考data_extraction_recipes.md。
?page=N2. Form Filling & Multi-Step Workflows
2. 表单填写与多步骤工作流
Break multi-step forms into discrete functions per step. Each function fills fields, clicks "Next"/"Continue", and waits for the next step to load (URL change or DOM element).
Key patterns: login flows, multi-page forms, file uploads (including drag-and-drop zones), native and custom dropdown handling. See playwright_browser_api.md for complete API reference on , , , and .
fill()select_option()set_input_files()expect_file_chooser()将多步骤表单拆分为每个步骤对应的独立函数。每个函数负责填写字段、点击“下一步/继续”按钮,并等待下一步加载完成(URL变更或DOM元素出现)。
关键模式:登录流程、多页表单、文件上传(包括拖放区域)、原生和自定义下拉菜单处理。关于、、和的完整API参考请查看playwright_browser_api.md。
fill()select_option()set_input_files()expect_file_chooser()3. Screenshot & PDF Capture
3. 屏幕截图与PDF捕获
- Full page:
await page.screenshot(path="full.png", full_page=True) - Element:
await page.locator("div.chart").screenshot(path="chart.png") - PDF (Chromium only):
await page.pdf(path="out.pdf", format="A4", print_background=True) - Visual regression: Take screenshots at known states, store baselines in version control with naming:
{page}_{viewport}_{state}.png
See playwright_browser_api.md for full screenshot/PDF options.
- 整页截图:
await page.screenshot(path="full.png", full_page=True) - 元素截图:
await page.locator("div.chart").screenshot(path="chart.png") - PDF生成(仅Chromium支持):
await page.pdf(path="out.pdf", format="A4", print_background=True) - 视觉回归测试: 在已知状态下截取截图,将基准图存储在版本控制中,命名格式为:
{page}_{viewport}_{state}.png
完整的截图/PDF选项请参考playwright_browser_api.md。
4. Structured Data Extraction
4. 结构化数据提取
Core extraction patterns:
- Tables to JSON — Extract headers and
<thead>rows into dictionaries<tbody> - Listings to arrays — Map repeating card elements using a field-selector map (supports for attributes)
::attr() - Nested/threaded data — Recursive extraction for comments with replies, category trees
See data_extraction_recipes.md for complete extraction functions, price parsing, data cleaning utilities, and output format helpers (JSON, CSV, JSONL).
核心提取模式:
- 表格转JSON——提取表头和
<thead>行数据并转换为字典<tbody> - 列表转数组——使用字段-选择器映射处理重复卡片元素(支持提取属性)
::attr() - 嵌套/线程化数据——递归提取带回复的评论、分类树等数据
完整的提取函数、价格解析、数据清理工具以及输出格式助手(JSON、CSV、JSONL)请参考data_extraction_recipes.md。
5. Cookie & Session Management
5. Cookie与会话管理
- Save/restore cookies: and
context.cookies()context.add_cookies() - Full storage state (cookies + localStorage): to save,
context.storage_state(path="state.json")to restorebrowser.new_context(storage_state="state.json")
Best practice: Save state after login, reuse across scraping sessions. Check session validity before starting a long job — make a lightweight request to a protected page and verify you are not redirected to login. See playwright_browser_api.md for cookie and storage state API details.
- 保存/恢复Cookie: 使用和
context.cookies()context.add_cookies() - 完整存储状态(Cookie + localStorage):使用保存状态,使用
context.storage_state(path="state.json")恢复状态browser.new_context(storage_state="state.json")
最佳实践: 登录后保存状态,在抓取会话中复用。启动长任务前检查会话有效性——向受保护页面发送轻量级请求,验证是否未被重定向到登录页。Cookie和存储状态的API详情请参考playwright_browser_api.md。
6. Anti-Detection Patterns
6. 反检测模式
Modern websites detect automation through multiple vectors. Apply these in priority order:
- WebDriver flag removal — Remove via init script (critical)
navigator.webdriver = true - Custom user agent — Rotate through real browser UAs; never use the default headless UA
- Realistic viewport — Set 1920x1080 or similar real-world dimensions (default 800x600 is a red flag)
- Request throttling — Add delays between actions
random.uniform() - Proxy support — Per-browser or per-context proxy configuration
See anti_detection_patterns.md for the complete stealth stack: navigator property hardening, WebGL/canvas fingerprint evasion, behavioral simulation (mouse movement, typing speed, scroll patterns), proxy rotation strategies, and detection self-test URLs.
现代网站会通过多种方式检测自动化工具,请按以下优先级应用反检测策略:
- 移除WebDriver标识——通过初始化脚本移除(至关重要)
navigator.webdriver = true - 自定义用户代理——轮换使用真实浏览器的UA;切勿使用默认无头浏览器UA
- 真实视口尺寸——设置1920x1080等真实场景尺寸(默认800x600易被识别)
- 请求限流——在操作间添加延迟
random.uniform() - 代理支持——为浏览器或上下文配置代理
完整的隐身策略栈请参考anti_detection_patterns.md:包括navigator属性加固、WebGL/Canvas指纹规避、行为模拟(鼠标移动、打字速度、滚动模式)、代理轮换策略以及检测自测URL。
7. Dynamic Content Handling
7. 动态内容处理
- SPA rendering: Wait for content selectors (), not the page load event
wait_for_selector - AJAX/Fetch waiting: Use to intercept and wait for specific API calls
page.expect_response("**/api/data*") - Shadow DOM: Playwright pierces open Shadow DOM with operator:
>>page.locator("custom-element >> .inner-class") - Lazy-loaded images: Scroll elements into view with to trigger loading
scroll_into_view_if_needed()
See playwright_browser_api.md for wait strategies, network interception, and Shadow DOM details.
- SPA渲染: 等待内容选择器(),而非页面加载事件
wait_for_selector - AJAX/Fetch等待: 使用拦截并等待特定API调用
page.expect_response("**/api/data*") - Shadow DOM: Playwright使用运算符穿透开放的Shadow DOM:
>>page.locator("custom-element >> .inner-class") - 懒加载图片: 使用将元素滚动到视图中以触发加载
scroll_into_view_if_needed()
等待策略、网络拦截和Shadow DOM的详情请参考playwright_browser_api.md。
8. Error Handling & Retry Logic
8. 错误处理与重试逻辑
- Retry with backoff: Wrap page interactions in retry logic with exponential backoff (e.g., 1s, 2s, 4s)
- Fallback selectors: On , try alternative selectors before failing
TimeoutError - Error-state screenshots: Capture on unexpected failures for debugging
page.screenshot(path="error-state.png") - Rate limit detection: Check for HTTP 429 responses and respect headers
Retry-After
See anti_detection_patterns.md for the complete exponential backoff implementation and rate limiter class.
- 带退避的重试: 将页面交互包装在指数退避的重试逻辑中(如1s、2s、4s)
- 备选选择器: 遇到时,尝试备选选择器后再失败
TimeoutError - 错误状态截图: 发生意外失败时捕获用于调试
page.screenshot(path="error-state.png") - 速率限制检测: 检查HTTP 429响应并遵循头部
Retry-After
完整的指数退避实现和速率限制器类请参考anti_detection_patterns.md。
Workflows
工作流
Workflow 1: Single-Page Data Extraction
工作流1:单页数据提取
Scenario: Extract product data from a single page with JavaScript-rendered content.
Steps:
- Launch browser in headed mode during development (), switch to headless for production
headless=False - Navigate to URL and wait for content selector
- Extract data using with field mapping
query_selector_all - Validate extracted data (check for nulls, expected types)
- Output as JSON
python
async def extract_single_page(url, selectors):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 ..."
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
data = await extract_listings(page, selectors["container"], selectors["fields"])
await browser.close()
return data场景: 从JavaScript渲染的单页中提取产品数据。
步骤:
- 开发阶段使用有头模式启动浏览器(),生产环境切换为无头模式
headless=False - 导航到URL并等待内容选择器出现
- 使用结合字段映射提取数据
query_selector_all - 验证提取的数据(检查空值、预期类型)
- 输出为JSON格式
python
async def extract_single_page(url, selectors):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 ..."
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
data = await extract_listings(page, selectors["container"], selectors["fields"])
await browser.close()
return dataWorkflow 2: Multi-Page Scraping with Pagination
工作流2:带分页的多页抓取
Scenario: Scrape search results across 50+ pages.
Steps:
- Launch browser with anti-detection settings
- Navigate to first page
- Extract data from current page
- Check if "Next" button exists and is enabled
- Click next, wait for new content to load (not just navigation)
- Repeat until no next page or max pages reached
- Deduplicate results by unique key
- Write output incrementally (don't hold everything in memory)
python
async def scrape_paginated(base_url, selectors, max_pages=100):
all_data = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await (await browser.new_context()).new_page()
await page.goto(base_url)
for page_num in range(max_pages):
items = await extract_listings(page, selectors["container"], selectors["fields"])
all_data.extend(items)
next_btn = page.locator(selectors["next_button"])
if await next_btn.count() == 0 or await next_btn.is_disabled():
break
await next_btn.click()
await page.wait_for_selector(selectors["container"])
await human_delay(800, 2000)
await browser.close()
return all_data场景: 抓取50+页的搜索结果。
步骤:
- 启动带有反检测设置的浏览器
- 导航到第一页
- 提取当前页面的数据
- 检查“下一页”按钮是否存在且可用
- 点击下一页,等待新内容加载(不仅仅是导航完成)
- 重复直到没有下一页或达到最大页数
- 通过唯一键去重结果
- 增量写入输出(不要将所有数据保存在内存中)
python
async def scrape_paginated(base_url, selectors, max_pages=100):
all_data = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await (await browser.new_context()).new_page()
await page.goto(base_url)
for page_num in range(max_pages):
items = await extract_listings(page, selectors["container"], selectors["fields"])
all_data.extend(items)
next_btn = page.locator(selectors["next_button"])
if await next_btn.count() == 0 or await next_btn.is_disabled():
break
await next_btn.click()
await page.wait_for_selector(selectors["container"])
await human_delay(800, 2000)
await browser.close()
return all_dataWorkflow 3: Authenticated Workflow Automation
工作流3:已认证的工作流自动化
Scenario: Log into a portal, navigate a multi-step form, download a report.
Steps:
- Check for existing session state file
- If no session, perform login and save state
- Navigate to target page using saved session
- Fill multi-step form with provided data
- Wait for download to trigger
- Save downloaded file to target directory
python
async def authenticated_workflow(credentials, form_data, download_dir):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
state_file = "session_state.json"
# Restore or create session
if os.path.exists(state_file):
context = await browser.new_context(storage_state=state_file)
else:
context = await browser.new_context()
page = await context.new_page()
await login(page, credentials["url"], credentials["user"], credentials["pass"])
await context.storage_state(path=state_file)
page = await context.new_page()
await page.goto(form_data["target_url"])
# Fill form steps
for step_fn in [fill_step_1, fill_step_2]:
await step_fn(page, form_data)
# Handle download
async with page.expect_download() as dl_info:
await page.click("button:has-text('Download Report')")
download = await dl_info.value
await download.save_as(os.path.join(download_dir, download.suggested_filename))
await browser.close()场景: 登录门户、导航多步骤表单、下载报告。
步骤:
- 检查是否存在现有会话状态文件
- 若无会话,执行登录并保存状态
- 使用保存的会话导航到目标页面
- 使用提供的数据填写多步骤表单
- 等待下载触发
- 将下载的文件保存到目标目录
python
async def authenticated_workflow(credentials, form_data, download_dir):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
state_file = "session_state.json"
# 恢复或创建会话
if os.path.exists(state_file):
context = await browser.new_context(storage_state=state_file)
else:
context = await browser.new_context()
page = await context.new_page()
await login(page, credentials["url"], credentials["user"], credentials["pass"])
await context.storage_state(path=state_file)
page = await context.new_page()
await page.goto(form_data["target_url"])
# 填写表单步骤
for step_fn in [fill_step_1, fill_step_2]:
await step_fn(page, form_data)
# 处理下载
async with page.expect_download() as dl_info:
await page.click("button:has-text('Download Report')")
download = await dl_info.value
await download.save_as(os.path.join(download_dir, download.suggested_filename))
await browser.close()Tools Reference
工具参考
| Script | Purpose | Key Flags | Output |
|---|---|---|---|
| Generate Playwright scraping script skeleton | | Python script or JSON config |
| Generate form-fill automation script from field spec | | Python automation script |
| Audit a Playwright script for detection vectors | | Risk report with score |
All scripts are stdlib-only. Run for full usage.
python3 <script> --help| 脚本 | 用途 | 关键参数 | 输出 |
|---|---|---|---|
| 生成Playwright抓取脚本框架 | | Python脚本或JSON配置 |
| 根据字段规格生成表单填写自动化脚本 | | Python自动化脚本 |
| 审计Playwright脚本的检测风险点 | | 带评分的风险报告 |
所有脚本仅依赖标准库。运行查看完整使用说明。
python3 <script> --helpAnti-Patterns
反模式
Hardcoded Waits
硬编码等待
Bad: before every action.
Good: Use , , , or . Hardcoded waits are flaky and slow.
await page.wait_for_timeout(5000)wait_for_selectorwait_for_urlexpect_responsewait_for_load_state错误做法: 在每个操作前使用。
正确做法: 使用、、或。硬编码等待不稳定且速度慢。
await page.wait_for_timeout(5000)wait_for_selectorwait_for_urlexpect_responsewait_for_load_stateNo Error Recovery
无错误恢复机制
Bad: Linear script that crashes on first failure.
Good: Wrap each page interaction in try/except. Take error-state screenshots. Implement retry with exponential backoff.
错误做法: 线性脚本在首次失败时崩溃。
正确做法: 将每个页面交互包装在try/except块中。捕获错误状态截图。实现带指数退避的重试逻辑。
Ignoring robots.txt
忽略robots.txt
Bad: Scraping without checking robots.txt directives.
Good: Fetch and parse robots.txt before scraping. Respect . Skip disallowed paths. Add your bot name to User-Agent if running at scale.
Crawl-delay错误做法: 不检查robots.txt指令就进行抓取。
正确做法: 抓取前获取并解析robots.txt。遵循规则。跳过禁止抓取的路径。大规模运行时在User-Agent中添加你的机器人名称。
Crawl-delayStoring Credentials in Scripts
在脚本中存储凭证
Bad: Hardcoding usernames and passwords in Python files.
Good: Use environment variables, files (gitignored), or a secrets manager. Pass credentials via CLI arguments.
.env错误做法: 在Python文件中硬编码用户名和密码。
正确做法: 使用环境变量、文件(已加入git忽略)或密钥管理器。通过CLI参数传递凭证。
.envNo Rate Limiting
无速率限制
Bad: Hammering a site with 100 requests/second.
Good: Add random delays between requests (1-3s for polite scraping). Monitor for 429 responses. Implement exponential backoff.
错误做法: 以每秒100次请求的频率冲击网站。
正确做法: 在请求间添加随机延迟(礼貌抓取建议1-3秒)。监控429响应。实现指数退避。
Selector Fragility
选择器易失效
Bad: Relying on auto-generated class names () or deep nesting ().
Good: Use data attributes, semantic HTML, or text-based locators. Test selectors in browser DevTools first.
.css-1a2b3cdiv > div > div > span:nth-child(3)错误做法: 依赖自动生成的类名()或深层嵌套()。
正确做法: 使用数据属性、语义化HTML或基于文本的定位器。先在浏览器开发者工具中测试选择器。
.css-1a2b3cdiv > div > div > span:nth-child(3)Not Cleaning Up Browser Instances
未清理浏览器实例
Bad: Launching browsers without closing them, leading to resource leaks.
Good: Always use or async context managers to ensure is called.
try/finallybrowser.close()错误做法: 启动浏览器后不关闭,导致资源泄漏。
正确做法: 始终使用或异步上下文管理器确保调用。
try/finallybrowser.close()Running Headed in Production
生产环境使用有头模式
Bad: Using in production/CI.
Good: Develop with headed mode for debugging, deploy with . Use environment variable to toggle: .
headless=Falseheadless=Trueheadless = os.environ.get("HEADLESS", "true") == "true"错误做法: 在生产/CI环境中使用。
正确做法: 开发阶段使用有头模式调试,部署时使用。使用环境变量切换:。
headless=Falseheadless=Trueheadless = os.environ.get("HEADLESS", "true") == "true"Cross-References
交叉引用
- playwright-pro — Browser testing skill. Use for E2E tests, test assertions, test fixtures. Browser Automation is for data extraction and workflow automation, not testing.
- api-test-suite-builder — When the website has a public API, hit the API directly instead of scraping the rendered page. Faster, more reliable, less detectable.
- performance-profiler — If your automation scripts are slow, profile the bottlenecks before adding concurrency.
- env-secrets-manager — For securely managing credentials used in authenticated automation workflows.
- playwright-pro —— 浏览器测试技能。用于端到端测试、测试断言、测试夹具。浏览器自动化技能用于数据提取和工作流自动化,而非测试。
- api-test-suite-builder —— 当网站有公开API时,直接调用API而非抓取渲染页面。速度更快、更可靠、更不易被检测。
- performance-profiler —— 如果自动化脚本速度慢,在添加并发前先分析瓶颈。
- env-secrets-manager —— 用于安全管理已认证自动化工作流中使用的凭证。