scrape

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Bright Data - Web Scraper

Bright Data - 网页爬虫

Scrape any webpage and get clean markdown content using Bright Data's Web Unlocker API. Automatically bypasses bot detection and CAPTCHA.
通过Bright Data的Web Unlocker API抓取任意网页并获取整洁的Markdown内容,自动绕过机器人检测和CAPTCHA。

Setup

配置步骤

1. Get your API Key: Get a key from Bright Data Dashboard.
2. Create a Web Unlocker zone: Create a zone at brightdata.com/cp by clicking "Add" (top-right), selecting "Unlocker zone".
3. Set environment variables:
bash
export BRIGHTDATA_API_KEY="your-api-key"
export BRIGHTDATA_UNLOCKER_ZONE="your-zone-name"
1. 获取你的API密钥:Bright Data控制台获取密钥。
2. 创建Web Unlocker zone: 访问brightdata.com/cp,点击右上角的「Add」按钮,选择「Unlocker zone」即可创建。
3. 设置环境变量:
bash
export BRIGHTDATA_API_KEY="your-api-key"
export BRIGHTDATA_UNLOCKER_ZONE="your-zone-name"

Usage

使用方法

bash
bash scripts/scrape.sh "url"
Parameters:
  • url
    (required): The webpage URL to scrape
Examples:
bash
undefined
bash
bash scripts/scrape.sh "url"
参数:
  • url
    (必填):需要抓取的网页URL
示例:
bash
undefined

Scrape a news article

抓取新闻文章

bash scripts/scrape.sh "https://example.com/article"
bash scripts/scrape.sh "https://example.com/article"

Scrape a product page

抓取商品详情页

bash scripts/scrape.sh "https://shop.example.com/product/123"
undefined
bash scripts/scrape.sh "https://shop.example.com/product/123"
undefined

Output Format

输出格式

Returns clean markdown content extracted from the webpage:
markdown
undefined
返回从网页中提取的整洁Markdown内容:
markdown
undefined

Page Title

Page Title

Main content of the page converted to markdown format...
Main content of the page converted to markdown format...

Section Heading

Section Heading

More content...
undefined
More content...
undefined

Features

功能特性

  • Bot Detection Bypass: Automatically handles anti-bot measures
  • CAPTCHA Solving: Bypasses CAPTCHA challenges
  • Clean Markdown: Returns well-formatted markdown content
  • JavaScript Rendering: Handles JavaScript-heavy pages
  • 绕过机器人检测:自动处理反爬虫机制
  • CAPTCHA破解:绕过CAPTCHA验证挑战
  • 整洁Markdown输出:返回格式规范的Markdown内容
  • JavaScript渲染:支持处理大量依赖JavaScript的页面

Dependencies

依赖

  • curl
    - For API requests
  • curl
    :用于发送API请求