scrape
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBright Data - Web Scraper
Bright Data - 网页爬虫
Scrape any webpage and get clean markdown content using Bright Data's Web Unlocker API. Automatically bypasses bot detection and CAPTCHA.
通过Bright Data的Web Unlocker API抓取任意网页并获取整洁的Markdown内容,自动绕过机器人检测和CAPTCHA。
Setup
配置步骤
1. Get your API Key:
Get a key from Bright Data Dashboard.
2. Create a Web Unlocker zone:
Create a zone at brightdata.com/cp by clicking "Add" (top-right), selecting "Unlocker zone".
3. Set environment variables:
bash
export BRIGHTDATA_API_KEY="your-api-key"
export BRIGHTDATA_UNLOCKER_ZONE="your-zone-name"1. 获取你的API密钥:
从Bright Data控制台获取密钥。
2. 创建Web Unlocker zone:
访问brightdata.com/cp,点击右上角的「Add」按钮,选择「Unlocker zone」即可创建。
3. 设置环境变量:
bash
export BRIGHTDATA_API_KEY="your-api-key"
export BRIGHTDATA_UNLOCKER_ZONE="your-zone-name"Usage
使用方法
bash
bash scripts/scrape.sh "url"Parameters:
- (required): The webpage URL to scrape
url
Examples:
bash
undefinedbash
bash scripts/scrape.sh "url"参数:
- (必填):需要抓取的网页URL
url
示例:
bash
undefinedScrape a news article
抓取新闻文章
bash scripts/scrape.sh "https://example.com/article"
bash scripts/scrape.sh "https://example.com/article"
Scrape a product page
抓取商品详情页
bash scripts/scrape.sh "https://shop.example.com/product/123"
undefinedbash scripts/scrape.sh "https://shop.example.com/product/123"
undefinedOutput Format
输出格式
Returns clean markdown content extracted from the webpage:
markdown
undefined返回从网页中提取的整洁Markdown内容:
markdown
undefinedPage Title
Page Title
Main content of the page converted to markdown format...
Main content of the page converted to markdown format...
Section Heading
Section Heading
More content...
undefinedMore content...
undefinedFeatures
功能特性
- Bot Detection Bypass: Automatically handles anti-bot measures
- CAPTCHA Solving: Bypasses CAPTCHA challenges
- Clean Markdown: Returns well-formatted markdown content
- JavaScript Rendering: Handles JavaScript-heavy pages
- 绕过机器人检测:自动处理反爬虫机制
- CAPTCHA破解:绕过CAPTCHA验证挑战
- 整洁Markdown输出:返回格式规范的Markdown内容
- JavaScript渲染:支持处理大量依赖JavaScript的页面
Dependencies
依赖
- - For API requests
curl
- :用于发送API请求
curl