web-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Scraper
网页抓取工具
Fetch web page content (text + images) and save as HTML or Markdown locally.
Minimal dependencies: Only requires and - no browser automation.
requestsbeautifulsoup4Default behavior: Downloads images to local directory automatically.
images/抓取网页内容(文本+图片)并保存为HTML或Markdown格式到本地。
依赖项极少:仅需和——无需浏览器自动化。
requestsbeautifulsoup4默认行为:自动将图片下载到本地目录。
images/Quick start
快速开始
Single page
单页抓取
bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.mdbash
{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.mdRecursive (follow links)
递归抓取(跟随链接)
bash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archivebash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archiveSetup
安装配置
Requires Python 3.8+ and minimal dependencies:
bash
cd {baseDir}
pip install -r requirements.txtOr install manually:
bash
pip install requests beautifulsoup4Note: No browser or driver needed - uses pure HTTP requests.
需要Python 3.8+及少量依赖项:
bash
cd {baseDir}
pip install -r requirements.txt或手动安装:
bash
pip install requests beautifulsoup4注意:无需浏览器或驱动程序——使用纯HTTP请求。
Inputs to collect
需要收集的输入信息
Single page mode
单页模式
- URL: The web page to scrape (required)
- Format: or
html(default:md)html - Output path: Where to save the file (default: current directory with auto-generated name)
- Images: Downloads images by default (use to disable)
--no-download-images
- URL:要抓取的网页地址(必填)
- 格式:或
html(默认:md)html - 输出路径:文件保存位置(默认:当前目录,自动生成文件名)
- 图片:默认下载图片(使用参数禁用)
--no-download-images
Recursive mode (--recursive)
递归模式(--recursive)
- URL: Starting point for recursive scraping
- Format: or
htmlmd - Output directory: Where to save all scraped pages
- Max depth: How many levels deep to follow links (default: 2)
- Max pages: Maximum total pages to scrape (default: 50)
- Domain filter: Whether to stay within same domain (default: yes)
- Images: Downloads images by default
- URL:递归抓取的起始地址
- 格式:或
htmlmd - 输出目录:所有抓取页面的保存位置
- 最大深度:跟随链接的层级数(默认:2)
- 最大页面数:抓取的总页面数上限(默认:50)
- 域名过滤:是否限制在同一域名内(默认:是)
- 图片:默认下载图片
Conversation Flow
对话流程
- Ask user for the URL to scrape
- Ask preferred output format (HTML or Markdown)
- Note: Both formats include text and images by default
- HTML: Preserves original structure with downloaded images
- Markdown: Clean text format with downloaded images in folder
images/
- For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
- Ask where to save (or suggest a default path like or
/tmp/)~/Downloads/ - Run the script and confirm success
- Show the saved file/directory path
- 询问用户要抓取的URL
- 询问偏好的输出格式(HTML或Markdown)
- 注意:两种格式默认都包含文本和图片
- HTML:保留原始页面结构,搭配下载后的图片
- Markdown:简洁文本格式,图片保存到文件夹
images/
- 若为递归模式:询问最大深度和最大页面数(可选,有合理默认值)
- 询问保存位置(或建议默认路径如或
/tmp/)~/Downloads/ - 运行脚本并确认成功
- 显示保存的文件/目录路径
Examples
示例
Single Page Scraping
单页抓取
Save as HTML
保存为HTML格式
bash
{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.htmlbash
{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.htmlSave as Markdown (with images, default)
保存为Markdown格式(默认包含图片)
bash
{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.mdResult: Creates + folder with all downloaded images (text + images).
web-scraping.mdimages/bash
{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md结果:生成文件和文件夹,包含所有下载的图片(文本+图片)。
web-scraping.mdimages/Without downloading images (optional)
不下载图片(可选)
bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-imagesResult: Only text + image URLs (not downloaded locally).
bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images结果:仅保留文本+图片原始URL(不下载到本地)。
Auto-generate filename
自动生成文件名
bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format htmlbash
{baseDir}/scripts/scrape.py --url "https://example.com" --format htmlSaves to: example-com-{timestamp}.html
保存路径:example-com-{timestamp}.html
undefinedundefinedRecursive Scraping
递归抓取
Basic recursive crawl (depth 2, same domain, with images)
基础递归爬取(深度2,同一域名,包含图片)
bash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archiveOutput structure (text + images for all pages):
docs-archive/
├── index.md
├── getting-started.md
├── api/
│ ├── authentication.md
│ └── endpoints.md
└── images/ # Shared images from all pages
├── logo.png
└── diagram.svgbash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive输出结构(所有页面的文本+图片):
docs-archive/
├── index.md
├── getting-started.md
├── api/
│ ├── authentication.md
│ └── endpoints.md
└── images/ # 所有页面共享的图片
├── logo.png
└── diagram.svgDeep crawl with custom limits
自定义限制的深度爬取
bash
{baseDir}/scripts/scrape.py \
--url "https://blog.example.com" \
--format html \
--recursive \
--max-depth 3 \
--max-pages 100 \
--output ~/Archives/blog-backupbash
{baseDir}/scripts/scrape.py \
--url "https://blog.example.com" \
--format html \
--recursive \
--max-depth 3 \
--max-pages 100 \
--output ~/Archives/blog-backupIgnore robots.txt (use with caution)
忽略robots.txt(谨慎使用)
bash
{baseDir}/scripts/scrape.py \
--url "https://example.com" \
--format md \
--recursive \
--no-respect-robots \
--rate-limit 1.0bash
{baseDir}/scripts/scrape.py \
--url "https://example.com" \
--format md \
--recursive \
--no-respect-robots \
--rate-limit 1.0Faster scraping (reduced rate limit)
快速抓取(降低速率限制)
bash
{baseDir}/scripts/scrape.py \
--url "https://yoursite.com" \
--format md \
--recursive \
--rate-limit 0.2bash
{baseDir}/scripts/scrape.py \
--url "https://yoursite.com" \
--format md \
--recursive \
--rate-limit 0.2Features
功能特性
Single Page Mode
单页模式
- HTML output: Preserves original page structure
- ✅ Clean, readable HTML document
- ✅ All images downloaded to folder
images/ - ✅ Suitable for offline viewing
- Markdown output: Extracts clean text content
- ✅ Auto-downloads images to local directory (default)
images/ - ✅ Converts image URLs to relative paths
- ✅ Clean, readable format for archiving
- ✅ Fallback to original URLs if download fails
- Use flag to keep original URLs only
--no-download-images
- ✅ Auto-downloads images to local
- Simple and fast: Pure HTTP requests, no browser needed
- Auto filename: Generates safe filename from URL if not specified
- HTML输出:保留原始页面结构
- ✅ 简洁可读的HTML文档
- ✅ 所有图片下载到文件夹
images/ - ✅ 适合离线查看
- Markdown输出:提取简洁的文本内容
- ✅ 默认自动下载图片到本地目录
images/ - ✅ 将图片URL转换为相对路径
- ✅ 简洁可读的存档格式
- ✅ 若下载失败,自动回退到原始URL
- 使用参数仅保留原始URL
--no-download-images
- ✅ 默认自动下载图片到本地
- 简单快速:纯HTTP请求,无需浏览器
- 自动命名:若未指定文件名,从URL生成安全的文件名
Recursive Mode (--recursive
)
--recursive递归模式(--recursive)
- ✅ Intelligent link discovery: Automatically follows all links on crawled pages
- ✅ Depth control: limits how many levels deep to crawl (default: 2)
--max-depth - ✅ Page limit: caps total pages to prevent runaway crawls (default: 50)
--max-pages - ✅ Domain filtering: keeps crawl within starting domain (default: on)
--same-domain - ✅ robots.txt compliance: Respects site's crawling rules by default
- ✅ Rate limiting: adds delay between requests (default: 0.5s)
--rate-limit - ✅ Smart URL filtering: Skips images, scripts, CSS, and duplicate URLs
- ✅ Progress tracking: Real-time console output with success/fail/skip counts
- ✅ Organized output: Preserves URL structure in directory hierarchy
- ✅ Efficient crawling: Sequential with rate limiting to respect servers
- ✅ 智能链接发现:自动跟随已爬取页面上的所有链接
- ✅ 深度控制:参数限制爬取层级(默认:2)
--max-depth - ✅ 页面限制:参数限制总爬取页面数,防止无限制爬取(默认:50)
--max-pages - ✅ 域名过滤:参数将爬取限制在起始域名内(默认开启)
--same-domain - ✅ 遵守robots.txt:默认遵循网站的爬取规则
- ✅ 速率限制:参数添加请求间隔延迟(默认:0.5秒)
--rate-limit - ✅ 智能URL过滤:跳过图片、脚本、CSS和重复URL
- ✅ 进度跟踪:实时控制台输出,显示成功/失败/跳过的计数
- ✅ 结构化输出:在目录层级中保留URL结构
- ✅ 高效爬取:按顺序爬取并添加速率限制,以尊重服务器
Guardrails
注意事项
Single Page Mode
单页模式
- Respect robots.txt and site terms of service
- Some sites may block automated access; this tool uses standard HTTP requests
- Large pages with many images may take time to download
- 遵守robots.txt和网站服务条款
- 部分网站可能阻止自动化访问;本工具使用标准HTTP请求
- 包含大量图片的大页面可能需要较长下载时间
Recursive Mode
递归模式
- Start small: Test with first
--max-depth 1 --max-pages 10 - Respect robots.txt: Default is on; only use for your own sites
--no-respect-robots - Rate limiting: Default 0.5s is polite; don't go below 0.2s for public sites
- Same domain: Strongly recommended to keep enabled
--same-domain - Monitor progress: Watch for high fail rates (may indicate blocking)
- Storage: Recursive crawls can generate many files; ensure sufficient disk space
- Legal: Ensure you have permission to crawl and archive the target site
- 从小规模开始:先使用进行测试
--max-depth 1 --max-pages 10 - 遵守robots.txt:默认开启;仅对自己的网站使用
--no-respect-robots - 速率限制:默认0.5秒是比较友好的;公共网站不要设置低于0.2秒
- 同一域名:强烈建议开启
--same-domain - 监控进度:注意高失败率(可能表示被拦截)
- 存储:递归爬取可能生成大量文件;确保有足够的磁盘空间
- 合法性:确保你有爬取和存档目标网站的权限
Troubleshooting
故障排除
- Connection errors: Check your internet connection and URL validity
- 403/blocked: Some sites block scrapers; the tool uses realistic User-Agent headers
- Timeout: Increase flag for slow-loading pages (value in seconds)
--timeout - Image download fails: Images will fall back to original URLs
- Missing images: Some sites use JavaScript to load images dynamically (not supported)
- 连接错误:检查网络连接和URL有效性
- 403/被拦截:部分网站会阻止抓取工具;本工具使用真实的User-Agent头
- 超时:对于加载缓慢的页面,增加参数的值(单位:秒)
--timeout - 图片下载失败:图片将自动回退到原始URL
- 图片缺失:部分网站使用JavaScript动态加载图片(本工具不支持此场景)