Web Spider

网页爬虫

Crawl and analyze websites.

爬取并分析网站。

Prerequisites

前置条件

bash

undefined

bash

undefined

curl for fetching

curl --version

Gemini for analysis

pip install google-generativeai export GEMINI_API_KEY=your_api_key

Optional: better HTML parsing

npm install -g puppeteer pip install beautifulsoup4

undefined

npm install -g puppeteer pip install beautifulsoup4

undefined

Basic Crawling

基础爬取

Fetch Page

获取页面

bash

undefined

bash

undefined

Simple fetch

curl -s "https://example.com"

With headers

curl -s -H "User-Agent: Mozilla/5.0" "https://example.com"

Follow redirects

curl -sL "https://example.com"

Save to file

curl -s "https://example.com" -o page.html

Get headers only

curl -sI "https://example.com"

Get headers and body

curl -sD - "https://example.com"

undefined

curl -sD - "https://example.com"

undefined

Extract Links

提取链接

bash

undefined

bash

undefined

Extract all links from a page

curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sed 's/href="//;s/"$//'

Filter to specific domain

curl -s "https://example.com" | grep -oE 'href="https?://example.com[^"]*"'

Unique links

curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sort -u

undefined

curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sort -u

undefined

Quick Site Scan

快速站点扫描

bash

#!/bin/bash
URL=$1
echo "=== Scanning: $URL ==="

echo ""
echo "### Response Info ###"
curl -sI "$URL" | head -20

echo ""
echo "### Links Found ###"
curl -s "$URL" | grep -oE 'href="[^"]*"' | sort -u | head -20

echo ""
echo "### Scripts ###"
curl -s "$URL" | grep -oE 'src="[^"]*\.js[^"]*"' | sort -u

echo ""
echo "### Meta Tags ###"
curl -s "$URL" | grep -oE '<meta[^>]*>' | head -10

bash

#!/bin/bash
URL=$1
echo "=== Scanning: $URL ==="

echo ""
echo "### Response Info ###"
curl -sI "$URL" | head -20

echo ""
echo "### Links Found ###"
curl -s "$URL" | grep -oE 'href="[^"]*"' | sort -u | head -20

echo ""
echo "### Scripts ###"
curl -s "$URL" | grep -oE 'src="[^"]*\.js[^"]*"' | sort -u

echo ""
echo "### Meta Tags ###"
curl -s "$URL" | grep -oE '<meta[^>]*>' | head -10

AI-Powered Analysis

AI驱动分析

Page Analysis

页面分析

bash

CONTENT=$(curl -s "https://example.com")

gemini -m pro -o text -e "" "Analyze this webpage:

$CONTENT

Provide:
1. What is this page about?
2. Key information/content
3. Navigation structure
4. Technical observations (framework, libraries)
5. SEO observations"

bash

CONTENT=$(curl -s "https://example.com")

gemini -m pro -o text -e "" "Analyze this webpage:

$CONTENT

Provide:
1. What is this page about?
2. Key information/content
3. Navigation structure
4. Technical observations (framework, libraries)
5. SEO observations"

Security Scan

安全扫描

bash

URL="https://example.com"

bash

URL="https://example.com"

Gather information

HEADERS=$(curl -sI "$URL") CONTENT=$(curl -s "$URL" | head -1000)

gemini -m pro -o text -e "" "Security scan this website:

URL: $URL

HEADERS: $HEADERS

CONTENT SAMPLE: $CONTENT

Check for:

Missing security headers (CSP, HSTS, X-Frame-Options)
Exposed sensitive information
Potential vulnerabilities in scripts/forms
Cookie security settings
HTTPS configuration"

undefined

HEADERS=$(curl -sI "$URL") CONTENT=$(curl -s "$URL" | head -1000)

gemini -m pro -o text -e "" "Security scan this website:

URL: $URL

HEADERS: $HEADERS

CONTENT SAMPLE: $CONTENT

Check for:

Missing security headers (CSP, HSTS, X-Frame-Options)
Exposed sensitive information
Potential vulnerabilities in scripts/forms
Cookie security settings
HTTPS configuration"

undefined

Extract Structured Data

提取结构化数据

bash

CONTENT=$(curl -s "https://example.com/products")

gemini -m pro -o text -e "" "Extract structured data from this page:

$CONTENT

Extract into JSON format:
- Product names
- Prices
- Descriptions
- Any available metadata"

bash

CONTENT=$(curl -s "https://example.com/products")

gemini -m pro -o text -e "" "Extract structured data from this page:

$CONTENT

Extract into JSON format:
- Product names
- Prices
- Descriptions
- Any available metadata"

Multi-Page Crawling

多页面爬取

Crawl and Analyze

爬取并分析

bash

#!/bin/bash
BASE_URL=$1
MAX_PAGES=${2:-10}

bash

#!/bin/bash
BASE_URL=$1
MAX_PAGES=${2:-10}

Get initial links

LINKS=$(curl -s "$BASE_URL" | grep -oE "href="$BASE_URL[^"]*"" | sed 's/href="//;s/"$//' | sort -u | head -$MAX_PAGES)

echo "Found $(echo "$LINKS" | wc -l) pages"

for link in $LINKS; do echo "" echo "=== $link ===" TITLE=$(curl -s "$link" | grep -oE '<title>[^<]</title>' | sed 's/<[^>]>//g') echo "Title: $TITLE" done

undefined

LINKS=$(curl -s "$BASE_URL" | grep -oE "href="$BASE_URL[^"]*"" | sed 's/href="//;s/"$//' | sort -u | head -$MAX_PAGES)

echo "Found $(echo "$LINKS" | wc -l) pages"

for link in $LINKS; do echo "" echo "=== $link ===" TITLE=$(curl -s "$link" | grep -oE '<title>[^<]</title>' | sed 's/<[^>]>//g') echo "Title: $TITLE" done

undefined

Sitemap Processing

站点地图处理

bash

undefined

bash

undefined

Fetch and parse sitemap

curl -s "https://example.com/sitemap.xml" | grep -oE '<loc>[^<]</loc>' | sed 's/<[^>]>//g'

Crawl pages from sitemap

curl -s "https://example.com/sitemap.xml" |
grep -oE '<loc>[^<]</loc>' |
sed 's/<[^>]>//g' |
while read url; do echo "Processing: $url" # Your processing here done

undefined

curl -s "https://example.com/sitemap.xml" |
grep -oE '<loc>[^<]</loc>' |
sed 's/<[^>]>//g' |
while read url; do echo "Processing: $url" # Your processing here done

undefined

Specific Extractions

特定数据提取

Extract Text Content

提取文本内容

bash

undefined

bash

undefined

Remove HTML tags (basic)

curl -s "https://example.com" | sed 's/<[^>]*>//g' | tr -s ' \n'

Using python

curl -s "https://example.com" | python3 -c " from html.parser import HTMLParser import sys

class TextExtractor(HTMLParser): def init(self): super().init() self.text = [] def handle_data(self, data): self.text.append(data.strip())

p = TextExtractor() p.feed(sys.stdin.read()) print(' '.join(filter(None, p.text))) "

undefined

curl -s "https://example.com" | python3 -c " from html.parser import HTMLParser import sys

class TextExtractor(HTMLParser): def init(self): super().init() self.text = [] def handle_data(self, data): self.text.append(data.strip())

p = TextExtractor() p.feed(sys.stdin.read()) print(' '.join(filter(None, p.text))) "

undefined

Extract API Endpoints

提取API端点

bash

CONTENT=$(curl -s "https://example.com/app.js")

bash

CONTENT=$(curl -s "https://example.com/app.js")

Find API calls in JS

echo "$CONTENT" | grep -oE "(fetch|axios|http)(['"][^'"]*['"]" | sort -u

AI extraction

gemini -m pro -o text -e "" "Extract API endpoints from this JavaScript:

$CONTENT

List all:

API URLs
HTTP methods used
Request patterns"

undefined

gemini -m pro -o text -e "" "Extract API endpoints from this JavaScript:

$CONTENT

List all:

API URLs
HTTP methods used
Request patterns"

undefined

Monitor Changes

监控页面变化

bash

#!/bin/bash
URL=$1
HASH_FILE="/tmp/page-hash-$(echo $URL | md5sum | cut -d' ' -f1)"

CURRENT=$(curl -s "$URL" | md5sum | cut -d' ' -f1)

if [ -f "$HASH_FILE" ]; then
  PREVIOUS=$(cat "$HASH_FILE")
  if [ "$CURRENT" != "$PREVIOUS" ]; then
    echo "Page changed: $URL"
    # Notify or log
  fi
fi

echo "$CURRENT" > "$HASH_FILE"

bash

#!/bin/bash
URL=$1
HASH_FILE="/tmp/page-hash-$(echo $URL | md5sum | cut -d' ' -f1)"

CURRENT=$(curl -s "$URL" | md5sum | cut -d' ' -f1)

if [ -f "$HASH_FILE" ]; then
  PREVIOUS=$(cat "$HASH_FILE")
  if [ "$CURRENT" != "$PREVIOUS" ]; then
    echo "Page changed: $URL"
    # Notify or log
  fi
fi

echo "$CURRENT" > "$HASH_FILE"

Best Practices

最佳实践

Respect robots.txt - Check before crawling
Rate limit - Don't overwhelm servers
Set User-Agent - Identify your crawler
Handle errors - Sites go down, pages 404
Cache responses - Don't re-fetch unnecessarily
Be ethical - Only crawl what you're allowed to
Check ToS - Some sites prohibit scraping

尊重robots.txt - 爬取前先检查
限速 - 不要给服务器造成过大负担
设置User-Agent - 标识你的爬虫
处理错误 - 网站可能宕机，页面可能404
缓存响应 - 避免重复获取内容
遵守伦理 - 只爬取你被允许的内容
查看服务条款 - 部分网站禁止爬取

spider

Original

Translation

Web Spider

网页爬虫

Prerequisites

前置条件

curl for fetching

curl for fetching

Gemini for analysis

Gemini for analysis

Optional: better HTML parsing

Optional: better HTML parsing

Basic Crawling

基础爬取

Fetch Page

获取页面

Simple fetch

Simple fetch

With headers

With headers

Follow redirects

Follow redirects

Save to file

Save to file

Get headers only

Get headers only

Get headers and body

Get headers and body

Extract Links

提取链接

Extract all links from a page

Extract all links from a page

Filter to specific domain

Filter to specific domain

Unique links

Unique links

Quick Site Scan

快速站点扫描

AI-Powered Analysis

AI驱动分析

Page Analysis

页面分析

Security Scan

安全扫描

Gather information

Gather information

Extract Structured Data

提取结构化数据

Multi-Page Crawling

多页面爬取

Crawl and Analyze

爬取并分析

Get initial links

Get initial links

Sitemap Processing

站点地图处理

Fetch and parse sitemap

Fetch and parse sitemap

Crawl pages from sitemap

Crawl pages from sitemap

Specific Extractions

特定数据提取

Extract Text Content

提取文本内容

Remove HTML tags (basic)

Remove HTML tags (basic)

Using python

Using python

Extract API Endpoints

提取API端点

Find API calls in JS

Find API calls in JS

AI extraction

AI extraction

Monitor Changes

监控页面变化

Best Practices

最佳实践