spider

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Spider

网页爬虫

Crawl and analyze websites.
爬取并分析网站。

Prerequisites

前置条件

bash
undefined
bash
undefined

curl for fetching

curl for fetching

curl --version
curl --version

Gemini for analysis

Gemini for analysis

pip install google-generativeai export GEMINI_API_KEY=your_api_key
pip install google-generativeai export GEMINI_API_KEY=your_api_key

Optional: better HTML parsing

Optional: better HTML parsing

npm install -g puppeteer pip install beautifulsoup4
undefined
npm install -g puppeteer pip install beautifulsoup4
undefined

Basic Crawling

基础爬取

Fetch Page

获取页面

bash
undefined
bash
undefined

Simple fetch

Simple fetch

With headers

With headers

curl -s -H "User-Agent: Mozilla/5.0" "https://example.com"
curl -s -H "User-Agent: Mozilla/5.0" "https://example.com"

Follow redirects

Follow redirects

Save to file

Save to file

curl -s "https://example.com" -o page.html
curl -s "https://example.com" -o page.html

Get headers only

Get headers only

Get headers and body

Get headers and body

curl -sD - "https://example.com"
undefined
curl -sD - "https://example.com"
undefined

Extract Links

提取链接

bash
undefined
bash
undefined

Extract all links from a page

Extract all links from a page

curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sed 's/href="//;s/"$//'
curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sed 's/href="//;s/"$//'

Filter to specific domain

Filter to specific domain

curl -s "https://example.com" | grep -oE 'href="https?://example.com[^"]*"'
curl -s "https://example.com" | grep -oE 'href="https?://example.com[^"]*"'

Unique links

Unique links

curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sort -u
undefined
curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sort -u
undefined

Quick Site Scan

快速站点扫描

bash
#!/bin/bash
URL=$1
echo "=== Scanning: $URL ==="

echo ""
echo "### Response Info ###"
curl -sI "$URL" | head -20

echo ""
echo "### Links Found ###"
curl -s "$URL" | grep -oE 'href="[^"]*"' | sort -u | head -20

echo ""
echo "### Scripts ###"
curl -s "$URL" | grep -oE 'src="[^"]*\.js[^"]*"' | sort -u

echo ""
echo "### Meta Tags ###"
curl -s "$URL" | grep -oE '<meta[^>]*>' | head -10
bash
#!/bin/bash
URL=$1
echo "=== Scanning: $URL ==="

echo ""
echo "### Response Info ###"
curl -sI "$URL" | head -20

echo ""
echo "### Links Found ###"
curl -s "$URL" | grep -oE 'href="[^"]*"' | sort -u | head -20

echo ""
echo "### Scripts ###"
curl -s "$URL" | grep -oE 'src="[^"]*\.js[^"]*"' | sort -u

echo ""
echo "### Meta Tags ###"
curl -s "$URL" | grep -oE '<meta[^>]*>' | head -10

AI-Powered Analysis

AI驱动分析

Page Analysis

页面分析

bash
CONTENT=$(curl -s "https://example.com")

gemini -m pro -o text -e "" "Analyze this webpage:

$CONTENT

Provide:
1. What is this page about?
2. Key information/content
3. Navigation structure
4. Technical observations (framework, libraries)
5. SEO observations"
bash
CONTENT=$(curl -s "https://example.com")

gemini -m pro -o text -e "" "Analyze this webpage:

$CONTENT

Provide:
1. What is this page about?
2. Key information/content
3. Navigation structure
4. Technical observations (framework, libraries)
5. SEO observations"

Security Scan

安全扫描

bash
URL="https://example.com"
bash
URL="https://example.com"

Gather information

Gather information

HEADERS=$(curl -sI "$URL") CONTENT=$(curl -s "$URL" | head -1000)
gemini -m pro -o text -e "" "Security scan this website:
URL: $URL
HEADERS: $HEADERS
CONTENT SAMPLE: $CONTENT
Check for:
  1. Missing security headers (CSP, HSTS, X-Frame-Options)
  2. Exposed sensitive information
  3. Potential vulnerabilities in scripts/forms
  4. Cookie security settings
  5. HTTPS configuration"
undefined
HEADERS=$(curl -sI "$URL") CONTENT=$(curl -s "$URL" | head -1000)
gemini -m pro -o text -e "" "Security scan this website:
URL: $URL
HEADERS: $HEADERS
CONTENT SAMPLE: $CONTENT
Check for:
  1. Missing security headers (CSP, HSTS, X-Frame-Options)
  2. Exposed sensitive information
  3. Potential vulnerabilities in scripts/forms
  4. Cookie security settings
  5. HTTPS configuration"
undefined

Extract Structured Data

提取结构化数据

bash
CONTENT=$(curl -s "https://example.com/products")

gemini -m pro -o text -e "" "Extract structured data from this page:

$CONTENT

Extract into JSON format:
- Product names
- Prices
- Descriptions
- Any available metadata"
bash
CONTENT=$(curl -s "https://example.com/products")

gemini -m pro -o text -e "" "Extract structured data from this page:

$CONTENT

Extract into JSON format:
- Product names
- Prices
- Descriptions
- Any available metadata"

Multi-Page Crawling

多页面爬取

Crawl and Analyze

爬取并分析

bash
#!/bin/bash
BASE_URL=$1
MAX_PAGES=${2:-10}
bash
#!/bin/bash
BASE_URL=$1
MAX_PAGES=${2:-10}

Get initial links

Get initial links

LINKS=$(curl -s "$BASE_URL" | grep -oE "href="$BASE_URL[^"]*"" | sed 's/href="//;s/"$//' | sort -u | head -$MAX_PAGES)
echo "Found $(echo "$LINKS" | wc -l) pages"
for link in $LINKS; do echo "" echo "=== $link ===" TITLE=$(curl -s "$link" | grep -oE '<title>[^<]</title>' | sed 's/<[^>]>//g') echo "Title: $TITLE" done
undefined
LINKS=$(curl -s "$BASE_URL" | grep -oE "href="$BASE_URL[^"]*"" | sed 's/href="//;s/"$//' | sort -u | head -$MAX_PAGES)
echo "Found $(echo "$LINKS" | wc -l) pages"
for link in $LINKS; do echo "" echo "=== $link ===" TITLE=$(curl -s "$link" | grep -oE '<title>[^<]</title>' | sed 's/<[^>]>//g') echo "Title: $TITLE" done
undefined

Sitemap Processing

站点地图处理

bash
undefined
bash
undefined

Fetch and parse sitemap

Fetch and parse sitemap

curl -s "https://example.com/sitemap.xml" | grep -oE '<loc>[^<]</loc>' | sed 's/<[^>]>//g'
curl -s "https://example.com/sitemap.xml" | grep -oE '<loc>[^<]</loc>' | sed 's/<[^>]>//g'

Crawl pages from sitemap

Crawl pages from sitemap

curl -s "https://example.com/sitemap.xml" |
grep -oE '<loc>[^<]</loc>' |
sed 's/<[^>]
>//g' |
while read url; do echo "Processing: $url" # Your processing here done
undefined
curl -s "https://example.com/sitemap.xml" |
grep -oE '<loc>[^<]</loc>' |
sed 's/<[^>]
>//g' |
while read url; do echo "Processing: $url" # Your processing here done
undefined

Specific Extractions

特定数据提取

Extract Text Content

提取文本内容

bash
undefined
bash
undefined

Remove HTML tags (basic)

Remove HTML tags (basic)

curl -s "https://example.com" | sed 's/<[^>]*>//g' | tr -s ' \n'
curl -s "https://example.com" | sed 's/<[^>]*>//g' | tr -s ' \n'

Using python

Using python

curl -s "https://example.com" | python3 -c " from html.parser import HTMLParser import sys
class TextExtractor(HTMLParser): def init(self): super().init() self.text = [] def handle_data(self, data): self.text.append(data.strip())
p = TextExtractor() p.feed(sys.stdin.read()) print(' '.join(filter(None, p.text))) "
undefined
curl -s "https://example.com" | python3 -c " from html.parser import HTMLParser import sys
class TextExtractor(HTMLParser): def init(self): super().init() self.text = [] def handle_data(self, data): self.text.append(data.strip())
p = TextExtractor() p.feed(sys.stdin.read()) print(' '.join(filter(None, p.text))) "
undefined

Extract API Endpoints

提取API端点

bash
CONTENT=$(curl -s "https://example.com/app.js")
bash
CONTENT=$(curl -s "https://example.com/app.js")

Find API calls in JS

Find API calls in JS

echo "$CONTENT" | grep -oE "(fetch|axios|http)(['"][^'"]*['"]" | sort -u
echo "$CONTENT" | grep -oE "(fetch|axios|http)(['"][^'"]*['"]" | sort -u

AI extraction

AI extraction

gemini -m pro -o text -e "" "Extract API endpoints from this JavaScript:
$CONTENT
List all:
  • API URLs
  • HTTP methods used
  • Request patterns"
undefined
gemini -m pro -o text -e "" "Extract API endpoints from this JavaScript:
$CONTENT
List all:
  • API URLs
  • HTTP methods used
  • Request patterns"
undefined

Monitor Changes

监控页面变化

bash
#!/bin/bash
URL=$1
HASH_FILE="/tmp/page-hash-$(echo $URL | md5sum | cut -d' ' -f1)"

CURRENT=$(curl -s "$URL" | md5sum | cut -d' ' -f1)

if [ -f "$HASH_FILE" ]; then
  PREVIOUS=$(cat "$HASH_FILE")
  if [ "$CURRENT" != "$PREVIOUS" ]; then
    echo "Page changed: $URL"
    # Notify or log
  fi
fi

echo "$CURRENT" > "$HASH_FILE"
bash
#!/bin/bash
URL=$1
HASH_FILE="/tmp/page-hash-$(echo $URL | md5sum | cut -d' ' -f1)"

CURRENT=$(curl -s "$URL" | md5sum | cut -d' ' -f1)

if [ -f "$HASH_FILE" ]; then
  PREVIOUS=$(cat "$HASH_FILE")
  if [ "$CURRENT" != "$PREVIOUS" ]; then
    echo "Page changed: $URL"
    # Notify or log
  fi
fi

echo "$CURRENT" > "$HASH_FILE"

Best Practices

最佳实践

  1. Respect robots.txt - Check before crawling
  2. Rate limit - Don't overwhelm servers
  3. Set User-Agent - Identify your crawler
  4. Handle errors - Sites go down, pages 404
  5. Cache responses - Don't re-fetch unnecessarily
  6. Be ethical - Only crawl what you're allowed to
  7. Check ToS - Some sites prohibit scraping
  1. 尊重robots.txt - 爬取前先检查
  2. 限速 - 不要给服务器造成过大负担
  3. 设置User-Agent - 标识你的爬虫
  4. 处理错误 - 网站可能宕机,页面可能404
  5. 缓存响应 - 避免重复获取内容
  6. 遵守伦理 - 只爬取你被允许的内容
  7. 查看服务条款 - 部分网站禁止爬取