spider
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Spider
网页爬虫
Crawl and analyze websites.
爬取并分析网站。
Prerequisites
前置条件
bash
undefinedbash
undefinedcurl for fetching
curl for fetching
curl --version
curl --version
Gemini for analysis
Gemini for analysis
pip install google-generativeai
export GEMINI_API_KEY=your_api_key
pip install google-generativeai
export GEMINI_API_KEY=your_api_key
Optional: better HTML parsing
Optional: better HTML parsing
npm install -g puppeteer
pip install beautifulsoup4
undefinednpm install -g puppeteer
pip install beautifulsoup4
undefinedBasic Crawling
基础爬取
Fetch Page
获取页面
bash
undefinedbash
undefinedSimple fetch
Simple fetch
curl -s "https://example.com"
curl -s "https://example.com"
With headers
With headers
curl -s -H "User-Agent: Mozilla/5.0" "https://example.com"
curl -s -H "User-Agent: Mozilla/5.0" "https://example.com"
Follow redirects
Follow redirects
curl -sL "https://example.com"
curl -sL "https://example.com"
Save to file
Save to file
curl -s "https://example.com" -o page.html
curl -s "https://example.com" -o page.html
Get headers only
Get headers only
curl -sI "https://example.com"
curl -sI "https://example.com"
Get headers and body
Get headers and body
curl -sD - "https://example.com"
undefinedcurl -sD - "https://example.com"
undefinedExtract Links
提取链接
bash
undefinedbash
undefinedExtract all links from a page
Extract all links from a page
curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sed 's/href="//;s/"$//'
curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sed 's/href="//;s/"$//'
Filter to specific domain
Filter to specific domain
curl -s "https://example.com" | grep -oE 'href="https?://example.com[^"]*"'
curl -s "https://example.com" | grep -oE 'href="https?://example.com[^"]*"'
Unique links
Unique links
curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sort -u
undefinedcurl -s "https://example.com" | grep -oE 'href="[^"]*"' | sort -u
undefinedQuick Site Scan
快速站点扫描
bash
#!/bin/bash
URL=$1
echo "=== Scanning: $URL ==="
echo ""
echo "### Response Info ###"
curl -sI "$URL" | head -20
echo ""
echo "### Links Found ###"
curl -s "$URL" | grep -oE 'href="[^"]*"' | sort -u | head -20
echo ""
echo "### Scripts ###"
curl -s "$URL" | grep -oE 'src="[^"]*\.js[^"]*"' | sort -u
echo ""
echo "### Meta Tags ###"
curl -s "$URL" | grep -oE '<meta[^>]*>' | head -10bash
#!/bin/bash
URL=$1
echo "=== Scanning: $URL ==="
echo ""
echo "### Response Info ###"
curl -sI "$URL" | head -20
echo ""
echo "### Links Found ###"
curl -s "$URL" | grep -oE 'href="[^"]*"' | sort -u | head -20
echo ""
echo "### Scripts ###"
curl -s "$URL" | grep -oE 'src="[^"]*\.js[^"]*"' | sort -u
echo ""
echo "### Meta Tags ###"
curl -s "$URL" | grep -oE '<meta[^>]*>' | head -10AI-Powered Analysis
AI驱动分析
Page Analysis
页面分析
bash
CONTENT=$(curl -s "https://example.com")
gemini -m pro -o text -e "" "Analyze this webpage:
$CONTENT
Provide:
1. What is this page about?
2. Key information/content
3. Navigation structure
4. Technical observations (framework, libraries)
5. SEO observations"bash
CONTENT=$(curl -s "https://example.com")
gemini -m pro -o text -e "" "Analyze this webpage:
$CONTENT
Provide:
1. What is this page about?
2. Key information/content
3. Navigation structure
4. Technical observations (framework, libraries)
5. SEO observations"Security Scan
安全扫描
bash
URL="https://example.com"bash
URL="https://example.com"Gather information
Gather information
HEADERS=$(curl -sI "$URL")
CONTENT=$(curl -s "$URL" | head -1000)
gemini -m pro -o text -e "" "Security scan this website:
URL: $URL
HEADERS:
$HEADERS
CONTENT SAMPLE:
$CONTENT
Check for:
- Missing security headers (CSP, HSTS, X-Frame-Options)
- Exposed sensitive information
- Potential vulnerabilities in scripts/forms
- Cookie security settings
- HTTPS configuration"
undefinedHEADERS=$(curl -sI "$URL")
CONTENT=$(curl -s "$URL" | head -1000)
gemini -m pro -o text -e "" "Security scan this website:
URL: $URL
HEADERS:
$HEADERS
CONTENT SAMPLE:
$CONTENT
Check for:
- Missing security headers (CSP, HSTS, X-Frame-Options)
- Exposed sensitive information
- Potential vulnerabilities in scripts/forms
- Cookie security settings
- HTTPS configuration"
undefinedExtract Structured Data
提取结构化数据
bash
CONTENT=$(curl -s "https://example.com/products")
gemini -m pro -o text -e "" "Extract structured data from this page:
$CONTENT
Extract into JSON format:
- Product names
- Prices
- Descriptions
- Any available metadata"bash
CONTENT=$(curl -s "https://example.com/products")
gemini -m pro -o text -e "" "Extract structured data from this page:
$CONTENT
Extract into JSON format:
- Product names
- Prices
- Descriptions
- Any available metadata"Multi-Page Crawling
多页面爬取
Crawl and Analyze
爬取并分析
bash
#!/bin/bash
BASE_URL=$1
MAX_PAGES=${2:-10}bash
#!/bin/bash
BASE_URL=$1
MAX_PAGES=${2:-10}Get initial links
Get initial links
LINKS=$(curl -s "$BASE_URL" | grep -oE "href="$BASE_URL[^"]*"" | sed 's/href="//;s/"$//' | sort -u | head -$MAX_PAGES)
echo "Found $(echo "$LINKS" | wc -l) pages"
for link in $LINKS; do
echo ""
echo "=== $link ==="
TITLE=$(curl -s "$link" | grep -oE '<title>[^<]</title>' | sed 's/<[^>]>//g')
echo "Title: $TITLE"
done
undefinedLINKS=$(curl -s "$BASE_URL" | grep -oE "href="$BASE_URL[^"]*"" | sed 's/href="//;s/"$//' | sort -u | head -$MAX_PAGES)
echo "Found $(echo "$LINKS" | wc -l) pages"
for link in $LINKS; do
echo ""
echo "=== $link ==="
TITLE=$(curl -s "$link" | grep -oE '<title>[^<]</title>' | sed 's/<[^>]>//g')
echo "Title: $TITLE"
done
undefinedSitemap Processing
站点地图处理
bash
undefinedbash
undefinedFetch and parse sitemap
Fetch and parse sitemap
curl -s "https://example.com/sitemap.xml" | grep -oE '<loc>[^<]</loc>' | sed 's/<[^>]>//g'
curl -s "https://example.com/sitemap.xml" | grep -oE '<loc>[^<]</loc>' | sed 's/<[^>]>//g'
Crawl pages from sitemap
Crawl pages from sitemap
curl -s "https://example.com/sitemap.xml" |
grep -oE '<loc>[^<]</loc>' |
sed 's/<[^>]>//g' |
while read url; do echo "Processing: $url" # Your processing here done
grep -oE '<loc>[^<]</loc>' |
sed 's/<[^>]>//g' |
while read url; do echo "Processing: $url" # Your processing here done
undefinedcurl -s "https://example.com/sitemap.xml" |
grep -oE '<loc>[^<]</loc>' |
sed 's/<[^>]>//g' |
while read url; do echo "Processing: $url" # Your processing here done
grep -oE '<loc>[^<]</loc>' |
sed 's/<[^>]>//g' |
while read url; do echo "Processing: $url" # Your processing here done
undefinedSpecific Extractions
特定数据提取
Extract Text Content
提取文本内容
bash
undefinedbash
undefinedRemove HTML tags (basic)
Remove HTML tags (basic)
curl -s "https://example.com" | sed 's/<[^>]*>//g' | tr -s ' \n'
curl -s "https://example.com" | sed 's/<[^>]*>//g' | tr -s ' \n'
Using python
Using python
curl -s "https://example.com" | python3 -c "
from html.parser import HTMLParser
import sys
class TextExtractor(HTMLParser):
def init(self):
super().init()
self.text = []
def handle_data(self, data):
self.text.append(data.strip())
p = TextExtractor()
p.feed(sys.stdin.read())
print(' '.join(filter(None, p.text)))
"
undefinedcurl -s "https://example.com" | python3 -c "
from html.parser import HTMLParser
import sys
class TextExtractor(HTMLParser):
def init(self):
super().init()
self.text = []
def handle_data(self, data):
self.text.append(data.strip())
p = TextExtractor()
p.feed(sys.stdin.read())
print(' '.join(filter(None, p.text)))
"
undefinedExtract API Endpoints
提取API端点
bash
CONTENT=$(curl -s "https://example.com/app.js")bash
CONTENT=$(curl -s "https://example.com/app.js")Find API calls in JS
Find API calls in JS
echo "$CONTENT" | grep -oE "(fetch|axios|http)(['"][^'"]*['"]" | sort -u
echo "$CONTENT" | grep -oE "(fetch|axios|http)(['"][^'"]*['"]" | sort -u
AI extraction
AI extraction
gemini -m pro -o text -e "" "Extract API endpoints from this JavaScript:
$CONTENT
List all:
- API URLs
- HTTP methods used
- Request patterns"
undefinedgemini -m pro -o text -e "" "Extract API endpoints from this JavaScript:
$CONTENT
List all:
- API URLs
- HTTP methods used
- Request patterns"
undefinedMonitor Changes
监控页面变化
bash
#!/bin/bash
URL=$1
HASH_FILE="/tmp/page-hash-$(echo $URL | md5sum | cut -d' ' -f1)"
CURRENT=$(curl -s "$URL" | md5sum | cut -d' ' -f1)
if [ -f "$HASH_FILE" ]; then
PREVIOUS=$(cat "$HASH_FILE")
if [ "$CURRENT" != "$PREVIOUS" ]; then
echo "Page changed: $URL"
# Notify or log
fi
fi
echo "$CURRENT" > "$HASH_FILE"bash
#!/bin/bash
URL=$1
HASH_FILE="/tmp/page-hash-$(echo $URL | md5sum | cut -d' ' -f1)"
CURRENT=$(curl -s "$URL" | md5sum | cut -d' ' -f1)
if [ -f "$HASH_FILE" ]; then
PREVIOUS=$(cat "$HASH_FILE")
if [ "$CURRENT" != "$PREVIOUS" ]; then
echo "Page changed: $URL"
# Notify or log
fi
fi
echo "$CURRENT" > "$HASH_FILE"Best Practices
最佳实践
- Respect robots.txt - Check before crawling
- Rate limit - Don't overwhelm servers
- Set User-Agent - Identify your crawler
- Handle errors - Sites go down, pages 404
- Cache responses - Don't re-fetch unnecessarily
- Be ethical - Only crawl what you're allowed to
- Check ToS - Some sites prohibit scraping
- 尊重robots.txt - 爬取前先检查
- 限速 - 不要给服务器造成过大负担
- 设置User-Agent - 标识你的爬虫
- 处理错误 - 网站可能宕机,页面可能404
- 缓存响应 - 避免重复获取内容
- 遵守伦理 - 只爬取你被允许的内容
- 查看服务条款 - 部分网站禁止爬取