web-scraping
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Scraping
网页抓取
Kanitsal Cerceve (Evidential Frame Activation)
证据框架激活
Kaynak dogrulama modu etkin.
[assert|neutral] Systematic web data extraction workflow with sequential-thinking planning phase [ground:skill-design:2026-01-12] [conf:0.85] [state:provisional]
源验证模式已激活。
[assert|neutral] 结合sequential-thinking规划阶段的系统化网页数据提取工作流 [ground:skill-design:2026-01-12] [conf:0.85] [state:provisional]
Overview
概述
Web scraping enables structured data extraction from web pages through the claude-in-chrome MCP server. This skill enforces a PLAN-READ-TRANSFORM pattern focused exclusively on data extraction without page modifications.
Philosophy: Data extraction fails when executed without understanding page structure. By mandating structure analysis before extraction, this skill improves data quality and reduces selector errors.
Methodology: Six-phase execution with emphasis on READ operations:
- PLAN Phase: Sequential-thinking MCP analyzes extraction requirements
- NAVIGATE Phase: Navigate to target URL(s)
- ANALYZE Phase: Understand page structure via read_page
- EXTRACT Phase: Pull data using get_page_text and javascript_tool
- TRANSFORM Phase: Convert to structured format (JSON/CSV/Markdown)
- STORE Phase: Persist to Memory MCP
Key Differentiator from browser-automation:
- READ-ONLY focus (no form submissions, no button clicks for actions)
- Emphasis on data transformation and output formats
- Pagination handling for multi-page datasets
- Rate limiting awareness for bulk extraction
- No account creation, no purchases, no write operations
Value Proposition: Transform unstructured web content into clean, structured datasets suitable for analysis, reporting, or system integration.
网页抓取可通过claude-in-chrome MCP服务器从网页中提取结构化数据。该技能遵循PLAN-READ-TRANSFORM模式,仅专注于数据提取,不修改页面内容。
核心理念:若不了解页面结构就执行数据提取,很可能失败。通过强制要求在提取前进行结构分析,该技能可提升数据质量并减少选择器错误。
方法论:分为六个执行阶段,重点关注READ操作:
- PLAN阶段:sequential-thinking MCP分析提取需求
- NAVIGATE阶段:导航至目标URL
- ANALYZE阶段:通过read_page了解页面结构
- EXTRACT阶段:使用get_page_text和javascript_tool提取数据
- TRANSFORM阶段:转换为结构化格式(JSON/CSV/Markdown)
- STORE阶段:存储至Memory MCP
与browser-automation的关键区别:
- 仅专注于只读操作(不提交表单、不点击按钮执行操作)
- 强调数据转换和输出格式
- 支持多页数据集的分页处理
- 批量提取时考虑速率限制
- 不创建账户、不进行购买、不执行写入操作
价值主张:将非结构化网页内容转换为干净、结构化的数据集,适用于分析、报告或系统集成。
When to Use This Skill
何时使用该技能
Trigger Thresholds:
| Data Volume | Recommendation |
|---|---|
| Single element | Use get_page_text directly (too simple) |
| 5-50 elements | Consider this skill |
| 50+ elements or pagination | Mandatory use of this skill |
Primary Use Cases:
- Product catalog extraction (prices, descriptions, images)
- Article content scraping (headlines, body text, metadata)
- Table data extraction (financial data, statistics, listings)
- Directory scraping (contact info, business listings)
- Research data collection (citations, abstracts, datasets)
- Price monitoring (competitive analysis, market research)
Apply When:
- Task requires structured output (JSON, CSV, Markdown table)
- Multiple pages need to be traversed (pagination)
- Data needs normalization or transformation
- Consistent data schema is required
- Historical data collection for trend analysis
触发阈值:
| 数据量 | 建议 |
|---|---|
| 单个元素 | 直接使用get_page_text(过于简单) |
| 5-50个元素 | 考虑使用该技能 |
| 50+元素或涉及分页 | 必须使用该技能 |
主要使用场景:
- 产品目录提取(价格、描述、图片)
- 文章内容抓取(标题、正文、元数据)
- 表格数据提取(财务数据、统计数据、列表)
- 目录抓取(联系信息、商家列表)
- 研究数据收集(引用、摘要、数据集)
- 价格监控(竞品分析、市场调研)
适用情况:
- 任务需要结构化输出(JSON、CSV、Markdown表格)
- 需要遍历多个页面(分页)
- 数据需要标准化或转换
- 需要一致的数据 schema
- 收集历史数据用于趋势分析
When NOT to Use This Skill
何时不使用该技能
- Form submissions or account actions (use browser-automation)
- Interactive workflows requiring clicks (use browser-automation)
- Single-page simple text extraction (use get_page_text directly)
- API-accessible data (use direct API calls instead)
- Real-time streaming data (use WebSocket or API)
- Content behind authentication requiring login (use browser-automation first)
- Sites with explicit anti-scraping measures (respect robots.txt)
- 表单提交或账户操作(使用browser-automation)
- 需要点击的交互式工作流(使用browser-automation)
- 单页简单文本提取(直接使用get_page_text)
- 可通过API获取的数据(直接调用API)
- 实时流数据(使用WebSocket或API)
- 需要登录的受保护内容(先使用browser-automation)
- 明确反爬措施的网站(遵守robots.txt)
Core Principles
核心原则
Principle 1: Plan Before Extract
原则1:先规划后提取
Mandate: ALWAYS invoke sequential-thinking MCP before data extraction.
Rationale: Web pages have complex DOM structures. Planning reveals data locations, identifies patterns, and anticipates pagination before extraction begins.
In Practice:
- Analyze page structure via read_page first
- Map data field locations to selectors
- Identify pagination patterns
- Plan output schema before extraction
- Define data validation rules
强制要求:数据提取前必须调用sequential-thinking MCP。
理由:网页具有复杂的DOM结构。规划可在提取前确定数据位置、识别模式并预判分页情况。
实践方式:
- 先通过read_page分析页面结构
- 将数据字段位置映射到选择器
- 识别分页模式
- 提取前定义输出schema
- 定义数据验证规则
Principle 2: Read-Only Operations
原则2:只读操作
Mandate: NEVER modify page state during extraction.
Rationale: Web scraping is about data retrieval, not interaction. Write operations (form submissions, clicks that change state) belong to browser-automation.
Allowed Operations:
- navigate (to target URLs)
- read_page (DOM analysis)
- get_page_text (text extraction)
- javascript_tool (read-only DOM queries)
- computer screenshot (visual verification)
- scroll (for lazy-loaded content)
Prohibited Operations:
- form_input (except for search filters if needed)
- computer left_click on submit buttons
- Any action that modifies page state
- Account creation or login actions
强制要求:提取过程中不得修改页面状态。
理由:网页抓取的核心是数据检索,而非交互。写入操作(表单提交、改变状态的点击)属于browser-automation的范畴。
允许的操作:
- navigate(导航至目标URL)
- read_page(DOM分析)
- get_page_text(文本提取)
- javascript_tool(只读DOM查询)
- computer screenshot(视觉验证)
- scroll(加载懒加载内容)
禁止的操作:
- form_input(除非需要搜索过滤)
- computer left_click点击提交按钮
- 任何修改页面状态的操作
- 账户创建或登录操作
Principle 3: Structured Output First
原则3:优先结构化输出
Mandate: Define output schema before extraction begins.
Rationale: Knowing the target format guides extraction strategy and ensures consistent data quality across all pages.
Output Formats:
| Format | Use Case | Example |
|---|---|---|
| JSON | API integration, databases | |
| CSV | Spreadsheet analysis | |
| Markdown Table | Documentation, reports | ` |
强制要求:提取前定义输出schema。
理由:明确目标格式可指导提取策略,确保所有页面的数据质量一致。
输出格式:
| 格式 | 使用场景 | 示例 |
|---|---|---|
| JSON | API集成、数据库 | |
| CSV | 电子表格分析 | |
| Markdown表格 | 文档、报告 | ` |
Principle 4: Pagination Awareness
原则4:分页感知
Mandate: Detect and handle pagination patterns before starting extraction.
Rationale: Most valuable datasets span multiple pages. Failing to handle pagination yields incomplete data.
Pagination Patterns:
| Pattern | Detection | Handling |
|---|---|---|
| Numbered pages | Links with | Iterate through page numbers |
| Next button | "Next", ">" or arrow elements | Click until disabled/missing |
| Infinite scroll | Content loads on scroll | Scroll + wait + check for new content |
| Load more | "Load more" button | Click until no new content |
| Cursor-based | | Extract cursor, iterate |
强制要求:提取前检测并处理分页模式。
理由:最有价值的数据集通常跨多个页面。不处理分页会导致数据不完整。
分页模式:
| 模式 | 检测方式 | 处理方式 |
|---|---|---|
| 编号页面 | 链接包含 | 遍历页面编号 |
| 下一页按钮 | 包含“Next”、">"或箭头元素 | 点击直至按钮禁用/消失 |
| 无限滚动 | 滚动时加载内容 | 滚动 + 等待 + 检查新内容 |
| 加载更多 | “Load more”按钮 | 点击直至无新内容 |
| 基于游标 | URL包含 | 提取游标并遍历 |
Principle 5: Rate Limiting Respect
原则5:遵守速率限制
Mandate: Implement delays between requests to avoid overwhelming servers.
Rationale: Aggressive scraping can trigger blocks, harm server performance, and violate terms of service.
Guidelines:
| Request Type | Minimum Delay |
|---|---|
| Same domain | 1-2 seconds |
| Pagination | 2-3 seconds |
| Bulk extraction (100+ pages) | 3-5 seconds |
| If rate-limited | Exponential backoff (start 10s) |
Implementation:
javascript
// Use computer wait action between page loads
await mcp__claude-in-chrome__computer({
action: "wait",
duration: 2, // seconds
tabId: tabId
});强制要求:请求之间设置延迟,避免压垮服务器。
理由:激进的抓取会触发封禁、影响服务器性能并违反服务条款。
指导方针:
| 请求类型 | 最小延迟 |
|---|---|
| 同一域名 | 1-2秒 |
| 分页 | 2-3秒 |
| 批量提取(100+页面) | 3-5秒 |
| 触发速率限制时 | 指数退避(起始10秒) |
实现代码:
javascript
// Use computer wait action between page loads
await mcp__claude-in-chrome__computer({
action: "wait",
duration: 2, // seconds
tabId: tabId
});Production Guardrails
生产环境防护措施
MCP Preflight Check Protocol
MCP预检查协议
Before executing any web scraping workflow, run preflight validation:
Preflight Sequence:
javascript
async function preflightCheck() {
const checks = {
sequential_thinking: false,
claude_in_chrome: false,
memory_mcp: false
};
// Check sequential-thinking MCP (required)
try {
await mcp__sequential-thinking__sequentialthinking({
thought: "Preflight check - verifying MCP availability for web scraping",
thoughtNumber: 1,
totalThoughts: 1,
nextThoughtNeeded: false
});
checks.sequential_thinking = true;
} catch (error) {
console.error("Sequential-thinking MCP unavailable:", error);
throw new Error("CRITICAL: sequential-thinking MCP required but unavailable");
}
// Check claude-in-chrome MCP (required)
try {
const context = await mcp__claude-in-chrome__tabs_context_mcp({});
checks.claude_in_chrome = true;
} catch (error) {
console.error("Claude-in-chrome MCP unavailable:", error);
throw new Error("CRITICAL: claude-in-chrome MCP required but unavailable");
}
// Check memory-mcp (optional but recommended)
try {
checks.memory_mcp = true;
} catch (error) {
console.warn("Memory MCP unavailable - extracted data will not be persisted");
checks.memory_mcp = false;
}
return checks;
}执行任何网页抓取工作流前,运行预验证:
预检查流程:
javascript
async function preflightCheck() {
const checks = {
sequential_thinking: false,
claude_in_chrome: false,
memory_mcp: false
};
// Check sequential-thinking MCP (required)
try {
await mcp__sequential-thinking__sequentialthinking({
thought: "Preflight check - verifying MCP availability for web scraping",
thoughtNumber: 1,
totalThoughts: 1,
nextThoughtNeeded: false
});
checks.sequential_thinking = true;
} catch (error) {
console.error("Sequential-thinking MCP unavailable:", error);
throw new Error("CRITICAL: sequential-thinking MCP required but unavailable");
}
// Check claude-in-chrome MCP (required)
try {
const context = await mcp__claude-in-chrome__tabs_context_mcp({});
checks.claude_in_chrome = true;
} catch (error) {
console.error("Claude-in-chrome MCP unavailable:", error);
throw new Error("CRITICAL: claude-in-chrome MCP required but unavailable");
}
// Check memory-mcp (optional but recommended)
try {
checks.memory_mcp = true;
} catch (error) {
console.warn("Memory MCP unavailable - extracted data will not be persisted");
checks.memory_mcp = false;
}
return checks;
}Error Handling Framework
错误处理框架
Error Categories:
| Category | Example | Recovery Strategy |
|---|---|---|
| MCP_UNAVAILABLE | Claude-in-chrome offline | ABORT with clear message |
| PAGE_NOT_FOUND | 404 error | Log, skip page, continue with others |
| ELEMENT_NOT_FOUND | Selector changed | Try alternative selectors, log schema drift |
| RATE_LIMITED | 429 response | Exponential backoff, wait, retry |
| CONTENT_CHANGED | Dynamic content not loaded | Wait longer, scroll to trigger load |
| PAGINATION_END | No more pages | Normal termination, return collected data |
| BLOCKED | Access denied, CAPTCHA | ABORT, notify user, do not bypass |
Error Recovery Pattern:
javascript
async function extractWithRetry(extractor, context, maxRetries = 3) {
let lastError = null;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const data = await extractor(context);
return data;
} catch (error) {
lastError = error;
console.error(`Extraction attempt ${attempt} failed:`, error.message);
if (isRateLimitError(error)) {
const delay = Math.pow(2, attempt) * 5; // 10s, 20s, 40s
await sleep(delay * 1000);
} else if (!isRecoverableError(error)) {
break;
}
}
}
throw lastError;
}
function isRecoverableError(error) {
const nonRecoverable = [
"CRITICAL: sequential-thinking MCP required",
"CRITICAL: claude-in-chrome MCP required",
"Access denied",
"CAPTCHA",
"Blocked"
];
return !nonRecoverable.some(msg => error.message.includes(msg));
}错误类别:
| 类别 | 示例 | 恢复策略 |
|---|---|---|
| MCP_UNAVAILABLE | Claude-in-chrome离线 | 终止并显示清晰信息 |
| PAGE_NOT_FOUND | 404错误 | 记录日志、跳过页面、继续处理其他页面 |
| ELEMENT_NOT_FOUND | 选择器变更 | 尝试备选选择器、记录schema漂移 |
| RATE_LIMITED | 429响应 | 指数退避、等待、重试 |
| CONTENT_CHANGED | 动态内容未加载 | 延长等待时间、滚动触发加载 |
| PAGINATION_END | 无更多页面 | 正常终止、返回已收集数据 |
| BLOCKED | 访问被拒绝、验证码 | 终止、通知用户、不绕过限制 |
错误恢复模式:
javascript
async function extractWithRetry(extractor, context, maxRetries = 3) {
let lastError = null;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const data = await extractor(context);
return data;
} catch (error) {
lastError = error;
console.error(`Extraction attempt ${attempt} failed:`, error.message);
if (isRateLimitError(error)) {
const delay = Math.pow(2, attempt) * 5; // 10s, 20s, 40s
await sleep(delay * 1000);
} else if (!isRecoverableError(error)) {
break;
}
}
}
throw lastError;
}
function isRecoverableError(error) {
const nonRecoverable = [
"CRITICAL: sequential-thinking MCP required",
"CRITICAL: claude-in-chrome MCP required",
"Access denied",
"CAPTCHA",
"Blocked"
];
return !nonRecoverable.some(msg => error.message.includes(msg));
}Data Validation Framework
数据验证框架
Purpose: Ensure extracted data meets quality standards before storage.
Validation Rules:
javascript
const VALIDATION_RULES = {
required_fields: ["name", "url"], // Must be present
field_types: {
price: "number",
name: "string",
url: "url",
date: "date"
},
constraints: {
price: { min: 0, max: 1000000 },
name: { minLength: 1, maxLength: 500 },
url: { pattern: /^https?:\/\// }
}
};
function validateRecord(record, rules) {
const errors = [];
// Check required fields
for (const field of rules.required_fields) {
if (!record[field]) {
errors.push(`Missing required field: ${field}`);
}
}
// Check field types
for (const [field, type] of Object.entries(rules.field_types)) {
if (record[field] && !isValidType(record[field], type)) {
errors.push(`Invalid type for ${field}: expected ${type}`);
}
}
return { valid: errors.length === 0, errors };
}目的:存储前确保提取的数据符合质量标准。
验证规则:
javascript
const VALIDATION_RULES = {
required_fields: ["name", "url"], // Must be present
field_types: {
price: "number",
name: "string",
url: "url",
date: "date"
},
constraints: {
price: { min: 0, max: 1000000 },
name: { minLength: 1, maxLength: 500 },
url: { pattern: /^https?:\/\// }
}
};
function validateRecord(record, rules) {
const errors = [];
// Check required fields
for (const field of rules.required_fields) {
if (!record[field]) {
errors.push(`Missing required field: ${field}`);
}
}
// Check field types
for (const [field, type] of Object.entries(rules.field_types)) {
if (record[field] && !isValidType(record[field], type)) {
errors.push(`Invalid type for ${field}: expected ${type}`);
}
}
return { valid: errors.length === 0, errors };
}Main Workflow
主要工作流
Phase 1: Planning (MANDATORY)
阶段1:规划(必须执行)
Purpose: Analyze extraction requirements and define data schema.
Process:
- Invoke sequential-thinking MCP
- Define target data fields and types
- Plan selector strategy
- Identify pagination pattern
- Define output format
- Plan rate limiting strategy
Planning Questions:
- What data fields need to be extracted?
- What is the expected output format (JSON/CSV/Markdown)?
- Is pagination involved? What pattern?
- What validation rules apply?
- What rate limiting is appropriate?
Output Contract:
yaml
extraction_plan:
target_url: string
output_format: "json" | "csv" | "markdown"
schema:
fields:
- name: string
selector: string
type: string | number | url | date
required: boolean
pagination:
type: "numbered" | "next_button" | "infinite_scroll" | "load_more" | "none"
max_pages: number
delay_seconds: number
rate_limit:
delay_between_requests: number目的:分析提取需求并定义数据schema。
流程:
- 调用sequential-thinking MCP
- 定义目标数据字段和类型
- 规划选择器策略
- 识别分页模式
- 定义输出格式
- 规划速率限制策略
规划问题:
- 需要提取哪些数据字段?
- 预期输出格式是什么(JSON/CSV/Markdown)?
- 是否涉及分页?是什么模式?
- 适用哪些验证规则?
- 合适的速率限制是多少?
输出约定:
yaml
extraction_plan:
target_url: string
output_format: "json" | "csv" | "markdown"
schema:
fields:
- name: string
selector: string
type: string | number | url | date
required: boolean
pagination:
type: "numbered" | "next_button" | "infinite_scroll" | "load_more" | "none"
max_pages: number
delay_seconds: number
rate_limit:
delay_between_requests: numberPhase 2: Navigation
阶段2:导航
Purpose: Navigate to target URL and establish context.
Process:
- Get tab context (tabs_context_mcp)
- Create new tab for scraping (tabs_create_mcp)
- Navigate to starting URL
- Take initial screenshot for verification
- Wait for page load completion
Implementation:
javascript
// 1. Get existing context
const context = await mcp__claude-in-chrome__tabs_context_mcp({});
// 2. Create dedicated tab
const newTab = await mcp__claude-in-chrome__tabs_create_mcp({});
const tabId = newTab.tabId;
// 3. Navigate to target
await mcp__claude-in-chrome__navigate({
url: targetUrl,
tabId: tabId
});
// 4. Wait for page load
await mcp__claude-in-chrome__computer({
action: "wait",
duration: 2,
tabId: tabId
});
// 5. Screenshot initial state
await mcp__claude-in-chrome__computer({
action: "screenshot",
tabId: tabId
});目的:导航至目标URL并建立上下文。
流程:
- 获取标签页上下文(tabs_context_mcp)
- 创建用于抓取的新标签页(tabs_create_mcp)
- 导航至起始URL
- 截取初始截图用于验证
- 等待页面加载完成
实现代码:
javascript
// 1. Get existing context
const context = await mcp__claude-in-chrome__tabs_context_mcp({});
// 2. Create dedicated tab
const newTab = await mcp__claude-in-chrome__tabs_create_mcp({});
const tabId = newTab.tabId;
// 3. Navigate to target
await mcp__claude-in-chrome__navigate({
url: targetUrl,
tabId: tabId
});
// 4. Wait for page load
await mcp__claude-in-chrome__computer({
action: "wait",
duration: 2,
tabId: tabId
});
// 5. Screenshot initial state
await mcp__claude-in-chrome__computer({
action: "screenshot",
tabId: tabId
});Phase 3: Structure Analysis
阶段3:结构分析
Purpose: Understand page DOM structure before extraction.
Process:
- Read page accessibility tree (read_page)
- Identify data container elements
- Map field selectors to schema
- Verify selectors exist
- Detect dynamic content patterns
Implementation:
javascript
// Read page structure
const pageTree = await mcp__claude-in-chrome__read_page({
tabId: tabId,
filter: "all" // Get all elements for analysis
});
// For interactive elements only
const interactive = await mcp__claude-in-chrome__read_page({
tabId: tabId,
filter: "interactive" // Buttons, links, inputs
});
// Find specific elements
const products = await mcp__claude-in-chrome__find({
query: "product cards or listing items",
tabId: tabId
});
const prices = await mcp__claude-in-chrome__find({
query: "price elements",
tabId: tabId
});目的:提取前了解页面DOM结构。
流程:
- 读取页面可访问性树(read_page)
- 识别数据容器元素
- 将字段选择器映射到schema
- 验证选择器存在
- 检测动态内容模式
实现代码:
javascript
// Read page structure
const pageTree = await mcp__claude-in-chrome__read_page({
tabId: tabId,
filter: "all" // Get all elements for analysis
});
// For interactive elements only
const interactive = await mcp__claude-in-chrome__read_page({
tabId: tabId,
filter: "interactive" // Buttons, links, inputs
});
// Find specific elements
const products = await mcp__claude-in-chrome__find({
query: "product cards or listing items",
tabId: tabId
});
const prices = await mcp__claude-in-chrome__find({
query: "price elements",
tabId: tabId
});Phase 4: Data Extraction
阶段4:数据提取
Purpose: Extract data according to defined schema.
Process:
- For simple text: Use get_page_text
- For structured data: Use javascript_tool
- For specific elements: Use find + read_page
- Handle lazy-loaded content with scroll
Extraction Methods:
Method 1: Full Page Text (simplest, for articles)
javascript
const text = await mcp__claude-in-chrome__get_page_text({
tabId: tabId
});Method 2: JavaScript DOM Query (most flexible)
javascript
const data = await mcp__claude-in-chrome__javascript_tool({
action: "javascript_exec",
tabId: tabId,
text: `
// Extract product data
Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.product-title')?.textContent?.trim(),
price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
url: card.querySelector('a')?.href,
image: card.querySelector('img')?.src
}))
`
});Method 3: Table Extraction
javascript
const tableData = await mcp__claude-in-chrome__javascript_tool({
action: "javascript_exec",
tabId: tabId,
text: `
// Extract table data
const table = document.querySelector('table');
const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
const rows = Array.from(table.querySelectorAll('tbody tr')).map(tr => {
const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
});
rows
`
});Method 4: Handling Lazy-Loaded Content
javascript
// Scroll to load content
await mcp__claude-in-chrome__computer({
action: "scroll",
scroll_direction: "down",
scroll_amount: 5,
tabId: tabId,
coordinate: [500, 500]
});
// Wait for content to load
await mcp__claude-in-chrome__computer({
action: "wait",
duration: 2,
tabId: tabId
});
// Now extract
const data = await extractData(tabId);目的:根据定义的schema提取数据。
流程:
- 简单文本:使用get_page_text
- 结构化数据:使用javascript_tool
- 特定元素:使用find + read_page
- 通过scroll处理懒加载内容
提取方法:
方法1:整页文本(最简单,适用于文章)
javascript
const text = await mcp__claude-in-chrome__get_page_text({
tabId: tabId
});方法2:JavaScript DOM查询(最灵活)
javascript
const data = await mcp__claude-in-chrome__javascript_tool({
action: "javascript_exec",
tabId: tabId,
text: `
// Extract product data
Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.product-title')?.textContent?.trim(),
price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
url: card.querySelector('a')?.href,
image: card.querySelector('img')?.src
}))
`
});方法3:表格提取
javascript
const tableData = await mcp__claude-in-chrome__javascript_tool({
action: "javascript_exec",
tabId: tabId,
text: `
// Extract table data
const table = document.querySelector('table');
const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
const rows = Array.from(table.querySelectorAll('tbody tr')).map(tr => {
const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
});
rows
`
});方法4:处理懒加载内容
javascript
// Scroll to load content
await mcp__claude-in-chrome__computer({
action: "scroll",
scroll_direction: "down",
scroll_amount: 5,
tabId: tabId,
coordinate: [500, 500]
});
// Wait for content to load
await mcp__claude-in-chrome__computer({
action: "wait",
duration: 2,
tabId: tabId
});
// Now extract
const data = await extractData(tabId);Phase 5: Pagination Handling
阶段5:分页处理
Purpose: Navigate through multiple pages and collect all data.
Pagination Patterns:
Pattern A: Numbered Pages
javascript
async function extractNumberedPages(baseUrl, maxPages, tabId) {
const allData = [];
for (let page = 1; page <= maxPages; page++) {
const url = `${baseUrl}?page=${page}`;
await mcp__claude-in-chrome__navigate({ url, tabId });
await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
const pageData = await extractPageData(tabId);
if (pageData.length === 0) break; // No more data
allData.push(...pageData);
// Rate limiting
await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
}
return allData;
}Pattern B: Next Button
javascript
async function extractWithNextButton(tabId) {
const allData = [];
let hasNext = true;
while (hasNext) {
const pageData = await extractPageData(tabId);
allData.push(...pageData);
// Find next button
const nextButton = await mcp__claude-in-chrome__find({
query: "next page button or pagination next link",
tabId: tabId
});
if (nextButton && nextButton.elements && nextButton.elements.length > 0) {
await mcp__claude-in-chrome__computer({
action: "left_click",
ref: nextButton.elements[0].ref,
tabId: tabId
});
await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
} else {
hasNext = false; // No more pages
}
}
return allData;
}Pattern C: Infinite Scroll
javascript
async function extractInfiniteScroll(tabId, maxScrolls = 20) {
const allData = [];
let previousCount = 0;
let scrollCount = 0;
while (scrollCount < maxScrolls) {
const pageData = await extractPageData(tabId);
if (pageData.length === previousCount) {
// No new content loaded, done
break;
}
allData.length = 0;
allData.push(...pageData);
previousCount = pageData.length;
// Scroll down
await mcp__claude-in-chrome__computer({
action: "scroll",
scroll_direction: "down",
scroll_amount: 5,
tabId: tabId,
coordinate: [500, 500]
});
await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
scrollCount++;
}
return allData;
}目的:遍历多个页面并收集所有数据。
分页模式:
模式A:编号页面
javascript
async function extractNumberedPages(baseUrl, maxPages, tabId) {
const allData = [];
for (let page = 1; page <= maxPages; page++) {
const url = `${baseUrl}?page=${page}`;
await mcp__claude-in-chrome__navigate({ url, tabId });
await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
const pageData = await extractPageData(tabId);
if (pageData.length === 0) break; // No more data
allData.push(...pageData);
// Rate limiting
await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
}
return allData;
}模式B:下一页按钮
javascript
async function extractWithNextButton(tabId) {
const allData = [];
let hasNext = true;
while (hasNext) {
const pageData = await extractPageData(tabId);
allData.push(...pageData);
// Find next button
const nextButton = await mcp__claude-in-chrome__find({
query: "next page button or pagination next link",
tabId: tabId
});
if (nextButton && nextButton.elements && nextButton.elements.length > 0) {
await mcp__claude-in-chrome__computer({
action: "left_click",
ref: nextButton.elements[0].ref,
tabId: tabId
});
await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
} else {
hasNext = false; // No more pages
}
}
return allData;
}模式C:无限滚动
javascript
async function extractInfiniteScroll(tabId, maxScrolls = 20) {
const allData = [];
let previousCount = 0;
let scrollCount = 0;
while (scrollCount < maxScrolls) {
const pageData = await extractPageData(tabId);
if (pageData.length === previousCount) {
// No new content loaded, done
break;
}
allData.length = 0;
allData.push(...pageData);
previousCount = pageData.length;
// Scroll down
await mcp__claude-in-chrome__computer({
action: "scroll",
scroll_direction: "down",
scroll_amount: 5,
tabId: tabId,
coordinate: [500, 500]
});
await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
scrollCount++;
}
return allData;
}Phase 6: Data Transformation
阶段6:数据转换
Purpose: Convert extracted data to requested output format.
JSON Output:
javascript
function toJSON(data, pretty = true) {
return pretty ? JSON.stringify(data, null, 2) : JSON.stringify(data);
}CSV Output:
javascript
function toCSV(data) {
if (data.length === 0) return "";
const headers = Object.keys(data[0]);
const headerRow = headers.join(",");
const dataRows = data.map(row =>
headers.map(h => {
const val = row[h];
// Escape quotes and wrap in quotes if contains comma
if (typeof val === "string" && (val.includes(",") || val.includes('"'))) {
return `"${val.replace(/"/g, '""')}"`;
}
return val;
}).join(",")
);
return [headerRow, ...dataRows].join("\n");
}Markdown Table Output:
javascript
function toMarkdownTable(data) {
if (data.length === 0) return "";
const headers = Object.keys(data[0]);
const headerRow = `| ${headers.join(" | ")} |`;
const separator = `| ${headers.map(() => "---").join(" | ")} |`;
const dataRows = data.map(row =>
`| ${headers.map(h => String(row[h] || "")).join(" | ")} |`
);
return [headerRow, separator, ...dataRows].join("\n");
}目的:将提取的数据转换为请求的输出格式。
JSON输出:
javascript
function toJSON(data, pretty = true) {
return pretty ? JSON.stringify(data, null, 2) : JSON.stringify(data);
}CSV输出:
javascript
function toCSV(data) {
if (data.length === 0) return "";
const headers = Object.keys(data[0]);
const headerRow = headers.join(",");
const dataRows = data.map(row =>
headers.map(h => {
const val = row[h];
// Escape quotes and wrap in quotes if contains comma
if (typeof val === "string" && (val.includes(",") || val.includes('"'))) {
return `"${val.replace(/"/g, '""')}"`;
}
return val;
}).join(",")
);
return [headerRow, ...dataRows].join("\n");
}Markdown表格输出:
javascript
function toMarkdownTable(data) {
if (data.length === 0) return "";
const headers = Object.keys(data[0]);
const headerRow = `| ${headers.join(" | ")} |`;
const separator = `| ${headers.map(() => "---").join(" | ")} |`;
const dataRows = data.map(row =>
`| ${headers.map(h => String(row[h] || "")).join(" | ")} |`
);
return [headerRow, separator, ...dataRows].join("\n");
}Phase 7: Storage
阶段7:存储
Purpose: Persist extracted data to Memory MCP for future use.
Implementation:
javascript
// Store in Memory MCP
await memory_store({
namespace: `skills/tooling/web-scraping/${projectName}/${timestamp}`,
data: {
extraction_metadata: {
source_url: targetUrl,
extraction_date: new Date().toISOString(),
record_count: data.length,
output_format: outputFormat,
pages_scraped: pageCount
},
extracted_data: data
},
tags: {
WHO: `web-scraping-${sessionId}`,
WHEN: new Date().toISOString(),
PROJECT: projectName,
WHY: "data-extraction",
data_type: dataType,
record_count: data.length
}
});目的:将提取的数据存储至Memory MCP以备后续使用。
实现代码:
javascript
// Store in Memory MCP
await memory_store({
namespace: `skills/tooling/web-scraping/${projectName}/${timestamp}`,
data: {
extraction_metadata: {
source_url: targetUrl,
extraction_date: new Date().toISOString(),
record_count: data.length,
output_format: outputFormat,
pages_scraped: pageCount
},
extracted_data: data
},
tags: {
WHO: `web-scraping-${sessionId}`,
WHEN: new Date().toISOString(),
PROJECT: projectName,
WHY: "data-extraction",
data_type: dataType,
record_count: data.length
}
});LEARNED PATTERNS
已学习模式
This section is populated by Loop 1.5 (Session Reflection) as patterns are discovered.
该部分由Loop 1.5(会话反思)填充,用于记录发现的模式。
High Confidence [conf:0.90]
高可信度 [conf:0.90]
No patterns captured yet. Patterns will be added as the skill is used and learnings are extracted.
暂无记录的模式。使用该技能并提取经验后会添加模式。
Medium Confidence [conf:0.75]
中等可信度 [conf:0.75]
No patterns captured yet.
暂无记录的模式。
Low Confidence [conf:0.55]
低可信度 [conf:0.55]
No patterns captured yet.
暂无记录的模式。
Success Criteria
成功标准
Quality Thresholds:
- All targeted data fields extracted (100% schema coverage)
- Data validation passes for 95%+ of records
- Pagination handled completely (no missing pages)
- Output format matches requested format
- Rate limiting respected (no 429 errors)
- No page state modifications occurred
- Extraction completed within reasonable time (5 min per 100 records)
Failure Indicators:
- Schema validation fails for >5% of records
- Missing required fields in output
- Pagination incomplete (detected more pages than scraped)
- Rate limit errors encountered
- Page state modified (form submitted, button clicked for action)
- CAPTCHA or access denial encountered
质量阈值:
- 所有目标数据字段均已提取(100% schema覆盖)
- 95%+的记录通过数据验证
- 分页处理完整(无遗漏页面)
- 输出格式符合要求
- 遵守速率限制(无429错误)
- 未修改页面状态
- 提取在合理时间内完成(每100条记录耗时≤5分钟)
失败指标:
-
5%的记录未通过schema验证
- 输出中缺少必填字段
- 分页不完整(检测到的页面数多于抓取的页面数)
- 遇到速率限制错误
- 页面状态被修改(提交表单、点击按钮执行操作)
- 遇到验证码或访问被拒绝
MCP Integration
MCP集成
Required MCPs:
| MCP | Purpose | Tools Used |
|---|---|---|
| sequential-thinking | Planning phase | |
| claude-in-chrome | Extraction phase | |
| memory-mcp | Data storage | |
Optional MCPs:
- filesystem (for saving extracted data locally)
必需的MCP:
| MCP | 用途 | 使用工具 |
|---|---|---|
| sequential-thinking | 规划阶段 | |
| claude-in-chrome | 提取阶段 | |
| memory-mcp | 数据存储 | |
可选的MCP:
- filesystem(用于本地保存提取的数据)
Memory Namespace
内存命名空间
Pattern:
skills/tooling/web-scraping/{project}/{timestamp}Store:
- Extraction schemas (field definitions)
- Extracted datasets (structured data)
- Selector patterns (for similar pages)
- Error logs (selector failures, schema drift)
Retrieve:
- Similar extraction tasks (vector search by description)
- Proven selectors for known sites
- Historical extraction patterns
Tagging:
json
{
"WHO": "web-scraping-{session_id}",
"WHEN": "ISO8601_timestamp",
"PROJECT": "{project_name}",
"WHY": "data-extraction",
"source_domain": "example.com",
"data_type": "products|articles|listings|tables",
"record_count": 150,
"pages_scraped": 5
}模式:
skills/tooling/web-scraping/{project}/{timestamp}存储内容:
- 提取schema(字段定义)
- 提取的数据集(结构化数据)
- 选择器模式(适用于相似页面)
- 错误日志(选择器失败、schema漂移)
检索内容:
- 相似提取任务(按描述进行向量搜索)
- 针对已知网站的有效选择器
- 历史提取模式
标签:
json
{
"WHO": "web-scraping-{session_id}",
"WHEN": "ISO8601_timestamp",
"PROJECT": "{project_name}",
"WHY": "data-extraction",
"source_domain": "example.com",
"data_type": "products|articles|listings|tables",
"record_count": 150,
"pages_scraped": 5
}Examples
示例
Example 1: Product Catalog Extraction
示例1:产品目录提取
Complexity: Medium (structured data, pagination)
Task: Extract product listings from e-commerce category page
Planning Output (sequential-thinking):
Thought 1/6: Need to extract product name, price, URL, and image from catalog
Thought 2/6: Schema: {name: string, price: number, url: url, image: url}
Thought 3/6: Identify product card containers via read_page
Thought 4/6: Use javascript_tool to query all product cards
Thought 5/6: Detect pagination pattern (numbered pages with ?page=N)
Thought 6/6: Output format: JSON array, validate price > 0Execution:
javascript
// 1. Navigate to catalog
await navigate({ url: "https://example.com/products", tabId });
// 2. Analyze structure
const structure = await read_page({ tabId, filter: "all" });
// 3. Extract data
const products = await javascript_tool({
action: "javascript_exec",
tabId,
text: `
Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.title')?.textContent?.trim(),
price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
url: card.querySelector('a')?.href,
image: card.querySelector('img')?.src
}))
`
});
// 4. Handle pagination (repeat for each page)
// 5. Transform to JSON
const output = JSON.stringify(products, null, 2);Result: 150 products extracted across 5 pages
Output Format: JSON
json
[
{
"name": "Product A",
"price": 29.99,
"url": "https://example.com/products/a",
"image": "https://example.com/images/a.jpg"
}
]复杂度: 中等(结构化数据、分页)
任务: 从电商分类页面提取产品列表
规划输出(sequential-thinking):
Thought 1/6: Need to extract product name, price, URL, and image from catalog
Thought 2/6: Schema: {name: string, price: number, url: url, image: url}
Thought 3/6: Identify product card containers via read_page
Thought 4/6: Use javascript_tool to query all product cards
Thought 5/6: Detect pagination pattern (numbered pages with ?page=N)
Thought 6/6: Output format: JSON array, validate price > 0执行代码:
javascript
// 1. Navigate to catalog
await navigate({ url: "https://example.com/products", tabId });
// 2. Analyze structure
const structure = await read_page({ tabId, filter: "all" });
// 3. Extract data
const products = await javascript_tool({
action: "javascript_exec",
tabId,
text: `
Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.title')?.textContent?.trim(),
price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
url: card.querySelector('a')?.href,
image: card.querySelector('img')?.src
}))
`
});
// 4. Handle pagination (repeat for each page)
// 5. Transform to JSON
const output = JSON.stringify(products, null, 2);结果: 跨5个页面提取了150个产品
输出格式: JSON
json
[
{
"name": "Product A",
"price": 29.99,
"url": "https://example.com/products/a",
"image": "https://example.com/images/a.jpg"
}
]Example 2: Article Content Scraping
示例2:文章内容抓取
Complexity: Simple (single page, text focus)
Task: Extract article headline, author, date, and body text
Planning Output (sequential-thinking):
Thought 1/4: Single page extraction, no pagination needed
Thought 2/4: Schema: {headline: string, author: string, date: date, body: string}
Thought 3/4: Use get_page_text for full content, javascript_tool for metadata
Thought 4/4: Output format: Markdown for readabilityExecution:
javascript
// 1. Navigate to article
await navigate({ url: "https://news.example.com/article/123", tabId });
// 2. Extract metadata
const metadata = await javascript_tool({
action: "javascript_exec",
tabId,
text: `({
headline: document.querySelector('h1')?.textContent?.trim(),
author: document.querySelector('.author')?.textContent?.trim(),
date: document.querySelector('time')?.getAttribute('datetime')
})`
});
// 3. Extract body text
const body = await get_page_text({ tabId });
// 4. Combine
const article = { ...metadata, body };Result: Article content extracted
Output Format: Markdown
markdown
undefined复杂度: 简单(单页、文本为主)
任务: 提取文章标题、作者、日期和正文
规划输出(sequential-thinking):
Thought 1/4: Single page extraction, no pagination needed
Thought 2/4: Schema: {headline: string, author: string, date: date, body: string}
Thought 3/4: Use get_page_text for full content, javascript_tool for metadata
Thought 4/4: Output format: Markdown for readability执行代码:
javascript
// 1. Navigate to article
await navigate({ url: "https://news.example.com/article/123", tabId });
// 2. Extract metadata
const metadata = await javascript_tool({
action: "javascript_exec",
tabId,
text: `({
headline: document.querySelector('h1')?.textContent?.trim(),
author: document.querySelector('.author')?.textContent?.trim(),
date: document.querySelector('time')?.getAttribute('datetime')
})`
});
// 3. Extract body text
const body = await get_page_text({ tabId });
// 4. Combine
const article = { ...metadata, body };结果: 提取文章内容
输出格式: Markdown
markdown
undefinedArticle Headline
Article Headline
Author: Jane Doe
Date: 2026-01-12
Article body text here...
undefinedAuthor: Jane Doe
Date: 2026-01-12
Article body text here...
undefinedExample 3: Table Data Extraction
示例3:表格数据提取
Complexity: Simple (structured HTML table)
Task: Extract financial data from HTML table
Planning Output (sequential-thinking):
Thought 1/4: HTML table with headers in <th>, data in <td>
Thought 2/4: Use javascript_tool to parse table structure
Thought 3/4: Convert to array of objects with header keys
Thought 4/4: Output format: CSV for spreadsheet importExecution:
javascript
// 1. Navigate to page
await navigate({ url: "https://finance.example.com/data", tabId });
// 2. Extract table
const tableData = await javascript_tool({
action: "javascript_exec",
tabId,
text: `
const table = document.querySelector('table.financial-data');
const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
Array.from(table.querySelectorAll('tbody tr')).map(tr => {
const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
})
`
});
// 3. Transform to CSV
const csv = toCSV(tableData);Result: 50 rows extracted
Output Format: CSV
csv
Date,Open,High,Low,Close,Volume
2026-01-12,150.00,152.50,149.00,151.75,1000000
2026-01-11,148.50,150.25,147.00,150.00,950000复杂度: 简单(结构化HTML表格)
任务: 从HTML表格提取财务数据
规划输出(sequential-thinking):
Thought 1/4: HTML table with headers in <th>, data in <td>
Thought 2/4: Use javascript_tool to parse table structure
Thought 3/4: Convert to array of objects with header keys
Thought 4/4: Output format: CSV for spreadsheet import执行代码:
javascript
// 1. Navigate to page
await navigate({ url: "https://finance.example.com/data", tabId });
// 2. Extract table
const tableData = await javascript_tool({
action: "javascript_exec",
tabId,
text: `
const table = document.querySelector('table.financial-data');
const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
Array.from(table.querySelectorAll('tbody tr')).map(tr => {
const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
})
`
});
// 3. Transform to CSV
const csv = toCSV(tableData);结果: 提取50行数据
输出格式: CSV
csv
Date,Open,High,Low,Close,Volume
2026-01-12,150.00,152.50,149.00,151.75,1000000
2026-01-11,148.50,150.25,147.00,150.00,950000Example 4: Multi-Page Directory Scraping
示例4:多页目录抓取
Complexity: High (pagination, rate limiting)
Task: Extract business listings from directory with "Load More" pagination
Planning Output (sequential-thinking):
Thought 1/8: Directory with business name, phone, address, website
Thought 2/8: Schema: {name: string, phone: string, address: string, website: url}
Thought 3/8: Pagination: "Load More" button at bottom
Thought 4/8: Rate limit: 3 seconds between loads
Thought 5/8: Max iterations: 20 (to prevent infinite loops)
Thought 6/8: End condition: "Load More" button disappears or disabled
Thought 7/8: Output format: JSON
Thought 8/8: Store in Memory MCP for analysisResult: 200 businesses extracted across 20 "Load More" clicks
Execution Time: 90 seconds (with rate limiting)
复杂度: 高(分页、速率限制)
任务: 从包含“Load More”分页的目录提取商家列表
规划输出(sequential-thinking):
Thought 1/8: Directory with business name, phone, address, website
Thought 2/8: Schema: {name: string, phone: string, address: string, website: url}
Thought 3/8: Pagination: "Load More" button at bottom
Thought 4/8: Rate limit: 3 seconds between loads
Thought 5/8: Max iterations: 20 (to prevent infinite loops)
Thought 6/8: End condition: "Load More" button disappears or disabled
Thought 7/8: Output format: JSON
Thought 8/8: Store in Memory MCP for analysis结果: 点击20次“Load More”按钮后提取200个商家
执行时间: 90秒(包含速率限制等待)
Anti-Patterns to Avoid
需避免的反模式
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Skip Structure Analysis | Selectors break unexpectedly | ALWAYS use read_page before extraction |
| Modify Page State | Not a scraping task anymore | Use browser-automation for interactions |
| Ignore Pagination | Incomplete datasets | Plan pagination strategy in Phase 1 |
| No Rate Limiting | Server blocks, legal issues | Implement delays between requests |
| Hardcoded Selectors | Break when site updates | Use semantic selectors, find tool |
| Skip Validation | Garbage data in output | Validate against schema before storage |
| Bypass CAPTCHA | Violates ToS, legal risk | ABORT and notify user |
| 反模式 | 问题 | 解决方案 |
|---|---|---|
| 跳过结构分析 | 选择器意外失效 | 提取前必须使用read_page |
| 修改页面状态 | 不再是抓取任务 | 交互操作使用browser-automation |
| 忽略分页 | 数据集不完整 | 阶段1规划分页策略 |
| 不设置速率限制 | 服务器封禁、法律问题 | 请求之间设置延迟 |
| 硬编码选择器 | 网站更新后失效 | 使用语义选择器、find工具 |
| 跳过验证 | 输出垃圾数据 | 存储前根据schema验证 |
| 绕过验证码 | 违反服务条款、法律风险 | 终止并通知用户 |
Related Skills
相关技能
Upstream (provide input to this skill):
- - Detect web scraping requirements
intent-analyzer - - Optimize extraction descriptions
prompt-architect - - High-level data collection strategy
planner
Downstream (use output from this skill):
- - Analyze extracted datasets
data-analysis - - Generate reports from data
reporting - - Use data for training
ml-pipeline
Parallel (work together):
- - For interactive pre-requisites (login)
browser-automation - - Hybrid scraping/API workflows
api-integration
上游技能(为该技能提供输入):
- - 检测网页抓取需求
intent-analyzer - - 优化提取描述
prompt-architect - - 高级数据收集策略
planner
下游技能(使用该技能的输出):
- - 分析提取的数据集
data-analysis - - 从数据生成报告
reporting - - 使用数据进行训练
ml-pipeline
并行技能(协同工作):
- - 用于交互式前置操作(登录)
browser-automation - - 混合抓取/API工作流
api-integration
Selector Stability Strategies
选择器稳定性策略
Problem: Web pages change, selectors break.
Strategies:
| Strategy | Stability | Speed | Example |
|---|---|---|---|
| ID selector | High | Fast | |
| Data attribute | High | Fast | |
| Semantic class | Medium | Fast | |
| Text content | Medium | Slow | Contains "Add to Cart" |
| XPath | Low | Medium | |
| Position | Very Low | Fast | First child element |
Best Practices:
- Prefer attributes (designed for scripting)
data-* - Use semantic class names over positional selectors
- Combine multiple attributes for specificity
- Use tool with natural language as fallback
find - Log selector failures to detect schema drift
问题: 网页会变化,选择器会失效。
策略:
| 策略 | 稳定性 | 速度 | 示例 |
|---|---|---|---|
| ID选择器 | 高 | 快 | |
| 数据属性 | 高 | 快 | |
| 语义类 | 中 | 快 | |
| 文本内容 | 中 | 慢 | 包含"Add to Cart" |
| XPath | 低 | 中 | |
| 位置 | 极低 | 快 | 第一个子元素 |
最佳实践:
- 优先使用属性(专为脚本设计)
data-* - 使用语义类名而非位置选择器
- 组合多个属性提高特异性
- 作为备选,使用find工具和自然语言
- 记录选择器失败情况以检测schema漂移
Output Format Templates
输出格式模板
JSON Template
JSON模板
json
{
"extraction_metadata": {
"source_url": "https://example.com/products",
"extraction_date": "2026-01-12T10:30:00Z",
"record_count": 150,
"pages_scraped": 5
},
"data": [
{
"name": "Product Name",
"price": 29.99,
"url": "https://example.com/product/1",
"image": "https://example.com/images/1.jpg"
}
]
}json
{
"extraction_metadata": {
"source_url": "https://example.com/products",
"extraction_date": "2026-01-12T10:30:00Z",
"record_count": 150,
"pages_scraped": 5
},
"data": [
{
"name": "Product Name",
"price": 29.99,
"url": "https://example.com/product/1",
"image": "https://example.com/images/1.jpg"
}
]
}CSV Template
CSV模板
csv
name,price,url,image
"Product Name",29.99,"https://example.com/product/1","https://example.com/images/1.jpg"csv
name,price,url,image
"Product Name",29.99,"https://example.com/product/1","https://example.com/images/1.jpg"Markdown Table Template
Markdown表格模板
markdown
| Name | Price | URL |
|------|-------|-----|
| Product Name | $29.99 | [Link](https://example.com/product/1) |markdown
| Name | Price | URL |
|------|-------|-----|
| Product Name | $29.99 | [Link](https://example.com/product/1) |Maintenance & Updates
维护与更新
Version History:
- v1.0.0 (2026-01-12): Initial release with READ-only focus, pagination handling, rate limiting
Feedback Loop:
- Loop 1.5 (Session): Store learnings from extractions
- Loop 3 (Meta-Loop): Aggregate patterns every 3 days
- Update LEARNED PATTERNS section with new discoveries
Continuous Improvement:
- Monitor extraction success rate via Memory MCP
- Identify common selector failures for pattern updates
- Optimize extraction strategies based on site patterns
<promise>WEB_SCRAPING_VERILINGUA_VERIX_COMPLIANT</promise>
版本历史:
- v1.0.0 (2026-01-12): 初始版本,专注于只读操作、分页处理、速率限制
反馈循环:
- Loop 1.5(会话): 存储提取过程中的经验
- Loop 3(元循环): 每3天汇总模式
- 更新已学习模式部分,添加新发现
持续改进:
- 通过Memory MCP监控提取成功率
- 识别常见选择器失败情况以更新模式
- 根据网站模式优化提取策略
<promise>WEB_SCRAPING_VERILINGUA_VERIX_COMPLIANT</promise>