web-scraping

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web Scraping

网页抓取

Kanitsal Cerceve (Evidential Frame Activation)

证据框架激活

Kaynak dogrulama modu etkin.

[assert|neutral] Systematic web data extraction workflow with sequential-thinking planning phase [ground:skill-design:2026-01-12] [conf:0.85] [state:provisional]

源验证模式已激活。

[assert|neutral] 结合sequential-thinking规划阶段的系统化网页数据提取工作流 [ground:skill-design:2026-01-12] [conf:0.85] [state:provisional]

Overview

概述

Web scraping enables structured data extraction from web pages through the claude-in-chrome MCP server. This skill enforces a PLAN-READ-TRANSFORM pattern focused exclusively on data extraction without page modifications.

Philosophy: Data extraction fails when executed without understanding page structure. By mandating structure analysis before extraction, this skill improves data quality and reduces selector errors.

Methodology: Six-phase execution with emphasis on READ operations:

PLAN Phase: Sequential-thinking MCP analyzes extraction requirements
NAVIGATE Phase: Navigate to target URL(s)
ANALYZE Phase: Understand page structure via read_page
EXTRACT Phase: Pull data using get_page_text and javascript_tool
TRANSFORM Phase: Convert to structured format (JSON/CSV/Markdown)
STORE Phase: Persist to Memory MCP

Key Differentiator from browser-automation:

READ-ONLY focus (no form submissions, no button clicks for actions)
Emphasis on data transformation and output formats
Pagination handling for multi-page datasets
Rate limiting awareness for bulk extraction
No account creation, no purchases, no write operations

Value Proposition: Transform unstructured web content into clean, structured datasets suitable for analysis, reporting, or system integration.

网页抓取可通过claude-in-chrome MCP服务器从网页中提取结构化数据。该技能遵循PLAN-READ-TRANSFORM模式，仅专注于数据提取，不修改页面内容。

核心理念：若不了解页面结构就执行数据提取，很可能失败。通过强制要求在提取前进行结构分析，该技能可提升数据质量并减少选择器错误。

方法论：分为六个执行阶段，重点关注READ操作：

PLAN阶段：sequential-thinking MCP分析提取需求
NAVIGATE阶段：导航至目标URL
ANALYZE阶段：通过read_page了解页面结构
EXTRACT阶段：使用get_page_text和javascript_tool提取数据
TRANSFORM阶段：转换为结构化格式（JSON/CSV/Markdown）
STORE阶段：存储至Memory MCP

与browser-automation的关键区别：

仅专注于只读操作（不提交表单、不点击按钮执行操作）
强调数据转换和输出格式
支持多页数据集的分页处理
批量提取时考虑速率限制
不创建账户、不进行购买、不执行写入操作

价值主张：将非结构化网页内容转换为干净、结构化的数据集，适用于分析、报告或系统集成。

When to Use This Skill

何时使用该技能

Trigger Thresholds:

Data Volume	Recommendation
Single element	Use get_page_text directly (too simple)
5-50 elements	Consider this skill
50+ elements or pagination	Mandatory use of this skill

Primary Use Cases:

Product catalog extraction (prices, descriptions, images)
Article content scraping (headlines, body text, metadata)
Table data extraction (financial data, statistics, listings)
Directory scraping (contact info, business listings)
Research data collection (citations, abstracts, datasets)
Price monitoring (competitive analysis, market research)

Apply When:

Task requires structured output (JSON, CSV, Markdown table)
Multiple pages need to be traversed (pagination)
Data needs normalization or transformation
Consistent data schema is required
Historical data collection for trend analysis

触发阈值:

数据量	建议
单个元素	直接使用get_page_text（过于简单）
5-50个元素	考虑使用该技能
50+元素或涉及分页	必须使用该技能

主要使用场景:

产品目录提取（价格、描述、图片）
文章内容抓取（标题、正文、元数据）
表格数据提取（财务数据、统计数据、列表）
目录抓取（联系信息、商家列表）
研究数据收集（引用、摘要、数据集）
价格监控（竞品分析、市场调研）

适用情况:

任务需要结构化输出（JSON、CSV、Markdown表格）
需要遍历多个页面（分页）
数据需要标准化或转换
需要一致的数据 schema
收集历史数据用于趋势分析

When NOT to Use This Skill

何时不使用该技能

Form submissions or account actions (use browser-automation)
Interactive workflows requiring clicks (use browser-automation)
Single-page simple text extraction (use get_page_text directly)
API-accessible data (use direct API calls instead)
Real-time streaming data (use WebSocket or API)
Content behind authentication requiring login (use browser-automation first)
Sites with explicit anti-scraping measures (respect robots.txt)

表单提交或账户操作（使用browser-automation）
需要点击的交互式工作流（使用browser-automation）
单页简单文本提取（直接使用get_page_text）
可通过API获取的数据（直接调用API）
实时流数据（使用WebSocket或API）
需要登录的受保护内容（先使用browser-automation）
明确反爬措施的网站（遵守robots.txt）

Core Principles

核心原则

Principle 1: Plan Before Extract

原则1：先规划后提取

Mandate: ALWAYS invoke sequential-thinking MCP before data extraction.

Rationale: Web pages have complex DOM structures. Planning reveals data locations, identifies patterns, and anticipates pagination before extraction begins.

In Practice:

Analyze page structure via read_page first
Map data field locations to selectors
Identify pagination patterns
Plan output schema before extraction
Define data validation rules

强制要求：数据提取前必须调用sequential-thinking MCP。

理由：网页具有复杂的DOM结构。规划可在提取前确定数据位置、识别模式并预判分页情况。

实践方式:

先通过read_page分析页面结构
将数据字段位置映射到选择器
识别分页模式
提取前定义输出schema
定义数据验证规则

Principle 2: Read-Only Operations

原则2：只读操作

Mandate: NEVER modify page state during extraction.

Rationale: Web scraping is about data retrieval, not interaction. Write operations (form submissions, clicks that change state) belong to browser-automation.

Allowed Operations:

navigate (to target URLs)
read_page (DOM analysis)
get_page_text (text extraction)
javascript_tool (read-only DOM queries)
computer screenshot (visual verification)
scroll (for lazy-loaded content)

Prohibited Operations:

form_input (except for search filters if needed)
computer left_click on submit buttons
Any action that modifies page state
Account creation or login actions

强制要求：提取过程中不得修改页面状态。

理由：网页抓取的核心是数据检索，而非交互。写入操作（表单提交、改变状态的点击）属于browser-automation的范畴。

允许的操作:

navigate（导航至目标URL）
read_page（DOM分析）
get_page_text（文本提取）
javascript_tool（只读DOM查询）
computer screenshot（视觉验证）
scroll（加载懒加载内容）

禁止的操作:

form_input（除非需要搜索过滤）
computer left_click点击提交按钮
任何修改页面状态的操作
账户创建或登录操作

Principle 3: Structured Output First

原则3：优先结构化输出

Mandate: Define output schema before extraction begins.

Rationale: Knowing the target format guides extraction strategy and ensures consistent data quality across all pages.

Output Formats:

Format	Use Case	Example
JSON	API integration, databases	`[{"name": "...", "price": 99.99}]`
CSV	Spreadsheet analysis	`name,price,url\n"Product",99.99,"https://..."`
Markdown Table	Documentation, reports	`

强制要求：提取前定义输出schema。

理由：明确目标格式可指导提取策略，确保所有页面的数据质量一致。

输出格式:

格式	使用场景	示例
JSON	API集成、数据库	`[{"name": "...", "price": 99.99}]`
CSV	电子表格分析	`name,price,url\n"Product",99.99,"https://..."`
Markdown表格	文档、报告	`

Principle 4: Pagination Awareness

原则4：分页感知

Mandate: Detect and handle pagination patterns before starting extraction.

Rationale: Most valuable datasets span multiple pages. Failing to handle pagination yields incomplete data.

Pagination Patterns:

Pattern	Detection	Handling
Numbered pages	Links with `?page=N` or `/page/N`	Iterate through page numbers
Next button	"Next", ">" or arrow elements	Click until disabled/missing
Infinite scroll	Content loads on scroll	Scroll + wait + check for new content
Load more	"Load more" button	Click until no new content
Cursor-based	`?cursor=abc123` in URL	Extract cursor, iterate

强制要求：提取前检测并处理分页模式。

理由：最有价值的数据集通常跨多个页面。不处理分页会导致数据不完整。

分页模式:

模式	检测方式	处理方式
编号页面	链接包含 `?page=N` 或 `/page/N`	遍历页面编号
下一页按钮	包含“Next”、">"或箭头元素	点击直至按钮禁用/消失
无限滚动	滚动时加载内容	滚动 + 等待 + 检查新内容
加载更多	“Load more”按钮	点击直至无新内容
基于游标	URL包含 `?cursor=abc123`	提取游标并遍历

Principle 5: Rate Limiting Respect

原则5：遵守速率限制

Mandate: Implement delays between requests to avoid overwhelming servers.

Rationale: Aggressive scraping can trigger blocks, harm server performance, and violate terms of service.

Guidelines:

Request Type	Minimum Delay
Same domain	1-2 seconds
Pagination	2-3 seconds
Bulk extraction (100+ pages)	3-5 seconds
If rate-limited	Exponential backoff (start 10s)

Implementation:

javascript

// Use computer wait action between page loads
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,  // seconds
  tabId: tabId
});

强制要求：请求之间设置延迟，避免压垮服务器。

理由：激进的抓取会触发封禁、影响服务器性能并违反服务条款。

指导方针:

请求类型	最小延迟
同一域名	1-2秒
分页	2-3秒
批量提取（100+页面）	3-5秒
触发速率限制时	指数退避（起始10秒）

实现代码:

javascript

// Use computer wait action between page loads
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,  // seconds
  tabId: tabId
});

Production Guardrails

生产环境防护措施

MCP Preflight Check Protocol

MCP预检查协议

Before executing any web scraping workflow, run preflight validation:

Preflight Sequence:

javascript

async function preflightCheck() {
  const checks = {
    sequential_thinking: false,
    claude_in_chrome: false,
    memory_mcp: false
  };

  // Check sequential-thinking MCP (required)
  try {
    await mcp__sequential-thinking__sequentialthinking({
      thought: "Preflight check - verifying MCP availability for web scraping",
      thoughtNumber: 1,
      totalThoughts: 1,
      nextThoughtNeeded: false
    });
    checks.sequential_thinking = true;
  } catch (error) {
    console.error("Sequential-thinking MCP unavailable:", error);
    throw new Error("CRITICAL: sequential-thinking MCP required but unavailable");
  }

  // Check claude-in-chrome MCP (required)
  try {
    const context = await mcp__claude-in-chrome__tabs_context_mcp({});
    checks.claude_in_chrome = true;
  } catch (error) {
    console.error("Claude-in-chrome MCP unavailable:", error);
    throw new Error("CRITICAL: claude-in-chrome MCP required but unavailable");
  }

  // Check memory-mcp (optional but recommended)
  try {
    checks.memory_mcp = true;
  } catch (error) {
    console.warn("Memory MCP unavailable - extracted data will not be persisted");
    checks.memory_mcp = false;
  }

  return checks;
}

执行任何网页抓取工作流前，运行预验证：

预检查流程:

javascript

async function preflightCheck() {
  const checks = {
    sequential_thinking: false,
    claude_in_chrome: false,
    memory_mcp: false
  };

  // Check sequential-thinking MCP (required)
  try {
    await mcp__sequential-thinking__sequentialthinking({
      thought: "Preflight check - verifying MCP availability for web scraping",
      thoughtNumber: 1,
      totalThoughts: 1,
      nextThoughtNeeded: false
    });
    checks.sequential_thinking = true;
  } catch (error) {
    console.error("Sequential-thinking MCP unavailable:", error);
    throw new Error("CRITICAL: sequential-thinking MCP required but unavailable");
  }

  // Check claude-in-chrome MCP (required)
  try {
    const context = await mcp__claude-in-chrome__tabs_context_mcp({});
    checks.claude_in_chrome = true;
  } catch (error) {
    console.error("Claude-in-chrome MCP unavailable:", error);
    throw new Error("CRITICAL: claude-in-chrome MCP required but unavailable");
  }

  // Check memory-mcp (optional but recommended)
  try {
    checks.memory_mcp = true;
  } catch (error) {
    console.warn("Memory MCP unavailable - extracted data will not be persisted");
    checks.memory_mcp = false;
  }

  return checks;
}

Error Handling Framework

错误处理框架

Error Categories:

Category	Example	Recovery Strategy
MCP_UNAVAILABLE	Claude-in-chrome offline	ABORT with clear message
PAGE_NOT_FOUND	404 error	Log, skip page, continue with others
ELEMENT_NOT_FOUND	Selector changed	Try alternative selectors, log schema drift
RATE_LIMITED	429 response	Exponential backoff, wait, retry
CONTENT_CHANGED	Dynamic content not loaded	Wait longer, scroll to trigger load
PAGINATION_END	No more pages	Normal termination, return collected data
BLOCKED	Access denied, CAPTCHA	ABORT, notify user, do not bypass

Error Recovery Pattern:

javascript

async function extractWithRetry(extractor, context, maxRetries = 3) {
  let lastError = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const data = await extractor(context);
      return data;
    } catch (error) {
      lastError = error;
      console.error(`Extraction attempt ${attempt} failed:`, error.message);

      if (isRateLimitError(error)) {
        const delay = Math.pow(2, attempt) * 5; // 10s, 20s, 40s
        await sleep(delay * 1000);
      } else if (!isRecoverableError(error)) {
        break;
      }
    }
  }
  throw lastError;
}

function isRecoverableError(error) {
  const nonRecoverable = [
    "CRITICAL: sequential-thinking MCP required",
    "CRITICAL: claude-in-chrome MCP required",
    "Access denied",
    "CAPTCHA",
    "Blocked"
  ];
  return !nonRecoverable.some(msg => error.message.includes(msg));
}

错误类别:

类别	示例	恢复策略
MCP_UNAVAILABLE	Claude-in-chrome离线	终止并显示清晰信息
PAGE_NOT_FOUND	404错误	记录日志、跳过页面、继续处理其他页面
ELEMENT_NOT_FOUND	选择器变更	尝试备选选择器、记录schema漂移
RATE_LIMITED	429响应	指数退避、等待、重试
CONTENT_CHANGED	动态内容未加载	延长等待时间、滚动触发加载
PAGINATION_END	无更多页面	正常终止、返回已收集数据
BLOCKED	访问被拒绝、验证码	终止、通知用户、不绕过限制

错误恢复模式:

javascript

async function extractWithRetry(extractor, context, maxRetries = 3) {
  let lastError = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const data = await extractor(context);
      return data;
    } catch (error) {
      lastError = error;
      console.error(`Extraction attempt ${attempt} failed:`, error.message);

      if (isRateLimitError(error)) {
        const delay = Math.pow(2, attempt) * 5; // 10s, 20s, 40s
        await sleep(delay * 1000);
      } else if (!isRecoverableError(error)) {
        break;
      }
    }
  }
  throw lastError;
}

function isRecoverableError(error) {
  const nonRecoverable = [
    "CRITICAL: sequential-thinking MCP required",
    "CRITICAL: claude-in-chrome MCP required",
    "Access denied",
    "CAPTCHA",
    "Blocked"
  ];
  return !nonRecoverable.some(msg => error.message.includes(msg));
}

Data Validation Framework

数据验证框架

Purpose: Ensure extracted data meets quality standards before storage.

Validation Rules:

javascript

const VALIDATION_RULES = {
  required_fields: ["name", "url"],  // Must be present
  field_types: {
    price: "number",
    name: "string",
    url: "url",
    date: "date"
  },
  constraints: {
    price: { min: 0, max: 1000000 },
    name: { minLength: 1, maxLength: 500 },
    url: { pattern: /^https?:\/\// }
  }
};

function validateRecord(record, rules) {
  const errors = [];

  // Check required fields
  for (const field of rules.required_fields) {
    if (!record[field]) {
      errors.push(`Missing required field: ${field}`);
    }
  }

  // Check field types
  for (const [field, type] of Object.entries(rules.field_types)) {
    if (record[field] && !isValidType(record[field], type)) {
      errors.push(`Invalid type for ${field}: expected ${type}`);
    }
  }

  return { valid: errors.length === 0, errors };
}

目的：存储前确保提取的数据符合质量标准。

验证规则:

javascript

const VALIDATION_RULES = {
  required_fields: ["name", "url"],  // Must be present
  field_types: {
    price: "number",
    name: "string",
    url: "url",
    date: "date"
  },
  constraints: {
    price: { min: 0, max: 1000000 },
    name: { minLength: 1, maxLength: 500 },
    url: { pattern: /^https?:\/\// }
  }
};

function validateRecord(record, rules) {
  const errors = [];

  // Check required fields
  for (const field of rules.required_fields) {
    if (!record[field]) {
      errors.push(`Missing required field: ${field}`);
    }
  }

  // Check field types
  for (const [field, type] of Object.entries(rules.field_types)) {
    if (record[field] && !isValidType(record[field], type)) {
      errors.push(`Invalid type for ${field}: expected ${type}`);
    }
  }

  return { valid: errors.length === 0, errors };
}

Main Workflow

主要工作流

Phase 1: Planning (MANDATORY)

阶段1：规划（必须执行）

Purpose: Analyze extraction requirements and define data schema.

Process:

Invoke sequential-thinking MCP
Define target data fields and types
Plan selector strategy
Identify pagination pattern
Define output format
Plan rate limiting strategy

Planning Questions:

What data fields need to be extracted?
What is the expected output format (JSON/CSV/Markdown)?
Is pagination involved? What pattern?
What validation rules apply?
What rate limiting is appropriate?

Output Contract:

yaml

extraction_plan:
  target_url: string
  output_format: "json" | "csv" | "markdown"
  schema:
    fields:
      - name: string
        selector: string
        type: string | number | url | date
        required: boolean
  pagination:
    type: "numbered" | "next_button" | "infinite_scroll" | "load_more" | "none"
    max_pages: number
    delay_seconds: number
  rate_limit:
    delay_between_requests: number

目的：分析提取需求并定义数据schema。

流程:

调用sequential-thinking MCP
定义目标数据字段和类型
规划选择器策略
识别分页模式
定义输出格式
规划速率限制策略

规划问题:

需要提取哪些数据字段？
预期输出格式是什么（JSON/CSV/Markdown）？
是否涉及分页？是什么模式？
适用哪些验证规则？
合适的速率限制是多少？

输出约定:

yaml

extraction_plan:
  target_url: string
  output_format: "json" | "csv" | "markdown"
  schema:
    fields:
      - name: string
        selector: string
        type: string | number | url | date
        required: boolean
  pagination:
    type: "numbered" | "next_button" | "infinite_scroll" | "load_more" | "none"
    max_pages: number
    delay_seconds: number
  rate_limit:
    delay_between_requests: number

Phase 2: Navigation

阶段2：导航

Purpose: Navigate to target URL and establish context.

Process:

Get tab context (tabs_context_mcp)
Create new tab for scraping (tabs_create_mcp)
Navigate to starting URL
Take initial screenshot for verification
Wait for page load completion

Implementation:

javascript

// 1. Get existing context
const context = await mcp__claude-in-chrome__tabs_context_mcp({});

// 2. Create dedicated tab
const newTab = await mcp__claude-in-chrome__tabs_create_mcp({});
const tabId = newTab.tabId;

// 3. Navigate to target
await mcp__claude-in-chrome__navigate({
  url: targetUrl,
  tabId: tabId
});

// 4. Wait for page load
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,
  tabId: tabId
});

// 5. Screenshot initial state
await mcp__claude-in-chrome__computer({
  action: "screenshot",
  tabId: tabId
});

目的：导航至目标URL并建立上下文。

流程:

获取标签页上下文（tabs_context_mcp）
创建用于抓取的新标签页（tabs_create_mcp）
导航至起始URL
截取初始截图用于验证
等待页面加载完成

实现代码:

javascript

// 1. Get existing context
const context = await mcp__claude-in-chrome__tabs_context_mcp({});

// 2. Create dedicated tab
const newTab = await mcp__claude-in-chrome__tabs_create_mcp({});
const tabId = newTab.tabId;

// 3. Navigate to target
await mcp__claude-in-chrome__navigate({
  url: targetUrl,
  tabId: tabId
});

// 4. Wait for page load
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,
  tabId: tabId
});

// 5. Screenshot initial state
await mcp__claude-in-chrome__computer({
  action: "screenshot",
  tabId: tabId
});

Phase 3: Structure Analysis

阶段3：结构分析

Purpose: Understand page DOM structure before extraction.

Process:

Read page accessibility tree (read_page)
Identify data container elements
Map field selectors to schema
Verify selectors exist
Detect dynamic content patterns

Implementation:

javascript

// Read page structure
const pageTree = await mcp__claude-in-chrome__read_page({
  tabId: tabId,
  filter: "all"  // Get all elements for analysis
});

// For interactive elements only
const interactive = await mcp__claude-in-chrome__read_page({
  tabId: tabId,
  filter: "interactive"  // Buttons, links, inputs
});

// Find specific elements
const products = await mcp__claude-in-chrome__find({
  query: "product cards or listing items",
  tabId: tabId
});

const prices = await mcp__claude-in-chrome__find({
  query: "price elements",
  tabId: tabId
});

目的：提取前了解页面DOM结构。

流程:

读取页面可访问性树（read_page）
识别数据容器元素
将字段选择器映射到schema
验证选择器存在
检测动态内容模式

实现代码:

javascript

// Read page structure
const pageTree = await mcp__claude-in-chrome__read_page({
  tabId: tabId,
  filter: "all"  // Get all elements for analysis
});

// For interactive elements only
const interactive = await mcp__claude-in-chrome__read_page({
  tabId: tabId,
  filter: "interactive"  // Buttons, links, inputs
});

// Find specific elements
const products = await mcp__claude-in-chrome__find({
  query: "product cards or listing items",
  tabId: tabId
});

const prices = await mcp__claude-in-chrome__find({
  query: "price elements",
  tabId: tabId
});

Phase 4: Data Extraction

阶段4：数据提取

Purpose: Extract data according to defined schema.

Process:

For simple text: Use get_page_text
For structured data: Use javascript_tool
For specific elements: Use find + read_page
Handle lazy-loaded content with scroll

Extraction Methods:

Method 1: Full Page Text (simplest, for articles)

javascript

const text = await mcp__claude-in-chrome__get_page_text({
  tabId: tabId
});

Method 2: JavaScript DOM Query (most flexible)

javascript

const data = await mcp__claude-in-chrome__javascript_tool({
  action: "javascript_exec",
  tabId: tabId,
  text: `
    // Extract product data
    Array.from(document.querySelectorAll('.product-card')).map(card => ({
      name: card.querySelector('.product-title')?.textContent?.trim(),
      price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
      url: card.querySelector('a')?.href,
      image: card.querySelector('img')?.src
    }))
  `
});

Method 3: Table Extraction

javascript

const tableData = await mcp__claude-in-chrome__javascript_tool({
  action: "javascript_exec",
  tabId: tabId,
  text: `
    // Extract table data
    const table = document.querySelector('table');
    const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
    const rows = Array.from(table.querySelectorAll('tbody tr')).map(tr => {
      const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
      return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
    });
    rows
  `
});

Method 4: Handling Lazy-Loaded Content

javascript

// Scroll to load content
await mcp__claude-in-chrome__computer({
  action: "scroll",
  scroll_direction: "down",
  scroll_amount: 5,
  tabId: tabId,
  coordinate: [500, 500]
});

// Wait for content to load
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,
  tabId: tabId
});

// Now extract
const data = await extractData(tabId);

目的：根据定义的schema提取数据。

流程:

简单文本：使用get_page_text
结构化数据：使用javascript_tool
特定元素：使用find + read_page
通过scroll处理懒加载内容

提取方法:

方法1：整页文本（最简单，适用于文章）

javascript

const text = await mcp__claude-in-chrome__get_page_text({
  tabId: tabId
});

方法2：JavaScript DOM查询（最灵活）

javascript

const data = await mcp__claude-in-chrome__javascript_tool({
  action: "javascript_exec",
  tabId: tabId,
  text: `
    // Extract product data
    Array.from(document.querySelectorAll('.product-card')).map(card => ({
      name: card.querySelector('.product-title')?.textContent?.trim(),
      price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
      url: card.querySelector('a')?.href,
      image: card.querySelector('img')?.src
    }))
  `
});

方法3：表格提取

javascript

const tableData = await mcp__claude-in-chrome__javascript_tool({
  action: "javascript_exec",
  tabId: tabId,
  text: `
    // Extract table data
    const table = document.querySelector('table');
    const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
    const rows = Array.from(table.querySelectorAll('tbody tr')).map(tr => {
      const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
      return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
    });
    rows
  `
});

方法4：处理懒加载内容

javascript

// Scroll to load content
await mcp__claude-in-chrome__computer({
  action: "scroll",
  scroll_direction: "down",
  scroll_amount: 5,
  tabId: tabId,
  coordinate: [500, 500]
});

// Wait for content to load
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,
  tabId: tabId
});

// Now extract
const data = await extractData(tabId);

Phase 5: Pagination Handling

阶段5：分页处理

Purpose: Navigate through multiple pages and collect all data.

Pagination Patterns:

Pattern A: Numbered Pages

javascript

async function extractNumberedPages(baseUrl, maxPages, tabId) {
  const allData = [];

  for (let page = 1; page <= maxPages; page++) {
    const url = `${baseUrl}?page=${page}`;
    await mcp__claude-in-chrome__navigate({ url, tabId });
    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });

    const pageData = await extractPageData(tabId);
    if (pageData.length === 0) break;  // No more data

    allData.push(...pageData);

    // Rate limiting
    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
  }

  return allData;
}

Pattern B: Next Button

javascript

async function extractWithNextButton(tabId) {
  const allData = [];
  let hasNext = true;

  while (hasNext) {
    const pageData = await extractPageData(tabId);
    allData.push(...pageData);

    // Find next button
    const nextButton = await mcp__claude-in-chrome__find({
      query: "next page button or pagination next link",
      tabId: tabId
    });

    if (nextButton && nextButton.elements && nextButton.elements.length > 0) {
      await mcp__claude-in-chrome__computer({
        action: "left_click",
        ref: nextButton.elements[0].ref,
        tabId: tabId
      });
      await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
    } else {
      hasNext = false;  // No more pages
    }
  }

  return allData;
}

Pattern C: Infinite Scroll

javascript

async function extractInfiniteScroll(tabId, maxScrolls = 20) {
  const allData = [];
  let previousCount = 0;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    const pageData = await extractPageData(tabId);

    if (pageData.length === previousCount) {
      // No new content loaded, done
      break;
    }

    allData.length = 0;
    allData.push(...pageData);
    previousCount = pageData.length;

    // Scroll down
    await mcp__claude-in-chrome__computer({
      action: "scroll",
      scroll_direction: "down",
      scroll_amount: 5,
      tabId: tabId,
      coordinate: [500, 500]
    });

    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
    scrollCount++;
  }

  return allData;
}

目的：遍历多个页面并收集所有数据。

分页模式:

模式A：编号页面

javascript

async function extractNumberedPages(baseUrl, maxPages, tabId) {
  const allData = [];

  for (let page = 1; page <= maxPages; page++) {
    const url = `${baseUrl}?page=${page}`;
    await mcp__claude-in-chrome__navigate({ url, tabId });
    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });

    const pageData = await extractPageData(tabId);
    if (pageData.length === 0) break;  // No more data

    allData.push(...pageData);

    // Rate limiting
    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
  }

  return allData;
}

模式B：下一页按钮

javascript

async function extractWithNextButton(tabId) {
  const allData = [];
  let hasNext = true;

  while (hasNext) {
    const pageData = await extractPageData(tabId);
    allData.push(...pageData);

    // Find next button
    const nextButton = await mcp__claude-in-chrome__find({
      query: "next page button or pagination next link",
      tabId: tabId
    });

    if (nextButton && nextButton.elements && nextButton.elements.length > 0) {
      await mcp__claude-in-chrome__computer({
        action: "left_click",
        ref: nextButton.elements[0].ref,
        tabId: tabId
      });
      await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
    } else {
      hasNext = false;  // No more pages
    }
  }

  return allData;
}

模式C：无限滚动

javascript

async function extractInfiniteScroll(tabId, maxScrolls = 20) {
  const allData = [];
  let previousCount = 0;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    const pageData = await extractPageData(tabId);

    if (pageData.length === previousCount) {
      // No new content loaded, done
      break;
    }

    allData.length = 0;
    allData.push(...pageData);
    previousCount = pageData.length;

    // Scroll down
    await mcp__claude-in-chrome__computer({
      action: "scroll",
      scroll_direction: "down",
      scroll_amount: 5,
      tabId: tabId,
      coordinate: [500, 500]
    });

    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
    scrollCount++;
  }

  return allData;
}

Phase 6: Data Transformation

阶段6：数据转换

Purpose: Convert extracted data to requested output format.

JSON Output:

javascript

function toJSON(data, pretty = true) {
  return pretty ? JSON.stringify(data, null, 2) : JSON.stringify(data);
}

CSV Output:

javascript

function toCSV(data) {
  if (data.length === 0) return "";

  const headers = Object.keys(data[0]);
  const headerRow = headers.join(",");

  const dataRows = data.map(row =>
    headers.map(h => {
      const val = row[h];
      // Escape quotes and wrap in quotes if contains comma
      if (typeof val === "string" && (val.includes(",") || val.includes('"'))) {
        return `"${val.replace(/"/g, '""')}"`;
      }
      return val;
    }).join(",")
  );

  return [headerRow, ...dataRows].join("\n");
}

Markdown Table Output:

javascript

function toMarkdownTable(data) {
  if (data.length === 0) return "";

  const headers = Object.keys(data[0]);
  const headerRow = `| ${headers.join(" | ")} |`;
  const separator = `| ${headers.map(() => "---").join(" | ")} |`;

  const dataRows = data.map(row =>
    `| ${headers.map(h => String(row[h] || "")).join(" | ")} |`
  );

  return [headerRow, separator, ...dataRows].join("\n");
}

目的：将提取的数据转换为请求的输出格式。

JSON输出:

javascript

function toJSON(data, pretty = true) {
  return pretty ? JSON.stringify(data, null, 2) : JSON.stringify(data);
}

CSV输出:

javascript

function toCSV(data) {
  if (data.length === 0) return "";

  const headers = Object.keys(data[0]);
  const headerRow = headers.join(",");

  const dataRows = data.map(row =>
    headers.map(h => {
      const val = row[h];
      // Escape quotes and wrap in quotes if contains comma
      if (typeof val === "string" && (val.includes(",") || val.includes('"'))) {
        return `"${val.replace(/"/g, '""')}"`;
      }
      return val;
    }).join(",")
  );

  return [headerRow, ...dataRows].join("\n");
}

Markdown表格输出:

javascript

function toMarkdownTable(data) {
  if (data.length === 0) return "";

  const headers = Object.keys(data[0]);
  const headerRow = `| ${headers.join(" | ")} |`;
  const separator = `| ${headers.map(() => "---").join(" | ")} |`;

  const dataRows = data.map(row =>
    `| ${headers.map(h => String(row[h] || "")).join(" | ")} |`
  );

  return [headerRow, separator, ...dataRows].join("\n");
}

Phase 7: Storage

阶段7：存储

Purpose: Persist extracted data to Memory MCP for future use.

Implementation:

javascript

// Store in Memory MCP
await memory_store({
  namespace: `skills/tooling/web-scraping/${projectName}/${timestamp}`,
  data: {
    extraction_metadata: {
      source_url: targetUrl,
      extraction_date: new Date().toISOString(),
      record_count: data.length,
      output_format: outputFormat,
      pages_scraped: pageCount
    },
    extracted_data: data
  },
  tags: {
    WHO: `web-scraping-${sessionId}`,
    WHEN: new Date().toISOString(),
    PROJECT: projectName,
    WHY: "data-extraction",
    data_type: dataType,
    record_count: data.length
  }
});

目的：将提取的数据存储至Memory MCP以备后续使用。

实现代码:

javascript

// Store in Memory MCP
await memory_store({
  namespace: `skills/tooling/web-scraping/${projectName}/${timestamp}`,
  data: {
    extraction_metadata: {
      source_url: targetUrl,
      extraction_date: new Date().toISOString(),
      record_count: data.length,
      output_format: outputFormat,
      pages_scraped: pageCount
    },
    extracted_data: data
  },
  tags: {
    WHO: `web-scraping-${sessionId}`,
    WHEN: new Date().toISOString(),
    PROJECT: projectName,
    WHY: "data-extraction",
    data_type: dataType,
    record_count: data.length
  }
});

LEARNED PATTERNS

已学习模式

This section is populated by Loop 1.5 (Session Reflection) as patterns are discovered.

该部分由Loop 1.5（会话反思）填充，用于记录发现的模式。

High Confidence [conf:0.90]

高可信度 [conf:0.90]

No patterns captured yet. Patterns will be added as the skill is used and learnings are extracted.

暂无记录的模式。使用该技能并提取经验后会添加模式。

Medium Confidence [conf:0.75]

中等可信度 [conf:0.75]

No patterns captured yet.

暂无记录的模式。

Low Confidence [conf:0.55]

低可信度 [conf:0.55]

No patterns captured yet.

暂无记录的模式。

Success Criteria

成功标准

Quality Thresholds:

All targeted data fields extracted (100% schema coverage)
Data validation passes for 95%+ of records
Pagination handled completely (no missing pages)
Output format matches requested format
Rate limiting respected (no 429 errors)
No page state modifications occurred
Extraction completed within reasonable time (5 min per 100 records)

Failure Indicators:

Schema validation fails for >5% of records
Missing required fields in output
Pagination incomplete (detected more pages than scraped)
Rate limit errors encountered
Page state modified (form submitted, button clicked for action)
CAPTCHA or access denial encountered

质量阈值:

所有目标数据字段均已提取（100% schema覆盖）
95%+的记录通过数据验证
分页处理完整（无遗漏页面）
输出格式符合要求
遵守速率限制（无429错误）
未修改页面状态
提取在合理时间内完成（每100条记录耗时≤5分钟）

失败指标:

5%的记录未通过schema验证
输出中缺少必填字段
分页不完整（检测到的页面数多于抓取的页面数）
遇到速率限制错误
页面状态被修改（提交表单、点击按钮执行操作）
遇到验证码或访问被拒绝

MCP Integration

MCP集成

Required MCPs:

MCP	Purpose	Tools Used
sequential-thinking	Planning phase	`sequentialthinking`
claude-in-chrome	Extraction phase	`navigate` , `read_page` , `get_page_text` , `javascript_tool` , `find` , `computer` (screenshot, scroll, wait only), `tabs_context_mcp` , `tabs_create_mcp`
memory-mcp	Data storage	`memory_store` , `vector_search`

Optional MCPs:

filesystem (for saving extracted data locally)

必需的MCP:

MCP	用途	使用工具
sequential-thinking	规划阶段	`sequentialthinking`
claude-in-chrome	提取阶段	`navigate` , `read_page` , `get_page_text` , `javascript_tool` , `find` , `computer` （仅截图、滚动、等待）, `tabs_context_mcp` , `tabs_create_mcp`
memory-mcp	数据存储	`memory_store` , `vector_search`

可选的MCP:

filesystem（用于本地保存提取的数据）

Memory Namespace

内存命名空间

Pattern:

skills/tooling/web-scraping/{project}/{timestamp}

Store:

Extraction schemas (field definitions)
Extracted datasets (structured data)
Selector patterns (for similar pages)
Error logs (selector failures, schema drift)

Retrieve:

Similar extraction tasks (vector search by description)
Proven selectors for known sites
Historical extraction patterns

Tagging:

json

{
  "WHO": "web-scraping-{session_id}",
  "WHEN": "ISO8601_timestamp",
  "PROJECT": "{project_name}",
  "WHY": "data-extraction",
  "source_domain": "example.com",
  "data_type": "products|articles|listings|tables",
  "record_count": 150,
  "pages_scraped": 5
}

模式:

skills/tooling/web-scraping/{project}/{timestamp}

存储内容:

提取schema（字段定义）
提取的数据集（结构化数据）
选择器模式（适用于相似页面）
错误日志（选择器失败、schema漂移）

检索内容:

相似提取任务（按描述进行向量搜索）
针对已知网站的有效选择器
历史提取模式

标签:

json

{
  "WHO": "web-scraping-{session_id}",
  "WHEN": "ISO8601_timestamp",
  "PROJECT": "{project_name}",
  "WHY": "data-extraction",
  "source_domain": "example.com",
  "data_type": "products|articles|listings|tables",
  "record_count": 150,
  "pages_scraped": 5
}

Examples

示例

Example 1: Product Catalog Extraction

示例1：产品目录提取

Complexity: Medium (structured data, pagination)

Task: Extract product listings from e-commerce category page

Planning Output (sequential-thinking):

Thought 1/6: Need to extract product name, price, URL, and image from catalog
Thought 2/6: Schema: {name: string, price: number, url: url, image: url}
Thought 3/6: Identify product card containers via read_page
Thought 4/6: Use javascript_tool to query all product cards
Thought 5/6: Detect pagination pattern (numbered pages with ?page=N)
Thought 6/6: Output format: JSON array, validate price > 0

Execution:

javascript

// 1. Navigate to catalog
await navigate({ url: "https://example.com/products", tabId });

// 2. Analyze structure
const structure = await read_page({ tabId, filter: "all" });

// 3. Extract data
const products = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `
    Array.from(document.querySelectorAll('.product-card')).map(card => ({
      name: card.querySelector('.title')?.textContent?.trim(),
      price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
      url: card.querySelector('a')?.href,
      image: card.querySelector('img')?.src
    }))
  `
});

// 4. Handle pagination (repeat for each page)
// 5. Transform to JSON
const output = JSON.stringify(products, null, 2);

Result: 150 products extracted across 5 pages

Output Format: JSON

json

[
  {
    "name": "Product A",
    "price": 29.99,
    "url": "https://example.com/products/a",
    "image": "https://example.com/images/a.jpg"
  }
]

复杂度: 中等（结构化数据、分页）

任务: 从电商分类页面提取产品列表

规划输出（sequential-thinking）:

Thought 1/6: Need to extract product name, price, URL, and image from catalog
Thought 2/6: Schema: {name: string, price: number, url: url, image: url}
Thought 3/6: Identify product card containers via read_page
Thought 4/6: Use javascript_tool to query all product cards
Thought 5/6: Detect pagination pattern (numbered pages with ?page=N)
Thought 6/6: Output format: JSON array, validate price > 0

执行代码:

javascript

// 1. Navigate to catalog
await navigate({ url: "https://example.com/products", tabId });

// 2. Analyze structure
const structure = await read_page({ tabId, filter: "all" });

// 3. Extract data
const products = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `
    Array.from(document.querySelectorAll('.product-card')).map(card => ({
      name: card.querySelector('.title')?.textContent?.trim(),
      price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
      url: card.querySelector('a')?.href,
      image: card.querySelector('img')?.src
    }))
  `
});

// 4. Handle pagination (repeat for each page)
// 5. Transform to JSON
const output = JSON.stringify(products, null, 2);

结果: 跨5个页面提取了150个产品

输出格式: JSON

json

[
  {
    "name": "Product A",
    "price": 29.99,
    "url": "https://example.com/products/a",
    "image": "https://example.com/images/a.jpg"
  }
]

Example 2: Article Content Scraping

示例2：文章内容抓取

Complexity: Simple (single page, text focus)

Task: Extract article headline, author, date, and body text

Planning Output (sequential-thinking):

Thought 1/4: Single page extraction, no pagination needed
Thought 2/4: Schema: {headline: string, author: string, date: date, body: string}
Thought 3/4: Use get_page_text for full content, javascript_tool for metadata
Thought 4/4: Output format: Markdown for readability

Execution:

javascript

// 1. Navigate to article
await navigate({ url: "https://news.example.com/article/123", tabId });

// 2. Extract metadata
const metadata = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `({
    headline: document.querySelector('h1')?.textContent?.trim(),
    author: document.querySelector('.author')?.textContent?.trim(),
    date: document.querySelector('time')?.getAttribute('datetime')
  })`
});

// 3. Extract body text
const body = await get_page_text({ tabId });

// 4. Combine
const article = { ...metadata, body };

Result: Article content extracted

Output Format: Markdown

markdown

undefined

复杂度: 简单（单页、文本为主）

任务: 提取文章标题、作者、日期和正文

规划输出（sequential-thinking）:

Thought 1/4: Single page extraction, no pagination needed
Thought 2/4: Schema: {headline: string, author: string, date: date, body: string}
Thought 3/4: Use get_page_text for full content, javascript_tool for metadata
Thought 4/4: Output format: Markdown for readability

执行代码:

javascript

// 1. Navigate to article
await navigate({ url: "https://news.example.com/article/123", tabId });

// 2. Extract metadata
const metadata = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `({
    headline: document.querySelector('h1')?.textContent?.trim(),
    author: document.querySelector('.author')?.textContent?.trim(),
    date: document.querySelector('time')?.getAttribute('datetime')
  })`
});

// 3. Extract body text
const body = await get_page_text({ tabId });

// 4. Combine
const article = { ...metadata, body };

结果: 提取文章内容

输出格式: Markdown

markdown

undefined

Article Headline

Author: Jane Doe Date: 2026-01-12

Article body text here...

undefined

Author: Jane Doe Date: 2026-01-12

Article body text here...

undefined

Example 3: Table Data Extraction

示例3：表格数据提取

Complexity: Simple (structured HTML table)

Task: Extract financial data from HTML table

Planning Output (sequential-thinking):

Thought 1/4: HTML table with headers in <th>, data in <td>
Thought 2/4: Use javascript_tool to parse table structure
Thought 3/4: Convert to array of objects with header keys
Thought 4/4: Output format: CSV for spreadsheet import

Execution:

javascript

// 1. Navigate to page
await navigate({ url: "https://finance.example.com/data", tabId });

// 2. Extract table
const tableData = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `
    const table = document.querySelector('table.financial-data');
    const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
    Array.from(table.querySelectorAll('tbody tr')).map(tr => {
      const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
      return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
    })
  `
});

// 3. Transform to CSV
const csv = toCSV(tableData);

Result: 50 rows extracted

Output Format: CSV

csv

Date,Open,High,Low,Close,Volume
2026-01-12,150.00,152.50,149.00,151.75,1000000
2026-01-11,148.50,150.25,147.00,150.00,950000

复杂度: 简单（结构化HTML表格）

任务: 从HTML表格提取财务数据

规划输出（sequential-thinking）:

Thought 1/4: HTML table with headers in <th>, data in <td>
Thought 2/4: Use javascript_tool to parse table structure
Thought 3/4: Convert to array of objects with header keys
Thought 4/4: Output format: CSV for spreadsheet import

执行代码:

javascript

// 1. Navigate to page
await navigate({ url: "https://finance.example.com/data", tabId });

// 2. Extract table
const tableData = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `
    const table = document.querySelector('table.financial-data');
    const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
    Array.from(table.querySelectorAll('tbody tr')).map(tr => {
      const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
      return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
    })
  `
});

// 3. Transform to CSV
const csv = toCSV(tableData);

结果: 提取50行数据

输出格式: CSV

csv

Date,Open,High,Low,Close,Volume
2026-01-12,150.00,152.50,149.00,151.75,1000000
2026-01-11,148.50,150.25,147.00,150.00,950000

Example 4: Multi-Page Directory Scraping

示例4：多页目录抓取

Complexity: High (pagination, rate limiting)

Task: Extract business listings from directory with "Load More" pagination

Planning Output (sequential-thinking):

Thought 1/8: Directory with business name, phone, address, website
Thought 2/8: Schema: {name: string, phone: string, address: string, website: url}
Thought 3/8: Pagination: "Load More" button at bottom
Thought 4/8: Rate limit: 3 seconds between loads
Thought 5/8: Max iterations: 20 (to prevent infinite loops)
Thought 6/8: End condition: "Load More" button disappears or disabled
Thought 7/8: Output format: JSON
Thought 8/8: Store in Memory MCP for analysis

Result: 200 businesses extracted across 20 "Load More" clicks

Execution Time: 90 seconds (with rate limiting)

复杂度: 高（分页、速率限制）

任务: 从包含“Load More”分页的目录提取商家列表

规划输出（sequential-thinking）:

Thought 1/8: Directory with business name, phone, address, website
Thought 2/8: Schema: {name: string, phone: string, address: string, website: url}
Thought 3/8: Pagination: "Load More" button at bottom
Thought 4/8: Rate limit: 3 seconds between loads
Thought 5/8: Max iterations: 20 (to prevent infinite loops)
Thought 6/8: End condition: "Load More" button disappears or disabled
Thought 7/8: Output format: JSON
Thought 8/8: Store in Memory MCP for analysis

结果: 点击20次“Load More”按钮后提取200个商家

执行时间: 90秒（包含速率限制等待）

Anti-Patterns to Avoid

需避免的反模式

Anti-Pattern	Problem	Solution
Skip Structure Analysis	Selectors break unexpectedly	ALWAYS use read_page before extraction
Modify Page State	Not a scraping task anymore	Use browser-automation for interactions
Ignore Pagination	Incomplete datasets	Plan pagination strategy in Phase 1
No Rate Limiting	Server blocks, legal issues	Implement delays between requests
Hardcoded Selectors	Break when site updates	Use semantic selectors, find tool
Skip Validation	Garbage data in output	Validate against schema before storage
Bypass CAPTCHA	Violates ToS, legal risk	ABORT and notify user

反模式	问题	解决方案
跳过结构分析	选择器意外失效	提取前必须使用read_page
修改页面状态	不再是抓取任务	交互操作使用browser-automation
忽略分页	数据集不完整	阶段1规划分页策略
不设置速率限制	服务器封禁、法律问题	请求之间设置延迟
硬编码选择器	网站更新后失效	使用语义选择器、find工具
跳过验证	输出垃圾数据	存储前根据schema验证
绕过验证码	违反服务条款、法律风险	终止并通知用户

Related Skills

Selector Stability Strategies

选择器稳定性策略

Problem: Web pages change, selectors break.

Strategies:

Strategy	Stability	Speed	Example
ID selector	High	Fast	`#product-123`
Data attribute	High	Fast	`[data-product-id="123"]`
Semantic class	Medium	Fast	`.product-card`
Text content	Medium	Slow	Contains "Add to Cart"
XPath	Low	Medium	`//div[@class='product'][1]`
Position	Very Low	Fast	First child element

Best Practices:

Prefer
```
data-*
```
attributes (designed for scripting)
Use semantic class names over positional selectors
Combine multiple attributes for specificity
Use
```
find
```
tool with natural language as fallback
Log selector failures to detect schema drift

问题: 网页会变化，选择器会失效。

策略:

策略	稳定性	速度	示例
ID选择器	高	快	`#product-123`
数据属性	高	快	`[data-product-id="123"]`
语义类	中	快	`.product-card`
文本内容	中	慢	包含"Add to Cart"
XPath	低	中	`//div[@class='product'][1]`
位置	极低	快	第一个子元素

最佳实践:

优先使用
```
data-*
```
属性（专为脚本设计）
使用语义类名而非位置选择器
组合多个属性提高特异性
作为备选，使用find工具和自然语言
记录选择器失败情况以检测schema漂移

Output Format Templates

输出格式模板

JSON Template

JSON模板

json

{
  "extraction_metadata": {
    "source_url": "https://example.com/products",
    "extraction_date": "2026-01-12T10:30:00Z",
    "record_count": 150,
    "pages_scraped": 5
  },
  "data": [
    {
      "name": "Product Name",
      "price": 29.99,
      "url": "https://example.com/product/1",
      "image": "https://example.com/images/1.jpg"
    }
  ]
}

json

{
  "extraction_metadata": {
    "source_url": "https://example.com/products",
    "extraction_date": "2026-01-12T10:30:00Z",
    "record_count": 150,
    "pages_scraped": 5
  },
  "data": [
    {
      "name": "Product Name",
      "price": 29.99,
      "url": "https://example.com/product/1",
      "image": "https://example.com/images/1.jpg"
    }
  ]
}

CSV Template

CSV模板

csv

name,price,url,image
"Product Name",29.99,"https://example.com/product/1","https://example.com/images/1.jpg"

csv

name,price,url,image
"Product Name",29.99,"https://example.com/product/1","https://example.com/images/1.jpg"

Markdown Table Template

Markdown表格模板

markdown

| Name | Price | URL |
|------|-------|-----|
| Product Name | $29.99 | [Link](https://example.com/product/1) |

markdown

| Name | Price | URL |
|------|-------|-----|
| Product Name | $29.99 | [Link](https://example.com/product/1) |

Maintenance & Updates

维护与更新

Version History:

v1.0.0 (2026-01-12): Initial release with READ-only focus, pagination handling, rate limiting

Feedback Loop:

Loop 1.5 (Session): Store learnings from extractions
Loop 3 (Meta-Loop): Aggregate patterns every 3 days
Update LEARNED PATTERNS section with new discoveries

Continuous Improvement:

Monitor extraction success rate via Memory MCP
Identify common selector failures for pattern updates
Optimize extraction strategies based on site patterns

<promise>WEB_SCRAPING_VERILINGUA_VERIX_COMPLIANT</promise>

版本历史:

v1.0.0 (2026-01-12): 初始版本，专注于只读操作、分页处理、速率限制

反馈循环:

Loop 1.5（会话）: 存储提取过程中的经验
Loop 3（元循环）: 每3天汇总模式
更新已学习模式部分，添加新发现

持续改进:

通过Memory MCP监控提取成功率
识别常见选择器失败情况以更新模式
根据网站模式优化提取策略

<promise>WEB_SCRAPING_VERILINGUA_VERIX_COMPLIANT</promise>

web-scraping

Original

Translation

Web Scraping

网页抓取

Kanitsal Cerceve (Evidential Frame Activation)

证据框架激活

Overview

概述

When to Use This Skill

何时使用该技能

When NOT to Use This Skill

何时不使用该技能

Core Principles

核心原则

Principle 1: Plan Before Extract

原则1：先规划后提取

Principle 2: Read-Only Operations

原则2：只读操作

Principle 3: Structured Output First

原则3：优先结构化输出

Principle 4: Pagination Awareness

原则4：分页感知

Principle 5: Rate Limiting Respect

原则5：遵守速率限制

Production Guardrails

生产环境防护措施

MCP Preflight Check Protocol

MCP预检查协议

Error Handling Framework

错误处理框架

Data Validation Framework

数据验证框架

Main Workflow

主要工作流

Phase 1: Planning (MANDATORY)

阶段1：规划（必须执行）

Phase 2: Navigation

阶段2：导航

Phase 3: Structure Analysis

阶段3：结构分析

Phase 4: Data Extraction

阶段4：数据提取

Phase 5: Pagination Handling

阶段5：分页处理

Phase 6: Data Transformation

阶段6：数据转换

Phase 7: Storage

阶段7：存储

LEARNED PATTERNS

已学习模式

High Confidence [conf:0.90]

高可信度 [conf:0.90]

Medium Confidence [conf:0.75]

中等可信度 [conf:0.75]

Low Confidence [conf:0.55]

低可信度 [conf:0.55]

Success Criteria

成功标准

MCP Integration

MCP集成

Memory Namespace

内存命名空间

Examples

示例

Example 1: Product Catalog Extraction

示例1：产品目录提取

Example 2: Article Content Scraping

示例2：文章内容抓取

Article Headline

Article Headline

Example 3: Table Data Extraction

示例3：表格数据提取

Example 4: Multi-Page Directory Scraping

示例4：多页目录抓取

Anti-Patterns to Avoid

需避免的反模式

Related Skills

相关技能

Selector Stability Strategies