web-scraping

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Scraping

网页抓取

Kanitsal Cerceve (Evidential Frame Activation)

证据框架激活

Kaynak dogrulama modu etkin.
[assert|neutral] Systematic web data extraction workflow with sequential-thinking planning phase [ground:skill-design:2026-01-12] [conf:0.85] [state:provisional]
源验证模式已激活。
[assert|neutral] 结合sequential-thinking规划阶段的系统化网页数据提取工作流 [ground:skill-design:2026-01-12] [conf:0.85] [state:provisional]

Overview

概述

Web scraping enables structured data extraction from web pages through the claude-in-chrome MCP server. This skill enforces a PLAN-READ-TRANSFORM pattern focused exclusively on data extraction without page modifications.
Philosophy: Data extraction fails when executed without understanding page structure. By mandating structure analysis before extraction, this skill improves data quality and reduces selector errors.
Methodology: Six-phase execution with emphasis on READ operations:
  1. PLAN Phase: Sequential-thinking MCP analyzes extraction requirements
  2. NAVIGATE Phase: Navigate to target URL(s)
  3. ANALYZE Phase: Understand page structure via read_page
  4. EXTRACT Phase: Pull data using get_page_text and javascript_tool
  5. TRANSFORM Phase: Convert to structured format (JSON/CSV/Markdown)
  6. STORE Phase: Persist to Memory MCP
Key Differentiator from browser-automation:
  • READ-ONLY focus (no form submissions, no button clicks for actions)
  • Emphasis on data transformation and output formats
  • Pagination handling for multi-page datasets
  • Rate limiting awareness for bulk extraction
  • No account creation, no purchases, no write operations
Value Proposition: Transform unstructured web content into clean, structured datasets suitable for analysis, reporting, or system integration.
网页抓取可通过claude-in-chrome MCP服务器从网页中提取结构化数据。该技能遵循PLAN-READ-TRANSFORM模式,仅专注于数据提取,不修改页面内容。
核心理念:若不了解页面结构就执行数据提取,很可能失败。通过强制要求在提取前进行结构分析,该技能可提升数据质量并减少选择器错误。
方法论:分为六个执行阶段,重点关注READ操作:
  1. PLAN阶段:sequential-thinking MCP分析提取需求
  2. NAVIGATE阶段:导航至目标URL
  3. ANALYZE阶段:通过read_page了解页面结构
  4. EXTRACT阶段:使用get_page_text和javascript_tool提取数据
  5. TRANSFORM阶段:转换为结构化格式(JSON/CSV/Markdown)
  6. STORE阶段:存储至Memory MCP
与browser-automation的关键区别
  • 仅专注于只读操作(不提交表单、不点击按钮执行操作)
  • 强调数据转换和输出格式
  • 支持多页数据集的分页处理
  • 批量提取时考虑速率限制
  • 不创建账户、不进行购买、不执行写入操作
价值主张:将非结构化网页内容转换为干净、结构化的数据集,适用于分析、报告或系统集成。

When to Use This Skill

何时使用该技能

Trigger Thresholds:
Data VolumeRecommendation
Single elementUse get_page_text directly (too simple)
5-50 elementsConsider this skill
50+ elements or paginationMandatory use of this skill
Primary Use Cases:
  • Product catalog extraction (prices, descriptions, images)
  • Article content scraping (headlines, body text, metadata)
  • Table data extraction (financial data, statistics, listings)
  • Directory scraping (contact info, business listings)
  • Research data collection (citations, abstracts, datasets)
  • Price monitoring (competitive analysis, market research)
Apply When:
  • Task requires structured output (JSON, CSV, Markdown table)
  • Multiple pages need to be traversed (pagination)
  • Data needs normalization or transformation
  • Consistent data schema is required
  • Historical data collection for trend analysis
触发阈值:
数据量建议
单个元素直接使用get_page_text(过于简单)
5-50个元素考虑使用该技能
50+元素或涉及分页必须使用该技能
主要使用场景:
  • 产品目录提取(价格、描述、图片)
  • 文章内容抓取(标题、正文、元数据)
  • 表格数据提取(财务数据、统计数据、列表)
  • 目录抓取(联系信息、商家列表)
  • 研究数据收集(引用、摘要、数据集)
  • 价格监控(竞品分析、市场调研)
适用情况:
  • 任务需要结构化输出(JSON、CSV、Markdown表格)
  • 需要遍历多个页面(分页)
  • 数据需要标准化或转换
  • 需要一致的数据 schema
  • 收集历史数据用于趋势分析

When NOT to Use This Skill

何时不使用该技能

  • Form submissions or account actions (use browser-automation)
  • Interactive workflows requiring clicks (use browser-automation)
  • Single-page simple text extraction (use get_page_text directly)
  • API-accessible data (use direct API calls instead)
  • Real-time streaming data (use WebSocket or API)
  • Content behind authentication requiring login (use browser-automation first)
  • Sites with explicit anti-scraping measures (respect robots.txt)
  • 表单提交或账户操作(使用browser-automation)
  • 需要点击的交互式工作流(使用browser-automation)
  • 单页简单文本提取(直接使用get_page_text)
  • 可通过API获取的数据(直接调用API)
  • 实时流数据(使用WebSocket或API)
  • 需要登录的受保护内容(先使用browser-automation)
  • 明确反爬措施的网站(遵守robots.txt)

Core Principles

核心原则

Principle 1: Plan Before Extract

原则1:先规划后提取

Mandate: ALWAYS invoke sequential-thinking MCP before data extraction.
Rationale: Web pages have complex DOM structures. Planning reveals data locations, identifies patterns, and anticipates pagination before extraction begins.
In Practice:
  • Analyze page structure via read_page first
  • Map data field locations to selectors
  • Identify pagination patterns
  • Plan output schema before extraction
  • Define data validation rules
强制要求:数据提取前必须调用sequential-thinking MCP。
理由:网页具有复杂的DOM结构。规划可在提取前确定数据位置、识别模式并预判分页情况。
实践方式:
  • 先通过read_page分析页面结构
  • 将数据字段位置映射到选择器
  • 识别分页模式
  • 提取前定义输出schema
  • 定义数据验证规则

Principle 2: Read-Only Operations

原则2:只读操作

Mandate: NEVER modify page state during extraction.
Rationale: Web scraping is about data retrieval, not interaction. Write operations (form submissions, clicks that change state) belong to browser-automation.
Allowed Operations:
  • navigate (to target URLs)
  • read_page (DOM analysis)
  • get_page_text (text extraction)
  • javascript_tool (read-only DOM queries)
  • computer screenshot (visual verification)
  • scroll (for lazy-loaded content)
Prohibited Operations:
  • form_input (except for search filters if needed)
  • computer left_click on submit buttons
  • Any action that modifies page state
  • Account creation or login actions
强制要求:提取过程中不得修改页面状态。
理由:网页抓取的核心是数据检索,而非交互。写入操作(表单提交、改变状态的点击)属于browser-automation的范畴。
允许的操作:
  • navigate(导航至目标URL)
  • read_page(DOM分析)
  • get_page_text(文本提取)
  • javascript_tool(只读DOM查询)
  • computer screenshot(视觉验证)
  • scroll(加载懒加载内容)
禁止的操作:
  • form_input(除非需要搜索过滤)
  • computer left_click点击提交按钮
  • 任何修改页面状态的操作
  • 账户创建或登录操作

Principle 3: Structured Output First

原则3:优先结构化输出

Mandate: Define output schema before extraction begins.
Rationale: Knowing the target format guides extraction strategy and ensures consistent data quality across all pages.
Output Formats:
FormatUse CaseExample
JSONAPI integration, databases
[{"name": "...", "price": 99.99}]
CSVSpreadsheet analysis
name,price,url\n"Product",99.99,"https://..."
Markdown TableDocumentation, reports`
强制要求:提取前定义输出schema。
理由:明确目标格式可指导提取策略,确保所有页面的数据质量一致。
输出格式:
格式使用场景示例
JSONAPI集成、数据库
[{"name": "...", "price": 99.99}]
CSV电子表格分析
name,price,url\n"Product",99.99,"https://..."
Markdown表格文档、报告`

Principle 4: Pagination Awareness

原则4:分页感知

Mandate: Detect and handle pagination patterns before starting extraction.
Rationale: Most valuable datasets span multiple pages. Failing to handle pagination yields incomplete data.
Pagination Patterns:
PatternDetectionHandling
Numbered pagesLinks with
?page=N
or
/page/N
Iterate through page numbers
Next button"Next", ">" or arrow elementsClick until disabled/missing
Infinite scrollContent loads on scrollScroll + wait + check for new content
Load more"Load more" buttonClick until no new content
Cursor-based
?cursor=abc123
in URL
Extract cursor, iterate
强制要求:提取前检测并处理分页模式。
理由:最有价值的数据集通常跨多个页面。不处理分页会导致数据不完整。
分页模式:
模式检测方式处理方式
编号页面链接包含
?page=N
/page/N
遍历页面编号
下一页按钮包含“Next”、">"或箭头元素点击直至按钮禁用/消失
无限滚动滚动时加载内容滚动 + 等待 + 检查新内容
加载更多“Load more”按钮点击直至无新内容
基于游标URL包含
?cursor=abc123
提取游标并遍历

Principle 5: Rate Limiting Respect

原则5:遵守速率限制

Mandate: Implement delays between requests to avoid overwhelming servers.
Rationale: Aggressive scraping can trigger blocks, harm server performance, and violate terms of service.
Guidelines:
Request TypeMinimum Delay
Same domain1-2 seconds
Pagination2-3 seconds
Bulk extraction (100+ pages)3-5 seconds
If rate-limitedExponential backoff (start 10s)
Implementation:
javascript
// Use computer wait action between page loads
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,  // seconds
  tabId: tabId
});
强制要求:请求之间设置延迟,避免压垮服务器。
理由:激进的抓取会触发封禁、影响服务器性能并违反服务条款。
指导方针:
请求类型最小延迟
同一域名1-2秒
分页2-3秒
批量提取(100+页面)3-5秒
触发速率限制时指数退避(起始10秒)
实现代码:
javascript
// Use computer wait action between page loads
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,  // seconds
  tabId: tabId
});

Production Guardrails

生产环境防护措施

MCP Preflight Check Protocol

MCP预检查协议

Before executing any web scraping workflow, run preflight validation:
Preflight Sequence:
javascript
async function preflightCheck() {
  const checks = {
    sequential_thinking: false,
    claude_in_chrome: false,
    memory_mcp: false
  };

  // Check sequential-thinking MCP (required)
  try {
    await mcp__sequential-thinking__sequentialthinking({
      thought: "Preflight check - verifying MCP availability for web scraping",
      thoughtNumber: 1,
      totalThoughts: 1,
      nextThoughtNeeded: false
    });
    checks.sequential_thinking = true;
  } catch (error) {
    console.error("Sequential-thinking MCP unavailable:", error);
    throw new Error("CRITICAL: sequential-thinking MCP required but unavailable");
  }

  // Check claude-in-chrome MCP (required)
  try {
    const context = await mcp__claude-in-chrome__tabs_context_mcp({});
    checks.claude_in_chrome = true;
  } catch (error) {
    console.error("Claude-in-chrome MCP unavailable:", error);
    throw new Error("CRITICAL: claude-in-chrome MCP required but unavailable");
  }

  // Check memory-mcp (optional but recommended)
  try {
    checks.memory_mcp = true;
  } catch (error) {
    console.warn("Memory MCP unavailable - extracted data will not be persisted");
    checks.memory_mcp = false;
  }

  return checks;
}
执行任何网页抓取工作流前,运行预验证:
预检查流程:
javascript
async function preflightCheck() {
  const checks = {
    sequential_thinking: false,
    claude_in_chrome: false,
    memory_mcp: false
  };

  // Check sequential-thinking MCP (required)
  try {
    await mcp__sequential-thinking__sequentialthinking({
      thought: "Preflight check - verifying MCP availability for web scraping",
      thoughtNumber: 1,
      totalThoughts: 1,
      nextThoughtNeeded: false
    });
    checks.sequential_thinking = true;
  } catch (error) {
    console.error("Sequential-thinking MCP unavailable:", error);
    throw new Error("CRITICAL: sequential-thinking MCP required but unavailable");
  }

  // Check claude-in-chrome MCP (required)
  try {
    const context = await mcp__claude-in-chrome__tabs_context_mcp({});
    checks.claude_in_chrome = true;
  } catch (error) {
    console.error("Claude-in-chrome MCP unavailable:", error);
    throw new Error("CRITICAL: claude-in-chrome MCP required but unavailable");
  }

  // Check memory-mcp (optional but recommended)
  try {
    checks.memory_mcp = true;
  } catch (error) {
    console.warn("Memory MCP unavailable - extracted data will not be persisted");
    checks.memory_mcp = false;
  }

  return checks;
}

Error Handling Framework

错误处理框架

Error Categories:
CategoryExampleRecovery Strategy
MCP_UNAVAILABLEClaude-in-chrome offlineABORT with clear message
PAGE_NOT_FOUND404 errorLog, skip page, continue with others
ELEMENT_NOT_FOUNDSelector changedTry alternative selectors, log schema drift
RATE_LIMITED429 responseExponential backoff, wait, retry
CONTENT_CHANGEDDynamic content not loadedWait longer, scroll to trigger load
PAGINATION_ENDNo more pagesNormal termination, return collected data
BLOCKEDAccess denied, CAPTCHAABORT, notify user, do not bypass
Error Recovery Pattern:
javascript
async function extractWithRetry(extractor, context, maxRetries = 3) {
  let lastError = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const data = await extractor(context);
      return data;
    } catch (error) {
      lastError = error;
      console.error(`Extraction attempt ${attempt} failed:`, error.message);

      if (isRateLimitError(error)) {
        const delay = Math.pow(2, attempt) * 5; // 10s, 20s, 40s
        await sleep(delay * 1000);
      } else if (!isRecoverableError(error)) {
        break;
      }
    }
  }
  throw lastError;
}

function isRecoverableError(error) {
  const nonRecoverable = [
    "CRITICAL: sequential-thinking MCP required",
    "CRITICAL: claude-in-chrome MCP required",
    "Access denied",
    "CAPTCHA",
    "Blocked"
  ];
  return !nonRecoverable.some(msg => error.message.includes(msg));
}
错误类别:
类别示例恢复策略
MCP_UNAVAILABLEClaude-in-chrome离线终止并显示清晰信息
PAGE_NOT_FOUND404错误记录日志、跳过页面、继续处理其他页面
ELEMENT_NOT_FOUND选择器变更尝试备选选择器、记录schema漂移
RATE_LIMITED429响应指数退避、等待、重试
CONTENT_CHANGED动态内容未加载延长等待时间、滚动触发加载
PAGINATION_END无更多页面正常终止、返回已收集数据
BLOCKED访问被拒绝、验证码终止、通知用户、不绕过限制
错误恢复模式:
javascript
async function extractWithRetry(extractor, context, maxRetries = 3) {
  let lastError = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const data = await extractor(context);
      return data;
    } catch (error) {
      lastError = error;
      console.error(`Extraction attempt ${attempt} failed:`, error.message);

      if (isRateLimitError(error)) {
        const delay = Math.pow(2, attempt) * 5; // 10s, 20s, 40s
        await sleep(delay * 1000);
      } else if (!isRecoverableError(error)) {
        break;
      }
    }
  }
  throw lastError;
}

function isRecoverableError(error) {
  const nonRecoverable = [
    "CRITICAL: sequential-thinking MCP required",
    "CRITICAL: claude-in-chrome MCP required",
    "Access denied",
    "CAPTCHA",
    "Blocked"
  ];
  return !nonRecoverable.some(msg => error.message.includes(msg));
}

Data Validation Framework

数据验证框架

Purpose: Ensure extracted data meets quality standards before storage.
Validation Rules:
javascript
const VALIDATION_RULES = {
  required_fields: ["name", "url"],  // Must be present
  field_types: {
    price: "number",
    name: "string",
    url: "url",
    date: "date"
  },
  constraints: {
    price: { min: 0, max: 1000000 },
    name: { minLength: 1, maxLength: 500 },
    url: { pattern: /^https?:\/\// }
  }
};

function validateRecord(record, rules) {
  const errors = [];

  // Check required fields
  for (const field of rules.required_fields) {
    if (!record[field]) {
      errors.push(`Missing required field: ${field}`);
    }
  }

  // Check field types
  for (const [field, type] of Object.entries(rules.field_types)) {
    if (record[field] && !isValidType(record[field], type)) {
      errors.push(`Invalid type for ${field}: expected ${type}`);
    }
  }

  return { valid: errors.length === 0, errors };
}

目的:存储前确保提取的数据符合质量标准。
验证规则:
javascript
const VALIDATION_RULES = {
  required_fields: ["name", "url"],  // Must be present
  field_types: {
    price: "number",
    name: "string",
    url: "url",
    date: "date"
  },
  constraints: {
    price: { min: 0, max: 1000000 },
    name: { minLength: 1, maxLength: 500 },
    url: { pattern: /^https?:\/\// }
  }
};

function validateRecord(record, rules) {
  const errors = [];

  // Check required fields
  for (const field of rules.required_fields) {
    if (!record[field]) {
      errors.push(`Missing required field: ${field}`);
    }
  }

  // Check field types
  for (const [field, type] of Object.entries(rules.field_types)) {
    if (record[field] && !isValidType(record[field], type)) {
      errors.push(`Invalid type for ${field}: expected ${type}`);
    }
  }

  return { valid: errors.length === 0, errors };
}

Main Workflow

主要工作流

Phase 1: Planning (MANDATORY)

阶段1:规划(必须执行)

Purpose: Analyze extraction requirements and define data schema.
Process:
  1. Invoke sequential-thinking MCP
  2. Define target data fields and types
  3. Plan selector strategy
  4. Identify pagination pattern
  5. Define output format
  6. Plan rate limiting strategy
Planning Questions:
  • What data fields need to be extracted?
  • What is the expected output format (JSON/CSV/Markdown)?
  • Is pagination involved? What pattern?
  • What validation rules apply?
  • What rate limiting is appropriate?
Output Contract:
yaml
extraction_plan:
  target_url: string
  output_format: "json" | "csv" | "markdown"
  schema:
    fields:
      - name: string
        selector: string
        type: string | number | url | date
        required: boolean
  pagination:
    type: "numbered" | "next_button" | "infinite_scroll" | "load_more" | "none"
    max_pages: number
    delay_seconds: number
  rate_limit:
    delay_between_requests: number
目的:分析提取需求并定义数据schema。
流程:
  1. 调用sequential-thinking MCP
  2. 定义目标数据字段和类型
  3. 规划选择器策略
  4. 识别分页模式
  5. 定义输出格式
  6. 规划速率限制策略
规划问题:
  • 需要提取哪些数据字段?
  • 预期输出格式是什么(JSON/CSV/Markdown)?
  • 是否涉及分页?是什么模式?
  • 适用哪些验证规则?
  • 合适的速率限制是多少?
输出约定:
yaml
extraction_plan:
  target_url: string
  output_format: "json" | "csv" | "markdown"
  schema:
    fields:
      - name: string
        selector: string
        type: string | number | url | date
        required: boolean
  pagination:
    type: "numbered" | "next_button" | "infinite_scroll" | "load_more" | "none"
    max_pages: number
    delay_seconds: number
  rate_limit:
    delay_between_requests: number

Phase 2: Navigation

阶段2:导航

Purpose: Navigate to target URL and establish context.
Process:
  1. Get tab context (tabs_context_mcp)
  2. Create new tab for scraping (tabs_create_mcp)
  3. Navigate to starting URL
  4. Take initial screenshot for verification
  5. Wait for page load completion
Implementation:
javascript
// 1. Get existing context
const context = await mcp__claude-in-chrome__tabs_context_mcp({});

// 2. Create dedicated tab
const newTab = await mcp__claude-in-chrome__tabs_create_mcp({});
const tabId = newTab.tabId;

// 3. Navigate to target
await mcp__claude-in-chrome__navigate({
  url: targetUrl,
  tabId: tabId
});

// 4. Wait for page load
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,
  tabId: tabId
});

// 5. Screenshot initial state
await mcp__claude-in-chrome__computer({
  action: "screenshot",
  tabId: tabId
});
目的:导航至目标URL并建立上下文。
流程:
  1. 获取标签页上下文(tabs_context_mcp)
  2. 创建用于抓取的新标签页(tabs_create_mcp)
  3. 导航至起始URL
  4. 截取初始截图用于验证
  5. 等待页面加载完成
实现代码:
javascript
// 1. Get existing context
const context = await mcp__claude-in-chrome__tabs_context_mcp({});

// 2. Create dedicated tab
const newTab = await mcp__claude-in-chrome__tabs_create_mcp({});
const tabId = newTab.tabId;

// 3. Navigate to target
await mcp__claude-in-chrome__navigate({
  url: targetUrl,
  tabId: tabId
});

// 4. Wait for page load
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,
  tabId: tabId
});

// 5. Screenshot initial state
await mcp__claude-in-chrome__computer({
  action: "screenshot",
  tabId: tabId
});

Phase 3: Structure Analysis

阶段3:结构分析

Purpose: Understand page DOM structure before extraction.
Process:
  1. Read page accessibility tree (read_page)
  2. Identify data container elements
  3. Map field selectors to schema
  4. Verify selectors exist
  5. Detect dynamic content patterns
Implementation:
javascript
// Read page structure
const pageTree = await mcp__claude-in-chrome__read_page({
  tabId: tabId,
  filter: "all"  // Get all elements for analysis
});

// For interactive elements only
const interactive = await mcp__claude-in-chrome__read_page({
  tabId: tabId,
  filter: "interactive"  // Buttons, links, inputs
});

// Find specific elements
const products = await mcp__claude-in-chrome__find({
  query: "product cards or listing items",
  tabId: tabId
});

const prices = await mcp__claude-in-chrome__find({
  query: "price elements",
  tabId: tabId
});
目的:提取前了解页面DOM结构。
流程:
  1. 读取页面可访问性树(read_page)
  2. 识别数据容器元素
  3. 将字段选择器映射到schema
  4. 验证选择器存在
  5. 检测动态内容模式
实现代码:
javascript
// Read page structure
const pageTree = await mcp__claude-in-chrome__read_page({
  tabId: tabId,
  filter: "all"  // Get all elements for analysis
});

// For interactive elements only
const interactive = await mcp__claude-in-chrome__read_page({
  tabId: tabId,
  filter: "interactive"  // Buttons, links, inputs
});

// Find specific elements
const products = await mcp__claude-in-chrome__find({
  query: "product cards or listing items",
  tabId: tabId
});

const prices = await mcp__claude-in-chrome__find({
  query: "price elements",
  tabId: tabId
});

Phase 4: Data Extraction

阶段4:数据提取

Purpose: Extract data according to defined schema.
Process:
  1. For simple text: Use get_page_text
  2. For structured data: Use javascript_tool
  3. For specific elements: Use find + read_page
  4. Handle lazy-loaded content with scroll
Extraction Methods:
Method 1: Full Page Text (simplest, for articles)
javascript
const text = await mcp__claude-in-chrome__get_page_text({
  tabId: tabId
});
Method 2: JavaScript DOM Query (most flexible)
javascript
const data = await mcp__claude-in-chrome__javascript_tool({
  action: "javascript_exec",
  tabId: tabId,
  text: `
    // Extract product data
    Array.from(document.querySelectorAll('.product-card')).map(card => ({
      name: card.querySelector('.product-title')?.textContent?.trim(),
      price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
      url: card.querySelector('a')?.href,
      image: card.querySelector('img')?.src
    }))
  `
});
Method 3: Table Extraction
javascript
const tableData = await mcp__claude-in-chrome__javascript_tool({
  action: "javascript_exec",
  tabId: tabId,
  text: `
    // Extract table data
    const table = document.querySelector('table');
    const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
    const rows = Array.from(table.querySelectorAll('tbody tr')).map(tr => {
      const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
      return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
    });
    rows
  `
});
Method 4: Handling Lazy-Loaded Content
javascript
// Scroll to load content
await mcp__claude-in-chrome__computer({
  action: "scroll",
  scroll_direction: "down",
  scroll_amount: 5,
  tabId: tabId,
  coordinate: [500, 500]
});

// Wait for content to load
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,
  tabId: tabId
});

// Now extract
const data = await extractData(tabId);
目的:根据定义的schema提取数据。
流程:
  1. 简单文本:使用get_page_text
  2. 结构化数据:使用javascript_tool
  3. 特定元素:使用find + read_page
  4. 通过scroll处理懒加载内容
提取方法:
方法1:整页文本(最简单,适用于文章)
javascript
const text = await mcp__claude-in-chrome__get_page_text({
  tabId: tabId
});
方法2:JavaScript DOM查询(最灵活)
javascript
const data = await mcp__claude-in-chrome__javascript_tool({
  action: "javascript_exec",
  tabId: tabId,
  text: `
    // Extract product data
    Array.from(document.querySelectorAll('.product-card')).map(card => ({
      name: card.querySelector('.product-title')?.textContent?.trim(),
      price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
      url: card.querySelector('a')?.href,
      image: card.querySelector('img')?.src
    }))
  `
});
方法3:表格提取
javascript
const tableData = await mcp__claude-in-chrome__javascript_tool({
  action: "javascript_exec",
  tabId: tabId,
  text: `
    // Extract table data
    const table = document.querySelector('table');
    const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
    const rows = Array.from(table.querySelectorAll('tbody tr')).map(tr => {
      const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
      return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
    });
    rows
  `
});
方法4:处理懒加载内容
javascript
// Scroll to load content
await mcp__claude-in-chrome__computer({
  action: "scroll",
  scroll_direction: "down",
  scroll_amount: 5,
  tabId: tabId,
  coordinate: [500, 500]
});

// Wait for content to load
await mcp__claude-in-chrome__computer({
  action: "wait",
  duration: 2,
  tabId: tabId
});

// Now extract
const data = await extractData(tabId);

Phase 5: Pagination Handling

阶段5:分页处理

Purpose: Navigate through multiple pages and collect all data.
Pagination Patterns:
Pattern A: Numbered Pages
javascript
async function extractNumberedPages(baseUrl, maxPages, tabId) {
  const allData = [];

  for (let page = 1; page <= maxPages; page++) {
    const url = `${baseUrl}?page=${page}`;
    await mcp__claude-in-chrome__navigate({ url, tabId });
    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });

    const pageData = await extractPageData(tabId);
    if (pageData.length === 0) break;  // No more data

    allData.push(...pageData);

    // Rate limiting
    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
  }

  return allData;
}
Pattern B: Next Button
javascript
async function extractWithNextButton(tabId) {
  const allData = [];
  let hasNext = true;

  while (hasNext) {
    const pageData = await extractPageData(tabId);
    allData.push(...pageData);

    // Find next button
    const nextButton = await mcp__claude-in-chrome__find({
      query: "next page button or pagination next link",
      tabId: tabId
    });

    if (nextButton && nextButton.elements && nextButton.elements.length > 0) {
      await mcp__claude-in-chrome__computer({
        action: "left_click",
        ref: nextButton.elements[0].ref,
        tabId: tabId
      });
      await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
    } else {
      hasNext = false;  // No more pages
    }
  }

  return allData;
}
Pattern C: Infinite Scroll
javascript
async function extractInfiniteScroll(tabId, maxScrolls = 20) {
  const allData = [];
  let previousCount = 0;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    const pageData = await extractPageData(tabId);

    if (pageData.length === previousCount) {
      // No new content loaded, done
      break;
    }

    allData.length = 0;
    allData.push(...pageData);
    previousCount = pageData.length;

    // Scroll down
    await mcp__claude-in-chrome__computer({
      action: "scroll",
      scroll_direction: "down",
      scroll_amount: 5,
      tabId: tabId,
      coordinate: [500, 500]
    });

    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
    scrollCount++;
  }

  return allData;
}
目的:遍历多个页面并收集所有数据。
分页模式:
模式A:编号页面
javascript
async function extractNumberedPages(baseUrl, maxPages, tabId) {
  const allData = [];

  for (let page = 1; page <= maxPages; page++) {
    const url = `${baseUrl}?page=${page}`;
    await mcp__claude-in-chrome__navigate({ url, tabId });
    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });

    const pageData = await extractPageData(tabId);
    if (pageData.length === 0) break;  // No more data

    allData.push(...pageData);

    // Rate limiting
    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
  }

  return allData;
}
模式B:下一页按钮
javascript
async function extractWithNextButton(tabId) {
  const allData = [];
  let hasNext = true;

  while (hasNext) {
    const pageData = await extractPageData(tabId);
    allData.push(...pageData);

    // Find next button
    const nextButton = await mcp__claude-in-chrome__find({
      query: "next page button or pagination next link",
      tabId: tabId
    });

    if (nextButton && nextButton.elements && nextButton.elements.length > 0) {
      await mcp__claude-in-chrome__computer({
        action: "left_click",
        ref: nextButton.elements[0].ref,
        tabId: tabId
      });
      await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
    } else {
      hasNext = false;  // No more pages
    }
  }

  return allData;
}
模式C:无限滚动
javascript
async function extractInfiniteScroll(tabId, maxScrolls = 20) {
  const allData = [];
  let previousCount = 0;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    const pageData = await extractPageData(tabId);

    if (pageData.length === previousCount) {
      // No new content loaded, done
      break;
    }

    allData.length = 0;
    allData.push(...pageData);
    previousCount = pageData.length;

    // Scroll down
    await mcp__claude-in-chrome__computer({
      action: "scroll",
      scroll_direction: "down",
      scroll_amount: 5,
      tabId: tabId,
      coordinate: [500, 500]
    });

    await mcp__claude-in-chrome__computer({ action: "wait", duration: 2, tabId });
    scrollCount++;
  }

  return allData;
}

Phase 6: Data Transformation

阶段6:数据转换

Purpose: Convert extracted data to requested output format.
JSON Output:
javascript
function toJSON(data, pretty = true) {
  return pretty ? JSON.stringify(data, null, 2) : JSON.stringify(data);
}
CSV Output:
javascript
function toCSV(data) {
  if (data.length === 0) return "";

  const headers = Object.keys(data[0]);
  const headerRow = headers.join(",");

  const dataRows = data.map(row =>
    headers.map(h => {
      const val = row[h];
      // Escape quotes and wrap in quotes if contains comma
      if (typeof val === "string" && (val.includes(",") || val.includes('"'))) {
        return `"${val.replace(/"/g, '""')}"`;
      }
      return val;
    }).join(",")
  );

  return [headerRow, ...dataRows].join("\n");
}
Markdown Table Output:
javascript
function toMarkdownTable(data) {
  if (data.length === 0) return "";

  const headers = Object.keys(data[0]);
  const headerRow = `| ${headers.join(" | ")} |`;
  const separator = `| ${headers.map(() => "---").join(" | ")} |`;

  const dataRows = data.map(row =>
    `| ${headers.map(h => String(row[h] || "")).join(" | ")} |`
  );

  return [headerRow, separator, ...dataRows].join("\n");
}
目的:将提取的数据转换为请求的输出格式。
JSON输出:
javascript
function toJSON(data, pretty = true) {
  return pretty ? JSON.stringify(data, null, 2) : JSON.stringify(data);
}
CSV输出:
javascript
function toCSV(data) {
  if (data.length === 0) return "";

  const headers = Object.keys(data[0]);
  const headerRow = headers.join(",");

  const dataRows = data.map(row =>
    headers.map(h => {
      const val = row[h];
      // Escape quotes and wrap in quotes if contains comma
      if (typeof val === "string" && (val.includes(",") || val.includes('"'))) {
        return `"${val.replace(/"/g, '""')}"`;
      }
      return val;
    }).join(",")
  );

  return [headerRow, ...dataRows].join("\n");
}
Markdown表格输出:
javascript
function toMarkdownTable(data) {
  if (data.length === 0) return "";

  const headers = Object.keys(data[0]);
  const headerRow = `| ${headers.join(" | ")} |`;
  const separator = `| ${headers.map(() => "---").join(" | ")} |`;

  const dataRows = data.map(row =>
    `| ${headers.map(h => String(row[h] || "")).join(" | ")} |`
  );

  return [headerRow, separator, ...dataRows].join("\n");
}

Phase 7: Storage

阶段7:存储

Purpose: Persist extracted data to Memory MCP for future use.
Implementation:
javascript
// Store in Memory MCP
await memory_store({
  namespace: `skills/tooling/web-scraping/${projectName}/${timestamp}`,
  data: {
    extraction_metadata: {
      source_url: targetUrl,
      extraction_date: new Date().toISOString(),
      record_count: data.length,
      output_format: outputFormat,
      pages_scraped: pageCount
    },
    extracted_data: data
  },
  tags: {
    WHO: `web-scraping-${sessionId}`,
    WHEN: new Date().toISOString(),
    PROJECT: projectName,
    WHY: "data-extraction",
    data_type: dataType,
    record_count: data.length
  }
});
目的:将提取的数据存储至Memory MCP以备后续使用。
实现代码:
javascript
// Store in Memory MCP
await memory_store({
  namespace: `skills/tooling/web-scraping/${projectName}/${timestamp}`,
  data: {
    extraction_metadata: {
      source_url: targetUrl,
      extraction_date: new Date().toISOString(),
      record_count: data.length,
      output_format: outputFormat,
      pages_scraped: pageCount
    },
    extracted_data: data
  },
  tags: {
    WHO: `web-scraping-${sessionId}`,
    WHEN: new Date().toISOString(),
    PROJECT: projectName,
    WHY: "data-extraction",
    data_type: dataType,
    record_count: data.length
  }
});

LEARNED PATTERNS

已学习模式

This section is populated by Loop 1.5 (Session Reflection) as patterns are discovered.
该部分由Loop 1.5(会话反思)填充,用于记录发现的模式。

High Confidence [conf:0.90]

高可信度 [conf:0.90]

No patterns captured yet. Patterns will be added as the skill is used and learnings are extracted.
暂无记录的模式。使用该技能并提取经验后会添加模式。

Medium Confidence [conf:0.75]

中等可信度 [conf:0.75]

No patterns captured yet.
暂无记录的模式。

Low Confidence [conf:0.55]

低可信度 [conf:0.55]

No patterns captured yet.
暂无记录的模式。

Success Criteria

成功标准

Quality Thresholds:
  • All targeted data fields extracted (100% schema coverage)
  • Data validation passes for 95%+ of records
  • Pagination handled completely (no missing pages)
  • Output format matches requested format
  • Rate limiting respected (no 429 errors)
  • No page state modifications occurred
  • Extraction completed within reasonable time (5 min per 100 records)
Failure Indicators:
  • Schema validation fails for >5% of records
  • Missing required fields in output
  • Pagination incomplete (detected more pages than scraped)
  • Rate limit errors encountered
  • Page state modified (form submitted, button clicked for action)
  • CAPTCHA or access denial encountered
质量阈值:
  • 所有目标数据字段均已提取(100% schema覆盖)
  • 95%+的记录通过数据验证
  • 分页处理完整(无遗漏页面)
  • 输出格式符合要求
  • 遵守速率限制(无429错误)
  • 未修改页面状态
  • 提取在合理时间内完成(每100条记录耗时≤5分钟)
失败指标:
  • 5%的记录未通过schema验证
  • 输出中缺少必填字段
  • 分页不完整(检测到的页面数多于抓取的页面数)
  • 遇到速率限制错误
  • 页面状态被修改(提交表单、点击按钮执行操作)
  • 遇到验证码或访问被拒绝

MCP Integration

MCP集成

Required MCPs:
MCPPurposeTools Used
sequential-thinkingPlanning phase
sequentialthinking
claude-in-chromeExtraction phase
navigate
,
read_page
,
get_page_text
,
javascript_tool
,
find
,
computer
(screenshot, scroll, wait only),
tabs_context_mcp
,
tabs_create_mcp
memory-mcpData storage
memory_store
,
vector_search
Optional MCPs:
  • filesystem (for saving extracted data locally)
必需的MCP:
MCP用途使用工具
sequential-thinking规划阶段
sequentialthinking
claude-in-chrome提取阶段
navigate
,
read_page
,
get_page_text
,
javascript_tool
,
find
,
computer
(仅截图、滚动、等待),
tabs_context_mcp
,
tabs_create_mcp
memory-mcp数据存储
memory_store
,
vector_search
可选的MCP:
  • filesystem(用于本地保存提取的数据)

Memory Namespace

内存命名空间

Pattern:
skills/tooling/web-scraping/{project}/{timestamp}
Store:
  • Extraction schemas (field definitions)
  • Extracted datasets (structured data)
  • Selector patterns (for similar pages)
  • Error logs (selector failures, schema drift)
Retrieve:
  • Similar extraction tasks (vector search by description)
  • Proven selectors for known sites
  • Historical extraction patterns
Tagging:
json
{
  "WHO": "web-scraping-{session_id}",
  "WHEN": "ISO8601_timestamp",
  "PROJECT": "{project_name}",
  "WHY": "data-extraction",
  "source_domain": "example.com",
  "data_type": "products|articles|listings|tables",
  "record_count": 150,
  "pages_scraped": 5
}
模式:
skills/tooling/web-scraping/{project}/{timestamp}
存储内容:
  • 提取schema(字段定义)
  • 提取的数据集(结构化数据)
  • 选择器模式(适用于相似页面)
  • 错误日志(选择器失败、schema漂移)
检索内容:
  • 相似提取任务(按描述进行向量搜索)
  • 针对已知网站的有效选择器
  • 历史提取模式
标签:
json
{
  "WHO": "web-scraping-{session_id}",
  "WHEN": "ISO8601_timestamp",
  "PROJECT": "{project_name}",
  "WHY": "data-extraction",
  "source_domain": "example.com",
  "data_type": "products|articles|listings|tables",
  "record_count": 150,
  "pages_scraped": 5
}

Examples

示例

Example 1: Product Catalog Extraction

示例1:产品目录提取

Complexity: Medium (structured data, pagination)
Task: Extract product listings from e-commerce category page
Planning Output (sequential-thinking):
Thought 1/6: Need to extract product name, price, URL, and image from catalog
Thought 2/6: Schema: {name: string, price: number, url: url, image: url}
Thought 3/6: Identify product card containers via read_page
Thought 4/6: Use javascript_tool to query all product cards
Thought 5/6: Detect pagination pattern (numbered pages with ?page=N)
Thought 6/6: Output format: JSON array, validate price > 0
Execution:
javascript
// 1. Navigate to catalog
await navigate({ url: "https://example.com/products", tabId });

// 2. Analyze structure
const structure = await read_page({ tabId, filter: "all" });

// 3. Extract data
const products = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `
    Array.from(document.querySelectorAll('.product-card')).map(card => ({
      name: card.querySelector('.title')?.textContent?.trim(),
      price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
      url: card.querySelector('a')?.href,
      image: card.querySelector('img')?.src
    }))
  `
});

// 4. Handle pagination (repeat for each page)
// 5. Transform to JSON
const output = JSON.stringify(products, null, 2);
Result: 150 products extracted across 5 pages
Output Format: JSON
json
[
  {
    "name": "Product A",
    "price": 29.99,
    "url": "https://example.com/products/a",
    "image": "https://example.com/images/a.jpg"
  }
]
复杂度: 中等(结构化数据、分页)
任务: 从电商分类页面提取产品列表
规划输出(sequential-thinking):
Thought 1/6: Need to extract product name, price, URL, and image from catalog
Thought 2/6: Schema: {name: string, price: number, url: url, image: url}
Thought 3/6: Identify product card containers via read_page
Thought 4/6: Use javascript_tool to query all product cards
Thought 5/6: Detect pagination pattern (numbered pages with ?page=N)
Thought 6/6: Output format: JSON array, validate price > 0
执行代码:
javascript
// 1. Navigate to catalog
await navigate({ url: "https://example.com/products", tabId });

// 2. Analyze structure
const structure = await read_page({ tabId, filter: "all" });

// 3. Extract data
const products = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `
    Array.from(document.querySelectorAll('.product-card')).map(card => ({
      name: card.querySelector('.title')?.textContent?.trim(),
      price: parseFloat(card.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '')),
      url: card.querySelector('a')?.href,
      image: card.querySelector('img')?.src
    }))
  `
});

// 4. Handle pagination (repeat for each page)
// 5. Transform to JSON
const output = JSON.stringify(products, null, 2);
结果: 跨5个页面提取了150个产品
输出格式: JSON
json
[
  {
    "name": "Product A",
    "price": 29.99,
    "url": "https://example.com/products/a",
    "image": "https://example.com/images/a.jpg"
  }
]

Example 2: Article Content Scraping

示例2:文章内容抓取

Complexity: Simple (single page, text focus)
Task: Extract article headline, author, date, and body text
Planning Output (sequential-thinking):
Thought 1/4: Single page extraction, no pagination needed
Thought 2/4: Schema: {headline: string, author: string, date: date, body: string}
Thought 3/4: Use get_page_text for full content, javascript_tool for metadata
Thought 4/4: Output format: Markdown for readability
Execution:
javascript
// 1. Navigate to article
await navigate({ url: "https://news.example.com/article/123", tabId });

// 2. Extract metadata
const metadata = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `({
    headline: document.querySelector('h1')?.textContent?.trim(),
    author: document.querySelector('.author')?.textContent?.trim(),
    date: document.querySelector('time')?.getAttribute('datetime')
  })`
});

// 3. Extract body text
const body = await get_page_text({ tabId });

// 4. Combine
const article = { ...metadata, body };
Result: Article content extracted
Output Format: Markdown
markdown
undefined
复杂度: 简单(单页、文本为主)
任务: 提取文章标题、作者、日期和正文
规划输出(sequential-thinking):
Thought 1/4: Single page extraction, no pagination needed
Thought 2/4: Schema: {headline: string, author: string, date: date, body: string}
Thought 3/4: Use get_page_text for full content, javascript_tool for metadata
Thought 4/4: Output format: Markdown for readability
执行代码:
javascript
// 1. Navigate to article
await navigate({ url: "https://news.example.com/article/123", tabId });

// 2. Extract metadata
const metadata = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `({
    headline: document.querySelector('h1')?.textContent?.trim(),
    author: document.querySelector('.author')?.textContent?.trim(),
    date: document.querySelector('time')?.getAttribute('datetime')
  })`
});

// 3. Extract body text
const body = await get_page_text({ tabId });

// 4. Combine
const article = { ...metadata, body };
结果: 提取文章内容
输出格式: Markdown
markdown
undefined

Article Headline

Article Headline

Author: Jane Doe Date: 2026-01-12
Article body text here...
undefined
Author: Jane Doe Date: 2026-01-12
Article body text here...
undefined

Example 3: Table Data Extraction

示例3:表格数据提取

Complexity: Simple (structured HTML table)
Task: Extract financial data from HTML table
Planning Output (sequential-thinking):
Thought 1/4: HTML table with headers in <th>, data in <td>
Thought 2/4: Use javascript_tool to parse table structure
Thought 3/4: Convert to array of objects with header keys
Thought 4/4: Output format: CSV for spreadsheet import
Execution:
javascript
// 1. Navigate to page
await navigate({ url: "https://finance.example.com/data", tabId });

// 2. Extract table
const tableData = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `
    const table = document.querySelector('table.financial-data');
    const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
    Array.from(table.querySelectorAll('tbody tr')).map(tr => {
      const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
      return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
    })
  `
});

// 3. Transform to CSV
const csv = toCSV(tableData);
Result: 50 rows extracted
Output Format: CSV
csv
Date,Open,High,Low,Close,Volume
2026-01-12,150.00,152.50,149.00,151.75,1000000
2026-01-11,148.50,150.25,147.00,150.00,950000
复杂度: 简单(结构化HTML表格)
任务: 从HTML表格提取财务数据
规划输出(sequential-thinking):
Thought 1/4: HTML table with headers in <th>, data in <td>
Thought 2/4: Use javascript_tool to parse table structure
Thought 3/4: Convert to array of objects with header keys
Thought 4/4: Output format: CSV for spreadsheet import
执行代码:
javascript
// 1. Navigate to page
await navigate({ url: "https://finance.example.com/data", tabId });

// 2. Extract table
const tableData = await javascript_tool({
  action: "javascript_exec",
  tabId,
  text: `
    const table = document.querySelector('table.financial-data');
    const headers = Array.from(table.querySelectorAll('th')).map(th => th.textContent.trim());
    Array.from(table.querySelectorAll('tbody tr')).map(tr => {
      const cells = Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim());
      return headers.reduce((obj, header, i) => ({ ...obj, [header]: cells[i] }), {});
    })
  `
});

// 3. Transform to CSV
const csv = toCSV(tableData);
结果: 提取50行数据
输出格式: CSV
csv
Date,Open,High,Low,Close,Volume
2026-01-12,150.00,152.50,149.00,151.75,1000000
2026-01-11,148.50,150.25,147.00,150.00,950000

Example 4: Multi-Page Directory Scraping

示例4:多页目录抓取

Complexity: High (pagination, rate limiting)
Task: Extract business listings from directory with "Load More" pagination
Planning Output (sequential-thinking):
Thought 1/8: Directory with business name, phone, address, website
Thought 2/8: Schema: {name: string, phone: string, address: string, website: url}
Thought 3/8: Pagination: "Load More" button at bottom
Thought 4/8: Rate limit: 3 seconds between loads
Thought 5/8: Max iterations: 20 (to prevent infinite loops)
Thought 6/8: End condition: "Load More" button disappears or disabled
Thought 7/8: Output format: JSON
Thought 8/8: Store in Memory MCP for analysis
Result: 200 businesses extracted across 20 "Load More" clicks
Execution Time: 90 seconds (with rate limiting)
复杂度: 高(分页、速率限制)
任务: 从包含“Load More”分页的目录提取商家列表
规划输出(sequential-thinking):
Thought 1/8: Directory with business name, phone, address, website
Thought 2/8: Schema: {name: string, phone: string, address: string, website: url}
Thought 3/8: Pagination: "Load More" button at bottom
Thought 4/8: Rate limit: 3 seconds between loads
Thought 5/8: Max iterations: 20 (to prevent infinite loops)
Thought 6/8: End condition: "Load More" button disappears or disabled
Thought 7/8: Output format: JSON
Thought 8/8: Store in Memory MCP for analysis
结果: 点击20次“Load More”按钮后提取200个商家
执行时间: 90秒(包含速率限制等待)

Anti-Patterns to Avoid

需避免的反模式

Anti-PatternProblemSolution
Skip Structure AnalysisSelectors break unexpectedlyALWAYS use read_page before extraction
Modify Page StateNot a scraping task anymoreUse browser-automation for interactions
Ignore PaginationIncomplete datasetsPlan pagination strategy in Phase 1
No Rate LimitingServer blocks, legal issuesImplement delays between requests
Hardcoded SelectorsBreak when site updatesUse semantic selectors, find tool
Skip ValidationGarbage data in outputValidate against schema before storage
Bypass CAPTCHAViolates ToS, legal riskABORT and notify user
反模式问题解决方案
跳过结构分析选择器意外失效提取前必须使用read_page
修改页面状态不再是抓取任务交互操作使用browser-automation
忽略分页数据集不完整阶段1规划分页策略
不设置速率限制服务器封禁、法律问题请求之间设置延迟
硬编码选择器网站更新后失效使用语义选择器、find工具
跳过验证输出垃圾数据存储前根据schema验证
绕过验证码违反服务条款、法律风险终止并通知用户

Related Skills

相关技能

Upstream (provide input to this skill):
  • intent-analyzer
    - Detect web scraping requirements
  • prompt-architect
    - Optimize extraction descriptions
  • planner
    - High-level data collection strategy
Downstream (use output from this skill):
  • data-analysis
    - Analyze extracted datasets
  • reporting
    - Generate reports from data
  • ml-pipeline
    - Use data for training
Parallel (work together):
  • browser-automation
    - For interactive pre-requisites (login)
  • api-integration
    - Hybrid scraping/API workflows
上游技能(为该技能提供输入):
  • intent-analyzer
    - 检测网页抓取需求
  • prompt-architect
    - 优化提取描述
  • planner
    - 高级数据收集策略
下游技能(使用该技能的输出):
  • data-analysis
    - 分析提取的数据集
  • reporting
    - 从数据生成报告
  • ml-pipeline
    - 使用数据进行训练
并行技能(协同工作):
  • browser-automation
    - 用于交互式前置操作(登录)
  • api-integration
    - 混合抓取/API工作流

Selector Stability Strategies

选择器稳定性策略

Problem: Web pages change, selectors break.
Strategies:
StrategyStabilitySpeedExample
ID selectorHighFast
#product-123
Data attributeHighFast
[data-product-id="123"]
Semantic classMediumFast
.product-card
Text contentMediumSlowContains "Add to Cart"
XPathLowMedium
//div[@class='product'][1]
PositionVery LowFastFirst child element
Best Practices:
  1. Prefer
    data-*
    attributes (designed for scripting)
  2. Use semantic class names over positional selectors
  3. Combine multiple attributes for specificity
  4. Use
    find
    tool with natural language as fallback
  5. Log selector failures to detect schema drift
问题: 网页会变化,选择器会失效。
策略:
策略稳定性速度示例
ID选择器
#product-123
数据属性
[data-product-id="123"]
语义类
.product-card
文本内容包含"Add to Cart"
XPath
//div[@class='product'][1]
位置极低第一个子元素
最佳实践:
  1. 优先使用
    data-*
    属性(专为脚本设计)
  2. 使用语义类名而非位置选择器
  3. 组合多个属性提高特异性
  4. 作为备选,使用find工具和自然语言
  5. 记录选择器失败情况以检测schema漂移

Output Format Templates

输出格式模板

JSON Template

JSON模板

json
{
  "extraction_metadata": {
    "source_url": "https://example.com/products",
    "extraction_date": "2026-01-12T10:30:00Z",
    "record_count": 150,
    "pages_scraped": 5
  },
  "data": [
    {
      "name": "Product Name",
      "price": 29.99,
      "url": "https://example.com/product/1",
      "image": "https://example.com/images/1.jpg"
    }
  ]
}
json
{
  "extraction_metadata": {
    "source_url": "https://example.com/products",
    "extraction_date": "2026-01-12T10:30:00Z",
    "record_count": 150,
    "pages_scraped": 5
  },
  "data": [
    {
      "name": "Product Name",
      "price": 29.99,
      "url": "https://example.com/product/1",
      "image": "https://example.com/images/1.jpg"
    }
  ]
}

CSV Template

CSV模板

csv
name,price,url,image
"Product Name",29.99,"https://example.com/product/1","https://example.com/images/1.jpg"
csv
name,price,url,image
"Product Name",29.99,"https://example.com/product/1","https://example.com/images/1.jpg"

Markdown Table Template

Markdown表格模板

markdown
| Name | Price | URL |
|------|-------|-----|
| Product Name | $29.99 | [Link](https://example.com/product/1) |
markdown
| Name | Price | URL |
|------|-------|-----|
| Product Name | $29.99 | [Link](https://example.com/product/1) |

Maintenance & Updates

维护与更新

Version History:
  • v1.0.0 (2026-01-12): Initial release with READ-only focus, pagination handling, rate limiting
Feedback Loop:
  • Loop 1.5 (Session): Store learnings from extractions
  • Loop 3 (Meta-Loop): Aggregate patterns every 3 days
  • Update LEARNED PATTERNS section with new discoveries
Continuous Improvement:
  • Monitor extraction success rate via Memory MCP
  • Identify common selector failures for pattern updates
  • Optimize extraction strategies based on site patterns
<promise>WEB_SCRAPING_VERILINGUA_VERIX_COMPLIANT</promise>
版本历史:
  • v1.0.0 (2026-01-12): 初始版本,专注于只读操作、分页处理、速率限制
反馈循环:
  • Loop 1.5(会话): 存储提取过程中的经验
  • Loop 3(元循环): 每3天汇总模式
  • 更新已学习模式部分,添加新发现
持续改进:
  • 通过Memory MCP监控提取成功率
  • 识别常见选择器失败情况以更新模式
  • 根据网站模式优化提取策略
<promise>WEB_SCRAPING_VERILINGUA_VERIX_COMPLIANT</promise>