using-web-scraping

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Scraping Skill — Chrome (Playwright) + DuckDuckGo

网页抓取技能 — Chrome(Playwright)+ DuckDuckGo

A privacy-minded, agent-facing web-scraping skill that uses headless Chrome (Playwright/Puppeteer) and DuckDuckGo for search. Focuses on: reliable navigation, extracting structured text, obeying robots.txt, and rate-limiting.
这是一款注重隐私、面向Agent的网页抓取技能,使用无头Chrome(Playwright/Puppeteer)和DuckDuckGo进行搜索。核心特性包括:可靠的导航、结构化文本提取、遵守robots.txt规则以及请求频率限制。

When to use

适用场景

  • Collect public webpage content for summarization, metadata extraction, or link discovery.
  • Use DuckDuckGo for queries when you want a privacy-respecting search source.
  • NOT for bypassing paywalls, scraping private/logged-in content, or violating Terms of Service.
  • 收集公开网页内容用于摘要生成、元数据提取或链接发现。
  • 当你需要注重隐私的搜索源时,使用DuckDuckGo进行查询。
  • 请勿用于绕过付费墙、抓取需登录的私有内容或违反服务条款。

Safety & etiquette

安全与规范

  • Always check and respect
    /robots.txt
    before scraping a site.
  • Rate-limit requests (default: 1 request/sec) and use polite
    User-Agent
    strings.
  • Avoid executing arbitrary user-provided JavaScript on scraped pages.
  • Only scrape public content; if login is required, return
    login_required
    instead of attempting to bypass.
  • 抓取网站前务必检查并遵守
    /robots.txt
    规则。
  • 限制请求频率(默认:1次请求/秒),并使用规范的
    User-Agent
    字符串。
  • 避免在抓取的页面上执行用户提供的任意JavaScript代码。
  • 仅抓取公开内容;若页面需要登录,返回
    login_required
    而非尝试绕过。

Capabilities

功能特性

  • Search DuckDuckGo and return top-N result links.
  • Visit result pages in headless Chrome and extract
    title
    ,
    meta description
    ,
    main
    text (or best-effort article text), and
    canonical
    URL.
  • Return results as structured JSON for downstream consumption.
  • 在DuckDuckGo上搜索并返回前N条结果链接。
  • 通过无头Chrome访问结果页面,提取
    title
    meta description
    main
    文本(或尽可能提取文章正文)以及
    canonical
    URL。
  • 将结果以结构化JSON格式返回,供后续使用。

Examples

示例

Node.js (Playwright)

Node.js(Playwright)

javascript
const { chromium } = require('playwright');

async function ddgSearchAndScrape(query) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({ userAgent: 'open-skills-bot/1.0' });

  // DuckDuckGo search
  await page.goto('https://duckduckgo.com/');
  await page.fill('input[name="q"]', query);
  await page.keyboard.press('Enter');
  await page.waitForSelector('.result__title a');

  // collect top result URL
  const href = await page.getAttribute('.result__title a', 'href');
  if (!href) { await browser.close(); return []; }

  // visit result and extract
  await page.goto(href, { waitUntil: 'domcontentloaded' });
  const title = await page.title();
  const description = await page.locator('meta[name="description"]').getAttribute('content').catch(() => null);
  const article = await page.locator('article, main, #content').first().innerText().catch(() => null);

  await browser.close();
  return [{ url: href, title, description, text: article }];
}

// usage
// ddgSearchAndScrape('open-source agent runtimes').then(console.log);
javascript
const { chromium } = require('playwright');

async function ddgSearchAndScrape(query) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({ userAgent: 'open-skills-bot/1.0' });

  // DuckDuckGo search
  await page.goto('https://duckduckgo.com/');
  await page.fill('input[name="q"]', query);
  await page.keyboard.press('Enter');
  await page.waitForSelector('.result__title a');

  // collect top result URL
  const href = await page.getAttribute('.result__title a', 'href');
  if (!href) { await browser.close(); return []; }

  // visit result and extract
  await page.goto(href, { waitUntil: 'domcontentloaded' });
  const title = await page.title();
  const description = await page.locator('meta[name="description"]').getAttribute('content').catch(() => null);
  const article = await page.locator('article, main, #content').first().innerText().catch(() => null);

  await browser.close();
  return [{ url: href, title, description, text: article }];
}

// usage
// ddgSearchAndScrape('open-source agent runtimes').then(console.log);

Agent prompt (copy/paste)

Agent提示词(可复制粘贴)

text
You are an agent with a web-scraping skill. For any `search:` task, use DuckDuckGo to find relevant pages, then open each page in a headless Chrome instance (Playwright/Puppeteer) and extract `title`, `meta description`, `main text`, and `canonical` URL. Always:
- Check and respect robots.txt
- Rate-limit requests (<=1 req/sec)
- Use a clear `User-Agent` and do not execute arbitrary page JS
Return results as JSON: [{url,title,description,text}] or `login_required` if a page needs authentication.
text
You are an agent with a web-scraping skill. For any `search:` task, use DuckDuckGo to find relevant pages, then open each page in a headless Chrome instance (Playwright/Puppeteer) and extract `title`, `meta description`, `main text`, and `canonical` URL. Always:
- Check and respect robots.txt
- Rate-limit requests (<=1 req/sec)
- Use a clear `User-Agent` and do not execute arbitrary page JS
Return results as JSON: [{url,title,description,text}] or `login_required` if a page needs authentication.

Quick setup

快速设置

  • Node:
    npm i playwright
    and run
    npx playwright install
    for browser binaries.
  • Python:
    pip install playwright
    and
    playwright install
    .
  • Node环境:执行
    npm i playwright
    安装,再运行
    npx playwright install
    获取浏览器二进制文件。
  • Python环境:执行
    pip install playwright
    安装,再运行
    playwright install

Tips

小贴士

  • Use
    page.route
    to block large assets (images, fonts) when you only need text.
  • Respect site terms and introduce exponential backoff for retries.
  • 当你仅需要文本内容时,使用
    page.route
    拦截大型资源(图片、字体)。
  • 遵守网站条款,重试时使用指数退避策略。

See also

另请参阅

  • using-youtube-download.md — media-specific scraping and download examples.
  • using-youtube-download.md — 媒体内容抓取与下载示例。