using-web-scraping
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Scraping Skill — Chrome (Playwright) + DuckDuckGo
网页抓取技能 — Chrome(Playwright)+ DuckDuckGo
A privacy-minded, agent-facing web-scraping skill that uses headless Chrome (Playwright/Puppeteer) and DuckDuckGo for search. Focuses on: reliable navigation, extracting structured text, obeying robots.txt, and rate-limiting.
这是一款注重隐私、面向Agent的网页抓取技能,使用无头Chrome(Playwright/Puppeteer)和DuckDuckGo进行搜索。核心特性包括:可靠的导航、结构化文本提取、遵守robots.txt规则以及请求频率限制。
When to use
适用场景
- Collect public webpage content for summarization, metadata extraction, or link discovery.
- Use DuckDuckGo for queries when you want a privacy-respecting search source.
- NOT for bypassing paywalls, scraping private/logged-in content, or violating Terms of Service.
- 收集公开网页内容用于摘要生成、元数据提取或链接发现。
- 当你需要注重隐私的搜索源时,使用DuckDuckGo进行查询。
- 请勿用于绕过付费墙、抓取需登录的私有内容或违反服务条款。
Safety & etiquette
安全与规范
- Always check and respect before scraping a site.
/robots.txt - Rate-limit requests (default: 1 request/sec) and use polite strings.
User-Agent - Avoid executing arbitrary user-provided JavaScript on scraped pages.
- Only scrape public content; if login is required, return instead of attempting to bypass.
login_required
- 抓取网站前务必检查并遵守规则。
/robots.txt - 限制请求频率(默认:1次请求/秒),并使用规范的字符串。
User-Agent - 避免在抓取的页面上执行用户提供的任意JavaScript代码。
- 仅抓取公开内容;若页面需要登录,返回而非尝试绕过。
login_required
Capabilities
功能特性
- Search DuckDuckGo and return top-N result links.
- Visit result pages in headless Chrome and extract ,
title,meta descriptiontext (or best-effort article text), andmainURL.canonical - Return results as structured JSON for downstream consumption.
- 在DuckDuckGo上搜索并返回前N条结果链接。
- 通过无头Chrome访问结果页面,提取、
title、meta description文本(或尽可能提取文章正文)以及mainURL。canonical - 将结果以结构化JSON格式返回,供后续使用。
Examples
示例
Node.js (Playwright)
Node.js(Playwright)
javascript
const { chromium } = require('playwright');
async function ddgSearchAndScrape(query) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({ userAgent: 'open-skills-bot/1.0' });
// DuckDuckGo search
await page.goto('https://duckduckgo.com/');
await page.fill('input[name="q"]', query);
await page.keyboard.press('Enter');
await page.waitForSelector('.result__title a');
// collect top result URL
const href = await page.getAttribute('.result__title a', 'href');
if (!href) { await browser.close(); return []; }
// visit result and extract
await page.goto(href, { waitUntil: 'domcontentloaded' });
const title = await page.title();
const description = await page.locator('meta[name="description"]').getAttribute('content').catch(() => null);
const article = await page.locator('article, main, #content').first().innerText().catch(() => null);
await browser.close();
return [{ url: href, title, description, text: article }];
}
// usage
// ddgSearchAndScrape('open-source agent runtimes').then(console.log);javascript
const { chromium } = require('playwright');
async function ddgSearchAndScrape(query) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({ userAgent: 'open-skills-bot/1.0' });
// DuckDuckGo search
await page.goto('https://duckduckgo.com/');
await page.fill('input[name="q"]', query);
await page.keyboard.press('Enter');
await page.waitForSelector('.result__title a');
// collect top result URL
const href = await page.getAttribute('.result__title a', 'href');
if (!href) { await browser.close(); return []; }
// visit result and extract
await page.goto(href, { waitUntil: 'domcontentloaded' });
const title = await page.title();
const description = await page.locator('meta[name="description"]').getAttribute('content').catch(() => null);
const article = await page.locator('article, main, #content').first().innerText().catch(() => null);
await browser.close();
return [{ url: href, title, description, text: article }];
}
// usage
// ddgSearchAndScrape('open-source agent runtimes').then(console.log);Agent prompt (copy/paste)
Agent提示词(可复制粘贴)
text
You are an agent with a web-scraping skill. For any `search:` task, use DuckDuckGo to find relevant pages, then open each page in a headless Chrome instance (Playwright/Puppeteer) and extract `title`, `meta description`, `main text`, and `canonical` URL. Always:
- Check and respect robots.txt
- Rate-limit requests (<=1 req/sec)
- Use a clear `User-Agent` and do not execute arbitrary page JS
Return results as JSON: [{url,title,description,text}] or `login_required` if a page needs authentication.text
You are an agent with a web-scraping skill. For any `search:` task, use DuckDuckGo to find relevant pages, then open each page in a headless Chrome instance (Playwright/Puppeteer) and extract `title`, `meta description`, `main text`, and `canonical` URL. Always:
- Check and respect robots.txt
- Rate-limit requests (<=1 req/sec)
- Use a clear `User-Agent` and do not execute arbitrary page JS
Return results as JSON: [{url,title,description,text}] or `login_required` if a page needs authentication.Quick setup
快速设置
- Node: and run
npm i playwrightfor browser binaries.npx playwright install - Python: and
pip install playwright.playwright install
- Node环境:执行安装,再运行
npm i playwright获取浏览器二进制文件。npx playwright install - Python环境:执行安装,再运行
pip install playwright。playwright install
Tips
小贴士
- Use to block large assets (images, fonts) when you only need text.
page.route - Respect site terms and introduce exponential backoff for retries.
- 当你仅需要文本内容时,使用拦截大型资源(图片、字体)。
page.route - 遵守网站条款,重试时使用指数退避策略。
See also
另请参阅
- using-youtube-download.md — media-specific scraping and download examples.
- using-youtube-download.md — 媒体内容抓取与下载示例。