documentation-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Documentation Scraper with slurp-ai

基于slurp-ai的文档抓取工具

Overview

概述

slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.
slurp-ai可抓取文档网站并将其编译为单个针对AI Agent上下文优化的Markdown文件。它轻量、快速且确定性强——不使用AI进行抓取,而是专为AI使用场景设计

CRITICAL: Run Outside Sandbox

重点提示:在沙箱外运行

All commands in this skill MUST be run outside the sandbox. Use
dangerouslyDisableSandbox: true
for all Bash commands including:
  • which slurp
    (installation check)
  • node analyze-sitemap.js
    (sitemap analysis)
  • slurp
    (scraping)
  • File inspection commands (
    wc
    ,
    head
    ,
    cat
    , etc.)
The sandbox blocks network access and file operations required for web scraping.
本技能中的所有命令必须在沙箱外执行。所有Bash命令都需使用
dangerouslyDisableSandbox: true
,包括:
  • which slurp
    (安装检查)
  • node analyze-sitemap.js
    (站点地图分析)
  • slurp
    (抓取操作)
  • 文件检查命令(
    wc
    head
    cat
    等)
沙箱会阻止网页抓取所需的网络访问和文件操作。

Pre-Flight: Check Installation

预检查:确认安装状态

Before scraping, verify slurp-ai is installed:
bash
which slurp || echo "NOT INSTALLED"
If not installed, ask the user to run:
bash
npm install -g slurp-ai
Requires: Node.js v20+
Do NOT proceed with scraping until slurp-ai is confirmed installed.
抓取前,请先验证slurp-ai已安装:
bash
which slurp || echo "NOT INSTALLED"
若未安装,请让用户执行以下命令:
bash
npm install -g slurp-ai
系统要求: Node.js v20+
在确认slurp-ai安装完成前,请勿进行抓取操作。

Commands

命令说明

CommandPurpose
slurp <url>
Fetch and compile in one step
slurp fetch <url> [version]
Download docs to partials only
slurp compile
Compile partials into single file
slurp read <package> [version]
Read local documentation
Output: Creates
slurp_compiled/compiled_docs.md
from partials in
slurp_partials/
.
命令用途
slurp <url>
一步完成抓取与编译
slurp fetch <url> [version]
仅将文档下载为分片文件
slurp compile
将分片文件编译为单个文件
slurp read <package> [version]
读取本地文档
输出结果:
slurp_partials/
目录的分片文件生成
slurp_compiled/compiled_docs.md
文件。

CRITICAL: Analyze Sitemap First

重点提示:先分析站点地图

Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your
--base-path
and
--max
decisions.
运行slurp前,务必先分析站点地图。这能完整呈现网站结构,为你设置
--base-path
--max
参数提供依据。

Step 1: Run Sitemap Analysis

步骤1:运行站点地图分析

Use the included
analyze-sitemap.js
script:
bash
node analyze-sitemap.js https://docs.example.com
This outputs:
  • Total page count (informs
    --max
    )
  • URLs grouped by section (informs
    --base-path
    )
  • Suggested slurp commands with appropriate flags
  • Sample URLs to understand naming patterns
使用内置的
analyze-sitemap.js
脚本:
bash
node analyze-sitemap.js https://docs.example.com
该脚本会输出:
  • 页面总数(用于设置
    --max
  • 按板块分组的URL(用于设置
    --base-path
  • 带有合适参数的推荐slurp命令
  • 示例URL以帮助理解命名规则

Step 2: Interpret the Output

步骤2:解读输出结果

Example output:
📊 Total URLs in sitemap: 247

📁 URLs by top-level section:
   /docs                          182 pages
   /api                            45 pages
   /blog                           20 pages

🎯 Suggested --base-path options:
   https://docs.example.com/docs/guides/     (67 pages)
   https://docs.example.com/docs/reference/  (52 pages)
   https://docs.example.com/api/             (45 pages)

💡 Recommended slurp commands:

   # Just "/docs/guides" section (67 pages)
   slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80
示例输出:
📊 站点地图中的URL总数:247

📁 按顶级板块分类的URL:
   /docs                          182个页面
   /api                            45个页面
   /blog                           20个页面

🎯 推荐的--base-path选项:
   https://docs.example.com/docs/guides/     (67个页面)
   https://docs.example.com/docs/reference/  (52个页面)
   https://docs.example.com/api/             (45个页面)

💡 推荐的slurp命令:

   # 仅抓取“/docs/guides”板块(67个页面)
   slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80

Step 3: Choose Scope Based on Analysis

步骤3:根据分析结果选择抓取范围

Sitemap ShowsAction
< 50 pages totalScrape entire site:
slurp <url> --max 60
50-200 pagesScope to relevant section with
--base-path
200+ pagesMust scope down - pick specific subsection
No sitemap foundStart with
--max 30
, inspect partials, adjust
站点地图显示操作
总页面数<50抓取整个站点:
slurp <url> --max 60
总页面数50-200使用
--base-path
限定到相关板块
总页面数>200必须缩小范围 - 选择具体子板块
未找到站点地图先使用
--max 30
抓取,检查分片文件后调整参数

Step 4: Frame the Slurp Command

步骤4:构建slurp命令

With sitemap data, you can now set accurate parameters:
bash
undefined
借助站点地图数据,你可以设置精准的参数:
bash
undefined

From sitemap: /docs/api has 45 pages

根据站点地图:/docs/api有45个页面


**Key insight:** Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).

**关键提示:** 起始URL是爬虫的入口地址。基础路径用于过滤需要跟随的链接。两者可以不同(当基础路径本身返回404时非常有用)。

Common Scraping Patterns

常见抓取模式

Library Documentation (versioned)

带版本号的库文档

bash
undefined
bash
undefined

Express.js 4.x docs

Express.js 4.x文档

React docs (latest)

React最新文档

API Reference Only

仅抓取API参考文档

bash
slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/
bash
slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/

Full Documentation Site

抓取完整文档站点

bash
slurp https://docs.example.com/
bash
slurp https://docs.example.com/

CLI Options

CLI选项

FlagDefaultPurpose
--max <n>
20Maximum pages to scrape
--concurrency <n>
5Parallel page requests
--headless <bool>
trueUse headless browser
--base-path <url>
start URLFilter links to this prefix
--output <dir>
./slurp_partials
Output directory for partials
--retry-count <n>
3Retries for failed requests
--retry-delay <ms>
1000Delay between retries
--yes
-Skip confirmation prompts
参数默认值用途
--max <n>
20最大抓取页面数
--concurrency <n>
5并行请求数
--headless <bool>
true使用无头浏览器
--base-path <url>
起始URL过滤以此前缀开头的链接
--output <dir>
./slurp_partials
分片文件的输出目录
--retry-count <n>
3请求失败后的重试次数
--retry-delay <ms>
1000重试间隔时间(毫秒)
--yes
-跳过确认提示

Compile Options

编译选项

FlagDefaultPurpose
--input <dir>
./slurp_partials
Input directory
--output <file>
./slurp_compiled/compiled_docs.md
Output file
--preserve-metadata
trueKeep metadata blocks
--remove-navigation
trueStrip nav elements
--remove-duplicates
trueEliminate duplicates
--exclude <json>
-JSON array of regex patterns to exclude
参数默认值用途
--input <dir>
./slurp_partials
分片文件的输入目录
--output <file>
./slurp_compiled/compiled_docs.md
输出文件路径
--preserve-metadata
true保留元数据块
--remove-navigation
true移除导航元素
--remove-duplicates
true去除重复内容
--exclude <json>
-用于排除内容的正则表达式JSON数组

When to Disable Headless Mode

何时禁用无头模式

Use
--headless false
for:
  • Static HTML documentation sites
  • Faster scraping when JS rendering not needed
Default is headless (true) - works for most modern doc sites including SPAs.
在以下场景使用
--headless false
  • 静态HTML文档站点
  • 无需JS渲染时,可加快抓取速度
默认启用无头模式(true) - 适用于大多数现代文档站点,包括单页应用(SPA)。

Output Structure

输出结构

slurp_partials/              # Intermediate files
  └── page1.md
  └── page2.md
slurp_compiled/              # Final output
  └── compiled_docs.md       # Compiled result
slurp_partials/              # 中间分片文件
  └── page1.md
  └── page2.md
slurp_compiled/              # 最终输出目录
  └── compiled_docs.md       # 编译后的结果文件

Quick Reference

快速参考

bash
undefined
bash
undefined

1. ALWAYS analyze sitemap first

1. 务必先分析站点地图

node analyze-sitemap.js https://docs.example.com
node analyze-sitemap.js https://docs.example.com

2. Scrape with informed parameters (from sitemap analysis)

2. 根据站点地图分析结果,使用精准参数进行抓取

3. Skip prompts for automation

3. 跳过提示以实现自动化

4. Check output

4. 检查输出结果

cat slurp_compiled/compiled_docs.md | head -100
undefined
cat slurp_compiled/compiled_docs.md | head -100
undefined

Common Issues

常见问题

ProblemCauseSolution
Wrong
--max
value
Guessing page countRun
analyze-sitemap.js
first
Too few pages scraped
--max
limit (default 20)
Set
--max
based on sitemap analysis
Missing contentJS not renderingEnsure
--headless true
(default)
Crawl stuck/slowRate limitingReduce
--concurrency 3
Duplicate sectionsSimilar contentUse
--remove-duplicates
(default)
Wrong pages includedBase path too broadUse sitemap to find correct
--base-path
Prompts blocking automationInteractive modeAdd
--yes
flag
问题原因解决方案
--max
参数设置错误
猜测页面数量先运行
analyze-sitemap.js
分析
抓取页面过少
--max
限制(默认20)
根据站点地图分析结果设置
--max
内容缺失JS未渲染确保启用
--headless true
(默认值)
爬虫停滞/速度慢速率限制降低
--concurrency
至3
出现重复板块内容相似使用
--remove-duplicates
(默认启用)
抓取了无关页面基础路径范围过宽借助站点地图找到正确的
--base-path
提示信息阻止自动化交互模式添加
--yes
参数

Post-Scrape Usage

抓取后用法

The output markdown is designed for AI context injection:
bash
undefined
输出的Markdown文件专为AI上下文注入设计:
bash
undefined

Check file size (context budget)

检查文件大小(适配上下文预算)

wc -c slurp_compiled/compiled_docs.md
wc -c slurp_compiled/compiled_docs.md

Preview structure

预览文档结构

grep "^#" slurp_compiled/compiled_docs.md | head -30
grep "^#" slurp_compiled/compiled_docs.md | head -30

Use with Claude Code - reference in prompt or via @file

与Claude Code配合使用 - 在提示词中引用或通过@file导入

undefined
undefined

When NOT to Use

不适用场景

  • API specs in OpenAPI/Swagger: Use dedicated parsers instead
  • GitHub READMEs: Fetch directly via raw.githubusercontent.com
  • npm package docs: Often better to read source + README
  • Frequently updated docs: Consider caching strategy
  • OpenAPI/Swagger格式的API规范:应使用专用解析器
  • GitHub README:直接通过raw.githubusercontent.com获取
  • npm包文档:通常直接阅读源码+README效果更好
  • 频繁更新的文档:需考虑缓存策略