documentation-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Documentation Scraper with slurp-ai

基于slurp-ai的文档抓取工具

Overview

概述

slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.

slurp-ai可抓取文档网站并将其编译为单个针对AI Agent上下文优化的Markdown文件。它轻量、快速且确定性强——不使用AI进行抓取，而是专为AI使用场景设计。

CRITICAL: Run Outside Sandbox

重点提示：在沙箱外运行

All commands in this skill MUST be run outside the sandbox. Use

dangerouslyDisableSandbox: true

for all Bash commands including:

```
which slurp
```
(installation check)
```
node analyze-sitemap.js
```
(sitemap analysis)
```
slurp
```
(scraping)
File inspection commands (
```
wc
```
,
```
head
```
,
```
cat
```
, etc.)

The sandbox blocks network access and file operations required for web scraping.

本技能中的所有命令必须在沙箱外执行。所有Bash命令都需使用

dangerouslyDisableSandbox: true

，包括：

```
which slurp
```
（安装检查）
```
node analyze-sitemap.js
```
（站点地图分析）
```
slurp
```
（抓取操作）
文件检查命令（
```
wc
```
、
```
head
```
、
```
cat
```
等）

沙箱会阻止网页抓取所需的网络访问和文件操作。

Pre-Flight: Check Installation

预检查：确认安装状态

Before scraping, verify slurp-ai is installed:

bash

which slurp || echo "NOT INSTALLED"

If not installed, ask the user to run:

bash

npm install -g slurp-ai

Requires: Node.js v20+

Do NOT proceed with scraping until slurp-ai is confirmed installed.

抓取前，请先验证slurp-ai已安装：

bash

which slurp || echo "NOT INSTALLED"

若未安装，请让用户执行以下命令：

bash

npm install -g slurp-ai

系统要求： Node.js v20+

在确认slurp-ai安装完成前，请勿进行抓取操作。

Commands

命令说明

Command	Purpose
`slurp <url>`	Fetch and compile in one step
`slurp fetch <url> [version]`	Download docs to partials only
`slurp compile`	Compile partials into single file
`slurp read <package> [version]`	Read local documentation

Output: Creates

slurp_compiled/compiled_docs.md

from partials in

slurp_partials/

命令	用途
`slurp <url>`	一步完成抓取与编译
`slurp fetch <url> [version]`	仅将文档下载为分片文件
`slurp compile`	将分片文件编译为单个文件
`slurp read <package> [version]`	读取本地文档

输出结果： 从

slurp_partials/

目录的分片文件生成

slurp_compiled/compiled_docs.md

文件。

CRITICAL: Analyze Sitemap First

重点提示：先分析站点地图

Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your

--base-path

and

--max

decisions.

运行slurp前，务必先分析站点地图。这能完整呈现网站结构，为你设置

--base-path

和

--max

参数提供依据。

Step 1: Run Sitemap Analysis

步骤1：运行站点地图分析

Use the included

analyze-sitemap.js

script:

bash

node analyze-sitemap.js https://docs.example.com

This outputs:

Total page count (informs
```
--max
```
)
URLs grouped by section (informs
```
--base-path
```
)
Suggested slurp commands with appropriate flags
Sample URLs to understand naming patterns

使用内置的

analyze-sitemap.js

脚本：

bash

node analyze-sitemap.js https://docs.example.com

该脚本会输出：

页面总数（用于设置
```
--max
```
）
按板块分组的URL（用于设置
```
--base-path
```
）
带有合适参数的推荐slurp命令
示例URL以帮助理解命名规则

Step 2: Interpret the Output

步骤2：解读输出结果

Example output:

📊 Total URLs in sitemap: 247

📁 URLs by top-level section:
   /docs                          182 pages
   /api                            45 pages
   /blog                           20 pages

🎯 Suggested --base-path options:
   https://docs.example.com/docs/guides/     (67 pages)
   https://docs.example.com/docs/reference/  (52 pages)
   https://docs.example.com/api/             (45 pages)

💡 Recommended slurp commands:

   # Just "/docs/guides" section (67 pages)
   slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80

示例输出：

📊 站点地图中的URL总数：247

📁 按顶级板块分类的URL：
   /docs                          182个页面
   /api                            45个页面
   /blog                           20个页面

🎯 推荐的--base-path选项：
   https://docs.example.com/docs/guides/     (67个页面)
   https://docs.example.com/docs/reference/  (52个页面)
   https://docs.example.com/api/             (45个页面)

💡 推荐的slurp命令：

   # 仅抓取“/docs/guides”板块（67个页面）
   slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80

Step 3: Choose Scope Based on Analysis

步骤3：根据分析结果选择抓取范围

Sitemap Shows	Action
< 50 pages total	Scrape entire site: `slurp <url> --max 60`
50-200 pages	Scope to relevant section with `--base-path`
200+ pages	Must scope down - pick specific subsection
No sitemap found	Start with `--max 30` , inspect partials, adjust

站点地图显示	操作
总页面数<50	抓取整个站点： `slurp <url> --max 60`
总页面数50-200	使用 `--base-path` 限定到相关板块
总页面数>200	必须缩小范围 - 选择具体子板块
未找到站点地图	先使用 `--max 30` 抓取，检查分片文件后调整参数

Step 4: Frame the Slurp Command

步骤4：构建slurp命令

With sitemap data, you can now set accurate parameters:

bash

undefined

借助站点地图数据，你可以设置精准的参数：

bash

undefined

From sitemap: /docs/api has 45 pages

根据站点地图：/docs/api有45个页面

slurp https://docs.example.com/docs/api/intro
--base-path https://docs.example.com/docs/api/
--max 55


**Key insight:** Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).

slurp https://docs.example.com/docs/api/intro
--base-path https://docs.example.com/docs/api/
--max 55


**关键提示：** 起始URL是爬虫的入口地址。基础路径用于过滤需要跟随的链接。两者可以不同（当基础路径本身返回404时非常有用）。

Common Scraping Patterns

常见抓取模式

Library Documentation (versioned)

带版本号的库文档

bash

undefined

bash

undefined

Express.js 4.x docs

Express.js 4.x文档

slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/

React docs (latest)

undefined

slurp https://react.dev/learn --base-path https://react.dev/learn

undefined

API Reference Only

仅抓取API参考文档

bash

slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/

bash

slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/

Full Documentation Site

抓取完整文档站点

bash

slurp https://docs.example.com/

bash

slurp https://docs.example.com/

CLI Options

CLI选项

Flag	Default	Purpose
`--max <n>`	20	Maximum pages to scrape
`--concurrency <n>`	5	Parallel page requests
`--headless <bool>`	true	Use headless browser
`--base-path <url>`	start URL	Filter links to this prefix
`--output <dir>`	`./slurp_partials`	Output directory for partials
`--retry-count <n>`	3	Retries for failed requests
`--retry-delay <ms>`	1000	Delay between retries
`--yes`	-	Skip confirmation prompts

参数	默认值	用途
`--max <n>`	20	最大抓取页面数
`--concurrency <n>`	5	并行请求数
`--headless <bool>`	true	使用无头浏览器
`--base-path <url>`	起始URL	过滤以此前缀开头的链接
`--output <dir>`	`./slurp_partials`	分片文件的输出目录
`--retry-count <n>`	3	请求失败后的重试次数
`--retry-delay <ms>`	1000	重试间隔时间（毫秒）
`--yes`	-	跳过确认提示

Compile Options

编译选项

Flag	Default	Purpose
`--input <dir>`	`./slurp_partials`	Input directory
`--output <file>`	`./slurp_compiled/compiled_docs.md`	Output file
`--preserve-metadata`	true	Keep metadata blocks
`--remove-navigation`	true	Strip nav elements
`--remove-duplicates`	true	Eliminate duplicates
`--exclude <json>`	-	JSON array of regex patterns to exclude

参数	默认值	用途
`--input <dir>`	`./slurp_partials`	分片文件的输入目录
`--output <file>`	`./slurp_compiled/compiled_docs.md`	输出文件路径
`--preserve-metadata`	true	保留元数据块
`--remove-navigation`	true	移除导航元素
`--remove-duplicates`	true	去除重复内容
`--exclude <json>`	-	用于排除内容的正则表达式JSON数组

When to Disable Headless Mode

何时禁用无头模式

Use

--headless false

for:

Static HTML documentation sites
Faster scraping when JS rendering not needed

Default is headless (true) - works for most modern doc sites including SPAs.

在以下场景使用

--headless false

：

静态HTML文档站点
无需JS渲染时，可加快抓取速度

默认启用无头模式（true） - 适用于大多数现代文档站点，包括单页应用（SPA）。

Output Structure

输出结构

slurp_partials/              # Intermediate files
  └── page1.md
  └── page2.md
slurp_compiled/              # Final output
  └── compiled_docs.md       # Compiled result

slurp_partials/              # 中间分片文件
  └── page1.md
  └── page2.md
slurp_compiled/              # 最终输出目录
  └── compiled_docs.md       # 编译后的结果文件

Quick Reference

快速参考

bash

undefined

bash

undefined

1. ALWAYS analyze sitemap first

1. 务必先分析站点地图

node analyze-sitemap.js https://docs.example.com

2. Scrape with informed parameters (from sitemap analysis)

2. 根据站点地图分析结果，使用精准参数进行抓取

slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80

3. Skip prompts for automation

3. 跳过提示以实现自动化

slurp https://docs.example.com/ --yes

4. Check output

4. 检查输出结果

cat slurp_compiled/compiled_docs.md | head -100

undefined

cat slurp_compiled/compiled_docs.md | head -100

undefined

Common Issues

常见问题

Problem	Cause	Solution
Wrong `--max` value	Guessing page count	Run `analyze-sitemap.js` first
Too few pages scraped	`--max` limit (default 20)	Set `--max` based on sitemap analysis
Missing content	JS not rendering	Ensure `--headless true` (default)
Crawl stuck/slow	Rate limiting	Reduce `--concurrency 3`
Duplicate sections	Similar content	Use `--remove-duplicates` (default)
Wrong pages included	Base path too broad	Use sitemap to find correct `--base-path`
Prompts blocking automation	Interactive mode	Add `--yes` flag

问题	原因	解决方案
`--max` 参数设置错误	猜测页面数量	先运行 `analyze-sitemap.js` 分析
抓取页面过少	`--max` 限制（默认20）	根据站点地图分析结果设置 `--max`
内容缺失	JS未渲染	确保启用 `--headless true` （默认值）
爬虫停滞/速度慢	速率限制	降低 `--concurrency` 至3
出现重复板块	内容相似	使用 `--remove-duplicates` （默认启用）
抓取了无关页面	基础路径范围过宽	借助站点地图找到正确的 `--base-path`
提示信息阻止自动化	交互模式	添加 `--yes` 参数

Post-Scrape Usage

抓取后用法

The output markdown is designed for AI context injection:

bash

undefined

输出的Markdown文件专为AI上下文注入设计：

bash

undefined

Check file size (context budget)

检查文件大小（适配上下文预算）

wc -c slurp_compiled/compiled_docs.md

Preview structure

预览文档结构

grep "^#" slurp_compiled/compiled_docs.md | head -30

Use with Claude Code - reference in prompt or via @file

与Claude Code配合使用 - 在提示词中引用或通过@file导入

undefined

undefined

When NOT to Use

不适用场景

API specs in OpenAPI/Swagger: Use dedicated parsers instead
GitHub READMEs: Fetch directly via raw.githubusercontent.com
npm package docs: Often better to read source + README
Frequently updated docs: Consider caching strategy

OpenAPI/Swagger格式的API规范：应使用专用解析器
GitHub README：直接通过raw.githubusercontent.com获取
npm包文档：通常直接阅读源码+README效果更好
频繁更新的文档：需考虑缓存策略

documentation-scraper

Original

Translation

Documentation Scraper with slurp-ai

基于slurp-ai的文档抓取工具

Overview

概述

CRITICAL: Run Outside Sandbox

重点提示：在沙箱外运行

Pre-Flight: Check Installation

预检查：确认安装状态

Commands

命令说明

CRITICAL: Analyze Sitemap First

重点提示：先分析站点地图

Step 1: Run Sitemap Analysis

步骤1：运行站点地图分析

Step 2: Interpret the Output

步骤2：解读输出结果

Step 3: Choose Scope Based on Analysis

步骤3：根据分析结果选择抓取范围

Step 4: Frame the Slurp Command

步骤4：构建slurp命令

From sitemap: /docs/api has 45 pages

根据站点地图：/docs/api有45个页面

Common Scraping Patterns

常见抓取模式

Library Documentation (versioned)

带版本号的库文档

Express.js 4.x docs

Express.js 4.x文档

React docs (latest)

React最新文档

API Reference Only

仅抓取API参考文档

Full Documentation Site

抓取完整文档站点

CLI Options

CLI选项

Compile Options

编译选项

When to Disable Headless Mode

何时禁用无头模式

Output Structure

输出结构

Quick Reference

快速参考

1. ALWAYS analyze sitemap first

1. 务必先分析站点地图

2. Scrape with informed parameters (from sitemap analysis)

2. 根据站点地图分析结果，使用精准参数进行抓取

3. Skip prompts for automation

3. 跳过提示以实现自动化

4. Check output

4. 检查输出结果

Common Issues

常见问题

Post-Scrape Usage

抓取后用法

Check file size (context budget)

检查文件大小（适配上下文预算）

Preview structure

预览文档结构

Use with Claude Code - reference in prompt or via @file

与Claude Code配合使用 - 在提示词中引用或通过@file导入

When NOT to Use

不适用场景