documentation-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDocumentation Scraper with slurp-ai
基于slurp-ai的文档抓取工具
Overview
概述
slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.
slurp-ai可抓取文档网站并将其编译为单个针对AI Agent上下文优化的Markdown文件。它轻量、快速且确定性强——不使用AI进行抓取,而是专为AI使用场景设计。
CRITICAL: Run Outside Sandbox
重点提示:在沙箱外运行
All commands in this skill MUST be run outside the sandbox. Use for all Bash commands including:
dangerouslyDisableSandbox: true- (installation check)
which slurp - (sitemap analysis)
node analyze-sitemap.js - (scraping)
slurp - File inspection commands (,
wc,head, etc.)cat
The sandbox blocks network access and file operations required for web scraping.
本技能中的所有命令必须在沙箱外执行。所有Bash命令都需使用,包括:
dangerouslyDisableSandbox: true- (安装检查)
which slurp - (站点地图分析)
node analyze-sitemap.js - (抓取操作)
slurp - 文件检查命令(、
wc、head等)cat
沙箱会阻止网页抓取所需的网络访问和文件操作。
Pre-Flight: Check Installation
预检查:确认安装状态
Before scraping, verify slurp-ai is installed:
bash
which slurp || echo "NOT INSTALLED"If not installed, ask the user to run:
bash
npm install -g slurp-aiRequires: Node.js v20+
Do NOT proceed with scraping until slurp-ai is confirmed installed.
抓取前,请先验证slurp-ai已安装:
bash
which slurp || echo "NOT INSTALLED"若未安装,请让用户执行以下命令:
bash
npm install -g slurp-ai系统要求: Node.js v20+
在确认slurp-ai安装完成前,请勿进行抓取操作。
Commands
命令说明
| Command | Purpose |
|---|---|
| Fetch and compile in one step |
| Download docs to partials only |
| Compile partials into single file |
| Read local documentation |
Output: Creates from partials in .
slurp_compiled/compiled_docs.mdslurp_partials/| 命令 | 用途 |
|---|---|
| 一步完成抓取与编译 |
| 仅将文档下载为分片文件 |
| 将分片文件编译为单个文件 |
| 读取本地文档 |
输出结果: 从目录的分片文件生成文件。
slurp_partials/slurp_compiled/compiled_docs.mdCRITICAL: Analyze Sitemap First
重点提示:先分析站点地图
Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your and decisions.
--base-path--max运行slurp前,务必先分析站点地图。这能完整呈现网站结构,为你设置和参数提供依据。
--base-path--maxStep 1: Run Sitemap Analysis
步骤1:运行站点地图分析
Use the included script:
analyze-sitemap.jsbash
node analyze-sitemap.js https://docs.example.comThis outputs:
- Total page count (informs )
--max - URLs grouped by section (informs )
--base-path - Suggested slurp commands with appropriate flags
- Sample URLs to understand naming patterns
使用内置的脚本:
analyze-sitemap.jsbash
node analyze-sitemap.js https://docs.example.com该脚本会输出:
- 页面总数(用于设置)
--max - 按板块分组的URL(用于设置)
--base-path - 带有合适参数的推荐slurp命令
- 示例URL以帮助理解命名规则
Step 2: Interpret the Output
步骤2:解读输出结果
Example output:
📊 Total URLs in sitemap: 247
📁 URLs by top-level section:
/docs 182 pages
/api 45 pages
/blog 20 pages
🎯 Suggested --base-path options:
https://docs.example.com/docs/guides/ (67 pages)
https://docs.example.com/docs/reference/ (52 pages)
https://docs.example.com/api/ (45 pages)
💡 Recommended slurp commands:
# Just "/docs/guides" section (67 pages)
slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80示例输出:
📊 站点地图中的URL总数:247
📁 按顶级板块分类的URL:
/docs 182个页面
/api 45个页面
/blog 20个页面
🎯 推荐的--base-path选项:
https://docs.example.com/docs/guides/ (67个页面)
https://docs.example.com/docs/reference/ (52个页面)
https://docs.example.com/api/ (45个页面)
💡 推荐的slurp命令:
# 仅抓取“/docs/guides”板块(67个页面)
slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80Step 3: Choose Scope Based on Analysis
步骤3:根据分析结果选择抓取范围
| Sitemap Shows | Action |
|---|---|
| < 50 pages total | Scrape entire site: |
| 50-200 pages | Scope to relevant section with |
| 200+ pages | Must scope down - pick specific subsection |
| No sitemap found | Start with |
| 站点地图显示 | 操作 |
|---|---|
| 总页面数<50 | 抓取整个站点: |
| 总页面数50-200 | 使用 |
| 总页面数>200 | 必须缩小范围 - 选择具体子板块 |
| 未找到站点地图 | 先使用 |
Step 4: Frame the Slurp Command
步骤4:构建slurp命令
With sitemap data, you can now set accurate parameters:
bash
undefined借助站点地图数据,你可以设置精准的参数:
bash
undefinedFrom sitemap: /docs/api has 45 pages
根据站点地图:/docs/api有45个页面
slurp https://docs.example.com/docs/api/intro
--base-path https://docs.example.com/docs/api/
--max 55
--base-path https://docs.example.com/docs/api/
--max 55
**Key insight:** Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).slurp https://docs.example.com/docs/api/intro
--base-path https://docs.example.com/docs/api/
--max 55
--base-path https://docs.example.com/docs/api/
--max 55
**关键提示:** 起始URL是爬虫的入口地址。基础路径用于过滤需要跟随的链接。两者可以不同(当基础路径本身返回404时非常有用)。Common Scraping Patterns
常见抓取模式
Library Documentation (versioned)
带版本号的库文档
bash
undefinedbash
undefinedExpress.js 4.x docs
Express.js 4.x文档
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/
React docs (latest)
React最新文档
slurp https://react.dev/learn --base-path https://react.dev/learn
undefinedslurp https://react.dev/learn --base-path https://react.dev/learn
undefinedAPI Reference Only
仅抓取API参考文档
bash
slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/bash
slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/Full Documentation Site
抓取完整文档站点
bash
slurp https://docs.example.com/bash
slurp https://docs.example.com/CLI Options
CLI选项
| Flag | Default | Purpose |
|---|---|---|
| 20 | Maximum pages to scrape |
| 5 | Parallel page requests |
| true | Use headless browser |
| start URL | Filter links to this prefix |
| | Output directory for partials |
| 3 | Retries for failed requests |
| 1000 | Delay between retries |
| - | Skip confirmation prompts |
| 参数 | 默认值 | 用途 |
|---|---|---|
| 20 | 最大抓取页面数 |
| 5 | 并行请求数 |
| true | 使用无头浏览器 |
| 起始URL | 过滤以此前缀开头的链接 |
| | 分片文件的输出目录 |
| 3 | 请求失败后的重试次数 |
| 1000 | 重试间隔时间(毫秒) |
| - | 跳过确认提示 |
Compile Options
编译选项
| Flag | Default | Purpose |
|---|---|---|
| | Input directory |
| | Output file |
| true | Keep metadata blocks |
| true | Strip nav elements |
| true | Eliminate duplicates |
| - | JSON array of regex patterns to exclude |
| 参数 | 默认值 | 用途 |
|---|---|---|
| | 分片文件的输入目录 |
| | 输出文件路径 |
| true | 保留元数据块 |
| true | 移除导航元素 |
| true | 去除重复内容 |
| - | 用于排除内容的正则表达式JSON数组 |
When to Disable Headless Mode
何时禁用无头模式
Use for:
--headless false- Static HTML documentation sites
- Faster scraping when JS rendering not needed
Default is headless (true) - works for most modern doc sites including SPAs.
在以下场景使用:
--headless false- 静态HTML文档站点
- 无需JS渲染时,可加快抓取速度
默认启用无头模式(true) - 适用于大多数现代文档站点,包括单页应用(SPA)。
Output Structure
输出结构
slurp_partials/ # Intermediate files
└── page1.md
└── page2.md
slurp_compiled/ # Final output
└── compiled_docs.md # Compiled resultslurp_partials/ # 中间分片文件
└── page1.md
└── page2.md
slurp_compiled/ # 最终输出目录
└── compiled_docs.md # 编译后的结果文件Quick Reference
快速参考
bash
undefinedbash
undefined1. ALWAYS analyze sitemap first
1. 务必先分析站点地图
node analyze-sitemap.js https://docs.example.com
node analyze-sitemap.js https://docs.example.com
2. Scrape with informed parameters (from sitemap analysis)
2. 根据站点地图分析结果,使用精准参数进行抓取
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80
3. Skip prompts for automation
3. 跳过提示以实现自动化
slurp https://docs.example.com/ --yes
slurp https://docs.example.com/ --yes
4. Check output
4. 检查输出结果
cat slurp_compiled/compiled_docs.md | head -100
undefinedcat slurp_compiled/compiled_docs.md | head -100
undefinedCommon Issues
常见问题
| Problem | Cause | Solution |
|---|---|---|
Wrong | Guessing page count | Run |
| Too few pages scraped | | Set |
| Missing content | JS not rendering | Ensure |
| Crawl stuck/slow | Rate limiting | Reduce |
| Duplicate sections | Similar content | Use |
| Wrong pages included | Base path too broad | Use sitemap to find correct |
| Prompts blocking automation | Interactive mode | Add |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 猜测页面数量 | 先运行 |
| 抓取页面过少 | | 根据站点地图分析结果设置 |
| 内容缺失 | JS未渲染 | 确保启用 |
| 爬虫停滞/速度慢 | 速率限制 | 降低 |
| 出现重复板块 | 内容相似 | 使用 |
| 抓取了无关页面 | 基础路径范围过宽 | 借助站点地图找到正确的 |
| 提示信息阻止自动化 | 交互模式 | 添加 |
Post-Scrape Usage
抓取后用法
The output markdown is designed for AI context injection:
bash
undefined输出的Markdown文件专为AI上下文注入设计:
bash
undefinedCheck file size (context budget)
检查文件大小(适配上下文预算)
wc -c slurp_compiled/compiled_docs.md
wc -c slurp_compiled/compiled_docs.md
Preview structure
预览文档结构
grep "^#" slurp_compiled/compiled_docs.md | head -30
grep "^#" slurp_compiled/compiled_docs.md | head -30
Use with Claude Code - reference in prompt or via @file
与Claude Code配合使用 - 在提示词中引用或通过@file导入
undefinedundefinedWhen NOT to Use
不适用场景
- API specs in OpenAPI/Swagger: Use dedicated parsers instead
- GitHub READMEs: Fetch directly via raw.githubusercontent.com
- npm package docs: Often better to read source + README
- Frequently updated docs: Consider caching strategy
- OpenAPI/Swagger格式的API规范:应使用专用解析器
- GitHub README:直接通过raw.githubusercontent.com获取
- npm包文档:通常直接阅读源码+README效果更好
- 频繁更新的文档:需考虑缓存策略