Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

Prerequisites

Tavily API Key Required - Get your key at https://tavily.com

Add to

~/.claude/settings.json

json

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

Quick Start

Using the Script

bash

./scripts/crawl.sh '<json>' [output_dir]

Examples:

bash

# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'

# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'

# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'

When

output_dir

is provided, each crawled page is saved as a separate markdown file.

Basic Crawl

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'

Focused Crawl with Instructions

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

API Reference

Endpoint

POST https://api.tavily.com/crawl

Headers

Header	Value
`Authorization`	`Bearer <TAVILY_API_KEY>`
`Content-Type`	`application/json`

Request Body

Field	Type	Default	Description
`url`	string	Required	Root URL to begin crawling
`max_depth`	integer	1	Levels deep to crawl (1-5)
`max_breadth`	integer	20	Links per page
`limit`	integer	50	Total pages cap
`instructions`	string	null	Natural language guidance for focus
`chunks_per_source`	integer	3	Chunks per page (1-5, requires instructions)
`extract_depth`	string	`"basic"`	`basic` or `advanced`
`format`	string	`"markdown"`	`markdown` or `text`
`select_paths`	array	null	Regex patterns to include
`exclude_paths`	array	null	Regex patterns to exclude
`allow_external`	boolean	true	Include external domain links
`timeout`	float	150	Max wait (10-150 seconds)

Response Format

json

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

Depth vs Performance

Depth	Typical Pages	Time
1	10-50	Seconds
2	50-500	Minutes
3	500-5000	Many minutes

Start with
max_depth=1
and increase only if needed.

Crawl for Context vs Data Collection

For agentic use (feeding results into context): Always use

instructions

chunks_per_source

. This returns only relevant chunks instead of full pages, preventing context window explosion.

For data collection (saving to files): Omit

chunks_per_source

to get full page content.

Examples

For Context: Agentic Research (Recommended)

Use when feeding crawl results into an LLM context:

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'

Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.

For Context: Targeted Technical Docs

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

For Data Collection: Full Page Archive

Use when saving content to files for later processing:

bash

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'

Returns full page content - use the script with

output_dir

to save as markdown files.

Map API (URL Discovery)

Use

map

instead of

crawl

when you only need URLs, not content:

bash

curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'

Returns URLs only (faster than crawl):

json

{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

Tips

Always use
chunks_per_source
for agentic workflows - prevents context explosion when feeding results to LLMs
Omit
chunks_per_source
only for data collection - when saving full pages to files
Start conservative (
```
max_depth=1
```
,
```
limit=20
```
) and scale up
Use path patterns to focus on relevant sections
Use Map first to understand site structure before full crawl
Always set a
limit
to prevent runaway crawls

crawl

NPX Install

Tags

SKILL.md Content

Crawl Skill

Prerequisites

Quick Start

Using the Script

Basic Crawl

Focused Crawl with Instructions

API Reference

Endpoint

Headers

Request Body

Response Format

Depth vs Performance

Crawl for Context vs Data Collection

Examples

For Context: Agentic Research (Recommended)

For Context: Targeted Technical Docs

For Data Collection: Full Page Archive

Map API (URL Discovery)

Tips