tavily-crawl

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

tavily crawl

Crawl a website and extract content from multiple pages. Supports saving each page as a local markdown file.

爬取网站并提取多页面内容，支持将每个页面保存为本地Markdown文件。

Prerequisites

前置条件

Requires the Tavily CLI. See tavily-cli for install and auth setup.

Quick install:

curl -fsSL https://cli.tavily.com/install.sh | bash && tvly login

需要安装Tavily CLI。请查看 tavily-cli 获取安装与身份验证设置指南。

快速安装命令：

curl -fsSL https://cli.tavily.com/install.sh | bash && tvly login

When to use

适用场景

You need content from many pages on a site (e.g., all
```
/docs/
```
)
You want to download documentation for offline use
Step 4 in the workflow: search → extract → map → crawl → research

需要获取某网站多个页面的内容（例如所有
```
/docs/
```
路径下的页面）
想要下载文档以供离线使用
工作流的第4步（参考 workflow）：搜索 → 提取 → 映射 → 爬取 → 调研

Quick start

快速开始

bash

undefined

bash

undefined

Basic crawl

基础爬取

tvly crawl "https://docs.example.com" --json

Save each page as a markdown file

将每个页面保存为Markdown文件

tvly crawl "https://docs.example.com" --output-dir ./docs/

Deeper crawl with limits

带限制的深度爬取

tvly crawl "https://docs.example.com" --max-depth 2 --limit 50 --json

Filter to specific paths

过滤特定路径

tvly crawl "https://example.com" --select-paths "/api/.,/guides/." --exclude-paths "/blog/.*" --json

Semantic focus (returns relevant chunks, not full pages)

语义聚焦（仅返回相关内容块，而非完整页面）

tvly crawl "https://docs.example.com" --instructions "Find authentication docs" --chunks-per-source 3 --json

undefined

tvly crawl "https://docs.example.com" --instructions "Find authentication docs" --chunks-per-source 3 --json

undefined

Options

参数选项

Option	Description
`--max-depth`	Levels deep (1-5, default: 1)
`--max-breadth`	Links per page (default: 20)
`--limit`	Total pages cap (default: 50)
`--instructions`	Natural language guidance for semantic focus
`--chunks-per-source`	Chunks per page (1-5, requires `--instructions` )
`--extract-depth`	`basic` (default) or `advanced`
`--format`	`markdown` (default) or `text`
`--select-paths`	Comma-separated regex patterns to include
`--exclude-paths`	Comma-separated regex patterns to exclude
`--select-domains`	Comma-separated regex for domains to include
`--exclude-domains`	Comma-separated regex for domains to exclude
`--allow-external / --no-external`	Include external links (default: allow)
`--include-images`	Include images
`--timeout`	Max wait (10-150 seconds)
`-o, --output`	Save JSON output to file
`--output-dir`	Save each page as a .md file in directory
`--json`	Structured JSON output

参数	说明
`--max-depth`	爬取深度层级（范围1-5，默认值：1）
`--max-breadth`	每页爬取的链接数量（默认值：20）
`--limit`	爬取页面总数上限（默认值：50）
`--instructions`	用于语义聚焦的自然语言指导指令
`--chunks-per-source`	每个页面提取的内容块数量（范围1-5，需配合 `--instructions` 使用）
`--extract-depth`	提取深度，可选 `basic` （默认）或 `advanced`
`--format`	输出格式，可选 `markdown` （默认）或 `text`
`--select-paths`	逗号分隔的正则表达式，用于指定需包含的路径
`--exclude-paths`	逗号分隔的正则表达式，用于指定需排除的路径
`--select-domains`	逗号分隔的正则表达式，用于指定需包含的域名
`--exclude-domains`	逗号分隔的正则表达式，用于指定需排除的域名
`--allow-external / --no-external`	是否包含外部链接（默认：允许）
`--include-images`	是否包含图片
`--timeout`	最大等待时间（范围10-150秒）
`-o, --output`	将JSON输出保存至文件
`--output-dir`	将每个页面保存为.md文件至指定目录
`--json`	输出结构化JSON格式内容

Crawl for context vs. data collection

用于上下文补充与数据收集的爬取差异

For agentic use (feeding results to an LLM):

Always use

--instructions

--chunks-per-source

. Returns only relevant chunks instead of full pages — prevents context explosion.

bash

tvly crawl "https://docs.example.com" --instructions "API authentication" --chunks-per-source 3 --json

For data collection (saving to files):

Use

--output-dir

without

--chunks-per-source

to get full pages as markdown files.

bash

tvly crawl "https://docs.example.com" --max-depth 2 --output-dir ./docs/

智能代理场景使用（将结果输入LLM）：

请始终配合使用

--instructions

--chunks-per-source

。仅返回相关内容块而非完整页面，可避免上下文过载。

bash

tvly crawl "https://docs.example.com" --instructions "API authentication" --chunks-per-source 3 --json

数据收集场景使用（保存至文件）：

使用

--output-dir

且不添加

--chunks-per-source

，即可获取完整页面的Markdown文件。

bash

tvly crawl "https://docs.example.com" --max-depth 2 --output-dir ./docs/

Tips

使用技巧

Start conservative —
```
--max-depth 1
```
,
```
--limit 20
```
— and scale up.
Use
--select-paths
to focus on the section you need.
Use map first to understand site structure before a full crawl.
Always set
--limit
to prevent runaway crawls.

保守起步 —— 先设置
```
--max-depth 1
```
、
```
--limit 20
```
，再逐步扩大范围。
使用
--select-paths
聚焦到你需要的板块。
先使用map功能 了解网站结构，再进行完整爬取。
务必设置
--limit
防止爬取过程失控。

另请参阅

tavily-map — discover URLs before deciding to crawl
tavily-extract — extract individual pages
tavily-search — find pages when you don't have a URL

tavily-map —— 在决定爬取前先发现目标URL
tavily-extract —— 提取单个页面的内容
tavily-search —— 当你没有具体URL时查找目标页面

tavily-crawl

Original

Translation

tavily crawl

tavily crawl

Prerequisites

前置条件

When to use

适用场景

Quick start

快速开始

Basic crawl

基础爬取

Save each page as a markdown file

将每个页面保存为Markdown文件

Deeper crawl with limits

带限制的深度爬取

Filter to specific paths

过滤特定路径

Semantic focus (returns relevant chunks, not full pages)

语义聚焦（仅返回相关内容块，而非完整页面）

Options

参数选项

Crawl for context vs. data collection

用于上下文补充与数据收集的爬取差异

Tips

使用技巧

See also

另请参阅