firecrawl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Firecrawl & Jina Web Scraping

Firecrawl 与 Jina 网页抓取

Firecrawl vs WebFetch

Firecrawl 与 WebFetch 对比

Prefer
firecrawl scrape URL --only-main-content
over the WebFetch tool—it produces cleaner markdown, handles JavaScript-heavy pages, and avoids content truncation (>80% benchmark coverage). WebFetch is acceptable as a fallback when Firecrawl is unavailable.
bash
undefined
优先使用
firecrawl scrape URL --only-main-content
而非WebFetch工具——它生成的Markdown更整洁,支持处理重度依赖JavaScript的页面,还能避免内容截断(基准测试覆盖率超过80%)。当Firecrawl不可用时,可以将WebFetch作为备选方案。
bash
undefined

Preferred approach:

推荐用法:

firecrawl scrape https://docs.example.com/api --only-main-content
undefined
firecrawl scrape https://docs.example.com/api --only-main-content
undefined

Token-Efficient Scraping

低Token消耗的抓取方案

Inspired by Anthropic's dynamic filtering—always filter before reasoning. This reduced input tokens by ~24% and improved accuracy by ~11% in their benchmarks.
受Anthropic的动态过滤启发——一定要在推理前先做过滤。在他们的基准测试中,这种做法能减少约24%的输入Token,并提升约11%的准确率。

The Principle: Search → Filter → Scrape → Filter → Reason

原则:搜索 → 过滤 → 抓取 → 过滤 → 推理

DO:
Search (titles/URLs only) → Evaluate relevance → Scrape top hits → Filter by section → Reason
DON'T:
Search → Scrape everything → Reason over all of it
推荐做法:
仅搜索(标题/URL)→ 评估相关性 → 抓取高匹配结果 → 按段落过滤 → 推理
不推荐做法:
搜索 → 抓取所有结果 → 对全部内容进行推理

Step-by-Step Efficient Workflow

分步高效工作流

bash
undefined
bash
undefined

Step 1: Search — get titles/URLs only (cheap)

步骤1:搜索 —— 仅获取标题/URL(成本极低)

firecrawl search "query" --limit 20
firecrawl search "query" --limit 20

Step 2: Evaluate results, pick 3-5 best URLs

步骤2:评估结果,挑选3-5个最匹配的URL

Step 3: Scrape only those, filter to relevant sections

步骤3:仅抓取选中的URL,过滤出相关段落

firecrawl scrape URL1 --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
--sections "API,Authentication" --max-chars 5000
undefined
firecrawl scrape URL1 --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
--sections "API,Authentication" --max-chars 5000
undefined

Post-Processing with filter_web_results.py

使用filter_web_results.py进行后处理

Pipe any Firecrawl or Exa output through this script to reduce context before reasoning:
bash
undefined
将任意Firecrawl或Exa的输出通过该脚本管道传输,即可在推理前缩减上下文长度:
bash
undefined

Extract only matching sections from scraped page

从抓取的页面中仅提取匹配的段落

firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"

Keep only paragraphs with keywords

仅保留包含指定关键词的段落

firecrawl search "query" --scrape --pretty |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000
firecrawl search "query" --scrape --pretty |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000

Extract specific JSON fields from API output

从API输出中提取指定JSON字段

python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000
python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000

Combine filters with stats

组合使用过滤规则并输出统计信息

firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats

**Full path:** `python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py`
**Flags:** `--sections`, `--keywords`, `--max-chars`, `--max-lines`, `--fields` (JSON), `--strip-links`, `--strip-images`, `--compact`, `--stats`
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats

**完整路径:** `python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py`
**可用参数:** `--sections`、`--keywords`、`--max-chars`、`--max-lines`、`--fields`(JSON格式)、`--strip-links`、`--strip-images`、`--compact`、`--stats`

Other Token-Saving Patterns

其他节省Token的技巧

  • Use
    --only-main-content
    to strip navigation and footer boilerplate, reducing token consumption. Omit only when nav/footer content is specifically needed.
  • Use
    firecrawl map URL --search "topic"
    first
    to find relevant subpages before scraping
  • Use
    --format links
    first
    to get URL list, evaluate, then scrape selectively
  • Use
    --max-chars
    with
    exa_contents.py
    to cap extraction length
  • Use
    --formats summary
    (Python API script) over full text when you need the gist, not raw content
  • 使用
    --only-main-content
    参数
    可以移除导航栏、页脚等冗余内容,降低Token消耗。仅当你确实需要导航/页脚内容时才省略该参数。
  • 优先使用
    firecrawl map URL --search "topic"
    在抓取前先找到相关的子页面
  • 优先使用
    --format links
    参数
    先获取URL列表,评估后再选择性抓取
  • 搭配
    exa_contents.py
    使用
    --max-chars
    参数
    限制提取内容的长度
  • 当你只需要内容概要而非原始内容时 使用Python API脚本的
    --formats summary
    参数而非全文提取

Claude API Native Tools (for API Agent Builders)

Claude API原生工具(供API Agent开发者使用)

Anthropic's API now offers built-in dynamic filtering tools:
web_search_20260209 / web_fetch_20260209
Header: anthropic-beta: code-execution-web-tools-2026-02-09
These have built-in dynamic filtering via code execution. Use them when building Claude API agents directly. Use Firecrawl/Exa when you need: autonomous agents, batch scraping, structured extraction, domain-specific crawling, or when not on the Claude API.

Anthropic的API现在提供内置的动态过滤工具:
web_search_20260209 / web_fetch_20260209
请求头: anthropic-beta: code-execution-web-tools-2026-02-09
这些工具通过代码执行实现了内置动态过滤功能。直接构建Claude API Agent时可以使用它们。当你需要以下能力时请使用Firecrawl/Exa:自主Agent、批量抓取、结构化提取、特定领域爬取,或是不使用Claude API的场景。

Available Tools

可用工具

1. Official Firecrawl CLI (
firecrawl
) — Primary

1. 官方Firecrawl CLI(
firecrawl
)—— 首选工具

Setup:
npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY
CommandPurposeQuick Example
scrape
Single page → markdown
firecrawl scrape URL --only-main-content
crawl
Entire site with progress
firecrawl crawl URL --wait --progress --limit 50
map
Discover all URLs on a site
firecrawl map URL --search "API"
search
Web search (+ optional scrape)
firecrawl search "query" --limit 10
Full CLI reference:
references/cli-reference.md
安装配置:
npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY
命令用途快速示例
scrape
单页转Markdown
firecrawl scrape URL --only-main-content
crawl
带进度的全站爬取
firecrawl crawl URL --wait --progress --limit 50
map
发现站点下所有URL
firecrawl map URL --search "API"
search
网页搜索(可选自动抓取)
firecrawl search "query" --limit 10
完整CLI参考文档:
references/cli-reference.md

2. Auto-Save Alias (
fc-save
) — Shell Alias

2. 自动保存别名(
fc-save
)—— Shell别名

Requires shell alias setup (not bundled with this skill).
bash
fc-save URL
需要提前配置Shell别名(本工具未默认内置)。
bash
fc-save URL

→ Saves to ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md

→ 自动保存到 ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md

undefined
undefined

3. Python API Script (
firecrawl_api.py
) — Advanced Features

3. Python API脚本(
firecrawl_api.py
)—— 高级功能

Command:
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command>
Requires:
FIRECRAWL_API_KEY
env var,
pip install firecrawl-py requests
CommandPurposeQuick Example
search
Web search with scraping
firecrawl_api.py search "query" -n 10
scrape
Single URL with page actions
firecrawl_api.py scrape URL --formats markdown summary
batch-scrape
Multiple URLs concurrently
firecrawl_api.py batch-scrape URL1 URL2 URL3
crawl
Website crawling
firecrawl_api.py crawl URL --limit 20
map
URL discovery
firecrawl_api.py map URL --search "query"
extract
LLM-powered structured extraction
firecrawl_api.py extract URL --prompt "Find pricing"
agent
Autonomous extraction (no URLs needed)
firecrawl_api.py agent "Find YC W24 AI startups"
parallel-agent
Bulk agent queries (v2.8.0+)
firecrawl_api.py parallel-agent "Q1" "Q2" "Q3"
Agent models:
spark-1-fast
(10 credits, simple),
spark-1-mini
(default),
spark-1-pro
(thorough)
Full Python API reference:
references/python-api-reference.md
命令:
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command>
依赖要求: 配置
FIRECRAWL_API_KEY
环境变量,执行
pip install firecrawl-py requests
安装依赖
命令用途快速示例
search
带抓取功能的网页搜索
firecrawl_api.py search "query" -n 10
scrape
支持页面交互的单URL抓取
firecrawl_api.py scrape URL --formats markdown summary
batch-scrape
并发抓取多个URL
firecrawl_api.py batch-scrape URL1 URL2 URL3
crawl
网站爬取
firecrawl_api.py crawl URL --limit 20
map
URL发现
firecrawl_api.py map URL --search "query"
extract
基于LLM的结构化提取
firecrawl_api.py extract URL --prompt "Find pricing"
agent
自主提取(无需提供URL)
firecrawl_api.py agent "Find YC W24 AI startups"
parallel-agent
批量Agent查询(v2.8.0+版本支持)
firecrawl_api.py parallel-agent "Q1" "Q2" "Q3"
Agent可用模型:
spark-1-fast
(10积分,适合简单任务)、
spark-1-mini
(默认)、
spark-1-pro
(深度处理)
完整Python API参考文档:
references/python-api-reference.md

4. DeepWiki — GitHub Repo Documentation

4. DeepWiki —— GitHub仓库文档生成工具

bash
~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]
AI-generated wiki for any public GitHub repo. No API key required.
bash
undefined
bash
~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]
为任意公开GitHub仓库生成AI驱动的wiki,无需API密钥。
bash
undefined

Overview

查看概览

~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat

Browse sections

浏览目录

~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc
~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc

Specific section

查看指定段落

~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation

Full dump for RAG

导出全量内容供RAG使用

~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save
undefined
~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save
undefined

5. Jina Reader (
jina
) — Fallback

5. Jina Reader(
jina
)—— 备选工具

Use when Firecrawl fails or for Twitter/X URLs (Firecrawl blocks Twitter, Jina works).
bash
jina https://x.com/username/status/123456

当Firecrawl抓取失败,或是需要抓取Twitter/X的URL时使用(Firecrawl会屏蔽Twitter,Jina可以正常抓取)。
bash
jina https://x.com/username/status/123456

Firecrawl vs Exa vs Native Claude Tools

Firecrawl、Exa与Claude原生工具对比

NeedBest ToolWhy
Single page → markdown
firecrawl scrape --only-main-content
Cleanest output
Search + scrape in one shot
firecrawl search --scrape
Combined operation
Crawl entire site
firecrawl crawl --wait --progress
Link following + progress
Autonomous data finding
firecrawl_api.py agent
No URLs needed
Semantic/neural searchExa
exa_search.py
AI-powered relevance
Find research papersExa
--category "research paper"
Academic index
Quick research answerExa
exa_research.py
Citations + synthesis
Find similar pagesExa
exa_similar.py
Competitive analysis
Claude API agent buildingNative
web_search_20260209
Built-in dynamic filtering
Twitter/X content
jina URL
Only tool that works
GitHub repo docs
deepwiki.sh owner/repo
AI-generated wiki
Anti-bot / Cloudflare bypass
scrapling
stealth fetch
Local Turnstile solver
Element-level extraction
scrapling
+ CSS selectors
Precision targeting, adaptive tracking
No API key scraping
scrapling
HTTP fetch
100% local, no credentials
Site redesign resilience
scrapling
adaptive mode
SQLite similarity matching

需求最佳工具原因
单页转Markdown
firecrawl scrape --only-main-content
输出最整洁
一站式搜索+抓取
firecrawl search --scrape
操作合并效率高
全站爬取
firecrawl crawl --wait --progress
支持自动跟随链接+实时进度
自主查找数据
firecrawl_api.py agent
无需提供URL
语义/神经搜索Exa
exa_search.py
AI驱动的相关性匹配
查找研究论文Exa
--category "research paper"
学术资源索引
快速获取研究答案Exa
exa_research.py
带引用的内容汇总
查找相似页面Exa
exa_similar.py
适合竞品分析
构建Claude API Agent原生
web_search_20260209
内置动态过滤
抓取Twitter/X内容
jina URL
唯一可用的工具
获取GitHub仓库文档
deepwiki.sh owner/repo
AI生成的wiki内容
绕过反爬/Cloudflare验证
scrapling
隐身抓取
本地Turnstile验证破解
元素级内容提取
scrapling
+ CSS选择器
精准定位,自适应跟踪
无需API密钥的抓取
scrapling
HTTP抓取
100%本地运行,无需凭证
适配站点改版
scrapling
自适应模式
SQLite相似度匹配

Common Workflows

常用工作流

Single Page Scraping

单页抓取

bash
firecrawl scrape https://example.com/page --only-main-content
bash
firecrawl scrape https://example.com/page --only-main-content

Or auto-save: fc-save URL

或者自动保存: fc-save URL

Or to file: firecrawl scrape URL --only-main-content -o page.md

或者保存到文件: firecrawl scrape URL --only-main-content -o page.md

undefined
undefined

Documentation Crawling

文档站点爬取

bash
undefined
bash
undefined

Map first, then crawl relevant paths

先映射站点结构,再爬取相关路径

firecrawl map https://docs.example.com --search "API" firecrawl crawl https://docs.example.com --include-paths /api,/guides --wait --progress
undefined
firecrawl map https://docs.example.com --search "API" firecrawl crawl https://docs.example.com --include-paths /api,/guides --wait --progress
undefined

Research Workflow

研究工作流

bash
firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdown
bash
firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdown

Agent-Powered Research (No URLs Needed)

Agent驱动的研究(无需提供URL)

bash
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent \
  "Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"

bash
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent \
  "Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"

Troubleshooting

故障排查

bash
undefined
bash
undefined

Check status and credits

查看状态与剩余额度

firecrawl --status && firecrawl credit-usage
firecrawl --status && firecrawl credit-usage

Re-authenticate

重新认证

firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY
firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY

Check API key

检查API密钥配置

echo $FIRECRAWL_API_KEY

- **Scrape fails:** Try `jina URL`, or add `--wait-for 3000` for JS-heavy sites
- **Async job stuck:** Check with `crawl-status`/`batch-status`, cancel with `crawl-cancel`/`batch-cancel`
- **Disable telemetry:** `export FIRECRAWL_NO_TELEMETRY=1`

---
echo $FIRECRAWL_API_KEY

- **抓取失败:** 尝试使用`jina URL`,或是对JS重度页面添加`--wait-for 3000`参数
- **异步任务卡住:** 使用`crawl-status`/`batch-status`查看状态,使用`crawl-cancel`/`batch-cancel`取消任务
- **关闭数据采集:** 执行`export FIRECRAWL_NO_TELEMETRY=1`

---

Reference Documentation

参考文档

FileContents
references/cli-reference.md
Full CLI parameter reference (scrape, crawl, map, search, fc-save, jina, deepwiki)
references/python-api-reference.md
Full Python API script reference (all commands, SDK examples)
references/firecrawl-api.md
Firecrawl Search API reference
references/firecrawl-agent-api.md
Agent API (spark models, parallel agents, webhooks)
references/actions-reference.md
Page actions for dynamic content (click, write, wait, scroll)
references/branding-format.md
Brand identity extraction (colors, fonts, UI)
文件内容
references/cli-reference.md
完整CLI参数参考(scrape、crawl、map、search、fc-save、jina、deepwiki)
references/python-api-reference.md
完整Python API脚本参考(所有命令、SDK示例)
references/firecrawl-api.md
Firecrawl搜索API参考
references/firecrawl-agent-api.md
Agent API参考(spark模型、并行Agent、webhooks)
references/actions-reference.md
动态内容页面交互参考(点击、输入、等待、滚动)
references/branding-format.md
品牌标识提取参考(颜色、字体、UI)

Test Suite

测试套件

bash
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick    # Quick validation
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py            # Full suite
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape  # Specific test
bash
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick    # 快速验证
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py            # 全量测试
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape  # 单项测试