firecrawl
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFirecrawl & Jina Web Scraping
Firecrawl 与 Jina 网页抓取
Firecrawl vs WebFetch
Firecrawl 与 WebFetch 对比
Prefer over the WebFetch tool—it produces cleaner markdown, handles JavaScript-heavy pages, and avoids content truncation (>80% benchmark coverage). WebFetch is acceptable as a fallback when Firecrawl is unavailable.
firecrawl scrape URL --only-main-contentbash
undefined优先使用 而非WebFetch工具——它生成的Markdown更整洁,支持处理重度依赖JavaScript的页面,还能避免内容截断(基准测试覆盖率超过80%)。当Firecrawl不可用时,可以将WebFetch作为备选方案。
firecrawl scrape URL --only-main-contentbash
undefinedPreferred approach:
推荐用法:
firecrawl scrape https://docs.example.com/api --only-main-content
undefinedfirecrawl scrape https://docs.example.com/api --only-main-content
undefinedToken-Efficient Scraping
低Token消耗的抓取方案
Inspired by Anthropic's dynamic filtering—always filter before reasoning. This reduced input tokens by ~24% and improved accuracy by ~11% in their benchmarks.
受Anthropic的动态过滤启发——一定要在推理前先做过滤。在他们的基准测试中,这种做法能减少约24%的输入Token,并提升约11%的准确率。
The Principle: Search → Filter → Scrape → Filter → Reason
原则:搜索 → 过滤 → 抓取 → 过滤 → 推理
DO:
Search (titles/URLs only) → Evaluate relevance → Scrape top hits → Filter by section → ReasonDON'T:
Search → Scrape everything → Reason over all of it推荐做法:
仅搜索(标题/URL)→ 评估相关性 → 抓取高匹配结果 → 按段落过滤 → 推理不推荐做法:
搜索 → 抓取所有结果 → 对全部内容进行推理Step-by-Step Efficient Workflow
分步高效工作流
bash
undefinedbash
undefinedStep 1: Search — get titles/URLs only (cheap)
步骤1:搜索 —— 仅获取标题/URL(成本极低)
firecrawl search "query" --limit 20
firecrawl search "query" --limit 20
Step 2: Evaluate results, pick 3-5 best URLs
步骤2:评估结果,挑选3-5个最匹配的URL
Step 3: Scrape only those, filter to relevant sections
步骤3:仅抓取选中的URL,过滤出相关段落
firecrawl scrape URL1 --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
--sections "API,Authentication" --max-chars 5000
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
--sections "API,Authentication" --max-chars 5000
undefinedfirecrawl scrape URL1 --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
--sections "API,Authentication" --max-chars 5000
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
--sections "API,Authentication" --max-chars 5000
undefinedPost-Processing with filter_web_results.py
使用filter_web_results.py进行后处理
Pipe any Firecrawl or Exa output through this script to reduce context before reasoning:
bash
undefined将任意Firecrawl或Exa的输出通过该脚本管道传输,即可在推理前缩减上下文长度:
bash
undefinedExtract only matching sections from scraped page
从抓取的页面中仅提取匹配的段落
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"
Keep only paragraphs with keywords
仅保留包含指定关键词的段落
firecrawl search "query" --scrape --pretty |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000
firecrawl search "query" --scrape --pretty |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000
Extract specific JSON fields from API output
从API输出中提取指定JSON字段
python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000
python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000
Combine filters with stats
组合使用过滤规则并输出统计信息
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats
**Full path:** `python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py`
**Flags:** `--sections`, `--keywords`, `--max-chars`, `--max-lines`, `--fields` (JSON), `--strip-links`, `--strip-images`, `--compact`, `--stats`firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats
**完整路径:** `python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py`
**可用参数:** `--sections`、`--keywords`、`--max-chars`、`--max-lines`、`--fields`(JSON格式)、`--strip-links`、`--strip-images`、`--compact`、`--stats`Other Token-Saving Patterns
其他节省Token的技巧
- Use to strip navigation and footer boilerplate, reducing token consumption. Omit only when nav/footer content is specifically needed.
--only-main-content - Use first to find relevant subpages before scraping
firecrawl map URL --search "topic" - Use first to get URL list, evaluate, then scrape selectively
--format links - Use with
--max-charsto cap extraction lengthexa_contents.py - Use (Python API script) over full text when you need the gist, not raw content
--formats summary
- 使用参数 可以移除导航栏、页脚等冗余内容,降低Token消耗。仅当你确实需要导航/页脚内容时才省略该参数。
--only-main-content - 优先使用在抓取前先找到相关的子页面
firecrawl map URL --search "topic" - 优先使用参数 先获取URL列表,评估后再选择性抓取
--format links - 搭配使用
exa_contents.py参数 限制提取内容的长度--max-chars - 当你只需要内容概要而非原始内容时 使用Python API脚本的参数而非全文提取
--formats summary
Claude API Native Tools (for API Agent Builders)
Claude API原生工具(供API Agent开发者使用)
Anthropic's API now offers built-in dynamic filtering tools:
web_search_20260209 / web_fetch_20260209
Header: anthropic-beta: code-execution-web-tools-2026-02-09These have built-in dynamic filtering via code execution. Use them when building Claude API agents directly. Use Firecrawl/Exa when you need: autonomous agents, batch scraping, structured extraction, domain-specific crawling, or when not on the Claude API.
Anthropic的API现在提供内置的动态过滤工具:
web_search_20260209 / web_fetch_20260209
请求头: anthropic-beta: code-execution-web-tools-2026-02-09这些工具通过代码执行实现了内置动态过滤功能。直接构建Claude API Agent时可以使用它们。当你需要以下能力时请使用Firecrawl/Exa:自主Agent、批量抓取、结构化提取、特定领域爬取,或是不使用Claude API的场景。
Available Tools
可用工具
1. Official Firecrawl CLI (firecrawl
) — Primary
firecrawl1. 官方Firecrawl CLI(firecrawl
)—— 首选工具
firecrawlSetup:
npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY| Command | Purpose | Quick Example |
|---|---|---|
| Single page → markdown | |
| Entire site with progress | |
| Discover all URLs on a site | |
| Web search (+ optional scrape) | |
Full CLI reference:
references/cli-reference.md安装配置:
npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY| 命令 | 用途 | 快速示例 |
|---|---|---|
| 单页转Markdown | |
| 带进度的全站爬取 | |
| 发现站点下所有URL | |
| 网页搜索(可选自动抓取) | |
完整CLI参考文档:
references/cli-reference.md2. Auto-Save Alias (fc-save
) — Shell Alias
fc-save2. 自动保存别名(fc-save
)—— Shell别名
fc-saveRequires shell alias setup (not bundled with this skill).
bash
fc-save URL需要提前配置Shell别名(本工具未默认内置)。
bash
fc-save URL→ Saves to ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md
→ 自动保存到 ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md
undefinedundefined3. Python API Script (firecrawl_api.py
) — Advanced Features
firecrawl_api.py3. Python API脚本(firecrawl_api.py
)—— 高级功能
firecrawl_api.pyCommand:
Requires: env var,
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command>FIRECRAWL_API_KEYpip install firecrawl-py requests| Command | Purpose | Quick Example |
|---|---|---|
| Web search with scraping | |
| Single URL with page actions | |
| Multiple URLs concurrently | |
| Website crawling | |
| URL discovery | |
| LLM-powered structured extraction | |
| Autonomous extraction (no URLs needed) | |
| Bulk agent queries (v2.8.0+) | |
Agent models: (10 credits, simple), (default), (thorough)
spark-1-fastspark-1-minispark-1-proFull Python API reference:
references/python-api-reference.md命令:
依赖要求: 配置环境变量,执行安装依赖
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command>FIRECRAWL_API_KEYpip install firecrawl-py requests| 命令 | 用途 | 快速示例 |
|---|---|---|
| 带抓取功能的网页搜索 | |
| 支持页面交互的单URL抓取 | |
| 并发抓取多个URL | |
| 网站爬取 | |
| URL发现 | |
| 基于LLM的结构化提取 | |
| 自主提取(无需提供URL) | |
| 批量Agent查询(v2.8.0+版本支持) | |
Agent可用模型: (10积分,适合简单任务)、(默认)、(深度处理)
spark-1-fastspark-1-minispark-1-pro完整Python API参考文档:
references/python-api-reference.md4. DeepWiki — GitHub Repo Documentation
4. DeepWiki —— GitHub仓库文档生成工具
bash
~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]AI-generated wiki for any public GitHub repo. No API key required.
bash
undefinedbash
~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]为任意公开GitHub仓库生成AI驱动的wiki,无需API密钥。
bash
undefinedOverview
查看概览
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat
Browse sections
浏览目录
~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc
~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc
Specific section
查看指定段落
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation
Full dump for RAG
导出全量内容供RAG使用
~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save
undefined~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save
undefined5. Jina Reader (jina
) — Fallback
jina5. Jina Reader(jina
)—— 备选工具
jinaUse when Firecrawl fails or for Twitter/X URLs (Firecrawl blocks Twitter, Jina works).
bash
jina https://x.com/username/status/123456当Firecrawl抓取失败,或是需要抓取Twitter/X的URL时使用(Firecrawl会屏蔽Twitter,Jina可以正常抓取)。
bash
jina https://x.com/username/status/123456Firecrawl vs Exa vs Native Claude Tools
Firecrawl、Exa与Claude原生工具对比
| Need | Best Tool | Why |
|---|---|---|
| Single page → markdown | | Cleanest output |
| Search + scrape in one shot | | Combined operation |
| Crawl entire site | | Link following + progress |
| Autonomous data finding | | No URLs needed |
| Semantic/neural search | Exa | AI-powered relevance |
| Find research papers | Exa | Academic index |
| Quick research answer | Exa | Citations + synthesis |
| Find similar pages | Exa | Competitive analysis |
| Claude API agent building | Native | Built-in dynamic filtering |
| Twitter/X content | | Only tool that works |
| GitHub repo docs | | AI-generated wiki |
| Anti-bot / Cloudflare bypass | | Local Turnstile solver |
| Element-level extraction | | Precision targeting, adaptive tracking |
| No API key scraping | | 100% local, no credentials |
| Site redesign resilience | | SQLite similarity matching |
| 需求 | 最佳工具 | 原因 |
|---|---|---|
| 单页转Markdown | | 输出最整洁 |
| 一站式搜索+抓取 | | 操作合并效率高 |
| 全站爬取 | | 支持自动跟随链接+实时进度 |
| 自主查找数据 | | 无需提供URL |
| 语义/神经搜索 | Exa | AI驱动的相关性匹配 |
| 查找研究论文 | Exa | 学术资源索引 |
| 快速获取研究答案 | Exa | 带引用的内容汇总 |
| 查找相似页面 | Exa | 适合竞品分析 |
| 构建Claude API Agent | 原生 | 内置动态过滤 |
| 抓取Twitter/X内容 | | 唯一可用的工具 |
| 获取GitHub仓库文档 | | AI生成的wiki内容 |
| 绕过反爬/Cloudflare验证 | | 本地Turnstile验证破解 |
| 元素级内容提取 | | 精准定位,自适应跟踪 |
| 无需API密钥的抓取 | | 100%本地运行,无需凭证 |
| 适配站点改版 | | SQLite相似度匹配 |
Common Workflows
常用工作流
Single Page Scraping
单页抓取
bash
firecrawl scrape https://example.com/page --only-main-contentbash
firecrawl scrape https://example.com/page --only-main-contentOr auto-save: fc-save URL
或者自动保存: fc-save URL
Or to file: firecrawl scrape URL --only-main-content -o page.md
或者保存到文件: firecrawl scrape URL --only-main-content -o page.md
undefinedundefinedDocumentation Crawling
文档站点爬取
bash
undefinedbash
undefinedMap first, then crawl relevant paths
先映射站点结构,再爬取相关路径
firecrawl map https://docs.example.com --search "API"
firecrawl crawl https://docs.example.com --include-paths /api,/guides --wait --progress
undefinedfirecrawl map https://docs.example.com --search "API"
firecrawl crawl https://docs.example.com --include-paths /api,/guides --wait --progress
undefinedResearch Workflow
研究工作流
bash
firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdownbash
firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdownAgent-Powered Research (No URLs Needed)
Agent驱动的研究(无需提供URL)
bash
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent \
"Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"bash
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent \
"Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"Troubleshooting
故障排查
bash
undefinedbash
undefinedCheck status and credits
查看状态与剩余额度
firecrawl --status && firecrawl credit-usage
firecrawl --status && firecrawl credit-usage
Re-authenticate
重新认证
firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY
firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY
Check API key
检查API密钥配置
echo $FIRECRAWL_API_KEY
- **Scrape fails:** Try `jina URL`, or add `--wait-for 3000` for JS-heavy sites
- **Async job stuck:** Check with `crawl-status`/`batch-status`, cancel with `crawl-cancel`/`batch-cancel`
- **Disable telemetry:** `export FIRECRAWL_NO_TELEMETRY=1`
---echo $FIRECRAWL_API_KEY
- **抓取失败:** 尝试使用`jina URL`,或是对JS重度页面添加`--wait-for 3000`参数
- **异步任务卡住:** 使用`crawl-status`/`batch-status`查看状态,使用`crawl-cancel`/`batch-cancel`取消任务
- **关闭数据采集:** 执行`export FIRECRAWL_NO_TELEMETRY=1`
---Reference Documentation
参考文档
| File | Contents |
|---|---|
| Full CLI parameter reference (scrape, crawl, map, search, fc-save, jina, deepwiki) |
| Full Python API script reference (all commands, SDK examples) |
| Firecrawl Search API reference |
| Agent API (spark models, parallel agents, webhooks) |
| Page actions for dynamic content (click, write, wait, scroll) |
| Brand identity extraction (colors, fonts, UI) |
| 文件 | 内容 |
|---|---|
| 完整CLI参数参考(scrape、crawl、map、search、fc-save、jina、deepwiki) |
| 完整Python API脚本参考(所有命令、SDK示例) |
| Firecrawl搜索API参考 |
| Agent API参考(spark模型、并行Agent、webhooks) |
| 动态内容页面交互参考(点击、输入、等待、滚动) |
| 品牌标识提取参考(颜色、字体、UI) |
Test Suite
测试套件
bash
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick # Quick validation
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py # Full suite
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape # Specific testbash
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick # 快速验证
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py # 全量测试
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape # 单项测试