firecrawl-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Firecrawl Web Scraper Skill

Firecrawl网页抓取Skill

Status: Production Ready ✅ Last Updated: 2025-11-21 Official Docs: https://docs.firecrawl.dev API Version: v2.5

状态:已就绪可投入生产 ✅ 最后更新时间:2025-11-21 官方文档https://docs.firecrawl.dev API版本:v2.5

What is Firecrawl?

什么是Firecrawl?

Firecrawl is a Web Data API for AI that turns entire websites into LLM-ready markdown or structured data. It handles:
  • JavaScript rendering - Executes client-side JavaScript to capture dynamic content
  • Anti-bot bypass - Gets past CAPTCHA and bot detection systems
  • Format conversion - Outputs as markdown, JSON, or structured data
  • Screenshot capture - Saves visual representations of pages
  • Browser automation - Full headless browser capabilities

Firecrawl是一款面向AI的网页数据API,可将整个网站内容转换为适用于LLM的markdown格式或结构化数据。它支持:
  • JavaScript渲染 - 执行客户端JavaScript以捕获动态内容
  • 反机器人绕过 - 突破验证码和机器人检测系统
  • 格式转换 - 输出markdown、JSON或结构化数据
  • 截图捕获 - 保存页面的可视化内容
  • 浏览器自动化 - 完整的无头浏览器功能

API Endpoints

API端点

1.
/v2/scrape
- Single Page Scraping

1.
/v2/scrape
- 单页面抓取

Scrapes a single webpage and returns clean, structured content.
Use Cases:
  • Extract article content
  • Get product details
  • Scrape specific pages
  • Convert HTML to markdown
Key Options:
  • formats
    : ["markdown", "html", "screenshot"]
  • onlyMainContent
    : true/false (removes nav, footer, ads)
  • waitFor
    : milliseconds to wait before scraping
  • actions
    : browser automation actions (click, scroll, etc.)
抓取单个网页并返回干净的结构化内容。
适用场景
  • 提取文章内容
  • 获取产品详情
  • 抓取特定页面
  • 将HTML转换为markdown
关键选项
  • formats
    : ["markdown", "html", "screenshot"]
  • onlyMainContent
    : true/false(移除导航栏、页脚、广告)
  • waitFor
    : 抓取前等待的毫秒数
  • actions
    : 浏览器自动化操作(点击、滚动等)

2.
/v2/crawl
- Full Site Crawling

2.
/v2/crawl
- 全站爬取

Crawls all accessible pages from a starting URL.
Use Cases:
  • Index entire documentation sites
  • Archive website content
  • Build knowledge bases
  • Scrape multi-page content
Key Options:
  • limit
    : max pages to crawl
  • maxDepth
    : how many links deep to follow
  • allowedDomains
    : restrict to specific domains
  • excludePaths
    : skip certain URL patterns
从起始URL爬取所有可访问的页面。
适用场景
  • 索引整个文档站点
  • 归档网站内容
  • 构建知识库
  • 抓取多页面内容
关键选项
  • limit
    : 最大爬取页面数
  • maxDepth
    : 跟随链接的深度
  • allowedDomains
    : 限制为特定域名
  • excludePaths
    : 跳过特定URL模式

3.
/v2/map
- URL Discovery

3.
/v2/map
- URL发现

Maps all URLs on a website without scraping content.
Use Cases:
  • Find sitemap
  • Discover all pages
  • Plan crawling strategy
  • Audit website structure
映射网站上的所有URL但不抓取内容。
适用场景
  • 查找站点地图
  • 发现所有页面
  • 规划爬取策略
  • 审核网站结构

4.
/v2/extract
- Structured Data Extraction

4.
/v2/extract
- 结构化数据提取

Uses AI to extract specific data fields from pages.
Use Cases:
  • Extract product prices and names
  • Parse contact information
  • Build structured datasets
  • Custom data schemas
Key Options:
  • schema
    : Zod or JSON schema defining desired structure
  • systemPrompt
    : guide AI extraction behavior

使用AI从页面中提取特定数据字段。
适用场景
  • 提取产品价格和名称
  • 解析联系信息
  • 构建结构化数据集
  • 自定义数据模式
关键选项
  • schema
    : 定义所需结构的Zod或JSON schema
  • systemPrompt
    : 引导AI提取行为

Authentication

身份验证

Firecrawl requires an API key for all requests.
所有Firecrawl请求都需要API密钥。

Get API Key

获取API密钥

  1. Sign up at https://www.firecrawl.dev
  2. Go to dashboard → API Keys
  3. Copy your API key (starts with
    fc-
    )
  1. https://www.firecrawl.dev注册账号
  2. 进入控制台 → API密钥
  3. 复制你的API密钥(以
    fc-
    开头)

Store Securely

安全存储

NEVER hardcode API keys in code!
bash
undefined
绝对不要在代码中硬编码API密钥!
bash
undefined

.env file

.env 文件

FIRECRAWL_API_KEY=fc-your-api-key-here

```bash
FIRECRAWL_API_KEY=fc-your-api-key-here

```bash

.env.local (for local development)

.env.local(用于本地开发)

FIRECRAWL_API_KEY=fc-your-api-key-here

---
FIRECRAWL_API_KEY=fc-your-api-key-here

---

SDK Quick Start

SDK快速开始

Python

Python

bash
pip install firecrawl-py  # v4.5.0+
python
from firecrawl import FirecrawlApp
import os

app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))
result = app.scrape_url("https://example.com", params={"formats": ["markdown"], "onlyMainContent": True})
print(result.get("markdown"))
bash
pip install firecrawl-py  # v4.5.0+
python
from firecrawl import FirecrawlApp
import os

app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))
result = app.scrape_url("https://example.com", params={"formats": ["markdown"], "onlyMainContent": True})
print(result.get("markdown"))

TypeScript/Node.js

TypeScript/Node.js

bash
bun add @mendable/firecrawl-js  # v4.4.1+
typescript
import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const result = await app.scrapeUrl('https://example.com', { formats: ['markdown'], onlyMainContent: true });
console.log(result.markdown);
See:
templates/
for crawl, extract, and advanced examples

bash
bun add @mendable/firecrawl-js  # v4.4.1+
typescript
import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const result = await app.scrapeUrl('https://example.com', { formats: ['markdown'], onlyMainContent: true });
console.log(result.markdown);
参考
templates/
目录下包含爬取、提取和进阶示例

Common Use Cases

常见使用场景

Use CaseEndpointKey Options
Documentation scraping
crawl_url()
limit: 500
,
allowedDomains
Product data extraction
extract()
Zod schema +
systemPrompt
News article scraping
scrape_url()
onlyMainContent: true
,
removeBase64Images
URL discovery
map()
Find all pages before crawling
See:
references/common-patterns.md
for complete examples.

使用场景端点关键选项
文档抓取
crawl_url()
limit: 500
,
allowedDomains
产品数据提取
extract()
Zod schema +
systemPrompt
新闻文章抓取
scrape_url()
onlyMainContent: true
,
removeBase64Images
URL发现
map()
爬取前先找到所有页面
参考
references/common-patterns.md
获取完整示例。

Error Handling

错误处理

python
undefined
python
undefined

Python

Python

try: result = app.scrape_url("https://example.com") except FirecrawlException as e: print(f"Firecrawl error: {e}")

```typescript
// TypeScript
try {
  const result = await app.scrapeUrl('https://example.com');
} catch (error) {
  console.error('Error:', error.message);
}

try: result = app.scrape_url("https://example.com") except FirecrawlException as e: print(f"Firecrawl错误: {e}")

```typescript
// TypeScript
try {
  const result = await app.scrapeUrl('https://example.com');
} catch (error) {
  console.error('错误:', error.message);
}

Rate Limits & Best Practices

速率限制与最佳实践

Best PracticeWhy
Use
onlyMainContent: true
Reduces credits, cleaner output
Set reasonable
limit
Avoid excessive costs
Use
map
endpoint first
Plan crawling strategy
Cache resultsAvoid re-scraping
Batch extract callsMore efficient for multiple URLs
Credits: Free tier = 500/month, paid tiers higher.

最佳实践原因
使用
onlyMainContent: true
减少积分消耗,输出更简洁
设置合理的
limit
避免过高成本
先使用
map
端点
规划爬取策略
缓存结果避免重复抓取
批量调用提取接口处理多个URL更高效
积分说明:免费层级每月500次调用,付费层级次数更高。

Cloudflare Workers Integration

Cloudflare Workers集成

⚠️ SDK cannot run in Workers (Node.js dependencies). Use direct REST API:
typescript
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({ url, formats: ['markdown'], onlyMainContent: true })
});
See:
references/common-patterns.md
for complete Workers example with caching.

⚠️ SDK无法在Workers中运行(依赖Node.js)。请直接使用REST API:
typescript
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({ url, formats: ['markdown'], onlyMainContent: true })
});
参考
references/common-patterns.md
获取带缓存的完整Workers示例。

When to Use This Skill

何时使用该Skill

✅ Use Firecrawl❌ Don't Use
Modern JS-rendered sitesSimple static HTML (use cheerio)
Clean markdown for LLMsExisting Puppeteer setup works
RAG/chatbot contentDirect API available
Structured data extractionBudget constraints
Bot protection bypass

✅ 适合使用Firecrawl❌ 不适合使用
现代JS渲染站点简单静态HTML(使用cheerio)
为LLM生成干净的markdown现有Puppeteer配置可用
RAG/聊天机器人内容已有直接API可用
结构化数据提取预算有限
绕过机器人防护

Common Issues

常见问题

IssueCauseFix
"Invalid API Key"Key not setCheck
$FIRECRAWL_API_KEY
starts with
fc-
"Rate limit exceeded"Monthly credits usedCheck dashboard, upgrade plan
"Timeout error"Page slow to loadAdd
waitFor: 10000
"Content is empty"JS loads lateAdd
actions: [{type: "wait", milliseconds: 3000}]

问题原因解决方法
"Invalid API Key"密钥未设置检查
$FIRECRAWL_API_KEY
是否以
fc-
开头
"Rate limit exceeded"月度积分已用完查看控制台,升级套餐
"Timeout error"页面加载缓慢添加
waitFor: 10000
"Content is empty"JS加载延迟添加
actions: [{type: "wait", milliseconds: 3000}]

Advanced Features

进阶功能

FeatureUsage
Browser actions
actions: [{type: "click", selector: "button"}]
Custom headers
headers: {"User-Agent": "Custom Bot"}
Webhooks
webhook: "https://your-domain.com/webhook"
Screenshots
formats: ["screenshot"]
See:
references/endpoints.md
for complete API reference.

功能使用方式
浏览器操作
actions: [{type: "click", selector: "button"}]
自定义请求头
headers: {"User-Agent": "Custom Bot"}
Webhooks
webhook: "https://your-domain.com/webhook"
截图
formats: ["screenshot"]
参考
references/endpoints.md
获取完整API参考。

When to Load References

何时加载参考文档

ReferenceLoad When...
endpoints.md
Need complete API endpoint documentation
common-patterns.md
Cloudflare Workers, caching, batch processing, error handling

参考文档加载时机
endpoints.md
需要完整的API端点文档时
common-patterns.md
处理Cloudflare Workers、缓存、批量处理、错误处理时

Package Versions

包版本

PackageVersion
firecrawl-py4.5.0+
@mendable/firecrawl-js4.4.1+
APIv2
Note: Node.js SDK requires Node.js >=22.0.0, cannot run in Workers.

Token Savings: ~60% | Production Ready: ✅
版本
firecrawl-py4.5.0+
@mendable/firecrawl-js4.4.1+
APIv2
注意:Node.js SDK要求Node.js >=22.0.0,无法在Workers中运行。

Token节省率:约60% | 生产就绪:✅