Firecrawl Web Scraper Skill
Firecrawl网页抓取技能
Status: Production Ready
Last Updated: 2026-01-20
Official Docs:
https://docs.firecrawl.dev
API Version: v2
SDK Versions: firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+
状态:已就绪可用于生产环境
最后更新:2026年1月20日
官方文档:
https://docs.firecrawl.dev
API版本:v2
SDK版本:firecrawl-py 4.13.0及以上,@mendable/firecrawl-js 4.11.1及以上
What is Firecrawl?
什么是Firecrawl?
Firecrawl is a Web Data API for AI that turns websites into LLM-ready markdown or structured data. It handles:
- JavaScript rendering - Executes client-side JavaScript to capture dynamic content
- Anti-bot bypass - Gets past CAPTCHA and bot detection systems
- Format conversion - Outputs as markdown, HTML, JSON, screenshots, summaries
- Document parsing - Processes PDFs, DOCX files, and images
- Autonomous agents - AI-powered web data gathering without URLs
- Change tracking - Monitor content changes over time
- Branding extraction - Extract color schemes, typography, logos
Firecrawl是一款面向AI的网页数据API,可将网站内容转换为LLM适用的Markdown格式或结构化数据。它支持:
- JavaScript渲染 - 执行客户端JavaScript以捕获动态内容
- 反机器人验证绕过 - 突破CAPTCHA和机器人检测系统
- 格式转换 - 输出为Markdown、HTML、JSON、截图、摘要
- 文档解析 - 处理PDF、DOCX文件及图片
- 自主Agent - 无需URL,由AI驱动的网页数据收集
- 变更追踪 - 监控内容随时间的变化
- 品牌信息提取 - 提取配色方案、排版、Logo
API Endpoints Overview
API接口概览
| Endpoint | Purpose | Use Case |
|---|
| Single page | Extract article, product page |
| Full site | Index docs, archive sites |
| URL discovery | Find all pages, plan strategy |
| Web search + scrape | Research with live data |
| Structured data | Product prices, contacts |
| Autonomous gathering | No URLs needed, AI navigates |
| Multiple URLs | Bulk processing |
| 接口 | 用途 | 适用场景 |
|---|
| 单页面抓取 | 提取文章、产品页面内容 |
| 全站爬取 | 文档索引、网站归档 |
| URL发现 | 查找所有页面、规划爬取策略 |
| 网页搜索+抓取 | 基于实时数据的调研 |
| 结构化数据提取 | 提取产品价格、联系信息 |
| 自主数据收集 | 无需提供URL,由AI导航完成 |
| 多URL批量处理 | 批量数据抓取 |
Scrapes a single webpage and returns clean, structured content.
python
from firecrawl import Firecrawl
import os
app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))
python
from firecrawl import Firecrawl
import os
app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))
doc = app.scrape(
url="
https://example.com/article",
formats=["markdown", "html"],
only_main_content=True
)
print(doc.markdown)
print(doc.metadata)
```typescript
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const result = await app.scrapeUrl('https://example.com/article', {
formats: ['markdown', 'html'],
onlyMainContent: true
});
console.log(result.markdown);
doc = app.scrape(
url="
https://example.com/article",
formats=["markdown", "html"],
only_main_content=True
)
print(doc.markdown)
print(doc.metadata)
```typescript
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const result = await app.scrapeUrl('https://example.com/article', {
formats: ['markdown', 'html'],
onlyMainContent: true
});
console.log(result.markdown);
| Format | Description |
|---|
| LLM-optimized content |
| Full HTML |
| Unprocessed HTML |
| Page capture (with viewport options) |
| All URLs on page |
| Structured data extraction |
| AI-generated summary |
| Design system data |
| Content change detection |
| 格式 | 说明 |
|---|
| 针对LLM优化的内容格式 |
| 完整HTML内容 |
| 未处理的原始HTML |
| 页面截图(支持视口配置) |
| 页面中所有URL |
| 结构化数据提取结果 |
| AI生成的内容摘要 |
| 设计系统数据 |
| 内容变更检测结果 |
python
doc = app.scrape(
url="https://example.com",
formats=["markdown", "screenshot"],
only_main_content=True,
remove_base64_images=True,
wait_for=5000, # Wait 5s for JS
timeout=30000,
# Location & language
location={"country": "AU", "languages": ["en-AU"]},
# Cache control
max_age=0, # Fresh content (no cache)
store_in_cache=True,
# Stealth mode for complex sites
stealth=True,
# Custom headers
headers={"User-Agent": "Custom Bot 1.0"}
)
python
doc = app.scrape(
url="https://example.com",
formats=["markdown", "screenshot"],
only_main_content=True,
remove_base64_images=True,
wait_for=5000, # 等待5秒以加载JavaScript
timeout=30000,
# 地区与语言配置
location={"country": "AU", "languages": ["en-AU"]},
# 缓存控制
max_age=0, # 获取最新内容(不使用缓存)
store_in_cache=True,
# 针对复杂网站启用隐身模式
stealth=True,
# 自定义请求头
headers={"User-Agent": "Custom Bot 1.0"}
)
Perform interactions before scraping:
python
doc = app.scrape(
url="https://example.com",
actions=[
{"type": "click", "selector": "button.load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down"},
{"type": "write", "selector": "input#search", "text": "query"},
{"type": "press", "key": "Enter"},
{"type": "screenshot"} # Capture state mid-action
]
)
在抓取前执行页面交互:
python
doc = app.scrape(
url="https://example.com",
actions=[
{"type": "click", "selector": "button.load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down"},
{"type": "write", "selector": "input#search", "text": "query"},
{"type": "press", "key": "Enter"},
{"type": "screenshot"} # 捕获交互过程中的页面状态
]
)
JSON Mode (Structured Extraction)
JSON模式(结构化提取)
doc = app.scrape(
url="
https://example.com/product",
formats=["json"],
json_options={
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"}
}
}
}
)
doc = app.scrape(
url="
https://example.com/product",
formats=["json"],
json_options={
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"}
}
}
}
)
Without schema (prompt-only)
不带Schema(仅用提示词)
doc = app.scrape(
url="
https://example.com/product",
formats=["json"],
json_options={
"prompt": "Extract the product name, price, and availability"
}
)
doc = app.scrape(
url="
https://example.com/product",
formats=["json"],
json_options={
"prompt": "Extract the product name, price, and availability"
}
)
Branding Extraction
品牌信息提取
Extract design system and brand identity:
python
doc = app.scrape(
url="https://example.com",
formats=["branding"]
)
提取设计系统与品牌标识:
python
doc = app.scrape(
url="https://example.com",
formats=["branding"]
)
- Color schemes and palettes
- 配色方案与调色板
- Typography (fonts, sizes, weights)
- 排版信息(字体、字号、字重)
- Spacing and layout metrics
- 间距与布局规范
- UI component styles
- UI组件样式
- Logo and imagery URLs
- Logo与图片URL
- Brand personality traits
- 品牌个性特征
Crawls all accessible pages from a starting URL.
python
result = app.crawl(
url="https://docs.example.com",
limit=100,
max_depth=3,
allowed_domains=["docs.example.com"],
exclude_paths=["/api/*", "/admin/*"],
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for page in result.data:
print(f"Scraped: {page.metadata.source_url}")
print(f"Content: {page.markdown[:200]}...")
从起始URL开始爬取所有可访问的页面。
python
result = app.crawl(
url="https://docs.example.com",
limit=100,
max_depth=3,
allowed_domains=["docs.example.com"],
exclude_paths=["/api/*", "/admin/*"],
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for page in result.data:
print(f"已抓取: {page.metadata.source_url}")
print(f"内容: {page.markdown[:200]}...")
Async Crawl with Webhooks
异步爬取与Webhook
Start crawl (returns immediately)
启动爬取(立即返回)
Or poll for status
或轮询任务状态
status = app.check_crawl_status(job.id)
status = app.check_crawl_status(job.id)
Rapidly discover all URLs on a website without scraping content.
python
urls = app.map(url="https://example.com")
print(f"Found {len(urls)} pages")
for url in urls[:10]:
print(url)
Use for: sitemap discovery, crawl planning, website audits.
快速发现网站上的所有URL,无需抓取内容。
python
urls = app.map(url="https://example.com")
print(f"发现 {len(urls)} 个页面")
for url in urls[:10]:
print(url)
适用场景:站点地图发现、爬取规划、网站审计。
4. Search Endpoint () - NEW
Perform web searches and optionally scrape the results in one operation.
results = app.search(
query="best practices for React server components",
limit=10
)
for result in results:
print(f"{result.title}: {result.url}")
results = app.search(
query="best practices for React server components",
limit=10
)
for result in results:
print(f"{result.title}: {result.url}")
Search + scrape results
搜索+抓取结果
results = app.search(
query="React server components tutorial",
limit=5,
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for result in results:
print(f"{result.title}")
print(result.markdown[:500])
results = app.search(
query="React server components tutorial",
limit=5,
scrape_options={
"formats": ["markdown"],
"only_main_content": True
}
)
for result in results:
print(f"{result.title}")
print(result.markdown[:500])
python
results = app.search(
query="machine learning papers",
limit=20,
# Filter by source type
sources=["web", "news", "images"],
# Filter by category
categories=["github", "research", "pdf"],
# Location
location={"country": "US"},
# Time filter
tbs="qdr:m", # Past month (qdr:h=hour, qdr:d=day, qdr:w=week, qdr:y=year)
timeout=30000
)
Cost: 2 credits per 10 results + scraping costs if enabled.
python
results = app.search(
query="machine learning papers",
limit=20,
# 按来源类型过滤
sources=["web", "news", "images"],
# 按分类过滤
categories=["github", "research", "pdf"],
# 地区配置
location={"country": "US"},
# 时间过滤
tbs="qdr:m", # 过去一个月(qdr:h=小时, qdr:d=天, qdr:w=周, qdr:y=年)
timeout=30000
)
成本:每10条搜索结果消耗2积分,若启用抓取则额外收取抓取费用。
AI-powered structured data extraction from single pages, multiple pages, or entire domains.
基于AI的结构化数据提取,支持单页面、多页面或整个域名。
python
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
in_stock: bool
result = app.extract(
urls=["https://example.com/product"],
schema=Product,
system_prompt="Extract product information"
)
print(result.data)
python
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
in_stock: bool
result = app.extract(
urls=["https://example.com/product"],
schema=Product,
system_prompt="Extract product information"
)
print(result.data)
Multi-Page / Domain Extraction
多页面/域名提取
Extract from entire domain using wildcard
通配符匹配整个域名
result = app.extract(
urls=["example.com/*"], # All pages on domain
schema=Product,
system_prompt="Extract all products"
)
result = app.extract(
urls=["example.com/*"], # 域名下所有页面
schema=Product,
system_prompt="Extract all products"
)
Enable web search for additional context
启用网页搜索获取额外上下文
result = app.extract(
urls=["example.com/products"],
schema=Product,
enable_web_search=True # Follow external links
)
result = app.extract(
urls=["example.com/products"],
schema=Product,
enable_web_search=True # 跟随外部链接
)
Prompt-Only Extraction (No Schema)
仅用提示词提取(无Schema)
python
result = app.extract(
urls=["https://example.com/about"],
prompt="Extract the company name, founding year, and key executives"
)
python
result = app.extract(
urls=["https://example.com/about"],
prompt="Extract the company name, founding year, and key executives"
)
LLM determines output structure
由LLM自动决定输出结构
6. Agent Endpoint () - NEW
Autonomous web data gathering without requiring specific URLs. The agent searches, navigates, and gathers data using natural language prompts.
无需指定具体URL,实现自主网页数据收集。Agent通过自然语言提示词完成搜索、导航与数据收集。
Basic agent usage
基础Agent用法
result = app.agent(
prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features"
)
print(result.data)
result = app.agent(
prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features"
)
print(result.data)
With schema for structured output
带Schema的结构化输出
from pydantic import BaseModel
from typing import List
class CMSPricing(BaseModel):
name: str
free_tier: bool
starter_price: float
features: List[str]
result = app.agent(
prompt="Find pricing for Contentful, Sanity, and Strapi",
schema=CMSPricing
)
from pydantic import BaseModel
from typing import List
class CMSPricing(BaseModel):
name: str
free_tier: bool
starter_price: float
features: List[str]
result = app.agent(
prompt="Find pricing for Contentful, Sanity, and Strapi",
schema=CMSPricing
)
Optional: focus on specific URLs
可选:限定在指定URL范围内
| Model | Best For | Cost |
|---|
| (default) | Simple extractions, high volume | Standard |
| Complex analysis, ambiguous data | 60% more |
python
result = app.agent(
prompt="Analyze competitive positioning...",
model="spark-1-pro" # For complex tasks
)
| 模型 | 适用场景 | 成本 |
|---|
| (默认) | 简单提取、高并发场景 | 标准定价 |
| 复杂分析、模糊数据处理 | 高出60% |
python
result = app.agent(
prompt="Analyze competitive positioning...",
model="spark-1-pro" # 适用于复杂任务
)
Start agent (returns immediately)
启动Agent任务(立即返回)
job = app.start_agent(
prompt="Research market trends..."
)
job = app.start_agent(
prompt="Research market trends..."
)
status = app.check_agent_status(job.id)
if status.status == "completed":
print(status.data)
**Note**: Agent is in Research Preview. 5 free daily requests, then credit-based billing.
---
status = app.check_agent_status(job.id)
if status.status == "completed":
print(status.data)
**注意**:Agent功能处于研究预览阶段。每日免费5次请求,超出后按积分计费。
---
7. Batch Scrape - NEW
7. 批量抓取 - 新增功能
Process multiple URLs efficiently in a single operation.
Synchronous (waits for completion)
同步模式(等待完成)
python
results = app.batch_scrape(
urls=[
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
formats=["markdown"],
only_main_content=True
)
for page in results.data:
print(f"{page.metadata.source_url}: {len(page.markdown)} chars")
python
results = app.batch_scrape(
urls=[
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
formats=["markdown"],
only_main_content=True
)
for page in results.data:
print(f"{page.metadata.source_url}: {len(page.markdown)} 字符")
Asynchronous (with webhooks)
异步模式(带Webhook)
python
job = app.start_batch_scrape(
urls=url_list,
formats=["markdown"],
webhook="https://your-domain.com/webhook"
)
python
job = app.start_batch_scrape(
urls=url_list,
formats=["markdown"],
webhook="https://your-domain.com/webhook"
)
Webhook receives events: started, page, completed, failed
Webhook接收事件:started, page, completed, failed
```typescript
const job = await app.startBatchScrape(urls, {
formats: ['markdown'],
webhook: 'https://your-domain.com/webhook'
});
// Poll for status
const status = await app.checkBatchScrapeStatus(job.id);
```typescript
const job = await app.startBatchScrape(urls, {
formats: ['markdown'],
webhook: 'https://your-domain.com/webhook'
});
// 轮询任务状态
const status = await app.checkBatchScrapeStatus(job.id);
8. Change Tracking - NEW
8. 变更追踪 - 新增功能
Monitor content changes over time by comparing scrapes.
Enable change tracking
启用变更追踪
print(doc.change_tracking.status) # new, same, changed, removed
print(doc.change_tracking.previous_scrape_at)
print(doc.change_tracking.visibility) # visible, hidden
print(doc.change_tracking.status) # new, same, changed, removed
print(doc.change_tracking.previous_scrape_at)
print(doc.change_tracking.visibility) # visible, hidden
Git-diff mode (default)
Git差异模式(默认)
doc = app.scrape(
url="
https://example.com/docs",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "diff"
}
)
print(doc.change_tracking.diff) # Line-by-line changes
doc = app.scrape(
url="
https://example.com/docs",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "diff"
}
)
print(doc.change_tracking.diff) # 逐行变更对比
JSON mode (structured comparison)
JSON模式(结构化对比)
doc = app.scrape(
url="
https://example.com/pricing",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "json",
"schema": {"type": "object", "properties": {"price": {"type": "number"}}}
}
)
doc = app.scrape(
url="
https://example.com/pricing",
formats=["markdown", "changeTracking"],
change_tracking_options={
"mode": "json",
"schema": {"type": "object", "properties": {"price": {"type": "number"}}}
}
)
Costs 5 credits per page
每页消耗5积分
**Change States**:
- `new` - Page not seen before
- `same` - No changes since last scrape
- `changed` - Content modified
- `removed` - Page no longer accessible
---
**变更状态**:
- `new` - 页面首次被抓取
- `same` - 自上次抓取后无变更
- `changed` - 内容已修改
- `removed` - 页面已无法访问
---
Store in environment
存储到环境变量中
FIRECRAWL_API_KEY=fc-your-api-key-here
**Never hardcode API keys!**
---
FIRECRAWL_API_KEY=fc-your-api-key-here
Cloudflare Workers Integration
Cloudflare Workers集成
The Firecrawl SDK cannot run in Cloudflare Workers (requires Node.js). Use the REST API directly:
typescript
interface Env {
FIRECRAWL_API_KEY: string;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { url } = await request.json<{ url: string }>();
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
formats: ['markdown'],
onlyMainContent: true
})
});
const result = await response.json();
return Response.json(result);
}
};
Firecrawl SDK无法在Cloudflare Workers中运行(依赖Node.js环境)。请直接使用REST API:
typescript
interface Env {
FIRECRAWL_API_KEY: string;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { url } = await request.json<{ url: string }>();
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
formats: ['markdown'],
onlyMainContent: true
})
});
const result = await response.json();
return Response.json(result);
}
};
Rate Limits & Pricing
速率限制与定价
Warning: Stealth Mode Pricing Change (May 2025)
警告:隐身模式定价变更(2025年5月)
Stealth mode now costs 5 credits per request when actively used. Default behavior uses "auto" mode which only charges stealth credits if basic fails.
Recommended pattern:
隐身模式当前采用主动使用时每请求5积分的计费方式。默认的"auto"模式仅在基础抓取失败时才会启用隐身模式并收取对应积分。
推荐用法:
Use auto mode (default) - only charges 5 credits if stealth is needed
推荐:使用默认的auto模式 - 仅在基础抓取失败时才会启用隐身模式并收取5积分
doc = app.scrape(url, formats=["markdown"])
doc = app.scrape(url, formats=["markdown"])
Or conditionally enable stealth for specific errors
或根据错误状态条件启用
if error_status_code in [401, 403, 500]:
doc = app.scrape(url, formats=["markdown"], proxy="stealth")
if error_status_code in [401, 403, 500]:
doc = app.scrape(url, formats=["markdown"], proxy="stealth")
Unified Billing (November 2025)
统一计费(2025年11月)
Credits and tokens merged into single system. Extract endpoint uses credits (15 tokens = 1 credit).
积分与令牌已合并为单一系统。提取接口使用积分(15令牌=1积分)。
| Tier | Credits/Month | Notes |
|---|
| Free | 500 | Good for testing |
| Hobby | 3,000 | $19/month |
| Standard | 100,000 | $99/month |
| Growth | 500,000 | $399/month |
Credit Costs:
- Scrape: 1 credit (basic), 5 credits (stealth)
- Crawl: 1 credit per page
- Search: 2 credits per 10 results
- Extract: 5 credits per page (changed from tokens in v2.6.0)
- Agent: Dynamic (complexity-based)
- Change Tracking JSON mode: +5 credits
| 套餐 | 每月积分 | 说明 |
|---|
| 免费版 | 500 | 适用于测试 |
| 爱好者版 | 3,000 | 19美元/月 |
| 标准版 | 100,000 | 99美元/月 |
| 成长版 | 500,000 | 399美元/月 |
积分消耗:
- 抓取:1积分(基础模式),5积分(隐身模式)
- 全站爬取:每页1积分
- 搜索:每10条结果2积分
- 提取:每页5积分(v2.6.0起从令牌改为积分)
- Agent:动态计费(基于任务复杂度)
- 变更追踪JSON模式:额外+5积分
Common Issues & Solutions
常见问题与解决方案
| Issue | Cause | Solution |
|---|
| Empty content | JS not loaded | Add or use |
| Rate limit exceeded | Over quota | Check dashboard, upgrade plan |
| Timeout error | Slow page | Increase , use |
| Bot detection | Anti-scraping | Use , add |
| Invalid API key | Wrong format | Must start with |
| 问题 | 原因 | 解决方案 |
|---|
| 内容为空 | JavaScript未加载 | 添加或使用配置 |
| 超出速率限制 | 超出配额 | 查看控制台,升级套餐 |
| 超时错误 | 页面加载缓慢 | 增加,启用 |
| 机器人检测 | 反爬机制拦截 | 使用,添加配置 |
| API密钥无效 | 格式错误 | 密钥必须以开头 |
Known Issues Prevention
已知问题预防
This skill prevents 10 documented issues:
Issue #1: Stealth Mode Pricing Change (May 2025)
问题1:隐身模式定价变更(2025年5月)
Error: Unexpected credit costs when using stealth mode
Source:
Stealth Mode Docs |
Changelog
Why It Happens: Starting May 8th, 2025, Stealth Mode proxy requests cost
5 credits per request (previously included in standard pricing). This is a significant billing change.
Prevention: Use auto mode (default) which only charges stealth credits if basic fails
错误:使用隐身模式时产生意外的积分消耗
来源:
隐身模式文档 |
更新日志
原因:从2025年5月8日起,隐身模式代理请求的计费方式变为
每请求5积分(此前包含在标准定价中)。这是一项重大的计费变更。
预防方案:使用默认的auto模式,仅在基础抓取失败时才会启用隐身模式并收取对应积分
RECOMMENDED: Use auto mode (default)
推荐:使用默认的auto模式
doc = app.scrape(url, formats=['markdown'])
doc = app.scrape(url, formats=['markdown'])
Auto retries with stealth (5 credits) only if basic fails
自动重试并仅在基础抓取失败时启用隐身模式(消耗5积分)
Or conditionally enable based on error status
或根据错误状态条件启用
try:
doc = app.scrape(url, formats=['markdown'], proxy='basic')
except Exception as e:
if e.status_code in [401, 403, 500]:
doc = app.scrape(url, formats=['markdown'], proxy='stealth')
**Stealth Mode Options**:
- `auto` (default): Charges 5 credits only if stealth succeeds after basic fails
- `basic`: Standard proxies, 1 credit cost
- `stealth`: 5 credits per request when actively used
---
try:
doc = app.scrape(url, formats=['markdown'], proxy='basic')
except Exception as e:
if e.status_code in [401, 403, 500]:
doc = app.scrape(url, formats=['markdown'], proxy='stealth')
**隐身模式选项**:
- `auto`(默认):仅在基础抓取失败且隐身模式成功时收取5积分
- `basic`:标准代理,每请求1积分
- `stealth`:主动使用时每请求5积分
---
Issue #2: v2.0.0 Breaking Changes - Method Renames
问题2:v2.0.0破坏性变更 - 方法重命名
Error:
AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url'
Source:
v2.0.0 Release |
Migration Guide
Why It Happens: v2.0.0 (August 2025) renamed SDK methods across all languages
Prevention: Use new method names
JavaScript/TypeScript:
Python:
错误:
AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url'
来源:
v2.0.0版本发布 |
迁移指南
原因:v2.0.0(2025年8月)对所有语言的SDK方法进行了重命名
预防方案:使用新的方法名
JavaScript/TypeScript:
Python:
Issue #3: v2.0.0 Breaking Changes - Format Changes
问题3:v2.0.0破坏性变更 - 格式变更
Error:
'extract' is not a valid format
Source:
v2.0.0 Release
Why It Happens: Old
format renamed to
in v2.0.0
Prevention: Use new object format for JSON extraction
错误:
'extract' is not a valid format
来源:
v2.0.0版本发布
原因:旧版的
格式在v2.0.0中重命名为
预防方案:使用JSON提取的新对象格式
doc = app.scrape_url(
url="
https://example.com",
params={
"formats": ["extract"],
"extract": {"prompt": "Extract title"}
}
)
doc = app.scrape_url(
url="
https://example.com",
params={
"formats": ["extract"],
"extract": {"prompt": "Extract title"}
}
)
doc = app.scrape(
url="
https://example.com",
formats=[{"type": "json", "prompt": "Extract title"}]
)
doc = app.scrape(
url="
https://example.com",
formats=[{"type": "json", "prompt": "Extract title"}]
)
doc = app.scrape(
url="
https://example.com",
formats=[{
"type": "json",
"prompt": "Extract product info",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
}
}]
)
**Screenshot format also changed**:
```python
doc = app.scrape(
url="
https://example.com",
formats=[{
"type": "json",
"prompt": "Extract product info",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
}
}]
)
NEW: Screenshot as object
新版:截图配置为对象
formats=[{
"type": "screenshot",
"fullPage": True,
"quality": 80,
"viewport": {"width": 1920, "height": 1080}
}]
formats=[{
"type": "screenshot",
"fullPage": True,
"quality": 80,
"viewport": {"width": 1920, "height": 1080}
}]
Issue #4: v2.0.0 Breaking Changes - Crawl Options
问题4:v2.0.0破坏性变更 - 爬取配置项
Error:
'allowBackwardCrawling' is not a valid parameter
Source:
v2.0.0 Release
Why It Happens: Several crawl parameters renamed or removed in v2.0.0
Prevention: Use new parameter names
Parameter Changes:
- → Use instead
- → Use instead
- (bool) → ("only", "skip", "include")
错误:
'allowBackwardCrawling' is not a valid parameter
来源:
v2.0.0版本发布
原因:v2.0.0中多个爬取配置项被重命名或移除
预防方案:使用新的配置项名称
配置项变更:
- → 改用
- → 改用
- (布尔值)→ (可选值:"only", "skip", "include")
app.crawl_url(
url="
https://docs.example.com",
params={
"allowBackwardCrawling": True,
"maxDepth": 3,
"ignoreSitemap": False
}
)
app.crawl_url(
url="
https://docs.example.com",
params={
"allowBackwardCrawling": True,
"maxDepth": 3,
"ignoreSitemap": False
}
)
app.crawl(
url="
https://docs.example.com",
crawl_entire_domain=True,
max_discovery_depth=3,
sitemap="include" # "only", "skip", or "include"
)
app.crawl(
url="
https://docs.example.com",
crawl_entire_domain=True,
max_discovery_depth=3,
sitemap="include" # 可选值:"only", "skip", "include"
)
Issue #5: v2.0.0 Default Behavior Changes
问题5:v2.0.0默认行为变更
Error: Stale cached content returned unexpectedly
Source:
v2.0.0 Release
Why It Happens: v2.0.0 changed several defaults
Prevention: Be aware of new defaults
Default Changes:
- now defaults to 2 days (cached by default)
- , , enabled by default
错误:意外返回缓存的旧内容
来源:
v2.0.0版本发布
原因:v2.0.0修改了多项默认配置
预防方案:了解新的默认配置
默认配置变更:
- 现在默认值为2天(默认启用缓存)
- , , 默认启用
Force fresh data if needed
如需强制获取最新数据
doc = app.scrape(url, formats=['markdown'], max_age=0)
doc = app.scrape(url, formats=['markdown'], max_age=0)
Disable cache entirely
完全禁用缓存
doc = app.scrape(url, formats=['markdown'], store_in_cache=False)
doc = app.scrape(url, formats=['markdown'], store_in_cache=False)
Issue #6: Job Status Race Condition
问题6:任务状态竞态条件
Error:
when checking crawl status immediately after creation
Source:
GitHub Issue #2662
Why It Happens: Database replication delay between job creation and status endpoint availability
Prevention: Wait 1-3 seconds before first status check, or implement retry logic
错误:创建爬取任务后立即检查状态时出现
来源:
GitHub Issue #2662
原因:任务创建与状态接口可用之间存在数据库复制延迟
预防方案:首次状态检查前等待1-3秒,或实现重试逻辑
REQUIRED: Wait before first status check
必须:首次状态检查前等待
time.sleep(2) # 1-3 seconds recommended
Now status check succeeds
此时状态检查可成功
status = app.get_crawl_status(job.id)
status = app.get_crawl_status(job.id)
Or implement retry logic
或实现重试逻辑
def get_status_with_retry(job_id, max_retries=3, delay=1):
for attempt in range(max_retries):
try:
return app.get_crawl_status(job_id)
except Exception as e:
if "Job not found" in str(e) and attempt < max_retries - 1:
time.sleep(delay)
continue
raise
status = get_status_with_retry(job.id)
def get_status_with_retry(job_id, max_retries=3, delay=1):
for attempt in range(max_retries):
try:
return app.get_crawl_status(job_id)
except Exception as e:
if "Job not found" in str(e) and attempt < max_retries - 1:
time.sleep(delay)
continue
raise
status = get_status_with_retry(job.id)
Issue #7: DNS Errors Return HTTP 200
问题7:DNS错误返回HTTP 200状态码
Error: DNS resolution failures return
with HTTP 200 status instead of 4xx
Source:
GitHub Issue #2402 | Fixed in v2.7.0
Why It Happens: Changed in v2.7.0 for consistent error handling
Prevention: Check
field and
field, don't rely on HTTP status alone
typescript
const result = await app.scrape('https://nonexistent-domain-xyz.com');
// DON'T rely on HTTP status code
// Response: HTTP 200 with { success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }
// DO check success field
if (!result.success) {
if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') {
console.error('DNS resolution failed');
}
throw new Error(result.error);
}
Note: DNS resolution errors still charge 1 credit despite failure.
错误:DNS解析失败时返回
但HTTP状态码为200而非4xx
来源:
GitHub Issue #2402 | v2.7.0中已修复
原因:v2.7.0中为了统一错误处理逻辑而修改
预防方案:检查
字段与
字段,不要仅依赖HTTP状态码
typescript
const result = await app.scrape('https://nonexistent-domain-xyz.com');
// 不要仅依赖HTTP状态码
// 响应:HTTP 200,内容为{ success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }
// 应检查success字段
if (!result.success) {
if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') {
console.error('DNS解析失败');
}
throw new Error(result.error);
}
注意:即使DNS解析失败,仍会扣除1积分。
Issue #8: Bot Detection Still Charges Credits
问题8:机器人检测仍会扣除积分
Error: Cloudflare error page returned as "successful" scrape, credits charged
Source:
GitHub Issue #2413
Why It Happens: Fire-1 engine charges credits even when bot detection prevents access
Prevention: Validate content isn't an error page before processing; use stealth mode for protected sites
错误:返回Cloudflare错误页面却被标记为“成功”抓取,扣除积分
来源:
GitHub Issue #2413
原因:Fire-1引擎即使在被机器人检测拦截时仍会扣除积分
预防方案:处理前验证内容是否为错误页面;对受保护网站使用隐身模式
First attempt without stealth
首次尝试不使用隐身模式
Validate content isn't an error page
验证内容是否为错误页面
if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower():
# Retry with stealth (costs 5 credits if successful)
doc = app.scrape(url, formats=["markdown"], stealth=True)
**Cost Impact**: Basic scrape charges 1 credit even on failure, stealth retry charges additional 5 credits.
---
if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower():
# 重试时启用隐身模式(成功则消耗5积分)
doc = app.scrape(url, formats=["markdown"], stealth=True)
**成本影响**:基础抓取即使失败也会扣除1积分,隐身模式重试会额外扣除5积分。
---
Issue #9: Self-Hosted Anti-Bot Fingerprinting Weakness
问题9:自部署版本反机器人指纹识别缺陷
Error:
"All scraping engines failed!"
(SCRAPE_ALL_ENGINES_FAILED) on sites with anti-bot measures
Source:
GitHub Issue #2257
Why It Happens: Self-hosted Firecrawl lacks advanced anti-fingerprinting techniques present in cloud service
Prevention: Use Firecrawl cloud service for sites with strong anti-bot measures, or configure proxy
错误:在有反机器人措施的网站上出现
"All scraping engines failed!"
(SCRAPE_ALL_ENGINES_FAILED)
来源:
GitHub Issue #2257
原因:自部署的Firecrawl缺少云服务中具备的高级反指纹识别技术
预防方案:对有强反机器人措施的网站使用Firecrawl云服务,或配置代理
Self-hosted fails on Cloudflare-protected sites
自部署版本在Cloudflare保护的网站上失败
Error: "All scraping engines failed!"
错误:"All scraping engines failed!"
Workaround: Use cloud service instead
解决方案:改用云服务
Cloud service has better anti-fingerprinting
云服务具备更完善的反指纹识别能力
**Note**: This affects self-hosted v2.3.0+ with default docker-compose setup. Warning present: "⚠️ WARNING: No proxy server provided. Your IP address may be blocked."
---
**注意**:此问题影响默认docker-compose部署的自部署v2.3.0+版本。部署时会显示警告:"⚠️ WARNING: No proxy server provided. Your IP address may be blocked."
---
Issue #10: Cache Performance Best Practices (Community-sourced)
问题10:缓存性能最佳实践(社区贡献)
Suboptimal: Not leveraging cache can make requests 500% slower
Source:
Fast Scraping Docs |
Blog Post
Why It Matters: Default
is 2 days in v2+, but many use cases need different strategies
Prevention: Use appropriate cache strategy for your content type
优化空间:未利用缓存会导致请求速度慢500%
来源:
快速抓取文档 |
博客文章
原因:v2+中
默认值为2天,但许多场景需要不同的缓存策略
预防方案:根据内容类型选择合适的缓存策略
Fresh data (real-time pricing, stock prices)
实时数据(如实时定价、库存)
doc = app.scrape(url, formats=["markdown"], max_age=0)
doc = app.scrape(url, formats=["markdown"], max_age=0)
10-minute cache (news, blogs)
10分钟缓存(如新闻、博客)
doc = app.scrape(url, formats=["markdown"], max_age=600000) # milliseconds
doc = app.scrape(url, formats=["markdown"], max_age=600000) # 毫秒
Use default cache (2 days) for static content
静态内容使用默认缓存(2天)
doc = app.scrape(url, formats=["markdown"]) # maxAge defaults to 172800000
doc = app.scrape(url, formats=["markdown"]) # maxAge默认值为172800000
Don't store in cache (one-time scrape)
单次抓取,不存储到缓存
doc = app.scrape(url, formats=["markdown"], store_in_cache=False)
doc = app.scrape(url, formats=["markdown"], store_in_cache=False)
Require minimum age before re-scraping (v2.7.0+)
重新抓取前需满足最小间隔(v2.7.0+)
doc = app.scrape(url, formats=["markdown"], min_age=3600000) # 1 hour minimum
**Performance Impact**:
- Cached response: Milliseconds
- Fresh scrape: Seconds
- Speed difference: **Up to 500%**
---
doc = app.scrape(url, formats=["markdown"], min_age=3600000) # 最小间隔1小时
**性能影响**:
- 缓存响应:毫秒级
- 全新抓取:秒级
- 速度差异:**最高500%**
---
| Package | Version | Last Checked |
|---|
| firecrawl-py | 4.13.0+ | 2026-01-20 |
| @mendable/firecrawl-js | 4.11.1+ | 2026-01-20 |
| API Version | v2 | Current |
| 包 | 版本 | 最后检查时间 |
|---|
| firecrawl-py | 4.13.0+ | 2026-01-20 |
| @mendable/firecrawl-js | 4.11.1+ | 2026-01-20 |
| API版本 | v2 | 当前版本 |
Official Documentation
官方文档
Token Savings: ~65% vs manual integration
Error Prevention: 10 documented issues (v2 migration, stealth pricing, job status race, DNS errors, bot detection billing, self-hosted limitations, cache optimization)
Production Ready: Yes
Last verified: 2026-01-21 | Skill version: 2.0.0 | Changes: Added Known Issues Prevention section with 10 documented errors from TIER 1-2 research findings; added v2 migration guidance; documented stealth mode pricing change and unified billing model
令牌节省:相比手动集成节省约65%
错误预防:10种已记录的问题(v2版本迁移、隐身模式定价、任务状态竞态、DNS错误、机器人检测计费、自部署限制、缓存优化)
生产就绪:是
最后验证:2026-01-21 | 技能版本:2.0.0 | 变更:新增已知问题预防章节,包含10种来自TIER 1-2研究的已记录错误;新增v2版本迁移指南;文档化隐身模式定价变更与统一计费模型