crawl4ai

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Crawl4AI

Crawl4AI

Overview

概述

Crawl4AI provides comprehensive web crawling and data extraction capabilities. This skill supports both CLI (recommended for quick tasks) and Python SDK (for programmatic control).
Choose your interface:
  • CLI (
    crwl
    ) - Quick, scriptable commands: CLI Guide
  • Python SDK - Full programmatic control: SDK Guide

Crawl4AI 提供全面的网页爬取和数据提取能力。该工具同时支持CLI(推荐用于快速任务)和Python SDK(用于程序化控制)。
选择适合你的交互方式:
  • CLI (
    crwl
    ) - 快速、可脚本化的命令:CLI 指南
  • Python SDK - 完整的程序化控制:SDK 指南

Quick Start

快速开始

Installation

安装

bash
pip install crawl4ai
crawl4ai-setup
bash
pip install crawl4ai
crawl4ai-setup

Verify installation

Verify installation

crawl4ai-doctor
undefined
crawl4ai-doctor
undefined

CLI (Recommended)

CLI(推荐)

bash
undefined
bash
undefined

Basic crawling - returns markdown

Basic crawling - returns markdown

Get markdown output

Get markdown output

crwl https://example.com -o markdown
crwl https://example.com -o markdown

JSON output with cache bypass

JSON output with cache bypass

crwl https://example.com -o json -v --bypass-cache
crwl https://example.com -o json -v --bypass-cache

See more examples

See more examples

crwl --example
undefined
crwl --example
undefined

Python SDK

Python SDK

python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])

asyncio.run(main())
For SDK configuration details: SDK Guide - Configuration (lines 61-150)

python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])

asyncio.run(main())
如需了解SDK配置详情:SDK 指南 - 配置(第61-150行)

Core Concepts

核心概念

Configuration Layers

配置层级

Both CLI and SDK use the same underlying configuration:
ConceptCLISDK
Browser settings
-B browser.yml
or
-b "param=value"
BrowserConfig(...)
Crawl settings
-C crawler.yml
or
-c "param=value"
CrawlerRunConfig(...)
Extraction
-e extract.yml -s schema.json
extraction_strategy=...
Content filter
-f filter.yml
markdown_generator=...
CLI 和 SDK 使用相同的底层配置:
概念CLISDK
浏览器设置
-B browser.yml
-b "param=value"
BrowserConfig(...)
爬取设置
-C crawler.yml
-c "param=value"
CrawlerRunConfig(...)
提取配置
-e extract.yml -s schema.json
extraction_strategy=...
内容过滤
-f filter.yml
markdown_generator=...

Key Parameters

关键参数

Browser Configuration:
  • headless
    : Run with/without GUI
  • viewport_width/height
    : Browser dimensions
  • user_agent
    : Custom user agent
  • proxy_config
    : Proxy settings
Crawler Configuration:
  • page_timeout
    : Max page load time (ms)
  • wait_for
    : CSS selector or JS condition to wait for
  • cache_mode
    : bypass, enabled, disabled
  • js_code
    : JavaScript to execute
  • css_selector
    : Focus on specific element
For complete parameters: CLI Config | SDK Config
浏览器配置:
  • headless
    : 是否以无GUI模式运行
  • viewport_width/height
    : 浏览器窗口尺寸
  • user_agent
    : 自定义用户代理
  • proxy_config
    : 代理设置
爬取配置:
  • page_timeout
    : 页面加载最长时间(毫秒)
  • wait_for
    : 等待加载完成的CSS选择器或JS条件
  • cache_mode
    : 绕过、启用、禁用缓存
  • js_code
    : 要执行的JavaScript代码
  • css_selector
    : 聚焦于特定元素
如需完整参数列表:CLI 配置 | SDK 配置

Output Content

输出内容

Every crawl returns:
  • markdown - Clean, formatted markdown
  • html - Raw HTML
  • links - Internal and external links discovered
  • media - Images, videos, audio found
  • extracted_content - Structured data (if extraction configured)

每次爬取都会返回以下内容:
  • markdown - 简洁格式化的Markdown
  • html - 原始HTML
  • links - 发现的内部和外部链接
  • media - 找到的图片、视频、音频
  • extracted_content - 结构化数据(若配置了提取规则)

Markdown Generation (Primary Use Case)

Markdown生成(主要用例)

Crawl4AI excels at generating clean, well-formatted markdown:
Crawl4AI 擅长生成简洁、格式规范的Markdown:

CLI

CLI

bash
undefined
bash
undefined

Basic markdown

Basic markdown

crwl https://docs.example.com -o markdown
crwl https://docs.example.com -o markdown

Filtered markdown (removes noise)

Filtered markdown (removes noise)

crwl https://docs.example.com -o markdown-fit
crwl https://docs.example.com -o markdown-fit

With content filter

With content filter

crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit

**Filter configuration:**

```yaml
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit

**过滤配置示例:**

```yaml

filter_bm25.yml (relevance-based)

filter_bm25.yml (relevance-based)

type: "bm25" query: "machine learning tutorials" threshold: 1.0
undefined
type: "bm25" query: "machine learning tutorials" threshold: 1.0
undefined

Python SDK

Python SDK

python
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)

config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)

print(result.markdown.fit_markdown)  # Filtered
print(result.markdown.raw_markdown)  # Original
For content filters: Content Processing (lines 2481-3101)

python
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)

config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)

print(result.markdown.fit_markdown)  # Filtered
print(result.markdown.raw_markdown)  # Original
如需了解内容过滤相关信息:内容处理(第2481-3101行)

Data Extraction

数据提取

1. Schema-Based CSS Extraction (Most Efficient)

1. 基于Schema的CSS提取(最高效)

No LLM required - fast, deterministic, cost-free.
CLI:
bash
undefined
无需LLM - 快速、确定性、无成本。
CLI:
bash
undefined

Generate schema once (uses LLM)

Generate schema once (uses LLM)

python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

Use schema for extraction (no LLM)

Use schema for extraction (no LLM)

crwl https://shop.com -e extract_css.yml -s product_schema.json -o json

**Schema format:**

```json
{
  "name": "products",
  "baseSelector": ".product-card",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"},
    {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
  ]
}
crwl https://shop.com -e extract_css.yml -s product_schema.json -o json

**Schema格式示例:**

```json
{
  "name": "products",
  "baseSelector": ".product-card",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"},
    {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
  ]
}

2. LLM-Based Extraction

2. 基于LLM的提取

For complex or irregular content:
CLI:
yaml
undefined
适用于复杂或非规则内容:
CLI:
yaml
undefined

extract_llm.yml

extract_llm.yml

type: "llm" provider: "openai/gpt-4o-mini" instruction: "Extract product names and prices" api_token: "your-token"

```bash
crwl https://shop.com -e extract_llm.yml -o json
For extraction details: Extraction Strategies (lines 4522-5429)

type: "llm" provider: "openai/gpt-4o-mini" instruction: "Extract product names and prices" api_token: "your-token"

```bash
crwl https://shop.com -e extract_llm.yml -o json
如需了解提取详情:提取策略(第4522-5429行)

Advanced Patterns

高级模式

Dynamic Content (JavaScript-Heavy Sites)

动态内容(JavaScript密集型网站)

CLI:
bash
crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"
Crawler config:
yaml
undefined
CLI:
bash
crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"
爬取配置示例:
yaml
undefined

crawler.yml

crawler.yml

wait_for: "css:.ajax-content" scan_full_page: true page_timeout: 60000 delay_before_return_html: 2.0
undefined
wait_for: "css:.ajax-content" scan_full_page: true page_timeout: 60000 delay_before_return_html: 2.0
undefined

Multi-URL Processing

多URL处理

CLI (sequential):
bash
for url in url1 url2 url3; do crwl "$url" -o markdown; done
Python SDK (concurrent):
python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)
For batch processing: arun_many() Reference (lines 1057-1224)
CLI(串行):
bash
for url in url1 url2 url3; do crwl "$url" -o markdown; done
Python SDK(并行):
python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)
如需了解批量处理:arun_many() 参考(第1057-1224行)

Session & Authentication

会话与身份验证

CLI:
yaml
undefined
CLI:
yaml
undefined

login_crawler.yml

login_crawler.yml

session_id: "user_session" js_code: | document.querySelector('#username').value = 'user'; document.querySelector('#password').value = 'pass'; document.querySelector('#submit').click(); wait_for: "css:.dashboard"

```bash
session_id: "user_session" js_code: | document.querySelector('#username').value = 'user'; document.querySelector('#password').value = 'pass'; document.querySelector('#submit').click(); wait_for: "css:.dashboard"

```bash

Login

登录

crwl https://site.com/login -C login_crawler.yml
crwl https://site.com/login -C login_crawler.yml

Access protected content (session reused)

访问受保护内容(复用会话)

crwl https://site.com/protected -c "session_id=user_session"

For session management: [Advanced Features](references/complete-sdk-reference.md#advanced-features) (lines 5429-5940)
crwl https://site.com/protected -c "session_id=user_session"

如需了解会话管理:[高级功能](references/complete-sdk-reference.md#advanced-features)(第5429-5940行)

Anti-Detection & Proxies

反检测与代理

CLI:
yaml
undefined
CLI:
yaml
undefined

browser.yml

browser.yml

headless: true proxy_config: server: "http://proxy:8080" username: "user" password: "pass" user_agent_mode: "random"

```bash
crwl https://example.com -B browser.yml

headless: true proxy_config: server: "http://proxy:8080" username: "user" password: "pass" user_agent_mode: "random"

```bash
crwl https://example.com -B browser.yml

Common Use Cases

常见用例

Documentation to Markdown

文档转Markdown

bash
crwl https://docs.example.com -o markdown > docs.md
bash
crwl https://docs.example.com -o markdown > docs.md

E-commerce Product Monitoring

电商产品监控

bash
undefined
bash
undefined

Generate schema once

Generate schema once

python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

Monitor (no LLM costs)

监控(无LLM成本)

crwl https://shop.com -e extract_css.yml -s schema.json -o json
undefined
crwl https://shop.com -e extract_css.yml -s schema.json -o json
undefined

News Aggregation

新闻聚合

bash
undefined
bash
undefined

Multiple sources with filtering

多来源过滤爬取

for url in news1.com news2.com news3.com; do crwl "https://$url" -f filter_bm25.yml -o markdown-fit done
undefined
for url in news1.com news2.com news3.com; do crwl "https://$url" -f filter_bm25.yml -o markdown-fit done
undefined

Interactive Q&A

交互式问答

bash
undefined
bash
undefined

First view content

先查看内容

crwl https://example.com -o markdown
crwl https://example.com -o markdown

Then ask questions

然后提问

crwl https://example.com -q "What are the main conclusions?" crwl https://example.com -q "Summarize the key points"

---
crwl https://example.com -q "What are the main conclusions?" crwl https://example.com -q "Summarize the key points"

---

Resources

资源

Provided Scripts

提供的脚本

  • scripts/extraction_pipeline.py - Schema generation and extraction
  • scripts/basic_crawler.py - Simple markdown extraction
  • scripts/batch_crawler.py - Multi-URL processing
  • scripts/extraction_pipeline.py - Schema生成与提取
  • scripts/basic_crawler.py - 简单Markdown提取
  • scripts/batch_crawler.py - 多URL处理

Reference Documentation

参考文档

DocumentPurpose
CLI GuideCommand-line interface reference
SDK GuidePython SDK quick reference
Complete SDK ReferenceFull API documentation (5900+ lines)

文档用途
CLI 指南命令行界面参考
SDK 指南Python SDK快速参考
完整SDK参考完整API文档(5900+行)

Best Practices

最佳实践

  1. Start with CLI for quick tasks, SDK for automation
  2. Use schema-based extraction - 10-100x more efficient than LLM
  3. Enable caching during development -
    --bypass-cache
    only when needed
  4. Set appropriate timeouts - 30s normal, 60s+ for JS-heavy sites
  5. Use content filters for cleaner, focused markdown
  6. Respect rate limits - Add delays between requests

  1. 优先使用CLI完成快速任务,使用SDK实现自动化
  2. 使用基于Schema的提取 - 效率比LLM高10-100倍
  3. 开发期间启用缓存 - 仅在需要时使用
    --bypass-cache
  4. 设置合适的超时时间 - 常规页面30秒,JavaScript密集型页面60秒以上
  5. 使用内容过滤获取更简洁、聚焦的Markdown
  6. 遵守速率限制 - 在请求之间添加延迟

Troubleshooting

故障排除

JavaScript Not Loading

JavaScript未加载

bash
crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"
bash
crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"

Bot Detection Issues

机器人检测问题

bash
crwl https://example.com -B browser.yml
yaml
undefined
bash
crwl https://example.com -B browser.yml
yaml
undefined

browser.yml

browser.yml

headless: false viewport_width: 1920 viewport_height: 1080 user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
undefined
headless: false viewport_width: 1920 viewport_height: 1080 user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
undefined

Content Not Extracted

内容未提取

bash
undefined
bash
undefined

Debug: see full output

调试:查看完整输出

crwl https://example.com -o all -v
crwl https://example.com -o all -v

Try different wait strategy

尝试不同的等待策略

crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
undefined
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
undefined

Session Issues

会话问题

bash
undefined
bash
undefined

Verify session

验证会话

crwl https://site.com -c "session_id=test" -o all | grep -i session

---

For comprehensive API documentation, see [Complete SDK Reference](references/complete-sdk-reference.md).
crwl https://site.com -c "session_id=test" -o all | grep -i session

---

如需查看完整API文档,请参考[完整SDK参考](references/complete-sdk-reference.md)。