crawl4ai

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Crawl4AI

Overview

概述

Crawl4AI provides comprehensive web crawling and data extraction capabilities. This skill supports both CLI (recommended for quick tasks) and Python SDK (for programmatic control).

Choose your interface:

CLI (
```
crwl
```
) - Quick, scriptable commands: CLI Guide
Python SDK - Full programmatic control: SDK Guide

Crawl4AI 提供全面的网页爬取和数据提取能力。该工具同时支持CLI（推荐用于快速任务）和Python SDK（用于程序化控制）。

选择适合你的交互方式：

CLI (
```
crwl
```
) - 快速、可脚本化的命令：CLI 指南
Python SDK - 完整的程序化控制：SDK 指南

Quick Start

快速开始

Installation

安装

bash

pip install crawl4ai
crawl4ai-setup

bash

pip install crawl4ai
crawl4ai-setup

Verify installation

crawl4ai-doctor

undefined

crawl4ai-doctor

undefined

CLI (Recommended)

CLI（推荐）

bash

undefined

bash

undefined

Basic crawling - returns markdown

crwl https://example.com

Get markdown output

crwl https://example.com -o markdown

JSON output with cache bypass

crwl https://example.com -o json -v --bypass-cache

See more examples

crwl --example

undefined

crwl --example

undefined

Python SDK

python

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])

asyncio.run(main())

For SDK configuration details: SDK Guide - Configuration (lines 61-150)

python

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])

asyncio.run(main())

如需了解SDK配置详情：SDK 指南 - 配置（第61-150行）

Core Concepts

核心概念

Configuration Layers

配置层级

Both CLI and SDK use the same underlying configuration:

Concept	CLI	SDK
Browser settings	`-B browser.yml` or `-b "param=value"`	`BrowserConfig(...)`
Crawl settings	`-C crawler.yml` or `-c "param=value"`	`CrawlerRunConfig(...)`
Extraction	`-e extract.yml -s schema.json`	`extraction_strategy=...`
Content filter	`-f filter.yml`	`markdown_generator=...`

CLI 和 SDK 使用相同的底层配置：

概念	CLI	SDK
浏览器设置	`-B browser.yml` 或 `-b "param=value"`	`BrowserConfig(...)`
爬取设置	`-C crawler.yml` 或 `-c "param=value"`	`CrawlerRunConfig(...)`
提取配置	`-e extract.yml -s schema.json`	`extraction_strategy=...`
内容过滤	`-f filter.yml`	`markdown_generator=...`

Key Parameters

关键参数

Browser Configuration:

```
headless
```
: Run with/without GUI
```
viewport_width/height
```
: Browser dimensions
```
user_agent
```
: Custom user agent
```
proxy_config
```
: Proxy settings

Crawler Configuration:

```
page_timeout
```
: Max page load time (ms)
```
wait_for
```
: CSS selector or JS condition to wait for
```
cache_mode
```
: bypass, enabled, disabled
```
js_code
```
: JavaScript to execute
```
css_selector
```
: Focus on specific element

For complete parameters: CLI Config | SDK Config

浏览器配置：

```
headless
```
: 是否以无GUI模式运行
```
viewport_width/height
```
: 浏览器窗口尺寸
```
user_agent
```
: 自定义用户代理
```
proxy_config
```
: 代理设置

爬取配置：

```
page_timeout
```
: 页面加载最长时间（毫秒）
```
wait_for
```
: 等待加载完成的CSS选择器或JS条件
```
cache_mode
```
: 绕过、启用、禁用缓存
```
js_code
```
: 要执行的JavaScript代码
```
css_selector
```
: 聚焦于特定元素

如需完整参数列表：CLI 配置 | SDK 配置

Output Content

输出内容

Every crawl returns:

markdown - Clean, formatted markdown
html - Raw HTML
links - Internal and external links discovered
media - Images, videos, audio found
extracted_content - Structured data (if extraction configured)

每次爬取都会返回以下内容：

markdown - 简洁格式化的Markdown
html - 原始HTML
links - 发现的内部和外部链接
media - 找到的图片、视频、音频
extracted_content - 结构化数据（若配置了提取规则）

Markdown Generation (Primary Use Case)

Markdown生成（主要用例）

Crawl4AI excels at generating clean, well-formatted markdown:

Crawl4AI 擅长生成简洁、格式规范的Markdown：

CLI

bash

undefined

bash

undefined

Basic markdown

crwl https://docs.example.com -o markdown

Filtered markdown (removes noise)

crwl https://docs.example.com -o markdown-fit

With content filter

crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit


**Filter configuration:**

```yaml

crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit


**过滤配置示例：**

```yaml

filter_bm25.yml (relevance-based)

type: "bm25" query: "machine learning tutorials" threshold: 1.0

undefined

type: "bm25" query: "machine learning tutorials" threshold: 1.0

undefined

Python SDK

python

from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)

config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)

print(result.markdown.fit_markdown)  # Filtered
print(result.markdown.raw_markdown)  # Original

For content filters: Content Processing (lines 2481-3101)

python

from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)

config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)

print(result.markdown.fit_markdown)  # Filtered
print(result.markdown.raw_markdown)  # Original

如需了解内容过滤相关信息：内容处理（第2481-3101行）

Data Extraction

数据提取

1. Schema-Based CSS Extraction (Most Efficient)

1. 基于Schema的CSS提取（最高效）

No LLM required - fast, deterministic, cost-free.

CLI:

bash

undefined

无需LLM - 快速、确定性、无成本。

CLI：

bash

undefined

Generate schema once (uses LLM)

python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

Use schema for extraction (no LLM)

crwl https://shop.com -e extract_css.yml -s product_schema.json -o json


**Schema format:**

```json
{
  "name": "products",
  "baseSelector": ".product-card",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"},
    {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
  ]
}

crwl https://shop.com -e extract_css.yml -s product_schema.json -o json


**Schema格式示例：**

```json
{
  "name": "products",
  "baseSelector": ".product-card",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"},
    {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
  ]
}

2. LLM-Based Extraction

2. 基于LLM的提取

For complex or irregular content:

CLI:

yaml

undefined

适用于复杂或非规则内容：

CLI：

yaml

undefined

extract_llm.yml

type: "llm" provider: "openai/gpt-4o-mini" instruction: "Extract product names and prices" api_token: "your-token"


```bash
crwl https://shop.com -e extract_llm.yml -o json

For extraction details: Extraction Strategies (lines 4522-5429)

type: "llm" provider: "openai/gpt-4o-mini" instruction: "Extract product names and prices" api_token: "your-token"


```bash
crwl https://shop.com -e extract_llm.yml -o json

如需了解提取详情：提取策略（第4522-5429行）

Advanced Patterns

高级模式

Dynamic Content (JavaScript-Heavy Sites)

动态内容（JavaScript密集型网站）

CLI:

bash

crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"

Crawler config:

yaml

undefined

CLI：

bash

crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"

爬取配置示例：

yaml

undefined

crawler.yml

wait_for: "css:.ajax-content" scan_full_page: true page_timeout: 60000 delay_before_return_html: 2.0

undefined

wait_for: "css:.ajax-content" scan_full_page: true page_timeout: 60000 delay_before_return_html: 2.0

undefined

Multi-URL Processing

多URL处理

CLI (sequential):

bash

for url in url1 url2 url3; do crwl "$url" -o markdown; done

Python SDK (concurrent):

python

urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)

For batch processing: arun_many() Reference (lines 1057-1224)

CLI（串行）：

bash

for url in url1 url2 url3; do crwl "$url" -o markdown; done

Python SDK（并行）：

python

urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)

如需了解批量处理：arun_many() 参考（第1057-1224行）

Session & Authentication

会话与身份验证

CLI:

yaml

undefined

CLI：

yaml

undefined

login_crawler.yml

session_id: "user_session" js_code: | document.querySelector('#username').value = 'user'; document.querySelector('#password').value = 'pass'; document.querySelector('#submit').click(); wait_for: "css:.dashboard"


```bash


```bash

Login

Access protected content (session reused)

访问受保护内容（复用会话）

crwl https://site.com/protected -c "session_id=user_session"


For session management: [Advanced Features](references/complete-sdk-reference.md#advanced-features) (lines 5429-5940)

crwl https://site.com/protected -c "session_id=user_session"


如需了解会话管理：[高级功能](references/complete-sdk-reference.md#advanced-features)（第5429-5940行）

Anti-Detection & Proxies

反检测与代理

CLI:

yaml

undefined

CLI：

yaml

undefined

browser.yml

headless: true proxy_config: server: "http://proxy:8080" username: "user" password: "pass" user_agent_mode: "random"


```bash
crwl https://example.com -B browser.yml

headless: true proxy_config: server: "http://proxy:8080" username: "user" password: "pass" user_agent_mode: "random"


```bash
crwl https://example.com -B browser.yml

Common Use Cases

常见用例

Documentation to Markdown

文档转Markdown

bash

crwl https://docs.example.com -o markdown > docs.md

bash

crwl https://docs.example.com -o markdown > docs.md

E-commerce Product Monitoring

电商产品监控

bash

undefined

bash

undefined

Generate schema once

python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

Monitor (no LLM costs)

监控（无LLM成本）

crwl https://shop.com -e extract_css.yml -s schema.json -o json

undefined

crwl https://shop.com -e extract_css.yml -s schema.json -o json

undefined

News Aggregation

新闻聚合

bash

undefined

bash

undefined

Multiple sources with filtering

多来源过滤爬取

for url in news1.com news2.com news3.com; do crwl "https://$url" -f filter_bm25.yml -o markdown-fit done

undefined

for url in news1.com news2.com news3.com; do crwl "https://$url" -f filter_bm25.yml -o markdown-fit done

undefined

Interactive Q&A

交互式问答

bash

undefined

bash

undefined

First view content

先查看内容

crwl https://example.com -o markdown

Then ask questions

然后提问

crwl https://example.com -q "What are the main conclusions?" crwl https://example.com -q "Summarize the key points"

---

crwl https://example.com -q "What are the main conclusions?" crwl https://example.com -q "Summarize the key points"

---

Resources

资源

Provided Scripts

提供的脚本

scripts/extraction_pipeline.py - Schema generation and extraction
scripts/basic_crawler.py - Simple markdown extraction
scripts/batch_crawler.py - Multi-URL processing

scripts/extraction_pipeline.py - Schema生成与提取
scripts/basic_crawler.py - 简单Markdown提取
scripts/batch_crawler.py - 多URL处理

Reference Documentation

参考文档

Document	Purpose
CLI Guide	Command-line interface reference
SDK Guide	Python SDK quick reference
Complete SDK Reference	Full API documentation (5900+ lines)

文档	用途
CLI 指南	命令行界面参考
SDK 指南	Python SDK快速参考
完整SDK参考	完整API文档（5900+行）

Best Practices

最佳实践

Start with CLI for quick tasks, SDK for automation
Use schema-based extraction - 10-100x more efficient than LLM
Enable caching during development -
```
--bypass-cache
```
only when needed
Set appropriate timeouts - 30s normal, 60s+ for JS-heavy sites
Use content filters for cleaner, focused markdown
Respect rate limits - Add delays between requests

优先使用CLI完成快速任务，使用SDK实现自动化
使用基于Schema的提取 - 效率比LLM高10-100倍
开发期间启用缓存 - 仅在需要时使用
```
--bypass-cache
```
设置合适的超时时间 - 常规页面30秒，JavaScript密集型页面60秒以上
使用内容过滤获取更简洁、聚焦的Markdown
遵守速率限制 - 在请求之间添加延迟

Troubleshooting

故障排除

JavaScript Not Loading

JavaScript未加载

bash

crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"

bash

crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"

Bot Detection Issues

机器人检测问题

bash

crwl https://example.com -B browser.yml

yaml

undefined

bash

crwl https://example.com -B browser.yml

yaml

undefined

browser.yml

headless: false viewport_width: 1920 viewport_height: 1080 user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

undefined

headless: false viewport_width: 1920 viewport_height: 1080 user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

undefined

Content Not Extracted

内容未提取

bash

undefined

bash

undefined

Debug: see full output

调试：查看完整输出

crwl https://example.com -o all -v

Try different wait strategy

尝试不同的等待策略

crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"

undefined

crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"

undefined

Session Issues

会话问题

bash

undefined

bash

undefined

Verify session

验证会话

crwl https://site.com -c "session_id=test" -o all | grep -i session


---

For comprehensive API documentation, see [Complete SDK Reference](references/complete-sdk-reference.md).

crwl https://site.com -c "session_id=test" -o all | grep -i session


---

如需查看完整API文档，请参考[完整SDK参考](references/complete-sdk-reference.md)。

crawl4ai

Original

Translation

Crawl4AI

Crawl4AI

Overview

概述

Quick Start

快速开始

Installation

安装

Verify installation

Verify installation

CLI (Recommended)

CLI（推荐）

Basic crawling - returns markdown

Basic crawling - returns markdown

Get markdown output

Get markdown output

JSON output with cache bypass

JSON output with cache bypass

See more examples

See more examples

Python SDK

Python SDK

Core Concepts

核心概念

Configuration Layers

配置层级

Key Parameters

关键参数

Output Content

输出内容

Markdown Generation (Primary Use Case)

Markdown生成（主要用例）

CLI

CLI

Basic markdown

Basic markdown

Filtered markdown (removes noise)

Filtered markdown (removes noise)

With content filter

With content filter

filter_bm25.yml (relevance-based)

filter_bm25.yml (relevance-based)

Python SDK

Python SDK

Data Extraction

数据提取

1. Schema-Based CSS Extraction (Most Efficient)

1. 基于Schema的CSS提取（最高效）

Generate schema once (uses LLM)

Generate schema once (uses LLM)

Use schema for extraction (no LLM)

Use schema for extraction (no LLM)

2. LLM-Based Extraction

2. 基于LLM的提取

extract_llm.yml

extract_llm.yml

Advanced Patterns

高级模式

Dynamic Content (JavaScript-Heavy Sites)

动态内容（JavaScript密集型网站）

crawler.yml

crawler.yml

Multi-URL Processing

多URL处理

Session & Authentication

会话与身份验证

login_crawler.yml

login_crawler.yml

Login

登录

Access protected content (session reused)

访问受保护内容（复用会话）

Anti-Detection & Proxies

反检测与代理

browser.yml

browser.yml

Common Use Cases