crawl4ai
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCrawl4AI
Crawl4AI
Overview
概述
Crawl4AI provides comprehensive web crawling and data extraction capabilities. This skill supports both CLI (recommended for quick tasks) and Python SDK (for programmatic control).
Choose your interface:
- CLI () - Quick, scriptable commands: CLI Guide
crwl - Python SDK - Full programmatic control: SDK Guide
Crawl4AI 提供全面的网页爬取和数据提取能力。该工具同时支持CLI(推荐用于快速任务)和Python SDK(用于程序化控制)。
选择适合你的交互方式:
- CLI () - 快速、可脚本化的命令:CLI 指南
crwl - Python SDK - 完整的程序化控制:SDK 指南
Quick Start
快速开始
Installation
安装
bash
pip install crawl4ai
crawl4ai-setupbash
pip install crawl4ai
crawl4ai-setupVerify installation
Verify installation
crawl4ai-doctor
undefinedcrawl4ai-doctor
undefinedCLI (Recommended)
CLI(推荐)
bash
undefinedbash
undefinedBasic crawling - returns markdown
Basic crawling - returns markdown
crwl https://example.com
crwl https://example.com
Get markdown output
Get markdown output
crwl https://example.com -o markdown
crwl https://example.com -o markdown
JSON output with cache bypass
JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache
crwl https://example.com -o json -v --bypass-cache
See more examples
See more examples
crwl --example
undefinedcrwl --example
undefinedPython SDK
Python SDK
python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())For SDK configuration details: SDK Guide - Configuration (lines 61-150)
python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())如需了解SDK配置详情:SDK 指南 - 配置(第61-150行)
Core Concepts
核心概念
Configuration Layers
配置层级
Both CLI and SDK use the same underlying configuration:
| Concept | CLI | SDK |
|---|---|---|
| Browser settings | | |
| Crawl settings | | |
| Extraction | | |
| Content filter | | |
CLI 和 SDK 使用相同的底层配置:
| 概念 | CLI | SDK |
|---|---|---|
| 浏览器设置 | | |
| 爬取设置 | | |
| 提取配置 | | |
| 内容过滤 | | |
Key Parameters
关键参数
Browser Configuration:
- : Run with/without GUI
headless - : Browser dimensions
viewport_width/height - : Custom user agent
user_agent - : Proxy settings
proxy_config
Crawler Configuration:
- : Max page load time (ms)
page_timeout - : CSS selector or JS condition to wait for
wait_for - : bypass, enabled, disabled
cache_mode - : JavaScript to execute
js_code - : Focus on specific element
css_selector
For complete parameters: CLI Config | SDK Config
浏览器配置:
- : 是否以无GUI模式运行
headless - : 浏览器窗口尺寸
viewport_width/height - : 自定义用户代理
user_agent - : 代理设置
proxy_config
爬取配置:
- : 页面加载最长时间(毫秒)
page_timeout - : 等待加载完成的CSS选择器或JS条件
wait_for - : 绕过、启用、禁用缓存
cache_mode - : 要执行的JavaScript代码
js_code - : 聚焦于特定元素
css_selector
如需完整参数列表:CLI 配置 | SDK 配置
Output Content
输出内容
Every crawl returns:
- markdown - Clean, formatted markdown
- html - Raw HTML
- links - Internal and external links discovered
- media - Images, videos, audio found
- extracted_content - Structured data (if extraction configured)
每次爬取都会返回以下内容:
- markdown - 简洁格式化的Markdown
- html - 原始HTML
- links - 发现的内部和外部链接
- media - 找到的图片、视频、音频
- extracted_content - 结构化数据(若配置了提取规则)
Markdown Generation (Primary Use Case)
Markdown生成(主要用例)
Crawl4AI excels at generating clean, well-formatted markdown:
Crawl4AI 擅长生成简洁、格式规范的Markdown:
CLI
CLI
bash
undefinedbash
undefinedBasic markdown
Basic markdown
crwl https://docs.example.com -o markdown
crwl https://docs.example.com -o markdown
Filtered markdown (removes noise)
Filtered markdown (removes noise)
crwl https://docs.example.com -o markdown-fit
crwl https://docs.example.com -o markdown-fit
With content filter
With content filter
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit
**Filter configuration:**
```yamlcrwl https://docs.example.com -f filter_bm25.yml -o markdown-fit
**过滤配置示例:**
```yamlfilter_bm25.yml (relevance-based)
filter_bm25.yml (relevance-based)
type: "bm25"
query: "machine learning tutorials"
threshold: 1.0
undefinedtype: "bm25"
query: "machine learning tutorials"
threshold: 1.0
undefinedPython SDK
Python SDK
python
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
print(result.markdown.fit_markdown) # Filtered
print(result.markdown.raw_markdown) # OriginalFor content filters: Content Processing (lines 2481-3101)
python
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
print(result.markdown.fit_markdown) # Filtered
print(result.markdown.raw_markdown) # Original如需了解内容过滤相关信息:内容处理(第2481-3101行)
Data Extraction
数据提取
1. Schema-Based CSS Extraction (Most Efficient)
1. 基于Schema的CSS提取(最高效)
No LLM required - fast, deterministic, cost-free.
CLI:
bash
undefined无需LLM - 快速、确定性、无成本。
CLI:
bash
undefinedGenerate schema once (uses LLM)
Generate schema once (uses LLM)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
Use schema for extraction (no LLM)
Use schema for extraction (no LLM)
crwl https://shop.com -e extract_css.yml -s product_schema.json -o json
**Schema format:**
```json
{
"name": "products",
"baseSelector": ".product-card",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}crwl https://shop.com -e extract_css.yml -s product_schema.json -o json
**Schema格式示例:**
```json
{
"name": "products",
"baseSelector": ".product-card",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}2. LLM-Based Extraction
2. 基于LLM的提取
For complex or irregular content:
CLI:
yaml
undefined适用于复杂或非规则内容:
CLI:
yaml
undefinedextract_llm.yml
extract_llm.yml
type: "llm"
provider: "openai/gpt-4o-mini"
instruction: "Extract product names and prices"
api_token: "your-token"
```bash
crwl https://shop.com -e extract_llm.yml -o jsonFor extraction details: Extraction Strategies (lines 4522-5429)
type: "llm"
provider: "openai/gpt-4o-mini"
instruction: "Extract product names and prices"
api_token: "your-token"
```bash
crwl https://shop.com -e extract_llm.yml -o json如需了解提取详情:提取策略(第4522-5429行)
Advanced Patterns
高级模式
Dynamic Content (JavaScript-Heavy Sites)
动态内容(JavaScript密集型网站)
CLI:
bash
crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"Crawler config:
yaml
undefinedCLI:
bash
crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"爬取配置示例:
yaml
undefinedcrawler.yml
crawler.yml
wait_for: "css:.ajax-content"
scan_full_page: true
page_timeout: 60000
delay_before_return_html: 2.0
undefinedwait_for: "css:.ajax-content"
scan_full_page: true
page_timeout: 60000
delay_before_return_html: 2.0
undefinedMulti-URL Processing
多URL处理
CLI (sequential):
bash
for url in url1 url2 url3; do crwl "$url" -o markdown; donePython SDK (concurrent):
python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)For batch processing: arun_many() Reference (lines 1057-1224)
CLI(串行):
bash
for url in url1 url2 url3; do crwl "$url" -o markdown; donePython SDK(并行):
python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)如需了解批量处理:arun_many() 参考(第1057-1224行)
Session & Authentication
会话与身份验证
CLI:
yaml
undefinedCLI:
yaml
undefinedlogin_crawler.yml
login_crawler.yml
session_id: "user_session"
js_code: |
document.querySelector('#username').value = 'user';
document.querySelector('#password').value = 'pass';
document.querySelector('#submit').click();
wait_for: "css:.dashboard"
```bashsession_id: "user_session"
js_code: |
document.querySelector('#username').value = 'user';
document.querySelector('#password').value = 'pass';
document.querySelector('#submit').click();
wait_for: "css:.dashboard"
```bashLogin
登录
crwl https://site.com/login -C login_crawler.yml
crwl https://site.com/login -C login_crawler.yml
Access protected content (session reused)
访问受保护内容(复用会话)
crwl https://site.com/protected -c "session_id=user_session"
For session management: [Advanced Features](references/complete-sdk-reference.md#advanced-features) (lines 5429-5940)crwl https://site.com/protected -c "session_id=user_session"
如需了解会话管理:[高级功能](references/complete-sdk-reference.md#advanced-features)(第5429-5940行)Anti-Detection & Proxies
反检测与代理
CLI:
yaml
undefinedCLI:
yaml
undefinedbrowser.yml
browser.yml
headless: true
proxy_config:
server: "http://proxy:8080"
username: "user"
password: "pass"
user_agent_mode: "random"
```bash
crwl https://example.com -B browser.ymlheadless: true
proxy_config:
server: "http://proxy:8080"
username: "user"
password: "pass"
user_agent_mode: "random"
```bash
crwl https://example.com -B browser.ymlCommon Use Cases
常见用例
Documentation to Markdown
文档转Markdown
bash
crwl https://docs.example.com -o markdown > docs.mdbash
crwl https://docs.example.com -o markdown > docs.mdE-commerce Product Monitoring
电商产品监控
bash
undefinedbash
undefinedGenerate schema once
Generate schema once
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
Monitor (no LLM costs)
监控(无LLM成本)
crwl https://shop.com -e extract_css.yml -s schema.json -o json
undefinedcrwl https://shop.com -e extract_css.yml -s schema.json -o json
undefinedNews Aggregation
新闻聚合
bash
undefinedbash
undefinedMultiple sources with filtering
多来源过滤爬取
for url in news1.com news2.com news3.com; do
crwl "https://$url" -f filter_bm25.yml -o markdown-fit
done
undefinedfor url in news1.com news2.com news3.com; do
crwl "https://$url" -f filter_bm25.yml -o markdown-fit
done
undefinedInteractive Q&A
交互式问答
bash
undefinedbash
undefinedFirst view content
先查看内容
crwl https://example.com -o markdown
crwl https://example.com -o markdown
Then ask questions
然后提问
crwl https://example.com -q "What are the main conclusions?"
crwl https://example.com -q "Summarize the key points"
---crwl https://example.com -q "What are the main conclusions?"
crwl https://example.com -q "Summarize the key points"
---Resources
资源
Provided Scripts
提供的脚本
- scripts/extraction_pipeline.py - Schema generation and extraction
- scripts/basic_crawler.py - Simple markdown extraction
- scripts/batch_crawler.py - Multi-URL processing
- scripts/extraction_pipeline.py - Schema生成与提取
- scripts/basic_crawler.py - 简单Markdown提取
- scripts/batch_crawler.py - 多URL处理
Reference Documentation
参考文档
| Document | Purpose |
|---|---|
| CLI Guide | Command-line interface reference |
| SDK Guide | Python SDK quick reference |
| Complete SDK Reference | Full API documentation (5900+ lines) |
| 文档 | 用途 |
|---|---|
| CLI 指南 | 命令行界面参考 |
| SDK 指南 | Python SDK快速参考 |
| 完整SDK参考 | 完整API文档(5900+行) |
Best Practices
最佳实践
- Start with CLI for quick tasks, SDK for automation
- Use schema-based extraction - 10-100x more efficient than LLM
- Enable caching during development - only when needed
--bypass-cache - Set appropriate timeouts - 30s normal, 60s+ for JS-heavy sites
- Use content filters for cleaner, focused markdown
- Respect rate limits - Add delays between requests
- 优先使用CLI完成快速任务,使用SDK实现自动化
- 使用基于Schema的提取 - 效率比LLM高10-100倍
- 开发期间启用缓存 - 仅在需要时使用
--bypass-cache - 设置合适的超时时间 - 常规页面30秒,JavaScript密集型页面60秒以上
- 使用内容过滤获取更简洁、聚焦的Markdown
- 遵守速率限制 - 在请求之间添加延迟
Troubleshooting
故障排除
JavaScript Not Loading
JavaScript未加载
bash
crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"bash
crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"Bot Detection Issues
机器人检测问题
bash
crwl https://example.com -B browser.ymlyaml
undefinedbash
crwl https://example.com -B browser.ymlyaml
undefinedbrowser.yml
browser.yml
headless: false
viewport_width: 1920
viewport_height: 1080
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
undefinedheadless: false
viewport_width: 1920
viewport_height: 1080
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
undefinedContent Not Extracted
内容未提取
bash
undefinedbash
undefinedDebug: see full output
调试:查看完整输出
crwl https://example.com -o all -v
crwl https://example.com -o all -v
Try different wait strategy
尝试不同的等待策略
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
undefinedcrwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
undefinedSession Issues
会话问题
bash
undefinedbash
undefinedVerify session
验证会话
crwl https://site.com -c "session_id=test" -o all | grep -i session
---
For comprehensive API documentation, see [Complete SDK Reference](references/complete-sdk-reference.md).crwl https://site.com -c "session_id=test" -o all | grep -i session
---
如需查看完整API文档,请参考[完整SDK参考](references/complete-sdk-reference.md)。