crawl4ai-openrouter
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCrawl4AI OpenRouter
Crawl4AI OpenRouter
Use Crawl4AI when the user explicitly wants Crawl4AI or when a real browser crawl plus optional LLM extraction is the right fit.
Load only the reference file you need:
- Open references/setup.md before first use on a machine or when the environment is failing.
- Open references/recipes.md when you need a concrete pattern for plain markdown crawl, CSS-based extraction, or LLM-based extraction.
当用户明确需要使用Crawl4AI,或是需要真实浏览器爬取加可选LLM提取的场景下,请使用Crawl4AI。
仅加载你需要的参考文件:
- 首次在某台机器上使用或环境运行出错时,打开 references/setup.md。
- 当你需要纯markdown爬取、基于CSS的提取或基于LLM的提取的具体实现模式时,打开 references/recipes.md。
Default Operating Assumptions
默认运行假设
- Default LLM provider:
openrouter/nvidia/nemotron-3-super-120b-a12b:free - Default API key env var:
OPENROUTER_API_KEY - Default OpenRouter base URL:
https://openrouter.ai/api/v1 - Prefer plus
LLMConfigfor schema-based extraction.LLMExtractionStrategy - Prefer passing the API token via environment variable, not inline in source.
- Prefer unless the task requires visual debugging.
headless=True - Prefer when the user wants fresh content.
CacheMode.BYPASS
- 默认LLM提供商:
openrouter/nvidia/nemotron-3-super-120b-a12b:free - 默认API密钥环境变量:
OPENROUTER_API_KEY - 默认OpenRouter基础URL:
https://openrouter.ai/api/v1 - 优先使用搭配
LLMConfig实现基于schema的提取。LLMExtractionStrategy - 优先通过环境变量传递API令牌,不要直接写在源码中。
- 优先使用,除非任务需要可视化调试。
headless=True - 当用户需要最新内容时,优先使用。
CacheMode.BYPASS
Workflow
工作流
- Confirm whether the task is:
- plain crawl to markdown/text
- structured extraction with a schema
- dynamic page crawl that may need waits, JS, or browser tuning
- Read references/setup.md if you have not already verified Python, package install, and browser setup.
- For repeated or non-trivial extraction, use scripts/crawl4ai_extract.py instead of rewriting the integration from scratch.
- Set these env vars in the current shell when needed:
OPENROUTER_API_KEY- optionally
CRAWL4AI_OPENROUTER_MODEL - optionally
CRAWL4AI_OPENROUTER_BASE_URL
- If the user provides a target JSON shape, save it as a schema file and pass it to the helper script with an extraction instruction.
- Return the extracted JSON or the crawl markdown, and call out any crawl limitations such as auth walls, robots constraints, or weak page structure.
- 先确认任务类型:
- 纯爬取转换为markdown/文本
- 基于schema的结构化提取
- 可能需要等待、JS执行或浏览器调整的动态页面爬取
- 如果你还未确认Python、包安装和浏览器配置是否正常,请先阅读references/setup.md。
- 对于重复或非简单提取任务,使用scripts/crawl4ai_extract.py,不要从零重写集成逻辑。
- 需要时在当前shell中设置以下环境变量:
OPENROUTER_API_KEY- 可选:
CRAWL4AI_OPENROUTER_MODEL - 可选:
CRAWL4AI_OPENROUTER_BASE_URL
- 如果用户提供了目标JSON结构,将其保存为schema文件,搭配提取指令传递给辅助脚本。
- 返回提取到的JSON或爬取得到的markdown,并说明所有爬取限制,比如权限墙、robots协议限制、页面结构不完善等问题。
Helper Script
辅助脚本
Use the helper for the common case:
powershell
<python> .\scripts\crawl4ai_extract.py `
--url "https://example.com" `
--instruction "Extract the pricing plans and limits." `
--schema-file ".\schema.json"Important flags:
- : target page
--url - : the extraction instruction for the LLM
--instruction - : JSON schema file for structured output
--schema-file - : optional content narrowing before extraction
--css-selector - : optional CSS selector to wait for on dynamic pages
--wait-for - : opt out of headless mode for debugging
--headless false --cache-mode enabled|bypass|read_only|write_only|disabled- : optional budget guardrail for large pages
--max-input-tokens
通用场景下使用辅助脚本:
powershell
<python> .\scripts\crawl4ai_extract.py `
--url "https://example.com" `
--instruction "Extract the pricing plans and limits." `
--schema-file ".\schema.json"重要参数:
- : 目标页面
--url - : 给LLM的提取指令
--instruction - : 用于结构化输出的JSON schema文件
--schema-file - : 可选参数,提取前缩小内容范围
--css-selector - : 可选参数,动态页面需要等待加载的CSS选择器
--wait-for - : 调试时关闭无头模式
--headless false --cache-mode enabled|bypass|read_only|write_only|disabled- : 可选参数,大型页面的输入token预算限制
--max-input-tokens
Notes
注意事项
- Use OpenRouter by passing a provider string and a custom through
base_url.LLMConfig - The provider prefix is an inference from Crawl4AI's LiteLLM-style provider naming plus OpenRouter's OpenAI-compatible endpoint. If that stops working in a future version, switch the provider string while keeping the same base URL and API key flow.
openrouter/ - Keep the skill focused on Crawl4AI. If the user needs generic scraping without Crawl4AI, use a more appropriate web-scraping workflow instead.
- 如需使用OpenRouter,可通过传递提供商字符串和自定义
LLMConfig。base_url - 提供商前缀是根据Crawl4AI的LiteLLM风格提供商命名规则加上OpenRouter的OpenAI兼容端点推断而来。如果未来版本中该规则失效,可修改提供商字符串,同时保持原有的基础URL和API密钥流程不变。
openrouter/ - 本工具专注于Crawl4AI相关功能。如果用户需要不使用Crawl4AI的通用爬虫功能,请使用更合适的网页爬取工作流。