crawl4ai-openrouter

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Crawl4AI OpenRouter

Crawl4AI OpenRouter

Use Crawl4AI when the user explicitly wants Crawl4AI or when a real browser crawl plus optional LLM extraction is the right fit.
Load only the reference file you need:
  • Open references/setup.md before first use on a machine or when the environment is failing.
  • Open references/recipes.md when you need a concrete pattern for plain markdown crawl, CSS-based extraction, or LLM-based extraction.
当用户明确需要使用Crawl4AI,或是需要真实浏览器爬取加可选LLM提取的场景下,请使用Crawl4AI。
仅加载你需要的参考文件:
  • 首次在某台机器上使用或环境运行出错时,打开 references/setup.md
  • 当你需要纯markdown爬取、基于CSS的提取或基于LLM的提取的具体实现模式时,打开 references/recipes.md

Default Operating Assumptions

默认运行假设

  • Default LLM provider:
    openrouter/nvidia/nemotron-3-super-120b-a12b:free
  • Default API key env var:
    OPENROUTER_API_KEY
  • Default OpenRouter base URL:
    https://openrouter.ai/api/v1
  • Prefer
    LLMConfig
    plus
    LLMExtractionStrategy
    for schema-based extraction.
  • Prefer passing the API token via environment variable, not inline in source.
  • Prefer
    headless=True
    unless the task requires visual debugging.
  • Prefer
    CacheMode.BYPASS
    when the user wants fresh content.
  • 默认LLM提供商:
    openrouter/nvidia/nemotron-3-super-120b-a12b:free
  • 默认API密钥环境变量:
    OPENROUTER_API_KEY
  • 默认OpenRouter基础URL:
    https://openrouter.ai/api/v1
  • 优先使用
    LLMConfig
    搭配
    LLMExtractionStrategy
    实现基于schema的提取。
  • 优先通过环境变量传递API令牌,不要直接写在源码中。
  • 优先使用
    headless=True
    ,除非任务需要可视化调试。
  • 当用户需要最新内容时,优先使用
    CacheMode.BYPASS

Workflow

工作流

  1. Confirm whether the task is:
    • plain crawl to markdown/text
    • structured extraction with a schema
    • dynamic page crawl that may need waits, JS, or browser tuning
  2. Read references/setup.md if you have not already verified Python, package install, and browser setup.
  3. For repeated or non-trivial extraction, use scripts/crawl4ai_extract.py instead of rewriting the integration from scratch.
  4. Set these env vars in the current shell when needed:
    • OPENROUTER_API_KEY
    • optionally
      CRAWL4AI_OPENROUTER_MODEL
    • optionally
      CRAWL4AI_OPENROUTER_BASE_URL
  5. If the user provides a target JSON shape, save it as a schema file and pass it to the helper script with an extraction instruction.
  6. Return the extracted JSON or the crawl markdown, and call out any crawl limitations such as auth walls, robots constraints, or weak page structure.
  1. 先确认任务类型:
    • 纯爬取转换为markdown/文本
    • 基于schema的结构化提取
    • 可能需要等待、JS执行或浏览器调整的动态页面爬取
  2. 如果你还未确认Python、包安装和浏览器配置是否正常,请先阅读references/setup.md
  3. 对于重复或非简单提取任务,使用scripts/crawl4ai_extract.py,不要从零重写集成逻辑。
  4. 需要时在当前shell中设置以下环境变量:
    • OPENROUTER_API_KEY
    • 可选:
      CRAWL4AI_OPENROUTER_MODEL
    • 可选:
      CRAWL4AI_OPENROUTER_BASE_URL
  5. 如果用户提供了目标JSON结构,将其保存为schema文件,搭配提取指令传递给辅助脚本。
  6. 返回提取到的JSON或爬取得到的markdown,并说明所有爬取限制,比如权限墙、robots协议限制、页面结构不完善等问题。

Helper Script

辅助脚本

Use the helper for the common case:
powershell
<python> .\scripts\crawl4ai_extract.py `
  --url "https://example.com" `
  --instruction "Extract the pricing plans and limits." `
  --schema-file ".\schema.json"
Important flags:
  • --url
    : target page
  • --instruction
    : the extraction instruction for the LLM
  • --schema-file
    : JSON schema file for structured output
  • --css-selector
    : optional content narrowing before extraction
  • --wait-for
    : optional CSS selector to wait for on dynamic pages
  • --headless false
    : opt out of headless mode for debugging
  • --cache-mode enabled|bypass|read_only|write_only|disabled
  • --max-input-tokens
    : optional budget guardrail for large pages
通用场景下使用辅助脚本:
powershell
<python> .\scripts\crawl4ai_extract.py `
  --url "https://example.com" `
  --instruction "Extract the pricing plans and limits." `
  --schema-file ".\schema.json"
重要参数:
  • --url
    : 目标页面
  • --instruction
    : 给LLM的提取指令
  • --schema-file
    : 用于结构化输出的JSON schema文件
  • --css-selector
    : 可选参数,提取前缩小内容范围
  • --wait-for
    : 可选参数,动态页面需要等待加载的CSS选择器
  • --headless false
    : 调试时关闭无头模式
  • --cache-mode enabled|bypass|read_only|write_only|disabled
  • --max-input-tokens
    : 可选参数,大型页面的输入token预算限制

Notes

注意事项

  • Use OpenRouter by passing a provider string and a custom
    base_url
    through
    LLMConfig
    .
  • The provider prefix
    openrouter/
    is an inference from Crawl4AI's LiteLLM-style provider naming plus OpenRouter's OpenAI-compatible endpoint. If that stops working in a future version, switch the provider string while keeping the same base URL and API key flow.
  • Keep the skill focused on Crawl4AI. If the user needs generic scraping without Crawl4AI, use a more appropriate web-scraping workflow instead.
  • 如需使用OpenRouter,可通过
    LLMConfig
    传递提供商字符串和自定义
    base_url
  • 提供商前缀
    openrouter/
    是根据Crawl4AI的LiteLLM风格提供商命名规则加上OpenRouter的OpenAI兼容端点推断而来。如果未来版本中该规则失效,可修改提供商字符串,同时保持原有的基础URL和API密钥流程不变。
  • 本工具专注于Crawl4AI相关功能。如果用户需要不使用Crawl4AI的通用爬虫功能,请使用更合适的网页爬取工作流。