firecrawl-knowledge-ingest

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Firecrawl Knowledge Ingest

Firecrawl 知识库摄取

Use this when a docs portal needs browser navigation, auth, pagination, or JS rendering.
当文档门户需要浏览器导航、认证、分页或JS渲染时,请使用此方案。

Onboarding Interview

初始对接访谈

Infer the portal URL, output format, auth needs, and page limit from context. If the portal is clear, proceed immediately.
Ask at most 1-3 concise questions only if blocked, such as the portal URL, whether authentication is required, or the desired output format.
从上下文推断门户URL、输出格式、认证需求和页面限制。若门户信息明确,可直接开展工作。
仅在受阻时提出最多1-3个简洁问题,例如门户URL、是否需要认证或期望的输出格式。

Firecrawl Collection Plan

Firecrawl 采集方案

Use Firecrawl browser to:
  • open the portal and inspect navigation
  • identify sections, categories, sidebar links, and article URLs
  • follow sidebar navigation, next links, pagination, load-more controls, or search
  • scrape article content as markdown
  • extract metadata such as title, section, last updated date, author, and tags
Try Firecrawl map as a supplement for public URLs, but use browser navigation for auth-gated or JS-heavy content.
使用Firecrawl浏览器:
  • 打开门户并检查导航结构
  • 识别板块、分类、侧边栏链接和文章URL
  • 跟随侧边栏导航、下一页链接、分页控件、加载更多按钮或搜索功能
  • 将文章内容抓取为markdown格式
  • 提取元数据,如标题、板块、最后更新日期、作者和标签
对于公开URL,可尝试使用Firecrawl map作为补充,但针对登录受限或JS密集型内容,请使用浏览器导航。

Final Deliverable

最终交付物

markdown
undefined
markdown
undefined

Knowledge Ingest: [Portal]

知识库摄取: [门户名称]

Summary

摘要

[Pages extracted, sections covered, limitations]
[提取的页面数量、覆盖的板块、限制条件]

Output

输出

[JSON/markdown/merged file path or content]
[JSON/markdown/合并文件路径或内容]

Sections

板块

[Section names and article counts]
[板块名称及文章数量]

Failed Or Restricted Pages

失败或受限页面

[Any access/loading issues]
[任何访问/加载问题]

Sources

来源

[URLs extracted]
[提取的URL列表]

Rerun Inputs

重跑输入参数

workflow: firecrawl-knowledge-ingest url: [portal url] format: [json/markdown/merged] max_pages: [number]
undefined
workflow: firecrawl-knowledge-ingest url: [门户URL] format: [json/markdown/merged] max_pages: [数字]
undefined

JSON Shape

JSON 结构

Use
source
,
url
,
extractedAt
,
totalArticles
, and
sections[]
with article
title
,
url
,
section
,
content
, and
metadata
.
使用
source
url
extractedAt
totalArticles
字段,以及包含文章
title
url
section
content
metadata
sections[]
数组。

Quality Bar

质量标准

  • Preserve code examples, tables, and formatting.
  • Strip nav chrome, headers, and footers.
  • Track extraction progress and page failures.
  • Respect authentication boundaries.
  • 保留代码示例、表格和格式。
  • 移除导航栏、页眉和页脚。
  • 跟踪提取进度和页面失败情况。
  • 遵守认证边界。