web-to-markdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web To Markdown

Web To Markdown

Convert URLs into usable Markdown by applying domain-aware fetching routes, then return the cleaned content directly.
通过采用域名感知的获取路由将URL转换为可用的Markdown,然后直接返回清理后的内容。

Quick Workflow

快速工作流

  1. Normalize and validate the input URL.
  2. Select route:
  • r.jina.ai
    : general web + X/Twitter.
  • defuddle.md
    : YouTube transcript/content extraction.
  • special-browser-fetch
    : WeChat/Zhihu/Feishu.
  1. Return markdown text (or JSON metadata if needed).
For generic URLs (non-YouTube, non-WeChat/Zhihu/Feishu), use this fallback chain:
  • try
    r.jina.ai
    first,
  • if it fails, fallback to direct HTTP fetch + Readability,
  • if direct fetch still fails or returns shell-like content, fallback to browser extraction.
  1. 标准化并验证输入URL。
  2. 选择路由:
  • r.jina.ai
    :普通网页 + X/Twitter。
  • defuddle.md
    :YouTube字幕/内容提取。
  • special-browser-fetch
    :微信/知乎/飞书。
  1. 返回Markdown文本(必要时可返回JSON元数据)。
对于通用URL(非YouTube、非微信/知乎/飞书),使用以下兜底链路:
  • 首先尝试使用
    r.jina.ai
  • 如果失败,兜底到直接HTTP请求 + Readability处理,
  • 如果直接请求仍然失败,或返回类shell内容,兜底到浏览器提取。

Commands

命令

Run from this skill directory (
skills/web-to-markdown
):
bash
npm install
node scripts/url_to_markdown.mjs <url>
Return metadata with markdown:
bash
node scripts/url_to_markdown.mjs <url> --json
Force special-site browser extraction:
bash
node scripts/fetch_special_sites.mjs <url> --json
在该skill目录(
skills/web-to-markdown
)下运行:
bash
npm install
node scripts/url_to_markdown.mjs <url>
返回包含Markdown的元数据:
bash
node scripts/url_to_markdown.mjs <url> --json
强制使用特殊站点浏览器提取:
bash
node scripts/fetch_special_sites.mjs <url> --json

Routing Policy

路由策略

  • Default route:
    https://r.jina.ai/<url>
    .
  • YouTube (
    youtube.com
    ,
    youtu.be
    ):
    https://defuddle.md/<url>
    .
  • X/Twitter (
    x.com
    ,
    twitter.com
    ):
    https://r.jina.ai/<url>
    .
  • WeChat/Zhihu/Feishu: run
    scripts/fetch_special_sites.mjs
    .
  • If input is already proxy-formatted (
    https://defuddle.md/https://...
    or
    https://r.jina.ai/https://...
    ), normalize back to the original URL and re-apply routing.
  • 默认路由:
    https://r.jina.ai/<url>
  • YouTube(
    youtube.com
    youtu.be
    ):
    https://defuddle.md/<url>
  • X/Twitter(
    x.com
    twitter.com
    ):
    https://r.jina.ai/<url>
  • 微信/知乎/飞书:运行
    scripts/fetch_special_sites.mjs
  • 如果输入已经是代理格式(
    https://defuddle.md/https://...
    https://r.jina.ai/https://...
    ),先标准化回原始URL,再重新应用路由规则。

Special-Site Extraction Behavior

特殊站点提取行为

Use a two-stage strategy for WeChat/Zhihu/Feishu:
  1. Try
    cuimp
    HTTP/TLS impersonation first, then clean HTML with Mozilla Readability.
  2. If stage 1 fails or returns blocked/shell content, fallback to
    puppeteer-extra
    browser impersonation.
  • HTTP stage impersonates modern Chrome TLS/HTTP profile via
    cuimp
    .
  • Browser stage impersonates a modern Chrome user agent and standard
    sec-ch-ua
    headers.
  • Remove known login modals and backdrop overlays (best effort).
  • Scroll the page to trigger lazy-loaded article blocks.
  • Parse cleaned document with Mozilla Readability.
  • Convert extracted HTML body to Markdown via Turndown.
  • Resolve browser executable from
    CHROME_PATH
    first, then system Chrome/Chromium/Edge paths.
If special-site extraction fails due to anti-bot checks, account-only pages, or network limits, report failure clearly and ask for fallback input (for example raw page text).
对微信/知乎/飞书采用两阶段策略:
  1. 首先尝试
    cuimp
    HTTP/TLS模拟,然后用Mozilla Readability清理HTML。
  2. 如果第一阶段失败,或返回被拦截/类shell内容,兜底到
    puppeteer-extra
    浏览器模拟。
  • HTTP阶段通过
    cuimp
    模拟现代Chrome的TLS/HTTP配置。
  • 浏览器阶段模拟现代Chrome用户代理和标准
    sec-ch-ua
    请求头。
  • 移除已知的登录弹窗和背景遮罩(尽力而为)。
  • 滚动页面以触发懒加载的文章区块。
  • 用Mozilla Readability解析清理后的文档。
  • 通过Turndown将提取的HTML主体转换为Markdown。
  • 优先从
    CHROME_PATH
    环境变量读取浏览器可执行文件路径,其次查找系统Chrome/Chromium/Edge路径。
如果特殊站点提取因反爬虫检查、仅登录可见页面或网络限制失败,清晰上报失败并要求提供兜底输入(例如原始页面文本)。

Output Contract

输出约定

For normal usage, output markdown only.
When
--json
is used, return:
  • source
    : backend source (
    r.jina.ai
    ,
    defuddle
    ,
    cuimp
    ,
    browser-readability
    ).
  • strategy
    : selected route (
    r-jina
    ,
    defuddle
    ,
    special-http-fetch
    ,
    special-browser-fetch-fallback
    ).
  • requestedUrl
    : original input.
  • resolvedUrl
    : normalized/final URL.
  • markdown
    : extracted markdown body.
普通使用场景下仅输出Markdown。
当使用
--json
参数时,返回:
  • source
    :后端来源(
    r.jina.ai
    defuddle
    cuimp
    browser-readability
    )。
  • strategy
    :选择的路由(
    r-jina
    defuddle
    special-http-fetch
    special-browser-fetch-fallback
    )。
  • requestedUrl
    :原始输入URL。
  • resolvedUrl
    :标准化后的最终URL。
  • markdown
    :提取到的Markdown主体。

Resources

资源

  • references/routing-and-notes.md: domain routing rules and operational caveats.
  • scripts/url_to_markdown.mjs
    : primary entrypoint.
  • scripts/fetch_special_sites_http.mjs
    : WeChat/Zhihu/Feishu HTTP impersonation fetcher (
    cuimp
    JS).
  • scripts/fetch_special_sites.mjs
    : two-stage extractor (HTTP-first, browser-fallback).
  • references/routing-and-notes.md:域名路由规则和运行注意事项。
  • scripts/url_to_markdown.mjs
    :主入口文件。
  • scripts/fetch_special_sites_http.mjs
    :微信/知乎/飞书HTTP模拟获取工具(
    cuimp
    JS实现)。
  • scripts/fetch_special_sites.mjs
    :两阶段提取器(HTTP优先,浏览器兜底)。