web-to-markdown

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web To Markdown

Convert URLs into usable Markdown by applying domain-aware fetching routes, then return the cleaned content directly.

通过采用域名感知的获取路由将URL转换为可用的Markdown，然后直接返回清理后的内容。

Quick Workflow

快速工作流

Normalize and validate the input URL.
Select route:

```
r.jina.ai
```
: general web + X/Twitter.
```
defuddle.md
```
: YouTube transcript/content extraction.
```
special-browser-fetch
```
: WeChat/Zhihu/Feishu.

Return markdown text (or JSON metadata if needed).

For generic URLs (non-YouTube, non-WeChat/Zhihu/Feishu), use this fallback chain:

try
```
r.jina.ai
```
first,
if it fails, fallback to direct HTTP fetch + Readability,
if direct fetch still fails or returns shell-like content, fallback to browser extraction.

标准化并验证输入URL。
选择路由：

```
r.jina.ai
```
：普通网页 + X/Twitter。
```
defuddle.md
```
：YouTube字幕/内容提取。
```
special-browser-fetch
```
：微信/知乎/飞书。

返回Markdown文本（必要时可返回JSON元数据）。

对于通用URL（非YouTube、非微信/知乎/飞书），使用以下兜底链路：

首先尝试使用
```
r.jina.ai
```
，
如果失败，兜底到直接HTTP请求 + Readability处理，
如果直接请求仍然失败，或返回类shell内容，兜底到浏览器提取。

Commands

命令

Run from this skill directory (

skills/web-to-markdown

bash

npm install
node scripts/url_to_markdown.mjs <url>

Return metadata with markdown:

bash

node scripts/url_to_markdown.mjs <url> --json

Force special-site browser extraction:

bash

node scripts/fetch_special_sites.mjs <url> --json

在该skill目录（

skills/web-to-markdown

）下运行：

bash

npm install
node scripts/url_to_markdown.mjs <url>

返回包含Markdown的元数据：

bash

node scripts/url_to_markdown.mjs <url> --json

强制使用特殊站点浏览器提取：

bash

node scripts/fetch_special_sites.mjs <url> --json

Routing Policy

路由策略

Default route:
```
https://r.jina.ai/<url>
```
.

YouTube (

youtube.com

youtu.be

https://defuddle.md/<url>

X/Twitter (
```
x.com
```
,
```
twitter.com
```
):
```
https://r.jina.ai/<url>
```
.
WeChat/Zhihu/Feishu: run
```
scripts/fetch_special_sites.mjs
```
.
If input is already proxy-formatted (
```
https://defuddle.md/https://...
```
or
```
https://r.jina.ai/https://...
```
), normalize back to the original URL and re-apply routing.

默认路由：
```
https://r.jina.ai/<url>
```
。

YouTube（

youtube.com

、

youtu.be

）：

https://defuddle.md/<url>

。

X/Twitter（
```
x.com
```
、
```
twitter.com
```
）：
```
https://r.jina.ai/<url>
```
。
微信/知乎/飞书：运行
```
scripts/fetch_special_sites.mjs
```
。
如果输入已经是代理格式（
```
https://defuddle.md/https://...
```
或
```
https://r.jina.ai/https://...
```
），先标准化回原始URL，再重新应用路由规则。

Special-Site Extraction Behavior

特殊站点提取行为

Use a two-stage strategy for WeChat/Zhihu/Feishu:

Try
```
cuimp
```
HTTP/TLS impersonation first, then clean HTML with Mozilla Readability.
If stage 1 fails or returns blocked/shell content, fallback to
```
puppeteer-extra
```
browser impersonation.

HTTP stage impersonates modern Chrome TLS/HTTP profile via
```
cuimp
```
.
Browser stage impersonates a modern Chrome user agent and standard
```
sec-ch-ua
```
headers.
Remove known login modals and backdrop overlays (best effort).
Scroll the page to trigger lazy-loaded article blocks.
Parse cleaned document with Mozilla Readability.
Convert extracted HTML body to Markdown via Turndown.
Resolve browser executable from
```
CHROME_PATH
```
first, then system Chrome/Chromium/Edge paths.

If special-site extraction fails due to anti-bot checks, account-only pages, or network limits, report failure clearly and ask for fallback input (for example raw page text).

对微信/知乎/飞书采用两阶段策略：

首先尝试
```
cuimp
```
HTTP/TLS模拟，然后用Mozilla Readability清理HTML。
如果第一阶段失败，或返回被拦截/类shell内容，兜底到
```
puppeteer-extra
```
浏览器模拟。

HTTP阶段通过
```
cuimp
```
模拟现代Chrome的TLS/HTTP配置。
浏览器阶段模拟现代Chrome用户代理和标准
```
sec-ch-ua
```
请求头。
移除已知的登录弹窗和背景遮罩（尽力而为）。
滚动页面以触发懒加载的文章区块。
用Mozilla Readability解析清理后的文档。
通过Turndown将提取的HTML主体转换为Markdown。
优先从
```
CHROME_PATH
```
环境变量读取浏览器可执行文件路径，其次查找系统Chrome/Chromium/Edge路径。

如果特殊站点提取因反爬虫检查、仅登录可见页面或网络限制失败，清晰上报失败并要求提供兜底输入（例如原始页面文本）。

Output Contract

输出约定

For normal usage, output markdown only.

When

--json

is used, return:

source

: backend source (

r.jina.ai

defuddle

cuimp

browser-readability

strategy

: selected route (

r-jina

defuddle

special-http-fetch

special-browser-fetch-fallback

```
requestedUrl
```
: original input.
```
resolvedUrl
```
: normalized/final URL.
```
markdown
```
: extracted markdown body.

普通使用场景下仅输出Markdown。

当使用

--json

参数时，返回：

source

：后端来源（

r.jina.ai

、

defuddle

、

cuimp

、

browser-readability

）。

strategy

：选择的路由（

r-jina

、

defuddle

、

special-http-fetch

、

special-browser-fetch-fallback

）。

```
requestedUrl
```
：原始输入URL。
```
resolvedUrl
```
：标准化后的最终URL。
```
markdown
```
：提取到的Markdown主体。

Resources

资源

references/routing-and-notes.md: domain routing rules and operational caveats.
```
scripts/url_to_markdown.mjs
```
: primary entrypoint.
```
scripts/fetch_special_sites_http.mjs
```
: WeChat/Zhihu/Feishu HTTP impersonation fetcher (
```
cuimp
```
JS).
```
scripts/fetch_special_sites.mjs
```
: two-stage extractor (HTTP-first, browser-fallback).

references/routing-and-notes.md：域名路由规则和运行注意事项。
```
scripts/url_to_markdown.mjs
```
：主入口文件。
```
scripts/fetch_special_sites_http.mjs
```
：微信/知乎/飞书HTTP模拟获取工具（
```
cuimp
```
JS实现）。
```
scripts/fetch_special_sites.mjs
```
：两阶段提取器（HTTP优先，浏览器兜底）。