web-to-markdown
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb To Markdown
Web To Markdown
Convert URLs into usable Markdown by applying domain-aware fetching routes, then return the cleaned content directly.
通过采用域名感知的获取路由将URL转换为可用的Markdown,然后直接返回清理后的内容。
Quick Workflow
快速工作流
- Normalize and validate the input URL.
- Select route:
- : general web + X/Twitter.
r.jina.ai - : YouTube transcript/content extraction.
defuddle.md - : WeChat/Zhihu/Feishu.
special-browser-fetch
- Return markdown text (or JSON metadata if needed).
For generic URLs (non-YouTube, non-WeChat/Zhihu/Feishu), use this fallback chain:
- try first,
r.jina.ai - if it fails, fallback to direct HTTP fetch + Readability,
- if direct fetch still fails or returns shell-like content, fallback to browser extraction.
- 标准化并验证输入URL。
- 选择路由:
- :普通网页 + X/Twitter。
r.jina.ai - :YouTube字幕/内容提取。
defuddle.md - :微信/知乎/飞书。
special-browser-fetch
- 返回Markdown文本(必要时可返回JSON元数据)。
对于通用URL(非YouTube、非微信/知乎/飞书),使用以下兜底链路:
- 首先尝试使用,
r.jina.ai - 如果失败,兜底到直接HTTP请求 + Readability处理,
- 如果直接请求仍然失败,或返回类shell内容,兜底到浏览器提取。
Commands
命令
Run from this skill directory ():
skills/web-to-markdownbash
npm install
node scripts/url_to_markdown.mjs <url>Return metadata with markdown:
bash
node scripts/url_to_markdown.mjs <url> --jsonForce special-site browser extraction:
bash
node scripts/fetch_special_sites.mjs <url> --json在该skill目录()下运行:
skills/web-to-markdownbash
npm install
node scripts/url_to_markdown.mjs <url>返回包含Markdown的元数据:
bash
node scripts/url_to_markdown.mjs <url> --json强制使用特殊站点浏览器提取:
bash
node scripts/fetch_special_sites.mjs <url> --jsonRouting Policy
路由策略
- Default route: .
https://r.jina.ai/<url> - YouTube (,
youtube.com):youtu.be.https://defuddle.md/<url> - X/Twitter (,
x.com):twitter.com.https://r.jina.ai/<url> - WeChat/Zhihu/Feishu: run .
scripts/fetch_special_sites.mjs - If input is already proxy-formatted (or
https://defuddle.md/https://...), normalize back to the original URL and re-apply routing.https://r.jina.ai/https://...
- 默认路由:。
https://r.jina.ai/<url> - YouTube(、
youtube.com):youtu.be。https://defuddle.md/<url> - X/Twitter(、
x.com):twitter.com。https://r.jina.ai/<url> - 微信/知乎/飞书:运行。
scripts/fetch_special_sites.mjs - 如果输入已经是代理格式(或
https://defuddle.md/https://...),先标准化回原始URL,再重新应用路由规则。https://r.jina.ai/https://...
Special-Site Extraction Behavior
特殊站点提取行为
Use a two-stage strategy for WeChat/Zhihu/Feishu:
- Try HTTP/TLS impersonation first, then clean HTML with Mozilla Readability.
cuimp - If stage 1 fails or returns blocked/shell content, fallback to browser impersonation.
puppeteer-extra
- HTTP stage impersonates modern Chrome TLS/HTTP profile via .
cuimp - Browser stage impersonates a modern Chrome user agent and standard headers.
sec-ch-ua - Remove known login modals and backdrop overlays (best effort).
- Scroll the page to trigger lazy-loaded article blocks.
- Parse cleaned document with Mozilla Readability.
- Convert extracted HTML body to Markdown via Turndown.
- Resolve browser executable from first, then system Chrome/Chromium/Edge paths.
CHROME_PATH
If special-site extraction fails due to anti-bot checks, account-only pages, or network limits, report failure clearly and ask for fallback input (for example raw page text).
对微信/知乎/飞书采用两阶段策略:
- 首先尝试HTTP/TLS模拟,然后用Mozilla Readability清理HTML。
cuimp - 如果第一阶段失败,或返回被拦截/类shell内容,兜底到浏览器模拟。
puppeteer-extra
- HTTP阶段通过模拟现代Chrome的TLS/HTTP配置。
cuimp - 浏览器阶段模拟现代Chrome用户代理和标准请求头。
sec-ch-ua - 移除已知的登录弹窗和背景遮罩(尽力而为)。
- 滚动页面以触发懒加载的文章区块。
- 用Mozilla Readability解析清理后的文档。
- 通过Turndown将提取的HTML主体转换为Markdown。
- 优先从环境变量读取浏览器可执行文件路径,其次查找系统Chrome/Chromium/Edge路径。
CHROME_PATH
如果特殊站点提取因反爬虫检查、仅登录可见页面或网络限制失败,清晰上报失败并要求提供兜底输入(例如原始页面文本)。
Output Contract
输出约定
For normal usage, output markdown only.
When is used, return:
--json- : backend source (
source,r.jina.ai,defuddle,cuimp).browser-readability - : selected route (
strategy,r-jina,defuddle,special-http-fetch).special-browser-fetch-fallback - : original input.
requestedUrl - : normalized/final URL.
resolvedUrl - : extracted markdown body.
markdown
普通使用场景下仅输出Markdown。
当使用参数时,返回:
--json- :后端来源(
source、r.jina.ai、defuddle、cuimp)。browser-readability - :选择的路由(
strategy、r-jina、defuddle、special-http-fetch)。special-browser-fetch-fallback - :原始输入URL。
requestedUrl - :标准化后的最终URL。
resolvedUrl - :提取到的Markdown主体。
markdown
Resources
资源
- references/routing-and-notes.md: domain routing rules and operational caveats.
- : primary entrypoint.
scripts/url_to_markdown.mjs - : WeChat/Zhihu/Feishu HTTP impersonation fetcher (
scripts/fetch_special_sites_http.mjsJS).cuimp - : two-stage extractor (HTTP-first, browser-fallback).
scripts/fetch_special_sites.mjs
- references/routing-and-notes.md:域名路由规则和运行注意事项。
- :主入口文件。
scripts/url_to_markdown.mjs - :微信/知乎/飞书HTTP模拟获取工具(
scripts/fetch_special_sites_http.mjsJS实现)。cuimp - :两阶段提取器(HTTP优先,浏览器兜底)。
scripts/fetch_special_sites.mjs