URL to Markdown
Fetches any URL via
CLI (Chrome CDP + site-specific adapters) and converts it to clean markdown.
CLI Setup
Important: The CLI source is vendored in the
scripts/vendor/baoyu-fetch/
subdirectory of this skill.
Agent Execution Instructions:
- Determine this SKILL.md file's directory path as
- CLI entry point =
{baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
- Resolve runtime: if installed → ; if available → ; else suggest installing bun
- =
${BUN_X} {baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
- Replace all in this document with the resolved value
Preferences (EXTEND.md)
Check EXTEND.md existence (priority order):
bash
# macOS, Linux, WSL, Git Bash
test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project"
test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg"
test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"
powershell
# PowerShell (Windows)
if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" }
$xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" }
if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" }
if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" }
| Path | Location |
|---|
.baoyu-skills/baoyu-url-to-markdown/EXTEND.md
| Project directory |
$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md
| User home |
| Result | Action |
|---|
| Found | Read, parse, apply settings |
| Not found | MUST run first-time setup (see below) — do NOT silently create defaults |
EXTEND.md Supports: Download media by default | Default output directory
First-Time Setup (BLOCKING)
CRITICAL: When EXTEND.md is not found, you
MUST use to ask the user for their preferences before creating EXTEND.md.
NEVER create EXTEND.md with defaults without asking. This is a
BLOCKING operation — do NOT proceed with any conversion until setup is complete.
Use
with ALL questions in ONE call:
Question 1 — header: "Media", question: "How to handle images and videos in pages?"
- "Ask each time (Recommended)" — After saving markdown, ask whether to download media
- "Always download" — Always download media to local imgs/ and videos/ directories
- "Never download" — Keep original remote URLs in markdown
Question 2 — header: "Output", question: "Default output directory?"
- "url-to-markdown (Recommended)" — Save to ./url-to-markdown/{domain}/{slug}.md
- (User may choose "Other" to type a custom path)
Question 3 — header: "Save", question: "Where to save preferences?"
- "User (Recommended)" — ~/.baoyu-skills/ (all projects)
- "Project" — .baoyu-skills/ (this project only)
After user answers, create EXTEND.md at the chosen location, confirm "Preferences saved to [path]", then continue.
Full reference: references/config/first-time-setup.md
Supported Keys
| Key | Default | Values | Description |
|---|
| | / / | = prompt each time, = always download, = never |
| empty | path or empty | Default output directory (empty = ) |
EXTEND.md → CLI mapping:
| EXTEND.md key | CLI argument | Notes |
|---|
| | Requires to be set |
default_output_dir: ./posts/
| Agent constructs --output ./posts/{domain}/{slug}.md
| Agent generates path, not a direct CLI flag |
Value priority:
- CLI arguments (, )
- EXTEND.md
- Skill defaults
Features
- Chrome CDP for full JavaScript rendering via CLI
- Site-specific adapters: X/Twitter, YouTube, Hacker News, generic (Defuddle)
- Automatic adapter selection based on URL, or force with
- Interaction gate detection: Cloudflare, reCAPTCHA, hCAPTCHA, custom challenges
- Two capture modes: headless (default) or interactive with wait-for-interaction
- Clean markdown output with YAML front matter
- Structured JSON output available via
- X/Twitter: extracts tweets, threads, and X Articles with media
- YouTube: transcript/caption extraction, chapters, cover images
- Hacker News: threaded comment parsing with proper nesting
- Generic: Defuddle extraction with Readability fallback
- Download images and videos to local directories
- Chrome profile persistence for authenticated sessions
- Debug artifact output for troubleshooting
Usage
bash
# Default: headless capture, markdown to stdout
${READER} <url>
# Save to file
${READER} <url> --output article.md
# Save with media download
${READER} <url> --output article.md --download-media
# Headless mode (explicit)
${READER} <url> --headless --output article.md
# Wait for interaction (login/CAPTCHA) — auto-detect and continue
${READER} <url> --wait-for interaction --output article.md
# Wait for interaction — manual control (Enter to continue)
${READER} <url> --wait-for force --output article.md
# JSON output
${READER} <url> --format json --output article.json
# Force specific adapter
${READER} <url> --adapter youtube --output transcript.md
# Connect to existing Chrome
${READER} <url> --cdp-url http://localhost:9222 --output article.md
# Debug artifacts
${READER} <url> --output article.md --debug-dir ./debug/
Options
| Option | Description |
|---|
| URL to fetch |
| Output file path (default: stdout) |
| Output format: (default) or |
| Shorthand for |
| Force adapter: , , , or (default: auto-detect) |
| Force headless Chrome (no visible window) |
| Interaction wait mode: (default), , or |
| Alias for |
| Alias for |
| Page load timeout (default: 30000) |
--interaction-timeout <ms>
| Login/CAPTCHA wait timeout (default: 600000 = 10 min) |
--interaction-poll-interval <ms>
| Poll interval for interaction checks (default: 1500) |
| Download images/videos to local and , rewrite markdown links. Requires |
| Base directory for downloaded media (default: same as directory) |
| Reuse existing Chrome DevTools Protocol endpoint |
| Custom Chrome/Chromium binary path |
--chrome-profile-dir <path>
| Chrome user data directory (default: env or ./baoyu-skills/chrome-profile
) |
| Write debug artifacts (document.json, markdown.md, page.html, network.json) |
Capture Modes
| Mode | Behavior | Use When |
|---|
| Default | Headless Chrome, auto-extract on network idle | Public pages, static content |
| Explicit headless (same as default) | Clarify intent |
| Opens visible Chrome, auto-detects login/CAPTCHA gates, waits for them to clear, then continues | Login-required, CAPTCHA-protected |
| Opens visible Chrome, auto-detects OR accepts Enter keypress to continue | Complex flows, lazy loading, paywalls |
Interaction gate auto-detection:
- Cloudflare Turnstile / "just a moment" pages
- Google reCAPTCHA
- hCaptcha
- Custom challenge / verification screens
Wait-for-interaction workflow:
- Run with → Chrome opens visibly
- CLI auto-detects login/CAPTCHA gates
- User completes login or solves CAPTCHA in the browser
- CLI auto-detects gate cleared → captures page
- If is used, user can also press Enter to trigger capture manually
Agent Quality Gate
CRITICAL: The agent must treat default headless capture as provisional. Some sites render differently in headless mode and can silently return low-quality content without causing the CLI to fail.
After every headless run, the agent MUST inspect the saved markdown output.
Quality checks the agent must perform
- Confirm the markdown title matches the target page, not a generic site shell
- Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error
- Watch for obvious failure signs:
This page could not be found
- Login, signup, subscribe, or verification shells
- Extremely short markdown for a page that should be long-form
- Raw framework payloads or mostly boilerplate content
- If the result is low quality, incomplete, or clearly wrong, do not accept the run as successful just because the CLI exited with code 0
Tip: Use
to get structured output including
,
, and
fields for programmatic quality assessment. A
"status": "needs_interaction"
response means the page requires manual interaction.
Recovery workflow the agent must follow
- First run headless (default) unless there is already a clear reason to use interaction mode
- Review markdown quality immediately after the run
- If the content is low quality or indicates login/CAPTCHA:
- for auto-detected gates (login, CAPTCHA, Cloudflare)
- when the page needs manual browsing, scroll loading, or complex interaction
- If is used, tell the user exactly what to do:
- If login is required, ask them to sign in in the browser
- If CAPTCHA appears, ask them to solve it
- If the page needs time to load, ask them to wait until content is visible
- For : tell them to press Enter when ready
- If JSON output shows
"status": "needs_interaction"
, switch to automatically
Output Path Generation
The agent must construct the output file path since
does not auto-generate paths.
Algorithm:
- Determine base directory from EXTEND.md or default
- Extract domain from URL (e.g., )
- Generate slug from URL path or page title (kebab-case, 2-6 words)
- Construct:
{base_dir}/{domain}/{slug}/{slug}.md
— each URL gets its own directory so media files stay isolated
- Conflict resolution: append timestamp
{slug}-YYYYMMDD-HHMMSS/{slug}-YYYYMMDD-HHMMSS.md
Pass the constructed path to
. Media files (
) are saved into subdirectories next to the markdown file, keeping each URL's assets self-contained.
Output Format
Markdown output to stdout (or file with
) as clean markdown text.
JSON output (
) returns structured data including:
- — which adapter handled the URL
- — or
- — login state detection (, , )
- — interaction gate details (kind, provider, prompt)
- — structured content (url, title, author, publishedAt, content blocks, metadata)
- — collected media assets with url, kind, role
- — converted markdown text
- — media download results (when used)
- Images are saved to next to the output file (or in )
- Videos are saved to next to the output file (or in )
- Markdown media links are rewritten to local relative paths
Built-in Adapters
| Adapter | URLs | Key Features |
|---|
| x.com, twitter.com | Tweets, threads, X Articles, media, login detection |
| youtube.com, youtu.be | Transcript/captions, chapters, cover image, metadata |
| news.ycombinator.com | Threaded comments, story metadata, nested replies |
| Any URL (fallback) | Defuddle extraction, Readability fallback, auto-scroll, network idle detection |
Adapter is auto-selected based on URL. Use
to override.
Media Download Workflow
Based on
setting in EXTEND.md:
| Setting | Behavior |
|---|
| (always) | Run CLI with --download-media --output <path>
|
| (never) | Run CLI with (no media download) |
| (default) | Follow the ask-each-time flow below |
Ask-Each-Time Flow
- Run CLI without with → markdown saved
- Check saved markdown for remote media URLs ( in image/video links)
- If no remote media found → done, no prompt needed
- If remote media found → use :
- header: "Media", question: "Download N images/videos to local files?"
- "Yes" — Download to local directories
- "No" — Keep remote URLs
- If user confirms → run CLI again with
--download-media --output <same-path>
(overwrites markdown with localized links)
Environment Variables
| Variable | Description |
|---|
| Chrome user data directory (can also use ) |
Troubleshooting: Chrome not found → use
. Timeout → increase
. Login/CAPTCHA pages → use
. Debug → use
to inspect captured HTML and network logs.
YouTube Notes
- YouTube adapter extracts transcripts/captions automatically when available
- Transcript format: with chapter headings
- Transcript availability depends on YouTube exposing a caption track. Videos with captions disabled or restricted playback may produce description-only output
- Use if the page needs time to finish loading player metadata
X/Twitter Notes
- Extracts single tweets, threads, and X Articles
- Auto-detects login state; if logged out and content requires auth, JSON output will show
"status": "needs_interaction"
- Use for login-protected content
Hacker News Notes
- Parses threaded comments with proper nesting and reply hierarchy
- Includes story metadata (title, URL, author, score, comment count)
- Shows comment deletion/dead status
Extension Support
Custom configurations via EXTEND.md. See Preferences section for paths and supported options.