content-extract

Original：🇨🇳 Chinese

Translated

1 scriptsChecked / no sensitive code detected

Robust URL-to-Markdown extraction for OpenClaw workflows. Use this when the user needs to "extract/summarize/convert a webpage to Markdown" (especially for WeChat official accounts at mp.weixin.qq.com) and web_fetch or browser access is blocked or returns messy content. It first uses a low-cost probe via web_fetch, then falls back to the official MinerU API (through the local mineru-extract skill), and returns a traceable result contract with source links.

13installs

Sourceblessonism/openclaw-search-skills

Added on2026-02-15

NPX Install

npx skill4agent add blessonism/openclaw-search-skills content-extract

SKILL.md Content (Chinese)

View Translation Comparison →

content-extract — Upper-level Content Parsing Entry (MCP Semantic Alignment, but no MCP Server running)

Objective: Turn "Provide a URL → Generate readable Markdown + traceable entry" into a unified entry for reuse by all subsequent business skills (github-explorer, writing skills, daily reports, etc.).

Core Principles (Inspired by the Excel Skill breakdown article you sent):

Behavior Specification Layer: Always provide traceable entries (original URL + parsed product path/link), never fabricate sources.
Token Probe: First use a low-cost probe to determine if direct crawling is possible; if not, proceed with heavy parsing (MinerU).
Fallback Mechanism: When failing, return "next action suggestions" instead of a stack of exceptions.

Workflow (Decision Tree)

Input:

url

Domain Whitelist (Skip Probe): If the URL belongs to a site with high probability of anti-crawling/dynamic content (WeChat, Zhihu, etc.), directly use MinerU

Whitelist file:
```
references/domain-whitelist.md
```
For URLs hitting the whitelist: Force
```
model_version=MinerU-HTML
```

Probe (Low Cost): Prioritize using
```
web_fetch(url)
```

Objective: Get the main content Markdown (cheap, fast)
Conditions for judging "failure/invalid" (see
```
references/heuristics.md
```
) include:
- 403/401/anti-crawling
- Only prompts like "environment exception/verification code/please open in WeChat"
- Extremely short content/obvious navigation page/missing main content

Fallback (High Fidelity): Use the official MinerU API

Call downstream driver:

skills/mineru-extract/scripts/mineru_parse_documents.py

For HTML pages (WeChat, etc.): Force
```
model_version=MinerU-HTML
```

Output Unified Result Contract

Regardless of using probe or MinerU, return the same structure:

json

{
  "ok": true,
  "source_url": "...",
  "engine": "web_fetch" ,
  "markdown": "...",
  "artifacts": {
    "out_dir": "...",
    "markdown_path": "...",
    "zip_path": "..."
  },
  "sources": [
    "原文URL",
    "（如使用MinerU）MinerU full_zip_url",
    "（如使用MinerU）本地markdown_path"
  ],
  "notes": ["任何重要限制/失败原因/下一步建议"]
}

Note:
engine
can be
web_fetch
or
mineru
.

MinerU Call (Deterministic Script for Agents)

When MinerU is needed, use this command (returns JSON, and can inline Markdown into JSON for downstream summarization):

bash

python3 mineru-extract/scripts/mineru_parse_documents.py \
  --file-sources "<URL>" \
  --model-version MinerU-HTML \
  --emit-markdown --max-chars 20000

Path Note: The above command assumes you are executing it in the root directory of the skills installation. If mineru-extract is installed in another location, please replace it with the actual path.

Delivery Specifications (Mandatory)

Output must include
```
sources
```
(original entry + parsed product entry).
If MinerU succeeds: Must write
```
markdown_path
```
(local path) into
```
sources
```
for easy review.
If both links fail: Must clearly state the failure reason and provide next steps (e.g., ask the Boss to provide an accessible mirror link / allow me to export HTML via browser relay / use the fallback solution of uploading HTML files for parsing).

What This Skill Does NOT Do

Does not run MCP Server (avoids permanent service and operation/maintenance burdens)
Does not attempt to bypass login/verification codes (this belongs to the access layer issue; we only handle parsing layer and workflow routing)

References

MinerU API docs: https://mineru.net/apiManage/docs
MinerU output files: https://opendatalab.github.io/MinerU/reference/output_files/