news-extractor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

News Extractor Skill

News Extractor Skill

从主流新闻平台提取文章内容,输出 JSON 和 Markdown 格式。
Extract article content from mainstream news platforms and output in JSON and Markdown formats.

支持平台

Supported Platforms

平台IDURL 示例
微信公众号wechat
https://mp.weixin.qq.com/s/xxxxx
今日头条toutiao
https://www.toutiao.com/article/123456/
网易新闻netease
https://www.163.com/news/article/ABC123.html
搜狐新闻sohu
https://www.sohu.com/a/123456_789
腾讯新闻tencent
https://news.qq.com/rain/a/20251016A07W8J00
PlatformIDURL Example
WeChat Official Accountswechat
https://mp.weixin.qq.com/s/xxxxx
Toutiaotoutiao
https://www.toutiao.com/article/123456/
NetEase Newsnetease
https://www.163.com/news/article/ABC123.html
Sohu Newssohu
https://www.sohu.com/a/123456_789
Tencent Newstencent
https://news.qq.com/rain/a/20251016A07W8J00

依赖安装

Dependency Installation

本 skill 使用 uv 管理依赖。首次使用前需要安装:
bash
cd ~/.claude/skills/news-extractor
uv sync
重要: 所有脚本必须使用
uv run
执行,不要直接用
python
运行。
uv run
会自动使用项目虚拟环境中的依赖。
This skill uses uv for dependency management. Install dependencies before first use:
bash
cd ~/.claude/skills/news-extractor
uv sync
Important: All scripts must be executed using
uv run
, not directly with
python
.
uv run
automatically uses dependencies from the project's virtual environment.

依赖列表

Dependency List

包名用途
pydantic数据模型验证
requestsHTTP 请求
curl_cffi浏览器模拟抓取
tenacity重试机制
parselHTML/XPath 解析
demjson3非标准 JSON 解析
PackagePurpose
pydanticData model validation
requestsHTTP requests
curl_cffiBrowser simulation crawling
tenacityRetry mechanism
parselHTML/XPath parsing
demjson3Non-standard JSON parsing

使用方式

Usage

基本用法

Basic Usage

bash
undefined
bash
undefined

提取新闻,自动检测平台,输出 JSON + Markdown

Extract news, auto-detect platform, output JSON + Markdown

uv run .claude/skills/news-extractor/scripts/extract_news.py "URL"
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL"

指定输出目录

Specify output directory

uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --output ./output
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --output ./output

仅输出 JSON

Output only JSON

uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format json
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format json

仅输出 Markdown

Output only Markdown

uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format markdown

列出支持的平台

List supported platforms

uv run .claude/skills/news-extractor/scripts/extract_news.py --list-platforms
undefined
uv run .claude/skills/news-extractor/scripts/extract_news.py --list-platforms
undefined

输出文件

Output Files

脚本默认输出两种格式到指定目录(默认
./output
):
  • {news_id}.json
    - 结构化 JSON 数据
  • {news_id}.md
    - Markdown 格式文章
The script outputs two formats to the specified directory (default
./output
) by default:
  • {news_id}.json
    - Structured JSON data
  • {news_id}.md
    - Markdown-formatted article

工作流程

Workflow

  1. 接收 URL - 用户提供新闻链接
  2. 平台检测 - 自动识别平台类型
  3. 内容提取 - 调用对应爬虫获取并解析内容
  4. 格式转换 - 生成 JSON 和 Markdown
  5. 输出文件 - 保存到指定目录
  1. Receive URL - User provides a news link
  2. Platform Detection - Automatically identify the platform type
  3. Content Extraction - Call the corresponding crawler to fetch and parse content
  4. Format Conversion - Generate JSON and Markdown
  5. Output Files - Save to the specified directory

输出格式

Output Formats

JSON 结构

JSON Structure

json
{
  "title": "文章标题",
  "news_url": "原始链接",
  "news_id": "文章ID",
  "meta_info": {
    "author_name": "作者/来源",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "段落文本", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["段落1", "段落2"],
  "images": ["图片URL1", "图片URL2"],
  "videos": []
}
json
{
  "title": "Article Title",
  "news_url": "Original Link",
  "news_id": "Article ID",
  "meta_info": {
    "author_name": "Author/Source",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "Paragraph text", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["Paragraph 1", "Paragraph 2"],
  "images": ["Image URL1", "Image URL2"],
  "videos": []
}

Markdown 结构

Markdown Structure

markdown
undefined
markdown
undefined

文章标题

Article Title

文章信息

Article Information

作者: xxx 发布时间: 2024-01-01 12:00 原文链接: 链接

Author: xxx Publish Time: 2024-01-01 12:00 Original Link: Link

正文内容

Article Content

段落内容...
图片

Paragraph content...
Image

媒体资源

Media Resources

图片 (N)

Images (N)

  1. URL1
  2. URL2
undefined
  1. URL1
  2. URL2
undefined

使用示例

Usage Examples

提取微信公众号文章

Extract WeChat Official Account Article

bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"
输出:
[INFO] Platform detected: wechat (微信公众号)
[INFO] Extracting content...
[INFO] Title: 文章标题
[INFO] Author: 公众号名称
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md
bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"
Output:
[INFO] Platform detected: wechat (WeChat Official Accounts)
[INFO] Extracting content...
[INFO] Title: Article Title
[INFO] Author: Official Account Name
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md

提取今日头条文章

Extract Toutiao Article

bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"
bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"

错误处理

Error Handling

错误类型说明解决方案
无法识别该平台
URL 不匹配任何支持的平台检查 URL 是否正确
平台不支持
非支持的站点本 Skill 仅支持列出的新闻站点
提取失败
网络错误或页面结构变化重试或检查 URL 有效性
Error TypeDescriptionSolution
Unrecognized Platform
URL does not match any supported platformCheck if the URL is correct
Platform Not Supported
Unsupported siteThis Skill only supports the listed news sites
Extraction Failed
Network error or page structure changeRetry or check URL validity

注意事项

Notes

  • 仅用于教育和研究目的
  • 不要进行大规模爬取
  • 尊重目标网站的 robots.txt 和服务条款
  • 微信公众号可能需要有效的 Cookie(当前默认配置通常可用)
  • For educational and research purposes only
  • Do not perform large-scale crawling
  • Respect the target website's robots.txt and terms of service
  • WeChat Official Accounts may require valid Cookies (the default configuration usually works)

参考

References

  • 平台 URL 模式说明
  • Platform URL Pattern Instructions