news-extractor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

News Extractor Skill

从主流新闻平台提取文章内容，输出 JSON 和 Markdown 格式。

Extract article content from mainstream news platforms and output in JSON and Markdown formats.

支持平台

Supported Platforms

平台	ID	URL 示例
微信公众号	wechat	`https://mp.weixin.qq.com/s/xxxxx`
今日头条	toutiao	`https://www.toutiao.com/article/123456/`
网易新闻	netease	`https://www.163.com/news/article/ABC123.html`
搜狐新闻	sohu	`https://www.sohu.com/a/123456_789`
腾讯新闻	tencent	`https://news.qq.com/rain/a/20251016A07W8J00`

Platform	ID	URL Example
WeChat Official Accounts	wechat	`https://mp.weixin.qq.com/s/xxxxx`
Toutiao	toutiao	`https://www.toutiao.com/article/123456/`
NetEase News	netease	`https://www.163.com/news/article/ABC123.html`
Sohu News	sohu	`https://www.sohu.com/a/123456_789`
Tencent News	tencent	`https://news.qq.com/rain/a/20251016A07W8J00`

依赖安装

Dependency Installation

本 skill 使用 uv 管理依赖。首次使用前需要安装：

bash

cd ~/.claude/skills/news-extractor
uv sync

重要: 所有脚本必须使用

uv run

执行，不要直接用

python

运行。

uv run

会自动使用项目虚拟环境中的依赖。

This skill uses uv for dependency management. Install dependencies before first use:

bash

cd ~/.claude/skills/news-extractor
uv sync

Important: All scripts must be executed using

uv run

, not directly with

python

uv run

automatically uses dependencies from the project's virtual environment.

依赖列表

Dependency List

包名	用途
pydantic	数据模型验证
requests	HTTP 请求
curl_cffi	浏览器模拟抓取
tenacity	重试机制
parsel	HTML/XPath 解析
demjson3	非标准 JSON 解析

Package	Purpose
pydantic	Data model validation
requests	HTTP requests
curl_cffi	Browser simulation crawling
tenacity	Retry mechanism
parsel	HTML/XPath parsing
demjson3	Non-standard JSON parsing

使用方式

Usage

基本用法

Basic Usage

bash

undefined

bash

undefined

提取新闻，自动检测平台，输出 JSON + Markdown

Extract news, auto-detect platform, output JSON + Markdown

uv run .claude/skills/news-extractor/scripts/extract_news.py "URL"

指定输出目录

Specify output directory

uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --output ./output

仅输出 JSON

Output only JSON

uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format json

仅输出 Markdown

Output only Markdown

uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format markdown

列出支持的平台

List supported platforms

uv run .claude/skills/news-extractor/scripts/extract_news.py --list-platforms

undefined

uv run .claude/skills/news-extractor/scripts/extract_news.py --list-platforms

undefined

输出文件

Output Files

脚本默认输出两种格式到指定目录（默认

./output

）：

```
{news_id}.json
```
- 结构化 JSON 数据
```
{news_id}.md
```
- Markdown 格式文章

The script outputs two formats to the specified directory (default

./output

) by default:

```
{news_id}.json
```
- Structured JSON data
```
{news_id}.md
```
- Markdown-formatted article

工作流程

Workflow

接收 URL - 用户提供新闻链接
平台检测 - 自动识别平台类型
内容提取 - 调用对应爬虫获取并解析内容
格式转换 - 生成 JSON 和 Markdown
输出文件 - 保存到指定目录

Receive URL - User provides a news link
Platform Detection - Automatically identify the platform type
Content Extraction - Call the corresponding crawler to fetch and parse content
Format Conversion - Generate JSON and Markdown
Output Files - Save to the specified directory

输出格式

Output Formats

JSON 结构

JSON Structure

json

{
  "title": "文章标题",
  "news_url": "原始链接",
  "news_id": "文章ID",
  "meta_info": {
    "author_name": "作者/来源",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "段落文本", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["段落1", "段落2"],
  "images": ["图片URL1", "图片URL2"],
  "videos": []
}

json

{
  "title": "Article Title",
  "news_url": "Original Link",
  "news_id": "Article ID",
  "meta_info": {
    "author_name": "Author/Source",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "Paragraph text", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["Paragraph 1", "Paragraph 2"],
  "images": ["Image URL1", "Image URL2"],
  "videos": []
}

Markdown 结构

Markdown Structure

markdown

undefined

markdown

undefined

文章标题

Article Title

文章信息

Article Information

作者: xxx 发布时间: 2024-01-01 12:00 原文链接: 链接

Author: xxx Publish Time: 2024-01-01 12:00 Original Link: Link

正文内容

Article Content

段落内容...

Paragraph content...

媒体资源

Media Resources

图片 (N)

Images (N)

URL1
URL2

undefined

URL1
URL2

undefined

使用示例

Usage Examples

提取微信公众号文章

Extract WeChat Official Account Article

bash

uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"

输出:

[INFO] Platform detected: wechat (微信公众号)
[INFO] Extracting content...
[INFO] Title: 文章标题
[INFO] Author: 公众号名称
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md

bash

uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"

Output:

[INFO] Platform detected: wechat (WeChat Official Accounts)
[INFO] Extracting content...
[INFO] Title: Article Title
[INFO] Author: Official Account Name
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md

提取今日头条文章

Extract Toutiao Article

bash

uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"

bash

uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"

错误处理

Error Handling

错误类型	说明	解决方案
`无法识别该平台`	URL 不匹配任何支持的平台	检查 URL 是否正确
`平台不支持`	非支持的站点	本 Skill 仅支持列出的新闻站点
`提取失败`	网络错误或页面结构变化	重试或检查 URL 有效性

Error Type	Description	Solution
`Unrecognized Platform`	URL does not match any supported platform	Check if the URL is correct
`Platform Not Supported`	Unsupported site	This Skill only supports the listed news sites
`Extraction Failed`	Network error or page structure change	Retry or check URL validity

注意事项

Notes

仅用于教育和研究目的
不要进行大规模爬取
尊重目标网站的 robots.txt 和服务条款
微信公众号可能需要有效的 Cookie（当前默认配置通常可用）

For educational and research purposes only
Do not perform large-scale crawling
Respect the target website's robots.txt and terms of service
WeChat Official Accounts may require valid Cookies (the default configuration usually works)

参考

References

平台 URL 模式说明

Platform URL Pattern Instructions